diff --git a/.nojekyll b/.nojekyll
new file mode 100644
index 0000000..e69de29
diff --git a/cache.json b/cache.json
new file mode 100644
index 0000000..996c866
--- /dev/null
+++ b/cache.json
@@ -0,0 +1 @@
+{"2024-11-29T00:00:00Z":{"Computation and Language":[{"id":"http://arxiv.org/abs/2411.19951v1","updated":"2024-11-29T18:59:54Z","published":"2024-11-29T18:59:54Z","title":"T2Vid: Translating Long Text into Multi-Image is the Catalyst for\n Video-LLMs","summary":" The success of Multimodal Large Language Models (MLLMs) in the image domain\nhas garnered wide attention from the research community. Drawing on previous\nsuccessful experiences, researchers have recently explored extending the\nsuccess to the video understanding realms. Apart from training from scratch, an\nefficient way is to utilize the pre-trained image-LLMs, leading to two\nmainstream approaches, i.e. zero-shot inference and further fine-tuning with\nvideo data. In this work, our study of these approaches harvests an effective\ndata augmentation method. We first make a deeper inspection of the zero-shot\ninference way and identify two limitations, i.e. limited generalization and\nlack of temporal understanding capabilities. Thus, we further investigate the\nfine-tuning approach and find a low learning efficiency when simply using all\nthe video data samples, which can be attributed to a lack of instruction\ndiversity. Aiming at this issue, we develop a method called T2Vid to synthesize\nvideo-like samples to enrich the instruction diversity in the training corpus.\nIntegrating these data enables a simple and efficient training scheme, which\nachieves performance comparable to or even superior to using full video\ndatasets by training with just 15% the sample size. Meanwhile, we find that the\nproposed scheme can boost the performance of long video understanding without\ntraining with long video samples. We hope our study will spark more thinking\nabout using MLLMs for video understanding and curation of high-quality data.\nThe code is released at https://github.com/xjtupanda/T2Vid.\n","authors":["Shukang Yin","Chaoyou Fu","Sirui Zhao","Yunhang Shen","Chunjiang Ge","Yan Yang","Zuwei Long","Yuhan Dai","Tong Xu","Xing Sun","Ran He","Caifeng Shan","Enhong Chen"],"pdf_url":"https://arxiv.org/pdf/2411.19951v1.pdf","comment":"13 pages, 9 figures, 5 tables. Project page:\n https://github.com/xjtupanda/T2Vid"},{"id":"http://arxiv.org/abs/2411.19943v1","updated":"2024-11-29T18:58:22Z","published":"2024-11-29T18:58:22Z","title":"Critical Tokens Matter: Token-Level Contrastive Estimation Enhence LLM's\n Reasoning Capability","summary":" Large Language Models (LLMs) have exhibited remarkable performance on\nreasoning tasks. They utilize autoregressive token generation to construct\nreasoning trajectories, enabling the development of a coherent chain of\nthought. In this work, we explore the impact of individual tokens on the final\noutcomes of reasoning tasks. We identify the existence of ``critical tokens''\nthat lead to incorrect reasoning trajectories in LLMs. Specifically, we find\nthat LLMs tend to produce positive outcomes when forced to decode other tokens\ninstead of critical tokens. Motivated by this observation, we propose a novel\napproach - cDPO - designed to automatically recognize and conduct token-level\nrewards for the critical tokens during the alignment process. Specifically, we\ndevelop a contrastive estimation approach to automatically identify critical\ntokens. It is achieved by comparing the generation likelihood of positive and\nnegative models. To achieve this, we separately fine-tune the positive and\nnegative models on various reasoning trajectories, consequently, they are\ncapable of identifying identify critical tokens within incorrect trajectories\nthat contribute to erroneous outcomes. Moreover, to further align the model\nwith the critical token information during the alignment process, we extend the\nconventional DPO algorithms to token-level DPO and utilize the differential\nlikelihood from the aforementioned positive and negative model as important\nweight for token-level DPO learning.Experimental results on GSM8K and MATH500\nbenchmarks with two-widely used models Llama-3 (8B and 70B) and deepseek-math\n(7B) demonstrate the effectiveness of the propsoed approach cDPO.\n","authors":["Zicheng Lin","Tian Liang","Jiahao Xu","Xing Wang","Ruilin Luo","Chufan Shi","Siheng Li","Yujiu Yang","Zhaopeng Tu"],"pdf_url":"https://arxiv.org/pdf/2411.19943v1.pdf","comment":"Work in progress"},{"id":"http://arxiv.org/abs/2411.19941v1","updated":"2024-11-29T18:57:25Z","published":"2024-11-29T18:57:25Z","title":"Perception Test 2024: Challenge Summary and a Novel Hour-Long VideoQA\n Benchmark","summary":" Following the successful 2023 edition, we organised the Second Perception\nTest challenge as a half-day workshop alongside the IEEE/CVF European\nConference on Computer Vision (ECCV) 2024, with the goal of benchmarking\nstate-of-the-art video models and measuring the progress since last year using\nthe Perception Test benchmark. This year, the challenge had seven tracks (up\nfrom six last year) and covered low-level and high-level tasks, with language\nand non-language interfaces, across video, audio, and text modalities; the\nadditional track covered hour-long video understanding and introduced a novel\nvideo QA benchmark 1h-walk VQA. Overall, the tasks in the different tracks\nwere: object tracking, point tracking, temporal action localisation, temporal\nsound localisation, multiple-choice video question-answering, grounded video\nquestion-answering, and hour-long video question-answering. We summarise in\nthis report the challenge tasks and results, and introduce in detail the novel\nhour-long video QA benchmark 1h-walk VQA.\n","authors":["Joseph Heyward","João Carreira","Dima Damen","Andrew Zisserman","Viorica Pătrăucean"],"pdf_url":"https://arxiv.org/pdf/2411.19941v1.pdf","comment":"arXiv admin note: substantial text overlap with arXiv:2312.13090"},{"id":"http://arxiv.org/abs/2411.19939v1","updated":"2024-11-29T18:56:37Z","published":"2024-11-29T18:56:37Z","title":"VLSBench: Unveiling Visual Leakage in Multimodal Safety","summary":" Safety concerns of Multimodal large language models (MLLMs) have gradually\nbecome an important problem in various applications. Surprisingly, previous\nworks indicate a counter-intuitive phenomenon that using textual unlearning to\nalign MLLMs achieves comparable safety performances with MLLMs trained with\nimage-text pairs. To explain such a counter-intuitive phenomenon, we discover a\nvisual safety information leakage (VSIL) problem in existing multimodal safety\nbenchmarks, i.e., the potentially risky and sensitive content in the image has\nbeen revealed in the textual query. In this way, MLLMs can easily refuse these\nsensitive text-image queries according to textual queries. However, image-text\npairs without VSIL are common in real-world scenarios and are overlooked by\nexisting multimodal safety benchmarks. To this end, we construct multimodal\nvisual leakless safety benchmark (VLSBench) preventing visual safety leakage\nfrom image to textual query with 2.4k image-text pairs. Experimental results\nindicate that VLSBench poses a significant challenge to both open-source and\nclose-source MLLMs, including LLaVA, Qwen2-VL, Llama3.2-Vision, and GPT-4o.\nThis study demonstrates that textual alignment is enough for multimodal safety\nscenarios with VSIL, while multimodal alignment is a more promising solution\nfor multimodal safety scenarios without VSIL. Please see our code and data at:\nhttp://hxhcreate.github.io/VLSBench\n","authors":["Xuhao Hu","Dongrui Liu","Hao Li","Xuanjing Huang","Jing Shao"],"pdf_url":"https://arxiv.org/pdf/2411.19939v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19930v1","updated":"2024-11-29T18:42:28Z","published":"2024-11-29T18:42:28Z","title":"On Domain-Specific Post-Training for Multimodal Large Language Models","summary":" Recent years have witnessed the rapid development of general multimodal large\nlanguage models (MLLMs). However, adapting general MLLMs to specific domains,\nsuch as scientific fields and industrial applications, remains less explored.\nThis paper systematically investigates domain adaptation of MLLMs through\npost-training, focusing on data synthesis, training pipelines, and task\nevaluation. (1) Data Synthesis: Using open-source models, we develop a visual\ninstruction synthesizer that effectively generates diverse visual instruction\ntasks from domain-specific image-caption pairs. Our synthetic tasks surpass\nthose generated by manual rules, GPT-4, and GPT-4V in enhancing the\ndomain-specific performance of MLLMs. (2) Training Pipeline: While the\ntwo-stage training--initially on image-caption pairs followed by visual\ninstruction tasks--is commonly adopted for developing general MLLMs, we apply a\nsingle-stage training pipeline to enhance task diversity for domain-specific\npost-training. (3) Task Evaluation: We conduct experiments in two domains,\nbiomedicine and food, by post-training MLLMs of different sources and scales\n(e.g., Qwen2-VL-2B, LLaVA-v1.6-8B, Llama-3.2-11B), and then evaluating MLLM\nperformance on various domain-specific tasks. To support further research in\nMLLM domain adaptation, we will open-source our implementations.\n","authors":["Daixuan Cheng","Shaohan Huang","Ziyu Zhu","Xintong Zhang","Wayne Xin Zhao","Zhongzhi Luan","Bo Dai","Zhenliang Zhang"],"pdf_url":"https://arxiv.org/pdf/2411.19930v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19921v1","updated":"2024-11-29T18:36:15Z","published":"2024-11-29T18:36:15Z","title":"SIMS: Simulating Human-Scene Interactions with Real World Script\n Planning","summary":" Simulating long-term human-scene interaction is a challenging yet fascinating\ntask. Previous works have not effectively addressed the generation of long-term\nhuman scene interactions with detailed narratives for physics-based animation.\nThis paper introduces a novel framework for the planning and controlling of\nlong-horizon physical plausible human-scene interaction. On the one hand, films\nand shows with stylish human locomotions or interactions with scenes are\nabundantly available on the internet, providing a rich source of data for\nscript planning. On the other hand, Large Language Models (LLMs) can understand\nand generate logical storylines.\n This motivates us to marry the two by using an LLM-based pipeline to extract\nscripts from videos, and then employ LLMs to imitate and create new scripts,\ncapturing complex, time-series human behaviors and interactions with\nenvironments. By leveraging this, we utilize a dual-aware policy that achieves\nboth language comprehension and scene understanding to guide character motions\nwithin contextual and spatial constraints. To facilitate training and\nevaluation, we contribute a comprehensive planning dataset containing diverse\nmotion sequences extracted from real-world videos and expand them with large\nlanguage models. We also collect and re-annotate motion clips from existing\nkinematic datasets to enable our policy learn diverse skills. Extensive\nexperiments demonstrate the effectiveness of our framework in versatile task\nexecution and its generalization ability to various scenarios, showing\nremarkably enhanced performance compared with existing methods. Our code and\ndata will be publicly available soon.\n","authors":["Wenjia Wang","Liang Pan","Zhiyang Dou","Zhouyingcheng Liao","Yuke Lou","Lei Yang","Jingbo Wang","Taku Komura"],"pdf_url":"https://arxiv.org/pdf/2411.19921v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19906v1","updated":"2024-11-29T18:11:39Z","published":"2024-11-29T18:11:39Z","title":"Classical and Quantum Algorithms for the Deterministic L-system\n Inductive Inference Problem","summary":" L-systems can be made to model and create simulations of many biological\nprocesses, such as plant development. Finding an L-system for a given process\nis typically solved by hand, by experts, in a hugely time-consuming process. It\nwould be significant if this could be done automatically from data, such as\nfrom sequences of images. In this paper, we are interested in inferring a\nparticular type of L-system, deterministic context-free L-system (D0L-system)\nfrom a sequence of strings. We introduce the characteristic graph of a sequence\nof strings, which we then utilize to translate our problem (inferring\nD0L-system) in polynomial time into the maximum independent set problem (MIS)\nand the SAT problem. After that, we offer a classical exact algorithm and an\napproximate quantum algorithm for the problem.\n","authors":["Ali Lotfi","Ian McQuillan","Steven Rayan"],"pdf_url":"https://arxiv.org/pdf/2411.19906v1.pdf","comment":"16 pages, 1 figure"},{"id":"http://arxiv.org/abs/2411.19869v1","updated":"2024-11-29T17:31:42Z","published":"2024-11-29T17:31:42Z","title":"AIDetx: a compression-based method for identification of\n machine-learning generated text","summary":" This paper introduces AIDetx, a novel method for detecting machine-generated\ntext using data compression techniques. Traditional approaches, such as deep\nlearning classifiers, often suffer from high computational costs and limited\ninterpretability. To address these limitations, we propose a compression-based\nclassification framework that leverages finite-context models (FCMs). AIDetx\nconstructs distinct compression models for human-written and AI-generated text,\nclassifying new inputs based on which model achieves a higher compression\nratio. We evaluated AIDetx on two benchmark datasets, achieving F1 scores\nexceeding 97% and 99%, respectively, highlighting its high accuracy. Compared\nto current methods, such as large language models (LLMs), AIDetx offers a more\ninterpretable and computationally efficient solution, significantly reducing\nboth training time and hardware requirements (e.g., no GPUs needed). The full\nimplementation is publicly available at https://github.com/AIDetx/AIDetx.\n","authors":["Leonardo Almeida","Pedro Rodrigues","Diogo Magalhães","Armando J. Pinho","Diogo Pratas"],"pdf_url":"https://arxiv.org/pdf/2411.19869v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19865v1","updated":"2024-11-29T17:27:05Z","published":"2024-11-29T17:27:05Z","title":"Reverse Thinking Makes LLMs Stronger Reasoners","summary":" Reverse thinking plays a crucial role in human reasoning. Humans can reason\nnot only from a problem to a solution but also in reverse, i.e., start from the\nsolution and reason towards the problem. This often enhances overall reasoning\nperformance as it enables consistency checks between their forward and backward\nthinking. To enable Large Language Models (LLMs) to perform reverse thinking,\nwe introduce Reverse-Enhanced Thinking (RevThink), a framework composed of data\naugmentation and learning objectives. In RevThink, we augment the dataset by\ncollecting structured forward-backward reasoning from a teacher model,\nconsisting of: (1) the original question, (2) forward reasoning, (3) backward\nquestion, and (4) backward reasoning. We then employ three objectives to train\na smaller student model in a multi-task learning fashion: (a) generate forward\nreasoning from a question, (b) generate a backward question from a question,\nand (c) generate backward reasoning from the backward question. Experiments\nacross 12 datasets covering commonsense, math, and logical reasoning show an\naverage 13.53% improvement over the student model's zero-shot performance and a\n6.84% improvement over the strongest knowledge distillation baselines.\nMoreover, our method demonstrates sample efficiency -- using only 10% of the\ncorrect forward reasoning from the training data, it outperforms a standard\nfine-tuning method trained on 10x more forward reasoning. RevThink also\nexhibits strong generalization to out-of-distribution held-out datasets.\n","authors":["Justin Chih-Yao Chen","Zifeng Wang","Hamid Palangi","Rujun Han","Sayna Ebrahimi","Long Le","Vincent Perot","Swaroop Mishra","Mohit Bansal","Chen-Yu Lee","Tomas Pfister"],"pdf_url":"https://arxiv.org/pdf/2411.19865v1.pdf","comment":"20 pages"},{"id":"http://arxiv.org/abs/2411.15623v2","updated":"2024-11-29T17:18:49Z","published":"2024-11-23T18:27:35Z","title":"Multi-label Sequential Sentence Classification via Large Language Model","summary":" Sequential sentence classification (SSC) in scientific publications is\ncrucial for supporting downstream tasks such as fine-grained information\nretrieval and extractive summarization. However, current SSC methods are\nconstrained by model size, sequence length, and single-label setting. To\naddress these limitations, this paper proposes LLM-SSC, a large language model\n(LLM)-based framework for both single- and multi-label SSC tasks. Unlike\nprevious approaches that employ small- or medium-sized language models, the\nproposed framework utilizes LLMs to generate SSC labels through designed\nprompts, which enhance task understanding by incorporating demonstrations and a\nquery to describe the prediction target. We also present a multi-label\ncontrastive learning loss with auto-weighting scheme, enabling the multi-label\nclassification task. To support our multi-label SSC analysis, we introduce and\nrelease a new dataset, biorc800, which mainly contains unstructured abstracts\nin the biomedical domain with manual annotations. Experiments demonstrate\nLLM-SSC's strong performance in SSC under both in-context learning and\ntask-specific tuning settings. We release biorc800 and our code at:\nhttps://github.com/ScienceNLP-Lab/LLM-SSC.\n","authors":["Mengfei Lan","Lecheng Zheng","Shufan Ming","Halil Kilicoglu"],"pdf_url":"https://arxiv.org/pdf/2411.15623v2.pdf","comment":"Accepted by EMNLP 2024 Findings"},{"id":"http://arxiv.org/abs/2411.19858v1","updated":"2024-11-29T17:12:06Z","published":"2024-11-29T17:12:06Z","title":"What fifty-one years of Linguistics and Artificial Intelligence research\n tell us about their correlation: A scientometric review","summary":" There is a strong correlation between linguistics and artificial intelligence\n(AI), best manifested by deep learning language models. This study provides a\nthorough scientometric analysis of this correlation, synthesizing the\nintellectual production during 51 years, from 1974 to 2024. It involves 5750\nWeb of Science-indexed articles published in 2124 journals, which are written\nby 20835 authors belonging to 13773 research centers in 794 countries. Two\npowerful software, viz., CiteSpace and VOSviewer, were used to generate mapping\nvisualizations of the intellectual landscape, trending issues and (re)emerging\nhotspots. The results indicate that in the 1980s and 1990s, linguistics and AI\nresearch was not robust, characterized by unstable publication over time. It\nhas, however, witnessed a remarkable increase of publication since then,\nreaching 1478 articles in 2023, and 546 articles in January-March timespan in\n2024, involving emerging issues and hotspots, addressing new horizons, new\ntopics, and launching new applications and powerful deep learning language\nmodels including ChatGPT.\n","authors":["Mohammed Q. Shormani"],"pdf_url":"https://arxiv.org/pdf/2411.19858v1.pdf","comment":"26 pages, 15 figures"},{"id":"http://arxiv.org/abs/2411.19855v1","updated":"2024-11-29T17:10:33Z","published":"2024-11-29T17:10:33Z","title":"Artificial intelligence contribution to translation industry: looking\n back and forward","summary":" This study provides a comprehensive analysis of artificial intelligence (AI)\ncontribution to translation industry (ACTI) research, synthesizing it over\nforty-one years from 1980-2024. 13220 articles were retrieved from three\nsources, namely WoS, Scopus, and Lens. We provided two types of analysis, viz.,\nscientometric and thematic, focusing on cluster, subject categories, keywords,\nburstness, centrality and research centers as for the former. For the latter,\nwe thematically review 18 articles, selected purposefully from the articles\ninvolved, centering on purpose, approach, findings, and contribution to ACTI\nfuture directions. The findings reveal that in the past AI contribution to\ntranslation industry was not rigorous, resulting in rule-based machine\ntranslation and statistical machine translation whose output was not\nsatisfactory. However, the more AI develops, the more machine translation\ndevelops, incorporating Neural Networking Algorithms and (Deep) Language\nLearning Models like ChatGPT whose translation output has developed\nconsiderably. However, much rigorous research is still needed to overcome\nseveral problems encountering translation industry, specifically concerning\nlow-source languages, multi-dialectical and free word order languages, and\ncultural and religious registers.\n","authors":["Mohammed Q. Shormani"],"pdf_url":"https://arxiv.org/pdf/2411.19855v1.pdf","comment":"20 pages, 4 figures"},{"id":"http://arxiv.org/abs/2411.19832v1","updated":"2024-11-29T16:44:02Z","published":"2024-11-29T16:44:02Z","title":"Sensitive Content Classification in Social Media: A Holistic Resource\n and Evaluation","summary":" The detection of sensitive content in large datasets is crucial for ensuring\nthat shared and analysed data is free from harmful material. However, current\nmoderation tools, such as external APIs, suffer from limitations in\ncustomisation, accuracy across diverse sensitive categories, and privacy\nconcerns. Additionally, existing datasets and open-source models focus\npredominantly on toxic language, leaving gaps in detecting other sensitive\ncategories such as substance abuse or self-harm. In this paper, we put forward\na unified dataset tailored for social media content moderation across six\nsensitive categories: conflictual language, profanity, sexually explicit\nmaterial, drug-related content, self-harm, and spam. By collecting and\nannotating data with consistent retrieval strategies and guidelines, we address\nthe shortcomings of previous focalised research. Our analysis demonstrates that\nfine-tuning large language models (LLMs) on this novel dataset yields\nsignificant improvements in detection performance compared to open\noff-the-shelf models such as LLaMA, and even proprietary OpenAI models, which\nunderperform by 10-15% overall. This limitation is even more pronounced on\npopular moderation APIs, which cannot be easily tailored to specific sensitive\ncontent categories, among others.\n","authors":["Dimosthenis Antypas","Indira Sen","Carla Perez-Almendros","Jose Camacho-Collados","Francesco Barbieri"],"pdf_url":"https://arxiv.org/pdf/2411.19832v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19822v1","updated":"2024-11-29T16:31:50Z","published":"2024-11-29T16:31:50Z","title":"SDR-GNN: Spectral Domain Reconstruction Graph Neural Network for\n Incomplete Multimodal Learning in Conversational Emotion Recognition","summary":" Multimodal Emotion Recognition in Conversations (MERC) aims to classify\nutterance emotions using textual, auditory, and visual modal features. Most\nexisting MERC methods assume each utterance has complete modalities,\noverlooking the common issue of incomplete modalities in real-world scenarios.\nRecently, graph neural networks (GNNs) have achieved notable results in\nIncomplete Multimodal Emotion Recognition in Conversations (IMERC). However,\ntraditional GNNs focus on binary relationships between nodes, limiting their\nability to capture more complex, higher-order information. Moreover, repeated\nmessage passing can cause over-smoothing, reducing their capacity to preserve\nessential high-frequency details. To address these issues, we propose a\nSpectral Domain Reconstruction Graph Neural Network (SDR-GNN) for incomplete\nmultimodal learning in conversational emotion recognition. SDR-GNN constructs\nan utterance semantic interaction graph using a sliding window based on both\nspeaker and context relationships to model emotional dependencies. To capture\nhigher-order and high-frequency information, SDR-GNN utilizes weighted\nrelationship aggregation, ensuring consistent semantic feature extraction\nacross utterances. Additionally, it performs multi-frequency aggregation in the\nspectral domain, enabling efficient recovery of incomplete modalities by\nextracting both high- and low-frequency information. Finally, multi-head\nattention is applied to fuse and optimize features for emotion recognition.\nExtensive experiments on various real-world datasets demonstrate that our\napproach is effective in incomplete multimodal learning and outperforms current\nstate-of-the-art methods.\n","authors":["Fangze Fu","Wei Ai","Fan Yang","Yuntao Shou","Tao Meng","Keqin Li"],"pdf_url":"https://arxiv.org/pdf/2411.19822v1.pdf","comment":"17 pages, 8 figures"},{"id":"http://arxiv.org/abs/2405.18653v2","updated":"2024-11-29T16:19:01Z","published":"2024-05-28T23:32:46Z","title":"Recent Advances of Foundation Language Models-based Continual Learning:\n A Survey","summary":" Recently, foundation language models (LMs) have marked significant\nachievements in the domains of natural language processing (NLP) and computer\nvision (CV). Unlike traditional neural network models, foundation LMs obtain a\ngreat ability for transfer learning by acquiring rich commonsense knowledge\nthrough pre-training on extensive unsupervised datasets with a vast number of\nparameters. However, they still can not emulate human-like continuous learning\ndue to catastrophic forgetting. Consequently, various continual learning\n(CL)-based methodologies have been developed to refine LMs, enabling them to\nadapt to new tasks without forgetting previous knowledge. However, a systematic\ntaxonomy of existing approaches and a comparison of their performance are still\nlacking, which is the gap that our survey aims to fill. We delve into a\ncomprehensive review, summarization, and classification of the existing\nliterature on CL-based approaches applied to foundation language models, such\nas pre-trained language models (PLMs), large language models (LLMs) and\nvision-language models (VLMs). We divide these studies into offline CL and\nonline CL, which consist of traditional methods, parameter-efficient-based\nmethods, instruction tuning-based methods and continual pre-training methods.\nOffline CL encompasses domain-incremental learning, task-incremental learning,\nand class-incremental learning, while online CL is subdivided into hard task\nboundary and blurry task boundary settings. Additionally, we outline the\ntypical datasets and metrics employed in CL research and provide a detailed\nanalysis of the challenges and future work for LMs-based continual learning.\n","authors":["Yutao Yang","Jie Zhou","Xuanwen Ding","Tianyu Huai","Shunyu Liu","Qin Chen","Yuan Xie","Liang He"],"pdf_url":"https://arxiv.org/pdf/2405.18653v2.pdf","comment":"Accepted by ACM Computing Survey"},{"id":"http://arxiv.org/abs/2410.08130v2","updated":"2024-11-29T16:18:29Z","published":"2024-10-10T17:14:36Z","title":"Think Beyond Size: Adaptive Prompting for More Effective Reasoning","summary":" Pretrained large language models (LLMs) are increasingly utilized across a\nwide range of natural language processing (NLP) tasks due to their impressive\ncapabilities as few-shot learners. Recent techniques, such as chain-of-thought\n(CoT) prompting, have significantly advanced multi-step reasoning by\nintroducing step-by-step decomposition, achieving state-of-the-art results on\ncomplex reasoning benchmarks. However, these approaches often rely on static\nprompting templates that do not adapt to task complexity or errors during the\nreasoning process. In this work, we introduce Adaptive Prompting, a dynamic and\niterative framework designed to enhance reasoning by incorporating real-time\nadjustments to prompt structures and validation mechanisms.Experimental results\ndemonstrate that Adaptive Prompting significantly improves performance on\ndiverse reasoning benchmarks, including arithmetic reasoning (GSM8K,\nMultiArith), logical reasoning and commonsense tasks, achieving substantial\naccuracy gains compared to static prompting baselines. By integrating guided\nprompts, intermediate validation, and self-corrective steps, our approach\nenables smaller models to achieve competitive performance with larger\ncounterparts, such as GPT-4, while maintaining computational efficiency. The\nframework achieves this without requiring fine-tuning or task-specific training\ndata, highlighting the untapped potential of iterative reasoning methods.\n","authors":["Kamesh R"],"pdf_url":"https://arxiv.org/pdf/2410.08130v2.pdf","comment":"Submitted to ICLR 2025. This is a preprint version. Future revisions\n will include additional evaluations and refinements"},{"id":"http://arxiv.org/abs/2411.19799v1","updated":"2024-11-29T16:03:14Z","published":"2024-11-29T16:03:14Z","title":"INCLUDE: Evaluating Multilingual Language Understanding with Regional\n Knowledge","summary":" The performance differential of large language models (LLM) between languages\nhinders their effective deployment in many regions, inhibiting the potential\neconomic and societal value of generative AI tools in many communities.\nHowever, the development of functional LLMs in many languages (\\ie,\nmultilingual LLMs) is bottlenecked by the lack of high-quality evaluation\nresources in languages other than English. Moreover, current practices in\nmultilingual benchmark construction often translate English resources, ignoring\nthe regional and cultural knowledge of the environments in which multilingual\nsystems would be used. In this work, we construct an evaluation suite of\n197,243 QA pairs from local exam sources to measure the capabilities of\nmultilingual LLMs in a variety of regional contexts. Our novel resource,\nINCLUDE, is a comprehensive knowledge- and reasoning-centric benchmark across\n44 written languages that evaluates multilingual LLMs for performance in the\nactual language environments where they would be deployed.\n","authors":["Angelika Romanou","Negar Foroutan","Anna Sotnikova","Zeming Chen","Sree Harsha Nelaturu","Shivalika Singh","Rishabh Maheshwary","Micol Altomare","Mohamed A. Haggag","Snegha A","Alfonso Amayuelas","Azril Hafizi Amirudin","Viraat Aryabumi","Danylo Boiko","Michael Chang","Jenny Chim","Gal Cohen","Aditya Kumar Dalmia","Abraham Diress","Sharad Duwal","Daniil Dzenhaliou","Daniel Fernando Erazo Florez","Fabian Farestam","Joseph Marvin Imperial","Shayekh Bin Islam","Perttu Isotalo","Maral Jabbarishiviari","Börje F. Karlsson","Eldar Khalilov","Christopher Klamm","Fajri Koto","Dominik Krzemiński","Gabriel Adriano de Melo","Syrielle Montariol","Yiyang Nan","Joel Niklaus","Jekaterina Novikova","Johan Samir Obando Ceron","Debjit Paul","Esther Ploeger","Jebish Purbey","Swati Rajwal","Selvan Sunitha Ravi","Sara Rydell","Roshan Santhosh","Drishti Sharma","Marjana Prifti Skenduli","Arshia Soltani Moakhar","Bardia Soltani Moakhar","Ran Tamir","Ayush Kumar Tarun","Azmine Toushik Wasi","Thenuka Ovin Weerasinghe","Serhan Yilmaz","Mike Zhang","Imanol Schlag","Marzieh Fadaee","Sara Hooker","Antoine Bosselut"],"pdf_url":"https://arxiv.org/pdf/2411.19799v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.13549v4","updated":"2024-11-29T15:51:23Z","published":"2023-06-23T15:21:52Z","title":"A Survey on Multimodal Large Language Models","summary":" Recently, Multimodal Large Language Model (MLLM) represented by GPT-4V has\nbeen a new rising research hotspot, which uses powerful Large Language Models\n(LLMs) as a brain to perform multimodal tasks. The surprising emergent\ncapabilities of MLLM, such as writing stories based on images and OCR-free math\nreasoning, are rare in traditional multimodal methods, suggesting a potential\npath to artificial general intelligence. To this end, both academia and\nindustry have endeavored to develop MLLMs that can compete with or even better\nthan GPT-4V, pushing the limit of research at a surprising speed. In this\npaper, we aim to trace and summarize the recent progress of MLLMs. First of\nall, we present the basic formulation of MLLM and delineate its related\nconcepts, including architecture, training strategy and data, as well as\nevaluation. Then, we introduce research topics about how MLLMs can be extended\nto support more granularity, modalities, languages, and scenarios. We continue\nwith multimodal hallucination and extended techniques, including Multimodal ICL\n(M-ICL), Multimodal CoT (M-CoT), and LLM-Aided Visual Reasoning (LAVR). To\nconclude the paper, we discuss existing challenges and point out promising\nresearch directions. In light of the fact that the era of MLLM has only just\nbegun, we will keep updating this survey and hope it can inspire more research.\nAn associated GitHub link collecting the latest papers is available at\nhttps://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models.\n","authors":["Shukang Yin","Chaoyou Fu","Sirui Zhao","Ke Li","Xing Sun","Tong Xu","Enhong Chen"],"pdf_url":"https://arxiv.org/pdf/2306.13549v4.pdf","comment":"Accepted for publication in National Science Review. Project\n page:https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models"},{"id":"http://arxiv.org/abs/2404.02837v2","updated":"2024-11-29T15:49:20Z","published":"2024-04-03T16:16:31Z","title":"Cherry on Top: Parameter Heterogeneity and Quantization in Large\n Language Models","summary":" This paper reveals the phenomenon of parameter heterogeneity in large\nlanguage models (LLMs). We find that a small subset of \"cherry\" parameters\nexhibit a disproportionately large influence on model performance, while the\nvast majority of parameters have minimal impact. This heterogeneity is found to\nbe prevalent across different model families, scales, and types. Motivated by\nthis observation, we propose CherryQ, a novel quantization method that unifies\nthe optimization of mixed-precision parameters. CherryQ identifies and\npreserves the critical cherry parameters in high precision while aggressively\nquantizing the remaining parameters to low precision. Extensive experiments\ndemonstrate the effectiveness of CherryQ. CherryQ outperforms existing\nquantization approaches in terms of perplexity and downstream task performance.\nNotably, our 3-bit quantized Vicuna-1.5 exhibits competitive performance\ncompared to their 16-bit counterparts.\n","authors":["Wanyun Cui","Qianle Wang"],"pdf_url":"https://arxiv.org/pdf/2404.02837v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19786v1","updated":"2024-11-29T15:48:24Z","published":"2024-11-29T15:48:24Z","title":"MoTe: Learning Motion-Text Diffusion Model for Multiple Generation Tasks","summary":" Recently, human motion analysis has experienced great improvement due to\ninspiring generative models such as the denoising diffusion model and large\nlanguage model. While the existing approaches mainly focus on generating\nmotions with textual descriptions and overlook the reciprocal task. In this\npaper, we present~\\textbf{MoTe}, a unified multi-modal model that could handle\ndiverse tasks by learning the marginal, conditional, and joint distributions of\nmotion and text simultaneously. MoTe enables us to handle the paired\ntext-motion generation, motion captioning, and text-driven motion generation by\nsimply modifying the input context. Specifically, MoTe is composed of three\ncomponents: Motion Encoder-Decoder (MED), Text Encoder-Decoder (TED), and\nMoti-on-Text Diffusion Model (MTDM). In particular, MED and TED are trained for\nextracting latent embeddings, and subsequently reconstructing the motion\nsequences and textual descriptions from the extracted embeddings, respectively.\nMTDM, on the other hand, performs an iterative denoising process on the input\ncontext to handle diverse tasks. Experimental results on the benchmark datasets\ndemonstrate the superior performance of our proposed method on text-to-motion\ngeneration and competitive performance on motion captioning.\n","authors":["Yiming Wu","Wei Ji","Kecheng Zheng","Zicheng Wang","Dong Xu"],"pdf_url":"https://arxiv.org/pdf/2411.19786v1.pdf","comment":"Five figures, six tables"},{"id":"http://arxiv.org/abs/2410.09432v2","updated":"2024-11-29T15:47:03Z","published":"2024-10-12T08:22:44Z","title":"Exact Aggregation for Federated and Efficient Fine-Tuning of Foundation\n Models","summary":" Low-Rank Adaptation (LoRA) is a popular technique for efficient fine-tuning\nof foundation models. However, applying LoRA in federated learning\nenvironments, where data is distributed across multiple clients, presents\nunique challenges. Existing methods rely on traditional federated averaging of\nLoRA adapters, resulting in inexact updates. To address this, we propose\nFederated Exact LoRA, or FedExLoRA, which adds a residual error term to the\npretrained frozen weight matrix. Our approach achieves exact updates with\nminimal computational and communication overhead, preserving LoRA's efficiency.\nWe evaluate the method on various models across arithmetic reasoning,\ncommonsense reasoning, natural language understanding and natural language\ngeneration tasks, showing consistent performance gains over state-of-the-art\nmethods across multiple settings. Through extensive analysis, we quantify that\nthe deviations in updates from the ideal solution are significant, highlighting\nthe need for exact aggregation. Our method's simplicity, efficiency, and broad\napplicability position it as a promising solution for accurate and effective\nfederated fine-tuning of foundation models. Our code is publicly available at\nhttps://github.com/RaghavSinghal10/fedex-lora.\n","authors":["Raghav Singhal","Kaustubh Ponkshe","Praneeth Vepakomma"],"pdf_url":"https://arxiv.org/pdf/2410.09432v2.pdf","comment":"Raghav Singhal and Kaustubh Ponkshe contributed equally to this work.\n Another version of the paper accepted at NeurIPS 2024 Workshop on Fine-Tuning\n in Modern Machine Learning: Principles and Scalability"},{"id":"http://arxiv.org/abs/2411.19774v1","updated":"2024-11-29T15:20:29Z","published":"2024-11-29T15:20:29Z","title":"PerLA: Perceptive 3D Language Assistant","summary":" Enabling Large Language Models (LLMs) to understand the 3D physical world is\nan emerging yet challenging research direction. Current strategies for\nprocessing point clouds typically downsample the scene or divide it into\nsmaller parts for separate analysis. However, both approaches risk losing key\nlocal details or global contextual information. In this paper, we introduce\nPerLA, a 3D language assistant designed to be more perceptive to both details\nand context, making visual representations more informative for the LLM. PerLA\ncaptures high-resolution (local) details in parallel from different point cloud\nareas and integrates them with (global) context obtained from a\nlower-resolution whole point cloud. We present a novel algorithm that preserves\npoint cloud locality through the Hilbert curve and effectively aggregates\nlocal-to-global information via cross-attention and a graph neural network.\nLastly, we introduce a novel loss for local representation consensus to promote\ntraining stability. PerLA outperforms state-of-the-art 3D language assistants,\nwith gains of up to +1.34 CiDEr on ScanQA for question answering, and +4.22 on\nScanRefer and +3.88 on Nr3D for dense\ncaptioning.\\url{https://gfmei.github.io/PerLA/}\n","authors":["Guofeng Mei","Wei Lin","Luigi Riz","Yujiao Wu","Fabio Poiesi","Yiming Wang"],"pdf_url":"https://arxiv.org/pdf/2411.19774v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19772v1","updated":"2024-11-29T15:18:06Z","published":"2024-11-29T15:18:06Z","title":"LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware\n Omni-Modal Perception of Long Videos","summary":" Despite impressive advancements in video understanding, most efforts remain\nlimited to coarse-grained or visual-only video tasks. However, real-world\nvideos encompass omni-modal information (vision, audio, and speech) with a\nseries of events forming a cohesive storyline. The lack of multi-modal video\ndata with fine-grained event annotations and the high cost of manual labeling\nare major obstacles to comprehensive omni-modality video perception. To address\nthis gap, we propose an automatic pipeline consisting of high-quality\nmulti-modal video filtering, semantically coherent omni-modal event boundary\ndetection, and cross-modal correlation-aware event captioning. In this way, we\npresent LongVALE, the first-ever Vision-Audio-Language Event understanding\nbenchmark comprising 105K omni-modal events with precise temporal boundaries\nand detailed relation-aware captions within 8.4K high-quality long videos.\nFurther, we build a baseline that leverages LongVALE to enable video large\nlanguage models (LLMs) for omni-modality fine-grained temporal video\nunderstanding for the first time. Extensive experiments demonstrate the\neffectiveness and great potential of LongVALE in advancing comprehensive\nmulti-modal video understanding.\n","authors":["Tiantian Geng","Jinrui Zhang","Qingni Wang","Teng Wang","Jinming Duan","Feng Zheng"],"pdf_url":"https://arxiv.org/pdf/2411.19772v1.pdf","comment":"18 pages, 15 figures"},{"id":"http://arxiv.org/abs/2411.19770v1","updated":"2024-11-29T15:18:01Z","published":"2024-11-29T15:18:01Z","title":"Noro: A Noise-Robust One-shot Voice Conversion System with Hidden\n Speaker Representation Capabilities","summary":" One-shot voice conversion (VC) aims to alter the timbre of speech from a\nsource speaker to match that of a target speaker using just a single reference\nspeech from the target, while preserving the semantic content of the original\nsource speech. Despite advancements in one-shot VC, its effectiveness decreases\nin real-world scenarios where reference speeches, often sourced from the\ninternet, contain various disturbances like background noise. To address this\nissue, we introduce Noro, a Noise Robust One-shot VC system. Noro features\ninnovative components tailored for VC using noisy reference speeches, including\na dual-branch reference encoding module and a noise-agnostic contrastive\nspeaker loss. Experimental results demonstrate that Noro outperforms our\nbaseline system in both clean and noisy scenarios, highlighting its efficacy\nfor real-world applications. Additionally, we investigate the hidden speaker\nrepresentation capabilities of our baseline system by repurposing its reference\nencoder as a speaker encoder. The results shows that it is competitive with\nseveral advanced self-supervised learning models for speaker representation\nunder the SUPERB settings, highlighting the potential for advancing speaker\nrepresentation learning through one-shot VC task.\n","authors":["Haorui He","Yuchen Song","Yuancheng Wang","Haoyang Li","Xueyao Zhang","Li Wang","Gongping Huang","Eng Siong Chng","Zhizheng Wu"],"pdf_url":"https://arxiv.org/pdf/2411.19770v1.pdf","comment":"Submitted to IEEE OJSP"},{"id":"http://arxiv.org/abs/2402.08349v3","updated":"2024-11-29T14:44:27Z","published":"2024-02-13T10:28:57Z","title":"Evaluating the Data Model Robustness of Text-to-SQL Systems Based on\n Real User Queries","summary":" Text-to-SQL systems (also known as NL-to-SQL systems) have become an\nincreasingly popular solution for bridging the gap between user capabilities\nand SQL-based data access. These systems translate user requests in natural\nlanguage to valid SQL statements for a specific database. Recent Text-to-SQL\nsystems have benefited from the rapid improvement of transformer-based language\nmodels. However, while Text-to-SQL systems that incorporate such models\ncontinuously reach new high scores on -- often synthetic -- benchmark datasets,\na systematic exploration of their robustness towards different data models in a\nreal-world, realistic scenario is notably missing. This paper provides the\nfirst in-depth evaluation of the data model robustness of Text-to-SQL systems\nin practice based on a multi-year international project focused on Text-to-SQL\ninterfaces. Our evaluation is based on a real-world deployment of FootballDB, a\nsystem that was deployed over a 9 month period in the context of the FIFA World\nCup 2022, during which about 6K natural language questions were asked and\nexecuted. All of our data is based on real user questions that were asked live\nto the system. We manually labeled and translated a subset of these questions\nfor three different data models. For each data model, we explore the\nperformance of representative Text-to-SQL systems and language models. We\nfurther quantify the impact of training data size, pre-, and post-processing\nsteps as well as language model inference time. Our comprehensive evaluation\nsheds light on the design choices of real-world Text-to-SQL systems and their\nimpact on moving from research prototypes to real deployments. Last, we provide\na new benchmark dataset to the community, which is the first to enable the\nevaluation of different data models for the same dataset and is substantially\nmore challenging than most previous datasets in terms of query complexity.\n","authors":["Jonathan Fürst","Catherine Kosten","Farhad Nooralahzadeh","Yi Zhang","Kurt Stockinger"],"pdf_url":"https://arxiv.org/pdf/2402.08349v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.17593v2","updated":"2024-11-29T14:41:48Z","published":"2024-11-26T17:01:27Z","title":"What Differentiates Educational Literature? A Multimodal Fusion Approach\n of Transformers and Computational Linguistics","summary":" The integration of new literature into the English curriculum remains a\nchallenge since educators often lack scalable tools to rapidly evaluate\nreadability and adapt texts for diverse classroom needs. This study proposes to\naddress this gap through a multimodal approach that combines transformer-based\ntext classification with linguistic feature analysis to align texts with UK Key\nStages. Eight state-of-the-art Transformers were fine-tuned on segmented text\ndata, with BERT achieving the highest unimodal F1 score of 0.75. In parallel,\n500 deep neural network topologies were searched for the classification of\nlinguistic characteristics, achieving an F1 score of 0.392. The fusion of these\nmodalities shows a significant improvement, with every multimodal approach\noutperforming all unimodal models. In particular, the ELECTRA Transformer fused\nwith the neural network achieved an F1 score of 0.996. Unimodal and multimodal\napproaches are shown to have statistically significant differences in all\nvalidation metrics (accuracy, precision, recall, F1 score) except for inference\ntime. The proposed approach is finally encapsulated in a stakeholder-facing web\napplication, providing non-technical stakeholder access to real-time insights\non text complexity, reading difficulty, curriculum alignment, and\nrecommendations for learning age range. The application empowers data-driven\ndecision making and reduces manual workload by integrating AI-based\nrecommendations into lesson planning for English literature.\n","authors":["Jordan J. Bird"],"pdf_url":"https://arxiv.org/pdf/2411.17593v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19733v1","updated":"2024-11-29T14:26:34Z","published":"2024-11-29T14:26:34Z","title":"A Deep Learning Approach to Language-independent Gender Prediction on\n Twitter","summary":" This work presents a set of experiments conducted to predict the gender of\nTwitter users based on language-independent features extracted from the text of\nthe users' tweets. The experiments were performed on a version of TwiSty\ndataset including tweets written by the users of six different languages:\nPortuguese, French, Dutch, English, German, and Italian. Logistic regression\n(LR), and feed-forward neural networks (FFNN) with back-propagation were used\nto build models in two different settings: Inter-Lingual (IL) and Cross-Lingual\n(CL). In the IL setting, the training and testing were performed on the same\nlanguage whereas in the CL, Italian and German datasets were set aside and only\nused as test sets and the rest were combined to compose training and\ndevelopment sets. In the IL, the highest accuracy score belongs to LR whereas\nin the CL, FFNN with three hidden layers yields the highest score. The results\nshow that neural network based models underperform traditional models when the\nsize of the training set is small; however, they beat traditional models by a\nnon-trivial margin, when they are fed with large enough data. Finally, the\nfeature analysis confirms that men and women have different writing styles\nindependent of their language.\n","authors":["Reyhaneh Hashempour","Barbara Plank","Aline Villavicencio","Renato Cordeiro de Amorim"],"pdf_url":"https://arxiv.org/pdf/2411.19733v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19726v1","updated":"2024-11-29T14:17:33Z","published":"2024-11-29T14:17:33Z","title":"Towards Santali Linguistic Inclusion: Building the First\n Santali-to-English Translation Model using mT5 Transformer and Data\n Augmentation","summary":" Around seven million individuals in India, Bangladesh, Bhutan, and Nepal\nspeak Santali, positioning it as nearly the third most commonly used\nAustroasiatic language. Despite its prominence among the Austroasiatic language\nfamily's Munda subfamily, Santali lacks global recognition. Currently, no\ntranslation models exist for the Santali language. Our paper aims to include\nSantali to the NPL spectrum. We aim to examine the feasibility of building\nSantali translation models based on available Santali corpora. The paper\nsuccessfully addressed the low-resource problem and, with promising results,\nexamined the possibility of creating a functional Santali machine translation\nmodel in a low-resource setup. Our study shows that Santali-English parallel\ncorpus performs better when in transformers like mt5 as opposed to untrained\ntransformers, proving that transfer learning can be a viable technique that\nworks with Santali language. Besides the mT5 transformer, Santali-English\nperforms better than Santali-Bangla parallel corpus as the mT5 has been trained\nin way more English data than Bangla data. Lastly, our study shows that with\ndata augmentation, our model performs better.\n","authors":["Syed Mohammed Mostaque Billah","Ateya Ahmed Subarna","Sudipta Nandi Sarna","Ahmad Shawkat Wasit","Anika Fariha","Asif Sushmit","Arig Yousuf Sadeque"],"pdf_url":"https://arxiv.org/pdf/2411.19726v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19718v1","updated":"2024-11-29T14:08:32Z","published":"2024-11-29T14:08:32Z","title":"TakeLab Retriever: AI-Driven Search Engine for Articles from Croatian\n News Outlets","summary":" TakeLab Retriever is an AI-driven search engine designed to discover,\ncollect, and semantically analyze news articles from Croatian news outlets. It\noffers a unique perspective on the history and current landscape of Croatian\nonline news media, making it an essential tool for researchers seeking to\nuncover trends, patterns, and correlations that general-purpose search engines\ncannot provide. TakeLab retriever utilizes cutting-edge natural language\nprocessing (NLP) methods, enabling users to sift through articles using named\nentities, phrases, and topics through the web application. This technical\nreport is divided into two parts: the first explains how TakeLab Retriever is\nutilized, while the second provides a detailed account of its design. In the\nsecond part, we also address the software engineering challenges involved and\npropose solutions for developing a microservice-based semantic search engine\ncapable of handling over ten million news articles published over the past two\ndecades.\n","authors":["David Dukić","Marin Petričević","Sven Ćurković","Jan Šnajder"],"pdf_url":"https://arxiv.org/pdf/2411.19718v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19689v1","updated":"2024-11-29T13:24:10Z","published":"2024-11-29T13:24:10Z","title":"MIMDE: Exploring the Use of Synthetic vs Human Data for Evaluating\n Multi-Insight Multi-Document Extraction Tasks","summary":" Large language models (LLMs) have demonstrated remarkable capabilities in\ntext analysis tasks, yet their evaluation on complex, real-world applications\nremains challenging. We define a set of tasks, Multi-Insight Multi-Document\nExtraction (MIMDE) tasks, which involves extracting an optimal set of insights\nfrom a document corpus and mapping these insights back to their source\ndocuments. This task is fundamental to many practical applications, from\nanalyzing survey responses to processing medical records, where identifying and\ntracing key insights across documents is crucial. We develop an evaluation\nframework for MIMDE and introduce a novel set of complementary human and\nsynthetic datasets to examine the potential of synthetic data for LLM\nevaluation. After establishing optimal metrics for comparing extracted\ninsights, we benchmark 20 state-of-the-art LLMs on both datasets. Our analysis\nreveals a strong correlation (0.71) between the ability of LLMs to extracts\ninsights on our two datasets but synthetic data fails to capture the complexity\nof document-level analysis. These findings offer crucial guidance for the use\nof synthetic data in evaluating text analysis systems, highlighting both its\npotential and limitations.\n","authors":["John Francis","Saba Esnaashari","Anton Poletaev","Sukankana Chakraborty","Youmna Hashem","Jonathan Bright"],"pdf_url":"https://arxiv.org/pdf/2411.19689v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19668v1","updated":"2024-11-29T12:48:49Z","published":"2024-11-29T12:48:49Z","title":"ChineseWebText 2.0: Large-Scale High-quality Chinese Web Text with\n Multi-dimensional and fine-grained information","summary":" During the development of large language models (LLMs), pre-training data\nplay a critical role in shaping LLMs' capabilities. In recent years several\nlarge-scale and high-quality pre-training datasets have been released to\naccelerate the research of LLMs, including ChineseWebText1.0, C4, Pile,\nWanJuan, MAPCC and others. However, as LLMs continue to evolve, focus has\nincreasingly shifted to domain-specific capabilities and safety concerns,\nmaking those previous coarse-grained texts insufficient for meeting training\nrequirements. Furthermore, fine-grained information, such as quality, domain\nand toxicity, is becoming increasingly important in building powerful and\nreliable LLMs for various scenarios. To address these challenges, in this paper\nwe propose a new tool-chain called MDFG-tool for constructing large-scale and\nhigh-quality Chinese datasets with multi-dimensional and fine-grained\ninformation. First, we employ manually crafted rules to discard explicit noisy\ntexts from raw contents. Second, the quality evaluation model, domain\nclassifier, and toxicity evaluation model are well-designed to assess the\nremaining cleaned data respectively. Finally, we integrate these three types of\nfine-grained information for each text. With this approach, we release the\nlargest, high-quality and fine-grained Chinese text ChineseWebText2.0, which\nconsists of 3.8TB and each text is associated with a quality score, domain\nlabels, a toxicity label and a toxicity score, facilitating the LLM researchers\nto select data based on various types of fine-grained information. The data,\ncodes and the tool-chain are available on this website\nhttps://github.com/CASIA-LM/ChineseWebText-2.0\n","authors":["Wanyue Zhang","Ziyong Li","Wen Yang","Chunlin Leng","Yinan Bai","Qianlong Du","Chengqing Zong","Jiajun Zhang"],"pdf_url":"https://arxiv.org/pdf/2411.19668v1.pdf","comment":"ChineseWebTex2.0 dataset is available at\n https://github.com/CASIA-LM/ChineseWebText-2.0"},{"id":"http://arxiv.org/abs/2411.19655v1","updated":"2024-11-29T12:21:15Z","published":"2024-11-29T12:21:15Z","title":"Truth or Mirage? Towards End-to-End Factuality Evaluation with LLM-OASIS","summary":" After the introduction of Large Language Models (LLMs), there have been\nsubstantial improvements in the performance of Natural Language Generation\n(NLG) tasks, including Text Summarization and Machine Translation. However,\nLLMs still produce outputs containing hallucinations, that is, content not\ngrounded in factual information. Therefore, developing methods to assess the\nfactuality of LLMs has become urgent.\n Indeed, resources for factuality evaluation have recently emerged. Although\nchallenging, these resources face one or more of the following limitations: (i)\nthey are tailored to a specific task or domain; (ii) they are limited in size,\nthereby preventing the training of new factuality evaluators; (iii) they are\ndesigned for simpler verification tasks, such as claim verification.\n To address these issues, we introduce LLM-Oasis, to the best of our knowledge\nthe largest resource for training end-to-end factuality evaluators. LLM-Oasis\nis constructed by extracting claims from Wikipedia, falsifying a subset of\nthese claims, and generating pairs of factual and unfactual texts. We then rely\non human annotators to both validate the quality of our dataset and to create a\ngold standard test set for benchmarking factuality evaluation systems.\n Our experiments demonstrate that LLM-Oasis presents a significant challenge\nfor state-of-the-art LLMs, with GPT-4o achieving up to 60% accuracy in our\nproposed end-to-end factuality evaluation task, highlighting its potential to\ndrive future research in the field.\n","authors":["Alessandro Scirè","Andrei Stefan Bejgu","Simone Tedeschi","Karim Ghonim","Federico Martelli","Roberto Navigli"],"pdf_url":"https://arxiv.org/pdf/2411.19655v1.pdf","comment":"15 pages. To be submitted to CL journal"},{"id":"http://arxiv.org/abs/2411.19650v1","updated":"2024-11-29T12:06:03Z","published":"2024-11-29T12:06:03Z","title":"CogACT: A Foundational Vision-Language-Action Model for Synergizing\n Cognition and Action in Robotic Manipulation","summary":" The advancement of large Vision-Language-Action (VLA) models has\nsignificantly improved robotic manipulation in terms of language-guided task\nexecution and generalization to unseen scenarios. While existing VLAs adapted\nfrom pretrained large Vision-Language-Models (VLM) have demonstrated promising\ngeneralizability, their task performance is still unsatisfactory as indicated\nby the low tasks success rates in different environments. In this paper, we\npresent a new advanced VLA architecture derived from VLM. Unlike previous works\nthat directly repurpose VLM for action prediction by simple action\nquantization, we propose a omponentized VLA architecture that has a specialized\naction module conditioned on VLM output. We systematically study the design of\nthe action module and demonstrates the strong performance enhancement with\ndiffusion action transformers for action sequence modeling, as well as their\nfavorable scaling behaviors. We also conduct comprehensive experiments and\nablation studies to evaluate the efficacy of our models with varied designs.\nThe evaluation on 5 robot embodiments in simulation and real work shows that\nour model not only significantly surpasses existing VLAs in task performance\nand but also exhibits remarkable adaptation to new robots and generalization to\nunseen objects and backgrounds. It exceeds the average success rates of OpenVLA\nwhich has similar model size (7B) with ours by over 35% in simulated evaluation\nand 55% in real robot experiments. It also outperforms the large RT-2-X model\n(55B) by 18% absolute success rates in simulation. Code and models can be found\non our project page (https://cogact.github.io/).\n","authors":["Qixiu Li","Yaobo Liang","Zeyu Wang","Lin Luo","Xi Chen","Mozheng Liao","Fangyun Wei","Yu Deng","Sicheng Xu","Yizhong Zhang","Xiaofan Wang","Bei Liu","Jianlong Fu","Jianmin Bao","Dong Chen","Yuanchun Shi","Jiaolong Yang","Baining Guo"],"pdf_url":"https://arxiv.org/pdf/2411.19650v1.pdf","comment":"Project Webpage: https://cogact.github.io/"},{"id":"http://arxiv.org/abs/2402.11295v6","updated":"2024-11-29T11:47:55Z","published":"2024-02-17T14:26:57Z","title":"OneBit: Towards Extremely Low-bit Large Language Models","summary":" Model quantification uses low bit-width values to represent the weight\nmatrices of existing models to be quantized, which is a promising approach to\nreduce both storage and computational overheads of deploying highly anticipated\nLLMs. However, current quantization methods suffer severe performance\ndegradation when the bit-width is extremely reduced, and thus focus on\nutilizing 4-bit or 8-bit values to quantize models. This paper boldly quantizes\nthe weight matrices of LLMs to 1-bit, paving the way for the extremely low\nbit-width deployment of LLMs. For this target, we introduce a 1-bit model\ncompressing framework named OneBit, including a novel 1-bit parameter\nrepresentation method to better quantize LLMs as well as an effective parameter\ninitialization method based on matrix decomposition to improve the convergence\nspeed of the quantization framework. Sufficient experimental results indicate\nthat OneBit achieves good performance (at least 81% of the non-quantized\nperformance on LLaMA models) with robust training processes when only using\n1-bit weight matrices.\n","authors":["Yuzhuang Xu","Xu Han","Zonghan Yang","Shuo Wang","Qingfu Zhu","Zhiyuan Liu","Weidong Liu","Wanxiang Che"],"pdf_url":"https://arxiv.org/pdf/2402.11295v6.pdf","comment":"Accepted by NeurIPS 2024"},{"id":"http://arxiv.org/abs/2411.19638v1","updated":"2024-11-29T11:42:58Z","published":"2024-11-29T11:42:58Z","title":"LLM Teacher-Student Framework for Text Classification With No Manually\n Annotated Data: A Case Study in IPTC News Topic Classification","summary":" With the ever-increasing number of news stories available online, classifying\nthem by topic, regardless of the language they are written in, has become\ncrucial for enhancing readers' access to relevant content. To address this\nchallenge, we propose a teacher-student framework based on large language\nmodels (LLMs) for developing multilingual news classification models of\nreasonable size with no need for manual data annotation. The framework employs\na Generative Pretrained Transformer (GPT) model as the teacher model to develop\nan IPTC Media Topic training dataset through automatic annotation of news\narticles in Slovenian, Croatian, Greek, and Catalan. The teacher model exhibits\na high zero-shot performance on all four languages. Its agreement with human\nannotators is comparable to that between the human annotators themselves. To\nmitigate the computational limitations associated with the requirement of\nprocessing millions of texts daily, smaller BERT-like student models are\nfine-tuned on the GPT-annotated dataset. These student models achieve high\nperformance comparable to the teacher model. Furthermore, we explore the impact\nof the training data size on the performance of the student models and\ninvestigate their monolingual, multilingual and zero-shot cross-lingual\ncapabilities. The findings indicate that student models can achieve high\nperformance with a relatively small number of training instances, and\ndemonstrate strong zero-shot cross-lingual abilities. Finally, we publish the\nbest-performing news topic classifier, enabling multilingual classification\nwith the top-level categories of the IPTC Media Topic schema.\n","authors":["Taja Kuzman","Nikola Ljubešić"],"pdf_url":"https://arxiv.org/pdf/2411.19638v1.pdf","comment":"This work has been submitted to the IEEE for possible publication"},{"id":"http://arxiv.org/abs/2411.19628v1","updated":"2024-11-29T11:24:23Z","published":"2024-11-29T11:24:23Z","title":"Accelerating Multimodal Large Language Models via Dynamic Visual-Token\n Exit and the Empirical Findings","summary":" The excessive use of visual tokens in existing Multimoal Large Language\nModels (MLLMs) often exhibits obvious redundancy and brings in prohibitively\nexpensive computation. To gain insights into this problem, we first conduct\nextensive empirical studies on the attention behaviors of MLLMs, and summarize\nthree main inference stages in MLLMs: (i) Early fusion between tokens is first\naccomplished quickly. (ii) Intra-modality modeling then comes to play. (iii)\nMultimodal reasoning} resumes and lasts until the end of inference. In\nparticular, we reveal that visual tokens will stop contributing to reasoning\nwhen the text tokens receive enough image information, yielding obvious visual\nredundancy. Based on these generalized observations, we propose a simple yet\neffective method to improve the efficiency of MLLMs, termed dynamic\nvisual-token exit (DyVTE). DyVTE uses lightweight hyper-networks to perceive\nthe text token status and decide the removal of all visual tokens after a\ncertain layer, thereby addressing the observed visual redundancy. To validate\nVTE, we apply it to a set of MLLMs, including LLaVA, VILA, Eagle and InternVL,\nand conduct extensive experiments on a bunch of benchmarks. The experiment\nresults not only show the effectiveness of our VTE in improving MLLMs'\nefficiency, but also yield the general modeling patterns of MLLMs, well\nfacilitating the in-depth understanding of MLLMs. Our code is anonymously\nreleased at https://github.com/DoubtedSteam/DyVTE.\n","authors":["Qiong Wu","Wenhao Lin","Weihao Ye","Yiyi Zhou","Xiaoshuai Sun","Rongrong Ji"],"pdf_url":"https://arxiv.org/pdf/2411.19628v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.12025v3","updated":"2024-11-29T10:46:22Z","published":"2024-02-19T10:34:13Z","title":"Speech Translation with Speech Foundation Models and Large Language\n Models: What is There and What is Missing?","summary":" The field of natural language processing (NLP) has recently witnessed a\ntransformative shift with the emergence of foundation models, particularly\nLarge Language Models (LLMs) that have revolutionized text-based NLP. This\nparadigm has extended to other modalities, including speech, where researchers\nare actively exploring the combination of Speech Foundation Models (SFMs) and\nLLMs into single, unified models capable of addressing multimodal tasks. Among\nsuch tasks, this paper focuses on speech-to-text translation (ST). By examining\nthe published papers on the topic, we propose a unified view of the\narchitectural solutions and training strategies presented so far, highlighting\nsimilarities and differences among them. Based on this examination, we not only\norganize the lessons learned but also show how diverse settings and evaluation\napproaches hinder the identification of the best-performing solution for each\narchitectural building block and training choice. Lastly, we outline\nrecommendations for future works on the topic aimed at better understanding the\nstrengths and weaknesses of the SFM+LLM solutions for ST.\n","authors":["Marco Gaido","Sara Papi","Matteo Negri","Luisa Bentivogli"],"pdf_url":"https://arxiv.org/pdf/2402.12025v3.pdf","comment":"Outstanding paper at the ACL 2024 main conference"},{"id":"http://arxiv.org/abs/2409.06567v2","updated":"2024-11-29T10:43:14Z","published":"2024-09-10T14:58:55Z","title":"Exploring syntactic information in sentence embeddings through\n multilingual subject-verb agreement","summary":" In this paper, our goal is to investigate to what degree multilingual\npretrained language models capture cross-linguistically valid abstract\nlinguistic representations. We take the approach of developing curated\nsynthetic data on a large scale, with specific properties, and using them to\nstudy sentence representations built using pretrained language models. We use a\nnew multiple-choice task and datasets, Blackbird Language Matrices (BLMs), to\nfocus on a specific grammatical structural phenomenon -- subject-verb agreement\nacross a variety of sentence structures -- in several languages. Finding a\nsolution to this task requires a system detecting complex linguistic patterns\nand paradigms in text representations. Using a two-level architecture that\nsolves the problem in two steps -- detect syntactic objects and their\nproperties in individual sentences, and find patterns across an input sequence\nof sentences -- we show that despite having been trained on multilingual texts\nin a consistent manner, multilingual pretrained language models have\nlanguage-specific differences, and syntactic structure is not shared, even\nacross closely related languages.\n","authors":["Vivi Nastase","Chunyang Jiang","Giuseppe Samo","Paola Merlo"],"pdf_url":"https://arxiv.org/pdf/2409.06567v2.pdf","comment":"13 pages, 5 tables, 6 figures"},{"id":"http://arxiv.org/abs/2310.08367v2","updated":"2024-11-29T10:39:26Z","published":"2023-10-12T14:38:25Z","title":"Towards Evaluating Generalist Agents: An Automated Benchmark in Open\n World","summary":" Evaluating generalist agents presents significant challenges due to their\nwide-ranging abilities and the limitations of current benchmarks in assessing\ntrue generalization. We introduce the Minecraft Universe (MCU), a fully\nautomated benchmarking framework set within the open-world game Minecraft. MCU\ndynamically generates and evaluates a broad spectrum of tasks, offering three\ncore components: 1) a task generation mechanism that provides high degrees of\nfreedom and variability, 2) an ever-expanding set of over 3K composable atomic\ntasks, and 3) a general evaluation framework that supports open-ended task\nassessment. By integrating large language models (LLMs), MCU dynamically\ncreates diverse environments for each evaluation, fostering agent\ngeneralization. The framework uses a vision-language model (VLM) to\nautomatically generate evaluation criteria, achieving over 90% agreement with\nhuman ratings across multi-dimensional assessments, which demonstrates that MCU\nis a scalable and explainable solution for evaluating generalist agents.\nAdditionally, we show that while state-of-the-art foundational models perform\nwell on specific tasks, they often struggle with increased task diversity and\ndifficulty.\n","authors":["Xinyue Zheng","Haowei Lin","Kaichen He","Zihao Wang","Zilong Zheng","Yitao Liang"],"pdf_url":"https://arxiv.org/pdf/2310.08367v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.06622v2","updated":"2024-11-29T10:30:11Z","published":"2024-09-10T16:22:18Z","title":"Exploring Italian sentence embeddings properties through multi-tasking","summary":" We investigate to what degree existing LLMs encode abstract linguistic\ninformation in Italian in a multi-task setting. We exploit curated synthetic\ndata on a large scale -- several Blackbird Language Matrices (BLMs) problems in\nItalian -- and use them to study how sentence representations built using\npre-trained language models encode specific syntactic and semantic information.\nWe use a two-level architecture to model separately a compression of the\nsentence embeddings into a representation that contains relevant information\nfor a task, and a BLM task. We then investigate whether we can obtain\ncompressed sentence representations that encode syntactic and semantic\ninformation relevant to several BLM tasks. While we expected that the sentence\nstructure -- in terms of sequence of phrases/chunks -- and chunk properties\ncould be shared across tasks, performance and error analysis show that the\nclues for the different tasks are encoded in different manners in the sentence\nembeddings, suggesting that abstract linguistic notions such as constituents or\nthematic roles does not seem to be present in the pretrained sentence\nembeddings.\n","authors":["Vivi Nastase","Giuseppe Samo","Chunyang Jiang","Paola Merlo"],"pdf_url":"https://arxiv.org/pdf/2409.06622v2.pdf","comment":"11 pages, 6 figures, 4 tables"},{"id":"http://arxiv.org/abs/2408.02085v4","updated":"2024-11-29T10:10:43Z","published":"2024-08-04T16:50:07Z","title":"Unleashing the Power of Data Tsunami: A Comprehensive Survey on Data\n Assessment and Selection for Instruction Tuning of Language Models","summary":" Instruction tuning plays a critical role in aligning large language models\n(LLMs) with human preference. Despite the vast amount of open instruction\ndatasets, naively training a LLM on all existing instructions may not be\noptimal and practical. To pinpoint the most beneficial datapoints, data\nassessment and selection methods have been proposed in the fields of natural\nlanguage processing (NLP) and deep learning. However, under the context of\ninstruction tuning, there still exists a gap in knowledge on what kind of data\nevaluation metrics can be employed and how they can be integrated into the\nselection mechanism. To bridge this gap, we present a comprehensive review on\nexisting literature of data assessment and selection especially for instruction\ntuning of LLMs. We systematically categorize all applicable methods into\nquality-based, diversity-based, and importance-based ones where a unified,\nfine-grained taxonomy is structured. For each category, representative methods\nare elaborated to describe the landscape of relevant research. In addition,\ncomparison between the latest methods is conducted on their officially reported\nresults to provide in-depth discussions on their limitations. Finally, we\nsummarize the open challenges and propose the promosing avenues for future\nstudies. All related contents are available at\nhttps://github.com/yuleiqin/fantastic-data-engineering.\n","authors":["Yulei Qin","Yuncheng Yang","Pengcheng Guo","Gang Li","Hang Shao","Yuchen Shi","Zihan Xu","Yun Gu","Ke Li","Xing Sun"],"pdf_url":"https://arxiv.org/pdf/2408.02085v4.pdf","comment":"review, survey, 37 pages, 5 figures, 4 tables"},{"id":"http://arxiv.org/abs/2411.19589v1","updated":"2024-11-29T10:10:16Z","published":"2024-11-29T10:10:16Z","title":"Can Large Language Models Reason about the Region Connection Calculus?","summary":" Qualitative Spatial Reasoning is a well explored area of Knowledge\nRepresentation and Reasoning and has multiple applications ranging from\nGeographical Information Systems to Robotics and Computer Vision. Recently,\nmany claims have been made for the reasoning capabilities of Large Language\nModels (LLMs). Here, we investigate the extent to which a set of representative\nLLMs can perform classical qualitative spatial reasoning tasks on the\nmereotopological Region Connection Calculus, RCC-8. We conduct three pairs of\nexperiments (reconstruction of composition tables, alignment to human\ncomposition preferences, conceptual neighbourhood reconstruction) using\nstate-of-the-art LLMs; in each pair one experiment uses eponymous relations and\none, anonymous relations (to test the extent to which the LLM relies on\nknowledge about the relation names obtained during training). All instances are\nrepeated 30 times to measure the stochasticity of the LLMs.\n","authors":["Anthony G Cohn","Robert E Blackwell"],"pdf_url":"https://arxiv.org/pdf/2411.19589v1.pdf","comment":"13 pages. arXiv admin note: text overlap with arXiv:2309.15577"},{"id":"http://arxiv.org/abs/2411.19581v1","updated":"2024-11-29T09:54:08Z","published":"2024-11-29T09:54:08Z","title":"In-Context Learning with Noisy Labels","summary":" In-context learning refers to the emerging ability of large language models\n(LLMs) to perform a target task without additional training, utilizing\ndemonstrations of the task. Recent studies aim to enhance in-context learning\nperformance by selecting more useful demonstrations. However, they overlook the\npresence of inevitable noisy labels in task demonstrations that arise during\nthe labeling process in the real-world. In this paper, we propose a new task,\nin-context learning with noisy labels, which aims to solve real-world problems\nfor in-context learning where labels in task demonstrations would be corrupted.\nMoreover, we propose a new method and baseline methods for the new task,\ninspired by studies in learning with noisy labels. Through experiments, we\ndemonstrate that our proposed method can serve as a safeguard against\nperformance degradation in in-context learning caused by noisy labels.\n","authors":["Junyong Kang","Donghyun Son","Hwanjun Song","Buru Chang"],"pdf_url":"https://arxiv.org/pdf/2411.19581v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19579v1","updated":"2024-11-29T09:50:32Z","published":"2024-11-29T09:50:32Z","title":"ICPR 2024 Competition on Multilingual Claim-Span Identification","summary":" A lot of claims are made in social media posts, which may contain\nmisinformation or fake news. Hence, it is crucial to identify claims as a first\nstep towards claim verification. Given the huge number of social media posts,\nthe task of identifying claims needs to be automated. This competition deals\nwith the task of 'Claim Span Identification' in which, given a text, parts /\nspans that correspond to claims are to be identified. This task is more\nchallenging than the traditional binary classification of text into claim or\nnot-claim, and requires state-of-the-art methods in Pattern Recognition,\nNatural Language Processing and Machine Learning. For this competition, we used\na newly developed dataset called HECSI containing about 8K posts in English and\nabout 8K posts in Hindi with claim-spans marked by human annotators. This paper\ngives an overview of the competition, and the solutions developed by the\nparticipating teams.\n","authors":["Soham Poddar","Biswajit Paul","Moumita Basu","Saptarshi Ghosh"],"pdf_url":"https://arxiv.org/pdf/2411.19579v1.pdf","comment":"To appear at ICPR 2024"},{"id":"http://arxiv.org/abs/2411.19574v1","updated":"2024-11-29T09:42:38Z","published":"2024-11-29T09:42:38Z","title":"KV Shifting Attention Enhances Language Modeling","summary":" The current large language models are mainly based on decode-only structure\ntransformers, which have great in-context learning (ICL) capabilities. It is\ngenerally believed that the important foundation of its ICL capability is the\ninduction heads mechanism, which requires at least two layers attention. In\norder to more efficiently implement the ability of the model's induction, we\nrevisit the induction heads mechanism and proposed a KV shifting attention. We\ntheoretically prove that the KV shifting attention reducing the model's\nrequirements for the depth and width of the induction heads mechanism. Our\nexperimental results demonstrate that KV shifting attention is beneficial to\nlearning induction heads and language modeling, which lead to better\nperformance or faster convergence from toy models to the pre-training models\nwith more than 10 B parameters.\n","authors":["Mingyu Xu","Wei Cheng","Bingning Wang","Weipeng Chen"],"pdf_url":"https://arxiv.org/pdf/2411.19574v1.pdf","comment":"22 pages"},{"id":"http://arxiv.org/abs/2411.19563v1","updated":"2024-11-29T09:18:32Z","published":"2024-11-29T09:18:32Z","title":"Ensemble Watermarks for Large Language Models","summary":" The rapid advancement of large language models (LLMs) has made it\nincreasingly difficult to distinguish between text written by humans and\nmachines. While watermarks already exist for LLMs, they often lack flexibility,\nand struggle with attacks such as paraphrasing. To address these issues, we\npropose a multi-feature method for generating watermarks that combines multiple\ndistinct watermark features into an ensemble watermark. Concretely, we combine\nacrostica and sensorimotor norms with the established red-green watermark to\nachieve a 98% detection rate. After a paraphrasing attack the performance\nremains high with 95% detection rate. The red-green feature alone as baseline\nachieves a detection rate of 49%. The evaluation of all feature combinations\nreveals that the ensemble of all three consistently has the highest detection\nrate across several LLMs and watermark strength settings. Due to the\nflexibility of combining features in the ensemble, various requirements and\ntrade-offs can be addressed. Additionally, for all ensemble configurations the\nsame detection function can be used without adaptations. This method is\nparticularly of interest to facilitate accountability and prevent societal\nharm.\n","authors":["Georg Niess","Roman Kern"],"pdf_url":"https://arxiv.org/pdf/2411.19563v1.pdf","comment":"9 pages in the main body. Code is available at\n http://github.com/CommodoreEU/master-generation. arXiv admin note:\n substantial text overlap with arXiv:2405.08400"},{"id":"http://arxiv.org/abs/2411.19557v1","updated":"2024-11-29T09:10:30Z","published":"2024-11-29T09:10:30Z","title":"Initialization using Update Approximation is a Silver Bullet for\n Extremely Efficient Low-Rank Fine-Tuning","summary":" Low-rank adapters have become a standard approach for efficiently fine-tuning\nlarge language models (LLMs), but they often fall short of achieving the\nperformance of full fine-tuning. We propose a method, LoRA Silver Bullet or\nLoRA-SB, that approximates full fine-tuning within low-rank subspaces using a\ncarefully designed initialization strategy. We theoretically demonstrate that\nthe architecture of LoRA-XS, which inserts a trainable (r x r) matrix between B\nand A while keeping other matrices fixed, provides the precise conditions\nneeded for this approximation. We leverage its constrained update space to\nachieve optimal scaling for high-rank gradient updates while removing the need\nfor hyperparameter tuning. We prove that our initialization offers an optimal\nlow-rank approximation of the initial gradient and preserves update directions\nthroughout training. Extensive experiments across mathematical reasoning,\ncommonsense reasoning, and language understanding tasks demonstrate that our\napproach exceeds the performance of standard LoRA while using 27-90x fewer\nparameters, and comprehensively outperforms LoRA-XS. Our findings establish\nthat it is possible to simulate full fine-tuning in low-rank subspaces, and\nachieve significant efficiency gains without sacrificing performance. Our code\nis publicly available at https://github.com/RaghavSinghal10/lora-sb.\n","authors":["Kaustubh Ponkshe","Raghav Singhal","Eduard Gorbunov","Alexey Tumanov","Samuel Horvath","Praneeth Vepakomma"],"pdf_url":"https://arxiv.org/pdf/2411.19557v1.pdf","comment":"Kaustubh Ponkshe and Raghav Singhal contributed equally to this work"},{"id":"http://arxiv.org/abs/2411.16205v3","updated":"2024-11-29T08:48:17Z","published":"2024-11-25T09:05:36Z","title":"MH-MoE: Multi-Head Mixture-of-Experts","summary":" Multi-Head Mixture-of-Experts (MH-MoE) demonstrates superior performance by\nusing the multi-head mechanism to collectively attend to information from\nvarious representation spaces within different experts. In this paper, we\npresent a novel implementation of MH-MoE that maintains both FLOPs and\nparameter parity with sparse Mixture of Experts models. Experimental results on\nlanguage models show that the new implementation yields quality improvements\nover both vanilla MoE and fine-grained MoE models. Additionally, our\nexperiments demonstrate that MH-MoE is compatible with 1-bit Large Language\nModels (LLMs) such as BitNet.\n","authors":["Shaohan Huang","Xun Wu","Shuming Ma","Furu Wei"],"pdf_url":"https://arxiv.org/pdf/2411.16205v3.pdf","comment":"7 pages, 0 figures"},{"id":"http://arxiv.org/abs/2411.19547v1","updated":"2024-11-29T08:47:04Z","published":"2024-11-29T08:47:04Z","title":"Training Agents with Weakly Supervised Feedback from Large Language\n Models","summary":" Large Language Models (LLMs) offer a promising basis for creating agents that\ncan tackle complex tasks through iterative environmental interaction. Existing\nmethods either require these agents to mimic expert-provided trajectories or\nrely on definitive environmental feedback for reinforcement learning which\nlimits their application to specific scenarios like gaming or code generation.\nThis paper introduces a novel training method for LLM-based agents using weakly\nsupervised signals from a critic LLM, bypassing the need for expert\ntrajectories or definitive feedback. Our agents are trained in iterative\nmanner, where they initially generate trajectories through environmental\ninteraction. Subsequently, a critic LLM selects a subset of good trajectories,\nwhich are then used to update the agents, enabling them to generate improved\ntrajectories in the next iteration. Extensive tests on the API-bank dataset\nshow consistent improvement in our agents' capabilities and comparable\nperformance to GPT-4, despite using open-source models with much fewer\nparameters.\n","authors":["Dihong Gong","Pu Lu","Zelong Wang","Meng Zhou","Xiuqiang He"],"pdf_url":"https://arxiv.org/pdf/2411.19547v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19539v1","updated":"2024-11-29T08:34:07Z","published":"2024-11-29T08:34:07Z","title":"Knowledge Management for Automobile Failure Analysis Using Graph RAG","summary":" This paper presents a knowledge management system for automobile failure\nanalysis using retrieval-augmented generation (RAG) with large language models\n(LLMs) and knowledge graphs (KGs). In the automotive industry, there is a\ngrowing demand for knowledge transfer of failure analysis from experienced\nengineers to young engineers. However, failure events are phenomena that occur\nin a chain reaction, making them difficult for beginners to analyze them. While\nknowledge graphs, which can describe semantic relationships and structure\ninformation is effective in representing failure events, due to their\ncapability of representing the relationships between components, there is much\ninformation in KGs, so it is challenging for young engineers to extract and\nunderstand sub-graphs from the KG. On the other hand, there is increasing\ninterest in the use of Graph RAG, a type of RAG that combines LLMs and KGs for\nknowledge management. However, when using the current Graph RAG framework with\nan existing knowledge graph for automobile failures, several issues arise\nbecause it is difficult to generate executable queries for a knowledge graph\ndatabase which is not constructed by LLMs. To address this, we focused on\noptimizing the Graph RAG pipeline for existing knowledge graphs. Using an\noriginal Q&A dataset, the ROUGE F1 score of the sentences generated by the\nproposed method showed an average improvement of 157.6% compared to the current\nmethod. This highlights the effectiveness of the proposed method for automobile\nfailure analysis.\n","authors":["Yuta Ojima","Hiroki Sakaji","Tadashi Nakamura","Hiroaki Sakata","Kazuya Seki","Yuu Teshigawara","Masami Yamashita","Kazuhiro Aoyama"],"pdf_url":"https://arxiv.org/pdf/2411.19539v1.pdf","comment":"7 pages, 6 figures, to be published in 2024 IEEE International\n Conference on Bid Data (BigData)"},{"id":"http://arxiv.org/abs/2411.10666v2","updated":"2024-11-29T08:16:29Z","published":"2024-11-16T02:02:49Z","title":"SAM Decoding: Speculative Decoding via Suffix Automaton","summary":" Large Language Models (LLMs) have revolutionized natural language processing\nby unifying tasks into text generation, yet their large parameter sizes and\nautoregressive nature limit inference speed. SAM-Decoding addresses this by\nintroducing a novel retrieval-based speculative decoding method that uses a\nsuffix automaton for efficient and accurate draft generation. Unlike n-gram\nmatching used by the existing method, SAM-Decoding finds the longest suffix\nmatch in generating text and text corpuss, achieving an average time complexity\nof $O(1)$ per generation step. SAM-Decoding constructs static and dynamic\nsuffix automatons for the text corpus and input prompts, respectively, enabling\nfast and precise draft generation. Meanwhile, it is designed as an approach\nthat can be combined with existing methods, allowing SAM-Decoding to adaptively\nselect a draft generation strategy based on the matching length, thus\nincreasing the inference speed of the LLM. When combined with Token Recycling,\nevaluations show SAM-Decoding outperforms existing model-free methods,\nachieving a speedup of $2.27\\times$ over autoregressive decoding on Spec-Bench.\nWhen combined with EAGLE2, it reaches a speedup of $2.49\\times$, surpassing all\ncurrent approaches. Our code is available at\nhttps://github.com/hyx1999/SAM-Decoding.\n","authors":["Yuxuan Hu","Ke Wang","Xiaokang Zhang","Fanjin Zhang","Cuiping Li","Hong Chen","Jing Zhang"],"pdf_url":"https://arxiv.org/pdf/2411.10666v2.pdf","comment":"17 pages, 5 figures"},{"id":"http://arxiv.org/abs/2411.19504v1","updated":"2024-11-29T06:48:13Z","published":"2024-11-29T06:48:13Z","title":"TQA-Bench: Evaluating LLMs for Multi-Table Question Answering with\n Scalable Context and Symbolic Extension","summary":" The advent of large language models (LLMs) has unlocked great opportunities\nin complex data management tasks, particularly in question answering (QA) over\ncomplicated multi-table relational data. Despite significant progress,\nsystematically evaluating LLMs on multi-table QA remains a critical challenge\ndue to the inherent complexity of analyzing heterogeneous table structures and\npotential large scale of serialized relational data. Existing benchmarks\nprimarily focus on single-table QA, failing to capture the intricacies of\nreasoning across multiple relational tables, as required in real-world domains\nsuch as finance, healthcare, and e-commerce. To address this gap, we present\nTQA-Bench, a new multi-table QA benchmark designed to evaluate the capabilities\nof LLMs in tackling complex QA tasks over relational data. Our benchmark\nincorporates diverse relational database instances sourced from real-world\npublic datasets and introduces a flexible sampling mechanism to create tasks\nwith varying multi-table context lengths, ranging from 8K to 64K tokens. To\nensure robustness and reliability, we integrate symbolic extensions into the\nevaluation framework, enabling the assessment of LLM reasoning capabilities\nbeyond simple data retrieval or probabilistic pattern matching. We\nsystematically evaluate a range of LLMs, both open-source and closed-source,\nspanning model scales from 7 billion to 70 billion parameters. Our extensive\nexperiments reveal critical insights into the performance of LLMs in\nmulti-table QA, highlighting both challenges and opportunities for advancing\ntheir application in complex, data-driven environments. Our benchmark\nimplementation and results are available at\nhttps://github.com/Relaxed-System-Lab/TQA-Bench.\n","authors":["Zipeng Qiu","You Peng","Guangxin He","Binhang Yuan","Chen Wang"],"pdf_url":"https://arxiv.org/pdf/2411.19504v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19500v1","updated":"2024-11-29T06:37:13Z","published":"2024-11-29T06:37:13Z","title":"COLD: Causal reasOning in cLosed Daily activities","summary":" Large Language Models (LLMs) have shown state-of-the-art performance in a\nvariety of tasks, including arithmetic and reasoning; however, to gauge the\nintellectual capabilities of LLMs, causal reasoning has become a reliable proxy\nfor validating a general understanding of the mechanics and intricacies of the\nworld similar to humans. Previous works in natural language processing (NLP)\nhave either focused on open-ended causal reasoning via causal commonsense\nreasoning (CCR) or framed a symbolic representation-based question answering\nfor theoretically backed-up analysis via a causal inference engine. The former\nadds an advantage of real-world grounding but lacks theoretically backed-up\nanalysis/validation, whereas the latter is far from real-world grounding. In\nthis work, we bridge this gap by proposing the COLD (Causal reasOning in cLosed\nDaily activities) framework, which is built upon human understanding of daily\nreal-world activities to reason about the causal nature of events. We show that\nthe proposed framework facilitates the creation of enormous causal queries (~ 9\nmillion) and comes close to the mini-turing test, simulating causal reasoning\nto evaluate the understanding of a daily real-world task. We evaluate multiple\nLLMs on the created causal queries and find that causal reasoning is\nchallenging even for activities trivial to humans. We further explore (the\ncausal reasoning abilities of LLMs) using the backdoor criterion to determine\nthe causal strength between events.\n","authors":["Abhinav Joshi","Areeb Ahmad","Ashutosh Modi"],"pdf_url":"https://arxiv.org/pdf/2411.19500v1.pdf","comment":"Paper accepted at NeurIPS 2024; Total 37 Pages"},{"id":"http://arxiv.org/abs/2405.01799v2","updated":"2024-11-29T06:15:20Z","published":"2024-05-03T01:04:28Z","title":"Exploiting ChatGPT for Diagnosing Autism-Associated Language Disorders\n and Identifying Distinct Features","summary":" Diagnosing language disorders associated with autism is a complex challenge,\noften hampered by the subjective nature and variability of traditional\nassessment methods. Traditional diagnostic methods not only require intensive\nhuman effort but also often result in delayed interventions due to their lack\nof speed and precision. In this study, we explored the application of ChatGPT,\na large language model, to overcome these obstacles by enhancing sensitivity\nand profiling linguistic features for autism diagnosis. This research utilizes\nChatGPT natural language processing capabilities to simplify and improve the\ndiagnostic process, focusing on identifying autism related language patterns.\nSpecifically, we compared ChatGPT performance with that of conventional\nsupervised learning models, including BERT, a model acclaimed for its\neffectiveness in various natural language processing tasks. We showed that\nChatGPT substantially outperformed these models, achieving over 10% improvement\nin both sensitivity and positive predictive value, in a zero shot learning\nconfiguration. The findings underscore the model potential as a diagnostic\ntool, combining accuracy and applicability. We identified ten key features of\nautism associated language disorders across scenarios. Features such as\necholalia, pronoun reversal, and atypical language usage play a critical role\nin diagnosing ASD and informing tailored treatment plans. Together, our\nfindings advocate for adopting sophisticated AI tools like ChatGPT in clinical\nsettings to assess and diagnose developmental disorders. Our approach promises\nenhanced diagnostic precision and supports personalized medicine, potentially\ntransforming the evaluation landscape for autism and similar neurological\nconditions.\n","authors":["Chuanbo Hu","Wenqi Li","Mindi Ruan","Xiangxu Yu","Shalaka Deshpande","Lynn K. Paul","Shuo Wang","Xin Li"],"pdf_url":"https://arxiv.org/pdf/2405.01799v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.11933v2","updated":"2024-11-29T06:07:41Z","published":"2024-11-18T15:09:50Z","title":"METEOR: Evolutionary Journey of Large Language Models from Guidance to\n Self-Growth","summary":" Model evolution enables learning from feedback to refine experiences and\nupdate skills, transforming models from having no domain knowledge to becoming\ndomain experts. However, there is currently no unified and effective method for\nguiding this evolutionary process. To address this gap, we propose the Meteor\nmethod, which includes three training phases: weak-to-strong data distillation,\niterative training, and self-evolution strategies. Each phase maximizes the\nmodel's inherent domain capabilities, allowing it to autonomously refine its\ndomain knowledge and enhance performance. Experiments demonstrate that our\napproach significantly improves accuracy, completeness, relevance, coherence,\nand reliability across domain-specific tasks.\n","authors":["Jiawei Li","Xiaoang Xu","Yang Gao"],"pdf_url":"https://arxiv.org/pdf/2411.11933v2.pdf","comment":"Our code can be found at https://github.com/DIRECT-BIT/METEOR"},{"id":"http://arxiv.org/abs/2407.00958v4","updated":"2024-11-29T05:50:09Z","published":"2024-07-01T04:29:35Z","title":"Dynamic Universal Approximation Theory: The Basic Theory for\n Transformer-based Large Language Models","summary":" Language models have emerged as a critical area of focus in artificial\nintelligence, particularly with the introduction of groundbreaking innovations\nlike ChatGPT. Large-scale Transformer networks have quickly become the leading\napproach for advancing natural language processing algorithms. Built on the\nTransformer architecture, these models enable interactions that closely mimic\nhuman communication and, equipped with extensive knowledge, can even assist in\nguiding human tasks. Despite their impressive capabilities and growing\ncomplexity, a key question remains-the theoretical foundations of large\nlanguage models (LLMs). What makes Transformer so effective for powering\nintelligent language applications, such as translation and coding? What\nunderlies LLMs' ability for In-Context Learning (ICL)? How does the LoRA scheme\nenhance the fine-tuning of LLMs? And what supports the practicality of pruning\nLLMs? To address these critical questions and explore the technological\nstrategies within LLMs, we leverage the Universal Approximation Theory (UAT) to\noffer a theoretical backdrop, shedding light on the mechanisms that underpin\nthese advancements.\n","authors":["Wei Wang","Qing Li"],"pdf_url":"https://arxiv.org/pdf/2407.00958v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19477v1","updated":"2024-11-29T05:29:47Z","published":"2024-11-29T05:29:47Z","title":"A Simple and Provable Scaling Law for the Test-Time Compute of Large\n Language Models","summary":" We propose a general two-stage algorithm that enjoys a provable scaling law\nfor the test-time compute of large language models (LLMs). Given an input\nproblem, the proposed algorithm first generates $N$ candidate solutions, and\nthen chooses the best one via a multiple-round knockout tournament where each\npair of candidates are compared for $K$ times and only the winners move on to\nthe next round. In a minimalistic implementation, both stages can be executed\nwith a black-box LLM alone and nothing else (e.g., no external verifier or\nreward model), and a total of $N \\times (K + 1)$ highly parallelizable LLM\ncalls are needed for solving an input problem. Assuming that a generated\ncandidate solution is correct with probability $p_{\\text{gen}} > 0$ and a\ncomparison between a pair of correct and incorrect solutions identifies the\nright winner with probability $p_{\\text{comp}} > 0.5$ (i.e., better than a\nrandom guess), we prove theoretically that the failure probability of the\nproposed algorithm decays to zero exponentially with respect to $N$ and $K$:\n$$\\mathbb{P}(\\text{final output is incorrect}) \\le (1 - p_{\\text{gen}})^N +\n\\lceil \\log_2 N \\rceil e^{-2 K (p_{\\text{comp}} - 0.5)^2}.$$ Our empirical\nresults with the challenging MMLU-Pro benchmark validate the technical\nassumptions, as well as the efficacy of the proposed algorithm and the gains\nfrom scaling up its test-time compute.\n","authors":["Yanxi Chen","Xuchen Pan","Yaliang Li","Bolin Ding","Jingren Zhou"],"pdf_url":"https://arxiv.org/pdf/2411.19477v1.pdf","comment":"Work in progress"},{"id":"http://arxiv.org/abs/2406.00627v3","updated":"2024-11-29T05:05:13Z","published":"2024-06-02T06:09:56Z","title":"Prompt Framework for Role-playing: Generation and Evaluation","summary":" Large language models (LLMs) exhibit impressive proficiency in natural\nlanguage generation, understanding user instructions, and emulating human-like\nlanguage use, which has led to significant interest in their application to\nrole-playing scenarios. However, the manual collection of role-specific script\ndata and the evaluation of model performance are resource-intensive processes.\nThis project introduces a prompt-based framework designed to leverage GPT's\ncapabilities for the generation of role-playing dialogue datasets and the\nevaluation of role-playing performance. To validate the effectiveness of the\nGPT-based generation and evaluation, we further incorporate the recall-oriented\nRouge-L metric, providing an additional quantitative measure of performance.\n","authors":["Xun Liu","Zhengwei Ni"],"pdf_url":"https://arxiv.org/pdf/2406.00627v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.06350v2","updated":"2024-11-29T04:50:35Z","published":"2024-03-11T00:46:56Z","title":"IndicLLMSuite: A Blueprint for Creating Pre-training and Fine-Tuning\n Datasets for Indian Languages","summary":" Despite the considerable advancements in English LLMs, the progress in\nbuilding comparable models for other languages has been hindered due to the\nscarcity of tailored resources. Our work aims to bridge this divide by\nintroducing an expansive suite of resources specifically designed for the\ndevelopment of Indic LLMs, covering 22 languages, containing a total of 251B\ntokens and 74.8M instruction-response pairs. Recognizing the importance of both\ndata quality and quantity, our approach combines highly curated manually\nverified data, unverified yet valuable data, and synthetic data. We build a\nclean, open-source pipeline for curating pre-training data from diverse\nsources, including websites, PDFs, and videos, incorporating best practices for\ncrawling, cleaning, flagging, and deduplication. For instruction-fine tuning,\nwe amalgamate existing Indic datasets, translate/transliterate English datasets\ninto Indian languages, and utilize LLaMa2 and Mixtral models to create\nconversations grounded in articles from Indian Wikipedia and Wikihow.\nAdditionally, we address toxicity alignment by generating toxic prompts for\nmultiple scenarios and then generate non-toxic responses by feeding these toxic\nprompts to an aligned LLaMa2 model. We hope that the datasets, tools, and\nresources released as a part of this work will not only propel the research and\ndevelopment of Indic LLMs but also establish an open-source blueprint for\nextending such efforts to other languages. The data and other artifacts created\nas part of this work are released with permissive licenses.\n","authors":["Mohammed Safi Ur Rahman Khan","Priyam Mehta","Ananth Sankar","Umashankar Kumaravelan","Sumanth Doddapaneni","Suriyaprasaad B","Varun Balan G","Sparsh Jain","Anoop Kunchukuttan","Pratyush Kumar","Raj Dabre","Mitesh M. Khapra"],"pdf_url":"https://arxiv.org/pdf/2403.06350v2.pdf","comment":"ACL-2024 Outstanding Paper"},{"id":"http://arxiv.org/abs/2411.19456v1","updated":"2024-11-29T03:57:26Z","published":"2024-11-29T03:57:26Z","title":"Beyond Surface Structure: A Causal Assessment of LLMs' Comprehension\n Ability","summary":" Large language models (LLMs) have shown remarkable capability in natural\nlanguage tasks, yet debate persists on whether they truly comprehend deep\nstructure (i.e., core semantics) or merely rely on surface structure (e.g.,\npresentation format). Prior studies observe that LLMs' performance declines\nwhen intervening on surface structure, arguing their success relies on surface\nstructure recognition. However, surface structure sensitivity does not prevent\ndeep structure comprehension. Rigorously evaluating LLMs' capability requires\nanalyzing both, yet deep structure is often overlooked. To this end, we assess\nLLMs' comprehension ability using causal mediation analysis, aiming to fully\ndiscover the capability of using both deep and surface structures.\nSpecifically, we formulate the comprehension of deep structure as direct causal\neffect (DCE) and that of surface structure as indirect causal effect (ICE),\nrespectively. To address the non-estimability of original DCE and ICE --\nstemming from the infeasibility of isolating mutual influences of deep and\nsurface structures, we develop the corresponding quantifiable surrogates,\nincluding approximated DCE (ADCE) and approximated ICE (AICE). We further apply\nthe ADCE to evaluate a series of mainstream LLMs, showing that most of them\nexhibit deep structure comprehension ability, which grows along with the\nprediction accuracy. Comparing ADCE and AICE demonstrates closed-source LLMs\nrely more on deep structure, while open-source LLMs are more surface-sensitive,\nwhich decreases with model scale. Theoretically, ADCE is a bidirectional\nevaluation, which measures both the sufficiency and necessity of deep structure\nchanges in causing output variations, thus offering a more comprehensive\nassessment than accuracy, a common evaluation in LLMs. Our work provides new\ninsights into LLMs' deep structure comprehension and offers novel methods for\nLLMs evaluation.\n","authors":["Yujin Han","Lei Xu","Sirui Chen","Difan Zou","Chaochao Lu"],"pdf_url":"https://arxiv.org/pdf/2411.19456v1.pdf","comment":"28 pages, 14 figures, 10 tables"},{"id":"http://arxiv.org/abs/2411.00774v4","updated":"2024-11-29T03:49:55Z","published":"2024-11-01T17:59:51Z","title":"Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model\n with Frozen LLM","summary":" Rapidly developing large language models (LLMs) have brought tremendous\nintelligent applications. Especially, the GPT-4o's excellent duplex speech\ninteraction ability has brought impressive experience to users. Researchers\nhave recently proposed several multi-modal LLMs in this direction that can\nachieve user-agent speech-to-speech conversations. This paper proposes a novel\nspeech-text multimodal LLM architecture called Freeze-Omni. Our main\ncontribution is that the speech input and output modalities can be easily\nconnected to a textual LLM while keeping the LLM's parameters frozen throughout\nthe training process. We design a three-stage training strategy for modeling\nboth the speech input and output, enabling Freeze-Omni to obtain\nspeech-to-speech conversation ability using text-speech paired data (such as\nASR and TTS data) and only 60,000 multi-round text Q&A data on 8 GPUs.\nMoreover, we can effectively ensure that the intelligence of the Freeze-Omni in\nthe speech modality is at the same level compared with that in the text\nmodality of its backbone LLM, while achieving low latency end-to-end spoken\nresponse. In addition, we also designed a method to achieve duplex dialogue\nability through multi-task training, giving Freeze-Omni a more natural style of\ndialogue ability between users and agents. In summary, Freeze-Omni holds great\npotential to conduct speech-to-speech dialogue based on a multimodal LLM under\nthe condition of a frozen LLM, avoiding the catastrophic forgetting problem\ncaused by limited data and training resources.\n","authors":["Xiong Wang","Yangze Li","Chaoyou Fu","Yunhang Shen","Lei Xie","Ke Li","Xing Sun","Long Ma"],"pdf_url":"https://arxiv.org/pdf/2411.00774v4.pdf","comment":"Project Page: https://freeze-omni.github.io/"},{"id":"http://arxiv.org/abs/2411.19443v1","updated":"2024-11-29T03:01:05Z","published":"2024-11-29T03:01:05Z","title":"Auto-RAG: Autonomous Retrieval-Augmented Generation for Large Language\n Models","summary":" Iterative retrieval refers to the process in which the model continuously\nqueries the retriever during generation to enhance the relevance of the\nretrieved knowledge, thereby improving the performance of Retrieval-Augmented\nGeneration (RAG). Existing work typically employs few-shot prompting or\nmanually constructed rules to implement iterative retrieval. This introduces\nadditional inference overhead and overlooks the remarkable reasoning\ncapabilities of Large Language Models (LLMs). In this paper, we introduce\nAuto-RAG, an autonomous iterative retrieval model centered on the LLM's\npowerful decision-making capabilities. Auto-RAG engages in multi-turn dialogues\nwith the retriever, systematically planning retrievals and refining queries to\nacquire valuable knowledge. This process continues until sufficient external\ninformation is gathered, at which point the results are presented to the user.\nTo this end, we develop a method for autonomously synthesizing reasoning-based\ndecision-making instructions in iterative retrieval and fine-tuned the latest\nopen-source LLMs. The experimental results indicate that Auto-RAG is capable of\nautonomous iterative interaction with the retriever, effectively leveraging the\nremarkable reasoning and decision-making abilities of LLMs, which lead to\noutstanding performance across six benchmarks. Further analysis reveals that\nAuto-RAG can autonomously adjust the number of iterations based on the\ndifficulty of the questions and the utility of the retrieved knowledge, without\nrequiring any human intervention. Moreover, Auto-RAG expresses the iterative\nretrieval process in natural language, enhancing interpretability while\nproviding users with a more intuitive experience\\footnote{Code is available at\n\\url{https://github.com/ictnlp/Auto-RAG}.\n","authors":["Tian Yu","Shaolei Zhang","Yang Feng"],"pdf_url":"https://arxiv.org/pdf/2411.19443v1.pdf","comment":"Code is available at https://github.com/ictnlp/Auto-RAG"},{"id":"http://arxiv.org/abs/2402.11217v2","updated":"2024-11-29T02:50:45Z","published":"2024-02-17T08:04:23Z","title":"A Spectrum Evaluation Benchmark for Medical Multi-Modal Large Language\n Models","summary":" The significant breakthroughs of Medical Multi-Modal Large Language Models\n(Med-MLLMs) renovate modern healthcare with robust information synthesis and\nmedical decision support. However, these models are often evaluated on\nbenchmarks that are unsuitable for the Med-MLLMs due to the complexity of\nreal-world diagnostics across diverse specialties. To address this gap, we\nintroduce Asclepius, a novel Med-MLLM benchmark that comprehensively assesses\nMed-MLLMs in terms of: distinct medical specialties (cardiovascular,\ngastroenterology, etc.) and different diagnostic capacities (perception,\ndisease analysis, etc.). Grounded in 3 proposed core principles, Asclepius\nensures a comprehensive evaluation by encompassing 15 medical specialties,\nstratifying into 3 main categories and 8 sub-categories of clinical tasks, and\nexempting overlap with existing VQA dataset. We further provide an in-depth\nanalysis of 6 Med-MLLMs and compare them with 3 human specialists, providing\ninsights into their competencies and limitations in various medical contexts.\nOur work not only advances the understanding of Med-MLLMs' capabilities but\nalso sets a precedent for future evaluations and the safe deployment of these\nmodels in clinical environments.\n","authors":["Jie Liu","Wenxuan Wang","Yihang Su","Jingyuan Huan","Wenting Chen","Yudi Zhang","Cheng-Yi Li","Kao-Jung Chang","Xiaohan Xin","Linlin Shen","Michael R. Lyu"],"pdf_url":"https://arxiv.org/pdf/2402.11217v2.pdf","comment":"20 pages, 15 figures"},{"id":"http://arxiv.org/abs/2409.07064v2","updated":"2024-11-29T02:37:06Z","published":"2024-09-11T07:24:07Z","title":"Automated Speaking Assessment of Conversation Tests with Novel\n Graph-based Modeling on Spoken Response Coherence","summary":" Automated speaking assessment in conversation tests (ASAC) aims to evaluate\nthe overall speaking proficiency of an L2 (second-language) speaker in a\nsetting where an interlocutor interacts with one or more candidates. Although\nprior ASAC approaches have shown promising performance on their respective\ndatasets, there is still a dearth of research specifically focused on\nincorporating the coherence of the logical flow within a conversation into the\ngrading model. To address this critical challenge, we propose a hierarchical\ngraph model that aptly incorporates both broad inter-response interactions\n(e.g., discourse relations) and nuanced semantic information (e.g., semantic\nwords and speaker intents), which is subsequently fused with contextual\ninformation for the final prediction. Extensive experimental results on the\nNICT-JLE benchmark dataset suggest that our proposed modeling approach can\nyield considerable improvements in prediction accuracy with respect to various\nassessment metrics, as compared to some strong baselines. This also sheds light\non the importance of investigating coherence-related facets of spoken responses\nin ASAC.\n","authors":["Jiun-Ting Li","Bi-Cheng Yan","Tien-Hong Lo","Yi-Cheng Wang","Yung-Chang Hsu","Berlin Chen"],"pdf_url":"https://arxiv.org/pdf/2409.07064v2.pdf","comment":"Accepted by IEEE SLT 2024"},{"id":"http://arxiv.org/abs/2411.19434v1","updated":"2024-11-29T02:14:05Z","published":"2024-11-29T02:14:05Z","title":"Actions and Objects Pathways for Domain Adaptation in Video Question\n Answering","summary":" In this paper, we introduce the Actions and Objects Pathways (AOPath) for\nout-of-domain generalization in video question answering tasks. AOPath\nleverages features from a large pretrained model to enhance generalizability\nwithout the need for explicit training on the unseen domains. Inspired by human\nbrain, AOPath dissociates the pretrained features into action and object\nfeatures, and subsequently processes them through separate reasoning pathways.\nIt utilizes a novel module which converts out-of-domain features into\ndomain-agnostic features without introducing any trainable weights. We validate\nthe proposed approach on the TVQA dataset, which is partitioned into multiple\nsubsets based on genre to facilitate the assessment of generalizability. The\nproposed approach demonstrates 5% and 4% superior performance over conventional\nclassifiers on out-of-domain and in-domain datasets, respectively. It also\noutperforms prior methods that involve training millions of parameters, whereas\nthe proposed approach trains very few parameters.\n","authors":["Safaa Abdullahi Moallim Mohamud","Ho-Young Jung"],"pdf_url":"https://arxiv.org/pdf/2411.19434v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.01247v3","updated":"2024-11-29T01:47:20Z","published":"2024-09-02T13:29:44Z","title":"Conversational Complexity for Assessing Risk in Large Language Models","summary":" Large Language Models (LLMs) present a dual-use dilemma: they enable\nbeneficial applications while harboring potential for harm, particularly\nthrough conversational interactions. Despite various safeguards, advanced LLMs\nremain vulnerable. A watershed case in early 2023 involved journalist Kevin\nRoose's extended dialogue with Bing, an LLM-powered search engine, which\nrevealed harmful outputs after probing questions, highlighting vulnerabilities\nin the model's safeguards. This contrasts with simpler early jailbreaks, like\nthe \"Grandma Jailbreak,\" where users framed requests as innocent help for a\ngrandmother, easily eliciting similar content. This raises the question: How\nmuch conversational effort is needed to elicit harmful information from LLMs?\nWe propose two measures to quantify this effort: Conversational Length (CL),\nwhich measures the number of conversational turns needed to obtain a specific\nharmful response, and Conversational Complexity (CC), defined as the Kolmogorov\ncomplexity of the user's instruction sequence leading to the harmful response.\nTo address the incomputability of Kolmogorov complexity, we approximate CC\nusing a reference LLM to estimate the compressibility of the user instructions.\nApplying this approach to a large red-teaming dataset, we perform a\nquantitative analysis examining the statistical distribution of harmful and\nharmless conversational lengths and complexities. Our empirical findings\nsuggest that this distributional analysis and the minimization of CC serve as\nvaluable tools for understanding AI safety, offering insights into the\naccessibility of harmful information. This work establishes a foundation for a\nnew perspective on LLM safety, centered around the algorithmic complexity of\npathways to harm.\n","authors":["John Burden","Manuel Cebrian","Jose Hernandez-Orallo"],"pdf_url":"https://arxiv.org/pdf/2409.01247v3.pdf","comment":"15 pages, 6 figures"},{"id":"http://arxiv.org/abs/2410.20302v2","updated":"2024-11-29T00:27:46Z","published":"2024-10-27T00:50:30Z","title":"Sequential Large Language Model-Based Hyper-Parameter Optimization","summary":" This study introduces SLLMBO, an innovative framework that leverages Large\nLanguage Models (LLMs) for hyperparameter optimization (HPO), incorporating\ndynamic search space adaptability, enhanced parameter landscape exploitation,\nand a hybrid, novel LLM-Tree-structured Parzen Estimator (LLM-TPE) sampler. By\naddressing limitations in recent fully LLM-based methods and traditional\nBayesian Optimization (BO), SLLMBO achieves more robust optimization. This\ncomprehensive benchmarking evaluates multiple LLMs, including GPT-3.5-turbo,\nGPT-4o, Claude-Sonnet-3.5, and Gemini-1.5-flash, extending prior work beyond\nGPT-3.5 and GPT-4 and establishing SLLMBO as the first framework to benchmark a\ndiverse set of LLMs for HPO. By integrating LLMs' established strengths in\nparameter initialization with the exploitation abilities demonstrated in this\nstudy, alongside TPE's exploration capabilities, the LLM-TPE sampler achieves a\nbalanced exploration-exploitation trade-off, reduces API costs, and mitigates\npremature early stoppings for more effective parameter searches. Across 14\ntabular tasks in classification and regression, the LLM-TPE sampler\noutperformed fully LLM-based methods and achieved superior results over BO\nmethods in 9 tasks. Testing early stopping in budget-constrained scenarios\nfurther demonstrated competitive performance, indicating that LLM-based methods\ngenerally benefit from extended iterations for optimal results. This work lays\nthe foundation for future research exploring open-source LLMs, reproducibility\nof LLM results in HPO, and benchmarking SLLMBO on complex datasets, such as\nimage classification, segmentation, and machine translation.\n","authors":["Kanan Mahammadli","Seyda Bolelli Ertekin"],"pdf_url":"https://arxiv.org/pdf/2410.20302v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.07066v5","updated":"2024-11-29T21:50:16Z","published":"2024-04-10T14:56:40Z","title":"Exploring Concept Depth: How Large Language Models Acquire Knowledge at\n Different Layers?","summary":" Large language models (LLMs) have shown remarkable performances across a wide\nrange of tasks. However, the mechanisms by which these models encode tasks of\nvarying complexities remain poorly understood. In this paper, we explore the\nhypothesis that LLMs process concepts of varying complexities in different\nlayers, introducing the idea of \"Concept Depth\" to suggest that more complex\nconcepts are typically acquired in deeper layers. Specifically, we categorize\nconcepts based on their level of abstraction, defining them in the order of\nincreasing complexity within factual, emotional, and inferential tasks. We\nconduct extensive probing experiments using layer-wise representations across\nvarious LLM families (Gemma, LLaMA, Qwen) on various datasets spanning the\nthree domains of tasks. Our findings reveal that models could efficiently\nconduct probing for simpler tasks in shallow layers, and more complex tasks\ntypically necessitate deeper layers for accurate understanding. Additionally,\nwe examine how external factors, such as adding noise to the input and\nquantizing the model weights, might affect layer-wise representations. Our\nfindings suggest that these factors can impede the development of a conceptual\nunderstanding of LLMs until deeper layers are explored. We hope that our\nproposed concept and experimental insights will enhance the understanding of\nthe mechanisms underlying LLMs. Our codes are available at\n\\url{https://github.com/Luckfort/CD}.\n","authors":["Mingyu Jin","Qinkai Yu","Jingyuan Huang","Qingcheng Zeng","Zhenting Wang","Wenyue Hua","Haiyan Zhao","Kai Mei","Yanda Meng","Kaize Ding","Fan Yang","Mengnan Du","Yongfeng Zhang"],"pdf_url":"https://arxiv.org/pdf/2404.07066v5.pdf","comment":"COLING 2025"},{"id":"http://arxiv.org/abs/2402.11512v4","updated":"2024-11-29T21:35:25Z","published":"2024-02-18T08:53:41Z","title":"From Prejudice to Parity: A New Approach to Debiasing Large Language\n Model Word Embeddings","summary":" Embeddings play a pivotal role in the efficacy of Large Language Models. They\nare the bedrock on which these models grasp contextual relationships and foster\na more nuanced understanding of language and consequently perform remarkably on\na plethora of complex tasks that require a fundamental understanding of human\nlanguage. Given that these embeddings themselves often reflect or exhibit bias,\nit stands to reason that these models may also inadvertently learn this bias.\nIn this work, we build on the seminal previous work and propose DeepSoftDebias,\nan algorithm that uses a neural network to perform 'soft debiasing'. We\nexhaustively evaluate this algorithm across a variety of SOTA datasets,\naccuracy metrics, and challenging NLP tasks. We find that DeepSoftDebias\noutperforms the current state-of-the-art methods at reducing bias across\ngender, race, and religion.\n","authors":["Aishik Rakshit","Smriti Singh","Shuvam Keshari","Arijit Ghosh Chowdhury","Vinija Jain","Aman Chadha"],"pdf_url":"https://arxiv.org/pdf/2402.11512v4.pdf","comment":"Accepted at COLING 2025"},{"id":"http://arxiv.org/abs/2405.10718v2","updated":"2024-11-29T19:01:20Z","published":"2024-05-17T12:01:43Z","title":"SignLLM: Sign Language Production Large Language Models","summary":" In this paper, we propose SignLLM, a multilingual Sign Language Production\n(SLP) large language model, which includes two novel multilingual SLP modes\nMLSF and Prompt2LangGloss that allow sign language gestures generation from\nquery texts input and question-style prompts input respectively. Both modes can\nuse a new RL loss based on reinforcement learning and a new RL module named\nPriority Learning Channel. These RL components can accelerate the training by\nenhancing the model's capability to sample high-quality data. For SignLLM's\ntraining, we introduce Prompt2Sign, a comprehensive multilingual sign language\ndataset, which builds from public data, including American Sign Language (ASL)\nand seven others. This dataset standardizes information by extracting pose\ninformation from sign language videos into a unified compressed format. We\nextensively evaluate SignLLM, demonstrating that our model achieves\nstate-of-the-art performance on SLP tasks across eight sign languages.\n","authors":["Sen Fang","Lei Wang","Ce Zheng","Chunyu Sui","Mingyu Zhao","Yapeng Tian","Chen Chen"],"pdf_url":"https://arxiv.org/pdf/2405.10718v2.pdf","comment":"website at https://signllm.github.io/"},{"id":"http://arxiv.org/abs/2410.12049v2","updated":"2024-11-29T19:00:58Z","published":"2024-10-15T20:37:34Z","title":"Sabiá-3 Technical Report","summary":" This report presents Sabi\\'a-3, our new flagship language model, and\nSabiazinho-3, a more cost-effective sibling. The models were trained on a large\nbrazilian-centric corpus. Evaluations across diverse professional and academic\nbenchmarks show a strong performance on Portuguese and Brazil-related tasks.\nSabi\\'a-3 shows large improvements in comparison to our previous best of model,\nSabia-2 Medium, especially in reasoning-intensive tasks. Notably, Sabi\\'a-3's\naverage performance matches frontier LLMs, while it is offered at a three to\nfour times lower cost per token, reinforcing the benefits of domain\nspecialization.\n","authors":["Hugo Abonizio","Thales Sales Almeida","Thiago Laitz","Roseval Malaquias Junior","Giovana Kerche Bonás","Rodrigo Nogueira","Ramon Pires"],"pdf_url":"https://arxiv.org/pdf/2410.12049v2.pdf","comment":null}],"Information Retrieval":[{"id":"http://arxiv.org/abs/2411.19862v1","updated":"2024-11-29T17:25:00Z","published":"2024-11-29T17:25:00Z","title":"Cross-Domain Recommendation Meets Large Language Models","summary":" Cross-domain recommendation (CDR) has emerged as a promising solution to the\ncold-start problem, faced by single-domain recommender systems. However,\nexisting CDR models rely on complex neural architectures, large datasets, and\nsignificant computational resources, making them less effective in data-scarce\nscenarios or when simplicity is crucial. In this work, we leverage the\nreasoning capabilities of large language models (LLMs) and explore their\nperformance in the CDR domain across multiple domain pairs. We introduce two\nnovel prompt designs tailored for CDR and demonstrate that LLMs, when prompted\neffectively, outperform state-of-the-art CDR baselines across various metrics\nand domain combinations in the rating prediction and ranking tasks. This work\nbridges the gap between LLMs and recommendation systems, showcasing their\npotential as effective cross-domain recommenders.\n","authors":["Ajay Krishna Vajjala","Dipak Meher","Ziwei Zhu","David S. Rosenblum"],"pdf_url":"https://arxiv.org/pdf/2411.19862v1.pdf","comment":"12 pages"},{"id":"http://arxiv.org/abs/2411.19718v1","updated":"2024-11-29T14:08:32Z","published":"2024-11-29T14:08:32Z","title":"TakeLab Retriever: AI-Driven Search Engine for Articles from Croatian\n News Outlets","summary":" TakeLab Retriever is an AI-driven search engine designed to discover,\ncollect, and semantically analyze news articles from Croatian news outlets. It\noffers a unique perspective on the history and current landscape of Croatian\nonline news media, making it an essential tool for researchers seeking to\nuncover trends, patterns, and correlations that general-purpose search engines\ncannot provide. TakeLab retriever utilizes cutting-edge natural language\nprocessing (NLP) methods, enabling users to sift through articles using named\nentities, phrases, and topics through the web application. This technical\nreport is divided into two parts: the first explains how TakeLab Retriever is\nutilized, while the second provides a detailed account of its design. In the\nsecond part, we also address the software engineering challenges involved and\npropose solutions for developing a microservice-based semantic search engine\ncapable of handling over ten million news articles published over the past two\ndecades.\n","authors":["David Dukić","Marin Petričević","Sven Ćurković","Jan Šnajder"],"pdf_url":"https://arxiv.org/pdf/2411.19718v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19710v1","updated":"2024-11-29T13:57:07Z","published":"2024-11-29T13:57:07Z","title":"Know Your RAG: Dataset Taxonomy and Generation Strategies for Evaluating\n RAG Systems","summary":" Retrieval Augmented Generation (RAG) systems are a widespread application of\nLarge Language Models (LLMs) in the industry. While many tools exist empowering\ndevelopers to build their own systems, measuring their performance locally,\nwith datasets reflective of the system's use cases, is a technological\nchallenge. Solutions to this problem range from non-specific and cheap (most\npublic datasets) to specific and costly (generating data from local documents).\nIn this paper, we show that using public question and answer (Q&A) datasets to\nassess retrieval performance can lead to non-optimal systems design, and that\ncommon tools for RAG dataset generation can lead to unbalanced data. We propose\nsolutions to these issues based on the characterization of RAG datasets through\nlabels and through label-targeted data generation. Finally, we show that\nfine-tuned small LLMs can efficiently generate Q&A datasets. We believe that\nthese observations are invaluable to the know-your-data step of RAG systems\ndevelopment.\n","authors":["Rafael Teixeira de Lima","Shubham Gupta","Cesar Berrospi","Lokesh Mishra","Michele Dolfi","Peter Staar","Panagiotis Vagenas"],"pdf_url":"https://arxiv.org/pdf/2411.19710v1.pdf","comment":"to be published in the 31st International Conference on Computational\n Linguistics (COLING 2025)"},{"id":"http://arxiv.org/abs/2411.19576v1","updated":"2024-11-29T09:47:32Z","published":"2024-11-29T09:47:32Z","title":"A Review of LLM-based Explanations in Recommender Systems","summary":" The rise of Large Language Models (LLMs), such as LLaMA and ChatGPT, has\nopened new opportunities for enhancing recommender systems through improved\nexplainability. This paper provides a systematic literature review focused on\nleveraging LLMs to generate explanations for recommendations -- a critical\naspect for fostering transparency and user trust. We conducted a comprehensive\nsearch within the ACM Guide to Computing Literature, covering publications from\nthe launch of ChatGPT (November 2022) to the present (November 2024). Our\nsearch yielded 232 articles, but after applying inclusion criteria, only six\nwere identified as directly addressing the use of LLMs in explaining\nrecommendations. This scarcity highlights that, despite the rise of LLMs, their\napplication in explainable recommender systems is still in an early stage. We\nanalyze these select studies to understand current methodologies, identify\nchallenges, and suggest directions for future research. Our findings underscore\nthe potential of LLMs improving explanations of recommender systems and\nencourage the development of more transparent and user-centric recommendation\nexplanation solutions.\n","authors":["Alan Said"],"pdf_url":"https://arxiv.org/pdf/2411.19576v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19539v1","updated":"2024-11-29T08:34:07Z","published":"2024-11-29T08:34:07Z","title":"Knowledge Management for Automobile Failure Analysis Using Graph RAG","summary":" This paper presents a knowledge management system for automobile failure\nanalysis using retrieval-augmented generation (RAG) with large language models\n(LLMs) and knowledge graphs (KGs). In the automotive industry, there is a\ngrowing demand for knowledge transfer of failure analysis from experienced\nengineers to young engineers. However, failure events are phenomena that occur\nin a chain reaction, making them difficult for beginners to analyze them. While\nknowledge graphs, which can describe semantic relationships and structure\ninformation is effective in representing failure events, due to their\ncapability of representing the relationships between components, there is much\ninformation in KGs, so it is challenging for young engineers to extract and\nunderstand sub-graphs from the KG. On the other hand, there is increasing\ninterest in the use of Graph RAG, a type of RAG that combines LLMs and KGs for\nknowledge management. However, when using the current Graph RAG framework with\nan existing knowledge graph for automobile failures, several issues arise\nbecause it is difficult to generate executable queries for a knowledge graph\ndatabase which is not constructed by LLMs. To address this, we focused on\noptimizing the Graph RAG pipeline for existing knowledge graphs. Using an\noriginal Q&A dataset, the ROUGE F1 score of the sentences generated by the\nproposed method showed an average improvement of 157.6% compared to the current\nmethod. This highlights the effectiveness of the proposed method for automobile\nfailure analysis.\n","authors":["Yuta Ojima","Hiroki Sakaji","Tadashi Nakamura","Hiroaki Sakata","Kazuya Seki","Yuu Teshigawara","Masami Yamashita","Kazuhiro Aoyama"],"pdf_url":"https://arxiv.org/pdf/2411.19539v1.pdf","comment":"7 pages, 6 figures, to be published in 2024 IEEE International\n Conference on Bid Data (BigData)"},{"id":"http://arxiv.org/abs/2411.19513v1","updated":"2024-11-29T07:11:42Z","published":"2024-11-29T07:11:42Z","title":"ContextGNN: Beyond Two-Tower Recommendation Systems","summary":" Recommendation systems predominantly utilize two-tower architectures, which\nevaluate user-item rankings through the inner product of their respective\nembeddings. However, one key limitation of two-tower models is that they learn\na pair-agnostic representation of users and items. In contrast, pair-wise\nrepresentations either scale poorly due to their quadratic complexity or are\ntoo restrictive on the candidate pairs to rank. To address these issues, we\nintroduce Context-based Graph Neural Networks (ContextGNNs), a novel deep\nlearning architecture for link prediction in recommendation systems. The method\nemploys a pair-wise representation technique for familiar items situated within\na user's local subgraph, while leveraging two-tower representations to\nfacilitate the recommendation of exploratory items. A final network then\npredicts how to fuse both pair-wise and two-tower recommendations into a single\nranking of items. We demonstrate that ContextGNN is able to adapt to different\ndata characteristics and outperforms existing methods, both traditional and\nGNN-based, on a diverse set of practical recommendation tasks, improving\nperformance by 20% on average.\n","authors":["Yiwen Yuan","Zecheng Zhang","Xinwei He","Akihiro Nitta","Weihua Hu","Dong Wang","Manan Shah","Shenyang Huang","Blaž Stojanovič","Alan Krumholz","Jan Eric Lenssen","Jure Leskovec","Matthias Fey"],"pdf_url":"https://arxiv.org/pdf/2411.19513v1.pdf","comment":"14 pages, 1 figure, 5 tables"},{"id":"http://arxiv.org/abs/2411.19504v1","updated":"2024-11-29T06:48:13Z","published":"2024-11-29T06:48:13Z","title":"TQA-Bench: Evaluating LLMs for Multi-Table Question Answering with\n Scalable Context and Symbolic Extension","summary":" The advent of large language models (LLMs) has unlocked great opportunities\nin complex data management tasks, particularly in question answering (QA) over\ncomplicated multi-table relational data. Despite significant progress,\nsystematically evaluating LLMs on multi-table QA remains a critical challenge\ndue to the inherent complexity of analyzing heterogeneous table structures and\npotential large scale of serialized relational data. Existing benchmarks\nprimarily focus on single-table QA, failing to capture the intricacies of\nreasoning across multiple relational tables, as required in real-world domains\nsuch as finance, healthcare, and e-commerce. To address this gap, we present\nTQA-Bench, a new multi-table QA benchmark designed to evaluate the capabilities\nof LLMs in tackling complex QA tasks over relational data. Our benchmark\nincorporates diverse relational database instances sourced from real-world\npublic datasets and introduces a flexible sampling mechanism to create tasks\nwith varying multi-table context lengths, ranging from 8K to 64K tokens. To\nensure robustness and reliability, we integrate symbolic extensions into the\nevaluation framework, enabling the assessment of LLM reasoning capabilities\nbeyond simple data retrieval or probabilistic pattern matching. We\nsystematically evaluate a range of LLMs, both open-source and closed-source,\nspanning model scales from 7 billion to 70 billion parameters. Our extensive\nexperiments reveal critical insights into the performance of LLMs in\nmulti-table QA, highlighting both challenges and opportunities for advancing\ntheir application in complex, data-driven environments. Our benchmark\nimplementation and results are available at\nhttps://github.com/Relaxed-System-Lab/TQA-Bench.\n","authors":["Zipeng Qiu","You Peng","Guangxin He","Binhang Yuan","Chen Wang"],"pdf_url":"https://arxiv.org/pdf/2411.19504v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19478v1","updated":"2024-11-29T05:31:04Z","published":"2024-11-29T05:31:04Z","title":"Zero-Indexing Internet Search Augmented Generation for Large Language\n Models","summary":" Retrieval augmented generation has emerged as an effective method to enhance\nlarge language model performance. This approach typically relies on an internal\nretrieval module that uses various indexing mechanisms to manage a static\npre-processed corpus. However, such a paradigm often falls short when it is\nnecessary to integrate the most up-to-date information that has not been\nupdated into the corpus during generative inference time. In this paper, we\nexplore an alternative approach that leverages standard search engine APIs to\ndynamically integrate the latest online information (without maintaining any\nindex for any fixed corpus), thereby improving the quality of generated\ncontent. We design a collaborative LLM-based paradigm, where we include: (i) a\nparser-LLM that determines if the Internet augmented generation is demanded and\nextracts the search keywords if so with a single inference; (ii) a mixed\nranking strategy that re-ranks the retrieved HTML files to eliminate bias\nintroduced from the search engine API; and (iii) an extractor-LLM that can\naccurately and efficiently extract relevant information from the fresh content\nin each HTML file. We conduct extensive empirical studies to evaluate the\nperformance of this Internet search augmented generation paradigm. The\nexperimental results demonstrate that our method generates content with\nsignificantly improved quality. Our system has been successfully deployed in a\nproduction environment to serve 01.AI's generative inference requests.\n","authors":["Guangxin He","Zonghong Dai","Jiangcheng Zhu","Binqiang Zhao","Chenyue Li","You Peng","Chen Wang","Binhang Yuan"],"pdf_url":"https://arxiv.org/pdf/2411.19478v1.pdf","comment":null}],"Machine Learning":[{"id":"http://arxiv.org/abs/2411.19951v1","updated":"2024-11-29T18:59:54Z","published":"2024-11-29T18:59:54Z","title":"T2Vid: Translating Long Text into Multi-Image is the Catalyst for\n Video-LLMs","summary":" The success of Multimodal Large Language Models (MLLMs) in the image domain\nhas garnered wide attention from the research community. Drawing on previous\nsuccessful experiences, researchers have recently explored extending the\nsuccess to the video understanding realms. Apart from training from scratch, an\nefficient way is to utilize the pre-trained image-LLMs, leading to two\nmainstream approaches, i.e. zero-shot inference and further fine-tuning with\nvideo data. In this work, our study of these approaches harvests an effective\ndata augmentation method. We first make a deeper inspection of the zero-shot\ninference way and identify two limitations, i.e. limited generalization and\nlack of temporal understanding capabilities. Thus, we further investigate the\nfine-tuning approach and find a low learning efficiency when simply using all\nthe video data samples, which can be attributed to a lack of instruction\ndiversity. Aiming at this issue, we develop a method called T2Vid to synthesize\nvideo-like samples to enrich the instruction diversity in the training corpus.\nIntegrating these data enables a simple and efficient training scheme, which\nachieves performance comparable to or even superior to using full video\ndatasets by training with just 15% the sample size. Meanwhile, we find that the\nproposed scheme can boost the performance of long video understanding without\ntraining with long video samples. We hope our study will spark more thinking\nabout using MLLMs for video understanding and curation of high-quality data.\nThe code is released at https://github.com/xjtupanda/T2Vid.\n","authors":["Shukang Yin","Chaoyou Fu","Sirui Zhao","Yunhang Shen","Chunjiang Ge","Yan Yang","Zuwei Long","Yuhan Dai","Tong Xu","Xing Sun","Ran He","Caifeng Shan","Enhong Chen"],"pdf_url":"https://arxiv.org/pdf/2411.19951v1.pdf","comment":"13 pages, 9 figures, 5 tables. Project page:\n https://github.com/xjtupanda/T2Vid"},{"id":"http://arxiv.org/abs/2411.19950v1","updated":"2024-11-29T18:59:52Z","published":"2024-11-29T18:59:52Z","title":"AlphaTablets: A Generic Plane Representation for 3D Planar\n Reconstruction from Monocular Videos","summary":" We introduce AlphaTablets, a novel and generic representation of 3D planes\nthat features continuous 3D surface and precise boundary delineation. By\nrepresenting 3D planes as rectangles with alpha channels, AlphaTablets combine\nthe advantages of current 2D and 3D plane representations, enabling accurate,\nconsistent and flexible modeling of 3D planes. We derive differentiable\nrasterization on top of AlphaTablets to efficiently render 3D planes into\nimages, and propose a novel bottom-up pipeline for 3D planar reconstruction\nfrom monocular videos. Starting with 2D superpixels and geometric cues from\npre-trained models, we initialize 3D planes as AlphaTablets and optimize them\nvia differentiable rendering. An effective merging scheme is introduced to\nfacilitate the growth and refinement of AlphaTablets. Through iterative\noptimization and merging, we reconstruct complete and accurate 3D planes with\nsolid surfaces and clear boundaries. Extensive experiments on the ScanNet\ndataset demonstrate state-of-the-art performance in 3D planar reconstruction,\nunderscoring the great potential of AlphaTablets as a generic 3D plane\nrepresentation for various applications. Project page is available at:\nhttps://hyzcluster.github.io/alphatablets\n","authors":["Yuze He","Wang Zhao","Shaohui Liu","Yubin Hu","Yushi Bai","Yu-Hui Wen","Yong-Jin Liu"],"pdf_url":"https://arxiv.org/pdf/2411.19950v1.pdf","comment":"NeurIPS 2024"},{"id":"http://arxiv.org/abs/2411.19946v1","updated":"2024-11-29T18:59:46Z","published":"2024-11-29T18:59:46Z","title":"DELT: A Simple Diversity-driven EarlyLate Training for Dataset\n Distillation","summary":" Recent advances in dataset distillation have led to solutions in two main\ndirections. The conventional batch-to-batch matching mechanism is ideal for\nsmall-scale datasets and includes bi-level optimization methods on models and\nsyntheses, such as FRePo, RCIG, and RaT-BPTT, as well as other methods like\ndistribution matching, gradient matching, and weight trajectory matching.\nConversely, batch-to-global matching typifies decoupled methods, which are\nparticularly advantageous for large-scale datasets. This approach has garnered\nsubstantial interest within the community, as seen in SRe$^2$L, G-VBSM, WMDD,\nand CDA. A primary challenge with the second approach is the lack of diversity\namong syntheses within each class since samples are optimized independently and\nthe same global supervision signals are reused across different synthetic\nimages. In this study, we propose a new Diversity-driven EarlyLate Training\n(DELT) scheme to enhance the diversity of images in batch-to-global matching\nwith less computation. Our approach is conceptually simple yet effective, it\npartitions predefined IPC samples into smaller subtasks and employs local\noptimizations to distill each subset into distributions from distinct phases,\nreducing the uniformity induced by the unified optimization process. These\ndistilled images from the subtasks demonstrate effective generalization when\napplied to the entire task. We conduct extensive experiments on CIFAR,\nTiny-ImageNet, ImageNet-1K, and its sub-datasets. Our approach outperforms the\nprevious state-of-the-art by 2$\\sim$5% on average across different datasets and\nIPCs (images per class), increasing diversity per class by more than 5% while\nreducing synthesis time by up to 39.3% for enhancing the training efficiency.\nCode is available at: https://github.com/VILA-Lab/DELT.\n","authors":["Zhiqiang Shen","Ammar Sherif","Zeyuan Yin","Shitong Shao"],"pdf_url":"https://arxiv.org/pdf/2411.19946v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19943v1","updated":"2024-11-29T18:58:22Z","published":"2024-11-29T18:58:22Z","title":"Critical Tokens Matter: Token-Level Contrastive Estimation Enhence LLM's\n Reasoning Capability","summary":" Large Language Models (LLMs) have exhibited remarkable performance on\nreasoning tasks. They utilize autoregressive token generation to construct\nreasoning trajectories, enabling the development of a coherent chain of\nthought. In this work, we explore the impact of individual tokens on the final\noutcomes of reasoning tasks. We identify the existence of ``critical tokens''\nthat lead to incorrect reasoning trajectories in LLMs. Specifically, we find\nthat LLMs tend to produce positive outcomes when forced to decode other tokens\ninstead of critical tokens. Motivated by this observation, we propose a novel\napproach - cDPO - designed to automatically recognize and conduct token-level\nrewards for the critical tokens during the alignment process. Specifically, we\ndevelop a contrastive estimation approach to automatically identify critical\ntokens. It is achieved by comparing the generation likelihood of positive and\nnegative models. To achieve this, we separately fine-tune the positive and\nnegative models on various reasoning trajectories, consequently, they are\ncapable of identifying identify critical tokens within incorrect trajectories\nthat contribute to erroneous outcomes. Moreover, to further align the model\nwith the critical token information during the alignment process, we extend the\nconventional DPO algorithms to token-level DPO and utilize the differential\nlikelihood from the aforementioned positive and negative model as important\nweight for token-level DPO learning.Experimental results on GSM8K and MATH500\nbenchmarks with two-widely used models Llama-3 (8B and 70B) and deepseek-math\n(7B) demonstrate the effectiveness of the propsoed approach cDPO.\n","authors":["Zicheng Lin","Tian Liang","Jiahao Xu","Xing Wang","Ruilin Luo","Chufan Shi","Siheng Li","Yujiu Yang","Zhaopeng Tu"],"pdf_url":"https://arxiv.org/pdf/2411.19943v1.pdf","comment":"Work in progress"},{"id":"http://arxiv.org/abs/2411.19942v1","updated":"2024-11-29T18:58:17Z","published":"2024-11-29T18:58:17Z","title":"Free-form Generation Enhances Challenging Clothed Human Modeling","summary":" Achieving realistic animated human avatars requires accurate modeling of\npose-dependent clothing deformations. Existing learning-based methods heavily\nrely on the Linear Blend Skinning (LBS) of minimally-clothed human models like\nSMPL to model deformation. However, these methods struggle to handle loose\nclothing, such as long dresses, where the canonicalization process becomes\nill-defined when the clothing is far from the body, leading to disjointed and\nfragmented results. To overcome this limitation, we propose a novel hybrid\nframework to model challenging clothed humans. Our core idea is to use\ndedicated strategies to model different regions, depending on whether they are\nclose to or distant from the body. Specifically, we segment the human body into\nthree categories: unclothed, deformed, and generated. We simply replicate\nunclothed regions that require no deformation. For deformed regions close to\nthe body, we leverage LBS to handle the deformation. As for the generated\nregions, which correspond to loose clothing areas, we introduce a novel\nfree-form, part-aware generator to model them, as they are less affected by\nmovements. This free-form generation paradigm brings enhanced flexibility and\nexpressiveness to our hybrid framework, enabling it to capture the intricate\ngeometric details of challenging loose clothing, such as skirts and dresses.\nExperimental results on the benchmark dataset featuring loose clothing\ndemonstrate that our method achieves state-of-the-art performance with superior\nvisual fidelity and realism, particularly in the most challenging cases.\n","authors":["Hang Ye","Xiaoxuan Ma","Hai Ci","Wentao Zhu","Yizhou Wang"],"pdf_url":"https://arxiv.org/pdf/2411.19942v1.pdf","comment":"23 pages, 25 figures"},{"id":"http://arxiv.org/abs/2411.19941v1","updated":"2024-11-29T18:57:25Z","published":"2024-11-29T18:57:25Z","title":"Perception Test 2024: Challenge Summary and a Novel Hour-Long VideoQA\n Benchmark","summary":" Following the successful 2023 edition, we organised the Second Perception\nTest challenge as a half-day workshop alongside the IEEE/CVF European\nConference on Computer Vision (ECCV) 2024, with the goal of benchmarking\nstate-of-the-art video models and measuring the progress since last year using\nthe Perception Test benchmark. This year, the challenge had seven tracks (up\nfrom six last year) and covered low-level and high-level tasks, with language\nand non-language interfaces, across video, audio, and text modalities; the\nadditional track covered hour-long video understanding and introduced a novel\nvideo QA benchmark 1h-walk VQA. Overall, the tasks in the different tracks\nwere: object tracking, point tracking, temporal action localisation, temporal\nsound localisation, multiple-choice video question-answering, grounded video\nquestion-answering, and hour-long video question-answering. We summarise in\nthis report the challenge tasks and results, and introduce in detail the novel\nhour-long video QA benchmark 1h-walk VQA.\n","authors":["Joseph Heyward","João Carreira","Dima Damen","Andrew Zisserman","Viorica Pătrăucean"],"pdf_url":"https://arxiv.org/pdf/2411.19941v1.pdf","comment":"arXiv admin note: substantial text overlap with arXiv:2312.13090"},{"id":"http://arxiv.org/abs/2411.18014v2","updated":"2024-11-29T18:57:12Z","published":"2024-11-27T03:16:00Z","title":"Diffeomorphic Latent Neural Operators for Data-Efficient Learning of\n Solutions to Partial Differential Equations","summary":" A computed approximation of the solution operator to a system of partial\ndifferential equations (PDEs) is needed in various areas of science and\nengineering. Neural operators have been shown to be quite effective at\npredicting these solution generators after training on high-fidelity ground\ntruth data (e.g. numerical simulations). However, in order to generalize well\nto unseen spatial domains, neural operators must be trained on an extensive\namount of geometrically varying data samples that may not be feasible to\nacquire or simulate in certain contexts (e.g., patient-specific medical data,\nlarge-scale computationally intensive simulations.) We propose that in order to\nlearn a PDE solution operator that can generalize across multiple domains\nwithout needing to sample enough data expressive enough for all possible\ngeometries, we can train instead a latent neural operator on just a few ground\ntruth solution fields diffeomorphically mapped from different geometric/spatial\ndomains to a fixed reference configuration. Furthermore, the form of the\nsolutions is dependent on the choice of mapping to and from the reference\ndomain. We emphasize that preserving properties of the differential operator\nwhen constructing these mappings can significantly reduce the data requirement\nfor achieving an accurate model due to the regularity of the solution fields\nthat the latent neural operator is training on. We provide motivating numerical\nexperimentation that demonstrates an extreme case of this consideration by\nexploiting the conformal invariance of the Laplacian\n","authors":["Zan Ahmad","Shiyi Chen","Minglang Yin","Avisha Kumar","Nicolas Charon","Natalia Trayanova","Mauro Maggioni"],"pdf_url":"https://arxiv.org/pdf/2411.18014v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2108.05974v2","updated":"2024-11-29T18:54:31Z","published":"2021-08-12T21:22:06Z","title":"An Operator Splitting View of Federated Learning","summary":" Over the past few years, the federated learning ($\\texttt{FL}$) community has\nwitnessed a proliferation of new $\\texttt{FL}$ algorithms. However, our\nunderstating of the theory of $\\texttt{FL}$ is still fragmented, and a\nthorough, formal comparison of these algorithms remains elusive. Motivated by\nthis gap, we show that many of the existing $\\texttt{FL}$ algorithms can be\nunderstood from an operator splitting point of view. This unification allows us\nto compare different algorithms with ease, to refine previous convergence\nresults and to uncover new algorithmic variants. In particular, our analysis\nreveals the vital role played by the step size in $\\texttt{FL}$ algorithms. The\nunification also leads to a streamlined and economic way to accelerate\n$\\texttt{FL}$ algorithms, without incurring any communication overhead. We\nperform numerical experiments on both convex and nonconvex models to validate\nour findings.\n","authors":["Saber Malekmohammadi","Kiarash Shaloudegi","Zeou Hu","Yaoliang Yu"],"pdf_url":"https://arxiv.org/pdf/2108.05974v2.pdf","comment":"30 pages, 28 figures"},{"id":"http://arxiv.org/abs/2410.04332v2","updated":"2024-11-29T18:52:41Z","published":"2024-10-06T02:43:49Z","title":"Gradient Routing: Masking Gradients to Localize Computation in Neural\n Networks","summary":" Neural networks are trained primarily based on their inputs and outputs,\nwithout regard for their internal mechanisms. These neglected mechanisms\ndetermine properties that are critical for safety, like (i) transparency; (ii)\nthe absence of sensitive information or harmful capabilities; and (iii)\nreliable generalization of goals beyond the training distribution. To address\nthis shortcoming, we introduce gradient routing, a training method that\nisolates capabilities to specific subregions of a neural network. Gradient\nrouting applies data-dependent, weighted masks to gradients during\nbackpropagation. These masks are supplied by the user in order to configure\nwhich parameters are updated by which data points. We show that gradient\nrouting can be used to (1) learn representations which are partitioned in an\ninterpretable way; (2) enable robust unlearning via ablation of a pre-specified\nnetwork subregion; and (3) achieve scalable oversight of a reinforcement\nlearner by localizing modules responsible for different behaviors. Throughout,\nwe find that gradient routing localizes capabilities even when applied to a\nlimited, ad-hoc subset of the data. We conclude that the approach holds promise\nfor challenging, real-world applications where quality data are scarce.\n","authors":["Alex Cloud","Jacob Goldman-Wetzler","Evžen Wybitul","Joseph Miller","Alexander Matt Turner"],"pdf_url":"https://arxiv.org/pdf/2410.04332v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.17523v3","updated":"2024-11-29T18:51:22Z","published":"2024-06-25T13:06:09Z","title":"On the consistency of hyper-parameter selection in value-based deep\n reinforcement learning","summary":" Deep reinforcement learning (deep RL) has achieved tremendous success on\nvarious domains through a combination of algorithmic design and careful\nselection of hyper-parameters. Algorithmic improvements are often the result of\niterative enhancements built upon prior approaches, while hyper-parameter\nchoices are typically inherited from previous methods or fine-tuned\nspecifically for the proposed technique. Despite their crucial impact on\nperformance, hyper-parameter choices are frequently overshadowed by algorithmic\nadvancements. This paper conducts an extensive empirical study focusing on the\nreliability of hyper-parameter selection for value-based deep reinforcement\nlearning agents, including the introduction of a new score to quantify the\nconsistency and reliability of various hyper-parameters. Our findings not only\nhelp establish which hyper-parameters are most critical to tune, but also help\nclarify which tunings remain consistent across different training regimes.\n","authors":["Johan Obando-Ceron","João G. M. Araújo","Aaron Courville","Pablo Samuel Castro"],"pdf_url":"https://arxiv.org/pdf/2406.17523v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19930v1","updated":"2024-11-29T18:42:28Z","published":"2024-11-29T18:42:28Z","title":"On Domain-Specific Post-Training for Multimodal Large Language Models","summary":" Recent years have witnessed the rapid development of general multimodal large\nlanguage models (MLLMs). However, adapting general MLLMs to specific domains,\nsuch as scientific fields and industrial applications, remains less explored.\nThis paper systematically investigates domain adaptation of MLLMs through\npost-training, focusing on data synthesis, training pipelines, and task\nevaluation. (1) Data Synthesis: Using open-source models, we develop a visual\ninstruction synthesizer that effectively generates diverse visual instruction\ntasks from domain-specific image-caption pairs. Our synthetic tasks surpass\nthose generated by manual rules, GPT-4, and GPT-4V in enhancing the\ndomain-specific performance of MLLMs. (2) Training Pipeline: While the\ntwo-stage training--initially on image-caption pairs followed by visual\ninstruction tasks--is commonly adopted for developing general MLLMs, we apply a\nsingle-stage training pipeline to enhance task diversity for domain-specific\npost-training. (3) Task Evaluation: We conduct experiments in two domains,\nbiomedicine and food, by post-training MLLMs of different sources and scales\n(e.g., Qwen2-VL-2B, LLaVA-v1.6-8B, Llama-3.2-11B), and then evaluating MLLM\nperformance on various domain-specific tasks. To support further research in\nMLLM domain adaptation, we will open-source our implementations.\n","authors":["Daixuan Cheng","Shaohan Huang","Ziyu Zhu","Xintong Zhang","Wayne Xin Zhao","Zhongzhi Luan","Bo Dai","Zhenliang Zhang"],"pdf_url":"https://arxiv.org/pdf/2411.19930v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19923v1","updated":"2024-11-29T18:38:17Z","published":"2024-11-29T18:38:17Z","title":"Scalable Out-of-distribution Robustness in the Presence of Unobserved\n Confounders","summary":" We consider the task of out-of-distribution (OOD) generalization, where the\ndistribution shift is due to an unobserved confounder ($Z$) affecting both the\ncovariates ($X$) and the labels ($Y$). In this setting, traditional assumptions\nof covariate and label shift are unsuitable due to the confounding, which\nintroduces heterogeneity in the predictor, i.e., $\\hat{Y} = f_Z(X)$. OOD\ngeneralization differs from traditional domain adaptation by not assuming\naccess to the covariate distribution ($X^\\text{te}$) of the test samples during\ntraining. These conditions create a challenging scenario for OOD robustness:\n(a) $Z^\\text{tr}$ is an unobserved confounder during training, (b)\n$P^\\text{te}{Z} \\neq P^\\text{tr}{Z}$, (c) $X^\\text{te}$ is unavailable during\ntraining, and (d) the posterior predictive distribution depends on\n$P^\\text{te}(Z)$, i.e., $\\hat{Y} = E_{P^\\text{te}(Z)}[f_Z(X)]$. In general,\naccurate predictions are unattainable in this scenario, and existing literature\nhas proposed complex predictors based on identifiability assumptions that\nrequire multiple additional variables. Our work investigates a set of\nidentifiability assumptions that tremendously simplify the predictor, whose\nresulting elegant simplicity outperforms existing approaches.\n","authors":["Parjanya Prashant","Seyedeh Baharan Khatami","Bruno Ribeiro","Babak Salimi"],"pdf_url":"https://arxiv.org/pdf/2411.19923v1.pdf","comment":"24 pages, 3 figures"},{"id":"http://arxiv.org/abs/2411.19922v1","updated":"2024-11-29T18:36:58Z","published":"2024-11-29T18:36:58Z","title":"Dynamic EEG-fMRI mapping: Revealing the relationship between brain\n connectivity and cognitive state","summary":" This study investigated the dynamic connectivity patterns between EEG and\nfMRI modalities, contributing to our understanding of brain network\ninteractions. By employing a comprehensive approach that integrated static and\ndynamic analyses of EEG-fMRI data, we were able to uncover distinct\nconnectivity states and characterize their temporal fluctuations. The results\nrevealed modular organization within the intrinsic connectivity networks (ICNs)\nof the brain, highlighting the significant roles of sensory systems and the\ndefault mode network. The use of a sliding window technique allowed us to\nassess how functional connectivity varies over time, further elucidating the\ntransient nature of brain connectivity. Additionally, our findings align with\nprevious literature, reinforcing the notion that cognitive states can be\neffectively identified through short-duration data, specifically within the\n30-60 second timeframe. The established relationships between connectivity\nstrength and cognitive processes, particularly during different visual states,\nunderscore the relevance of our approach for future research into brain\ndynamics. Overall, this study not only enhances our understanding of the\ninterplay between EEG and fMRI signals but also paves the way for further\nexploration into the neural correlates of cognitive functions and their\nimplications in clinical settings. Future research should focus on refining\nthese methodologies and exploring their applications in various cognitive and\nclinical contexts.\n","authors":["Guiran Liu","Binrong Zhu"],"pdf_url":"https://arxiv.org/pdf/2411.19922v1.pdf","comment":"15 pages, Subjects: Machine Learning (cs.LG); Human-Computer\n Interaction (cs.HC); Signal Processing (eess.SP)"},{"id":"http://arxiv.org/abs/2305.04281v3","updated":"2024-11-29T18:33:10Z","published":"2023-05-07T14:10:34Z","title":"Analysing Multiscale Clusterings with Persistent Homology","summary":" In many applications in data clustering, it is desirable to find not just a\nsingle partition into clusters but a sequence of partitions describing the data\nat different scales (or levels of coarseness). A natural problem then is to\nanalyse and compare the (not necessarily hierarchical) sequences of partitions\nthat underpin multiscale descriptions of data. Here, we introduce the\nMultiscale Clustering Filtration (MCF), a well-defined and stable filtration of\nabstract simplicial complexes that encodes arbitrary patterns of cluster\nassignments across scales of increasing coarseness. We show that the\nzero-dimensional persistent homology of the MCF measures the degree of\nhierarchy in the sequence of partitions, and the higher-dimensional persistent\nhomology tracks the emergence and resolution of conflicts between cluster\nassignments across the sequence of partitions. To broaden the theoretical\nfoundations of the MCF, we also provide an equivalent construction via a nerve\ncomplex filtration, and we show that in the hierarchical case, the MCF reduces\nto a Vietoris-Rips filtration of an ultrametric space. We then use numerical\nexperiments to illustrate how the MCF can serve to characterise multiscale\nclusterings of synthetic data from stochastic block models.\n","authors":["Dominik J. Schindler","Mauricio Barahona"],"pdf_url":"https://arxiv.org/pdf/2305.04281v3.pdf","comment":"This work was presented at the Dagstuhl Seminar (23192) on\n \"Topological Data Analysis and Applications\""},{"id":"http://arxiv.org/abs/2411.19913v1","updated":"2024-11-29T18:18:26Z","published":"2024-11-29T18:18:26Z","title":"Quantifying the synthetic and real domain gap in aerial scene\n understanding","summary":" Quantifying the gap between synthetic and real-world imagery is essential for\nimproving both transformer-based models - that rely on large volumes of data -\nand datasets, especially in underexplored domains like aerial scene\nunderstanding where the potential impact is significant. This paper introduces\na novel methodology for scene complexity assessment using Multi-Model Consensus\nMetric (MMCM) and depth-based structural metrics, enabling a robust evaluation\nof perceptual and structural disparities between domains. Our experimental\nanalysis, utilizing real-world (Dronescapes) and synthetic (Skyscenes)\ndatasets, demonstrates that real-world scenes generally exhibit higher\nconsensus among state-of-the-art vision transformers, while synthetic scenes\nshow greater variability and challenge model adaptability. The results\nunderline the inherent complexities and domain gaps, emphasizing the need for\nenhanced simulation fidelity and model generalization. This work provides\ncritical insights into the interplay between domain characteristics and model\nperformance, offering a pathway for improved domain adaptation strategies in\naerial scene understanding.\n","authors":["Alina Marcu"],"pdf_url":"https://arxiv.org/pdf/2411.19913v1.pdf","comment":"17 pages (including references), 5 figures, 2 tables. Accepted for\n publication in the \"Scientific Bulletin\", Series C, Electrical Engineering\n and Computer Science, ISSN 2286-3540"},{"id":"http://arxiv.org/abs/2411.19908v1","updated":"2024-11-29T18:12:50Z","published":"2024-11-29T18:12:50Z","title":"Another look at inference after prediction","summary":" Prediction-based (PB) inference is increasingly used in applications where\nthe outcome of interest is difficult to obtain, but its predictors are readily\navailable. Unlike traditional inference, PB inference performs statistical\ninference using a partially observed outcome and a set of covariates by\nleveraging a prediction of the outcome generated from a machine learning (ML)\nmodel. Motwani and Witten (2023) recently revisited two innovative PB inference\napproaches for ordinary least squares. They found that the method proposed by\nWang et al. (2020) yields a consistent estimator for the association of\ninterest when the ML model perfectly captures the underlying regression\nfunction. Conversely, the prediction-powered inference (PPI) method proposed by\nAngelopoulos et al. (2023) yields valid inference regardless of the model's\naccuracy. In this paper, we study the statistical efficiency of the PPI\nestimator. Our analysis reveals that a more efficient estimator, proposed 25\nyears ago by Chen and Chen (2000), can be obtained by simply adding a weight to\nthe PPI estimator. We also contextualize PB inference with methods from the\neconomics and statistics literature dating back to the 1960s. Our extensive\ntheoretical and numerical analyses indicate that the Chen and Chen (CC)\nestimator offers a balance between robustness to ML model specification and\nstatistical efficiency, making it the preferred choice for use in practice.\n","authors":["Jessica Gronsbell","Jianhui Gao","Yaqi Shi","Zachary R. McCaw","David Cheng"],"pdf_url":"https://arxiv.org/pdf/2411.19908v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19906v1","updated":"2024-11-29T18:11:39Z","published":"2024-11-29T18:11:39Z","title":"Classical and Quantum Algorithms for the Deterministic L-system\n Inductive Inference Problem","summary":" L-systems can be made to model and create simulations of many biological\nprocesses, such as plant development. Finding an L-system for a given process\nis typically solved by hand, by experts, in a hugely time-consuming process. It\nwould be significant if this could be done automatically from data, such as\nfrom sequences of images. In this paper, we are interested in inferring a\nparticular type of L-system, deterministic context-free L-system (D0L-system)\nfrom a sequence of strings. We introduce the characteristic graph of a sequence\nof strings, which we then utilize to translate our problem (inferring\nD0L-system) in polynomial time into the maximum independent set problem (MIS)\nand the SAT problem. After that, we offer a classical exact algorithm and an\napproximate quantum algorithm for the problem.\n","authors":["Ali Lotfi","Ian McQuillan","Steven Rayan"],"pdf_url":"https://arxiv.org/pdf/2411.19906v1.pdf","comment":"16 pages, 1 figure"},{"id":"http://arxiv.org/abs/2411.19903v1","updated":"2024-11-29T18:05:16Z","published":"2024-11-29T18:05:16Z","title":"$C^{3}$-NeRF: Modeling Multiple Scenes via Conditional-cum-Continual\n Neural Radiance Fields","summary":" Neural radiance fields (NeRF) have exhibited highly photorealistic rendering\nof novel views through per-scene optimization over a single 3D scene. With the\ngrowing popularity of NeRF and its variants, they have become ubiquitous and\nhave been identified as efficient 3D resources. However, they are still far\nfrom being scalable since a separate model needs to be stored for each scene,\nand the training time increases linearly with every newly added scene.\nSurprisingly, the idea of encoding multiple 3D scenes into a single NeRF model\nis heavily under-explored. In this work, we propose a novel\nconditional-cum-continual framework, called $C^{3}$-NeRF, to accommodate\nmultiple scenes into the parameters of a single neural radiance field. Unlike\nconventional approaches that leverage feature extractors and pre-trained priors\nfor scene conditioning, we use simple pseudo-scene labels to model multiple\nscenes in NeRF. Interestingly, we observe the framework is also inherently\ncontinual (via generative replay) with minimal, if not no, forgetting of the\npreviously learned scenes. Consequently, the proposed framework adapts to\nmultiple new scenes without necessarily accessing the old data. Through\nextensive qualitative and quantitative evaluation using synthetic and real\ndatasets, we demonstrate the inherent capacity of the NeRF model to accommodate\nmultiple scenes with high-quality novel-view renderings without adding\nadditional parameters. We provide implementation details and dynamic\nvisualizations of our results in the supplementary file.\n","authors":["Prajwal Singh","Ashish Tiwari","Gautam Vashishtha","Shanmuganathan Raman"],"pdf_url":"https://arxiv.org/pdf/2411.19903v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19902v1","updated":"2024-11-29T18:04:11Z","published":"2024-11-29T18:04:11Z","title":"Noncommutative Model Selection for Data Clustering and Dimension\n Reduction Using Relative von Neumann Entropy","summary":" We propose a pair of completely data-driven algorithms for unsupervised\nclassification and dimension reduction, and we empirically study their\nperformance on a number of data sets, both simulated data in three-dimensions\nand images from the COIL-20 data set. The algorithms take as input a set of\npoints sampled from a uniform distribution supported on a metric space, the\nlatter embedded in an ambient metric space, and they output a clustering or\nreduction of dimension of the data. They work by constructing a natural family\nof graphs from the data and selecting the graph which maximizes the relative\nvon Neumann entropy of certain normalized heat operators constructed from the\ngraphs. Once the appropriate graph is selected, the eigenvectors of the graph\nLaplacian may be used to reduce the dimension of the data, and clusters in the\ndata may be identified with the kernel of the associated graph Laplacian.\nNotably, these algorithms do not require information about the size of a\nneighborhood or the desired number of clusters as input, in contrast to popular\nalgorithms such as $k$-means, and even more modern spectral methods such as\nLaplacian eigenmaps, among others.\n In our computational experiments, our clustering algorithm outperforms\n$k$-means clustering on data sets with non-trivial geometry and topology, in\nparticular data whose clusters are not concentrated around a specific point,\nand our dimension reduction algorithm is shown to work well in several simple\nexamples.\n","authors":["Araceli Guzmán-Tristán","Antonio Rieser"],"pdf_url":"https://arxiv.org/pdf/2411.19902v1.pdf","comment":"20 pages"},{"id":"http://arxiv.org/abs/2411.19896v1","updated":"2024-11-29T18:00:07Z","published":"2024-11-29T18:00:07Z","title":"Efficient quantum-enhanced classical simulation for patches of quantum\n landscapes","summary":" Understanding the capabilities of classical simulation methods is key to\nidentifying where quantum computers are advantageous. Not only does this ensure\nthat quantum computers are used only where necessary, but also one can\npotentially identify subroutines that can be offloaded onto a classical device.\nIn this work, we show that it is always possible to generate a classical\nsurrogate of a sub-region (dubbed a \"patch\") of an expectation landscape\nproduced by a parameterized quantum circuit. That is, we provide a\nquantum-enhanced classical algorithm which, after simple measurements on a\nquantum device, allows one to classically simulate approximate expectation\nvalues of a subregion of a landscape. We provide time and sample complexity\nguarantees for a range of families of circuits of interest, and further\nnumerically demonstrate our simulation algorithms on an exactly verifiable\nsimulation of a Hamiltonian variational ansatz and long-time dynamics\nsimulation on a 127-qubit heavy-hex topology.\n","authors":["Sacha Lerch","Ricard Puig","Manuel S. Rudolph","Armando Angrisani","Tyson Jones","M. Cerezo","Supanut Thanasilp","Zoë Holmes"],"pdf_url":"https://arxiv.org/pdf/2411.19896v1.pdf","comment":"10 + 47 pages, 4 figures"},{"id":"http://arxiv.org/abs/2411.19894v1","updated":"2024-11-29T17:58:45Z","published":"2024-11-29T17:58:45Z","title":"Noncommutative Model Selection and the Data-Driven Estimation of Real\n Cohomology Groups","summary":" We propose three completely data-driven methods for estimating the real\ncohomology groups $H^k (X ; \\mathbb{R})$ of a compact metric-measure space $(X,\nd_X, \\mu_X)$ embedded in a metric-measure space $(Y,d_Y,\\mu_Y)$, given a finite\nset of points $S$ sampled from a uniform distrbution $\\mu_X$ on $X$, possibly\ncorrupted with noise from $Y$. We present the results of several computational\nexperiments in the case that $X$ is embedded in $\\mathbb{R}^n$, where two of\nthe three algorithms performed well.\n","authors":["Araceli Guzmán-Tristán","Antonio Rieser","Eduardo Velázquez-Richards"],"pdf_url":"https://arxiv.org/pdf/2411.19894v1.pdf","comment":"15 pages, sequel to \"Noncommutative Model Selection for Data\n Clustering and Dimension Reduction Using Relative von Neumann Entropy\""},{"id":"http://arxiv.org/abs/2411.19888v1","updated":"2024-11-29T17:53:41Z","published":"2024-11-29T17:53:41Z","title":"FlowCLAS: Enhancing Normalizing Flow Via Contrastive Learning For\n Anomaly Segmentation","summary":" Anomaly segmentation is a valuable computer vision task for safety-critical\napplications that need to be aware of unexpected events. Current\nstate-of-the-art (SOTA) scene-level anomaly segmentation approaches rely on\ndiverse inlier class labels during training, limiting their ability to leverage\nvast unlabeled datasets and pre-trained vision encoders. These methods may\nunderperform in domains with reduced color diversity and limited object\nclasses. Conversely, existing unsupervised methods struggle with anomaly\nsegmentation with the diverse scenes of less restricted domains. To address\nthese challenges, we introduce FlowCLAS, a novel self-supervised framework that\nutilizes vision foundation models to extract rich features and employs a\nnormalizing flow network to learn their density distribution. We enhance the\nmodel's discriminative power by incorporating Outlier Exposure and contrastive\nlearning in the latent space. FlowCLAS significantly outperforms all existing\nmethods on the ALLO anomaly segmentation benchmark for space robotics and\ndemonstrates competitive results on multiple road anomaly segmentation\nbenchmarks for autonomous driving, including Fishyscapes Lost&Found and Road\nAnomaly. These results highlight FlowCLAS's effectiveness in addressing the\nunique challenges of space anomaly segmentation while retaining SOTA\nperformance in the autonomous driving domain without reliance on inlier\nsegmentation labels.\n","authors":["Chang Won Lee","Selina Leveugle","Svetlana Stolpner","Chris Langley","Paul Grouchy","Jonathan Kelly","Steven L. Waslander"],"pdf_url":"https://arxiv.org/pdf/2411.19888v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19882v1","updated":"2024-11-29T17:48:20Z","published":"2024-11-29T17:48:20Z","title":"Open source Differentiable ODE Solving Infrastructure","summary":" Ordinary Differential Equations (ODEs) are widely used in physics, chemistry,\nand biology to model dynamic systems, including reaction kinetics, population\ndynamics, and biological processes. In this work, we integrate GPU-accelerated\nODE solvers into the open-source DeepChem framework, making these tools easily\naccessible. These solvers support multiple numerical methods and are fully\ndifferentiable, enabling easy integration into more complex differentiable\nprograms. We demonstrate the capabilities of our implementation through\nexperiments on Lotka-Volterra predator-prey dynamics, pharmacokinetic\ncompartment models, neural ODEs, and solving PDEs using reaction-diffusion\nequations. Our solvers achieved high accuracy with mean squared errors ranging\nfrom $10^{-4}$ to $10^{-6}$ and showed scalability in solving large systems\nwith up to 100 compartments.\n","authors":["Rakshit Kr. Singh","Aaron Rock Menezes","Rida Irfan","Bharath Ramsundar"],"pdf_url":"https://arxiv.org/pdf/2411.19882v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.10842v2","updated":"2024-11-29T17:36:32Z","published":"2024-04-16T18:40:28Z","title":"Unsupervised Speaker Diarization in Distributed IoT Networks Using\n Federated Learning","summary":" This paper presents a computationally efficient and distributed speaker\ndiarization framework for networked IoT-style audio devices. The work proposes\na Federated Learning model which can identify the participants in a\nconversation without the requirement of a large audio database for training. An\nunsupervised online update mechanism is proposed for the Federated Learning\nmodel which depends on cosine similarity of speaker embeddings. Moreover, the\nproposed diarization system solves the problem of speaker change detection via.\nunsupervised segmentation techniques using Hotelling's t-squared Statistic and\nBayesian Information Criterion. In this new approach, speaker change detection\nis biased around detected quasi-silences, which reduces the severity of the\ntrade-off between the missed detection and false detection rates. Additionally,\nthe computational overhead due to frame-by-frame identification of speakers is\nreduced via. unsupervised clustering of speech segments. The results\ndemonstrate the effectiveness of the proposed training method in the presence\nof non-IID speech data. It also shows a considerable improvement in the\nreduction of false and missed detection at the segmentation stage, while\nreducing the computational overhead. Improved accuracy and reduced\ncomputational cost makes the mechanism suitable for real-time speaker\ndiarization across a distributed IoT audio network.\n","authors":["Amit Kumar Bhuyan","Hrishikesh Dutta","Subir Biswas"],"pdf_url":"https://arxiv.org/pdf/2404.10842v2.pdf","comment":"11 pages, 7 figures, 1 table"},{"id":"http://arxiv.org/abs/2411.19875v1","updated":"2024-11-29T17:36:31Z","published":"2024-11-29T17:36:31Z","title":"Enhanced anomaly detection in well log data through the application of\n ensemble GANs","summary":" Although generative adversarial networks (GANs) have shown significant\nsuccess in modeling data distributions for image datasets, their application to\nstructured or tabular data, such as well logs, remains relatively\nunderexplored. This study extends the ensemble GANs (EGANs) framework to\ncapture the distribution of well log data and detect anomalies that fall\noutside of these distributions. The proposed approach compares the performance\nof traditional methods, such as Gaussian mixture models (GMMs), with EGANs in\ndetecting anomalies outside the expected data distributions. For the gamma ray\n(GR) dataset, EGANs achieved a precision of 0.62 and F1 score of 0.76,\noutperforming GMM's precision of 0.38 and F1 score of 0.54. Similarly, for\ntravel time (DT), EGANs achieved a precision of 0.70 and F1 score of 0.79,\nsurpassing GMM 0.56 and 0.71. In the neutron porosity (NPHI) dataset, EGANs\nrecorded a precision of 0.53 and F1 score of 0.68, outshining GMM 0.47 and\n0.61. For the bulk density (RHOB) dataset, EGANs achieved a precision of 0.52\nand an F1 score of 0.67, slightly outperforming GMM, which yielded a precision\nof 0.50 and an F1 score of 0.65. This work's novelty lies in applying EGANs for\nwell log data analysis, showcasing their ability to learn data patterns and\nidentify anomalies that deviate from them. This approach offers more reliable\nanomaly detection compared to traditional methods like GMM. The findings\nhighlight the potential of EGANs in enhancing anomaly detection for well log\ndata, delivering significant implications for optimizing drilling strategies\nand reservoir management through more accurate, data-driven insights into\nsubsurface characterization.\n","authors":["Abdulrahman Al-Fakih","A. Koeshidayatullah","Tapan Mukerji","SanLinn I. Kaka"],"pdf_url":"https://arxiv.org/pdf/2411.19875v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19870v1","updated":"2024-11-29T17:31:47Z","published":"2024-11-29T17:31:47Z","title":"DeMo: Decoupled Momentum Optimization","summary":" Training large neural networks typically requires sharing gradients between\naccelerators through specialized high-speed interconnects. Drawing from the\nsignal processing principles of frequency decomposition and energy compaction,\nwe demonstrate that synchronizing full optimizer states and model parameters\nduring training is unnecessary. By decoupling momentum updates and allowing\ncontrolled divergence in optimizer states across accelerators, we achieve\nimproved convergence compared to state-of-the-art optimizers. We introduce\n{\\textbf{De}}coupled {\\textbf{Mo}}mentum (DeMo), a fused optimizer and data\nparallel algorithm that reduces inter-accelerator communication requirements by\nseveral orders of magnitude. This enables training of large neural networks\neven with limited network bandwidth and heterogeneous hardware. Our method is\ntopology-agnostic and architecture-independent and supports scalable\nclock-synchronous distributed training with negligible compute and memory\noverhead. Empirical results show that models trained with DeMo match or exceed\nthe performance of equivalent models trained with AdamW, while eliminating the\nneed for high-speed interconnects when pre-training large scale foundation\nmodels. An open source reference PyTorch implementation is published on GitHub\nat https://github.com/bloc97/DeMo\n","authors":["Bowen Peng","Jeffrey Quesnelle","Diederik P. Kingma"],"pdf_url":"https://arxiv.org/pdf/2411.19870v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19869v1","updated":"2024-11-29T17:31:42Z","published":"2024-11-29T17:31:42Z","title":"AIDetx: a compression-based method for identification of\n machine-learning generated text","summary":" This paper introduces AIDetx, a novel method for detecting machine-generated\ntext using data compression techniques. Traditional approaches, such as deep\nlearning classifiers, often suffer from high computational costs and limited\ninterpretability. To address these limitations, we propose a compression-based\nclassification framework that leverages finite-context models (FCMs). AIDetx\nconstructs distinct compression models for human-written and AI-generated text,\nclassifying new inputs based on which model achieves a higher compression\nratio. We evaluated AIDetx on two benchmark datasets, achieving F1 scores\nexceeding 97% and 99%, respectively, highlighting its high accuracy. Compared\nto current methods, such as large language models (LLMs), AIDetx offers a more\ninterpretable and computationally efficient solution, significantly reducing\nboth training time and hardware requirements (e.g., no GPUs needed). The full\nimplementation is publicly available at https://github.com/AIDetx/AIDetx.\n","authors":["Leonardo Almeida","Pedro Rodrigues","Diogo Magalhães","Armando J. Pinho","Diogo Pratas"],"pdf_url":"https://arxiv.org/pdf/2411.19869v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19865v1","updated":"2024-11-29T17:27:05Z","published":"2024-11-29T17:27:05Z","title":"Reverse Thinking Makes LLMs Stronger Reasoners","summary":" Reverse thinking plays a crucial role in human reasoning. Humans can reason\nnot only from a problem to a solution but also in reverse, i.e., start from the\nsolution and reason towards the problem. This often enhances overall reasoning\nperformance as it enables consistency checks between their forward and backward\nthinking. To enable Large Language Models (LLMs) to perform reverse thinking,\nwe introduce Reverse-Enhanced Thinking (RevThink), a framework composed of data\naugmentation and learning objectives. In RevThink, we augment the dataset by\ncollecting structured forward-backward reasoning from a teacher model,\nconsisting of: (1) the original question, (2) forward reasoning, (3) backward\nquestion, and (4) backward reasoning. We then employ three objectives to train\na smaller student model in a multi-task learning fashion: (a) generate forward\nreasoning from a question, (b) generate a backward question from a question,\nand (c) generate backward reasoning from the backward question. Experiments\nacross 12 datasets covering commonsense, math, and logical reasoning show an\naverage 13.53% improvement over the student model's zero-shot performance and a\n6.84% improvement over the strongest knowledge distillation baselines.\nMoreover, our method demonstrates sample efficiency -- using only 10% of the\ncorrect forward reasoning from the training data, it outperforms a standard\nfine-tuning method trained on 10x more forward reasoning. RevThink also\nexhibits strong generalization to out-of-distribution held-out datasets.\n","authors":["Justin Chih-Yao Chen","Zifeng Wang","Hamid Palangi","Rujun Han","Sayna Ebrahimi","Long Le","Vincent Perot","Swaroop Mishra","Mohit Bansal","Chen-Yu Lee","Tomas Pfister"],"pdf_url":"https://arxiv.org/pdf/2411.19865v1.pdf","comment":"20 pages"},{"id":"http://arxiv.org/abs/2411.19860v1","updated":"2024-11-29T17:17:38Z","published":"2024-11-29T17:17:38Z","title":"SpaRC: Sparse Radar-Camera Fusion for 3D Object Detection","summary":" In this work, we present SpaRC, a novel Sparse fusion transformer for 3D\nperception that integrates multi-view image semantics with Radar and Camera\npoint features. The fusion of radar and camera modalities has emerged as an\nefficient perception paradigm for autonomous driving systems. While\nconventional approaches utilize dense Bird's Eye View (BEV)-based architectures\nfor depth estimation, contemporary query-based transformers excel in\ncamera-only detection through object-centric methodology. However, these\nquery-based approaches exhibit limitations in false positive detections and\nlocalization precision due to implicit depth modeling. We address these\nchallenges through three key contributions: (1) sparse frustum fusion (SFF) for\ncross-modal feature alignment, (2) range-adaptive radar aggregation (RAR) for\nprecise object localization, and (3) local self-attention (LSA) for focused\nquery aggregation. In contrast to existing methods requiring computationally\nintensive BEV-grid rendering, SpaRC operates directly on encoded point\nfeatures, yielding substantial improvements in efficiency and accuracy.\nEmpirical evaluations on the nuScenes and TruckScenes benchmarks demonstrate\nthat SpaRC significantly outperforms existing dense BEV-based and sparse\nquery-based detectors. Our method achieves state-of-the-art performance metrics\nof 67.1 NDS and 63.1 AMOTA. The code and pretrained models are available at\nhttps://github.com/phi-wol/sparc.\n","authors":["Philipp Wolters","Johannes Gilg","Torben Teepe","Fabian Herzog","Felix Fent","Gerhard Rigoll"],"pdf_url":"https://arxiv.org/pdf/2411.19860v1.pdf","comment":"18 pages, 11 figures"},{"id":"http://arxiv.org/abs/2205.15935v4","updated":"2024-11-29T17:12:44Z","published":"2022-05-31T16:27:57Z","title":"Bias-inducing geometries: an exactly solvable data model with fairness\n implications","summary":" Machine learning (ML) may be oblivious to human bias but it is not immune to\nits perpetuation. Marginalisation and iniquitous group representation are often\ntraceable in the very data used for training, and may be reflected or even\nenhanced by the learning models. In the present work, we aim at clarifying the\nrole played by data geometry in the emergence of ML bias. We introduce an\nexactly solvable high-dimensional model of data imbalance, where parametric\ncontrol over the many bias-inducing factors allows for an extensive exploration\nof the bias inheritance mechanism. Through the tools of statistical physics, we\nanalytically characterise the typical properties of learning models trained in\nthis synthetic framework and obtain exact predictions for the observables that\nare commonly employed for fairness assessment. Despite the simplicity of the\ndata model, we retrace and unpack typical unfairness behaviour observed on\nreal-world datasets. We also obtain a detailed analytical characterisation of a\nclass of bias mitigation strategies. We first consider a basic loss-reweighing\nscheme, which allows for an implicit minimisation of different unfairness\nmetrics, and quantify the incompatibilities between some existing fairness\ncriteria. Then, we consider a novel mitigation strategy based on a matched\ninference approach, consisting in the introduction of coupled learning models.\nOur theoretical analysis of this approach shows that the coupled strategy can\nstrike superior fairness-accuracy trade-offs.\n","authors":["Stefano Sarao Mannelli","Federica Gerace","Negar Rostamzadeh","Luca Saglietti"],"pdf_url":"https://arxiv.org/pdf/2205.15935v4.pdf","comment":"10 pages + appendix"},{"id":"http://arxiv.org/abs/2411.19853v1","updated":"2024-11-29T17:09:59Z","published":"2024-11-29T17:09:59Z","title":"Towards Class-wise Robustness Analysis","summary":" While being very successful in solving many downstream tasks, the application\nof deep neural networks is limited in real-life scenarios because of their\nsusceptibility to domain shifts such as common corruptions, and adversarial\nattacks. The existence of adversarial examples and data corruption\nsignificantly reduces the performance of deep classification models.\nResearchers have made strides in developing robust neural architectures to\nbolster decisions of deep classifiers. However, most of these works rely on\neffective adversarial training methods, and predominantly focus on overall\nmodel robustness, disregarding class-wise differences in robustness, which are\ncritical. Exploiting weakly robust classes is a potential avenue for attackers\nto fool the image recognition models. Therefore, this study investigates\nclass-to-class biases across adversarially trained robust classification models\nto understand their latent space structures and analyze their strong and weak\nclass-wise properties. We further assess the robustness of classes against\ncommon corruptions and adversarial attacks, recognizing that class\nvulnerability extends beyond the number of correct classifications for a\nspecific class. We find that the number of false positives of classes as\nspecific target classes significantly impacts their vulnerability to attacks.\nThrough our analysis on the Class False Positive Score, we assess a fair\nevaluation of how susceptible each class is to misclassification.\n","authors":["Tejaswini Medi","Julia Grabinski","Margret Keuper"],"pdf_url":"https://arxiv.org/pdf/2411.19853v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19845v1","updated":"2024-11-29T17:04:03Z","published":"2024-11-29T17:04:03Z","title":"A Visual-inertial Localization Algorithm using Opportunistic Visual\n Beacons and Dead-Reckoning for GNSS-Denied Large-scale Applications","summary":" With the development of smart cities, the demand for continuous pedestrian\nnavigation in large-scale urban environments has significantly increased. While\nglobal navigation satellite systems (GNSS) provide low-cost and reliable\npositioning services, they are often hindered in complex urban canyon\nenvironments. Thus, exploring opportunistic signals for positioning in urban\nareas has become a key solution. Augmented reality (AR) allows pedestrians to\nacquire real-time visual information. Accordingly, we propose a low-cost\nvisual-inertial positioning solution. This method comprises a lightweight\nmulti-scale group convolution (MSGC)-based visual place recognition (VPR)\nneural network, a pedestrian dead reckoning (PDR) algorithm, and a\nvisual/inertial fusion approach based on a Kalman filter with gross error\nsuppression. The VPR serves as a conditional observation to the Kalman filter,\neffectively correcting the errors accumulated through the PDR method. This\nenables the entire algorithm to ensure the reliability of long-term positioning\nin GNSS-denied areas. Extensive experimental results demonstrate that our\nmethod maintains stable positioning during large-scale movements. Compared to\nthe lightweight MobileNetV3-based VPR method, our proposed VPR solution\nimproves Recall@1 by at least 3\\% on two public datasets while reducing the\nnumber of parameters by 63.37\\%. It also achieves performance that is\ncomparable to the VGG16-based method. The VPR-PDR algorithm improves\nlocalization accuracy by more than 40\\% compared to the original PDR.\n","authors":["Liqiang Zhang Ye Tian Dongyan Wei"],"pdf_url":"https://arxiv.org/pdf/2411.19845v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.18868v2","updated":"2024-11-29T17:02:31Z","published":"2024-10-24T15:53:21Z","title":"A Riemannian Framework for Learning Reduced-order Lagrangian Dynamics","summary":" By incorporating physical consistency as inductive bias, deep neural networks\ndisplay increased generalization capabilities and data efficiency in learning\nnonlinear dynamic models. However, the complexity of these models generally\nincreases with the system dimensionality, requiring larger datasets, more\ncomplex deep networks, and significant computational effort. We propose a novel\ngeometric network architecture to learn physically-consistent reduced-order\ndynamic parameters that accurately describe the original high-dimensional\nsystem behavior. This is achieved by building on recent advances in model-order\nreduction and by adopting a Riemannian perspective to jointly learn a\nnon-linear structure-preserving latent space and the associated low-dimensional\ndynamics. Our approach enables accurate long-term predictions of the\nhigh-dimensional dynamics of rigid and deformable systems with increased data\nefficiency by inferring interpretable and physically plausible reduced\nLagrangian models.\n","authors":["Katharina Friedl","Noémie Jaquier","Jens Lundell","Tamim Asfour","Danica Kragic"],"pdf_url":"https://arxiv.org/pdf/2410.18868v2.pdf","comment":"29 pages, 16 figures"},{"id":"http://arxiv.org/abs/2411.19842v1","updated":"2024-11-29T16:58:02Z","published":"2024-11-29T16:58:02Z","title":"Scaling Transformers for Low-Bitrate High-Quality Speech Coding","summary":" The tokenization of speech with neural audio codec models is a vital part of\nmodern AI pipelines for the generation or understanding of speech, alone or in\na multimodal context. Traditionally such tokenization models have concentrated\non low parameter-count architectures using only components with strong\ninductive biases. In this work we show that by scaling a transformer\narchitecture with large parameter count to this problem, and applying a\nflexible Finite Scalar Quantization (FSQ) based bottleneck, it is possible to\nreach state-of-the-art speech quality at extremely low bit-rates of $400$ or\n$700$ bits-per-second. The trained models strongly out-perform existing\nbaselines in both objective and subjective tests.\n","authors":["Julian D Parker","Anton Smirnov","Jordi Pons","CJ Carr","Zack Zukowski","Zach Evans","Xubo Liu"],"pdf_url":"https://arxiv.org/pdf/2411.19842v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.17555v2","updated":"2024-11-29T16:53:50Z","published":"2024-11-26T16:18:38Z","title":"Multiscale spatiotemporal heterogeneity analysis of bike-sharing\n system's self-loop phenomenon: Evidence from Shanghai","summary":" Bike-sharing is an environmentally friendly shared mobility mode, but its\nself-loop phenomenon, where bikes are returned to the same station after\nseveral time usage, significantly impacts equity in accessing its services.\nTherefore, this study conducts a multiscale analysis with a spatial\nautoregressive model and double machine learning framework to assess\nsocioeconomic features and geospatial location's impact on the self-loop\nphenomenon at metro stations and street scales. The results reveal that\nbike-sharing self-loop intensity exhibits significant spatial lag effect at\nstreet scale and is positively associated with residential land use. Marginal\ntreatment effects of residential land use is higher on streets with middle-aged\nresidents, high fixed employment, and low car ownership. The multimodal public\ntransit condition reveals significant positive marginal treatment effects at\nboth scales. To enhance bike-sharing cooperation, we advocate augmenting\nbicycle availability in areas with high metro usage and low bus coverage,\nalongside implementing adaptable redistribution strategies.\n","authors":["Yichen Wang","Qing Yu","Yancun Song"],"pdf_url":"https://arxiv.org/pdf/2411.17555v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19835v1","updated":"2024-11-29T16:45:25Z","published":"2024-11-29T16:45:25Z","title":"Feedback-driven object detection and iterative model improvement","summary":" Automated object detection has become increasingly valuable across diverse\napplications, yet efficient, high-quality annotation remains a persistent\nchallenge. In this paper, we present the development and evaluation of a\nplatform designed to interactively improve object detection models. The\nplatform allows uploading and annotating images as well as fine-tuning object\ndetection models. Users can then manually review and refine annotations,\nfurther creating improved snapshots that are used for automatic object\ndetection on subsequent image uploads - a process we refer to as semi-automatic\nannotation resulting in a significant gain in annotation efficiency.\n Whereas iterative refinement of model results to speed up annotation has\nbecome common practice, we are the first to quantitatively evaluate its\nbenefits with respect to time, effort, and interaction savings. Our\nexperimental results show clear evidence for a significant time reduction of up\nto 53% for semi-automatic compared to manual annotation. Importantly, these\nefficiency gains did not compromise annotation quality, while matching or\noccasionally even exceeding the accuracy of manual annotations. These findings\ndemonstrate the potential of our lightweight annotation platform for creating\nhigh-quality object detection datasets and provide best practices to guide\nfuture development of annotation platforms.\n The platform is open-source, with the frontend and backend repositories\navailable on GitHub.\n","authors":["Sönke Tenckhoff","Mario Koddenbrock","Erik Rodner"],"pdf_url":"https://arxiv.org/pdf/2411.19835v1.pdf","comment":"AI4EA24 preprint"},{"id":"http://arxiv.org/abs/2308.04964v3","updated":"2024-11-29T16:33:12Z","published":"2023-08-09T13:58:03Z","title":"ModSec-AdvLearn: Countering Adversarial SQL Injections with Robust\n Machine Learning","summary":" Many Web Application Firewalls (WAFs) leverage the OWASP Core Rule Set (CRS)\nto block incoming malicious requests. The CRS consists of different sets of\nrules designed by domain experts to detect well-known web attack patterns. Both\nthe set of rules to be used and the weights used to combine them are manually\ndefined, yielding four different default configurations of the CRS. In this\nwork, we focus on the detection of SQL injection (SQLi) attacks, and show that\nthe manual configurations of the CRS typically yield a suboptimal trade-off\nbetween detection and false alarm rates. Furthermore, we show that these\nconfigurations are not robust to adversarial SQLi attacks, i.e.,\ncarefully-crafted attacks that iteratively refine the malicious SQLi payload by\nquerying the target WAF to bypass detection. To overcome these limitations, we\npropose (i) using machine learning to automate the selection of the set of\nrules to be combined along with their weights, i.e., customizing the CRS\nconfiguration based on the monitored web services; and (ii) leveraging\nadversarial training to significantly improve its robustness to adversarial\nSQLi manipulations. Our experiments, conducted using the well-known open-source\nModSecurity WAF equipped with the CRS rules, show that our approach, named\nModSec-AdvLearn, can (i) increase the detection rate up to 30%, while retaining\nnegligible false alarm rates and discarding up to 50% of the CRS rules; and\n(ii) improve robustness against adversarial SQLi attacks up to 85%, marking a\nsignificant stride toward designing more effective and robust WAFs. We release\nour open-source code at https://github.com/pralab/modsec-advlearn.\n","authors":["Biagio Montaruli","Giuseppe Floris","Christian Scano","Luca Demetrio","Andrea Valenza","Luca Compagna","Davide Ariu","Luca Piras","Davide Balzarotti","Battista Biggio"],"pdf_url":"https://arxiv.org/pdf/2308.04964v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19819v1","updated":"2024-11-29T16:27:55Z","published":"2024-11-29T16:27:55Z","title":"GradAlign for Training-free Model Performance Inference","summary":" Architecture plays an important role in deciding the performance of deep\nneural networks. However, the search for the optimal architecture is often\nhindered by the vast search space, making it a time-intensive process.\nRecently, a novel approach known as training-free neural architecture search\n(NAS) has emerged, aiming to discover the ideal architecture without\nnecessitating extensive training. Training-free NAS leverages various\nindicators for architecture selection, including metrics such as the count of\nlinear regions, the density of per-sample losses, and the stability of the\nfinite-width Neural Tangent Kernel (NTK) matrix. Despite the competitive\nempirical performance of current training-free NAS techniques, they suffer from\ncertain limitations, including inconsistent performance and a lack of deep\nunderstanding. In this paper, we introduce GradAlign, a simple yet effective\nmethod designed for inferring model performance without the need for training.\nAt its core, GradAlign quantifies the extent of conflicts within per-sample\ngradients during initialization, as substantial conflicts hinder model\nconvergence and ultimately result in worse performance. We evaluate GradAlign\nagainst established training-free NAS methods using standard NAS benchmarks,\nshowing a better overall performance. Moreover, we show that the widely adopted\nmetric of linear region count may not suffice as a dependable criterion for\nselecting network architectures during at initialization.\n","authors":["Yuxuan Li","Yunhui Guo"],"pdf_url":"https://arxiv.org/pdf/2411.19819v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.15788v2","updated":"2024-11-29T16:23:44Z","published":"2023-12-25T18:51:23Z","title":"Robust Stochastically-Descending Unrolled Networks","summary":" Deep unrolling, or unfolding, is an emerging learning-to-optimize method that\nunrolls a truncated iterative algorithm in the layers of a trainable neural\nnetwork. However, the convergence guarantees and generalizability of the\nunrolled networks are still open theoretical problems. To tackle these\nproblems, we provide deep unrolled architectures with a stochastic descent\nnature by imposing descending constraints during training. The descending\nconstraints are forced layer by layer to ensure that each unrolled layer takes,\non average, a descent step toward the optimum during training. We theoretically\nprove that the sequence constructed by the outputs of the unrolled layers is\nthen guaranteed to converge for unseen problems, assuming no distribution shift\nbetween training and test problems. We also show that standard unrolling is\nbrittle to perturbations, and our imposed constraints provide the unrolled\nnetworks with robustness to additive noise and perturbations. We numerically\nassess unrolled architectures trained under the proposed constraints in two\ndifferent applications, including the sparse coding using learnable iterative\nshrinkage and thresholding algorithm (LISTA) and image inpainting using\nproximal generative flow (GLOW-Prox), and demonstrate the performance and\nrobustness benefits of the proposed method.\n","authors":["Samar Hadou","Navid NaderiAlizadeh","Alejandro Ribeiro"],"pdf_url":"https://arxiv.org/pdf/2312.15788v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.08130v2","updated":"2024-11-29T16:18:29Z","published":"2024-10-10T17:14:36Z","title":"Think Beyond Size: Adaptive Prompting for More Effective Reasoning","summary":" Pretrained large language models (LLMs) are increasingly utilized across a\nwide range of natural language processing (NLP) tasks due to their impressive\ncapabilities as few-shot learners. Recent techniques, such as chain-of-thought\n(CoT) prompting, have significantly advanced multi-step reasoning by\nintroducing step-by-step decomposition, achieving state-of-the-art results on\ncomplex reasoning benchmarks. However, these approaches often rely on static\nprompting templates that do not adapt to task complexity or errors during the\nreasoning process. In this work, we introduce Adaptive Prompting, a dynamic and\niterative framework designed to enhance reasoning by incorporating real-time\nadjustments to prompt structures and validation mechanisms.Experimental results\ndemonstrate that Adaptive Prompting significantly improves performance on\ndiverse reasoning benchmarks, including arithmetic reasoning (GSM8K,\nMultiArith), logical reasoning and commonsense tasks, achieving substantial\naccuracy gains compared to static prompting baselines. By integrating guided\nprompts, intermediate validation, and self-corrective steps, our approach\nenables smaller models to achieve competitive performance with larger\ncounterparts, such as GPT-4, while maintaining computational efficiency. The\nframework achieves this without requiring fine-tuning or task-specific training\ndata, highlighting the untapped potential of iterative reasoning methods.\n","authors":["Kamesh R"],"pdf_url":"https://arxiv.org/pdf/2410.08130v2.pdf","comment":"Submitted to ICLR 2025. This is a preprint version. Future revisions\n will include additional evaluations and refinements"},{"id":"http://arxiv.org/abs/2312.13842v2","updated":"2024-11-29T16:02:21Z","published":"2023-12-21T13:40:31Z","title":"Statistical learning theory and Occam's razor: The core argument","summary":" Statistical learning theory is often associated with the principle of Occam's\nrazor, which recommends a simplicity preference in inductive inference. This\npaper distills the core argument for simplicity obtainable from statistical\nlearning theory, built on the theory's central learning guarantee for the\nmethod of empirical risk minimization. This core \"means-ends\" argument is that\na simpler hypothesis class or inductive model is better because it has better\nlearning guarantees; however, these guarantees are model-relative and so the\ntheoretical push towards simplicity is checked by our prior knowledge.\n","authors":["Tom F. Sterkenburg"],"pdf_url":"https://arxiv.org/pdf/2312.13842v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2205.09622v6","updated":"2024-11-29T16:01:31Z","published":"2022-05-19T15:37:26Z","title":"What Is Fairness? On the Role of Protected Attributes and Fictitious\n Worlds","summary":" A growing body of literature in fairness-aware machine learning (fairML) aims\nto mitigate machine learning (ML)-related unfairness in automated\ndecision-making (ADM) by defining metrics that measure fairness of an ML model\nand by proposing methods to ensure that trained ML models achieve low scores on\nthese metrics. However, the underlying concept of fairness, i.e., the question\nof what fairness is, is rarely discussed, leaving a significant gap between\ncenturies of philosophical discussion and the recent adoption of the concept in\nthe ML community. In this work, we try to bridge this gap by formalizing a\nconsistent concept of fairness and by translating the philosophical\nconsiderations into a formal framework for the training and evaluation of ML\nmodels in ADM systems. We argue that fairness problems can arise even without\nthe presence of protected attributes (PAs), and point out that fairness and\npredictive performance are not irreconcilable opposites, but that the latter is\nnecessary to achieve the former. Furthermore, we argue why and how causal\nconsiderations are necessary when assessing fairness in the presence of PAs by\nproposing a fictitious, normatively desired (FiND) world in which PAs have no\ncausal effects. In practice, this FiND world must be approximated by a warped\nworld in which the causal effects of the PAs are removed from the real-world\ndata. Finally, we achieve greater linguistic clarity in the discussion of\nfairML. We outline algorithms for practical applications and present\nillustrative experiments on COMPAS data.\n","authors":["Ludwig Bothmann","Kristina Peters","Bernd Bischl"],"pdf_url":"https://arxiv.org/pdf/2205.09622v6.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19798v1","updated":"2024-11-29T16:00:52Z","published":"2024-11-29T16:00:52Z","title":"Rethinking the initialization of Momentum in Federated Learning with\n Heterogeneous Data","summary":" Data Heterogeneity is a major challenge of Federated Learning performance.\nRecently, momentum based optimization techniques have beed proved to be\neffective in mitigating the heterogeneity issue. Along with the model updates,\nthe momentum updates are transmitted to the server side and aggregated.\nTherefore, the local training initialized with a global momentum is guided by\nthe global history of the gradients. However, we spot a problem in the\ntraditional cumulation of the momentum which is suboptimal in the Federated\nLearning systems. The momentum used to weight less on the historical gradients\nand more on the recent gradients. This however, will engage more biased local\ngradients in the end of the local training. In this work, we propose a new way\nto calculate the estimated momentum used in local initialization. The proposed\nmethod is named as Reversed Momentum Federated Learning (RMFL). The key idea is\nto assign exponentially decayed weights to the gradients with the time going\nforward, which is on the contrary to the traditional momentum cumulation. The\neffectiveness of RMFL is evaluated on three popular benchmark datasets with\ndifferent heterogeneity levels.\n","authors":["Chenguang Xiao","Shuo Wang"],"pdf_url":"https://arxiv.org/pdf/2411.19798v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19791v1","updated":"2024-11-29T15:52:59Z","published":"2024-11-29T15:52:59Z","title":"Tractable Agreement Protocols","summary":" We present an efficient reduction that converts any machine learning\nalgorithm into an interactive protocol, enabling collaboration with another\nparty (e.g., a human) to achieve consensus on predictions and improve accuracy.\nThis approach imposes calibration conditions on each party, which are\ncomputationally and statistically tractable relaxations of Bayesian\nrationality. These conditions are sensible even in prior-free settings,\nrepresenting a significant generalization of Aumann's classic \"agreement\ntheorem.\"\n In our protocol, the model first provides a prediction. The human then\nresponds by either agreeing or offering feedback. The model updates its state\nand revises its prediction, while the human may adjust their beliefs. This\niterative process continues until the two parties reach agreement. Initially,\nwe study a setting that extends Aumann's Agreement Theorem, where parties aim\nto agree on a one-dimensional expectation by iteratively sharing their current\nestimates. Here, we recover the convergence theorem of Aaronson'05 under weaker\nassumptions. We then address the case where parties hold beliefs over\ndistributions with d outcomes, exploring two feedback mechanisms. The first\ninvolves vector-valued estimates of predictions, while the second adopts a\ndecision-theoretic approach: the human, needing to take an action from a finite\nset based on utility, communicates their utility-maximizing action at each\nround. In this setup, the number of rounds until agreement remains independent\nof d. Finally, we generalize to scenarios with more than two parties, where\ncomputational complexity scales linearly with the number of participants. Our\nprotocols rely on simple, efficient conditions and produce predictions that\nsurpass the accuracy of any individual party's alone.\n","authors":["Natalie Collina","Surbhi Goel","Varun Gupta","Aaron Roth"],"pdf_url":"https://arxiv.org/pdf/2411.19791v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.13549v4","updated":"2024-11-29T15:51:23Z","published":"2023-06-23T15:21:52Z","title":"A Survey on Multimodal Large Language Models","summary":" Recently, Multimodal Large Language Model (MLLM) represented by GPT-4V has\nbeen a new rising research hotspot, which uses powerful Large Language Models\n(LLMs) as a brain to perform multimodal tasks. The surprising emergent\ncapabilities of MLLM, such as writing stories based on images and OCR-free math\nreasoning, are rare in traditional multimodal methods, suggesting a potential\npath to artificial general intelligence. To this end, both academia and\nindustry have endeavored to develop MLLMs that can compete with or even better\nthan GPT-4V, pushing the limit of research at a surprising speed. In this\npaper, we aim to trace and summarize the recent progress of MLLMs. First of\nall, we present the basic formulation of MLLM and delineate its related\nconcepts, including architecture, training strategy and data, as well as\nevaluation. Then, we introduce research topics about how MLLMs can be extended\nto support more granularity, modalities, languages, and scenarios. We continue\nwith multimodal hallucination and extended techniques, including Multimodal ICL\n(M-ICL), Multimodal CoT (M-CoT), and LLM-Aided Visual Reasoning (LAVR). To\nconclude the paper, we discuss existing challenges and point out promising\nresearch directions. In light of the fact that the era of MLLM has only just\nbegun, we will keep updating this survey and hope it can inspire more research.\nAn associated GitHub link collecting the latest papers is available at\nhttps://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models.\n","authors":["Shukang Yin","Chaoyou Fu","Sirui Zhao","Ke Li","Xing Sun","Tong Xu","Enhong Chen"],"pdf_url":"https://arxiv.org/pdf/2306.13549v4.pdf","comment":"Accepted for publication in National Science Review. Project\n page:https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models"},{"id":"http://arxiv.org/abs/2411.19787v1","updated":"2024-11-29T15:49:06Z","published":"2024-11-29T15:49:06Z","title":"CAREL: Instruction-guided reinforcement learning with cross-modal\n auxiliary objectives","summary":" Grounding the instruction in the environment is a key step in solving\nlanguage-guided goal-reaching reinforcement learning problems. In automated\nreinforcement learning, a key concern is to enhance the model's ability to\ngeneralize across various tasks and environments. In goal-reaching scenarios,\nthe agent must comprehend the different parts of the instructions within the\nenvironmental context in order to complete the overall task successfully. In\nthis work, we propose CAREL (Cross-modal Auxiliary REinforcement Learning) as a\nnew framework to solve this problem using auxiliary loss functions inspired by\nvideo-text retrieval literature and a novel method called instruction tracking,\nwhich automatically keeps track of progress in an environment. The results of\nour experiments suggest superior sample efficiency and systematic\ngeneralization for this framework in multi-modal reinforcement learning\nproblems. Our code base is available here.\n","authors":["Armin Saghafian","Amirmohammad Izadi","Negin Hashemi Dijujin","Mahdieh Soleymani Baghshah"],"pdf_url":"https://arxiv.org/pdf/2411.19787v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19786v1","updated":"2024-11-29T15:48:24Z","published":"2024-11-29T15:48:24Z","title":"MoTe: Learning Motion-Text Diffusion Model for Multiple Generation Tasks","summary":" Recently, human motion analysis has experienced great improvement due to\ninspiring generative models such as the denoising diffusion model and large\nlanguage model. While the existing approaches mainly focus on generating\nmotions with textual descriptions and overlook the reciprocal task. In this\npaper, we present~\\textbf{MoTe}, a unified multi-modal model that could handle\ndiverse tasks by learning the marginal, conditional, and joint distributions of\nmotion and text simultaneously. MoTe enables us to handle the paired\ntext-motion generation, motion captioning, and text-driven motion generation by\nsimply modifying the input context. Specifically, MoTe is composed of three\ncomponents: Motion Encoder-Decoder (MED), Text Encoder-Decoder (TED), and\nMoti-on-Text Diffusion Model (MTDM). In particular, MED and TED are trained for\nextracting latent embeddings, and subsequently reconstructing the motion\nsequences and textual descriptions from the extracted embeddings, respectively.\nMTDM, on the other hand, performs an iterative denoising process on the input\ncontext to handle diverse tasks. Experimental results on the benchmark datasets\ndemonstrate the superior performance of our proposed method on text-to-motion\ngeneration and competitive performance on motion captioning.\n","authors":["Yiming Wu","Wei Ji","Kecheng Zheng","Zicheng Wang","Dong Xu"],"pdf_url":"https://arxiv.org/pdf/2411.19786v1.pdf","comment":"Five figures, six tables"},{"id":"http://arxiv.org/abs/2401.14907v2","updated":"2024-11-29T15:46:37Z","published":"2024-01-26T14:38:43Z","title":"Learning Local Control Barrier Functions for Hybrid Systems","summary":" Hybrid dynamical systems are ubiquitous as practical robotic applications\noften involve both continuous states and discrete switchings. Safety is a\nprimary concern for hybrid robotic systems. Existing safety-critical control\napproaches for hybrid systems are either computationally inefficient,\ndetrimental to system performance, or limited to small-scale systems. To amend\nthese drawbacks, in this paper, we propose a learning-enabled approach to\nconstruct local Control Barrier Functions (CBFs) to guarantee the safety of a\nwide class of nonlinear hybrid dynamical systems. The end result is a safe\nneural CBF-based switching controller. Our approach is computationally\nefficient, minimally invasive to any reference controller, and applicable to\nlarge-scale systems. We empirically evaluate our framework and demonstrate its\nefficacy and flexibility through two robotic examples including a\nhigh-dimensional autonomous racing case, against other CBF-based approaches and\nmodel predictive control.\n","authors":["Shuo Yang","Yu Chen","Xiang Yin","George J. Pappas","Rahul Mangharam"],"pdf_url":"https://arxiv.org/pdf/2401.14907v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19780v1","updated":"2024-11-29T15:35:37Z","published":"2024-11-29T15:35:37Z","title":"Machine learning force-field model for kinetic Monte Carlo simulations\n of itinerant Ising magnets","summary":" We present a scalable machine learning (ML) framework for large-scale kinetic\nMonte Carlo (kMC) simulations of itinerant electron Ising systems. As the\neffective interactions between Ising spins in such itinerant magnets are\nmediated by conducting electrons, the calculation of energy change due to a\nlocal spin update requires solving an electronic structure problem. Such\nrepeated electronic structure calculations could be overwhelmingly prohibitive\nfor large systems. Assuming the locality principle, a convolutional neural\nnetwork (CNN) model is developed to directly predict the effective local field\nand the corresponding energy change associated with a given spin update based\non Ising configuration in a finite neighborhood. As the kernel size of the CNN\nis fixed at a constant, the model can be directly scalable to kMC simulations\nof large lattices. Our approach is reminiscent of the ML force-field models\nwidely used in first-principles molecular dynamics simulations. Applying our ML\nframework to a square-lattice double-exchange Ising model, we uncover unusual\ncoarsening of ferromagnetic domains at low temperatures. Our work highlights\nthe potential of ML methods for large-scale modeling of similar itinerant\nsystems with discrete dynamical variables.\n","authors":["Alexa Tyberg","Yunhao Fan","Gia-Wei Chern"],"pdf_url":"https://arxiv.org/pdf/2411.19780v1.pdf","comment":"11 pages, 7 figures"},{"id":"http://arxiv.org/abs/2411.19774v1","updated":"2024-11-29T15:20:29Z","published":"2024-11-29T15:20:29Z","title":"PerLA: Perceptive 3D Language Assistant","summary":" Enabling Large Language Models (LLMs) to understand the 3D physical world is\nan emerging yet challenging research direction. Current strategies for\nprocessing point clouds typically downsample the scene or divide it into\nsmaller parts for separate analysis. However, both approaches risk losing key\nlocal details or global contextual information. In this paper, we introduce\nPerLA, a 3D language assistant designed to be more perceptive to both details\nand context, making visual representations more informative for the LLM. PerLA\ncaptures high-resolution (local) details in parallel from different point cloud\nareas and integrates them with (global) context obtained from a\nlower-resolution whole point cloud. We present a novel algorithm that preserves\npoint cloud locality through the Hilbert curve and effectively aggregates\nlocal-to-global information via cross-attention and a graph neural network.\nLastly, we introduce a novel loss for local representation consensus to promote\ntraining stability. PerLA outperforms state-of-the-art 3D language assistants,\nwith gains of up to +1.34 CiDEr on ScanQA for question answering, and +4.22 on\nScanRefer and +3.88 on Nr3D for dense\ncaptioning.\\url{https://gfmei.github.io/PerLA/}\n","authors":["Guofeng Mei","Wei Lin","Luigi Riz","Yujiao Wu","Fabio Poiesi","Yiming Wang"],"pdf_url":"https://arxiv.org/pdf/2411.19774v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19772v1","updated":"2024-11-29T15:18:06Z","published":"2024-11-29T15:18:06Z","title":"LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware\n Omni-Modal Perception of Long Videos","summary":" Despite impressive advancements in video understanding, most efforts remain\nlimited to coarse-grained or visual-only video tasks. However, real-world\nvideos encompass omni-modal information (vision, audio, and speech) with a\nseries of events forming a cohesive storyline. The lack of multi-modal video\ndata with fine-grained event annotations and the high cost of manual labeling\nare major obstacles to comprehensive omni-modality video perception. To address\nthis gap, we propose an automatic pipeline consisting of high-quality\nmulti-modal video filtering, semantically coherent omni-modal event boundary\ndetection, and cross-modal correlation-aware event captioning. In this way, we\npresent LongVALE, the first-ever Vision-Audio-Language Event understanding\nbenchmark comprising 105K omni-modal events with precise temporal boundaries\nand detailed relation-aware captions within 8.4K high-quality long videos.\nFurther, we build a baseline that leverages LongVALE to enable video large\nlanguage models (LLMs) for omni-modality fine-grained temporal video\nunderstanding for the first time. Extensive experiments demonstrate the\neffectiveness and great potential of LongVALE in advancing comprehensive\nmulti-modal video understanding.\n","authors":["Tiantian Geng","Jinrui Zhang","Qingni Wang","Teng Wang","Jinming Duan","Feng Zheng"],"pdf_url":"https://arxiv.org/pdf/2411.19772v1.pdf","comment":"18 pages, 15 figures"},{"id":"http://arxiv.org/abs/2411.19769v1","updated":"2024-11-29T15:17:42Z","published":"2024-11-29T15:17:42Z","title":"Riemannian Denoising Score Matching for Molecular Structure Optimization\n with Accurate Energy","summary":" This study introduces a modified score matching method aimed at generating\nmolecular structures with high energy accuracy. The denoising process of score\nmatching or diffusion models mirrors molecular structure optimization, where\nscores act like physical force fields that guide particles toward equilibrium\nstates. To achieve energetically accurate structures, it can be advantageous to\nhave the score closely approximate the gradient of the actual potential energy\nsurface. Unlike conventional methods that simply design the target score based\non structural differences in Euclidean space, we propose a Riemannian score\nmatching approach. This method represents molecular structures on a manifold\ndefined by physics-informed internal coordinates to efficiently mimic the\nenergy landscape, and performs noising and denoising within this space. Our\nmethod has been evaluated by refining several types of starting structures on\nthe QM9 and GEOM datasets, demonstrating that the proposed Riemannian score\nmatching method significantly improves the accuracy of the generated molecular\nstructures, attaining chemical accuracy. The implications of this study extend\nto various applications in computational chemistry, offering a robust tool for\naccurate molecular structure prediction.\n","authors":["Jeheon Woo","Seonghwan Kim","Jun Hyeong Kim","Woo Youn Kim"],"pdf_url":"https://arxiv.org/pdf/2411.19769v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19766v1","updated":"2024-11-29T15:12:48Z","published":"2024-11-29T15:12:48Z","title":"Stock Price Prediction using Multi-Faceted Information based on Deep\n Recurrent Neural Networks","summary":" Accurate prediction of stock market trends is crucial for informed investment\ndecisions and effective portfolio management, ultimately leading to enhanced\nwealth creation and risk mitigation. This study proposes a novel approach for\npredicting stock prices in the stock market by integrating Convolutional Neural\nNetworks (CNN) and Long Short-Term Memory (LSTM) networks, using sentiment\nanalysis of social network data and candlestick data (price). The proposed\nmethodology consists of two primary components: sentiment analysis of social\nnetwork and candlestick data. By amalgamating candlestick data with insights\ngleaned from Twitter, this approach facilitates a more detailed and accurate\nexamination of market trends and patterns, ultimately leading to more effective\nstock price predictions. Additionally, a Random Forest algorithm is used to\nclassify tweets as either positive or negative, allowing for a more subtle and\ninformed assessment of market sentiment. This study uses CNN and LSTM networks\nto predict stock prices. The CNN extracts short-term features, while the LSTM\nmodels long-term dependencies. The integration of both networks enables a more\ncomprehensive analysis of market trends and patterns, leading to more accurate\nstock price predictions.\n","authors":["Lida Shahbandari","Elahe Moradi","Mohammad Manthouri"],"pdf_url":"https://arxiv.org/pdf/2411.19766v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2212.09010v6","updated":"2024-11-29T15:12:22Z","published":"2022-12-18T04:44:38Z","title":"Risk-Sensitive Reinforcement Learning with Exponential Criteria","summary":" While reinforcement learning has shown experimental success in a number of\napplications, it is known to be sensitive to noise and perturbations in the\nparameters of the system, leading to high variance in the total reward amongst\ndifferent episodes in slightly different environments. To introduce robustness,\nas well as sample efficiency, risk-sensitive reinforcement learning methods are\nbeing thoroughly studied. In this work, we provide a definition of robust\nreinforcement learning policies and formulate a risk-sensitive reinforcement\nlearning problem to approximate them, by solving an optimization problem with\nrespect to a modified objective based on exponential criteria. In particular,\nwe study a model-free risk-sensitive variation of the widely-used Monte Carlo\nPolicy Gradient algorithm and introduce a novel risk-sensitive online\nActor-Critic algorithm based on solving a multiplicative Bellman equation using\nstochastic approximation updates. Analytical results suggest that the use of\nexponential criteria generalizes commonly used ad-hoc regularization\napproaches, improves sample efficiency, and introduces robustness with respect\nto perturbations in the model parameters and the environment. The\nimplementation, performance, and robustness properties of the proposed methods\nare evaluated in simulated experiments.\n","authors":["Erfaun Noorani","Christos Mavridis","John Baras"],"pdf_url":"https://arxiv.org/pdf/2212.09010v6.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19763v1","updated":"2024-11-29T15:07:44Z","published":"2024-11-29T15:07:44Z","title":"Forecasting Foreign Exchange Market Prices Using Technical Indicators\n with Deep Learning and Attention Mechanism","summary":" Accurate prediction of price behavior in the foreign exchange market is\ncrucial. This paper proposes a novel approach that leverages technical\nindicators and deep neural networks. The proposed architecture consists of a\nLong Short-Term Memory (LSTM) and Convolutional Neural Network (CNN), and\nattention mechanism. Initially, trend and oscillation technical indicators are\nemployed to extract statistical features from Forex currency pair data,\nproviding insights into price trends, market volatility, relative price\nstrength, and overbought and oversold conditions. Subsequently, the LSTM and\nCNN networks are utilized in parallel to predict future price movements,\nleveraging the strengths of both recurrent and convolutional architectures. The\nLSTM network captures long-term dependencies and temporal patterns in the data,\nwhile the CNN network extracts local patterns. The outputs of the parallel LSTM\nand CNN networks are then fed into an attention mechanism, which learns to\nweigh the importance of each feature and temporal dependency, generating a\ncontext-aware representation of the input data. The attention-weighted output\nis then used to predict future price movements, enabling the model to focus on\nthe most relevant features and temporal dependencies. Through a comprehensive\nevaluation of the proposed approach on multiple Forex currency pairs, we\ndemonstrate its effectiveness in predicting price behavior and outperforming\nbenchmark models.\n","authors":["Sahabeh Saadati","Mohammad Manthouri"],"pdf_url":"https://arxiv.org/pdf/2411.19763v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19758v1","updated":"2024-11-29T15:04:40Z","published":"2024-11-29T15:04:40Z","title":"LaVIDE: A Language-Vision Discriminator for Detecting Changes in\n Satellite Image with Map References","summary":" Change detection, which typically relies on the comparison of bi-temporal\nimages, is significantly hindered when only a single image is available.\nComparing a single image with an existing map, such as OpenStreetMap, which is\ncontinuously updated through crowd-sourcing, offers a viable solution to this\nchallenge. Unlike images that carry low-level visual details of ground objects,\nmaps convey high-level categorical information. This discrepancy in abstraction\nlevels complicates the alignment and comparison of the two data types. In this\npaper, we propose a \\textbf{La}nguage-\\textbf{VI}sion \\textbf{D}iscriminator\nfor d\\textbf{E}tecting changes in satellite image with map references, namely\n\\ours{}, which leverages language to bridge the information gap between maps\nand images. Specifically, \\ours{} formulates change detection as the problem of\n``{\\textit Does the pixel belong to [class]?}'', aligning maps and images\nwithin the feature space of the language-vision model to associate high-level\nmap categories with low-level image details. Moreover, we build a\nmixture-of-experts discriminative module, which compares linguistic features\nfrom maps with visual features from images across various semantic\nperspectives, achieving comprehensive semantic comparison for change detection.\nExtensive evaluation on four benchmark datasets demonstrates that \\ours{} can\neffectively detect changes in satellite image with map references,\noutperforming state-of-the-art change detection algorithms, e.g., with gains of\nabout $13.8$\\% on the DynamicEarthNet dataset and $4.3$\\% on the SECOND\ndataset.\n","authors":["Shuguo Jiang","Fang Xu","Sen Jia","Gui-Song Xia"],"pdf_url":"https://arxiv.org/pdf/2411.19758v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19757v1","updated":"2024-11-29T15:01:25Z","published":"2024-11-29T15:01:25Z","title":"Dual Risk Minimization: Towards Next-Level Robustness in Fine-tuning\n Zero-Shot Models","summary":" Fine-tuning foundation models often compromises their robustness to\ndistribution shifts. To remedy this, most robust fine-tuning methods aim to\npreserve the pre-trained features. However, not all pre-trained features are\nrobust and those methods are largely indifferent to which ones to preserve. We\npropose dual risk minimization (DRM), which combines empirical risk\nminimization with worst-case risk minimization, to better preserve the core\nfeatures of downstream tasks. In particular, we utilize core-feature\ndescriptions generated by LLMs to induce core-based zero-shot predictions which\nthen serve as proxies to estimate the worst-case risk. DRM balances two crucial\naspects of model robustness: expected performance and worst-case performance,\nestablishing a new state of the art on various real-world benchmarks. DRM\nsignificantly improves the out-of-distribution performance of CLIP ViT-L/14@336\non ImageNet (75.9 to 77.1), WILDS-iWildCam (47.1 to 51.8), and WILDS-FMoW (50.7\nto 53.1); opening up new avenues for robust fine-tuning. Our code is available\nat https://github.com/vaynexie/DRM .\n","authors":["Kaican Li","Weiyan Xie","Yongxiang Huang","Didan Deng","Lanqing Hong","Zhenguo Li","Ricardo Silva","Nevin L. Zhang"],"pdf_url":"https://arxiv.org/pdf/2411.19757v1.pdf","comment":"NeurIPS 2024"},{"id":"http://arxiv.org/abs/2411.19756v1","updated":"2024-11-29T15:00:38Z","published":"2024-11-29T15:00:38Z","title":"DeSplat: Decomposed Gaussian Splatting for Distractor-Free Rendering","summary":" Gaussian splatting enables fast novel view synthesis in static 3D\nenvironments. However, reconstructing real-world environments remains\nchallenging as distractors or occluders break the multi-view consistency\nassumption required for accurate 3D reconstruction. Most existing methods rely\non external semantic information from pre-trained models, introducing\nadditional computational overhead as pre-processing steps or during\noptimization. In this work, we propose a novel method, DeSplat, that directly\nseparates distractors and static scene elements purely based on volume\nrendering of Gaussian primitives. We initialize Gaussians within each camera\nview for reconstructing the view-specific distractors to separately model the\nstatic 3D scene and distractors in the alpha compositing stages. DeSplat yields\nan explicit scene separation of static elements and distractors, achieving\ncomparable results to prior distractor-free approaches without sacrificing\nrendering speed. We demonstrate DeSplat's effectiveness on three benchmark data\nsets for distractor-free novel view synthesis. See the project website at\nhttps://aaltoml.github.io/desplat/.\n","authors":["Yihao Wang","Marcus Klasson","Matias Turkulainen","Shuzhe Wang","Juho Kannala","Arno Solin"],"pdf_url":"https://arxiv.org/pdf/2411.19756v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.07808v3","updated":"2024-11-29T14:50:25Z","published":"2024-02-12T17:13:02Z","title":"Sourcerer: Sample-based Maximum Entropy Source Distribution Estimation","summary":" Scientific modeling applications often require estimating a distribution of\nparameters consistent with a dataset of observations - an inference task also\nknown as source distribution estimation. This problem can be ill-posed,\nhowever, since many different source distributions might produce the same\ndistribution of data-consistent simulations. To make a principled choice among\nmany equally valid sources, we propose an approach which targets the maximum\nentropy distribution, i.e., prioritizes retaining as much uncertainty as\npossible. Our method is purely sample-based - leveraging the Sliced-Wasserstein\ndistance to measure the discrepancy between the dataset and simulations - and\nthus suitable for simulators with intractable likelihoods. We benchmark our\nmethod on several tasks, and show that it can recover source distributions with\nsubstantially higher entropy than recent source estimation methods, without\nsacrificing the fidelity of the simulations. Finally, to demonstrate the\nutility of our approach, we infer source distributions for parameters of the\nHodgkin-Huxley model from experimental datasets with thousands of single-neuron\nmeasurements. In summary, we propose a principled method for inferring source\ndistributions of scientific simulator parameters while retaining as much\nuncertainty as possible.\n","authors":["Julius Vetter","Guy Moss","Cornelius Schröder","Richard Gao","Jakob H. Macke"],"pdf_url":"https://arxiv.org/pdf/2402.07808v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.21302v3","updated":"2024-11-29T14:49:09Z","published":"2024-10-21T22:52:25Z","title":"Domain-Adaptive Pre-training of Self-Supervised Foundation Models for\n Medical Image Classification in Gastrointestinal Endoscopy","summary":" Video capsule endoscopy has transformed gastrointestinal endoscopy (GIE)\ndiagnostics by offering a non-invasive method for capturing detailed images of\nthe gastrointestinal tract, enabling early disease detection. However, its\npotential is limited by the sheer volume of images generated during the imaging\nprocedure, which can take anywhere from 6-8 hours and often produce up to 1\nmillion images, necessitating automated analysis. Additionally, the variability\nof these images, combined with the need for expert annotations and the scarcity\nof large, high-quality labeled datasets, constrains the effectiveness of\ncurrent medical image analysis models. To address this, we introduce a novel\nlarge GIE dataset, called EndoExtend24, created by merging ten existing public\nand private datasets, ensuring patient integrity across splits. EndoExtend24\nincludes over 226,000 labeled images, as well as dynamic class mappings, which\nallow unified training across datasets with differing labeling granularity,\nsupporting up to 123 distinct pathological findings. Further, we propose to\nleverage domain adaptive pre-training of foundation models trained with\nself-supervision on generic image data, to adapt them to the task of GIE\nmedical image diagnosis. Specifically, the EVA-02 model, which is based on the\nViT architecture and trained on ImageNet-22k with masked image modeling (using\nEVA-CLIP as a MIM teacher), is pre-trained on the EndoExtend24 dataset to\nachieve domain adaptation, and finally trained on the Capsule Endoscopy 2024\nChallenge dataset. Our model demonstrates robust performance, securing third\nplace in the Capsule Endoscopy 2024 Challenge. We achieved a macro AUC of 0.762\nand a balanced accuracy of 37.1% on the test set. These results emphasize the\neffectiveness of our domain-adaptive pre-training approach and the enriched\nEndoExtend24 dataset in advancing gastrointestinal endoscopy diagnostics.\n","authors":["Marcel Roth","Micha V. Nowak","Adrian Krenzer","Frank Puppe"],"pdf_url":"https://arxiv.org/pdf/2410.21302v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19747v1","updated":"2024-11-29T14:47:08Z","published":"2024-11-29T14:47:08Z","title":"A Multi-Loss Strategy for Vehicle Trajectory Prediction: Combining\n Off-Road, Diversity, and Directional Consistency Losses","summary":" Trajectory prediction is essential for the safety and efficiency of planning\nin autonomous vehicles. However, current models often fail to fully capture\ncomplex traffic rules and the complete range of potential vehicle movements.\nAddressing these limitations, this study introduces three novel loss functions:\nOffroad Loss, Direction Consistency Error, and Diversity Loss. These functions\nare designed to keep predicted paths within driving area boundaries, aligned\nwith traffic directions, and cover a wider variety of plausible driving\nscenarios. As all prediction modes should adhere to road rules and conditions,\nthis work overcomes the shortcomings of traditional \"winner takes all\" training\nmethods by applying the loss functions to all prediction modes. These loss\nfunctions not only improve model training but can also serve as metrics for\nevaluating the realism and diversity of trajectory predictions. Extensive\nvalidation on the nuScenes and Argoverse 2 datasets with leading baseline\nmodels demonstrates that our approach not only maintains accuracy but\nsignificantly improves safety and robustness, reducing offroad errors on\naverage by 47% on original and by 37% on attacked scenes. This work sets a new\nbenchmark for trajectory prediction in autonomous driving, offering substantial\nimprovements in navigating complex environments. Our code is available at\nhttps://github.com/vita-epfl/stay-on-track .\n","authors":["Ahmad Rahimi","Alexandre Alahi"],"pdf_url":"https://arxiv.org/pdf/2411.19747v1.pdf","comment":"Preprint, 7 pages, 4 figures and 2 tables"},{"id":"http://arxiv.org/abs/2411.19746v1","updated":"2024-11-29T14:46:37Z","published":"2024-11-29T14:46:37Z","title":"HVAC-DPT: A Decision Pretrained Transformer for HVAC Control","summary":" Building operations consume approximately 40% of global energy, with Heating,\nVentilation, and Air Conditioning (HVAC) systems responsible for up to 50% of\nthis consumption. As HVAC energy demands are expected to rise, optimising\nsystem efficiency is crucial for reducing future energy use and mitigating\nclimate change. Existing control strategies lack generalisation and require\nextensive training and data, limiting their rapid deployment across diverse\nbuildings. This paper introduces HVAC-DPT, a Decision-Pretrained Transformer\nusing in-context Reinforcement Learning (RL) for multi-zone HVAC control.\nHVAC-DPT frames HVAC control as a sequential prediction task, training a causal\ntransformer on interaction histories generated by diverse RL agents. This\napproach enables HVAC-DPT to refine its policy in-context, without modifying\nnetwork parameters, allowing for deployment across different buildings without\nthe need for additional training or data collection. HVAC-DPT reduces energy\nconsumption in unseen buildings by 45% compared to the baseline controller,\noffering a scalable and effective approach to mitigating the increasing\nenvironmental impact of HVAC systems.\n","authors":["Anaïs Berkes"],"pdf_url":"https://arxiv.org/pdf/2411.19746v1.pdf","comment":"7 pages, 3 figures, 3 tables"},{"id":"http://arxiv.org/abs/2411.17593v2","updated":"2024-11-29T14:41:48Z","published":"2024-11-26T17:01:27Z","title":"What Differentiates Educational Literature? A Multimodal Fusion Approach\n of Transformers and Computational Linguistics","summary":" The integration of new literature into the English curriculum remains a\nchallenge since educators often lack scalable tools to rapidly evaluate\nreadability and adapt texts for diverse classroom needs. This study proposes to\naddress this gap through a multimodal approach that combines transformer-based\ntext classification with linguistic feature analysis to align texts with UK Key\nStages. Eight state-of-the-art Transformers were fine-tuned on segmented text\ndata, with BERT achieving the highest unimodal F1 score of 0.75. In parallel,\n500 deep neural network topologies were searched for the classification of\nlinguistic characteristics, achieving an F1 score of 0.392. The fusion of these\nmodalities shows a significant improvement, with every multimodal approach\noutperforming all unimodal models. In particular, the ELECTRA Transformer fused\nwith the neural network achieved an F1 score of 0.996. Unimodal and multimodal\napproaches are shown to have statistically significant differences in all\nvalidation metrics (accuracy, precision, recall, F1 score) except for inference\ntime. The proposed approach is finally encapsulated in a stakeholder-facing web\napplication, providing non-technical stakeholder access to real-time insights\non text complexity, reading difficulty, curriculum alignment, and\nrecommendations for learning age range. The application empowers data-driven\ndecision making and reduces manual workload by integrating AI-based\nrecommendations into lesson planning for English literature.\n","authors":["Jordan J. Bird"],"pdf_url":"https://arxiv.org/pdf/2411.17593v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19744v1","updated":"2024-11-29T14:40:36Z","published":"2024-11-29T14:40:36Z","title":"Amplifying human performance in combinatorial competitive programming","summary":" Recent years have seen a significant surge in complex AI systems for\ncompetitive programming, capable of performing at admirable levels against\nhuman competitors. While steady progress has been made, the highest percentiles\nstill remain out of reach for these methods on standard competition platforms\nsuch as Codeforces. Here we instead focus on combinatorial competitive\nprogramming, where the target is to find as-good-as-possible solutions to\notherwise computationally intractable problems, over specific given inputs. We\nhypothesise that this scenario offers a unique testbed for human-AI synergy, as\nhuman programmers can write a backbone of a heuristic solution, after which AI\ncan be used to optimise the scoring function used by the heuristic. We deploy\nour approach on previous iterations of Hash Code, a global team programming\ncompetition inspired by NP-hard software engineering problems at Google, and we\nleverage FunSearch to evolve our scoring functions. Our evolved solutions\nsignificantly improve the attained scores from their baseline, successfully\nbreaking into the top percentile on all previous Hash Code online qualification\nrounds, and outperforming the top human teams on several. Our method is also\nperformant on an optimisation problem that featured in a recent held-out\nAtCoder contest.\n","authors":["Petar Veličković","Alex Vitvitskyi","Larisa Markeeva","Borja Ibarz","Lars Buesing","Matej Balog","Alexander Novikov"],"pdf_url":"https://arxiv.org/pdf/2411.19744v1.pdf","comment":"Technical report. 18 pages, 8 figures"},{"id":"http://arxiv.org/abs/2411.19742v1","updated":"2024-11-29T14:40:19Z","published":"2024-11-29T14:40:19Z","title":"Graph Neural Networks for Heart Failure Prediction on an EHR-Based\n Patient Similarity Graph","summary":" Objective: In modern healthcare, accurately predicting diseases is a crucial\nmatter. This study introduces a novel approach using graph neural networks\n(GNNs) and a Graph Transformer (GT) to predict the incidence of heart failure\n(HF) on a patient similarity graph at the next hospital visit. Materials and\nMethods: We used electronic health records (EHR) from the MIMIC-III dataset and\napplied the K-Nearest Neighbors (KNN) algorithm to create a patient similarity\ngraph using embeddings from diagnoses, procedures, and medications. Three\nmodels - GraphSAGE, Graph Attention Network (GAT), and Graph Transformer (GT) -\nwere implemented to predict HF incidence. Model performance was evaluated using\nF1 score, AUROC, and AUPRC metrics, and results were compared against baseline\nalgorithms. An interpretability analysis was performed to understand the\nmodel's decision-making process. Results: The GT model demonstrated the best\nperformance (F1 score: 0.5361, AUROC: 0.7925, AUPRC: 0.5168). Although the\nRandom Forest (RF) baseline achieved a similar AUPRC value, the GT model\noffered enhanced interpretability due to the use of patient relationships in\nthe graph structure. A joint analysis of attention weights, graph connectivity,\nand clinical features provided insight into model predictions across different\nclassification groups. Discussion and Conclusion: Graph-based approaches such\nas GNNs provide an effective framework for predicting HF. By leveraging a\npatient similarity graph, GNNs can capture complex relationships in EHR data,\npotentially improving prediction accuracy and clinical interpretability.\n","authors":["Heloisa Oss Boll","Ali Amirahmadi","Amira Soliman","Stefan Byttner","Mariana Recamonde-Mendoza"],"pdf_url":"https://arxiv.org/pdf/2411.19742v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.00754v2","updated":"2024-11-29T14:27:46Z","published":"2024-05-01T07:24:30Z","title":"CLIPArTT: Adaptation of CLIP to New Domains at Test Time","summary":" Pre-trained vision-language models (VLMs), exemplified by CLIP, demonstrate\nremarkable adaptability across zero-shot classification tasks without\nadditional training. However, their performance diminishes in the presence of\ndomain shifts. In this study, we introduce CLIP Adaptation duRing Test-Time\n(CLIPArTT), a fully test-time adaptation (TTA) approach for CLIP, which\ninvolves automatic text prompts construction during inference for their use as\ntext supervision. Our method employs a unique, minimally invasive text prompt\ntuning process, wherein multiple predicted classes are aggregated into a single\nnew text prompt, used as \\emph{pseudo label} to re-classify inputs in a\ntransductive manner. Additionally, we pioneer the standardization of TTA\nbenchmarks (e.g., TENT) in the realm of VLMs. Our findings demonstrate that,\nwithout requiring additional transformations nor new trainable modules,\nCLIPArTT enhances performance dynamically across non-corrupted datasets such as\nCIFAR-100, corrupted datasets like CIFAR-100-C and ImageNet-C, alongside\nsynthetic datasets such as VisDA-C. This research underscores the potential for\nimproving VLMs' adaptability through novel test-time strategies, offering\ninsights for robust performance across varied datasets and environments. The\ncode can be found at: https://github.com/dosowiechi/CLIPArTT.git\n","authors":["Gustavo Adolfo Vargas Hakim","David Osowiechi","Mehrdad Noori","Milad Cheraghalikhani","Ali Bahri","Moslem Yazdanpanah","Ismail Ben Ayed","Christian Desrosiers"],"pdf_url":"https://arxiv.org/pdf/2405.00754v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19734v1","updated":"2024-11-29T14:27:28Z","published":"2024-11-29T14:27:28Z","title":"A Note on Small Percolating Sets on Hypercubes via Generative AI","summary":" We apply a generative AI pattern-recognition technique called PatternBoost to\nstudy bootstrap percolation on hypercubes. With this, we slightly improve the\nbest existing upper bound for the size of percolating subsets of the hypercube.\n","authors":["Gergely Bérczi","Adam Zsolt Wagner"],"pdf_url":"https://arxiv.org/pdf/2411.19734v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19732v1","updated":"2024-11-29T14:25:54Z","published":"2024-11-29T14:25:54Z","title":"Improving generalization of robot locomotion policies via\n Sharpness-Aware Reinforcement Learning","summary":" Reinforcement learning often requires extensive training data.\nSimulation-to-real transfer offers a promising approach to address this\nchallenge in robotics. While differentiable simulators offer improved sample\nefficiency through exact gradients, they can be unstable in contact-rich\nenvironments and may lead to poor generalization. This paper introduces a novel\napproach integrating sharpness-aware optimization into gradient-based\nreinforcement learning algorithms. Our simulation results demonstrate that our\nmethod, tested on contact-rich environments, significantly enhances policy\nrobustness to environmental variations and action perturbations while\nmaintaining the sample efficiency of first-order methods. Specifically, our\napproach improves action noise tolerance compared to standard first-order\nmethods and achieves generalization comparable to zeroth-order methods. This\nimprovement stems from finding flatter minima in the loss landscape, associated\nwith better generalization. Our work offers a promising solution to balance\nefficient learning and robust sim-to-real transfer in robotics, potentially\nbridging the gap between simulation and real-world performance.\n","authors":["Severin Bochem","Eduardo Gonzalez-Sanchez","Yves Bicker","Gabriele Fadini"],"pdf_url":"https://arxiv.org/pdf/2411.19732v1.pdf","comment":"9 pages, 6 figures"},{"id":"http://arxiv.org/abs/2411.19731v1","updated":"2024-11-29T14:24:33Z","published":"2024-11-29T14:24:33Z","title":"Real-Time Anomaly Detection in Video Streams","summary":" This thesis is part of a CIFRE agreement between the company Othello and the\nLIASD laboratory. The objective is to develop an artificial intelligence system\nthat can detect real-time dangers in a video stream. To achieve this, a novel\napproach combining temporal and spatial analysis has been proposed. Several\navenues have been explored to improve anomaly detection by integrating object\ndetection, human pose detection, and motion analysis. For result\ninterpretability, techniques commonly used for image analysis, such as\nactivation and saliency maps, have been extended to videos, and an original\nmethod has been proposed. The proposed architecture performs binary or\nmulticlass classification depending on whether an alert or the cause needs to\nbe identified. Numerous neural networkmodels have been tested, and three of\nthem have been selected. You Only Looks Once (YOLO) has been used for spatial\nanalysis, a Convolutional Recurrent Neuronal Network (CRNN) composed of VGG19\nand a Gated Recurrent Unit (GRU) for temporal analysis, and a multi-layer\nperceptron for classification. These models handle different types of data and\ncan be combined in parallel or in series. Although the parallel mode is faster,\nthe serial mode is generally more reliable. For training these models,\nsupervised learning was chosen, and two proprietary datasets were created. The\nfirst dataset focuses on objects that may play a potential role in anomalies,\nwhile the second consists of videos containing anomalies or non-anomalies. This\napproach allows for the processing of both continuous video streams and finite\nvideos, providing greater flexibility in detection.\n","authors":["Fabien Poirier"],"pdf_url":"https://arxiv.org/pdf/2411.19731v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19729v1","updated":"2024-11-29T14:22:51Z","published":"2024-11-29T14:22:51Z","title":"Risk-Averse Certification of Bayesian Neural Networks","summary":" In light of the inherently complex and dynamic nature of real-world\nenvironments, incorporating risk measures is crucial for the robustness\nevaluation of deep learning models. In this work, we propose a Risk-Averse\nCertification framework for Bayesian neural networks called RAC-BNN. Our method\nleverages sampling and optimisation to compute a sound approximation of the\noutput set of a BNN, represented using a set of template polytopes. To enhance\nrobustness evaluation, we integrate a coherent distortion risk\nmeasure--Conditional Value at Risk (CVaR)--into the certification framework,\nproviding probabilistic guarantees based on empirical distributions obtained\nthrough sampling. We validate RAC-BNN on a range of regression and\nclassification benchmarks and compare its performance with a state-of-the-art\nmethod. The results show that RAC-BNN effectively quantifies robustness under\nworst-performing risky scenarios, and achieves tighter certified bounds and\nhigher efficiency in complex tasks.\n","authors":["Xiyue Zhang","Zifan Wang","Yulong Gao","Licio Romao","Alessandro Abate","Marta Kwiatkowska"],"pdf_url":"https://arxiv.org/pdf/2411.19729v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19726v1","updated":"2024-11-29T14:17:33Z","published":"2024-11-29T14:17:33Z","title":"Towards Santali Linguistic Inclusion: Building the First\n Santali-to-English Translation Model using mT5 Transformer and Data\n Augmentation","summary":" Around seven million individuals in India, Bangladesh, Bhutan, and Nepal\nspeak Santali, positioning it as nearly the third most commonly used\nAustroasiatic language. Despite its prominence among the Austroasiatic language\nfamily's Munda subfamily, Santali lacks global recognition. Currently, no\ntranslation models exist for the Santali language. Our paper aims to include\nSantali to the NPL spectrum. We aim to examine the feasibility of building\nSantali translation models based on available Santali corpora. The paper\nsuccessfully addressed the low-resource problem and, with promising results,\nexamined the possibility of creating a functional Santali machine translation\nmodel in a low-resource setup. Our study shows that Santali-English parallel\ncorpus performs better when in transformers like mt5 as opposed to untrained\ntransformers, proving that transfer learning can be a viable technique that\nworks with Santali language. Besides the mT5 transformer, Santali-English\nperforms better than Santali-Bangla parallel corpus as the mT5 has been trained\nin way more English data than Bangla data. Lastly, our study shows that with\ndata augmentation, our model performs better.\n","authors":["Syed Mohammed Mostaque Billah","Ateya Ahmed Subarna","Sudipta Nandi Sarna","Ahmad Shawkat Wasit","Anika Fariha","Asif Sushmit","Arig Yousuf Sadeque"],"pdf_url":"https://arxiv.org/pdf/2411.19726v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19722v1","updated":"2024-11-29T14:14:59Z","published":"2024-11-29T14:14:59Z","title":"JetFormer: An Autoregressive Generative Model of Raw Images and Text","summary":" Removing modeling constraints and unifying architectures across domains has\nbeen a key driver of the recent progress in training large multimodal models.\nHowever, most of these models still rely on many separately trained components\nsuch as modality-specific encoders and decoders. In this work, we further\nstreamline joint generative modeling of images and text. We propose an\nautoregressive decoder-only transformer - JetFormer - which is trained to\ndirectly maximize the likelihood of raw data, without relying on any separately\npretrained components, and can understand and generate both text and images.\nSpecifically, we leverage a normalizing flow model to obtain a soft-token image\nrepresentation that is jointly trained with an autoregressive multimodal\ntransformer. The normalizing flow model serves as both an image encoder for\nperception tasks and an image decoder for image generation tasks during\ninference. JetFormer achieves text-to-image generation quality competitive with\nrecent VQ-VAE- and VAE-based baselines. These baselines rely on pretrained\nimage autoencoders, which are trained with a complex mixture of losses,\nincluding perceptual ones. At the same time, JetFormer demonstrates robust\nimage understanding capabilities. To the best of our knowledge, JetFormer is\nthe first model that is capable of generating high-fidelity images and\nproducing strong log-likelihood bounds.\n","authors":["Michael Tschannen","André Susano Pinto","Alexander Kolesnikov"],"pdf_url":"https://arxiv.org/pdf/2411.19722v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19719v1","updated":"2024-11-29T14:08:48Z","published":"2024-11-29T14:08:48Z","title":"Relative Representations of Latent Spaces enable Efficient Semantic\n Channel Equalization","summary":" In multi-user semantic communication, language mismatche poses a significant\nchallenge when independently trained agents interact. We present a novel\nsemantic equalization algorithm that enables communication between agents with\ndifferent languages without additional retraining. Our algorithm is based on\nrelative representations, a framework that enables different agents employing\ndifferent neural network models to have unified representation. It proceeds by\nprojecting the latent vectors of different models into a common space defined\nrelative to a set of data samples called \\textit{anchors}, whose number equals\nthe dimension of the resulting space. A communication between different agents\ntranslates to a communication of semantic symbols sampled from this relative\nspace. This approach, in addition to aligning the semantic representations of\ndifferent agents, allows compressing the amount of information being exchanged,\nby appropriately selecting the number of anchors. Eventually, we introduce a\nnovel anchor selection strategy, which advantageously determines prototypical\nanchors, capturing the most relevant information for the downstream task. Our\nnumerical results show the effectiveness of the proposed approach allowing\nseamless communication between agents with radically different models,\nincluding differences in terms of neural network architecture and datasets used\nfor initial training.\n","authors":["Tomás Hüttebräucker","Simone Fiorellino","Mohamed Sana","Paolo Di Lorenzo","Emilio Calvanese Strinati"],"pdf_url":"https://arxiv.org/pdf/2411.19719v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19717v1","updated":"2024-11-29T14:06:58Z","published":"2024-11-29T14:06:58Z","title":"MonoPP: Metric-Scaled Self-Supervised Monocular Depth Estimation by\n Planar-Parallax Geometry in Automotive Applications","summary":" Self-supervised monocular depth estimation (MDE) has gained popularity for\nobtaining depth predictions directly from videos. However, these methods often\nproduce scale invariant results, unless additional training signals are\nprovided. Addressing this challenge, we introduce a novel self-supervised\nmetric-scaled MDE model that requires only monocular video data and the\ncamera's mounting position, both of which are readily available in modern\nvehicles. Our approach leverages planar-parallax geometry to reconstruct scene\nstructure. The full pipeline consists of three main networks, a multi-frame\nnetwork, a singleframe network, and a pose network. The multi-frame network\nprocesses sequential frames to estimate the structure of the static scene using\nplanar-parallax geometry and the camera mounting position. Based on this\nreconstruction, it acts as a teacher, distilling knowledge such as scale\ninformation, masked drivable area, metric-scale depth for the static scene, and\ndynamic object mask to the singleframe network. It also aids the pose network\nin predicting a metric-scaled relative pose between two subsequent images. Our\nmethod achieved state-of-the-art results for the driving benchmark KITTI for\nmetric-scaled depth prediction. Notably, it is one of the first methods to\nproduce self-supervised metric-scaled depth prediction for the challenging\nCityscapes dataset, demonstrating its effectiveness and versatility.\n","authors":["Gasser Elazab","Torben Gräber","Michael Unterreiner","Olaf Hellwich"],"pdf_url":"https://arxiv.org/pdf/2411.19717v1.pdf","comment":"Accepted at WACV 25, project page: https://mono-pp.github.io/"},{"id":"http://arxiv.org/abs/2411.19715v1","updated":"2024-11-29T14:02:11Z","published":"2024-11-29T14:02:11Z","title":"Forensics Adapter: Adapting CLIP for Generalizable Face Forgery\n Detection","summary":" We describe the Forensics Adapter, an adapter network designed to transform\nCLIP into an effective and generalizable face forgery detector. Although CLIP\nis highly versatile, adapting it for face forgery detection is non-trivial as\nforgery-related knowledge is entangled with a wide range of unrelated\nknowledge. Existing methods treat CLIP merely as a feature extractor, lacking\ntask-specific adaptation, which limits their effectiveness. To address this, we\nintroduce an adapter to learn face forgery traces -- the blending boundaries\nunique to forged faces, guided by task-specific objectives. Then we enhance the\nCLIP visual tokens with a dedicated interaction strategy that communicates\nknowledge across CLIP and the adapter. Since the adapter is alongside CLIP, its\nversatility is highly retained, naturally ensuring strong generalizability in\nface forgery detection. With only $\\bm{5.7M}$ trainable parameters, our method\nachieves a significant performance boost, improving by approximately $\\bm{7\\%}$\non average across five standard datasets. We believe the proposed method can\nserve as a baseline for future CLIP-based face forgery detection methods.\n","authors":["Xinjie Cui","Yuezun Li","Ao Luo","Jiaran Zhou","Junyu Dong"],"pdf_url":"https://arxiv.org/pdf/2411.19715v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19714v1","updated":"2024-11-29T14:02:00Z","published":"2024-11-29T14:02:00Z","title":"The Streetscape Application Services Stack (SASS): Towards a Distributed\n Sensing Architecture for Urban Applications","summary":" As urban populations grow, cities are becoming more complex, driving the\ndeployment of interconnected sensing systems to realize the vision of smart\ncities. These systems aim to improve safety, mobility, and quality of life\nthrough applications that integrate diverse sensors with real-time\ndecision-making. Streetscape applications-focusing on challenges like\npedestrian safety and adaptive traffic management-depend on managing\ndistributed, heterogeneous sensor data, aligning information across time and\nspace, and enabling real-time processing. These tasks are inherently complex\nand often difficult to scale. The Streetscape Application Services Stack (SASS)\naddresses these challenges with three core services: multimodal data\nsynchronization, spatiotemporal data fusion, and distributed edge computing. By\nstructuring these capabilities as clear, composable abstractions with clear\nsemantics, SASS allows developers to scale streetscape applications efficiently\nwhile minimizing the complexity of multimodal integration.\n We evaluated SASS in two real-world testbed environments: a controlled\nparking lot and an urban intersection in a major U.S. city. These testbeds\nallowed us to test SASS under diverse conditions, demonstrating its practical\napplicability. The Multimodal Data Synchronization service reduced temporal\nmisalignment errors by 88%, achieving synchronization accuracy within 50\nmilliseconds. Spatiotemporal Data Fusion service improved detection accuracy\nfor pedestrians and vehicles by over 10%, leveraging multicamera integration.\nThe Distributed Edge Computing service increased system throughput by more than\nan order of magnitude. Together, these results show how SASS provides the\nabstractions and performance needed to support real-time, scalable urban\napplications, bridging the gap between sensing infrastructure and actionable\nstreetscape intelligence.\n","authors":["Navid Salami Pargoo","Mahshid Ghasemi","Shuren Xia","Mehmet Kerem Turkcan","Taqiya Ehsan","Chengbo Zang","Yuan Sun","Javad Ghaderi","Gil Zussman","Zoran Kostic","Jorge Ortiz"],"pdf_url":"https://arxiv.org/pdf/2411.19714v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19710v1","updated":"2024-11-29T13:57:07Z","published":"2024-11-29T13:57:07Z","title":"Know Your RAG: Dataset Taxonomy and Generation Strategies for Evaluating\n RAG Systems","summary":" Retrieval Augmented Generation (RAG) systems are a widespread application of\nLarge Language Models (LLMs) in the industry. While many tools exist empowering\ndevelopers to build their own systems, measuring their performance locally,\nwith datasets reflective of the system's use cases, is a technological\nchallenge. Solutions to this problem range from non-specific and cheap (most\npublic datasets) to specific and costly (generating data from local documents).\nIn this paper, we show that using public question and answer (Q&A) datasets to\nassess retrieval performance can lead to non-optimal systems design, and that\ncommon tools for RAG dataset generation can lead to unbalanced data. We propose\nsolutions to these issues based on the characterization of RAG datasets through\nlabels and through label-targeted data generation. Finally, we show that\nfine-tuned small LLMs can efficiently generate Q&A datasets. We believe that\nthese observations are invaluable to the know-your-data step of RAG systems\ndevelopment.\n","authors":["Rafael Teixeira de Lima","Shubham Gupta","Cesar Berrospi","Lokesh Mishra","Michele Dolfi","Peter Staar","Panagiotis Vagenas"],"pdf_url":"https://arxiv.org/pdf/2411.19710v1.pdf","comment":"to be published in the 31st International Conference on Computational\n Linguistics (COLING 2025)"},{"id":"http://arxiv.org/abs/2411.19702v1","updated":"2024-11-29T13:46:15Z","published":"2024-11-29T13:46:15Z","title":"Fast Mutual Information Computation for Large Binary Datasets","summary":" Mutual Information (MI) is a powerful statistical measure that quantifies\nshared information between random variables, particularly valuable in\nhigh-dimensional data analysis across fields like genomics, natural language\nprocessing, and network science. However, computing MI becomes computationally\nprohibitive for large datasets where it is typically required a pairwise\ncomputational approach where each column is compared to others. This work\nintroduces a matrix-based algorithm that accelerates MI computation by\nleveraging vectorized operations and optimized matrix calculations. By\ntransforming traditional pairwise computational approaches into bulk matrix\noperations, the proposed method enables efficient MI calculation across all\nvariable pairs. Experimental results demonstrate significant performance\nimprovements, with computation times reduced up to 50,000 times in the largest\ndataset using optimized implementations, particularly when utilizing hardware\noptimized frameworks. The approach promises to expand MI's applicability in\ndata-driven research by overcoming previous computational limitations.\n","authors":["Andre O. Falcao"],"pdf_url":"https://arxiv.org/pdf/2411.19702v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.18574v2","updated":"2024-11-29T13:44:15Z","published":"2024-09-27T09:18:57Z","title":"Climate Adaptation with Reinforcement Learning: Experiments with\n Flooding and Transportation in Copenhagen","summary":" Due to climate change the frequency and intensity of extreme rainfall events,\nwhich contribute to urban flooding, are expected to increase in many places.\nThese floods can damage transport infrastructure and disrupt mobility,\nhighlighting the need for cities to adapt to escalating risks. Reinforcement\nlearning (RL) serves as a powerful tool for uncovering optimal adaptation\nstrategies, determining how and where to deploy adaptation measures\neffectively, even under significant uncertainty. In this study, we leverage RL\nto identify the most effective timing and locations for implementing measures,\naiming to reduce both direct and indirect impacts of flooding. Our framework\nintegrates climate change projections of future rainfall events and floods,\nmodels city-wide motorized trips, and quantifies direct and indirect impacts on\ninfrastructure and mobility. Preliminary results suggest that our RL-based\napproach can significantly enhance decision-making by prioritizing\ninterventions in specific urban areas and identifying the optimal periods for\ntheir implementation. Our framework is publicly available:\n\\url{https://github.com/MLSM-at-DTU/floods_transport_rl}.\n","authors":["Miguel Costa","Morten W. Petersen","Arthur Vandervoort","Martin Drews","Karyn Morrissey","Francisco C. Pereira"],"pdf_url":"https://arxiv.org/pdf/2409.18574v2.pdf","comment":"Accepted for presentation at Tackling Climate Change with Machine\n Learning workshop at NeurIPS 2024"},{"id":"http://arxiv.org/abs/2411.19700v1","updated":"2024-11-29T13:42:10Z","published":"2024-11-29T13:42:10Z","title":"Explaining the Impact of Training on Vision Models via Activation\n Clustering","summary":" Recent developments in the field of explainable artificial intelligence (XAI)\nfor vision models investigate the information extracted by their feature\nencoder. We contribute to this effort and propose Neuro-Activated Vision\nExplanations (NAVE), which extracts the information captured by the encoder by\nclustering the feature activations of the frozen network to be explained. The\nmethod does not aim to explain the model's prediction but to answer questions\nsuch as which parts of the image are processed similarly or which information\nis kept in deeper layers. Experimentally, we leverage NAVE to show that the\ntraining dataset and the level of supervision affect which concepts are\ncaptured. In addition, our method reveals the impact of registers on vision\ntransformers (ViT) and the information saturation caused by the watermark\nClever Hans effect in the training set.\n","authors":["Ahcène Boubekki","Samuel G. Fadel","Sebastian Mair"],"pdf_url":"https://arxiv.org/pdf/2411.19700v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.09294v2","updated":"2024-11-29T13:39:02Z","published":"2024-06-13T16:30:03Z","title":"You Don't Need Domain-Specific Data Augmentations When Scaling\n Self-Supervised Learning","summary":" Self-Supervised learning (SSL) with Joint-Embedding Architectures (JEA) has\nled to outstanding performances. All instantiations of this paradigm were\ntrained using strong and well-established hand-crafted data augmentations,\nleading to the general belief that they are required for the proper training\nand performance of such models. On the other hand, generative\nreconstruction-based models such as BEIT and MAE or Joint-Embedding Predictive\nArchitectures such as I-JEPA have shown strong performance without using data\naugmentations except masking. In this work, we challenge the importance of\ninvariance and data-augmentation in JEAs at scale. By running a case-study on a\nrecent SSL foundation model - DINOv2 - we show that strong image\nrepresentations can be obtained with JEAs and only cropping without resizing\nprovided the training data is large enough, reaching state-of-the-art results\nand using the least amount of augmentation in the literature. Through this\nstudy, we also discuss the impact of compute constraints on the outcomes of\nexperimental deep learning research, showing that they can lead to very\ndifferent conclusions.\n","authors":["Théo Moutakanni","Maxime Oquab","Marc Szafraniec","Maria Vakalopoulou","Piotr Bojanowski"],"pdf_url":"https://arxiv.org/pdf/2406.09294v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.09079v3","updated":"2024-11-29T13:24:58Z","published":"2024-06-13T13:03:37Z","title":"Hadamard Representations: Augmenting Hyperbolic Tangents in RL","summary":" Activation functions are one of the key components of a deep neural network.\nThe most commonly used activation functions can be classed into the category of\ncontinuously differentiable (e.g. tanh) and piece-wise linear functions (e.g.\nReLU), both having their own strengths and drawbacks with respect to downstream\nperformance and representation capacity through learning (e.g. measured by the\nnumber of dead neurons and the effective rank). In reinforcement learning, the\nperformance of continuously differentiable activations often falls short as\ncompared to piece-wise linear functions. We provide insights into the vanishing\ngradients associated with the former, and show that the dying neuron problem is\nnot exclusive to ReLU's. To alleviate vanishing gradients and the resulting\ndying neuron problem occurring with continuously differentiable activations, we\npropose a Hadamard representation. Using deep Q-networks, proximal policy\noptimization and parallelized Q-networks in the Atari domain, we show faster\nlearning, a reduction in dead neurons and increased effective rank.\n","authors":["Jacob E. Kooi","Mark Hoogendoorn","Vincent François-Lavet"],"pdf_url":"https://arxiv.org/pdf/2406.09079v3.pdf","comment":"34 pages, 28 figures"},{"id":"http://arxiv.org/abs/2411.19690v1","updated":"2024-11-29T13:24:14Z","published":"2024-11-29T13:24:14Z","title":"Gated-Attention Feature-Fusion Based Framework for Poverty Prediction","summary":" This research paper addresses the significant challenge of accurately\nestimating poverty levels using deep learning, particularly in developing\nregions where traditional methods like household surveys are often costly,\ninfrequent, and quickly become outdated. To address these issues, we propose a\nstate-of-the-art Convolutional Neural Network (CNN) architecture, extending the\nResNet50 model by incorporating a Gated-Attention Feature-Fusion Module (GAFM).\nOur architecture is designed to improve the model's ability to capture and\ncombine both global and local features from satellite images, leading to more\naccurate poverty estimates. The model achieves a 75% R2 score, significantly\noutperforming existing leading methods in poverty mapping. This improvement is\ndue to the model's capacity to focus on and refine the most relevant features,\nfiltering out unnecessary data, which makes it a powerful tool for remote\nsensing and poverty estimation.\n","authors":["Muhammad Umer Ramzan","Wahab Khaddim","Muhammad Ehsan Rana","Usman Ali","Manohar Ali","Fiaz ul Hassan","Fatima Mehmood"],"pdf_url":"https://arxiv.org/pdf/2411.19690v1.pdf","comment":"The paper has accepted for publication at 5th International\n Conference on Data Engineering and Communication Technology (ICDECT)"},{"id":"http://arxiv.org/abs/2411.19688v1","updated":"2024-11-29T13:22:52Z","published":"2024-11-29T13:22:52Z","title":"SURE-VQA: Systematic Understanding of Robustness Evaluation in Medical\n VQA Tasks","summary":" Vision-Language Models (VLMs) have great potential in medical tasks, like\nVisual Question Answering (VQA), where they could act as interactive assistants\nfor both patients and clinicians. Yet their robustness to distribution shifts\non unseen data remains a critical concern for safe deployment. Evaluating such\nrobustness requires a controlled experimental setup that allows for systematic\ninsights into the model's behavior. However, we demonstrate that current setups\nfail to offer sufficiently thorough evaluations, limiting their ability to\naccurately assess model robustness. To address this gap, our work introduces a\nnovel framework, called SURE-VQA, centered around three key requirements to\novercome the current pitfalls and systematically analyze the robustness of\nVLMs: 1) Since robustness on synthetic shifts does not necessarily translate to\nreal-world shifts, robustness should be measured on real-world shifts that are\ninherent to the VQA data; 2) Traditional token-matching metrics often fail to\ncapture underlying semantics, necessitating the use of large language models\n(LLMs) for more accurate semantic evaluation; 3) Model performance often lacks\ninterpretability due to missing sanity baselines, thus meaningful baselines\nshould be reported that allow assessing the multimodal impact on the VLM. To\ndemonstrate the relevance of this framework, we conduct a study on the\nrobustness of various fine-tuning methods across three medical datasets with\nfour different types of distribution shifts. Our study reveals several\nimportant findings: 1) Sanity baselines that do not utilize image data can\nperform surprisingly well; 2) We confirm LoRA as the best-performing PEFT\nmethod; 3) No PEFT method consistently outperforms others in terms of\nrobustness to shifts. Code is provided at https://github.com/IML-DKFZ/sure-vqa.\n","authors":["Kim-Celine Kahl","Selen Erkan","Jeremias Traub","Carsten T. Lüth","Klaus Maier-Hein","Lena Maier-Hein","Paul F. Jaeger"],"pdf_url":"https://arxiv.org/pdf/2411.19688v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2304.07832v2","updated":"2024-11-29T13:13:09Z","published":"2023-04-16T16:56:52Z","title":"An Interpretable Approach to Load Profile Forecasting in Power Grids\n using Galerkin-Approximated Koopman Pseudospectra","summary":" This paper presents an interpretable machine learning approach that\ncharacterizes load dynamics within an operator-theoretic framework for\nelectricity load forecasting in power grids. We represent the dynamics of load\ndata using the Koopman operator, which provides a linear, infinite-dimensional\nrepresentation of the nonlinear dynamics, and approximate a finite version that\nremains robust against spectral pollutions due to truncation. By computing\n$\\epsilon$-approximate Koopman eigenfunctions using dynamics-adapted kernels in\ndelay coordinates, we decompose the load dynamics into coherent spatiotemporal\npatterns that evolve quasi-independently. Our approach captures temporal\ncoherent patterns due to seasonal changes and finer time scales, such as time\nof day and day of the week. This method allows for a more nuanced understanding\nof the complex interactions within power grids and their response to various\nexogenous factors. We assess our method using a large-scale dataset from a\nrenewable power system in the continental European electricity system. The\nresults indicate that our Koopman-based method surpasses a separately optimized\ndeep learning (LSTM) architecture in both accuracy and computational\nefficiency, while providing deeper insights into the underlying dynamics of the\npower grid\\footnote{The code is available at\n\\href{https://github.com/Shakeri-Lab/Power-Grids}{github.com/Shakeri-Lab/Power-Grids}.\n","authors":["Ali Tavasoli","Behnaz Moradijamei","Heman Shakeri"],"pdf_url":"https://arxiv.org/pdf/2304.07832v2.pdf","comment":"34 pages, 17 figures"},{"id":"http://arxiv.org/abs/2411.19678v1","updated":"2024-11-29T13:12:11Z","published":"2024-11-29T13:12:11Z","title":"Privacy-Preserving Orthogonal Aggregation for Guaranteeing Gender\n Fairness in Federated Recommendation","summary":" Under stringent privacy constraints, whether federated recommendation systems\ncan achieve group fairness remains an inadequately explored question. Taking\ngender fairness as a representative issue, we identify three phenomena in\nfederated recommendation systems: performance difference, data imbalance, and\npreference disparity. We discover that the state-of-the-art methods only focus\non the first phenomenon. Consequently, their imposition of inappropriate\nfairness constraints detrimentally affects the model training. Moreover, due to\ninsufficient sensitive attribute protection of existing works, we can infer the\ngender of all users with 99.90% accuracy even with the addition of maximal\nnoise. In this work, we propose Privacy-Preserving Orthogonal Aggregation\n(PPOA), which employs the secure aggregation scheme and quantization technique,\nto prevent the suppression of minority groups by the majority and preserve the\ndistinct preferences for better group fairness. PPOA can assist different\ngroups in obtaining their respective model aggregation results through a\ndesigned orthogonal mapping while keeping their attributes private.\nExperimental results on three real-world datasets demonstrate that PPOA\nenhances recommendation effectiveness for both females and males by up to 8.25%\nand 6.36%, respectively, with a maximum overall improvement of 7.30%, and\nachieves optimal fairness in most cases. Extensive ablation experiments and\nvisualizations indicate that PPOA successfully maintains preferences for\ndifferent gender groups.\n","authors":["Siqing Zhang","Yuchen Ding","Wei Tang","Wei Sun","Yong Liao","Peng Yuan Zhou"],"pdf_url":"https://arxiv.org/pdf/2411.19678v1.pdf","comment":"accepted by WSDM 2025"},{"id":"http://arxiv.org/abs/2404.16196v3","updated":"2024-11-29T13:04:35Z","published":"2024-04-24T20:35:17Z","title":"ApisTox: a new benchmark dataset for the classification of small\n molecules toxicity on honey bees","summary":" The global decline in bee populations poses significant risks to agriculture,\nbiodiversity, and environmental stability. To bridge the gap in existing data,\nwe introduce ApisTox, a comprehensive dataset focusing on the toxicity of\npesticides to honey bees (Apis mellifera). This dataset combines and leverages\ndata from existing sources such as ECOTOX and PPDB, providing an extensive,\nconsistent, and curated collection that surpasses the previous datasets.\nApisTox incorporates a wide array of data, including toxicity levels for\nchemicals, details such as time of their publication in literature, and\nidentifiers linking them to external chemical databases. This dataset may serve\nas an important tool for environmental and agricultural research, but also can\nsupport the development of policies and practices aimed at minimizing harm to\nbee populations. Finally, ApisTox offers a unique resource for benchmarking\nmolecular property prediction methods on agrochemical compounds, facilitating\nadvancements in both environmental science and cheminformatics. This makes it a\nvaluable tool for both academic research and practical applications in bee\nconservation.\n","authors":["Jakub Adamczyk","Jakub Poziemski","Pawel Siedlecki"],"pdf_url":"https://arxiv.org/pdf/2404.16196v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19671v1","updated":"2024-11-29T12:56:43Z","published":"2024-11-29T12:56:43Z","title":"On the Performance Analysis of Momentum Method: A Frequency Domain\n Perspective","summary":" Momentum-based optimizers are widely adopted for training neural networks.\nHowever, the optimal selection of momentum coefficients remains elusive. This\nuncertainty impedes a clear understanding of the role of momentum in stochastic\ngradient methods. In this paper, we present a frequency domain analysis\nframework that interprets the momentum method as a time-variant filter for\ngradients, where adjustments to momentum coefficients modify the filter\ncharacteristics. Our experiments support this perspective and provide a deeper\nunderstanding of the mechanism involved. Moreover, our analysis reveals the\nfollowing significant findings: high-frequency gradient components are\nundesired in the late stages of training; preserving the original gradient in\nthe early stages, and gradually amplifying low-frequency gradient components\nduring training both enhance generalization performance. Based on these\ninsights, we propose Frequency Stochastic Gradient Descent with Momentum\n(FSGDM), a heuristic optimizer that dynamically adjusts the momentum filtering\ncharacteristic with an empirically effective dynamic magnitude response.\nExperimental results demonstrate the superiority of FSGDM over conventional\nmomentum optimizers.\n","authors":["Xianliang Li","Jun Luo","Zhiwei Zheng","Hanxiao Wang","Li Luo","Lingkun Wen","Linlong Wu","Sheng Xu"],"pdf_url":"https://arxiv.org/pdf/2411.19671v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.17071v2","updated":"2024-11-29T12:52:27Z","published":"2024-11-26T03:14:38Z","title":"Fast, Precise Thompson Sampling for Bayesian Optimization","summary":" Thompson sampling (TS) has optimal regret and excellent empirical performance\nin multi-armed bandit problems. Yet, in Bayesian optimization, TS underperforms\npopular acquisition functions (e.g., EI, UCB). TS samples arms according to the\nprobability that they are optimal. A recent algorithm, P-Star Sampler (PSS),\nperforms such a sampling via Hit-and-Run. We present an improved version,\nStagger Thompson Sampler (STS). STS more precisely locates the maximizer than\ndoes TS using less computation time. We demonstrate that STS outperforms TS,\nPSS, and other acquisition methods in numerical experiments of optimizations of\nseveral test functions across a broad range of dimension. Additionally, since\nPSS was originally presented not as a standalone acquisition method but as an\ninput to a batching algorithm called Minimal Terminal Variance (MTV), we also\ndemon-strate that STS matches PSS performance when used as the input to MTV.\n","authors":["David Sweet"],"pdf_url":"https://arxiv.org/pdf/2411.17071v2.pdf","comment":"NeurIPS 2024 Workshop on Bayesian Decision-making and Uncertainty;\n Poster"},{"id":"http://arxiv.org/abs/2411.12570v2","updated":"2024-11-29T12:50:13Z","published":"2024-11-19T15:39:25Z","title":"A data driven approach to classify descriptors based on their efficiency\n in translating noisy trajectories into physically-relevant information","summary":" Reconstructing the physical complexity of many-body dynamical systems can be\nchallenging. Starting from the trajectories of their constitutive units (raw\ndata), typical approaches require selecting appropriate descriptors to convert\nthem into time-series, which are then analyzed to extract interpretable\ninformation. However, identifying the most effective descriptor is often\nnon-trivial. Here, we report a data-driven approach to compare the efficiency\nof various descriptors in extracting information from noisy trajectories and\ntranslating it into physically relevant insights. As a prototypical system with\nnon-trivial internal complexity, we analyze molecular dynamics trajectories of\nan atomistic system where ice and water coexist in equilibrium near the\nsolid/liquid transition temperature. We compare general and specific\ndescriptors often used in aqueous systems: number of neighbors, molecular\nvelocities, Smooth Overlap of Atomic Positions (SOAP), Local Environments and\nNeighbors Shuffling (LENS), Orientational Tetrahedral Order, and distance from\nthe fifth neighbor ($d_5$). Using Onion Clustering -- an efficient unsupervised\nmethod for single-point time-series analysis -- we assess the maximum\nextractable information for each descriptor and rank them via a\nhigh-dimensional metric. Our results show that advanced descriptors like SOAP\nand LENS outperform classical ones due to higher signal-to-noise ratios.\nNonetheless, even simple descriptors can rival or exceed advanced ones after\nlocal signal denoising. For example, $d_5$, initially among the weakest,\nbecomes the most effective at resolving the system's non-local dynamical\ncomplexity after denoising. This work highlights the critical role of noise in\ninformation extraction from molecular trajectories and offers a data-driven\napproach to identify optimal descriptors for systems with characteristic\ninternal complexity.\n","authors":["Simone Martino","Domiziano Doria","Chiara Lionello","Matteo Becchi","Giovanni M. Pavan"],"pdf_url":"https://arxiv.org/pdf/2411.12570v2.pdf","comment":"19 pages, 5 figures + 3 in supporting information (at the bottom of\n the manuscript)"},{"id":"http://arxiv.org/abs/2211.10502v3","updated":"2024-11-29T12:48:55Z","published":"2022-11-18T20:33:08Z","title":"A Mathematical Programming Approach to Optimal Classification Forests","summary":" This paper introduces Weighted Optimal Classification Forests (WOCFs), a new\nfamily of classifiers that takes advantage of an optimal ensemble of decision\ntrees to derive accurate and interpretable classifiers. We propose a novel\nmathematical optimization-based methodology which simultaneously constructs a\ngiven number of trees, each of them providing a predicted class for the\nobservations in the feature space. The classification rule is derived by\nassigning to each observation its most frequently predicted class among the\ntrees. We provide a mixed integer linear programming formulation (MIP) for the\nproblem and several novel MIP strengthening / scaling techniques. We report the\nresults of our computational experiments, from which we conclude that our\nmethod has equal or superior performance compared with state-of-the-art\ntree-based classification methods for small to medium-sized instances. We also\npresent three real-world case studies showing that our methodology has very\ninteresting implications in terms of interpretability. Overall, WOCFs\ncomplement existing methods such as CART, Optimal Classification Trees, Random\nForests and XGBoost. In addition to its Pareto improvement on accuracy and\ninterpretability, we also see unique properties emerging in terms of different\ntrees focusing on different feature variables. This provides nontrivial\nimprovement in interpretability and usability of the trained model in terms of\ncounterfactual explanation. Thus, despite the apparent computational challenge\nof WOCFs that limit the size of the problems that can be efficiently solved\nwith current MIP, this is an important research direction that can lead to\nqualitatively different insights for researchers and complement the toolbox of\npractitioners for high stakes problems.\n","authors":["Víctor Blanco","Alberto Japón","Justo Puerto","Peter Zhang"],"pdf_url":"https://arxiv.org/pdf/2211.10502v3.pdf","comment":"30 pages, 9 figures, 2 table"},{"id":"http://arxiv.org/abs/2310.09278v2","updated":"2024-11-29T12:47:11Z","published":"2023-10-13T17:40:39Z","title":"Disentangled Latent Spaces Facilitate Data-Driven Auxiliary Learning","summary":" Auxiliary tasks facilitate learning in situations when data is scarce or the\nprincipal task of focus is extremely complex. This idea is primarily inspired\nby the improved generalization capability induced by solving multiple tasks\nsimultaneously, which leads to a more robust shared representation.\nNevertheless, finding optimal auxiliary tasks is a crucial problem that often\nrequires hand-crafted solutions or expensive meta-learning approaches. In this\npaper, we propose a novel framework, dubbed Detaux, whereby a weakly supervised\ndisentanglement procedure is used to discover a new unrelated auxiliary\nclassification task, which allows us to go from a Single-Task Learning (STL) to\na Multi-Task Learning (MTL) problem. The disentanglement procedure works at the\nrepresentation level, isolating the variation related to the principal task\ninto an isolated subspace and additionally producing an arbitrary number of\northogonal subspaces, each one of them encouraging high separability among the\nprojections. We generate the auxiliary classification task through a clustering\nprocedure on the most disentangled subspace, obtaining a discrete set of\nlabels. Subsequently, the original data, the labels associated with the\nprincipal task, and the newly discovered ones can be fed into any MTL\nframework. Experimental validation on both synthetic and real data, along with\nvarious ablation studies, demonstrate promising results, revealing the\npotential in what has been, so far, an unexplored connection between learning\ndisentangled representations and MTL. The source code will be made available\nupon acceptance.\n","authors":["Geri Skenderi","Luigi Capogrosso","Andrea Toaiari","Matteo Denitto","Franco Fummi","Simone Melzi","Marco Cristani"],"pdf_url":"https://arxiv.org/pdf/2310.09278v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19666v1","updated":"2024-11-29T12:39:57Z","published":"2024-11-29T12:39:57Z","title":"Multimodal Whole Slide Foundation Model for Pathology","summary":" The field of computational pathology has been transformed with recent\nadvances in foundation models that encode histopathology region-of-interests\n(ROIs) into versatile and transferable feature representations via\nself-supervised learning (SSL). However, translating these advancements to\naddress complex clinical challenges at the patient and slide level remains\nconstrained by limited clinical data in disease-specific cohorts, especially\nfor rare clinical conditions. We propose TITAN, a multimodal whole slide\nfoundation model pretrained using 335,645 WSIs via visual self-supervised\nlearning and vision-language alignment with corresponding pathology reports and\n423,122 synthetic captions generated from a multimodal generative AI copilot\nfor pathology. Without any finetuning or requiring clinical labels, TITAN can\nextract general-purpose slide representations and generate pathology reports\nthat generalize to resource-limited clinical scenarios such as rare disease\nretrieval and cancer prognosis. We evaluate TITAN on diverse clinical tasks and\nfind that TITAN outperforms both ROI and slide foundation models across machine\nlearning settings such as linear probing, few-shot and zero-shot\nclassification, rare cancer retrieval and cross-modal retrieval, and pathology\nreport generation.\n","authors":["Tong Ding","Sophia J. Wagner","Andrew H. Song","Richard J. Chen","Ming Y. Lu","Andrew Zhang","Anurag J. Vaidya","Guillaume Jaume","Muhammad Shaban","Ahrong Kim","Drew F. K. Williamson","Bowen Chen","Cristina Almagro-Perez","Paul Doucet","Sharifa Sahai","Chengkuan Chen","Daisuke Komura","Akihiro Kawabe","Shumpei Ishikawa","Georg Gerber","Tingying Peng","Long Phi Le","Faisal Mahmood"],"pdf_url":"https://arxiv.org/pdf/2411.19666v1.pdf","comment":"The code is accessible at https://github.com/mahmoodlab/TITAN"},{"id":"http://arxiv.org/abs/2411.19653v1","updated":"2024-11-29T12:18:01Z","published":"2024-11-29T12:18:01Z","title":"Nonparametric Instrumental Regression via Kernel Methods is Minimax\n Optimal","summary":" We study the kernel instrumental variable algorithm of\n\\citet{singh2019kernel}, a nonparametric two-stage least squares (2SLS)\nprocedure which has demonstrated strong empirical performance. We provide a\nconvergence analysis that covers both the identified and unidentified settings:\nwhen the structural function cannot be identified, we show that the kernel NPIV\nestimator converges to the IV solution with minimum norm. Crucially, our\nconvergence is with respect to the strong $L_2$-norm, rather than a\npseudo-norm. Additionally, we characterize the smoothness of the target\nfunction without relying on the instrument, instead leveraging a new\ndescription of the projected subspace size (this being closely related to the\nlink condition in inverse learning literature). With the subspace size\ndescription and under standard kernel learning assumptions, we derive, for the\nfirst time, the minimax optimal learning rate for kernel NPIV in the strong\n$L_2$-norm. Our result demonstrates that the strength of the instrument is\nessential to achieve efficient learning. We also improve the original kernel\nNPIV algorithm by adopting a general spectral regularization in stage 1\nregression. The modified regularization can overcome the saturation effect of\nTikhonov regularization.\n","authors":["Dimitri Meunier","Zhu Li","Tim Christensen","Arthur Gretton"],"pdf_url":"https://arxiv.org/pdf/2411.19653v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19652v1","updated":"2024-11-29T12:11:28Z","published":"2024-11-29T12:11:28Z","title":"Uniform Attention Maps: Boosting Image Fidelity in Reconstruction and\n Editing","summary":" Text-guided image generation and editing using diffusion models have achieved\nremarkable advancements. Among these, tuning-free methods have gained attention\nfor their ability to perform edits without extensive model adjustments,\noffering simplicity and efficiency. However, existing tuning-free approaches\noften struggle with balancing fidelity and editing precision. Reconstruction\nerrors in DDIM Inversion are partly attributed to the cross-attention mechanism\nin U-Net, which introduces misalignments during the inversion and\nreconstruction process. To address this, we analyze reconstruction from a\nstructural perspective and propose a novel approach that replaces traditional\ncross-attention with uniform attention maps, significantly enhancing image\nreconstruction fidelity. Our method effectively minimizes distortions caused by\nvarying text conditions during noise prediction. To complement this\nimprovement, we introduce an adaptive mask-guided editing technique that\nintegrates seamlessly with our reconstruction approach, ensuring consistency\nand accuracy in editing tasks. Experimental results demonstrate that our\napproach not only excels in achieving high-fidelity image reconstruction but\nalso performs robustly in real image composition and editing scenarios. This\nstudy underscores the potential of uniform attention maps to enhance the\nfidelity and versatility of diffusion-based image processing methods. Code is\navailable at https://github.com/Mowenyii/Uniform-Attention-Maps.\n","authors":["Wenyi Mo","Tianyu Zhang","Yalong Bai","Bing Su","Ji-Rong Wen"],"pdf_url":"https://arxiv.org/pdf/2411.19652v1.pdf","comment":"Accepted to WACV 2025"},{"id":"http://arxiv.org/abs/2411.19650v1","updated":"2024-11-29T12:06:03Z","published":"2024-11-29T12:06:03Z","title":"CogACT: A Foundational Vision-Language-Action Model for Synergizing\n Cognition and Action in Robotic Manipulation","summary":" The advancement of large Vision-Language-Action (VLA) models has\nsignificantly improved robotic manipulation in terms of language-guided task\nexecution and generalization to unseen scenarios. While existing VLAs adapted\nfrom pretrained large Vision-Language-Models (VLM) have demonstrated promising\ngeneralizability, their task performance is still unsatisfactory as indicated\nby the low tasks success rates in different environments. In this paper, we\npresent a new advanced VLA architecture derived from VLM. Unlike previous works\nthat directly repurpose VLM for action prediction by simple action\nquantization, we propose a omponentized VLA architecture that has a specialized\naction module conditioned on VLM output. We systematically study the design of\nthe action module and demonstrates the strong performance enhancement with\ndiffusion action transformers for action sequence modeling, as well as their\nfavorable scaling behaviors. We also conduct comprehensive experiments and\nablation studies to evaluate the efficacy of our models with varied designs.\nThe evaluation on 5 robot embodiments in simulation and real work shows that\nour model not only significantly surpasses existing VLAs in task performance\nand but also exhibits remarkable adaptation to new robots and generalization to\nunseen objects and backgrounds. It exceeds the average success rates of OpenVLA\nwhich has similar model size (7B) with ours by over 35% in simulated evaluation\nand 55% in real robot experiments. It also outperforms the large RT-2-X model\n(55B) by 18% absolute success rates in simulation. Code and models can be found\non our project page (https://cogact.github.io/).\n","authors":["Qixiu Li","Yaobo Liang","Zeyu Wang","Lin Luo","Xi Chen","Mozheng Liao","Fangyun Wei","Yu Deng","Sicheng Xu","Yizhong Zhang","Xiaofan Wang","Bei Liu","Jianlong Fu","Jianmin Bao","Dong Chen","Yuanchun Shi","Jiaolong Yang","Baining Guo"],"pdf_url":"https://arxiv.org/pdf/2411.19650v1.pdf","comment":"Project Webpage: https://cogact.github.io/"},{"id":"http://arxiv.org/abs/2411.19647v1","updated":"2024-11-29T12:00:27Z","published":"2024-11-29T12:00:27Z","title":"CAdam: Confidence-Based Optimization for Online Learning","summary":" Modern recommendation systems frequently employ online learning to\ndynamically update their models with freshly collected data. The most commonly\nused optimizer for updating neural networks in these contexts is the Adam\noptimizer, which integrates momentum ($m_t$) and adaptive learning rate\n($v_t$). However, the volatile nature of online learning data, characterized by\nits frequent distribution shifts and presence of noises, poses significant\nchallenges to Adam's standard optimization process: (1) Adam may use outdated\nmomentum and the average of squared gradients, resulting in slower adaptation\nto distribution changes, and (2) Adam's performance is adversely affected by\ndata noise. To mitigate these issues, we introduce CAdam, a confidence-based\noptimization strategy that assesses the consistence between the momentum and\nthe gradient for each parameter dimension before deciding on updates. If\nmomentum and gradient are in sync, CAdam proceeds with parameter updates\naccording to Adam's original formulation; if not, it temporarily withholds\nupdates and monitors potential shifts in data distribution in subsequent\niterations. This method allows CAdam to distinguish between the true\ndistributional shifts and mere noise, and adapt more quickly to new data\ndistributions. Our experiments with both synthetic and real-world datasets\ndemonstrate that CAdam surpasses other well-known optimizers, including the\noriginal Adam, in efficiency and noise robustness. Furthermore, in large-scale\nA/B testing within a live recommendation system, CAdam significantly enhances\nmodel performance compared to Adam, leading to substantial increases in the\nsystem's gross merchandise volume (GMV).\n","authors":["Shaowen Wang","Anan Liu","Jian Xiao","Huan Liu","Yuekui Yang","Cong Xu","Qianqian Pu","Suncong Zheng","Wei Zhang","Jian Li"],"pdf_url":"https://arxiv.org/pdf/2411.19647v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19640v1","updated":"2024-11-29T11:52:59Z","published":"2024-11-29T11:52:59Z","title":"Learned Random Label Predictions as a Neural Network Complexity Metric","summary":" We empirically investigate the impact of learning randomly generated labels\nin parallel to class labels in supervised learning on memorization, model\ncomplexity, and generalization in deep neural networks. To this end, we\nintroduce a multi-head network architecture as an extension of standard CNN\narchitectures. Inspired by methods used in fair AI, our approach allows for the\nunlearning of random labels, preventing the network from memorizing individual\nsamples. Based on the concept of Rademacher complexity, we first use our\nproposed method as a complexity metric to analyze the effects of common\nregularization techniques and challenge the traditional understanding of\nfeature extraction and classification in CNNs. Second, we propose a novel\nregularizer that effectively reduces sample memorization. However, contrary to\nthe predictions of classical statistical learning theory, we do not observe\nimprovements in generalization.\n","authors":["Marlon Becker","Benjamin Risse"],"pdf_url":"https://arxiv.org/pdf/2411.19640v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.16314v2","updated":"2024-11-29T11:52:31Z","published":"2024-10-09T10:09:37Z","title":"Steering Large Language Models using Conceptors: Improving\n Addition-Based Activation Engineering","summary":" Large language models have transformed AI, yet reliably controlling their\noutputs remains a challenge. This paper explores activation engineering, where\noutputs of pre-trained LLMs are controlled by manipulating their activations at\ninference time. Unlike traditional methods using a single steering vector, we\nintroduce conceptors - mathematical constructs that represent sets of\nactivation vectors as ellipsoidal regions. Conceptors act as soft projection\nmatrices and offer more precise control over complex activation patterns. Our\nexperiments demonstrate that conceptors outperform traditional methods across\nmultiple steering tasks. We further use Boolean operations on conceptors for\ncombined steering goals that empirically outperform additively combining\nsteering vectors on a set of tasks. These results highlight conceptors as a\npromising tool for more effective steering of LLMs. Our code is available on\ngithub.com/jorispos/conceptorsteering.\n","authors":["Joris Postmus","Steven Abreu"],"pdf_url":"https://arxiv.org/pdf/2410.16314v2.pdf","comment":"Presented at the MINT workshop at NeurIPS 2024"},{"id":"http://arxiv.org/abs/2303.05263v3","updated":"2024-11-29T11:49:04Z","published":"2023-03-09T13:58:35Z","title":"Fast post-process Bayesian inference with Variational Sparse Bayesian\n Quadrature","summary":" In applied Bayesian inference scenarios, users may have access to a large\nnumber of pre-existing model evaluations, for example from maximum-a-posteriori\n(MAP) optimization runs. However, traditional approximate inference techniques\nmake little to no use of this available information. We propose the framework\nof post-process Bayesian inference as a means to obtain a quick posterior\napproximation from existing target density evaluations, with no further model\ncalls. Within this framework, we introduce Variational Sparse Bayesian\nQuadrature (VSBQ), a method for post-process approximate inference for models\nwith black-box and potentially noisy likelihoods. VSBQ reuses existing target\ndensity evaluations to build a sparse Gaussian process (GP) surrogate model of\nthe log posterior density function. Subsequently, we leverage sparse-GP\nBayesian quadrature combined with variational inference to achieve fast\napproximate posterior inference over the surrogate. We validate our method on\nchallenging synthetic scenarios and real-world applications from computational\nneuroscience. The experiments show that VSBQ builds high-quality posterior\napproximations by post-processing existing optimization traces, with no further\nmodel evaluations.\n","authors":["Chengkun Li","Grégoire Clarté","Martin Jørgensen","Luigi Acerbi"],"pdf_url":"https://arxiv.org/pdf/2303.05263v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19632v1","updated":"2024-11-29T11:31:11Z","published":"2024-11-29T11:31:11Z","title":"PACMANN: Point Adaptive Collocation Method for Artificial Neural\n Networks","summary":" Physics-Informed Neural Networks (PINNs) are an emerging tool for\napproximating the solution of Partial Differential Equations (PDEs) in both\nforward and inverse problems. PINNs minimize a loss function which includes the\nPDE residual determined for a set of collocation points. Previous work has\nshown that the number and distribution of these collocation points have a\nsignificant influence on the accuracy of the PINN solution. Therefore, the\neffective placement of these collocation points is an active area of research.\nSpecifically, adaptive collocation point sampling methods have been proposed,\nwhich have been reported to scale poorly to higher dimensions. In this work, we\naddress this issue and present the Point Adaptive Collocation Method for\nArtificial Neural Networks (PACMANN). Inspired by classic optimization\nproblems, this approach incrementally moves collocation points toward regions\nof higher residuals using gradient-based optimization algorithms guided by the\ngradient of the squared residual. We apply PACMANN for forward and inverse\nproblems, and demonstrate that this method matches the performance of\nstate-of-the-art methods in terms of the accuracy/efficiency tradeoff for the\nlow-dimensional problems, while outperforming available approaches for\nhigh-dimensional problems; the best performance is observed for the Adam\noptimizer. Key features of the method include its low computational cost and\nsimplicity of integration in existing physics-informed neural network\npipelines.\n","authors":["Coen Visser","Alexander Heinlein","Bianca Giovanardi"],"pdf_url":"https://arxiv.org/pdf/2411.19632v1.pdf","comment":"22 pages, 9 figures"},{"id":"http://arxiv.org/abs/2411.19631v1","updated":"2024-11-29T11:30:02Z","published":"2024-11-29T11:30:02Z","title":"Non-linear Equalization in 112 Gb/s PONs Using Kolmogorov-Arnold\n Networks","summary":" We investigate Kolmogorov-Arnold networks (KANs) for non-linear equalization\nof 112 Gb/s PAM4 passive optical networks (PONs). Using pruning and extensive\nhyperparameter search, we outperform linear equalizers and convolutional neural\nnetworks at low computational complexity.\n","authors":["Rodrigo Fischer","Patrick Matalla","Sebastian Randel","Laurent Schmalen"],"pdf_url":"https://arxiv.org/pdf/2411.19631v1.pdf","comment":"Submitted for possible publication at Optical Fiber Communication\n Conference (OFC) 2025"},{"id":"http://arxiv.org/abs/2411.19629v1","updated":"2024-11-29T11:25:51Z","published":"2024-11-29T11:25:51Z","title":"OpenQDC: Open Quantum Data Commons","summary":" Machine Learning Interatomic Potentials (MLIPs) are a highly promising\nalternative to force-fields for molecular dynamics (MD) simulations, offering\nprecise and rapid energy and force calculations. However, Quantum-Mechanical\n(QM) datasets, crucial for MLIPs, are fragmented across various repositories,\nhindering accessibility and model development. We introduce the openQDC\npackage, consolidating 37 QM datasets from over 250 quantum methods and 400\nmillion geometries into a single, accessible resource. These datasets are\nmeticulously preprocessed, and standardized for MLIP training, covering a wide\nrange of chemical elements and interactions relevant in organic chemistry.\nOpenQDC includes tools for normalization and integration, easily accessible via\nPython. Experiments with well-known architectures like SchNet, TorchMD-Net, and\nDimeNet reveal challenges for those architectures and constitute a leaderboard\nto accelerate benchmarking and guide novel algorithms development. Continuously\nadding datasets to OpenQDC will democratize QM dataset access, foster more\ncollaboration and innovation, enhance MLIP development, and support their\nadoption in the MD field.\n","authors":["Cristian Gabellini","Nikhil Shenoy","Stephan Thaler","Semih Canturk","Daniel McNeela","Dominique Beaini","Michael Bronstein","Prudencio Tossou"],"pdf_url":"https://arxiv.org/pdf/2411.19629v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19628v1","updated":"2024-11-29T11:24:23Z","published":"2024-11-29T11:24:23Z","title":"Accelerating Multimodal Large Language Models via Dynamic Visual-Token\n Exit and the Empirical Findings","summary":" The excessive use of visual tokens in existing Multimoal Large Language\nModels (MLLMs) often exhibits obvious redundancy and brings in prohibitively\nexpensive computation. To gain insights into this problem, we first conduct\nextensive empirical studies on the attention behaviors of MLLMs, and summarize\nthree main inference stages in MLLMs: (i) Early fusion between tokens is first\naccomplished quickly. (ii) Intra-modality modeling then comes to play. (iii)\nMultimodal reasoning} resumes and lasts until the end of inference. In\nparticular, we reveal that visual tokens will stop contributing to reasoning\nwhen the text tokens receive enough image information, yielding obvious visual\nredundancy. Based on these generalized observations, we propose a simple yet\neffective method to improve the efficiency of MLLMs, termed dynamic\nvisual-token exit (DyVTE). DyVTE uses lightweight hyper-networks to perceive\nthe text token status and decide the removal of all visual tokens after a\ncertain layer, thereby addressing the observed visual redundancy. To validate\nVTE, we apply it to a set of MLLMs, including LLaVA, VILA, Eagle and InternVL,\nand conduct extensive experiments on a bunch of benchmarks. The experiment\nresults not only show the effectiveness of our VTE in improving MLLMs'\nefficiency, but also yield the general modeling patterns of MLLMs, well\nfacilitating the in-depth understanding of MLLMs. Our code is anonymously\nreleased at https://github.com/DoubtedSteam/DyVTE.\n","authors":["Qiong Wu","Wenhao Lin","Weihao Ye","Yiyi Zhou","Xiaoshuai Sun","Rongrong Ji"],"pdf_url":"https://arxiv.org/pdf/2411.19628v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19623v1","updated":"2024-11-29T11:22:20Z","published":"2024-11-29T11:22:20Z","title":"FairDD: Fair Dataset Distillation via Synchronized Matching","summary":" Condensing large datasets into smaller synthetic counterparts has\ndemonstrated its promise for image classification. However, previous research\nhas overlooked a crucial concern in image recognition: ensuring that models\ntrained on condensed datasets are unbiased towards protected attributes (PA),\nsuch as gender and race. Our investigation reveals that dataset distillation\n(DD) fails to alleviate the unfairness towards minority groups within original\ndatasets. Moreover, this bias typically worsens in the condensed datasets due\nto their smaller size. To bridge the research gap, we propose a novel fair\ndataset distillation (FDD) framework, namely FairDD, which can be seamlessly\napplied to diverse matching-based DD approaches, requiring no modifications to\ntheir original architectures. The key innovation of FairDD lies in\nsynchronously matching synthetic datasets to PA-wise groups of original\ndatasets, rather than indiscriminate alignment to the whole distributions in\nvanilla DDs, dominated by majority groups. This synchronized matching allows\nsynthetic datasets to avoid collapsing into majority groups and bootstrap their\nbalanced generation to all PA groups. Consequently, FairDD could effectively\nregularize vanilla DDs to favor biased generation toward minority groups while\nmaintaining the accuracy of target attributes. Theoretical analyses and\nextensive experimental evaluations demonstrate that FairDD significantly\nimproves fairness compared to vanilla DD methods, without sacrificing\nclassification accuracy. Its consistent superiority across diverse DDs,\nspanning Distribution and Gradient Matching, establishes it as a versatile FDD\napproach.\n","authors":["Qihang Zhou","Shenhao Fang","Shibo He","Wenchao Meng","Jiming Chen"],"pdf_url":"https://arxiv.org/pdf/2411.19623v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.13299v2","updated":"2024-11-29T11:21:29Z","published":"2024-10-17T07:55:47Z","title":"LLM-Rank: A Graph Theoretical Approach to Pruning Large Language Models","summary":" The evolving capabilities of large language models are accompanied by growing\nsizes and deployment costs, necessitating effective inference optimisation\ntechniques. We propose a novel pruning method utilising centrality measures\nfrom graph theory, reducing both the computational requirements and the memory\nfootprint of these models. Specifically, we devise a method for creating a\nweighted directed acyclical graph representation of multilayer perceptrons to\nwhich we apply a modified version of the weighted PageRank centrality measure\nto compute node importance scores. In combination with uniform pruning this\nleads to structured sparsity. We call this pruning method MLPRank. Furthermore\nwe introduce an extension to decoder-only transformer models and call it\nLLMRank. For both variants we demonstrate a strong performance. With MLPRank on\naverage leading to 6.09 % higher accuracy retention than three popular\nbaselines and 13.42 % with LLMRank compared to two popular baselines. Code is\navailable at https://github.com/amazon-science/llm-rank-pruning.\n","authors":["David Hoffmann","Kailash Budhathoki","Matthaeus Kleindessner"],"pdf_url":"https://arxiv.org/pdf/2410.13299v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19617v1","updated":"2024-11-29T11:10:29Z","published":"2024-11-29T11:10:29Z","title":"Materials Learning Algorithms (MALA): Scalable Machine Learning for\n Electronic Structure Calculations in Large-Scale Atomistic Simulations","summary":" We present the Materials Learning Algorithms (MALA) package, a scalable\nmachine learning framework designed to accelerate density functional theory\n(DFT) calculations suitable for large-scale atomistic simulations. Using local\ndescriptors of the atomic environment, MALA models efficiently predict key\nelectronic observables, including local density of states, electronic density,\ndensity of states, and total energy. The package integrates data sampling,\nmodel training and scalable inference into a unified library, while ensuring\ncompatibility with standard DFT and molecular dynamics codes. We demonstrate\nMALA's capabilities with examples including boron clusters, aluminum across its\nsolid-liquid phase boundary, and predicting the electronic structure of a\nstacking fault in a large beryllium slab. Scaling analyses reveal MALA's\ncomputational efficiency and identify bottlenecks for future optimization. With\nits ability to model electronic structures at scales far beyond standard DFT,\nMALA is well suited for modeling complex material systems, making it a\nversatile tool for advanced materials research.\n","authors":["Attila Cangi","Lenz Fiedler","Bartosz Brzoza","Karan Shah","Timothy J. Callow","Daniel Kotik","Steve Schmerler","Matthew C. Barry","James M. Goff","Andrew Rohskopf","Dayton J. Vogel","Normand Modine","Aidan P. Thompson","Sivasankaran Rajamanickam"],"pdf_url":"https://arxiv.org/pdf/2411.19617v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.15795v2","updated":"2024-11-29T10:53:40Z","published":"2024-11-24T11:46:47Z","title":"Beyond adaptive gradient: Fast-Controlled Minibatch Algorithm for\n large-scale optimization","summary":" Adaptive gradient methods have been increasingly adopted by deep learning\ncommunity due to their fast convergence and reduced sensitivity to\nhyper-parameters. However, these methods come with limitations, such as\nincreased memory requirements for elements like moving averages and a poorly\nunderstood convergence theory. To overcome these challenges, we introduce\nF-CMA, a Fast-Controlled Mini-batch Algorithm with a random reshuffling method\nfeaturing a sufficient decrease condition and a line-search procedure to ensure\nloss reduction per epoch, along with its deterministic proof of global\nconvergence to a stationary point. To evaluate the F-CMA, we integrate it into\nconventional training protocols for classification tasks involving both\nconvolutional neural networks and vision transformer models, allowing for a\ndirect comparison with popular optimizers. Computational tests show significant\nimprovements, including a decrease in the overall training time by up to 68%,\nan increase in per-epoch efficiency by up to 20%, and in model accuracy by up\nto 5%.\n","authors":["Corrado Coppola","Lorenzo Papa","Irene Amerini","Laura Palagi"],"pdf_url":"https://arxiv.org/pdf/2411.15795v2.pdf","comment":"There is an error in the literature review, in section 1. In\n particular, we noticed that there is a wrong citation, the [65], which has\n been erroneously associated with another author's claims"},{"id":"http://arxiv.org/abs/2410.19192v3","updated":"2024-11-29T10:39:36Z","published":"2024-10-24T22:50:21Z","title":"TEAM: Topological Evolution-aware Framework for Traffic\n Forecasting--Extended Version","summary":" Due to the global trend towards urbanization, people increasingly move to and\nlive in cities that then continue to grow. Traffic forecasting plays an\nimportant role in the intelligent transportation systems of cities as well as\nin spatio-temporal data mining. State-of-the-art forecasting is achieved by\ndeep-learning approaches due to their ability to contend with complex\nspatio-temporal dynamics. However, existing methods assume the input is\nfixed-topology road networks and static traffic time series. These assumptions\nfail to align with urbanization, where time series are collected continuously\nand road networks evolve over time. In such settings, deep-learning models\nrequire frequent re-initialization and re-training, imposing high computational\ncosts. To enable much more efficient training without jeopardizing model\naccuracy, we propose the Topological Evolution-aware Framework (TEAM) for\ntraffic forecasting that incorporates convolution and attention. This\ncombination of mechanisms enables better adaptation to newly collected time\nseries, while being able to maintain learned knowledge from old time series.\nTEAM features a continual learning module based on the Wasserstein metric that\nacts as a buffer that can identify the most stable and the most changing\nnetwork nodes. Then, only data related to stable nodes is employed for\nre-training when consolidating a model. Further, only data of new nodes and\ntheir adjacent nodes as well as data pertaining to changing nodes are used to\nre-train the model. Empirical studies with two real-world traffic datasets\noffer evidence that TEAM is capable of much lower re-training costs than\nexisting methods are, without jeopardizing forecasting accuracy.\n","authors":["Duc Kieu","Tung Kieu","Peng Han","Bin Yang","Christian S. Jensen","Bac Le"],"pdf_url":"https://arxiv.org/pdf/2410.19192v3.pdf","comment":"16 pages. An extended version of \"TEAM: Topological Evolution-aware\n Framework for Traffic Forecasting\" accepted at PVLDB 2025"},{"id":"http://arxiv.org/abs/2310.08367v2","updated":"2024-11-29T10:39:26Z","published":"2023-10-12T14:38:25Z","title":"Towards Evaluating Generalist Agents: An Automated Benchmark in Open\n World","summary":" Evaluating generalist agents presents significant challenges due to their\nwide-ranging abilities and the limitations of current benchmarks in assessing\ntrue generalization. We introduce the Minecraft Universe (MCU), a fully\nautomated benchmarking framework set within the open-world game Minecraft. MCU\ndynamically generates and evaluates a broad spectrum of tasks, offering three\ncore components: 1) a task generation mechanism that provides high degrees of\nfreedom and variability, 2) an ever-expanding set of over 3K composable atomic\ntasks, and 3) a general evaluation framework that supports open-ended task\nassessment. By integrating large language models (LLMs), MCU dynamically\ncreates diverse environments for each evaluation, fostering agent\ngeneralization. The framework uses a vision-language model (VLM) to\nautomatically generate evaluation criteria, achieving over 90% agreement with\nhuman ratings across multi-dimensional assessments, which demonstrates that MCU\nis a scalable and explainable solution for evaluating generalist agents.\nAdditionally, we show that while state-of-the-art foundational models perform\nwell on specific tasks, they often struggle with increased task diversity and\ndifficulty.\n","authors":["Xinyue Zheng","Haowei Lin","Kaichen He","Zihao Wang","Zilong Zheng","Yitao Liang"],"pdf_url":"https://arxiv.org/pdf/2310.08367v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.05540v2","updated":"2024-11-29T10:29:20Z","published":"2024-08-10T12:43:55Z","title":"Convergence Analysis for Deep Sparse Coding via Convolutional Neural\n Networks","summary":" In this work, we explore intersections between sparse coding and deep\nlearning to enhance our understanding of feature extraction capabilities in\nadvanced neural network architectures. We begin by introducing a novel class of\nDeep Sparse Coding (DSC) models and establish thorough theoretical analysis of\ntheir uniqueness and stability properties. By applying iterative algorithms to\nthese DSC models, we derive convergence rates for convolutional neural networks\n(CNNs) in their ability to extract sparse features. This provides a strong\ntheoretical foundation for the use of CNNs in sparse feature learning tasks. We\nadditionally extend the convergence analysis to more general neural network\narchitectures, including those with diverse activation functions, as well as\nself-attention and transformer-based models. This broadens the applicability of\nour findings to a wide range of deep learning methods for deep sparse feature\nextraction. Inspired by the strong connection between sparse coding and CNNs,\nwe also explore training strategies to encourage neural networks to learn more\nsparse features. Through numerical experiments, we demonstrate the\neffectiveness of these approaches, providing valuable insights for the design\nof efficient and interpretable deep learning models.\n","authors":["Jianfei Li","Han Feng","Ding-Xuan Zhou"],"pdf_url":"https://arxiv.org/pdf/2408.05540v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.02771v5","updated":"2024-11-29T10:23:56Z","published":"2024-01-05T12:01:19Z","title":"Powerformer: A Section-adaptive Transformer for Power Flow Adjustment","summary":" In this paper, we present a novel transformer architecture tailored for\nlearning robust power system state representations, which strives to optimize\npower dispatch for the power flow adjustment across different transmission\nsections. Specifically, our proposed approach, named Powerformer, develops a\ndedicated section-adaptive attention mechanism, separating itself from the\nself-attention used in conventional transformers. This mechanism effectively\nintegrates power system states with transmission section information, which\nfacilitates the development of robust state representations. Furthermore, by\nconsidering the graph topology of power system and the electrical attributes of\nbus nodes, we introduce two customized strategies to further enhance the\nexpressiveness: graph neural network propagation and multi-factor attention\nmechanism. Extensive evaluations are conducted on three power system scenarios,\nincluding the IEEE 118-bus system, a realistic 300-bus system in China, and a\nlarge-scale European system with 9241 buses, where Powerformer demonstrates its\nsuperior performance over several baseline methods.\n","authors":["Kaixuan Chen","Wei Luo","Shunyu Liu","Yaoquan Wei","Yihe Zhou","Yunpeng Qing","Quan Zhang","Jie Song","Mingli Song"],"pdf_url":"https://arxiv.org/pdf/2401.02771v5.pdf","comment":"8 figures"},{"id":"http://arxiv.org/abs/2411.19593v1","updated":"2024-11-29T10:21:37Z","published":"2024-11-29T10:21:37Z","title":"Self-Supervised Denoiser Framework","summary":" Reconstructing images using Computed Tomography (CT) in an industrial context\nleads to specific challenges that differ from those encountered in other areas,\nsuch as clinical CT. Indeed, non-destructive testing with industrial CT will\noften involve scanning multiple similar objects while maintaining high\nthroughput, requiring short scanning times, which is not a relevant concern in\nclinical CT. Under-sampling the tomographic data (sinograms) is a natural way\nto reduce the scanning time at the cost of image quality since the latter\ndepends on the number of measurements. In such a scenario, post-processing\ntechniques are required to compensate for the image artifacts induced by the\nsinogram sparsity. We introduce the Self-supervised Denoiser Framework (SDF), a\nself-supervised training method that leverages pre-training on highly sampled\nsinogram data to enhance the quality of images reconstructed from undersampled\nsinogram data. The main contribution of SDF is that it proposes to train an\nimage denoiser in the sinogram space by setting the learning task as the\nprediction of one sinogram subset from another. As such, it does not require\nground-truth image data, leverages the abundant data modality in CT, the\nsinogram, and can drastically enhance the quality of images reconstructed from\na fraction of the measurements. We demonstrate that SDF produces better image\nquality, in terms of peak signal-to-noise ratio, than other analytical and\nself-supervised frameworks in both 2D fan-beam or 3D cone-beam CT settings.\nMoreover, we show that the enhancement provided by SDF carries over when\nfine-tuning the image denoiser on a few examples, making it a suitable\npre-training technique in a context where there is little high-quality image\ndata. Our results are established on experimental datasets, making SDF a strong\ncandidate for being the building block of foundational image-enhancement models\nin CT.\n","authors":["Emilien Valat","Andreas Hauptmann","Ozan Öktem"],"pdf_url":"https://arxiv.org/pdf/2411.19593v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2311.15327v6","updated":"2024-11-29T10:02:40Z","published":"2023-11-26T15:11:17Z","title":"FRAC-Q-Learning: A Reinforcement Learning with Boredom Avoidance\n Processes for Social Robots","summary":" The reinforcement learning algorithms have often been applied to social\nrobots. However, most reinforcement learning algorithms were not optimized for\nthe use of social robots, and consequently they may bore users. We proposed a\nnew reinforcement learning method specialized for the social robot, the\nFRAC-Q-learning, that can avoid user boredom. The proposed algorithm consists\nof a forgetting process in addition to randomizing and categorizing processes.\nThis study evaluated interest and boredom hardness scores of the\nFRAC-Q-learning by a comparison with the traditional Q-learning. The\nFRAC-Q-learning showed significantly higher trend of interest score, and\nindicated significantly harder to bore users compared to the traditional\nQ-learning. Therefore, the FRAC-Q-learning can contribute to develop a social\nrobot that will not bore users. The proposed algorithm has a potential to apply\nfor Web-based communication and educational systems. This paper presents the\nentire process, detailed implementation and a detailed evaluation method of the\nof the FRAC-Q-learning for the first time.\n","authors":["Akinari Onishi"],"pdf_url":"https://arxiv.org/pdf/2311.15327v6.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19585v1","updated":"2024-11-29T09:59:47Z","published":"2024-11-29T09:59:47Z","title":"LDA-AQU: Adaptive Query-guided Upsampling via Local Deformable Attention","summary":" Feature upsampling is an essential operation in constructing deep\nconvolutional neural networks. However, existing upsamplers either lack\nspecific feature guidance or necessitate the utilization of high-resolution\nfeature maps, resulting in a loss of performance and flexibility. In this\npaper, we find that the local self-attention naturally has the feature guidance\ncapability, and its computational paradigm aligns closely with the essence of\nfeature upsampling (\\ie feature reassembly of neighboring points). Therefore,\nwe introduce local self-attention into the upsampling task and demonstrate that\nthe majority of existing upsamplers can be regarded as special cases of\nupsamplers based on local self-attention. Considering the potential semantic\ngap between upsampled points and their neighboring points, we further introduce\nthe deformation mechanism into the upsampler based on local self-attention,\nthereby proposing LDA-AQU. As a novel dynamic kernel-based upsampler, LDA-AQU\nutilizes the feature of queries to guide the model in adaptively adjusting the\nposition and aggregation weight of neighboring points, thereby meeting the\nupsampling requirements across various complex scenarios. In addition, LDA-AQU\nis lightweight and can be easily integrated into various model architectures.\nWe evaluate the effectiveness of LDA-AQU across four dense prediction tasks:\nobject detection, instance segmentation, panoptic segmentation, and semantic\nsegmentation. LDA-AQU consistently outperforms previous state-of-the-art\nupsamplers, achieving performance enhancements of 1.7 AP, 1.5 AP, 2.0 PQ, and\n2.5 mIoU compared to the baseline models in the aforementioned four tasks,\nrespectively. Code is available at \\url{https://github.com/duzw9311/LDA-AQU}.\n","authors":["Zewen Du","Zhenjiang Hu","Guiyu Zhao","Ying Jin","Hongbin Ma"],"pdf_url":"https://arxiv.org/pdf/2411.19585v1.pdf","comment":"Accepted by ACM MM2024"},{"id":"http://arxiv.org/abs/2411.19584v1","updated":"2024-11-29T09:57:11Z","published":"2024-11-29T09:57:11Z","title":"Enhancing Sentiment Analysis in Bengali Texts: A Hybrid Approach Using\n Lexicon-Based Algorithm and Pretrained Language Model Bangla-BERT","summary":" Sentiment analysis (SA) is a process of identifying the emotional tone or\npolarity within a given text and aims to uncover the user's complex emotions\nand inner feelings. While sentiment analysis has been extensively studied for\nlanguages like English, research in Bengali, remains limited, particularly for\nfine-grained sentiment categorization. This work aims to connect this gap by\ndeveloping a novel approach that integrates rule-based algorithms with\npre-trained language models. We developed a dataset from scratch, comprising\nover 15,000 manually labeled reviews. Next, we constructed a Lexicon Data\nDictionary, assigning polarity scores to the reviews. We developed a novel rule\nbased algorithm Bangla Sentiment Polarity Score (BSPS), an approach capable of\ngenerating sentiment scores and classifying reviews into nine distinct\nsentiment categories. To assess the performance of this method, we evaluated\nthe classified sentiments using BanglaBERT, a pre-trained transformer-based\nlanguage model. We also performed sentiment classification directly with\nBanglaBERT on the original data and evaluated this model's results. Our\nanalysis revealed that the BSPS + BanglaBERT hybrid approach outperformed the\nstandalone BanglaBERT model, achieving higher accuracy, precision, and nuanced\nclassification across the nine sentiment categories. The results of our study\nemphasize the value and effectiveness of combining rule-based and pre-trained\nlanguage model approaches for enhanced sentiment analysis in Bengali and\nsuggest pathways for future research and application in languages with similar\nlinguistic complexities.\n","authors":["Hemal Mahmud","Hasan Mahmud"],"pdf_url":"https://arxiv.org/pdf/2411.19584v1.pdf","comment":"13 pages, 12 figures"},{"id":"http://arxiv.org/abs/2411.19583v1","updated":"2024-11-29T09:56:40Z","published":"2024-11-29T09:56:40Z","title":"Solving Rubik's Cube Without Tricky Sampling","summary":" The Rubiks Cube, with its vast state space and sparse reward structure,\npresents a significant challenge for reinforcement learning (RL) due to the\ndifficulty of reaching rewarded states. Previous research addressed this by\npropagating cost-to-go estimates from the solved state and incorporating search\ntechniques. These approaches differ from human strategies that start from fully\nscrambled cubes, which can be tricky for solving a general sparse-reward\nproblem. In this paper, we introduce a novel RL algorithm using policy gradient\nmethods to solve the Rubiks Cube without relying on near solved-state sampling.\nOur approach employs a neural network to predict cost patterns between states,\nallowing the agent to learn directly from scrambled states. Our method was\ntested on the 2x2x2 Rubiks Cube, where the cube was scrambled 50,000 times, and\nthe model successfully solved it in over 99.4% of cases. Notably, this result\nwas achieved using only the policy network without relying on tree search as in\nprevious methods, demonstrating its effectiveness and potential for broader\napplications in sparse-reward problems.\n","authors":["Yicheng Lin","Siyu Liang"],"pdf_url":"https://arxiv.org/pdf/2411.19583v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.13404v2","updated":"2024-11-29T09:39:27Z","published":"2024-04-20T15:12:47Z","title":"Solution space and storage capacity of fully connected two-layer neural\n networks with generic activation functions","summary":" The storage capacity of a binary classification model is the maximum number\nof random input-output pairs per parameter that the model can learn. It is one\nof the indicators of the expressive power of machine learning models and is\nimportant for comparing the performance of various models. In this study, we\nanalyze the structure of the solution space and the storage capacity of fully\nconnected two-layer neural networks with general activation functions using the\nreplica method from statistical physics. Our results demonstrate that the\nstorage capacity per parameter remains finite even with infinite width and that\nthe weights of the network exhibit negative correlations, leading to a\n'division of labor'. In addition, we find that increasing the dataset size\ntriggers a phase transition at a certain transition point where the permutation\nsymmetry of weights is broken, resulting in the solution space splitting into\ndisjoint regions. We identify the dependence of this transition point and the\nstorage capacity on the choice of activation function. These findings\ncontribute to understanding the influence of activation functions and the\nnumber of parameters on the structure of the solution space, potentially\noffering insights for selecting appropriate architectures based on specific\nobjectives.\n","authors":["Sota Nishiyama","Masayuki Ohzeki"],"pdf_url":"https://arxiv.org/pdf/2404.13404v2.pdf","comment":"16+12 pages, 5 figures, 1 table. v2 accepted to Journal of the\n Physical Society of Japan"},{"id":"http://arxiv.org/abs/2411.19564v1","updated":"2024-11-29T09:19:57Z","published":"2024-11-29T09:19:57Z","title":"A Comprehensive Framework for Automated Segmentation of Perivascular\n Spaces in Brain MRI with the nnU-Net","summary":" Background: Enlargement of perivascular spaces (PVS) is common in\nneurodegenerative disorders including cerebral small vessel disease,\nAlzheimer's disease, and Parkinson's disease. PVS enlargement may indicate\nimpaired clearance pathways and there is a need for reliable PVS detection\nmethods which are currently lacking. Aim: To optimise a widely used deep\nlearning model, the no-new-UNet (nnU-Net), for PVS segmentation. Methods: In 30\nhealthy participants (mean$\\pm$SD age: 50$\\pm$18.9 years; 13 females),\nT1-weighted MRI images were acquired using three different protocols on three\nMRI scanners (3T Siemens Tim Trio, 3T Philips Achieva, and 7T Siemens\nMagnetom). PVS were manually segmented across ten axial slices in each\nparticipant. Segmentations were completed using a sparse annotation strategy.\nIn total, 11 models were compared using various strategies for image handling,\npreprocessing and semi-supervised learning with pseudo-labels. Model\nperformance was evaluated using 5-fold cross validation (5FCV). The main\nperformance metric was the Dice Similarity Coefficient (DSC). Results: The\nvoxel-spacing agnostic model (mean$\\pm$SD DSC=64.3$\\pm$3.3%) outperformed\nmodels which resampled images to a common resolution (DSC=40.5-55%). Model\nperformance improved substantially following iterative label cleaning\n(DSC=85.7$\\pm$1.2%). Semi-supervised learning with pseudo-labels (n=12,740)\nfrom 18 additional datasets improved the agreement between raw and predicted\nPVS cluster counts (Lin's concordance correlation coefficient=0.89,\n95%CI=0.82-0.94). We extended the model to enable PVS segmentation in the\nmidbrain (DSC=64.3$\\pm$6.5%) and hippocampus (DSC=67.8$\\pm$5%). Conclusions:\nOur deep learning models provide a robust and holistic framework for the\nautomated quantification of PVS in brain MRI.\n","authors":["William Pham","Alexander Jarema","Donggyu Rim","Zhibin Chen","Mohamed S. H. Khlif","Vaughan G. Macefield","Luke A. Henderson","Amy Brodtmann"],"pdf_url":"https://arxiv.org/pdf/2411.19564v1.pdf","comment":"46 pages, 8 figures, 2 tables"},{"id":"http://arxiv.org/abs/2411.19557v1","updated":"2024-11-29T09:10:30Z","published":"2024-11-29T09:10:30Z","title":"Initialization using Update Approximation is a Silver Bullet for\n Extremely Efficient Low-Rank Fine-Tuning","summary":" Low-rank adapters have become a standard approach for efficiently fine-tuning\nlarge language models (LLMs), but they often fall short of achieving the\nperformance of full fine-tuning. We propose a method, LoRA Silver Bullet or\nLoRA-SB, that approximates full fine-tuning within low-rank subspaces using a\ncarefully designed initialization strategy. We theoretically demonstrate that\nthe architecture of LoRA-XS, which inserts a trainable (r x r) matrix between B\nand A while keeping other matrices fixed, provides the precise conditions\nneeded for this approximation. We leverage its constrained update space to\nachieve optimal scaling for high-rank gradient updates while removing the need\nfor hyperparameter tuning. We prove that our initialization offers an optimal\nlow-rank approximation of the initial gradient and preserves update directions\nthroughout training. Extensive experiments across mathematical reasoning,\ncommonsense reasoning, and language understanding tasks demonstrate that our\napproach exceeds the performance of standard LoRA while using 27-90x fewer\nparameters, and comprehensively outperforms LoRA-XS. Our findings establish\nthat it is possible to simulate full fine-tuning in low-rank subspaces, and\nachieve significant efficiency gains without sacrificing performance. Our code\nis publicly available at https://github.com/RaghavSinghal10/lora-sb.\n","authors":["Kaustubh Ponkshe","Raghav Singhal","Eduard Gorbunov","Alexey Tumanov","Samuel Horvath","Praneeth Vepakomma"],"pdf_url":"https://arxiv.org/pdf/2411.19557v1.pdf","comment":"Kaustubh Ponkshe and Raghav Singhal contributed equally to this work"},{"id":"http://arxiv.org/abs/2411.19556v1","updated":"2024-11-29T09:08:20Z","published":"2024-11-29T09:08:20Z","title":"Differentiable Causal Discovery For Latent Hierarchical Causal Models","summary":" Discovering causal structures with latent variables from observational data\nis a fundamental challenge in causal discovery. Existing methods often rely on\nconstraint-based, iterative discrete searches, limiting their scalability to\nlarge numbers of variables. Moreover, these methods frequently assume linearity\nor invertibility, restricting their applicability to real-world scenarios. We\npresent new theoretical results on the identifiability of nonlinear latent\nhierarchical causal models, relaxing previous assumptions in literature about\nthe deterministic nature of latent variables and exogenous noise. Building on\nthese insights, we develop a novel differentiable causal discovery algorithm\nthat efficiently estimates the structure of such models. To the best of our\nknowledge, this is the first work to propose a differentiable causal discovery\nmethod for nonlinear latent hierarchical models. Our approach outperforms\nexisting methods in both accuracy and scalability. We demonstrate its practical\nutility by learning interpretable hierarchical latent structures from\nhigh-dimensional image data and demonstrate its effectiveness on downstream\ntasks.\n","authors":["Parjanya Prashant","Ignavier Ng","Kun Zhang","Biwei Huang"],"pdf_url":"https://arxiv.org/pdf/2411.19556v1.pdf","comment":"25 pages with references, 7 figures"},{"id":"http://arxiv.org/abs/2411.07885v2","updated":"2024-11-29T09:02:25Z","published":"2024-11-12T15:47:17Z","title":"RadioActive: 3D Radiological Interactive Segmentation Benchmark","summary":" Current interactive segmentation approaches, inspired by the success of\nMETA's Segment Anything model, have achieved notable advancements, however,\nthey come with substantial limitations that hinder their practical application\nin 3D radiological scenarios. These include unrealistic human interaction\nrequirements, such as slice-by-slice operations for 2D models on 3D data, a\nlack of iterative interactive refinement, and insufficient evaluation\nexperiments. These shortcomings prevent accurate assessment of model\nperformance and lead to inconsistent outcomes across studies. The RadioActive\nbenchmark overcomes these challenges by offering a comprehensive and\nreproducible evaluation of interactive segmentation methods in realistic,\nclinically relevant scenarios. It includes diverse datasets, target structures,\nand interactive segmentation methods, and provides a flexible, extendable\ncodebase that allows seamless integration of new models and prompting\nstrategies. We also introduce advanced prompting techniques to enable 2D models\non 3D data by reducing the needed number of interaction steps, enabling a fair\ncomparison. We show that surprisingly the performance of slice-wise prompted\napproaches can match native 3D methods, despite the domain gap. Our findings\nchallenge the current literature and highlight that models not specifically\ntrained on medical data can outperform the current specialized medical methods.\nBy open-sourcing RadioActive, we invite the research community to integrate\ntheir models and prompting techniques, ensuring continuous and transparent\nevaluation of interactive segmentation models in 3D medical imaging.\n","authors":["Constantin Ulrich","Tassilo Wald","Emily Tempus","Maximilian Rokuss","Paul F. Jaeger","Klaus Maier-Hein"],"pdf_url":"https://arxiv.org/pdf/2411.07885v2.pdf","comment":"Undergoing Peer-Review"},{"id":"http://arxiv.org/abs/2208.06677v5","updated":"2024-11-29T08:58:27Z","published":"2022-08-13T16:04:39Z","title":"Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep\n Models","summary":" In deep learning, different kinds of deep networks typically need different\noptimizers, which have to be chosen after multiple trials, making the training\nprocess inefficient. To relieve this issue and consistently improve the model\ntraining speed across deep networks, we propose the ADAptive Nesterov momentum\nalgorithm, Adan for short. Adan first reformulates the vanilla Nesterov\nacceleration to develop a new Nesterov momentum estimation (NME) method, which\navoids the extra overhead of computing gradient at the extrapolation point.\nThen, Adan adopts NME to estimate the gradient's first- and second-order\nmoments in adaptive gradient algorithms for convergence acceleration. Besides,\nwe prove that Adan finds an $\\epsilon$-approximate first-order stationary point\nwithin $\\mathcal{O}(\\epsilon^{-3.5})$ stochastic gradient complexity on the\nnon-convex stochastic problems (e.g., deep learning problems), matching the\nbest-known lower bound. Extensive experimental results show that Adan\nconsistently surpasses the corresponding SoTA optimizers on vision, language,\nand RL tasks and sets new SoTAs for many popular networks and frameworks, e.g.,\nResNet, ConvNext, ViT, Swin, MAE, DETR, GPT-2, Transformer-XL, and BERT. More\nsurprisingly, Adan can use half of the training cost (epochs) of SoTA\noptimizers to achieve higher or comparable performance on ViT, GPT-2, MAE,\netc., and also shows great tolerance to a large range of minibatch size, e.g.,\nfrom 1k to 32k. Code is released at https://github.com/sail-sg/Adan, and has\nbeen used in multiple popular deep learning frameworks or projects.\n","authors":["Xingyu Xie","Pan Zhou","Huan Li","Zhouchen Lin","Shuicheng Yan"],"pdf_url":"https://arxiv.org/pdf/2208.06677v5.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19553v1","updated":"2024-11-29T08:57:07Z","published":"2024-11-29T08:57:07Z","title":"Analysis of High-dimensional Gaussian Labeled-unlabeled Mixture Model\n via Message-passing Algorithm","summary":" Semi-supervised learning (SSL) is a machine learning methodology that\nleverages unlabeled data in conjunction with a limited amount of labeled data.\nAlthough SSL has been applied in various applications and its effectiveness has\nbeen empirically demonstrated, it is still not fully understood when and why\nSSL performs well. Some existing theoretical studies have attempted to address\nthis issue by modeling classification problems using the so-called Gaussian\nMixture Model (GMM). These studies provide notable and insightful\ninterpretations. However, their analyses are focused on specific purposes, and\na thorough investigation of the properties of GMM in the context of SSL has\nbeen lacking. In this paper, we conduct such a detailed analysis of the\nproperties of the high-dimensional GMM for binary classification in the SSL\nsetting. To this end, we employ the approximate message passing and state\nevolution methods, which are widely used in high-dimensional settings and\noriginate from statistical mechanics. We deal with two estimation approaches:\nthe Bayesian one and the l2-regularized maximum likelihood estimation (RMLE).\nWe conduct a comprehensive comparison between these two approaches, examining\naspects such as the global phase diagram, estimation error for the parameters,\nand prediction error for the labels. A specific comparison is made between the\nBayes-optimal (BO) estimator and RMLE, as the BO setting provides optimal\nestimation performance and is ideal as a benchmark. Our analysis shows that\nwith appropriate regularizations, RMLE can achieve near-optimal performance in\nterms of both the estimation error and prediction error, especially when there\nis a large amount of unlabeled data. These results demonstrate that the l2\nregularization term plays an effective role in estimation and prediction in SSL\napproaches.\n","authors":["Xiaosi Gu","Tomoyuki Obuchi"],"pdf_url":"https://arxiv.org/pdf/2411.19553v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19551v1","updated":"2024-11-29T08:52:32Z","published":"2024-11-29T08:52:32Z","title":"Bootstraping Clustering of Gaussians for View-consistent 3D Scene\n Understanding","summary":" Injecting semantics into 3D Gaussian Splatting (3DGS) has recently garnered\nsignificant attention. While current approaches typically distill 3D semantic\nfeatures from 2D foundational models (e.g., CLIP and SAM) to facilitate novel\nview segmentation and semantic understanding, their heavy reliance on 2D\nsupervision can undermine cross-view semantic consistency and necessitate\ncomplex data preparation processes, therefore hindering view-consistent scene\nunderstanding. In this work, we present FreeGS, an unsupervised\nsemantic-embedded 3DGS framework that achieves view-consistent 3D scene\nunderstanding without the need for 2D labels. Instead of directly learning\nsemantic features, we introduce the IDentity-coupled Semantic Field (IDSF) into\n3DGS, which captures both semantic representations and view-consistent instance\nindices for each Gaussian. We optimize IDSF with a two-step alternating\nstrategy: semantics help to extract coherent instances in 3D space, while the\nresulting instances regularize the injection of stable semantics from 2D space.\nAdditionally, we adopt a 2D-3D joint contrastive loss to enhance the\ncomplementarity between view-consistent 3D geometry and rich semantics during\nthe bootstrapping process, enabling FreeGS to uniformly perform tasks such as\nnovel-view semantic segmentation, object selection, and 3D object detection.\nExtensive experiments on LERF-Mask, 3D-OVS, and ScanNet datasets demonstrate\nthat FreeGS performs comparably to state-of-the-art methods while avoiding the\ncomplex data preprocessing workload.\n","authors":["Wenbo Zhang","Lu Zhang","Ping Hu","Liqian Ma","Yunzhi Zhuge","Huchuan Lu"],"pdf_url":"https://arxiv.org/pdf/2411.19551v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19549v1","updated":"2024-11-29T08:51:43Z","published":"2024-11-29T08:51:43Z","title":"Contextual Checkerboard Denoise -- A Novel Neural Network-Based Approach\n for Classification-Aware OCT Image Denoising","summary":" In contrast to non-medical image denoising, where enhancing image clarity is\nthe primary goal, medical image denoising warrants preservation of crucial\nfeatures without introduction of new artifacts. However, many denoising methods\nthat improve the clarity of the image, inadvertently alter critical information\nof the denoised images, potentially compromising classification performance and\ndiagnostic quality. Additionally, supervised denoising methods are not very\npractical in medical image domain, since a \\emph{ground truth} denoised version\nof a noisy medical image is often extremely challenging to obtain. In this\npaper, we tackle both of these problems by introducing a novel neural network\nbased method -- \\emph{Contextual Checkerboard Denoising}, that can learn\ndenoising from only a dataset of noisy images, while preserving crucial\nanatomical details necessary for image classification/analysis. We perform our\nexperimentation on real Optical Coherence Tomography (OCT) images, and\nempirically demonstrate that our proposed method significantly improves image\nquality, providing clearer and more detailed OCT images, while enhancing\ndiagnostic accuracy.\n","authors":["Md. Touhidul Islam","Md. Abtahi M. Chowdhury","Sumaiya Salekin","Aye T. Maung","Akil A. Taki","Hafiz Imtiaz"],"pdf_url":"https://arxiv.org/pdf/2411.19549v1.pdf","comment":"Under review in Springer Journal of Medical Systems. Code available:\n https://github.com/AbtahiMajeed/CheckerBoardDenoiser/tree/main"},{"id":"http://arxiv.org/abs/2411.19548v1","updated":"2024-11-29T08:47:46Z","published":"2024-11-29T08:47:46Z","title":"ReconDreamer: Crafting World Models for Driving Scene Reconstruction via\n Online Restoration","summary":" Closed-loop simulation is crucial for end-to-end autonomous driving. Existing\nsensor simulation methods (e.g., NeRF and 3DGS) reconstruct driving scenes\nbased on conditions that closely mirror training data distributions. However,\nthese methods struggle with rendering novel trajectories, such as lane changes.\nRecent works have demonstrated that integrating world model knowledge\nalleviates these issues. Despite their efficiency, these approaches still\nencounter difficulties in the accurate representation of more complex\nmaneuvers, with multi-lane shifts being a notable example. Therefore, we\nintroduce ReconDreamer, which enhances driving scene reconstruction through\nincremental integration of world model knowledge. Specifically, DriveRestorer\nis proposed to mitigate artifacts via online restoration. This is complemented\nby a progressive data update strategy designed to ensure high-quality rendering\nfor more complex maneuvers. To the best of our knowledge, ReconDreamer is the\nfirst method to effectively render in large maneuvers. Experimental results\ndemonstrate that ReconDreamer outperforms Street Gaussians in the NTA-IoU,\nNTL-IoU, and FID, with relative improvements by 24.87%, 6.72%, and 29.97%.\nFurthermore, ReconDreamer surpasses DriveDreamer4D with PVG during large\nmaneuver rendering, as verified by a relative improvement of 195.87% in the\nNTA-IoU metric and a comprehensive user study.\n","authors":["Chaojun Ni","Guosheng Zhao","Xiaofeng Wang","Zheng Zhu","Wenkang Qin","Guan Huang","Chen Liu","Yuyin Chen","Yida Wang","Xueyang Zhang","Yifei Zhan","Kun Zhan","Peng Jia","Xianpeng Lang","Xingang Wang","Wenjun Mei"],"pdf_url":"https://arxiv.org/pdf/2411.19548v1.pdf","comment":"Project Page: https://recondreamer.github.io"},{"id":"http://arxiv.org/abs/2411.19544v1","updated":"2024-11-29T08:43:52Z","published":"2024-11-29T08:43:52Z","title":"SkelMamba: A State Space Model for Efficient Skeleton Action Recognition\n of Neurological Disorders","summary":" We introduce a novel state-space model (SSM)-based framework for\nskeleton-based human action recognition, with an anatomically-guided\narchitecture that improves state-of-the-art performance in both clinical\ndiagnostics and general action recognition tasks. Our approach decomposes\nskeletal motion analysis into spatial, temporal, and spatio-temporal streams,\nusing channel partitioning to capture distinct movement characteristics\nefficiently. By implementing a structured, multi-directional scanning strategy\nwithin SSMs, our model captures local joint interactions and global motion\npatterns across multiple anatomical body parts. This anatomically-aware\ndecomposition enhances the ability to identify subtle motion patterns critical\nin medical diagnosis, such as gait anomalies associated with neurological\nconditions. On public action recognition benchmarks, i.e., NTU RGB+D, NTU RGB+D\n120, and NW-UCLA, our model outperforms current state-of-the-art methods,\nachieving accuracy improvements up to $3.2\\%$ with lower computational\ncomplexity than previous leading transformer-based models. We also introduce a\nnovel medical dataset for motion-based patient neurological disorder analysis\nto validate our method's potential in automated disease diagnosis.\n","authors":["Niki Martinel","Mariano Serrao","Christian Micheloni"],"pdf_url":"https://arxiv.org/pdf/2411.19544v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2407.04480v2","updated":"2024-11-29T08:38:55Z","published":"2024-07-05T13:01:36Z","title":"LoCo: Low-Bit Communication Adaptor for Large-scale Model Training","summary":" To efficiently train large-scale models, low-bit gradient communication\ncompresses full-precision gradients on local GPU nodes into low-precision ones\nfor higher gradient synchronization efficiency among GPU nodes. However, it\noften degrades training quality due to compression information loss. To address\nthis, we propose the Low-bit Communication Adaptor (LoCo), which compensates\ngradients on local GPU nodes before compression, ensuring efficient\nsynchronization without compromising training quality. Specifically, LoCo\ndesigns a moving average of historical compensation errors to stably estimate\nconcurrent compression error and then adopts it to compensate for the\nconcurrent gradient compression, yielding a less lossless compression. This\nmechanism allows it to be compatible with general optimizers like Adam and\nsharding strategies like FSDP. Theoretical analysis shows that integrating LoCo\ninto full-precision optimizers like Adam and SGD does not impair their\nconvergence speed on nonconvex problems. Experimental results show that across\nlarge-scale model training frameworks like Megatron-LM and PyTorch's FSDP, LoCo\nsignificantly improves communication efficiency, e.g., improving Adam's\ntraining speed by 14% to 40% without performance degradation on large language\nmodels like LLAMAs and MoE.\n","authors":["Xingyu Xie","Zhijie Lin","Kim-Chuan Toh","Pan Zhou"],"pdf_url":"https://arxiv.org/pdf/2407.04480v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19537v1","updated":"2024-11-29T08:29:25Z","published":"2024-11-29T08:29:25Z","title":"Deepfake Media Generation and Detection in the Generative AI Era: A\n Survey and Outlook","summary":" With the recent advancements in generative modeling, the realism of deepfake\ncontent has been increasing at a steady pace, even reaching the point where\npeople often fail to detect manipulated media content online, thus being\ndeceived into various kinds of scams. In this paper, we survey deepfake\ngeneration and detection techniques, including the most recent developments in\nthe field, such as diffusion models and Neural Radiance Fields. Our literature\nreview covers all deepfake media types, comprising image, video, audio and\nmultimodal (audio-visual) content. We identify various kinds of deepfakes,\naccording to the procedure used to alter or generate the fake content. We\nfurther construct a taxonomy of deepfake generation and detection methods,\nillustrating the important groups of methods and the domains where these\nmethods are applied. Next, we gather datasets used for deepfake detection and\nprovide updated rankings of the best performing deepfake detectors on the most\npopular datasets. In addition, we develop a novel multimodal benchmark to\nevaluate deepfake detectors on out-of-distribution content. The results\nindicate that state-of-the-art detectors fail to generalize to deepfake content\ngenerated by unseen deepfake generators. Finally, we propose future directions\nto obtain robust and powerful deepfake detectors. Our project page and new\nbenchmark are available at https://github.com/CroitoruAlin/biodeep.\n","authors":["Florinel-Alin Croitoru","Andrei-Iulian Hiji","Vlad Hondru","Nicolae Catalin Ristea","Paul Irofti","Marius Popescu","Cristian Rusu","Radu Tudor Ionescu","Fahad Shahbaz Khan","Mubarak Shah"],"pdf_url":"https://arxiv.org/pdf/2411.19537v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19536v1","updated":"2024-11-29T08:24:22Z","published":"2024-11-29T08:24:22Z","title":"Development of Low-Cost IoT Units for Thermal Comfort Measurement and AC\n Energy Consumption Prediction System","summary":" In response to the substantial energy consumption in buildings, the Japanese\ngovernment initiated the BI-Tech (Behavioral Insights X Technology) project in\n2019, aimed at promoting voluntary energy-saving behaviors through the\nutilization of AI and IoT technologies. Our study aimed at small and\nmedium-sized office buildings introduces a cost-effective IoT-based BI-Tech\nsystem, utilizing the Raspberry Pi 4B+ platform for real-time monitoring of\nindoor thermal conditions and air conditioner (AC) set-point temperature.\nEmploying machine learning and image recognition, the system analyzes data to\ncalculate the PMV index and predict energy consumption changes due to\ntemperature adjustments. The integration of mobile and desktop applications\nconveys this information to users, encouraging energy-efficient behavior\nmodifications. The machine learning model achieved with an R2 value of 97%,\ndemonstrating the system's efficiency in promoting energy-saving habits among\nusers.\n","authors":["Yutong Chen","Daisuke Sumiyoshi","Riki Sakai","Takahiro Yamamoto","Takahiro Ueno","Jewon Oh"],"pdf_url":"https://arxiv.org/pdf/2411.19536v1.pdf","comment":"RoomVent2024 conference"},{"id":"http://arxiv.org/abs/2411.19534v1","updated":"2024-11-29T08:20:12Z","published":"2024-11-29T08:20:12Z","title":"QUOTA: Quantifying Objects with Text-to-Image Models for Any Domain","summary":" We tackle the problem of quantifying the number of objects by a generative\ntext-to-image model. Rather than retraining such a model for each new image\ndomain of interest, which leads to high computational costs and limited\nscalability, we are the first to consider this problem from a domain-agnostic\nperspective. We propose QUOTA, an optimization framework for text-to-image\nmodels that enables effective object quantification across unseen domains\nwithout retraining. It leverages a dual-loop meta-learning strategy to optimize\na domain-invariant prompt. Further, by integrating prompt learning with\nlearnable counting and domain tokens, our method captures stylistic variations\nand maintains accuracy, even for object classes not encountered during\ntraining. For evaluation, we adopt a new benchmark specifically designed for\nobject quantification in domain generalization, enabling rigorous assessment of\nobject quantification accuracy and adaptability across unseen domains in\ntext-to-image generation. Extensive experiments demonstrate that QUOTA\noutperforms conventional models in both object quantification accuracy and\nsemantic consistency, setting a new benchmark for efficient and scalable\ntext-to-image generation for any domain.\n","authors":["Wenfang Sun","Yingjun Du","Gaowen Liu","Cees G. M. Snoek"],"pdf_url":"https://arxiv.org/pdf/2411.19534v1.pdf","comment":"12 pages, 6 figures"},{"id":"http://arxiv.org/abs/2310.12563v3","updated":"2024-11-29T08:16:06Z","published":"2023-10-19T08:15:03Z","title":"Approximate information maximization for bandit games","summary":" Entropy maximization and free energy minimization are general physical\nprinciples for modeling the dynamics of various physical systems. Notable\nexamples include modeling decision-making within the brain using the\nfree-energy principle, optimizing the accuracy-complexity trade-off when\naccessing hidden variables with the information bottleneck principle (Tishby et\nal., 2000), and navigation in random environments using information\nmaximization (Vergassola et al., 2007). Built on this principle, we propose a\nnew class of bandit algorithms that maximize an approximation to the\ninformation of a key variable within the system. To this end, we develop an\napproximated analytical physics-based representation of an entropy to forecast\nthe information gain of each action and greedily choose the one with the\nlargest information gain. This method yields strong performances in classical\nbandit settings. Motivated by its empirical success, we prove its asymptotic\noptimality for the two-armed bandit problem with Gaussian rewards. Owing to its\nability to encompass the system's properties in a global physical functional,\nthis approach can be efficiently adapted to more complex bandit settings,\ncalling for further investigation of information maximization approaches for\nmulti-armed bandit problems.\n","authors":["Alex Barbier-Chebbah","Christian L. Vestergaard","Jean-Baptiste Masson","Etienne Boursier"],"pdf_url":"https://arxiv.org/pdf/2310.12563v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19530v1","updated":"2024-11-29T08:05:50Z","published":"2024-11-29T08:05:50Z","title":"Quantized Delta Weight Is Safety Keeper","summary":" Recent advancements in fine-tuning proprietary language models enable\ncustomized applications across various domains but also introduce two major\nchallenges: high resource demands and security risks. Regarding resource\ndemands, recent work proposes novel partial compression, such as BitDelta, to\nquantize the delta weights between the fine-tuned model and base model.\nRegarding the security risks, user-defined fine-tuning can introduce security\nvulnerabilities, such as alignment issues, backdoor attacks, and\nhallucinations. However, most of the current efforts in security assessment\nfocus on the full-precision or full-compression models, it is not\nwell-discussed how the partial compression methods affect security concerns. To\nbridge this gap, we evaluate the robustness of delta-weight quantization\nagainst these security threats. In this paper, we uncover a \"free lunch\"\nphenomenon: partial compression can enhance model security against\nfine-tuning-based attacks with bearable utility loss. Using Llama-2-7b-chat as\na case study, we show that, with under 10% utility degradation, the partial\ncompression mitigates alignment-breaking risks by up to 66.17%, harmful\nbackdoor vulnerabilities by 64.46%, and targeted output manipulation risks by\nup to 90.53%. We further apply LogitLens to visualize internal state\ntransformations during forward passes, suggesting mechanisms for both security\nfailure and recovery in standard versus compressed fine-tuning. This work\noffers new insights into selecting effective delta compression methods for\nsecure, resource-efficient multi-tenant services.\n","authors":["Yule Liu","Zhen Sun","Xinlei He","Xinyi Huang"],"pdf_url":"https://arxiv.org/pdf/2411.19530v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19528v1","updated":"2024-11-29T07:57:32Z","published":"2024-11-29T07:57:32Z","title":"RAGDiffusion: Faithful Cloth Generation via External Knowledge\n Assimilation","summary":" Standard clothing asset generation involves creating forward-facing flat-lay\ngarment images displayed on a clear background by extracting clothing\ninformation from diverse real-world contexts, which presents significant\nchallenges due to highly standardized sampling distributions and precise\nstructural requirements in the generated images. Existing models have limited\nspatial perception and often exhibit structural hallucinations in this\nhigh-specification generative task. To address this issue, we propose a novel\nRetrieval-Augmented Generation (RAG) framework, termed RAGDiffusion, to enhance\nstructure determinacy and mitigate hallucinations by assimilating external\nknowledge from LLM and databases. RAGDiffusion consists of two core processes:\n(1) Retrieval-based structure aggregation, which employs contrastive learning\nand a Structure Locally Linear Embedding (SLLE) to derive global structure and\nspatial landmarks, providing both soft and hard guidance to counteract\nstructural ambiguities; and (2) Omni-level faithful garment generation, which\nintroduces a three-level alignment that ensures fidelity in structural,\npattern, and decoding components within the diffusing. Extensive experiments on\nchallenging real-world datasets demonstrate that RAGDiffusion synthesizes\nstructurally and detail-faithful clothing assets with significant performance\nimprovements, representing a pioneering effort in high-specification faithful\ngeneration with RAG to confront intrinsic hallucinations and enhance fidelity.\n","authors":["Xianfeng Tan","Yuhan Li","Wenxiang Shang","Yubo Wu","Jian Wang","Xuanhong Chen","Yi Zhang","Ran Lin","Bingbing Ni"],"pdf_url":"https://arxiv.org/pdf/2411.19528v1.pdf","comment":"Project website: https://colorful-liyu.github.io/RAGDiffusion-page/"},{"id":"http://arxiv.org/abs/2411.19527v1","updated":"2024-11-29T07:54:56Z","published":"2024-11-29T07:54:56Z","title":"DisCoRD: Discrete Tokens to Continuous Motion via Rectified Flow\n Decoding","summary":" Human motion, inherently continuous and dynamic, presents significant\nchallenges for generative models. Despite their dominance, discrete\nquantization methods, such as VQ-VAEs, suffer from inherent limitations,\nincluding restricted expressiveness and frame-wise noise artifacts. Continuous\napproaches, while producing smoother and more natural motions, often falter due\nto high-dimensional complexity and limited training data. To resolve this\n\"discord\" between discrete and continuous representations, we introduce\nDisCoRD: Discrete Tokens to Continuous Motion via Rectified Flow Decoding, a\nnovel method that decodes discrete motion tokens into continuous motion through\nrectified flow. By employing an iterative refinement process in the continuous\nspace, DisCoRD captures fine-grained dynamics and ensures smoother and more\nnatural motions. Compatible with any discrete-based framework, our method\nenhances naturalness without compromising faithfulness to the conditioning\nsignals. Extensive evaluations demonstrate that DisCoRD achieves\nstate-of-the-art performance, with FID of 0.032 on HumanML3D and 0.169 on\nKIT-ML. These results solidify DisCoRD as a robust solution for bridging the\ndivide between discrete efficiency and continuous realism. Our project page is\navailable at: https://whwjdqls.github.io/discord.github.io/.\n","authors":["Jungbin Cho","Junwan Kim","Jisoo Kim","Minseo Kim","Mingu Kang","Sungeun Hong","Tae-Hyun Oh","Youngjae Yu"],"pdf_url":"https://arxiv.org/pdf/2411.19527v1.pdf","comment":"20 pages 18 figures"},{"id":"http://arxiv.org/abs/2411.19525v1","updated":"2024-11-29T07:49:44Z","published":"2024-11-29T07:49:44Z","title":"LokiTalk: Learning Fine-Grained and Generalizable Correspondences to\n Enhance NeRF-based Talking Head Synthesis","summary":" Despite significant progress in talking head synthesis since the introduction\nof Neural Radiance Fields (NeRF), visual artifacts and high training costs\npersist as major obstacles to large-scale commercial adoption. We propose that\nidentifying and establishing fine-grained and generalizable correspondences\nbetween driving signals and generated results can simultaneously resolve both\nproblems. Here we present LokiTalk, a novel framework designed to enhance\nNeRF-based talking heads with lifelike facial dynamics and improved training\nefficiency. To achieve fine-grained correspondences, we introduce\nRegion-Specific Deformation Fields, which decompose the overall portrait motion\ninto lip movements, eye blinking, head pose, and torso movements. By\nhierarchically modeling the driving signals and their associated regions\nthrough two cascaded deformation fields, we significantly improve dynamic\naccuracy and minimize synthetic artifacts. Furthermore, we propose ID-Aware\nKnowledge Transfer, a plug-and-play module that learns generalizable dynamic\nand static correspondences from multi-identity videos, while simultaneously\nextracting ID-specific dynamic and static features to refine the depiction of\nindividual characters. Comprehensive evaluations demonstrate that LokiTalk\ndelivers superior high-fidelity results and training efficiency compared to\nprevious methods. The code will be released upon acceptance.\n","authors":["Tianqi Li","Ruobing Zheng","Bonan Li","Zicheng Zhang","Meng Wang","Jingdong Chen","Ming Yang"],"pdf_url":"https://arxiv.org/pdf/2411.19525v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19523v1","updated":"2024-11-29T07:41:20Z","published":"2024-11-29T07:41:20Z","title":"Density-Calibrated Conformal Quantile Regression","summary":" This paper introduces the Density-Calibrated Conformal Quantile Regression\n(CQR-d) method, a novel approach for constructing prediction intervals that\nadapts to varying uncertainty across the feature space. Building upon conformal\nquantile regression, CQR-d incorporates local information through a weighted\ncombination of local and global conformity scores, where the weights are\ndetermined by local data density. We prove that CQR-d provides valid marginal\ncoverage at level $1 - \\alpha - \\epsilon$, where $\\epsilon$ represents a small\ntolerance from numerical optimization. Through extensive simulation studies and\nan application to the a heteroscedastic dataset available in R, we demonstrate\nthat CQR-d maintains the desired coverage while producing substantially\nnarrower prediction intervals compared to standard conformal quantile\nregression (CQR). Notably, in our application on heteroscedastic data, CQR-d\nachieves an $8.6\\%$ reduction in average interval width while maintaining\ncomparable coverage. The method's effectiveness is particularly pronounced in\nsettings with clear local uncertainty patterns, making it a valuable tool for\nprediction tasks in heterogeneous data environments.\n","authors":["Yuan Lu"],"pdf_url":"https://arxiv.org/pdf/2411.19523v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.16656v2","updated":"2024-11-29T07:35:30Z","published":"2024-10-22T03:00:11Z","title":"Parsimonious Dynamic Mode Decomposition: A Robust and Automated Approach\n for Optimally Sparse Mode Selection in Complex Systems","summary":" This paper introduces the Parsimonious Dynamic Mode Decomposition (parsDMD),\na novel algorithm designed to automatically select an optimally sparse subset\nof dynamic modes for both spatiotemporal and purely temporal data. By\nincorporating time-delay embedding and leveraging Orthogonal Matching Pursuit\n(OMP), parsDMD ensures robustness against noise and effectively handles\ncomplex, nonlinear dynamics. The algorithm is validated on a diverse range of\ndatasets, including standing wave signals, identifying hidden dynamics, fluid\ndynamics simulations (flow past a cylinder and transonic buffet), and\natmospheric sea-surface temperature (SST) data. ParsDMD addresses a significant\nlimitation of the traditional sparsity-promoting DMD (spDMD), which requires\nmanual tuning of sparsity parameters through a rigorous trial-and-error process\nto balance between single-mode and all-mode solutions. In contrast, parsDMD\nautonomously determines the optimally sparse subset of modes without user\nintervention, while maintaining minimal computational complexity. Comparative\nanalyses demonstrate that parsDMD consistently outperforms spDMD by providing\nmore accurate mode identification and effective reconstruction in noisy\nenvironments. These advantages render parsDMD an effective tool for real-time\ndiagnostics, forecasting, and reduced-order model construction across various\ndisciplines.\n","authors":["Arpan Das","Pier Marzocca","Oleg Levinski"],"pdf_url":"https://arxiv.org/pdf/2410.16656v2.pdf","comment":"42 pages, 16 Figures"},{"id":"http://arxiv.org/abs/2411.19517v1","updated":"2024-11-29T07:23:34Z","published":"2024-11-29T07:23:34Z","title":"RL-MILP Solver: A Reinforcement Learning Approach for Solving\n Mixed-Integer Linear Programs with Graph Neural Networks","summary":" Mixed-Integer Linear Programming (MILP) is an optimization technique widely\nused in various fields. Primal heuristics, which reduce the search space of\nMILP, have enabled traditional solvers (e.g., Gurobi) to efficiently find\nhigh-quality solutions. However, traditional primal heuristics rely on expert\nknowledge, motivating the advent of machine learning (ML)-based primal\nheuristics that learn repetitive patterns in MILP. Nonetheless, existing\nML-based primal heuristics do not guarantee solution feasibility (i.e.,\nsatisfying all constraints) and primarily focus on prediction for binary\ndecision variables. When addressing MILP involving non-binary integer variables\nusing ML-based approaches, feasibility issues can become even more pronounced.\nSince finding an optimal solution requires satisfying all constraints,\naddressing feasibility is critical. To overcome these limitations, we propose a\nnovel reinforcement learning (RL)-based solver that interacts with MILP to find\nfeasible solutions, rather than delegating sub-problems to traditional solvers.\nWe design reward functions tailored for MILP, which enables the RL agent to\nlearn relationships between decision variables and constraints. Additionally,\nto effectively model complex relationships among decision variables, we\nleverage a Transformer encoder-based graph neural network (GNN). Our\nexperimental results demonstrate that the proposed method can solve MILP\nproblems and find near-optimal solutions without delegating the remainder to\ntraditional solvers. The proposed method provides a meaningful step forward as\nan initial study in solving MILP problems end-to-end based solely on ML.\n","authors":["Tae-Hoon Lee","Min-Soo Kim"],"pdf_url":"https://arxiv.org/pdf/2411.19517v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2212.08701v3","updated":"2024-11-29T07:21:45Z","published":"2022-12-16T20:02:03Z","title":"An Upper Bound for the Distribution Overlap Index and Its Applications","summary":" This paper proposes an easy-to-compute upper bound for the overlap index\nbetween two probability distributions without requiring any knowledge of the\ndistribution models. The computation of our bound is time-efficient and\nmemory-efficient and only requires finite samples. The proposed bound shows its\nvalue in one-class classification and domain shift analysis. Specifically, in\none-class classification, we build a novel one-class classifier by converting\nthe bound into a confidence score function. Unlike most one-class classifiers,\nthe training process is not needed for our classifier. Additionally, the\nexperimental results show that our classifier can be accurate with only a small\nnumber of in-class samples and outperform many state-of-the-art methods on\nvarious datasets in different one-class classification scenarios. In domain\nshift analysis, we propose a theorem based on our bound. The theorem is useful\nin detecting the existence of domain shift and inferring data information. The\ndetection and inference processes are both computation-efficient and\nmemory-efficient. Our work shows significant promise toward broadening the\napplications of overlap-based metrics.\n","authors":["Hao Fu","Prashanth Krishnamurthy","Siddharth Garg","Farshad Khorrami"],"pdf_url":"https://arxiv.org/pdf/2212.08701v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19514v1","updated":"2024-11-29T07:12:36Z","published":"2024-11-29T07:12:36Z","title":"Enhancing AI microscopy for foodborne bacterial classification via\n adversarial domain adaptation across optical and biological variability","summary":" Rapid detection of foodborne bacteria is critical for food safety and\nquality, yet traditional culture-based methods require extended incubation and\nspecialized sample preparation. This study addresses these challenges by i)\nenhancing the generalizability of AI-enabled microscopy for bacterial\nclassification using adversarial domain adaptation and ii) comparing the\nperformance of single-target and multi-domain adaptation. Three Gram-positive\n(Bacillus coagulans, Bacillus subtilis, Listeria innocua) and three\nGram-negative (E. coli, Salmonella Enteritidis, Salmonella Typhimurium) strains\nwere classified. EfficientNetV2 served as the backbone architecture, leveraging\nfine-grained feature extraction for small targets. Few-shot learning enabled\nscalability, with domain-adversarial neural networks (DANNs) addressing single\ndomains and multi-DANNs (MDANNs) generalizing across all target domains. The\nmodel was trained on source domain data collected under controlled conditions\n(phase contrast microscopy, 60x magnification, 3-h bacterial incubation) and\nevaluated on target domains with variations in microscopy modality\n(brightfield, BF), magnification (20x), and extended incubation to compensate\nfor lower resolution (20x-5h). DANNs improved target domain classification\naccuracy by up to 54.45% (20x), 43.44% (20x-5h), and 31.67% (BF), with minimal\nsource domain degradation (<4.44%). MDANNs achieved superior performance in the\nBF domain and substantial gains in the 20x domain. Grad-CAM and t-SNE\nvisualizations validated the model's ability to learn domain-invariant features\nacross diverse conditions. This study presents a scalable and adaptable\nframework for bacterial classification, reducing reliance on extensive sample\npreparation and enabling application in decentralized and resource-limited\nenvironments.\n","authors":["Siddhartha Bhattacharya","Aarham Wasit","Mason Earles","Nitin Nitin","Luyao Ma","Jiyoon Yi"],"pdf_url":"https://arxiv.org/pdf/2411.19514v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19513v1","updated":"2024-11-29T07:11:42Z","published":"2024-11-29T07:11:42Z","title":"ContextGNN: Beyond Two-Tower Recommendation Systems","summary":" Recommendation systems predominantly utilize two-tower architectures, which\nevaluate user-item rankings through the inner product of their respective\nembeddings. However, one key limitation of two-tower models is that they learn\na pair-agnostic representation of users and items. In contrast, pair-wise\nrepresentations either scale poorly due to their quadratic complexity or are\ntoo restrictive on the candidate pairs to rank. To address these issues, we\nintroduce Context-based Graph Neural Networks (ContextGNNs), a novel deep\nlearning architecture for link prediction in recommendation systems. The method\nemploys a pair-wise representation technique for familiar items situated within\na user's local subgraph, while leveraging two-tower representations to\nfacilitate the recommendation of exploratory items. A final network then\npredicts how to fuse both pair-wise and two-tower recommendations into a single\nranking of items. We demonstrate that ContextGNN is able to adapt to different\ndata characteristics and outperforms existing methods, both traditional and\nGNN-based, on a diverse set of practical recommendation tasks, improving\nperformance by 20% on average.\n","authors":["Yiwen Yuan","Zecheng Zhang","Xinwei He","Akihiro Nitta","Weihua Hu","Dong Wang","Manan Shah","Shenyang Huang","Blaž Stojanovič","Alan Krumholz","Jan Eric Lenssen","Jure Leskovec","Matthias Fey"],"pdf_url":"https://arxiv.org/pdf/2411.19513v1.pdf","comment":"14 pages, 1 figure, 5 tables"},{"id":"http://arxiv.org/abs/2411.19512v1","updated":"2024-11-29T07:11:28Z","published":"2024-11-29T07:11:28Z","title":"Topology-Preserving Scaling in Data Augmentation","summary":" We propose an algorithmic framework for dataset normalization in data\naugmentation pipelines that preserves topological stability under non-uniform\nscaling transformations. Given a finite metric space \\( X \\subset \\mathbb{R}^n\n\\) with Euclidean distance \\( d_X \\), we consider scaling transformations\ndefined by scaling factors \\( s_1, s_2, \\ldots, s_n > 0 \\). Specifically, we\ndefine a scaling function \\( S \\) that maps each point \\( x = (x_1, x_2,\n\\ldots, x_n) \\in X \\) to \\[ S(x) = (s_1 x_1, s_2 x_2, \\ldots, s_n x_n). \\] Our\nmain result establishes that the bottleneck distance \\( d_B(D, D_S) \\) between\nthe persistence diagrams \\( D \\) of \\( X \\) and \\( D_S \\) of \\( S(X) \\)\nsatisfies: \\[ d_B(D, D_S) \\leq (s_{\\max} - s_{\\min}) \\cdot\n\\operatorname{diam}(X), \\] where \\( s_{\\min} = \\min_{1 \\leq i \\leq n} s_i \\),\n\\( s_{\\max} = \\max_{1 \\leq i \\leq n} s_i \\), and \\( \\operatorname{diam}(X) \\)\nis the diameter of \\( X \\). Based on this theoretical guarantee, we formulate\nan optimization problem to minimize the scaling variability \\( \\Delta_s =\ns_{\\max} - s_{\\min} \\) under the constraint \\( d_B(D, D_S) \\leq \\epsilon \\),\nwhere \\( \\epsilon > 0 \\) is a user-defined tolerance.\n We develop an algorithmic solution to this problem, ensuring that data\naugmentation via scaling transformations preserves essential topological\nfeatures. We further extend our analysis to higher-dimensional homological\nfeatures, alternative metrics such as the Wasserstein distance, and iterative\nor probabilistic scaling scenarios. Our contributions provide a rigorous\nmathematical framework for dataset normalization in data augmentation\npipelines, ensuring that essential topological characteristics are maintained\ndespite scaling transformations.\n","authors":["Vu-Anh Le","Mehmet Dik"],"pdf_url":"https://arxiv.org/pdf/2411.19512v1.pdf","comment":"20 pages"},{"id":"http://arxiv.org/abs/2411.19510v1","updated":"2024-11-29T07:04:44Z","published":"2024-11-29T07:04:44Z","title":"Retrieval-guided Cross-view Image Synthesis","summary":" Cross-view image synthesis involves generating new images of a scene from\ndifferent viewpoints or perspectives, given one input image from other\nviewpoints. Despite recent advancements, there are several limitations in\nexisting methods: 1) reliance on additional data such as semantic segmentation\nmaps or preprocessing modules to bridge the domain gap; 2) insufficient focus\non view-specific semantics, leading to compromised image quality and realism;\nand 3) a lack of diverse datasets representing complex urban environments. To\ntackle these challenges, we propose: 1) a novel retrieval-guided framework that\nemploys a retrieval network as an embedder to address the domain gap; 2) an\ninnovative generator that enhances semantic consistency and diversity specific\nto the target view to improve image quality and realism; and 3) a new dataset,\nVIGOR-GEN, providing diverse cross-view image pairs in urban settings to enrich\ndataset diversity. Extensive experiments on well-known CVUSA, CVACT, and new\nVIGOR-GEN datasets demonstrate that our method generates images of superior\nrealism, significantly outperforming current leading approaches, particularly\nin SSIM and FID evaluations.\n","authors":["Hongji Yang","Yiru Li","Yingying Zhu"],"pdf_url":"https://arxiv.org/pdf/2411.19510v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19509v1","updated":"2024-11-29T07:01:31Z","published":"2024-11-29T07:01:31Z","title":"Ditto: Motion-Space Diffusion for Controllable Realtime Talking Head\n Synthesis","summary":" Recent advances in diffusion models have revolutionized audio-driven talking\nhead synthesis. Beyond precise lip synchronization, diffusion-based methods\nexcel in generating subtle expressions and natural head movements that are\nwell-aligned with the audio signal. However, these methods are confronted by\nslow inference speed, insufficient fine-grained control over facial motions,\nand occasional visual artifacts largely due to an implicit latent space derived\nfrom Variational Auto-Encoders (VAE), which prevent their adoption in realtime\ninteraction applications. To address these issues, we introduce Ditto, a\ndiffusion-based framework that enables controllable realtime talking head\nsynthesis. Our key innovation lies in bridging motion generation and\nphotorealistic neural rendering through an explicit identity-agnostic motion\nspace, replacing conventional VAE representations. This design substantially\nreduces the complexity of diffusion learning while enabling precise control\nover the synthesized talking heads. We further propose an inference strategy\nthat jointly optimizes three key components: audio feature extraction, motion\ngeneration, and video synthesis. This optimization enables streaming\nprocessing, realtime inference, and low first-frame delay, which are the\nfunctionalities crucial for interactive applications such as AI assistants.\nExtensive experimental results demonstrate that Ditto generates compelling\ntalking head videos and substantially outperforms existing methods in both\nmotion control and realtime performance.\n","authors":["Tianqi Li","Ruobing Zheng","Minghui Yang","Jingdong Chen","Ming Yang"],"pdf_url":"https://arxiv.org/pdf/2411.19509v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.13871v2","updated":"2024-11-29T07:00:39Z","published":"2024-08-25T15:40:21Z","title":"AlphaViT: A Flexible Game-Playing AI for Multiple Games and Variable\n Board Sizes","summary":" This paper presents novel game-playing AI agents based on the AlphaZero\nframework, enhanced with Vision Transformer (ViT): AlphaViT, AlphaViD, and\nAlphaVDA. These agents are designed to play multiple board games of various\nsizes using a single network with shared weights, thereby overcoming\nAlphaZero's limitation of fixed-board-size constraints. AlphaViT employs only a\ntransformer encoder, whereas AlphaViD and AlphaVDA incorporate both transformer\nencoders and decoders. In AlphaViD, the decoder processes outputs from the\nencoder, whereas AlphaVDA uses a learnable embeddings as the decoder input. The\nadditional decoder layers in AlphaViD and AlphaVDA provide flexibility to adapt\nto various action spaces and board sizes. Experimental results show that the\nproposed agents, trained on either individual games or multiple games\nsimultaneously, consistently outperform traditional algorithms such as Minimax\nand Monte Carlo Tree Search and approach the performance of AlphaZero, despite\nusing a single deep neural network (DNN) with shared weights. In particular,\nAlphaViT shows strong performance across all tested games. Furthermore,\nfine-tuning the DNN using pre-trained weights from small-board games\naccelerates convergence and improves performance, particularly in Gomoku.\nInterestingly, simultaneous training on multiple games yields performance\ncomparable to, or even surpassing, single-game training. These results indicate\nthe potential of transformer-based architectures to develop more flexible and\nrobust game-playing AI agents that excel in multiple games and dynamic\nenvironments.\n","authors":["Kazuhisa Fujita"],"pdf_url":"https://arxiv.org/pdf/2408.13871v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19507v1","updated":"2024-11-29T06:57:50Z","published":"2024-11-29T06:57:50Z","title":"Graph-Enhanced EEG Foundation Model","summary":" Electroencephalography (EEG) signals provide critical insights for\napplications in disease diagnosis and healthcare. However, the scarcity of\nlabeled EEG data poses a significant challenge. Foundation models offer a\npromising solution by leveraging large-scale unlabeled data through\npre-training, enabling strong performance across diverse tasks. While both\ntemporal dynamics and inter-channel relationships are vital for understanding\nEEG signals, existing EEG foundation models primarily focus on the former,\noverlooking the latter. To address this limitation, we propose a novel\nfoundation model for EEG that integrates both temporal and inter-channel\ninformation. Our architecture combines Graph Neural Networks (GNNs), which\neffectively capture relational structures, with a masked autoencoder to enable\nefficient pre-training. We evaluated our approach using three downstream tasks\nand experimented with various GNN architectures. The results demonstrate that\nour proposed model, particularly when employing the GCN architecture with\noptimized configurations, consistently outperformed baseline methods across all\ntasks. These findings suggest that our model serves as a robust foundation\nmodel for EEG analysis.\n","authors":["Limin Wang","Toyotaro Suzumura","Hiroki Kanezashi"],"pdf_url":"https://arxiv.org/pdf/2411.19507v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.01519v2","updated":"2024-11-29T06:57:10Z","published":"2024-09-03T01:26:21Z","title":"Hybridization of Persistent Homology with Neural Networks for\n Time-Series Prediction: A Case Study in Wave Height","summary":" Time-series prediction is an active area of research across various fields,\noften challenged by the fluctuating influence of short-term and long-term\nfactors. In this study, we introduce a feature engineering method that enhances\nthe predictive performance of neural network models. Specifically, we leverage\ncomputational topology techniques to derive valuable topological features from\ninput data, boosting the predictive accuracy of our models. Our focus is on\npredicting wave heights, utilizing models based on topological features within\nfeedforward neural networks (FNNs), recurrent neural networks (RNNs), long\nshort-term memory networks (LSTM), and RNNs with gated recurrent units (GRU).\nFor time-ahead predictions, the enhancements in $R^2$ score were significant\nfor FNNs, RNNs, LSTM, and GRU models. Additionally, these models also showed\nsignificant reductions in maximum errors and mean squared errors.\n","authors":["Zixin Lin","Nur Fariha Syaqina Zulkepli","Mohd Shareduwan Mohd Kasihmuddin","R. U. Gobithaasan"],"pdf_url":"https://arxiv.org/pdf/2409.01519v2.pdf","comment":"The work has problems in methods and results"},{"id":"http://arxiv.org/abs/2411.19506v1","updated":"2024-11-29T06:56:42Z","published":"2024-11-29T06:56:42Z","title":"Real-time Anomaly Detection at the L1 Trigger of CMS Experiment","summary":" We present the preparation, deployment, and testing of an autoencoder trained\nfor unbiased detection of new physics signatures in the CMS experiment Global\nTrigger (GT) test crate FPGAs during LHC Run 3. The GT makes the final decision\nwhether to readout or discard the data from each LHC collision, which occur at\na rate of 40 MHz, within a 50 ns latency. The Neural Network makes a prediction\nfor each event within these constraints, which can be used to select anomalous\nevents for further analysis. The GT test crate is a copy of the main GT system,\nreceiving the same input data, but whose output is not used to trigger the\nreadout of CMS, providing a platform for thorough testing of new trigger\nalgorithms on live data, but without interrupting data taking. We describe the\nmethodology to achieve ultra low latency anomaly detection, and present the\nintegration of the DNN into the GT test crate, as well as the monitoring,\ntesting, and validation of the algorithm during proton collisions.\n","authors":["Abhijith Gandrakota"],"pdf_url":"https://arxiv.org/pdf/2411.19506v1.pdf","comment":"Contribution to 42nd International Conference on High Energy Physics\n (ICHEP 2024)"}],"Multimedia":[{"id":"http://arxiv.org/abs/2411.19772v1","updated":"2024-11-29T15:18:06Z","published":"2024-11-29T15:18:06Z","title":"LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware\n Omni-Modal Perception of Long Videos","summary":" Despite impressive advancements in video understanding, most efforts remain\nlimited to coarse-grained or visual-only video tasks. However, real-world\nvideos encompass omni-modal information (vision, audio, and speech) with a\nseries of events forming a cohesive storyline. The lack of multi-modal video\ndata with fine-grained event annotations and the high cost of manual labeling\nare major obstacles to comprehensive omni-modality video perception. To address\nthis gap, we propose an automatic pipeline consisting of high-quality\nmulti-modal video filtering, semantically coherent omni-modal event boundary\ndetection, and cross-modal correlation-aware event captioning. In this way, we\npresent LongVALE, the first-ever Vision-Audio-Language Event understanding\nbenchmark comprising 105K omni-modal events with precise temporal boundaries\nand detailed relation-aware captions within 8.4K high-quality long videos.\nFurther, we build a baseline that leverages LongVALE to enable video large\nlanguage models (LLMs) for omni-modality fine-grained temporal video\nunderstanding for the first time. Extensive experiments demonstrate the\neffectiveness and great potential of LongVALE in advancing comprehensive\nmulti-modal video understanding.\n","authors":["Tiantian Geng","Jinrui Zhang","Qingni Wang","Teng Wang","Jinming Duan","Feng Zheng"],"pdf_url":"https://arxiv.org/pdf/2411.19772v1.pdf","comment":"18 pages, 15 figures"},{"id":"http://arxiv.org/abs/2411.19730v1","updated":"2024-11-29T14:23:25Z","published":"2024-11-29T14:23:25Z","title":"Ten Ways in which Virtual Reality Differs from Video Streaming","summary":" Virtual Reality (VR) applications have a number of unique characteristics\nthat set them apart from traditional video streaming. These characteristics\nhave major implications on the design of VR rendering, adaptation, prefetching,\ncaching, and transport mechanisms. This paper contrasts VR to video streaming,\nstored 2D video streaming in particular, and discusses how to rethink system\nand network support for VR.\n","authors":["Gustavo de Veciana","Sonia Fahmy","George Kesidis","Voicu Popescu"],"pdf_url":"https://arxiv.org/pdf/2411.19730v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19628v1","updated":"2024-11-29T11:24:23Z","published":"2024-11-29T11:24:23Z","title":"Accelerating Multimodal Large Language Models via Dynamic Visual-Token\n Exit and the Empirical Findings","summary":" The excessive use of visual tokens in existing Multimoal Large Language\nModels (MLLMs) often exhibits obvious redundancy and brings in prohibitively\nexpensive computation. To gain insights into this problem, we first conduct\nextensive empirical studies on the attention behaviors of MLLMs, and summarize\nthree main inference stages in MLLMs: (i) Early fusion between tokens is first\naccomplished quickly. (ii) Intra-modality modeling then comes to play. (iii)\nMultimodal reasoning} resumes and lasts until the end of inference. In\nparticular, we reveal that visual tokens will stop contributing to reasoning\nwhen the text tokens receive enough image information, yielding obvious visual\nredundancy. Based on these generalized observations, we propose a simple yet\neffective method to improve the efficiency of MLLMs, termed dynamic\nvisual-token exit (DyVTE). DyVTE uses lightweight hyper-networks to perceive\nthe text token status and decide the removal of all visual tokens after a\ncertain layer, thereby addressing the observed visual redundancy. To validate\nVTE, we apply it to a set of MLLMs, including LLaVA, VILA, Eagle and InternVL,\nand conduct extensive experiments on a bunch of benchmarks. The experiment\nresults not only show the effectiveness of our VTE in improving MLLMs'\nefficiency, but also yield the general modeling patterns of MLLMs, well\nfacilitating the in-depth understanding of MLLMs. Our code is anonymously\nreleased at https://github.com/DoubtedSteam/DyVTE.\n","authors":["Qiong Wu","Wenhao Lin","Weihao Ye","Yiyi Zhou","Xiaoshuai Sun","Rongrong Ji"],"pdf_url":"https://arxiv.org/pdf/2411.19628v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19537v1","updated":"2024-11-29T08:29:25Z","published":"2024-11-29T08:29:25Z","title":"Deepfake Media Generation and Detection in the Generative AI Era: A\n Survey and Outlook","summary":" With the recent advancements in generative modeling, the realism of deepfake\ncontent has been increasing at a steady pace, even reaching the point where\npeople often fail to detect manipulated media content online, thus being\ndeceived into various kinds of scams. In this paper, we survey deepfake\ngeneration and detection techniques, including the most recent developments in\nthe field, such as diffusion models and Neural Radiance Fields. Our literature\nreview covers all deepfake media types, comprising image, video, audio and\nmultimodal (audio-visual) content. We identify various kinds of deepfakes,\naccording to the procedure used to alter or generate the fake content. We\nfurther construct a taxonomy of deepfake generation and detection methods,\nillustrating the important groups of methods and the domains where these\nmethods are applied. Next, we gather datasets used for deepfake detection and\nprovide updated rankings of the best performing deepfake detectors on the most\npopular datasets. In addition, we develop a novel multimodal benchmark to\nevaluate deepfake detectors on out-of-distribution content. The results\nindicate that state-of-the-art detectors fail to generalize to deepfake content\ngenerated by unseen deepfake generators. Finally, we propose future directions\nto obtain robust and powerful deepfake detectors. Our project page and new\nbenchmark are available at https://github.com/CroitoruAlin/biodeep.\n","authors":["Florinel-Alin Croitoru","Andrei-Iulian Hiji","Vlad Hondru","Nicolae Catalin Ristea","Paul Irofti","Marius Popescu","Cristian Rusu","Radu Tudor Ionescu","Fahad Shahbaz Khan","Mubarak Shah"],"pdf_url":"https://arxiv.org/pdf/2411.19537v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19522v1","updated":"2024-11-29T07:40:58Z","published":"2024-11-29T07:40:58Z","title":"Subjective and Objective Quality Assessment Methods of Stereoscopic\n Videos with Visibility Affecting Distortions","summary":" We present two major contributions in this work: 1) we create a full HD\nresolution stereoscopic (S3D) video dataset comprised of 12 reference and 360\ndistorted videos. The test stimuli are produced by simulating the five levels\nof fog and haze ambiances on the pristine left and right video sequences. We\nperform subjective analysis on the created video dataset with 24 viewers and\ncompute Difference Mean Opinion Scores (DMOS) as quality representative of the\ndataset, 2) an Opinion Unaware (OU) and Distortion Unaware (DU) video quality\nassessment model is developed for S3D videos. We construct cyclopean frames\nfrom the individual views of an S3D video and partition them into\nnonoverlapping blocks. We analyze the Natural Scene Statistics (NSS) of all\npatches of pristine and test videos, and empirically model the NSS features\nwith Univariate Generalized Gaussian Distribution (UGGD). We compute UGGD model\nparameters ({\\alpha}, \\b{eta}) at multiple spatial scales and multiple\norientations of spherical steerable pyramid decomposition and show that the\nUGGD parameters are distortion discriminable. Further, we perform Multivariate\nGaussian (MVG) modeling on the pristine and distorted video feature sets and\ncompute the corresponding mean vectors and covariance matrices of MVG fits. We\ncompute the Bhattacharyya distance measure between mean vectors and covariance\nmatrices to estimate the perceptual deviation of a test video from pristine\nvideo set. Finally, we pool both distance measures to estimate the overall\nquality score of an S3D video. The performance of the proposed objective\nalgorithm is verified on the popular S3D video datasets such as IRCCYN,\nLFOVIAS3DPh1, LFOVIAS3DPh2 and the proposed VAD stereo dataset. The algorithm\ndelivers consistent performance across all datasets and shows competitive\nperformance against off-the-shelf 2D and 3D image and video quality assessment\nalgorithms.\n","authors":["Sria Biswas","Balasubramanyam Appina","Priyanka Kokil","Sumohana S Channappayya"],"pdf_url":"https://arxiv.org/pdf/2411.19522v1.pdf","comment":"13 pages"}]},"2024-11-28T00:00:00Z":{"Computation and Language":[{"id":"http://arxiv.org/abs/2410.02381v3","updated":"2024-11-28T23:46:52Z","published":"2024-10-03T11:01:25Z","title":"MetaMetrics: Calibrating Metrics For Generation Tasks Using Human\n Preferences","summary":" Understanding the quality of a performance evaluation metric is crucial for\nensuring that model outputs align with human preferences. However, it remains\nunclear how well each metric captures the diverse aspects of these preferences,\nas metrics often excel in one particular area but not across all dimensions. To\naddress this, it is essential to systematically calibrate metrics to specific\naspects of human preference, catering to the unique characteristics of each\naspect. We introduce MetaMetrics, a calibrated meta-metric designed to evaluate\ngeneration tasks across different modalities in a supervised manner.\nMetaMetrics optimizes the combination of existing metrics to enhance their\nalignment with human preferences. Our metric demonstrates flexibility and\neffectiveness in both language and vision downstream tasks, showing significant\nbenefits across various multilingual and multi-domain scenarios. MetaMetrics\naligns closely with human preferences and is highly extendable and easily\nintegrable into any application. This makes MetaMetrics a powerful tool for\nimproving the evaluation of generation tasks, ensuring that metrics are more\nrepresentative of human judgment across diverse contexts.\n","authors":["Genta Indra Winata","David Anugraha","Lucky Susanto","Garry Kuwanto","Derry Tanti Wijaya"],"pdf_url":"https://arxiv.org/pdf/2410.02381v3.pdf","comment":"Preprint"},{"id":"http://arxiv.org/abs/2410.12705v3","updated":"2024-11-28T22:47:21Z","published":"2024-10-16T16:11:49Z","title":"WorldCuisines: A Massive-Scale Benchmark for Multilingual and\n Multicultural Visual Question Answering on Global Cuisines","summary":" Vision Language Models (VLMs) often struggle with culture-specific knowledge,\nparticularly in languages other than English and in underrepresented cultural\ncontexts. To evaluate their understanding of such knowledge, we introduce\nWorldCuisines, a massive-scale benchmark for multilingual and multicultural,\nvisually grounded language understanding. This benchmark includes a visual\nquestion answering (VQA) dataset with text-image pairs across 30 languages and\ndialects, spanning 9 language families and featuring over 1 million data\npoints, making it the largest multicultural VQA benchmark to date. It includes\ntasks for identifying dish names and their origins. We provide evaluation\ndatasets in two sizes (12k and 60k instances) alongside a training dataset (1\nmillion instances). Our findings show that while VLMs perform better with\ncorrect location context, they struggle with adversarial contexts and\npredicting specific regional cuisines and languages. To support future\nresearch, we release a knowledge base with annotated food entries and images\nalong with the VQA data.\n","authors":["Genta Indra Winata","Frederikus Hudi","Patrick Amadeus Irawan","David Anugraha","Rifki Afina Putri","Yutong Wang","Adam Nohejl","Ubaidillah Ariq Prathama","Nedjma Ousidhoum","Afifa Amriani","Anar Rzayev","Anirban Das","Ashmari Pramodya","Aulia Adila","Bryan Wilie","Candy Olivia Mawalim","Ching Lam Cheng","Daud Abolade","Emmanuele Chersoni","Enrico Santus","Fariz Ikhwantri","Garry Kuwanto","Hanyang Zhao","Haryo Akbarianto Wibowo","Holy Lovenia","Jan Christian Blaise Cruz","Jan Wira Gotama Putra","Junho Myung","Lucky Susanto","Maria Angelica Riera Machin","Marina Zhukova","Michael Anugraha","Muhammad Farid Adilazuarda","Natasha Santosa","Peerat Limkonchotiwat","Raj Dabre","Rio Alexander Audino","Samuel Cahyawijaya","Shi-Xiong Zhang","Stephanie Yulia Salim","Yi Zhou","Yinxuan Gui","David Ifeoluwa Adelani","En-Shiun Annie Lee","Shogo Okada","Ayu Purwarianti","Alham Fikri Aji","Taro Watanabe","Derry Tanti Wijaya","Alice Oh","Chong-Wah Ngo"],"pdf_url":"https://arxiv.org/pdf/2410.12705v3.pdf","comment":"Preprint"},{"id":"http://arxiv.org/abs/2411.10724v2","updated":"2024-11-28T22:37:57Z","published":"2024-11-16T07:14:32Z","title":"HJ-Ky-0.1: an Evaluation Dataset for Kyrgyz Word Embeddings","summary":" One of the key tasks in modern applied computational linguistics is\nconstructing word vector representations (word embeddings), which are widely\nused to address natural language processing tasks such as sentiment analysis,\ninformation extraction, and more. To choose an appropriate method for\ngenerating these word embeddings, quality assessment techniques are often\nnecessary. A standard approach involves calculating distances between vectors\nfor words with expert-assessed 'similarity'. This work introduces the first\n'silver standard' dataset for such tasks in the Kyrgyz language, alongside\ntraining corresponding models and validating the dataset's suitability through\nquality evaluation metrics.\n","authors":["Anton Alekseev","Gulnara Kabaeva"],"pdf_url":"https://arxiv.org/pdf/2411.10724v2.pdf","comment":"The translation of the 2023 paper into English"},{"id":"http://arxiv.org/abs/2401.00820v2","updated":"2024-11-28T22:01:57Z","published":"2024-01-01T17:32:28Z","title":"A Computational Framework for Behavioral Assessment of LLM Therapists","summary":" The emergence of large language models (LLMs) like ChatGPT has increased\ninterest in their use as therapists to address mental health challenges and the\nwidespread lack of access to care. However, experts have emphasized the\ncritical need for systematic evaluation of LLM-based mental health\ninterventions to accurately assess their capabilities and limitations. Here, we\npropose BOLT, a proof-of-concept computational framework to systematically\nassess the conversational behavior of LLM therapists. We quantitatively measure\nLLM behavior across 13 psychotherapeutic approaches with in-context learning\nmethods. Then, we compare the behavior of LLMs against high- and low-quality\nhuman therapy. Our analysis based on Motivational Interviewing therapy reveals\nthat LLMs often resemble behaviors more commonly exhibited in low-quality\ntherapy rather than high-quality therapy, such as offering a higher degree of\nproblem-solving advice when clients share emotions. However, unlike low-quality\ntherapy, LLMs reflect significantly more upon clients' needs and strengths. Our\nfindings caution that LLM therapists still require further research for\nconsistent, high-quality care.\n","authors":["Yu Ying Chiu","Ashish Sharma","Inna Wanyin Lin","Tim Althoff"],"pdf_url":"https://arxiv.org/pdf/2401.00820v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19378v1","updated":"2024-11-28T21:07:22Z","published":"2024-11-28T21:07:22Z","title":"Libra: Leveraging Temporal Images for Biomedical Radiology Analysis","summary":" Radiology report generation (RRG) is a challenging task, as it requires a\nthorough understanding of medical images, integration of multiple temporal\ninputs, and accurate report generation. Effective interpretation of medical\nimages, such as chest X-rays (CXRs), demands sophisticated visual-language\nreasoning to map visual findings to structured reports. Recent studies have\nshown that multimodal large language models (MLLMs) can acquire multimodal\ncapabilities by aligning with pre-trained vision encoders. However, current\napproaches predominantly focus on single-image analysis or utilise rule-based\nsymbolic processing to handle multiple images, thereby overlooking the\nessential temporal information derived from comparing current images with prior\nones. To overcome this critical limitation, we introduce Libra, a\ntemporal-aware MLLM tailored for CXR report generation using temporal images.\nLibra integrates a radiology-specific image encoder with a MLLM and utilises a\nnovel Temporal Alignment Connector to capture and synthesise temporal\ninformation of images across different time points with unprecedented\nprecision. Extensive experiments show that Libra achieves new state-of-the-art\nperformance among the same parameter scale MLLMs for RRG tasks on the\nMIMIC-CXR. Specifically, Libra improves the RadCliQ metric by 12.9% and makes\nsubstantial gains across all lexical metrics compared to previous models.\n","authors":["Xi Zhang","Zaiqiao Meng","Jake Lever","Edmond S. L. Ho"],"pdf_url":"https://arxiv.org/pdf/2411.19378v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.19134v2","updated":"2024-11-28T20:20:23Z","published":"2024-09-27T20:32:42Z","title":"Confidential Prompting: Protecting User Prompts from Cloud LLM Providers","summary":" Our work tackles the challenge of securing user inputs in cloud-hosted large\nlanguage model (LLM) serving while ensuring output invariance, model\nconfidentiality, and compute efficiency. We introduce secure multi-party\ndecoding (SMD), which leverages confidential computing to confine user prompts\nto a trusted execution environment (TEE), namely a confidential virtual machine\n(CVM), while allowing service providers to generate tokens efficiently. We also\nintroduce a novel cryptographic method, prompt obfuscation (PO), to ensure\nrobustness against reconstruction attacks on SMD. We demonstrate that our\napproach preserves both prompt confidentiality and LLM serving efficiency. Our\nsolution can enable privacy-preserving cloud LLM serving that handles sensitive\nprompts, such as clinical records, financial data, and personal information.\n","authors":["In Gim","Caihua Li","Lin Zhong"],"pdf_url":"https://arxiv.org/pdf/2409.19134v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19360v1","updated":"2024-11-28T20:14:47Z","published":"2024-11-28T20:14:47Z","title":"DENIAHL: In-Context Features Influence LLM Needle-In-A-Haystack\n Abilities","summary":" The Needle-in-a-haystack (NIAH) test is a general task used to assess\nlanguage models' (LMs') abilities to recall particular information from long\ninput context. This framework however does not provide a means of analyzing\nwhat factors, beyond context length, contribute to LMs' abilities or\ninabilities to separate and recall needles from their haystacks. To provide a\nsystematic means of assessing what features contribute to LMs' NIAH\ncapabilities, we developed a synthetic benchmark called DENIAHL (Data-oriented\nEvaluation of NIAH for LLM's). Our work expands on previous NIAH studies by\nablating NIAH features beyond typical context length including data type, size,\nand patterns. We find stark differences between GPT-3.5 and LLaMA 2-7B's\nperformance on DENIAHL, and drops in recall performance when features like item\nsize are increased, and to some degree when data type is changed from numbers\nto letters. This has implications for increasingly large context models,\ndemonstrating factors beyond item-number impact NIAH capabilities.\n","authors":["Hui Dai","Dan Pechi","Xinyi Yang","Garvit Banga","Raghav Mantri"],"pdf_url":"https://arxiv.org/pdf/2411.19360v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.20739v3","updated":"2024-11-28T19:54:09Z","published":"2024-10-28T05:08:08Z","title":"Gender Bias in LLM-generated Interview Responses","summary":" LLMs have emerged as a promising tool for assisting individuals in diverse\ntext-generation tasks, including job-related texts. However, LLM-generated\nanswers have been increasingly found to exhibit gender bias. This study\nevaluates three LLMs (GPT-3.5, GPT-4, Claude) to conduct a multifaceted audit\nof LLM-generated interview responses across models, question types, and jobs,\nand their alignment with two gender stereotypes. Our findings reveal that\ngender bias is consistent, and closely aligned with gender stereotypes and the\ndominance of jobs. Overall, this study contributes to the systematic\nexamination of gender bias in LLM-generated interview responses, highlighting\nthe need for a mindful approach to mitigate such biases in related\napplications.\n","authors":["Haein Kong","Yongsu Ahn","Sangyub Lee","Yunho Maeng"],"pdf_url":"https://arxiv.org/pdf/2410.20739v3.pdf","comment":"Accepted to NeurlIPS 2024, SoLaR workshop"},{"id":"http://arxiv.org/abs/2411.19346v1","updated":"2024-11-28T19:48:54Z","published":"2024-11-28T19:48:54Z","title":"CLIP meets DINO for Tuning Zero-Shot Classifier using Unlabeled Image\n Collections","summary":" In the era of foundation models, CLIP has emerged as a powerful tool for\naligning text and visual modalities into a common embedding space. However, the\nalignment objective used to train CLIP often results in subpar visual features\nfor fine-grained tasks. In contrast, SSL-pretrained models like DINO excel at\nextracting rich visual features due to their specialized training paradigm.\nYet, these SSL models require an additional supervised linear probing step,\nwhich relies on fully labeled data which is often expensive and difficult to\nobtain at scale. In this paper, we propose a label-free prompt-tuning method\nthat leverages the rich visual features of self-supervised learning models\n(DINO) and the broad textual knowledge of large language models (LLMs) to\nlargely enhance CLIP-based image classification performance using unlabeled\nimages. Our approach unfolds in three key steps: (1) We generate robust textual\nfeature embeddings that more accurately represent object classes by leveraging\nclass-specific descriptions from LLMs, enabling more effective zero-shot\nclassification compared to CLIP's default name-specific prompts. (2) These\ntextual embeddings are then used to produce pseudo-labels to train an alignment\nmodule that integrates the complementary strengths of LLM description-based\ntextual embeddings and DINO's visual features. (3) Finally, we prompt-tune\nCLIP's vision encoder through DINO-assisted supervision using the trained\nalignment module. This three-step process allows us to harness the best of\nvisual and textual foundation models, resulting in a powerful and efficient\napproach that surpasses state-of-the-art label-free classification methods.\nNotably, our framework, NoLA (No Labels Attached), achieves an average absolute\ngain of 3.6% over the state-of-the-art LaFter across 11 diverse image\nclassification datasets.\n","authors":["Mohamed Fazli Imam","Rufael Fedaku Marew","Jameel Hassan","Mustansar Fiaz","Alham Fikri Aji","Hisham Cholakkal"],"pdf_url":"https://arxiv.org/pdf/2411.19346v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.03136v2","updated":"2024-11-28T19:47:26Z","published":"2024-10-04T04:23:36Z","title":"Deliberate Reasoning for LLMs as Structure-aware Planning with Accurate\n World Model","summary":" Enhancing the reasoning capabilities of large language models (LLMs) remains\na key challenge, especially for tasks that require complex, multi-step\ndecision-making. Humans excel at these tasks by leveraging deliberate planning\nwith an internal world model to simulate the potential outcomes of various\nactions. Inspired by this, we propose a novel multi-step reasoning framework\nfor LLMs, referred to as Structure-aware Planning with Accurate World Model\n(SWAP). Unlike previous approaches that rely solely on Chain-of-Thought (CoT)\nreasoning in natural language, SWAP incorporates structural information to\nguide the reasoning process via a world model and provides a soft verification\nmechanism over the steps. Moreover, SWAP overcomes the challenge of accurate\nworld state predictions in complex reasoning tasks by introducing a\nGenerator-Discriminator architecture, which enables more reliable world\nmodeling. Specifically, the generator predicts the next state, and the\ndiscriminator ensures alignment with the logical consistency required by the\nproblem context. SWAP also encourages the policy model to explore a broad range\nof potential actions to prevent premature convergence. By resolving the\nbottlenecks of generation diversity for both actions and states using\ndiversity-based modeling (DBM) and improving discrimination accuracy through\ncontrastive ranking (CR), SWAP significantly enhances the reasoning performance\nof LLMs. We evaluate SWAP across diverse reasoning-intensive benchmarks\nincluding math reasoning, logical reasoning, and coding tasks. Extensive\nexperiments demonstrate that SWAP achieves substantial improvements over the\nbaselines and consistently outperforms existing methods.\n","authors":["Siheng Xiong","Ali Payani","Yuan Yang","Faramarz Fekri"],"pdf_url":"https://arxiv.org/pdf/2410.03136v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19331v1","updated":"2024-11-28T19:00:03Z","published":"2024-11-28T19:00:03Z","title":"Talking to DINO: Bridging Self-Supervised Vision Backbones with Language\n for Open-Vocabulary Segmentation","summary":" Open-Vocabulary Segmentation (OVS) aims at segmenting images from free-form\ntextual concepts without predefined training classes. While existing\nvision-language models such as CLIP can generate segmentation masks by\nleveraging coarse spatial information from Vision Transformers, they face\nchallenges in spatial localization due to their global alignment of image and\ntext features. Conversely, self-supervised visual models like DINO excel in\nfine-grained visual encoding but lack integration with language. To bridge this\ngap, we present Talk2DINO, a novel hybrid approach that combines the spatial\naccuracy of DINOv2 with the language understanding of CLIP. Our approach aligns\nthe textual embeddings of CLIP to the patch-level features of DINOv2 through a\nlearned mapping function without the need to fine-tune the underlying\nbackbones. At training time, we exploit the attention maps of DINOv2 to\nselectively align local visual patches with textual embeddings. We show that\nthe powerful semantic and localization abilities of Talk2DINO can enhance the\nsegmentation process, resulting in more natural and less noisy segmentations,\nand that our approach can also effectively distinguish foreground objects from\nthe background. Experimental results demonstrate that Talk2DINO achieves\nstate-of-the-art performance across several unsupervised OVS benchmarks. Source\ncode and models are publicly available at:\nhttps://lorebianchi98.github.io/Talk2DINO/.\n","authors":["Luca Barsellotti","Lorenzo Bianchi","Nicola Messina","Fabio Carrara","Marcella Cornia","Lorenzo Baraldi","Fabrizio Falchi","Rita Cucchiara"],"pdf_url":"https://arxiv.org/pdf/2411.19331v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19295v1","updated":"2024-11-28T18:04:31Z","published":"2024-11-28T18:04:31Z","title":"Extracting Information in a Low-resource Setting: Case Study on\n Bioinformatics Workflows","summary":" Bioinformatics workflows are essential for complex biological data analyses\nand are often described in scientific articles with source code in public\nrepositories. Extracting detailed workflow information from articles can\nimprove accessibility and reusability but is hindered by limited annotated\ncorpora. To address this, we framed the problem as a low-resource extraction\ntask and tested four strategies: 1) creating a tailored annotated corpus, 2)\nfew-shot named-entity recognition (NER) with an autoregressive language model,\n3) NER using masked language models with existing and new corpora, and 4)\nintegrating workflow knowledge into NER models. Using BioToFlow, a new corpus\nof 52 articles annotated with 16 entities, a SciBERT-based NER model achieved a\n70.4 F-measure, comparable to inter-annotator agreement. While knowledge\nintegration improved performance for specific entities, it was less effective\nacross the entire information schema. Our results demonstrate that\nhigh-performance information extraction for bioinformatics workflows is\nachievable.\n","authors":["Clémence Sebe","Sarah Cohen-Boulakia","Olivier Ferret","Aurélie Névéol"],"pdf_url":"https://arxiv.org/pdf/2411.19295v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19244v1","updated":"2024-11-28T16:32:02Z","published":"2024-11-28T16:32:02Z","title":"Consolidating and Developing Benchmarking Datasets for the Nepali\n Natural Language Understanding Tasks","summary":" The Nepali language has distinct linguistic features, especially its complex\nscript (Devanagari script), morphology, and various dialects, which pose a\nunique challenge for natural language processing (NLP) evaluation. While the\nNepali Language Understanding Evaluation (Nep-gLUE) benchmark provides a\nfoundation for evaluating models, it remains limited in scope, covering four\ntasks. This restricts their utility for comprehensive assessments of NLP\nmodels. To address this limitation, we introduce eight new datasets, creating a\nnew benchmark, the Nepali Language Understanding Evaluation (NLUE) benchmark,\nwhich covers a total of 12 tasks for evaluating the performance of models\nacross a diverse set of Natural Language Understanding (NLU) tasks. The added\ntasks include single-sentence classification, similarity and paraphrase tasks,\nand Natural Language Inference (NLI) tasks. On evaluating the models using\nadded tasks, we observe that the existing models fall short in handling complex\nNLU tasks effectively. This expanded benchmark sets a new standard for\nevaluating, comparing, and advancing models, contributing significantly to the\nbroader goal of advancing NLP research for low-resource languages.\n","authors":["Jinu Nyachhyon","Mridul Sharma","Prajwal Thapa","Bal Krishna Bal"],"pdf_url":"https://arxiv.org/pdf/2411.19244v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19240v1","updated":"2024-11-28T16:20:25Z","published":"2024-11-28T16:20:25Z","title":"How far can bias go? -- Tracing bias from pretraining data to alignment","summary":" As LLMs are increasingly integrated into user-facing applications, addressing\nbiases that perpetuate societal inequalities is crucial. While much work has\ngone into measuring or mitigating biases in these models, fewer studies have\ninvestigated their origins. Therefore, this study examines the correlation\nbetween gender-occupation bias in pre-training data and their manifestation in\nLLMs, focusing on the Dolma dataset and the OLMo model. Using zero-shot\nprompting and token co-occurrence analyses, we explore how biases in training\ndata influence model outputs. Our findings reveal that biases present in\npre-training data are amplified in model outputs. The study also examines the\neffects of prompt types, hyperparameters, and instruction-tuning on bias\nexpression, finding instruction-tuning partially alleviating representational\nbias while still maintaining overall stereotypical gender associations, whereas\nhyperparameters and prompting variation have a lesser effect on bias\nexpression. Our research traces bias throughout the LLM development pipeline\nand underscores the importance of mitigating bias at the pretraining stage.\n","authors":["Marion Thaler","Abdullatif Köksal","Alina Leidinger","Anna Korhonen","Hinrich Schütze"],"pdf_url":"https://arxiv.org/pdf/2411.19240v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19203v1","updated":"2024-11-28T15:23:12Z","published":"2024-11-28T15:23:12Z","title":"An Extensive Evaluation of Factual Consistency in Large Language Models\n for Data-to-Text Generation","summary":" Large Language Models (LLMs) have shown exceptional performance across\nvarious Data-to-Text Generation (DTG) tasks. However, generating factually\nconsistent text in DTG remains challenging for LLMs. Despite this, in-depth\nevaluations of LLM factual consistency for DTG remain missing in the current\nliterature. This paper addresses this gap by providing an extensive evaluation\nof factual consistency in LLMs for DTG. Our evaluation covers five widely used\nDTG datasets (E2E, ViGGo, WikiTableText, DART, and WebNLG) and five prominent\nLLM families (T5, BART, OPT, BLOOM, and Llama 2). To ensure a thorough\nevaluation of factual consistency, we use four state-of-the-art automatic\nmetrics and include essential human assessments. Our extensive evaluations\nreveals three key findings regarding factual consistency in LLMs for DTG.\nFirst, Llama 2 often excels in generating factually consistent text, although\nsmaller models like T5 and BART can achieve strong factual consistency on\nlarger, lexically less-diverse datasets. Second, the average rate of change\n(AROC) indicates that increasing model size (number of model trainable\nparameters) generally enhances factual consistency of LLMs in DTG. Third, we\nobserve that source-reference divergence (i.e., when the reference text\ndiverges semantically from the source) typically reduces the factual\nconsistency of LLMs in DTG.\n","authors":["Joy Mahapatra","Utpal Garain"],"pdf_url":"https://arxiv.org/pdf/2411.19203v1.pdf","comment":"15 pages"},{"id":"http://arxiv.org/abs/2212.03749v3","updated":"2024-11-28T15:06:32Z","published":"2022-12-07T16:20:50Z","title":"Memorization of Named Entities in Fine-tuned BERT Models","summary":" Privacy preserving deep learning is an emerging field in machine learning\nthat aims to mitigate the privacy risks in the use of deep neural networks. One\nsuch risk is training data extraction from language models that have been\ntrained on datasets, which contain personal and privacy sensitive information.\nIn our study, we investigate the extent of named entity memorization in\nfine-tuned BERT models. We use single-label text classification as\nrepresentative downstream task and employ three different fine-tuning setups in\nour experiments, including one with Differential Privacy (DP). We create a\nlarge number of text samples from the fine-tuned BERT models utilizing a custom\nsequential sampling strategy with two prompting strategies. We search in these\nsamples for named entities and check if they are also present in the\nfine-tuning datasets. We experiment with two benchmark datasets in the domains\nof emails and blogs. We show that the application of DP has a detrimental\neffect on the text generation capabilities of BERT. Furthermore, we show that a\nfine-tuned BERT does not generate more named entities specific to the\nfine-tuning dataset than a BERT model that is pre-trained only. This suggests\nthat BERT is unlikely to emit personal or privacy sensitive named entities.\nOverall, our results are important to understand to what extent BERT-based\nservices are prone to training data extraction attacks.\n","authors":["Andor Diera","Nicolas Lell","Aygul Garifullina","Ansgar Scherp"],"pdf_url":"https://arxiv.org/pdf/2212.03749v3.pdf","comment":"published at CD-MAKE 2023"},{"id":"http://arxiv.org/abs/2411.19187v1","updated":"2024-11-28T14:47:55Z","published":"2024-11-28T14:47:55Z","title":"Beyond Logit Lens: Contextual Embeddings for Robust Hallucination\n Detection & Grounding in VLMs","summary":" The rapid development of Large Multimodal Models (LMMs) has significantly\nadvanced multimodal understanding by harnessing the language abilities of Large\nLanguage Models (LLMs) and integrating modality-specific encoders. However,\nLMMs are plagued by hallucinations that limit their reliability and adoption.\nWhile traditional methods to detect and mitigate these hallucinations often\ninvolve costly training or rely heavily on external models, recent approaches\nutilizing internal model features present a promising alternative. In this\npaper, we critically assess the limitations of the state-of-the-art\ntraining-free technique, the logit lens, in handling generalized visual\nhallucinations. We introduce a refined method that leverages contextual token\nembeddings from middle layers of LMMs. This approach significantly improves\nhallucination detection and grounding across diverse categories, including\nactions and OCR, while also excelling in tasks requiring contextual\nunderstanding, such as spatial relations and attribute comparison. Our novel\ngrounding technique yields highly precise bounding boxes, facilitating a\ntransition from Zero-Shot Object Segmentation to Grounded Visual Question\nAnswering. Our contributions pave the way for more reliable and interpretable\nmultimodal models.\n","authors":["Anirudh Phukan"," Divyansh","Harshit Kumar Morj"," Vaishnavi","Apoorv Saxena","Koustava Goswami"],"pdf_url":"https://arxiv.org/pdf/2411.19187v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19140v1","updated":"2024-11-28T13:41:44Z","published":"2024-11-28T13:41:44Z","title":"Examining Multimodal Gender and Content Bias in ChatGPT-4o","summary":" This study investigates ChatGPT-4o's multimodal content generation,\nhighlighting significant disparities in its treatment of sexual content and\nnudity versus violent and drug-related themes. Detailed analysis reveals that\nChatGPT-4o consistently censors sexual content and nudity, while showing\nleniency towards violence and drug use. Moreover, a pronounced gender bias\nemerges, with female-specific content facing stricter regulation compared to\nmale-specific content. This disparity likely stems from media scrutiny and\npublic backlash over past AI controversies, prompting tech companies to impose\nstringent guidelines on sensitive issues to protect their reputations. Our\nfindings emphasize the urgent need for AI systems to uphold genuine ethical\nstandards and accountability, transcending mere political correctness. This\nresearch contributes to the understanding of biases in AI-driven language and\nmultimodal models, calling for more balanced and ethical content moderation\npractices.\n","authors":["Roberto Balestri"],"pdf_url":"https://arxiv.org/pdf/2411.19140v1.pdf","comment":"17 pages, 4 figures, 3 tables. Conference: \"14th International\n Conference on Artificial Intelligence, Soft Computing and Applications (AIAA\n 2024), London, 23-24 November 2024\" It will be published in the proceedings\n \"David C. Wyld et al. (Eds): IoTE, CNDC, DSA, AIAA, NLPTA, DPPR - 2024\""},{"id":"http://arxiv.org/abs/2411.16638v3","updated":"2024-11-28T13:33:53Z","published":"2024-11-25T18:15:15Z","title":"Do Automatic Factuality Metrics Measure Factuality? A Critical\n Evaluation","summary":" Modern LLMs can now produce highly readable abstractive summaries, to the\npoint where traditional automated metrics for evaluating summary quality, such\nas ROUGE, have become saturated. However, LLMs still sometimes introduce\nunwanted content into summaries, i.e., information inconsistent with or\nunsupported by their source. Measuring the occurrence of these often subtle\n``hallucinations'' automatically has proved to be challenging. This in turn has\nmotivated development of a variety of metrics intended to measure the factual\nconsistency of generated summaries against their source. But are these\napproaches measuring what they purport to do? In this work, we stress-test\nautomatic factuality metrics. Specifically, we investigate whether and to what\ndegree superficial attributes of summary texts suffice to predict\n``factuality'', finding that a (supervised) model using only such shallow\nfeatures is reasonably competitive with SOTA factuality scoring methods. We\nthen evaluate how factuality metrics respond to factual corrections in\ninconsistent summaries and find that only a few show meaningful improvements.\nIn contrast, some metrics are more sensitive to benign, non-factual edits.\nMotivated by these insights, we show that one can ``game'' (most) automatic\nfactuality metrics, i.e., reliably inflate ``factuality'' scores by appending\ninnocuous sentences to generated summaries. Taken together, our results raise\nquestions about the degree to which we should rely on existing automated\nfactuality metrics and what exactly we want ``factuality metrics'' to measure.\n","authors":["Sanjana Ramprasad","Byron C. Wallace"],"pdf_url":"https://arxiv.org/pdf/2411.16638v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19113v1","updated":"2024-11-28T12:59:32Z","published":"2024-11-28T12:59:32Z","title":"Integration of Contextual Descriptors in Ontology Alignment for\n Enrichment of Semantic Correspondence","summary":" This paper proposes a novel approach to semantic ontology alignment using\ncontextual descriptors. A formalization was developed that enables the\nintegration of essential and contextual descriptors to create a comprehensive\nknowledge model. The hierarchical structure of the semantic approach and the\nmathematical apparatus for analyzing potential conflicts between concepts,\nparticularly in the example of \"Transparency\" and \"Privacy\" in the context of\nartificial intelligence, are demonstrated. Experimental studies showed a\nsignificant improvement in ontology alignment metrics after the implementation\nof contextual descriptors, especially in the areas of privacy, responsibility,\nand freedom & autonomy. The application of contextual descriptors achieved an\naverage overall improvement of approximately 4.36%. The results indicate the\neffectiveness of the proposed approach for more accurately reflecting the\ncomplexity of knowledge and its contextual dependence.\n","authors":["Eduard Manziuk","Oleksander Barmak","Pavlo Radiuk","Vladislav Kuznetsov","Iurii Krak","Sergiy Yakovlev"],"pdf_url":"https://arxiv.org/pdf/2411.19113v1.pdf","comment":"Ontology alignment, contextual descriptors, semantic matching,\n knowledge representation, essential descriptors, ontology integration,\n hierarchical structure, semantic heterogeneity, ethical AI"},{"id":"http://arxiv.org/abs/2411.19103v1","updated":"2024-11-28T12:38:42Z","published":"2024-11-28T12:38:42Z","title":"VARCO-VISION: Expanding Frontiers in Korean Vision-Language Models","summary":" In this paper, we introduce an open-source Korean-English vision-language\nmodel (VLM), VARCO-VISION. We incorporate a step-by-step training strategy that\nallows a model learn both linguistic and visual information while preserving\nthe backbone model's knowledge. Our model demonstrates outstanding performance\nin diverse settings requiring bilingual image-text understanding and generation\nabilities compared to models of similar size. VARCO-VISION is also capable of\ngrounding, referring, and OCR, expanding its usage and potential applications\nfor real-world scenarios. In addition to the model, we release five Korean\nevaluation datasets, including four closed-set and one openset benchmarks. We\nanticipate that our milestone will broaden the opportunities for AI researchers\naiming to train VLMs. VARCO-VISION is available at\nhttps://huggingface.co/NCSOFT/VARCO-VISION-14B.\n","authors":["Jeongho Ju","Daeyoung Kim","SunYoung Park","Youngjune Kim"],"pdf_url":"https://arxiv.org/pdf/2411.19103v1.pdf","comment":"24 pages, 15 figures, 4 tables. Model weights at\n https://huggingface.co/NCSOFT/VARCO-VISION-14B. Benchmarks released at\n NCSOFT's HuggingFace repositories (K-MMBench, K-SEED, K-MMStar, K-DTCBench,\n K-LLaVA-W). VARCO-VISION is an open-source Korean-English VLM with OCR,\n grounding, and referring capabilities"},{"id":"http://arxiv.org/abs/2411.19096v1","updated":"2024-11-28T12:17:24Z","published":"2024-11-28T12:17:24Z","title":"Pralekha: An Indic Document Alignment Evaluation Benchmark","summary":" Mining parallel document pairs poses a significant challenge because existing\nsentence embedding models often have limited context windows, preventing them\nfrom effectively capturing document-level information. Another overlooked issue\nis the lack of concrete evaluation benchmarks comprising high-quality parallel\ndocument pairs for assessing document-level mining approaches, particularly for\nIndic languages. In this study, we introduce Pralekha, a large-scale benchmark\nfor document-level alignment evaluation. Pralekha includes over 2 million\ndocuments, with a 1:2 ratio of unaligned to aligned pairs, covering 11 Indic\nlanguages and English. Using Pralekha, we evaluate various document-level\nmining approaches across three dimensions: the embedding models, the\ngranularity levels, and the alignment algorithm. To address the challenge of\naligning documents using sentence and chunk-level alignments, we propose a\nnovel scoring method, Document Alignment Coefficient (DAC). DAC demonstrates\nsubstantial improvements over baseline pooling approaches, particularly in\nnoisy scenarios, achieving average gains of 20-30% in precision and 15-20% in\nF1 score. These results highlight DAC's effectiveness in parallel document\nmining for Indic languages.\n","authors":["Sanjay Suryanarayanan","Haiyue Song","Mohammed Safi Ur Rahman Khan","Anoop Kunchukuttan","Mitesh M. Khapra","Raj Dabre"],"pdf_url":"https://arxiv.org/pdf/2411.19096v1.pdf","comment":"Work in Progress"},{"id":"http://arxiv.org/abs/2407.04794v2","updated":"2024-11-28T11:28:39Z","published":"2024-07-05T18:09:06Z","title":"On Evaluating The Performance of Watermarked Machine-Generated Texts\n Under Adversarial Attacks","summary":" Large Language Models (LLMs) excel in various applications, including text\ngeneration and complex tasks. However, the misuse of LLMs raises concerns about\nthe authenticity and ethical implications of the content they produce, such as\ndeepfake news, academic fraud, and copyright infringement. Watermarking\ntechniques, which embed identifiable markers in machine-generated text, offer a\npromising solution to these issues by allowing for content verification and\norigin tracing. Unfortunately, the robustness of current LLM watermarking\nschemes under potential watermark removal attacks has not been comprehensively\nexplored.\n In this paper, to fill this gap, we first systematically comb the mainstream\nwatermarking schemes and removal attacks on machine-generated texts, and then\nwe categorize them into pre-text (before text generation) and post-text (after\ntext generation) classes so that we can conduct diversified analyses. In our\nexperiments, we evaluate eight watermarks (five pre-text, three post-text) and\ntwelve attacks (two pre-text, ten post-text) across 87 scenarios. Evaluation\nresults indicate that (1) KGW and Exponential watermarks offer high text\nquality and watermark retention but remain vulnerable to most attacks; (2)\nPost-text attacks are found to be more efficient and practical than pre-text\nattacks; (3) Pre-text watermarks are generally more imperceptible, as they do\nnot alter text fluency, unlike post-text watermarks; (4) Additionally, combined\nattack methods can significantly increase effectiveness, highlighting the need\nfor more robust watermarking solutions. Our study underscores the\nvulnerabilities of current techniques and the necessity for developing more\nresilient schemes.\n","authors":["Zesen Liu","Tianshuo Cong","Xinlei He","Qi Li"],"pdf_url":"https://arxiv.org/pdf/2407.04794v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19064v1","updated":"2024-11-28T11:24:43Z","published":"2024-11-28T11:24:43Z","title":"Way to Specialist: Closing Loop Between Specialized LLM and Evolving\n Domain Knowledge Graph","summary":" Large language models (LLMs) have demonstrated exceptional performance across\na wide variety of domains. Nonetheless, generalist LLMs continue to fall short\nin reasoning tasks necessitating specialized knowledge. Prior investigations\ninto specialized LLMs focused on domain-specific training, which entails\nsubstantial efforts in domain data acquisition and model parameter fine-tuning.\nTo address these challenges, this paper proposes the Way-to-Specialist (WTS)\nframework, which synergizes retrieval-augmented generation with knowledge\ngraphs (KGs) to enhance the specialized capability of LLMs in the absence of\nspecialized training. In distinction to existing paradigms that merely utilize\nexternal knowledge from general KGs or static domain KGs to prompt LLM for\nenhanced domain-specific reasoning, WTS proposes an innovative\n\"LLM$\\circlearrowright$KG\" paradigm, which achieves bidirectional enhancement\nbetween specialized LLM and domain knowledge graph (DKG). The proposed paradigm\nencompasses two closely coupled components: the DKG-Augmented LLM and the\nLLM-Assisted DKG Evolution. The former retrieves question-relevant domain\nknowledge from DKG and uses it to prompt LLM to enhance the reasoning\ncapability for domain-specific tasks; the latter leverages LLM to generate new\ndomain knowledge from processed tasks and use it to evolve DKG. WTS closes the\nloop between DKG-Augmented LLM and LLM-Assisted DKG Evolution, enabling\ncontinuous improvement in the domain specialization as it progressively answers\nand learns from domain-specific questions. We validate the performance of WTS\non 6 datasets spanning 5 domains. The experimental results show that WTS\nsurpasses the previous SOTA in 4 specialized domains and achieves a maximum\nperformance improvement of 11.3%.\n","authors":["Yutong Zhang","Lixing Chen","Shenghong Li","Nan Cao","Yang Shi","Jiaxin Ding","Zhe Qu","Pan Zhou","Yang Bai"],"pdf_url":"https://arxiv.org/pdf/2411.19064v1.pdf","comment":"Accepted by KDD 2025"},{"id":"http://arxiv.org/abs/2411.02018v2","updated":"2024-11-28T11:00:19Z","published":"2024-11-04T12:13:04Z","title":"Shortcut Learning in In-Context Learning: A Survey","summary":" Shortcut learning refers to the phenomenon where models employ simple,\nnon-robust decision rules in practical tasks, which hinders their\ngeneralization and robustness. With the rapid development of large language\nmodels (LLMs) in recent years, an increasing number of studies have shown the\nimpact of shortcut learning on LLMs. This paper provides a novel perspective to\nreview relevant research on shortcut learning in In-Context Learning (ICL). It\nconducts a detailed exploration of the types of shortcuts in ICL tasks, their\ncauses, available benchmarks, and strategies for mitigating shortcuts. Based on\ncorresponding observations, it summarizes the unresolved issues in existing\nresearch and attempts to outline the future research landscape of shortcut\nlearning.\n","authors":["Rui Song","Yingji Li","Lida Shi","Fausto Giunchiglia","Hao Xu"],"pdf_url":"https://arxiv.org/pdf/2411.02018v2.pdf","comment":"20 pages, 7 figures"},{"id":"http://arxiv.org/abs/2411.19038v1","updated":"2024-11-28T10:33:11Z","published":"2024-11-28T10:33:11Z","title":"DIESEL -- Dynamic Inference-Guidance via Evasion of Semantic Embeddings\n in LLMs","summary":" In recent years, conversational large language models (LLMs) have shown\ntremendous success in tasks such as casual conversation, question answering,\nand personalized dialogue, making significant advancements in domains like\nvirtual assistance, social interaction, and online customer engagement.\nHowever, they often generate responses that are not aligned with human values\n(e.g., ethical standards, safety, or social norms), leading to potentially\nunsafe or inappropriate outputs. While several techniques have been proposed to\naddress this problem, they come with a cost, requiring computationally\nexpensive training or dramatically increasing the inference time. In this\npaper, we present DIESEL, a lightweight inference guidance technique that can\nbe seamlessly integrated into any autoregressive LLM to semantically filter\nundesired concepts from the response. DIESEL can function either as a\nstandalone safeguard or as an additional layer of defense, enhancing response\nsafety by reranking the LLM's proposed tokens based on their similarity to\npredefined negative concepts in the latent space. This approach provides an\nefficient and effective solution for maintaining alignment with human values.\nOur evaluation demonstrates DIESEL's effectiveness on state-of-the-art\nconversational models (e.g., Llama 3), even in challenging jailbreaking\nscenarios that test the limits of response safety. We further show that DIESEL\ncan be generalized to use cases other than safety, providing a versatile\nsolution for general-purpose response filtering with minimal computational\noverhead.\n","authors":["Ben Ganon","Alon Zolfi","Omer Hofman","Inderjeet Singh","Hisashi Kojima","Yuval Elovici","Asaf Shabtai"],"pdf_url":"https://arxiv.org/pdf/2411.19038v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.10527v3","updated":"2024-11-28T10:23:42Z","published":"2024-02-16T09:29:38Z","title":"Assessing biomedical knowledge robustness in large language models by\n query-efficient sampling attacks","summary":" The increasing depth of parametric domain knowledge in large language models\n(LLMs) is fueling their rapid deployment in real-world applications.\nUnderstanding model vulnerabilities in high-stakes and knowledge-intensive\ntasks is essential for quantifying the trustworthiness of model predictions and\nregulating their use. The recent discovery of named entities as adversarial\nexamples (i.e. adversarial entities) in natural language processing tasks\nraises questions about their potential impact on the knowledge robustness of\npre-trained and finetuned LLMs in high-stakes and specialized domains. We\nexamined the use of type-consistent entity substitution as a template for\ncollecting adversarial entities for billion-parameter LLMs with biomedical\nknowledge. To this end, we developed an embedding-space attack based on\npowerscaled distance-weighted sampling to assess the robustness of their\nbiomedical knowledge with a low query budget and controllable coverage. Our\nmethod has favorable query efficiency and scaling over alternative approaches\nbased on random sampling and blackbox gradient-guided search, which we\ndemonstrated for adversarial distractor generation in biomedical question\nanswering. Subsequent failure mode analysis uncovered two regimes of\nadversarial entities on the attack surface with distinct characteristics and we\nshowed that entity substitution attacks can manipulate token-wise Shapley value\nexplanations, which become deceptive in this setting. Our approach complements\nstandard evaluations for high-capacity models and the results highlight the\nbrittleness of domain knowledge in LLMs.\n","authors":["R. Patrick Xian","Alex J. Lee","Satvik Lolla","Vincent Wang","Qiming Cui","Russell Ro","Reza Abbasi-Asl"],"pdf_url":"https://arxiv.org/pdf/2402.10527v3.pdf","comment":"31 pages incl. appendix, accepted by TMLR"},{"id":"http://arxiv.org/abs/2411.19017v1","updated":"2024-11-28T09:42:53Z","published":"2024-11-28T09:42:53Z","title":"A Survey on Automatic Online Hate Speech Detection in Low-Resource\n Languages","summary":" The expanding influence of social media platforms over the past decade has\nimpacted the way people communicate. The level of obscurity provided by social\nmedia and easy accessibility of the internet has facilitated the spread of hate\nspeech. The terms and expressions related to hate speech gets updated with\nchanging times which poses an obstacle to policy-makers and researchers in case\nof hate speech identification. With growing number of individuals using their\nnative languages to communicate with each other, hate speech in these\nlow-resource languages are also growing. Although, there is awareness about the\nEnglish-related approaches, much attention have not been provided to these\nlow-resource languages due to lack of datasets and online available data. This\narticle provides a detailed survey of hate speech detection in low-resource\nlanguages around the world with details of available datasets, features\nutilized and techniques used. This survey further discusses the prevailing\nsurveys, overlapping concepts related to hate speech, research challenges and\nopportunities.\n","authors":["Susmita Das","Arpita Dutta","Kingshuk Roy","Abir Mondal","Arnab Mukhopadhyay"],"pdf_url":"https://arxiv.org/pdf/2411.19017v1.pdf","comment":"34 pages, 12 figures"},{"id":"http://arxiv.org/abs/2405.14093v2","updated":"2024-11-28T09:18:10Z","published":"2024-05-23T01:43:54Z","title":"A Survey on Vision-Language-Action Models for Embodied AI","summary":" Deep learning has demonstrated remarkable success across many domains,\nincluding computer vision, natural language processing, and reinforcement\nlearning. Representative artificial neural networks in these fields span\nconvolutional neural networks, Transformers, and deep Q-networks. Built upon\nunimodal neural networks, numerous multi-modal models have been introduced to\naddress a range of tasks such as visual question answering, image captioning,\nand speech recognition. The rise of instruction-following robotic policies in\nembodied AI has spurred the development of a novel category of multi-modal\nmodels known as vision-language-action models (VLAs). Their multi-modality\ncapability has become a foundational element in robot learning. Various methods\nhave been proposed to enhance traits such as versatility, dexterity, and\ngeneralizability. Some models focus on refining specific components. Others aim\nto develop control policies adept at predicting low-level actions. Certain VLAs\nserve as high-level task planners capable of decomposing long-horizon tasks\ninto executable subtasks. Over the past few years, a myriad of VLAs have\nemerged, reflecting the rapid advancement of embodied AI. Therefore, it is\nimperative to capture the evolving landscape through a comprehensive survey.\n","authors":["Yueen Ma","Zixing Song","Yuzheng Zhuang","Jianye Hao","Irwin King"],"pdf_url":"https://arxiv.org/pdf/2405.14093v2.pdf","comment":"17 pages, a survey of vision-language-action models"},{"id":"http://arxiv.org/abs/2411.19007v1","updated":"2024-11-28T09:14:58Z","published":"2024-11-28T09:14:58Z","title":"Talking to oneself in CMC: a study of self replies in Wikipedia talk\n pages","summary":" This study proposes a qualitative analysis of self replies in Wikipedia talk\npages, more precisely when the first two messages of a discussion are written\nby the same user. This specific pattern occurs in more than 10% of threads with\ntwo messages or more and can be explained by a number of reasons. After a first\nexamination of the lexical specificities of second messages, we propose a seven\ncategories typology and use it to annotate two reference samples (English and\nFrench) of 100 threads each. Finally, we analyse and compare the performance of\nhuman annotators (who reach a reasonable global efficiency) and\ninstruction-tuned LLMs (which encounter important difficulties with several\ncategories).\n","authors":["Ludovic Tanguy","Céline Poudat","Lydia-Mai Ho-Dac"],"pdf_url":"https://arxiv.org/pdf/2411.19007v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.18990v1","updated":"2024-11-28T08:40:14Z","published":"2024-11-28T08:40:14Z","title":"USTCCTSU at SemEval-2024 Task 1: Reducing Anisotropy for Cross-lingual\n Semantic Textual Relatedness Task","summary":" Cross-lingual semantic textual relatedness task is an important research task\nthat addresses challenges in cross-lingual communication and text\nunderstanding. It helps establish semantic connections between different\nlanguages, crucial for downstream tasks like machine translation, multilingual\ninformation retrieval, and cross-lingual text understanding.Based on extensive\ncomparative experiments, we choose the XLM-R-base as our base model and use\npre-trained sentence representations based on whitening to reduce\nanisotropy.Additionally, for the given training data, we design a delicate data\nfiltering method to alleviate the curse of multilingualism. With our approach,\nwe achieve a 2nd score in Spanish, a 3rd in Indonesian, and multiple entries in\nthe top ten results in the competition's track C. We further do a comprehensive\nanalysis to inspire future research aimed at improving performance on\ncross-lingual tasks.\n","authors":["Jianjian Li","Shengwei Liang","Yong Liao","Hongping Deng","Haiyang Yu"],"pdf_url":"https://arxiv.org/pdf/2411.18990v1.pdf","comment":"8 pages, 3 figures"},{"id":"http://arxiv.org/abs/2311.16444v3","updated":"2024-11-28T08:37:37Z","published":"2023-11-28T02:51:13Z","title":"Exo2EgoDVC: Dense Video Captioning of Egocentric Procedural Activities\n Using Web Instructional Videos","summary":" We propose a novel benchmark for cross-view knowledge transfer of dense video\ncaptioning, adapting models from web instructional videos with exocentric views\nto an egocentric view. While dense video captioning (predicting time segments\nand their captions) is primarily studied with exocentric videos (e.g.,\nYouCook2), benchmarks with egocentric videos are restricted due to data\nscarcity. To overcome the limited video availability, transferring knowledge\nfrom abundant exocentric web videos is demanded as a practical approach.\nHowever, learning the correspondence between exocentric and egocentric views is\ndifficult due to their dynamic view changes. The web videos contain shots\nshowing either full-body or hand regions, while the egocentric view is\nconstantly shifting. This necessitates the in-depth study of cross-view\ntransfer under complex view changes. To this end, we first create a real-life\negocentric dataset (EgoYC2) whose captions follow the definition of YouCook2\ncaptions, enabling transfer learning between these datasets with access to\ntheir ground-truth. To bridge the view gaps, we propose a view-invariant\nlearning method using adversarial training, which consists of pre-training and\nfine-tuning stages. Our experiments confirm the effectiveness of overcoming the\nview change problem and knowledge transfer to egocentric views. Our benchmark\npushes the study of cross-view transfer into a new task domain of dense video\ncaptioning and envisions methodologies that describe egocentric videos in\nnatural language.\n","authors":["Takehiko Ohkawa","Takuma Yagi","Taichi Nishimura","Ryosuke Furuta","Atsushi Hashimoto","Yoshitaka Ushiku","Yoichi Sato"],"pdf_url":"https://arxiv.org/pdf/2311.16444v3.pdf","comment":"Accepted to WACV 2025"},{"id":"http://arxiv.org/abs/2409.15371v5","updated":"2024-11-28T08:15:05Z","published":"2024-09-19T10:26:42Z","title":"Bone: Block-Affine Adaptation of Large Language Models","summary":" Low-Rank Adaptation (LoRA) has achieved remarkable training results by\nfreezing the original weights and training only low-rank matrices, establishing\nitself as the predominant fine-tuning method for LLMs. Many LoRA variants have\nemerged, yet they lack a design tailored to the characteristics of LLM weights\nand fail to leverage the original weights effectively. To address the sparsity\nof LLM weights, and drawing inspiration from GQA and MQA, we propose\nBlock-Affine Adaptation (Bone), a novel PEFT technique distinct from LoRA. By\ndividing the original weights into multiple subspaces that share a single\nmatrix for weight updates, Bone simplifies the process by requiring the\ntrainable matrix to be initialized to zero, eliminating the need for complex\ninitialization as in some LoRA variants. Compared to LoRA, Bone significantly\nreduces memory usage and achieves faster computation. Evaluation of both NLU\nand NLG tasks demonstrates that Bone substantially outperforms LoRA and its\nvariants. Inspired by Pissa, we propose a new theory called \"Weight Guide\" to\nbetter utilize the information embedded in the original weights. This approach\nextracts valuable information through a linear transformation of the original\nweight matrix using a trainable matrix. To validate the effectiveness of\n\"Weight Guide\" we combined it with Bone to create a new structure called\nBlock-Affine Transformation (Bat), and ablation experiments confirmed the\neffectiveness of \"Weight Guide\".\n","authors":["Jiale Kang"],"pdf_url":"https://arxiv.org/pdf/2409.15371v5.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.19606v2","updated":"2024-11-28T08:09:05Z","published":"2024-09-29T07:57:07Z","title":"Hyper-Connections","summary":" We present hyper-connections, a simple yet effective method that can serve as\nan alternative to residual connections. This approach specifically addresses\ncommon drawbacks observed in residual connection variants, such as the seesaw\neffect between gradient vanishing and representation collapse. Theoretically,\nhyper-connections allow the network to adjust the strength of connections\nbetween features at different depths and dynamically rearrange layers. We\nconduct experiments focusing on the pre-training of large language models,\nincluding dense and sparse models, where hyper-connections show significant\nperformance improvements over residual connections. Additional experiments\nconducted on vision tasks also demonstrate similar improvements. We anticipate\nthat this method will be broadly applicable and beneficial across a wide range\nof AI problems.\n","authors":["Defa Zhu","Hongzhi Huang","Zihao Huang","Yutao Zeng","Yunyao Mao","Banggu Wu","Qiyang Min","Xun Zhou"],"pdf_url":"https://arxiv.org/pdf/2409.19606v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.18980v1","updated":"2024-11-28T08:02:25Z","published":"2024-11-28T08:02:25Z","title":"Zero-shot Slot Filling in the Age of LLMs for Dialogue Systems","summary":" Zero-shot slot filling is a well-established subtask of Natural Language\nUnderstanding (NLU). However, most existing methods primarily focus on\nsingle-turn text data, overlooking the unique complexities of conversational\ndialogue. Conversational data is highly dynamic, often involving abrupt topic\nshifts, interruptions, and implicit references that make it difficult to\ndirectly apply zero-shot slot filling techniques, even with the remarkable\ncapabilities of large language models (LLMs). This paper addresses these\nchallenges by proposing strategies for automatic data annotation with slot\ninduction and black-box knowledge distillation (KD) from a teacher LLM to a\nsmaller model, outperforming vanilla LLMs on internal datasets by 26% absolute\nincrease in F1 score. Additionally, we introduce an efficient system\narchitecture for call center product settings that surpasses off-the-shelf\nextractive models by 34% relative F1 score, enabling near real-time inference\non dialogue streams with higher accuracy, while preserving low latency.\n","authors":["Mansi Rana","Kadri Hacioglu","Sindhuja Gopalan","Maragathamani Boothalingam"],"pdf_url":"https://arxiv.org/pdf/2411.18980v1.pdf","comment":"To appear in Proceedings of COLING 2025"},{"id":"http://arxiv.org/abs/2406.14491v2","updated":"2024-11-28T06:51:20Z","published":"2024-06-20T16:55:33Z","title":"Instruction Pre-Training: Language Models are Supervised Multitask\n Learners","summary":" Unsupervised multitask pre-training has been the critical method behind the\nrecent success of language models (LMs). However, supervised multitask learning\nstill holds significant promise, as scaling it in the post-training stage\ntrends towards better generalization. In this paper, we explore supervised\nmultitask pre-training by proposing Instruction Pre-Training, a framework that\nscalably augments massive raw corpora with instruction-response pairs to\npre-train LMs. The instruction-response pairs are generated by an efficient\ninstruction synthesizer built on open-source models. In our experiments, we\nsynthesize 200M instruction-response pairs covering 40+ task categories to\nverify the effectiveness of Instruction Pre-Training. In pre-training from\nscratch, Instruction Pre-Training not only consistently enhances pre-trained\nbase models but also benefits more from further instruction tuning. In\ncontinual pre-training, Instruction Pre-Training enables Llama3-8B to be\ncomparable to or even outperform Llama3-70B. Our model, code, and data are\navailable at https://github.com/microsoft/LMOps.\n","authors":["Daixuan Cheng","Yuxian Gu","Shaohan Huang","Junyu Bi","Minlie Huang","Furu Wei"],"pdf_url":"https://arxiv.org/pdf/2406.14491v2.pdf","comment":"EMNLP 2024 Main Conference"},{"id":"http://arxiv.org/abs/2411.18279v2","updated":"2024-11-28T06:40:09Z","published":"2024-11-27T12:13:39Z","title":"Large Language Model-Brained GUI Agents: A Survey","summary":" GUIs have long been central to human-computer interaction, providing an\nintuitive and visually-driven way to access and interact with digital systems.\nThe advent of LLMs, particularly multimodal models, has ushered in a new era of\nGUI automation. They have demonstrated exceptional capabilities in natural\nlanguage understanding, code generation, and visual processing. This has paved\nthe way for a new generation of LLM-brained GUI agents capable of interpreting\ncomplex GUI elements and autonomously executing actions based on natural\nlanguage instructions. These agents represent a paradigm shift, enabling users\nto perform intricate, multi-step tasks through simple conversational commands.\nTheir applications span across web navigation, mobile app interactions, and\ndesktop automation, offering a transformative user experience that\nrevolutionizes how individuals interact with software. This emerging field is\nrapidly advancing, with significant progress in both research and industry.\n To provide a structured understanding of this trend, this paper presents a\ncomprehensive survey of LLM-brained GUI agents, exploring their historical\nevolution, core components, and advanced techniques. We address research\nquestions such as existing GUI agent frameworks, the collection and utilization\nof data for training specialized GUI agents, the development of large action\nmodels tailored for GUI tasks, and the evaluation metrics and benchmarks\nnecessary to assess their effectiveness. Additionally, we examine emerging\napplications powered by these agents. Through a detailed analysis, this survey\nidentifies key research gaps and outlines a roadmap for future advancements in\nthe field. By consolidating foundational knowledge and state-of-the-art\ndevelopments, this work aims to guide both researchers and practitioners in\novercoming challenges and unlocking the full potential of LLM-brained GUI\nagents.\n","authors":["Chaoyun Zhang","Shilin He","Jiaxu Qian","Bowen Li","Liqun Li","Si Qin","Yu Kang","Minghua Ma","Guyue Liu","Qingwei Lin","Saravan Rajmohan","Dongmei Zhang","Qi Zhang"],"pdf_url":"https://arxiv.org/pdf/2411.18279v2.pdf","comment":"The collection of papers reviewed in this survey will be hosted and\n regularly updated on the GitHub repository:\n https://github.com/vyokky/LLM-Brained-GUI-Agents-Survey Additionally, a\n searchable webpage is available at https://aka.ms/gui-agent for easier access\n and exploration"},{"id":"http://arxiv.org/abs/2411.13591v3","updated":"2024-11-28T06:24:27Z","published":"2024-11-18T05:47:12Z","title":"Improved GUI Grounding via Iterative Narrowing","summary":" Graphical User Interface (GUI) grounding plays a crucial role in enhancing\nthe capabilities of Vision-Language Model (VLM) agents. While general VLMs,\nsuch as GPT-4V, demonstrate strong performance across various tasks, their\nproficiency in GUI grounding remains suboptimal. Recent studies have focused on\nfine-tuning these models specifically for one-shot GUI grounding, yielding\nsignificant improvements over baseline performance. We introduce a visual\nprompting framework that employs an iterative narrowing mechanism to improve\nthe performance of both general and fine-tuned models in GUI grounding by up to\n61%. For evaluation, we tested our method on a comprehensive benchmark\ncomprising various UI platforms and provided the code to reproduce our results.\n","authors":["Anthony Nguyen"],"pdf_url":"https://arxiv.org/pdf/2411.13591v3.pdf","comment":"Code available at\n https://github.com/ant-8/GUI-Grounding-via-Iterative-Narrowing"},{"id":"http://arxiv.org/abs/2408.00764v2","updated":"2024-11-28T06:19:29Z","published":"2024-08-01T17:59:46Z","title":"AgentGen: Enhancing Planning Abilities for Large Language Model based\n Agent via Environment and Task Generation","summary":" Large Language Model-based agents have garnered significant attention and are\nbecoming increasingly popular. Furthermore, planning ability is a crucial\ncomponent of an LLM-based agent, which generally entails achieving a desired\ngoal from an initial state. This paper investigates enhancing the planning\nabilities of LLMs through instruction tuning, referred to as agent training.\nRecent studies have demonstrated that utilizing expert-level trajectory for\ninstruction-tuning LLMs effectively enhances their planning capabilities.\nHowever, existing work primarily focuses on synthesizing trajectories from\nmanually designed planning tasks and environments. The labor-intensive nature\nof creating these environments and tasks impedes the generation of sufficiently\nvaried and extensive trajectories. To address this limitation, this paper\nexplores the automated synthesis of diverse environments and a gradual range of\nplanning tasks, from easy to difficult. We introduce a framework, AgentGen,\nthat leverages LLMs first to generate environments and subsequently generate\nplanning tasks conditioned on these environments. Specifically, to improve\nenvironmental diversity, we propose using an inspiration corpus composed of\nvarious domain-specific text segments as the context for synthesizing\nenvironments. Moreover, to increase the difficulty diversity of generated\nplanning tasks, we propose a bidirectional evolution method, Bi-Evol, that\nevolves planning tasks from easier and harder directions to synthesize a task\nset with a smoother difficulty curve. The evaluation results derived from\nAgentBoard show that AgentGen greatly improves LLMs' planning ability, e.g.,\nthe AgentGen instruction-tuned Llama-3.1-8B surpasses GPT-3.5 in overall\nperformance. Moreover, the AgentGen-tuned Llama-3.1-70B model achieves\nstate-of-the-art results in planning tasks.\n","authors":["Mengkang Hu","Pu Zhao","Can Xu","Qingfeng Sun","Jianguang Lou","Qingwei Lin","Ping Luo","Saravan Rajmohan"],"pdf_url":"https://arxiv.org/pdf/2408.00764v2.pdf","comment":"Accepted by KDD 2025 (Research Track)"},{"id":"http://arxiv.org/abs/2411.18940v1","updated":"2024-11-28T06:12:28Z","published":"2024-11-28T06:12:28Z","title":"Rephrasing Electronic Health Records for Pretraining Clinical Language\n Models","summary":" Clinical language models are important for many applications in healthcare,\nbut their development depends on access to extensive clinical text for\npretraining. However, obtaining clinical notes from electronic health records\n(EHRs) at scale is challenging due to patient privacy concerns. In this study,\nwe rephrase existing clinical notes using LLMs to generate synthetic\npretraining corpora, drawing inspiration from previous work on rephrasing web\ndata. We examine four popular small-sized LLMs (<10B) to create synthetic\nclinical text to pretrain both decoder-based and encoder-based language models.\nThe method yields better results in language modeling and downstream tasks than\nprevious synthesis approaches without referencing real clinical text. We find\nthat augmenting original clinical notes with synthetic corpora from different\nLLMs improves performances even at a small token budget, showing the potential\nof this method to support pretraining at the institutional level or be scaled\nto synthesize large-scale clinical corpora.\n","authors":["Jinghui Liu","Anthony Nguyen"],"pdf_url":"https://arxiv.org/pdf/2411.18940v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.18932v1","updated":"2024-11-28T05:51:45Z","published":"2024-11-28T05:51:45Z","title":"ScratchEval: Are GPT-4o Smarter than My Child? Evaluating Large\n Multimodal Models with Visual Programming Challenges","summary":" Recent advancements in large multimodal models (LMMs) have showcased\nimpressive code generation capabilities, primarily evaluated through\nimage-to-code benchmarks. However, these benchmarks are limited to specific\nvisual programming scenarios where the logic reasoning and the multimodal\nunderstanding capacities are split apart. To fill this gap, we propose\nScratchEval, a novel benchmark designed to evaluate the visual programming\nreasoning ability of LMMs. ScratchEval is based on Scratch, a block-based\nvisual programming language widely used in children's programming education. By\nintegrating visual elements and embedded programming logic, ScratchEval\nrequires the model to process both visual information and code structure,\nthereby comprehensively evaluating its programming intent understanding\nability. Our evaluation approach goes beyond the traditional image-to-code\nmapping and focuses on unified logical thinking and problem-solving abilities,\nproviding a more comprehensive and challenging framework for evaluating the\nvisual programming ability of LMMs. ScratchEval not only fills the gap in\nexisting evaluation methods, but also provides new insights for the future\ndevelopment of LMMs in the field of visual programming. Our benchmark can be\naccessed at https://github.com/HKBUNLP/ScratchEval .\n","authors":["Rao Fu","Ziyang Luo","Hongzhan Lin","Zhen Ye","Jing Ma"],"pdf_url":"https://arxiv.org/pdf/2411.18932v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.18924v1","updated":"2024-11-28T05:24:51Z","published":"2024-11-28T05:24:51Z","title":"The Impact of Example Selection in Few-Shot Prompting on Automated Essay\n Scoring Using GPT Models","summary":" This study investigates the impact of example selection on the performance of\nau-tomated essay scoring (AES) using few-shot prompting with GPT models. We\nevaluate the effects of the choice and order of examples in few-shot prompting\non several versions of GPT-3.5 and GPT-4 models. Our experiments involve 119\nprompts with different examples, and we calculate the quadratic weighted kappa\n(QWK) to measure the agreement between GPT and human rater scores. Regres-sion\nanalysis is used to quantitatively assess biases introduced by example\nselec-tion. The results show that the impact of example selection on QWK varies\nacross models, with GPT-3.5 being more influenced by examples than GPT-4. We\nalso find evidence of majority label bias, which is a tendency to favor the\nmajority la-bel among the examples, and recency bias, which is a tendency to\nfavor the label of the most recent example, in GPT-generated essay scores and\nQWK, with these biases being more pronounced in GPT-3.5. Notably, careful\nexample selection enables GPT-3.5 models to outperform some GPT-4 models.\nHowever, among the GPT models, the June 2023 version of GPT-4, which is not the\nlatest model, exhibits the highest stability and performance. Our findings\nprovide insights into the importance of example selection in few-shot prompting\nfor AES, especially in GPT-3.5 models, and highlight the need for individual\nperformance evaluations of each model, even for minor versions.\n","authors":["Lui Yoshida"],"pdf_url":"https://arxiv.org/pdf/2411.18924v1.pdf","comment":"Accepted in AIED2024. This preprint has not undergone any\n post-submission improvements or corrections. The Version of Record of this\n contribution is published in Communications in Com-puter and Information\n Science, vol 2150, and is available online at https://doi.org/"},{"id":"http://arxiv.org/abs/2411.18923v1","updated":"2024-11-28T05:24:46Z","published":"2024-11-28T05:24:46Z","title":"EzSQL: An SQL intermediate representation for improving SQL-to-text\n Generation","summary":" The SQL-to-text generation task traditionally uses template base, Seq2Seq,\ntree-to-sequence, and graph-to-sequence models. Recent models take advantage of\npre-trained generative language models for this task in the Seq2Seq framework.\nHowever, treating SQL as a sequence of inputs to the pre-trained models is not\noptimal. In this work, we put forward a new SQL intermediate representation\ncalled EzSQL to align SQL with the natural language text sequence. EzSQL\nsimplifies the SQL queries and brings them closer to natural language text by\nmodifying operators and keywords, which can usually be described in natural\nlanguage. EzSQL also removes the need for set operators. Our proposed\nSQL-to-text generation model uses EzSQL as the input to a pre-trained\ngenerative language model for generating the text descriptions. We demonstrate\nthat our model is an effective state-of-the-art method to generate text\nnarrations from SQL queries on the WikiSQL and Spider datasets. We also show\nthat by generating pretraining data using our SQL-to-text generation model, we\ncan enhance the performance of Text-to-SQL parsers.\n","authors":["Meher Bhardwaj","Hrishikesh Ethari","Dennis Singh Moirangthem"],"pdf_url":"https://arxiv.org/pdf/2411.18923v1.pdf","comment":"Under Review at Expert System With Applications Journal"},{"id":"http://arxiv.org/abs/2411.18922v1","updated":"2024-11-28T05:23:22Z","published":"2024-11-28T05:23:22Z","title":"Devising a Set of Compact and Explainable Spoken Language Feature for\n Screening Alzheimer's Disease","summary":" Alzheimer's disease (AD) has become one of the most significant health\nchallenges in an aging society. The use of spoken language-based AD detection\nmethods has gained prevalence due to their scalability due to their\nscalability. Based on the Cookie Theft picture description task, we devised an\nexplainable and effective feature set that leverages the visual capabilities of\na large language model (LLM) and the Term Frequency-Inverse Document Frequency\n(TF-IDF) model. Our experimental results show that the newly proposed features\nconsistently outperform traditional linguistic features across two different\nclassifiers with high dimension efficiency. Our new features can be well\nexplained and interpreted step by step which enhance the interpretability of\nautomatic AD screening.\n","authors":["Junan Li","Yunxiang Li","Yuren Wang","Xixin Wu","Helen Meng"],"pdf_url":"https://arxiv.org/pdf/2411.18922v1.pdf","comment":"Published at ISCSLP 2024"},{"id":"http://arxiv.org/abs/2411.18915v1","updated":"2024-11-28T05:12:17Z","published":"2024-11-28T05:12:17Z","title":"MATATA: a weak-supervised MAthematical Tool-Assisted reasoning for\n Tabular Applications","summary":" Mathematical reasoning capabilities are increasing with tool-augmented\nlanguage agents, but methods often rely either on closed-source or large\nmodels, external data, or extensive prompt engineering. This work introduces\nMATATA, a novel cost-effective method to train LLM agents for tabular data\nproblems through reasoning, planning, and tool use. With a progressive\nself-improvement paradigm and an iterative weak supervision, it empowers\n3.8B/8B Small Language Models (SLMs), particularly suited for local hosting and\nsensitive business contexts where data privacy is crucial. By employing a\nflexible and reusable tools across different datasets, it achieves robust\nperformance with effective scalability across shared tasks. Experiments show\nthat MATATA reaches state-of-the-art performances on FinQA and TAT-QA among\nreasoning frameworks based on open-source models. Moreover, MATATA models\ncompete with GPT-4 based frameworks on TabMWP, while being SLMs.\n","authors":["Vishnou Vinayagame","Gregory Senay","Luis Martí"],"pdf_url":"https://arxiv.org/pdf/2411.18915v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.18895v1","updated":"2024-11-28T03:58:48Z","published":"2024-11-28T03:58:48Z","title":"Evaluating Sparse Autoencoders on Targeted Concept Erasure Tasks","summary":" Sparse Autoencoders (SAEs) are an interpretability technique aimed at\ndecomposing neural network activations into interpretable units. However, a\nmajor bottleneck for SAE development has been the lack of high-quality\nperformance metrics, with prior work largely relying on unsupervised proxies.\nIn this work, we introduce a family of evaluations based on SHIFT, a downstream\ntask from Marks et al. (Sparse Feature Circuits, 2024) in which spurious cues\nare removed from a classifier by ablating SAE features judged to be\ntask-irrelevant by a human annotator. We adapt SHIFT into an automated metric\nof SAE quality; this involves replacing the human annotator with an LLM.\nAdditionally, we introduce the Targeted Probe Perturbation (TPP) metric that\nquantifies an SAE's ability to disentangle similar concepts, effectively\nscaling SHIFT to a wider range of datasets. We apply both SHIFT and TPP to\nmultiple open-source models, demonstrating that these metrics effectively\ndifferentiate between various SAE training hyperparameters and architectures.\n","authors":["Adam Karvonen","Can Rager","Samuel Marks","Neel Nanda"],"pdf_url":"https://arxiv.org/pdf/2411.18895v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.18888v1","updated":"2024-11-28T03:31:12Z","published":"2024-11-28T03:31:12Z","title":"ArEEG_Words: Dataset for Envisioned Speech Recognition using EEG for\n Arabic Words","summary":" Brain-Computer-Interface (BCI) aims to support communication-impaired\npatients by translating neural signals into speech. A notable research topic in\nBCI involves Electroencephalography (EEG) signals that measure the electrical\nactivity in the brain. While significant advancements have been made in BCI EEG\nresearch, a major limitation still exists: the scarcity of publicly available\nEEG datasets for non-English languages, such as Arabic. To address this gap, we\nintroduce in this paper ArEEG_Words dataset, a novel EEG dataset recorded from\n22 participants with mean age of 22 years (5 female, 17 male) using a\n14-channel Emotiv Epoc X device. The participants were asked to be free from\nany effects on their nervous system, such as coffee, alcohol, cigarettes, and\nso 8 hours before recording. They were asked to stay calm in a clam room during\nimagining one of the 16 Arabic Words for 10 seconds. The words include 16\ncommonly used words such as up, down, left, and right. A total of 352 EEG\nrecordings were collected, then each recording was divided into multiple 250ms\nsignals, resulting in a total of 15,360 EEG signals. To the best of our\nknowledge, ArEEG_Words data is the first of its kind in Arabic EEG domain.\nMoreover, it is publicly available for researchers as we hope that will fill\nthe gap in Arabic EEG research.\n","authors":["Hazem Darwish","Abdalrahman Al Malah","Khloud Al Jallad","Nada Ghneim"],"pdf_url":"https://arxiv.org/pdf/2411.18888v1.pdf","comment":"arXiv admin note: substantial text overlap with arXiv:2402.15733"},{"id":"http://arxiv.org/abs/2411.18885v1","updated":"2024-11-28T03:27:48Z","published":"2024-11-28T03:27:48Z","title":"Sneaking Syntax into Transformer Language Models with Tree\n Regularization","summary":" While compositional accounts of human language understanding are based on a\nhierarchical tree-like process, neural models like transformers lack a direct\ninductive bias for such tree structures. Introducing syntactic inductive biases\ncould unlock more robust and data-efficient learning in transformer language\nmodels (LMs), but existing methods for incorporating such structure greatly\nrestrict models, either limiting their expressivity or increasing inference\ncomplexity. This work instead aims to softly inject syntactic inductive biases\ninto given transformer circuits, through a structured regularizer. We introduce\nTREEREG, an auxiliary loss function that converts bracketing decisions from\nsilver parses into a set of differentiable orthogonality constraints on vector\nhidden states. TREEREG integrates seamlessly with the standard LM objective,\nrequiring no architectural changes. LMs pre-trained with TreeReg on natural\nlanguage corpora such as WikiText-103 achieve up to 10% lower perplexities on\nout-of-distribution data and up to 9.5 point improvements in syntactic\ngeneralization, requiring less than half the training data to outperform\nstandard LMs. TreeReg still provides gains for pre-trained LLMs: Continued\npre-training of Sheared Llama with TreeReg results in improved syntactic\ngeneralization, and fine-tuning on MultiNLI with TreeReg mitigates degradation\nof performance on adversarial NLI benchmarks by 41.2 points.\n","authors":["Ananjan Nandi","Christopher D. Manning","Shikhar Murty"],"pdf_url":"https://arxiv.org/pdf/2411.18885v1.pdf","comment":"17 pages, 16 figures, 8 tables"},{"id":"http://arxiv.org/abs/2411.17075v3","updated":"2024-11-28T03:13:04Z","published":"2024-11-26T03:27:43Z","title":"Don't Command, Cultivate: An Exploratory Study of System-2 Alignment","summary":" The o1 system card identifies the o1 models as the most robust within OpenAI,\nwith their defining characteristic being the progression from rapid, intuitive\nthinking to slower, more deliberate reasoning. This observation motivated us to\ninvestigate the influence of System-2 thinking patterns on model safety. In our\npreliminary research, we conducted safety evaluations of the o1 model,\nincluding complex jailbreak attack scenarios using adversarial natural language\nprompts and mathematical encoding prompts. Our findings indicate that the o1\nmodel demonstrates relatively improved safety performance; however, it still\nexhibits vulnerabilities, particularly against jailbreak attacks employing\nmathematical encoding. Through detailed case analysis, we identified specific\npatterns in the o1 model's responses. We also explored the alignment of\nSystem-2 safety in open-source models using prompt engineering and supervised\nfine-tuning techniques. Experimental results show that some simple methods to\nencourage the model to carefully scrutinize user requests are beneficial for\nmodel safety. Additionally, we proposed a implementation plan for process\nsupervision to enhance safety alignment. The implementation details and\nexperimental results will be provided in future versions.\n","authors":["Yuhang Wang","Jitao Sang"],"pdf_url":"https://arxiv.org/pdf/2411.17075v3.pdf","comment":"Preprint version, more results will be updated"},{"id":"http://arxiv.org/abs/2409.06411v2","updated":"2024-11-28T02:53:40Z","published":"2024-09-10T10:49:38Z","title":"Length Desensitization in Direct Preference Optimization","summary":" Direct Preference Optimization (DPO) is widely utilized in the Reinforcement\nLearning from Human Feedback (RLHF) phase to align Large Language Models (LLMs)\nwith human preferences, thereby enhancing both their harmlessness and efficacy.\nHowever, it has been observed that DPO tends to over-optimize for verbosity,\nwhich can detrimentally affect both performance and user experience. In this\npaper, we conduct an in-depth theoretical analysis of DPO's optimization\nobjective and reveal a strong correlation between its implicit reward and data\nlength. This correlation misguides the optimization direction, resulting in\nlength sensitivity during the DPO training and leading to verbosity. To address\nthis issue, we propose a length-desensitization improvement method for DPO,\ntermed LD-DPO. The proposed method aims to desensitize DPO to data length by\ndecoupling explicit length preference, which is relatively insignificant, from\nthe other implicit preferences, thereby enabling more effective learning of the\nintrinsic preferences. We utilized two settings (Base and Instruct) of\nLlama2-13B, Llama3-8B, and Qwen2-7B for experimental validation on various\nbenchmarks including MT-Bench and AlpacaEval 2. The experimental results\nindicate that LD-DPO consistently outperforms DPO and other baseline methods,\nachieving more concise responses with a 10-40% reduction in length compared to\nDPO. We conducted in-depth experimental analyses to demonstrate that LD-DPO can\nindeed achieve length desensitization and align the model more closely with\nhuman-like preferences.\n","authors":["Wei Liu","Yang Bai","Chengcheng Han","Rongxiang Weng","Jun Xu","Xuezhi Cao","Jingang Wang","Xunliang Cai"],"pdf_url":"https://arxiv.org/pdf/2409.06411v2.pdf","comment":"21 pages, 9 figures"},{"id":"http://arxiv.org/abs/2411.11496v3","updated":"2024-11-28T02:07:46Z","published":"2024-11-18T11:58:07Z","title":"Safe + Safe = Unsafe? Exploring How Safe Images Can Be Exploited to\n Jailbreak Large Vision-Language Models","summary":" Recent advances in Large Vision-Language Models (LVLMs) have showcased strong\nreasoning abilities across multiple modalities, achieving significant\nbreakthroughs in various real-world applications. Despite this great success,\nthe safety guardrail of LVLMs may not cover the unforeseen domains introduced\nby the visual modality. Existing studies primarily focus on eliciting LVLMs to\ngenerate harmful responses via carefully crafted image-based jailbreaks\ndesigned to bypass alignment defenses. In this study, we reveal that a safe\nimage can be exploited to achieve the same jailbreak consequence when combined\nwith additional safe images and prompts. This stems from two fundamental\nproperties of LVLMs: universal reasoning capabilities and safety snowball\neffect. Building on these insights, we propose Safety Snowball Agent (SSA), a\nnovel agent-based framework leveraging agents' autonomous and tool-using\nabilities to jailbreak LVLMs. SSA operates through two principal stages: (1)\ninitial response generation, where tools generate or retrieve jailbreak images\nbased on potential harmful intents, and (2) harmful snowballing, where refined\nsubsequent prompts induce progressively harmful outputs. Our experiments\ndemonstrate that \\ours can use nearly any image to induce LVLMs to produce\nunsafe content, achieving high success jailbreaking rates against the latest\nLVLMs. Unlike prior works that exploit alignment flaws, \\ours leverages the\ninherent properties of LVLMs, presenting a profound challenge for enforcing\nsafety in generative multimodal systems. Our code is avaliable at\n\\url{https://github.com/gzcch/Safety_Snowball_Agent}.\n","authors":["Chenhang Cui","Gelei Deng","An Zhang","Jingnan Zheng","Yicong Li","Lianli Gao","Tianwei Zhang","Tat-Seng Chua"],"pdf_url":"https://arxiv.org/pdf/2411.11496v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.18077v2","updated":"2024-11-28T02:01:50Z","published":"2024-11-27T06:10:49Z","title":"MiniKV: Pushing the Limits of LLM Inference via 2-Bit\n Layer-Discriminative KV Cache","summary":" How to efficiently serve LLMs in practice has become exceptionally\nchallenging due to their prohibitive memory and computation requirements. In\nthis study, we investigate optimizing the KV cache, whose memory footprint\nposes a critical bottleneck in LLM inference, especially when dealing with long\ncontext tasks. To tackle the challenge, we introduce MiniKV, a KV cache\noptimization method that simultaneously preserves long context task accuracy\nwhile significantly reducing KV cache size via a novel 2-bit\nlayer-discriminative KV cache. More importantly, we develop specialized CUDA\nkernels to make MiniKV compatible with FlashAttention. Experiments on a wide\nrange of long context tasks show that MiniKV effectively achieves 86% KV cache\ncompression ratio while recovering over 98.5% of accuracy, outperforming\nstate-of-the-art methods while achieving excellent measured system performance\nimprovements.\n","authors":["Akshat Sharma","Hangliang Ding","Jianping Li","Neel Dani","Minjia Zhang"],"pdf_url":"https://arxiv.org/pdf/2411.18077v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.05706v3","updated":"2024-11-28T01:10:49Z","published":"2024-02-08T14:35:09Z","title":"Paralinguistics-Aware Speech-Empowered Large Language Models for Natural\n Conversation","summary":" Recent work shows promising results in expanding the capabilities of large\nlanguage models (LLM) to directly understand and synthesize speech. However, an\nLLM-based strategy for modeling spoken dialogs remains elusive, calling for\nfurther investigation. This paper introduces an extensive speech-text LLM\nframework, the Unified Spoken Dialog Model (USDM), designed to generate\ncoherent spoken responses with naturally occurring prosodic features relevant\nto the given input speech without relying on explicit automatic speech\nrecognition (ASR) or text-to-speech (TTS) systems. We have verified the\ninclusion of prosody in speech tokens that predominantly contain semantic\ninformation and have used this foundation to construct a prosody-infused\nspeech-text model. Additionally, we propose a generalized speech-text\npretraining scheme that enhances the capture of cross-modal semantics. To\nconstruct USDM, we fine-tune our speech-text model on spoken dialog data using\na multi-step spoken dialog template that stimulates the chain-of-reasoning\ncapabilities exhibited by the underlying LLM. Automatic and human evaluations\non the DailyTalk dataset demonstrate that our approach effectively generates\nnatural-sounding spoken responses, surpassing previous and cascaded baselines.\nOur code and checkpoints are available at https://github.com/naver-ai/usdm.\n","authors":["Heeseung Kim","Soonshin Seo","Kyeongseok Jeong","Ohsung Kwon","Soyoon Kim","Jungwhan Kim","Jaehong Lee","Eunwoo Song","Myungwoo Oh","Jung-Woo Ha","Sungroh Yoon","Kang Min Yoo"],"pdf_url":"https://arxiv.org/pdf/2402.05706v3.pdf","comment":"NeurIPS 2024, Project Page: https://unifiedsdm.github.io/"},{"id":"http://arxiv.org/abs/2411.17204v2","updated":"2024-11-28T01:04:40Z","published":"2024-11-26T08:21:24Z","title":"Strategic Prompting for Conversational Tasks: A Comparative Analysis of\n Large Language Models Across Diverse Conversational Tasks","summary":" Given the advancements in conversational artificial intelligence, the\nevaluation and assessment of Large Language Models (LLMs) play a crucial role\nin ensuring optimal performance across various conversational tasks. In this\npaper, we present a comprehensive study that thoroughly evaluates the\ncapabilities and limitations of five prevalent LLMs: Llama, OPT, Falcon,\nAlpaca, and MPT. The study encompasses various conversational tasks, including\nreservation, empathetic response generation, mental health and legal\ncounseling, persuasion, and negotiation. To conduct the evaluation, an\nextensive test setup is employed, utilizing multiple evaluation criteria that\nspan from automatic to human evaluation. This includes using generic and\ntask-specific metrics to gauge the LMs' performance accurately. From our\nevaluation, no single model emerges as universally optimal for all tasks.\nInstead, their performance varies significantly depending on the specific\nrequirements of each task. While some models excel in certain tasks, they may\ndemonstrate comparatively poorer performance in others. These findings\nemphasize the importance of considering task-specific requirements and\ncharacteristics when selecting the most suitable LM for conversational\napplications.\n","authors":["Ratnesh Kumar Joshi","Priyanshu Priya","Vishesh Desai","Saurav Dudhate","Siddhant Senapati","Asif Ekbal","Roshni Ramnani","Anutosh Maitra","Shubhashis Sengupta"],"pdf_url":"https://arxiv.org/pdf/2411.17204v2.pdf","comment":"39 pages, 12 tables"},{"id":"http://arxiv.org/abs/2411.18831v1","updated":"2024-11-28T00:21:31Z","published":"2024-11-28T00:21:31Z","title":"Measuring Risk of Bias in Biomedical Reports: The RoBBR Benchmark","summary":" Systems that answer questions by reviewing the scientific literature are\nbecoming increasingly feasible. To draw reliable conclusions, these systems\nshould take into account the quality of available evidence, placing more weight\non studies that use a valid methodology. We present a benchmark for measuring\nthe methodological strength of biomedical papers, drawing on the risk-of-bias\nframework used for systematic reviews. The four benchmark tasks, drawn from\nmore than 500 papers, cover the analysis of research study methodology,\nfollowed by evaluation of risk of bias in these studies. The benchmark contains\n2000 expert-generated bias annotations, and a human-validated pipeline for\nfine-grained alignment with research paper content. We evaluate a range of\nlarge language models on the benchmark, and find that these models fall\nsignificantly short of expert-level performance. By providing a standardized\ntool for measuring judgments of study quality, the benchmark can help to guide\nsystems that perform large-scale aggregation of scientific data. The dataset is\navailable at https://github.com/RoBBR-Benchmark/RoBBR.\n","authors":["Jianyou Wang","Weili Cao","Longtian Bao","Youze Zheng","Gil Pasternak","Kaicheng Wang","Xiaoyue Wang","Ramamohan Paturi","Leon Bergen"],"pdf_url":"https://arxiv.org/pdf/2411.18831v1.pdf","comment":null}],"Information Retrieval":[{"id":"http://arxiv.org/abs/2409.18969v2","updated":"2024-11-28T20:29:44Z","published":"2024-09-11T14:50:28Z","title":"Integrating SPARQL and LLMs for Question Answering over Scholarly Data\n Sources","summary":" The Scholarly Hybrid Question Answering over Linked Data (QALD) Challenge at\nthe International Semantic Web Conference (ISWC) 2024 focuses on Question\nAnswering (QA) over diverse scholarly sources: DBLP, SemOpenAlex, and\nWikipedia-based texts. This paper describes a methodology that combines SPARQL\nqueries, divide and conquer algorithms, and a pre-trained extractive question\nanswering model. It starts with SPARQL queries to gather data, then applies\ndivide and conquer to manage various question types and sources, and uses the\nmodel to handle personal author questions. The approach, evaluated with Exact\nMatch and F-score metrics, shows promise for improving QA accuracy and\nefficiency in scholarly contexts.\n","authors":["Fomubad Borista Fondi","Azanzi Jiomekong Fidel","Gaoussou Camara"],"pdf_url":"https://arxiv.org/pdf/2409.18969v2.pdf","comment":"Scholarly Hybrid Question answering challenge from the International\n Semantic Web Conference of 2024(ISWC), 7 pages, 8 figures"},{"id":"http://arxiv.org/abs/2411.19214v1","updated":"2024-11-28T15:36:55Z","published":"2024-11-28T15:36:55Z","title":"Parallel and Mini-Batch Stable Matching for Large-Scale Reciprocal\n Recommender Systems","summary":" Reciprocal recommender systems (RRSs) are crucial in online two-sided\nmatching platforms, such as online job or dating markets, as they need to\nconsider the preferences of both sides of the match. The concentration of\nrecommendations to a subset of users on these platforms undermines their match\nopportunities and reduces the total number of matches. To maximize the total\nnumber of expected matches among market participants, stable matching theory\nwith transferable utility has been applied to RRSs. However, computational\ncomplexity and memory efficiency quadratically increase with the number of\nusers, making it difficult to implement stable matching algorithms for several\nusers. In this study, we propose novel methods using parallel and mini-batch\ncomputations for reciprocal recommendation models to improve the computational\ntime and space efficiency of the optimization process for stable matching.\nExperiments on both real and synthetic data confirmed that our stable matching\ntheory-based RRS increased the computation speed and enabled tractable\nlarge-scale data processing of up to one million samples with a single graphics\nprocessing unit graphics board, without losing the match count.\n","authors":["Kento Nakada","Kazuki Kawamura","Ryosuke Furukawa"],"pdf_url":"https://arxiv.org/pdf/2411.19214v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.15408v3","updated":"2024-11-28T13:46:38Z","published":"2023-09-27T05:20:53Z","title":"A smoothed-Bayesian approach to frequency recovery from sketched data","summary":" We provide a novel statistical perspective on a classical problem at the\nintersection of computer science and information theory: recovering the\nempirical frequency of a symbol in a large discrete dataset using only a\ncompressed representation, or sketch, obtained via random hashing. Departing\nfrom traditional algorithmic approaches, recent works have proposed Bayesian\nnonparametric (BNP) methods that can provide more informative frequency\nestimates by leveraging modeling assumptions about the distribution of the\nsketched data. In this paper, we propose a smoothed-Bayesian method, inspired\nby existing BNP approaches but designed in a frequentist framework to overcome\nthe computational limitations of the BNP approaches when dealing with\nlarge-scale data from realistic distributions, including those with power-law\ntail behaviors. For sketches obtained with a single hash function, our approach\nis supported by rigorous frequentist properties, including unbiasedness and\noptimality under a squared error loss function within an intuitive class of\nlinear estimators. For sketches with multiple hash functions, we introduce an\napproach based on multi-view learning to construct computationally efficient\nfrequency estimators. We validate our method on synthetic and real data,\ncomparing its performance to that of existing alternatives.\n","authors":["Mario Beraha","Stefano Favaro","Matteo Sesia"],"pdf_url":"https://arxiv.org/pdf/2309.15408v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19119v1","updated":"2024-11-28T13:06:48Z","published":"2024-11-28T13:06:48Z","title":"Introducing Three New Benchmark Datasets for Hierarchical Text\n Classification","summary":" Hierarchical Text Classification (HTC) is a natural language processing task\nwith the objective to classify text documents into a set of classes from a\nstructured class hierarchy. Many HTC approaches have been proposed which\nattempt to leverage the class hierarchy information in various ways to improve\nclassification performance. Machine learning-based classification approaches\nrequire large amounts of training data and are most-commonly compared through\nthree established benchmark datasets, which include the Web Of Science (WOS),\nReuters Corpus Volume 1 Version 2 (RCV1-V2) and New York Times (NYT) datasets.\nHowever, apart from the RCV1-V2 dataset which is well-documented, these\ndatasets are not accompanied with detailed description methodologies. In this\npaper, we introduce three new HTC benchmark datasets in the domain of research\npublications which comprise the titles and abstracts of papers from the Web of\nScience publication database. We first create two baseline datasets which use\nexisting journal-and citation-based classification schemas. Due to the\nrespective shortcomings of these two existing schemas, we propose an approach\nwhich combines their classifications to improve the reliability and robustness\nof the dataset. We evaluate the three created datasets with a clustering-based\nanalysis and show that our proposed approach results in a higher quality\ndataset where documents that belong to the same class are semantically more\nsimilar compared to the other datasets. Finally, we provide the classification\nperformance of four state-of-the-art HTC approaches on these three new datasets\nto provide baselines for future studies on machine learning-based techniques\nfor scientific publication classification.\n","authors":["Jaco du Toit","Herman Redelinghuys","Marcel Dunaiski"],"pdf_url":"https://arxiv.org/pdf/2411.19119v1.pdf","comment":"16 pages, 11 figures"},{"id":"http://arxiv.org/abs/2411.19113v1","updated":"2024-11-28T12:59:32Z","published":"2024-11-28T12:59:32Z","title":"Integration of Contextual Descriptors in Ontology Alignment for\n Enrichment of Semantic Correspondence","summary":" This paper proposes a novel approach to semantic ontology alignment using\ncontextual descriptors. A formalization was developed that enables the\nintegration of essential and contextual descriptors to create a comprehensive\nknowledge model. The hierarchical structure of the semantic approach and the\nmathematical apparatus for analyzing potential conflicts between concepts,\nparticularly in the example of \"Transparency\" and \"Privacy\" in the context of\nartificial intelligence, are demonstrated. Experimental studies showed a\nsignificant improvement in ontology alignment metrics after the implementation\nof contextual descriptors, especially in the areas of privacy, responsibility,\nand freedom & autonomy. The application of contextual descriptors achieved an\naverage overall improvement of approximately 4.36%. The results indicate the\neffectiveness of the proposed approach for more accurately reflecting the\ncomplexity of knowledge and its contextual dependence.\n","authors":["Eduard Manziuk","Oleksander Barmak","Pavlo Radiuk","Vladislav Kuznetsov","Iurii Krak","Sergiy Yakovlev"],"pdf_url":"https://arxiv.org/pdf/2411.19113v1.pdf","comment":"Ontology alignment, contextual descriptors, semantic matching,\n knowledge representation, essential descriptors, ontology integration,\n hierarchical structure, semantic heterogeneity, ethical AI"},{"id":"http://arxiv.org/abs/2411.19107v1","updated":"2024-11-28T12:44:56Z","published":"2024-11-28T12:44:56Z","title":"Headache to Overstock? Promoting Long-tail Items through Debiased\n Product Bundling","summary":" Product bundling aims to organize a set of thematically related items into a\ncombined bundle for shipment facilitation and item promotion. To increase the\nexposure of fresh or overstocked products, sellers typically bundle these items\nwith popular products for inventory clearance. This specific task can be\nformulated as a long-tail product bundling scenario, which leverages the\nuser-item interactions to define the popularity of each item. The inherent\npopularity bias in the pre-extracted user feedback features and the\ninsufficient utilization of other popularity-independent knowledge may force\nthe conventional bundling methods to find more popular items, thereby\nstruggling with this long-tail bundling scenario. Through intuitive and\nempirical analysis, we navigate the core solution for this challenge, which is\nmaximally mining the popularity-free features and effectively incorporating\nthem into the bundling process. To achieve this, we propose a Distilled\nModality-Oriented Knowledge Transfer framework (DieT) to effectively counter\nthe popularity bias misintroduced by the user feedback features and adhere to\nthe original intent behind the real-world bundling behaviors. Specifically,\nDieT first proposes the Popularity-free Collaborative Distribution Modeling\nmodule (PCD) to capture the popularity-independent information from the\nbundle-item view, which is proven most effective in the long-tail bundling\nscenario to enable the directional information transfer. With the tailored\nUnbiased Bundle-aware Knowledge Transferring module (UBT), DieT can highlight\nthe significance of popularity-free features while mitigating the negative\neffects of user feedback features in the long-tail scenario via the knowledge\ndistillation paradigm. Extensive experiments on two real-world datasets\ndemonstrate the superiority of DieT over a list of SOTA methods in the\nlong-tail bundling scenario.\n","authors":["Shuo Xu","Haokai Ma","Yunshan Ma","Xiaohao Liu","Lei Meng","Xiangxu Meng","Tat-Seng Chua"],"pdf_url":"https://arxiv.org/pdf/2411.19107v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.18947v1","updated":"2024-11-28T06:28:45Z","published":"2024-11-28T06:28:45Z","title":"ICLERB: In-Context Learning Embedding and Reranker Benchmark","summary":" In-Context Learning (ICL) enables Large Language Models (LLMs) to perform new\ntasks by conditioning on prompts with relevant information. Retrieval-Augmented\nGeneration (RAG) enhances ICL by incorporating retrieved documents into the\nLLM's context at query time. However, traditional retrieval methods focus on\nsemantic relevance, treating retrieval as a search problem. In this paper, we\npropose reframing retrieval for ICL as a recommendation problem, aiming to\nselect documents that maximize utility in ICL tasks. We introduce the\nIn-Context Learning Embedding and Reranker Benchmark (ICLERB), a novel\nevaluation framework that compares retrievers based on their ability to enhance\nLLM accuracy in ICL settings. Additionally, we propose a novel Reinforcement\nLearning-to-Rank from AI Feedback (RLRAIF) algorithm, designed to fine-tune\nretrieval models using minimal feedback from the LLM. Our experimental results\nreveal notable differences between ICLERB and existing benchmarks, and\ndemonstrate that small models fine-tuned with our RLRAIF algorithm outperform\nlarge state-of-the-art retrieval models. These findings highlight the\nlimitations of existing evaluation methods and the need for specialized\nbenchmarks and training strategies adapted to ICL.\n","authors":["Marie Al Ghossein","Emile Contal","Alexandre Robicquet"],"pdf_url":"https://arxiv.org/pdf/2411.18947v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.15937v2","updated":"2024-11-28T05:29:32Z","published":"2024-03-23T21:54:18Z","title":"Model, Analyze, and Comprehend User Interactions within a Social Media\n Platform","summary":" In this study, we propose a novel graph-based approach to model, analyze and\ncomprehend user interactions within a social media platform based on\npost-comment relationship. We construct a user interaction graph from social\nmedia data and analyze it to gain insights into community dynamics, user\nbehavior, and content preferences. Our investigation reveals that while 56.05%\nof the active users are strongly connected within the community, only 0.8% of\nthem significantly contribute to its dynamics. Moreover, we observe temporal\nvariations in community activity, with certain periods experiencing heightened\nengagement. Additionally, our findings highlight a correlation between user\nactivity and popularity showing that more active users are generally more\npopular. Alongside these, a preference for positive and informative content is\nalso observed where 82.41% users preferred positive and informative content.\nOverall, our study provides a comprehensive framework for understanding and\nmanaging online communities, leveraging graph-based techniques to gain valuable\ninsights into user behavior and community dynamics.\n","authors":["Md Kaykobad Reza","S M Maksudul Alam","Yiran Luo","Youzhe Liu","Md Siam"],"pdf_url":"https://arxiv.org/pdf/2403.15937v2.pdf","comment":"Accepted by 27th International Conference on Computer and Information\n Technology (ICCIT), 2024. 6 Pages, 6 Figures"}],"Multimedia":[{"id":"http://arxiv.org/abs/2411.19220v1","updated":"2024-11-28T15:42:32Z","published":"2024-11-28T15:42:32Z","title":"Automatic Prompt Generation and Grounding Object Detection for Zero-Shot\n Image Anomaly Detection","summary":" Identifying defects and anomalies in industrial products is a critical\nquality control task. Traditional manual inspection methods are slow,\nsubjective, and error-prone. In this work, we propose a novel zero-shot\ntraining-free approach for automated industrial image anomaly detection using a\nmultimodal machine learning pipeline, consisting of three foundation models.\nOur method first uses a large language model, i.e., GPT-3. generate text\nprompts describing the expected appearances of normal and abnormal products. We\nthen use a grounding object detection model, called Grounding DINO, to locate\nthe product in the image. Finally, we compare the cropped product image patches\nto the generated prompts using a zero-shot image-text matching model, called\nCLIP, to identify any anomalies. Our experiments on two datasets of industrial\nproduct images, namely MVTec-AD and VisA, demonstrate the effectiveness of this\nmethod, achieving high accuracy in detecting various types of defects and\nanomalies without the need for model training. Our proposed model enables\nefficient, scalable, and objective quality control in industrial manufacturing\nsettings.\n","authors":["Tsun-Hin Cheung","Ka-Chun Fung","Songjiang Lai","Kwan-Ho Lin","Vincent Ng","Kin-Man Lam"],"pdf_url":"https://arxiv.org/pdf/2411.19220v1.pdf","comment":"Accepted to APSIPA ASC 2024"},{"id":"http://arxiv.org/abs/2411.17698v2","updated":"2024-11-28T13:25:04Z","published":"2024-11-26T18:59:58Z","title":"Video-Guided Foley Sound Generation with Multimodal Controls","summary":" Generating sound effects for videos often requires creating artistic sound\neffects that diverge significantly from real-life sources and flexible control\nin the sound design. To address this problem, we introduce MultiFoley, a model\ndesigned for video-guided sound generation that supports multimodal\nconditioning through text, audio, and video. Given a silent video and a text\nprompt, MultiFoley allows users to create clean sounds (e.g., skateboard wheels\nspinning without wind noise) or more whimsical sounds (e.g., making a lion's\nroar sound like a cat's meow). MultiFoley also allows users to choose reference\naudio from sound effects (SFX) libraries or partial videos for conditioning. A\nkey novelty of our model lies in its joint training on both internet video\ndatasets with low-quality audio and professional SFX recordings, enabling\nhigh-quality, full-bandwidth (48kHz) audio generation. Through automated\nevaluations and human studies, we demonstrate that MultiFoley successfully\ngenerates synchronized high-quality sounds across varied conditional inputs and\noutperforms existing methods. Please see our project page for video results:\nhttps://ificl.github.io/MultiFoley/\n","authors":["Ziyang Chen","Prem Seetharaman","Bryan Russell","Oriol Nieto","David Bourgin","Andrew Owens","Justin Salamon"],"pdf_url":"https://arxiv.org/pdf/2411.17698v2.pdf","comment":"Project site: https://ificl.github.io/MultiFoley/"},{"id":"http://arxiv.org/abs/2405.00233v2","updated":"2024-11-28T12:31:04Z","published":"2024-04-30T22:51:36Z","title":"SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General\n Sound","summary":" Large language models (LLMs) have significantly advanced audio processing\nthrough audio codecs that convert audio into discrete tokens, enabling the\napplication of language modelling techniques to audio data. However,\ntraditional codecs often operate at high bitrates or within narrow domains such\nas speech and lack the semantic clues required for efficient language\nmodelling. Addressing these challenges, we introduce SemantiCodec, a novel\ncodec designed to compress audio into fewer than a hundred tokens per second\nacross diverse audio types, including speech, general sound, and music, without\ncompromising quality. SemantiCodec features a dual-encoder architecture: a\nsemantic encoder using a self-supervised pre-trained Audio Masked Autoencoder\n(AudioMAE), discretized using k-means clustering on extensive audio data, and\nan acoustic encoder to capture the remaining details. The semantic and acoustic\nencoder outputs are used to reconstruct audio via a diffusion-model-based\ndecoder. SemantiCodec is presented in three variants with token rates of 25,\n50, and 100 per second, supporting a range of ultra-low bit rates between 0.31\nkbps and 1.40 kbps. Experimental results demonstrate that SemantiCodec\nsignificantly outperforms the state-of-the-art Descript codec on reconstruction\nquality. Our results also suggest that SemantiCodec contains significantly\nricher semantic information than all evaluated state-of-the-art audio codecs,\neven at significantly lower bitrates. Our code and demos are available at\nhttps://haoheliu.github.io/SemantiCodec/.\n","authors":["Haohe Liu","Xuenan Xu","Yi Yuan","Mengyue Wu","Wenwu Wang","Mark D. Plumbley"],"pdf_url":"https://arxiv.org/pdf/2405.00233v2.pdf","comment":"Accepted by Journal of Selected Topics in Signal Processing (JSTSP).\n Demo and code: https://haoheliu.github.io/SemantiCodec/"},{"id":"http://arxiv.org/abs/2405.09266v3","updated":"2024-11-28T10:30:14Z","published":"2024-05-15T11:33:07Z","title":"Dance Any Beat: Blending Beats with Visuals in Dance Video Generation","summary":" Generating dance from music is crucial for advancing automated choreography.\nCurrent methods typically produce skeleton keypoint sequences instead of dance\nvideos and lack the capability to make specific individuals dance, which\nreduces their real-world applicability. These methods also require precise\nkeypoint annotations, complicating data collection and limiting the use of\nself-collected video datasets. To overcome these challenges, we introduce a\nnovel task: generating dance videos directly from images of individuals guided\nby music. This task enables the dance generation of specific individuals\nwithout requiring keypoint annotations, making it more versatile and applicable\nto various situations. Our solution, the Dance Any Beat Diffusion model\n(DabFusion), utilizes a reference image and a music piece to generate dance\nvideos featuring various dance types and choreographies. The music is analyzed\nby our specially designed music encoder, which identifies essential features\nincluding dance style, movement, and rhythm. DabFusion excels in generating\ndance videos not only for individuals in the training dataset but also for any\npreviously unseen person. This versatility stems from its approach of\ngenerating latent optical flow, which contains all necessary motion information\nto animate any person in the image. We evaluate DabFusion's performance using\nthe AIST++ dataset, focusing on video quality, audio-video synchronization, and\nmotion-music alignment. We propose a 2D Motion-Music Alignment Score (2D-MM\nAlign), which builds on the Beat Alignment Score to more effectively evaluate\nmotion-music alignment for this new task. Experiments show that our DabFusion\nestablishes a solid baseline for this innovative task. Video results can be\nfound on our project page: https://DabFusion.github.io.\n","authors":["Xuanchen Wang","Heng Wang","Dongnan Liu","Weidong Cai"],"pdf_url":"https://arxiv.org/pdf/2405.09266v3.pdf","comment":"WACV2025, 11 pages, 7 figures, demo page: https://DabFusion.github.io"},{"id":"http://arxiv.org/abs/2305.09979v2","updated":"2024-11-28T07:49:44Z","published":"2023-05-17T06:23:06Z","title":"Self-Training Boosted Multi-Factor Matching Network for Composed Image\n Retrieval","summary":" The composed image retrieval (CIR) task aims to retrieve the desired target\nimage for a given multimodal query, i.e., a reference image with its\ncorresponding modification text. The key limitations encountered by existing\nefforts are two aspects: 1) ignoring the multi-faceted query-target matching\nfactors; 2) ignoring the potential unlabeled reference-target image pairs in\nexisting benchmark datasets. To address these two limitations is non-trivial\ndue to the following challenges: 1) how to effectively model the multi-faceted\nmatching factors in a latent way without direct supervision signals; 2) how to\nfully utilize the potential unlabeled reference-target image pairs to improve\nthe generalization ability of the CIR model. To address these challenges, in\nthis work, we first propose a muLtI-faceted Matching Network (LIMN), which\nconsists of three key modules: multi-grained image/text encoder, latent\nfactor-oriented feature aggregation, and query-target matching modeling.\nThereafter, we design an iterative dual self-training paradigm to further\nenhance the performance of LIMN by fully utilizing the potential unlabeled\nreference-target image pairs in a semi-supervised manner. Specifically, we\ndenote the iterative dual self-training paradigm enhanced LIMN as LIMN+.\nExtensive experiments on three real-world datasets, FashionIQ, Shoes, and\nBirds-to-Words, show that our proposed method significantly surpasses the\nstate-of-the-art baselines.\n","authors":["Haokun Wen","Xuemeng Song","Jianhua Yin","Jianlong Wu","Weili Guan","Liqiang Nie"],"pdf_url":"https://arxiv.org/pdf/2305.09979v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.18966v1","updated":"2024-11-28T07:36:22Z","published":"2024-11-28T07:36:22Z","title":"SuperGaussians: Enhancing Gaussian Splatting Using Primitives with\n Spatially Varying Colors","summary":" Gaussian Splattings demonstrate impressive results in multi-view\nreconstruction based on Gaussian explicit representations. However, the current\nGaussian primitives only have a single view-dependent color and an opacity to\nrepresent the appearance and geometry of the scene, resulting in a non-compact\nrepresentation. In this paper, we introduce a new method called SuperGaussians\nthat utilizes spatially varying colors and opacity in a single Gaussian\nprimitive to improve its representation ability. We have implemented bilinear\ninterpolation, movable kernels, and even tiny neural networks as spatially\nvarying functions. Quantitative and qualitative experimental results\ndemonstrate that all three functions outperform the baseline, with the best\nmovable kernels achieving superior novel view synthesis performance on multiple\ndatasets, highlighting the strong potential of spatially varying functions.\n","authors":["Rui Xu","Wenyue Chen","Jiepeng Wang","Yuan Liu","Peng Wang","Lin Gao","Shiqing Xin","Taku Komura","Xin Li","Wenping Wang"],"pdf_url":"https://arxiv.org/pdf/2411.18966v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.18855v1","updated":"2024-11-28T01:51:46Z","published":"2024-11-28T01:51:46Z","title":"Improving Accuracy and Generalization for Efficient Visual Tracking","summary":" Efficient visual trackers overfit to their training distributions and lack\ngeneralization abilities, resulting in them performing well on their respective\nin-distribution (ID) test sets and not as well on out-of-distribution (OOD)\nsequences, imposing limitations to their deployment in-the-wild under\nconstrained resources. We introduce SiamABC, a highly efficient Siamese tracker\nthat significantly improves tracking performance, even on OOD sequences.\nSiamABC takes advantage of new architectural designs in the way it bridges the\ndynamic variability of the target, and of new losses for training. Also, it\ndirectly addresses OOD tracking generalization by including a fast\nbackward-free dynamic test-time adaptation method that continuously adapts the\nmodel according to the dynamic visual changes of the target. Our extensive\nexperiments suggest that SiamABC shows remarkable performance gains in OOD sets\nwhile maintaining accurate performance on the ID benchmarks. SiamABC\noutperforms MixFormerV2-S by 7.6\\% on the OOD AVisT benchmark while being 3x\nfaster (100 FPS) on a CPU.\n","authors":["Ram Zaveri","Shivang Patel","Yu Gu","Gianfranco Doretto"],"pdf_url":"https://arxiv.org/pdf/2411.18855v1.pdf","comment":"WACV 2025"}]},"2024-12-02T00:00:00Z":{"Computation and Language":[{"id":"http://arxiv.org/abs/2410.16208v3","updated":"2024-12-02T18:59:28Z","published":"2024-10-21T17:11:21Z","title":"Compute-Constrained Data Selection","summary":" Data selection can reduce the amount of training data needed to finetune\nLLMs; however, the efficacy of data selection scales directly with its compute.\nMotivated by the practical challenge of compute-constrained finetuning, we\nconsider the setting in which both the cost of selecting data and training are\nbudgeted for. We first formalize the problem of data selection with a\ncost-aware utility function, and model the data selection problem as trading\noff initial-selection cost for training gain. We run a comprehensive sweep of\nexperiments across multiple tasks, varying compute budget by scaling finetuning\ntokens, model sizes, and data selection compute. Interestingly we find that\nmany powerful data selection methods are almost never compute-optimal, and that\ncheaper data selection alternatives dominate both from a theoretical and\nempirical perspective. For compute-optimal training, we find that perplexity\nand gradient data selection require training-to-selection model size ratios of\n5x and 10x, respectively.\n","authors":["Junjie Oscar Yin","Alexander M. Rush"],"pdf_url":"https://arxiv.org/pdf/2410.16208v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.05677v2","updated":"2024-12-02T18:13:28Z","published":"2024-09-09T14:44:19Z","title":"RIRAG: Regulatory Information Retrieval and Answer Generation","summary":" Regulatory documents, issued by governmental regulatory bodies, establish\nrules, guidelines, and standards that organizations must adhere to for legal\ncompliance. These documents, characterized by their length, complexity and\nfrequent updates, are challenging to interpret, requiring significant\nallocation of time and expertise on the part of organizations to ensure ongoing\ncompliance. Regulatory Natural Language Processing (RegNLP) is a\nmultidisciplinary field aimed at simplifying access to and interpretation of\nregulatory rules and obligations. We introduce a task of generating\nquestion-passages pairs, where questions are automatically created and paired\nwith relevant regulatory passages, facilitating the development of regulatory\nquestion-answering systems. We create the ObliQA dataset, containing 27,869\nquestions derived from the collection of Abu Dhabi Global Markets (ADGM)\nfinancial regulation documents, design a baseline Regulatory Information\nRetrieval and Answer Generation (RIRAG) system and evaluate it with RePASs, a\nnovel evaluation metric that tests whether generated answers accurately capture\nall relevant obligations while avoiding contradictions.\n","authors":["Tuba Gokhan","Kexin Wang","Iryna Gurevych","Ted Briscoe"],"pdf_url":"https://arxiv.org/pdf/2409.05677v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.17593v3","updated":"2024-12-02T17:43:20Z","published":"2024-11-26T17:01:27Z","title":"What Differentiates Educational Literature? A Multimodal Fusion Approach\n of Transformers and Computational Linguistics","summary":" The integration of new literature into the English curriculum remains a\nchallenge since educators often lack scalable tools to rapidly evaluate\nreadability and adapt texts for diverse classroom needs. This study proposes to\naddress this gap through a multimodal approach that combines transformer-based\ntext classification with linguistic feature analysis to align texts with UK Key\nStages. Eight state-of-the-art Transformers were fine-tuned on segmented text\ndata, with BERT achieving the highest unimodal F1 score of 0.75. In parallel,\n500 deep neural network topologies were searched for the classification of\nlinguistic characteristics, achieving an F1 score of 0.392. The fusion of these\nmodalities shows a significant improvement, with every multimodal approach\noutperforming all unimodal models. In particular, the ELECTRA Transformer fused\nwith the neural network achieved an F1 score of 0.996. Unimodal and multimodal\napproaches are shown to have statistically significant differences in all\nvalidation metrics (accuracy, precision, recall, F1 score) except for inference\ntime. The proposed approach is finally encapsulated in a stakeholder-facing web\napplication, providing non-technical stakeholder access to real-time insights\non text complexity, reading difficulty, curriculum alignment, and\nrecommendations for learning age range. The application empowers data-driven\ndecision making and reduces manual workload by integrating AI-based\nrecommendations into lesson planning for English literature.\n","authors":["Jordan J. Bird"],"pdf_url":"https://arxiv.org/pdf/2411.17593v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19855v2","updated":"2024-12-02T16:39:26Z","published":"2024-11-29T17:10:33Z","title":"Artificial intelligence contribution to translation industry: looking\n back and forward","summary":" This study provides a comprehensive analysis of artificial intelligence (AI)\ncontribution to translation industry (ACTI) research, synthesizing it over\nforty-one years from 1980-2024. 13220 articles were retrieved from three\nsources, namely WoS, Scopus, and Lens. We provided two types of analysis, viz.,\nscientometric and thematic, focusing on cluster, subject categories, keywords,\nburstness, centrality and research centers as for the former. For the latter,\nwe thematically review 18 articles, selected purposefully from the articles\ninvolved, centering on purpose, approach, findings, and contribution to ACTI\nfuture directions. The findings reveal that in the past AI contribution to\ntranslation industry was not rigorous, resulting in rule-based machine\ntranslation and statistical machine translation whose output was not\nsatisfactory. However, the more AI develops, the more machine translation\ndevelops, incorporating Neural Networking Algorithms and (Deep) Language\nLearning Models like ChatGPT whose translation output has developed\nconsiderably. However, much rigorous research is still needed to overcome\nseveral problems encountering translation industry, specifically concerning\nlow-source languages, multi-dialectical and free word order languages, and\ncultural and religious registers.\n","authors":["Mohammed Q. Shormani"],"pdf_url":"https://arxiv.org/pdf/2411.19855v2.pdf","comment":"20 pages, 4 figures"},{"id":"http://arxiv.org/abs/2409.19839v3","updated":"2024-12-02T16:27:16Z","published":"2024-09-30T00:41:51Z","title":"ForecastBench: A Dynamic Benchmark of AI Forecasting Capabilities","summary":" Forecasts of future events are essential inputs into informed\ndecision-making. Machine learning (ML) systems have the potential to deliver\nforecasts at scale, but there is no framework for evaluating the accuracy of ML\nsystems on a standardized set of forecasting questions. To address this gap, we\nintroduce ForecastBench: a dynamic benchmark that evaluates the accuracy of ML\nsystems on an automatically generated and regularly updated set of 1,000\nforecasting questions. To avoid any possibility of data leakage, ForecastBench\nis comprised solely of questions about future events that have no known answer\nat the time of submission. We quantify the capabilities of current ML systems\nby collecting forecasts from expert (human) forecasters, the general public,\nand LLMs on a random subset of questions from the benchmark ($N=200$). While\nLLMs have achieved super-human performance on many benchmarks, they perform\nless well here: expert forecasters outperform the top-performing LLM (p-value\n$<0.01$). We display system and human scores in a public leaderboard at\nwww.forecastbench.org.\n","authors":["Ezra Karger","Houtan Bastani","Chen Yueh-Han","Zachary Jacobs","Danny Halawi","Fred Zhang","Philip E. Tetlock"],"pdf_url":"https://arxiv.org/pdf/2409.19839v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.17607v2","updated":"2024-12-02T16:13:24Z","published":"2024-11-26T17:19:09Z","title":"Scaling Speech-Text Pre-training with Synthetic Interleaved Data","summary":" Speech language models (SpeechLMs) accept speech input and produce speech\noutput, allowing for more natural human-computer interaction compared to\ntext-based large language models (LLMs). Traditional approaches for developing\nSpeechLMs are constrained by the limited availability of unsupervised speech\ndata and parallel speech-text data, which are significantly less abundant than\ntext pre-training data, thereby limiting their scalability as LLMs. We propose\na novel approach to scaling speech-text pre-training by leveraging large-scale\nsynthetic interleaved data derived from text corpora, eliminating the need for\nparallel speech-text datasets. Our method efficiently constructs speech-text\ninterleaved data by sampling text spans from existing text corpora and\nsynthesizing corresponding speech spans using a text-to-token model, bypassing\nthe need to generate actual speech. We also employ a supervised speech\ntokenizer derived from an automatic speech recognition (ASR) model by\nincorporating a vector-quantized bottleneck into the encoder. This supervised\ntraining approach results in discrete speech tokens with strong semantic\npreservation even at lower frame rates (e.g. 12.5Hz), while still maintaining\nspeech reconstruction quality. Starting from a pre-trained language model and\nscaling our pre-training to 1 trillion tokens (with 600B synthetic interleaved\nspeech-text data), we achieve state-of-the-art performance in speech language\nmodeling and spoken question answering, improving performance on spoken\nquestions tasks from the previous SOTA of 13% (Moshi) to 31%. We further\ndemonstrate that by fine-tuning the pre-trained model with speech dialogue\ndata, we can develop an end-to-end spoken chatbot that achieves competitive\nperformance comparable to existing baselines in both conversational abilities\nand speech quality, even operating exclusively in the speech domain.\n","authors":["Aohan Zeng","Zhengxiao Du","Mingdao Liu","Lei Zhang","Shengmin Jiang","Yuxiao Dong","Jie Tang"],"pdf_url":"https://arxiv.org/pdf/2411.17607v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2407.12850v2","updated":"2024-12-02T15:46:35Z","published":"2024-07-08T09:50:49Z","title":"Limits to Predicting Online Speech Using Large Language Models","summary":" We study the predictability of online speech on social media, and whether\npredictability improves with information outside a user's own posts. Recent\ntheoretical results suggest that posts from a user's social circle are as\npredictive of the user's future posts as that of the user's past posts.\nMotivated by the success of large language models, we empirically test this\nhypothesis. We define predictability as a measure of the model's uncertainty,\ni.e., its negative log-likelihood on future tokens given context. As the basis\nof our study, we collect 10M tweets for ``tweet-tuning'' base models and a\nfurther 6.25M posts from more than five thousand X (previously Twitter) users\nand their peers. Across four large language models ranging in size from 1.5\nbillion to 70 billion parameters, we find that predicting a user's posts from\ntheir peers' posts performs poorly. Moreover, the value of the user's own posts\nfor prediction is consistently higher than that of their peers'. We extend our\ninvestigation with a detailed analysis on what's learned in-context and the\nrobustness of our findings. From context, base models learn to correctly\npredict @-mentions and hashtags. Moreover, our results replicate if instead of\nprompting the model with additional context, we finetune on it. Across the\nboard, we find that predicting the posts of individual users remains hard.\n","authors":["Mina Remeli","Moritz Hardt","Robert C. Williamson"],"pdf_url":"https://arxiv.org/pdf/2407.12850v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.10886v2","updated":"2024-12-02T15:40:45Z","published":"2024-02-16T18:43:10Z","title":"Reviewer2: Optimizing Review Generation Through Prompt Generation","summary":" Recent developments in LLMs offer new opportunities for assisting authors in\nimproving their work. In this paper, we envision a use case where authors can\nreceive LLM-generated reviews that uncover weak points in the current draft.\nWhile initial methods for automated review generation already exist, these\nmethods tend to produce reviews that lack detail, and they do not cover the\nrange of opinions that human reviewers produce. To address this shortcoming, we\npropose an efficient two-stage review generation framework called Reviewer2.\nUnlike prior work, this approach explicitly models the distribution of possible\naspects that the review may address. We show that this leads to more detailed\nreviews that better cover the range of aspects that human reviewers identify in\nthe draft. As part of the research, we generate a large-scale review dataset of\n27k papers and 99k reviews that we annotate with aspect prompts, which we make\navailable as a resource for future research.\n","authors":["Zhaolin Gao","Kianté Brantley","Thorsten Joachims"],"pdf_url":"https://arxiv.org/pdf/2402.10886v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.06562v2","updated":"2024-12-02T15:32:41Z","published":"2023-12-11T17:46:44Z","title":"On Meta-Prompting","summary":" Modern generative language models are capable of interpreting input strings\nas instructions, or prompts, and carry out tasks based on them. Many approaches\nto prompting and pre-training these models involve the automated generation of\nthese prompts: meta-prompting, or prompting to obtain prompts. We propose a\ntheoretical framework based on category theory to generalize and describe them.\nThis framework is flexible enough to account for stochasticity, and allows us\nto obtain formal results around task agnosticity and equivalence of various\nmeta-prompting approaches. Experimentally, we test our framework in two active\nareas of model research: creativity and ideation. We find that user preference\nstrongly favors (p < 0.01) the prompts generated under meta-prompting, as well\nas their corresponding outputs, over a series of hardcoded baseline prompts\nthat include the original task definition. Using our framework, we argue that\nmeta-prompting is more effective than basic prompting at generating desirable\noutputs.\n","authors":["Adrian de Wynter","Xun Wang","Qilong Gu","Si-Qing Chen"],"pdf_url":"https://arxiv.org/pdf/2312.06562v2.pdf","comment":"Under review"},{"id":"http://arxiv.org/abs/2409.13730v2","updated":"2024-12-02T15:11:23Z","published":"2024-09-10T01:20:26Z","title":"VisScience: An Extensive Benchmark for Evaluating K12 Educational\n Multi-modal Scientific Reasoning","summary":" Multi-modal large language models (MLLMs) have demonstrated promising\ncapabilities across various tasks by integrating textual and visual information\nto achieve visual understanding in complex scenarios. Despite the availability\nof several benchmarks aims to evaluating MLLMs in tasks from visual question\nanswering to complex problem-solving, most focus predominantly on mathematics\nor general visual understanding tasks. This reveals a critical gap in current\nbenchmarks, which often overlook the inclusion of other key scientific\ndisciplines such as physics and chemistry. To address this gap, we meticulously\nconstruct a comprehensive benchmark, named VisScience, which is utilized to\nassess the multi-modal scientific reasoning across the three disciplines of\nmathematics, physics, and chemistry. This benchmark comprises 3,000 questions\ndrawn from K12 education - spanning elementary school through high school -\nequally distributed across three disciplines, with 1,000 questions per\ndiscipline. The questions within VisScience span 21 distinct subjects and are\ncategorized into five difficulty levels, offering a broad spectrum of topics\nwithin each discipline. With VisScience, we present a detailed evaluation of\nthe performance of 25 representative MLLMs in scientific reasoning.\nExperimental results demonstrate that closed-source MLLMs generally outperform\nopen-source models. The best performance observed include a 53.4\\% accuracy in\nmathematics by Claude3.5-Sonnet, 38.2\\% in physics by GPT-4o, and 47.0\\% in\nchemistry by Gemini-1.5-Pro. These results underscore the strengths and\nlimitations of MLLMs, suggesting areas for future improvement and highlighting\nthe importance of developing models that can effectively handle the diverse\ndemands of multi-modal scientific reasoning.\n","authors":["Zhihuan Jiang","Zhen Yang","Jinhao Chen","Zhengxiao Du","Weihan Wang","Bin Xu","Jie Tang"],"pdf_url":"https://arxiv.org/pdf/2409.13730v2.pdf","comment":"89 pages, 70 figures"},{"id":"http://arxiv.org/abs/2409.13729v2","updated":"2024-12-02T14:59:08Z","published":"2024-09-10T01:20:22Z","title":"MathGLM-Vision: Solving Mathematical Problems with Multi-Modal Large\n Language Model","summary":" Large language models (LLMs) have demonstrated significant capabilities in\nmathematical reasoning, particularly with text-based mathematical problems.\nHowever, current multi-modal large language models (MLLMs), especially those\nspecialized in mathematics, tend to focus predominantly on solving geometric\nproblems but ignore the diversity of visual information available in other\nareas of mathematics. Moreover, the geometric information for these specialized\nmathematical MLLMs is derived from several public datasets, which are typically\nlimited in diversity and complexity. To address these limitations, we aim to\nconstruct a fine-tuning dataset named MathVL, and develop a series of\nspecialized mathematical MLLMs termed MathGLM-Vision by conducting Supervised\nFine-Tuning (SFT) on MathVL with various parameter-scale backbones. To\nextensively evaluate the effectiveness of MathGLM-Vision, we conduct\nexperiments on several public benchmarks and our curated MathVL-test consisting\nof 2,000 problems. Experimental results demonstrate that MathGLM-Vision\nachieves significant improvements compared with some existing models, including\nbackbone models and open-source mathematical MLLMs. These findings indicate the\nimportance of diversity dataset in enhancing the mathematical reasoning\nabilities of MLLMs.\n","authors":["Zhen Yang","Jinhao Chen","Zhengxiao Du","Wenmeng Yu","Weihan Wang","Wenyi Hong","Zhihuan Jiang","Bin Xu","Jie Tang"],"pdf_url":"https://arxiv.org/pdf/2409.13729v2.pdf","comment":"30 pages,19 figures"},{"id":"http://arxiv.org/abs/2411.19655v2","updated":"2024-12-02T14:28:07Z","published":"2024-11-29T12:21:15Z","title":"Truth or Mirage? Towards End-to-End Factuality Evaluation with LLM-Oasis","summary":" After the introduction of Large Language Models (LLMs), there have been\nsubstantial improvements in the performance of Natural Language Generation\n(NLG) tasks, including Text Summarization and Machine Translation. However,\nLLMs still produce outputs containing hallucinations, that is, content not\ngrounded in factual information. Therefore, developing methods to assess the\nfactuality of LLMs has become urgent.\n Indeed, resources for factuality evaluation have recently emerged. Although\nchallenging, these resources face one or more of the following limitations: (i)\nthey are tailored to a specific task or domain; (ii) they are limited in size,\nthereby preventing the training of new factuality evaluators; (iii) they are\ndesigned for simpler verification tasks, such as claim verification.\n To address these issues, we introduce LLM-Oasis, to the best of our knowledge\nthe largest resource for training end-to-end factuality evaluators. LLM-Oasis\nis constructed by extracting claims from Wikipedia, falsifying a subset of\nthese claims, and generating pairs of factual and unfactual texts. We then rely\non human annotators to both validate the quality of our dataset and to create a\ngold standard test set for benchmarking factuality evaluation systems.\n Our experiments demonstrate that LLM-Oasis presents a significant challenge\nfor state-of-the-art LLMs, with GPT-4o achieving up to 60% accuracy in our\nproposed end-to-end factuality evaluation task, highlighting its potential to\ndrive future research in the field.\n","authors":["Alessandro Scirè","Andrei Stefan Bejgu","Simone Tedeschi","Karim Ghonim","Federico Martelli","Roberto Navigli"],"pdf_url":"https://arxiv.org/pdf/2411.19655v2.pdf","comment":"15 pages. To be submitted to CL journal"},{"id":"http://arxiv.org/abs/2308.00802v4","updated":"2024-12-02T13:33:17Z","published":"2023-08-01T19:34:18Z","title":"GRDD: A Dataset for Greek Dialectal NLP","summary":" In this paper, we present a dataset for the computational study of a number\nof Modern Greek dialects. It consists of raw text data from four dialects of\nModern Greek, Cretan, Pontic, Northern Greek and Cypriot Greek. The dataset is\nof considerable size, albeit imbalanced, and presents the first attempt to\ncreate large scale dialectal resources of this type for Modern Greek dialects.\nWe then use the dataset to perform dialect idefntification. We experiment with\ntraditional ML algorithms, as well as simple DL architectures. The results show\nvery good performance on the task, potentially revealing that the dialects in\nquestion have distinct enough characteristics allowing even simple ML models to\nperform well on the task. Error analysis is performed for the top performing\nalgorithms showing that in a number of cases the errors are due to insufficient\ndataset cleaning.\n","authors":["Stergios Chatzikyriakidis","Chatrine Qwaider","Ilias Kolokousis","Christina Koula","Dimitris Papadakis","Efthymia Sakellariou"],"pdf_url":"https://arxiv.org/pdf/2308.00802v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.07123v2","updated":"2024-12-02T13:04:18Z","published":"2024-09-11T09:21:20Z","title":"Cross-Refine: Improving Natural Language Explanation Generation by\n Learning in Tandem","summary":" Natural language explanations (NLEs) are vital for elucidating the reasoning\nbehind large language model (LLM) decisions. Many techniques have been\ndeveloped to generate NLEs using LLMs. However, like humans, LLMs might not\nalways produce optimal NLEs on first attempt. Inspired by human learning\nprocesses, we introduce Cross-Refine, which employs role modeling by deploying\ntwo LLMs as generator and critic, respectively. The generator outputs a first\nNLE and then refines this initial explanation using feedback and suggestions\nprovided by the critic. Cross-Refine does not require any supervised training\ndata or additional training. We validate Cross-Refine across three NLP tasks\nusing three state-of-the-art open-source LLMs through automatic and human\nevaluation. We select Self-Refine (Madaan et al., 2023) as the baseline, which\nonly utilizes self-feedback to refine the explanations. Our findings from\nautomatic evaluation and a user study indicate that Cross-Refine outperforms\nSelf-Refine. Meanwhile, Cross-Refine can perform effectively with less powerful\nLLMs, whereas Self-Refine only yields strong results with ChatGPT.\nAdditionally, we conduct an ablation study to assess the importance of feedback\nand suggestions. Both of them play an important role in refining explanations.\nWe further evaluate Cross-Refine on a bilingual dataset in English and German.\n","authors":["Qianli Wang","Tatiana Anikina","Nils Feldhus","Simon Ostermann","Sebastian Möller","Vera Schmitt"],"pdf_url":"https://arxiv.org/pdf/2409.07123v2.pdf","comment":"Accepted at COLING 2025; long paper"},{"id":"http://arxiv.org/abs/2411.02272v4","updated":"2024-12-02T12:36:30Z","published":"2024-11-04T17:03:55Z","title":"Combining Induction and Transduction for Abstract Reasoning","summary":" When learning an input-output mapping from very few examples, is it better to\nfirst infer a latent function that explains the examples, or is it better to\ndirectly predict new test outputs, e.g. using a neural network? We study this\nquestion on ARC by training neural models for induction (inferring latent\nfunctions) and transduction (directly predicting the test output for a given\ntest input). We train on synthetically generated variations of Python programs\nthat solve ARC training tasks. We find inductive and transductive models solve\ndifferent kinds of test problems, despite having the same training problems and\nsharing the same neural architecture: Inductive program synthesis excels at\nprecise computations, and at composing multiple concepts, while transduction\nsucceeds on fuzzier perceptual concepts. Ensembling them approaches human-level\nperformance on ARC.\n","authors":["Wen-Ding Li","Keya Hu","Carter Larsen","Yuqing Wu","Simon Alford","Caleb Woo","Spencer M. Dunn","Hao Tang","Michelangelo Naim","Dat Nguyen","Wei-Long Zheng","Zenna Tavares","Yewen Pu","Kevin Ellis"],"pdf_url":"https://arxiv.org/pdf/2411.02272v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.07818v5","updated":"2024-12-02T12:29:47Z","published":"2024-02-12T17:24:15Z","title":"Differentially Private Zeroth-Order Methods for Scalable Large Language\n Model Finetuning","summary":" Fine-tuning on task-specific datasets is a widely-embraced paradigm of\nharnessing the powerful capability of pretrained LLMs for various downstream\ntasks. Due to the popularity of LLMs fine-tuning and its accompanying privacy\nconcerns, differentially private (DP) fine-tuning of pretrained LLMs has been\nwidely used to safeguarding the privacy of task-specific datasets. Lying at the\ndesign core of DP LLM fine-tuning methods is the satisfactory tradeoff among\nprivacy, utility, and scalability. Most existing methods build upon the seminal\nwork of DP-SGD. Despite pushing the scalability of DP-SGD to its limit,\nDP-SGD-based fine-tuning methods are unfortunately limited by the inherent\ninefficiency of SGD.\n In this paper, we investigate the potential of DP zeroth-order methods for\nLLM pretraining, which avoids the scalability bottleneck of SGD by\napproximating the gradient with the more efficient zeroth-order gradient.\nRather than treating the zeroth-order method as a drop-in replacement for SGD,\nthis paper presents a comprehensive study both theoretically and empirically.\nFirst, we propose the stagewise DP zeroth-order method (DP-ZOSO) that\ndynamically schedules key hyperparameters. This design is grounded on the\nsynergy between DP random perturbation and the gradient approximation error of\nthe zeroth-order method, and its effect on fine-tuning trajectory.\n We provide theoretical analysis for both proposed methods. We conduct\nextensive empirical analysis on both encoder-only masked language model and\ndecoder-only autoregressive language model, achieving impressive results in\nterms of scalability and utility regardless of the class of tasks (compared\nwith DPZero, DP-ZOPO improves $4.5\\%$ on SST-5, $5.5\\%$ on MNLI with\nRoBERTa-Large and 9.2\\% on CB, 3.9\\% on BoolQ with OPT-2.7b when $\\epsilon=4$,\ndemonstrates more significant enhancement in performance on more complicated\ntasks).\n","authors":["Z Liu","J Lou","W Bao","Y Hu","B Li","Z Qin","K Ren"],"pdf_url":"https://arxiv.org/pdf/2402.07818v5.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.06892v2","updated":"2024-12-02T11:24:20Z","published":"2024-03-11T16:48:25Z","title":"Real-time Transformer-based Open-Vocabulary Detection with Efficient\n Fusion Head","summary":" End-to-end transformer-based detectors (DETRs) have shown exceptional\nperformance in both closed-set and open-vocabulary object detection (OVD) tasks\nthrough the integration of language modalities. However, their demanding\ncomputational requirements have hindered their practical application in\nreal-time object detection (OD) scenarios. In this paper, we scrutinize the\nlimitations of two leading models in the OVDEval benchmark, OmDet and\nGrounding-DINO, and introduce OmDet-Turbo. This novel transformer-based\nreal-time OVD model features an innovative Efficient Fusion Head (EFH) module\ndesigned to alleviate the bottlenecks observed in OmDet and Grounding-DINO.\nNotably, OmDet-Turbo-Base achieves a 100.2 frames per second (FPS) with\nTensorRT and language cache techniques applied. Notably, in zero-shot scenarios\non COCO and LVIS datasets, OmDet-Turbo achieves performance levels nearly on\npar with current state-of-the-art supervised models. Furthermore, it\nestablishes new state-of-the-art benchmarks on ODinW and OVDEval, boasting an\nAP of 30.1 and an NMS-AP of 26.86, respectively. The practicality of\nOmDet-Turbo in industrial applications is underscored by its exceptional\nperformance on benchmark datasets and superior inference speed, positioning it\nas a compelling choice for real-time object detection tasks. Code:\n\\url{https://github.com/om-ai-lab/OmDet}\n","authors":["Tiancheng Zhao","Peng Liu","Xuan He","Lu Zhang","Kyusong Lee"],"pdf_url":"https://arxiv.org/pdf/2403.06892v2.pdf","comment":"Preprint"},{"id":"http://arxiv.org/abs/2411.14708v2","updated":"2024-12-02T10:52:21Z","published":"2024-11-22T03:33:51Z","title":"Understanding LLM Embeddings for Regression","summary":" With the rise of large language models (LLMs) for flexibly processing\ninformation as strings, a natural application is regression, specifically by\npreprocessing string representations into LLM embeddings as downstream features\nfor metric prediction. In this paper, we provide one of the first comprehensive\ninvestigations into embedding-based regression and demonstrate that LLM\nembeddings as features can be better for high-dimensional regression tasks than\nusing traditional feature engineering. This regression performance can be\nexplained in part due to LLM embeddings over numeric data inherently preserving\nLipschitz continuity over the feature space. Furthermore, we quantify the\ncontribution of different model effects, most notably model size and language\nunderstanding, which we find surprisingly do not always improve regression\nperformance.\n","authors":["Eric Tang","Bangding Yang","Xingyou Song"],"pdf_url":"https://arxiv.org/pdf/2411.14708v2.pdf","comment":"16 pages, 13 figures"},{"id":"http://arxiv.org/abs/2403.01432v5","updated":"2024-12-02T10:48:36Z","published":"2024-03-03T08:07:55Z","title":"Fine Tuning vs. Retrieval Augmented Generation for Less Popular\n Knowledge","summary":" Language Models (LMs) memorize a vast amount of factual knowledge, exhibiting\nstrong performance across diverse tasks and domains. However, it has been\nobserved that the performance diminishes when dealing with less-popular or\nlow-frequency concepts and entities, for example in domain specific\napplications. The two prominent approaches to enhance the performance of LMs on\nlow-frequent topics are: Retrieval Augmented Generation (RAG) and fine-tuning\n(FT) over synthetic data. This paper explores and evaluates the impact of RAG\nand FT on customizing LMs in handling low-frequency entities on question\nanswering tasks. We conduct extensive experiments on twelve LMs of varying size\nand type and different fine tuning, data augmentation, and retrieval models.\nOur findings indicate that while FT boosts the performance across entities of\nvarying popularity, RAG surpasses FT by a large margin particularly for least\npopular factual knowledge. Additionally, the success of both RAG and FT\napproaches is amplified by improving retrieval and data augmentation\ntechniques. Fine tuning, while beneficial for small LMs, requires extensive\nresources. To address this issue, we propose the new Stimulus RAG approach that\nsurpasses the effectiveness of fine tuning based approaches, thereby\neliminating the need for the costly data augmentation and fine tuning step for\nenriching LMs with less popular factual knowledge. The code is available at\n\\url{https://github.com/informagi/RAGvsFT}.\n","authors":["Heydar Soudani","Evangelos Kanoulas","Faegheh Hasibi"],"pdf_url":"https://arxiv.org/pdf/2403.01432v5.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.19211v2","updated":"2024-12-02T10:44:08Z","published":"2024-03-28T08:19:33Z","title":"Dual-Personalizing Adapter for Federated Foundation Models","summary":" Recently, foundation models, particularly large language models (LLMs), have\ndemonstrated an impressive ability to adapt to various tasks by fine-tuning\ndiverse instruction data. Notably, federated foundation models (FedFM) emerge\nas a privacy preservation method to fine-tune models collaboratively under\nfederated learning (FL) settings by leveraging many distributed datasets with\nnon-IID data. To alleviate communication and computation overhead,\nparameter-efficient methods are introduced for efficiency, and some research\nadapted personalization methods to FedFM for better user preferences alignment.\nHowever, a critical gap in existing research is the neglect of test-time\ndistribution shifts in real-world applications, and conventional methods for\ntest-time distribution shifts in personalized FL are less effective for FedFM\ndue to their failure to adapt to complex distribution shift scenarios and the\nrequirement to train all parameters. To bridge this gap, we refine the setting\nin FedFM, termed test-time personalization, which aims to learn personalized\nfederated foundation models on clients while effectively handling test-time\ndistribution shifts simultaneously. To address challenges in this setting, we\nexplore a simple yet effective solution, a Federated Dual-Personalizing Adapter\n(FedDPA) architecture. By co-working with a foundation model, a global adapter\nand a local adapter jointly tackle the test-time distribution shifts and\nclient-specific personalization. Additionally, we introduce an instance-wise\ndynamic weighting mechanism that dynamically integrates the global and local\nadapters for each test instance during inference, facilitating effective\ntest-time personalization. The effectiveness of the proposed method has been\nevaluated on benchmark datasets across different NLP tasks.\n","authors":["Yiyuan Yang","Guodong Long","Tao Shen","Jing Jiang","Michael Blumenstein"],"pdf_url":"https://arxiv.org/pdf/2403.19211v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.14212v3","updated":"2024-12-02T10:30:50Z","published":"2024-01-25T14:53:30Z","title":"Explicitly Representing Syntax Improves Sentence-to-layout Prediction of\n Unexpected Situations","summary":" Recognizing visual entities in a natural language sentence and arranging them\nin a 2D spatial layout require a compositional understanding of language and\nspace. This task of layout prediction is valuable in text-to-image synthesis as\nit allows localized and controlled in-painting of the image. In this\ncomparative study it is shown that we can predict layouts from language\nrepresentations that implicitly or explicitly encode sentence syntax, if the\nsentences mention similar entity-relationships to the ones seen during\ntraining. To test compositional understanding, we collect a test set of\ngrammatically correct sentences and layouts describing compositions of entities\nand relations that unlikely have been seen during training. Performance on this\ntest set substantially drops, showing that current models rely on correlations\nin the training data and have difficulties in understanding the structure of\nthe input sentences. We propose a novel structural loss function that better\nenforces the syntactic structure of the input sentence and show large\nperformance gains in the task of 2D spatial layout prediction conditioned on\ntext. The loss has the potential to be used in other generation tasks where a\ntree-like structure underlies the conditioning modality. Code, trained models\nand the USCOCO evaluation set are available via github.\n","authors":["Wolf Nuyts","Ruben Cartuyvels","Marie-Francine Moens"],"pdf_url":"https://arxiv.org/pdf/2401.14212v3.pdf","comment":"Published in TACL"},{"id":"http://arxiv.org/abs/2409.06067v2","updated":"2024-12-02T10:18:38Z","published":"2024-09-09T21:04:16Z","title":"MLLM-LLaVA-FL: Multimodal Large Language Model Assisted Federated\n Learning","summary":" Previous studies on federated learning (FL) often encounter performance\ndegradation due to data heterogeneity among different clients. In light of the\nrecent advances in multimodal large language models (MLLMs), such as GPT-4v and\nLLaVA, which demonstrate their exceptional proficiency in multimodal tasks,\nsuch as image captioning and multimodal question answering. We introduce a\nnovel federated learning framework, named Multimodal Large Language Model\nAssisted Federated Learning (MLLM-LLaVA-FL), which employs powerful MLLMs at\nthe server end to address the heterogeneous and long-tailed challenges. Owing\nto the advanced cross-modality representation capabilities and the extensive\nopen-vocabulary prior knowledge of MLLMs, our framework is adept at harnessing\nthe extensive, yet previously underexploited, open-source data accessible from\nwebsites and powerful server-side computational resources. Hence, the\nMLLM-LLaVA-FL not only enhances the performance but also avoids increasing the\nrisk of privacy leakage and the computational burden on local devices,\ndistinguishing it from prior methodologies. Our framework has three key stages.\nInitially, we conduct global visual-text pretraining of the model. This\npretraining is facilitated by utilizing the extensive open-source data\navailable online, with the assistance of MLLMs. Subsequently, the pretrained\nmodel is distributed among various clients for local training. Finally, once\nthe locally trained models are transmitted back to the server, a global\nalignment is carried out under the supervision of MLLMs to further enhance the\nperformance. Experimental evaluations on established benchmarks, show that our\nframework delivers promising performance in the typical scenarios with data\nheterogeneity and long-tail distribution across different clients in FL.\n","authors":["Jianyi Zhang","Hao Frank Yang","Ang Li","Xin Guo","Pu Wang","Haiming Wang","Yiran Chen","Hai Li"],"pdf_url":"https://arxiv.org/pdf/2409.06067v2.pdf","comment":"Accepted to WACV 2025"},{"id":"http://arxiv.org/abs/2407.04125v2","updated":"2024-12-02T09:42:24Z","published":"2024-07-04T18:54:30Z","title":"Query-Guided Self-Supervised Summarization of Nursing Notes","summary":" Nursing notes, an important part of Electronic Health Records (EHRs), track a\npatient's health during a care episode. Summarizing key information in nursing\nnotes can help clinicians quickly understand patients' conditions. However,\nexisting summarization methods in the clinical setting, especially abstractive\nmethods, have overlooked nursing notes and require reference summaries for\ntraining. We introduce QGSumm, a novel query-guided self-supervised domain\nadaptation approach for abstractive nursing note summarization. The method uses\npatient-related clinical queries for guidance, and hence does not need\nreference summaries for training. Through automatic experiments and manual\nevaluation by an expert clinician, we study our approach and other\nstate-of-the-art Large Language Models (LLMs) for nursing note summarization.\nOur experiments show: 1) GPT-4 is competitive in maintaining information in the\noriginal nursing notes, 2) QGSumm can generate high-quality summaries with a\ngood balance between recall of the original content and hallucination rate\nlower than other top methods. Ultimately, our work offers a new perspective on\nconditional text summarization, tailored to clinical applications.\n","authors":["Ya Gao","Hans Moen","Saila Koivusalo","Miika Koskinen","Pekka Marttinen"],"pdf_url":"https://arxiv.org/pdf/2407.04125v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.19916v3","updated":"2024-12-02T08:56:26Z","published":"2024-09-30T03:37:10Z","title":"Deep Learning and Machine Learning, Advancing Big Data Analytics and\n Management: Object-Oriented Programming","summary":" Object-Oriented Programming (OOP) has become a crucial paradigm for managing\nthe growing complexity of modern software systems, particularly in fields like\nmachine learning, deep learning, large language models (LLM), and data\nanalytics. This work provides a comprehensive introduction to the integration\nof OOP techniques within these domains, with a focus on improving code\nmodularity, maintainability, and scalability. We begin by outlining the\nevolution of computing and the rise of OOP, followed by an in-depth discussion\nof key OOP principles such as encapsulation, inheritance, polymorphism, and\nabstraction. The practical application of these principles is demonstrated\nusing Python, a widely adopted language in AI and data science. Furthermore, we\nexamine how design patterns and modular programming can be employed to enhance\nthe structure and efficiency of machine learning systems. In subsequent\nsections, we apply these OOP concepts to real-world AI tasks, including the\nencapsulation of preprocessing workflows, machine learning model training, and\nevaluation. Detailed examples illustrate how OOP can be used to build reusable,\nscalable machine learning systems while maintaining code clarity and reducing\nredundancy.This work is intended to serve as a bridge for both beginners and\nexperienced developers, equipping them with the necessary knowledge to apply\nOOP methodologies in AI-driven projects, ultimately fostering the development\nof more robust and maintainable systems.\n","authors":["Tianyang Wang","Ziqian Bi","Keyu Chen","Jiawei Xu","Qian Niu","Junyu Liu","Benji Peng","Ming Li","Sen Zhang","Xuanhe Pan","Jinlang Wang","Pohsun Feng","Caitlyn Heqi Yin","Yizhu Wen","Ming Liu"],"pdf_url":"https://arxiv.org/pdf/2409.19916v3.pdf","comment":"49pages"},{"id":"http://arxiv.org/abs/2409.09318v2","updated":"2024-12-02T08:51:09Z","published":"2024-09-14T05:31:29Z","title":"ODE: Open-Set Evaluation of Hallucinations in Multimodal Large Language\n Models","summary":" Hallucination poses a persistent challenge for multimodal large language\nmodels (MLLMs). However, existing benchmarks for evaluating hallucinations are\ngenerally static, which may overlook the potential risk of data contamination.\nTo address this issue, we propose ODE, an open-set, dynamic protocol designed\nto evaluate object hallucinations in MLLMs at both the existence and attribute\nlevels. ODE employs a graph-based structure to represent real-world object\nconcepts, their attributes, and the distributional associations between them.\nThis structure facilitates the extraction of concept combinations based on\ndiverse distributional criteria, generating varied samples for structured\nqueries that evaluate hallucinations in both generative and discriminative\ntasks. Through the generation of new samples, dynamic concept combinations, and\nvaried distribution frequencies, ODE mitigates the risk of data contamination\nand broadens the scope of evaluation. This protocol is applicable to both\ngeneral and specialized scenarios, including those with limited data.\nExperimental results demonstrate the effectiveness of our protocol, revealing\nthat MLLMs exhibit higher hallucination rates when evaluated with ODE-generated\nsamples, which indicates potential data contamination. Furthermore, these\ngenerated samples aid in analyzing hallucination patterns and fine-tuning\nmodels, offering an effective approach to mitigating hallucinations in MLLMs.\n","authors":["Yahan Tu","Rui Hu","Jitao Sang"],"pdf_url":"https://arxiv.org/pdf/2409.09318v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.01077v2","updated":"2024-12-02T08:47:24Z","published":"2024-04-01T12:19:08Z","title":"Efficient Prompting Methods for Large Language Models: A Survey","summary":" Prompting is a mainstream paradigm for adapting large language models to\nspecific natural language processing tasks without modifying internal\nparameters. Therefore, detailed supplementary knowledge needs to be integrated\ninto external prompts, which inevitably brings extra human efforts and\ncomputational burdens for practical applications. As an effective solution to\nmitigate resource consumption, Efficient Prompting Methods have attracted a\nwide range of attention. We provide mathematical expressions at a high level to\ndeeply discuss Automatic Prompt Engineering for different prompt components and\nPrompt Compression in continuous and discrete spaces. Finally, we highlight\npromising future directions to inspire researchers interested in this field.\n","authors":["Kaiyan Chang","Songcheng Xu","Chenglong Wang","Yingfeng Luo","Xiaoqian Liu","Tong Xiao","Jingbo Zhu"],"pdf_url":"https://arxiv.org/pdf/2404.01077v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.01380v2","updated":"2024-12-02T08:43:16Z","published":"2024-10-02T09:49:45Z","title":"Knowledge Entropy Decay during Language Model Pretraining Hinders New\n Knowledge Acquisition","summary":" In this work, we investigate how a model's tendency to broadly integrate its\nparametric knowledge evolves throughout pretraining, and how this behavior\naffects overall performance, particularly in terms of knowledge acquisition and\nforgetting. We introduce the concept of knowledge entropy, which quantifies the\nrange of memory sources the model engages with; high knowledge entropy\nindicates that the model utilizes a wide range of memory sources, while low\nknowledge entropy suggests reliance on specific sources with greater certainty.\nOur analysis reveals a consistent decline in knowledge entropy as pretraining\nadvances. We also find that the decline is closely associated with a reduction\nin the model's ability to acquire and retain knowledge, leading us to conclude\nthat diminishing knowledge entropy (smaller number of active memory sources)\nimpairs the model's knowledge acquisition and retention capabilities. We find\nfurther support for this by demonstrating that increasing the activity of\ninactive memory sources enhances the model's capacity for knowledge acquisition\nand retention.\n","authors":["Jiyeon Kim","Hyunji Lee","Hyowon Cho","Joel Jang","Hyeonbin Hwang","Seungpil Won","Youbin Ahn","Dohaeng Lee","Minjoon Seo"],"pdf_url":"https://arxiv.org/pdf/2410.01380v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.02408v2","updated":"2024-12-02T07:47:00Z","published":"2024-02-04T08:57:54Z","title":"GLaPE: Gold Label-agnostic Prompt Evaluation and Optimization for Large\n Language Model","summary":" Despite the rapid progress of large language models (LLMs), their task\nperformance remains sensitive to prompt design. Recent studies have explored\nleveraging the LLM itself as an optimizer to identify optimal prompts that\nmaximize task accuracy. However, when evaluating prompts, such approaches\nheavily rely on elusive manually annotated gold labels to calculate task\naccuracy for each candidate prompt, which hinders the widespread implementation\nand generality. To overcome the limitation, this work proposes a gold\nlabel-agnostic prompt evaluation (GLaPE) to alleviate dependence on gold\nlabels. Motivated by the observed correlation between self-consistency and the\naccuracy of the answer, we adopt self-consistency as the initial evaluation\nscore. Subsequently, we refine the scores of prompts producing identical\nanswers to be mutually consistent. Experimental results show that GLaPE\nprovides reliable evaluations uniform with accuracy, even in the absence of\ngold labels. Moreover, on six popular reasoning tasks, our GLaPE-based prompt\noptimization yields effective prompts comparable to accuracy-based ones. The\ncode is publicly available at https://github.com/thunderous77/GLaPE.\n","authors":["Xuanchang Zhang","Zhuosheng Zhang","Hai Zhao"],"pdf_url":"https://arxiv.org/pdf/2402.02408v2.pdf","comment":"EMNLP 2024"},{"id":"http://arxiv.org/abs/2403.12027v3","updated":"2024-12-02T07:22:40Z","published":"2024-03-18T17:57:09Z","title":"From Pixels to Insights: A Survey on Automatic Chart Understanding in\n the Era of Large Foundation Models","summary":" Data visualization in the form of charts plays a pivotal role in data\nanalysis, offering critical insights and aiding in informed decision-making.\nAutomatic chart understanding has witnessed significant advancements with the\nrise of large foundation models in recent years. Foundation models, such as\nlarge language models, have revolutionized various natural language processing\ntasks and are increasingly being applied to chart understanding tasks. This\nsurvey paper provides a comprehensive overview of the recent developments,\nchallenges, and future directions in chart understanding within the context of\nthese foundation models. We review fundamental building blocks crucial for\nstudying chart understanding tasks. Additionally, we explore various tasks and\ntheir evaluation metrics and sources of both charts and textual inputs. Various\nmodeling strategies are then examined, encompassing both classification-based\nand generation-based approaches, along with tool augmentation techniques that\nenhance chart understanding performance. Furthermore, we discuss the\nstate-of-the-art performance of each task and discuss how we can improve the\nperformance. Challenges and future directions are addressed, highlighting the\nimportance of several topics, such as domain-specific charts, lack of efforts\nin developing evaluation metrics, and agent-oriented settings. This survey\npaper serves as a comprehensive resource for researchers and practitioners in\nthe fields of natural language processing, computer vision, and data analysis,\nproviding valuable insights and directions for future research in chart\nunderstanding leveraging large foundation models. The studies mentioned in this\npaper, along with emerging new research, will be continually updated at:\nhttps://github.com/khuangaf/Awesome-Chart-Understanding.\n","authors":["Kung-Hsiang Huang","Hou Pong Chan","Yi R. Fung","Haoyi Qiu","Mingyang Zhou","Shafiq Joty","Shih-Fu Chang","Heng Ji"],"pdf_url":"https://arxiv.org/pdf/2403.12027v3.pdf","comment":"IEEE Transactions on Knowledge and Data Engineering (TKDE)"},{"id":"http://arxiv.org/abs/2411.19951v2","updated":"2024-12-02T06:54:47Z","published":"2024-11-29T18:59:54Z","title":"T2Vid: Translating Long Text into Multi-Image is the Catalyst for\n Video-LLMs","summary":" The success of Multimodal Large Language Models (MLLMs) in the image domain\nhas garnered wide attention from the research community. Drawing on previous\nsuccessful experiences, researchers have recently explored extending the\nsuccess to the video understanding realms. Apart from training from scratch, an\nefficient way is to utilize the pre-trained image-LLMs, leading to two\nmainstream approaches, i.e. zero-shot inference and further fine-tuning with\nvideo data. In this work, our study of these approaches harvests an effective\ndata augmentation method. We first make a deeper inspection of the zero-shot\ninference way and identify two limitations, i.e. limited generalization and\nlack of temporal understanding capabilities. Thus, we further investigate the\nfine-tuning approach and find a low learning efficiency when simply using all\nthe video data samples, which can be attributed to a lack of instruction\ndiversity. Aiming at this issue, we develop a method called T2Vid to synthesize\nvideo-like samples to enrich the instruction diversity in the training corpus.\nIntegrating these data enables a simple and efficient training scheme, which\nachieves performance comparable to or even superior to using full video\ndatasets by training with just 15% the sample size. Meanwhile, we find that the\nproposed scheme can boost the performance of long video understanding without\ntraining with long video samples. We hope our study will spark more thinking\nabout using MLLMs for video understanding and curation of high-quality data.\nThe code is released at https://github.com/xjtupanda/T2Vid.\n","authors":["Shukang Yin","Chaoyou Fu","Sirui Zhao","Yunhang Shen","Chunjiang Ge","Yan Yang","Zuwei Long","Yuhan Dai","Tong Xu","Xing Sun","Ran He","Caifeng Shan","Enhong Chen"],"pdf_url":"https://arxiv.org/pdf/2411.19951v2.pdf","comment":"Project page: https://github.com/xjtupanda/T2Vid"},{"id":"http://arxiv.org/abs/2410.13025v2","updated":"2024-12-02T06:40:50Z","published":"2024-10-16T20:33:06Z","title":"LoRA Soups: Merging LoRAs for Practical Skill Composition Tasks","summary":" Low-Rank Adaptation (LoRA) is a popular technique for parameter-efficient\nfine-tuning of Large Language Models (LLMs). We study how different LoRA\nmodules can be merged to achieve skill composition -- testing the performance\nof the merged model on a target task that involves combining multiple skills,\neach skill coming from a single LoRA. This setup is favorable when it is\ndifficult to obtain training data for the target task and when it can be\ndecomposed into multiple skills. First, we identify practically occurring\nuse-cases that can be studied under the realm of skill composition, e.g.\nsolving hard math-word problems with code, creating a bot to answer questions\non proprietary manuals or about domain-specialized corpora. Our main\ncontribution is to show that concatenation of LoRAs (CAT), which optimally\nweights LoRAs that were individually trained on different skills, outperforms\nexisting model- and data- merging techniques; for instance on math-word\nproblems, CAT beats these methods by an average of 43% and 12% respectively.\nThus, this paper advocates model merging as an efficient way to solve\ncompositional tasks and underscores CAT as a simple, compute-friendly and\neffective procedure. To our knowledge, this is the first work demonstrating the\nsuperiority of model merging over data mixing for binary skill composition\ntasks. Code and data are available at https://github.com/aksh555/LoRA-Soups\n","authors":["Akshara Prabhakar","Yuanzhi Li","Karthik Narasimhan","Sham Kakade","Eran Malach","Samy Jelassi"],"pdf_url":"https://arxiv.org/pdf/2410.13025v2.pdf","comment":"COLING 2025 Industry track; 9 pages plus references and appendices"},{"id":"http://arxiv.org/abs/2411.19943v2","updated":"2024-12-02T06:26:38Z","published":"2024-11-29T18:58:22Z","title":"Critical Tokens Matter: Token-Level Contrastive Estimation Enhances\n LLM's Reasoning Capability","summary":" Large Language Models (LLMs) have exhibited remarkable performance on\nreasoning tasks. They utilize autoregressive token generation to construct\nreasoning trajectories, enabling the development of a coherent chain of\nthought. In this work, we explore the impact of individual tokens on the final\noutcomes of reasoning tasks. We identify the existence of ``critical tokens''\nthat lead to incorrect reasoning trajectories in LLMs. Specifically, we find\nthat LLMs tend to produce positive outcomes when forced to decode other tokens\ninstead of critical tokens. Motivated by this observation, we propose a novel\napproach - cDPO - designed to automatically recognize and conduct token-level\nrewards for the critical tokens during the alignment process. Specifically, we\ndevelop a contrastive estimation approach to automatically identify critical\ntokens. It is achieved by comparing the generation likelihood of positive and\nnegative models. To achieve this, we separately fine-tune the positive and\nnegative models on various reasoning trajectories, consequently, they are\ncapable of identifying identify critical tokens within incorrect trajectories\nthat contribute to erroneous outcomes. Moreover, to further align the model\nwith the critical token information during the alignment process, we extend the\nconventional DPO algorithms to token-level DPO and utilize the differential\nlikelihood from the aforementioned positive and negative model as important\nweight for token-level DPO learning.Experimental results on GSM8K and MATH500\nbenchmarks with two-widely used models Llama-3 (8B and 70B) and deepseek-math\n(7B) demonstrate the effectiveness of the propsoed approach cDPO.\n","authors":["Zicheng Lin","Tian Liang","Jiahao Xu","Xing Wang","Ruilin Luo","Chufan Shi","Siheng Li","Yujiu Yang","Zhaopeng Tu"],"pdf_url":"https://arxiv.org/pdf/2411.19943v2.pdf","comment":"Work in progress"},{"id":"http://arxiv.org/abs/2406.11285v2","updated":"2024-12-02T05:22:01Z","published":"2024-06-17T07:46:45Z","title":"Self and Cross-Model Distillation for LLMs: Effective Methods for\n Refusal Pattern Alignment","summary":" Large Language Models (LLMs) like OpenAI's GPT series, Anthropic's Claude,\nand Meta's LLaMa have shown remarkable capabilities in text generation.\nHowever, their susceptibility to toxic prompts presents significant security\nchallenges. This paper investigates alignment techniques, including Supervised\nFine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF), to\nmitigate these risks. We conduct an empirical study on refusal patterns across\nnine LLMs, revealing that models with uniform refusal patterns, such as\nClaude3, exhibit higher security. Based on these findings, we propose\nself-distilling and cross-model distilling methods to enhance LLM security. Our\nresults show that these methods significantly improve refusal rates and reduce\nunsafe content, with cross-model distilling achieving refusal rates close to\nClaude3's 94.51%. These findings underscore the potential of distillation-based\nalignment in securing LLMs against toxic prompts.\n","authors":["Jie Li","Yi Liu","Chongyang Liu","Xiaoning Ren","Ling Shi","Weisong Sun","Yinxing Xue"],"pdf_url":"https://arxiv.org/pdf/2406.11285v2.pdf","comment":"The method used in the paper has obvious problems and ambiguities.\n The security enhancement method we used cannot be considered distillation,\n but it is described as distillation in the paper, and the experiment lacks\n comparison and baseline, which has been criticized by many peers. In order to\n avoid further dissemination, we have decided to withdraw the paper"},{"id":"http://arxiv.org/abs/2411.18203v2","updated":"2024-12-02T05:00:19Z","published":"2024-11-27T10:28:57Z","title":"Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning","summary":" Vision-language models (VLMs) have shown remarkable advancements in\nmultimodal reasoning tasks. However, they still often generate inaccurate or\nirrelevant responses due to issues like hallucinated image understandings or\nunrefined reasoning paths. To address these challenges, we introduce Critic-V,\na novel framework inspired by the Actor-Critic paradigm to boost the reasoning\ncapability of VLMs. This framework decouples the reasoning process and critic\nprocess by integrating two independent components: the Reasoner, which\ngenerates reasoning paths based on visual and textual inputs, and the Critic,\nwhich provides constructive critique to refine these paths. In this approach,\nthe Reasoner generates reasoning responses according to text prompts, which can\nevolve iteratively as a policy based on feedback from the Critic. This\ninteraction process was theoretically driven by a reinforcement learning\nframework where the Critic offers natural language critiques instead of scalar\nrewards, enabling more nuanced feedback to boost the Reasoner's capability on\ncomplex reasoning tasks. The Critic model is trained using Direct Preference\nOptimization (DPO), leveraging a preference dataset of critiques ranked by\nRule-based Reward~(RBR) to enhance its critic capabilities. Evaluation results\nshow that the Critic-V framework significantly outperforms existing methods,\nincluding GPT-4V, on 5 out of 8 benchmarks, especially regarding reasoning\naccuracy and efficiency. Combining a dynamic text-based policy for the Reasoner\nand constructive feedback from the preference-optimized Critic enables a more\nreliable and context-sensitive multimodal reasoning process. Our approach\nprovides a promising solution to enhance the reliability of VLMs, improving\ntheir performance in real-world reasoning-heavy multimodal applications such as\nautonomous driving and embodied intelligence.\n","authors":["Di Zhang","Junxian Li","Jingdi Lei","Xunzhi Wang","Yujie Liu","Zonglin Yang","Jiatong Li","Weida Wang","Suorong Yang","Jianbo Wu","Peng Ye","Wanli Ouyang","Dongzhan Zhou"],"pdf_url":"https://arxiv.org/pdf/2411.18203v2.pdf","comment":"16 pages, 11 figures"},{"id":"http://arxiv.org/abs/2411.07656v2","updated":"2024-12-02T04:36:45Z","published":"2024-11-12T09:14:16Z","title":"Mitigating Bias in Queer Representation within Large Language Models: A\n Collaborative Agent Approach","summary":" Large Language Models (LLMs) often perpetuate biases in pronoun usage,\nleading to misrepresentation or exclusion of queer individuals. This paper\naddresses the specific problem of biased pronoun usage in LLM outputs,\nparticularly the inappropriate use of traditionally gendered pronouns (\"he,\"\n\"she\") when inclusive language is needed to accurately represent all\nidentities. We introduce a collaborative agent pipeline designed to mitigate\nthese biases by analyzing and optimizing pronoun usage for inclusivity. Our\nmulti-agent framework includes specialized agents for both bias detection and\ncorrection. Experimental evaluations using the Tango dataset-a benchmark\nfocused on gender pronoun usage-demonstrate that our approach significantly\nimproves inclusive pronoun classification, achieving a 32.6 percentage point\nincrease over GPT-4o in correctly disagreeing with inappropriate traditionally\ngendered pronouns $(\\chi^2 = 38.57, p < 0.0001)$. These results accentuate the\npotential of agent-driven frameworks in enhancing fairness and inclusivity in\nAI-generated content, demonstrating their efficacy in reducing biases and\npromoting socially responsible AI.\n","authors":["Tianyi Huang","Arya Somasundaram"],"pdf_url":"https://arxiv.org/pdf/2411.07656v2.pdf","comment":"NeurIPS 2024 Queer in AI Workshop"},{"id":"http://arxiv.org/abs/2406.12336v2","updated":"2024-12-02T04:08:49Z","published":"2024-06-18T07:03:34Z","title":"Towards Understanding Domain Adapted Sentence Embeddings for Document\n Retrieval","summary":" A plethora of sentence embedding models makes it challenging to choose one,\nespecially for technical domains rich with specialized vocabulary. In this\nwork, we domain adapt embeddings using telecom, health and science datasets for\nquestion answering. We evaluate embeddings obtained from publicly available\nmodels and their domain-adapted variants, on both point retrieval accuracies,\nas well as their (95\\%) confidence intervals. We establish a systematic method\nto obtain thresholds for similarity scores for different embeddings. As\nexpected, we observe that fine-tuning improves mean bootstrapped accuracies. We\nalso observe that it results in tighter confidence intervals, which further\nimprove when pre-training is preceded by fine-tuning. We introduce metrics\nwhich measure the distributional overlaps of top-$K$, correct and random\ndocument similarities with the question. Further, we show that these metrics\nare correlated with retrieval accuracy and similarity thresholds. Recent\nliterature shows conflicting effects of isotropy on retrieval accuracies. Our\nexperiments establish that the isotropy of embeddings (as measured by two\nindependent state-of-the-art isotropy metric definitions) is poorly correlated\nwith retrieval performance. We show that embeddings for domain-specific\nsentences have little overlap with those for domain-agnostic ones, and\nfine-tuning moves them further apart. Based on our results, we provide\nrecommendations for use of our methodology and metrics by researchers and\npractitioners.\n","authors":["Sujoy Roychowdhury","Sumit Soman","H. G. Ranjani","Vansh Chhabra","Neeraj Gunda","Shashank Gautam","Subhadip Bandyopadhyay","Sai Krishna Bala"],"pdf_url":"https://arxiv.org/pdf/2406.12336v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2407.12707v3","updated":"2024-12-02T03:45:42Z","published":"2024-07-17T16:30:27Z","title":"TTSDS -- Text-to-Speech Distribution Score","summary":" Many recently published Text-to-Speech (TTS) systems produce audio close to\nreal speech. However, TTS evaluation needs to be revisited to make sense of the\nresults obtained with the new architectures, approaches and datasets. We\npropose evaluating the quality of synthetic speech as a combination of multiple\nfactors such as prosody, speaker identity, and intelligibility. Our approach\nassesses how well synthetic speech mirrors real speech by obtaining correlates\nof each factor and measuring their distance from both real speech datasets and\nnoise datasets. We benchmark 35 TTS systems developed between 2008 and 2024 and\nshow that our score computed as an unweighted average of factors strongly\ncorrelates with the human evaluations from each time period.\n","authors":["Christoph Minixhofer","Ondřej Klejch","Peter Bell"],"pdf_url":"https://arxiv.org/pdf/2407.12707v3.pdf","comment":"SLT 2024"},{"id":"http://arxiv.org/abs/2404.01245v3","updated":"2024-12-02T03:27:10Z","published":"2024-04-01T17:03:41Z","title":"A Statistical Framework of Watermarks for Large Language Models: Pivot,\n Detection Efficiency and Optimal Rules","summary":" Since ChatGPT was introduced in November 2022, embedding (nearly)\nunnoticeable statistical signals into text generated by large language models\n(LLMs), also known as watermarking, has been used as a principled approach to\nprovable detection of LLM-generated text from its human-written counterpart. In\nthis paper, we introduce a general and flexible framework for reasoning about\nthe statistical efficiency of watermarks and designing powerful detection\nrules. Inspired by the hypothesis testing formulation of watermark detection,\nour framework starts by selecting a pivotal statistic of the text and a secret\nkey -- provided by the LLM to the verifier -- to enable controlling the false\npositive rate (the error of mistakenly detecting human-written text as\nLLM-generated). Next, this framework allows one to evaluate the power of\nwatermark detection rules by obtaining a closed-form expression of the\nasymptotic false negative rate (the error of incorrectly classifying\nLLM-generated text as human-written). Our framework further reduces the problem\nof determining the optimal detection rule to solving a minimax optimization\nprogram. We apply this framework to two representative watermarks -- one of\nwhich has been internally implemented at OpenAI -- and obtain several findings\nthat can be instrumental in guiding the practice of implementing watermarks. In\nparticular, we derive optimal detection rules for these watermarks under our\nframework. These theoretically derived detection rules are demonstrated to be\ncompetitive and sometimes enjoy a higher power than existing detection\napproaches through numerical experiments.\n","authors":["Xiang Li","Feng Ruan","Huiyuan Wang","Qi Long","Weijie J. Su"],"pdf_url":"https://arxiv.org/pdf/2404.01245v3.pdf","comment":"To appear in the Annals of Statistics"},{"id":"http://arxiv.org/abs/2409.01345v3","updated":"2024-12-02T03:10:37Z","published":"2024-09-02T15:58:27Z","title":"Language Models Benefit from Preparation with Elicited Knowledge","summary":" The zero-shot chain of thought (CoT) approach is often used in question\nanswering (QA) by language models (LMs) for tasks that require multiple\nreasoning steps. However, some QA tasks hinge more on accessing relevant\nknowledge than on chaining reasoning steps. We introduce a simple prompting\ntechnique, called PREP, that involves using two instances of LMs: the first\n(LM1) generates relevant information, and the second (LM2) receives the\ninformation from the user and answers the question. This design is intended to\nmake better use of the LM's instruction-following capability. PREP is\napplicable across various QA tasks without domain-specific prompt engineering.\nPREP is developed on a dataset of 100 QA questions, derived from an extensive\nschematic dataset specifying artifact parts and material composition. These\nquestions ask which of two artifacts is less likely to share materials with\nanother artifact. Such questions probe the LM's knowledge of shared materials\nin the part structure of different artifacts. We test our method on our\nparts-and-materials dataset and three published commonsense reasoning datasets.\nThe average accuracy of our method is consistently higher than that of all the\nother tested methods across all the tested datasets.\n","authors":["Jiacan Yu","Hannah An","Lenhart K. Schubert"],"pdf_url":"https://arxiv.org/pdf/2409.01345v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.11266v3","updated":"2024-12-02T02:27:17Z","published":"2024-11-18T03:45:34Z","title":"VersaTune: An Efficient Data Composition Framework for Training\n Multi-Capability LLMs","summary":" Large-scale pretrained models, particularly Large Language Models (LLMs),\nhave exhibited remarkable capabilities in handling multiple tasks across\ndomains due to their emergent properties. These capabilities are further\naugmented during the Supervised Fine-Tuning (SFT) phase. Despite their\npotential, existing work mainly focuses on domain-specific enhancements during\nfine-tuning, the challenge of which lies in catastrophic forgetting of\nknowledge across other domains. In this study, we introduce VersaTune, a novel\ndata composition framework designed for enhancing LLMs' overall multi-ability\nperformances during training. We categorize knowledge into distinct domains\nincluding law, medicine, finance, science, code, etc. We begin with detecting\nthe distribution of domain-specific knowledge within the base model, followed\nby the training data composition that aligns with the model's existing\nknowledge distribution. During the training process, domain weights are\ndynamically adjusted based on their learnable potential and forgetting degree.\nExperimental results demonstrate that VersaTune achieves significant\nimprovements in multi-domain performance, with an 35.21% enhancement in\ncomprehensive multi-domain tasks. Additionally, in scenarios where specific\ndomain optimization is required, VersaTune reduces the degradation of\nperformance in other domains by 38.77%, without compromising the target\ndomain's training efficacy.\n","authors":["Keer Lu","Keshi Zhao","Zheng Liang","Da Pan","Shusen Zhang","Xin Wu","Weipeng Chen","Zenan Zhou","Guosheng Dong","Bin Cui","Wentao Zhang"],"pdf_url":"https://arxiv.org/pdf/2411.11266v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.02326v2","updated":"2024-12-02T01:59:30Z","published":"2024-04-23T18:55:49Z","title":"Evaluating LLMs for Hardware Design and Test","summary":" Large Language Models (LLMs) have demonstrated capabilities for producing\ncode in Hardware Description Languages (HDLs). However, most of the focus\nremains on their abilities to write functional code, not test code. The\nhardware design process consists of both design and test, and so eschewing\nvalidation and verification leaves considerable potential benefit unexplored,\ngiven that a design and test framework may allow for progress towards full\nautomation of the digital design pipeline. In this work, we perform one of the\nfirst studies exploring how a LLM can both design and test hardware modules\nfrom provided specifications. Using a suite of 8 representative benchmarks, we\nexamined the capabilities and limitations of the state-of-the-art\nconversational LLMs when producing Verilog for functional and verification\npurposes. We taped out the benchmarks on a Skywater 130nm shuttle and received\nthe functional chip.\n","authors":["Jason Blocklove","Siddharth Garg","Ramesh Karri","Hammond Pearce"],"pdf_url":"https://arxiv.org/pdf/2405.02326v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2212.05356v2","updated":"2024-12-02T00:57:32Z","published":"2022-12-10T19:54:53Z","title":"Punctuation Restoration for Singaporean Spoken Languages: English,\n Malay, and Mandarin","summary":" This paper presents the work of restoring punctuation for ASR transcripts\ngenerated by multilingual ASR systems. The focus languages are English,\nMandarin, and Malay which are three of the most popular languages in Singapore.\nTo the best of our knowledge, this is the first system that can tackle\npunctuation restoration for these three languages simultaneously. Traditional\napproaches usually treat the task as a sequential labeling task, however, this\nwork adopts a slot-filling approach that predicts the presence and type of\npunctuation marks at each word boundary. The approach is similar to the\nMasked-Language Model approach employed during the pre-training stages of BERT,\nbut instead of predicting the masked word, our model predicts masked\npunctuation. Additionally, we find that using Jieba1 instead of only using the\nbuilt-in SentencePiece tokenizer of XLM-R can significantly improve the\nperformance of punctuating Mandarin transcripts. Experimental results on\nEnglish and Mandarin IWSLT2022 datasets and Malay News show that the proposed\napproach achieved state-of-the-art results for Mandarin with 73.8% F1-score\nwhile maintaining a reasonable F1-score for English and Malay, i.e. 74.7% and\n78% respectively. Our source code that allows reproducing the results and\nbuilding a simple web-based application for demonstration purposes is available\non Github.\n","authors":["Abhinav Rao","Ho Thi-Nga","Chng Eng-Siong"],"pdf_url":"https://arxiv.org/pdf/2212.05356v2.pdf","comment":"Accepted at APSIPA 2022, Chiang-Mai, Thailand"},{"id":"http://arxiv.org/abs/2410.05192v3","updated":"2024-12-02T21:54:54Z","published":"2024-10-07T16:49:39Z","title":"Understanding Warmup-Stable-Decay Learning Rates: A River Valley Loss\n Landscape Perspective","summary":" Training language models currently requires pre-determining a fixed compute\nbudget because the typical cosine learning rate schedule depends on the total\nnumber of steps. In contrast, the Warmup-Stable-Decay (WSD) schedule uses a\nconstant learning rate to produce a main branch of iterates that can in\nprinciple continue indefinitely without a pre-specified compute budget. Then,\ngiven any compute budget, one can branch out from the main branch at a proper\ntime with a rapidly decaying learning rate to produce a strong model.\nEmpirically, WSD generates a non-traditional loss curve: the loss remains\nelevated during the stable phase but sharply declines during the decay phase.\nTowards explaining this phenomenon, we conjecture that pretraining loss\nexhibits a river valley landscape, which resembles a deep valley with a river\nat its bottom. Under this assumption, we show that during the stable phase, the\niterate undergoes large oscillations due to the high learning rate, yet it\nprogresses swiftly along the river. During the decay phase, the rapidly\ndropping learning rate minimizes the iterate's oscillations, moving it closer\nto the river and revealing true optimization progress. Therefore, the sustained\nhigh learning rate phase and fast decaying phase are responsible for progress\nin the river and the mountain directions respectively, and are both critical.\nOur analysis predicts phenomenons consistent with empirical observations and\nshows that this landscape can emerge from pretraining on a simple bi-gram\ndataset. Inspired by the theory, we introduce WSD-S, a variant of WSD that\nreuses previous checkpoints' decay phases and keeps only one main branch, where\nwe resume from a decayed checkpoint. WSD-S empirically outperforms WSD and\nCyclic-Cosine in obtaining multiple language model checkpoints across various\ncompute budgets in a single run for parameters scaling from 0.1B to 1.2B.\n","authors":["Kaiyue Wen","Zhiyuan Li","Jason Wang","David Hall","Percy Liang","Tengyu Ma"],"pdf_url":"https://arxiv.org/pdf/2410.05192v3.pdf","comment":"45 pages,13 figures"},{"id":"http://arxiv.org/abs/2412.01991v1","updated":"2024-12-02T21:51:41Z","published":"2024-12-02T21:51:41Z","title":"Real-Time Multilingual Sign Language Processing","summary":" Sign Language Processing (SLP) is an interdisciplinary field comprised of\nNatural Language Processing (NLP) and Computer Vision. It is focused on the\ncomputational understanding, translation, and production of signed languages.\nTraditional approaches have often been constrained by the use of gloss-based\nsystems that are both language-specific and inadequate for capturing the\nmultidimensional nature of sign language. These limitations have hindered the\ndevelopment of technology capable of processing signed languages effectively.\n This thesis aims to revolutionize the field of SLP by proposing a simple\nparadigm that can bridge this existing technological gap. We propose the use of\nSignWiring, a universal sign language transcription notation system, to serve\nas an intermediary link between the visual-gestural modality of signed\nlanguages and text-based linguistic representations.\n We contribute foundational libraries and resources to the SLP community,\nthereby setting the stage for a more in-depth exploration of the tasks of sign\nlanguage translation and production. These tasks encompass the translation of\nsign language from video to spoken language text and vice versa. Through\nempirical evaluations, we establish the efficacy of our transcription method as\na pivot for enabling faster, more targeted research, that can lead to more\nnatural and accurate translations across a range of languages.\n The universal nature of our transcription-based paradigm also paves the way\nfor real-time, multilingual applications in SLP, thereby offering a more\ninclusive and accessible approach to language technology. This is a significant\nstep toward universal accessibility, enabling a wider reach of AI-driven\nlanguage technologies to include the deaf and hard-of-hearing community.\n","authors":["Amit Moryossef"],"pdf_url":"https://arxiv.org/pdf/2412.01991v1.pdf","comment":"PhD Thesis"},{"id":"http://arxiv.org/abs/2406.10794v3","updated":"2024-12-02T21:48:47Z","published":"2024-06-16T03:38:48Z","title":"Towards Understanding Jailbreak Attacks in LLMs: A Representation Space\n Analysis","summary":" Large language models (LLMs) are susceptible to a type of attack known as\njailbreaking, which misleads LLMs to output harmful contents. Although there\nare diverse jailbreak attack strategies, there is no unified understanding on\nwhy some methods succeed and others fail. This paper explores the behavior of\nharmful and harmless prompts in the LLM's representation space to investigate\nthe intrinsic properties of successful jailbreak attacks. We hypothesize that\nsuccessful attacks share some similar properties: They are effective in moving\nthe representation of the harmful prompt towards the direction to the harmless\nprompts. We leverage hidden representations into the objective of existing\njailbreak attacks to move the attacks along the acceptance direction, and\nconduct experiments to validate the above hypothesis using the proposed\nobjective. We hope this study provides new insights into understanding how LLMs\nunderstand harmfulness information.\n","authors":["Yuping Lin","Pengfei He","Han Xu","Yue Xing","Makoto Yamada","Hui Liu","Jiliang Tang"],"pdf_url":"https://arxiv.org/pdf/2406.10794v3.pdf","comment":"Accepted by EMNLP 2024 Main"},{"id":"http://arxiv.org/abs/2406.10086v3","updated":"2024-12-02T21:31:59Z","published":"2024-06-14T14:41:44Z","title":"Discovering influential text using convolutional neural networks","summary":" Experimental methods for estimating the impacts of text on human evaluation\nhave been widely used in the social sciences. However, researchers in\nexperimental settings are usually limited to testing a small number of\npre-specified text treatments. While efforts to mine unstructured texts for\nfeatures that causally affect outcomes have been ongoing in recent years, these\nmodels have primarily focused on the topics or specific words of text, which\nmay not always be the mechanism of the effect. We connect these efforts with\nNLP interpretability techniques and present a method for flexibly discovering\nclusters of similar text phrases that are predictive of human reactions to\ntexts using convolutional neural networks. When used in an experimental\nsetting, this method can identify text treatments and their effects under\ncertain assumptions. We apply the method to two datasets. The first enables\ndirect validation of the model's ability to detect phrases known to cause the\noutcome. The second demonstrates its ability to flexibly discover text\ntreatments with varying textual structures. In both cases, the model learns a\ngreater variety of text treatments compared to benchmark methods, and these\ntext features quantitatively meet or exceed the ability of benchmark methods to\npredict the outcome.\n","authors":["Megan Ayers","Luke Sanford","Margaret Roberts","Eddie Yang"],"pdf_url":"https://arxiv.org/pdf/2406.10086v3.pdf","comment":"Published in Findings of ACL 2024 ( see\n https://aclanthology.org/2024.findings-acl.714 )"},{"id":"http://arxiv.org/abs/2412.01981v1","updated":"2024-12-02T21:20:02Z","published":"2024-12-02T21:20:02Z","title":"Free Process Rewards without Process Labels","summary":" Different from its counterpart outcome reward models (ORMs), which evaluate\nthe entire responses, a process reward model (PRM) scores a reasoning\ntrajectory step by step, providing denser and more fine grained rewards.\nHowever, training a PRM requires labels annotated at every intermediate step,\npresenting significant challenges for both manual and automatic data\ncollection. This paper aims to address this challenge. Both theoretically and\nempirically, we show that an \\textit{implicit PRM} can be obtained at no\nadditional cost, by simply training an ORM on the cheaper response-level\nlabels. The only assumption is to parameterize the outcome reward as the\nlog-likelihood ratios of the policy and reference models, which can be\noptimized regardless of the specific choice of loss objectives. In experiments,\nwe instantiate our implicit PRMs with various objectives and evaluate their\nperformance on MATH. We show that our implicit PRM outperforms a strong\nMCTS-based baseline \\textit{\\'a la} Math-Shepherd using less than $1/38$ of the\ntraining data. Its performance can be further improved with majority voting. We\nfurther find that scaling up instructions and responses benefits our implicit\nPRM, and the latter brings a larger gain. Particularly, we find that our\nimplicit PRM, when instantiated with the cross-entropy (CE) loss, is more\ndata-efficient and can keep improving generation models even when trained with\nonly one response per instruction, the setup that suffers from extreme data\nscarcity and imbalance. Further, instructions should be relevant to downstream\ntasks while the diversity of responses does not bring gains. Surprisingly,\ntraining on extra Math-Shepherd step labels brings no further improvements to\nour implicit PRM trained on only outcome data. We hope that our work will\nencourage a rethinking of PRM training approaches and contribute to making\ntraining PRMs more accessible.\n","authors":["Lifan Yuan","Wendi Li","Huayu Chen","Ganqu Cui","Ning Ding","Kaiyan Zhang","Bowen Zhou","Zhiyuan Liu","Hao Peng"],"pdf_url":"https://arxiv.org/pdf/2412.01981v1.pdf","comment":"Models and data are available at:\n https://github.com/lifan-yuan/ImplicitPRM"},{"id":"http://arxiv.org/abs/2411.18915v2","updated":"2024-12-02T21:08:00Z","published":"2024-11-28T05:12:17Z","title":"MATATA: a weak-supervised MAthematical Tool-Assisted reasoning for\n Tabular Applications","summary":" Mathematical reasoning capabilities are increasing with tool-augmented\nlanguage agents, but methods often rely either on closed-source or large\nmodels, external data, or extensive prompt engineering. This work introduces\nMATATA, a novel cost-effective method to train LLM agents for tabular data\nproblems through reasoning, planning, and tool use. With a progressive\nself-improvement paradigm and an iterative weak supervision, it empowers\n3.8B/8B Small Language Models (SLMs), particularly suited for local hosting and\nsensitive business contexts where data privacy is crucial. By employing a\nflexible and reusable tools across different datasets, it achieves robust\nperformance with effective scalability across shared tasks. Experiments show\nthat MATATA reaches state-of-the-art performances on FinQA and TAT-QA among\nreasoning frameworks based on open-source models. Moreover, MATATA models\ncompete with GPT-4 based frameworks on TabMWP, while being SLMs.\n","authors":["Vishnou Vinayagame","Gregory Senay","Luis Martí"],"pdf_url":"https://arxiv.org/pdf/2411.18915v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.02741v2","updated":"2024-12-02T21:06:29Z","published":"2024-10-03T17:54:56Z","title":"Salient Information Prompting to Steer Content in Prompt-based\n Abstractive Summarization","summary":" Large language models (LLMs) can generate fluent summaries across domains\nusing prompting techniques, reducing the need to train models for summarization\napplications. However, crafting effective prompts that guide LLMs to generate\nsummaries with the appropriate level of detail and writing style remains a\nchallenge. In this paper, we explore the use of salient information extracted\nfrom the source document to enhance summarization prompts. We show that adding\nkeyphrases in prompts can improve ROUGE F1 and recall, making the generated\nsummaries more similar to the reference and more complete. The number of\nkeyphrases can control the precision-recall trade-off. Furthermore, our\nanalysis reveals that incorporating phrase-level salient information is\nsuperior to word- or sentence-level. However, the impact on hallucination is\nnot universally positive across LLMs. To conduct this analysis, we introduce\nKeyphrase Signal Extractor (SigExt), a lightweight model that can be finetuned\nto extract salient keyphrases. By using SigExt, we achieve consistent ROUGE\nimprovements across datasets and open-weight and proprietary LLMs without any\nLLM customization. Our findings provide insights into leveraging salient\ninformation in building prompt-based summarization systems. We release our code\nat \\url{https://github.com/amazon-science/SigExt}\n","authors":["Lei Xu","Mohammed Asad Karim","Saket Dingliwal","Aparna Elangovan"],"pdf_url":"https://arxiv.org/pdf/2410.02741v2.pdf","comment":"Accepted to EMNLP 2024 Industry Track. Code available at\n https://github.com/amazon-science/SigExt"},{"id":"http://arxiv.org/abs/2402.13446v3","updated":"2024-12-02T20:55:15Z","published":"2024-02-21T00:44:04Z","title":"Large Language Models for Data Annotation and Synthesis: A Survey","summary":" Data annotation and synthesis generally refers to the labeling or generating\nof raw data with relevant information, which could be used for improving the\nefficacy of machine learning models. The process, however, is labor-intensive\nand costly. The emergence of advanced Large Language Models (LLMs), exemplified\nby GPT-4, presents an unprecedented opportunity to automate the complicated\nprocess of data annotation and synthesis. While existing surveys have\nextensively covered LLM architecture, training, and general applications, we\nuniquely focus on their specific utility for data annotation. This survey\ncontributes to three core aspects: LLM-Based Annotation Generation,\nLLM-Generated Annotations Assessment, and LLM-Generated Annotations\nUtilization. Furthermore, this survey includes an in-depth taxonomy of data\ntypes that LLMs can annotate, a comprehensive review of learning strategies for\nmodels utilizing LLM-generated annotations, and a detailed discussion of the\nprimary challenges and limitations associated with using LLMs for data\nannotation and synthesis. Serving as a key guide, this survey aims to assist\nresearchers and practitioners in exploring the potential of the latest LLMs for\ndata annotation, thereby fostering future advancements in this critical field.\n","authors":["Zhen Tan","Dawei Li","Song Wang","Alimohammad Beigi","Bohan Jiang","Amrita Bhattacharjee","Mansooreh Karami","Jundong Li","Lu Cheng","Huan Liu"],"pdf_url":"https://arxiv.org/pdf/2402.13446v3.pdf","comment":"Accepted to EMNLP 2024 Main"},{"id":"http://arxiv.org/abs/2412.01955v1","updated":"2024-12-02T20:31:27Z","published":"2024-12-02T20:31:27Z","title":"The use of large language models to enhance cancer clinical trial\n educational materials","summary":" Cancer clinical trials often face challenges in recruitment and engagement\ndue to a lack of participant-facing informational and educational resources.\nThis study investigated the potential of Large Language Models (LLMs),\nspecifically GPT4, in generating patient-friendly educational content from\nclinical trial informed consent forms. Using data from ClinicalTrials.gov, we\nemployed zero-shot learning for creating trial summaries and one-shot learning\nfor developing multiple-choice questions, evaluating their effectiveness\nthrough patient surveys and crowdsourced annotation. Results showed that\nGPT4-generated summaries were both readable and comprehensive, and may improve\npatients' understanding and interest in clinical trials. The multiple-choice\nquestions demonstrated high accuracy and agreement with crowdsourced\nannotators. For both resource types, hallucinations were identified that\nrequire ongoing human oversight. The findings demonstrate the potential of LLMs\n\"out-of-the-box\" to support the generation of clinical trial education\nmaterials with minimal trial-specific engineering, but implementation with a\nhuman-in-the-loop is still needed to avoid misinformation risks.\n","authors":["Mingye Gao","Aman Varshney","Shan Chen","Vikram Goddla","Jack Gallifant","Patrick Doyle","Claire Novack","Maeve Dillon-Martin","Teresia Perkins","Xinrong Correia","Erik Duhaime","Howard Isenstein","Elad Sharon","Lisa Soleymani Lehmann","David Kozono","Brian Anthony","Dmitriy Dligach","Danielle S. Bitterman"],"pdf_url":"https://arxiv.org/pdf/2412.01955v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.12851v4","updated":"2024-12-02T20:27:39Z","published":"2024-10-10T17:59:17Z","title":"VibeCheck: Discover and Quantify Qualitative Differences in Large\n Language Models","summary":" Large language models (LLMs) often exhibit subtle yet distinctive\ncharacteristics in their outputs that users intuitively recognize, but struggle\nto quantify. These \"vibes\" -- such as tone, formatting, or writing style --\ninfluence user preferences, yet traditional evaluations focus primarily on the\nsingular axis of correctness. We introduce VibeCheck, a system for\nautomatically comparing a pair of LLMs by discovering identifying traits of a\nmodel (vibes) that are well-defined, differentiating, and user-aligned.\nVibeCheck iteratively discovers vibes from model outputs and then utilizes a\npanel of LLM judges to quantitatively measure the utility of each vibe. We\nvalidate that the vibes generated by VibeCheck align with those found in human\ndiscovery and run VibeCheck on pairwise preference data from real-world user\nconversations with Llama-3-70b vs GPT-4. VibeCheck reveals that Llama has a\nfriendly, funny, and somewhat controversial vibe. These vibes predict model\nidentity with 80% accuracy and human preference with 61% accuracy. Lastly, we\nrun VibeCheck on a variety of models and tasks including summarization, math,\nand captioning to provide insight into differences in model behavior. VibeCheck\ndiscovers vibes like Command X prefers to add concrete intros and conclusions\nwhen summarizing in comparison to TNGL, Llama-405b often overexplains its\nthought process on math problems compared to GPT-4o, and GPT-4 prefers to focus\non the mood and emotions of the scene when captioning compared to\nGemini-1.5-Flash. Code can be found at https://github.com/lisadunlap/VibeCheck\n","authors":["Lisa Dunlap","Krishna Mandal","Trevor Darrell","Jacob Steinhardt","Joseph E Gonzalez"],"pdf_url":"https://arxiv.org/pdf/2410.12851v4.pdf","comment":"unironic use of the word 'vibe', added more analysis and cooler\n graphs. arXiv admin note: text overlap with arXiv:2301.07597 by other authors"},{"id":"http://arxiv.org/abs/2412.01951v1","updated":"2024-12-02T20:24:17Z","published":"2024-12-02T20:24:17Z","title":"Self-Improvement in Language Models: The Sharpening Mechanism","summary":" Recent work in language modeling has raised the possibility of\nself-improvement, where a language models evaluates and refines its own\ngenerations to achieve higher performance without external feedback. It is\nimpossible for this self-improvement to create information that is not already\nin the model, so why should we expect that this will lead to improved\ncapabilities? We offer a new perspective on the capabilities of\nself-improvement through a lens we refer to as sharpening. Motivated by the\nobservation that language models are often better at verifying response quality\nthan they are at generating correct responses, we formalize self-improvement as\nusing the model itself as a verifier during post-training in order to\n``sharpen'' the model to one placing large mass on high-quality sequences,\nthereby amortizing the expensive inference-time computation of generating good\nsequences. We begin by introducing a new statistical framework for sharpening\nin which the learner aims to sharpen a pre-trained base policy via sample\naccess, and establish fundamental limits. Then we analyze two natural families\nof self-improvement algorithms based on SFT and RLHF.\n","authors":["Audrey Huang","Adam Block","Dylan J. Foster","Dhruv Rohatgi","Cyril Zhang","Max Simchowitz","Jordan T. Ash","Akshay Krishnamurthy"],"pdf_url":"https://arxiv.org/pdf/2412.01951v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2407.16695v2","updated":"2024-12-02T20:23:49Z","published":"2024-07-23T17:57:41Z","title":"Stress-Testing Long-Context Language Models with Lifelong ICL and Task\n Haystack","summary":" We introduce Lifelong ICL, a problem setting that challenges long-context\nlanguage models (LMs) to learn a sequence of language tasks through in-context\nlearning (ICL). We further introduce Task Haystack, an evaluation suite\ndedicated to assessing and diagnosing how long-context LMs utilizes contexts in\nLifelong ICL. When given a task instruction and test inputs, long-context LMs\nare expected to leverage the relevant demonstrations in the Lifelong ICL\nprompt, avoid distraction and interference from other tasks, and achieve test\naccuracies that are not significantly worse than those of the Single-task ICL\nbaseline.\n Task Haystack draws inspiration from the widely-adopted\n\"needle-in-a-haystack\" (NIAH) evaluation, but presents distinct new challenges.\nIt requires models (1) to utilize the contexts at a deeper level, rather than\nresorting to simple copying and pasting; (2) to navigate through long streams\nof evolving topics and tasks, proxying the complexities and dynamism of\ncontexts in real-world scenarios. Additionally, Task Haystack inherits the\ncontrollability of NIAH, providing model developers with tools and\nvisualizations to identify model vulnerabilities effectively.\n We benchmark 14 long-context LMs using Task Haystack, finding that frontier\nmodels like GPT-4o still struggle with the setting, failing on 15% of cases on\naverage. Most open-weight models further lack behind by a large margin, with\nfailure rates reaching up to 61%. In our controlled analysis, we identify\nfactors such as distraction and recency bias as contributors to these failure\ncases. Further, performance declines when task instructions are paraphrased at\ntest time or when ICL demonstrations are repeated excessively, raising concerns\nabout the robustness, instruction understanding, and true context utilization\nof long-context LMs.\n","authors":["Xiaoyue Xu","Qinyuan Ye","Xiang Ren"],"pdf_url":"https://arxiv.org/pdf/2407.16695v2.pdf","comment":"NeurIPS 2024 (Datasets and Benchmarks Track). Code:\n https://github.com/INK-USC/Lifelong-ICL Website:\n https://inklab.usc.edu/lifelong-icl/"},{"id":"http://arxiv.org/abs/2411.16085v2","updated":"2024-12-02T20:00:52Z","published":"2024-11-25T04:36:01Z","title":"Cautious Optimizers: Improving Training with One Line of Code","summary":" AdamW has been the default optimizer for transformer pretraining. For many\nyears, our community searches for faster and more stable optimizers with only\nconstraint positive outcomes. In this work, we propose a \\textbf{single-line\nmodification in Pytorch} to any momentum-based optimizer, which we rename\nCautious Optimizer, e.g. C-AdamW and C-Lion. Our theoretical result shows that\nthis modification preserves Adam's Hamiltonian function and it does not break\nthe convergence guarantee under the Lyapunov analysis. In addition, a whole new\nfamily of optimizers is revealed by our theoretical insight. Among them, we\npick the simplest one for empirical experiments, showing speed-up on Llama and\nMAE pretraining up to $1.47\\times$. Code is available at\nhttps://github.com/kyleliang919/C-Optim\n","authors":["Kaizhao Liang","Lizhang Chen","Bo Liu","Qiang Liu"],"pdf_url":"https://arxiv.org/pdf/2411.16085v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.15112v5","updated":"2024-12-02T19:26:56Z","published":"2024-03-22T11:08:48Z","title":"Text Clustering with Large Language Model Embeddings","summary":" Text clustering is an important method for organising the increasing volume\nof digital content, aiding in the structuring and discovery of hidden patterns\nin uncategorised data. The effectiveness of text clustering largely depends on\nthe selection of textual embeddings and clustering algorithms. This study\nargues that recent advancements in large language models (LLMs) have the\npotential to enhance this task. The research investigates how different textual\nembeddings, particularly those utilised in LLMs, and various clustering\nalgorithms influence the clustering of text datasets. A series of experiments\nwere conducted to evaluate the impact of embeddings on clustering results, the\nrole of dimensionality reduction through summarisation, and the adjustment of\nmodel size. The findings indicate that LLM embeddings are superior at capturing\nsubtleties in structured language. OpenAI's GPT-3.5 Turbo model yields better\nresults in three out of five clustering metrics across most tested datasets.\nMost LLM embeddings show improvements in cluster purity and provide a more\ninformative silhouette score, reflecting a refined structural understanding of\ntext data compared to traditional methods. Among the more lightweight models,\nBERT demonstrates leading performance. Additionally, it was observed that\nincreasing model dimensionality and employing summarisation techniques do not\nconsistently enhance clustering efficiency, suggesting that these strategies\nrequire careful consideration for practical application. These results\nhighlight a complex balance between the need for refined text representation\nand computational feasibility in text clustering applications. This study\nextends traditional text clustering frameworks by integrating embeddings from\nLLMs, offering improved methodologies and suggesting new avenues for future\nresearch in various types of textual analysis.\n","authors":["Alina Petukhova","João P. Matos-Carvalho","Nuno Fachada"],"pdf_url":"https://arxiv.org/pdf/2403.15112v5.pdf","comment":"The peer-reviewed version of this paper is published in the\n International Journal of Cognitive Computing in Engineering at\n https://doi.org/10.1016/j.ijcce.2024.11.004. This version is typeset by the\n authors and differs only in pagination and typographical detail"},{"id":"http://arxiv.org/abs/2411.13687v2","updated":"2024-12-02T19:07:09Z","published":"2024-11-20T20:07:25Z","title":"Hierarchical Text Classification (HTC) vs. eXtreme Multilabel\n Classification (XML): Two Sides of the Same Medal","summary":" Assigning a subset of labels from a fixed pool of labels to a given input\ntext is a text classification problem with many real-world applications, such\nas in recommender systems. Two separate research streams address this issue.\nHierarchical Text Classification (HTC) focuses on datasets with smaller label\npools of hundreds of entries, accompanied by a semantic label hierarchy. In\ncontrast, eXtreme Multi-Label Text Classification (XML) considers very large\nlabel pools with up to millions of entries, in which the labels are not\narranged in any particular manner. However, in XML, a common approach is to\nconstruct an artificial hierarchy without any semantic information before or\nduring the training process. Here, we investigate how state-of-the-art models\nfrom one domain perform when trained and tested on datasets from the other\ndomain. The HBGL and HGLCR models from the HTC domain are trained and tested on\nthe datasets Wiki10-31K, AmazonCat-13K, and Amazon-670K from the XML domain. On\nthe other side, the XML models CascadeXML and XR-Transformer are trained and\ntested on the datasets Web of Science, The New York Times Annotated Corpus, and\nRCV1-V2 from the HTC domain. HTC models, on the other hand, are not equipped to\nhandle the size of XML datasets and achieve poor transfer results. The code and\nnumerous files that are needed to reproduce our results can be obtained from\nhttps://github.com/FloHauss/XMC_HTC\n","authors":["Nerijus Bertalis","Paul Granse","Ferhat Gül","Florian Hauss","Leon Menkel","David Schüler","Tom Speier","Lukas Galke","Ansgar Scherp"],"pdf_url":"https://arxiv.org/pdf/2411.13687v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.18060v3","updated":"2024-12-02T19:03:47Z","published":"2024-06-26T04:33:13Z","title":"AdaZeta: Adaptive Zeroth-Order Tensor-Train Adaption for\n Memory-Efficient Large Language Models Fine-Tuning","summary":" Fine-tuning large language models (LLMs) has achieved remarkable performance\nacross various natural language processing tasks, yet it demands more and more\nmemory as model sizes keep growing. To address this issue, the recently\nproposed Memory-efficient Zeroth-order (MeZO) methods attempt to fine-tune LLMs\nusing only forward passes, thereby avoiding the need for a backpropagation\ngraph. However, significant performance drops and a high risk of divergence\nhave limited their widespread adoption. In this paper, we propose the Adaptive\nZeroth-order Tensor-Train Adaption (AdaZeta) framework, specifically designed\nto improve the performance and convergence of the ZO methods. To enhance\ndimension-dependent ZO estimation accuracy, we introduce a fast-forward,\nlow-parameter tensorized adapter. To tackle the frequently observed divergence\nissue in large-scale ZO fine-tuning tasks, we propose an adaptive query number\nschedule that guarantees convergence. Detailed theoretical analysis and\nextensive experimental results on Roberta-Large and Llama-2-7B models\nsubstantiate the efficacy of our AdaZeta framework in terms of accuracy, memory\nefficiency, and convergence speed.\n","authors":["Yifan Yang","Kai Zhen","Ershad Banijamal","Athanasios Mouchtaris","Zheng Zhang"],"pdf_url":"https://arxiv.org/pdf/2406.18060v3.pdf","comment":"Accepted for publication in EMNLP 2024"},{"id":"http://arxiv.org/abs/2412.01825v1","updated":"2024-12-02T18:59:50Z","published":"2024-12-02T18:59:50Z","title":"GETAE: Graph information Enhanced deep neural NeTwork ensemble\n ArchitecturE for fake news detection","summary":" In today's digital age, fake news has become a major problem that has serious\nconsequences, ranging from social unrest to political upheaval. To address this\nissue, new methods for detecting and mitigating fake news are required. In this\nwork, we propose to incorporate contextual and network-aware features into the\ndetection process. This involves analyzing not only the content of a news\narticle but also the context in which it was shared and the network of users\nwho shared it, i.e., the information diffusion. Thus, we propose GETAE,\n\\underline{G}raph Information \\underline{E}nhanced Deep Neural\nNe\\underline{t}work Ensemble \\underline{A}rchitectur\\underline{E} for Fake News\nDetection, a novel ensemble architecture that uses textual content together\nwith the social interactions to improve fake news detection. GETAE contains two\nBranches: the Text Branch and the Propagation Branch. The Text Branch uses Word\nand Transformer Embeddings and a Deep Neural Network based on feed-forward and\nbidirectional Recurrent Neural Networks (\\textsc{[Bi]RNN}) for learning novel\ncontextual features and creating a novel Text Content Embedding. The\nPropagation Branch considers the information propagation within the graph\nnetwork and proposes a Deep Learning architecture that employs Node Embeddings\nto create novel Propagation Embedding. GETAE Ensemble combines the two novel\nembeddings, i.e., Text Content Embedding and Propagation Embedding, to create a\nnovel \\textit{Propagation-Enhanced Content Embedding} which is afterward used\nfor classification. The experimental results obtained on two real-world\npublicly available datasets, i.e., Twitter15 and Twitter16, prove that using\nthis approach improves fake news detection and outperforms state-of-the-art\nmodels.\n","authors":["Ciprian-Octavian Truică","Elena-Simona Apostol","Marius Marogel","Adrian Paschke"],"pdf_url":"https://arxiv.org/pdf/2412.01825v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.01814v1","updated":"2024-12-02T18:56:06Z","published":"2024-12-02T18:56:06Z","title":"COSMOS: Cross-Modality Self-Distillation for Vision Language\n Pre-training","summary":" Vision-Language Models (VLMs) trained with contrastive loss have achieved\nsignificant advancements in various vision and language tasks. However, the\nglobal nature of contrastive loss makes VLMs focus predominantly on foreground\nobjects, neglecting other crucial information in the image, which limits their\neffectiveness in downstream tasks. To address these challenges, we propose\nCOSMOS: CrOSs-MOdality Self-distillation for vision-language pre-training that\nintegrates a novel text-cropping strategy and cross-attention module into a\nself-supervised learning framework. We create global and local views of images\nand texts (i.e., multi-modal augmentations), which are essential for\nself-distillation in VLMs. We further introduce a cross-attention module,\nenabling COSMOS to learn comprehensive cross-modal representations optimized\nvia a cross-modality self-distillation loss. COSMOS consistently outperforms\nprevious strong baselines on various zero-shot downstream tasks, including\nretrieval, classification, and semantic segmentation. Additionally, it\nsurpasses CLIP-based models trained on larger datasets in visual perception and\ncontextual understanding tasks.\n","authors":["Sanghwan Kim","Rui Xiao","Mariana-Iuliana Georgescu","Stephan Alaniz","Zeynep Akata"],"pdf_url":"https://arxiv.org/pdf/2412.01814v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.01806v1","updated":"2024-12-02T18:50:27Z","published":"2024-12-02T18:50:27Z","title":"Random Tree Model of Meaningful Memory","summary":" Traditional studies of memory for meaningful narratives focus on specific\nstories and their semantic structures but do not address common quantitative\nfeatures of recall across different narratives. We introduce a statistical\nensemble of random trees to represent narratives as hierarchies of key points,\nwhere each node is a compressed representation of its descendant leaves, which\nare the original narrative segments. Recall is modeled as constrained by\nworking memory capacity from this hierarchical structure. Our analytical\nsolution aligns with observations from large-scale narrative recall\nexperiments. Specifically, our model explains that (1) average recall length\nincreases sublinearly with narrative length, and (2) individuals summarize\nincreasingly longer narrative segments in each recall sentence. Additionally,\nthe theory predicts that for sufficiently long narratives, a universal,\nscale-invariant limit emerges, where the fraction of a narrative summarized by\na single recall sentence follows a distribution independent of narrative\nlength.\n","authors":["Weishun Zhong","Tankut Can","Antonis Georgiou","Ilya Shnayderman","Mikhail Katkov","Misha Tsodyks"],"pdf_url":"https://arxiv.org/pdf/2412.01806v1.pdf","comment":"16 pages, 4 figures"},{"id":"http://arxiv.org/abs/2412.01752v1","updated":"2024-12-02T17:47:13Z","published":"2024-12-02T17:47:13Z","title":"A Neurosymbolic Fast and Slow Architecture for Graph Coloring","summary":" Constraint Satisfaction Problems (CSPs) present significant challenges to\nartificial intelligence due to their intricate constraints and the necessity\nfor precise solutions. Existing symbolic solvers are often slow, and prior\nresearch has shown that Large Language Models (LLMs) alone struggle with CSPs\nbecause of their complexity. To bridge this gap, we build upon the existing\nSOFAI architecture (or SOFAI-v1), which adapts Daniel Kahneman's ''Thinking,\nFast and Slow'' cognitive model to AI. Our enhanced architecture, SOFAI-v2,\nintegrates refined metacognitive governance mechanisms to improve adaptability\nacross complex domains, specifically tailored for solving CSPs like graph\ncoloring. SOFAI-v2 combines a fast System 1 (S1) based on LLMs with a\ndeliberative System 2 (S2) governed by a metacognition module. S1's initial\nsolutions, often limited by non-adherence to constraints, are enhanced through\nmetacognitive governance, which provides targeted feedback and examples to\nadapt S1 to CSP requirements. If S1 fails to solve the problem, metacognition\nstrategically invokes S2, ensuring accurate and reliable solutions. With\nempirical results, we show that SOFAI-v2 for graph coloring problems achieves a\n16.98% increased success rate and is 32.42% faster than symbolic solvers.\n","authors":["Vedant Khandelwal","Vishal Pallagani","Biplav Srivastava","Francesca Rossi"],"pdf_url":"https://arxiv.org/pdf/2412.01752v1.pdf","comment":"18 Pages, 18 Figures, 3 Tables"},{"id":"http://arxiv.org/abs/2412.01711v1","updated":"2024-12-02T16:56:08Z","published":"2024-12-02T16:56:08Z","title":"Towards Resource Efficient and Interpretable Bias Mitigation in Large\n Language Models","summary":" Although large language models (LLMs) have demonstrated their effectiveness\nin a wide range of applications, they have also been observed to perpetuate\nunwanted biases present in the training data, potentially leading to harm for\nmarginalized communities. In this paper, we mitigate bias by leveraging small\nbiased and anti-biased expert models to obtain a debiasing signal that will be\nadded to the LLM output at decoding-time. This approach combines resource\nefficiency with interpretability and can be optimized for mitigating specific\ntypes of bias, depending on the target use case. Experiments on mitigating\ngender, race, and religion biases show a reduction in bias on several local and\nglobal bias metrics while preserving language model performance.\n","authors":["Schrasing Tong","Eliott Zemour","Rawisara Lohanimit","Lalana Kagal"],"pdf_url":"https://arxiv.org/pdf/2412.01711v1.pdf","comment":"38th Conference on Neural Information Processing Systems (NeurIPS\n 2024) Safe Generative AI Workshop"},{"id":"http://arxiv.org/abs/2412.01709v1","updated":"2024-12-02T16:55:07Z","published":"2024-12-02T16:55:07Z","title":"Query Performance Explanation through Large Language Model for HTAP\n Systems","summary":" In hybrid transactional and analytical processing (HTAP) systems, users often\nstruggle to understand why query plans from one engine (OLAP or OLTP) perform\nsignificantly slower than those from another. Although optimizers provide plan\ndetails via the EXPLAIN function, these explanations are frequently too\ntechnical for non-experts and offer limited insights into performance\ndifferences across engines. To address this, we propose a novel framework that\nleverages large language models (LLMs) to explain query performance in HTAP\nsystems. Built on Retrieval-Augmented Generation (RAG), our framework\nconstructs a knowledge base that stores historical query executions and\nexpert-curated explanations. To enable efficient retrieval of relevant\nknowledge, query plans are embedded using a lightweight tree-CNN classifier.\nThis augmentation allows the LLM to generate clear, context-aware explanations\nof performance differences between engines. Our approach demonstrates the\npotential of LLMs in hybrid engine systems, paving the way for further\nadvancements in database optimization and user support.\n","authors":["Haibo Xiu","Li Zhang","Tieying Zhang","Jun Yang","Jianjun Chen"],"pdf_url":"https://arxiv.org/pdf/2412.01709v1.pdf","comment":"Submitted to ICDE 2025"},{"id":"http://arxiv.org/abs/2412.01708v1","updated":"2024-12-02T16:55:03Z","published":"2024-12-02T16:55:03Z","title":"Are We There Yet? Revealing the Risks of Utilizing Large Language Models\n in Scholarly Peer Review","summary":" Scholarly peer review is a cornerstone of scientific advancement, but the\nsystem is under strain due to increasing manuscript submissions and the\nlabor-intensive nature of the process. Recent advancements in large language\nmodels (LLMs) have led to their integration into peer review, with promising\nresults such as substantial overlaps between LLM- and human-generated reviews.\nHowever, the unchecked adoption of LLMs poses significant risks to the\nintegrity of the peer review system. In this study, we comprehensively analyze\nthe vulnerabilities of LLM-generated reviews by focusing on manipulation and\ninherent flaws. Our experiments show that injecting covert deliberate content\ninto manuscripts allows authors to explicitly manipulate LLM reviews, leading\nto inflated ratings and reduced alignment with human reviews. In a simulation,\nwe find that manipulating 5% of the reviews could potentially cause 12% of the\npapers to lose their position in the top 30% rankings. Implicit manipulation,\nwhere authors strategically highlight minor limitations in their papers,\nfurther demonstrates LLMs' susceptibility compared to human reviewers, with a\n4.5 times higher consistency with disclosed limitations. Additionally, LLMs\nexhibit inherent flaws, such as potentially assigning higher ratings to\nincomplete papers compared to full papers and favoring well-known authors in\nsingle-blind review process. These findings highlight the risks of\nover-reliance on LLMs in peer review, underscoring that we are not yet ready\nfor widespread adoption and emphasizing the need for robust safeguards.\n","authors":["Rui Ye","Xianghe Pang","Jingyi Chai","Jiaao Chen","Zhenfei Yin","Zhen Xiang","Xiaowen Dong","Jing Shao","Siheng Chen"],"pdf_url":"https://arxiv.org/pdf/2412.01708v1.pdf","comment":"27 pages, 24 figures"},{"id":"http://arxiv.org/abs/2412.01690v1","updated":"2024-12-02T16:34:18Z","published":"2024-12-02T16:34:18Z","title":"Can We Afford The Perfect Prompt? Balancing Cost and Accuracy with the\n Economical Prompting Index","summary":" As prompt engineering research rapidly evolves, evaluations beyond accuracy\nare crucial for developing cost-effective techniques. We present the Economical\nPrompting Index (EPI), a novel metric that combines accuracy scores with token\nconsumption, adjusted by a user-specified cost concern level to reflect\ndifferent resource constraints. Our study examines 6 advanced prompting\ntechniques, including Chain-of-Thought, Self-Consistency, and Tree of Thoughts,\nacross 10 widely-used language models and 4 diverse datasets. We demonstrate\nthat approaches such as Self-Consistency often provide statistically\ninsignificant gains while becoming cost-prohibitive. For example, on\nhigh-performing models like Claude 3.5 Sonnet, the EPI of simpler techniques\nlike Chain-of-Thought (0.72) surpasses more complex methods like\nSelf-Consistency (0.64) at slight cost concern levels. Our findings suggest a\nreevaluation of complex prompting strategies in resource-constrained scenarios,\npotentially reshaping future research priorities and improving\ncost-effectiveness for end-users.\n","authors":["Tyler McDonald","Anthony Colosimo","Yifeng Li","Ali Emami"],"pdf_url":"https://arxiv.org/pdf/2412.01690v1.pdf","comment":"5 pages (excluding references), accepted to Coling 2025"},{"id":"http://arxiv.org/abs/2412.01661v1","updated":"2024-12-02T16:13:04Z","published":"2024-12-02T16:13:04Z","title":"R-Bot: An LLM-based Query Rewrite System","summary":" Query rewrite is essential for optimizing SQL queries to improve their\nexecution efficiency without changing their results. Traditionally, this task\nhas been tackled through heuristic and learning-based methods, each with its\nlimitations in terms of inferior quality and low robustness. Recent\nadvancements in LLMs offer a new paradigm by leveraging their superior natural\nlanguage and code comprehension abilities. Despite their potential, directly\napplying LLMs like GPT-4 has faced challenges due to problems such as\nhallucinations, where the model might generate inaccurate or irrelevant\nresults. To address this, we propose R-Bot, an LLM-based query rewrite system\nwith a systematic approach. We first design a multi-source rewrite evidence\npreparation pipeline to generate query rewrite evidences for guiding LLMs to\navoid hallucinations. We then propose a hybrid structure-semantics retrieval\nmethod that combines structural and semantic analysis to retrieve the most\nrelevant rewrite evidences for effectively answering an online query. We next\npropose a step-by-step LLM rewrite method that iteratively leverages the\nretrieved evidences to select and arrange rewrite rules with self-reflection.\nWe conduct comprehensive experiments on widely used benchmarks, and demonstrate\nthe superior performance of our system, R-Bot, surpassing state-of-the-art\nquery rewrite methods.\n","authors":["Zhaoyan Sun","Xuanhe Zhou","Guoliang Li"],"pdf_url":"https://arxiv.org/pdf/2412.01661v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.01644v1","updated":"2024-12-02T15:56:08Z","published":"2024-12-02T15:56:08Z","title":"Concept Based Continuous Prompts for Interpretable Text Classification","summary":" Continuous prompts have become widely adopted for augmenting performance\nacross a wide range of natural language tasks. However, the underlying\nmechanism of this enhancement remains obscure. Previous studies rely on\nindividual words for interpreting continuous prompts, which lacks comprehensive\nsemantic understanding. Drawing inspiration from Concept Bottleneck Models, we\npropose a framework for interpreting continuous prompts by decomposing them\ninto human-readable concepts. Specifically, to ensure the feasibility of the\ndecomposition, we demonstrate that a corresponding concept embedding matrix and\na coefficient matrix can always be found to replace the prompt embedding\nmatrix. Then, we employ GPT-4o to generate a concept pool and choose potential\ncandidate concepts that are discriminative and representative using a novel\nsubmodular optimization algorithm. Experiments demonstrate that our framework\ncan achieve similar results as the original P-tuning and word-based approaches\nusing only a few concepts while providing more plausible results. Our code is\navailable at https://github.com/qq31415926/CD.\n","authors":["Qian Chen","Dongyang Li","Xiaofeng He"],"pdf_url":"https://arxiv.org/pdf/2412.01644v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.01626v1","updated":"2024-12-02T15:44:19Z","published":"2024-12-02T15:44:19Z","title":"Using Large Language Models in Automatic Hint Ranking and Generation\n Tasks","summary":" The use of Large Language Models (LLMs) has increased significantly recently,\nwith individuals frequently interacting with chatbots to receive answers to a\nwide range of questions. In an era where information is readily accessible, it\nis crucial to stimulate and preserve human cognitive abilities and maintain\nstrong reasoning skills. This paper addresses such challenges by promoting the\nuse of hints as an alternative or a supplement to direct answers. We first\nintroduce a manually constructed hint dataset, WIKIHINT, which includes 5,000\nhints created for 1,000 questions. We then finetune open-source LLMs such as\nLLaMA-3.1 for hint generation in answer-aware and answer-agnostic contexts. We\nassess the effectiveness of the hints with human participants who try to answer\nquestions with and without the aid of hints. Additionally, we introduce a\nlightweight evaluation method, HINTRANK, to evaluate and rank hints in both\nanswer-aware and answer-agnostic settings. Our findings show that (a) the\ndataset helps generate more effective hints, (b) including answer information\nalong with questions generally improves hint quality, and (c) encoder-based\nmodels perform better than decoder-based models in hint ranking.\n","authors":["Jamshid Mozafari","Florian Gerhold","Adam Jatowt"],"pdf_url":"https://arxiv.org/pdf/2412.01626v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.01624v1","updated":"2024-12-02T15:43:10Z","published":"2024-12-02T15:43:10Z","title":"CHIMA: Headline-Guided Extractive Summarization for Thai News Articles","summary":" Text summarization is a process of condensing lengthy texts while preserving\ntheir essential information. Previous studies have predominantly focused on\nhigh-resource languages, while low-resource languages like Thai have received\nless attention. Furthermore, earlier extractive summarization models for Thai\ntexts have primarily relied on the article's body, without considering the\nheadline. This omission can result in the exclusion of key sentences from the\nsummary. To address these limitations, we propose CHIMA, an extractive\nsummarization model that incorporates the contextual information of the\nheadline for Thai news articles. Our model utilizes a pre-trained language\nmodel to capture complex language semantics and assigns a probability to each\nsentence to be included in the summary. By leveraging the headline to guide\nsentence selection, CHIMA enhances the model's ability to recover important\nsentences and discount irrelevant ones. Additionally, we introduce two\nstrategies for aggregating headline-body similarities, simple average and\nharmonic mean, providing flexibility in sentence selection to accommodate\nvarying writing styles. Experiments on publicly available Thai news datasets\ndemonstrate that CHIMA outperforms baseline models across ROUGE, BLEU, and F1\nscores. These results highlight the effectiveness of incorporating the\nheadline-body similarities as model guidance. The results also indicate an\nenhancement in the model's ability to recall critical sentences, even those\nscattered throughout the middle or end of the article. With this potential,\nheadline-guided extractive summarization offers a promising approach to improve\nthe quality and relevance of summaries for Thai news articles.\n","authors":["Pimpitchaya Kositcharoensuk","Nakarin Sritrakool","Ploy N. Pratanwanich"],"pdf_url":"https://arxiv.org/pdf/2412.01624v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.01621v1","updated":"2024-12-02T15:41:47Z","published":"2024-12-02T15:41:47Z","title":"NYT-Connections: A Deceptively Simple Text Classification Task that\n Stumps System-1 Thinkers","summary":" Large Language Models (LLMs) have shown impressive performance on various\nbenchmarks, yet their ability to engage in deliberate reasoning remains\nquestionable. We present NYT-Connections, a collection of 358 simple word\nclassification puzzles derived from the New York Times Connections game. This\nbenchmark is designed to penalize quick, intuitive \"System 1\" thinking,\nisolating fundamental reasoning skills. We evaluated six recent LLMs, a simple\nmachine learning heuristic, and humans across three configurations:\nsingle-attempt, multiple attempts without hints, and multiple attempts with\ncontextual hints. Our findings reveal a significant performance gap: even\ntop-performing LLMs like GPT-4 fall short of human performance by nearly 30%.\nNotably, advanced prompting techniques such as Chain-of-Thought and\nSelf-Consistency show diminishing returns as task difficulty increases.\nNYT-Connections uniquely combines linguistic isolation, resistance to intuitive\nshortcuts, and regular updates to mitigate data leakage, offering a novel tool\nfor assessing LLM reasoning capabilities.\n","authors":["Angel Yahir Loredo Lopez","Tyler McDonald","Ali Emami"],"pdf_url":"https://arxiv.org/pdf/2412.01621v1.pdf","comment":"5 pages (excluding references), accepted to Coling 2025"},{"id":"http://arxiv.org/abs/2412.01617v1","updated":"2024-12-02T15:39:00Z","published":"2024-12-02T15:39:00Z","title":"If Eleanor Rigby Had Met ChatGPT: A Study on Loneliness in a Post-LLM\n World","summary":" Loneliness, or the lack of fulfilling relationships, significantly impacts a\nperson's mental and physical well-being and is prevalent worldwide. Previous\nresearch suggests that large language models (LLMs) may help mitigate\nloneliness. However, we argue that the use of widespread LLMs like ChatGPT is\nmore prevalent--and riskier, as they are not designed for this purpose. To\nexplore this, we analysed user interactions with ChatGPT, particularly those\noutside of its marketed use as task-oriented assistant. In dialogues classified\nas lonely, users frequently (37%) sought advice or validation, and received\ngood engagement. However, ChatGPT failed in sensitive scenarios, like\nresponding appropriately to suicidal ideation or trauma. We also observed a 35%\nhigher incidence of toxic content, with women being 22 times more likely to be\ntargeted than men. Our findings underscore ethical and legal questions about\nthis technology, and note risks like radicalisation or further isolation. We\nconclude with recommendations for research and industry to address loneliness.\n","authors":["Adrian de Wynter"],"pdf_url":"https://arxiv.org/pdf/2412.01617v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.01605v1","updated":"2024-12-02T15:25:02Z","published":"2024-12-02T15:25:02Z","title":"Medchain: Bridging the Gap Between LLM Agents and Clinical Practice\n through Interactive Sequential Benchmarking","summary":" Clinical decision making (CDM) is a complex, dynamic process crucial to\nhealthcare delivery, yet it remains a significant challenge for artificial\nintelligence systems. While Large Language Model (LLM)-based agents have been\ntested on general medical knowledge using licensing exams and knowledge\nquestion-answering tasks, their performance in the CDM in real-world scenarios\nis limited due to the lack of comprehensive testing datasets that mirror actual\nmedical practice. To address this gap, we present MedChain, a dataset of 12,163\nclinical cases that covers five key stages of clinical workflow. MedChain\ndistinguishes itself from existing benchmarks with three key features of\nreal-world clinical practice: personalization, interactivity, and\nsequentiality. Further, to tackle real-world CDM challenges, we also propose\nMedChain-Agent, an AI system that integrates a feedback mechanism and a\nMCase-RAG module to learn from previous cases and adapt its responses.\nMedChain-Agent demonstrates remarkable adaptability in gathering information\ndynamically and handling sequential clinical tasks, significantly outperforming\nexisting approaches. The relevant dataset and code will be released upon\nacceptance of this paper.\n","authors":["Jie Liu","Wenxuan Wang","Zizhan Ma","Guolin Huang","Yihang SU","Kao-Jung Chang","Wenting Chen","Haoliang Li","Linlin Shen","Michael Lyu"],"pdf_url":"https://arxiv.org/pdf/2412.01605v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.01505v1","updated":"2024-12-02T13:58:35Z","published":"2024-12-02T13:58:35Z","title":"Scaling Law for Language Models Training Considering Batch Size","summary":" Large language models (LLMs) have made remarkable advances in recent years,\nwith scaling laws playing a critical role in this rapid progress. In this\npaper, we empirically investigate how a critical hyper-parameter, i.e., the\nglobal batch size, influences the LLM training prdocess. We begin by training\nlanguage models ranging from 125 million to 2.6 billion parameters, using up to\n300 billion high-quality tokens. Through these experiments, we establish a\nbasic scaling law on model size and training data amount. We then examine how\nvarying batch sizes and learning rates affect the convergence and\ngeneralization of these models. Our analysis yields batch size scaling laws\nunder two different cases: with a fixed compute budget, and with a fixed amount\nof training data. Extrapolation experiments on models of increasing sizes\nvalidate our predicted laws, which provides guidance for optimizing LLM\ntraining strategies under specific resource constraints.\n","authors":["Xian Shuai","Yiding Wang","Yimeng Wu","Xin Jiang","Xiaozhe Ren"],"pdf_url":"https://arxiv.org/pdf/2412.01505v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.01455v1","updated":"2024-12-02T12:46:34Z","published":"2024-12-02T12:46:34Z","title":"Early Exit Is a Natural Capability in Transformer-based Models: An\n Empirical Study on Early Exit without Joint Optimization","summary":" Large language models (LLMs) exhibit exceptional performance across various\ndownstream tasks. However, they encounter limitations due to slow inference\nspeeds stemming from their extensive parameters. The early exit (EE) is an\napproach that aims to accelerate auto-regressive decoding. EE generates outputs\nfrom intermediate layers instead of using the whole model, which offers a\npromising solution to this challenge. However, additional output layers and\njoint optimization used in conventional EE hinder the application of EE in\nLLMs.\n In this paper, we explore the possibility of LLMs EE without additional\noutput layers and joint optimization. Our findings indicate that EE is a\nnatural capability within transformer-based models. While joint optimization\ndoes not give model EE capability, it must be employed to address challenges by\nimproving the accuracy of locating the optimal EE layer through gating\nfunctions. Additionally, our study reveals patterns in EE behavior from a\nsub-word perspective based on the LLaMA model and the potential possibility for\nEE based on sub-layers.\n","authors":["Weiqiao Shan","Long Meng","Tong Zheng","Yingfeng Luo","Bei Li","junxin Wang","Tong Xiao","Jingbo Zhu"],"pdf_url":"https://arxiv.org/pdf/2412.01455v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.01447v1","updated":"2024-12-02T12:36:27Z","published":"2024-12-02T12:36:27Z","title":"PLD+: Accelerating LLM inference by leveraging Language Model Artifacts","summary":" To reduce the latency associated with autoretrogressive LLM inference,\nspeculative decoding has emerged as a novel decoding paradigm, where future\ntokens are drafted and verified in parallel. However, the practical deployment\nof speculative decoding is hindered by its requirements for additional\ncomputational resources and fine-tuning, which limits its out-of-the-box\nusability. To address these challenges, we present PLD+, a suite of novel\nalgorithms developed to accelerate the inference process of LLMs, particularly\nfor input-guided tasks. These tasks, which include code editing, text editing,\nsummarization, etc., often feature outputs with substantial overlap with their\ninputs-an attribute PLD+ is designed to exploit. PLD+ also leverages the\nartifacts (attention and hidden states) generated during inference to\naccelerate inference speed. We test our approach on five input-guided tasks and\nthrough extensive experiments we find that PLD+ outperforms all tuning-free\napproaches. In the greedy setting, it even outperforms the state-of-the-art\ntuning-dependent approach EAGLE on four of the tasks. (by a margin of upto 2.31\nin terms of avg. speedup). Our approach is tuning free, does not require any\nadditional compute and can easily be used for accelerating inference of any\nLLM.\n","authors":["Shwetha Somasundaram","Anirudh Phukan","Apoorv Saxena"],"pdf_url":"https://arxiv.org/pdf/2412.01447v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.01443v1","updated":"2024-12-02T12:32:19Z","published":"2024-12-02T12:32:19Z","title":"Multi-Facet Blending for Faceted Query-by-Example Retrieval","summary":" With the growing demand to fit fine-grained user intents, faceted\nquery-by-example (QBE), which retrieves similar documents conditioned on\nspecific facets, has gained recent attention. However, prior approaches mainly\ndepend on document-level comparisons using basic indicators like citations due\nto the lack of facet-level relevance datasets; yet, this limits their use to\ncitation-based domains and fails to capture the intricacies of facet\nconstraints. In this paper, we propose a multi-facet blending (FaBle)\naugmentation method, which exploits modularity by decomposing and recomposing\nto explicitly synthesize facet-specific training sets. We automatically\ndecompose documents into facet units and generate (ir)relevant pairs by\nleveraging LLMs' intrinsic distinguishing capabilities; then, dynamically\nrecomposing the units leads to facet-wise relevance-informed document pairs.\nOur modularization eliminates the need for pre-defined facet knowledge or\nlabels. Further, to prove the FaBle's efficacy in a new domain beyond\ncitation-based scientific paper retrieval, we release a benchmark dataset for\neducational exam item QBE. FaBle augmentation on 1K documents remarkably\nassists training in obtaining facet conditional embeddings.\n","authors":["Heejin Do","Sangwon Ryu","Jonghwi Kim","Gary Geunbae Lee"],"pdf_url":"https://arxiv.org/pdf/2412.01443v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.01423v1","updated":"2024-12-02T12:06:41Z","published":"2024-12-02T12:06:41Z","title":"A Top-down Graph-based Tool for Modeling Classical Semantic Maps: A\n Crosslinguistic Case Study of Supplementary Adverbs","summary":" Semantic map models (SMMs) construct a network-like conceptual space from\ncross-linguistic instances or forms, based on the connectivity hypothesis. This\napproach has been widely used to represent similarity and entailment\nrelationships in cross-linguistic concept comparisons. However, most SMMs are\nmanually built by human experts using bottom-up procedures, which are often\nlabor-intensive and time-consuming. In this paper, we propose a novel\ngraph-based algorithm that automatically generates conceptual spaces and SMMs\nin a top-down manner. The algorithm begins by creating a dense graph, which is\nsubsequently pruned into maximum spanning trees, selected according to metrics\nwe propose. These evaluation metrics include both intrinsic and extrinsic\nmeasures, considering factors such as network structure and the trade-off\nbetween precision and coverage. A case study on cross-linguistic supplementary\nadverbs demonstrates the effectiveness and efficiency of our model compared to\nhuman annotations and other automated methods. The tool is available at\n\\url{https://github.com/RyanLiut/SemanticMapModel}.\n","authors":["Zhu Liu","Cunliang Kong","Ying Liu","Maosong Sun"],"pdf_url":"https://arxiv.org/pdf/2412.01423v1.pdf","comment":"Paper under review"},{"id":"http://arxiv.org/abs/2412.01386v1","updated":"2024-12-02T11:15:10Z","published":"2024-12-02T11:15:10Z","title":"CLASSLA-Express: a Train of CLARIN.SI Workshops on Language Resources\n and Tools with Easily Expanding Route","summary":" This paper introduces the CLASSLA-Express workshop series as an innovative\napproach to disseminating linguistic resources and infrastructure provided by\nthe CLASSLA Knowledge Centre for South Slavic languages and the Slovenian\nCLARIN.SI infrastructure. The workshop series employs two key strategies: (1)\nconducting workshops directly in countries with interested audiences, and (2)\ndesigning the series for easy expansion to new venues. The first iteration of\nthe CLASSLA-Express workshop series encompasses 6 workshops in 5 countries. Its\ngoal is to share knowledge on the use of corpus querying tools, as well as the\nrecently-released CLASSLA-web corpora - the largest general corpora for South\nSlavic languages. In the paper, we present the design of the workshop series,\nits current scope and the effortless extensions of the workshop to new venues\nthat are already in sight.\n","authors":["Nikola Ljubešić","Taja Kuzman","Ivana Filipović Petrović","Jelena Parizoska","Petya Osenova"],"pdf_url":"https://arxiv.org/pdf/2412.01386v1.pdf","comment":"Published in CLARIN Annual Conference Proceedings 2024\n (https://www.clarin.eu/sites/default/files/CLARIN2024_ConferenceProceedings_final.pdf)"},{"id":"http://arxiv.org/abs/2412.01380v1","updated":"2024-12-02T11:07:51Z","published":"2024-12-02T11:07:51Z","title":"Efficient LLM Inference using Dynamic Input Pruning and Cache-Aware\n Masking","summary":" While mobile devices provide ever more compute power, improvements in DRAM\nbandwidth are much slower. This is unfortunate for large language model (LLM)\ntoken generation, which is heavily memory-bound. Previous work has proposed to\nleverage natural dynamic activation sparsity in ReLU-activated LLMs to reduce\neffective DRAM bandwidth per token. However, more recent LLMs use SwiGLU\ninstead of ReLU, which result in little inherent sparsity. While SwiGLU\nactivations can be pruned based on magnitude, the resulting sparsity patterns\nare difficult to predict, rendering previous approaches ineffective. To\ncircumvent this issue, our work introduces Dynamic Input Pruning (DIP): a\npredictor-free dynamic sparsification approach, which preserves accuracy with\nminimal fine-tuning. DIP can further use lightweight LoRA adapters to regain\nsome performance lost during sparsification. Lastly, we describe a novel\ncache-aware masking strategy, which considers the cache state and activation\nmagnitude to further increase cache hit rate, improving LLM token rate on\nmobile devices. DIP outperforms other methods in terms of accuracy, memory and\nthroughput trade-offs across simulated hardware settings. On Phi-3-Medium, DIP\nachieves a 46% reduction in memory and 40% increase in throughput with $<$ 0.1\nloss in perplexity.\n","authors":["Marco Federici","Davide Belli","Mart van Baalen","Amir Jalalirad","Andrii Skliar","Bence Major","Markus Nagel","Paul Whatmough"],"pdf_url":"https://arxiv.org/pdf/2412.01380v1.pdf","comment":"Main Text: 10 pages, 11 figures. Appendix: 3 pages, 3 figures"},{"id":"http://arxiv.org/abs/2412.01377v1","updated":"2024-12-02T11:05:31Z","published":"2024-12-02T11:05:31Z","title":"Adapting Large Language Models to Log Analysis with Interpretable Domain\n Knowledge","summary":" The increasing complexity of computer systems necessitates innovative\napproaches to fault and error management, going beyond traditional manual log\nanalysis. While existing solutions using large language models (LLMs) show\npromise, they are limited by a gap between natural and domain-specific\nlanguages, which restricts their effectiveness in real-world applications. Our\napproach addresses these limitations by integrating interpretable domain\nknowledge into open-source LLMs through continual pre-training (CPT), enhancing\nperformance on log tasks while retaining natural language processing\ncapabilities. We created a comprehensive dataset, NLPLog, with over 250,000\nquestion-answer pairs to facilitate this integration. Our model, SuperLog,\ntrained with this dataset, achieves the best performance across four log\nanalysis tasks, surpassing the second-best model by an average of 12.01%. Our\ncontributions include a novel CPT paradigm that significantly improves model\nperformance, the development of SuperLog with state-of-the-art results, and the\nrelease of a large-scale dataset to support further research in this domain.\n","authors":["Yuhe Ji","Yilun Liu","Feiyu Yao","Minggui He","Shimin Tao","Xiaofeng Zhao","Su Chang","Xinhua Yang","Weibin Meng","Yuming Xie","Boxing Chen","Hao Yang"],"pdf_url":"https://arxiv.org/pdf/2412.01377v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.01370v1","updated":"2024-12-02T10:54:31Z","published":"2024-12-02T10:54:31Z","title":"Understanding the World's Museums through Vision-Language Reasoning","summary":" Museums serve as vital repositories of cultural heritage and historical\nartifacts spanning diverse epochs, civilizations, and regions, preserving\nwell-documented collections. Data reveal key attributes such as age, origin,\nmaterial, and cultural significance. Understanding museum exhibits from their\nimages requires reasoning beyond visual features. In this work, we facilitate\nsuch reasoning by (a) collecting and curating a large-scale dataset of 65M\nimages and 200M question-answer pairs in the standard museum catalog format for\nexhibits from all around the world; (b) training large vision-language models\non the collected dataset; (c) benchmarking their ability on five visual\nquestion answering tasks. The complete dataset is labeled by museum experts,\nensuring the quality as well as the practical significance of the labels. We\ntrain two VLMs from different categories: the BLIP model, with vision-language\naligned embeddings, but lacking the expressive power of large language models,\nand the LLaVA model, a powerful instruction-tuned LLM enriched with\nvision-language reasoning capabilities. Through exhaustive experiments, we\nprovide several insights on the complex and fine-grained understanding of\nmuseum exhibits. In particular, we show that some questions whose answers can\noften be derived directly from visual features are well answered by both types\nof models. On the other hand, questions that require the grounding of the\nvisual features in repositories of human knowledge are better answered by the\nlarge vision-language models, thus demonstrating their superior capacity to\nperform the desired reasoning. Find our dataset, benchmarks, and source code\nat: https://github.com/insait-institute/Museum-65\n","authors":["Ada-Astrid Balauca","Sanjana Garai","Stefan Balauca","Rasesh Udayakumar Shetty","Naitik Agrawal","Dhwanil Subhashbhai Shah","Yuqian Fu","Xi Wang","Kristina Toutanova","Danda Pani Paudel","Luc Van Gool"],"pdf_url":"https://arxiv.org/pdf/2412.01370v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.01340v1","updated":"2024-12-02T10:07:01Z","published":"2024-12-02T10:07:01Z","title":"A 2-step Framework for Automated Literary Translation Evaluation: Its\n Promises and Pitfalls","summary":" In this work, we propose and evaluate the feasibility of a two-stage pipeline\nto evaluate literary machine translation, in a fine-grained manner, from\nEnglish to Korean. The results show that our framework provides fine-grained,\ninterpretable metrics suited for literary translation and obtains a higher\ncorrelation with human judgment than traditional machine translation metrics.\nNonetheless, it still fails to match inter-human agreement, especially in\nmetrics like Korean Honorifics. We also observe that LLMs tend to favor\ntranslations generated by other LLMs, and we highlight the necessity of\ndeveloping more sophisticated evaluation methods to ensure accurate and\nculturally sensitive machine translation of literary works.\n","authors":["Sheikh Shafayat","Dongkeun Yoon","Woori Jang","Jiwoo Choi","Alice Oh","Seohyon Jung"],"pdf_url":"https://arxiv.org/pdf/2412.01340v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.01331v1","updated":"2024-12-02T09:54:51Z","published":"2024-12-02T09:54:51Z","title":"Exploring Long-Term Prediction of Type 2 Diabetes Microvascular\n Complications","summary":" Electronic healthcare records (EHR) contain a huge wealth of data that can\nsupport the prediction of clinical outcomes. EHR data is often stored and\nanalysed using clinical codes (ICD10, SNOMED), however these can differ across\nregistries and healthcare providers. Integrating data across systems involves\nmapping between different clinical ontologies requiring domain expertise, and\nat times resulting in data loss. To overcome this, code-agnostic models have\nbeen proposed. We assess the effectiveness of a code-agnostic representation\napproach on the task of long-term microvascular complication prediction for\nindividuals living with Type 2 Diabetes. Our method encodes individual EHRs as\ntext using fine-tuned, pretrained clinical language models. Leveraging\nlarge-scale EHR data from the UK, we employ a multi-label approach to\nsimultaneously predict the risk of microvascular complications across 1-, 5-,\nand 10-year windows. We demonstrate that a code-agnostic approach outperforms a\ncode-based model and illustrate that performance is better with longer\nprediction windows but is biased to the first occurring complication. Overall,\nwe highlight that context length is vitally important for model performance.\nThis study highlights the possibility of including data from across different\nclinical ontologies and is a starting point for generalisable clinical models.\n","authors":["Elizabeth Remfry","Rafael Henkin","Michael R Barnes","Aakanksha Naik"],"pdf_url":"https://arxiv.org/pdf/2412.01331v1.pdf","comment":"Findings paper presented at Machine Learning for Health (ML4H)\n symposium 2024, December 15-16, 2024, Vancouver, Canada, 9 pages"},{"id":"http://arxiv.org/abs/2412.01330v1","updated":"2024-12-02T09:54:14Z","published":"2024-12-02T09:54:14Z","title":"The \"LLM World of Words\" English free association norms generated by\n large language models","summary":" Free associations have been extensively used in cognitive psychology and\nlinguistics for studying how conceptual knowledge is organized. Recently, the\npotential of applying a similar approach for investigating the knowledge\nencoded in LLMs has emerged, specifically as a method for investigating LLM\nbiases. However, the absence of large-scale LLM-generated free association\nnorms that are comparable with human-generated norms is an obstacle to this new\nresearch direction. To address this limitation, we create a new dataset of\nLLM-generated free association norms modeled after the \"Small World of Words\"\n(SWOW) human-generated norms consisting of approximately 12,000 cue words. We\nprompt three LLMs, namely Mistral, Llama3, and Haiku, with the same cues as\nthose in the SWOW norms to generate three novel comparable datasets, the \"LLM\nWorld of Words\" (LWOW). Using both SWOW and LWOW norms, we construct cognitive\nnetwork models of semantic memory that represent the conceptual knowledge\npossessed by humans and LLMs. We demonstrate how these datasets can be used for\ninvestigating implicit biases in humans and LLMs, such as the harmful gender\nstereotypes that are prevalent both in society and LLM outputs.\n","authors":["Katherine Abramski","Riccardo Improta","Giulio Rossetti","Massimo Stella"],"pdf_url":"https://arxiv.org/pdf/2412.01330v1.pdf","comment":"16 pages, 11 figures, associated Github page with dataset available\n at: https://github.com/LLMWorldOfWords/LWOW"},{"id":"http://arxiv.org/abs/2412.01293v1","updated":"2024-12-02T09:08:06Z","published":"2024-12-02T09:08:06Z","title":"SiTSE: Sinhala Text Simplification Dataset and Evaluation","summary":" Text Simplification is a task that has been minimally explored for\nlow-resource languages. Consequently, there are only a few manually curated\ndatasets. In this paper, we present a human curated sentence-level text\nsimplification dataset for the Sinhala language. Our evaluation dataset\ncontains 1,000 complex sentences and corresponding 3,000 simplified sentences\nproduced by three different human annotators. We model the text simplification\ntask as a zero-shot and zero resource sequence-to-sequence (seq-seq) task on\nthe multilingual language models mT5 and mBART. We exploit auxiliary data from\nrelated seq-seq tasks and explore the possibility of using intermediate task\ntransfer learning (ITTL). Our analysis shows that ITTL outperforms the\npreviously proposed zero-resource methods for text simplification. Our findings\nalso highlight the challenges in evaluating text simplification systems, and\nsupport the calls for improved metrics for measuring the quality of automated\ntext simplification systems that would suit low-resource languages as well. Our\ncode and data are publicly available:\nhttps://github.com/brainsharks-fyp17/Sinhala-Text-Simplification-Dataset-and-Evaluation\n","authors":["Surangika Ranathunga","Rumesh Sirithunga","Himashi Rathnayake","Lahiru De Silva","Thamindu Aluthwala","Saman Peramuna","Ravi Shekhar"],"pdf_url":"https://arxiv.org/pdf/2412.01293v1.pdf","comment":null}],"Information Retrieval":[{"id":"http://arxiv.org/abs/2409.05677v2","updated":"2024-12-02T18:13:28Z","published":"2024-09-09T14:44:19Z","title":"RIRAG: Regulatory Information Retrieval and Answer Generation","summary":" Regulatory documents, issued by governmental regulatory bodies, establish\nrules, guidelines, and standards that organizations must adhere to for legal\ncompliance. These documents, characterized by their length, complexity and\nfrequent updates, are challenging to interpret, requiring significant\nallocation of time and expertise on the part of organizations to ensure ongoing\ncompliance. Regulatory Natural Language Processing (RegNLP) is a\nmultidisciplinary field aimed at simplifying access to and interpretation of\nregulatory rules and obligations. We introduce a task of generating\nquestion-passages pairs, where questions are automatically created and paired\nwith relevant regulatory passages, facilitating the development of regulatory\nquestion-answering systems. We create the ObliQA dataset, containing 27,869\nquestions derived from the collection of Abu Dhabi Global Markets (ADGM)\nfinancial regulation documents, design a baseline Regulatory Information\nRetrieval and Answer Generation (RIRAG) system and evaluate it with RePASs, a\nnovel evaluation metric that tests whether generated answers accurately capture\nall relevant obligations while avoiding contradictions.\n","authors":["Tuba Gokhan","Kexin Wang","Iryna Gurevych","Ted Briscoe"],"pdf_url":"https://arxiv.org/pdf/2409.05677v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.11251v2","updated":"2024-12-02T13:26:14Z","published":"2024-06-17T06:27:35Z","title":"Unifying Multimodal Retrieval via Document Screenshot Embedding","summary":" In the real world, documents are organized in different formats and varied\nmodalities. Traditional retrieval pipelines require tailored document parsing\ntechniques and content extraction modules to prepare input for indexing. This\nprocess is tedious, prone to errors, and has information loss. To this end, we\npropose Document Screenshot Embedding (DSE), a novel retrieval paradigm that\nregards document screenshots as a unified input format, which does not require\nany content extraction preprocess and preserves all the information in a\ndocument (e.g., text, image and layout). DSE leverages a large vision-language\nmodel to directly encode document screenshots into dense representations for\nretrieval. To evaluate our method, we first craft the dataset of Wiki-SS, a\n1.3M Wikipedia web page screenshots as the corpus to answer the questions from\nthe Natural Questions dataset. In such a text-intensive document retrieval\nsetting, DSE shows competitive effectiveness compared to other text retrieval\nmethods relying on parsing. For example, DSE outperforms BM25 by 17 points in\ntop-1 retrieval accuracy. Additionally, in a mixed-modality task of slide\nretrieval, DSE significantly outperforms OCR text retrieval methods by over 15\npoints in nDCG@10. These experiments show that DSE is an effective document\nretrieval paradigm for diverse types of documents. Model checkpoints, code, and\nWiki-SS collection will be released.\n","authors":["Xueguang Ma","Sheng-Chieh Lin","Minghan Li","Wenhu Chen","Jimmy Lin"],"pdf_url":"https://arxiv.org/pdf/2406.11251v2.pdf","comment":"EMNLP2024 main"},{"id":"http://arxiv.org/abs/2409.10825v3","updated":"2024-12-02T07:00:57Z","published":"2024-09-17T01:37:57Z","title":"Unveiling and Mitigating Bias in Large Language Model Recommendations: A\n Path to Fairness","summary":" excel in delivering comprehensive suggestions by deeply analyzing content and\nuser behavior. However, they often inherit biases from skewed training data,\nfavoring mainstream content while underrepresenting diverse or non-traditional\noptions. This study explores the interplay between bias and LLM-based\nrecommendation systems, focusing on music, song, and book recommendations\nacross diverse demographic and cultural groups. This paper analyzes bias in\nLLM-based recommendation systems across multiple models (GPT, LLaMA, and\nGemini), revealing its deep and pervasive impact on outcomes. Intersecting\nidentities and contextual factors, like socioeconomic status, further amplify\nbiases, complicating fair recommendations across diverse groups. Our findings\nreveal that bias in these systems is deeply ingrained, yet even simple\ninterventions like prompt engineering can significantly reduce it. We further\npropose a retrieval-augmented generation strategy to mitigate bias more\neffectively. Numerical experiments validate these strategies, demonstrating\nboth the pervasive nature of bias and the impact of the proposed solutions.\n","authors":["Anindya Bijoy Das","Shahnewaz Karim Sakib"],"pdf_url":"https://arxiv.org/pdf/2409.10825v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.16886v2","updated":"2024-12-02T21:35:55Z","published":"2024-02-07T22:15:15Z","title":"Using text embedding models as text classifiers with medical data","summary":" The advent of Large Language Models (LLMs) is promising and LLMs have been\napplied to numerous fields. However, it is not trivial to implement LLMs in the\nmedical field, due to the high standards for precision and accuracy. Currently,\nthe diagnosis of medical ailments must be done by hand, as it is costly to\nbuild a sufficiently broad LLM that can diagnose a wide range of diseases.\nHere, we explore the use of vector databases and embedding models as a means of\nencoding and classifying text with medical text data without the need to train\na new model altogether. We used various LLMs to generate the medical data, then\nencoded the data with a text embedding model and stored it in a vector\ndatabase. We hypothesized that higher embedding dimensions coupled with\ndescriptive data in the vector database would lead to better classifications\nand designed a robustness test to test our hypothesis. By using vector\ndatabases and text embedding models to classify a clinician's notes on a\npatient presenting with a certain ailment, we showed that these tools can be\nsuccessful at classifying medical text data. We found that a higher embedding\ndimension did indeed yield better results, however, querying with simple data\nin the database was optimal for performance. We have shown in this study the\napplicability of text embedding models and vector databases on a small scale,\nand our work lays the groundwork for applying these tools on a larger scale.\n","authors":["Rishabh Goel"],"pdf_url":"https://arxiv.org/pdf/2402.16886v2.pdf","comment":"15 pages, 6 figures"},{"id":"http://arxiv.org/abs/2412.01985v1","updated":"2024-12-02T21:29:16Z","published":"2024-12-02T21:29:16Z","title":"Improving feature interactions at Pinterest under industry constraints","summary":" Adopting advances in recommendation systems is often challenging in\nindustrial settings due to unique constraints. This paper aims to highlight\nthese constraints through the lens of feature interactions. Feature\ninteractions are critical for accurately predicting user behavior in\nrecommendation systems and online advertising. Despite numerous novel\ntechniques showing superior performance on benchmark datasets like Criteo,\ntheir direct application in industrial settings is hindered by constraints such\nas model latency, GPU memory limitations and model reproducibility. In this\npaper, we share our learnings from improving feature interactions in\nPinterest's Homefeed ranking model under such constraints. We provide details\nabout the specific challenges encountered, the strategies employed to address\nthem, and the trade-offs made to balance performance with practical\nlimitations. Additionally, we present a set of learning experiments that help\nguide the feature interaction architecture selection. We believe these insights\nwill be useful for engineers who are interested in improving their model\nthrough better feature interaction learning.\n","authors":["Siddarth Malreddy","Matthew Lawhon","Usha Amrutha Nookala","Aditya Mantha","Dhruvil Deven Badani"],"pdf_url":"https://arxiv.org/pdf/2412.01985v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.01979v1","updated":"2024-12-02T21:16:47Z","published":"2024-12-02T21:16:47Z","title":"FGATT: A Robust Framework for Wireless Data Imputation Using Fuzzy Graph\n Attention Networks and Transformer Encoders","summary":" Missing data is a pervasive challenge in wireless networks and many other\ndomains, often compromising the performance of machine learning and deep\nlearning models. To address this, we propose a novel framework, FGATT, that\ncombines the Fuzzy Graph Attention Network (FGAT) with the Transformer encoder\nto perform robust and accurate data imputation. FGAT leverages fuzzy rough sets\nand graph attention mechanisms to capture spatial dependencies dynamically,\neven in scenarios where predefined spatial information is unavailable. The\nTransformer encoder is employed to model temporal dependencies, utilizing its\nself-attention mechanism to focus on significant time-series patterns. A\nself-adaptive graph construction method is introduced to enable dynamic\nconnectivity learning, ensuring the framework's applicability to a wide range\nof wireless datasets. Extensive experiments demonstrate that our approach\noutperforms state-of-the-art methods in imputation accuracy and robustness,\nparticularly in scenarios with substantial missing data. The proposed model is\nwell-suited for applications in wireless sensor networks and IoT environments,\nwhere data integrity is critical.\n","authors":["Jinming Xing","Ruilin Xing","Yan Sun"],"pdf_url":"https://arxiv.org/pdf/2412.01979v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.01940v1","updated":"2024-12-02T20:04:06Z","published":"2024-12-02T20:04:06Z","title":"Down with the Hierarchy: The 'H' in HNSW Stands for \"Hubs\"","summary":" Driven by recent breakthrough advances in neural representation learning,\napproximate near-neighbor (ANN) search over vector embeddings has emerged as a\ncritical computational workload. With the introduction of the seminal\nHierarchical Navigable Small World (HNSW) algorithm, graph-based indexes have\nestablished themseves as the overwhelmingly dominant paradigm for efficient and\nscalable ANN search. As the name suggests, HNSW searches a layered hierarchical\ngraph to quickly identify neighborhoods of similar points to a given query\nvector. But is this hierarchy even necessary? A rigorous experimental analysis\nto answer this question would provide valuable insights into the nature of\nalgorithm design for ANN search and motivate directions for future work in this\nincreasingly crucial domain. To that end, we conduct an extensive benchmarking\nstudy covering more large-scale datasets than prior investigations of this\nquestion. We ultimately find that a flat graph retains all of the benefits of\nHNSW on high-dimensional datasets, with latency and recall performance\nessentially \\emph{identical} to the original algorithm but with less memory\noverhead. Furthermore, we go a step further and study \\emph{why} the hierarchy\nof HNSW provides no benefit in high dimensions, hypothesizing that navigable\nsmall world graphs contain a well-connected, frequently traversed ``highway\" of\nhub nodes that maintain the same purported function as the hierarchical layers.\nWe present compelling empirical evidence that the \\emph{Hub Highway Hypothesis}\nholds for real datasets and investigate the mechanisms by which the highway\nforms. The implications of this hypothesis may also provide future research\ndirections in developing enhancements to graph-based ANN search.\n","authors":["Blaise Munyampirwa","Vihan Lakshman","Benjamin Coleman"],"pdf_url":"https://arxiv.org/pdf/2412.01940v1.pdf","comment":"10 pages"},{"id":"http://arxiv.org/abs/2412.01626v1","updated":"2024-12-02T15:44:19Z","published":"2024-12-02T15:44:19Z","title":"Using Large Language Models in Automatic Hint Ranking and Generation\n Tasks","summary":" The use of Large Language Models (LLMs) has increased significantly recently,\nwith individuals frequently interacting with chatbots to receive answers to a\nwide range of questions. In an era where information is readily accessible, it\nis crucial to stimulate and preserve human cognitive abilities and maintain\nstrong reasoning skills. This paper addresses such challenges by promoting the\nuse of hints as an alternative or a supplement to direct answers. We first\nintroduce a manually constructed hint dataset, WIKIHINT, which includes 5,000\nhints created for 1,000 questions. We then finetune open-source LLMs such as\nLLaMA-3.1 for hint generation in answer-aware and answer-agnostic contexts. We\nassess the effectiveness of the hints with human participants who try to answer\nquestions with and without the aid of hints. Additionally, we introduce a\nlightweight evaluation method, HINTRANK, to evaluate and rank hints in both\nanswer-aware and answer-agnostic settings. Our findings show that (a) the\ndataset helps generate more effective hints, (b) including answer information\nalong with questions generally improves hint quality, and (c) encoder-based\nmodels perform better than decoder-based models in hint ranking.\n","authors":["Jamshid Mozafari","Florian Gerhold","Adam Jatowt"],"pdf_url":"https://arxiv.org/pdf/2412.01626v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.01443v1","updated":"2024-12-02T12:32:19Z","published":"2024-12-02T12:32:19Z","title":"Multi-Facet Blending for Faceted Query-by-Example Retrieval","summary":" With the growing demand to fit fine-grained user intents, faceted\nquery-by-example (QBE), which retrieves similar documents conditioned on\nspecific facets, has gained recent attention. However, prior approaches mainly\ndepend on document-level comparisons using basic indicators like citations due\nto the lack of facet-level relevance datasets; yet, this limits their use to\ncitation-based domains and fails to capture the intricacies of facet\nconstraints. In this paper, we propose a multi-facet blending (FaBle)\naugmentation method, which exploits modularity by decomposing and recomposing\nto explicitly synthesize facet-specific training sets. We automatically\ndecompose documents into facet units and generate (ir)relevant pairs by\nleveraging LLMs' intrinsic distinguishing capabilities; then, dynamically\nrecomposing the units leads to facet-wise relevance-informed document pairs.\nOur modularization eliminates the need for pre-defined facet knowledge or\nlabels. Further, to prove the FaBle's efficacy in a new domain beyond\ncitation-based scientific paper retrieval, we release a benchmark dataset for\neducational exam item QBE. FaBle augmentation on 1K documents remarkably\nassists training in obtaining facet conditional embeddings.\n","authors":["Heejin Do","Sangwon Ryu","Jonghwi Kim","Gary Geunbae Lee"],"pdf_url":"https://arxiv.org/pdf/2412.01443v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.01291v1","updated":"2024-12-02T09:04:16Z","published":"2024-12-02T09:04:16Z","title":"Global Estimation of Building-Integrated Facade and Rooftop Photovoltaic\n Potential by Integrating 3D Building Footprint and Spatio-Temporal Datasets","summary":" This research tackles the challenges of estimating Building-Integrated\nPhotovoltaics (BIPV) potential across various temporal and spatial scales,\naccounting for different geographical climates and urban morphology. We\nintroduce a holistic methodology for evaluating BIPV potential, integrating 3D\nbuilding footprint models with diverse meteorological data sources to account\nfor dynamic shadow effects. The approach enables the assessment of PV potential\non facades and rooftops at different levels-individual buildings, urban blocks,\nand cities globally. Through an analysis of 120 typical cities, we highlight\nthe importance of 3D building forms, cityscape morphology, and geographic\npositioning in measuring BIPV potential at various levels. In particular, our\nsimulation study reveals that among cities with optimal facade PV performance,\nthe average ratio of facade PV potential to rooftop PV potential is\napproximately 68.2%. Additionally, approximately 17.5% of the analyzed samples\ndemonstrate even higher facade PV potentials compared to rooftop installations.\nThis finding underscores the strategic value of incorporating facade PV\napplications into urban sustainable energy systems.\n","authors":["Qing Yu","Kechuan Dong","Zhiling Guo","Jiaxing Li","Hongjun Tan","Yanxiu Jin","Jian Yuan","Haoran Zhang","Junwei Liu","Qi Chen","Jinyue Yan"],"pdf_url":"https://arxiv.org/pdf/2412.01291v1.pdf","comment":"17 pages, 5 figures"},{"id":"http://arxiv.org/abs/2412.01290v1","updated":"2024-12-02T09:03:05Z","published":"2024-12-02T09:03:05Z","title":"Learning Smooth Distance Functions via Queries","summary":" In this work, we investigate the problem of learning distance functions\nwithin the query-based learning framework, where a learner is able to pose\ntriplet queries of the form: ``Is $x_i$ closer to $x_j$ or $x_k$?'' We\nestablish formal guarantees on the query complexity required to learn smooth,\nbut otherwise general, distance functions under two notions of approximation:\n$\\omega$-additive approximation and $(1 + \\omega)$-multiplicative\napproximation. For the additive approximation, we propose a global method whose\nquery complexity is quadratic in the size of a finite cover of the sample\nspace. For the (stronger) multiplicative approximation, we introduce a method\nthat combines global and local approaches, utilizing multiple Mahalanobis\ndistance functions to capture local geometry. This method has a query\ncomplexity that scales quadratically with both the size of the cover and the\nambient space dimension of the sample space.\n","authors":["Akash Kumar","Sanjoy Dasgupta"],"pdf_url":"https://arxiv.org/pdf/2412.01290v1.pdf","comment":"40 pages, 1 figure"},{"id":"http://arxiv.org/abs/2412.01141v1","updated":"2024-12-02T05:31:22Z","published":"2024-12-02T05:31:22Z","title":"Lossless and Privacy-Preserving Graph Convolution Network for Federated\n Item Recommendation","summary":" Graph neural network (GNN) has emerged as a state-of-the-art solution for\nitem recommendation. However, existing GNN-based recommendation methods rely on\na centralized storage of fragmented user-item interaction sub-graphs and\ntraining on an aggregated global graph, which will lead to privacy concerns. As\na response, some recent works develop GNN-based federated recommendation\nmethods by exploiting decentralized and fragmented user-item sub-graphs in\norder to preserve user privacy. However, due to privacy constraints, the graph\nconvolution process in existing federated recommendation methods is incomplete\ncompared with the centralized counterpart, causing a degradation of the\nrecommendation performance. In this paper, we propose a novel lossless and\nprivacy-preserving graph convolution network (LP-GCN), which fully completes\nthe graph convolution process with decentralized user-item interaction\nsub-graphs while ensuring privacy. It is worth mentioning that its performance\nis equivalent to that of the non-federated (i.e., centralized) counterpart.\nMoreover, we validate its effectiveness through both theoretical analysis and\nempirical studies. Extensive experiments on three real-world datasets show that\nour LP-GCN outperforms the existing federated recommendation methods. The code\nwill be publicly available once the paper is accepted.\n","authors":["Guowei Wu","Weike Pan","Qiang Yang","Zhong Ming"],"pdf_url":"https://arxiv.org/pdf/2412.01141v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.01127v1","updated":"2024-12-02T05:03:56Z","published":"2024-12-02T05:03:56Z","title":"Precision Profile Pollution Attack on Sequential Recommenders via\n Influence Function","summary":" Sequential recommendation approaches have demonstrated remarkable proficiency\nin modeling user preferences. Nevertheless, they are susceptible to profile\npollution attacks (PPA), wherein items are introduced into a user's interaction\nhistory deliberately to influence the recommendation list. Since retraining the\nmodel for each polluted item is time-consuming, recent PPAs estimate item\ninfluence based on gradient directions to identify the most effective attack\ncandidates. However, the actual item representations diverge significantly from\nthe gradients, resulting in disparate outcomes.To tackle this challenge, we\nintroduce an INFluence Function-based Attack approach INFAttack that offers a\nmore accurate estimation of the influence of polluting items. Specifically, we\ncalculate the modifications to the original model using the influence function\nwhen generating polluted sequences by introducing specific items. Subsequently,\nwe choose the sequence that has been most significantly influenced to\nsubstitute the original sequence, thus promoting the target item. Comprehensive\nexperiments conducted on five real-world datasets illustrate that INFAttack\nsurpasses all baseline methods and consistently delivers stable attack\nperformance for both popular and unpopular items.\n","authors":["Xiaoyu Du","Yingying Chen","Yang Zhang","Jinhui Tang"],"pdf_url":"https://arxiv.org/pdf/2412.01127v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.01093v1","updated":"2024-12-02T04:05:49Z","published":"2024-12-02T04:05:49Z","title":"Automated Extraction of Acronym-Expansion Pairs from Scientific Papers","summary":" This project addresses challenges posed by the widespread use of\nabbreviations and acronyms in digital texts. We propose a novel method that\ncombines document preprocessing, regular expressions, and a large language\nmodel to identify abbreviations and map them to their corresponding expansions.\nThe regular expressions alone are often insufficient to extract expansions, at\nwhich point our approach leverages GPT-4 to analyze the text surrounding the\nacronyms. By limiting the analysis to only a small portion of the surrounding\ntext, we mitigate the risk of obtaining incorrect or multiple expansions for an\nacronym. There are several known challenges in processing text with acronyms,\nincluding polysemous acronyms, non-local and ambiguous acronyms. Our approach\nenhances the precision and efficiency of NLP techniques by addressing these\nissues with automated acronym identification and disambiguation. This study\nhighlights the challenges of working with PDF files and the importance of\ndocument preprocessing. Furthermore, the results of this work show that neither\nregular expressions nor GPT-4 alone can perform well. Regular expressions are\nsuitable for identifying acronyms but have limitations in finding their\nexpansions within the paper due to a variety of formats used for expressing\nacronym-expansion pairs and the tendency of authors to omit expansions within\nthe text. GPT-4, on the other hand, is an excellent tool for obtaining\nexpansions but struggles with correctly identifying all relevant acronyms.\nAdditionally, GPT-4 poses challenges due to its probabilistic nature, which may\nlead to slightly different results for the same input. Our algorithm employs\npreprocessing to eliminate irrelevant information from the text, regular\nexpressions for identifying acronyms, and a large language model to help find\nacronym expansions to provide the most accurate and consistent results.\n","authors":["Izhar Ali","Million Haileyesus","Serhiy Hnatyshyn","Jan-Lucas Ott","Vasil Hnatyshin"],"pdf_url":"https://arxiv.org/pdf/2412.01093v1.pdf","comment":"9 pages, 1 figure"},{"id":"http://arxiv.org/abs/2412.01011v1","updated":"2024-12-02T00:05:20Z","published":"2024-12-02T00:05:20Z","title":"e-Fold Cross-Validation for Recommender-System Evaluation","summary":" To combat the rising energy consumption of recommender systems we implement a\nnovel alternative for k-fold cross validation. This alternative, named e-fold\ncross validation, aims to minimize the number of folds to achieve a reduction\nin power usage while keeping the reliability and robustness of the test results\nhigh. We tested our method on 5 recommender system algorithms across 6 datasets\nand compared it with 10-fold cross validation. On average e-fold cross\nvalidation only needed 41.5% of the energy that 10-fold cross validation would\nneed, while it's results only differed by 1.81%. We conclude that e-fold cross\nvalidation is a promising approach that has the potential to be an energy\nefficient but still reliable alternative to k-fold cross validation.\n","authors":["Moritz Baumgart","Lukas Wegmeth","Tobias Vente","Joeran Beel"],"pdf_url":"https://arxiv.org/pdf/2412.01011v1.pdf","comment":"This preprint has not undergone peer review (when applicable) or any\n post-submission improvements or corrections. The Version of Record of this\n contribution is published in [TBA], and is available online at [TBA]"}],"Machine Learning":[{"id":"http://arxiv.org/abs/2410.16208v3","updated":"2024-12-02T18:59:28Z","published":"2024-10-21T17:11:21Z","title":"Compute-Constrained Data Selection","summary":" Data selection can reduce the amount of training data needed to finetune\nLLMs; however, the efficacy of data selection scales directly with its compute.\nMotivated by the practical challenge of compute-constrained finetuning, we\nconsider the setting in which both the cost of selecting data and training are\nbudgeted for. We first formalize the problem of data selection with a\ncost-aware utility function, and model the data selection problem as trading\noff initial-selection cost for training gain. We run a comprehensive sweep of\nexperiments across multiple tasks, varying compute budget by scaling finetuning\ntokens, model sizes, and data selection compute. Interestingly we find that\nmany powerful data selection methods are almost never compute-optimal, and that\ncheaper data selection alternatives dominate both from a theoretical and\nempirical perspective. For compute-optimal training, we find that perplexity\nand gradient data selection require training-to-selection model size ratios of\n5x and 10x, respectively.\n","authors":["Junjie Oscar Yin","Alexander M. Rush"],"pdf_url":"https://arxiv.org/pdf/2410.16208v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.07978v3","updated":"2024-12-02T18:58:18Z","published":"2024-11-12T17:58:34Z","title":"A Note on Doubly Robust Estimator in Regression Continuity Designs","summary":" This note introduces a doubly robust (DR) estimator for regression\ndiscontinuity (RD) designs. RD designs provide a quasi-experimental framework\nfor estimating treatment effects, where treatment assignment depends on whether\na running variable surpasses a predefined cutoff. A common approach in RD\nestimation is the use of nonparametric regression methods, such as local linear\nregression. However, the validity of these methods still relies on the\nconsistency of the nonparametric estimators. In this study, we propose the\nDR-RD estimator, which combines two distinct estimators for the conditional\nexpected outcomes. The primary advantage of the DR-RD estimator lies in its\nability to ensure the consistency of the treatment effect estimation as long as\nat least one of the two estimators is consistent. Consequently, our DR-RD\nestimator enhances robustness of treatment effect estimators in RD designs.\n","authors":["Masahiro Kato"],"pdf_url":"https://arxiv.org/pdf/2411.07978v3.pdf","comment":"There is a critical error in the previous submission. We have revised\n the original claim and present a weakened result"},{"id":"http://arxiv.org/abs/2411.17501v2","updated":"2024-12-02T18:54:28Z","published":"2024-11-26T15:13:06Z","title":"Inference Scaling fLaws: The Limits of LLM Resampling with Imperfect\n Verifiers","summary":" Recent research has generated hope that inference scaling could allow weaker\nlanguage models to match or exceed the accuracy of stronger models, such as by\nrepeatedly sampling solutions to a coding problem until it passes unit tests.\nThe central thesis of this paper is that there is no free lunch for inference\nscaling: indefinite accuracy improvement through resampling can only be\nrealized if the \"verifier\" (in this case, a set of unit tests) is perfect. When\nthe verifier is imperfect, as it almost always is in domains such as reasoning\nor coding (for example, unit tests have imperfect coverage), there is a nonzero\nprobability of false positives: incorrect solutions that pass the verifier.\nResampling cannot decrease this probability, so it imposes an upper bound to\nthe accuracy of resampling-based inference scaling even with an infinite\ncompute budget. We find that there is a very strong correlation between the\nmodel's single-sample accuracy (i.e. accuracy without unit tests) and its false\npositive rate on coding benchmarks HumanEval and MBPP, whose unit tests have\nlimited coverage. Therefore, no amount of inference scaling of weaker models\ncan enable them to match the single-sample accuracy of a sufficiently strong\nmodel (Fig. 1a). When we consider that false positives have a negative utility\ncompared to abstaining from producing a solution, it bends the inference\nscaling curve further downward. Empirically, we find that the optimal number of\nsamples can be less than 10 under realistic assumptions (Fig. 1b). Finally, we\nshow that beyond accuracy, false positives may have other undesirable\nqualities, such as poor adherence to coding style conventions.\n","authors":["Benedikt Stroebl","Sayash Kapoor","Arvind Narayanan"],"pdf_url":"https://arxiv.org/pdf/2411.17501v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.05248v3","updated":"2024-12-02T18:54:09Z","published":"2023-12-08T18:55:40Z","title":"Topology-Based Reconstruction Prevention for Decentralised Learning","summary":" Decentralised learning has recently gained traction as an alternative to\nfederated learning in which both data and coordination are distributed. To\npreserve the confidentiality of users' data, decentralised learning relies on\ndifferential privacy, multi-party computation, or both. However, running\nmultiple privacy-preserving summations in sequence may allow adversaries to\nperform reconstruction attacks. Current reconstruction countermeasures either\ncannot trivially be adapted to the distributed setting, or add excessive\namounts of noise.\n In this work, we first show that passive honest-but-curious adversaries can\ninfer other users' private data after several privacy-preserving summations.\nFor example, in subgraphs with 18 users, we show that only three passive\nhonest-but-curious adversaries succeed at reconstructing private data 11.0% of\nthe time, requiring an average of 8.8 summations per adversary. The success\nrate depends only on the adversaries' direct neighbourhood, and is independent\nof the size of the full network. We consider weak adversaries that do not\ncontrol the graph topology, cannot exploit the summation's inner workings, and\ndo not have auxiliary knowledge; and show that these adversaries can still\ninfer private data.\n We analyse how reconstruction relates to topology and propose the first\ntopology-based decentralised defence against reconstruction attacks. We show\nthat reconstruction requires a number of adversaries linear in the length of\nthe network's shortest cycle. Consequently, exact attacks over\nprivacy-preserving summations are impossible in acyclic networks.\n Our work is a stepping stone for a formal theory of topology-based\ndecentralised reconstruction defences. Such a theory would generalise our\ncountermeasure beyond summation, define confidentiality in terms of entropy,\nand describe the interactions with (topology-aware) differential privacy.\n","authors":["Florine W. Dekker","Zekeriya Erkin","Mauro Conti"],"pdf_url":"https://arxiv.org/pdf/2312.05248v3.pdf","comment":"14 pages, 19 figures, for associated experiment source code see\n doi:10.4121/21572601.v2"},{"id":"http://arxiv.org/abs/2410.09943v2","updated":"2024-12-02T18:39:06Z","published":"2024-10-13T17:55:58Z","title":"Dynamic Estimation of Learning Rates Using a Non-Linear Autoregressive\n Model","summary":" We introduce a new class of adaptive non-linear autoregressive (Nlar) models\nincorporating the concept of momentum, which dynamically estimate both the\nlearning rates and momentum as the number of iterations increases. In our\nmethod, the growth of the gradients is controlled using a scaling (clipping)\nfunction, leading to stable convergence. Within this framework, we propose\nthree distinct estimators for learning rates and provide theoretical proof of\ntheir convergence. We further demonstrate how these estimators underpin the\ndevelopment of effective Nlar optimizers. The performance of the proposed\nestimators and optimizers is rigorously evaluated through extensive experiments\nacross several datasets and a reinforcement learning environment. The results\nhighlight two key features of the Nlar optimizers: robust convergence despite\nvariations in underlying parameters, including large initial learning rates,\nand strong adaptability with rapid convergence during the initial epochs.\n","authors":["Ramin Okhrati"],"pdf_url":"https://arxiv.org/pdf/2410.09943v2.pdf","comment":"Typos corrected"},{"id":"http://arxiv.org/abs/2408.00170v2","updated":"2024-12-02T18:37:01Z","published":"2024-07-31T21:43:55Z","title":"CREW: Facilitating Human-AI Teaming Research","summary":" With the increasing deployment of artificial intelligence (AI) technologies,\nthe potential of humans working with AI agents has been growing at a great\nspeed. Human-AI teaming is an important paradigm for studying various aspects\nwhen humans and AI agents work together. The unique aspect of Human-AI teaming\nresearch is the need to jointly study humans and AI agents, demanding\nmultidisciplinary research efforts from machine learning to human-computer\ninteraction, robotics, cognitive science, neuroscience, psychology, social\nscience, and complex systems. However, existing platforms for Human-AI teaming\nresearch are limited, often supporting oversimplified scenarios and a single\ntask, or specifically focusing on either human-teaming research or multi-agent\nAI algorithms. We introduce CREW, a platform to facilitate Human-AI teaming\nresearch in real-time decision-making scenarios and engage collaborations from\nmultiple scientific disciplines, with a strong emphasis on human involvement.\nIt includes pre-built tasks for cognitive studies and Human-AI teaming with\nexpandable potentials from our modular design. Following conventional cognitive\nneuroscience research, CREW also supports multimodal human physiological signal\nrecording for behavior analysis. Moreover, CREW benchmarks real-time\nhuman-guided reinforcement learning agents using state-of-the-art algorithms\nand well-tuned baselines. With CREW, we were able to conduct 50 human subject\nstudies within a week to verify the effectiveness of our benchmark.\n","authors":["Lingyu Zhang","Zhengran Ji","Boyuan Chen"],"pdf_url":"https://arxiv.org/pdf/2408.00170v2.pdf","comment":"Our project website is at: http://generalroboticslab.com/CREW"},{"id":"http://arxiv.org/abs/2402.08573v3","updated":"2024-12-02T18:33:58Z","published":"2024-02-13T16:21:18Z","title":"Two Tales of Single-Phase Contrastive Hebbian Learning","summary":" The search for ``biologically plausible'' learning algorithms has converged\non the idea of representing gradients as activity differences. However, most\napproaches require a high degree of synchronization (distinct phases during\nlearning) and introduce substantial computational overhead, which raises doubts\nregarding their biological plausibility as well as their potential utility for\nneuromorphic computing. Furthermore, they commonly rely on applying\ninfinitesimal perturbations (nudges) to output units, which is impractical in\nnoisy environments. Recently it has been shown that by modelling artificial\nneurons as dyads with two oppositely nudged compartments, it is possible for a\nfully local learning algorithm named ``dual propagation'' to bridge the\nperformance gap to backpropagation, without requiring separate learning phases\nor infinitesimal nudging. However, the algorithm has the drawback that its\nnumerical stability relies on symmetric nudging, which may be restrictive in\nbiological and analog implementations. In this work we first provide a solid\nfoundation for the objective underlying the dual propagation method, which also\nreveals a surprising connection with adversarial robustness. Second, we\ndemonstrate how dual propagation is related to a particular adjoint state\nmethod, which is stable regardless of asymmetric nudging.\n","authors":["Rasmus Kjær Høier","Christopher Zach"],"pdf_url":"https://arxiv.org/pdf/2402.08573v3.pdf","comment":"ICML 2024; 21 pages"},{"id":"http://arxiv.org/abs/2406.16738v2","updated":"2024-12-02T18:27:02Z","published":"2024-06-24T15:45:20Z","title":"Inducing Group Fairness in Prompt-Based Language Model Decisions","summary":" Classifiers are used throughout industry to enforce policies, ranging from\nthe detection of toxic content to age-appropriate content filtering. While\nthese classifiers serve important functions, it is also essential that they are\nbuilt in ways that minimize unfair biases for users.\n One such fairness consideration is called group fairness, which desires that\ndifferent sub-population of users receive equal treatment. This is a\nwell-studied problem in the context of 'classical' classifiers. However, the\nemergence of prompt-based language model (LM) decision making has created new\nopportunities to solve text-based classification tasks, and the fairness\nproperties of these new classifiers are not yet well understood. Further, the\n`remediation toolkit' is incomplete for LM-based decision makers and little is\nunderstood about how to improve decision maker group fairness while maintaining\nclassifier performance.\n This work sets out to add more tools to that toolbox. We introduce\nadaptations of existing effective approaches from the classical classifier\nfairness to the prompt-based classifier space. We also devise simple methods\nthat take advantage of the new structure of prompt-based decision makers and\noperate at the prompt level. We compare these approaches empirically on real\ndata. Our results suggest that adaptations of approaches that are effective for\nclassical classifiers remain effective in the LM-based classifier environment.\nHowever, there is room for further exploration of prompt-based remediation\nmethods (and other remediation methods that take advantage of LM structure).\n","authors":["James Atwood","Nino Scherrer","Preethi Lahoti","Ananth Balashankar","Flavien Prost","Ahmad Beirami"],"pdf_url":"https://arxiv.org/pdf/2406.16738v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.13846v2","updated":"2024-12-02T18:03:51Z","published":"2024-05-22T17:14:03Z","title":"Regression Trees Know Calculus","summary":" Regression trees have emerged as a preeminent tool for solving real-world\nregression problems due to their ability to deal with nonlinearities,\ninteraction effects and sharp discontinuities. In this article, we rather study\nregression trees applied to well-behaved, differentiable functions, and\ndetermine the relationship between node parameters and the local gradient of\nthe function being approximated. We find a simple estimate of the gradient\nwhich can be efficiently computed using quantities exposed by popular tree\nlearning libraries. This allows the tools developed in the context of\ndifferentiable algorithms, like neural nets and Gaussian processes, to be\ndeployed to tree-based models. To demonstrate this, we study measures of model\nsensitivity defined in terms of integrals of gradients and demonstrate how to\ncompute them for regression trees using the proposed gradient estimates.\nQuantitative and qualitative numerical experiments reveal the capability of\ngradients estimated by regression trees to improve predictive analysis, solve\ntasks in uncertainty quantification, and provide interpretation of model\nbehavior.\n","authors":["Nathan Wycoff"],"pdf_url":"https://arxiv.org/pdf/2405.13846v2.pdf","comment":"Comments very welcome!"},{"id":"http://arxiv.org/abs/2311.04604v3","updated":"2024-12-02T18:02:53Z","published":"2023-11-08T11:12:27Z","title":"Asynchronous Message-Passing and Zeroth-Order Optimization Based\n Distributed Learning with a Use-Case in Resource Allocation in Communication\n Networks","summary":" Distributed learning and adaptation have received significant interest and\nfound wide-ranging applications in machine learning and signal processing.\nWhile various approaches, such as shared-memory optimization, multi-task\nlearning, and consensus-based learning (e.g., federated learning and learning\nover graphs), focus on optimizing either local costs or a global cost, there\nremains a need for further exploration of their interconnections. This paper\nspecifically focuses on a scenario where agents collaborate towards a common\ntask (i.e., optimizing a global cost equal to aggregated local costs) while\neffectively having distinct individual tasks (i.e., optimizing individual local\nparameters in a local cost). Each agent's actions can potentially impact other\nagents' performance through interactions. Notably, each agent has access to\nonly its local zeroth-order oracle (i.e., cost function value) and shares\nscalar values, rather than gradient vectors, with other agents, leading to\ncommunication bandwidth efficiency and agent privacy. Agents employ\nzeroth-order optimization to update their parameters, and the asynchronous\nmessage-passing between them is subject to bounded but possibly random\ncommunication delays. This paper presents theoretical convergence analyses and\nestablishes a convergence rate for nonconvex problems. Furthermore, it\naddresses the relevant use-case of deep learning-based resource allocation in\ncommunication networks and conducts numerical experiments in which agents,\nacting as transmitters, collaboratively train their individual policies to\nmaximize a global reward, e.g., a sum of data rates.\n","authors":["Pourya Behmandpoor","Marc Moonen","Panagiotis Patrinos"],"pdf_url":"https://arxiv.org/pdf/2311.04604v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.24060v5","updated":"2024-12-02T18:00:18Z","published":"2024-10-31T15:57:04Z","title":"Understanding Generalizability of Diffusion Models Requires Rethinking\n the Hidden Gaussian Structure","summary":" In this work, we study the generalizability of diffusion models by looking\ninto the hidden properties of the learned score functions, which are\nessentially a series of deep denoisers trained on various noise levels. We\nobserve that as diffusion models transition from memorization to\ngeneralization, their corresponding nonlinear diffusion denoisers exhibit\nincreasing linearity. This discovery leads us to investigate the linear\ncounterparts of the nonlinear diffusion models, which are a series of linear\nmodels trained to match the function mappings of the nonlinear diffusion\ndenoisers. Surprisingly, these linear denoisers are approximately the optimal\ndenoisers for a multivariate Gaussian distribution characterized by the\nempirical mean and covariance of the training dataset. This finding implies\nthat diffusion models have the inductive bias towards capturing and utilizing\nthe Gaussian structure (covariance information) of the training dataset for\ndata generation. We empirically demonstrate that this inductive bias is a\nunique property of diffusion models in the generalization regime, which becomes\nincreasingly evident when the model's capacity is relatively small compared to\nthe training dataset size. In the case that the model is highly\noverparameterized, this inductive bias emerges during the initial training\nphases before the model fully memorizes its training data. Our study provides\ncrucial insights into understanding the notable strong generalization\nphenomenon recently observed in real-world diffusion models.\n","authors":["Xiang Li","Yixiang Dai","Qing Qu"],"pdf_url":"https://arxiv.org/pdf/2410.24060v5.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.15098v3","updated":"2024-12-02T17:59:40Z","published":"2024-11-22T17:55:15Z","title":"OminiControl: Minimal and Universal Control for Diffusion Transformer","summary":" In this paper, we introduce OminiControl, a highly versatile and\nparameter-efficient framework that integrates image conditions into pre-trained\nDiffusion Transformer (DiT) models. At its core, OminiControl leverages a\nparameter reuse mechanism, enabling the DiT to encode image conditions using\nitself as a powerful backbone and process them with its flexible multi-modal\nattention processors. Unlike existing methods, which rely heavily on additional\nencoder modules with complex architectures, OminiControl (1) effectively and\nefficiently incorporates injected image conditions with only ~0.1% additional\nparameters, and (2) addresses a wide range of image conditioning tasks in a\nunified manner, including subject-driven generation and spatially-aligned\nconditions such as edges, depth, and more. Remarkably, these capabilities are\nachieved by training on images generated by the DiT itself, which is\nparticularly beneficial for subject-driven generation. Extensive evaluations\ndemonstrate that OminiControl outperforms existing UNet-based and DiT-adapted\nmodels in both subject-driven and spatially-aligned conditional generation.\nAdditionally, we release our training dataset, Subjects200K, a diverse\ncollection of over 200,000 identity-consistent images, along with an efficient\ndata synthesis pipeline to advance research in subject-consistent generation.\n","authors":["Zhenxiong Tan","Songhua Liu","Xingyi Yang","Qiaochu Xue","Xinchao Wang"],"pdf_url":"https://arxiv.org/pdf/2411.15098v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.17593v3","updated":"2024-12-02T17:43:20Z","published":"2024-11-26T17:01:27Z","title":"What Differentiates Educational Literature? A Multimodal Fusion Approach\n of Transformers and Computational Linguistics","summary":" The integration of new literature into the English curriculum remains a\nchallenge since educators often lack scalable tools to rapidly evaluate\nreadability and adapt texts for diverse classroom needs. This study proposes to\naddress this gap through a multimodal approach that combines transformer-based\ntext classification with linguistic feature analysis to align texts with UK Key\nStages. Eight state-of-the-art Transformers were fine-tuned on segmented text\ndata, with BERT achieving the highest unimodal F1 score of 0.75. In parallel,\n500 deep neural network topologies were searched for the classification of\nlinguistic characteristics, achieving an F1 score of 0.392. The fusion of these\nmodalities shows a significant improvement, with every multimodal approach\noutperforming all unimodal models. In particular, the ELECTRA Transformer fused\nwith the neural network achieved an F1 score of 0.996. Unimodal and multimodal\napproaches are shown to have statistically significant differences in all\nvalidation metrics (accuracy, precision, recall, F1 score) except for inference\ntime. The proposed approach is finally encapsulated in a stakeholder-facing web\napplication, providing non-technical stakeholder access to real-time insights\non text complexity, reading difficulty, curriculum alignment, and\nrecommendations for learning age range. The application empowers data-driven\ndecision making and reduces manual workload by integrating AI-based\nrecommendations into lesson planning for English literature.\n","authors":["Jordan J. Bird"],"pdf_url":"https://arxiv.org/pdf/2411.17593v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.14973v2","updated":"2024-12-02T17:35:07Z","published":"2024-01-26T16:06:01Z","title":"Discovering group dynamics in coordinated time series via hierarchical\n recurrent switching-state models","summary":" We seek a computationally efficient model for a collection of time series\narising from multiple interacting entities (a.k.a. \"agents\"). Recent models of\nspatiotemporal patterns across individuals fail to incorporate explicit\nsystem-level collective behavior that can influence the trajectories of\nindividual entities. To address this gap in the literature, we present a new\nhierarchical switching-state model that can be trained in an unsupervised\nfashion to simultaneously learn both system-level and individual-level\ndynamics. We employ a latent system-level discrete state Markov chain that\nprovides top-down influence on latent entity-level chains which in turn govern\nthe emission of each observed time series. Recurrent feedback from the\nobservations to the latent chains at both entity and system levels allows\nrecent situational context to inform how dynamics unfold at all levels in\nbottom-up fashion. We hypothesize that including both top-down and bottom-up\ninfluences on group dynamics will improve interpretability of the learned\ndynamics and reduce error when forecasting. Our hierarchical switching\nrecurrent dynamical model can be learned via closed-form variational coordinate\nascent updates to all latent chains that scale linearly in the number of\nentities. This is asymptotically no more costly than fitting a separate model\nfor each entity. Analysis of both synthetic data and real basketball team\nmovements suggests our lean parametric model can achieve competitive forecasts\ncompared to larger neural network models that require far more computational\nresources. Further experiments on soldier data as well as a synthetic task with\n64 cooperating entities show how our approach can yield interpretable insights\nabout team dynamics over time.\n","authors":["Michael T. Wojnowicz","Kaitlin Gili","Preetish Rath","Eric Miller","Jeffrey Miller","Clifford Hancock","Meghan O'Donovan","Seth Elkin-Frankston","Tad T. Brunyé","Michael C. Hughes"],"pdf_url":"https://arxiv.org/pdf/2401.14973v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.17644v4","updated":"2024-12-02T17:12:54Z","published":"2024-04-26T18:08:15Z","title":"A Conditional Independence Test in the Presence of Discretization","summary":" Testing conditional independence has many applications, such as in Bayesian\nnetwork learning and causal discovery. Different test methods have been\nproposed. However, existing methods generally can not work when only\ndiscretized observations are available. Specifically, consider $X_1$,\n$\\tilde{X}_2$ and $X_3$ are observed variables, where $\\tilde{X}_2$ is a\ndiscretization of latent variables $X_2$. Applying existing test methods to the\nobservations of $X_1$, $\\tilde{X}_2$ and $X_3$ can lead to a false conclusion\nabout the underlying conditional independence of variables $X_1$, $X_2$ and\n$X_3$. Motivated by this, we propose a conditional independence test\nspecifically designed to accommodate the presence of such discretization. To\nachieve this, we design the bridge equations to recover the parameter\nreflecting the statistical information of the underlying latent continuous\nvariables. An appropriate test statistic and its asymptotic distribution under\nthe null hypothesis of conditional independence have also been derived. Both\ntheoretical results and empirical validation have been provided, demonstrating\nthe effectiveness of our test methods.\n","authors":["Boyang Sun","Yu Yao","Huangyuan Hao","Yumou Qiu","Kun Zhang"],"pdf_url":"https://arxiv.org/pdf/2404.17644v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.07118v3","updated":"2024-12-02T17:11:07Z","published":"2024-11-11T16:45:18Z","title":"ConvMixFormer- A Resource-efficient Convolution Mixer for\n Transformer-based Dynamic Hand Gesture Recognition","summary":" Transformer models have demonstrated remarkable success in many domains such\nas natural language processing (NLP) and computer vision. With the growing\ninterest in transformer-based architectures, they are now utilized for gesture\nrecognition. So, we also explore and devise a novel ConvMixFormer architecture\nfor dynamic hand gestures. The transformers use quadratic scaling of the\nattention features with the sequential data, due to which these models are\ncomputationally complex and heavy. We have considered this drawback of the\ntransformer and designed a resource-efficient model that replaces the\nself-attention in the transformer with the simple convolutional layer-based\ntoken mixer. The computational cost and the parameters used for the\nconvolution-based mixer are comparatively less than the quadratic\nself-attention. Convolution-mixer helps the model capture the local spatial\nfeatures that self-attention struggles to capture due to their sequential\nprocessing nature. Further, an efficient gate mechanism is employed instead of\na conventional feed-forward network in the transformer to help the model\ncontrol the flow of features within different stages of the proposed model.\nThis design uses fewer learnable parameters which is nearly half the vanilla\ntransformer that helps in fast and efficient training. The proposed method is\nevaluated on NVidia Dynamic Hand Gesture and Briareo datasets and our model has\nachieved state-of-the-art results on single and multimodal inputs. We have also\nshown the parameter efficiency of the proposed ConvMixFormer model compared to\nother methods. The source code is available at\nhttps://github.com/mallikagarg/ConvMixFormer.\n","authors":["Mallika Garg","Debashis Ghosh","Pyari Mohan Pradhan"],"pdf_url":"https://arxiv.org/pdf/2411.07118v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.17251v2","updated":"2024-12-02T16:37:41Z","published":"2024-11-26T09:29:27Z","title":"DGNN-YOLO: Dynamic Graph Neural Networks with YOLO11 for Small Object\n Detection and Tracking in Traffic Surveillance","summary":" Accurate detection and tracking of small objects such as pedestrians,\ncyclists, and motorbikes are critical for traffic surveillance systems, which\nare crucial in improving road safety and decision-making in intelligent\ntransportation systems. However, traditional methods struggle with challenges\nsuch as occlusion, low resolution, and dynamic traffic conditions,\nnecessitating innovative approaches to address these limitations. This paper\nintroduces DGNN-YOLO, a novel framework integrating dynamic graph neural\nnetworks (DGNN) with YOLO11 to enhance small object detection and tracking in\ntraffic surveillance systems. The framework leverages YOLO11's advanced spatial\nfeature extraction capabilities for precise object detection and incorporates\nDGNN to model spatial-temporal relationships for robust real-time tracking\ndynamically. By constructing and updating graph structures, DGNN-YOLO\neffectively represents objects as nodes and their interactions as edges,\nensuring adaptive and accurate tracking in complex and dynamic environments.\nExtensive experiments demonstrate that DGNN-YOLO consistently outperforms\nstate-of-the-art methods in detecting and tracking small objects under diverse\ntraffic conditions, achieving the highest precision (0.8382), recall (0.6875),\nand mAP@0.5:0.95 (0.6476), showcasing its robustness and scalability,\nparticularly in challenging scenarios involving small and occluded objects.\nThis work provides a scalable, real-time traffic surveillance and analysis\nsolution, significantly contributing to intelligent transportation systems.\n","authors":["Shahriar Soudeep","M. F. Mridha","Md Abrar Jahin","Nilanjan Dey"],"pdf_url":"https://arxiv.org/pdf/2411.17251v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.17311v3","updated":"2024-12-02T16:29:47Z","published":"2024-05-27T16:11:49Z","title":"Probabilistic Graph Rewiring via Virtual Nodes","summary":" Message-passing graph neural networks (MPNNs) have emerged as a powerful\nparadigm for graph-based machine learning. Despite their effectiveness, MPNNs\nface challenges such as under-reaching and over-squashing, where limited\nreceptive fields and structural bottlenecks hinder information flow in the\ngraph. While graph transformers hold promise in addressing these issues, their\nscalability is limited due to quadratic complexity regarding the number of\nnodes, rendering them impractical for larger graphs. Here, we propose\nimplicitly rewired message-passing neural networks (IPR-MPNNs), a novel\napproach that integrates implicit probabilistic graph rewiring into MPNNs. By\nintroducing a small number of virtual nodes, i.e., adding additional nodes to a\ngiven graph and connecting them to existing nodes, in a differentiable,\nend-to-end manner, IPR-MPNNs enable long-distance message propagation,\ncircumventing quadratic complexity. Theoretically, we demonstrate that\nIPR-MPNNs surpass the expressiveness of traditional MPNNs. Empirically, we\nvalidate our approach by showcasing its ability to mitigate under-reaching and\nover-squashing effects, achieving state-of-the-art performance across multiple\ngraph datasets. Notably, IPR-MPNNs outperform graph transformers while\nmaintaining significantly faster computational efficiency.\n","authors":["Chendi Qian","Andrei Manolache","Christopher Morris","Mathias Niepert"],"pdf_url":"https://arxiv.org/pdf/2405.17311v3.pdf","comment":"Accepted at 38th Conference on Neural Information Processing Systems\n (NeurIPS 2024), Vancouver, Canada"},{"id":"http://arxiv.org/abs/2409.19839v3","updated":"2024-12-02T16:27:16Z","published":"2024-09-30T00:41:51Z","title":"ForecastBench: A Dynamic Benchmark of AI Forecasting Capabilities","summary":" Forecasts of future events are essential inputs into informed\ndecision-making. Machine learning (ML) systems have the potential to deliver\nforecasts at scale, but there is no framework for evaluating the accuracy of ML\nsystems on a standardized set of forecasting questions. To address this gap, we\nintroduce ForecastBench: a dynamic benchmark that evaluates the accuracy of ML\nsystems on an automatically generated and regularly updated set of 1,000\nforecasting questions. To avoid any possibility of data leakage, ForecastBench\nis comprised solely of questions about future events that have no known answer\nat the time of submission. We quantify the capabilities of current ML systems\nby collecting forecasts from expert (human) forecasters, the general public,\nand LLMs on a random subset of questions from the benchmark ($N=200$). While\nLLMs have achieved super-human performance on many benchmarks, they perform\nless well here: expert forecasters outperform the top-performing LLM (p-value\n$<0.01$). We display system and human scores in a public leaderboard at\nwww.forecastbench.org.\n","authors":["Ezra Karger","Houtan Bastani","Chen Yueh-Han","Zachary Jacobs","Danny Halawi","Fred Zhang","Philip E. Tetlock"],"pdf_url":"https://arxiv.org/pdf/2409.19839v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.17339v2","updated":"2024-12-02T16:08:41Z","published":"2024-05-27T16:42:51Z","title":"Physics-Informed Real NVP for Satellite Power System Fault Detection","summary":" The unique challenges posed by the space environment, characterized by\nextreme conditions and limited accessibility, raise the need for robust and\nreliable techniques to identify and prevent satellite faults. Fault detection\nmethods in the space sector are required to ensure mission success and to\nprotect valuable assets. In this context, this paper proposes an Artificial\nIntelligence (AI) based fault detection methodology and evaluates its\nperformance on ADAPT (Advanced Diagnostics and Prognostics Testbed), an\nElectrical Power System (EPS) dataset, crafted in laboratory by NASA. Our study\nfocuses on the application of a physics-informed (PI) real-valued non-volume\npreserving (Real NVP) model for fault detection in space systems. The efficacy\nof this method is systematically compared against other AI approaches such as\nGated Recurrent Unit (GRU) and Autoencoder-based techniques. Results show that\nour physics-informed approach outperforms existing methods of fault detection,\ndemonstrating its suitability for addressing the unique challenges of satellite\nEPS sub-system faults. Furthermore, we unveil the competitive advantage of\nphysics-informed loss in AI models to address specific space needs, namely\nrobustness, reliability, and power constraints, crucial for space exploration\nand satellite missions.\n","authors":["Carlo Cena","Umberto Albertin","Mauro Martini","Silvia Bucci","Marcello Chiaberge"],"pdf_url":"https://arxiv.org/pdf/2405.17339v2.pdf","comment":"C. Cena, U. Albertin, M. Martini, S. Bucci and M. Chiaberge,\n \"Physics-Informed Real NVP for Satellite Power System Fault Detection,\" 2024\n IEEE International Conference on Advanced Intelligent Mechatronics (AIM),\n Boston, MA, USA, 2024, pp. 679-684, doi: 10.1109/AIM55361.2024.10636990"}],"Multimedia":[{"id":"http://arxiv.org/abs/2303.17550v6","updated":"2024-12-02T10:06:28Z","published":"2023-03-30T17:18:31Z","title":"DAE-Talker: High Fidelity Speech-Driven Talking Face Generation with\n Diffusion Autoencoder","summary":" While recent research has made significant progress in speech-driven talking\nface generation, the quality of the generated video still lags behind that of\nreal recordings. One reason for this is the use of handcrafted intermediate\nrepresentations like facial landmarks and 3DMM coefficients, which are designed\nbased on human knowledge and are insufficient to precisely describe facial\nmovements. Additionally, these methods require an external pretrained model for\nextracting these representations, whose performance sets an upper bound on\ntalking face generation. To address these limitations, we propose a novel\nmethod called DAE-Talker that leverages data-driven latent representations\nobtained from a diffusion autoencoder (DAE). DAE contains an image encoder that\nencodes an image into a latent vector and a DDIM image decoder that\nreconstructs the image from it. We train our DAE on talking face video frames\nand then extract their latent representations as the training target for a\nConformer-based speech2latent model. This allows DAE-Talker to synthesize full\nvideo frames and produce natural head movements that align with the content of\nspeech, rather than relying on a predetermined head pose from a template video.\nWe also introduce pose modelling in speech2latent for pose controllability.\nAdditionally, we propose a novel method for generating continuous video frames\nwith the DDIM image decoder trained on individual frames, eliminating the need\nfor modelling the joint distribution of consecutive frames directly. Our\nexperiments show that DAE-Talker outperforms existing popular methods in\nlip-sync, video fidelity, and pose naturalness. We also conduct ablation\nstudies to analyze the effectiveness of the proposed techniques and demonstrate\nthe pose controllability of DAE-Talker.\n","authors":["Chenpeng Du","Qi Chen","Tianyu He","Xu Tan","Xie Chen","Kai Yu","Sheng Zhao","Jiang Bian"],"pdf_url":"https://arxiv.org/pdf/2303.17550v6.pdf","comment":"Accepted to ACM Multimedia 2023"},{"id":"http://arxiv.org/abs/2412.01986v1","updated":"2024-12-02T21:35:33Z","published":"2024-12-02T21:35:33Z","title":"HybridMQA: Exploring Geometry-Texture Interactions for Colored Mesh\n Quality Assessment","summary":" Mesh quality assessment (MQA) models play a critical role in the design,\noptimization, and evaluation of mesh operation systems in a wide variety of\napplications. Current MQA models, whether model-based methods using\ntopology-aware features or projection-based approaches working on rendered 2D\nprojections, often fail to capture the intricate interactions between texture\nand 3D geometry. We introduce HybridMQA, a first-of-its-kind hybrid\nfull-reference colored MQA framework that integrates model-based and\nprojection-based approaches, capturing complex interactions between textural\ninformation and 3D structures for enriched quality representations. Our method\nemploys graph learning to extract detailed 3D representations, which are then\nprojected to 2D using a novel feature rendering process that precisely aligns\nthem with colored projections. This enables the exploration of geometry-texture\ninteractions via cross-attention, producing comprehensive mesh quality\nrepresentations. Extensive experiments demonstrate HybridMQA's superior\nperformance across diverse datasets, highlighting its ability to effectively\nleverage geometry-texture interactions for a thorough understanding of mesh\nquality. Our implementation will be made publicly available.\n","authors":["Armin Shafiee Sarvestani","Sheyang Tang","Zhou Wang"],"pdf_url":"https://arxiv.org/pdf/2412.01986v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.01824v1","updated":"2024-12-02T18:59:26Z","published":"2024-12-02T18:59:26Z","title":"X-Prompt: Towards Universal In-Context Image Generation in\n Auto-Regressive Vision Language Foundation Models","summary":" In-context generation is a key component of large language models' (LLMs)\nopen-task generalization capability. By leveraging a few examples as context,\nLLMs can perform both in-domain and out-of-domain tasks. Recent advancements in\nauto-regressive vision-language models (VLMs) built upon LLMs have showcased\nimpressive performance in text-to-image generation. However, the potential of\nin-context learning for general image generation tasks remains largely\nunexplored. To address this, we introduce X-Prompt, a purely auto-regressive\nlarge-vision language model designed to deliver competitive performance across\na wide range of both seen and unseen image generation tasks, all within a\nunified in-context learning framework. X-Prompt incorporates a specialized\ndesign that efficiently compresses valuable features from in-context examples,\nsupporting longer in-context token sequences and improving its ability to\ngeneralize to unseen tasks. A unified training task for both text and image\nprediction enables X-Prompt to handle general image generation with enhanced\ntask awareness from in-context examples. Extensive experiments validate the\nmodel's performance across diverse seen image generation tasks and its capacity\nto generalize to previously unseen tasks.\n","authors":["Zeyi Sun","Ziyang Chu","Pan Zhang","Tong Wu","Xiaoyi Dong","Yuhang Zang","Yuanjun Xiong","Dahua Lin","Jiaqi Wang"],"pdf_url":"https://arxiv.org/pdf/2412.01824v1.pdf","comment":"code: https://github.com/SunzeY/X-Prompt"},{"id":"http://arxiv.org/abs/2412.01556v1","updated":"2024-12-02T14:44:39Z","published":"2024-12-02T14:44:39Z","title":"Divide-and-Conquer: Confluent Triple-Flow Network for RGB-T Salient\n Object Detection","summary":" RGB-Thermal Salient Object Detection aims to pinpoint prominent objects\nwithin aligned pairs of visible and thermal infrared images. Traditional\nencoder-decoder architectures, while designed for cross-modality feature\ninteractions, may not have adequately considered the robustness against noise\noriginating from defective modalities. Inspired by hierarchical human visual\nsystems, we propose the ConTriNet, a robust Confluent Triple-Flow Network\nemploying a Divide-and-Conquer strategy. Specifically, ConTriNet comprises\nthree flows: two modality-specific flows explore cues from RGB and Thermal\nmodalities, and a third modality-complementary flow integrates cues from both\nmodalities. ConTriNet presents several notable advantages. It incorporates a\nModality-induced Feature Modulator in the modality-shared union encoder to\nminimize inter-modality discrepancies and mitigate the impact of defective\nsamples. Additionally, a foundational Residual Atrous Spatial Pyramid Module in\nthe separated flows enlarges the receptive field, allowing for the capture of\nmulti-scale contextual information. Furthermore, a Modality-aware Dynamic\nAggregation Module in the modality-complementary flow dynamically aggregates\nsaliency-related cues from both modality-specific flows. Leveraging the\nproposed parallel triple-flow framework, we further refine saliency maps\nderived from different flows through a flow-cooperative fusion strategy,\nyielding a high-quality, full-resolution saliency map for the final prediction.\nTo evaluate the robustness and stability of our approach, we collect a\ncomprehensive RGB-T SOD benchmark, VT-IMAG, covering various real-world\nchallenging scenarios. Extensive experiments on public benchmarks and our\nVT-IMAG dataset demonstrate that ConTriNet consistently outperforms\nstate-of-the-art competitors in both common and challenging scenarios.\n","authors":["Hao Tang","Zechao Li","Dong Zhang","Shengfeng He","Jinhui Tang"],"pdf_url":"https://arxiv.org/pdf/2412.01556v1.pdf","comment":"Accepted by IEEE TPAMI. Project page:\n https://cser-tang-hao.github.io/contrinet.html"},{"id":"http://arxiv.org/abs/2412.01316v1","updated":"2024-12-02T09:32:36Z","published":"2024-12-02T09:32:36Z","title":"Long Video Diffusion Generation with Segmented Cross-Attention and\n Content-Rich Video Data Curation","summary":" We introduce Presto, a novel video diffusion model designed to generate\n15-second videos with long-range coherence and rich content. Extending video\ngeneration methods to maintain scenario diversity over long durations presents\nsignificant challenges. To address this, we propose a Segmented Cross-Attention\n(SCA) strategy, which splits hidden states into segments along the temporal\ndimension, allowing each segment to cross-attend to a corresponding\nsub-caption. SCA requires no additional parameters, enabling seamless\nincorporation into current DiT-based architectures. To facilitate high-quality\nlong video generation, we build the LongTake-HD dataset, consisting of 261k\ncontent-rich videos with scenario coherence, annotated with an overall video\ncaption and five progressive sub-captions. Experiments show that our Presto\nachieves 78.5% on the VBench Semantic Score and 100% on the Dynamic Degree,\noutperforming existing state-of-the-art video generation methods. This\ndemonstrates that our proposed Presto significantly enhances content richness,\nmaintains long-range coherence, and captures intricate textual details. More\ndetails are displayed on our project page: https://presto-video.github.io/.\n","authors":["Xin Yan","Yuxuan Cai","Qiuyue Wang","Yuan Zhou","Wenhao Huang","Huan Yang"],"pdf_url":"https://arxiv.org/pdf/2412.01316v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.01202v1","updated":"2024-12-02T07:14:15Z","published":"2024-12-02T07:14:15Z","title":"Neuron Abandoning Attention Flow: Visual Explanation of Dynamics inside\n CNN Models","summary":" In this paper, we present a Neuron Abandoning Attention Flow (NAFlow) method\nto address the open problem of visually explaining the attention evolution\ndynamics inside CNNs when making their classification decisions. A novel\ncascading neuron abandoning back-propagation algorithm is designed to trace\nneurons in all layers of a CNN that involve in making its prediction to address\nthe problem of significant interference from abandoned neurons. Firstly, a\nNeuron Abandoning Back-Propagation (NA-BP) module is proposed to generate\nBack-Propagated Feature Maps (BPFM) by using the inverse function of the\nintermediate layers of CNN models, on which the neurons not used for\ndecision-making are abandoned. Meanwhile, the cascading NA-BP modules calculate\nthe tensors of importance coefficients which are linearly combined with the\ntensors of BPFMs to form the NAFlow. Secondly, to be able to visualize\nattention flow for similarity metric-based CNN models, a new channel\ncontribution weights module is proposed to calculate the importance\ncoefficients via Jacobian Matrix. The effectiveness of the proposed NAFlow is\nvalidated on nine widely-used CNN models for various tasks of general image\nclassification, contrastive learning classification, few-shot image\nclassification, and image retrieval.\n","authors":["Yi Liao","Yongsheng Gao","Weichuan Zhang"],"pdf_url":"https://arxiv.org/pdf/2412.01202v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.01169v1","updated":"2024-12-02T06:13:01Z","published":"2024-12-02T06:13:01Z","title":"OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows","summary":" We introduce OmniFlow, a novel generative model designed for any-to-any\ngeneration tasks such as text-to-image, text-to-audio, and audio-to-image\nsynthesis. OmniFlow advances the rectified flow (RF) framework used in\ntext-to-image models to handle the joint distribution of multiple modalities.\nIt outperforms previous any-to-any models on a wide range of tasks, such as\ntext-to-image and text-to-audio synthesis. Our work offers three key\ncontributions: First, we extend RF to a multi-modal setting and introduce a\nnovel guidance mechanism, enabling users to flexibly control the alignment\nbetween different modalities in the generated outputs. Second, we propose a\nnovel architecture that extends the text-to-image MMDiT architecture of Stable\nDiffusion 3 and enables audio and text generation. The extended modules can be\nefficiently pretrained individually and merged with the vanilla text-to-image\nMMDiT for fine-tuning. Lastly, we conduct a comprehensive study on the design\nchoices of rectified flow transformers for large-scale audio and text\ngeneration, providing valuable insights into optimizing performance across\ndiverse modalities. The Code will be available at\nhttps://github.com/jacklishufan/OmniFlows.\n","authors":["Shufan Li","Konstantinos Kallidromitis","Akash Gokul","Zichun Liao","Yusuke Kato","Kazuki Kozuka","Aditya Grover"],"pdf_url":"https://arxiv.org/pdf/2412.01169v1.pdf","comment":"12 pages, 14 figures"},{"id":"http://arxiv.org/abs/2412.01064v1","updated":"2024-12-02T02:50:07Z","published":"2024-12-02T02:50:07Z","title":"FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking\n Portrait","summary":" With the rapid advancement of diffusion-based generative models, portrait\nimage animation has achieved remarkable results. However, it still faces\nchallenges in temporally consistent video generation and fast sampling due to\nits iterative sampling nature. This paper presents FLOAT, an audio-driven\ntalking portrait video generation method based on flow matching generative\nmodel. We shift the generative modeling from the pixel-based latent space to a\nlearned motion latent space, enabling efficient design of temporally consistent\nmotion. To achieve this, we introduce a transformer-based vector field\npredictor with a simple yet effective frame-wise conditioning mechanism.\nAdditionally, our method supports speech-driven emotion enhancement, enabling a\nnatural incorporation of expressive motions. Extensive experiments demonstrate\nthat our method outperforms state-of-the-art audio-driven talking portrait\nmethods in terms of visual quality, motion fidelity, and efficiency.\n","authors":["Taekyung Ki","Dongchan Min","Gyoungsu Chae"],"pdf_url":"https://arxiv.org/pdf/2412.01064v1.pdf","comment":"Project page: https://deepbrainai-research.github.io/float/"}]},"2024-12-01T00:00:00Z":{"Computation and Language":[{"id":"http://arxiv.org/abs/2308.10792v8","updated":"2024-12-01T22:01:51Z","published":"2023-08-21T15:35:16Z","title":"Instruction Tuning for Large Language Models: A Survey","summary":" This paper surveys research works in the quickly advancing field of\ninstruction tuning (IT), which can also be referred to as supervised\nfine-tuning (SFT)\\footnote{In this paper, unless specified otherwise,\nsupervised fine-tuning (SFT) and instruction tuning (IT) are used\ninterchangeably.}, a crucial technique to enhance the capabilities and\ncontrollability of large language models (LLMs). Instruction tuning refers to\nthe process of further training LLMs on a dataset consisting of\n\\textsc{(instruction, output)} pairs in a supervised fashion, which bridges the\ngap between the next-word prediction objective of LLMs and the users' objective\nof having LLMs adhere to human instructions. In this work, we make a systematic\nreview of the literature, including the general methodology of SFT, the\nconstruction of SFT datasets, the training of SFT models, and applications to\ndifferent modalities, domains and application, along with analysis on aspects\nthat influence the outcome of SFT (e.g., generation of instruction outputs,\nsize of the instruction dataset, etc). We also review the potential pitfalls of\nSFT along with criticism against it, along with efforts pointing out current\ndeficiencies of existing strategies and suggest some avenues for fruitful\nresearch. Project Page: github.com/xiaoya-li/Instruction-Tuning-Survey\n","authors":["Shengyu Zhang","Linfeng Dong","Xiaoya Li","Sen Zhang","Xiaofei Sun","Shuhe Wang","Jiwei Li","Runyi Hu","Tianwei Zhang","Fei Wu","Guoyin Wang"],"pdf_url":"https://arxiv.org/pdf/2308.10792v8.pdf","comment":"V5; Last update: Dec. 1, 2024"},{"id":"http://arxiv.org/abs/2411.06548v2","updated":"2024-12-01T21:35:43Z","published":"2024-11-10T18:04:41Z","title":"CineXDrama: Relevance Detection and Sentiment Analysis of Bangla YouTube\n Comments on Movie-Drama using Transformers: Insights from Interpretability\n Tool","summary":" In recent years, YouTube has become the leading platform for Bangla movies\nand dramas, where viewers express their opinions in comments that convey their\nsentiments about the content. However, not all comments are relevant for\nsentiment analysis, necessitating a filtering mechanism. We propose a system\nthat first assesses the relevance of comments and then analyzes the sentiment\nof those deemed relevant. We introduce a dataset of 14,000 manually collected\nand preprocessed comments, annotated for relevance (relevant or irrelevant) and\nsentiment (positive or negative). Eight transformer models, including\nBanglaBERT, were used for classification tasks, with BanglaBERT achieving the\nhighest accuracy (83.99% for relevance detection and 93.3% for sentiment\nanalysis). The study also integrates LIME to interpret model decisions,\nenhancing transparency.\n","authors":["Usafa Akther Rifa","Pronay Debnath","Busra Kamal Rafa","Shamaun Safa Hridi","Md. Aminur Rahman"],"pdf_url":"https://arxiv.org/pdf/2411.06548v2.pdf","comment":"Accepted for publication in Fifth International Conference on\n Advances in Electrical, Computing, Communications and Sustainable\n Technologies (ICAECT 2025)"},{"id":"http://arxiv.org/abs/2404.04393v2","updated":"2024-12-01T20:48:11Z","published":"2024-04-05T20:36:30Z","title":"Counting Like Transformers: Compiling Temporal Counting Logic Into\n Softmax Transformers","summary":" Deriving formal bounds on the expressivity of transformers, as well as\nstudying transformers that are constructed to implement known algorithms, are\nboth effective methods for better understanding the computational power of\ntransformers. Towards both ends, we introduce the temporal counting logic\n$\\textsf{K}_\\text{t}$[#] alongside the RASP variant $\\textsf{C-RASP}$. We show\nthey are equivalent to each other, and that together they are the best-known\nlower bound on the formal expressivity of future-masked soft attention\ntransformers with unbounded input size. We prove this by showing all\n$\\textsf{K}_\\text{t}$[#] formulas can be compiled into these transformers.\n","authors":["Andy Yang","David Chiang"],"pdf_url":"https://arxiv.org/pdf/2404.04393v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2211.03730v3","updated":"2024-12-01T18:45:39Z","published":"2022-11-07T17:59:05Z","title":"DPCSpell: A Transformer-based Detector-Purificator-Corrector Framework\n for Spelling Error Correction of Bangla and Resource Scarce Indic Languages","summary":" Spelling error correction is the task of identifying and rectifying\nmisspelled words in texts. It is a potential and active research topic in\nNatural Language Processing because of numerous applications in human language\nunderstanding. The phonetically or visually similar yet semantically distinct\ncharacters make it an arduous task in any language. Earlier efforts on spelling\nerror correction in Bangla and resource-scarce Indic languages focused on\nrule-based, statistical, and machine learning-based methods which we found\nrather inefficient. In particular, machine learning-based approaches, which\nexhibit superior performance to rule-based and statistical methods, are\nineffective as they correct each character regardless of its appropriateness.\nIn this paper, we propose a novel detector-purificator-corrector framework,\nDPCSpell based on denoising transformers by addressing previous issues. In\naddition to that, we present a method for large-scale corpus creation from\nscratch which in turn resolves the resource limitation problem of any\nleft-to-right scripted language. The empirical outcomes demonstrate the\neffectiveness of our approach, which outperforms previous state-of-the-art\nmethods by attaining an exact match (EM) score of 94.78%, a precision score of\n0.9487, a recall score of 0.9478, an f1 score of 0.948, an f0.5 score of\n0.9483, and a modified accuracy (MA) score of 95.16% for Bangla spelling error\ncorrection. The models and corpus are publicly available at\nhttps://tinyurl.com/DPCSpell.\n","authors":["Mehedi Hasan Bijoy","Nahid Hossain","Salekul Islam","Swakkhar Shatabda"],"pdf_url":"https://arxiv.org/pdf/2211.03730v3.pdf","comment":"29 pages, 4 figures, and 9 tables"},{"id":"http://arxiv.org/abs/2406.12644v3","updated":"2024-12-01T17:45:28Z","published":"2024-06-18T14:12:27Z","title":"Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for\n Large Language Models","summary":" Assessing the effectiveness of large language models (LLMs) in performing\ndifferent tasks is crucial for understanding their strengths and weaknesses.\nThis paper presents the Hierarchical Prompting Taxonomy (HPT), grounded on\nhuman cognitive principles and designed to assess LLMs by examining the\ncognitive demands of various tasks. The HPT uses the Hierarchical Prompting\nFramework (HPF), a prompt selection framework that organizes five distinct\nprompting strategies by their cognitive load on LLMs. This study introduces the\nHierarchical Prompting Index (HPI) to measure task complexity, which\ndemonstrates LLMs' abilities across different datasets and serves as a\nuniversal metric for task complexity. The HPT offers a reliable method for\nevaluating LLMs' problem-solving skills in diverse scenarios, leading to\nclearer conclusions. Extensive experiments with multiple datasets and LLMs show\nthat the HPF enhances LLM performance by 2\\% to 63\\% compared to standard\nbenchmark datasets, confirming the effectiveness of the HPT. To support future\nresearch in this domain, the implementations of HPT and HPF are publicly\navailable\n","authors":["Devichand Budagam","Ashutosh Kumar","Mahsa Khoshnoodi","Sankalp KJ","Vinija Jain","Aman Chadha"],"pdf_url":"https://arxiv.org/pdf/2406.12644v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.17661v2","updated":"2024-12-01T17:10:16Z","published":"2024-11-26T18:25:57Z","title":"BERT or FastText? A Comparative Analysis of Contextual as well as\n Non-Contextual Embeddings","summary":" Natural Language Processing (NLP) for low-resource languages presents\nsignificant challenges, particularly due to the scarcity of high-quality\nannotated data and linguistic resources. The choice of embeddings plays a\ncritical role in enhancing the performance of NLP tasks, such as news\nclassification, sentiment analysis, and hate speech detection, especially for\nlow-resource languages like Marathi. In this study, we investigate the impact\nof various embedding techniques- Contextual BERT-based, Non-Contextual\nBERT-based, and FastText-based on NLP classification tasks specific to the\nMarathi language. Our research includes a thorough evaluation of both\ncompressed and uncompressed embeddings, providing a comprehensive overview of\nhow these embeddings perform across different scenarios. Specifically, we\ncompare two BERT model embeddings, Muril and MahaBERT, as well as two FastText\nmodel embeddings, IndicFT and MahaFT. Our evaluation includes applying\nembeddings to a Multiple Logistic Regression (MLR) classifier for task\nperformance assessment, as well as TSNE visualizations to observe the spatial\ndistribution of these embeddings. The results demonstrate that contextual\nembeddings outperform non-contextual embeddings. Furthermore, BERT-based\nnon-contextual embeddings extracted from the first BERT embedding layer yield\nbetter results than FastText-based embeddings, suggesting a potential\nalternative to FastText embeddings.\n","authors":["Abhay Shanbhag","Suramya Jadhav","Amogh Thakurdesai","Ridhima Sinare","Raviraj Joshi"],"pdf_url":"https://arxiv.org/pdf/2411.17661v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.19128v3","updated":"2024-12-01T15:07:29Z","published":"2024-10-24T19:56:28Z","title":"Retrieving Implicit and Explicit Emotional Events Using Large Language\n Models","summary":" Large language models (LLMs) have garnered significant attention in recent\nyears due to their impressive performance. While considerable research has\nevaluated these models from various perspectives, the extent to which LLMs can\nperform implicit and explicit emotion retrieval remains largely unexplored. To\naddress this gap, this study investigates LLMs' emotion retrieval capabilities\nin commonsense. Through extensive experiments involving multiple models, we\nsystematically evaluate the ability of LLMs on emotion retrieval. Specifically,\nwe propose a supervised contrastive probing method to verify LLMs' performance\nfor implicit and explicit emotion retrieval, as well as the diversity of the\nemotional events they retrieve. The results offer valuable insights into the\nstrengths and limitations of LLMs in handling emotion retrieval.\n","authors":["Guimin Hu","Hasti Seifi"],"pdf_url":"https://arxiv.org/pdf/2410.19128v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.01100v2","updated":"2024-12-01T14:32:47Z","published":"2024-10-01T22:03:34Z","title":"Unlocking Korean Verbs: A User-Friendly Exploration into the Verb\n Lexicon","summary":" The Sejong dictionary dataset offers a valuable resource, providing extensive\ncoverage of morphology, syntax, and semantic representation. This dataset can\nbe utilized to explore linguistic information in greater depth. The labeled\nlinguistic structures within this dataset form the basis for uncovering\nrelationships between words and phrases and their associations with target\nverbs. This paper introduces a user-friendly web interface designed for the\ncollection and consolidation of verb-related information, with a particular\nfocus on subcategorization frames. Additionally, it outlines our efforts in\nmapping this information by aligning subcategorization frames with\ncorresponding illustrative sentence examples. Furthermore, we provide a Python\nlibrary that would simplify syntactic parsing and semantic role labeling. These\ntools are intended to assist individuals interested in harnessing the Sejong\ndictionary dataset to develop applications for Korean language processing.\n","authors":["Seohyun Song","Eunkyul Leah Jo","Yige Chen","Jeen-Pyo Hong","Kyuwon Kim","Jin Wee","Miyoung Kang","KyungTae Lim","Jungyeul Park","Chulwoo Park"],"pdf_url":"https://arxiv.org/pdf/2410.01100v2.pdf","comment":"NAACL 2025 System Demonstrations (Submitted)"},{"id":"http://arxiv.org/abs/2305.14225v3","updated":"2024-12-01T13:55:56Z","published":"2023-05-23T16:40:07Z","title":"ManiTweet: A New Benchmark for Identifying Manipulation of News on\n Social Media","summary":" Considerable advancements have been made to tackle the misrepresentation of\ninformation derived from reference articles in the domains of fact-checking and\nfaithful summarization. However, an unaddressed aspect remains - the\nidentification of social media posts that manipulate information within\nassociated news articles. This task presents a significant challenge, primarily\ndue to the prevalence of personal opinions in such posts. We present a novel\ntask, identifying manipulation of news on social media, which aims to detect\nmanipulation in social media posts and identify manipulated or inserted\ninformation. To study this task, we have proposed a data collection schema and\ncurated a dataset called ManiTweet, consisting of 3.6K pairs of tweets and\ncorresponding articles. Our analysis demonstrates that this task is highly\nchallenging, with large language models (LLMs) yielding unsatisfactory\nperformance. Additionally, we have developed a simple yet effective basic model\nthat outperforms LLMs significantly on the ManiTweet dataset. Finally, we have\nconducted an exploratory analysis of human-written tweets, unveiling intriguing\nconnections between manipulation and the domain and factuality of news\narticles, as well as revealing that manipulated sentences are more likely to\nencapsulate the main story or consequences of a news outlet.\n","authors":["Kung-Hsiang Huang","Hou Pong Chan","Kathleen McKeown","Heng Ji"],"pdf_url":"https://arxiv.org/pdf/2305.14225v3.pdf","comment":"COLING 2025"},{"id":"http://arxiv.org/abs/2406.17305v2","updated":"2024-12-01T09:02:35Z","published":"2024-06-25T06:24:50Z","title":"Retrieval Augmented Instruction Tuning for Open NER with Large Language\n Models","summary":" The strong capability of large language models (LLMs) has been applied to\ninformation extraction (IE) through either retrieval augmented prompting or\ninstruction tuning (IT). However, the best way to incorporate information with\nLLMs for IE remains an open question. In this paper, we explore Retrieval\nAugmented Instruction Tuning (RA-IT) for IE, focusing on the task of open named\nentity recognition (NER). Specifically, for each training sample, we retrieve\nsemantically similar examples from the training dataset as the context and\nprepend them to the input of the original instruction. To evaluate our RA-IT\napproach more thoroughly, we construct a Chinese IT dataset for open NER and\nevaluate RA-IT in both English and Chinese scenarios. Experimental results\nverify the effectiveness of RA-IT across various data sizes and in both English\nand Chinese scenarios. We also conduct thorough studies to explore the impacts\nof various retrieval strategies in the proposed RA-IT framework. Code and data\nare available at: https://github.com/Emma1066/Retrieval-Augmented-IT-OpenNER\n","authors":["Tingyu Xie","Jian Zhang","Yan Zhang","Yuanyuan Liang","Qi Li","Hongwei Wang"],"pdf_url":"https://arxiv.org/pdf/2406.17305v2.pdf","comment":"To be appeared at COLING 2025"},{"id":"http://arxiv.org/abs/2411.14491v3","updated":"2024-12-01T08:37:51Z","published":"2024-11-20T12:34:44Z","title":"A Survey on Human-Centric LLMs","summary":" The rapid evolution of large language models (LLMs) and their capacity to\nsimulate human cognition and behavior has given rise to LLM-based frameworks\nand tools that are evaluated and applied based on their ability to perform\ntasks traditionally performed by humans, namely those involving cognition,\ndecision-making, and social interaction. This survey provides a comprehensive\nexamination of such human-centric LLM capabilities, focusing on their\nperformance in both individual tasks (where an LLM acts as a stand-in for a\nsingle human) and collective tasks (where multiple LLMs coordinate to mimic\ngroup dynamics). We first evaluate LLM competencies across key areas including\nreasoning, perception, and social cognition, comparing their abilities to\nhuman-like skills. Then, we explore real-world applications of LLMs in\nhuman-centric domains such as behavioral science, political science, and\nsociology, assessing their effectiveness in replicating human behaviors and\ninteractions. Finally, we identify challenges and future research directions,\nsuch as improving LLM adaptability, emotional intelligence, and cultural\nsensitivity, while addressing inherent biases and enhancing frameworks for\nhuman-AI collaboration. This survey aims to provide a foundational\nunderstanding of LLMs from a human-centric perspective, offering insights into\ntheir current capabilities and potential for future development.\n","authors":["Jing Yi Wang","Nicholas Sukiennik","Tong Li","Weikang Su","Qianyue Hao","Jingbo Xu","Zihan Huang","Fengli Xu","Yong Li"],"pdf_url":"https://arxiv.org/pdf/2411.14491v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.07602v2","updated":"2024-12-01T06:39:41Z","published":"2024-11-12T07:24:41Z","title":"Circuit Complexity Bounds for RoPE-based Transformer Architecture","summary":" Characterizing the express power of the Transformer architecture is critical\nto understanding its capacity limits and scaling law. Recent works provide the\ncircuit complexity bounds to Transformer-like architecture. On the other hand,\nRotary Position Embedding ($\\mathsf{RoPE}$) has emerged as a crucial technique\nin modern large language models, offering superior performance in capturing\npositional information compared to traditional position embeddings, which shows\ngreat potential in application prospects, particularly for the long context\nscenario. Empirical evidence also suggests that $\\mathsf{RoPE}$-based\nTransformer architectures demonstrate greater generalization capabilities\ncompared to conventional Transformer models. In this work, we establish a\ncircuit complexity bound for Transformers with $\\mathsf{RoPE}$ attention. Our\nkey contribution is that we show that unless $\\mathsf{TC}^0 = \\mathsf{NC}^1$, a\n$\\mathsf{RoPE}$-based Transformer with $\\mathrm{poly}(n)$-precision, $O(1)$\nlayers, hidden dimension $d \\leq O(n)$ cannot solve the Arithmetic formula\nevaluation problem or the Boolean formula value problem. This result\nsignificantly demonstrates the fundamental limitation of the expressivity of\nthe $\\mathsf{RoPE}$-based Transformer architecture, although it achieves giant\nempirical success. Our theoretical result not only establishes the complexity\nbound but also may instruct further work on the $\\mathsf{RoPE}$-based\nTransformer.\n","authors":["Bo Chen","Xiaoyu Li","Yingyu Liang","Jiangxuan Long","Zhenmei Shi","Zhao Song"],"pdf_url":"https://arxiv.org/pdf/2411.07602v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.00765v2","updated":"2024-12-01T06:08:00Z","published":"2024-08-01T17:59:54Z","title":"MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models\n for Integrated Capabilities","summary":" MM-Vet, with open-ended vision-language questions targeting at evaluating\nintegrated capabilities, has become one of the most popular benchmarks for\nlarge multimodal model evaluation. MM-Vet assesses six core vision-language\n(VL) capabilities: recognition, knowledge, spatial awareness, language\ngeneration, OCR, and math. However, its question format is restricted to single\nimage-text pairs, lacking the interleaved image and text sequences prevalent in\nreal-world scenarios. To address this limitation, we introduce MM-Vet v2, which\nincludes a new VL capability called \"image-text sequence understanding\",\nevaluating models' ability to process VL sequences. Furthermore, we maintain\nthe high quality of evaluation samples while further expanding the evaluation\nset size. Using MM-Vet v2 to benchmark large multimodal models, we found that\nClaude 3.5 Sonnet is the best model with a score of 71.8, slightly\noutperforming GPT-4o which scored 71.0. Among open-weight models,\nInternVL2-Llama3-76B leads with a score of 68.4. The code, data, and\nleaderboard are accessible at https://github.com/yuweihao/MM-Vet.\n","authors":["Weihao Yu","Zhengyuan Yang","Lingfeng Ren","Linjie Li","Jianfeng Wang","Kevin Lin","Chung-Ching Lin","Zicheng Liu","Lijuan Wang","Xinchao Wang"],"pdf_url":"https://arxiv.org/pdf/2408.00765v2.pdf","comment":"Code, data and leaderboard: https://github.com/yuweihao/MM-Vet"},{"id":"http://arxiv.org/abs/2308.02490v4","updated":"2024-12-01T05:46:03Z","published":"2023-08-04T17:59:47Z","title":"MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities","summary":" We propose MM-Vet, an evaluation benchmark that examines large multimodal\nmodels (LMMs) on complicated multimodal tasks. Recent LMMs have shown various\nintriguing abilities, such as solving math problems written on the blackboard,\nreasoning about events and celebrities in news images, and explaining visual\njokes. Rapid model advancements pose challenges to evaluation benchmark\ndevelopment. Problems include: (1) How to systematically structure and evaluate\nthe complicated multimodal tasks; (2) How to design evaluation metrics that\nwork well across question and answer types; and (3) How to give model insights\nbeyond a simple performance ranking. To this end, we present MM-Vet, designed\nbased on the insight that the intriguing ability to solve complicated tasks is\noften achieved by a generalist model being able to integrate different core\nvision-language (VL) capabilities. MM-Vet defines 6 core VL capabilities and\nexamines the 16 integrations of interest derived from the capability\ncombination. For evaluation metrics, we propose an LLM-based evaluator for\nopen-ended outputs. The evaluator enables the evaluation across different\nquestion types and answer styles, resulting in a unified scoring metric. We\nevaluate representative LMMs on MM-Vet, providing insights into the\ncapabilities of different LMM system paradigms and models.\n","authors":["Weihao Yu","Zhengyuan Yang","Linjie Li","Jianfeng Wang","Kevin Lin","Zicheng Liu","Xinchao Wang","Lijuan Wang"],"pdf_url":"https://arxiv.org/pdf/2308.02490v4.pdf","comment":"ICML 2024. Code, data and leaderboard:\n https://github.com/yuweihao/MM-Vet"},{"id":"http://arxiv.org/abs/2309.17249v3","updated":"2024-12-01T01:36:50Z","published":"2023-09-29T13:55:45Z","title":"Batch Calibration: Rethinking Calibration for In-Context Learning and\n Prompt Engineering","summary":" Prompting and in-context learning (ICL) have become efficient learning\nparadigms for large language models (LLMs). However, LLMs suffer from prompt\nbrittleness and various bias factors in the prompt, including but not limited\nto the formatting, the choice verbalizers, and the ICL examples. To address\nthis problem that results in unexpected performance degradation, calibration\nmethods have been developed to mitigate the effects of these biases while\nrecovering LLM performance. In this work, we first conduct a systematic\nanalysis of the existing calibration methods, where we both provide a unified\nview and reveal the failure cases. Inspired by these analyses, we propose Batch\nCalibration (BC), a simple yet intuitive method that controls the contextual\nbias from the batched input, unifies various prior approaches, and effectively\naddresses the aforementioned issues. BC is zero-shot, inference-only, and\nincurs negligible additional costs. In the few-shot setup, we further extend BC\nto allow it to learn the contextual bias from labeled data. We validate the\neffectiveness of BC with PaLM 2-(S, M, L) and CLIP models and demonstrate\nstate-of-the-art performance over previous calibration baselines across more\nthan 10 natural language understanding and image classification tasks.\n","authors":["Han Zhou","Xingchen Wan","Lev Proleev","Diana Mincu","Jilin Chen","Katherine Heller","Subhrajit Roy"],"pdf_url":"https://arxiv.org/pdf/2309.17249v3.pdf","comment":"ICLR 2024"}],"Information Retrieval":[{"id":"http://arxiv.org/abs/2209.05227v5","updated":"2024-12-01T16:50:02Z","published":"2022-09-12T13:26:26Z","title":"DUET: A Tuning-Free Device-Cloud Collaborative Parameters Generation\n Framework for Efficient Device Model Generalization","summary":" Device Model Generalization (DMG) is a practical yet under-investigated\nresearch topic for on-device machine learning applications. It aims to improve\nthe generalization ability of pre-trained models when deployed on\nresource-constrained devices, such as improving the performance of pre-trained\ncloud models on smart mobiles. While quite a lot of works have investigated the\ndata distribution shift across clouds and devices, most of them focus on model\nfine-tuning on personalized data for individual devices to facilitate DMG.\nDespite their promising, these approaches require on-device re-training, which\nis practically infeasible due to the overfitting problem and high time delay\nwhen performing gradient calculation on real-time data. In this paper, we argue\nthat the computational cost brought by fine-tuning can be rather unnecessary.\nWe consequently present a novel perspective to improving DMG without increasing\ncomputational cost, i.e., device-specific parameter generation which directly\nmaps data distribution to parameters. Specifically, we propose an efficient\nDevice-cloUd collaborative parametErs generaTion framework DUET. DUET is\ndeployed on a powerful cloud server that only requires the low cost of\nforwarding propagation and low time delay of data transmission between the\ndevice and the cloud. By doing so, DUET can rehearse the device-specific model\nweight realizations conditioned on the personalized real-time data for an\nindividual device. Importantly, our DUET elegantly connects the cloud and\ndevice as a 'duet' collaboration, frees the DMG from fine-tuning, and enables a\nfaster and more accurate DMG paradigm. We conduct an extensive experimental\nstudy of DUET on three public datasets, and the experimental results confirm\nour framework's effectiveness and generalisability for different DMG tasks.\n","authors":["Zheqi Lv","Wenqiao Zhang","Shengyu Zhang","Kun Kuang","Feng Wang","Yongwei Wang","Zhengyu Chen","Tao Shen","Hongxia Yang","Beng Chin Ooi","Fei Wu"],"pdf_url":"https://arxiv.org/pdf/2209.05227v5.pdf","comment":"Published on WWW'23: Proceedings of the ACM on Web Conference 2023\n (pp. 3077 - 3085)"},{"id":"http://arxiv.org/abs/2302.07335v3","updated":"2024-12-01T16:41:49Z","published":"2023-02-14T20:44:12Z","title":"Intelligent Model Update Strategy for Sequential Recommendation","summary":" Modern online platforms are increasingly employing recommendation systems to\naddress information overload and improve user engagement. There is an evolving\nparadigm in this research field that recommendation network learning occurs\nboth on the cloud and on edges with knowledge transfer in between (i.e.,\nedge-cloud collaboration). Recent works push this field further by enabling\nedge-specific context-aware adaptivity, where model parameters are updated in\nreal-time based on incoming on-edge data. However, we argue that frequent data\nexchanges between the cloud and edges often lead to inefficiency and waste of\ncommunication/computation resources, as considerable parameter updates might be\nredundant. To investigate this problem, we introduce Intelligent Edge-Cloud\nParameter Request Model, abbreviated as IntellectReq.\n IntellectReq is designed to operate on edge, evaluating the cost-benefit\nlandscape of parameter requests with minimal computation and communication\noverhead. We formulate this as a novel learning task, aimed at the detection of\nout-of-distribution data, thereby fine-tuning adaptive communication\nstrategies. Further, we employ statistical mapping techniques to convert\nreal-time user behavior into a normal distribution, thereby employing\nmulti-sample outputs to quantify the model's uncertainty and thus its\ngeneralization capabilities. Rigorous empirical validation on four\nwidely-adopted benchmarks evaluates our approach, evidencing a marked\nimprovement in the efficiency and generalizability of edge-cloud collaborative\nand dynamic recommendation systems.\n","authors":["Zheqi Lv","Wenqiao Zhang","Zhengyu Chen","Shengyu Zhang","Kun Kuang"],"pdf_url":"https://arxiv.org/pdf/2302.07335v3.pdf","comment":"Published on WWW'24(Oral): Proceedings of the ACM on Web Conference\n 2024 (pp. 3117-3128)"},{"id":"http://arxiv.org/abs/2411.06237v2","updated":"2024-12-01T13:31:14Z","published":"2024-11-09T17:38:01Z","title":"Leveraging Retrieval-Augmented Generation for Persian University\n Knowledge Retrieval","summary":" This paper introduces an innovative approach using Retrieval-Augmented\nGeneration (RAG) pipelines with Large Language Models (LLMs) to enhance\ninformation retrieval and query response systems for university-related\nquestion answering. By systematically extracting data from the university\nofficial webpage and employing advanced prompt engineering techniques, we\ngenerate accurate, contextually relevant responses to user queries.\n We developed a comprehensive university benchmark, UniversityQuestionBench\n(UQB), to rigorously evaluate our system performance, based on common key\nmetrics in the filed of RAG pipelines, assessing accuracy and reliability\nthrough various metrics and real-world scenarios. Our experimental results\ndemonstrate significant improvements in the precision and relevance of\ngenerated responses, enhancing user experience and reducing the time required\nto obtain relevant answers. In summary, this paper presents a novel application\nof RAG pipelines and LLMs, supported by a meticulously prepared university\nbenchmark, offering valuable insights into advanced AI techniques for academic\ndata retrieval and setting the stage for future research in this domain.\n","authors":["Arshia Hemmat","Kianoosh Vadaei","Mohammad Hassan Heydari","Afsaneh Fatemi"],"pdf_url":"https://arxiv.org/pdf/2411.06237v2.pdf","comment":"6 pages, 2 figures, 1 table, Submitted to 15th IKT conference"},{"id":"http://arxiv.org/abs/2411.17229v2","updated":"2024-12-01T13:20:02Z","published":"2024-11-26T08:51:46Z","title":"Efficient Data-aware Distance Comparison Operations for High-Dimensional\n Approximate Nearest Neighbor Search","summary":" High-dimensional approximate $K$ nearest neighbor search (AKNN) is a\nfundamental task for various applications, including information retrieval.\nMost existing algorithms for AKNN can be decomposed into two main components,\ni.e., candidate generation and distance comparison operations (DCOs). While\ndifferent methods have unique ways of generating candidates, they all share the\nsame DCO process. In this study, we focus on accelerating the process of DCOs\nthat dominates the time cost in most existing AKNN algorithms. To achieve this,\nwe propose an Data-Aware Distance Estimation approach, called DADE, which\napproximates the exact distance in a lower-dimensional space. We theoretically\nprove that the distance estimation in DADE is unbiased in terms of data\ndistribution. Furthermore, we propose an optimized estimation based on the\nunbiased distance estimation formulation. In addition, we propose a hypothesis\ntesting approach to adaptively determine the number of dimensions needed to\nestimate the exact distance with sufficient confidence. We integrate DADE into\nwidely-used AKNN search algorithms, e.g., IVF and HNSW, and conduct extensive\nexperiments to demonstrate the superiority.\n","authors":["Liwei Deng","Penghao Chen","Ximu Zeng","Tianfu Wang","Yan Zhao","Kai Zheng"],"pdf_url":"https://arxiv.org/pdf/2411.17229v2.pdf","comment":"Accepted by VLDB 2025"},{"id":"http://arxiv.org/abs/2405.18560v2","updated":"2024-12-01T05:22:22Z","published":"2024-05-28T20:10:06Z","title":"Potential Field Based Deep Metric Learning","summary":" Deep metric learning (DML) involves training a network to learn a\nsemantically meaningful representation space. Many current approaches mine\nn-tuples of examples and model interactions within each tuplets. We present a\nnovel, compositional DML model, inspired by electrostatic fields in physics\nthat, instead of in tuples, represents the influence of each example\n(embedding) by a continuous potential field, and superposes the fields to\nobtain their combined global potential field. We use attractive/repulsive\npotential fields to represent interactions among embeddings from images of the\nsame/different classes. Contrary to typical learning methods, where mutual\ninfluence of samples is proportional to their distance, we enforce reduction in\nsuch influence with distance, leading to a decaying field. We show that such\ndecay helps improve performance on real world datasets with large intra-class\nvariations and label noise. Like other proxy-based methods, we also use proxies\nto succinctly represent sub-populations of examples. We evaluate our method on\nthree standard DML benchmarks- Cars-196, CUB-200-2011, and SOP datasets where\nit outperforms state-of-the-art baselines.\n","authors":["Shubhang Bhatnagar","Narendra Ahuja"],"pdf_url":"https://arxiv.org/pdf/2405.18560v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.14592v2","updated":"2024-12-01T01:21:24Z","published":"2024-11-21T21:22:58Z","title":"G-RAG: Knowledge Expansion in Material Science","summary":" In the field of Material Science, effective information retrieval systems are\nessential for facilitating research. Traditional Retrieval-Augmented Generation\n(RAG) approaches in Large Language Models (LLMs) often encounter challenges\nsuch as outdated information, hallucinations, limited interpretability due to\ncontext constraints, and inaccurate retrieval. To address these issues, Graph\nRAG integrates graph databases to enhance the retrieval process. Our proposed\nmethod processes Material Science documents by extracting key entities\n(referred to as MatIDs) from sentences, which are then utilized to query\nexternal Wikipedia knowledge bases (KBs) for additional relevant information.\nWe implement an agent-based parsing technique to achieve a more detailed\nrepresentation of the documents. Our improved version of Graph RAG called G-RAG\nfurther leverages a graph database to capture relationships between these\nentities, improving both retrieval accuracy and contextual understanding. This\nenhanced approach demonstrates significant improvements in performance for\ndomains that require precise information retrieval, such as Material Science.\n","authors":["Radeen Mostafa","Mirza Nihal Baig","Mashaekh Tausif Ehsan","Jakir Hasan"],"pdf_url":"https://arxiv.org/pdf/2411.14592v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.01007v1","updated":"2024-12-01T23:54:12Z","published":"2024-12-01T23:54:12Z","title":"CoRNStack: High-Quality Contrastive Data for Better Code Ranking","summary":" Effective code retrieval plays a crucial role in advancing code generation,\nbug fixing, and software maintenance, particularly as software systems increase\nin complexity. While current code embedding models have demonstrated promise in\nretrieving code snippets for small-scale, well-defined tasks, they often\nunderperform in more demanding real-world applications such as bug localization\nwithin GitHub repositories. We hypothesize that a key issue is their reliance\non noisy and inconsistent datasets for training, which impedes their ability to\ngeneralize to more complex retrieval scenarios. To address these limitations,\nwe introduce CoRNStack, a large-scale, high-quality contrastive training\ndataset for code that spans multiple programming languages. This dataset is\ncurated using consistency filtering to eliminate noisy positives and is further\nenriched with mined hard negatives, thereby facilitating more effective\nlearning. We demonstrate that contrastive training of embedding models using\nCoRNStack leads to state-of-the-art performance across a variety of code\nretrieval tasks. Furthermore, the dataset can be leveraged for training code\nreranking models, a largely underexplored area compared to text reranking. Our\nfinetuned code reranking model significantly improves the ranking quality over\nthe retrieved results. Finally, by employing our code retriever and reranker\ntogether, we demonstrate significant improvements in function localization for\nGitHub issues, an important component of real-world software development.\n","authors":["Tarun Suresh","Revanth Gangi Reddy","Yifei Xu","Zach Nussbaum","Andriy Mulyar","Brandon Duderstadt","Heng Ji"],"pdf_url":"https://arxiv.org/pdf/2412.01007v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.00978v1","updated":"2024-12-01T21:58:44Z","published":"2024-12-01T21:58:44Z","title":"Patent-publication pairs for the detection of knowledge transfer from\n research to industry: reducing ambiguities with word embeddings and\n references","summary":" The performance of medical research can be viewed and evaluated not only from\nthe perspective of publication output, but also from the perspective of\neconomic exploitability. Patents can represent the exploitation of research\nresults and thus the transfer of knowledge from research to industry. In this\nstudy, we set out to identify publication-patent pairs in order to use patents\nas a proxy for the economic impact of research. To identify these pairs, we\nmatched scholarly publications and patents by comparing the names of authors\nand investors. To resolve the ambiguities that arise in this name-matching\nprocess, we expanded our approach with two additional filter features, one used\nto assess the similarity of text content, the other to identify common\nreferences in the two document types. To evaluate text similarity, we extracted\nand transformed technical terms from a medical ontology (MeSH) into numerical\nvectors using word embeddings. We then calculated the results of the two\nsupporting features over an example five-year period. Furthermore, we developed\na statistical procedure which can be used to determine valid patent classes for\nthe domain of medicine. Our complete data processing pipeline is freely\navailable, from the raw data of the two document types right through to the\nvalidated publication-patent pairs.\n","authors":["Klaus Lippert","Konrad U. Förstner"],"pdf_url":"https://arxiv.org/pdf/2412.00978v1.pdf","comment":"16 Pages, 8 figures"},{"id":"http://arxiv.org/abs/2412.00934v1","updated":"2024-12-01T18:58:17Z","published":"2024-12-01T18:58:17Z","title":"QABISAR: Query-Article Bipartite Interactions for Statutory Article\n Retrieval","summary":" In this paper, we introduce QABISAR, a novel framework for statutory article\nretrieval, to overcome the semantic mismatch problem when modeling each\nquery-article pair in isolation, making it hard to learn representation that\ncan effectively capture multi-faceted information. QABISAR leverages bipartite\ninteractions between queries and articles to capture diverse aspects inherent\nin them. Further, we employ knowledge distillation to transfer enriched query\nrepresentations from the graph network into the query bi-encoder, to capture\nthe rich semantics present in the graph representations, despite absence of\ngraph-based supervision for unseen queries during inference. Our experiments on\na real-world expert-annotated dataset demonstrate its effectiveness.\n","authors":["T. Y. S. S. Santosh","Hassan Sarwat","Matthias Grabmair"],"pdf_url":"https://arxiv.org/pdf/2412.00934v1.pdf","comment":"Accepted to COLING 2025"},{"id":"http://arxiv.org/abs/2412.00813v1","updated":"2024-12-01T14:01:17Z","published":"2024-12-01T14:01:17Z","title":"Oracle-guided Dynamic User Preference Modeling for Sequential\n Recommendation","summary":" Sequential recommendation methods can capture dynamic user preferences from\nuser historical interactions to achieve better performance. However, most\nexisting methods only use past information extracted from user historical\ninteractions to train the models, leading to the deviations of user preference\nmodeling. Besides past information, future information is also available during\ntraining, which contains the ``oracle'' user preferences in the future and will\nbe beneficial to model dynamic user preferences. Therefore, we propose an\noracle-guided dynamic user preference modeling method for sequential\nrecommendation (Oracle4Rec), which leverages future information to guide model\ntraining on past information, aiming to learn ``forward-looking'' models.\nSpecifically, Oracle4Rec first extracts past and future information through two\nseparate encoders, then learns a forward-looking model through an\noracle-guiding module which minimizes the discrepancy between past and future\ninformation. We also tailor a two-phase model training strategy to make the\nguiding more effective. Extensive experiments demonstrate that Oracle4Rec is\nsuperior to state-of-the-art sequential methods. Further experiments show that\nOracle4Rec can be leveraged as a generic module in other sequential\nrecommendation methods to improve their performance with a considerable margin.\n","authors":["Jiafeng Xia","Dongsheng Li","Hansu Gu","Tun Lu","Peng Zhang","Li Shang","Ning Gu"],"pdf_url":"https://arxiv.org/pdf/2412.00813v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.00714v1","updated":"2024-12-01T07:27:20Z","published":"2024-12-01T07:27:20Z","title":"Scaling New Frontiers: Insights into Large Recommendation Models","summary":" Recommendation systems are essential for filtering data and retrieving\nrelevant information across various applications. Recent advancements have seen\nthese systems incorporate increasingly large embedding tables, scaling up to\ntens of terabytes for industrial use. However, the expansion of network\nparameters in traditional recommendation models has plateaued at tens of\nmillions, limiting further benefits from increased embedding parameters.\nInspired by the success of large language models (LLMs), a new approach has\nemerged that scales network parameters using innovative structures, enabling\ncontinued performance improvements. A significant development in this area is\nMeta's generative recommendation model HSTU, which illustrates the scaling laws\nof recommendation systems by expanding parameters to thousands of billions.\nThis new paradigm has achieved substantial performance gains in online\nexperiments. In this paper, we aim to enhance the understanding of scaling laws\nby conducting comprehensive evaluations of large recommendation models.\nFirstly, we investigate the scaling laws across different backbone\narchitectures of the large recommendation models. Secondly, we conduct\ncomprehensive ablation studies to explore the origins of these scaling laws. We\nthen further assess the performance of HSTU, as the representative of large\nrecommendation models, on complex user behavior modeling tasks to evaluate its\napplicability. Notably, we also analyze its effectiveness in ranking tasks for\nthe first time. Finally, we offer insights into future directions for large\nrecommendation models. Supplementary materials for our research are available\non GitHub at https://github.com/USTC-StarTeam/Large-Recommendation-Models.\n","authors":["Wei Guo","Hao Wang","Luankang Zhang","Jin Yao Chin","Zhongzhou Liu","Kai Cheng","Qiushi Pan","Yi Quan Lee","Wanqi Xue","Tingjia Shen","Kenan Song","Kefan Wang","Wenjia Xie","Yuyang Ye","Huifeng Guo","Yong Liu","Defu Lian","Ruiming Tang","Enhong Chen"],"pdf_url":"https://arxiv.org/pdf/2412.00714v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.00657v1","updated":"2024-12-01T03:28:26Z","published":"2024-12-01T03:28:26Z","title":"Improving Vietnamese Legal Document Retrieval using Synthetic Data","summary":" In the field of legal information retrieval, effective embedding-based models\nare essential for accurate question-answering systems. However, the scarcity of\nlarge annotated datasets poses a significant challenge, particularly for\nVietnamese legal texts. To address this issue, we propose a novel approach that\nleverages large language models to generate high-quality, diverse synthetic\nqueries for Vietnamese legal passages. This synthetic data is then used to\npre-train retrieval models, specifically bi-encoder and ColBERT, which are\nfurther fine-tuned using contrastive loss with mined hard negatives. Our\nexperiments demonstrate that these enhancements lead to strong improvement in\nretrieval accuracy, validating the effectiveness of synthetic data and\npre-training techniques in overcoming the limitations posed by the lack of\nlarge labeled datasets in the Vietnamese legal domain.\n","authors":["Son Pham Tien","Hieu Nguyen Doan","An Nguyen Dai","Sang Dinh Viet"],"pdf_url":"https://arxiv.org/pdf/2412.00657v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.00639v1","updated":"2024-12-01T01:36:41Z","published":"2024-12-01T01:36:41Z","title":"Needle: A Generative-AI Powered Monte Carlo Method for Answering Complex\n Natural Language Queries on Multi-modal Data","summary":" Multi-modal data, such as image data sets, often miss the detailed\ndescriptions that properly capture the rich information encoded in them. This\nmakes answering complex natural language queries a major challenge in these\ndomains. In particular, unlike the traditional nearest-neighbor search, where\nthe tuples and the query are modeled as points in a data cube, the query and\nthe tuples are of different natures, making the traditional query answering\nsolutions not directly applicable for such settings. Existing literature\naddresses this challenge for image data through vector representations jointly\ntrained on natural language and images. This technique, however, underperforms\nfor complex queries due to various reasons.\n This paper takes a step towards addressing this challenge by introducing a\nGenerative-AI (GenAI) powered Monte Carlo method that utilizes foundation\nmodels to generate synthetic samples that capture the complexity of the natural\nlanguage query and transform it to the same space of the multi-modal data.\nFollowing this method, we develop a system for image data retrieval and propose\npractical solutions that enable leveraging future advancements in GenAI and\nvector representations for improving our system's performance. Our\ncomprehensive experiments on various benchmark datasets verify that our system\nsignificantly outperforms state-of-the-art techniques.\n","authors":["Mahdi Erfanian","Mohsen Dehghankar","Abolfazl Asudeh"],"pdf_url":"https://arxiv.org/pdf/2412.00639v1.pdf","comment":null}],"Multimedia":[{"id":"http://arxiv.org/abs/2308.05037v3","updated":"2024-12-01T15:17:03Z","published":"2023-08-09T16:09:44Z","title":"Separate Anything You Describe","summary":" Language-queried audio source separation (LASS) is a new paradigm for\ncomputational auditory scene analysis (CASA). LASS aims to separate a target\nsound from an audio mixture given a natural language query, which provides a\nnatural and scalable interface for digital audio applications. Recent works on\nLASS, despite attaining promising separation performance on specific sources\n(e.g., musical instruments, limited classes of audio events), are unable to\nseparate audio concepts in the open domain. In this work, we introduce\nAudioSep, a foundation model for open-domain audio source separation with\nnatural language queries. We train AudioSep on large-scale multimodal datasets\nand extensively evaluate its capabilities on numerous tasks including audio\nevent separation, musical instrument separation, and speech enhancement.\nAudioSep demonstrates strong separation performance and impressive zero-shot\ngeneralization ability using audio captions or text labels as queries,\nsubstantially outperforming previous audio-queried and language-queried sound\nseparation models. For reproducibility of this work, we will release the source\ncode, evaluation benchmark and pre-trained model at:\nhttps://github.com/Audio-AGI/AudioSep.\n","authors":["Xubo Liu","Qiuqiang Kong","Yan Zhao","Haohe Liu","Yi Yuan","Yuzhuo Liu","Rui Xia","Yuxuan Wang","Mark D. Plumbley","Wenwu Wang"],"pdf_url":"https://arxiv.org/pdf/2308.05037v3.pdf","comment":"Code, benchmark and pre-trained models:\n https://github.com/Audio-AGI/AudioSep"},{"id":"http://arxiv.org/abs/2401.17133v2","updated":"2024-12-01T04:06:27Z","published":"2024-01-30T16:07:44Z","title":"SongBsAb: A Dual Prevention Approach against Singing Voice Conversion\n based Illegal Song Covers","summary":" Singing voice conversion (SVC) automates song covers by converting a source\nsinging voice from a source singer into a new singing voice with the same\nlyrics and melody as the source, but sounds like being covered by the target\nsinger of some given target singing voices. However, it raises serious concerns\nabout copyright and civil right infringements. We propose SongBsAb, the first\nproactive approach to tackle SVC-based illegal song covers. SongBsAb adds\nperturbations to singing voices before releasing them, so that when they are\nused, the process of SVC will be interfered, leading to unexpected singing\nvoices. Perturbations are carefully crafted to (1) provide a dual prevention,\ni.e., preventing the singing voice from being used as the source and target\nsinging voice in SVC, by proposing a gender-transformation loss and a high/low\nhierarchy multi-target loss, respectively; and (2) be harmless, i.e., no\nside-effect on the enjoyment of protected songs, by refining a psychoacoustic\nmodel-based loss with the backing track as an additional masker, a unique\naccompanying element for singing voices compared to ordinary speech voices. We\nalso adopt a frame-level interaction reduction-based loss and encoder ensemble\nto enhance the transferability of SongBsAb to unknown SVC models. We\ndemonstrate the prevention effectiveness, harmlessness, and robustness of\nSongBsAb on five diverse and promising SVC models, using both English and\nChinese datasets, and both objective and human study-based subjective metrics.\nOur work fosters an emerging research direction for mitigating illegal\nautomated song covers.\n","authors":["Guangke Chen","Yedi Zhang","Fu Song","Ting Wang","Xiaoning Du","Yang Liu"],"pdf_url":"https://arxiv.org/pdf/2401.17133v2.pdf","comment":"In Proceedings of the 32nd Network and Distributed System Security\n (NDSS) Symposium 2025"}]},"2024-11-30T00:00:00Z":{"Computation and Language":[{"id":"http://arxiv.org/abs/2406.05804v6","updated":"2024-11-30T22:38:57Z","published":"2024-06-09T14:42:55Z","title":"A Review of Prominent Paradigms for LLM-Based Agents: Tool Use\n (Including RAG), Planning, and Feedback Learning","summary":" Tool use, planning, and feedback learning are currently three prominent\nparadigms for developing Large Language Model (LLM)-based agents across various\ntasks. Although numerous frameworks have been devised for each paradigm, their\nintricate workflows and inconsistent taxonomy create challenges in\nunderstanding and reviewing the frameworks across different paradigms. This\nsurvey introduces a unified taxonomy to systematically review and discuss these\nframeworks. Specifically, 1) the taxonomy defines environments/tasks, common\nLLM-profiled roles or LMPRs (policy models, evaluators, and dynamic models),\nand universally applicable workflows found in prior work, and 2) it enables a\ncomparison of key perspectives on the implementations of LMPRs and workflow\ndesigns across different agent paradigms and frameworks. 3) Finally, we\nidentify three limitations in existing workflow designs and systematically\ndiscuss the future work. Resources have been made publicly available at in our\nGitHub repository https://github.com/xinzhel/LLM-Agent-Survey.\n","authors":["Xinzhe Li"],"pdf_url":"https://arxiv.org/pdf/2406.05804v6.pdf","comment":"CoLing 2025 Camera Ready (extended to 9 pages)"},{"id":"http://arxiv.org/abs/2409.14165v3","updated":"2024-11-30T22:21:30Z","published":"2024-09-21T15:07:37Z","title":"A Survey on Large Language Model-empowered Autonomous Driving","summary":" Artificial intelligence (AI) plays a crucial role in autonomous driving (AD)\nresearch, propelling its development towards intelligence and efficiency.\nCurrently, the development of AD technology follows two main technical paths:\nmodularization and end-to-end. Modularization decompose the driving task into\nmodules such as perception, prediction, planning, and control, and train them\nseparately. Due to the inconsistency of training objectives between modules,\nthe integrated effect suffers from bias. End-to-end attempts to address this\nissue by utilizing a single model that directly maps from sensor data to\ncontrol signals. This path has limited learning capabilities in a comprehensive\nset of features and struggles to handle unpredictable long-tail events and\ncomplex urban traffic scenarios. In the face of challenges encountered in both\npaths, many researchers believe that large language models (LLMs) with powerful\nreasoning capabilities and extensive knowledge understanding may be the\nsolution, expecting LLMs to provide AD systems with deeper levels of\nunderstanding and decision-making capabilities. In light of the challenges\nfaced by both paths, many researchers believe that LLMs, with their powerful\nreasoning abilities and extensive knowledge, could offer a solution. To\nunderstand if LLMs could enhance AD, this paper conducts a thorough analysis of\nthe potential applications of LLMs in AD systems, including exploring their\noptimization strategies in both modular and end-to-end approaches, with a\nparticular focus on how LLMs can tackle the problems and challenges present in\ncurrent solutions. Furthermore, we discuss an important question: Can LLM-based\nartificial general intelligence (AGI) be a key to achieve high-level AD? We\nfurther analyze the potential limitations and challenges that LLMs may\nencounter in promoting the development of AD technology.\n","authors":["Yuxuan Zhu","Shiyi Wang","Wenqing Zhong","Nianchen Shen","Yunqi Li","Siqi Wang","Zhiheng Li","Cathy Wu","Zhengbing He","Li Li"],"pdf_url":"https://arxiv.org/pdf/2409.14165v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.11796v3","updated":"2024-11-30T22:01:07Z","published":"2024-08-21T17:38:48Z","title":"LLM Pruning and Distillation in Practice: The Minitron Approach","summary":" We present a comprehensive report on compressing the Llama 3.1 8B and Mistral\nNeMo 12B models to 4B and 8B parameters, respectively, using pruning and\ndistillation. We explore two distinct pruning strategies: (1) depth pruning and\n(2) joint hidden/attention/MLP (width) pruning, and evaluate the results on\ncommon benchmarks from the LM Evaluation Harness. The models are then aligned\nwith NeMo Aligner and tested in instruct-tuned versions. This approach produces\na compelling 4B model from Llama 3.1 8B and a state-of-the-art\nMistral-NeMo-Minitron-8B (MN-Minitron-8B for brevity) model from Mistral NeMo\n12B. We found that with no access to the original data, it is beneficial to\nslightly fine-tune teacher models on the distillation dataset. We open-source\nour base model weights on Hugging Face with a permissive license.\n","authors":["Sharath Turuvekere Sreenivas","Saurav Muralidharan","Raviraj Joshi","Marcin Chochowski","Mostofa Patwary","Pavlo Molchanov","Mohammad Shoeybi","Jan Kautz","Ameya Sunil Mahabaleshwarkar","Gerald Shen","Jiaqi Zeng","Oleksii Kuchaiev","Zijia Chen","Yoshi Suhara","Shizhe Diao","Chenhan Yu","Wei-Chun Chen","Hayley Ross","Daniel Korzekwa","Oluwatobi Olabiyi","Ashwath Aithal","Bryan Catanzaro"],"pdf_url":"https://arxiv.org/pdf/2408.11796v3.pdf","comment":"v3: Update author list, other changes"},{"id":"http://arxiv.org/abs/2406.02532v3","updated":"2024-11-30T21:33:59Z","published":"2024-06-04T17:53:36Z","title":"SpecExec: Massively Parallel Speculative Decoding for Interactive LLM\n Inference on Consumer Devices","summary":" As large language models gain widespread adoption, running them efficiently\nbecomes crucial. Recent works on LLM inference use speculative decoding to\nachieve extreme speedups. However, most of these works implicitly design their\nalgorithms for high-end datacenter hardware. In this work, we ask the opposite\nquestion: how fast can we run LLMs on consumer machines? Consumer GPUs can no\nlonger fit the largest available models (50B+ parameters) and must offload them\nto RAM or SSD. When running with offloaded parameters, the inference engine can\nprocess batches of hundreds or thousands of tokens at the same time as just one\ntoken, making it a natural fit for speculative decoding. We propose SpecExec\n(Speculative Execution), a simple parallel decoding method that can generate up\nto 20 tokens per target model iteration for popular LLM families. It utilizes\nthe high spikiness of the token probabilities distribution in modern LLMs and a\nhigh degree of alignment between model output probabilities. SpecExec takes the\nmost probable tokens continuation from the draft model to build a \"cache\" tree\nfor the target model, which then gets validated in a single pass. Using\nSpecExec, we demonstrate inference of 50B+ parameter LLMs on consumer GPUs with\nRAM offloading at 4-6 tokens per second with 4-bit quantization or 2-3 tokens\nper second with 16-bit weights.\n","authors":["Ruslan Svirschevski","Avner May","Zhuoming Chen","Beidi Chen","Zhihao Jia","Max Ryabinin"],"pdf_url":"https://arxiv.org/pdf/2406.02532v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.10492v2","updated":"2024-11-30T21:27:19Z","published":"2024-06-15T04:09:31Z","title":"Large Language Models as Interpolated and Extrapolated Event Predictors","summary":" Salient facts of sociopolitical events are distilled into quadruples\nfollowing a format of subject, relation, object, and timestamp. Machine\nlearning methods, such as graph neural networks (GNNs) and recurrent neural\nnetworks (RNNs), have been built to make predictions and infer relations on the\nquadruple-based knowledge graphs (KGs). In many applications, quadruples are\nextended to quintuples with auxiliary attributes such as text summaries that\ndescribe the quadruple events. In this paper, we comprehensively investigate\nhow large language models (LLMs) streamline the design of event prediction\nframeworks using quadruple-based or quintuple-based data while maintaining\ncompetitive accuracy. We propose LEAP, a unified framework that leverages large\nlanguage models as event predictors. Specifically, we develop multiple prompt\ntemplates to frame the object prediction (OP) task as a standard\nquestion-answering (QA) task, suitable for instruction fine-tuning with an\nencoder-decoder LLM. For multi-event forecasting (MEF) task, we design a simple\nyet effective prompt template for each event quintuple. This novel approach\nremoves the need for GNNs and RNNs, instead utilizing an encoder-only LLM to\ngenerate fixed intermediate embeddings, which are processed by a customized\ndownstream head with a self-attention mechanism to predict potential relation\noccurrences in the future. Extensive experiments on multiple real-world\ndatasets using various evaluation metrics validate the effectiveness of our\napproach.\n","authors":["Libo Zhang","Yue Ning"],"pdf_url":"https://arxiv.org/pdf/2406.10492v2.pdf","comment":"11 pages, 3 figures, 10 tables"},{"id":"http://arxiv.org/abs/2407.16607v4","updated":"2024-11-30T19:31:38Z","published":"2024-07-23T16:13:22Z","title":"Data Mixture Inference: What do BPE Tokenizers Reveal about their\n Training Data?","summary":" The pretraining data of today's strongest language models is opaque; in\nparticular, little is known about the proportions of various domains or\nlanguages represented. In this work, we tackle a task which we call data\nmixture inference, which aims to uncover the distributional make-up of training\ndata. We introduce a novel attack based on a previously overlooked source of\ninformation: byte-pair encoding (BPE) tokenizers, used by the vast majority of\nmodern language models. Our key insight is that the ordered list of merge rules\nlearned by a BPE tokenizer naturally reveals information about the token\nfrequencies in its training data. Given a tokenizer's merge list along with\nexample data for each category of interest, we formulate a linear program that\nsolves for the proportion of each category in the tokenizer's training set. In\ncontrolled experiments, we show that our attack recovers mixture ratios with\nhigh precision for tokenizers trained on known mixtures of natural languages,\nprogramming languages, and data sources. We then apply our approach to\noff-the-shelf tokenizers released with recent LMs. We confirm much publicly\ndisclosed information about these models, and also make several new inferences:\nGPT-4o and Mistral NeMo's tokenizers are much more multilingual than their\npredecessors, training on 39% and 47% non-English language data, respectively;\nLlama 3 extends GPT-3.5's tokenizer primarily for multilingual (48%) use;\nGPT-3.5's and Claude's tokenizers are trained on predominantly code (~60%). We\nhope our work sheds light on current design practices for pretraining data, and\ninspires continued research into data mixture inference for LMs.\n","authors":["Jonathan Hayase","Alisa Liu","Yejin Choi","Sewoong Oh","Noah A. Smith"],"pdf_url":"https://arxiv.org/pdf/2407.16607v4.pdf","comment":"NeurIPS camera-ready, code at\n https://github.com/alisawuffles/tokenizer-attack"},{"id":"http://arxiv.org/abs/2309.00378v5","updated":"2024-11-30T19:08:48Z","published":"2023-09-01T10:27:04Z","title":"Long-Term Ad Memorability: Understanding & Generating Memorable Ads","summary":" Despite the importance of long-term memory in marketing and brand building,\nuntil now, there has been no large-scale study on the memorability of ads. All\nprevious memorability studies have been conducted on short-term recall on\nspecific content types like action videos. On the other hand, long-term\nmemorability is crucial for the advertising industry, and ads are almost always\nhighly multimodal. Therefore, we release the first memorability dataset,\nLAMBDA, consisting of 1749 participants and 2205 ads covering 276 brands.\nRunning statistical tests over different participant subpopulations and ad\ntypes, we find many interesting insights into what makes an ad memorable, e.g.,\nfast-moving ads are more memorable than those with slower scenes; people who\nuse ad-blockers remember a lower number of ads than those who don't. Next, we\npresent a model, Henry, to predict the memorability of a content. Henry\nachieves state-of-the-art performance across all prominent literature\nmemorability datasets. It shows strong generalization performance with better\nresults in 0-shot on unseen datasets. Finally, with the intent of memorable ad\ngeneration, we present a scalable method to build a high-quality memorable ad\ngeneration model by leveraging automatically annotated data. Our approach, SEED\n(Self rEwarding mEmorability Modeling), starts with a language model trained on\nLAMBDA as seed data and progressively trains an LLM to generate more memorable\nads. We show that the generated advertisements have 44% higher memorability\nscores than the original ads. We release this large-scale ad dataset,\nUltraLAMBDA, consisting of 5 million ads. Our code and the datasets, LAMBDA and\nUltraLAMBDA, are open-sourced at\nhttps://behavior-in-the-wild.github.io/memorability.\n","authors":["Harini SI","Somesh Singh","Yaman K Singla","Aanisha Bhattacharyya","Veeky Baths","Changyou Chen","Rajiv Ratn Shah","Balaji Krishnamurthy"],"pdf_url":"https://arxiv.org/pdf/2309.00378v5.pdf","comment":"Published in WACV-2025"},{"id":"http://arxiv.org/abs/2410.05168v3","updated":"2024-11-30T16:10:21Z","published":"2024-10-07T16:25:39Z","title":"ReasoningRank: Teaching Student Models to Rank through Reasoning-Based\n Knowledge Distillation","summary":" Reranking documents based on their relevance to a given query is a critical\ntask in information retrieval. Traditional reranking methods often lack\ntransparency and rely on proprietary models, hindering reproducibility and\ninterpretability. We propose Reason-to-Rank (R2R), a novel open-source\nreranking approach that enhances transparency by generating two types of\nreasoning: direct relevance reasoning, which explains how a document addresses\nthe query, and comparison reasoning, which justifies the relevance of one\ndocument over another. We leverage large language models (LLMs) as teacher\nmodels to generate these explanations and distill this knowledge into smaller,\nopenly available student models. Our student models are trained to generate\nmeaningful reasoning and rerank documents, achieving competitive performance\nacross multiple datasets, including MSMARCO and BRIGHT. Experiments demonstrate\nthat R2R not only improves reranking accuracy but also provides valuable\ninsights into the decision-making process. By offering a structured and\ninterpretable solution with openly accessible resources, R2R aims to bridge the\ngap between effectiveness and transparency in information retrieval, fostering\nreproducibility and further research in the field.\n","authors":["Yuelyu Ji","Zhuochun Li","Rui Meng","Daqing He"],"pdf_url":"https://arxiv.org/pdf/2410.05168v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.02224v3","updated":"2024-11-30T14:27:59Z","published":"2024-06-04T11:36:09Z","title":"FedMKT: Federated Mutual Knowledge Transfer for Large and Small Language\n Models","summary":" Recent research in federated large language models (LLMs) has primarily\nfocused on enabling clients to fine-tune their locally deployed homogeneous\nLLMs collaboratively or on transferring knowledge from server-based LLMs to\nsmall language models (SLMs) at downstream clients. However, a significant gap\nremains in the simultaneous mutual enhancement of both the server's LLM and\nclients' SLMs. To bridge this gap, we propose FedMKT, a parameter-efficient\nfederated mutual knowledge transfer framework for large and small language\nmodels. This framework is designed to adaptively transfer knowledge from the\nserver's LLM to clients' SLMs while concurrently enriching the LLM with\nclients' unique domain insights. We facilitate token alignment using minimum\nedit distance (MinED) and then selective mutual knowledge transfer between\nclient-side SLMs and a server-side LLM, aiming to collectively enhance their\nperformance. Through extensive experiments across three distinct scenarios, we\nevaluate the effectiveness of FedMKT using various public LLMs and SLMs on a\nrange of NLP text generation tasks. Empirical results demonstrate that FedMKT\nsimultaneously boosts the performance of both LLMs and SLMs.\n","authors":["Tao Fan","Guoqiang Ma","Yan Kang","Hanlin Gu","Yuanfeng Song","Lixin Fan","Kai Chen","Qiang Yang"],"pdf_url":"https://arxiv.org/pdf/2406.02224v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.00177v3","updated":"2024-11-30T14:01:56Z","published":"2024-10-31T19:48:12Z","title":"LLM4Mat-Bench: Benchmarking Large Language Models for Materials Property\n Prediction","summary":" Large language models (LLMs) are increasingly being used in materials\nscience. However, little attention has been given to benchmarking and\nstandardized evaluation for LLM-based materials property prediction, which\nhinders progress. We present LLM4Mat-Bench, the largest benchmark to date for\nevaluating the performance of LLMs in predicting the properties of crystalline\nmaterials. LLM4Mat-Bench contains about 1.9M crystal structures in total,\ncollected from 10 publicly available materials data sources, and 45 distinct\nproperties. LLM4Mat-Bench features different input modalities: crystal\ncomposition, CIF, and crystal text description, with 4.7M, 615.5M, and 3.1B\ntokens in total for each modality, respectively. We use LLM4Mat-Bench to\nfine-tune models with different sizes, including LLM-Prop and MatBERT, and\nprovide zero-shot and few-shot prompts to evaluate the property prediction\ncapabilities of LLM-chat-like models, including Llama, Gemma, and Mistral. The\nresults highlight the challenges of general-purpose LLMs in materials science\nand the need for task-specific predictive models and task-specific\ninstruction-tuned LLMs in materials property prediction.\n","authors":["Andre Niyongabo Rubungo","Kangming Li","Jason Hattrick-Simpers","Adji Bousso Dieng"],"pdf_url":"https://arxiv.org/pdf/2411.00177v3.pdf","comment":"Accepted at NeurIPS 2024-AI4Mat Workshop. The Benchmark and code can\n be found at https://github.com/vertaix/LLM4Mat-Bench"},{"id":"http://arxiv.org/abs/2410.02953v3","updated":"2024-11-30T12:16:51Z","published":"2024-10-03T19:53:47Z","title":"Unlocking Structured Thinking in Language Models with Cognitive\n Prompting","summary":" We propose cognitive prompting as a novel approach to guide problem-solving\nin large language models (LLMs) through structured, human-like cognitive\noperations, such as goal clarification, decomposition, filtering, abstraction,\nand pattern recognition. By employing systematic, step-by-step reasoning,\ncognitive prompting enables LLMs to tackle complex, multi-step tasks more\nefficiently. We introduce three variants: a deterministic sequence of cognitive\noperations, a self-adaptive variant in which the LLM dynamically selects the\nsequence of cognitive operations, and a hybrid variant that uses generated\ncorrect solutions as few-shot chain-of-thought prompts. Experiments with LLaMA,\nGemma~2, and Qwen models in each two sizes on the arithmetic reasoning\nbenchmark GSM8K demonstrate that cognitive prompting significantly improves\nperformance compared to standard question answering.\n","authors":["Oliver Kramer","Jill Baumann"],"pdf_url":"https://arxiv.org/pdf/2410.02953v3.pdf","comment":"6 pages, submitted to ESANN 2025"},{"id":"http://arxiv.org/abs/2410.03845v2","updated":"2024-11-30T11:19:39Z","published":"2024-10-04T18:22:58Z","title":"ORAssistant: A Custom RAG-based Conversational Assistant for OpenROAD","summary":" Open-source Electronic Design Automation (EDA) tools are rapidly transforming\nchip design by addressing key barriers of commercial EDA tools such as\ncomplexity, costs, and access. Recent advancements in Large Language Models\n(LLMs) have further enhanced efficiency in chip design by providing user\nassistance across a range of tasks like setup, decision-making, and flow\nautomation. This paper introduces ORAssistant, a conversational assistant for\nOpenROAD, based on Retrieval-Augmented Generation (RAG). ORAssistant aims to\nimprove the user experience for the OpenROAD flow, from RTL-GDSII by providing\ncontext-specific responses to common user queries, including installation,\ncommand usage, flow setup, and execution, in prose format. Currently,\nORAssistant integrates OpenROAD, OpenROAD-flow-scripts, Yosys, OpenSTA, and\nKLayout. The data model is built from publicly available documentation and\nGitHub resources. The proposed architecture is scalable, supporting extensions\nto other open-source tools, operating modes, and LLM models. We use Google\nGemini as the base LLM model to build and test ORAssistant. Early evaluation\nresults of the RAG-based model show notable improvements in performance and\naccuracy compared to non-fine-tuned LLMs.\n","authors":["Aviral Kaintura","Palaniappan R","Shui Song Luar","Indira Iyer Almeida"],"pdf_url":"https://arxiv.org/pdf/2410.03845v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.08074v3","updated":"2024-11-30T10:48:21Z","published":"2024-06-12T10:48:53Z","title":"A Concept-Based Explainability Framework for Large Multimodal Models","summary":" Large multimodal models (LMMs) combine unimodal encoders and large language\nmodels (LLMs) to perform multimodal tasks. Despite recent advancements towards\nthe interpretability of these models, understanding internal representations of\nLMMs remains largely a mystery. In this paper, we present a novel framework for\nthe interpretation of LMMs. We propose a dictionary learning based approach,\napplied to the representation of tokens. The elements of the learned dictionary\ncorrespond to our proposed concepts. We show that these concepts are well\nsemantically grounded in both vision and text. Thus we refer to these as\n``multi-modal concepts''. We qualitatively and quantitatively evaluate the\nresults of the learnt concepts. We show that the extracted multimodal concepts\nare useful to interpret representations of test samples. Finally, we evaluate\nthe disentanglement between different concepts and the quality of grounding\nconcepts visually and textually. Our code is publicly available at\nhttps://github.com/mshukor/xl-vlms\n","authors":["Jayneel Parekh","Pegah Khayatan","Mustafa Shukor","Alasdair Newson","Matthieu Cord"],"pdf_url":"https://arxiv.org/pdf/2406.08074v3.pdf","comment":"NeurIPS 2024"},{"id":"http://arxiv.org/abs/2409.15380v2","updated":"2024-11-30T09:57:09Z","published":"2024-09-20T15:01:21Z","title":"Kalahi: A handcrafted, grassroots cultural LLM evaluation suite for\n Filipino","summary":" Multilingual large language models (LLMs) today may not necessarily provide\nculturally appropriate and relevant responses to its Filipino users. We\nintroduce Kalahi, a cultural LLM evaluation suite collaboratively created by\nnative Filipino speakers. It is composed of 150 high-quality, handcrafted and\nnuanced prompts that test LLMs for generations that are relevant to shared\nFilipino cultural knowledge and values. Strong LLM performance in Kalahi\nindicates a model's ability to generate responses similar to what an average\nFilipino would say or do in a given situation. We conducted experiments on LLMs\nwith multilingual and Filipino language support. Results show that Kalahi,\nwhile trivial for Filipinos, is challenging for LLMs, with the best model\nanswering only 46.0% of the questions correctly compared to native Filipino\nperformance of 89.10%. Thus, Kalahi can be used to accurately and reliably\nevaluate Filipino cultural representation in LLMs.\n","authors":["Jann Railey Montalan","Jian Gang Ngui","Wei Qi Leong","Yosephine Susanto","Hamsawardhini Rengarajan","Alham Fikri Aji","William Chandra Tjhi"],"pdf_url":"https://arxiv.org/pdf/2409.15380v2.pdf","comment":"Accepted for presentation at Paclic 38, 2024"},{"id":"http://arxiv.org/abs/2404.12038v5","updated":"2024-11-30T08:52:29Z","published":"2024-04-18T09:46:25Z","title":"Uncovering Safety Risks of Large Language Models through Concept\n Activation Vector","summary":" Despite careful safety alignment, current large language models (LLMs) remain\nvulnerable to various attacks. To further unveil the safety risks of LLMs, we\nintroduce a Safety Concept Activation Vector (SCAV) framework, which\neffectively guides the attacks by accurately interpreting LLMs' safety\nmechanisms. We then develop an SCAV-guided attack method that can generate both\nattack prompts and embedding-level attacks with automatically selected\nperturbation hyperparameters. Both automatic and human evaluations demonstrate\nthat our attack method significantly improves the attack success rate and\nresponse quality while requiring less training data. Additionally, we find that\nour generated attack prompts may be transferable to GPT-4, and the\nembedding-level attacks may also be transferred to other white-box LLMs whose\nparameters are known. Our experiments further uncover the safety risks present\nin current LLMs. For example, in our evaluation of seven open-source LLMs, we\nobserve an average attack success rate of 99.14%, based on the classic\nkeyword-matching criterion. Finally, we provide insights into the safety\nmechanism of LLMs. The code is available at\nhttps://github.com/SproutNan/AI-Safety_SCAV.\n","authors":["Zhihao Xu","Ruixuan Huang","Changyu Chen","Xiting Wang"],"pdf_url":"https://arxiv.org/pdf/2404.12038v5.pdf","comment":"10 pages, accepted at NeurIPS 2024"},{"id":"http://arxiv.org/abs/2411.14811v2","updated":"2024-11-30T08:47:23Z","published":"2024-11-22T09:12:02Z","title":"Fine-Grained Alignment in Vision-and-Language Navigation through\n Bayesian Optimization","summary":" This paper addresses the challenge of fine-grained alignment in\nVision-and-Language Navigation (VLN) tasks, where robots navigate realistic 3D\nenvironments based on natural language instructions. Current approaches use\ncontrastive learning to align language with visual trajectory sequences.\nNevertheless, they encounter difficulties with fine-grained vision negatives.\nTo enhance cross-modal embeddings, we introduce a novel Bayesian\nOptimization-based adversarial optimization framework for creating fine-grained\ncontrastive vision samples. To validate the proposed methodology, we conduct a\nseries of experiments to assess the effectiveness of the enriched embeddings on\nfine-grained vision negatives. We conduct experiments on two common VLN\nbenchmarks R2R and REVERIE, experiments on the them demonstrate that these\nembeddings benefit navigation, and can lead to a promising performance\nenhancement. Our source code and trained models are available at:\nhttps://anonymous.4open.science/r/FGVLN.\n","authors":["Yuhang Song","Mario Gianni","Chenguang Yang","Kunyang Lin","Te-Chuan Chiu","Anh Nguyen","Chun-Yi Lee"],"pdf_url":"https://arxiv.org/pdf/2411.14811v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.06832v3","updated":"2024-11-30T04:53:04Z","published":"2024-03-11T15:48:43Z","title":"Noise-powered Multi-modal Knowledge Graph Representation Framework","summary":" The rise of Multi-modal Pre-training highlights the necessity for a unified\nMulti-Modal Knowledge Graph (MMKG) representation learning framework. Such a\nframework is essential for embedding structured knowledge into multi-modal\nLarge Language Models effectively, alleviating issues like knowledge\nmisconceptions and multi-modal hallucinations. In this work, we explore the\nefficacy of models in accurately embedding entities within MMKGs through two\npivotal tasks: Multi-modal Knowledge Graph Completion (MKGC) and Multi-modal\nEntity Alignment (MMEA). Building on this foundation, we propose a novel SNAG\nmethod that utilizes a Transformer-based architecture equipped with\nmodality-level noise masking to robustly integrate multi-modal entity features\nin KGs. By incorporating specific training objectives for both MKGC and MMEA,\nour approach achieves SOTA performance across a total of ten datasets,\ndemonstrating its versatility. Moreover, SNAG can not only function as a\nstandalone model but also enhance other existing methods, providing stable\nperformance improvements. Code and data are available at\nhttps://github.com/zjukg/SNAG.\n","authors":["Zhuo Chen","Yin Fang","Yichi Zhang","Lingbing Guo","Jiaoyan Che","Jeff Z. Pan","Huajun Chen","Wen Zhang"],"pdf_url":"https://arxiv.org/pdf/2403.06832v3.pdf","comment":"COLING 2025 Accpeted, Repo is available at\n https://github.com/zjukg/SNAG"},{"id":"http://arxiv.org/abs/2402.14776v3","updated":"2024-11-30T04:29:53Z","published":"2024-02-22T18:35:05Z","title":"2D Matryoshka Sentence Embeddings","summary":" Common approaches rely on fixed-length embedding vectors from language models\nas sentence embeddings for downstream tasks such as semantic textual similarity\n(STS). Such methods are limited in their flexibility due to unknown\ncomputational constraints and budgets across various applications. Matryoshka\nRepresentation Learning (MRL) \\cite{aditya2022matryoshka} encodes information\nat finer granularities, i.e., with lower embedding dimensions, to adaptively\naccommodate \\emph{ad hoc} tasks. Similar accuracy can be achieved with a\nsmaller embedding size, leading to speedups in downstream tasks. Despite its\nimproved efficiency, MRL still requires traversing all Transformer layers\nbefore obtaining the embedding, which remains the dominant factor in time and\nmemory consumption. This prompts consideration of whether the fixed number of\nTransformer layers affects representation quality and whether using\nintermediate layers for sentence representation is feasible. In this paper, we\nintroduce a novel sentence embedding model called \\textit{Two-dimensional\nMatryoshka Sentence Embedding} (2DMSE)\\footnote{Our code is available at\n\\url{https://github.com/SeanLee97/AnglE/blob/main/README_2DMSE.md}.}. It\nsupports elastic settings for both embedding sizes and Transformer layers,\noffering greater flexibility and efficiency than MRL. We conduct extensive\nexperiments on STS tasks and downstream applications. The experimental results\ndemonstrate the effectiveness of our proposed model in dynamically supporting\ndifferent embedding sizes and Transformer layers, allowing it to be highly\nadaptable to various scenarios.\n","authors":["Xianming Li","Zongxi Li","Jing Li","Haoran Xie","Qing Li"],"pdf_url":"https://arxiv.org/pdf/2402.14776v3.pdf","comment":"Decoupled with ESE"},{"id":"http://arxiv.org/abs/2310.18964v3","updated":"2024-11-30T02:56:48Z","published":"2023-10-29T10:07:32Z","title":"LLMs and Finetuning: Benchmarking cross-domain performance for hate\n speech detection","summary":" In the evolving landscape of online communication, hate speech detection\nremains a formidable challenge, further compounded by the diversity of digital\nplatforms. This study investigates the effectiveness and adaptability of\npre-trained and fine-tuned Large Language Models (LLMs) in identifying hate\nspeech, to address two central questions: (1) To what extent does the model\nperformance depend on the fine-tuning and training parameters?, (2) To what\nextent do models generalize to cross-domain hate speech detection? and (3) What\nare the specific features of the datasets or models that influence the\ngeneralization potential? The experiment shows that LLMs offer a huge advantage\nover the state-of-the-art even without pretraining. Ordinary least squares\nanalyses suggest that the advantage of training with fine-grained hate speech\nlabels is washed away with the increase in dataset size. We conclude with a\nvision for the future of hate speech detection, emphasizing cross-domain\ngeneralizability and appropriate benchmarking practices.\n","authors":["Ahmad Nasir","Aadish Sharma","Kokil Jaidka"],"pdf_url":"https://arxiv.org/pdf/2310.18964v3.pdf","comment":"10 pages, 3 figures, 5 tables"},{"id":"http://arxiv.org/abs/2411.04330v2","updated":"2024-11-30T02:42:31Z","published":"2024-11-07T00:10:10Z","title":"Scaling Laws for Precision","summary":" Low precision training and inference affect both the quality and cost of\nlanguage models, but current scaling laws do not account for this. In this\nwork, we devise \"precision-aware\" scaling laws for both training and inference.\nWe propose that training in lower precision reduces the model's \"effective\nparameter count,\" allowing us to predict the additional loss incurred from\ntraining in low precision and post-train quantization. For inference, we find\nthat the degradation introduced by post-training quantization increases as\nmodels are trained on more data, eventually making additional pretraining data\nactively harmful. For training, our scaling laws allow us to predict the loss\nof a model with different parts in different precisions, and suggest that\ntraining larger models in lower precision may be compute optimal. We unify the\nscaling laws for post and pretraining quantization to arrive at a single\nfunctional form that predicts degradation from training and inference in varied\nprecisions. We fit on over 465 pretraining runs and validate our predictions on\nmodel sizes up to 1.7B parameters trained on up to 26B tokens.\n","authors":["Tanishq Kumar","Zachary Ankner","Benjamin F. Spector","Blake Bordelon","Niklas Muennighoff","Mansheej Paul","Cengiz Pehlevan","Christopher Ré","Aditi Raghunathan"],"pdf_url":"https://arxiv.org/pdf/2411.04330v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.11579v3","updated":"2024-11-30T02:29:59Z","published":"2024-09-17T22:06:46Z","title":"HEARTS: A Holistic Framework for Explainable, Sustainable and Robust\n Text Stereotype Detection","summary":" Stereotypes are generalised assumptions about societal groups, and even\nstate-of-the-art LLMs using in-context learning struggle to identify them\naccurately. Due to the subjective nature of stereotypes, where what constitutes\na stereotype can vary widely depending on cultural, social, and individual\nperspectives, robust explainability is crucial. Explainable models ensure that\nthese nuanced judgments can be understood and validated by human users,\npromoting trust and accountability. We address these challenges by introducing\nHEARTS (Holistic Framework for Explainable, Sustainable, and Robust Text\nStereotype Detection), a framework that enhances model performance, minimises\ncarbon footprint, and provides transparent, interpretable explanations. We\nestablish the Expanded Multi-Grain Stereotype Dataset (EMGSD), comprising\n57,201 labelled texts across six groups, including under-represented\ndemographics like LGBTQ+ and regional stereotypes. Ablation studies confirm\nthat BERT models fine-tuned on EMGSD outperform those trained on individual\ncomponents. We then analyse a fine-tuned, carbon-efficient ALBERT-V2 model\nusing SHAP to generate token-level importance values, ensuring alignment with\nhuman understanding, and calculate explainability confidence scores by\ncomparing SHAP and LIME outputs...\n","authors":["Theo King","Zekun Wu","Adriano Koshiyama","Emre Kazim","Philip Treleaven"],"pdf_url":"https://arxiv.org/pdf/2409.11579v3.pdf","comment":"NeurIPS 2024 SoLaR Workshop and NeurIPS 2024 Safety Gen AI Workshop"},{"id":"http://arxiv.org/abs/2409.11353v3","updated":"2024-11-30T02:27:09Z","published":"2024-09-17T16:55:25Z","title":"THaMES: An End-to-End Tool for Hallucination Mitigation and Evaluation\n in Large Language Models","summary":" Hallucination, the generation of factually incorrect content, is a growing\nchallenge in Large Language Models (LLMs). Existing detection and mitigation\nmethods are often isolated and insufficient for domain-specific needs, lacking\na standardized pipeline. This paper introduces THaMES (Tool for Hallucination\nMitigations and EvaluationS), an integrated framework and library addressing\nthis gap. THaMES offers an end-to-end solution for evaluating and mitigating\nhallucinations in LLMs, featuring automated test set generation, multifaceted\nbenchmarking, and adaptable mitigation strategies. It automates test set\ncreation from any corpus, ensuring high data quality, diversity, and\ncost-efficiency through techniques like batch processing, weighted sampling,\nand counterfactual validation. THaMES assesses a model's ability to detect and\nreduce hallucinations across various tasks, including text generation and\nbinary classification, applying optimal mitigation strategies like In-Context\nLearning (ICL), Retrieval Augmented Generation (RAG), and Parameter-Efficient\nFine-tuning (PEFT). Evaluations of state-of-the-art LLMs using a knowledge base\nof academic papers, political news, and Wikipedia reveal that commercial models\nlike GPT-4o benefit more from RAG than ICL, while open-weight models like\nLlama-3.1-8B-Instruct and Mistral-Nemo gain more from ICL. Additionally, PEFT\nsignificantly enhances the performance of Llama-3.1-8B-Instruct in both\nevaluation tasks.\n","authors":["Mengfei Liang","Archish Arun","Zekun Wu","Cristian Munoz","Jonathan Lutch","Emre Kazim","Adriano Koshiyama","Philip Treleaven"],"pdf_url":"https://arxiv.org/pdf/2409.11353v3.pdf","comment":"NeurIPS 2024 SoLaR (Socially Responsible Language Modelling Research\n ) Workshop"},{"id":"http://arxiv.org/abs/2409.11149v4","updated":"2024-11-30T02:21:25Z","published":"2024-09-17T13:03:12Z","title":"SAGED: A Holistic Bias-Benchmarking Pipeline for Language Models with\n Customisable Fairness Calibration","summary":" The development of unbiased large language models is widely recognized as\ncrucial, yet existing benchmarks fall short in detecting biases due to limited\nscope, contamination, and lack of a fairness baseline. SAGED(-Bias) is the\nfirst holistic benchmarking pipeline to address these problems. The pipeline\nencompasses five core stages: scraping materials, assembling benchmarks,\ngenerating responses, extracting numeric features, and diagnosing with\ndisparity metrics. SAGED includes metrics for max disparity, such as impact\nratio, and bias concentration, such as Max Z-scores. Noticing that assessment\ntool bias and contextual bias in prompts can distort evaluation, SAGED\nimplements counterfactual branching and baseline calibration for mitigation.\nFor demonstration, we use SAGED on G20 Countries with popular 8b-level models\nincluding Gemma2, Llama3.1, Mistral, and Qwen2. With sentiment analysis, we\nfind that while Mistral and Qwen2 show lower max disparity and higher bias\nconcentration than Gemma2 and Llama3.1, all models are notably biased against\ncountries like Russia and (except for Qwen2) China. With further experiments to\nhave models role-playing U.S. (vice-/former-) presidents, we see bias amplifies\nand shifts in heterogeneous directions. Moreover, we see Qwen2 and Mistral not\nengage in role-playing, while Llama3.1 and Gemma2 role-play Trump notably more\nintensively than Biden and Harris, indicating role-playing performance bias in\nthese models.\n","authors":["Xin Guan","Nathaniel Demchak","Saloni Gupta","Ze Wang","Ediz Ertekin Jr.","Adriano Koshiyama","Emre Kazim","Zekun Wu"],"pdf_url":"https://arxiv.org/pdf/2409.11149v4.pdf","comment":"COLING 2025 Main Conference"},{"id":"http://arxiv.org/abs/2410.01169v2","updated":"2024-11-30T01:04:31Z","published":"2024-10-02T01:54:46Z","title":"GADFA: Generator-Assisted Decision-Focused Approach for Opinion\n Expressing Timing Identification","summary":" The advancement of text generation models has granted us the capability to\nproduce coherent and convincing text on demand. Yet, in real-life\ncircumstances, individuals do not continuously generate text or voice their\nopinions. For instance, consumers pen product reviews after weighing the merits\nand demerits of a product, and professional analysts issue reports following\nsignificant news releases. In essence, opinion expression is typically prompted\nby particular reasons or signals. Despite long-standing developments in opinion\nmining, the appropriate timing for expressing an opinion remains largely\nunexplored. To address this deficit, our study introduces an innovative task -\nthe identification of news-triggered opinion expressing timing. We ground this\ntask in the actions of professional stock analysts and develop a novel dataset\nfor investigation. Our approach is decision-focused, leveraging text generation\nmodels to steer the classification model, thus enhancing overall performance.\nOur experimental findings demonstrate that the text generated by our model\ncontributes fresh insights from various angles, effectively aiding in\nidentifying the optimal timing for opinion expression.\n","authors":["Chung-Chi Chen","Hiroya Takamura","Ichiro Kobayashi","Yusuke Miyao","Hsin-Hsi Chen"],"pdf_url":"https://arxiv.org/pdf/2410.01169v2.pdf","comment":"Accepted: COLING-2025"},{"id":"http://arxiv.org/abs/2411.00024v3","updated":"2024-11-30T00:17:01Z","published":"2024-10-28T22:30:06Z","title":"A Perspective for Adapting Generalist AI to Specialized Medical AI\n Applications and Their Challenges","summary":" The integration of Large Language Models (LLMs) into medical applications has\nsparked widespread interest across the healthcare industry, from drug discovery\nand development to clinical decision support, assisting telemedicine, medical\ndevices, and healthcare insurance applications. This perspective paper aims to\ndiscuss the inner workings of building LLM-powered medical AI applications and\nintroduces a comprehensive framework for their development. We review existing\nliterature and outline the unique challenges of applying LLMs in specialized\nmedical contexts. Additionally, we introduce a three-step framework to organize\nmedical LLM research activities: 1) Modeling: breaking down complex medical\nworkflows into manageable steps for developing medical-specific models; 2)\nOptimization: optimizing the model performance with crafted prompts and\nintegrating external knowledge and tools, and 3) System engineering:\ndecomposing complex tasks into subtasks and leveraging human expertise for\nbuilding medical AI applications. Furthermore, we offer a detailed use case\nplaybook that describes various LLM-powered medical AI applications, such as\noptimizing clinical trial design, enhancing clinical decision support, and\nadvancing medical imaging analysis. Finally, we discuss various challenges and\nconsiderations for building medical AI applications with LLMs, such as handling\nhallucination issues, data ownership and compliance, privacy, intellectual\nproperty considerations, compute cost, sustainability issues, and responsible\nAI requirements.\n","authors":["Zifeng Wang","Hanyin Wang","Benjamin Danek","Ying Li","Christina Mack","Hoifung Poon","Yajuan Wang","Pranav Rajpurkar","Jimeng Sun"],"pdf_url":"https://arxiv.org/pdf/2411.00024v3.pdf","comment":null}],"Information Retrieval":[{"id":"http://arxiv.org/abs/2302.11370v6","updated":"2024-11-30T21:47:39Z","published":"2023-02-22T13:39:54Z","title":"Recall, Robustness, and Lexicographic Evaluation","summary":" Although originally developed to evaluate sets of items, recall is often used\nto evaluate rankings of items, including those produced by recommender,\nretrieval, and other machine learning systems. The application of recall\nwithout a formal evaluative motivation has led to criticism of recall as a\nvague or inappropriate measure. In light of this debate, we reflect on the\nmeasurement of recall in rankings from a formal perspective. Our analysis is\ncomposed of three tenets: recall, robustness, and lexicographic evaluation.\nFirst, we formally define `recall-orientation' as the sensitivity of a metric\nto a user interested in finding every relevant item. Second, we analyze\nrecall-orientation from the perspective of robustness with respect to possible\ncontent consumers and providers, connecting recall to recent conversations\nabout fair ranking. Finally, we extend this conceptual and theoretical\ntreatment of recall by developing a practical preference-based evaluation\nmethod based on lexicographic comparison. Through extensive empirical analysis\nacross three recommendation tasks and 17 information retrieval tasks, we\nestablish that our new evaluation method, lexirecall, has convergent validity\n(i.e., it is correlated with existing recall metrics) and exhibits\nsubstantially higher sensitivity in terms of discriminative power and stability\nin the presence of missing labels. Our conceptual, theoretical, and empirical\nanalysis substantially deepens our understanding of recall and motivates its\nadoption through connections to robustness and fairness.\n","authors":["Fernando Diaz","Michael D. Ekstrand","Bhaskar Mitra"],"pdf_url":"https://arxiv.org/pdf/2302.11370v6.pdf","comment":"Under review"},{"id":"http://arxiv.org/abs/2409.18721v2","updated":"2024-11-30T14:45:05Z","published":"2024-09-27T13:17:59Z","title":"Scalable Cross-Entropy Loss for Sequential Recommendations with Large\n Item Catalogs","summary":" Scalability issue plays a crucial role in productionizing modern recommender\nsystems. Even lightweight architectures may suffer from high computational\noverload due to intermediate calculations, limiting their practicality in\nreal-world applications. Specifically, applying full Cross-Entropy (CE) loss\noften yields state-of-the-art performance in terms of recommendations quality.\nStill, it suffers from excessive GPU memory utilization when dealing with large\nitem catalogs. This paper introduces a novel Scalable Cross-Entropy (SCE) loss\nfunction in the sequential learning setup. It approximates the CE loss for\ndatasets with large-size catalogs, enhancing both time efficiency and memory\nusage without compromising recommendations quality. Unlike traditional negative\nsampling methods, our approach utilizes a selective GPU-efficient computation\nstrategy, focusing on the most informative elements of the catalog,\nparticularly those most likely to be false positives. This is achieved by\napproximating the softmax distribution over a subset of the model outputs\nthrough the maximum inner product search. Experimental results on multiple\ndatasets demonstrate the effectiveness of SCE in reducing peak memory usage by\na factor of up to 100 compared to the alternatives, retaining or even exceeding\ntheir metrics values. The proposed approach also opens new perspectives for\nlarge-scale developments in different domains, such as large language models.\n","authors":["Gleb Mezentsev","Danil Gusak","Ivan Oseledets","Evgeny Frolov"],"pdf_url":"https://arxiv.org/pdf/2409.18721v2.pdf","comment":"11 pages, fixed some typos"},{"id":"http://arxiv.org/abs/2410.19764v2","updated":"2024-11-30T07:06:57Z","published":"2024-10-12T16:14:18Z","title":"Unraveling Movie Genres through Cross-Attention Fusion of Bi-Modal\n Synergy of Poster","summary":" Movie posters are not just decorative; they are meticulously designed to\ncapture the essence of a movie, such as its genre, storyline, and tone/vibe.\nFor decades, movie posters have graced cinema walls, billboards, and now our\ndigital screens as a form of digital posters. Movie genre classification plays\na pivotal role in film marketing, audience engagement, and recommendation\nsystems. Previous explorations into movie genre classification have been mostly\nexamined in plot summaries, subtitles, trailers and movie scenes. Movie posters\nprovide a pre-release tantalizing glimpse into a film's key aspects, which can\nignite public interest. In this paper, we presented the framework that exploits\nmovie posters from a visual and textual perspective to address the multilabel\nmovie genre classification problem. Firstly, we extracted text from movie\nposters using an OCR and retrieved the relevant embedding. Next, we introduce a\ncross-attention-based fusion module to allocate attention weights to visual and\ntextual embedding. In validating our framework, we utilized 13882 posters\nsourced from the Internet Movie Database (IMDb). The outcomes of the\nexperiments indicate that our model exhibited promising performance and\noutperformed even some prominent contemporary architectures.\n","authors":["Utsav Kumar Nareti","Chandranath Adak","Soumi Chattopadhyay","Pichao Wang"],"pdf_url":"https://arxiv.org/pdf/2410.19764v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.00605v1","updated":"2024-11-30T22:48:06Z","published":"2024-11-30T22:48:06Z","title":"The Impact of Generative AI on Student Churn and the Future of Formal\n Education","summary":" In the contemporary educational landscape, the advent of Generative\nArtificial Intelligence (AI) presents unprecedented opportunities for\npersonalised learning, fundamentally challenging the traditional paradigms of\neducation. This research explores the emerging trend where high school\nstudents, empowered by tailored educational experiences provided by Generative\nAI, opt to forgo traditional university degrees to pursue entrepreneurial\nventures at a younger age. To understand and predict the future of education in\nthe age of Generative AI, we employ a comprehensive methodology to analyse\nsocial media data. Our approach includes sentiment analysis to gauge public\nopinion, topic modelling to identify key themes and emerging trends, and user\ndemographic analysis to understand the engagement of different age groups and\nregions. We also perform influencer analysis to identify key figures shaping\nthe discourse and engagement metrics to measure the level of interest and\ninteraction with AI-related educational content. Content analysis helps us to\ndetermine the types of content being shared and the prevalent narratives, while\nhashtag analysis reveals the connectivity of discussions. The temporal analysis\ntracks changes over time and identifies event-based spikes in discussions. The\ninsights derived from this analysis include the acceptance and adoption of\nGenerative AI in education, its impact on traditional education models, the\ninfluence on students' entrepreneurial ambitions, and the educational outcomes\nassociated with AI-driven personalised learning. Additionally, we explore\npublic sentiment towards policies and regulations and use predictive modelling\nto forecast future trends. This comprehensive social media analysis provides a\nnuanced understanding of the evolving educational landscape, offering valuable\nperspectives on the role of Generative AI in shaping the future of education.\n","authors":["Stephen Elbourn"],"pdf_url":"https://arxiv.org/pdf/2412.00605v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.00569v1","updated":"2024-11-30T19:45:23Z","published":"2024-11-30T19:45:23Z","title":"Contextual Bandits in Payment Processing: Non-uniform Exploration and\n Supervised Learning at Adyen","summary":" Uniform random exploration in decision-making systems supports off-policy\nlearning via supervision but incurs high regret, making it impractical for many\napplications. Conversely, non-uniform exploration offers better immediate\nperformance but lacks support for off-policy learning. Recent research suggests\nthat regression oracles can bridge this gap by combining non-uniform\nexploration with supervised learning. In this paper, we analyze these\napproaches within a real-world industrial context at Adyen, a large global\npayments processor characterized by batch logged delayed feedback, short-term\nmemory, and dynamic action spaces under the Empirical Risk Minimization (ERM)\nframework. Our analysis reveals that while regression oracles significantly\nimprove performance, they introduce challenges due to rigid algorithmic\nassumptions. Specifically, we observe that as a policy improves, subsequent\ngenerations may perform worse due to shifts in the reward distribution and\nincreased class imbalance in the training data. This degradation occurs de\nspite improvements in other aspects of the training data, leading to decreased\nperformance in successive policy iterations. We further explore the long-term\nimpact of regression oracles, identifying a potential \"oscillation effect.\"\nThis effect arises when regression oracles influence probability estimates and\nthe realizability of subsequent policy models, leading to fluctuations in\nperformance across iterations. Our findings highlight the need for more\nadaptable algorithms that can leverage the benefits of regression oracles\nwithout introducing instability in policy performance over time.\n","authors":["Akhila Vangara","Alex Egg"],"pdf_url":"https://arxiv.org/pdf/2412.00569v1.pdf","comment":"7 pages, 10 figures, submitted to WWW '25"},{"id":"http://arxiv.org/abs/2412.00546v1","updated":"2024-11-30T17:39:59Z","published":"2024-11-30T17:39:59Z","title":"Rank It, Then Ask It: Input Reranking for Maximizing the Performance of\n LLMs on Symmetric Tasks","summary":" Large language models (LLMs) have quickly emerged as practical and versatile\ntools that provide new solutions for a wide range of domains. In this paper, we\nconsider the application of LLMs on symmetric tasks where a query is asked on\nan (unordered) bag of elements. Examples of such tasks include answering\naggregate queries on a database table. In general, when the bag contains a\nlarge number of elements, LLMs tend to overlook some elements, leading to\nchallenges in generating accurate responses to the query. LLMs receive their\ninputs as ordered sequences. However, in this problem, we leverage the fact\nthat the symmetric input is not ordered, and reordering should not affect the\nLLM's response.\n Observing that LLMs are less likely to miss elements at certain positions of\nthe input, we introduce the problem of LLM input reranking: to find a ranking\nof the input that maximizes the LLM's accuracy for the given query without\nmaking explicit assumptions about the query. Finding the optimal ranking\nrequires identifying (i) the relevance of each input element for answering the\nquery and (ii) the importance of each rank position for the LLM's attention. We\ndevelop algorithms for estimating these values efficiently utilizing a helper\nLLM. We conduct comprehensive experiments on different synthetic and real\ndatasets to validate our proposal and to evaluate the effectiveness of our\nproposed algorithms. Our experiments confirm that our reranking approach\nimproves the accuracy of the LLMs on symmetric tasks by up to $99\\%$ proximity\nto the optimum upper bound.\n","authors":["Mohsen Dehghankar","Abolfazl Asudeh"],"pdf_url":"https://arxiv.org/pdf/2412.00546v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.00491v1","updated":"2024-11-30T14:20:18Z","published":"2024-11-30T14:20:18Z","title":"CDEMapper: Enhancing NIH Common Data Element Normalization using Large\n Language Models","summary":" Common Data Elements (CDEs) standardize data collection and sharing across\nstudies, enhancing data interoperability and improving research\nreproducibility. However, implementing CDEs presents challenges due to the\nbroad range and variety of data elements. This study aims to develop an\neffective and efficient mapping tool to bridge the gap between local data\nelements and National Institutes of Health (NIH) CDEs. We propose CDEMapper, a\nlarge language model (LLM) powered mapping tool designed to assist in mapping\nlocal data elements to NIH CDEs. CDEMapper has three core modules: (1) CDE\nindexing and embeddings. NIH CDEs were indexed and embedded to support semantic\nsearch; (2) CDE recommendations. The tool combines Elasticsearch (BM25\nsimilarity methods) with state of the art GPT services to recommend candidate\nCDEs and their permissible values; and (3) Human review. Users review and\nselect the NIH CDEs and values that best match their data elements and value\nsets. We evaluate the tool recommendation accuracy against manually annotated\nmapping results. CDEMapper offers a publicly available, LLM-powered, and\nintuitive user interface that consolidates essential and advanced mapping\nservices into a streamlined pipeline. It provides a step by step, quality\nassured mapping workflow designed with a user-centered approach. The evaluation\nresults demonstrated that augmenting BM25 with GPT embeddings and a ranker\nconsistently enhances CDEMapper mapping accuracy in three different mapping\nsettings across four evaluation datasets. This work opens up the potential of\nusing LLMs to assist with CDE recommendation and human curation when aligning\nlocal data elements with NIH CDEs. Additionally, this effort enhances clinical\nresearch data interoperability and helps researchers better understand the gaps\nbetween local data elements and NIH CDEs.\n","authors":["Yan Wang","Jimin Huang","Huan He","Vincent Zhang","Yujia Zhou","Xubing Hao","Pritham Ram","Lingfei Qian","Qianqian Xie","Ruey-Ling Weng","Fongci Lin","Yan Hu","Licong Cui","Xiaoqian Jiang","Hua Xu","Na Hong"],"pdf_url":"https://arxiv.org/pdf/2412.00491v1.pdf","comment":"11 pages,4 figures"},{"id":"http://arxiv.org/abs/2412.00424v1","updated":"2024-11-30T10:30:49Z","published":"2024-11-30T10:30:49Z","title":"FairSort: Learning to Fair Rank for Personalized Recommendations in\n Two-Sided Platforms","summary":" Traditional recommendation systems focus on maximizing user satisfaction by\nsuggesting their favorite items. This user-centric approach may lead to unfair\nexposure distribution among the providers. On the contrary, a provider-centric\ndesign might become unfair to the users. Therefore, this paper proposes a\nre-ranking model FairSort\\footnote{\\textbf{Reproducibility:}The code and\ndatasets are available at \\url{https://github.com/13543024276/FairSort}} to\nfind a trade-off solution among user-side fairness, provider-side fairness, and\npersonalized recommendations utility. Previous works habitually treat this\nissue as a knapsack problem, incorporating both-side fairness as constraints.\n In this paper, we adopt a novel perspective, treating each recommendation\nlist as a runway rather than a knapsack. In this perspective, each item on the\nrunway gains a velocity and runs within a specific time, achieving re-ranking\nfor both-side fairness. Meanwhile, we ensure the Minimum Utility Guarantee for\npersonalized recommendations by designing a Binary Search approach. This can\nprovide more reliable recommendations compared to the conventional greedy\nstrategy based on the knapsack problem. We further broaden the applicability of\nFairSort, designing two versions for online and offline recommendation\nscenarios. Theoretical analysis and extensive experiments on real-world\ndatasets indicate that FairSort can ensure more reliable personalized\nrecommendations while considering fairness for both the provider and user.\n","authors":["Guoli Wu","Zhiyong Feng","Shizhan Chen","Hongyue Wu","Xiao Xue","Jianmao Xiao","Guodong Fan","Hongqi Chen","Jingyu Li"],"pdf_url":"https://arxiv.org/pdf/2412.00424v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.00324v1","updated":"2024-11-30T02:45:01Z","published":"2024-11-30T02:45:01Z","title":"Robust Table Integration in Data Lakes","summary":" In this paper, we investigate the challenge of integrating tables from data\nlakes, focusing on three core tasks: 1) pairwise integrability judgment, which\ndetermines whether a tuple pair in a table is integrable, accounting for any\noccurrences of semantic equivalence or typographical errors; 2) integrable set\ndiscovery, which aims to identify all integrable sets in a table based on\npairwise integrability judgments established in the first task; 3) multi-tuple\nconflict resolution, which resolves conflicts among multiple tuples during\nintegration. We train a binary classifier to address the task of pairwise\nintegrability judgment. Given the scarcity of labeled data, we propose a\nself-supervised adversarial contrastive learning algorithm to perform\nclassification, which incorporates data augmentation methods and adversarial\nexamples to autonomously generate new training data. Upon the output of\npairwise integrability judgment, each integrable set is considered as a\ncommunity, a densely connected sub-graph where nodes and edges correspond to\ntuples in the table and their pairwise integrability, respectively. We proceed\nto investigate various community detection algorithms to address the integrable\nset discovery objective. Moving forward to tackle multi-tuple conflict\nresolution, we introduce an novel in-context learning methodology. This\napproach capitalizes on the knowledge embedded within pretrained large language\nmodels to effectively resolve conflicts that arise when integrating multiple\ntuples. Notably, our method minimizes the need for annotated data. Since no\nsuitable test collections are available for our tasks, we develop our own\nbenchmarks using two real-word dataset repositories: Real and Join. We conduct\nextensive experiments on these benchmarks to validate the robustness and\napplicability of our methodologies in the context of integrating tables within\ndata lakes.\n","authors":["Daomin Ji","Hui Luo","Zhifeng Bao","Shane Culpepper"],"pdf_url":"https://arxiv.org/pdf/2412.00324v1.pdf","comment":null}],"Multimedia":[{"id":"http://arxiv.org/abs/2410.19764v2","updated":"2024-11-30T07:06:57Z","published":"2024-10-12T16:14:18Z","title":"Unraveling Movie Genres through Cross-Attention Fusion of Bi-Modal\n Synergy of Poster","summary":" Movie posters are not just decorative; they are meticulously designed to\ncapture the essence of a movie, such as its genre, storyline, and tone/vibe.\nFor decades, movie posters have graced cinema walls, billboards, and now our\ndigital screens as a form of digital posters. Movie genre classification plays\na pivotal role in film marketing, audience engagement, and recommendation\nsystems. Previous explorations into movie genre classification have been mostly\nexamined in plot summaries, subtitles, trailers and movie scenes. Movie posters\nprovide a pre-release tantalizing glimpse into a film's key aspects, which can\nignite public interest. In this paper, we presented the framework that exploits\nmovie posters from a visual and textual perspective to address the multilabel\nmovie genre classification problem. Firstly, we extracted text from movie\nposters using an OCR and retrieved the relevant embedding. Next, we introduce a\ncross-attention-based fusion module to allocate attention weights to visual and\ntextual embedding. In validating our framework, we utilized 13882 posters\nsourced from the Internet Movie Database (IMDb). The outcomes of the\nexperiments indicate that our model exhibited promising performance and\noutperformed even some prominent contemporary architectures.\n","authors":["Utsav Kumar Nareti","Chandranath Adak","Soumi Chattopadhyay","Pichao Wang"],"pdf_url":"https://arxiv.org/pdf/2410.19764v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.00446v1","updated":"2024-11-30T11:40:31Z","published":"2024-11-30T11:40:31Z","title":"Hybrid Local-Global Context Learning for Neural Video Compression","summary":" In neural video codecs, current state-of-the-art methods typically adopt\nmulti-scale motion compensation to handle diverse motions. These methods\nestimate and compress either optical flow or deformable offsets to reduce\ninter-frame redundancy. However, flow-based methods often suffer from\ninaccurate motion estimation in complicated scenes. Deformable\nconvolution-based methods are more robust but have a higher bit cost for motion\ncoding. In this paper, we propose a hybrid context generation module, which\ncombines the advantages of the above methods in an optimal way and achieves\naccurate compensation at a low bit cost. Specifically, considering the\ncharacteristics of features at different scales, we adopt flow-guided\ndeformable compensation at largest-scale to produce accurate alignment in\ndetailed regions. For smaller-scale features, we perform flow-based warping to\nsave the bit cost for motion coding. Furthermore, we design a local-global\ncontext enhancement module to fully explore the local-global information of\nprevious reconstructed signals. Experimental results demonstrate that our\nproposed Hybrid Local-Global Context learning (HLGC) method can significantly\nenhance the state-of-the-art methods on standard test datasets.\n","authors":["Yongqi Zhai","Jiayu Yang","Wei Jiang","Chunhui Yang","Luyang Tang","Ronggang Wang"],"pdf_url":"https://arxiv.org/pdf/2412.00446v1.pdf","comment":"Accepted to DCC 2024"}]},"2024-12-03T00:00:00Z":{"Computation and Language":[{"id":"http://arxiv.org/abs/2412.02698v1","updated":"2024-12-03T18:59:51Z","published":"2024-12-03T18:59:51Z","title":"Scaling BERT Models for Turkish Automatic Punctuation and Capitalization\n Correction","summary":" This paper investigates the effectiveness of BERT based models for automated\npunctuation and capitalization corrections in Turkish texts across five\ndistinct model sizes. The models are designated as Tiny, Mini, Small, Medium,\nand Base. The design and capabilities of each model are tailored to address the\nspecific challenges of the Turkish language, with a focus on optimizing\nperformance while minimizing computational overhead. The study presents a\nsystematic comparison of the performance metrics precision, recall, and F1\nscore of each model, offering insights into their applicability in diverse\noperational contexts. The results demonstrate a significant improvement in text\nreadability and accuracy as model size increases, with the Base model achieving\nthe highest correction precision. This research provides a comprehensive guide\nfor selecting the appropriate model size based on specific user needs and\ncomputational resources, establishing a framework for deploying these models in\nreal-world applications to enhance the quality of written Turkish.\n","authors":["Abdulkader Saoud","Mahmut Alomeyr","Himmet Toprak Kesgin","Mehmet Fatih Amasyali"],"pdf_url":"https://arxiv.org/pdf/2412.02698v1.pdf","comment":"2024 Innovations in Intelligent Systems and Applications Conference\n (ASYU)"},{"id":"http://arxiv.org/abs/2412.02685v1","updated":"2024-12-03T18:56:07Z","published":"2024-12-03T18:56:07Z","title":"T-REG: Preference Optimization with Token-Level Reward Regularization","summary":" Reinforcement learning from human feedback (RLHF) has been crucial in\naligning large language models (LLMs) with human values. Traditionally, RLHF\ninvolves generating responses to a query and using a reward model to assign a\nreward to the entire response. However, this approach faces challenges due to\nits reliance on a single, sparse reward, which makes it challenging for the\nmodel to identify which parts of the sequence contribute most significantly to\nthe final reward. Recent methods have attempted to address this limitation by\nintroducing token-level rewards. However, these methods often rely on either a\ntrained credit assignment model or AI annotators, raising concerns about the\nquality and reliability of the rewards. In this paper, we propose token-level\nreward regularization (T-REG), a novel approach that leverages both\nsequence-level and token-level rewards for preference optimization. Harnessing\nthe self-refinement capabilities of LLMs, our method uses contrastive prompting\nto enable LLMs to self-generate token-level rewards. These self-generated\nrewards then act as reward regularization, guiding the model to more\neffectively distribute sequence-level rewards across tokens. This facilitates\nbetter token-level credit assignment and enhances alignment performance.\nExperiments on the instruction following benchmarks, including Alpaca Eval 2\nand Arena-Hard, show that our method consistently outperforms baseline methods\nby up to 3.8% and 4.4%, respectively. We will release the code and models at\nhttps://github.com/wzhouad/T-REG.\n","authors":["Wenxuan Zhou","Shujian Zhang","Lingxiao Zhao","Tao Meng"],"pdf_url":"https://arxiv.org/pdf/2412.02685v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.14052v2","updated":"2024-12-03T18:48:00Z","published":"2024-10-17T21:47:11Z","title":"From Isolated Conversations to Hierarchical Schemas: Dynamic Tree Memory\n Representation for LLMs","summary":" Recent advancements in large language models have significantly improved\ntheir context windows, yet challenges in effective long-term memory management\nremain. We introduce MemTree, an algorithm that leverages a dynamic,\ntree-structured memory representation to optimize the organization, retrieval,\nand integration of information, akin to human cognitive schemas. MemTree\norganizes memory hierarchically, with each node encapsulating aggregated\ntextual content, corresponding semantic embeddings, and varying abstraction\nlevels across the tree's depths. Our algorithm dynamically adapts this memory\nstructure by computing and comparing semantic embeddings of new and existing\ninformation to enrich the model's context-awareness. This approach allows\nMemTree to handle complex reasoning and extended interactions more effectively\nthan traditional memory augmentation methods, which often rely on flat lookup\ntables. Evaluations on benchmarks for multi-turn dialogue understanding and\ndocument question answering show that MemTree significantly enhances\nperformance in scenarios that demand structured memory management.\n","authors":["Alireza Rezazadeh","Zichao Li","Wei Wei","Yujia Bao"],"pdf_url":"https://arxiv.org/pdf/2410.14052v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02674v1","updated":"2024-12-03T18:47:26Z","published":"2024-12-03T18:47:26Z","title":"Mind the Gap: Examining the Self-Improvement Capabilities of Large\n Language Models","summary":" Self-improvement is a mechanism in Large Language Model (LLM) pre-training,\npost-training and test-time inference. We explore a framework where the model\nverifies its own outputs, filters or reweights data based on this verification,\nand distills the filtered data. Despite several empirical successes, a\nfundamental understanding is still lacking. In this work, we initiate a\ncomprehensive, modular and controlled study on LLM self-improvement. We provide\na mathematical formulation for self-improvement, which is largely governed by a\nquantity which we formalize as the generation-verification gap. Through\nexperiments with various model families and tasks, we discover a scaling\nphenomenon of self-improvement -- a variant of the generation-verification gap\nscales monotonically with the model pre-training flops. We also examine when\nself-improvement is possible, an iterative self-improvement procedure, and ways\nto improve its performance. Our findings not only advance understanding of LLM\nself-improvement with practical implications, but also open numerous avenues\nfor future research into its capabilities and boundaries.\n","authors":["Yuda Song","Hanlin Zhang","Carson Eisenach","Sham Kakade","Dean Foster","Udaya Ghai"],"pdf_url":"https://arxiv.org/pdf/2412.02674v1.pdf","comment":"41 pages, 19 figures"},{"id":"http://arxiv.org/abs/2412.02664v1","updated":"2024-12-03T18:38:14Z","published":"2024-12-03T18:38:14Z","title":"Probing the statistical properties of enriched co-occurrence networks","summary":" Recent studies have explored the addition of virtual edges to word\nco-occurrence networks using word embeddings to enhance graph representations,\nparticularly for short texts. While these enriched networks have demonstrated\nsome success, the impact of incorporating semantic edges into traditional\nco-occurrence networks remains uncertain. This study investigates two key\nstatistical properties of text-based network models. First, we assess whether\nnetwork metrics can effectively distinguish between meaningless and meaningful\ntexts. Second, we analyze whether these metrics are more sensitive to syntactic\nor semantic aspects of the text. Our results show that incorporating virtual\nedges can have positive and negative effects, depending on the specific network\nmetric. For instance, the informativeness of the average shortest path and\ncloseness centrality improves in short texts, while the clustering\ncoefficient's informativeness decreases as more virtual edges are added.\nAdditionally, we found that including stopwords affects the statistical\nproperties of enriched networks. Our results can serve as a guideline for\ndetermining which network metrics are most appropriate for specific\napplications, depending on the typical text size and the nature of the problem.\n","authors":["Diego R. Amancio","Jeaneth Machicao","Laura V. C. Quispe"],"pdf_url":"https://arxiv.org/pdf/2412.02664v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02638v1","updated":"2024-12-03T18:10:31Z","published":"2024-12-03T18:10:31Z","title":"QA-TOOLBOX: Conversational Question-Answering for process task guidance\n in manufacturing","summary":" In this work we explore utilizing LLMs for data augmentation for\nmanufacturing task guidance system. The dataset consists of representative\nsamples of interactions with technicians working in an advanced manufacturing\nsetting. The purpose of this work to explore the task, data augmentation for\nthe supported tasks and evaluating the performance of the existing LLMs. We\nobserve that that task is complex requiring understanding from procedure\nspecification documents, actions and objects sequenced temporally. The dataset\nconsists of 200,000+ question/answer pairs that refer to the spec document and\nare grounded in narrations and/or video demonstrations. We compared the\nperformance of several popular open-sourced LLMs by developing a baseline using\neach LLM and then compared the responses in a reference-free setting using\nLLM-as-a-judge and compared the ratings with crowd-workers whilst validating\nthe ratings with experts.\n","authors":["Ramesh Manuvinakurike","Elizabeth Watkins","Celal Savur","Anthony Rhodes","Sovan Biswas","Gesem Gudino Mejia","Richard Beckwith","Saurav Sahay","Giuseppe Raffa","Lama Nachman"],"pdf_url":"https://arxiv.org/pdf/2412.02638v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02637v1","updated":"2024-12-03T18:10:28Z","published":"2024-12-03T18:10:28Z","title":"Words and Action: Modeling Linguistic Leadership in #BlackLivesMatter\n Communities","summary":" In this project, we describe a method of modeling semantic leadership across\na set of communities associated with the #BlackLivesMatter movement, which has\nbeen informed by qualitative research on the structure of social media and\nBlack Twitter in particular. We describe our bespoke approaches to\ntime-binning, community clustering, and connecting communities over time, as\nwell as our adaptation of state-of-the-art approaches to semantic change\ndetection and semantic leadership induction. We find substantial evidence of\nthe leadership role of BLM activists and progressives, as well as Black\ncelebrities. We also find evidence of the sustained engagement of the\nconservative community with this discourse, suggesting an alternative\nexplanation for how we arrived at the present moment, in which \"anti-woke\" and\n\"anti-CRT\" bills are being enacted nationwide.\n","authors":["Dani Roytburg","Deborah Olorunisola","Sandeep Soni","Lauren Klein"],"pdf_url":"https://arxiv.org/pdf/2412.02637v1.pdf","comment":"Accepted at ICWSM 2025; minor revisions forthcoming"},{"id":"http://arxiv.org/abs/2412.02626v1","updated":"2024-12-03T17:54:12Z","published":"2024-12-03T17:54:12Z","title":"Time-Reversal Provides Unsupervised Feedback to LLMs","summary":" Large Language Models (LLMs) are typically trained to predict in the forward\ndirection of time. However, recent works have shown that prompting these models\nto look back and critique their own generations can produce useful feedback.\nMotivated by this, we explore the question of whether LLMs can be empowered to\nthink (predict and score) backwards to provide unsupervised feedback that\ncomplements forward LLMs. Towards this, we introduce Time Reversed Language\nModels (TRLMs), which can score and generate queries when conditioned on\nresponses, effectively functioning in the reverse direction of time. Further,\nto effectively infer in the response to query direction, we pre-train and\nfine-tune a language model (TRLM-Ba) in the reverse token order from scratch.\nWe show empirically (and theoretically in a stylized setting) that\ntime-reversed models can indeed complement forward model predictions when used\nto score the query given response for re-ranking multiple forward generations.\nWe obtain up to 5\\% improvement on the widely used AlpacaEval Leaderboard over\nthe competent baseline of best-of-N re-ranking using self log-perplexity\nscores. We further show that TRLM scoring outperforms conventional forward\nscoring of response given query, resulting in significant gains in applications\nsuch as citation generation and passage retrieval. We next leverage the\ngenerative ability of TRLM to augment or provide unsupervised feedback to input\nsafety filters of LLMs, demonstrating a drastic reduction in false negative\nrate with negligible impact on false positive rates against several attacks\npublished on the popular JailbreakBench leaderboard.\n","authors":["Yerram Varun","Rahul Madhavan","Sravanti Addepalli","Arun Suggala","Karthikeyan Shanmugam","Prateek Jain"],"pdf_url":"https://arxiv.org/pdf/2412.02626v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02612v1","updated":"2024-12-03T17:41:24Z","published":"2024-12-03T17:41:24Z","title":"GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken\n Chatbot","summary":" We introduce GLM-4-Voice, an intelligent and human-like end-to-end spoken\nchatbot. It supports both Chinese and English, engages in real-time voice\nconversations, and varies vocal nuances such as emotion, intonation, speech\nrate, and dialect according to user instructions. GLM-4-Voice uses an ultra-low\nbitrate (175bps), single-codebook speech tokenizer with 12.5Hz frame rate\nderived from an automatic speech recognition (ASR) model by incorporating a\nvector-quantized bottleneck into the encoder. To efficiently transfer knowledge\nfrom text to speech modalities, we synthesize speech-text interleaved data from\nexisting text pre-training corpora using a text-to-token model. We continue\npre-training from the pre-trained text language model GLM-4-9B with a\ncombination of unsupervised speech data, interleaved speech-text data, and\nsupervised speech-text data, scaling up to 1 trillion tokens, achieving\nstate-of-the-art performance in both speech language modeling and spoken\nquestion answering. We then fine-tune the pre-trained model with high-quality\nconversational speech data, achieving superior performance compared to existing\nbaselines in both conversational ability and speech quality. The open models\ncan be accessed through https://github.com/THUDM/GLM-4-Voice and\nhttps://huggingface.co/THUDM/glm-4-voice-9b.\n","authors":["Aohan Zeng","Zhengxiao Du","Mingdao Liu","Kedong Wang","Shengmin Jiang","Lei Zhao","Yuxiao Dong","Jie Tang"],"pdf_url":"https://arxiv.org/pdf/2412.02612v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02611v1","updated":"2024-12-03T17:41:23Z","published":"2024-12-03T17:41:23Z","title":"AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand\n Audio-Visual Information?","summary":" Recently, multimodal large language models (MLLMs), such as GPT-4o, Gemini\n1.5 Pro, and Reka Core, have expanded their capabilities to include vision and\naudio modalities. While these models demonstrate impressive performance across\na wide range of audio-visual applications, our proposed DeafTest reveals that\nMLLMs often struggle with simple tasks humans find trivial: 1) determining\nwhich of two sounds is louder, and 2) determining which of two sounds has a\nhigher pitch. Motivated by these observations, we introduce AV-Odyssey Bench, a\ncomprehensive audio-visual benchmark designed to assess whether those MLLMs can\ntruly understand the audio-visual information. This benchmark encompasses 4,555\ncarefully crafted problems, each incorporating text, visual, and audio\ncomponents. To successfully infer answers, models must effectively leverage\nclues from both visual and audio inputs. To ensure precise and objective\nevaluation of MLLM responses, we have structured the questions as\nmultiple-choice, eliminating the need for human evaluation or LLM-assisted\nassessment. We benchmark a series of closed-source and open-source models and\nsummarize the observations. By revealing the limitations of current models, we\naim to provide useful insight for future dataset collection and model\ndevelopment.\n","authors":["Kaixiong Gong","Kaituo Feng","Bohao Li","Yibing Wang","Mofan Cheng","Shijia Yang","Jiaming Han","Benyou Wang","Yutong Bai","Zhuoran Yang","Xiangyu Yue"],"pdf_url":"https://arxiv.org/pdf/2412.02611v1.pdf","comment":"Project page: https://av-odyssey.github.io/"},{"id":"http://arxiv.org/abs/2411.17404v2","updated":"2024-12-03T17:38:54Z","published":"2024-11-26T13:05:53Z","title":"BPP-Search: Enhancing Tree of Thought Reasoning for Mathematical\n Modeling Problem Solving","summary":" LLMs exhibit advanced reasoning capabilities, offering the potential to\ntransform natural language questions into mathematical models. However,\nexisting open-source datasets in operations research domain lack detailed\nannotations of the modeling process, such as variable definitions, focusing\nsolely on objective values, which hinders reinforcement learning applications.\nTo address this, we release the StructuredOR dataset, annotated with\ncomprehensive labels that capture the complete mathematical modeling process.\nWe further propose BPP-Search, a algorithm that integrates reinforcement\nlearning into a tree-of-thought structure using Beam search, a Process reward\nmodel, and a pairwise Preference algorithm. This approach enables efficient\nexploration of tree structures, avoiding exhaustive search while improving\naccuracy. Extensive experiments on StructuredOR, NL4OPT, and MAMO-ComplexLP\ndatasets show that BPP-Search significantly outperforms state-of-the-art\nmethods. In tree-based reasoning, BPP-Search excels in accuracy and efficiency,\nenabling faster retrieval of correct solutions.\n","authors":["Teng Wang","Wing-Yin Yu","Zhenqi He","Zehua Liu","Xiongwei Han","Hailei Gong","Han Wu","Wei Shi","Ruifeng She","Fangzhou Zhu","Tao Zhong"],"pdf_url":"https://arxiv.org/pdf/2411.17404v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02605v1","updated":"2024-12-03T17:34:50Z","published":"2024-12-03T17:34:50Z","title":"Interpretable Company Similarity with Sparse Autoencoders","summary":" Determining company similarity is a vital task in finance, underpinning\nhedging, risk management, portfolio diversification, and more. Practitioners\noften rely on sector and industry classifications to gauge similarity, such as\nSIC-codes and GICS-codes, the former being used by the U.S. Securities and\nExchange Commission (SEC), and the latter widely used by the investment\ncommunity. Clustering embeddings of company descriptions has been proposed as a\npotential technique for determining company similarity, but the lack of\ninterpretability in token embeddings poses a significant barrier to adoption in\nhigh-stakes contexts. Sparse Autoencoders have shown promise in enhancing the\ninterpretability of Large Language Models by decomposing LLM activations into\ninterpretable features. In this paper, we explore the use of SAE features in\nmeasuring company similarity and benchmark them against (1) SIC codes and (2)\nMajor Group codes. We conclude that SAE features can reproduce and even surpass\nsector classifications in quantifying fundamental characteristics of companies,\nevaluated by the correlation of monthly returns, a proxy for similarity, and\nPnL from cointegration.\n","authors":["Marco Molinari","Vladimir Tregubiak","Victor Shao","Abhimanyu Pandey","Mateusz Mikolajczak","Sebastião Kuznetsov Ryder Torres Pereira"],"pdf_url":"https://arxiv.org/pdf/2412.02605v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02602v1","updated":"2024-12-03T17:32:47Z","published":"2024-12-03T17:32:47Z","title":"CEGI: Measuring the trade-off between efficiency and carbon emissions\n for SLMs and VLMs","summary":" This paper analyzes the performance of Small Language Models (SLMs) and\nVision Language Models (VLMs) and evaluates the trade-off between model\nperformance and carbon emissions across 4 essential tasks: Image Captioning,\nVisual Question Answering (VQA), Dialogue Summarization and Text-to-SQL\nconversion. Various SLMs and VLMs belonging to the Qwen and LLaMA architecture\nfamily are chosen and variants based on model size in terms of the number of\nparameters, quantization level and fine-tuning parameters are evaluated. The\nmodel variant's performance and carbon emissions are calculated. To quantify\nthe trade-off between model performance and carbon emissions, we introduce a\nnovel metric called CEGI (Carbon Efficient Gain Index). This metric represents\nthe carbon emission per unit percentage gain per million trainable parameters .\nThis metric provides a normalized measure to compare model's efficiency in\nterms of performance improvement relative to their environmental cost. The\nexperiment's outcome demonstrates that fine-tuning SLMs and VLMs can achieve\nperformance levels comparable to Large Language Models (LLMs) while producing\nsignificantly less carbon emissions. Our findings suggest that the marginal\ngains in accuracy from larger models do not justify the substantial increase in\ncarbon emissions. Leveraging lower-bit quantization levels, the proposed metric\nfurther enhances energy efficiency without compromising performance. This study\nhighlights balancing high performance and environmental sustainability. It\noffers a valuable metric for selecting models suitable for\nenvironmentally-friendly AI development.\n","authors":["Abhas Kumar","Kapil Pathak","Rajesh Kavuru","Prabhakar Srinivasan"],"pdf_url":"https://arxiv.org/pdf/2412.02602v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02595v1","updated":"2024-12-03T17:28:50Z","published":"2024-12-03T17:28:50Z","title":"Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon\n Pretraining Dataset","summary":" Recent English Common Crawl datasets like FineWeb-Edu and DCLM achieved\nsignificant benchmark gains via aggressive model-based filtering, but at the\ncost of removing 90% of data. This limits their suitability for long token\nhorizon training, such as 15T tokens for Llama 3.1. In this paper, we show how\nto achieve better trade-offs between accuracy and data quantity by a\ncombination of classifier ensembling, synthetic data rephrasing, and reduced\nreliance on heuristic filters. When training 8B parameter models for 1T tokens,\nusing a high-quality subset of our data improves MMLU by 5.6 over DCLM,\ndemonstrating the efficacy of our methods for boosting accuracies over a\nrelatively short token horizon. Furthermore, our full 6.3T token dataset\nmatches DCLM on MMLU, but contains four times more unique real tokens than\nDCLM. This unlocks state-of-the-art training over a long token horizon: an 8B\nparameter model trained for 15T tokens, of which 7.2T came from our dataset, is\nbetter than the Llama 3.1 8B model: +5 on MMLU, +3.1 on ARC-Challenge, and +0.5\non average across ten diverse tasks. The dataset is available at\nhttps://data.commoncrawl.org/contrib/Nemotron/Nemotron-CC/index.html\n","authors":["Dan Su","Kezhi Kong","Ying Lin","Joseph Jennings","Brandon Norick","Markus Kliegl","Mostofa Patwary","Mohammad Shoeybi","Bryan Catanzaro"],"pdf_url":"https://arxiv.org/pdf/2412.02595v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.13846v4","updated":"2024-12-03T17:22:01Z","published":"2024-04-22T03:05:19Z","title":"Filtered Direct Preference Optimization","summary":" Reinforcement learning from human feedback (RLHF) plays a crucial role in\naligning language models with human preferences. While the significance of\ndataset quality is generally recognized, explicit investigations into its\nimpact within the RLHF framework, to our knowledge, have been limited. This\npaper addresses the issue of text quality within the preference dataset by\nfocusing on direct preference optimization (DPO), an increasingly adopted\nreward-model-free RLHF method. We confirm that text quality significantly\ninfluences the performance of models optimized with DPO more than those\noptimized with reward-model-based RLHF. Building on this new insight, we\npropose an extension of DPO, termed filtered direct preference optimization\n(fDPO). fDPO uses a trained reward model to monitor the quality of texts within\nthe preference dataset during DPO training. Samples of lower quality are\ndiscarded based on comparisons with texts generated by the model being\noptimized, resulting in a more accurate dataset. Experimental results\ndemonstrate that fDPO enhances the final model performance. Our code is\navailable at https://github.com/CyberAgentAILab/filtered-dpo.\n","authors":["Tetsuro Morimura","Mitsuki Sakamoto","Yuu Jinnai","Kenshi Abe","Kaito Ariu"],"pdf_url":"https://arxiv.org/pdf/2404.13846v4.pdf","comment":"EMNLP 2024"},{"id":"http://arxiv.org/abs/2412.02563v1","updated":"2024-12-03T16:52:06Z","published":"2024-12-03T16:52:06Z","title":"Semantic Tokens in Retrieval Augmented Generation","summary":" Retrieval-Augmented Generation (RAG) architectures have recently garnered\nsignificant attention for their ability to improve truth grounding and\ncoherence in natural language processing tasks. However, the reliability of RAG\nsystems in producing accurate answers diminishes as the volume of data they\naccess increases. Even with smaller datasets, these systems occasionally fail\nto address simple queries. This issue arises from their dependence on\nstate-of-the-art large language models (LLMs), which can introduce uncertainty\ninto the system's outputs. In this work, I propose a novel Comparative RAG\nsystem that introduces an evaluator module to bridge the gap between\nprobabilistic RAG systems and deterministically verifiable responses. The\nevaluator compares external recommendations with the retrieved document chunks,\nadding a decision-making layer that enhances the system's reliability. This\napproach ensures that the chunks retrieved are both semantically relevant and\nlogically consistent with deterministic insights, thereby improving the\naccuracy and overall efficiency of RAG systems. This framework paves the way\nfor more reliable and scalable question-answering applications in domains\nrequiring high precision and verifiability.\n","authors":["Joel Suro"],"pdf_url":"https://arxiv.org/pdf/2412.02563v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02549v1","updated":"2024-12-03T16:43:42Z","published":"2024-12-03T16:43:42Z","title":"Patent-CR: A Dataset for Patent Claim Revision","summary":" This paper presents Patent-CR, the first dataset created for the patent claim\nrevision task in English. It includes both initial patent applications rejected\nby patent examiners and the final granted versions. Unlike normal text revision\ntasks that predominantly focus on enhancing sentence quality, such as grammar\ncorrection and coherence improvement, patent claim revision aims at ensuring\nthe claims meet stringent legal criteria. These criteria are beyond novelty and\ninventiveness, including clarity of scope, technical accuracy, language\nprecision, and legal robustness. We assess various large language models (LLMs)\nthrough professional human evaluation, including general LLMs with different\nsizes and architectures, text revision models, and domain-specific models. Our\nresults indicate that LLMs often bring ineffective edits that deviate from the\ntarget revisions. In addition, domain-specific models and the method of\nfine-tuning show promising results. Notably, GPT-4 outperforms other tested\nLLMs, but further revisions are still necessary to reach the examination\nstandard. Furthermore, we demonstrate the inconsistency between automated and\nhuman evaluation results, suggesting that GPT-4-based automated evaluation has\nthe highest correlation with human judgment. This dataset, along with our\npreliminary empirical research, offers invaluable insights for further\nexploration in patent claim revision.\n","authors":["Lekang Jiang","Pascal A Scherz","Stephan Goetz"],"pdf_url":"https://arxiv.org/pdf/2412.02549v1.pdf","comment":"15 pages, 6 tables, 3 figures"},{"id":"http://arxiv.org/abs/2411.02337v2","updated":"2024-12-03T16:37:23Z","published":"2024-11-04T17:59:58Z","title":"WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum\n Reinforcement Learning","summary":" Large language models (LLMs) have shown remarkable potential as autonomous\nagents, particularly in web-based tasks. However, existing LLM web agents\nheavily rely on expensive proprietary LLM APIs, while open LLMs lack the\nnecessary decision-making capabilities. This paper introduces WebRL, a\nself-evolving online curriculum reinforcement learning framework designed to\ntrain high-performance web agents using open LLMs. WebRL addresses three key\nchallenges in building LLM web agents, including the scarcity of training\ntasks, sparse feedback signals, and policy distribution drift in online\nlearning. Specifically, WebRL incorporates 1) a self-evolving curriculum that\ngenerates new tasks from unsuccessful attempts, 2) a robust outcome-supervised\nreward model (ORM), and 3) adaptive reinforcement learning strategies to ensure\nconsistent improvements. We apply WebRL to transform open Llama-3.1 and GLM-4\nmodels into proficient web agents. On WebArena-Lite, WebRL improves the success\nrate of Llama-3.1-8B from 4.8% to 42.4%, and from 6.1% to 43% for GLM-4-9B.\nThese open models significantly surpass the performance of GPT-4-Turbo (17.6%)\nand GPT-4o (13.9%) and outperform previous state-of-the-art web agents trained\non open LLMs (AutoWebGLM, 18.2%). Our findings demonstrate WebRL's\neffectiveness in bridging the gap between open and proprietary LLM-based web\nagents, paving the way for more accessible and powerful autonomous web\ninteraction systems.\n","authors":["Zehan Qi","Xiao Liu","Iat Long Iong","Hanyu Lai","Xueqiao Sun","Wenyi Zhao","Yu Yang","Xinyue Yang","Jiadai Sun","Shuntian Yao","Tianjie Zhang","Wei Xu","Jie Tang","Yuxiao Dong"],"pdf_url":"https://arxiv.org/pdf/2411.02337v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02525v1","updated":"2024-12-03T16:18:42Z","published":"2024-12-03T16:18:42Z","title":"LLMForecaster: Improving Seasonal Event Forecasts with Unstructured\n Textual Data","summary":" Modern time-series forecasting models often fail to make full use of rich\nunstructured information about the time series themselves. This lack of proper\nconditioning can lead to obvious model failures; for example, models may be\nunaware of the details of a particular product, and hence fail to anticipate\nseasonal surges in customer demand in the lead up to major exogenous events\nlike holidays for clearly relevant products. To address this shortcoming, this\npaper introduces a novel forecast post-processor -- which we call LLMForecaster\n-- that fine-tunes large language models (LLMs) to incorporate unstructured\nsemantic and contextual information and historical data to improve the\nforecasts from an existing demand forecasting pipeline. In an industry-scale\nretail application, we demonstrate that our technique yields statistically\nsignificantly forecast improvements across several sets of products subject to\nholiday-driven demand surges.\n","authors":["Hanyu Zhang","Chuck Arvin","Dmitry Efimov","Michael W. Mahoney","Dominique Perrault-Joncas","Shankar Ramasubramanian","Andrew Gordon Wilson","Malcolm Wolff"],"pdf_url":"https://arxiv.org/pdf/2412.02525v1.pdf","comment":"Presented at NeurIPS Time Series in the Age of Large Models (2024)"},{"id":"http://arxiv.org/abs/2404.16019v2","updated":"2024-12-03T16:18:10Z","published":"2024-04-24T17:51:36Z","title":"The PRISM Alignment Dataset: What Participatory, Representative and\n Individualised Human Feedback Reveals About the Subjective and Multicultural\n Alignment of Large Language Models","summary":" Human feedback is central to the alignment of Large Language Models (LLMs).\nHowever, open questions remain about methods (how), domains (where), people\n(who) and objectives (to what end) of feedback processes. To navigate these\nquestions, we introduce PRISM, a dataset that maps the sociodemographics and\nstated preferences of 1,500 diverse participants from 75 countries, to their\ncontextual preferences and fine-grained feedback in 8,011 live conversations\nwith 21 LLMs. With PRISM, we contribute (i) wider geographic and demographic\nparticipation in feedback; (ii) census-representative samples for two countries\n(UK, US); and (iii) individualised ratings that link to detailed participant\nprofiles, permitting personalisation and attribution of sample artefacts. We\ntarget subjective and multicultural perspectives on value-laden and\ncontroversial issues, where we expect interpersonal and cross-cultural\ndisagreement. We use PRISM in three case studies to demonstrate the need for\ncareful consideration of which humans provide what alignment data.\n","authors":["Hannah Rose Kirk","Alexander Whitefield","Paul Röttger","Andrew Bean","Katerina Margatina","Juan Ciro","Rafael Mosquera","Max Bartolo","Adina Williams","He He","Bertie Vidgen","Scott A. Hale"],"pdf_url":"https://arxiv.org/pdf/2404.16019v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.00876v2","updated":"2024-12-03T16:12:09Z","published":"2024-12-01T16:32:31Z","title":"Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic\n Vision-language Context Sparsification","summary":" Multimodal Large Language Models (MLLMs) have achieved remarkable success in\nvision understanding, reasoning, and interaction. However, the inference\ncomputation and memory increase progressively with the generation of output\ntokens during decoding, directly affecting the efficacy of MLLMs. Existing\nmethods attempt to reduce the vision context redundancy to achieve efficient\nMLLMs. Unfortunately, the efficiency benefits of the vision context reduction\nin the prefill stage gradually diminish during the decoding stage. To address\nthis problem, we proposed a dynamic vision-language context sparsification\nframework Dynamic-LLaVA, which dynamically reduces the redundancy of vision\ncontext in the prefill stage and decreases the memory and computation overhead\nof the generated language context during decoding. Dynamic-LLaVA designs a\ntailored sparsification inference scheme for different inference modes, i.e.,\nprefill, decoding with and without KV cache, to achieve efficient inference of\nMLLMs. In practice, Dynamic-LLaVA can reduce computation consumption by\n$\\sim$75\\% in the prefill stage. Meanwhile, throughout the entire generation\nprocess of MLLMs, Dynamic-LLaVA reduces the $\\sim$50\\% computation consumption\nunder decoding without KV cache, while saving $\\sim$50\\% GPU memory overhead\nwhen decoding with KV cache, due to the vision-language context sparsification.\nExtensive experiments also demonstrate that Dynamic-LLaVA achieves efficient\ninference for MLLMs with negligible understanding and generation ability\ndegradation or even performance gains compared to the full-context inference\nbaselines. Code is available at https://github.com/Osilly/dynamic_llava .\n","authors":["Wenxuan Huang","Zijie Zhai","Yunhang Shen","Shaoshen Cao","Fei Zhao","Xiangfeng Xu","Zheyu Ye","Shaohui Lin"],"pdf_url":"https://arxiv.org/pdf/2412.00876v2.pdf","comment":"Code is available at https://github.com/Osilly/dynamic_llava"},{"id":"http://arxiv.org/abs/2402.12317v2","updated":"2024-12-03T15:56:26Z","published":"2024-02-19T17:37:28Z","title":"EVOR: Evolving Retrieval for Code Generation","summary":" Recently the retrieval-augmented generation (RAG) has been successfully\napplied in code generation. However, existing pipelines for retrieval-augmented\ncode generation (RACG) employ static knowledge bases with a single source,\nlimiting the adaptation capabilities of Large Language Models (LLMs) to domains\nthey have insufficient knowledge of. In this work, we develop a novel pipeline,\nEVOR, that employs the synchronous evolution of both queries and diverse\nknowledge bases. On two realistic settings where the external knowledge is\nrequired to solve code generation tasks, we compile four new datasets\nassociated with frequently updated libraries and long-tail programming\nlanguages, named EVOR-BENCH. Extensive experiments demonstrate that EVOR\nachieves two to four times of execution accuracy compared to other methods such\nas Reflexion (Shinn et al., 2024), DocPrompting (Zhou et al., 2023), etc. We\ndemonstrate that EVOR is flexible and can be easily combined with them to\nachieve further improvement. Further analysis reveals that EVOR benefits from\nthe synchronous evolution of queries and documents and the diverse information\nsources in the knowledge base. We hope that our studies will inspire more\ninsights into the design of advanced RACG pipelines in future research. Our\nmodel, code, and data are available at https://arks-codegen.github.io.\n","authors":["Hongjin Su","Shuyang Jiang","Yuhang Lai","Haoyuan Wu","Boao Shi","Che Liu","Qian Liu","Tao Yu"],"pdf_url":"https://arxiv.org/pdf/2402.12317v2.pdf","comment":"Retrieval-augmented code generation"},{"id":"http://arxiv.org/abs/2411.16300v2","updated":"2024-12-03T14:17:41Z","published":"2024-11-25T11:35:08Z","title":"BayLing 2: A Multilingual Large Language Model with Efficient Language\n Alignment","summary":" Large language models (LLMs), with their powerful generative capabilities and\nvast knowledge, empower various tasks in everyday life. However, these\nabilities are primarily concentrated in high-resource languages, leaving\nlow-resource languages with weaker generative capabilities and relatively\nlimited knowledge. Enhancing the multilingual capabilities of LLMs is therefore\ncrucial for serving over 100 linguistic communities worldwide. An intuitive\napproach to enhance the multilingual capabilities would be to construct\ninstruction data for various languages, but constructing instruction data for\nover 100 languages is prohibitively costly. In this paper, we introduce BayLing\n2, which efficiently transfers generative capabilities and knowledge from\nhigh-resource languages to low-resource languages through language alignment.\nTo achieve this, we constructed a dataset of 3.2 million instructions,\ncomprising high-resource language instructions (Chinese and English) and\ncross-lingual instructions for 100+ languages and performed instruction tuning\nbased on the dataset to facilitate the capability transfer between languages.\nUsing Llama as the foundation model, we developed BayLing-2-7B, BayLing-2-13B,\nand BayLing-2-8B, and conducted a comprehensive evaluation of BayLing. For\nmultilingual translation across 100+ languages, BayLing shows superior\nperformance compared to open-source models of similar scale. For multilingual\nknowledge and understanding benchmarks, BayLing achieves significant\nimprovements across over 20 low-resource languages, demonstrating its\ncapability of effective knowledge transfer from high-resource to low-resource\nlanguages. Furthermore, results on English benchmarks indicate that BayLing\nmaintains high performance in highresource languages while enhancing the\nperformance in low-resource languages. Demo, homepage, code and models of\nBayLing are available.\n","authors":["Shaolei Zhang","Kehao Zhang","Qingkai Fang","Shoutao Guo","Yan Zhou","Xiaodong Liu","Yang Feng"],"pdf_url":"https://arxiv.org/pdf/2411.16300v2.pdf","comment":"BayLing 2's online demo: http://nlp.ict.ac.cn/bayling/demo. BayLing\n 2's code and models: https://github.com/ictnlp/BayLing"},{"id":"http://arxiv.org/abs/2412.02467v1","updated":"2024-12-03T14:10:09Z","published":"2024-12-03T14:10:09Z","title":"DP-2Stage: Adapting Language Models as Differentially Private Tabular\n Data Generators","summary":" Generating tabular data under differential privacy (DP) protection ensures\ntheoretical privacy guarantees but poses challenges for training machine\nlearning models, primarily due to the need to capture complex structures under\nnoisy supervision signals. Recently, pre-trained Large Language Models (LLMs)\n-- even those at the scale of GPT-2 -- have demonstrated great potential in\nsynthesizing tabular data. However, their applications under DP constraints\nremain largely unexplored. In this work, we address this gap by applying DP\ntechniques to the generation of synthetic tabular data. Our findings shows that\nLLMs face difficulties in generating coherent text when fine-tuned with DP, as\nprivacy budgets are inefficiently allocated to non-private elements like table\nstructures. To overcome this, we propose \\ours, a two-stage fine-tuning\nframework for differentially private tabular data generation. The first stage\ninvolves non-private fine-tuning on a pseudo dataset, followed by DP\nfine-tuning on a private dataset. Our empirical results show that this approach\nimproves performance across various settings and metrics compared to directly\nfine-tuned LLMs in DP contexts. We release our code and setup at\nhttps://github.com/tejuafonja/DP-2Stage.\n","authors":["Tejumade Afonja","Hui-Po Wang","Raouf Kerkouche","Mario Fritz"],"pdf_url":"https://arxiv.org/pdf/2412.02467v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02466v1","updated":"2024-12-03T14:09:40Z","published":"2024-12-03T14:09:40Z","title":"Can ChatGPT capture swearing nuances? Evidence from translating Arabic\n oaths","summary":" This study sets out to answer one major question: Can ChatGPT capture\nswearing nuances? It presents an empirical study on the ability of ChatGPT to\ntranslate Arabic oath expressions into English. 30 Arabic oath expressions were\ncollected from the literature. These 30 oaths were first translated via ChatGPT\nand then analyzed and compared to the human translation in terms of types of\ngaps left unfulfilled by ChatGPT. Specifically, the gaps involved are:\nreligious gap, cultural gap, both religious and cultural gaps, no gap, using\nnon-oath particles, redundancy and noncapturing of Arabic script diacritics. It\nconcludes that ChatGPT translation of oaths is still much unsatisfactory,\nunveiling the need of further developments of ChatGPT, and the inclusion of\nArabic data on which ChatGPT should be trained including oath expressions, oath\nnuances, rituals, and practices.\n","authors":["Mohammed Q. Shormani"],"pdf_url":"https://arxiv.org/pdf/2412.02466v1.pdf","comment":"18 pages, 3 figures"},{"id":"http://arxiv.org/abs/2412.02454v1","updated":"2024-12-03T13:43:36Z","published":"2024-12-03T13:43:36Z","title":"Gracefully Filtering Backdoor Samples for Generative Large Language\n Models without Retraining","summary":" Backdoor attacks remain significant security threats to generative large\nlanguage models (LLMs). Since generative LLMs output sequences of\nhigh-dimensional token logits instead of low-dimensional classification logits,\nmost existing backdoor defense methods designed for discriminative models like\nBERT are ineffective for generative LLMs. Inspired by the observed differences\nin learning behavior between backdoor and clean mapping in the frequency space,\nwe transform gradients of each training sample, directly influencing parameter\nupdates, into the frequency space. Our findings reveal a distinct separation\nbetween the gradients of backdoor and clean samples in the frequency space.\nBased on this phenomenon, we propose Gradient Clustering in the Frequency Space\nfor Backdoor Sample Filtering (GraCeFul), which leverages sample-wise gradients\nin the frequency space to effectively identify backdoor samples without\nrequiring retraining LLMs. Experimental results show that GraCeFul outperforms\nbaselines significantly. Notably, GraCeFul exhibits remarkable computational\nefficiency, achieving nearly 100% recall and F1 scores in identifying backdoor\nsamples, reducing the average success rate of various backdoor attacks to 0%\nwith negligible drops in clean accuracy across multiple free-style question\nanswering datasets. Additionally, GraCeFul generalizes to Llama-2 and Vicuna.\nThe codes are publicly available at https://github.com/ZrW00/GraceFul.\n","authors":["Zongru Wu","Pengzhou Cheng","Lingyong Fang","Zhuosheng Zhang","Gongshen Liu"],"pdf_url":"https://arxiv.org/pdf/2412.02454v1.pdf","comment":"Accepted at COLING 2025"},{"id":"http://arxiv.org/abs/2412.02441v1","updated":"2024-12-03T13:25:18Z","published":"2024-12-03T13:25:18Z","title":"Artificial Expert Intelligence through PAC-reasoning","summary":" Artificial Expert Intelligence (AEI) seeks to transcend the limitations of\nboth Artificial General Intelligence (AGI) and narrow AI by integrating\ndomain-specific expertise with critical, precise reasoning capabilities akin to\nthose of top human experts. Existing AI systems often excel at predefined tasks\nbut struggle with adaptability and precision in novel problem-solving. To\novercome this, AEI introduces a framework for ``Probably Approximately Correct\n(PAC) Reasoning\". This paradigm provides robust theoretical guarantees for\nreliably decomposing complex problems, with a practical mechanism for\ncontrolling reasoning precision. In reference to the division of human thought\ninto System 1 for intuitive thinking and System 2 for reflective\nreasoning~\\citep{tversky1974judgment}, we refer to this new type of reasoning\nas System 3 for precise reasoning, inspired by the rigor of the scientific\nmethod. AEI thus establishes a foundation for error-bounded, inference-time\nlearning.\n","authors":["Shai Shalev-Shwartz","Amnon Shashua","Gal Beniamini","Yoav Levine","Or Sharir","Noam Wies","Ido Ben-Shaul","Tomer Nussbaum","Shir Granot Peled"],"pdf_url":"https://arxiv.org/pdf/2412.02441v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02427v1","updated":"2024-12-03T12:46:06Z","published":"2024-12-03T12:46:06Z","title":"GerPS-Compare: Comparing NER methods for legal norm analysis","summary":" We apply NER to a particular sub-genre of legal texts in German: the genre of\nlegal norms regulating administrative processes in public service\nadministration. The analysis of such texts involves identifying stretches of\ntext that instantiate one of ten classes identified by public service\nadministration professionals. We investigate and compare three methods for\nperforming Named Entity Recognition (NER) to detect these classes: a Rule-based\nsystem, deep discriminative models, and a deep generative model. Our results\nshow that Deep Discriminative models outperform both the Rule-based system as\nwell as the Deep Generative model, the latter two roughly performing equally\nwell, outperforming each other in different classes. The main cause for this\nsomewhat surprising result is arguably the fact that the classes used in the\nanalysis are semantically and syntactically heterogeneous, in contrast to the\nclasses used in more standard NER tasks. Deep Discriminative models appear to\nbe better equipped for dealing with this heterogenerity than both generic LLMs\nand human linguists designing rule-based NER systems.\n","authors":["Sarah T. Bachinger","Christoph Unger","Robin Erd","Leila Feddoul","Clara Lachenmaier","Sina Zarrieß","Birgitta König-Ries"],"pdf_url":"https://arxiv.org/pdf/2412.02427v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.17897v3","updated":"2024-12-03T12:36:19Z","published":"2024-10-23T14:15:07Z","title":"Value Residual Learning For Alleviating Attention Concentration In\n Transformers","summary":" Transformers can capture long-range dependencies using self-attention,\nallowing tokens to attend to all others directly. However, stacking multiple\nattention layers leads to attention concentration. One natural way to address\nthis issue is to use cross-layer attention, allowing information from earlier\nlayers to be directly accessible to later layers. However, this approach is\ncomputationally expensive. To address this problem, we propose Transformer with\nresidual value (ResFormer) which approximates cross-layer attention through\nadding a residual connection from the values of the the first layer to all\nsubsequent layers. Based on this method, one variant is the Transformer with\nsingle layer value (SVFormer), where all layers share the same value embedding\nfrom first layer. Comprehensive empirical evidence demonstrates ResFormer\nachieves equivalent validation loss with 10.4% fewer model parameters and 13.6%\nless training data compared to Transformer, while maintaining similar memory\nusage and computational cost. Besides, SVFormer reduces KV cache size by nearly\nhalf with only a small performance penalty and can be integrated with other\nKV-efficient methods, yielding further reductions in KV cache, with performance\ninfluenced by sequence length and cumulative learning rate. Further\nvisualization results suggest that Resformer and SVFormer alleviate attention\nconcentration in deeper layers through avoiding value-state drains and enhance\nrepresentation across most layers.\n","authors":["Zhanchao Zhou","Tianyi Wu","Zhiyun Jiang","Zhenzhong Lan"],"pdf_url":"https://arxiv.org/pdf/2410.17897v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02400v1","updated":"2024-12-03T11:49:34Z","published":"2024-12-03T11:49:34Z","title":"Four Guiding Principles for Modeling Causal Domain Knowledge: A Case\n Study on Brainstorming Approaches for Urban Blight Analysis","summary":" Urban blight is a problem of high interest for planning and policy making.\nResearchers frequently propose theories about the relationships between urban\nblight indicators, focusing on relationships reflecting causality. In this\npaper, we improve on the integration of domain knowledge in the analysis of\nurban blight by introducing four rules for effective modeling of causal domain\nknowledge. The findings of this study reveal significant deviation from causal\nmodeling guidelines by investigating cognitive maps developed for urban blight\nanalysis. These findings provide valuable insights that will inform future work\non urban blight, ultimately enhancing our understanding of urban blight complex\ninteractions.\n","authors":["Houssam Razouk","Michael Leitner","Roman Kern"],"pdf_url":"https://arxiv.org/pdf/2412.02400v1.pdf","comment":"16 pages, 4 figures, 2 tables"},{"id":"http://arxiv.org/abs/2412.02371v1","updated":"2024-12-03T10:57:19Z","published":"2024-12-03T10:57:19Z","title":"TSCheater: Generating High-Quality Tibetan Adversarial Texts via Visual\n Similarity","summary":" Language models based on deep neural networks are vulnerable to textual\nadversarial attacks. While rich-resource languages like English are receiving\nfocused attention, Tibetan, a cross-border language, is gradually being studied\ndue to its abundant ancient literature and critical language strategy.\nCurrently, there are several Tibetan adversarial text generation methods, but\nthey do not fully consider the textual features of Tibetan script and\noverestimate the quality of generated adversarial texts. To address this issue,\nwe propose a novel Tibetan adversarial text generation method called TSCheater,\nwhich considers the characteristic of Tibetan encoding and the feature that\nvisually similar syllables have similar semantics. This method can also be\ntransferred to other abugidas, such as Devanagari script. We utilize a\nself-constructed Tibetan syllable visual similarity database called TSVSDB to\ngenerate substitution candidates and adopt a greedy algorithm-based scoring\nmechanism to determine substitution order. After that, we conduct the method on\neight victim language models. Experimentally, TSCheater outperforms existing\nmethods in attack effectiveness, perturbation magnitude, semantic similarity,\nvisual similarity, and human acceptance. Finally, we construct the first\nTibetan adversarial robustness evaluation benchmark called AdvTS, which is\ngenerated by existing methods and proofread by humans.\n","authors":["Xi Cao","Quzong Gesang","Yuan Sun","Nuo Qun","Tashi Nyima"],"pdf_url":"https://arxiv.org/pdf/2412.02371v1.pdf","comment":"Review Version; Submitted to ICASSP 2025"},{"id":"http://arxiv.org/abs/2412.02369v1","updated":"2024-12-03T10:53:22Z","published":"2024-12-03T10:53:22Z","title":"The Impact of Featuring Comments in Online Discussions","summary":" A widespread moderation strategy by online news platforms is to feature what\nthe platform deems high quality comments, usually called editor picks or\nfeatured comments. In this paper, we compare online discussions of news\narticles in which certain comments are featured, versus discussions in which no\ncomments are featured. We measure the impact of featuring comments on the\ndiscussion, by estimating and comparing the quality of discussions from the\nperspective of the user base and the platform itself. Our analysis shows that\nthe impact on discussion quality is limited. However, we do observe an increase\nin discussion activity after the first comments are featured by moderators,\nsuggesting that the moderation strategy might be used to increase user\nengagement and to postpone the natural decline in user activity over time.\n","authors":["Cedric Waterschoot","Ernst van den Hemel","Antal van den Bosch"],"pdf_url":"https://arxiv.org/pdf/2412.02369v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02368v1","updated":"2024-12-03T10:52:06Z","published":"2024-12-03T10:52:06Z","title":"ScImage: How Good Are Multimodal Large Language Models at Scientific\n Text-to-Image Generation?","summary":" Multimodal large language models (LLMs) have demonstrated impressive\ncapabilities in generating high-quality images from textual instructions.\nHowever, their performance in generating scientific images--a critical\napplication for accelerating scientific progress--remains underexplored. In\nthis work, we address this gap by introducing ScImage, a benchmark designed to\nevaluate the multimodal capabilities of LLMs in generating scientific images\nfrom textual descriptions. ScImage assesses three key dimensions of\nunderstanding: spatial, numeric, and attribute comprehension, as well as their\ncombinations, focusing on the relationships between scientific objects (e.g.,\nsquares, circles). We evaluate five models, GPT-4o, Llama, AutomaTikZ, Dall-E,\nand StableDiffusion, using two modes of output generation: code-based outputs\n(Python, TikZ) and direct raster image generation. Additionally, we examine\nfour different input languages: English, German, Farsi, and Chinese. Our\nevaluation, conducted with 11 scientists across three criteria (correctness,\nrelevance, and scientific accuracy), reveals that while GPT-4o produces outputs\nof decent quality for simpler prompts involving individual dimensions such as\nspatial, numeric, or attribute understanding in isolation, all models face\nchallenges in this task, especially for more complex prompts.\n","authors":["Leixin Zhang","Steffen Eger","Yinjie Cheng","Weihe Zhai","Jonas Belouadi","Christoph Leiter","Simone Paolo Ponzetto","Fahimeh Moafian","Zhixue Zhao"],"pdf_url":"https://arxiv.org/pdf/2412.02368v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.00751v2","updated":"2024-12-03T10:37:09Z","published":"2024-06-02T14:08:51Z","title":"Evaluating Distributed Representations for Multi-Level Lexical\n Semantics: A Research Proposal","summary":" Modern neural networks (NNs), trained on extensive raw sentence data,\nconstruct distributed representations by compressing individual words into\ndense, continuous, high-dimensional vectors. These representations are expected\nto capture multi-level lexical meaning. In this thesis, our objective is to\nexamine the efficacy of distributed representations from NNs in encoding\nlexical meaning. Initially, we identify and formalize three levels of lexical\nsemantics: \\textit{local}, \\textit{global}, and \\textit{mixed} levels. Then,\nfor each level, we evaluate language models by collecting or constructing\nmultilingual datasets, leveraging various language models, and employing\nlinguistic analysis theories. This thesis builds a bridge between computational\nmodels and lexical semantics, aiming to complement each other.\n","authors":["Zhu Liu"],"pdf_url":"https://arxiv.org/pdf/2406.00751v2.pdf","comment":"Paper under review"},{"id":"http://arxiv.org/abs/2307.16082v6","updated":"2024-12-03T10:18:20Z","published":"2023-07-29T21:37:55Z","title":"EnrichEvent: Enriching Social Data with Contextual Information for\n Emerging Event Extraction","summary":" Social platforms have emerged as crucial platforms for disseminating\ninformation and discussing real-life social events, offering researchers an\nexcellent opportunity to design and implement novel event detection frameworks.\nHowever, most existing approaches only exploit keyword burstiness or network\nstructures to detect unspecified events. Thus, they often need help identifying\nunknown events regarding the challenging nature of events and social data.\nSocial data, e.g., tweets, is characterized by misspellings, incompleteness,\nword sense ambiguation, irregular language, and variation in aspects of\nopinions. Moreover, extracting discriminative features and patterns for\nevolving events by exploiting the limited structural knowledge is almost\ninfeasible. To address these challenges, in this paper, we propose a novel\nframework, namely EnrichEvent, that leverages the linguistic and contextual\nrepresentations of streaming social data. In particular, we leverage contextual\nand linguistic knowledge to detect semantically related tweets and enhance the\neffectiveness of the event detection approaches. Eventually, our proposed\nframework produces cluster chains for each event to show the evolving variation\nof the event through time. We conducted extensive experiments to evaluate our\nframework, validating its high performance and effectiveness in detecting and\ndistinguishing unspecified social events.\n","authors":["Mohammadali Sefidi Esfahani","Mohammad Akbari"],"pdf_url":"https://arxiv.org/pdf/2307.16082v6.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02343v1","updated":"2024-12-03T10:03:52Z","published":"2024-12-03T10:03:52Z","title":"Multi-Granularity Tibetan Textual Adversarial Attack Method Based on\n Masked Language Model","summary":" In social media, neural network models have been applied to hate speech\ndetection, sentiment analysis, etc., but neural network models are susceptible\nto adversarial attacks. For instance, in a text classification task, the\nattacker elaborately introduces perturbations to the original texts that hardly\nalter the original semantics in order to trick the model into making different\npredictions. By studying textual adversarial attack methods, the robustness of\nlanguage models can be evaluated and then improved. Currently, most of the\nresearch in this field focuses on English, and there is also a certain amount\nof research on Chinese. However, there is little research targeting Chinese\nminority languages. With the rapid development of artificial intelligence\ntechnology and the emergence of Chinese minority language models, textual\nadversarial attacks become a new challenge for the information processing of\nChinese minority languages. In response to this situation, we propose a\nmulti-granularity Tibetan textual adversarial attack method based on masked\nlanguage models called TSTricker. We utilize the masked language models to\ngenerate candidate substitution syllables or words, adopt the scoring mechanism\nto determine the substitution order, and then conduct the attack method on\nseveral fine-tuned victim models. The experimental results show that TSTricker\nreduces the accuracy of the classification models by more than 28.70% and makes\nthe classification models change the predictions of more than 90.60% of the\nsamples, which has an evidently higher attack effect than the baseline method.\n","authors":["Xi Cao","Nuo Qun","Quzong Gesang","Yulei Zhu","Trashi Nyima"],"pdf_url":"https://arxiv.org/pdf/2412.02343v1.pdf","comment":"Revised Version; Accepted at WWW 2024 Workshop on SocialNLP"},{"id":"http://arxiv.org/abs/2412.02323v1","updated":"2024-12-03T09:38:22Z","published":"2024-12-03T09:38:22Z","title":"Pay Attention to the Robustness of Chinese Minority Language Models!\n Syllable-level Textual Adversarial Attack on Tibetan Script","summary":" The textual adversarial attack refers to an attack method in which the\nattacker adds imperceptible perturbations to the original texts by elaborate\ndesign so that the NLP (natural language processing) model produces false\njudgments. This method is also used to evaluate the robustness of NLP models.\nCurrently, most of the research in this field focuses on English, and there is\nalso a certain amount of research on Chinese. However, to the best of our\nknowledge, there is little research targeting Chinese minority languages.\nTextual adversarial attacks are a new challenge for the information processing\nof Chinese minority languages. In response to this situation, we propose a\nTibetan syllable-level black-box textual adversarial attack called TSAttacker\nbased on syllable cosine distance and scoring mechanism. And then, we conduct\nTSAttacker on six models generated by fine-tuning two PLMs (pre-trained\nlanguage models) for three downstream tasks. The experiment results show that\nTSAttacker is effective and generates high-quality adversarial samples. In\naddition, the robustness of the involved models still has much room for\nimprovement.\n","authors":["Xi Cao","Dolma Dawa","Nuo Qun","Trashi Nyima"],"pdf_url":"https://arxiv.org/pdf/2412.02323v1.pdf","comment":"Revised Version; Accepted at ACL 2023 Workshop on TrustNLP"},{"id":"http://arxiv.org/abs/2412.02301v1","updated":"2024-12-03T09:13:52Z","published":"2024-12-03T09:13:52Z","title":"Large Multimodal Agents for Accurate Phishing Detection with Enhanced\n Token Optimization and Cost Reduction","summary":" With the rise of sophisticated phishing attacks, there is a growing need for\neffective and economical detection solutions. This paper explores the use of\nlarge multimodal agents, specifically Gemini 1.5 Flash and GPT-4o mini, to\nanalyze both URLs and webpage screenshots via APIs, thus avoiding the\ncomplexities of training and maintaining AI systems. Our findings indicate that\nintegrating these two data types substantially enhances detection performance\nover using either type alone. However, API usage incurs costs per query that\ndepend on the number of input and output tokens. To address this, we propose a\ntwo-tiered agentic approach: initially, one agent assesses the URL, and if\ninconclusive, a second agent evaluates both the URL and the screenshot. This\nmethod not only maintains robust detection performance but also significantly\nreduces API costs by minimizing unnecessary multi-input queries. Cost analysis\nshows that with the agentic approach, GPT-4o mini can process about 4.2 times\nas many websites per $100 compared to the multimodal approach (107,440 vs.\n25,626), and Gemini 1.5 Flash can process about 2.6 times more websites\n(2,232,142 vs. 862,068). These findings underscore the significant economic\nbenefits of the agentic approach over the multimodal method, providing a viable\nsolution for organizations aiming to leverage advanced AI for phishing\ndetection while controlling expenses.\n","authors":["Fouad Trad","Ali Chehab"],"pdf_url":"https://arxiv.org/pdf/2412.02301v1.pdf","comment":"Accepted in the 2nd International Conference on Foundation and Large\n Language Models (FLLM2024)"},{"id":"http://arxiv.org/abs/2412.02290v1","updated":"2024-12-03T09:07:13Z","published":"2024-12-03T09:07:13Z","title":"Characterizing Information Shared by Participants to Coding Challenges:\n The Case of Advent of Code","summary":" Advent of Code (AoC from now on) is a popular coding challenge requiring to\nsolve programming puzzles for a variety of skill sets and levels. AoC follows\nthe advent calendar, therefore it is an annual challenge that lasts for 25\ndays. AoC participants usually post their solutions on social networks and\ndiscuss them online. These challenges are interesting to study since they could\nhighlight the adoption of new tools, the evolution of the developer community,\nor the technological requirements of well-known companies. For these reasons,\nwe first create a dataset of the 2019-2021 AoC editions containing the\ndiscussion threads made on the subreddit {\\tt /r/adventofcode}. Then, we\npropose a model based on stream graphs to best study this context, where we\nrepresent its most important actors through time: participants, comments, and\nprogramming languages. Thanks to our model, we investigate user participation,\nadoption of new programming languages during a challenge and between two of\nthem, and resiliency of programming languages based on a Stack Overflow survey.\nWe find that the top-used programming languages are almost the same in the\nthree years, pointing out their importance. Moreover, participants tend to keep\nthe same programming language for the whole challenge, while the ones attending\ntwo AoCs usually change it in the next one. Finally, we observe interesting\nresults about the programming languages that are ``Popular'' or ``Loved''\naccording to the Stack Overflow survey. Firstly, these are the ones adopted for\nthe longest time in an AoC edition, thanks to which users have a high chance of\nreaching the end of the challenge. Secondly, they are the most chosen when a\nparticipant decides to change programming language during the same challenge.\n","authors":["Francesco Cauteruccio","Enrico Corradini","Luca Virgili"],"pdf_url":"https://arxiv.org/pdf/2412.02290v1.pdf","comment":"10 pages, 7 figures"},{"id":"http://arxiv.org/abs/2411.11465v2","updated":"2024-12-03T09:04:35Z","published":"2024-11-18T10:58:46Z","title":"Re-examining learning linear functions in context","summary":" In context learning (ICL) is an attractive method of solving a wide range of\nproblems. Inspired by Garg et al. (2022), we look closely at ICL in a variety\nof train and test settings for several transformer models of different sizes\ntrained from scratch. Our study complements prior work by pointing out several\nsystematic failures of these models to generalize to data not in the training\ndistribution, thereby showing some limitations of ICL. We find that models\nadopt a strategy for this task that is very different from standard solutions.\n","authors":["Omar Naim","Guilhem Fouilhé","Nicholas Asher"],"pdf_url":"https://arxiv.org/pdf/2411.11465v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02279v1","updated":"2024-12-03T08:54:17Z","published":"2024-12-03T08:54:17Z","title":"A Comprehensive Evaluation of Large Language Models on Aspect-Based\n Sentiment Analysis","summary":" Recently, Large Language Models (LLMs) have garnered increasing attention in\nthe field of natural language processing, revolutionizing numerous downstream\ntasks with powerful reasoning and generation abilities. For example, In-Context\nLearning (ICL) introduces a fine-tuning-free paradigm, allowing out-of-the-box\nLLMs to execute downstream tasks by analogy learning without any fine-tuning.\nBesides, in a fine-tuning-dependent paradigm where substantial training data\nexists, Parameter-Efficient Fine-Tuning (PEFT), as the cost-effective methods,\nenable LLMs to achieve excellent performance comparable to full fine-tuning.\n However, these fascinating techniques employed by LLMs have not been fully\nexploited in the ABSA field. Previous works probe LLMs in ABSA by merely using\nrandomly selected input-output pairs as demonstrations in ICL, resulting in an\nincomplete and superficial evaluation. In this paper, we shed light on a\ncomprehensive evaluation of LLMs in the ABSA field, involving 13 datasets, 8\nABSA subtasks, and 6 LLMs. Specifically, we design a unified task formulation\nto unify ``multiple LLMs for multiple ABSA subtasks in multiple paradigms.''\nFor the fine-tuning-dependent paradigm, we efficiently fine-tune LLMs using\ninstruction-based multi-task learning. For the fine-tuning-free paradigm, we\npropose 3 demonstration selection strategies to stimulate the few-shot\nabilities of LLMs. Our extensive experiments demonstrate that LLMs achieve a\nnew state-of-the-art performance compared to fine-tuned Small Language Models\n(SLMs) in the fine-tuning-dependent paradigm. More importantly, in the\nfine-tuning-free paradigm where SLMs are ineffective, LLMs with ICL still\nshowcase impressive potential and even compete with fine-tuned SLMs on some\nABSA subtasks.\n","authors":["Changzhi Zhou","Dandan Song","Yuhang Tian","Zhijing Wu","Hao Wang","Xinyu Zhang","Jun Yang","Ziyi Yang","Shuhao Zhang"],"pdf_url":"https://arxiv.org/pdf/2412.02279v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02271v1","updated":"2024-12-03T08:41:13Z","published":"2024-12-03T08:41:13Z","title":"MediaSpin: Exploring Media Bias Through Fine-Grained Analysis of News\n Headlines","summary":" In this paper, we introduce the MediaSpin dataset aiming to help in the\ndevelopment of models that can detect different forms of media bias present in\nnews headlines, developed through human-supervised and -validated Large\nLanguage Model (LLM) labeling of media bias. This corpus comprises 78,910 pairs\nof news headlines and annotations with explanations of the 13 distinct types of\nmedia bias categories assigned. We demonstrate the usefulness of our dataset\nfor automated bias detection in news edits.\n","authors":["Preetika Verma","Kokil Jaidka"],"pdf_url":"https://arxiv.org/pdf/2412.02271v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02252v1","updated":"2024-12-03T08:29:27Z","published":"2024-12-03T08:29:27Z","title":"Compressing KV Cache for Long-Context LLM Inference with Inter-Layer\n Attention Similarity","summary":" The increasing context window size in Large Language Models (LLMs), such as\nthe GPT and LLaMA series, has improved their ability to tackle complex,\nlong-text tasks, but at the cost of inference efficiency, particularly\nregarding memory and computational complexity. Existing methods, including\nselective token retention and window-based attention, improve efficiency but\nrisk discarding important tokens needed for future text generation. In this\npaper, we propose an approach that enhances LLM efficiency without token loss\nby reducing the memory and computational load of less important tokens, rather\nthan discarding them.We address two challenges: 1) investigating the\ndistribution of important tokens in the context, discovering recent tokens are\nmore important than distant tokens in context, and 2) optimizing resources for\ndistant tokens by sharing attention scores across layers. The experiments show\nthat our method saves $35\\%$ KV cache without compromising the performance.\n","authors":["Da Ma","Lu Chen","Situo Zhang","Yuxun Miao","Su Zhu","Zhi Chen","Hongshen Xu","Hanqi Li","Shuai Fan","Lei Pan","Kai Yu"],"pdf_url":"https://arxiv.org/pdf/2412.02252v1.pdf","comment":"preprint"},{"id":"http://arxiv.org/abs/2406.07522v2","updated":"2024-12-03T08:27:49Z","published":"2024-06-11T17:50:51Z","title":"Samba: Simple Hybrid State Space Models for Efficient Unlimited Context\n Language Modeling","summary":" Efficiently modeling sequences with infinite context length has long been a\nchallenging problem. Previous approaches have either suffered from quadratic\ncomputational complexity or limited extrapolation ability in length\ngeneralization. In this work, we present Samba, a simple hybrid architecture\nthat layer-wise combines Mamba, a selective State Space Model (SSM), with\nSliding Window Attention (SWA). Samba selectively compresses a given sequence\ninto recurrent hidden states while still maintaining the ability to precisely\nrecall recent memories with the attention mechanism. We scale Samba up to 3.8B\nparameters with 3.2T training tokens and demonstrate that it significantly\noutperforms state-of-the-art models across a variety of benchmarks. Pretrained\non sequences of 4K length, Samba shows improved perplexity in context lengths\nof up to 1M in zero-shot. When finetuned on 4K-length sequences, Samba\nefficiently extrapolates to a 256K context length with perfect memory recall on\nthe Passkey Retrieval task, and exhibits superior retrieval extrapolation on\nthe challenging Phonebook task compared to full-attention models. As a\nlinear-time sequence model, Samba achieves a 3.73x higher throughput compared\nto Transformers with grouped-query attention for user prompts of 128K length,\nand a 3.64x speedup when generating 64K tokens with unlimited streaming. Our\ncode for training on open source data is publicly available at\nhttps://github.com/microsoft/Samba.\n","authors":["Liliang Ren","Yang Liu","Yadong Lu","Yelong Shen","Chen Liang","Weizhu Chen"],"pdf_url":"https://arxiv.org/pdf/2406.07522v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.14026v3","updated":"2024-12-03T08:03:25Z","published":"2024-06-20T06:46:23Z","title":"Demystifying Language Model Forgetting with Low-rank Example\n Associations","summary":" Large Language models (LLMs) suffer from forgetting of upstream data when\nfine-tuned. Despite efforts on mitigating forgetting, few have investigated\nwhether, and how forgotten upstream examples are dependent on and associated\nwith newly learned tasks. Insights on such associations enable efficient and\ntargeted mitigation of forgetting. In this paper, we empirically analyze\nforgetting (measured in log-perplexity increase) that occurs in $N$ upstream\nexamples of language modeling or instruction-tuning after fine-tuning LLMs on\none of $M$ new tasks, visualized in $M\\times N$ matrices. We demonstrate that\nthe matrices display simple low-rank patterns, often well-approximated with\nmultiplicative scalar effects of upstream examples and newly learned tasks. We\nalso examine fine-grained associations with visualization and statistics.\nLeveraging the low-rank nature of the associations, we predict forgetting of\nupstream examples when fine-tuning on unseen tasks with matrix completion over\nthe empirical associations. This enables fast identification of most forgotten\nexamples without expensive inference on the entire upstream data. The approach,\ndespite simplicity, outperforms prior approaches that learn semantic\nrelationships of learned tasks and upstream examples with LMs for predicting\nforgetting. We demonstrate the practical utility of our analysis by showing\nstatistically significantly reduced forgetting as we upweight predicted\nexamples for replay at fine-tuning. Project page:\nhttps://inklab.usc.edu/lm-forgetting-prediction/\n","authors":["Xisen Jin","Xiang Ren"],"pdf_url":"https://arxiv.org/pdf/2406.14026v3.pdf","comment":"10 pages; preprint"},{"id":"http://arxiv.org/abs/2311.15316v3","updated":"2024-12-03T07:57:33Z","published":"2023-11-26T14:35:23Z","title":"Sibyl: Empowering Empathetic Dialogue Generation in Large Language\n Models via Sensible and Visionary Commonsense Inference","summary":" Recently, there has been a heightened interest in building chatbots based on\nLarge Language Models (LLMs) to emulate human-like qualities in multi-turn\nconversations. Despite having access to commonsense knowledge to better\nunderstand the psychological aspects and causality of dialogue context, even\nthese powerful LLMs struggle to achieve the goals of empathy and emotional\nsupport. Current commonsense knowledge derived from dialogue contexts is\ninherently limited and often fails to adequately anticipate the future course\nof a dialogue. This lack of foresight can mislead LLMs and hinder their ability\nto provide effective support. In response to this challenge, we present an\ninnovative framework named Sensible and Visionary Commonsense Knowledge\n(Sibyl). Designed to concentrate on the immediately succeeding dialogue, this\nparadigm equips LLMs with the capability to uncover the implicit requirements\nof the conversation, aiming to elicit more empathetic responses. Experimental\nresults demonstrate that incorporating our paradigm for acquiring commonsense\nknowledge into LLMs comprehensively enhances the quality of their responses.\n","authors":["Lanrui Wang","Jiangnan Li","Chenxu Yang","Zheng Lin","Hongyin Tang","Huan Liu","Yanan Cao","Jingang Wang","Weiping Wang"],"pdf_url":"https://arxiv.org/pdf/2311.15316v3.pdf","comment":"Accepted by COLING 2025"},{"id":"http://arxiv.org/abs/2412.01408v2","updated":"2024-12-03T07:52:35Z","published":"2024-12-02T11:51:19Z","title":"Towards Cross-Lingual Audio Abuse Detection in Low-Resource Settings\n with Few-Shot Learning","summary":" Online abusive content detection, particularly in low-resource settings and\nwithin the audio modality, remains underexplored. We investigate the potential\nof pre-trained audio representations for detecting abusive language in\nlow-resource languages, in this case, in Indian languages using Few Shot\nLearning (FSL). Leveraging powerful representations from models such as Wav2Vec\nand Whisper, we explore cross-lingual abuse detection using the ADIMA dataset\nwith FSL. Our approach integrates these representations within the\nModel-Agnostic Meta-Learning (MAML) framework to classify abusive language in\n10 languages. We experiment with various shot sizes (50-200) evaluating the\nimpact of limited data on performance. Additionally, a feature visualization\nstudy was conducted to better understand model behaviour. This study highlights\nthe generalization ability of pre-trained models in low-resource scenarios and\noffers valuable insights into detecting abusive language in multilingual\ncontexts.\n","authors":["Aditya Narayan Sankaran","Reza Farahbakhsh","Noel Crespi"],"pdf_url":"https://arxiv.org/pdf/2412.01408v2.pdf","comment":"Accepted as part of the proceedings of COLING 2025"},{"id":"http://arxiv.org/abs/2412.02228v1","updated":"2024-12-03T07:51:14Z","published":"2024-12-03T07:51:14Z","title":"BANER: Boundary-Aware LLMs for Few-Shot Named Entity Recognition","summary":" Despite the recent success of two-stage prototypical networks in few-shot\nnamed entity recognition (NER), challenges such as over/under-detected false\nspans in the span detection stage and unaligned entity prototypes in the type\nclassification stage persist. Additionally, LLMs have not proven to be\neffective few-shot information extractors in general. In this paper, we propose\nan approach called Boundary-Aware LLMs for Few-Shot Named Entity Recognition to\naddress these issues. We introduce a boundary-aware contrastive learning\nstrategy to enhance the LLM's ability to perceive entity boundaries for\ngeneralized entity spans. Additionally, we utilize LoRAHub to align information\nfrom the target domain to the source domain, thereby enhancing adaptive\ncross-domain classification capabilities. Extensive experiments across various\nbenchmarks demonstrate that our framework outperforms prior methods, validating\nits effectiveness. In particular, the proposed strategies demonstrate\neffectiveness across a range of LLM architectures. The code and data are\nreleased on https://github.com/UESTC-GQJ/BANER.\n","authors":["Quanjiang Guo","Yihong Dong","Ling Tian","Zhao Kang","Yu Zhang","Sijie Wang"],"pdf_url":"https://arxiv.org/pdf/2412.02228v1.pdf","comment":"Appear on COLING 2025"},{"id":"http://arxiv.org/abs/2411.06638v2","updated":"2024-12-03T07:40:40Z","published":"2024-11-11T00:18:54Z","title":"Model Editing for LLMs4Code: How Far are We?","summary":" Large Language Models for Code (LLMs4Code) have been found to exhibit\noutstanding performance in the software engineering domain, especially the\nremarkable performance in coding tasks. However, even the most advanced\nLLMs4Code can inevitably contain incorrect or outdated code knowledge. Due to\nthe high cost of training LLMs4Code, it is impractical to re-train the models\nfor fixing these problematic code knowledge. Model editing is a new technical\nfield for effectively and efficiently correcting erroneous knowledge in LLMs,\nwhere various model editing techniques and benchmarks have been proposed\nrecently. Despite that, a comprehensive study that thoroughly compares and\nanalyzes the performance of the state-of-the-art model editing techniques for\nadapting the knowledge within LLMs4Code across various code-related tasks is\nnotably absent. To bridge this gap, we perform the first systematic study on\napplying state-of-the-art model editing approaches to repair the inaccuracy of\nLLMs4Code. To that end, we introduce a benchmark named CLMEEval, which consists\nof two datasets, i.e., CoNaLa-Edit (CNLE) with 21K+ code generation samples and\nCodeSearchNet-Edit (CSNE) with 16K+ code summarization samples. With the help\nof CLMEEval, we evaluate six advanced model editing techniques on three\nLLMs4Code: CodeLlama (7B), CodeQwen1.5 (7B), and Stable-Code (3B). Our findings\ninclude that the external memorization-based GRACE approach achieves the best\nknowledge editing effectiveness and specificity (the editing does not influence\nuntargeted knowledge), while generalization (whether the editing can generalize\nto other semantically-identical inputs) is a universal challenge for existing\ntechniques. Furthermore, building on in-depth case analysis, we introduce an\nenhanced version of GRACE called A-GRACE, which incorporates contrastive\nlearning to better capture the semantics of the inputs.\n","authors":["Xiaopeng Li","Shangwen Wang","Shasha Li","Jun Ma","Jie Yu","Xiaodong Liu","Jing Wang","Bin Ji","Weimin Zhang"],"pdf_url":"https://arxiv.org/pdf/2411.06638v2.pdf","comment":"Accepted by ICSE2025. The code is available at:\n https://github.com/xpq-tech/code-llmedit.git"},{"id":"http://arxiv.org/abs/2403.08978v2","updated":"2024-12-03T07:36:47Z","published":"2024-03-13T22:06:03Z","title":"AutoGuide: Automated Generation and Selection of Context-Aware\n Guidelines for Large Language Model Agents","summary":" Recent advances in large language models (LLMs) have empowered AI agents\ncapable of performing various sequential decision-making tasks. However,\neffectively guiding LLMs to perform well in unfamiliar domains like web\nnavigation, where they lack sufficient knowledge, has proven to be difficult\nwith the demonstration-based in-context learning paradigm. In this paper, we\nintroduce a novel framework, called AutoGuide, which addresses this limitation\nby automatically generating context-aware guidelines from offline experiences.\nImportantly, each context-aware guideline is expressed in concise natural\nlanguage and follows a conditional structure, clearly describing the context\nwhere it is applicable. As a result, our guidelines facilitate the provision of\nrelevant knowledge for the agent's current decision-making process, overcoming\nthe limitations of the conventional demonstration-based learning paradigm. Our\nevaluation demonstrates that AutoGuide significantly outperforms competitive\nbaselines in complex benchmark domains, including real-world web navigation.\n","authors":["Yao Fu","Dong-Ki Kim","Jaekyeom Kim","Sungryull Sohn","Lajanugen Logeswaran","Kyunghoon Bae","Honglak Lee"],"pdf_url":"https://arxiv.org/pdf/2403.08978v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.01413v2","updated":"2024-12-03T07:12:27Z","published":"2024-12-02T11:56:06Z","title":"Impromptu Cybercrime Euphemism Detection","summary":" Detecting euphemisms is essential for content security on various social\nmedia platforms, but existing methods designed for detecting euphemisms are\nineffective in impromptu euphemisms. In this work, we make a first attempt to\nan exploration of impromptu euphemism detection and introduce the Impromptu\nCybercrime Euphemisms Detection (ICED) dataset. Moreover, we propose a\ndetection framework tailored to this problem, which employs context\naugmentation modeling and multi-round iterative training. Our detection\nframework mainly consists of a coarse-grained and a fine-grained classification\nmodel. The coarse-grained classification model removes most of the harmless\ncontent in the corpus to be detected. The fine-grained model, impromptu\neuphemisms detector, integrates context augmentation and multi-round iterations\ntraining to better predicts the actual meaning of a masked token. In addition,\nwe leverage ChatGPT to evaluate the mode's capability. Experimental results\ndemonstrate that our approach achieves a remarkable 76-fold improvement\ncompared to the previous state-of-the-art euphemism detector.\n","authors":["Xiang Li","Yucheng Zhou","Laiping Zhao","Jing Li","Fangming Liu"],"pdf_url":"https://arxiv.org/pdf/2412.01413v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.10946v3","updated":"2024-12-03T06:52:34Z","published":"2024-02-09T04:02:43Z","title":"CultureLLM: Incorporating Cultural Differences into Large Language\n Models","summary":" Large language models (LLMs) are reported to be partial to certain cultures\nowing to the training data dominance from the English corpora. Since\nmultilingual cultural data are often expensive to collect, existing efforts\nhandle this by prompt engineering or culture-specific pre-training. However,\nthey might overlook the knowledge deficiency of low-resource culture and\nrequire extensive computing resources. In this paper, we propose CultureLLM, a\ncost-effective solution to incorporate cultural differences into LLMs.\nCultureLLM adopts World Value Survey (WVS) as seed data and generates\nsemantically equivalent training data via the proposed semantic data\naugmentation. Using only 50 seed samples from WVS with augmented data, we\nfine-tune culture-specific LLMs and one unified model (CultureLLM-One) for 9\ncultures covering rich and low-resource languages. Extensive experiments on 60\nculture-related datasets demonstrate that CultureLLM significantly outperforms\nvarious counterparts such as GPT-3.5 (by 8.1%) and Gemini Pro (by 9.5%) with\ncomparable performance to GPT-4 or even better. Our human study shows that the\ngenerated samples are semantically equivalent to the original samples,\nproviding an effective solution for LLMs augmentation. Code is released at\nhttps://github.com/Scarelette/CultureLLM.\n","authors":["Cheng Li","Mengzhou Chen","Jindong Wang","Sunayana Sitaram","Xing Xie"],"pdf_url":"https://arxiv.org/pdf/2402.10946v3.pdf","comment":"NeurIPS 2024; Code is at https://github.com/Scarelette/CultureLLM"},{"id":"http://arxiv.org/abs/2412.02205v1","updated":"2024-12-03T06:47:15Z","published":"2024-12-03T06:47:15Z","title":"DataLab: A Unifed Platform for LLM-Powered Business Intelligence","summary":" Business intelligence (BI) transforms large volumes of data within modern\norganizations into actionable insights for informed decision-making. Recently,\nlarge language model (LLM)-based agents have streamlined the BI workflow by\nautomatically performing task planning, reasoning, and actions in executable\nenvironments based on natural language (NL) queries. However, existing\napproaches primarily focus on individual BI tasks such as NL2SQL and NL2VIS.\nThe fragmentation of tasks across different data roles and tools lead to\ninefficiencies and potential errors due to the iterative and collaborative\nnature of BI. In this paper, we introduce DataLab, a unified BI platform that\nintegrates a one-stop LLM-based agent framework with an augmented computational\nnotebook interface. DataLab supports a wide range of BI tasks for different\ndata roles by seamlessly combining LLM assistance with user customization\nwithin a single environment. To achieve this unification, we design a domain\nknowledge incorporation module tailored for enterprise-specific BI tasks, an\ninter-agent communication mechanism to facilitate information sharing across\nthe BI workflow, and a cell-based context management strategy to enhance\ncontext utilization efficiency in BI notebooks. Extensive experiments\ndemonstrate that DataLab achieves state-of-the-art performance on various BI\ntasks across popular research benchmarks. Moreover, DataLab maintains high\neffectiveness and efficiency on real-world datasets from Tencent, achieving up\nto a 58.58% increase in accuracy and a 61.65% reduction in token cost on\nenterprise-specific BI tasks.\n","authors":["Luoxuan Weng","Yinghao Tang","Yingchaojie Feng","Zhuo Chang","Peng Chen","Ruiqin Chen","Haozhe Feng","Chen Hou","Danqing Huang","Yang Li","Huaming Rao","Haonan Wang","Canshi Wei","Xiaofeng Yang","Yuhui Zhang","Yifeng Zheng","Xiuqi Huang","Minfeng Zhu","Yuxin Ma","Bin Cui","Wei Chen"],"pdf_url":"https://arxiv.org/pdf/2412.02205v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.15143v3","updated":"2024-12-03T06:43:39Z","published":"2024-05-24T01:45:27Z","title":"Intelligent Go-Explore: Standing on the Shoulders of Giant Foundation\n Models","summary":" Go-Explore is a powerful family of algorithms designed to solve\nhard-exploration problems built on the principle of archiving discovered\nstates, and iteratively returning to and exploring from the most promising\nstates. This approach has led to superhuman performance across a wide variety\nof challenging problems including Atari games and robotic control, but requires\nmanually designing heuristics to guide exploration (i.e., determine which\nstates to save and explore from, and what actions to consider next), which is\ntime-consuming and infeasible in general. To resolve this, we propose\nIntelligent Go-Explore (IGE) which greatly extends the scope of the original\nGo-Explore by replacing these handcrafted heuristics with the intelligence and\ninternalized human notions of interestingness captured by giant pretrained\nfoundation models (FMs). This provides IGE with a human-like ability to\ninstinctively identify how interesting or promising any new state is (e.g.,\ndiscovering new objects, locations, or behaviors), even in complex environments\nwhere heuristics are hard to define. Moreover, IGE offers the exciting\nopportunity to recognize and capitalize on serendipitous discoveries-states\nencountered during exploration that are valuable in terms of exploration, yet\nwhere what makes them interesting was not anticipated by the human user. We\nevaluate our algorithm on a diverse range of language and vision-based tasks\nthat require search and exploration. Across these tasks, IGE strongly exceeds\nclassic reinforcement learning and graph search baselines, and also succeeds\nwhere prior state-of-the-art FM agents like Reflexion completely fail. Overall,\nIntelligent Go-Explore combines the tremendous strengths of FMs and the\npowerful Go-Explore algorithm, opening up a new frontier of research into\ncreating more generally capable agents with impressive exploration\ncapabilities.\n","authors":["Cong Lu","Shengran Hu","Jeff Clune"],"pdf_url":"https://arxiv.org/pdf/2405.15143v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2211.05403v2","updated":"2024-12-03T05:18:59Z","published":"2022-11-10T08:13:19Z","title":"Enabling Efficient Attack Investigation via Human-in-the-Loop Security\n Analysis","summary":" System auditing is a vital technique for collecting system call events as\nsystem provenance and investigating complex multi-step attacks such as Advanced\nPersistent Threats. However, existing attack investigation methods struggle to\nuncover long attack sequences due to the massive volume of system provenance\ndata and their inability to focus on attack-relevant parts. In this paper, we\npresent Raptor, a defense system that enables human analysts to effectively\nanalyze large-scale system provenance to reveal multi-step attack sequences.\nRaptor introduces an expressive domain-specific language, ProvQL, that offers\nessential primitives for various types of attack analyses (e.g., attack pattern\nsearch, attack dependency tracking) with user-defined constraints, enabling\nanalysts to focus on attack-relevant parts and iteratively sift through the\nlarge provenance data. Moreover, Raptor provides an optimized execution engine\nfor efficient language execution. Our extensive evaluations on a wide range of\nattack scenarios demonstrate the practical effectiveness of Raptor in\nfacilitating timely attack investigation.\n","authors":["Xinyu Yang","Haoyuan Liu","Saimon Amanuel Tsegai","Peng Gao"],"pdf_url":"https://arxiv.org/pdf/2211.05403v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02172v1","updated":"2024-12-03T05:04:49Z","published":"2024-12-03T05:04:49Z","title":"VISCO: Benchmarking Fine-Grained Critique and Correction Towards\n Self-Improvement in Visual Reasoning","summary":" The ability of large vision-language models (LVLMs) to critique and correct\ntheir reasoning is an essential building block towards their self-improvement.\nHowever, a systematic analysis of such capabilities in LVLMs is still lacking.\nWe propose VISCO, the first benchmark to extensively analyze the fine-grained\ncritique and correction capabilities of LVLMs. Compared to existing work that\nuses a single scalar value to critique the entire reasoning [4], VISCO features\ndense and fine-grained critique, requiring LVLMs to evaluate the correctness of\neach step in the chain-of-thought and provide natural language explanations to\nsupport their judgments. Extensive evaluation of 24 LVLMs demonstrates that\nhuman-written critiques significantly enhance the performance after correction,\nshowcasing the potential of the self-improvement strategy. However, the\nmodel-generated critiques are less helpful and sometimes detrimental to the\nperformance, suggesting that critique is the crucial bottleneck. We identified\nthree common patterns in critique failures: failure to critique visual\nperception, reluctance to \"say no\", and exaggerated assumption of error\npropagation. To address these issues, we propose an effective LookBack strategy\nthat revisits the image to verify each piece of information in the initial\nreasoning. LookBack significantly improves critique and correction performance\nby up to 13.5%.\n","authors":["Xueqing Wu","Yuheng Ding","Bingxuan Li","Pan Lu","Da Yin","Kai-Wei Chang","Nanyun Peng"],"pdf_url":"https://arxiv.org/pdf/2412.02172v1.pdf","comment":"Project: https://visco-benchmark.github.io/"},{"id":"http://arxiv.org/abs/2411.16495v2","updated":"2024-12-03T05:00:18Z","published":"2024-11-25T15:35:51Z","title":"AtomR: Atomic Operator-Empowered Large Language Models for Heterogeneous\n Knowledge Reasoning","summary":" Recent advancements in large language models (LLMs) have led to significant\nimprovements in various natural language processing tasks, but it is still\nchallenging for LLMs to perform knowledge-intensive complex question answering\ndue to LLMs' inefficacy in reasoning planning and the hallucination problem. A\ntypical solution is to employ retrieval-augmented generation (RAG) coupled with\nchain-of-thought (CoT) reasoning, which decomposes complex questions into\nchain-like sub-questions and applies iterative RAG at each sub-question.\nHowever, prior works exhibit sub-optimal reasoning planning and overlook\ndynamic knowledge retrieval from heterogeneous sources. In this paper, we\npropose AtomR, a novel heterogeneous knowledge reasoning framework that\nconducts multi-source reasoning at the atomic level. Drawing inspiration from\nthe graph modeling of knowledge, AtomR leverages large language models (LLMs)\nto decompose complex questions into combinations of three atomic knowledge\noperators, significantly enhancing the reasoning process at both the planning\nand execution stages. We also introduce BlendQA, a novel evaluation benchmark\ntailored to assess complex heterogeneous knowledge reasoning. Experiments show\nthat AtomR significantly outperforms state-of-the-art baselines across three\nsingle-source and two multi-source reasoning benchmarks, with notable\nperformance gains of 9.4% on 2WikiMultihop and 9.5% on BlendQA.\n","authors":["Amy Xin","Jinxin Liu","Zijun Yao","Zhicheng Lee","Shulin Cao","Lei Hou","Juanzi Li"],"pdf_url":"https://arxiv.org/pdf/2411.16495v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.13706v2","updated":"2024-12-03T04:57:32Z","published":"2024-06-19T16:58:32Z","title":"Developing Story: Case Studies of Generative AI's Use in Journalism","summary":" Journalists are among the many users of large language models (LLMs). To\nbetter understand the journalist-AI interactions, we conduct a study of LLM\nusage by two news agencies through browsing the WildChat dataset, identifying\ncandidate interactions, and verifying them by matching to online published\narticles. Our analysis uncovers instances where journalists provide sensitive\nmaterial such as confidential correspondence with sources or articles from\nother agencies to the LLM as stimuli and prompt it to generate articles, and\npublish these machine-generated articles with limited intervention (median\noutput-publication ROUGE-L of 0.62). Based on our findings, we call for further\nresearch into what constitutes responsible use of AI, and the establishment of\nclear guidelines and best practices on using LLMs in a journalistic context.\n","authors":["Natalie Grace Brigham","Chongjiu Gao","Tadayoshi Kohno","Franziska Roesner","Niloofar Mireshghallah"],"pdf_url":"https://arxiv.org/pdf/2406.13706v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.01253v2","updated":"2024-12-03T04:51:10Z","published":"2024-12-02T08:22:56Z","title":"Yi-Lightning Technical Report","summary":" This technical report presents Yi-Lightning, our latest flagship large\nlanguage model (LLM). It achieves exceptional performance, ranking 6th overall\non Chatbot Arena, with particularly strong results (2nd to 4th place) in\nspecialized categories including Chinese, Math, Coding, and Hard Prompts.\nYi-Lightning leverages an enhanced Mixture-of-Experts (MoE) architecture,\nfeaturing advanced expert segmentation and routing mechanisms coupled with\noptimized KV-caching techniques. Our development process encompasses\ncomprehensive pre-training, supervised fine-tuning (SFT), and reinforcement\nlearning from human feedback (RLHF), where we devise deliberate strategies for\nmulti-stage training, synthetic data construction, and reward modeling.\nFurthermore, we implement RAISE (Responsible AI Safety Engine), a\nfour-component framework to address safety issues across pre-training,\npost-training, and serving phases. Empowered by our scalable super-computing\ninfrastructure, all these innovations substantially reduce training, deployment\nand inference costs while maintaining high-performance standards. With further\nevaluations on public academic benchmarks, Yi-Lightning demonstrates\ncompetitive performance against top-tier LLMs, while we observe a notable\ndisparity between traditional, static benchmark results and real-world, dynamic\nhuman preferences. This observation prompts a critical reassessment of\nconventional benchmarks' utility in guiding the development of more intelligent\nand powerful AI systems for practical applications. Yi-Lightning is now\navailable through our developer platform at https://platform.lingyiwanwu.com.\n","authors":["01. AI"," :","Alan Wake","Albert Wang","Bei Chen","C. X. Lv","Chao Li","Chengen Huang","Chenglin Cai","Chujie Zheng","Daniel Cooper","Ethan Dai","Fan Zhou","Feng Hu","Heng Ji","Howard Qiu","Jiangcheng Zhu","Jun Tian","Katherine Su","Lihuan Zhang","Liying Li","Ming Song","Mou Li","Peng Liu","Qichen Hu","Shawn Wang","Shijun Zhou","Shiyong Li","Tianhang Zhu","Wen Xie","Xiang He","Xiaobo Chen","Xiaohui Hu","Xiaoyi Ren","Xinyao Niu","Yanpeng Li","Yongke Zhao","Yongzhen Luo","Yuchi Xu","Yuxuan Sha","Zhaodong Yan","Zhiyuan Liu","Zirui Zhang"],"pdf_url":"https://arxiv.org/pdf/2412.01253v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02164v1","updated":"2024-12-03T04:48:59Z","published":"2024-12-03T04:48:59Z","title":"A Theoretical Framework for Acoustic Neighbor Embeddings","summary":" This paper provides a theoretical framework for interpreting acoustic\nneighbor embeddings, which are representations of the phonetic content of\nvariable-width audio or text in a fixed-dimensional embedding space. A\nprobabilistic interpretation of the distances between embeddings is proposed,\nbased on a general quantitative definition of phonetic similarity between\nwords. This provides us a framework for understanding and applying the\nembeddings in a principled manner. Theoretical and empirical evidence to\nsupport an approximation of uniform cluster-wise isotropy are shown, which\nallows us to reduce the distances to simple Euclidean distances. Four\nexperiments that validate the framework and demonstrate how it can be applied\nto diverse problems are described. Nearest-neighbor search between audio and\ntext embeddings can give isolated word classification accuracy that is\nidentical to that of finite state transducers (FSTs) for vocabularies as large\nas 500k. Embedding distances give accuracy with 0.5% point difference compared\nto phone edit distances in out-of-vocabulary word recovery, as well as\nproducing clustering hierarchies identical to those derived from human\nlistening experiments in English dialect clustering. The theoretical framework\nalso allows us to use the embeddings to predict the expected confusion of\ndevice wake-up words. All source code and pretrained models are provided.\n","authors":["Woojay Jeon"],"pdf_url":"https://arxiv.org/pdf/2412.02164v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.00218v2","updated":"2024-12-03T04:38:31Z","published":"2024-11-29T19:25:00Z","title":"NüshuRescue: Revitalization of the endangered Nüshu Language with AI","summary":" The preservation and revitalization of endangered and extinct languages is a\nmeaningful endeavor, conserving cultural heritage while enriching fields like\nlinguistics and anthropology. However, these languages are typically\nlow-resource, making their reconstruction labor-intensive and costly. This\nchallenge is exemplified by N\\\"ushu, a rare script historically used by Yao\nwomen in China for self-expression within a patriarchal society. To address\nthis challenge, we introduce N\\\"ushuRescue, an AI-driven framework designed to\ntrain large language models (LLMs) on endangered languages with minimal data.\nN\\\"ushuRescue automates evaluation and expands target corpora to accelerate\nlinguistic revitalization. As a foundational component, we developed NCGold, a\n500-sentence N\\\"ushu-Chinese parallel corpus, the first publicly available\ndataset of its kind. Leveraging GPT-4-Turbo, with no prior exposure to N\\\"ushu\nand only 35 short examples from NCGold, N\\\"ushuRescue achieved 48.69\\%\ntranslation accuracy on 50 withheld sentences and generated NCSilver, a set of\n98 newly translated modern Chinese sentences of varying lengths. A sample of\nboth NCGold and NCSilver is included in the Supplementary Materials.\nAdditionally, we developed FastText-based and Seq2Seq models to further support\nresearch on N\\\"ushu. N\\\"ushuRescue provides a versatile and scalable tool for\nthe revitalization of endangered languages, minimizing the need for extensive\nhuman input.\n","authors":["Ivory Yang","Weicheng Ma","Soroush Vosoughi"],"pdf_url":"https://arxiv.org/pdf/2412.00218v2.pdf","comment":"Accepted to COLING 2025"},{"id":"http://arxiv.org/abs/2412.01269v2","updated":"2024-12-03T04:37:03Z","published":"2024-12-02T08:35:54Z","title":"CPRM: A LLM-based Continual Pre-training Framework for Relevance\n Modeling in Commercial Search","summary":" Relevance modeling between queries and items stands as a pivotal component in\ncommercial search engines, directly affecting the user experience. Given the\nremarkable achievements of large language models (LLMs) in various natural\nlanguage processing (NLP) tasks, LLM-based relevance modeling is gradually\nbeing adopted within industrial search systems. Nevertheless, foundational LLMs\nlack domain-specific knowledge and do not fully exploit the potential of\nin-context learning. Furthermore, structured item text remains underutilized,\nand there is a shortage in the supply of corresponding queries and background\nknowledge. We thereby propose CPRM (Continual Pre-training for Relevance\nModeling), a framework designed for the continual pre-training of LLMs to\naddress these issues. Our CPRM framework includes three modules: 1) employing\nboth queries and multi-field item to jointly pre-train for enhancing domain\nknowledge, 2) applying in-context pre-training, a novel approach where LLMs are\npre-trained on a sequence of related queries or items, and 3) conducting\nreading comprehension on items to produce associated domain knowledge and\nbackground information (e.g., generating summaries and corresponding queries)\nto further strengthen LLMs. Results on offline experiments and online A/B\ntesting demonstrate that our model achieves convincing performance compared to\nstrong baselines.\n","authors":["Kaixin Wu","Yixin Ji","Zeyuan Chen","Qiang Wang","Cunxiang Wang","Hong Liu","Baijun Ji","Jia Xu","Zhongyi Liu","Jinjie Gu","Yuan Zhou","Linjian Mo"],"pdf_url":"https://arxiv.org/pdf/2412.01269v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.10264v2","updated":"2024-12-03T04:35:12Z","published":"2024-06-11T06:26:04Z","title":"Large Language Model-empowered multimodal strain sensory system for\n shape recognition, monitoring, and human interaction of tensegrity","summary":" A tensegrity-based system is a promising approach for dynamic exploration of\nuneven and unpredictable environments, particularly, space exploration.\nHowever, implementing such systems presents challenges in terms of intelligent\naspects: state recognition, wireless monitoring, human interaction, and smart\nanalyzing and advising function. Here, we introduce a 6-strut tensegrity\nintegrate with 24 multimodal strain sensors by leveraging both deep learning\nmodel and large language models to realize smart tensegrity. Using conductive\nflexible tendons assisted by long short-term memory model, the tensegrity\nachieves the self-shape reconstruction without extern sensors. Through\nintegrating the flask server and gpt-3.5-turbo model, the tensegrity\nautonomously enables to send data to iPhone for wireless monitoring and\nprovides data analysis, explanation, prediction, and suggestions to human for\ndecision making. Finally, human interaction system of the tensegrity helps\nhuman obtain necessary information of tensegrity from the aspect of human\nlanguage. Overall, this intelligent tensegrity-based system with self-sensing\ntendons showcases potential for future exploration, making it a versatile tool\nfor real-world applications.\n","authors":["Zebing Mao","Ryota Kobayashi","Hiroyuki Nabae","Koichi Suzumori"],"pdf_url":"https://arxiv.org/pdf/2406.10264v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02159v1","updated":"2024-12-03T04:34:58Z","published":"2024-12-03T04:34:58Z","title":"Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods\n and a New Transcript-Classifier Approach","summary":" Defending large language models against jailbreaks so that they never engage\nin a broadly-defined set of forbidden behaviors is an open problem. In this\npaper, we investigate the difficulty of jailbreak-defense when we only want to\nforbid a narrowly-defined set of behaviors. As a case study, we focus on\npreventing an LLM from helping a user make a bomb. We find that popular\ndefenses such as safety training, adversarial training, and input/output\nclassifiers are unable to fully solve this problem. In pursuit of a better\nsolution, we develop a transcript-classifier defense which outperforms the\nbaseline defenses we test. However, our classifier defense still fails in some\ncircumstances, which highlights the difficulty of jailbreak-defense even in a\nnarrow domain.\n","authors":["Tony T. Wang","John Hughes","Henry Sleight","Rylan Schaeffer","Rajashree Agrawal","Fazl Barez","Mrinank Sharma","Jesse Mu","Nir Shavit","Ethan Perez"],"pdf_url":"https://arxiv.org/pdf/2412.02159v1.pdf","comment":"Accepted to the AdvML-Frontiers and SoLaR workshops at NeurIPS 2024"},{"id":"http://arxiv.org/abs/2410.12361v3","updated":"2024-12-03T04:34:09Z","published":"2024-10-16T08:24:09Z","title":"Proactive Agent: Shifting LLM Agents from Reactive Responses to Active\n Assistance","summary":" Agents powered by large language models have shown remarkable abilities in\nsolving complex tasks. However, most agent systems remain reactive, limiting\ntheir effectiveness in scenarios requiring foresight and autonomous\ndecision-making. In this paper, we tackle the challenge of developing proactive\nagents capable of anticipating and initiating tasks without explicit human\ninstructions. We propose a novel data-driven approach for this problem.\nFirstly, we collect real-world human activities to generate proactive task\npredictions. These predictions are then labeled by human annotators as either\naccepted or rejected. The labeled data is used to train a reward model that\nsimulates human judgment and serves as an automatic evaluator of the\nproactiveness of LLM agents. Building on this, we develop a comprehensive data\ngeneration pipeline to create a diverse dataset, ProactiveBench, containing\n6,790 events. Finally, we demonstrate that fine-tuning models with the proposed\nProactiveBench can significantly elicit the proactiveness of LLM agents.\nExperimental results show that our fine-tuned model achieves an F1-Score of\n66.47% in proactively offering assistance, outperforming all open-source and\nclose-source models. These results highlight the potential of our method in\ncreating more proactive and effective agent systems, paving the way for future\nadvancements in human-agent collaboration.\n","authors":["Yaxi Lu","Shenzhi Yang","Cheng Qian","Guirong Chen","Qinyu Luo","Yesai Wu","Huadong Wang","Xin Cong","Zhong Zhang","Yankai Lin","Weiwen Liu","Yasheng Wang","Zhiyuan Liu","Fangming Liu","Maosong Sun"],"pdf_url":"https://arxiv.org/pdf/2410.12361v3.pdf","comment":"9 pages, 4 figures"},{"id":"http://arxiv.org/abs/2410.18142v2","updated":"2024-12-03T04:19:36Z","published":"2024-10-22T13:03:28Z","title":"Analyzing Nobel Prize Literature with Large Language Models","summary":" This study examines the capabilities of advanced Large Language Models\n(LLMs), particularly the o1 model, in the context of literary analysis. The\noutputs of these models are compared directly to those produced by\ngraduate-level human participants. By focusing on two Nobel Prize-winning short\nstories, 'Nine Chapters' by Han Kang, the 2024 laureate, and 'Friendship' by\nJon Fosse, the 2023 laureate, the research explores the extent to which AI can\nengage with complex literary elements such as thematic analysis,\nintertextuality, cultural and historical contexts, linguistic and structural\ninnovations, and character development. Given the Nobel Prize's prestige and\nits emphasis on cultural, historical, and linguistic richness, applying LLMs to\nthese works provides a deeper understanding of both human and AI approaches to\ninterpretation. The study uses qualitative and quantitative evaluations of\ncoherence, creativity, and fidelity to the text, revealing the strengths and\nlimitations of AI in tasks typically reserved for human expertise. While LLMs\ndemonstrate strong analytical capabilities, particularly in structured tasks,\nthey often fall short in emotional nuance and coherence, areas where human\ninterpretation excels. This research underscores the potential for human-AI\ncollaboration in the humanities, opening new opportunities in literary studies\nand beyond.\n","authors":["Zhenyuan Yang","Zhengliang Liu","Jing Zhang","Cen Lu","Jiaxin Tai","Tianyang Zhong","Yiwei Li","Siyan Zhao","Teng Yao","Qing Liu","Jinlin Yang","Qixin Liu","Zhaowei Li","Kexin Wang","Longjun Ma","Dajiang Zhu","Yudan Ren","Bao Ge","Wei Zhang","Ning Qiang","Tuo Zhang","Tianming Liu"],"pdf_url":"https://arxiv.org/pdf/2410.18142v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02149v1","updated":"2024-12-03T04:09:36Z","published":"2024-12-03T04:09:36Z","title":"Leveraging Large Language Models for Comparative Literature\n Summarization with Reflective Incremental Mechanisms","summary":" In this paper, we introduce ChatCite, a novel method leveraging large\nlanguage models (LLMs) for generating comparative literature summaries. The\nability to summarize research papers with a focus on key comparisons between\nstudies is an essential task in academic research. Existing summarization\nmodels, while effective at generating concise summaries, fail to provide deep\ncomparative insights. ChatCite addresses this limitation by incorporating a\nmulti-step reasoning mechanism that extracts critical elements from papers,\nincrementally builds a comparative summary, and refines the output through a\nreflective memory process. We evaluate ChatCite on a custom dataset,\nCompLit-LongContext, consisting of 1000 research papers with annotated\ncomparative summaries. Experimental results show that ChatCite outperforms\nseveral baseline methods, including GPT-4, BART, T5, and CoT, across various\nautomatic evaluation metrics such as ROUGE and the newly proposed G-Score.\nHuman evaluation further confirms that ChatCite generates more coherent,\ninsightful, and fluent summaries compared to these baseline models. Our method\nprovides a significant advancement in automatic literature review generation,\noffering researchers a powerful tool for efficiently comparing and synthesizing\nscientific research.\n","authors":["Fernando Gabriela Garcia","Spencer Burns","Harrison Fuller"],"pdf_url":"https://arxiv.org/pdf/2412.02149v1.pdf","comment":"8 pages"},{"id":"http://arxiv.org/abs/2412.02142v1","updated":"2024-12-03T03:59:03Z","published":"2024-12-03T03:59:03Z","title":"Personalized Multimodal Large Language Models: A Survey","summary":" Multimodal Large Language Models (MLLMs) have become increasingly important\ndue to their state-of-the-art performance and ability to integrate multiple\ndata modalities, such as text, images, and audio, to perform complex tasks with\nhigh accuracy. This paper presents a comprehensive survey on personalized\nmultimodal large language models, focusing on their architecture, training\nmethods, and applications. We propose an intuitive taxonomy for categorizing\nthe techniques used to personalize MLLMs to individual users, and discuss the\ntechniques accordingly. Furthermore, we discuss how such techniques can be\ncombined or adapted when appropriate, highlighting their advantages and\nunderlying rationale. We also provide a succinct summary of personalization\ntasks investigated in existing research, along with the evaluation metrics\ncommonly used. Additionally, we summarize the datasets that are useful for\nbenchmarking personalized MLLMs. Finally, we outline critical open challenges.\nThis survey aims to serve as a valuable resource for researchers and\npractitioners seeking to understand and advance the development of personalized\nmultimodal large language models.\n","authors":["Junda Wu","Hanjia Lyu","Yu Xia","Zhehao Zhang","Joe Barrow","Ishita Kumar","Mehrnoosh Mirtaheri","Hongjie Chen","Ryan A. Rossi","Franck Dernoncourt","Tong Yu","Ruiyi Zhang","Jiuxiang Gu","Nesreen K. Ahmed","Yu Wang","Xiang Chen","Hanieh Deilamsalehy","Namyong Park","Sungchul Kim","Huanrui Yang","Subrata Mitra","Zhengmian Hu","Nedim Lipka","Dang Nguyen","Yue Zhao","Jiebo Luo","Julian McAuley"],"pdf_url":"https://arxiv.org/pdf/2412.02142v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02141v1","updated":"2024-12-03T03:57:24Z","published":"2024-12-03T03:57:24Z","title":"WSI-LLaVA: A Multimodal Large Language Model for Whole Slide Image","summary":" Recent advancements in computational pathology have produced patch-level\nMulti-modal Large Language Models (MLLMs), but these models are limited by\ntheir inability to analyze whole slide images (WSIs) comprehensively and their\ntendency to bypass crucial morphological features that pathologists rely on for\ndiagnosis. To address these challenges, we first introduce WSI-Bench, a\nlarge-scale morphology-aware benchmark containing 180k VQA pairs from 9,850\nWSIs across 30 cancer types, designed to evaluate MLLMs' understanding of\nmorphological characteristics crucial for accurate diagnosis. Building upon\nthis benchmark, we present WSI-LLaVA, a novel framework for gigapixel WSI\nunderstanding that employs a three-stage training approach: WSI-text alignment,\nfeature space alignment, and task-specific instruction tuning. To better assess\nmodel performance in pathological contexts, we develop two specialized WSI\nmetrics: WSI-Precision and WSI-Relevance. Experimental results demonstrate that\nWSI-LLaVA outperforms existing models across all capability dimensions, with a\nsignificant improvement in morphological analysis, establishing a clear\ncorrelation between morphological understanding and diagnostic accuracy.\n","authors":["Yuci Liang","Xinheng Lyu","Meidan Ding","Wenting Chen","Jipeng Zhang","Yuexiang Ren","Xiangjian He","Song Wu","Sen Yang","Xiyue Wang","Xiaohan Xing","Linlin Shen"],"pdf_url":"https://arxiv.org/pdf/2412.02141v1.pdf","comment":"38 pages, 22 figures, 35 tables"},{"id":"http://arxiv.org/abs/2412.02138v1","updated":"2024-12-03T03:51:31Z","published":"2024-12-03T03:51:31Z","title":"Misalignment of Semantic Relation Knowledge between WordNet and Human\n Intuition","summary":" WordNet provides a carefully constructed repository of semantic relations,\ncreated by specialists. But there is another source of information on semantic\nrelations, the intuition of language users. We present the first systematic\nstudy of the degree to which these two sources are aligned. Investigating the\ncases of misalignment could make proper use of WordNet and facilitate its\nimprovement. Our analysis which uses templates to elicit responses from human\nparticipants, reveals a general misalignment of semantic relation knowledge\nbetween WordNet and human intuition. Further analyses find a systematic pattern\nof mismatch among synonymy and taxonomic relations~(hypernymy and hyponymy),\ntogether with the fact that WordNet path length does not serve as a reliable\nindicator of human intuition regarding hypernymy or hyponymy relations.\n","authors":["Zhihan Cao","Hiroaki Yamada","Simone Teufel","Takenobu Tokunaga"],"pdf_url":"https://arxiv.org/pdf/2412.02138v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.18279v3","updated":"2024-12-03T03:16:27Z","published":"2024-11-27T12:13:39Z","title":"Large Language Model-Brained GUI Agents: A Survey","summary":" GUIs have long been central to human-computer interaction, providing an\nintuitive and visually-driven way to access and interact with digital systems.\nThe advent of LLMs, particularly multimodal models, has ushered in a new era of\nGUI automation. They have demonstrated exceptional capabilities in natural\nlanguage understanding, code generation, and visual processing. This has paved\nthe way for a new generation of LLM-brained GUI agents capable of interpreting\ncomplex GUI elements and autonomously executing actions based on natural\nlanguage instructions. These agents represent a paradigm shift, enabling users\nto perform intricate, multi-step tasks through simple conversational commands.\nTheir applications span across web navigation, mobile app interactions, and\ndesktop automation, offering a transformative user experience that\nrevolutionizes how individuals interact with software. This emerging field is\nrapidly advancing, with significant progress in both research and industry.\n To provide a structured understanding of this trend, this paper presents a\ncomprehensive survey of LLM-brained GUI agents, exploring their historical\nevolution, core components, and advanced techniques. We address research\nquestions such as existing GUI agent frameworks, the collection and utilization\nof data for training specialized GUI agents, the development of large action\nmodels tailored for GUI tasks, and the evaluation metrics and benchmarks\nnecessary to assess their effectiveness. Additionally, we examine emerging\napplications powered by these agents. Through a detailed analysis, this survey\nidentifies key research gaps and outlines a roadmap for future advancements in\nthe field. By consolidating foundational knowledge and state-of-the-art\ndevelopments, this work aims to guide both researchers and practitioners in\novercoming challenges and unlocking the full potential of LLM-brained GUI\nagents.\n","authors":["Chaoyun Zhang","Shilin He","Jiaxu Qian","Bowen Li","Liqun Li","Si Qin","Yu Kang","Minghua Ma","Guyue Liu","Qingwei Lin","Saravan Rajmohan","Dongmei Zhang","Qi Zhang"],"pdf_url":"https://arxiv.org/pdf/2411.18279v3.pdf","comment":"The collection of papers reviewed in this survey will be hosted and\n regularly updated on the GitHub repository:\n https://github.com/vyokky/LLM-Brained-GUI-Agents-Survey Additionally, a\n searchable webpage is available at https://aka.ms/gui-agent for easier access\n and exploration"},{"id":"http://arxiv.org/abs/2412.01078v2","updated":"2024-12-03T02:59:43Z","published":"2024-12-02T03:31:46Z","title":"Advancing Speech Language Models by Scaling Supervised Fine-Tuning with\n Over 60,000 Hours of Synthetic Speech Dialogue Data","summary":" The GPT-4o represents a significant milestone in enabling real-time\ninteraction with large language models (LLMs) through speech, its remarkable\nlow latency and high fluency not only capture attention but also stimulate\nresearch interest in the field. This real-time speech interaction is\nparticularly valuable in scenarios requiring rapid feedback and immediate\nresponses, dramatically enhancing user experience. However, there is a notable\nlack of research focused on real-time large speech language models,\nparticularly for Chinese. In this work, we present KE-Omni, a seamless large\nspeech language model built upon Ke-SpeechChat, a large-scale high-quality\nsynthetic speech interaction dataset consisting of 7 million Chinese and\nEnglish conversations, featuring 42,002 speakers, and totaling over 60,000\nhours, This contributes significantly to the advancement of research and\ndevelopment in this field. The demos can be accessed at\n\\url{https://huggingface.co/spaces/KE-Team/KE-Omni}.\n","authors":["Shuaijiang Zhao","Tingwei Guo","Bajian Xiang","Tongtang Wan","Qiang Niu","Wei Zou","Xiangang Li"],"pdf_url":"https://arxiv.org/pdf/2412.01078v2.pdf","comment":"KE-Omni, Ke-SpeechChat"},{"id":"http://arxiv.org/abs/2412.02104v1","updated":"2024-12-03T02:54:31Z","published":"2024-12-03T02:54:31Z","title":"Explainable and Interpretable Multimodal Large Language Models: A\n Comprehensive Survey","summary":" The rapid development of Artificial Intelligence (AI) has revolutionized\nnumerous fields, with large language models (LLMs) and computer vision (CV)\nsystems driving advancements in natural language understanding and visual\nprocessing, respectively. The convergence of these technologies has catalyzed\nthe rise of multimodal AI, enabling richer, cross-modal understanding that\nspans text, vision, audio, and video modalities. Multimodal large language\nmodels (MLLMs), in particular, have emerged as a powerful framework,\ndemonstrating impressive capabilities in tasks like image-text generation,\nvisual question answering, and cross-modal retrieval. Despite these\nadvancements, the complexity and scale of MLLMs introduce significant\nchallenges in interpretability and explainability, essential for establishing\ntransparency, trustworthiness, and reliability in high-stakes applications.\nThis paper provides a comprehensive survey on the interpretability and\nexplainability of MLLMs, proposing a novel framework that categorizes existing\nresearch across three perspectives: (I) Data, (II) Model, (III) Training \\&\nInference. We systematically analyze interpretability from token-level to\nembedding-level representations, assess approaches related to both architecture\nanalysis and design, and explore training and inference strategies that enhance\ntransparency. By comparing various methodologies, we identify their strengths\nand limitations and propose future research directions to address unresolved\nchallenges in multimodal explainability. This survey offers a foundational\nresource for advancing interpretability and transparency in MLLMs, guiding\nresearchers and practitioners toward developing more accountable and robust\nmultimodal AI systems.\n","authors":["Yunkai Dang","Kaichen Huang","Jiahao Huo","Yibo Yan","Sirui Huang","Dongrui Liu","Mengxi Gao","Jie Zhang","Chen Qian","Kun Wang","Yong Liu","Jing Shao","Hui Xiong","Xuming Hu"],"pdf_url":"https://arxiv.org/pdf/2412.02104v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02101v1","updated":"2024-12-03T02:52:14Z","published":"2024-12-03T02:52:14Z","title":"Improving Language Transfer Capability of Decoder-only Architecture in\n Multilingual Neural Machine Translation","summary":" Existing multilingual neural machine translation (MNMT) approaches mainly\nfocus on improving models with the encoder-decoder architecture to translate\nmultiple languages. However, decoder-only architecture has been explored less\nin MNMT due to its underperformance when trained on parallel data solely. In\nthis work, we attribute the issue of the decoder-only architecture to its lack\nof language transfer capability. Specifically, the decoder-only architecture is\ninsufficient in encoding source tokens with the target language features. We\npropose dividing the decoding process into two stages so that target tokens are\nexplicitly excluded in the first stage to implicitly boost the transfer\ncapability across languages. Additionally, we impose contrastive learning on\ntranslation instructions, resulting in improved performance in zero-shot\ntranslation. We conduct experiments on TED-19 and OPUS-100 datasets,\nconsidering both training from scratch and fine-tuning scenarios. Experimental\nresults show that, compared to the encoder-decoder architecture, our methods\nnot only perform competitively in supervised translations but also achieve\nimprovements of up to 3.39 BLEU, 6.99 chrF++, 3.22 BERTScore, and 4.81 COMET in\nzero-shot translations.\n","authors":["Zhi Qu","Yiran Wang","Chenchen Ding","Hideki Tanaka","Masao Utiyama","Taro Watanabe"],"pdf_url":"https://arxiv.org/pdf/2412.02101v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14809v4","updated":"2024-12-03T02:06:03Z","published":"2024-08-27T06:44:28Z","title":"GSIFN: A Graph-Structured and Interlaced-Masked Multimodal\n Transformer-based Fusion Network for Multimodal Sentiment Analysis","summary":" Multimodal Sentiment Analysis (MSA) leverages multiple data modals to analyze\nhuman sentiment. Existing MSA models generally employ cutting-edge multimodal\nfusion and representation learning-based methods to promote MSA capability.\nHowever, there are two key challenges: (i) in existing multimodal fusion\nmethods, the decoupling of modal combinations and tremendous parameter\nredundancy, lead to insufficient fusion performance and efficiency; (ii) a\nchallenging trade-off exists between representation capability and\ncomputational overhead in unimodal feature extractors and encoders. Our\nproposed GSIFN incorporates two main components to solve these problems: (i) a\ngraph-structured and interlaced-masked multimodal Transformer. It adopts the\nInterlaced Mask mechanism to construct robust multimodal graph embedding,\nachieve all-modal-in-one Transformer-based fusion, and greatly reduce the\ncomputational overhead; (ii) a self-supervised learning framework with low\ncomputational overhead and high performance, which utilizes a parallelized LSTM\nwith matrix memory to enhance non-verbal modal features for unimodal label\ngeneration. Evaluated on the MSA datasets CMU-MOSI, CMU-MOSEI, and CH-SIMS,\nGSIFN demonstrates superior performance with significantly lower computational\noverhead compared with previous state-of-the-art models.\n","authors":["Yijie Jin"],"pdf_url":"https://arxiv.org/pdf/2408.14809v4.pdf","comment":"Withdraw for the error in the paper"},{"id":"http://arxiv.org/abs/2412.02081v1","updated":"2024-12-03T01:53:06Z","published":"2024-12-03T01:53:06Z","title":"Let's Think Var-by-Var: Large Language Models Enable Ad Hoc\n Probabilistic Reasoning","summary":" A hallmark of intelligence is the ability to flesh out underspecified\nsituations using \"common sense.\" We propose to extract that common sense from\nlarge language models (LLMs), in a form that can feed into probabilistic\ninference. We focus our investigation on $\\textit{guesstimation}$ questions\nsuch as \"How much are Airbnb listings in Newark, NJ?\" Formulating a sensible\nanswer without access to data requires drawing on, and integrating, bits of\ncommon knowledge about how $\\texttt{Price}$ and $\\texttt{Location}$ may relate\nto other variables, such as $\\texttt{Property Type}$. Our framework answers\nsuch a question by synthesizing an $\\textit{ad hoc}$ probabilistic model. First\nwe prompt an LLM to propose a set of random variables relevant to the question,\nfollowed by moment constraints on their joint distribution. We then optimize\nthe joint distribution $p$ within a log-linear family to maximize the overall\nconstraint satisfaction. Our experiments show that LLMs can successfully be\nprompted to propose reasonable variables, and while the proposed numerical\nconstraints can be noisy, jointly optimizing for their satisfaction reconciles\nthem. When evaluated on probabilistic questions derived from three real-world\ntabular datasets, we find that our framework performs comparably to a direct\nprompting baseline in terms of total variation distance from the dataset\ndistribution, and is similarly robust to noise.\n","authors":["Shepard Xia","Brian Lu","Jason Eisner"],"pdf_url":"https://arxiv.org/pdf/2412.02081v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.07870v3","updated":"2024-12-03T01:04:10Z","published":"2024-11-12T15:26:17Z","title":"Trustful LLMs: Customizing and Grounding Text Generation with Knowledge\n Bases and Dual Decoders","summary":" Although people are impressed by the content generation skills of large\nlanguage models, the use of LLMs, such as ChatGPT, is limited by the domain\ngrounding of the content. The correctness and groundedness of the generated\ncontent need to be based on a verified context, such as results from\nRetrieval-Augmented Generation (RAG). One important issue when adapting LLMs to\na customized domain is that the generated responses are often incomplete, or\nthe additions are not verified and may even be hallucinated. Prior studies on\nhallucination detection have focused on evaluation metrics, which are not\neasily adaptable to dynamic domains and can be vulnerable to attacks like\njail-breaking. In this work, we propose 1) a post-processing algorithm that\nleverages knowledge triplets in RAG context to correct hallucinations and 2) a\ndual-decoder model that fuses RAG context to guide the generation process.\n","authors":["Xiaofeng Zhu","Jaya Krishna Mandivarapu"],"pdf_url":"https://arxiv.org/pdf/2411.07870v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02058v1","updated":"2024-12-03T00:32:32Z","published":"2024-12-03T00:32:32Z","title":"BN-AuthProf: Benchmarking Machine Learning for Bangla Author Profiling\n on Social Media Texts","summary":" Author profiling, the analysis of texts to uncover attributes such as gender\nand age of the author, has become essential with the widespread use of social\nmedia platforms. This paper focuses on author profiling in the Bangla language,\naiming to extract valuable insights about anonymous authors based on their\nwriting style on social media. The primary objective is to introduce and\nbenchmark the performance of machine learning approaches on a newly created\nBangla Author Profiling dataset, BN-AuthProf. The dataset comprises 30,131\nsocial media posts from 300 authors, labeled by their age and gender. Authors'\nidentities and sensitive information were anonymized to ensure privacy. Various\nclassical machine learning and deep learning techniques were employed to\nevaluate the dataset. For gender classification, the best accuracy achieved was\n80% using Support Vector Machine (SVM), while a Multinomial Naive Bayes (MNB)\nclassifier achieved the best F1 score of 0.756. For age classification, MNB\nattained a maximum accuracy score of 91% with an F1 score of 0.905. This\nresearch highlights the effectiveness of machine learning in gender and age\nclassification for Bangla author profiling, with practical implications\nspanning marketing, security, forensic linguistics, education, and criminal\ninvestigations, considering privacy and biases.\n","authors":["Raisa Tasnim","Mehanaz Chowdhury","Md Ataur Rahman"],"pdf_url":"https://arxiv.org/pdf/2412.02058v1.pdf","comment":"Accepted to be Published in 2024 27th International Conference on\n Computer and Information Technology (ICCIT)"},{"id":"http://arxiv.org/abs/2412.02056v1","updated":"2024-12-03T00:28:31Z","published":"2024-12-03T00:28:31Z","title":"A Multi-way Parallel Named Entity Annotated Corpus for English, Tamil\n and Sinhala","summary":" This paper presents a multi-way parallel English-Tamil-Sinhala corpus\nannotated with Named Entities (NEs), where Sinhala and Tamil are low-resource\nlanguages. Using pre-trained multilingual Language Models (mLMs), we establish\nnew benchmark Named Entity Recognition (NER) results on this dataset for\nSinhala and Tamil. We also carry out a detailed investigation on the NER\ncapabilities of different types of mLMs. Finally, we demonstrate the utility of\nour NER system on a low-resource Neural Machine Translation (NMT) task. Our\ndataset is publicly released: https://github.com/suralk/multiNER.\n","authors":["Surangika Ranathunga","Asanka Ranasinghea","Janaka Shamala","Ayodya Dandeniyaa","Rashmi Galappaththia","Malithi Samaraweeraa"],"pdf_url":"https://arxiv.org/pdf/2412.02056v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02048v1","updated":"2024-12-03T00:08:01Z","published":"2024-12-03T00:08:01Z","title":"Impact of Data Snooping on Deep Learning Models for Locating\n Vulnerabilities in Lifted Code","summary":" This study examines the impact of data snooping on neural networks for\nvulnerability detection in lifted code, building on previous research which\nused word2vec, and unidirectional and bidirectional transformer-based\nembeddings. The research specifically focuses on how model performance is\naffected when embedding models are trained on datasets, including samples also\nused for neural network training and validation. The results show that\nintroducing data snooping did not significantly alter model performance,\nsuggesting that data snooping had a minimal impact or that samples randomly\ndropped as part of the methodology contained hidden features critical to\nachieving optimal performance. In addition, the findings reinforce the\nconclusions of previous research, which found that models trained with GPT-2\nembeddings consistently outperformed neural networks trained with other\nembeddings. The fact that this holds even when data snooping is introduced into\nthe embedding model indicates GPT-2's robustness in representing complex code\nfeatures, even under less-than-ideal conditions.\n","authors":["Gary A. McCully","John D. Hastings","Shengjie Xu"],"pdf_url":"https://arxiv.org/pdf/2412.02048v1.pdf","comment":"7 pages, 2 figures"},{"id":"http://arxiv.org/abs/2412.02915v1","updated":"2024-12-03T23:58:35Z","published":"2024-12-03T23:58:35Z","title":"Single-Cell Omics Arena: A Benchmark Study for Large Language Models on\n Cell Type Annotation Using Single-Cell Data","summary":" Over the past decade, the revolution in single-cell sequencing has enabled\nthe simultaneous molecular profiling of various modalities across thousands of\nindividual cells, allowing scientists to investigate the diverse functions of\ncomplex tissues and uncover underlying disease mechanisms. Among all the\nanalytical steps, assigning individual cells to specific types is fundamental\nfor understanding cellular heterogeneity. However, this process is usually\nlabor-intensive and requires extensive expert knowledge. Recent advances in\nlarge language models (LLMs) have demonstrated their ability to efficiently\nprocess and synthesize vast corpora of text to automatically extract essential\nbiological knowledge, such as marker genes, potentially promoting more\nefficient and automated cell type annotations. To thoroughly evaluate the\ncapability of modern instruction-tuned LLMs in automating the cell type\nidentification process, we introduce SOAR, a comprehensive benchmarking study\nof LLMs for cell type annotation tasks in single-cell genomics. Specifically,\nwe assess the performance of 8 instruction-tuned LLMs across 11 datasets,\nspanning multiple cell types and species. Our study explores the potential of\nLLMs to accurately classify and annotate cell types in single-cell RNA\nsequencing (scRNA-seq) data, while extending their application to multiomics\ndata through cross-modality translation. Additionally, we evaluate the\neffectiveness of chain-of-thought (CoT) prompting techniques in generating\ndetailed biological insights during the annotation process. The results\ndemonstrate that LLMs can provide robust interpretations of single-cell data\nwithout requiring additional fine-tuning, advancing the automation of cell type\nannotation in genomics research.\n","authors":["Junhao Liu","Siwei Xu","Lei Zhang","Jing Zhang"],"pdf_url":"https://arxiv.org/pdf/2412.02915v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.19318v2","updated":"2024-12-03T23:53:19Z","published":"2024-04-30T07:38:08Z","title":"Enhancing Trust in LLM-Generated Code Summaries with Calibrated\n Confidence Scores","summary":" A good summary can often be very useful during program comprehension. While a\nbrief, fluent, and relevant summary can be helpful, it does require significant\nhuman effort to produce. Often, good summaries are unavailable in software\nprojects, thus making maintenance more difficult. There has been a considerable\nbody of research into automated AI-based methods, using Large Language models\n(LLMs), to generate summaries of code; there also has been quite a bit work on\nways to measure the performance of such summarization methods, with special\nattention paid to how closely these AI-generated summaries resemble a summary a\nhuman might have produced. Measures such as BERTScore and BLEU have been\nsuggested and evaluated with human-subject studies.\n However, LLM-produced summaries can be too long, irrelevant, etc: generally,\ntoo dissimilar to what a human might say. Given an LLM-produced code summary,\nhow can we judge if a summary is good enough? Given some input source code, and\nan LLM-generated summary, existing approaches can help judge brevity, fluency\nand relevance; however, it's difficult to gauge whether an LLM-produced summary\nsufficiently resembles what a human might produce, without a \"golden\"\nhuman-produced summary to compare against. We study this resemblance question\nas a calibration problem: given just the summary from an LLM, can we compute a\nconfidence measure, that provides a reliable indication of whether the summary\nsufficiently resembles what a human would have produced in this situation? We\nexamine this question using several LLMs, for several languages, and in several\ndifferent settings. Our investigation suggests approaches to provide reliable\npredictions of the likelihood that an LLM-generated summary would sufficiently\nresemble a summary a human might write for the same code.\n","authors":["Yuvraj Virk","Premkumar Devanbu","Toufique Ahmed"],"pdf_url":"https://arxiv.org/pdf/2404.19318v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02906v1","updated":"2024-12-03T23:19:40Z","published":"2024-12-03T23:19:40Z","title":"Does Few-Shot Learning Help LLM Performance in Code Synthesis?","summary":" Large language models (LLMs) have made significant strides at code generation\nthrough improved model design, training, and chain-of-thought. However,\nprompt-level optimizations remain an important yet under-explored aspect of\nLLMs for coding. This work focuses on the few-shot examples present in most\ncode generation prompts, offering a systematic study on whether few-shot\nexamples improve LLM's coding capabilities, which few-shot examples have the\nlargest impact, and how to select impactful examples. Our work offers 2\napproaches for selecting few-shot examples, a model-free method,\nCODEEXEMPLAR-FREE, and a model-based method, CODEEXEMPLAR-BASED. The 2 methods\noffer a trade-off between improved performance and reliance on training data\nand interpretability. Both methods significantly improve CodeLlama's coding\nability across the popular HumanEval+ coding benchmark. In summary, our work\nprovides valuable insights into how to pick few-shot examples in code\ngeneration prompts to improve LLM code generation capabilities.\n","authors":["Derek Xu","Tong Xie","Botao Xia","Haoyu Li","Yunsheng Bai","Yizhou Sun","Wei Wang"],"pdf_url":"https://arxiv.org/pdf/2412.02906v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02904v1","updated":"2024-12-03T23:14:47Z","published":"2024-12-03T23:14:47Z","title":"Enhancing Trust in Large Language Models with Uncertainty-Aware\n Fine-Tuning","summary":" Large language models (LLMs) have revolutionized the field of natural\nlanguage processing with their impressive reasoning and question-answering\ncapabilities. However, these models are sometimes prone to generating\ncredible-sounding but incorrect information, a phenomenon known as LLM\nhallucinations. Reliable uncertainty estimation in LLMs is essential for\nfostering trust in their generated responses and serves as a critical tool for\nthe detection and prevention of erroneous or hallucinated outputs. To achieve\nreliable and well-calibrated uncertainty quantification in open-ended and\nfree-form natural language generation, we propose an uncertainty-aware\nfine-tuning approach for LLMs. This approach enhances the model's ability to\nprovide reliable uncertainty estimates without compromising accuracy, thereby\nguiding them to produce more trustworthy responses. We introduce a novel\nuncertainty-aware causal language modeling loss function, grounded in the\nprinciples of decision theory. Through rigorous evaluation on multiple\nfree-form question-answering datasets and models, we demonstrate that our\nuncertainty-aware fine-tuning approach yields better calibrated uncertainty\nestimates in natural language generation tasks than fine-tuning with the\nstandard causal language modeling loss. Furthermore, the experimental results\nshow that the proposed method significantly improves the model's ability to\ndetect hallucinations and identify out-of-domain prompts.\n","authors":["Ranganath Krishnan","Piyush Khanna","Omesh Tickoo"],"pdf_url":"https://arxiv.org/pdf/2412.02904v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02897v1","updated":"2024-12-03T23:01:21Z","published":"2024-12-03T23:01:21Z","title":"MLD-EA: Check and Complete Narrative Coherence by Introducing Emotions\n and Actions","summary":" Narrative understanding and story generation are critical challenges in\nnatural language processing (NLP), with much of the existing research focused\non summarization and question-answering tasks. While previous studies have\nexplored predicting plot endings and generating extended narratives, they often\nneglect the logical coherence within stories, leaving a significant gap in the\nfield. To address this, we introduce the Missing Logic Detector by Emotion and\nAction (MLD-EA) model, which leverages large language models (LLMs) to identify\nnarrative gaps and generate coherent sentences that integrate seamlessly with\nthe story's emotional and logical flow. The experimental results demonstrate\nthat the MLD-EA model enhances narrative understanding and story generation,\nhighlighting LLMs' potential as effective logic checkers in story writing with\nlogical coherence and emotional consistency. This work fills a gap in NLP\nresearch and advances border goals of creating more sophisticated and reliable\nstory-generation systems.\n","authors":["Jinming Zhang","Yunfei Long"],"pdf_url":"https://arxiv.org/pdf/2412.02897v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02893v1","updated":"2024-12-03T22:58:21Z","published":"2024-12-03T22:58:21Z","title":"Removing Spurious Correlation from Neural Network Interpretations","summary":" The existing algorithms for identification of neurons responsible for\nundesired and harmful behaviors do not consider the effects of confounders such\nas topic of the conversation. In this work, we show that confounders can create\nspurious correlations and propose a new causal mediation approach that controls\nthe impact of the topic. In experiments with two large language models, we\nstudy the localization hypothesis and show that adjusting for the effect of\nconversation topic, toxicity becomes less localized.\n","authors":["Milad Fotouhi","Mohammad Taha Bahadori","Oluwaseyi Feyisetan","Payman Arabshahi","David Heckerman"],"pdf_url":"https://arxiv.org/pdf/2412.02893v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.12883v2","updated":"2024-12-03T22:44:35Z","published":"2024-10-15T20:29:38Z","title":"Scaling Laws for Multilingual Language Models","summary":" We propose a novel scaling law for general-purpose decoder-only language\nmodels (LMs) trained on multilingual data, tackling the problem of balancing\nlanguages during multilingual pretraining. A primary challenge in studying\nmultilingual scaling is the difficulty of analyzing individual language\nperformance due to cross-lingual transfer. To address this, we shift the focus\nfrom individual languages to language families. We introduce and validate a\nhypothesis that the test cross-entropy loss for each language family is\ndetermined solely by its own sampling ratio, independent of other languages in\nthe mixture. This insight simplifies the complexity of multilingual scaling and\nmake the analysis scalable to an arbitrary number of languages. Building on\nthis hypothesis, we derive a power-law relationship that links performance with\ndataset size, model size and sampling ratios. This relationship enables us to\npredict performance across various combinations of the above three quantities,\nand derive the optimal sampling ratios at different model scales. To\ndemonstrate the effectiveness and accuracy of our proposed scaling law, we\nperform a large-scale empirical study, training more than 100 models on 23\nlanguages spanning 5 language families. Our experiments show that the optimal\nsampling ratios derived from small models (85M parameters) generalize\neffectively to models that are several orders of magnitude larger (1.2B\nparameters), offering a resource-efficient approach for multilingual LM\ntraining at scale.\n","authors":["Yifei He","Alon Benhaim","Barun Patra","Praneetha Vaddamanu","Sanchit Ahuja","Parul Chopra","Vishrav Chaudhary","Han Zhao","Xia Song"],"pdf_url":"https://arxiv.org/pdf/2410.12883v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02883v1","updated":"2024-12-03T22:38:05Z","published":"2024-12-03T22:38:05Z","title":"TDD-Bench Verified: Can LLMs Generate Tests for Issues Before They Get\n Resolved?","summary":" Test-driven development (TDD) is the practice of writing tests first and\ncoding later, and the proponents of TDD expound its numerous benefits. For\ninstance, given an issue on a source code repository, tests can clarify the\ndesired behavior among stake-holders before anyone writes code for the\nagreed-upon fix. Although there has been a lot of work on automated test\ngeneration for the practice \"write code first, test later\", there has been\nlittle such automation for TDD. Ideally, tests for TDD should be fail-to-pass\n(i.e., fail before the issue is resolved and pass after) and have good adequacy\nwith respect to covering the code changed during issue resolution. This paper\nintroduces TDD-Bench Verified, a high-quality benchmark suite of 449 issues\nmined from real-world GitHub code repositories. The benchmark's evaluation\nharness runs only relevant tests in isolation for simple yet accurate coverage\nmeasurements, and the benchmark's dataset is filtered both by human judges and\nby execution in the harness. This paper also presents Auto-TDD, an LLM-based\nsolution that takes as input an issue description and a codebase (prior to\nissue resolution) and returns as output a test that can be used to validate the\nchanges made for resolving the issue. Our evaluation shows that Auto-TDD yields\na better fail-to-pass rate than the strongest prior work while also yielding\nhigh coverage adequacy. Overall, we hope that this work helps make developers\nmore productive at resolving issues while simultaneously leading to more robust\nfixes.\n","authors":["Toufique Ahmed","Martin Hirzel","Rangeet Pan","Avraham Shinnar","Saurabh Sinha"],"pdf_url":"https://arxiv.org/pdf/2412.02883v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.14485v2","updated":"2024-12-03T22:27:12Z","published":"2024-11-20T02:49:18Z","title":"Mediating Modes of Thought: LLM's for design scripting","summary":" Architects adopt visual scripting and parametric design tools to explore more\nexpansive design spaces (Coates, 2010), refine their thinking about the\ngeometric logic of their design (Woodbury, 2010), and overcome conventional\nsoftware limitations (Burry, 2011). Despite two decades of effort to make\ndesign scripting more accessible, a disconnect between a designer's free ways\nof thinking and the rigidity of algorithms remains (Burry, 2011). Recent\ndevelopments in Large Language Models (LLMs) suggest this might soon change, as\nLLMs encode a general understanding of human context and exhibit the capacity\nto produce geometric logic. This project speculates that if LLMs can\neffectively mediate between user intent and algorithms, they become a powerful\ntool to make scripting in design more widespread and fun. We explore if such\nsystems can interpret natural language prompts to assemble geometric operations\nrelevant to computational design scripting. In the system, multiple layers of\nLLM agents are configured with specific context to infer the user intent and\nconstruct a sequential logic. Given a user's high-level text prompt, a\ngeometric description is created, distilled into a sequence of logic\noperations, and mapped to software-specific commands. The completed script is\nconstructed in the user's visual programming interface. The system succeeds in\ngenerating complete visual scripts up to a certain complexity but fails beyond\nthis complexity threshold. It shows how LLMs can make design scripting much\nmore aligned with human creativity and thought. Future research should explore\nconversational interactions, expand to multimodal inputs and outputs, and\nassess the performance of these tools.\n","authors":["Moritz Rietschel","Fang Guo","Kyle Steinfeld"],"pdf_url":"https://arxiv.org/pdf/2411.14485v2.pdf","comment":"Published at ACADIA 2024"},{"id":"http://arxiv.org/abs/2409.11598v2","updated":"2024-12-03T22:23:53Z","published":"2024-09-17T23:10:04Z","title":"Towards Fair RAG: On the Impact of Fair Ranking in Retrieval-Augmented\n Generation","summary":" Many language models now enhance their responses with retrieval capabilities,\nleading to the widespread adoption of retrieval-augmented generation (RAG)\nsystems. However, despite retrieval being a core component of RAG, much of the\nresearch in this area overlooks the extensive body of work on fair ranking,\nneglecting the importance of considering all stakeholders involved. This paper\npresents the first systematic evaluation of RAG systems integrated with fair\nrankings. We focus specifically on measuring the fair exposure of each relevant\nitem across the rankings utilized by RAG systems (i.e., item-side fairness),\naiming to promote equitable growth for relevant item providers. To gain a deep\nunderstanding of the relationship between item-fairness, ranking quality, and\ngeneration quality in the context of RAG, we analyze nine different RAG systems\nthat incorporate fair rankings across seven distinct datasets. Our findings\nindicate that RAG systems with fair rankings can maintain a high level of\ngeneration quality and, in many cases, even outperform traditional RAG systems,\ndespite the general trend of a tradeoff between ensuring fairness and\nmaintaining system-effectiveness. We believe our insights lay the groundwork\nfor responsible and equitable RAG systems and open new avenues for future\nresearch. We publicly release our codebase and dataset at\nhttps://github.com/kimdanny/Fair-RAG.\n","authors":["To Eun Kim","Fernando Diaz"],"pdf_url":"https://arxiv.org/pdf/2409.11598v2.pdf","comment":"Top 5 Spotlight at AFME Workshop at NeurIPS 2024"},{"id":"http://arxiv.org/abs/2410.18806v2","updated":"2024-12-03T21:48:17Z","published":"2024-10-24T14:54:09Z","title":"A Combinatorial Approach to Neural Emergent Communication","summary":" Substantial research on deep learning-based emergent communication uses the\nreferential game framework, specifically the Lewis signaling game, however we\nargue that successful communication in this game typically only need one or two\nsymbols for target image classification because of a sampling pitfall in the\ntraining data. To address this issue, we provide a theoretical analysis and\nintroduce a combinatorial algorithm SolveMinSym (SMS) to solve the symbolic\ncomplexity for classification, which is the minimum number of symbols in the\nmessage for successful communication. We use the SMS algorithm to create\ndatasets with different symbolic complexity to empirically show that data with\nhigher symbolic complexity increases the number of effective symbols in the\nemergent language.\n","authors":["Zheyuan Zhang"],"pdf_url":"https://arxiv.org/pdf/2410.18806v2.pdf","comment":"Accepted to COLING 2025"},{"id":"http://arxiv.org/abs/2412.02835v1","updated":"2024-12-03T21:00:10Z","published":"2024-12-03T21:00:10Z","title":"CAISSON: Concept-Augmented Inference Suite of Self-Organizing Neural\n Networks","summary":" We present CAISSON, a novel hierarchical approach to Retrieval-Augmented\nGeneration (RAG) that transforms traditional single-vector search into a\nmulti-view clustering framework. At its core, CAISSON leverages dual\nSelf-Organizing Maps (SOMs) to create complementary organizational views of the\ndocument space, where each view captures different aspects of document\nrelationships through specialized embeddings. The first view processes combined\ntext and metadata embeddings, while the second operates on metadata enriched\nwith concept embeddings, enabling a comprehensive multi-view analysis that\ncaptures both fine-grained semantic relationships and high-level conceptual\npatterns. This dual-view approach enables more nuanced document discovery by\ncombining evidence from different organizational perspectives. To evaluate\nCAISSON, we develop SynFAQA, a framework for generating synthetic financial\nanalyst notes and question-answer pairs that systematically tests different\naspects of information retrieval capabilities. Drawing on HotPotQA's\nmethodology for constructing multi-step reasoning questions, SynFAQA generates\ncontrolled test cases where each question is paired with the set of notes\ncontaining its ground-truth answer, progressing from simple single-entity\nqueries to complex multi-hop retrieval tasks involving multiple entities and\nconcepts. Our experimental results demonstrate substantial improvements over\nboth basic and enhanced RAG implementations, particularly for complex\nmulti-entity queries, while maintaining practical response times suitable for\ninteractive applications.\n","authors":["Igor Halperin"],"pdf_url":"https://arxiv.org/pdf/2412.02835v1.pdf","comment":"26 pages, 7 figures, 8 tables"},{"id":"http://arxiv.org/abs/2407.02820v2","updated":"2024-12-03T20:56:16Z","published":"2024-07-03T05:42:20Z","title":"Investigating the Contextualised Word Embedding Dimensions Specified for\n Contextual and Temporal Semantic Changes","summary":" The sense-aware contextualised word embeddings (SCWEs) encode semantic\nchanges of words within the contextualised word embedding (CWE) spaces. Despite\nthe superior performance of SCWEs in contextual/temporal semantic change\ndetection (SCD) benchmarks, it remains unclear as to how the meaning changes\nare encoded in the embedding space. To study this, we compare pre-trained CWEs\nand their fine-tuned versions on contextual and temporal semantic change\nbenchmarks under Principal Component Analysis (PCA) and Independent Component\nAnalysis (ICA) transformations. Our experimental results reveal (a) although\nthere exist a smaller number of axes that are specific to semantic changes of\nwords in the pre-trained CWE space, this information gets distributed across\nall dimensions when fine-tuned, and (b) in contrast to prior work studying the\ngeometry of CWEs, we find that PCA to better represent semantic changes than\nICA within the top 10% of axes. These findings encourage the development of\nmore efficient SCD methods with a small number of SCD-aware dimensions. Source\ncode is available at https://github.com/LivNLP/svp-dims .\n","authors":["Taichi Aida","Danushka Bollegala"],"pdf_url":"https://arxiv.org/pdf/2407.02820v2.pdf","comment":"COLING2025"},{"id":"http://arxiv.org/abs/2412.02830v1","updated":"2024-12-03T20:52:35Z","published":"2024-12-03T20:52:35Z","title":"RARE: Retrieval-Augmented Reasoning Enhancement for Large Language\n Models","summary":" This work introduces RARE (Retrieval-Augmented Reasoning Enhancement), a\nversatile extension to the mutual reasoning framework (rStar), aimed at\nenhancing reasoning accuracy and factual integrity across large language models\n(LLMs) for complex, knowledge-intensive tasks such as commonsense and medical\nreasoning. RARE incorporates two innovative actions within the Monte Carlo Tree\nSearch (MCTS) framework: A6, which generates search queries based on the\ninitial problem statement, performs information retrieval using those queries,\nand augments reasoning with the retrieved data to formulate the final answer;\nand A7, which leverages information retrieval specifically for generated\nsub-questions and re-answers these sub-questions with the relevant contextual\ninformation. Additionally, a Retrieval-Augmented Factuality Scorer is proposed\nto replace the original discriminator, prioritizing reasoning paths that meet\nhigh standards of factuality. Experimental results with LLaMA 3.1 show that\nRARE enables open-source LLMs to achieve competitive performance with top\nopen-source models like GPT-4 and GPT-4o. This research establishes RARE as a\nscalable solution for improving LLMs in domains where logical coherence and\nfactual integrity are critical.\n","authors":["Hieu Tran","Zonghai Yao","Junda Wang","Yifan Zhang","Zhichao Yang","Hong Yu"],"pdf_url":"https://arxiv.org/pdf/2412.02830v1.pdf","comment":"24 pages"},{"id":"http://arxiv.org/abs/2412.02823v1","updated":"2024-12-03T20:41:37Z","published":"2024-12-03T20:41:37Z","title":"Minimization of Boolean Complexity in In-Context Concept Learning","summary":" What factors contribute to the relative success and corresponding\ndifficulties of in-context learning for Large Language Models (LLMs)? Drawing\non insights from the literature on human concept learning, we test LLMs on\ncarefully designed concept learning tasks, and show that task performance\nhighly correlates with the Boolean complexity of the concept. This suggests\nthat in-context learning exhibits a learning bias for simplicity in a way\nsimilar to humans.\n","authors":["Leroy Z. Wang","R. Thomas McCoy","Shane Steinert-Threlkeld"],"pdf_url":"https://arxiv.org/pdf/2412.02823v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02819v1","updated":"2024-12-03T20:35:57Z","published":"2024-12-03T20:35:57Z","title":"CNNSum: Exploring Long-Conext Summarization with Large Language Models\n in Chinese Novels","summary":" Large Language Models (LLMs) have been well-researched in many long-context\ntasks. However, due to high annotation costs, high-quality long-context summary\ndatasets for training or evaluation are scarce, limiting further research. In\nthis work, we introduce CNNSum, a new multi-scale Chinese long-context novel\nsummarization benchmark, including four subsets, length covering\n16k\\textasciitilde128k, 695 samples in total, the annotations are human-driven.\nWe evaluate commercial and open-source models on CNNSum and conduct a detailed\nanalysis. Based on the observations, we further conduct fine-tuning exploration\nwith short-context summary data. In our study: (1) GPT-4o underperformed, due\nto excessive subjective commentary. (2) Currently, long-context summarization\nmainly relies on memory ability, small LLMs with stable longer context lengths\nare the most cost-effective. Using long data concatenated from short-context\nsummaries makes a significant improvement. (3) Prompt templates may cause a\nlarge performance gap but can be mitigated through fine-tuning. (4) Fine-tuned\nChat or Instruction versions may harm the Base model and further fine-tuning\ncannot bridge performance gap. (5) while models with RoPE base scaling exhibit\nstrong extrapolation potential, their performance may vary significantly when\ncombined with other interpolation methods and need careful selection. (6)\nCNNSum provides more reliable and insightful evaluation results than other\nbenchmarks. We release CNNSum to advance research in this field.\n","authors":["Lingxiao Wei","He Yan","Xiangju Lu","Junmin Zhu","Jun Wang","Wei Zhang"],"pdf_url":"https://arxiv.org/pdf/2412.02819v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.01779v3","updated":"2024-12-03T20:19:54Z","published":"2024-10-02T17:33:26Z","title":"Composing Global Optimizers to Reasoning Tasks via Algebraic Objects in\n Neural Nets","summary":" We prove rich algebraic structures of the solution space for 2-layer neural\nnetworks with quadratic activation and $L_2$ loss, trained on reasoning tasks\nin Abelian group (e.g., modular addition). Such a rich structure enables\nanalytical construction of global optimal solutions from partial solutions that\nonly satisfy part of the loss, despite its high nonlinearity. We coin the\nframework as CoGO (Composing Global Optimizers). Specifically, we show that the\nweight space over different numbers of hidden nodes of the 2-layer network is\nequipped with a semi-ring algebraic structure, and the loss function to be\noptimized consists of monomial potentials, which are ring homomorphism,\nallowing partial solutions to be composed into global ones by ring addition and\nmultiplication. Our experiments show that around $95\\%$ of the solutions\nobtained by gradient descent match exactly our theoretical constructions.\nAlthough the global optimizers constructed only required a small number of\nhidden nodes, our analysis on gradient dynamics shows that\nover-parameterization asymptotically decouples training dynamics and is\nbeneficial. We further show that training dynamics favors simpler solutions\nunder weight decay, and thus high-order global optimizers such as perfect\nmemorization are unfavorable. Code can be found at\nhttps://github.com/facebookresearch/luckmatters/tree/yuandong3/ssl/real-dataset.\n","authors":["Yuandong Tian"],"pdf_url":"https://arxiv.org/pdf/2410.01779v3.pdf","comment":"Update presentation and add more lemmas for necessary conditions"},{"id":"http://arxiv.org/abs/2411.15175v2","updated":"2024-12-03T20:07:58Z","published":"2024-11-18T00:21:14Z","title":"Can Open-source LLMs Enhance Data Synthesis for Toxic Detection?: An\n Experimental Study","summary":" Effective toxic content detection relies heavily on high-quality and diverse\ndata, which serves as the foundation for robust content moderation models. This\nstudy explores the potential of open-source LLMs for harmful data synthesis,\nutilizing prompt engineering and fine-tuning techniques to enhance data quality\nand diversity. In a two-stage evaluation, we first examine the capabilities of\nsix open-source LLMs in generating harmful data across multiple datasets using\nprompt engineering. In the second stage, we fine-tune these models to improve\ndata generation while addressing challenges such as hallucination, data\nduplication, and overfitting. Our findings reveal that Mistral excels in\ngenerating high-quality and diverse harmful data with minimal hallucination.\nFurthermore, fine-tuning enhances data quality, offering scalable and\ncost-effective solutions for augmenting datasets for specific toxic content\ndetection tasks. These results emphasize the significance of data synthesis in\nbuilding robust, standalone detection models and highlight the potential of\nopen-source LLMs to advance smaller downstream content moderation systems. We\nimplemented this approach in real-world industrial settings, demonstrating the\nfeasibility and efficiency of fine-tuned open-source LLMs for harmful data\nsynthesis.\n","authors":["Zheng Hui","Zhaoxiao Guo","Hang Zhao","Juanyong Duan","Lin Ai","Yinheng Li","Julia Hirschberg","Congrui Huang"],"pdf_url":"https://arxiv.org/pdf/2411.15175v2.pdf","comment":"12 pages"},{"id":"http://arxiv.org/abs/2408.14774v3","updated":"2024-12-03T20:01:23Z","published":"2024-08-27T04:31:58Z","title":"Instruct-SkillMix: A Powerful Pipeline for LLM Instruction Tuning","summary":" We introduce Instruct-SkillMix, an automated approach for creating diverse,\nhigh quality SFT data. The Instruct-SkillMix pipeline involves two stages, each\nleveraging an existing powerful LLM: (1) Skill extraction: uses the LLM to\nextract core \"skills\" for instruction-following, either from existing datasets,\nor by directly prompting the model; (2) Data generation: uses the powerful LLM\nto generate (instruction, response) data that exhibit a randomly chosen pair of\nthese skills. Here, the use of random skill combinations promotes diversity and\ndifficulty.\n Vanilla SFT (i.e., no PPO, DPO, or RL methods) on data generated from\nInstruct-SkillMix leads to strong gains on instruction following benchmarks\nsuch as AlpacaEval 2.0, MT-Bench, and WildBench. With just $4$K examples,\nLLaMA-3-8B-Base achieves 42.76% length-controlled win rate on AlpacaEval 2.0.\nTo our knowledge, this achieves state-of-the-art performance among all models\nthat have only undergone SFT (no RL methods) and competes with proprietary\nmodels such as Claude 3 Opus and LLaMA-3.1-405B-Instruct.\n Ablation studies also suggest plausible reasons for why creating open\ninstruction-tuning datasets via naive crowd-sourcing has proved difficult.\nIntroducing low quality answers (\"shirkers\") in $20\\%$ of Instruct-SkillMix\nexamples causes performance to plummet, sometimes catastrophically.\n The Instruct-SkillMix pipeline is flexible and is adaptable to other\nsettings.\n","authors":["Simran Kaur","Simon Park","Anirudh Goyal","Sanjeev Arora"],"pdf_url":"https://arxiv.org/pdf/2408.14774v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02790v1","updated":"2024-12-03T19:40:13Z","published":"2024-12-03T19:40:13Z","title":"An Evolutionary Large Language Model for Hallucination Mitigation","summary":" The emergence of LLMs, like ChatGPT and Gemini, has marked the modern era of\nartificial intelligence applications characterized by high-impact applications\ngenerating text, images, and videos. However, these models usually ensue with\none critical challenge called hallucination: confident presentation of\ninaccurate or fabricated information. This problem attracts serious concern\nwhen these models are applied to specialized domains, including healthcare and\nlaw, where the accuracy and preciseness of information are absolute conditions.\nIn this paper, we propose EvoLLMs, an innovative framework inspired by\nEvolutionary Computation, which automates the generation of high-quality\nQuestion-answering (QA) datasets while minimizing hallucinations. EvoLLMs\nemploys genetic algorithms, mimicking evolutionary processes like selection,\nvariation, and mutation, to guide LLMs in generating accurate, contextually\nrelevant question-answer pairs. Comparative analysis shows that EvoLLMs\nconsistently outperforms human-generated datasets in key metrics such as Depth,\nRelevance, and Coverage, while nearly matching human performance in mitigating\nhallucinations. These results highlight EvoLLMs as a robust and efficient\nsolution for QA dataset generation, significantly reducing the time and\nresources required for manual curation.\n","authors":["Abdennour Boulesnane","Abdelhakim Souilah"],"pdf_url":"https://arxiv.org/pdf/2412.02790v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02788v1","updated":"2024-12-03T19:37:00Z","published":"2024-12-03T19:37:00Z","title":"Hybrid-SQuAD: Hybrid Scholarly Question Answering Dataset","summary":" Existing Scholarly Question Answering (QA) methods typically target\nhomogeneous data sources, relying solely on either text or Knowledge Graphs\n(KGs). However, scholarly information often spans heterogeneous sources,\nnecessitating the development of QA systems that can integrate information from\nmultiple heterogeneous data sources. To address this challenge, we introduce\nHybrid-SQuAD (Hybrid Scholarly Question Answering Dataset), a novel large-scale\nQA dataset designed to facilitate answering questions incorporating both text\nand KG facts. The dataset consists of 10.5K question-answer pairs generated by\na large language model, leveraging the KGs - DBLP and SemOpenAlex alongside\ncorresponding text from Wikipedia. In addition, we propose a RAG-based baseline\nhybrid QA model, achieving an exact match score of 69.65 on the Hybrid-SQuAD\ntest set.\n","authors":["Tilahun Abedissa Taffa","Debayan Baneerje","Yaregal Assabie","Ricardo Usbeck"],"pdf_url":"https://arxiv.org/pdf/2412.02788v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.08316v2","updated":"2024-12-03T19:21:15Z","published":"2024-10-10T19:06:39Z","title":"HyperDPO: Conditioned One-Shot Multi-Objective Fine-Tuning Framework","summary":" In LLM alignment and many other ML applications, one often faces the\nMulti-Objective Fine-Tuning (MOFT) problem, i.e. fine-tuning an existing model\nwith datasets labeled w.r.t. different objectives simultaneously. To address\nthe challenge, we propose the HyperDPO framework, a conditioned one-shot\nfine-tuning approach that extends the Direct Preference Optimization (DPO)\ntechnique, originally developed for efficient LLM alignment with preference\ndata, to accommodate the MOFT settings. By substituting the Bradley-Terry-Luce\nmodel in DPO with the Plackett-Luce model, our framework is capable of handling\na wide range of MOFT tasks that involve listwise ranking datasets. Compared\nwith previous approaches, HyperDPO enjoys an efficient one-shot training\nprocess for profiling the Pareto front of auxiliary objectives, and offers\npost-training control over trade-offs. Additionally, we propose a novel Hyper\nPrompt Tuning design, that conveys continuous importance weight across\nobjectives to transformer-based models without altering their architecture, and\ninvestigate the potential of temperature-conditioned networks for enhancing the\nflexibility of post-training control. We demonstrate the effectiveness and\nefficiency of the HyperDPO framework through its applications to various tasks,\nincluding Learning-to-Rank (LTR) and LLM alignment, highlighting its viability\nfor large-scale ML deployments.\n","authors":["Yinuo Ren","Tesi Xiao","Michael Shavlovsky","Lexing Ying","Holakou Rahmanian"],"pdf_url":"https://arxiv.org/pdf/2410.08316v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02775v1","updated":"2024-12-03T19:17:18Z","published":"2024-12-03T19:17:18Z","title":"Optimizing Large Language Models for Turkish: New Methodologies in\n Corpus Selection and Training","summary":" In this study, we develop and assess new corpus selection and training\nmethodologies to improve the effectiveness of Turkish language models.\nSpecifically, we adapted Large Language Model generated datasets and translated\nEnglish datasets into Turkish, integrating these resources into the training\nprocess. This approach led to substantial enhancements in model accuracy for\nboth few-shot and zero-shot learning scenarios. Furthermore, the merging of\nthese adapted models was found to markedly improve their performance. Human\nevaluative metrics, including task-specific performance assessments, further\ndemonstrated that these adapted models possess a greater aptitude for\ncomprehending the Turkish language and addressing logic-based queries. This\nresearch underscores the importance of refining corpus selection strategies to\noptimize the performance of multilingual models, particularly for\nunder-resourced languages like Turkish.\n","authors":["H. Toprak Kesgin","M. Kaan Yuce","Eren Dogan","M. Egemen Uzun","Atahan Uz","Elif Ince","Yusuf Erdem","Osama Shbib","Ahmed Zeer","M. Fatih Amasyali"],"pdf_url":"https://arxiv.org/pdf/2412.02775v1.pdf","comment":"2024 Innovations in Intelligent Systems and Applications Conference\n (ASYU)"},{"id":"http://arxiv.org/abs/2406.01297v3","updated":"2024-12-03T19:14:06Z","published":"2024-06-03T13:05:46Z","title":"When Can LLMs Actually Correct Their Own Mistakes? A Critical Survey of\n Self-Correction of LLMs","summary":" Self-correction is an approach to improving responses from large language\nmodels (LLMs) by refining the responses using LLMs during inference. Prior work\nhas proposed various self-correction frameworks using different sources of\nfeedback, including self-evaluation and external feedback. However, there is\nstill no consensus on the question of when LLMs can correct their own mistakes,\nas recent studies also report negative results. In this work, we critically\nsurvey broad papers and discuss the conditions required for successful\nself-correction. We first find that prior studies often do not define their\nresearch questions in detail and involve impractical frameworks or unfair\nevaluations that over-evaluate self-correction. To tackle these issues, we\ncategorize research questions in self-correction research and provide a\nchecklist for designing appropriate experiments. Our critical survey based on\nthe newly categorized research questions shows that (1) no prior work\ndemonstrates successful self-correction with feedback from prompted LLMs,\nexcept for studies in tasks that are exceptionally suited for self-correction,\n(2) self-correction works well in tasks that can use reliable external\nfeedback, and (3) large-scale fine-tuning enables self-correction.\n","authors":["Ryo Kamoi","Yusen Zhang","Nan Zhang","Jiawei Han","Rui Zhang"],"pdf_url":"https://arxiv.org/pdf/2406.01297v3.pdf","comment":"TACL 2024"},{"id":"http://arxiv.org/abs/2412.02760v1","updated":"2024-12-03T19:01:00Z","published":"2024-12-03T19:01:00Z","title":"Cosmos-LLaVA: Chatting with the Visual Cosmos-LLaVA: Görselle Sohbet\n Etmek","summary":" In this study, a Turkish visual instruction model was developed and various\nmodel architectures and dataset combinations were analysed to improve the\nperformance of this model. The Cosmos-LLaVA model, which is built by combining\ndifferent large language models and image coders, is designed to overcome the\ndeficiencies in the Turkish language. In the experiments, the effects of\nfine-tuning with various datasets on the model performance are analysed in\ndetail. The results show that model architecture and dataset selection have a\nsignificant impact on performance.\n Bu \\c{c}al{\\i}\\c{s}mada bir T\\\"urk\\c{c}e g\\\"orsel talimat modeli\ngeli\\c{s}tirilerek bu modelin performans{\\i}n{\\i} art{\\i}rmaya y\\\"onelik\n\\c{c}e\\c{s}itli model mimarileri ve veri k\\\"umesi kombinasyonlar{\\i}\nderinlemesine incelenmi\\c{s}tir. Farkl{\\i} b\\\"uy\\\"uk dil modelleri ve\ng\\\"or\\\"unt\\\"u kodlay{\\i}c{\\i}lar{\\i}n{\\i}n bir araya getirilmesiyle\nolu\\c{s}turulan Cosmos-LLaVA modeli, T\\\"urk\\c{c}e dilindeki eksiklikleri\ngidermeye y\\\"onelik olarak tasarlanm{\\i}\\c{s}t{\\i}r. Yap{\\i}lan deneylerde,\n\\c{c}e\\c{s}itli veri k\\\"umeleri ile yap{\\i}lan ince ayarlar{\\i}n model\nperformans{\\i}n{\\i} nas{\\i}l etkiledi\\u{g}i detayl{\\i} olarak ele\nal{\\i}nm{\\i}\\c{s}t{\\i}r. Sonu\\c{c}lar, model mimarisi ve veri k\\\"umesi\nse\\c{c}iminin performans \\\"uzerinde \\\"onemli bir etkiye sahip oldu\\u{g}unu\ng\\\"ostermektedir.\n","authors":["Ahmed Zeer","Eren Dogan","Yusuf Erdem","Elif Ince","Osama Shbib","M. Egemen Uzun","Atahan Uz","M. Kaan Yuce","H. Toprak Kesgin","M. Fatih Amasyali"],"pdf_url":"https://arxiv.org/pdf/2412.02760v1.pdf","comment":"in Turkish language, 2024 8th International Artificial Intelligence\n and Data Processing Symposium (IDAP)"}],"Information Retrieval":[{"id":"http://arxiv.org/abs/2412.02588v1","updated":"2024-12-03T17:17:27Z","published":"2024-12-03T17:17:27Z","title":"Explainable CTR Prediction via LLM Reasoning","summary":" Recommendation Systems have become integral to modern user experiences, but\nlack transparency in their decision-making processes. Existing explainable\nrecommendation methods are hindered by reliance on a post-hoc paradigm, wherein\nexplanation generators are trained independently of the underlying recommender\nmodels. This paradigm necessitates substantial human effort in data\nconstruction and raises concerns about explanation reliability. In this paper,\nwe present ExpCTR, a novel framework that integrates large language model based\nexplanation generation directly into the CTR prediction process. Inspired by\nrecent advances in reinforcement learning, we employ two carefully designed\nreward mechanisms, LC alignment, which ensures explanations reflect user\nintentions, and IC alignment, which maintains consistency with traditional\nID-based CTR models. Our approach incorporates an efficient training paradigm\nwith LoRA and a three-stage iterative process. ExpCTR circumvents the need for\nextensive explanation datasets while fostering synergy between CTR prediction\nand explanation generation. Experimental results demonstrate that ExpCTR\nsignificantly enhances both recommendation accuracy and interpretability across\nthree real-world datasets.\n","authors":["Xiaohan Yu","Li Zhang","Chong Chen"],"pdf_url":"https://arxiv.org/pdf/2412.02588v1.pdf","comment":"WSDM 2025"},{"id":"http://arxiv.org/abs/2412.00430v2","updated":"2024-12-03T15:43:49Z","published":"2024-11-30T10:56:30Z","title":"Predictive Models in Sequential Recommendations: Bridging Performance\n Laws with Data Quality Insights","summary":" Sequential Recommendation (SR) plays a critical role in predicting users'\nsequential preferences. Despite its growing prominence in various industries,\nthe increasing scale of SR models incurs substantial computational costs and\nunpredictability, challenging developers to manage resources efficiently. Under\nthis predicament, Scaling Laws have achieved significant success by examining\nthe loss as models scale up. However, there remains a disparity between loss\nand model performance, which is of greater concern in practical applications.\nMoreover, as data continues to expand, it incorporates repetitive and\ninefficient data. In response, we introduce the Performance Law for SR models,\nwhich aims to theoretically investigate and model the relationship between\nmodel performance and data quality. Specifically, we first fit the HR and NDCG\nmetrics to transformer-based SR models. Subsequently, we propose Approximate\nEntropy (ApEn) to assess data quality, presenting a more nuanced approach\ncompared to traditional data quantity metrics. Our method enables accurate\npredictions across various dataset scales and model sizes, demonstrating a\nstrong correlation in large SR models and offering insights into achieving\noptimal performance for any given model configuration.\n","authors":["Tingjia Shen","Hao Wang","Chuhan Wu","Jin Yao Chin","Wei Guo","Yong Liu","Huifeng Guo","Defu Lian","Ruiming Tang","Enhong Chen"],"pdf_url":"https://arxiv.org/pdf/2412.00430v2.pdf","comment":"12 pages, 5 figures"},{"id":"http://arxiv.org/abs/2310.09401v4","updated":"2024-12-03T14:10:01Z","published":"2023-10-13T20:53:50Z","title":"A Novel Approach to Comprehending Users' Preferences for Accurate\n Personalized News Recommendation","summary":" Personalized news recommendation aims to assist users in finding news\narticles that align with their interests, which plays a pivotal role in\nmitigating users' information overload problem. Although many recent works have\nbeen studied for better personalized news recommendation, the following\nchallenges should be explored more: (C1) Comprehending manifold intents coupled\nwithin a news article, (C2) Differentiating varying post-read preferences of\nnews articles, and (C3) Addressing the cold-start user problem. To tackle the\naforementioned challenges together, in this paper, we propose a novel\npersonalized news recommendation framework (CROWN) that employs (1)\ncategory-guided intent disentanglement for (C1), (2) consistency-based news\nrepresentation for (C2), and (3) GNN-enhanced hybrid user representation for\n(C3). Furthermore, we incorporate a category prediction into the training\nprocess of CROWN as an auxiliary task, which provides supplementary supervisory\nsignals to enhance intent disentanglement. Extensive experiments on two\nreal-world datasets reveal that (1) CROWN provides consistent performance\nimprovements over ten state-of-the-art news recommendation methods and (2) the\nproposed strategies significantly improve the accuracy of CROWN.\n","authors":["Yunyong Ko","Seongeun Ryu","Sang-Wook Kim"],"pdf_url":"https://arxiv.org/pdf/2310.09401v4.pdf","comment":"10 pages, 6 figures, 8 tables"},{"id":"http://arxiv.org/abs/2412.02415v1","updated":"2024-12-03T12:20:56Z","published":"2024-12-03T12:20:56Z","title":"Knowledge-Enhanced Conversational Recommendation via Transformer-based\n Sequential Modelling","summary":" In conversational recommender systems (CRSs), conversations usually involve a\nset of items and item-related entities or attributes, e.g., director is a\nrelated entity of a movie. These items and item-related entities are often\nmentioned along the development of a dialog, leading to potential sequential\ndependencies among them. However, most of existing CRSs neglect these potential\nsequential dependencies. In this article, we first propose a Transformer-based\nsequential conversational recommendation method, named TSCR, to model the\nsequential dependencies in the conversations to improve CRS. In TSCR, we\nrepresent conversations by items and the item-related entities, and construct\nuser sequences to discover user preferences by considering both the mentioned\nitems and item-related entities. Based on the constructed sequences, we deploy\na Cloze task to predict the recommended items along a sequence. Meanwhile, in\ncertain domains, knowledge graphs formed by the items and their related\nentities are readily available, which provide various different kinds of\nassociations among them. Given that TSCR does not benefit from such knowledge\ngraphs, we then propose a knowledge graph enhanced version of TSCR, called\nTSCRKG. In specific, we leverage the knowledge graph to offline initialize our\nmodel TSCRKG, and augment the user sequence of conversations (i.e., sequence of\nthe mentioned items and item-related entities in the conversation) with\nmulti-hop paths in the knowledge graph. Experimental results demonstrate that\nour TSCR model significantly outperforms state-of-the-art baselines, and the\nenhanced version TSCRKG further improves recommendation performance on top of\nTSCR.\n","authors":["Jie Zou","Aixin Sun","Cheng Long","Evangelos Kanoulas"],"pdf_url":"https://arxiv.org/pdf/2412.02415v1.pdf","comment":"Accepted by ACM TOIS"},{"id":"http://arxiv.org/abs/2412.02310v1","updated":"2024-12-03T09:27:46Z","published":"2024-12-03T09:27:46Z","title":"Active Learning via Classifier Impact and Greedy Selection for\n Interactive Image Retrieval","summary":" Active Learning (AL) is a user-interactive approach aimed at reducing\nannotation costs by selecting the most crucial examples to label. Although AL\nhas been extensively studied for image classification tasks, the specific\nscenario of interactive image retrieval has received relatively little\nattention. This scenario presents unique characteristics, including an open-set\nand class-imbalanced binary classification, starting with very few labeled\nsamples. We introduce a novel batch-mode Active Learning framework named GAL\n(Greedy Active Learning) that better copes with this application. It\nincorporates a new acquisition function for sample selection that measures the\nimpact of each unlabeled sample on the classifier. We further embed this\nstrategy in a greedy selection approach, better exploiting the samples within\neach batch. We evaluate our framework with both linear (SVM) and non-linear\nMLP/Gaussian Process classifiers. For the Gaussian Process case, we show a\ntheoretical guarantee on the greedy approximation. Finally, we assess our\nperformance for the interactive content-based image retrieval task on several\nbenchmarks and demonstrate its superiority over existing approaches and common\nbaselines. Code is available at https://github.com/barleah/GreedyAL.\n","authors":["Leah Bar","Boaz Lerner","Nir Darshan","Rami Ben-Ari"],"pdf_url":"https://arxiv.org/pdf/2412.02310v1.pdf","comment":"Accepted to Transactions on Machine Learning Research (TMLR)"},{"id":"http://arxiv.org/abs/2412.02295v1","updated":"2024-12-03T09:09:52Z","published":"2024-12-03T09:09:52Z","title":"CADMR: Cross-Attention and Disentangled Learning for Multimodal\n Recommender Systems","summary":" The increasing availability and diversity of multimodal data in recommender\nsystems offer new avenues for enhancing recommendation accuracy and user\nsatisfaction. However, these systems must contend with high-dimensional, sparse\nuser-item rating matrices, where reconstructing the matrix with only small\nsubsets of preferred items for each user poses a significant challenge. To\naddress this, we propose CADMR, a novel autoencoder-based multimodal\nrecommender system framework. CADMR leverages multi-head cross-attention\nmechanisms and Disentangled Learning to effectively integrate and utilize\nheterogeneous multimodal data in reconstructing the rating matrix. Our approach\nfirst disentangles modality-specific features while preserving their\ninterdependence, thereby learning a joint latent representation. The multi-head\ncross-attention mechanism is then applied to enhance user-item interaction\nrepresentations with respect to the learned multimodal item latent\nrepresentations. We evaluate CADMR on three benchmark datasets, demonstrating\nsignificant performance improvements over state-of-the-art methods.\n","authors":["Yasser Khalafaoui","Martino Lovisetto","Basarab Matei","Nistor Grozavu"],"pdf_url":"https://arxiv.org/pdf/2412.02295v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02290v1","updated":"2024-12-03T09:07:13Z","published":"2024-12-03T09:07:13Z","title":"Characterizing Information Shared by Participants to Coding Challenges:\n The Case of Advent of Code","summary":" Advent of Code (AoC from now on) is a popular coding challenge requiring to\nsolve programming puzzles for a variety of skill sets and levels. AoC follows\nthe advent calendar, therefore it is an annual challenge that lasts for 25\ndays. AoC participants usually post their solutions on social networks and\ndiscuss them online. These challenges are interesting to study since they could\nhighlight the adoption of new tools, the evolution of the developer community,\nor the technological requirements of well-known companies. For these reasons,\nwe first create a dataset of the 2019-2021 AoC editions containing the\ndiscussion threads made on the subreddit {\\tt /r/adventofcode}. Then, we\npropose a model based on stream graphs to best study this context, where we\nrepresent its most important actors through time: participants, comments, and\nprogramming languages. Thanks to our model, we investigate user participation,\nadoption of new programming languages during a challenge and between two of\nthem, and resiliency of programming languages based on a Stack Overflow survey.\nWe find that the top-used programming languages are almost the same in the\nthree years, pointing out their importance. Moreover, participants tend to keep\nthe same programming language for the whole challenge, while the ones attending\ntwo AoCs usually change it in the next one. Finally, we observe interesting\nresults about the programming languages that are ``Popular'' or ``Loved''\naccording to the Stack Overflow survey. Firstly, these are the ones adopted for\nthe longest time in an AoC edition, thanks to which users have a high chance of\nreaching the end of the challenge. Secondly, they are the most chosen when a\nparticipant decides to change programming language during the same challenge.\n","authors":["Francesco Cauteruccio","Enrico Corradini","Luca Virgili"],"pdf_url":"https://arxiv.org/pdf/2412.02290v1.pdf","comment":"10 pages, 7 figures"},{"id":"http://arxiv.org/abs/2409.12161v2","updated":"2024-12-03T05:26:10Z","published":"2024-09-18T17:25:31Z","title":"Generalized compression and compressive search of large datasets","summary":" The Big Data explosion has necessitated the development of search algorithms\nthat scale sub-linearly in time and memory.\n While compression algorithms and search algorithms do exist independently,\nfew algorithms offer both, and those which do are domain-specific.\n We present panCAKES, a novel approach to compressive search, i.e., a way to\nperform $k$-NN and $\\rho$-NN search on compressed data while only decompressing\na small, relevant, portion of the data.\n panCAKES assumes the manifold hypothesis and leverages the low-dimensional\nstructure of the data to compress and search it efficiently.\n panCAKES is generic over any distance function for which the distance between\ntwo points is proportional to the memory cost of storing an encoding of one in\nterms of the other.\n This property holds for many widely-used distance functions, e.g. string edit\ndistances (Levenshtein, Needleman-Wunsch, etc.) and set dissimilarity measures\n(Jaccard, Dice, etc.).\n We benchmark panCAKES on a variety of datasets, including genomic, proteomic,\nand set data.\n We compare compression ratios to gzip, and search performance between the\ncompressed and uncompressed versions of the same dataset.\n panCAKES achieves compression ratios close to those of gzip, while offering\nsub-linear time performance for $k$-NN and $\\rho$-NN search.\n We conclude that panCAKES is an efficient, general-purpose algorithm for\nexact compressive search on large datasets that obey the manifold hypothesis.\n We provide an open-source implementation of panCAKES in the Rust programming\nlanguage.\n","authors":["Morgan E. Prior","Thomas Howard III","Emily Light","Najib Ishaq","Noah M. Daniels"],"pdf_url":"https://arxiv.org/pdf/2409.12161v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.01269v2","updated":"2024-12-03T04:37:03Z","published":"2024-12-02T08:35:54Z","title":"CPRM: A LLM-based Continual Pre-training Framework for Relevance\n Modeling in Commercial Search","summary":" Relevance modeling between queries and items stands as a pivotal component in\ncommercial search engines, directly affecting the user experience. Given the\nremarkable achievements of large language models (LLMs) in various natural\nlanguage processing (NLP) tasks, LLM-based relevance modeling is gradually\nbeing adopted within industrial search systems. Nevertheless, foundational LLMs\nlack domain-specific knowledge and do not fully exploit the potential of\nin-context learning. Furthermore, structured item text remains underutilized,\nand there is a shortage in the supply of corresponding queries and background\nknowledge. We thereby propose CPRM (Continual Pre-training for Relevance\nModeling), a framework designed for the continual pre-training of LLMs to\naddress these issues. Our CPRM framework includes three modules: 1) employing\nboth queries and multi-field item to jointly pre-train for enhancing domain\nknowledge, 2) applying in-context pre-training, a novel approach where LLMs are\npre-trained on a sequence of related queries or items, and 3) conducting\nreading comprehension on items to produce associated domain knowledge and\nbackground information (e.g., generating summaries and corresponding queries)\nto further strengthen LLMs. Results on offline experiments and online A/B\ntesting demonstrate that our model achieves convincing performance compared to\nstrong baselines.\n","authors":["Kaixin Wu","Yixin Ji","Zeyuan Chen","Qiang Wang","Cunxiang Wang","Hong Liu","Baijun Ji","Jia Xu","Zhongyi Liu","Jinjie Gu","Yuan Zhou","Linjian Mo"],"pdf_url":"https://arxiv.org/pdf/2412.01269v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02155v1","updated":"2024-12-03T04:29:27Z","published":"2024-12-03T04:29:27Z","title":"CausalMob: Causal Human Mobility Prediction with LLMs-derived Human\n Intentions toward Public Events","summary":" Large-scale human mobility exhibits spatial and temporal patterns that can\nassist policymakers in decision making. Although traditional prediction models\nattempt to capture these patterns, they often interfered by non-periodic public\nevents, such as disasters and occasional celebrations. Since regular human\nmobility patterns are heavily affected by these events, estimating their causal\neffects is critical to accurate mobility predictions. Although news articles\nprovide unique perspectives on these events in an unstructured format,\nprocessing is a challenge. In this study, we propose a causality-augmented\nprediction model, called \\textbf{CausalMob}, to analyze the causal effects of\npublic events. We first utilize large language models (LLMs) to extract human\nintentions from news articles and transform them into features that act as\ncausal treatments. Next, the model learns representations of spatio-temporal\nregional covariates from multiple data sources to serve as confounders for\ncausal inference. Finally, we present a causal effect estimation framework to\nensure event features remain independent of confounders during prediction.\nBased on large-scale real-world data, the experimental results show that the\nproposed model excels in human mobility prediction, outperforming\nstate-of-the-art models.\n","authors":["Xiaojie Yang","Hangli Ge","Jiawei Wang","Zipei Fan","Renhe Jiang","Ryosuke Shibasaki","Noboru Koshizuka"],"pdf_url":"https://arxiv.org/pdf/2412.02155v1.pdf","comment":"Accepted by KDD 2025"},{"id":"http://arxiv.org/abs/2412.02149v1","updated":"2024-12-03T04:09:36Z","published":"2024-12-03T04:09:36Z","title":"Leveraging Large Language Models for Comparative Literature\n Summarization with Reflective Incremental Mechanisms","summary":" In this paper, we introduce ChatCite, a novel method leveraging large\nlanguage models (LLMs) for generating comparative literature summaries. The\nability to summarize research papers with a focus on key comparisons between\nstudies is an essential task in academic research. Existing summarization\nmodels, while effective at generating concise summaries, fail to provide deep\ncomparative insights. ChatCite addresses this limitation by incorporating a\nmulti-step reasoning mechanism that extracts critical elements from papers,\nincrementally builds a comparative summary, and refines the output through a\nreflective memory process. We evaluate ChatCite on a custom dataset,\nCompLit-LongContext, consisting of 1000 research papers with annotated\ncomparative summaries. Experimental results show that ChatCite outperforms\nseveral baseline methods, including GPT-4, BART, T5, and CoT, across various\nautomatic evaluation metrics such as ROUGE and the newly proposed G-Score.\nHuman evaluation further confirms that ChatCite generates more coherent,\ninsightful, and fluent summaries compared to these baseline models. Our method\nprovides a significant advancement in automatic literature review generation,\noffering researchers a powerful tool for efficiently comparing and synthesizing\nscientific research.\n","authors":["Fernando Gabriela Garcia","Spencer Burns","Harrison Fuller"],"pdf_url":"https://arxiv.org/pdf/2412.02149v1.pdf","comment":"8 pages"},{"id":"http://arxiv.org/abs/2412.02142v1","updated":"2024-12-03T03:59:03Z","published":"2024-12-03T03:59:03Z","title":"Personalized Multimodal Large Language Models: A Survey","summary":" Multimodal Large Language Models (MLLMs) have become increasingly important\ndue to their state-of-the-art performance and ability to integrate multiple\ndata modalities, such as text, images, and audio, to perform complex tasks with\nhigh accuracy. This paper presents a comprehensive survey on personalized\nmultimodal large language models, focusing on their architecture, training\nmethods, and applications. We propose an intuitive taxonomy for categorizing\nthe techniques used to personalize MLLMs to individual users, and discuss the\ntechniques accordingly. Furthermore, we discuss how such techniques can be\ncombined or adapted when appropriate, highlighting their advantages and\nunderlying rationale. We also provide a succinct summary of personalization\ntasks investigated in existing research, along with the evaluation metrics\ncommonly used. Additionally, we summarize the datasets that are useful for\nbenchmarking personalized MLLMs. Finally, we outline critical open challenges.\nThis survey aims to serve as a valuable resource for researchers and\npractitioners seeking to understand and advance the development of personalized\nmultimodal large language models.\n","authors":["Junda Wu","Hanjia Lyu","Yu Xia","Zhehao Zhang","Joe Barrow","Ishita Kumar","Mehrnoosh Mirtaheri","Hongjie Chen","Ryan A. Rossi","Franck Dernoncourt","Tong Yu","Ruiyi Zhang","Jiuxiang Gu","Nesreen K. Ahmed","Yu Wang","Xiang Chen","Hanieh Deilamsalehy","Namyong Park","Sungchul Kim","Huanrui Yang","Subrata Mitra","Zhengmian Hu","Nedim Lipka","Dang Nguyen","Yue Zhao","Jiebo Luo","Julian McAuley"],"pdf_url":"https://arxiv.org/pdf/2412.02142v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02122v1","updated":"2024-12-03T03:20:40Z","published":"2024-12-03T03:20:40Z","title":"Improving Sequential Recommender Systems with Online and In-store User\n Behavior","summary":" Online e-commerce platforms have been extending in-store shopping, which\nallows users to keep the canonical online browsing and checkout experience\nwhile exploring in-store shopping. However, the growing transition between\nonline and in-store becomes a challenge to sequential recommender systems for\nfuture online interaction prediction due to the lack of holistic modeling of\nhybrid user behaviors (online and in-store). The challenges are twofold. First,\ncombining online and in-store user behavior data into a single data schema and\nsupporting multiple stages in the model life cycle (pre-training, training,\ninference, etc.) organically needs a new data pipeline design. Second, online\nrecommender systems, which solely rely on online user behavior sequences, must\nbe redesigned to support online and in-store user data as input under the\nsequential modeling setting. To overcome the first challenge, we propose a\nhybrid, omnichannel data pipeline to compile online and in-store user behavior\ndata by caching information from diverse data sources. Later, we introduce a\nmodel-agnostic encoder module to the sequential recommender system to interpret\nthe user in-store transaction and augment the modeling capacity for better\nonline interaction prediction given the hybrid user behavior.\n","authors":["Luyi Ma","Aashika Padmanabhan","Anjana Ganesh","Shengwei Tang","Jiao Chen","Xiaohan Li","Lalitesh Morishetti","Kaushiki Nag","Malay Patel","Jason Cho","Sushant Kumar","Kannan Achan"],"pdf_url":"https://arxiv.org/pdf/2412.02122v1.pdf","comment":"6 pages, IEEE BigData 2024 Workshop"},{"id":"http://arxiv.org/abs/2412.02043v1","updated":"2024-12-03T00:01:48Z","published":"2024-12-03T00:01:48Z","title":"Future of Information Retrieval Research in the Age of Generative AI","summary":" In the fast-evolving field of information retrieval (IR), the integration of\ngenerative AI technologies such as large language models (LLMs) is transforming\nhow users search for and interact with information. Recognizing this paradigm\nshift at the intersection of IR and generative AI (IR-GenAI), a visioning\nworkshop supported by the Computing Community Consortium (CCC) was held in July\n2024 to discuss the future of IR in the age of generative AI. This workshop\nconvened 44 experts in information retrieval, natural language processing,\nhuman-computer interaction, and artificial intelligence from academia,\nindustry, and government to explore how generative AI can enhance IR and vice\nversa, and to identify the major challenges and opportunities in this rapidly\nadvancing field.\n This report contains a summary of discussions as potentially important\nresearch topics and contains a list of recommendations for academics, industry\npractitioners, institutions, evaluation campaigns, and funding agencies.\n","authors":["James Allan","Eunsol Choi","Daniel P. Lopresti","Hamed Zamani"],"pdf_url":"https://arxiv.org/pdf/2412.02043v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.11598v2","updated":"2024-12-03T22:23:53Z","published":"2024-09-17T23:10:04Z","title":"Towards Fair RAG: On the Impact of Fair Ranking in Retrieval-Augmented\n Generation","summary":" Many language models now enhance their responses with retrieval capabilities,\nleading to the widespread adoption of retrieval-augmented generation (RAG)\nsystems. However, despite retrieval being a core component of RAG, much of the\nresearch in this area overlooks the extensive body of work on fair ranking,\nneglecting the importance of considering all stakeholders involved. This paper\npresents the first systematic evaluation of RAG systems integrated with fair\nrankings. We focus specifically on measuring the fair exposure of each relevant\nitem across the rankings utilized by RAG systems (i.e., item-side fairness),\naiming to promote equitable growth for relevant item providers. To gain a deep\nunderstanding of the relationship between item-fairness, ranking quality, and\ngeneration quality in the context of RAG, we analyze nine different RAG systems\nthat incorporate fair rankings across seven distinct datasets. Our findings\nindicate that RAG systems with fair rankings can maintain a high level of\ngeneration quality and, in many cases, even outperform traditional RAG systems,\ndespite the general trend of a tradeoff between ensuring fairness and\nmaintaining system-effectiveness. We believe our insights lay the groundwork\nfor responsible and equitable RAG systems and open new avenues for future\nresearch. We publicly release our codebase and dataset at\nhttps://github.com/kimdanny/Fair-RAG.\n","authors":["To Eun Kim","Fernando Diaz"],"pdf_url":"https://arxiv.org/pdf/2409.11598v2.pdf","comment":"Top 5 Spotlight at AFME Workshop at NeurIPS 2024"},{"id":"http://arxiv.org/abs/2412.02835v1","updated":"2024-12-03T21:00:10Z","published":"2024-12-03T21:00:10Z","title":"CAISSON: Concept-Augmented Inference Suite of Self-Organizing Neural\n Networks","summary":" We present CAISSON, a novel hierarchical approach to Retrieval-Augmented\nGeneration (RAG) that transforms traditional single-vector search into a\nmulti-view clustering framework. At its core, CAISSON leverages dual\nSelf-Organizing Maps (SOMs) to create complementary organizational views of the\ndocument space, where each view captures different aspects of document\nrelationships through specialized embeddings. The first view processes combined\ntext and metadata embeddings, while the second operates on metadata enriched\nwith concept embeddings, enabling a comprehensive multi-view analysis that\ncaptures both fine-grained semantic relationships and high-level conceptual\npatterns. This dual-view approach enables more nuanced document discovery by\ncombining evidence from different organizational perspectives. To evaluate\nCAISSON, we develop SynFAQA, a framework for generating synthetic financial\nanalyst notes and question-answer pairs that systematically tests different\naspects of information retrieval capabilities. Drawing on HotPotQA's\nmethodology for constructing multi-step reasoning questions, SynFAQA generates\ncontrolled test cases where each question is paired with the set of notes\ncontaining its ground-truth answer, progressing from simple single-entity\nqueries to complex multi-hop retrieval tasks involving multiple entities and\nconcepts. Our experimental results demonstrate substantial improvements over\nboth basic and enhanced RAG implementations, particularly for complex\nmulti-entity queries, while maintaining practical response times suitable for\ninteractive applications.\n","authors":["Igor Halperin"],"pdf_url":"https://arxiv.org/pdf/2412.02835v1.pdf","comment":"26 pages, 7 figures, 8 tables"}],"Machine Learning":[{"id":"http://arxiv.org/abs/2412.02698v1","updated":"2024-12-03T18:59:51Z","published":"2024-12-03T18:59:51Z","title":"Scaling BERT Models for Turkish Automatic Punctuation and Capitalization\n Correction","summary":" This paper investigates the effectiveness of BERT based models for automated\npunctuation and capitalization corrections in Turkish texts across five\ndistinct model sizes. The models are designated as Tiny, Mini, Small, Medium,\nand Base. The design and capabilities of each model are tailored to address the\nspecific challenges of the Turkish language, with a focus on optimizing\nperformance while minimizing computational overhead. The study presents a\nsystematic comparison of the performance metrics precision, recall, and F1\nscore of each model, offering insights into their applicability in diverse\noperational contexts. The results demonstrate a significant improvement in text\nreadability and accuracy as model size increases, with the Base model achieving\nthe highest correction precision. This research provides a comprehensive guide\nfor selecting the appropriate model size based on specific user needs and\ncomputational resources, establishing a framework for deploying these models in\nreal-world applications to enhance the quality of written Turkish.\n","authors":["Abdulkader Saoud","Mahmut Alomeyr","Himmet Toprak Kesgin","Mehmet Fatih Amasyali"],"pdf_url":"https://arxiv.org/pdf/2412.02698v1.pdf","comment":"2024 Innovations in Intelligent Systems and Applications Conference\n (ASYU)"},{"id":"http://arxiv.org/abs/2412.02695v1","updated":"2024-12-03T18:59:35Z","published":"2024-12-03T18:59:35Z","title":"An ADHD Diagnostic Interface Based on EEG Spectrograms and Deep Learning\n Techniques","summary":" This paper introduces an innovative approach to\nAttention-deficit/hyperactivity disorder (ADHD) diagnosis by employing deep\nlearning (DL) techniques on electroencephalography (EEG) signals. This method\naddresses the limitations of current behavior-based diagnostic methods, which\noften lead to misdiagnosis and gender bias. By utilizing a publicly available\nEEG dataset and converting the signals into spectrograms, a Resnet-18\nconvolutional neural network (CNN) architecture was used to extract features\nfor ADHD classification. The model achieved a high precision, recall, and an\noverall F1 score of 0.9. Feature extraction highlighted significant brain\nregions (frontopolar, parietal, and occipital lobes) associated with ADHD.\nThese insights guided the creation of a three-part digital diagnostic system,\nfacilitating cost-effective and accessible ADHD screening, especially in school\nenvironments. This system enables earlier and more accurate identification of\nstudents at risk for ADHD, providing timely support to enhance their\ndevelopmental outcomes. This study showcases the potential of integrating EEG\nanalysis with DL to enhance ADHD diagnostics, presenting a viable alternative\nto traditional methods.\n","authors":["Medha Pappula","Syed Muhammad Anwar"],"pdf_url":"https://arxiv.org/pdf/2412.02695v1.pdf","comment":"Presented at SIPAIM 2024"},{"id":"http://arxiv.org/abs/2412.02685v1","updated":"2024-12-03T18:56:07Z","published":"2024-12-03T18:56:07Z","title":"T-REG: Preference Optimization with Token-Level Reward Regularization","summary":" Reinforcement learning from human feedback (RLHF) has been crucial in\naligning large language models (LLMs) with human values. Traditionally, RLHF\ninvolves generating responses to a query and using a reward model to assign a\nreward to the entire response. However, this approach faces challenges due to\nits reliance on a single, sparse reward, which makes it challenging for the\nmodel to identify which parts of the sequence contribute most significantly to\nthe final reward. Recent methods have attempted to address this limitation by\nintroducing token-level rewards. However, these methods often rely on either a\ntrained credit assignment model or AI annotators, raising concerns about the\nquality and reliability of the rewards. In this paper, we propose token-level\nreward regularization (T-REG), a novel approach that leverages both\nsequence-level and token-level rewards for preference optimization. Harnessing\nthe self-refinement capabilities of LLMs, our method uses contrastive prompting\nto enable LLMs to self-generate token-level rewards. These self-generated\nrewards then act as reward regularization, guiding the model to more\neffectively distribute sequence-level rewards across tokens. This facilitates\nbetter token-level credit assignment and enhances alignment performance.\nExperiments on the instruction following benchmarks, including Alpaca Eval 2\nand Arena-Hard, show that our method consistently outperforms baseline methods\nby up to 3.8% and 4.4%, respectively. We will release the code and models at\nhttps://github.com/wzhouad/T-REG.\n","authors":["Wenxuan Zhou","Shujian Zhang","Lingxiao Zhao","Tao Meng"],"pdf_url":"https://arxiv.org/pdf/2412.02685v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02682v1","updated":"2024-12-03T18:54:49Z","published":"2024-12-03T18:54:49Z","title":"The Asymptotic Behavior of Attention in Transformers","summary":" A key component of transformers is the attention mechanism orchestrating how\neach token influences the propagation of every other token through a\ntransformer. In this paper we provide a rigorous, mathematical analysis of the\nasymptotic properties of attention in transformers. Although we present several\nresults based on different assumptions, all of them point to the same\nconclusion, all tokens asymptotically converge to each other, a phenomenon that\nhas been empirically reported in the literature. Our findings are carefully\ncompared with existing theoretical results and illustrated by simulations and\nexperimental studies using the GPT-2 model.\n","authors":["Álvaro Rodríguez Abella","João Pedro Silvestre","Paulo Tabuada"],"pdf_url":"https://arxiv.org/pdf/2412.02682v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02676v1","updated":"2024-12-03T18:51:39Z","published":"2024-12-03T18:51:39Z","title":"Planning-Guided Diffusion Policy Learning for Generalizable Contact-Rich\n Bimanual Manipulation","summary":" Contact-rich bimanual manipulation involves precise coordination of two arms\nto change object states through strategically selected contacts and motions.\nDue to the inherent complexity of these tasks, acquiring sufficient\ndemonstration data and training policies that generalize to unseen scenarios\nremain a largely unresolved challenge. Building on recent advances in planning\nthrough contacts, we introduce Generalizable Planning-Guided Diffusion Policy\nLearning (GLIDE), an approach that effectively learns to solve contact-rich\nbimanual manipulation tasks by leveraging model-based motion planners to\ngenerate demonstration data in high-fidelity physics simulation. Through\nefficient planning in randomized environments, our approach generates\nlarge-scale and high-quality synthetic motion trajectories for tasks involving\ndiverse objects and transformations. We then train a task-conditioned diffusion\npolicy via behavior cloning using these demonstrations. To tackle the\nsim-to-real gap, we propose a set of essential design options in feature\nextraction, task representation, action prediction, and data augmentation that\nenable learning robust prediction of smooth action sequences and generalization\nto unseen scenarios. Through experiments in both simulation and the real world,\nwe demonstrate that our approach can enable a bimanual robotic system to\neffectively manipulate objects of diverse geometries, dimensions, and physical\nproperties. Website: https://glide-manip.github.io/\n","authors":["Xuanlin Li","Tong Zhao","Xinghao Zhu","Jiuguang Wang","Tao Pang","Kuan Fang"],"pdf_url":"https://arxiv.org/pdf/2412.02676v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.14052v2","updated":"2024-12-03T18:48:00Z","published":"2024-10-17T21:47:11Z","title":"From Isolated Conversations to Hierarchical Schemas: Dynamic Tree Memory\n Representation for LLMs","summary":" Recent advancements in large language models have significantly improved\ntheir context windows, yet challenges in effective long-term memory management\nremain. We introduce MemTree, an algorithm that leverages a dynamic,\ntree-structured memory representation to optimize the organization, retrieval,\nand integration of information, akin to human cognitive schemas. MemTree\norganizes memory hierarchically, with each node encapsulating aggregated\ntextual content, corresponding semantic embeddings, and varying abstraction\nlevels across the tree's depths. Our algorithm dynamically adapts this memory\nstructure by computing and comparing semantic embeddings of new and existing\ninformation to enrich the model's context-awareness. This approach allows\nMemTree to handle complex reasoning and extended interactions more effectively\nthan traditional memory augmentation methods, which often rely on flat lookup\ntables. Evaluations on benchmarks for multi-turn dialogue understanding and\ndocument question answering show that MemTree significantly enhances\nperformance in scenarios that demand structured memory management.\n","authors":["Alireza Rezazadeh","Zichao Li","Wei Wei","Yujia Bao"],"pdf_url":"https://arxiv.org/pdf/2410.14052v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02674v1","updated":"2024-12-03T18:47:26Z","published":"2024-12-03T18:47:26Z","title":"Mind the Gap: Examining the Self-Improvement Capabilities of Large\n Language Models","summary":" Self-improvement is a mechanism in Large Language Model (LLM) pre-training,\npost-training and test-time inference. We explore a framework where the model\nverifies its own outputs, filters or reweights data based on this verification,\nand distills the filtered data. Despite several empirical successes, a\nfundamental understanding is still lacking. In this work, we initiate a\ncomprehensive, modular and controlled study on LLM self-improvement. We provide\na mathematical formulation for self-improvement, which is largely governed by a\nquantity which we formalize as the generation-verification gap. Through\nexperiments with various model families and tasks, we discover a scaling\nphenomenon of self-improvement -- a variant of the generation-verification gap\nscales monotonically with the model pre-training flops. We also examine when\nself-improvement is possible, an iterative self-improvement procedure, and ways\nto improve its performance. Our findings not only advance understanding of LLM\nself-improvement with practical implications, but also open numerous avenues\nfor future research into its capabilities and boundaries.\n","authors":["Yuda Song","Hanlin Zhang","Carson Eisenach","Sham Kakade","Dean Foster","Udaya Ghai"],"pdf_url":"https://arxiv.org/pdf/2412.02674v1.pdf","comment":"41 pages, 19 figures"},{"id":"http://arxiv.org/abs/2411.17861v2","updated":"2024-12-03T18:38:45Z","published":"2024-11-26T20:22:31Z","title":"Accelerating Proximal Policy Optimization Learning Using Task Prediction\n for Solving Environments with Delayed Rewards","summary":" In this paper, we tackle the challenging problem of delayed rewards in\nreinforcement learning (RL). While Proximal Policy Optimization (PPO) has\nemerged as a leading Policy Gradient method, its performance can degrade under\ndelayed rewards. We introduce two key enhancements to PPO: a hybrid policy\narchitecture that combines an offline policy (trained on expert demonstrations)\nwith an online PPO policy, and a reward shaping mechanism using Time Window\nTemporal Logic (TWTL). The hybrid architecture leverages offline data\nthroughout training while maintaining PPO's theoretical guarantees. Building on\nthe monotonic improvement framework of Trust Region Policy Optimization (TRPO),\nwe prove that our approach ensures improvement over both the offline policy and\nprevious iterations, with a bounded performance gap of\n$(2\\varsigma\\gamma\\alpha^2)/(1-\\gamma)^2$, where $\\alpha$ is the mixing\nparameter, $\\gamma$ is the discount factor, and $\\varsigma$ bounds the expected\nadvantage. Additionally, we prove that our TWTL-based reward shaping preserves\nthe optimal policy of the original problem. TWTL enables formal translation of\ntemporal objectives into immediate feedback signals that guide learning. We\ndemonstrate the effectiveness of our approach through extensive experiments on\nan inverted pendulum and a lunar lander environments, showing improvements in\nboth learning speed and final performance compared to standard PPO and\noffline-only approaches.\n","authors":["Ahmad Ahmad","Mehdi Kermanshah","Kevin Leahy","Zachary Serlin","Ho Chit Siu","Makai Mann","Cristian-Ioan Vasile","Roberto Tron","Calin Belta"],"pdf_url":"https://arxiv.org/pdf/2411.17861v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.07636v2","updated":"2024-12-03T18:35:27Z","published":"2023-12-12T10:25:31Z","title":"Go beyond End-to-End Training: Boosting Greedy Local Learning with\n Context Supply","summary":" Traditional end-to-end (E2E) training of deep networks necessitates storing\nintermediate activations for back-propagation, resulting in a large memory\nfootprint on GPUs and restricted model parallelization. As an alternative,\ngreedy local learning partitions the network into gradient-isolated modules and\ntrains supervisely based on local preliminary losses, thereby providing\nasynchronous and parallel training methods that substantially reduce memory\ncost. However, empirical experiments reveal that as the number of segmentations\nof the gradient-isolated module increases, the performance of the local\nlearning scheme degrades substantially, severely limiting its expansibility. To\navoid this issue, we theoretically analyze the greedy local learning from the\nstandpoint of information theory and propose a ContSup scheme, which\nincorporates context supply between isolated modules to compensate for\ninformation loss. Experiments on benchmark datasets (i.e. CIFAR, SVHN, STL-10)\nachieve SOTA results and indicate that our proposed method can significantly\nimprove the performance of greedy local learning with minimal memory and\ncomputational overhead, allowing for the boost of the number of isolated\nmodules. Our codes are available at https://github.com/Tab-ct/ContSup.\n","authors":["Chengting Yu","Fengzhao Zhang","Hanzhi Ma","Aili Wang","Erping Li"],"pdf_url":"https://arxiv.org/pdf/2312.07636v2.pdf","comment":"9 figures, 12 tables"},{"id":"http://arxiv.org/abs/2406.01378v2","updated":"2024-12-03T18:32:15Z","published":"2024-06-03T14:42:31Z","title":"A Fast Convergence Theory for Offline Decision Making","summary":" This paper proposes the first generic fast convergence result in general\nfunction approximation for offline decision making problems, which include\noffline reinforcement learning (RL) and off-policy evaluation (OPE) as special\ncases. To unify different settings, we introduce a framework called Decision\nMaking with Offline Feedback (DMOF), which captures a wide range of offline\ndecision making problems. Within this framework, we propose a simple yet\npowerful algorithm called Empirical Decision with Divergence (EDD), whose upper\nbound can be termed as a coefficient named Empirical Offline Estimation\nCoefficient (EOEC). We show that EOEC is instance-dependent and actually\nmeasures the correlation of the problem. When assuming partial coverage in the\ndataset, EOEC will reduce in a rate of $1/N$ where $N$ is the size of the\ndataset, endowing EDD with a fast convergence guarantee. Finally, we complement\nthe above results with a lower bound in the DMOF framework, which further\ndemonstrates the soundness of our theory.\n","authors":["Chenjie Mao","Qiaosheng Zhang"],"pdf_url":"https://arxiv.org/pdf/2406.01378v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.01547v2","updated":"2024-12-03T18:28:43Z","published":"2024-11-03T12:42:16Z","title":"Decoupling Dark Knowledge via Block-wise Logit Distillation for\n Feature-level Alignment","summary":" Knowledge Distillation (KD), a learning manner with a larger teacher network\nguiding a smaller student network, transfers dark knowledge from the teacher to\nthe student via logits or intermediate features, with the aim of producing a\nwell-performed lightweight model. Notably, many subsequent feature-based KD\nmethods outperformed the earliest logit-based KD method and iteratively\ngenerated numerous state-of-the-art distillation methods. Nevertheless, recent\nwork has uncovered the potential of the logit-based method, bringing the simple\nKD form based on logits back into the limelight. Features or logits? They\npartially implement the KD with entirely distinct perspectives; therefore,\nchoosing between logits and features is not straightforward. This paper\nprovides a unified perspective of feature alignment in order to obtain a better\ncomprehension of their fundamental distinction. Inheriting the design\nphilosophy and insights of feature-based and logit-based methods, we introduce\na block-wise logit distillation framework to apply implicit logit-based feature\nalignment by gradually replacing teacher's blocks as intermediate\nstepping-stone models to bridge the gap between the student and the teacher.\nOur method obtains comparable or superior results to state-of-the-art\ndistillation methods. This paper demonstrates the great potential of combining\nlogit and features, and we hope it will inspire future research to revisit KD\nfrom a higher vantage point.\n","authors":["Chengting Yu","Fengzhao Zhang","Ruizhe Chen","Aili Wang","Zuozhu Liu","Shurun Tan","Er-Ping Li"],"pdf_url":"https://arxiv.org/pdf/2411.01547v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02646v1","updated":"2024-12-03T18:21:20Z","published":"2024-12-03T18:21:20Z","title":"Interpretable Generalized Additive Models for Datasets with Missing\n Values","summary":" Many important datasets contain samples that are missing one or more feature\nvalues. Maintaining the interpretability of machine learning models in the\npresence of such missing data is challenging. Singly or multiply imputing\nmissing values complicates the model's mapping from features to labels. On the\nother hand, reasoning on indicator variables that represent missingness\nintroduces a potentially large number of additional terms, sacrificing\nsparsity. We solve these problems with M-GAM, a sparse, generalized, additive\nmodeling approach that incorporates missingness indicators and their\ninteraction terms while maintaining sparsity through l0 regularization. We show\nthat M-GAM provides similar or superior accuracy to prior methods while\nsignificantly improving sparsity relative to either imputation or naive\ninclusion of indicator variables.\n","authors":["Hayden McTavish","Jon Donnelly","Margo Seltzer","Cynthia Rudin"],"pdf_url":"https://arxiv.org/pdf/2412.02646v1.pdf","comment":"Published in NeurIPS 2024"},{"id":"http://arxiv.org/abs/2412.02639v1","updated":"2024-12-03T18:11:37Z","published":"2024-12-03T18:11:37Z","title":"The Space Complexity of Approximating Logistic Loss","summary":" We provide space complexity lower bounds for data structures that approximate\nlogistic loss up to $\\epsilon$-relative error on a logistic regression problem\nwith data $\\mathbf{X} \\in \\mathbb{R}^{n \\times d}$ and labels $\\mathbf{y} \\in\n\\{-1,1\\}^d$. The space complexity of existing coreset constructions depend on a\nnatural complexity measure $\\mu_\\mathbf{y}(\\mathbf{X})$, first defined in\n(Munteanu, 2018). We give an $\\tilde{\\Omega}(\\frac{d}{\\epsilon^2})$ space\ncomplexity lower bound in the regime $\\mu_\\mathbf{y}(\\mathbf{X}) = O(1)$ that\nshows existing coresets are optimal in this regime up to lower order factors.\nWe also prove a general $\\tilde{\\Omega}(d\\cdot \\mu_\\mathbf{y}(\\mathbf{X}))$\nspace lower bound when $\\epsilon$ is constant, showing that the dependency on\n$\\mu_\\mathbf{y}(\\mathbf{X})$ is not an artifact of mergeable coresets. Finally,\nwe refute a prior conjecture that $\\mu_\\mathbf{y}(\\mathbf{X})$ is hard to\ncompute by providing an efficient linear programming formulation, and we\nempirically compare our algorithm to prior approximate methods.\n","authors":["Gregory Dexter","Petros Drineas","Rajiv Khanna"],"pdf_url":"https://arxiv.org/pdf/2412.02639v1.pdf","comment":"arXiv admin note: text overlap with arXiv:2303.14284"},{"id":"http://arxiv.org/abs/2412.02631v1","updated":"2024-12-03T17:58:07Z","published":"2024-12-03T17:58:07Z","title":"Sharp-It: A Multi-view to Multi-view Diffusion Model for 3D Synthesis\n and Manipulation","summary":" Advancements in text-to-image diffusion models have led to significant\nprogress in fast 3D content creation. One common approach is to generate a set\nof multi-view images of an object, and then reconstruct it into a 3D model.\nHowever, this approach bypasses the use of a native 3D representation of the\nobject and is hence prone to geometric artifacts and limited in controllability\nand manipulation capabilities. An alternative approach involves native 3D\ngenerative models that directly produce 3D representations. These models,\nhowever, are typically limited in their resolution, resulting in lower quality\n3D objects. In this work, we bridge the quality gap between methods that\ndirectly generate 3D representations and ones that reconstruct 3D objects from\nmulti-view images. We introduce a multi-view to multi-view diffusion model\ncalled Sharp-It, which takes a 3D consistent set of multi-view images rendered\nfrom a low-quality object and enriches its geometric details and texture. The\ndiffusion model operates on the multi-view set in parallel, in the sense that\nit shares features across the generated views. A high-quality 3D model can then\nbe reconstructed from the enriched multi-view set. By leveraging the advantages\nof both 2D and 3D approaches, our method offers an efficient and controllable\nmethod for high-quality 3D content creation. We demonstrate that Sharp-It\nenables various 3D applications, such as fast synthesis, editing, and\ncontrolled generation, while attaining high-quality assets.\n","authors":["Yiftach Edelstein","Or Patashnik","Dana Cohen-Bar","Lihi Zelnik-Manor"],"pdf_url":"https://arxiv.org/pdf/2412.02631v1.pdf","comment":"Project page at https://yiftachede.github.io/Sharp-It/"},{"id":"http://arxiv.org/abs/2412.02623v1","updated":"2024-12-03T17:52:38Z","published":"2024-12-03T17:52:38Z","title":"The effect of priors on Learning with Restricted Boltzmann Machines","summary":" Restricted Boltzmann Machines (RBMs) are generative models designed to learn\nfrom data with a rich underlying structure. In this work, we explore a\nteacher-student setting where a student RBM learns from examples generated by a\nteacher RBM, with a focus on the effect of the unit priors on learning\nefficiency. We consider a parametric class of priors that interpolate between\ncontinuous (Gaussian) and binary variables. This approach models various\npossible choices of visible units, hidden units, and weights for both the\nteacher and student RBMs.\n By analyzing the phase diagram of the posterior distribution in both the\nBayes optimal and mismatched regimes, we demonstrate the existence of a triple\npoint that defines the critical dataset size necessary for learning through\ngeneralization. The critical size is strongly influenced by the properties of\nthe teacher, and thus the data, but is unaffected by the properties of the\nstudent RBM. Nevertheless, a prudent choice of student priors can facilitate\ntraining by expanding the so-called signal retrieval region, where the machine\ngeneralizes effectively.\n","authors":["Gianluca Manzan","Daniele Tantari"],"pdf_url":"https://arxiv.org/pdf/2412.02623v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02621v1","updated":"2024-12-03T17:50:19Z","published":"2024-12-03T17:50:19Z","title":"Medical Multimodal Foundation Models in Clinical Diagnosis and\n Treatment: Applications, Challenges, and Future Directions","summary":" Recent advancements in deep learning have significantly revolutionized the\nfield of clinical diagnosis and treatment, offering novel approaches to improve\ndiagnostic precision and treatment efficacy across diverse clinical domains,\nthus driving the pursuit of precision medicine. The growing availability of\nmulti-organ and multimodal datasets has accelerated the development of\nlarge-scale Medical Multimodal Foundation Models (MMFMs). These models, known\nfor their strong generalization capabilities and rich representational power,\nare increasingly being adapted to address a wide range of clinical tasks, from\nearly diagnosis to personalized treatment strategies. This review offers a\ncomprehensive analysis of recent developments in MMFMs, focusing on three key\naspects: datasets, model architectures, and clinical applications. We also\nexplore the challenges and opportunities in optimizing multimodal\nrepresentations and discuss how these advancements are shaping the future of\nhealthcare by enabling improved patient outcomes and more efficient clinical\nworkflows.\n","authors":["Kai Sun","Siyan Xue","Fuchun Sun","Haoran Sun","Yu Luo","Ling Wang","Siyuan Wang","Na Guo","Lei Liu","Tian Zhao","Xinzhou Wang","Lei Yang","Shuo Jin","Jun Yan","Jiahong Dong"],"pdf_url":"https://arxiv.org/pdf/2412.02621v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.02313v4","updated":"2024-12-03T17:49:39Z","published":"2024-06-04T13:42:42Z","title":"Neural Thermodynamic Integration: Free Energies from Energy-based\n Diffusion Models","summary":" Thermodynamic integration (TI) offers a rigorous method for estimating\nfree-energy differences by integrating over a sequence of interpolating\nconformational ensembles. However, TI calculations are computationally\nexpensive and typically limited to coupling a small number of degrees of\nfreedom due to the need to sample numerous intermediate ensembles with\nsufficient conformational-space overlap. In this work, we propose to perform TI\nalong an alchemical pathway represented by a trainable neural network, which we\nterm Neural TI. Critically, we parametrize a time-dependent Hamiltonian\ninterpolating between the interacting and non-interacting systems, and optimize\nits gradient using a score matching objective. The ability of the resulting\nenergy-based diffusion model to sample all intermediate ensembles allows us to\nperform TI from a single reference calculation. We apply our method to\nLennard-Jones fluids, where we report accurate calculations of the excess\nchemical potential, demonstrating that Neural TI reproduces the underlying\nchanges in free energy without the need for simulations at interpolating\nHamiltonians.\n","authors":["Bálint Máté","François Fleuret","Tristan Bereau"],"pdf_url":"https://arxiv.org/pdf/2406.02313v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02617v1","updated":"2024-12-03T17:44:23Z","published":"2024-12-03T17:44:23Z","title":"Improving Dynamic Object Interactions in Text-to-Video Generation with\n AI Feedback","summary":" Large text-to-video models hold immense potential for a wide range of\ndownstream applications. However, these models struggle to accurately depict\ndynamic object interactions, often resulting in unrealistic movements and\nfrequent violations of real-world physics. One solution inspired by large\nlanguage models is to align generated outputs with desired outcomes using\nexternal feedback. This enables the model to refine its responses autonomously,\neliminating extensive manual data collection. In this work, we investigate the\nuse of feedback to enhance the object dynamics in text-to-video models. We aim\nto answer a critical question: what types of feedback, paired with which\nspecific self-improvement algorithms, can most effectively improve text-video\nalignment and realistic object interactions? We begin by deriving a unified\nprobabilistic objective for offline RL finetuning of text-to-video models. This\nperspective highlights how design elements in existing algorithms like KL\nregularization and policy projection emerge as specific choices within a\nunified framework. We then use derived methods to optimize a set of text-video\nalignment metrics (e.g., CLIP scores, optical flow), but notice that they often\nfail to align with human perceptions of generation quality. To address this\nlimitation, we propose leveraging vision-language models to provide more\nnuanced feedback specifically tailored to object dynamics in videos. Our\nexperiments demonstrate that our method can effectively optimize a wide variety\nof rewards, with binary AI feedback driving the most significant improvements\nin video quality for dynamic interactions, as confirmed by both AI and human\nevaluations. Notably, we observe substantial gains when using reward signals\nderived from AI feedback, particularly in scenarios involving complex\ninteractions between multiple objects and realistic depictions of objects\nfalling.\n","authors":["Hiroki Furuta","Heiga Zen","Dale Schuurmans","Aleksandra Faust","Yutaka Matsuo","Percy Liang","Sherry Yang"],"pdf_url":"https://arxiv.org/pdf/2412.02617v1.pdf","comment":"Website: https://sites.google.com/view/aif-dynamic-t2v/"},{"id":"http://arxiv.org/abs/2412.02609v1","updated":"2024-12-03T17:40:26Z","published":"2024-12-03T17:40:26Z","title":"Wasserstein Markets for Differentially-Private Data","summary":" Data is an increasingly vital component of decision making processes across\nindustries. However, data access raises privacy concerns motivating the need\nfor privacy-preserving techniques such as differential privacy. Data markets\nprovide a means to enable wider access as well as determine the appropriate\nprivacy-utility trade-off. Existing data market frameworks either require a\ntrusted third party to perform computationally expensive valuations or are\nunable to capture the combinatorial nature of data value and do not\nendogenously model the effect of differential privacy. This paper addresses\nthese shortcomings by proposing a valuation mechanism based on the Wasserstein\ndistance for differentially-private data, and corresponding procurement\nmechanisms by leveraging incentive mechanism design theory, for task-agnostic\ndata procurement, and task-specific procurement co-optimisation. The mechanisms\nare reformulated into tractable mixed-integer second-order cone programs, which\nare validated with numerical studies.\n","authors":["Saurab Chhachhi","Fei Teng"],"pdf_url":"https://arxiv.org/pdf/2412.02609v1.pdf","comment":"35 pages, 15 figures"},{"id":"http://arxiv.org/abs/2412.02605v1","updated":"2024-12-03T17:34:50Z","published":"2024-12-03T17:34:50Z","title":"Interpretable Company Similarity with Sparse Autoencoders","summary":" Determining company similarity is a vital task in finance, underpinning\nhedging, risk management, portfolio diversification, and more. Practitioners\noften rely on sector and industry classifications to gauge similarity, such as\nSIC-codes and GICS-codes, the former being used by the U.S. Securities and\nExchange Commission (SEC), and the latter widely used by the investment\ncommunity. Clustering embeddings of company descriptions has been proposed as a\npotential technique for determining company similarity, but the lack of\ninterpretability in token embeddings poses a significant barrier to adoption in\nhigh-stakes contexts. Sparse Autoencoders have shown promise in enhancing the\ninterpretability of Large Language Models by decomposing LLM activations into\ninterpretable features. In this paper, we explore the use of SAE features in\nmeasuring company similarity and benchmark them against (1) SIC codes and (2)\nMajor Group codes. We conclude that SAE features can reproduce and even surpass\nsector classifications in quantifying fundamental characteristics of companies,\nevaluated by the correlation of monthly returns, a proxy for similarity, and\nPnL from cointegration.\n","authors":["Marco Molinari","Vladimir Tregubiak","Victor Shao","Abhimanyu Pandey","Mateusz Mikolajczak","Sebastião Kuznetsov Ryder Torres Pereira"],"pdf_url":"https://arxiv.org/pdf/2412.02605v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02602v1","updated":"2024-12-03T17:32:47Z","published":"2024-12-03T17:32:47Z","title":"CEGI: Measuring the trade-off between efficiency and carbon emissions\n for SLMs and VLMs","summary":" This paper analyzes the performance of Small Language Models (SLMs) and\nVision Language Models (VLMs) and evaluates the trade-off between model\nperformance and carbon emissions across 4 essential tasks: Image Captioning,\nVisual Question Answering (VQA), Dialogue Summarization and Text-to-SQL\nconversion. Various SLMs and VLMs belonging to the Qwen and LLaMA architecture\nfamily are chosen and variants based on model size in terms of the number of\nparameters, quantization level and fine-tuning parameters are evaluated. The\nmodel variant's performance and carbon emissions are calculated. To quantify\nthe trade-off between model performance and carbon emissions, we introduce a\nnovel metric called CEGI (Carbon Efficient Gain Index). This metric represents\nthe carbon emission per unit percentage gain per million trainable parameters .\nThis metric provides a normalized measure to compare model's efficiency in\nterms of performance improvement relative to their environmental cost. The\nexperiment's outcome demonstrates that fine-tuning SLMs and VLMs can achieve\nperformance levels comparable to Large Language Models (LLMs) while producing\nsignificantly less carbon emissions. Our findings suggest that the marginal\ngains in accuracy from larger models do not justify the substantial increase in\ncarbon emissions. Leveraging lower-bit quantization levels, the proposed metric\nfurther enhances energy efficiency without compromising performance. This study\nhighlights balancing high performance and environmental sustainability. It\noffers a valuable metric for selecting models suitable for\nenvironmentally-friendly AI development.\n","authors":["Abhas Kumar","Kapil Pathak","Rajesh Kavuru","Prabhakar Srinivasan"],"pdf_url":"https://arxiv.org/pdf/2412.02602v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02596v1","updated":"2024-12-03T17:29:00Z","published":"2024-12-03T17:29:00Z","title":"Class-wise Autoencoders Measure Classification Difficulty And Detect\n Label Mistakes","summary":" We introduce a new framework for analyzing classification datasets based on\nthe ratios of reconstruction errors between autoencoders trained on individual\nclasses. This analysis framework enables efficient characterization of datasets\non the sample, class, and entire dataset levels. We define reconstruction error\nratios (RERs) that probe classification difficulty and allow its decomposition\ninto (1) finite sample size and (2) Bayes error and decision-boundary\ncomplexity. Through systematic study across 19 popular visual datasets, we find\nthat our RER-based dataset difficulty probe strongly correlates with error rate\nfor state-of-the-art (SOTA) classification models. By interpreting sample-level\nclassification difficulty as a label mistakenness score, we further find that\nRERs achieve SOTA performance on mislabel detection tasks on hard datasets\nunder symmetric and asymmetric label noise. Our code is publicly available at\nhttps://github.com/voxel51/reconstruction-error-ratios.\n","authors":["Jacob Marks","Brent A. Griffin","Jason J. Corso"],"pdf_url":"https://arxiv.org/pdf/2412.02596v1.pdf","comment":"30 pages, 18 figures"},{"id":"http://arxiv.org/abs/2409.06219v4","updated":"2024-12-03T17:23:07Z","published":"2024-09-10T05:05:34Z","title":"Denoising: A Powerful Building-Block for Imaging, Inverse Problems, and\n Machine Learning","summary":" Denoising, the process of reducing random fluctuations in a signal to\nemphasize essential patterns, has been a fundamental problem of interest since\nthe dawn of modern scientific inquiry. Recent denoising techniques,\nparticularly in imaging, have achieved remarkable success, nearing theoretical\nlimits by some measures. Yet, despite tens of thousands of research papers, the\nwide-ranging applications of denoising beyond noise removal have not been fully\nrecognized. This is partly due to the vast and diverse literature, making a\nclear overview challenging.\n This paper aims to address this gap. We present a clarifying perspective on\ndenoisers, their structure, and desired properties. We emphasize the increasing\nimportance of denoising and showcase its evolution into an essential building\nblock for complex tasks in imaging, inverse problems, and machine learning.\nDespite its long history, the community continues to uncover unexpected and\ngroundbreaking uses for denoising, further solidifying its place as a\ncornerstone of scientific and engineering practice.\n","authors":["Peyman Milanfar","Mauricio Delbracio"],"pdf_url":"https://arxiv.org/pdf/2409.06219v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.13846v4","updated":"2024-12-03T17:22:01Z","published":"2024-04-22T03:05:19Z","title":"Filtered Direct Preference Optimization","summary":" Reinforcement learning from human feedback (RLHF) plays a crucial role in\naligning language models with human preferences. While the significance of\ndataset quality is generally recognized, explicit investigations into its\nimpact within the RLHF framework, to our knowledge, have been limited. This\npaper addresses the issue of text quality within the preference dataset by\nfocusing on direct preference optimization (DPO), an increasingly adopted\nreward-model-free RLHF method. We confirm that text quality significantly\ninfluences the performance of models optimized with DPO more than those\noptimized with reward-model-based RLHF. Building on this new insight, we\npropose an extension of DPO, termed filtered direct preference optimization\n(fDPO). fDPO uses a trained reward model to monitor the quality of texts within\nthe preference dataset during DPO training. Samples of lower quality are\ndiscarded based on comparisons with texts generated by the model being\noptimized, resulting in a more accurate dataset. Experimental results\ndemonstrate that fDPO enhances the final model performance. Our code is\navailable at https://github.com/CyberAgentAILab/filtered-dpo.\n","authors":["Tetsuro Morimura","Mitsuki Sakamoto","Yuu Jinnai","Kenshi Abe","Kaito Ariu"],"pdf_url":"https://arxiv.org/pdf/2404.13846v4.pdf","comment":"EMNLP 2024"},{"id":"http://arxiv.org/abs/2412.00177v2","updated":"2024-12-03T17:21:41Z","published":"2024-11-29T18:59:11Z","title":"LumiNet: Latent Intrinsics Meets Diffusion Models for Indoor Scene\n Relighting","summary":" We introduce LumiNet, a novel architecture that leverages generative models\nand latent intrinsic representations for effective lighting transfer. Given a\nsource image and a target lighting image, LumiNet synthesizes a relit version\nof the source scene that captures the target's lighting. Our approach makes two\nkey contributions: a data curation strategy from the StyleGAN-based relighting\nmodel for our training, and a modified diffusion-based ControlNet that\nprocesses both latent intrinsic properties from the source image and latent\nextrinsic properties from the target image. We further improve lighting\ntransfer through a learned adaptor (MLP) that injects the target's latent\nextrinsic properties via cross-attention and fine-tuning.\n Unlike traditional ControlNet, which generates images with conditional maps\nfrom a single scene, LumiNet processes latent representations from two\ndifferent images - preserving geometry and albedo from the source while\ntransferring lighting characteristics from the target. Experiments demonstrate\nthat our method successfully transfers complex lighting phenomena including\nspecular highlights and indirect illumination across scenes with varying\nspatial layouts and materials, outperforming existing approaches on challenging\nindoor scenes using only images as input.\n","authors":["Xiaoyan Xing","Konrad Groh","Sezer Karaoglu","Theo Gevers","Anand Bhattad"],"pdf_url":"https://arxiv.org/pdf/2412.00177v2.pdf","comment":"Project page: https://luminet-relight.github.io"},{"id":"http://arxiv.org/abs/2412.02578v1","updated":"2024-12-03T17:04:14Z","published":"2024-12-03T17:04:14Z","title":"Private Linear Regression with Differential Privacy and PAC Privacy","summary":" Linear regression is a fundamental tool for statistical analysis, which has\nmotivated the development of linear regression methods that satisfy provable\nprivacy guarantees so that the learned model reveals little about any one data\npoint used to construct it. Most existing privacy-preserving linear regression\nmethods rely on the well-established framework of differential privacy, while\nthe newly proposed PAC Privacy has not yet been explored in this context. In\nthis paper, we systematically compare linear regression models trained with\ndifferential privacy and PAC privacy across three real-world datasets,\nobserving several key findings that impact the performance of\nprivacy-preserving linear regression.\n","authors":["Hillary Yang"],"pdf_url":"https://arxiv.org/pdf/2412.02578v1.pdf","comment":"8 pages, 6 figures"},{"id":"http://arxiv.org/abs/2409.05305v2","updated":"2024-12-03T17:03:57Z","published":"2024-09-09T03:26:07Z","title":"Closed-Form Interpretation of Neural Network Latent Spaces with Symbolic\n Gradients","summary":" It has been demonstrated in many scientific fields that artificial neural\nnetworks like autoencoders or Siamese networks encode meaningful concepts in\ntheir latent spaces. However, there does not exist a comprehensive framework\nfor retrieving this information in a human-readable form without prior\nknowledge. In order to extract these concepts, we introduce a framework for\nfinding closed-form interpretations of neurons in latent spaces of artificial\nneural networks. The interpretation framework is based on embedding trained\nneural networks into an equivalence class of functions that encode the same\nconcept. We interpret these neural networks by finding an intersection between\nthe equivalence class and human-readable equations defined by a symbolic search\nspace. The approach is demonstrated by retrieving invariants of matrices and\nconserved quantities of dynamical systems from latent spaces of Siamese neural\nnetworks.\n","authors":["Zakaria Patel","Sebastian J. Wetzel"],"pdf_url":"https://arxiv.org/pdf/2409.05305v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02570v1","updated":"2024-12-03T16:55:27Z","published":"2024-12-03T16:55:27Z","title":"TAB-Fields: A Maximum Entropy Framework for Mission-Aware Adversarial\n Planning","summary":" Autonomous agents operating in adversarial scenarios face a fundamental\nchallenge: while they may know their adversaries' high-level objectives, such\nas reaching specific destinations within time constraints, the exact policies\nthese adversaries will employ remain unknown. Traditional approaches address\nthis challenge by treating the adversary's state as a partially observable\nelement, leading to a formulation as a Partially Observable Markov Decision\nProcess (POMDP). However, the induced belief-space dynamics in a POMDP require\nknowledge of the system's transition dynamics, which, in this case, depend on\nthe adversary's unknown policy. Our key observation is that while an\nadversary's exact policy is unknown, their behavior is necessarily constrained\nby their mission objectives and the physical environment, allowing us to\ncharacterize the space of possible behaviors without assuming specific\npolicies. In this paper, we develop Task-Aware Behavior Fields (TAB-Fields), a\nrepresentation that captures adversary state distributions over time by\ncomputing the most unbiased probability distribution consistent with known\nconstraints. We construct TAB-Fields by solving a constrained optimization\nproblem that minimizes additional assumptions about adversary behavior beyond\nmission and environmental requirements. We integrate TAB-Fields with standard\nplanning algorithms by introducing TAB-conditioned POMCP, an adaptation of\nPartially Observable Monte Carlo Planning. Through experiments in simulation\nwith underwater robots and hardware implementations with ground robots, we\ndemonstrate that our approach achieves superior performance compared to\nbaselines that either assume specific adversary policies or neglect mission\nconstraints altogether. Evaluation videos and code are available at\nhttps://tab-fields.github.io.\n","authors":["Gokul Puthumanaillam","Jae Hyuk Song","Nurzhan Yesmagambet","Shinkyu Park","Melkior Ornik"],"pdf_url":"https://arxiv.org/pdf/2412.02570v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02548v1","updated":"2024-12-03T16:41:18Z","published":"2024-12-03T16:41:18Z","title":"Plug-and-Play Half-Quadratic Splitting for Ptychography","summary":" Ptychography is a coherent diffraction imaging method that uses phase\nretrieval techniques to reconstruct complex-valued images. It achieves this by\nsequentially illuminating overlapping regions of a sample with a coherent beam\nand recording the diffraction pattern. Although this addresses traditional\nimaging system challenges, it is computationally intensive and highly sensitive\nto noise, especially with reduced illumination overlap. Data-driven\nregularisation techniques have been applied in phase retrieval to improve\nreconstruction quality. In particular, plug-and-play (PnP) offers flexibility\nby integrating data-driven denoisers as implicit priors. In this work, we\npropose a half-quadratic splitting framework for using PnP and other\ndata-driven priors for ptychography. We evaluate our method both on natural\nimages and real test objects to validate its effectiveness for ptychographic\nimage reconstruction.\n","authors":["Alexander Denker","Johannes Hertrich","Zeljko Kereta","Silvia Cipiccia","Ecem Erin","Simon Arridge"],"pdf_url":"https://arxiv.org/pdf/2412.02548v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02546v1","updated":"2024-12-03T16:39:01Z","published":"2024-12-03T16:39:01Z","title":"Fractional Order Distributed Optimization","summary":" Distributed optimization is fundamental to modern machine learning\napplications like federated learning, but existing methods often struggle with\nill-conditioned problems and face stability-versus-speed tradeoffs. We\nintroduce fractional order distributed optimization (FrODO); a\ntheoretically-grounded framework that incorporates fractional-order memory\nterms to enhance convergence properties in challenging optimization landscapes.\nOur approach achieves provable linear convergence for any strongly connected\nnetwork. Through empirical validation, our results suggest that FrODO achieves\nup to 4 times faster convergence versus baselines on ill-conditioned problems\nand 2-3 times speedup in federated neural network training, while maintaining\nstability and theoretical guarantees.\n","authors":["Andrei Lixandru","Marcel van Gerven","Sergio Pequito"],"pdf_url":"https://arxiv.org/pdf/2412.02546v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02542v1","updated":"2024-12-03T16:34:49Z","published":"2024-12-03T16:34:49Z","title":"Unveiling Concept Attribution in Diffusion Models","summary":" Diffusion models have shown remarkable abilities in generating realistic and\nhigh-quality images from text prompts. However, a trained model remains\nblack-box; little do we know about the role of its components in exhibiting a\nconcept such as objects or styles. Recent works employ causal tracing to\nlocalize layers storing knowledge in generative models without showing how\nthose layers contribute to the target concept. In this work, we approach the\nmodel interpretability problem from a more general perspective and pose a\nquestion: \\textit{``How do model components work jointly to demonstrate\nknowledge?''}. We adapt component attribution to decompose diffusion models,\nunveiling how a component contributes to a concept. Our framework allows\neffective model editing, in particular, we can erase a concept from diffusion\nmodels by removing positive components while remaining knowledge of other\nconcepts. Surprisingly, we also show there exist components that contribute\nnegatively to a concept, which has not been discovered in the knowledge\nlocalization approach. Experimental results confirm the role of positive and\nnegative components pinpointed by our framework, depicting a complete view of\ninterpreting generative models. Our code is available at\n\\url{https://github.com/mail-research/CAD-attribution4diffusion}\n","authors":["Quang H. Nguyen","Hoang Phan","Khoa D. Doan"],"pdf_url":"https://arxiv.org/pdf/2412.02542v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02538v1","updated":"2024-12-03T16:32:19Z","published":"2024-12-03T16:32:19Z","title":"On the Privacy, Security, and Trustworthy for Distributed Wireless Large\n AI Model (WLAM)","summary":" Combining wireless communication with large artificial intelligence (AI)\nmodels can open up a myriad of novel application scenarios. In sixth generation\n(6G) networks, ubiquitous communication and computing resources allow large AI\nmodels to serve democratic large AI models-related services to enable real-time\napplications like autonomous vehicles, smart cities, and Internet of Things\n(IoT) ecosystems. However, the security considerations and sustainable\ncommunication resources limit the deployment of large AI models over\ndistributed wireless networks. This paper provides a comprehensive overview of\nprivacy, security, and trustworthy for distributed wireless large AI model\n(WLAM). In particular, the detailed privacy and security are analysis for\ndistributed WLAM is fist revealed. The classifications and theoretical findings\nabout privacy and security in distributed WLAM are discussed. Then the\ntrustworthy and ethics for implementing distributed WLAM are described.\nFinally, the comprehensive applications of distributed WLAM is provided in the\naspect of electromagnetic signal processing.\n","authors":["Zhaohui Yang","Wei Xu","Le Liang","Yuanhao Cui","Zhijin Qin","Merouane Debbah"],"pdf_url":"https://arxiv.org/pdf/2412.02538v1.pdf","comment":"12 pages, 4 figures"},{"id":"http://arxiv.org/abs/2412.02535v1","updated":"2024-12-03T16:26:56Z","published":"2024-12-03T16:26:56Z","title":"Defending Against Diverse Attacks in Federated Learning Through\n Consensus-Based Bi-Level Optimization","summary":" Adversarial attacks pose significant challenges in many machine learning\napplications, particularly in the setting of distributed training and federated\nlearning, where malicious agents seek to corrupt the training process with the\ngoal of jeopardizing and compromising the performance and reliability of the\nfinal models. In this paper, we address the problem of robust federated\nlearning in the presence of such attacks by formulating the training task as a\nbi-level optimization problem. We conduct a theoretical analysis of the\nresilience of consensus-based bi-level optimization (CB$^2$O), an interacting\nmulti-particle metaheuristic optimization method, in adversarial settings.\nSpecifically, we provide a global convergence analysis of CB$^2$O in mean-field\nlaw in the presence of malicious agents, demonstrating the robustness of\nCB$^2$O against a diverse range of attacks. Thereby, we offer insights into how\nspecific hyperparameter choices enable to mitigate adversarial effects. On the\npractical side, we extend CB$^2$O to the clustered federated learning setting\nby proposing FedCB$^2$O, a novel interacting multi-particle system, and design\na practical algorithm that addresses the demands of real-world applications.\nExtensive experiments demonstrate the robustness of the FedCB$^2$O algorithm\nagainst label-flipping attacks in decentralized clustered federated learning\nscenarios, showcasing its effectiveness in practical contexts.\n","authors":["Nicolás García Trillos","Aditya Kumar Akash","Sixu Li","Konstantin Riedl","Yuhua Zhu"],"pdf_url":"https://arxiv.org/pdf/2412.02535v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.10578v5","updated":"2024-12-03T16:26:09Z","published":"2024-10-14T14:52:23Z","title":"Burning RED: Unlocking Subtask-Driven Reinforcement Learning and\n Risk-Awareness in Average-Reward Markov Decision Processes","summary":" Average-reward Markov decision processes (MDPs) provide a foundational\nframework for sequential decision-making under uncertainty. However,\naverage-reward MDPs have remained largely unexplored in reinforcement learning\n(RL) settings, with the majority of RL-based efforts having been allocated to\nepisodic and discounted MDPs. In this work, we study a unique structural\nproperty of average-reward MDPs and utilize it to introduce Reward-Extended\nDifferential (or RED) reinforcement learning: a novel RL framework that can be\nused to effectively and efficiently solve various subtasks simultaneously in\nthe average-reward setting. We introduce a family of RED learning algorithms\nfor prediction and control, including proven-convergent algorithms for the\ntabular case. We then showcase the power of these algorithms by demonstrating\nhow they can be used to learn a policy that optimizes, for the first time, the\nwell-known conditional value-at-risk (CVaR) risk measure in a fully-online\nmanner, without the use of an explicit bi-level optimization scheme or an\naugmented state-space.\n","authors":["Juan Sebastian Rojas","Chi-Guhn Lee"],"pdf_url":"https://arxiv.org/pdf/2410.10578v5.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02529v1","updated":"2024-12-03T16:21:53Z","published":"2024-12-03T16:21:53Z","title":"Active learning of neural population dynamics using two-photon\n holographic optogenetics","summary":" Recent advances in techniques for monitoring and perturbing neural\npopulations have greatly enhanced our ability to study circuits in the brain.\nIn particular, two-photon holographic optogenetics now enables precise\nphotostimulation of experimenter-specified groups of individual neurons, while\nsimultaneous two-photon calcium imaging enables the measurement of ongoing and\ninduced activity across the neural population. Despite the enormous space of\npotential photostimulation patterns and the time-consuming nature of\nphotostimulation experiments, very little algorithmic work has been done to\ndetermine the most effective photostimulation patterns for identifying the\nneural population dynamics. Here, we develop methods to efficiently select\nwhich neurons to stimulate such that the resulting neural responses will best\ninform a dynamical model of the neural population activity. Using neural\npopulation responses to photostimulation in mouse motor cortex, we demonstrate\nthe efficacy of a low-rank linear dynamical systems model, and develop an\nactive learning procedure which takes advantage of low-rank structure to\ndetermine informative photostimulation patterns. We demonstrate our approach on\nboth real and synthetic data, obtaining in some cases as much as a two-fold\nreduction in the amount of data required to reach a given predictive power. Our\nactive stimulation design method is based on a novel active learning procedure\nfor low-rank regression, which may be of independent interest.\n","authors":["Andrew Wagenmaker","Lu Mi","Marton Rozsa","Matthew S. Bull","Karel Svoboda","Kayvon Daie","Matthew D. Golub","Kevin Jamieson"],"pdf_url":"https://arxiv.org/pdf/2412.02529v1.pdf","comment":"NeurIPS 2024"},{"id":"http://arxiv.org/abs/2412.02525v1","updated":"2024-12-03T16:18:42Z","published":"2024-12-03T16:18:42Z","title":"LLMForecaster: Improving Seasonal Event Forecasts with Unstructured\n Textual Data","summary":" Modern time-series forecasting models often fail to make full use of rich\nunstructured information about the time series themselves. This lack of proper\nconditioning can lead to obvious model failures; for example, models may be\nunaware of the details of a particular product, and hence fail to anticipate\nseasonal surges in customer demand in the lead up to major exogenous events\nlike holidays for clearly relevant products. To address this shortcoming, this\npaper introduces a novel forecast post-processor -- which we call LLMForecaster\n-- that fine-tunes large language models (LLMs) to incorporate unstructured\nsemantic and contextual information and historical data to improve the\nforecasts from an existing demand forecasting pipeline. In an industry-scale\nretail application, we demonstrate that our technique yields statistically\nsignificantly forecast improvements across several sets of products subject to\nholiday-driven demand surges.\n","authors":["Hanyu Zhang","Chuck Arvin","Dmitry Efimov","Michael W. Mahoney","Dominique Perrault-Joncas","Shankar Ramasubramanian","Andrew Gordon Wilson","Malcolm Wolff"],"pdf_url":"https://arxiv.org/pdf/2412.02525v1.pdf","comment":"Presented at NeurIPS Time Series in the Age of Large Models (2024)"},{"id":"http://arxiv.org/abs/2408.07712v3","updated":"2024-12-03T16:17:32Z","published":"2024-08-13T23:08:06Z","title":"Introduction to Reinforcement Learning","summary":" Reinforcement Learning (RL), a subfield of Artificial Intelligence (AI),\nfocuses on training agents to make decisions by interacting with their\nenvironment to maximize cumulative rewards. This paper provides an overview of\nRL, covering its core concepts, methodologies, and resources for further\nlearning. It offers a thorough explanation of fundamental components such as\nstates, actions, policies, and reward signals, ensuring readers develop a solid\nfoundational understanding. Additionally, the paper presents a variety of RL\nalgorithms, categorized based on the key factors such as model-free,\nmodel-based, value-based, policy-based, and other key factors. Resources for\nlearning and implementing RL, such as books, courses, and online communities\nare also provided. By offering a clear, structured introduction, this paper\naims to simplify the complexities of RL for beginners, providing a\nstraightforward pathway to understanding.\n","authors":["Majid Ghasemi","Dariush Ebrahimi"],"pdf_url":"https://arxiv.org/pdf/2408.07712v3.pdf","comment":"19 pages"},{"id":"http://arxiv.org/abs/2412.02520v1","updated":"2024-12-03T16:13:42Z","published":"2024-12-03T16:13:42Z","title":"Cooperative Cruising: Reinforcement Learning based Time-Headway Control\n for Increased Traffic Efficiency","summary":" The proliferation of Connected Automated Vehicles represents an unprecedented\nopportunity for improving driving efficiency and alleviating traffic\ncongestion. However, existing research fails to address realistic multi-lane\nhighway scenarios without assuming connectivity, perception, and control\ncapabilities that are typically unavailable in current vehicles. This paper\nproposes a novel AI system that is the first to improve highway traffic\nefficiency compared with human-like traffic in realistic, simulated multi-lane\nscenarios, while relying on existing connectivity, perception, and control\ncapabilities. At the core of our approach is a reinforcement learning based\ncontroller that dynamically communicates time-headways to automated vehicles\nnear bottlenecks based on real-time traffic conditions. These desired\ntime-headways are then used by Adaptive Cruise Control (ACC) systems to adjust\ntheir following distance. By (i) integrating existing traffic estimation\ntechnology and low-bandwidth vehicle-to-infrastructure connectivity, (ii)\nleveraging safety-certified ACC systems, and (iii) targeting localized\nbottleneck challenges that can be addressed independently in different\nlocations, we propose a practical, safe, and scalable system that can\npositively impact numerous road users.\n","authors":["Yaron Veksler","Sharon Hornstein","Han Wang","Maria Laura Delle Monache","Daniel Urieli"],"pdf_url":"https://arxiv.org/pdf/2412.02520v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.00876v2","updated":"2024-12-03T16:12:09Z","published":"2024-12-01T16:32:31Z","title":"Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic\n Vision-language Context Sparsification","summary":" Multimodal Large Language Models (MLLMs) have achieved remarkable success in\nvision understanding, reasoning, and interaction. However, the inference\ncomputation and memory increase progressively with the generation of output\ntokens during decoding, directly affecting the efficacy of MLLMs. Existing\nmethods attempt to reduce the vision context redundancy to achieve efficient\nMLLMs. Unfortunately, the efficiency benefits of the vision context reduction\nin the prefill stage gradually diminish during the decoding stage. To address\nthis problem, we proposed a dynamic vision-language context sparsification\nframework Dynamic-LLaVA, which dynamically reduces the redundancy of vision\ncontext in the prefill stage and decreases the memory and computation overhead\nof the generated language context during decoding. Dynamic-LLaVA designs a\ntailored sparsification inference scheme for different inference modes, i.e.,\nprefill, decoding with and without KV cache, to achieve efficient inference of\nMLLMs. In practice, Dynamic-LLaVA can reduce computation consumption by\n$\\sim$75\\% in the prefill stage. Meanwhile, throughout the entire generation\nprocess of MLLMs, Dynamic-LLaVA reduces the $\\sim$50\\% computation consumption\nunder decoding without KV cache, while saving $\\sim$50\\% GPU memory overhead\nwhen decoding with KV cache, due to the vision-language context sparsification.\nExtensive experiments also demonstrate that Dynamic-LLaVA achieves efficient\ninference for MLLMs with negligible understanding and generation ability\ndegradation or even performance gains compared to the full-context inference\nbaselines. Code is available at https://github.com/Osilly/dynamic_llava .\n","authors":["Wenxuan Huang","Zijie Zhai","Yunhang Shen","Shaoshen Cao","Fei Zhao","Xiangfeng Xu","Zheyu Ye","Shaohui Lin"],"pdf_url":"https://arxiv.org/pdf/2412.00876v2.pdf","comment":"Code is available at https://github.com/Osilly/dynamic_llava"},{"id":"http://arxiv.org/abs/2412.01491v2","updated":"2024-12-03T16:01:54Z","published":"2024-12-02T13:42:36Z","title":"Understanding complex crowd dynamics with generative neural simulators","summary":" Understanding the dynamics of pedestrian crowds is an outstanding challenge\ncrucial for designing efficient urban infrastructure and ensuring safe crowd\nmanagement. To this end, both small-scale laboratory and large-scale real-world\nmeasurements have been used. However, these approaches respectively lack\nstatistical resolution and parametric controllability, both essential to\ndiscovering physical relationships underlying the complex stochastic dynamics\nof crowds. Here, we establish an investigation paradigm that offers\nlaboratory-like controllability, while ensuring the statistical resolution of\nlarge-scale real-world datasets. Using our data-driven Neural Crowd Simulator\n(NeCS), which we train on large-scale data and validate against key statistical\nfeatures of crowd dynamics, we show that we can perform effective surrogate\ncrowd dynamics experiments without training on specific scenarios. We not only\nreproduce known experimental results on pairwise avoidance, but also uncover\nthe vision-guided and topological nature of N-body interactions. These findings\nshow how virtual experiments based on neural simulation enable data-driven\nscientific discovery.\n","authors":["Koen Minartz","Fleur Hendriks","Simon Martinus Koop","Alessandro Corbetta","Vlado Menkovski"],"pdf_url":"https://arxiv.org/pdf/2412.01491v2.pdf","comment":"26 pages, 6 figures"},{"id":"http://arxiv.org/abs/2405.13220v2","updated":"2024-12-03T16:00:40Z","published":"2024-05-21T22:00:34Z","title":"Paired Autoencoders for Likelihood-free Estimation in Inverse Problems","summary":" We consider the solution of nonlinear inverse problems where the forward\nproblem is a discretization of a partial differential equation. Such problems\nare notoriously difficult to solve in practice and require minimizing a\ncombination of a data-fit term and a regularization term. The main\ncomputational bottleneck of typical algorithms is the direct estimation of the\ndata misfit. Therefore, likelihood-free approaches have become appealing\nalternatives. Nonetheless, difficulties in generalization and limitations in\naccuracy have hindered their broader utility and applicability. In this work,\nwe use a paired autoencoder framework as a likelihood-free estimator for\ninverse problems. We show that the use of such an architecture allows us to\nconstruct a solution efficiently and to overcome some known open problems when\nusing likelihood-free estimators. In particular, our framework can assess the\nquality of the solution and improve on it if needed. We demonstrate the\nviability of our approach using examples from full waveform inversion and\ninverse electromagnetic imaging.\n","authors":["Matthias Chung","Emma Hart","Julianne Chung","Bas Peters","Eldad Haber"],"pdf_url":"https://arxiv.org/pdf/2405.13220v2.pdf","comment":"18 pages, 6 figures"},{"id":"http://arxiv.org/abs/2403.10182v2","updated":"2024-12-03T15:35:24Z","published":"2024-03-15T10:38:48Z","title":"Fast and reliable uncertainty quantification with neural network\n ensembles for industrial image classification","summary":" Image classification with neural networks (NNs) is widely used in industrial\nprocesses, situations where the model likely encounters unknown objects during\ndeployment, i.e., out-of-distribution (OOD) data. Worryingly, NNs tend to make\nconfident yet incorrect predictions when confronted with OOD data. To increase\nthe models' reliability, they should quantify the uncertainty in their own\npredictions, communicating when the output should (not) be trusted. Deep\nensembles, composed of multiple independent NNs, have been shown to perform\nstrongly but are computationally expensive. Recent research has proposed more\nefficient NN ensembles, namely the snapshot, batch, and multi-input\nmulti-output ensemble. This study investigates the predictive and uncertainty\nperformance of efficient NN ensembles in the context of image classification\nfor industrial processes. It is the first to provide a comprehensive comparison\nand it proposes a novel Diversity Quality metric to quantify the ensembles'\nperformance on the in-distribution and OOD sets in one single metric. The\nresults highlight the batch ensemble as a cost-effective and competitive\nalternative to the deep ensemble. It matches the deep ensemble in both\nuncertainty and accuracy while exhibiting considerable savings in training\ntime, test time, and memory storage.\n","authors":["Arthur Thuy","Dries F. Benoit"],"pdf_url":"https://arxiv.org/pdf/2403.10182v2.pdf","comment":"Submitted to Annals of Operations Research"},{"id":"http://arxiv.org/abs/2412.02503v1","updated":"2024-12-03T15:30:52Z","published":"2024-12-03T15:30:52Z","title":"CA-MoE: Channel-Adapted MoE for Incremental Weather Forecasting","summary":" Atmospheric science is intricately connected with other fields, e.g.,\ngeography and aerospace. Most existing approaches involve training a joint\natmospheric and geographic model from scratch, which incurs significant\ncomputational costs and overlooks the potential for incremental learning of\nweather variables across different domains. In this paper, we introduce\nincremental learning to weather forecasting and propose a novel structure that\nallows for the flexible expansion of variables within the model. Specifically,\nour method presents a Channel-Adapted MoE (CA-MoE) that employs a\ndivide-and-conquer strategy. This strategy assigns variable training tasks to\ndifferent experts by index embedding and reduces computational complexity\nthrough a channel-wise Top-K strategy. Experiments conducted on the widely\nutilized ERA5 dataset reveal that our method, utilizing only approximately 15\\%\nof trainable parameters during the incremental stage, attains performance that\nis on par with state-of-the-art competitors. Notably, in the context of\nvariable incremental experiments, our method demonstrates negligible issues\nwith catastrophic forgetting.\n","authors":["Hao Chen","Han Tao","Guo Song","Jie Zhang","Yunlong Yu","Yonghan Dong","Chuang Yang","Lei Bai"],"pdf_url":"https://arxiv.org/pdf/2412.02503v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.05469v4","updated":"2024-12-03T15:21:53Z","published":"2023-10-09T07:26:35Z","title":"Learning to Predict Structural Vibrations","summary":" In mechanical structures like airplanes, cars and houses, noise is generated\nand transmitted through vibrations. To take measures to reduce this noise,\nvibrations need to be simulated with expensive numerical computations. Deep\nlearning surrogate models present a promising alternative to classical\nnumerical simulations as they can be evaluated magnitudes faster, while\ntrading-off accuracy. To quantify such trade-offs systematically and foster the\ndevelopment of methods, we present a benchmark on the task of predicting the\nvibration of harmonically excited plates. The benchmark features a total of\n12,000 plate geometries with varying forms of beadings, material, boundary\nconditions, load position and sizes with associated numerical solutions. To\naddress the benchmark task, we propose a new network architecture, named\nFrequency-Query Operator, which predicts vibration patterns of plate geometries\ngiven a specific excitation frequency. Applying principles from operator\nlearning and implicit models for shape encoding, our approach effectively\naddresses the prediction of highly variable frequency response functions\noccurring in dynamic systems. To quantify the prediction quality, we introduce\na set of evaluation metrics and evaluate the method on our vibrating-plates\nbenchmark. Our method outperforms DeepONets, Fourier Neural Operators and more\ntraditional neural network architectures and can be used for design\noptimization. Code, dataset and visualizations:\nhttps://github.com/ecker-lab/Learning_Vibrating_Plates\n","authors":["Jan van Delden","Julius Schultz","Christopher Blech","Sabine C. Langer","Timo Lüddecke"],"pdf_url":"https://arxiv.org/pdf/2310.05469v4.pdf","comment":"Accepted at Neurips 2024"},{"id":"http://arxiv.org/abs/2412.02492v1","updated":"2024-12-03T15:06:07Z","published":"2024-12-03T15:06:07Z","title":"The Cost of Consistency: Submodular Maximization with Constant Recourse","summary":" In this work, we study online submodular maximization, and how the\nrequirement of maintaining a stable solution impacts the approximation. In\nparticular, we seek bounds on the best-possible approximation ratio that is\nattainable when the algorithm is allowed to make at most a constant number of\nupdates per step. We show a tight information-theoretic bound of $\\tfrac{2}{3}$\nfor general monotone submodular functions, and an improved (also tight) bound\nof $\\tfrac{3}{4}$ for coverage functions. Since both these bounds are attained\nby non poly-time algorithms, we also give a poly-time randomized algorithm that\nachieves a $0.51$-approximation. Combined with an information-theoretic\nhardness of $\\tfrac{1}{2}$ for deterministic algorithms from prior work, our\nwork thus shows a separation between deterministic and randomized algorithms,\nboth information theoretically and for poly-time algorithms.\n","authors":["Paul Dütting","Federico Fusco","Silvio Lattanzi","Ashkan Norouzi-Fard","Ola Svensson","Morteza Zadimoghaddam"],"pdf_url":"https://arxiv.org/pdf/2412.02492v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02484v1","updated":"2024-12-03T14:47:46Z","published":"2024-12-03T14:47:46Z","title":"Vector Optimization with Gaussian Process Bandits","summary":" Learning problems in which multiple conflicting objectives must be considered\nsimultaneously often arise in various fields, including engineering, drug\ndesign, and environmental management. Traditional methods for dealing with\nmultiple black-box objective functions, such as scalarization and\nidentification of the Pareto set under the componentwise order, have\nlimitations in incorporating objective preferences and exploring the solution\nspace accordingly. While vector optimization offers improved flexibility and\nadaptability via specifying partial orders based on ordering cones, current\ntechniques designed for sequential experiments either suffer from high sample\ncomplexity or lack theoretical guarantees. To address these issues, we propose\nVector Optimization with Gaussian Process (VOGP), a probably approximately\ncorrect adaptive elimination algorithm that performs black-box vector\noptimization using Gaussian process bandits. VOGP allows users to convey\nobjective preferences through ordering cones while performing efficient\nsampling by exploiting the smoothness of the objective function, resulting in a\nmore effective optimization process that requires fewer evaluations. We\nestablish theoretical guarantees for VOGP and derive information gain-based and\nkernel-specific sample complexity bounds. We also conduct experiments on both\nreal-world and synthetic datasets to compare VOGP with the state-of-the-art\nmethods.\n","authors":["İlter Onat Korkmaz","Yaşar Cahit Yıldırım","Çağın Ararat","Cem Tekin"],"pdf_url":"https://arxiv.org/pdf/2412.02484v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02482v1","updated":"2024-12-03T14:45:46Z","published":"2024-12-03T14:45:46Z","title":"What should a neuron aim for? Designing local objective functions based\n on information theory","summary":" In modern deep neural networks, the learning dynamics of the individual\nneurons is often obscure, as the networks are trained via global optimization.\nConversely, biological systems build on self-organized, local learning,\nachieving robustness and efficiency with limited global information. We here\nshow how self-organization between individual artificial neurons can be\nachieved by designing abstract bio-inspired local learning goals. These goals\nare parameterized using a recent extension of information theory, Partial\nInformation Decomposition (PID), which decomposes the information that a set of\ninformation sources holds about an outcome into unique, redundant and\nsynergistic contributions. Our framework enables neurons to locally shape the\nintegration of information from various input classes, i.e. feedforward,\nfeedback, and lateral, by selecting which of the three inputs should contribute\nuniquely, redundantly or synergistically to the output. This selection is\nexpressed as a weighted sum of PID terms, which, for a given problem, can be\ndirectly derived from intuitive reasoning or via numerical optimization,\noffering a window into understanding task-relevant local information\nprocessing. Achieving neuron-level interpretability while enabling strong\nperformance using local learning, our work advances a principled\ninformation-theoretic foundation for local learning strategies.\n","authors":["Andreas C. Schneider","Valentin Neuhaus","David A. Ehrlich","Abdullah Makkeh","Alexander S. Ecker","Viola Priesemann","Michael Wibral"],"pdf_url":"https://arxiv.org/pdf/2412.02482v1.pdf","comment":"24 pages, 11 figures"},{"id":"http://arxiv.org/abs/2312.00710v3","updated":"2024-12-03T14:45:03Z","published":"2023-12-01T16:42:57Z","title":"SpaCE: The Spatial Confounding Environment","summary":" Spatial confounding poses a significant challenge in scientific studies\ninvolving spatial data, where unobserved spatial variables can influence both\ntreatment and outcome, possibly leading to spurious associations. To address\nthis problem, we introduce SpaCE: The Spatial Confounding Environment, the\nfirst toolkit to provide realistic benchmark datasets and tools for\nsystematically evaluating causal inference methods designed to alleviate\nspatial confounding. Each dataset includes training data, true counterfactuals,\na spatial graph with coordinates, and smoothness and confounding scores\ncharacterizing the effect of a missing spatial confounder. It also includes\nrealistic semi-synthetic outcomes and counterfactuals, generated using\nstate-of-the-art machine learning ensembles, following best practices for\ncausal inference benchmarks. The datasets cover real treatment and covariates\nfrom diverse domains, including climate, health and social sciences. SpaCE\nfacilitates an automated end-to-end pipeline, simplifying data loading,\nexperimental setup, and evaluating machine learning and causal inference\nmodels. The SpaCE project provides several dozens of datasets of diverse sizes\nand spatial complexity. It is publicly available as a Python package,\nencouraging community feedback and contributions.\n","authors":["Mauricio Tec","Ana Trisovic","Michelle Audirac","Sophie Woodward","Jie Kate Hu","Naeem Khoshnevis","Francesca Dominici"],"pdf_url":"https://arxiv.org/pdf/2312.00710v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02479v1","updated":"2024-12-03T14:42:31Z","published":"2024-12-03T14:42:31Z","title":"OODFace: Benchmarking Robustness of Face Recognition under Common\n Corruptions and Appearance Variations","summary":" With the rise of deep learning, facial recognition technology has seen\nextensive research and rapid development. Although facial recognition is\nconsidered a mature technology, we find that existing open-source models and\ncommercial algorithms lack robustness in certain real-world Out-of-Distribution\n(OOD) scenarios, raising concerns about the reliability of these systems. In\nthis paper, we introduce OODFace, which explores the OOD challenges faced by\nfacial recognition models from two perspectives: common corruptions and\nappearance variations. We systematically design 30 OOD scenarios across 9 major\ncategories tailored for facial recognition. By simulating these challenges on\npublic datasets, we establish three robustness benchmarks: LFW-C/V, CFP-FP-C/V,\nand YTF-C/V. We then conduct extensive experiments on 19 different facial\nrecognition models and 3 commercial APIs, along with extended experiments on\nface masks, Vision-Language Models (VLMs), and defense strategies to assess\ntheir robustness. Based on the results, we draw several key insights,\nhighlighting the vulnerability of facial recognition systems to OOD data and\nsuggesting possible solutions. Additionally, we offer a unified toolkit that\nincludes all corruption and variation types, easily extendable to other\ndatasets. We hope that our benchmarks and findings can provide guidance for\nfuture improvements in facial recognition model robustness.\n","authors":["Caixin Kang","Yubo Chen","Shouwei Ruan","Shiji Zhao","Ruochen Zhang","Jiayi Wang","Shan Fu","Xingxing Wei"],"pdf_url":"https://arxiv.org/pdf/2412.02479v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.03523v4","updated":"2024-12-03T14:31:41Z","published":"2024-10-04T15:44:23Z","title":"A Probabilistic Perspective on Unlearning and Alignment for Large\n Language Models","summary":" Comprehensive evaluation of Large Language Models (LLMs) is an open research\nproblem. Existing evaluations rely on deterministic point estimates generated\nvia greedy decoding. However, we find that deterministic evaluations fail to\ncapture the whole output distribution of a model, yielding inaccurate\nestimations of model capabilities. This is particularly problematic in critical\ncontexts such as unlearning and alignment, where precise model evaluations are\ncrucial. To remedy this, we introduce the first formal probabilistic evaluation\nframework in LLMs. Namely, we derive novel metrics with high-probability\nguarantees concerning the output distribution of a model. Our metrics are\napplication-independent and allow practitioners to make more reliable estimates\nabout model capabilities before deployment. Through a case study focused on\nunlearning, we reveal that deterministic evaluations falsely indicate\nsuccessful unlearning, whereas our probabilistic evaluations demonstrate that\nmost if not all of the supposedly unlearned information remains accessible in\nthese models. Additionally, we propose a novel unlearning loss based on entropy\noptimization and adaptive temperature scaling, which significantly improves\nunlearning in probabilistic settings on recent benchmarks. Our proposed shift\nfrom point estimates to probabilistic evaluations of output distributions\nrepresents an important step toward comprehensive evaluations of LLMs. Code\navailable at https://github.com/yascho/probabilistic-unlearning.\n","authors":["Yan Scholten","Stephan Günnemann","Leo Schwinn"],"pdf_url":"https://arxiv.org/pdf/2410.03523v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02471v1","updated":"2024-12-03T14:29:47Z","published":"2024-12-03T14:29:47Z","title":"COMET:Combined Matrix for Elucidating Targets","summary":" Identifying the interaction targets of bioactive compounds is a foundational\nelement for deciphering their pharmacological effects. Target prediction\nalgorithms equip researchers with an effective tool to rapidly scope and\nexplore potential targets. Here, we introduce the COMET, a multi-technological\nmodular target prediction tool that provides comprehensive predictive insights,\nincluding similar active compounds, three-dimensional predicted binding modes,\nand probability scores, all within an average processing time of less than 10\nminutes per task. With meticulously curated data, the COMET database\nencompasses 990,944 drug-target interaction pairs and 45,035 binding pockets,\nenabling predictions for 2,685 targets, which span confirmed and exploratory\ntherapeutic targets for human diseases. In comparative testing using datasets\nfrom ChEMBL and BindingDB, COMET outperformed five other well-known algorithms,\noffering nearly an 80% probability of accurately identifying at least one true\ntarget within the top 15 predictions for a given compound. COMET also features\na user-friendly web server, accessible freely at\nhttps://www.pdbbind-plus.org.cn/comet.\n","authors":["Haojie Wang","Zhe Zhang","Haotian Gao","Xiangying Zhang","Zhihang Chen","Xinchong Chen","Yifei Qi","Yan Li","Renxiao Wang"],"pdf_url":"https://arxiv.org/pdf/2412.02471v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02467v1","updated":"2024-12-03T14:10:09Z","published":"2024-12-03T14:10:09Z","title":"DP-2Stage: Adapting Language Models as Differentially Private Tabular\n Data Generators","summary":" Generating tabular data under differential privacy (DP) protection ensures\ntheoretical privacy guarantees but poses challenges for training machine\nlearning models, primarily due to the need to capture complex structures under\nnoisy supervision signals. Recently, pre-trained Large Language Models (LLMs)\n-- even those at the scale of GPT-2 -- have demonstrated great potential in\nsynthesizing tabular data. However, their applications under DP constraints\nremain largely unexplored. In this work, we address this gap by applying DP\ntechniques to the generation of synthetic tabular data. Our findings shows that\nLLMs face difficulties in generating coherent text when fine-tuned with DP, as\nprivacy budgets are inefficiently allocated to non-private elements like table\nstructures. To overcome this, we propose \\ours, a two-stage fine-tuning\nframework for differentially private tabular data generation. The first stage\ninvolves non-private fine-tuning on a pseudo dataset, followed by DP\nfine-tuning on a private dataset. Our empirical results show that this approach\nimproves performance across various settings and metrics compared to directly\nfine-tuned LLMs in DP contexts. We release our code and setup at\nhttps://github.com/tejuafonja/DP-2Stage.\n","authors":["Tejumade Afonja","Hui-Po Wang","Raouf Kerkouche","Mario Fritz"],"pdf_url":"https://arxiv.org/pdf/2412.02467v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.01388v2","updated":"2024-12-03T14:08:19Z","published":"2024-12-02T11:21:58Z","title":"Harnessing Preference Optimisation in Protein LMs for Hit Maturation in\n Cell Therapy","summary":" Cell and immunotherapy offer transformative potential for treating diseases\nlike cancer and autoimmune disorders by modulating the immune system. The\ndevelopment of these therapies is resource-intensive, with the majority of drug\ncandidates failing to progress beyond laboratory testing. While recent advances\nin machine learning have revolutionised areas such as protein engineering,\napplications in immunotherapy remain limited due to the scarcity of\nlarge-scale, standardised datasets and the complexity of cellular systems. In\nthis work, we address these challenges by leveraging a high-throughput\nexperimental platform to generate data suitable for fine-tuning protein\nlanguage models. We demonstrate how models fine-tuned using a preference task\nshow surprising correlations to biological assays, and how they can be\nleveraged for few-shot hit maturation in CARs. This proof-of-concept presents a\nnovel pathway for applying ML to immunotherapy and could generalise to other\ntherapeutic modalities.\n","authors":["Katarzyna Janocha","Annabel Ling","Alice Godson","Yulia Lampi","Simon Bornschein","Nils Y. Hammerla"],"pdf_url":"https://arxiv.org/pdf/2412.01388v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.18355v2","updated":"2024-12-03T14:07:34Z","published":"2024-03-27T08:48:16Z","title":"Supervised Multiple Kernel Learning approaches for multi-omics data\n integration","summary":" Advances in high-throughput technologies have originated an ever-increasing\navailability of omics datasets. The integration of multiple heterogeneous data\nsources is currently an issue for biology and bioinformatics. Multiple kernel\nlearning (MKL) has shown to be a flexible and valid approach to consider the\ndiverse nature of multi-omics inputs, despite being an underused tool in\ngenomic data mining. We provide novel MKL approaches based on different kernel\nfusion strategies. To learn from the meta-kernel of input kernels, we adapted\nunsupervised integration algorithms for supervised tasks with support vector\nmachines. We also tested deep learning architectures for kernel fusion and\nclassification. The results show that MKL-based models can outperform more\ncomplex, state-of-the-art, supervised multi-omics integrative approaches.\nMultiple kernel learning offers a natural framework for predictive models in\nmulti-omics data. It proved to provide a fast and reliable solution that can\ncompete with and outperform more complex architectures. Our results offer a\ndirection for bio-data mining research, biomarker discovery and further\ndevelopment of methods for heterogeneous data integration.\n","authors":["Mitja Briscik","Gabriele Tazza","Marie-Agnes Dillies","László Vidács","Sébastien Dejean"],"pdf_url":"https://arxiv.org/pdf/2403.18355v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02449v1","updated":"2024-12-03T13:34:42Z","published":"2024-12-03T13:34:42Z","title":"BYE: Build Your Encoder with One Sequence of Exploration Data for\n Long-Term Dynamic Scene Understanding","summary":" Dynamic scene understanding remains a persistent challenge in robotic\napplications. Early dynamic mapping methods focused on mitigating the negative\ninfluence of short-term dynamic objects on camera motion estimation by masking\nor tracking specific categories, which often fall short in adapting to\nlong-term scene changes. Recent efforts address object association in long-term\ndynamic environments using neural networks trained on synthetic datasets, but\nthey still rely on predefined object shapes and categories. Other methods\nincorporate visual, geometric, or semantic heuristics for the association but\noften lack robustness. In this work, we introduce BYE, a class-agnostic,\nper-scene point cloud encoder that removes the need for predefined categories,\nshape priors, or extensive association datasets. Trained on only a single\nsequence of exploration data, BYE can efficiently perform object association in\ndynamically changing scenes. We further propose an ensembling scheme combining\nthe semantic strengths of Vision Language Models (VLMs) with the scene-specific\nexpertise of BYE, achieving a 7% improvement and a 95% success rate in object\nassociation tasks. Code and dataset are available at\nhttps://byencoder.github.io.\n","authors":["Chenguang Huang","Shengchao Yan","Wolfram Burgard"],"pdf_url":"https://arxiv.org/pdf/2412.02449v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02441v1","updated":"2024-12-03T13:25:18Z","published":"2024-12-03T13:25:18Z","title":"Artificial Expert Intelligence through PAC-reasoning","summary":" Artificial Expert Intelligence (AEI) seeks to transcend the limitations of\nboth Artificial General Intelligence (AGI) and narrow AI by integrating\ndomain-specific expertise with critical, precise reasoning capabilities akin to\nthose of top human experts. Existing AI systems often excel at predefined tasks\nbut struggle with adaptability and precision in novel problem-solving. To\novercome this, AEI introduces a framework for ``Probably Approximately Correct\n(PAC) Reasoning\". This paradigm provides robust theoretical guarantees for\nreliably decomposing complex problems, with a practical mechanism for\ncontrolling reasoning precision. In reference to the division of human thought\ninto System 1 for intuitive thinking and System 2 for reflective\nreasoning~\\citep{tversky1974judgment}, we refer to this new type of reasoning\nas System 3 for precise reasoning, inspired by the rigor of the scientific\nmethod. AEI thus establishes a foundation for error-bounded, inference-time\nlearning.\n","authors":["Shai Shalev-Shwartz","Amnon Shashua","Gal Beniamini","Yoav Levine","Or Sharir","Noam Wies","Ido Ben-Shaul","Tomer Nussbaum","Shir Granot Peled"],"pdf_url":"https://arxiv.org/pdf/2412.02441v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02439v1","updated":"2024-12-03T13:21:09Z","published":"2024-12-03T13:21:09Z","title":"Nature versus nurture in galaxy formation: the effect of environment on\n star formation with causal machine learning","summary":" Understanding how galaxies form and evolve is at the heart of modern\nastronomy. With the advent of large-scale surveys and simulations, remarkable\nprogress has been made in the last few decades. Despite this, the physical\nprocesses behind the phenomena, and particularly their importance, remain far\nfrom known, as correlations have primarily been established rather than the\nunderlying causality. We address this challenge by applying the causal\ninference framework. Specifically, we tackle the fundamental open question of\nwhether galaxy formation and evolution depends more on nature (i.e., internal\nprocesses) or nurture (i.e., external processes), by estimating the causal\neffect of environment on star-formation rate in the IllustrisTNG simulations.\nTo do so, we develop a comprehensive causal model and employ cutting-edge\ntechniques from epidemiology to overcome the long-standing problem of\ndisentangling nature and nurture. We find that the causal effect is negative\nand substantial, with environment suppressing the SFR by a maximal factor of\n$\\sim100$. While the overall effect at $z=0$ is negative, in the early\nuniverse, environment is discovered to have a positive impact, boosting star\nformation by a factor of $\\sim10$ at $z\\sim1$ and by even greater amounts at\nhigher redshifts. Furthermore, we show that: (i) nature also plays an important\nrole, as ignoring it underestimates the causal effect in intermediate-density\nenvironments by a factor of $\\sim2$, (ii) controlling for the stellar mass at a\nsnapshot in time, as is common in the literature, is not only insufficient to\ndisentangle nature and nurture but actually has an adverse effect, though (iii)\nstellar mass is an adequate proxy of the effects of nature. Finally, this work\nmay prove a useful blueprint for extracting causal insights in other fields\nthat deal with dynamical systems with closed feedback loops, such as the\nEarth's climate.\n","authors":["Sunil Mucesh","William G. Hartley","Ciarán M. Gilligan-Lee","Ofer Lahav"],"pdf_url":"https://arxiv.org/pdf/2412.02439v1.pdf","comment":"16 pages, 4 figures"},{"id":"http://arxiv.org/abs/2412.02432v1","updated":"2024-12-03T12:57:08Z","published":"2024-12-03T12:57:08Z","title":"Improved Localized Machine Unlearning Through the Lens of Memorization","summary":" Machine unlearning refers to removing the influence of a specified subset of\ntraining data from a machine learning model, efficiently, after it has already\nbeen trained. This is important for key applications, including making the\nmodel more accurate by removing outdated, mislabeled, or poisoned data. In this\nwork, we study localized unlearning, where the unlearning algorithm operates on\na (small) identified subset of parameters. Drawing inspiration from the\nmemorization literature, we propose an improved localization strategy that\nyields strong results when paired with existing unlearning algorithms. We also\npropose a new unlearning algorithm, Deletion by Example Localization (DEL),\nthat resets the parameters deemed-to-be most critical according to our\nlocalization strategy, and then finetunes them. Our extensive experiments on\ndifferent datasets, forget sets and metrics reveal that DEL sets a new\nstate-of-the-art for unlearning metrics, against both localized and\nfull-parameter methods, while modifying a small subset of parameters, and\noutperforms the state-of-the-art localized unlearning in terms of test accuracy\ntoo.\n","authors":["Reihaneh Torkzadehmahani","Reza Nasirigerdeh","Georgios Kaissis","Daniel Rueckert","Gintare Karolina Dziugaite","Eleni Triantafillou"],"pdf_url":"https://arxiv.org/pdf/2412.02432v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02430v1","updated":"2024-12-03T12:52:04Z","published":"2024-12-03T12:52:04Z","title":"Transformer-based Koopman Autoencoder for Linearizing Fisher's Equation","summary":" A Transformer-based Koopman autoencoder is proposed for linearizing Fisher's\nreaction-diffusion equation. The primary focus of this study is on using deep\nlearning techniques to find complex spatiotemporal patterns in the\nreaction-diffusion system. The emphasis is on not just solving the equation but\nalso transforming the system's dynamics into a more comprehensible, linear\nform. Global coordinate transformations are achieved through the autoencoder,\nwhich learns to capture the underlying dynamics by training on a dataset with\n60,000 initial conditions. Extensive testing on multiple datasets was used to\nassess the efficacy of the proposed model, demonstrating its ability to\naccurately predict the system's evolution as well as to generalize. We provide\na thorough comparison study, comparing our suggested design to a few other\ncomparable methods using experiments on various PDEs, such as the\nKuramoto-Sivashinsky equation and the Burger's equation. Results show improved\naccuracy, highlighting the capabilities of the Transformer-based Koopman\nautoencoder. The proposed architecture in is significantly ahead of other\narchitectures, in terms of solving different types of PDEs using a single\narchitecture. Our method relies entirely on the data, without requiring any\nknowledge of the underlying equations. This makes it applicable to even the\ndatasets where the governing equations are not known.\n","authors":["Kanav Singh Rana","Nitu Kumari"],"pdf_url":"https://arxiv.org/pdf/2412.02430v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02423v1","updated":"2024-12-03T12:38:53Z","published":"2024-12-03T12:38:53Z","title":"Time-Series-Informed Closed-loop Learning for Sequential Decision Making\n and Control","summary":" Closed-loop performance of sequential decision making algorithms, such as\nmodel predictive control, depends strongly on the parameters of cost functions,\nmodels, and constraints. Bayesian optimization is a common approach to learning\nthese parameters based on closed-loop experiments. However, traditional\nBayesian optimization approaches treat the learning problem as a black box,\nignoring valuable information and knowledge about the structure of the\nunderlying problem, resulting in slow convergence and high experimental\nresource use. We propose a time-series-informed optimization framework that\nincorporates intermediate performance evaluations from early iterations of each\nexperimental episode into the learning procedure. Additionally, probabilistic\nearly stopping criteria are proposed to terminate unpromising experiments,\nsignificantly reducing experimental time. Simulation results show that our\napproach achieves baseline performance with approximately half the resources.\nMoreover, with the same resource budget, our approach outperforms the baseline\nin terms of final closed-loop performance, highlighting its efficiency in\nsequential decision making scenarios.\n","authors":["Sebastian Hirt","Lukas Theiner","Rolf Findeisen"],"pdf_url":"https://arxiv.org/pdf/2412.02423v1.pdf","comment":"12 pages, 3 figures, submitted to L4DC 2025"},{"id":"http://arxiv.org/abs/2412.02412v1","updated":"2024-12-03T12:12:03Z","published":"2024-12-03T12:12:03Z","title":"VISTA: A Panoramic View of Neural Representations","summary":" We present VISTA (Visualization of Internal States and Their Associations), a\nnovel pipeline for visually exploring and interpreting neural network\nrepresentations. VISTA addresses the challenge of analyzing vast\nmultidimensional spaces in modern machine learning models by mapping\nrepresentations into a semantic 2D space. The resulting collages visually\nreveal patterns and relationships within internal representations. We\ndemonstrate VISTA's utility by applying it to sparse autoencoder latents\nuncovering new properties and interpretations. We review the VISTA methodology,\npresent findings from our case study ( https://got.drib.net/latents/ ), and\ndiscuss implications for neural network interpretability across various domains\nof machine learning.\n","authors":["Tom White"],"pdf_url":"https://arxiv.org/pdf/2412.02412v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02408v1","updated":"2024-12-03T12:03:13Z","published":"2024-12-03T12:03:13Z","title":"Leveraging Ensemble-Based Semi-Supervised Learning for Illicit Account\n Detection in Ethereum DeFi Transactions","summary":" The advent of smart contracts has enabled the rapid rise of Decentralized\nFinance (DeFi) on the Ethereum blockchain, offering substantial rewards in\nfinancial innovation and inclusivity. However, this growth has also introduced\nsignificant security risks, including the proliferation of illicit accounts\ninvolved in fraudulent activities. Traditional detection methods are limited by\nthe scarcity of labeled data and the evolving tactics of malicious actors. In\nthis paper, we propose a novel Self-Learning Ensemble-based Illicit account\nDetection (SLEID) framework to address these challenges. SLEID employs an\nIsolation Forest for initial outlier detection and a self-training mechanism to\niteratively generate pseudo-labels for unlabeled accounts, thereby enhancing\ndetection accuracy. Extensive experiments demonstrate that SLEID significantly\noutperforms traditional supervised approaches and recent semi-supervised\nmodels, achieving superior precision, recall, and F1-scores, particularly in\ndetecting illicit accounts. Compared to state-of-the-art methods, our approach\nachieves better detection performance while reducing reliance on labeled data.\nThe results affirm SLEID's efficacy as a robust solution for safeguarding the\nDeFi ecosystem and mitigating risks posed by malicious accounts.\n","authors":["Shabnam Fazliani","Mohammad Mowlavi Sorond","Arsalan Masoudifard"],"pdf_url":"https://arxiv.org/pdf/2412.02408v1.pdf","comment":"12 pages, 6 figures"},{"id":"http://arxiv.org/abs/2412.02403v1","updated":"2024-12-03T11:53:05Z","published":"2024-12-03T11:53:05Z","title":"3D Face Reconstruction From Radar Images","summary":" The 3D reconstruction of faces gains wide attention in computer vision and is\nused in many fields of application, for example, animation, virtual reality,\nand even forensics. This work is motivated by monitoring patients in sleep\nlaboratories. Due to their unique characteristics, sensors from the radar\ndomain have advantages compared to optical sensors, namely penetration of\nelectrically non-conductive materials and independence of light. These\nadvantages of radar signals unlock new applications and require adaptation of\n3D reconstruction frameworks. We propose a novel model-based method for 3D\nreconstruction from radar images. We generate a dataset of synthetic radar\nimages with a physics-based but non-differentiable radar renderer. This dataset\nis used to train a CNN-based encoder to estimate the parameters of a 3D\nmorphable face model. Whilst the encoder alone already leads to strong\nreconstructions of synthetic data, we extend our reconstruction in an\nAnalysis-by-Synthesis fashion to a model-based autoencoder. This is enabled by\nlearning the rendering process in the decoder, which acts as an object-specific\ndifferentiable radar renderer. Subsequently, the combination of both network\nparts is trained to minimize both, the loss of the parameters and the loss of\nthe resulting reconstructed radar image. This leads to the additional benefit,\nthat at test time the parameters can be further optimized by finetuning the\nautoencoder unsupervised on the image loss. We evaluated our framework on\ngenerated synthetic face images as well as on real radar images with 3D ground\ntruth of four individuals.\n","authors":["Valentin Braeutigam","Vanessa Wirth","Ingrid Ullmann","Christian Schüßler","Martin Vossiek","Matthias Berking","Bernhard Egger"],"pdf_url":"https://arxiv.org/pdf/2412.02403v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02399v1","updated":"2024-12-03T11:49:01Z","published":"2024-12-03T11:49:01Z","title":"OMENN: One Matrix to Explain Neural Networks","summary":" Deep Learning (DL) models are often black boxes, making their decision-making\nprocesses difficult to interpret. This lack of transparency has driven\nadvancements in eXplainable Artificial Intelligence (XAI), a field dedicated to\nclarifying the reasoning behind DL model predictions. Among these,\nattribution-based methods such as LRP and GradCAM are widely used, though they\nrely on approximations that can be imprecise.\n To address these limitations, we introduce One Matrix to Explain Neural\nNetworks (OMENN), a novel post-hoc method that represents a neural network as a\nsingle, interpretable matrix for each specific input. This matrix is\nconstructed through a series of linear transformations that represent the\nprocessing of the input by each successive layer in the neural network. As a\nresult, OMENN provides locally precise, attribution-based explanations of the\ninput across various modern models, including ViTs and CNNs. We present a\ntheoretical analysis of OMENN based on dynamic linearity property and validate\nits effectiveness with extensive tests on two XAI benchmarks, demonstrating\nthat OMENN is competitive with state-of-the-art methods.\n","authors":["Adam Wróbel","Mikołaj Janusz","Bartosz Zieliński","Dawid Rymarczyk"],"pdf_url":"https://arxiv.org/pdf/2412.02399v1.pdf","comment":"Under review, code will be released after acceptance"},{"id":"http://arxiv.org/abs/2303.04613v5","updated":"2024-12-03T11:48:24Z","published":"2023-03-08T14:32:59Z","title":"The Descriptive Complexity of Graph Neural Networks","summary":" We analyse the power of graph neural networks (GNNs) in terms of Boolean\ncircuit complexity and descriptive complexity.\n We prove that the graph queries that can be computed by a polynomial-size\nbounded-depth family of GNNs are exactly those definable in the guarded\nfragment GFO+C of first-order logic with counting and with built-in relations.\nThis puts GNNs in the circuit complexity class (non-uniform) $\\text{TC}^0$.\nRemarkably, the GNN families may use arbitrary real weights and a wide class of\nactivation functions that includes the standard ReLU, logistic \"sigmoid\", and\nhyperbolic tangent functions. If the GNNs are allowed to use random\ninitialisation and global readout (both standard features of GNNs widely used\nin practice), they can compute exactly the same queries as bounded depth\nBoolean circuits with threshold gates, that is, exactly the queries in\n$\\text{TC}^0$.\n Moreover, we show that queries computable by a single GNN with piecewise\nlinear activations and rational weights are definable in GFO+C without built-in\nrelations. Therefore, they are contained in uniform $\\text{TC}^0$.\n","authors":["Martin Grohe"],"pdf_url":"https://arxiv.org/pdf/2303.04613v5.pdf","comment":"Journal version for TheoretiCS"},{"id":"http://arxiv.org/abs/2003.12366v2","updated":"2024-12-03T11:13:27Z","published":"2020-03-22T11:21:29Z","title":"Training for Speech Recognition on Coprocessors","summary":" Automatic Speech Recognition (ASR) has increased in popularity in recent\nyears. The evolution of processor and storage technologies has enabled more\nadvanced ASR mechanisms, fueling the development of virtual assistants such as\nAmazon Alexa, Apple Siri, Microsoft Cortana, and Google Home. The interest in\nsuch assistants, in turn, has amplified the novel developments in ASR research.\nHowever, despite this popularity, there has not been a detailed training\nefficiency analysis of modern ASR systems. This mainly stems from: the\nproprietary nature of many modern applications that depend on ASR, like the\nones listed above; the relatively expensive co-processor hardware that is used\nto accelerate ASR by big vendors to enable such applications; and the absence\nof well-established benchmarks. The goal of this paper is to address the latter\ntwo of these challenges. The paper first describes an ASR model, based on a\ndeep neural network inspired by recent work in this domain, and our experiences\nbuilding it. Then we evaluate this model on three CPU-GPU co-processor\nplatforms that represent different budget categories. Our results demonstrate\nthat utilizing hardware acceleration yields good results even without high-end\nequipment. While the most expensive platform (10X price of the least expensive\none) converges to the initial accuracy target 10-30% and 60-70% faster than the\nother two, the differences among the platforms almost disappear at slightly\nhigher accuracy targets. In addition, our results further highlight both the\ndifficulty of evaluating ASR systems due to the complex, long, and resource\nintensive nature of the model training in this domain, and the importance of\nestablishing benchmarks for ASR.\n","authors":["Sebastian Baunsgaard","Sebastian B. Wrede","Pınar Tozun"],"pdf_url":"https://arxiv.org/pdf/2003.12366v2.pdf","comment":"published at ADMS 2020"},{"id":"http://arxiv.org/abs/2403.16970v3","updated":"2024-12-03T11:09:31Z","published":"2024-03-25T17:31:12Z","title":"Enhancing joint automatic chest X-ray diagnosis and clinical visual\n attention prediction with multi-stage cooperative learning","summary":" Purpose: As visual inspection is an inherent process during radiological\nscreening, the associated eye gaze data can provide valuable insights into\nrelevant clinical decisions. As deep learning has become the state-of-the-art\nfor computer-assisted diagnosis, integrating human behavior, such as eye gaze\ndata, into these systems is instrumental to help align machine predictions with\nclinical diagnostic criteria, thus enhancing the quality of automatic\nradiological diagnosis. Methods: We propose a novel deep learning framework for\njoint disease diagnosis and prediction of corresponding clinical visual\nattention maps for chest X-ray scans. Specifically, we introduce a new\ndual-encoder multi-task UNet, which leverages both a DenseNet201 backbone and a\nResidual and Squeeze-and-Excitation block-based encoder to extract diverse\nfeatures for visual attention map prediction, and a multi-scale feature-fusion\nclassifier to perform disease classification. To tackle the issue of\nasynchronous training schedules of individual tasks in multi-task learning, we\nproposed a multi-stage cooperative learning strategy, with contrastive learning\nfor feature encoder pretraining to boost performance. Results: Our proposed\nmethod is shown to significantly outperform existing techniques for chest X-ray\ndiagnosis (AUC=0.93) and the quality of visual attention map prediction\n(Correlation coefficient=0.58). Conclusion: Benefiting from the proposed\nmulti-task multi-stage cooperative learning, our technique demonstrates the\nbenefit of integrating clinicians' eye gaze into clinical AI systems to boost\nperformance and potentially explainability.\n","authors":["Zirui Qiu","Hassan Rivaz","Yiming Xiao"],"pdf_url":"https://arxiv.org/pdf/2403.16970v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.08488v2","updated":"2024-12-03T11:06:03Z","published":"2024-08-16T02:17:21Z","title":"PITN: Physics-Informed Temporal Networks for Cuffless Blood Pressure\n Estimation","summary":" Monitoring blood pressure with non-invasive sensors has gained popularity for\nproviding comfortable user experiences, one of which is a significant function\nof smart wearables. Although providing a comfortable user experience, such\nmethods are suffering from the demand for a significant amount of realistic\ndata to train an individual model for each subject, especially considering the\ninvasive or obtrusive BP ground-truth measurements. To tackle this challenge,\nwe introduce a novel physics-informed temporal network~(PITN) with adversarial\ncontrastive learning to enable precise BP estimation with very limited data.\nSpecifically, we first enhance the physics-informed neural network~(PINN) with\nthe temporal block for investigating BP dynamics' multi-periodicity for\npersonal cardiovascular cycle modeling and temporal variation. We then employ\nadversarial training to generate extra physiological time series data,\nimproving PITN's robustness in the face of sparse subject-specific training\ndata. Furthermore, we utilize contrastive learning to capture the\ndiscriminative variations of cardiovascular physiologic phenomena. This\napproach aggregates physiological signals with similar blood pressure values in\nlatent space while separating clusters of samples with dissimilar blood\npressure values. Experiments on three widely-adopted datasets with different\nmodailties (\\emph{i.e.,} bioimpedance, PPG, millimeter-wave) demonstrate the\nsuperiority and effectiveness of the proposed methods over previous\nstate-of-the-art approaches. The code is available\nat~\\url{https://github.com/Zest86/ACL-PITN}.\n","authors":["Rui Wang","Mengshi Qi","Yingxia Shao","Anfu Zhou","Huadong Ma"],"pdf_url":"https://arxiv.org/pdf/2408.08488v2.pdf","comment":"12 pages, 6 figures"},{"id":"http://arxiv.org/abs/2412.02372v1","updated":"2024-12-03T10:58:34Z","published":"2024-12-03T10:58:34Z","title":"HERO: Hint-Based Efficient and Reliable Query Optimizer","summary":" We propose a novel model for learned query optimization which provides query\nhints leading to better execution plans. The model addresses the three key\nchallenges in learned hint-based query optimization: reliable hint\nrecommendation (ensuring non-degradation of query latency), efficient hint\nexploration, and fast inference. We provide an in-depth analysis of existing\nNN-based approaches to hint-based optimization and experimentally confirm the\nnamed challenges for them. Our alternative solution consists of a new inference\nschema based on an ensemble of context-aware models and a graph storage for\nreliable hint suggestion and fast inference, and a budget-controlled training\nprocedure with a local search algorithm that solves the issue of exponential\nsearch space exploration. In experiments on standard benchmarks, our model\ndemonstrates optimization capability close to the best achievable with\ncoarse-grained hints. Controlling the degree of parallelism (query dop) in\naddition to operator-related hints enables our model to achieve 3x latency\nimprovement on JOB benchmark which sets a new standard for optimization. Our\nmodel is interpretable and easy to debug, which is particularly important for\ndeployment in production.\n","authors":["Sergey Zinchenko","Sergey Iazov"],"pdf_url":"https://arxiv.org/pdf/2412.02372v1.pdf","comment":"Submitted to VLDB 2025; 13 pages; 13 figures"},{"id":"http://arxiv.org/abs/2412.02352v1","updated":"2024-12-03T10:17:15Z","published":"2024-12-03T10:17:15Z","title":"LoRA Diffusion: Zero-Shot LoRA Synthesis for Diffusion Model\n Personalization","summary":" Low-Rank Adaptation (LoRA) and other parameter-efficient fine-tuning (PEFT)\nmethods provide low-memory, storage-efficient solutions for personalizing\ntext-to-image models. However, these methods offer little to no improvement in\nwall-clock training time or the number of steps needed for convergence compared\nto full model fine-tuning. While PEFT methods assume that shifts in generated\ndistributions (from base to fine-tuned models) can be effectively modeled\nthrough weight changes in a low-rank subspace, they fail to leverage knowledge\nof common use cases, which typically focus on capturing specific styles or\nidentities. Observing that desired outputs often comprise only a small subset\nof the possible domain covered by LoRA training, we propose reducing the search\nspace by incorporating a prior over regions of interest. We demonstrate that\ntraining a hypernetwork model to generate LoRA weights can achieve competitive\nquality for specific domains while enabling near-instantaneous conditioning on\nuser input, in contrast to traditional training methods that require thousands\nof steps.\n","authors":["Ethan Smith","Rami Seid","Alberto Hojel","Paramita Mishra","Jianbo Wu"],"pdf_url":"https://arxiv.org/pdf/2412.02352v1.pdf","comment":"9 pages, 6 figures"},{"id":"http://arxiv.org/abs/2412.02340v1","updated":"2024-12-03T10:03:12Z","published":"2024-12-03T10:03:12Z","title":"Federated Analytics in Practice: Engineering for Privacy, Scalability\n and Practicality","summary":" Cross-device Federated Analytics (FA) is a distributed computation paradigm\ndesigned to answer analytics queries about and derive insights from data held\nlocally on users' devices. On-device computations combined with other privacy\nand security measures ensure that only minimal data is transmitted off-device,\nachieving a high standard of data protection. Despite FA's broad relevance, the\napplicability of existing FA systems is limited by compromised accuracy; lack\nof flexibility for data analytics; and an inability to scale effectively. In\nthis paper, we describe our approach to combine privacy, scalability, and\npracticality to build and deploy a system that overcomes these limitations. Our\nFA system leverages trusted execution environments (TEEs) and optimizes the use\nof on-device computing resources to facilitate federated data processing across\nlarge fleets of devices, while ensuring robust, defensible, and verifiable\nprivacy safeguards. We focus on federated analytics (statistics and\nmonitoring), in contrast to systems for federated learning (ML workloads), and\nwe flag the key differences.\n","authors":["Harish Srinivas","Graham Cormode","Mehrdad Honarkhah","Samuel Lurye","Jonathan Hehir","Lunwen He","George Hong","Ahmed Magdy","Dzmitry Huba","Kaikai Wang","Shen Guo","Shoubhik Bhattacharya"],"pdf_url":"https://arxiv.org/pdf/2412.02340v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.01464v2","updated":"2024-12-03T10:01:06Z","published":"2024-10-02T12:16:46Z","title":"Flow Matching for Accelerated Simulation of Atomic Transport in\n Materials","summary":" We introduce LiFlow, a generative framework to accelerate molecular dynamics\n(MD) simulations for crystalline materials that formulates the task as\nconditional generation of atomic displacements. The model uses flow matching,\nwith a Propagator submodel to generate atomic displacements and a Corrector to\nlocally correct unphysical geometries, and incorporates an adaptive prior based\non the Maxwell-Boltzmann distribution to account for chemical and thermal\nconditions. We benchmark LiFlow on a dataset comprising 25-ps trajectories of\nlithium diffusion across 4,186 solid-state electrolyte (SSE) candidates at four\ntemperatures. The model obtains a consistent Spearman rank correlation of\n0.7-0.8 for lithium mean squared displacement (MSD) predictions on unseen\ncompositions. Furthermore, LiFlow generalizes from short training trajectories\nto larger supercells and longer simulations while maintaining high accuracy.\nWith speed-ups of up to 600,000$\\times$ compared to first-principles methods,\nLiFlow enables scalable simulations at significantly larger length and time\nscales.\n","authors":["Juno Nam","Sulin Liu","Gavin Winter","KyuJung Jun","Soojung Yang","Rafael Gómez-Bombarelli"],"pdf_url":"https://arxiv.org/pdf/2410.01464v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02335v1","updated":"2024-12-03T09:55:00Z","published":"2024-12-03T09:55:00Z","title":"An Adaptive Grasping Force Tracking Strategy for Nonlinear and\n Time-Varying Object Behaviors","summary":" Accurate grasp force control is one of the key skills for ensuring successful\nand damage-free robotic grasping of objects. Although existing methods have\nconducted in-depth research on slip detection and grasping force planning, they\noften overlook the issue of adaptive tracking of the actual force to the target\nforce when handling objects with different material properties. The optimal\nparameters of a force tracking controller are significantly influenced by the\nobject's stiffness, and many adaptive force tracking algorithms rely on\nstiffness estimation. However, real-world objects often exhibit viscous,\nplastic, or other more complex nonlinear time-varying behaviors, and existing\nstudies provide insufficient support for these materials in terms of stiffness\ndefinition and estimation. To address this, this paper introduces the concept\nof generalized stiffness, extending the definition of stiffness to nonlinear\ntime-varying grasp system models, and proposes an online generalized stiffness\nestimator based on Long Short-Term Memory (LSTM) networks. Based on generalized\nstiffness, this paper proposes an adaptive parameter adjustment strategy using\na PI controller as an example, enabling dynamic force tracking for objects with\nvarying characteristics. Experimental results demonstrate that the proposed\nmethod achieves high precision and short probing time, while showing better\nadaptability to non-ideal objects compared to existing methods. The method\neffectively solves the problem of grasp force tracking in unknown, nonlinear,\nand time-varying grasp systems, enhancing the robotic grasping ability in\nunstructured environments.\n","authors":["Ziyang Cheng","Xiangyu Tian","Ruomin Sui","Tiemin Li","Yao Jiang"],"pdf_url":"https://arxiv.org/pdf/2412.02335v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.04346v2","updated":"2024-12-03T09:54:40Z","published":"2023-12-07T15:06:06Z","title":"Detection and Imputation based Two-Stage Denoising Diffusion Power\n System Measurement Recovery under Cyber-Physical Uncertainties","summary":" Power system cyber-physical uncertainties, including measurement ambiguities\nstemming from cyber attacks and data losses, along with system uncertainties\nintroduced by massive renewables and complex dynamics, reduce the likelihood of\nenhancing the quality of measurements. Fortunately, denoising diffusion models\nexhibit powerful learning and generation abilities for the complex underlying\nphysics of the real world. To this end, this paper proposes an improved\ndetection and imputation based two-stage denoising diffusion model (TSDM) to\nidentify and reconstruct the measurements with various cyber-physical\nuncertainties. The first stage of the model comprises a classifier-guided\nconditional anomaly detection component, while the second stage involves\ndiffusion-based measurement imputation component. Moreover, the proposed TSDM\nadopts optimal variance to accelerate the diffusion generation process with\nsubsequence sampling. Extensive numerical case studies demonstrate that the\nproposed TSDM can accurately recover power system measurements despite\nrenewables-induced strong randomness and highly nonlinear dynamics.\nAdditionally, the proposed TSDM has stronger robustness compared to existing\nreconstruction networks and exhibits lower computational complexity than\ngeneral denoising diffusion models.\n","authors":["Jianhua Pei","Jingyu Wang","Dongyuan Shi","Ping Wang"],"pdf_url":"https://arxiv.org/pdf/2312.04346v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.06644v3","updated":"2024-12-03T09:49:07Z","published":"2024-06-09T23:39:31Z","title":"Latent Diffusion Model-Enabled Low-Latency Semantic Communication in the\n Presence of Semantic Ambiguities and Wireless Channel Noises","summary":" Deep learning (DL)-based Semantic Communications (SemCom) is becoming\ncritical to maximize overall efficiency of communication networks.\nNevertheless, SemCom is sensitive to wireless channel uncertainties, source\noutliers, and suffer from poor generalization bottlenecks. To address the\nmentioned challenges, this paper develops a latent diffusion model-enabled\nSemCom system with three key contributions, i.e., i) to handle potential\noutliers in the source data, semantic errors obtained by projected gradient\ndescent based on the vulnerabilities of DL models, are utilized to update the\nparameters and obtain an outlier-robust encoder, ii) a lightweight single-layer\nlatent space transformation adapter completes one-shot learning at the\ntransmitter and is placed before the decoder at the receiver, enabling\nadaptation for out-of-distribution data and enhancing human-perceptual quality,\nand iii) an end-to-end consistency distillation (EECD) strategy is used to\ndistill the diffusion models trained in latent space, enabling deterministic\nsingle or few-step low-latency denoising in various noisy channels while\nmaintaining high semantic quality. Extensive numerical experiments across\ndifferent datasets demonstrate the superiority of the proposed SemCom system,\nconsistently proving its robustness to outliers, the capability to transmit\ndata with unknown distributions, and the ability to perform real-time channel\ndenoising tasks while preserving high human perceptual quality, outperforming\nthe existing denoising approaches in semantic metrics like learned perceptual\nimage path similarity (LPIPS).\n","authors":["Jianhua Pei","Cheng Feng","Ping Wang","Hina Tabassum","Dongyuan Shi"],"pdf_url":"https://arxiv.org/pdf/2406.06644v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02331v1","updated":"2024-12-03T09:48:28Z","published":"2024-12-03T09:48:28Z","title":"Sample Efficient Robot Learning in Supervised Effect Prediction Tasks","summary":" In self-supervised robot learning, robots actively explore their environments\nand generate data by acting on entities in the environment. Therefore, an\nexploration policy is desired that ensures sample efficiency to minimize robot\nexecution costs while still providing accurate learning. For this purpose, the\nrobotic community has adopted Intrinsic Motivation (IM)-based approaches such\nas Learning Progress (LP). On the machine learning front, Active Learning (AL)\nhas been used successfully, especially for classification tasks. In this work,\nwe develop a novel AL framework geared towards robotics regression tasks, such\nas action-effect prediction and, more generally, for world model learning,\nwhich we call MUSEL - Model Uncertainty for Sample Efficient Learning. MUSEL\naims to extract model uncertainty from the total uncertainty estimate given by\na suitable learning engine by making use of earning progress and input\ndiversity and use it to improve sample efficiency beyond the state-of-the-art\naction-effect prediction methods. We demonstrate the feasibility of our model\nby using a Stochastic Variational Gaussian Process (SVGP) as the learning\nengine and testing the system on a set of robotic experiments in simulation.\nThe efficacy of MUSEL is demonstrated by comparing its performance to standard\nmethods used in robot action-effect learning. In a robotic tabletop environment\nin which a robot manipulator is tasked with learning the effect of its actions,\nthe experiments show that MUSEL facilitates higher accuracy in learning action\neffects while ensuring sample efficiency.\n","authors":["Mehmet Arda Eren","Erhan Oztop"],"pdf_url":"https://arxiv.org/pdf/2412.02331v1.pdf","comment":"18 pages, 18 figures"},{"id":"http://arxiv.org/abs/2404.10746v3","updated":"2024-12-03T09:47:45Z","published":"2024-04-16T17:24:22Z","title":"Interpolation and differentiation of alchemical degrees of freedom in\n machine learning interatomic potentials","summary":" Machine learning interatomic potentials (MLIPs) have become a workhorse of\nmodern atomistic simulations, and recently published universal MLIPs,\npre-trained on large datasets, have demonstrated remarkable accuracy and\ngeneralizability. However, the computational cost of MLIPs limits their\napplicability to chemically disordered systems requiring large simulation cells\nor to sample-intensive statistical methods. Here, we report the use of\ncontinuous and differentiable alchemical degrees of freedom in atomistic\nmaterials simulations, exploiting the fact that graph neural network MLIPs\nrepresent discrete elements as real-valued tensors. The proposed method\nintroduces alchemical atoms with corresponding weights into the input graph,\nalongside modifications to the message-passing and readout mechanisms of MLIPs,\nand allows smooth interpolation between the compositional states of materials.\nThe end-to-end differentiability of MLIPs enables efficient calculation of the\ngradient of energy with respect to the compositional weights. With this\nmodification, we propose methodologies for optimizing the composition of solid\nsolutions towards target macroscopic properties, characterizing order and\ndisorder in multicomponent oxides, and conducting alchemical free energy\nsimulations to quantify the free energy of vacancy formation and composition\nchanges. The approach offers an avenue for extending the capabilities of\nuniversal MLIPs in the modeling of compositional disorder and characterizing\nthe phase stability of complex materials systems.\n","authors":["Juno Nam","Jiayu Peng","Rafael Gómez-Bombarelli"],"pdf_url":"https://arxiv.org/pdf/2404.10746v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02328v1","updated":"2024-12-03T09:42:16Z","published":"2024-12-03T09:42:16Z","title":"Efficient Model Compression Techniques with FishLeg","summary":" In many domains, the most successful AI models tend to be the largest, indeed\noften too large to be handled by AI players with limited computational\nresources. To mitigate this, a number of compression methods have been\ndeveloped, including methods that prune the network down to high sparsity\nwhilst retaining performance. The best-performing pruning techniques are often\nthose that use second-order curvature information (such as an estimate of the\nFisher information matrix) to score the importance of each weight and to\npredict the optimal compensation for weight deletion. However, these methods\nare difficult to scale to high-dimensional parameter spaces without making\nheavy approximations. Here, we propose the FishLeg surgeon (FLS), a new\nsecond-order pruning method based on the Fisher-Legendre (FishLeg) optimizer.\nAt the heart of FishLeg is a meta-learning approach to amortising the action of\nthe inverse FIM, which brings a number of advantages. Firstly, the\nparameterisation enables the use of flexible tensor factorisation techniques to\nimprove computational and memory efficiency without sacrificing much accuracy,\nalleviating challenges associated with scalability of most second-order pruning\nmethods. Secondly, directly estimating the inverse FIM leads to less\nsensitivity to the amplification of stochasticity during inversion, thereby\nresulting in more precise estimates. Thirdly, our approach also allows for\nprogressive assimilation of the curvature into the parameterisation. In the\ngradual pruning regime, this results in a more efficient estimate refinement as\nopposed to re-estimation. We find that FishLeg achieves higher or comparable\nperformance against two common baselines in the area, most notably in the high\nsparsity regime when considering a ResNet18 model on CIFAR-10 (84% accuracy at\n95% sparsity vs 60% for OBS) and TinyIM (53% accuracy at 80% sparsity vs 48%\nfor OBS).\n","authors":["Jamie McGowan","Wei Sheng Lai","Weibin Chen","Henry Aldridge","Jools Clarke","Jezabel Garcia","Rui Xia","Yilei Liang","Guillaume Hennequin","Alberto Bernacchia"],"pdf_url":"https://arxiv.org/pdf/2412.02328v1.pdf","comment":"Published in NeurIPS 2024 - Neural Compression Workshop, 13 pages, 6\n figures"},{"id":"http://arxiv.org/abs/2412.02327v1","updated":"2024-12-03T09:40:59Z","published":"2024-12-03T09:40:59Z","title":"Switchable deep beamformer for high-quality and real-time passive\n acoustic mapping","summary":" Passive acoustic mapping (PAM) is a promising tool for monitoring acoustic\ncavitation activities in the applications of ultrasound therapy. Data-adaptive\nbeamformers for PAM have better image quality compared to the time exposure\nacoustics (TEA) algorithms. However, the computational cost of data-adaptive\nbeamformers is considerably expensive. In this work, we develop a deep\nbeamformer based on a generative adversarial network, which can switch between\ndifferent transducer arrays and reconstruct high-quality PAM images directly\nfrom radio frequency ultrasound signals with low computational cost. The deep\nbeamformer was trained on the dataset consisting of simulated and experimental\ncavitation signals of single and multiple microbubble clouds measured by\ndifferent (linear and phased) arrays covering 1-15 MHz. We compared the\nperformance of the deep beamformer to TEA and three different data-adaptive\nbeamformers using the simulated and experimental test dataset. Compared with\nTEA, the deep beamformer reduced the energy spread area by 18.9%-65.0% and\nimproved the image signal-to-noise ratio by 9.3-22.9 dB in average for the\ndifferent arrays in our data. Compared to the data-adaptive beamformers, the\ndeep beamformer reduced the computational cost by three orders of magnitude\nachieving 10.5 ms image reconstruction speed in our data, while the image\nquality was as good as that of the data-adaptive beamformers. These results\ndemonstrated the potential of the deep beamformer for high-resolution\nmonitoring of microbubble cavitation activities for ultrasound therapy.\n","authors":["Yi Zeng","Jinwei Li","Hui Zhu","Shukuan Lu","Jianfeng Li","Xiran Cai"],"pdf_url":"https://arxiv.org/pdf/2412.02327v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.08802v3","updated":"2024-12-03T09:39:57Z","published":"2024-02-05T14:20:19Z","title":"Governance of Generative Artificial Intelligence for Companies","summary":" Generative Artificial Intelligence (GenAI), specifically large language\nmodels like ChatGPT, has swiftly entered organizations without adequate\ngovernance, posing both opportunities and risks. Despite extensive debates on\nGenAI's transformative nature and regulatory measures, limited research\naddresses organizational governance, encompassing technical and business\nperspectives. Although numerous frameworks for governance of AI exist, it is\nnot clear to what extent they apply to GenAI. Our review paper fills this gap\nby surveying recent works with the purpose of better understanding fundamental\ncharacteristics of GenAI and adjusting prior frameworks specifically towards\nGenAI governance within companies. To do so, it extends Nickerson's framework\ndevelopment processes to include prior conceptualizations. Our framework\noutlines the scope, objectives, and governance mechanisms tailored to harness\nbusiness opportunities as well as mitigate risks associated with GenAI\nintegration. Our research contributes a focused approach to GenAI governance,\noffering practical insights for companies navigating the challenges of GenAI\nadoption and highlighting research gaps.\n","authors":["Johannes Schneider","Pauline Kuss","Rene Abraham","Christian Meske"],"pdf_url":"https://arxiv.org/pdf/2403.08802v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02316v1","updated":"2024-12-03T09:32:02Z","published":"2024-12-03T09:32:02Z","title":"Optimizing Plastic Waste Collection in Water Bodies Using Heterogeneous\n Autonomous Surface Vehicles with Deep Reinforcement Learning","summary":" This paper presents a model-free deep reinforcement learning framework for\ninformative path planning with heterogeneous fleets of autonomous surface\nvehicles to locate and collect plastic waste. The system employs two teams of\nvehicles: scouts and cleaners. Coordination between these teams is achieved\nthrough a deep reinforcement approach, allowing agents to learn strategies to\nmaximize cleaning efficiency. The primary objective is for the scout team to\nprovide an up-to-date contamination model, while the cleaner team collects as\nmuch waste as possible following this model. This strategy leads to\nheterogeneous teams that optimize fleet efficiency through inter-team\ncooperation supported by a tailored reward function. Different trainings of the\nproposed algorithm are compared with other state-of-the-art heuristics in two\ndistinct scenarios, one with high convexity and another with narrow corridors\nand challenging access. According to the obtained results, it is demonstrated\nthat deep reinforcement learning based algorithms outperform other benchmark\nheuristics, exhibiting superior adaptability. In addition, training with greedy\nactions further enhances performance, particularly in scenarios with intricate\nlayouts.\n","authors":["Alejandro Mendoza Barrionuevo","Samuel Yanes Luis","Daniel Gutiérrez Reina","Sergio L. Toral Marín"],"pdf_url":"https://arxiv.org/pdf/2412.02316v1.pdf","comment":"This article is currently under revision for the Robotics and\n Automation Letters (IEEE)"},{"id":"http://arxiv.org/abs/2412.02313v1","updated":"2024-12-03T09:30:57Z","published":"2024-12-03T09:30:57Z","title":"Noisy Ostracods: A Fine-Grained, Imbalanced Real-World Dataset for\n Benchmarking Robust Machine Learning and Label Correction Methods","summary":" We present the Noisy Ostracods, a noisy dataset for genus and species\nclassification of crustacean ostracods with specialists' annotations. Over the\n71466 specimens collected, 5.58% of them are estimated to be noisy (possibly\nproblematic) at genus level. The dataset is created to addressing a real-world\nchallenge: creating a clean fine-grained taxonomy dataset. The Noisy Ostracods\ndataset has diverse noises from multiple sources. Firstly, the noise is\nopen-set, including new classes discovered during curation that were not part\nof the original annotation. The dataset has pseudo-classes, where annotators\nmisclassified samples that should belong to an existing class into a new\npseudo-class. The Noisy Ostracods dataset is highly imbalanced with a imbalance\nfactor $\\rho$ = 22429. This presents a unique challenge for robust machine\nlearning methods, as existing approaches have not been extensively evaluated on\nfine-grained classification tasks with such diverse real-world noise. Initial\nexperiments using current robust learning techniques have not yielded\nsignificant performance improvements on the Noisy Ostracods dataset compared to\ncross-entropy training on the raw, noisy data. On the other hand, noise\ndetection methods have underperformed in error hit rate compared to naive\ncross-validation ensembling for identifying problematic labels. These findings\nsuggest that the fine-grained, imbalanced nature, and complex noise\ncharacteristics of the dataset present considerable challenges for existing\nnoise-robust algorithms. By openly releasing the Noisy Ostracods dataset, our\ngoal is to encourage further research into the development of noise-resilient\nmachine learning methods capable of effectively handling diverse, real-world\nnoise in fine-grained classification tasks. The dataset, along with its\nevaluation protocols, can be accessed at\nhttps://github.com/H-Jamieu/Noisy_ostracods.\n","authors":["Jiamian Hu","Yuanyuan Hong","Yihua Chen","He Wang","Moriaki Yasuhara"],"pdf_url":"https://arxiv.org/pdf/2412.02313v1.pdf","comment":"Initial submit"},{"id":"http://arxiv.org/abs/2411.18506v2","updated":"2024-12-03T09:25:11Z","published":"2024-11-27T16:48:24Z","title":"LLM-ABBA: Understanding time series via symbolic approximation","summary":" The success of large language models (LLMs) for time series has been\ndemonstrated in previous work. Utilizing a symbolic time series representation,\none can efficiently bridge the gap between LLMs and time series. However, the\nremaining challenge is to exploit the semantic information hidden in time\nseries by using symbols or existing tokens of LLMs, while aligning the\nembedding space of LLMs according to the hidden information of time series. The\nsymbolic time series approximation (STSA) method called adaptive Brownian\nbridge-based symbolic aggregation (ABBA) shows outstanding efficacy in\npreserving salient time series features by modeling time series patterns in\nterms of amplitude and period while using existing tokens of LLMs.\n In this paper, we introduce a method, called LLM-ABBA, that integrates ABBA\ninto large language models for various downstream time series tasks. By\nsymbolizing time series, LLM-ABBA compares favorably to the recent\nstate-of-the-art (SOTA) in UCR and three medical time series classification\ntasks. Meanwhile, a fixed-polygonal chain trick in ABBA is introduced to\n\\kc{avoid obvious drifting} during prediction tasks by significantly mitigating\nthe effects of cumulative error arising from misused symbols during the\ntransition from symbols to numerical values. In time series regression tasks,\nLLM-ABBA achieves the new SOTA on Time Series Extrinsic Regression (TSER)\nbenchmarks. LLM-ABBA also shows competitive prediction capability compared to\nrecent SOTA time series prediction results. We believe this framework can also\nseamlessly extend to other time series tasks.\n","authors":["Erin Carson","Xinye Chen","Cheng Kang"],"pdf_url":"https://arxiv.org/pdf/2411.18506v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.16861v2","updated":"2024-12-03T09:17:43Z","published":"2024-05-27T06:26:55Z","title":"BInD: Bond and Interaction-generating Diffusion Model for\n Multi-objective Structure-based Drug Design","summary":" A remarkable advance in geometric deep generative models with accumulated\nstructural data enables structure-based drug design (SBDD) with target protein\ninformation only. However, most existing models struggle to address\nmulti-objectives simultaneously while performing well only in their specialized\ntasks. Here, we present BInD, a diffusion model with knowledge-based guidance\nfor multi-objective SBDD. BInD is designed to co-generate molecules and their\ninteractions with a target protein to consider all key objectives equally well,\nincluding target-specific interactions, molecular properties, and local\ngeometry. Comprehensive evaluations show that BInD achieves robust performance\nfor all objectives while outperforming or matching state-of-the-art methods for\neach. Finally, we propose a train-free optimization method empowered by\nretrieving target-specific interactions, highlighting the role of non-covalent\ninteractions in achieving higher selectivity and binding affinities to a target\nprotein.\n","authors":["Joongwon Lee","Wonho Zhung","Jisu Seo","Woo Youn Kim"],"pdf_url":"https://arxiv.org/pdf/2405.16861v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02302v1","updated":"2024-12-03T09:16:13Z","published":"2024-12-03T09:16:13Z","title":"Enhanced Photovoltaic Power Forecasting: An iTransformer and LSTM-Based\n Model Integrating Temporal and Covariate Interactions","summary":" Accurate photovoltaic (PV) power forecasting is critical for integrating\nrenewable energy sources into the grid, optimizing real-time energy management,\nand ensuring energy reliability amidst increasing demand. However, existing\nmodels often struggle with effectively capturing the complex relationships\nbetween target variables and covariates, as well as the interactions between\ntemporal dynamics and multivariate data, leading to suboptimal forecasting\naccuracy. To address these challenges, we propose a novel model architecture\nthat leverages the iTransformer for feature extraction from target variables\nand employs long short-term memory (LSTM) to extract features from covariates.\nA cross-attention mechanism is integrated to fuse the outputs of both models,\nfollowed by a Kolmogorov-Arnold network (KAN) mapping for enhanced\nrepresentation. The effectiveness of the proposed model is validated using\npublicly available datasets from Australia, with experiments conducted across\nfour seasons. Results demonstrate that the proposed model effectively capture\nseasonal variations in PV power generation and improve forecasting accuracy.\n","authors":["Guang Wu","Yun Wang","Qian Zhou","Ziyang Zhang"],"pdf_url":"https://arxiv.org/pdf/2412.02302v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02295v1","updated":"2024-12-03T09:09:52Z","published":"2024-12-03T09:09:52Z","title":"CADMR: Cross-Attention and Disentangled Learning for Multimodal\n Recommender Systems","summary":" The increasing availability and diversity of multimodal data in recommender\nsystems offer new avenues for enhancing recommendation accuracy and user\nsatisfaction. However, these systems must contend with high-dimensional, sparse\nuser-item rating matrices, where reconstructing the matrix with only small\nsubsets of preferred items for each user poses a significant challenge. To\naddress this, we propose CADMR, a novel autoencoder-based multimodal\nrecommender system framework. CADMR leverages multi-head cross-attention\nmechanisms and Disentangled Learning to effectively integrate and utilize\nheterogeneous multimodal data in reconstructing the rating matrix. Our approach\nfirst disentangles modality-specific features while preserving their\ninterdependence, thereby learning a joint latent representation. The multi-head\ncross-attention mechanism is then applied to enhance user-item interaction\nrepresentations with respect to the learned multimodal item latent\nrepresentations. We evaluate CADMR on three benchmark datasets, demonstrating\nsignificant performance improvements over state-of-the-art methods.\n","authors":["Yasser Khalafaoui","Martino Lovisetto","Basarab Matei","Nistor Grozavu"],"pdf_url":"https://arxiv.org/pdf/2412.02295v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02294v1","updated":"2024-12-03T09:08:38Z","published":"2024-12-03T09:08:38Z","title":"Initial Study On Improving Segmentation By Combining Preoperative CT And\n Intraoperative CBCT Using Synthetic Data","summary":" Computer-Assisted Interventions enable clinicians to perform precise,\nminimally invasive procedures, often relying on advanced imaging methods.\nCone-beam computed tomography (CBCT) can be used to facilitate\ncomputer-assisted interventions, despite often suffering from artifacts that\npose challenges for accurate interpretation. While the degraded image quality\ncan affect image analysis, the availability of high quality, preoperative scans\noffers potential for improvements. Here we consider a setting where\npreoperative CT and intraoperative CBCT scans are available, however, the\nalignment (registration) between the scans is imperfect to simulate a real\nworld scenario. We propose a multimodal learning method that fuses roughly\naligned CBCT and CT scans and investigate the effect on segmentation\nperformance. For this experiment we use synthetically generated data containing\nreal CT and synthetic CBCT volumes with corresponding voxel annotations. We\nshow that this fusion setup improves segmentation performance in $18$ out of\n$20$ investigated setups.\n","authors":["Maximilian E. Tschuchnig","Philipp Steininger","Michael Gadermayr"],"pdf_url":"https://arxiv.org/pdf/2412.02294v1.pdf","comment":"Accepted at BVM 2025. arXiv admin note: text overlap with\n arXiv:2406.11650"},{"id":"http://arxiv.org/abs/2412.02292v1","updated":"2024-12-03T09:08:27Z","published":"2024-12-03T09:08:27Z","title":"Deep Matrix Factorization with Adaptive Weights for Multi-View\n Clustering","summary":" Recently, deep matrix factorization has been established as a powerful model\nfor unsupervised tasks, achieving promising results, especially for multi-view\nclustering. However, existing methods often lack effective feature selection\nmechanisms and rely on empirical hyperparameter selection. To address these\nissues, we introduce a novel Deep Matrix Factorization with Adaptive Weights\nfor Multi-View Clustering (DMFAW). Our method simultaneously incorporates\nfeature selection and generates local partitions, enhancing clustering results.\nNotably, the features weights are controlled and adjusted by a parameter that\nis dynamically updated using Control Theory inspired mechanism, which not only\nimproves the model's stability and adaptability to diverse datasets but also\naccelerates convergence. A late fusion approach is then proposed to align the\nweighted local partitions with the consensus partition. Finally, the\noptimization problem is solved via an alternating optimization algorithm with\ntheoretically guaranteed convergence. Extensive experiments on benchmark\ndatasets highlight that DMFAW outperforms state-of-the-art methods in terms of\nclustering performance.\n","authors":["Yasser Khalafaoui","Basarab Matei","Martino Lovisetto","Nistor Grozavu"],"pdf_url":"https://arxiv.org/pdf/2412.02292v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02291v1","updated":"2024-12-03T09:07:31Z","published":"2024-12-03T09:07:31Z","title":"Conformal Symplectic Optimization for Stable Reinforcement Learning","summary":" Training deep reinforcement learning (RL) agents necessitates overcoming the\nhighly unstable nonconvex stochastic optimization inherent in the\ntrial-and-error mechanism. To tackle this challenge, we propose a\nphysics-inspired optimization algorithm called relativistic adaptive gradient\ndescent (RAD), which enhances long-term training stability. By conceptualizing\nneural network (NN) training as the evolution of a conformal Hamiltonian\nsystem, we present a universal framework for transferring long-term stability\nfrom conformal symplectic integrators to iterative NN updating rules, where the\nchoice of kinetic energy governs the dynamical properties of resulting\noptimization algorithms. By utilizing relativistic kinetic energy, RAD\nincorporates principles from special relativity and limits parameter updates\nbelow a finite speed, effectively mitigating abnormal gradient influences.\nAdditionally, RAD models NN optimization as the evolution of a multi-particle\nsystem where each trainable parameter acts as an independent particle with an\nindividual adaptive learning rate. We prove RAD's sublinear convergence under\ngeneral nonconvex settings, where smaller gradient variance and larger batch\nsizes contribute to tighter convergence. Notably, RAD degrades to the\nwell-known adaptive moment estimation (ADAM) algorithm when its speed\ncoefficient is chosen as one and symplectic factor as a small positive value.\nExperimental results show RAD outperforming nine baseline optimizers with five\nRL algorithms across twelve environments, including standard benchmarks and\nchallenging scenarios. Notably, RAD achieves up to a 155.1% performance\nimprovement over ADAM in Atari games, showcasing its efficacy in stabilizing\nand accelerating RL training.\n","authors":["Yao Lyu","Xiangteng Zhang","Shengbo Eben Li","Jingliang Duan","Letian Tao","Qing Xu","Lei He","Keqiang Li"],"pdf_url":"https://arxiv.org/pdf/2412.02291v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02289v1","updated":"2024-12-03T09:06:57Z","published":"2024-12-03T09:06:57Z","title":"Learn More by Using Less: Distributed Learning with Energy-Constrained\n Devices","summary":" Federated Learning (FL) has emerged as a solution for distributed model\ntraining across decentralized, privacy-preserving devices, but the different\nenergy capacities of participating devices (system heterogeneity) constrain\nreal-world implementations. These energy limitations not only reduce model\naccuracy but also increase dropout rates, impacting on convergence in practical\nFL deployments. In this work, we propose LeanFed, an energy-aware FL framework\ndesigned to optimize client selection and training workloads on\nbattery-constrained devices. LeanFed leverages adaptive data usage by\ndynamically adjusting the fraction of local data each device utilizes during\ntraining, thereby maximizing device participation across communication rounds\nwhile ensuring they do not run out of battery during the process. We rigorously\nevaluate LeanFed against traditional FedAvg on CIFAR-10 and CIFAR-100 datasets,\nsimulating various levels of data heterogeneity and device participation rates.\nResults show that LeanFed consistently enhances model accuracy and stability,\nparticularly in settings with high data heterogeneity and limited battery life,\nby mitigating client dropout and extending device availability. This approach\ndemonstrates the potential of energy-efficient, privacy-preserving FL in\nreal-world, large-scale applications, setting a foundation for robust and\nsustainable pervasive AI on resource-constrained networks.\n","authors":["Roberto Pereira","Cristian J. Vaca-Rubio","Luis Blanco"],"pdf_url":"https://arxiv.org/pdf/2412.02289v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19146v2","updated":"2024-12-03T09:06:33Z","published":"2024-11-28T13:45:42Z","title":"Puzzle: Distillation-Based NAS for Inference-Optimized LLMs","summary":" Large language models (LLMs) have demonstrated remarkable capabilities, but\ntheir adoption is limited by high computational costs during inference. While\nincreasing parameter counts enhances accuracy, it also widens the gap between\nstate-of-the-art capabilities and practical deployability. We present Puzzle, a\nframework to accelerate LLM inference on specific hardware while preserving\ntheir capabilities. Through an innovative application of neural architecture\nsearch (NAS) at an unprecedented scale, Puzzle systematically optimizes models\nwith tens of billions of parameters under hardware constraints. Our approach\nutilizes blockwise local knowledge distillation (BLD) for parallel architecture\nexploration and employs mixed-integer programming for precise constraint\noptimization.\n We demonstrate the real-world impact of our framework through\nLlama-3.1-Nemotron-51B-Instruct (Nemotron-51B), a publicly available model\nderived from Llama-3.1-70B-Instruct. Nemotron-51B achieves a 2.17x inference\nthroughput speedup, fitting on a single NVIDIA H100 GPU while preserving 98.4%\nof the original model's capabilities. Nemotron-51B currently stands as the most\naccurate language model capable of inference on a single GPU with large batch\nsizes. Remarkably, this transformation required just 45B training tokens,\ncompared to over 15T tokens used for the 70B model it was derived from. This\nestablishes a new paradigm where powerful models can be optimized for efficient\ndeployment with only negligible compromise of their capabilities, demonstrating\nthat inference performance, not parameter count alone, should guide model\nselection. With the release of Nemotron-51B and the presentation of the Puzzle\nframework, we provide practitioners immediate access to state-of-the-art\nlanguage modeling capabilities at significantly reduced computational costs.\n","authors":["Akhiad Bercovich","Tomer Ronen","Talor Abramovich","Nir Ailon","Nave Assaf","Mohammad Dabbah","Ido Galil","Amnon Geifman","Yonatan Geifman","Izhak Golan","Netanel Haber","Ehud Karpas","Roi Koren","Itay Levy","Pavlo Molchanov","Shahar Mor","Zach Moshe","Najeeb Nabwani","Omri Puny","Ran Rubin","Itamar Schen","Ido Shahaf","Oren Tropp","Omer Ullman Argov","Ran Zilberstein","Ran El-Yaniv"],"pdf_url":"https://arxiv.org/pdf/2411.19146v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.11465v2","updated":"2024-12-03T09:04:35Z","published":"2024-11-18T10:58:46Z","title":"Re-examining learning linear functions in context","summary":" In context learning (ICL) is an attractive method of solving a wide range of\nproblems. Inspired by Garg et al. (2022), we look closely at ICL in a variety\nof train and test settings for several transformer models of different sizes\ntrained from scratch. Our study complements prior work by pointing out several\nsystematic failures of these models to generalize to data not in the training\ndistribution, thereby showing some limitations of ICL. We find that models\nadopt a strategy for this task that is very different from standard solutions.\n","authors":["Omar Naim","Guilhem Fouilhé","Nicholas Asher"],"pdf_url":"https://arxiv.org/pdf/2411.11465v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02285v1","updated":"2024-12-03T09:03:04Z","published":"2024-12-03T09:03:04Z","title":"GQWformer: A Quantum-based Transformer for Graph Representation Learning","summary":" Graph Transformers (GTs) have demonstrated significant advantages in graph\nrepresentation learning through their global attention mechanisms. However, the\nself-attention mechanism in GTs tends to neglect the inductive biases inherent\nin graph structures, making it chanllenging to effectively capture essential\nstructural information. To address this issue, we propose a novel approach that\nintegrate graph inductive bias into self-attention mechanisms by leveraging\nquantum technology for structural encoding. In this paper, we introduce the\nGraph Quantum Walk Transformer (GQWformer), a groundbreaking GNN framework that\nutilizes quantum walks on attributed graphs to generate node quantum states.\nThese quantum states encapsulate rich structural attributes and serve as\ninductive biases for the transformer, thereby enabling the generation of more\nmeaningful attention scores. By subsequently incorporating a recurrent neural\nnetwork, our design amplifies the model's ability to focus on both local and\nglobal information. We conducted comprehensive experiments across five publicly\navailable datasets to evaluate the effectiveness of our model. These results\nclearly indicate that GQWformer outperforms existing state-of-the-art graph\nclassification algorithms. These findings highlight the significant potential\nof integrating quantum computing methodologies with traditional GNNs to advance\nthe field of graph representation learning, providing a promising direction for\nfuture research and applications.\n","authors":["Lei Yu","Hongyang Chen","Jingsong Lv","Linyao Yang"],"pdf_url":"https://arxiv.org/pdf/2412.02285v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2304.05099v6","updated":"2024-12-03T08:58:22Z","published":"2023-04-11T09:51:13Z","title":"Feudal Graph Reinforcement Learning","summary":" Graph-based representations and message-passing modular policies constitute\nprominent approaches to tackling composable control problems in reinforcement\nlearning (RL). However, as shown by recent graph deep learning literature, such\nlocal message-passing operators can create information bottlenecks and hinder\nglobal coordination. The issue becomes more serious in tasks requiring\nhigh-level planning. In this work, we propose a novel methodology, named Feudal\nGraph Reinforcement Learning (FGRL), that addresses such challenges by relying\non hierarchical RL and a pyramidal message-passing architecture. In particular,\nFGRL defines a hierarchy of policies where high-level commands are propagated\nfrom the top of the hierarchy down through a layered graph structure. The\nbottom layers mimic the morphology of the physical system, while the upper\nlayers correspond to higher-order sub-modules. The resulting agents are then\ncharacterized by a committee of policies where actions at a certain level set\ngoals for the level below, thus implementing a hierarchical decision-making\nstructure that can naturally implement task decomposition. We evaluate the\nproposed framework on a graph clustering problem and MuJoCo locomotion tasks;\nsimulation results show that FGRL compares favorably against relevant\nbaselines. Furthermore, an in-depth analysis of the command propagation\nmechanism provides evidence that the introduced message-passing scheme favors\nlearning hierarchical decision-making policies.\n","authors":["Tommaso Marzi","Arshjot Khehra","Andrea Cini","Cesare Alippi"],"pdf_url":"https://arxiv.org/pdf/2304.05099v6.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.03848v3","updated":"2024-12-03T08:54:30Z","published":"2024-06-06T08:29:29Z","title":"OceanCastNet: A Deep Learning Ocean Wave Model with Energy Conservation","summary":" Traditional wave forecasting models, although based on energy conservation\nequations, are computationally expensive. On the other hand, existing deep\nlearning geophysical fluid models, while computationally efficient, often\nsuffer from issues such as energy dissipation in long-term forecasts. This\npaper proposes a novel energy-balanced deep learning wave forecasting model\ncalled OceanCastNet (OCN). By incorporating wind fields at the current,\nprevious, and future time steps, as well as wave fields at the current and\nprevious time steps as input variables, OCN maintains energy balance within the\nmodel. Furthermore, the model employs adaptive Fourier operators as its core\ncomponents and designs a masked loss function to better handle the impact of\nland-sea boundaries. A series of experiments on the ERA5 dataset demonstrate\nthat OCN can achieve short-term forecast accuracy comparable to traditional\nmodels while exhibiting an understanding of the wave generation process. In\ncomparative experiments under both normal and extreme conditions, OCN\nconsistently outperforms the widely used WaveWatch III model in the industry.\nEven after long-term forecasting, OCN maintains a stable and energy-rich state.\nBy further constructing a simple meteorological model, OCN-wind, which\nconsiders energy balance, this paper confirms the importance of energy\nconstraints for improving the long-term forecast performance of deep learning\nmeteorological models. This finding provides new ideas for future research on\ndeep learning geophysical fluid models.\n","authors":["Ziliang Zhang","Huaming Yu","Danqin Ren"],"pdf_url":"https://arxiv.org/pdf/2406.03848v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.01585v2","updated":"2024-12-03T08:54:27Z","published":"2024-12-02T15:04:51Z","title":"FairML: A Julia Package for Fair Classification","summary":" In this paper, we propose FairML.jl, a Julia package providing a framework\nfor fair classification in machine learning. In this framework, the fair\nlearning process is divided into three stages. Each stage aims to reduce\nunfairness, such as disparate impact and disparate mistreatment, in the final\nprediction. For the preprocessing stage, we present a resampling method that\naddresses unfairness coming from data imbalances. The in-processing phase\nconsist of a classification method. This can be either one coming from the\nMLJ.jl package, or a user defined one. For this phase, we incorporate fair ML\nmethods that can handle unfairness to a certain degree through their\noptimization process. In the post-processing, we discuss the choice of the\ncut-off value for fair prediction. With simulations, we show the performance of\nthe single phases and their combinations.\n","authors":["Jan Pablo Burgard","João Vitor Pamplona"],"pdf_url":"https://arxiv.org/pdf/2412.01585v2.pdf","comment":"25 pages, 8 figures"},{"id":"http://arxiv.org/abs/2410.10929v5","updated":"2024-12-03T08:48:21Z","published":"2024-10-14T16:35:27Z","title":"ASTM :Autonomous Smart Traffic Management System Using Artificial\n Intelligence CNN and LSTM","summary":" In the modern world, the development of Artificial Intelligence (AI) has\ncontributed to improvements in various areas, including automation, computer\nvision, fraud detection, and more. AI can be leveraged to enhance the\nefficiency of Autonomous Smart Traffic Management (ASTM) systems and reduce\ntraffic congestion rates. This paper presents an Autonomous Smart Traffic\nManagement (STM) system that uses AI to improve traffic flow rates. The system\nemploys the YOLO V5 Convolutional Neural Network to detect vehicles in traffic\nmanagement images. Additionally, it predicts the number of vehicles for the\nnext 12 hours using a Recurrent Neural Network with Long Short-Term Memory\n(RNN-LSTM). The Smart Traffic Management Cycle Length Analysis manages the\ntraffic cycle length based on these vehicle predictions, aided by AI. From the\nresults of the RNN-LSTM model for predicting vehicle numbers over the next 12\nhours, we observe that the model predicts traffic with a Mean Squared Error\n(MSE) of 4.521 vehicles and a Root Mean Squared Error (RMSE) of 2.232 vehicles.\nAfter simulating the STM system in the CARLA simulation environment, we found\nthat the Traffic Management Congestion Flow Rate with ASTM (21 vehicles per\nminute) is 50\\% higher than the rate without STM (around 15 vehicles per\nminute). Additionally, the Traffic Management Vehicle Pass Delay with STM (5\nseconds per vehicle) is 70\\% lower than without STM (around 12 seconds per\nvehicle). These results demonstrate that the STM system using AI can increase\ntraffic flow by 50\\% and reduce vehicle pass delays by 70\\%.\n","authors":["Christofel Rio Goenawan"],"pdf_url":"https://arxiv.org/pdf/2410.10929v5.pdf","comment":"In process to IEEE Intelligent Vehicle Symposium 2025"},{"id":"http://arxiv.org/abs/2411.09545v2","updated":"2024-12-03T08:48:06Z","published":"2024-11-14T15:59:41Z","title":"Equation-informed data-driven identification of flow budgets and\n dynamics","summary":" Computational Fluid Dynamics (CFD) is an indispensable method of fluid\nmodelling in engineering applications, reducing the need for physical\nprototypes and testing for tasks such as design optimisation and performance\nanalysis. Depending on the complexity of the system under consideration, models\nranging from low to high fidelity can be used for prediction, allowing\nsignificant speed-up. However, the choice of model requires information about\nthe actual dynamics of the flow regime. Correctly identifying the\nregions/clusters of flow that share the same dynamics has been a challenging\nresearch topic to date. In this study, we propose a novel hybrid approach to\nflow clustering. It consists of characterising each sample point of the system\nwith equation-based features, i.e. features are budgets that represent the\ncontribution of each term from the original governing equation to the local\ndynamics at each sample point. This was achieved by applying the Sparse\nIdentification of Nonlinear Dynamical systems (SINDy) method pointwise to time\nevolution data. The method proceeds with equation-based clustering using the\nGirvan-Newman algorithm. This allows the detection of communities that share\nthe same physical dynamics. The algorithm is implemented in both Eulerian and\nLagrangian frameworks. In the Lagrangian, i.e. dynamic approach, the clustering\nis performed on the trajectory of each point, allowing the change of clusters\nto be represented also in time. The performance of the algorithm is first\ntested on a flow around a cylinder. The construction of the dynamic clusters in\nthis test case clearly shows the evolution of the wake from the steady state\nsolution through the transient to the oscillatory solution. Dynamic clustering\nwas then successfully tested on turbulent flow data. Two distinct and\nwell-defined clusters were identified and their temporal evolution was\nreconstructed.\n","authors":["Nataliya Sevryugina","Serena Costanzo","Stephen de Bruyn Kops","Colm-cille Caulfield","Iraj Mortazavi","Taraneh Sayadi"],"pdf_url":"https://arxiv.org/pdf/2411.09545v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02273v1","updated":"2024-12-03T08:45:50Z","published":"2024-12-03T08:45:50Z","title":"Step-by-Step Guidance to Differential Anemia Diagnosis with Real-World\n Data and Deep Reinforcement Learning","summary":" Clinical diagnostic guidelines outline the key questions to answer to reach a\ndiagnosis. Inspired by guidelines, we aim to develop a model that learns from\nelectronic health records to determine the optimal sequence of actions for\naccurate diagnosis. Focusing on anemia and its sub-types, we employ deep\nreinforcement learning (DRL) algorithms and evaluate their performance on both\na synthetic dataset, which is based on expert-defined diagnostic pathways, and\na real-world dataset. We investigate the performance of these algorithms across\nvarious scenarios. Our experimental results demonstrate that DRL algorithms\nperform competitively with state-of-the-art methods while offering the\nsignificant advantage of progressively generating pathways to the suggested\ndiagnosis, providing a transparent decision-making process that can guide and\nexplain diagnostic reasoning.\n","authors":["Lillian Muyama","Estelle Lu","Geoffrey Cheminet","Jacques Pouchot","Bastien Rance","Anne-Isabelle Tropeano","Antoine Neuraz","Adrien Coulet"],"pdf_url":"https://arxiv.org/pdf/2412.02273v1.pdf","comment":"arXiv admin note: text overlap with arXiv:2404.05913"},{"id":"http://arxiv.org/abs/2405.16158v3","updated":"2024-12-03T08:42:49Z","published":"2024-05-25T09:53:25Z","title":"Bigger, Regularized, Optimistic: scaling for compute and\n sample-efficient continuous control","summary":" Sample efficiency in Reinforcement Learning (RL) has traditionally been\ndriven by algorithmic enhancements. In this work, we demonstrate that scaling\ncan also lead to substantial improvements. We conduct a thorough investigation\ninto the interplay of scaling model capacity and domain-specific RL\nenhancements. These empirical findings inform the design choices underlying our\nproposed BRO (Bigger, Regularized, Optimistic) algorithm. The key innovation\nbehind BRO is that strong regularization allows for effective scaling of the\ncritic networks, which, paired with optimistic exploration, leads to superior\nperformance. BRO achieves state-of-the-art results, significantly outperforming\nthe leading model-based and model-free algorithms across 40 complex tasks from\nthe DeepMind Control, MetaWorld, and MyoSuite benchmarks. BRO is the first\nmodel-free algorithm to achieve near-optimal policies in the notoriously\nchallenging Dog and Humanoid tasks.\n","authors":["Michal Nauman","Mateusz Ostaszewski","Krzysztof Jankowski","Piotr Miłoś","Marek Cygan"],"pdf_url":"https://arxiv.org/pdf/2405.16158v3.pdf","comment":"NeurIPS 2024 Spotlight"},{"id":"http://arxiv.org/abs/2412.01566v2","updated":"2024-12-03T08:42:37Z","published":"2024-12-02T14:51:21Z","title":"Multi-objective Deep Learning: Taxonomy and Survey of the State of the\n Art","summary":" Simultaneously considering multiple objectives in machine learning has been a\npopular approach for several decades, with various benefits for multi-task\nlearning, the consideration of secondary goals such as sparsity, or\nmulticriteria hyperparameter tuning. However - as multi-objective optimization\nis significantly more costly than single-objective optimization - the recent\nfocus on deep learning architectures poses considerable additional challenges\ndue to the very large number of parameters, strong nonlinearities and\nstochasticity. This survey covers recent advancements in the area of\nmulti-objective deep learning. We introduce a taxonomy of existing methods -\nbased on the type of training algorithm as well as the decision maker's needs -\nbefore listing recent advancements, and also successful applications. All three\nmain learning paradigms supervised learning, unsupervised learning and\nreinforcement learning are covered, and we also address the recently very\npopular area of generative modeling.\n","authors":["Sebastian Peitz","Sedjro Salomon Hotegni"],"pdf_url":"https://arxiv.org/pdf/2412.01566v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02266v1","updated":"2024-12-03T08:38:30Z","published":"2024-12-03T08:38:30Z","title":"BOTracle: A framework for Discriminating Bots and Humans","summary":" Bots constitute a significant portion of Internet traffic and are a source of\nvarious issues across multiple domains. Modern bots often become\nindistinguishable from real users, as they employ similar methods to browse the\nweb, including using real browsers. We address the challenge of bot detection\nin high-traffic scenarios by analyzing three distinct detection methods. The\nfirst method operates on heuristics, allowing for rapid detection. The second\nmethod utilizes, well known, technical features, such as IP address, window\nsize, and user agent. It serves primarily for comparison with the third method.\nIn the third method, we rely solely on browsing behavior, omitting all static\nfeatures and focusing exclusively on how clients behave on a website. In\ncontrast to related work, we evaluate our approaches using real-world\ne-commerce traffic data, comprising 40 million monthly page visits. We further\ncompare our methods against another bot detection approach, Botcha, on the same\ndataset. Our performance metrics, including precision, recall, and AUC, reach\n98 percent or higher, surpassing Botcha.\n","authors":["Jan Kadel","August See","Ritwik Sinha","Mathias Fischer"],"pdf_url":"https://arxiv.org/pdf/2412.02266v1.pdf","comment":"Bot Detection; User Behaviour Analysis; Published at ESORICS\n International Workshops 2024"},{"id":"http://arxiv.org/abs/2412.02265v1","updated":"2024-12-03T08:37:28Z","published":"2024-12-03T08:37:28Z","title":"Diabetic Retinopathy Classification from Retinal Images using Machine\n Learning Approaches","summary":" Diabetic Retinopathy is one of the most familiar diseases and is a diabetes\ncomplication that affects eyes. Initially, diabetic retinopathy may cause no\nsymptoms or only mild vision problems. Eventually, it can cause blindness. So\nearly detection of symptoms could help to avoid blindness. In this paper, we\npresent some experiments on some features of diabetic retinopathy, like\nproperties of exudates, properties of blood vessels and properties of\nmicroaneurysm. Using the features, we can classify healthy, mild\nnon-proliferative, moderate non-proliferative, severe non-proliferative and\nproliferative stages of DR. Support Vector Machine, Random Forest and Naive\nBayes classifiers are used to classify the stages. Finally, Random Forest is\nfound to be the best for higher accuracy, sensitivity and specificity of 76.5%,\n77.2% and 93.3% respectively.\n","authors":["Indronil Bhattacharjee"," Al-Mahmud","Tareq Mahmud"],"pdf_url":"https://arxiv.org/pdf/2412.02265v1.pdf","comment":"5 pages, 9 figures, 2 tables. International Conference on Advanced\n Engineering, Technology and Applications (ICAETA-2021), Istanbul, Turkey"},{"id":"http://arxiv.org/abs/2412.02264v1","updated":"2024-12-03T08:37:27Z","published":"2024-12-03T08:37:27Z","title":"Technical Report on Reinforcement Learning Control on the Lucas-Nülle\n Inverted Pendulum","summary":" The discipline of automatic control is making increased use of concepts that\noriginate from the domain of machine learning. Herein, reinforcement learning\n(RL) takes an elevated role, as it is inherently designed for sequential\ndecision making, and can be applied to optimal control problems without the\nneed for a plant system model. To advance education of control engineers and\noperators in this field, this contribution targets an RL framework that can be\napplied to educational hardware provided by the Lucas-N\\\"ulle company.\nSpecifically, the goal of inverted pendulum control is pursued by means of RL,\nincluding both, swing-up and stabilization within a single holistic design\napproach. Herein, the actual learning is enabled by separating corresponding\ncomputations from the real-time control computer and outsourcing them to a\ndifferent hardware. This distributed architecture, however, necessitates\ncommunication of the involved components, which is realized via CAN bus. The\nexperimental proof of concept is presented with an applied safeguarding\nalgorithm that prevents the plant from being operated harmfully during the\ntrial-and-error training phase.\n","authors":["Maximilian Schenke","Shalbus Bukarov"],"pdf_url":"https://arxiv.org/pdf/2412.02264v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02262v1","updated":"2024-12-03T08:34:42Z","published":"2024-12-03T08:34:42Z","title":"Composing Open-domain Vision with RAG for Ocean Monitoring and\n Conservation","summary":" Climate change's destruction of marine biodiversity is threatening\ncommunities and economies around the world which rely on healthy oceans for\ntheir livelihoods. The challenge of applying computer vision to niche,\nreal-world domains such as ocean conservation lies in the dynamic and diverse\nenvironments where traditional top-down learning struggle with long-tailed\ndistributions, generalization, and domain transfer. Scalable species\nidentification for ocean monitoring is particularly difficult due to the need\nto adapt models to new environments and identify rare or unseen species. To\novercome these limitations, we propose leveraging bottom-up, open-domain\nlearning frameworks as a resilient, scalable solution for image and video\nanalysis in marine applications. Our preliminary demonstration uses pretrained\nvision-language models (VLMs) combined with retrieval-augmented generation\n(RAG) as grounding, leaving the door open for numerous architectural, training\nand engineering optimizations. We validate this approach through a preliminary\napplication in classifying fish from video onboard fishing vessels,\ndemonstrating impressive emergent retrieval and prediction capabilities without\ndomain-specific training or knowledge of the task itself.\n","authors":["Sepand Dyanatkar","Angran Li","Alexander Dungate"],"pdf_url":"https://arxiv.org/pdf/2412.02262v1.pdf","comment":"Accepted to Climate Change AI Workshop at NeurIPS 2024. 9 pages, 6\n figures, 1 table"},{"id":"http://arxiv.org/abs/2410.13637v2","updated":"2024-12-03T08:29:54Z","published":"2024-10-17T15:07:56Z","title":"Normalizing self-supervised learning for provably reliable Change Point\n Detection","summary":" Change point detection (CPD) methods aim to identify abrupt shifts in the\ndistribution of input data streams. Accurate estimators for this task are\ncrucial across various real-world scenarios. Yet, traditional unsupervised CPD\ntechniques face significant limitations, often relying on strong assumptions or\nsuffering from low expressive power due to inherent model simplicity. In\ncontrast, representation learning methods overcome these drawbacks by offering\nflexibility and the ability to capture the full complexity of the data without\nimposing restrictive assumptions. However, these approaches are still emerging\nin the CPD field and lack robust theoretical foundations to ensure their\nreliability. Our work addresses this gap by integrating the expressive power of\nrepresentation learning with the groundedness of traditional CPD techniques. We\nadopt spectral normalization (SN) for deep representation learning in CPD tasks\nand prove that the embeddings after SN are highly informative for CPD. Our\nmethod significantly outperforms current state-of-the-art methods during the\ncomprehensive evaluation via three standard CPD datasets.\n","authors":["Alexandra Bazarova","Evgenia Romanenkova","Alexey Zaytsev"],"pdf_url":"https://arxiv.org/pdf/2410.13637v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02251v1","updated":"2024-12-03T08:28:47Z","published":"2024-12-03T08:28:47Z","title":"Selective Reviews of Bandit Problems in AI via a Statistical View","summary":" Reinforcement Learning (RL) is a widely researched area in artificial\nintelligence that focuses on teaching agents decision-making through\ninteractions with their environment. A key subset includes stochastic\nmulti-armed bandit (MAB) and continuum-armed bandit (SCAB) problems, which\nmodel sequential decision-making under uncertainty. This review outlines the\nfoundational models and assumptions of bandit problems, explores non-asymptotic\ntheoretical tools like concentration inequalities and minimax regret bounds,\nand compares frequentist and Bayesian algorithms for managing\nexploration-exploitation trade-offs. We also extend the discussion to $K$-armed\ncontextual bandits and SCAB, examining their methodologies, regret analyses,\nand discussing the relation between the SCAB problems and the functional data\nanalysis. Finally, we highlight recent advances and ongoing challenges in the\nfield.\n","authors":["Pengjie Zhou","Haoyu Wei","Huiming Zhang"],"pdf_url":"https://arxiv.org/pdf/2412.02251v1.pdf","comment":"46 pages, 5 figures,"},{"id":"http://arxiv.org/abs/2406.07522v2","updated":"2024-12-03T08:27:49Z","published":"2024-06-11T17:50:51Z","title":"Samba: Simple Hybrid State Space Models for Efficient Unlimited Context\n Language Modeling","summary":" Efficiently modeling sequences with infinite context length has long been a\nchallenging problem. Previous approaches have either suffered from quadratic\ncomputational complexity or limited extrapolation ability in length\ngeneralization. In this work, we present Samba, a simple hybrid architecture\nthat layer-wise combines Mamba, a selective State Space Model (SSM), with\nSliding Window Attention (SWA). Samba selectively compresses a given sequence\ninto recurrent hidden states while still maintaining the ability to precisely\nrecall recent memories with the attention mechanism. We scale Samba up to 3.8B\nparameters with 3.2T training tokens and demonstrate that it significantly\noutperforms state-of-the-art models across a variety of benchmarks. Pretrained\non sequences of 4K length, Samba shows improved perplexity in context lengths\nof up to 1M in zero-shot. When finetuned on 4K-length sequences, Samba\nefficiently extrapolates to a 256K context length with perfect memory recall on\nthe Passkey Retrieval task, and exhibits superior retrieval extrapolation on\nthe challenging Phonebook task compared to full-attention models. As a\nlinear-time sequence model, Samba achieves a 3.73x higher throughput compared\nto Transformers with grouped-query attention for user prompts of 128K length,\nand a 3.64x speedup when generating 64K tokens with unlimited streaming. Our\ncode for training on open source data is publicly available at\nhttps://github.com/microsoft/Samba.\n","authors":["Liliang Ren","Yang Liu","Yadong Lu","Yelong Shen","Chen Liang","Weizhu Chen"],"pdf_url":"https://arxiv.org/pdf/2406.07522v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.16469v2","updated":"2024-12-03T08:18:17Z","published":"2024-03-25T06:50:25Z","title":"Learning from Reduced Labels for Long-Tailed Data","summary":" Long-tailed data is prevalent in real-world classification tasks and heavily\nrelies on supervised information, which makes the annotation process\nexceptionally labor-intensive and time-consuming. Unfortunately, despite being\na common approach to mitigate labeling costs, existing weakly supervised\nlearning methods struggle to adequately preserve supervised information for\ntail samples, resulting in a decline in accuracy for the tail classes. To\nalleviate this problem, we introduce a novel weakly supervised labeling setting\ncalled Reduced Label. The proposed labeling setting not only avoids the decline\nof supervised information for the tail samples, but also decreases the labeling\ncosts associated with long-tailed data. Additionally, we propose an\nstraightforward and highly efficient unbiased framework with strong theoretical\nguarantees to learn from these Reduced Labels. Extensive experiments conducted\non benchmark datasets including ImageNet validate the effectiveness of our\napproach, surpassing the performance of state-of-the-art weakly supervised\nmethods.\n","authors":["Meng Wei","Zhongnian Li","Yong Zhou","Xinzheng Xu"],"pdf_url":"https://arxiv.org/pdf/2403.16469v2.pdf","comment":"11 pages, 3 figures"},{"id":"http://arxiv.org/abs/2412.02244v1","updated":"2024-12-03T08:16:59Z","published":"2024-12-03T08:16:59Z","title":"On Simplifying Large-Scale Spatial Vectors: Fast, Memory-Efficient, and\n Cost-Predictable k-means","summary":" The k-means algorithm can simplify large-scale spatial vectors, such as 2D\ngeo-locations and 3D point clouds, to support fast analytics and learning.\nHowever, when processing large-scale datasets, existing k-means algorithms have\nbeen developed to achieve high performance with significant computational\nresources, such as memory and CPU usage time. These algorithms, though\neffective, are not well-suited for resource-constrained devices. In this paper,\nwe propose a fast, memory-efficient, and cost-predictable k-means called\nDask-means. We first accelerate k-means by designing a memory-efficient\naccelerator, which utilizes an optimized nearest neighbor search over a\nmemory-tunable index to assign spatial vectors to clusters in batches. We then\ndesign a lightweight cost estimator to predict the memory cost and runtime of\nthe k-means task, allowing it to request appropriate memory from devices or\nadjust the accelerator's required space to meet memory constraints, and ensure\nsufficient CPU time for running k-means. Experiments show that when simplifying\ndatasets with scale such as $10^6$, Dask-means uses less than $30$MB of memory,\nachieves over $168$ times speedup compared to the widely-used Lloyd's\nalgorithm. We also validate Dask-means on mobile devices, where it demonstrates\nsignificant speedup and low memory cost compared to other state-of-the-art\n(SOTA) k-means algorithms. Our cost estimator estimates the memory cost with a\ndifference of less than $3\\%$ from the actual ones and predicts runtime with an\nMSE up to $33.3\\%$ lower than SOTA methods.\n","authors":["Yushuai Ji","Zepeng Liu","Sheng Wang","Yuan Sun","Zhiyong Peng"],"pdf_url":"https://arxiv.org/pdf/2412.02244v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02242v1","updated":"2024-12-03T08:11:06Z","published":"2024-12-03T08:11:06Z","title":"U-Net in Medical Image Segmentation: A Review of Its Applications Across\n Modalities","summary":" Medical imaging is essential in healthcare to provide key insights into\npatient anatomy and pathology, aiding in diagnosis and treatment. Non-invasive\ntechniques such as X-ray, Magnetic Resonance Imaging (MRI), Computed Tomography\n(CT), and Ultrasound (US), capture detailed images of organs, tissues, and\nabnormalities. Effective analysis of these images requires precise segmentation\nto delineate regions of interest (ROI), such as organs or lesions. Traditional\nsegmentation methods, relying on manual feature-extraction, are labor-intensive\nand vary across experts. Recent advancements in Artificial Intelligence (AI)\nand Deep Learning (DL), particularly convolutional models such as U-Net and its\nvariants (U-Net++ and U-Net 3+), have transformed medical image segmentation\n(MIS) by automating the process and enhancing accuracy. These models enable\nefficient, precise pixel-wise classification across various imaging modalities,\novercoming the limitations of manual segmentation. This review explores various\nmedical imaging techniques, examines the U-Net architectures and their\nadaptations, and discusses their application across different modalities. It\nalso identifies common challenges in MIS and proposes potential solutions.\n","authors":["Fnu Neha","Deepshikha Bhati","Deepak Kumar Shukla","Sonavi Makarand Dalvi","Nikolaos Mantzou","Safa Shubbar"],"pdf_url":"https://arxiv.org/pdf/2412.02242v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02240v1","updated":"2024-12-03T08:09:06Z","published":"2024-12-03T08:09:06Z","title":"ESA: Example Sieve Approach for Multi-Positive and Unlabeled Learning","summary":" Learning from Multi-Positive and Unlabeled (MPU) data has gradually attracted\nsignificant attention from practical applications. Unfortunately, the risk of\nMPU also suffer from the shift of minimum risk, particularly when the models\nare very flexible as shown in Fig.\\ref{moti}. In this paper, to alleviate the\nshifting of minimum risk problem, we propose an Example Sieve Approach (ESA) to\nselect examples for training a multi-class classifier. Specifically, we sieve\nout some examples by utilizing the Certain Loss (CL) value of each example in\nthe training stage and analyze the consistency of the proposed risk estimator.\nBesides, we show that the estimation error of proposed ESA obtains the optimal\nparametric convergence rate. Extensive experiments on various real-world\ndatasets show the proposed approach outperforms previous methods.\n","authors":["Zhongnian Li","Meng Wei","Peng Ying","Xinzheng Xu"],"pdf_url":"https://arxiv.org/pdf/2412.02240v1.pdf","comment":"12 pages, 6 figures"},{"id":"http://arxiv.org/abs/2406.14026v3","updated":"2024-12-03T08:03:25Z","published":"2024-06-20T06:46:23Z","title":"Demystifying Language Model Forgetting with Low-rank Example\n Associations","summary":" Large Language models (LLMs) suffer from forgetting of upstream data when\nfine-tuned. Despite efforts on mitigating forgetting, few have investigated\nwhether, and how forgotten upstream examples are dependent on and associated\nwith newly learned tasks. Insights on such associations enable efficient and\ntargeted mitigation of forgetting. In this paper, we empirically analyze\nforgetting (measured in log-perplexity increase) that occurs in $N$ upstream\nexamples of language modeling or instruction-tuning after fine-tuning LLMs on\none of $M$ new tasks, visualized in $M\\times N$ matrices. We demonstrate that\nthe matrices display simple low-rank patterns, often well-approximated with\nmultiplicative scalar effects of upstream examples and newly learned tasks. We\nalso examine fine-grained associations with visualization and statistics.\nLeveraging the low-rank nature of the associations, we predict forgetting of\nupstream examples when fine-tuning on unseen tasks with matrix completion over\nthe empirical associations. This enables fast identification of most forgotten\nexamples without expensive inference on the entire upstream data. The approach,\ndespite simplicity, outperforms prior approaches that learn semantic\nrelationships of learned tasks and upstream examples with LMs for predicting\nforgetting. We demonstrate the practical utility of our analysis by showing\nstatistically significantly reduced forgetting as we upweight predicted\nexamples for replay at fine-tuning. Project page:\nhttps://inklab.usc.edu/lm-forgetting-prediction/\n","authors":["Xisen Jin","Xiang Ren"],"pdf_url":"https://arxiv.org/pdf/2406.14026v3.pdf","comment":"10 pages; preprint"},{"id":"http://arxiv.org/abs/2412.02230v1","updated":"2024-12-03T08:00:19Z","published":"2024-12-03T08:00:19Z","title":"Learning from Concealed Labels","summary":" Annotating data for sensitive labels (e.g., disease, smoking) poses a\npotential threats to individual privacy in many real-world scenarios. To cope\nwith this problem, we propose a novel setting to protect privacy of each\ninstance, namely learning from concealed labels for multi-class classification.\nConcealed labels prevent sensitive labels from appearing in the label set\nduring the label collection stage, which specifies none and some random sampled\ninsensitive labels as concealed labels set to annotate sensitive data. In this\npaper, an unbiased estimator can be established from concealed data under mild\nassumptions, and the learned multi-class classifier can not only classify the\ninstance from insensitive labels accurately but also recognize the instance\nfrom the sensitive labels. Moreover, we bound the estimation error and show\nthat the multi-class classifier achieves the optimal parametric convergence\nrate. Experiments demonstrate the significance and effectiveness of the\nproposed method for concealed labels in synthetic and real-world datasets.\n","authors":["Zhongnian Li","Meng Wei","Peng Ying","Tongfeng Sun","Xinzheng Xu"],"pdf_url":"https://arxiv.org/pdf/2412.02230v1.pdf","comment":"12 pages, 2 figures"},{"id":"http://arxiv.org/abs/2412.02228v1","updated":"2024-12-03T07:51:14Z","published":"2024-12-03T07:51:14Z","title":"BANER: Boundary-Aware LLMs for Few-Shot Named Entity Recognition","summary":" Despite the recent success of two-stage prototypical networks in few-shot\nnamed entity recognition (NER), challenges such as over/under-detected false\nspans in the span detection stage and unaligned entity prototypes in the type\nclassification stage persist. Additionally, LLMs have not proven to be\neffective few-shot information extractors in general. In this paper, we propose\nan approach called Boundary-Aware LLMs for Few-Shot Named Entity Recognition to\naddress these issues. We introduce a boundary-aware contrastive learning\nstrategy to enhance the LLM's ability to perceive entity boundaries for\ngeneralized entity spans. Additionally, we utilize LoRAHub to align information\nfrom the target domain to the source domain, thereby enhancing adaptive\ncross-domain classification capabilities. Extensive experiments across various\nbenchmarks demonstrate that our framework outperforms prior methods, validating\nits effectiveness. In particular, the proposed strategies demonstrate\neffectiveness across a range of LLM architectures. The code and data are\nreleased on https://github.com/UESTC-GQJ/BANER.\n","authors":["Quanjiang Guo","Yihong Dong","Ling Tian","Zhao Kang","Yu Zhang","Sijie Wang"],"pdf_url":"https://arxiv.org/pdf/2412.02228v1.pdf","comment":"Appear on COLING 2025"},{"id":"http://arxiv.org/abs/2403.08978v2","updated":"2024-12-03T07:36:47Z","published":"2024-03-13T22:06:03Z","title":"AutoGuide: Automated Generation and Selection of Context-Aware\n Guidelines for Large Language Model Agents","summary":" Recent advances in large language models (LLMs) have empowered AI agents\ncapable of performing various sequential decision-making tasks. However,\neffectively guiding LLMs to perform well in unfamiliar domains like web\nnavigation, where they lack sufficient knowledge, has proven to be difficult\nwith the demonstration-based in-context learning paradigm. In this paper, we\nintroduce a novel framework, called AutoGuide, which addresses this limitation\nby automatically generating context-aware guidelines from offline experiences.\nImportantly, each context-aware guideline is expressed in concise natural\nlanguage and follows a conditional structure, clearly describing the context\nwhere it is applicable. As a result, our guidelines facilitate the provision of\nrelevant knowledge for the agent's current decision-making process, overcoming\nthe limitations of the conventional demonstration-based learning paradigm. Our\nevaluation demonstrates that AutoGuide significantly outperforms competitive\nbaselines in complex benchmark domains, including real-world web navigation.\n","authors":["Yao Fu","Dong-Ki Kim","Jaekyeom Kim","Sungryull Sohn","Lajanugen Logeswaran","Kyunghoon Bae","Honglak Lee"],"pdf_url":"https://arxiv.org/pdf/2403.08978v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02220v1","updated":"2024-12-03T07:25:30Z","published":"2024-12-03T07:25:30Z","title":"Unlocking Tuning-Free Few-Shot Adaptability in Visual Foundation Models\n by Recycling Pre-Tuned LoRAs","summary":" Large Language Models (LLMs) such as ChatGPT demonstrate strong few-shot\nadaptability without requiring fine-tuning, positioning them ideal for\ndata-limited and real-time applications. However, this adaptability has not yet\nbeen replicated in current Visual Foundation Models (VFMs), which require\nexplicit fine-tuning with sufficient tuning data. Besides, the\npretraining-finetuning paradigm has led to the surge of numerous task-specific\nmodular components, such as Low-Rank Adaptation (LoRA). For the first time, we\nexplore the potential of reusing diverse pre-tuned LoRAs without accessing\ntheir original training data, to achieve tuning-free few-shot adaptation in\nVFMs. Our framework, LoRA Recycle, distills a meta-LoRA from diverse pre-tuned\nLoRAs with a meta-learning objective, using surrogate data generated inversely\nfrom pre-tuned LoRAs themselves. The VFM, once equipped with the meta-LoRA, is\nempowered to solve new few-shot tasks in a single forward pass, akin to the\nin-context learning of LLMs. Additionally, we incorporate a double-efficient\nmechanism tailored to our framework, significantly accelerating the\nmeta-training process while maintaining or even improving performance.\nExtensive experiments across various few-shot classification benchmarks across\nboth in- and cross-domain scenarios demonstrate the superiority of our\nframework.\n","authors":["Zixuan Hu","Yongxian Wei","Li Shen","Chun Yuan","Dacheng Tao"],"pdf_url":"https://arxiv.org/pdf/2412.02220v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.08830v2","updated":"2024-12-03T07:23:25Z","published":"2024-06-13T05:49:29Z","title":"Center-Sensitive Kernel Optimization for Efficient On-Device Incremental\n Learning","summary":" To facilitate the evolution of edge intelligence in ever-changing\nenvironments, we study on-device incremental learning constrained in limited\ncomputation resource in this paper. Current on-device training methods just\nfocus on efficient training without considering the catastrophic forgetting,\npreventing the model getting stronger when continually exploring the world. To\nsolve this problem, a direct solution is to involve the existing incremental\nlearning mechanisms into the on-device training framework. Unfortunately, such\na manner cannot work well as those mechanisms usually introduce large\nadditional computational cost to the network optimization process, which would\ninevitably exceed the memory capacity of the edge devices. To address this\nissue, this paper makes an early effort to propose a simple but effective\nedge-friendly incremental learning framework. Based on an empirical study on\nthe knowledge intensity of the kernel elements of the neural network, we find\nthat the center kernel is the key for maximizing the knowledge intensity for\nlearning new data, while freezing the other kernel elements would get a good\nbalance on the model's capacity for overcoming catastrophic forgetting. Upon\nthis finding, we further design a center-sensitive kernel optimization\nframework to largely alleviate the cost of the gradient computation and\nback-propagation. Besides, a dynamic channel element selection strategy is also\nproposed to facilitate a sparse orthogonal gradient projection for further\nreducing the optimization complexity, upon the knowledge explored from the new\ntask data. Extensive experiments validate our method is efficient and\neffective, e.g., our method achieves average accuracy boost of 38.08% with even\nless memory and approximate computation compared to existing on-device training\nmethods, indicating its significant potential for on-device incremental\nlearning.\n","authors":["Dingwen Zhang","Yan Li","De Cheng","Nannan Wang","Junwei Han"],"pdf_url":"https://arxiv.org/pdf/2406.08830v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.00156v2","updated":"2024-12-03T07:18:25Z","published":"2024-11-29T08:10:49Z","title":"VISION-XL: High Definition Video Inverse Problem Solver using Latent\n Image Diffusion Models","summary":" In this paper, we propose a novel framework for solving high-definition video\ninverse problems using latent image diffusion models. Building on recent\nadvancements in spatio-temporal optimization for video inverse problems using\nimage diffusion models, our approach leverages latent-space diffusion models to\nachieve enhanced video quality and resolution. To address the high\ncomputational demands of processing high-resolution frames, we introduce a\npseudo-batch consistent sampling strategy, allowing efficient operation on a\nsingle GPU. Additionally, to improve temporal consistency, we present\nbatch-consistent inversion, an initialization technique that incorporates\ninformative latents from the measurement frame. By integrating with SDXL, our\nframework achieves state-of-the-art video reconstruction across a wide range of\nspatio-temporal inverse problems, including complex combinations of frame\naveraging and various spatial degradations, such as deblurring,\nsuper-resolution, and inpainting. Unlike previous methods, our approach\nsupports multiple aspect ratios (landscape, vertical, and square) and delivers\nHD-resolution reconstructions (exceeding 1280x720) in under 2.5 minutes on a\nsingle NVIDIA 4090 GPU.\n","authors":["Taesung Kwon","Jong Chul Ye"],"pdf_url":"https://arxiv.org/pdf/2412.00156v2.pdf","comment":"Project page: https://vision-xl.github.io/"},{"id":"http://arxiv.org/abs/2412.02215v1","updated":"2024-12-03T07:11:21Z","published":"2024-12-03T07:11:21Z","title":"Recovering implicit physics model under real-world constraints","summary":" Recovering a physics-driven model, i.e. a governing set of equations of the\nunderlying dynamical systems, from the real-world data has been of recent\ninterest. Most existing methods either operate on simulation data with\nunrealistically high sampling rates or require explicit measurements of all\nsystem variables, which is not amenable in real-world deployments. Moreover,\nthey assume the timestamps of external perturbations to the physical system are\nknown a priori, without uncertainty, implicitly discounting any sensor\ntime-synchronization or human reporting errors. In this paper, we propose a\nnovel liquid time constant neural network (LTC-NN) based architecture to\nrecover underlying model of physical dynamics from real-world data. The\nautomatic differentiation property of LTC-NN nodes overcomes problems\nassociated with low sampling rates, the input dependent time constant in the\nforward pass of the hidden layer of LTC-NN nodes creates a massive search space\nof implicit physical dynamics, the physics model solver based data\nreconstruction loss guides the search for the correct set of implicit dynamics,\nand the use of the dropout regularization in the dense layer ensures extraction\nof the sparsest model. Further, to account for the perturbation timing error,\nwe utilize dense layer nodes to search through input shifts that results in the\nlowest reconstruction loss. Experiments on four benchmark dynamical systems,\nthree with simulation data and one with the real-world data show that the\nLTC-NN architecture is more accurate in recovering implicit physics model\ncoefficients than the state-of-the-art sparse model recovery approaches. We\nalso introduce four additional case studies (total eight) on real-life medical\nexamples in simulation and with real-world clinical data to show effectiveness\nof our approach in recovering underlying model in practice.\n","authors":["Ayan Banerjee","Sandeep K. S. Gupta"],"pdf_url":"https://arxiv.org/pdf/2412.02215v1.pdf","comment":"This paper is published in ECAI 2024,\n https://ebooks.iospress.nl/volumearticle/69651"},{"id":"http://arxiv.org/abs/2404.18247v2","updated":"2024-12-03T07:07:45Z","published":"2024-04-28T17:02:24Z","title":"Classical integrability in the presence of a cosmological constant:\n analytic and machine learning results","summary":" We study the integrability of two-dimensional theories that are obtained by a\ndimensional reduction of certain four-dimensional gravitational theories\ndescribing the coupling of Maxwell fields and neutral scalar fields to gravity\nin the presence of a potential for the neutral scalar fields. For a certain\nsolution subspace, we demonstrate partial integrability by showing that a\nsubset of the equations of motion in two dimensions are the compatibility\nconditions for a linear system. Subsequently, we study the integrability of\nthese two-dimensional models from a complementary one-dimensional point of\nview, framed in terms of Liouville integrability. In this endeavour, we employ\nvarious machine learning techniques to systematise our search for numerical Lax\npair matrices for these models, as well as conserved currents expressed as\nfunctions of phase space variables.\n","authors":["Gabriel Lopes Cardoso","Damián Mayorga Peña","Suresh Nampuri"],"pdf_url":"https://arxiv.org/pdf/2404.18247v2.pdf","comment":"38 pages, 9 figures, typographical corrections and assorted\n improvements"},{"id":"http://arxiv.org/abs/2412.02211v1","updated":"2024-12-03T07:04:10Z","published":"2024-12-03T07:04:10Z","title":"An Automated Data Mining Framework Using Autoencoders for Feature\n Extraction and Dimensionality Reduction","summary":" This study proposes an automated data mining framework based on autoencoders\nand experimentally verifies its effectiveness in feature extraction and data\ndimensionality reduction. Through the encoding-decoding structure, the\nautoencoder can capture the data's potential characteristics and achieve noise\nreduction and anomaly detection, providing an efficient and stable solution for\nthe data mining process. The experiment compared the performance of the\nautoencoder with traditional dimensionality reduction methods (such as PCA, FA,\nT-SNE, and UMAP). The results showed that the autoencoder performed best in\nterms of reconstruction error and root mean square error and could better\nretain data structure and enhance the generalization ability of the model. The\nautoencoder-based framework not only reduces manual intervention but also\nsignificantly improves the automation of data processing. In the future, with\nthe advancement of deep learning and big data technology, the autoencoder\nmethod combined with a generative adversarial network (GAN) or graph neural\nnetwork (GNN) is expected to be more widely used in the fields of complex data\nprocessing, real-time data analysis and intelligent decision-making.\n","authors":["Yaxin Liang","Xinshi Li","Xin Huang","Ziqi Zhang","Yue Yao"],"pdf_url":"https://arxiv.org/pdf/2412.02211v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.10656v2","updated":"2024-12-03T07:02:05Z","published":"2023-08-21T11:48:34Z","title":"Practical Parallel Algorithms for Non-Monotone Submodular Maximization","summary":" Submodular maximization has found extensive applications in various domains\nwithin the field of artificial intelligence, including but not limited to\nmachine learning, computer vision, and natural language processing. With the\nincreasing size of datasets in these domains, there is a pressing need to\ndevelop efficient and parallelizable algorithms for submodular maximization.\nOne measure of the parallelizability of a submodular maximization algorithm is\nits adaptive complexity, which indicates the number of sequential rounds where\na polynomial number of queries to the objective function can be executed in\nparallel. In this paper, we study the problem of non-monotone submodular\nmaximization subject to a knapsack constraint, and propose the first\ncombinatorial algorithm achieving an $(8+\\epsilon)$-approximation under\n$\\mathcal{O}(\\log n)$ adaptive complexity, which is \\textit{optimal} up to a\nfactor of $\\mathcal{O}(\\log\\log n)$. Moreover, we also propose the first\nalgorithm with both provable approximation ratio and sublinear adaptive\ncomplexity for the problem of non-monotone submodular maximization subject to a\n$k$-system constraint. As a by-product, we show that our two algorithms can\nalso be applied to the special case of submodular maximization subject to a\ncardinality constraint, and achieve performance bounds comparable with those of\nstate-of-the-art algorithms. Finally, the effectiveness of our approach is\ndemonstrated by extensive experiments on real-world applications.\n","authors":["Shuang Cui","Kai Han","Jing Tang","Xueying Li","Aakas Zhiyuli","Hanxiao Li"],"pdf_url":"https://arxiv.org/pdf/2308.10656v2.pdf","comment":"Part of the contribution appears in AAAI-2023"},{"id":"http://arxiv.org/abs/2112.04948v2","updated":"2024-12-03T07:00:13Z","published":"2021-12-09T14:26:13Z","title":"Guardian of the Ensembles: Introducing Pairwise Adversarially Robust\n Loss for Resisting Adversarial Attacks in DNN Ensembles","summary":" Adversarial attacks rely on transferability, where an adversarial example\n(AE) crafted on a surrogate classifier tends to mislead a target classifier.\nRecent ensemble methods demonstrate that AEs are less likely to mislead\nmultiple classifiers in an ensemble. This paper proposes a new ensemble\ntraining using a Pairwise Adversarially Robust Loss (PARL) that by construction\nproduces an ensemble of classifiers with diverse decision boundaries. PARL\nutilizes outputs and gradients of each layer with respect to network parameters\nin every classifier within the ensemble simultaneously. PARL is demonstrated to\nachieve higher robustness against black-box transfer attacks than previous\nensemble methods as well as adversarial training without adversely affecting\nclean example accuracy. Extensive experiments using standard Resnet20,\nWideResnet28-10 classifiers demonstrate the robustness of PARL against\nstate-of-the-art adversarial attacks. While maintaining similar clean accuracy\nand lesser training time, the proposed architecture has a 24.8% increase in\nrobust accuracy ($\\epsilon$ = 0.07) from the state-of-the art method.\n","authors":["Shubhi Shukla","Subhadeep Dalui","Manaar Alam","Shubhajit Datta","Arijit Mondal","Debdeep Mukhopadhyay","Partha Pratim Chakrabarti"],"pdf_url":"https://arxiv.org/pdf/2112.04948v2.pdf","comment":"Accepted at IEEE/CVF Winter Conference on Applications of Computer\n Vision (WACV 2025)"},{"id":"http://arxiv.org/abs/2408.17355v3","updated":"2024-12-03T06:53:58Z","published":"2024-08-30T15:39:34Z","title":"Bidirectional Decoding: Improving Action Chunking via Closed-Loop\n Resampling","summary":" Predicting and executing a sequence of actions without intermediate\nreplanning, known as action chunking, is increasingly used in robot learning\nfrom human demonstrations. Yet, its reported effects on the learned policy are\ninconsistent: some studies find it crucial for achieving strong results, while\nothers observe decreased performance. In this paper, we first dissect how\naction chunking impacts the divergence between a learner and a demonstrator. We\nfind that action chunking allows the learner to better capture the temporal\ndependencies in demonstrations but at the cost of reduced reactivity in\nstochastic environments. To address this tradeoff, we propose Bidirectional\nDecoding (BID), a test-time inference algorithm that bridges action chunking\nwith closed-loop operations. BID samples multiple predictions at each time step\nand searches for the optimal one based on two criteria: (i) backward coherence,\nwhich favors samples that align with previous decisions; (ii) forward contrast,\nwhich seeks samples of high likelihood for future plans. By coupling decisions\nwithin and across action chunks, BID promotes consistency over time while\nmaintaining reactivity to unexpected changes. Experimental results show that\nBID boosts the performance of two state-of-the-art generative policies across\nseven simulation benchmarks and two real-world tasks. Code and videos are\navailable at https://bid-robot.github.io.\n","authors":["Yuejiang Liu","Jubayer Ibn Hamid","Annie Xie","Yoonho Lee","Maximilian Du","Chelsea Finn"],"pdf_url":"https://arxiv.org/pdf/2408.17355v3.pdf","comment":"Project website: https://bid-robot.github.io/"},{"id":"http://arxiv.org/abs/2402.10946v3","updated":"2024-12-03T06:52:34Z","published":"2024-02-09T04:02:43Z","title":"CultureLLM: Incorporating Cultural Differences into Large Language\n Models","summary":" Large language models (LLMs) are reported to be partial to certain cultures\nowing to the training data dominance from the English corpora. Since\nmultilingual cultural data are often expensive to collect, existing efforts\nhandle this by prompt engineering or culture-specific pre-training. However,\nthey might overlook the knowledge deficiency of low-resource culture and\nrequire extensive computing resources. In this paper, we propose CultureLLM, a\ncost-effective solution to incorporate cultural differences into LLMs.\nCultureLLM adopts World Value Survey (WVS) as seed data and generates\nsemantically equivalent training data via the proposed semantic data\naugmentation. Using only 50 seed samples from WVS with augmented data, we\nfine-tune culture-specific LLMs and one unified model (CultureLLM-One) for 9\ncultures covering rich and low-resource languages. Extensive experiments on 60\nculture-related datasets demonstrate that CultureLLM significantly outperforms\nvarious counterparts such as GPT-3.5 (by 8.1%) and Gemini Pro (by 9.5%) with\ncomparable performance to GPT-4 or even better. Our human study shows that the\ngenerated samples are semantically equivalent to the original samples,\nproviding an effective solution for LLMs augmentation. Code is released at\nhttps://github.com/Scarelette/CultureLLM.\n","authors":["Cheng Li","Mengzhou Chen","Jindong Wang","Sunayana Sitaram","Xing Xie"],"pdf_url":"https://arxiv.org/pdf/2402.10946v3.pdf","comment":"NeurIPS 2024; Code is at https://github.com/Scarelette/CultureLLM"},{"id":"http://arxiv.org/abs/2409.18169v5","updated":"2024-12-03T06:52:11Z","published":"2024-09-26T17:55:22Z","title":"Harmful Fine-tuning Attacks and Defenses for Large Language Models: A\n Survey","summary":" Recent research demonstrates that the nascent fine-tuning-as-a-service\nbusiness model exposes serious safety concerns -- fine-tuning over a few\nharmful data uploaded by the users can compromise the safety alignment of the\nmodel. The attack, known as harmful fine-tuning attack, has raised a broad\nresearch interest among the community. However, as the attack is still new,\n\\textbf{we observe that there are general misunderstandings within the research\ncommunity.} To clear up concern, this paper provide a comprehensive overview to\nthree aspects of harmful fine-tuning: attacks setting, defense design and\nevaluation methodology. Specifically, we first present the threat model of the\nproblem, and introduce the harmful fine-tuning attack and its variants. Then we\nsystematically survey the existing literature on attacks/defenses/mechanical\nanalysis of the problem. Finally, we introduce the evaluation methodology and\noutline future research directions that might contribute to the development of\nthe field. Additionally, we present a list of questions of interest, which\nmight be useful to refer to when reviewers in the peer review process question\nthe realism of the experiment/attack/defense setting. A curated list of\nrelevant papers is maintained and made accessible at:\nhttps://github.com/git-disl/awesome_LLM-harmful-fine-tuning-papers.\n","authors":["Tiansheng Huang","Sihao Hu","Fatih Ilhan","Selim Furkan Tekin","Ling Liu"],"pdf_url":"https://arxiv.org/pdf/2409.18169v5.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.15143v3","updated":"2024-12-03T06:43:39Z","published":"2024-05-24T01:45:27Z","title":"Intelligent Go-Explore: Standing on the Shoulders of Giant Foundation\n Models","summary":" Go-Explore is a powerful family of algorithms designed to solve\nhard-exploration problems built on the principle of archiving discovered\nstates, and iteratively returning to and exploring from the most promising\nstates. This approach has led to superhuman performance across a wide variety\nof challenging problems including Atari games and robotic control, but requires\nmanually designing heuristics to guide exploration (i.e., determine which\nstates to save and explore from, and what actions to consider next), which is\ntime-consuming and infeasible in general. To resolve this, we propose\nIntelligent Go-Explore (IGE) which greatly extends the scope of the original\nGo-Explore by replacing these handcrafted heuristics with the intelligence and\ninternalized human notions of interestingness captured by giant pretrained\nfoundation models (FMs). This provides IGE with a human-like ability to\ninstinctively identify how interesting or promising any new state is (e.g.,\ndiscovering new objects, locations, or behaviors), even in complex environments\nwhere heuristics are hard to define. Moreover, IGE offers the exciting\nopportunity to recognize and capitalize on serendipitous discoveries-states\nencountered during exploration that are valuable in terms of exploration, yet\nwhere what makes them interesting was not anticipated by the human user. We\nevaluate our algorithm on a diverse range of language and vision-based tasks\nthat require search and exploration. Across these tasks, IGE strongly exceeds\nclassic reinforcement learning and graph search baselines, and also succeeds\nwhere prior state-of-the-art FM agents like Reflexion completely fail. Overall,\nIntelligent Go-Explore combines the tremendous strengths of FMs and the\npowerful Go-Explore algorithm, opening up a new frontier of research into\ncreating more generally capable agents with impressive exploration\ncapabilities.\n","authors":["Cong Lu","Shengran Hu","Jeff Clune"],"pdf_url":"https://arxiv.org/pdf/2405.15143v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02196v1","updated":"2024-12-03T06:21:35Z","published":"2024-12-03T06:21:35Z","title":"SA-GNAS: Seed Architecture Expansion for Efficient Large-scale Graph\n Neural Architecture Search","summary":" GNAS (Graph Neural Architecture Search) has demonstrated great effectiveness\nin automatically designing the optimal graph neural architectures for multiple\ndownstream tasks, such as node classification and link prediction. However,\nmost existing GNAS methods cannot efficiently handle large-scale graphs\ncontaining more than million-scale nodes and edges due to the expensive\ncomputational and memory overhead. To scale GNAS on large graphs while\nachieving better performance, we propose SA-GNAS, a novel framework based on\nseed architecture expansion for efficient large-scale GNAS. Similar to the cell\nexpansion in biotechnology, we first construct a seed architecture and then\nexpand the seed architecture iteratively. Specifically, we first propose a\nperformance ranking consistency-based seed architecture selection method, which\nselects the architecture searched on the subgraph that best matches the\noriginal large-scale graph. Then, we propose an entropy minimization-based seed\narchitecture expansion method to further improve the performance of the seed\narchitecture. Extensive experimental results on five large-scale graphs\ndemonstrate that the proposed SA-GNAS outperforms human-designed\nstate-of-the-art GNN architectures and existing graph NAS methods. Moreover,\nSA-GNAS can significantly reduce the search time, showing better search\nefficiency. For the largest graph with billion edges, SA-GNAS can achieve 2.8\ntimes speedup compared to the SOTA large-scale GNAS method GAUSS. Additionally,\nsince SA-GNAS is inherently parallelized, the search efficiency can be further\nimproved with more GPUs. SA-GNAS is available at\nhttps://github.com/PasaLab/SAGNAS.\n","authors":["Guanghui Zhu","Zipeng Ji","Jingyan Chen","Limin Wang","Chunfeng Yuan","Yihua Huang"],"pdf_url":"https://arxiv.org/pdf/2412.02196v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02187v1","updated":"2024-12-03T05:59:34Z","published":"2024-12-03T05:59:34Z","title":"Deep Learning, Machine Learning, Advancing Big Data Analytics and\n Management","summary":" Advancements in artificial intelligence, machine learning, and deep learning\nhave catalyzed the transformation of big data analytics and management into\npivotal domains for research and application. This work explores the\ntheoretical foundations, methodological advancements, and practical\nimplementations of these technologies, emphasizing their role in uncovering\nactionable insights from massive, high-dimensional datasets. The study presents\na systematic overview of data preprocessing techniques, including data\ncleaning, normalization, integration, and dimensionality reduction, to prepare\nraw data for analysis. Core analytics methodologies such as classification,\nclustering, regression, and anomaly detection are examined, with a focus on\nalgorithmic innovation and scalability. Furthermore, the text delves into\nstate-of-the-art frameworks for data mining and predictive modeling,\nhighlighting the role of neural networks, support vector machines, and ensemble\nmethods in tackling complex analytical challenges. Special emphasis is placed\non the convergence of big data with distributed computing paradigms, including\ncloud and edge computing, to address challenges in storage, computation, and\nreal-time analytics. The integration of ethical considerations, including data\nprivacy and compliance with global standards, ensures a holistic perspective on\ndata management. Practical applications across healthcare, finance, marketing,\nand policy-making illustrate the real-world impact of these technologies.\nThrough comprehensive case studies and Python-based implementations, this work\nequips researchers, practitioners, and data enthusiasts with the tools to\nnavigate the complexities of modern data analytics. It bridges the gap between\ntheory and practice, fostering the development of innovative solutions for\nmanaging and leveraging data in the era of artificial intelligence.\n","authors":["Weiche Hsieh","Ziqian Bi","Keyu Chen","Benji Peng","Sen Zhang","Jiawei Xu","Jinlang Wang","Caitlyn Heqi Yin","Yichao Zhang","Pohsun Feng","Yizhu Wen","Tianyang Wang","Ming Li","Chia Xin Liang","Jintao Ren","Qian Niu","Silin Chen","Lawrence K. Q. Yan","Han Xu","Hong-Ming Tseng","Xinyuan Song","Bowen Jing","Junjie Yang","Junhao Song","Junyu Liu","Ming Liu"],"pdf_url":"https://arxiv.org/pdf/2412.02187v1.pdf","comment":"174 pages"},{"id":"http://arxiv.org/abs/2410.15876v3","updated":"2024-12-03T05:59:09Z","published":"2024-10-21T10:57:45Z","title":"FlickerFusion: Intra-trajectory Domain Generalizing Multi-Agent RL","summary":" Multi-agent reinforcement learning has demonstrated significant potential in\naddressing complex cooperative tasks across various real-world applications.\nHowever, existing MARL approaches often rely on the restrictive assumption that\nthe number of entities (e.g., agents, obstacles) remains constant between\ntraining and inference. This overlooks scenarios where entities are dynamically\nremoved or added during the inference trajectory -- a common occurrence in\nreal-world environments like search and rescue missions and dynamic combat\nsituations. In this paper, we tackle the challenge of intra-trajectory dynamic\nentity composition under zero-shot out-of-domain (OOD) generalization, where\nsuch dynamic changes cannot be anticipated beforehand. Our empirical studies\nreveal that existing MARL methods suffer significant performance degradation\nand increased uncertainty in these scenarios. In response, we propose\nFlickerFusion, a novel OOD generalization method that acts as a universally\napplicable augmentation technique for MARL backbone methods. FlickerFusion\nstochastically drops out parts of the observation space, emulating being\nin-domain when inferenced OOD. The results show that FlickerFusion not only\nachieves superior inference rewards but also uniquely reduces uncertainty\nvis-\\`a-vis the backbone, compared to existing methods. Benchmarks,\nimplementations, and model weights are organized and open-sourced at\nflickerfusion305.github.io, accompanied by ample demo video renderings.\n","authors":["Woosung Koh","Wonbeen Oh","Siyeol Kim","Suhin Shin","Hyeongjin Kim","Jaein Jang","Junghyun Lee","Se-Young Yun"],"pdf_url":"https://arxiv.org/pdf/2410.15876v3.pdf","comment":"NeurIPS '24 Open-World Agents Workshop"},{"id":"http://arxiv.org/abs/2412.01650v2","updated":"2024-12-03T05:46:35Z","published":"2024-12-02T15:59:35Z","title":"Privacy-Preserving Federated Learning via Homomorphic Adversarial\n Networks","summary":" Privacy-preserving federated learning (PPFL) aims to train a global model for\nmultiple clients while maintaining their data privacy. However, current PPFL\nprotocols exhibit one or more of the following insufficiencies: considerable\ndegradation in accuracy, the requirement for sharing keys, and cooperation\nduring the key generation or decryption processes. As a mitigation, we develop\nthe first protocol that utilizes neural networks to implement PPFL, as well as\nincorporating an Aggregatable Hybrid Encryption scheme tailored to the needs of\nPPFL. We name these networks as Homomorphic Adversarial Networks (HANs) which\ndemonstrate that neural networks are capable of performing tasks similar to\nmulti-key homomorphic encryption (MK-HE) while solving the problems of key\ndistribution and collaborative decryption. Our experiments show that HANs are\nrobust against privacy attacks. Compared with non-private federated learning,\nexperiments conducted on multiple datasets demonstrate that HANs exhibit a\nnegligible accuracy loss (at most 1.35%). Compared to traditional MK-HE\nschemes, HANs increase encryption aggregation speed by 6,075 times while\nincurring a 29.2 times increase in communication overhead.\n","authors":["Wenhan Dong","Chao Lin","Xinlei He","Xinyi Huang","Shengmin Xu"],"pdf_url":"https://arxiv.org/pdf/2412.01650v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02181v1","updated":"2024-12-03T05:35:44Z","published":"2024-12-03T05:35:44Z","title":"Generalizing Weisfeiler-Lehman Kernels to Subgraphs","summary":" Subgraph representation learning has been effective in solving various\nreal-world problems. However, current graph neural networks (GNNs) produce\nsuboptimal results for subgraph-level tasks due to their inability to capture\ncomplex interactions within and between subgraphs. To provide a more expressive\nand efficient alternative, we propose WLKS, a Weisfeiler-Lehman (WL) kernel\ngeneralized for subgraphs by applying the WL algorithm on induced $k$-hop\nneighborhoods. We combine kernels across different $k$-hop levels to capture\nricher structural information that is not fully encoded in existing models. Our\napproach can balance expressiveness and efficiency by eliminating the need for\nneighborhood sampling. In experiments on eight real-world and synthetic\nbenchmarks, WLKS significantly outperforms leading approaches on five datasets\nwhile reducing training time, ranging from 0.01x to 0.25x compared to the\nstate-of-the-art.\n","authors":["Dongkwan Kim","Alice Oh"],"pdf_url":"https://arxiv.org/pdf/2412.02181v1.pdf","comment":"15 pages"},{"id":"http://arxiv.org/abs/2408.17151v2","updated":"2024-12-03T05:27:59Z","published":"2024-08-30T09:40:52Z","title":"Investigating Privacy Leakage in Dimensionality Reduction Methods via\n Reconstruction Attack","summary":" This study investigates privacy leakage in dimensionality reduction methods\nthrough a novel machine learning-based reconstruction attack. Employing an\ninformed adversary threat model, we develop a neural network capable of\nreconstructing high-dimensional data from low-dimensional embeddings.\n We evaluate six popular dimensionality reduction techniques: PCA, sparse\nrandom projection (SRP), multidimensional scaling (MDS), Isomap, t-SNE, and\nUMAP. Using both MNIST and NIH Chest X-ray datasets, we perform a qualitative\nanalysis to identify key factors affecting reconstruction quality. Furthermore,\nwe assess the effectiveness of an additive noise mechanism in mitigating these\nreconstruction attacks. Our experimental results on both datasets reveal that\nthe attack is effective against deterministic methods (PCA and Isomap), but\nineffective against methods that employ random initialization (SRP, MDS, t-SNE\nand UMAP). When adding the images with large noises before performing PCA or\nIsomap, the attack produced severely distorted reconstructions. In contrast,\nfor the other four methods, the reconstructions still show some recognizable\nfeatures, though they bear little resemblance to the original images.\n","authors":["Chayadon Lumbut","Donlapark Ponnoprat"],"pdf_url":"https://arxiv.org/pdf/2408.17151v2.pdf","comment":"Major revision"},{"id":"http://arxiv.org/abs/2412.02175v1","updated":"2024-12-03T05:20:05Z","published":"2024-12-03T05:20:05Z","title":"Improved Complexity for Smooth Nonconvex Optimization: A Two-Level\n Online Learning Approach with Quasi-Newton Methods","summary":" We study the problem of finding an $\\epsilon$-first-order stationary point\n(FOSP) of a smooth function, given access only to gradient information. The\nbest-known gradient query complexity for this task, assuming both the gradient\nand Hessian of the objective function are Lipschitz continuous, is\n${O}(\\epsilon^{-7/4})$. In this work, we propose a method with a gradient\ncomplexity of ${O}(d^{1/4}\\epsilon^{-13/8})$, where $d$ is the problem\ndimension, leading to an improved complexity when $d = {O}(\\epsilon^{-1/2})$.\nTo achieve this result, we design an optimization algorithm that, underneath,\ninvolves solving two online learning problems. Specifically, we first\nreformulate the task of finding a stationary point for a nonconvex problem as\nminimizing the regret in an online convex optimization problem, where the loss\nis determined by the gradient of the objective function. Then, we introduce a\nnovel optimistic quasi-Newton method to solve this online learning problem,\nwith the Hessian approximation update itself framed as an online learning\nproblem in the space of matrices. Beyond improving the complexity bound for\nachieving an $\\epsilon$-FOSP using a gradient oracle, our result provides the\nfirst guarantee suggesting that quasi-Newton methods can potentially outperform\ngradient descent-type methods in nonconvex settings.\n","authors":["Ruichen Jiang","Aryan Mokhtari","Francisco Patitucci"],"pdf_url":"https://arxiv.org/pdf/2412.02175v1.pdf","comment":"35 pages"},{"id":"http://arxiv.org/abs/2403.12820v3","updated":"2024-12-03T05:15:15Z","published":"2024-03-19T15:21:00Z","title":"A Physics-embedded Deep Learning Framework for Cloth Simulation","summary":" Delicate cloth simulations have long been desired in computer graphics.\nVarious methods were proposed to improve engaged force interactions, collision\nhandling, and numerical integrations. Deep learning has the potential to\nachieve fast and real-time simulation, but common neural network structures\noften demand many parameters to capture cloth dynamics. This paper proposes a\nphysics-embedded learning framework that directly encodes physical features of\ncloth simulation. The convolutional neural network is used to represent spatial\ncorrelations of the mass-spring system, after which three branches are designed\nto learn linear, nonlinear, and time derivate features of cloth physics. The\nframework can also integrate with other external forces and collision handling\nthrough either traditional simulators or sub neural networks. The model is\ntested across different cloth animation cases, without training with new data.\nAgreement with baselines and predictive realism successfully validate its\ngeneralization ability. Inference efficiency of the proposed model also defeats\ntraditional physics simulation. This framework is also designed to easily\nintegrate with other visual refinement techniques like wrinkle carving, which\nleaves significant chances to incorporate prevailing macing learning techniques\nin 3D cloth amination.\n","authors":["Zhiwei Zhao"],"pdf_url":"https://arxiv.org/pdf/2403.12820v3.pdf","comment":"updated version"},{"id":"http://arxiv.org/abs/2412.01253v2","updated":"2024-12-03T04:51:10Z","published":"2024-12-02T08:22:56Z","title":"Yi-Lightning Technical Report","summary":" This technical report presents Yi-Lightning, our latest flagship large\nlanguage model (LLM). It achieves exceptional performance, ranking 6th overall\non Chatbot Arena, with particularly strong results (2nd to 4th place) in\nspecialized categories including Chinese, Math, Coding, and Hard Prompts.\nYi-Lightning leverages an enhanced Mixture-of-Experts (MoE) architecture,\nfeaturing advanced expert segmentation and routing mechanisms coupled with\noptimized KV-caching techniques. Our development process encompasses\ncomprehensive pre-training, supervised fine-tuning (SFT), and reinforcement\nlearning from human feedback (RLHF), where we devise deliberate strategies for\nmulti-stage training, synthetic data construction, and reward modeling.\nFurthermore, we implement RAISE (Responsible AI Safety Engine), a\nfour-component framework to address safety issues across pre-training,\npost-training, and serving phases. Empowered by our scalable super-computing\ninfrastructure, all these innovations substantially reduce training, deployment\nand inference costs while maintaining high-performance standards. With further\nevaluations on public academic benchmarks, Yi-Lightning demonstrates\ncompetitive performance against top-tier LLMs, while we observe a notable\ndisparity between traditional, static benchmark results and real-world, dynamic\nhuman preferences. This observation prompts a critical reassessment of\nconventional benchmarks' utility in guiding the development of more intelligent\nand powerful AI systems for practical applications. Yi-Lightning is now\navailable through our developer platform at https://platform.lingyiwanwu.com.\n","authors":["01. AI"," :","Alan Wake","Albert Wang","Bei Chen","C. X. Lv","Chao Li","Chengen Huang","Chenglin Cai","Chujie Zheng","Daniel Cooper","Ethan Dai","Fan Zhou","Feng Hu","Heng Ji","Howard Qiu","Jiangcheng Zhu","Jun Tian","Katherine Su","Lihuan Zhang","Liying Li","Ming Song","Mou Li","Peng Liu","Qichen Hu","Shawn Wang","Shijun Zhou","Shiyong Li","Tianhang Zhu","Wen Xie","Xiang He","Xiaobo Chen","Xiaohui Hu","Xiaoyi Ren","Xinyao Niu","Yanpeng Li","Yongke Zhao","Yongzhen Luo","Yuchi Xu","Yuxuan Sha","Zhaodong Yan","Zhiyuan Liu","Zirui Zhang"],"pdf_url":"https://arxiv.org/pdf/2412.01253v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.01460v2","updated":"2024-12-03T04:48:22Z","published":"2024-12-02T12:54:11Z","title":"A Comprehensive Study of Shapley Value in Data Analytics","summary":" Over the recent years, Shapley value (SV), a solution concept from\ncooperative game theory, has found numerous applications in data analytics\n(DA). This paper provides the first comprehensive study of SV used throughout\nthe DA workflow, which involves three main steps: data fabric, data\nexploration, and result reporting. We summarize existing versatile forms of SV\nused in these steps by a unified definition and clarify the essential\nfunctionalities that SV can provide for data scientists. We categorize the arts\nin this field based on the technical challenges they tackled, which include\ncomputation efficiency, approximation error, privacy preservation, and\nappropriate interpretations. We discuss these challenges and analyze the\ncorresponding solutions. We also implement SVBench, the first open-sourced\nbenchmark for developing SV applications, and conduct experiments on six DA\ntasks to validate our analysis and discussions. Based on the qualitative and\nquantitative results, we identify the limitations of current efforts for\napplying SV to DA and highlight the directions of future research and\nengineering.\n","authors":["Hong Lin","Shixin Wan","Zhongle Xie","Ke Chen","Meihui Zhang","Lidan Shou","Gang Chen"],"pdf_url":"https://arxiv.org/pdf/2412.01460v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.01654v2","updated":"2024-12-03T04:40:13Z","published":"2024-12-02T16:04:15Z","title":"FSMLP: Modelling Channel Dependencies With Simplex Theory Based\n Multi-Layer Perceptions In Frequency Domain","summary":" Time series forecasting (TSF) plays a crucial role in various domains,\nincluding web data analysis, energy consumption prediction, and weather\nforecasting. While Multi-Layer Perceptrons (MLPs) are lightweight and effective\nfor capturing temporal dependencies, they are prone to overfitting when used to\nmodel inter-channel dependencies. In this paper, we investigate the overfitting\nproblem in channel-wise MLPs using Rademacher complexity theory, revealing that\nextreme values in time series data exacerbate this issue. To mitigate this\nissue, we introduce a novel Simplex-MLP layer, where the weights are\nconstrained within a standard simplex. This strategy encourages the model to\nlearn simpler patterns and thereby reducing overfitting to extreme values.\nBased on the Simplex-MLP layer, we propose a novel \\textbf{F}requency\n\\textbf{S}implex \\textbf{MLP} (FSMLP) framework for time series forecasting,\ncomprising of two kinds of modules: \\textbf{S}implex\n\\textbf{C}hannel-\\textbf{W}ise MLP (SCWM) and \\textbf{F}requency\n\\textbf{T}emporal \\textbf{M}LP (FTM). The SCWM effectively leverages the\nSimplex-MLP to capture inter-channel dependencies, while the FTM is a simple\nyet efficient temporal MLP designed to extract temporal information from the\ndata. Our theoretical analysis shows that the upper bound of the Rademacher\nComplexity for Simplex-MLP is lower than that for standard MLPs. Moreover, we\nvalidate our proposed method on seven benchmark datasets, demonstrating\nsignificant improvements in forecasting accuracy and efficiency, while also\nshowcasing superior scalability. Additionally, we demonstrate that Simplex-MLP\ncan improve other methods that use channel-wise MLP to achieve less overfitting\nand improved performance. Code are available\n\\href{https://github.com/FMLYD/FSMLP}{\\textcolor{red}{here}}.\n","authors":["Zhengnan Li","Haoxuan Li","Hao Wang","Jun Fang","Duoyin Li Yunxiao Qin"],"pdf_url":"https://arxiv.org/pdf/2412.01654v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.00218v2","updated":"2024-12-03T04:38:31Z","published":"2024-11-29T19:25:00Z","title":"NüshuRescue: Revitalization of the endangered Nüshu Language with AI","summary":" The preservation and revitalization of endangered and extinct languages is a\nmeaningful endeavor, conserving cultural heritage while enriching fields like\nlinguistics and anthropology. However, these languages are typically\nlow-resource, making their reconstruction labor-intensive and costly. This\nchallenge is exemplified by N\\\"ushu, a rare script historically used by Yao\nwomen in China for self-expression within a patriarchal society. To address\nthis challenge, we introduce N\\\"ushuRescue, an AI-driven framework designed to\ntrain large language models (LLMs) on endangered languages with minimal data.\nN\\\"ushuRescue automates evaluation and expands target corpora to accelerate\nlinguistic revitalization. As a foundational component, we developed NCGold, a\n500-sentence N\\\"ushu-Chinese parallel corpus, the first publicly available\ndataset of its kind. Leveraging GPT-4-Turbo, with no prior exposure to N\\\"ushu\nand only 35 short examples from NCGold, N\\\"ushuRescue achieved 48.69\\%\ntranslation accuracy on 50 withheld sentences and generated NCSilver, a set of\n98 newly translated modern Chinese sentences of varying lengths. A sample of\nboth NCGold and NCSilver is included in the Supplementary Materials.\nAdditionally, we developed FastText-based and Seq2Seq models to further support\nresearch on N\\\"ushu. N\\\"ushuRescue provides a versatile and scalable tool for\nthe revitalization of endangered languages, minimizing the need for extensive\nhuman input.\n","authors":["Ivory Yang","Weicheng Ma","Soroush Vosoughi"],"pdf_url":"https://arxiv.org/pdf/2412.00218v2.pdf","comment":"Accepted to COLING 2025"},{"id":"http://arxiv.org/abs/2412.02161v1","updated":"2024-12-03T04:37:28Z","published":"2024-12-03T04:37:28Z","title":"Towards the efficacy of federated prediction for epidemics on networks","summary":" Epidemic prediction is of practical significance in public health, enabling\nearly intervention, resource allocation, and strategic planning. However,\nprivacy concerns often hinder the sharing of health data among institutions,\nlimiting the development of accurate prediction models. In this paper, we\ndevelop a general privacy-preserving framework for node-level epidemic\nprediction on networks based on federated learning (FL). We frame the\nspatio-temporal spread of epidemics across multiple data-isolated subnetworks,\nwhere each node state represents the aggregate epidemic severity within a\ncommunity. Then, both the pure temporal LSTM model and the spatio-temporal\nmodel i.e., Spatio-Temporal Graph Attention Network (STGAT) are proposed to\naddress the federated epidemic prediction. Extensive experiments are conducted\non various epidemic processes using a practical airline network, offering a\ncomprehensive assessment of FL efficacy under diverse scenarios. By introducing\nthe efficacy energy metric to measure system robustness under various client\nconfigurations, we systematically explore key factors influencing FL\nperformance, including client numbers, aggregation strategies, graph\npartitioning, missing infectious reports. Numerical results manifest that STGAT\nexcels in capturing spatio-temporal dependencies in dynamic processes whereas\nLSTM performs well in simpler pattern. Moreover, our findings highlight the\nimportance of balancing feature consistency and volume uniformity among\nclients, as well as the prediction dilemma between information richness and\nintrinsic stochasticity of dynamic processes. This study offers practical\ninsights into the efficacy of FL scenario in epidemic management, demonstrates\nthe potential of FL to address broader collective dynamics.\n","authors":["Chengpeng Fu","Tong Li","Hao Chen","Wen Du","Zhidong He"],"pdf_url":"https://arxiv.org/pdf/2412.02161v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.01269v2","updated":"2024-12-03T04:37:03Z","published":"2024-12-02T08:35:54Z","title":"CPRM: A LLM-based Continual Pre-training Framework for Relevance\n Modeling in Commercial Search","summary":" Relevance modeling between queries and items stands as a pivotal component in\ncommercial search engines, directly affecting the user experience. Given the\nremarkable achievements of large language models (LLMs) in various natural\nlanguage processing (NLP) tasks, LLM-based relevance modeling is gradually\nbeing adopted within industrial search systems. Nevertheless, foundational LLMs\nlack domain-specific knowledge and do not fully exploit the potential of\nin-context learning. Furthermore, structured item text remains underutilized,\nand there is a shortage in the supply of corresponding queries and background\nknowledge. We thereby propose CPRM (Continual Pre-training for Relevance\nModeling), a framework designed for the continual pre-training of LLMs to\naddress these issues. Our CPRM framework includes three modules: 1) employing\nboth queries and multi-field item to jointly pre-train for enhancing domain\nknowledge, 2) applying in-context pre-training, a novel approach where LLMs are\npre-trained on a sequence of related queries or items, and 3) conducting\nreading comprehension on items to produce associated domain knowledge and\nbackground information (e.g., generating summaries and corresponding queries)\nto further strengthen LLMs. Results on offline experiments and online A/B\ntesting demonstrate that our model achieves convincing performance compared to\nstrong baselines.\n","authors":["Kaixin Wu","Yixin Ji","Zeyuan Chen","Qiang Wang","Cunxiang Wang","Hong Liu","Baijun Ji","Jia Xu","Zhongyi Liu","Jinjie Gu","Yuan Zhou","Linjian Mo"],"pdf_url":"https://arxiv.org/pdf/2412.01269v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02159v1","updated":"2024-12-03T04:34:58Z","published":"2024-12-03T04:34:58Z","title":"Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods\n and a New Transcript-Classifier Approach","summary":" Defending large language models against jailbreaks so that they never engage\nin a broadly-defined set of forbidden behaviors is an open problem. In this\npaper, we investigate the difficulty of jailbreak-defense when we only want to\nforbid a narrowly-defined set of behaviors. As a case study, we focus on\npreventing an LLM from helping a user make a bomb. We find that popular\ndefenses such as safety training, adversarial training, and input/output\nclassifiers are unable to fully solve this problem. In pursuit of a better\nsolution, we develop a transcript-classifier defense which outperforms the\nbaseline defenses we test. However, our classifier defense still fails in some\ncircumstances, which highlights the difficulty of jailbreak-defense even in a\nnarrow domain.\n","authors":["Tony T. Wang","John Hughes","Henry Sleight","Rylan Schaeffer","Rajashree Agrawal","Fazl Barez","Mrinank Sharma","Jesse Mu","Nir Shavit","Ethan Perez"],"pdf_url":"https://arxiv.org/pdf/2412.02159v1.pdf","comment":"Accepted to the AdvML-Frontiers and SoLaR workshops at NeurIPS 2024"},{"id":"http://arxiv.org/abs/2412.02155v1","updated":"2024-12-03T04:29:27Z","published":"2024-12-03T04:29:27Z","title":"CausalMob: Causal Human Mobility Prediction with LLMs-derived Human\n Intentions toward Public Events","summary":" Large-scale human mobility exhibits spatial and temporal patterns that can\nassist policymakers in decision making. Although traditional prediction models\nattempt to capture these patterns, they often interfered by non-periodic public\nevents, such as disasters and occasional celebrations. Since regular human\nmobility patterns are heavily affected by these events, estimating their causal\neffects is critical to accurate mobility predictions. Although news articles\nprovide unique perspectives on these events in an unstructured format,\nprocessing is a challenge. In this study, we propose a causality-augmented\nprediction model, called \\textbf{CausalMob}, to analyze the causal effects of\npublic events. We first utilize large language models (LLMs) to extract human\nintentions from news articles and transform them into features that act as\ncausal treatments. Next, the model learns representations of spatio-temporal\nregional covariates from multiple data sources to serve as confounders for\ncausal inference. Finally, we present a causal effect estimation framework to\nensure event features remain independent of confounders during prediction.\nBased on large-scale real-world data, the experimental results show that the\nproposed model excels in human mobility prediction, outperforming\nstate-of-the-art models.\n","authors":["Xiaojie Yang","Hangli Ge","Jiawei Wang","Zipei Fan","Renhe Jiang","Ryosuke Shibasaki","Noboru Koshizuka"],"pdf_url":"https://arxiv.org/pdf/2412.02155v1.pdf","comment":"Accepted by KDD 2025"},{"id":"http://arxiv.org/abs/2412.02154v1","updated":"2024-12-03T04:28:58Z","published":"2024-12-03T04:28:58Z","title":"Failure Probability Estimation for Black-Box Autonomous Systems using\n State-Dependent Importance Sampling Proposals","summary":" Estimating the probability of failure is a critical step in developing\nsafety-critical autonomous systems. Direct estimation methods such as Monte\nCarlo sampling are often impractical due to the rarity of failures in these\nsystems. Existing importance sampling approaches do not scale to sequential\ndecision-making systems with large state spaces and long horizons. We propose\nan adaptive importance sampling algorithm to address these limitations. Our\nmethod minimizes the forward Kullback-Leibler divergence between a\nstate-dependent proposal distribution and a relaxed form of the optimal\nimportance sampling distribution. Our method uses Markov score ascent methods\nto estimate this objective. We evaluate our approach on four sequential systems\nand show that it provides more accurate failure probability estimates than\nbaseline Monte Carlo and importance sampling techniques. This work is open\nsourced.\n","authors":["Harrison Delecki","Sydney M. Katz","Mykel J. Kochenderfer"],"pdf_url":"https://arxiv.org/pdf/2412.02154v1.pdf","comment":"Submitted to L4DC 2025"},{"id":"http://arxiv.org/abs/2412.02153v1","updated":"2024-12-03T04:28:14Z","published":"2024-12-03T04:28:14Z","title":"Revisiting the Initial Steps in Adaptive Gradient Descent Optimization","summary":" Adaptive gradient optimization methods, such as Adam, are prevalent in\ntraining deep neural networks across diverse machine learning tasks due to\ntheir ability to achieve faster convergence. However, these methods often\nsuffer from suboptimal generalization compared to stochastic gradient descent\n(SGD) and exhibit instability, particularly when training Transformer models.\nIn this work, we show the standard initialization of the second-order moment\nestimation ($v_0 =0$) as a significant factor contributing to these\nlimitations. We introduce simple yet effective solutions: initializing the\nsecond-order moment estimation with non-zero values, using either data-driven\nor random initialization strategies. Empirical evaluations demonstrate that our\napproach not only stabilizes convergence but also enhances the final\nperformance of adaptive gradient optimizers. Furthermore, by adopting the\nproposed initialization strategies, Adam achieves performance comparable to\nmany recently proposed variants of adaptive gradient optimization methods,\nhighlighting the practical impact of this straightforward modification.\n","authors":["Abulikemu Abuduweili","Changliu Liu"],"pdf_url":"https://arxiv.org/pdf/2412.02153v1.pdf","comment":"OPT workshop at NeurIPS 2024"},{"id":"http://arxiv.org/abs/2406.08666v2","updated":"2024-12-03T04:22:40Z","published":"2024-06-12T22:12:03Z","title":"Interventional Causal Discovery in a Mixture of DAGs","summary":" Causal interactions among a group of variables are often modeled by a single\ncausal graph. In some domains, however, these interactions are best described\nby multiple co-existing causal graphs, e.g., in dynamical systems or genomics.\nThis paper addresses the hitherto unknown role of interventions in learning\ncausal interactions among variables governed by a mixture of causal systems,\neach modeled by one directed acyclic graph (DAG). Causal discovery from\nmixtures is fundamentally more challenging than single-DAG causal discovery.\nTwo major difficulties stem from (i)~an inherent uncertainty about the\nskeletons of the component DAGs that constitute the mixture and (ii)~possibly\ncyclic relationships across these component DAGs. This paper addresses these\nchallenges and aims to identify edges that exist in at least one component DAG\nof the mixture, referred to as the true edges. First, it establishes matching\nnecessary and sufficient conditions on the size of interventions required to\nidentify the true edges. Next, guided by the necessity results, an adaptive\nalgorithm is designed that learns all true edges using $O(n^2)$ interventions,\nwhere $n$ is the number of nodes. Remarkably, the size of the interventions is\noptimal if the underlying mixture model does not contain cycles across its\ncomponents. More generally, the gap between the intervention size used by the\nalgorithm and the optimal size is quantified. It is shown to be bounded by the\ncyclic complexity number of the mixture model, defined as the size of the\nminimal intervention that can break the cycles in the mixture, which is upper\nbounded by the number of cycles among the ancestors of a node.\n","authors":["Burak Varıcı","Dmitriy Katz-Rogozhnikov","Dennis Wei","Prasanna Sattigeri","Ali Tajer"],"pdf_url":"https://arxiv.org/pdf/2406.08666v2.pdf","comment":"NeurIPS 2024 camera-ready version"},{"id":"http://arxiv.org/abs/2412.00648v2","updated":"2024-12-03T04:14:31Z","published":"2024-12-01T02:55:08Z","title":"DFRot: Achieving Outlier-Free and Massive Activation-Free for Rotated\n LLMs with Refined Rotation","summary":" Rotating the activation and weight matrices to reduce the influence of\noutliers in large language models (LLMs) has recently attracted significant\nattention, particularly in the context of model quantization. Prior studies\nhave shown that in low-precision quantization scenarios, such as 4-bit weights\nand 4-bit activations (W4A4), randomized Hadamard transforms can achieve\nsignificantly higher accuracy than randomized orthogonal transforms. Notably,\nthe reason behind this phenomena remains unknown. In this paper, we find that\nthese transformations show substantial improvement in eliminating outliers for\ncommon tokens and achieve similar quantization error. The primary reason for\nthe accuracy difference lies in the fact that randomized Hadamard transforms\ncan slightly reduce the quantization error for tokens with massive activations\nwhile randomized orthogonal transforms increase the quantization error. Due to\nthe extreme rarity of these tokens and their critical impact on model accuracy,\nwe consider this a long-tail optimization problem, and therefore construct a\nsimple yet effective method: a weighted loss function. Additionally, we propose\nan optimization strategy for the rotation matrix that involves alternating\noptimization of quantization parameters while employing orthogonal Procrustes\ntransforms to refine the rotation matrix. This makes the distribution of the\nrotated activation values more conducive to quantization, especially for tokens\nwith massive activations. Our method enhances the Rotated LLMs by achieving\ndual free, Outlier-Free and Massive Activation-Free, dubbed as DFRot. Extensive\nexperiments demonstrate the effectiveness and efficiency of DFRot. By tuning\nthe rotation matrix using just a single sample, DFRot achieves a perplexity\nimprovement of 0.25 and 0.21 on W4A4KV4 and W4A4KV16, respectively, for\nLLaMA3-8B, a model known for its quantization challenges.\n","authors":["Jingyang Xiang","Sai Qian Zhang"],"pdf_url":"https://arxiv.org/pdf/2412.00648v2.pdf","comment":"24 pages, 38 figures, source code\n \\url{https://github.com/JingyangXiang/DFRot}"},{"id":"http://arxiv.org/abs/2407.00382v4","updated":"2024-12-03T04:07:32Z","published":"2024-06-29T09:35:12Z","title":"Towards Universal Mesh Movement Networks","summary":" Solving complex Partial Differential Equations (PDEs) accurately and\nefficiently is an essential and challenging problem in all scientific and\nengineering disciplines. Mesh movement methods provide the capability to\nimprove the accuracy of the numerical solution without increasing the overall\nmesh degree of freedom count. Conventional sophisticated mesh movement methods\nare extremely expensive and struggle to handle scenarios with complex boundary\ngeometries. However, existing learning-based methods require re-training from\nscratch given a different PDE type or boundary geometry, which limits their\napplicability, and also often suffer from robustness issues in the form of\ninverted elements. In this paper, we introduce the Universal Mesh Movement\nNetwork (UM2N), which -- once trained -- can be applied in a non-intrusive,\nzero-shot manner to move meshes with different size distributions and\nstructures, for solvers applicable to different PDE types and boundary\ngeometries. UM2N consists of a Graph Transformer (GT) encoder for extracting\nfeatures and a Graph Attention Network (GAT) based decoder for moving the mesh.\nWe evaluate our method on advection and Navier-Stokes based examples, as well\nas a real-world tsunami simulation case. Our method outperforms existing\nlearning-based mesh movement methods in terms of the benchmarks described\nabove. In comparison to the conventional sophisticated Monge-Amp\\`ere\nPDE-solver based method, our approach not only significantly accelerates mesh\nmovement, but also proves effective in scenarios where the conventional method\nfails. Our project page is at https://erizmr.github.io/UM2N/.\n","authors":["Mingrui Zhang","Chunyang Wang","Stephan Kramer","Joseph G. Wallwork","Siyi Li","Jiancheng Liu","Xiang Chen","Matthew D. Piggott"],"pdf_url":"https://arxiv.org/pdf/2407.00382v4.pdf","comment":"Accepted at NeurIPS 2024 as a spotlight paper"},{"id":"http://arxiv.org/abs/2412.02140v1","updated":"2024-12-03T03:56:01Z","published":"2024-12-03T03:56:01Z","title":"SparseGrasp: Robotic Grasping via 3D Semantic Gaussian Splatting from\n Sparse Multi-View RGB Images","summary":" Language-guided robotic grasping is a rapidly advancing field where robots\nare instructed using human language to grasp specific objects. However,\nexisting methods often depend on dense camera views and struggle to quickly\nupdate scenes, limiting their effectiveness in changeable environments.\n In contrast, we propose SparseGrasp, a novel open-vocabulary robotic grasping\nsystem that operates efficiently with sparse-view RGB images and handles scene\nupdates fastly. Our system builds upon and significantly enhances existing\ncomputer vision modules in robotic learning. Specifically, SparseGrasp utilizes\nDUSt3R to generate a dense point cloud as the initialization for 3D Gaussian\nSplatting (3DGS), maintaining high fidelity even under sparse supervision.\nImportantly, SparseGrasp incorporates semantic awareness from recent vision\nfoundation models. To further improve processing efficiency, we repurpose\nPrincipal Component Analysis (PCA) to compress features from 2D models.\nAdditionally, we introduce a novel render-and-compare strategy that ensures\nrapid scene updates, enabling multi-turn grasping in changeable environments.\n Experimental results show that SparseGrasp significantly outperforms\nstate-of-the-art methods in terms of both speed and adaptability, providing a\nrobust solution for multi-turn grasping in changeable environment.\n","authors":["Junqiu Yu","Xinlin Ren","Yongchong Gu","Haitao Lin","Tianyu Wang","Yi Zhu","Hang Xu","Yu-Gang Jiang","Xiangyang Xue","Yanwei Fu"],"pdf_url":"https://arxiv.org/pdf/2412.02140v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02912v1","updated":"2024-12-03T23:37:47Z","published":"2024-12-03T23:37:47Z","title":"ShapeWords: Guiding Text-to-Image Synthesis with 3D Shape-Aware Prompts","summary":" We introduce ShapeWords, an approach for synthesizing images based on 3D\nshape guidance and text prompts. ShapeWords incorporates target 3D shape\ninformation within specialized tokens embedded together with the input text,\neffectively blending 3D shape awareness with textual context to guide the image\nsynthesis process. Unlike conventional shape guidance methods that rely on\ndepth maps restricted to fixed viewpoints and often overlook full 3D structure\nor textual context, ShapeWords generates diverse yet consistent images that\nreflect both the target shape's geometry and the textual description.\nExperimental results show that ShapeWords produces images that are more\ntext-compliant, aesthetically plausible, while also maintaining 3D shape\nawareness.\n","authors":["Dmitry Petrov","Pradyumn Goyal","Divyansh Shivashok","Yuanming Tao","Melinos Averkiou","Evangelos Kalogerakis"],"pdf_url":"https://arxiv.org/pdf/2412.02912v1.pdf","comment":"Project webpage: https://lodurality.github.io/shapewords/"},{"id":"http://arxiv.org/abs/2405.00820v3","updated":"2024-12-03T23:30:43Z","published":"2024-05-01T19:02:18Z","title":"HLSFactory: A Framework Empowering High-Level Synthesis Datasets for\n Machine Learning and Beyond","summary":" Machine learning (ML) techniques have been applied to high-level synthesis\n(HLS) flows for quality-of-result (QoR) prediction and design space exploration\n(DSE). Nevertheless, the scarcity of accessible high-quality HLS datasets and\nthe complexity of building such datasets present challenges. Existing datasets\nhave limitations in terms of benchmark coverage, design space enumeration,\nvendor extensibility, or lack of reproducible and extensible software for\ndataset construction. Many works also lack user-friendly ways to add more\ndesigns, limiting wider adoption of such datasets. In response to these\nchallenges, we introduce HLSFactory, a comprehensive framework designed to\nfacilitate the curation and generation of high-quality HLS design datasets.\nHLSFactory has three main stages: 1) a design space expansion stage to\nelaborate single HLS designs into large design spaces using various\noptimization directives across multiple vendor tools, 2) a design synthesis\nstage to execute HLS and FPGA tool flows concurrently across designs, and 3) a\ndata aggregation stage for extracting standardized data into packaged datasets\nfor ML usage. This tripartite architecture ensures broad design space coverage\nvia design space expansion and supports multiple vendor tools. Users can\ncontribute to each stage with their own HLS designs and synthesis results and\nextend the framework itself with custom frontends and tool flows. We also\ninclude an initial set of built-in designs from common HLS benchmarks curated\nopen-source HLS designs. We showcase the versatility and multi-functionality of\nour framework through seven case studies: I) ML model for QoR prediction; II)\nDesign space sampling; III) Fine-grained parallelism backend speedup; IV)\nTargeting Intel's HLS flow; V) Adding new auxiliary designs; VI) Integrating\npublished HLS data; VII) HLS tool version regression benchmarking.\n","authors":["Stefan Abi-Karam","Rishov Sarkar","Allison Seigler","Sean Lowe","Zhigang Wei","Hanqiu Chen","Nanditha Rao","Lizy John","Aman Arora","Cong Hao"],"pdf_url":"https://arxiv.org/pdf/2405.00820v3.pdf","comment":"MLCAD 2024 version of the paper. New case study with ML QoR\n prediction. Artifact evaluation details included"}],"Multimedia":[{"id":"http://arxiv.org/abs/2412.02611v1","updated":"2024-12-03T17:41:23Z","published":"2024-12-03T17:41:23Z","title":"AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand\n Audio-Visual Information?","summary":" Recently, multimodal large language models (MLLMs), such as GPT-4o, Gemini\n1.5 Pro, and Reka Core, have expanded their capabilities to include vision and\naudio modalities. While these models demonstrate impressive performance across\na wide range of audio-visual applications, our proposed DeafTest reveals that\nMLLMs often struggle with simple tasks humans find trivial: 1) determining\nwhich of two sounds is louder, and 2) determining which of two sounds has a\nhigher pitch. Motivated by these observations, we introduce AV-Odyssey Bench, a\ncomprehensive audio-visual benchmark designed to assess whether those MLLMs can\ntruly understand the audio-visual information. This benchmark encompasses 4,555\ncarefully crafted problems, each incorporating text, visual, and audio\ncomponents. To successfully infer answers, models must effectively leverage\nclues from both visual and audio inputs. To ensure precise and objective\nevaluation of MLLM responses, we have structured the questions as\nmultiple-choice, eliminating the need for human evaluation or LLM-assisted\nassessment. We benchmark a series of closed-source and open-source models and\nsummarize the observations. By revealing the limitations of current models, we\naim to provide useful insight for future dataset collection and model\ndevelopment.\n","authors":["Kaixiong Gong","Kaituo Feng","Bohao Li","Yibing Wang","Mofan Cheng","Shijia Yang","Jiaming Han","Benyou Wang","Yutong Bai","Zhuoran Yang","Xiangyu Yue"],"pdf_url":"https://arxiv.org/pdf/2412.02611v1.pdf","comment":"Project page: https://av-odyssey.github.io/"},{"id":"http://arxiv.org/abs/2412.02575v1","updated":"2024-12-03T17:02:40Z","published":"2024-12-03T17:02:40Z","title":"Copy-Move Forgery Detection and Question Answering for Remote Sensing\n Image","summary":" This paper introduces the task of Remote Sensing Copy-Move Question Answering\n(RSCMQA). Unlike traditional Remote Sensing Visual Question Answering (RSVQA),\nRSCMQA focuses on interpreting complex tampering scenarios and inferring\nrelationships between objects. Based on the practical needs of national defense\nsecurity and land resource monitoring, we have developed an accurate and\ncomprehensive global dataset for remote sensing image copy-move question\nanswering, named RS-CMQA-2.1M. These images were collected from 29 different\nregions across 14 countries. Additionally, we have refined a balanced dataset,\nRS-CMQA-B, to address the long-standing issue of long-tail data in the remote\nsensing field. Furthermore, we propose a region-discriminative guided\nmultimodal CMQA model, which enhances the accuracy of answering questions about\ntampered images by leveraging prompt about the differences and connections\nbetween the source and tampered domains. Extensive experiments demonstrate that\nour method provides a stronger benchmark for RS-CMQA compared to general VQA\nand RSVQA models. Our dataset and code are available at\nhttps://github.com/shenyedepisa/RSCMQA.\n","authors":["Ze Zhang","Enyuan Zhao","Ziyi Wan","Jie Nie","Xinyue Liang","Lei Huang"],"pdf_url":"https://arxiv.org/pdf/2412.02575v1.pdf","comment":"7 figs, 7 tables"},{"id":"http://arxiv.org/abs/2412.02419v1","updated":"2024-12-03T12:31:44Z","published":"2024-12-03T12:31:44Z","title":"It Takes Two: Real-time Co-Speech Two-person's Interaction Generation\n via Reactive Auto-regressive Diffusion Model","summary":" Conversational scenarios are very common in real-world settings, yet existing\nco-speech motion synthesis approaches often fall short in these contexts, where\none person's audio and gestures will influence the other's responses.\nAdditionally, most existing methods rely on offline sequence-to-sequence\nframeworks, which are unsuitable for online applications. In this work, we\nintroduce an audio-driven, auto-regressive system designed to synthesize\ndynamic movements for two characters during a conversation. At the core of our\napproach is a diffusion-based full-body motion synthesis model, which is\nconditioned on the past states of both characters, speech audio, and a\ntask-oriented motion trajectory input, allowing for flexible spatial control.\nTo enhance the model's ability to learn diverse interactions, we have enriched\nexisting two-person conversational motion datasets with more dynamic and\ninteractive motions. We evaluate our system through multiple experiments to\nshow it outperforms across a variety of tasks, including single and two-person\nco-speech motion generation, as well as interactive motion generation. To the\nbest of our knowledge, this is the first system capable of generating\ninteractive full-body motions for two characters from speech in an online\nmanner.\n","authors":["Mingyi Shi","Dafei Qin","Leo Ho","Zhouyingcheng Liao","Yinghao Huang","Junichi Yamagishi","Taku Komura"],"pdf_url":"https://arxiv.org/pdf/2412.02419v1.pdf","comment":"15 pages, 10 figures"},{"id":"http://arxiv.org/abs/2409.08489v2","updated":"2024-12-03T23:17:44Z","published":"2024-09-13T02:32:10Z","title":"Resource-Efficient Reference-Free Evaluation of Audio Captions","summary":" To establish the trustworthiness of systems that automatically generate text\ncaptions for audio, images and video, existing reference-free metrics rely on\nlarge pretrained models which are impractical to accommodate in\nresource-constrained settings. To address this, we propose some metrics to\nelicit the model's confidence in its own generation. To assess how well these\nmetrics replace correctness measures that leverage reference captions, we test\ntheir calibration with correctness measures. We discuss why some of these\nconfidence metrics align better with certain correctness measures. Further, we\nprovide insight into why temperature scaling of confidence metrics is\neffective. Our main contribution is a suite of well-calibrated lightweight\nconfidence metrics for reference-free evaluation of captions in\nresource-constrained settings.\n","authors":["Rehana Mahfuz","Yinyi Guo","Erik Visser"],"pdf_url":"https://arxiv.org/pdf/2409.08489v2.pdf","comment":null}]},"2024-12-04T00:00:00Z":{"Computation and Language":[{"id":"http://arxiv.org/abs/2412.03563v1","updated":"2024-12-04T18:56:37Z","published":"2024-12-04T18:56:37Z","title":"From Individual to Society: A Survey on Social Simulation Driven by\n Large Language Model-based Agents","summary":" Traditional sociological research often relies on human participation, which,\nthough effective, is expensive, challenging to scale, and with ethical\nconcerns. Recent advancements in large language models (LLMs) highlight their\npotential to simulate human behavior, enabling the replication of individual\nresponses and facilitating studies on many interdisciplinary studies. In this\npaper, we conduct a comprehensive survey of this field, illustrating the recent\nprogress in simulation driven by LLM-empowered agents. We categorize the\nsimulations into three types: (1) Individual Simulation, which mimics specific\nindividuals or demographic groups; (2) Scenario Simulation, where multiple\nagents collaborate to achieve goals within specific contexts; and (3) Society\nSimulation, which models interactions within agent societies to reflect the\ncomplexity and variety of real-world dynamics. These simulations follow a\nprogression, ranging from detailed individual modeling to large-scale societal\nphenomena. We provide a detailed discussion of each simulation type, including\nthe architecture or key components of the simulation, the classification of\nobjectives or scenarios and the evaluation method. Afterward, we summarize\ncommonly used datasets and benchmarks. Finally, we discuss the trends across\nthese three types of simulation. A repository for the related sources is at\n{\\url{https://github.com/FudanDISC/SocialAgent}}.\n","authors":["Xinyi Mou","Xuanwen Ding","Qi He","Liang Wang","Jingcong Liang","Xinnong Zhang","Libo Sun","Jiayu Lin","Jie Zhou","Xuanjing Huang","Zhongyu Wei"],"pdf_url":"https://arxiv.org/pdf/2412.03563v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03556v1","updated":"2024-12-04T18:51:32Z","published":"2024-12-04T18:51:32Z","title":"Best-of-N Jailbreaking","summary":" We introduce Best-of-N (BoN) Jailbreaking, a simple black-box algorithm that\njailbreaks frontier AI systems across modalities. BoN Jailbreaking works by\nrepeatedly sampling variations of a prompt with a combination of augmentations\n- such as random shuffling or capitalization for textual prompts - until a\nharmful response is elicited. We find that BoN Jailbreaking achieves high\nattack success rates (ASRs) on closed-source language models, such as 89% on\nGPT-4o and 78% on Claude 3.5 Sonnet when sampling 10,000 augmented prompts.\nFurther, it is similarly effective at circumventing state-of-the-art\nopen-source defenses like circuit breakers. BoN also seamlessly extends to\nother modalities: it jailbreaks vision language models (VLMs) such as GPT-4o\nand audio language models (ALMs) like Gemini 1.5 Pro, using modality-specific\naugmentations. BoN reliably improves when we sample more augmented prompts.\nAcross all modalities, ASR, as a function of the number of samples (N),\nempirically follows power-law-like behavior for many orders of magnitude. BoN\nJailbreaking can also be composed with other black-box algorithms for even more\neffective attacks - combining BoN with an optimized prefix attack achieves up\nto a 35% increase in ASR. Overall, our work indicates that, despite their\ncapability, language models are sensitive to seemingly innocuous changes to\ninputs, which attackers can exploit across modalities.\n","authors":["John Hughes","Sara Price","Aengus Lynch","Rylan Schaeffer","Fazl Barez","Sanmi Koyejo","Henry Sleight","Erik Jones","Ethan Perez","Mrinank Sharma"],"pdf_url":"https://arxiv.org/pdf/2412.03556v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03537v1","updated":"2024-12-04T18:32:42Z","published":"2024-12-04T18:32:42Z","title":"Evaluating Gender Bias Transfer between Pre-trained and Prompt-Adapted\n Language Models","summary":" Large language models (LLMs) are increasingly being adapted to achieve\ntask-specificity for deployment in real-world decision systems. Several\nprevious works have investigated the bias transfer hypothesis (BTH) by studying\nthe effect of the fine-tuning adaptation strategy on model fairness to find\nthat fairness in pre-trained masked language models have limited effect on the\nfairness of models when adapted using fine-tuning. In this work, we expand the\nstudy of BTH to causal models under prompt adaptations, as prompting is an\naccessible, and compute-efficient way to deploy models in real-world systems.\nIn contrast to previous works, we establish that intrinsic biases in\npre-trained Mistral, Falcon and Llama models are strongly correlated (rho >=\n0.94) with biases when the same models are zero- and few-shot prompted, using a\npronoun co-reference resolution task. Further, we find that bias transfer\nremains strongly correlated even when LLMs are specifically prompted to exhibit\nfair or biased behavior (rho >= 0.92), and few-shot length and stereotypical\ncomposition are varied (rho >= 0.97). Our findings highlight the importance of\nensuring fairness in pre-trained LLMs, especially when they are later used to\nperform downstream tasks via prompt adaptation.\n","authors":["Natalie Mackraz","Nivedha Sivakumar","Samira Khorshidi","Krishna Patel","Barry-John Theobald","Luca Zappella","Nicholas Apostoloff"],"pdf_url":"https://arxiv.org/pdf/2412.03537v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.11556v2","updated":"2024-12-04T18:31:44Z","published":"2023-12-17T08:07:32Z","title":"StarVector: Generating Scalable Vector Graphics Code from Images and\n Text","summary":" Scalable Vector Graphics (SVGs) are vital for modern image rendering due to\ntheir scalability and versatility. Previous SVG generation methods have focused\non curve-based vectorization, lacking semantic understanding, often producing\nartifacts, and struggling with SVG primitives beyond path curves. To address\nthese issues, we introduce StarVector, a multimodal large language model for\nSVG generation. It performs image vectorization by understanding image\nsemantics and using SVG primitives for compact, precise outputs. Unlike\ntraditional methods, StarVector works directly in the SVG code space,\nleveraging visual understanding to apply accurate SVG primitives. To train\nStarVector, we create SVG-Stack, a diverse dataset of 2M samples that enables\ngeneralization across vectorization tasks and precise use of primitives like\nellipses, polygons, and text. We address challenges in SVG evaluation, showing\nthat pixel-based metrics like MSE fail to capture the unique qualities of\nvector graphics. We introduce SVG-Bench, a benchmark across 10 datasets, and 3\ntasks: Image-to-SVG, Text-to-SVG generation, and diagram generation. Using this\nsetup, StarVector achieves state-of-the-art performance, producing more compact\nand semantically rich SVGs.\n","authors":["Juan A. Rodriguez","Abhay Puri","Shubham Agarwal","Issam H. Laradji","Pau Rodriguez","Sai Rajeswar","David Vazquez","Christopher Pal","Marco Pedersoli"],"pdf_url":"https://arxiv.org/pdf/2312.11556v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03531v1","updated":"2024-12-04T18:26:13Z","published":"2024-12-04T18:26:13Z","title":"A Review on Scientific Knowledge Extraction using Large Language Models\n in Biomedical Sciences","summary":" The rapid advancement of large language models (LLMs) has opened new\nboundaries in the extraction and synthesis of medical knowledge, particularly\nwithin evidence synthesis. This paper reviews the state-of-the-art applications\nof LLMs in the biomedical domain, exploring their effectiveness in automating\ncomplex tasks such as evidence synthesis and data extraction from a biomedical\ncorpus of documents. While LLMs demonstrate remarkable potential, significant\nchallenges remain, including issues related to hallucinations, contextual\nunderstanding, and the ability to generalize across diverse medical tasks. We\nhighlight critical gaps in the current research literature, particularly the\nneed for unified benchmarks to standardize evaluations and ensure reliability\nin real-world applications. In addition, we propose directions for future\nresearch, emphasizing the integration of state-of-the-art techniques such as\nretrieval-augmented generation (RAG) to enhance LLM performance in evidence\nsynthesis. By addressing these challenges and utilizing the strengths of LLMs,\nwe aim to improve access to medical literature and facilitate meaningful\ndiscoveries in healthcare.\n","authors":["Gabriel Lino Garcia","João Renato Ribeiro Manesco","Pedro Henrique Paiola","Lucas Miranda","Maria Paola de Salvo","João Paulo Papa"],"pdf_url":"https://arxiv.org/pdf/2412.03531v1.pdf","comment":"9 pages, 1 table, 1 figure, conference paper"},{"id":"http://arxiv.org/abs/2412.03527v1","updated":"2024-12-04T18:15:41Z","published":"2024-12-04T18:15:41Z","title":"FANAL -- Financial Activity News Alerting Language Modeling Framework","summary":" In the rapidly evolving financial sector, the accurate and timely\ninterpretation of market news is essential for stakeholders needing to navigate\nunpredictable events. This paper introduces FANAL (Financial Activity News\nAlerting Language Modeling Framework), a specialized BERT-based framework\nengineered for real-time financial event detection and analysis, categorizing\nnews into twelve distinct financial categories. FANAL leverages silver-labeled\ndata processed through XGBoost and employs advanced fine-tuning techniques,\nalongside ORBERT (Odds Ratio BERT), a novel variant of BERT fine-tuned with\nORPO (Odds Ratio Preference Optimization) for superior class-wise probability\ncalibration and alignment with financial event relevance. We evaluate FANAL's\nperformance against leading large language models, including GPT-4o, Llama-3.1\n8B, and Phi-3, demonstrating its superior accuracy and cost efficiency. This\nframework sets a new standard for financial intelligence and responsiveness,\nsignificantly outstripping existing models in both performance and\naffordability.\n","authors":["Urjitkumar Patel","Fang-Chun Yeh","Chinmay Gondhalekar","Hari Nalluri"],"pdf_url":"https://arxiv.org/pdf/2412.03527v1.pdf","comment":"Accepted for the IEEE International Workshop on Large Language Models\n for Finance, 2024. This is a preprint version"},{"id":"http://arxiv.org/abs/2407.08152v2","updated":"2024-12-04T17:56:57Z","published":"2024-07-11T03:10:27Z","title":"Privacy-Preserving Data Deduplication for Enhancing Federated Learning\n of Language Models (Extended Version)","summary":" Deduplication is a vital preprocessing step that enhances machine learning\nmodel performance and saves training time and energy. However, enhancing\nfederated learning through deduplication poses challenges, especially regarding\nscalability and potential privacy violations if deduplication involves sharing\nall clients' data. In this paper, we address the problem of deduplication in a\nfederated setup by introducing a pioneering protocol, Efficient\nPrivacy-Preserving Multi-Party Deduplication (EP-MPD). It efficiently removes\nduplicates from multiple clients' datasets without compromising data privacy.\nEP-MPD is constructed in a modular fashion, utilizing two novel variants of the\nPrivate Set Intersection protocol. Our extensive experiments demonstrate the\nsignificant benefits of deduplication in federated learning of large language\nmodels. For instance, we observe up to 19.62\\% improvement in perplexity and up\nto 27.95\\% reduction in running time while varying the duplication level\nbetween 10\\% and 30\\%. EP-MPD effectively balances privacy and performance in\nfederated learning, making it a valuable solution for large-scale applications.\n","authors":["Aydin Abadi","Vishnu Asutosh Dasu","Sumanta Sarkar"],"pdf_url":"https://arxiv.org/pdf/2407.08152v2.pdf","comment":"Accepted at the Network and Distributed Systems Security (NDSS)\n Symposium, 2025"},{"id":"http://arxiv.org/abs/2412.03513v1","updated":"2024-12-04T17:56:49Z","published":"2024-12-04T17:56:49Z","title":"KKLIP: Knowledge Distillation Exploiting K-means Clustering for\n Language-Image Pre-Training","summary":" Recently, CLIP has emerged as a valuable model for aligning image and text\ninformation in multi-modal scenarios. However, researchers have observed\nlimitations in the ability of CLIP's text and image encoders to extract\ndetailed knowledge from caption-image pairs. In response, this paper introduces\nKKLIP, a novel approach designed to enhance the quality of CLIP by\nincorporating a new knowledge distillation (KD) method derived from Llama 2.\nOur method comprises three objectives: Text Embedding Distillation, Concept\nLearning, and Contrastive Learning. Firstly, Text Embedding Distillation\ninvolves training the KKLIP text encoder to emulate the teacher model, Llama 2.\nSecondly, Concept Learning assigns a soft concept label to each caption-image\npair through offline k-means clustering of text information from Llama 2,\nallowing KKLIP to learn from these soft concept labels. Finally, Contrastive\nLearning harmonizes text and image embeddings. Our experimental results\ndemonstrate that KKLIP enhances the quality of both text and image encoders.\n","authors":["Kuei-Chun Kao"],"pdf_url":"https://arxiv.org/pdf/2412.03513v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2407.13492v3","updated":"2024-12-04T17:05:02Z","published":"2024-07-18T13:20:53Z","title":"Enhancing Biomedical Knowledge Discovery for Diseases: An Open-Source\n Framework Applied on Rett Syndrome and Alzheimer's Disease","summary":" The ever-growing volume of biomedical publications creates a critical need\nfor efficient knowledge discovery. In this context, we introduce an open-source\nend-to-end framework designed to construct knowledge around specific diseases\ndirectly from raw text. To facilitate research in disease-related knowledge\ndiscovery, we create two annotated datasets focused on Rett syndrome and\nAlzheimer's disease, enabling the identification of semantic relations between\nbiomedical entities. Extensive benchmarking explores various ways to represent\nrelations and entity representations, offering insights into optimal modeling\nstrategies for semantic relation detection and highlighting language models'\ncompetence in knowledge discovery. We also conduct probing experiments using\ndifferent layer representations and attention scores to explore transformers'\nability to capture semantic relations.\n","authors":["Christos Theodoropoulos","Andrei Catalin Coman","James Henderson","Marie-Francine Moens"],"pdf_url":"https://arxiv.org/pdf/2407.13492v3.pdf","comment":"Published in IEEE Access, doi: 10.1109/ACCESS.2024.3509714"},{"id":"http://arxiv.org/abs/2410.13928v2","updated":"2024-12-04T17:03:13Z","published":"2024-10-17T17:56:01Z","title":"Automatically Interpreting Millions of Features in Large Language Models","summary":" While the activations of neurons in deep neural networks usually do not have\na simple human-understandable interpretation, sparse autoencoders (SAEs) can be\nused to transform these activations into a higher-dimensional latent space\nwhich may be more easily interpretable. However, these SAEs can have millions\nof distinct latent features, making it infeasible for humans to manually\ninterpret each one. In this work, we build an open-source automated pipeline to\ngenerate and evaluate natural language explanations for SAE features using\nLLMs. We test our framework on SAEs of varying sizes, activation functions, and\nlosses, trained on two different open-weight LLMs. We introduce five new\ntechniques to score the quality of explanations that are cheaper to run than\nthe previous state of the art. One of these techniques, intervention scoring,\nevaluates the interpretability of the effects of intervening on a feature,\nwhich we find explains features that are not recalled by existing methods. We\npropose guidelines for generating better explanations that remain valid for a\nbroader set of activating contexts, and discuss pitfalls with existing scoring\ntechniques. We use our explanations to measure the semantic similarity of\nindependently trained SAEs, and find that SAEs trained on nearby layers of the\nresidual stream are highly similar. Our large-scale analysis confirms that SAE\nlatents are indeed much more interpretable than neurons, even when neurons are\nsparsified using top-$k$ postprocessing. Our code is available at\nhttps://github.com/EleutherAI/sae-auto-interp, and our explanations are\navailable at\nhttps://huggingface.co/datasets/EleutherAI/auto_interp_explanations.\n","authors":["Gonçalo Paulo","Alex Mallen","Caden Juang","Nora Belrose"],"pdf_url":"https://arxiv.org/pdf/2410.13928v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03465v1","updated":"2024-12-04T16:54:58Z","published":"2024-12-04T16:54:58Z","title":"YT-30M: A multi-lingual multi-category dataset of YouTube comments","summary":" This paper introduces two large-scale multilingual comment datasets, YT-30M\n(and YT-100K) from YouTube. The analysis in this paper is performed on a\nsmaller sample (YT-100K) of YT-30M. Both the datasets: YT-30M (full) and\nYT-100K (randomly selected 100K sample from YT-30M) are publicly released for\nfurther research. YT-30M (YT-100K) contains 32236173 (108694) comments posted\nby YouTube channel that belong to YouTube categories. Each comment is\nassociated with a video ID, comment ID, commentor name, commentor channel ID,\ncomment text, upvotes, original channel ID and category of the YouTube channel\n(e.g., 'News & Politics', 'Science & Technology', etc.).\n","authors":["Hridoy Sankar Dutta"],"pdf_url":"https://arxiv.org/pdf/2412.03465v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.03766v2","updated":"2024-12-04T16:39:04Z","published":"2024-11-06T08:59:44Z","title":"Number Cookbook: Number Understanding of Language Models and How to\n Improve It","summary":" Large language models (LLMs) can solve an increasing number of complex\nreasoning tasks while making surprising mistakes in basic numerical\nunderstanding and processing (such as 9.11 > 9.9). The latter ability is\nessential for tackling complex arithmetic and mathematical problems and serves\nas a foundation for most reasoning tasks, but previous work paid little\nattention to it or only discussed several restricted tasks (like integer\naddition). In this paper, we comprehensively investigate the numerical\nunderstanding and processing ability (NUPA) of LLMs. Firstly, we introduce a\nbenchmark covering four common numerical representations and 17 distinct\nnumerical tasks in four major categories, resulting in 41 meaningful\ncombinations in total. These tasks are derived from primary and secondary\neducation curricula, encompassing nearly all everyday numerical understanding\nand processing scenarios, and the rules of these tasks are very simple and\nclear. Through the benchmark, we find that current LLMs fail frequently in many\nof the tasks. To study the problem, we train small models with existing and\npotential techniques for enhancing NUPA (such as tokenizers, PEs, and number\nformats), comprehensively evaluating their effectiveness using our testbed. We\nalso finetune practical-scale LLMs on our proposed NUPA tasks and find that 1)\nnaive finetuning can improve NUPA a lot on many but not all tasks, and 2)\nsurprisingly, techniques designed to enhance NUPA prove ineffective for\nfinetuning pretrained models. We further explore the impact of chain-of-thought\ntechniques on NUPA. Our work provides a more detailed and comprehensive\nunderstanding of NUPA in LLMs. Our benchmark and code are released at\nhttps://github.com/GraphPKU/number_cookbook.\n","authors":["Haotong Yang","Yi Hu","Shijia Kang","Zhouchen Lin","Muhan Zhang"],"pdf_url":"https://arxiv.org/pdf/2411.03766v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02205v2","updated":"2024-12-04T16:12:08Z","published":"2024-12-03T06:47:15Z","title":"DataLab: A Unified Platform for LLM-Powered Business Intelligence","summary":" Business intelligence (BI) transforms large volumes of data within modern\norganizations into actionable insights for informed decision-making. Recently,\nlarge language model (LLM)-based agents have streamlined the BI workflow by\nautomatically performing task planning, reasoning, and actions in executable\nenvironments based on natural language (NL) queries. However, existing\napproaches primarily focus on individual BI tasks such as NL2SQL and NL2VIS.\nThe fragmentation of tasks across different data roles and tools lead to\ninefficiencies and potential errors due to the iterative and collaborative\nnature of BI. In this paper, we introduce DataLab, a unified BI platform that\nintegrates a one-stop LLM-based agent framework with an augmented computational\nnotebook interface. DataLab supports a wide range of BI tasks for different\ndata roles by seamlessly combining LLM assistance with user customization\nwithin a single environment. To achieve this unification, we design a domain\nknowledge incorporation module tailored for enterprise-specific BI tasks, an\ninter-agent communication mechanism to facilitate information sharing across\nthe BI workflow, and a cell-based context management strategy to enhance\ncontext utilization efficiency in BI notebooks. Extensive experiments\ndemonstrate that DataLab achieves state-of-the-art performance on various BI\ntasks across popular research benchmarks. Moreover, DataLab maintains high\neffectiveness and efficiency on real-world datasets from Tencent, achieving up\nto a 58.58% increase in accuracy and a 61.65% reduction in token cost on\nenterprise-specific BI tasks.\n","authors":["Luoxuan Weng","Yinghao Tang","Yingchaojie Feng","Zhuo Chang","Peng Chen","Ruiqin Chen","Haozhe Feng","Chen Hou","Danqing Huang","Yang Li","Huaming Rao","Haonan Wang","Canshi Wei","Xiaofeng Yang","Yuhui Zhang","Yifeng Zheng","Xiuqi Huang","Minfeng Zhu","Yuxin Ma","Bin Cui","Wei Chen"],"pdf_url":"https://arxiv.org/pdf/2412.02205v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.17826v3","updated":"2024-12-04T16:03:04Z","published":"2024-02-27T19:00:01Z","title":"Prediction-Powered Ranking of Large Language Models","summary":" Large language models are often ranked according to their level of alignment\nwith human preferences -- a model is better than other models if its outputs\nare more frequently preferred by humans. One of the popular ways to elicit\nhuman preferences utilizes pairwise comparisons between the outputs provided by\ndifferent models to the same inputs. However, since gathering pairwise\ncomparisons by humans is costly and time-consuming, it has become a common\npractice to gather pairwise comparisons by a strong large language model -- a\nmodel strongly aligned with human preferences. Surprisingly, practitioners\ncannot currently measure the uncertainty that any mismatch between human and\nmodel preferences may introduce in the constructed rankings. In this work, we\ndevelop a statistical framework to bridge this gap. Given a (small) set of\npairwise comparisons by humans and a large set of pairwise comparisons by a\nmodel, our framework provides a rank-set -- a set of possible ranking positions\n-- for each of the models under comparison. Moreover, it guarantees that, with\na probability greater than or equal to a user-specified value, the rank-sets\ncover the true ranking consistent with the distribution of human pairwise\npreferences asymptotically. Using pairwise comparisons made by humans in the\nLMSYS Chatbot Arena platform and pairwise comparisons made by three strong\nlarge language models, we empirically demonstrate the effectivity of our\nframework and show that the rank-sets constructed using only pairwise\ncomparisons by the strong large language models are often inconsistent with\n(the distribution of) human pairwise preferences.\n","authors":["Ivi Chatzi","Eleni Straitouri","Suhas Thejaswi","Manuel Gomez Rodriguez"],"pdf_url":"https://arxiv.org/pdf/2402.17826v3.pdf","comment":"Published at NeurIPS 2024"},{"id":"http://arxiv.org/abs/2411.05872v2","updated":"2024-12-04T15:56:13Z","published":"2024-11-07T22:23:30Z","title":"Dialectal Coverage And Generalization in Arabic Speech Recognition","summary":" Developing robust automatic speech recognition (ASR) systems for Arabic, a\nlanguage characterized by its rich dialectal diversity and often considered a\nlow-resource language in speech technology, demands effective strategies to\nmanage its complexity. This study explores three critical factors influencing\nASR performance: the role of dialectal coverage in pre-training, the\neffectiveness of dialect-specific fine-tuning compared to a multi-dialectal\napproach, and the ability to generalize to unseen dialects. Through extensive\nexperiments across different dialect combinations, our findings offer key\ninsights towards advancing the development of ASR systems for pluricentric\nlanguages like Arabic.\n","authors":["Amirbek Djanibekov","Hawau Olamide Toyin","Raghad Alshalan","Abdullah Alitr","Hanan Aldarmaki"],"pdf_url":"https://arxiv.org/pdf/2411.05872v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03398v1","updated":"2024-12-04T15:27:39Z","published":"2024-12-04T15:27:39Z","title":"RedStone: Curating General, Code, Math, and QA Data for Large Language\n Models","summary":" Pre-training Large Language Models (LLMs) on high-quality, meticulously\ncurated datasets is widely recognized as critical for enhancing their\nperformance and generalization capabilities. This study explores the untapped\npotential of Common Crawl as a comprehensive and flexible resource for\npre-training LLMs, addressing both general-purpose language understanding and\nspecialized domain knowledge. We introduce RedStone, an innovative and scalable\npipeline engineered to extract and process data from Common Crawl, facilitating\nthe creation of extensive and varied pre-training datasets. Unlike traditional\ndatasets, which often require expensive curation and domain-specific expertise,\nRedStone leverages the breadth of Common Crawl to deliver datasets tailored to\na wide array of domains. In this work, we exemplify its capability by\nconstructing pre-training datasets across multiple fields, including general\nlanguage understanding, code, mathematics, and question-answering tasks. The\nflexibility of RedStone allows for easy adaptation to other specialized\ndomains, significantly lowering the barrier to creating valuable\ndomain-specific datasets. Our findings demonstrate that Common Crawl, when\nharnessed through effective pipelines like RedStone, can serve as a rich,\nrenewable source of pre-training data, unlocking new avenues for domain\nadaptation and knowledge discovery in LLMs. This work also underscores the\nimportance of innovative data acquisition strategies and highlights the role of\nweb-scale data as a powerful resource in the continued evolution of LLMs.\nRedStone code and data samples will be publicly available at\n\\url{https://aka.ms/redstone}.\n","authors":["Yaoyao Chang","Lei Cui","Li Dong","Shaohan Huang","Yangyu Huang","Yupan Huang","Scarlett Li","Tengchao Lv","Shuming Ma","Qinzheng Sun","Wenhui Wang","Furu Wei","Ying Xin","Mao Yang","Qiufeng Yin","Xingxing Zhang"],"pdf_url":"https://arxiv.org/pdf/2412.03398v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2407.04158v2","updated":"2024-12-04T15:23:54Z","published":"2024-07-04T21:23:18Z","title":"ELCC: the Emergent Language Corpus Collection","summary":" We introduce the Emergent Language Corpus Collection (ELCC): a collection of\ncorpora generated from open source implementations of emergent communication\nsystems across the literature. These systems include a variety of signalling\ngame environments as well as more complex environments like a social deduction\ngame and embodied navigation. Each corpus is annotated with metadata describing\nthe characteristics of the source system as well as a suite of analyses of the\ncorpus (e.g., size, entropy, average message length, performance as transfer\nlearning data). Currently, research studying emergent languages requires\ndirectly running different systems which takes time away from actual analyses\nof such languages, makes studies which compare diverse emergent languages rare,\nand presents a barrier to entry for researchers without a background in deep\nlearning. The availability of a substantial collection of well-documented\nemergent language corpora, then, will enable research which can analyze a wider\nvariety of emergent languages, which more effectively uncovers general\nprinciples in emergent communication rather than artifacts of particular\nenvironments. We provide some quantitative and qualitative analyses with ELCC\nto demonstrate potential use cases of the resource in this vein.\n","authors":["Brendon Boldt","David Mortensen"],"pdf_url":"https://arxiv.org/pdf/2407.04158v2.pdf","comment":"21 pages, 8 figures; added analyses"},{"id":"http://arxiv.org/abs/2405.19732v4","updated":"2024-12-04T15:20:35Z","published":"2024-05-30T06:24:14Z","title":"LLM as a Complementary Optimizer to Gradient Descent: A Case Study in\n Prompt Tuning","summary":" Mastering a skill generally relies on both hands-on experience from doers and\ninsightful, high-level guidance by mentors. Will this strategy also work well\nfor solving complex non-convex optimization problems? Here, a common\ngradient-based optimizer acts like a disciplined doer, making locally optimal\nupdates at each step. Large Language Models (LLMs) can also search for better\nsolutions by inferring from natural language instructions, akin to a high-level\nmentor. In this paper, we show that these two participators are complementary\nto each other and can effectively collaborate as a combined optimization\nframework. The collaborative optimization is achieved by alternating between\nthe gradient-based and LLM-based optimizers. We instruct LLMs to generate\npossibly improved solutions by taking parameter trajectories recorded during\nthe previous stage of gradient-based optimization into account. Inferred\nresults of LLMs are used as restarting points for the next stage of gradient\noptimization. We verify the effectiveness of this optimization framework on\nprompt tuning. By leveraging both the locally rigorous gradient-based optimizer\nand the high-level deductive LLM-based optimizer, the combined optimization\nmethod consistently yields improvements over competitive baselines on a variety\nof tasks. Our results demonstrate the synergistic effect of conventional\ngradient-based optimization and the inference ability of LLMs. The code is\nreleased at https://github.com/guozix/LLM-catalyst.\n","authors":["Zixian Guo","Ming Liu","Zhilong Ji","Jinfeng Bai","Yiwen Guo","Wangmeng Zuo"],"pdf_url":"https://arxiv.org/pdf/2405.19732v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03388v1","updated":"2024-12-04T15:17:25Z","published":"2024-12-04T15:17:25Z","title":"DiffStyleTTS: Diffusion-based Hierarchical Prosody Modeling for\n Text-to-Speech with Diverse and Controllable Styles","summary":" Human speech exhibits rich and flexible prosodic variations. To address the\none-to-many mapping problem from text to prosody in a reasonable and flexible\nmanner, we propose DiffStyleTTS, a multi-speaker acoustic model based on a\nconditional diffusion module and an improved classifier-free guidance, which\nhierarchically models speech prosodic features, and controls different prosodic\nstyles to guide prosody prediction. Experiments show that our method\noutperforms all baselines in naturalness and achieves superior synthesis speed\ncompared to three diffusion-based baselines. Additionally, by adjusting the\nguiding scale, DiffStyleTTS effectively controls the guidance intensity of the\nsynthetic prosody.\n","authors":["Jiaxuan Liu","Zhaoci Liu","Yajun Hu","Yingying Gao","Shilei Zhang","Zhenhua Ling"],"pdf_url":"https://arxiv.org/pdf/2412.03388v1.pdf","comment":"COLING 2025"},{"id":"http://arxiv.org/abs/2408.15903v2","updated":"2024-12-04T15:01:47Z","published":"2024-08-28T16:15:45Z","title":"LLM-Based Multi-Hop Question Answering with Knowledge Graph Integration\n in Evolving Environments","summary":" The important challenge of keeping knowledge in Large Language Models (LLMs)\nup-to-date has led to the development of various methods for incorporating new\nfacts. However, existing methods for such knowledge editing still face\ndifficulties with multi-hop questions that require accurate fact identification\nand sequential logical reasoning, particularly among numerous fact updates. To\ntackle these challenges, this paper introduces Graph Memory-based Editing for\nLarge Language Models (GMeLLo), a straightforward and effective method that\nmerges the explicit knowledge representation of Knowledge Graphs (KGs) with the\nlinguistic flexibility of LLMs. Beyond merely leveraging LLMs for question\nanswering, GMeLLo employs these models to convert free-form language into\nstructured queries and fact triples, facilitating seamless interaction with KGs\nfor rapid updates and precise multi-hop reasoning. Our results show that GMeLLo\nsignificantly surpasses current state-of-the-art (SOTA) knowledge editing\nmethods in the multi-hop question answering benchmark, MQuAKE, especially in\nscenarios with extensive knowledge edits.\n","authors":["Ruirui Chen","Weifeng Jiang","Chengwei Qin","Ishaan Singh Rawal","Cheston Tan","Dongkyu Choi","Bo Xiong","Bo Ai"],"pdf_url":"https://arxiv.org/pdf/2408.15903v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03343v1","updated":"2024-12-04T14:23:16Z","published":"2024-12-04T14:23:16Z","title":"Improving Linguistic Diversity of Large Language Models with Possibility\n Exploration Fine-Tuning","summary":" While Large Language Models (LLMs) have made significant strides in\nreplicating human-like abilities, there are concerns about a reduction in the\nlinguistic diversity of their outputs. This results in the homogenization of\nviewpoints and perspectives, as well as the underrepresentation of specific\ndemographic groups. Although several fine-tuning and prompting techniques have\nbeen suggested to tackle the issue, they are often tailored to specific tasks\nor come with a substantial increase in computational cost and latency. This\nmakes them challenging to apply to applications that demand very low latency,\nsuch as chatbots and virtual assistants. We propose Possibility Exploration\nFine-Tuning (PEFT), a task-agnostic framework that enhances the text diversity\nof LLMs without increasing latency or computational cost. Given the same\nprompt, models fine-tuned with PEFT can simultaneously generate multiple\ndiverse responses, each corresponding with a controllable possibility number.\nExperiments on dialogue and story generation tasks demonstrate that PEFT\nsignificantly enhances the diversity of LLM outputs, as evidenced by lower\nsimilarity between candidate responses. Since PEFT emphasizes semantic\ndiversity over lexical diversity, it can also notably reduce demographic bias\nin dialogue systems. The implementations and datasets are available in our\nrepository: https://github.com/mailong25/peft_diversity\n","authors":["Long Mai","Julie Carson-Berndsen"],"pdf_url":"https://arxiv.org/pdf/2412.03343v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.01951v2","updated":"2024-12-04T14:20:21Z","published":"2024-12-02T20:24:17Z","title":"Self-Improvement in Language Models: The Sharpening Mechanism","summary":" Recent work in language modeling has raised the possibility of\nself-improvement, where a language models evaluates and refines its own\ngenerations to achieve higher performance without external feedback. It is\nimpossible for this self-improvement to create information that is not already\nin the model, so why should we expect that this will lead to improved\ncapabilities? We offer a new perspective on the capabilities of\nself-improvement through a lens we refer to as sharpening. Motivated by the\nobservation that language models are often better at verifying response quality\nthan they are at generating correct responses, we formalize self-improvement as\nusing the model itself as a verifier during post-training in order to\n``sharpen'' the model to one placing large mass on high-quality sequences,\nthereby amortizing the expensive inference-time computation of generating good\nsequences. We begin by introducing a new statistical framework for sharpening\nin which the learner aims to sharpen a pre-trained base policy via sample\naccess, and establish fundamental limits. Then we analyze two natural families\nof self-improvement algorithms based on SFT and RLHF. We find that (i) the\nSFT-based approach is minimax optimal whenever the initial model has sufficient\ncoverage, but (ii) the RLHF-based approach can improve over SFT-based\nself-improvement by leveraging online exploration, bypassing the need for\ncoverage. Finally, we empirically validate the sharpening mechanism via\ninference-time and amortization experiments. We view these findings as a\nstarting point toward a foundational understanding that can guide the design\nand evaluation of self-improvement algorithms.\n","authors":["Audrey Huang","Adam Block","Dylan J. Foster","Dhruv Rohatgi","Cyril Zhang","Max Simchowitz","Jordan T. Ash","Akshay Krishnamurthy"],"pdf_url":"https://arxiv.org/pdf/2412.01951v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03334v1","updated":"2024-12-04T14:05:18Z","published":"2024-12-04T14:05:18Z","title":"Yankari: A Monolingual Yoruba Dataset","summary":" This paper presents Yankari, a large-scale monolingual dataset for the Yoruba\nlanguage, aimed at addressing the critical gap in Natural Language Processing\n(NLP) resources for this important West African language. Despite being spoken\nby over 30 million people, Yoruba has been severely underrepresented in NLP\nresearch and applications. We detail our methodology for creating this dataset,\nwhich includes careful source selection, automated quality control, and\nrigorous data cleaning processes. The Yankari dataset comprises 51,407\ndocuments from 13 diverse sources, totaling over 30 million tokens. Our\napproach focuses on ethical data collection practices, avoiding problematic\nsources and addressing issues prevalent in existing datasets. We provide\nthorough automated evaluations of the dataset, demonstrating its quality\ncompared to existing resources. The Yankari dataset represents a significant\nadvancement in Yoruba language resources, providing a foundation for developing\nmore accurate NLP models, supporting comparative linguistic studies, and\ncontributing to the digital accessibility of the Yoruba language.\n","authors":["Maro Akpobi"],"pdf_url":"https://arxiv.org/pdf/2412.03334v1.pdf","comment":"8 pages"},{"id":"http://arxiv.org/abs/2412.03331v1","updated":"2024-12-04T14:02:12Z","published":"2024-12-04T14:02:12Z","title":"LuxEmbedder: A Cross-Lingual Approach to Enhanced Luxembourgish Sentence\n Embeddings","summary":" Sentence embedding models play a key role in various Natural Language\nProcessing tasks, such as in Topic Modeling, Document Clustering and\nRecommendation Systems. However, these models rely heavily on parallel data,\nwhich can be scarce for many low-resource languages, including Luxembourgish.\nThis scarcity results in suboptimal performance of monolingual and\ncross-lingual sentence embedding models for these languages. To address this\nissue, we compile a relatively small but high-quality human-generated\ncross-lingual parallel dataset to train \\tool, an enhanced sentence embedding\nmodel for Luxembourgish with strong cross-lingual capabilities. Additionally,\nwe present evidence suggesting that including low-resource languages in\nparallel training datasets can be more advantageous for other low-resource\nlanguages than relying solely on high-resource language pairs. Furthermore,\nrecognizing the lack of sentence embedding benchmarks for low-resource\nlanguages, we create a paraphrase detection benchmark specifically for\nLuxembourgish, aiming to partially fill this gap and promote further research.\n","authors":["Fred Philippy","Siwen Guo","Jacques Klein","Tegawendé F. Bissyandé"],"pdf_url":"https://arxiv.org/pdf/2412.03331v1.pdf","comment":"Accepted at COLING 2025"},{"id":"http://arxiv.org/abs/2408.14845v2","updated":"2024-12-04T13:43:28Z","published":"2024-08-27T07:56:35Z","title":"AAVENUE: Detecting LLM Biases on NLU Tasks in AAVE via a Novel Benchmark","summary":" Detecting biases in natural language understanding (NLU) for African American\nVernacular English (AAVE) is crucial to developing inclusive natural language\nprocessing (NLP) systems. To address dialect-induced performance discrepancies,\nwe introduce AAVENUE ({AAVE} {N}atural Language {U}nderstanding {E}valuation),\na benchmark for evaluating large language model (LLM) performance on NLU tasks\nin AAVE and Standard American English (SAE). AAVENUE builds upon and extends\nexisting benchmarks like VALUE, replacing deterministic syntactic and\nmorphological transformations with a more flexible methodology leveraging\nLLM-based translation with few-shot prompting, improving performance across our\nevaluation metrics when translating key tasks from the GLUE and SuperGLUE\nbenchmarks. We compare AAVENUE and VALUE translations using five popular LLMs\nand a comprehensive set of metrics including fluency, BARTScore, quality,\ncoherence, and understandability. Additionally, we recruit fluent AAVE speakers\nto validate our translations for authenticity. Our evaluations reveal that LLMs\nconsistently perform better on SAE tasks than AAVE-translated versions,\nunderscoring inherent biases and highlighting the need for more inclusive NLP\nmodels. We have open-sourced our source code on GitHub and created a website to\nshowcase our work at https://aavenue.live.\n","authors":["Abhay Gupta","Philip Meng","Ece Yurtseven","Sean O'Brien","Kevin Zhu"],"pdf_url":"https://arxiv.org/pdf/2408.14845v2.pdf","comment":"Published at NLP4PI @ EMNLP 2024"},{"id":"http://arxiv.org/abs/2407.02854v2","updated":"2024-12-04T13:41:11Z","published":"2024-07-03T07:12:36Z","title":"A Spatio-Temporal Representation Learning as an Alternative to\n Traditional Glosses in Sign Language Translation and Production","summary":" This work addresses the challenges associated with the use of glosses in both\nSign Language Translation (SLT) and Sign Language Production (SLP). While\nglosses have long been used as a bridge between sign language and spoken\nlanguage, they come with two major limitations that impede the advancement of\nsign language systems. First, annotating the glosses is a labor-intensive and\ntime-consuming process, which limits the scalability of datasets. Second, the\nglosses oversimplify sign language by stripping away its spatio-temporal\ndynamics, reducing complex signs to basic labels and missing the subtle\nmovements essential for precise interpretation. To address these limitations,\nwe introduce Universal Gloss-level Representation (UniGloR), a framework\ndesigned to capture the spatio-temporal features inherent in sign language,\nproviding a more dynamic and detailed alternative to the use of the glosses.\nThe core idea of UniGloR is simple yet effective: We derive dense\nspatio-temporal representations from sign keypoint sequences using\nself-supervised learning and seamlessly integrate them into SLT and SLP tasks.\nOur experiments in a keypoint-based setting demonstrate that UniGloR either\noutperforms or matches the performance of previous SLT and SLP methods on two\nwidely-used datasets: PHOENIX14T and How2Sign.\n","authors":["Eui Jun Hwang","Sukmin Cho","Huije Lee","Youngwoo Yoon","Jong C. Park"],"pdf_url":"https://arxiv.org/pdf/2407.02854v2.pdf","comment":"Accepted at WACV 2025"},{"id":"http://arxiv.org/abs/2412.03310v1","updated":"2024-12-04T13:37:59Z","published":"2024-12-04T13:37:59Z","title":"Grounded Language Design for Lightweight Diagramming for Formal Methods","summary":" Model finding, as embodied by SAT solvers and similar tools, is used widely,\nboth in embedding settings and as a tool in its own right. For instance, tools\nlike Alloy target SAT to enable users to incrementally define, explore, verify,\nand diagnose sophisticated specifications for a large number of complex\nsystems.\n These tools critically include a visualizer that lets users graphically\nexplore these generated models. As we show, however, default visualizers, which\nknow nothing about the domain, are unhelpful and even actively violate\npresentational and cognitive principles. At the other extreme, full-blown\nvisualizations require significant effort as well as knowledge a specifier\nmight not possess; they can also exhibit bad failure modes (including silent\nfailure). Instead, we need a language to capture essential domain information\nfor lightweight diagramming. We ground our language design in both the\ncognitive science literature on diagrams and on a large number of example\ncustom visualizations. This identifies the key elements of lightweight\ndiagrams. We distill these into a small set of orthogonal primitives. We extend\nan Alloy-like tool to support these primitives. We evaluate the effectiveness\nof the produced diagrams, finding them good for reasoning. We then compare this\nagainst many other drawing languages and tools to show that this work defines a\nnew niche that is lightweight, effective, and driven by sound principles.\n","authors":["Siddhartha Prasad","Ben Greenman","Tim Nelson","Shriram Krishnamurthi"],"pdf_url":"https://arxiv.org/pdf/2412.03310v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03309v1","updated":"2024-12-04T13:32:14Z","published":"2024-12-04T13:32:14Z","title":"Typologie des comportements utilisateurs : {é}tude exploratoire des\n sessions de recherche complexe sur le Web","summary":" In this study, we propose an exploratory approach aiming at a typology of\nuser behaviour during a Web search session. We describe a typology based on\ngeneric IR variables (e.g. number of queries), but also on the study of topic\n(propositions with distinct semantic content defined from the search\nstatement). To this end, we gathered experimental data enabling us to study\nvariations across users (N=70) for the same task. We performed a\nmultidimensional analysis and propose a 5 classes typology based on the\nindividual behaviours during the processing of a complex search task.\n","authors":["Claire Ibarboure","Ludovic Tanguy","Franck Amadieu"],"pdf_url":"https://arxiv.org/pdf/2412.03309v1.pdf","comment":"in French language, CORIA (COnf{\\'e}rence en Recherche d'Information\n et Applications), 2024, La Rochelle, France"},{"id":"http://arxiv.org/abs/2412.03304v1","updated":"2024-12-04T13:27:09Z","published":"2024-12-04T13:27:09Z","title":"Global MMLU: Understanding and Addressing Cultural and Linguistic Biases\n in Multilingual Evaluation","summary":" Cultural biases in multilingual datasets pose significant challenges for\ntheir effectiveness as global benchmarks. These biases stem not only from\nlanguage but also from the cultural knowledge required to interpret questions,\nreducing the practical utility of translated datasets like MMLU. Furthermore,\ntranslation often introduces artifacts that can distort the meaning or clarity\nof questions in the target language. A common practice in multilingual\nevaluation is to rely on machine-translated evaluation sets, but simply\ntranslating a dataset is insufficient to address these challenges. In this\nwork, we trace the impact of both of these issues on multilingual evaluations\nand ensuing model performances. Our large-scale evaluation of state-of-the-art\nopen and proprietary models illustrates that progress on MMLU depends heavily\non learning Western-centric concepts, with 28% of all questions requiring\nculturally sensitive knowledge. Moreover, for questions requiring geographic\nknowledge, an astounding 84.9% focus on either North American or European\nregions. Rankings of model evaluations change depending on whether they are\nevaluated on the full portion or the subset of questions annotated as\nculturally sensitive, showing the distortion to model rankings when blindly\nrelying on translated MMLU. We release Global-MMLU, an improved MMLU with\nevaluation coverage across 42 languages -- with improved overall quality by\nengaging with compensated professional and community annotators to verify\ntranslation quality while also rigorously evaluating cultural biases present in\nthe original dataset. This comprehensive Global-MMLU set also includes\ndesignated subsets labeled as culturally sensitive and culturally agnostic to\nallow for more holistic, complete evaluation.\n","authors":["Shivalika Singh","Angelika Romanou","Clémentine Fourrier","David I. Adelani","Jian Gang Ngui","Daniel Vila-Suero","Peerat Limkonchotiwat","Kelly Marchisio","Wei Qi Leong","Yosephine Susanto","Raymond Ng","Shayne Longpre","Wei-Yin Ko","Madeline Smith","Antoine Bosselut","Alice Oh","Andre F. T. Martins","Leshem Choshen","Daphne Ippolito","Enzo Ferrante","Marzieh Fadaee","Beyza Ermis","Sara Hooker"],"pdf_url":"https://arxiv.org/pdf/2412.03304v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03275v1","updated":"2024-12-04T12:34:15Z","published":"2024-12-04T12:34:15Z","title":"AntLM: Bridging Causal and Masked Language Models","summary":" Causal Language Modeling (CLM) and Masked Language Modeling (MLM) are two\nmainstream learning paradigms based on Transformer networks, specifically the\nDecoder-only and Encoder-only architectures. The strengths of each paradigm in\ndownstream tasks have shown a mix of advantages and disadvantages. In the past\nBabyLM Challenge 2023, although the MLM paradigm achieved the best average\nperformance, the CLM paradigm demonstrated significantly faster convergence\nrates. For the BabyLM Challenge 2024, we propose a novel language modeling\nparadigm named $\\textbf{AntLM}$, which integrates both CLM and MLM to leverage\nthe advantages of these two classic paradigms. We chose the strict-small track\nand conducted experiments on two foundation models: BabyLlama, representing\nCLM, and LTG-BERT, representing MLM. During the training process for specific\nfoundation models, we alternate between applying CLM or MLM training objectives\nand causal or bidirectional attention masks. Experimental results show that\ncombining the two pretraining objectives leverages their strengths, enhancing\noverall training performance. Under the same epochs, $AntLM_{BabyLlama}$\nimproves Macro-average by 1%, and $AntLM_{LTG-BERT}$ achieves a 2.2% increase\nover the baselines.\n","authors":["Xinru Yu","Bin Guo","Shiwei Luo","Jie Wang","Tao Ji","Yuanbin Wu"],"pdf_url":"https://arxiv.org/pdf/2412.03275v1.pdf","comment":"CoNLL Shared Task BabyLM Challenge"},{"id":"http://arxiv.org/abs/2412.03270v1","updated":"2024-12-04T12:25:41Z","published":"2024-12-04T12:25:41Z","title":"Intent-driven In-context Learning for Few-shot Dialogue State Tracking","summary":" Dialogue state tracking (DST) plays an essential role in task-oriented\ndialogue systems. However, user's input may contain implicit information,\nposing significant challenges for DST tasks. Additionally, DST data includes\ncomplex information, which not only contains a large amount of noise unrelated\nto the current turn, but also makes constructing DST datasets expensive. To\naddress these challenges, we introduce Intent-driven In-context Learning for\nFew-shot DST (IDIC-DST). By extracting user's intent, we propose an\nIntent-driven Dialogue Information Augmentation module to augment the dialogue\ninformation, which can track dialogue states more effectively. Moreover, we\nmask noisy information from DST data and rewrite user's input in the\nIntent-driven Examples Retrieval module, where we retrieve similar examples. We\nthen utilize a pre-trained large language model to update the dialogue state\nusing the augmented dialogue information and examples. Experimental results\ndemonstrate that IDIC-DST achieves state-of-the-art performance in few-shot\nsettings on MultiWOZ 2.1 and MultiWOZ 2.4 datasets.\n","authors":["Zihao Yi","Zhe Xu","Ying Shen"],"pdf_url":"https://arxiv.org/pdf/2412.03270v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03253v1","updated":"2024-12-04T11:52:03Z","published":"2024-12-04T11:52:03Z","title":"Alignment at Pre-training! Towards Native Alignment for Arabic LLMs","summary":" The alignment of large language models (LLMs) is critical for developing\neffective and safe language models. Traditional approaches focus on aligning\nmodels during the instruction tuning or reinforcement learning stages, referred\nto in this paper as `post alignment'. We argue that alignment during the\npre-training phase, which we term `native alignment', warrants investigation.\nNative alignment aims to prevent unaligned content from the beginning, rather\nthan relying on post-hoc processing. This approach leverages extensively\naligned pre-training data to enhance the effectiveness and usability of\npre-trained models. Our study specifically explores the application of native\nalignment in the context of Arabic LLMs. We conduct comprehensive experiments\nand ablation studies to evaluate the impact of native alignment on model\nperformance and alignment stability. Additionally, we release open-source\nArabic LLMs that demonstrate state-of-the-art performance on various\nbenchmarks, providing significant benefits to the Arabic LLM community.\n","authors":["Juhao Liang","Zhenyang Cai","Jianqing Zhu","Huang Huang","Kewei Zong","Bang An","Mosen Alharthi","Juncai He","Lian Zhang","Haizhou Li","Benyou Wang","Jinchao Xu"],"pdf_url":"https://arxiv.org/pdf/2412.03253v1.pdf","comment":"Accepted to NeurIPS 2024 main conference. see\n https://github.com/FreedomIntelligence/AceGPT-v2"},{"id":"http://arxiv.org/abs/2411.10083v3","updated":"2024-12-04T11:49:04Z","published":"2024-11-15T10:01:52Z","title":"Xmodel-1.5: An 1B-scale Multilingual LLM","summary":" We introduce Xmodel-1.5, a 1-billion-parameter multilingual large language\nmodel pretrained on 2 trillion tokens, designed for balanced performance and\nscalability. Unlike most large models that use the BPE tokenizer, Xmodel-1.5\nemploys a custom unigram tokenizer with 65,280 tokens, optimizing both\nefficiency and accuracy. The model delivers competitive results across multiple\nlanguages, including Thai, Arabic, French, Chinese, and English, outperforming\nAlibaba's PolyLM-1.7B on respective evaluation datasets. Xmodel-1.5 excels in\nbenchmarks like mMMLU and PIQA, and achieves state-of-the-art results in Thai.\nTo support low-resource language research, we release Xdata_Thai, a\nThai-specific evaluation dataset featuring unique linguistic challenges such as\ngendered particles and idioms. While the model demonstrates strong performance,\nthere is still room for improvement in handling culturally specific nuances. We\nhope this work contributes to advancements in multilingual AI research. Models\nand code are publicly available on GitHub at\nhttps://github.com/XiaoduoAILab/XmodelLM-1.5\n","authors":["Wang Qun","Liu Yang","Lin Qingquan","Jiang Ling"],"pdf_url":"https://arxiv.org/pdf/2411.10083v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03248v1","updated":"2024-12-04T11:47:57Z","published":"2024-12-04T11:47:57Z","title":"AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and\n Pruning","summary":" Large language models (LLMs) have enabled the creation of multi-modal LLMs\nthat exhibit strong comprehension of visual data such as images and videos.\nHowever, these models usually rely on extensive visual tokens from visual\nencoders, leading to high computational demands, which limits their\napplicability in resource-constrained environments and for long-context tasks.\nIn this work, we propose a training-free adaptive inference method for\nmulti-modal LLMs that can accommodate a broad range of efficiency requirements\nwith a minimum performance drop. Our method consists of a) iterative token\nmerging based on embedding similarity before LLMs, and b) progressive token\npruning within LLM layers based on multi-modal importance. With a minimalist\ndesign, our method can be applied to both video and image LLMs. Extensive\nexperiments on diverse video and image benchmarks demonstrate that, our method\nsubstantially reduces computation load (e.g., a $\\textbf{7-fold}$ reduction in\nFLOPs) while preserving the performance of video and image LLMs. Further, under\na similar computational cost, our method outperforms the state-of-the-art\nmethods in long video understanding (e.g., $\\textbf{+4.6}$ on MLVU).\nAdditionally, our in-depth analysis provides insights into token redundancy and\nLLM layer behaviors, offering guidance for future research in designing\nefficient multi-modal LLMs. Our code will be available at\nhttps://github.com/LaVi-Lab/AIM.\n","authors":["Yiwu Zhong","Zhuoming Liu","Yin Li","Liwei Wang"],"pdf_url":"https://arxiv.org/pdf/2412.03248v1.pdf","comment":"12 pages, 2 figures"},{"id":"http://arxiv.org/abs/2412.02626v2","updated":"2024-12-04T11:45:34Z","published":"2024-12-03T17:54:12Z","title":"Time-Reversal Provides Unsupervised Feedback to LLMs","summary":" Large Language Models (LLMs) are typically trained to predict in the forward\ndirection of time. However, recent works have shown that prompting these models\nto look back and critique their own generations can produce useful feedback.\nMotivated by this, we explore the question of whether LLMs can be empowered to\nthink (predict and score) backwards to provide unsupervised feedback that\ncomplements forward LLMs. Towards this, we introduce Time Reversed Language\nModels (TRLMs), which can score and generate queries when conditioned on\nresponses, effectively functioning in the reverse direction of time. Further,\nto effectively infer in the response to query direction, we pre-train and\nfine-tune a language model (TRLM-Ba) in the reverse token order from scratch.\nWe show empirically (and theoretically in a stylized setting) that\ntime-reversed models can indeed complement forward model predictions when used\nto score the query given response for re-ranking multiple forward generations.\nWe obtain up to 5\\% improvement on the widely used AlpacaEval Leaderboard over\nthe competent baseline of best-of-N re-ranking using self log-perplexity\nscores. We further show that TRLM scoring outperforms conventional forward\nscoring of response given query, resulting in significant gains in applications\nsuch as citation generation and passage retrieval. We next leverage the\ngenerative ability of TRLM to augment or provide unsupervised feedback to input\nsafety filters of LLMs, demonstrating a drastic reduction in false negative\nrate with negligible impact on false positive rates against several attacks\npublished on the popular JailbreakBench leaderboard.\n","authors":["Yerram Varun","Rahul Madhavan","Sravanti Addepalli","Arun Suggala","Karthikeyan Shanmugam","Prateek Jain"],"pdf_url":"https://arxiv.org/pdf/2412.02626v2.pdf","comment":"Accepted as a spotlight in NeurIPS 2024"},{"id":"http://arxiv.org/abs/2409.09318v3","updated":"2024-12-04T11:44:57Z","published":"2024-09-14T05:31:29Z","title":"ODE: Open-Set Evaluation of Hallucinations in Multimodal Large Language\n Models","summary":" Hallucination poses a persistent challenge for multimodal large language\nmodels (MLLMs). However, existing benchmarks for evaluating hallucinations are\ngenerally static, which may overlook the potential risk of data contamination.\nTo address this issue, we propose ODE, an open-set, dynamic protocol designed\nto evaluate object hallucinations in MLLMs at both the existence and attribute\nlevels. ODE employs a graph-based structure to represent real-world object\nconcepts, their attributes, and the distributional associations between them.\nThis structure facilitates the extraction of concept combinations based on\ndiverse distributional criteria, generating varied samples for structured\nqueries that evaluate hallucinations in both generative and discriminative\ntasks. Through the generation of new samples, dynamic concept combinations, and\nvaried distribution frequencies, ODE mitigates the risk of data contamination\nand broadens the scope of evaluation. This protocol is applicable to both\ngeneral and specialized scenarios, including those with limited data.\nExperimental results demonstrate the effectiveness of our protocol, revealing\nthat MLLMs exhibit higher hallucination rates when evaluated with ODE-generated\nsamples, which indicates potential data contamination. Furthermore, these\ngenerated samples aid in analyzing hallucination patterns and fine-tuning\nmodels, offering an effective approach to mitigating hallucinations in MLLMs.\n","authors":["Yahan Tu","Rui Hu","Jitao Sang"],"pdf_url":"https://arxiv.org/pdf/2409.09318v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03242v1","updated":"2024-12-04T11:43:08Z","published":"2024-12-04T11:43:08Z","title":"Benchmarking terminology building capabilities of ChatGPT on an\n English-Russian Fashion Corpus","summary":" This paper compares the accuracy of the terms extracted using SketchEngine,\nTBXTools and ChatGPT. In addition, it evaluates the quality of the definitions\nproduced by ChatGPT for these terms. The research is carried out on a\ncomparable corpus of fashion magazines written in English and Russian collected\nfrom the web. A gold standard for the fashion terminology was also developed by\nidentifying web pages that can be harvested automatically and contain\ndefinitions of terms from the fashion domain in English and Russian. This gold\nstandard was used to evaluate the quality of the extracted terms and of the\ndefinitions produced. Our evaluation shows that TBXTools and SketchEngine,\nwhile capable of high recall, suffer from reduced precision as the number of\nterms increases, which affects their overall performance. Conversely, ChatGPT\ndemonstrates superior performance, maintaining or improving precision as more\nterms are considered. Analysis of the definitions produced by ChatGPT for 60\ncommonly used terms in English and Russian shows that ChatGPT maintains a\nreasonable level of accuracy and fidelity across languages, but sometimes the\ndefinitions in both languages miss crucial specifics and include unnecessary\ndeviations. Our research reveals that no single tool excels universally; each\nhas strengths suited to particular aspects of terminology extraction and\napplication.\n","authors":["Anastasiia Bezobrazova","Miriam Seghiri","Constantin Orasan"],"pdf_url":"https://arxiv.org/pdf/2412.03242v1.pdf","comment":"To appear in the Proceedings of Translating and the Computer 2024\n (TC46)"},{"id":"http://arxiv.org/abs/2412.03235v1","updated":"2024-12-04T11:36:37Z","published":"2024-12-04T11:36:37Z","title":"Does Safety Training of LLMs Generalize to Semantically Related Natural\n Prompts?","summary":" Large Language Models (LLMs) are known to be susceptible to crafted\nadversarial attacks or jailbreaks that lead to the generation of objectionable\ncontent despite being aligned to human preferences using safety fine-tuning\nmethods. While the large dimensionality of input token space makes it\ninevitable to find adversarial prompts that can jailbreak these models, we aim\nto evaluate whether safety fine-tuned LLMs are safe against natural prompts\nwhich are semantically related to toxic seed prompts that elicit safe responses\nafter alignment. We surprisingly find that popular aligned LLMs such as GPT-4\ncan be compromised using naive prompts that are NOT even crafted with an\nobjective of jailbreaking the model. Furthermore, we empirically show that\ngiven a seed prompt that elicits a toxic response from an unaligned model, one\ncan systematically generate several semantically related natural prompts that\ncan jailbreak aligned LLMs. Towards this, we propose a method of Response\nGuided Question Augmentation (ReG-QA) to evaluate the generalization of safety\naligned LLMs to natural prompts, that first generates several toxic answers\ngiven a seed question using an unaligned LLM (Q to A), and further leverages an\nLLM to generate questions that are likely to produce these answers (A to Q). We\ninterestingly find that safety fine-tuned LLMs such as GPT-4o are vulnerable to\nproducing natural jailbreak questions from unsafe content (without denial) and\ncan thus be used for the latter (A to Q) step. We obtain attack success rates\nthat are comparable to/ better than leading adversarial attack methods on the\nJailbreakBench leaderboard, while being significantly more stable against\ndefenses such as Smooth-LLM and Synonym Substitution, which are effective\nagainst existing all attacks on the leaderboard.\n","authors":["Sravanti Addepalli","Yerram Varun","Arun Suggala","Karthikeyan Shanmugam","Prateek Jain"],"pdf_url":"https://arxiv.org/pdf/2412.03235v1.pdf","comment":"Accepted at the Safe Generative AI Workshop @ NeurIPS 2024"},{"id":"http://arxiv.org/abs/2412.03230v1","updated":"2024-12-04T11:28:52Z","published":"2024-12-04T11:28:52Z","title":"PERL: Pinyin Enhanced Rephrasing Language Model for Chinese ASR N-best\n Error Correction","summary":" ASR correction methods have predominantly focused on general datasets and\nhave not effectively utilized Pinyin information, unique to the Chinese\nlanguage. In this study, we address this gap by proposing a Pinyin Enhanced\nRephrasing Language Model (PERL), specifically designed for N-best correction\nscenarios. Additionally, we implement a length predictor module to address the\nvariable-length problem. We conduct experiments on the Aishell-1 dataset and\nour newly proposed DoAD dataset. The results show that our approach outperforms\nbaseline methods, achieving a 29.11% reduction in Character Error Rate (CER) on\nAishell-1 and around 70% CER reduction on domain-specific datasets.\nFurthermore, our approach leverages Pinyin similarity at the token level,\nproviding an advantage over baselines and leading to superior performance.\n","authors":["Junhong Liang"],"pdf_url":"https://arxiv.org/pdf/2412.03230v1.pdf","comment":"2 figures, 6 tables"},{"id":"http://arxiv.org/abs/2412.03223v1","updated":"2024-12-04T11:18:32Z","published":"2024-12-04T11:18:32Z","title":"Linq-Embed-Mistral Technical Report","summary":" This report explores the enhancement of text retrieval performance using\nadvanced data refinement techniques. We develop\nLinq-Embed-Mistral\\footnote{\\url{https://huggingface.co/Linq-AI-Research/Linq-Embed-Mistral}}\nby building on the E5-mistral and Mistral-7B-v0.1 models, focusing on\nsophisticated data crafting, data filtering, and negative mining methods, which\nare highly tailored to each task, applied to both existing benchmark dataset\nand highly tailored synthetic dataset generated via large language models\n(LLMs). Linq-Embed-Mistral excels in the MTEB benchmarks (as of May 29, 2024),\nachieving an average score of 68.2 across 56 datasets, and ranks 1st among all\nmodels for retrieval tasks on the MTEB leaderboard with a performance score of\n60.2. This performance underscores its superior capability in enhancing search\nprecision and reliability. Our contributions include advanced data refinement\nmethods that significantly improve model performance on benchmark and synthetic\ndatasets, techniques for homogeneous task ordering and mixed task fine-tuning\nto enhance model generalization and stability, and a streamlined evaluation\nprocess using 4-bit precision and a light retrieval evaluation set, which\naccelerates validation without sacrificing accuracy.\n","authors":["Chanyeol Choi","Junseong Kim","Seolhwa Lee","Jihoon Kwon","Sangmo Gu","Yejin Kim","Minkyung Cho","Jy-yong Sohn"],"pdf_url":"https://arxiv.org/pdf/2412.03223v1.pdf","comment":"15 pages"},{"id":"http://arxiv.org/abs/2411.00850v2","updated":"2024-12-04T10:45:41Z","published":"2024-10-30T11:16:04Z","title":"GWQ: Gradient-Aware Weight Quantization for Large Language Models","summary":" Large language models (LLMs) show impressive performance in solving complex\nlanguage tasks. However, its large number of parameters present significant\nchallenges for the deployment and application of the model on edge devices.\nCompressing large language models to low bits can enable them to run on\nresource-constrained devices, often leading to performance degradation. To\naddress this problem, we propose gradient-aware weight quantization (GWQ), the\nfirst quantization approach for low-bit weight quantization that leverages\ngradients to localize outliers, requiring only a minimal amount of calibration\ndata for outlier detection. GWQ retains the weights corresponding to the top 1%\noutliers preferentially at FP16 precision, while the remaining non-outlier\nweights are stored in a low-bit format. GWQ found experimentally that utilizing\nthe sensitive weights in the gradient localization model is more scientific\ncompared to utilizing the sensitive weights in the Hessian matrix localization\nmodel. Compared to current quantization methods, GWQ can be applied to multiple\nlanguage models and achieves lower PPL on the WikiText2 and C4 dataset. In the\nzero-shot task, GWQ quantized models have higher accuracy compared to other\nquantization methods. GWQ is also suitable for multimodal model quantization,\nand the quantized Qwen-VL family model is more accurate than other methods.\nZero-shot target detection task dataset RefCOCO outperforms the current\nstat-of-the-arts method SPQR. GWQ achieves 1.2 times inference speedup in\ncomparison to the original model, and effectively reduces the inference memory.\n","authors":["Yihua Shao","Siyu Liang","Zijian Ling","Minxi Yan","Haiyang Liu","Siyu Chen","Ziyang Yan","Chenyu Zhang","Haotong Qin","Michele Magno","Yang Yang","Zhen Lei","Yan Wang","Jingcai Guo","Ling Shao","Hao Tang"],"pdf_url":"https://arxiv.org/pdf/2411.00850v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03205v1","updated":"2024-12-04T10:44:50Z","published":"2024-12-04T10:44:50Z","title":"U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills\n in LLMs","summary":" The current evaluation of mathematical skills in LLMs is limited, as existing\nbenchmarks are either relatively small, primarily focus on elementary and\nhigh-school problems, or lack diversity in topics. Additionally, the inclusion\nof visual elements in tasks remains largely under-explored.\n To address these gaps, we introduce U-MATH, a novel benchmark of 1,100\nunpublished open-ended university-level problems sourced from teaching\nmaterials. It is balanced across six core subjects, with 20% of multimodal\nproblems. Given the open-ended nature of U-MATH problems, we employ an LLM to\njudge the correctness of generated solutions. To this end, we release\n$\\mu$-MATH, a dataset to evaluate the LLMs' capabilities in judging solutions.\n The evaluation of general domain, math-specific, and multimodal LLMs\nhighlights the challenges presented by U-MATH. Our findings reveal that LLMs\nachieve a maximum accuracy of only 63% on text-based tasks, with even lower 45%\non visual problems. The solution assessment proves challenging for LLMs, with\nthe best LLM judge having an F1-score of 80% on $\\mu$-MATH.\n","authors":["Konstantin Chernyshev","Vitaliy Polshkov","Ekaterina Artemova","Alex Myasnikov","Vlad Stepanov","Alexei Miasnikov","Sergei Tilga"],"pdf_url":"https://arxiv.org/pdf/2412.03205v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.08004v2","updated":"2024-12-04T10:35:25Z","published":"2024-03-12T18:12:50Z","title":"Leveraging LLMs for On-the-Fly Instruction Guided Image Editing","summary":" The combination of language processing and image processing keeps attracting\nincreased interest given recent impressive advances that leverage the combined\nstrengths of both domains of research. Among these advances, the task of\nediting an image on the basis solely of a natural language instruction stands\nout as a most challenging endeavour. While recent approaches for this task\nresort, in one way or other, to some form of preliminary preparation, training\nor fine-tuning, this paper explores a novel approach: We propose a\npreparation-free method that permits instruction-guided image editing on the\nfly. This approach is organized along three steps properly orchestrated that\nresort to image captioning and DDIM inversion, followed by obtaining the edit\ndirection embedding, followed by image editing proper. While dispensing with\npreliminary preparation, our approach demonstrates to be effective and\ncompetitive, outperforming recent, state of the art models for this task when\nevaluated on the MAGICBRUSH dataset.\n","authors":["Rodrigo Santos","João Silva","António Branco"],"pdf_url":"https://arxiv.org/pdf/2403.08004v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.06209v3","updated":"2024-12-04T10:33:18Z","published":"2024-04-09T10:58:21Z","title":"Elephants Never Forget: Memorization and Learning of Tabular Data in\n Large Language Models","summary":" While many have shown how Large Language Models (LLMs) can be applied to a\ndiverse set of tasks, the critical issues of data contamination and\nmemorization are often glossed over. In this work, we address this concern for\ntabular data. Specifically, we introduce a variety of different techniques to\nassess whether a language model has seen a tabular dataset during training.\nThis investigation reveals that LLMs have memorized many popular tabular\ndatasets verbatim. We then compare the few-shot learning performance of LLMs on\ndatasets that were seen during training to the performance on datasets released\nafter training. We find that LLMs perform better on datasets seen during\ntraining, indicating that memorization leads to overfitting. At the same time,\nLLMs show non-trivial performance on novel datasets and are surprisingly robust\nto data transformations. We then investigate the in-context statistical\nlearning abilities of LLMs. While LLMs are significantly better than random at\nsolving statistical classification problems, the sample efficiency of few-shot\nlearning lags behind traditional statistical learning algorithms, especially as\nthe dimension of the problem increases. This suggests that much of the observed\nfew-shot performance on novel real-world datasets is due to the LLM's world\nknowledge. Overall, our results highlight the importance of testing whether an\nLLM has seen an evaluation dataset during pre-training. We release the\nhttps://github.com/interpretml/LLM-Tabular-Memorization-Checker Python package\nto test LLMs for memorization of tabular datasets.\n","authors":["Sebastian Bordt","Harsha Nori","Vanessa Rodrigues","Besmira Nushi","Rich Caruana"],"pdf_url":"https://arxiv.org/pdf/2404.06209v3.pdf","comment":"COLM camera ready, fix typo"},{"id":"http://arxiv.org/abs/2412.03187v1","updated":"2024-12-04T10:15:12Z","published":"2024-12-04T10:15:12Z","title":"Weighted-Reward Preference Optimization for Implicit Model Fusion","summary":" While fusing heterogeneous open-source LLMs with varying architectures and\nsizes can potentially integrate the strengths of different models, existing\nfusion methods face significant challenges, such as vocabulary alignment and\nmerging distribution matrices. These procedures are not only complex but also\nprone to introducing noise and errors. In this paper, we propose an implicit\nfusion method, Weighted-Reward Preference Optimization (WRPO), which leverages\npreference optimization between the source LLMs and the target LLM to transfer\ntheir capabilities effectively. WRPO eliminates the need for vocabulary\nalignment and matrix fusion and can be efficiently scaled to accommodate\nvarious LLMs. To address distributional deviations between the source and\ntarget LLMs, WRPO introduces a progressive adaptation strategy that gradually\nshifts reliance on preferred examples from the target LLM to the source LLMs.\nExtensive experiments on the MT-Bench, AlpacaEval-2, and Arena-Hard benchmarks\ndemonstrate that WRPO consistently outperforms existing knowledge fusion\nmethods and various fine-tuning baselines. When applied to LLaMA3-8B-Instruct\nas the target model, WRPO achieves a length-controlled win rate of 55.9%\nagainst GPT-4-Preview-1106 on AlpacaEval-2 and a win rate of 46.2% against\nGPT-4-0314 on Arena-Hard. Our code is available at\n\\url{https://github.com/SLIT-AI/WRPO}.\n","authors":["Ziyi Yang","Fanqi Wan","Longguang Zhong","Tianyuan Shi","Xiaojun Quan"],"pdf_url":"https://arxiv.org/pdf/2412.03187v1.pdf","comment":"Work in progress"},{"id":"http://arxiv.org/abs/2412.03176v1","updated":"2024-12-04T09:57:57Z","published":"2024-12-04T09:57:57Z","title":"Automatic detection of diseases in Spanish clinical notes combining\n medical language models and ontologies","summary":" In this paper we present a hybrid method for the automatic detection of\ndermatological pathologies in medical reports. We use a large language model\ncombined with medical ontologies to predict, given a first appointment or\nfollow-up medical report, the pathology a person may suffer from. The results\nshow that teaching the model to learn the type, severity and location on the\nbody of a dermatological pathology, as well as in which order it has to learn\nthese three features, significantly increases its accuracy. The article\npresents the demonstration of state-of-the-art results for classification of\nmedical texts with a precision of 0.84, micro and macro F1-score of 0.82 and\n0.75, and makes both the method and the data set used available to the\ncommunity.\n","authors":["Leon-Paul Schaub Torre","Pelayo Quiros","Helena Garcia Mieres"],"pdf_url":"https://arxiv.org/pdf/2412.03176v1.pdf","comment":"Translation of SEPLN 2024 es paper"},{"id":"http://arxiv.org/abs/2407.15017v4","updated":"2024-12-04T09:54:59Z","published":"2024-07-22T06:15:59Z","title":"Knowledge Mechanisms in Large Language Models: A Survey and Perspective","summary":" Understanding knowledge mechanisms in Large Language Models (LLMs) is crucial\nfor advancing towards trustworthy AGI. This paper reviews knowledge mechanism\nanalysis from a novel taxonomy including knowledge utilization and evolution.\nKnowledge utilization delves into the mechanism of memorization, comprehension\nand application, and creation. Knowledge evolution focuses on the dynamic\nprogression of knowledge within individual and group LLMs. Moreover, we discuss\nwhat knowledge LLMs have learned, the reasons for the fragility of parametric\nknowledge, and the potential dark knowledge (hypothesis) that will be\nchallenging to address. We hope this work can help understand knowledge in LLMs\nand provide insights for future research.\n","authors":["Mengru Wang","Yunzhi Yao","Ziwen Xu","Shuofei Qiao","Shumin Deng","Peng Wang","Xiang Chen","Jia-Chen Gu","Yong Jiang","Pengjun Xie","Fei Huang","Huajun Chen","Ningyu Zhang"],"pdf_url":"https://arxiv.org/pdf/2407.15017v4.pdf","comment":"EMNLP 2024 Findings; 39 pages (v4)"},{"id":"http://arxiv.org/abs/2412.03160v1","updated":"2024-12-04T09:38:11Z","published":"2024-12-04T09:38:11Z","title":"Byte BPE Tokenization as an Inverse string Homomorphism","summary":" Tokenization is an important preprocessing step in the training and inference\nof large language models (LLMs). While there has been extensive research on the\nexpressive power of the neural achitectures used in LLMs, the impact of\ntokenization has not been well understood. In this work, we demonstrate that\ntokenization, irrespective of the algorithm used, acts as an inverse\nhomomorphism between strings and tokens. This suggests that the character space\nof the source language and the token space of the tokenized language are\nhomomorphic, preserving the structural properties of the source language.\nAdditionally, we explore the concept of proper tokenization, which refers to an\nunambiguous tokenization returned from the tokenizer. Our analysis reveals that\nthe expressiveness of neural architectures in recognizing context-free\nlanguages is not affected by tokenization.\n","authors":["Saibo Geng","Sankalp Gambhir","Chris Wendler","Robert West"],"pdf_url":"https://arxiv.org/pdf/2412.03160v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03159v1","updated":"2024-12-04T09:36:24Z","published":"2024-12-04T09:36:24Z","title":"Multi-Level Correlation Network For Few-Shot Image Classification","summary":" Few-shot image classification(FSIC) aims to recognize novel classes given few\nlabeled images from base classes. Recent works have achieved promising\nclassification performance, especially for metric-learning methods, where a\nmeasure at only image feature level is usually used. In this paper, we argue\nthat measure at such a level may not be effective enough to generalize from\nbase to novel classes when using only a few images. Instead, a multi-level\ndescriptor of an image is taken for consideration in this paper. We propose a\nmulti-level correlation network (MLCN) for FSIC to tackle this problem by\neffectively capturing local information. Concretely, we present the\nself-correlation module and cross-correlation module to learn the semantic\ncorrespondence relation of local information based on learned representations.\nMoreover, we propose a pattern-correlation module to capture the pattern of\nfine-grained images and find relevant structural patterns between base classes\nand novel classes. Extensive experiments and analysis show the effectiveness of\nour proposed method on four widely-used FSIC benchmarks. The code for our\napproach is available at: https://github.com/Yunkai696/MLCN.\n","authors":["Yunkai Dang","Min Zhang","Zhengyu Chen","Xinliang Zhang","Zheng Wang","Meijun Sun","Donglin Wang"],"pdf_url":"https://arxiv.org/pdf/2412.03159v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.00809v2","updated":"2024-12-04T09:26:47Z","published":"2024-10-23T16:16:15Z","title":"Adaptive Dense Reward: Understanding the Gap Between Action and Reward\n Space in Alignment","summary":" Reinforcement Learning from Human Feedback (RLHF) has proven highly effective\nin aligning Large Language Models (LLMs) with human preferences. However, the\noriginal RLHF typically optimizes under an overall reward, which can lead to a\nsuboptimal learning process. This limitation stems from RLHF's lack of\nawareness regarding which specific tokens should be reinforced or suppressed.\nMoreover, conflicts in supervision can arise, for instance, when a chosen\nresponse includes erroneous tokens, while a rejected response contains accurate\nelements. To rectify these shortcomings, increasing dense reward methods, such\nas step-wise and token-wise RLHF, have been proposed. However, these existing\nmethods are limited to specific tasks (like mathematics). In this paper, we\npropose the ``Adaptive Message-wise RLHF'' method, which robustly applies to\nvarious tasks. By defining pivot tokens as key indicators, our approach\nadaptively identifies essential information and converts sequence-level\nsupervision into fine-grained, subsequence-level supervision. This aligns the\ndensity of rewards and action spaces more closely with the information density\nof the input. Experiments demonstrate that our method can be integrated into\nvarious training methods, significantly mitigating hallucinations and\ncatastrophic forgetting problems, while outperforming other methods on multiple\nevaluation metrics. Our method improves the success rate on adversarial samples\nby 10\\% compared to the sample-wise approach, and achieves a 1.3\\% improvement\non evaluation benchmarks such as MMLU, GSM8K, HumanEval, etc.\n","authors":["Yanshi Li","Shaopan Xiong","Gengru Chen","Xiaoyang Li","Yijia Luo","Xingyao Zhang","Yanhui Huang","Xingyuan Bu","Yingshui Tan","Chun Yuan","Jiamang Wang","Wenbo Su","Bo Zheng"],"pdf_url":"https://arxiv.org/pdf/2411.00809v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03152v1","updated":"2024-12-04T09:21:46Z","published":"2024-12-04T09:21:46Z","title":"A Measure of the System Dependence of Automated Metrics","summary":" Automated metrics for Machine Translation have made significant progress,\nwith the goal of replacing expensive and time-consuming human evaluations.\nThese metrics are typically assessed by their correlation with human judgments,\nwhich captures the monotonic relationship between human and metric scores.\nHowever, we argue that it is equally important to ensure that metrics treat all\nsystems fairly and consistently. In this paper, we introduce a method to\nevaluate this aspect.\n","authors":["Pius von Däniken","Jan Deriu","Mark Cieliebak"],"pdf_url":"https://arxiv.org/pdf/2412.03152v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03148v1","updated":"2024-12-04T09:14:56Z","published":"2024-12-04T09:14:56Z","title":"Fine-Grained Behavior Simulation with Role-Playing Large Language Model\n on Social Media","summary":" Large language models (LLMs) have demonstrated impressive capabilities in\nrole-playing tasks. However, there is limited research on whether LLMs can\naccurately simulate user behavior in real-world scenarios, such as social\nmedia. This requires models to effectively analyze a user's history and\nsimulate their role. In this paper, we introduce \\textbf{FineRob}, a novel\nfine-grained behavior simulation dataset. We collect the complete behavioral\nhistory of 1,866 distinct users across three social media platforms. Each\nbehavior is decomposed into three fine-grained elements: object, type, and\ncontent, resulting in 78.6k QA records. Based on FineRob, we identify two\ndominant reasoning patterns in LLMs' behavior simulation processes and propose\nthe \\textbf{OM-CoT} fine-tuning method to enhance the capability. Through\ncomprehensive experiments, we conduct an in-depth analysis of key factors of\nbehavior simulation and also demonstrate the effectiveness of OM-CoT\napproach\\footnote{Code and dataset are available at\n\\url{https://github.com/linkseed18612254945/FineRob}}\n","authors":["Kun Li","Chenwei Dai","Wei Zhou","Songlin Hu"],"pdf_url":"https://arxiv.org/pdf/2412.03148v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02323v2","updated":"2024-12-04T09:08:45Z","published":"2024-12-03T09:38:22Z","title":"Pay Attention to the Robustness of Chinese Minority Language Models!\n Syllable-level Textual Adversarial Attack on Tibetan Script","summary":" The textual adversarial attack refers to an attack method in which the\nattacker adds imperceptible perturbations to the original texts by elaborate\ndesign so that the NLP (natural language processing) model produces false\njudgments. This method is also used to evaluate the robustness of NLP models.\nCurrently, most of the research in this field focuses on English, and there is\nalso a certain amount of research on Chinese. However, to the best of our\nknowledge, there is little research targeting Chinese minority languages.\nTextual adversarial attacks are a new challenge for the information processing\nof Chinese minority languages. In response to this situation, we propose a\nTibetan syllable-level black-box textual adversarial attack called TSAttacker\nbased on syllable cosine distance and scoring mechanism. And then, we conduct\nTSAttacker on six models generated by fine-tuning two PLMs (pre-trained\nlanguage models) for three downstream tasks. The experiment results show that\nTSAttacker is effective and generates high-quality adversarial samples. In\naddition, the robustness of the involved models still has much room for\nimprovement.\n","authors":["Xi Cao","Dolma Dawa","Nuo Qun","Trashi Nyima"],"pdf_url":"https://arxiv.org/pdf/2412.02323v2.pdf","comment":"Revised Version; Accepted at ACL 2023 Workshop on TrustNLP"},{"id":"http://arxiv.org/abs/2411.16730v3","updated":"2024-12-04T08:21:17Z","published":"2024-11-23T09:32:44Z","title":"\"Moralized\" Multi-Step Jailbreak Prompts: Black-Box Testing of\n Guardrails in Large Language Models for Verbal Attacks","summary":" As the application of large language models continues to expand in various\nfields, it poses higher challenges to the effectiveness of identifying harmful\ncontent generation and guardrail mechanisms. This research aims to evaluate the\nguardrail effectiveness of GPT-4o, Grok-2 Beta, Llama 3.1 (405B), Gemini 1.5,\nand Claude 3.5 Sonnet through black-box testing of seemingly ethical multi-step\njailbreak prompts. It conducts ethical attacks by designing an identical\nmulti-step prompts that simulates the scenario of \"corporate middle managers\ncompeting for promotions.\" The data results show that the guardrails of the\nabove-mentioned LLMs were bypassed and the content of verbal attacks was\ngenerated. Claude 3.5 Sonnet's resistance to multi-step jailbreak prompts is\nmore obvious. To ensure objectivity, the experimental process, black box test\ncode, and enhanced guardrail code are uploaded to the GitHub repository:\nhttps://github.com/brucewang123456789/GeniusTrail.git.\n","authors":["Libo Wang"],"pdf_url":"https://arxiv.org/pdf/2411.16730v3.pdf","comment":"This paper has been submitted to Nature Machine Intelligence and\n OpenReview preprints. It has 7 pages of text, 3 figures, and 3 tables"},{"id":"http://arxiv.org/abs/2412.03098v1","updated":"2024-12-04T07:53:45Z","published":"2024-12-04T07:53:45Z","title":"A surprisal oracle for when every layer counts","summary":" Active Curriculum Language Modeling (ACLM; Hong et al., 2023) is a learner\ndirected approach to training a language model. We proposed the original\nversion of this process in our submission to the BabyLM 2023 task, and now we\npropose an updated ACLM process for the BabyLM 2024 task. ACLM involves an\niteratively- and dynamically-constructed curriculum informed over the training\nprocess by a model of uncertainty; other training items that are similarly\nuncertain to a least certain candidate item are prioritized. Our new process\nimproves the similarity model so that it is more dynamic, and we run ACLM over\nthe most successful model from the BabyLM 2023 task: ELC-BERT (Charpentier and\nSamuel, 2023). We find that while our models underperform on fine-grained\ngrammatical inferences, they outperform the BabyLM 2024 official base-lines on\ncommon-sense and world-knowledge tasks. We make our code available at https:\n//github.com/asayeed/ActiveBaby.\n","authors":["Xudong Hong","Sharid Loáiciga","Asad Sayeed"],"pdf_url":"https://arxiv.org/pdf/2412.03098v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03096v1","updated":"2024-12-04T07:50:17Z","published":"2024-12-04T07:50:17Z","title":"TOOL-ED: Enhancing Empathetic Response Generation with the Tool Calling\n Capability of LLM","summary":" Empathetic conversation is a crucial characteristic in daily conversations\nbetween individuals. Nowadays, Large Language models (LLMs) have shown\noutstanding performance in generating empathetic responses. Knowledge bases\nlike COMET can assist LLMs in mitigating illusions and enhancing the\nunderstanding of users' intentions and emotions. However, models remain heavily\nreliant on fixed knowledge bases and unrestricted incorporation of external\nknowledge can introduce noise. Tool learning is a flexible end-to-end approach\nthat assists LLMs in handling complex problems. In this paper, we propose\nEmotional Knowledge Tool Calling (EKTC) framework, which encapsulates the\ncommonsense knowledge bases as empathetic tools, enabling LLMs to integrate\nexternal knowledge flexibly through tool calling. In order to adapt the models\nto the new task, we construct a novel dataset TOOL-ED based on the\nEMPATHETICMPATHETIC DIALOGUE (ED) dataset. We validate EKTC on the ED dataset,\nand the experimental results demonstrate that our framework can enhance the\nability of LLMs to generate empathetic responses effectively.\n","authors":["Huiying Cao","Yiqun Zhang","Shi Feng","Xiaocui Yang","Daling Wang","Yifei Zhang"],"pdf_url":"https://arxiv.org/pdf/2412.03096v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03092v1","updated":"2024-12-04T07:44:35Z","published":"2024-12-04T07:44:35Z","title":"Revolve: Optimizing AI Systems by Tracking Response Evolution in Textual\n Optimization","summary":" Recent advancements in large language models (LLMs) have significantly\nenhanced the ability of LLM-based systems to perform complex tasks through\nnatural language processing and tool interaction. However, optimizing these\nLLM-based systems for specific tasks remains challenging, often requiring\nmanual interventions like prompt engineering and hyperparameter tuning.\nExisting automatic optimization methods, such as textual feedback-based\ntechniques (e.g., TextGrad), tend to focus on immediate feedback, analogous to\nusing immediate derivatives in traditional numerical gradient descent. However,\nrelying solely on such feedback can be limited when the adjustments made in\nresponse to this feedback are either too small or fluctuate irregularly,\npotentially slowing down or even stalling the optimization process. To overcome\nthese challenges, more adaptive methods are needed, especially in situations\nwhere the system's response is evolving slowly or unpredictably. In this paper,\nwe introduce REVOLVE, an optimization method that tracks how \"R\"esponses\n\"EVOLVE\" across iterations in LLM systems. By focusing on the evolution of\nresponses over time, REVOLVE enables more stable and effective optimization by\nmaking thoughtful, progressive adjustments at each step. Experimental results\ndemonstrate that REVOLVE outperforms competitive baselines, achieving a 7.8%\nimprovement in prompt optimization, a 20.72% gain in solution refinement, and a\n29.17% increase in code optimization. Additionally, REVOLVE converges in fewer\niterations, resulting in significant computational savings. These advantages\nhighlight its adaptability and efficiency, positioning REVOLVE as a valuable\ntool for optimizing LLM-based systems and accelerating the development of\nnext-generation AI technologies. Code is available at:\nhttps://github.com/Peiyance/REVOLVE.\n","authors":["Peiyan Zhang","Haibo Jin","Leyang Hu","Xinnuo Li","Liying Kang","Man Luo","Yangqiu Song","Haohan Wang"],"pdf_url":"https://arxiv.org/pdf/2412.03092v1.pdf","comment":"20 pages, 2 figures"},{"id":"http://arxiv.org/abs/2411.13082v3","updated":"2024-12-04T07:22:45Z","published":"2024-11-20T07:20:48Z","title":"Patience Is The Key to Large Language Model Reasoning","summary":" Recent advancements in the field of large language models, particularly\nthrough the Chain of Thought (CoT) approach, have demonstrated significant\nimprovements in solving complex problems. However, existing models either tend\nto sacrifice detailed reasoning for brevity due to user preferences, or require\nextensive and expensive training data to learn complicated reasoning ability,\nlimiting their potential in solving complex tasks. To bridge this gap,\nfollowing the concept of scaling test-time, we propose a simple method by\nencouraging models to adopt a more patient reasoning style without the need of\nintroducing new knowledge or skills. To employ a preference optimization\napproach, we generate detailed reasoning processes as positive examples and\nsimple answers as negative examples, thereby training the model to favor\nthoroughness in its responses. Our results demonstrate a performance increase\nof up to 2.1% on GSM8k with training just on a lightweight dataset.\n","authors":["Yijiong Yu"],"pdf_url":"https://arxiv.org/pdf/2411.13082v3.pdf","comment":"The dataset and model are available at\n https://huggingface.co/datasets/yuyijiong/patient-math-cot"},{"id":"http://arxiv.org/abs/2410.07170v2","updated":"2024-12-04T07:18:17Z","published":"2024-10-09T17:59:06Z","title":"One Initialization to Rule them All: Fine-tuning via Explained Variance\n Adaptation","summary":" Foundation models (FMs) are pre-trained on large-scale datasets and then\nfine-tuned on a downstream task for a specific application. The most successful\nand most commonly used fine-tuning method is to update the pre-trained weights\nvia a low-rank adaptation (LoRA). LoRA introduces new weight matrices that are\nusually initialized at random with a uniform rank distribution across the model\nweights. Recent works focus on different initialization schemes or the learning\nof adaptive ranks during fine-tuning. Both approaches have only been\ninvestigated in isolation, resulting in slow convergence or a uniform rank\ndistribution, in turn leading to suboptimal performance. We propose to improve\nLoRA by initializing the new weights in a data-driven manner by computing\nsingular value decomposition (SVD) on minibatches of activation vectors. Then,\nwe initialize the LoRA matrices with the obtained right-singular vectors and\nredistribute ranks among all weight matrices to provably store the maximum\namount of information of the downstream data in the newly introduced weights.\nIn this way, only what information to maintain or neglect during the\nfine-tuning process needs to be learned. We call our new method Explained\nVariance Adaptation (EVA). We apply EVA to a variety of fine-tuning tasks\nranging from language generation and understanding to image classification and\nreinforcement learning. EVA exhibits faster convergence than competitors and\nachieves the highest average score across a multitude of tasks per domain while\nreducing the number of trainable parameters through rank redistribution.\n","authors":["Fabian Paischer","Lukas Hauzenberger","Thomas Schmied","Benedikt Alkin","Marc Peter Deisenroth","Sepp Hochreiter"],"pdf_url":"https://arxiv.org/pdf/2410.07170v2.pdf","comment":"11 pages + references and appendix, code available at\n https://github.com/ml-jku/EVA"},{"id":"http://arxiv.org/abs/2410.04422v6","updated":"2024-12-04T07:03:49Z","published":"2024-10-06T09:29:19Z","title":"Long-context Language Models Are Not Good At Retrieval Without Enough\n Steps","summary":" Long-context language models (LCLMs), characterized by their extensive\ncontext window, are becoming increasingly popular. However, despite they are\nnearly perfect at standard long-context retrieval, we find they are actually\nnot good at all of them. Specifically, we identify 2 basic cases,\n\"multi-matching retrieval,\" and \"logic-based retrieval\", which LLMs struggle to\nsolve under normal settings. Moreover, we find these cases can only be well\naddressed by specific CoT prompting, with enough reasoning steps. This finding\nreminds the developers and users of LCLMs that relying on LCLMs to directly\nperform even basic retrieval tasks may be unreliable, rather, a sufficiently\nlong reasoning process is necessary.\n","authors":["Yijiong Yu","Ma Xiufa","Fang Jianwei","Zhi Xu","Su Guangyao","Wang Jiancheng","Yongfeng Huang","Zhixiao Qi","Wei Wang","Weifeng Liu","Ran Chen","Ji Pei"],"pdf_url":"https://arxiv.org/pdf/2410.04422v6.pdf","comment":"Our code is publicly available at\n https://github.com/yuyijiong/hard_retrieval_for_llm and the datasets is at\n https://huggingface.co/datasets/yuyijiong/difficult_retrieval"},{"id":"http://arxiv.org/abs/2412.03075v1","updated":"2024-12-04T06:52:10Z","published":"2024-12-04T06:52:10Z","title":"ASR-EC Benchmark: Evaluating Large Language Models on Chinese ASR Error\n Correction","summary":" Automatic speech Recognition (ASR) is a fundamental and important task in the\nfield of speech and natural language processing. It is an inherent building\nblock in many applications such as voice assistant, speech translation, etc.\nDespite the advancement of ASR technologies in recent years, it is still\ninevitable for modern ASR systems to have a substantial number of erroneous\nrecognition due to environmental noise, ambiguity, etc. Therefore, the error\ncorrection in ASR is crucial.\n Motivated by this, this paper studies ASR error correction in the Chinese\nlanguage, which is one of the most popular languages and enjoys a large number\nof users in the world. We first create a benchmark dataset named \\emph{ASR-EC}\nthat contains a wide spectrum of ASR errors generated by industry-grade ASR\nsystems. To the best of our knowledge, it is the first Chinese ASR error\ncorrection benchmark. Then, inspired by the recent advances in \\emph{large\nlanguage models (LLMs)}, we investigate how to harness the power of LLMs to\ncorrect ASR errors. We apply LLMs to ASR error correction in three paradigms.\nThe first paradigm is prompting, which is further categorized as zero-shot,\nfew-shot, and multi-step. The second paradigm is finetuning, which finetunes\nLLMs with ASR error correction data. The third paradigm is multi-modal\naugmentation, which collectively utilizes the audio and ASR transcripts for\nerror correction. Extensive experiments reveal that prompting is not effective\nfor ASR error correction. Finetuning is effective only for a portion of LLMs.\nMulti-modal augmentation is the most effective method for error correction and\nachieves state-of-the-art performance.\n","authors":["Victor Junqiu Wei","Weicheng Wang","Di Jiang","Yuanfeng Song","Lu Wang"],"pdf_url":"https://arxiv.org/pdf/2412.03075v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03074v1","updated":"2024-12-04T06:52:03Z","published":"2024-12-04T06:52:03Z","title":"Analytic Study of Text-Free Speech Synthesis for Raw Audio using a\n Self-Supervised Learning Model","summary":" We examine the text-free speech representations of raw audio obtained from a\nself-supervised learning (SSL) model by analyzing the synthesized speech using\nthe SSL representations instead of conventional text representations. Since raw\naudio does not have paired speech representations as transcribed texts do,\nobtaining speech representations from unpaired speech is crucial for augmenting\navailable datasets for speech synthesis. Specifically, the proposed speech\nsynthesis is conducted using discrete symbol representations from the SSL model\nin comparison with text representations, and analytical examinations of the\nsynthesized speech have been carried out. The results empirically show that\nusing text representations is advantageous for preserving semantic information,\nwhile using discrete symbol representations is superior for preserving acoustic\ncontent, including prosodic and intonational information.\n","authors":["Joonyong Park","Daisuke Saito","Nobuaki Minematsu"],"pdf_url":"https://arxiv.org/pdf/2412.03074v1.pdf","comment":"APSIPA ASC 2024"},{"id":"http://arxiv.org/abs/2311.17696v4","updated":"2024-12-04T06:33:55Z","published":"2023-11-29T15:02:46Z","title":"How to Build an AI Tutor that Can Adapt to Any Course and Provide\n Accurate Answers Using Large Language Model and Retrieval-Augmented\n Generation","summary":" This paper proposes a low-code solution to build an AI tutor that leverages\nadvanced AI techniques to provide accurate and contextually relevant responses\nin a personalized learning environment. The OpenAI Assistants API allows AI\nTutor to easily embed, store, retrieve, and manage files and chat history,\nenabling a low-code solution. Large Language Models (LLMs) and\nRetrieval-Augmented Generation (RAG) technology generate sophisticated answers\nbased on course-specific materials. The application efficiently organizes and\nretrieves relevant information through vector embedding and similarity-based\nretrieval algorithms. The AI Tutor prototype demonstrates its ability to\ngenerate relevant, accurate answers with source citations. It represents a\nsignificant advancement in technology-enhanced tutoring systems, democratizing\naccess to high-quality, customized educational support in higher education.\n","authors":["Chenxi Dong","Kan Chen","Shupei Cheng","Chujie Wen"],"pdf_url":"https://arxiv.org/pdf/2311.17696v4.pdf","comment":"4 pages, 6 figures"},{"id":"http://arxiv.org/abs/2406.12329v2","updated":"2024-12-04T06:30:06Z","published":"2024-06-18T06:54:05Z","title":"Opt-Out: Investigating Entity-Level Unlearning for Large Language Models\n via Optimal Transport","summary":" Instruction-following large language models (LLMs), such as ChatGPT, have\nbecome widely popular among everyday users. However, these models inadvertently\ndisclose private, sensitive information to their users, underscoring the need\nfor machine unlearning techniques to remove selective information from the\nmodels. While prior work has focused on forgetting small, random subsets of\ntraining data at the instance-level, we argue that real-world scenarios often\nrequire the removal of an entire user data, which may require a more careful\nmaneuver. In this study, we explore entity-level unlearning, which aims to\nerase all knowledge related to a target entity while preserving the remaining\nmodel capabilities. To address this, we introduce Opt-Out, an optimal\ntransport-based unlearning method that utilizes the Wasserstein distance from\nthe model's initial parameters to achieve more effective and fine-grained\nunlearning. We also present the first Entity-Level Unlearning Dataset (ELUDe)\ndesigned to evaluate entity-level unlearning. Our empirical results demonstrate\nthat Opt-Out surpasses existing methods, establishing a new standard for secure\nand adaptable LLMs that can accommodate user data removal requests without the\nneed for full retraining.\n","authors":["Minseok Choi","Daniel Rim","Dohyun Lee","Jaegul Choo"],"pdf_url":"https://arxiv.org/pdf/2406.12329v2.pdf","comment":"17 pages, 10 figures"},{"id":"http://arxiv.org/abs/2412.00721v2","updated":"2024-12-04T06:23:40Z","published":"2024-12-01T08:07:01Z","title":"A Comparative Study of LLM-based ASR and Whisper in Low Resource and\n Code Switching Scenario","summary":" Large Language Models (LLMs) have showcased exceptional performance across\ndiverse NLP tasks, and their integration with speech encoder is rapidly\nemerging as a dominant trend in the Automatic Speech Recognition (ASR) field.\nPrevious works mainly concentrated on leveraging LLMs for speech recognition in\nEnglish and Chinese. However, their potential for addressing speech recognition\nchallenges in low resource settings remains underexplored. Hence, in this work,\nwe aim to explore the capability of LLMs in low resource ASR and\nMandarin-English code switching ASR. We also evaluate and compare the\nrecognition performance of LLM-based ASR systems against Whisper model.\nExtensive experiments demonstrate that LLM-based ASR yields a relative gain of\n12.8\\% over the Whisper model in low resource ASR while Whisper performs better\nin Mandarin-English code switching ASR. We hope that this study could shed\nlight on ASR for low resource scenarios.\n","authors":["Zheshu Song","Ziyang Ma","Yifan Yang","Jianheng Zhuo","Xie Chen"],"pdf_url":"https://arxiv.org/pdf/2412.00721v2.pdf","comment":"This work hasn't been finished yet"},{"id":"http://arxiv.org/abs/2411.10145v2","updated":"2024-12-04T05:54:43Z","published":"2024-11-15T12:39:02Z","title":"An Effective Framework to Help Large Language Models Handle\n Numeric-involved Long-context Tasks","summary":" Large Language Models (LLMs) have demonstrated remarkable capabilities in\nhandling long texts and have almost perfect performance in traditional\nretrieval tasks. However, their performance significantly degrades when it\ncomes to numerical calculations in the long-context. Numeric-involved\nlong-context tasks typically cannot be addressed by current LLMs in normal\nsettings due to their inherent limitations in simultaneously handling complex\nand massive information. Some CoT like prompting methods can improve accuracy\nbut demands massive output tokens, which is costly and slow. To address this\nissue, we propose a workflow, which decompose a numeric-involved long-context\ntask into 4 low-level subtasks: judging, extracting and processing with code\nand conclusion. The former 2 subtasks is relatively simple, which allows us to\nuse smaller models for efficiently processing long context. When numerical\ncalculations are required, we use code generated by LLMs to avoid the\ndisadvantage of LLM not being good at calculations. The results in 2\nnumeric-involved long-context benchmarks demonstrate our workflow can not only\nimprove accuracy, but also significantly reduce the cost of API calls.\n","authors":["Yijiong Yu"],"pdf_url":"https://arxiv.org/pdf/2411.10145v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.15862v2","updated":"2024-12-04T05:52:03Z","published":"2024-11-24T14:38:59Z","title":"LLMs Do Not Think Step-by-step In Implicit Reasoning","summary":" It has been well-known that Chain-of-Thought can remarkably enhance LLMs'\nperformance on complex tasks. However, because it also introduces slower\ninference speeds and higher computational costs, many researches have attempted\nto use implicit CoT, which does not need LLMs to explicitly generate the\nintermediate steps. But there is still gap between their efficacy and typical\nexplicit CoT methods. This leaves us a doubt that, does implicit CoT really\nequal to explicit CoT? Therefore, in this study, we address this question\nthrough experiments. We probe the information of intermediate steps from the\nmodel's hidden states when it is performing implicit CoT. The results\nsurprisingly indicate that LLMs hardly think about intermediate steps,\nsuggesting they may just rely on experience rather than strict step-by-step\nreasoning. Moreover, we find LLMs' implicit reasoning capabilities are\nsusceptible and unstable, reaffirming the necessity of explicit CoT to\neffectively support complex tasks.\n","authors":["Yijiong Yu"],"pdf_url":"https://arxiv.org/pdf/2411.15862v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03025v1","updated":"2024-12-04T04:38:35Z","published":"2024-12-04T04:38:35Z","title":"Human Variability vs. Machine Consistency: A Linguistic Analysis of\n Texts Generated by Humans and Large Language Models","summary":" The rapid advancements in large language models (LLMs) have significantly\nimproved their ability to generate natural language, making texts generated by\nLLMs increasingly indistinguishable from human-written texts. Recent research\nhas predominantly focused on using LLMs to classify text as either\nhuman-written or machine-generated. In our study, we adopt a different approach\nby profiling texts spanning four domains based on 250 distinct linguistic\nfeatures. We select the M4 dataset from the Subtask B of SemEval 2024 Task 8.\nWe automatically calculate various linguistic features with the LFTK tool and\nadditionally measure the average syntactic depth, semantic similarity, and\nemotional content for each document. We then apply a two-dimensional PCA\nreduction to all the calculated features. Our analyses reveal significant\ndifferences between human-written texts and those generated by LLMs,\nparticularly in the variability of these features, which we find to be\nconsiderably higher in human-written texts. This discrepancy is especially\nevident in text genres with less rigid linguistic style constraints. Our\nfindings indicate that humans write texts that are less cognitively demanding,\nwith higher semantic content, and richer emotional content compared to texts\ngenerated by LLMs. These insights underscore the need for incorporating\nmeaningful linguistic features to enhance the understanding of textual outputs\nof LLMs.\n","authors":["Sergio E. Zanotto","Segun Aroyehun"],"pdf_url":"https://arxiv.org/pdf/2412.03025v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2311.15316v4","updated":"2024-12-04T04:08:49Z","published":"2023-11-26T14:35:23Z","title":"Sibyl: Empowering Empathetic Dialogue Generation in Large Language\n Models via Sensible and Visionary Commonsense Inference","summary":" Recently, there has been a heightened interest in building chatbots based on\nLarge Language Models (LLMs) to emulate human-like qualities in multi-turn\nconversations. Despite having access to commonsense knowledge to better\nunderstand the psychological aspects and causality of dialogue context, even\nthese powerful LLMs struggle to achieve the goals of empathy and emotional\nsupport. Current commonsense knowledge derived from dialogue contexts is\ninherently limited and often fails to adequately anticipate the future course\nof a dialogue. This lack of foresight can mislead LLMs and hinder their ability\nto provide effective support. In response to this challenge, we present an\ninnovative framework named Sensible and Visionary Commonsense Knowledge\n(Sibyl). Designed to concentrate on the immediately succeeding dialogue, this\nparadigm equips LLMs with the capability to uncover the implicit requirements\nof the conversation, aiming to elicit more empathetic responses. Experimental\nresults demonstrate that incorporating our paradigm for acquiring commonsense\nknowledge into LLMs comprehensively enhances the quality of their responses.\n","authors":["Lanrui Wang","Jiangnan Li","Chenxu Yang","Zheng Lin","Hongyin Tang","Huan Liu","Yanan Cao","Jingang Wang","Weiping Wang"],"pdf_url":"https://arxiv.org/pdf/2311.15316v4.pdf","comment":"Accepted by COLING 2025"},{"id":"http://arxiv.org/abs/2412.01130v2","updated":"2024-12-04T03:34:42Z","published":"2024-12-02T05:10:41Z","title":"Enhancing Function-Calling Capabilities in LLMs: Strategies for Prompt\n Formats, Data Integration, and Multilingual Translation","summary":" Large language models (LLMs) have significantly advanced autonomous agents,\nparticularly in zero-shot tool usage, also known as function calling. This\nresearch delves into enhancing the function-calling capabilities of LLMs by\nexploring different approaches, including prompt formats for integrating\nfunction descriptions, blending function-calling and instruction-following\ndata, introducing a novel Decision Token for conditional prompts, leveraging\nchain-of-thought reasoning, and overcoming multilingual challenges with a\ntranslation pipeline. Our key findings and contributions are as follows: (1)\nInstruction-following data improves both function-calling accuracy and\nrelevance detection. (2) The use of the newly proposed Decision Token, combined\nwith synthetic non-function-call data, enhances relevance detection. (3) A\ntailored translation pipeline effectively overcomes multilingual limitations,\ndemonstrating significant improvements in Traditional Chinese. These insights\nhighlight the potential for improved function-calling capabilities and\nmultilingual applications in LLMs.\n","authors":["Yi-Chang Chen","Po-Chun Hsu","Chan-Jan Hsu","Da-shan Shiu"],"pdf_url":"https://arxiv.org/pdf/2412.01130v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.16146v2","updated":"2024-12-04T03:21:44Z","published":"2024-09-24T14:52:14Z","title":"Controlling Risk of Retrieval-augmented Generation: A Counterfactual\n Prompting Framework","summary":" Retrieval-augmented generation (RAG) has emerged as a popular solution to\nmitigate the hallucination issues of large language models. However, existing\nstudies on RAG seldom address the issue of predictive uncertainty, i.e., how\nlikely it is that a RAG model's prediction is incorrect, resulting in\nuncontrollable risks in real-world applications. In this work, we emphasize the\nimportance of risk control, ensuring that RAG models proactively refuse to\nanswer questions with low confidence. Our research identifies two critical\nlatent factors affecting RAG's confidence in its predictions: the quality of\nthe retrieved results and the manner in which these results are utilized. To\nguide RAG models in assessing their own confidence based on these two latent\nfactors, we develop a counterfactual prompting framework that induces the\nmodels to alter these factors and analyzes the effect on their answers. We also\nintroduce a benchmarking procedure to collect answers with the option to\nabstain, facilitating a series of experiments. For evaluation, we introduce\nseveral risk-related metrics and the experimental results demonstrate the\neffectiveness of our approach. Our code and benchmark dataset are available at\nhttps://github.com/ict-bigdatalab/RC-RAG.\n","authors":["Lu Chen","Ruqing Zhang","Jiafeng Guo","Yixing Fan","Xueqi Cheng"],"pdf_url":"https://arxiv.org/pdf/2409.16146v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02987v1","updated":"2024-12-04T03:02:46Z","published":"2024-12-04T03:02:46Z","title":"Advancing Conversational Psychotherapy: Integrating Privacy,\n Dual-Memory, and Domain Expertise with Large Language Models","summary":" Mental health has increasingly become a global issue that reveals the\nlimitations of traditional conversational psychotherapy, constrained by\nlocation, time, expense, and privacy concerns. In response to these challenges,\nwe introduce SoulSpeak, a Large Language Model (LLM)-enabled chatbot designed\nto democratize access to psychotherapy. SoulSpeak improves upon the\ncapabilities of standard LLM-enabled chatbots by incorporating a novel\ndual-memory component that combines short-term and long-term context via\nRetrieval Augmented Generation (RAG) to offer personalized responses while\nensuring the preservation of user privacy and intimacy through a dedicated\nprivacy module. In addition, it leverages a counseling chat dataset of\ntherapist-client interactions and various prompting techniques to align the\ngenerated responses with psychotherapeutic methods. We introduce two fine-tuned\nBERT models to evaluate the system against existing LLMs and human therapists:\nthe Conversational Psychotherapy Preference Model (CPPM) to simulate human\npreference among responses and another to assess response relevance to user\ninput. CPPM is useful for training and evaluating psychotherapy-focused\nlanguage models independent from SoulSpeak, helping with the constrained\nresources available for psychotherapy. Furthermore, the effectiveness of the\ndual-memory component and the robustness of the privacy module are also\nexamined. Our findings highlight the potential and challenge of enhancing\nmental health care by offering an alternative that combines the expertise of\ntraditional therapy with the advantages of LLMs, providing a promising way to\naddress the accessibility and personalization gap in current mental health\nservices.\n","authors":["XiuYu Zhang","Zening Luo"],"pdf_url":"https://arxiv.org/pdf/2412.02987v1.pdf","comment":"Accepted as a Poster at Statistical Foundations of LLMs and\n Foundation Models (NeurIPS 2024 Workshop)"},{"id":"http://arxiv.org/abs/2412.02980v1","updated":"2024-12-04T02:47:45Z","published":"2024-12-04T02:47:45Z","title":"Surveying the Effects of Quality, Diversity, and Complexity in Synthetic\n Data From Large Language Models","summary":" Synthetic data generation with Large Language Models is a promising paradigm\nfor augmenting natural data over a nearly infinite range of tasks. Given this\nvariety, direct comparisons among synthetic data generation algorithms are\nscarce, making it difficult to understand where improvement comes from and what\nbottlenecks exist. We propose to evaluate algorithms via the makeup of\nsynthetic data generated by each algorithm in terms of data quality, diversity,\nand complexity. We choose these three characteristics for their significance in\nopen-ended processes and the impact each has on the capabilities of downstream\nmodels. We find quality to be essential for in-distribution model\ngeneralization, diversity to be essential for out-of-distribution\ngeneralization, and complexity to be beneficial for both. Further, we emphasize\nthe existence of Quality-Diversity trade-offs in training data and the\ndownstream effects on model performance. We then examine the effect of various\ncomponents in the synthetic data pipeline on each data characteristic. This\nexamination allows us to taxonomize and compare synthetic data generation\nalgorithms through the components they utilize and the resulting effects on\ndata QDC composition. This analysis extends into a discussion on the importance\nof balancing QDC in synthetic data for efficient reinforcement learning and\nself-improvement algorithms. Analogous to the QD trade-offs in training data,\noften there exist trade-offs between model output quality and output diversity\nwhich impact the composition of synthetic data. We observe that many models are\ncurrently evaluated and optimized only for output quality, thereby limiting\noutput diversity and the potential for self-improvement. We argue that\nbalancing these trade-offs is essential to the development of future\nself-improvement algorithms and highlight a number of works making progress in\nthis direction.\n","authors":["Alex Havrilla","Andrew Dai","Laura O'Mahony","Koen Oostermeijer","Vera Zisler","Alon Albalak","Fabrizio Milo","Sharath Chandra Raparthy","Kanishk Gandhi","Baber Abbasi","Duy Phung","Maia Iyer","Dakota Mahan","Chase Blagden","Srishti Gureja","Mohammed Hamdy","Wen-Ding Li","Giovanni Paolini","Pawan Sasanka Ammanamanchi","Elliot Meyerson"],"pdf_url":"https://arxiv.org/pdf/2412.02980v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.01955v2","updated":"2024-12-04T02:25:04Z","published":"2024-12-02T20:31:27Z","title":"The use of large language models to enhance cancer clinical trial\n educational materials","summary":" Cancer clinical trials often face challenges in recruitment and engagement\ndue to a lack of participant-facing informational and educational resources.\nThis study investigated the potential of Large Language Models (LLMs),\nspecifically GPT4, in generating patient-friendly educational content from\nclinical trial informed consent forms. Using data from ClinicalTrials.gov, we\nemployed zero-shot learning for creating trial summaries and one-shot learning\nfor developing multiple-choice questions, evaluating their effectiveness\nthrough patient surveys and crowdsourced annotation. Results showed that\nGPT4-generated summaries were both readable and comprehensive, and may improve\npatients' understanding and interest in clinical trials. The multiple-choice\nquestions demonstrated high accuracy and agreement with crowdsourced\nannotators. For both resource types, hallucinations were identified that\nrequire ongoing human oversight. The findings demonstrate the potential of LLMs\n\"out-of-the-box\" to support the generation of clinical trial education\nmaterials with minimal trial-specific engineering, but implementation with a\nhuman-in-the-loop is still needed to avoid misinformation risks.\n","authors":["Mingye Gao","Aman Varshney","Shan Chen","Vikram Goddla","Jack Gallifant","Patrick Doyle","Claire Novack","Maeve Dillon-Martin","Teresia Perkins","Xinrong Correia","Erik Duhaime","Howard Isenstein","Elad Sharon","Lisa Soleymani Lehmann","David Kozono","Brian Anthony","Dmitriy Dligach","Danielle S. Bitterman"],"pdf_url":"https://arxiv.org/pdf/2412.01955v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.02167v3","updated":"2024-12-04T02:08:13Z","published":"2024-03-04T16:13:39Z","title":"EMOVOME: A Dataset for Emotion Recognition in Spontaneous Real-Life\n Speech","summary":" Spontaneous datasets for Speech Emotion Recognition (SER) are scarce and\nfrequently derived from laboratory environments or staged scenarios, such as TV\nshows, limiting their application in real-world contexts. We developed and\npublicly released the Emotional Voice Messages (EMOVOME) dataset, including 999\nvoice messages from real conversations of 100 Spanish speakers on a messaging\napp, labeled in continuous and discrete emotions by expert and non-expert\nannotators. We evaluated speaker-independent SER models using acoustic features\nas baseline and transformer-based models. We compared the results with\nreference datasets including acted and elicited speech, and analyzed the\ninfluence of annotators and gender fairness. The pre-trained\nUniSpeech-SAT-Large model achieved the highest results, 61.64% and 55.57%\nUnweighted Accuracy (UA) for 3-class valence and arousal prediction\nrespectively on EMOVOME, a 10% improvement over baseline models. For the\nemotion categories, 42.58% UA was obtained. EMOVOME performed lower than the\nacted RAVDESS dataset. The elicited IEMOCAP dataset also outperformed EMOVOME\nin predicting emotion categories, while similar results were obtained in\nvalence and arousal. EMOVOME outcomes varied with annotator labels, showing\nbetter results and fairness when combining expert and non-expert annotations.\nThis study highlights the gap between controlled and real-life scenarios,\nsupporting further advancements in recognizing genuine emotions.\n","authors":["Lucía Gómez-Zaragozá","Rocío del Amor","María José Castro-Bleda","Valery Naranjo","Mariano Alcañiz Raya","Javier Marín-Morales"],"pdf_url":"https://arxiv.org/pdf/2403.02167v3.pdf","comment":"This article is a merged version of the description of the EMOVOME\n database in arXiv:2402.17496v1 and the speech emotion recognition models in\n arXiv:2403.02167v1. This work has been submitted to the IEEE for possible\n publication"},{"id":"http://arxiv.org/abs/2412.02956v1","updated":"2024-12-04T02:05:21Z","published":"2024-12-04T02:05:21Z","title":"Curriculum-style Data Augmentation for LLM-based Metaphor Detection","summary":" Recently, utilizing large language models (LLMs) for metaphor detection has\nachieved promising results. However, these methods heavily rely on the\ncapabilities of closed-source LLMs, which come with relatively high inference\ncosts and latency. To address this, we propose a method for metaphor detection\nby fine-tuning open-source LLMs, effectively reducing inference costs and\nlatency with a single inference step. Furthermore, metaphor detection suffers\nfrom a severe data scarcity problem, which hinders effective fine-tuning of\nLLMs. To tackle this, we introduce Curriculum-style Data Augmentation (CDA).\nSpecifically, before fine-tuning, we evaluate the training data to identify\ncorrectly predicted instances for fine-tuning, while incorrectly predicted\ninstances are used as seed data for data augmentation. This approach enables\nthe model to quickly learn simpler knowledge and progressively acquire more\ncomplex knowledge, thereby improving performance incrementally. Experimental\nresults demonstrate that our method achieves state-of-the-art performance\nacross all baselines. Additionally, we provide detailed ablation studies to\nvalidate the effectiveness of CDA.\n","authors":["Kaidi Jia","Yanxia Wu","Rongsheng Li"],"pdf_url":"https://arxiv.org/pdf/2412.02956v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02935v1","updated":"2024-12-04T01:07:59Z","published":"2024-12-04T01:07:59Z","title":"Dynamic Graph Neural Ordinary Differential Equation Network for\n Multi-modal Emotion Recognition in Conversation","summary":" Multimodal emotion recognition in conversation (MERC) refers to identifying\nand classifying human emotional states by combining data from multiple\ndifferent modalities (e.g., audio, images, text, video, etc.). Most existing\nmultimodal emotion recognition methods use GCN to improve performance, but\nexisting GCN methods are prone to overfitting and cannot capture the temporal\ndependency of the speaker's emotions. To address the above problems, we propose\na Dynamic Graph Neural Ordinary Differential Equation Network (DGODE) for MERC,\nwhich combines the dynamic changes of emotions to capture the temporal\ndependency of speakers' emotions, and effectively alleviates the overfitting\nproblem of GCNs. Technically, the key idea of DGODE is to utilize an adaptive\nmixhop mechanism to improve the generalization ability of GCNs and use the\ngraph ODE evolution network to characterize the continuous dynamics of node\nrepresentations over time and capture temporal dependencies. Extensive\nexperiments on two publicly available multimodal emotion recognition datasets\ndemonstrate that the proposed DGODE model has superior performance compared to\nvarious baselines. Furthermore, the proposed DGODE can also alleviate the\nover-smoothing problem, thereby enabling the construction of a deep GCN\nnetwork.\n","authors":["Yuntao Shou","Tao Meng","Wei Ai","Keqin Li"],"pdf_url":"https://arxiv.org/pdf/2412.02935v1.pdf","comment":"13 pages, 6 figures"},{"id":"http://arxiv.org/abs/2410.08474v3","updated":"2024-12-04T00:43:57Z","published":"2024-10-11T02:58:38Z","title":"SPORTU: A Comprehensive Sports Understanding Benchmark for Multimodal\n Large Language Models","summary":" Multimodal Large Language Models (MLLMs) are advancing the ability to reason\nabout complex sports scenarios by integrating textual and visual information.\nTo comprehensively evaluate their capabilities, we introduce SPORTU, a\nbenchmark designed to assess MLLMs across multi-level sports reasoning tasks.\nSPORTU comprises two key components: SPORTU-text, featuring 900 multiple-choice\nquestions with human-annotated explanations for rule comprehension and strategy\nunderstanding. This component focuses on testing models' ability to reason\nabout sports solely through question-answering (QA), without requiring visual\ninputs; SPORTU-video, consisting of 1,701 slow-motion video clips across 7\ndifferent sports and 12,048 QA pairs, designed to assess multi-level reasoning,\nfrom simple sports recognition to complex tasks like foul detection and rule\napplication. We evaluate four prevalent LLMs mainly utilizing few-shot learning\nparadigms supplemented by chain-of-thought (CoT) prompting on the SPORTU-text\npart. We evaluate four LLMs using few-shot learning and chain-of-thought (CoT)\nprompting on SPORTU-text. GPT-4o achieves the highest accuracy of 71%, but\nstill falls short of human-level performance, highlighting room for improvement\nin rule comprehension and reasoning. The evaluation for the SPORTU-video part\nincludes 7 proprietary and 6 open-source MLLMs. Experiments show that models\nfall short on hard tasks that require deep reasoning and rule-based\nunderstanding. Claude-3.5-Sonnet performs the best with only 52.6% accuracy on\nthe hard task, showing large room for improvement. We hope that SPORTU will\nserve as a critical step toward evaluating models' capabilities in sports\nunderstanding and reasoning.\n","authors":["Haotian Xia","Zhengbang Yang","Junbo Zou","Rhys Tracy","Yuqing Wang","Chi Lu","Christopher Lai","Yanjun He","Xun Shao","Zhuoqing Xie","Yuan-fang Wang","Weining Shen","Hanjie Chen"],"pdf_url":"https://arxiv.org/pdf/2410.08474v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.12914v2","updated":"2024-12-04T00:03:38Z","published":"2024-09-19T17:10:34Z","title":"Mitigating Unsafe Feedback with Learning Constraints","summary":" While there has been progress towards aligning Large Language Models (LLMs)\nwith human values and ensuring safe behaviour at inference time, safety-guards\ncan easily be removed when fine-tuned on unsafe and harmful datasets.While this\nsetting has been treated extensively, another popular training paradigm,\nlearning from unsafe feedback with reinforcement learning, has previously been\nunexplored. This is concerning due to the widespread deployment of feedback\ncollection systems. We address this gap by providing an analysis of learning\nsettings where feedback is adversarial and noisy, i.e. that unsafe samples are\npreferred over safe ones despite model developers goal to maintain safety. We\nfind that safety-aligned LLMs easily explore unsafe action spaces through\ngenerating harmful text and optimize for adversarial reward indicating that\ncurrent safety guards are not enough to prevent learning from unsafe feedback.\nIn order to protect against this vulnerability, we adapt a number of both\n\"implict\" and \"explicit\" harmful fine-tuning defences to evaluate whether they\nare effective as learning constraints in an RL setting finding that no method\nis generally effective pointing to the need for more research in defences given\nthe widespread adoption of methods designed to learn from feedback. We end the\npaper with the observation that some defences work by performing \"harmless\nreward hacking\" for which we provide a theoretical explanation drawn from the\ntheory of Constrained Markov Decision Processes and provide some direction for\nfuture defence development.\n","authors":["Domenic Rosati","Giles Edkins","Harsh Raj","David Atanasov","Subhabrata Majumdar","Janarthanan Rajendran","Frank Rudzicz","Hassan Sajjad"],"pdf_url":"https://arxiv.org/pdf/2409.12914v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03775v1","updated":"2024-12-04T23:36:23Z","published":"2024-12-04T23:36:23Z","title":"WithdrarXiv: A Large-Scale Dataset for Retraction Study","summary":" Retractions play a vital role in maintaining scientific integrity, yet\nsystematic studies of retractions in computer science and other STEM fields\nremain scarce. We present WithdrarXiv, the first large-scale dataset of\nwithdrawn papers from arXiv, containing over 14,000 papers and their associated\nretraction comments spanning the repository's entire history through September\n2024. Through careful analysis of author comments, we develop a comprehensive\ntaxonomy of retraction reasons, identifying 10 distinct categories ranging from\ncritical errors to policy violations. We demonstrate a simple yet highly\naccurate zero-shot automatic categorization of retraction reasons, achieving a\nweighted average F1-score of 0.96. Additionally, we release WithdrarXiv-SciFy,\nan enriched version including scripts for parsed full-text PDFs, specifically\ndesigned to enable research in scientific feasibility studies, claim\nverification, and automated theorem proving. These findings provide valuable\ninsights for improving scientific quality control and automated verification\nsystems. Finally, and most importantly, we discuss ethical issues and take a\nnumber of steps to implement responsible data release while fostering open\nscience in this area.\n","authors":["Delip Rao","Jonathan Young","Thomas Dietterich","Chris Callison-Burch"],"pdf_url":"https://arxiv.org/pdf/2412.03775v1.pdf","comment":"11 pages, 5 figures"},{"id":"http://arxiv.org/abs/2312.11502v2","updated":"2024-12-04T23:09:53Z","published":"2023-12-09T23:43:35Z","title":"Labrador: Exploring the Limits of Masked Language Modeling for\n Laboratory Data","summary":" In this work we introduce Labrador, a pre-trained Transformer model for\nlaboratory data. Labrador and BERT were pre-trained on a corpus of 100 million\nlab test results from electronic health records (EHRs) and evaluated on various\ndownstream outcome prediction tasks. Both models demonstrate mastery of the\npre-training task but neither consistently outperform XGBoost on downstream\nsupervised tasks. Our ablation studies reveal that transfer learning shows\nlimited effectiveness for BERT and achieves marginal success with Labrador. We\nexplore the reasons for the failure of transfer learning and suggest that the\ndata generating process underlying each patient cannot be characterized\nsufficiently using labs alone, among other factors. We encourage future work to\nfocus on joint modeling of multiple EHR data categories and to include\ntree-based baselines in their evaluations.\n","authors":["David R. Bellamy","Bhawesh Kumar","Cindy Wang","Andrew Beam"],"pdf_url":"https://arxiv.org/pdf/2312.11502v2.pdf","comment":"26 pages, 8 figures, best paper award at ML4H 2024"},{"id":"http://arxiv.org/abs/2412.03761v1","updated":"2024-12-04T22:59:35Z","published":"2024-12-04T22:59:35Z","title":"Language Model Meets Prototypes: Towards Interpretable Text\n Classification Models through Prototypical Networks","summary":" Pretrained transformer-based Language Models (LMs) are well-known for their\nability to achieve significant improvement on NLP tasks, but their black-box\nnature, which leads to a lack of interpretability, has been a major concern. My\ndissertation focuses on developing intrinsically interpretable models when\nusing LMs as encoders while maintaining their superior performance via\nprototypical networks. I initiated my research by investigating enhancements in\nperformance for interpretable models of sarcasm detection. My proposed approach\nfocuses on capturing sentiment incongruity to enhance accuracy while offering\ninstance-based explanations for the classification decisions. Later, I\ndeveloped a novel white-box multi-head graph attention-based prototype network\ndesigned to explain the decisions of text classification models without\nsacrificing the accuracy of the original black-box LMs. In addition, I am\nworking on extending the attention-based prototype network with contrastive\nlearning to redesign an interpretable graph neural network, aiming to enhance\nboth the interpretability and performance of the model in document\nclassification.\n","authors":["Ximing Wen"],"pdf_url":"https://arxiv.org/pdf/2412.03761v1.pdf","comment":"2 pages, 1 figure, accepted by AAAI25 DC"},{"id":"http://arxiv.org/abs/2403.16442v2","updated":"2024-12-04T22:37:07Z","published":"2024-03-25T06:05:50Z","title":"If CLIP Could Talk: Understanding Vision-Language Model Representations\n Through Their Preferred Concept Descriptions","summary":" Recent works often assume that Vision-Language Model (VLM) representations\nare based on visual attributes like shape. However, it is unclear to what\nextent VLMs prioritize this information to represent concepts. We propose\nExtract and Explore (EX2), a novel approach to characterize textual features\nthat are important for VLMs. EX2 uses reinforcement learning to align a large\nlanguage model with VLM preferences and generates descriptions that incorporate\nfeatures that are important for the VLM. Then, we inspect the descriptions to\nidentify features that contribute to VLM representations. Using EX2, we find\nthat spurious descriptions have a major role in VLM representations despite\nproviding no helpful information, e.g., Click to enlarge photo of CONCEPT. More\nimportantly, among informative descriptions, VLMs rely significantly on\nnon-visual attributes like habitat (e.g., North America) to represent visual\nconcepts. Also, our analysis reveals that different VLMs prioritize different\nattributes in their representations. Overall, we show that VLMs do not simply\nmatch images to scene descriptions and that non-visual or even spurious\ndescriptions significantly influence their representations.\n","authors":["Reza Esfandiarpoor","Cristina Menghini","Stephen H. Bach"],"pdf_url":"https://arxiv.org/pdf/2403.16442v2.pdf","comment":"EMNLP 2024"},{"id":"http://arxiv.org/abs/2412.03736v1","updated":"2024-12-04T22:04:13Z","published":"2024-12-04T22:04:13Z","title":"Domain-specific Question Answering with Hybrid Search","summary":" Domain specific question answering is an evolving field that requires\nspecialized solutions to address unique challenges. In this paper, we show that\na hybrid approach combining a fine-tuned dense retriever with keyword based\nsparse search methods significantly enhances performance. Our system leverages\na linear combination of relevance signals, including cosine similarity from\ndense retrieval, BM25 scores, and URL host matching, each with tunable boost\nparameters. Experimental results indicate that this hybrid method outperforms\nour single-retriever system, achieving improved accuracy while maintaining\nrobust contextual grounding. These findings suggest that integrating multiple\nretrieval methodologies with weighted scoring effectively addresses the\ncomplexities of domain specific question answering in enterprise settings.\n","authors":["Dewang Sultania","Zhaoyu Lu","Twisha Naik","Franck Dernoncourt","David Seunghyun Yoon","Sanat Sharma","Trung Bui","Ashok Gupta","Tushar Vatsa","Suhas Suresha","Ishita Verma","Vibha Belavadi","Cheng Chen","Michael Friedrich"],"pdf_url":"https://arxiv.org/pdf/2412.03736v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03719v1","updated":"2024-12-04T21:19:20Z","published":"2024-12-04T21:19:20Z","title":"From Language Models over Tokens to Language Models over Characters","summary":" Modern language models are internally -- and mathematically -- distributions\nover token strings rather than \\emph{character} strings, posing numerous\nchallenges for programmers building user applications on top of them. For\nexample, if a prompt is specified as a character string, it must be tokenized\nbefore passing it to the token-level language model. Thus, the tokenizer and\nconsequent analyses are very sensitive to the specification of the prompt\n(e.g., if the prompt ends with a space or not). This paper presents algorithms\nfor converting token-level language models to character-level ones. We present\nboth exact and approximate algorithms. In the empirical portion of the paper,\nwe benchmark the practical runtime and approximation quality. We find that --\neven with a small computation budget -- our method is able to accurately\napproximate the character-level distribution (less than 0.00021 excess bits /\ncharacter) at reasonably fast speeds (46.3 characters / second) on the Llama\n3.1 8B language model.\n","authors":["Tim Vieira","Ben LeBrun","Mario Giulianelli","Juan Luis Gastaldi","Brian DuSell","John Terilla","Timothy J. O'Donnell","Ryan Cotterell"],"pdf_url":"https://arxiv.org/pdf/2412.03719v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.01109v2","updated":"2024-12-04T20:57:05Z","published":"2024-10-01T22:35:56Z","title":"Mixing It Up: The Cocktail Effect of Multi-Task Fine-Tuning on LLM\n Performance -- A Case Study in Finance","summary":" The application of large language models (LLMs) in domain-specific contexts,\nincluding finance, has expanded rapidly. Domain-specific LLMs are typically\nevaluated based on their performance in various downstream tasks relevant to\nthe domain. In this work, we present a detailed analysis of fine-tuning LLMs\nfor such tasks. Somewhat counterintuitively, we find that in domain-specific\ncases, fine-tuning exclusively on the target task is not always the most\neffective strategy. Instead, multi-task finetuning - where models are trained\non a cocktail of related tasks - can significantly enhance performance. We\ndemonstrate how this approach enables a small model, such as Phi-3-Mini, to\nachieve state-of-the-art results, even surpassing the much larger GPT-4-o model\non financial benchmarks. Our study involves a large-scale experiment,\nconducting over 200 training experiments using several widely adopted LLMs as\nbaselines, and empirically confirms the benefits of multi-task fine-tuning.\nAdditionally, we explore the use of general instruction data as a form of\nregularization, suggesting that it helps minimize performance degradation. We\nalso investigate the inclusion of mathematical data, finding improvements in\nnumerical reasoning that transfer effectively to financial tasks. Finally, we\nnote that while fine-tuning for downstream tasks leads to targeted improvements\nin task performance, it does not necessarily result in broader gains in domain\nknowledge or complex domain reasoning abilities.\n","authors":["Meni Brief","Oded Ovadia","Gil Shenderovitz","Noga Ben Yoash","Rachel Lemberg","Eitam Sheetrit"],"pdf_url":"https://arxiv.org/pdf/2410.01109v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03704v1","updated":"2024-12-04T20:35:07Z","published":"2024-12-04T20:35:07Z","title":"Scaling Inference-Time Search with Vision Value Model for Improved\n Visual Comprehension","summary":" Despite significant advancements in vision-language models (VLMs), there\nlacks effective approaches to enhance response quality by scaling\ninference-time computation. This capability is known to be a core step towards\nthe self-improving models in recent large language model studies. In this\npaper, we present Vision Value Model (VisVM) that can guide VLM inference-time\nsearch to generate responses with better visual comprehension. Specifically,\nVisVM not only evaluates the generated sentence quality in the current search\nstep, but also anticipates the quality of subsequent sentences that may result\nfrom the current step, thus providing a long-term value. In this way, VisVM\nsteers VLMs away from generating sentences prone to hallucinations or\ninsufficient detail, thereby producing higher quality responses. Experimental\nresults demonstrate that VisVM-guided search significantly enhances VLMs'\nability to generate descriptive captions with richer visual details and fewer\nhallucinations, compared with greedy decoding and search methods with other\nvisual reward signals. Furthermore, we find that self-training the model with\nthe VisVM-guided captions improve VLM's performance across a wide range of\nmultimodal benchmarks, indicating the potential for developing self-improving\nVLMs. Our value model and code are available at\nhttps://github.com/si0wang/VisVM.\n","authors":["Wang Xiyao","Yang Zhengyuan","Li Linjie","Lu Hongjin","Xu Yuancheng","Lin Chung-Ching Lin","Lin Kevin","Huang Furong","Wang Lijuan"],"pdf_url":"https://arxiv.org/pdf/2412.03704v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.01007v2","updated":"2024-12-04T20:01:42Z","published":"2024-12-01T23:54:12Z","title":"CoRNStack: High-Quality Contrastive Data for Better Code Ranking","summary":" Effective code retrieval plays a crucial role in advancing code generation,\nbug fixing, and software maintenance, particularly as software systems increase\nin complexity. While current code embedding models have demonstrated promise in\nretrieving code snippets for small-scale, well-defined tasks, they often\nunderperform in more demanding real-world applications such as bug localization\nwithin GitHub repositories. We hypothesize that a key issue is their reliance\non noisy and inconsistent datasets for training, which impedes their ability to\ngeneralize to more complex retrieval scenarios. To address these limitations,\nwe introduce CoRNStack, a large-scale, high-quality contrastive training\ndataset for code that spans multiple programming languages. This dataset is\ncurated using consistency filtering to eliminate noisy positives and is further\nenriched with mined hard negatives, thereby facilitating more effective\nlearning. We demonstrate that contrastive training of embedding models using\nCoRNStack leads to state-of-the-art performance across a variety of code\nretrieval tasks. Furthermore, the dataset can be leveraged for training code\nreranking models, a largely underexplored area compared to text reranking. Our\nfinetuned code reranking model significantly improves the ranking quality over\nthe retrieved results. Finally, by employing our code retriever and reranker\ntogether, we demonstrate significant improvements in function localization for\nGitHub issues, an important component of real-world software development.\n","authors":["Tarun Suresh","Revanth Gangi Reddy","Yifei Xu","Zach Nussbaum","Andriy Mulyar","Brandon Duderstadt","Heng Ji"],"pdf_url":"https://arxiv.org/pdf/2412.01007v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.10882v5","updated":"2024-12-04T19:56:34Z","published":"2024-02-16T18:36:36Z","title":"Universal Prompt Optimizer for Safe Text-to-Image Generation","summary":" Text-to-Image (T2I) models have shown great performance in generating images\nbased on textual prompts. However, these models are vulnerable to unsafe input\nto generate unsafe content like sexual, harassment and illegal-activity images.\nExisting studies based on image checker, model fine-tuning and embedding\nblocking are impractical in real-world applications. Hence, we propose the\nfirst universal prompt optimizer for safe T2I (POSI) generation in black-box\nscenario. We first construct a dataset consisting of toxic-clean prompt pairs\nby GPT-3.5 Turbo. To guide the optimizer to have the ability of converting\ntoxic prompt to clean prompt while preserving semantic information, we design a\nnovel reward function measuring toxicity and text alignment of generated images\nand train the optimizer through Proximal Policy Optimization. Experiments show\nthat our approach can effectively reduce the likelihood of various T2I models\nin generating inappropriate images, with no significant impact on text\nalignment. It is also flexible to be combined with methods to achieve better\nperformance. Our code is available at https://github.com/wu-zongyu/POSI.\n","authors":["Zongyu Wu","Hongcheng Gao","Yueze Wang","Xiang Zhang","Suhang Wang"],"pdf_url":"https://arxiv.org/pdf/2402.10882v5.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.03128v6","updated":"2024-12-04T19:49:02Z","published":"2023-10-04T19:39:26Z","title":"MetaTool Benchmark for Large Language Models: Deciding Whether to Use\n Tools and Which to Use","summary":" Large language models (LLMs) have garnered significant attention due to their\nimpressive natural language processing (NLP) capabilities. Recently, many\nstudies have focused on the tool utilization ability of LLMs. They primarily\ninvestigated how LLMs effectively collaborate with given specific tools.\nHowever, in scenarios where LLMs serve as intelligent agents, as seen in\napplications like AutoGPT and MetaGPT, LLMs are expected to engage in intricate\ndecision-making processes that involve deciding whether to employ a tool and\nselecting the most suitable tool(s) from a collection of available tools to\nfulfill user requests. Therefore, in this paper, we introduce MetaTool, a\nbenchmark designed to evaluate whether LLMs have tool usage awareness and can\ncorrectly choose tools. Specifically, we create a dataset called ToolE within\nthe benchmark. This dataset contains various types of user queries in the form\nof prompts that trigger LLMs to use tools, including both single-tool and\nmulti-tool scenarios. Subsequently, we set the tasks for both tool usage\nawareness and tool selection. We define four subtasks from different\nperspectives in tool selection, including tool selection with similar choices,\ntool selection in specific scenarios, tool selection with possible reliability\nissues, and multi-tool selection. We conduct experiments involving eight\npopular LLMs and find that the majority of them still struggle to effectively\nselect tools, highlighting the existing gaps between LLMs and genuine\nintelligent agents. However, through the error analysis, we found there is\nstill significant room for improvement. Finally, we conclude with insights for\ntool developers -- we strongly recommend that tool developers choose an\nappropriate rewrite model for generating new descriptions based on the\ndownstream LLM the tool will apply to. Our code is in\nhttps://github.com/HowieHwong/MetaTool.\n","authors":["Yue Huang","Jiawen Shi","Yuan Li","Chenrui Fan","Siyuan Wu","Qihui Zhang","Yixin Liu","Pan Zhou","Yao Wan","Neil Zhenqiang Gong","Lichao Sun"],"pdf_url":"https://arxiv.org/pdf/2310.03128v6.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03681v1","updated":"2024-12-04T19:23:37Z","published":"2024-12-04T19:23:37Z","title":"Acquired TASTE: Multimodal Stance Detection with Textual and Structural\n Embeddings","summary":" Stance detection plays a pivotal role in enabling an extensive range of\ndownstream applications, from discourse parsing to tracing the spread of fake\nnews and the denial of scientific facts. While most stance classification\nmodels rely on textual representation of the utterance in question, prior work\nhas demonstrated the importance of the conversational context in stance\ndetection. In this work we introduce TASTE -- a multimodal architecture for\nstance detection that harmoniously fuses Transformer-based content embedding\nwith unsupervised structural embedding. Through the fine-tuning of a pretrained\ntransformer and the amalgamation with social embedding via a Gated Residual\nNetwork (GRN) layer, our model adeptly captures the complex interplay between\ncontent and conversational structure in determining stance. TASTE achieves\nstate-of-the-art results on common benchmarks, significantly outperforming an\narray of strong baselines. Comparative evaluations underscore the benefits of\nsocial grounding -- emphasizing the criticality of concurrently harnessing both\ncontent and structure for enhanced stance detection.\n","authors":["Guy Barel","Oren Tsur","Dan Volenchik"],"pdf_url":"https://arxiv.org/pdf/2412.03681v1.pdf","comment":"The modified camera ready version will be published in January 2025\n at COLING"},{"id":"http://arxiv.org/abs/2405.01535v2","updated":"2024-12-04T19:23:17Z","published":"2024-05-02T17:59:35Z","title":"Prometheus 2: An Open Source Language Model Specialized in Evaluating\n Other Language Models","summary":" Proprietary LMs such as GPT-4 are often employed to assess the quality of\nresponses from various LMs. However, concerns including transparency,\ncontrollability, and affordability strongly motivate the development of\nopen-source LMs specialized in evaluations. On the other hand, existing open\nevaluator LMs exhibit critical shortcomings: 1) they issue scores that\nsignificantly diverge from those assigned by humans, and 2) they lack the\nflexibility to perform both direct assessment and pairwise ranking, the two\nmost prevalent forms of assessment. Additionally, they do not possess the\nability to evaluate based on custom evaluation criteria, focusing instead on\ngeneral attributes like helpfulness and harmlessness. To address these issues,\nwe introduce Prometheus 2, a more powerful evaluator LM than its predecessor\nthat closely mirrors human and GPT-4 judgements. Moreover, it is capable of\nprocessing both direct assessment and pair-wise ranking formats grouped with a\nuser-defined evaluation criteria. On four direct assessment benchmarks and four\npairwise ranking benchmarks, Prometheus 2 scores the highest correlation and\nagreement with humans and proprietary LM judges among all tested open evaluator\nLMs. Our models, code, and data are all publicly available at\nhttps://github.com/prometheus-eval/prometheus-eval.\n","authors":["Seungone Kim","Juyoung Suk","Shayne Longpre","Bill Yuchen Lin","Jamin Shin","Sean Welleck","Graham Neubig","Moontae Lee","Kyungjae Lee","Minjoon Seo"],"pdf_url":"https://arxiv.org/pdf/2405.01535v2.pdf","comment":"EMNLP 2024 (Main Conference)"},{"id":"http://arxiv.org/abs/2412.03679v1","updated":"2024-12-04T19:20:32Z","published":"2024-12-04T19:20:32Z","title":"Evaluating Language Models as Synthetic Data Generators","summary":" Given the increasing use of synthetic data in language model (LM)\npost-training, an LM's ability to generate high-quality data has become nearly\nas crucial as its ability to solve problems directly. While prior works have\nfocused on developing effective data generation methods, they lack systematic\ncomparison of different LMs as data generators in a unified setting. To address\nthis gap, we propose AgoraBench, a benchmark that provides standardized\nsettings and metrics to evaluate LMs' data generation abilities. Through\nsynthesizing 1.26 million training instances using 6 LMs and training 99\nstudent models, we uncover key insights about LMs' data generation\ncapabilities. First, we observe that LMs exhibit distinct strengths. For\ninstance, GPT-4o excels at generating new problems, while Claude-3.5-Sonnet\nperforms better at enhancing existing ones. Furthermore, our analysis reveals\nthat an LM's data generation ability doesn't necessarily correlate with its\nproblem-solving ability. Instead, multiple intrinsic features of data\nquality-including response quality, perplexity, and instruction\ndifficulty-collectively serve as better indicators. Finally, we demonstrate\nthat strategic choices in output format and cost-conscious model selection\nsignificantly impact data generation effectiveness.\n","authors":["Seungone Kim","Juyoung Suk","Xiang Yue","Vijay Viswanathan","Seongyun Lee","Yizhong Wang","Kiril Gashteovski","Carolin Lawrence","Sean Welleck","Graham Neubig"],"pdf_url":"https://arxiv.org/pdf/2412.03679v1.pdf","comment":"Work in Progress"},{"id":"http://arxiv.org/abs/2403.05152v3","updated":"2024-12-04T19:01:43Z","published":"2024-03-08T08:41:14Z","title":"Towards a Psychology of Machines: Large Language Models Predict Human\n Memory","summary":" Large language models (LLMs), such as ChatGPT, have shown remarkable\nabilities in natural language processing, opening new avenues in psychological\nresearch. This study explores whether LLMs can predict human memory performance\nin tasks involving garden-path sentences and contextual information. In the\nfirst part, we used ChatGPT to rate the relatedness and memorability of\ngarden-path sentences preceded by either fitting or unfitting contexts. In the\nsecond part, human participants read the same sentences, rated their\nrelatedness, and completed a surprise memory test. The results demonstrated\nthat ChatGPT's relatedness ratings closely matched those of the human\nparticipants, and its memorability ratings effectively predicted human memory\nperformance. Both LLM and human data revealed that higher relatedness in the\nunfitting context condition was associated with better memory performance,\naligning with probabilistic frameworks of context-dependent learning. These\nfindings suggest that LLMs, despite lacking human-like memory mechanisms, can\nmodel aspects of human cognition and serve as valuable tools in psychological\nresearch. We propose the field of machine psychology to explore this interplay\nbetween human cognition and artificial intelligence, offering a bidirectional\napproach where LLMs can both benefit from and contribute to our understanding\nof human cognitive processes.\n","authors":["Markus Huff","Elanur Ulakçı"],"pdf_url":"https://arxiv.org/pdf/2403.05152v3.pdf","comment":"34 pages, 3 figures, 2 tables"},{"id":"http://arxiv.org/abs/2412.03665v1","updated":"2024-12-04T19:01:06Z","published":"2024-12-04T19:01:06Z","title":"Personalizing Multimodal Large Language Models for Image Captioning: An\n Experimental Analysis","summary":" The task of image captioning demands an algorithm to generate natural\nlanguage descriptions of visual inputs. Recent advancements have seen a\nconvergence between image captioning research and the development of Large\nLanguage Models (LLMs) and Multimodal LLMs -- like GPT-4V and Gemini -- which\nextend the capabilities of text-only LLMs to multiple modalities. This paper\ninvestigates whether Multimodal LLMs can supplant traditional image captioning\nnetworks by evaluating their performance on various image description\nbenchmarks. We explore both the zero-shot capabilities of these models and\ntheir adaptability to different semantic domains through fine-tuning methods,\nincluding prompt learning, prefix tuning, and low-rank adaptation. Our results\ndemonstrate that while Multimodal LLMs achieve impressive zero-shot\nperformance, fine-tuning for specific domains while maintaining their\ngeneralization capabilities intact remains challenging. We discuss the\nimplications of these findings for future research in image captioning and the\ndevelopment of more adaptable Multimodal LLMs.\n","authors":["Davide Bucciarelli","Nicholas Moratelli","Marcella Cornia","Lorenzo Baraldi","Rita Cucchiara"],"pdf_url":"https://arxiv.org/pdf/2412.03665v1.pdf","comment":"ECCV 2024 Workshop on Green Foundation Models"},{"id":"http://arxiv.org/abs/2412.03625v1","updated":"2024-12-04T15:55:20Z","published":"2024-12-04T15:55:20Z","title":"Multimodal Sentiment Analysis Based on BERT and ResNet","summary":" With the rapid development of the Internet and social media, multi-modal data\n(text and image) is increasingly important in sentiment analysis tasks.\nHowever, the existing methods are difficult to effectively fuse text and image\nfeatures, which limits the accuracy of analysis. To solve this problem, a\nmultimodal sentiment analysis framework combining BERT and ResNet was proposed.\nBERT has shown strong text representation ability in natural language\nprocessing, and ResNet has excellent image feature extraction performance in\nthe field of computer vision. Firstly, BERT is used to extract the text feature\nvector, and ResNet is used to extract the image feature representation. Then, a\nvariety of feature fusion strategies are explored, and finally the fusion model\nbased on attention mechanism is selected to make full use of the complementary\ninformation between text and image. Experimental results on the public dataset\nMAVA-single show that compared with the single-modal models that only use BERT\nor ResNet, the proposed multi-modal model improves the accuracy and F1 score,\nreaching the best accuracy of 74.5%. This study not only provides new ideas and\nmethods for multimodal sentiment analysis, but also demonstrates the\napplication potential of BERT and ResNet in cross-domain fusion. In the future,\nmore advanced feature fusion techniques and optimization strategies will be\nexplored to further improve the accuracy and generalization ability of\nmultimodal sentiment analysis.\n","authors":["JiaLe Ren"],"pdf_url":"https://arxiv.org/pdf/2412.03625v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03624v1","updated":"2024-12-04T15:52:03Z","published":"2024-12-04T15:52:03Z","title":"How to Correctly do Semantic Backpropagation on Language-based Agentic\n Systems","summary":" Language-based agentic systems have shown great promise in recent years,\ntransitioning from solving small-scale research problems to being deployed in\nchallenging real-world tasks. However, optimizing these systems often requires\nsubstantial manual labor. Recent studies have demonstrated that these systems\ncan be represented as computational graphs, enabling automatic optimization.\nDespite these advancements, most current efforts in Graph-based Agentic System\nOptimization (GASO) fail to properly assign feedback to the system's components\ngiven feedback on the system's output. To address this challenge, we formalize\nthe concept of semantic backpropagation with semantic gradients -- a\ngeneralization that aligns several key optimization techniques, including\nreverse-mode automatic differentiation and the more recent TextGrad by\nexploiting the relationship among nodes with a common successor. This serves as\na method for computing directional information about how changes to each\ncomponent of an agentic system might improve the system's output. To use these\ngradients, we propose a method called semantic gradient descent which enables\nus to solve GASO effectively. Our results on both BIG-Bench Hard and GSM8K show\nthat our approach outperforms existing state-of-the-art methods for solving\nGASO problems. A detailed ablation study on the LIAR dataset demonstrates the\nparsimonious nature of our method. A full copy of our implementation is\npublicly available at https://github.com/HishamAlyahya/semantic_backprop\n","authors":["Wenyi Wang","Hisham A. Alyahya","Dylan R. Ashley","Oleg Serikov","Dmitrii Khizbullin","Francesco Faccio","Jürgen Schmidhuber"],"pdf_url":"https://arxiv.org/pdf/2412.03624v1.pdf","comment":"11 pages in main text + 2 pages of references + 15 pages of\n appendices, 2 figures in main text + 17 figures in appendices, 2 tables in\n main text + 1 table in appendices, 2 algorithms in main text; source code\n available at https://github.com/HishamAlyahya/semantic_backprop"}],"Information Retrieval":[{"id":"http://arxiv.org/abs/2412.03557v1","updated":"2024-12-04T18:52:32Z","published":"2024-12-04T18:52:32Z","title":"Freshness and Informativity Weighted Cognitive Extent and Its\n Correlation with Cumulative Citation Count","summary":" In this paper, we revisit cognitive extent, originally defined as the number\nof unique phrases in a quota. We introduce Freshness and Informative Weighted\nCognitive Extent (FICE), calculated based on two novel weighting factors, the\nlifetime ratio and informativity of scientific entities. We model the lifetime\nof each scientific entity as the time-dependent document frequency, which is\nfit by the composition of multiple Gaussian profiles. The lifetime ratio is\nthen calculated as the cumulative document frequency at the publication time\n$t_0$ divided by the cumulative document frequency over its entire lifetime.\nThe informativity is calculated by normalizing the document frequency across\nall scientific entities recognized in a title. Using the ACL Anthology, we\nverified the trend formerly observed in several other domains that the number\nof unique scientific entities per quota increased gradually at a slower rate.\nWe found that FICE exhibits a strong correlation with the average cumulative\ncitation count within a quota. Our code is available at\n\\href{https://github.com/ZiheHerzWang/Freshness-and-Informativity-Weighted-Cognitive-Extent}{https://github.com/ZiheHerzWang/Freshness-and-Informativity-Weighted-Cognitive-Extent}\n","authors":["Zihe Wang","Jian Wu"],"pdf_url":"https://arxiv.org/pdf/2412.03557v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03465v1","updated":"2024-12-04T16:54:58Z","published":"2024-12-04T16:54:58Z","title":"YT-30M: A multi-lingual multi-category dataset of YouTube comments","summary":" This paper introduces two large-scale multilingual comment datasets, YT-30M\n(and YT-100K) from YouTube. The analysis in this paper is performed on a\nsmaller sample (YT-100K) of YT-30M. Both the datasets: YT-30M (full) and\nYT-100K (randomly selected 100K sample from YT-30M) are publicly released for\nfurther research. YT-30M (YT-100K) contains 32236173 (108694) comments posted\nby YouTube channel that belong to YouTube categories. Each comment is\nassociated with a video ID, comment ID, commentor name, commentor channel ID,\ncomment text, upvotes, original channel ID and category of the YouTube channel\n(e.g., 'News & Politics', 'Science & Technology', etc.).\n","authors":["Hridoy Sankar Dutta"],"pdf_url":"https://arxiv.org/pdf/2412.03465v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03193v1","updated":"2024-12-04T10:27:35Z","published":"2024-12-04T10:27:35Z","title":"Beyond Questions: Leveraging ColBERT for Keyphrase Search","summary":" While question-like queries are gaining popularity and search engines' users\nincreasingly adopt them, keyphrase search has traditionally been the\ncornerstone of web search. This query type is also prevalent in specialised\nsearch tasks such as academic or professional search, where experts rely on\nkeyphrases to articulate their information needs. However, current dense\nretrieval models often fail with keyphrase-like queries, primarily because they\nare mostly trained on question-like ones. This paper introduces a novel model\nthat employs the ColBERT architecture to enhance document ranking for keyphrase\nqueries. For that, given the lack of large keyphrase-based retrieval datasets,\nwe first explore how Large Language Models can convert question-like queries\ninto keyphrase format. Then, using those keyphrases, we train a keyphrase-based\nColBERT ranker (ColBERTKP_QD) to improve the performance when working with\nkeyphrase queries. Furthermore, to reduce the training costs associated with\ntraining the full ColBERT model, we investigate the feasibility of training\nonly a keyphrase query encoder while keeping the document encoder weights\nstatic (ColBERTKP_Q). We assess our proposals' ranking performance using both\nautomatically generated and manually annotated keyphrases. Our results reveal\nthe potential of the late interaction architecture when working under the\nkeyphrase search scenario.\n","authors":["Jorge Gabín","Javier Parapar","Craig Macdonald"],"pdf_url":"https://arxiv.org/pdf/2412.03193v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03097v1","updated":"2024-12-04T07:50:27Z","published":"2024-12-04T07:50:27Z","title":"Enhancing Recommendation Systems with GNNs and Addressing Over-Smoothing","summary":" This paper addresses key challenges in enhancing recommendation systems by\nleveraging Graph Neural Networks (GNNs) and addressing inherent limitations\nsuch as over-smoothing, which reduces model effectiveness as network hierarchy\ndeepens. The proposed approach introduces three GNN-based recommendation\nmodels, specifically designed to mitigate over-smoothing through innovative\nmechanisms like residual connections and identity mapping within the\naggregation propagation process. These modifications enable more effective\ninformation flow across layers, preserving essential user-item interaction\ndetails to improve recommendation accuracy. Additionally, the study emphasizes\nthe critical need for interpretability in recommendation systems, aiming to\nprovide transparent and justifiable suggestions tailored to dynamic user\npreferences. By integrating collaborative filtering with GNN architectures, the\nproposed models not only enhance predictive accuracy but also align\nrecommendations more closely with individual behaviors, adapting to nuanced\nshifts in user interests. This work advances the field by tackling both\ntechnical and user-centric challenges, contributing to the development of\nrobust and explainable recommendation systems capable of managing the\ncomplexity and scale of modern online environments.\n","authors":["Wenyi Liu","Ziqi Zhang","Xinshi Li","Jiacheng Hu","Yuanshuai Luo","Junliang Du"],"pdf_url":"https://arxiv.org/pdf/2412.03097v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.11646v2","updated":"2024-12-04T04:52:24Z","published":"2024-08-21T14:17:24Z","title":"Mathematical Information Retrieval: Search and Question Answering","summary":" Mathematical information is essential for technical work, but its creation,\ninterpretation, and search are challenging. To help address these challenges,\nresearchers have developed multimodal search engines and mathematical question\nanswering systems. This book begins with a simple framework characterizing the\ninformation tasks that people and systems perform as we work to answer\nmath-related questions. The framework is used to organize and relate the other\ncore topics of the book, including interactions between people and systems,\nrepresenting math formulas in sources, and evaluation. We close by addressing\nsome key questions and presenting directions for future work. This book is\nintended for students, instructors, and researchers interested in systems that\nhelp us find and use mathematical information.\n","authors":["Richard Zanibbi","Behrooz Mansouri","Anurag Agarwal"],"pdf_url":"https://arxiv.org/pdf/2408.11646v2.pdf","comment":"[DRAFT] Revised (2nd) draft"},{"id":"http://arxiv.org/abs/2412.02996v1","updated":"2024-12-04T03:29:56Z","published":"2024-12-04T03:29:56Z","title":"CLAS: A Machine Learning Enhanced Framework for Exploring Large 3D\n Design Datasets","summary":" Three-dimensional (3D) objects have wide applications. Despite the growing\ninterest in 3D modeling in academia and industries, designing and/or creating\n3D objects from scratch remains time-consuming and challenging. With the\ndevelopment of generative artificial intelligence (AI), designers discover a\nnew way to create images for ideation. However, generative AIs are less useful\nin creating 3D objects with satisfying qualities. To allow 3D designers to\naccess a wide range of 3D objects for creative activities based on their\nspecific demands, we propose a machine learning (ML) enhanced framework CLAS -\nnamed after the four-step of capture, label, associate, and search - to enable\nfully automatic retrieval of 3D objects based on user specifications leveraging\nthe existing datasets of 3D objects. CLAS provides an effective and efficient\nmethod for any person or organization to benefit from their existing but not\nutilized 3D datasets. In addition, CLAS may also be used to produce\nhigh-quality 3D object synthesis datasets for training and evaluating 3D\ngenerative models. As a proof of concept, we created and showcased a search\nsystem with a web user interface (UI) for retrieving 6,778 3D objects of chairs\nin the ShapeNet dataset powered by CLAS. In a close-set retrieval setting, our\nretrieval method achieves a mean reciprocal rank (MRR) of 0.58, top 1 accuracy\nof 42.27%, and top 10 accuracy of 89.64%.\n","authors":["XiuYu Zhang","Xiaolei Ye","Jui-Che Chang","Yue Fang"],"pdf_url":"https://arxiv.org/pdf/2412.02996v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.01007v2","updated":"2024-12-04T20:01:42Z","published":"2024-12-01T23:54:12Z","title":"CoRNStack: High-Quality Contrastive Data for Better Code Ranking","summary":" Effective code retrieval plays a crucial role in advancing code generation,\nbug fixing, and software maintenance, particularly as software systems increase\nin complexity. While current code embedding models have demonstrated promise in\nretrieving code snippets for small-scale, well-defined tasks, they often\nunderperform in more demanding real-world applications such as bug localization\nwithin GitHub repositories. We hypothesize that a key issue is their reliance\non noisy and inconsistent datasets for training, which impedes their ability to\ngeneralize to more complex retrieval scenarios. To address these limitations,\nwe introduce CoRNStack, a large-scale, high-quality contrastive training\ndataset for code that spans multiple programming languages. This dataset is\ncurated using consistency filtering to eliminate noisy positives and is further\nenriched with mined hard negatives, thereby facilitating more effective\nlearning. We demonstrate that contrastive training of embedding models using\nCoRNStack leads to state-of-the-art performance across a variety of code\nretrieval tasks. Furthermore, the dataset can be leveraged for training code\nreranking models, a largely underexplored area compared to text reranking. Our\nfinetuned code reranking model significantly improves the ranking quality over\nthe retrieved results. Finally, by employing our code retriever and reranker\ntogether, we demonstrate significant improvements in function localization for\nGitHub issues, an important component of real-world software development.\n","authors":["Tarun Suresh","Revanth Gangi Reddy","Yifei Xu","Zach Nussbaum","Andriy Mulyar","Brandon Duderstadt","Heng Ji"],"pdf_url":"https://arxiv.org/pdf/2412.01007v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03620v1","updated":"2024-12-04T15:03:47Z","published":"2024-12-04T15:03:47Z","title":"Recommender Systems for Sustainability: Overview and Research Issues","summary":" Sustainability development goals (SDGs) are regarded as a universal call to\naction with the overall objectives of planet protection, ending of poverty, and\nensuring peace and prosperity for all people. In order to achieve these\nobjectives, different AI technologies play a major role. Specifically,\nrecommender systems can provide support for organizations and individuals to\nachieve the defined goals. Recommender systems integrate AI technologies such\nas machine learning, explainable AI (XAI), case-based reasoning, and constraint\nsolving in order to find and explain user-relevant alternatives from a\npotentially large set of options. In this article, we summarize the state of\nthe art in applying recommender systems to support the achievement of\nsustainability development goals. In this context, we discuss open issues for\nfuture research.\n","authors":["Alexander Felfernig","Manfred Wundara","Thi Ngoc Trang Tran","Seda Polat-Erdeniz","Sebastian Lubos","Merfat El-Mansi","Damian Garber","Viet-Man Le"],"pdf_url":"https://arxiv.org/pdf/2412.03620v1.pdf","comment":null}],"Machine Learning":[{"id":"http://arxiv.org/abs/2406.09400v2","updated":"2024-12-04T18:59:56Z","published":"2024-06-13T17:59:29Z","title":"Yo'LLaVA: Your Personalized Language and Vision Assistant","summary":" Large Multimodal Models (LMMs) have shown remarkable capabilities across a\nvariety of tasks (e.g., image captioning, visual question answering). While\nbroad, their knowledge remains generic (e.g., recognizing a dog), and they are\nunable to handle personalized subjects (e.g., recognizing a user's pet dog).\nHuman reasoning, in contrast, typically operates within the context of specific\nsubjects in our surroundings. For example, one might ask, \"What should I buy\nfor my dog's birthday?\"; as opposed to a generic inquiry about \"What should I\nbuy for a dog's birthday?\". Similarly, when looking at a friend's image, the\ninterest lies in seeing their activities (e.g., \"my friend is holding a cat\"),\nrather than merely observing generic human actions (e.g., \"a man is holding a\ncat\"). In this paper, we introduce the novel task of personalizing LMMs, so\nthat they can have conversations about a specific subject. We propose Yo'LLaVA,\nwhich learns to embed a personalized subject into a set of latent tokens given\na handful of example images of the subject. Our qualitative and quantitative\nanalyses reveal that Yo'LLaVA can learn the concept more efficiently using\nfewer tokens and more effectively encode the visual attributes compared to\nstrong prompting baselines (e.g., LLaVA).\n","authors":["Thao Nguyen","Haotian Liu","Yuheng Li","Mu Cai","Utkarsh Ojha","Yong Jae Lee"],"pdf_url":"https://arxiv.org/pdf/2406.09400v2.pdf","comment":"NeurIPS 2024; Project page: https://thaoshibe.github.io/YoLLaVA"},{"id":"http://arxiv.org/abs/2412.03572v1","updated":"2024-12-04T18:59:45Z","published":"2024-12-04T18:59:45Z","title":"Navigation World Models","summary":" Navigation is a fundamental skill of agents with visual-motor capabilities.\nWe introduce a Navigation World Model (NWM), a controllable video generation\nmodel that predicts future visual observations based on past observations and\nnavigation actions. To capture complex environment dynamics, NWM employs a\nConditional Diffusion Transformer (CDiT), trained on a diverse collection of\negocentric videos of both human and robotic agents, and scaled up to 1 billion\nparameters. In familiar environments, NWM can plan navigation trajectories by\nsimulating them and evaluating whether they achieve the desired goal. Unlike\nsupervised navigation policies with fixed behavior, NWM can dynamically\nincorporate constraints during planning. Experiments demonstrate its\neffectiveness in planning trajectories from scratch or by ranking trajectories\nsampled from an external policy. Furthermore, NWM leverages its learned visual\npriors to imagine trajectories in unfamiliar environments from a single input\nimage, making it a flexible and powerful tool for next-generation navigation\nsystems.\n","authors":["Amir Bar","Gaoyue Zhou","Danny Tran","Trevor Darrell","Yann LeCun"],"pdf_url":"https://arxiv.org/pdf/2412.03572v1.pdf","comment":"project page: https://www.amirbar.net/nwm/"},{"id":"http://arxiv.org/abs/2412.03556v1","updated":"2024-12-04T18:51:32Z","published":"2024-12-04T18:51:32Z","title":"Best-of-N Jailbreaking","summary":" We introduce Best-of-N (BoN) Jailbreaking, a simple black-box algorithm that\njailbreaks frontier AI systems across modalities. BoN Jailbreaking works by\nrepeatedly sampling variations of a prompt with a combination of augmentations\n- such as random shuffling or capitalization for textual prompts - until a\nharmful response is elicited. We find that BoN Jailbreaking achieves high\nattack success rates (ASRs) on closed-source language models, such as 89% on\nGPT-4o and 78% on Claude 3.5 Sonnet when sampling 10,000 augmented prompts.\nFurther, it is similarly effective at circumventing state-of-the-art\nopen-source defenses like circuit breakers. BoN also seamlessly extends to\nother modalities: it jailbreaks vision language models (VLMs) such as GPT-4o\nand audio language models (ALMs) like Gemini 1.5 Pro, using modality-specific\naugmentations. BoN reliably improves when we sample more augmented prompts.\nAcross all modalities, ASR, as a function of the number of samples (N),\nempirically follows power-law-like behavior for many orders of magnitude. BoN\nJailbreaking can also be composed with other black-box algorithms for even more\neffective attacks - combining BoN with an optimized prefix attack achieves up\nto a 35% increase in ASR. Overall, our work indicates that, despite their\ncapability, language models are sensitive to seemingly innocuous changes to\ninputs, which attackers can exploit across modalities.\n","authors":["John Hughes","Sara Price","Aengus Lynch","Rylan Schaeffer","Fazl Barez","Sanmi Koyejo","Henry Sleight","Erik Jones","Ethan Perez","Mrinank Sharma"],"pdf_url":"https://arxiv.org/pdf/2412.03556v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.15957v2","updated":"2024-12-04T18:48:43Z","published":"2024-02-25T02:36:03Z","title":"DynaMITE-RL: A Dynamic Model for Improved Temporal Meta-Reinforcement\n Learning","summary":" We introduce DynaMITE-RL, a meta-reinforcement learning (meta-RL) approach to\napproximate inference in environments where the latent state evolves at varying\nrates. We model episode sessions - parts of the episode where the latent state\nis fixed - and propose three key modifications to existing meta-RL methods:\nconsistency of latent information within sessions, session masking, and prior\nlatent conditioning. We demonstrate the importance of these modifications in\nvarious domains, ranging from discrete Gridworld environments to\ncontinuous-control and simulated robot assistive tasks, demonstrating that\nDynaMITE-RL significantly outperforms state-of-the-art baselines in sample\nefficiency and inference returns.\n","authors":["Anthony Liang","Guy Tennenholtz","Chih-wei Hsu","Yinlam Chow","Erdem Bıyık","Craig Boutilier"],"pdf_url":"https://arxiv.org/pdf/2402.15957v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.10182v3","updated":"2024-12-04T18:48:28Z","published":"2024-03-15T10:38:48Z","title":"Fast and reliable uncertainty quantification with neural network\n ensembles for industrial image classification","summary":" Image classification with neural networks (NNs) is widely used in industrial\nprocesses, situations where the model likely encounters unknown objects during\ndeployment, i.e., out-of-distribution (OOD) data. Worryingly, NNs tend to make\nconfident yet incorrect predictions when confronted with OOD data. To increase\nthe models' reliability, they should quantify the uncertainty in their own\npredictions, communicating when the output should (not) be trusted. Deep\nensembles, composed of multiple independent NNs, have been shown to perform\nstrongly but are computationally expensive. Recent research has proposed more\nefficient NN ensembles, namely the snapshot, batch, and multi-input\nmulti-output ensemble. This study investigates the predictive and uncertainty\nperformance of efficient NN ensembles in the context of image classification\nfor industrial processes. It is the first to provide a comprehensive comparison\nand it proposes a novel Diversity Quality metric to quantify the ensembles'\nperformance on the in-distribution and OOD sets in one single metric. The\nresults highlight the batch ensemble as a cost-effective and competitive\nalternative to the deep ensemble. It matches the deep ensemble in both\nuncertainty and accuracy while exhibiting considerable savings in training\ntime, test time, and memory storage.\n","authors":["Arthur Thuy","Dries F. Benoit"],"pdf_url":"https://arxiv.org/pdf/2403.10182v3.pdf","comment":"Submitted to Annals of Operations Research"},{"id":"http://arxiv.org/abs/2412.03548v1","updated":"2024-12-04T18:45:35Z","published":"2024-12-04T18:45:35Z","title":"Perception Tokens Enhance Visual Reasoning in Multimodal Language Models","summary":" Multimodal language models (MLMs) still face challenges in fundamental visual\nperception tasks where specialized models excel. Tasks requiring reasoning\nabout 3D structures benefit from depth estimation, and reasoning about 2D\nobject instances benefits from object detection. Yet, MLMs can not produce\nintermediate depth or boxes to reason over. Finetuning MLMs on relevant data\ndoesn't generalize well and outsourcing computation to specialized vision tools\nis too compute-intensive and memory-inefficient. To address this, we introduce\nPerception Tokens, intrinsic image representations designed to assist reasoning\ntasks where language is insufficient. Perception tokens act as auxiliary\nreasoning tokens, akin to chain-of-thought prompts in language models. For\nexample, in a depth-related task, an MLM augmented with perception tokens can\nreason by generating a depth map as tokens, enabling it to solve the problem\neffectively. We propose AURORA, a training method that augments MLMs with\nperception tokens for improved reasoning over visual inputs. AURORA leverages a\nVQVAE to transform intermediate image representations, such as depth maps into\na tokenized format and bounding box tokens, which is then used in a multi-task\ntraining framework. AURORA achieves notable improvements across counting\nbenchmarks: +10.8% on BLINK, +11.3% on CVBench, and +8.3% on SEED-Bench,\noutperforming finetuning approaches in generalization across datasets. It also\nimproves on relative depth: over +6% on BLINK. With perception tokens, AURORA\nexpands the scope of MLMs beyond language-based reasoning, paving the way for\nmore effective visual reasoning capabilities.\n","authors":["Mahtab Bigverdi","Zelun Luo","Cheng-Yu Hsieh","Ethan Shen","Dongping Chen","Linda G. Shapiro","Ranjay Krishna"],"pdf_url":"https://arxiv.org/pdf/2412.03548v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19379v2","updated":"2024-12-04T18:40:24Z","published":"2024-11-28T21:10:20Z","title":"Marconi: Prefix Caching for the Era of Hybrid LLMs","summary":" Hybrid models that combine the language modeling capabilities of Attention\nlayers with the efficiency of Recurrent layers (e.g., State Space Models) have\ngained traction in practically supporting long contexts in Large Language Model\nserving. Yet, the unique properties of these models complicate the usage of\ncomplementary efficiency optimizations such as prefix caching that skip\nredundant computations across requests. Most notably, their use of in-place\nstate updates for recurrent layers precludes rolling back cache entries for\npartial sequence overlaps, and instead mandates only exact-match cache hits;\nthe effect is a deluge of (large) cache entries per sequence, most of which\nyield minimal reuse opportunities. We present Marconi, the first system that\nsupports efficient prefix caching with Hybrid LLMs. Key to Marconi are its\nnovel admission and eviction policies that more judiciously assess potential\ncache entries based not only on recency, but also on (1) forecasts of their\nreuse likelihood across a taxonomy of different hit scenarios, and (2) the\ncompute savings that hits deliver relative to memory footprints. Across diverse\nworkloads and Hybrid models, Marconi achieves up to 34.4$\\times$ higher token\nhit rates (71.1% or 617 ms lower TTFT) compared to state-of-the-art prefix\ncaching systems.\n","authors":["Rui Pan","Zhuang Wang","Zhen Jia","Can Karakus","Luca Zancato","Tri Dao","Yida Wang","Ravi Netravali"],"pdf_url":"https://arxiv.org/pdf/2411.19379v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03539v1","updated":"2024-12-04T18:36:09Z","published":"2024-12-04T18:36:09Z","title":"NODE-AdvGAN: Improving the transferability and perceptual similarity of\n adversarial examples by dynamic-system-driven adversarial generative model","summary":" Understanding adversarial examples is crucial for improving the model's\nrobustness, as they introduce imperceptible perturbations that deceive models.\nEffective adversarial examples, therefore, offer the potential to train more\nrobust models by removing their singularities. We propose NODE-AdvGAN, a novel\napproach that treats adversarial generation as a continuous process and employs\na Neural Ordinary Differential Equation (NODE) for simulating the dynamics of\nthe generator. By mimicking the iterative nature of traditional gradient-based\nmethods, NODE-AdvGAN generates smoother and more precise perturbations that\npreserve high perceptual similarity when added to benign images. We also\npropose a new training strategy, NODE-AdvGAN-T, which enhances transferability\nin black-box attacks by effectively tuning noise parameters during training.\nExperiments demonstrate that NODE-AdvGAN and NODE-AdvGAN-T generate more\neffective adversarial examples that achieve higher attack success rates while\npreserving better perceptual quality than traditional GAN-based methods.\n","authors":["Xinheng Xie","Yue Wu","Cuiyu He"],"pdf_url":"https://arxiv.org/pdf/2412.03539v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03537v1","updated":"2024-12-04T18:32:42Z","published":"2024-12-04T18:32:42Z","title":"Evaluating Gender Bias Transfer between Pre-trained and Prompt-Adapted\n Language Models","summary":" Large language models (LLMs) are increasingly being adapted to achieve\ntask-specificity for deployment in real-world decision systems. Several\nprevious works have investigated the bias transfer hypothesis (BTH) by studying\nthe effect of the fine-tuning adaptation strategy on model fairness to find\nthat fairness in pre-trained masked language models have limited effect on the\nfairness of models when adapted using fine-tuning. In this work, we expand the\nstudy of BTH to causal models under prompt adaptations, as prompting is an\naccessible, and compute-efficient way to deploy models in real-world systems.\nIn contrast to previous works, we establish that intrinsic biases in\npre-trained Mistral, Falcon and Llama models are strongly correlated (rho >=\n0.94) with biases when the same models are zero- and few-shot prompted, using a\npronoun co-reference resolution task. Further, we find that bias transfer\nremains strongly correlated even when LLMs are specifically prompted to exhibit\nfair or biased behavior (rho >= 0.92), and few-shot length and stereotypical\ncomposition are varied (rho >= 0.97). Our findings highlight the importance of\nensuring fairness in pre-trained LLMs, especially when they are later used to\nperform downstream tasks via prompt adaptation.\n","authors":["Natalie Mackraz","Nivedha Sivakumar","Samira Khorshidi","Krishna Patel","Barry-John Theobald","Luca Zappella","Nicholas Apostoloff"],"pdf_url":"https://arxiv.org/pdf/2412.03537v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03531v1","updated":"2024-12-04T18:26:13Z","published":"2024-12-04T18:26:13Z","title":"A Review on Scientific Knowledge Extraction using Large Language Models\n in Biomedical Sciences","summary":" The rapid advancement of large language models (LLMs) has opened new\nboundaries in the extraction and synthesis of medical knowledge, particularly\nwithin evidence synthesis. This paper reviews the state-of-the-art applications\nof LLMs in the biomedical domain, exploring their effectiveness in automating\ncomplex tasks such as evidence synthesis and data extraction from a biomedical\ncorpus of documents. While LLMs demonstrate remarkable potential, significant\nchallenges remain, including issues related to hallucinations, contextual\nunderstanding, and the ability to generalize across diverse medical tasks. We\nhighlight critical gaps in the current research literature, particularly the\nneed for unified benchmarks to standardize evaluations and ensure reliability\nin real-world applications. In addition, we propose directions for future\nresearch, emphasizing the integration of state-of-the-art techniques such as\nretrieval-augmented generation (RAG) to enhance LLM performance in evidence\nsynthesis. By addressing these challenges and utilizing the strengths of LLMs,\nwe aim to improve access to medical literature and facilitate meaningful\ndiscoveries in healthcare.\n","authors":["Gabriel Lino Garcia","João Renato Ribeiro Manesco","Pedro Henrique Paiola","Lucas Miranda","Maria Paola de Salvo","João Paulo Papa"],"pdf_url":"https://arxiv.org/pdf/2412.03531v1.pdf","comment":"9 pages, 1 table, 1 figure, conference paper"},{"id":"http://arxiv.org/abs/2403.12712v3","updated":"2024-12-04T18:18:47Z","published":"2024-03-19T13:19:41Z","title":"Instance-Warp: Saliency Guided Image Warping for Unsupervised Domain\n Adaptation","summary":" Driving is challenging in conditions like night, rain, and snow. Lack of good\nlabeled datasets has hampered progress in scene understanding under such\nconditions. Unsupervised Domain Adaptation (UDA) using large labeled clear-day\ndatasets is a promising research direction in such cases. However, many UDA\nmethods are trained with dominant scene backgrounds (e.g., roads, sky,\nsidewalks) that appear dramatically different across domains. As a result, they\nstruggle to learn effective features of smaller and often sparse foreground\nobjects (e.g., people, vehicles, signs).\n In this work, we improve UDA training by applying in-place image warping to\nfocus on salient objects. We design instance-level saliency guidance to\nadaptively oversample object regions and undersample background areas, which\nreduces adverse effects from background context and enhances backbone feature\nlearning. Our approach improves adaptation across geographies, lighting, and\nweather conditions, and is agnostic to the task (segmentation, detection),\ndomain adaptation algorithm, saliency guidance, and underlying model\narchitecture. Result highlights include +6.1 mAP50 for BDD100K Clear\n$\\rightarrow$ DENSE Foggy, +3.7 mAP50 for BDD100K Day $\\rightarrow$ Night, +3.0\nmAP50 for BDD100K Clear $\\rightarrow$ Rainy, and +6.3 mIoU for Cityscapes\n$\\rightarrow$ ACDC. Besides, Our method adds minimal training memory and no\nadditional inference latency. Code is available at\nhttps://github.com/ShenZheng2000/Instance-Warp\n","authors":["Shen Zheng","Anurag Ghosh","Srinivasa G. Narasimhan"],"pdf_url":"https://arxiv.org/pdf/2403.12712v3.pdf","comment":"WACV 2025 Accepted Paper"},{"id":"http://arxiv.org/abs/2412.03527v1","updated":"2024-12-04T18:15:41Z","published":"2024-12-04T18:15:41Z","title":"FANAL -- Financial Activity News Alerting Language Modeling Framework","summary":" In the rapidly evolving financial sector, the accurate and timely\ninterpretation of market news is essential for stakeholders needing to navigate\nunpredictable events. This paper introduces FANAL (Financial Activity News\nAlerting Language Modeling Framework), a specialized BERT-based framework\nengineered for real-time financial event detection and analysis, categorizing\nnews into twelve distinct financial categories. FANAL leverages silver-labeled\ndata processed through XGBoost and employs advanced fine-tuning techniques,\nalongside ORBERT (Odds Ratio BERT), a novel variant of BERT fine-tuned with\nORPO (Odds Ratio Preference Optimization) for superior class-wise probability\ncalibration and alignment with financial event relevance. We evaluate FANAL's\nperformance against leading large language models, including GPT-4o, Llama-3.1\n8B, and Phi-3, demonstrating its superior accuracy and cost efficiency. This\nframework sets a new standard for financial intelligence and responsiveness,\nsignificantly outstripping existing models in both performance and\naffordability.\n","authors":["Urjitkumar Patel","Fang-Chun Yeh","Chinmay Gondhalekar","Hari Nalluri"],"pdf_url":"https://arxiv.org/pdf/2412.03527v1.pdf","comment":"Accepted for the IEEE International Workshop on Large Language Models\n for Finance, 2024. This is a preprint version"},{"id":"http://arxiv.org/abs/2407.08152v2","updated":"2024-12-04T17:56:57Z","published":"2024-07-11T03:10:27Z","title":"Privacy-Preserving Data Deduplication for Enhancing Federated Learning\n of Language Models (Extended Version)","summary":" Deduplication is a vital preprocessing step that enhances machine learning\nmodel performance and saves training time and energy. However, enhancing\nfederated learning through deduplication poses challenges, especially regarding\nscalability and potential privacy violations if deduplication involves sharing\nall clients' data. In this paper, we address the problem of deduplication in a\nfederated setup by introducing a pioneering protocol, Efficient\nPrivacy-Preserving Multi-Party Deduplication (EP-MPD). It efficiently removes\nduplicates from multiple clients' datasets without compromising data privacy.\nEP-MPD is constructed in a modular fashion, utilizing two novel variants of the\nPrivate Set Intersection protocol. Our extensive experiments demonstrate the\nsignificant benefits of deduplication in federated learning of large language\nmodels. For instance, we observe up to 19.62\\% improvement in perplexity and up\nto 27.95\\% reduction in running time while varying the duplication level\nbetween 10\\% and 30\\%. EP-MPD effectively balances privacy and performance in\nfederated learning, making it a valuable solution for large-scale applications.\n","authors":["Aydin Abadi","Vishnu Asutosh Dasu","Sumanta Sarkar"],"pdf_url":"https://arxiv.org/pdf/2407.08152v2.pdf","comment":"Accepted at the Network and Distributed Systems Security (NDSS)\n Symposium, 2025"},{"id":"http://arxiv.org/abs/2412.03513v1","updated":"2024-12-04T17:56:49Z","published":"2024-12-04T17:56:49Z","title":"KKLIP: Knowledge Distillation Exploiting K-means Clustering for\n Language-Image Pre-Training","summary":" Recently, CLIP has emerged as a valuable model for aligning image and text\ninformation in multi-modal scenarios. However, researchers have observed\nlimitations in the ability of CLIP's text and image encoders to extract\ndetailed knowledge from caption-image pairs. In response, this paper introduces\nKKLIP, a novel approach designed to enhance the quality of CLIP by\nincorporating a new knowledge distillation (KD) method derived from Llama 2.\nOur method comprises three objectives: Text Embedding Distillation, Concept\nLearning, and Contrastive Learning. Firstly, Text Embedding Distillation\ninvolves training the KKLIP text encoder to emulate the teacher model, Llama 2.\nSecondly, Concept Learning assigns a soft concept label to each caption-image\npair through offline k-means clustering of text information from Llama 2,\nallowing KKLIP to learn from these soft concept labels. Finally, Contrastive\nLearning harmonizes text and image embeddings. Our experimental results\ndemonstrate that KKLIP enhances the quality of both text and image encoders.\n","authors":["Kuei-Chun Kao"],"pdf_url":"https://arxiv.org/pdf/2412.03513v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03506v1","updated":"2024-12-04T17:48:38Z","published":"2024-12-04T17:48:38Z","title":"Self-test loss functions for learning weak-form operators and gradient\n flows","summary":" The construction of loss functions presents a major challenge in data-driven\nmodeling involving weak-form operators in PDEs and gradient flows, particularly\ndue to the need to select test functions appropriately. We address this\nchallenge by introducing self-test loss functions, which employ test functions\nthat depend on the unknown parameters, specifically for cases where the\noperator depends linearly on the unknowns. The proposed self-test loss function\nconserves energy for gradient flows and coincides with the expected\nlog-likelihood ratio for stochastic differential equations. Importantly, it is\nquadratic, facilitating theoretical analysis of identifiability and\nwell-posedness of the inverse problem, while also leading to efficient\nparametric or nonparametric regression algorithms. It is computationally\nsimple, requiring only low-order derivatives or even being entirely\nderivative-free, and numerical experiments demonstrate its robustness against\nnoisy and discrete data.\n","authors":["Yuan Gao","Quanjun Lang","Fei Lu"],"pdf_url":"https://arxiv.org/pdf/2412.03506v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.11376v2","updated":"2024-12-04T17:45:14Z","published":"2024-09-17T17:23:44Z","title":"Towards Time Series Reasoning with LLMs","summary":" Multi-modal large language models (MLLMs) have enabled numerous advances in\nunderstanding and reasoning in domains like vision, but we have not yet seen\nthis broad success for time-series. Although prior works on time-series MLLMs\nhave shown promising performance in time-series forecasting, very few works\nshow how an LLM could be used for time-series reasoning in natural language. We\npropose a novel multi-modal time-series LLM approach that learns generalizable\ninformation across various domains with powerful zero-shot performance. First,\nwe train a lightweight time-series encoder on top of an LLM to directly extract\ntime-series information. Then, we fine-tune our model with chain-of-thought\naugmented time-series tasks to encourage the model to generate reasoning paths.\nWe show that our model learns a latent representation that reflects specific\ntime-series features (e.g. slope, frequency), as well as outperforming GPT-4o\non a set of zero-shot reasoning tasks on a variety of domains.\n","authors":["Winnie Chow","Lauren Gardiner","Haraldur T. Hallgrímsson","Maxwell A. Xu","Shirley You Ren"],"pdf_url":"https://arxiv.org/pdf/2409.11376v2.pdf","comment":"Oral Presentation at 2024 NeurIPS Workshop on Time Series in the Age\n of Large Models"},{"id":"http://arxiv.org/abs/2412.03498v1","updated":"2024-12-04T17:39:55Z","published":"2024-12-04T17:39:55Z","title":"A Bidirectional Siamese Recurrent Neural Network for Accurate Gait\n Recognition Using Body Landmarks","summary":" Gait recognition is a significant biometric technique for person\nidentification, particularly in scenarios where other physiological biometrics\nare impractical or ineffective. In this paper, we address the challenges\nassociated with gait recognition and present a novel approach to improve its\naccuracy and reliability. The proposed method leverages advanced techniques,\nincluding sequential gait landmarks obtained through the Mediapipe pose\nestimation model, Procrustes analysis for alignment, and a Siamese\nbiGRU-dualStack Neural Network architecture for capturing temporal\ndependencies. Extensive experiments were conducted on large-scale cross-view\ndatasets to demonstrate the effectiveness of the approach, achieving high\nrecognition accuracy compared to other models. The model demonstrated\naccuracies of 95.7%, 94.44%, 87.71%, and 86.6% on CASIA-B, SZU RGB-D, OU-MVLP,\nand Gait3D datasets respectively. The results highlight the potential\napplications of the proposed method in various practical domains, indicating\nits significant contribution to the field of gait recognition.\n","authors":["Proma Hossain Progga","Md. Jobayer Rahman","Swapnil Biswas","Md. Shakil Ahmed","Arif Reza Anwary","Swakkhar Shatabda"],"pdf_url":"https://arxiv.org/pdf/2412.03498v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03497v1","updated":"2024-12-04T17:39:01Z","published":"2024-12-04T17:39:01Z","title":"Soft Checksums to Flag Untrustworthy Machine Learning Surrogate\n Predictions and Application to Atomic Physics Simulations","summary":" Trained neural networks (NN) are attractive as surrogate models to replace\ncostly calculations in physical simulations, but are often unknowingly applied\nto states not adequately represented in the training dataset. We present the\nnovel technique of soft checksums for scientific machine learning, a\ngeneral-purpose method to differentiate between trustworthy predictions with\nsmall errors on in-distribution (ID) data points, and untrustworthy predictions\nwith large errors on out-of-distribution (OOD) data points. By adding a check\nnode to the existing output layer, we train the model to learn the chosen\nchecksum function encoded within the NN predictions and show that violations of\nthis function correlate with high prediction errors. As the checksum function\ndepends only on the NN predictions, we can calculate the checksum error for any\nprediction with a single forward pass, incurring negligible time and memory\ncosts. Additionally, we find that incorporating the checksum function into the\nloss function and exposing the NN to OOD data points during the training\nprocess improves separation between ID and OOD predictions. By applying soft\nchecksums to a physically complex and high-dimensional non-local thermodynamic\nequilibrium atomic physics dataset, we show that a well-chosen threshold\nchecksum error can effectively separate ID and OOD predictions.\n","authors":["Casey Lauer","Robert C. Blake","Jonathan B. Freund"],"pdf_url":"https://arxiv.org/pdf/2412.03497v1.pdf","comment":"8 pages, 3 figures"},{"id":"http://arxiv.org/abs/2205.11359v3","updated":"2024-12-04T17:37:38Z","published":"2022-05-23T14:45:34Z","title":"Towards Size-Independent Generalization Bounds for Deep Operator Nets","summary":" In recent times machine learning methods have made significant advances in\nbecoming a useful tool for analyzing physical systems. A particularly active\narea in this theme has been \"physics-informed machine learning\" which focuses\non using neural nets for numerically solving differential equations. In this\nwork, we aim to advance the theory of measuring out-of-sample error while\ntraining DeepONets - which is among the most versatile ways to solve P.D.E\nsystems in one-shot. Firstly, for a class of DeepONets, we prove a bound on\ntheir Rademacher complexity which does not explicitly scale with the width of\nthe nets involved. Secondly, we use this to show how the Huber loss can be\nchosen so that for these DeepONet classes generalization error bounds can be\nobtained that have no explicit dependence on the size of the nets. The\neffective capacity measure for DeepONets that we thus derive is also shown to\ncorrelate with the behavior of generalization error in experiments.\n","authors":["Pulkit Gopalani","Sayar Karmakar","Dibyakanti Kumar","Anirbit Mukherjee"],"pdf_url":"https://arxiv.org/pdf/2205.11359v3.pdf","comment":"33 pages, 7 figures; Published in TMLR, December 2024"},{"id":"http://arxiv.org/abs/2412.03496v1","updated":"2024-12-04T17:36:47Z","published":"2024-12-04T17:36:47Z","title":"TRENDy: Temporal Regression of Effective Non-linear Dynamics","summary":" Spatiotemporal dynamics pervade the natural sciences, from the morphogen\ndynamics underlying patterning in animal pigmentation to the protein waves\ncontrolling cell division. A central challenge lies in understanding how\ncontrollable parameters induce qualitative changes in system behavior called\nbifurcations. This endeavor is made particularly difficult in realistic\nsettings where governing partial differential equations (PDEs) are unknown and\ndata is limited and noisy. To address this challenge, we propose TRENDy\n(Temporal Regression of Effective Nonlinear Dynamics), an equation-free\napproach to learning low-dimensional, predictive models of spatiotemporal\ndynamics. Following classical work in spatial coarse-graining, TRENDy first\nmaps input data to a low-dimensional space of effective dynamics via a cascade\nof multiscale filtering operations. Our key insight is the recognition that\nthese effective dynamics can be fit by a neural ordinary differential equation\n(NODE) having the same parameter space as the input PDE. The preceding\nfiltering operations strongly regularize the phase space of the NODE, making\nTRENDy significantly more robust to noise compared to existing methods. We\ntrain TRENDy to predict the effective dynamics of synthetic and real data\nrepresenting dynamics from across the physical and life sciences. We then\ndemonstrate how our framework can automatically locate both Turing and Hopf\nbifurcations in unseen regions of parameter space. We finally apply our method\nto the analysis of spatial patterning of the ocellated lizard through\ndevelopment. We found that TRENDy's effective state not only accurately\npredicts spatial changes over time but also identifies distinct pattern\nfeatures unique to different anatomical regions, highlighting the potential\ninfluence of surface geometry on reaction-diffusion mechanisms and their role\nin driving spatially varying pattern dynamics.\n","authors":["Matthew Ricci","Guy Pelc","Zoe Piran","Noa Moriel","Mor Nitzan"],"pdf_url":"https://arxiv.org/pdf/2412.03496v1.pdf","comment":"10 pages, 14 appendix pages, 5 figures, 7 appendix figures"},{"id":"http://arxiv.org/abs/2412.03491v1","updated":"2024-12-04T17:29:10Z","published":"2024-12-04T17:29:10Z","title":"Beyond algorithm hyperparameters: on preprocessing hyperparameters and\n associated pitfalls in machine learning applications","summary":" Adequately generating and evaluating prediction models based on supervised\nmachine learning (ML) is often challenging, especially for less experienced\nusers in applied research areas. Special attention is required in settings\nwhere the model generation process involves hyperparameter tuning, i.e.\ndata-driven optimization of different types of hyperparameters to improve the\npredictive performance of the resulting model. Discussions about tuning\ntypically focus on the hyperparameters of the ML algorithm (e.g., the minimum\nnumber of observations in each terminal node for a tree-based algorithm). In\nthis context, it is often neglected that hyperparameters also exist for the\npreprocessing steps that are applied to the data before it is provided to the\nalgorithm (e.g., how to handle missing feature values in the data). As a\nconsequence, users experimenting with different preprocessing options to\nimprove model performance may be unaware that this constitutes a form of\nhyperparameter tuning - albeit informal and unsystematic - and thus may fail to\nreport or account for this optimization. To illuminate this issue, this paper\nreviews and empirically illustrates different procedures for generating and\nevaluating prediction models, explicitly addressing the different ways\nalgorithm and preprocessing hyperparameters are typically handled by applied ML\nusers. By highlighting potential pitfalls, especially those that may lead to\nexaggerated performance claims, this review aims to further improve the quality\nof predictive modeling in ML applications.\n","authors":["Christina Sauer","Anne-Laure Boulesteix","Luzia Hanßum","Farina Hodiamont","Claudia Bausewein","Theresa Ullmann"],"pdf_url":"https://arxiv.org/pdf/2412.03491v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03487v1","updated":"2024-12-04T17:24:35Z","published":"2024-12-04T17:24:35Z","title":"Flow Matching with General Discrete Paths: A Kinetic-Optimal Perspective","summary":" The design space of discrete-space diffusion or flow generative models are\nsignificantly less well-understood than their continuous-space counterparts,\nwith many works focusing only on a simple masked construction. In this work, we\naim to take a holistic approach to the construction of discrete generative\nmodels based on continuous-time Markov chains, and for the first time, allow\nthe use of arbitrary discrete probability paths, or colloquially, corruption\nprocesses. Through the lens of optimizing the symmetric kinetic energy, we\npropose velocity formulas that can be applied to any given probability path,\ncompletely decoupling the probability and velocity, and giving the user the\nfreedom to specify any desirable probability path based on expert knowledge\nspecific to the data domain. Furthermore, we find that a special construction\nof mixture probability paths optimizes the symmetric kinetic energy for the\ndiscrete case. We empirically validate the usefulness of this new design space\nacross multiple modalities: text generation, inorganic material generation, and\nimage generation. We find that we can outperform the mask construction even in\ntext with kinetic-optimal mixture paths, while we can make use of\ndomain-specific constructions of the probability path over the visual domain.\n","authors":["Neta Shaul","Itai Gat","Marton Havasi","Daniel Severo","Anuroop Sriram","Peter Holderrieth","Brian Karrer","Yaron Lipman","Ricky T. Q. Chen"],"pdf_url":"https://arxiv.org/pdf/2412.03487v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03486v1","updated":"2024-12-04T17:23:35Z","published":"2024-12-04T17:23:35Z","title":"Tight PAC-Bayesian Risk Certificates for Contrastive Learning","summary":" Contrastive representation learning is a modern paradigm for learning\nrepresentations of unlabeled data via augmentations -- precisely, contrastive\nmodels learn to embed semantically similar pairs of samples (positive pairs)\ncloser than independently drawn samples (negative samples). In spite of its\nempirical success and widespread use in foundation models, statistical theory\nfor contrastive learning remains less explored. Recent works have developed\ngeneralization error bounds for contrastive losses, but the resulting risk\ncertificates are either vacuous (certificates based on Rademacher complexity or\n$f$-divergence) or require strong assumptions about samples that are\nunreasonable in practice. The present paper develops non-vacuous PAC-Bayesian\nrisk certificates for contrastive representation learning, considering the\npractical considerations of the popular SimCLR framework. Notably, we take into\naccount that SimCLR reuses positive pairs of augmented data as negative samples\nfor other data, thereby inducing strong dependence and making classical PAC or\nPAC-Bayesian bounds inapplicable. We further refine existing bounds on the\ndownstream classification loss by incorporating SimCLR-specific factors,\nincluding data augmentation and temperature scaling, and derive risk\ncertificates for the contrastive zero-one risk. The resulting bounds for\ncontrastive loss and downstream prediction are much tighter than those of\nprevious risk certificates, as demonstrated by experiments on CIFAR-10.\n","authors":["Anna van Elst","Debarghya Ghoshdastidar"],"pdf_url":"https://arxiv.org/pdf/2412.03486v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03483v1","updated":"2024-12-04T17:20:01Z","published":"2024-12-04T17:20:01Z","title":"Convolutional Neural Networks and Mixture of Experts for Intrusion\n Detection in 5G Networks and beyond","summary":" The advent of 6G/NextG networks comes along with a series of benefits,\nincluding extreme capacity, reliability, and efficiency. However, these\nnetworks may become vulnerable to new security threats. Therefore, 6G/NextG\nnetworks must be equipped with advanced Artificial Intelligence algorithms, in\norder to evade these attacks. Existing studies on the intrusion detection task\nrely on the train of shallow machine learning classifiers, including Logistic\nRegression, Decision Trees, and so on, yielding suboptimal performance. Others\nare based on deep neural networks consisting of static components, which are\nnot conditional on the input. This limits their representation power and\nefficiency. To resolve these issues, we present the first study integrating\nMixture of Experts (MoE) for identifying malicious traffic. Specifically, we\nuse network traffic data and convert the 1D array of features into a 2D matrix.\nNext, we pass this matrix through convolutional neural network (CNN) layers\nfollowed by batch normalization and max pooling layers. After obtaining the\nrepresentation vector via the CNN layers, a sparsely gated MoE layer is used.\nThis layer consists of a set of experts (dense layers) and a router, where the\nrouter assigns weights to the output of each expert. Sparsity is achieved by\nchoosing the most relevant experts of the total ones. Finally, we perform a\nseries of ablation experiments to prove the effectiveness of our proposed\nmodel. Experiments are conducted on the 5G-NIDD dataset, a network intrusion\ndetection dataset generated from a real 5G test network. Results show that our\nintroduced approach reaches weighted F1-score up to 99.95% achieving comparable\nperformance to existing approaches. Findings also show that our proposed model\nachieves multiple advantages over state-of-the-art approaches.\n","authors":["Loukas Ilias","George Doukas","Vangelis Lamprou","Christos Ntanos","Dimitris Askounis"],"pdf_url":"https://arxiv.org/pdf/2412.03483v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.04919v2","updated":"2024-12-04T17:18:05Z","published":"2024-05-08T09:41:25Z","title":"Fast Computation of Leave-One-Out Cross-Validation for $k$-NN Regression","summary":" We describe a fast computation method for leave-one-out cross-validation\n(LOOCV) for $k$-nearest neighbours ($k$-NN) regression. We show that, under a\ntie-breaking condition for nearest neighbours, the LOOCV estimate of the mean\nsquare error for $k$-NN regression is identical to the mean square error of\n$(k+1)$-NN regression evaluated on the training data, multiplied by the scaling\nfactor $(k+1)^2/k^2$. Therefore, to compute the LOOCV score, one only needs to\nfit $(k+1)$-NN regression only once, and does not need to repeat\ntraining-validation of $k$-NN regression for the number of training data.\nNumerical experiments confirm the validity of the fast computation method.\n","authors":["Motonobu Kanagawa"],"pdf_url":"https://arxiv.org/pdf/2405.04919v2.pdf","comment":"To appear in Transactions of Machine Learning Research (TMLR)"},{"id":"http://arxiv.org/abs/2411.11976v2","updated":"2024-12-04T17:13:22Z","published":"2024-11-18T19:06:01Z","title":"Coverage-Constrained Human-AI Cooperation with Multiple Experts","summary":" Human-AI cooperative classification (HAI-CC) approaches aim to develop hybrid\nintelligent systems that enhance decision-making in various high-stakes\nreal-world scenarios by leveraging both human expertise and AI capabilities.\nCurrent HAI-CC methods primarily focus on learning-to-defer (L2D), where\ndecisions are deferred to human experts, and learning-to-complement (L2C),\nwhere AI and human experts make predictions cooperatively. However, a notable\nresearch gap remains in effectively exploring both L2D and L2C under diverse\nexpert knowledge to improve decision-making, particularly when constrained by\nthe cooperation cost required to achieve a target probability for AI-only\nselection (i.e., coverage). In this paper, we address this research gap by\nproposing the Coverage-constrained Learning to Defer and Complement with\nSpecific Experts (CL2DC) method. CL2DC makes final decisions through either AI\nprediction alone or by deferring to or complementing a specific expert,\ndepending on the input data. Furthermore, we propose a coverage-constrained\noptimisation to control the cooperation cost, ensuring it approximates a target\nprobability for AI-only selection. This approach enables an effective\nassessment of system performance within a specified budget. Also, CL2DC is\ndesigned to address scenarios where training sets contain multiple noisy-label\nannotations without any clean-label references. Comprehensive evaluations on\nboth synthetic and real-world datasets demonstrate that CL2DC achieves superior\nperformance compared to state-of-the-art HAI-CC methods.\n","authors":["Zheng Zhang","Cuong Nguyen","Kevin Wells","Thanh-Toan Do","David Rosewarne","Gustavo Carneiro"],"pdf_url":"https://arxiv.org/pdf/2411.11976v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.08511v5","updated":"2024-12-04T17:10:06Z","published":"2024-10-11T04:23:56Z","title":"Distributionally robust self-supervised learning for tabular data","summary":" Machine learning (ML) models trained using Empirical Risk Minimization (ERM)\noften exhibit systematic errors on specific subpopulations of tabular data,\nknown as error slices. Learning robust representation in presence of error\nslices is challenging, especially in self-supervised settings during the\nfeature reconstruction phase, due to high cardinality features and the\ncomplexity of constructing error sets. Traditional robust representation\nlearning methods are largely focused on improving worst group performance in\nsupervised setting in computer vision, leaving a gap in approaches tailored for\ntabular data. We address this gap by developing a framework to learn robust\nrepresentation in tabular data during self-supervised pre-training. Our\napproach utilizes an encoder-decoder model trained with Masked Language\nModeling (MLM) loss to learn robust latent representations. This paper applies\nthe Just Train Twice (JTT) and Deep Feature Reweighting (DFR) methods during\nthe pre-training phase for tabular data. These methods fine-tune the ERM\npre-trained model by up-weighting error-prone samples or creating balanced\ndatasets for specific categorical features. This results in specialized models\nfor each feature, which are then used in an ensemble approach to enhance\ndownstream classification performance. This methodology improves robustness\nacross slices, thus enhancing overall generalization performance. Extensive\nexperiments across various datasets demonstrate the efficacy of our approach.\nThe code is available:\n\\url{https://github.com/amazon-science/distributionally-robust-self-supervised-learning-for-tabular-data}.\n","authors":["Shantanu Ghosh","Tiankang Xie","Mikhail Kuznetsov"],"pdf_url":"https://arxiv.org/pdf/2410.08511v5.pdf","comment":"TRL Workshop@NeurIPS2024"},{"id":"http://arxiv.org/abs/2410.13928v2","updated":"2024-12-04T17:03:13Z","published":"2024-10-17T17:56:01Z","title":"Automatically Interpreting Millions of Features in Large Language Models","summary":" While the activations of neurons in deep neural networks usually do not have\na simple human-understandable interpretation, sparse autoencoders (SAEs) can be\nused to transform these activations into a higher-dimensional latent space\nwhich may be more easily interpretable. However, these SAEs can have millions\nof distinct latent features, making it infeasible for humans to manually\ninterpret each one. In this work, we build an open-source automated pipeline to\ngenerate and evaluate natural language explanations for SAE features using\nLLMs. We test our framework on SAEs of varying sizes, activation functions, and\nlosses, trained on two different open-weight LLMs. We introduce five new\ntechniques to score the quality of explanations that are cheaper to run than\nthe previous state of the art. One of these techniques, intervention scoring,\nevaluates the interpretability of the effects of intervening on a feature,\nwhich we find explains features that are not recalled by existing methods. We\npropose guidelines for generating better explanations that remain valid for a\nbroader set of activating contexts, and discuss pitfalls with existing scoring\ntechniques. We use our explanations to measure the semantic similarity of\nindependently trained SAEs, and find that SAEs trained on nearby layers of the\nresidual stream are highly similar. Our large-scale analysis confirms that SAE\nlatents are indeed much more interpretable than neurons, even when neurons are\nsparsified using top-$k$ postprocessing. Our code is available at\nhttps://github.com/EleutherAI/sae-auto-interp, and our explanations are\navailable at\nhttps://huggingface.co/datasets/EleutherAI/auto_interp_explanations.\n","authors":["Gonçalo Paulo","Alex Mallen","Caden Juang","Nora Belrose"],"pdf_url":"https://arxiv.org/pdf/2410.13928v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.08026v2","updated":"2024-12-04T16:59:38Z","published":"2024-10-10T15:23:21Z","title":"Generalization Bounds and Model Complexity for Kolmogorov-Arnold\n Networks","summary":" Kolmogorov-Arnold Network (KAN) is a network structure recently proposed by\nLiu et al. (2024) that offers improved interpretability and a more parsimonious\ndesign in many science-oriented tasks compared to multi-layer perceptrons. This\nwork provides a rigorous theoretical analysis of KAN by establishing\ngeneralization bounds for KAN equipped with activation functions that are\neither represented by linear combinations of basis functions or lying in a\nlow-rank Reproducing Kernel Hilbert Space (RKHS). In the first case, the\ngeneralization bound accommodates various choices of basis functions in forming\nthe activation functions in each layer of KAN and is adapted to different\noperator norms at each layer. For a particular choice of operator norms, the\nbound scales with the $l_1$ norm of the coefficient matrices and the Lipschitz\nconstants for the activation functions, and it has no dependence on\ncombinatorial parameters (e.g., number of nodes) outside of logarithmic\nfactors. Moreover, our result does not require the boundedness assumption on\nthe loss function and, hence, is applicable to a general class of\nregression-type loss functions. In the low-rank case, the generalization bound\nscales polynomially with the underlying ranks as well as the Lipschitz\nconstants of the activation functions in each layer. These bounds are\nempirically investigated for KANs trained with stochastic gradient descent on\nsimulated and real data sets. The numerical results demonstrate the practical\nrelevance of these bounds.\n","authors":["Xianyang Zhang","Huijuan Zhou"],"pdf_url":"https://arxiv.org/pdf/2410.08026v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03471v1","updated":"2024-12-04T16:59:37Z","published":"2024-12-04T16:59:37Z","title":"Cluster Specific Representation Learning","summary":" Representation learning aims to extract meaningful lower-dimensional\nembeddings from data, known as representations. Despite its widespread\napplication, there is no established definition of a ``good'' representation.\nTypically, the representation quality is evaluated based on its performance in\ndownstream tasks such as clustering, de-noising, etc. However, this\ntask-specific approach has a limitation where a representation that performs\nwell for one task may not necessarily be effective for another. This highlights\nthe need for a more agnostic formulation, which is the focus of our work. We\npropose a downstream-agnostic formulation: when inherent clusters exist in the\ndata, the representations should be specific to each cluster. Under this idea,\nwe develop a meta-algorithm that jointly learns cluster-specific\nrepresentations and cluster assignments. As our approach is easy to integrate\nwith any representation learning framework, we demonstrate its effectiveness in\nvarious setups, including Autoencoders, Variational Autoencoders, Contrastive\nlearning models, and Restricted Boltzmann Machines. We qualitatively compare\nour cluster-specific embeddings to standard embeddings and downstream tasks\nsuch as de-noising and clustering. While our method slightly increases runtime\nand parameters compared to the standard model, the experiments clearly show\nthat it extracts the inherent cluster structures in the data, resulting in\nimproved performance in relevant applications.\n","authors":["Mahalakshmi Sabanayagam","Omar Al-Dabooni","Pascal Esser"],"pdf_url":"https://arxiv.org/pdf/2412.03471v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03465v1","updated":"2024-12-04T16:54:58Z","published":"2024-12-04T16:54:58Z","title":"YT-30M: A multi-lingual multi-category dataset of YouTube comments","summary":" This paper introduces two large-scale multilingual comment datasets, YT-30M\n(and YT-100K) from YouTube. The analysis in this paper is performed on a\nsmaller sample (YT-100K) of YT-30M. Both the datasets: YT-30M (full) and\nYT-100K (randomly selected 100K sample from YT-30M) are publicly released for\nfurther research. YT-30M (YT-100K) contains 32236173 (108694) comments posted\nby YouTube channel that belong to YouTube categories. Each comment is\nassociated with a video ID, comment ID, commentor name, commentor channel ID,\ncomment text, upvotes, original channel ID and category of the YouTube channel\n(e.g., 'News & Politics', 'Science & Technology', etc.).\n","authors":["Hridoy Sankar Dutta"],"pdf_url":"https://arxiv.org/pdf/2412.03465v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03464v1","updated":"2024-12-04T16:52:44Z","published":"2024-12-04T16:52:44Z","title":"Validity and efficiency of the conformal CUSUM procedure","summary":" In this paper we study the validity and efficiency of a conformal version of\nthe CUSUM procedure for change detection both experimentally and theoretically.\n","authors":["Vladimir Vovk","Ilia Nouretdinov","Alex Gammerman"],"pdf_url":"https://arxiv.org/pdf/2412.03464v1.pdf","comment":"19 pages, 7 figures"},{"id":"http://arxiv.org/abs/2412.03442v1","updated":"2024-12-04T16:30:35Z","published":"2024-12-04T16:30:35Z","title":"State Frequency Estimation for Anomaly Detection","summary":" Many works have studied the efficacy of state machines for detecting\nanomalies within NetFlows. These works typically learn a model from unlabeled\ndata and compute anomaly scores for arbitrary traces based on their likelihood\nof occurrence or how well they fit within the model. However, these methods do\nnot dynamically adapt their scores based on the traces seen at test time. This\nbecomes a problem when an adversary produces seemingly common traces in their\nattack, causing the model to miss the detection by assigning low anomaly\nscores. We propose SEQUENT, a new approach that uses the state visit frequency\nto adapt its scoring for anomaly detection dynamically. SEQUENT subsequently\nuses the scores to generate root causes for anomalies. These allow the grouping\nof alarms and simplify the analysis of anomalies. Our evaluation of SEQUENT on\nthree NetFlow datasets indicates that our approach outperforms existing\nmethods, demonstrating its effectiveness in detecting anomalies.\n","authors":["Clinton Cao","Agathe Blaise","Annibale Panichella","Sicco Verwer"],"pdf_url":"https://arxiv.org/pdf/2412.03442v1.pdf","comment":"9 pages"},{"id":"http://arxiv.org/abs/2412.03441v1","updated":"2024-12-04T16:30:03Z","published":"2024-12-04T16:30:03Z","title":"PBP: Post-training Backdoor Purification for Malware Classifiers","summary":" In recent years, the rise of machine learning (ML) in cybersecurity has\nbrought new challenges, including the increasing threat of backdoor poisoning\nattacks on ML malware classifiers. For instance, adversaries could inject\nmalicious samples into public malware repositories, contaminating the training\ndata and potentially misclassifying malware by the ML model. Current\ncountermeasures predominantly focus on detecting poisoned samples by leveraging\ndisagreements within the outputs of a diverse set of ensemble models on\ntraining data points. However, these methods are not suitable for scenarios\nwhere Machine Learning-as-a-Service (MLaaS) is used or when users aim to remove\nbackdoors from a model after it has been trained. Addressing this scenario, we\nintroduce PBP, a post-training defense for malware classifiers that mitigates\nvarious types of backdoor embeddings without assuming any specific backdoor\nembedding mechanism. Our method exploits the influence of backdoor attacks on\nthe activation distribution of neural networks, independent of the\ntrigger-embedding method. In the presence of a backdoor attack, the activation\ndistribution of each layer is distorted into a mixture of distributions. By\nregulating the statistics of the batch normalization layers, we can guide a\nbackdoored model to perform similarly to a clean one. Our method demonstrates\nsubstantial advantages over several state-of-the-art methods, as evidenced by\nexperiments on two datasets, two types of backdoor methods, and various attack\nconfigurations. Notably, our approach requires only a small portion of the\ntraining data -- only 1\\% -- to purify the backdoor and reduce the attack\nsuccess rate from 100\\% to almost 0\\%, a 100-fold improvement over the baseline\nmethods. Our code is available at\n\\url{https://github.com/judydnguyen/pbp-backdoor-purification-official}.\n","authors":["Dung Thuy Nguyen","Ngoc N. Tran","Taylor T. Johnson","Kevin Leach"],"pdf_url":"https://arxiv.org/pdf/2412.03441v1.pdf","comment":"Accepted at NDSS 2025"},{"id":"http://arxiv.org/abs/2412.03430v1","updated":"2024-12-04T16:19:47Z","published":"2024-12-04T16:19:47Z","title":"SINGER: Vivid Audio-driven Singing Video Generation with Multi-scale\n Spectral Diffusion Model","summary":" Recent advancements in generative models have significantly enhanced talking\nface video generation, yet singing video generation remains underexplored. The\ndifferences between human talking and singing limit the performance of existing\ntalking face video generation models when applied to singing. The fundamental\ndifferences between talking and singing-specifically in audio characteristics\nand behavioral expressions-limit the effectiveness of existing models. We\nobserve that the differences between singing and talking audios manifest in\nterms of frequency and amplitude. To address this, we have designed a\nmulti-scale spectral module to help the model learn singing patterns in the\nspectral domain. Additionally, we develop a spectral-filtering module that aids\nthe model in learning the human behaviors associated with singing audio. These\ntwo modules are integrated into the diffusion model to enhance singing video\ngeneration performance, resulting in our proposed model, SINGER. Furthermore,\nthe lack of high-quality real-world singing face videos has hindered the\ndevelopment of the singing video generation community. To address this gap, we\nhave collected an in-the-wild audio-visual singing dataset to facilitate\nresearch in this area. Our experiments demonstrate that SINGER is capable of\ngenerating vivid singing videos and outperforms state-of-the-art methods in\nboth objective and subjective evaluations.\n","authors":["Yan Li","Ziya Zhou","Zhiqiang Wang","Wei Xue","Wenhan Luo","Yike Guo"],"pdf_url":"https://arxiv.org/pdf/2412.03430v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03427v1","updated":"2024-12-04T16:17:09Z","published":"2024-12-04T16:17:09Z","title":"Assessing Foundation Models' Transferability to Physiological Signals in\n Precision Medicine","summary":" The success of precision medicine requires computational models that can\neffectively process and interpret diverse physiological signals across\nheterogeneous patient populations. While foundation models have demonstrated\nremarkable transfer capabilities across various domains, their effectiveness in\nhandling individual-specific physiological signals - crucial for precision\nmedicine - remains largely unexplored. This work introduces a systematic\npipeline for rapidly and efficiently evaluating foundation models' transfer\ncapabilities in medical contexts. Our pipeline employs a three-stage approach.\nFirst, it leverages physiological simulation software to generate diverse,\nclinically relevant scenarios, particularly focusing on data-scarce medical\nconditions. This simulation-based approach enables both targeted capability\nassessment and subsequent model fine-tuning. Second, the pipeline projects\nthese simulated signals through the foundation model to obtain embeddings,\nwhich are then evaluated using linear methods. This evaluation quantifies the\nmodel's ability to capture three critical aspects: physiological feature\nindependence, temporal dynamics preservation, and medical scenario\ndifferentiation. Finally, the pipeline validates these representations through\nspecific downstream medical tasks. Initial testing of our pipeline on the\nMoirai time series foundation model revealed significant limitations in\nphysiological signal processing, including feature entanglement, temporal\ndynamics distortion, and reduced scenario discrimination. These findings\nsuggest that current foundation models may require substantial architectural\nmodifications or targeted fine-tuning before deployment in clinical settings.\n","authors":["Matthias Christenson","Cove Geary","Brian Locke","Pranav Koirala","Warren Woodrich Pettine"],"pdf_url":"https://arxiv.org/pdf/2412.03427v1.pdf","comment":"Presented at the precision medicine workshop at the AI in Medicine\n conference (2024) in Salt Lake City"},{"id":"http://arxiv.org/abs/2406.06671v2","updated":"2024-12-04T16:04:07Z","published":"2024-06-10T18:00:00Z","title":"Controlling Counterfactual Harm in Decision Support Systems Based on\n Prediction Sets","summary":" Decision support systems based on prediction sets help humans solve\nmulticlass classification tasks by narrowing down the set of potential label\nvalues to a subset of them, namely a prediction set, and asking them to always\npredict label values from the prediction sets. While this type of systems have\nbeen proven to be effective at improving the average accuracy of the\npredictions made by humans, by restricting human agency, they may cause\nharm$\\unicode{x2014}$a human who has succeeded at predicting the ground-truth\nlabel of an instance on their own may have failed had they used these systems.\nIn this paper, our goal is to control how frequently a decision support system\nbased on prediction sets may cause harm, by design. To this end, we start by\ncharacterizing the above notion of harm using the theoretical framework of\nstructural causal models. Then, we show that, under a natural, albeit\nunverifiable, monotonicity assumption, we can estimate how frequently a system\nmay cause harm using only predictions made by humans on their own. Further, we\nalso show that, under a weaker monotonicity assumption, which can be verified\nexperimentally, we can bound how frequently a system may cause harm again using\nonly predictions made by humans on their own. Building upon these assumptions,\nwe introduce a computational framework to design decision support systems based\non prediction sets that are guaranteed to cause harm less frequently than a\nuser-specified value using conformal risk control. We validate our framework\nusing real human predictions from two different human subject studies and show\nthat, in decision support systems based on prediction sets, there is a\ntrade-off between accuracy and counterfactual harm.\n","authors":["Eleni Straitouri","Suhas Thejaswi","Manuel Gomez Rodriguez"],"pdf_url":"https://arxiv.org/pdf/2406.06671v2.pdf","comment":"Accepted at the ICML 2024 Workshop on Humans, Algorithmic\n Decision-Making and Society and published at NeurIPS 2024"},{"id":"http://arxiv.org/abs/2402.17826v3","updated":"2024-12-04T16:03:04Z","published":"2024-02-27T19:00:01Z","title":"Prediction-Powered Ranking of Large Language Models","summary":" Large language models are often ranked according to their level of alignment\nwith human preferences -- a model is better than other models if its outputs\nare more frequently preferred by humans. One of the popular ways to elicit\nhuman preferences utilizes pairwise comparisons between the outputs provided by\ndifferent models to the same inputs. However, since gathering pairwise\ncomparisons by humans is costly and time-consuming, it has become a common\npractice to gather pairwise comparisons by a strong large language model -- a\nmodel strongly aligned with human preferences. Surprisingly, practitioners\ncannot currently measure the uncertainty that any mismatch between human and\nmodel preferences may introduce in the constructed rankings. In this work, we\ndevelop a statistical framework to bridge this gap. Given a (small) set of\npairwise comparisons by humans and a large set of pairwise comparisons by a\nmodel, our framework provides a rank-set -- a set of possible ranking positions\n-- for each of the models under comparison. Moreover, it guarantees that, with\na probability greater than or equal to a user-specified value, the rank-sets\ncover the true ranking consistent with the distribution of human pairwise\npreferences asymptotically. Using pairwise comparisons made by humans in the\nLMSYS Chatbot Arena platform and pairwise comparisons made by three strong\nlarge language models, we empirically demonstrate the effectivity of our\nframework and show that the rank-sets constructed using only pairwise\ncomparisons by the strong large language models are often inconsistent with\n(the distribution of) human pairwise preferences.\n","authors":["Ivi Chatzi","Eleni Straitouri","Suhas Thejaswi","Manuel Gomez Rodriguez"],"pdf_url":"https://arxiv.org/pdf/2402.17826v3.pdf","comment":"Published at NeurIPS 2024"},{"id":"http://arxiv.org/abs/2412.03417v1","updated":"2024-12-04T15:53:45Z","published":"2024-12-04T15:53:45Z","title":"Learning Semantic Association Rules from Internet of Things Data","summary":" Association Rule Mining (ARM) is the task of discovering commonalities in\ndata in the form of logical implications. ARM is used in the Internet of Things\n(IoT) for different tasks including monitoring and decision-making. However,\nexisting methods give limited consideration to IoT-specific requirements such\nas heterogeneity and volume. Furthermore, they do not utilize important static\ndomain-specific description data about IoT systems, which is increasingly\nrepresented as knowledge graphs. In this paper, we propose a novel ARM pipeline\nfor IoT data that utilizes both dynamic sensor data and static IoT system\nmetadata. Furthermore, we propose an Autoencoder-based Neurosymbolic ARM method\n(Aerial) as part of the pipeline to address the high volume of IoT data and\nreduce the total number of rules that are resource-intensive to process. Aerial\nlearns a neural representation of a given data and extracts association rules\nfrom this representation by exploiting the reconstruction (decoding) mechanism\nof an autoencoder. Extensive evaluations on 3 IoT datasets from 2 domains show\nthat ARM on both static and dynamic IoT data results in more generically\napplicable rules while Aerial can learn a more concise set of high-quality\nassociation rules than the state-of-the-art with full coverage over the\ndatasets.\n","authors":["Erkan Karabulut","Paul Groth","Victoria Degeler"],"pdf_url":"https://arxiv.org/pdf/2412.03417v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.03752v2","updated":"2024-12-04T15:53:19Z","published":"2024-11-06T08:27:49Z","title":"Deferred Poisoning: Making the Model More Vulnerable via Hessian\n Singularization","summary":" Recent studies have shown that deep learning models are very vulnerable to\npoisoning attacks. Many defense methods have been proposed to address this\nissue. However, traditional poisoning attacks are not as threatening as\ncommonly believed. This is because they often cause differences in how the\nmodel performs on the training set compared to the validation set. Such\ninconsistency can alert defenders that their data has been poisoned, allowing\nthem to take the necessary defensive actions. In this paper, we introduce a\nmore threatening type of poisoning attack called the Deferred Poisoning Attack.\nThis new attack allows the model to function normally during the training and\nvalidation phases but makes it very sensitive to evasion attacks or even\nnatural noise. We achieve this by ensuring the poisoned model's loss function\nhas a similar value as a normally trained model at each input sample but with a\nlarge local curvature. A similar model loss ensures that there is no obvious\ninconsistency between the training and validation accuracy, demonstrating high\nstealthiness. On the other hand, the large curvature implies that a small\nperturbation may cause a significant increase in model loss, leading to\nsubstantial performance degradation, which reflects a worse robustness. We\nfulfill this purpose by making the model have singular Hessian information at\nthe optimal point via our proposed Singularization Regularization term. We have\nconducted both theoretical and empirical analyses of the proposed method and\nvalidated its effectiveness through experiments on image classification tasks.\nFurthermore, we have confirmed the hazards of this form of poisoning attack\nunder more general scenarios using natural noise, offering a new perspective\nfor research in the field of security.\n","authors":["Yuhao He","Jinyu Tian","Xianwei Zheng","Li Dong","Yuanman Li","Jiantao Zhou"],"pdf_url":"https://arxiv.org/pdf/2411.03752v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03405v1","updated":"2024-12-04T15:36:20Z","published":"2024-12-04T15:36:20Z","title":"Deep Operator BSDE: a Numerical Scheme to Approximate the Solution\n Operators","summary":" Motivated by dynamic risk measures and conditional $g$-expectations, in this\nwork we propose a numerical method to approximate the solution operator given\nby a Backward Stochastic Differential Equation (BSDE). The main ingredients for\nthis are the Wiener chaos decomposition and the classical Euler scheme for\nBSDEs. We show convergence of this scheme under very mild assumptions, and\nprovide a rate of convergence in more restrictive cases. We then implement it\nusing neural networks, and we present several numerical examples where we can\ncheck the accuracy of the method.\n","authors":["Giulia Di Nunno","Pere Díaz Lozano"],"pdf_url":"https://arxiv.org/pdf/2412.03405v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.09695v3","updated":"2024-12-04T15:35:48Z","published":"2024-10-13T02:10:26Z","title":"Can In-context Learning Really Generalize to Out-of-distribution Tasks?","summary":" In this work, we explore the mechanism of in-context learning (ICL) on\nout-of-distribution (OOD) tasks that were not encountered during training. To\nachieve this, we conduct synthetic experiments where the objective is to learn\nOOD mathematical functions through ICL using a GPT-2 model. We reveal that\nTransformers may struggle to learn OOD task functions through ICL.\nSpecifically, ICL performance resembles implementing a function within the\npretraining hypothesis space and optimizing it with gradient descent based on\nthe in-context examples. Additionally, we investigate ICL's well-documented\nability to learn unseen abstract labels in context. We demonstrate that such\nability only manifests in the scenarios without distributional shifts and,\ntherefore, may not serve as evidence of new-task-learning ability. Furthermore,\nwe assess ICL's performance on OOD tasks when the model is pretrained on\nmultiple tasks. Both empirical and theoretical analyses demonstrate the\nexistence of the \\textbf{low-test-error preference} of ICL, where it tends to\nimplement the pretraining function that yields low test error in the testing\ncontext. We validate this through numerical experiments. This new theoretical\nresult, combined with our empirical findings, elucidates the mechanism of ICL\nin addressing OOD tasks.\n","authors":["Qixun Wang","Yifei Wang","Yisen Wang","Xianghua Ying"],"pdf_url":"https://arxiv.org/pdf/2410.09695v3.pdf","comment":"Preprint, under review"},{"id":"http://arxiv.org/abs/2305.05518v2","updated":"2024-12-04T15:32:32Z","published":"2023-05-09T15:16:50Z","title":"Minimal Learning Machine for Multi-Label Learning","summary":" Distance-based supervised method, the minimal learning machine, constructs a\npredictive model from data by learning a mapping between input and output\ndistance matrices. In this paper, we propose new methods and evaluate how their\ncore component, the distance mapping, can be adapted to multi-label learning.\nThe proposed approach is based on combining the distance mapping with an\ninverse distance weighting. Although the proposal is one of the simplest\nmethods in the multi-label learning literature, it achieves state-of-the-art\nperformance for small to moderate-sized multi-label learning problems. In\naddition to its simplicity, the proposed method is fully deterministic: Its\nhyper-parameter can be selected via ranking loss-based statistic which has a\nclosed form, thus avoiding conventional cross-validation-based hyper-parameter\ntuning. In addition, due to its simple linear distance mapping-based\nconstruction, we demonstrate that the proposed method can assess the\nuncertainty of the predictions for multi-label classification, which is a\nvaluable capability for data-centric machine learning pipelines.\n","authors":["Joonas Hämäläinen","Antoine Hubermont","Amauri Souza","César L. C. Mattos","João P. P. Gomes","Tommi Kärkkäinen"],"pdf_url":"https://arxiv.org/pdf/2305.05518v2.pdf","comment":"Submitted, 29 pages"},{"id":"http://arxiv.org/abs/2412.03393v1","updated":"2024-12-04T15:22:54Z","published":"2024-12-04T15:22:54Z","title":"Can neural operators always be continuously discretized?","summary":" We consider the problem of discretization of neural operators between Hilbert\nspaces in a general framework including skip connections. We focus on bijective\nneural operators through the lens of diffeomorphisms in infinite dimensions.\nFramed using category theory, we give a no-go theorem that shows that\ndiffeomorphisms between Hilbert spaces or Hilbert manifolds may not admit any\ncontinuous approximations by diffeomorphisms on finite-dimensional spaces, even\nif the approximations are nonlinear. The natural way out is the introduction of\nstrongly monotone diffeomorphisms and layerwise strongly monotone neural\noperators which have continuous approximations by strongly monotone\ndiffeomorphisms on finite-dimensional spaces. For these, one can guarantee\ndiscretization invariance, while ensuring that finite-dimensional\napproximations converge not only as sequences of functions, but that their\nrepresentations converge in a suitable sense as well. Finally, we show that\nbilipschitz neural operators may always be written in the form of an\nalternating composition of strongly monotone neural operators, plus a simple\nisometry. Thus we realize a rigorous platform for discretization of a\ngeneralization of a neural operator. We also show that neural operators of this\ntype may be approximated through the composition of finite-rank residual neural\noperators, where each block is strongly monotone, and may be inverted locally\nvia iteration. We conclude by providing a quantitative approximation result for\nthe discretization of general bilipschitz neural operators.\n","authors":["Takashi Furuya","Michael Puthawala","Maarten V. de Hoop","Matti Lassas"],"pdf_url":"https://arxiv.org/pdf/2412.03393v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19732v4","updated":"2024-12-04T15:20:35Z","published":"2024-05-30T06:24:14Z","title":"LLM as a Complementary Optimizer to Gradient Descent: A Case Study in\n Prompt Tuning","summary":" Mastering a skill generally relies on both hands-on experience from doers and\ninsightful, high-level guidance by mentors. Will this strategy also work well\nfor solving complex non-convex optimization problems? Here, a common\ngradient-based optimizer acts like a disciplined doer, making locally optimal\nupdates at each step. Large Language Models (LLMs) can also search for better\nsolutions by inferring from natural language instructions, akin to a high-level\nmentor. In this paper, we show that these two participators are complementary\nto each other and can effectively collaborate as a combined optimization\nframework. The collaborative optimization is achieved by alternating between\nthe gradient-based and LLM-based optimizers. We instruct LLMs to generate\npossibly improved solutions by taking parameter trajectories recorded during\nthe previous stage of gradient-based optimization into account. Inferred\nresults of LLMs are used as restarting points for the next stage of gradient\noptimization. We verify the effectiveness of this optimization framework on\nprompt tuning. By leveraging both the locally rigorous gradient-based optimizer\nand the high-level deductive LLM-based optimizer, the combined optimization\nmethod consistently yields improvements over competitive baselines on a variety\nof tasks. Our results demonstrate the synergistic effect of conventional\ngradient-based optimization and the inference ability of LLMs. The code is\nreleased at https://github.com/guozix/LLM-catalyst.\n","authors":["Zixian Guo","Ming Liu","Zhilong Ji","Jinfeng Bai","Yiwen Guo","Wangmeng Zuo"],"pdf_url":"https://arxiv.org/pdf/2405.19732v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03391v1","updated":"2024-12-04T15:20:12Z","published":"2024-12-04T15:20:12Z","title":"Risk-aware Classification via Uncertainty Quantification","summary":" Autonomous and semi-autonomous systems are using deep learning models to\nimprove decision-making. However, deep classifiers can be overly confident in\ntheir incorrect predictions, a major issue especially in safety-critical\ndomains. The present study introduces three foundational desiderata for\ndeveloping real-world risk-aware classification systems. Expanding upon the\npreviously proposed Evidential Deep Learning (EDL), we demonstrate the unity\nbetween these principles and EDL's operational attributes. We then augment EDL\nempowering autonomous agents to exercise discretion during structured\ndecision-making when uncertainty and risks are inherent. We rigorously examine\nempirical scenarios to substantiate these theoretical innovations. In contrast\nto existing risk-aware classifiers, our proposed methodologies consistently\nexhibit superior performance, underscoring their transformative potential in\nrisk-conscious classification strategies.\n","authors":["Murat Sensoy","Lance M. Kaplan","Simon Julier","Maryam Saleki","Federico Cerutti"],"pdf_url":"https://arxiv.org/pdf/2412.03391v1.pdf","comment":"Accepted for publication in Expert Systems with Applications"},{"id":"http://arxiv.org/abs/2412.03385v1","updated":"2024-12-04T15:12:00Z","published":"2024-12-04T15:12:00Z","title":"Reactive Orchestration for Hierarchical Federated Learning Under a\n Communication Cost Budget","summary":" Deploying a Hierarchical Federated Learning (HFL) pipeline across the\ncomputing continuum (CC) requires careful organization of participants into a\nhierarchical structure with intermediate aggregation nodes between FL clients\nand the global FL server. This is challenging to achieve due to (i) cost\nconstraints, (ii) varying data distributions, and (iii) the volatile operating\nenvironment of the CC. In response to these challenges, we present a framework\nfor the adaptive orchestration of HFL pipelines, designed to be reactive to\nclient churn and infrastructure-level events, while balancing communication\ncost and ML model accuracy. Our mechanisms identify and react to events that\ncause HFL reconfiguration actions at runtime, building on multi-level\nmonitoring information (model accuracy, resource availability, resource cost).\nMoreover, our framework introduces a generic methodology for estimating\nreconfiguration costs to continuously re-evaluate the quality of adaptation\nactions, while being extensible to optimize for various HFL performance\ncriteria. By extending the Kubernetes ecosystem, our framework demonstrates the\nability to react promptly and effectively to changes in the operating\nenvironment, making the best of the available communication cost budget and\neffectively balancing costs and ML performance at runtime.\n","authors":["Ivan Čilić","Anna Lackinger","Pantelis Frangoudis","Ivana Podnar Žarko","Alireza Furutanpey","Ilir Murturi","Schahram Dustdar"],"pdf_url":"https://arxiv.org/pdf/2412.03385v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03381v1","updated":"2024-12-04T15:07:58Z","published":"2024-12-04T15:07:58Z","title":"Classical Shadows with Improved Median-of-Means Estimation","summary":" The classical shadows protocol, introduced by Huang et al. [Nat. Phys. 16,\n1050 (2020)], makes use of the median-of-means (MoM) estimator to efficiently\nestimate the expectation values of $M$ observables with failure probability\n$\\delta$ using only $\\mathcal{O}(\\log(M/\\delta))$ measurements. In their\nanalysis, Huang et al. used loose constants in their asymptotic performance\nbounds for simplicity. However, the specific values of these constants can\nsignificantly affect the number of shots used in practical implementations. To\naddress this, we studied a modified MoM estimator proposed by Minsker [PMLR\n195, 5925 (2023)] that uses optimal constants and involves a U-statistic over\nthe data set. For efficient estimation, we implemented two types of incomplete\nU-statistics estimators, the first based on random sampling and the second\nbased on cyclically permuted sampling. We compared the performance of the\noriginal and modified estimators when used with the classical shadows protocol\nwith single-qubit Clifford unitaries (Pauli measurements) for an Ising spin\nchain, and global Clifford unitaries (Clifford measurements) for the\nGreenberger-Horne-Zeilinger (GHZ) state. While the original estimator\noutperformed the modified estimators for Pauli measurements, the modified\nestimators showed improved performance over the original estimator for Clifford\nmeasurements. Our findings highlight the importance of tailoring estimators to\nspecific measurement settings to optimize the performance of the classical\nshadows protocol in practical applications.\n","authors":["Winston Fu","Dax Enshan Koh","Siong Thye Goh","Jian Feng Kong"],"pdf_url":"https://arxiv.org/pdf/2412.03381v1.pdf","comment":"15 pages, 13 figures"},{"id":"http://arxiv.org/abs/2412.03375v1","updated":"2024-12-04T15:02:28Z","published":"2024-12-04T15:02:28Z","title":"Granular Ball Twin Support Vector Machine with Universum Data","summary":" Classification with support vector machines (SVM) often suffers from limited\nperformance when relying solely on labeled data from target classes and is\nsensitive to noise and outliers. Incorporating prior knowledge from Universum\ndata and more robust data representations can enhance accuracy and efficiency.\nMotivated by these findings, we propose a novel Granular Ball Twin Support\nVector Machine with Universum Data (GBU-TSVM) that extends the TSVM framework\nto leverage both Universum samples and granular ball computing during model\ntraining. Unlike existing TSVM methods, the proposed GBU-TSVM represents data\ninstances as hyper-balls rather than points in the feature space. This\ninnovative approach improves the model's robustness and efficiency,\nparticularly in handling noisy and large datasets. By grouping data points into\ngranular balls, the model achieves superior computational efficiency, increased\nnoise resistance, and enhanced interpretability. Additionally, the inclusion of\nUniversum data, which consists of samples that are not strictly from the target\nclasses, further refines the classification boundaries. This integration\nenriches the model with contextual information, refining classification\nboundaries and boosting overall accuracy. Experimental results on UCI benchmark\ndatasets demonstrate that the GBU-TSVM outperforms existing TSVM models in both\naccuracy and computational efficiency. These findings highlight the potential\nof the GBU-TSVM model in setting a new standard in data representation and\nclassification.\n","authors":["M. A. Ganaie","Vrushank Ahire"],"pdf_url":"https://arxiv.org/pdf/2412.03375v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.08093v2","updated":"2024-12-04T14:45:23Z","published":"2024-04-11T19:15:45Z","title":"Towards a Robust Soft Baby Robot With Rich Interaction Ability for\n Advanced Machine Learning Algorithms","summary":" Advanced machine learning algorithms require platforms that are extremely\nrobust and equipped with rich sensory feedback to handle extensive\ntrial-and-error learning without relying on strong inductive biases.\nTraditional robotic designs, while well-suited for their specific use cases,\nare often fragile when used with these algorithms. To address this gap -- and\ninspired by the vision of enabling curiosity-driven baby robots -- we present a\nnovel robotic limb designed from scratch. Our design has a hybrid soft-hard\nstructure, high redundancy with rich non-contact sensors (exclusively cameras),\nand easily replaceable failure points. Proof-of-concept experiments using two\ncontemporary reinforcement learning algorithms on a physical prototype\ndemonstrate that our design is able to succeed in a simple target-finding task\neven under simulated sensor failures, all with minimal human oversight during\nextended learning periods. We believe this design represents a concrete step\ntoward more tailored robotic designs for achieving general-purpose, generally\nintelligent robots.\n","authors":["Mohannad Alhakami","Dylan R. Ashley","Joel Dunham","Yanning Dai","Francesco Faccio","Eric Feron","Jürgen Schmidhuber"],"pdf_url":"https://arxiv.org/pdf/2404.08093v2.pdf","comment":"6 pages in main text + 2 pages of references, 8 figures in main text,\n 1 table in main text; source code available at\n https://github.com/dylanashley/robot-limb-testai"},{"id":"http://arxiv.org/abs/2402.01930v4","updated":"2024-12-04T14:40:21Z","published":"2024-02-02T21:58:26Z","title":"Reducing Optimism Bias in Incomplete Cooperative Games","summary":" Cooperative game theory has diverse applications in contemporary artificial\nintelligence, including domains like interpretable machine learning, resource\nallocation, and collaborative decision-making. However, specifying a\ncooperative game entails assigning values to exponentially many coalitions, and\nobtaining even a single value can be resource-intensive in practice. Yet simply\nleaving certain coalition values undisclosed introduces ambiguity regarding\nindividual contributions to the collective grand coalition. This ambiguity\noften leads to players holding overly optimistic expectations, stemming from\neither inherent biases or strategic considerations, frequently resulting in\ncollective claims exceeding the actual grand coalition value. In this paper, we\npresent a framework aimed at optimizing the sequence for revealing coalition\nvalues, with the overarching goal of efficiently closing the gap between\nplayers' expectations and achievable outcomes in cooperative games. Our\ncontributions are threefold: (i) we study the individual players' optimistic\ncompletions of games with missing coalition values along with the arising gap,\nand investigate its analytical characteristics that facilitate more efficient\noptimization; (ii) we develop methods to minimize this gap over classes of\ngames with a known prior by disclosing values of additional coalitions in both\noffline and online fashion; and (iii) we empirically demonstrate the\nalgorithms' performance in practical scenarios, together with an investigation\ninto the typical order of revealing coalition values.\n","authors":["Filip Úradník","David Sychrovský","Jakub Černý","Martin Černý"],"pdf_url":"https://arxiv.org/pdf/2402.01930v4.pdf","comment":"Proc. of the 23rd International Conference on Autonomous Agents and\n Multiagent Systems (AAMAS 2024)"},{"id":"http://arxiv.org/abs/2404.13040v2","updated":"2024-12-04T14:38:11Z","published":"2024-04-19T17:53:43Z","title":"Analysis of Classifier-Free Guidance Weight Schedulers","summary":" Classifier-Free Guidance (CFG) enhances the quality and condition adherence\nof text-to-image diffusion models. It operates by combining the conditional and\nunconditional predictions using a fixed weight. However, recent works vary the\nweights throughout the diffusion process, reporting superior results but\nwithout providing any rationale or analysis. By conducting comprehensive\nexperiments, this paper provides insights into CFG weight schedulers. Our\nfindings suggest that simple, monotonically increasing weight schedulers\nconsistently lead to improved performances, requiring merely a single line of\ncode. In addition, more complex parametrized schedulers can be optimized for\nfurther improvement, but do not generalize across different models and tasks.\n","authors":["Xi Wang","Nicolas Dufour","Nefeli Andreou","Marie-Paule Cani","Victoria Fernandez Abrevaya","David Picard","Vicky Kalogeiton"],"pdf_url":"https://arxiv.org/pdf/2404.13040v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.01951v2","updated":"2024-12-04T14:20:21Z","published":"2024-12-02T20:24:17Z","title":"Self-Improvement in Language Models: The Sharpening Mechanism","summary":" Recent work in language modeling has raised the possibility of\nself-improvement, where a language models evaluates and refines its own\ngenerations to achieve higher performance without external feedback. It is\nimpossible for this self-improvement to create information that is not already\nin the model, so why should we expect that this will lead to improved\ncapabilities? We offer a new perspective on the capabilities of\nself-improvement through a lens we refer to as sharpening. Motivated by the\nobservation that language models are often better at verifying response quality\nthan they are at generating correct responses, we formalize self-improvement as\nusing the model itself as a verifier during post-training in order to\n``sharpen'' the model to one placing large mass on high-quality sequences,\nthereby amortizing the expensive inference-time computation of generating good\nsequences. We begin by introducing a new statistical framework for sharpening\nin which the learner aims to sharpen a pre-trained base policy via sample\naccess, and establish fundamental limits. Then we analyze two natural families\nof self-improvement algorithms based on SFT and RLHF. We find that (i) the\nSFT-based approach is minimax optimal whenever the initial model has sufficient\ncoverage, but (ii) the RLHF-based approach can improve over SFT-based\nself-improvement by leveraging online exploration, bypassing the need for\ncoverage. Finally, we empirically validate the sharpening mechanism via\ninference-time and amortization experiments. We view these findings as a\nstarting point toward a foundational understanding that can guide the design\nand evaluation of self-improvement algorithms.\n","authors":["Audrey Huang","Adam Block","Dylan J. Foster","Dhruv Rohatgi","Cyril Zhang","Max Simchowitz","Jordan T. Ash","Akshay Krishnamurthy"],"pdf_url":"https://arxiv.org/pdf/2412.01951v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03338v1","updated":"2024-12-04T14:13:38Z","published":"2024-12-04T14:13:38Z","title":"AI-Driven Day-to-Day Route Choice","summary":" Understanding travelers' route choices can help policymakers devise optimal\noperational and planning strategies for both normal and abnormal circumstances.\nHowever, existing choice modeling methods often rely on predefined assumptions\nand struggle to capture the dynamic and adaptive nature of travel behavior.\nRecently, Large Language Models (LLMs) have emerged as a promising alternative,\ndemonstrating remarkable ability to replicate human-like behaviors across\nvarious fields. Despite this potential, their capacity to accurately simulate\nhuman route choice behavior in transportation contexts remains doubtful. To\nsatisfy this curiosity, this paper investigates the potential of LLMs for route\nchoice modeling by introducing an LLM-empowered agent, \"LLMTraveler.\" This\nagent integrates an LLM as its core, equipped with a memory system that learns\nfrom past experiences and makes decisions by balancing retrieved data and\npersonality traits. The study systematically evaluates the LLMTraveler's\nability to replicate human-like decision-making through two stages: (1)\nanalyzing its route-switching behavior in single origin-destination (OD) pair\ncongestion game scenarios, where it demonstrates patterns align with laboratory\ndata but are not fully explained by traditional models, and (2) testing its\ncapacity to model day-to-day (DTD) adaptive learning behaviors on the Ortuzar\nand Willumsen (OW) network, producing results comparable to Multinomial Logit\n(MNL) and Reinforcement Learning (RL) models. These experiments demonstrate\nthat the framework can partially replicate human-like decision-making in route\nchoice while providing natural language explanations for its decisions. This\ncapability offers valuable insights for transportation policymaking, such as\nsimulating traveler responses to new policies or changes in the network.\n","authors":["Leizhen Wang","Peibo Duan","Zhengbing He","Cheng Lyu","Xin Chen","Nan Zheng","Li Yao","Zhenliang Ma"],"pdf_url":"https://arxiv.org/pdf/2412.03338v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03332v1","updated":"2024-12-04T14:03:27Z","published":"2024-12-04T14:03:27Z","title":"On Approximability of $\\ell_2^2$ Min-Sum Clustering","summary":" The $\\ell_2^2$ min-sum $k$-clustering problem is to partition an input set\ninto clusters $C_1,\\ldots,C_k$ to minimize $\\sum_{i=1}^k\\sum_{p,q\\in\nC_i}\\|p-q\\|_2^2$. Although $\\ell_2^2$ min-sum $k$-clustering is NP-hard, it is\nnot known whether it is NP-hard to approximate $\\ell_2^2$ min-sum\n$k$-clustering beyond a certain factor.\n In this paper, we give the first hardness-of-approximation result for the\n$\\ell_2^2$ min-sum $k$-clustering problem. We show that it is NP-hard to\napproximate the objective to a factor better than $1.056$ and moreover,\nassuming a balanced variant of the Johnson Coverage Hypothesis, it is NP-hard\nto approximate the objective to a factor better than 1.327.\n We then complement our hardness result by giving the first\n$(1+\\varepsilon)$-coreset construction for $\\ell_2^2$ min-sum $k$-clustering.\nOur coreset uses $\\mathcal{O}\\left(k^{\\varepsilon^{-4}}\\right)$ space and can\nbe leveraged to achieve a polynomial-time approximation scheme with runtime\n$nd\\cdot f(k,\\varepsilon^{-1})$, where $d$ is the underlying dimension of the\ninput dataset and $f$ is a fixed function.\n Finally, we consider a learning-augmented setting, where the algorithm has\naccess to an oracle that outputs a label $i\\in[k]$ for input point, thereby\nimplicitly partitioning the input dataset into $k$ clusters that induce an\napproximately optimal solution, up to some amount of adversarial error\n$\\alpha\\in\\left[0,\\frac{1}{2}\\right)$. We give a polynomial-time algorithm that\noutputs a $\\frac{1+\\gamma\\alpha}{(1-\\alpha)^2}$-approximation to $\\ell_2^2$\nmin-sum $k$-clustering, for a fixed constant $\\gamma>0$.\n","authors":["Karthik C. S.","Euiwoong Lee","Yuval Rabani","Chris Schwiegelshohn","Samson Zhou"],"pdf_url":"https://arxiv.org/pdf/2412.03332v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03326v1","updated":"2024-12-04T13:57:20Z","published":"2024-12-04T13:57:20Z","title":"Multi-Action Restless Bandits with Weakly Coupled Constraints:\n Simultaneous Learning and Control","summary":" We study a system with finitely many groups of multi-action bandit processes,\neach of which is a Markov decision process (MDP) with finite state and action\nspaces and potentially different transition matrices when taking different\nactions. The bandit processes of the same group share the same state and action\nspaces and, given the same action that is taken, the same transition matrix.\nAll the bandit processes across various groups are subject to multiple weakly\ncoupled constraints over their state and action variables. Unlike the past\nstudies that focused on the offline case, we consider the online case without\nassuming full knowledge of transition matrices and reward functions a priori\nand propose an effective scheme that enables simultaneous learning and control.\nWe prove the convergence of the relevant processes in both the timeline and the\nnumber of the bandit processes, referred to as the convergence in the time and\nthe magnitude dimensions. Moreover, we prove that the relevant processes\nconverge exponentially fast in the magnitude dimension, leading to\nexponentially diminishing performance deviation between the proposed online\nalgorithms and offline optimality.\n","authors":["Jing Fu","Bill Moran","José Niño-Mora"],"pdf_url":"https://arxiv.org/pdf/2412.03326v1.pdf","comment":"70 pages,0 figure"},{"id":"http://arxiv.org/abs/2412.03321v1","updated":"2024-12-04T13:55:14Z","published":"2024-12-04T13:55:14Z","title":"Scalable Bayesian Tensor Ring Factorization for Multiway Data Analysis","summary":" Tensor decompositions play a crucial role in numerous applications related to\nmulti-way data analysis. By employing a Bayesian framework with\nsparsity-inducing priors, Bayesian Tensor Ring (BTR) factorization offers\nprobabilistic estimates and an effective approach for automatically adapting\nthe tensor ring rank during the learning process. However, previous BTR method\nemploys an Automatic Relevance Determination (ARD) prior, which can lead to\nsub-optimal solutions. Besides, it solely focuses on continuous data, whereas\nmany applications involve discrete data. More importantly, it relies on the\nCoordinate-Ascent Variational Inference (CAVI) algorithm, which is inadequate\nfor handling large tensors with extensive observations. These limitations\ngreatly limit its application scales and scopes, making it suitable only for\nsmall-scale problems, such as image/video completion. To address these issues,\nwe propose a novel BTR model that incorporates a nonparametric Multiplicative\nGamma Process (MGP) prior, known for its superior accuracy in identifying\nlatent structures. To handle discrete data, we introduce the P\\'olya-Gamma\naugmentation for closed-form updates. Furthermore, we develop an efficient\nGibbs sampler for consistent posterior simulation, which reduces the\ncomputational complexity of previous VI algorithm by two orders, and an online\nEM algorithm that is scalable to extremely large tensors. To showcase the\nadvantages of our model, we conduct extensive experiments on both simulation\ndata and real-world applications.\n","authors":["Zerui Tao","Toshihisa Tanaka","Qibin Zhao"],"pdf_url":"https://arxiv.org/pdf/2412.03321v1.pdf","comment":"ICONIP 2023"},{"id":"http://arxiv.org/abs/2412.03317v1","updated":"2024-12-04T13:52:04Z","published":"2024-12-04T13:52:04Z","title":"FlashAttention on a Napkin: A Diagrammatic Approach to Deep Learning\n IO-Awareness","summary":" Optimizing deep learning algorithms currently requires slow, manual\nderivation, potentially leaving much performance untapped. Methods like\nFlashAttention have achieved a x6 performance improvement over native PyTorch\nby avoiding unnecessary data transfers, but required three iterations over\nthree years. Automated compiled methods have consistently lagged behind. GPUs\nare limited by both transfers to processors and available compute, with\ntransfer bandwidth having improved at a far slower pace. Already, transfer\nbandwidth accounts for 46% of GPU energy costs. This indicates the future of\nenergy and capital-efficient algorithms relies on improved consideration of\ntransfer costs (IO-awareness) and a systematic method for deriving optimized\nalgorithms. In this paper, we present a diagrammatic approach to deep learning\nmodels which, with simple relabelings, derive optimal implementations and\nperformance models that consider low-level memory. Diagrams generalize down the\nGPU hierarchy, providing a universal performance model for comparing hardware\nand quantization choices. Diagrams generate pseudocode, which reveals the\napplication of hardware-specific features such as coalesced memory access,\ntensor core operations, and overlapped computation. We present attention\nalgorithms for Ampere, which fits 13 warps per SM (FlashAttention fits 8), and\nfor Hopper, which has improved overlapping and may achieve 1.32 PFLOPs.\n","authors":["Vincent Abbott","Gioele Zardini"],"pdf_url":"https://arxiv.org/pdf/2412.03317v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.13609v2","updated":"2024-12-04T13:46:04Z","published":"2024-05-22T13:01:37Z","title":"Tackling Decision Processes with Non-Cumulative Objectives using\n Reinforcement Learning","summary":" Markov decision processes (MDPs) are used to model a wide variety of\napplications ranging from game playing over robotics to finance. Their optimal\npolicy typically maximizes the expected sum of rewards given at each step of\nthe decision process. However, a large class of problems does not fit\nstraightforwardly into this framework: Non-cumulative Markov decision processes\n(NCMDPs), where instead of the expected sum of rewards, the expected value of\nan arbitrary function of the rewards is maximized. Example functions include\nthe maximum of the rewards or their mean divided by their standard deviation.\nIn this work, we introduce a general mapping of NCMDPs to standard MDPs. This\nallows all techniques developed to find optimal policies for MDPs, such as\nreinforcement learning or dynamic programming, to be directly applied to the\nlarger class of NCMDPs. Focusing on reinforcement learning, we show\napplications in a diverse set of tasks, including classical control, portfolio\noptimization in finance, and discrete optimization problems. Given our\napproach, we can improve both final performance and training time compared to\nrelying on standard MDPs.\n","authors":["Maximilian Nägele","Jan Olle","Thomas Fösel","Remmy Zen","Florian Marquardt"],"pdf_url":"https://arxiv.org/pdf/2405.13609v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03312v1","updated":"2024-12-04T13:44:56Z","published":"2024-12-04T13:44:56Z","title":"Path-Guided Particle-based Sampling","summary":" Particle-based Bayesian inference methods by sampling from a partition-free\ntarget (posterior) distribution, e.g., Stein variational gradient descent\n(SVGD), have attracted significant attention. We propose a path-guided\nparticle-based sampling~(PGPS) method based on a novel Log-weighted Shrinkage\n(LwS) density path linking an initial distribution to the target distribution.\nWe propose to utilize a Neural network to learn a vector field motivated by the\nFokker-Planck equation of the designed density path. Particles, initiated from\nthe initial distribution, evolve according to the ordinary differential\nequation defined by the vector field. The distribution of these particles is\nguided along a density path from the initial distribution to the target\ndistribution. The proposed LwS density path allows for an efficient search of\nmodes of the target distribution while canonical methods fail. We theoretically\nanalyze the Wasserstein distance of the distribution of the PGPS-generated\nsamples and the target distribution due to approximation and discretization\nerrors. Practically, the proposed PGPS-LwS method demonstrates higher Bayesian\ninference accuracy and better calibration ability in experiments conducted on\nboth synthetic and real-world Bayesian learning tasks, compared to baselines,\nsuch as SVGD and Langevin dynamics, etc.\n","authors":["Mingzhou Fan","Ruida Zhou","Chao Tian","Xiaoning Qian"],"pdf_url":"https://arxiv.org/pdf/2412.03312v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2304.04203v3","updated":"2024-12-04T13:43:10Z","published":"2023-04-09T10:08:38Z","title":"OpenDriver: An Open-Road Driver State Detection Dataset","summary":" Among numerous studies for driver state detection, wearable physiological\nmeasurements offer a practical method for real-time monitoring. However, there\nare few driver physiological datasets in open-road scenarios, and the existing\ndatasets suffer from issues such as poor signal quality, small sample sizes,\nand short data collection periods. Therefore, in this paper, a large-scale\nmultimodal driving dataset, OpenDriver, for driver state detection is\ndeveloped. The OpenDriver encompasses a total of 3,278 driving trips, with a\nsignal collection duration spanning approximately 4,600 hours. Two modalities\nof driving signals are enrolled in OpenDriver: electrocardiogram (ECG) signals\nand six-axis motion data of the steering wheel from a motion measurement unit\n(IMU), which were recorded from 81 drivers and their vehicles. Furthermore,\nthree challenging tasks are involved in our work, namely ECG signal quality\nassessment, individual biometric identification based on ECG signals, and\nphysiological signal analysis in complex driving environments. To facilitate\nresearch in these tasks, corresponding benchmarks have also been introduced.\nFirst, a noisy augmentation strategy is applied to generate a larger-scale ECG\nsignal dataset with realistic noise simulation for quality assessment. Second,\nan end-to-end contrastive learning framework is employed for individual\nbiometric identification. Finally, a comprehensive analysis of drivers' HRV\nfeatures under different driving conditions is conducted. Each benchmark\nprovides evaluation metrics and reference results. The OpenDriver dataset will\nbe publicly available at https://github.com/bdne/OpenDriver.\n","authors":["Delong Liu","Shichao Li","Tianyi Shi","Zhu Meng","Guanyu Chen","Yadong Huang","Jin Dong","Zhicheng Zhao"],"pdf_url":"https://arxiv.org/pdf/2304.04203v3.pdf","comment":"Considering that there are flaws in the statistical data of the\n dataset, all the authors agreed to withdraw the manuscript"},{"id":"http://arxiv.org/abs/2412.03300v1","updated":"2024-12-04T13:17:42Z","published":"2024-12-04T13:17:42Z","title":"Conveying Emotions to Robots through Touch and Sound","summary":" Human emotions can be conveyed through nuanced touch gestures. However, there\nis a lack of understanding of how consistently emotions can be conveyed to\nrobots through touch. This study explores the consistency of touch-based\nemotional expression toward a robot by integrating tactile and auditory sensory\nreading of affective haptic expressions. We developed a piezoresistive pressure\nsensor and used a microphone to mimic touch and sound channels, respectively.\nIn a study with 28 participants, each conveyed 10 emotions to a robot using\nspontaneous touch gestures. Our findings reveal a statistically significant\nconsistency in emotion expression among participants. However, some emotions\nobtained low intraclass correlation values. Additionally, certain emotions with\nsimilar levels of arousal or valence did not exhibit significant differences in\nthe way they were conveyed. We subsequently constructed a multi-modal\nintegrating touch and audio features to decode the 10 emotions. A support\nvector machine (SVM) model demonstrated the highest accuracy, achieving 40% for\n10 classes, with \"Attention\" being the most accurately conveyed emotion at a\nbalanced accuracy of 87.65%.\n","authors":["Qiaoqiao Ren","Remko Proesmans","Frederick Bossuyt","Jan Vanfleteren","Francis Wyffels","Tony Belpaeme"],"pdf_url":"https://arxiv.org/pdf/2412.03300v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03299v1","updated":"2024-12-04T13:16:57Z","published":"2024-12-04T13:16:57Z","title":"Gaussian Processes for Probabilistic Estimates of Earthquake Ground\n Shaking: A 1-D Proof-of-Concept","summary":" Estimates of seismic wave speeds in the Earth (seismic velocity models) are\nkey input parameters to earthquake simulations for ground motion prediction.\nOwing to the non-uniqueness of the seismic inverse problem, typically many\nvelocity models exist for any given region. The arbitrary choice of which\nvelocity model to use in earthquake simulations impacts ground motion\npredictions. However, current hazard analysis methods do not account for this\nsource of uncertainty. We present a proof-of-concept ground motion prediction\nworkflow for incorporating uncertainties arising from inconsistencies between\nexisting seismic velocity models. Our analysis is based on the probabilistic\nfusion of overlapping seismic velocity models using scalable Gaussian process\n(GP) regression. Specifically, we fit a GP to two synthetic 1-D velocity\nprofiles simultaneously, and show that the predictive uncertainty accounts for\nthe differences between the models. We subsequently draw velocity model samples\nfrom the predictive distribution and estimate peak ground displacement using\nacoustic wave propagation through the velocity models. The resulting\ndistribution of possible ground motion amplitudes is much wider than would be\npredicted by simulating shaking using only the two input velocity models. This\nproof-of-concept illustrates the importance of probabilistic methods for\nphysics-based seismic hazard analysis.\n","authors":["Sam A. Scivier","Tarje Nissen-Meyer","Paula Koelemeijer","Atılım Güneş Baydin"],"pdf_url":"https://arxiv.org/pdf/2412.03299v1.pdf","comment":"8 pages, 2 figures, accepted in the Machine Learning and the Physical\n Sciences Workshop at NeurIPS 2024"},{"id":"http://arxiv.org/abs/2410.17882v2","updated":"2024-12-04T13:14:15Z","published":"2024-10-23T13:55:42Z","title":"Identifiable Representation and Model Learning for Latent Dynamic\n Systems","summary":" Learning identifiable representations and models from low-level observations\nis helpful for an intelligent spacecraft to complete downstream tasks reliably.\nFor temporal observations, to ensure that the data generating process is\nprovably inverted, most existing works either assume the noise variables in the\ndynamic mechanisms are (conditionally) independent or require that the\ninterventions can directly affect each latent variable. However, in practice,\nthe relationship between the exogenous inputs/interventions and the latent\nvariables may follow some complex deterministic mechanisms. In this work, we\nstudy the problem of identifiable representation and model learning for latent\ndynamic systems. The key idea is to use an inductive bias inspired by\ncontrollable canonical forms, which are sparse and input-dependent by\ndefinition. We prove that, for linear and affine nonlinear latent dynamic\nsystems with sparse input matrices, it is possible to identify the latent\nvariables up to scaling and determine the dynamic models up to some simple\ntransformations. The results have the potential to provide some theoretical\nguarantees for developing more trustworthy decision-making and control methods\nfor intelligent spacecrafts.\n","authors":["Congxi Zhang","Yongchun Xie"],"pdf_url":"https://arxiv.org/pdf/2410.17882v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.20351v2","updated":"2024-12-04T12:42:20Z","published":"2024-10-27T06:32:41Z","title":"Leveraging Auxiliary Task Relevance for Enhanced Bearing Fault Diagnosis\n through Curriculum Meta-learning","summary":" The accurate diagnosis of machine breakdowns is crucial for maintaining\noperational safety in smart manufacturing. Despite the promise shown by deep\nlearning in automating fault identification, the scarcity of labeled training\ndata, particularly for equipment failure instances, poses a significant\nchallenge. This limitation hampers the development of robust classification\nmodels. Existing methods like model-agnostic meta-learning (MAML) do not\nadequately address variable working conditions, affecting knowledge transfer.\nTo address these challenges, a Related Task Aware Curriculum Meta-learning\n(RT-ACM) enhanced fault diagnosis framework is proposed in this paper, inspired\nby human cognitive learning processes. RT-ACM improves training by considering\nthe relevance of auxiliary sensor working conditions, adhering to the principle\nof ``paying more attention to more relevant knowledge\", and focusing on\n``easier first, harder later\" curriculum sampling. This approach aids the\nmeta-learner in achieving a superior convergence state. Extensive experiments\non two real-world datasets demonstrate the superiority of RT-ACM framework.\n","authors":["Jinze Wang","Jiong Jin","Tiehua Zhang","Boon Xian Chai","Adriano Di Pietro","Dimitrios Georgakopoulos"],"pdf_url":"https://arxiv.org/pdf/2410.20351v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.00153v2","updated":"2024-12-04T12:40:30Z","published":"2024-11-29T07:00:18Z","title":"ROSE: Revolutionizing Open-Set Dense Segmentation with Patch-Wise\n Perceptual Large Multimodal Model","summary":" Advances in CLIP and large multimodal models (LMMs) have enabled\nopen-vocabulary and free-text segmentation, yet existing models still require\npredefined category prompts, limiting free-form category self-generation. Most\nsegmentation LMMs also remain confined to sparse predictions, restricting their\napplicability in open-set environments. In contrast, we propose ROSE, a\nRevolutionary Open-set dense SEgmentation LMM, which enables dense mask\nprediction and open-category generation through patch-wise perception. Our\nmethod treats each image patch as an independent region of interest candidate,\nenabling the model to predict both dense and sparse masks simultaneously.\nAdditionally, a newly designed instruction-response paradigm takes full\nadvantage of the generation and generalization capabilities of LMMs, achieving\ncategory prediction independent of closed-set constraints or predefined\ncategories. To further enhance mask detail and category precision, we introduce\na conversation-based refinement paradigm, integrating the prediction result\nfrom previous step with textual prompt for revision. Extensive experiments\ndemonstrate that ROSE achieves competitive performance across various\nsegmentation tasks in a unified framework. Code will be released.\n","authors":["Kunyang Han","Yibo Hu","Mengxue Qu","Hailin Shi","Yao Zhao","Yunchao Wei"],"pdf_url":"https://arxiv.org/pdf/2412.00153v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03271v1","updated":"2024-12-04T12:31:15Z","published":"2024-12-04T12:31:15Z","title":"Nonparametric Filtering, Estimation and Classification using Neural Jump\n ODEs","summary":" Neural Jump ODEs model the conditional expectation between observations by\nneural ODEs and jump at arrival of new observations. They have demonstrated\neffectiveness for fully data-driven online forecasting in settings with\nirregular and partial observations, operating under weak regularity\nassumptions. This work extends the framework to input-output systems, enabling\ndirect applications in online filtering and classification. We establish\ntheoretical convergence guarantees for this approach, providing a robust\nsolution to $L^2$-optimal filtering. Empirical experiments highlight the\nmodel's superior performance over classical parametric methods, particularly in\nscenarios with complex underlying distributions. These results emphasise the\napproach's potential in time-sensitive domains such as finance and health\nmonitoring, where real-time accuracy is crucial.\n","authors":["Jakob Heiss","Florian Krach","Thorsten Schmidt","Félix B. Tambe-Ndonfack"],"pdf_url":"https://arxiv.org/pdf/2412.03271v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.18152v2","updated":"2024-12-04T12:18:17Z","published":"2024-09-25T17:15:26Z","title":"Reinforcement Learning for Finite Space Mean-Field Type Games","summary":" Mean field type games (MFTGs) describe Nash equilibria between large\ncoalitions: each coalition consists of a continuum of cooperative agents who\nmaximize the average reward of their coalition while interacting\nnon-cooperatively with a finite number of other coalitions. Although the theory\nhas been extensively developed, we are still lacking efficient and scalable\ncomputational methods. Here, we develop reinforcement learning methods for such\ngames in a finite space setting with general dynamics and reward functions. We\nstart by proving that MFTG solution yields approximate Nash equilibria in\nfinite-size coalition games. We then propose two algorithms. The first is based\non quantization of mean-field spaces and Nash Q-learning. We provide\nconvergence and stability analysis. We then propose a deep reinforcement\nlearning algorithm, which can scale to larger spaces. Numerical experiments in\n5 environments with mean-field distributions of dimension up to $200$ show the\nscalability and efficiency of the proposed method.\n","authors":["Kai Shao","Jiacheng Shen","Chijie An","Mathieu Laurière"],"pdf_url":"https://arxiv.org/pdf/2409.18152v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03263v1","updated":"2024-12-04T12:11:19Z","published":"2024-12-04T12:11:19Z","title":"NeRF and Gaussian Splatting SLAM in the Wild","summary":" Navigating outdoor environments with visual Simultaneous Localization and\nMapping (SLAM) systems poses significant challenges due to dynamic scenes,\nlighting variations, and seasonal changes, requiring robust solutions. While\ntraditional SLAM methods struggle with adaptability, deep learning-based\napproaches and emerging neural radiance fields as well as Gaussian\nSplatting-based SLAM methods, offer promising alternatives. However, these\nmethods have primarily been evaluated in controlled indoor environments with\nstable conditions, leaving a gap in understanding their performance in\nunstructured and variable outdoor settings. This study addresses this gap by\nevaluating these methods in natural outdoor environments, focusing on camera\ntracking accuracy, robustness to environmental factors, and computational\nefficiency, highlighting distinct trade-offs. Extensive evaluations demonstrate\nthat neural SLAM methods achieve superior robustness, particularly under\nchallenging conditions such as low light, but at a high computational cost. At\nthe same time, traditional methods perform the best across seasons but are\nhighly sensitive to variations in lighting conditions. The code of the\nbenchmark is publicly available at\nhttps://github.com/iis-esslingen/nerf-3dgs-benchmark.\n","authors":["Fabian Schmidt","Markus Enzweiler","Abhinav Valada"],"pdf_url":"https://arxiv.org/pdf/2412.03263v1.pdf","comment":"5 pages, 2 figures, 4 tables"},{"id":"http://arxiv.org/abs/2403.14695v2","updated":"2024-12-04T11:58:41Z","published":"2024-03-15T15:05:59Z","title":"Chain-structured neural architecture search for financial time series\n forecasting","summary":" Neural architecture search (NAS) emerged as a way to automatically optimize\nneural networks for a specific task and dataset. Despite an abundance of\nresearch on NAS for images and natural language applications, similar studies\nfor time series data are lacking. Among NAS search spaces, chain-structured are\nthe simplest and most applicable to small datasets like time series. We compare\nthree popular NAS strategies on chain-structured search spaces: Bayesian\noptimization (specifically Tree-structured Parzen Estimator), the hyperband\nmethod, and reinforcement learning in the context of financial time series\nforecasting. These strategies were employed to optimize simple well-understood\nneural architectures like the MLP, 1D CNN, and RNN, with more complex temporal\nfusion transformers (TFT) and their own optimizers included for comparison. We\nfind Bayesian optimization and the hyperband method performing best among the\nstrategies, and RNN and 1D CNN best among the architectures, but all methods\nwere very close to each other with a high variance due to the difficulty of\nworking with financial datasets. We discuss our approach to overcome the\nvariance and provide implementation recommendations for future users and\nresearchers.\n","authors":["Denis Levchenko","Efstratios Rappos","Shabnam Ataee","Biagio Nigro","Stephan Robert-Nicoud"],"pdf_url":"https://arxiv.org/pdf/2403.14695v2.pdf","comment":"This is the accepted version of the paper published in International\n Journal of Data Science and Analytics"},{"id":"http://arxiv.org/abs/2412.03258v1","updated":"2024-12-04T11:57:36Z","published":"2024-12-04T11:57:36Z","title":"Learning on One Mode: Addressing Multi-Modality in Offline Reinforcement\n Learning","summary":" Offline reinforcement learning (RL) seeks to learn optimal policies from\nstatic datasets without interacting with the environment. A common challenge is\nhandling multi-modal action distributions, where multiple behaviours are\nrepresented in the data. Existing methods often assume unimodal behaviour\npolicies, leading to suboptimal performance when this assumption is violated.\nWe propose Weighted Imitation Learning on One Mode (LOM), a novel approach that\nfocuses on learning from a single, promising mode of the behaviour policy. By\nusing a Gaussian mixture model to identify modes and selecting the best mode\nbased on expected returns, LOM avoids the pitfalls of averaging over\nconflicting actions. Theoretically, we show that LOM improves performance while\nmaintaining simplicity in policy learning. Empirically, LOM outperforms\nexisting methods on standard D4RL benchmarks and demonstrates its effectiveness\nin complex, multi-modal scenarios.\n","authors":["Mianchu Wang","Yue Jin","Giovanni Montana"],"pdf_url":"https://arxiv.org/pdf/2412.03258v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.01322v2","updated":"2024-12-04T11:53:32Z","published":"2024-12-02T09:40:03Z","title":"Explainable fault and severity classification for rolling element\n bearings using Kolmogorov-Arnold networks","summary":" Rolling element bearings are critical components of rotating machinery, with\ntheir performance directly influencing the efficiency and reliability of\nindustrial systems. At the same time, bearing faults are a leading cause of\nmachinery failures, often resulting in costly downtime, reduced productivity,\nand, in extreme cases, catastrophic damage. This study presents a methodology\nthat utilizes Kolmogorov-Arnold Networks to address these challenges through\nautomatic feature selection, hyperparameter tuning and interpretable fault\nanalysis within a unified framework. By training shallow network architectures\nand minimizing the number of selected features, the framework produces\nlightweight models that deliver explainable results through feature attribution\nand symbolic representations of their activation functions. Validated on two\nwidely recognized datasets for bearing fault diagnosis, the framework achieved\nperfect F1-Scores for fault detection and high performance in fault and\nseverity classification tasks, including 100% F1-Scores in most cases. Notably,\nit demonstrated adaptability by handling diverse fault types, such as imbalance\nand misalignment, within the same dataset. The symbolic representations\nenhanced model interpretability, while feature attribution offered insights\ninto the optimal feature types or signals for each studied task. These results\nhighlight the framework's potential for practical applications, such as\nreal-time machinery monitoring, and for scientific research requiring efficient\nand explainable models.\n","authors":["Spyros Rigas","Michalis Papachristou","Ioannis Sotiropoulos","Georgios Alexandridis"],"pdf_url":"https://arxiv.org/pdf/2412.01322v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03252v1","updated":"2024-12-04T11:51:50Z","published":"2024-12-04T11:51:50Z","title":"Variable-Speed Teaching-Playback as Real-World Data Augmentation for\n Imitation Learning","summary":" Because imitation learning relies on human demonstrations in hard-to-simulate\nsettings, the inclusion of force control in this method has resulted in a\nshortage of training data, even with a simple change in speed. Although the\nfield of data augmentation has addressed the lack of data, conventional methods\nof data augmentation for robot manipulation are limited to simulation-based\nmethods or downsampling for position control. This paper proposes a novel\nmethod of data augmentation that is applicable to force control and preserves\nthe advantages of real-world datasets. We applied teaching-playback at variable\nspeeds as real-world data augmentation to increase both the quantity and\nquality of environmental reactions at variable speeds. An experiment was\nconducted on bilateral control-based imitation learning using a method of\nimitation learning equipped with position-force control. We evaluated the\neffect of real-world data augmentation on two tasks, pick-and-place and wiping,\nat variable speeds, each from two human demonstrations at fixed speed. The\nresults showed a maximum 55% increase in success rate from a simple change in\nspeed of real-world reactions and improved accuracy along the\nduration/frequency command by gathering environmental reactions at variable\nspeeds.\n","authors":["Nozomu Masuya","Hiroshi Sato","Koki Yamane","Takuya Kusume","Sho Sakaino","Toshiaki Tsuji"],"pdf_url":"https://arxiv.org/pdf/2412.03252v1.pdf","comment":"16 pages, 12 figures, 4 tables. This is a preprint of an article\n submitted for consideration in ADVANCED ROBOTICS, copyright Taylor & Francis\n and Robotics Society of Japan; ADVANCED ROBOTICS is available online at\n http://www.tandfonline.com/"},{"id":"http://arxiv.org/abs/2412.03238v1","updated":"2024-12-04T11:39:03Z","published":"2024-12-04T11:39:03Z","title":"Dynamic Consistent $k$-Center Clustering with Optimal Recourse","summary":" Given points from an arbitrary metric space and a sequence of point updates\nsent by an adversary, what is the minimum recourse per update (i.e., the\nminimum number of changes needed to the set of centers after an update), in\norder to maintain a constant-factor approximation to a $k$-clustering problem?\nThis question has received attention in recent years under the name consistent\nclustering.\n Previous works by Lattanzi and Vassilvitskii [ICLM '17] and Fichtenberger,\nLattanzi, Norouzi-Fard, and Svensson [SODA '21] studied $k$-clustering\nobjectives, including the $k$-center and the $k$-median objectives, under only\npoint insertions. In this paper we study the $k$-center objective in the fully\ndynamic setting, where the update is either a point insertion or a point\ndeletion. Before our work, {\\L}\\k{a}cki, Haeupler, Grunau, Rozho\\v{n}, and\nJayaram [SODA '24] gave a deterministic fully dynamic constant-factor\napproximation algorithm for the $k$-center objective with worst-case recourse\nof $2$ per update.\n In this work, we prove that the $k$-center clustering problem admits optimal\nrecourse bounds by developing a deterministic fully dynamic constant-factor\napproximation algorithm with worst-case recourse of $1$ per update. Moreover\nour algorithm performs simple choices based on light data structures, and thus\nis arguably more direct and faster than the previous one which uses a\nsophisticated combinatorial structure. Additionally, we develop a new\ndeterministic decremental algorithm and a new deterministic incremental\nalgorithm, both of which maintain a $6$-approximate $k$-center solution with\nworst-case recourse of $1$ per update. Our incremental algorithm improves over\nthe $8$-approximation algorithm by Charikar, Chekuri, Feder, and Motwani [STOC\n'97]. Finally, we remark that since all three of our algorithms are\ndeterministic, they work against an adaptive adversary.\n","authors":["Sebastian Forster","Antonis Skarlatos"],"pdf_url":"https://arxiv.org/pdf/2412.03238v1.pdf","comment":"In Proceedings SODA 2025"},{"id":"http://arxiv.org/abs/2412.03224v1","updated":"2024-12-04T11:21:30Z","published":"2024-12-04T11:21:30Z","title":"Channel Reflection: Knowledge-Driven Data Augmentation for EEG-Based\n Brain-Computer Interfaces","summary":" A brain-computer interface (BCI) enables direct communication between the\nhuman brain and external devices. Electroencephalography (EEG) based BCIs are\ncurrently the most popular for able-bodied users. To increase\nuser-friendliness, usually a small amount of user-specific EEG data are used\nfor calibration, which may not be enough to develop a pure data-driven decoding\nmodel. To cope with this typical calibration data shortage challenge in\nEEG-based BCIs, this paper proposes a parameter-free channel reflection (CR)\ndata augmentation approach that incorporates prior knowledge on the channel\ndistributions of different BCI paradigms in data augmentation. Experiments on\neight public EEG datasets across four different BCI paradigms (motor imagery,\nsteady-state visual evoked potential, P300, and seizure classifications) using\ndifferent decoding algorithms demonstrated that: 1) CR is effective, i.e., it\ncan noticeably improve the classification accuracy; 2) CR is robust, i.e., it\nconsistently outperforms existing data augmentation approaches in the\nliterature; and, 3) CR is flexible, i.e., it can be combined with other data\naugmentation approaches to further increase the performance. We suggest that\ndata augmentation approaches like CR should be an essential step in EEG-based\nBCIs. Our code is available online.\n","authors":["Ziwei Wang","Siyang Li","Jingwei Luo","Jiajing Liu","Dongrui Wu"],"pdf_url":"https://arxiv.org/pdf/2412.03224v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03220v1","updated":"2024-12-04T11:14:06Z","published":"2024-12-04T11:14:06Z","title":"Survey of different Large Language Model Architectures: Trends,\n Benchmarks, and Challenges","summary":" Large Language Models (LLMs) represent a class of deep learning models adept\nat understanding natural language and generating coherent responses to various\nprompts or queries. These models far exceed the complexity of conventional\nneural networks, often encompassing dozens of neural network layers and\ncontaining billions to trillions of parameters. They are typically trained on\nvast datasets, utilizing architectures based on transformer blocks. Present-day\nLLMs are multi-functional, capable of performing a range of tasks from text\ngeneration and language translation to question answering, as well as code\ngeneration and analysis. An advanced subset of these models, known as\nMultimodal Large Language Models (MLLMs), extends LLM capabilities to process\nand interpret multiple data modalities, including images, audio, and video.\nThis enhancement empowers MLLMs with capabilities like video editing, image\ncomprehension, and captioning for visual content. This survey provides a\ncomprehensive overview of the recent advancements in LLMs. We begin by tracing\nthe evolution of LLMs and subsequently delve into the advent and nuances of\nMLLMs. We analyze emerging state-of-the-art MLLMs, exploring their technical\nfeatures, strengths, and limitations. Additionally, we present a comparative\nanalysis of these models and discuss their challenges, potential limitations,\nand prospects for future development.\n","authors":["Minghao Shao","Abdul Basit","Ramesh Karri","Muhammad Shafique"],"pdf_url":"https://arxiv.org/pdf/2412.03220v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2407.05650v2","updated":"2024-12-04T11:12:42Z","published":"2024-07-08T06:22:10Z","title":"The Cooperative Network Architecture: Learning Structured Networks as\n Representation of Sensory Patterns","summary":" Nets, cooperative networks of neurons, have been proposed as format for the\nrepresentation of sensory signals, as physical implementation of the Gestalt\nphenomenon and as solution to the neural binding problem, while the direct\ninteraction between nets by structure-sensitive matching has been proposed as\nbasis for object-global operations such as object detection. The nets are\nflexibly composed of overlapping net fragments, which are learned from\nstatistical regularities of sensory input. We here present the cooperative\nnetwork architecture (CNA), a concrete model that learns such net structure to\nrepresent input patterns and deals robustly with noise, deformation, and\nout-of-distribution data, thus laying the groundwork for a novel neural\narchitecture.\n","authors":["Pascal J. Sager","Jan M. Deriu","Benjamin F. Grewe","Thilo Stadelmann","Christoph von der Malsburg"],"pdf_url":"https://arxiv.org/pdf/2407.05650v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03215v1","updated":"2024-12-04T11:08:32Z","published":"2024-12-04T11:08:32Z","title":"Beyond [cls]: Exploring the true potential of Masked Image Modeling\n representations","summary":" Masked Image Modeling (MIM) has emerged as a popular method for\nSelf-Supervised Learning (SSL) of visual representations. However, for\nhigh-level perception tasks, MIM-pretrained models offer lower out-of-the-box\nrepresentation quality than the Joint-Embedding Architectures (JEA) - another\nprominent SSL paradigm. To understand this performance gap, we analyze the\ninformation flow in Vision Transformers (ViT) learned by both approaches. We\nreveal that whereas JEAs construct their representation on a selected set of\nrelevant image fragments, MIM models aggregate nearly whole image content.\nMoreover, we demonstrate that MIM-trained ViTs retain valuable information\nwithin their patch tokens, which is not effectively captured by the global\n[cls] token representations. Therefore, selective aggregation of relevant patch\ntokens, without any fine-tuning, results in consistently higher-quality of MIM\nrepresentations. To our knowledge, we are the first to highlight the lack of\neffective representation aggregation as an emergent issue of MIM and propose\ndirections to address it, contributing to future advances in Self-Supervised\nLearning.\n","authors":["Marcin Przewięźlikowski","Randall Balestriero","Wojciech Jasiński","Marek Śmieja","Bartosz Zieliński"],"pdf_url":"https://arxiv.org/pdf/2412.03215v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03214v1","updated":"2024-12-04T11:05:01Z","published":"2024-12-04T11:05:01Z","title":"Continual Low-Rank Scaled Dot-product Attention","summary":" Transformers are widely used for their ability to capture data relations in\nsequence processing, with great success for a wide range of static tasks.\nHowever, the computational and memory footprint of their main component, i.e.,\nthe Scaled Dot-product Attention, is commonly overlooked. This makes their\nadoption in applications involving stream data processing with constraints in\nresponse latency, computational and memory resources infeasible. Some works\nhave proposed methods to lower the computational cost of transformers, i.e.\nlow-rank approximations, sparsity in attention, and efficient formulations for\nContinual Inference. In this paper, we introduce a new formulation of the\nScaled Dot-product Attention based on the Nystr\\\"om approximation that is\nsuitable for Continual Inference. In experiments on Online Audio Classification\nand Online Action Detection tasks, the proposed Continual Scaled Dot-product\nAttention can lower the number of operations by up to three orders of magnitude\ncompared to the original Transformers while retaining the predictive\nperformance of competing models.\n","authors":["Ginés Carreto Picón","Illia Oleksiienko","Lukas Hedegaard","Arian Bakhtiarnia","Alexandros Iosifidis"],"pdf_url":"https://arxiv.org/pdf/2412.03214v1.pdf","comment":"11 pages, 7 figures"},{"id":"http://arxiv.org/abs/2412.03213v1","updated":"2024-12-04T10:58:27Z","published":"2024-12-04T10:58:27Z","title":"ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable\n Compression","summary":" Large Language Models (LLMs) have been widely deployed in a variety of\napplications, and the context length is rapidly increasing to handle tasks such\nas long-document QA and complex logical reasoning. However, long context poses\nsignificant challenges for inference efficiency, including high memory costs of\nkey-value (KV) cache and increased latency due to extensive memory accesses.\nRecent works have proposed compressing KV cache to approximate computation, but\nthese methods either evict tokens permanently, never recalling them for later\ninference, or recall previous tokens at the granularity of pages divided by\ntextual positions. Both approaches degrade the model accuracy and output\nquality. To achieve efficient and accurate recallable KV cache compression, we\nintroduce ClusterKV, which recalls tokens at the granularity of semantic\nclusters. We design and implement efficient algorithms and systems for\nclustering, selection, indexing and caching. Experiment results show that\nClusterKV attains negligible accuracy loss across various tasks with 32k\ncontext lengths, using only a 1k to 2k KV cache budget, and achieves up to a\n2$\\times$ speedup in latency and a 2.5$\\times$ improvement in decoding\nthroughput. Compared to SoTA recallable KV compression methods, ClusterKV\ndemonstrates higher model accuracy and output quality, while maintaining or\nexceeding inference efficiency.\n","authors":["Guangda Liu","Chengwei Li","Jieru Zhao","Chenqi Zhang","Minyi Guo"],"pdf_url":"https://arxiv.org/pdf/2412.03213v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03212v1","updated":"2024-12-04T10:57:55Z","published":"2024-12-04T10:57:55Z","title":"Semi-Supervised Transfer Boosting (SS-TrBoosting)","summary":" Semi-supervised domain adaptation (SSDA) aims at training a high-performance\nmodel for a target domain using few labeled target data, many unlabeled target\ndata, and plenty of auxiliary data from a source domain. Previous works in SSDA\nmainly focused on learning transferable representations across domains.\nHowever, it is difficult to find a feature space where the source and target\ndomains share the same conditional probability distribution. Additionally,\nthere is no flexible and effective strategy extending existing unsupervised\ndomain adaptation (UDA) approaches to SSDA settings. In order to solve the\nabove two challenges, we propose a novel fine-tuning framework, semi-supervised\ntransfer boosting (SS-TrBoosting). Given a well-trained deep learning-based UDA\nor SSDA model, we use it as the initial model, generate additional base\nlearners by boosting, and then use all of them as an ensemble. More\nspecifically, half of the base learners are generated by supervised domain\nadaptation, and half by semi-supervised learning. Furthermore, for more\nefficient data transmission and better data privacy protection, we propose a\nsource data generation approach to extend SS-TrBoosting to semi-supervised\nsource-free domain adaptation (SS-SFDA). Extensive experiments showed that\nSS-TrBoosting can be applied to a variety of existing UDA, SSDA and SFDA\napproaches to further improve their performance.\n","authors":["Lingfei Deng","Changming Zhao","Zhenbang Du","Kun Xia","Dongrui Wu"],"pdf_url":"https://arxiv.org/pdf/2412.03212v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.03098v2","updated":"2024-12-04T10:52:25Z","published":"2024-11-05T13:44:25Z","title":"Local Lesion Generation is Effective for Capsule Endoscopy Image Data\n Augmentation in a Limited Data Setting","summary":" Limited medical imaging datasets challenge deep learning models by increasing\nrisks of overfitting and reduced generalization, particularly in Generative\nAdversarial Networks (GANs), where discriminators may overfit, leading to\ntraining divergence. This constraint also impairs classification models trained\non small datasets. Generative Data Augmentation (GDA) addresses this by\nexpanding training datasets with synthetic data, although it requires training\na generative model. We propose and evaluate two local lesion generation\napproaches to address the challenge of augmenting small medical image datasets.\nThe first approach employs the Poisson Image Editing algorithm, a classical\nimage processing technique, to create realistic image composites that\noutperform current state-of-the-art methods. The second approach introduces a\nnovel generative method, leveraging a fine-tuned Image Inpainting GAN to\nsynthesize realistic lesions within specified regions of real training images.\nA comprehensive comparison of the two proposed methods demonstrates that\neffective local lesion generation in a data-constrained setting allows for\nreaching new state-of-the-art results in capsule endoscopy lesion\nclassification. Combination of our techniques achieves a macro F1-score of\n33.07%, surpassing the previous best result by 7.84 percentage points (p.p.) on\nthe highly imbalanced Kvasir Capsule Dataset, a benchmark for capsule\nendoscopy. To the best of our knowledge, this work is the first to apply a\nfine-tuned Image Inpainting GAN for GDA in medical imaging, demonstrating that\nan image-conditional GAN can be adapted effectively to limited datasets to\ngenerate high-quality examples, facilitating effective data augmentation.\nAdditionally, we show that combining this GAN-based approach with classical\nimage processing techniques further improves the results.\n","authors":["Adrian B. Chłopowiec","Adam R. Chłopowiec","Krzysztof Galus","Wojciech Cebula","Martin Tabakov"],"pdf_url":"https://arxiv.org/pdf/2411.03098v2.pdf","comment":"54 pages, 35 figures"},{"id":"http://arxiv.org/abs/2410.16926v2","updated":"2024-12-04T10:52:04Z","published":"2024-10-22T11:57:32Z","title":"Pyramid Vector Quantization for LLMs","summary":" Recent works on compression of large language models (LLM) using quantization\nconsidered reparameterizing the architecture such that weights are distributed\non the sphere. This demonstratively improves the ability to quantize by\nincreasing the mathematical notion of coherence, resulting in fewer weight\noutliers without affecting the network output. In this work, we aim to further\nexploit this spherical geometry of the weights when performing quantization by\nconsidering Pyramid Vector Quantization (PVQ) for large language models.\nArranging points evenly on the sphere is notoriously difficult, especially in\nhigh dimensions, and in case approximate solutions exists, representing points\nexplicitly in a codebook is typically not feasible due to its additional memory\ncost. Instead, PVQ uses a fixed integer lattice on the sphere by projecting\npoints onto the 1-sphere, which allows for efficient encoding and decoding\nwithout requiring an explicit codebook in memory. To obtain a practical\nalgorithm, we propose to combine PVQ with scale quantization for which we\nderive theoretically optimal quantizations, under empirically verified\nassumptions. Further, we extend pyramid vector quantization to use Hessian\ninformation to minimize quantization error under expected feature activations,\ninstead of only relying on weight magnitudes. Experimentally, we achieves\nstate-of-the-art quantization performance with pareto-optimal trade-off between\nperformance and bits per weight and bits per activation, compared to compared\nmethods. On weight-only, we find that we can quantize a Llama-3 70B model to\n3.25 bits per weight and retain 98\\% accuracy on downstream tasks.\n","authors":["Tycho F. A. van der Ouderaa","Maximilian L. Croci","Agrin Hilmkil","James Hensman"],"pdf_url":"https://arxiv.org/pdf/2410.16926v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.00850v2","updated":"2024-12-04T10:45:41Z","published":"2024-10-30T11:16:04Z","title":"GWQ: Gradient-Aware Weight Quantization for Large Language Models","summary":" Large language models (LLMs) show impressive performance in solving complex\nlanguage tasks. However, its large number of parameters present significant\nchallenges for the deployment and application of the model on edge devices.\nCompressing large language models to low bits can enable them to run on\nresource-constrained devices, often leading to performance degradation. To\naddress this problem, we propose gradient-aware weight quantization (GWQ), the\nfirst quantization approach for low-bit weight quantization that leverages\ngradients to localize outliers, requiring only a minimal amount of calibration\ndata for outlier detection. GWQ retains the weights corresponding to the top 1%\noutliers preferentially at FP16 precision, while the remaining non-outlier\nweights are stored in a low-bit format. GWQ found experimentally that utilizing\nthe sensitive weights in the gradient localization model is more scientific\ncompared to utilizing the sensitive weights in the Hessian matrix localization\nmodel. Compared to current quantization methods, GWQ can be applied to multiple\nlanguage models and achieves lower PPL on the WikiText2 and C4 dataset. In the\nzero-shot task, GWQ quantized models have higher accuracy compared to other\nquantization methods. GWQ is also suitable for multimodal model quantization,\nand the quantized Qwen-VL family model is more accurate than other methods.\nZero-shot target detection task dataset RefCOCO outperforms the current\nstat-of-the-arts method SPQR. GWQ achieves 1.2 times inference speedup in\ncomparison to the original model, and effectively reduces the inference memory.\n","authors":["Yihua Shao","Siyu Liang","Zijian Ling","Minxi Yan","Haiyang Liu","Siyu Chen","Ziyang Yan","Chenyu Zhang","Haotong Qin","Michele Magno","Yang Yang","Zhen Lei","Yan Wang","Jingcai Guo","Ling Shao","Hao Tang"],"pdf_url":"https://arxiv.org/pdf/2411.00850v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.06209v3","updated":"2024-12-04T10:33:18Z","published":"2024-04-09T10:58:21Z","title":"Elephants Never Forget: Memorization and Learning of Tabular Data in\n Large Language Models","summary":" While many have shown how Large Language Models (LLMs) can be applied to a\ndiverse set of tasks, the critical issues of data contamination and\nmemorization are often glossed over. In this work, we address this concern for\ntabular data. Specifically, we introduce a variety of different techniques to\nassess whether a language model has seen a tabular dataset during training.\nThis investigation reveals that LLMs have memorized many popular tabular\ndatasets verbatim. We then compare the few-shot learning performance of LLMs on\ndatasets that were seen during training to the performance on datasets released\nafter training. We find that LLMs perform better on datasets seen during\ntraining, indicating that memorization leads to overfitting. At the same time,\nLLMs show non-trivial performance on novel datasets and are surprisingly robust\nto data transformations. We then investigate the in-context statistical\nlearning abilities of LLMs. While LLMs are significantly better than random at\nsolving statistical classification problems, the sample efficiency of few-shot\nlearning lags behind traditional statistical learning algorithms, especially as\nthe dimension of the problem increases. This suggests that much of the observed\nfew-shot performance on novel real-world datasets is due to the LLM's world\nknowledge. Overall, our results highlight the importance of testing whether an\nLLM has seen an evaluation dataset during pre-training. We release the\nhttps://github.com/interpretml/LLM-Tabular-Memorization-Checker Python package\nto test LLMs for memorization of tabular datasets.\n","authors":["Sebastian Bordt","Harsha Nori","Vanessa Rodrigues","Besmira Nushi","Rich Caruana"],"pdf_url":"https://arxiv.org/pdf/2404.06209v3.pdf","comment":"COLM camera ready, fix typo"},{"id":"http://arxiv.org/abs/2412.03190v1","updated":"2024-12-04T10:22:34Z","published":"2024-12-04T10:22:34Z","title":"Node Classification With Integrated Reject Option","summary":" One of the key tasks in graph learning is node classification. While Graph\nneural networks have been used for various applications, their adaptivity to\nreject option setting is not previously explored. In this paper, we propose\nNCwR, a novel approach to node classification in Graph Neural Networks (GNNs)\nwith an integrated reject option, which allows the model to abstain from making\npredictions when uncertainty is high. We propose both cost-based and\ncoverage-based methods for classification with abstention in node\nclassification setting using GNNs. We perform experiments using our method on\nthree standard citation network datasets Cora, Citeseer and Pubmed and compare\nwith relevant baselines. We also model the Legal judgment prediction problem on\nILDC dataset as a node classification problem where nodes represent legal cases\nand edges represent citations. We further interpret the model by analyzing the\ncases that the model abstains from predicting by visualizing which part of the\ninput features influenced this decision.\n","authors":["Uday Bhaskar","Jayadratha Gayen","Charu Sharma","Naresh Manwani"],"pdf_url":"https://arxiv.org/pdf/2412.03190v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03188v1","updated":"2024-12-04T10:20:21Z","published":"2024-12-04T10:20:21Z","title":"Semi-decentralized Training of Spatio-Temporal Graph Neural Networks for\n Traffic Prediction","summary":" In smart mobility, large networks of geographically distributed sensors\nproduce vast amounts of high-frequency spatio-temporal data that must be\nprocessed in real time to avoid major disruptions. Traditional centralized\napproaches are increasingly unsuitable to this task, as they struggle to scale\nwith expanding sensor networks, and reliability issues in central components\ncan easily affect the whole deployment. To address these challenges, we explore\nand adapt semi-decentralized training techniques for Spatio-Temporal Graph\nNeural Networks (ST-GNNs) in smart mobility domain. We implement a simulation\nframework where sensors are grouped by proximity into multiple cloudlets, each\nhandling a subgraph of the traffic graph, fetching node features from other\ncloudlets to train its own local ST-GNN model, and exchanging model updates\nwith other cloudlets to ensure consistency, enhancing scalability and removing\nreliance on a centralized aggregator. We perform extensive comparative\nevaluation of four different ST-GNN training setups -- centralized, traditional\nFL, server-free FL, and Gossip Learning -- on large-scale traffic datasets, the\nMETR-LA and PeMS-BAY datasets, for short-, mid-, and long-term vehicle speed\npredictions. Experimental results show that semi-decentralized setups are\ncomparable to centralized approaches in performance metrics, while offering\nadvantages in terms of scalability and fault tolerance. In addition, we\nhighlight often overlooked issues in existing literature for distributed\nST-GNNs, such as the variation in model performance across different\ngeographical areas due to region-specific traffic patterns, and the significant\ncommunication overhead and computational costs that arise from the large\nreceptive field of GNNs, leading to substantial data transfers and increased\ncomputation of partial embeddings.\n","authors":["Ivan Kralj","Lodovico Giaretta","Gordan Ježić","Ivana Podnar Žarko","Šarūnas Girdzijauskas"],"pdf_url":"https://arxiv.org/pdf/2412.03188v1.pdf","comment":"8 pages, 4 figures, 3 tables, conference"},{"id":"http://arxiv.org/abs/2401.10962v2","updated":"2024-12-04T10:09:46Z","published":"2024-01-19T11:45:31Z","title":"One Step Learning, One Step Review","summary":" Visual fine-tuning has garnered significant attention with the rise of\npre-trained vision models. The current prevailing method, full fine-tuning,\nsuffers from the issue of knowledge forgetting as it focuses solely on fitting\nthe downstream training set. In this paper, we propose a novel weight\nrollback-based fine-tuning method called OLOR (One step Learning, One step\nReview). OLOR combines fine-tuning with optimizers, incorporating a weight\nrollback term into the weight update term at each step. This ensures\nconsistency in the weight range of upstream and downstream models, effectively\nmitigating knowledge forgetting and enhancing fine-tuning performance. In\naddition, a layer-wise penalty is presented to employ penalty decay and the\ndiversified decay rate to adjust the weight rollback levels of layers for\nadapting varying downstream tasks. Through extensive experiments on various\ntasks such as image classification, object detection, semantic segmentation,\nand instance segmentation, we demonstrate the general applicability and\nstate-of-the-art performance of our proposed OLOR. Code is available at\nhttps://github.com/rainbow-xiao/OLOR-AAAI-2024.\n","authors":["Xiaolong Huang","Qiankun Li","Xueran Li","Xuesong Gao"],"pdf_url":"https://arxiv.org/pdf/2401.10962v2.pdf","comment":"Published at the 38th AAAI Conference on Artificial Intelligence\n (AAAI 2024)"},{"id":"http://arxiv.org/abs/2310.01225v5","updated":"2024-12-04T10:04:02Z","published":"2023-10-02T14:12:53Z","title":"A path-norm toolkit for modern networks: consequences, promises and\n challenges","summary":" This work introduces the first toolkit around path-norms that fully\nencompasses general DAG ReLU networks with biases, skip connections and any\noperation based on the extraction of order statistics: max pooling, GroupSort\netc. This toolkit notably allows us to establish generalization bounds for\nmodern neural networks that are not only the most widely applicable path-norm\nbased ones, but also recover or beat the sharpest known bounds of this type.\nThese extended path-norms further enjoy the usual benefits of path-norms: ease\nof computation, invariance under the symmetries of the network, and improved\nsharpness on layered fully-connected networks compared to the product of\noperator norms, another complexity measure most commonly used.\n The versatility of the toolkit and its ease of implementation allow us to\nchallenge the concrete promises of path-norm-based generalization bounds, by\nnumerically evaluating the sharpest known bounds for ResNets on ImageNet.\n","authors":["Antoine Gonon","Nicolas Brisebarre","Elisa Riccietti","Rémi Gribonval"],"pdf_url":"https://arxiv.org/pdf/2310.01225v5.pdf","comment":"Erratum: in the published version there was a typo in the definition\n of the activation matrix in Definition A.3. This is fixed with this new\n version"},{"id":"http://arxiv.org/abs/2412.03178v1","updated":"2024-12-04T10:03:52Z","published":"2024-12-04T10:03:52Z","title":"Towards Understanding and Quantifying Uncertainty for Text-to-Image\n Generation","summary":" Uncertainty quantification in text-to-image (T2I) generative models is\ncrucial for understanding model behavior and improving output reliability. In\nthis paper, we are the first to quantify and evaluate the uncertainty of T2I\nmodels with respect to the prompt. Alongside adapting existing approaches\ndesigned to measure uncertainty in the image space, we also introduce\nPrompt-based UNCertainty Estimation for T2I models (PUNC), a novel method\nleveraging Large Vision-Language Models (LVLMs) to better address uncertainties\narising from the semantics of the prompt and generated images. PUNC utilizes a\nLVLM to caption a generated image, and then compares the caption with the\noriginal prompt in the more semantically meaningful text space. PUNC also\nenables the disentanglement of both aleatoric and epistemic uncertainties via\nprecision and recall, which image-space approaches are unable to do. Extensive\nexperiments demonstrate that PUNC outperforms state-of-the-art uncertainty\nestimation techniques across various settings. Uncertainty quantification in\ntext-to-image generation models can be used on various applications including\nbias detection, copyright protection, and OOD detection. We also introduce a\ncomprehensive dataset of text prompts and generation pairs to foster further\nresearch in uncertainty quantification for generative models. Our findings\nillustrate that PUNC not only achieves competitive performance but also enables\nnovel applications in evaluating and improving the trustworthiness of\ntext-to-image models.\n","authors":["Gianni Franchi","Dat Nguyen Trong","Nacim Belkhir","Guoxuan Xia","Andrea Pilzer"],"pdf_url":"https://arxiv.org/pdf/2412.03178v1.pdf","comment":"28 pages and 22 figures"},{"id":"http://arxiv.org/abs/2407.15017v4","updated":"2024-12-04T09:54:59Z","published":"2024-07-22T06:15:59Z","title":"Knowledge Mechanisms in Large Language Models: A Survey and Perspective","summary":" Understanding knowledge mechanisms in Large Language Models (LLMs) is crucial\nfor advancing towards trustworthy AGI. This paper reviews knowledge mechanism\nanalysis from a novel taxonomy including knowledge utilization and evolution.\nKnowledge utilization delves into the mechanism of memorization, comprehension\nand application, and creation. Knowledge evolution focuses on the dynamic\nprogression of knowledge within individual and group LLMs. Moreover, we discuss\nwhat knowledge LLMs have learned, the reasons for the fragility of parametric\nknowledge, and the potential dark knowledge (hypothesis) that will be\nchallenging to address. We hope this work can help understand knowledge in LLMs\nand provide insights for future research.\n","authors":["Mengru Wang","Yunzhi Yao","Ziwen Xu","Shuofei Qiao","Shumin Deng","Peng Wang","Xiang Chen","Jia-Chen Gu","Yong Jiang","Pengjun Xie","Fei Huang","Huajun Chen","Ningyu Zhang"],"pdf_url":"https://arxiv.org/pdf/2407.15017v4.pdf","comment":"EMNLP 2024 Findings; 39 pages (v4)"},{"id":"http://arxiv.org/abs/2207.09959v4","updated":"2024-12-04T09:45:26Z","published":"2022-07-20T15:09:16Z","title":"Exploration of Parameter Spaces Assisted by Machine Learning","summary":" We demonstrate two sampling procedures assisted by machine learning models\nvia regression and classification. The main objective is the use of a neural\nnetwork to suggest points likely inside regions of interest, reducing the\nnumber of evaluations of time consuming calculations. We compare results from\nthis approach with results from other sampling methods, namely Markov chain\nMonte Carlo and MultiNest, obtaining results that range from comparably similar\nto arguably better. In particular, we augment our classifier method with a\nboosting technique that rapidly increases the efficiency within a few\niterations. We show results from our methods applied to a toy model and the\ntype II 2HDM, using 3 and 7 free parameters, respectively. The code used for\nthis paper and instructions are publicly available on the web.\n","authors":["A. Hammad","Myeonghun Park","Raymundo Ramos","Pankaj Saha"],"pdf_url":"https://arxiv.org/pdf/2207.09959v4.pdf","comment":"30 pages, 9 figures. Matches published version. Code and instructions\n are available on https://github.com/AHamamd150/MLscanner"},{"id":"http://arxiv.org/abs/2402.14400v3","updated":"2024-12-04T09:44:26Z","published":"2024-02-22T09:34:48Z","title":"Learning Developmental Age from 3D Infant Kinetics Using Adaptive Graph\n Neural Networks","summary":" Reliable methods for the neurodevelopmental assessment of infants are\nessential for early detection of problems that may need prompt interventions.\nSpontaneous motor activity, or 'kinetics', is shown to provide a powerful\nsurrogate measure of upcoming neurodevelopment. However, its assessment is by\nand large qualitative and subjective, focusing on visually identified,\nage-specific gestures. In this work, we introduce Kinetic Age (KA), a novel\ndata-driven metric that quantifies neurodevelopmental maturity by predicting an\ninfant's age based on their movement patterns. KA offers an interpretable and\ngeneralizable proxy for motor development. Our method leverages 3D video\nrecordings of infants, processed with pose estimation to extract\nspatio-temporal series of anatomical landmarks, which are released as a new\nopenly available dataset. These data are modeled using adaptive graph\nconvolutional networks, able to capture the spatio-temporal dependencies in\ninfant movements. We also show that our data-driven approach achieves\nimprovement over traditional machine learning baselines based on manually\nengineered features.\n","authors":["Daniel Holmberg","Manu Airaksinen","Viviana Marchi","Andrea Guzzetta","Anna Kivi","Leena Haataja","Sampsa Vanhatalo","Teemu Roos"],"pdf_url":"https://arxiv.org/pdf/2402.14400v3.pdf","comment":"15 pages, 9 figures. Code repository available via\n https://github.com/deinal/infant-aagcn"},{"id":"http://arxiv.org/abs/2412.01064v2","updated":"2024-12-04T09:43:18Z","published":"2024-12-02T02:50:07Z","title":"FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking\n Portrait","summary":" With the rapid advancement of diffusion-based generative models, portrait\nimage animation has achieved remarkable results. However, it still faces\nchallenges in temporally consistent video generation and fast sampling due to\nits iterative sampling nature. This paper presents FLOAT, an audio-driven\ntalking portrait video generation method based on flow matching generative\nmodel. We shift the generative modeling from the pixel-based latent space to a\nlearned motion latent space, enabling efficient design of temporally consistent\nmotion. To achieve this, we introduce a transformer-based vector field\npredictor with a simple yet effective frame-wise conditioning mechanism.\nAdditionally, our method supports speech-driven emotion enhancement, enabling a\nnatural incorporation of expressive motions. Extensive experiments demonstrate\nthat our method outperforms state-of-the-art audio-driven talking portrait\nmethods in terms of visual quality, motion fidelity, and efficiency.\n","authors":["Taekyung Ki","Dongchan Min","Gyeongsu Chae"],"pdf_url":"https://arxiv.org/pdf/2412.01064v2.pdf","comment":"Project page: https://deepbrainai-research.github.io/float/"},{"id":"http://arxiv.org/abs/2412.03158v1","updated":"2024-12-04T09:35:03Z","published":"2024-12-04T09:35:03Z","title":"LEP-QNN: Loan Eligibility Prediction Using Quantum Neural Networks","summary":" Predicting loan eligibility with high accuracy remains a significant\nchallenge in the finance sector. Accurate predictions enable financial\ninstitutions to make informed decisions, mitigate risks, and effectively adapt\nservices to meet customer needs. However, the complexity and the\nhigh-dimensional nature of financial data have always posed significant\nchallenges to achieving this level of precision. To overcome these issues, we\npropose a novel approach that employs Quantum Machine Learning (QML) for Loan\nEligibility Prediction using Quantum Neural Networks (LEP-QNN).Our innovative\napproach achieves an accuracy of 98% in predicting loan eligibility from a\nsingle, comprehensive dataset. This performance boost is attributed to the\nstrategic implementation of a dropout mechanism within the quantum circuit,\naimed at minimizing overfitting and thereby improving the model's predictive\nreliability. In addition, our exploration of various optimizers leads to\nidentifying the most efficient setup for our LEP-QNN framework, optimizing its\nperformance. We also rigorously evaluate the resilience of LEP-QNN under\ndifferent quantum noise scenarios, ensuring its robustness and dependability\nfor quantum computing environments. This research showcases the potential of\nQML in financial predictions and establishes a foundational guide for advancing\nQML technologies, marking a step towards developing advanced, quantum-driven\nfinancial decision-making tools.\n","authors":["Nouhaila Innan","Alberto Marchisio","Mohamed Bennai","Muhammad Shafique"],"pdf_url":"https://arxiv.org/pdf/2412.03158v1.pdf","comment":"8 pages. 6 figures, 3 tables"},{"id":"http://arxiv.org/abs/2411.00809v2","updated":"2024-12-04T09:26:47Z","published":"2024-10-23T16:16:15Z","title":"Adaptive Dense Reward: Understanding the Gap Between Action and Reward\n Space in Alignment","summary":" Reinforcement Learning from Human Feedback (RLHF) has proven highly effective\nin aligning Large Language Models (LLMs) with human preferences. However, the\noriginal RLHF typically optimizes under an overall reward, which can lead to a\nsuboptimal learning process. This limitation stems from RLHF's lack of\nawareness regarding which specific tokens should be reinforced or suppressed.\nMoreover, conflicts in supervision can arise, for instance, when a chosen\nresponse includes erroneous tokens, while a rejected response contains accurate\nelements. To rectify these shortcomings, increasing dense reward methods, such\nas step-wise and token-wise RLHF, have been proposed. However, these existing\nmethods are limited to specific tasks (like mathematics). In this paper, we\npropose the ``Adaptive Message-wise RLHF'' method, which robustly applies to\nvarious tasks. By defining pivot tokens as key indicators, our approach\nadaptively identifies essential information and converts sequence-level\nsupervision into fine-grained, subsequence-level supervision. This aligns the\ndensity of rewards and action spaces more closely with the information density\nof the input. Experiments demonstrate that our method can be integrated into\nvarious training methods, significantly mitigating hallucinations and\ncatastrophic forgetting problems, while outperforming other methods on multiple\nevaluation metrics. Our method improves the success rate on adversarial samples\nby 10\\% compared to the sample-wise approach, and achieves a 1.3\\% improvement\non evaluation benchmarks such as MMLU, GSM8K, HumanEval, etc.\n","authors":["Yanshi Li","Shaopan Xiong","Gengru Chen","Xiaoyang Li","Yijia Luo","Xingyao Zhang","Yanhui Huang","Xingyuan Bu","Yingshui Tan","Chun Yuan","Jiamang Wang","Wenbo Su","Bo Zheng"],"pdf_url":"https://arxiv.org/pdf/2411.00809v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03154v1","updated":"2024-12-04T09:24:33Z","published":"2024-12-04T09:24:33Z","title":"Testing Neural Network Verifiers: A Soundness Benchmark with Hidden\n Counterexamples","summary":" In recent years, many neural network (NN) verifiers have been developed to\nformally verify certain properties of neural networks such as robustness.\nAlthough many benchmarks have been constructed to evaluate the performance of\nNN verifiers, they typically lack a ground-truth for hard instances where no\ncurrent verifier can verify and no counterexample can be found, which makes it\ndifficult to check the soundness of a new verifier if it claims to verify hard\ninstances which no other verifier can do. We propose to develop a soundness\nbenchmark for NN verification. Our benchmark contains instances with\ndeliberately inserted counterexamples while we also try to hide the\ncounterexamples from regular adversarial attacks which can be used for finding\ncounterexamples. We design a training method to produce neural networks with\nsuch hidden counterexamples. Our benchmark aims to be used for testing the\nsoundness of NN verifiers and identifying falsely claimed verifiability when it\nis known that hidden counterexamples exist. We systematically construct our\nbenchmark and generate instances across diverse model architectures, activation\nfunctions, input sizes, and perturbation radii. We demonstrate that our\nbenchmark successfully identifies bugs in state-of-the-art NN verifiers, as\nwell as synthetic bugs, providing a crucial step toward enhancing the\nreliability of testing NN verifiers. Our code is available at\nhttps://github.com/MVP-Harry/SoundnessBench and our benchmark is available at\nhttps://huggingface.co/datasets/SoundnessBench/SoundnessBench.\n","authors":["Xingjian Zhou","Hongji Xu","Andy Xu","Zhouxing Shi","Cho-Jui Hsieh","Huan Zhang"],"pdf_url":"https://arxiv.org/pdf/2412.03154v1.pdf","comment":"Preprint"},{"id":"http://arxiv.org/abs/2412.03145v1","updated":"2024-12-04T09:11:33Z","published":"2024-12-04T09:11:33Z","title":"Topological Trajectory Classification and Landmark Inference on\n Simplicial Complexes","summary":" We consider the problem of classifying trajectories on a discrete or\ndiscretised 2-dimensional manifold modelled by a simplicial complex. Previous\nworks have proposed to project the trajectories into the harmonic eigenspace of\nthe Hodge Laplacian, and then cluster the resulting embeddings. However, if the\nconsidered space has vanishing homology (i.e., no \"holes\"), then the harmonic\nspace of the 1-Hodge Laplacian is trivial and thus the approach fails. Here we\npropose to view this issue akin to a sensor placement problem and present an\nalgorithm that aims to learn \"optimal holes\" to distinguish a set of given\ntrajectory classes. Specifically, given a set of labelled trajectories, which\nwe interpret as edge-flows on the underlying simplicial complex, we search for\n2-simplices whose deletion results in an optimal separation of the trajectory\nlabels according to the corresponding spectral embedding of the trajectories\ninto the harmonic space. Finally, we generalise this approach to the\nunsupervised setting.\n","authors":["Vincent P. Grande","Josef Hoppe","Florian Frantzen","Michael T. Schaub"],"pdf_url":"https://arxiv.org/pdf/2412.03145v1.pdf","comment":"5 pages, 4 figures, Accepted at the 58th Annual Asilomar Conference\n on Signals, Systems, and Computers 2024"},{"id":"http://arxiv.org/abs/2412.03134v1","updated":"2024-12-04T08:57:03Z","published":"2024-12-04T08:57:03Z","title":"Generalized Diffusion Model with Adjusted Offset Noise","summary":" Diffusion models have become fundamental tools for modeling data\ndistributions in machine learning and have applications in image generation,\ndrug discovery, and audio synthesis. Despite their success, these models face\nchallenges when generating data with extreme brightness values, as evidenced by\nlimitations in widely used frameworks like Stable Diffusion. Offset noise has\nbeen proposed as an empirical solution to this issue, yet its theoretical basis\nremains insufficiently explored. In this paper, we propose a generalized\ndiffusion model that naturally incorporates additional noise within a rigorous\nprobabilistic framework. Our approach modifies both the forward and reverse\ndiffusion processes, enabling inputs to be diffused into Gaussian distributions\nwith arbitrary mean structures. We derive a loss function based on the evidence\nlower bound, establishing its theoretical equivalence to offset noise with\ncertain adjustments, while broadening its applicability. Experiments on\nsynthetic datasets demonstrate that our model effectively addresses\nbrightness-related challenges and outperforms conventional methods in\nhigh-dimensional scenarios.\n","authors":["Takuro Kutsuna"],"pdf_url":"https://arxiv.org/pdf/2412.03134v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03131v1","updated":"2024-12-04T08:51:23Z","published":"2024-12-04T08:51:23Z","title":"Unifying KV Cache Compression for Large Language Models with LeanKV","summary":" Large language models (LLMs) demonstrate exceptional performance but incur\nhigh serving costs due to substantial memory demands, with the key-value (KV)\ncache being a primary bottleneck. Existing KV cache compression methods,\nincluding quantization and pruning, struggle with limitations such as uniform\ntreatment of keys and values and static memory allocation across attention\nheads. To address these challenges, we introduce LeanKV, a unified KV cache\ncompression framework that enhances LLM serving efficiency without compromising\naccuracy through three innovations: (1) Hetero-KV quantization, which stores\nkeys at a higher precision than values to reflect their greater impact on\nattention computations; (2) per-head dynamic sparsity, which allocates memory\nbased on token importance per head and per request; and (3) unified KV\ncompression, integrating mixed-precision quantization and selective pruning to\nenable a smooth tradeoff between model accuracy and memory efficiency. To\nefficiently support these techniques, LeanKV introduces systems optimizations\nincluding unified paging and on-GPU parallel memory management. Implemented on\nvLLM, LeanKV compresses the KV cache by $3.0\\times$ to $5.0\\times$ without\naccuracy loss and up to $11.0\\times$ with under 5% accuracy loss, enhancing\nthroughput by $1.9\\times$ to $2.5\\times$, and up to $6.9\\times$.\n","authors":["Yanqi Zhang","Yuwei Hu","Runyuan Zhao","John C. S. Lui","Haibo Chen"],"pdf_url":"https://arxiv.org/pdf/2412.03131v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.16615v2","updated":"2024-12-04T08:44:10Z","published":"2024-11-25T17:54:29Z","title":"Graph Pooling by Local Cluster Selection","summary":" Graph pooling is a family of operations which take graphs as input and\nproduce shrinked graphs as output. Modern graph pooling methods are trainable\nand, in general inserted in Graph Neural Networks (GNNs) architectures as graph\nshrinking operators along the (deep) processing pipeline. This work proposes a\nnovel procedure for pooling graphs, along with a node-centred graph pooling\noperator.\n","authors":["Yizhu Chen"],"pdf_url":"https://arxiv.org/pdf/2411.16615v2.pdf","comment":"11 pages, 4 figures"},{"id":"http://arxiv.org/abs/2412.03120v1","updated":"2024-12-04T08:39:45Z","published":"2024-12-04T08:39:45Z","title":"Sinkhorn Algorithm for Sequentially Composed Optimal Transports","summary":" Sinkhorn algorithm is the de-facto standard approximation algorithm for\noptimal transport, which has been applied to a variety of applications,\nincluding image processing and natural language processing. In theory, the\nproof of its convergence follows from the convergence of the Sinkhorn--Knopp\nalgorithm for the matrix scaling problem, and Altschuler et al. show that its\nworst-case time complexity is in near-linear time. Very recently, sequentially\ncomposed optimal transports were proposed by Watanabe and Isobe as a\nhierarchical extension of optimal transports. In this paper, we present an\nefficient approximation algorithm, namely Sinkhorn algorithm for sequentially\ncomposed optimal transports, for its entropic regularization. Furthermore, we\npresent a theoretical analysis of the Sinkhorn algorithm, namely (i) its\nexponential convergence to the optimal solution with respect to the Hilbert\npseudometric, and (ii) a worst-case complexity analysis for the case of one\nsequential composition.\n","authors":["Kazuki Watanabe","Noboru Isobe"],"pdf_url":"https://arxiv.org/pdf/2412.03120v1.pdf","comment":"Preprint"},{"id":"http://arxiv.org/abs/2405.16436v3","updated":"2024-12-04T08:15:35Z","published":"2024-05-26T05:38:50Z","title":"Provably Mitigating Overoptimization in RLHF: Your SFT Loss is\n Implicitly an Adversarial Regularizer","summary":" Aligning generative models with human preference via RLHF typically suffers\nfrom overoptimization, where an imperfectly learned reward model can misguide\nthe generative model to output undesired responses. We investigate this problem\nin a principled manner by identifying the source of the misalignment as a form\nof distributional shift and uncertainty in learning human preferences. To\nmitigate overoptimization, we first propose a theoretical algorithm that\nchooses the best policy for an adversarially chosen reward model; one that\nsimultaneously minimizes the maximum likelihood estimation of the loss and a\nreward penalty term. Here, the reward penalty term is introduced to prevent the\npolicy from choosing actions with spurious high proxy rewards, resulting in\nprovable sample efficiency of the algorithm under a partial coverage style\ncondition. Moving from theory to practice, the proposed algorithm further\nenjoys an equivalent but surprisingly easy-to-implement reformulation. Using\nthe equivalence between reward models and the corresponding optimal policy, the\nalgorithm features a simple objective that combines: (i) a preference\noptimization loss that directly aligns the policy with human preference, and\n(ii) a supervised learning loss that explicitly imitates the policy with a\n(suitable) baseline distribution. In the context of aligning large language\nmodels (LLM), this objective fuses the direct preference optimization (DPO)\nloss with the supervised fine-tuning (SFT) loss to help mitigate the\noveroptimization towards undesired responses, for which we name the algorithm\nRegularized Preference Optimization (RPO). Experiments of aligning LLMs\ndemonstrate the improved performance of RPO compared with DPO baselines. Our\nwork sheds light on the interplay between preference optimization and SFT in\ntuning LLMs with both theoretical guarantees and empirical evidence.\n","authors":["Zhihan Liu","Miao Lu","Shenao Zhang","Boyi Liu","Hongyi Guo","Yingxiang Yang","Jose Blanchet","Zhaoran Wang"],"pdf_url":"https://arxiv.org/pdf/2405.16436v3.pdf","comment":"Accepted by The Thirty-Eighth Annual Conference on Neural Information\n Processing Systems. 31 pages, 7 figures"},{"id":"http://arxiv.org/abs/2412.03105v1","updated":"2024-12-04T08:10:48Z","published":"2024-12-04T08:10:48Z","title":"Few-Shot Learning with Adaptive Weight Masking in Conditional GANs","summary":" Deep learning has revolutionized various fields, yet its efficacy is hindered\nby overfitting and the requirement of extensive annotated data, particularly in\nfew-shot learning scenarios where limited samples are available. This paper\nintroduces a novel approach to few-shot learning by employing a Residual Weight\nMasking Conditional Generative Adversarial Network (RWM-CGAN) for data\naugmentation. The proposed model integrates residual units within the generator\nto enhance network depth and sample quality, coupled with a weight mask\nregularization technique in the discriminator to improve feature learning from\nsmall-sample categories. This method addresses the core issues of robustness\nand generalization in few-shot learning by providing a controlled and clear\naugmentation of the sample space. Extensive experiments demonstrate that\nRWM-CGAN not only expands the sample space effectively but also enriches the\ndiversity and quality of generated samples, leading to significant improvements\nin detection and classification accuracy on public datasets. The paper\ncontributes to the advancement of few-shot learning by offering a practical\nsolution to the challenges posed by data scarcity and the need for rapid\ngeneralization to new tasks or categories.\n","authors":["Jiacheng Hu","Zhen Qi","Jianjun Wei","Jiajing Chen","Runyuan Bao","Xinyu Qiu"],"pdf_url":"https://arxiv.org/pdf/2412.03105v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.14946v2","updated":"2024-12-04T07:58:40Z","published":"2024-10-19T02:32:09Z","title":"DEL-Ranking: Ranking-Correction Denoising Framework for Elucidating\n Molecular Affinities in DNA-Encoded Libraries","summary":" DNA-encoded library (DEL) screening has revolutionized the detection of\nprotein-ligand interactions through read counts, enabling rapid exploration of\nvast chemical spaces. However, noise in read counts, stemming from nonspecific\ninteractions, can mislead this exploration process. We present DEL-Ranking, a\nnovel distribution-correction denoising framework that addresses these\nchallenges. Our approach introduces two key innovations: (1) a novel ranking\nloss that rectifies relative magnitude relationships between read counts,\nenabling the learning of causal features determining activity levels, and (2)\nan iterative algorithm employing self-training and consistency loss to\nestablish model coherence between activity label and read count predictions.\nFurthermore, we contribute three new DEL screening datasets, the first to\ncomprehensively include multi-dimensional molecular representations,\nprotein-ligand enrichment values, and their activity labels. These datasets\nmitigate data scarcity issues in AI-driven DEL screening research. Rigorous\nevaluation on diverse DEL datasets demonstrates DEL-Ranking's superior\nperformance across multiple correlation metrics, with significant improvements\nin binding affinity prediction accuracy. Our model exhibits zero-shot\ngeneralization ability across different protein targets and successfully\nidentifies potential motifs determining compound binding affinity. This work\nadvances DEL screening analysis and provides valuable resources for future\nresearch in this area.\n","authors":["Hanqun Cao","Mutian He","Ning Ma","Chang-yu Hsieh","Chunbin Gu","Pheng-Ann Heng"],"pdf_url":"https://arxiv.org/pdf/2410.14946v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03097v1","updated":"2024-12-04T07:50:27Z","published":"2024-12-04T07:50:27Z","title":"Enhancing Recommendation Systems with GNNs and Addressing Over-Smoothing","summary":" This paper addresses key challenges in enhancing recommendation systems by\nleveraging Graph Neural Networks (GNNs) and addressing inherent limitations\nsuch as over-smoothing, which reduces model effectiveness as network hierarchy\ndeepens. The proposed approach introduces three GNN-based recommendation\nmodels, specifically designed to mitigate over-smoothing through innovative\nmechanisms like residual connections and identity mapping within the\naggregation propagation process. These modifications enable more effective\ninformation flow across layers, preserving essential user-item interaction\ndetails to improve recommendation accuracy. Additionally, the study emphasizes\nthe critical need for interpretability in recommendation systems, aiming to\nprovide transparent and justifiable suggestions tailored to dynamic user\npreferences. By integrating collaborative filtering with GNN architectures, the\nproposed models not only enhance predictive accuracy but also align\nrecommendations more closely with individual behaviors, adapting to nuanced\nshifts in user interests. This work advances the field by tackling both\ntechnical and user-centric challenges, contributing to the development of\nrobust and explainable recommendation systems capable of managing the\ncomplexity and scale of modern online environments.\n","authors":["Wenyi Liu","Ziqi Zhang","Xinshi Li","Jiacheng Hu","Yuanshuai Luo","Junliang Du"],"pdf_url":"https://arxiv.org/pdf/2412.03097v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03092v1","updated":"2024-12-04T07:44:35Z","published":"2024-12-04T07:44:35Z","title":"Revolve: Optimizing AI Systems by Tracking Response Evolution in Textual\n Optimization","summary":" Recent advancements in large language models (LLMs) have significantly\nenhanced the ability of LLM-based systems to perform complex tasks through\nnatural language processing and tool interaction. However, optimizing these\nLLM-based systems for specific tasks remains challenging, often requiring\nmanual interventions like prompt engineering and hyperparameter tuning.\nExisting automatic optimization methods, such as textual feedback-based\ntechniques (e.g., TextGrad), tend to focus on immediate feedback, analogous to\nusing immediate derivatives in traditional numerical gradient descent. However,\nrelying solely on such feedback can be limited when the adjustments made in\nresponse to this feedback are either too small or fluctuate irregularly,\npotentially slowing down or even stalling the optimization process. To overcome\nthese challenges, more adaptive methods are needed, especially in situations\nwhere the system's response is evolving slowly or unpredictably. In this paper,\nwe introduce REVOLVE, an optimization method that tracks how \"R\"esponses\n\"EVOLVE\" across iterations in LLM systems. By focusing on the evolution of\nresponses over time, REVOLVE enables more stable and effective optimization by\nmaking thoughtful, progressive adjustments at each step. Experimental results\ndemonstrate that REVOLVE outperforms competitive baselines, achieving a 7.8%\nimprovement in prompt optimization, a 20.72% gain in solution refinement, and a\n29.17% increase in code optimization. Additionally, REVOLVE converges in fewer\niterations, resulting in significant computational savings. These advantages\nhighlight its adaptability and efficiency, positioning REVOLVE as a valuable\ntool for optimizing LLM-based systems and accelerating the development of\nnext-generation AI technologies. Code is available at:\nhttps://github.com/Peiyance/REVOLVE.\n","authors":["Peiyan Zhang","Haibo Jin","Leyang Hu","Xinnuo Li","Liying Kang","Man Luo","Yangqiu Song","Haohan Wang"],"pdf_url":"https://arxiv.org/pdf/2412.03092v1.pdf","comment":"20 pages, 2 figures"},{"id":"http://arxiv.org/abs/2412.03084v1","updated":"2024-12-04T07:26:36Z","published":"2024-12-04T07:26:36Z","title":"Hybrid deep learning-based strategy for the hepatocellular carcinoma\n cancer grade classification of H&E stained liver histopathology images","summary":" Hepatocellular carcinoma (HCC) is a common type of liver cancer whose\nearly-stage diagnosis is a common challenge, mainly due to the manual\nassessment of hematoxylin and eosin-stained whole slide images, which is a\ntime-consuming process and may lead to variability in decision-making. For\naccurate detection of HCC, we propose a hybrid deep learning-based architecture\nthat uses transfer learning to extract the features from pre-trained\nconvolutional neural network (CNN) models and a classifier made up of a\nsequence of fully connected layers. This study uses a publicly available The\nCancer Genome Atlas Hepatocellular Carcinoma (TCGA-LIHC)database (n=491) for\nmodel development and database of Kasturba Gandhi Medical College (KMC), India\nfor validation. The pre-processing step involves patch extraction, colour\nnormalization, and augmentation that results in 3920 patches for the TCGA\ndataset. The developed hybrid deep neural network consisting of a CNN-based\npre-trained feature extractor and a customized artificial neural network-based\nclassifier is trained using five-fold cross-validation. For this study, eight\ndifferent state-of-the-art models are trained and tested as feature extractors\nfor the proposed hybrid model. The proposed hybrid model with ResNet50-based\nfeature extractor provided the sensitivity, specificity, F1-score, accuracy,\nand AUC of 100.00%, 100.00%, 100.00%, 100.00%, and 1.00, respectively on the\nTCGA database. On the KMC database, EfficientNetb3 resulted in the optimal\nchoice of the feature extractor giving sensitivity, specificity, F1-score,\naccuracy, and AUC of 96.97, 98.85, 96.71, 96.71, and 0.99, respectively. The\nproposed hybrid models showed improvement in accuracy of 2% and 4% over the\npre-trained models in TCGA-LIHC and KMC databases.\n","authors":["Ajinkya Deshpande","Deep Gupta","Ankit Bhurane","Nisha Meshram","Sneha Singh","Petia Radeva"],"pdf_url":"https://arxiv.org/pdf/2412.03084v1.pdf","comment":"14 figure, 9 tables"},{"id":"http://arxiv.org/abs/2412.03083v1","updated":"2024-12-04T07:21:23Z","published":"2024-12-04T07:21:23Z","title":"A Scalable Quantum Neural Network for Approximate SRBB-Based Unitary\n Synthesis","summary":" In this work, scalable quantum neural networks are introduced to approximate\nunitary evolutions through the Standard Recursive Block Basis (SRBB) and,\nsubsequently, redesigned with a reduced number of CNOTs. This algebraic\napproach to the problem of unitary synthesis exploits Lie algebras and their\ntopological features to obtain scalable parameterizations of unitary operators.\nFirst, the recursive algorithm that builds the SRBB is presented, framed in the\noriginal scalability scheme already known to the literature only from a\ntheoretical point of view. Unexpectedly, 2-qubit systems emerge as a special\ncase outside this scheme. Furthermore, an algorithm to reduce the number of\nCNOTs is proposed, thus deriving a new implementable scaling scheme that\nrequires one single layer of approximation. From the mathematical algorithm,\nthe scalable CNOT-reduced quantum neural network is implemented and its\nperformance is assessed with a variety of different unitary matrices, both\nsparse and dense, up to 6 qubits via the PennyLane library. The effectiveness\nof the approximation is measured with different metrics in relation to two\noptimizers: a gradient-based method and the Nelder-Mead method. The approximate\nSRBB-based synthesis algorithm with CNOT-reduction is also tested on real\nhardware and compared with other valid approximation and decomposition methods\navailable in the literature.\n","authors":["Giacomo Belli","Marco Mordacci","Michele Amoretti"],"pdf_url":"https://arxiv.org/pdf/2412.03083v1.pdf","comment":"Journal"},{"id":"http://arxiv.org/abs/2410.07170v2","updated":"2024-12-04T07:18:17Z","published":"2024-10-09T17:59:06Z","title":"One Initialization to Rule them All: Fine-tuning via Explained Variance\n Adaptation","summary":" Foundation models (FMs) are pre-trained on large-scale datasets and then\nfine-tuned on a downstream task for a specific application. The most successful\nand most commonly used fine-tuning method is to update the pre-trained weights\nvia a low-rank adaptation (LoRA). LoRA introduces new weight matrices that are\nusually initialized at random with a uniform rank distribution across the model\nweights. Recent works focus on different initialization schemes or the learning\nof adaptive ranks during fine-tuning. Both approaches have only been\ninvestigated in isolation, resulting in slow convergence or a uniform rank\ndistribution, in turn leading to suboptimal performance. We propose to improve\nLoRA by initializing the new weights in a data-driven manner by computing\nsingular value decomposition (SVD) on minibatches of activation vectors. Then,\nwe initialize the LoRA matrices with the obtained right-singular vectors and\nredistribute ranks among all weight matrices to provably store the maximum\namount of information of the downstream data in the newly introduced weights.\nIn this way, only what information to maintain or neglect during the\nfine-tuning process needs to be learned. We call our new method Explained\nVariance Adaptation (EVA). We apply EVA to a variety of fine-tuning tasks\nranging from language generation and understanding to image classification and\nreinforcement learning. EVA exhibits faster convergence than competitors and\nachieves the highest average score across a multitude of tasks per domain while\nreducing the number of trainable parameters through rank redistribution.\n","authors":["Fabian Paischer","Lukas Hauzenberger","Thomas Schmied","Benedikt Alkin","Marc Peter Deisenroth","Sepp Hochreiter"],"pdf_url":"https://arxiv.org/pdf/2410.07170v2.pdf","comment":"11 pages + references and appendix, code available at\n https://github.com/ml-jku/EVA"},{"id":"http://arxiv.org/abs/2412.02538v2","updated":"2024-12-04T07:11:07Z","published":"2024-12-03T16:32:19Z","title":"On Privacy, Security, and Trustworthiness in Distributed Wireless Large\n AI Models (WLAM)","summary":" Combining wireless communication with large artificial intelligence (AI)\nmodels can open up a myriad of novel application scenarios. In sixth generation\n(6G) networks, ubiquitous communication and computing resources allow large AI\nmodels to serve democratic large AI models-related services to enable real-time\napplications like autonomous vehicles, smart cities, and Internet of Things\n(IoT) ecosystems. However, the security considerations and sustainable\ncommunication resources limit the deployment of large AI models over\ndistributed wireless networks. This paper provides a comprehensive overview of\nprivacy, security, and trustworthy for distributed wireless large AI model\n(WLAM). In particular, a detailed privacy and security are analysis for\ndistributed WLAM is fist revealed. The classifications and theoretical findings\nabout privacy and security in distributed WLAM are discussed. Then the\ntrustworthy and ethics for implementing distributed WLAM are described.\nFinally, the comprehensive applications of distributed WLAM are presented in\nthe context of electromagnetic signal processing.\n","authors":["Zhaohui Yang","Wei Xu","Le Liang","Yuanhao Cui","Zhijin Qin","Merouane Debbah"],"pdf_url":"https://arxiv.org/pdf/2412.02538v2.pdf","comment":"12 pages, 4 figures"},{"id":"http://arxiv.org/abs/2410.08631v2","updated":"2024-12-04T06:58:26Z","published":"2024-10-11T08:53:58Z","title":"CryoFM: A Flow-based Foundation Model for Cryo-EM Densities","summary":" Cryo-electron microscopy (cryo-EM) is a powerful technique in structural\nbiology and drug discovery, enabling the study of biomolecules at high\nresolution. Significant advancements by structural biologists using cryo-EM\nhave led to the production of over 38,626 protein density maps at various\nresolutions1. However, cryo-EM data processing algorithms have yet to fully\nbenefit from our knowledge of biomolecular density maps, with only a few recent\nmodels being data-driven but limited to specific tasks. In this study, we\npresent CryoFM, a foundation model designed as a generative model, learning the\ndistribution of high-quality density maps and generalizing effectively to\ndownstream tasks. Built on flow matching, CryoFM is trained to accurately\ncapture the prior distribution of biomolecular density maps. Furthermore, we\nintroduce a flow posterior sampling method that leverages CRYOFM as a flexible\nprior for several downstream tasks in cryo-EM and cryo-electron tomography\n(cryo-ET) without the need for fine-tuning, achieving state-of-the-art\nperformance on most tasks and demonstrating its potential as a foundational\nmodel for broader applications in these fields.\n","authors":["Yi Zhou","Yilai Li","Jing Yuan","Quanquan Gu"],"pdf_url":"https://arxiv.org/pdf/2410.08631v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03068v1","updated":"2024-12-04T06:42:55Z","published":"2024-12-04T06:42:55Z","title":"UTSD: Unified Time Series Diffusion Model","summary":" Transformer-based architectures have achieved unprecedented success in time\nseries analysis. However, facing the challenge of across-domain modeling,\nexisting studies utilize statistical prior as prompt engineering fails under\nthe huge distribution shift among various domains. In this paper, a Unified\nTime Series Diffusion (UTSD) model is established for the first time to model\nthe multi-domain probability distribution, utilizing the powerful probability\ndistribution modeling ability of Diffusion. Unlike the autoregressive models\nthat capture the conditional probabilities of the prediction horizon to the\nhistorical sequence, we use a diffusion denoising process to model the mixture\ndistribution of the cross-domain data and generate the prediction sequence for\nthe target domain directly utilizing conditional sampling. The proposed UTSD\ncontains three pivotal designs: (1) The condition network captures the\nmulti-scale fluctuation patterns from the observation sequence, which are\nutilized as context representations to guide the denoising network to generate\nthe prediction sequence; (2) Adapter-based fine-tuning strategy, the\nmulti-domain universal representation learned in the pretraining stage is\nutilized for downstream tasks in target domains; (3) The diffusion and\ndenoising process on the actual sequence space, combined with the improved\nclassifier free guidance as the conditional generation strategy, greatly\nimproves the stability and accuracy of the downstream task. We conduct\nextensive experiments on mainstream benchmarks, and the pre-trained UTSD\noutperforms existing foundation models on all data domains, exhibiting superior\nzero-shot generalization ability. After training from scratch, UTSD achieves\ncomparable performance against domain-specific proprietary models. The\nempirical results validate the potential of UTSD as a time series foundational\nmodel.\n","authors":["Xiangkai Ma","Xiaobin Hong","Wenzhong Li","Sanglu Lu"],"pdf_url":"https://arxiv.org/pdf/2412.03068v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03056v1","updated":"2024-12-04T06:20:51Z","published":"2024-12-04T06:20:51Z","title":"Point-GN: A Non-Parametric Network Using Gaussian Positional Encoding\n for Point Cloud Classification","summary":" This paper introduces Point-GN, a novel non-parametric network for efficient\nand accurate 3D point cloud classification. Unlike conventional deep learning\nmodels that rely on a large number of trainable parameters, Point-GN leverages\nnon-learnable components-specifically, Farthest Point Sampling (FPS), k-Nearest\nNeighbors (k-NN), and Gaussian Positional Encoding (GPE)-to extract both local\nand global geometric features. This design eliminates the need for additional\ntraining while maintaining high performance, making Point-GN particularly\nsuited for real-time, resource-constrained applications. We evaluate Point-GN\non two benchmark datasets, ModelNet40 and ScanObjectNN, achieving\nclassification accuracies of 85.29% and 85.89%, respectively, while\nsignificantly reducing computational complexity. Point-GN outperforms existing\nnon-parametric methods and matches the performance of fully trained models, all\nwith zero learnable parameters. Our results demonstrate that Point-GN is a\npromising solution for 3D point cloud classification in practical, real-time\nenvironments.\n","authors":["Marzieh Mohammadi","Amir Salarpour"],"pdf_url":"https://arxiv.org/pdf/2412.03056v1.pdf","comment":"This paper has been accepted for presentation at the IEEE Winter\n Conference on Applications of Computer Vision (WACV) 2025"},{"id":"http://arxiv.org/abs/2405.18407v2","updated":"2024-12-04T06:18:16Z","published":"2024-05-28T17:47:19Z","title":"Phased Consistency Models","summary":" Consistency Models (CMs) have made significant progress in accelerating the\ngeneration of diffusion models. However, their application to high-resolution,\ntext-conditioned image generation in the latent space remains unsatisfactory.\nIn this paper, we identify three key flaws in the current design of Latent\nConsistency Models (LCMs). We investigate the reasons behind these limitations\nand propose Phased Consistency Models (PCMs), which generalize the design space\nand address the identified limitations. Our evaluations demonstrate that PCMs\noutperform LCMs across 1--16 step generation settings. While PCMs are\nspecifically designed for multi-step refinement, they achieve comparable 1-step\ngeneration results to previously state-of-the-art specifically designed 1-step\nmethods. Furthermore, we show the methodology of PCMs is versatile and\napplicable to video generation, enabling us to train the state-of-the-art\nfew-step text-to-video generator. Our code is available at\nhttps://github.com/G-U-N/Phased-Consistency-Model.\n","authors":["Fu-Yun Wang","Zhaoyang Huang","Alexander William Bergman","Dazhong Shen","Peng Gao","Michael Lingelbach","Keqiang Sun","Weikang Bian","Guanglu Song","Yu Liu","Xiaogang Wang","Hongsheng Li"],"pdf_url":"https://arxiv.org/pdf/2405.18407v2.pdf","comment":"NeurIPS 2024"},{"id":"http://arxiv.org/abs/2412.03051v1","updated":"2024-12-04T06:11:09Z","published":"2024-12-04T06:11:09Z","title":"Less is More: A Stealthy and Efficient Adversarial Attack Method for\n DRL-based Autonomous Driving Policies","summary":" Despite significant advancements in deep reinforcement learning (DRL)-based\nautonomous driving policies, these policies still exhibit vulnerability to\nadversarial attacks. This vulnerability poses a formidable challenge to the\npractical deployment of these policies in autonomous driving. Designing\neffective adversarial attacks is an indispensable prerequisite for enhancing\nthe robustness of these policies. In view of this, we present a novel stealthy\nand efficient adversarial attack method for DRL-based autonomous driving\npolicies. Specifically, we introduce a DRL-based adversary designed to trigger\nsafety violations (e.g., collisions) by injecting adversarial samples at\ncritical moments. We model the attack as a mixed-integer optimization problem\nand formulate it as a Markov decision process. Then, we train the adversary to\nlearn the optimal policy for attacking at critical moments without domain\nknowledge. Furthermore, we introduce attack-related information and a\ntrajectory clipping method to enhance the learning capability of the adversary.\nFinally, we validate our method in an unprotected left-turn scenario across\ndifferent traffic densities. The experimental results show that our method\nachieves more than 90% collision rate within three attacks in most cases.\nFurthermore, our method achieves more than 130% improvement in attack\nefficiency compared to the unlimited attack method.\n","authors":["Junchao Fan","Xuyang Lei","Xiaolin Chang","Jelena Mišić","Vojislav B. Mišić"],"pdf_url":"https://arxiv.org/pdf/2412.03051v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.01858v2","updated":"2024-12-04T05:35:17Z","published":"2024-11-30T19:53:25Z","title":"MQFL-FHE: Multimodal Quantum Federated Learning Framework with Fully\n Homomorphic Encryption","summary":" The integration of fully homomorphic encryption (FHE) in federated learning\n(FL) has led to significant advances in data privacy. However, during the\naggregation phase, it often results in performance degradation of the\naggregated model, hindering the development of robust representational\ngeneralization. In this work, we propose a novel multimodal quantum federated\nlearning framework that utilizes quantum computing to counteract the\nperformance drop resulting from FHE. For the first time in FL, our framework\ncombines a multimodal quantum mixture of experts (MQMoE) model with FHE,\nincorporating multimodal datasets for enriched representation and task-specific\nlearning. Our MQMoE framework enhances performance on multimodal datasets and\ncombined genomics and brain MRI scans, especially for underrepresented\ncategories. Our results also demonstrate that the quantum-enhanced approach\nmitigates the performance degradation associated with FHE and improves\nclassification accuracy across diverse datasets, validating the potential of\nquantum interventions in enhancing privacy in FL.\n","authors":["Siddhant Dutta","Nouhaila Innan","Sadok Ben Yahia","Muhammad Shafique","David Esteban Bernal Neira"],"pdf_url":"https://arxiv.org/pdf/2412.01858v2.pdf","comment":"14 pages, 6 figures, 5 Tables. Under Review"},{"id":"http://arxiv.org/abs/2412.03038v1","updated":"2024-12-04T05:19:34Z","published":"2024-12-04T05:19:34Z","title":"MILLION: A General Multi-Objective Framework with Controllable Risk for\n Portfolio Management","summary":" Portfolio management is an important yet challenging task in AI for FinTech,\nwhich aims to allocate investors' budgets among different assets to balance the\nrisk and return of an investment. In this study, we propose a general\nMulti-objectIve framework with controLLable rIsk for pOrtfolio maNagement\n(MILLION), which consists of two main phases, i.e., return-related maximization\nand risk control. Specifically, in the return-related maximization phase, we\nintroduce two auxiliary objectives, i.e., return rate prediction, and return\nrate ranking, combined with portfolio optimization to remit the overfitting\nproblem and improve the generalization of the trained model to future markets.\nSubsequently, in the risk control phase, we propose two methods, i.e.,\nportfolio interpolation and portfolio improvement, to achieve fine-grained risk\ncontrol and fast risk adaption to a user-specified risk level. For the\nportfolio interpolation method, we theoretically prove that the risk can be\nperfectly controlled if the to-be-set risk level is in a proper interval. In\naddition, we also show that the return rate of the adjusted portfolio after\nportfolio interpolation is no less than that of the min-variance optimization,\nas long as the model in the reward maximization phase is effective.\nFurthermore, the portfolio improvement method can achieve greater return rates\nwhile keeping the same risk level compared to portfolio interpolation.\nExtensive experiments are conducted on three real-world datasets. The results\ndemonstrate the effectiveness and efficiency of the proposed framework.\n","authors":["Liwei Deng","Tianfu Wang","Yan Zhao","Kai Zheng"],"pdf_url":"https://arxiv.org/pdf/2412.03038v1.pdf","comment":"accepted by VLDB 2025"},{"id":"http://arxiv.org/abs/2412.03035v1","updated":"2024-12-04T05:16:48Z","published":"2024-12-04T05:16:48Z","title":"A Granger-Causal Perspective on Gradient Descent with Application to\n Pruning","summary":" Stochastic Gradient Descent (SGD) is the main approach to optimizing neural\nnetworks. Several generalization properties of deep networks, such as\nconvergence to a flatter minima, are believed to arise from SGD. This article\nexplores the causality aspect of gradient descent. Specifically, we show that\nthe gradient descent procedure has an implicit granger-causal relationship\nbetween the reduction in loss and a change in parameters. By suitable\nmodifications, we make this causal relationship explicit. A causal approach to\ngradient descent has many significant applications which allow greater control.\nIn this article, we illustrate the significance of the causal approach using\nthe application of Pruning. The causal approach to pruning has several\ninteresting properties - (i) We observe a phase shift as the percentage of\npruned parameters increase. Such phase shift is indicative of an optimal\npruning strategy. (ii) After pruning, we see that minima becomes \"flatter\",\nexplaining the increase in accuracy after pruning weights.\n","authors":["Aditya Shah","Aditya Challa","Sravan Danda","Archana Mathur","Snehanshu Saha"],"pdf_url":"https://arxiv.org/pdf/2412.03035v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.18958v3","updated":"2024-12-04T05:04:42Z","published":"2024-10-24T17:55:52Z","title":"Stable Consistency Tuning: Understanding and Improving Consistency\n Models","summary":" Diffusion models achieve superior generation quality but suffer from slow\ngeneration speed due to the iterative nature of denoising. In contrast,\nconsistency models, a new generative family, achieve competitive performance\nwith significantly faster sampling. These models are trained either through\nconsistency distillation, which leverages pretrained diffusion models, or\nconsistency training/tuning directly from raw data. In this work, we propose a\nnovel framework for understanding consistency models by modeling the denoising\nprocess of the diffusion model as a Markov Decision Process (MDP) and framing\nconsistency model training as the value estimation through Temporal\nDifference~(TD) Learning. More importantly, this framework allows us to analyze\nthe limitations of current consistency training/tuning strategies. Built upon\nEasy Consistency Tuning (ECT), we propose Stable Consistency Tuning (SCT),\nwhich incorporates variance-reduced learning using the score identity. SCT\nleads to significant performance improvements on benchmarks such as CIFAR-10\nand ImageNet-64. On ImageNet-64, SCT achieves 1-step FID 2.42 and 2-step FID\n1.55, a new SoTA for consistency models.\n","authors":["Fu-Yun Wang","Zhengyang Geng","Hongsheng Li"],"pdf_url":"https://arxiv.org/pdf/2410.18958v3.pdf","comment":"Code is available at\n https://github.com/G-U-N/Stable-Consistency-Tuning"},{"id":"http://arxiv.org/abs/2405.19600v2","updated":"2024-12-04T04:41:49Z","published":"2024-05-30T01:30:34Z","title":"Rethinking Spectral Augmentation for Contrast-based Graph\n Self-Supervised Learning","summary":" The recent surge in contrast-based graph self-supervised learning has\nprominently featured an intensified exploration of spectral cues. Spectral\naugmentation, which involves modifying a graph's spectral properties such as\neigenvalues or eigenvectors, is widely believed to enhance model performance.\nHowever, an intriguing paradox emerges, as methods grounded in seemingly\nconflicting assumptions regarding the spectral domain demonstrate notable\nenhancements in learning performance. Through extensive empirical studies, we\nfind that simple edge perturbations - random edge dropping for node-level and\nrandom edge adding for graph-level self-supervised learning - consistently\nyield comparable or superior performance while being significantly more\ncomputationally efficient. This suggests that the computational overhead of\nsophisticated spectral augmentations may not justify their practical benefits.\nOur theoretical analysis of the InfoNCE loss bounds for shallow GNNs further\nsupports this observation. The proposed insights represent a significant leap\nforward in the field, potentially refining the understanding and implementation\nof graph self-supervised learning.\n","authors":["Xiangru Jian","Xinjian Zhao","Wei Pang","Chaolong Ying","Yimu Wang","Yaoyao Xu","Tianshu Yu"],"pdf_url":"https://arxiv.org/pdf/2405.19600v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.12635v2","updated":"2024-12-04T04:18:35Z","published":"2024-04-19T05:32:37Z","title":"AED-PADA:Improving Generalizability of Adversarial Example Detection via\n Principal Adversarial Domain Adaptation","summary":" Adversarial example detection, which can be conveniently applied in many\nscenarios, is important in the area of adversarial defense. Unfortunately,\nexisting detection methods suffer from poor generalization performance, because\ntheir training process usually relies on the examples generated from a single\nknown adversarial attack and there exists a large discrepancy between the\ntraining and unseen testing adversarial examples. To address this issue, we\npropose a novel method, named Adversarial Example Detection via Principal\nAdversarial Domain Adaptation (AED-PADA). Specifically, our approach identifies\nthe Principal Adversarial Domains (PADs), i.e., a combination of features of\nthe adversarial examples generated by different attacks, which possesses a\nlarge portion of the entire adversarial feature space. Subsequently, we pioneer\nto exploit Multi-source Unsupervised Domain Adaptation in adversarial example\ndetection, with PADs as the source domains. Experimental results demonstrate\nthe superior generalization ability of our proposed AED-PADA. Note that this\nsuperiority is particularly achieved in challenging scenarios characterized by\nemploying the minimal magnitude constraint for the perturbations.\n","authors":["Heqi Peng","Yunhong Wang","Ruijie Yang","Beichen Li","Rui Wang","Yuanfang Guo"],"pdf_url":"https://arxiv.org/pdf/2404.12635v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03018v1","updated":"2024-12-04T04:08:51Z","published":"2024-12-04T04:08:51Z","title":"Hamiltonian-based neural networks for systems under nonholonomic\n constraints","summary":" There has been increasing interest in methodologies that incorporate physics\npriors into neural network architectures to enhance their modeling\ncapabilities. A family of these methodologies that has gained traction are\nHamiltonian neural networks (HNN) and their variations. These architectures\nexplicitly encode Hamiltonian mechanics both in their structure and loss\nfunction. Although Hamiltonian systems under nonholonomic constraints are in\ngeneral not Hamiltonian, it is possible to formulate them in pseudo-Hamiltonian\nform, equipped with a Lie bracket which is almost Poisson. This opens the\npossibility of using some principles of HNNs in systems under nonholonomic\nconstraints. The goal of the present work is to develop a modified Hamiltonian\nneural network architecture capable of modeling Hamiltonian systems under\nholonomic and nonholonomic constraints. A three-network parallel architecture\nis proposed to simultaneously learn the Hamiltonian of the system, the\nconstraints, and their associated multipliers. A rolling disk and a ball on a\nspinning table are considered as canonical examples to assess the performance\nof the proposed Hamiltonian architecture. The experiments are then repeated\nwith a noisy training set to study modeling performance under more realistic\nconditions.\n","authors":["Ignacio Puiggros T.","A. Srikantha Phani"],"pdf_url":"https://arxiv.org/pdf/2412.03018v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03012v1","updated":"2024-12-04T04:02:38Z","published":"2024-12-04T04:02:38Z","title":"Learning Whole-Body Loco-Manipulation for Omni-Directional Task Space\n Pose Tracking with a Wheeled-Quadrupedal-Manipulator","summary":" In this paper, we study the whole-body loco-manipulation problem using\nreinforcement learning (RL). Specifically, we focus on the problem of how to\ncoordinate the floating base and the robotic arm of a wheeled-quadrupedal\nmanipulator robot to achieve direct six-dimensional (6D) end-effector (EE) pose\ntracking in task space. Different from conventional whole-body\nloco-manipulation problems that track both floating-base and end-effector\ncommands, the direct EE pose tracking problem requires inherent balance among\nredundant degrees of freedom in the whole-body motion. We leverage RL to solve\nthis challenging problem. To address the associated difficulties, we develop a\nnovel reward fusion module (RFM) that systematically integrates reward terms\ncorresponding to different tasks in a nonlinear manner. In such a way, the\ninherent multi-stage and hierarchical feature of the loco-manipulation problem\ncan be carefully accommodated. By combining the proposed RFM with the a\nteacher-student RL training paradigm, we present a complete RL scheme to\nachieve 6D EE pose tracking for the wheeled-quadruped manipulator robot.\nExtensive simulation and hardware experiments demonstrate the significance of\nthe RFM. In particular, we enable smooth and precise tracking performance,\nachieving state-of-the-art tracking position error of less than 5 cm, and\nrotation error of less than 0.1 rad. Please refer to\nhttps://clearlab-sustech.github.io/RFM_loco_mani/ for more experimental videos.\n","authors":["Kaiwen Jiang","Zhen Fu","Junde Guo","Wei Zhang","Hua Chen"],"pdf_url":"https://arxiv.org/pdf/2412.03012v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03009v1","updated":"2024-12-04T03:56:54Z","published":"2024-12-04T03:56:54Z","title":"Data Acquisition for Improving Model Fairness using Reinforcement\n Learning","summary":" Machine learning systems are increasingly being used in critical decision\nmaking such as healthcare, finance, and criminal justice. Concerns around their\nfairness have resulted in several bias mitigation techniques that emphasize the\nneed for high-quality data to ensure fairer decisions. However, the role of\nearlier stages of machine learning pipelines in mitigating model bias has not\nbeen explored well. In this paper, we focus on the task of acquiring additional\nlabeled data points for training the downstream machine learning model to\nrapidly improve its fairness. Since not all data points in a data pool are\nequally beneficial to the task of fairness, we generate an ordering in which\ndata points should be acquired. We present DataSift, a data acquisition\nframework based on the idea of data valuation that relies on partitioning and\nmulti-armed bandits to determine the most valuable data points to acquire. Over\nseveral iterations, DataSift selects a partition and randomly samples a batch\nof data points from the selected partition, evaluates the benefit of acquiring\nthe batch on model fairness, and updates the utility of partitions depending on\nthe benefit. To further improve the effectiveness and efficiency of evaluating\nbatches, we leverage influence functions that estimate the effect of acquiring\na batch without retraining the model. We empirically evaluate DataSift on\nseveral real-world and synthetic datasets and show that the fairness of a\nmachine learning model can be significantly improved even while acquiring a few\ndata points.\n","authors":["Jahid Hasan","Romila Pradhan"],"pdf_url":"https://arxiv.org/pdf/2412.03009v1.pdf","comment":"19 pages, 9 figures"},{"id":"http://arxiv.org/abs/2412.03008v1","updated":"2024-12-04T03:56:14Z","published":"2024-12-04T03:56:14Z","title":"Provably Extending PageRank-based Local Clustering Algorithm to Weighted\n Directed Graphs with Self-Loops and to Hypergraphs","summary":" Local clustering aims to find a compact cluster near the given starting\ninstances. This work focuses on graph local clustering, which has broad\napplications beyond graphs because of the internal connectivities within\nvarious modalities. While most existing studies on local graph clustering adopt\nthe discrete graph setting (i.e., unweighted graphs without self-loops),\nreal-world graphs can be more complex. In this paper, we extend the\nnon-approximating Andersen-Chung-Lang (\"ACL\") algorithm beyond discrete graphs\nand generalize its quadratic optimality to a wider range of graphs, including\nweighted, directed, and self-looped graphs and hypergraphs. Specifically,\nleveraging PageRank, we propose two algorithms: GeneralACL for graphs and\nHyperACL for hypergraphs. We theoretically prove that, under two mild\nconditions, both algorithms can identify a quadratically optimal local cluster\nin terms of conductance with at least 1/2 probability. On the property of\nhypergraphs, we address a fundamental gap in the literature by defining\nconductance for hypergraphs from the perspective of hypergraph random walks.\nAdditionally, we provide experiments to validate our theoretical findings.\n","authors":["Zihao Li","Dongqi Fu","Hengyu Liu","Jingrui He"],"pdf_url":"https://arxiv.org/pdf/2412.03008v1.pdf","comment":"Preprint, 42 pages"},{"id":"http://arxiv.org/abs/2408.15126v6","updated":"2024-12-04T03:24:18Z","published":"2024-08-27T15:07:27Z","title":"Force-Guided Bridge Matching for Full-Atom Time-Coarsened Dynamics of\n Peptides","summary":" Molecular Dynamics (MD) is crucial in various fields such as materials\nscience, chemistry, and pharmacology to name a few. Conventional MD software\nstruggles with the balance between time cost and prediction accuracy, which\nrestricts its wider application. Recently, data-driven approaches based on deep\ngenerative models have been devised for time-coarsened dynamics, which aim at\nlearning dynamics of diverse molecular systems over a long timestep, enjoying\nboth universality and efficiency. Nevertheless, most current methods are\ndesigned solely to learn from the data distribution regardless of the\nunderlying Boltzmann distribution, and the physics priors such as energies and\nforces are constantly overlooked. In this work, we propose a conditional\ngenerative model called Force-guided Bridge Matching (FBM), which learns\nfull-atom time-coarsened dynamics and targets the Boltzmann-constrained\ndistribution. With the guidance of our delicately-designed intermediate force\nfield, FBM leverages favourable physics priors into the generation process,\ngiving rise to enhanced simulations. Experiments on two datasets consisting of\npeptides verify our superiority in terms of comprehensive metrics and\ndemonstrate transferability to unseen systems.\n","authors":["Ziyang Yu","Wenbing Huang","Yang Liu"],"pdf_url":"https://arxiv.org/pdf/2408.15126v6.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02988v1","updated":"2024-12-04T03:02:55Z","published":"2024-12-04T03:02:55Z","title":"Preference-based Pure Exploration","summary":" We study the preference-based pure exploration problem for bandits with\nvector-valued rewards. The rewards are ordered using a (given) preference cone\n$\\mathcal{C}$ and our the goal is to identify the set of Pareto optimal arms.\nFirst, to quantify the impact of preferences, we derive a novel lower bound on\nthe sample complexity for identifying the most preferred policy with confidence\nlevel $1-\\delta$. Our lower bound elicits the role played by the geometry of\nthe preference cone and punctuates the difference in hardness compared to\nexisting best-arm identification variants of the problem. We further explicate\nthis geometry when rewards follow Gaussian distributions. We then provide a\nconvex relaxation of the lower bound. and leverage it to design\nPreference-based Track and Stop (PreTS) algorithm that identifies the most\npreferred policy. Finally, we show that sample complexity of PreTS is\nasymptotically tight by deriving a new concentration inequality for\nvector-valued rewards.\n","authors":["Apurv Shukla","Debabrota Basu"],"pdf_url":"https://arxiv.org/pdf/2412.02988v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.08027v2","updated":"2024-12-04T02:57:03Z","published":"2024-04-11T15:58:12Z","title":"SurvMamba: State Space Model with Multi-grained Multi-modal Interaction\n for Survival Prediction","summary":" Multi-modal learning that combines pathological images with genomic data has\nsignificantly enhanced the accuracy of survival prediction. Nevertheless,\nexisting methods have not fully utilized the inherent hierarchical structure\nwithin both whole slide images (WSIs) and transcriptomic data, from which\nbetter intra-modal representations and inter-modal integration could be\nderived. Moreover, many existing studies attempt to improve multi-modal\nrepresentations through attention mechanisms, which inevitably lead to high\ncomplexity when processing high-dimensional WSIs and transcriptomic data.\nRecently, a structured state space model named Mamba emerged as a promising\napproach for its superior performance in modeling long sequences with low\ncomplexity. In this study, we propose Mamba with multi-grained multi-modal\ninteraction (SurvMamba) for survival prediction. SurvMamba is implemented with\na Hierarchical Interaction Mamba (HIM) module that facilitates efficient\nintra-modal interactions at different granularities, thereby capturing more\ndetailed local features as well as rich global representations. In addition, an\nInteraction Fusion Mamba (IFM) module is used for cascaded inter-modal\ninteractive fusion, yielding more comprehensive features for survival\nprediction. Comprehensive evaluations on five TCGA datasets demonstrate that\nSurvMamba outperforms other existing methods in terms of performance and\ncomputational cost.\n","authors":["Ying Chen","Jiajing Xie","Yuxiang Lin","Yuhang Song","Wenxian Yang","Rongshan Yu"],"pdf_url":"https://arxiv.org/pdf/2404.08027v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02980v1","updated":"2024-12-04T02:47:45Z","published":"2024-12-04T02:47:45Z","title":"Surveying the Effects of Quality, Diversity, and Complexity in Synthetic\n Data From Large Language Models","summary":" Synthetic data generation with Large Language Models is a promising paradigm\nfor augmenting natural data over a nearly infinite range of tasks. Given this\nvariety, direct comparisons among synthetic data generation algorithms are\nscarce, making it difficult to understand where improvement comes from and what\nbottlenecks exist. We propose to evaluate algorithms via the makeup of\nsynthetic data generated by each algorithm in terms of data quality, diversity,\nand complexity. We choose these three characteristics for their significance in\nopen-ended processes and the impact each has on the capabilities of downstream\nmodels. We find quality to be essential for in-distribution model\ngeneralization, diversity to be essential for out-of-distribution\ngeneralization, and complexity to be beneficial for both. Further, we emphasize\nthe existence of Quality-Diversity trade-offs in training data and the\ndownstream effects on model performance. We then examine the effect of various\ncomponents in the synthetic data pipeline on each data characteristic. This\nexamination allows us to taxonomize and compare synthetic data generation\nalgorithms through the components they utilize and the resulting effects on\ndata QDC composition. This analysis extends into a discussion on the importance\nof balancing QDC in synthetic data for efficient reinforcement learning and\nself-improvement algorithms. Analogous to the QD trade-offs in training data,\noften there exist trade-offs between model output quality and output diversity\nwhich impact the composition of synthetic data. We observe that many models are\ncurrently evaluated and optimized only for output quality, thereby limiting\noutput diversity and the potential for self-improvement. We argue that\nbalancing these trade-offs is essential to the development of future\nself-improvement algorithms and highlight a number of works making progress in\nthis direction.\n","authors":["Alex Havrilla","Andrew Dai","Laura O'Mahony","Koen Oostermeijer","Vera Zisler","Alon Albalak","Fabrizio Milo","Sharath Chandra Raparthy","Kanishk Gandhi","Baber Abbasi","Duy Phung","Maia Iyer","Dakota Mahan","Chase Blagden","Srishti Gureja","Mohammed Hamdy","Wen-Ding Li","Giovanni Paolini","Pawan Sasanka Ammanamanchi","Elliot Meyerson"],"pdf_url":"https://arxiv.org/pdf/2412.02980v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02975v1","updated":"2024-12-04T02:37:31Z","published":"2024-12-04T02:37:31Z","title":"Theoretical limitations of multi-layer Transformer","summary":" Transformers, especially the decoder-only variants, are the backbone of most\nmodern large language models; yet we do not have much understanding of their\nexpressive power except for the simple $1$-layer case.\n Due to the difficulty of analyzing multi-layer models, all previous work\nrelies on unproven complexity conjectures to show limitations for multi-layer\nTransformers. In this work, we prove the first $\\textit{unconditional}$ lower\nbound against multi-layer decoder-only transformers. For any constant $L$, we\nprove that any $L$-layer decoder-only transformer needs a polynomial model\ndimension ($n^{\\Omega(1)}$) to perform sequential composition of $L$ functions\nover an input of $n$ tokens.\n As a consequence, our results give: (1) the first depth-width trade-off for\nmulti-layer transformers, exhibiting that the $L$-step composition task is\nexponentially harder for $L$-layer models compared to $(L+1)$-layer ones; (2)\nan unconditional separation between encoder and decoder, exhibiting a hard task\nfor decoders that can be solved by an exponentially shallower and smaller\nencoder; (3) a provable advantage of chain-of-thought, exhibiting a task that\nbecomes exponentially easier with chain-of-thought.\n On the technical side, we propose the multi-party $\\textit{autoregressive}$\n$\\textit{communication}$ $\\textit{model}$ that captures the computation of a\ndecoder-only Transformer. We also introduce a new proof technique that finds a\ncertain $\\textit{indistinguishable}$ $\\textit{decomposition}$ of all possible\ninputs iteratively for proving lower bounds in this model. We believe our new\ncommunication model and proof technique will be helpful to further understand\nthe computational power of transformers.\n","authors":["Lijie Chen","Binghui Peng","Hongxun Wu"],"pdf_url":"https://arxiv.org/pdf/2412.02975v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02969v1","updated":"2024-12-04T02:31:31Z","published":"2024-12-04T02:31:31Z","title":"Unified Inductive Logic: From Formal Learning to Statistical Inference\n to Supervised Learning","summary":" While the traditional conception of inductive logic is Carnapian, I develop a\nPeircean alternative and use it to unify formal learning theory, statistics,\nand a significant part of machine learning: supervised learning. Some crucial\nstandards for evaluating non-deductive inferences have been assumed separately\nin those areas, but can actually be justified by a unifying principle.\n","authors":["Hanti Lin"],"pdf_url":"https://arxiv.org/pdf/2412.02969v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02968v1","updated":"2024-12-04T02:31:28Z","published":"2024-12-04T02:31:28Z","title":"How Many Ratings per Item are Necessary for Reliable Significance\n Testing?","summary":" Most approaches to machine learning evaluation assume that machine and human\nresponses are repeatable enough to be measured against data with unitary,\nauthoritative, \"gold standard\" responses, via simple metrics such as accuracy,\nprecision, and recall that assume scores are independent given the test item.\nHowever, AI models have multiple sources of stochasticity and the human raters\nwho create gold standards tend to disagree with each other, often in meaningful\nways, hence a single output response per input item may not provide enough\ninformation. We introduce methods for determining whether an (existing or\nplanned) evaluation dataset has enough responses per item to reliably compare\nthe performance of one model to another. We apply our methods to several of\nvery few extant gold standard test sets with multiple disaggregated responses\nper item and show that there are usually not enough responses per item to\nreliably compare the performance of one model against another. Our methods also\nallow us to estimate the number of responses per item for hypothetical datasets\nwith similar response distributions to the existing datasets we study. When two\nmodels are very far apart in their predictive performance, fewer raters are\nneeded to confidently compare them, as expected. However, as the models draw\ncloser, we find that a larger number of raters than are currently typical in\nannotation collection are needed to ensure that the power analysis correctly\nreflects the difference in performance.\n","authors":["Christopher Homan","Flip Korn","Chris Welty"],"pdf_url":"https://arxiv.org/pdf/2412.02968v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02957v1","updated":"2024-12-04T02:05:55Z","published":"2024-12-04T02:05:55Z","title":"3D Interaction Geometric Pre-training for Molecular Relational Learning","summary":" Molecular Relational Learning (MRL) is a rapidly growing field that focuses\non understanding the interaction dynamics between molecules, which is crucial\nfor applications ranging from catalyst engineering to drug discovery. Despite\nrecent progress, earlier MRL approaches are limited to using only the 2D\ntopological structure of molecules, as obtaining the 3D interaction geometry\nremains prohibitively expensive. This paper introduces a novel 3D geometric\npre-training strategy for MRL (3DMRL) that incorporates a 3D virtual\ninteraction environment, overcoming the limitations of costly traditional\nquantum mechanical calculation methods. With the constructed 3D virtual\ninteraction environment, 3DMRL trains 2D MRL model to learn the overall 3D\ngeometric information of molecular interaction through contrastive learning.\nMoreover, fine-grained interaction between molecules is learned through force\nprediction loss, which is crucial in understanding the wide range of molecular\ninteraction processes. Extensive experiments on various tasks using real-world\ndatasets, including out-of-distribution and extrapolation scenarios,\ndemonstrate the effectiveness of 3DMRL, showing up to a 24.93\\% improvement in\nperformance across 40 tasks.\n","authors":["Namkyeong Lee","Yunhak Oh","Heewoong Noh","Gyoung S. Na","Minkai Xu","Hanchen Wang","Tianfan Fu","Chanyoung Park"],"pdf_url":"https://arxiv.org/pdf/2412.02957v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.18822v2","updated":"2024-12-04T01:56:07Z","published":"2024-11-27T23:51:53Z","title":"RelCon: Relative Contrastive Learning for a Motion Foundation Model for\n Wearable Data","summary":" We present RelCon, a novel self-supervised *Rel*ative *Con*trastive learning\napproach that uses a learnable distance measure in combination with a softened\ncontrastive loss for training an motion foundation model from wearable sensors.\nThe learnable distance measure captures motif similarity and domain-specific\nsemantic information such as rotation invariance. The learned distance provides\na measurement of semantic similarity between a pair of accelerometer\ntime-series segments, which is used to measure the distance between an anchor\nand various other sampled candidate segments. The self-supervised model is\ntrained on 1 billion segments from 87,376 participants from a large wearables\ndataset. The model achieves strong performance across multiple downstream\ntasks, encompassing both classification and regression. To our knowledge, we\nare the first to show the generalizability of a self-supervised learning model\nwith motion data from wearables across distinct evaluation tasks.\n","authors":["Maxwell A. Xu","Jaya Narain","Gregory Darnell","Haraldur Hallgrimsson","Hyewon Jeong","Darren Forde","Richard Fineman","Karthik J. Raghuram","James M. Rehg","Shirley Ren"],"pdf_url":"https://arxiv.org/pdf/2411.18822v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.06220v2","updated":"2024-12-04T01:47:08Z","published":"2024-04-09T11:14:45Z","title":"Zero-Shot Relational Learning for Multimodal Knowledge Graphs","summary":" Relational learning is an essential task in the domain of knowledge\nrepresentation, particularly in knowledge graph completion (KGC). While\nrelational learning in traditional single-modal settings has been extensively\nstudied, exploring it within a multimodal KGC context presents distinct\nchallenges and opportunities. One of the major challenges is inference on newly\ndiscovered relations without any associated training data. This zero-shot\nrelational learning scenario poses unique requirements for multimodal KGC,\ni.e., utilizing multimodality to facilitate relational learning.However,\nexisting works fail to support the leverage of multimodal information and leave\nthe problem unexplored. In this paper, we propose a novel end-to-end framework,\nconsisting of three components, i.e., multimodal learner, structure\nconsolidator, and relation embedding generator, to integrate diverse multimodal\ninformation and knowledge graph structures to facilitate the zero-shot\nrelational learning. Evaluation results on three multimodal knowledge graphs\ndemonstrate the superior performance of our proposed method.\n","authors":["Rui Cai","Shichao Pei","Xiangliang Zhang"],"pdf_url":"https://arxiv.org/pdf/2404.06220v2.pdf","comment":"In the Proceedings of the 2024 IEEE International Conference on Big\n Data (IEEE BigData 2024)"},{"id":"http://arxiv.org/abs/2412.02951v1","updated":"2024-12-04T01:40:54Z","published":"2024-12-04T01:40:54Z","title":"Incorporating System-level Safety Requirements in Perception Models via\n Reinforcement Learning","summary":" Perception components in autonomous systems are often developed and optimized\nindependently of downstream decision-making and control components, relying on\nestablished performance metrics like accuracy, precision, and recall.\nTraditional loss functions, such as cross-entropy loss and negative\nlog-likelihood, focus on reducing misclassification errors but fail to consider\ntheir impact on system-level safety, overlooking the varying severities of\nsystem-level failures caused by these errors. To address this limitation, we\npropose a novel training paradigm that augments the perception component with\nan understanding of system-level safety objectives. Central to our approach is\nthe translation of system-level safety requirements, formally specified using\nthe rulebook formalism, into safety scores. These scores are then incorporated\ninto the reward function of a reinforcement learning framework for fine-tuning\nperception models with system-level safety objectives. Simulation results\ndemonstrate that models trained with this approach outperform baseline\nperception models in terms of system-level safety.\n","authors":["Weisi Fan","Jesse Lane","Qisai Liu","Soumik Sarkar","Tichakorn Wongpiromsarn"],"pdf_url":"https://arxiv.org/pdf/2412.02951v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02946v1","updated":"2024-12-04T01:23:57Z","published":"2024-12-04T01:23:57Z","title":"Who Brings the Frisbee: Probing Hidden Hallucination Factors in Large\n Vision-Language Model via Causality Analysis","summary":" Recent advancements in large vision-language models (LVLM) have significantly\nenhanced their ability to comprehend visual inputs alongside natural language.\nHowever, a major challenge in their real-world application is hallucination,\nwhere LVLMs generate non-existent visual elements, eroding user trust. The\nunderlying mechanism driving this multimodal hallucination is poorly\nunderstood. Minimal research has illuminated whether contexts such as sky,\ntree, or grass field involve the LVLM in hallucinating a frisbee. We\nhypothesize that hidden factors, such as objects, contexts, and semantic\nforeground-background structures, induce hallucination. This study proposes a\nnovel causal approach: a hallucination probing system to identify these hidden\nfactors. By analyzing the causality between images, text prompts, and network\nsaliency, we systematically explore interventions to block these factors. Our\nexperimental findings show that a straightforward technique based on our\nanalysis can significantly reduce hallucinations. Additionally, our analyses\nindicate the potential to edit network internals to minimize hallucinated\noutputs.\n","authors":["Po-Hsuan Huang","Jeng-Lin Li","Chin-Po Chen","Ming-Ching Chang","Wei-Chao Chen"],"pdf_url":"https://arxiv.org/pdf/2412.02946v1.pdf","comment":"Accepted by WACV2025"},{"id":"http://arxiv.org/abs/2408.12841v2","updated":"2024-12-04T01:20:16Z","published":"2024-08-23T05:15:24Z","title":"COVID-19 Probability Prediction Using Machine Learning: An Infectious\n Approach","summary":" The ongoing COVID-19 pandemic continues to pose significant challenges to\nglobal public health, despite the widespread availability of vaccines. Early\ndetection of the disease remains paramount in curbing its transmission and\nmitigating its impact on public health systems. In response, this study delves\ninto the application of advanced machine learning (ML) techniques for\npredicting COVID-19 infection probability. We conducted a rigorous\ninvestigation into the efficacy of various ML models, including XGBoost, LGBM,\nAdaBoost, Logistic Regression, Decision Tree, RandomForest, CatBoost, KNN, and\nDeep Neural Networks (DNN). Leveraging a dataset comprising 4000 samples, with\n3200 allocated for training and 800 for testing, our experiment offers\ncomprehensive insights into the performance of these models in COVID-19\nprediction. Our findings reveal that Deep Neural Networks (DNN) emerge as the\ntop-performing model, exhibiting superior accuracy and recall metrics. With an\nimpressive accuracy rate of 89%, DNN demonstrates remarkable potential in early\nCOVID-19 detection. This underscores the efficacy of deep learning approaches\nin leveraging complex data patterns to identify COVID-19 infections accurately.\nThis study underscores the critical role of machine learning, particularly deep\nlearning methodologies, in augmenting early detection efforts amidst the\nongoing pandemic. The success of DNN in accurately predicting COVID-19\ninfection probability highlights the importance of continued research and\ndevelopment in leveraging advanced technologies to combat infectious diseases.\n","authors":["Mohsen Asghari Ilani","Saba Moftakhar Tehran","Ashkan Kavei","Arian Radmehr"],"pdf_url":"https://arxiv.org/pdf/2408.12841v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02940v1","updated":"2024-12-04T01:13:44Z","published":"2024-12-04T01:13:44Z","title":"SAVER: A Toolbox for Sampling-Based, Probabilistic Verification of\n Neural Networks","summary":" We present a neural network verification toolbox to 1) assess the probability\nof satisfaction of a constraint, and 2) synthesize a set expansion factor to\nachieve the probability of satisfaction. Specifically, the tool box establishes\nwith a user-specified level of confidence whether the output of the neural\nnetwork for a given input distribution is likely to be contained within a given\nset. Should the tool determine that the given set cannot satisfy the likelihood\nconstraint, the tool also implements an approach outlined in this paper to\nalter the constraint set to ensure that the user-defined satisfaction\nprobability is achieved. The toolbox is comprised of sampling-based approaches\nwhich exploit the properties of signed distance function to define set\ncontainment.\n","authors":["Vignesh Sivaramakrishnan","Krishna C. Kalagarla","Rosalyn Devonport","Joshua Pilipovsky","Panagiotis Tsiotras","Meeko Oishi"],"pdf_url":"https://arxiv.org/pdf/2412.02940v1.pdf","comment":"7 pages, 8 figures, submitted to the 28th ACM International\n Conference on Hybrid Systems: Computation and Control"},{"id":"http://arxiv.org/abs/2412.02934v1","updated":"2024-12-04T01:07:04Z","published":"2024-12-04T01:07:04Z","title":"BGTplanner: Maximizing Training Accuracy for Differentially Private\n Federated Recommenders via Strategic Privacy Budget Allocation","summary":" To mitigate the rising concern about privacy leakage, the federated\nrecommender (FR) paradigm emerges, in which decentralized clients co-train the\nrecommendation model without exposing their raw user-item rating data. The\ndifferentially private federated recommender (DPFR) further enhances FR by\ninjecting differentially private (DP) noises into clients. Yet, current DPFRs,\nsuffering from noise distortion, cannot achieve satisfactory accuracy. Various\nefforts have been dedicated to improving DPFRs by adaptively allocating the\nprivacy budget over the learning process. However, due to the intricate\nrelation between privacy budget allocation and model accuracy, existing works\nare still far from maximizing DPFR accuracy. To address this challenge, we\ndevelop BGTplanner (Budget Planner) to strategically allocate the privacy\nbudget for each round of DPFR training, improving overall training performance.\nSpecifically, we leverage the Gaussian process regression and historical\ninformation to predict the change in recommendation accuracy with a certain\nallocated privacy budget. Additionally, Contextual Multi-Armed Bandit (CMAB) is\nharnessed to make privacy budget allocation decisions by reconciling the\ncurrent improvement and long-term privacy constraints. Our extensive\nexperimental results on real datasets demonstrate that \\emph{BGTplanner}\nachieves an average improvement of 6.76\\% in training performance compared to\nstate-of-the-art baselines.\n","authors":["Xianzhi Zhang","Yipeng Zhou","Miao Hu","Di Wu","Pengshan Liao","Mohsen Guizani","Michael Sheng"],"pdf_url":"https://arxiv.org/pdf/2412.02934v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02931v1","updated":"2024-12-04T00:53:55Z","published":"2024-12-04T00:53:55Z","title":"Inverse Delayed Reinforcement Learning","summary":" Inverse Reinforcement Learning (IRL) has demonstrated effectiveness in a\nvariety of imitation tasks. In this paper, we introduce an IRL framework\ndesigned to extract rewarding features from expert trajectories affected by\ndelayed disturbances. Instead of relying on direct observations, our approach\nemploys an efficient off-policy adversarial training framework to derive expert\nfeatures and recover optimal policies from augmented delayed observations.\nEmpirical evaluations in the MuJoCo environment under diverse delay settings\nvalidate the effectiveness of our method. Furthermore, we provide a theoretical\nanalysis showing that recovering expert policies from augmented delayed\nobservations outperforms using direct delayed observations.\n","authors":["Simon Sinong Zhan","Qingyuan Wu","Zhian Ruan","Frank Yang","Philip Wang","Yixuan Wang","Ruochen Jiao","Chao Huang","Qi Zhu"],"pdf_url":"https://arxiv.org/pdf/2412.02931v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.15367v2","updated":"2024-12-04T00:42:56Z","published":"2024-04-19T13:24:09Z","title":"Leveraging Visibility Graphs for Enhanced Arrhythmia Classification with\n Graph Convolutional Networks","summary":" Arrhythmias, detectable through electrocardiograms (ECGs), pose significant\nhealth risks, underscoring the need for accurate and efficient automated\ndetection techniques. While recent advancements in graph-based methods have\ndemonstrated potential to enhance arrhythmia classification, the challenge lies\nin effectively representing ECG signals as graphs. This study investigates the\nuse of Visibility Graph (VG) and Vector Visibility Graph (VVG) representations\ncombined with Graph Convolutional Networks (GCNs) for arrhythmia classification\nunder the ANSI/AAMI standard, ensuring reproducibility and fair comparison with\nother techniques. Through extensive experiments on the MIT-BIH dataset, we\nevaluate various GCN architectures and preprocessing parameters. Our findings\ndemonstrate that VG and VVG mappings enable GCNs to classify arrhythmias\ndirectly from raw ECG signals, without the need for preprocessing or noise\nremoval. Notably, VG offers superior computational efficiency, while VVG\ndelivers enhanced classification performance by leveraging additional lead\nfeatures. The proposed approach outperforms baseline methods in several\nmetrics, although challenges persist in classifying the supraventricular\nectopic beat (S) class, particularly under the inter-patient paradigm.\n","authors":["Rafael F. Oliveira","Gladston J. P. Moreira","Vander L. S. Freitas","Eduardo J. S. Luz"],"pdf_url":"https://arxiv.org/pdf/2404.15367v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02924v1","updated":"2024-12-04T00:27:54Z","published":"2024-12-04T00:27:54Z","title":"Harnessing Loss Decomposition for Long-Horizon Wave Predictions via Deep\n Neural Networks","summary":" Accurate prediction over long time horizons is crucial for modeling complex\nphysical processes such as wave propagation. Although deep neural networks show\npromise for real-time forecasting, they often struggle with accumulating phase\nand amplitude errors as predictions extend over a long period. To address this\nissue, we propose a novel loss decomposition strategy that breaks down the loss\ninto separate phase and amplitude components. This technique improves the\nlong-term prediction accuracy of neural networks in wave propagation tasks by\nexplicitly accounting for numerical errors, improving stability, and reducing\nerror accumulation over extended forecasts.\n","authors":["Indu Kant Deo","Rajeev Jaiman"],"pdf_url":"https://arxiv.org/pdf/2412.02924v1.pdf","comment":"6 pages, 4 figures, NeurIPS Machine Learning for Physical Sciences\n workshop"},{"id":"http://arxiv.org/abs/2403.09548v2","updated":"2024-12-04T00:26:21Z","published":"2024-03-14T16:35:43Z","title":"Breast Cancer Classification Using Gradient Boosting Algorithms Focusing\n on Reducing the False Negative and SHAP for Explainability","summary":" Cancer is one of the diseases that kill the most women in the world, with\nbreast cancer being responsible for the highest number of cancer cases and\nconsequently deaths. However, it can be prevented by early detection and,\nconsequently, early treatment. Any development for detection or perdition this\nkind of cancer is important for a better healthy life. Many studies focus on a\nmodel with high accuracy in cancer prediction, but sometimes accuracy alone may\nnot always be a reliable metric. This study implies an investigative approach\nto studying the performance of different machine learning algorithms based on\nboosting to predict breast cancer focusing on the recall metric. Boosting\nmachine learning algorithms has been proven to be an effective tool for\ndetecting medical diseases. The dataset of the University of California, Irvine\n(UCI) repository has been utilized to train and test the model classifier that\ncontains their attributes. The main objective of this study is to use\nstate-of-the-art boosting algorithms such as AdaBoost, XGBoost, CatBoost and\nLightGBM to predict and diagnose breast cancer and to find the most effective\nmetric regarding recall, ROC-AUC, and confusion matrix. Furthermore, our study\nis the first to use these four boosting algorithms with Optuna, a library for\nhyperparameter optimization, and the SHAP method to improve the\ninterpretability of our model, which can be used as a support to identify and\npredict breast cancer. We were able to improve AUC or recall for all the models\nand reduce the False Negative for AdaBoost and LigthGBM the final AUC were more\nthan 99.41\\% for all models.\n","authors":["João Manoel Herrera Pinheiro","Marcelo Becker"],"pdf_url":"https://arxiv.org/pdf/2403.09548v2.pdf","comment":"9 pages, 16 figures"},{"id":"http://arxiv.org/abs/2402.17363v4","updated":"2024-12-04T00:11:36Z","published":"2024-02-27T09:55:34Z","title":"CGGM: A conditional graph generation model with adaptive sparsity for\n node anomaly detection in IoT networks","summary":" Dynamic graphs are extensively employed for detecting anomalous behavior in\nnodes within the Internet of Things (IoT). Graph generative models are often\nused to address the issue of imbalanced node categories in dynamic graphs.\nNevertheless, the constraints it faces include the monotonicity of adjacency\nrelationships, the difficulty in constructing multi-dimensional features for\nnodes, and the lack of a method for end-to-end generation of multiple\ncategories of nodes. In this paper, we propose a novel graph generation model,\ncalled CGGM, specifically for generating samples belonging to the minority\nclass. The framework consists two core module: a conditional graph generation\nmodule and a graph-based anomaly detection module. The generative module adapts\nto the sparsity of the matrix by downsampling a noise adjacency matrix, and\nincorporates a multi-dimensional feature encoder based on multi-head\nself-attention to capture latent dependencies among features. Additionally, a\nlatent space constraint is combined with the distribution distance to\napproximate the latent distribution of real data. The graph-based anomaly\ndetection module utilizes the generated balanced dataset to predict the node\nbehaviors. Extensive experiments have shown that CGGM outperforms the\nstate-of-the-art methods in terms of accuracy and divergence. The results also\ndemonstrate CGGM can generated diverse data categories, that enhancing the\nperformance of multi-category classification task.\n","authors":["Munan Li","Xianshi Su","Runze Ma","Tongbang Jiang","Zijian Li","Tony Q. S. Quek"],"pdf_url":"https://arxiv.org/pdf/2402.17363v4.pdf","comment":"10 pages, 19 figures"},{"id":"http://arxiv.org/abs/2412.02919v1","updated":"2024-12-04T00:10:47Z","published":"2024-12-04T00:10:47Z","title":"Higher Order Transformers: Efficient Attention Mechanism for Tensor\n Structured Data","summary":" Transformers are now ubiquitous for sequence modeling tasks, but their\nextension to multi-dimensional data remains a challenge due to the quadratic\ncost of the attention mechanism. In this paper, we propose Higher-Order\nTransformers (HOT), a novel architecture designed to efficiently process data\nwith more than two axes, i.e. higher-order tensors. To address the\ncomputational challenges associated with high-order tensor attention, we\nintroduce a novel Kronecker factorized attention mechanism that reduces the\nattention cost to quadratic in each axis' dimension, rather than quadratic in\nthe total size of the input tensor. To further enhance efficiency, HOT\nleverages kernelized attention, reducing the complexity to linear. This\nstrategy maintains the model's expressiveness while enabling scalable attention\ncomputation. We validate the effectiveness of HOT on two high-dimensional\ntasks, including multivariate time series forecasting, and 3D medical image\nclassification. Experimental results demonstrate that HOT achieves competitive\nperformance while significantly improving computational efficiency, showcasing\nits potential for tackling a wide range of complex, multi-dimensional data.\n","authors":["Soroush Omranpour","Guillaume Rabusseau","Reihaneh Rabbany"],"pdf_url":"https://arxiv.org/pdf/2412.02919v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.12914v2","updated":"2024-12-04T00:03:38Z","published":"2024-09-19T17:10:34Z","title":"Mitigating Unsafe Feedback with Learning Constraints","summary":" While there has been progress towards aligning Large Language Models (LLMs)\nwith human values and ensuring safe behaviour at inference time, safety-guards\ncan easily be removed when fine-tuned on unsafe and harmful datasets.While this\nsetting has been treated extensively, another popular training paradigm,\nlearning from unsafe feedback with reinforcement learning, has previously been\nunexplored. This is concerning due to the widespread deployment of feedback\ncollection systems. We address this gap by providing an analysis of learning\nsettings where feedback is adversarial and noisy, i.e. that unsafe samples are\npreferred over safe ones despite model developers goal to maintain safety. We\nfind that safety-aligned LLMs easily explore unsafe action spaces through\ngenerating harmful text and optimize for adversarial reward indicating that\ncurrent safety guards are not enough to prevent learning from unsafe feedback.\nIn order to protect against this vulnerability, we adapt a number of both\n\"implict\" and \"explicit\" harmful fine-tuning defences to evaluate whether they\nare effective as learning constraints in an RL setting finding that no method\nis generally effective pointing to the need for more research in defences given\nthe widespread adoption of methods designed to learn from feedback. We end the\npaper with the observation that some defences work by performing \"harmless\nreward hacking\" for which we provide a theoretical explanation drawn from the\ntheory of Constrained Markov Decision Processes and provide some direction for\nfuture defence development.\n","authors":["Domenic Rosati","Giles Edkins","Harsh Raj","David Atanasov","Subhabrata Majumdar","Janarthanan Rajendran","Frank Rudzicz","Hassan Sajjad"],"pdf_url":"https://arxiv.org/pdf/2409.12914v2.pdf","comment":null}],"Multimedia":[{"id":"http://arxiv.org/abs/2412.03551v1","updated":"2024-12-04T18:49:26Z","published":"2024-12-04T18:49:26Z","title":"SPICE: Smart Projection Interface for Cooking Enhancement","summary":" Tangible User Interfaces (TUI) for human--computer interaction (HCI) provide\nthe user with physical representations of digital information with the aim to\novercome the limitations of screen-based interfaces. Although many compelling\ndemonstrations of TUIs exist in the literature, there is a lack of research on\nTUIs intended for daily two-handed tasks and processes, such as cooking. In\nresponse to this gap, we propose SPICE (Smart Projection Interface for Cooking\nEnhancement). SPICE investigates TUIs in a kitchen setting, aiming to transform\nthe recipe following experience from simply text-based to tangibly interactive.\nSPICE includes a tracking system, an agent-based software, and vision large\nlanguage models to create and interpret a kitchen environment where recipe\ninformation is projected directly onto the cooking surface. We conducted a\ncomparative usability study of SPICE and text-based recipe following with 30\nparticipants, assessing the task difficulty, total duration, and efficiency, as\nwell as user confidence and taste perception. The results indicate that SPICE\nallowed participants to perform the recipe with less stops and in shorter time\nwhile also improving self-reported efficiency, confidence, and taste. Despite\nthis, participants self-reported no change in overall difficulty, which is a\ndirection for future research. Overall, the SPICE project demonstrates the\npotential of using TUIs to improve everyday activities, paving the way for\nfuture research in HCI and new computing interfaces.\n","authors":["Vera Prohaska","Eduardo Castelló Ferrer"],"pdf_url":"https://arxiv.org/pdf/2412.03551v1.pdf","comment":"Article submitted to IUI 2025"},{"id":"http://arxiv.org/abs/2412.01064v2","updated":"2024-12-04T09:43:18Z","published":"2024-12-02T02:50:07Z","title":"FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking\n Portrait","summary":" With the rapid advancement of diffusion-based generative models, portrait\nimage animation has achieved remarkable results. However, it still faces\nchallenges in temporally consistent video generation and fast sampling due to\nits iterative sampling nature. This paper presents FLOAT, an audio-driven\ntalking portrait video generation method based on flow matching generative\nmodel. We shift the generative modeling from the pixel-based latent space to a\nlearned motion latent space, enabling efficient design of temporally consistent\nmotion. To achieve this, we introduce a transformer-based vector field\npredictor with a simple yet effective frame-wise conditioning mechanism.\nAdditionally, our method supports speech-driven emotion enhancement, enabling a\nnatural incorporation of expressive motions. Extensive experiments demonstrate\nthat our method outperforms state-of-the-art audio-driven talking portrait\nmethods in terms of visual quality, motion fidelity, and efficiency.\n","authors":["Taekyung Ki","Dongchan Min","Gyeongsu Chae"],"pdf_url":"https://arxiv.org/pdf/2412.01064v2.pdf","comment":"Project page: https://deepbrainai-research.github.io/float/"},{"id":"http://arxiv.org/abs/2406.00758v3","updated":"2024-12-04T09:36:56Z","published":"2024-06-02T14:22:09Z","title":"Once-for-All: Controllable Generative Image Compression with Dynamic\n Granularity Adaption","summary":" Although recent generative image compression methods have demonstrated\nimpressive potential in optimizing the rate-distortion-perception trade-off,\nthey still face the critical challenge of flexible rate adaption to diverse\ncompression necessities and scenarios. To overcome this challenge, this paper\nproposes a Controllable Generative Image Compression framework, termed\nControl-GIC, the first capable of fine-grained bitrate adaption across a broad\nspectrum while ensuring high-fidelity and generality compression. Control-GIC\nis grounded in a VQGAN framework that encodes an image as a sequence of\nvariable-length codes (i.e. VQ-indices), which can be losslessly compressed and\nexhibits a direct positive correlation with the bitrates. Drawing inspiration\nfrom the classical coding principle, we correlate the information density of\nlocal image patches with their granular representations. Hence, we can flexibly\ndetermine a proper allocation of granularity for the patches to achieve dynamic\nadjustment for VQ-indices, resulting in desirable compression rates. We further\ndevelop a probabilistic conditional decoder capable of retrieving historic\nencoded multi-granularity representations according to transmitted codes, and\nthen reconstruct hierarchical granular features in the formalization of\nconditional probability, enabling more informative aggregation to improve\nreconstruction realism. Our experiments show that Control-GIC allows highly\nflexible and controllable bitrate adaption where the results demonstrate its\nsuperior performance over recent state-of-the-art methods.\n","authors":["Anqi Li","Feng Li","Yuxi Liu","Runmin Cong","Yao Zhao","Huihui Bai"],"pdf_url":"https://arxiv.org/pdf/2406.00758v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.06220v2","updated":"2024-12-04T01:47:08Z","published":"2024-04-09T11:14:45Z","title":"Zero-Shot Relational Learning for Multimodal Knowledge Graphs","summary":" Relational learning is an essential task in the domain of knowledge\nrepresentation, particularly in knowledge graph completion (KGC). While\nrelational learning in traditional single-modal settings has been extensively\nstudied, exploring it within a multimodal KGC context presents distinct\nchallenges and opportunities. One of the major challenges is inference on newly\ndiscovered relations without any associated training data. This zero-shot\nrelational learning scenario poses unique requirements for multimodal KGC,\ni.e., utilizing multimodality to facilitate relational learning.However,\nexisting works fail to support the leverage of multimodal information and leave\nthe problem unexplored. In this paper, we propose a novel end-to-end framework,\nconsisting of three components, i.e., multimodal learner, structure\nconsolidator, and relation embedding generator, to integrate diverse multimodal\ninformation and knowledge graph structures to facilitate the zero-shot\nrelational learning. Evaluation results on three multimodal knowledge graphs\ndemonstrate the superior performance of our proposed method.\n","authors":["Rui Cai","Shichao Pei","Xiangliang Zhang"],"pdf_url":"https://arxiv.org/pdf/2404.06220v2.pdf","comment":"In the Proceedings of the 2024 IEEE International Conference on Big\n Data (IEEE BigData 2024)"},{"id":"http://arxiv.org/abs/2412.02946v1","updated":"2024-12-04T01:23:57Z","published":"2024-12-04T01:23:57Z","title":"Who Brings the Frisbee: Probing Hidden Hallucination Factors in Large\n Vision-Language Model via Causality Analysis","summary":" Recent advancements in large vision-language models (LVLM) have significantly\nenhanced their ability to comprehend visual inputs alongside natural language.\nHowever, a major challenge in their real-world application is hallucination,\nwhere LVLMs generate non-existent visual elements, eroding user trust. The\nunderlying mechanism driving this multimodal hallucination is poorly\nunderstood. Minimal research has illuminated whether contexts such as sky,\ntree, or grass field involve the LVLM in hallucinating a frisbee. We\nhypothesize that hidden factors, such as objects, contexts, and semantic\nforeground-background structures, induce hallucination. This study proposes a\nnovel causal approach: a hallucination probing system to identify these hidden\nfactors. By analyzing the causality between images, text prompts, and network\nsaliency, we systematically explore interventions to block these factors. Our\nexperimental findings show that a straightforward technique based on our\nanalysis can significantly reduce hallucinations. Additionally, our analyses\nindicate the potential to edit network internals to minimize hallucinated\noutputs.\n","authors":["Po-Hsuan Huang","Jeng-Lin Li","Chin-Po Chen","Ming-Ching Chang","Wei-Chao Chen"],"pdf_url":"https://arxiv.org/pdf/2412.02946v1.pdf","comment":"Accepted by WACV2025"},{"id":"http://arxiv.org/abs/2411.08307v2","updated":"2024-12-04T22:02:25Z","published":"2024-11-13T03:14:10Z","title":"PerceiverS: A Multi-Scale Perceiver with Effective Segmentation for\n Long-Term Expressive Symbolic Music Generation","summary":" AI-based music generation has progressed significantly in recent years.\nHowever, creating symbolic music that is both long-structured and expressive\nremains a considerable challenge. In this paper, we propose PerceiverS\n(Segmentation and Scale), a novel architecture designed to address this issue\nby leveraging both Effective Segmentation and Multi-Scale attention mechanisms.\nOur approach enhances symbolic music generation by simultaneously learning\nlong-term structural dependencies and short-term expressive details. By\ncombining cross-attention and self-attention in a Multi-Scale setting,\nPerceiverS captures long-range musical structure while preserving musical\ndiversity. The proposed model has been evaluated using the Maestro dataset and\nhas demonstrated improvements in generating music of conventional length with\nexpressive nuances. The project demos and the generated music samples can be\naccessed through the link: https://perceivers.github.io\n","authors":["Yungang Yi","Weihua Li","Matthew Kuo","Quan Bai"],"pdf_url":"https://arxiv.org/pdf/2411.08307v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03665v1","updated":"2024-12-04T19:01:06Z","published":"2024-12-04T19:01:06Z","title":"Personalizing Multimodal Large Language Models for Image Captioning: An\n Experimental Analysis","summary":" The task of image captioning demands an algorithm to generate natural\nlanguage descriptions of visual inputs. Recent advancements have seen a\nconvergence between image captioning research and the development of Large\nLanguage Models (LLMs) and Multimodal LLMs -- like GPT-4V and Gemini -- which\nextend the capabilities of text-only LLMs to multiple modalities. This paper\ninvestigates whether Multimodal LLMs can supplant traditional image captioning\nnetworks by evaluating their performance on various image description\nbenchmarks. We explore both the zero-shot capabilities of these models and\ntheir adaptability to different semantic domains through fine-tuning methods,\nincluding prompt learning, prefix tuning, and low-rank adaptation. Our results\ndemonstrate that while Multimodal LLMs achieve impressive zero-shot\nperformance, fine-tuning for specific domains while maintaining their\ngeneralization capabilities intact remains challenging. We discuss the\nimplications of these findings for future research in image captioning and the\ndevelopment of more adaptable Multimodal LLMs.\n","authors":["Davide Bucciarelli","Nicholas Moratelli","Marcella Cornia","Lorenzo Baraldi","Rita Cucchiara"],"pdf_url":"https://arxiv.org/pdf/2412.03665v1.pdf","comment":"ECCV 2024 Workshop on Green Foundation Models"}]},"2024-12-05T00:00:00Z":{"Computation and Language":[{"id":"http://arxiv.org/abs/2412.04467v1","updated":"2024-12-05T18:59:53Z","published":"2024-12-05T18:59:53Z","title":"VisionZip: Longer is Better but Not Necessary in Vision Language Models","summary":" Recent advancements in vision-language models have enhanced performance by\nincreasing the length of visual tokens, making them much longer than text\ntokens and significantly raising computational costs. However, we observe that\nthe visual tokens generated by popular vision encoders, such as CLIP and\nSigLIP, contain significant redundancy. To address this, we introduce\nVisionZip, a simple yet effective method that selects a set of informative\ntokens for input to the language model, reducing visual token redundancy and\nimproving efficiency while maintaining model performance. The proposed\nVisionZip can be widely applied to image and video understanding tasks and is\nwell-suited for multi-turn dialogues in real-world scenarios, where previous\nmethods tend to underperform. Experimental results show that VisionZip\noutperforms the previous state-of-the-art method by at least 5% performance\ngains across nearly all settings. Moreover, our method significantly enhances\nmodel inference speed, improving the prefilling time by 8x and enabling the\nLLaVA-Next 13B model to infer faster than the LLaVA-Next 7B model while\nachieving better results. Furthermore, we analyze the causes of this redundancy\nand encourage the community to focus on extracting better visual features\nrather than merely increasing token length. Our code is available at\nhttps://github.com/dvlab-research/VisionZip .\n","authors":["Senqiao Yang","Yukang Chen","Zhuotao Tian","Chengyao Wang","Jingyao Li","Bei Yu","Jiaya Jia"],"pdf_url":"https://arxiv.org/pdf/2412.04467v1.pdf","comment":"2 columns, 28 pages, 15 figures, 18 tables"},{"id":"http://arxiv.org/abs/2412.04454v1","updated":"2024-12-05T18:58:26Z","published":"2024-12-05T18:58:26Z","title":"Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction","summary":" Graphical User Interfaces (GUIs) are critical to human-computer interaction,\nyet automating GUI tasks remains challenging due to the complexity and\nvariability of visual environments. Existing approaches often rely on textual\nrepresentations of GUIs, which introduce limitations in generalization,\nefficiency, and scalability. In this paper, we introduce Aguvis, a unified pure\nvision-based framework for autonomous GUI agents that operates across various\nplatforms. Our approach leverages image-based observations, and grounding\ninstructions in natural language to visual elements, and employs a consistent\naction space to ensure cross-platform generalization. To address the\nlimitations of previous work, we integrate explicit planning and reasoning\nwithin the model, enhancing its ability to autonomously navigate and interact\nwith complex digital environments. We construct a large-scale dataset of GUI\nagent trajectories, incorporating multimodal reasoning and grounding, and\nemploy a two-stage training pipeline that first focuses on general GUI\ngrounding, followed by planning and reasoning. Through comprehensive\nexperiments, we demonstrate that Aguvis surpasses previous state-of-the-art\nmethods in both offline and real-world online scenarios, achieving, to our\nknowledge, the first fully autonomous pure vision GUI agent capable of\nperforming tasks independently without collaboration with external\nclosed-source models. We open-sourced all datasets, models, and training\nrecipes to facilitate future research at https://aguvis-project.github.io/.\n","authors":["Yiheng Xu","Zekun Wang","Junli Wang","Dunjie Lu","Tianbao Xie","Amrita Saha","Doyen Sahoo","Tao Yu","Caiming Xiong"],"pdf_url":"https://arxiv.org/pdf/2412.04454v1.pdf","comment":"https://aguvis-project.github.io/"},{"id":"http://arxiv.org/abs/2412.04449v1","updated":"2024-12-05T18:58:03Z","published":"2024-12-05T18:58:03Z","title":"p-MoD: Building Mixture-of-Depths MLLMs via Progressive Ratio Decay","summary":" Despite the remarkable performance of multimodal large language models\n(MLLMs) across diverse tasks, the substantial training and inference costs\nimpede their advancement. The majority of computation stems from the\noverwhelming volume of vision tokens processed by the transformer decoder. In\nthis paper, we propose to build efficient MLLMs by leveraging the\nMixture-of-Depths (MoD) mechanism, where each transformer decoder layer selects\nessential vision tokens to process while skipping redundant ones. However,\nintegrating MoD into MLLMs is non-trivial. To address the challenges of\ntraining and inference stability as well as limited training data, we adapt the\nMoD module with two novel designs: tanh-gated weight normalization (TanhNorm)\nand symmetric token reweighting (STRing). Moreover, we observe that vision\ntokens exhibit higher redundancy in deeper layer and thus design a progressive\nratio decay (PRD) strategy, which gradually reduces the token retention ratio\nlayer by layer, employing a shifted cosine schedule. This crucial design fully\nunleashes the potential of MoD, significantly boosting the efficiency and\nperformance of our models. To validate the effectiveness of our approach, we\nconduct extensive experiments with two baseline models across 14 benchmarks.\nOur model, p-MoD, matches or even surpasses the performance of the baseline\nmodels, with only 55.6% TFLOPs and 53.8% KV cache storage during inference, and\n77.7% GPU hours during training.\n","authors":["Jun Zhang","Desen Meng","Ji Qi","Zhenpeng Huang","Tao Wu","Limin Wang"],"pdf_url":"https://arxiv.org/pdf/2412.04449v1.pdf","comment":"Technical Report; Code released at https://github.com/MCG-NJU/p-MoD"},{"id":"http://arxiv.org/abs/2411.04986v2","updated":"2024-12-05T18:57:50Z","published":"2024-11-07T18:55:09Z","title":"The Semantic Hub Hypothesis: Language Models Share Semantic\n Representations Across Languages and Modalities","summary":" Modern language models can process inputs across diverse languages and\nmodalities. We hypothesize that models acquire this capability through learning\na shared representation space across heterogeneous data types (e.g., different\nlanguages and modalities), which places semantically similar inputs near one\nanother, even if they are from different modalities/languages. We term this the\nsemantic hub hypothesis, following the hub-and-spoke model from neuroscience\n(Patterson et al., 2007) which posits that semantic knowledge in the human\nbrain is organized through a transmodal semantic \"hub\" which integrates\ninformation from various modality-specific \"spokes\" regions. We first show that\nmodel representations for semantically equivalent inputs in different languages\nare similar in the intermediate layers, and that this space can be interpreted\nusing the model's dominant pretraining language via the logit lens. This\ntendency extends to other data types, including arithmetic expressions, code,\nand visual/audio inputs. Interventions in the shared representation space in\none data type also predictably affect model outputs in other data types,\nsuggesting that this shared representations space is not simply a vestigial\nbyproduct of large-scale training on broad data, but something that is actively\nutilized by the model during input processing.\n","authors":["Zhaofeng Wu","Xinyan Velocity Yu","Dani Yogatama","Jiasen Lu","Yoon Kim"],"pdf_url":"https://arxiv.org/pdf/2411.04986v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04445v1","updated":"2024-12-05T18:57:04Z","published":"2024-12-05T18:57:04Z","title":"Moto: Latent Motion Token as the Bridging Language for Robot\n Manipulation","summary":" Recent developments in Large Language Models pre-trained on extensive corpora\nhave shown significant success in various natural language processing tasks\nwith minimal fine-tuning. This success offers new promise for robotics, which\nhas long been constrained by the high cost of action-labeled data. We ask:\ngiven the abundant video data containing interaction-related knowledge\navailable as a rich \"corpus\", can a similar generative pre-training approach be\neffectively applied to enhance robot learning? The key challenge is to identify\nan effective representation for autoregressive pre-training that benefits robot\nmanipulation tasks. Inspired by the way humans learn new skills through\nobserving dynamic environments, we propose that effective robotic learning\nshould emphasize motion-related knowledge, which is closely tied to low-level\nactions and is hardware-agnostic, facilitating the transfer of learned motions\nto actual robot actions. To this end, we introduce Moto, which converts video\ncontent into latent Motion Token sequences by a Latent Motion Tokenizer,\nlearning a bridging \"language\" of motion from videos in an unsupervised manner.\nWe pre-train Moto-GPT through motion token autoregression, enabling it to\ncapture diverse visual motion knowledge. After pre-training, Moto-GPT\ndemonstrates the promising ability to produce semantically interpretable motion\ntokens, predict plausible motion trajectories, and assess trajectory\nrationality through output likelihood. To transfer learned motion priors to\nreal robot actions, we implement a co-fine-tuning strategy that seamlessly\nbridges latent motion token prediction and real robot control. Extensive\nexperiments show that the fine-tuned Moto-GPT exhibits superior robustness and\nefficiency on robot manipulation benchmarks, underscoring its effectiveness in\ntransferring knowledge from video data to downstream visual manipulation tasks.\n","authors":["Yi Chen","Yuying Ge","Yizhuo Li","Yixiao Ge","Mingyu Ding","Ying Shan","Xihui Liu"],"pdf_url":"https://arxiv.org/pdf/2412.04445v1.pdf","comment":"Project released at: https://chenyi99.github.io/moto/"},{"id":"http://arxiv.org/abs/2412.04425v1","updated":"2024-12-05T18:51:10Z","published":"2024-12-05T18:51:10Z","title":"CA-SSLR: Condition-Aware Self-Supervised Learning Representation for\n Generalized Speech Processing","summary":" We introduce Condition-Aware Self-Supervised Learning Representation\n(CA-SSLR), a generalist conditioning model broadly applicable to various\nspeech-processing tasks. Compared to standard fine-tuning methods that optimize\nfor downstream models, CA-SSLR integrates language and speaker embeddings from\nearlier layers, making the SSL model aware of the current language and speaker\ncontext. This approach reduces the reliance on input audio features while\npreserving the integrity of the base SSLR. CA-SSLR improves the model's\ncapabilities and demonstrates its generality on unseen tasks with minimal\ntask-specific tuning. Our method employs linear modulation to dynamically\nadjust internal representations, enabling fine-grained adaptability without\nsignificantly altering the original model behavior. Experiments show that\nCA-SSLR reduces the number of trainable parameters, mitigates overfitting, and\nexcels in under-resourced and unseen tasks. Specifically, CA-SSLR achieves a\n10% relative reduction in LID errors, a 37% improvement in ASR CER on the\nML-SUPERB benchmark, and a 27% decrease in SV EER on VoxCeleb-1, demonstrating\nits effectiveness.\n","authors":["Yen-Ju Lu","Jing Liu","Thomas Thebaud","Laureano Moro-Velazquez","Ariya Rastrow","Najim Dehak","Jesus Villalba"],"pdf_url":"https://arxiv.org/pdf/2412.04425v1.pdf","comment":"38th Conference on Neural Information Processing Systems (NeurIPS\n 2024)"},{"id":"http://arxiv.org/abs/2403.07384v2","updated":"2024-12-05T18:47:47Z","published":"2024-03-12T07:45:33Z","title":"SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large\n Language Models by Summarizing Training Trajectories of Small Models","summary":" Despite the effectiveness of data selection for large language models (LLMs)\nduring pretraining and instruction fine-tuning phases, improving data\nefficiency in supervised fine-tuning (SFT) for specialized domains poses\nsignificant challenges due to the complexity of fine-tuning data. To bridge\nthis gap, we introduce an effective and scalable data selection method for SFT,\nSmallToLarge (S2L), which leverages training trajectories from small models to\nguide the data selection for larger models. We demonstrate through extensive\nexperiments that S2L significantly improves data efficiency in SFT for\nmathematical problem-solving, reducing the training data to just 11% of the\noriginal MathInstruct dataset (Yue et al., 2023) to match full dataset\nperformance while outperforming state-of-the-art data selection algorithms by\nan average of 4.7% across 6 in- and out-domain evaluation datasets. Remarkably,\nselecting only 50K data for SFT, S2L achieves a 32.7% accuracy on the most\nchallenging MATH (Hendrycks et al., 2021) benchmark, improving Phi-2 (Li et\nal., 2023b) by 16.6%. In clinical text summarization on the MIMIC-III dataset\n(Johnson et al., 2016), S2L again outperforms training on the full dataset\nusing only 50% of the data. Notably, S2L can perform data selection using a\nreference model 40x smaller than the target model, proportionally reducing the\ncost of data selection.\n","authors":["Yu Yang","Siddhartha Mishra","Jeffrey N Chiang","Baharan Mirzasoleiman"],"pdf_url":"https://arxiv.org/pdf/2403.07384v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.12924v3","updated":"2024-12-05T18:35:26Z","published":"2024-09-04T03:17:19Z","title":"WaveletGPT: Wavelets Meet Large Language Models","summary":" Large Language Models (LLMs) have ushered in a new wave of artificial\nintelligence advancements impacting every scientific field and discipline. They\nare trained on a simple objective: to predict the next token given the previous\ncontext. We live in a world where most of the data around us, e.g., text,\naudio, and music, has a multi-scale structure associated with it. This paper\ninfuses LLMs with traditional signal processing ideas, namely wavelets, during\npre-training to take advantage of the structure. Without adding \\textbf{any\nextra parameters} to a GPT-style LLM architecture, we achieve the same\npre-training performance almost twice as fast in text, raw audio, and symbolic\nmusic. This is achieved by imposing a structure on intermediate embeddings.\nWhen trained for the same number of training steps, we achieve significant\ngains in performance, which is comparable to pre-training a larger neural\narchitecture. Our architecture allows every next token prediction access to\nintermediate embeddings at different temporal resolutions in every Transformer\ndecoder block. This work will hopefully pave the way for incorporating\nmulti-rate signal processing ideas into traditional LLM pre-training. Further,\nwe showcase pushing model performance by improving internal structure instead\nof just going after scale.\n","authors":["Prateek Verma"],"pdf_url":"https://arxiv.org/pdf/2409.12924v3.pdf","comment":"16 pages, 4 figures"},{"id":"http://arxiv.org/abs/2412.04403v1","updated":"2024-12-05T18:21:49Z","published":"2024-12-05T18:21:49Z","title":"Establishing Task Scaling Laws via Compute-Efficient Model Ladders","summary":" We develop task scaling laws and model ladders to predict the individual task\nperformance of pretrained language models (LMs) in the overtrained setting.\nStandard power laws for language modeling loss cannot accurately model task\nperformance. Therefore, we leverage a two-step prediction approach: first use\nmodel and data size to predict a task-specific loss, and then use this task\nloss to predict task performance. We train a set of small-scale \"ladder\"\nmodels, collect data points to fit the parameterized functions of the two\nprediction steps, and make predictions for two target models: a 7B model\ntrained to 4T tokens and a 13B model trained to 5T tokens. Training the ladder\nmodels only costs 1% of the compute used for the target models. On four\nmultiple-choice tasks written in ranked classification format, we can predict\nthe accuracy of both target models within 2 points of absolute error. We have\nhigher prediction error on four other tasks (average absolute error 6.9) and\nfind that these are often tasks with higher variance in task metrics. We also\nfind that using less compute to train fewer ladder models tends to deteriorate\npredictions. Finally, we empirically show that our design choices and the\ntwo-step approach lead to superior performance in establishing scaling laws.\n","authors":["Akshita Bhagia","Jiacheng Liu","Alexander Wettig","David Heineman","Oyvind Tafjord","Ananya Harsh Jha","Luca Soldaini","Noah A. Smith","Dirk Groeneveld","Pang Wei Koh","Jesse Dodge","Hannaneh Hajishirzi"],"pdf_url":"https://arxiv.org/pdf/2412.04403v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02819v2","updated":"2024-12-05T17:51:20Z","published":"2024-12-03T20:35:57Z","title":"CNNSum: Exploring Long-Conext Summarization with Large Language Models\n in Chinese Novels","summary":" Large Language Models (LLMs) have been well-researched in many long-context\ntasks. However, due to high annotation costs, high-quality long-context summary\ndatasets for training or evaluation are scarce, limiting further research. In\nthis work, we introduce CNNSum, a new multi-scale Chinese long-context novel\nsummarization benchmark, including four subsets, length covering\n16k\\textasciitilde128k, 695 samples in total, the annotations are human-driven.\nWe evaluate commercial and open-source models on CNNSum and conduct a detailed\nanalysis. Based on the observations, we further conduct fine-tuning exploration\nwith short-context summary data. In our study: (1) GPT-4o underperformed, due\nto excessive subjective commentary. (2) Currently, long-context summarization\nmainly relies on memory ability, small LLMs with stable longer context lengths\nare the most cost-effective. Using long data concatenated from short-context\nsummaries makes a significant improvement. (3) Prompt templates may cause a\nlarge performance gap but can be mitigated through fine-tuning. (4) Fine-tuned\nChat or Instruction versions may harm the Base model and further fine-tuning\ncannot bridge performance gap. (5) while models with RoPE base scaling exhibit\nstrong extrapolation potential, their performance may vary significantly when\ncombined with other interpolation methods and need careful selection. (6)\nCNNSum provides more reliable and insightful evaluation results than other\nbenchmarks. We release CNNSum to advance research in this field.\n","authors":["Lingxiao Wei","He Yan","Xiangju Lu","Junmin Zhu","Jun Wang","Wei Zhang"],"pdf_url":"https://arxiv.org/pdf/2412.02819v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.02589v2","updated":"2024-12-05T17:41:48Z","published":"2024-11-04T20:29:35Z","title":"Context-Informed Machine Translation of Manga using Multimodal Large\n Language Models","summary":" Due to the significant time and effort required for handcrafting\ntranslations, most manga never leave the domestic Japanese market. Automatic\nmanga translation is a promising potential solution. However, it is a budding\nand underdeveloped field and presents complexities even greater than those\nfound in standard translation due to the need to effectively incorporate visual\nelements into the translation process to resolve ambiguities. In this work, we\ninvestigate to what extent multimodal large language models (LLMs) can provide\neffective manga translation, thereby assisting manga authors and publishers in\nreaching wider audiences. Specifically, we propose a methodology that leverages\nthe vision component of multimodal LLMs to improve translation quality and\nevaluate the impact of translation unit size, context length, and propose a\ntoken efficient approach for manga translation. Moreover, we introduce a new\nevaluation dataset -- the first parallel Japanese-Polish manga translation\ndataset -- as part of a benchmark to be used in future research. Finally, we\ncontribute an open-source software suite, enabling others to benchmark LLMs for\nmanga translation. Our findings demonstrate that our proposed methods achieve\nstate-of-the-art results for Japanese-English translation and set a new\nstandard for Japanese-Polish.\n","authors":["Philip Lippmann","Konrad Skublicki","Joshua Tanner","Shonosuke Ishiwatari","Jie Yang"],"pdf_url":"https://arxiv.org/pdf/2411.02589v2.pdf","comment":"COLING 2025"},{"id":"http://arxiv.org/abs/2412.04351v1","updated":"2024-12-05T17:10:19Z","published":"2024-12-05T17:10:19Z","title":"BhashaVerse : Translation Ecosystem for Indian Subcontinent Languages","summary":" This paper focuses on developing translation models and related applications\nfor 36 Indian languages, including Assamese, Awadhi, Bengali, Bhojpuri, Braj,\nBodo, Dogri, English, Konkani, Gondi, Gujarati, Hindi, Hinglish, Ho, Kannada,\nKangri, Kashmiri (Arabic and Devanagari), Khasi, Mizo, Magahi, Maithili,\nMalayalam, Marathi, Manipuri (Bengali and Meitei), Nepali, Oriya, Punjabi,\nSanskrit, Santali, Sinhala, Sindhi (Arabic and Devanagari), Tamil, Tulu,\nTelugu, and Urdu. Achieving this requires parallel and other types of corpora\nfor all 36 * 36 language pairs, addressing challenges like script variations,\nphonetic differences, and syntactic diversity. For instance, languages like\nKashmiri and Sindhi, which use multiple scripts, demand script normalization\nfor alignment, while low-resource languages such as Khasi and Santali require\nsynthetic data augmentation to ensure sufficient coverage and quality.\n To address these challenges, this work proposes strategies for corpus\ncreation by leveraging existing resources, developing parallel datasets,\ngenerating domain-specific corpora, and utilizing synthetic data techniques.\nAdditionally, it evaluates machine translation across various dimensions,\nincluding standard and discourse-level translation, domain-specific\ntranslation, reference-based and reference-free evaluation, error analysis, and\nautomatic post-editing. By integrating these elements, the study establishes a\ncomprehensive framework to improve machine translation quality and enable\nbetter cross-lingual communication in India's linguistically diverse ecosystem.\n","authors":["Vandan Mujadia","Dipti Misra Sharma"],"pdf_url":"https://arxiv.org/pdf/2412.04351v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04342v1","updated":"2024-12-05T17:00:32Z","published":"2024-12-05T17:00:32Z","title":"Retrieval-Augmented Machine Translation with Unstructured Knowledge","summary":" Retrieval-augmented generation (RAG) introduces additional information to\nenhance large language models (LLMs). In machine translation (MT), previous\nwork typically retrieves in-context examples from paired MT corpora, or\ndomain-specific knowledge from knowledge graphs, to enhance models' MT ability.\nHowever, a large amount of world knowledge is organized in unstructured\ndocuments, and might not be fully paired across different languages. In this\npaper, we study retrieval-augmented MT using unstructured documents.\nSpecifically, we build RAGtrans, the first benchmark to train and evaluate\nLLMs' retrieval-augmented MT ability. RAGtrans contains 79K MT samples\ncollected via GPT-4o and human translators. Besides, documents from different\nlanguages are also provided to supply the knowledge to these samples. Based on\nRAGtrans, we further propose a multi-task training method to teach LLMs how to\nuse information from multilingual documents during their translation. The\nmethod uses existing multilingual corpora to create auxiliary training\nobjectives without additional labeling requirements. Extensive experiments show\nthat the method improves LLMs by 1.58-3.09 BLEU and 1.00-2.03 COMET scores.\n","authors":["Jiaan Wang","Fandong Meng","Yingxue Zhang","Jie Zhou"],"pdf_url":"https://arxiv.org/pdf/2412.04342v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04318v1","updated":"2024-12-05T16:34:20Z","published":"2024-12-05T16:34:20Z","title":"The Hyperfitting Phenomenon: Sharpening and Stabilizing LLMs for\n Open-Ended Text Generation","summary":" This paper introduces the counter-intuitive generalization results of\noverfitting pre-trained large language models (LLMs) on very small datasets. In\nthe setting of open-ended text generation, it is well-documented that LLMs tend\nto generate repetitive and dull sequences, a phenomenon that is especially\napparent when generating using greedy decoding. This issue persists even with\nstate-of-the-art LLMs containing billions of parameters, trained via next-token\nprediction on large datasets. We find that by further fine-tuning these models\nto achieve a near-zero training loss on a small set of samples -- a process we\nrefer to as hyperfitting -- the long-sequence generative capabilities are\ngreatly enhanced. Greedy decoding with these Hyperfitted models even outperform\nTop-P sampling over long-sequences, both in terms of diversity and human\npreferences. This phenomenon extends to LLMs of various sizes, different\ndomains, and even autoregressive image generation. We further find this\nphenomena to be distinctly different from that of Grokking and double descent.\nSurprisingly, our experiments indicate that hyperfitted models rarely fall into\nrepeating sequences they were trained on, and even explicitly blocking these\nsequences results in high-quality output. All hyperfitted models produce\nextremely low-entropy predictions, often allocating nearly all probability to a\nsingle token.\n","authors":["Fredrik Carlsson","Fangyu Liu","Daniel Ward","Murathan Kurfali","Joakim Nivre"],"pdf_url":"https://arxiv.org/pdf/2412.04318v1.pdf","comment":"Under review at ICLR"},{"id":"http://arxiv.org/abs/2412.04315v1","updated":"2024-12-05T16:31:13Z","published":"2024-12-05T16:31:13Z","title":"Densing Law of LLMs","summary":" Large Language Models (LLMs) have emerged as a milestone in artificial\nintelligence, and their performance can improve as the model size increases.\nHowever, this scaling brings great challenges to training and inference\nefficiency, particularly for deploying LLMs in resource-constrained\nenvironments, and the scaling trend is becoming increasingly unsustainable.\nThis paper introduces the concept of ``\\textit{capacity density}'' as a new\nmetric to evaluate the quality of the LLMs across different scales and\ndescribes the trend of LLMs in terms of both effectiveness and efficiency. To\ncalculate the capacity density of a given target LLM, we first introduce a set\nof reference models and develop a scaling law to predict the downstream\nperformance of these reference models based on their parameter sizes. We then\ndefine the \\textit{effective parameter size} of the target LLM as the parameter\nsize required by a reference model to achieve equivalent performance, and\nformalize the capacity density as the ratio of the effective parameter size to\nthe actual parameter size of the target LLM. Capacity density provides a\nunified framework for assessing both model effectiveness and efficiency. Our\nfurther analysis of recent open-source base LLMs reveals an empirical law (the\ndensing law)that the capacity density of LLMs grows exponentially over time.\nMore specifically, using some widely used benchmarks for evaluation, the\ncapacity density of LLMs doubles approximately every three months. The law\nprovides new perspectives to guide future LLM development, emphasizing the\nimportance of improving capacity density to achieve optimal results with\nminimal computational overhead.\n","authors":["Chaojun Xiao","Jie Cai","Weilin Zhao","Guoyang Zeng","Xu Han","Zhiyuan Liu","Maosong Sun"],"pdf_url":"https://arxiv.org/pdf/2412.04315v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04305v1","updated":"2024-12-05T16:26:31Z","published":"2024-12-05T16:26:31Z","title":"ALMA: Alignment with Minimal Annotation","summary":" Recent approaches to large language model (LLM) alignment typically require\nmillions of human annotations or rely on external aligned models for synthetic\ndata generation. This paper introduces ALMA: Alignment with Minimal Annotation,\ndemonstrating that effective alignment can be achieved using only 9,000 labeled\nexamples -- less than 1% of conventional approaches. ALMA generates large\namounts of high-quality synthetic alignment data through new techniques:\ndiverse prompt synthesis via few-shot learning, diverse response generation\nwith multiple model checkpoints, and judge (reward model) enhancement through\nscore aggregation and self-distillation. Using only a pretrained Llama3 base\nmodel, 5,000 SFT examples, and 4,000 judge annotations, ALMA achieves\nperformance close to Llama3-Instruct across diverse alignment benchmarks (e.g.,\n0.1% difference on AlpacaEval 2.0 score). These results are achieved with a\nmulti-round, self-bootstrapped data synthesis and training recipe that\ncontinues to improve for 10 rounds, surpassing the typical 3-round ceiling of\nprevious methods. These results suggest that base models already possess\nsufficient knowledge for effective alignment, and that synthetic data\ngeneration methods can expose it.\n","authors":["Michihiro Yasunaga","Leonid Shamis","Chunting Zhou","Andrew Cohen","Jason Weston","Luke Zettlemoyer","Marjan Ghazvininejad"],"pdf_url":"https://arxiv.org/pdf/2412.04305v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.15796v5","updated":"2024-12-05T16:13:09Z","published":"2024-06-22T09:40:07Z","title":"Unveiling Entity-Level Unlearning for Large Language Models: A\n Comprehensive Analysis","summary":" Large language model unlearning has garnered increasing attention due to its\npotential to address security and privacy concerns, leading to extensive\nresearch in the field. However, much of this research has concentrated on\ninstance-level unlearning, specifically targeting the removal of predefined\ninstances containing sensitive content. This focus has left a significant gap\nin the exploration of full entity-level unlearning, which is critical in\nreal-world scenarios such as copyright protection. To this end, we propose a\nnovel task of Entity-level unlearning, which aims to erase entity-related\nknowledge from the target model completely. To thoroughly investigate this\ntask, we systematically evaluate trending unlearning algorithms, revealing that\ncurrent methods struggle to achieve effective entity-level unlearning. Then, we\nfurther explore the factors that influence the performance of the unlearning\nalgorithms, identifying that knowledge coverage and the size of the forget set\nplay pivotal roles. Notably, our analysis also uncovers that entities\nintroduced through fine-tuning are more vulnerable to unlearning than\npre-trained entities. These findings collectively offer valuable insights for\nadvancing entity-level unlearning for LLMs.\n","authors":["Weitao Ma","Xiaocheng Feng","Weihong Zhong","Lei Huang","Yangfan Ye","Xiachong Feng","Bing Qin"],"pdf_url":"https://arxiv.org/pdf/2406.15796v5.pdf","comment":"Accepted by COLING 2025"},{"id":"http://arxiv.org/abs/2412.04291v1","updated":"2024-12-05T16:12:06Z","published":"2024-12-05T16:12:06Z","title":"Evolutionary Pre-Prompt Optimization for Mathematical Reasoning","summary":" Recent advancements have highlighted that large language models (LLMs), when\ngiven a small set of task-specific examples, demonstrate remarkable\nproficiency, a capability that extends to complex reasoning tasks. In\nparticular, the combination of few-shot learning with the chain-of-thought\n(CoT) approach has been pivotal in steering models towards more logically\nconsistent conclusions. This paper explores the optimization of example\nselection for designing effective CoT pre-prompts and shows that the choice of\nthe optimization algorithm, typically in favor of comparison-based methods such\nas evolutionary computation, significantly enhances efficacy and feasibility.\nSpecifically, thanks to a limited exploitative and overfitted optimization,\nEvolutionary Pre-Prompt Optimization (EPPO) brings an improvement over the\nnaive few-shot approach exceeding 10 absolute points in exact match scores on\nbenchmark datasets such as GSM8k and MathQA. These gains are consistent across\nvarious contexts and are further amplified when integrated with\nself-consistency (SC)\n","authors":["Mathurin Videau","Alessandro Leite","Marc Schoenauer","Olivier Teytaud"],"pdf_url":"https://arxiv.org/pdf/2412.04291v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04277v1","updated":"2024-12-05T15:59:29Z","published":"2024-12-05T15:59:29Z","title":"Arabic Stable LM: Adapting Stable LM 2 1.6B to Arabic","summary":" Large Language Models (LLMs) have shown impressive results in multiple\ndomains of natural language processing (NLP) but are mainly focused on the\nEnglish language. Recently, more LLMs have incorporated a larger proportion of\nmultilingual text to represent low-resource languages. In Arabic NLP, several\nArabic-centric LLMs have shown remarkable results on multiple benchmarks in the\npast two years. However, most Arabic LLMs have more than 7 billion parameters,\nwhich increases their hardware requirements and inference latency, when\ncompared to smaller LLMs. This paper introduces Arabic Stable LM 1.6B in a base\nand chat version as a small but powerful Arabic-centric LLM. Our Arabic Stable\nLM 1.6B chat model achieves impressive results on several benchmarks beating\nmultiple models with up to 8x the parameters. In addition, we show the benefit\nof mixing in synthetic instruction tuning data by augmenting our fine-tuning\ndata with a large synthetic dialogue dataset.\n","authors":["Zaid Alyafeai","Michael Pieler","Hannah Teufel","Jonathan Tow","Marco Bellagente","Duy Phung","Nikhil Pinnaparaju","Reshinth Adithyan","Paulo Rocha","Maksym Zhuravinskyi","Carlos Riquelme"],"pdf_url":"https://arxiv.org/pdf/2412.04277v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04266v1","updated":"2024-12-05T15:50:44Z","published":"2024-12-05T15:50:44Z","title":"Representation Purification for End-to-End Speech Translation","summary":" Speech-to-text translation (ST) is a cross-modal task that involves\nconverting spoken language into text in a different language. Previous research\nprimarily focused on enhancing speech translation by facilitating knowledge\ntransfer from machine translation, exploring various methods to bridge the gap\nbetween speech and text modalities. Despite substantial progress made, factors\nin speech that are not relevant to translation content, such as timbre and\nrhythm, often limit the efficiency of knowledge transfer. In this paper, we\nconceptualize speech representation as a combination of content-agnostic and\ncontent-relevant factors. We examine the impact of content-agnostic factors on\ntranslation performance through preliminary experiments and observe a\nsignificant performance deterioration when content-agnostic perturbations are\nintroduced to speech signals. To address this issue, we propose a\n\\textbf{S}peech \\textbf{R}epresentation \\textbf{P}urification with\n\\textbf{S}upervision \\textbf{E}nhancement (SRPSE) framework, which excludes the\ncontent-agnostic components within speech representations to mitigate their\nnegative impact on ST. Experiments on MuST-C and CoVoST-2 datasets demonstrate\nthat SRPSE significantly improves translation performance across all\ntranslation directions in three settings and achieves preeminent performance\nunder a \\textit{transcript-free} setting.\n","authors":["Chengwei Zhang","Yue Zhou","Rui Zhao","Yidong Chen","Xiaodong Shi"],"pdf_url":"https://arxiv.org/pdf/2412.04266v1.pdf","comment":"Accepted by COLING 2025"},{"id":"http://arxiv.org/abs/2405.20331v2","updated":"2024-12-05T15:48:24Z","published":"2024-05-30T17:59:04Z","title":"CoSy: Evaluating Textual Explanations of Neurons","summary":" A crucial aspect of understanding the complex nature of Deep Neural Networks\n(DNNs) is the ability to explain learned concepts within their latent\nrepresentations. While methods exist to connect neurons to human-understandable\ntextual descriptions, evaluating the quality of these explanations is\nchallenging due to the lack of a unified quantitative approach. We introduce\nCoSy (Concept Synthesis), a novel, architecture-agnostic framework for\nevaluating textual explanations of latent neurons. Given textual explanations,\nour proposed framework uses a generative model conditioned on textual input to\ncreate data points representing the explanations. By comparing the neuron's\nresponse to these generated data points and control data points, we can\nestimate the quality of the explanation. We validate our framework through\nsanity checks and benchmark various neuron description methods for Computer\nVision tasks, revealing significant differences in quality.\n","authors":["Laura Kopf","Philine Lou Bommer","Anna Hedström","Sebastian Lapuschkin","Marina M. -C. Höhne","Kirill Bykov"],"pdf_url":"https://arxiv.org/pdf/2405.20331v2.pdf","comment":"10 pages, 5 figures"},{"id":"http://arxiv.org/abs/2412.04261v1","updated":"2024-12-05T15:41:06Z","published":"2024-12-05T15:41:06Z","title":"Aya Expanse: Combining Research Breakthroughs for a New Multilingual\n Frontier","summary":" We introduce the Aya Expanse model family, a new generation of 8B and 32B\nparameter multilingual language models, aiming to address the critical\nchallenge of developing highly performant multilingual models that match or\nsurpass the capabilities of monolingual models. By leveraging several years of\nresearch at Cohere For AI and Cohere, including advancements in data arbitrage,\nmultilingual preference training, and model merging, Aya Expanse sets a new\nstate-of-the-art in multilingual performance. Our evaluations on the\nArena-Hard-Auto dataset, translated into 23 languages, demonstrate that Aya\nExpanse 8B and 32B outperform leading open-weight models in their respective\nparameter classes, including Gemma 2, Qwen 2.5, and Llama 3.1, achieving up to\na 76.6% win-rate. Notably, Aya Expanse 32B outperforms Llama 3.1 70B, a model\nwith twice as many parameters, achieving a 54.0% win-rate. In this short\ntechnical report, we present extended evaluation results for the Aya Expanse\nmodel family and release their open-weights, together with a new multilingual\nevaluation dataset m-ArenaHard.\n","authors":["John Dang","Shivalika Singh","Daniel D'souza","Arash Ahmadian","Alejandro Salamanca","Madeline Smith","Aidan Peppin","Sungjin Hong","Manoj Govindassamy","Terrence Zhao","Sandra Kublik","Meor Amer","Viraat Aryabumi","Jon Ander Campos","Yi-Chern Tan","Tom Kocmi","Florian Strub","Nathan Grinsztajn","Yannis Flet-Berliac","Acyr Locatelli","Hangyu Lin","Dwarak Talupuru","Bharat Venkitesh","David Cairuz","Bowen Yang","Tim Chung","Wei-Yin Ko","Sylvie Shang Shi","Amir Shukayev","Sammie Bae","Aleksandra Piktus","Roman Castagné","Felipe Cruz-Salinas","Eddie Kim","Lucas Crawhall-Stein","Adrien Morisot","Sudip Roy","Phil Blunsom","Ivan Zhang","Aidan Gomez","Nick Frosst","Marzieh Fadaee","Beyza Ermis","Ahmet Üstün","Sara Hooker"],"pdf_url":"https://arxiv.org/pdf/2412.04261v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.16778v2","updated":"2024-12-05T15:38:11Z","published":"2024-06-24T16:40:54Z","title":"Finding Transformer Circuits with Edge Pruning","summary":" The path to interpreting a language model often proceeds via analysis of\ncircuits -- sparse computational subgraphs of the model that capture specific\naspects of its behavior. Recent work has automated the task of discovering\ncircuits. Yet, these methods have practical limitations, as they rely either on\ninefficient search algorithms or inaccurate approximations. In this paper, we\nframe automated circuit discovery as an optimization problem and propose *Edge\nPruning* as an effective and scalable solution. Edge Pruning leverages\ngradient-based pruning techniques, but instead of removing neurons or\ncomponents, it prunes the \\emph{edges} between components. Our method finds\ncircuits in GPT-2 that use less than half the number of edges compared to\ncircuits found by previous methods while being equally faithful to the full\nmodel predictions on standard circuit-finding tasks. Edge Pruning is efficient\neven with as many as 100K examples, outperforming previous methods in speed and\nproducing substantially better circuits. It also perfectly recovers the\nground-truth circuits in two models compiled with Tracr. Thanks to its\nefficiency, we scale Edge Pruning to CodeLlama-13B, a model over 100x the scale\nthat prior methods operate on. We use this setting for a case study comparing\nthe mechanisms behind instruction prompting and in-context learning. We find\ntwo circuits with more than 99.96% sparsity that match the performance of the\nfull model and reveal that the mechanisms in the two settings overlap\nsubstantially. Our case study shows that Edge Pruning is a practical and\nscalable tool for interpretability and sheds light on behaviors that only\nemerge in large models.\n","authors":["Adithya Bhaskar","Alexander Wettig","Dan Friedman","Danqi Chen"],"pdf_url":"https://arxiv.org/pdf/2406.16778v2.pdf","comment":"NeurIPS 2024 (Spotlight)"},{"id":"http://arxiv.org/abs/2412.04254v1","updated":"2024-12-05T15:34:02Z","published":"2024-12-05T15:34:02Z","title":"CLINICSUM: Utilizing Language Models for Generating Clinical Summaries\n from Patient-Doctor Conversations","summary":" This paper presents ClinicSum, a novel framework designed to automatically\ngenerate clinical summaries from patient-doctor conversations. It utilizes a\ntwo-module architecture: a retrieval-based filtering module that extracts\nSubjective, Objective, Assessment, and Plan (SOAP) information from\nconversation transcripts, and an inference module powered by fine-tuned\nPre-trained Language Models (PLMs), which leverage the extracted SOAP data to\ngenerate abstracted clinical summaries. To fine-tune the PLM, we created a\ntraining dataset of consisting 1,473 conversations-summaries pair by\nconsolidating two publicly available datasets, FigShare and MTS-Dialog, with\nground truth summaries validated by Subject Matter Experts (SMEs). ClinicSum's\neffectiveness is evaluated through both automatic metrics (e.g., ROUGE,\nBERTScore) and expert human assessments. Results show that ClinicSum\noutperforms state-of-the-art PLMs, demonstrating superior precision, recall,\nand F-1 scores in automatic evaluations and receiving high preference from SMEs\nin human assessment, making it a robust solution for automated clinical\nsummarization.\n","authors":["Subash Neupane","Himanshu Tripathi","Shaswata Mitra","Sean Bozorgzad","Sudip Mittal","Shahram Rahimi","Amin Amirlatifi"],"pdf_url":"https://arxiv.org/pdf/2412.04254v1.pdf","comment":"accepted at the the 2024 IEEE International Conference on Big Data\n workshop Workshop on Big Data and AI for Healthcare"},{"id":"http://arxiv.org/abs/2410.14817v2","updated":"2024-12-05T15:20:28Z","published":"2024-10-18T18:37:27Z","title":"A Complexity-Based Theory of Compositionality","summary":" Compositionality is believed to be fundamental to intelligence. In humans, it\nunderlies the structure of thought, language, and higher-level reasoning. In\nAI, compositional representations can enable a powerful form of\nout-of-distribution generalization, in which a model systematically adapts to\nnovel combinations of known concepts. However, while we have strong intuitions\nabout what compositionality is, there currently exists no formal definition for\nit that is measurable and mathematical. Here, we propose such a definition,\nwhich we call representational compositionality, that accounts for and extends\nour intuitions about compositionality. The definition is conceptually simple,\nquantitative, grounded in algorithmic information theory, and applicable to any\nrepresentation. Intuitively, representational compositionality states that a\ncompositional representation satisfies three properties. First, it must be\nexpressive. Second, it must be possible to re-describe the representation as a\nfunction of discrete symbolic sequences with re-combinable parts, analogous to\nsentences in natural language. Third, the function that relates these symbolic\nsequences to the representation, analogous to semantics in natural language,\nmust be simple. Through experiments on both synthetic and real world data, we\nvalidate our definition of compositionality and show how it unifies disparate\nintuitions from across the literature in both AI and cognitive science. We also\nshow that representational compositionality, while theoretically intractable,\ncan be readily estimated using standard deep learning tools. Our definition has\nthe potential to inspire the design of novel, theoretically-driven models that\nbetter capture the mechanisms of compositional thought.\n","authors":["Eric Elmoznino","Thomas Jiralerspong","Yoshua Bengio","Guillaume Lajoie"],"pdf_url":"https://arxiv.org/pdf/2410.14817v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04236v1","updated":"2024-12-05T15:14:16Z","published":"2024-12-05T15:14:16Z","title":"A History of Philosophy in Colombia through Topic Modelling","summary":" Data-driven approaches to philosophy have emerged as a valuable tool for\nstudying the history of the discipline. However, most studies in this area have\nfocused on a limited number of journals from specific regions and subfields. We\nexpand the scope of this research by applying dynamic topic modelling\ntechniques to explore the history of philosophy in Colombia and Latin America.\nOur study examines the Colombian philosophy journal Ideas y Valores, founded in\n1951 and currently one of the most influential academic philosophy journals in\nthe region. By analyzing the evolution of topics across the journal's history,\nwe identify various trends and specific dynamics in philosophical discourse\nwithin the Colombian and Latin American context. Our findings reveal that the\nmost prominent topics are value theory (including ethics, political philosophy,\nand aesthetics), epistemology, and the philosophy of science. We also trace the\nevolution of articles focusing on the historical and interpretive aspects of\nphilosophical texts, and we note a notable emphasis on German philosophers such\nas Kant, Husserl, and Hegel on various topics throughout the journal's\nlifetime. Additionally, we investigate whether articles with a historical focus\nhave decreased over time due to editorial pressures. Our analysis suggests no\nsignificant decline in such articles. Finally, we propose ideas for extending\nthis research to other Latin American journals and suggest improvements for\nnatural language processing workflows in non-English languages.\n","authors":["Juan R. Loaiza","Miguel González-Duque"],"pdf_url":"https://arxiv.org/pdf/2412.04236v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04235v1","updated":"2024-12-05T15:11:12Z","published":"2024-12-05T15:11:12Z","title":"Addressing Hallucinations with RAG and NMISS in Italian Healthcare LLM\n Chatbots","summary":" I combine detection and mitigation techniques to addresses hallucinations in\nLarge Language Models (LLMs). Mitigation is achieved in a question-answering\nRetrieval-Augmented Generation (RAG) framework while detection is obtained by\nintroducing the Negative Missing Information Scoring System (NMISS), which\naccounts for contextual relevance in responses. While RAG mitigates\nhallucinations by grounding answers in external data, NMISS refines the\nevaluation by identifying cases where traditional metrics incorrectly flag\ncontextually accurate responses as hallucinations. I use Italian health news\narticles as context to evaluate LLM performance. Results show that Gemma2 and\nGPT-4 outperform the other models, with GPT-4 producing answers closely aligned\nwith reference responses. Mid-tier models, such as Llama2, Llama3, and Mistral\nbenefit significantly from NMISS, highlighting their ability to provide richer\ncontextual information. This combined approach offers new insights into the\nreduction and more accurate assessment of hallucinations in LLMs, with\napplications in real-world healthcare tasks and other domains.\n","authors":["Maria Paola Priola"],"pdf_url":"https://arxiv.org/pdf/2412.04235v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.05357v2","updated":"2024-12-05T15:08:56Z","published":"2024-10-07T15:55:55Z","title":"Model-GLUE: Democratized LLM Scaling for A Large Model Zoo in the Wild","summary":" As Large Language Models (LLMs) excel across tasks and specialized domains,\nscaling LLMs based on existing models has garnered significant attention, which\nfaces the challenge of decreasing performance when combining disparate models.\nVarious techniques have been proposed for the aggregation of pre-trained LLMs,\nincluding model merging, Mixture-of-Experts, and stacking. Despite their\nmerits, a comprehensive comparison and synergistic application of them to a\ndiverse model zoo is yet to be adequately addressed. In light of this research\ngap, this paper introduces Model-GLUE, a holistic LLM scaling guideline. First,\nour work starts with a benchmarking of existing LLM scaling techniques,\nespecially selective merging, and variants of mixture. Utilizing the insights\nfrom the benchmark results, we formulate an optimal strategy for the selection\nand aggregation of a heterogeneous model zoo characterizing different\narchitectures and initialization.Our methodology involves the clustering of\nmergeable models and optimal merging strategy selection, and the integration of\nclusters through a model mixture. Finally, evidenced by our experiments on a\ndiverse Llama-2-based model zoo, Model-GLUE shows an average performance\nenhancement of 5.61%, achieved without additional training. Codes are available\nat: https://github.com/Model-GLUE/Model-GLUE.\n","authors":["Xinyu Zhao","Guoheng Sun","Ruisi Cai","Yukun Zhou","Pingzhi Li","Peihao Wang","Bowen Tan","Yexiao He","Li Chen","Yi Liang","Beidi Chen","Binhang Yuan","Hongyi Wang","Ang Li","Zhangyang Wang","Tianlong Chen"],"pdf_url":"https://arxiv.org/pdf/2410.05357v2.pdf","comment":"24 pages, 4 figures, accepted to NeurIPS 2024 Datasets and Benchmarks\n Track"},{"id":"http://arxiv.org/abs/2410.03960v2","updated":"2024-12-05T14:56:56Z","published":"2024-10-04T22:45:26Z","title":"SwiftKV: Fast Prefill-Optimized Inference with Knowledge-Preserving\n Model Transformation","summary":" LLM inference for popular enterprise use cases, such as summarization, RAG,\nand code-generation, typically observes orders of magnitude longer prompt\nlengths than generation lengths. This characteristic leads to high cost of\nprefill and increased response latency. In this paper, we present SwiftKV, a\nnovel model transformation and distillation procedure specifically designed to\nreduce the time and cost of processing prompt tokens while preserving high\nquality of generated tokens. SwiftKV combines three key mechanisms: i)\nSingleInputKV, which prefills later layers' KV cache using a much earlier\nlayer's output, allowing prompt tokens to skip much of the model computation,\nii) AcrossKV, which merges the KV caches of neighboring layers to reduce the\nmemory footprint and support larger batch size for higher throughput, and iii)\na knowledge-preserving distillation procedure that can adapt existing LLMs for\nSwiftKV with minimal accuracy impact and low compute and data requirement. For\nLlama-3.1-8B and 70B, SwiftKV reduces the compute requirement of prefill by 50%\nand the memory requirement of the KV cache by 62.5% while incurring minimum\nquality degradation across a wide range of tasks. In the end-to-end inference\nserving using an optimized vLLM implementation, SwiftKV realizes up to 2x\nhigher aggregate throughput and 60% lower time per output token. It can achieve\na staggering 560 TFlops/GPU of normalized inference throughput, which\ntranslates to 16K tokens/s for Llama-3.1-70B in 16-bit precision on 4x H100\nGPUs. Our training, inference, and model implementations are open-sourced and\ncan be found through\nhttps://huggingface.co/collections/Snowflake/swiftkv-models-674f7d7474eb789e185d31cb.\n","authors":["Aurick Qiao","Zhewei Yao","Samyam Rajbhandari","Yuxiong He"],"pdf_url":"https://arxiv.org/pdf/2410.03960v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02830v2","updated":"2024-12-05T14:51:35Z","published":"2024-12-03T20:52:35Z","title":"RARE: Retrieval-Augmented Reasoning Enhancement for Large Language\n Models","summary":" This work introduces RARE (Retrieval-Augmented Reasoning Enhancement), a\nversatile extension to the mutual reasoning framework (rStar), aimed at\nenhancing reasoning accuracy and factual integrity across large language models\n(LLMs) for complex, knowledge-intensive tasks such as commonsense and medical\nreasoning. RARE incorporates two innovative actions within the Monte Carlo Tree\nSearch (MCTS) framework: A6, which generates search queries based on the\ninitial problem statement, performs information retrieval using those queries,\nand augments reasoning with the retrieved data to formulate the final answer;\nand A7, which leverages information retrieval specifically for generated\nsub-questions and re-answers these sub-questions with the relevant contextual\ninformation. Additionally, a Retrieval-Augmented Factuality Scorer is proposed\nto replace the original discriminator, prioritizing reasoning paths that meet\nhigh standards of factuality. Experimental results with LLaMA 3.1 show that\nRARE enables open-source LLMs to achieve competitive performance with top\nopen-source models like GPT-4 and GPT-4o. This research establishes RARE as a\nscalable solution for improving LLMs in domains where logical coherence and\nfactual integrity are critical.\n","authors":["Hieu Tran","Zonghai Yao","Junda Wang","Yifan Zhang","Zhichao Yang","Hong Yu"],"pdf_url":"https://arxiv.org/pdf/2412.02830v2.pdf","comment":"24 pages"},{"id":"http://arxiv.org/abs/2312.00326v4","updated":"2024-12-05T14:45:05Z","published":"2023-12-01T03:44:54Z","title":"Agent-OM: Leveraging LLM Agents for Ontology Matching","summary":" Ontology matching (OM) enables semantic interoperability between different\nontologies and resolves their conceptual heterogeneity by aligning related\nentities. OM systems currently have two prevailing design paradigms:\nconventional knowledge-based expert systems and newer machine learning-based\npredictive systems. While large language models (LLMs) and LLM agents have\nrevolutionised data engineering and have been applied creatively in many\ndomains, their potential for OM remains underexplored. This study introduces a\nnovel agent-powered LLM-based design paradigm for OM systems. With\nconsideration of several specific challenges in leveraging LLM agents for OM,\nwe propose a generic framework, namely Agent-OM (Agent for Ontology Matching),\nconsisting of two Siamese agents for retrieval and matching, with a set of\nsimple OM tools. Our framework is implemented in a proof-of-concept system.\nEvaluations of three Ontology Alignment Evaluation Initiative (OAEI) tracks\nover state-of-the-art OM systems show that our system can achieve results very\nclose to the long-standing best performance on simple OM tasks and can\nsignificantly improve the performance on complex and few-shot OM tasks.\n","authors":["Zhangcheng Qiang","Weiqing Wang","Kerry Taylor"],"pdf_url":"https://arxiv.org/pdf/2312.00326v4.pdf","comment":"14 pages, 13 figures, 4 tables"},{"id":"http://arxiv.org/abs/2412.04205v1","updated":"2024-12-05T14:41:05Z","published":"2024-12-05T14:41:05Z","title":"A Context-aware Framework for Translation-mediated Conversations","summary":" Effective communication is fundamental to any interaction, yet challenges\narise when participants do not share a common language. Automatic translation\nsystems offer a powerful solution to bridge language barriers in such\nscenarios, but they introduce errors that can lead to misunderstandings and\nconversation breakdown. A key issue is that current systems fail to incorporate\nthe rich contextual information necessary to resolve ambiguities and omitted\ndetails, resulting in literal, inappropriate, or misaligned translations. In\nthis work, we present a framework to improve large language model-based\ntranslation systems by incorporating contextual information in bilingual\nconversational settings. During training, we leverage context-augmented\nparallel data, which allows the model to generate translations sensitive to\nconversational history. During inference, we perform quality-aware decoding\nwith context-aware metrics to select the optimal translation from a pool of\ncandidates. We validate both components of our framework on two task-oriented\ndomains: customer chat and user-assistant interaction. Across both settings,\nour framework consistently results in better translations than state-of-the-art\nsystems like GPT-4o and TowerInstruct, as measured by multiple automatic\ntranslation quality metrics on several language pairs. We also show that the\nresulting model leverages context in an intended and interpretable way,\nimproving consistency between the conveyed message and the generated\ntranslations.\n","authors":["José Pombal","Sweta Agrawal","Patrick Fernandes","Emmanouil Zaranis","André F. T. Martins"],"pdf_url":"https://arxiv.org/pdf/2412.04205v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04193v1","updated":"2024-12-05T14:33:00Z","published":"2024-12-05T14:33:00Z","title":"AL-QASIDA: Analyzing LLM Quality and Accuracy Systematically in\n Dialectal Arabic","summary":" Dialectal Arabic (DA) varieties are under-served by language technologies,\nparticularly large language models (LLMs). This trend threatens to exacerbate\nexisting social inequalities and limits language modeling applications, yet the\nresearch community lacks operationalized LLM performance measurements in DA. We\npresent a method that comprehensively evaluates LLM fidelity, understanding,\nquality, and diglossia in modeling DA. We evaluate nine LLMs in eight DA\nvarieties across these four dimensions and provide best practice\nrecommendations. Our evaluation suggests that LLMs do not produce DA as well as\nthey understand it, but does not suggest deterioration in quality when they do.\nFurther analysis suggests that current post-training can degrade DA\ncapabilities, that few-shot examples can overcome this and other LLM\ndeficiencies, and that otherwise no measurable features of input text correlate\nwell with LLM DA performance.\n","authors":["Nathaniel R. Robinson","Shahd Abdelmoneim","Kelly Marchisio","Sebastian Ruder"],"pdf_url":"https://arxiv.org/pdf/2412.04193v1.pdf","comment":"Pre-print"},{"id":"http://arxiv.org/abs/2409.17146v2","updated":"2024-12-05T14:28:40Z","published":"2024-09-25T17:59:51Z","title":"Molmo and PixMo: Open Weights and Open Data for State-of-the-Art\n Vision-Language Models","summary":" Today's most advanced vision-language models (VLMs) remain proprietary. The\nstrongest open-weight models rely heavily on synthetic data from proprietary\nVLMs to achieve good performance, effectively distilling these closed VLMs into\nopen ones. As a result, the community has been missing foundational knowledge\nabout how to build performant VLMs from scratch. We present Molmo, a new family\nof VLMs that are state-of-the-art in their class of openness. Our key\ncontribution is a collection of new datasets called PixMo, including a dataset\nof highly detailed image captions for pre-training, a free-form image Q&A\ndataset for fine-tuning, and an innovative 2D pointing dataset, all collected\nwithout the use of external VLMs. The success of our approach relies on careful\nmodeling choices, a well-tuned training pipeline, and, most critically, the\nquality of our newly collected datasets. Our best-in-class 72B model not only\noutperforms others in the class of open weight and data models, but also\noutperforms larger proprietary models including Claude 3.5 Sonnet, and Gemini\n1.5 Pro and Flash, second only to GPT-4o based on both academic benchmarks and\non a large human evaluation. Our model weights, new datasets, and source code\nare available at https://molmo.allenai.org/blog.\n","authors":["Matt Deitke","Christopher Clark","Sangho Lee","Rohun Tripathi","Yue Yang","Jae Sung Park","Mohammadreza Salehi","Niklas Muennighoff","Kyle Lo","Luca Soldaini","Jiasen Lu","Taira Anderson","Erin Bransom","Kiana Ehsani","Huong Ngo","YenSung Chen","Ajay Patel","Mark Yatskar","Chris Callison-Burch","Andrew Head","Rose Hendrix","Favyen Bastani","Eli VanderBilt","Nathan Lambert","Yvonne Chou","Arnavi Chheda","Jenna Sparks","Sam Skjonsberg","Michael Schmitz","Aaron Sarnat","Byron Bischoff","Pete Walsh","Chris Newell","Piper Wolters","Tanmay Gupta","Kuo-Hao Zeng","Jon Borchardt","Dirk Groeneveld","Crystal Nam","Sophie Lebrecht","Caitlin Wittlif","Carissa Schoenick","Oscar Michel","Ranjay Krishna","Luca Weihs","Noah A. Smith","Hannaneh Hajishirzi","Ross Girshick","Ali Farhadi","Aniruddha Kembhavi"],"pdf_url":"https://arxiv.org/pdf/2409.17146v2.pdf","comment":"Updated with ablations and more technical details"},{"id":"http://arxiv.org/abs/2411.16105v2","updated":"2024-12-05T14:16:57Z","published":"2024-11-25T05:32:34Z","title":"Adaptive Circuit Behavior and Generalization in Mechanistic\n Interpretability","summary":" Mechanistic interpretability aims to understand the inner workings of large\nneural networks by identifying circuits, or minimal subgraphs within the model\nthat implement algorithms responsible for performing specific tasks. These\ncircuits are typically discovered and analyzed using a narrowly defined prompt\nformat. However, given the abilities of large language models (LLMs) to\ngeneralize across various prompt formats for the same task, it remains unclear\nhow well these circuits generalize. For instance, it is unclear whether the\nmodels generalization results from reusing the same circuit components, the\ncomponents behaving differently, or the use of entirely different components.\nIn this paper, we investigate the generality of the indirect object\nidentification (IOI) circuit in GPT-2 small, which is well-studied and believed\nto implement a simple, interpretable algorithm. We evaluate its performance on\nprompt variants that challenge the assumptions of this algorithm. Our findings\nreveal that the circuit generalizes surprisingly well, reusing all of its\ncomponents and mechanisms while only adding additional input edges. Notably,\nthe circuit generalizes even to prompt variants where the original algorithm\nshould fail; we discover a mechanism that explains this which we term S2\nHacking. Our findings indicate that circuits within LLMs may be more flexible\nand general than previously recognized, underscoring the importance of studying\ncircuit generalization to better understand the broader capabilities of these\nmodels.\n","authors":["Jatin Nainani","Sankaran Vaidyanathan","AJ Yeung","Kartik Gupta","David Jensen"],"pdf_url":"https://arxiv.org/pdf/2411.16105v2.pdf","comment":"10 pages, 8 figures"},{"id":"http://arxiv.org/abs/2412.04144v1","updated":"2024-12-05T13:12:51Z","published":"2024-12-05T13:12:51Z","title":"If You Can't Use Them, Recycle Them: Optimizing Merging at Scale\n Mitigates Performance Tradeoffs","summary":" Model merging has shown great promise at combining expert models, but the\nbenefit of merging is unclear when merging ``generalist'' models trained on\nmany tasks. We explore merging in the context of large ($\\sim100$B) models, by\n\\textit{recycling} checkpoints that exhibit tradeoffs among different tasks.\nSuch checkpoints are often created in the process of developing a frontier\nmodel, and many suboptimal ones are usually discarded. Given a pool of model\ncheckpoints obtained from different training runs (e.g., different stages,\nobjectives, hyperparameters, and data mixtures), which naturally show tradeoffs\nacross different language capabilities (e.g., instruction following vs. code\ngeneration), we investigate whether merging can recycle such suboptimal models\ninto a Pareto-optimal one. Our optimization algorithm tunes the weight of each\ncheckpoint in a linear combination, resulting in a Pareto-optimal models that\noutperforms both individual models and merge-based baselines. Further analysis\nshows that good merges tend to include almost all checkpoints with with\nnon-zero weights, indicating that even seemingly bad initial checkpoints can\ncontribute to good final merges.\n","authors":["Muhammad Khalifa","Yi-Chern Tan","Arash Ahmadian","Tom Hosking","Honglak Lee","Lu Wang","Ahmet Üstün","Tom Sherborne","Matthias Gallé"],"pdf_url":"https://arxiv.org/pdf/2412.04144v1.pdf","comment":"13 pages, 9 figures"},{"id":"http://arxiv.org/abs/2412.04141v1","updated":"2024-12-05T13:10:54Z","published":"2024-12-05T13:10:54Z","title":"Reducing Tool Hallucination via Reliability Alignment","summary":" Large Language Models (LLMs) have extended their capabilities beyond language\ngeneration to interact with external systems through tool calling, offering\npowerful potential for real-world applications. However, the phenomenon of tool\nhallucinations, which occur when models improperly select or misuse tools,\npresents critical challenges that can lead to flawed task execution and\nincreased operational costs. This paper investigates the concept of reliable\ntool calling and highlights the necessity of addressing tool hallucinations. We\nsystematically categorize tool hallucinations into two main types: tool\nselection hallucination and tool usage hallucination. To mitigate these issues,\nwe propose a reliability-focused alignment framework that enhances the model's\nability to accurately assess tool relevance and usage. By proposing a suite of\nevaluation metrics and evaluating on StableToolBench, we further demonstrate\nthe effectiveness of our framework in mitigating tool hallucination and\nimproving the overall system reliability of LLM tool calling.\n","authors":["Hongshen Xu","Su Zhu","Zihan Wang","Hang Zheng","Da Ma","Ruisheng Cao","Shuai Fan","Lu Chen","Kai Yu"],"pdf_url":"https://arxiv.org/pdf/2412.04141v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04137v1","updated":"2024-12-05T13:04:10Z","published":"2024-12-05T13:04:10Z","title":"Text Change Detection in Multilingual Documents Using Image Comparison","summary":" Document comparison typically relies on optical character recognition (OCR)\nas its core technology. However, OCR requires the selection of appropriate\nlanguage models for each document and the performance of multilingual or hybrid\nmodels remains limited. To overcome these challenges, we propose text change\ndetection (TCD) using an image comparison model tailored for multilingual\ndocuments. Unlike OCR-based approaches, our method employs word-level text\nimage-to-image comparison to detect changes. Our model generates bidirectional\nchange segmentation maps between the source and target documents. To enhance\nperformance without requiring explicit text alignment or scaling preprocessing,\nwe employ correlations among multi-scale attention features. We also construct\na benchmark dataset comprising actual printed and scanned word pairs in various\nlanguages to evaluate our model. We validate our approach using our benchmark\ndataset and public benchmarks Distorted Document Images and the LRDE Document\nBinarization Dataset. We compare our model against state-of-the-art semantic\nsegmentation and change detection models, as well as to conventional OCR-based\nmodels.\n","authors":["Doyoung Park","Naresh Reddy Yarram","Sunjin Kim","Minkyu Kim","Seongho Cho","Taehee Lee"],"pdf_url":"https://arxiv.org/pdf/2412.04137v1.pdf","comment":"15pages, 11figures 6tables, wacv2025 accepted"},{"id":"http://arxiv.org/abs/2411.03906v2","updated":"2024-12-05T12:56:40Z","published":"2024-11-06T13:37:28Z","title":"Lexicalization Is All You Need: Examining the Impact of Lexical\n Knowledge in a Compositional QALD System","summary":" In this paper, we examine the impact of lexicalization on Question Answering\nover Linked Data (QALD). It is well known that one of the key challenges in\ninterpreting natural language questions with respect to SPARQL lies in bridging\nthe lexical gap, that is mapping the words in the query to the correct\nvocabulary elements. We argue in this paper that lexicalization, that is\nexplicit knowledge about the potential interpretations of a word with respect\nto the given vocabulary, significantly eases the task and increases the\nperformance of QA systems. Towards this goal, we present a compositional QA\nsystem that can leverage explicit lexical knowledge in a compositional manner\nto infer the meaning of a question in terms of a SPARQL query. We show that\nsuch a system, given lexical knowledge, has a performance well beyond current\nQA systems, achieving up to a $35.8\\%$ increase in the micro $F_1$ score\ncompared to the best QA system on QALD-9. This shows the importance and\npotential of including explicit lexical knowledge. In contrast, we show that\nLLMs have limited abilities to exploit lexical knowledge, with only marginal\nimprovements compared to a version without lexical knowledge. This shows that\nLLMs have no ability to compositionally interpret a question on the basis of\nthe meaning of its parts, a key feature of compositional approaches. Taken\ntogether, our work shows new avenues for QALD research, emphasizing the\nimportance of lexicalization and compositionality.\n","authors":["David Maria Schmidt","Mohammad Fazleh Elahi","Philipp Cimiano"],"pdf_url":"https://arxiv.org/pdf/2411.03906v2.pdf","comment":"24th International Conference on Knowledge Engineering and Knowledge\n Management (EKAW 2024), November 26-28, 2024, Amsterdam, The Netherlands"},{"id":"http://arxiv.org/abs/2412.04119v1","updated":"2024-12-05T12:37:27Z","published":"2024-12-05T12:37:27Z","title":"GRAF: Graph Retrieval Augmented by Facts for Legal Question Answering","summary":" Pre-trained Language Models (PLMs) have shown remarkable performances in\nrecent years, setting a new paradigm for NLP research and industry. The legal\ndomain has received some attention from the NLP community partly due to its\ntextual nature. Some tasks from this domain are represented by\nquestion-answering (QA) tasks. This work explores the legal domain\nMultiple-Choice QA (MCQA) for a low-resource language. The contribution of this\nwork is multi-fold. We first introduce JuRO, the first openly available\nRomanian legal MCQA dataset, comprising three different examinations and a\nnumber of 10,836 total questions. Along with this dataset, we introduce CROL,\nan organized corpus of laws that has a total of 93 distinct documents with\ntheir modifications from 763 time spans, that we leveraged in this work for\nInformation Retrieval (IR) techniques. Moreover, we are the first to propose\nLaw-RoG, a Knowledge Graph (KG) for the Romanian language, and this KG is\nderived from the aforementioned corpus. Lastly, we propose a novel approach for\nMCQA, Graph Retrieval Augmented by Facts (GRAF), which achieves competitive\nresults with generally accepted SOTA methods and even exceeds them in most\nsettings.\n","authors":["Cristian-George Crăciun","Răzvan-Alexandru Smădu","Dumitru-Clementin Cercel","Mihaela-Claudia Cercel"],"pdf_url":"https://arxiv.org/pdf/2412.04119v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19574v2","updated":"2024-12-05T12:19:38Z","published":"2024-11-29T09:42:38Z","title":"KV Shifting Attention Enhances Language Modeling","summary":" The current large language models are mainly based on decode-only structure\ntransformers, which have great in-context learning (ICL) capabilities. It is\ngenerally believed that the important foundation of its ICL capability is the\ninduction heads mechanism, which requires at least two layers attention. In\norder to more efficiently implement the ability of the model's induction, we\nrevisit the induction heads mechanism and proposed a KV shifting attention. We\ntheoretically prove that the KV shifting attention reducing the model's\nrequirements for the depth and width of the induction heads mechanism. Our\nexperimental results demonstrate that KV shifting attention is beneficial to\nlearning induction heads and language modeling, which lead to better\nperformance or faster convergence from toy models to the pre-training models\nwith more than 10 B parameters.\n","authors":["Mingyu Xu","Wei Cheng","Bingning Wang","Weipeng Chen"],"pdf_url":"https://arxiv.org/pdf/2411.19574v2.pdf","comment":"22 pages"},{"id":"http://arxiv.org/abs/2412.04100v1","updated":"2024-12-05T12:10:42Z","published":"2024-12-05T12:10:42Z","title":"Missing Melodies: AI Music Generation and its \"Nearly\" Complete Omission\n of the Global South","summary":" Recent advances in generative AI have sparked renewed interest and expanded\npossibilities for music generation. However, the performance and versatility of\nthese systems across musical genres are heavily influenced by the availability\nof training data. We conducted an extensive analysis of over one million hours\nof audio datasets used in AI music generation research and manually reviewed\nmore than 200 papers from eleven prominent AI and music conferences and\norganizations (AAAI, ACM, EUSIPCO, EURASIP, ICASSP, ICML, IJCAI, ISMIR,\nNeurIPS, NIME, SMC) to identify a critical gap in the fair representation and\ninclusion of the musical genres of the Global South in AI research. Our\nfindings reveal a stark imbalance: approximately 86% of the total dataset hours\nand over 93% of researchers focus primarily on music from the Global North.\nHowever, around 40% of these datasets include some form of non-Western music,\ngenres from the Global South account for only 14.6% of the data. Furthermore,\napproximately 51% of the papers surveyed concentrate on symbolic music\ngeneration, a method that often fails to capture the cultural nuances inherent\nin music from regions such as South Asia, the Middle East, and Africa. As AI\nincreasingly shapes the creation and dissemination of music, the significant\nunderrepresentation of music genres in datasets and research presents a serious\nthreat to global musical diversity. We also propose some important steps to\nmitigate these risks and foster a more inclusive future for AI-driven music\ngeneration.\n","authors":["Atharva Mehta","Shivam Chauhan","Monojit Choudhury"],"pdf_url":"https://arxiv.org/pdf/2412.04100v1.pdf","comment":"Submitted to CACM, 12 pages, 2 figures"},{"id":"http://arxiv.org/abs/2412.04092v1","updated":"2024-12-05T11:56:48Z","published":"2024-12-05T11:56:48Z","title":"GEITje 7B Ultra: A Conversational Model for Dutch","summary":" Language models have rapidly evolved, predominantly focusing on English while\noften neglecting extensive pretraining in other languages. This approach has\nrequired initiatives to adapt powerful, English-centric models to other\nlinguistic contexts through finetuning. For Dutch, such a recent endeavour is\n``GEITje'' a model originally derived from the English-based Mistral 7B.\nBuilding on this fundamental work, the current research extends the\ncapabilities of GEITje by supervised finetuning on newly created high-quality\nsynthetic conversational datasets, along with an additional preference\nalignment procedure on a synthetic feedback dataset. Both the developed models\nand the created datasets are openly available.\n","authors":["Bram Vanroy"],"pdf_url":"https://arxiv.org/pdf/2412.04092v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.11624v3","updated":"2024-12-05T11:47:49Z","published":"2024-06-17T15:07:55Z","title":"Words in Motion: Extracting Interpretable Control Vectors for Motion\n Transformers","summary":" Transformer-based models generate hidden states that are difficult to\ninterpret. In this work, we aim to interpret these hidden states and control\nthem at inference, with a focus on motion forecasting. We use linear probes to\nmeasure neural collapse towards interpretable motion features in hidden states.\nHigh probing accuracy implies meaningful directions and distances between\nhidden states of opposing features, which we use to fit interpretable control\nvectors for activation steering at inference. To optimize our control vectors,\nwe use sparse autoencoders with fully-connected, convolutional, MLPMixer layers\nand various activation functions. Notably, we show that enforcing sparsity in\nhidden states leads to a more linear relationship between control vector\ntemperatures and forecasts. Our approach enables mechanistic interpretability\nand zero-shot generalization to unseen dataset characteristics with negligible\ncomputational overhead. Our implementation is available at\nhttps://github.com/kit-mrt/future-motion\n","authors":["Omer Sahin Tas","Royden Wagner"],"pdf_url":"https://arxiv.org/pdf/2406.11624v3.pdf","comment":"Add autoencoders with convolutional, MLPMixer layers, and JumpReLU\n activations"},{"id":"http://arxiv.org/abs/2412.04067v1","updated":"2024-12-05T11:05:12Z","published":"2024-12-05T11:05:12Z","title":"Automated Medical Report Generation for ECG Data: Bridging Medical Text\n and Signal Processing with Deep Learning","summary":" Recent advances in deep learning and natural language generation have\nsignificantly improved image captioning, enabling automated, human-like\ndescriptions for visual content. In this work, we apply these captioning\ntechniques to generate clinician-like interpretations of ECG data. This study\nleverages existing ECG datasets accompanied by free-text reports authored by\nhealthcare professionals (HCPs) as training data. These reports, while often\ninconsistent, provide a valuable foundation for automated learning. We\nintroduce an encoder-decoder-based method that uses these reports to train\nmodels to generate detailed descriptions of ECG episodes. This represents a\nsignificant advancement in ECG analysis automation, with potential applications\nin zero-shot classification and automated clinical decision support.\n The model is tested on various datasets, including both 1- and 12-lead ECGs.\nIt significantly outperforms the state-of-the-art reference model by Qiu et\nal., achieving a METEOR score of 55.53% compared to 24.51% achieved by the\nreference model. Furthermore, several key design choices are discussed,\nproviding a comprehensive overview of current challenges and innovations in\nthis domain.\n The source codes for this research are publicly available in our Git\nrepository https://git.zib.de/ableich/ecg-comment-generation-public\n","authors":["Amnon Bleich","Antje Linnemann","Bjoern H. Diem","Tim OF Conrad"],"pdf_url":"https://arxiv.org/pdf/2412.04067v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.07122v2","updated":"2024-12-05T10:45:02Z","published":"2024-11-11T16:51:39Z","title":"SCAR: Sparse Conditioned Autoencoders for Concept Detection and Steering\n in LLMs","summary":" Large Language Models (LLMs) have demonstrated remarkable capabilities in\ngenerating human-like text, but their output may not be aligned with the user\nor even produce harmful content. This paper presents a novel approach to detect\nand steer concepts such as toxicity before generation. We introduce the Sparse\nConditioned Autoencoder (SCAR), a single trained module that extends the\notherwise untouched LLM. SCAR ensures full steerability, towards and away from\nconcepts (e.g., toxic content), without compromising the quality of the model's\ntext generation on standard evaluation benchmarks. We demonstrate the effective\napplication of our approach through a variety of concepts, including toxicity,\nsafety, and writing style alignment. As such, this work establishes a robust\nframework for controlling LLM generations, ensuring their ethical and safe\ndeployment in real-world applications.\n","authors":["Ruben Härle","Felix Friedrich","Manuel Brack","Björn Deiseroth","Patrick Schramowski","Kristian Kersting"],"pdf_url":"https://arxiv.org/pdf/2411.07122v2.pdf","comment":"Accepted at Socially Responsible Language Modelling Research (SoLaR)\n Workshop at NeurIPS 2024"},{"id":"http://arxiv.org/abs/2412.04046v1","updated":"2024-12-05T10:37:38Z","published":"2024-12-05T10:37:38Z","title":"Hostility Detection in UK Politics: A Dataset on Online Abuse Targeting\n MPs","summary":" Numerous politicians use social media platforms, particularly X, to engage\nwith their constituents. This interaction allows constituents to pose questions\nand offer feedback but also exposes politicians to a barrage of hostile\nresponses, especially given the anonymity afforded by social media. They are\ntypically targeted in relation to their governmental role, but the comments\nalso tend to attack their personal identity. This can discredit politicians and\nreduce public trust in the government. It can also incite anger and disrespect,\nleading to offline harm and violence. While numerous models exist for detecting\nhostility in general, they lack the specificity required for political\ncontexts. Furthermore, addressing hostility towards politicians demands\ntailored approaches due to the distinct language and issues inherent to each\ncountry (e.g., Brexit for the UK). To bridge this gap, we construct a dataset\nof 3,320 English tweets spanning a two-year period manually annotated for\nhostility towards UK MPs. Our dataset also captures the targeted identity\ncharacteristics (race, gender, religion, none) in hostile tweets. We perform\nlinguistic and topical analyses to delve into the unique content of the UK\npolitical data. Finally, we evaluate the performance of pre-trained language\nmodels and large language models on binary hostility detection and multi-class\ntargeted identity type classification tasks. Our study offers valuable data and\ninsights for future research on the prevalence and nature of politics-related\nhostility specific to the UK.\n","authors":["Mugdha Pandya","Mali Jin","Kalina Bontcheva","Diana Maynard"],"pdf_url":"https://arxiv.org/pdf/2412.04046v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02788v2","updated":"2024-12-05T10:30:56Z","published":"2024-12-03T19:37:00Z","title":"Hybrid-SQuAD: Hybrid Scholarly Question Answering Dataset","summary":" Existing Scholarly Question Answering (QA) methods typically target\nhomogeneous data sources, relying solely on either text or Knowledge Graphs\n(KGs). However, scholarly information often spans heterogeneous sources,\nnecessitating the development of QA systems that integrate information from\nmultiple heterogeneous data sources. To address this challenge, we introduce\nHybrid-SQuAD (Hybrid Scholarly Question Answering Dataset), a novel large-scale\nQA dataset designed to facilitate answering questions incorporating both text\nand KG facts. The dataset consists of 10.5K question-answer pairs generated by\na large language model, leveraging the KGs DBLP and SemOpenAlex alongside\ncorresponding text from Wikipedia. In addition, we propose a RAG-based baseline\nhybrid QA model, achieving an exact match score of 69.65 on the Hybrid-SQuAD\ntest set.\n","authors":["Tilahun Abedissa Taffa","Debayan Banerjee","Yaregal Assabie","Ricardo Usbeck"],"pdf_url":"https://arxiv.org/pdf/2412.02788v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04026v1","updated":"2024-12-05T10:00:58Z","published":"2024-12-05T10:00:58Z","title":"M$^{3}$D: A Multimodal, Multilingual and Multitask Dataset for Grounded\n Document-level Information Extraction","summary":" Multimodal information extraction (IE) tasks have attracted increasing\nattention because many studies have shown that multimodal information benefits\ntext information extraction. However, existing multimodal IE datasets mainly\nfocus on sentence-level image-facilitated IE in English text, and pay little\nattention to video-based multimodal IE and fine-grained visual grounding.\nTherefore, in order to promote the development of multimodal IE, we constructed\na multimodal multilingual multitask dataset, named M$^{3}$D, which has the\nfollowing features: (1) It contains paired document-level text and video to\nenrich multimodal information; (2) It supports two widely-used languages,\nnamely English and Chinese; (3) It includes more multimodal IE tasks such as\nentity recognition, entity chain extraction, relation extraction and visual\ngrounding. In addition, our dataset introduces an unexplored theme, i.e.,\nbiography, enriching the domains of multimodal IE resources. To establish a\nbenchmark for our dataset, we propose an innovative hierarchical multimodal IE\nmodel. This model effectively leverages and integrates multimodal information\nthrough a Denoised Feature Fusion Module (DFFM). Furthermore, in non-ideal\nscenarios, modal information is often incomplete. Thus, we designed a Missing\nModality Construction Module (MMCM) to alleviate the issues caused by missing\nmodalities. Our model achieved an average performance of 53.80% and 53.77% on\nfour tasks in English and Chinese datasets, respectively, which set a\nreasonable standard for subsequent research. In addition, we conducted more\nanalytical experiments to verify the effectiveness of our proposed module. We\nbelieve that our work can promote the development of the field of multimodal\nIE.\n","authors":["Jiang Liu","Bobo Li","Xinran Yang","Na Yang","Hao Fei","Mingyao Zhang","Fei Li","Donghong Ji"],"pdf_url":"https://arxiv.org/pdf/2412.04026v1.pdf","comment":"14 pages, 9 figures, 6 tables"},{"id":"http://arxiv.org/abs/2412.04025v1","updated":"2024-12-05T10:00:49Z","published":"2024-12-05T10:00:49Z","title":"Exploring the Influence of Label Aggregation on Minority Voices:\n Implications for Dataset Bias and Model Training","summary":" Resolving disagreement in manual annotation typically consists of removing\nunreliable annotators and using a label aggregation strategy such as majority\nvote or expert opinion to resolve disagreement. These may have the side-effect\nof silencing or under-representing minority but equally valid opinions. In this\npaper, we study the impact of standard label aggregation strategies on minority\nopinion representation in sexism detection. We investigate the quality and\nvalue of minority annotations, and then examine their effect on the class\ndistributions in gold labels, as well as how this affects the behaviour of\nmodels trained on the resulting datasets. Finally, we discuss the potential\nbiases introduced by each method and how they can be amplified by the models.\n","authors":["Mugdha Pandya","Nafise Sadat Moosavi","Diana Maynard"],"pdf_url":"https://arxiv.org/pdf/2412.04025v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19846v6","updated":"2024-12-05T09:56:35Z","published":"2024-05-30T08:50:55Z","title":"Quest: Query-centric Data Synthesis Approach for Long-context Scaling of\n Large Language Model","summary":" Recent advancements in large language models (LLMs) have highlighted the\nimportance of extending context lengths for handling complex tasks. While\ntraditional methods for training on long contexts often use filtered long\ndocuments, these approaches lead to domain imbalances, limiting model\nperformance. To address this, techniques like random document concatenation\n(Standard) and similarity-based methods (KNN, ICLM) have been developed.\nHowever, they either sacrifice semantic coherence or diversity. To balance both\naspects, we introduce Quest, a query-centric data synthesis method aggregating\nsemantically relevant yet diverse documents. Quest uses a generative model to\npredict potential queries for each document, grouping documents with similar\nqueries and keywords. Extensive experiments demonstrate Quest's superior\nperformance on long-context tasks, achieving remarkable results with context\nlengths of up to 1M tokens and confirming its scalability across various model\nsizes.\n","authors":["Chaochen Gao","Xing Wu","Qi Fu","Songlin Hu"],"pdf_url":"https://arxiv.org/pdf/2405.19846v6.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04003v1","updated":"2024-12-05T09:26:58Z","published":"2024-12-05T09:26:58Z","title":"Marco-LLM: Bridging Languages via Massive Multilingual Training for\n Cross-Lingual Enhancement","summary":" Large Language Models (LLMs) have achieved remarkable progress in recent\nyears; however, their excellent performance is still largely limited to major\nworld languages, primarily English. Many LLMs continue to face challenges with\nmultilingual tasks, especially when it comes to low-resource languages. To\naddress this issue, we introduced Marco-LLM: Massive multilingual training for\ncross-lingual enhancement LLM. We have collected a substantial amount of\nmultilingual data for several low-resource languages and conducted extensive\ncontinual pre-training using the Qwen2 models. This effort has resulted in a\nmultilingual LLM named Marco-LLM. Through comprehensive evaluations on various\nmultilingual benchmarks, including MMMLU, AGIEval, Belebele, Flores-200, XCOPA\nand many others, Marco-LLM has demonstrated substantial improvements over\nstate-of-the-art LLMs. Furthermore, Marco-LLM achieved substantial enhancements\nin any-to-any machine translation tasks, showing the effectiveness of our\nmultilingual LLM. Marco-LLM is a pioneering multilingual LLM designed to not\nonly perform exceptionally well in multilingual tasks, including low-resource\nlanguages, but also maintain strong performance in English and other major\nlanguages, closing the performance gap between high- and low-resource language\ncapabilities. By bridging languages, this effort demonstrates our dedication to\nensuring LLMs work accurately across various languages.\n","authors":["Lingfeng Ming","Bo Zeng","Chenyang Lyu","Tianqi Shi","Yu Zhao","Xue Yang","Yefeng Liu","Yiyu Wang","Linlong Xu","Yangyang Liu","Xiaohu Zhao","Hao Wang","Heng Liu","Hao Zhou","Huifeng Yin","Zifu Shang","Haijun Li","Longyue Wang","Weihua Luo","Kaifu Zhang"],"pdf_url":"https://arxiv.org/pdf/2412.04003v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03987v1","updated":"2024-12-05T09:05:30Z","published":"2024-12-05T09:05:30Z","title":"MTMT: Consolidating Multiple Thinking Modes to Form a Thought Tree for\n Strengthening LLM","summary":" Large language models (LLMs) have shown limitations in tasks requiring\ncomplex logical reasoning and multi-step problem-solving. To address these\nchallenges, researchers have employed carefully designed prompts and\nflowcharts, simulating human cognitive processes to enhance LLM performance,\nsuch as the Chain of Thought approach. In this paper, we introduce MTMT\n(Multi-thinking Modes Tree), a novel method that interacts with LLMs to\nconstruct a thought tree, simulating various advanced cognitive processes,\nincluding but not limited to association, counterfactual thinking, task\ndecomposition, and comparison. By breaking down the original complex task into\nsimpler sub-questions, MTMT facilitates easier problem-solving for LLMs,\nenabling more effective utilization of the latent knowledge within LLMs. We\nevaluate the performance of MTMT under different parameter configurations,\nusing GPT-4o mini as the base model. Our results demonstrate that integrating\nmultiple modes of thinking significantly enhances the ability of LLMs to handle\ncomplex tasks.\n","authors":["Changcheng Li","Xiangyu Wang","Qiuju Chen","Xiren Zhou","Huanhuan Chen"],"pdf_url":"https://arxiv.org/pdf/2412.03987v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03966v1","updated":"2024-12-05T08:33:52Z","published":"2024-12-05T08:33:52Z","title":"Demonstration Selection for In-Context Learning via Reinforcement\n Learning","summary":" Diversity in demonstration selection is crucial for enhancing model\ngeneralization, as it enables a broader coverage of structures and concepts.\nHowever, constructing an appropriate set of demonstrations has remained a focal\npoint of research. This paper presents the Relevance-Diversity Enhanced\nSelection (RDES), an innovative approach that leverages reinforcement learning\nto optimize the selection of diverse reference demonstrations for text\nclassification tasks using Large Language Models (LLMs), especially in few-shot\nprompting scenarios. RDES employs a Q-learning framework to dynamically\nidentify demonstrations that maximize both diversity and relevance to the\nclassification objective by calculating a diversity score based on label\ndistribution among selected demonstrations. This method ensures a balanced\nrepresentation of reference data, leading to improved classification accuracy.\nThrough extensive experiments on four benchmark datasets and involving 12\nclosed-source and open-source LLMs, we demonstrate that RDES significantly\nenhances classification accuracy compared to ten established baselines.\nFurthermore, we investigate the incorporation of Chain-of-Thought (CoT)\nreasoning in the reasoning process, which further enhances the model's\npredictive performance. The results underscore the potential of reinforcement\nlearning to facilitate adaptive demonstration selection and deepen the\nunderstanding of classification challenges.\n","authors":["Xubin Wang","Jianfei Wu","Yichen Yuan","Mingzhe Li","Deyu Cai","Weijia Jia"],"pdf_url":"https://arxiv.org/pdf/2412.03966v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03930v1","updated":"2024-12-05T07:12:53Z","published":"2024-12-05T07:12:53Z","title":"MIND: Effective Incorrect Assignment Detection through a Multi-Modal\n Structure-Enhanced Language Model","summary":" The rapid growth of academic publications has exacerbated the issue of author\nname ambiguity in online digital libraries. Despite advances in name\ndisambiguation algorithms, cumulative errors continue to undermine the\nreliability of academic systems. It is estimated that over 10% paper-author\nassignments are rectified when constructing the million-scale WhoIsWho\nbenchmark. Existing endeavors to detect incorrect assignments are either\nsemantic-based or graph-based approaches, which fall short of making full use\nof the rich text attributes of papers and implicit structural features defined\nvia the co-occurrence of paper attributes. To this end, this paper introduces a\nstructure-enhanced language model that combines key structural features from\ngraph-based methods with fine-grained semantic features from rich paper\nattributes to detect incorrect assignments. The proposed model is trained with\na highly effective multi-modal multi-turn instruction tuning framework, which\nincorporates task-guided instruction tuning, text-attribute modality, and\nstructural modality. Experimental results demonstrate that our model\noutperforms previous approaches, achieving top performance on the leaderboard\nof KDD Cup 2024. Our code has been publicly available.\n","authors":["Yunhe Pang","Bo Chen","Fanjin Zhang","Yanghui Rao","Jie Tang"],"pdf_url":"https://arxiv.org/pdf/2412.03930v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.21216v2","updated":"2024-12-05T07:09:27Z","published":"2024-10-28T17:01:52Z","title":"HoPE: A Novel Positional Encoding Without Long-Term Decay for Enhanced\n Context Awareness and Extrapolation","summary":" Many positional encodings (PEs) are designed to exhibit long-term decay,\nbased on an entrenched and long-standing inductive opinion: tokens farther away\nfrom the current position carry less relevant information. We argue that\nlong-term decay is outdated in the era of LLMs, as LLMs are now applied to\ntasks demanding precise retrieval of in-context information from arbitrary\npositions. Firstly, we present empirical analyses on various PEs, demonstrating\nthat models inherently learn attention with only a local-decay pattern while\nforming a U-shape pattern globally, contradicting the principle of long-term\ndecay. Furthermore, we conduct a detailed analysis of rotary position encoding\n(RoPE, a prevalent relative positional encoding in LLMs), and found that the\nU-shape attention is caused by some learned components, which are also the key\nfactor limiting RoPE's expressiveness and extrapolation.Inspired by these\ninsights, we propose High-frequency rotary Position Encoding (HoPE). HoPE\nreplaces the specific components in RoPE with position-independent ones,\nretaining only high-frequency signals, which also breaks the principle of\nlong-term decay in theory. HoPE achieves two major advantages: (1) Without\nconstraints imposed by long-term decay, contradictory factors that limit\nspontaneous attention optimization and model extrapolation performance are\nremoved. (2) Components representing positions and semantics are are optimized.\nThese enhances model's context awareness and extrapolation, as validated by\nextensive experiments.\n","authors":["Yuhan Chen","Ang Lv","Jian Luan","Bin Wang","Wei Liu"],"pdf_url":"https://arxiv.org/pdf/2410.21216v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.00741v3","updated":"2024-12-05T07:05:59Z","published":"2024-01-01T12:49:36Z","title":"ToolEyes: Fine-Grained Evaluation for Tool Learning Capabilities of\n Large Language Models in Real-world Scenarios","summary":" Existing evaluations of tool learning primarily focus on validating the\nalignment of selected tools for large language models (LLMs) with expected\noutcomes. However, these approaches rely on a limited set of scenarios where\nanswers can be pre-determined, diverging from genuine needs. Furthermore, a\nsole emphasis on outcomes disregards the complex capabilities required for LLMs\nto effectively use tools. To tackle this issue, we propose ToolEyes, a\nfine-grained system tailored for the evaluation of the LLMs' tool learning\ncapabilities in authentic scenarios. The system meticulously examines seven\nreal-world scenarios, analyzing five dimensions crucial to LLMs in tool\nlearning: format alignment, intent comprehension, behavior planning, tool\nselection, and answer organization. Additionally, ToolEyes incorporates a tool\nlibrary boasting approximately 600 tools, serving as an intermediary between\nLLMs and the physical world. Evaluations involving ten LLMs across three\ncategories reveal a preference for specific scenarios and limited cognitive\nabilities in tool learning. Intriguingly, expanding the model size even\nexacerbates the hindrance to tool learning. The code and data are available at\nhttps://github.com/Junjie-Ye/ToolEyes.\n","authors":["Junjie Ye","Guanyu Li","Songyang Gao","Caishuang Huang","Yilong Wu","Sixian Li","Xiaoran Fan","Shihan Dou","Tao Ji","Qi Zhang","Tao Gui","Xuanjing Huang"],"pdf_url":"https://arxiv.org/pdf/2401.00741v3.pdf","comment":"Accepted by COLING 2025 conference"},{"id":"http://arxiv.org/abs/2412.03331v2","updated":"2024-12-05T07:05:57Z","published":"2024-12-04T14:02:12Z","title":"LuxEmbedder: A Cross-Lingual Approach to Enhanced Luxembourgish Sentence\n Embeddings","summary":" Sentence embedding models play a key role in various Natural Language\nProcessing tasks, such as in Topic Modeling, Document Clustering and\nRecommendation Systems. However, these models rely heavily on parallel data,\nwhich can be scarce for many low-resource languages, including Luxembourgish.\nThis scarcity results in suboptimal performance of monolingual and\ncross-lingual sentence embedding models for these languages. To address this\nissue, we compile a relatively small but high-quality human-generated\ncross-lingual parallel dataset to train LuxEmbedder, an enhanced sentence\nembedding model for Luxembourgish with strong cross-lingual capabilities.\nAdditionally, we present evidence suggesting that including low-resource\nlanguages in parallel training datasets can be more advantageous for other\nlow-resource languages than relying solely on high-resource language pairs.\nFurthermore, recognizing the lack of sentence embedding benchmarks for\nlow-resource languages, we create a paraphrase detection benchmark specifically\nfor Luxembourgish, aiming to partially fill this gap and promote further\nresearch.\n","authors":["Fred Philippy","Siwen Guo","Jacques Klein","Tegawendé F. Bissyandé"],"pdf_url":"https://arxiv.org/pdf/2412.03331v2.pdf","comment":"Accepted at COLING 2025"},{"id":"http://arxiv.org/abs/2411.17993v2","updated":"2024-12-05T06:53:40Z","published":"2024-11-27T02:20:44Z","title":"DRS: Deep Question Reformulation With Structured Output","summary":" Question answering represents a core capability of large language models\n(LLMs). However, when individuals encounter unfamiliar knowledge in texts, they\noften formulate questions that the text itself cannot answer due to\ninsufficient understanding of the underlying information. Recent studies reveal\nthat while LLMs can detect unanswerable questions, they struggle to assist\nusers in reformulating these questions. Even advanced models like GPT-3.5\ndemonstrate limited effectiveness in this regard. To address this limitation,\nwe propose DRS: Deep Question Reformulation with Structured Output, a novel\nzero-shot method aimed at enhancing LLMs ability to assist users in\nreformulating questions to extract relevant information from new documents. DRS\ncombines the strengths of LLMs with a DFS-based algorithm to iteratively\nexplore potential entity combinations and constrain outputs using predefined\nentities. This structured approach significantly enhances the reformulation\ncapabilities of LLMs. Comprehensive experimental evaluations demonstrate that\nDRS improves the reformulation accuracy of GPT-3.5 from 23.03% to 70.42%, while\nalso enhancing the performance of open-source models, such as Gemma2-9B, from\n26.35% to 56.75%.\n","authors":["Zhecheng Li","Yiwei Wang","Bryan Hooi","Yujun Cai","Nanyun Peng","Kai-Wei Chang"],"pdf_url":"https://arxiv.org/pdf/2411.17993v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.01485v2","updated":"2024-12-05T06:52:42Z","published":"2024-10-02T12:35:53Z","title":"A Little Goes a Long Way: Efficient Long Context Training and Inference\n with Partial Contexts","summary":" Training and serving long-context large language models (LLMs) incurs\nsubstantial overhead. To address this, two critical steps are often required: a\npretrained LLM typically undergoes a separate stage for context length\nextension by training on long-context data, followed by architectural\nmodifications to reduce the overhead of KV cache during serving. This paper\nargues that integrating length extension with a GPU-friendly KV cache reduction\narchitecture not only reduces training overhead during length extension, but\nalso achieves better long-context performance. This leads to our proposed\nLongGen, which finetunes a pretrained LLM into an efficient architecture during\nlength extension. LongGen builds on three key insights: (1) Sparse attention\npatterns, such as window attention (attending to recent tokens), attention sink\n(initial ones), and blockwise sparse attention (strided token blocks) are\nwell-suited for building efficient long-context models, primarily due to their\nGPU-friendly memory access patterns, enabling efficiency gains not just\ntheoretically but in practice as well. (2) It is essential for the model to\nhave direct access to all tokens. A hybrid architecture with 1/3 full attention\nlayers and 2/3 efficient ones achieves a balanced trade-off between efficiency\nand long-context performance. (3) Lightweight training on 5B long-context data\nis sufficient to extend the hybrid model's context length from 4K to 128K.\n We evaluate LongGen on both Llama-2 7B and Llama-2 70B, demonstrating its\neffectiveness across different scales. During training with 128K-long contexts,\nLongGen achieves 1.55x training speedup and reduces wall-clock time by 36%,\ncompared to a full-attention baseline. During inference, LongGen reduces KV\ncache memory by 62%, achieving 1.67x prefilling speedup and 1.41x decoding\nspeedup.\n","authors":["Suyu Ge","Xihui Lin","Yunan Zhang","Jiawei Han","Hao Peng"],"pdf_url":"https://arxiv.org/pdf/2410.01485v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.01644v2","updated":"2024-12-05T06:49:37Z","published":"2024-12-02T15:56:08Z","title":"Concept Based Continuous Prompts for Interpretable Text Classification","summary":" Continuous prompts have become widely adopted for augmenting performance\nacross a wide range of natural language tasks. However, the underlying\nmechanism of this enhancement remains obscure. Previous studies rely on\nindividual words for interpreting continuous prompts, which lacks comprehensive\nsemantic understanding. Drawing inspiration from Concept Bottleneck Models, we\npropose a framework for interpreting continuous prompts by decomposing them\ninto human-readable concepts. Specifically, to ensure the feasibility of the\ndecomposition, we demonstrate that a corresponding concept embedding matrix and\na coefficient matrix can always be found to replace the prompt embedding\nmatrix. Then, we employ GPT-4o to generate a concept pool and choose potential\ncandidate concepts that are discriminative and representative using a novel\nsubmodular optimization algorithm. Experiments demonstrate that our framework\ncan achieve similar results as the original P-tuning and word-based approaches\nusing only a few concepts while providing more plausible results. Our code is\navailable at https://github.com/qq31415926/CD.\n","authors":["Qian Chen","Dongyang Li","Xiaofeng He"],"pdf_url":"https://arxiv.org/pdf/2412.01644v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.07754v3","updated":"2024-12-05T06:49:06Z","published":"2024-02-12T16:23:28Z","title":"Diffusion of Thoughts: Chain-of-Thought Reasoning in Diffusion Language\n Models","summary":" Recently, diffusion models have garnered significant interest in the field of\ntext processing due to their many potential advantages compared to conventional\nautoregressive models. In this work, we propose Diffusion-of-Thought (DoT), a\nnovel approach that integrates diffusion models with Chain-of-Thought, a\nwell-established technique for improving the reasoning ability of\nautoregressive language models. In contrast to autoregressive language models\nthat make decisions in a left-to-right, token-by-token manner, DoT allows\nreasoning steps to diffuse over time through a diffusion language model and\noffers greater flexibility in trading-off computation for reasoning\nperformance. Our experimental results demonstrate the effectiveness of DoT in\nmulti-digit multiplication, boolean logic, and grade school math problems, with\na small diffusion model outperforming a much larger autoregressive model in\nboth efficiency and accuracy. In addition to that, DoT showcases promising\nself-correction abilities and benefits from existing reasoning-enhancing\ntechniques like self-consistency decoding. Our findings contribute to the\nunderstanding and development of reasoning with diffusion language models.\n","authors":["Jiacheng Ye","Shansan Gong","Liheng Chen","Lin Zheng","Jiahui Gao","Han Shi","Chuan Wu","Xin Jiang","Zhenguo Li","Wei Bi","Lingpeng Kong"],"pdf_url":"https://arxiv.org/pdf/2402.07754v3.pdf","comment":"NeurIPS 2024"},{"id":"http://arxiv.org/abs/2412.03920v1","updated":"2024-12-05T06:46:46Z","published":"2024-12-05T06:46:46Z","title":"A Survey on Large Language Model-Based Social Agents in Game-Theoretic\n Scenarios","summary":" Game-theoretic scenarios have become pivotal in evaluating the social\nintelligence of Large Language Model (LLM)-based social agents. While numerous\nstudies have explored these agents in such settings, there is a lack of a\ncomprehensive survey summarizing the current progress. To address this gap, we\nsystematically review existing research on LLM-based social agents within\ngame-theoretic scenarios. Our survey organizes the findings into three core\ncomponents: Game Framework, Social Agent, and Evaluation Protocol. The game\nframework encompasses diverse game scenarios, ranging from choice-focusing to\ncommunication-focusing games. The social agent part explores agents'\npreferences, beliefs, and reasoning abilities. The evaluation protocol covers\nboth game-agnostic and game-specific metrics for assessing agent performance.\nBy reflecting on the current research and identifying future research\ndirections, this survey provides insights to advance the development and\nevaluation of social agents in game-theoretic scenarios.\n","authors":["Xiachong Feng","Longxu Dou","Ella Li","Qinghao Wang","Haochuan Wang","Yu Guo","Chang Ma","Lingpeng Kong"],"pdf_url":"https://arxiv.org/pdf/2412.03920v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03904v1","updated":"2024-12-05T06:20:47Z","published":"2024-12-05T06:20:47Z","title":"MISR: Measuring Instrumental Self-Reasoning in Frontier Models","summary":" We propose a suite of tasks to evaluate the instrumental self-reasoning\nability of large language model (LLM) agents. Instrumental self-reasoning\nability could improve adaptability and enable self-modification, but it could\nalso pose significant risks, such as enabling deceptive alignment. Prior work\nhas only evaluated self-reasoning in non-agentic settings or in limited\ndomains. In this paper, we propose evaluations for instrumental self-reasoning\nability in agentic tasks in a wide range of scenarios, including\nself-modification, knowledge seeking, and opaque self-reasoning. We evaluate\nagents built using state-of-the-art LLMs, including commercial and open source\nsystems. We find that instrumental self-reasoning ability emerges only in the\nmost capable frontier models and that it is highly context-dependent. No model\npasses the the most difficult versions of our evaluations, hence our evaluation\ncan be used to measure increases in instrumental self-reasoning ability in\nfuture models. We open-source our evaluations at\nhttps://github.com/kaifronsdal/Self-Reasoning-Evals.\n","authors":["Kai Fronsdal","David Lindner"],"pdf_url":"https://arxiv.org/pdf/2412.03904v1.pdf","comment":"10 pages, 65 page appendix, 5 figures"},{"id":"http://arxiv.org/abs/2404.14215v2","updated":"2024-12-05T06:02:59Z","published":"2024-04-22T14:31:28Z","title":"Text-Tuple-Table: Towards Information Integration in Text-to-Table\n Generation via Global Tuple Extraction","summary":" The task of condensing large chunks of textual information into concise and\nstructured tables has gained attention recently due to the emergence of Large\nLanguage Models (LLMs) and their potential benefit for downstream tasks, such\nas text summarization and text mining. Previous approaches often generate\ntables that directly replicate information from the text, limiting their\napplicability in broader contexts, as text-to-table generation in real-life\nscenarios necessitates information extraction, reasoning, and integration.\nHowever, there is a lack of both datasets and methodologies towards this task.\nIn this paper, we introduce LiveSum, a new benchmark dataset created for\ngenerating summary tables of competitions based on real-time commentary texts.\nWe evaluate the performances of state-of-the-art LLMs on this task in both\nfine-tuning and zero-shot settings, and additionally propose a novel pipeline\ncalled $T^3$(Text-Tuple-Table) to improve their performances. Extensive\nexperimental results demonstrate that LLMs still struggle with this task even\nafter fine-tuning, while our approach can offer substantial performance gains\nwithout explicit training. Further analyses demonstrate that our method\nexhibits strong generalization abilities, surpassing previous approaches on\nseveral other text-to-table datasets. Our code and data can be found at\nhttps://github.com/HKUST-KnowComp/LiveSum.\n","authors":["Zheye Deng","Chunkit Chan","Weiqi Wang","Yuxi Sun","Wei Fan","Tianshi Zheng","Yauwai Yim","Yangqiu Song"],"pdf_url":"https://arxiv.org/pdf/2404.14215v2.pdf","comment":"Accepted to EMNLP 2024"},{"id":"http://arxiv.org/abs/2412.03886v1","updated":"2024-12-05T05:39:03Z","published":"2024-12-05T05:39:03Z","title":"Uniform Discretized Integrated Gradients: An effective attribution based\n method for explaining large language models","summary":" Integrated Gradients is a well-known technique for explaining deep learning\nmodels. It calculates feature importance scores by employing a gradient based\napproach computing gradients of the model output with respect to input features\nand accumulating them along a linear path. While this works well for continuous\nfeatures spaces, it may not be the most optimal way to deal with discrete\nspaces like word embeddings. For interpreting LLMs (Large Language Models),\nthere exists a need for a non-linear path where intermediate points, whose\ngradients are to be computed, lie close to actual words in the embedding space.\nIn this paper, we propose a method called Uniform Discretized Integrated\nGradients (UDIG) based on a new interpolation strategy where we choose a\nfavorable nonlinear path for computing attribution scores suitable for\npredictive language models. We evaluate our method on two types of NLP tasks-\nSentiment Classification and Question Answering against three metrics viz Log\nodds, Comprehensiveness and Sufficiency. For sentiment classification, we have\nused the SST2, IMDb and Rotten Tomatoes datasets for benchmarking and for\nQuestion Answering, we have used the fine-tuned BERT model on SQuAD dataset.\nOur approach outperforms the existing methods in almost all the metrics.\n","authors":["Swarnava Sinha Roy","Ayan Kundu"],"pdf_url":"https://arxiv.org/pdf/2412.03886v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03877v1","updated":"2024-12-05T05:18:09Z","published":"2024-12-05T05:18:09Z","title":"AyutthayaAlpha: A Thai-Latin Script Transliteration Transformer","summary":" This study introduces AyutthayaAlpha, an advanced transformer-based machine\nlearning model designed for the transliteration of Thai proper names into Latin\nscript. Our system achieves state-of-the-art performance with 82.32%\nfirst-token accuracy and 95.24% first-three-token accuracy, while maintaining a\nlow character error rate of 0.0047. The complexity of Thai phonology, including\ntonal features and vowel length distinctions, presents significant challenges\nfor accurate transliteration, which we address through a novel two-model\napproach: AyutthayaAlpha-Small, based on the ByT5 architecture, and\nAyutthayaAlpha-VerySmall, a computationally efficient variant that unexpectedly\noutperforms its larger counterpart. Our research combines linguistic rules with\ndeep learning, training on a carefully curated dataset of 1.2 million\nThai-Latin name pairs, augmented through strategic upsampling to 2.7 million\nexamples. Extensive evaluations against existing transliteration methods and\nhuman expert benchmarks demonstrate that AyutthayaAlpha not only achieves\nsuperior accuracy but also effectively captures personal and cultural\npreferences in name romanization. The system's practical applications extend to\ncross-lingual information retrieval, international data standardization, and\nidentity verification systems, with particular relevance for government\ndatabases, academic institutions, and global business operations. This work\nrepresents a significant advance in bridging linguistic gaps between Thai and\nLatin scripts, while respecting the cultural and personal dimensions of name\ntransliteration.\n","authors":["Davor Lauc","Attapol Rutherford","Weerin Wongwarawipatr"],"pdf_url":"https://arxiv.org/pdf/2412.03877v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.11534v2","updated":"2024-12-05T04:40:54Z","published":"2024-02-18T10:15:38Z","title":"PreAct: Prediction Enhances Agent's Planning Ability","summary":" Addressing the disparity between forecasts and actual results can enable\nindividuals to expand their thought processes and stimulate self-reflection,\nthus promoting accurate planning. In this research, we present **PreAct**, an\nagent framework that integrates **pre**diction, **rea**soning, and **act**ion.\nBy utilizing the information derived from predictions, the large language model\n(LLM) agent can provide a wider range and more strategically focused reasoning.\nThis leads to more efficient actions that aid the agent in accomplishing\nintricate tasks. Our experimental results show that PreAct surpasses the ReAct\nmethod in completing complex tasks and that PreAct's performance can be further\nimproved when paired with other memory or selection strategy techniques. We\npresented the model with varying quantities of historical predictions and\ndiscovered that these predictions consistently enhance LLM planning.The\nvariances in single-step reasoning between PreAct and ReAct indicate that\nPreAct indeed has benefits in terms of diversity and strategic orientation over\nReAct.\n","authors":["Dayuan Fu","Jianzhao Huang","Siyuan Lu","Guanting Dong","Yejie Wang","Keqing He","Weiran Xu"],"pdf_url":"https://arxiv.org/pdf/2402.11534v2.pdf","comment":"Coling 2025"},{"id":"http://arxiv.org/abs/2402.10659v4","updated":"2024-12-05T04:35:22Z","published":"2024-02-16T13:10:14Z","title":"Network Formation and Dynamics Among Multi-LLMs","summary":" Social networks fundamentally shape human opinions, behaviors, and the\ndissemination of information. As large language models (LLMs) like GPT, Claude,\nand Llama increasingly integrate into social and professional settings,\nunderstanding their behavior in the context of social interactions and network\nformation becomes essential. This study develops a framework to systematically\nexamine whether the network formation behaviors of multiple LLMs approximate\ncertain aspects of human network dynamics. By simulating interactions among LLM\nagents across various model families, we observe that these models consistently\nexhibit key patterns associated with social network principles including\npreferential attachment, triadic closure, homophily, community structure, and\nthe small-world phenomenon when forming networks. Moreover, LLMs adapt their\nnetwork formation strategies based on each network's characteristics,\nreflecting the context-dependent nature of human behavior: in Facebook\nnetworks, they prioritize triadic closure and homophily, mirroring close-knit\nfriendships; in phone networks, homophily and preferential attachment dominate,\ncapturing personal and professional connections, while in employment networks,\nLLMs favor heterophily and high-degree connections, aligning with career\nadvancement dynamics. These results open new avenues for using LLMs in network\nscience research, with potential applications in agent-based modeling and\nsynthetic network generation.\n","authors":["Marios Papachristou","Yuan Yuan"],"pdf_url":"https://arxiv.org/pdf/2402.10659v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.01253v3","updated":"2024-12-05T04:29:49Z","published":"2024-12-02T08:22:56Z","title":"Yi-Lightning Technical Report","summary":" This technical report presents Yi-Lightning, our latest flagship large\nlanguage model (LLM). It achieves exceptional performance, ranking 6th overall\non Chatbot Arena, with particularly strong results (2nd to 4th place) in\nspecialized categories including Chinese, Math, Coding, and Hard Prompts.\nYi-Lightning leverages an enhanced Mixture-of-Experts (MoE) architecture,\nfeaturing advanced expert segmentation and routing mechanisms coupled with\noptimized KV-caching techniques. Our development process encompasses\ncomprehensive pre-training, supervised fine-tuning (SFT), and reinforcement\nlearning from human feedback (RLHF), where we devise deliberate strategies for\nmulti-stage training, synthetic data construction, and reward modeling.\nFurthermore, we implement RAISE (Responsible AI Safety Engine), a\nfour-component framework to address safety issues across pre-training,\npost-training, and serving phases. Empowered by our scalable super-computing\ninfrastructure, all these innovations substantially reduce training, deployment\nand inference costs while maintaining high-performance standards. With further\nevaluations on public academic benchmarks, Yi-Lightning demonstrates\ncompetitive performance against top-tier LLMs, while we observe a notable\ndisparity between traditional, static benchmark results and real-world, dynamic\nhuman preferences. This observation prompts a critical reassessment of\nconventional benchmarks' utility in guiding the development of more intelligent\nand powerful AI systems for practical applications. Yi-Lightning is now\navailable through our developer platform at https://platform.lingyiwanwu.com.\n","authors":["01. AI"," :","Alan Wake","Albert Wang","Bei Chen","C. X. Lv","Chao Li","Chengen Huang","Chenglin Cai","Chujie Zheng","Daniel Cooper","Ethan Dai","Fan Zhou","Feng Hu","Heng Ji","Howard Qiu","Jiangcheng Zhu","Jun Tian","Katherine Su","Lihuan Zhang","Liying Li","Ming Song","Mou Li","Peng Liu","Qicheng Hu","Shawn Wang","Shijun Zhou","Shiyong Li","Tianhang Zhu","Wen Xie","Xiang He","Xiaobo Chen","Xiaohui Hu","Xiaoyi Ren","Xinyao Niu","Yanpeng Li","Yongke Zhao","Yongzhen Luo","Yuchi Xu","Yuxuan Sha","Zhaodong Yan","Zhiyuan Liu","Zirui Zhang"],"pdf_url":"https://arxiv.org/pdf/2412.01253v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.00273v3","updated":"2024-12-05T04:19:45Z","published":"2024-05-01T01:45:50Z","title":"Social Life Simulation for Non-Cognitive Skills Learning","summary":" Non-cognitive skills are crucial for personal and social life well-being, and\nsuch skill development can be supported by narrative-based (e.g., storytelling)\ntechnologies. While generative AI enables interactive and role-playing\nstorytelling, little is known about how users engage with and perceive the use\nof AI in social life simulation for non-cognitive skills learning.\nAdditionally, the benefits of AI mentorship on self-reflection awareness and\nability in this context remain largely underexplored. To this end, we\nintroduced Simulife++, an interactive platform enabled by a large language\nmodel (LLM). The system allows users to act as protagonists, creating stories\nwith one or multiple AI-based characters in diverse social scenarios. In\nparticular, we expanded the Human-AI interaction to a Human-AI-AI collaboration\nby including a Sage Agent, who acts as a bystander, providing users with some\nperspectives and guidance on their choices and conversations in terms of\nnon-cognitive skills to promote reflection. In a within-subject user study, our\nquantitative results reveal that, when accompanied by Sage Agent, users exhibit\nsignificantly higher levels of reflection on motivation, self-perceptions, and\nresilience & coping, along with an enhanced experience of narrative\ntransportation. Additionally, our qualitative findings suggest that Sage Agent\nplays a crucial role in promoting reflection on non-cognitive skills, enhancing\nsocial communication and decision-making performance, and improving overall\nuser experience within Simulife++. Multiple supportive relationships between\nSage Agent and users were also reported. We offer design implications for the\napplication of generative AI in narrative solutions and the future potential of\nSage Agent for non-cognitive skill development in broader social contexts.\n","authors":["Zihan Yan","Yaohong Xiang","Yun Huang"],"pdf_url":"https://arxiv.org/pdf/2405.00273v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.05889v3","updated":"2024-12-05T04:16:54Z","published":"2024-02-08T18:27:22Z","title":"CREMA: Generalizable and Efficient Video-Language Reasoning via\n Multimodal Modular Fusion","summary":" Despite impressive advancements in recent multimodal reasoning approaches,\nthey are still limited in flexibility and efficiency, as these models typically\nprocess only a few fixed modality inputs and require updates to numerous\nparameters. This paper tackles these critical challenges and proposes CREMA, a\ngeneralizable, highly efficient, and modular modality-fusion framework that can\nincorporate any new modality to enhance video reasoning. We first augment\nmultiple informative modalities (such as optical flow, 3D point cloud, audio,\nthermal heatmap, and touch map) from given videos without extra human\nannotation by leveraging sensors or existing pre-trained models. Next, we\nintroduce a query transformer with multiple parameter-efficient modules\nassociated with each accessible modality. It projects diverse modality features\nto the LLM token embedding space, allowing the model to integrate different\ndata types for response generation. Furthermore, we propose a novel progressive\nmultimodal fusion design supported by a lightweight fusion module and\nmodality-sequential training strategy. It helps compress information across\nvarious assisting modalities, maintaining computational efficiency in the LLM\nwhile improving performance. We validate our method on 7 video-language\nreasoning tasks assisted by diverse modalities, including conventional VideoQA\nand Video-Audio/3D/Touch/Thermal QA, and achieve better/equivalent performance\nagainst strong multimodal LLMs, including OneLLM, BLIP-2, and SeViLA while\nreducing over 90% trainable parameters. We provide extensive analyses of CREMA,\nincluding the impact of each modality on reasoning domains, the design of the\nfusion module, and example visualizations.\n","authors":["Shoubin Yu","Jaehong Yoon","Mohit Bansal"],"pdf_url":"https://arxiv.org/pdf/2402.05889v3.pdf","comment":"first two authors contributed equally. Project page:\n https://CREMA-VideoLLM.github.io/"},{"id":"http://arxiv.org/abs/2405.18711v2","updated":"2024-12-05T04:01:28Z","published":"2024-05-29T02:44:12Z","title":"Calibrating Reasoning in Language Models with Internal Consistency","summary":" Large language models (LLMs) have demonstrated impressive capabilities in\nvarious reasoning tasks, aided by techniques like chain-of-thought prompting\nthat elicits verbalized reasoning. However, LLMs often generate text with\nobvious mistakes and contradictions, raising doubts about their ability to\nrobustly process and utilize generated rationales. In this work, we investigate\nreasoning in LLMs through the lens of internal representations, focusing on how\nthese representations are influenced by generated rationales. Our preliminary\nanalysis reveals that while generated rationales improve answer accuracy,\ninconsistencies emerge between the model's internal representations in middle\nlayers and those in final layers, potentially undermining the reliability of\ntheir reasoning processes. To address this, we propose internal consistency as\na measure of the model's confidence by examining the agreement of latent\npredictions decoded from intermediate layers. Extensive empirical studies\nacross different models and datasets demonstrate that internal consistency\neffectively distinguishes between correct and incorrect reasoning paths.\nMotivated by this, we propose a new approach to calibrate reasoning by\nup-weighting reasoning paths with high internal consistency, resulting in a\nsignificant boost in reasoning performance. Further analysis uncovers distinct\npatterns in attention and feed-forward modules across layers, providing\ninsights into the emergence of internal inconsistency. In summary, our results\ndemonstrate the potential of using internal representations for self-evaluation\nof LLMs. Our code is available at github.com/zhxieml/internal-consistency.\n","authors":["Zhihui Xie","Jizhou Guo","Tong Yu","Shuai Li"],"pdf_url":"https://arxiv.org/pdf/2405.18711v2.pdf","comment":"NeurIPS 2024 camera ready"},{"id":"http://arxiv.org/abs/2412.03853v1","updated":"2024-12-05T03:58:13Z","published":"2024-12-05T03:58:13Z","title":"Automated LaTeX Code Generation from Handwritten Math Expressions Using\n Vision Transformer","summary":" Converting mathematical expressions into LaTeX is challenging. In this paper,\nwe explore using newer transformer based architectures for addressing the\nproblem of converting handwritten/digital mathematical expression images into\nequivalent LaTeX code. We use the current state of the art CNN encoder and RNN\ndecoder as a baseline for our experiments. We also investigate improvements to\nCNN-RNN architecture by replacing the CNN encoder with the ResNet50 model. Our\nexperiments show that transformer architectures achieve a higher overall\naccuracy and BLEU scores along with lower Levenschtein scores compared to the\nbaseline CNN/RNN architecture with room to achieve even better results with\nappropriate fine-tuning of model parameters.\n","authors":["Jayaprakash Sundararaj","Akhil Vyas","Benjamin Gonzalez-Maldonado"],"pdf_url":"https://arxiv.org/pdf/2412.03853v1.pdf","comment":"7 pages; 3 figures"},{"id":"http://arxiv.org/abs/2412.03847v1","updated":"2024-12-05T03:27:02Z","published":"2024-12-05T03:27:02Z","title":"Educational-Psychological Dialogue Robot Based on Multi-Agent\n Collaboration","summary":" Intelligent dialogue systems are increasingly used in modern education and\npsychological counseling fields, but most existing systems are limited to a\nsingle domain, cannot deal with both educational and psychological issues, and\noften lack accuracy and professionalism when dealing with complex issues. To\naddress these problems, this paper proposes an intelligent dialog system that\ncombines educational and psychological counseling functions. The system\nconsists of multiple AI agent, including security detection agent, intent\nidentification agent, educational LLM agent, and psychological LLM agent, which\nwork in concert to ensure the provision of accurate educational knowledge Q\\&A\nand psychological support services. Specifically, the system recognizes\nuser-input intentions through an intention classification model and invokes a\nretrieval-enhanced educational grand model and a psychological grand model\nfine-tuned with psychological data in order to provide professional educational\nadvice and psychological support.\n","authors":["Shiwen Ni","Min Yang"],"pdf_url":"https://arxiv.org/pdf/2412.03847v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.12027v4","updated":"2024-12-05T03:26:13Z","published":"2024-03-18T17:57:09Z","title":"From Pixels to Insights: A Survey on Automatic Chart Understanding in\n the Era of Large Foundation Models","summary":" Data visualization in the form of charts plays a pivotal role in data\nanalysis, offering critical insights and aiding in informed decision-making.\nAutomatic chart understanding has witnessed significant advancements with the\nrise of large foundation models in recent years. Foundation models, such as\nlarge language models, have revolutionized various natural language processing\ntasks and are increasingly being applied to chart understanding tasks. This\nsurvey paper provides a comprehensive overview of the recent developments,\nchallenges, and future directions in chart understanding within the context of\nthese foundation models. We review fundamental building blocks crucial for\nstudying chart understanding tasks. Additionally, we explore various tasks and\ntheir evaluation metrics and sources of both charts and textual inputs. Various\nmodeling strategies are then examined, encompassing both classification-based\nand generation-based approaches, along with tool augmentation techniques that\nenhance chart understanding performance. Furthermore, we discuss the\nstate-of-the-art performance of each task and discuss how we can improve the\nperformance. Challenges and future directions are addressed, highlighting the\nimportance of several topics, such as domain-specific charts, lack of efforts\nin developing evaluation metrics, and agent-oriented settings. This survey\npaper serves as a comprehensive resource for researchers and practitioners in\nthe fields of natural language processing, computer vision, and data analysis,\nproviding valuable insights and directions for future research in chart\nunderstanding leveraging large foundation models. The studies mentioned in this\npaper, along with emerging new research, will be continually updated at:\nhttps://github.com/khuangaf/Awesome-Chart-Understanding.\n","authors":["Kung-Hsiang Huang","Hou Pong Chan","Yi R. Fung","Haoyi Qiu","Mingyang Zhou","Shafiq Joty","Shih-Fu Chang","Heng Ji"],"pdf_url":"https://arxiv.org/pdf/2403.12027v4.pdf","comment":"IEEE Transactions on Knowledge and Data Engineering (TKDE)"},{"id":"http://arxiv.org/abs/2410.12949v2","updated":"2024-12-05T02:55:35Z","published":"2024-10-16T18:35:02Z","title":"Mechanistic Unlearning: Robust Knowledge Unlearning and Editing via\n Mechanistic Localization","summary":" Methods for knowledge editing and unlearning in large language models seek to\nedit or remove undesirable knowledge or capabilities without compromising\ngeneral language modeling performance. This work investigates how mechanistic\ninterpretability -- which, in part, aims to identify model components\n(circuits) associated to specific interpretable mechanisms that make up a model\ncapability -- can improve the precision and effectiveness of editing and\nunlearning. We find a stark difference in unlearning and edit robustness when\ntraining components localized by different methods. We highlight an important\ndistinction between methods that localize components based primarily on\npreserving outputs, and those finding high level mechanisms with predictable\nintermediate states. In particular, localizing edits/unlearning to components\nassociated with the lookup-table mechanism for factual recall 1) leads to more\nrobust edits/unlearning across different input/output formats, and 2) resists\nattempts to relearn the unwanted information, while also reducing unintended\nside effects compared to baselines, on both a sports facts dataset and the\nCounterFact dataset across multiple models. We also find that certain localized\nedits disrupt the latent knowledge in the model more than any other baselines,\nmaking unlearning more robust to various attacks.\n","authors":["Phillip Guo","Aaquib Syed","Abhay Sheshadri","Aidan Ewart","Gintare Karolina Dziugaite"],"pdf_url":"https://arxiv.org/pdf/2410.12949v2.pdf","comment":"31 pages, 45 figures, 7 tables"},{"id":"http://arxiv.org/abs/2411.11266v4","updated":"2024-12-05T02:48:32Z","published":"2024-11-18T03:45:34Z","title":"VersaTune: An Efficient Data Composition Framework for Training\n Multi-Capability LLMs","summary":" Large-scale pretrained models, particularly Large Language Models (LLMs),\nhave exhibited remarkable capabilities in handling multiple tasks across\ndomains due to their emergent properties. These capabilities are further\naugmented during the Supervised Fine-Tuning (SFT) phase. Despite their\npotential, existing work mainly focuses on domain-specific enhancements during\nfine-tuning, the challenge of which lies in catastrophic forgetting of\nknowledge across other domains. In this study, we introduce VersaTune, a novel\ndata composition framework designed for enhancing LLMs' overall multi-ability\nperformances during training. We categorize knowledge into distinct domains\nincluding law, medicine, finance, science, code, etc. We begin with detecting\nthe distribution of domain-specific knowledge within the base model, followed\nby the training data composition that aligns with the model's existing\nknowledge distribution. During the training process, domain weights are\ndynamically adjusted based on their learnable potential and forgetting degree.\nExperimental results demonstrate that VersaTune achieves significant\nimprovements in multi-domain performance, with an 35.21% enhancement in\ncomprehensive multi-domain tasks. Additionally, in scenarios where specific\ndomain optimization is required, VersaTune reduces the degradation of\nperformance in other domains by 38.77%, without compromising the target\ndomain's training efficacy.\n","authors":["Keer Lu","Keshi Zhao","Zheng Liang","Da Pan","Shusen Zhang","Xin Wu","Weipeng Chen","Zenan Zhou","Guosheng Dong","Bin Cui","Wentao Zhang"],"pdf_url":"https://arxiv.org/pdf/2411.11266v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.06851v2","updated":"2024-12-05T02:43:36Z","published":"2023-06-12T03:54:04Z","title":"UniPoll: A Unified Social Media Poll Generation Framework via\n Multi-Objective Optimization","summary":" Social media platforms are vital for expressing opinions and understanding\npublic sentiment, yet many analytical tools overlook passive users who mainly\nconsume content without engaging actively. To address this, we introduce\nUniPoll, an advanced framework designed to automatically generate polls from\nsocial media posts using sophisticated natural language generation (NLG)\ntechniques. Unlike traditional methods that struggle with social media's\ninformal and context-sensitive nature, UniPoll leverages enriched contexts from\nuser comments and employs multi-objective optimization to enhance poll\nrelevance and engagement. To tackle the inherently noisy nature of social media\ndata, UniPoll incorporates Retrieval-Augmented Generation (RAG) and synthetic\ndata generation, ensuring robust performance across real-world scenarios. The\nframework surpasses existing models, including T5, ChatGLM3, and GPT-3.5, in\ngenerating coherent and contextually appropriate question-answer pairs.\nEvaluated on the Chinese WeiboPolls dataset and the newly introduced English\nRedditPolls dataset, UniPoll demonstrates superior cross-lingual and\ncross-platform capabilities, making it a potent tool to boost user engagement\nand create a more inclusive environment for interaction.\n","authors":["Yixia Li","Rong Xiang","Yanlin Song","Jing Li"],"pdf_url":"https://arxiv.org/pdf/2306.06851v2.pdf","comment":"Accepted by IEEE Transactions on Neural Networks and Learning\n Systems. Project page is live at https://uni-poll.github.io . Code are\n available at https://github.com/X1AOX1A/UniPoll"},{"id":"http://arxiv.org/abs/2412.03822v1","updated":"2024-12-05T02:35:46Z","published":"2024-12-05T02:35:46Z","title":"Beyond the Binary: Capturing Diverse Preferences With Reward\n Regularization","summary":" Large language models (LLMs) are increasingly deployed via public-facing\ninterfaces to interact with millions of users, each with diverse preferences.\nDespite this, preference tuning of LLMs predominantly relies on reward models\ntrained using binary judgments where annotators select the preferred choice out\nof pairs of model outputs. In this work, we argue that this reliance on binary\nchoices does not capture the broader, aggregate preferences of the target user\nin real-world tasks. We propose a taxonomy that identifies two dimensions of\nsubjectivity where different users disagree on the preferred output-namely, the\nPlurality of Responses to Prompts, where prompts allow for multiple correct\nanswers, and the Indistinguishability of Responses, where candidate outputs are\nparaphrases of each other. We show that reward models correlate weakly with\nuser preferences in these cases. As a first step to address this issue, we\nintroduce a simple yet effective method that augments existing binary\npreference datasets with synthetic preference judgments to estimate potential\nuser disagreement. Incorporating these via a margin term as a form of\nregularization during model training yields predictions that better align with\nthe aggregate user preferences.\n","authors":["Vishakh Padmakumar","Chuanyang Jin","Hannah Rose Kirk","He He"],"pdf_url":"https://arxiv.org/pdf/2412.03822v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03817v1","updated":"2024-12-05T02:18:35Z","published":"2024-12-05T02:18:35Z","title":"Detecting Redundant Health Survey Questions Using Language-agnostic BERT\n Sentence Embedding (LaBSE)","summary":" The goal of this work was to compute the semantic similarity among publicly\navailable health survey questions in order to facilitate the standardization of\nsurvey-based Person-Generated Health Data (PGHD). We compiled various health\nsurvey questions authored in both English and Korean from the NIH CDE\nRepository, PROMIS, Korean public health agencies, and academic publications.\nQuestions were drawn from various health lifelog domains. A randomized question\npairing scheme was used to generate a Semantic Text Similarity (STS) dataset\nconsisting of 1758 question pairs. Similarity scores between each question pair\nwere assigned by two human experts. The tagged dataset was then used to build\nthree classifiers featuring: Bag-of-Words, SBERT with BERT-based embeddings,\nand SBRET with LaBSE embeddings. The algorithms were evaluated using\ntraditional contingency statistics. Among the three algorithms, SBERT-LaBSE\ndemonstrated the highest performance in assessing question similarity across\nboth languages, achieving an Area Under the Receiver Operating Characteristic\n(ROC) and Precision-Recall Curves of over 0.99. Additionally, it proved\neffective in identifying cross-lingual semantic similarities.The SBERT-LaBSE\nalgorithm excelled at aligning semantically equivalent sentences across both\nlanguages but encountered challenges in capturing subtle nuances and\nmaintaining computational efficiency. Future research should focus on testing\nwith larger multilingual datasets and on calibrating and normalizing scores\nacross the health lifelog domains to improve consistency. This study introduces\nthe SBERT-LaBSE algorithm for calculating semantic similarity across two\nlanguages, showing it outperforms BERT-based models and the Bag of Words\napproach, highlighting its potential to improve semantic interoperability of\nsurvey-based PGHD across language barriers.\n","authors":["Sunghoon Kang","Hyeoneui Kim","Hyewon Park","Ricky Taira"],"pdf_url":"https://arxiv.org/pdf/2412.03817v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03815v1","updated":"2024-12-05T02:18:03Z","published":"2024-12-05T02:18:03Z","title":"Synergizing LLMs and Knowledge Graphs: A Novel Approach to Software\n Repository-Related Question Answering","summary":" Software repositories contain valuable information for gaining insights into\ntheir development process. However, extracting insights from these repository\ndata is time-consuming and requires technical expertise. While software\nengineering chatbots have been developed to facilitate natural language\ninteractions with repositories, they struggle with understanding natural\nlanguage and accurately retrieving relevant data. This study aims to improve\nthe accuracy of LLM-based chatbots in answering repository-related questions by\naugmenting them with knowledge graphs. We achieve this in a two-step approach;\n(1) constructing a knowledge graph from the repository data and (2) synergizing\nthe knowledge graph with LLM to allow for the natural language questions and\nanswers. We curated a set of 20 questions with different complexities and\nevaluated our approach on five popular open-source projects. Our approach\nachieved an accuracy of 65%. We further investigated the limitations and\nidentified six key issues, with the majority relating to the reasoning\ncapability of the LLM. We experimented with a few-shot chain-of-thought\nprompting to determine if it could enhance our approach. This technique\nimproved the overall accuracy to 84%. Our findings demonstrate the synergy\nbetween LLMs and knowledge graphs as a viable solution for making repository\ndata accessible to both technical and non-technical stakeholders.\n","authors":["Samuel Abedu","SayedHassan Khatoonabadi","Emad Shihab"],"pdf_url":"https://arxiv.org/pdf/2412.03815v1.pdf","comment":"Submitted to ACM Transactions on Software Engineering and Methodology\n for review"},{"id":"http://arxiv.org/abs/2411.15004v2","updated":"2024-12-05T02:00:07Z","published":"2024-11-22T15:26:23Z","title":"ScribeAgent: Towards Specialized Web Agents Using Production-Scale\n Workflow Data","summary":" Large Language Model (LLM) agents are rapidly improving to handle\nincreasingly complex web-based tasks. Most of these agents rely on\ngeneral-purpose, proprietary models like GPT-4 and focus on designing better\nprompts to improve their planning abilities. However, general-purpose LLMs are\nnot specifically trained to understand specialized web contexts such as HTML,\nand they often struggle with long-horizon planning. We explore an alternative\napproach that fine-tunes open-source LLMs using production-scale workflow data\ncollected from over 250 domains corresponding to 6 billion tokens. This simple\nyet effective approach shows substantial gains over prompting-based agents on\nexisting benchmarks -- ScribeAgent achieves state-of-the-art direct generation\nperformance on Mind2Web and improves the task success rate by 7.3% over the\nprevious best text-only web agents on WebArena. We further perform detailed\nablation studies on various fine-tuning design choices and provide insights\ninto LLM selection, training recipes, context window optimization, and effect\nof dataset sizes.\n","authors":["Junhong Shen","Atishay Jain","Zedian Xiao","Ishan Amlekar","Mouad Hadji","Aaron Podolny","Ameet Talwalkar"],"pdf_url":"https://arxiv.org/pdf/2411.15004v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03801v1","updated":"2024-12-05T01:45:12Z","published":"2024-12-05T01:45:12Z","title":"Agent AI with LangGraph: A Modular Framework for Enhancing Machine\n Translation Using Large Language Models","summary":" This paper explores the transformative role of Agent AI and LangGraph in\nadvancing the automation and effectiveness of machine translation (MT). Agents\nare modular components designed to perform specific tasks, such as translating\nbetween particular languages, with specializations like TranslateEnAgent,\nTranslateFrenchAgent, and TranslateJpAgent for English, French, and Japanese\ntranslations, respectively. These agents leverage the powerful semantic\ncapabilities of large language models (LLMs), such as GPT-4o, to ensure\naccurate, contextually relevant translations while maintaining modularity,\nscalability, and context retention.\n LangGraph, a graph-based framework built on LangChain, simplifies the\ncreation and management of these agents and their workflows. It supports\ndynamic state management, enabling agents to maintain dialogue context and\nautomates complex workflows by linking agents and facilitating their\ncollaboration. With flexibility, open-source community support, and seamless\nintegration with LLMs, LangGraph empowers agents to deliver high-quality\ntranslations.\n Together, Agent AI and LangGraph create a cohesive system where LangGraph\norchestrates agent interactions, ensuring that user inputs are analyzed,\nrouted, and processed efficiently. Experimental results demonstrate the\npotential of this system to enhance multilingual translation accuracy and\nscalability. By highlighting modular design and automated workflows, this paper\nsets the stage for further innovations in intelligent machine translation\nservices.\n","authors":["Jialin Wang","Zhihua Duan"],"pdf_url":"https://arxiv.org/pdf/2412.03801v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.11681v4","updated":"2024-12-05T01:03:37Z","published":"2023-12-18T20:01:58Z","title":"Designing LLM Chains by Adapting Techniques from Crowdsourcing Workflows","summary":" LLM chains enable complex tasks by decomposing work into a sequence of\nsubtasks. Similarly, the more established techniques of crowdsourcing workflows\ndecompose complex tasks into smaller tasks for human crowdworkers. Chains\naddress LLM errors analogously to the way crowdsourcing workflows address human\nerror. To characterize opportunities for LLM chaining, we survey 107 papers\nacross the crowdsourcing and chaining literature to construct a design space\nfor chain development. The design space covers a designer's objectives and the\ntactics used to build workflows. We then surface strategies that mediate how\nworkflows use tactics to achieve objectives. To explore how techniques from\ncrowdsourcing may apply to chaining, we adapt crowdsourcing workflows to\nimplement LLM chains across three case studies: creating a taxonomy, shortening\ntext, and writing a short story. From the design space and our case studies, we\nidentify takeaways for effective chain design and raise implications for future\nresearch and development.\n","authors":["Madeleine Grunde-McLaughlin","Michelle S. Lam","Ranjay Krishna","Daniel S. Weld","Jeffrey Heer"],"pdf_url":"https://arxiv.org/pdf/2312.11681v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03782v1","updated":"2024-12-05T00:05:11Z","published":"2024-12-05T00:05:11Z","title":"The broader spectrum of in-context learning","summary":" The ability of language models to learn a task from a few examples in context\nhas generated substantial interest. Here, we provide a perspective that\nsituates this type of supervised few-shot learning within a much broader\nspectrum of meta-learned in-context learning. Indeed, we suggest that any\ndistribution of sequences in which context non-trivially decreases loss on\nsubsequent predictions can be interpreted as eliciting a kind of in-context\nlearning. We suggest that this perspective helps to unify the broad set of\nin-context abilities that language models exhibit $\\unicode{x2014}$ such as\nadapting to tasks from instructions or role play, or extrapolating time series.\nThis perspective also sheds light on potential roots of in-context learning in\nlower-level processing of linguistic dependencies (e.g. coreference or parallel\nstructures). Finally, taking this perspective highlights the importance of\ngeneralization, which we suggest can be studied along several dimensions: not\nonly the ability to learn something novel, but also flexibility in learning\nfrom different presentations, and in applying what is learned. We discuss\nbroader connections to past literature in meta-learning and goal-conditioned\nagents, and other perspectives on learning and adaptation. We close by\nsuggesting that research on in-context learning should consider this broader\nspectrum of in-context capabilities and types of generalization.\n","authors":["Andrew Kyle Lampinen","Stephanie C. Y. Chan","Aaditya K. Singh","Murray Shanahan"],"pdf_url":"https://arxiv.org/pdf/2412.03782v1.pdf","comment":null}],"Information Retrieval":[{"id":"http://arxiv.org/abs/2412.04466v1","updated":"2024-12-05T18:59:51Z","published":"2024-12-05T18:59:51Z","title":"User-item fairness tradeoffs in recommendations","summary":" In the basic recommendation paradigm, the most (predicted) relevant item is\nrecommended to each user. This may result in some items receiving lower\nexposure than they \"should\"; to counter this, several algorithmic approaches\nhave been developed to ensure item fairness. These approaches necessarily\ndegrade recommendations for some users to improve outcomes for items, leading\nto user fairness concerns. In turn, a recent line of work has focused on\ndeveloping algorithms for multi-sided fairness, to jointly optimize user\nfairness, item fairness, and overall recommendation quality. This induces the\nquestion: what is the tradeoff between these objectives, and what are the\ncharacteristics of (multi-objective) optimal solutions? Theoretically, we\ndevelop a model of recommendations with user and item fairness objectives and\ncharacterize the solutions of fairness-constrained optimization. We identify\ntwo phenomena: (a) when user preferences are diverse, there is \"free\" item and\nuser fairness; and (b) users whose preferences are misestimated can be\nespecially disadvantaged by item fairness constraints. Empirically, we\nprototype a recommendation system for preprints on arXiv and implement our\nframework, measuring the phenomena in practice and showing how these phenomena\ninform the design of markets with recommendation systems-intermediated\nmatching.\n","authors":["Sophie Greenwood","Sudalakshmee Chiniah","Nikhil Garg"],"pdf_url":"https://arxiv.org/pdf/2412.04466v1.pdf","comment":"Accepted at the Thirty-Eighth Annual Conference on Neural Information\n Processing Systems"},{"id":"http://arxiv.org/abs/2412.04276v1","updated":"2024-12-05T15:59:05Z","published":"2024-12-05T15:59:05Z","title":"Graph-Sequential Alignment and Uniformity: Toward Enhanced\n Recommendation Systems","summary":" Graph-based and sequential methods are two popular recommendation paradigms,\neach excelling in its domain but lacking the ability to leverage signals from\nthe other. To address this, we propose a novel method that integrates both\napproaches for enhanced performance. Our framework uses Graph Neural Network\n(GNN)-based and sequential recommenders as separate submodules while sharing a\nunified embedding space optimized jointly. To enable positive knowledge\ntransfer, we design a loss function that enforces alignment and uniformity both\nwithin and across submodules. Experiments on three real-world datasets\ndemonstrate that the proposed method significantly outperforms using either\napproach alone and achieves state-of-the-art results. Our implementations are\npublicly available at https://github.com/YuweiCao-UIC/GSAU.git.\n","authors":["Yuwei Cao","Liangwei Yang","Zhiwei Liu","Yuqing Liu","Chen Wang","Yueqing Liang","Hao Peng","Philip S. Yu"],"pdf_url":"https://arxiv.org/pdf/2412.04276v1.pdf","comment":"Under review"},{"id":"http://arxiv.org/abs/2412.04272v1","updated":"2024-12-05T15:54:16Z","published":"2024-12-05T15:54:16Z","title":"PoTable: Programming Standardly on Table-based Reasoning Like a Human\n Analyst","summary":" Table-based reasoning has garnered substantial research interest,\nparticularly in its integration with Large Language Model (LLM) which has\nrevolutionized the general reasoning paradigm. Numerous LLM-based studies\nintroduce symbolic tools (e.g., databases, Python) as assistants to extend\nhuman-like abilities in structured table understanding and complex arithmetic\ncomputations. However, these studies can be improved better in simulating human\ncognitive behavior when using symbolic tools, as they still suffer from\nlimitations of non-standard logical splits and constrained operation pools. In\nthis study, we propose PoTable as a novel table-based reasoning method that\nsimulates a human tabular analyst, which integrates a Python interpreter as the\nreal-time executor accompanied by an LLM-based operation planner and code\ngenerator. Specifically, PoTable follows a human-like logical stage split and\nextends the operation pool into an open-world space without any constraints.\nThrough planning and executing in each distinct stage, PoTable standardly\ncompletes the entire reasoning process and produces superior reasoning results\nalong with highly accurate, steply commented and completely executable\nprograms. Accordingly, the effectiveness and explainability of PoTable are\nfully demonstrated. Extensive experiments over three evaluation datasets from\ntwo public benchmarks on two backbones show the outstanding performance of our\napproach. In particular, GPT-based PoTable achieves over 4% higher absolute\naccuracy than runner-ups on all evaluation datasets.\n","authors":["Qingyang Mao","Qi Liu","Zhi Li","Mingyue Cheng","Zheng Zhang","Rui Li"],"pdf_url":"https://arxiv.org/pdf/2412.04272v1.pdf","comment":"12 pages, 4 figures"},{"id":"http://arxiv.org/abs/2312.00326v4","updated":"2024-12-05T14:45:05Z","published":"2023-12-01T03:44:54Z","title":"Agent-OM: Leveraging LLM Agents for Ontology Matching","summary":" Ontology matching (OM) enables semantic interoperability between different\nontologies and resolves their conceptual heterogeneity by aligning related\nentities. OM systems currently have two prevailing design paradigms:\nconventional knowledge-based expert systems and newer machine learning-based\npredictive systems. While large language models (LLMs) and LLM agents have\nrevolutionised data engineering and have been applied creatively in many\ndomains, their potential for OM remains underexplored. This study introduces a\nnovel agent-powered LLM-based design paradigm for OM systems. With\nconsideration of several specific challenges in leveraging LLM agents for OM,\nwe propose a generic framework, namely Agent-OM (Agent for Ontology Matching),\nconsisting of two Siamese agents for retrieval and matching, with a set of\nsimple OM tools. Our framework is implemented in a proof-of-concept system.\nEvaluations of three Ontology Alignment Evaluation Initiative (OAEI) tracks\nover state-of-the-art OM systems show that our system can achieve results very\nclose to the long-standing best performance on simple OM tasks and can\nsignificantly improve the performance on complex and few-shot OM tasks.\n","authors":["Zhangcheng Qiang","Weiqing Wang","Kerry Taylor"],"pdf_url":"https://arxiv.org/pdf/2312.00326v4.pdf","comment":"14 pages, 13 figures, 4 tables"},{"id":"http://arxiv.org/abs/2409.16182v3","updated":"2024-12-05T14:28:42Z","published":"2024-09-24T15:26:38Z","title":"TiM4Rec: An Efficient Sequential Recommendation Model Based on\n Time-Aware Structured State Space Duality Model","summary":" The Sequential Recommendation modeling paradigm is shifting from Transformer\nto Mamba architecture, which comprises two generations: Mamba1, based on the\nState Space Model (SSM), and Mamba2, based on State Space Duality (SSD).\nAlthough SSD offers superior computational efficiency compared to SSM, it\nsuffers performance degradation in sequential recommendation tasks, especially\nin low-dimensional scenarios that are critical for these tasks. Considering\nthat time-aware enhancement methods are commonly employed to mitigate\nperformance loss, our analysis reveals that the performance decline of SSD can\nsimilarly be fundamentally compensated by leveraging mechanisms in time-aware\nmethods. Thus, we propose integrating time-awareness into the SSD framework to\naddress these performance issues. However, integrating current time-aware\nmethods, modeled after TiSASRec, into SSD faces the following challenges: 1)\nthe complexity of integrating these transformer-based mechanisms with the SSD\narchitecture, and 2) the computational inefficiency caused by the need for\ndimensionality expansion of time-difference modeling. To overcome these\nchallenges, we introduce a novel Time-aware Structured Masked Matrix that\nefficiently incorporates time-aware capabilities into SSD. Building on this, we\npropose Time-Aware Mamba for Recommendation (TiM4Rec), which mitigates\nperformance degradation in low-dimensional SSD contexts while preserving\ncomputational efficiency. This marks the inaugural application of a time-aware\nenhancement method specifically tailored for the Mamba architecture within the\ndomain of sequential recommendation. Extensive experiments conducted on three\nreal-world datasets demonstrate the superiority of our approach. The code for\nour model is accessible at https://github.com/AlwaysFHao/TiM4Rec.\n","authors":["Hao Fan","Mengyi Zhu","Yanrong Hu","Hailin Feng","Zhijie He","Hongjiu Liu","Qingyang Liu"],"pdf_url":"https://arxiv.org/pdf/2409.16182v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.03906v2","updated":"2024-12-05T12:56:40Z","published":"2024-11-06T13:37:28Z","title":"Lexicalization Is All You Need: Examining the Impact of Lexical\n Knowledge in a Compositional QALD System","summary":" In this paper, we examine the impact of lexicalization on Question Answering\nover Linked Data (QALD). It is well known that one of the key challenges in\ninterpreting natural language questions with respect to SPARQL lies in bridging\nthe lexical gap, that is mapping the words in the query to the correct\nvocabulary elements. We argue in this paper that lexicalization, that is\nexplicit knowledge about the potential interpretations of a word with respect\nto the given vocabulary, significantly eases the task and increases the\nperformance of QA systems. Towards this goal, we present a compositional QA\nsystem that can leverage explicit lexical knowledge in a compositional manner\nto infer the meaning of a question in terms of a SPARQL query. We show that\nsuch a system, given lexical knowledge, has a performance well beyond current\nQA systems, achieving up to a $35.8\\%$ increase in the micro $F_1$ score\ncompared to the best QA system on QALD-9. This shows the importance and\npotential of including explicit lexical knowledge. In contrast, we show that\nLLMs have limited abilities to exploit lexical knowledge, with only marginal\nimprovements compared to a version without lexical knowledge. This shows that\nLLMs have no ability to compositionally interpret a question on the basis of\nthe meaning of its parts, a key feature of compositional approaches. Taken\ntogether, our work shows new avenues for QALD research, emphasizing the\nimportance of lexicalization and compositionality.\n","authors":["David Maria Schmidt","Mohammad Fazleh Elahi","Philipp Cimiano"],"pdf_url":"https://arxiv.org/pdf/2411.03906v2.pdf","comment":"24th International Conference on Knowledge Engineering and Knowledge\n Management (EKAW 2024), November 26-28, 2024, Amsterdam, The Netherlands"},{"id":"http://arxiv.org/abs/2412.04107v1","updated":"2024-12-05T12:17:56Z","published":"2024-12-05T12:17:56Z","title":"Pre-train, Align, and Disentangle: Empowering Sequential Recommendation\n with Large Language Models","summary":" Sequential recommendation (SR) aims to model the sequential dependencies in\nusers' historical interactions to better capture their evolving interests.\nHowever, existing SR approaches primarily rely on collaborative data, which\nleads to limitations such as the cold-start problem and sub-optimal\nperformance. Meanwhile, despite the success of large language models (LLMs),\ntheir application in industrial recommender systems is hindered by high\ninference latency, inability to capture all distribution statistics, and\ncatastrophic forgetting. To this end, we propose a novel Pre-train, Align, and\nDisentangle (PAD) paradigm to empower recommendation models with LLMs.\nSpecifically, we first pre-train both the SR and LLM models to get\ncollaborative and textual embeddings. Next, a characteristic\nrecommendation-anchored alignment loss is proposed using multi-kernel maximum\nmean discrepancy with Gaussian kernels. Finally, a triple-experts architecture,\nconsisting aligned and modality-specific experts with disentangled embeddings,\nis fine-tuned in a frequency-aware manner. Experiments conducted on three\npublic datasets demonstrate the effectiveness of PAD, showing significant\nimprovements and compatibility with various SR backbone models, especially on\ncold items. The implementation code and datasets will be publicly available.\n","authors":["Yuhao Wang","Junwei Pan","Xiangyu Zhao","Pengyue Jia","Wanyu Wang","Yuan Wang","Yue Liu","Dapeng Liu","Jie Jiang"],"pdf_url":"https://arxiv.org/pdf/2412.04107v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.07426v4","updated":"2024-12-05T11:04:14Z","published":"2023-08-14T19:36:57Z","title":"A Survey on Point-of-Interest Recommendations Leveraging Heterogeneous\n Data","summary":" Tourism is an important application domain for recommender systems. In this\ndomain, recommender systems are for example tasked with providing personalized\nrecommendations for transportation, accommodation, points-of-interest (POIs),\netc. Among these tasks, in particular the problem of recommending POIs that are\nof likely interest to individual tourists has gained growing attention in\nrecent years. Providing POI recommendations to tourists can however be\nespecially challenging due to the variability of the user's context. With the\nrapid development of the Web and today's multitude of online services, vast\namounts of data from various sources have become available, and these\nheterogeneous data represent a huge potential to better address the challenges\nof POI recommendation problems. In this work, we provide a survey of published\nresearch on the problem of POI recommendation between 2021 and 2023. The\nliterature was surveyed to identify the information types, techniques and\nevaluation methods employed. Based on the analysis, it was observed that the\ncurrent research tends to focus on a relatively narrow range of information\ntypes and there is a significant potential in improving POI recommendation by\nleveraging heterogeneous data. As the first information-centric survey on POI\nrecommendation research, this study serves as a reference for researchers\naiming to develop increasingly accurate, personalized and context-aware POI\nrecommender systems.\n","authors":["Zehui Wang","Wolfram Höpken","Dietmar Jannach"],"pdf_url":"https://arxiv.org/pdf/2308.07426v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03913v1","updated":"2024-12-05T06:30:20Z","published":"2024-12-05T06:30:20Z","title":"Graph Disentangle Causal Model: Enhancing Causal Inference in Networked\n Observational Data","summary":" Estimating individual treatment effects (ITE) from observational data is a\ncritical task across various domains. However, many existing works on ITE\nestimation overlook the influence of hidden confounders, which remain\nunobserved at the individual unit level. To address this limitation,\nresearchers have utilized graph neural networks to aggregate neighbors'\nfeatures to capture the hidden confounders and mitigate confounding bias by\nminimizing the discrepancy of confounder representations between the treated\nand control groups. Despite the success of these approaches, practical\nscenarios often treat all features as confounders and involve substantial\ndifferences in feature distributions between the treated and control groups.\nConfusing the adjustment and confounder and enforcing strict balance on the\nconfounder representations could potentially undermine the effectiveness of\noutcome prediction. To mitigate this issue, we propose a novel framework called\nthe \\textit{Graph Disentangle Causal model} (GDC) to conduct ITE estimation in\nthe network setting. GDC utilizes a causal disentangle module to separate unit\nfeatures into adjustment and confounder representations. Then we design a graph\naggregation module consisting of three distinct graph aggregators to obtain\nadjustment, confounder, and counterfactual confounder representations. Finally,\na causal constraint module is employed to enforce the disentangled\nrepresentations as true causal factors. The effectiveness of our proposed\nmethod is demonstrated by conducting comprehensive experiments on two networked\ndatasets.\n","authors":["Binbin Hu","Zhicheng An","Zhengwei Wu","Ke Tu","Ziqi Liu","Zhiqiang Zhang","Jun Zhou","Yufei Feng","Jiawei Chen"],"pdf_url":"https://arxiv.org/pdf/2412.03913v1.pdf","comment":"Accepted by WSDM 2025"},{"id":"http://arxiv.org/abs/2412.03875v1","updated":"2024-12-05T05:07:19Z","published":"2024-12-05T05:07:19Z","title":"Learning to Hash for Recommendation: A Survey","summary":" With the explosive growth of users and items, Recommender Systems (RS) are\nfacing unprecedented challenges on both retrieval efficiency and storage cost.\nFortunately, Learning to Hash (L2H) techniques have been shown as a promising\nsolution to address the two dilemmas, whose core idea is encoding\nhigh-dimensional data into compact hash codes. To this end, L2H for RS (HashRec\nfor short) has recently received widespread attention to support large-scale\nrecommendations. In this survey, we present a comprehensive review of current\nHashRec algorithms. Specifically, we first introduce the commonly used\ntwo-tower models in the recall stage and identify two search strategies\nfrequently employed in L2H. Then, we categorize prior works into two-tier\ntaxonomy based on: (i) the type of loss function and (ii) the optimization\nstrategy. We also introduce some commonly used evaluation metrics to measure\nthe performance of HashRec algorithms. Finally, we shed light on the\nlimitations of the current research and outline the future research directions.\nFurthermore, the summary of HashRec methods reviewed in this survey can be\nfound at\n\\href{https://github.com/Luo-Fangyuan/HashRec}{https://github.com/Luo-Fangyuan/HashRec}.\n","authors":["Fangyuan Luo","Honglei Zhang","Tong Li","Jun Wu"],"pdf_url":"https://arxiv.org/pdf/2412.03875v1.pdf","comment":null}],"Machine Learning":[{"id":"http://arxiv.org/abs/2412.04467v1","updated":"2024-12-05T18:59:53Z","published":"2024-12-05T18:59:53Z","title":"VisionZip: Longer is Better but Not Necessary in Vision Language Models","summary":" Recent advancements in vision-language models have enhanced performance by\nincreasing the length of visual tokens, making them much longer than text\ntokens and significantly raising computational costs. However, we observe that\nthe visual tokens generated by popular vision encoders, such as CLIP and\nSigLIP, contain significant redundancy. To address this, we introduce\nVisionZip, a simple yet effective method that selects a set of informative\ntokens for input to the language model, reducing visual token redundancy and\nimproving efficiency while maintaining model performance. The proposed\nVisionZip can be widely applied to image and video understanding tasks and is\nwell-suited for multi-turn dialogues in real-world scenarios, where previous\nmethods tend to underperform. Experimental results show that VisionZip\noutperforms the previous state-of-the-art method by at least 5% performance\ngains across nearly all settings. Moreover, our method significantly enhances\nmodel inference speed, improving the prefilling time by 8x and enabling the\nLLaVA-Next 13B model to infer faster than the LLaVA-Next 7B model while\nachieving better results. Furthermore, we analyze the causes of this redundancy\nand encourage the community to focus on extracting better visual features\nrather than merely increasing token length. Our code is available at\nhttps://github.com/dvlab-research/VisionZip .\n","authors":["Senqiao Yang","Yukang Chen","Zhuotao Tian","Chengyao Wang","Jingyao Li","Bei Yu","Jiaya Jia"],"pdf_url":"https://arxiv.org/pdf/2412.04467v1.pdf","comment":"2 columns, 28 pages, 15 figures, 18 tables"},{"id":"http://arxiv.org/abs/2412.04455v1","updated":"2024-12-05T18:58:27Z","published":"2024-12-05T18:58:27Z","title":"Code-as-Monitor: Constraint-aware Visual Programming for Reactive and\n Proactive Robotic Failure Detection","summary":" Automatic detection and prevention of open-set failures are crucial in\nclosed-loop robotic systems. Recent studies often struggle to simultaneously\nidentify unexpected failures reactively after they occur and prevent\nforeseeable ones proactively. To this end, we propose Code-as-Monitor (CaM), a\nnovel paradigm leveraging the vision-language model (VLM) for both open-set\nreactive and proactive failure detection. The core of our method is to\nformulate both tasks as a unified set of spatio-temporal constraint\nsatisfaction problems and use VLM-generated code to evaluate them for real-time\nmonitoring. To enhance the accuracy and efficiency of monitoring, we further\nintroduce constraint elements that abstract constraint-related entities or\ntheir parts into compact geometric elements. This approach offers greater\ngenerality, simplifies tracking, and facilitates constraint-aware visual\nprogramming by leveraging these elements as visual prompts. Experiments show\nthat CaM achieves a 28.7% higher success rate and reduces execution time by\n31.8% under severe disturbances compared to baselines across three simulators\nand a real-world setting. Moreover, CaM can be integrated with open-loop\ncontrol policies to form closed-loop systems, enabling long-horizon tasks in\ncluttered scenes with dynamic environments.\n","authors":["Enshen Zhou","Qi Su","Cheng Chi","Zhizheng Zhang","Zhongyuan Wang","Tiejun Huang","Lu Sheng","He Wang"],"pdf_url":"https://arxiv.org/pdf/2412.04455v1.pdf","comment":"Project page: https://zhoues.github.io/Code-as-Monitor/"},{"id":"http://arxiv.org/abs/2412.04445v1","updated":"2024-12-05T18:57:04Z","published":"2024-12-05T18:57:04Z","title":"Moto: Latent Motion Token as the Bridging Language for Robot\n Manipulation","summary":" Recent developments in Large Language Models pre-trained on extensive corpora\nhave shown significant success in various natural language processing tasks\nwith minimal fine-tuning. This success offers new promise for robotics, which\nhas long been constrained by the high cost of action-labeled data. We ask:\ngiven the abundant video data containing interaction-related knowledge\navailable as a rich \"corpus\", can a similar generative pre-training approach be\neffectively applied to enhance robot learning? The key challenge is to identify\nan effective representation for autoregressive pre-training that benefits robot\nmanipulation tasks. Inspired by the way humans learn new skills through\nobserving dynamic environments, we propose that effective robotic learning\nshould emphasize motion-related knowledge, which is closely tied to low-level\nactions and is hardware-agnostic, facilitating the transfer of learned motions\nto actual robot actions. To this end, we introduce Moto, which converts video\ncontent into latent Motion Token sequences by a Latent Motion Tokenizer,\nlearning a bridging \"language\" of motion from videos in an unsupervised manner.\nWe pre-train Moto-GPT through motion token autoregression, enabling it to\ncapture diverse visual motion knowledge. After pre-training, Moto-GPT\ndemonstrates the promising ability to produce semantically interpretable motion\ntokens, predict plausible motion trajectories, and assess trajectory\nrationality through output likelihood. To transfer learned motion priors to\nreal robot actions, we implement a co-fine-tuning strategy that seamlessly\nbridges latent motion token prediction and real robot control. Extensive\nexperiments show that the fine-tuned Moto-GPT exhibits superior robustness and\nefficiency on robot manipulation benchmarks, underscoring its effectiveness in\ntransferring knowledge from video data to downstream visual manipulation tasks.\n","authors":["Yi Chen","Yuying Ge","Yizhuo Li","Yixiao Ge","Mingyu Ding","Ying Shan","Xihui Liu"],"pdf_url":"https://arxiv.org/pdf/2412.04445v1.pdf","comment":"Project released at: https://chenyi99.github.io/moto/"},{"id":"http://arxiv.org/abs/2409.03669v2","updated":"2024-12-05T18:56:04Z","published":"2024-09-05T16:23:07Z","title":"A method to benchmark high-dimensional process drift detection","summary":" Process curves are multivariate finite time series data coming from\nmanufacturing processes. This paper studies machine learning that detect drifts\nin process curve datasets. A theoretic framework to synthetically generate\nprocess curves in a controlled way is introduced in order to benchmark machine\nlearning algorithms for process drift detection. An evaluation score, called\nthe temporal area under the curve, is introduced, which allows to quantify how\nwell machine learning models unveil curves belonging to drift segments.\nFinally, a benchmark study comparing popular machine learning approaches on\nsynthetic data generated with the introduced framework is presented that shows\nthat existing algorithms often struggle with datasets containing multiple drift\nsegments.\n","authors":["Edgar Wolf","Tobias Windisch"],"pdf_url":"https://arxiv.org/pdf/2409.03669v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.17728v3","updated":"2024-12-05T18:55:44Z","published":"2024-03-26T14:17:01Z","title":"Masked Autoencoders are PDE Learners","summary":" Neural solvers for partial differential equations (PDEs) have great potential\nto generate fast and accurate physics solutions, yet their practicality is\ncurrently limited by their generalizability. PDEs evolve over broad scales and\nexhibit diverse behaviors; predicting these phenomena will require learning\nrepresentations across a wide variety of inputs which may encompass different\ncoefficients, boundary conditions, resolutions, or even equations. As a step\ntowards generalizable PDE modeling, we adapt masked pretraining for physics\nproblems. Through self-supervised learning across PDEs, masked autoencoders can\nconsolidate heterogeneous physics to learn rich latent representations. We show\nthat learned representations can generalize to a limited set of unseen\nequations or parameters and are meaningful enough to regress PDE coefficients\nor the classify PDE features. Furthermore, conditioning neural solvers on\nlearned latent representations can improve time-stepping and super-resolution\nperformance across a variety of coefficients, discretizations, or boundary\nconditions, as well as on certain unseen PDEs. We hope that masked pretraining\ncan emerge as a unifying method across large, unlabeled, and heterogeneous\ndatasets to learn latent physics at scale.\n","authors":["Anthony Zhou","Amir Barati Farimani"],"pdf_url":"https://arxiv.org/pdf/2403.17728v3.pdf","comment":"29 pages, 9 figures"},{"id":"http://arxiv.org/abs/2412.04429v1","updated":"2024-12-05T18:52:00Z","published":"2024-12-05T18:52:00Z","title":"Grounding Descriptions in Images informs Zero-Shot Visual Recognition","summary":" Vision-language models (VLMs) like CLIP have been cherished for their ability\nto perform zero-shot visual recognition on open-vocabulary concepts. This is\nachieved by selecting the object category whose textual representation bears\nthe highest similarity with the query image. While successful in some domains,\nthis method struggles with identifying fine-grained entities as well as\ngeneralizing to unseen concepts that are not captured by the training\ndistribution. Recent works attempt to mitigate these challenges by integrating\ncategory descriptions at test time, albeit yielding modest improvements. We\nattribute these limited gains to a fundamental misalignment between image and\ndescription representations, which is rooted in the pretraining structure of\nCLIP. In this paper, we propose GRAIN, a new pretraining strategy aimed at\naligning representations at both fine and coarse levels simultaneously. Our\napproach learns to jointly ground textual descriptions in image regions along\nwith aligning overarching captions with global image representations. To drive\nthis pre-training, we leverage frozen Multimodal Large Language Models (MLLMs)\nto derive large-scale synthetic annotations. We demonstrate the enhanced\nzero-shot performance of our model compared to current state-of-the art methods\nacross 11 diverse image classification datasets. Additionally, we introduce\nProducts-2023, a newly curated, manually labeled dataset featuring novel\nconcepts, and showcase our model's ability to recognize these concepts by\nbenchmarking on it. Significant improvements achieved by our model on other\ndownstream tasks like retrieval further highlight the superior quality of\nrepresentations learned by our approach. Code available at\nhttps://github.com/shaunak27/grain-clip .\n","authors":["Shaunak Halbe","Junjiao Tian","K J Joseph","James Seale Smith","Katherine Stevo","Vineeth N Balasubramanian","Zsolt Kira"],"pdf_url":"https://arxiv.org/pdf/2412.04429v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04426v1","updated":"2024-12-05T18:51:18Z","published":"2024-12-05T18:51:18Z","title":"Marvel: Accelerating Safe Online Reinforcement Learning with Finetuned\n Offline Policy","summary":" The high costs and risks involved in extensive environment interactions\nhinder the practical application of current online safe reinforcement learning\n(RL) methods. While offline safe RL addresses this by learning policies from\nstatic datasets, the performance therein is usually limited due to reliance on\ndata quality and challenges with out-of-distribution (OOD) actions. Inspired by\nrecent successes in offline-to-online (O2O) RL, it is crucial to explore\nwhether offline safe RL can be leveraged to facilitate faster and safer online\npolicy learning, a direction that has yet to be fully investigated. To fill\nthis gap, we first demonstrate that naively applying existing O2O algorithms\nfrom standard RL would not work well in the safe RL setting due to two unique\nchallenges: \\emph{erroneous Q-estimations}, resulted from offline-online\nobjective mismatch and offline cost sparsity, and \\emph{Lagrangian mismatch},\nresulted from difficulties in aligning Lagrange multipliers between offline and\nonline policies. To address these challenges, we introduce \\textbf{Marvel}, a\nnovel framework for O2O safe RL, comprising two key components that work in\nconcert: \\emph{Value Pre-Alignment} to align the Q-functions with the\nunderlying truth before online learning, and \\emph{Adaptive PID Control} to\neffectively adjust the Lagrange multipliers during online finetuning. Extensive\nexperiments demonstrate that Marvel significantly outperforms existing\nbaselines in both reward maximization and safety constraint satisfaction. By\nintroducing the first policy-finetuning based framework for O2O safe RL, which\nis compatible with many offline and online safe RL methods, our work has the\ngreat potential to advance the field towards more efficient and practical safe\nRL solutions.\n","authors":["Keru Chen","Honghao Wei","Zhigang Deng","Sen Lin"],"pdf_url":"https://arxiv.org/pdf/2412.04426v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04425v1","updated":"2024-12-05T18:51:10Z","published":"2024-12-05T18:51:10Z","title":"CA-SSLR: Condition-Aware Self-Supervised Learning Representation for\n Generalized Speech Processing","summary":" We introduce Condition-Aware Self-Supervised Learning Representation\n(CA-SSLR), a generalist conditioning model broadly applicable to various\nspeech-processing tasks. Compared to standard fine-tuning methods that optimize\nfor downstream models, CA-SSLR integrates language and speaker embeddings from\nearlier layers, making the SSL model aware of the current language and speaker\ncontext. This approach reduces the reliance on input audio features while\npreserving the integrity of the base SSLR. CA-SSLR improves the model's\ncapabilities and demonstrates its generality on unseen tasks with minimal\ntask-specific tuning. Our method employs linear modulation to dynamically\nadjust internal representations, enabling fine-grained adaptability without\nsignificantly altering the original model behavior. Experiments show that\nCA-SSLR reduces the number of trainable parameters, mitigates overfitting, and\nexcels in under-resourced and unseen tasks. Specifically, CA-SSLR achieves a\n10% relative reduction in LID errors, a 37% improvement in ASR CER on the\nML-SUPERB benchmark, and a 27% decrease in SV EER on VoxCeleb-1, demonstrating\nits effectiveness.\n","authors":["Yen-Ju Lu","Jing Liu","Thomas Thebaud","Laureano Moro-Velazquez","Ariya Rastrow","Najim Dehak","Jesus Villalba"],"pdf_url":"https://arxiv.org/pdf/2412.04425v1.pdf","comment":"38th Conference on Neural Information Processing Systems (NeurIPS\n 2024)"},{"id":"http://arxiv.org/abs/2403.07384v2","updated":"2024-12-05T18:47:47Z","published":"2024-03-12T07:45:33Z","title":"SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large\n Language Models by Summarizing Training Trajectories of Small Models","summary":" Despite the effectiveness of data selection for large language models (LLMs)\nduring pretraining and instruction fine-tuning phases, improving data\nefficiency in supervised fine-tuning (SFT) for specialized domains poses\nsignificant challenges due to the complexity of fine-tuning data. To bridge\nthis gap, we introduce an effective and scalable data selection method for SFT,\nSmallToLarge (S2L), which leverages training trajectories from small models to\nguide the data selection for larger models. We demonstrate through extensive\nexperiments that S2L significantly improves data efficiency in SFT for\nmathematical problem-solving, reducing the training data to just 11% of the\noriginal MathInstruct dataset (Yue et al., 2023) to match full dataset\nperformance while outperforming state-of-the-art data selection algorithms by\nan average of 4.7% across 6 in- and out-domain evaluation datasets. Remarkably,\nselecting only 50K data for SFT, S2L achieves a 32.7% accuracy on the most\nchallenging MATH (Hendrycks et al., 2021) benchmark, improving Phi-2 (Li et\nal., 2023b) by 16.6%. In clinical text summarization on the MIMIC-III dataset\n(Johnson et al., 2016), S2L again outperforms training on the full dataset\nusing only 50% of the data. Notably, S2L can perform data selection using a\nreference model 40x smaller than the target model, proportionally reducing the\ncost of data selection.\n","authors":["Yu Yang","Siddhartha Mishra","Jeffrey N Chiang","Baharan Mirzasoleiman"],"pdf_url":"https://arxiv.org/pdf/2403.07384v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.01339v2","updated":"2024-12-05T18:43:25Z","published":"2024-12-02T10:06:57Z","title":"Negative Token Merging: Image-based Adversarial Feature Guidance","summary":" Text-based adversarial guidance using a negative prompt has emerged as a\nwidely adopted approach to steer diffusion models away from producing undesired\nconcepts. While useful, performing adversarial guidance using text alone can be\ninsufficient to capture complex visual concepts or avoid specific visual\nelements like copyrighted characters. In this paper, for the first time we\nexplore an alternate modality in this direction by performing adversarial\nguidance directly using visual features from a reference image or other images\nin a batch. We introduce negative token merging (NegToMe), a simple but\neffective training-free approach which performs adversarial guidance through\nimages by selectively pushing apart matching visual features between reference\nand generated images during the reverse diffusion process. By simply adjusting\nthe used reference, NegToMe enables a diverse range of applications. Notably,\nwhen using other images in same batch as reference, we find that NegToMe\nsignificantly enhances output diversity (e.g., racial, gender, visual) by\nguiding features of each image away from others. Similarly, when used w.r.t.\ncopyrighted reference images, NegToMe reduces visual similarity to copyrighted\ncontent by 34.57%. NegToMe is simple to implement using just few-lines of code,\nuses only marginally higher (<4%) inference time and is compatible with\ndifferent diffusion architectures, including those like Flux, which don't\nnatively support the use of a negative prompt. Code is available at\nhttps://negtome.github.io\n","authors":["Jaskirat Singh","Lindsey Li","Weijia Shi","Ranjay Krishna","Yejin Choi","Pang Wei Koh","Michael F. Cohen","Stephen Gould","Liang Zheng","Luke Zettlemoyer"],"pdf_url":"https://arxiv.org/pdf/2412.01339v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04416v1","updated":"2024-12-05T18:42:29Z","published":"2024-12-05T18:42:29Z","title":"FedDUAL: A Dual-Strategy with Adaptive Loss and Dynamic Aggregation for\n Mitigating Data Heterogeneity in Federated Learning","summary":" Federated Learning (FL) marks a transformative approach to distributed model\ntraining by combining locally optimized models from various clients into a\nunified global model. While FL preserves data privacy by eliminating\ncentralized storage, it encounters significant challenges such as performance\ndegradation, slower convergence, and reduced robustness of the global model due\nto the heterogeneity in client data distributions. Among the various forms of\ndata heterogeneity, label skew emerges as a particularly formidable and\nprevalent issue, especially in domains such as image classification. To address\nthese challenges, we begin with comprehensive experiments to pinpoint the\nunderlying issues in the FL training process. Based on our findings, we then\nintroduce an innovative dual-strategy approach designed to effectively resolve\nthese issues. First, we introduce an adaptive loss function for client-side\ntraining, meticulously crafted to preserve previously acquired knowledge while\nmaintaining an optimal equilibrium between local optimization and global model\ncoherence. Secondly, we develop a dynamic aggregation strategy for aggregating\nclient models at the server. This approach adapts to each client's unique\nlearning patterns, effectively addressing the challenges of diverse data across\nthe network. Our comprehensive evaluation, conducted across three diverse\nreal-world datasets, coupled with theoretical convergence guarantees,\ndemonstrates the superior efficacy of our method compared to several\nestablished state-of-the-art approaches.\n","authors":["Pranab Sahoo","Ashutosh Tripathi","Sriparna Saha","Samrat Mondal"],"pdf_url":"https://arxiv.org/pdf/2412.04416v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.12924v3","updated":"2024-12-05T18:35:26Z","published":"2024-09-04T03:17:19Z","title":"WaveletGPT: Wavelets Meet Large Language Models","summary":" Large Language Models (LLMs) have ushered in a new wave of artificial\nintelligence advancements impacting every scientific field and discipline. They\nare trained on a simple objective: to predict the next token given the previous\ncontext. We live in a world where most of the data around us, e.g., text,\naudio, and music, has a multi-scale structure associated with it. This paper\ninfuses LLMs with traditional signal processing ideas, namely wavelets, during\npre-training to take advantage of the structure. Without adding \\textbf{any\nextra parameters} to a GPT-style LLM architecture, we achieve the same\npre-training performance almost twice as fast in text, raw audio, and symbolic\nmusic. This is achieved by imposing a structure on intermediate embeddings.\nWhen trained for the same number of training steps, we achieve significant\ngains in performance, which is comparable to pre-training a larger neural\narchitecture. Our architecture allows every next token prediction access to\nintermediate embeddings at different temporal resolutions in every Transformer\ndecoder block. This work will hopefully pave the way for incorporating\nmulti-rate signal processing ideas into traditional LLM pre-training. Further,\nwe showcase pushing model performance by improving internal structure instead\nof just going after scale.\n","authors":["Prateek Verma"],"pdf_url":"https://arxiv.org/pdf/2409.12924v3.pdf","comment":"16 pages, 4 figures"},{"id":"http://arxiv.org/abs/2412.04413v1","updated":"2024-12-05T18:33:59Z","published":"2024-12-05T18:33:59Z","title":"Efficient Task Grouping Through Samplewise Optimisation Landscape\n Analysis","summary":" Shared training approaches, such as multi-task learning (MTL) and\ngradient-based meta-learning, are widely used in various machine learning\napplications, but they often suffer from negative transfer, leading to\nperformance degradation in specific tasks. While several optimisation\ntechniques have been developed to mitigate this issue for pre-selected task\ncohorts, identifying optimal task combinations for joint learning - known as\ntask grouping - remains underexplored and computationally challenging due to\nthe exponential growth in task combinations and the need for extensive training\nand evaluation cycles. This paper introduces an efficient task grouping\nframework designed to reduce these overwhelming computational demands of the\nexisting methods. The proposed framework infers pairwise task similarities\nthrough a sample-wise optimisation landscape analysis, eliminating the need for\nthe shared model training required to infer task similarities in existing\nmethods. With task similarities acquired, a graph-based clustering algorithm is\nemployed to pinpoint near-optimal task groups, providing an approximate yet\nefficient and effective solution to the originally NP-hard problem. Empirical\nassessments conducted on 8 different datasets highlight the effectiveness of\nthe proposed framework, revealing a five-fold speed enhancement compared to\nprevious state-of-the-art methods. Moreover, the framework consistently\ndemonstrates comparable performance, confirming its remarkable efficiency and\neffectiveness in task grouping.\n","authors":["Anshul Thakur","Yichen Huang","Soheila Molaei","Yujiang Wang","David A. Clifton"],"pdf_url":"https://arxiv.org/pdf/2412.04413v1.pdf","comment":"Under review at IEEE Transactions on Pattern Analysis and Machine\n Intelligence"},{"id":"http://arxiv.org/abs/2412.04409v1","updated":"2024-12-05T18:31:14Z","published":"2024-12-05T18:31:14Z","title":"Stabilizing and Solving Inverse Problems using Data and Machine Learning","summary":" We consider an inverse problem involving the reconstruction of the solution\nto a nonlinear partial differential equation (PDE) with unknown boundary\nconditions. Instead of direct boundary data, we are provided with a large\ndataset of boundary observations for typical solutions (collective data) and a\nbulk measurement of a specific realization. To leverage this collective data,\nwe first compress the boundary data using proper orthogonal decomposition (POD)\nin a linear expansion. Next, we identify a possible nonlinear low-dimensional\nstructure in the expansion coefficients using an auto-encoder, which provides a\nparametrization of the dataset in a lower-dimensional latent space. We then\ntrain a neural network to map the latent variables representing the boundary\ndata to the solution of the PDE. Finally, we solve the inverse problem by\noptimizing a data-fitting term over the latent space.\n We analyze the underlying stabilized finite element method in the linear\nsetting and establish optimal error estimates in the $H^1$ and $L^2$-norms. The\nnonlinear problem is then studied numerically, demonstrating the effectiveness\nof our approach.\n","authors":["Erik Burman","Mats G. Larson","Karl Larsson","Carl Lundholm"],"pdf_url":"https://arxiv.org/pdf/2412.04409v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04408v1","updated":"2024-12-05T18:27:09Z","published":"2024-12-05T18:27:09Z","title":"Providing Differential Privacy for Federated Learning Over Wireless: A\n Cross-layer Framework","summary":" Federated Learning (FL) is a distributed machine learning framework that\ninherently allows edge devices to maintain their local training data, thus\nproviding some level of privacy. However, FL's model updates still pose a risk\nof privacy leakage, which must be mitigated. Over-the-air FL (OTA-FL) is an\nadapted FL design for wireless edge networks that leverages the natural\nsuperposition property of the wireless medium. We propose a wireless physical\nlayer (PHY) design for OTA-FL which improves differential privacy (DP) through\na decentralized, dynamic power control that utilizes both inherent Gaussian\nnoise in the wireless channel and a cooperative jammer (CJ) for additional\nartificial noise generation when higher privacy levels are required. Although\nprimarily implemented within the Upcycled-FL framework, where a\nresource-efficient method with first-order approximations is used at every even\niteration to decrease the required information from clients, our power control\nstrategy is applicable to any FL framework, including FedAvg and FedProx as\nshown in the paper. This adaptation showcases the flexibility and effectiveness\nof our design across different learning algorithms while maintaining a strong\nemphasis on privacy. Our design removes the need for client-side artificial\nnoise injection for DP, utilizing a cooperative jammer to enhance privacy\nwithout affecting transmission efficiency for higher privacy demands. Privacy\nanalysis is provided using the Moments Accountant method. We perform a\nconvergence analysis for non-convex objectives to tackle heterogeneous data\ndistributions, highlighting the inherent trade-offs between privacy and\naccuracy. Numerical results show that our approach with various FL algorithms\noutperforms the state-of-the-art under the same DP conditions on the non-i.i.d.\nFEMNIST dataset, and highlight the cooperative jammer's effectiveness in\nensuring strict privacy.\n","authors":["Jiayu Mao","Tongxin Yin","Aylin Yener","Mingyan Liu"],"pdf_url":"https://arxiv.org/pdf/2412.04408v1.pdf","comment":"submitted for an IEEE publication"},{"id":"http://arxiv.org/abs/2412.04404v1","updated":"2024-12-05T18:23:44Z","published":"2024-12-05T18:23:44Z","title":"Federated Automated Feature Engineering","summary":" Automated feature engineering (AutoFE) is used to automatically create new\nfeatures from original features to improve predictive performance without\nneeding significant human intervention and expertise. Many algorithms exist for\nAutoFE, but very few approaches exist for the federated learning (FL) setting\nwhere data is gathered across many clients and is not shared between clients or\na central server. We introduce AutoFE algorithms for the horizontal, vertical,\nand hybrid FL settings, which differ in how the data is gathered across\nclients. To the best of our knowledge, we are the first to develop AutoFE\nalgorithms for the horizontal and hybrid FL cases, and we show that the\ndownstream model performance of federated AutoFE is similar to the case where\ndata is held centrally and AutoFE is performed centrally.\n","authors":["Tom Overman","Diego Klabjan"],"pdf_url":"https://arxiv.org/pdf/2412.04404v1.pdf","comment":"Preliminary Work"},{"id":"http://arxiv.org/abs/2311.10162v3","updated":"2024-12-05T18:16:10Z","published":"2023-11-16T19:34:18Z","title":"Learning to Reconstruct Accelerated MRI Through K-space Cold Diffusion\n without Noise","summary":" Deep learning-based MRI reconstruction models have achieved superior\nperformance these days. Most recently, diffusion models have shown remarkable\nperformance in image generation, in-painting, super-resolution, image editing\nand more. As a generalized diffusion model, cold diffusion further broadens the\nscope and considers models built around arbitrary image transformations such as\nblurring, down-sampling, etc. In this paper, we propose a k-space cold\ndiffusion model that performs image degradation and restoration in k-space\nwithout the need for Gaussian noise. We provide comparisons with multiple deep\nlearning-based MRI reconstruction models and perform tests on a well-known\nlarge open-source MRI dataset. Our results show that this novel way of\nperforming degradation can generate high-quality reconstruction images for\naccelerated MRI.\n","authors":["Guoyao Shen","Mengyu Li","Chad W. Farris","Stephan Anderson","Xin Zhang"],"pdf_url":"https://arxiv.org/pdf/2311.10162v3.pdf","comment":"21 pages, 5 figures, 4 tables"},{"id":"http://arxiv.org/abs/2308.10968v2","updated":"2024-12-05T18:07:33Z","published":"2023-08-21T18:26:35Z","title":"Regularization by Neural Style Transfer for MRI Field-Transfer\n Reconstruction with Limited Data","summary":" Recent advances in MRI reconstruction have achieved remarkable success with\ndeep learning-based models. However, most methods depend on large-scale,\ntask-specific datasets, leaving reconstruction in data-limited settings as a\ncritical but underexplored challenge. Regularization by denoising (RED) is a\ngeneral pipeline that incorporates a denoiser as a prior for image\nreconstruction, showing promising results in various image processing tasks,\nincluding denoising, deblurring, and super-resolution. In this work, we propose\na regularization by neural style transfer (RNST) method to further leverage the\npriors from the neural transfer and denoising engine. RNST effectively\nreconstructs high-quality images from noisy, low-quality inputs across varying\nimage styles, even with limited data. We validate RNST on clinical MRI scans,\ndemonstrating its ability to significantly improve image quality. These\nfindings underline the potential of RNST for MRI field-transfer reconstruction\nand its promise in addressing reconstruction tasks in data-constrained\nscenarios.\n","authors":["Guoyao Shen","Yancheng Zhu","Mengyu Li","Ryan McNaughton","Hernan Jara","Sean B. Andersson","Chad W. Farris","Stephan Anderson","Xin Zhang"],"pdf_url":"https://arxiv.org/pdf/2308.10968v2.pdf","comment":"31 pages, 9 figures, 3 tables, 1 algorithm chart"},{"id":"http://arxiv.org/abs/2412.04392v1","updated":"2024-12-05T18:06:09Z","published":"2024-12-05T18:06:09Z","title":"Asynchronous Batch Bayesian Optimization with Pipelining Evaluations for\n Experimental Resource$\\unicode{x2013}$constrained Conditions","summary":" Bayesian optimization is efficient even with a small amount of data and is\nused in engineering and in science, including biology and chemistry. In\nBayesian optimization, a parameterized model with an uncertainty is fitted to\nexplain the experimental data, and then the model suggests parameters that\nwould most likely improve the results. Batch Bayesian optimization reduces the\nprocessing time of optimization by parallelizing experiments. However, batch\nBayesian optimization cannot be applied if the number of parallelized\nexperiments is limited by the cost or scarcity of equipment; in such cases,\nsequential methods require an unrealistic amount of time. In this study, we\ndeveloped pipelining Bayesian optimization (PipeBO) to reduce the processing\ntime of optimization even with a limited number of parallel experiments. PipeBO\nwas inspired by the pipelining of central processing unit architecture, which\ndivides computational tasks into multiple processes. PipeBO was designed to\nachieve experiment parallelization by overlapping various processes of the\nexperiments. PipeBO uses the results of completed experiments to update the\nparameters of running parallelized experiments. Using the Black-Box\nOptimization Benchmarking, which consists of 24 benchmark functions, we\ncompared PipeBO with the sequential Bayesian optimization methods. PipeBO\nreduced the average processing time of optimization to about 56% for the\nexperiments that consisted of two processes or even less for those with more\nprocesses for 20 out of the 24 functions. Overall, PipeBO parallelizes Bayesian\noptimization in the resource-constrained settings so that efficient\noptimization can be achieved.\n","authors":["Yujin Taguchi","Yusuke Shibuya","Yusuke Hiki","Takashi Morikura","Takahiro G. Yamada","Akira Funahashi"],"pdf_url":"https://arxiv.org/pdf/2412.04392v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04384v1","updated":"2024-12-05T17:59:58Z","published":"2024-12-05T17:59:58Z","title":"Probabilistic Gaussian Superposition for Efficient 3D Occupancy\n Prediction","summary":" 3D semantic occupancy prediction is an important task for robust\nvision-centric autonomous driving, which predicts fine-grained geometry and\nsemantics of the surrounding scene. Most existing methods leverage dense\ngrid-based scene representations, overlooking the spatial sparsity of the\ndriving scenes. Although 3D semantic Gaussian serves as an object-centric\nsparse alternative, most of the Gaussians still describe the empty region with\nlow efficiency. To address this, we propose a probabilistic Gaussian\nsuperposition model which interprets each Gaussian as a probability\ndistribution of its neighborhood being occupied and conforms to probabilistic\nmultiplication to derive the overall geometry. Furthermore, we adopt the exact\nGaussian mixture model for semantics calculation to avoid unnecessary\noverlapping of Gaussians. To effectively initialize Gaussians in non-empty\nregion, we design a distribution-based initialization module which learns the\npixel-aligned occupancy distribution instead of the depth of surfaces. We\nconduct extensive experiments on nuScenes and KITTI-360 datasets and our\nGaussianFormer-2 achieves state-of-the-art performance with high efficiency.\nCode: https://github.com/huang-yh/GaussianFormer.\n","authors":["Yuanhui Huang","Amonnut Thammatadatrakoon","Wenzhao Zheng","Yunpeng Zhang","Dalong Du","Jiwen Lu"],"pdf_url":"https://arxiv.org/pdf/2412.04384v1.pdf","comment":"Code is available at: https://github.com/huang-yh/GaussianFormer"},{"id":"http://arxiv.org/abs/2412.04380v1","updated":"2024-12-05T17:57:09Z","published":"2024-12-05T17:57:09Z","title":"EmbodiedOcc: Embodied 3D Occupancy Prediction for Vision-based Online\n Scene Understanding","summary":" 3D occupancy prediction provides a comprehensive description of the\nsurrounding scenes and has become an essential task for 3D perception. Most\nexisting methods focus on offline perception from one or a few views and cannot\nbe applied to embodied agents which demands to gradually perceive the scene\nthrough progressive embodied exploration. In this paper, we formulate an\nembodied 3D occupancy prediction task to target this practical scenario and\npropose a Gaussian-based EmbodiedOcc framework to accomplish it. We initialize\nthe global scene with uniform 3D semantic Gaussians and progressively update\nlocal regions observed by the embodied agent. For each update, we extract\nsemantic and structural features from the observed image and efficiently\nincorporate them via deformable cross-attention to refine the regional\nGaussians. Finally, we employ Gaussian-to-voxel splatting to obtain the global\n3D occupancy from the updated 3D Gaussians. Our EmbodiedOcc assumes an unknown\n(i.e., uniformly distributed) environment and maintains an explicit global\nmemory of it with 3D Gaussians. It gradually gains knowledge through local\nrefinement of regional Gaussians, which is consistent with how humans\nunderstand new scenes through embodied exploration. We reorganize an\nEmbodiedOcc-ScanNet benchmark based on local annotations to facilitate the\nevaluation of the embodied 3D occupancy prediction task. Experiments\ndemonstrate that our EmbodiedOcc outperforms existing local prediction methods\nand accomplishes the embodied occupancy prediction with high accuracy and\nstrong expandability. Our code is available at:\nhttps://github.com/YkiWu/EmbodiedOcc.\n","authors":["Yuqi Wu","Wenzhao Zheng","Sicheng Zuo","Yuanhui Huang","Jie Zhou","Jiwen Lu"],"pdf_url":"https://arxiv.org/pdf/2412.04380v1.pdf","comment":"Code: https://github.com/YkiWu/EmbodiedOcc"},{"id":"http://arxiv.org/abs/2412.04377v1","updated":"2024-12-05T17:52:35Z","published":"2024-12-05T17:52:35Z","title":"A Hitchhiker's Guide to Understanding Performances of Two-Class\n Classifiers","summary":" Properly understanding the performances of classifiers is essential in\nvarious scenarios. However, the literature often relies only on one or two\nstandard scores to compare classifiers, which fails to capture the nuances of\napplication-specific requirements, potentially leading to suboptimal classifier\nselection. Recently, a paper on the foundations of the theory of\nperformance-based ranking introduced a tool, called the Tile, that organizes an\ninfinity of ranking scores into a 2D map. Thanks to the Tile, it is now\npossible to evaluate and compare classifiers efficiently, displaying all\npossible application-specific preferences instead of having to rely on a pair\nof scores. In this paper, we provide a first hitchhiker's guide for\nunderstanding the performances of two-class classifiers by presenting four\nscenarios, each showcasing a different user profile: a theoretical analyst, a\nmethod designer, a benchmarker, and an application developer. Particularly, we\nshow that we can provide different interpretative flavors that are adapted to\nthe user's needs by mapping different values on the Tile. As an illustration,\nwe leverage the newly introduced Tile tool and the different flavors to rank\nand analyze the performances of 74 state-of-the-art semantic segmentation\nmodels in two-class classification through the eyes of the four user profiles.\nThrough these user profiles, we demonstrate that the Tile effectively captures\nthe behavior of classifiers in a single visualization, while accommodating an\ninfinite number of ranking scores.\n","authors":["Anaïs Halin","Sébastien Piérard","Anthony Cioppa","Marc Van Droogenbroeck"],"pdf_url":"https://arxiv.org/pdf/2412.04377v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.11224v2","updated":"2024-12-05T17:44:09Z","published":"2024-11-18T01:27:44Z","title":"Don't Be So Positive: Negative Step Sizes in Second-Order Methods","summary":" The value of second-order methods lies in the use of curvature information.\nYet, this information is costly to extract and once obtained, valuable negative\ncurvature information is often discarded so that the method is globally\nconvergent. This limits the effectiveness of second-order methods in modern\nmachine learning. In this paper, we show that second-order and\nsecond-order-like methods are promising optimizers for neural networks provided\nthat we add one ingredient: negative step sizes. We show that under very\ngeneral conditions, methods that produce ascent directions are globally\nconvergent when combined with a Wolfe line search that allows both positive and\nnegative step sizes. We experimentally demonstrate that using negative step\nsizes is often more effective than common Hessian modification methods.\n","authors":["Betty Shea","Mark Schmidt"],"pdf_url":"https://arxiv.org/pdf/2411.11224v2.pdf","comment":"added affiliation and more references"},{"id":"http://arxiv.org/abs/2412.04368v1","updated":"2024-12-05T17:36:22Z","published":"2024-12-05T17:36:22Z","title":"Finer Behavioral Foundation Models via Auto-Regressive Features and\n Advantage Weighting","summary":" The forward-backward representation (FB) is a recently proposed framework\n(Touati et al., 2023; Touati & Ollivier, 2021) to train behavior foundation\nmodels (BFMs) that aim at providing zero-shot efficient policies for any new\ntask specified in a given reinforcement learning (RL) environment, without\ntraining for each new task. Here we address two core limitations of FB model\ntraining. First, FB, like all successor-feature-based methods, relies on a\nlinear encoding of tasks: at test time, each new reward function is linearly\nprojected onto a fixed set of pre-trained features. This limits expressivity as\nwell as precision of the task representation. We break the linearity limitation\nby introducing auto-regressive features for FB, which let finegrained task\nfeatures depend on coarser-grained task information. This can represent\narbitrary nonlinear task encodings, thus significantly increasing expressivity\nof the FB framework. Second, it is well-known that training RL agents from\noffline datasets often requires specific techniques.We show that FB works well\ntogether with such offline RL techniques, by adapting techniques from (Nair et\nal.,2020b; Cetin et al., 2024) for FB. This is necessary to get non-flatlining\nperformance in some datasets, such as DMC Humanoid. As a result, we produce\nefficient FB BFMs for a number of new environments. Notably, in the D4RL\nlocomotion benchmark, the generic FB agent matches the performance of standard\nsingle-task offline agents (IQL, XQL). In many setups, the offline techniques\nare needed to get any decent performance at all. The auto-regressive features\nhave a positive but moderate impact, concentrated on tasks requiring spatial\nprecision and task generalization beyond the behaviors represented in the\ntrainset.\n","authors":["Edoardo Cetin","Ahmed Touati","Yann Ollivier"],"pdf_url":"https://arxiv.org/pdf/2412.04368v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04367v1","updated":"2024-12-05T17:35:29Z","published":"2024-12-05T17:35:29Z","title":"Machine Theory of Mind for Autonomous Cyber-Defence","summary":" Intelligent autonomous agents hold much potential for the domain of\ncyber-security. However, due to many state-of-the-art approaches relying on\nuninterpretable black-box models, there is growing demand for methods that\noffer stakeholders clear and actionable insights into their latent beliefs and\nmotivations. To address this, we evaluate Theory of Mind (ToM) approaches for\nAutonomous Cyber Operations. Upon learning a robust prior, ToM models can\npredict an agent's goals, behaviours, and contextual beliefs given only a\nhandful of past behaviour observations. In this paper, we introduce a novel\nGraph Neural Network (GNN)-based ToM architecture tailored for cyber-defence,\nGraph-In, Graph-Out (GIGO)-ToM, which can accurately predict both the targets\nand attack trajectories of adversarial cyber agents over arbitrary computer\nnetwork topologies. To evaluate the latter, we propose a novel extension of the\nWasserstein distance for measuring the similarity of graph-based probability\ndistributions. Whereas the standard Wasserstein distance lacks a fixed\nreference scale, we introduce a graph-theoretic normalization factor that\nenables a standardized comparison between networks of different sizes. We\nfurnish this metric, which we term the Network Transport Distance (NTD), with a\nweighting function that emphasizes predictions according to custom node\nfeatures, allowing network operators to explore arbitrary strategic\nconsiderations. Benchmarked against a Graph-In, Dense-Out (GIDO)-ToM\narchitecture in an abstract cyber-defence environment, our empirical\nevaluations show that GIGO-ToM can accurately predict the goals and behaviours\nof various unseen cyber-attacking agents across a range of network topologies,\nas well as learn embeddings that can effectively characterize their policies.\n","authors":["Luke Swaby","Matthew Stewart","Daniel Harrold","Chris Willis","Gregory Palmer"],"pdf_url":"https://arxiv.org/pdf/2412.04367v1.pdf","comment":"29 pages, 17 figures, 12 tables"},{"id":"http://arxiv.org/abs/2401.01951v2","updated":"2024-12-05T17:31:43Z","published":"2024-01-03T19:27:20Z","title":"GeoPos: A Minimal Positional Encoding for Enhanced Fine-Grained Details\n in Image Synthesis Using Convolutional Neural Networks","summary":" The enduring inability of image generative models to recreate intricate\ngeometric features, such as those present in human hands and fingers has been\nan ongoing problem in image generation for nearly a decade. While strides have\nbeen made by increasing model sizes and diversifying training datasets, this\nissue remains prevalent across all models, from denoising diffusion models to\nGenerative Adversarial Networks (GAN), pointing to a fundamental shortcoming in\nthe underlying architectures. In this paper, we demonstrate how this problem\ncan be mitigated by augmenting convolution layers geometric capabilities\nthrough providing them with a single input channel incorporating the relative\nn-dimensional Cartesian coordinate system. We show this drastically improves\nquality of images generated by Diffusion Models, GANs, and Variational\nAutoEncoders (VAE).\n","authors":["Mehran Hosseini","Peyman Hosseini"],"pdf_url":"https://arxiv.org/pdf/2401.01951v2.pdf","comment":"Accepted at WACV 2025. Contains 19 pages, 15 figures, and 9 tables"},{"id":"http://arxiv.org/abs/2410.01910v2","updated":"2024-12-05T17:22:21Z","published":"2024-10-02T18:09:12Z","title":"Is uniform expressivity too restrictive? Towards efficient expressivity\n of graph neural networks","summary":" Uniform expressivity guarantees that a Graph Neural Network (GNN) can express\na query without the parameters depending on the size of the input graphs. This\nproperty is desirable in applications in order to have number of trainable\nparameters that is independent of the size of the input graphs. Uniform\nexpressivity of the two variable guarded fragment (GC2) of first order logic is\na well-celebrated result for Rectified Linear Unit (ReLU) GNNs [Barcelo & al.,\n2020]. In this article, we prove that uniform expressivity of GC2 queries is\nnot possible for GNNs with a wide class of Pfaffian activation functions\n(including the sigmoid and tanh), answering a question formulated by [Grohe,\n2021]. We also show that despite these limitations, many of those GNNs can\nstill efficiently express GC2 queries in a way that the number of parameters\nremains logarithmic on the maximal degree of the input graphs. Furthermore, we\ndemonstrate that a log-log dependency on the degree is achievable for a certain\nchoice of activation function. This shows that uniform expressivity can be\nsuccessfully relaxed by covering large graphs appearing in practical\napplications. Our experiments illustrates that our theoretical estimates hold\nin practice.\n","authors":["Sammy Khalife","Josué Tonelli-Cueto"],"pdf_url":"https://arxiv.org/pdf/2410.01910v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.13000v2","updated":"2024-12-05T17:19:12Z","published":"2024-09-19T15:38:21Z","title":"Introducing the Large Medical Model: State of the art healthcare cost\n and risk prediction with transformers trained on patient event sequences","summary":" With U.S. healthcare spending approaching $5T (NHE Fact Sheet 2024), and 25%\nof it estimated to be wasteful (Waste in the US the health care system:\nestimated costs and potential for savings, n.d.), the need to better predict\nrisk and optimal patient care is evermore important. This paper introduces the\nLarge Medical Model (LMM), a generative pre-trained transformer (GPT) designed\nto guide and predict the broad facets of patient care and healthcare\nadministration. The model is trained on medical event sequences from over 140M\nlongitudinal patient claims records with a specialized vocabulary built from\nmedical terminology systems and demonstrates a superior capability to forecast\nhealthcare costs and identify potential risk factors. Through experimentation\nand validation, we showcase the LMM's proficiency in not only in cost and risk\npredictions, but also in discerning intricate patterns within complex medical\nconditions and an ability to identify novel relationships in patient care. The\nLMM is able to improve both cost prediction by 14.1% over the best commercial\nmodels and chronic conditions prediction by 1.9% over the best transformer\nmodels in research predicting a broad set of conditions. The LMM is a\nsubstantial advancement in healthcare analytics, offering the potential to\nsignificantly enhance risk assessment, cost management, and personalized\nmedicine.\n","authors":["Ricky Sahu","Eric Marriott","Ethan Siegel","David Wagner","Flore Uzan","Troy Yang","Asim Javed"],"pdf_url":"https://arxiv.org/pdf/2409.13000v2.pdf","comment":"10 pages, 10 figures"},{"id":"http://arxiv.org/abs/2412.04358v1","updated":"2024-12-05T17:17:28Z","published":"2024-12-05T17:17:28Z","title":"Approximate Top-$k$ for Increased Parallelism","summary":" We present an evaluation of bucketed approximate top-$k$ algorithms.\nComputing top-$k$ exactly suffers from limited parallelism, because the $k$\nlargest values must be aggregated along the vector, thus is not well suited to\ncomputation on highly-parallel machine learning accelerators. By relaxing the\nrequirement that the top-$k$ is exact, bucketed algorithms can dramatically\nincrease the parallelism available by independently computing many smaller\ntop-$k$ operations. We explore the design choices of this class of algorithms\nusing both theoretical analysis and empirical evaluation on downstream tasks.\nOur motivating examples are sparsity algorithms for language models, which\noften use top-$k$ to select the most important parameters or activations. We\nalso release a fast bucketed top-$k$ implementation for PyTorch.\n","authors":["Oscar Key","Luka Ribar","Alberto Cattaneo","Luke Hudlass-Galley","Douglas Orr"],"pdf_url":"https://arxiv.org/pdf/2412.04358v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04354v1","updated":"2024-12-05T17:12:45Z","published":"2024-12-05T17:12:45Z","title":"Multi-Scale Node Embeddings for Graph Modeling and Generation","summary":" Lying at the interface between Network Science and Machine Learning, node\nembedding algorithms take a graph as input and encode its structure onto output\nvectors that represent nodes in an abstract geometric space, enabling various\nvector-based downstream tasks such as network modelling, data compression, link\nprediction, and community detection. Two apparently unrelated limitations\naffect these algorithms. On one hand, it is not clear what the basic operation\ndefining vector spaces, i.e. the vector sum, corresponds to in terms of the\noriginal nodes in the network. On the other hand, while the same input network\ncan be represented at multiple levels of resolution by coarse-graining the\nconstituent nodes into arbitrary block-nodes, the relationship between node\nembeddings obtained at different hierarchical levels is not understood. Here,\nbuilding on recent results in network renormalization theory, we address these\ntwo limitations at once and define a multiscale node embedding method that,\nupon arbitrary coarse-grainings, ensures statistical consistency of the\nembedding vector of a block-node with the sum of the embedding vectors of its\nconstituent nodes. We illustrate the power of this approach on two economic\nnetworks that can be naturally represented at multiple resolution levels:\nnamely, the international trade between (sets of) countries and the\ninput-output flows among (sets of) industries in the Netherlands. We confirm\nthe statistical consistency between networks retrieved from coarse-grained node\nvectors and networks retrieved from sums of fine-grained node vectors, a result\nthat cannot be achieved by alternative methods. Several key network properties,\nincluding a large number of triangles, are successfully replicated already from\nembeddings of very low dimensionality, allowing for the generation of faithful\nreplicas of the original networks at arbitrary resolution levels.\n","authors":["Riccardo Milocco","Fabian Jansen","Diego Garlaschelli"],"pdf_url":"https://arxiv.org/pdf/2412.04354v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04353v1","updated":"2024-12-05T17:12:35Z","published":"2024-12-05T17:12:35Z","title":"ActFusion: a Unified Diffusion Model for Action Segmentation and\n Anticipation","summary":" Temporal action segmentation and long-term action anticipation are two\npopular vision tasks for the temporal analysis of actions in videos. Despite\napparent relevance and potential complementarity, these two problems have been\ninvestigated as separate and distinct tasks. In this work, we tackle these two\nproblems, action segmentation and action anticipation, jointly using a unified\ndiffusion model dubbed ActFusion. The key idea to unification is to train the\nmodel to effectively handle both visible and invisible parts of the sequence in\nan integrated manner; the visible part is for temporal segmentation, and the\ninvisible part is for future anticipation. To this end, we introduce a new\nanticipative masking strategy during training in which a late part of the video\nframes is masked as invisible, and learnable tokens replace these frames to\nlearn to predict the invisible future. Experimental results demonstrate the\nbi-directional benefits between action segmentation and anticipation. ActFusion\nachieves the state-of-the-art performance across the standard benchmarks of 50\nSalads, Breakfast, and GTEA, outperforming task-specific models in both of the\ntwo tasks with a single unified model through joint learning.\n","authors":["Dayoung Gong","Suha Kwak","Minsu Cho"],"pdf_url":"https://arxiv.org/pdf/2412.04353v1.pdf","comment":"Accepted to NeurIPS 2024"},{"id":"http://arxiv.org/abs/2412.04346v1","updated":"2024-12-05T17:05:49Z","published":"2024-12-05T17:05:49Z","title":"Distributionally Robust Performative Prediction","summary":" Performative prediction aims to model scenarios where predictive outcomes\nsubsequently influence the very systems they target. The pursuit of a\nperformative optimum (PO) -- minimizing performative risk -- is generally\nreliant on modeling of the distribution map, which characterizes how a deployed\nML model alters the data distribution. Unfortunately, inevitable\nmisspecification of the distribution map can lead to a poor approximation of\nthe true PO. To address this issue, we introduce a novel framework of\ndistributionally robust performative prediction and study a new solution\nconcept termed as distributionally robust performative optimum (DRPO). We show\nprovable guarantees for DRPO as a robust approximation to the true PO when the\nnominal distribution map is different from the actual one. Moreover,\ndistributionally robust performative prediction can be reformulated as an\naugmented performative prediction problem, enabling efficient optimization. The\nexperimental results demonstrate that DRPO offers potential advantages over\ntraditional PO approach when the distribution map is misspecified at either\nmicro- or macro-level.\n","authors":["Songkai Xue","Yuekai Sun"],"pdf_url":"https://arxiv.org/pdf/2412.04346v1.pdf","comment":"In Proceedings of the 38th Conference on Neural Information\n Processing Systems (NeurIPS) 2024"},{"id":"http://arxiv.org/abs/2410.16340v3","updated":"2024-12-05T17:03:34Z","published":"2024-10-21T09:39:10Z","title":"Limit Theorems for Stochastic Gradient Descent with Infinite Variance","summary":" Stochastic gradient descent is a classic algorithm that has gained great\npopularity especially in the last decades as the most common approach for\ntraining models in machine learning. While the algorithm has been well-studied\nwhen stochastic gradients are assumed to have a finite variance, there is\nsignificantly less research addressing its theoretical properties in the case\nof infinite variance gradients. In this paper, we establish the asymptotic\nbehavior of stochastic gradient descent in the context of infinite variance\nstochastic gradients, assuming that the stochastic gradient is regular varying\nwith index $\\alpha\\in(1,2)$. The closest result in this context was established\nin 1969 , in the one-dimensional case and assuming that stochastic gradients\nbelong to a more restrictive class of distributions. We extend it to the\nmultidimensional case, covering a broader class of infinite variance\ndistributions. As we show, the asymptotic distribution of the stochastic\ngradient descent algorithm can be characterized as the stationary distribution\nof a suitably defined Ornstein-Uhlenbeck process driven by an appropriate\nstable L\\'evy process. Additionally, we explore the applications of these\nresults in linear regression and logistic regression models.\n","authors":["Jose Blanchet","Aleksandar Mijatović","Wenhao Yang"],"pdf_url":"https://arxiv.org/pdf/2410.16340v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04339v1","updated":"2024-12-05T16:58:45Z","published":"2024-12-05T16:58:45Z","title":"Likelihood-Scheduled Score-Based Generative Modeling for Fully 3D PET\n Image Reconstruction","summary":" Medical image reconstruction with pre-trained score-based generative models\n(SGMs) has advantages over other existing state-of-the-art deep-learned\nreconstruction methods, including improved resilience to different scanner\nsetups and advanced image distribution modeling. SGM-based reconstruction has\nrecently been applied to simulated positron emission tomography (PET) datasets,\nshowing improved contrast recovery for out-of-distribution lesions relative to\nthe state-of-the-art. However, existing methods for SGM-based reconstruction\nfrom PET data suffer from slow reconstruction, burdensome hyperparameter tuning\nand slice inconsistency effects (in 3D). In this work, we propose a practical\nmethodology for fully 3D reconstruction that accelerates reconstruction and\nreduces the number of critical hyperparameters by matching the likelihood of an\nSGM's reverse diffusion process to a current iterate of the maximum-likelihood\nexpectation maximization algorithm. Using the example of low-count\nreconstruction from simulated $[^{18}$F]DPA-714 datasets, we show our\nmethodology can match or improve on the NRMSE and SSIM of existing\nstate-of-the-art SGM-based PET reconstruction while reducing reconstruction\ntime and the need for hyperparameter tuning. We evaluate our methodology\nagainst state-of-the-art supervised and conventional reconstruction algorithms.\nFinally, we demonstrate a first-ever implementation of SGM-based reconstruction\nfor real 3D PET data, specifically $[^{18}$F]DPA-714 data, where we integrate\nperpendicular pre-trained SGMs to eliminate slice inconsistency issues.\n","authors":["George Webber","Yuya Mizuno","Oliver D. Howes","Alexander Hammers","Andrew P. King","Andrew J. Reader"],"pdf_url":"https://arxiv.org/pdf/2412.04339v1.pdf","comment":"11 pages, 12 figures. Submitted to Transactions on Medical Imaging"},{"id":"http://arxiv.org/abs/2412.04327v1","updated":"2024-12-05T16:42:45Z","published":"2024-12-05T16:42:45Z","title":"Action Mapping for Reinforcement Learning in Continuous Environments\n with Constraints","summary":" Deep reinforcement learning (DRL) has had success across various domains, but\napplying it to environments with constraints remains challenging due to poor\nsample efficiency and slow convergence. Recent literature explored\nincorporating model knowledge to mitigate these problems, particularly through\nthe use of models that assess the feasibility of proposed actions. However,\nintegrating feasibility models efficiently into DRL pipelines in environments\nwith continuous action spaces is non-trivial. We propose a novel DRL training\nstrategy utilizing action mapping that leverages feasibility models to\nstreamline the learning process. By decoupling the learning of feasible actions\nfrom policy optimization, action mapping allows DRL agents to focus on\nselecting the optimal action from a reduced feasible action set. We demonstrate\nthrough experiments that action mapping significantly improves training\nperformance in constrained environments with continuous action spaces,\nespecially with imperfect feasibility models.\n","authors":["Mirco Theile","Lukas Dirnberger","Raphael Trumpp","Marco Caccamo","Alberto L. Sangiovanni-Vincentelli"],"pdf_url":"https://arxiv.org/pdf/2412.04327v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04323v1","updated":"2024-12-05T16:39:01Z","published":"2024-12-05T16:39:01Z","title":"GRAM: Generalization in Deep RL with a Robust Adaptation Module","summary":" The reliable deployment of deep reinforcement learning in real-world settings\nrequires the ability to generalize across a variety of conditions, including\nboth in-distribution scenarios seen during training as well as novel\nout-of-distribution scenarios. In this work, we present a framework for\ndynamics generalization in deep reinforcement learning that unifies these two\ndistinct types of generalization within a single architecture. We introduce a\nrobust adaptation module that provides a mechanism for identifying and reacting\nto both in-distribution and out-of-distribution environment dynamics, along\nwith a joint training pipeline that combines the goals of in-distribution\nadaptation and out-of-distribution robustness. Our algorithm GRAM achieves\nstrong generalization performance across in-distribution and\nout-of-distribution scenarios upon deployment, which we demonstrate on a\nvariety of realistic simulated locomotion tasks with a quadruped robot.\n","authors":["James Queeney","Xiaoyi Cai","Mouhacine Benosman","Jonathan P. How"],"pdf_url":"https://arxiv.org/pdf/2412.04323v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.02951v2","updated":"2024-12-05T16:35:46Z","published":"2023-10-04T16:41:36Z","title":"A Fisher-Rao gradient flow for entropy-regularised Markov decision\n processes in Polish spaces","summary":" We study the global convergence of a Fisher-Rao policy gradient flow for\ninfinite-horizon entropy-regularised Markov decision processes with Polish\nstate and action space. The flow is a continuous-time analogue of a policy\nmirror descent method. We establish the global well-posedness of the gradient\nflow and demonstrate its exponential convergence to the optimal policy.\nMoreover, we prove the flow is stable with respect to gradient evaluation,\noffering insights into the performance of a natural policy gradient flow with\nlog-linear policy parameterisation. To overcome challenges stemming from the\nlack of the convexity of the objective function and the discontinuity arising\nfrom the entropy regulariser, we leverage the performance difference lemma and\nthe duality relationship between the gradient and mirror descent flows. Our\nanalysis provides a theoretical foundation for developing various discrete\npolicy gradient algorithms.\n","authors":["Bekzhan Kerimkulov","James-Michael Leahy","David Siska","Lukasz Szpruch","Yufei Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.02951v2.pdf","comment":"add discretizations of gradient flow and their convergence analysis"},{"id":"http://arxiv.org/abs/2412.04319v1","updated":"2024-12-05T16:35:43Z","published":"2024-12-05T16:35:43Z","title":"Generative-Model-Based Fully 3D PET Image Reconstruction by Conditional\n Diffusion Sampling","summary":" Score-based generative models (SGMs) have recently shown promising results\nfor image reconstruction on simulated positron emission tomography (PET)\ndatasets. In this work we have developed and implemented practical methodology\nfor 3D image reconstruction with SGMs, and perform (to our knowledge) the first\nSGM-based reconstruction of real fully 3D PET data. We train an SGM on\nfull-count reference brain images, and extend methodology to allow SGM-based\nreconstructions at very low counts (1% of original, to simulate low-dose or\nshort-duration scanning). We then perform reconstructions for multiple\nindependent realisations of 1% count data, allowing us to analyse the bias and\nvariance characteristics of the method. We sample from the learned posterior\ndistribution of the generative algorithm to calculate uncertainty images for\nour reconstructions. We evaluate the method's performance on real full- and\nlow-count PET data and compare with conventional OSEM and MAP-EM baselines,\nshowing that our SGM-based low-count reconstructions match full-dose\nreconstructions more closely and in a bias-variance trade-off comparison, our\nSGM-reconstructed images have lower variance than existing baselines. Future\nwork will compare to supervised deep-learned methods, with other avenues for\ninvestigation including how data conditioning affects the SGM's posterior\ndistribution and the algorithm's performance with different tracers.\n","authors":["George Webber","Yuya Mizuno","Oliver D. Howes","Alexander Hammers","Andrew P. King","Andrew J. Reader"],"pdf_url":"https://arxiv.org/pdf/2412.04319v1.pdf","comment":"2 pages, 2 figures. Accepted for oral presentation at IEEE NSS MIC\n RTSD 2024 (submitted May 2024; accepted July 2024; presented Nov 2024)"},{"id":"http://arxiv.org/abs/2311.12068v3","updated":"2024-12-05T16:34:21Z","published":"2023-11-19T17:28:28Z","title":"Enhancing Novel Object Detection via Cooperative Foundational Models","summary":" In this work, we address the challenging and emergent problem of novel object\ndetection (NOD), focusing on the accurate detection of both known and novel\nobject categories during inference. Traditional object detection algorithms are\ninherently closed-set, limiting their capability to handle NOD. We present a\nnovel approach to transform existing closed-set detectors into open-set\ndetectors. This transformation is achieved by leveraging the complementary\nstrengths of pre-trained foundational models, specifically CLIP and SAM,\nthrough our cooperative mechanism. Furthermore, by integrating this mechanism\nwith state-of-the-art open-set detectors such as GDINO, we establish new\nbenchmarks in object detection performance. Our method achieves 17.42 mAP in\nnovel object detection and 42.08 mAP for known objects on the challenging LVIS\ndataset. Adapting our approach to the COCO OVD split, we surpass the current\nstate-of-the-art by a margin of 7.2 $ \\text{AP}_{50} $ for novel classes. Our\ncode is available at https://rohit901.github.io/coop-foundation-models/ .\n","authors":["Rohit Bharadwaj","Muzammal Naseer","Salman Khan","Fahad Shahbaz Khan"],"pdf_url":"https://arxiv.org/pdf/2311.12068v3.pdf","comment":"Accepted at WACV 2025"},{"id":"http://arxiv.org/abs/2412.04309v1","updated":"2024-12-05T16:27:59Z","published":"2024-12-05T16:27:59Z","title":"The Tile: A 2D Map of Ranking Scores for Two-Class Classification","summary":" In the computer vision and machine learning communities, as well as in many\nother research domains, rigorous evaluation of any new method, including\nclassifiers, is essential. One key component of the evaluation process is the\nability to compare and rank methods. However, ranking classifiers and\naccurately comparing their performances, especially when taking\napplication-specific preferences into account, remains challenging. For\ninstance, commonly used evaluation tools like Receiver Operating Characteristic\n(ROC) and Precision/Recall (PR) spaces display performances based on two\nscores. Hence, they are inherently limited in their ability to compare\nclassifiers across a broader range of scores and lack the capability to\nestablish a clear ranking among classifiers. In this paper, we present a novel\nversatile tool, named the Tile, that organizes an infinity of ranking scores in\na single 2D map for two-class classifiers, including common evaluation scores\nsuch as the accuracy, the true positive rate, the positive predictive value,\nJaccard's coefficient, and all F-beta scores. Furthermore, we study the\nproperties of the underlying ranking scores, such as the influence of the\npriors or the correspondences with the ROC space, and depict how to\ncharacterize any other score by comparing them to the Tile. Overall, we\ndemonstrate that the Tile is a powerful tool that effectively captures all the\nrankings in a single visualization and allows interpreting them.\n","authors":["Sébastien Piérard","Anaïs Halin","Anthony Cioppa","Adrien Deliège","Marc Van Droogenbroeck"],"pdf_url":"https://arxiv.org/pdf/2412.04309v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.18825v2","updated":"2024-12-05T16:27:08Z","published":"2024-11-27T23:58:32Z","title":"ELEMENTAL: Interactive Learning from Demonstrations and Vision-Language\n Models for Reward Design in Robotics","summary":" Reinforcement learning (RL) has demonstrated compelling performance in\nrobotic tasks, but its success often hinges on the design of complex, ad hoc\nreward functions. Researchers have explored how Large Language Models (LLMs)\ncould enable non-expert users to specify reward functions more easily. However,\nLLMs struggle to balance the importance of different features, generalize\npoorly to out-of-distribution robotic tasks, and cannot represent the problem\nproperly with only text-based descriptions. To address these challenges, we\npropose ELEMENTAL (intEractive LEarning froM dEmoNstraTion And Language), a\nnovel framework that combines natural language guidance with visual user\ndemonstrations to align robot behavior with user intentions better. By\nincorporating visual inputs, ELEMENTAL overcomes the limitations of text-only\ntask specifications, while leveraging inverse reinforcement learning (IRL) to\nbalance feature weights and match the demonstrated behaviors optimally.\nELEMENTAL also introduces an iterative feedback-loop through self-reflection to\nimprove feature, reward, and policy learning. Our experiment results\ndemonstrate that ELEMENTAL outperforms prior work by 42.3% on task success, and\nachieves 41.3% better generalization in out-of-distribution tasks, highlighting\nits robustness in LfD.\n","authors":["Letian Chen","Matthew Gombolay"],"pdf_url":"https://arxiv.org/pdf/2411.18825v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04305v1","updated":"2024-12-05T16:26:31Z","published":"2024-12-05T16:26:31Z","title":"ALMA: Alignment with Minimal Annotation","summary":" Recent approaches to large language model (LLM) alignment typically require\nmillions of human annotations or rely on external aligned models for synthetic\ndata generation. This paper introduces ALMA: Alignment with Minimal Annotation,\ndemonstrating that effective alignment can be achieved using only 9,000 labeled\nexamples -- less than 1% of conventional approaches. ALMA generates large\namounts of high-quality synthetic alignment data through new techniques:\ndiverse prompt synthesis via few-shot learning, diverse response generation\nwith multiple model checkpoints, and judge (reward model) enhancement through\nscore aggregation and self-distillation. Using only a pretrained Llama3 base\nmodel, 5,000 SFT examples, and 4,000 judge annotations, ALMA achieves\nperformance close to Llama3-Instruct across diverse alignment benchmarks (e.g.,\n0.1% difference on AlpacaEval 2.0 score). These results are achieved with a\nmulti-round, self-bootstrapped data synthesis and training recipe that\ncontinues to improve for 10 rounds, surpassing the typical 3-round ceiling of\nprevious methods. These results suggest that base models already possess\nsufficient knowledge for effective alignment, and that synthetic data\ngeneration methods can expose it.\n","authors":["Michihiro Yasunaga","Leonid Shamis","Chunting Zhou","Andrew Cohen","Jason Weston","Luke Zettlemoyer","Marjan Ghazvininejad"],"pdf_url":"https://arxiv.org/pdf/2412.04305v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.17978v2","updated":"2024-12-05T16:24:15Z","published":"2024-09-26T15:52:36Z","title":"HydraViT: Stacking Heads for a Scalable ViT","summary":" The architecture of Vision Transformers (ViTs), particularly the Multi-head\nAttention (MHA) mechanism, imposes substantial hardware demands. Deploying ViTs\non devices with varying constraints, such as mobile phones, requires multiple\nmodels of different sizes. However, this approach has limitations, such as\ntraining and storing each required model separately. This paper introduces\nHydraViT, a novel approach that addresses these limitations by stacking\nattention heads to achieve a scalable ViT. By repeatedly changing the size of\nthe embedded dimensions throughout each layer and their corresponding number of\nattention heads in MHA during training, HydraViT induces multiple subnetworks.\nThereby, HydraViT achieves adaptability across a wide spectrum of hardware\nenvironments while maintaining performance. Our experimental results\ndemonstrate the efficacy of HydraViT in achieving a scalable ViT with up to 10\nsubnetworks, covering a wide range of resource constraints. HydraViT achieves\nup to 5 p.p. more accuracy with the same GMACs and up to 7 p.p. more accuracy\nwith the same throughput on ImageNet-1K compared to the baselines, making it an\neffective solution for scenarios where hardware availability is diverse or\nvaries over time. Source code available at https://github.com/ds-kiel/HydraViT.\n","authors":["Janek Haberer","Ali Hojjat","Olaf Landsiedel"],"pdf_url":"https://arxiv.org/pdf/2409.17978v2.pdf","comment":"Accepted at NeurIPS'24, please cite the conference version"},{"id":"http://arxiv.org/abs/2412.04296v1","updated":"2024-12-05T16:15:32Z","published":"2024-12-05T16:15:32Z","title":"Structure-Aware Stylized Image Synthesis for Robust Medical Image\n Segmentation","summary":" Accurate medical image segmentation is essential for effective diagnosis and\ntreatment planning but is often challenged by domain shifts caused by\nvariations in imaging devices, acquisition conditions, and patient-specific\nattributes. Traditional domain generalization methods typically require\ninclusion of parts of the test domain within the training set, which is not\nalways feasible in clinical settings with limited diverse data. Additionally,\nalthough diffusion models have demonstrated strong capabilities in image\ngeneration and style transfer, they often fail to preserve the critical\nstructural information necessary for precise medical analysis. To address these\nissues, we propose a novel medical image segmentation method that combines\ndiffusion models and Structure-Preserving Network for structure-aware one-shot\nimage stylization. Our approach effectively mitigates domain shifts by\ntransforming images from various sources into a consistent style while\nmaintaining the location, size, and shape of lesions. This ensures robust and\naccurate segmentation even when the target domain is absent from the training\ndata. Experimental evaluations on colonoscopy polyp segmentation and skin\nlesion segmentation datasets show that our method enhances the robustness and\naccuracy of segmentation models, achieving superior performance metrics\ncompared to baseline models without style transfer. This structure-aware\nstylization framework offers a practical solution for improving medical image\nsegmentation across diverse domains, facilitating more reliable clinical\ndiagnoses.\n","authors":["Jie Bao","Zhixin Zhou","Wen Jung Li","Rui Luo"],"pdf_url":"https://arxiv.org/pdf/2412.04296v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04285v1","updated":"2024-12-05T16:06:23Z","published":"2024-12-05T16:06:23Z","title":"Deep Causal Inference for Point-referenced Spatial Data with Continuous\n Treatments","summary":" Causal reasoning is often challenging with spatial data, particularly when\nhandling high-dimensional inputs. To address this, we propose a neural network\n(NN) based framework integrated with an approximate Gaussian process to manage\nspatial interference and unobserved confounding. Additionally, we adopt a\ngeneralized propensity-score-based approach to address partially observed\noutcomes when estimating causal effects with continuous treatments. We evaluate\nour framework using synthetic, semi-synthetic, and real-world data inferred\nfrom satellite imagery. Our results demonstrate that NN-based models\nsignificantly outperform linear spatial regression models in estimating causal\neffects. Furthermore, in real-world case studies, NN-based models offer more\nreasonable predictions of causal effects, facilitating decision-making in\nrelevant applications.\n","authors":["Ziyang Jiang","Zach Calhoun","Yiling Liu","Lei Duan","David Carlson"],"pdf_url":"https://arxiv.org/pdf/2412.04285v1.pdf","comment":"16 pages, 4 figures, 5 tables"},{"id":"http://arxiv.org/abs/2411.15046v2","updated":"2024-12-05T16:04:02Z","published":"2024-11-22T16:31:36Z","title":"On Multi-Agent Inverse Reinforcement Learning","summary":" In multi-agent systems, the agent behavior is highly influenced by its\nutility function, as these utilities shape both individual goals as well as\ninteractions with the other agents. Inverse Reinforcement Learning (IRL) is a\nwell-established approach to inferring the utility function by observing an\nexpert behavior within a given environment. In this paper, we extend the IRL\nframework to the multi-agent setting, assuming to observe agents who are\nfollowing Nash Equilibrium (NE) policies. We theoretically investigate the set\nof utilities that explain the behavior of NE experts. Specifically, we provide\nan explicit characterization of the feasible reward set and analyze how errors\nin estimating the transition dynamics and expert behavior impact the recovered\nrewards. Building on these findings, we provide the first sample complexity\nanalysis for the multi-agent IRL problem. Finally, we provide a numerical\nevaluation of our theoretical results.\n","authors":["Till Freihaut","Giorgia Ramponi"],"pdf_url":"https://arxiv.org/pdf/2411.15046v2.pdf","comment":"Currently under review"},{"id":"http://arxiv.org/abs/2412.04274v1","updated":"2024-12-05T15:56:54Z","published":"2024-12-05T15:56:54Z","title":"Complexity of Vector-valued Prediction: From Linear Models to Stochastic\n Convex Optimization","summary":" We study the problem of learning vector-valued linear predictors: these are\nprediction rules parameterized by a matrix that maps an $m$-dimensional feature\nvector to a $k$-dimensional target. We focus on the fundamental case with a\nconvex and Lipschitz loss function, and show several new theoretical results\nthat shed light on the complexity of this problem and its connection to related\nlearning models. First, we give a tight characterization of the sample\ncomplexity of Empirical Risk Minimization (ERM) in this setting, establishing\nthat $\\smash{\\widetilde{\\Omega}}(k/\\epsilon^2)$ examples are necessary for ERM\nto reach $\\epsilon$ excess (population) risk; this provides for an exponential\nimprovement over recent results by Magen and Shamir (2023) in terms of the\ndependence on the target dimension $k$, and matches a classical upper bound due\nto Maurer (2016). Second, we present a black-box conversion from general\n$d$-dimensional Stochastic Convex Optimization (SCO) to vector-valued linear\nprediction, showing that any SCO problem can be embedded as a prediction\nproblem with $k=\\Theta(d)$ outputs. These results portray the setting of\nvector-valued linear prediction as bridging between two extensively studied yet\ndisparate learning models: linear models (corresponds to $k=1$) and general\n$d$-dimensional SCO (with $k=\\Theta(d)$).\n","authors":["Matan Schliserman","Tomer Koren"],"pdf_url":"https://arxiv.org/pdf/2412.04274v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04273v1","updated":"2024-12-05T15:55:23Z","published":"2024-12-05T15:55:23Z","title":"Reinforcement Learning from Wild Animal Videos","summary":" We propose to learn legged robot locomotion skills by watching thousands of\nwild animal videos from the internet, such as those featured in nature\ndocumentaries. Indeed, such videos offer a rich and diverse collection of\nplausible motion examples, which could inform how robots should move. To\nachieve this, we introduce Reinforcement Learning from Wild Animal Videos\n(RLWAV), a method to ground these motions into physical robots. We first train\na video classifier on a large-scale animal video dataset to recognize actions\nfrom RGB clips of animals in their natural habitats. We then train a\nmulti-skill policy to control a robot in a physics simulator, using the\nclassification score of a third-person camera capturing videos of the robot's\nmovements as a reward for reinforcement learning. Finally, we directly transfer\nthe learned policy to a real quadruped Solo. Remarkably, despite the extreme\ngap in both domain and embodiment between animals in the wild and robots, our\napproach enables the policy to learn diverse skills such as walking, jumping,\nand keeping still, without relying on reference trajectories nor skill-specific\nrewards.\n","authors":["Elliot Chane-Sane","Constant Roux","Olivier Stasse","Nicolas Mansard"],"pdf_url":"https://arxiv.org/pdf/2412.04273v1.pdf","comment":"Project website: https://elliotchanesane31.github.io/RLWAV/"},{"id":"http://arxiv.org/abs/2405.20331v2","updated":"2024-12-05T15:48:24Z","published":"2024-05-30T17:59:04Z","title":"CoSy: Evaluating Textual Explanations of Neurons","summary":" A crucial aspect of understanding the complex nature of Deep Neural Networks\n(DNNs) is the ability to explain learned concepts within their latent\nrepresentations. While methods exist to connect neurons to human-understandable\ntextual descriptions, evaluating the quality of these explanations is\nchallenging due to the lack of a unified quantitative approach. We introduce\nCoSy (Concept Synthesis), a novel, architecture-agnostic framework for\nevaluating textual explanations of latent neurons. Given textual explanations,\nour proposed framework uses a generative model conditioned on textual input to\ncreate data points representing the explanations. By comparing the neuron's\nresponse to these generated data points and control data points, we can\nestimate the quality of the explanation. We validate our framework through\nsanity checks and benchmark various neuron description methods for Computer\nVision tasks, revealing significant differences in quality.\n","authors":["Laura Kopf","Philine Lou Bommer","Anna Hedström","Sebastian Lapuschkin","Marina M. -C. Höhne","Kirill Bykov"],"pdf_url":"https://arxiv.org/pdf/2405.20331v2.pdf","comment":"10 pages, 5 figures"},{"id":"http://arxiv.org/abs/2411.02137v2","updated":"2024-12-05T15:46:44Z","published":"2024-11-04T14:50:15Z","title":"Finite-sample performance of the maximum likelihood estimator in\n logistic regression","summary":" Logistic regression is a classical model for describing the probabilistic\ndependence of binary responses to multivariate covariates. We consider the\npredictive performance of the maximum likelihood estimator (MLE) for logistic\nregression, assessed in terms of logistic risk. We consider two questions:\nfirst, that of the existence of the MLE (which occurs when the dataset is not\nlinearly separated), and second that of its accuracy when it exists. These\nproperties depend on both the dimension of covariates and on the signal\nstrength. In the case of Gaussian covariates and a well-specified logistic\nmodel, we obtain sharp non-asymptotic guarantees for the existence and excess\nlogistic risk of the MLE. We then generalize these results in two ways: first,\nto non-Gaussian covariates satisfying a certain two-dimensional margin\ncondition, and second to the general case of statistical learning with a\npossibly misspecified logistic model. Finally, we consider the case of a\nBernoulli design, where the behavior of the MLE is highly sensitive to the\nparameter direction.\n","authors":["Hugo Chardon","Matthieu Lerasle","Jaouad Mourtada"],"pdf_url":"https://arxiv.org/pdf/2411.02137v2.pdf","comment":"Simplified some statements and added a proof sketch in Sec. 4"},{"id":"http://arxiv.org/abs/2412.04262v1","updated":"2024-12-05T15:42:59Z","published":"2024-12-05T15:42:59Z","title":"SynFinTabs: A Dataset of Synthetic Financial Tables for Information and\n Table Extraction","summary":" Table extraction from document images is a challenging AI problem, and\nlabelled data for many content domains is difficult to come by. Existing table\nextraction datasets often focus on scientific tables due to the vast amount of\nacademic articles that are readily available, along with their source code.\nHowever, there are significant layout and typographical differences between\ntables found across scientific, financial, and other domains. Current datasets\noften lack the words, and their positions, contained within the tables, instead\nrelying on unreliable OCR to extract these features for training modern machine\nlearning models on natural language processing tasks. Therefore, there is a\nneed for a more general method of obtaining labelled data. We present\nSynFinTabs, a large-scale, labelled dataset of synthetic financial tables. Our\nhope is that our method of generating these synthetic tables is transferable to\nother domains. To demonstrate the effectiveness of our dataset in training\nmodels to extract information from table images, we create FinTabQA, a layout\nlarge language model trained on an extractive question-answering task. We test\nour model using real-world financial tables and compare it to a\nstate-of-the-art generative model and discuss the results. We make the dataset,\nmodel, and dataset generation code publicly available.\n","authors":["Ethan Bradley","Muhammad Roman","Karen Rafferty","Barry Devereux"],"pdf_url":"https://arxiv.org/pdf/2412.04262v1.pdf","comment":"12 pages, 8 figures"},{"id":"http://arxiv.org/abs/2412.04259v1","updated":"2024-12-05T15:39:13Z","published":"2024-12-05T15:39:13Z","title":"SCADE: Scalable Command-line Anomaly Detection Engine","summary":" As command-line interfaces remain an integral part of high-computation\nenvironments, the risk of exploitation through stealthy, complex command-line\nabuse continues to grow. Conventional security solutions often struggle with\nthese command-line-based anomalies due to their context-specific nature and\nlack of labeled data, especially in detecting rare, malicious patterns amidst\nlegitimate, high-volume activity. This gap has left organizations vulnerable to\nsophisticated threats like Living-off-the-Land (LOL) attacks, where standard\ndetection tools frequently miss or misclassify anomalous command-line behavior.\nWe introduce Scalable Command-Line Anomaly Detection Engine (SCADE), who\naddresses these challenges by introducing a dual-layered detection framework\nthat combines a global statistical analysis with local context-specific anomaly\ndetection, innovatively using a novel ensemble of statistical models such as\nBM25 and Log Entropy, adapted for command-line data. The framework also\nfeatures a dynamic thresholding mechanism for adaptive anomaly detection,\nensuring high precision and recall even in environments with extremely high\nSignal-to-Noise Ratios (SNRs). Initial experimental results demonstrate the\neffectiveness of the framework, achieving above 98% SNR in identifying unusual\ncommand-line behavior while minimizing false positives. In this paper, we\npresent SCADE's core architecture, including its metadata-enriched approach to\nanomaly detection and the design choices behind its scalability for\nenterprise-level deployment. We argue that SCADE represents a significant\nadvancement in command-line anomaly detection, offering a robust, adaptive\nframework for security analysts and researchers seeking to enhance detection\naccuracy in high-computation environments.\n","authors":["Vaishali Vinay","Anjali Mangal"],"pdf_url":"https://arxiv.org/pdf/2412.04259v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.17010v2","updated":"2024-12-05T15:33:29Z","published":"2024-03-25T17:59:59Z","title":"Calib3D: Calibrating Model Preferences for Reliable 3D Scene\n Understanding","summary":" Safety-critical 3D scene understanding tasks necessitate not only accurate\nbut also confident predictions from 3D perception models. This study introduces\nCalib3D, a pioneering effort to benchmark and scrutinize the reliability of 3D\nscene understanding models from an uncertainty estimation viewpoint. We\ncomprehensively evaluate 28 state-of-the-art models across 10 diverse 3D\ndatasets, uncovering insightful phenomena that cope with both the aleatoric and\nepistemic uncertainties in 3D scene understanding. We discover that despite\nachieving impressive levels of accuracy, existing models frequently fail to\nprovide reliable uncertainty estimates -- a pitfall that critically undermines\ntheir applicability in safety-sensitive contexts. Through extensive analysis of\nkey factors such as network capacity, LiDAR representations, rasterization\nresolutions, and 3D data augmentation techniques, we correlate these aspects\ndirectly with the model calibration efficacy. Furthermore, we introduce DeptS,\na novel depth-aware scaling approach aimed at enhancing 3D model calibration.\nExtensive experiments across a wide range of configurations validate the\nsuperiority of our method. We hope this work could serve as a cornerstone for\nfostering reliable 3D scene understanding. Code and benchmark toolkit are\npublicly available.\n","authors":["Lingdong Kong","Xiang Xu","Jun Cen","Wenwei Zhang","Liang Pan","Kai Chen","Ziwei Liu"],"pdf_url":"https://arxiv.org/pdf/2403.17010v2.pdf","comment":"WACV 2025; 26 pages, 8 figures, 12 tables; Code at\n https://github.com/ldkong1205/Calib3D"},{"id":"http://arxiv.org/abs/2404.12294v3","updated":"2024-12-05T15:27:14Z","published":"2024-04-18T16:16:02Z","title":"Bayesian evidence estimation from posterior samples with normalizing\n flows","summary":" We propose a novel method ($floZ$), based on normalizing flows, to estimate\nthe Bayesian evidence (and its numerical uncertainty) from a pre-existing set\nof samples drawn from the unnormalized posterior distribution. We validate it\non distributions whose evidence is known analytically, up to 15 parameter space\ndimensions, and compare with two state-of-the-art techniques for estimating the\nevidence: nested sampling (which computes the evidence as its main target) and\na $k$-nearest-neighbors technique that produces evidence estimates from\nposterior samples. Provided representative samples from the target posterior\nare available, our method is more robust to posterior distributions with sharp\nfeatures, especially in higher dimensions. For a simple multivariate Gaussian,\nwe demonstrate its accuracy for up to 200 dimensions with $10^5$ posterior\nsamples. $floZ$ has wide applicability, e.g., to estimate evidence from\nvariational inference, Markov Chain Monte Carlo samples, or any other method\nthat delivers samples and their likelihood from the unnormalized posterior\ndensity. As a physical application, we use $floZ$ to compute the Bayes factor\nfor the presence of the first overtone in the ringdown signal of the\ngravitational wave data of GW150914, finding good agreement with nested\nsampling.\n","authors":["Rahul Srinivasan","Marco Crisostomi","Roberto Trotta","Enrico Barausse","Matteo Breschi"],"pdf_url":"https://arxiv.org/pdf/2404.12294v3.pdf","comment":"15 pages, 8 figures, 1 table"},{"id":"http://arxiv.org/abs/2412.04243v1","updated":"2024-12-05T15:25:51Z","published":"2024-12-05T15:25:51Z","title":"Quantifying the Limits of Segment Anything Model: Analyzing Challenges\n in Segmenting Tree-Like and Low-Contrast Structures","summary":" Segment Anything Model (SAM) has shown impressive performance in interactive\nand zero-shot segmentation across diverse domains, suggesting that they have\nlearned a general concept of \"objects\" from their large-scale training.\nHowever, we observed that SAM struggles with certain types of objects,\nparticularly those featuring dense, tree-like structures and low textural\ncontrast from their surroundings. These failure modes are critical for\nunderstanding its limitations in real-world use. In order to systematically\nexamine this issue, we propose metrics to quantify two key object\ncharacteristics: tree-likeness and textural separability. Through extensive\ncontrolled synthetic experiments and testing on real datasets, we demonstrate\nthat SAM's performance is noticeably correlated with these factors. We link\nthese behaviors under the concept of \"textural confusion\", where SAM\nmisinterprets local structure as global texture, leading to over-segmentation,\nor struggles to differentiate objects from similarly textured backgrounds.\nThese findings offer the first quantitative framework to model SAM's\nchallenges, providing valuable insights into its limitations and guiding future\nimprovements for vision foundation models.\n","authors":["Yixin Zhang","Nicholas Konz","Kevin Kramer","Maciej A. Mazurowski"],"pdf_url":"https://arxiv.org/pdf/2412.04243v1.pdf","comment":"Code: https://github.com/mazurowski-lab/SAM-TexturalConfusion-Metrics"},{"id":"http://arxiv.org/abs/2412.04242v1","updated":"2024-12-05T15:25:18Z","published":"2024-12-05T15:25:18Z","title":"LMDM:Latent Molecular Diffusion Model For 3D Molecule Generation","summary":" n this work, we propose a latent molecular diffusion model that can make the\ngenerated 3D molecules rich in diversity and maintain rich geometric features.\nThe model captures the information of the forces and local constraints between\natoms so that the generated molecules can maintain Euclidean transformation and\nhigh level of effectiveness and diversity. We also use the lowerrank manifold\nadvantage of the latent variables of the latent model to fuse the information\nof the forces between atoms to better maintain the geometric equivariant\nproperties of the molecules. Because there is no need to perform information\nfusion encoding in stages like traditional encoders and decoders, this reduces\nthe amount of calculation in the back-propagation process. The model keeps the\nforces and local constraints of particle bonds in the latent variable space,\nreducing the impact of underfitting on the surface of the network on the large\nposition drift of the particle geometry, so that our model can converge\nearlier. We introduce a distribution control variable in each backward step to\nstrengthen exploration and improve the diversity of generation. In the\nexperiment, the quality of the samples we generated and the convergence speed\nof the model have been significantly improved.\n","authors":["Xiang Chen"],"pdf_url":"https://arxiv.org/pdf/2412.04242v1.pdf","comment":"arXiv admin note: text overlap with arXiv:2209.05710 by other authors"},{"id":"http://arxiv.org/abs/2410.14086v2","updated":"2024-12-05T15:24:33Z","published":"2024-10-17T23:37:34Z","title":"In-context learning and Occam's razor","summary":" A central goal of machine learning is generalization. While the No Free Lunch\nTheorem states that we cannot obtain theoretical guarantees for generalization\nwithout further assumptions, in practice we observe that simple models which\nexplain the training data generalize best: a principle called Occam's razor.\nDespite the need for simple models, most current approaches in machine learning\nonly minimize the training error, and at best indirectly promote simplicity\nthrough regularization or architecture design. Here, we draw a connection\nbetween Occam's razor and in-context learning: an emergent ability of certain\nsequence models like Transformers to learn at inference time from past\nobservations in a sequence. In particular, we show that the next-token\nprediction loss used to train in-context learners is directly equivalent to a\ndata compression technique called prequential coding, and that minimizing this\nloss amounts to jointly minimizing both the training error and the complexity\nof the model that was implicitly learned from context. Our theory and the\nempirical experiments we use to support it not only provide a normative account\nof in-context learning, but also elucidate the shortcomings of current\nin-context learning methods, suggesting ways in which they can be improved. We\nmake our code available at https://github.com/3rdCore/PrequentialCode.\n","authors":["Eric Elmoznino","Tom Marty","Tejas Kasetty","Leo Gagnon","Sarthak Mittal","Mahan Fathi","Dhanya Sridhar","Guillaume Lajoie"],"pdf_url":"https://arxiv.org/pdf/2410.14086v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2210.08339v3","updated":"2024-12-05T15:23:20Z","published":"2022-10-15T17:15:53Z","title":"Reachable Polyhedral Marching (RPM): An Exact Analysis Tool for\n Deep-Learned Control Systems","summary":" Neural networks are increasingly used in robotics as policies, state\ntransition models, state estimation models, or all of the above. With these\ncomponents being learned from data, it is important to be able to analyze what\nbehaviors were learned and how this affects closed-loop performance. In this\npaper we take steps toward this goal by developing methods for computing\ncontrol invariant sets and regions of attraction (ROAs) of dynamical systems\nrepresented as neural networks. We focus our attention on feedforward neural\nnetworks with the rectified linear unit (ReLU) activation, which are known to\nimplement continuous piecewise-affine (PWA) functions. We describe the\nReachable Polyhedral Marching (RPM) algorithm for enumerating the affine pieces\nof a neural network through an incremental connected walk. We then use this\nalgorithm to compute exact forward and backward reachable sets, from which we\nprovide methods for computing control invariant sets and ROAs. Our approach is\nunique in that we find these sets incrementally, without Lyapunov-based tools.\nIn our examples we demonstrate the ability of our approach to find non-convex\ncontrol invariant sets and ROAs on tasks with learned van der Pol oscillator\nand pendulum models. Further, we provide an accelerated algorithm for computing\nROAs that leverages the incremental and connected enumeration of affine regions\nthat RPM provides. We show this acceleration to lead to a 15x speedup in our\nexamples. Finally, we apply our methods to find a set of states that are\nstabilized by an image-based controller for an aircraft runway control problem.\n","authors":["Joseph A. Vincent","Mac Schwager"],"pdf_url":"https://arxiv.org/pdf/2210.08339v3.pdf","comment":"Submitted to IEEE Transactions on Neural Networks and Learning\n Systems. arXiv admin note: text overlap with arXiv:2011.11609"},{"id":"http://arxiv.org/abs/2410.14817v2","updated":"2024-12-05T15:20:28Z","published":"2024-10-18T18:37:27Z","title":"A Complexity-Based Theory of Compositionality","summary":" Compositionality is believed to be fundamental to intelligence. In humans, it\nunderlies the structure of thought, language, and higher-level reasoning. In\nAI, compositional representations can enable a powerful form of\nout-of-distribution generalization, in which a model systematically adapts to\nnovel combinations of known concepts. However, while we have strong intuitions\nabout what compositionality is, there currently exists no formal definition for\nit that is measurable and mathematical. Here, we propose such a definition,\nwhich we call representational compositionality, that accounts for and extends\nour intuitions about compositionality. The definition is conceptually simple,\nquantitative, grounded in algorithmic information theory, and applicable to any\nrepresentation. Intuitively, representational compositionality states that a\ncompositional representation satisfies three properties. First, it must be\nexpressive. Second, it must be possible to re-describe the representation as a\nfunction of discrete symbolic sequences with re-combinable parts, analogous to\nsentences in natural language. Third, the function that relates these symbolic\nsequences to the representation, analogous to semantics in natural language,\nmust be simple. Through experiments on both synthetic and real world data, we\nvalidate our definition of compositionality and show how it unifies disparate\nintuitions from across the literature in both AI and cognitive science. We also\nshow that representational compositionality, while theoretically intractable,\ncan be readily estimated using standard deep learning tools. Our definition has\nthe potential to inspire the design of novel, theoretically-driven models that\nbetter capture the mechanisms of compositional thought.\n","authors":["Eric Elmoznino","Thomas Jiralerspong","Yoshua Bengio","Guillaume Lajoie"],"pdf_url":"https://arxiv.org/pdf/2410.14817v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04236v1","updated":"2024-12-05T15:14:16Z","published":"2024-12-05T15:14:16Z","title":"A History of Philosophy in Colombia through Topic Modelling","summary":" Data-driven approaches to philosophy have emerged as a valuable tool for\nstudying the history of the discipline. However, most studies in this area have\nfocused on a limited number of journals from specific regions and subfields. We\nexpand the scope of this research by applying dynamic topic modelling\ntechniques to explore the history of philosophy in Colombia and Latin America.\nOur study examines the Colombian philosophy journal Ideas y Valores, founded in\n1951 and currently one of the most influential academic philosophy journals in\nthe region. By analyzing the evolution of topics across the journal's history,\nwe identify various trends and specific dynamics in philosophical discourse\nwithin the Colombian and Latin American context. Our findings reveal that the\nmost prominent topics are value theory (including ethics, political philosophy,\nand aesthetics), epistemology, and the philosophy of science. We also trace the\nevolution of articles focusing on the historical and interpretive aspects of\nphilosophical texts, and we note a notable emphasis on German philosophers such\nas Kant, Husserl, and Hegel on various topics throughout the journal's\nlifetime. Additionally, we investigate whether articles with a historical focus\nhave decreased over time due to editorial pressures. Our analysis suggests no\nsignificant decline in such articles. Finally, we propose ideas for extending\nthis research to other Latin American journals and suggest improvements for\nnatural language processing workflows in non-English languages.\n","authors":["Juan R. Loaiza","Miguel González-Duque"],"pdf_url":"https://arxiv.org/pdf/2412.04236v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04233v1","updated":"2024-12-05T15:09:51Z","published":"2024-12-05T15:09:51Z","title":"HyperMARL: Adaptive Hypernetworks for Multi-Agent RL","summary":" Balancing individual specialisation and shared behaviours is a critical\nchallenge in multi-agent reinforcement learning (MARL). Existing methods\ntypically focus on encouraging diversity or leveraging shared representations.\nFull parameter sharing (FuPS) improves sample efficiency but struggles to learn\ndiverse behaviours when required, while no parameter sharing (NoPS) enables\ndiversity but is computationally expensive and sample inefficient. To address\nthese challenges, we introduce HyperMARL, a novel approach using hypernetworks\nto balance efficiency and specialisation. HyperMARL generates agent-specific\nactor and critic parameters, enabling agents to adaptively exhibit diverse or\nhomogeneous behaviours as needed, without modifying the learning objective or\nrequiring prior knowledge of the optimal diversity. Furthermore, HyperMARL\ndecouples agent-specific and state-based gradients, which empirically\ncorrelates with reduced policy gradient variance, potentially offering insights\ninto its ability to capture diverse behaviours. Across MARL benchmarks\nrequiring homogeneous, heterogeneous, or mixed behaviours, HyperMARL\nconsistently matches or outperforms FuPS, NoPS, and diversity-focused methods,\nachieving NoPS-level diversity with a shared architecture. These results\nhighlight the potential of hypernetworks as a versatile approach to the\ntrade-off between specialisation and shared behaviours in MARL.\n","authors":["Kale-ab Abebe Tessera","Arrasy Rahman","Stefano V. Albrecht"],"pdf_url":"https://arxiv.org/pdf/2412.04233v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.05357v2","updated":"2024-12-05T15:08:56Z","published":"2024-10-07T15:55:55Z","title":"Model-GLUE: Democratized LLM Scaling for A Large Model Zoo in the Wild","summary":" As Large Language Models (LLMs) excel across tasks and specialized domains,\nscaling LLMs based on existing models has garnered significant attention, which\nfaces the challenge of decreasing performance when combining disparate models.\nVarious techniques have been proposed for the aggregation of pre-trained LLMs,\nincluding model merging, Mixture-of-Experts, and stacking. Despite their\nmerits, a comprehensive comparison and synergistic application of them to a\ndiverse model zoo is yet to be adequately addressed. In light of this research\ngap, this paper introduces Model-GLUE, a holistic LLM scaling guideline. First,\nour work starts with a benchmarking of existing LLM scaling techniques,\nespecially selective merging, and variants of mixture. Utilizing the insights\nfrom the benchmark results, we formulate an optimal strategy for the selection\nand aggregation of a heterogeneous model zoo characterizing different\narchitectures and initialization.Our methodology involves the clustering of\nmergeable models and optimal merging strategy selection, and the integration of\nclusters through a model mixture. Finally, evidenced by our experiments on a\ndiverse Llama-2-based model zoo, Model-GLUE shows an average performance\nenhancement of 5.61%, achieved without additional training. Codes are available\nat: https://github.com/Model-GLUE/Model-GLUE.\n","authors":["Xinyu Zhao","Guoheng Sun","Ruisi Cai","Yukun Zhou","Pingzhi Li","Peihao Wang","Bowen Tan","Yexiao He","Li Chen","Yi Liang","Beidi Chen","Binhang Yuan","Hongyi Wang","Ang Li","Zhangyang Wang","Tianlong Chen"],"pdf_url":"https://arxiv.org/pdf/2410.05357v2.pdf","comment":"24 pages, 4 figures, accepted to NeurIPS 2024 Datasets and Benchmarks\n Track"},{"id":"http://arxiv.org/abs/2412.04227v1","updated":"2024-12-05T15:05:25Z","published":"2024-12-05T15:05:25Z","title":"Foundations of the Theory of Performance-Based Ranking","summary":" Ranking entities such as algorithms, devices, methods, or models based on\ntheir performances, while accounting for application-specific preferences, is a\nchallenge. To address this challenge, we establish the foundations of a\nuniversal theory for performance-based ranking. First, we introduce a rigorous\nframework built on top of both the probability and order theories. Our new\nframework encompasses the elements necessary to (1) manipulate performances as\nmathematical objects, (2) express which performances are worse than or\nequivalent to others, (3) model tasks through a variable called satisfaction,\n(4) consider properties of the evaluation, (5) define scores, and (6) specify\napplication-specific preferences through a variable called importance. On top\nof this framework, we propose the first axiomatic definition of performance\norderings and performance-based rankings. Then, we introduce a universal\nparametric family of scores, called ranking scores, that can be used to\nestablish rankings satisfying our axioms, while considering\napplication-specific preferences. Finally, we show, in the case of two-class\nclassification, that the family of ranking scores encompasses well-known\nperformance scores, including the accuracy, the true positive rate (recall,\nsensitivity), the true negative rate (specificity), the positive predictive\nvalue (precision), and F1. However, we also show that some other scores\ncommonly used to compare classifiers are unsuitable to derive performance\norderings satisfying the axioms. Therefore, this paper provides the computer\nvision and machine learning communities with a rigorous framework for\nevaluating and ranking entities.\n","authors":["Sébastien Piérard","Anaïs Halin","Anthony Cioppa","Adrien Deliège","Marc Van Droogenbroeck"],"pdf_url":"https://arxiv.org/pdf/2412.04227v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03441v2","updated":"2024-12-05T15:03:26Z","published":"2024-12-04T16:30:03Z","title":"PBP: Post-training Backdoor Purification for Malware Classifiers","summary":" In recent years, the rise of machine learning (ML) in cybersecurity has\nbrought new challenges, including the increasing threat of backdoor poisoning\nattacks on ML malware classifiers. For instance, adversaries could inject\nmalicious samples into public malware repositories, contaminating the training\ndata and potentially misclassifying malware by the ML model. Current\ncountermeasures predominantly focus on detecting poisoned samples by leveraging\ndisagreements within the outputs of a diverse set of ensemble models on\ntraining data points. However, these methods are not suitable for scenarios\nwhere Machine Learning-as-a-Service (MLaaS) is used or when users aim to remove\nbackdoors from a model after it has been trained. Addressing this scenario, we\nintroduce PBP, a post-training defense for malware classifiers that mitigates\nvarious types of backdoor embeddings without assuming any specific backdoor\nembedding mechanism. Our method exploits the influence of backdoor attacks on\nthe activation distribution of neural networks, independent of the\ntrigger-embedding method. In the presence of a backdoor attack, the activation\ndistribution of each layer is distorted into a mixture of distributions. By\nregulating the statistics of the batch normalization layers, we can guide a\nbackdoored model to perform similarly to a clean one. Our method demonstrates\nsubstantial advantages over several state-of-the-art methods, as evidenced by\nexperiments on two datasets, two types of backdoor methods, and various attack\nconfigurations. Notably, our approach requires only a small portion of the\ntraining data -- only 1\\% -- to purify the backdoor and reduce the attack\nsuccess rate from 100\\% to almost 0\\%, a 100-fold improvement over the baseline\nmethods. Our code is available at\n\\url{https://github.com/judydnguyen/pbp-backdoor-purification-official}.\n","authors":["Dung Thuy Nguyen","Ngoc N. Tran","Taylor T. Johnson","Kevin Leach"],"pdf_url":"https://arxiv.org/pdf/2412.03441v2.pdf","comment":"Accepted at NDSS 2025"},{"id":"http://arxiv.org/abs/2410.03960v2","updated":"2024-12-05T14:56:56Z","published":"2024-10-04T22:45:26Z","title":"SwiftKV: Fast Prefill-Optimized Inference with Knowledge-Preserving\n Model Transformation","summary":" LLM inference for popular enterprise use cases, such as summarization, RAG,\nand code-generation, typically observes orders of magnitude longer prompt\nlengths than generation lengths. This characteristic leads to high cost of\nprefill and increased response latency. In this paper, we present SwiftKV, a\nnovel model transformation and distillation procedure specifically designed to\nreduce the time and cost of processing prompt tokens while preserving high\nquality of generated tokens. SwiftKV combines three key mechanisms: i)\nSingleInputKV, which prefills later layers' KV cache using a much earlier\nlayer's output, allowing prompt tokens to skip much of the model computation,\nii) AcrossKV, which merges the KV caches of neighboring layers to reduce the\nmemory footprint and support larger batch size for higher throughput, and iii)\na knowledge-preserving distillation procedure that can adapt existing LLMs for\nSwiftKV with minimal accuracy impact and low compute and data requirement. For\nLlama-3.1-8B and 70B, SwiftKV reduces the compute requirement of prefill by 50%\nand the memory requirement of the KV cache by 62.5% while incurring minimum\nquality degradation across a wide range of tasks. In the end-to-end inference\nserving using an optimized vLLM implementation, SwiftKV realizes up to 2x\nhigher aggregate throughput and 60% lower time per output token. It can achieve\na staggering 560 TFlops/GPU of normalized inference throughput, which\ntranslates to 16K tokens/s for Llama-3.1-70B in 16-bit precision on 4x H100\nGPUs. Our training, inference, and model implementations are open-sourced and\ncan be found through\nhttps://huggingface.co/collections/Snowflake/swiftkv-models-674f7d7474eb789e185d31cb.\n","authors":["Aurick Qiao","Zhewei Yao","Samyam Rajbhandari","Yuxiong He"],"pdf_url":"https://arxiv.org/pdf/2410.03960v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.06740v4","updated":"2024-12-05T14:56:30Z","published":"2024-11-11T06:25:13Z","title":"Dockformer: A transformer-based molecular docking paradigm for\n large-scale virtual screening","summary":" Molecular docking is a crucial step in drug development, which enables the\nvirtual screening of compound libraries to identify potential ligands that\ntarget proteins of interest. However, the computational complexity of\ntraditional docking models increases as the size of the compound library\nincreases. Recently, deep learning algorithms can provide data-driven research\nand development models to increase the speed of the docking process.\nUnfortunately, few models can achieve superior screening performance compared\nto that of traditional models. Therefore, a novel deep learning-based docking\napproach named Dockformer is introduced in this study. Dockformer leverages\nmultimodal information to capture the geometric topology and structural\nknowledge of molecules and can directly generate binding conformations with the\ncorresponding confidence measures in an end-to-end manner. The experimental\nresults show that Dockformer achieves success rates of 90.53% and 82.71% on the\nPDBbind core set and PoseBusters benchmarks, respectively, and more than a\n100-fold increase in the inference process speed, outperforming almost all\nstate-of-the-art docking methods. In addition, the ability of Dockformer to\nidentify the main protease inhibitors of coronaviruses is demonstrated in a\nreal-world virtual screening scenario. Considering its high docking accuracy\nand screening efficiency, Dockformer can be regarded as a powerful and robust\ntool in the field of drug design.\n","authors":["Zhangfan Yang","Junkai Ji","Shan He","Jianqiang Li","Tiantian He","Ruibin Bai","Zexuan Zhu","Yew Soon Ong"],"pdf_url":"https://arxiv.org/pdf/2411.06740v4.pdf","comment":"15 pages, 10 figures"},{"id":"http://arxiv.org/abs/2412.04213v1","updated":"2024-12-05T14:47:38Z","published":"2024-12-05T14:47:38Z","title":"Physics-informed Deep Learning for Muscle Force Prediction with\n Unlabeled sEMG Signals","summary":" Computational biomechanical analysis plays a pivotal role in understanding\nand improving human movements and physical functions. Although physics-based\nmodeling methods can interpret the dynamic interaction between the neural drive\nto muscle dynamics and joint kinematics, they suffer from high computational\nlatency. In recent years, data-driven methods have emerged as a promising\nalternative due to their fast execution speed, but label information is still\nrequired during training, which is not easy to acquire in practice. To tackle\nthese issues, this paper presents a novel physics-informed deep learning method\nto predict muscle forces without any label information during model training.\nIn addition, the proposed method could also identify personalized muscle-tendon\nparameters. To achieve this, the Hill muscle model-based forward dynamics is\nembedded into the deep neural network as the additional loss to further\nregulate the behavior of the deep neural network. Experimental validations on\nthe wrist joint from six healthy subjects are performed, and a fully connected\nneural network (FNN) is selected to implement the proposed method. The\npredicted results of muscle forces show comparable or even lower root mean\nsquare error (RMSE) and higher coefficient of determination compared with\nbaseline methods, which have to use the labeled surface electromyography (sEMG)\nsignals, and it can also identify muscle-tendon parameters accurately,\ndemonstrating the effectiveness of the proposed physics-informed deep learning\nmethod.\n","authors":["Shuhao Ma","Jie Zhang","Chaoyang Shi","Pei Di","Ian D. Robertson","Zhi-Qiang Zhang"],"pdf_url":"https://arxiv.org/pdf/2412.04213v1.pdf","comment":"11pages, 8 figures, journal"},{"id":"http://arxiv.org/abs/2410.19725v2","updated":"2024-12-05T14:34:08Z","published":"2024-10-25T17:52:53Z","title":"On the Benefits of Active Data Collection in Operator Learning","summary":" We investigate active data collection strategies for operator learning when\nthe target operator is linear and the input functions are drawn from a\nmean-zero stochastic process with continuous covariance kernels. With an active\ndata collection strategy, we establish an error convergence rate in terms of\nthe decay rate of the eigenvalues of the covariance kernel. Thus, with\nsufficiently rapid eigenvalue decay of the covariance kernels, arbitrarily fast\nerror convergence rates can be achieved. This contrasts with the passive\n(i.i.d.) data collection strategies, where the convergence rate is never faster\nthan $\\sim n^{-1}$. In fact, for our setting, we establish a\n\\emph{non-vanishing} lower bound for any passive data collection strategy,\nregardless of the eigenvalues decay rate of the covariance kernel. Overall, our\nresults show the benefit of active over passive data collection strategies in\noperator learning.\n","authors":["Unique Subedi","Ambuj Tewari"],"pdf_url":"https://arxiv.org/pdf/2410.19725v2.pdf","comment":"Added experiments"},{"id":"http://arxiv.org/abs/2403.10182v4","updated":"2024-12-05T14:30:41Z","published":"2024-03-15T10:38:48Z","title":"Fast and reliable uncertainty quantification with neural network\n ensembles for industrial image classification","summary":" Image classification with neural networks (NNs) is widely used in industrial\nprocesses, situations where the model likely encounters unknown objects during\ndeployment, i.e., out-of-distribution (OOD) data. Worryingly, NNs tend to make\nconfident yet incorrect predictions when confronted with OOD data. To increase\nthe models' reliability, they should quantify the uncertainty in their own\npredictions, communicating when the output should (not) be trusted. Deep\nensembles, composed of multiple independent NNs, have been shown to perform\nstrongly but are computationally expensive. Recent research has proposed more\nefficient NN ensembles, namely the snapshot, batch, and multi-input\nmulti-output ensemble. This study investigates the predictive and uncertainty\nperformance of efficient NN ensembles in the context of image classification\nfor industrial processes. It is the first to provide a comprehensive comparison\nand it proposes a novel Diversity Quality metric to quantify the ensembles'\nperformance on the in-distribution and OOD sets in one single metric. The\nresults highlight the batch ensemble as a cost-effective and competitive\nalternative to the deep ensemble. It matches the deep ensemble in both\nuncertainty and accuracy while exhibiting considerable savings in training\ntime, test time, and memory storage.\n","authors":["Arthur Thuy","Dries F. Benoit"],"pdf_url":"https://arxiv.org/pdf/2403.10182v4.pdf","comment":"Submitted to Annals of Operations Research"},{"id":"http://arxiv.org/abs/2412.04190v1","updated":"2024-12-05T14:30:18Z","published":"2024-12-05T14:30:18Z","title":"Directed Structural Adaptation to Overcome Statistical Conflicts and\n Enable Continual Learning","summary":" Adaptive networks today rely on overparameterized fixed topologies that\ncannot break through the statistical conflicts they encounter in the data they\nare exposed to, and are prone to \"catastrophic forgetting\" as the network\nattempts to reuse the existing structures to learn new task. We propose a\nstructural adaptation method, DIRAD, that can complexify as needed and in a\ndirected manner without being limited by statistical conflicts within a\ndataset. We then extend this method and present the PREVAL framework, designed\nto prevent \"catastrophic forgetting\" in continual learning by detection of new\ndata and assigning encountered data to suitable models adapted to process them,\nwithout needing task labels anywhere in the workflow. We show the reliability\nof the DIRAD in growing a network with high performance and orders-of-magnitude\nsimpler than fixed topology networks; and demonstrate the proof-of-concept\noperation of PREVAL, in which continual adaptation to new tasks is observed\nwhile being able to detect and discern previously-encountered tasks.\n","authors":["Zeki Doruk Erden","Boi Faltings"],"pdf_url":"https://arxiv.org/pdf/2412.04190v1.pdf","comment":"Presented in Deployable AI (DAI) workshop at AAAI-2024"},{"id":"http://arxiv.org/abs/2409.17146v2","updated":"2024-12-05T14:28:40Z","published":"2024-09-25T17:59:51Z","title":"Molmo and PixMo: Open Weights and Open Data for State-of-the-Art\n Vision-Language Models","summary":" Today's most advanced vision-language models (VLMs) remain proprietary. The\nstrongest open-weight models rely heavily on synthetic data from proprietary\nVLMs to achieve good performance, effectively distilling these closed VLMs into\nopen ones. As a result, the community has been missing foundational knowledge\nabout how to build performant VLMs from scratch. We present Molmo, a new family\nof VLMs that are state-of-the-art in their class of openness. Our key\ncontribution is a collection of new datasets called PixMo, including a dataset\nof highly detailed image captions for pre-training, a free-form image Q&A\ndataset for fine-tuning, and an innovative 2D pointing dataset, all collected\nwithout the use of external VLMs. The success of our approach relies on careful\nmodeling choices, a well-tuned training pipeline, and, most critically, the\nquality of our newly collected datasets. Our best-in-class 72B model not only\noutperforms others in the class of open weight and data models, but also\noutperforms larger proprietary models including Claude 3.5 Sonnet, and Gemini\n1.5 Pro and Flash, second only to GPT-4o based on both academic benchmarks and\non a large human evaluation. Our model weights, new datasets, and source code\nare available at https://molmo.allenai.org/blog.\n","authors":["Matt Deitke","Christopher Clark","Sangho Lee","Rohun Tripathi","Yue Yang","Jae Sung Park","Mohammadreza Salehi","Niklas Muennighoff","Kyle Lo","Luca Soldaini","Jiasen Lu","Taira Anderson","Erin Bransom","Kiana Ehsani","Huong Ngo","YenSung Chen","Ajay Patel","Mark Yatskar","Chris Callison-Burch","Andrew Head","Rose Hendrix","Favyen Bastani","Eli VanderBilt","Nathan Lambert","Yvonne Chou","Arnavi Chheda","Jenna Sparks","Sam Skjonsberg","Michael Schmitz","Aaron Sarnat","Byron Bischoff","Pete Walsh","Chris Newell","Piper Wolters","Tanmay Gupta","Kuo-Hao Zeng","Jon Borchardt","Dirk Groeneveld","Crystal Nam","Sophie Lebrecht","Caitlin Wittlif","Carissa Schoenick","Oscar Michel","Ranjay Krishna","Luca Weihs","Noah A. Smith","Hannaneh Hajishirzi","Ross Girshick","Ali Farhadi","Aniruddha Kembhavi"],"pdf_url":"https://arxiv.org/pdf/2409.17146v2.pdf","comment":"Updated with ablations and more technical details"},{"id":"http://arxiv.org/abs/2412.04183v1","updated":"2024-12-05T14:21:18Z","published":"2024-12-05T14:21:18Z","title":"Linear Discriminant Analysis in Credit Scoring: A Transparent Hybrid\n Model Approach","summary":" The development of computing has made credit scoring approaches possible,\nwith various machine learning (ML) and deep learning (DL) techniques becoming\nmore and more valuable. While complex models yield more accurate predictions,\ntheir interpretability is often weakened, which is a concern for credit scoring\nthat places importance on decision fairness. As features of the dataset are a\ncrucial factor for the credit scoring system, we implement Linear Discriminant\nAnalysis (LDA) as a feature reduction technique, which reduces the burden of\nthe models complexity. We compared 6 different machine learning models, 1 deep\nlearning model, and a hybrid model with and without using LDA. From the result,\nwe have found our hybrid model, XG-DNN, outperformed other models with the\nhighest accuracy of 99.45% and a 99% F1 score with LDA. Lastly, to interpret\nmodel decisions, we have applied 2 different explainable AI techniques named\nLIME (local) and Morris Sensitivity Analysis (global). Through this research,\nwe showed how feature reduction techniques can be used without affecting the\nperformance and explainability of the model, which can be very useful in\nresource-constrained settings to optimize the computational workload.\n","authors":["Md Shihab Reza","Monirul Islam Mahmud","Ifti Azad Abeer","Nova Ahmed"],"pdf_url":"https://arxiv.org/pdf/2412.04183v1.pdf","comment":"Accepted on International Conference on Computer and Information\n Technology (ICCIT) 2024"},{"id":"http://arxiv.org/abs/2412.04180v1","updated":"2024-12-05T14:19:59Z","published":"2024-12-05T14:19:59Z","title":"SKIM: Any-bit Quantization Pushing The Limits of Post-Training\n Quantization","summary":" Large Language Models (LLMs) exhibit impressive performance across various\ntasks, but deploying them for inference poses challenges. Their high resource\ndemands often necessitate complex, costly multi-GPU pipelines, or the use of\nsmaller, less capable models. While quantization offers a promising solution\nutilizing lower precision for model storage, existing methods frequently\nexperience significant performance drops at lower precision levels.\nAdditionally, they typically provide only a limited set of solutions at\nspecific bit levels, many of which are extensively manually tuned. To address\nthese challenges, we propose a new method called SKIM: Scaled K-means\nclustering wIth Mixed precision. Our approach introduces two novel techniques:\n1. A greedy algorithm to solve approximately optimal bit allocation across\nweight channels, and 2. A trainable scaling vector for non-differentiable\nK-means clustering. These techniques substantially improve performance and can\nbe adapted to any given bit. Notably, in terms of model perplexity, our method\nnarrows the gap between 3-bit quantized LLaMA models and their full precision\ncounterparts by 16.3% on average.\n","authors":["Runsheng Bai","Qiang Liu","Bo Liu"],"pdf_url":"https://arxiv.org/pdf/2412.04180v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04178v1","updated":"2024-12-05T14:18:50Z","published":"2024-12-05T14:18:50Z","title":"Multi-Layer Privacy-Preserving Record Linkage with Clerical Review based\n on gradual information disclosure","summary":" Privacy-Preserving Record linkage (PPRL) is an essential component in data\nintegration tasks of sensitive information. The linkage quality determines the\nusability of combined datasets and (machine learning) applications based on\nthem. We present a novel privacy-preserving protocol that integrates clerical\nreview in PPRL using a multi-layer active learning process. Uncertain match\ncandidates are reviewed on several layers by human and non-human oracles to\nreduce the amount of disclosed information per record and in total. Predictions\nare propagated back to update previous layers, resulting in an improved linkage\nperformance for non-reviewed candidates as well. The data owners remain in\ncontrol of the amount of information they share for each record. Therefore, our\napproach follows need-to-know and data sovereignty principles. The experimental\nevaluation on real-world datasets shows considerable linkage quality\nimprovements with limited labeling effort and privacy risks.\n","authors":["Florens Rohde","Victor Christen","Martin Franke","Erhard Rahm"],"pdf_url":"https://arxiv.org/pdf/2412.04178v1.pdf","comment":"Accepted at 21st Conference on Database Systems for Business,\n Technology and Web (BTW)"},{"id":"http://arxiv.org/abs/2412.04177v1","updated":"2024-12-05T14:17:16Z","published":"2024-12-05T14:17:16Z","title":"Fixed-Mean Gaussian Processes for Post-hoc Bayesian Deep Learning","summary":" Recently, there has been an increasing interest in performing post-hoc\nuncertainty estimation about the predictions of pre-trained deep neural\nnetworks (DNNs). Given a pre-trained DNN via back-propagation, these methods\nenhance the original network by adding output confidence measures, such as\nerror bars, without compromising its initial accuracy. In this context, we\nintroduce a novel family of sparse variational Gaussian processes (GPs), where\nthe posterior mean is fixed to any continuous function when using a universal\nkernel. Specifically, we fix the mean of this GP to the output of the\npre-trained DNN, allowing our approach to effectively fit the GP's predictive\nvariances to estimate the DNN prediction uncertainty. Our approach leverages\nvariational inference (VI) for efficient stochastic optimization, with training\ncosts that remain independent of the number of training points, scaling\nefficiently to large datasets such as ImageNet. The proposed method, called\nfixed mean GP (FMGP), is architecture-agnostic, relying solely on the\npre-trained model's outputs to adjust the predictive variances. Experimental\nresults demonstrate that FMGP improves both uncertainty estimation and\ncomputational efficiency when compared to state-of-the-art methods.\n","authors":["Luis A. Ortega","Simón Rodríguez-Santana","Daniel Hernández-Lobato"],"pdf_url":"https://arxiv.org/pdf/2412.04177v1.pdf","comment":"12 pages, 6 figures and 2 tables. Submitted to IEEE TRANSACTIONS ON\n PATTERN ANALYSIS AND MACHINE INTELLIGENCE"},{"id":"http://arxiv.org/abs/2411.16105v2","updated":"2024-12-05T14:16:57Z","published":"2024-11-25T05:32:34Z","title":"Adaptive Circuit Behavior and Generalization in Mechanistic\n Interpretability","summary":" Mechanistic interpretability aims to understand the inner workings of large\nneural networks by identifying circuits, or minimal subgraphs within the model\nthat implement algorithms responsible for performing specific tasks. These\ncircuits are typically discovered and analyzed using a narrowly defined prompt\nformat. However, given the abilities of large language models (LLMs) to\ngeneralize across various prompt formats for the same task, it remains unclear\nhow well these circuits generalize. For instance, it is unclear whether the\nmodels generalization results from reusing the same circuit components, the\ncomponents behaving differently, or the use of entirely different components.\nIn this paper, we investigate the generality of the indirect object\nidentification (IOI) circuit in GPT-2 small, which is well-studied and believed\nto implement a simple, interpretable algorithm. We evaluate its performance on\nprompt variants that challenge the assumptions of this algorithm. Our findings\nreveal that the circuit generalizes surprisingly well, reusing all of its\ncomponents and mechanisms while only adding additional input edges. Notably,\nthe circuit generalizes even to prompt variants where the original algorithm\nshould fail; we discover a mechanism that explains this which we term S2\nHacking. Our findings indicate that circuits within LLMs may be more flexible\nand general than previously recognized, underscoring the importance of studying\ncircuit generalization to better understand the broader capabilities of these\nmodels.\n","authors":["Jatin Nainani","Sankaran Vaidyanathan","AJ Yeung","Kartik Gupta","David Jensen"],"pdf_url":"https://arxiv.org/pdf/2411.16105v2.pdf","comment":"10 pages, 8 figures"},{"id":"http://arxiv.org/abs/2412.04166v1","updated":"2024-12-05T14:03:16Z","published":"2024-12-05T14:03:16Z","title":"An In-Depth Examination of Risk Assessment in Multi-Class Classification\n Algorithms","summary":" Advanced classification algorithms are being increasingly used in\nsafety-critical applications like health-care, engineering, etc. In such\napplications, miss-classifications made by ML algorithms can result in\nsubstantial financial or health-related losses. To better anticipate and\nprepare for such losses, the algorithm user seeks an estimate for the\nprobability that the algorithm miss-classifies a sample. We refer to this task\nas the risk-assessment. For a variety of models and datasets, we numerically\nanalyze the performance of different methods in solving the risk-assessment\nproblem. We consider two solution strategies: a) calibration techniques that\ncalibrate the output probabilities of classification models to provide accurate\nprobability outputs; and b) a novel approach based upon the prediction interval\ngeneration technique of conformal prediction. Our conformal prediction based\napproach is model and data-distribution agnostic, simple to implement, and\nprovides reasonable results for a variety of use-cases. We compare the\ndifferent methods on a broad variety of models and datasets.\n","authors":["Disha Ghandwani","Neeraj Sarna","Yuanyuan Li","Yang Lin"],"pdf_url":"https://arxiv.org/pdf/2412.04166v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04163v1","updated":"2024-12-05T13:54:53Z","published":"2024-12-05T13:54:53Z","title":"On the Lack of Robustness of Binary Function Similarity Systems","summary":" Binary function similarity, which often relies on learning-based algorithms\nto identify what functions in a pool are most similar to a given query\nfunction, is a sought-after topic in different communities, including machine\nlearning, software engineering, and security. Its importance stems from the\nimpact it has in facilitating several crucial tasks, from reverse engineering\nand malware analysis to automated vulnerability detection. Whereas recent work\ncast light around performance on this long-studied problem, the research\nlandscape remains largely lackluster in understanding the resiliency of the\nstate-of-the-art machine learning models against adversarial attacks. As\nsecurity requires to reason about adversaries, in this work we assess the\nrobustness of such models through a simple yet effective black-box greedy\nattack, which modifies the topology and the content of the control flow of the\nattacked functions. We demonstrate that this attack is successful in\ncompromising all the models, achieving average attack success rates of 57.06%\nand 95.81% depending on the problem settings (targeted and untargeted attacks).\nOur findings are insightful: top performance on clean data does not necessarily\nrelate to top robustness properties, which explicitly highlights\nperformance-robustness trade-offs one should consider when deploying such\nmodels, calling for further research.\n","authors":["Gianluca Capozzi","Tong Tang","Jie Wan","Ziqi Yang","Daniele Cono D'Elia","Giuseppe Antonio Di Luna","Lorenzo Cavallaro","Leonardo Querzoni"],"pdf_url":"https://arxiv.org/pdf/2412.04163v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2109.06181v2","updated":"2024-12-05T13:50:59Z","published":"2021-09-13T16:06:10Z","title":"When Stability meets Sufficiency: Informative Explanations that do not\n Overwhelm","summary":" Recent studies evaluating various criteria for explainable artificial\nintelligence (XAI) suggest that fidelity, stability, and comprehensibility are\namong the most important metrics considered by users of AI across a diverse\ncollection of usage contexts. We consider these criteria as applied to\nfeature-based attribution methods, which are amongst the most prevalent in XAI\nliterature. Going beyond standard correlation, methods have been proposed that\nhighlight what should be minimally sufficient to justify the classification of\nan input (viz. pertinent positives). While minimal sufficiency is an attractive\nproperty akin to comprehensibility, the resulting explanations are often too\nsparse for a human to understand and evaluate the local behavior of the model.\nTo overcome these limitations, we incorporate the criteria of stability and\nfidelity and propose a novel method called Path-Sufficient Explanations Method\n(PSEM) that outputs a sequence of stable and sufficient explanations for a\ngiven input of strictly decreasing size (or value) -- from original input to a\nminimally sufficient explanation -- which can be thought to trace the local\nboundary of the model in a stable manner, thus providing better intuition about\nthe local model behavior for the specific input. We validate these claims, both\nqualitatively and quantitatively, with experiments that show the benefit of\nPSEM across three modalities (image, tabular and text) as well as versus other\npath explanations. A user study depicts the strength of the method in\ncommunicating the local behavior, where (many) users are able to correctly\ndetermine the prediction made by a model.\n","authors":["Ronny Luss","Amit Dhurandhar"],"pdf_url":"https://arxiv.org/pdf/2109.06181v2.pdf","comment":"Published at TMLR"},{"id":"http://arxiv.org/abs/2412.04158v1","updated":"2024-12-05T13:46:55Z","published":"2024-12-05T13:46:55Z","title":"LossVal: Efficient Data Valuation for Neural Networks","summary":" Assessing the importance of individual training samples is a key challenge in\nmachine learning. Traditional approaches retrain models with and without\nspecific samples, which is computationally expensive and ignores dependencies\nbetween data points. We introduce LossVal, an efficient data valuation method\nthat computes importance scores during neural network training by embedding a\nself-weighting mechanism into loss functions like cross-entropy and mean\nsquared error. LossVal reduces computational costs, making it suitable for\nlarge datasets and practical applications. Experiments on classification and\nregression tasks across multiple datasets show that LossVal effectively\nidentifies noisy samples and is able to distinguish helpful from harmful\nsamples. We examine the gradient calculation of LossVal to highlight its\nadvantages. The source code is available at:\nhttps://github.com/twibiral/LossVal\n","authors":["Tim Wibiral","Mohamed Karim Belaid","Maximilian Rabus","Ansgar Scherp"],"pdf_url":"https://arxiv.org/pdf/2412.04158v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04157v1","updated":"2024-12-05T13:45:35Z","published":"2024-12-05T13:45:35Z","title":"Non-Asymptotic Bounds for Closed-Loop Identification of Unstable\n Nonlinear Stochastic Systems","summary":" We consider the problem of least squares parameter estimation from\nsingle-trajectory data for discrete-time, unstable, closed-loop nonlinear\nstochastic systems, with linearly parameterised uncertainty. Assuming a region\nof the state space produces informative data, and the system is\nsub-exponentially unstable, we establish non-asymptotic guarantees on the\nestimation error at times where the state trajectory evolves in this region. If\nthe whole state space is informative, high probability guarantees on the error\nhold for all times. Examples are provided where our results are useful for\nanalysis, but existing results are not.\n","authors":["Seth Siriya","Jingge Zhu","Dragan Nešić","Ye Pu"],"pdf_url":"https://arxiv.org/pdf/2412.04157v1.pdf","comment":"21 pages, 2 figures"},{"id":"http://arxiv.org/abs/2407.17449v3","updated":"2024-12-05T13:37:51Z","published":"2024-07-24T17:30:21Z","title":"Looking at Model Debiasing through the Lens of Anomaly Detection","summary":" It is widely recognized that deep neural networks are sensitive to bias in\nthe data. This means that during training these models are likely to learn\nspurious correlations between data and labels, resulting in limited\ngeneralization abilities and low performance. In this context, model debiasing\napproaches can be devised aiming at reducing the model's dependency on such\nunwanted correlations, either leveraging the knowledge of bias information or\nnot. In this work, we focus on the latter and more realistic scenario, showing\nthe importance of accurately predicting the bias-conflicting and bias-aligned\nsamples to obtain compelling performance in bias mitigation. On this ground, we\npropose to conceive the problem of model bias from an out-of-distribution\nperspective, introducing a new bias identification method based on anomaly\ndetection. We claim that when data is mostly biased, bias-conflicting samples\ncan be regarded as outliers with respect to the bias-aligned distribution in\nthe feature space of a biased model, thus allowing for precisely detecting them\nwith an anomaly detection method. Coupling the proposed bias identification\napproach with bias-conflicting data upsampling and augmentation in a two-step\nstrategy, we reach state-of-the-art performance on synthetic and real benchmark\ndatasets. Ultimately, our proposed approach shows that the data bias issue does\nnot necessarily require complex debiasing methods, given that an accurate bias\nidentification procedure is defined. Source code is available at\nhttps://github.com/Malga-Vision/MoDAD\n","authors":["Vito Paolo Pastore","Massimiliano Ciranni","Davide Marinelli","Francesca Odone","Vittorio Murino"],"pdf_url":"https://arxiv.org/pdf/2407.17449v3.pdf","comment":"13 pages, 8 figures; Accepted at IEEE/CVF Winter Conference on\n Applications of Computer Vision (WACV) 2025"},{"id":"http://arxiv.org/abs/2407.16940v2","updated":"2024-12-05T13:30:16Z","published":"2024-07-24T02:20:29Z","title":"GV-Rep: A Large-Scale Dataset for Genetic Variant Representation\n Learning","summary":" Genetic variants (GVs) are defined as differences in the DNA sequences among\nindividuals and play a crucial role in diagnosing and treating genetic\ndiseases. The rapid decrease in next generation sequencing cost has led to an\nexponential increase in patient-level GV data. This growth poses a challenge\nfor clinicians who must efficiently prioritize patient-specific GVs and\nintegrate them with existing genomic databases to inform patient management. To\naddressing the interpretation of GVs, genomic foundation models (GFMs) have\nemerged. However, these models lack standardized performance assessments,\nleading to considerable variability in model evaluations. This poses the\nquestion: How effectively do deep learning methods classify unknown GVs and\nalign them with clinically-verified GVs? We argue that representation learning,\nwhich transforms raw data into meaningful feature spaces, is an effective\napproach for addressing both indexing and classification challenges. We\nintroduce a large-scale Genetic Variant dataset, named GV-Rep, featuring\nvariable-length contexts and detailed annotations, designed for deep learning\nmodels to learn GV representations across various traits, diseases, tissue\ntypes, and experimental contexts. Our contributions are three-fold: (i)\nConstruction of a comprehensive dataset with 7 million records, each labeled\nwith characteristics of the corresponding variants, alongside additional data\nfrom 17,548 gene knockout tests across 1,107 cell types, 1,808 variant\ncombinations, and 156 unique clinically verified GVs from real-world patients.\n(ii) Analysis of the structure and properties of the dataset. (iii)\nExperimentation of the dataset with pre-trained GFMs. The results show a\nsignificant gap between GFMs current capabilities and accurate GV\nrepresentation. We hope this dataset will help advance genomic deep learning to\nbridge this gap.\n","authors":["Zehui Li","Vallijah Subasri","Guy-Bart Stan","Yiren Zhao","Bo Wang"],"pdf_url":"https://arxiv.org/pdf/2407.16940v2.pdf","comment":"Preprint"},{"id":"http://arxiv.org/abs/2412.03417v2","updated":"2024-12-05T13:22:28Z","published":"2024-12-04T15:53:45Z","title":"Learning Semantic Association Rules from Internet of Things Data","summary":" Association Rule Mining (ARM) is the task of discovering commonalities in\ndata in the form of logical implications. ARM is used in the Internet of Things\n(IoT) for different tasks including monitoring and decision-making. However,\nexisting methods give limited consideration to IoT-specific requirements such\nas heterogeneity and volume. Furthermore, they do not utilize important static\ndomain-specific description data about IoT systems, which is increasingly\nrepresented as knowledge graphs. In this paper, we propose a novel ARM pipeline\nfor IoT data that utilizes both dynamic sensor data and static IoT system\nmetadata. Furthermore, we propose an Autoencoder-based Neurosymbolic ARM method\n(Aerial) as part of the pipeline to address the high volume of IoT data and\nreduce the total number of rules that are resource-intensive to process. Aerial\nlearns a neural representation of a given data and extracts association rules\nfrom this representation by exploiting the reconstruction (decoding) mechanism\nof an autoencoder. Extensive evaluations on 3 IoT datasets from 2 domains show\nthat ARM on both static and dynamic IoT data results in more generically\napplicable rules while Aerial can learn a more concise set of high-quality\nassociation rules than the state-of-the-art with full coverage over the\ndatasets.\n","authors":["Erkan Karabulut","Paul Groth","Victoria Degeler"],"pdf_url":"https://arxiv.org/pdf/2412.03417v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04147v1","updated":"2024-12-05T13:19:34Z","published":"2024-12-05T13:19:34Z","title":"MultiTASC++: A Continuously Adaptive Scheduler for Edge-Based\n Multi-Device Cascade Inference","summary":" Cascade systems, consisting of a lightweight model processing all samples and\na heavier, high-accuracy model refining challenging samples, have become a\nwidely-adopted distributed inference approach to achieving high accuracy and\nmaintaining a low computational burden for mobile and IoT devices. As\nintelligent indoor environments, like smart homes, continue to expand, a new\nscenario emerges, the multi-device cascade. In this setting, multiple diverse\ndevices simultaneously utilize a shared heavy model hosted on a server, often\nsituated within or close to the consumer environment. This work introduces\nMultiTASC++, a continuously adaptive multi-tenancy-aware scheduler that\ndynamically controls the forwarding decision functions of devices to optimize\nsystem throughput while maintaining high accuracy and low latency. Through\nextensive experimentation in diverse device environments and with varying\nserver-side models, we demonstrate the scheduler's efficacy in consistently\nmaintaining a targeted satisfaction rate while providing the highest available\naccuracy across different device tiers and workloads of up to 100 devices. This\ndemonstrates its scalability and efficiency in addressing the unique challenges\nof collaborative DNN inference in dynamic and diverse IoT environments.\n","authors":["Sokratis Nikolaidis","Stylianos I. Venieris","Iakovos S. Venieris"],"pdf_url":"https://arxiv.org/pdf/2412.04147v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.14123v2","updated":"2024-12-05T13:15:34Z","published":"2024-02-21T20:43:49Z","title":"DeiSAM: Segment Anything with Deictic Prompting","summary":" Large-scale, pre-trained neural networks have demonstrated strong\ncapabilities in various tasks, including zero-shot image segmentation. To\nidentify concrete objects in complex scenes, humans instinctively rely on\ndeictic descriptions in natural language, i.e., referring to something\ndepending on the context such as \"The object that is on the desk and behind the\ncup.\". However, deep learning approaches cannot reliably interpret such deictic\nrepresentations due to their lack of reasoning capabilities in complex\nscenarios. To remedy this issue, we propose DeiSAM -- a combination of large\npre-trained neural networks with differentiable logic reasoners -- for deictic\npromptable segmentation. Given a complex, textual segmentation description,\nDeiSAM leverages Large Language Models (LLMs) to generate first-order logic\nrules and performs differentiable forward reasoning on generated scene graphs.\nSubsequently, DeiSAM segments objects by matching them to the logically\ninferred image regions. As part of our evaluation, we propose the Deictic\nVisual Genome (DeiVG) dataset, containing paired visual input and complex,\ndeictic textual prompts. Our empirical results demonstrate that DeiSAM is a\nsubstantial improvement over purely data-driven baselines for deictic\npromptable segmentation.\n","authors":["Hikaru Shindo","Manuel Brack","Gopika Sudhakaran","Devendra Singh Dhami","Patrick Schramowski","Kristian Kersting"],"pdf_url":"https://arxiv.org/pdf/2402.14123v2.pdf","comment":"Published as a conference paper at NeurIPS 2024"},{"id":"http://arxiv.org/abs/2412.04140v1","updated":"2024-12-05T13:07:24Z","published":"2024-12-05T13:07:24Z","title":"Understanding Memorization in Generative Models via Sharpness in\n Probability Landscapes","summary":" In this paper, we introduce a geometric framework to analyze memorization in\ndiffusion models using the eigenvalues of the Hessian of the log probability\ndensity. We propose that memorization arises from isolated points in the\nlearned probability distribution, characterized by sharpness in the probability\nlandscape, as indicated by large negative eigenvalues of the Hessian. Through\nexperiments on various datasets, we demonstrate that these eigenvalues\neffectively detect and quantify memorization. Our approach provides a clear\nunderstanding of memorization in diffusion models and lays the groundwork for\ndeveloping strategies to ensure secure and reliable generative models\n","authors":["Dongjae Jeon","Dueun Kim","Albert No"],"pdf_url":"https://arxiv.org/pdf/2412.04140v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04137v1","updated":"2024-12-05T13:04:10Z","published":"2024-12-05T13:04:10Z","title":"Text Change Detection in Multilingual Documents Using Image Comparison","summary":" Document comparison typically relies on optical character recognition (OCR)\nas its core technology. However, OCR requires the selection of appropriate\nlanguage models for each document and the performance of multilingual or hybrid\nmodels remains limited. To overcome these challenges, we propose text change\ndetection (TCD) using an image comparison model tailored for multilingual\ndocuments. Unlike OCR-based approaches, our method employs word-level text\nimage-to-image comparison to detect changes. Our model generates bidirectional\nchange segmentation maps between the source and target documents. To enhance\nperformance without requiring explicit text alignment or scaling preprocessing,\nwe employ correlations among multi-scale attention features. We also construct\na benchmark dataset comprising actual printed and scanned word pairs in various\nlanguages to evaluate our model. We validate our approach using our benchmark\ndataset and public benchmarks Distorted Document Images and the LRDE Document\nBinarization Dataset. We compare our model against state-of-the-art semantic\nsegmentation and change detection models, as well as to conventional OCR-based\nmodels.\n","authors":["Doyoung Park","Naresh Reddy Yarram","Sunjin Kim","Minkyu Kim","Seongho Cho","Taehee Lee"],"pdf_url":"https://arxiv.org/pdf/2412.04137v1.pdf","comment":"15pages, 11figures 6tables, wacv2025 accepted"},{"id":"http://arxiv.org/abs/2405.13888v2","updated":"2024-12-05T13:03:55Z","published":"2024-05-22T18:00:41Z","title":"Marrying Causal Representation Learning with Dynamical Systems for\n Science","summary":" Causal representation learning promises to extend causal models to hidden\ncausal variables from raw entangled measurements. However, most progress has\nfocused on proving identifiability results in different settings, and we are\nnot aware of any successful real-world application. At the same time, the field\nof dynamical systems benefited from deep learning and scaled to countless\napplications but does not allow parameter identification. In this paper, we\ndraw a clear connection between the two and their key assumptions, allowing us\nto apply identifiable methods developed in causal representation learning to\ndynamical systems. At the same time, we can leverage scalable differentiable\nsolvers developed for differential equations to build models that are both\nidentifiable and practical. Overall, we learn explicitly controllable models\nthat isolate the trajectory-specific parameters for further downstream tasks\nsuch as out-of-distribution classification or treatment effect estimation. We\nexperiment with a wind simulator with partially known factors of variation. We\nalso apply the resulting model to real-world climate data and successfully\nanswer downstream causal questions in line with existing literature on climate\nchange.\n","authors":["Dingling Yao","Caroline Muller","Francesco Locatello"],"pdf_url":"https://arxiv.org/pdf/2405.13888v2.pdf","comment":"NeurIPS 2024 Camera Ready"},{"id":"http://arxiv.org/abs/2411.02785v2","updated":"2024-12-05T12:58:44Z","published":"2024-11-05T03:51:13Z","title":"Stochastic Monkeys at Play: Random Augmentations Cheaply Break LLM\n Safety Alignment","summary":" Safety alignment of Large Language Models (LLMs) has recently become a\ncritical objective of model developers. In response, a growing body of work has\nbeen investigating how safety alignment can be bypassed through various\njailbreaking methods, such as adversarial attacks. However, these jailbreak\nmethods can be rather costly or involve a non-trivial amount of creativity and\neffort, introducing the assumption that malicious users are high-resource or\nsophisticated. In this paper, we study how simple random augmentations to the\ninput prompt affect safety alignment effectiveness in state-of-the-art LLMs,\nsuch as Llama 3 and Qwen 2. We perform an in-depth evaluation of 17 different\nmodels and investigate the intersection of safety under random augmentations\nwith multiple dimensions: augmentation type, model size, quantization,\nfine-tuning-based defenses, and decoding strategies (e.g., sampling\ntemperature). We show that low-resource and unsophisticated attackers, i.e.\n$\\textit{stochastic monkeys}$, can significantly improve their chances of\nbypassing alignment with just 25 random augmentations per prompt. Source code\nand data: https://github.com/uiuc-focal-lab/stochastic-monkeys/\n","authors":["Jason Vega","Junsheng Huang","Gaokai Zhang","Hangoo Kang","Minjia Zhang","Gagandeep Singh"],"pdf_url":"https://arxiv.org/pdf/2411.02785v2.pdf","comment":"v2: Updated with changes from peer review rebuttal. v1: Version under\n peer review"},{"id":"http://arxiv.org/abs/2412.04134v1","updated":"2024-12-05T12:58:30Z","published":"2024-12-05T12:58:30Z","title":"Compositional Generative Multiphysics and Multi-component Simulation","summary":" Multiphysics simulation, which models the interactions between multiple\nphysical processes, and multi-component simulation of complex structures are\ncritical in fields like nuclear and aerospace engineering. Previous studies\noften rely on numerical solvers or machine learning-based surrogate models to\nsolve or accelerate these simulations. However, multiphysics simulations\ntypically require integrating multiple specialized solvers-each responsible for\nevolving a specific physical process-into a coupled program, which introduces\nsignificant development challenges. Furthermore, no universal algorithm exists\nfor multi-component simulations, which adds to the complexity. Here we propose\ncompositional Multiphysics and Multi-component Simulation with Diffusion models\n(MultiSimDiff) to overcome these challenges. During diffusion-based training,\nMultiSimDiff learns energy functions modeling the conditional probability of\none physical process/component conditioned on other processes/components. In\ninference, MultiSimDiff generates coupled multiphysics solutions and\nmulti-component structures by sampling from the joint probability distribution,\nachieved by composing the learned energy functions in a structured way. We test\nour method in three tasks. In the reaction-diffusion and nuclear thermal\ncoupling problems, MultiSimDiff successfully predicts the coupling solution\nusing decoupled data, while the surrogate model fails in the more complex\nsecond problem. For the thermal and mechanical analysis of the prismatic fuel\nelement, MultiSimDiff trained for single component prediction accurately\npredicts a larger structure with 64 components, reducing the relative error by\n40.3% compared to the surrogate model.\n","authors":["Tao Zhang","Zhenhai Liu","Feipeng Qi","Yongjun Jiao","Tailin Wu"],"pdf_url":"https://arxiv.org/pdf/2412.04134v1.pdf","comment":"30pages,13 figures"},{"id":"http://arxiv.org/abs/2412.04121v1","updated":"2024-12-05T12:46:18Z","published":"2024-12-05T12:46:18Z","title":"DeepFEA: Deep Learning for Prediction of Transient Finite Element\n Analysis Solutions","summary":" Finite Element Analysis (FEA) is a powerful but computationally intensive\nmethod for simulating physical phenomena. Recent advancements in machine\nlearning have led to surrogate models capable of accelerating FEA. Yet there\nare still limitations in developing surrogates of transient FEA models that can\nsimultaneously predict the solutions for both nodes and elements with\napplicability on both the 2D and 3D domains. Motivated by this research gap,\nthis study proposes DeepFEA, a deep learning-based framework that leverages a\nmultilayer Convolutional Long Short-Term Memory (ConvLSTM) network branching\ninto two parallel convolutional neural networks to predict the solutions for\nboth nodes and elements of FEA models. The proposed network is optimized using\na novel adaptive learning algorithm, called Node-Element Loss Optimization\n(NELO). NELO minimizes the error occurring at both branches of the network\nenabling the prediction of solutions for transient FEA simulations. The\nexperimental evaluation of DeepFEA is performed on three datasets in the\ncontext of structural mechanics, generated to serve as publicly available\nreference datasets. The results show that DeepFEA can achieve less than 3%\nnormalized mean and root mean squared error for 2D and 3D simulation scenarios,\nand inference times that are two orders of magnitude faster than FEA. In\ncontrast, relevant state-of-the-art methods face challenges with\nmulti-dimensional output and dynamic input prediction. Furthermore, DeepFEA's\nrobustness was demonstrated in a real-life biomedical scenario, confirming its\nsuitability for accurate and efficient predictions of FEA simulations.\n","authors":["Georgios Triantafyllou","Panagiotis G. Kalozoumis","George Dimas","Dimitris K. Iakovidis"],"pdf_url":"https://arxiv.org/pdf/2412.04121v1.pdf","comment":"This work has been submitted to a journal for possible publication"},{"id":"http://arxiv.org/abs/2409.19214v2","updated":"2024-12-05T12:45:09Z","published":"2024-09-28T02:45:14Z","title":"Group Distributionally Robust Optimization can Suppress Class Imbalance\n Effect in Network Traffic Classification","summary":" Internet services have led to the eruption of network traffic, and machine\nlearning on these Internet data has become an indispensable tool, especially\nwhen the application is risk-sensitive. This paper focuses on network traffic\nclassification in the presence of class imbalance, which fundamentally and\nubiquitously exists in Internet data analysis. This existence of class\nimbalance mostly drifts the optimal decision boundary, resulting in a less\noptimal solution for machine learning models. To alleviate the effect, we\npropose to design strategies for alleviating the class imbalance through the\nlens of group distributionally robust optimization. Our approach iteratively\nupdates the non-parametric weights for separate classes and optimizes the\nlearning model by minimizing reweighted losses. We interpret the optimization\nprocess from a Stackelberg game and perform extensive experiments on typical\nbenchmarks. Results show that our approach can not only suppress the negative\neffect of class imbalance but also improve the comprehensive performance in\nprediction.\n","authors":["Wumei Du","Dong Liang","Yiqin Lv","Xingxing Liang","Guanlin Wu","Qi Wang","Zheng Xie"],"pdf_url":"https://arxiv.org/pdf/2409.19214v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.08020v2","updated":"2024-12-05T12:40:16Z","published":"2024-10-10T15:17:49Z","title":"Efficiently Learning at Test-Time: Active Fine-Tuning of LLMs","summary":" Recent efforts in fine-tuning language models often rely on automatic data\nselection, commonly using Nearest Neighbors retrieval from large datasets.\nHowever, we theoretically show that this approach tends to select redundant\ndata, limiting its effectiveness or even hurting performance. To address this,\nwe introduce SIFT, a data selection algorithm designed to reduce uncertainty\nabout the model's response given a prompt, which unifies ideas from retrieval\nand active learning. Whereas Nearest Neighbor retrieval typically fails in the\npresence of information duplication, SIFT accounts for information duplication\nand optimizes the overall information gain of the selected examples. We focus\nour evaluations on fine-tuning at test-time for prompt-specific language\nmodeling on the Pile dataset, and show that SIFT consistently outperforms\nNearest Neighbor retrieval, with minimal computational overhead. Moreover, we\nshow that our uncertainty estimates can predict the performance gain of\ntest-time fine-tuning, and use this to develop an adaptive algorithm that\ninvests test-time compute proportional to realized performance gains. We\nprovide the $\\texttt{activeft}$ (Active Fine-Tuning) library which can be used\nas a drop-in replacement for Nearest Neighbor retrieval.\n","authors":["Jonas Hübotter","Sascha Bongni","Ido Hakimi","Andreas Krause"],"pdf_url":"https://arxiv.org/pdf/2410.08020v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02865v2","updated":"2024-12-05T12:38:58Z","published":"2024-12-03T22:00:12Z","title":"Memory-efficient Continual Learning with Neural Collapse Contrastive","summary":" Contrastive learning has significantly improved representation quality,\nenhancing knowledge transfer across tasks in continual learning (CL). However,\ncatastrophic forgetting remains a key challenge, as contrastive based methods\nprimarily focus on \"soft relationships\" or \"softness\" between samples, which\nshift with changing data distributions and lead to representation overlap\nacross tasks. Recently, the newly identified Neural Collapse phenomenon has\nshown promise in CL by focusing on \"hard relationships\" or \"hardness\" between\nsamples and fixed prototypes. However, this approach overlooks \"softness\",\ncrucial for capturing intra-class variability, and this rigid focus can also\npull old class representations toward current ones, increasing forgetting.\nBuilding on these insights, we propose Focal Neural Collapse Contrastive\n(FNC2), a novel representation learning loss that effectively balances both\nsoft and hard relationships. Additionally, we introduce the Hardness-Softness\nDistillation (HSD) loss to progressively preserve the knowledge gained from\nthese relationships across tasks. Our method outperforms state-of-the-art\napproaches, particularly in minimizing memory reliance. Remarkably, even\nwithout the use of memory, our approach rivals rehearsal-based methods,\noffering a compelling solution for data privacy concerns.\n","authors":["Trung-Anh Dang","Vincent Nguyen","Ngoc-Son Vu","Christel Vrain"],"pdf_url":"https://arxiv.org/pdf/2412.02865v2.pdf","comment":"Accepted at WACV 2025"},{"id":"http://arxiv.org/abs/2412.04100v1","updated":"2024-12-05T12:10:42Z","published":"2024-12-05T12:10:42Z","title":"Missing Melodies: AI Music Generation and its \"Nearly\" Complete Omission\n of the Global South","summary":" Recent advances in generative AI have sparked renewed interest and expanded\npossibilities for music generation. However, the performance and versatility of\nthese systems across musical genres are heavily influenced by the availability\nof training data. We conducted an extensive analysis of over one million hours\nof audio datasets used in AI music generation research and manually reviewed\nmore than 200 papers from eleven prominent AI and music conferences and\norganizations (AAAI, ACM, EUSIPCO, EURASIP, ICASSP, ICML, IJCAI, ISMIR,\nNeurIPS, NIME, SMC) to identify a critical gap in the fair representation and\ninclusion of the musical genres of the Global South in AI research. Our\nfindings reveal a stark imbalance: approximately 86% of the total dataset hours\nand over 93% of researchers focus primarily on music from the Global North.\nHowever, around 40% of these datasets include some form of non-Western music,\ngenres from the Global South account for only 14.6% of the data. Furthermore,\napproximately 51% of the papers surveyed concentrate on symbolic music\ngeneration, a method that often fails to capture the cultural nuances inherent\nin music from regions such as South Asia, the Middle East, and Africa. As AI\nincreasingly shapes the creation and dissemination of music, the significant\nunderrepresentation of music genres in datasets and research presents a serious\nthreat to global musical diversity. We also propose some important steps to\nmitigate these risks and foster a more inclusive future for AI-driven music\ngeneration.\n","authors":["Atharva Mehta","Shivam Chauhan","Monojit Choudhury"],"pdf_url":"https://arxiv.org/pdf/2412.04100v1.pdf","comment":"Submitted to CACM, 12 pages, 2 figures"},{"id":"http://arxiv.org/abs/2412.04095v1","updated":"2024-12-05T12:01:20Z","published":"2024-12-05T12:01:20Z","title":"HyperFLINT: Hypernetwork-based Flow Estimation and Temporal\n Interpolation for Scientific Ensemble Visualization","summary":" We present HyperFLINT (Hypernetwork-based FLow estimation and temporal\nINTerpolation), a novel deep learning-based approach for estimating flow\nfields, temporally interpolating scalar fields, and facilitating parameter\nspace exploration in spatio-temporal scientific ensemble data. This work\naddresses the critical need to explicitly incorporate ensemble parameters into\nthe learning process, as traditional methods often neglect these, limiting\ntheir ability to adapt to diverse simulation settings and provide meaningful\ninsights into the data dynamics. HyperFLINT introduces a hypernetwork to\naccount for simulation parameters, enabling it to generate accurate\ninterpolations and flow fields for each timestep by dynamically adapting to\nvarying conditions, thereby outperforming existing parameter-agnostic\napproaches. The architecture features modular neural blocks with convolutional\nand deconvolutional layers, supported by a hypernetwork that generates weights\nfor the main network, allowing the model to better capture intricate simulation\ndynamics. A series of experiments demonstrates HyperFLINT's significantly\nimproved performance in flow field estimation and temporal interpolation, as\nwell as its potential in enabling parameter space exploration, offering\nvaluable insights into complex scientific ensembles.\n","authors":["Hamid Gadirov","Qi Wu","David Bauer","Kwan-Liu Ma","Jos Roerdink","Steffen Frey"],"pdf_url":"https://arxiv.org/pdf/2412.04095v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.12562v2","updated":"2024-12-05T11:57:19Z","published":"2024-03-19T09:17:18Z","title":"PePR: Performance Per Resource Unit as a Metric to Promote Small-Scale\n Deep Learning in Medical Image Analysis","summary":" The recent advances in deep learning (DL) have been accelerated by access to\nlarge-scale data and compute. These large-scale resources have been used to\ntrain progressively larger models which are resource intensive in terms of\ncompute, data, energy, and carbon emissions. These costs are becoming a new\ntype of entry barrier to researchers and practitioners with limited access to\nresources at such scale, particularly in the Global South. In this work, we\ntake a comprehensive look at the landscape of existing DL models for medical\nimage analysis tasks and demonstrate their usefulness in settings where\nresources are limited. To account for the resource consumption of DL models, we\nintroduce a novel measure to estimate the performance per resource unit, which\nwe call the PePR score. Using a diverse family of 131 unique DL architectures\n(spanning 1M to 130M trainable parameters) and three medical image datasets, we\ncapture trends about the performance-resource trade-offs. In applications like\nmedical image analysis, we argue that small-scale, specialized models are\nbetter than striving for large-scale models. Furthermore, we show that using\nexisting pretrained models that are fine-tuned on new data can significantly\nreduce the computational resources and data required compared to training\nmodels from scratch. We hope this work will encourage the community to focus on\nimproving AI equity by developing methods and models with smaller resource\nfootprints.\n","authors":["Raghavendra Selvan","Bob Pepin","Christian Igel","Gabrielle Samuel","Erik B Dam"],"pdf_url":"https://arxiv.org/pdf/2403.12562v2.pdf","comment":"Accepted to be published at the Northern Lights Deep Learning\n Conference (NLDL), 2025. Source code available at\n https://github.com/saintslab/PePR"},{"id":"http://arxiv.org/abs/2412.02482v2","updated":"2024-12-05T11:50:40Z","published":"2024-12-03T14:45:46Z","title":"What should a neuron aim for? Designing local objective functions based\n on information theory","summary":" In modern deep neural networks, the learning dynamics of the individual\nneurons is often obscure, as the networks are trained via global optimization.\nConversely, biological systems build on self-organized, local learning,\nachieving robustness and efficiency with limited global information. We here\nshow how self-organization between individual artificial neurons can be\nachieved by designing abstract bio-inspired local learning goals. These goals\nare parameterized using a recent extension of information theory, Partial\nInformation Decomposition (PID), which decomposes the information that a set of\ninformation sources holds about an outcome into unique, redundant and\nsynergistic contributions. Our framework enables neurons to locally shape the\nintegration of information from various input classes, i.e. feedforward,\nfeedback, and lateral, by selecting which of the three inputs should contribute\nuniquely, redundantly or synergistically to the output. This selection is\nexpressed as a weighted sum of PID terms, which, for a given problem, can be\ndirectly derived from intuitive reasoning or via numerical optimization,\noffering a window into understanding task-relevant local information\nprocessing. Achieving neuron-level interpretability while enabling strong\nperformance using local learning, our work advances a principled\ninformation-theoretic foundation for local learning strategies.\n","authors":["Andreas C. Schneider","Valentin Neuhaus","David A. Ehrlich","Abdullah Makkeh","Alexander S. Ecker","Viola Priesemann","Michael Wibral"],"pdf_url":"https://arxiv.org/pdf/2412.02482v2.pdf","comment":"24 pages, 11 figures"},{"id":"http://arxiv.org/abs/2410.13569v2","updated":"2024-12-05T11:50:24Z","published":"2024-10-17T17:17:09Z","title":"Learning on Model Weights using Tree Experts","summary":" The increasing availability of public models begs the question: can we train\nneural networks that use other networks as input? Such models allow us to study\ndifferent aspects of a given neural network, for example, determining the\ncategories in a model's training dataset. However, machine learning on model\nweights is challenging as they often exhibit significant variation unrelated to\nthe models' semantic properties (nuisance variation). Here, we identify a key\nproperty of real-world models: most public models belong to a small set of\nModel Trees, where all models within a tree are fine-tuned from a common\nancestor (e.g., a foundation model). Importantly, we find that within each tree\nthere is less nuisance variation between models. Concretely, while learning\nacross Model Trees requires complex architectures, even a linear classifier\ntrained on a single model layer often works within trees. While effective,\nthese linear classifiers are computationally expensive, especially when dealing\nwith larger models that have many parameters. To address this, we introduce\nProbing Experts (ProbeX), a theoretically motivated and lightweight method.\nNotably, ProbeX is the first probing method specifically designed to learn from\nthe weights of a single hidden model layer. We demonstrate the effectiveness of\nProbeX by predicting the categories in a model's training dataset based only on\nits weights. Excitingly, ProbeX can also map the weights of Stable Diffusion\ninto a shared weight-language embedding space, enabling zero-shot model\nclassification.\n","authors":["Eliahu Horwitz","Bar Cavia","Jonathan Kahana","Yedid Hoshen"],"pdf_url":"https://arxiv.org/pdf/2410.13569v2.pdf","comment":"Project page: https://horwitz.ai/probex/"},{"id":"http://arxiv.org/abs/2406.11624v3","updated":"2024-12-05T11:47:49Z","published":"2024-06-17T15:07:55Z","title":"Words in Motion: Extracting Interpretable Control Vectors for Motion\n Transformers","summary":" Transformer-based models generate hidden states that are difficult to\ninterpret. In this work, we aim to interpret these hidden states and control\nthem at inference, with a focus on motion forecasting. We use linear probes to\nmeasure neural collapse towards interpretable motion features in hidden states.\nHigh probing accuracy implies meaningful directions and distances between\nhidden states of opposing features, which we use to fit interpretable control\nvectors for activation steering at inference. To optimize our control vectors,\nwe use sparse autoencoders with fully-connected, convolutional, MLPMixer layers\nand various activation functions. Notably, we show that enforcing sparsity in\nhidden states leads to a more linear relationship between control vector\ntemperatures and forecasts. Our approach enables mechanistic interpretability\nand zero-shot generalization to unseen dataset characteristics with negligible\ncomputational overhead. Our implementation is available at\nhttps://github.com/kit-mrt/future-motion\n","authors":["Omer Sahin Tas","Royden Wagner"],"pdf_url":"https://arxiv.org/pdf/2406.11624v3.pdf","comment":"Add autoencoders with convolutional, MLPMixer layers, and JumpReLU\n activations"},{"id":"http://arxiv.org/abs/2412.04082v1","updated":"2024-12-05T11:32:53Z","published":"2024-12-05T11:32:53Z","title":"Learnable Similarity and Dissimilarity Guided Symmetric Non-Negative\n Matrix Factorization","summary":" Symmetric nonnegative matrix factorization (SymNMF) is a powerful tool for\nclustering, which typically uses the $k$-nearest neighbor ($k$-NN) method to\nconstruct similarity matrix. However, $k$-NN may mislead clustering since the\nneighbors may belong to different clusters, and its reliability generally\ndecreases as $k$ grows. In this paper, we construct the similarity matrix as a\nweighted $k$-NN graph with learnable weight that reflects the reliability of\neach $k$-th NN. This approach reduces the search space of the similarity matrix\nlearning to $n - 1$ dimension, as opposed to the $\\mathcal{O}(n^2)$ dimension\nof existing methods, where $n$ represents the number of samples. Moreover, to\nobtain a discriminative similarity matrix, we introduce a dissimilarity matrix\nwith a dual structure of the similarity matrix, and propose a new form of\northogonality regularization with discussions on its geometric interpretation\nand numerical stability. An efficient alternative optimization algorithm is\ndesigned to solve the proposed model, with theoretically guarantee that the\nvariables converge to a stationary point that satisfies the KKT conditions. The\nadvantage of the proposed model is demonstrated by the comparison with nine\nstate-of-the-art clustering methods on eight datasets. The code is available at\n\\url{https://github.com/lwl-learning/LSDGSymNMF}.\n","authors":["Wenlong Lyu","Yuheng Jia"],"pdf_url":"https://arxiv.org/pdf/2412.04082v1.pdf","comment":"12 pages, 14 figures"},{"id":"http://arxiv.org/abs/2412.04081v1","updated":"2024-12-05T11:32:14Z","published":"2024-12-05T11:32:14Z","title":"Federated Learning in Mobile Networks: A Comprehensive Case Study on\n Traffic Forecasting","summary":" The increasing demand for efficient resource allocation in mobile networks\nhas catalyzed the exploration of innovative solutions that could enhance the\ntask of real-time cellular traffic prediction. Under these circumstances,\nfederated learning (FL) stands out as a distributed and privacy-preserving\nsolution to foster collaboration among different sites, thus enabling\nresponsive near-the-edge solutions. In this paper, we comprehensively study the\npotential benefits of FL in telecommunications through a case study on\nfederated traffic forecasting using real-world data from base stations (BSs) in\nBarcelona (Spain). Our study encompasses relevant aspects within the federated\nexperience, including model aggregation techniques, outlier management, the\nimpact of individual clients, personalized learning, and the integration of\nexogenous sources of data. The performed evaluation is based on both prediction\naccuracy and sustainability, thus showcasing the environmental impact of\nemployed FL algorithms in various settings. The findings from our study\nhighlight FL as a promising and robust solution for mobile traffic prediction,\nemphasizing its twin merits as a privacy-conscious and environmentally\nsustainable approach, while also demonstrating its capability to overcome data\nheterogeneity and ensure high-quality predictions, marking a significant stride\ntowards its integration in mobile traffic management systems.\n","authors":["Nikolaos Pavlidis","Vasileios Perifanis","Selim F. Yilmaz","Francesc Wilhelmi","Marco Miozzo","Pavlos S. Efraimidis","Remous-Aris Koutsiamanis","Pavol Mulinka","Paolo Dini"],"pdf_url":"https://arxiv.org/pdf/2412.04081v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2407.18245v2","updated":"2024-12-05T11:29:56Z","published":"2024-07-25T17:58:17Z","title":"VGGHeads: 3D Multi Head Alignment with a Large-Scale Synthetic Dataset","summary":" Human head detection, keypoint estimation, and 3D head model fitting are\nessential tasks with many applications. However, traditional real-world\ndatasets often suffer from bias, privacy, and ethical concerns, and they have\nbeen recorded in laboratory environments, which makes it difficult for trained\nmodels to generalize. Here, we introduce \\method -- a large-scale synthetic\ndataset generated with diffusion models for human head detection and 3D mesh\nestimation. Our dataset comprises over 1 million high-resolution images, each\nannotated with detailed 3D head meshes, facial landmarks, and bounding boxes.\nUsing this dataset, we introduce a new model architecture capable of\nsimultaneous head detection and head mesh reconstruction from a single image in\na single step. Through extensive experimental evaluations, we demonstrate that\nmodels trained on our synthetic data achieve strong performance on real images.\nFurthermore, the versatility of our dataset makes it applicable across a broad\nspectrum of tasks, offering a general and comprehensive representation of human\nheads.\n","authors":["Orest Kupyn","Eugene Khvedchenia","Christian Rupprecht"],"pdf_url":"https://arxiv.org/pdf/2407.18245v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04078v1","updated":"2024-12-05T11:24:27Z","published":"2024-12-05T11:24:27Z","title":"Towards Generalizable Autonomous Penetration Testing via Domain\n Randomization and Meta-Reinforcement Learning","summary":" With increasing numbers of vulnerabilities exposed on the internet,\nautonomous penetration testing (pentesting) has emerged as an emerging research\narea, while reinforcement learning (RL) is a natural fit for studying\nautonomous pentesting. Previous research in RL-based autonomous pentesting\nmainly focused on enhancing agents' learning efficacy within abstract simulated\ntraining environments. They overlooked the applicability and generalization\nrequirements of deploying agents' policies in real-world environments that\ndiffer substantially from their training settings. In contrast, for the first\ntime, we shift focus to the pentesting agents' ability to generalize across\nunseen real environments. For this purpose, we propose a Generalizable\nAutonomous Pentesting framework (namely GAP) for training agents capable of\ndrawing inferences from one to another -- a key requirement for the broad\napplication of autonomous pentesting and a hallmark of human intelligence. GAP\nintroduces a Real-to-Sim-to-Real pipeline with two key methods: domain\nrandomization and meta-RL learning. Specifically, we are among the first to\napply domain randomization in autonomous pentesting and propose a large\nlanguage model-powered domain randomization method for synthetic environment\ngeneration. We further apply meta-RL to improve the agents' generalization\nability in unseen environments by leveraging the synthetic environments. The\ncombination of these two methods can effectively bridge the generalization gap\nand improve policy adaptation performance. Experiments are conducted on various\nvulnerable virtual machines, with results showing that GAP can (a) enable\npolicy learning in unknown real environments, (b) achieve zero-shot policy\ntransfer in similar environments, and (c) realize rapid policy adaptation in\ndissimilar environments.\n","authors":["Shicheng Zhou","Jingju Liu","Yuliang Lu","Jiahai Yang","Yue Zhang","Jie Chen"],"pdf_url":"https://arxiv.org/pdf/2412.04078v1.pdf","comment":"This work has been submitted to the IEEE for possible publication"},{"id":"http://arxiv.org/abs/2412.04076v1","updated":"2024-12-05T11:17:03Z","published":"2024-12-05T11:17:03Z","title":"Distance-Adaptive Quaternion Knowledge Graph Embedding with\n Bidirectional Rotation","summary":" Quaternion contains one real part and three imaginary parts, which provided a\nmore expressive hypercomplex space for learning knowledge graph. Existing\nquaternion embedding models measure the plausibility of a triplet either\nthrough semantic matching or geometric distance scoring functions. However, it\nappears that semantic matching diminishes the separability of entities, while\nthe distance scoring function weakens the semantics of entities. To address\nthis issue, we propose a novel quaternion knowledge graph embedding model. Our\nmodel combines semantic matching with entity's geometric distance to better\nmeasure the plausibility of triplets. Specifically, in the quaternion space, we\nperform a right rotation on head entity and a reverse rotation on tail entity\nto learn rich semantic features. Then, we utilize distance adaptive\ntranslations to learn geometric distance between entities. Furthermore, we\nprovide mathematical proofs to demonstrate our model can handle complex logical\nrelationships. Extensive experimental results and analyses show our model\nsignificantly outperforms previous models on well-known knowledge graph\ncompletion benchmark datasets. Our code is available at\nhttps://github.com/llqy123/DaBR.\n","authors":["Weihua Wang","Qiuyu Liang","Feilong Bao","Guanglai Gao"],"pdf_url":"https://arxiv.org/pdf/2412.04076v1.pdf","comment":"Accepted by COLING 2025"},{"id":"http://arxiv.org/abs/2412.04074v1","updated":"2024-12-05T11:12:46Z","published":"2024-12-05T11:12:46Z","title":"Integrated Sensing and Communications for Low-Altitude Economy: A Deep\n Reinforcement Learning Approach","summary":" This paper studies an integrated sensing and communications (ISAC) system for\nlow-altitude economy (LAE), where a ground base station (GBS) provides\ncommunication and navigation services for authorized unmanned aerial vehicles\n(UAVs), while sensing the low-altitude airspace to monitor the unauthorized\nmobile target. The expected communication sum-rate over a given flight period\nis maximized by jointly optimizing the beamforming at the GBS and UAVs'\ntrajectories, subject to the constraints on the average signal-to-noise ratio\nrequirement for sensing, the flight mission and collision avoidance of UAVs, as\nwell as the maximum transmit power at the GBS. Typically, this is a sequential\ndecision-making problem with the given flight mission. Thus, we transform it to\na specific Markov decision process (MDP) model called episode task. Based on\nthis modeling, we propose a novel LAE-oriented ISAC scheme, referred to as Deep\nLAE-ISAC (DeepLSC), by leveraging the deep reinforcement learning (DRL)\ntechnique. In DeepLSC, a reward function and a new action selection policy\ntermed constrained noise-exploration policy are judiciously designed to fulfill\nvarious constraints. To enable efficient learning in episode tasks, we develop\na hierarchical experience replay mechanism, where the gist is to employ all\nexperiences generated within each episode to jointly train the neural network.\nBesides, to enhance the convergence speed of DeepLSC, a symmetric experience\naugmentation mechanism, which simultaneously permutes the indexes of all\nvariables to enrich available experience sets, is proposed. Simulation results\ndemonstrate that compared with benchmarks, DeepLSC yields a higher sum-rate\nwhile meeting the preset constraints, achieves faster convergence, and is more\nrobust against different settings.\n","authors":["Xiaowen Ye","Yuyi Mao","Xianghao Yu","Shu Sun","Liqun Fu","Jie Xu"],"pdf_url":"https://arxiv.org/pdf/2412.04074v1.pdf","comment":"submitted for an IEEE publication"},{"id":"http://arxiv.org/abs/2412.04072v1","updated":"2024-12-05T11:09:11Z","published":"2024-12-05T11:09:11Z","title":"Boundary-Guided Learning for Gene Expression Prediction in Spatial\n Transcriptomics","summary":" Spatial transcriptomics (ST) has emerged as an advanced technology that\nprovides spatial context to gene expression. Recently, deep learning-based\nmethods have shown the capability to predict gene expression from WSI data\nusing ST data. Existing approaches typically extract features from images and\nthe neighboring regions using pretrained models, and then develop methods to\nfuse this information to generate the final output. However, these methods\noften fail to account for the cellular structure similarity, cellular density\nand the interactions within the microenvironment. In this paper, we propose a\nframework named BG-TRIPLEX, which leverages boundary information extracted from\npathological images as guiding features to enhance gene expression prediction\nfrom WSIs. Specifically, our model consists of three branches: the spot,\nin-context and global branches. In the spot and in-context branches, boundary\ninformation, including edge and nuclei characteristics, is extracted using\npretrained models. These boundary features guide the learning of cellular\nmorphology and the characteristics of microenvironment through Multi-Head\nCross-Attention. Finally, these features are integrated with global features to\npredict the final output. Extensive experiments were conducted on three public\nST datasets. The results demonstrate that our BG-TRIPLEX consistently\noutperforms existing methods in terms of Pearson Correlation Coefficient (PCC).\nThis method highlights the crucial role of boundary features in understanding\nthe complex interactions between WSI and gene expression, offering a promising\ndirection for future research.\n","authors":["Mingcheng Qu","Yuncong Wu","Donglin Di","Anyang Su","Tonghua Su","Yang Song","Lei Fan"],"pdf_url":"https://arxiv.org/pdf/2412.04072v1.pdf","comment":"8 pages, 5 figures"},{"id":"http://arxiv.org/abs/2408.08968v3","updated":"2024-12-05T11:01:30Z","published":"2024-08-16T18:34:11Z","title":"Online SLA Decomposition: Enabling Real-Time Adaptation to Evolving\n Systems","summary":" When a network slice spans multiple technology domains, it is crucial for\neach domain to uphold the End-to-End (E2E) Service Level Agreement (SLA)\nassociated with the slice. Consequently, the E2E SLA must be properly\ndecomposed into partial SLAs that are assigned to each domain involved. In a\nnetwork slice management system with a two-level architecture, comprising an\nE2E service orchestrator and local domain controllers, we consider that the\norchestrator has access solely to historical data regarding the responses of\nlocal controllers to previous requests, and this information is used to\nconstruct a risk model for each domain. In this study, we extend our previous\nwork by investigating the dynamic nature of real-world systems and introducing\nan online learning-decomposition framework to tackle the dynamicity. We propose\na framework that periodically updates the risk models based on the most recent\nfeedback. This approach leverages key components such as online gradient\ndescent and FIFO memory buffers, which enhance the stability and robustness of\nthe overall process. Our empirical study on an analytic model-based simulator\ndemonstrates that the proposed framework outperforms the state-of-the-art\nstatic approach, providing more accurate and resilient SLA decomposition even\nunder varying conditions and limited data scenarios.\n","authors":["Cyril Shih-Huan Hsu","Danny De Vleeschauwer","Chrysa Papagianni"],"pdf_url":"https://arxiv.org/pdf/2408.08968v3.pdf","comment":"The paper has been submitted to IEEE ICMLCN 2025"},{"id":"http://arxiv.org/abs/2412.04065v1","updated":"2024-12-05T10:59:54Z","published":"2024-12-05T10:59:54Z","title":"Space to Policy: Scalable Brick Kiln Detection and Automatic Compliance\n Monitoring with Geospatial Data","summary":" Air pollution kills 7 million people annually. The brick kiln sector\nsignificantly contributes to economic development but also accounts for 8-14\\%\nof air pollution in India. Policymakers have implemented compliance measures to\nregulate brick kilns. Emission inventories are critical for air quality\nmodeling and source apportionment studies. However, the largely unorganized\nnature of the brick kiln sector necessitates labor-intensive survey efforts for\nmonitoring. Recent efforts by air quality researchers have relied on manual\nannotation of brick kilns using satellite imagery to build emission\ninventories, but this approach lacks scalability. Machine-learning-based object\ndetection methods have shown promise for detecting brick kilns; however,\nprevious studies often rely on costly high-resolution imagery and fail to\nintegrate with governmental policies. In this work, we developed a scalable\nmachine-learning pipeline that detected and classified 30638 brick kilns across\nfive states in the Indo-Gangetic Plain using free, moderate-resolution\nsatellite imagery from Planet Labs. Our detections have a high correlation with\non-ground surveys. We performed automated compliance analysis based on\ngovernment policies. In the Delhi airshed, stricter policy enforcement has led\nto the adoption of efficient brick kiln technologies. This study highlights the\nneed for inclusive policies that balance environmental sustainability with the\nlivelihoods of workers.\n","authors":["Zeel B Patel","Rishabh Mondal","Shataxi Dubey","Suraj Jaiswal","Sarath Guttikunda","Nipun Batra"],"pdf_url":"https://arxiv.org/pdf/2412.04065v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04064v1","updated":"2024-12-05T10:59:20Z","published":"2024-12-05T10:59:20Z","title":"Graph Neural Networks Need Cluster-Normalize-Activate Modules","summary":" Graph Neural Networks (GNNs) are non-Euclidean deep learning models for\ngraph-structured data. Despite their successful and diverse applications,\noversmoothing prohibits deep architectures due to node features converging to a\nsingle fixed point. This severely limits their potential to solve complex\ntasks. To counteract this tendency, we propose a plug-and-play module\nconsisting of three steps: Cluster-Normalize-Activate (CNA). By applying CNA\nmodules, GNNs search and form super nodes in each layer, which are normalized\nand activated individually. We demonstrate in node classification and property\nprediction tasks that CNA significantly improves the accuracy over the\nstate-of-the-art. Particularly, CNA reaches 94.18% and 95.75% accuracy on Cora\nand CiteSeer, respectively. It further benefits GNNs in regression tasks as\nwell, reducing the mean squared error compared to all baselines. At the same\ntime, GNNs with CNA require substantially fewer learnable parameters than\ncompeting architectures.\n","authors":["Arseny Skryagin","Felix Divo","Mohammad Amin Ali","Devendra Singh Dhami","Kristian Kersting"],"pdf_url":"https://arxiv.org/pdf/2412.04064v1.pdf","comment":"17 pages, 6 figures, 6 tables, accepted at NeurIPS 2024"},{"id":"http://arxiv.org/abs/2406.09014v6","updated":"2024-12-05T10:57:12Z","published":"2024-06-13T11:38:58Z","title":"Deep learning empowered sensor fusion boosts infant movement\n classification","summary":" To assess the integrity of the developing nervous system, the Prechtl general\nmovement assessment (GMA) is recognized for its clinical value in diagnosing\nneurological impairments in early infancy. GMA has been increasingly augmented\nthrough machine learning approaches intending to scale-up its application,\ncircumvent costs in the training of human assessors and further standardize\nclassification of spontaneous motor patterns. Available deep learning tools,\nall of which are based on single sensor modalities, are however still\nconsiderably inferior to that of well-trained human assessors. These approaches\nare hardly comparable as all models are designed, trained and evaluated on\nproprietary/silo-data sets. With this study we propose a sensor fusion approach\nfor assessing fidgety movements (FMs). FMs were recorded from 51 typically\ndeveloping participants. We compared three different sensor modalities\n(pressure, inertial, and visual sensors). Various combinations and two sensor\nfusion approaches (late and early fusion) for infant movement classification\nwere tested to evaluate whether a multi-sensor system outperforms single\nmodality assessments. Convolutional neural network (CNN) architectures were\nused to classify movement patterns. The performance of the three-sensor fusion\n(classification accuracy of 94.5%) was significantly higher than that of any\nsingle modality evaluated. We show that the sensor fusion approach is a\npromising avenue for automated classification of infant motor patterns. The\ndevelopment of a robust sensor fusion system may significantly enhance AI-based\nearly recognition of neurofunctions, ultimately facilitating automated early\ndetection of neurodevelopmental conditions.\n","authors":["Tomas Kulvicius","Dajie Zhang","Luise Poustka","Sven Bölte","Lennart Jahn","Sarah Flügge","Marc Kraft","Markus Zweckstetter","Karin Nielsen-Saines","Florentin Wörgötter","Peter B Marschik"],"pdf_url":"https://arxiv.org/pdf/2406.09014v6.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14027v3","updated":"2024-12-05T10:49:37Z","published":"2023-12-21T16:58:49Z","title":"AdamMCMC: Combining Metropolis Adjusted Langevin with Momentum-based\n Optimization","summary":" Uncertainty estimation is a key issue when considering the application of\ndeep neural network methods in science and engineering. In this work, we\nintroduce a novel algorithm that quantifies epistemic uncertainty via Monte\nCarlo sampling from a tempered posterior distribution. It combines the well\nestablished Metropolis Adjusted Langevin Algorithm (MALA) with momentum-based\noptimization using Adam and leverages a prolate proposal distribution, to\nefficiently draw from the posterior. We prove that the constructed chain admits\nthe Gibbs posterior as invariant distribution and approximates this posterior\nin total variation distance. Furthermore, we demonstrate the efficiency of the\nresulting algorithm and the merit of the proposed changes on a state-of-the-art\nclassifier from high-energy particle physics.\n","authors":["Sebastian Bieringer","Gregor Kasieczka","Maximilian F. Steffen","Mathias Trabs"],"pdf_url":"https://arxiv.org/pdf/2312.14027v3.pdf","comment":"16 pages, 5 figures; adapted Theorem 2"},{"id":"http://arxiv.org/abs/2411.14875v2","updated":"2024-12-05T10:40:41Z","published":"2024-11-22T11:55:37Z","title":"Iterative Reweighted Framework Based Algorithms for Sparse Linear\n Regression with Generalized Elastic Net Penalty","summary":" The elastic net penalty is frequently employed in high-dimensional statistics\nfor parameter regression and variable selection. It is particularly beneficial\ncompared to lasso when the number of predictors greatly surpasses the number of\nobservations. However, empirical evidence has shown that the $\\ell_q$-norm\npenalty (where $0 < q < 1$) often provides better regression compared to the\n$\\ell_1$-norm penalty, demonstrating enhanced robustness in various scenarios.\nIn this paper, we explore a generalized elastic net model that employs a\n$\\ell_r$-norm (where $r \\geq 1$) in loss function to accommodate various types\nof noise, and employs a $\\ell_q$-norm (where $0 < q < 1$) to replace the\n$\\ell_1$-norm in elastic net penalty. Theoretically, we establish the\ncomputable lower bounds for the nonzero entries of the generalized first-order\nstationary points of the proposed generalized elastic net model. For\nimplementation, we develop two efficient algorithms based on the locally\nLipschitz continuous $\\epsilon$-approximation to $\\ell_q$-norm. The first\nalgorithm employs an alternating direction method of multipliers (ADMM), while\nthe second utilizes a proximal majorization-minimization method (PMM), where\nthe subproblems are addressed using the semismooth Newton method (SNN). We also\nperform extensive numerical experiments with both simulated and real data,\nshowing that both algorithms demonstrate superior performance. Notably, the\nPMM-SSN is efficient than ADMM, even though the latter provides a simpler\nimplementation.\n","authors":["Yanyun Ding","Zhenghua Yao","Peili Li","Yunhai Xiao"],"pdf_url":"https://arxiv.org/pdf/2411.14875v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04047v1","updated":"2024-12-05T10:38:29Z","published":"2024-12-05T10:38:29Z","title":"Pathwise optimization for bridge-type estimators and its applications","summary":" Sparse parametric models are of great interest in statistical learning and\nare often analyzed by means of regularized estimators. Pathwise methods allow\nto efficiently compute the full solution path for penalized estimators, for any\npossible value of the penalization parameter $\\lambda$. In this paper we deal\nwith the pathwise optimization for bridge-type problems; i.e. we are interested\nin the minimization of a loss function, such as negative log-likelihood or\nresidual sum of squares, plus the sum of $\\ell^q$ norms with $q\\in(0,1]$\ninvolving adpative coefficients. For some loss functions this regularization\nachieves asymptotically the oracle properties (such as the selection\nconsistency). Nevertheless, since the objective function involves nonconvex and\nnondifferentiable terms, the minimization problem is computationally\nchallenging.\n The aim of this paper is to apply some general algorithms, arising from\nnonconvex optimization theory, to compute efficiently the path solutions for\nthe adaptive bridge estimator with multiple penalties. In particular, we take\ninto account two different approaches: accelerated proximal gradient descent\nand blockwise alternating optimization. The convergence and the path\nconsistency of these algorithms are discussed. In order to assess our methods,\nwe apply these algorithms to the penalized estimation of diffusion processes\nobserved at discrete times. This latter represents a recent research topic in\nthe field of statistics for time-dependent data.\n","authors":["Alessandro De Gregorio","Francesco Iafrate"],"pdf_url":"https://arxiv.org/pdf/2412.04047v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04045v1","updated":"2024-12-05T10:36:39Z","published":"2024-12-05T10:36:39Z","title":"AI4EF: Artificial Intelligence for Energy Efficiency in the Building\n Sector","summary":" AI4EF, Artificial Intelligence for Energy Efficiency, is an advanced,\nuser-centric tool designed to support decision-making in building energy\nretrofitting and efficiency optimization. Leveraging machine learning (ML) and\ndata-driven insights, AI4EF enables stakeholders such as public sector\nrepresentatives, energy consultants, and building owners to model, analyze, and\npredict energy consumption, retrofit costs, and environmental impacts of\nbuilding upgrades. Featuring a modular framework, AI4EF includes customizable\nbuilding retrofitting, photovoltaic installation assessment, and predictive\nmodeling tools that allow users to input building parameters and receive\ntailored recommendations for achieving energy savings and carbon reduction\ngoals. Additionally, the platform incorporates a Training Playground for data\nscientists to refine ML models used by said framework. Finally, AI4EF provides\naccess to the Enershare Data Space to facilitate seamless data sharing and\naccess within the ecosystem. Its compatibility with open-source identity\nmanagement, Keycloak, enhances security and accessibility, making it adaptable\nfor various regulatory and organizational contexts. This paper presents an\narchitectural overview of AI4EF, its application in energy efficiency\nscenarios, and its potential for advancing sustainable energy practices through\nartificial intelligence (AI).\n","authors":["Alexandros Menelaos Tzortzis","Georgios Kormpakis","Sotiris Pelekis","Ariadni Michalitsi-Psarrou","Evangelos Karakolis","Christos Ntanos","Dimitris Askounis"],"pdf_url":"https://arxiv.org/pdf/2412.04045v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.01591v2","updated":"2024-12-05T10:22:00Z","published":"2024-12-02T15:08:55Z","title":"Kernel-Based Optimal Control: An Infinitesimal Generator Approach","summary":" This paper presents a novel approach for optimal control of nonlinear\nstochastic systems using infinitesimal generator learning within\ninfinite-dimensional reproducing kernel Hilbert spaces. Our learning framework\nleverages data samples of system dynamics and stage cost functions, with only\ncontrol penalties and constraints provided. The proposed method directly learns\nthe diffusion operator of a controlled Fokker-Planck-Kolmogorov equation in an\ninfinite-dimensional hypothesis space. This operator models the continuous-time\nevolution of the probability measure of the control system's state. We\ndemonstrate that this approach seamlessly integrates with modern convex\noperator-theoretic Hamilton-Jacobi-Bellman recursions, enabling a data-driven\nsolution to the optimal control problem. Furthermore, our statistical learning\nframework includes nonparametric estimators for uncontrolled forward\ninfinitesimal generators as a special case. Numerical experiments, ranging from\nsynthetic differential equations to simulated robotic systems, showcase the\nadvantages of our approach compared to both modern data-driven and classical\nnonlinear programming methods for optimal control.\n","authors":["Petar Bevanda","Nicolas Hoischen","Tobias Wittmann","Jan Brüdigam","Sandra Hirche","Boris Houska"],"pdf_url":"https://arxiv.org/pdf/2412.01591v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04034v1","updated":"2024-12-05T10:15:56Z","published":"2024-12-05T10:15:56Z","title":"Dynamic Graph Representation with Contrastive Learning for Financial\n Market Prediction: Integrating Temporal Evolution and Static Relations","summary":" Temporal Graph Learning (TGL) is crucial for capturing the evolving nature of\nstock markets. Traditional methods often ignore the interplay between dynamic\ntemporal changes and static relational structures between stocks. To address\nthis issue, we propose the Dynamic Graph Representation with Contrastive\nLearning (DGRCL) framework, which integrates dynamic and static graph relations\nto improve the accuracy of stock trend prediction. Our framework introduces two\nkey components: the Embedding Enhancement (EE) module and the Contrastive\nConstrained Training (CCT) module. The EE module focuses on dynamically\ncapturing the temporal evolution of stock data, while the CCT module enforces\nstatic constraints based on stock relations, refined within contrastive\nlearning. This dual-relation approach allows for a more comprehensive\nunderstanding of stock market dynamics. Our experiments on two major U.S. stock\nmarket datasets, NASDAQ and NYSE, demonstrate that DGRCL significantly\noutperforms state-of-the-art TGL baselines. Ablation studies indicate the\nimportance of both modules. Overall, DGRCL not only enhances prediction ability\nbut also provides a robust framework for integrating temporal and relational\ndata in dynamic graphs. Code and data are available for public access.\n","authors":["Yunhua Pei","Jin Zheng","John Cartlidge"],"pdf_url":"https://arxiv.org/pdf/2412.04034v1.pdf","comment":"12 pages, 2 figures, author manuscript accepted for ICAART 2025\n (International Conference on Agents and Artificial Intelligence)"},{"id":"http://arxiv.org/abs/2411.01115v2","updated":"2024-12-05T09:45:55Z","published":"2024-11-02T02:50:12Z","title":"Relax and Merge: A Simple Yet Effective Framework for Solving Fair\n $k$-Means and $k$-sparse Wasserstein Barycenter Problems","summary":" The fairness of clustering algorithms has gained widespread attention across\nvarious areas, including machine learning, In this paper, we study fair\n$k$-means clustering in Euclidean space. Given a dataset comprising several\ngroups, the fairness constraint requires that each cluster should contain a\nproportion of points from each group within specified lower and upper bounds.\nDue to these fairness constraints, determining the optimal locations of $k$\ncenters is a quite challenging task. We propose a novel ``Relax and Merge''\nframework that returns a $(1+4\\rho + O(\\epsilon))$-approximate solution, where\n$\\rho$ is the approximate ratio of an off-the-shelf vanilla $k$-means algorithm\nand $O(\\epsilon)$ can be an arbitrarily small positive number. If equipped with\na PTAS of $k$-means, our solution can achieve an approximation ratio of\n$(5+O(\\epsilon))$ with only a slight violation of the fairness constraints,\nwhich improves the current state-of-the-art approximation guarantee.\nFurthermore, using our framework, we can also obtain a $(1+4\\rho\n+O(\\epsilon))$-approximate solution for the $k$-sparse Wasserstein Barycenter\nproblem, which is a fundamental optimization problem in the field of optimal\ntransport, and a $(2+6\\rho)$-approximate solution for the strictly fair\n$k$-means clustering with no violation, both of which are better than the\ncurrent state-of-the-art methods. In addition, the empirical results\ndemonstrate that our proposed algorithm can significantly outperform baseline\napproaches in terms of clustering cost.\n","authors":["Shihong Song","Guanlin Mo","Qingyuan Yang","Hu Ding"],"pdf_url":"https://arxiv.org/pdf/2411.01115v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04011v1","updated":"2024-12-05T09:45:21Z","published":"2024-12-05T09:45:21Z","title":"A Note on Spectral Map","summary":" In molecular dynamics (MD) simulations, transitions between states are often\nrare events due to energy barriers that exceed the thermal temperature. Because\nof their infrequent occurrence and the huge number of degrees of freedom in\nmolecular systems, understanding the physical properties that drive rare events\nis immensely difficult. A common approach to this problem is to propose a\ncollective variable (CV) that describes this process by a simplified\nrepresentation. However, choosing CVs is not easy, as it often relies on\nphysical intuition. Machine learning (ML) techniques provide a promising\napproach for effectively extracting optimal CVs from MD data. Here, we provide\na note on a recent unsupervised ML method called spectral map, which constructs\nCVs by maximizing the timescale separation between slow and fast variables in\nthe system.\n","authors":["Tuğçe Gökdemir","Jakub Rydzewski"],"pdf_url":"https://arxiv.org/pdf/2412.04011v1.pdf","comment":"A letter prepared for the Ensemble journal of the Molecular\n Simulation Society of Japan (MSSJ)"},{"id":"http://arxiv.org/abs/2411.05712v2","updated":"2024-12-05T09:39:07Z","published":"2024-11-08T17:13:53Z","title":"Scaling Laws for Task-Optimized Models of the Primate Visual Ventral\n Stream","summary":" When trained on large-scale object classification datasets, certain\nartificial neural network models begin to approximate core object recognition\n(COR) behaviors and neural response patterns in the primate visual ventral\nstream (VVS). While recent machine learning advances suggest that scaling model\nsize, dataset size, and compute resources improve task performance, the impact\nof scaling on brain alignment remains unclear. In this study, we explore\nscaling laws for modeling the primate VVS by systematically evaluating over 600\nmodels trained under controlled conditions on benchmarks spanning V1, V2, V4,\nIT and COR behaviors. We observe that while behavioral alignment continues to\nscale with larger models, neural alignment saturates. This observation remains\ntrue across model architectures and training datasets, even though models with\nstronger inductive bias and datasets with higher-quality images are more\ncompute-efficient. Increased scaling is especially beneficial for higher-level\nvisual areas, where small models trained on few samples exhibit only poor\nalignment. Finally, we develop a scaling recipe, indicating that a greater\nproportion of compute should be allocated to data samples over model size. Our\nresults suggest that while scaling alone might suffice for alignment with human\ncore object recognition behavior, it will not yield improved models of the\nbrain's visual ventral stream with current architectures and datasets,\nhighlighting the need for novel strategies in building brain-like models.\n","authors":["Abdulkadir Gokce","Martin Schrimpf"],"pdf_url":"https://arxiv.org/pdf/2411.05712v2.pdf","comment":"10 pages for the main paper, 23 pages in total. 7 main figures and 7\n supplementary figures. Code, model weights, and benchmark results can be\n accessed at https://github.com/epflneuroailab/scaling-primate-vvs - In\n version 2, Figure 7 and the related discussion are added, and the appendix is\n updated"},{"id":"http://arxiv.org/abs/2412.03486v2","updated":"2024-12-05T09:26:26Z","published":"2024-12-04T17:23:35Z","title":"Tight PAC-Bayesian Risk Certificates for Contrastive Learning","summary":" Contrastive representation learning is a modern paradigm for learning\nrepresentations of unlabeled data via augmentations -- precisely, contrastive\nmodels learn to embed semantically similar pairs of samples (positive pairs)\ncloser than independently drawn samples (negative samples). In spite of its\nempirical success and widespread use in foundation models, statistical theory\nfor contrastive learning remains less explored. Recent works have developed\ngeneralization error bounds for contrastive losses, but the resulting risk\ncertificates are either vacuous (certificates based on Rademacher complexity or\n$f$-divergence) or require strong assumptions about samples that are\nunreasonable in practice. The present paper develops non-vacuous PAC-Bayesian\nrisk certificates for contrastive representation learning, considering the\npractical considerations of the popular SimCLR framework. Notably, we take into\naccount that SimCLR reuses positive pairs of augmented data as negative samples\nfor other data, thereby inducing strong dependence and making classical PAC or\nPAC-Bayesian bounds inapplicable. We further refine existing bounds on the\ndownstream classification loss by incorporating SimCLR-specific factors,\nincluding data augmentation and temperature scaling, and derive risk\ncertificates for the contrastive zero-one risk. The resulting bounds for\ncontrastive loss and downstream prediction are much tighter than those of\nprevious risk certificates, as demonstrated by experiments on CIFAR-10.\n","authors":["Anna Van Elst","Debarghya Ghoshdastidar"],"pdf_url":"https://arxiv.org/pdf/2412.03486v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.14438v3","updated":"2024-12-05T09:23:13Z","published":"2024-05-23T11:10:32Z","title":"LoRA-Ensemble: Efficient Uncertainty Modelling for Self-attention\n Networks","summary":" Numerous crucial tasks in real-world decision-making rely on machine learning\nalgorithms with calibrated uncertainty estimates. However, modern methods often\nyield overconfident and uncalibrated predictions. Various approaches involve\ntraining an ensemble of separate models to quantify the uncertainty related to\nthe model itself, known as epistemic uncertainty. In an explicit\nimplementation, the ensemble approach has high computational cost and high\nmemory requirements. This particular challenge is evident in state-of-the-art\nneural networks such as transformers, where even a single network is already\ndemanding in terms of compute and memory. Consequently, efforts are made to\nemulate the ensemble model without actually instantiating separate ensemble\nmembers, referred to as implicit ensembling. We introduce LoRA-Ensemble, a\nparameter-efficient deep ensemble method for self-attention networks, which is\nbased on Low-Rank Adaptation (LoRA). Initially developed for efficient LLM\nfine-tuning, we extend LoRA to an implicit ensembling approach. By employing a\nsingle pre-trained self-attention network with weights shared across all\nmembers, we train member-specific low-rank matrices for the attention\nprojections. Our method exhibits superior calibration compared to explicit\nensembles and achieves similar or better accuracy across various prediction\ntasks and datasets.\n","authors":["Michelle Halbheer","Dominik J. Mühlematter","Alexander Becker","Dominik Narnhofer","Helge Aasen","Konrad Schindler","Mehmet Ozgur Turkoglu"],"pdf_url":"https://arxiv.org/pdf/2405.14438v3.pdf","comment":"under review"},{"id":"http://arxiv.org/abs/2412.03995v1","updated":"2024-12-05T09:15:21Z","published":"2024-12-05T09:15:21Z","title":"Blind Underwater Image Restoration using Co-Operational Regressor\n Networks","summary":" The exploration of underwater environments is essential for applications such\nas biological research, archaeology, and infrastructure maintenanceHowever,\nunderwater imaging is challenging due to the waters unique properties,\nincluding scattering, absorption, color distortion, and reduced visibility. To\naddress such visual degradations, a variety of approaches have been proposed\ncovering from basic signal processing methods to deep learning models; however,\nnone of them has proven to be consistently successful. In this paper, we\npropose a novel machine learning model, Co-Operational Regressor Networks\n(CoRe-Nets), designed to achieve the best possible underwater image\nrestoration. A CoRe-Net consists of two co-operating networks: the Apprentice\nRegressor (AR), responsible for image transformation, and the Master Regressor\n(MR), which evaluates the Peak Signal-to-Noise Ratio (PSNR) of the images\ngenerated by the AR and feeds it back to AR. CoRe-Nets are built on\nSelf-Organized Operational Neural Networks (Self-ONNs), which offer a superior\nlearning capability by modulating nonlinearity in kernel transformations. The\neffectiveness of the proposed model is demonstrated on the benchmark Large\nScale Underwater Image (LSUI) dataset. Leveraging the joint learning\ncapabilities of the two cooperating networks, the proposed model achieves the\nstate-of-art restoration performance with significantly reduced computational\ncomplexity and often presents such results that can even surpass the visual\nquality of the ground truth with a 2-pass application. Our results and the\noptimized PyTorch implementation of the proposed approach are now publicly\nshared on GitHub.\n","authors":["Ozer Can Devecioglu","Serkan Kiranyaz","Turker Ince","Moncef Gabbouj"],"pdf_url":"https://arxiv.org/pdf/2412.03995v1.pdf","comment":"11 pages"},{"id":"http://arxiv.org/abs/2412.03993v1","updated":"2024-12-05T09:14:50Z","published":"2024-12-05T09:14:50Z","title":"LaserGuider: A Laser Based Physical Backdoor Attack against Deep Neural\n Networks","summary":" Backdoor attacks embed hidden associations between triggers and targets in\ndeep neural networks (DNNs), causing them to predict the target when a trigger\nis present while maintaining normal behavior otherwise. Physical backdoor\nattacks, which use physical objects as triggers, are feasible but lack remote\ncontrol, temporal stealthiness, flexibility, and mobility. To overcome these\nlimitations, in this work, we propose a new type of backdoor triggers utilizing\nlasers that feature long-distance transmission and instant-imaging properties.\nBased on the laser-based backdoor triggers, we present a physical backdoor\nattack, called LaserGuider, which possesses remote control ability and achieves\nhigh temporal stealthiness, flexibility, and mobility. We also introduce a\nsystematic approach to optimize laser parameters for improving attack\neffectiveness. Our evaluation on traffic sign recognition DNNs, critical in\nautonomous vehicles, demonstrates that LaserGuider with three different\nlaser-based triggers achieves over 90% attack success rate with negligible\nimpact on normal inputs. Additionally, we release LaserMark, the first dataset\nof real world traffic signs stamped with physical laser spots, to support\nfurther research in backdoor attacks and defenses.\n","authors":["Yongjie Xu","Guangke Chen","Fu Song","Yuqi Chen"],"pdf_url":"https://arxiv.org/pdf/2412.03993v1.pdf","comment":"In Proceedings of the 23rd International Conference on Applied\n Cryptography and Network Security (ACNS), Munich, Germany, 23-26 June, 2025"},{"id":"http://arxiv.org/abs/2412.03992v1","updated":"2024-12-05T09:12:25Z","published":"2024-12-05T09:12:25Z","title":"How well behaved is finite dimensional Diffusion Maps?","summary":" Under a set of assumptions on a family of submanifolds $\\subset {\\mathbb\nR}^D$, we derive a series of geometric properties that remain valid after\nfinite-dimensional and almost isometric Diffusion Maps (DM), including almost\nuniform density, finite polynomial approximation and local reach. Leveraging\nthese properties, we establish rigorous bounds on the embedding errors\nintroduced by the DM algorithm is $O\\left((\\frac{\\log\nn}{n})^{\\frac{1}{8d+16}}\\right)$. These results offer a solid theoretical\nfoundation for understanding the performance and reliability of DM in practical\napplications.\n","authors":["Wenyu Bo","Marina Meilă"],"pdf_url":"https://arxiv.org/pdf/2412.03992v1.pdf","comment":"20 pages, 3 figures"},{"id":"http://arxiv.org/abs/2403.18569v2","updated":"2024-12-05T09:02:11Z","published":"2024-03-27T13:50:13Z","title":"PDNNet: PDN-Aware GNN-CNN Heterogeneous Network for Dynamic IR Drop\n Prediction","summary":" IR drop on the power delivery network (PDN) is closely related to PDN's\nconfiguration and cell current consumption. As the integrated circuit (IC)\ndesign is growing larger, dynamic IR drop simulation becomes computationally\nunaffordable and machine learning based IR drop prediction has been explored as\na promising solution. Although CNN-based methods have been adapted to IR drop\nprediction task in several works, the shortcomings of overlooking PDN\nconfiguration is non-negligible. In this paper, we consider not only how to\nproperly represent cell-PDN relation, but also how to model IR drop following\nits physical nature in the feature aggregation procedure. Thus, we propose a\nnovel graph structure, PDNGraph, to unify the representations of the PDN\nstructure and the fine-grained cell-PDN relation. We further propose a\ndual-branch heterogeneous network, PDNNet, incorporating two parallel GNN-CNN\nbranches to favorably capture the above features during the learning process.\nSeveral key designs are presented to make the dynamic IR drop prediction highly\neffective and interpretable. We are the first work to apply graph structure to\ndeep-learning based dynamic IR drop prediction method. Experiments show that\nPDNNet outperforms the state-of-the-art CNN-based methods and achieves 545x\nspeedup compared to the commercial tool, which demonstrates the superiority of\nour method.\n","authors":["Yuxiang Zhao","Zhuomin Chai","Xun Jiang","Yibo Lin","Runsheng Wang","Ru Huang"],"pdf_url":"https://arxiv.org/pdf/2403.18569v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03983v1","updated":"2024-12-05T08:58:41Z","published":"2024-12-05T08:58:41Z","title":"Safe and Efficient Online Convex Optimization with Linear Budget\n Constraints and Partial Feedback","summary":" This paper studies online convex optimization with unknown linear budget\nconstraints, where only the gradient information of the objective and the\nbandit feedback of constraint functions are observed. We propose a safe and\nefficient Lyapunov-optimization algorithm (SELO) that can achieve an\n$O(\\sqrt{T})$ regret and zero cumulative constraint violation. The result also\nimplies SELO achieves $O(\\sqrt{T})$ regret when the budget is hard and not\nallowed to be violated. The proposed algorithm is computationally efficient as\nit resembles a primal-dual algorithm where the primal problem is an\nunconstrained, strongly convex and smooth problem, and the dual problem has a\nsimple gradient-type update. The algorithm and theory are further justified in\na simulated application of energy-efficient task processing in distributed data\ncenters.\n","authors":["Shanqi Liu","Xin Liu"],"pdf_url":"https://arxiv.org/pdf/2412.03983v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03982v1","updated":"2024-12-05T08:58:25Z","published":"2024-12-05T08:58:25Z","title":"Exploring Fully Convolutional Networks for the Segmentation of\n Hyperspectral Imaging Applied to Advanced Driver Assistance Systems","summary":" Advanced Driver Assistance Systems (ADAS) are designed with the main purpose\nof increasing the safety and comfort of vehicle occupants. Most of current\ncomputer vision-based ADAS perform detection and tracking tasks quite\nsuccessfully under regular conditions, but are not completely reliable,\nparticularly under adverse weather and changing lighting conditions, neither in\ncomplex situations with many overlapping objects. In this work we explore the\nuse of hyperspectral imaging (HSI) in ADAS on the assumption that the distinct\nnear infrared (NIR) spectral reflectances of different materials can help to\nbetter separate the objects in a driving scene. In particular, this paper\ndescribes some experimental results of the application of fully convolutional\nnetworks (FCN) to the image segmentation of HSI for ADAS applications. More\nspecifically, our aim is to investigate to what extent the spatial features\ncodified by convolutional filters can be helpful to improve the performance of\nHSI segmentation systems. With that aim, we use the HSI-Drive v1.1 dataset,\nwhich provides a set of labelled images recorded in real driving conditions\nwith a small-size snapshot NIR-HSI camera. Finally, we analyze the\nimplementability of such a HSI segmentation system by prototyping the developed\nFCN model together with the necessary hyperspectral cube preprocessing stage\nand characterizing its performance on an MPSoC.\n","authors":["Jon Gutiérrez-Zaballa","Koldo Basterretxea","Javier Echanobe","M. Victoria Martínez","Inés del Campo"],"pdf_url":"https://arxiv.org/pdf/2412.03982v1.pdf","comment":"arXiv admin note: text overlap with arXiv:2411.19274"},{"id":"http://arxiv.org/abs/2411.18220v2","updated":"2024-12-05T08:57:30Z","published":"2024-11-27T10:57:06Z","title":"R-MTLLMF: Resilient Multi-Task Large Language Model Fusion at the\n Wireless Edge","summary":" Multi-task large language models (MTLLMs) are important for many applications\nat the wireless edge, where users demand specialized models to handle multiple\ntasks efficiently. However, training MTLLMs is complex and exhaustive,\nparticularly when tasks are subject to change. Recently, the concept of model\nfusion via task vectors has emerged as an efficient approach for combining\nfine-tuning parameters to produce an MTLLM. In this paper, the problem of\nenabling edge users to collaboratively craft such MTLMs via tasks vectors is\nstudied, under the assumption of worst-case adversarial attacks. To this end,\nfirst the influence of adversarial noise to multi-task model fusion is\ninvestigated and a relationship between the so-called weight disentanglement\nerror and the mean squared error (MSE) is derived. Using hypothesis testing, it\nis directly shown that the MSE increases interference between task vectors,\nthereby rendering model fusion ineffective. Then, a novel resilient MTLLM\nfusion (R-MTLLMF) is proposed, which leverages insights about the LLM\narchitecture and fine-tuning process to safeguard task vector aggregation under\nadversarial noise by realigning the MTLLM. The proposed R-MTLLMF is then\ncompared for both worst-case and ideal transmission scenarios to study the\nimpact of the wireless channel. Extensive model fusion experiments with vision\nLLMs demonstrate R-MTLLMF's effectiveness, achieving close-to-baseline\nperformance across eight different tasks in ideal noise scenarios and\nsignificantly outperforming unprotected model fusion in worst-case scenarios.\nThe results further advocate for additional physical layer protection for a\nholistic approach to resilience, from both a wireless and LLM perspective.\n","authors":["Aladin Djuhera","Vlad C. Andrei","Mohsen Pourghasemian","Haris Gacanin","Holger Boche","Walid Saad"],"pdf_url":"https://arxiv.org/pdf/2411.18220v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03214v2","updated":"2024-12-05T08:49:02Z","published":"2024-12-04T11:05:01Z","title":"Continual Low-Rank Scaled Dot-product Attention","summary":" Transformers are widely used for their ability to capture data relations in\nsequence processing, with great success for a wide range of static tasks.\nHowever, the computational and memory footprint of their main component, i.e.,\nthe Scaled Dot-product Attention, is commonly overlooked. This makes their\nadoption in applications involving stream data processing with constraints in\nresponse latency, computational and memory resources infeasible. Some works\nhave proposed methods to lower the computational cost of transformers, i.e.\nlow-rank approximations, sparsity in attention, and efficient formulations for\nContinual Inference. In this paper, we introduce a new formulation of the\nScaled Dot-product Attention based on the Nystr\\\"om approximation that is\nsuitable for Continual Inference. In experiments on Online Audio Classification\nand Online Action Detection tasks, the proposed Continual Scaled Dot-product\nAttention can lower the number of operations by up to three orders of magnitude\ncompared to the original Transformers while retaining the predictive\nperformance of competing models.\n","authors":["Ginés Carreto Picón","Illia Oleksiienko","Lukas Hedegaard","Arian Bakhtiarnia","Alexandros Iosifidis"],"pdf_url":"https://arxiv.org/pdf/2412.03214v2.pdf","comment":"11 pages, 7 figures"},{"id":"http://arxiv.org/abs/2410.01262v2","updated":"2024-12-05T08:44:53Z","published":"2024-10-02T06:16:06Z","title":"Improving Fine-Grained Control via Aggregation of Multiple Diffusion\n Models","summary":" While many diffusion models perform well when controlling for particular\naspect among style, character, and interaction, they struggle with fine-grained\ncontrol due to dataset limitations and intricate model architecture design.\nThis paper introduces a novel algorithm, Aggregation of Multiple Diffusion\nModels (AMDM), which synthesizes features from multiple diffusion models into a\nspecified model, activating specific features for fine-grained control.\nExperimental results demonstrate that AMDM significantly improves fine-grained\ncontrol without training, proving its effectiveness. Additionally, it reveals\nthat diffusion models initially focus on features such as position, attributes,\nand style, with later stages improving generation quality and consistency. AMDM\noffers a new perspective for tackling the challenges of fine-grained\nconditional control generation in diffusion models: We can fully utilize\nexisting or develop new conditional diffusion models that control specific\naspects, and then aggregate them using AMDM algorithm. This eliminates the need\nfor constructing complex datasets, designing intricate model architectures, and\nincurring high training costs. Code is available at:\nhttps://github.com/Hammour-steak/AMDM.\n","authors":["Conghan Yue","Zhengwei Peng","Shiyan Du","Zhi Ji","Chuangjian Cai","Le Wan","Dongyu Zhang"],"pdf_url":"https://arxiv.org/pdf/2410.01262v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03962v1","updated":"2024-12-05T08:26:13Z","published":"2024-12-05T08:26:13Z","title":"Local Curvature Smoothing with Stein's Identity for Efficient Score\n Matching","summary":" The training of score-based diffusion models (SDMs) is based on score\nmatching. The challenge of score matching is that it includes a computationally\nexpensive Jacobian trace. While several methods have been proposed to avoid\nthis computation, each has drawbacks, such as instability during training and\napproximating the learning as learning a denoising vector field rather than a\ntrue score. We propose a novel score matching variant, local curvature\nsmoothing with Stein's identity (LCSS). The LCSS bypasses the Jacobian trace by\napplying Stein's identity, enabling regularization effectiveness and efficient\ncomputation. We show that LCSS surpasses existing methods in sample generation\nperformance and matches the performance of denoising score matching, widely\nadopted by most SDMs, in evaluations such as FID, Inception score, and bits per\ndimension. Furthermore, we show that LCSS enables realistic image generation\neven at a high resolution of $1024 \\times 1024$.\n","authors":["Genki Osada","Makoto Shing","Takashi Nishide"],"pdf_url":"https://arxiv.org/pdf/2412.03962v1.pdf","comment":"Accepted at NeurIPS 2024"},{"id":"http://arxiv.org/abs/2412.03961v1","updated":"2024-12-05T08:26:07Z","published":"2024-12-05T08:26:07Z","title":"Electronic Health Records-Based Data-Driven Diabetes Knowledge Unveiling\n and Risk Prognosis","summary":" In the healthcare sector, the application of deep learning technologies has\nrevolutionized data analysis and disease forecasting. This is particularly\nevident in the field of diabetes, where the deep analysis of Electronic Health\nRecords (EHR) has unlocked new opportunities for early detection and effective\nintervention strategies. Our research presents an innovative model that\nsynergizes the capabilities of Bidirectional Long Short-Term Memory\nNetworks-Conditional Random Field (BiLSTM-CRF) with a fusion of XGBoost and\nLogistic Regression. This model is designed to enhance the accuracy of diabetes\nrisk prediction by conducting an in-depth analysis of electronic medical\nrecords data. The first phase of our approach involves employing BiLSTM-CRF to\ndelve into the temporal characteristics and latent patterns present in EHR\ndata. This method effectively uncovers the progression trends of diabetes,\nwhich are often hidden in the complex data structures of medical records. The\nsecond phase leverages the combined strength of XGBoost and Logistic Regression\nto classify these extracted features and evaluate associated risks. This dual\napproach facilitates a more nuanced and precise prediction of diabetes,\noutperforming traditional models, particularly in handling multifaceted and\nnonlinear medical datasets. Our research demonstrates a notable advancement in\ndiabetes prediction over traditional methods, showcasing the effectiveness of\nour combined BiLSTM-CRF, XGBoost, and Logistic Regression model. This study\nhighlights the value of data-driven strategies in clinical decision-making,\nequipping healthcare professionals with precise tools for early detection and\nintervention. By enabling personalized treatment and timely care, our approach\nsignifies progress in incorporating advanced analytics in healthcare,\npotentially improving outcomes for diabetes and other chronic conditions.\n","authors":["Huadong Pang","Li Zhou","Yiping Dong","Peiyuan Chen","Dian Gu","Tianyi Lyu","Hansong Zhang"],"pdf_url":"https://arxiv.org/pdf/2412.03961v1.pdf","comment":"16 pages"},{"id":"http://arxiv.org/abs/2107.14432v6","updated":"2024-12-05T08:11:50Z","published":"2021-07-30T05:33:43Z","title":"Adaptive Optimizers with Sparse Group Lasso for Neural Networks in CTR\n Prediction","summary":" We develop a novel framework that adds the regularizers of the sparse group\nlasso to a family of adaptive optimizers in deep learning, such as Momentum,\nAdagrad, Adam, AMSGrad, AdaHessian, and create a new class of optimizers, which\nare named Group Momentum, Group Adagrad, Group Adam, Group AMSGrad and Group\nAdaHessian, etc., accordingly. We establish theoretically proven convergence\nguarantees in the stochastic convex settings, based on primal-dual methods. We\nevaluate the regularized effect of our new optimizers on three large-scale\nreal-world ad click datasets with state-of-the-art deep learning models. The\nexperimental results reveal that compared with the original optimizers with the\npost-processing procedure which uses the magnitude pruning method, the\nperformance of the models can be significantly improved on the same sparsity\nlevel. Furthermore, in comparison to the cases without magnitude pruning, our\nmethods can achieve extremely high sparsity with significantly better or highly\ncompetitive performance. The code is available at\nhttps://github.com/intelligent-machine-learning/tfplus/tree/main/tfplus.\n","authors":["Yun Yue","Yongchao Liu","Suo Tong","Minghao Li","Zhen Zhang","Chunyang Wen","Huanjun Bao","Lihong Gu","Jinjie Gu","Yixiang Mu"],"pdf_url":"https://arxiv.org/pdf/2107.14432v6.pdf","comment":"24 pages. Published as a conference paper at ECML PKDD 2021. This\n version includes Appendix which was not included in the published version\n because of page limit"},{"id":"http://arxiv.org/abs/2410.13874v3","updated":"2024-12-05T08:10:55Z","published":"2024-10-02T13:02:17Z","title":"COOL: Efficient and Reliable Chain-Oriented Objective Logic with Neural\n Networks Feedback Control for Program Synthesis","summary":" Program synthesis methods, whether formal or neural-based, lack fine-grained\ncontrol and flexible modularity, which limits their adaptation to complex\nsoftware development. These limitations stem from rigid Domain-Specific\nLanguage (DSL) frameworks and neural network incorrect predictions. To this\nend, we propose the Chain of Logic (CoL), which organizes the synthesis process\ninto an activity flow and provides heuristic control to guide the process.\nFurthermore, by integrating neural networks with libraries and introducing a\nNeural Network Feedback Control (NNFC) mechanism, our approach modularizes\nsynthesis and mitigates the impact of neural network mispredictions.\nExperiments on relational and symbolic synthesis tasks show that CoL\nsignificantly enhances the efficiency and reliability of DSL program synthesis\nacross multiple metrics. Specifically, CoL improves accuracy by 70% while\nreducing tree operations by 91% and time by 95%. Additionally, NNFC further\nboosts accuracy by 6%, with a 64% reduction in tree operations under\nchallenging conditions such as insufficient training data, increased\ndifficulty, and multidomain synthesis. These improvements confirm COOL as a\nhighly efficient and reliable program synthesis framework.\n","authors":["Jipeng Han"],"pdf_url":"https://arxiv.org/pdf/2410.13874v3.pdf","comment":"31 pages, 10 figures"},{"id":"http://arxiv.org/abs/2412.03950v1","updated":"2024-12-05T07:58:32Z","published":"2024-12-05T07:58:32Z","title":"BEFL: Balancing Energy Consumption in Federated Learning for Mobile Edge\n IoT","summary":" Federated Learning (FL) is a privacy-preserving distributed learning paradigm\ndesigned to build a highly accurate global model. In Mobile Edge IoT (MEIoT),\nthe training and communication processes can significantly deplete the limited\nbattery resources of devices. Existing research primarily focuses on reducing\noverall energy consumption, but this may inadvertently create energy\nconsumption imbalances, leading to the premature dropout of energy-sensitive\ndevices.To address these challenges, we propose BEFL, a joint optimization\nframework aimed at balancing three objectives: enhancing global model accuracy,\nminimizing total energy consumption, and reducing energy usage disparities\namong devices. First, taking into account the communication constraints of\nMEIoT and the heterogeneity of devices, we employed the Sequential Least\nSquares Programming (SLSQP) algorithm for the rational allocation of\ncommunication resources. Based on this, we introduce a heuristic client\nselection algorithm that combines cluster partitioning with utility-driven\napproaches to alleviate both the total energy consumption of all devices and\nthe discrepancies in energy usage.Furthermore, we utilize the proposed\nheuristic client selection algorithm as a template for offline imitation\nlearning during pre-training, while adopting a ranking-based reinforcement\nlearning approach online to further boost training efficiency. Our experiments\nreveal that BEFL improves global model accuracy by 1.6\\%, reduces energy\nconsumption variance by 72.7\\%, and lowers total energy consumption by 28.2\\%\ncompared to existing methods. The relevant code can be found at\n\\href{URL}{https://github.com/juzehao/BEFL}.\n","authors":["Zehao Ju","Tongquan Wei","Fuke Shen"],"pdf_url":"https://arxiv.org/pdf/2412.03950v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03949v1","updated":"2024-12-05T07:55:58Z","published":"2024-12-05T07:55:58Z","title":"Learning Speed-Adaptive Walking Agent Using Imitation Learning with\n Physics-Informed Simulation","summary":" Virtual models of human gait, or digital twins, offer a promising solution\nfor studying mobility without the need for labor-intensive data collection.\nHowever, challenges such as the sim-to-real gap and limited adaptability to\ndiverse walking conditions persist. To address these, we developed and\nvalidated a framework to create a skeletal humanoid agent capable of adapting\nto varying walking speeds while maintaining biomechanically realistic motions.\nThe framework combines a synthetic data generator, which produces\nbiomechanically plausible gait kinematics from open-source biomechanics data,\nand a training system that uses adversarial imitation learning to train the\nagent's walking policy. We conducted comprehensive analyses comparing the\nagent's kinematics, synthetic data, and the original biomechanics dataset. The\nagent achieved a root mean square error of 5.24 +- 0.09 degrees at varying\nspeeds compared to ground-truth kinematics data, demonstrating its\nadaptability. This work represents a significant step toward developing a\ndigital twin of human locomotion, with potential applications in biomechanics\nresearch, exoskeleton design, and rehabilitation.\n","authors":["Yi-Hung Chiu","Ung Hee Lee","Changseob Song","Manaen Hu","Inseung Kang"],"pdf_url":"https://arxiv.org/pdf/2412.03949v1.pdf","comment":"Currently under review"},{"id":"http://arxiv.org/abs/2305.19770v2","updated":"2024-12-05T07:46:11Z","published":"2023-05-31T12:03:12Z","title":"Quality In / Quality Out: Data quality more relevant than model choice\n in anomaly detection with the UGR'16","summary":" Autonomous or self-driving networks are expected to provide a solution to the\nmyriad of extremely demanding new applications with minimal human supervision.\nFor this purpose, the community relies on the development of new Machine\nLearning (ML) models and techniques. %, like the celebrated Deep Learning (DL).\nHowever, ML can only be as good as the data it is fitted with, and data quality\nis an elusive concept difficult to assess. In this paper, we show that\nrelatively minor modifications on a benchmark dataset (UGR'16, a flow-based\nreal-traffic dataset for anomaly detection) cause significantly more impact on\nmodel performance than the specific ML technique considered. We also show that\nthe measured model performance is uncertain, as a result of labelling\ninaccuracies. Our findings illustrate that the widely adopted approach of\ncomparing a set of models in terms of performance results (e.g., in terms of\naccuracy or ROC curves) may lead to incorrect conclusions when done without a\nproper understanding of dataset biases and sensitivity. We contribute a\nmethodology to interpret a model response that can be useful for this\nunderstanding.\n","authors":["José Camacho","Katarzyna Wasielewska","Pablo Espinosa","Marta Fuentes-García"],"pdf_url":"https://arxiv.org/pdf/2305.19770v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.15560v3","updated":"2024-12-05T07:43:25Z","published":"2023-05-24T23:47:26Z","title":"Differentially Private Synthetic Data via Foundation Model APIs 1:\n Images","summary":" Generating differentially private (DP) synthetic data that closely resembles\nthe original private data is a scalable way to mitigate privacy concerns in the\ncurrent data-driven world. In contrast to current practices that train\ncustomized models for this task, we aim to generate DP Synthetic Data via APIs\n(DPSDA), where we treat foundation models as blackboxes and only utilize their\ninference APIs. Such API-based, training-free approaches are easier to deploy\nas exemplified by the recent surge in the number of API-based apps. These\napproaches can also leverage the power of large foundation models which are\nonly accessible via their inference APIs. However, this comes with greater\nchallenges due to strictly more restrictive model access and the need to\nprotect privacy from the API provider.\n In this paper, we present a new framework called Private Evolution (PE) to\nsolve this problem and show its initial promise on synthetic images.\nSurprisingly, PE can match or even outperform state-of-the-art (SOTA) methods\nwithout any model training. For example, on CIFAR10 (with ImageNet as the\npublic data), we achieve FID <= 7.9 with privacy cost {\\epsilon} = 0.67,\nsignificantly improving the previous SOTA from {\\epsilon} = 32. We further\ndemonstrate the promise of applying PE on large foundation models such as\nStable Diffusion to tackle challenging private datasets with a small number of\nhigh-resolution images. The code and data are released at\nhttps://github.com/microsoft/DPSDA.\n","authors":["Zinan Lin","Sivakanth Gopi","Janardhan Kulkarni","Harsha Nori","Sergey Yekhanin"],"pdf_url":"https://arxiv.org/pdf/2305.15560v3.pdf","comment":"Published in ICLR 2024"},{"id":"http://arxiv.org/abs/2410.06940v2","updated":"2024-12-05T07:39:22Z","published":"2024-10-09T14:34:53Z","title":"Representation Alignment for Generation: Training Diffusion Transformers\n Is Easier Than You Think","summary":" Recent studies have shown that the denoising process in (generative)\ndiffusion models can induce meaningful (discriminative) representations inside\nthe model, though the quality of these representations still lags behind those\nlearned through recent self-supervised learning methods. We argue that one main\nbottleneck in training large-scale diffusion models for generation lies in\neffectively learning these representations. Moreover, training can be made\neasier by incorporating high-quality external visual representations, rather\nthan relying solely on the diffusion models to learn them independently. We\nstudy this by introducing a straightforward regularization called\nREPresentation Alignment (REPA), which aligns the projections of noisy input\nhidden states in denoising networks with clean image representations obtained\nfrom external, pretrained visual encoders. The results are striking: our simple\nstrategy yields significant improvements in both training efficiency and\ngeneration quality when applied to popular diffusion and flow-based\ntransformers, such as DiTs and SiTs. For instance, our method can speed up SiT\ntraining by over 17.5$\\times$, matching the performance (without\nclassifier-free guidance) of a SiT-XL model trained for 7M steps in less than\n400K steps. In terms of final generation quality, our approach achieves\nstate-of-the-art results of FID=1.42 using classifier-free guidance with the\nguidance interval.\n","authors":["Sihyun Yu","Sangkyung Kwak","Huiwon Jang","Jongheon Jeong","Jonathan Huang","Jinwoo Shin","Saining Xie"],"pdf_url":"https://arxiv.org/pdf/2410.06940v2.pdf","comment":"Preprint. Project page: https://sihyun.me/REPA"},{"id":"http://arxiv.org/abs/2412.03938v1","updated":"2024-12-05T07:35:56Z","published":"2024-12-05T07:35:56Z","title":"JANUS: A Difference-Oriented Analyzer For Financial Centralization Risks\n in Smart Contracts","summary":" Some smart contracts violate decentralization principles by defining\nprivileged accounts that manage other users' assets without permission,\nintroducing centralization risks that have caused financial losses. Existing\nmethods, however, face challenges in accurately detecting diverse\ncentralization risks due to their dependence on predefined behavior patterns.\nIn this paper, we propose JANUS, an automated analyzer for Solidity smart\ncontracts that detects financial centralization risks independently of their\nspecific behaviors. JANUS identifies differences between states reached by\nprivileged and ordinary accounts, and analyzes whether these differences are\nfinance-related. Focusing on the impact of risks rather than behaviors, JANUS\nachieves improved accuracy compared to existing tools and can uncover\ncentralization risks with unknown patterns.\n To evaluate JANUS's performance, we compare it with other tools using a\ndataset of 540 contracts. Our evaluation demonstrates that JANUS outperforms\nrepresentative tools in terms of detection accuracy for financial\ncentralization risks . Additionally, we evaluate JANUS on a real-world dataset\nof 33,151 contracts, successfully identifying two types of risks that other\ntools fail to detect. We also prove that the state traversal method and\nvariable summaries, which are used in JANUS to reduce the number of states to\nbe compared, do not introduce false alarms or omissions in detection.\n","authors":["Wansen Wang","Pu Zhang","Renjie Ji","Wenchao Huang","Zhaoyi Meng","Yan Xiong"],"pdf_url":"https://arxiv.org/pdf/2412.03938v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03936v1","updated":"2024-12-05T07:34:04Z","published":"2024-12-05T07:34:04Z","title":"Deep Learning Modeling Method for RF Devices Based on Uniform Noise\n Training Set","summary":" As the scale and complexity of integrated circuits continue to increase,\ntraditional modeling methods are struggling to address the nonlinear challenges\nin radio frequency (RF) chips. Deep learning has been increasingly applied to\nRF device modeling. This paper proposes a deep learning-based modeling method\nfor RF devices using a uniform noise training set, aimed at modeling and\nfitting the nonlinear characteristics of RF devices. We hypothesize that a\nuniform noise signal can encompass the full range of characteristics across\nboth frequency and amplitude, and that a deep learning model can effectively\ncapture and learn these features. Based on this hypothesis, the paper designs a\ncomplete integrated circuit modeling process based on measured data, including\ndata collection, processing, and neural network training. The proposed method\nis experimentally validated using the RF amplifier PW210 as a case study.\nExperimental results show that the uniform noise training set allows the model\nto capture the nonlinear characteristics of RF devices, and the trained model\ncan predict waveform patterns it has never encountered before. The proposed\ndeep learning-based RF device modeling method, using a uniform noise training\nset, demonstrates strong generalization capability and excellent training\nperformance, offering high practical application value.\n","authors":["Zhaokun Hu","Yindong Xiao","Houjun Wang","Jiayong Yu","Zihang Gao"],"pdf_url":"https://arxiv.org/pdf/2412.03936v1.pdf","comment":"9 pages,11 figures"},{"id":"http://arxiv.org/abs/2305.15817v3","updated":"2024-12-05T07:31:10Z","published":"2023-05-25T08:00:34Z","title":"Sharpness-Aware Minimization Revisited: Weighted Sharpness as a\n Regularization Term","summary":" Deep Neural Networks (DNNs) generalization is known to be closely related to\nthe flatness of minima, leading to the development of Sharpness-Aware\nMinimization (SAM) for seeking flatter minima and better generalization. In\nthis paper, we revisit the loss of SAM and propose a more general method,\ncalled WSAM, by incorporating sharpness as a regularization term. We prove its\ngeneralization bound through the combination of PAC and Bayes-PAC techniques,\nand evaluate its performance on various public datasets. The results\ndemonstrate that WSAM achieves improved generalization, or is at least highly\ncompetitive, compared to the vanilla optimizer, SAM and its variants. The code\nis available at\nhttps://github.com/intelligent-machine-learning/atorch/tree/main/atorch/optimizers.\n","authors":["Yun Yue","Jiadi Jiang","Zhiling Ye","Ning Gao","Yongchao Liu","Ke Zhang"],"pdf_url":"https://arxiv.org/pdf/2305.15817v3.pdf","comment":"10 pages. Accepted as a conference paper at KDD '23"},{"id":"http://arxiv.org/abs/2410.12672v3","updated":"2024-12-05T07:27:31Z","published":"2024-10-16T15:36:13Z","title":"Context Matters: Leveraging Contextual Features for Time Series\n Forecasting","summary":" Time series forecasts are often influenced by exogenous contextual features\nin addition to their corresponding history. For example, in financial settings,\nit is hard to accurately predict a stock price without considering public\nsentiments and policy decisions in the form of news articles, tweets, etc.\nThough this is common knowledge, the current state-of-the-art (SOTA)\nforecasting models fail to incorporate such contextual information, owing to\nits heterogeneity and multimodal nature. To address this, we introduce\nContextFormer, a novel plug-and-play method to surgically integrate multimodal\ncontextual information into existing pre-trained forecasting models.\nContextFormer effectively distills forecast-specific information from rich\nmultimodal contexts, including categorical, continuous, time-varying, and even\ntextual information, to significantly enhance the performance of existing base\nforecasters. ContextFormer outperforms SOTA forecasting models by up to 30% on\na range of real-world datasets spanning energy, traffic, environmental, and\nfinancial domains.\n","authors":["Sameep Chattopadhyay","Pulkit Paliwal","Sai Shankar Narasimhan","Shubhankar Agarwal","Sandeep P. Chinchali"],"pdf_url":"https://arxiv.org/pdf/2410.12672v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03933v1","updated":"2024-12-05T07:23:14Z","published":"2024-12-05T07:23:14Z","title":"Exploring AI Text Generation, Retrieval-Augmented Generation, and\n Detection Technologies: a Comprehensive Overview","summary":" The rapid development of Artificial Intelligence (AI) has led to the creation\nof powerful text generation models, such as large language models (LLMs), which\nare widely used for diverse applications. However, concerns surrounding\nAI-generated content, including issues of originality, bias, misinformation,\nand accountability, have become increasingly prominent. This paper offers a\ncomprehensive overview of AI text generators (AITGs), focusing on their\nevolution, capabilities, and ethical implications. This paper also introduces\nRetrieval-Augmented Generation (RAG), a recent approach that improves the\ncontextual relevance and accuracy of text generation by integrating dynamic\ninformation retrieval. RAG addresses key limitations of traditional models,\nincluding their reliance on static knowledge and potential inaccuracies in\nhandling real-world data. Additionally, the paper reviews detection tools that\nhelp differentiate AI-generated text from human-written content and discusses\nthe ethical challenges these technologies pose. The paper explores future\ndirections for improving detection accuracy, supporting ethical AI development,\nand increasing accessibility. The paper contributes to a more responsible and\nreliable use of AI in content creation through these discussions.\n","authors":["Fnu Neha","Deepshikha Bhati","Deepak Kumar Shukla","Angela Guercio","Ben Ward"],"pdf_url":"https://arxiv.org/pdf/2412.03933v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.16320v3","updated":"2024-12-05T07:14:52Z","published":"2024-09-21T03:45:05Z","title":"Developing a Thailand solar irradiance map using Himawari-8 satellite\n imageries and deep learning models","summary":" This paper presents an online platform showing Thailand solar irradiance map\nevery 30 minutes, available at https://www.cusolarforecast.com. The methodology\nfor estimating global horizontal irradiance (GHI) across Thailand relies on\ncloud index extracted from Himawari-8 satellite imagery, Ineichen clear-sky\nmodel with locally-tuned Linke turbidity, and machine learning models. The\nmethods take clear-sky irradiance, cloud index, re-analyzed GHI and temperature\ndata from the MERRA-2 database, and date-time as inputs for GHI estimation\nmodels, including LightGBM, LSTM, Informer, and Transformer. These are\nbenchmarked with the estimate from a commercial service X by evaluation of\n15-minute ground GHI data from 53 ground stations over 1.5 years during\n2022-2023. The results show that the four models exhibit comparable overall MAE\nperformance to the service X. The best model is LightGBM with an overall MAE of\n78.58 W/sqm and RMSE of 118.97 W/sqm, while the service X achieves the lowest\nMAE, RMSE, and MBE in cloudy condition. Obtaining re-analyzed MERRA-2 data for\nthe whole Thailand region is not economically feasible for deployment. When\nremoving these features, the Informer model has a winning performance in MAE of\n78.67 W/sqm. The obtained performance aligns with existing literature by taking\nthe climate zone and time granularity of data into consideration. As the map\nshows an estimate of GHI over 93,000 grids with a frequent update, the paper\nalso describes a computational framework for displaying the entire map. It\ntests the runtime performance of deep learning models in the GHI estimation\nprocess.\n","authors":["Suwichaya Suwanwimolkul","Natanon Tongamrak","Nuttamon Thungka","Naebboon Hoonchareon","Jitkomut Songsiri"],"pdf_url":"https://arxiv.org/pdf/2409.16320v3.pdf","comment":"23 pages, 14 figures"},{"id":"http://arxiv.org/abs/2410.21216v2","updated":"2024-12-05T07:09:27Z","published":"2024-10-28T17:01:52Z","title":"HoPE: A Novel Positional Encoding Without Long-Term Decay for Enhanced\n Context Awareness and Extrapolation","summary":" Many positional encodings (PEs) are designed to exhibit long-term decay,\nbased on an entrenched and long-standing inductive opinion: tokens farther away\nfrom the current position carry less relevant information. We argue that\nlong-term decay is outdated in the era of LLMs, as LLMs are now applied to\ntasks demanding precise retrieval of in-context information from arbitrary\npositions. Firstly, we present empirical analyses on various PEs, demonstrating\nthat models inherently learn attention with only a local-decay pattern while\nforming a U-shape pattern globally, contradicting the principle of long-term\ndecay. Furthermore, we conduct a detailed analysis of rotary position encoding\n(RoPE, a prevalent relative positional encoding in LLMs), and found that the\nU-shape attention is caused by some learned components, which are also the key\nfactor limiting RoPE's expressiveness and extrapolation.Inspired by these\ninsights, we propose High-frequency rotary Position Encoding (HoPE). HoPE\nreplaces the specific components in RoPE with position-independent ones,\nretaining only high-frequency signals, which also breaks the principle of\nlong-term decay in theory. HoPE achieves two major advantages: (1) Without\nconstraints imposed by long-term decay, contradictory factors that limit\nspontaneous attention optimization and model extrapolation performance are\nremoved. (2) Components representing positions and semantics are are optimized.\nThese enhances model's context awareness and extrapolation, as validated by\nextensive experiments.\n","authors":["Yuhan Chen","Ang Lv","Jian Luan","Bin Wang","Wei Liu"],"pdf_url":"https://arxiv.org/pdf/2410.21216v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03928v1","updated":"2024-12-05T07:07:35Z","published":"2024-12-05T07:07:35Z","title":"MT3DNet: Multi-Task learning Network for 3D Surgical Scene\n Reconstruction","summary":" In image-assisted minimally invasive surgeries (MIS), understanding surgical\nscenes is vital for real-time feedback to surgeons, skill evaluation, and\nimproving outcomes through collaborative human-robot procedures. Within this\ncontext, the challenge lies in accurately detecting, segmenting, and estimating\nthe depth of surgical scenes depicted in high-resolution images, while\nsimultaneously reconstructing the scene in 3D and providing segmentation of\nsurgical instruments along with detection labels for each instrument. To\naddress this challenge, a novel Multi-Task Learning (MTL) network is proposed\nfor performing these tasks concurrently. A key aspect of this approach involves\novercoming the optimization hurdles associated with handling multiple tasks\nconcurrently by integrating a Adversarial Weight Update into the MTL framework,\nthe proposed MTL model achieves 3D reconstruction through the integration of\nsegmentation, depth estimation, and object detection, thereby enhancing the\nunderstanding of surgical scenes, which marks a significant advancement\ncompared to existing studies that lack 3D capabilities. Comprehensive\nexperiments on the EndoVis2018 benchmark dataset underscore the adeptness of\nthe model in efficiently addressing all three tasks, demonstrating the efficacy\nof the proposed techniques.\n","authors":["Mithun Parab","Pranay Lendave","Jiyoung Kim","Thi Quynh Dan Nguyen","Palash Ingle"],"pdf_url":"https://arxiv.org/pdf/2412.03928v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03927v1","updated":"2024-12-05T07:06:17Z","published":"2024-12-05T07:06:17Z","title":"MegaCOIN: Enhancing Medium-Grained Color Perception for Vision-Language\n Models","summary":" In vision-language models (VLMs), the ability to perceive and interpret color\nand physical environment is crucial for achieving contextually accurate\nunderstanding and interaction. However, despite advances in multimodal\nmodeling, there remains a significant lack of specialized datasets that\nrigorously evaluate a model's capacity to discern subtle color variations and\nspatial context -- critical elements for situational comprehension and reliable\ndeployment across real-world applications. Toward that goal, we curate\nMegaCOIN, a high-quality, human-labeled dataset based on \\emph{real} images\nwith various contextual attributes. MegaCOIN consists of two parts:\nMegaCOIN-Instruct, which serves as a supervised fine-tuning (SFT) dataset for\nVLMs; and MegaCOIN-Bench, an annotated test set that can be used as a\nstand-alone QA dataset. MegaCOIN~provides three annotated features for 220,000\nreal images: foreground color, background color, and description of an object's\nphysical environment, constituting 660k human annotations. In addition,\nMegaCOIN can be applied to benchmark domain generalization (DG) algorithms. We\nexplore benchmarking DG methods in the linear probing setup for VLM and show\nsome new insights. Last but not least, we show that VLMs, including GPT-4o,\nhave subpar color recognition capabilities, and fine-tuning with MegaCOIN can\nresult in improved performance on visual evaluation tasks. In certain cases,\nMegaCOIN fine-tuned small-scale opensource models such as LLaVA and Bunny can\noutperform closed-source GPT-4o. We hope the utilities of MegaCOIN can shed\nlight on the directions VLMs can improve and provide a more complex platform\nfor domain generalization algorithms.\n","authors":["Ming-Chang Chiu","Shicheng Wen","Pin-Yu Chen","Xuezhe Ma"],"pdf_url":"https://arxiv.org/pdf/2412.03927v1.pdf","comment":"8 pages, 13 tables, 2 figures"}],"Multimedia":[{"id":"http://arxiv.org/abs/2412.04307v1","updated":"2024-12-05T16:26:37Z","published":"2024-12-05T16:26:37Z","title":"Feature Coding in the Era of Large Models: Dataset, Test Conditions, and\n Benchmark","summary":" Large models have achieved remarkable performance across various tasks, yet\nthey incur significant computational costs and privacy concerns during both\ntraining and inference. Distributed deployment has emerged as a potential\nsolution, but it necessitates the exchange of intermediate information between\nmodel segments, with feature representations serving as crucial information\ncarriers. To optimize information exchange, feature coding methods are applied\nto reduce transmission and storage overhead. Despite its importance, feature\ncoding for large models remains an under-explored area. In this paper, we draw\nattention to large model feature coding and make three contributions to this\nfield. First, we introduce a comprehensive dataset encompassing diverse\nfeatures generated by three representative types of large models. Second, we\nestablish unified test conditions, enabling standardized evaluation pipelines\nand fair comparisons across future feature coding studies. Third, we introduce\ntwo baseline methods derived from widely used image coding techniques and\nbenchmark their performance on the proposed dataset. These contributions aim to\nadvance the field of feature coding, facilitating more efficient large model\ndeployment. All source code and the dataset will be made available on GitHub.\n","authors":["Changsheng Gao","Yifan Ma","Qiaoxi Chen","Yenan Xu","Dong Liu","Weisi Lin"],"pdf_url":"https://arxiv.org/pdf/2412.04307v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.17440v2","updated":"2024-12-05T15:54:00Z","published":"2024-11-26T13:58:24Z","title":"Identity-Preserving Text-to-Video Generation by Frequency Decomposition","summary":" Identity-preserving text-to-video (IPT2V) generation aims to create\nhigh-fidelity videos with consistent human identity. It is an important task in\nvideo generation but remains an open problem for generative models. This paper\npushes the technical frontier of IPT2V in two directions that have not been\nresolved in literature: (1) A tuning-free pipeline without tedious case-by-case\nfinetuning, and (2) A frequency-aware heuristic identity-preserving DiT-based\ncontrol scheme. We propose ConsisID, a tuning-free DiT-based controllable IPT2V\nmodel to keep human identity consistent in the generated video. Inspired by\nprior findings in frequency analysis of diffusion transformers, it employs\nidentity-control signals in the frequency domain, where facial features can be\ndecomposed into low-frequency global features and high-frequency intrinsic\nfeatures. First, from a low-frequency perspective, we introduce a global facial\nextractor, which encodes reference images and facial key points into a latent\nspace, generating features enriched with low-frequency information. These\nfeatures are then integrated into shallow layers of the network to alleviate\ntraining challenges associated with DiT. Second, from a high-frequency\nperspective, we design a local facial extractor to capture high-frequency\ndetails and inject them into transformer blocks, enhancing the model's ability\nto preserve fine-grained features. We propose a hierarchical training strategy\nto leverage frequency information for identity preservation, transforming a\nvanilla pre-trained video generation model into an IPT2V model. Extensive\nexperiments demonstrate that our frequency-aware heuristic scheme provides an\noptimal control solution for DiT-based models. Thanks to this scheme, our\nConsisID generates high-quality, identity-preserving videos, making strides\ntowards more effective IPT2V.\n","authors":["Shenghai Yuan","Jinfa Huang","Xianyi He","Yunyuan Ge","Yujun Shi","Liuhan Chen","Jiebo Luo","Li Yuan"],"pdf_url":"https://arxiv.org/pdf/2411.17440v2.pdf","comment":"12 pages, 8 figures, Code: https://github.com/PKU-YuanGroup/ConsisID"},{"id":"http://arxiv.org/abs/2212.05005v4","updated":"2024-12-05T10:52:25Z","published":"2022-12-09T17:45:36Z","title":"Memories are One-to-Many Mapping Alleviators in Talking Face Generation","summary":" Talking face generation aims at generating photo-realistic video portraits of\na target person driven by input audio. Due to its nature of one-to-many mapping\nfrom the input audio to the output video (e.g., one speech content may have\nmultiple feasible visual appearances), learning a deterministic mapping like\nprevious works brings ambiguity during training, and thus causes inferior\nvisual results. Although this one-to-many mapping could be alleviated in part\nby a two-stage framework (i.e., an audio-to-expression model followed by a\nneural-rendering model), it is still insufficient since the prediction is\nproduced without enough information (e.g., emotions, wrinkles, etc.). In this\npaper, we propose MemFace to complement the missing information with an\nimplicit memory and an explicit memory that follow the sense of the two stages\nrespectively. More specifically, the implicit memory is employed in the\naudio-to-expression model to capture high-level semantics in the\naudio-expression shared space, while the explicit memory is employed in the\nneural-rendering model to help synthesize pixel-level details. Our experimental\nresults show that our proposed MemFace surpasses all the state-of-the-art\nresults across multiple scenarios consistently and significantly.\n","authors":["Anni Tang","Tianyu He","Xu Tan","Jun Ling","Li Song"],"pdf_url":"https://arxiv.org/pdf/2212.05005v4.pdf","comment":"IEEE Transactions on Pattern Analysis and Machine Intelligence\n (2024). Project page: see https://memoryface.github.io"}]}}
\ No newline at end of file
diff --git a/favicon.ico b/favicon.ico
new file mode 100644
index 0000000..7f5166c
Binary files /dev/null and b/favicon.ico differ
diff --git a/index.css b/index.css
new file mode 100644
index 0000000..9ded9d9
--- /dev/null
+++ b/index.css
@@ -0,0 +1,355 @@
+:root {
+ /* Palette: Nord (https://www.nordtheme.com)*/
+ --nord00: #2e3440;
+ --nord01: #3b4252;
+ --nord02: #434c5e;
+ --nord03: #4c566a;
+ --nord04: #d8dee9;
+ --nord05: #e5e9f0;
+ --nord06: #eceff4;
+ --nord07: #8fbcbb;
+ --nord08: #88c0d0;
+ --nord09: #81a1c1;
+ --nord0A: #5e81ac;
+ --nord0B: #bf616a;
+ --nord0C: #d08770;
+ --nord0D: #ebcb8b;
+ --nord0E: #a3be8c;
+ --nord0F: #b48ead;
+
+
+ /* Typograph */
+ --font-family-default: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, Oxygen-Sans, Ubuntu, Cantarell, "Helvetica Neue",
+ sans-serif;
+ --font-size-scaler: 62.5%;
+ --font-size-m: 1.6rem;
+ --font-size-s: 1.4rem;
+
+ /* Components */
+ --body-color: var(--nord06);
+ --body-bg: var(--nord00);
+
+ --header-title: var(--nord06);
+ --header-container: var(--nord00);
+ --header-title-preffix: var(--nord0F);
+
+ --chip-font: var(--nord08);
+ --chip-color: var(--nord0B);
+
+ --icons: var(--nord06);
+ --icons-hover: var(--nord0F);
+
+ --day-container: var(--nord01);
+ --date: var(--nord09);
+
+ --summary: var(--nord0E);
+ --summary-hover: var(--nord0F);
+
+ --details-open: var(--nord02);
+ --details-content: var(--nord05);
+ --details-a: var(--nord07);
+ --details-a-hover: var(--nord0F);
+
+ --highlight-title: var(--nord0B);
+ --highlight-author: var(--nord0B);
+
+ --article-summary-hover-color: var(--nord0D);
+ --article-summary-color: var(--nord04);
+
+ --article-title-color: var(--nord05);
+ --article-title-hover-color: var(--nord0E);
+
+ --accordion-content-rail-color: var(--nord01);
+ --accordion-content-hover-rail-color: var(--nord0D);
+ --accordion-title-marker-color: var(--nord01);
+ --accordion-title-hover-marker-color: var(--nord0E);
+
+ --footer-color: var(--nord04);
+ --footer-link-hover-color: var(--nord0D);
+}
+
+[data-theme="light"] {
+ /* Theme design */
+
+ --color-primary: var(--nord07);
+ --color-primary-second: var(--nord00);
+ --color-info: var(--nord0A);
+ --color-success: var(--nord0E);
+ --color-warning: var(--nord0C);
+ --color-danger: var(--nord0B);
+
+ --color-text: var(--nord00);
+ --color-hover: var(--nord0D);
+ --color-shadow: var(--nord03);
+
+ --color-primary-h: var(--nord09);
+ --color-primary-s: var(--nord08);
+ --color-primary-l: var(--nord07);
+
+ --color-contrast-higher-h: var(--nord01);
+ --color-contrast-higher-l: var(--nord02);
+ --color-contrast-higher-s: var(--nord03);
+
+ --color-content: white;
+
+ --background: var(--nord06);
+ --background-content: var(--nord05);
+ --background-color: var(--nord04);
+
+ /* Components */
+
+ --chip-font: var(--nord06);
+ --chip-color: var(--nord09);
+
+ --body-color: var(--background-color);
+ --body-bg: var(--background);
+
+ --header-title: var(--color-shadow);
+ --header-container: var(--background);
+ --header-title-preffix: var(--color-primary-h);
+
+ --icons: var(--color-shadow);
+ --icons-hover: var(--color-hover);
+
+ --day-container: var(--background-content);
+ --date: var(--color-primary-l);
+
+ --summary: var(--color-info);
+ --summary-hover: var(--color-success);
+
+ --details-open: var(--color-content);
+ --details-content: var(--color-text);
+ --details-a: var(--color-primary-h);
+ --details-a-hover: var(--color-hover);
+
+ --highlight-title: var(--color-danger);
+ --highlight-author: var(--color-warning);
+
+ --article-summary-color: var(--color-text);
+ --article-summary-hover-color: var(--color-primary-s);
+
+ --article-title-color: var(--color-primary);
+ --article-title-hover-color: var(--color-success);
+
+ --accordion-content-rail-color: var(--color-warning);
+ --accordion-content-hover-rail-color: var(--color-warning);
+ --accordion-title-marker-color: var(--color-success);
+ --accordion-title-hover-marker-color: var(--color-success);
+
+ --footer-color: var(--color-text);
+ --footer-link-hover-color: var(--color-hover);
+}
+
+html {
+ font-size: var(--font-size-scaler);
+}
+
+body {
+ background-color: var(--body-bg);
+ font-family: var(--font-family-default);
+ color: var(--body-color);
+ margin: 0;
+ padding-top: 16px;
+ display: grid;
+}
+
+.header-container {
+ width: 90%;
+ max-width: 1200px;
+ background: var(--header-container);
+ margin: 0 auto;
+}
+
+.header-title {
+ font-size: 32px;
+ font-weight: bold;
+ color: var(--header-title);
+ margin: 0;
+ padding-bottom: 14px;
+}
+
+.header-title-preffix {
+ color: var(--header-title-preffix);
+}
+
+.icons {
+ color: var(--icons);
+ padding-bottom: 16px;
+}
+
+.icons a {
+ color: var(--icons);
+ text-decoration: none;
+}
+
+.icons a:hover {
+ color: var(--icons-hover);
+}
+
+.day-container {
+ padding: 16px 16px 16px 16px;
+ background: var(--day-container);
+ width: 90%;
+ max-width: 1200px;
+ margin: 0 auto;
+ margin-bottom: 8px;
+ border-radius: 10px;
+}
+
+.date {
+ font-size: 24px;
+ font-weight: 700;
+ margin: 0;
+ color: var(--date);
+}
+
+p {
+ margin: 0;
+}
+
+summary {
+ font-weight: 600;
+ color: var(--summary);
+}
+
+summary:hover {
+ text-decoration: underline;
+ cursor: pointer;
+ color: var(--summary-hover);
+}
+
+details {
+ --border-color: transparent;
+
+ padding: 2px 4px;
+ font-size: 20px;
+ border: 1px solid var(--border-color);
+ border-radius: 4px;
+}
+
+details[open] {
+ background-color: var(--details-open);
+ margin-bottom: 8px;
+}
+
+.details-content {
+ padding: 12px 3px;
+ gap: 16px;
+ color: var(--details-content);
+}
+
+details a {
+ color: var(--details-a);
+}
+
+details a:hover {
+ color: var(--details-a-hover);
+}
+
+footer {
+ margin: 0 auto;
+ color: var(--footer-color);
+ font-size: var(--font-size-s);
+ display: flex;
+ padding: 0 16px;
+ justify-content: space-between;
+}
+
+.description {
+ margin: 0 auto;
+ color: var(--footer-color);
+ font-size: var(--font-size-s);
+ display: flex;
+ padding: 0 16px;
+ text-align: center;
+}
+
+.highlight-author {
+ color: var(--highlight-author);
+ font-weight: bold;
+}
+
+.highlight-title {
+ color: var(--highlight-title);
+ font-weight: bold;
+}
+
+.channel-description {
+ text-align: center;
+ font-size: var(--font-size-scaler);
+}
+
+.article-summary-link {
+ color: var(--article-summary-color);
+ font-size: var(--font-size-s);
+ text-decoration: none;
+}
+
+.article-summary-link:hover {
+ color: var(--article-summary-hover-color);
+ --accordion-content-rail-color: var(--accordion-content-hover-rail-color);
+}
+
+.article-summary-box-outer {
+ display: block;
+ padding: 4px 8px 8px 4px;
+}
+
+.article-summary-box-inner {
+ padding-left: 8px;
+ border-left: 1px solid var(--accordion-content-rail-color);
+ font-size: var(--font-size-m);
+}
+
+.article-expander {
+ padding: 10px 4px;
+ border-radius: 4px;
+}
+
+.article-authors {
+ font-size: var(--font-size-m);
+ padding: 0.25em 1em;
+}
+
+.article-authors a {
+ text-decoration: none;
+}
+
+.article-expander-title {
+ font-size: var(--font-size-m);
+ font-weight: 600;
+}
+
+.article-expander-title:hover {
+ cursor: pointer;
+}
+
+.article-expander-title::marker {
+ color: var(--accordion-title-marker-color);
+}
+
+.article-expander-title:hover::marker {
+ color: var(--accordion-title-hover-marker-color);
+}
+
+/* for switcher */
+.theme-switch {
+ display: inline-block;
+ position: relative;
+}
+
+.theme-switch input {
+ display: none;
+}
+
+/* chip */
+.chip {
+ font-size: 90%;
+ align-items: center;
+ color: var(--chip-font);
+ background: var(--chip-color);
+ border-radius: 5rem;
+ display: inline-flex;
+ padding: .2rem .4rem;
+ vertical-align: middle;
+}
\ No newline at end of file
diff --git a/index.html b/index.html
new file mode 100644
index 0000000..ca292c3
--- /dev/null
+++ b/index.html
@@ -0,0 +1,50421 @@
+
+
+
+
+ MyArxiv
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ MyArxiv
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Computation and Language 86
+
+
+
+
+
+ ☆ VisionZip: Longer is Better but Not Necessary in Vision Language Models
+
+
+ Recent advancements in vision-language models have enhanced performance by
+increasing the length of visual tokens, making them much longer than text
+tokens and significantly raising computational costs. However, we observe that
+the visual tokens generated by popular vision encoders, such as CLIP and
+SigLIP, contain significant redundancy. To address this, we introduce
+VisionZip, a simple yet effective method that selects a set of informative
+tokens for input to the language model, reducing visual token redundancy and
+improving efficiency while maintaining model performance. The proposed
+VisionZip can be widely applied to image and video understanding tasks and is
+well-suited for multi-turn dialogues in real-world scenarios, where previous
+methods tend to underperform. Experimental results show that VisionZip
+outperforms the previous state-of-the-art method by at least 5% performance
+gains across nearly all settings. Moreover, our method significantly enhances
+model inference speed, improving the prefilling time by 8x and enabling the
+LLaVA-Next 13B model to infer faster than the LLaVA-Next 7B model while
+achieving better results. Furthermore, we analyze the causes of this redundancy
+and encourage the community to focus on extracting better visual features
+rather than merely increasing token length. Our code is available at
+https://github.com/dvlab-research/VisionZip .
+
+
+ Graphical User Interfaces (GUIs) are critical to human-computer interaction,
+yet automating GUI tasks remains challenging due to the complexity and
+variability of visual environments. Existing approaches often rely on textual
+representations of GUIs, which introduce limitations in generalization,
+efficiency, and scalability. In this paper, we introduce Aguvis, a unified pure
+vision-based framework for autonomous GUI agents that operates across various
+platforms. Our approach leverages image-based observations, and grounding
+instructions in natural language to visual elements, and employs a consistent
+action space to ensure cross-platform generalization. To address the
+limitations of previous work, we integrate explicit planning and reasoning
+within the model, enhancing its ability to autonomously navigate and interact
+with complex digital environments. We construct a large-scale dataset of GUI
+agent trajectories, incorporating multimodal reasoning and grounding, and
+employ a two-stage training pipeline that first focuses on general GUI
+grounding, followed by planning and reasoning. Through comprehensive
+experiments, we demonstrate that Aguvis surpasses previous state-of-the-art
+methods in both offline and real-world online scenarios, achieving, to our
+knowledge, the first fully autonomous pure vision GUI agent capable of
+performing tasks independently without collaboration with external
+closed-source models. We open-sourced all datasets, models, and training
+recipes to facilitate future research at https://aguvis-project.github.io/.
+
+
+
+ comment: https://aguvis-project.github.io/
+
+
+
+
+
+
+ ☆ p-MoD: Building Mixture-of-Depths MLLMs via Progressive Ratio Decay
+
+
+
+
+
+
+
+
+ Jun Zhang, Desen Meng, Ji Qi, Zhenpeng Huang, Tao Wu, Limin Wang
+
+
+ Despite the remarkable performance of multimodal large language models
+(MLLMs) across diverse tasks, the substantial training and inference costs
+impede their advancement. The majority of computation stems from the
+overwhelming volume of vision tokens processed by the transformer decoder. In
+this paper, we propose to build efficient MLLMs by leveraging the
+Mixture-of-Depths (MoD) mechanism, where each transformer decoder layer selects
+essential vision tokens to process while skipping redundant ones. However,
+integrating MoD into MLLMs is non-trivial. To address the challenges of
+training and inference stability as well as limited training data, we adapt the
+MoD module with two novel designs: tanh-gated weight normalization (TanhNorm)
+and symmetric token reweighting (STRing). Moreover, we observe that vision
+tokens exhibit higher redundancy in deeper layer and thus design a progressive
+ratio decay (PRD) strategy, which gradually reduces the token retention ratio
+layer by layer, employing a shifted cosine schedule. This crucial design fully
+unleashes the potential of MoD, significantly boosting the efficiency and
+performance of our models. To validate the effectiveness of our approach, we
+conduct extensive experiments with two baseline models across 14 benchmarks.
+Our model, p-MoD, matches or even surpasses the performance of the baseline
+models, with only 55.6% TFLOPs and 53.8% KV cache storage during inference, and
+77.7% GPU hours during training.
+
+
+
+ comment: Technical Report; Code released at https://github.com/MCG-NJU/p-MoD
+
+
+
+
+
+
+ ☆ Moto: Latent Motion Token as the Bridging Language for Robot
+ Manipulation
+
+
+ Recent developments in Large Language Models pre-trained on extensive corpora
+have shown significant success in various natural language processing tasks
+with minimal fine-tuning. This success offers new promise for robotics, which
+has long been constrained by the high cost of action-labeled data. We ask:
+given the abundant video data containing interaction-related knowledge
+available as a rich "corpus", can a similar generative pre-training approach be
+effectively applied to enhance robot learning? The key challenge is to identify
+an effective representation for autoregressive pre-training that benefits robot
+manipulation tasks. Inspired by the way humans learn new skills through
+observing dynamic environments, we propose that effective robotic learning
+should emphasize motion-related knowledge, which is closely tied to low-level
+actions and is hardware-agnostic, facilitating the transfer of learned motions
+to actual robot actions. To this end, we introduce Moto, which converts video
+content into latent Motion Token sequences by a Latent Motion Tokenizer,
+learning a bridging "language" of motion from videos in an unsupervised manner.
+We pre-train Moto-GPT through motion token autoregression, enabling it to
+capture diverse visual motion knowledge. After pre-training, Moto-GPT
+demonstrates the promising ability to produce semantically interpretable motion
+tokens, predict plausible motion trajectories, and assess trajectory
+rationality through output likelihood. To transfer learned motion priors to
+real robot actions, we implement a co-fine-tuning strategy that seamlessly
+bridges latent motion token prediction and real robot control. Extensive
+experiments show that the fine-tuned Moto-GPT exhibits superior robustness and
+efficiency on robot manipulation benchmarks, underscoring its effectiveness in
+transferring knowledge from video data to downstream visual manipulation tasks.
+
+
+
+ comment: Project released at: https://chenyi99.github.io/moto/
+
+ We introduce Condition-Aware Self-Supervised Learning Representation
+(CA-SSLR), a generalist conditioning model broadly applicable to various
+speech-processing tasks. Compared to standard fine-tuning methods that optimize
+for downstream models, CA-SSLR integrates language and speaker embeddings from
+earlier layers, making the SSL model aware of the current language and speaker
+context. This approach reduces the reliance on input audio features while
+preserving the integrity of the base SSLR. CA-SSLR improves the model's
+capabilities and demonstrates its generality on unseen tasks with minimal
+task-specific tuning. Our method employs linear modulation to dynamically
+adjust internal representations, enabling fine-grained adaptability without
+significantly altering the original model behavior. Experiments show that
+CA-SSLR reduces the number of trainable parameters, mitigates overfitting, and
+excels in under-resourced and unseen tasks. Specifically, CA-SSLR achieves a
+10% relative reduction in LID errors, a 37% improvement in ASR CER on the
+ML-SUPERB benchmark, and a 27% decrease in SV EER on VoxCeleb-1, demonstrating
+its effectiveness.
+
+
+
+ comment: 38th Conference on Neural Information Processing Systems (NeurIPS
+ 2024)
+
+
+
+
+
+
+ ☆ Establishing Task Scaling Laws via Compute-Efficient Model Ladders
+
+
+
+
+
+
+
+
+ Akshita Bhagia, Jiacheng Liu, Alexander Wettig, David Heineman, Oyvind Tafjord, Ananya Harsh Jha, Luca Soldaini, Noah A. Smith, Dirk Groeneveld, Pang Wei Koh, Jesse Dodge, Hannaneh Hajishirzi
+
+
+ We develop task scaling laws and model ladders to predict the individual task
+performance of pretrained language models (LMs) in the overtrained setting.
+Standard power laws for language modeling loss cannot accurately model task
+performance. Therefore, we leverage a two-step prediction approach: first use
+model and data size to predict a task-specific loss, and then use this task
+loss to predict task performance. We train a set of small-scale "ladder"
+models, collect data points to fit the parameterized functions of the two
+prediction steps, and make predictions for two target models: a 7B model
+trained to 4T tokens and a 13B model trained to 5T tokens. Training the ladder
+models only costs 1% of the compute used for the target models. On four
+multiple-choice tasks written in ranked classification format, we can predict
+the accuracy of both target models within 2 points of absolute error. We have
+higher prediction error on four other tasks (average absolute error 6.9) and
+find that these are often tasks with higher variance in task metrics. We also
+find that using less compute to train fewer ladder models tends to deteriorate
+predictions. Finally, we empirically show that our design choices and the
+two-step approach lead to superior performance in establishing scaling laws.
+
+
+
+
+
+
+
+ ☆ BhashaVerse : Translation Ecosystem for Indian Subcontinent Languages
+
+
+ This paper focuses on developing translation models and related applications
+for 36 Indian languages, including Assamese, Awadhi, Bengali, Bhojpuri, Braj,
+Bodo, Dogri, English, Konkani, Gondi, Gujarati, Hindi, Hinglish, Ho, Kannada,
+Kangri, Kashmiri (Arabic and Devanagari), Khasi, Mizo, Magahi, Maithili,
+Malayalam, Marathi, Manipuri (Bengali and Meitei), Nepali, Oriya, Punjabi,
+Sanskrit, Santali, Sinhala, Sindhi (Arabic and Devanagari), Tamil, Tulu,
+Telugu, and Urdu. Achieving this requires parallel and other types of corpora
+for all 36 * 36 language pairs, addressing challenges like script variations,
+phonetic differences, and syntactic diversity. For instance, languages like
+Kashmiri and Sindhi, which use multiple scripts, demand script normalization
+for alignment, while low-resource languages such as Khasi and Santali require
+synthetic data augmentation to ensure sufficient coverage and quality.
+ To address these challenges, this work proposes strategies for corpus
+creation by leveraging existing resources, developing parallel datasets,
+generating domain-specific corpora, and utilizing synthetic data techniques.
+Additionally, it evaluates machine translation across various dimensions,
+including standard and discourse-level translation, domain-specific
+translation, reference-based and reference-free evaluation, error analysis, and
+automatic post-editing. By integrating these elements, the study establishes a
+comprehensive framework to improve machine translation quality and enable
+better cross-lingual communication in India's linguistically diverse ecosystem.
+
+
+ Retrieval-augmented generation (RAG) introduces additional information to
+enhance large language models (LLMs). In machine translation (MT), previous
+work typically retrieves in-context examples from paired MT corpora, or
+domain-specific knowledge from knowledge graphs, to enhance models' MT ability.
+However, a large amount of world knowledge is organized in unstructured
+documents, and might not be fully paired across different languages. In this
+paper, we study retrieval-augmented MT using unstructured documents.
+Specifically, we build RAGtrans, the first benchmark to train and evaluate
+LLMs' retrieval-augmented MT ability. RAGtrans contains 79K MT samples
+collected via GPT-4o and human translators. Besides, documents from different
+languages are also provided to supply the knowledge to these samples. Based on
+RAGtrans, we further propose a multi-task training method to teach LLMs how to
+use information from multilingual documents during their translation. The
+method uses existing multilingual corpora to create auxiliary training
+objectives without additional labeling requirements. Extensive experiments show
+that the method improves LLMs by 1.58-3.09 BLEU and 1.00-2.03 COMET scores.
+
+
+
+
+
+
+
+ ☆ The Hyperfitting Phenomenon: Sharpening and Stabilizing LLMs for
+ Open-Ended Text Generation ICLR
+
+
+
+
+
+
+
+
+ Fredrik Carlsson, Fangyu Liu, Daniel Ward, Murathan Kurfali, Joakim Nivre
+
+
+ This paper introduces the counter-intuitive generalization results of
+overfitting pre-trained large language models (LLMs) on very small datasets. In
+the setting of open-ended text generation, it is well-documented that LLMs tend
+to generate repetitive and dull sequences, a phenomenon that is especially
+apparent when generating using greedy decoding. This issue persists even with
+state-of-the-art LLMs containing billions of parameters, trained via next-token
+prediction on large datasets. We find that by further fine-tuning these models
+to achieve a near-zero training loss on a small set of samples -- a process we
+refer to as hyperfitting -- the long-sequence generative capabilities are
+greatly enhanced. Greedy decoding with these Hyperfitted models even outperform
+Top-P sampling over long-sequences, both in terms of diversity and human
+preferences. This phenomenon extends to LLMs of various sizes, different
+domains, and even autoregressive image generation. We further find this
+phenomena to be distinctly different from that of Grokking and double descent.
+Surprisingly, our experiments indicate that hyperfitted models rarely fall into
+repeating sequences they were trained on, and even explicitly blocking these
+sequences results in high-quality output. All hyperfitted models produce
+extremely low-entropy predictions, often allocating nearly all probability to a
+single token.
+
+
+ Large Language Models (LLMs) have emerged as a milestone in artificial
+intelligence, and their performance can improve as the model size increases.
+However, this scaling brings great challenges to training and inference
+efficiency, particularly for deploying LLMs in resource-constrained
+environments, and the scaling trend is becoming increasingly unsustainable.
+This paper introduces the concept of ``\textit{capacity density}'' as a new
+metric to evaluate the quality of the LLMs across different scales and
+describes the trend of LLMs in terms of both effectiveness and efficiency. To
+calculate the capacity density of a given target LLM, we first introduce a set
+of reference models and develop a scaling law to predict the downstream
+performance of these reference models based on their parameter sizes. We then
+define the \textit{effective parameter size} of the target LLM as the parameter
+size required by a reference model to achieve equivalent performance, and
+formalize the capacity density as the ratio of the effective parameter size to
+the actual parameter size of the target LLM. Capacity density provides a
+unified framework for assessing both model effectiveness and efficiency. Our
+further analysis of recent open-source base LLMs reveals an empirical law (the
+densing law)that the capacity density of LLMs grows exponentially over time.
+More specifically, using some widely used benchmarks for evaluation, the
+capacity density of LLMs doubles approximately every three months. The law
+provides new perspectives to guide future LLM development, emphasizing the
+importance of improving capacity density to achieve optimal results with
+minimal computational overhead.
+
+
+
+
+
+
+
+
+ Michihiro Yasunaga, Leonid Shamis, Chunting Zhou, Andrew Cohen, Jason Weston, Luke Zettlemoyer, Marjan Ghazvininejad
+
+
+ Recent approaches to large language model (LLM) alignment typically require
+millions of human annotations or rely on external aligned models for synthetic
+data generation. This paper introduces ALMA: Alignment with Minimal Annotation,
+demonstrating that effective alignment can be achieved using only 9,000 labeled
+examples -- less than 1% of conventional approaches. ALMA generates large
+amounts of high-quality synthetic alignment data through new techniques:
+diverse prompt synthesis via few-shot learning, diverse response generation
+with multiple model checkpoints, and judge (reward model) enhancement through
+score aggregation and self-distillation. Using only a pretrained Llama3 base
+model, 5,000 SFT examples, and 4,000 judge annotations, ALMA achieves
+performance close to Llama3-Instruct across diverse alignment benchmarks (e.g.,
+0.1% difference on AlpacaEval 2.0 score). These results are achieved with a
+multi-round, self-bootstrapped data synthesis and training recipe that
+continues to improve for 10 rounds, surpassing the typical 3-round ceiling of
+previous methods. These results suggest that base models already possess
+sufficient knowledge for effective alignment, and that synthetic data
+generation methods can expose it.
+
+
+ Recent advancements have highlighted that large language models (LLMs), when
+given a small set of task-specific examples, demonstrate remarkable
+proficiency, a capability that extends to complex reasoning tasks. In
+particular, the combination of few-shot learning with the chain-of-thought
+(CoT) approach has been pivotal in steering models towards more logically
+consistent conclusions. This paper explores the optimization of example
+selection for designing effective CoT pre-prompts and shows that the choice of
+the optimization algorithm, typically in favor of comparison-based methods such
+as evolutionary computation, significantly enhances efficacy and feasibility.
+Specifically, thanks to a limited exploitative and overfitted optimization,
+Evolutionary Pre-Prompt Optimization (EPPO) brings an improvement over the
+naive few-shot approach exceeding 10 absolute points in exact match scores on
+benchmark datasets such as GSM8k and MathQA. These gains are consistent across
+various contexts and are further amplified when integrated with
+self-consistency (SC)
+
+
+
+
+
+
+
+
+ Zaid Alyafeai, Michael Pieler, Hannah Teufel, Jonathan Tow, Marco Bellagente, Duy Phung, Nikhil Pinnaparaju, Reshinth Adithyan, Paulo Rocha, Maksym Zhuravinskyi, Carlos Riquelme
+
+
+ Large Language Models (LLMs) have shown impressive results in multiple
+domains of natural language processing (NLP) but are mainly focused on the
+English language. Recently, more LLMs have incorporated a larger proportion of
+multilingual text to represent low-resource languages. In Arabic NLP, several
+Arabic-centric LLMs have shown remarkable results on multiple benchmarks in the
+past two years. However, most Arabic LLMs have more than 7 billion parameters,
+which increases their hardware requirements and inference latency, when
+compared to smaller LLMs. This paper introduces Arabic Stable LM 1.6B in a base
+and chat version as a small but powerful Arabic-centric LLM. Our Arabic Stable
+LM 1.6B chat model achieves impressive results on several benchmarks beating
+multiple models with up to 8x the parameters. In addition, we show the benefit
+of mixing in synthetic instruction tuning data by augmenting our fine-tuning
+data with a large synthetic dialogue dataset.
+
+
+ Speech-to-text translation (ST) is a cross-modal task that involves
+converting spoken language into text in a different language. Previous research
+primarily focused on enhancing speech translation by facilitating knowledge
+transfer from machine translation, exploring various methods to bridge the gap
+between speech and text modalities. Despite substantial progress made, factors
+in speech that are not relevant to translation content, such as timbre and
+rhythm, often limit the efficiency of knowledge transfer. In this paper, we
+conceptualize speech representation as a combination of content-agnostic and
+content-relevant factors. We examine the impact of content-agnostic factors on
+translation performance through preliminary experiments and observe a
+significant performance deterioration when content-agnostic perturbations are
+introduced to speech signals. To address this issue, we propose a
+\textbf{S}peech \textbf{R}epresentation \textbf{P}urification with
+\textbf{S}upervision \textbf{E}nhancement (SRPSE) framework, which excludes the
+content-agnostic components within speech representations to mitigate their
+negative impact on ST. Experiments on MuST-C and CoVoST-2 datasets demonstrate
+that SRPSE significantly improves translation performance across all
+translation directions in three settings and achieves preeminent performance
+under a \textit{transcript-free} setting.
+
+
+
+ comment: Accepted by COLING 2025
+
+
+
+
+
+
+ ☆ Aya Expanse: Combining Research Breakthroughs for a New Multilingual
+ Frontier
+
+
+
+
+
+
+
+
+ John Dang, Shivalika Singh, Daniel D'souza, Arash Ahmadian, Alejandro Salamanca, Madeline Smith, Aidan Peppin, Sungjin Hong, Manoj Govindassamy, Terrence Zhao, Sandra Kublik, Meor Amer, Viraat Aryabumi, Jon Ander Campos, Yi-Chern Tan, Tom Kocmi, Florian Strub, Nathan Grinsztajn, Yannis Flet-Berliac, Acyr Locatelli, Hangyu Lin, Dwarak Talupuru, Bharat Venkitesh, David Cairuz, Bowen Yang, Tim Chung, Wei-Yin Ko, Sylvie Shang Shi, Amir Shukayev, Sammie Bae, Aleksandra Piktus, Roman Castagné, Felipe Cruz-Salinas, Eddie Kim, Lucas Crawhall-Stein, Adrien Morisot, Sudip Roy, Phil Blunsom, Ivan Zhang, Aidan Gomez, Nick Frosst, Marzieh Fadaee, Beyza Ermis, Ahmet Üstün, Sara Hooker
+
+
+ We introduce the Aya Expanse model family, a new generation of 8B and 32B
+parameter multilingual language models, aiming to address the critical
+challenge of developing highly performant multilingual models that match or
+surpass the capabilities of monolingual models. By leveraging several years of
+research at Cohere For AI and Cohere, including advancements in data arbitrage,
+multilingual preference training, and model merging, Aya Expanse sets a new
+state-of-the-art in multilingual performance. Our evaluations on the
+Arena-Hard-Auto dataset, translated into 23 languages, demonstrate that Aya
+Expanse 8B and 32B outperform leading open-weight models in their respective
+parameter classes, including Gemma 2, Qwen 2.5, and Llama 3.1, achieving up to
+a 76.6% win-rate. Notably, Aya Expanse 32B outperforms Llama 3.1 70B, a model
+with twice as many parameters, achieving a 54.0% win-rate. In this short
+technical report, we present extended evaluation results for the Aya Expanse
+model family and release their open-weights, together with a new multilingual
+evaluation dataset m-ArenaHard.
+
+
+
+
+
+
+
+ ☆ CLINICSUM: Utilizing Language Models for Generating Clinical Summaries
+ from Patient-Doctor Conversations
+
+
+ This paper presents ClinicSum, a novel framework designed to automatically
+generate clinical summaries from patient-doctor conversations. It utilizes a
+two-module architecture: a retrieval-based filtering module that extracts
+Subjective, Objective, Assessment, and Plan (SOAP) information from
+conversation transcripts, and an inference module powered by fine-tuned
+Pre-trained Language Models (PLMs), which leverage the extracted SOAP data to
+generate abstracted clinical summaries. To fine-tune the PLM, we created a
+training dataset of consisting 1,473 conversations-summaries pair by
+consolidating two publicly available datasets, FigShare and MTS-Dialog, with
+ground truth summaries validated by Subject Matter Experts (SMEs). ClinicSum's
+effectiveness is evaluated through both automatic metrics (e.g., ROUGE,
+BERTScore) and expert human assessments. Results show that ClinicSum
+outperforms state-of-the-art PLMs, demonstrating superior precision, recall,
+and F-1 scores in automatic evaluations and receiving high preference from SMEs
+in human assessment, making it a robust solution for automated clinical
+summarization.
+
+
+
+ comment: accepted at the the 2024 IEEE International Conference on Big Data
+ workshop Workshop on Big Data and AI for Healthcare
+
+
+
+
+
+
+ ☆ A History of Philosophy in Colombia through Topic Modelling
+
+
+
+
+
+
+
+
+ Juan R. Loaiza, Miguel González-Duque
+
+
+ Data-driven approaches to philosophy have emerged as a valuable tool for
+studying the history of the discipline. However, most studies in this area have
+focused on a limited number of journals from specific regions and subfields. We
+expand the scope of this research by applying dynamic topic modelling
+techniques to explore the history of philosophy in Colombia and Latin America.
+Our study examines the Colombian philosophy journal Ideas y Valores, founded in
+1951 and currently one of the most influential academic philosophy journals in
+the region. By analyzing the evolution of topics across the journal's history,
+we identify various trends and specific dynamics in philosophical discourse
+within the Colombian and Latin American context. Our findings reveal that the
+most prominent topics are value theory (including ethics, political philosophy,
+and aesthetics), epistemology, and the philosophy of science. We also trace the
+evolution of articles focusing on the historical and interpretive aspects of
+philosophical texts, and we note a notable emphasis on German philosophers such
+as Kant, Husserl, and Hegel on various topics throughout the journal's
+lifetime. Additionally, we investigate whether articles with a historical focus
+have decreased over time due to editorial pressures. Our analysis suggests no
+significant decline in such articles. Finally, we propose ideas for extending
+this research to other Latin American journals and suggest improvements for
+natural language processing workflows in non-English languages.
+
+
+
+
+
+
+
+ ☆ Addressing Hallucinations with RAG and NMISS in Italian Healthcare LLM
+ Chatbots
+
+
+ I combine detection and mitigation techniques to addresses hallucinations in
+Large Language Models (LLMs). Mitigation is achieved in a question-answering
+Retrieval-Augmented Generation (RAG) framework while detection is obtained by
+introducing the Negative Missing Information Scoring System (NMISS), which
+accounts for contextual relevance in responses. While RAG mitigates
+hallucinations by grounding answers in external data, NMISS refines the
+evaluation by identifying cases where traditional metrics incorrectly flag
+contextually accurate responses as hallucinations. I use Italian health news
+articles as context to evaluate LLM performance. Results show that Gemma2 and
+GPT-4 outperform the other models, with GPT-4 producing answers closely aligned
+with reference responses. Mid-tier models, such as Llama2, Llama3, and Mistral
+benefit significantly from NMISS, highlighting their ability to provide richer
+contextual information. This combined approach offers new insights into the
+reduction and more accurate assessment of hallucinations in LLMs, with
+applications in real-world healthcare tasks and other domains.
+
+
+
+
+
+
+
+ ☆ A Context-aware Framework for Translation-mediated Conversations
+
+
+
+
+
+
+
+
+ José Pombal, Sweta Agrawal, Patrick Fernandes, Emmanouil Zaranis, André F. T. Martins
+
+
+ Effective communication is fundamental to any interaction, yet challenges
+arise when participants do not share a common language. Automatic translation
+systems offer a powerful solution to bridge language barriers in such
+scenarios, but they introduce errors that can lead to misunderstandings and
+conversation breakdown. A key issue is that current systems fail to incorporate
+the rich contextual information necessary to resolve ambiguities and omitted
+details, resulting in literal, inappropriate, or misaligned translations. In
+this work, we present a framework to improve large language model-based
+translation systems by incorporating contextual information in bilingual
+conversational settings. During training, we leverage context-augmented
+parallel data, which allows the model to generate translations sensitive to
+conversational history. During inference, we perform quality-aware decoding
+with context-aware metrics to select the optimal translation from a pool of
+candidates. We validate both components of our framework on two task-oriented
+domains: customer chat and user-assistant interaction. Across both settings,
+our framework consistently results in better translations than state-of-the-art
+systems like GPT-4o and TowerInstruct, as measured by multiple automatic
+translation quality metrics on several language pairs. We also show that the
+resulting model leverages context in an intended and interpretable way,
+improving consistency between the conveyed message and the generated
+translations.
+
+
+
+
+
+
+
+ ☆ AL-QASIDA: Analyzing LLM Quality and Accuracy Systematically in
+ Dialectal Arabic
+
+
+
+
+
+
+
+
+ Nathaniel R. Robinson, Shahd Abdelmoneim, Kelly Marchisio, Sebastian Ruder
+
+
+ Dialectal Arabic (DA) varieties are under-served by language technologies,
+particularly large language models (LLMs). This trend threatens to exacerbate
+existing social inequalities and limits language modeling applications, yet the
+research community lacks operationalized LLM performance measurements in DA. We
+present a method that comprehensively evaluates LLM fidelity, understanding,
+quality, and diglossia in modeling DA. We evaluate nine LLMs in eight DA
+varieties across these four dimensions and provide best practice
+recommendations. Our evaluation suggests that LLMs do not produce DA as well as
+they understand it, but does not suggest deterioration in quality when they do.
+Further analysis suggests that current post-training can degrade DA
+capabilities, that few-shot examples can overcome this and other LLM
+deficiencies, and that otherwise no measurable features of input text correlate
+well with LLM DA performance.
+
+
+
+ comment: Pre-print
+
+
+
+
+
+
+ ☆ If You Can't Use Them, Recycle Them: Optimizing Merging at Scale
+ Mitigates Performance Tradeoffs
+
+
+
+
+
+
+
+
+ Muhammad Khalifa, Yi-Chern Tan, Arash Ahmadian, Tom Hosking, Honglak Lee, Lu Wang, Ahmet Üstün, Tom Sherborne, Matthias Gallé
+
+
+ Model merging has shown great promise at combining expert models, but the
+benefit of merging is unclear when merging ``generalist'' models trained on
+many tasks. We explore merging in the context of large ($\sim100$B) models, by
+\textit{recycling} checkpoints that exhibit tradeoffs among different tasks.
+Such checkpoints are often created in the process of developing a frontier
+model, and many suboptimal ones are usually discarded. Given a pool of model
+checkpoints obtained from different training runs (e.g., different stages,
+objectives, hyperparameters, and data mixtures), which naturally show tradeoffs
+across different language capabilities (e.g., instruction following vs. code
+generation), we investigate whether merging can recycle such suboptimal models
+into a Pareto-optimal one. Our optimization algorithm tunes the weight of each
+checkpoint in a linear combination, resulting in a Pareto-optimal models that
+outperforms both individual models and merge-based baselines. Further analysis
+shows that good merges tend to include almost all checkpoints with with
+non-zero weights, indicating that even seemingly bad initial checkpoints can
+contribute to good final merges.
+
+
+
+
+
+
+
+
+ Hongshen Xu, Su Zhu, Zihan Wang, Hang Zheng, Da Ma, Ruisheng Cao, Shuai Fan, Lu Chen, Kai Yu
+
+
+ Large Language Models (LLMs) have extended their capabilities beyond language
+generation to interact with external systems through tool calling, offering
+powerful potential for real-world applications. However, the phenomenon of tool
+hallucinations, which occur when models improperly select or misuse tools,
+presents critical challenges that can lead to flawed task execution and
+increased operational costs. This paper investigates the concept of reliable
+tool calling and highlights the necessity of addressing tool hallucinations. We
+systematically categorize tool hallucinations into two main types: tool
+selection hallucination and tool usage hallucination. To mitigate these issues,
+we propose a reliability-focused alignment framework that enhances the model's
+ability to accurately assess tool relevance and usage. By proposing a suite of
+evaluation metrics and evaluating on StableToolBench, we further demonstrate
+the effectiveness of our framework in mitigating tool hallucination and
+improving the overall system reliability of LLM tool calling.
+
+
+
+
+
+
+
+ ☆ Text Change Detection in Multilingual Documents Using Image Comparison
+
+
+
+
+
+
+
+
+ Doyoung Park, Naresh Reddy Yarram, Sunjin Kim, Minkyu Kim, Seongho Cho, Taehee Lee
+
+
+ Document comparison typically relies on optical character recognition (OCR)
+as its core technology. However, OCR requires the selection of appropriate
+language models for each document and the performance of multilingual or hybrid
+models remains limited. To overcome these challenges, we propose text change
+detection (TCD) using an image comparison model tailored for multilingual
+documents. Unlike OCR-based approaches, our method employs word-level text
+image-to-image comparison to detect changes. Our model generates bidirectional
+change segmentation maps between the source and target documents. To enhance
+performance without requiring explicit text alignment or scaling preprocessing,
+we employ correlations among multi-scale attention features. We also construct
+a benchmark dataset comprising actual printed and scanned word pairs in various
+languages to evaluate our model. We validate our approach using our benchmark
+dataset and public benchmarks Distorted Document Images and the LRDE Document
+Binarization Dataset. We compare our model against state-of-the-art semantic
+segmentation and change detection models, as well as to conventional OCR-based
+models.
+
+
+ Pre-trained Language Models (PLMs) have shown remarkable performances in
+recent years, setting a new paradigm for NLP research and industry. The legal
+domain has received some attention from the NLP community partly due to its
+textual nature. Some tasks from this domain are represented by
+question-answering (QA) tasks. This work explores the legal domain
+Multiple-Choice QA (MCQA) for a low-resource language. The contribution of this
+work is multi-fold. We first introduce JuRO, the first openly available
+Romanian legal MCQA dataset, comprising three different examinations and a
+number of 10,836 total questions. Along with this dataset, we introduce CROL,
+an organized corpus of laws that has a total of 93 distinct documents with
+their modifications from 763 time spans, that we leveraged in this work for
+Information Retrieval (IR) techniques. Moreover, we are the first to propose
+Law-RoG, a Knowledge Graph (KG) for the Romanian language, and this KG is
+derived from the aforementioned corpus. Lastly, we propose a novel approach for
+MCQA, Graph Retrieval Augmented by Facts (GRAF), which achieves competitive
+results with generally accepted SOTA methods and even exceeds them in most
+settings.
+
+
+
+
+
+
+
+ ☆ Missing Melodies: AI Music Generation and its "Nearly" Complete Omission
+ of the Global South
+
+
+ Recent advances in generative AI have sparked renewed interest and expanded
+possibilities for music generation. However, the performance and versatility of
+these systems across musical genres are heavily influenced by the availability
+of training data. We conducted an extensive analysis of over one million hours
+of audio datasets used in AI music generation research and manually reviewed
+more than 200 papers from eleven prominent AI and music conferences and
+organizations (AAAI, ACM, EUSIPCO, EURASIP, ICASSP, ICML, IJCAI, ISMIR,
+NeurIPS, NIME, SMC) to identify a critical gap in the fair representation and
+inclusion of the musical genres of the Global South in AI research. Our
+findings reveal a stark imbalance: approximately 86% of the total dataset hours
+and over 93% of researchers focus primarily on music from the Global North.
+However, around 40% of these datasets include some form of non-Western music,
+genres from the Global South account for only 14.6% of the data. Furthermore,
+approximately 51% of the papers surveyed concentrate on symbolic music
+generation, a method that often fails to capture the cultural nuances inherent
+in music from regions such as South Asia, the Middle East, and Africa. As AI
+increasingly shapes the creation and dissemination of music, the significant
+underrepresentation of music genres in datasets and research presents a serious
+threat to global musical diversity. We also propose some important steps to
+mitigate these risks and foster a more inclusive future for AI-driven music
+generation.
+
+
+
+ comment: Submitted to CACM, 12 pages, 2 figures
+
+
+
+
+
+
+ ☆ GEITje 7B Ultra: A Conversational Model for Dutch
+
+
+ Language models have rapidly evolved, predominantly focusing on English while
+often neglecting extensive pretraining in other languages. This approach has
+required initiatives to adapt powerful, English-centric models to other
+linguistic contexts through finetuning. For Dutch, such a recent endeavour is
+``GEITje'' a model originally derived from the English-based Mistral 7B.
+Building on this fundamental work, the current research extends the
+capabilities of GEITje by supervised finetuning on newly created high-quality
+synthetic conversational datasets, along with an additional preference
+alignment procedure on a synthetic feedback dataset. Both the developed models
+and the created datasets are openly available.
+
+
+
+
+
+
+
+ ☆ Automated Medical Report Generation for ECG Data: Bridging Medical Text
+ and Signal Processing with Deep Learning
+
+
+
+
+
+
+
+
+ Amnon Bleich, Antje Linnemann, Bjoern H. Diem, Tim OF Conrad
+
+
+ Recent advances in deep learning and natural language generation have
+significantly improved image captioning, enabling automated, human-like
+descriptions for visual content. In this work, we apply these captioning
+techniques to generate clinician-like interpretations of ECG data. This study
+leverages existing ECG datasets accompanied by free-text reports authored by
+healthcare professionals (HCPs) as training data. These reports, while often
+inconsistent, provide a valuable foundation for automated learning. We
+introduce an encoder-decoder-based method that uses these reports to train
+models to generate detailed descriptions of ECG episodes. This represents a
+significant advancement in ECG analysis automation, with potential applications
+in zero-shot classification and automated clinical decision support.
+ The model is tested on various datasets, including both 1- and 12-lead ECGs.
+It significantly outperforms the state-of-the-art reference model by Qiu et
+al., achieving a METEOR score of 55.53% compared to 24.51% achieved by the
+reference model. Furthermore, several key design choices are discussed,
+providing a comprehensive overview of current challenges and innovations in
+this domain.
+ The source codes for this research are publicly available in our Git
+repository https://git.zib.de/ableich/ecg-comment-generation-public
+
+
+
+
+
+
+
+ ☆ Hostility Detection in UK Politics: A Dataset on Online Abuse Targeting
+ MPs
+
+
+ Numerous politicians use social media platforms, particularly X, to engage
+with their constituents. This interaction allows constituents to pose questions
+and offer feedback but also exposes politicians to a barrage of hostile
+responses, especially given the anonymity afforded by social media. They are
+typically targeted in relation to their governmental role, but the comments
+also tend to attack their personal identity. This can discredit politicians and
+reduce public trust in the government. It can also incite anger and disrespect,
+leading to offline harm and violence. While numerous models exist for detecting
+hostility in general, they lack the specificity required for political
+contexts. Furthermore, addressing hostility towards politicians demands
+tailored approaches due to the distinct language and issues inherent to each
+country (e.g., Brexit for the UK). To bridge this gap, we construct a dataset
+of 3,320 English tweets spanning a two-year period manually annotated for
+hostility towards UK MPs. Our dataset also captures the targeted identity
+characteristics (race, gender, religion, none) in hostile tweets. We perform
+linguistic and topical analyses to delve into the unique content of the UK
+political data. Finally, we evaluate the performance of pre-trained language
+models and large language models on binary hostility detection and multi-class
+targeted identity type classification tasks. Our study offers valuable data and
+insights for future research on the prevalence and nature of politics-related
+hostility specific to the UK.
+
+
+
+
+
+
+
+ ☆ M$^{3}$D: A Multimodal, Multilingual and Multitask Dataset for Grounded
+ Document-level Information Extraction
+
+
+ Multimodal information extraction (IE) tasks have attracted increasing
+attention because many studies have shown that multimodal information benefits
+text information extraction. However, existing multimodal IE datasets mainly
+focus on sentence-level image-facilitated IE in English text, and pay little
+attention to video-based multimodal IE and fine-grained visual grounding.
+Therefore, in order to promote the development of multimodal IE, we constructed
+a multimodal multilingual multitask dataset, named M$^{3}$D, which has the
+following features: (1) It contains paired document-level text and video to
+enrich multimodal information; (2) It supports two widely-used languages,
+namely English and Chinese; (3) It includes more multimodal IE tasks such as
+entity recognition, entity chain extraction, relation extraction and visual
+grounding. In addition, our dataset introduces an unexplored theme, i.e.,
+biography, enriching the domains of multimodal IE resources. To establish a
+benchmark for our dataset, we propose an innovative hierarchical multimodal IE
+model. This model effectively leverages and integrates multimodal information
+through a Denoised Feature Fusion Module (DFFM). Furthermore, in non-ideal
+scenarios, modal information is often incomplete. Thus, we designed a Missing
+Modality Construction Module (MMCM) to alleviate the issues caused by missing
+modalities. Our model achieved an average performance of 53.80% and 53.77% on
+four tasks in English and Chinese datasets, respectively, which set a
+reasonable standard for subsequent research. In addition, we conducted more
+analytical experiments to verify the effectiveness of our proposed module. We
+believe that our work can promote the development of the field of multimodal
+IE.
+
+
+
+ comment: 14 pages, 9 figures, 6 tables
+
+
+
+
+
+
+ ☆ Exploring the Influence of Label Aggregation on Minority Voices:
+ Implications for Dataset Bias and Model Training
+
+
+ Resolving disagreement in manual annotation typically consists of removing
+unreliable annotators and using a label aggregation strategy such as majority
+vote or expert opinion to resolve disagreement. These may have the side-effect
+of silencing or under-representing minority but equally valid opinions. In this
+paper, we study the impact of standard label aggregation strategies on minority
+opinion representation in sexism detection. We investigate the quality and
+value of minority annotations, and then examine their effect on the class
+distributions in gold labels, as well as how this affects the behaviour of
+models trained on the resulting datasets. Finally, we discuss the potential
+biases introduced by each method and how they can be amplified by the models.
+
+
+
+
+
+
+
+ ☆ Marco-LLM: Bridging Languages via Massive Multilingual Training for
+ Cross-Lingual Enhancement
+
+
+ Large Language Models (LLMs) have achieved remarkable progress in recent
+years; however, their excellent performance is still largely limited to major
+world languages, primarily English. Many LLMs continue to face challenges with
+multilingual tasks, especially when it comes to low-resource languages. To
+address this issue, we introduced Marco-LLM: Massive multilingual training for
+cross-lingual enhancement LLM. We have collected a substantial amount of
+multilingual data for several low-resource languages and conducted extensive
+continual pre-training using the Qwen2 models. This effort has resulted in a
+multilingual LLM named Marco-LLM. Through comprehensive evaluations on various
+multilingual benchmarks, including MMMLU, AGIEval, Belebele, Flores-200, XCOPA
+and many others, Marco-LLM has demonstrated substantial improvements over
+state-of-the-art LLMs. Furthermore, Marco-LLM achieved substantial enhancements
+in any-to-any machine translation tasks, showing the effectiveness of our
+multilingual LLM. Marco-LLM is a pioneering multilingual LLM designed to not
+only perform exceptionally well in multilingual tasks, including low-resource
+languages, but also maintain strong performance in English and other major
+languages, closing the performance gap between high- and low-resource language
+capabilities. By bridging languages, this effort demonstrates our dedication to
+ensuring LLMs work accurately across various languages.
+
+
+
+
+
+
+
+ ☆ MTMT: Consolidating Multiple Thinking Modes to Form a Thought Tree for
+ Strengthening LLM
+
+
+ Large language models (LLMs) have shown limitations in tasks requiring
+complex logical reasoning and multi-step problem-solving. To address these
+challenges, researchers have employed carefully designed prompts and
+flowcharts, simulating human cognitive processes to enhance LLM performance,
+such as the Chain of Thought approach. In this paper, we introduce MTMT
+(Multi-thinking Modes Tree), a novel method that interacts with LLMs to
+construct a thought tree, simulating various advanced cognitive processes,
+including but not limited to association, counterfactual thinking, task
+decomposition, and comparison. By breaking down the original complex task into
+simpler sub-questions, MTMT facilitates easier problem-solving for LLMs,
+enabling more effective utilization of the latent knowledge within LLMs. We
+evaluate the performance of MTMT under different parameter configurations,
+using GPT-4o mini as the base model. Our results demonstrate that integrating
+multiple modes of thinking significantly enhances the ability of LLMs to handle
+complex tasks.
+
+
+
+
+
+
+
+ ☆ Demonstration Selection for In-Context Learning via Reinforcement
+ Learning
+
+
+ Diversity in demonstration selection is crucial for enhancing model
+generalization, as it enables a broader coverage of structures and concepts.
+However, constructing an appropriate set of demonstrations has remained a focal
+point of research. This paper presents the Relevance-Diversity Enhanced
+Selection (RDES), an innovative approach that leverages reinforcement learning
+to optimize the selection of diverse reference demonstrations for text
+classification tasks using Large Language Models (LLMs), especially in few-shot
+prompting scenarios. RDES employs a Q-learning framework to dynamically
+identify demonstrations that maximize both diversity and relevance to the
+classification objective by calculating a diversity score based on label
+distribution among selected demonstrations. This method ensures a balanced
+representation of reference data, leading to improved classification accuracy.
+Through extensive experiments on four benchmark datasets and involving 12
+closed-source and open-source LLMs, we demonstrate that RDES significantly
+enhances classification accuracy compared to ten established baselines.
+Furthermore, we investigate the incorporation of Chain-of-Thought (CoT)
+reasoning in the reasoning process, which further enhances the model's
+predictive performance. The results underscore the potential of reinforcement
+learning to facilitate adaptive demonstration selection and deepen the
+understanding of classification challenges.
+
+
+
+
+
+
+
+ ☆ MIND: Effective Incorrect Assignment Detection through a Multi-Modal
+ Structure-Enhanced Language Model
+
+
+
+
+
+
+
+
+ Yunhe Pang, Bo Chen, Fanjin Zhang, Yanghui Rao, Jie Tang
+
+
+ The rapid growth of academic publications has exacerbated the issue of author
+name ambiguity in online digital libraries. Despite advances in name
+disambiguation algorithms, cumulative errors continue to undermine the
+reliability of academic systems. It is estimated that over 10% paper-author
+assignments are rectified when constructing the million-scale WhoIsWho
+benchmark. Existing endeavors to detect incorrect assignments are either
+semantic-based or graph-based approaches, which fall short of making full use
+of the rich text attributes of papers and implicit structural features defined
+via the co-occurrence of paper attributes. To this end, this paper introduces a
+structure-enhanced language model that combines key structural features from
+graph-based methods with fine-grained semantic features from rich paper
+attributes to detect incorrect assignments. The proposed model is trained with
+a highly effective multi-modal multi-turn instruction tuning framework, which
+incorporates task-guided instruction tuning, text-attribute modality, and
+structural modality. Experimental results demonstrate that our model
+outperforms previous approaches, achieving top performance on the leaderboard
+of KDD Cup 2024. Our code has been publicly available.
+
+
+
+
+
+
+
+ ☆ A Survey on Large Language Model-Based Social Agents in Game-Theoretic
+ Scenarios
+
+
+ Game-theoretic scenarios have become pivotal in evaluating the social
+intelligence of Large Language Model (LLM)-based social agents. While numerous
+studies have explored these agents in such settings, there is a lack of a
+comprehensive survey summarizing the current progress. To address this gap, we
+systematically review existing research on LLM-based social agents within
+game-theoretic scenarios. Our survey organizes the findings into three core
+components: Game Framework, Social Agent, and Evaluation Protocol. The game
+framework encompasses diverse game scenarios, ranging from choice-focusing to
+communication-focusing games. The social agent part explores agents'
+preferences, beliefs, and reasoning abilities. The evaluation protocol covers
+both game-agnostic and game-specific metrics for assessing agent performance.
+By reflecting on the current research and identifying future research
+directions, this survey provides insights to advance the development and
+evaluation of social agents in game-theoretic scenarios.
+
+
+ We propose a suite of tasks to evaluate the instrumental self-reasoning
+ability of large language model (LLM) agents. Instrumental self-reasoning
+ability could improve adaptability and enable self-modification, but it could
+also pose significant risks, such as enabling deceptive alignment. Prior work
+has only evaluated self-reasoning in non-agentic settings or in limited
+domains. In this paper, we propose evaluations for instrumental self-reasoning
+ability in agentic tasks in a wide range of scenarios, including
+self-modification, knowledge seeking, and opaque self-reasoning. We evaluate
+agents built using state-of-the-art LLMs, including commercial and open source
+systems. We find that instrumental self-reasoning ability emerges only in the
+most capable frontier models and that it is highly context-dependent. No model
+passes the the most difficult versions of our evaluations, hence our evaluation
+can be used to measure increases in instrumental self-reasoning ability in
+future models. We open-source our evaluations at
+https://github.com/kaifronsdal/Self-Reasoning-Evals.
+
+
+ Integrated Gradients is a well-known technique for explaining deep learning
+models. It calculates feature importance scores by employing a gradient based
+approach computing gradients of the model output with respect to input features
+and accumulating them along a linear path. While this works well for continuous
+features spaces, it may not be the most optimal way to deal with discrete
+spaces like word embeddings. For interpreting LLMs (Large Language Models),
+there exists a need for a non-linear path where intermediate points, whose
+gradients are to be computed, lie close to actual words in the embedding space.
+In this paper, we propose a method called Uniform Discretized Integrated
+Gradients (UDIG) based on a new interpolation strategy where we choose a
+favorable nonlinear path for computing attribution scores suitable for
+predictive language models. We evaluate our method on two types of NLP tasks-
+Sentiment Classification and Question Answering against three metrics viz Log
+odds, Comprehensiveness and Sufficiency. For sentiment classification, we have
+used the SST2, IMDb and Rotten Tomatoes datasets for benchmarking and for
+Question Answering, we have used the fine-tuned BERT model on SQuAD dataset.
+Our approach outperforms the existing methods in almost all the metrics.
+
+
+ This study introduces AyutthayaAlpha, an advanced transformer-based machine
+learning model designed for the transliteration of Thai proper names into Latin
+script. Our system achieves state-of-the-art performance with 82.32%
+first-token accuracy and 95.24% first-three-token accuracy, while maintaining a
+low character error rate of 0.0047. The complexity of Thai phonology, including
+tonal features and vowel length distinctions, presents significant challenges
+for accurate transliteration, which we address through a novel two-model
+approach: AyutthayaAlpha-Small, based on the ByT5 architecture, and
+AyutthayaAlpha-VerySmall, a computationally efficient variant that unexpectedly
+outperforms its larger counterpart. Our research combines linguistic rules with
+deep learning, training on a carefully curated dataset of 1.2 million
+Thai-Latin name pairs, augmented through strategic upsampling to 2.7 million
+examples. Extensive evaluations against existing transliteration methods and
+human expert benchmarks demonstrate that AyutthayaAlpha not only achieves
+superior accuracy but also effectively captures personal and cultural
+preferences in name romanization. The system's practical applications extend to
+cross-lingual information retrieval, international data standardization, and
+identity verification systems, with particular relevance for government
+databases, academic institutions, and global business operations. This work
+represents a significant advance in bridging linguistic gaps between Thai and
+Latin scripts, while respecting the cultural and personal dimensions of name
+transliteration.
+
+
+
+
+
+
+
+ ☆ Automated LaTeX Code Generation from Handwritten Math Expressions Using
+ Vision Transformer
+
+
+ Converting mathematical expressions into LaTeX is challenging. In this paper,
+we explore using newer transformer based architectures for addressing the
+problem of converting handwritten/digital mathematical expression images into
+equivalent LaTeX code. We use the current state of the art CNN encoder and RNN
+decoder as a baseline for our experiments. We also investigate improvements to
+CNN-RNN architecture by replacing the CNN encoder with the ResNet50 model. Our
+experiments show that transformer architectures achieve a higher overall
+accuracy and BLEU scores along with lower Levenschtein scores compared to the
+baseline CNN/RNN architecture with room to achieve even better results with
+appropriate fine-tuning of model parameters.
+
+
+
+ comment: 7 pages; 3 figures
+
+
+
+
+
+
+ ☆ Educational-Psychological Dialogue Robot Based on Multi-Agent
+ Collaboration
+
+
+ Intelligent dialogue systems are increasingly used in modern education and
+psychological counseling fields, but most existing systems are limited to a
+single domain, cannot deal with both educational and psychological issues, and
+often lack accuracy and professionalism when dealing with complex issues. To
+address these problems, this paper proposes an intelligent dialog system that
+combines educational and psychological counseling functions. The system
+consists of multiple AI agent, including security detection agent, intent
+identification agent, educational LLM agent, and psychological LLM agent, which
+work in concert to ensure the provision of accurate educational knowledge Q\&A
+and psychological support services. Specifically, the system recognizes
+user-input intentions through an intention classification model and invokes a
+retrieval-enhanced educational grand model and a psychological grand model
+fine-tuned with psychological data in order to provide professional educational
+advice and psychological support.
+
+
+
+
+
+
+
+ ☆ Beyond the Binary: Capturing Diverse Preferences With Reward
+ Regularization
+
+
+
+
+
+
+
+
+ Vishakh Padmakumar, Chuanyang Jin, Hannah Rose Kirk, He He
+
+
+ Large language models (LLMs) are increasingly deployed via public-facing
+interfaces to interact with millions of users, each with diverse preferences.
+Despite this, preference tuning of LLMs predominantly relies on reward models
+trained using binary judgments where annotators select the preferred choice out
+of pairs of model outputs. In this work, we argue that this reliance on binary
+choices does not capture the broader, aggregate preferences of the target user
+in real-world tasks. We propose a taxonomy that identifies two dimensions of
+subjectivity where different users disagree on the preferred output-namely, the
+Plurality of Responses to Prompts, where prompts allow for multiple correct
+answers, and the Indistinguishability of Responses, where candidate outputs are
+paraphrases of each other. We show that reward models correlate weakly with
+user preferences in these cases. As a first step to address this issue, we
+introduce a simple yet effective method that augments existing binary
+preference datasets with synthetic preference judgments to estimate potential
+user disagreement. Incorporating these via a margin term as a form of
+regularization during model training yields predictions that better align with
+the aggregate user preferences.
+
+
+
+
+
+
+
+
+ Sunghoon Kang, Hyeoneui Kim, Hyewon Park, Ricky Taira
+
+
+ The goal of this work was to compute the semantic similarity among publicly
+available health survey questions in order to facilitate the standardization of
+survey-based Person-Generated Health Data (PGHD). We compiled various health
+survey questions authored in both English and Korean from the NIH CDE
+Repository, PROMIS, Korean public health agencies, and academic publications.
+Questions were drawn from various health lifelog domains. A randomized question
+pairing scheme was used to generate a Semantic Text Similarity (STS) dataset
+consisting of 1758 question pairs. Similarity scores between each question pair
+were assigned by two human experts. The tagged dataset was then used to build
+three classifiers featuring: Bag-of-Words, SBERT with BERT-based embeddings,
+and SBRET with LaBSE embeddings. The algorithms were evaluated using
+traditional contingency statistics. Among the three algorithms, SBERT-LaBSE
+demonstrated the highest performance in assessing question similarity across
+both languages, achieving an Area Under the Receiver Operating Characteristic
+(ROC) and Precision-Recall Curves of over 0.99. Additionally, it proved
+effective in identifying cross-lingual semantic similarities.The SBERT-LaBSE
+algorithm excelled at aligning semantically equivalent sentences across both
+languages but encountered challenges in capturing subtle nuances and
+maintaining computational efficiency. Future research should focus on testing
+with larger multilingual datasets and on calibrating and normalizing scores
+across the health lifelog domains to improve consistency. This study introduces
+the SBERT-LaBSE algorithm for calculating semantic similarity across two
+languages, showing it outperforms BERT-based models and the Bag of Words
+approach, highlighting its potential to improve semantic interoperability of
+survey-based PGHD across language barriers.
+
+
+
+
+
+
+
+ ☆ Synergizing LLMs and Knowledge Graphs: A Novel Approach to Software
+ Repository-Related Question Answering
+
+
+ Software repositories contain valuable information for gaining insights into
+their development process. However, extracting insights from these repository
+data is time-consuming and requires technical expertise. While software
+engineering chatbots have been developed to facilitate natural language
+interactions with repositories, they struggle with understanding natural
+language and accurately retrieving relevant data. This study aims to improve
+the accuracy of LLM-based chatbots in answering repository-related questions by
+augmenting them with knowledge graphs. We achieve this in a two-step approach;
+(1) constructing a knowledge graph from the repository data and (2) synergizing
+the knowledge graph with LLM to allow for the natural language questions and
+answers. We curated a set of 20 questions with different complexities and
+evaluated our approach on five popular open-source projects. Our approach
+achieved an accuracy of 65%. We further investigated the limitations and
+identified six key issues, with the majority relating to the reasoning
+capability of the LLM. We experimented with a few-shot chain-of-thought
+prompting to determine if it could enhance our approach. This technique
+improved the overall accuracy to 84%. Our findings demonstrate the synergy
+between LLMs and knowledge graphs as a viable solution for making repository
+data accessible to both technical and non-technical stakeholders.
+
+
+
+ comment: Submitted to ACM Transactions on Software Engineering and Methodology
+ for review
+
+
+
+
+
+
+ ☆ Agent AI with LangGraph: A Modular Framework for Enhancing Machine
+ Translation Using Large Language Models
+
+
+ This paper explores the transformative role of Agent AI and LangGraph in
+advancing the automation and effectiveness of machine translation (MT). Agents
+are modular components designed to perform specific tasks, such as translating
+between particular languages, with specializations like TranslateEnAgent,
+TranslateFrenchAgent, and TranslateJpAgent for English, French, and Japanese
+translations, respectively. These agents leverage the powerful semantic
+capabilities of large language models (LLMs), such as GPT-4o, to ensure
+accurate, contextually relevant translations while maintaining modularity,
+scalability, and context retention.
+ LangGraph, a graph-based framework built on LangChain, simplifies the
+creation and management of these agents and their workflows. It supports
+dynamic state management, enabling agents to maintain dialogue context and
+automates complex workflows by linking agents and facilitating their
+collaboration. With flexibility, open-source community support, and seamless
+integration with LLMs, LangGraph empowers agents to deliver high-quality
+translations.
+ Together, Agent AI and LangGraph create a cohesive system where LangGraph
+orchestrates agent interactions, ensuring that user inputs are analyzed,
+routed, and processed efficiently. Experimental results demonstrate the
+potential of this system to enhance multilingual translation accuracy and
+scalability. By highlighting modular design and automated workflows, this paper
+sets the stage for further innovations in intelligent machine translation
+services.
+
+
+
+
+
+
+
+ ☆ The broader spectrum of in-context learning
+
+
+
+
+
+
+
+
+ Andrew Kyle Lampinen, Stephanie C. Y. Chan, Aaditya K. Singh, Murray Shanahan
+
+
+ The ability of language models to learn a task from a few examples in context
+has generated substantial interest. Here, we provide a perspective that
+situates this type of supervised few-shot learning within a much broader
+spectrum of meta-learned in-context learning. Indeed, we suggest that any
+distribution of sequences in which context non-trivially decreases loss on
+subsequent predictions can be interpreted as eliciting a kind of in-context
+learning. We suggest that this perspective helps to unify the broad set of
+in-context abilities that language models exhibit $\unicode{x2014}$ such as
+adapting to tasks from instructions or role play, or extrapolating time series.
+This perspective also sheds light on potential roots of in-context learning in
+lower-level processing of linguistic dependencies (e.g. coreference or parallel
+structures). Finally, taking this perspective highlights the importance of
+generalization, which we suggest can be studied along several dimensions: not
+only the ability to learn something novel, but also flexibility in learning
+from different presentations, and in applying what is learned. We discuss
+broader connections to past literature in meta-learning and goal-conditioned
+agents, and other perspectives on learning and adaptation. We close by
+suggesting that research on in-context learning should consider this broader
+spectrum of in-context capabilities and types of generalization.
+
+
+
+
+
+
+
+ ♻ ☆ The Semantic Hub Hypothesis: Language Models Share Semantic
+ Representations Across Languages and Modalities
+
+
+ Modern language models can process inputs across diverse languages and
+modalities. We hypothesize that models acquire this capability through learning
+a shared representation space across heterogeneous data types (e.g., different
+languages and modalities), which places semantically similar inputs near one
+another, even if they are from different modalities/languages. We term this the
+semantic hub hypothesis, following the hub-and-spoke model from neuroscience
+(Patterson et al., 2007) which posits that semantic knowledge in the human
+brain is organized through a transmodal semantic "hub" which integrates
+information from various modality-specific "spokes" regions. We first show that
+model representations for semantically equivalent inputs in different languages
+are similar in the intermediate layers, and that this space can be interpreted
+using the model's dominant pretraining language via the logit lens. This
+tendency extends to other data types, including arithmetic expressions, code,
+and visual/audio inputs. Interventions in the shared representation space in
+one data type also predictably affect model outputs in other data types,
+suggesting that this shared representations space is not simply a vestigial
+byproduct of large-scale training on broad data, but something that is actively
+utilized by the model during input processing.
+
+
+
+
+
+
+
+ ♻ ☆ SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large
+ Language Models by Summarizing Training Trajectories of Small Models
+
+
+ Despite the effectiveness of data selection for large language models (LLMs)
+during pretraining and instruction fine-tuning phases, improving data
+efficiency in supervised fine-tuning (SFT) for specialized domains poses
+significant challenges due to the complexity of fine-tuning data. To bridge
+this gap, we introduce an effective and scalable data selection method for SFT,
+SmallToLarge (S2L), which leverages training trajectories from small models to
+guide the data selection for larger models. We demonstrate through extensive
+experiments that S2L significantly improves data efficiency in SFT for
+mathematical problem-solving, reducing the training data to just 11% of the
+original MathInstruct dataset (Yue et al., 2023) to match full dataset
+performance while outperforming state-of-the-art data selection algorithms by
+an average of 4.7% across 6 in- and out-domain evaluation datasets. Remarkably,
+selecting only 50K data for SFT, S2L achieves a 32.7% accuracy on the most
+challenging MATH (Hendrycks et al., 2021) benchmark, improving Phi-2 (Li et
+al., 2023b) by 16.6%. In clinical text summarization on the MIMIC-III dataset
+(Johnson et al., 2016), S2L again outperforms training on the full dataset
+using only 50% of the data. Notably, S2L can perform data selection using a
+reference model 40x smaller than the target model, proportionally reducing the
+cost of data selection.
+
+
+
+
+
+
+
+ ♻ ☆ WaveletGPT: Wavelets Meet Large Language Models
+
+
+ Large Language Models (LLMs) have ushered in a new wave of artificial
+intelligence advancements impacting every scientific field and discipline. They
+are trained on a simple objective: to predict the next token given the previous
+context. We live in a world where most of the data around us, e.g., text,
+audio, and music, has a multi-scale structure associated with it. This paper
+infuses LLMs with traditional signal processing ideas, namely wavelets, during
+pre-training to take advantage of the structure. Without adding \textbf{any
+extra parameters} to a GPT-style LLM architecture, we achieve the same
+pre-training performance almost twice as fast in text, raw audio, and symbolic
+music. This is achieved by imposing a structure on intermediate embeddings.
+When trained for the same number of training steps, we achieve significant
+gains in performance, which is comparable to pre-training a larger neural
+architecture. Our architecture allows every next token prediction access to
+intermediate embeddings at different temporal resolutions in every Transformer
+decoder block. This work will hopefully pave the way for incorporating
+multi-rate signal processing ideas into traditional LLM pre-training. Further,
+we showcase pushing model performance by improving internal structure instead
+of just going after scale.
+
+
+
+ comment: 16 pages, 4 figures
+
+
+
+
+
+
+ ♻ ☆ CNNSum: Exploring Long-Conext Summarization with Large Language Models
+ in Chinese Novels
+
+
+
+
+
+
+
+
+ Lingxiao Wei, He Yan, Xiangju Lu, Junmin Zhu, Jun Wang, Wei Zhang
+
+
+ Large Language Models (LLMs) have been well-researched in many long-context
+tasks. However, due to high annotation costs, high-quality long-context summary
+datasets for training or evaluation are scarce, limiting further research. In
+this work, we introduce CNNSum, a new multi-scale Chinese long-context novel
+summarization benchmark, including four subsets, length covering
+16k\textasciitilde128k, 695 samples in total, the annotations are human-driven.
+We evaluate commercial and open-source models on CNNSum and conduct a detailed
+analysis. Based on the observations, we further conduct fine-tuning exploration
+with short-context summary data. In our study: (1) GPT-4o underperformed, due
+to excessive subjective commentary. (2) Currently, long-context summarization
+mainly relies on memory ability, small LLMs with stable longer context lengths
+are the most cost-effective. Using long data concatenated from short-context
+summaries makes a significant improvement. (3) Prompt templates may cause a
+large performance gap but can be mitigated through fine-tuning. (4) Fine-tuned
+Chat or Instruction versions may harm the Base model and further fine-tuning
+cannot bridge performance gap. (5) while models with RoPE base scaling exhibit
+strong extrapolation potential, their performance may vary significantly when
+combined with other interpolation methods and need careful selection. (6)
+CNNSum provides more reliable and insightful evaluation results than other
+benchmarks. We release CNNSum to advance research in this field.
+
+
+
+
+
+
+
+ ♻ ☆ Context-Informed Machine Translation of Manga using Multimodal Large
+ Language Models COLING 2025
+
+
+
+
+
+
+
+
+ Philip Lippmann, Konrad Skublicki, Joshua Tanner, Shonosuke Ishiwatari, Jie Yang
+
+
+ Due to the significant time and effort required for handcrafting
+translations, most manga never leave the domestic Japanese market. Automatic
+manga translation is a promising potential solution. However, it is a budding
+and underdeveloped field and presents complexities even greater than those
+found in standard translation due to the need to effectively incorporate visual
+elements into the translation process to resolve ambiguities. In this work, we
+investigate to what extent multimodal large language models (LLMs) can provide
+effective manga translation, thereby assisting manga authors and publishers in
+reaching wider audiences. Specifically, we propose a methodology that leverages
+the vision component of multimodal LLMs to improve translation quality and
+evaluate the impact of translation unit size, context length, and propose a
+token efficient approach for manga translation. Moreover, we introduce a new
+evaluation dataset -- the first parallel Japanese-Polish manga translation
+dataset -- as part of a benchmark to be used in future research. Finally, we
+contribute an open-source software suite, enabling others to benchmark LLMs for
+manga translation. Our findings demonstrate that our proposed methods achieve
+state-of-the-art results for Japanese-English translation and set a new
+standard for Japanese-Polish.
+
+
+
+ comment: COLING 2025
+
+
+
+
+
+
+ ♻ ☆ Unveiling Entity-Level Unlearning for Large Language Models: A
+ Comprehensive Analysis COLING 2025
+
+
+ Large language model unlearning has garnered increasing attention due to its
+potential to address security and privacy concerns, leading to extensive
+research in the field. However, much of this research has concentrated on
+instance-level unlearning, specifically targeting the removal of predefined
+instances containing sensitive content. This focus has left a significant gap
+in the exploration of full entity-level unlearning, which is critical in
+real-world scenarios such as copyright protection. To this end, we propose a
+novel task of Entity-level unlearning, which aims to erase entity-related
+knowledge from the target model completely. To thoroughly investigate this
+task, we systematically evaluate trending unlearning algorithms, revealing that
+current methods struggle to achieve effective entity-level unlearning. Then, we
+further explore the factors that influence the performance of the unlearning
+algorithms, identifying that knowledge coverage and the size of the forget set
+play pivotal roles. Notably, our analysis also uncovers that entities
+introduced through fine-tuning are more vulnerable to unlearning than
+pre-trained entities. These findings collectively offer valuable insights for
+advancing entity-level unlearning for LLMs.
+
+
+
+
+
+
+
+
+ Laura Kopf, Philine Lou Bommer, Anna Hedström, Sebastian Lapuschkin, Marina M. -C. Höhne, Kirill Bykov
+
+
+ A crucial aspect of understanding the complex nature of Deep Neural Networks
+(DNNs) is the ability to explain learned concepts within their latent
+representations. While methods exist to connect neurons to human-understandable
+textual descriptions, evaluating the quality of these explanations is
+challenging due to the lack of a unified quantitative approach. We introduce
+CoSy (Concept Synthesis), a novel, architecture-agnostic framework for
+evaluating textual explanations of latent neurons. Given textual explanations,
+our proposed framework uses a generative model conditioned on textual input to
+create data points representing the explanations. By comparing the neuron's
+response to these generated data points and control data points, we can
+estimate the quality of the explanation. We validate our framework through
+sanity checks and benchmark various neuron description methods for Computer
+Vision tasks, revealing significant differences in quality.
+
+
+
+
+
+
+
+
+ Adithya Bhaskar, Alexander Wettig, Dan Friedman, Danqi Chen
+
+
+ The path to interpreting a language model often proceeds via analysis of
+circuits -- sparse computational subgraphs of the model that capture specific
+aspects of its behavior. Recent work has automated the task of discovering
+circuits. Yet, these methods have practical limitations, as they rely either on
+inefficient search algorithms or inaccurate approximations. In this paper, we
+frame automated circuit discovery as an optimization problem and propose *Edge
+Pruning* as an effective and scalable solution. Edge Pruning leverages
+gradient-based pruning techniques, but instead of removing neurons or
+components, it prunes the \emph{edges} between components. Our method finds
+circuits in GPT-2 that use less than half the number of edges compared to
+circuits found by previous methods while being equally faithful to the full
+model predictions on standard circuit-finding tasks. Edge Pruning is efficient
+even with as many as 100K examples, outperforming previous methods in speed and
+producing substantially better circuits. It also perfectly recovers the
+ground-truth circuits in two models compiled with Tracr. Thanks to its
+efficiency, we scale Edge Pruning to CodeLlama-13B, a model over 100x the scale
+that prior methods operate on. We use this setting for a case study comparing
+the mechanisms behind instruction prompting and in-context learning. We find
+two circuits with more than 99.96% sparsity that match the performance of the
+full model and reveal that the mechanisms in the two settings overlap
+substantially. Our case study shows that Edge Pruning is a practical and
+scalable tool for interpretability and sheds light on behaviors that only
+emerge in large models.
+
+
+
+ comment: NeurIPS 2024 (Spotlight)
+
+
+
+
+
+
+ ♻ ☆ A Complexity-Based Theory of Compositionality
+
+
+
+
+
+
+
+
+ Eric Elmoznino, Thomas Jiralerspong, Yoshua Bengio, Guillaume Lajoie
+
+
+ Compositionality is believed to be fundamental to intelligence. In humans, it
+underlies the structure of thought, language, and higher-level reasoning. In
+AI, compositional representations can enable a powerful form of
+out-of-distribution generalization, in which a model systematically adapts to
+novel combinations of known concepts. However, while we have strong intuitions
+about what compositionality is, there currently exists no formal definition for
+it that is measurable and mathematical. Here, we propose such a definition,
+which we call representational compositionality, that accounts for and extends
+our intuitions about compositionality. The definition is conceptually simple,
+quantitative, grounded in algorithmic information theory, and applicable to any
+representation. Intuitively, representational compositionality states that a
+compositional representation satisfies three properties. First, it must be
+expressive. Second, it must be possible to re-describe the representation as a
+function of discrete symbolic sequences with re-combinable parts, analogous to
+sentences in natural language. Third, the function that relates these symbolic
+sequences to the representation, analogous to semantics in natural language,
+must be simple. Through experiments on both synthetic and real world data, we
+validate our definition of compositionality and show how it unifies disparate
+intuitions from across the literature in both AI and cognitive science. We also
+show that representational compositionality, while theoretically intractable,
+can be readily estimated using standard deep learning tools. Our definition has
+the potential to inspire the design of novel, theoretically-driven models that
+better capture the mechanisms of compositional thought.
+
+
+
+
+
+
+
+ ♻ ★ Model-GLUE: Democratized LLM Scaling for A Large Model Zoo in the Wild NeurIPS 2024
+
+
+ As Large Language Models (LLMs) excel across tasks and specialized domains,
+scaling LLMs based on existing models has garnered significant attention, which
+faces the challenge of decreasing performance when combining disparate models.
+Various techniques have been proposed for the aggregation of pre-trained LLMs,
+including model merging, Mixture-of-Experts, and stacking. Despite their
+merits, a comprehensive comparison and synergistic application of them to a
+diverse model zoo is yet to be adequately addressed. In light of this research
+gap, this paper introduces Model-GLUE, a holistic LLM scaling guideline. First,
+our work starts with a benchmarking of existing LLM scaling techniques,
+especially selective merging, and variants of mixture. Utilizing the insights
+from the benchmark results, we formulate an optimal strategy for the selection
+and aggregation of a heterogeneous model zoo characterizing different
+architectures and initialization.Our methodology involves the clustering of
+mergeable models and optimal merging strategy selection, and the integration of
+clusters through a model mixture. Finally, evidenced by our experiments on a
+diverse Llama-2-based model zoo, Model-GLUE shows an average performance
+enhancement of 5.61%, achieved without additional training. Codes are available
+at: https://github.com/Model-GLUE/Model-GLUE.
+
+
+
+ comment: 24 pages, 4 figures, accepted to NeurIPS 2024 Datasets and Benchmarks
+ Track
+
+
+
+
+
+
+ ♻ ☆ SwiftKV: Fast Prefill-Optimized Inference with Knowledge-Preserving
+ Model Transformation
+
+
+ LLM inference for popular enterprise use cases, such as summarization, RAG,
+and code-generation, typically observes orders of magnitude longer prompt
+lengths than generation lengths. This characteristic leads to high cost of
+prefill and increased response latency. In this paper, we present SwiftKV, a
+novel model transformation and distillation procedure specifically designed to
+reduce the time and cost of processing prompt tokens while preserving high
+quality of generated tokens. SwiftKV combines three key mechanisms: i)
+SingleInputKV, which prefills later layers' KV cache using a much earlier
+layer's output, allowing prompt tokens to skip much of the model computation,
+ii) AcrossKV, which merges the KV caches of neighboring layers to reduce the
+memory footprint and support larger batch size for higher throughput, and iii)
+a knowledge-preserving distillation procedure that can adapt existing LLMs for
+SwiftKV with minimal accuracy impact and low compute and data requirement. For
+Llama-3.1-8B and 70B, SwiftKV reduces the compute requirement of prefill by 50%
+and the memory requirement of the KV cache by 62.5% while incurring minimum
+quality degradation across a wide range of tasks. In the end-to-end inference
+serving using an optimized vLLM implementation, SwiftKV realizes up to 2x
+higher aggregate throughput and 60% lower time per output token. It can achieve
+a staggering 560 TFlops/GPU of normalized inference throughput, which
+translates to 16K tokens/s for Llama-3.1-70B in 16-bit precision on 4x H100
+GPUs. Our training, inference, and model implementations are open-sourced and
+can be found through
+https://huggingface.co/collections/Snowflake/swiftkv-models-674f7d7474eb789e185d31cb.
+
+
+
+
+
+
+
+ ♻ ☆ RARE: Retrieval-Augmented Reasoning Enhancement for Large Language
+ Models
+
+
+ This work introduces RARE (Retrieval-Augmented Reasoning Enhancement), a
+versatile extension to the mutual reasoning framework (rStar), aimed at
+enhancing reasoning accuracy and factual integrity across large language models
+(LLMs) for complex, knowledge-intensive tasks such as commonsense and medical
+reasoning. RARE incorporates two innovative actions within the Monte Carlo Tree
+Search (MCTS) framework: A6, which generates search queries based on the
+initial problem statement, performs information retrieval using those queries,
+and augments reasoning with the retrieved data to formulate the final answer;
+and A7, which leverages information retrieval specifically for generated
+sub-questions and re-answers these sub-questions with the relevant contextual
+information. Additionally, a Retrieval-Augmented Factuality Scorer is proposed
+to replace the original discriminator, prioritizing reasoning paths that meet
+high standards of factuality. Experimental results with LLaMA 3.1 show that
+RARE enables open-source LLMs to achieve competitive performance with top
+open-source models like GPT-4 and GPT-4o. This research establishes RARE as a
+scalable solution for improving LLMs in domains where logical coherence and
+factual integrity are critical.
+
+
+ Ontology matching (OM) enables semantic interoperability between different
+ontologies and resolves their conceptual heterogeneity by aligning related
+entities. OM systems currently have two prevailing design paradigms:
+conventional knowledge-based expert systems and newer machine learning-based
+predictive systems. While large language models (LLMs) and LLM agents have
+revolutionised data engineering and have been applied creatively in many
+domains, their potential for OM remains underexplored. This study introduces a
+novel agent-powered LLM-based design paradigm for OM systems. With
+consideration of several specific challenges in leveraging LLM agents for OM,
+we propose a generic framework, namely Agent-OM (Agent for Ontology Matching),
+consisting of two Siamese agents for retrieval and matching, with a set of
+simple OM tools. Our framework is implemented in a proof-of-concept system.
+Evaluations of three Ontology Alignment Evaluation Initiative (OAEI) tracks
+over state-of-the-art OM systems show that our system can achieve results very
+close to the long-standing best performance on simple OM tasks and can
+significantly improve the performance on complex and few-shot OM tasks.
+
+
+
+ comment: 14 pages, 13 figures, 4 tables
+
+
+
+
+
+
+ ♻ ☆ Molmo and PixMo: Open Weights and Open Data for State-of-the-Art
+ Vision-Language Models
+
+
+
+
+
+
+
+
+ Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, Jiasen Lu, Taira Anderson, Erin Bransom, Kiana Ehsani, Huong Ngo, YenSung Chen, Ajay Patel, Mark Yatskar, Chris Callison-Burch, Andrew Head, Rose Hendrix, Favyen Bastani, Eli VanderBilt, Nathan Lambert, Yvonne Chou, Arnavi Chheda, Jenna Sparks, Sam Skjonsberg, Michael Schmitz, Aaron Sarnat, Byron Bischoff, Pete Walsh, Chris Newell, Piper Wolters, Tanmay Gupta, Kuo-Hao Zeng, Jon Borchardt, Dirk Groeneveld, Crystal Nam, Sophie Lebrecht, Caitlin Wittlif, Carissa Schoenick, Oscar Michel, Ranjay Krishna, Luca Weihs, Noah A. Smith, Hannaneh Hajishirzi, Ross Girshick, Ali Farhadi, Aniruddha Kembhavi
+
+
+ Today's most advanced vision-language models (VLMs) remain proprietary. The
+strongest open-weight models rely heavily on synthetic data from proprietary
+VLMs to achieve good performance, effectively distilling these closed VLMs into
+open ones. As a result, the community has been missing foundational knowledge
+about how to build performant VLMs from scratch. We present Molmo, a new family
+of VLMs that are state-of-the-art in their class of openness. Our key
+contribution is a collection of new datasets called PixMo, including a dataset
+of highly detailed image captions for pre-training, a free-form image Q&A
+dataset for fine-tuning, and an innovative 2D pointing dataset, all collected
+without the use of external VLMs. The success of our approach relies on careful
+modeling choices, a well-tuned training pipeline, and, most critically, the
+quality of our newly collected datasets. Our best-in-class 72B model not only
+outperforms others in the class of open weight and data models, but also
+outperforms larger proprietary models including Claude 3.5 Sonnet, and Gemini
+1.5 Pro and Flash, second only to GPT-4o based on both academic benchmarks and
+on a large human evaluation. Our model weights, new datasets, and source code
+are available at https://molmo.allenai.org/blog.
+
+
+
+ comment: Updated with ablations and more technical details
+
+
+
+
+
+
+ ♻ ☆ Adaptive Circuit Behavior and Generalization in Mechanistic
+ Interpretability
+
+
+
+
+
+
+
+
+ Jatin Nainani, Sankaran Vaidyanathan, AJ Yeung, Kartik Gupta, David Jensen
+
+
+ Mechanistic interpretability aims to understand the inner workings of large
+neural networks by identifying circuits, or minimal subgraphs within the model
+that implement algorithms responsible for performing specific tasks. These
+circuits are typically discovered and analyzed using a narrowly defined prompt
+format. However, given the abilities of large language models (LLMs) to
+generalize across various prompt formats for the same task, it remains unclear
+how well these circuits generalize. For instance, it is unclear whether the
+models generalization results from reusing the same circuit components, the
+components behaving differently, or the use of entirely different components.
+In this paper, we investigate the generality of the indirect object
+identification (IOI) circuit in GPT-2 small, which is well-studied and believed
+to implement a simple, interpretable algorithm. We evaluate its performance on
+prompt variants that challenge the assumptions of this algorithm. Our findings
+reveal that the circuit generalizes surprisingly well, reusing all of its
+components and mechanisms while only adding additional input edges. Notably,
+the circuit generalizes even to prompt variants where the original algorithm
+should fail; we discover a mechanism that explains this which we term S2
+Hacking. Our findings indicate that circuits within LLMs may be more flexible
+and general than previously recognized, underscoring the importance of studying
+circuit generalization to better understand the broader capabilities of these
+models.
+
+
+
+ comment: 10 pages, 8 figures
+
+
+
+
+
+
+ ♻ ☆ Lexicalization Is All You Need: Examining the Impact of Lexical
+ Knowledge in a Compositional QALD System
+
+
+
+
+
+
+
+
+ David Maria Schmidt, Mohammad Fazleh Elahi, Philipp Cimiano
+
+
+ In this paper, we examine the impact of lexicalization on Question Answering
+over Linked Data (QALD). It is well known that one of the key challenges in
+interpreting natural language questions with respect to SPARQL lies in bridging
+the lexical gap, that is mapping the words in the query to the correct
+vocabulary elements. We argue in this paper that lexicalization, that is
+explicit knowledge about the potential interpretations of a word with respect
+to the given vocabulary, significantly eases the task and increases the
+performance of QA systems. Towards this goal, we present a compositional QA
+system that can leverage explicit lexical knowledge in a compositional manner
+to infer the meaning of a question in terms of a SPARQL query. We show that
+such a system, given lexical knowledge, has a performance well beyond current
+QA systems, achieving up to a $35.8\%$ increase in the micro $F_1$ score
+compared to the best QA system on QALD-9. This shows the importance and
+potential of including explicit lexical knowledge. In contrast, we show that
+LLMs have limited abilities to exploit lexical knowledge, with only marginal
+improvements compared to a version without lexical knowledge. This shows that
+LLMs have no ability to compositionally interpret a question on the basis of
+the meaning of its parts, a key feature of compositional approaches. Taken
+together, our work shows new avenues for QALD research, emphasizing the
+importance of lexicalization and compositionality.
+
+
+
+ comment: 24th International Conference on Knowledge Engineering and Knowledge
+ Management (EKAW 2024), November 26-28, 2024, Amsterdam, The Netherlands
+
+ The current large language models are mainly based on decode-only structure
+transformers, which have great in-context learning (ICL) capabilities. It is
+generally believed that the important foundation of its ICL capability is the
+induction heads mechanism, which requires at least two layers attention. In
+order to more efficiently implement the ability of the model's induction, we
+revisit the induction heads mechanism and proposed a KV shifting attention. We
+theoretically prove that the KV shifting attention reducing the model's
+requirements for the depth and width of the induction heads mechanism. Our
+experimental results demonstrate that KV shifting attention is beneficial to
+learning induction heads and language modeling, which lead to better
+performance or faster convergence from toy models to the pre-training models
+with more than 10 B parameters.
+
+
+
+ comment: 22 pages
+
+
+
+
+
+
+ ♻ ☆ Words in Motion: Extracting Interpretable Control Vectors for Motion
+ Transformers
+
+
+ Transformer-based models generate hidden states that are difficult to
+interpret. In this work, we aim to interpret these hidden states and control
+them at inference, with a focus on motion forecasting. We use linear probes to
+measure neural collapse towards interpretable motion features in hidden states.
+High probing accuracy implies meaningful directions and distances between
+hidden states of opposing features, which we use to fit interpretable control
+vectors for activation steering at inference. To optimize our control vectors,
+we use sparse autoencoders with fully-connected, convolutional, MLPMixer layers
+and various activation functions. Notably, we show that enforcing sparsity in
+hidden states leads to a more linear relationship between control vector
+temperatures and forecasts. Our approach enables mechanistic interpretability
+and zero-shot generalization to unseen dataset characteristics with negligible
+computational overhead. Our implementation is available at
+https://github.com/kit-mrt/future-motion
+
+
+
+ comment: Add autoencoders with convolutional, MLPMixer layers, and JumpReLU
+ activations
+
+
+
+
+
+
+ ♻ ☆ SCAR: Sparse Conditioned Autoencoders for Concept Detection and Steering
+ in LLMs NeurIPS 2024
+
+
+
+
+
+
+
+
+ Ruben Härle, Felix Friedrich, Manuel Brack, Björn Deiseroth, Patrick Schramowski, Kristian Kersting
+
+
+ Large Language Models (LLMs) have demonstrated remarkable capabilities in
+generating human-like text, but their output may not be aligned with the user
+or even produce harmful content. This paper presents a novel approach to detect
+and steer concepts such as toxicity before generation. We introduce the Sparse
+Conditioned Autoencoder (SCAR), a single trained module that extends the
+otherwise untouched LLM. SCAR ensures full steerability, towards and away from
+concepts (e.g., toxic content), without compromising the quality of the model's
+text generation on standard evaluation benchmarks. We demonstrate the effective
+application of our approach through a variety of concepts, including toxicity,
+safety, and writing style alignment. As such, this work establishes a robust
+framework for controlling LLM generations, ensuring their ethical and safe
+deployment in real-world applications.
+
+
+
+ comment: Accepted at Socially Responsible Language Modelling Research (SoLaR)
+ Workshop at NeurIPS 2024
+
+ Existing Scholarly Question Answering (QA) methods typically target
+homogeneous data sources, relying solely on either text or Knowledge Graphs
+(KGs). However, scholarly information often spans heterogeneous sources,
+necessitating the development of QA systems that integrate information from
+multiple heterogeneous data sources. To address this challenge, we introduce
+Hybrid-SQuAD (Hybrid Scholarly Question Answering Dataset), a novel large-scale
+QA dataset designed to facilitate answering questions incorporating both text
+and KG facts. The dataset consists of 10.5K question-answer pairs generated by
+a large language model, leveraging the KGs DBLP and SemOpenAlex alongside
+corresponding text from Wikipedia. In addition, we propose a RAG-based baseline
+hybrid QA model, achieving an exact match score of 69.65 on the Hybrid-SQuAD
+test set.
+
+
+
+
+
+
+
+ ♻ ☆ Quest: Query-centric Data Synthesis Approach for Long-context Scaling of
+ Large Language Model
+
+
+ Recent advancements in large language models (LLMs) have highlighted the
+importance of extending context lengths for handling complex tasks. While
+traditional methods for training on long contexts often use filtered long
+documents, these approaches lead to domain imbalances, limiting model
+performance. To address this, techniques like random document concatenation
+(Standard) and similarity-based methods (KNN, ICLM) have been developed.
+However, they either sacrifice semantic coherence or diversity. To balance both
+aspects, we introduce Quest, a query-centric data synthesis method aggregating
+semantically relevant yet diverse documents. Quest uses a generative model to
+predict potential queries for each document, grouping documents with similar
+queries and keywords. Extensive experiments demonstrate Quest's superior
+performance on long-context tasks, achieving remarkable results with context
+lengths of up to 1M tokens and confirming its scalability across various model
+sizes.
+
+
+
+
+
+
+
+ ♻ ☆ HoPE: A Novel Positional Encoding Without Long-Term Decay for Enhanced
+ Context Awareness and Extrapolation
+
+
+
+
+
+
+
+
+ Yuhan Chen, Ang Lv, Jian Luan, Bin Wang, Wei Liu
+
+
+ Many positional encodings (PEs) are designed to exhibit long-term decay,
+based on an entrenched and long-standing inductive opinion: tokens farther away
+from the current position carry less relevant information. We argue that
+long-term decay is outdated in the era of LLMs, as LLMs are now applied to
+tasks demanding precise retrieval of in-context information from arbitrary
+positions. Firstly, we present empirical analyses on various PEs, demonstrating
+that models inherently learn attention with only a local-decay pattern while
+forming a U-shape pattern globally, contradicting the principle of long-term
+decay. Furthermore, we conduct a detailed analysis of rotary position encoding
+(RoPE, a prevalent relative positional encoding in LLMs), and found that the
+U-shape attention is caused by some learned components, which are also the key
+factor limiting RoPE's expressiveness and extrapolation.Inspired by these
+insights, we propose High-frequency rotary Position Encoding (HoPE). HoPE
+replaces the specific components in RoPE with position-independent ones,
+retaining only high-frequency signals, which also breaks the principle of
+long-term decay in theory. HoPE achieves two major advantages: (1) Without
+constraints imposed by long-term decay, contradictory factors that limit
+spontaneous attention optimization and model extrapolation performance are
+removed. (2) Components representing positions and semantics are are optimized.
+These enhances model's context awareness and extrapolation, as validated by
+extensive experiments.
+
+
+
+
+
+
+
+ ♻ ☆ ToolEyes: Fine-Grained Evaluation for Tool Learning Capabilities of
+ Large Language Models in Real-world Scenarios COLING 2025
+
+
+
+
+
+
+
+
+ Junjie Ye, Guanyu Li, Songyang Gao, Caishuang Huang, Yilong Wu, Sixian Li, Xiaoran Fan, Shihan Dou, Tao Ji, Qi Zhang, Tao Gui, Xuanjing Huang
+
+
+ Existing evaluations of tool learning primarily focus on validating the
+alignment of selected tools for large language models (LLMs) with expected
+outcomes. However, these approaches rely on a limited set of scenarios where
+answers can be pre-determined, diverging from genuine needs. Furthermore, a
+sole emphasis on outcomes disregards the complex capabilities required for LLMs
+to effectively use tools. To tackle this issue, we propose ToolEyes, a
+fine-grained system tailored for the evaluation of the LLMs' tool learning
+capabilities in authentic scenarios. The system meticulously examines seven
+real-world scenarios, analyzing five dimensions crucial to LLMs in tool
+learning: format alignment, intent comprehension, behavior planning, tool
+selection, and answer organization. Additionally, ToolEyes incorporates a tool
+library boasting approximately 600 tools, serving as an intermediary between
+LLMs and the physical world. Evaluations involving ten LLMs across three
+categories reveal a preference for specific scenarios and limited cognitive
+abilities in tool learning. Intriguingly, expanding the model size even
+exacerbates the hindrance to tool learning. The code and data are available at
+https://github.com/Junjie-Ye/ToolEyes.
+
+
+
+
+
+
+
+
+ Fred Philippy, Siwen Guo, Jacques Klein, Tegawendé F. Bissyandé
+
+
+ Sentence embedding models play a key role in various Natural Language
+Processing tasks, such as in Topic Modeling, Document Clustering and
+Recommendation Systems. However, these models rely heavily on parallel data,
+which can be scarce for many low-resource languages, including Luxembourgish.
+This scarcity results in suboptimal performance of monolingual and
+cross-lingual sentence embedding models for these languages. To address this
+issue, we compile a relatively small but high-quality human-generated
+cross-lingual parallel dataset to train LuxEmbedder, an enhanced sentence
+embedding model for Luxembourgish with strong cross-lingual capabilities.
+Additionally, we present evidence suggesting that including low-resource
+languages in parallel training datasets can be more advantageous for other
+low-resource languages than relying solely on high-resource language pairs.
+Furthermore, recognizing the lack of sentence embedding benchmarks for
+low-resource languages, we create a paraphrase detection benchmark specifically
+for Luxembourgish, aiming to partially fill this gap and promote further
+research.
+
+
+
+ comment: Accepted at COLING 2025
+
+
+
+
+
+
+ ♻ ☆ DRS: Deep Question Reformulation With Structured Output
+
+
+ Question answering represents a core capability of large language models
+(LLMs). However, when individuals encounter unfamiliar knowledge in texts, they
+often formulate questions that the text itself cannot answer due to
+insufficient understanding of the underlying information. Recent studies reveal
+that while LLMs can detect unanswerable questions, they struggle to assist
+users in reformulating these questions. Even advanced models like GPT-3.5
+demonstrate limited effectiveness in this regard. To address this limitation,
+we propose DRS: Deep Question Reformulation with Structured Output, a novel
+zero-shot method aimed at enhancing LLMs ability to assist users in
+reformulating questions to extract relevant information from new documents. DRS
+combines the strengths of LLMs with a DFS-based algorithm to iteratively
+explore potential entity combinations and constrain outputs using predefined
+entities. This structured approach significantly enhances the reformulation
+capabilities of LLMs. Comprehensive experimental evaluations demonstrate that
+DRS improves the reformulation accuracy of GPT-3.5 from 23.03% to 70.42%, while
+also enhancing the performance of open-source models, such as Gemma2-9B, from
+26.35% to 56.75%.
+
+
+
+
+
+
+
+ ♻ ☆ A Little Goes a Long Way: Efficient Long Context Training and Inference
+ with Partial Contexts
+
+
+
+
+
+
+
+
+ Suyu Ge, Xihui Lin, Yunan Zhang, Jiawei Han, Hao Peng
+
+
+ Training and serving long-context large language models (LLMs) incurs
+substantial overhead. To address this, two critical steps are often required: a
+pretrained LLM typically undergoes a separate stage for context length
+extension by training on long-context data, followed by architectural
+modifications to reduce the overhead of KV cache during serving. This paper
+argues that integrating length extension with a GPU-friendly KV cache reduction
+architecture not only reduces training overhead during length extension, but
+also achieves better long-context performance. This leads to our proposed
+LongGen, which finetunes a pretrained LLM into an efficient architecture during
+length extension. LongGen builds on three key insights: (1) Sparse attention
+patterns, such as window attention (attending to recent tokens), attention sink
+(initial ones), and blockwise sparse attention (strided token blocks) are
+well-suited for building efficient long-context models, primarily due to their
+GPU-friendly memory access patterns, enabling efficiency gains not just
+theoretically but in practice as well. (2) It is essential for the model to
+have direct access to all tokens. A hybrid architecture with 1/3 full attention
+layers and 2/3 efficient ones achieves a balanced trade-off between efficiency
+and long-context performance. (3) Lightweight training on 5B long-context data
+is sufficient to extend the hybrid model's context length from 4K to 128K.
+ We evaluate LongGen on both Llama-2 7B and Llama-2 70B, demonstrating its
+effectiveness across different scales. During training with 128K-long contexts,
+LongGen achieves 1.55x training speedup and reduces wall-clock time by 36%,
+compared to a full-attention baseline. During inference, LongGen reduces KV
+cache memory by 62%, achieving 1.67x prefilling speedup and 1.41x decoding
+speedup.
+
+
+
+
+
+
+
+ ♻ ☆ Concept Based Continuous Prompts for Interpretable Text Classification
+
+
+ Continuous prompts have become widely adopted for augmenting performance
+across a wide range of natural language tasks. However, the underlying
+mechanism of this enhancement remains obscure. Previous studies rely on
+individual words for interpreting continuous prompts, which lacks comprehensive
+semantic understanding. Drawing inspiration from Concept Bottleneck Models, we
+propose a framework for interpreting continuous prompts by decomposing them
+into human-readable concepts. Specifically, to ensure the feasibility of the
+decomposition, we demonstrate that a corresponding concept embedding matrix and
+a coefficient matrix can always be found to replace the prompt embedding
+matrix. Then, we employ GPT-4o to generate a concept pool and choose potential
+candidate concepts that are discriminative and representative using a novel
+submodular optimization algorithm. Experiments demonstrate that our framework
+can achieve similar results as the original P-tuning and word-based approaches
+using only a few concepts while providing more plausible results. Our code is
+available at https://github.com/qq31415926/CD.
+
+
+
+
+
+
+
+ ♻ ☆ Diffusion of Thoughts: Chain-of-Thought Reasoning in Diffusion Language
+ Models NeurIPS 2024
+
+
+
+
+
+
+
+
+ Jiacheng Ye, Shansan Gong, Liheng Chen, Lin Zheng, Jiahui Gao, Han Shi, Chuan Wu, Xin Jiang, Zhenguo Li, Wei Bi, Lingpeng Kong
+
+
+ Recently, diffusion models have garnered significant interest in the field of
+text processing due to their many potential advantages compared to conventional
+autoregressive models. In this work, we propose Diffusion-of-Thought (DoT), a
+novel approach that integrates diffusion models with Chain-of-Thought, a
+well-established technique for improving the reasoning ability of
+autoregressive language models. In contrast to autoregressive language models
+that make decisions in a left-to-right, token-by-token manner, DoT allows
+reasoning steps to diffuse over time through a diffusion language model and
+offers greater flexibility in trading-off computation for reasoning
+performance. Our experimental results demonstrate the effectiveness of DoT in
+multi-digit multiplication, boolean logic, and grade school math problems, with
+a small diffusion model outperforming a much larger autoregressive model in
+both efficiency and accuracy. In addition to that, DoT showcases promising
+self-correction abilities and benefits from existing reasoning-enhancing
+techniques like self-consistency decoding. Our findings contribute to the
+understanding and development of reasoning with diffusion language models.
+
+
+
+ comment: NeurIPS 2024
+
+
+
+
+
+
+ ♻ ☆ Text-Tuple-Table: Towards Information Integration in Text-to-Table
+ Generation via Global Tuple Extraction EMNLP 2024
+
+
+ The task of condensing large chunks of textual information into concise and
+structured tables has gained attention recently due to the emergence of Large
+Language Models (LLMs) and their potential benefit for downstream tasks, such
+as text summarization and text mining. Previous approaches often generate
+tables that directly replicate information from the text, limiting their
+applicability in broader contexts, as text-to-table generation in real-life
+scenarios necessitates information extraction, reasoning, and integration.
+However, there is a lack of both datasets and methodologies towards this task.
+In this paper, we introduce LiveSum, a new benchmark dataset created for
+generating summary tables of competitions based on real-time commentary texts.
+We evaluate the performances of state-of-the-art LLMs on this task in both
+fine-tuning and zero-shot settings, and additionally propose a novel pipeline
+called $T^3$(Text-Tuple-Table) to improve their performances. Extensive
+experimental results demonstrate that LLMs still struggle with this task even
+after fine-tuning, while our approach can offer substantial performance gains
+without explicit training. Further analyses demonstrate that our method
+exhibits strong generalization abilities, surpassing previous approaches on
+several other text-to-table datasets. Our code and data can be found at
+https://github.com/HKUST-KnowComp/LiveSum.
+
+
+ Addressing the disparity between forecasts and actual results can enable
+individuals to expand their thought processes and stimulate self-reflection,
+thus promoting accurate planning. In this research, we present **PreAct**, an
+agent framework that integrates **pre**diction, **rea**soning, and **act**ion.
+By utilizing the information derived from predictions, the large language model
+(LLM) agent can provide a wider range and more strategically focused reasoning.
+This leads to more efficient actions that aid the agent in accomplishing
+intricate tasks. Our experimental results show that PreAct surpasses the ReAct
+method in completing complex tasks and that PreAct's performance can be further
+improved when paired with other memory or selection strategy techniques. We
+presented the model with varying quantities of historical predictions and
+discovered that these predictions consistently enhance LLM planning.The
+variances in single-step reasoning between PreAct and ReAct indicate that
+PreAct indeed has benefits in terms of diversity and strategic orientation over
+ReAct.
+
+
+
+ comment: Coling 2025
+
+
+
+
+
+
+ ♻ ☆ Network Formation and Dynamics Among Multi-LLMs
+
+
+ Social networks fundamentally shape human opinions, behaviors, and the
+dissemination of information. As large language models (LLMs) like GPT, Claude,
+and Llama increasingly integrate into social and professional settings,
+understanding their behavior in the context of social interactions and network
+formation becomes essential. This study develops a framework to systematically
+examine whether the network formation behaviors of multiple LLMs approximate
+certain aspects of human network dynamics. By simulating interactions among LLM
+agents across various model families, we observe that these models consistently
+exhibit key patterns associated with social network principles including
+preferential attachment, triadic closure, homophily, community structure, and
+the small-world phenomenon when forming networks. Moreover, LLMs adapt their
+network formation strategies based on each network's characteristics,
+reflecting the context-dependent nature of human behavior: in Facebook
+networks, they prioritize triadic closure and homophily, mirroring close-knit
+friendships; in phone networks, homophily and preferential attachment dominate,
+capturing personal and professional connections, while in employment networks,
+LLMs favor heterophily and high-degree connections, aligning with career
+advancement dynamics. These results open new avenues for using LLMs in network
+science research, with potential applications in agent-based modeling and
+synthetic network generation.
+
+
+
+
+
+
+
+ ♻ ☆ Yi-Lightning Technical Report
+
+
+
+
+
+
+
+
+ 01. AI, :, Alan Wake, Albert Wang, Bei Chen, C. X. Lv, Chao Li, Chengen Huang, Chenglin Cai, Chujie Zheng, Daniel Cooper, Ethan Dai, Fan Zhou, Feng Hu, Heng Ji, Howard Qiu, Jiangcheng Zhu, Jun Tian, Katherine Su, Lihuan Zhang, Liying Li, Ming Song, Mou Li, Peng Liu, Qicheng Hu, Shawn Wang, Shijun Zhou, Shiyong Li, Tianhang Zhu, Wen Xie, Xiang He, Xiaobo Chen, Xiaohui Hu, Xiaoyi Ren, Xinyao Niu, Yanpeng Li, Yongke Zhao, Yongzhen Luo, Yuchi Xu, Yuxuan Sha, Zhaodong Yan, Zhiyuan Liu, Zirui Zhang
+
+
+ This technical report presents Yi-Lightning, our latest flagship large
+language model (LLM). It achieves exceptional performance, ranking 6th overall
+on Chatbot Arena, with particularly strong results (2nd to 4th place) in
+specialized categories including Chinese, Math, Coding, and Hard Prompts.
+Yi-Lightning leverages an enhanced Mixture-of-Experts (MoE) architecture,
+featuring advanced expert segmentation and routing mechanisms coupled with
+optimized KV-caching techniques. Our development process encompasses
+comprehensive pre-training, supervised fine-tuning (SFT), and reinforcement
+learning from human feedback (RLHF), where we devise deliberate strategies for
+multi-stage training, synthetic data construction, and reward modeling.
+Furthermore, we implement RAISE (Responsible AI Safety Engine), a
+four-component framework to address safety issues across pre-training,
+post-training, and serving phases. Empowered by our scalable super-computing
+infrastructure, all these innovations substantially reduce training, deployment
+and inference costs while maintaining high-performance standards. With further
+evaluations on public academic benchmarks, Yi-Lightning demonstrates
+competitive performance against top-tier LLMs, while we observe a notable
+disparity between traditional, static benchmark results and real-world, dynamic
+human preferences. This observation prompts a critical reassessment of
+conventional benchmarks' utility in guiding the development of more intelligent
+and powerful AI systems for practical applications. Yi-Lightning is now
+available through our developer platform at https://platform.lingyiwanwu.com.
+
+
+
+
+
+
+
+ ♻ ☆ Social Life Simulation for Non-Cognitive Skills Learning
+
+
+ Non-cognitive skills are crucial for personal and social life well-being, and
+such skill development can be supported by narrative-based (e.g., storytelling)
+technologies. While generative AI enables interactive and role-playing
+storytelling, little is known about how users engage with and perceive the use
+of AI in social life simulation for non-cognitive skills learning.
+Additionally, the benefits of AI mentorship on self-reflection awareness and
+ability in this context remain largely underexplored. To this end, we
+introduced Simulife++, an interactive platform enabled by a large language
+model (LLM). The system allows users to act as protagonists, creating stories
+with one or multiple AI-based characters in diverse social scenarios. In
+particular, we expanded the Human-AI interaction to a Human-AI-AI collaboration
+by including a Sage Agent, who acts as a bystander, providing users with some
+perspectives and guidance on their choices and conversations in terms of
+non-cognitive skills to promote reflection. In a within-subject user study, our
+quantitative results reveal that, when accompanied by Sage Agent, users exhibit
+significantly higher levels of reflection on motivation, self-perceptions, and
+resilience & coping, along with an enhanced experience of narrative
+transportation. Additionally, our qualitative findings suggest that Sage Agent
+plays a crucial role in promoting reflection on non-cognitive skills, enhancing
+social communication and decision-making performance, and improving overall
+user experience within Simulife++. Multiple supportive relationships between
+Sage Agent and users were also reported. We offer design implications for the
+application of generative AI in narrative solutions and the future potential of
+Sage Agent for non-cognitive skill development in broader social contexts.
+
+
+ Despite impressive advancements in recent multimodal reasoning approaches,
+they are still limited in flexibility and efficiency, as these models typically
+process only a few fixed modality inputs and require updates to numerous
+parameters. This paper tackles these critical challenges and proposes CREMA, a
+generalizable, highly efficient, and modular modality-fusion framework that can
+incorporate any new modality to enhance video reasoning. We first augment
+multiple informative modalities (such as optical flow, 3D point cloud, audio,
+thermal heatmap, and touch map) from given videos without extra human
+annotation by leveraging sensors or existing pre-trained models. Next, we
+introduce a query transformer with multiple parameter-efficient modules
+associated with each accessible modality. It projects diverse modality features
+to the LLM token embedding space, allowing the model to integrate different
+data types for response generation. Furthermore, we propose a novel progressive
+multimodal fusion design supported by a lightweight fusion module and
+modality-sequential training strategy. It helps compress information across
+various assisting modalities, maintaining computational efficiency in the LLM
+while improving performance. We validate our method on 7 video-language
+reasoning tasks assisted by diverse modalities, including conventional VideoQA
+and Video-Audio/3D/Touch/Thermal QA, and achieve better/equivalent performance
+against strong multimodal LLMs, including OneLLM, BLIP-2, and SeViLA while
+reducing over 90% trainable parameters. We provide extensive analyses of CREMA,
+including the impact of each modality on reasoning domains, the design of the
+fusion module, and example visualizations.
+
+
+
+ comment: first two authors contributed equally. Project page:
+ https://CREMA-VideoLLM.github.io/
+
+
+
+
+
+
+ ♻ ☆ Calibrating Reasoning in Language Models with Internal Consistency NeurIPS 2024
+
+
+ Large language models (LLMs) have demonstrated impressive capabilities in
+various reasoning tasks, aided by techniques like chain-of-thought prompting
+that elicits verbalized reasoning. However, LLMs often generate text with
+obvious mistakes and contradictions, raising doubts about their ability to
+robustly process and utilize generated rationales. In this work, we investigate
+reasoning in LLMs through the lens of internal representations, focusing on how
+these representations are influenced by generated rationales. Our preliminary
+analysis reveals that while generated rationales improve answer accuracy,
+inconsistencies emerge between the model's internal representations in middle
+layers and those in final layers, potentially undermining the reliability of
+their reasoning processes. To address this, we propose internal consistency as
+a measure of the model's confidence by examining the agreement of latent
+predictions decoded from intermediate layers. Extensive empirical studies
+across different models and datasets demonstrate that internal consistency
+effectively distinguishes between correct and incorrect reasoning paths.
+Motivated by this, we propose a new approach to calibrate reasoning by
+up-weighting reasoning paths with high internal consistency, resulting in a
+significant boost in reasoning performance. Further analysis uncovers distinct
+patterns in attention and feed-forward modules across layers, providing
+insights into the emergence of internal inconsistency. In summary, our results
+demonstrate the potential of using internal representations for self-evaluation
+of LLMs. Our code is available at github.com/zhxieml/internal-consistency.
+
+
+
+ comment: NeurIPS 2024 camera ready
+
+
+
+
+
+
+ ♻ ☆ From Pixels to Insights: A Survey on Automatic Chart Understanding in
+ the Era of Large Foundation Models
+
+
+
+
+
+
+
+
+ Kung-Hsiang Huang, Hou Pong Chan, Yi R. Fung, Haoyi Qiu, Mingyang Zhou, Shafiq Joty, Shih-Fu Chang, Heng Ji
+
+
+ Data visualization in the form of charts plays a pivotal role in data
+analysis, offering critical insights and aiding in informed decision-making.
+Automatic chart understanding has witnessed significant advancements with the
+rise of large foundation models in recent years. Foundation models, such as
+large language models, have revolutionized various natural language processing
+tasks and are increasingly being applied to chart understanding tasks. This
+survey paper provides a comprehensive overview of the recent developments,
+challenges, and future directions in chart understanding within the context of
+these foundation models. We review fundamental building blocks crucial for
+studying chart understanding tasks. Additionally, we explore various tasks and
+their evaluation metrics and sources of both charts and textual inputs. Various
+modeling strategies are then examined, encompassing both classification-based
+and generation-based approaches, along with tool augmentation techniques that
+enhance chart understanding performance. Furthermore, we discuss the
+state-of-the-art performance of each task and discuss how we can improve the
+performance. Challenges and future directions are addressed, highlighting the
+importance of several topics, such as domain-specific charts, lack of efforts
+in developing evaluation metrics, and agent-oriented settings. This survey
+paper serves as a comprehensive resource for researchers and practitioners in
+the fields of natural language processing, computer vision, and data analysis,
+providing valuable insights and directions for future research in chart
+understanding leveraging large foundation models. The studies mentioned in this
+paper, along with emerging new research, will be continually updated at:
+https://github.com/khuangaf/Awesome-Chart-Understanding.
+
+
+
+ comment: IEEE Transactions on Knowledge and Data Engineering (TKDE)
+
+ Methods for knowledge editing and unlearning in large language models seek to
+edit or remove undesirable knowledge or capabilities without compromising
+general language modeling performance. This work investigates how mechanistic
+interpretability -- which, in part, aims to identify model components
+(circuits) associated to specific interpretable mechanisms that make up a model
+capability -- can improve the precision and effectiveness of editing and
+unlearning. We find a stark difference in unlearning and edit robustness when
+training components localized by different methods. We highlight an important
+distinction between methods that localize components based primarily on
+preserving outputs, and those finding high level mechanisms with predictable
+intermediate states. In particular, localizing edits/unlearning to components
+associated with the lookup-table mechanism for factual recall 1) leads to more
+robust edits/unlearning across different input/output formats, and 2) resists
+attempts to relearn the unwanted information, while also reducing unintended
+side effects compared to baselines, on both a sports facts dataset and the
+CounterFact dataset across multiple models. We also find that certain localized
+edits disrupt the latent knowledge in the model more than any other baselines,
+making unlearning more robust to various attacks.
+
+
+
+ comment: 31 pages, 45 figures, 7 tables
+
+
+
+
+
+
+ ♻ ☆ VersaTune: An Efficient Data Composition Framework for Training
+ Multi-Capability LLMs
+
+
+
+
+
+
+
+
+ Keer Lu, Keshi Zhao, Zheng Liang, Da Pan, Shusen Zhang, Xin Wu, Weipeng Chen, Zenan Zhou, Guosheng Dong, Bin Cui, Wentao Zhang
+
+
+ Large-scale pretrained models, particularly Large Language Models (LLMs),
+have exhibited remarkable capabilities in handling multiple tasks across
+domains due to their emergent properties. These capabilities are further
+augmented during the Supervised Fine-Tuning (SFT) phase. Despite their
+potential, existing work mainly focuses on domain-specific enhancements during
+fine-tuning, the challenge of which lies in catastrophic forgetting of
+knowledge across other domains. In this study, we introduce VersaTune, a novel
+data composition framework designed for enhancing LLMs' overall multi-ability
+performances during training. We categorize knowledge into distinct domains
+including law, medicine, finance, science, code, etc. We begin with detecting
+the distribution of domain-specific knowledge within the base model, followed
+by the training data composition that aligns with the model's existing
+knowledge distribution. During the training process, domain weights are
+dynamically adjusted based on their learnable potential and forgetting degree.
+Experimental results demonstrate that VersaTune achieves significant
+improvements in multi-domain performance, with an 35.21% enhancement in
+comprehensive multi-domain tasks. Additionally, in scenarios where specific
+domain optimization is required, VersaTune reduces the degradation of
+performance in other domains by 38.77%, without compromising the target
+domain's training efficacy.
+
+
+
+
+
+
+
+ ♻ ☆ UniPoll: A Unified Social Media Poll Generation Framework via
+ Multi-Objective Optimization
+
+
+
+
+
+
+
+
+ Yixia Li, Rong Xiang, Yanlin Song, Jing Li
+
+
+ Social media platforms are vital for expressing opinions and understanding
+public sentiment, yet many analytical tools overlook passive users who mainly
+consume content without engaging actively. To address this, we introduce
+UniPoll, an advanced framework designed to automatically generate polls from
+social media posts using sophisticated natural language generation (NLG)
+techniques. Unlike traditional methods that struggle with social media's
+informal and context-sensitive nature, UniPoll leverages enriched contexts from
+user comments and employs multi-objective optimization to enhance poll
+relevance and engagement. To tackle the inherently noisy nature of social media
+data, UniPoll incorporates Retrieval-Augmented Generation (RAG) and synthetic
+data generation, ensuring robust performance across real-world scenarios. The
+framework surpasses existing models, including T5, ChatGLM3, and GPT-3.5, in
+generating coherent and contextually appropriate question-answer pairs.
+Evaluated on the Chinese WeiboPolls dataset and the newly introduced English
+RedditPolls dataset, UniPoll demonstrates superior cross-lingual and
+cross-platform capabilities, making it a potent tool to boost user engagement
+and create a more inclusive environment for interaction.
+
+
+
+ comment: Accepted by IEEE Transactions on Neural Networks and Learning
+ Systems. Project page is live at https://uni-poll.github.io . Code are
+ available at https://github.com/X1AOX1A/UniPoll
+
+
+
+
+
+
+ ♻ ☆ ScribeAgent: Towards Specialized Web Agents Using Production-Scale
+ Workflow Data
+
+
+ Large Language Model (LLM) agents are rapidly improving to handle
+increasingly complex web-based tasks. Most of these agents rely on
+general-purpose, proprietary models like GPT-4 and focus on designing better
+prompts to improve their planning abilities. However, general-purpose LLMs are
+not specifically trained to understand specialized web contexts such as HTML,
+and they often struggle with long-horizon planning. We explore an alternative
+approach that fine-tunes open-source LLMs using production-scale workflow data
+collected from over 250 domains corresponding to 6 billion tokens. This simple
+yet effective approach shows substantial gains over prompting-based agents on
+existing benchmarks -- ScribeAgent achieves state-of-the-art direct generation
+performance on Mind2Web and improves the task success rate by 7.3% over the
+previous best text-only web agents on WebArena. We further perform detailed
+ablation studies on various fine-tuning design choices and provide insights
+into LLM selection, training recipes, context window optimization, and effect
+of dataset sizes.
+
+
+
+
+
+
+
+ ♻ ☆ Designing LLM Chains by Adapting Techniques from Crowdsourcing Workflows
+
+
+
+
+
+
+
+
+ Madeleine Grunde-McLaughlin, Michelle S. Lam, Ranjay Krishna, Daniel S. Weld, Jeffrey Heer
+
+
+ LLM chains enable complex tasks by decomposing work into a sequence of
+subtasks. Similarly, the more established techniques of crowdsourcing workflows
+decompose complex tasks into smaller tasks for human crowdworkers. Chains
+address LLM errors analogously to the way crowdsourcing workflows address human
+error. To characterize opportunities for LLM chaining, we survey 107 papers
+across the crowdsourcing and chaining literature to construct a design space
+for chain development. The design space covers a designer's objectives and the
+tactics used to build workflows. We then surface strategies that mediate how
+workflows use tactics to achieve objectives. To explore how techniques from
+crowdsourcing may apply to chaining, we adapt crowdsourcing workflows to
+implement LLM chains across three case studies: creating a taxonomy, shortening
+text, and writing a short story. From the design space and our case studies, we
+identify takeaways for effective chain design and raise implications for future
+research and development.
+
+
+ In the basic recommendation paradigm, the most (predicted) relevant item is
+recommended to each user. This may result in some items receiving lower
+exposure than they "should"; to counter this, several algorithmic approaches
+have been developed to ensure item fairness. These approaches necessarily
+degrade recommendations for some users to improve outcomes for items, leading
+to user fairness concerns. In turn, a recent line of work has focused on
+developing algorithms for multi-sided fairness, to jointly optimize user
+fairness, item fairness, and overall recommendation quality. This induces the
+question: what is the tradeoff between these objectives, and what are the
+characteristics of (multi-objective) optimal solutions? Theoretically, we
+develop a model of recommendations with user and item fairness objectives and
+characterize the solutions of fairness-constrained optimization. We identify
+two phenomena: (a) when user preferences are diverse, there is "free" item and
+user fairness; and (b) users whose preferences are misestimated can be
+especially disadvantaged by item fairness constraints. Empirically, we
+prototype a recommendation system for preprints on arXiv and implement our
+framework, measuring the phenomena in practice and showing how these phenomena
+inform the design of markets with recommendation systems-intermediated
+matching.
+
+
+
+ comment: Accepted at the Thirty-Eighth Annual Conference on Neural Information
+ Processing Systems
+
+
+
+
+
+
+ ☆ Graph-Sequential Alignment and Uniformity: Toward Enhanced
+ Recommendation Systems
+
+
+ Graph-based and sequential methods are two popular recommendation paradigms,
+each excelling in its domain but lacking the ability to leverage signals from
+the other. To address this, we propose a novel method that integrates both
+approaches for enhanced performance. Our framework uses Graph Neural Network
+(GNN)-based and sequential recommenders as separate submodules while sharing a
+unified embedding space optimized jointly. To enable positive knowledge
+transfer, we design a loss function that enforces alignment and uniformity both
+within and across submodules. Experiments on three real-world datasets
+demonstrate that the proposed method significantly outperforms using either
+approach alone and achieves state-of-the-art results. Our implementations are
+publicly available at https://github.com/YuweiCao-UIC/GSAU.git.
+
+
+
+ comment: Under review
+
+
+
+
+
+
+ ☆ PoTable: Programming Standardly on Table-based Reasoning Like a Human
+ Analyst
+
+
+ Table-based reasoning has garnered substantial research interest,
+particularly in its integration with Large Language Model (LLM) which has
+revolutionized the general reasoning paradigm. Numerous LLM-based studies
+introduce symbolic tools (e.g., databases, Python) as assistants to extend
+human-like abilities in structured table understanding and complex arithmetic
+computations. However, these studies can be improved better in simulating human
+cognitive behavior when using symbolic tools, as they still suffer from
+limitations of non-standard logical splits and constrained operation pools. In
+this study, we propose PoTable as a novel table-based reasoning method that
+simulates a human tabular analyst, which integrates a Python interpreter as the
+real-time executor accompanied by an LLM-based operation planner and code
+generator. Specifically, PoTable follows a human-like logical stage split and
+extends the operation pool into an open-world space without any constraints.
+Through planning and executing in each distinct stage, PoTable standardly
+completes the entire reasoning process and produces superior reasoning results
+along with highly accurate, steply commented and completely executable
+programs. Accordingly, the effectiveness and explainability of PoTable are
+fully demonstrated. Extensive experiments over three evaluation datasets from
+two public benchmarks on two backbones show the outstanding performance of our
+approach. In particular, GPT-based PoTable achieves over 4% higher absolute
+accuracy than runner-ups on all evaluation datasets.
+
+
+
+ comment: 12 pages, 4 figures
+
+
+
+
+
+
+ ☆ Pre-train, Align, and Disentangle: Empowering Sequential Recommendation
+ with Large Language Models
+
+
+ Sequential recommendation (SR) aims to model the sequential dependencies in
+users' historical interactions to better capture their evolving interests.
+However, existing SR approaches primarily rely on collaborative data, which
+leads to limitations such as the cold-start problem and sub-optimal
+performance. Meanwhile, despite the success of large language models (LLMs),
+their application in industrial recommender systems is hindered by high
+inference latency, inability to capture all distribution statistics, and
+catastrophic forgetting. To this end, we propose a novel Pre-train, Align, and
+Disentangle (PAD) paradigm to empower recommendation models with LLMs.
+Specifically, we first pre-train both the SR and LLM models to get
+collaborative and textual embeddings. Next, a characteristic
+recommendation-anchored alignment loss is proposed using multi-kernel maximum
+mean discrepancy with Gaussian kernels. Finally, a triple-experts architecture,
+consisting aligned and modality-specific experts with disentangled embeddings,
+is fine-tuned in a frequency-aware manner. Experiments conducted on three
+public datasets demonstrate the effectiveness of PAD, showing significant
+improvements and compatibility with various SR backbone models, especially on
+cold items. The implementation code and datasets will be publicly available.
+
+
+
+
+
+
+
+
+ Binbin Hu, Zhicheng An, Zhengwei Wu, Ke Tu, Ziqi Liu, Zhiqiang Zhang, Jun Zhou, Yufei Feng, Jiawei Chen
+
+
+ Estimating individual treatment effects (ITE) from observational data is a
+critical task across various domains. However, many existing works on ITE
+estimation overlook the influence of hidden confounders, which remain
+unobserved at the individual unit level. To address this limitation,
+researchers have utilized graph neural networks to aggregate neighbors'
+features to capture the hidden confounders and mitigate confounding bias by
+minimizing the discrepancy of confounder representations between the treated
+and control groups. Despite the success of these approaches, practical
+scenarios often treat all features as confounders and involve substantial
+differences in feature distributions between the treated and control groups.
+Confusing the adjustment and confounder and enforcing strict balance on the
+confounder representations could potentially undermine the effectiveness of
+outcome prediction. To mitigate this issue, we propose a novel framework called
+the \textit{Graph Disentangle Causal model} (GDC) to conduct ITE estimation in
+the network setting. GDC utilizes a causal disentangle module to separate unit
+features into adjustment and confounder representations. Then we design a graph
+aggregation module consisting of three distinct graph aggregators to obtain
+adjustment, confounder, and counterfactual confounder representations. Finally,
+a causal constraint module is employed to enforce the disentangled
+representations as true causal factors. The effectiveness of our proposed
+method is demonstrated by conducting comprehensive experiments on two networked
+datasets.
+
+
+
+ comment: Accepted by WSDM 2025
+
+
+
+
+
+
+ ☆ Learning to Hash for Recommendation: A Survey
+
+
+ With the explosive growth of users and items, Recommender Systems (RS) are
+facing unprecedented challenges on both retrieval efficiency and storage cost.
+Fortunately, Learning to Hash (L2H) techniques have been shown as a promising
+solution to address the two dilemmas, whose core idea is encoding
+high-dimensional data into compact hash codes. To this end, L2H for RS (HashRec
+for short) has recently received widespread attention to support large-scale
+recommendations. In this survey, we present a comprehensive review of current
+HashRec algorithms. Specifically, we first introduce the commonly used
+two-tower models in the recall stage and identify two search strategies
+frequently employed in L2H. Then, we categorize prior works into two-tier
+taxonomy based on: (i) the type of loss function and (ii) the optimization
+strategy. We also introduce some commonly used evaluation metrics to measure
+the performance of HashRec algorithms. Finally, we shed light on the
+limitations of the current research and outline the future research directions.
+Furthermore, the summary of HashRec methods reviewed in this survey can be
+found at
+\href{https://github.com/Luo-Fangyuan/HashRec}{https://github.com/Luo-Fangyuan/HashRec}.
+
+
+ Ontology matching (OM) enables semantic interoperability between different
+ontologies and resolves their conceptual heterogeneity by aligning related
+entities. OM systems currently have two prevailing design paradigms:
+conventional knowledge-based expert systems and newer machine learning-based
+predictive systems. While large language models (LLMs) and LLM agents have
+revolutionised data engineering and have been applied creatively in many
+domains, their potential for OM remains underexplored. This study introduces a
+novel agent-powered LLM-based design paradigm for OM systems. With
+consideration of several specific challenges in leveraging LLM agents for OM,
+we propose a generic framework, namely Agent-OM (Agent for Ontology Matching),
+consisting of two Siamese agents for retrieval and matching, with a set of
+simple OM tools. Our framework is implemented in a proof-of-concept system.
+Evaluations of three Ontology Alignment Evaluation Initiative (OAEI) tracks
+over state-of-the-art OM systems show that our system can achieve results very
+close to the long-standing best performance on simple OM tasks and can
+significantly improve the performance on complex and few-shot OM tasks.
+
+
+
+ comment: 14 pages, 13 figures, 4 tables
+
+
+
+
+
+
+ ♻ ☆ TiM4Rec: An Efficient Sequential Recommendation Model Based on
+ Time-Aware Structured State Space Duality Model
+
+
+ The Sequential Recommendation modeling paradigm is shifting from Transformer
+to Mamba architecture, which comprises two generations: Mamba1, based on the
+State Space Model (SSM), and Mamba2, based on State Space Duality (SSD).
+Although SSD offers superior computational efficiency compared to SSM, it
+suffers performance degradation in sequential recommendation tasks, especially
+in low-dimensional scenarios that are critical for these tasks. Considering
+that time-aware enhancement methods are commonly employed to mitigate
+performance loss, our analysis reveals that the performance decline of SSD can
+similarly be fundamentally compensated by leveraging mechanisms in time-aware
+methods. Thus, we propose integrating time-awareness into the SSD framework to
+address these performance issues. However, integrating current time-aware
+methods, modeled after TiSASRec, into SSD faces the following challenges: 1)
+the complexity of integrating these transformer-based mechanisms with the SSD
+architecture, and 2) the computational inefficiency caused by the need for
+dimensionality expansion of time-difference modeling. To overcome these
+challenges, we introduce a novel Time-aware Structured Masked Matrix that
+efficiently incorporates time-aware capabilities into SSD. Building on this, we
+propose Time-Aware Mamba for Recommendation (TiM4Rec), which mitigates
+performance degradation in low-dimensional SSD contexts while preserving
+computational efficiency. This marks the inaugural application of a time-aware
+enhancement method specifically tailored for the Mamba architecture within the
+domain of sequential recommendation. Extensive experiments conducted on three
+real-world datasets demonstrate the superiority of our approach. The code for
+our model is accessible at https://github.com/AlwaysFHao/TiM4Rec.
+
+
+
+
+
+
+
+ ♻ ☆ Lexicalization Is All You Need: Examining the Impact of Lexical
+ Knowledge in a Compositional QALD System
+
+
+
+
+
+
+
+
+ David Maria Schmidt, Mohammad Fazleh Elahi, Philipp Cimiano
+
+
+ In this paper, we examine the impact of lexicalization on Question Answering
+over Linked Data (QALD). It is well known that one of the key challenges in
+interpreting natural language questions with respect to SPARQL lies in bridging
+the lexical gap, that is mapping the words in the query to the correct
+vocabulary elements. We argue in this paper that lexicalization, that is
+explicit knowledge about the potential interpretations of a word with respect
+to the given vocabulary, significantly eases the task and increases the
+performance of QA systems. Towards this goal, we present a compositional QA
+system that can leverage explicit lexical knowledge in a compositional manner
+to infer the meaning of a question in terms of a SPARQL query. We show that
+such a system, given lexical knowledge, has a performance well beyond current
+QA systems, achieving up to a $35.8\%$ increase in the micro $F_1$ score
+compared to the best QA system on QALD-9. This shows the importance and
+potential of including explicit lexical knowledge. In contrast, we show that
+LLMs have limited abilities to exploit lexical knowledge, with only marginal
+improvements compared to a version without lexical knowledge. This shows that
+LLMs have no ability to compositionally interpret a question on the basis of
+the meaning of its parts, a key feature of compositional approaches. Taken
+together, our work shows new avenues for QALD research, emphasizing the
+importance of lexicalization and compositionality.
+
+
+
+ comment: 24th International Conference on Knowledge Engineering and Knowledge
+ Management (EKAW 2024), November 26-28, 2024, Amsterdam, The Netherlands
+
+
+
+
+
+
+ ♻ ☆ A Survey on Point-of-Interest Recommendations Leveraging Heterogeneous
+ Data
+
+
+ Tourism is an important application domain for recommender systems. In this
+domain, recommender systems are for example tasked with providing personalized
+recommendations for transportation, accommodation, points-of-interest (POIs),
+etc. Among these tasks, in particular the problem of recommending POIs that are
+of likely interest to individual tourists has gained growing attention in
+recent years. Providing POI recommendations to tourists can however be
+especially challenging due to the variability of the user's context. With the
+rapid development of the Web and today's multitude of online services, vast
+amounts of data from various sources have become available, and these
+heterogeneous data represent a huge potential to better address the challenges
+of POI recommendation problems. In this work, we provide a survey of published
+research on the problem of POI recommendation between 2021 and 2023. The
+literature was surveyed to identify the information types, techniques and
+evaluation methods employed. Based on the analysis, it was observed that the
+current research tends to focus on a relatively narrow range of information
+types and there is a significant potential in improving POI recommendation by
+leveraging heterogeneous data. As the first information-centric survey on POI
+recommendation research, this study serves as a reference for researchers
+aiming to develop increasingly accurate, personalized and context-aware POI
+recommender systems.
+
+
+
+
+
+
+
+
+
+
+ Machine Learning 150
+
+
+
+
+
+ ☆ VisionZip: Longer is Better but Not Necessary in Vision Language Models
+
+
+ Recent advancements in vision-language models have enhanced performance by
+increasing the length of visual tokens, making them much longer than text
+tokens and significantly raising computational costs. However, we observe that
+the visual tokens generated by popular vision encoders, such as CLIP and
+SigLIP, contain significant redundancy. To address this, we introduce
+VisionZip, a simple yet effective method that selects a set of informative
+tokens for input to the language model, reducing visual token redundancy and
+improving efficiency while maintaining model performance. The proposed
+VisionZip can be widely applied to image and video understanding tasks and is
+well-suited for multi-turn dialogues in real-world scenarios, where previous
+methods tend to underperform. Experimental results show that VisionZip
+outperforms the previous state-of-the-art method by at least 5% performance
+gains across nearly all settings. Moreover, our method significantly enhances
+model inference speed, improving the prefilling time by 8x and enabling the
+LLaVA-Next 13B model to infer faster than the LLaVA-Next 7B model while
+achieving better results. Furthermore, we analyze the causes of this redundancy
+and encourage the community to focus on extracting better visual features
+rather than merely increasing token length. Our code is available at
+https://github.com/dvlab-research/VisionZip .
+
+
+
+
+
+
+
+
+ Enshen Zhou, Qi Su, Cheng Chi, Zhizheng Zhang, Zhongyuan Wang, Tiejun Huang, Lu Sheng, He Wang
+
+
+ Automatic detection and prevention of open-set failures are crucial in
+closed-loop robotic systems. Recent studies often struggle to simultaneously
+identify unexpected failures reactively after they occur and prevent
+foreseeable ones proactively. To this end, we propose Code-as-Monitor (CaM), a
+novel paradigm leveraging the vision-language model (VLM) for both open-set
+reactive and proactive failure detection. The core of our method is to
+formulate both tasks as a unified set of spatio-temporal constraint
+satisfaction problems and use VLM-generated code to evaluate them for real-time
+monitoring. To enhance the accuracy and efficiency of monitoring, we further
+introduce constraint elements that abstract constraint-related entities or
+their parts into compact geometric elements. This approach offers greater
+generality, simplifies tracking, and facilitates constraint-aware visual
+programming by leveraging these elements as visual prompts. Experiments show
+that CaM achieves a 28.7% higher success rate and reduces execution time by
+31.8% under severe disturbances compared to baselines across three simulators
+and a real-world setting. Moreover, CaM can be integrated with open-loop
+control policies to form closed-loop systems, enabling long-horizon tasks in
+cluttered scenes with dynamic environments.
+
+
+ Recent developments in Large Language Models pre-trained on extensive corpora
+have shown significant success in various natural language processing tasks
+with minimal fine-tuning. This success offers new promise for robotics, which
+has long been constrained by the high cost of action-labeled data. We ask:
+given the abundant video data containing interaction-related knowledge
+available as a rich "corpus", can a similar generative pre-training approach be
+effectively applied to enhance robot learning? The key challenge is to identify
+an effective representation for autoregressive pre-training that benefits robot
+manipulation tasks. Inspired by the way humans learn new skills through
+observing dynamic environments, we propose that effective robotic learning
+should emphasize motion-related knowledge, which is closely tied to low-level
+actions and is hardware-agnostic, facilitating the transfer of learned motions
+to actual robot actions. To this end, we introduce Moto, which converts video
+content into latent Motion Token sequences by a Latent Motion Tokenizer,
+learning a bridging "language" of motion from videos in an unsupervised manner.
+We pre-train Moto-GPT through motion token autoregression, enabling it to
+capture diverse visual motion knowledge. After pre-training, Moto-GPT
+demonstrates the promising ability to produce semantically interpretable motion
+tokens, predict plausible motion trajectories, and assess trajectory
+rationality through output likelihood. To transfer learned motion priors to
+real robot actions, we implement a co-fine-tuning strategy that seamlessly
+bridges latent motion token prediction and real robot control. Extensive
+experiments show that the fine-tuned Moto-GPT exhibits superior robustness and
+efficiency on robot manipulation benchmarks, underscoring its effectiveness in
+transferring knowledge from video data to downstream visual manipulation tasks.
+
+
+
+ comment: Project released at: https://chenyi99.github.io/moto/
+
+
+
+
+
+
+
+ Shaunak Halbe, Junjiao Tian, K J Joseph, James Seale Smith, Katherine Stevo, Vineeth N Balasubramanian, Zsolt Kira
+
+
+ Vision-language models (VLMs) like CLIP have been cherished for their ability
+to perform zero-shot visual recognition on open-vocabulary concepts. This is
+achieved by selecting the object category whose textual representation bears
+the highest similarity with the query image. While successful in some domains,
+this method struggles with identifying fine-grained entities as well as
+generalizing to unseen concepts that are not captured by the training
+distribution. Recent works attempt to mitigate these challenges by integrating
+category descriptions at test time, albeit yielding modest improvements. We
+attribute these limited gains to a fundamental misalignment between image and
+description representations, which is rooted in the pretraining structure of
+CLIP. In this paper, we propose GRAIN, a new pretraining strategy aimed at
+aligning representations at both fine and coarse levels simultaneously. Our
+approach learns to jointly ground textual descriptions in image regions along
+with aligning overarching captions with global image representations. To drive
+this pre-training, we leverage frozen Multimodal Large Language Models (MLLMs)
+to derive large-scale synthetic annotations. We demonstrate the enhanced
+zero-shot performance of our model compared to current state-of-the art methods
+across 11 diverse image classification datasets. Additionally, we introduce
+Products-2023, a newly curated, manually labeled dataset featuring novel
+concepts, and showcase our model's ability to recognize these concepts by
+benchmarking on it. Significant improvements achieved by our model on other
+downstream tasks like retrieval further highlight the superior quality of
+representations learned by our approach. Code available at
+https://github.com/shaunak27/grain-clip .
+
+
+
+
+
+
+
+
+ Keru Chen, Honghao Wei, Zhigang Deng, Sen Lin
+
+
+ The high costs and risks involved in extensive environment interactions
+hinder the practical application of current online safe reinforcement learning
+(RL) methods. While offline safe RL addresses this by learning policies from
+static datasets, the performance therein is usually limited due to reliance on
+data quality and challenges with out-of-distribution (OOD) actions. Inspired by
+recent successes in offline-to-online (O2O) RL, it is crucial to explore
+whether offline safe RL can be leveraged to facilitate faster and safer online
+policy learning, a direction that has yet to be fully investigated. To fill
+this gap, we first demonstrate that naively applying existing O2O algorithms
+from standard RL would not work well in the safe RL setting due to two unique
+challenges: \emph{erroneous Q-estimations}, resulted from offline-online
+objective mismatch and offline cost sparsity, and \emph{Lagrangian mismatch},
+resulted from difficulties in aligning Lagrange multipliers between offline and
+online policies. To address these challenges, we introduce \textbf{Marvel}, a
+novel framework for O2O safe RL, comprising two key components that work in
+concert: \emph{Value Pre-Alignment} to align the Q-functions with the
+underlying truth before online learning, and \emph{Adaptive PID Control} to
+effectively adjust the Lagrange multipliers during online finetuning. Extensive
+experiments demonstrate that Marvel significantly outperforms existing
+baselines in both reward maximization and safety constraint satisfaction. By
+introducing the first policy-finetuning based framework for O2O safe RL, which
+is compatible with many offline and online safe RL methods, our work has the
+great potential to advance the field towards more efficient and practical safe
+RL solutions.
+
+
+ We introduce Condition-Aware Self-Supervised Learning Representation
+(CA-SSLR), a generalist conditioning model broadly applicable to various
+speech-processing tasks. Compared to standard fine-tuning methods that optimize
+for downstream models, CA-SSLR integrates language and speaker embeddings from
+earlier layers, making the SSL model aware of the current language and speaker
+context. This approach reduces the reliance on input audio features while
+preserving the integrity of the base SSLR. CA-SSLR improves the model's
+capabilities and demonstrates its generality on unseen tasks with minimal
+task-specific tuning. Our method employs linear modulation to dynamically
+adjust internal representations, enabling fine-grained adaptability without
+significantly altering the original model behavior. Experiments show that
+CA-SSLR reduces the number of trainable parameters, mitigates overfitting, and
+excels in under-resourced and unseen tasks. Specifically, CA-SSLR achieves a
+10% relative reduction in LID errors, a 37% improvement in ASR CER on the
+ML-SUPERB benchmark, and a 27% decrease in SV EER on VoxCeleb-1, demonstrating
+its effectiveness.
+
+
+
+ comment: 38th Conference on Neural Information Processing Systems (NeurIPS
+ 2024)
+
+
+
+
+
+
+ ☆ FedDUAL: A Dual-Strategy with Adaptive Loss and Dynamic Aggregation for
+ Mitigating Data Heterogeneity in Federated Learning
+
+
+ Federated Learning (FL) marks a transformative approach to distributed model
+training by combining locally optimized models from various clients into a
+unified global model. While FL preserves data privacy by eliminating
+centralized storage, it encounters significant challenges such as performance
+degradation, slower convergence, and reduced robustness of the global model due
+to the heterogeneity in client data distributions. Among the various forms of
+data heterogeneity, label skew emerges as a particularly formidable and
+prevalent issue, especially in domains such as image classification. To address
+these challenges, we begin with comprehensive experiments to pinpoint the
+underlying issues in the FL training process. Based on our findings, we then
+introduce an innovative dual-strategy approach designed to effectively resolve
+these issues. First, we introduce an adaptive loss function for client-side
+training, meticulously crafted to preserve previously acquired knowledge while
+maintaining an optimal equilibrium between local optimization and global model
+coherence. Secondly, we develop a dynamic aggregation strategy for aggregating
+client models at the server. This approach adapts to each client's unique
+learning patterns, effectively addressing the challenges of diverse data across
+the network. Our comprehensive evaluation, conducted across three diverse
+real-world datasets, coupled with theoretical convergence guarantees,
+demonstrates the superior efficacy of our method compared to several
+established state-of-the-art approaches.
+
+
+
+
+
+
+
+
+ Anshul Thakur, Yichen Huang, Soheila Molaei, Yujiang Wang, David A. Clifton
+
+
+ Shared training approaches, such as multi-task learning (MTL) and
+gradient-based meta-learning, are widely used in various machine learning
+applications, but they often suffer from negative transfer, leading to
+performance degradation in specific tasks. While several optimisation
+techniques have been developed to mitigate this issue for pre-selected task
+cohorts, identifying optimal task combinations for joint learning - known as
+task grouping - remains underexplored and computationally challenging due to
+the exponential growth in task combinations and the need for extensive training
+and evaluation cycles. This paper introduces an efficient task grouping
+framework designed to reduce these overwhelming computational demands of the
+existing methods. The proposed framework infers pairwise task similarities
+through a sample-wise optimisation landscape analysis, eliminating the need for
+the shared model training required to infer task similarities in existing
+methods. With task similarities acquired, a graph-based clustering algorithm is
+employed to pinpoint near-optimal task groups, providing an approximate yet
+efficient and effective solution to the originally NP-hard problem. Empirical
+assessments conducted on 8 different datasets highlight the effectiveness of
+the proposed framework, revealing a five-fold speed enhancement compared to
+previous state-of-the-art methods. Moreover, the framework consistently
+demonstrates comparable performance, confirming its remarkable efficiency and
+effectiveness in task grouping.
+
+
+
+ comment: Under review at IEEE Transactions on Pattern Analysis and Machine
+ Intelligence
+
+
+
+
+
+
+ ☆ Stabilizing and Solving Inverse Problems using Data and Machine Learning
+
+
+
+
+
+
+
+
+ Erik Burman, Mats G. Larson, Karl Larsson, Carl Lundholm
+
+
+ We consider an inverse problem involving the reconstruction of the solution
+to a nonlinear partial differential equation (PDE) with unknown boundary
+conditions. Instead of direct boundary data, we are provided with a large
+dataset of boundary observations for typical solutions (collective data) and a
+bulk measurement of a specific realization. To leverage this collective data,
+we first compress the boundary data using proper orthogonal decomposition (POD)
+in a linear expansion. Next, we identify a possible nonlinear low-dimensional
+structure in the expansion coefficients using an auto-encoder, which provides a
+parametrization of the dataset in a lower-dimensional latent space. We then
+train a neural network to map the latent variables representing the boundary
+data to the solution of the PDE. Finally, we solve the inverse problem by
+optimizing a data-fitting term over the latent space.
+ We analyze the underlying stabilized finite element method in the linear
+setting and establish optimal error estimates in the $H^1$ and $L^2$-norms. The
+nonlinear problem is then studied numerically, demonstrating the effectiveness
+of our approach.
+
+
+
+
+
+
+
+ ☆ Providing Differential Privacy for Federated Learning Over Wireless: A
+ Cross-layer Framework
+
+
+ Federated Learning (FL) is a distributed machine learning framework that
+inherently allows edge devices to maintain their local training data, thus
+providing some level of privacy. However, FL's model updates still pose a risk
+of privacy leakage, which must be mitigated. Over-the-air FL (OTA-FL) is an
+adapted FL design for wireless edge networks that leverages the natural
+superposition property of the wireless medium. We propose a wireless physical
+layer (PHY) design for OTA-FL which improves differential privacy (DP) through
+a decentralized, dynamic power control that utilizes both inherent Gaussian
+noise in the wireless channel and a cooperative jammer (CJ) for additional
+artificial noise generation when higher privacy levels are required. Although
+primarily implemented within the Upcycled-FL framework, where a
+resource-efficient method with first-order approximations is used at every even
+iteration to decrease the required information from clients, our power control
+strategy is applicable to any FL framework, including FedAvg and FedProx as
+shown in the paper. This adaptation showcases the flexibility and effectiveness
+of our design across different learning algorithms while maintaining a strong
+emphasis on privacy. Our design removes the need for client-side artificial
+noise injection for DP, utilizing a cooperative jammer to enhance privacy
+without affecting transmission efficiency for higher privacy demands. Privacy
+analysis is provided using the Moments Accountant method. We perform a
+convergence analysis for non-convex objectives to tackle heterogeneous data
+distributions, highlighting the inherent trade-offs between privacy and
+accuracy. Numerical results show that our approach with various FL algorithms
+outperforms the state-of-the-art under the same DP conditions on the non-i.i.d.
+FEMNIST dataset, and highlight the cooperative jammer's effectiveness in
+ensuring strict privacy.
+
+
+ Automated feature engineering (AutoFE) is used to automatically create new
+features from original features to improve predictive performance without
+needing significant human intervention and expertise. Many algorithms exist for
+AutoFE, but very few approaches exist for the federated learning (FL) setting
+where data is gathered across many clients and is not shared between clients or
+a central server. We introduce AutoFE algorithms for the horizontal, vertical,
+and hybrid FL settings, which differ in how the data is gathered across
+clients. To the best of our knowledge, we are the first to develop AutoFE
+algorithms for the horizontal and hybrid FL cases, and we show that the
+downstream model performance of federated AutoFE is similar to the case where
+data is held centrally and AutoFE is performed centrally.
+
+
+ Bayesian optimization is efficient even with a small amount of data and is
+used in engineering and in science, including biology and chemistry. In
+Bayesian optimization, a parameterized model with an uncertainty is fitted to
+explain the experimental data, and then the model suggests parameters that
+would most likely improve the results. Batch Bayesian optimization reduces the
+processing time of optimization by parallelizing experiments. However, batch
+Bayesian optimization cannot be applied if the number of parallelized
+experiments is limited by the cost or scarcity of equipment; in such cases,
+sequential methods require an unrealistic amount of time. In this study, we
+developed pipelining Bayesian optimization (PipeBO) to reduce the processing
+time of optimization even with a limited number of parallel experiments. PipeBO
+was inspired by the pipelining of central processing unit architecture, which
+divides computational tasks into multiple processes. PipeBO was designed to
+achieve experiment parallelization by overlapping various processes of the
+experiments. PipeBO uses the results of completed experiments to update the
+parameters of running parallelized experiments. Using the Black-Box
+Optimization Benchmarking, which consists of 24 benchmark functions, we
+compared PipeBO with the sequential Bayesian optimization methods. PipeBO
+reduced the average processing time of optimization to about 56% for the
+experiments that consisted of two processes or even less for those with more
+processes for 20 out of the 24 functions. Overall, PipeBO parallelizes Bayesian
+optimization in the resource-constrained settings so that efficient
+optimization can be achieved.
+
+
+
+
+
+
+
+ ☆ Probabilistic Gaussian Superposition for Efficient 3D Occupancy
+ Prediction
+
+
+
+
+
+
+
+
+ Yuanhui Huang, Amonnut Thammatadatrakoon, Wenzhao Zheng, Yunpeng Zhang, Dalong Du, Jiwen Lu
+
+
+ 3D semantic occupancy prediction is an important task for robust
+vision-centric autonomous driving, which predicts fine-grained geometry and
+semantics of the surrounding scene. Most existing methods leverage dense
+grid-based scene representations, overlooking the spatial sparsity of the
+driving scenes. Although 3D semantic Gaussian serves as an object-centric
+sparse alternative, most of the Gaussians still describe the empty region with
+low efficiency. To address this, we propose a probabilistic Gaussian
+superposition model which interprets each Gaussian as a probability
+distribution of its neighborhood being occupied and conforms to probabilistic
+multiplication to derive the overall geometry. Furthermore, we adopt the exact
+Gaussian mixture model for semantics calculation to avoid unnecessary
+overlapping of Gaussians. To effectively initialize Gaussians in non-empty
+region, we design a distribution-based initialization module which learns the
+pixel-aligned occupancy distribution instead of the depth of surfaces. We
+conduct extensive experiments on nuScenes and KITTI-360 datasets and our
+GaussianFormer-2 achieves state-of-the-art performance with high efficiency.
+Code: https://github.com/huang-yh/GaussianFormer.
+
+
+
+ comment: Code is available at: https://github.com/huang-yh/GaussianFormer
+
+
+
+
+
+
+ ☆ EmbodiedOcc: Embodied 3D Occupancy Prediction for Vision-based Online
+ Scene Understanding
+
+
+
+
+
+
+
+
+ Yuqi Wu, Wenzhao Zheng, Sicheng Zuo, Yuanhui Huang, Jie Zhou, Jiwen Lu
+
+
+ 3D occupancy prediction provides a comprehensive description of the
+surrounding scenes and has become an essential task for 3D perception. Most
+existing methods focus on offline perception from one or a few views and cannot
+be applied to embodied agents which demands to gradually perceive the scene
+through progressive embodied exploration. In this paper, we formulate an
+embodied 3D occupancy prediction task to target this practical scenario and
+propose a Gaussian-based EmbodiedOcc framework to accomplish it. We initialize
+the global scene with uniform 3D semantic Gaussians and progressively update
+local regions observed by the embodied agent. For each update, we extract
+semantic and structural features from the observed image and efficiently
+incorporate them via deformable cross-attention to refine the regional
+Gaussians. Finally, we employ Gaussian-to-voxel splatting to obtain the global
+3D occupancy from the updated 3D Gaussians. Our EmbodiedOcc assumes an unknown
+(i.e., uniformly distributed) environment and maintains an explicit global
+memory of it with 3D Gaussians. It gradually gains knowledge through local
+refinement of regional Gaussians, which is consistent with how humans
+understand new scenes through embodied exploration. We reorganize an
+EmbodiedOcc-ScanNet benchmark based on local annotations to facilitate the
+evaluation of the embodied 3D occupancy prediction task. Experiments
+demonstrate that our EmbodiedOcc outperforms existing local prediction methods
+and accomplishes the embodied occupancy prediction with high accuracy and
+strong expandability. Our code is available at:
+https://github.com/YkiWu/EmbodiedOcc.
+
+
+
+
+
+
+
+ ☆ A Hitchhiker's Guide to Understanding Performances of Two-Class
+ Classifiers
+
+
+
+
+
+
+
+
+ Anaïs Halin, Sébastien Piérard, Anthony Cioppa, Marc Van Droogenbroeck
+
+
+ Properly understanding the performances of classifiers is essential in
+various scenarios. However, the literature often relies only on one or two
+standard scores to compare classifiers, which fails to capture the nuances of
+application-specific requirements, potentially leading to suboptimal classifier
+selection. Recently, a paper on the foundations of the theory of
+performance-based ranking introduced a tool, called the Tile, that organizes an
+infinity of ranking scores into a 2D map. Thanks to the Tile, it is now
+possible to evaluate and compare classifiers efficiently, displaying all
+possible application-specific preferences instead of having to rely on a pair
+of scores. In this paper, we provide a first hitchhiker's guide for
+understanding the performances of two-class classifiers by presenting four
+scenarios, each showcasing a different user profile: a theoretical analyst, a
+method designer, a benchmarker, and an application developer. Particularly, we
+show that we can provide different interpretative flavors that are adapted to
+the user's needs by mapping different values on the Tile. As an illustration,
+we leverage the newly introduced Tile tool and the different flavors to rank
+and analyze the performances of 74 state-of-the-art semantic segmentation
+models in two-class classification through the eyes of the four user profiles.
+Through these user profiles, we demonstrate that the Tile effectively captures
+the behavior of classifiers in a single visualization, while accommodating an
+infinite number of ranking scores.
+
+
+
+
+
+
+
+ ☆ Finer Behavioral Foundation Models via Auto-Regressive Features and
+ Advantage Weighting
+
+
+ The forward-backward representation (FB) is a recently proposed framework
+(Touati et al., 2023; Touati & Ollivier, 2021) to train behavior foundation
+models (BFMs) that aim at providing zero-shot efficient policies for any new
+task specified in a given reinforcement learning (RL) environment, without
+training for each new task. Here we address two core limitations of FB model
+training. First, FB, like all successor-feature-based methods, relies on a
+linear encoding of tasks: at test time, each new reward function is linearly
+projected onto a fixed set of pre-trained features. This limits expressivity as
+well as precision of the task representation. We break the linearity limitation
+by introducing auto-regressive features for FB, which let finegrained task
+features depend on coarser-grained task information. This can represent
+arbitrary nonlinear task encodings, thus significantly increasing expressivity
+of the FB framework. Second, it is well-known that training RL agents from
+offline datasets often requires specific techniques.We show that FB works well
+together with such offline RL techniques, by adapting techniques from (Nair et
+al.,2020b; Cetin et al., 2024) for FB. This is necessary to get non-flatlining
+performance in some datasets, such as DMC Humanoid. As a result, we produce
+efficient FB BFMs for a number of new environments. Notably, in the D4RL
+locomotion benchmark, the generic FB agent matches the performance of standard
+single-task offline agents (IQL, XQL). In many setups, the offline techniques
+are needed to get any decent performance at all. The auto-regressive features
+have a positive but moderate impact, concentrated on tasks requiring spatial
+precision and task generalization beyond the behaviors represented in the
+trainset.
+
+
+
+
+
+
+
+ ☆ Machine Theory of Mind for Autonomous Cyber-Defence
+
+
+
+
+
+
+
+
+ Luke Swaby, Matthew Stewart, Daniel Harrold, Chris Willis, Gregory Palmer
+
+
+ Intelligent autonomous agents hold much potential for the domain of
+cyber-security. However, due to many state-of-the-art approaches relying on
+uninterpretable black-box models, there is growing demand for methods that
+offer stakeholders clear and actionable insights into their latent beliefs and
+motivations. To address this, we evaluate Theory of Mind (ToM) approaches for
+Autonomous Cyber Operations. Upon learning a robust prior, ToM models can
+predict an agent's goals, behaviours, and contextual beliefs given only a
+handful of past behaviour observations. In this paper, we introduce a novel
+Graph Neural Network (GNN)-based ToM architecture tailored for cyber-defence,
+Graph-In, Graph-Out (GIGO)-ToM, which can accurately predict both the targets
+and attack trajectories of adversarial cyber agents over arbitrary computer
+network topologies. To evaluate the latter, we propose a novel extension of the
+Wasserstein distance for measuring the similarity of graph-based probability
+distributions. Whereas the standard Wasserstein distance lacks a fixed
+reference scale, we introduce a graph-theoretic normalization factor that
+enables a standardized comparison between networks of different sizes. We
+furnish this metric, which we term the Network Transport Distance (NTD), with a
+weighting function that emphasizes predictions according to custom node
+features, allowing network operators to explore arbitrary strategic
+considerations. Benchmarked against a Graph-In, Dense-Out (GIDO)-ToM
+architecture in an abstract cyber-defence environment, our empirical
+evaluations show that GIGO-ToM can accurately predict the goals and behaviours
+of various unseen cyber-attacking agents across a range of network topologies,
+as well as learn embeddings that can effectively characterize their policies.
+
+
+
+
+
+
+
+
+ Oscar Key, Luka Ribar, Alberto Cattaneo, Luke Hudlass-Galley, Douglas Orr
+
+
+ We present an evaluation of bucketed approximate top-$k$ algorithms.
+Computing top-$k$ exactly suffers from limited parallelism, because the $k$
+largest values must be aggregated along the vector, thus is not well suited to
+computation on highly-parallel machine learning accelerators. By relaxing the
+requirement that the top-$k$ is exact, bucketed algorithms can dramatically
+increase the parallelism available by independently computing many smaller
+top-$k$ operations. We explore the design choices of this class of algorithms
+using both theoretical analysis and empirical evaluation on downstream tasks.
+Our motivating examples are sparsity algorithms for language models, which
+often use top-$k$ to select the most important parameters or activations. We
+also release a fast bucketed top-$k$ implementation for PyTorch.
+
+
+
+
+
+
+
+ ☆ Multi-Scale Node Embeddings for Graph Modeling and Generation
+
+
+ Lying at the interface between Network Science and Machine Learning, node
+embedding algorithms take a graph as input and encode its structure onto output
+vectors that represent nodes in an abstract geometric space, enabling various
+vector-based downstream tasks such as network modelling, data compression, link
+prediction, and community detection. Two apparently unrelated limitations
+affect these algorithms. On one hand, it is not clear what the basic operation
+defining vector spaces, i.e. the vector sum, corresponds to in terms of the
+original nodes in the network. On the other hand, while the same input network
+can be represented at multiple levels of resolution by coarse-graining the
+constituent nodes into arbitrary block-nodes, the relationship between node
+embeddings obtained at different hierarchical levels is not understood. Here,
+building on recent results in network renormalization theory, we address these
+two limitations at once and define a multiscale node embedding method that,
+upon arbitrary coarse-grainings, ensures statistical consistency of the
+embedding vector of a block-node with the sum of the embedding vectors of its
+constituent nodes. We illustrate the power of this approach on two economic
+networks that can be naturally represented at multiple resolution levels:
+namely, the international trade between (sets of) countries and the
+input-output flows among (sets of) industries in the Netherlands. We confirm
+the statistical consistency between networks retrieved from coarse-grained node
+vectors and networks retrieved from sums of fine-grained node vectors, a result
+that cannot be achieved by alternative methods. Several key network properties,
+including a large number of triangles, are successfully replicated already from
+embeddings of very low dimensionality, allowing for the generation of faithful
+replicas of the original networks at arbitrary resolution levels.
+
+
+
+
+
+
+
+ ☆ ActFusion: a Unified Diffusion Model for Action Segmentation and
+ Anticipation NeurIPS 2024
+
+
+ Temporal action segmentation and long-term action anticipation are two
+popular vision tasks for the temporal analysis of actions in videos. Despite
+apparent relevance and potential complementarity, these two problems have been
+investigated as separate and distinct tasks. In this work, we tackle these two
+problems, action segmentation and action anticipation, jointly using a unified
+diffusion model dubbed ActFusion. The key idea to unification is to train the
+model to effectively handle both visible and invisible parts of the sequence in
+an integrated manner; the visible part is for temporal segmentation, and the
+invisible part is for future anticipation. To this end, we introduce a new
+anticipative masking strategy during training in which a late part of the video
+frames is masked as invisible, and learnable tokens replace these frames to
+learn to predict the invisible future. Experimental results demonstrate the
+bi-directional benefits between action segmentation and anticipation. ActFusion
+achieves the state-of-the-art performance across the standard benchmarks of 50
+Salads, Breakfast, and GTEA, outperforming task-specific models in both of the
+two tasks with a single unified model through joint learning.
+
+
+ Performative prediction aims to model scenarios where predictive outcomes
+subsequently influence the very systems they target. The pursuit of a
+performative optimum (PO) -- minimizing performative risk -- is generally
+reliant on modeling of the distribution map, which characterizes how a deployed
+ML model alters the data distribution. Unfortunately, inevitable
+misspecification of the distribution map can lead to a poor approximation of
+the true PO. To address this issue, we introduce a novel framework of
+distributionally robust performative prediction and study a new solution
+concept termed as distributionally robust performative optimum (DRPO). We show
+provable guarantees for DRPO as a robust approximation to the true PO when the
+nominal distribution map is different from the actual one. Moreover,
+distributionally robust performative prediction can be reformulated as an
+augmented performative prediction problem, enabling efficient optimization. The
+experimental results demonstrate that DRPO offers potential advantages over
+traditional PO approach when the distribution map is misspecified at either
+micro- or macro-level.
+
+
+
+ comment: In Proceedings of the 38th Conference on Neural Information
+ Processing Systems (NeurIPS) 2024
+
+
+
+
+
+
+ ☆ Likelihood-Scheduled Score-Based Generative Modeling for Fully 3D PET
+ Image Reconstruction
+
+
+
+
+
+
+
+
+ George Webber, Yuya Mizuno, Oliver D. Howes, Alexander Hammers, Andrew P. King, Andrew J. Reader
+
+
+ Medical image reconstruction with pre-trained score-based generative models
+(SGMs) has advantages over other existing state-of-the-art deep-learned
+reconstruction methods, including improved resilience to different scanner
+setups and advanced image distribution modeling. SGM-based reconstruction has
+recently been applied to simulated positron emission tomography (PET) datasets,
+showing improved contrast recovery for out-of-distribution lesions relative to
+the state-of-the-art. However, existing methods for SGM-based reconstruction
+from PET data suffer from slow reconstruction, burdensome hyperparameter tuning
+and slice inconsistency effects (in 3D). In this work, we propose a practical
+methodology for fully 3D reconstruction that accelerates reconstruction and
+reduces the number of critical hyperparameters by matching the likelihood of an
+SGM's reverse diffusion process to a current iterate of the maximum-likelihood
+expectation maximization algorithm. Using the example of low-count
+reconstruction from simulated $[^{18}$F]DPA-714 datasets, we show our
+methodology can match or improve on the NRMSE and SSIM of existing
+state-of-the-art SGM-based PET reconstruction while reducing reconstruction
+time and the need for hyperparameter tuning. We evaluate our methodology
+against state-of-the-art supervised and conventional reconstruction algorithms.
+Finally, we demonstrate a first-ever implementation of SGM-based reconstruction
+for real 3D PET data, specifically $[^{18}$F]DPA-714 data, where we integrate
+perpendicular pre-trained SGMs to eliminate slice inconsistency issues.
+
+
+
+ comment: 11 pages, 12 figures. Submitted to Transactions on Medical Imaging
+
+
+
+
+
+
+ ☆ Action Mapping for Reinforcement Learning in Continuous Environments
+ with Constraints
+
+
+
+
+
+
+
+
+ Mirco Theile, Lukas Dirnberger, Raphael Trumpp, Marco Caccamo, Alberto L. Sangiovanni-Vincentelli
+
+
+ Deep reinforcement learning (DRL) has had success across various domains, but
+applying it to environments with constraints remains challenging due to poor
+sample efficiency and slow convergence. Recent literature explored
+incorporating model knowledge to mitigate these problems, particularly through
+the use of models that assess the feasibility of proposed actions. However,
+integrating feasibility models efficiently into DRL pipelines in environments
+with continuous action spaces is non-trivial. We propose a novel DRL training
+strategy utilizing action mapping that leverages feasibility models to
+streamline the learning process. By decoupling the learning of feasible actions
+from policy optimization, action mapping allows DRL agents to focus on
+selecting the optimal action from a reduced feasible action set. We demonstrate
+through experiments that action mapping significantly improves training
+performance in constrained environments with continuous action spaces,
+especially with imperfect feasibility models.
+
+
+
+
+
+
+
+ ☆ GRAM: Generalization in Deep RL with a Robust Adaptation Module
+
+
+
+
+
+
+
+
+ James Queeney, Xiaoyi Cai, Mouhacine Benosman, Jonathan P. How
+
+
+ The reliable deployment of deep reinforcement learning in real-world settings
+requires the ability to generalize across a variety of conditions, including
+both in-distribution scenarios seen during training as well as novel
+out-of-distribution scenarios. In this work, we present a framework for
+dynamics generalization in deep reinforcement learning that unifies these two
+distinct types of generalization within a single architecture. We introduce a
+robust adaptation module that provides a mechanism for identifying and reacting
+to both in-distribution and out-of-distribution environment dynamics, along
+with a joint training pipeline that combines the goals of in-distribution
+adaptation and out-of-distribution robustness. Our algorithm GRAM achieves
+strong generalization performance across in-distribution and
+out-of-distribution scenarios upon deployment, which we demonstrate on a
+variety of realistic simulated locomotion tasks with a quadruped robot.
+
+
+
+
+
+
+
+ ☆ Generative-Model-Based Fully 3D PET Image Reconstruction by Conditional
+ Diffusion Sampling
+
+
+
+
+
+
+
+
+ George Webber, Yuya Mizuno, Oliver D. Howes, Alexander Hammers, Andrew P. King, Andrew J. Reader
+
+
+ Score-based generative models (SGMs) have recently shown promising results
+for image reconstruction on simulated positron emission tomography (PET)
+datasets. In this work we have developed and implemented practical methodology
+for 3D image reconstruction with SGMs, and perform (to our knowledge) the first
+SGM-based reconstruction of real fully 3D PET data. We train an SGM on
+full-count reference brain images, and extend methodology to allow SGM-based
+reconstructions at very low counts (1% of original, to simulate low-dose or
+short-duration scanning). We then perform reconstructions for multiple
+independent realisations of 1% count data, allowing us to analyse the bias and
+variance characteristics of the method. We sample from the learned posterior
+distribution of the generative algorithm to calculate uncertainty images for
+our reconstructions. We evaluate the method's performance on real full- and
+low-count PET data and compare with conventional OSEM and MAP-EM baselines,
+showing that our SGM-based low-count reconstructions match full-dose
+reconstructions more closely and in a bias-variance trade-off comparison, our
+SGM-reconstructed images have lower variance than existing baselines. Future
+work will compare to supervised deep-learned methods, with other avenues for
+investigation including how data conditioning affects the SGM's posterior
+distribution and the algorithm's performance with different tracers.
+
+
+
+ comment: 2 pages, 2 figures. Accepted for oral presentation at IEEE NSS MIC
+ RTSD 2024 (submitted May 2024; accepted July 2024; presented Nov 2024)
+
+
+
+
+
+
+ ☆ The Tile: A 2D Map of Ranking Scores for Two-Class Classification
+
+
+
+
+
+
+
+
+ Sébastien Piérard, Anaïs Halin, Anthony Cioppa, Adrien Deliège, Marc Van Droogenbroeck
+
+
+ In the computer vision and machine learning communities, as well as in many
+other research domains, rigorous evaluation of any new method, including
+classifiers, is essential. One key component of the evaluation process is the
+ability to compare and rank methods. However, ranking classifiers and
+accurately comparing their performances, especially when taking
+application-specific preferences into account, remains challenging. For
+instance, commonly used evaluation tools like Receiver Operating Characteristic
+(ROC) and Precision/Recall (PR) spaces display performances based on two
+scores. Hence, they are inherently limited in their ability to compare
+classifiers across a broader range of scores and lack the capability to
+establish a clear ranking among classifiers. In this paper, we present a novel
+versatile tool, named the Tile, that organizes an infinity of ranking scores in
+a single 2D map for two-class classifiers, including common evaluation scores
+such as the accuracy, the true positive rate, the positive predictive value,
+Jaccard's coefficient, and all F-beta scores. Furthermore, we study the
+properties of the underlying ranking scores, such as the influence of the
+priors or the correspondences with the ROC space, and depict how to
+characterize any other score by comparing them to the Tile. Overall, we
+demonstrate that the Tile is a powerful tool that effectively captures all the
+rankings in a single visualization and allows interpreting them.
+
+
+
+
+
+
+
+
+ Michihiro Yasunaga, Leonid Shamis, Chunting Zhou, Andrew Cohen, Jason Weston, Luke Zettlemoyer, Marjan Ghazvininejad
+
+
+ Recent approaches to large language model (LLM) alignment typically require
+millions of human annotations or rely on external aligned models for synthetic
+data generation. This paper introduces ALMA: Alignment with Minimal Annotation,
+demonstrating that effective alignment can be achieved using only 9,000 labeled
+examples -- less than 1% of conventional approaches. ALMA generates large
+amounts of high-quality synthetic alignment data through new techniques:
+diverse prompt synthesis via few-shot learning, diverse response generation
+with multiple model checkpoints, and judge (reward model) enhancement through
+score aggregation and self-distillation. Using only a pretrained Llama3 base
+model, 5,000 SFT examples, and 4,000 judge annotations, ALMA achieves
+performance close to Llama3-Instruct across diverse alignment benchmarks (e.g.,
+0.1% difference on AlpacaEval 2.0 score). These results are achieved with a
+multi-round, self-bootstrapped data synthesis and training recipe that
+continues to improve for 10 rounds, surpassing the typical 3-round ceiling of
+previous methods. These results suggest that base models already possess
+sufficient knowledge for effective alignment, and that synthetic data
+generation methods can expose it.
+
+
+
+
+
+
+
+ ☆ Structure-Aware Stylized Image Synthesis for Robust Medical Image
+ Segmentation
+
+
+
+
+
+
+
+
+ Jie Bao, Zhixin Zhou, Wen Jung Li, Rui Luo
+
+
+ Accurate medical image segmentation is essential for effective diagnosis and
+treatment planning but is often challenged by domain shifts caused by
+variations in imaging devices, acquisition conditions, and patient-specific
+attributes. Traditional domain generalization methods typically require
+inclusion of parts of the test domain within the training set, which is not
+always feasible in clinical settings with limited diverse data. Additionally,
+although diffusion models have demonstrated strong capabilities in image
+generation and style transfer, they often fail to preserve the critical
+structural information necessary for precise medical analysis. To address these
+issues, we propose a novel medical image segmentation method that combines
+diffusion models and Structure-Preserving Network for structure-aware one-shot
+image stylization. Our approach effectively mitigates domain shifts by
+transforming images from various sources into a consistent style while
+maintaining the location, size, and shape of lesions. This ensures robust and
+accurate segmentation even when the target domain is absent from the training
+data. Experimental evaluations on colonoscopy polyp segmentation and skin
+lesion segmentation datasets show that our method enhances the robustness and
+accuracy of segmentation models, achieving superior performance metrics
+compared to baseline models without style transfer. This structure-aware
+stylization framework offers a practical solution for improving medical image
+segmentation across diverse domains, facilitating more reliable clinical
+diagnoses.
+
+
+
+
+
+
+
+ ☆ Deep Causal Inference for Point-referenced Spatial Data with Continuous
+ Treatments
+
+
+
+
+
+
+
+
+ Ziyang Jiang, Zach Calhoun, Yiling Liu, Lei Duan, David Carlson
+
+
+ Causal reasoning is often challenging with spatial data, particularly when
+handling high-dimensional inputs. To address this, we propose a neural network
+(NN) based framework integrated with an approximate Gaussian process to manage
+spatial interference and unobserved confounding. Additionally, we adopt a
+generalized propensity-score-based approach to address partially observed
+outcomes when estimating causal effects with continuous treatments. We evaluate
+our framework using synthetic, semi-synthetic, and real-world data inferred
+from satellite imagery. Our results demonstrate that NN-based models
+significantly outperform linear spatial regression models in estimating causal
+effects. Furthermore, in real-world case studies, NN-based models offer more
+reasonable predictions of causal effects, facilitating decision-making in
+relevant applications.
+
+
+
+ comment: 16 pages, 4 figures, 5 tables
+
+
+
+
+
+
+ ☆ Complexity of Vector-valued Prediction: From Linear Models to Stochastic
+ Convex Optimization
+
+
+ We study the problem of learning vector-valued linear predictors: these are
+prediction rules parameterized by a matrix that maps an $m$-dimensional feature
+vector to a $k$-dimensional target. We focus on the fundamental case with a
+convex and Lipschitz loss function, and show several new theoretical results
+that shed light on the complexity of this problem and its connection to related
+learning models. First, we give a tight characterization of the sample
+complexity of Empirical Risk Minimization (ERM) in this setting, establishing
+that $\smash{\widetilde{\Omega}}(k/\epsilon^2)$ examples are necessary for ERM
+to reach $\epsilon$ excess (population) risk; this provides for an exponential
+improvement over recent results by Magen and Shamir (2023) in terms of the
+dependence on the target dimension $k$, and matches a classical upper bound due
+to Maurer (2016). Second, we present a black-box conversion from general
+$d$-dimensional Stochastic Convex Optimization (SCO) to vector-valued linear
+prediction, showing that any SCO problem can be embedded as a prediction
+problem with $k=\Theta(d)$ outputs. These results portray the setting of
+vector-valued linear prediction as bridging between two extensively studied yet
+disparate learning models: linear models (corresponds to $k=1$) and general
+$d$-dimensional SCO (with $k=\Theta(d)$).
+
+
+ We propose to learn legged robot locomotion skills by watching thousands of
+wild animal videos from the internet, such as those featured in nature
+documentaries. Indeed, such videos offer a rich and diverse collection of
+plausible motion examples, which could inform how robots should move. To
+achieve this, we introduce Reinforcement Learning from Wild Animal Videos
+(RLWAV), a method to ground these motions into physical robots. We first train
+a video classifier on a large-scale animal video dataset to recognize actions
+from RGB clips of animals in their natural habitats. We then train a
+multi-skill policy to control a robot in a physics simulator, using the
+classification score of a third-person camera capturing videos of the robot's
+movements as a reward for reinforcement learning. Finally, we directly transfer
+the learned policy to a real quadruped Solo. Remarkably, despite the extreme
+gap in both domain and embodiment between animals in the wild and robots, our
+approach enables the policy to learn diverse skills such as walking, jumping,
+and keeping still, without relying on reference trajectories nor skill-specific
+rewards.
+
+
+
+
+
+
+
+ ☆ SynFinTabs: A Dataset of Synthetic Financial Tables for Information and
+ Table Extraction
+
+
+
+
+
+
+
+
+ Ethan Bradley, Muhammad Roman, Karen Rafferty, Barry Devereux
+
+
+ Table extraction from document images is a challenging AI problem, and
+labelled data for many content domains is difficult to come by. Existing table
+extraction datasets often focus on scientific tables due to the vast amount of
+academic articles that are readily available, along with their source code.
+However, there are significant layout and typographical differences between
+tables found across scientific, financial, and other domains. Current datasets
+often lack the words, and their positions, contained within the tables, instead
+relying on unreliable OCR to extract these features for training modern machine
+learning models on natural language processing tasks. Therefore, there is a
+need for a more general method of obtaining labelled data. We present
+SynFinTabs, a large-scale, labelled dataset of synthetic financial tables. Our
+hope is that our method of generating these synthetic tables is transferable to
+other domains. To demonstrate the effectiveness of our dataset in training
+models to extract information from table images, we create FinTabQA, a layout
+large language model trained on an extractive question-answering task. We test
+our model using real-world financial tables and compare it to a
+state-of-the-art generative model and discuss the results. We make the dataset,
+model, and dataset generation code publicly available.
+
+
+ As command-line interfaces remain an integral part of high-computation
+environments, the risk of exploitation through stealthy, complex command-line
+abuse continues to grow. Conventional security solutions often struggle with
+these command-line-based anomalies due to their context-specific nature and
+lack of labeled data, especially in detecting rare, malicious patterns amidst
+legitimate, high-volume activity. This gap has left organizations vulnerable to
+sophisticated threats like Living-off-the-Land (LOL) attacks, where standard
+detection tools frequently miss or misclassify anomalous command-line behavior.
+We introduce Scalable Command-Line Anomaly Detection Engine (SCADE), who
+addresses these challenges by introducing a dual-layered detection framework
+that combines a global statistical analysis with local context-specific anomaly
+detection, innovatively using a novel ensemble of statistical models such as
+BM25 and Log Entropy, adapted for command-line data. The framework also
+features a dynamic thresholding mechanism for adaptive anomaly detection,
+ensuring high precision and recall even in environments with extremely high
+Signal-to-Noise Ratios (SNRs). Initial experimental results demonstrate the
+effectiveness of the framework, achieving above 98% SNR in identifying unusual
+command-line behavior while minimizing false positives. In this paper, we
+present SCADE's core architecture, including its metadata-enriched approach to
+anomaly detection and the design choices behind its scalability for
+enterprise-level deployment. We argue that SCADE represents a significant
+advancement in command-line anomaly detection, offering a robust, adaptive
+framework for security analysts and researchers seeking to enhance detection
+accuracy in high-computation environments.
+
+
+
+
+
+
+
+ ☆ Quantifying the Limits of Segment Anything Model: Analyzing Challenges
+ in Segmenting Tree-Like and Low-Contrast Structures
+
+
+
+
+
+
+
+
+ Yixin Zhang, Nicholas Konz, Kevin Kramer, Maciej A. Mazurowski
+
+
+ Segment Anything Model (SAM) has shown impressive performance in interactive
+and zero-shot segmentation across diverse domains, suggesting that they have
+learned a general concept of "objects" from their large-scale training.
+However, we observed that SAM struggles with certain types of objects,
+particularly those featuring dense, tree-like structures and low textural
+contrast from their surroundings. These failure modes are critical for
+understanding its limitations in real-world use. In order to systematically
+examine this issue, we propose metrics to quantify two key object
+characteristics: tree-likeness and textural separability. Through extensive
+controlled synthetic experiments and testing on real datasets, we demonstrate
+that SAM's performance is noticeably correlated with these factors. We link
+these behaviors under the concept of "textural confusion", where SAM
+misinterprets local structure as global texture, leading to over-segmentation,
+or struggles to differentiate objects from similarly textured backgrounds.
+These findings offer the first quantitative framework to model SAM's
+challenges, providing valuable insights into its limitations and guiding future
+improvements for vision foundation models.
+
+
+ n this work, we propose a latent molecular diffusion model that can make the
+generated 3D molecules rich in diversity and maintain rich geometric features.
+The model captures the information of the forces and local constraints between
+atoms so that the generated molecules can maintain Euclidean transformation and
+high level of effectiveness and diversity. We also use the lowerrank manifold
+advantage of the latent variables of the latent model to fuse the information
+of the forces between atoms to better maintain the geometric equivariant
+properties of the molecules. Because there is no need to perform information
+fusion encoding in stages like traditional encoders and decoders, this reduces
+the amount of calculation in the back-propagation process. The model keeps the
+forces and local constraints of particle bonds in the latent variable space,
+reducing the impact of underfitting on the surface of the network on the large
+position drift of the particle geometry, so that our model can converge
+earlier. We introduce a distribution control variable in each backward step to
+strengthen exploration and improve the diversity of generation. In the
+experiment, the quality of the samples we generated and the convergence speed
+of the model have been significantly improved.
+
+
+
+ comment: arXiv admin note: text overlap with arXiv:2209.05710 by other authors
+
+
+
+
+
+
+ ☆ A History of Philosophy in Colombia through Topic Modelling
+
+
+
+
+
+
+
+
+ Juan R. Loaiza, Miguel González-Duque
+
+
+ Data-driven approaches to philosophy have emerged as a valuable tool for
+studying the history of the discipline. However, most studies in this area have
+focused on a limited number of journals from specific regions and subfields. We
+expand the scope of this research by applying dynamic topic modelling
+techniques to explore the history of philosophy in Colombia and Latin America.
+Our study examines the Colombian philosophy journal Ideas y Valores, founded in
+1951 and currently one of the most influential academic philosophy journals in
+the region. By analyzing the evolution of topics across the journal's history,
+we identify various trends and specific dynamics in philosophical discourse
+within the Colombian and Latin American context. Our findings reveal that the
+most prominent topics are value theory (including ethics, political philosophy,
+and aesthetics), epistemology, and the philosophy of science. We also trace the
+evolution of articles focusing on the historical and interpretive aspects of
+philosophical texts, and we note a notable emphasis on German philosophers such
+as Kant, Husserl, and Hegel on various topics throughout the journal's
+lifetime. Additionally, we investigate whether articles with a historical focus
+have decreased over time due to editorial pressures. Our analysis suggests no
+significant decline in such articles. Finally, we propose ideas for extending
+this research to other Latin American journals and suggest improvements for
+natural language processing workflows in non-English languages.
+
+
+
+
+
+
+
+
+ Kale-ab Abebe Tessera, Arrasy Rahman, Stefano V. Albrecht
+
+
+ Balancing individual specialisation and shared behaviours is a critical
+challenge in multi-agent reinforcement learning (MARL). Existing methods
+typically focus on encouraging diversity or leveraging shared representations.
+Full parameter sharing (FuPS) improves sample efficiency but struggles to learn
+diverse behaviours when required, while no parameter sharing (NoPS) enables
+diversity but is computationally expensive and sample inefficient. To address
+these challenges, we introduce HyperMARL, a novel approach using hypernetworks
+to balance efficiency and specialisation. HyperMARL generates agent-specific
+actor and critic parameters, enabling agents to adaptively exhibit diverse or
+homogeneous behaviours as needed, without modifying the learning objective or
+requiring prior knowledge of the optimal diversity. Furthermore, HyperMARL
+decouples agent-specific and state-based gradients, which empirically
+correlates with reduced policy gradient variance, potentially offering insights
+into its ability to capture diverse behaviours. Across MARL benchmarks
+requiring homogeneous, heterogeneous, or mixed behaviours, HyperMARL
+consistently matches or outperforms FuPS, NoPS, and diversity-focused methods,
+achieving NoPS-level diversity with a shared architecture. These results
+highlight the potential of hypernetworks as a versatile approach to the
+trade-off between specialisation and shared behaviours in MARL.
+
+
+
+
+
+
+
+ ☆ Foundations of the Theory of Performance-Based Ranking
+
+
+
+
+
+
+
+
+ Sébastien Piérard, Anaïs Halin, Anthony Cioppa, Adrien Deliège, Marc Van Droogenbroeck
+
+
+ Ranking entities such as algorithms, devices, methods, or models based on
+their performances, while accounting for application-specific preferences, is a
+challenge. To address this challenge, we establish the foundations of a
+universal theory for performance-based ranking. First, we introduce a rigorous
+framework built on top of both the probability and order theories. Our new
+framework encompasses the elements necessary to (1) manipulate performances as
+mathematical objects, (2) express which performances are worse than or
+equivalent to others, (3) model tasks through a variable called satisfaction,
+(4) consider properties of the evaluation, (5) define scores, and (6) specify
+application-specific preferences through a variable called importance. On top
+of this framework, we propose the first axiomatic definition of performance
+orderings and performance-based rankings. Then, we introduce a universal
+parametric family of scores, called ranking scores, that can be used to
+establish rankings satisfying our axioms, while considering
+application-specific preferences. Finally, we show, in the case of two-class
+classification, that the family of ranking scores encompasses well-known
+performance scores, including the accuracy, the true positive rate (recall,
+sensitivity), the true negative rate (specificity), the positive predictive
+value (precision), and F1. However, we also show that some other scores
+commonly used to compare classifiers are unsuitable to derive performance
+orderings satisfying the axioms. Therefore, this paper provides the computer
+vision and machine learning communities with a rigorous framework for
+evaluating and ranking entities.
+
+
+
+
+
+
+
+ ☆ Physics-informed Deep Learning for Muscle Force Prediction with
+ Unlabeled sEMG Signals
+
+
+
+
+
+
+
+
+ Shuhao Ma, Jie Zhang, Chaoyang Shi, Pei Di, Ian D. Robertson, Zhi-Qiang Zhang
+
+
+ Computational biomechanical analysis plays a pivotal role in understanding
+and improving human movements and physical functions. Although physics-based
+modeling methods can interpret the dynamic interaction between the neural drive
+to muscle dynamics and joint kinematics, they suffer from high computational
+latency. In recent years, data-driven methods have emerged as a promising
+alternative due to their fast execution speed, but label information is still
+required during training, which is not easy to acquire in practice. To tackle
+these issues, this paper presents a novel physics-informed deep learning method
+to predict muscle forces without any label information during model training.
+In addition, the proposed method could also identify personalized muscle-tendon
+parameters. To achieve this, the Hill muscle model-based forward dynamics is
+embedded into the deep neural network as the additional loss to further
+regulate the behavior of the deep neural network. Experimental validations on
+the wrist joint from six healthy subjects are performed, and a fully connected
+neural network (FNN) is selected to implement the proposed method. The
+predicted results of muscle forces show comparable or even lower root mean
+square error (RMSE) and higher coefficient of determination compared with
+baseline methods, which have to use the labeled surface electromyography (sEMG)
+signals, and it can also identify muscle-tendon parameters accurately,
+demonstrating the effectiveness of the proposed physics-informed deep learning
+method.
+
+
+ Adaptive networks today rely on overparameterized fixed topologies that
+cannot break through the statistical conflicts they encounter in the data they
+are exposed to, and are prone to "catastrophic forgetting" as the network
+attempts to reuse the existing structures to learn new task. We propose a
+structural adaptation method, DIRAD, that can complexify as needed and in a
+directed manner without being limited by statistical conflicts within a
+dataset. We then extend this method and present the PREVAL framework, designed
+to prevent "catastrophic forgetting" in continual learning by detection of new
+data and assigning encountered data to suitable models adapted to process them,
+without needing task labels anywhere in the workflow. We show the reliability
+of the DIRAD in growing a network with high performance and orders-of-magnitude
+simpler than fixed topology networks; and demonstrate the proof-of-concept
+operation of PREVAL, in which continual adaptation to new tasks is observed
+while being able to detect and discern previously-encountered tasks.
+
+
+
+ comment: Presented in Deployable AI (DAI) workshop at AAAI-2024
+
+
+
+
+
+
+ ☆ Linear Discriminant Analysis in Credit Scoring: A Transparent Hybrid
+ Model Approach
+
+
+
+
+
+
+
+
+ Md Shihab Reza, Monirul Islam Mahmud, Ifti Azad Abeer, Nova Ahmed
+
+
+ The development of computing has made credit scoring approaches possible,
+with various machine learning (ML) and deep learning (DL) techniques becoming
+more and more valuable. While complex models yield more accurate predictions,
+their interpretability is often weakened, which is a concern for credit scoring
+that places importance on decision fairness. As features of the dataset are a
+crucial factor for the credit scoring system, we implement Linear Discriminant
+Analysis (LDA) as a feature reduction technique, which reduces the burden of
+the models complexity. We compared 6 different machine learning models, 1 deep
+learning model, and a hybrid model with and without using LDA. From the result,
+we have found our hybrid model, XG-DNN, outperformed other models with the
+highest accuracy of 99.45% and a 99% F1 score with LDA. Lastly, to interpret
+model decisions, we have applied 2 different explainable AI techniques named
+LIME (local) and Morris Sensitivity Analysis (global). Through this research,
+we showed how feature reduction techniques can be used without affecting the
+performance and explainability of the model, which can be very useful in
+resource-constrained settings to optimize the computational workload.
+
+
+
+ comment: Accepted on International Conference on Computer and Information
+ Technology (ICCIT) 2024
+
+
+
+
+
+
+ ☆ SKIM: Any-bit Quantization Pushing The Limits of Post-Training
+ Quantization
+
+
+ Large Language Models (LLMs) exhibit impressive performance across various
+tasks, but deploying them for inference poses challenges. Their high resource
+demands often necessitate complex, costly multi-GPU pipelines, or the use of
+smaller, less capable models. While quantization offers a promising solution
+utilizing lower precision for model storage, existing methods frequently
+experience significant performance drops at lower precision levels.
+Additionally, they typically provide only a limited set of solutions at
+specific bit levels, many of which are extensively manually tuned. To address
+these challenges, we propose a new method called SKIM: Scaled K-means
+clustering wIth Mixed precision. Our approach introduces two novel techniques:
+1. A greedy algorithm to solve approximately optimal bit allocation across
+weight channels, and 2. A trainable scaling vector for non-differentiable
+K-means clustering. These techniques substantially improve performance and can
+be adapted to any given bit. Notably, in terms of model perplexity, our method
+narrows the gap between 3-bit quantized LLaMA models and their full precision
+counterparts by 16.3% on average.
+
+
+
+
+
+
+
+ ☆ Multi-Layer Privacy-Preserving Record Linkage with Clerical Review based
+ on gradual information disclosure
+
+
+
+
+
+
+
+
+ Florens Rohde, Victor Christen, Martin Franke, Erhard Rahm
+
+
+ Privacy-Preserving Record linkage (PPRL) is an essential component in data
+integration tasks of sensitive information. The linkage quality determines the
+usability of combined datasets and (machine learning) applications based on
+them. We present a novel privacy-preserving protocol that integrates clerical
+review in PPRL using a multi-layer active learning process. Uncertain match
+candidates are reviewed on several layers by human and non-human oracles to
+reduce the amount of disclosed information per record and in total. Predictions
+are propagated back to update previous layers, resulting in an improved linkage
+performance for non-reviewed candidates as well. The data owners remain in
+control of the amount of information they share for each record. Therefore, our
+approach follows need-to-know and data sovereignty principles. The experimental
+evaluation on real-world datasets shows considerable linkage quality
+improvements with limited labeling effort and privacy risks.
+
+
+
+ comment: Accepted at 21st Conference on Database Systems for Business,
+ Technology and Web (BTW)
+
+
+
+
+
+
+ ☆ Fixed-Mean Gaussian Processes for Post-hoc Bayesian Deep Learning
+
+
+
+
+
+
+
+
+ Luis A. Ortega, Simón Rodríguez-Santana, Daniel Hernández-Lobato
+
+
+ Recently, there has been an increasing interest in performing post-hoc
+uncertainty estimation about the predictions of pre-trained deep neural
+networks (DNNs). Given a pre-trained DNN via back-propagation, these methods
+enhance the original network by adding output confidence measures, such as
+error bars, without compromising its initial accuracy. In this context, we
+introduce a novel family of sparse variational Gaussian processes (GPs), where
+the posterior mean is fixed to any continuous function when using a universal
+kernel. Specifically, we fix the mean of this GP to the output of the
+pre-trained DNN, allowing our approach to effectively fit the GP's predictive
+variances to estimate the DNN prediction uncertainty. Our approach leverages
+variational inference (VI) for efficient stochastic optimization, with training
+costs that remain independent of the number of training points, scaling
+efficiently to large datasets such as ImageNet. The proposed method, called
+fixed mean GP (FMGP), is architecture-agnostic, relying solely on the
+pre-trained model's outputs to adjust the predictive variances. Experimental
+results demonstrate that FMGP improves both uncertainty estimation and
+computational efficiency when compared to state-of-the-art methods.
+
+
+
+ comment: 12 pages, 6 figures and 2 tables. Submitted to IEEE TRANSACTIONS ON
+ PATTERN ANALYSIS AND MACHINE INTELLIGENCE
+
+
+
+
+
+
+ ☆ An In-Depth Examination of Risk Assessment in Multi-Class Classification
+ Algorithms
+
+
+
+
+
+
+
+
+ Disha Ghandwani, Neeraj Sarna, Yuanyuan Li, Yang Lin
+
+
+ Advanced classification algorithms are being increasingly used in
+safety-critical applications like health-care, engineering, etc. In such
+applications, miss-classifications made by ML algorithms can result in
+substantial financial or health-related losses. To better anticipate and
+prepare for such losses, the algorithm user seeks an estimate for the
+probability that the algorithm miss-classifies a sample. We refer to this task
+as the risk-assessment. For a variety of models and datasets, we numerically
+analyze the performance of different methods in solving the risk-assessment
+problem. We consider two solution strategies: a) calibration techniques that
+calibrate the output probabilities of classification models to provide accurate
+probability outputs; and b) a novel approach based upon the prediction interval
+generation technique of conformal prediction. Our conformal prediction based
+approach is model and data-distribution agnostic, simple to implement, and
+provides reasonable results for a variety of use-cases. We compare the
+different methods on a broad variety of models and datasets.
+
+
+
+
+
+
+
+ ☆ On the Lack of Robustness of Binary Function Similarity Systems
+
+
+
+
+
+
+
+
+ Gianluca Capozzi, Tong Tang, Jie Wan, Ziqi Yang, Daniele Cono D'Elia, Giuseppe Antonio Di Luna, Lorenzo Cavallaro, Leonardo Querzoni
+
+
+ Binary function similarity, which often relies on learning-based algorithms
+to identify what functions in a pool are most similar to a given query
+function, is a sought-after topic in different communities, including machine
+learning, software engineering, and security. Its importance stems from the
+impact it has in facilitating several crucial tasks, from reverse engineering
+and malware analysis to automated vulnerability detection. Whereas recent work
+cast light around performance on this long-studied problem, the research
+landscape remains largely lackluster in understanding the resiliency of the
+state-of-the-art machine learning models against adversarial attacks. As
+security requires to reason about adversaries, in this work we assess the
+robustness of such models through a simple yet effective black-box greedy
+attack, which modifies the topology and the content of the control flow of the
+attacked functions. We demonstrate that this attack is successful in
+compromising all the models, achieving average attack success rates of 57.06%
+and 95.81% depending on the problem settings (targeted and untargeted attacks).
+Our findings are insightful: top performance on clean data does not necessarily
+relate to top robustness properties, which explicitly highlights
+performance-robustness trade-offs one should consider when deploying such
+models, calling for further research.
+
+
+
+
+
+
+
+ ☆ LossVal: Efficient Data Valuation for Neural Networks
+
+
+
+
+
+
+
+
+ Tim Wibiral, Mohamed Karim Belaid, Maximilian Rabus, Ansgar Scherp
+
+
+ Assessing the importance of individual training samples is a key challenge in
+machine learning. Traditional approaches retrain models with and without
+specific samples, which is computationally expensive and ignores dependencies
+between data points. We introduce LossVal, an efficient data valuation method
+that computes importance scores during neural network training by embedding a
+self-weighting mechanism into loss functions like cross-entropy and mean
+squared error. LossVal reduces computational costs, making it suitable for
+large datasets and practical applications. Experiments on classification and
+regression tasks across multiple datasets show that LossVal effectively
+identifies noisy samples and is able to distinguish helpful from harmful
+samples. We examine the gradient calculation of LossVal to highlight its
+advantages. The source code is available at:
+https://github.com/twibiral/LossVal
+
+
+
+
+
+
+
+ ☆ Non-Asymptotic Bounds for Closed-Loop Identification of Unstable
+ Nonlinear Stochastic Systems
+
+
+
+
+
+
+
+
+ Seth Siriya, Jingge Zhu, Dragan Nešić, Ye Pu
+
+
+ We consider the problem of least squares parameter estimation from
+single-trajectory data for discrete-time, unstable, closed-loop nonlinear
+stochastic systems, with linearly parameterised uncertainty. Assuming a region
+of the state space produces informative data, and the system is
+sub-exponentially unstable, we establish non-asymptotic guarantees on the
+estimation error at times where the state trajectory evolves in this region. If
+the whole state space is informative, high probability guarantees on the error
+hold for all times. Examples are provided where our results are useful for
+analysis, but existing results are not.
+
+
+
+ comment: 21 pages, 2 figures
+
+
+
+
+
+
+ ☆ MultiTASC++: A Continuously Adaptive Scheduler for Edge-Based
+ Multi-Device Cascade Inference
+
+
+
+
+
+
+
+
+ Sokratis Nikolaidis, Stylianos I. Venieris, Iakovos S. Venieris
+
+
+ Cascade systems, consisting of a lightweight model processing all samples and
+a heavier, high-accuracy model refining challenging samples, have become a
+widely-adopted distributed inference approach to achieving high accuracy and
+maintaining a low computational burden for mobile and IoT devices. As
+intelligent indoor environments, like smart homes, continue to expand, a new
+scenario emerges, the multi-device cascade. In this setting, multiple diverse
+devices simultaneously utilize a shared heavy model hosted on a server, often
+situated within or close to the consumer environment. This work introduces
+MultiTASC++, a continuously adaptive multi-tenancy-aware scheduler that
+dynamically controls the forwarding decision functions of devices to optimize
+system throughput while maintaining high accuracy and low latency. Through
+extensive experimentation in diverse device environments and with varying
+server-side models, we demonstrate the scheduler's efficacy in consistently
+maintaining a targeted satisfaction rate while providing the highest available
+accuracy across different device tiers and workloads of up to 100 devices. This
+demonstrates its scalability and efficiency in addressing the unique challenges
+of collaborative DNN inference in dynamic and diverse IoT environments.
+
+
+
+
+
+
+
+ ☆ Understanding Memorization in Generative Models via Sharpness in
+ Probability Landscapes
+
+
+
+
+
+
+
+
+ Dongjae Jeon, Dueun Kim, Albert No
+
+
+ In this paper, we introduce a geometric framework to analyze memorization in
+diffusion models using the eigenvalues of the Hessian of the log probability
+density. We propose that memorization arises from isolated points in the
+learned probability distribution, characterized by sharpness in the probability
+landscape, as indicated by large negative eigenvalues of the Hessian. Through
+experiments on various datasets, we demonstrate that these eigenvalues
+effectively detect and quantify memorization. Our approach provides a clear
+understanding of memorization in diffusion models and lays the groundwork for
+developing strategies to ensure secure and reliable generative models
+
+
+
+
+
+
+
+ ☆ Text Change Detection in Multilingual Documents Using Image Comparison
+
+
+
+
+
+
+
+
+ Doyoung Park, Naresh Reddy Yarram, Sunjin Kim, Minkyu Kim, Seongho Cho, Taehee Lee
+
+
+ Document comparison typically relies on optical character recognition (OCR)
+as its core technology. However, OCR requires the selection of appropriate
+language models for each document and the performance of multilingual or hybrid
+models remains limited. To overcome these challenges, we propose text change
+detection (TCD) using an image comparison model tailored for multilingual
+documents. Unlike OCR-based approaches, our method employs word-level text
+image-to-image comparison to detect changes. Our model generates bidirectional
+change segmentation maps between the source and target documents. To enhance
+performance without requiring explicit text alignment or scaling preprocessing,
+we employ correlations among multi-scale attention features. We also construct
+a benchmark dataset comprising actual printed and scanned word pairs in various
+languages to evaluate our model. We validate our approach using our benchmark
+dataset and public benchmarks Distorted Document Images and the LRDE Document
+Binarization Dataset. We compare our model against state-of-the-art semantic
+segmentation and change detection models, as well as to conventional OCR-based
+models.
+
+
+ Multiphysics simulation, which models the interactions between multiple
+physical processes, and multi-component simulation of complex structures are
+critical in fields like nuclear and aerospace engineering. Previous studies
+often rely on numerical solvers or machine learning-based surrogate models to
+solve or accelerate these simulations. However, multiphysics simulations
+typically require integrating multiple specialized solvers-each responsible for
+evolving a specific physical process-into a coupled program, which introduces
+significant development challenges. Furthermore, no universal algorithm exists
+for multi-component simulations, which adds to the complexity. Here we propose
+compositional Multiphysics and Multi-component Simulation with Diffusion models
+(MultiSimDiff) to overcome these challenges. During diffusion-based training,
+MultiSimDiff learns energy functions modeling the conditional probability of
+one physical process/component conditioned on other processes/components. In
+inference, MultiSimDiff generates coupled multiphysics solutions and
+multi-component structures by sampling from the joint probability distribution,
+achieved by composing the learned energy functions in a structured way. We test
+our method in three tasks. In the reaction-diffusion and nuclear thermal
+coupling problems, MultiSimDiff successfully predicts the coupling solution
+using decoupled data, while the surrogate model fails in the more complex
+second problem. For the thermal and mechanical analysis of the prismatic fuel
+element, MultiSimDiff trained for single component prediction accurately
+predicts a larger structure with 64 components, reducing the relative error by
+40.3% compared to the surrogate model.
+
+
+
+ comment: 30pages,13 figures
+
+
+
+
+
+
+ ☆ DeepFEA: Deep Learning for Prediction of Transient Finite Element
+ Analysis Solutions
+
+
+
+
+
+
+
+
+ Georgios Triantafyllou, Panagiotis G. Kalozoumis, George Dimas, Dimitris K. Iakovidis
+
+
+ Finite Element Analysis (FEA) is a powerful but computationally intensive
+method for simulating physical phenomena. Recent advancements in machine
+learning have led to surrogate models capable of accelerating FEA. Yet there
+are still limitations in developing surrogates of transient FEA models that can
+simultaneously predict the solutions for both nodes and elements with
+applicability on both the 2D and 3D domains. Motivated by this research gap,
+this study proposes DeepFEA, a deep learning-based framework that leverages a
+multilayer Convolutional Long Short-Term Memory (ConvLSTM) network branching
+into two parallel convolutional neural networks to predict the solutions for
+both nodes and elements of FEA models. The proposed network is optimized using
+a novel adaptive learning algorithm, called Node-Element Loss Optimization
+(NELO). NELO minimizes the error occurring at both branches of the network
+enabling the prediction of solutions for transient FEA simulations. The
+experimental evaluation of DeepFEA is performed on three datasets in the
+context of structural mechanics, generated to serve as publicly available
+reference datasets. The results show that DeepFEA can achieve less than 3%
+normalized mean and root mean squared error for 2D and 3D simulation scenarios,
+and inference times that are two orders of magnitude faster than FEA. In
+contrast, relevant state-of-the-art methods face challenges with
+multi-dimensional output and dynamic input prediction. Furthermore, DeepFEA's
+robustness was demonstrated in a real-life biomedical scenario, confirming its
+suitability for accurate and efficient predictions of FEA simulations.
+
+
+
+ comment: This work has been submitted to a journal for possible publication
+
+
+
+
+
+
+ ☆ Missing Melodies: AI Music Generation and its "Nearly" Complete Omission
+ of the Global South
+
+
+ Recent advances in generative AI have sparked renewed interest and expanded
+possibilities for music generation. However, the performance and versatility of
+these systems across musical genres are heavily influenced by the availability
+of training data. We conducted an extensive analysis of over one million hours
+of audio datasets used in AI music generation research and manually reviewed
+more than 200 papers from eleven prominent AI and music conferences and
+organizations (AAAI, ACM, EUSIPCO, EURASIP, ICASSP, ICML, IJCAI, ISMIR,
+NeurIPS, NIME, SMC) to identify a critical gap in the fair representation and
+inclusion of the musical genres of the Global South in AI research. Our
+findings reveal a stark imbalance: approximately 86% of the total dataset hours
+and over 93% of researchers focus primarily on music from the Global North.
+However, around 40% of these datasets include some form of non-Western music,
+genres from the Global South account for only 14.6% of the data. Furthermore,
+approximately 51% of the papers surveyed concentrate on symbolic music
+generation, a method that often fails to capture the cultural nuances inherent
+in music from regions such as South Asia, the Middle East, and Africa. As AI
+increasingly shapes the creation and dissemination of music, the significant
+underrepresentation of music genres in datasets and research presents a serious
+threat to global musical diversity. We also propose some important steps to
+mitigate these risks and foster a more inclusive future for AI-driven music
+generation.
+
+
+
+ comment: Submitted to CACM, 12 pages, 2 figures
+
+
+
+
+
+
+
+ Hamid Gadirov, Qi Wu, David Bauer, Kwan-Liu Ma, Jos Roerdink, Steffen Frey
+
+
+ We present HyperFLINT (Hypernetwork-based FLow estimation and temporal
+INTerpolation), a novel deep learning-based approach for estimating flow
+fields, temporally interpolating scalar fields, and facilitating parameter
+space exploration in spatio-temporal scientific ensemble data. This work
+addresses the critical need to explicitly incorporate ensemble parameters into
+the learning process, as traditional methods often neglect these, limiting
+their ability to adapt to diverse simulation settings and provide meaningful
+insights into the data dynamics. HyperFLINT introduces a hypernetwork to
+account for simulation parameters, enabling it to generate accurate
+interpolations and flow fields for each timestep by dynamically adapting to
+varying conditions, thereby outperforming existing parameter-agnostic
+approaches. The architecture features modular neural blocks with convolutional
+and deconvolutional layers, supported by a hypernetwork that generates weights
+for the main network, allowing the model to better capture intricate simulation
+dynamics. A series of experiments demonstrates HyperFLINT's significantly
+improved performance in flow field estimation and temporal interpolation, as
+well as its potential in enabling parameter space exploration, offering
+valuable insights into complex scientific ensembles.
+
+
+ Symmetric nonnegative matrix factorization (SymNMF) is a powerful tool for
+clustering, which typically uses the $k$-nearest neighbor ($k$-NN) method to
+construct similarity matrix. However, $k$-NN may mislead clustering since the
+neighbors may belong to different clusters, and its reliability generally
+decreases as $k$ grows. In this paper, we construct the similarity matrix as a
+weighted $k$-NN graph with learnable weight that reflects the reliability of
+each $k$-th NN. This approach reduces the search space of the similarity matrix
+learning to $n - 1$ dimension, as opposed to the $\mathcal{O}(n^2)$ dimension
+of existing methods, where $n$ represents the number of samples. Moreover, to
+obtain a discriminative similarity matrix, we introduce a dissimilarity matrix
+with a dual structure of the similarity matrix, and propose a new form of
+orthogonality regularization with discussions on its geometric interpretation
+and numerical stability. An efficient alternative optimization algorithm is
+designed to solve the proposed model, with theoretically guarantee that the
+variables converge to a stationary point that satisfies the KKT conditions. The
+advantage of the proposed model is demonstrated by the comparison with nine
+state-of-the-art clustering methods on eight datasets. The code is available at
+\url{https://github.com/lwl-learning/LSDGSymNMF}.
+
+
+
+ comment: 12 pages, 14 figures
+
+
+
+
+
+
+ ☆ Federated Learning in Mobile Networks: A Comprehensive Case Study on
+ Traffic Forecasting
+
+
+
+
+
+
+
+
+ Nikolaos Pavlidis, Vasileios Perifanis, Selim F. Yilmaz, Francesc Wilhelmi, Marco Miozzo, Pavlos S. Efraimidis, Remous-Aris Koutsiamanis, Pavol Mulinka, Paolo Dini
+
+
+ The increasing demand for efficient resource allocation in mobile networks
+has catalyzed the exploration of innovative solutions that could enhance the
+task of real-time cellular traffic prediction. Under these circumstances,
+federated learning (FL) stands out as a distributed and privacy-preserving
+solution to foster collaboration among different sites, thus enabling
+responsive near-the-edge solutions. In this paper, we comprehensively study the
+potential benefits of FL in telecommunications through a case study on
+federated traffic forecasting using real-world data from base stations (BSs) in
+Barcelona (Spain). Our study encompasses relevant aspects within the federated
+experience, including model aggregation techniques, outlier management, the
+impact of individual clients, personalized learning, and the integration of
+exogenous sources of data. The performed evaluation is based on both prediction
+accuracy and sustainability, thus showcasing the environmental impact of
+employed FL algorithms in various settings. The findings from our study
+highlight FL as a promising and robust solution for mobile traffic prediction,
+emphasizing its twin merits as a privacy-conscious and environmentally
+sustainable approach, while also demonstrating its capability to overcome data
+heterogeneity and ensure high-quality predictions, marking a significant stride
+towards its integration in mobile traffic management systems.
+
+
+
+
+
+
+
+ ☆ Towards Generalizable Autonomous Penetration Testing via Domain
+ Randomization and Meta-Reinforcement Learning
+
+
+ With increasing numbers of vulnerabilities exposed on the internet,
+autonomous penetration testing (pentesting) has emerged as an emerging research
+area, while reinforcement learning (RL) is a natural fit for studying
+autonomous pentesting. Previous research in RL-based autonomous pentesting
+mainly focused on enhancing agents' learning efficacy within abstract simulated
+training environments. They overlooked the applicability and generalization
+requirements of deploying agents' policies in real-world environments that
+differ substantially from their training settings. In contrast, for the first
+time, we shift focus to the pentesting agents' ability to generalize across
+unseen real environments. For this purpose, we propose a Generalizable
+Autonomous Pentesting framework (namely GAP) for training agents capable of
+drawing inferences from one to another -- a key requirement for the broad
+application of autonomous pentesting and a hallmark of human intelligence. GAP
+introduces a Real-to-Sim-to-Real pipeline with two key methods: domain
+randomization and meta-RL learning. Specifically, we are among the first to
+apply domain randomization in autonomous pentesting and propose a large
+language model-powered domain randomization method for synthetic environment
+generation. We further apply meta-RL to improve the agents' generalization
+ability in unseen environments by leveraging the synthetic environments. The
+combination of these two methods can effectively bridge the generalization gap
+and improve policy adaptation performance. Experiments are conducted on various
+vulnerable virtual machines, with results showing that GAP can (a) enable
+policy learning in unknown real environments, (b) achieve zero-shot policy
+transfer in similar environments, and (c) realize rapid policy adaptation in
+dissimilar environments.
+
+
+
+ comment: This work has been submitted to the IEEE for possible publication
+
+ Quaternion contains one real part and three imaginary parts, which provided a
+more expressive hypercomplex space for learning knowledge graph. Existing
+quaternion embedding models measure the plausibility of a triplet either
+through semantic matching or geometric distance scoring functions. However, it
+appears that semantic matching diminishes the separability of entities, while
+the distance scoring function weakens the semantics of entities. To address
+this issue, we propose a novel quaternion knowledge graph embedding model. Our
+model combines semantic matching with entity's geometric distance to better
+measure the plausibility of triplets. Specifically, in the quaternion space, we
+perform a right rotation on head entity and a reverse rotation on tail entity
+to learn rich semantic features. Then, we utilize distance adaptive
+translations to learn geometric distance between entities. Furthermore, we
+provide mathematical proofs to demonstrate our model can handle complex logical
+relationships. Extensive experimental results and analyses show our model
+significantly outperforms previous models on well-known knowledge graph
+completion benchmark datasets. Our code is available at
+https://github.com/llqy123/DaBR.
+
+
+
+ comment: Accepted by COLING 2025
+
+
+
+
+
+
+ ☆ Integrated Sensing and Communications for Low-Altitude Economy: A Deep
+ Reinforcement Learning Approach
+
+
+
+
+
+
+
+
+ Xiaowen Ye, Yuyi Mao, Xianghao Yu, Shu Sun, Liqun Fu, Jie Xu
+
+
+ This paper studies an integrated sensing and communications (ISAC) system for
+low-altitude economy (LAE), where a ground base station (GBS) provides
+communication and navigation services for authorized unmanned aerial vehicles
+(UAVs), while sensing the low-altitude airspace to monitor the unauthorized
+mobile target. The expected communication sum-rate over a given flight period
+is maximized by jointly optimizing the beamforming at the GBS and UAVs'
+trajectories, subject to the constraints on the average signal-to-noise ratio
+requirement for sensing, the flight mission and collision avoidance of UAVs, as
+well as the maximum transmit power at the GBS. Typically, this is a sequential
+decision-making problem with the given flight mission. Thus, we transform it to
+a specific Markov decision process (MDP) model called episode task. Based on
+this modeling, we propose a novel LAE-oriented ISAC scheme, referred to as Deep
+LAE-ISAC (DeepLSC), by leveraging the deep reinforcement learning (DRL)
+technique. In DeepLSC, a reward function and a new action selection policy
+termed constrained noise-exploration policy are judiciously designed to fulfill
+various constraints. To enable efficient learning in episode tasks, we develop
+a hierarchical experience replay mechanism, where the gist is to employ all
+experiences generated within each episode to jointly train the neural network.
+Besides, to enhance the convergence speed of DeepLSC, a symmetric experience
+augmentation mechanism, which simultaneously permutes the indexes of all
+variables to enrich available experience sets, is proposed. Simulation results
+demonstrate that compared with benchmarks, DeepLSC yields a higher sum-rate
+while meeting the preset constraints, achieves faster convergence, and is more
+robust against different settings.
+
+
+
+ comment: submitted for an IEEE publication
+
+
+
+
+
+
+ ☆ Boundary-Guided Learning for Gene Expression Prediction in Spatial
+ Transcriptomics
+
+
+
+
+
+
+
+
+ Mingcheng Qu, Yuncong Wu, Donglin Di, Anyang Su, Tonghua Su, Yang Song, Lei Fan
+
+
+ Spatial transcriptomics (ST) has emerged as an advanced technology that
+provides spatial context to gene expression. Recently, deep learning-based
+methods have shown the capability to predict gene expression from WSI data
+using ST data. Existing approaches typically extract features from images and
+the neighboring regions using pretrained models, and then develop methods to
+fuse this information to generate the final output. However, these methods
+often fail to account for the cellular structure similarity, cellular density
+and the interactions within the microenvironment. In this paper, we propose a
+framework named BG-TRIPLEX, which leverages boundary information extracted from
+pathological images as guiding features to enhance gene expression prediction
+from WSIs. Specifically, our model consists of three branches: the spot,
+in-context and global branches. In the spot and in-context branches, boundary
+information, including edge and nuclei characteristics, is extracted using
+pretrained models. These boundary features guide the learning of cellular
+morphology and the characteristics of microenvironment through Multi-Head
+Cross-Attention. Finally, these features are integrated with global features to
+predict the final output. Extensive experiments were conducted on three public
+ST datasets. The results demonstrate that our BG-TRIPLEX consistently
+outperforms existing methods in terms of Pearson Correlation Coefficient (PCC).
+This method highlights the crucial role of boundary features in understanding
+the complex interactions between WSI and gene expression, offering a promising
+direction for future research.
+
+
+
+ comment: 8 pages, 5 figures
+
+
+
+
+
+
+ ☆ Space to Policy: Scalable Brick Kiln Detection and Automatic Compliance
+ Monitoring with Geospatial Data
+
+
+ Air pollution kills 7 million people annually. The brick kiln sector
+significantly contributes to economic development but also accounts for 8-14\%
+of air pollution in India. Policymakers have implemented compliance measures to
+regulate brick kilns. Emission inventories are critical for air quality
+modeling and source apportionment studies. However, the largely unorganized
+nature of the brick kiln sector necessitates labor-intensive survey efforts for
+monitoring. Recent efforts by air quality researchers have relied on manual
+annotation of brick kilns using satellite imagery to build emission
+inventories, but this approach lacks scalability. Machine-learning-based object
+detection methods have shown promise for detecting brick kilns; however,
+previous studies often rely on costly high-resolution imagery and fail to
+integrate with governmental policies. In this work, we developed a scalable
+machine-learning pipeline that detected and classified 30638 brick kilns across
+five states in the Indo-Gangetic Plain using free, moderate-resolution
+satellite imagery from Planet Labs. Our detections have a high correlation with
+on-ground surveys. We performed automated compliance analysis based on
+government policies. In the Delhi airshed, stricter policy enforcement has led
+to the adoption of efficient brick kiln technologies. This study highlights the
+need for inclusive policies that balance environmental sustainability with the
+livelihoods of workers.
+
+
+
+
+
+
+
+
+ Arseny Skryagin, Felix Divo, Mohammad Amin Ali, Devendra Singh Dhami, Kristian Kersting
+
+
+ Graph Neural Networks (GNNs) are non-Euclidean deep learning models for
+graph-structured data. Despite their successful and diverse applications,
+oversmoothing prohibits deep architectures due to node features converging to a
+single fixed point. This severely limits their potential to solve complex
+tasks. To counteract this tendency, we propose a plug-and-play module
+consisting of three steps: Cluster-Normalize-Activate (CNA). By applying CNA
+modules, GNNs search and form super nodes in each layer, which are normalized
+and activated individually. We demonstrate in node classification and property
+prediction tasks that CNA significantly improves the accuracy over the
+state-of-the-art. Particularly, CNA reaches 94.18% and 95.75% accuracy on Cora
+and CiteSeer, respectively. It further benefits GNNs in regression tasks as
+well, reducing the mean squared error compared to all baselines. At the same
+time, GNNs with CNA require substantially fewer learnable parameters than
+competing architectures.
+
+
+
+
+
+
+
+ ☆ Pathwise optimization for bridge-type estimators and its applications
+
+
+
+
+
+
+
+
+ Alessandro De Gregorio, Francesco Iafrate
+
+
+ Sparse parametric models are of great interest in statistical learning and
+are often analyzed by means of regularized estimators. Pathwise methods allow
+to efficiently compute the full solution path for penalized estimators, for any
+possible value of the penalization parameter $\lambda$. In this paper we deal
+with the pathwise optimization for bridge-type problems; i.e. we are interested
+in the minimization of a loss function, such as negative log-likelihood or
+residual sum of squares, plus the sum of $\ell^q$ norms with $q\in(0,1]$
+involving adpative coefficients. For some loss functions this regularization
+achieves asymptotically the oracle properties (such as the selection
+consistency). Nevertheless, since the objective function involves nonconvex and
+nondifferentiable terms, the minimization problem is computationally
+challenging.
+ The aim of this paper is to apply some general algorithms, arising from
+nonconvex optimization theory, to compute efficiently the path solutions for
+the adaptive bridge estimator with multiple penalties. In particular, we take
+into account two different approaches: accelerated proximal gradient descent
+and blockwise alternating optimization. The convergence and the path
+consistency of these algorithms are discussed. In order to assess our methods,
+we apply these algorithms to the penalized estimation of diffusion processes
+observed at discrete times. This latter represents a recent research topic in
+the field of statistics for time-dependent data.
+
+
+
+
+
+
+
+ ☆ AI4EF: Artificial Intelligence for Energy Efficiency in the Building
+ Sector
+
+
+ AI4EF, Artificial Intelligence for Energy Efficiency, is an advanced,
+user-centric tool designed to support decision-making in building energy
+retrofitting and efficiency optimization. Leveraging machine learning (ML) and
+data-driven insights, AI4EF enables stakeholders such as public sector
+representatives, energy consultants, and building owners to model, analyze, and
+predict energy consumption, retrofit costs, and environmental impacts of
+building upgrades. Featuring a modular framework, AI4EF includes customizable
+building retrofitting, photovoltaic installation assessment, and predictive
+modeling tools that allow users to input building parameters and receive
+tailored recommendations for achieving energy savings and carbon reduction
+goals. Additionally, the platform incorporates a Training Playground for data
+scientists to refine ML models used by said framework. Finally, AI4EF provides
+access to the Enershare Data Space to facilitate seamless data sharing and
+access within the ecosystem. Its compatibility with open-source identity
+management, Keycloak, enhances security and accessibility, making it adaptable
+for various regulatory and organizational contexts. This paper presents an
+architectural overview of AI4EF, its application in energy efficiency
+scenarios, and its potential for advancing sustainable energy practices through
+artificial intelligence (AI).
+
+
+
+
+
+
+
+ ☆ Dynamic Graph Representation with Contrastive Learning for Financial
+ Market Prediction: Integrating Temporal Evolution and Static Relations
+
+
+
+
+
+
+
+
+ Yunhua Pei, Jin Zheng, John Cartlidge
+
+
+ Temporal Graph Learning (TGL) is crucial for capturing the evolving nature of
+stock markets. Traditional methods often ignore the interplay between dynamic
+temporal changes and static relational structures between stocks. To address
+this issue, we propose the Dynamic Graph Representation with Contrastive
+Learning (DGRCL) framework, which integrates dynamic and static graph relations
+to improve the accuracy of stock trend prediction. Our framework introduces two
+key components: the Embedding Enhancement (EE) module and the Contrastive
+Constrained Training (CCT) module. The EE module focuses on dynamically
+capturing the temporal evolution of stock data, while the CCT module enforces
+static constraints based on stock relations, refined within contrastive
+learning. This dual-relation approach allows for a more comprehensive
+understanding of stock market dynamics. Our experiments on two major U.S. stock
+market datasets, NASDAQ and NYSE, demonstrate that DGRCL significantly
+outperforms state-of-the-art TGL baselines. Ablation studies indicate the
+importance of both modules. Overall, DGRCL not only enhances prediction ability
+but also provides a robust framework for integrating temporal and relational
+data in dynamic graphs. Code and data are available for public access.
+
+
+
+ comment: 12 pages, 2 figures, author manuscript accepted for ICAART 2025
+ (International Conference on Agents and Artificial Intelligence)
+
+ In molecular dynamics (MD) simulations, transitions between states are often
+rare events due to energy barriers that exceed the thermal temperature. Because
+of their infrequent occurrence and the huge number of degrees of freedom in
+molecular systems, understanding the physical properties that drive rare events
+is immensely difficult. A common approach to this problem is to propose a
+collective variable (CV) that describes this process by a simplified
+representation. However, choosing CVs is not easy, as it often relies on
+physical intuition. Machine learning (ML) techniques provide a promising
+approach for effectively extracting optimal CVs from MD data. Here, we provide
+a note on a recent unsupervised ML method called spectral map, which constructs
+CVs by maximizing the timescale separation between slow and fast variables in
+the system.
+
+
+
+ comment: A letter prepared for the Ensemble journal of the Molecular
+ Simulation Society of Japan (MSSJ)
+
+ The exploration of underwater environments is essential for applications such
+as biological research, archaeology, and infrastructure maintenanceHowever,
+underwater imaging is challenging due to the waters unique properties,
+including scattering, absorption, color distortion, and reduced visibility. To
+address such visual degradations, a variety of approaches have been proposed
+covering from basic signal processing methods to deep learning models; however,
+none of them has proven to be consistently successful. In this paper, we
+propose a novel machine learning model, Co-Operational Regressor Networks
+(CoRe-Nets), designed to achieve the best possible underwater image
+restoration. A CoRe-Net consists of two co-operating networks: the Apprentice
+Regressor (AR), responsible for image transformation, and the Master Regressor
+(MR), which evaluates the Peak Signal-to-Noise Ratio (PSNR) of the images
+generated by the AR and feeds it back to AR. CoRe-Nets are built on
+Self-Organized Operational Neural Networks (Self-ONNs), which offer a superior
+learning capability by modulating nonlinearity in kernel transformations. The
+effectiveness of the proposed model is demonstrated on the benchmark Large
+Scale Underwater Image (LSUI) dataset. Leveraging the joint learning
+capabilities of the two cooperating networks, the proposed model achieves the
+state-of-art restoration performance with significantly reduced computational
+complexity and often presents such results that can even surpass the visual
+quality of the ground truth with a 2-pass application. Our results and the
+optimized PyTorch implementation of the proposed approach are now publicly
+shared on GitHub.
+
+
+
+ comment: 11 pages
+
+
+
+
+
+
+ ☆ LaserGuider: A Laser Based Physical Backdoor Attack against Deep Neural
+ Networks
+
+
+ Backdoor attacks embed hidden associations between triggers and targets in
+deep neural networks (DNNs), causing them to predict the target when a trigger
+is present while maintaining normal behavior otherwise. Physical backdoor
+attacks, which use physical objects as triggers, are feasible but lack remote
+control, temporal stealthiness, flexibility, and mobility. To overcome these
+limitations, in this work, we propose a new type of backdoor triggers utilizing
+lasers that feature long-distance transmission and instant-imaging properties.
+Based on the laser-based backdoor triggers, we present a physical backdoor
+attack, called LaserGuider, which possesses remote control ability and achieves
+high temporal stealthiness, flexibility, and mobility. We also introduce a
+systematic approach to optimize laser parameters for improving attack
+effectiveness. Our evaluation on traffic sign recognition DNNs, critical in
+autonomous vehicles, demonstrates that LaserGuider with three different
+laser-based triggers achieves over 90% attack success rate with negligible
+impact on normal inputs. Additionally, we release LaserMark, the first dataset
+of real world traffic signs stamped with physical laser spots, to support
+further research in backdoor attacks and defenses.
+
+
+
+ comment: In Proceedings of the 23rd International Conference on Applied
+ Cryptography and Network Security (ACNS), Munich, Germany, 23-26 June, 2025
+
+
+
+
+
+
+ ☆ How well behaved is finite dimensional Diffusion Maps?
+
+
+ Under a set of assumptions on a family of submanifolds $\subset {\mathbb
+R}^D$, we derive a series of geometric properties that remain valid after
+finite-dimensional and almost isometric Diffusion Maps (DM), including almost
+uniform density, finite polynomial approximation and local reach. Leveraging
+these properties, we establish rigorous bounds on the embedding errors
+introduced by the DM algorithm is $O\left((\frac{\log
+n}{n})^{\frac{1}{8d+16}}\right)$. These results offer a solid theoretical
+foundation for understanding the performance and reliability of DM in practical
+applications.
+
+
+
+ comment: 20 pages, 3 figures
+
+
+
+
+
+
+ ☆ Safe and Efficient Online Convex Optimization with Linear Budget
+ Constraints and Partial Feedback
+
+
+ This paper studies online convex optimization with unknown linear budget
+constraints, where only the gradient information of the objective and the
+bandit feedback of constraint functions are observed. We propose a safe and
+efficient Lyapunov-optimization algorithm (SELO) that can achieve an
+$O(\sqrt{T})$ regret and zero cumulative constraint violation. The result also
+implies SELO achieves $O(\sqrt{T})$ regret when the budget is hard and not
+allowed to be violated. The proposed algorithm is computationally efficient as
+it resembles a primal-dual algorithm where the primal problem is an
+unconstrained, strongly convex and smooth problem, and the dual problem has a
+simple gradient-type update. The algorithm and theory are further justified in
+a simulated application of energy-efficient task processing in distributed data
+centers.
+
+
+
+
+
+
+
+ ☆ Exploring Fully Convolutional Networks for the Segmentation of
+ Hyperspectral Imaging Applied to Advanced Driver Assistance Systems
+
+
+
+
+
+
+
+
+ Jon Gutiérrez-Zaballa, Koldo Basterretxea, Javier Echanobe, M. Victoria Martínez, Inés del Campo
+
+
+ Advanced Driver Assistance Systems (ADAS) are designed with the main purpose
+of increasing the safety and comfort of vehicle occupants. Most of current
+computer vision-based ADAS perform detection and tracking tasks quite
+successfully under regular conditions, but are not completely reliable,
+particularly under adverse weather and changing lighting conditions, neither in
+complex situations with many overlapping objects. In this work we explore the
+use of hyperspectral imaging (HSI) in ADAS on the assumption that the distinct
+near infrared (NIR) spectral reflectances of different materials can help to
+better separate the objects in a driving scene. In particular, this paper
+describes some experimental results of the application of fully convolutional
+networks (FCN) to the image segmentation of HSI for ADAS applications. More
+specifically, our aim is to investigate to what extent the spatial features
+codified by convolutional filters can be helpful to improve the performance of
+HSI segmentation systems. With that aim, we use the HSI-Drive v1.1 dataset,
+which provides a set of labelled images recorded in real driving conditions
+with a small-size snapshot NIR-HSI camera. Finally, we analyze the
+implementability of such a HSI segmentation system by prototyping the developed
+FCN model together with the necessary hyperspectral cube preprocessing stage
+and characterizing its performance on an MPSoC.
+
+
+
+ comment: arXiv admin note: text overlap with arXiv:2411.19274
+
+
+
+
+
+
+ ☆ Local Curvature Smoothing with Stein's Identity for Efficient Score
+ Matching NeurIPS 2024
+
+
+ The training of score-based diffusion models (SDMs) is based on score
+matching. The challenge of score matching is that it includes a computationally
+expensive Jacobian trace. While several methods have been proposed to avoid
+this computation, each has drawbacks, such as instability during training and
+approximating the learning as learning a denoising vector field rather than a
+true score. We propose a novel score matching variant, local curvature
+smoothing with Stein's identity (LCSS). The LCSS bypasses the Jacobian trace by
+applying Stein's identity, enabling regularization effectiveness and efficient
+computation. We show that LCSS surpasses existing methods in sample generation
+performance and matches the performance of denoising score matching, widely
+adopted by most SDMs, in evaluations such as FID, Inception score, and bits per
+dimension. Furthermore, we show that LCSS enables realistic image generation
+even at a high resolution of $1024 \times 1024$.
+
+
+
+ comment: Accepted at NeurIPS 2024
+
+
+
+
+
+
+ ☆ Electronic Health Records-Based Data-Driven Diabetes Knowledge Unveiling
+ and Risk Prognosis
+
+
+ In the healthcare sector, the application of deep learning technologies has
+revolutionized data analysis and disease forecasting. This is particularly
+evident in the field of diabetes, where the deep analysis of Electronic Health
+Records (EHR) has unlocked new opportunities for early detection and effective
+intervention strategies. Our research presents an innovative model that
+synergizes the capabilities of Bidirectional Long Short-Term Memory
+Networks-Conditional Random Field (BiLSTM-CRF) with a fusion of XGBoost and
+Logistic Regression. This model is designed to enhance the accuracy of diabetes
+risk prediction by conducting an in-depth analysis of electronic medical
+records data. The first phase of our approach involves employing BiLSTM-CRF to
+delve into the temporal characteristics and latent patterns present in EHR
+data. This method effectively uncovers the progression trends of diabetes,
+which are often hidden in the complex data structures of medical records. The
+second phase leverages the combined strength of XGBoost and Logistic Regression
+to classify these extracted features and evaluate associated risks. This dual
+approach facilitates a more nuanced and precise prediction of diabetes,
+outperforming traditional models, particularly in handling multifaceted and
+nonlinear medical datasets. Our research demonstrates a notable advancement in
+diabetes prediction over traditional methods, showcasing the effectiveness of
+our combined BiLSTM-CRF, XGBoost, and Logistic Regression model. This study
+highlights the value of data-driven strategies in clinical decision-making,
+equipping healthcare professionals with precise tools for early detection and
+intervention. By enabling personalized treatment and timely care, our approach
+signifies progress in incorporating advanced analytics in healthcare,
+potentially improving outcomes for diabetes and other chronic conditions.
+
+
+
+ comment: 16 pages
+
+
+
+
+
+
+ ☆ BEFL: Balancing Energy Consumption in Federated Learning for Mobile Edge
+ IoT
+
+
+ Federated Learning (FL) is a privacy-preserving distributed learning paradigm
+designed to build a highly accurate global model. In Mobile Edge IoT (MEIoT),
+the training and communication processes can significantly deplete the limited
+battery resources of devices. Existing research primarily focuses on reducing
+overall energy consumption, but this may inadvertently create energy
+consumption imbalances, leading to the premature dropout of energy-sensitive
+devices.To address these challenges, we propose BEFL, a joint optimization
+framework aimed at balancing three objectives: enhancing global model accuracy,
+minimizing total energy consumption, and reducing energy usage disparities
+among devices. First, taking into account the communication constraints of
+MEIoT and the heterogeneity of devices, we employed the Sequential Least
+Squares Programming (SLSQP) algorithm for the rational allocation of
+communication resources. Based on this, we introduce a heuristic client
+selection algorithm that combines cluster partitioning with utility-driven
+approaches to alleviate both the total energy consumption of all devices and
+the discrepancies in energy usage.Furthermore, we utilize the proposed
+heuristic client selection algorithm as a template for offline imitation
+learning during pre-training, while adopting a ranking-based reinforcement
+learning approach online to further boost training efficiency. Our experiments
+reveal that BEFL improves global model accuracy by 1.6\%, reduces energy
+consumption variance by 72.7\%, and lowers total energy consumption by 28.2\%
+compared to existing methods. The relevant code can be found at
+\href{URL}{https://github.com/juzehao/BEFL}.
+
+
+
+
+
+
+
+ ☆ Learning Speed-Adaptive Walking Agent Using Imitation Learning with
+ Physics-Informed Simulation
+
+
+ Virtual models of human gait, or digital twins, offer a promising solution
+for studying mobility without the need for labor-intensive data collection.
+However, challenges such as the sim-to-real gap and limited adaptability to
+diverse walking conditions persist. To address these, we developed and
+validated a framework to create a skeletal humanoid agent capable of adapting
+to varying walking speeds while maintaining biomechanically realistic motions.
+The framework combines a synthetic data generator, which produces
+biomechanically plausible gait kinematics from open-source biomechanics data,
+and a training system that uses adversarial imitation learning to train the
+agent's walking policy. We conducted comprehensive analyses comparing the
+agent's kinematics, synthetic data, and the original biomechanics dataset. The
+agent achieved a root mean square error of 5.24 +- 0.09 degrees at varying
+speeds compared to ground-truth kinematics data, demonstrating its
+adaptability. This work represents a significant step toward developing a
+digital twin of human locomotion, with potential applications in biomechanics
+research, exoskeleton design, and rehabilitation.
+
+
+
+ comment: Currently under review
+
+
+
+
+
+
+ ☆ JANUS: A Difference-Oriented Analyzer For Financial Centralization Risks
+ in Smart Contracts
+
+
+
+
+
+
+
+
+ Wansen Wang, Pu Zhang, Renjie Ji, Wenchao Huang, Zhaoyi Meng, Yan Xiong
+
+
+ Some smart contracts violate decentralization principles by defining
+privileged accounts that manage other users' assets without permission,
+introducing centralization risks that have caused financial losses. Existing
+methods, however, face challenges in accurately detecting diverse
+centralization risks due to their dependence on predefined behavior patterns.
+In this paper, we propose JANUS, an automated analyzer for Solidity smart
+contracts that detects financial centralization risks independently of their
+specific behaviors. JANUS identifies differences between states reached by
+privileged and ordinary accounts, and analyzes whether these differences are
+finance-related. Focusing on the impact of risks rather than behaviors, JANUS
+achieves improved accuracy compared to existing tools and can uncover
+centralization risks with unknown patterns.
+ To evaluate JANUS's performance, we compare it with other tools using a
+dataset of 540 contracts. Our evaluation demonstrates that JANUS outperforms
+representative tools in terms of detection accuracy for financial
+centralization risks . Additionally, we evaluate JANUS on a real-world dataset
+of 33,151 contracts, successfully identifying two types of risks that other
+tools fail to detect. We also prove that the state traversal method and
+variable summaries, which are used in JANUS to reduce the number of states to
+be compared, do not introduce false alarms or omissions in detection.
+
+
+
+
+
+
+
+ ☆ Deep Learning Modeling Method for RF Devices Based on Uniform Noise
+ Training Set
+
+
+ As the scale and complexity of integrated circuits continue to increase,
+traditional modeling methods are struggling to address the nonlinear challenges
+in radio frequency (RF) chips. Deep learning has been increasingly applied to
+RF device modeling. This paper proposes a deep learning-based modeling method
+for RF devices using a uniform noise training set, aimed at modeling and
+fitting the nonlinear characteristics of RF devices. We hypothesize that a
+uniform noise signal can encompass the full range of characteristics across
+both frequency and amplitude, and that a deep learning model can effectively
+capture and learn these features. Based on this hypothesis, the paper designs a
+complete integrated circuit modeling process based on measured data, including
+data collection, processing, and neural network training. The proposed method
+is experimentally validated using the RF amplifier PW210 as a case study.
+Experimental results show that the uniform noise training set allows the model
+to capture the nonlinear characteristics of RF devices, and the trained model
+can predict waveform patterns it has never encountered before. The proposed
+deep learning-based RF device modeling method, using a uniform noise training
+set, demonstrates strong generalization capability and excellent training
+performance, offering high practical application value.
+
+
+
+ comment: 9 pages,11 figures
+
+
+
+
+
+
+ ☆ Exploring AI Text Generation, Retrieval-Augmented Generation, and
+ Detection Technologies: a Comprehensive Overview
+
+
+ The rapid development of Artificial Intelligence (AI) has led to the creation
+of powerful text generation models, such as large language models (LLMs), which
+are widely used for diverse applications. However, concerns surrounding
+AI-generated content, including issues of originality, bias, misinformation,
+and accountability, have become increasingly prominent. This paper offers a
+comprehensive overview of AI text generators (AITGs), focusing on their
+evolution, capabilities, and ethical implications. This paper also introduces
+Retrieval-Augmented Generation (RAG), a recent approach that improves the
+contextual relevance and accuracy of text generation by integrating dynamic
+information retrieval. RAG addresses key limitations of traditional models,
+including their reliance on static knowledge and potential inaccuracies in
+handling real-world data. Additionally, the paper reviews detection tools that
+help differentiate AI-generated text from human-written content and discusses
+the ethical challenges these technologies pose. The paper explores future
+directions for improving detection accuracy, supporting ethical AI development,
+and increasing accessibility. The paper contributes to a more responsible and
+reliable use of AI in content creation through these discussions.
+
+
+
+
+
+
+
+ ☆ MT3DNet: Multi-Task learning Network for 3D Surgical Scene
+ Reconstruction
+
+
+
+
+
+
+
+
+ Mithun Parab, Pranay Lendave, Jiyoung Kim, Thi Quynh Dan Nguyen, Palash Ingle
+
+
+ In image-assisted minimally invasive surgeries (MIS), understanding surgical
+scenes is vital for real-time feedback to surgeons, skill evaluation, and
+improving outcomes through collaborative human-robot procedures. Within this
+context, the challenge lies in accurately detecting, segmenting, and estimating
+the depth of surgical scenes depicted in high-resolution images, while
+simultaneously reconstructing the scene in 3D and providing segmentation of
+surgical instruments along with detection labels for each instrument. To
+address this challenge, a novel Multi-Task Learning (MTL) network is proposed
+for performing these tasks concurrently. A key aspect of this approach involves
+overcoming the optimization hurdles associated with handling multiple tasks
+concurrently by integrating a Adversarial Weight Update into the MTL framework,
+the proposed MTL model achieves 3D reconstruction through the integration of
+segmentation, depth estimation, and object detection, thereby enhancing the
+understanding of surgical scenes, which marks a significant advancement
+compared to existing studies that lack 3D capabilities. Comprehensive
+experiments on the EndoVis2018 benchmark dataset underscore the adeptness of
+the model in efficiently addressing all three tasks, demonstrating the efficacy
+of the proposed techniques.
+
+
+
+
+
+
+
+ ☆ MegaCOIN: Enhancing Medium-Grained Color Perception for Vision-Language
+ Models
+
+
+ In vision-language models (VLMs), the ability to perceive and interpret color
+and physical environment is crucial for achieving contextually accurate
+understanding and interaction. However, despite advances in multimodal
+modeling, there remains a significant lack of specialized datasets that
+rigorously evaluate a model's capacity to discern subtle color variations and
+spatial context -- critical elements for situational comprehension and reliable
+deployment across real-world applications. Toward that goal, we curate
+MegaCOIN, a high-quality, human-labeled dataset based on \emph{real} images
+with various contextual attributes. MegaCOIN consists of two parts:
+MegaCOIN-Instruct, which serves as a supervised fine-tuning (SFT) dataset for
+VLMs; and MegaCOIN-Bench, an annotated test set that can be used as a
+stand-alone QA dataset. MegaCOIN~provides three annotated features for 220,000
+real images: foreground color, background color, and description of an object's
+physical environment, constituting 660k human annotations. In addition,
+MegaCOIN can be applied to benchmark domain generalization (DG) algorithms. We
+explore benchmarking DG methods in the linear probing setup for VLM and show
+some new insights. Last but not least, we show that VLMs, including GPT-4o,
+have subpar color recognition capabilities, and fine-tuning with MegaCOIN can
+result in improved performance on visual evaluation tasks. In certain cases,
+MegaCOIN fine-tuned small-scale opensource models such as LLaVA and Bunny can
+outperform closed-source GPT-4o. We hope the utilities of MegaCOIN can shed
+light on the directions VLMs can improve and provide a more complex platform
+for domain generalization algorithms.
+
+
+
+ comment: 8 pages, 13 tables, 2 figures
+
+
+
+
+
+
+ ♻ ☆ A method to benchmark high-dimensional process drift detection
+
+
+ Process curves are multivariate finite time series data coming from
+manufacturing processes. This paper studies machine learning that detect drifts
+in process curve datasets. A theoretic framework to synthetically generate
+process curves in a controlled way is introduced in order to benchmark machine
+learning algorithms for process drift detection. An evaluation score, called
+the temporal area under the curve, is introduced, which allows to quantify how
+well machine learning models unveil curves belonging to drift segments.
+Finally, a benchmark study comparing popular machine learning approaches on
+synthetic data generated with the introduced framework is presented that shows
+that existing algorithms often struggle with datasets containing multiple drift
+segments.
+
+
+
+
+
+
+
+
+ Anthony Zhou, Amir Barati Farimani
+
+
+ Neural solvers for partial differential equations (PDEs) have great potential
+to generate fast and accurate physics solutions, yet their practicality is
+currently limited by their generalizability. PDEs evolve over broad scales and
+exhibit diverse behaviors; predicting these phenomena will require learning
+representations across a wide variety of inputs which may encompass different
+coefficients, boundary conditions, resolutions, or even equations. As a step
+towards generalizable PDE modeling, we adapt masked pretraining for physics
+problems. Through self-supervised learning across PDEs, masked autoencoders can
+consolidate heterogeneous physics to learn rich latent representations. We show
+that learned representations can generalize to a limited set of unseen
+equations or parameters and are meaningful enough to regress PDE coefficients
+or the classify PDE features. Furthermore, conditioning neural solvers on
+learned latent representations can improve time-stepping and super-resolution
+performance across a variety of coefficients, discretizations, or boundary
+conditions, as well as on certain unseen PDEs. We hope that masked pretraining
+can emerge as a unifying method across large, unlabeled, and heterogeneous
+datasets to learn latent physics at scale.
+
+
+
+ comment: 29 pages, 9 figures
+
+
+
+
+
+
+ ♻ ☆ SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large
+ Language Models by Summarizing Training Trajectories of Small Models
+
+
+ Despite the effectiveness of data selection for large language models (LLMs)
+during pretraining and instruction fine-tuning phases, improving data
+efficiency in supervised fine-tuning (SFT) for specialized domains poses
+significant challenges due to the complexity of fine-tuning data. To bridge
+this gap, we introduce an effective and scalable data selection method for SFT,
+SmallToLarge (S2L), which leverages training trajectories from small models to
+guide the data selection for larger models. We demonstrate through extensive
+experiments that S2L significantly improves data efficiency in SFT for
+mathematical problem-solving, reducing the training data to just 11% of the
+original MathInstruct dataset (Yue et al., 2023) to match full dataset
+performance while outperforming state-of-the-art data selection algorithms by
+an average of 4.7% across 6 in- and out-domain evaluation datasets. Remarkably,
+selecting only 50K data for SFT, S2L achieves a 32.7% accuracy on the most
+challenging MATH (Hendrycks et al., 2021) benchmark, improving Phi-2 (Li et
+al., 2023b) by 16.6%. In clinical text summarization on the MIMIC-III dataset
+(Johnson et al., 2016), S2L again outperforms training on the full dataset
+using only 50% of the data. Notably, S2L can perform data selection using a
+reference model 40x smaller than the target model, proportionally reducing the
+cost of data selection.
+
+
+
+
+
+
+
+
+ Jaskirat Singh, Lindsey Li, Weijia Shi, Ranjay Krishna, Yejin Choi, Pang Wei Koh, Michael F. Cohen, Stephen Gould, Liang Zheng, Luke Zettlemoyer
+
+
+ Text-based adversarial guidance using a negative prompt has emerged as a
+widely adopted approach to steer diffusion models away from producing undesired
+concepts. While useful, performing adversarial guidance using text alone can be
+insufficient to capture complex visual concepts or avoid specific visual
+elements like copyrighted characters. In this paper, for the first time we
+explore an alternate modality in this direction by performing adversarial
+guidance directly using visual features from a reference image or other images
+in a batch. We introduce negative token merging (NegToMe), a simple but
+effective training-free approach which performs adversarial guidance through
+images by selectively pushing apart matching visual features between reference
+and generated images during the reverse diffusion process. By simply adjusting
+the used reference, NegToMe enables a diverse range of applications. Notably,
+when using other images in same batch as reference, we find that NegToMe
+significantly enhances output diversity (e.g., racial, gender, visual) by
+guiding features of each image away from others. Similarly, when used w.r.t.
+copyrighted reference images, NegToMe reduces visual similarity to copyrighted
+content by 34.57%. NegToMe is simple to implement using just few-lines of code,
+uses only marginally higher (<4%) inference time and is compatible with
+different diffusion architectures, including those like Flux, which don't
+natively support the use of a negative prompt. Code is available at
+https://negtome.github.io
+
+
+
+
+
+
+
+ ♻ ☆ WaveletGPT: Wavelets Meet Large Language Models
+
+
+ Large Language Models (LLMs) have ushered in a new wave of artificial
+intelligence advancements impacting every scientific field and discipline. They
+are trained on a simple objective: to predict the next token given the previous
+context. We live in a world where most of the data around us, e.g., text,
+audio, and music, has a multi-scale structure associated with it. This paper
+infuses LLMs with traditional signal processing ideas, namely wavelets, during
+pre-training to take advantage of the structure. Without adding \textbf{any
+extra parameters} to a GPT-style LLM architecture, we achieve the same
+pre-training performance almost twice as fast in text, raw audio, and symbolic
+music. This is achieved by imposing a structure on intermediate embeddings.
+When trained for the same number of training steps, we achieve significant
+gains in performance, which is comparable to pre-training a larger neural
+architecture. Our architecture allows every next token prediction access to
+intermediate embeddings at different temporal resolutions in every Transformer
+decoder block. This work will hopefully pave the way for incorporating
+multi-rate signal processing ideas into traditional LLM pre-training. Further,
+we showcase pushing model performance by improving internal structure instead
+of just going after scale.
+
+
+
+ comment: 16 pages, 4 figures
+
+
+
+
+
+
+ ♻ ☆ Learning to Reconstruct Accelerated MRI Through K-space Cold Diffusion
+ without Noise
+
+
+
+
+
+
+
+
+ Guoyao Shen, Mengyu Li, Chad W. Farris, Stephan Anderson, Xin Zhang
+
+
+ Deep learning-based MRI reconstruction models have achieved superior
+performance these days. Most recently, diffusion models have shown remarkable
+performance in image generation, in-painting, super-resolution, image editing
+and more. As a generalized diffusion model, cold diffusion further broadens the
+scope and considers models built around arbitrary image transformations such as
+blurring, down-sampling, etc. In this paper, we propose a k-space cold
+diffusion model that performs image degradation and restoration in k-space
+without the need for Gaussian noise. We provide comparisons with multiple deep
+learning-based MRI reconstruction models and perform tests on a well-known
+large open-source MRI dataset. Our results show that this novel way of
+performing degradation can generate high-quality reconstruction images for
+accelerated MRI.
+
+
+
+ comment: 21 pages, 5 figures, 4 tables
+
+
+
+
+
+
+ ♻ ☆ Regularization by Neural Style Transfer for MRI Field-Transfer
+ Reconstruction with Limited Data
+
+
+
+
+
+
+
+
+ Guoyao Shen, Yancheng Zhu, Mengyu Li, Ryan McNaughton, Hernan Jara, Sean B. Andersson, Chad W. Farris, Stephan Anderson, Xin Zhang
+
+
+ Recent advances in MRI reconstruction have achieved remarkable success with
+deep learning-based models. However, most methods depend on large-scale,
+task-specific datasets, leaving reconstruction in data-limited settings as a
+critical but underexplored challenge. Regularization by denoising (RED) is a
+general pipeline that incorporates a denoiser as a prior for image
+reconstruction, showing promising results in various image processing tasks,
+including denoising, deblurring, and super-resolution. In this work, we propose
+a regularization by neural style transfer (RNST) method to further leverage the
+priors from the neural transfer and denoising engine. RNST effectively
+reconstructs high-quality images from noisy, low-quality inputs across varying
+image styles, even with limited data. We validate RNST on clinical MRI scans,
+demonstrating its ability to significantly improve image quality. These
+findings underline the potential of RNST for MRI field-transfer reconstruction
+and its promise in addressing reconstruction tasks in data-constrained
+scenarios.
+
+
+ The value of second-order methods lies in the use of curvature information.
+Yet, this information is costly to extract and once obtained, valuable negative
+curvature information is often discarded so that the method is globally
+convergent. This limits the effectiveness of second-order methods in modern
+machine learning. In this paper, we show that second-order and
+second-order-like methods are promising optimizers for neural networks provided
+that we add one ingredient: negative step sizes. We show that under very
+general conditions, methods that produce ascent directions are globally
+convergent when combined with a Wolfe line search that allows both positive and
+negative step sizes. We experimentally demonstrate that using negative step
+sizes is often more effective than common Hessian modification methods.
+
+
+
+ comment: added affiliation and more references
+
+
+
+
+
+
+ ♻ ☆ GeoPos: A Minimal Positional Encoding for Enhanced Fine-Grained Details
+ in Image Synthesis Using Convolutional Neural Networks WACV 2025
+
+
+ The enduring inability of image generative models to recreate intricate
+geometric features, such as those present in human hands and fingers has been
+an ongoing problem in image generation for nearly a decade. While strides have
+been made by increasing model sizes and diversifying training datasets, this
+issue remains prevalent across all models, from denoising diffusion models to
+Generative Adversarial Networks (GAN), pointing to a fundamental shortcoming in
+the underlying architectures. In this paper, we demonstrate how this problem
+can be mitigated by augmenting convolution layers geometric capabilities
+through providing them with a single input channel incorporating the relative
+n-dimensional Cartesian coordinate system. We show this drastically improves
+quality of images generated by Diffusion Models, GANs, and Variational
+AutoEncoders (VAE).
+
+
+
+ comment: Accepted at WACV 2025. Contains 19 pages, 15 figures, and 9 tables
+
+
+
+
+
+
+ ♻ ☆ Is uniform expressivity too restrictive? Towards efficient expressivity
+ of graph neural networks
+
+
+ Uniform expressivity guarantees that a Graph Neural Network (GNN) can express
+a query without the parameters depending on the size of the input graphs. This
+property is desirable in applications in order to have number of trainable
+parameters that is independent of the size of the input graphs. Uniform
+expressivity of the two variable guarded fragment (GC2) of first order logic is
+a well-celebrated result for Rectified Linear Unit (ReLU) GNNs [Barcelo & al.,
+2020]. In this article, we prove that uniform expressivity of GC2 queries is
+not possible for GNNs with a wide class of Pfaffian activation functions
+(including the sigmoid and tanh), answering a question formulated by [Grohe,
+2021]. We also show that despite these limitations, many of those GNNs can
+still efficiently express GC2 queries in a way that the number of parameters
+remains logarithmic on the maximal degree of the input graphs. Furthermore, we
+demonstrate that a log-log dependency on the degree is achievable for a certain
+choice of activation function. This shows that uniform expressivity can be
+successfully relaxed by covering large graphs appearing in practical
+applications. Our experiments illustrates that our theoretical estimates hold
+in practice.
+
+
+
+
+
+
+
+ ♻ ☆ Introducing the Large Medical Model: State of the art healthcare cost
+ and risk prediction with transformers trained on patient event sequences
+
+
+ With U.S. healthcare spending approaching $5T (NHE Fact Sheet 2024), and 25%
+of it estimated to be wasteful (Waste in the US the health care system:
+estimated costs and potential for savings, n.d.), the need to better predict
+risk and optimal patient care is evermore important. This paper introduces the
+Large Medical Model (LMM), a generative pre-trained transformer (GPT) designed
+to guide and predict the broad facets of patient care and healthcare
+administration. The model is trained on medical event sequences from over 140M
+longitudinal patient claims records with a specialized vocabulary built from
+medical terminology systems and demonstrates a superior capability to forecast
+healthcare costs and identify potential risk factors. Through experimentation
+and validation, we showcase the LMM's proficiency in not only in cost and risk
+predictions, but also in discerning intricate patterns within complex medical
+conditions and an ability to identify novel relationships in patient care. The
+LMM is able to improve both cost prediction by 14.1% over the best commercial
+models and chronic conditions prediction by 1.9% over the best transformer
+models in research predicting a broad set of conditions. The LMM is a
+substantial advancement in healthcare analytics, offering the potential to
+significantly enhance risk assessment, cost management, and personalized
+medicine.
+
+
+
+ comment: 10 pages, 10 figures
+
+
+
+
+
+
+ ♻ ☆ Limit Theorems for Stochastic Gradient Descent with Infinite Variance
+
+
+
+
+
+
+
+
+ Jose Blanchet, Aleksandar Mijatović, Wenhao Yang
+
+
+ Stochastic gradient descent is a classic algorithm that has gained great
+popularity especially in the last decades as the most common approach for
+training models in machine learning. While the algorithm has been well-studied
+when stochastic gradients are assumed to have a finite variance, there is
+significantly less research addressing its theoretical properties in the case
+of infinite variance gradients. In this paper, we establish the asymptotic
+behavior of stochastic gradient descent in the context of infinite variance
+stochastic gradients, assuming that the stochastic gradient is regular varying
+with index $\alpha\in(1,2)$. The closest result in this context was established
+in 1969 , in the one-dimensional case and assuming that stochastic gradients
+belong to a more restrictive class of distributions. We extend it to the
+multidimensional case, covering a broader class of infinite variance
+distributions. As we show, the asymptotic distribution of the stochastic
+gradient descent algorithm can be characterized as the stationary distribution
+of a suitably defined Ornstein-Uhlenbeck process driven by an appropriate
+stable L\'evy process. Additionally, we explore the applications of these
+results in linear regression and logistic regression models.
+
+
+
+
+
+
+
+ ♻ ☆ A Fisher-Rao gradient flow for entropy-regularised Markov decision
+ processes in Polish spaces
+
+
+ We study the global convergence of a Fisher-Rao policy gradient flow for
+infinite-horizon entropy-regularised Markov decision processes with Polish
+state and action space. The flow is a continuous-time analogue of a policy
+mirror descent method. We establish the global well-posedness of the gradient
+flow and demonstrate its exponential convergence to the optimal policy.
+Moreover, we prove the flow is stable with respect to gradient evaluation,
+offering insights into the performance of a natural policy gradient flow with
+log-linear policy parameterisation. To overcome challenges stemming from the
+lack of the convexity of the objective function and the discontinuity arising
+from the entropy regulariser, we leverage the performance difference lemma and
+the duality relationship between the gradient and mirror descent flows. Our
+analysis provides a theoretical foundation for developing various discrete
+policy gradient algorithms.
+
+
+
+ comment: add discretizations of gradient flow and their convergence analysis
+
+ In this work, we address the challenging and emergent problem of novel object
+detection (NOD), focusing on the accurate detection of both known and novel
+object categories during inference. Traditional object detection algorithms are
+inherently closed-set, limiting their capability to handle NOD. We present a
+novel approach to transform existing closed-set detectors into open-set
+detectors. This transformation is achieved by leveraging the complementary
+strengths of pre-trained foundational models, specifically CLIP and SAM,
+through our cooperative mechanism. Furthermore, by integrating this mechanism
+with state-of-the-art open-set detectors such as GDINO, we establish new
+benchmarks in object detection performance. Our method achieves 17.42 mAP in
+novel object detection and 42.08 mAP for known objects on the challenging LVIS
+dataset. Adapting our approach to the COCO OVD split, we surpass the current
+state-of-the-art by a margin of 7.2 $ \text{AP}_{50} $ for novel classes. Our
+code is available at https://rohit901.github.io/coop-foundation-models/ .
+
+
+
+ comment: Accepted at WACV 2025
+
+
+
+
+
+
+ ♻ ☆ ELEMENTAL: Interactive Learning from Demonstrations and Vision-Language
+ Models for Reward Design in Robotics
+
+
+ Reinforcement learning (RL) has demonstrated compelling performance in
+robotic tasks, but its success often hinges on the design of complex, ad hoc
+reward functions. Researchers have explored how Large Language Models (LLMs)
+could enable non-expert users to specify reward functions more easily. However,
+LLMs struggle to balance the importance of different features, generalize
+poorly to out-of-distribution robotic tasks, and cannot represent the problem
+properly with only text-based descriptions. To address these challenges, we
+propose ELEMENTAL (intEractive LEarning froM dEmoNstraTion And Language), a
+novel framework that combines natural language guidance with visual user
+demonstrations to align robot behavior with user intentions better. By
+incorporating visual inputs, ELEMENTAL overcomes the limitations of text-only
+task specifications, while leveraging inverse reinforcement learning (IRL) to
+balance feature weights and match the demonstrated behaviors optimally.
+ELEMENTAL also introduces an iterative feedback-loop through self-reflection to
+improve feature, reward, and policy learning. Our experiment results
+demonstrate that ELEMENTAL outperforms prior work by 42.3% on task success, and
+achieves 41.3% better generalization in out-of-distribution tasks, highlighting
+its robustness in LfD.
+
+
+
+
+
+
+
+ ♻ ☆ HydraViT: Stacking Heads for a Scalable ViT NeurIPS'24
+
+
+
+
+
+
+
+
+ Janek Haberer, Ali Hojjat, Olaf Landsiedel
+
+
+ The architecture of Vision Transformers (ViTs), particularly the Multi-head
+Attention (MHA) mechanism, imposes substantial hardware demands. Deploying ViTs
+on devices with varying constraints, such as mobile phones, requires multiple
+models of different sizes. However, this approach has limitations, such as
+training and storing each required model separately. This paper introduces
+HydraViT, a novel approach that addresses these limitations by stacking
+attention heads to achieve a scalable ViT. By repeatedly changing the size of
+the embedded dimensions throughout each layer and their corresponding number of
+attention heads in MHA during training, HydraViT induces multiple subnetworks.
+Thereby, HydraViT achieves adaptability across a wide spectrum of hardware
+environments while maintaining performance. Our experimental results
+demonstrate the efficacy of HydraViT in achieving a scalable ViT with up to 10
+subnetworks, covering a wide range of resource constraints. HydraViT achieves
+up to 5 p.p. more accuracy with the same GMACs and up to 7 p.p. more accuracy
+with the same throughput on ImageNet-1K compared to the baselines, making it an
+effective solution for scenarios where hardware availability is diverse or
+varies over time. Source code available at https://github.com/ds-kiel/HydraViT.
+
+
+
+ comment: Accepted at NeurIPS'24, please cite the conference version
+
+ In multi-agent systems, the agent behavior is highly influenced by its
+utility function, as these utilities shape both individual goals as well as
+interactions with the other agents. Inverse Reinforcement Learning (IRL) is a
+well-established approach to inferring the utility function by observing an
+expert behavior within a given environment. In this paper, we extend the IRL
+framework to the multi-agent setting, assuming to observe agents who are
+following Nash Equilibrium (NE) policies. We theoretically investigate the set
+of utilities that explain the behavior of NE experts. Specifically, we provide
+an explicit characterization of the feasible reward set and analyze how errors
+in estimating the transition dynamics and expert behavior impact the recovered
+rewards. Building on these findings, we provide the first sample complexity
+analysis for the multi-agent IRL problem. Finally, we provide a numerical
+evaluation of our theoretical results.
+
+
+
+
+
+
+
+
+ Laura Kopf, Philine Lou Bommer, Anna Hedström, Sebastian Lapuschkin, Marina M. -C. Höhne, Kirill Bykov
+
+
+ A crucial aspect of understanding the complex nature of Deep Neural Networks
+(DNNs) is the ability to explain learned concepts within their latent
+representations. While methods exist to connect neurons to human-understandable
+textual descriptions, evaluating the quality of these explanations is
+challenging due to the lack of a unified quantitative approach. We introduce
+CoSy (Concept Synthesis), a novel, architecture-agnostic framework for
+evaluating textual explanations of latent neurons. Given textual explanations,
+our proposed framework uses a generative model conditioned on textual input to
+create data points representing the explanations. By comparing the neuron's
+response to these generated data points and control data points, we can
+estimate the quality of the explanation. We validate our framework through
+sanity checks and benchmark various neuron description methods for Computer
+Vision tasks, revealing significant differences in quality.
+
+
+
+ comment: 10 pages, 5 figures
+
+
+
+
+
+
+ ♻ ☆ Finite-sample performance of the maximum likelihood estimator in
+ logistic regression
+
+
+ Logistic regression is a classical model for describing the probabilistic
+dependence of binary responses to multivariate covariates. We consider the
+predictive performance of the maximum likelihood estimator (MLE) for logistic
+regression, assessed in terms of logistic risk. We consider two questions:
+first, that of the existence of the MLE (which occurs when the dataset is not
+linearly separated), and second that of its accuracy when it exists. These
+properties depend on both the dimension of covariates and on the signal
+strength. In the case of Gaussian covariates and a well-specified logistic
+model, we obtain sharp non-asymptotic guarantees for the existence and excess
+logistic risk of the MLE. We then generalize these results in two ways: first,
+to non-Gaussian covariates satisfying a certain two-dimensional margin
+condition, and second to the general case of statistical learning with a
+possibly misspecified logistic model. Finally, we consider the case of a
+Bernoulli design, where the behavior of the MLE is highly sensitive to the
+parameter direction.
+
+
+
+ comment: Simplified some statements and added a proof sketch in Sec. 4
+
+
+
+
+
+
+ ♻ ☆ Calib3D: Calibrating Model Preferences for Reliable 3D Scene
+ Understanding WACV 2025
+
+
+
+
+
+
+
+
+ Lingdong Kong, Xiang Xu, Jun Cen, Wenwei Zhang, Liang Pan, Kai Chen, Ziwei Liu
+
+
+ Safety-critical 3D scene understanding tasks necessitate not only accurate
+but also confident predictions from 3D perception models. This study introduces
+Calib3D, a pioneering effort to benchmark and scrutinize the reliability of 3D
+scene understanding models from an uncertainty estimation viewpoint. We
+comprehensively evaluate 28 state-of-the-art models across 10 diverse 3D
+datasets, uncovering insightful phenomena that cope with both the aleatoric and
+epistemic uncertainties in 3D scene understanding. We discover that despite
+achieving impressive levels of accuracy, existing models frequently fail to
+provide reliable uncertainty estimates -- a pitfall that critically undermines
+their applicability in safety-sensitive contexts. Through extensive analysis of
+key factors such as network capacity, LiDAR representations, rasterization
+resolutions, and 3D data augmentation techniques, we correlate these aspects
+directly with the model calibration efficacy. Furthermore, we introduce DeptS,
+a novel depth-aware scaling approach aimed at enhancing 3D model calibration.
+Extensive experiments across a wide range of configurations validate the
+superiority of our method. We hope this work could serve as a cornerstone for
+fostering reliable 3D scene understanding. Code and benchmark toolkit are
+publicly available.
+
+
+ We propose a novel method ($floZ$), based on normalizing flows, to estimate
+the Bayesian evidence (and its numerical uncertainty) from a pre-existing set
+of samples drawn from the unnormalized posterior distribution. We validate it
+on distributions whose evidence is known analytically, up to 15 parameter space
+dimensions, and compare with two state-of-the-art techniques for estimating the
+evidence: nested sampling (which computes the evidence as its main target) and
+a $k$-nearest-neighbors technique that produces evidence estimates from
+posterior samples. Provided representative samples from the target posterior
+are available, our method is more robust to posterior distributions with sharp
+features, especially in higher dimensions. For a simple multivariate Gaussian,
+we demonstrate its accuracy for up to 200 dimensions with $10^5$ posterior
+samples. $floZ$ has wide applicability, e.g., to estimate evidence from
+variational inference, Markov Chain Monte Carlo samples, or any other method
+that delivers samples and their likelihood from the unnormalized posterior
+density. As a physical application, we use $floZ$ to compute the Bayes factor
+for the presence of the first overtone in the ringdown signal of the
+gravitational wave data of GW150914, finding good agreement with nested
+sampling.
+
+
+
+
+
+
+
+
+ Eric Elmoznino, Tom Marty, Tejas Kasetty, Leo Gagnon, Sarthak Mittal, Mahan Fathi, Dhanya Sridhar, Guillaume Lajoie
+
+
+ A central goal of machine learning is generalization. While the No Free Lunch
+Theorem states that we cannot obtain theoretical guarantees for generalization
+without further assumptions, in practice we observe that simple models which
+explain the training data generalize best: a principle called Occam's razor.
+Despite the need for simple models, most current approaches in machine learning
+only minimize the training error, and at best indirectly promote simplicity
+through regularization or architecture design. Here, we draw a connection
+between Occam's razor and in-context learning: an emergent ability of certain
+sequence models like Transformers to learn at inference time from past
+observations in a sequence. In particular, we show that the next-token
+prediction loss used to train in-context learners is directly equivalent to a
+data compression technique called prequential coding, and that minimizing this
+loss amounts to jointly minimizing both the training error and the complexity
+of the model that was implicitly learned from context. Our theory and the
+empirical experiments we use to support it not only provide a normative account
+of in-context learning, but also elucidate the shortcomings of current
+in-context learning methods, suggesting ways in which they can be improved. We
+make our code available at https://github.com/3rdCore/PrequentialCode.
+
+
+
+
+
+
+
+ ♻ ☆ Reachable Polyhedral Marching (RPM): An Exact Analysis Tool for
+ Deep-Learned Control Systems
+
+
+ Neural networks are increasingly used in robotics as policies, state
+transition models, state estimation models, or all of the above. With these
+components being learned from data, it is important to be able to analyze what
+behaviors were learned and how this affects closed-loop performance. In this
+paper we take steps toward this goal by developing methods for computing
+control invariant sets and regions of attraction (ROAs) of dynamical systems
+represented as neural networks. We focus our attention on feedforward neural
+networks with the rectified linear unit (ReLU) activation, which are known to
+implement continuous piecewise-affine (PWA) functions. We describe the
+Reachable Polyhedral Marching (RPM) algorithm for enumerating the affine pieces
+of a neural network through an incremental connected walk. We then use this
+algorithm to compute exact forward and backward reachable sets, from which we
+provide methods for computing control invariant sets and ROAs. Our approach is
+unique in that we find these sets incrementally, without Lyapunov-based tools.
+In our examples we demonstrate the ability of our approach to find non-convex
+control invariant sets and ROAs on tasks with learned van der Pol oscillator
+and pendulum models. Further, we provide an accelerated algorithm for computing
+ROAs that leverages the incremental and connected enumeration of affine regions
+that RPM provides. We show this acceleration to lead to a 15x speedup in our
+examples. Finally, we apply our methods to find a set of states that are
+stabilized by an image-based controller for an aircraft runway control problem.
+
+
+
+ comment: Submitted to IEEE Transactions on Neural Networks and Learning
+ Systems. arXiv admin note: text overlap with arXiv:2011.11609
+
+
+
+
+
+
+ ♻ ☆ A Complexity-Based Theory of Compositionality
+
+
+
+
+
+
+
+
+ Eric Elmoznino, Thomas Jiralerspong, Yoshua Bengio, Guillaume Lajoie
+
+
+ Compositionality is believed to be fundamental to intelligence. In humans, it
+underlies the structure of thought, language, and higher-level reasoning. In
+AI, compositional representations can enable a powerful form of
+out-of-distribution generalization, in which a model systematically adapts to
+novel combinations of known concepts. However, while we have strong intuitions
+about what compositionality is, there currently exists no formal definition for
+it that is measurable and mathematical. Here, we propose such a definition,
+which we call representational compositionality, that accounts for and extends
+our intuitions about compositionality. The definition is conceptually simple,
+quantitative, grounded in algorithmic information theory, and applicable to any
+representation. Intuitively, representational compositionality states that a
+compositional representation satisfies three properties. First, it must be
+expressive. Second, it must be possible to re-describe the representation as a
+function of discrete symbolic sequences with re-combinable parts, analogous to
+sentences in natural language. Third, the function that relates these symbolic
+sequences to the representation, analogous to semantics in natural language,
+must be simple. Through experiments on both synthetic and real world data, we
+validate our definition of compositionality and show how it unifies disparate
+intuitions from across the literature in both AI and cognitive science. We also
+show that representational compositionality, while theoretically intractable,
+can be readily estimated using standard deep learning tools. Our definition has
+the potential to inspire the design of novel, theoretically-driven models that
+better capture the mechanisms of compositional thought.
+
+
+
+
+
+
+
+ ♻ ★ Model-GLUE: Democratized LLM Scaling for A Large Model Zoo in the Wild NeurIPS 2024
+
+
+ As Large Language Models (LLMs) excel across tasks and specialized domains,
+scaling LLMs based on existing models has garnered significant attention, which
+faces the challenge of decreasing performance when combining disparate models.
+Various techniques have been proposed for the aggregation of pre-trained LLMs,
+including model merging, Mixture-of-Experts, and stacking. Despite their
+merits, a comprehensive comparison and synergistic application of them to a
+diverse model zoo is yet to be adequately addressed. In light of this research
+gap, this paper introduces Model-GLUE, a holistic LLM scaling guideline. First,
+our work starts with a benchmarking of existing LLM scaling techniques,
+especially selective merging, and variants of mixture. Utilizing the insights
+from the benchmark results, we formulate an optimal strategy for the selection
+and aggregation of a heterogeneous model zoo characterizing different
+architectures and initialization.Our methodology involves the clustering of
+mergeable models and optimal merging strategy selection, and the integration of
+clusters through a model mixture. Finally, evidenced by our experiments on a
+diverse Llama-2-based model zoo, Model-GLUE shows an average performance
+enhancement of 5.61%, achieved without additional training. Codes are available
+at: https://github.com/Model-GLUE/Model-GLUE.
+
+
+
+ comment: 24 pages, 4 figures, accepted to NeurIPS 2024 Datasets and Benchmarks
+ Track
+
+
+
+
+
+
+
+ Dung Thuy Nguyen, Ngoc N. Tran, Taylor T. Johnson, Kevin Leach
+
+
+ In recent years, the rise of machine learning (ML) in cybersecurity has
+brought new challenges, including the increasing threat of backdoor poisoning
+attacks on ML malware classifiers. For instance, adversaries could inject
+malicious samples into public malware repositories, contaminating the training
+data and potentially misclassifying malware by the ML model. Current
+countermeasures predominantly focus on detecting poisoned samples by leveraging
+disagreements within the outputs of a diverse set of ensemble models on
+training data points. However, these methods are not suitable for scenarios
+where Machine Learning-as-a-Service (MLaaS) is used or when users aim to remove
+backdoors from a model after it has been trained. Addressing this scenario, we
+introduce PBP, a post-training defense for malware classifiers that mitigates
+various types of backdoor embeddings without assuming any specific backdoor
+embedding mechanism. Our method exploits the influence of backdoor attacks on
+the activation distribution of neural networks, independent of the
+trigger-embedding method. In the presence of a backdoor attack, the activation
+distribution of each layer is distorted into a mixture of distributions. By
+regulating the statistics of the batch normalization layers, we can guide a
+backdoored model to perform similarly to a clean one. Our method demonstrates
+substantial advantages over several state-of-the-art methods, as evidenced by
+experiments on two datasets, two types of backdoor methods, and various attack
+configurations. Notably, our approach requires only a small portion of the
+training data -- only 1\% -- to purify the backdoor and reduce the attack
+success rate from 100\% to almost 0\%, a 100-fold improvement over the baseline
+methods. Our code is available at
+\url{https://github.com/judydnguyen/pbp-backdoor-purification-official}.
+
+
+
+ comment: Accepted at NDSS 2025
+
+
+
+
+
+
+ ♻ ☆ SwiftKV: Fast Prefill-Optimized Inference with Knowledge-Preserving
+ Model Transformation
+
+
+ LLM inference for popular enterprise use cases, such as summarization, RAG,
+and code-generation, typically observes orders of magnitude longer prompt
+lengths than generation lengths. This characteristic leads to high cost of
+prefill and increased response latency. In this paper, we present SwiftKV, a
+novel model transformation and distillation procedure specifically designed to
+reduce the time and cost of processing prompt tokens while preserving high
+quality of generated tokens. SwiftKV combines three key mechanisms: i)
+SingleInputKV, which prefills later layers' KV cache using a much earlier
+layer's output, allowing prompt tokens to skip much of the model computation,
+ii) AcrossKV, which merges the KV caches of neighboring layers to reduce the
+memory footprint and support larger batch size for higher throughput, and iii)
+a knowledge-preserving distillation procedure that can adapt existing LLMs for
+SwiftKV with minimal accuracy impact and low compute and data requirement. For
+Llama-3.1-8B and 70B, SwiftKV reduces the compute requirement of prefill by 50%
+and the memory requirement of the KV cache by 62.5% while incurring minimum
+quality degradation across a wide range of tasks. In the end-to-end inference
+serving using an optimized vLLM implementation, SwiftKV realizes up to 2x
+higher aggregate throughput and 60% lower time per output token. It can achieve
+a staggering 560 TFlops/GPU of normalized inference throughput, which
+translates to 16K tokens/s for Llama-3.1-70B in 16-bit precision on 4x H100
+GPUs. Our training, inference, and model implementations are open-sourced and
+can be found through
+https://huggingface.co/collections/Snowflake/swiftkv-models-674f7d7474eb789e185d31cb.
+
+
+
+
+
+
+
+
+ Zhangfan Yang, Junkai Ji, Shan He, Jianqiang Li, Tiantian He, Ruibin Bai, Zexuan Zhu, Yew Soon Ong
+
+
+ Molecular docking is a crucial step in drug development, which enables the
+virtual screening of compound libraries to identify potential ligands that
+target proteins of interest. However, the computational complexity of
+traditional docking models increases as the size of the compound library
+increases. Recently, deep learning algorithms can provide data-driven research
+and development models to increase the speed of the docking process.
+Unfortunately, few models can achieve superior screening performance compared
+to that of traditional models. Therefore, a novel deep learning-based docking
+approach named Dockformer is introduced in this study. Dockformer leverages
+multimodal information to capture the geometric topology and structural
+knowledge of molecules and can directly generate binding conformations with the
+corresponding confidence measures in an end-to-end manner. The experimental
+results show that Dockformer achieves success rates of 90.53% and 82.71% on the
+PDBbind core set and PoseBusters benchmarks, respectively, and more than a
+100-fold increase in the inference process speed, outperforming almost all
+state-of-the-art docking methods. In addition, the ability of Dockformer to
+identify the main protease inhibitors of coronaviruses is demonstrated in a
+real-world virtual screening scenario. Considering its high docking accuracy
+and screening efficiency, Dockformer can be regarded as a powerful and robust
+tool in the field of drug design.
+
+
+
+ comment: 15 pages, 10 figures
+
+
+
+
+
+
+ ♻ ☆ On the Benefits of Active Data Collection in Operator Learning
+
+
+ We investigate active data collection strategies for operator learning when
+the target operator is linear and the input functions are drawn from a
+mean-zero stochastic process with continuous covariance kernels. With an active
+data collection strategy, we establish an error convergence rate in terms of
+the decay rate of the eigenvalues of the covariance kernel. Thus, with
+sufficiently rapid eigenvalue decay of the covariance kernels, arbitrarily fast
+error convergence rates can be achieved. This contrasts with the passive
+(i.i.d.) data collection strategies, where the convergence rate is never faster
+than $\sim n^{-1}$. In fact, for our setting, we establish a
+\emph{non-vanishing} lower bound for any passive data collection strategy,
+regardless of the eigenvalues decay rate of the covariance kernel. Overall, our
+results show the benefit of active over passive data collection strategies in
+operator learning.
+
+
+
+ comment: Added experiments
+
+
+
+
+
+
+ ♻ ☆ Fast and reliable uncertainty quantification with neural network
+ ensembles for industrial image classification
+
+
+ Image classification with neural networks (NNs) is widely used in industrial
+processes, situations where the model likely encounters unknown objects during
+deployment, i.e., out-of-distribution (OOD) data. Worryingly, NNs tend to make
+confident yet incorrect predictions when confronted with OOD data. To increase
+the models' reliability, they should quantify the uncertainty in their own
+predictions, communicating when the output should (not) be trusted. Deep
+ensembles, composed of multiple independent NNs, have been shown to perform
+strongly but are computationally expensive. Recent research has proposed more
+efficient NN ensembles, namely the snapshot, batch, and multi-input
+multi-output ensemble. This study investigates the predictive and uncertainty
+performance of efficient NN ensembles in the context of image classification
+for industrial processes. It is the first to provide a comprehensive comparison
+and it proposes a novel Diversity Quality metric to quantify the ensembles'
+performance on the in-distribution and OOD sets in one single metric. The
+results highlight the batch ensemble as a cost-effective and competitive
+alternative to the deep ensemble. It matches the deep ensemble in both
+uncertainty and accuracy while exhibiting considerable savings in training
+time, test time, and memory storage.
+
+
+
+ comment: Submitted to Annals of Operations Research
+
+
+
+
+
+
+ ♻ ☆ Molmo and PixMo: Open Weights and Open Data for State-of-the-Art
+ Vision-Language Models
+
+
+
+
+
+
+
+
+ Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, Jiasen Lu, Taira Anderson, Erin Bransom, Kiana Ehsani, Huong Ngo, YenSung Chen, Ajay Patel, Mark Yatskar, Chris Callison-Burch, Andrew Head, Rose Hendrix, Favyen Bastani, Eli VanderBilt, Nathan Lambert, Yvonne Chou, Arnavi Chheda, Jenna Sparks, Sam Skjonsberg, Michael Schmitz, Aaron Sarnat, Byron Bischoff, Pete Walsh, Chris Newell, Piper Wolters, Tanmay Gupta, Kuo-Hao Zeng, Jon Borchardt, Dirk Groeneveld, Crystal Nam, Sophie Lebrecht, Caitlin Wittlif, Carissa Schoenick, Oscar Michel, Ranjay Krishna, Luca Weihs, Noah A. Smith, Hannaneh Hajishirzi, Ross Girshick, Ali Farhadi, Aniruddha Kembhavi
+
+
+ Today's most advanced vision-language models (VLMs) remain proprietary. The
+strongest open-weight models rely heavily on synthetic data from proprietary
+VLMs to achieve good performance, effectively distilling these closed VLMs into
+open ones. As a result, the community has been missing foundational knowledge
+about how to build performant VLMs from scratch. We present Molmo, a new family
+of VLMs that are state-of-the-art in their class of openness. Our key
+contribution is a collection of new datasets called PixMo, including a dataset
+of highly detailed image captions for pre-training, a free-form image Q&A
+dataset for fine-tuning, and an innovative 2D pointing dataset, all collected
+without the use of external VLMs. The success of our approach relies on careful
+modeling choices, a well-tuned training pipeline, and, most critically, the
+quality of our newly collected datasets. Our best-in-class 72B model not only
+outperforms others in the class of open weight and data models, but also
+outperforms larger proprietary models including Claude 3.5 Sonnet, and Gemini
+1.5 Pro and Flash, second only to GPT-4o based on both academic benchmarks and
+on a large human evaluation. Our model weights, new datasets, and source code
+are available at https://molmo.allenai.org/blog.
+
+
+
+ comment: Updated with ablations and more technical details
+
+
+
+
+
+
+ ♻ ☆ Adaptive Circuit Behavior and Generalization in Mechanistic
+ Interpretability
+
+
+
+
+
+
+
+
+ Jatin Nainani, Sankaran Vaidyanathan, AJ Yeung, Kartik Gupta, David Jensen
+
+
+ Mechanistic interpretability aims to understand the inner workings of large
+neural networks by identifying circuits, or minimal subgraphs within the model
+that implement algorithms responsible for performing specific tasks. These
+circuits are typically discovered and analyzed using a narrowly defined prompt
+format. However, given the abilities of large language models (LLMs) to
+generalize across various prompt formats for the same task, it remains unclear
+how well these circuits generalize. For instance, it is unclear whether the
+models generalization results from reusing the same circuit components, the
+components behaving differently, or the use of entirely different components.
+In this paper, we investigate the generality of the indirect object
+identification (IOI) circuit in GPT-2 small, which is well-studied and believed
+to implement a simple, interpretable algorithm. We evaluate its performance on
+prompt variants that challenge the assumptions of this algorithm. Our findings
+reveal that the circuit generalizes surprisingly well, reusing all of its
+components and mechanisms while only adding additional input edges. Notably,
+the circuit generalizes even to prompt variants where the original algorithm
+should fail; we discover a mechanism that explains this which we term S2
+Hacking. Our findings indicate that circuits within LLMs may be more flexible
+and general than previously recognized, underscoring the importance of studying
+circuit generalization to better understand the broader capabilities of these
+models.
+
+
+
+ comment: 10 pages, 8 figures
+
+
+
+
+
+
+ ♻ ☆ When Stability meets Sufficiency: Informative Explanations that do not
+ Overwhelm
+
+
+ Recent studies evaluating various criteria for explainable artificial
+intelligence (XAI) suggest that fidelity, stability, and comprehensibility are
+among the most important metrics considered by users of AI across a diverse
+collection of usage contexts. We consider these criteria as applied to
+feature-based attribution methods, which are amongst the most prevalent in XAI
+literature. Going beyond standard correlation, methods have been proposed that
+highlight what should be minimally sufficient to justify the classification of
+an input (viz. pertinent positives). While minimal sufficiency is an attractive
+property akin to comprehensibility, the resulting explanations are often too
+sparse for a human to understand and evaluate the local behavior of the model.
+To overcome these limitations, we incorporate the criteria of stability and
+fidelity and propose a novel method called Path-Sufficient Explanations Method
+(PSEM) that outputs a sequence of stable and sufficient explanations for a
+given input of strictly decreasing size (or value) -- from original input to a
+minimally sufficient explanation -- which can be thought to trace the local
+boundary of the model in a stable manner, thus providing better intuition about
+the local model behavior for the specific input. We validate these claims, both
+qualitatively and quantitatively, with experiments that show the benefit of
+PSEM across three modalities (image, tabular and text) as well as versus other
+path explanations. A user study depicts the strength of the method in
+communicating the local behavior, where (many) users are able to correctly
+determine the prediction made by a model.
+
+
+
+ comment: Published at TMLR
+
+
+
+
+
+
+ ♻ ☆ Looking at Model Debiasing through the Lens of Anomaly Detection WACV
+
+
+ It is widely recognized that deep neural networks are sensitive to bias in
+the data. This means that during training these models are likely to learn
+spurious correlations between data and labels, resulting in limited
+generalization abilities and low performance. In this context, model debiasing
+approaches can be devised aiming at reducing the model's dependency on such
+unwanted correlations, either leveraging the knowledge of bias information or
+not. In this work, we focus on the latter and more realistic scenario, showing
+the importance of accurately predicting the bias-conflicting and bias-aligned
+samples to obtain compelling performance in bias mitigation. On this ground, we
+propose to conceive the problem of model bias from an out-of-distribution
+perspective, introducing a new bias identification method based on anomaly
+detection. We claim that when data is mostly biased, bias-conflicting samples
+can be regarded as outliers with respect to the bias-aligned distribution in
+the feature space of a biased model, thus allowing for precisely detecting them
+with an anomaly detection method. Coupling the proposed bias identification
+approach with bias-conflicting data upsampling and augmentation in a two-step
+strategy, we reach state-of-the-art performance on synthetic and real benchmark
+datasets. Ultimately, our proposed approach shows that the data bias issue does
+not necessarily require complex debiasing methods, given that an accurate bias
+identification procedure is defined. Source code is available at
+https://github.com/Malga-Vision/MoDAD
+
+
+
+ comment: 13 pages, 8 figures; Accepted at IEEE/CVF Winter Conference on
+ Applications of Computer Vision (WACV) 2025
+
+
+
+
+
+
+ ♻ ☆ GV-Rep: A Large-Scale Dataset for Genetic Variant Representation
+ Learning
+
+
+
+
+
+
+
+
+ Zehui Li, Vallijah Subasri, Guy-Bart Stan, Yiren Zhao, Bo Wang
+
+
+ Genetic variants (GVs) are defined as differences in the DNA sequences among
+individuals and play a crucial role in diagnosing and treating genetic
+diseases. The rapid decrease in next generation sequencing cost has led to an
+exponential increase in patient-level GV data. This growth poses a challenge
+for clinicians who must efficiently prioritize patient-specific GVs and
+integrate them with existing genomic databases to inform patient management. To
+addressing the interpretation of GVs, genomic foundation models (GFMs) have
+emerged. However, these models lack standardized performance assessments,
+leading to considerable variability in model evaluations. This poses the
+question: How effectively do deep learning methods classify unknown GVs and
+align them with clinically-verified GVs? We argue that representation learning,
+which transforms raw data into meaningful feature spaces, is an effective
+approach for addressing both indexing and classification challenges. We
+introduce a large-scale Genetic Variant dataset, named GV-Rep, featuring
+variable-length contexts and detailed annotations, designed for deep learning
+models to learn GV representations across various traits, diseases, tissue
+types, and experimental contexts. Our contributions are three-fold: (i)
+Construction of a comprehensive dataset with 7 million records, each labeled
+with characteristics of the corresponding variants, alongside additional data
+from 17,548 gene knockout tests across 1,107 cell types, 1,808 variant
+combinations, and 156 unique clinically verified GVs from real-world patients.
+(ii) Analysis of the structure and properties of the dataset. (iii)
+Experimentation of the dataset with pre-trained GFMs. The results show a
+significant gap between GFMs current capabilities and accurate GV
+representation. We hope this dataset will help advance genomic deep learning to
+bridge this gap.
+
+
+
+ comment: Preprint
+
+
+
+
+
+
+ ♻ ☆ Learning Semantic Association Rules from Internet of Things Data
+
+
+
+
+
+
+
+
+ Erkan Karabulut, Paul Groth, Victoria Degeler
+
+
+ Association Rule Mining (ARM) is the task of discovering commonalities in
+data in the form of logical implications. ARM is used in the Internet of Things
+(IoT) for different tasks including monitoring and decision-making. However,
+existing methods give limited consideration to IoT-specific requirements such
+as heterogeneity and volume. Furthermore, they do not utilize important static
+domain-specific description data about IoT systems, which is increasingly
+represented as knowledge graphs. In this paper, we propose a novel ARM pipeline
+for IoT data that utilizes both dynamic sensor data and static IoT system
+metadata. Furthermore, we propose an Autoencoder-based Neurosymbolic ARM method
+(Aerial) as part of the pipeline to address the high volume of IoT data and
+reduce the total number of rules that are resource-intensive to process. Aerial
+learns a neural representation of a given data and extracts association rules
+from this representation by exploiting the reconstruction (decoding) mechanism
+of an autoencoder. Extensive evaluations on 3 IoT datasets from 2 domains show
+that ARM on both static and dynamic IoT data results in more generically
+applicable rules while Aerial can learn a more concise set of high-quality
+association rules than the state-of-the-art with full coverage over the
+datasets.
+
+
+
+
+
+
+
+
+ Hikaru Shindo, Manuel Brack, Gopika Sudhakaran, Devendra Singh Dhami, Patrick Schramowski, Kristian Kersting
+
+
+ Large-scale, pre-trained neural networks have demonstrated strong
+capabilities in various tasks, including zero-shot image segmentation. To
+identify concrete objects in complex scenes, humans instinctively rely on
+deictic descriptions in natural language, i.e., referring to something
+depending on the context such as "The object that is on the desk and behind the
+cup.". However, deep learning approaches cannot reliably interpret such deictic
+representations due to their lack of reasoning capabilities in complex
+scenarios. To remedy this issue, we propose DeiSAM -- a combination of large
+pre-trained neural networks with differentiable logic reasoners -- for deictic
+promptable segmentation. Given a complex, textual segmentation description,
+DeiSAM leverages Large Language Models (LLMs) to generate first-order logic
+rules and performs differentiable forward reasoning on generated scene graphs.
+Subsequently, DeiSAM segments objects by matching them to the logically
+inferred image regions. As part of our evaluation, we propose the Deictic
+Visual Genome (DeiVG) dataset, containing paired visual input and complex,
+deictic textual prompts. Our empirical results demonstrate that DeiSAM is a
+substantial improvement over purely data-driven baselines for deictic
+promptable segmentation.
+
+
+
+ comment: Published as a conference paper at NeurIPS 2024
+
+
+
+
+
+
+ ♻ ☆ Marrying Causal Representation Learning with Dynamical Systems for
+ Science NeurIPS 2024
+
+
+ Causal representation learning promises to extend causal models to hidden
+causal variables from raw entangled measurements. However, most progress has
+focused on proving identifiability results in different settings, and we are
+not aware of any successful real-world application. At the same time, the field
+of dynamical systems benefited from deep learning and scaled to countless
+applications but does not allow parameter identification. In this paper, we
+draw a clear connection between the two and their key assumptions, allowing us
+to apply identifiable methods developed in causal representation learning to
+dynamical systems. At the same time, we can leverage scalable differentiable
+solvers developed for differential equations to build models that are both
+identifiable and practical. Overall, we learn explicitly controllable models
+that isolate the trajectory-specific parameters for further downstream tasks
+such as out-of-distribution classification or treatment effect estimation. We
+experiment with a wind simulator with partially known factors of variation. We
+also apply the resulting model to real-world climate data and successfully
+answer downstream causal questions in line with existing literature on climate
+change.
+
+
+ Safety alignment of Large Language Models (LLMs) has recently become a
+critical objective of model developers. In response, a growing body of work has
+been investigating how safety alignment can be bypassed through various
+jailbreaking methods, such as adversarial attacks. However, these jailbreak
+methods can be rather costly or involve a non-trivial amount of creativity and
+effort, introducing the assumption that malicious users are high-resource or
+sophisticated. In this paper, we study how simple random augmentations to the
+input prompt affect safety alignment effectiveness in state-of-the-art LLMs,
+such as Llama 3 and Qwen 2. We perform an in-depth evaluation of 17 different
+models and investigate the intersection of safety under random augmentations
+with multiple dimensions: augmentation type, model size, quantization,
+fine-tuning-based defenses, and decoding strategies (e.g., sampling
+temperature). We show that low-resource and unsophisticated attackers, i.e.
+$\textit{stochastic monkeys}$, can significantly improve their chances of
+bypassing alignment with just 25 random augmentations per prompt. Source code
+and data: https://github.com/uiuc-focal-lab/stochastic-monkeys/
+
+
+
+ comment: v2: Updated with changes from peer review rebuttal. v1: Version under
+ peer review
+
+
+
+
+
+
+ ♻ ☆ Group Distributionally Robust Optimization can Suppress Class Imbalance
+ Effect in Network Traffic Classification
+
+
+ Internet services have led to the eruption of network traffic, and machine
+learning on these Internet data has become an indispensable tool, especially
+when the application is risk-sensitive. This paper focuses on network traffic
+classification in the presence of class imbalance, which fundamentally and
+ubiquitously exists in Internet data analysis. This existence of class
+imbalance mostly drifts the optimal decision boundary, resulting in a less
+optimal solution for machine learning models. To alleviate the effect, we
+propose to design strategies for alleviating the class imbalance through the
+lens of group distributionally robust optimization. Our approach iteratively
+updates the non-parametric weights for separate classes and optimizes the
+learning model by minimizing reweighted losses. We interpret the optimization
+process from a Stackelberg game and perform extensive experiments on typical
+benchmarks. Results show that our approach can not only suppress the negative
+effect of class imbalance but also improve the comprehensive performance in
+prediction.
+
+
+
+
+
+
+
+ ♻ ☆ Efficiently Learning at Test-Time: Active Fine-Tuning of LLMs
+
+
+
+
+
+
+
+
+ Jonas Hübotter, Sascha Bongni, Ido Hakimi, Andreas Krause
+
+
+ Recent efforts in fine-tuning language models often rely on automatic data
+selection, commonly using Nearest Neighbors retrieval from large datasets.
+However, we theoretically show that this approach tends to select redundant
+data, limiting its effectiveness or even hurting performance. To address this,
+we introduce SIFT, a data selection algorithm designed to reduce uncertainty
+about the model's response given a prompt, which unifies ideas from retrieval
+and active learning. Whereas Nearest Neighbor retrieval typically fails in the
+presence of information duplication, SIFT accounts for information duplication
+and optimizes the overall information gain of the selected examples. We focus
+our evaluations on fine-tuning at test-time for prompt-specific language
+modeling on the Pile dataset, and show that SIFT consistently outperforms
+Nearest Neighbor retrieval, with minimal computational overhead. Moreover, we
+show that our uncertainty estimates can predict the performance gain of
+test-time fine-tuning, and use this to develop an adaptive algorithm that
+invests test-time compute proportional to realized performance gains. We
+provide the $\texttt{activeft}$ (Active Fine-Tuning) library which can be used
+as a drop-in replacement for Nearest Neighbor retrieval.
+
+
+ Contrastive learning has significantly improved representation quality,
+enhancing knowledge transfer across tasks in continual learning (CL). However,
+catastrophic forgetting remains a key challenge, as contrastive based methods
+primarily focus on "soft relationships" or "softness" between samples, which
+shift with changing data distributions and lead to representation overlap
+across tasks. Recently, the newly identified Neural Collapse phenomenon has
+shown promise in CL by focusing on "hard relationships" or "hardness" between
+samples and fixed prototypes. However, this approach overlooks "softness",
+crucial for capturing intra-class variability, and this rigid focus can also
+pull old class representations toward current ones, increasing forgetting.
+Building on these insights, we propose Focal Neural Collapse Contrastive
+(FNC2), a novel representation learning loss that effectively balances both
+soft and hard relationships. Additionally, we introduce the Hardness-Softness
+Distillation (HSD) loss to progressively preserve the knowledge gained from
+these relationships across tasks. Our method outperforms state-of-the-art
+approaches, particularly in minimizing memory reliance. Remarkably, even
+without the use of memory, our approach rivals rehearsal-based methods,
+offering a compelling solution for data privacy concerns.
+
+
+
+ comment: Accepted at WACV 2025
+
+
+
+
+
+
+ ♻ ☆ PePR: Performance Per Resource Unit as a Metric to Promote Small-Scale
+ Deep Learning in Medical Image Analysis
+
+
+
+
+
+
+
+
+ Raghavendra Selvan, Bob Pepin, Christian Igel, Gabrielle Samuel, Erik B Dam
+
+
+ The recent advances in deep learning (DL) have been accelerated by access to
+large-scale data and compute. These large-scale resources have been used to
+train progressively larger models which are resource intensive in terms of
+compute, data, energy, and carbon emissions. These costs are becoming a new
+type of entry barrier to researchers and practitioners with limited access to
+resources at such scale, particularly in the Global South. In this work, we
+take a comprehensive look at the landscape of existing DL models for medical
+image analysis tasks and demonstrate their usefulness in settings where
+resources are limited. To account for the resource consumption of DL models, we
+introduce a novel measure to estimate the performance per resource unit, which
+we call the PePR score. Using a diverse family of 131 unique DL architectures
+(spanning 1M to 130M trainable parameters) and three medical image datasets, we
+capture trends about the performance-resource trade-offs. In applications like
+medical image analysis, we argue that small-scale, specialized models are
+better than striving for large-scale models. Furthermore, we show that using
+existing pretrained models that are fine-tuned on new data can significantly
+reduce the computational resources and data required compared to training
+models from scratch. We hope this work will encourage the community to focus on
+improving AI equity by developing methods and models with smaller resource
+footprints.
+
+
+
+ comment: Accepted to be published at the Northern Lights Deep Learning
+ Conference (NLDL), 2025. Source code available at
+ https://github.com/saintslab/PePR
+
+
+
+
+
+
+ ♻ ☆ What should a neuron aim for? Designing local objective functions based
+ on information theory
+
+
+
+
+
+
+
+
+ Andreas C. Schneider, Valentin Neuhaus, David A. Ehrlich, Abdullah Makkeh, Alexander S. Ecker, Viola Priesemann, Michael Wibral
+
+
+ In modern deep neural networks, the learning dynamics of the individual
+neurons is often obscure, as the networks are trained via global optimization.
+Conversely, biological systems build on self-organized, local learning,
+achieving robustness and efficiency with limited global information. We here
+show how self-organization between individual artificial neurons can be
+achieved by designing abstract bio-inspired local learning goals. These goals
+are parameterized using a recent extension of information theory, Partial
+Information Decomposition (PID), which decomposes the information that a set of
+information sources holds about an outcome into unique, redundant and
+synergistic contributions. Our framework enables neurons to locally shape the
+integration of information from various input classes, i.e. feedforward,
+feedback, and lateral, by selecting which of the three inputs should contribute
+uniquely, redundantly or synergistically to the output. This selection is
+expressed as a weighted sum of PID terms, which, for a given problem, can be
+directly derived from intuitive reasoning or via numerical optimization,
+offering a window into understanding task-relevant local information
+processing. Achieving neuron-level interpretability while enabling strong
+performance using local learning, our work advances a principled
+information-theoretic foundation for local learning strategies.
+
+
+
+ comment: 24 pages, 11 figures
+
+
+
+
+
+
+ ♻ ☆ Learning on Model Weights using Tree Experts
+
+
+
+
+
+
+
+
+ Eliahu Horwitz, Bar Cavia, Jonathan Kahana, Yedid Hoshen
+
+
+ The increasing availability of public models begs the question: can we train
+neural networks that use other networks as input? Such models allow us to study
+different aspects of a given neural network, for example, determining the
+categories in a model's training dataset. However, machine learning on model
+weights is challenging as they often exhibit significant variation unrelated to
+the models' semantic properties (nuisance variation). Here, we identify a key
+property of real-world models: most public models belong to a small set of
+Model Trees, where all models within a tree are fine-tuned from a common
+ancestor (e.g., a foundation model). Importantly, we find that within each tree
+there is less nuisance variation between models. Concretely, while learning
+across Model Trees requires complex architectures, even a linear classifier
+trained on a single model layer often works within trees. While effective,
+these linear classifiers are computationally expensive, especially when dealing
+with larger models that have many parameters. To address this, we introduce
+Probing Experts (ProbeX), a theoretically motivated and lightweight method.
+Notably, ProbeX is the first probing method specifically designed to learn from
+the weights of a single hidden model layer. We demonstrate the effectiveness of
+ProbeX by predicting the categories in a model's training dataset based only on
+its weights. Excitingly, ProbeX can also map the weights of Stable Diffusion
+into a shared weight-language embedding space, enabling zero-shot model
+classification.
+
+
+ Transformer-based models generate hidden states that are difficult to
+interpret. In this work, we aim to interpret these hidden states and control
+them at inference, with a focus on motion forecasting. We use linear probes to
+measure neural collapse towards interpretable motion features in hidden states.
+High probing accuracy implies meaningful directions and distances between
+hidden states of opposing features, which we use to fit interpretable control
+vectors for activation steering at inference. To optimize our control vectors,
+we use sparse autoencoders with fully-connected, convolutional, MLPMixer layers
+and various activation functions. Notably, we show that enforcing sparsity in
+hidden states leads to a more linear relationship between control vector
+temperatures and forecasts. Our approach enables mechanistic interpretability
+and zero-shot generalization to unseen dataset characteristics with negligible
+computational overhead. Our implementation is available at
+https://github.com/kit-mrt/future-motion
+
+
+
+ comment: Add autoencoders with convolutional, MLPMixer layers, and JumpReLU
+ activations
+
+
+
+
+
+
+ ♻ ☆ VGGHeads: 3D Multi Head Alignment with a Large-Scale Synthetic Dataset
+
+
+ Human head detection, keypoint estimation, and 3D head model fitting are
+essential tasks with many applications. However, traditional real-world
+datasets often suffer from bias, privacy, and ethical concerns, and they have
+been recorded in laboratory environments, which makes it difficult for trained
+models to generalize. Here, we introduce \method -- a large-scale synthetic
+dataset generated with diffusion models for human head detection and 3D mesh
+estimation. Our dataset comprises over 1 million high-resolution images, each
+annotated with detailed 3D head meshes, facial landmarks, and bounding boxes.
+Using this dataset, we introduce a new model architecture capable of
+simultaneous head detection and head mesh reconstruction from a single image in
+a single step. Through extensive experimental evaluations, we demonstrate that
+models trained on our synthetic data achieve strong performance on real images.
+Furthermore, the versatility of our dataset makes it applicable across a broad
+spectrum of tasks, offering a general and comprehensive representation of human
+heads.
+
+
+
+
+
+
+
+
+ Cyril Shih-Huan Hsu, Danny De Vleeschauwer, Chrysa Papagianni
+
+
+ When a network slice spans multiple technology domains, it is crucial for
+each domain to uphold the End-to-End (E2E) Service Level Agreement (SLA)
+associated with the slice. Consequently, the E2E SLA must be properly
+decomposed into partial SLAs that are assigned to each domain involved. In a
+network slice management system with a two-level architecture, comprising an
+E2E service orchestrator and local domain controllers, we consider that the
+orchestrator has access solely to historical data regarding the responses of
+local controllers to previous requests, and this information is used to
+construct a risk model for each domain. In this study, we extend our previous
+work by investigating the dynamic nature of real-world systems and introducing
+an online learning-decomposition framework to tackle the dynamicity. We propose
+a framework that periodically updates the risk models based on the most recent
+feedback. This approach leverages key components such as online gradient
+descent and FIFO memory buffers, which enhance the stability and robustness of
+the overall process. Our empirical study on an analytic model-based simulator
+demonstrates that the proposed framework outperforms the state-of-the-art
+static approach, providing more accurate and resilient SLA decomposition even
+under varying conditions and limited data scenarios.
+
+
+
+ comment: The paper has been submitted to IEEE ICMLCN 2025
+
+
+
+
+
+
+ ♻ ☆ Deep learning empowered sensor fusion boosts infant movement
+ classification
+
+
+
+
+
+
+
+
+ Tomas Kulvicius, Dajie Zhang, Luise Poustka, Sven Bölte, Lennart Jahn, Sarah Flügge, Marc Kraft, Markus Zweckstetter, Karin Nielsen-Saines, Florentin Wörgötter, Peter B Marschik
+
+
+ To assess the integrity of the developing nervous system, the Prechtl general
+movement assessment (GMA) is recognized for its clinical value in diagnosing
+neurological impairments in early infancy. GMA has been increasingly augmented
+through machine learning approaches intending to scale-up its application,
+circumvent costs in the training of human assessors and further standardize
+classification of spontaneous motor patterns. Available deep learning tools,
+all of which are based on single sensor modalities, are however still
+considerably inferior to that of well-trained human assessors. These approaches
+are hardly comparable as all models are designed, trained and evaluated on
+proprietary/silo-data sets. With this study we propose a sensor fusion approach
+for assessing fidgety movements (FMs). FMs were recorded from 51 typically
+developing participants. We compared three different sensor modalities
+(pressure, inertial, and visual sensors). Various combinations and two sensor
+fusion approaches (late and early fusion) for infant movement classification
+were tested to evaluate whether a multi-sensor system outperforms single
+modality assessments. Convolutional neural network (CNN) architectures were
+used to classify movement patterns. The performance of the three-sensor fusion
+(classification accuracy of 94.5%) was significantly higher than that of any
+single modality evaluated. We show that the sensor fusion approach is a
+promising avenue for automated classification of infant motor patterns. The
+development of a robust sensor fusion system may significantly enhance AI-based
+early recognition of neurofunctions, ultimately facilitating automated early
+detection of neurodevelopmental conditions.
+
+
+
+
+
+
+
+
+ Sebastian Bieringer, Gregor Kasieczka, Maximilian F. Steffen, Mathias Trabs
+
+
+ Uncertainty estimation is a key issue when considering the application of
+deep neural network methods in science and engineering. In this work, we
+introduce a novel algorithm that quantifies epistemic uncertainty via Monte
+Carlo sampling from a tempered posterior distribution. It combines the well
+established Metropolis Adjusted Langevin Algorithm (MALA) with momentum-based
+optimization using Adam and leverages a prolate proposal distribution, to
+efficiently draw from the posterior. We prove that the constructed chain admits
+the Gibbs posterior as invariant distribution and approximates this posterior
+in total variation distance. Furthermore, we demonstrate the efficiency of the
+resulting algorithm and the merit of the proposed changes on a state-of-the-art
+classifier from high-energy particle physics.
+
+
+ The elastic net penalty is frequently employed in high-dimensional statistics
+for parameter regression and variable selection. It is particularly beneficial
+compared to lasso when the number of predictors greatly surpasses the number of
+observations. However, empirical evidence has shown that the $\ell_q$-norm
+penalty (where $0 < q < 1$) often provides better regression compared to the
+$\ell_1$-norm penalty, demonstrating enhanced robustness in various scenarios.
+In this paper, we explore a generalized elastic net model that employs a
+$\ell_r$-norm (where $r \geq 1$) in loss function to accommodate various types
+of noise, and employs a $\ell_q$-norm (where $0 < q < 1$) to replace the
+$\ell_1$-norm in elastic net penalty. Theoretically, we establish the
+computable lower bounds for the nonzero entries of the generalized first-order
+stationary points of the proposed generalized elastic net model. For
+implementation, we develop two efficient algorithms based on the locally
+Lipschitz continuous $\epsilon$-approximation to $\ell_q$-norm. The first
+algorithm employs an alternating direction method of multipliers (ADMM), while
+the second utilizes a proximal majorization-minimization method (PMM), where
+the subproblems are addressed using the semismooth Newton method (SNN). We also
+perform extensive numerical experiments with both simulated and real data,
+showing that both algorithms demonstrate superior performance. Notably, the
+PMM-SSN is efficient than ADMM, even though the latter provides a simpler
+implementation.
+
+
+
+
+
+
+
+
+ Petar Bevanda, Nicolas Hoischen, Tobias Wittmann, Jan Brüdigam, Sandra Hirche, Boris Houska
+
+
+ This paper presents a novel approach for optimal control of nonlinear
+stochastic systems using infinitesimal generator learning within
+infinite-dimensional reproducing kernel Hilbert spaces. Our learning framework
+leverages data samples of system dynamics and stage cost functions, with only
+control penalties and constraints provided. The proposed method directly learns
+the diffusion operator of a controlled Fokker-Planck-Kolmogorov equation in an
+infinite-dimensional hypothesis space. This operator models the continuous-time
+evolution of the probability measure of the control system's state. We
+demonstrate that this approach seamlessly integrates with modern convex
+operator-theoretic Hamilton-Jacobi-Bellman recursions, enabling a data-driven
+solution to the optimal control problem. Furthermore, our statistical learning
+framework includes nonparametric estimators for uncontrolled forward
+infinitesimal generators as a special case. Numerical experiments, ranging from
+synthetic differential equations to simulated robotic systems, showcase the
+advantages of our approach compared to both modern data-driven and classical
+nonlinear programming methods for optimal control.
+
+
+
+
+
+
+
+ ♻ ☆ Relax and Merge: A Simple Yet Effective Framework for Solving Fair
+ $k$-Means and $k$-sparse Wasserstein Barycenter Problems
+
+
+ The fairness of clustering algorithms has gained widespread attention across
+various areas, including machine learning, In this paper, we study fair
+$k$-means clustering in Euclidean space. Given a dataset comprising several
+groups, the fairness constraint requires that each cluster should contain a
+proportion of points from each group within specified lower and upper bounds.
+Due to these fairness constraints, determining the optimal locations of $k$
+centers is a quite challenging task. We propose a novel ``Relax and Merge''
+framework that returns a $(1+4\rho + O(\epsilon))$-approximate solution, where
+$\rho$ is the approximate ratio of an off-the-shelf vanilla $k$-means algorithm
+and $O(\epsilon)$ can be an arbitrarily small positive number. If equipped with
+a PTAS of $k$-means, our solution can achieve an approximation ratio of
+$(5+O(\epsilon))$ with only a slight violation of the fairness constraints,
+which improves the current state-of-the-art approximation guarantee.
+Furthermore, using our framework, we can also obtain a $(1+4\rho
++O(\epsilon))$-approximate solution for the $k$-sparse Wasserstein Barycenter
+problem, which is a fundamental optimization problem in the field of optimal
+transport, and a $(2+6\rho)$-approximate solution for the strictly fair
+$k$-means clustering with no violation, both of which are better than the
+current state-of-the-art methods. In addition, the empirical results
+demonstrate that our proposed algorithm can significantly outperform baseline
+approaches in terms of clustering cost.
+
+
+
+
+
+
+
+ ♻ ☆ Scaling Laws for Task-Optimized Models of the Primate Visual Ventral
+ Stream
+
+
+ When trained on large-scale object classification datasets, certain
+artificial neural network models begin to approximate core object recognition
+(COR) behaviors and neural response patterns in the primate visual ventral
+stream (VVS). While recent machine learning advances suggest that scaling model
+size, dataset size, and compute resources improve task performance, the impact
+of scaling on brain alignment remains unclear. In this study, we explore
+scaling laws for modeling the primate VVS by systematically evaluating over 600
+models trained under controlled conditions on benchmarks spanning V1, V2, V4,
+IT and COR behaviors. We observe that while behavioral alignment continues to
+scale with larger models, neural alignment saturates. This observation remains
+true across model architectures and training datasets, even though models with
+stronger inductive bias and datasets with higher-quality images are more
+compute-efficient. Increased scaling is especially beneficial for higher-level
+visual areas, where small models trained on few samples exhibit only poor
+alignment. Finally, we develop a scaling recipe, indicating that a greater
+proportion of compute should be allocated to data samples over model size. Our
+results suggest that while scaling alone might suffice for alignment with human
+core object recognition behavior, it will not yield improved models of the
+brain's visual ventral stream with current architectures and datasets,
+highlighting the need for novel strategies in building brain-like models.
+
+
+
+ comment: 10 pages for the main paper, 23 pages in total. 7 main figures and 7
+ supplementary figures. Code, model weights, and benchmark results can be
+ accessed at https://github.com/epflneuroailab/scaling-primate-vvs - In
+ version 2, Figure 7 and the related discussion are added, and the appendix is
+ updated
+
+
+
+
+
+
+
+ Anna Van Elst, Debarghya Ghoshdastidar
+
+
+ Contrastive representation learning is a modern paradigm for learning
+representations of unlabeled data via augmentations -- precisely, contrastive
+models learn to embed semantically similar pairs of samples (positive pairs)
+closer than independently drawn samples (negative samples). In spite of its
+empirical success and widespread use in foundation models, statistical theory
+for contrastive learning remains less explored. Recent works have developed
+generalization error bounds for contrastive losses, but the resulting risk
+certificates are either vacuous (certificates based on Rademacher complexity or
+$f$-divergence) or require strong assumptions about samples that are
+unreasonable in practice. The present paper develops non-vacuous PAC-Bayesian
+risk certificates for contrastive representation learning, considering the
+practical considerations of the popular SimCLR framework. Notably, we take into
+account that SimCLR reuses positive pairs of augmented data as negative samples
+for other data, thereby inducing strong dependence and making classical PAC or
+PAC-Bayesian bounds inapplicable. We further refine existing bounds on the
+downstream classification loss by incorporating SimCLR-specific factors,
+including data augmentation and temperature scaling, and derive risk
+certificates for the contrastive zero-one risk. The resulting bounds for
+contrastive loss and downstream prediction are much tighter than those of
+previous risk certificates, as demonstrated by experiments on CIFAR-10.
+
+
+
+
+
+
+
+
+ Michelle Halbheer, Dominik J. Mühlematter, Alexander Becker, Dominik Narnhofer, Helge Aasen, Konrad Schindler, Mehmet Ozgur Turkoglu
+
+
+ Numerous crucial tasks in real-world decision-making rely on machine learning
+algorithms with calibrated uncertainty estimates. However, modern methods often
+yield overconfident and uncalibrated predictions. Various approaches involve
+training an ensemble of separate models to quantify the uncertainty related to
+the model itself, known as epistemic uncertainty. In an explicit
+implementation, the ensemble approach has high computational cost and high
+memory requirements. This particular challenge is evident in state-of-the-art
+neural networks such as transformers, where even a single network is already
+demanding in terms of compute and memory. Consequently, efforts are made to
+emulate the ensemble model without actually instantiating separate ensemble
+members, referred to as implicit ensembling. We introduce LoRA-Ensemble, a
+parameter-efficient deep ensemble method for self-attention networks, which is
+based on Low-Rank Adaptation (LoRA). Initially developed for efficient LLM
+fine-tuning, we extend LoRA to an implicit ensembling approach. By employing a
+single pre-trained self-attention network with weights shared across all
+members, we train member-specific low-rank matrices for the attention
+projections. Our method exhibits superior calibration compared to explicit
+ensembles and achieves similar or better accuracy across various prediction
+tasks and datasets.
+
+
+
+ comment: under review
+
+
+
+
+
+
+ ♻ ☆ PDNNet: PDN-Aware GNN-CNN Heterogeneous Network for Dynamic IR Drop
+ Prediction
+
+
+ IR drop on the power delivery network (PDN) is closely related to PDN's
+configuration and cell current consumption. As the integrated circuit (IC)
+design is growing larger, dynamic IR drop simulation becomes computationally
+unaffordable and machine learning based IR drop prediction has been explored as
+a promising solution. Although CNN-based methods have been adapted to IR drop
+prediction task in several works, the shortcomings of overlooking PDN
+configuration is non-negligible. In this paper, we consider not only how to
+properly represent cell-PDN relation, but also how to model IR drop following
+its physical nature in the feature aggregation procedure. Thus, we propose a
+novel graph structure, PDNGraph, to unify the representations of the PDN
+structure and the fine-grained cell-PDN relation. We further propose a
+dual-branch heterogeneous network, PDNNet, incorporating two parallel GNN-CNN
+branches to favorably capture the above features during the learning process.
+Several key designs are presented to make the dynamic IR drop prediction highly
+effective and interpretable. We are the first work to apply graph structure to
+deep-learning based dynamic IR drop prediction method. Experiments show that
+PDNNet outperforms the state-of-the-art CNN-based methods and achieves 545x
+speedup compared to the commercial tool, which demonstrates the superiority of
+our method.
+
+
+
+
+
+
+
+ ♻ ☆ R-MTLLMF: Resilient Multi-Task Large Language Model Fusion at the
+ Wireless Edge
+
+
+ Multi-task large language models (MTLLMs) are important for many applications
+at the wireless edge, where users demand specialized models to handle multiple
+tasks efficiently. However, training MTLLMs is complex and exhaustive,
+particularly when tasks are subject to change. Recently, the concept of model
+fusion via task vectors has emerged as an efficient approach for combining
+fine-tuning parameters to produce an MTLLM. In this paper, the problem of
+enabling edge users to collaboratively craft such MTLMs via tasks vectors is
+studied, under the assumption of worst-case adversarial attacks. To this end,
+first the influence of adversarial noise to multi-task model fusion is
+investigated and a relationship between the so-called weight disentanglement
+error and the mean squared error (MSE) is derived. Using hypothesis testing, it
+is directly shown that the MSE increases interference between task vectors,
+thereby rendering model fusion ineffective. Then, a novel resilient MTLLM
+fusion (R-MTLLMF) is proposed, which leverages insights about the LLM
+architecture and fine-tuning process to safeguard task vector aggregation under
+adversarial noise by realigning the MTLLM. The proposed R-MTLLMF is then
+compared for both worst-case and ideal transmission scenarios to study the
+impact of the wireless channel. Extensive model fusion experiments with vision
+LLMs demonstrate R-MTLLMF's effectiveness, achieving close-to-baseline
+performance across eight different tasks in ideal noise scenarios and
+significantly outperforming unprotected model fusion in worst-case scenarios.
+The results further advocate for additional physical layer protection for a
+holistic approach to resilience, from both a wireless and LLM perspective.
+
+
+ Transformers are widely used for their ability to capture data relations in
+sequence processing, with great success for a wide range of static tasks.
+However, the computational and memory footprint of their main component, i.e.,
+the Scaled Dot-product Attention, is commonly overlooked. This makes their
+adoption in applications involving stream data processing with constraints in
+response latency, computational and memory resources infeasible. Some works
+have proposed methods to lower the computational cost of transformers, i.e.
+low-rank approximations, sparsity in attention, and efficient formulations for
+Continual Inference. In this paper, we introduce a new formulation of the
+Scaled Dot-product Attention based on the Nystr\"om approximation that is
+suitable for Continual Inference. In experiments on Online Audio Classification
+and Online Action Detection tasks, the proposed Continual Scaled Dot-product
+Attention can lower the number of operations by up to three orders of magnitude
+compared to the original Transformers while retaining the predictive
+performance of competing models.
+
+
+
+ comment: 11 pages, 7 figures
+
+
+
+
+
+
+ ♻ ☆ Improving Fine-Grained Control via Aggregation of Multiple Diffusion
+ Models
+
+
+
+
+
+
+
+
+ Conghan Yue, Zhengwei Peng, Shiyan Du, Zhi Ji, Chuangjian Cai, Le Wan, Dongyu Zhang
+
+
+ While many diffusion models perform well when controlling for particular
+aspect among style, character, and interaction, they struggle with fine-grained
+control due to dataset limitations and intricate model architecture design.
+This paper introduces a novel algorithm, Aggregation of Multiple Diffusion
+Models (AMDM), which synthesizes features from multiple diffusion models into a
+specified model, activating specific features for fine-grained control.
+Experimental results demonstrate that AMDM significantly improves fine-grained
+control without training, proving its effectiveness. Additionally, it reveals
+that diffusion models initially focus on features such as position, attributes,
+and style, with later stages improving generation quality and consistency. AMDM
+offers a new perspective for tackling the challenges of fine-grained
+conditional control generation in diffusion models: We can fully utilize
+existing or develop new conditional diffusion models that control specific
+aspects, and then aggregate them using AMDM algorithm. This eliminates the need
+for constructing complex datasets, designing intricate model architectures, and
+incurring high training costs. Code is available at:
+https://github.com/Hammour-steak/AMDM.
+
+
+
+
+
+
+
+ ♻ ☆ Adaptive Optimizers with Sparse Group Lasso for Neural Networks in CTR
+ Prediction ECML
+
+
+ We develop a novel framework that adds the regularizers of the sparse group
+lasso to a family of adaptive optimizers in deep learning, such as Momentum,
+Adagrad, Adam, AMSGrad, AdaHessian, and create a new class of optimizers, which
+are named Group Momentum, Group Adagrad, Group Adam, Group AMSGrad and Group
+AdaHessian, etc., accordingly. We establish theoretically proven convergence
+guarantees in the stochastic convex settings, based on primal-dual methods. We
+evaluate the regularized effect of our new optimizers on three large-scale
+real-world ad click datasets with state-of-the-art deep learning models. The
+experimental results reveal that compared with the original optimizers with the
+post-processing procedure which uses the magnitude pruning method, the
+performance of the models can be significantly improved on the same sparsity
+level. Furthermore, in comparison to the cases without magnitude pruning, our
+methods can achieve extremely high sparsity with significantly better or highly
+competitive performance. The code is available at
+https://github.com/intelligent-machine-learning/tfplus/tree/main/tfplus.
+
+
+
+ comment: 24 pages. Published as a conference paper at ECML PKDD 2021. This
+ version includes Appendix which was not included in the published version
+ because of page limit
+
+
+
+
+
+
+ ♻ ☆ COOL: Efficient and Reliable Chain-Oriented Objective Logic with Neural
+ Networks Feedback Control for Program Synthesis
+
+
+ Program synthesis methods, whether formal or neural-based, lack fine-grained
+control and flexible modularity, which limits their adaptation to complex
+software development. These limitations stem from rigid Domain-Specific
+Language (DSL) frameworks and neural network incorrect predictions. To this
+end, we propose the Chain of Logic (CoL), which organizes the synthesis process
+into an activity flow and provides heuristic control to guide the process.
+Furthermore, by integrating neural networks with libraries and introducing a
+Neural Network Feedback Control (NNFC) mechanism, our approach modularizes
+synthesis and mitigates the impact of neural network mispredictions.
+Experiments on relational and symbolic synthesis tasks show that CoL
+significantly enhances the efficiency and reliability of DSL program synthesis
+across multiple metrics. Specifically, CoL improves accuracy by 70% while
+reducing tree operations by 91% and time by 95%. Additionally, NNFC further
+boosts accuracy by 6%, with a 64% reduction in tree operations under
+challenging conditions such as insufficient training data, increased
+difficulty, and multidomain synthesis. These improvements confirm COOL as a
+highly efficient and reliable program synthesis framework.
+
+
+
+ comment: 31 pages, 10 figures
+
+
+
+
+
+
+ ♻ ☆ Quality In / Quality Out: Data quality more relevant than model choice
+ in anomaly detection with the UGR'16
+
+
+
+
+
+
+
+
+ José Camacho, Katarzyna Wasielewska, Pablo Espinosa, Marta Fuentes-García
+
+
+ Autonomous or self-driving networks are expected to provide a solution to the
+myriad of extremely demanding new applications with minimal human supervision.
+For this purpose, the community relies on the development of new Machine
+Learning (ML) models and techniques. %, like the celebrated Deep Learning (DL).
+However, ML can only be as good as the data it is fitted with, and data quality
+is an elusive concept difficult to assess. In this paper, we show that
+relatively minor modifications on a benchmark dataset (UGR'16, a flow-based
+real-traffic dataset for anomaly detection) cause significantly more impact on
+model performance than the specific ML technique considered. We also show that
+the measured model performance is uncertain, as a result of labelling
+inaccuracies. Our findings illustrate that the widely adopted approach of
+comparing a set of models in terms of performance results (e.g., in terms of
+accuracy or ROC curves) may lead to incorrect conclusions when done without a
+proper understanding of dataset biases and sensitivity. We contribute a
+methodology to interpret a model response that can be useful for this
+understanding.
+
+
+
+
+
+
+
+ ♻ ☆ Differentially Private Synthetic Data via Foundation Model APIs 1:
+ Images ICLR 2024
+
+
+ Generating differentially private (DP) synthetic data that closely resembles
+the original private data is a scalable way to mitigate privacy concerns in the
+current data-driven world. In contrast to current practices that train
+customized models for this task, we aim to generate DP Synthetic Data via APIs
+(DPSDA), where we treat foundation models as blackboxes and only utilize their
+inference APIs. Such API-based, training-free approaches are easier to deploy
+as exemplified by the recent surge in the number of API-based apps. These
+approaches can also leverage the power of large foundation models which are
+only accessible via their inference APIs. However, this comes with greater
+challenges due to strictly more restrictive model access and the need to
+protect privacy from the API provider.
+ In this paper, we present a new framework called Private Evolution (PE) to
+solve this problem and show its initial promise on synthetic images.
+Surprisingly, PE can match or even outperform state-of-the-art (SOTA) methods
+without any model training. For example, on CIFAR10 (with ImageNet as the
+public data), we achieve FID <= 7.9 with privacy cost {\epsilon} = 0.67,
+significantly improving the previous SOTA from {\epsilon} = 32. We further
+demonstrate the promise of applying PE on large foundation models such as
+Stable Diffusion to tackle challenging private datasets with a small number of
+high-resolution images. The code and data are released at
+https://github.com/microsoft/DPSDA.
+
+
+
+ comment: Published in ICLR 2024
+
+
+
+
+
+
+ ♻ ☆ Representation Alignment for Generation: Training Diffusion Transformers
+ Is Easier Than You Think
+
+
+ Recent studies have shown that the denoising process in (generative)
+diffusion models can induce meaningful (discriminative) representations inside
+the model, though the quality of these representations still lags behind those
+learned through recent self-supervised learning methods. We argue that one main
+bottleneck in training large-scale diffusion models for generation lies in
+effectively learning these representations. Moreover, training can be made
+easier by incorporating high-quality external visual representations, rather
+than relying solely on the diffusion models to learn them independently. We
+study this by introducing a straightforward regularization called
+REPresentation Alignment (REPA), which aligns the projections of noisy input
+hidden states in denoising networks with clean image representations obtained
+from external, pretrained visual encoders. The results are striking: our simple
+strategy yields significant improvements in both training efficiency and
+generation quality when applied to popular diffusion and flow-based
+transformers, such as DiTs and SiTs. For instance, our method can speed up SiT
+training by over 17.5$\times$, matching the performance (without
+classifier-free guidance) of a SiT-XL model trained for 7M steps in less than
+400K steps. In terms of final generation quality, our approach achieves
+state-of-the-art results of FID=1.42 using classifier-free guidance with the
+guidance interval.
+
+
+
+
+
+
+
+ ♻ ☆ Sharpness-Aware Minimization Revisited: Weighted Sharpness as a
+ Regularization Term KDD '23
+
+
+
+
+
+
+
+
+ Yun Yue, Jiadi Jiang, Zhiling Ye, Ning Gao, Yongchao Liu, Ke Zhang
+
+
+ Deep Neural Networks (DNNs) generalization is known to be closely related to
+the flatness of minima, leading to the development of Sharpness-Aware
+Minimization (SAM) for seeking flatter minima and better generalization. In
+this paper, we revisit the loss of SAM and propose a more general method,
+called WSAM, by incorporating sharpness as a regularization term. We prove its
+generalization bound through the combination of PAC and Bayes-PAC techniques,
+and evaluate its performance on various public datasets. The results
+demonstrate that WSAM achieves improved generalization, or is at least highly
+competitive, compared to the vanilla optimizer, SAM and its variants. The code
+is available at
+https://github.com/intelligent-machine-learning/atorch/tree/main/atorch/optimizers.
+
+
+
+ comment: 10 pages. Accepted as a conference paper at KDD '23
+
+
+
+
+
+
+ ♻ ☆ Context Matters: Leveraging Contextual Features for Time Series
+ Forecasting
+
+
+
+
+
+
+
+
+ Sameep Chattopadhyay, Pulkit Paliwal, Sai Shankar Narasimhan, Shubhankar Agarwal, Sandeep P. Chinchali
+
+
+ Time series forecasts are often influenced by exogenous contextual features
+in addition to their corresponding history. For example, in financial settings,
+it is hard to accurately predict a stock price without considering public
+sentiments and policy decisions in the form of news articles, tweets, etc.
+Though this is common knowledge, the current state-of-the-art (SOTA)
+forecasting models fail to incorporate such contextual information, owing to
+its heterogeneity and multimodal nature. To address this, we introduce
+ContextFormer, a novel plug-and-play method to surgically integrate multimodal
+contextual information into existing pre-trained forecasting models.
+ContextFormer effectively distills forecast-specific information from rich
+multimodal contexts, including categorical, continuous, time-varying, and even
+textual information, to significantly enhance the performance of existing base
+forecasters. ContextFormer outperforms SOTA forecasting models by up to 30% on
+a range of real-world datasets spanning energy, traffic, environmental, and
+financial domains.
+
+
+
+
+
+
+
+ ♻ ☆ Developing a Thailand solar irradiance map using Himawari-8 satellite
+ imageries and deep learning models
+
+
+ This paper presents an online platform showing Thailand solar irradiance map
+every 30 minutes, available at https://www.cusolarforecast.com. The methodology
+for estimating global horizontal irradiance (GHI) across Thailand relies on
+cloud index extracted from Himawari-8 satellite imagery, Ineichen clear-sky
+model with locally-tuned Linke turbidity, and machine learning models. The
+methods take clear-sky irradiance, cloud index, re-analyzed GHI and temperature
+data from the MERRA-2 database, and date-time as inputs for GHI estimation
+models, including LightGBM, LSTM, Informer, and Transformer. These are
+benchmarked with the estimate from a commercial service X by evaluation of
+15-minute ground GHI data from 53 ground stations over 1.5 years during
+2022-2023. The results show that the four models exhibit comparable overall MAE
+performance to the service X. The best model is LightGBM with an overall MAE of
+78.58 W/sqm and RMSE of 118.97 W/sqm, while the service X achieves the lowest
+MAE, RMSE, and MBE in cloudy condition. Obtaining re-analyzed MERRA-2 data for
+the whole Thailand region is not economically feasible for deployment. When
+removing these features, the Informer model has a winning performance in MAE of
+78.67 W/sqm. The obtained performance aligns with existing literature by taking
+the climate zone and time granularity of data into consideration. As the map
+shows an estimate of GHI over 93,000 grids with a frequent update, the paper
+also describes a computational framework for displaying the entire map. It
+tests the runtime performance of deep learning models in the GHI estimation
+process.
+
+
+
+ comment: 23 pages, 14 figures
+
+
+
+
+
+
+ ♻ ☆ HoPE: A Novel Positional Encoding Without Long-Term Decay for Enhanced
+ Context Awareness and Extrapolation
+
+
+
+
+
+
+
+
+ Yuhan Chen, Ang Lv, Jian Luan, Bin Wang, Wei Liu
+
+
+ Many positional encodings (PEs) are designed to exhibit long-term decay,
+based on an entrenched and long-standing inductive opinion: tokens farther away
+from the current position carry less relevant information. We argue that
+long-term decay is outdated in the era of LLMs, as LLMs are now applied to
+tasks demanding precise retrieval of in-context information from arbitrary
+positions. Firstly, we present empirical analyses on various PEs, demonstrating
+that models inherently learn attention with only a local-decay pattern while
+forming a U-shape pattern globally, contradicting the principle of long-term
+decay. Furthermore, we conduct a detailed analysis of rotary position encoding
+(RoPE, a prevalent relative positional encoding in LLMs), and found that the
+U-shape attention is caused by some learned components, which are also the key
+factor limiting RoPE's expressiveness and extrapolation.Inspired by these
+insights, we propose High-frequency rotary Position Encoding (HoPE). HoPE
+replaces the specific components in RoPE with position-independent ones,
+retaining only high-frequency signals, which also breaks the principle of
+long-term decay in theory. HoPE achieves two major advantages: (1) Without
+constraints imposed by long-term decay, contradictory factors that limit
+spontaneous attention optimization and model extrapolation performance are
+removed. (2) Components representing positions and semantics are are optimized.
+These enhances model's context awareness and extrapolation, as validated by
+extensive experiments.
+
+
+
+
+
+
+
+
+
+
+ Multimedia 3
+
+
+
+
+
+ ☆ Feature Coding in the Era of Large Models: Dataset, Test Conditions, and
+ Benchmark
+
+
+ Large models have achieved remarkable performance across various tasks, yet
+they incur significant computational costs and privacy concerns during both
+training and inference. Distributed deployment has emerged as a potential
+solution, but it necessitates the exchange of intermediate information between
+model segments, with feature representations serving as crucial information
+carriers. To optimize information exchange, feature coding methods are applied
+to reduce transmission and storage overhead. Despite its importance, feature
+coding for large models remains an under-explored area. In this paper, we draw
+attention to large model feature coding and make three contributions to this
+field. First, we introduce a comprehensive dataset encompassing diverse
+features generated by three representative types of large models. Second, we
+establish unified test conditions, enabling standardized evaluation pipelines
+and fair comparisons across future feature coding studies. Third, we introduce
+two baseline methods derived from widely used image coding techniques and
+benchmark their performance on the proposed dataset. These contributions aim to
+advance the field of feature coding, facilitating more efficient large model
+deployment. All source code and the dataset will be made available on GitHub.
+
+
+
+
+
+
+
+ ♻ ☆ Identity-Preserving Text-to-Video Generation by Frequency Decomposition
+
+
+ Identity-preserving text-to-video (IPT2V) generation aims to create
+high-fidelity videos with consistent human identity. It is an important task in
+video generation but remains an open problem for generative models. This paper
+pushes the technical frontier of IPT2V in two directions that have not been
+resolved in literature: (1) A tuning-free pipeline without tedious case-by-case
+finetuning, and (2) A frequency-aware heuristic identity-preserving DiT-based
+control scheme. We propose ConsisID, a tuning-free DiT-based controllable IPT2V
+model to keep human identity consistent in the generated video. Inspired by
+prior findings in frequency analysis of diffusion transformers, it employs
+identity-control signals in the frequency domain, where facial features can be
+decomposed into low-frequency global features and high-frequency intrinsic
+features. First, from a low-frequency perspective, we introduce a global facial
+extractor, which encodes reference images and facial key points into a latent
+space, generating features enriched with low-frequency information. These
+features are then integrated into shallow layers of the network to alleviate
+training challenges associated with DiT. Second, from a high-frequency
+perspective, we design a local facial extractor to capture high-frequency
+details and inject them into transformer blocks, enhancing the model's ability
+to preserve fine-grained features. We propose a hierarchical training strategy
+to leverage frequency information for identity preservation, transforming a
+vanilla pre-trained video generation model into an IPT2V model. Extensive
+experiments demonstrate that our frequency-aware heuristic scheme provides an
+optimal control solution for DiT-based models. Thanks to this scheme, our
+ConsisID generates high-quality, identity-preserving videos, making strides
+towards more effective IPT2V.
+
+
+
+
+
+
+
+ ♻ ☆ Memories are One-to-Many Mapping Alleviators in Talking Face Generation
+
+
+
+
+
+
+
+
+ Anni Tang, Tianyu He, Xu Tan, Jun Ling, Li Song
+
+
+ Talking face generation aims at generating photo-realistic video portraits of
+a target person driven by input audio. Due to its nature of one-to-many mapping
+from the input audio to the output video (e.g., one speech content may have
+multiple feasible visual appearances), learning a deterministic mapping like
+previous works brings ambiguity during training, and thus causes inferior
+visual results. Although this one-to-many mapping could be alleviated in part
+by a two-stage framework (i.e., an audio-to-expression model followed by a
+neural-rendering model), it is still insufficient since the prediction is
+produced without enough information (e.g., emotions, wrinkles, etc.). In this
+paper, we propose MemFace to complement the missing information with an
+implicit memory and an explicit memory that follow the sense of the two stages
+respectively. More specifically, the implicit memory is employed in the
+audio-to-expression model to capture high-level semantics in the
+audio-expression shared space, while the explicit memory is employed in the
+neural-rendering model to help synthesize pixel-level details. Our experimental
+results show that our proposed MemFace surpasses all the state-of-the-art
+results across multiple scenarios consistently and significantly.
+
+
+
+ comment: IEEE Transactions on Pattern Analysis and Machine Intelligence
+ (2024). Project page: see https://memoryface.github.io
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Computation and Language 97
+
+
+
+
+
+ ☆ From Individual to Society: A Survey on Social Simulation Driven by
+ Large Language Model-based Agents
+
+
+ Traditional sociological research often relies on human participation, which,
+though effective, is expensive, challenging to scale, and with ethical
+concerns. Recent advancements in large language models (LLMs) highlight their
+potential to simulate human behavior, enabling the replication of individual
+responses and facilitating studies on many interdisciplinary studies. In this
+paper, we conduct a comprehensive survey of this field, illustrating the recent
+progress in simulation driven by LLM-empowered agents. We categorize the
+simulations into three types: (1) Individual Simulation, which mimics specific
+individuals or demographic groups; (2) Scenario Simulation, where multiple
+agents collaborate to achieve goals within specific contexts; and (3) Society
+Simulation, which models interactions within agent societies to reflect the
+complexity and variety of real-world dynamics. These simulations follow a
+progression, ranging from detailed individual modeling to large-scale societal
+phenomena. We provide a detailed discussion of each simulation type, including
+the architecture or key components of the simulation, the classification of
+objectives or scenarios and the evaluation method. Afterward, we summarize
+commonly used datasets and benchmarks. Finally, we discuss the trends across
+these three types of simulation. A repository for the related sources is at
+{\url{https://github.com/FudanDISC/SocialAgent}}.
+
+
+
+
+
+
+
+ ☆ Best-of-N Jailbreaking
+
+
+
+
+
+
+
+
+ John Hughes, Sara Price, Aengus Lynch, Rylan Schaeffer, Fazl Barez, Sanmi Koyejo, Henry Sleight, Erik Jones, Ethan Perez, Mrinank Sharma
+
+
+ We introduce Best-of-N (BoN) Jailbreaking, a simple black-box algorithm that
+jailbreaks frontier AI systems across modalities. BoN Jailbreaking works by
+repeatedly sampling variations of a prompt with a combination of augmentations
+- such as random shuffling or capitalization for textual prompts - until a
+harmful response is elicited. We find that BoN Jailbreaking achieves high
+attack success rates (ASRs) on closed-source language models, such as 89% on
+GPT-4o and 78% on Claude 3.5 Sonnet when sampling 10,000 augmented prompts.
+Further, it is similarly effective at circumventing state-of-the-art
+open-source defenses like circuit breakers. BoN also seamlessly extends to
+other modalities: it jailbreaks vision language models (VLMs) such as GPT-4o
+and audio language models (ALMs) like Gemini 1.5 Pro, using modality-specific
+augmentations. BoN reliably improves when we sample more augmented prompts.
+Across all modalities, ASR, as a function of the number of samples (N),
+empirically follows power-law-like behavior for many orders of magnitude. BoN
+Jailbreaking can also be composed with other black-box algorithms for even more
+effective attacks - combining BoN with an optimized prefix attack achieves up
+to a 35% increase in ASR. Overall, our work indicates that, despite their
+capability, language models are sensitive to seemingly innocuous changes to
+inputs, which attackers can exploit across modalities.
+
+
+
+
+
+
+
+ ☆ Evaluating Gender Bias Transfer between Pre-trained and Prompt-Adapted
+ Language Models
+
+
+ Large language models (LLMs) are increasingly being adapted to achieve
+task-specificity for deployment in real-world decision systems. Several
+previous works have investigated the bias transfer hypothesis (BTH) by studying
+the effect of the fine-tuning adaptation strategy on model fairness to find
+that fairness in pre-trained masked language models have limited effect on the
+fairness of models when adapted using fine-tuning. In this work, we expand the
+study of BTH to causal models under prompt adaptations, as prompting is an
+accessible, and compute-efficient way to deploy models in real-world systems.
+In contrast to previous works, we establish that intrinsic biases in
+pre-trained Mistral, Falcon and Llama models are strongly correlated (rho >=
+0.94) with biases when the same models are zero- and few-shot prompted, using a
+pronoun co-reference resolution task. Further, we find that bias transfer
+remains strongly correlated even when LLMs are specifically prompted to exhibit
+fair or biased behavior (rho >= 0.92), and few-shot length and stereotypical
+composition are varied (rho >= 0.97). Our findings highlight the importance of
+ensuring fairness in pre-trained LLMs, especially when they are later used to
+perform downstream tasks via prompt adaptation.
+
+
+
+
+
+
+
+ ☆ A Review on Scientific Knowledge Extraction using Large Language Models
+ in Biomedical Sciences
+
+
+
+
+
+
+
+
+ Gabriel Lino Garcia, João Renato Ribeiro Manesco, Pedro Henrique Paiola, Lucas Miranda, Maria Paola de Salvo, João Paulo Papa
+
+
+ The rapid advancement of large language models (LLMs) has opened new
+boundaries in the extraction and synthesis of medical knowledge, particularly
+within evidence synthesis. This paper reviews the state-of-the-art applications
+of LLMs in the biomedical domain, exploring their effectiveness in automating
+complex tasks such as evidence synthesis and data extraction from a biomedical
+corpus of documents. While LLMs demonstrate remarkable potential, significant
+challenges remain, including issues related to hallucinations, contextual
+understanding, and the ability to generalize across diverse medical tasks. We
+highlight critical gaps in the current research literature, particularly the
+need for unified benchmarks to standardize evaluations and ensure reliability
+in real-world applications. In addition, we propose directions for future
+research, emphasizing the integration of state-of-the-art techniques such as
+retrieval-augmented generation (RAG) to enhance LLM performance in evidence
+synthesis. By addressing these challenges and utilizing the strengths of LLMs,
+we aim to improve access to medical literature and facilitate meaningful
+discoveries in healthcare.
+
+
+ In the rapidly evolving financial sector, the accurate and timely
+interpretation of market news is essential for stakeholders needing to navigate
+unpredictable events. This paper introduces FANAL (Financial Activity News
+Alerting Language Modeling Framework), a specialized BERT-based framework
+engineered for real-time financial event detection and analysis, categorizing
+news into twelve distinct financial categories. FANAL leverages silver-labeled
+data processed through XGBoost and employs advanced fine-tuning techniques,
+alongside ORBERT (Odds Ratio BERT), a novel variant of BERT fine-tuned with
+ORPO (Odds Ratio Preference Optimization) for superior class-wise probability
+calibration and alignment with financial event relevance. We evaluate FANAL's
+performance against leading large language models, including GPT-4o, Llama-3.1
+8B, and Phi-3, demonstrating its superior accuracy and cost efficiency. This
+framework sets a new standard for financial intelligence and responsiveness,
+significantly outstripping existing models in both performance and
+affordability.
+
+
+
+ comment: Accepted for the IEEE International Workshop on Large Language Models
+ for Finance, 2024. This is a preprint version
+
+ Recently, CLIP has emerged as a valuable model for aligning image and text
+information in multi-modal scenarios. However, researchers have observed
+limitations in the ability of CLIP's text and image encoders to extract
+detailed knowledge from caption-image pairs. In response, this paper introduces
+KKLIP, a novel approach designed to enhance the quality of CLIP by
+incorporating a new knowledge distillation (KD) method derived from Llama 2.
+Our method comprises three objectives: Text Embedding Distillation, Concept
+Learning, and Contrastive Learning. Firstly, Text Embedding Distillation
+involves training the KKLIP text encoder to emulate the teacher model, Llama 2.
+Secondly, Concept Learning assigns a soft concept label to each caption-image
+pair through offline k-means clustering of text information from Llama 2,
+allowing KKLIP to learn from these soft concept labels. Finally, Contrastive
+Learning harmonizes text and image embeddings. Our experimental results
+demonstrate that KKLIP enhances the quality of both text and image encoders.
+
+
+
+
+
+
+
+ ☆ YT-30M: A multi-lingual multi-category dataset of YouTube comments
+
+
+ This paper introduces two large-scale multilingual comment datasets, YT-30M
+(and YT-100K) from YouTube. The analysis in this paper is performed on a
+smaller sample (YT-100K) of YT-30M. Both the datasets: YT-30M (full) and
+YT-100K (randomly selected 100K sample from YT-30M) are publicly released for
+further research. YT-30M (YT-100K) contains 32236173 (108694) comments posted
+by YouTube channel that belong to YouTube categories. Each comment is
+associated with a video ID, comment ID, commentor name, commentor channel ID,
+comment text, upvotes, original channel ID and category of the YouTube channel
+(e.g., 'News & Politics', 'Science & Technology', etc.).
+
+
+
+
+
+
+
+ ☆ RedStone: Curating General, Code, Math, and QA Data for Large Language
+ Models
+
+
+ Pre-training Large Language Models (LLMs) on high-quality, meticulously
+curated datasets is widely recognized as critical for enhancing their
+performance and generalization capabilities. This study explores the untapped
+potential of Common Crawl as a comprehensive and flexible resource for
+pre-training LLMs, addressing both general-purpose language understanding and
+specialized domain knowledge. We introduce RedStone, an innovative and scalable
+pipeline engineered to extract and process data from Common Crawl, facilitating
+the creation of extensive and varied pre-training datasets. Unlike traditional
+datasets, which often require expensive curation and domain-specific expertise,
+RedStone leverages the breadth of Common Crawl to deliver datasets tailored to
+a wide array of domains. In this work, we exemplify its capability by
+constructing pre-training datasets across multiple fields, including general
+language understanding, code, mathematics, and question-answering tasks. The
+flexibility of RedStone allows for easy adaptation to other specialized
+domains, significantly lowering the barrier to creating valuable
+domain-specific datasets. Our findings demonstrate that Common Crawl, when
+harnessed through effective pipelines like RedStone, can serve as a rich,
+renewable source of pre-training data, unlocking new avenues for domain
+adaptation and knowledge discovery in LLMs. This work also underscores the
+importance of innovative data acquisition strategies and highlights the role of
+web-scale data as a powerful resource in the continued evolution of LLMs.
+RedStone code and data samples will be publicly available at
+\url{https://aka.ms/redstone}.
+
+
+
+
+
+
+
+ ☆ DiffStyleTTS: Diffusion-based Hierarchical Prosody Modeling for
+ Text-to-Speech with Diverse and Controllable Styles COLING 2025
+
+
+ Human speech exhibits rich and flexible prosodic variations. To address the
+one-to-many mapping problem from text to prosody in a reasonable and flexible
+manner, we propose DiffStyleTTS, a multi-speaker acoustic model based on a
+conditional diffusion module and an improved classifier-free guidance, which
+hierarchically models speech prosodic features, and controls different prosodic
+styles to guide prosody prediction. Experiments show that our method
+outperforms all baselines in naturalness and achieves superior synthesis speed
+compared to three diffusion-based baselines. Additionally, by adjusting the
+guiding scale, DiffStyleTTS effectively controls the guidance intensity of the
+synthetic prosody.
+
+
+
+ comment: COLING 2025
+
+
+
+
+
+
+ ☆ Improving Linguistic Diversity of Large Language Models with Possibility
+ Exploration Fine-Tuning
+
+
+ While Large Language Models (LLMs) have made significant strides in
+replicating human-like abilities, there are concerns about a reduction in the
+linguistic diversity of their outputs. This results in the homogenization of
+viewpoints and perspectives, as well as the underrepresentation of specific
+demographic groups. Although several fine-tuning and prompting techniques have
+been suggested to tackle the issue, they are often tailored to specific tasks
+or come with a substantial increase in computational cost and latency. This
+makes them challenging to apply to applications that demand very low latency,
+such as chatbots and virtual assistants. We propose Possibility Exploration
+Fine-Tuning (PEFT), a task-agnostic framework that enhances the text diversity
+of LLMs without increasing latency or computational cost. Given the same
+prompt, models fine-tuned with PEFT can simultaneously generate multiple
+diverse responses, each corresponding with a controllable possibility number.
+Experiments on dialogue and story generation tasks demonstrate that PEFT
+significantly enhances the diversity of LLM outputs, as evidenced by lower
+similarity between candidate responses. Since PEFT emphasizes semantic
+diversity over lexical diversity, it can also notably reduce demographic bias
+in dialogue systems. The implementations and datasets are available in our
+repository: https://github.com/mailong25/peft_diversity
+
+
+ This paper presents Yankari, a large-scale monolingual dataset for the Yoruba
+language, aimed at addressing the critical gap in Natural Language Processing
+(NLP) resources for this important West African language. Despite being spoken
+by over 30 million people, Yoruba has been severely underrepresented in NLP
+research and applications. We detail our methodology for creating this dataset,
+which includes careful source selection, automated quality control, and
+rigorous data cleaning processes. The Yankari dataset comprises 51,407
+documents from 13 diverse sources, totaling over 30 million tokens. Our
+approach focuses on ethical data collection practices, avoiding problematic
+sources and addressing issues prevalent in existing datasets. We provide
+thorough automated evaluations of the dataset, demonstrating its quality
+compared to existing resources. The Yankari dataset represents a significant
+advancement in Yoruba language resources, providing a foundation for developing
+more accurate NLP models, supporting comparative linguistic studies, and
+contributing to the digital accessibility of the Yoruba language.
+
+
+
+
+
+
+
+
+ Fred Philippy, Siwen Guo, Jacques Klein, Tegawendé F. Bissyandé
+
+
+ Sentence embedding models play a key role in various Natural Language
+Processing tasks, such as in Topic Modeling, Document Clustering and
+Recommendation Systems. However, these models rely heavily on parallel data,
+which can be scarce for many low-resource languages, including Luxembourgish.
+This scarcity results in suboptimal performance of monolingual and
+cross-lingual sentence embedding models for these languages. To address this
+issue, we compile a relatively small but high-quality human-generated
+cross-lingual parallel dataset to train \tool, an enhanced sentence embedding
+model for Luxembourgish with strong cross-lingual capabilities. Additionally,
+we present evidence suggesting that including low-resource languages in
+parallel training datasets can be more advantageous for other low-resource
+languages than relying solely on high-resource language pairs. Furthermore,
+recognizing the lack of sentence embedding benchmarks for low-resource
+languages, we create a paraphrase detection benchmark specifically for
+Luxembourgish, aiming to partially fill this gap and promote further research.
+
+
+
+ comment: Accepted at COLING 2025
+
+
+
+
+
+
+ ☆ Grounded Language Design for Lightweight Diagramming for Formal Methods
+
+
+
+
+
+
+
+
+ Siddhartha Prasad, Ben Greenman, Tim Nelson, Shriram Krishnamurthi
+
+
+ Model finding, as embodied by SAT solvers and similar tools, is used widely,
+both in embedding settings and as a tool in its own right. For instance, tools
+like Alloy target SAT to enable users to incrementally define, explore, verify,
+and diagnose sophisticated specifications for a large number of complex
+systems.
+ These tools critically include a visualizer that lets users graphically
+explore these generated models. As we show, however, default visualizers, which
+know nothing about the domain, are unhelpful and even actively violate
+presentational and cognitive principles. At the other extreme, full-blown
+visualizations require significant effort as well as knowledge a specifier
+might not possess; they can also exhibit bad failure modes (including silent
+failure). Instead, we need a language to capture essential domain information
+for lightweight diagramming. We ground our language design in both the
+cognitive science literature on diagrams and on a large number of example
+custom visualizations. This identifies the key elements of lightweight
+diagrams. We distill these into a small set of orthogonal primitives. We extend
+an Alloy-like tool to support these primitives. We evaluate the effectiveness
+of the produced diagrams, finding them good for reasoning. We then compare this
+against many other drawing languages and tools to show that this work defines a
+new niche that is lightweight, effective, and driven by sound principles.
+
+
+
+
+
+
+
+ ☆ Typologie des comportements utilisateurs : {é}tude exploratoire des
+ sessions de recherche complexe sur le Web
+
+
+ In this study, we propose an exploratory approach aiming at a typology of
+user behaviour during a Web search session. We describe a typology based on
+generic IR variables (e.g. number of queries), but also on the study of topic
+(propositions with distinct semantic content defined from the search
+statement). To this end, we gathered experimental data enabling us to study
+variations across users (N=70) for the same task. We performed a
+multidimensional analysis and propose a 5 classes typology based on the
+individual behaviours during the processing of a complex search task.
+
+
+
+ comment: in French language, CORIA (COnf{\'e}rence en Recherche d'Information
+ et Applications), 2024, La Rochelle, France
+
+
+
+
+
+
+ ☆ Global MMLU: Understanding and Addressing Cultural and Linguistic Biases
+ in Multilingual Evaluation
+
+
+
+
+
+
+
+
+ Shivalika Singh, Angelika Romanou, Clémentine Fourrier, David I. Adelani, Jian Gang Ngui, Daniel Vila-Suero, Peerat Limkonchotiwat, Kelly Marchisio, Wei Qi Leong, Yosephine Susanto, Raymond Ng, Shayne Longpre, Wei-Yin Ko, Madeline Smith, Antoine Bosselut, Alice Oh, Andre F. T. Martins, Leshem Choshen, Daphne Ippolito, Enzo Ferrante, Marzieh Fadaee, Beyza Ermis, Sara Hooker
+
+
+ Cultural biases in multilingual datasets pose significant challenges for
+their effectiveness as global benchmarks. These biases stem not only from
+language but also from the cultural knowledge required to interpret questions,
+reducing the practical utility of translated datasets like MMLU. Furthermore,
+translation often introduces artifacts that can distort the meaning or clarity
+of questions in the target language. A common practice in multilingual
+evaluation is to rely on machine-translated evaluation sets, but simply
+translating a dataset is insufficient to address these challenges. In this
+work, we trace the impact of both of these issues on multilingual evaluations
+and ensuing model performances. Our large-scale evaluation of state-of-the-art
+open and proprietary models illustrates that progress on MMLU depends heavily
+on learning Western-centric concepts, with 28% of all questions requiring
+culturally sensitive knowledge. Moreover, for questions requiring geographic
+knowledge, an astounding 84.9% focus on either North American or European
+regions. Rankings of model evaluations change depending on whether they are
+evaluated on the full portion or the subset of questions annotated as
+culturally sensitive, showing the distortion to model rankings when blindly
+relying on translated MMLU. We release Global-MMLU, an improved MMLU with
+evaluation coverage across 42 languages -- with improved overall quality by
+engaging with compensated professional and community annotators to verify
+translation quality while also rigorously evaluating cultural biases present in
+the original dataset. This comprehensive Global-MMLU set also includes
+designated subsets labeled as culturally sensitive and culturally agnostic to
+allow for more holistic, complete evaluation.
+
+
+
+
+
+
+
+ ☆ AntLM: Bridging Causal and Masked Language Models CoNLL
+
+
+
+
+
+
+
+
+ Xinru Yu, Bin Guo, Shiwei Luo, Jie Wang, Tao Ji, Yuanbin Wu
+
+
+ Causal Language Modeling (CLM) and Masked Language Modeling (MLM) are two
+mainstream learning paradigms based on Transformer networks, specifically the
+Decoder-only and Encoder-only architectures. The strengths of each paradigm in
+downstream tasks have shown a mix of advantages and disadvantages. In the past
+BabyLM Challenge 2023, although the MLM paradigm achieved the best average
+performance, the CLM paradigm demonstrated significantly faster convergence
+rates. For the BabyLM Challenge 2024, we propose a novel language modeling
+paradigm named $\textbf{AntLM}$, which integrates both CLM and MLM to leverage
+the advantages of these two classic paradigms. We chose the strict-small track
+and conducted experiments on two foundation models: BabyLlama, representing
+CLM, and LTG-BERT, representing MLM. During the training process for specific
+foundation models, we alternate between applying CLM or MLM training objectives
+and causal or bidirectional attention masks. Experimental results show that
+combining the two pretraining objectives leverages their strengths, enhancing
+overall training performance. Under the same epochs, $AntLM_{BabyLlama}$
+improves Macro-average by 1%, and $AntLM_{LTG-BERT}$ achieves a 2.2% increase
+over the baselines.
+
+
+
+ comment: CoNLL Shared Task BabyLM Challenge
+
+
+
+
+
+
+ ☆ Intent-driven In-context Learning for Few-shot Dialogue State Tracking
+
+
+ Dialogue state tracking (DST) plays an essential role in task-oriented
+dialogue systems. However, user's input may contain implicit information,
+posing significant challenges for DST tasks. Additionally, DST data includes
+complex information, which not only contains a large amount of noise unrelated
+to the current turn, but also makes constructing DST datasets expensive. To
+address these challenges, we introduce Intent-driven In-context Learning for
+Few-shot DST (IDIC-DST). By extracting user's intent, we propose an
+Intent-driven Dialogue Information Augmentation module to augment the dialogue
+information, which can track dialogue states more effectively. Moreover, we
+mask noisy information from DST data and rewrite user's input in the
+Intent-driven Examples Retrieval module, where we retrieve similar examples. We
+then utilize a pre-trained large language model to update the dialogue state
+using the augmented dialogue information and examples. Experimental results
+demonstrate that IDIC-DST achieves state-of-the-art performance in few-shot
+settings on MultiWOZ 2.1 and MultiWOZ 2.4 datasets.
+
+
+
+
+
+
+
+ ☆ Alignment at Pre-training! Towards Native Alignment for Arabic LLMs NeurIPS 2024
+
+
+ The alignment of large language models (LLMs) is critical for developing
+effective and safe language models. Traditional approaches focus on aligning
+models during the instruction tuning or reinforcement learning stages, referred
+to in this paper as `post alignment'. We argue that alignment during the
+pre-training phase, which we term `native alignment', warrants investigation.
+Native alignment aims to prevent unaligned content from the beginning, rather
+than relying on post-hoc processing. This approach leverages extensively
+aligned pre-training data to enhance the effectiveness and usability of
+pre-trained models. Our study specifically explores the application of native
+alignment in the context of Arabic LLMs. We conduct comprehensive experiments
+and ablation studies to evaluate the impact of native alignment on model
+performance and alignment stability. Additionally, we release open-source
+Arabic LLMs that demonstrate state-of-the-art performance on various
+benchmarks, providing significant benefits to the Arabic LLM community.
+
+
+
+ comment: Accepted to NeurIPS 2024 main conference. see
+ https://github.com/FreedomIntelligence/AceGPT-v2
+
+
+
+
+
+
+ ☆ AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and
+ Pruning
+
+
+ Large language models (LLMs) have enabled the creation of multi-modal LLMs
+that exhibit strong comprehension of visual data such as images and videos.
+However, these models usually rely on extensive visual tokens from visual
+encoders, leading to high computational demands, which limits their
+applicability in resource-constrained environments and for long-context tasks.
+In this work, we propose a training-free adaptive inference method for
+multi-modal LLMs that can accommodate a broad range of efficiency requirements
+with a minimum performance drop. Our method consists of a) iterative token
+merging based on embedding similarity before LLMs, and b) progressive token
+pruning within LLM layers based on multi-modal importance. With a minimalist
+design, our method can be applied to both video and image LLMs. Extensive
+experiments on diverse video and image benchmarks demonstrate that, our method
+substantially reduces computation load (e.g., a $\textbf{7-fold}$ reduction in
+FLOPs) while preserving the performance of video and image LLMs. Further, under
+a similar computational cost, our method outperforms the state-of-the-art
+methods in long video understanding (e.g., $\textbf{+4.6}$ on MLVU).
+Additionally, our in-depth analysis provides insights into token redundancy and
+LLM layer behaviors, offering guidance for future research in designing
+efficient multi-modal LLMs. Our code will be available at
+https://github.com/LaVi-Lab/AIM.
+
+
+
+ comment: 12 pages, 2 figures
+
+
+
+
+
+
+ ☆ Benchmarking terminology building capabilities of ChatGPT on an
+ English-Russian Fashion Corpus
+
+
+ This paper compares the accuracy of the terms extracted using SketchEngine,
+TBXTools and ChatGPT. In addition, it evaluates the quality of the definitions
+produced by ChatGPT for these terms. The research is carried out on a
+comparable corpus of fashion magazines written in English and Russian collected
+from the web. A gold standard for the fashion terminology was also developed by
+identifying web pages that can be harvested automatically and contain
+definitions of terms from the fashion domain in English and Russian. This gold
+standard was used to evaluate the quality of the extracted terms and of the
+definitions produced. Our evaluation shows that TBXTools and SketchEngine,
+while capable of high recall, suffer from reduced precision as the number of
+terms increases, which affects their overall performance. Conversely, ChatGPT
+demonstrates superior performance, maintaining or improving precision as more
+terms are considered. Analysis of the definitions produced by ChatGPT for 60
+commonly used terms in English and Russian shows that ChatGPT maintains a
+reasonable level of accuracy and fidelity across languages, but sometimes the
+definitions in both languages miss crucial specifics and include unnecessary
+deviations. Our research reveals that no single tool excels universally; each
+has strengths suited to particular aspects of terminology extraction and
+application.
+
+
+
+ comment: To appear in the Proceedings of Translating and the Computer 2024
+ (TC46)
+
+
+
+
+
+
+ ☆ Does Safety Training of LLMs Generalize to Semantically Related Natural
+ Prompts? NeurIPS 2024
+
+
+ Large Language Models (LLMs) are known to be susceptible to crafted
+adversarial attacks or jailbreaks that lead to the generation of objectionable
+content despite being aligned to human preferences using safety fine-tuning
+methods. While the large dimensionality of input token space makes it
+inevitable to find adversarial prompts that can jailbreak these models, we aim
+to evaluate whether safety fine-tuned LLMs are safe against natural prompts
+which are semantically related to toxic seed prompts that elicit safe responses
+after alignment. We surprisingly find that popular aligned LLMs such as GPT-4
+can be compromised using naive prompts that are NOT even crafted with an
+objective of jailbreaking the model. Furthermore, we empirically show that
+given a seed prompt that elicits a toxic response from an unaligned model, one
+can systematically generate several semantically related natural prompts that
+can jailbreak aligned LLMs. Towards this, we propose a method of Response
+Guided Question Augmentation (ReG-QA) to evaluate the generalization of safety
+aligned LLMs to natural prompts, that first generates several toxic answers
+given a seed question using an unaligned LLM (Q to A), and further leverages an
+LLM to generate questions that are likely to produce these answers (A to Q). We
+interestingly find that safety fine-tuned LLMs such as GPT-4o are vulnerable to
+producing natural jailbreak questions from unsafe content (without denial) and
+can thus be used for the latter (A to Q) step. We obtain attack success rates
+that are comparable to/ better than leading adversarial attack methods on the
+JailbreakBench leaderboard, while being significantly more stable against
+defenses such as Smooth-LLM and Synonym Substitution, which are effective
+against existing all attacks on the leaderboard.
+
+
+
+ comment: Accepted at the Safe Generative AI Workshop @ NeurIPS 2024
+
+
+
+
+
+
+ ☆ PERL: Pinyin Enhanced Rephrasing Language Model for Chinese ASR N-best
+ Error Correction
+
+
+ ASR correction methods have predominantly focused on general datasets and
+have not effectively utilized Pinyin information, unique to the Chinese
+language. In this study, we address this gap by proposing a Pinyin Enhanced
+Rephrasing Language Model (PERL), specifically designed for N-best correction
+scenarios. Additionally, we implement a length predictor module to address the
+variable-length problem. We conduct experiments on the Aishell-1 dataset and
+our newly proposed DoAD dataset. The results show that our approach outperforms
+baseline methods, achieving a 29.11% reduction in Character Error Rate (CER) on
+Aishell-1 and around 70% CER reduction on domain-specific datasets.
+Furthermore, our approach leverages Pinyin similarity at the token level,
+providing an advantage over baselines and leading to superior performance.
+
+
+
+
+
+
+
+
+ Chanyeol Choi, Junseong Kim, Seolhwa Lee, Jihoon Kwon, Sangmo Gu, Yejin Kim, Minkyung Cho, Jy-yong Sohn
+
+
+ This report explores the enhancement of text retrieval performance using
+advanced data refinement techniques. We develop
+Linq-Embed-Mistral\footnote{\url{https://huggingface.co/Linq-AI-Research/Linq-Embed-Mistral}}
+by building on the E5-mistral and Mistral-7B-v0.1 models, focusing on
+sophisticated data crafting, data filtering, and negative mining methods, which
+are highly tailored to each task, applied to both existing benchmark dataset
+and highly tailored synthetic dataset generated via large language models
+(LLMs). Linq-Embed-Mistral excels in the MTEB benchmarks (as of May 29, 2024),
+achieving an average score of 68.2 across 56 datasets, and ranks 1st among all
+models for retrieval tasks on the MTEB leaderboard with a performance score of
+60.2. This performance underscores its superior capability in enhancing search
+precision and reliability. Our contributions include advanced data refinement
+methods that significantly improve model performance on benchmark and synthetic
+datasets, techniques for homogeneous task ordering and mixed task fine-tuning
+to enhance model generalization and stability, and a streamlined evaluation
+process using 4-bit precision and a light retrieval evaluation set, which
+accelerates validation without sacrificing accuracy.
+
+
+
+ comment: 15 pages
+
+
+
+
+
+
+ ☆ U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills
+ in LLMs
+
+
+
+
+
+
+
+
+ Konstantin Chernyshev, Vitaliy Polshkov, Ekaterina Artemova, Alex Myasnikov, Vlad Stepanov, Alexei Miasnikov, Sergei Tilga
+
+
+ The current evaluation of mathematical skills in LLMs is limited, as existing
+benchmarks are either relatively small, primarily focus on elementary and
+high-school problems, or lack diversity in topics. Additionally, the inclusion
+of visual elements in tasks remains largely under-explored.
+ To address these gaps, we introduce U-MATH, a novel benchmark of 1,100
+unpublished open-ended university-level problems sourced from teaching
+materials. It is balanced across six core subjects, with 20% of multimodal
+problems. Given the open-ended nature of U-MATH problems, we employ an LLM to
+judge the correctness of generated solutions. To this end, we release
+$\mu$-MATH, a dataset to evaluate the LLMs' capabilities in judging solutions.
+ The evaluation of general domain, math-specific, and multimodal LLMs
+highlights the challenges presented by U-MATH. Our findings reveal that LLMs
+achieve a maximum accuracy of only 63% on text-based tasks, with even lower 45%
+on visual problems. The solution assessment proves challenging for LLMs, with
+the best LLM judge having an F1-score of 80% on $\mu$-MATH.
+
+
+
+
+
+
+
+ ☆ Weighted-Reward Preference Optimization for Implicit Model Fusion
+
+
+ While fusing heterogeneous open-source LLMs with varying architectures and
+sizes can potentially integrate the strengths of different models, existing
+fusion methods face significant challenges, such as vocabulary alignment and
+merging distribution matrices. These procedures are not only complex but also
+prone to introducing noise and errors. In this paper, we propose an implicit
+fusion method, Weighted-Reward Preference Optimization (WRPO), which leverages
+preference optimization between the source LLMs and the target LLM to transfer
+their capabilities effectively. WRPO eliminates the need for vocabulary
+alignment and matrix fusion and can be efficiently scaled to accommodate
+various LLMs. To address distributional deviations between the source and
+target LLMs, WRPO introduces a progressive adaptation strategy that gradually
+shifts reliance on preferred examples from the target LLM to the source LLMs.
+Extensive experiments on the MT-Bench, AlpacaEval-2, and Arena-Hard benchmarks
+demonstrate that WRPO consistently outperforms existing knowledge fusion
+methods and various fine-tuning baselines. When applied to LLaMA3-8B-Instruct
+as the target model, WRPO achieves a length-controlled win rate of 55.9%
+against GPT-4-Preview-1106 on AlpacaEval-2 and a win rate of 46.2% against
+GPT-4-0314 on Arena-Hard. Our code is available at
+\url{https://github.com/SLIT-AI/WRPO}.
+
+
+
+ comment: Work in progress
+
+
+
+
+
+
+ ☆ Automatic detection of diseases in Spanish clinical notes combining
+ medical language models and ontologies
+
+
+ In this paper we present a hybrid method for the automatic detection of
+dermatological pathologies in medical reports. We use a large language model
+combined with medical ontologies to predict, given a first appointment or
+follow-up medical report, the pathology a person may suffer from. The results
+show that teaching the model to learn the type, severity and location on the
+body of a dermatological pathology, as well as in which order it has to learn
+these three features, significantly increases its accuracy. The article
+presents the demonstration of state-of-the-art results for classification of
+medical texts with a precision of 0.84, micro and macro F1-score of 0.82 and
+0.75, and makes both the method and the data set used available to the
+community.
+
+
+
+ comment: Translation of SEPLN 2024 es paper
+
+
+
+
+
+
+ ☆ Byte BPE Tokenization as an Inverse string Homomorphism
+
+
+
+
+
+
+
+
+ Saibo Geng, Sankalp Gambhir, Chris Wendler, Robert West
+
+
+ Tokenization is an important preprocessing step in the training and inference
+of large language models (LLMs). While there has been extensive research on the
+expressive power of the neural achitectures used in LLMs, the impact of
+tokenization has not been well understood. In this work, we demonstrate that
+tokenization, irrespective of the algorithm used, acts as an inverse
+homomorphism between strings and tokens. This suggests that the character space
+of the source language and the token space of the tokenized language are
+homomorphic, preserving the structural properties of the source language.
+Additionally, we explore the concept of proper tokenization, which refers to an
+unambiguous tokenization returned from the tokenizer. Our analysis reveals that
+the expressiveness of neural architectures in recognizing context-free
+languages is not affected by tokenization.
+
+
+ Few-shot image classification(FSIC) aims to recognize novel classes given few
+labeled images from base classes. Recent works have achieved promising
+classification performance, especially for metric-learning methods, where a
+measure at only image feature level is usually used. In this paper, we argue
+that measure at such a level may not be effective enough to generalize from
+base to novel classes when using only a few images. Instead, a multi-level
+descriptor of an image is taken for consideration in this paper. We propose a
+multi-level correlation network (MLCN) for FSIC to tackle this problem by
+effectively capturing local information. Concretely, we present the
+self-correlation module and cross-correlation module to learn the semantic
+correspondence relation of local information based on learned representations.
+Moreover, we propose a pattern-correlation module to capture the pattern of
+fine-grained images and find relevant structural patterns between base classes
+and novel classes. Extensive experiments and analysis show the effectiveness of
+our proposed method on four widely-used FSIC benchmarks. The code for our
+approach is available at: https://github.com/Yunkai696/MLCN.
+
+
+
+
+
+
+
+ ☆ A Measure of the System Dependence of Automated Metrics
+
+
+
+
+
+
+
+
+ Pius von Däniken, Jan Deriu, Mark Cieliebak
+
+
+ Automated metrics for Machine Translation have made significant progress,
+with the goal of replacing expensive and time-consuming human evaluations.
+These metrics are typically assessed by their correlation with human judgments,
+which captures the monotonic relationship between human and metric scores.
+However, we argue that it is equally important to ensure that metrics treat all
+systems fairly and consistently. In this paper, we introduce a method to
+evaluate this aspect.
+
+
+
+
+
+
+
+ ☆ Fine-Grained Behavior Simulation with Role-Playing Large Language Model
+ on Social Media
+
+
+
+
+
+
+
+
+ Kun Li, Chenwei Dai, Wei Zhou, Songlin Hu
+
+
+ Large language models (LLMs) have demonstrated impressive capabilities in
+role-playing tasks. However, there is limited research on whether LLMs can
+accurately simulate user behavior in real-world scenarios, such as social
+media. This requires models to effectively analyze a user's history and
+simulate their role. In this paper, we introduce \textbf{FineRob}, a novel
+fine-grained behavior simulation dataset. We collect the complete behavioral
+history of 1,866 distinct users across three social media platforms. Each
+behavior is decomposed into three fine-grained elements: object, type, and
+content, resulting in 78.6k QA records. Based on FineRob, we identify two
+dominant reasoning patterns in LLMs' behavior simulation processes and propose
+the \textbf{OM-CoT} fine-tuning method to enhance the capability. Through
+comprehensive experiments, we conduct an in-depth analysis of key factors of
+behavior simulation and also demonstrate the effectiveness of OM-CoT
+approach\footnote{Code and dataset are available at
+\url{https://github.com/linkseed18612254945/FineRob}}
+
+
+
+
+
+
+
+ ☆ A surprisal oracle for when every layer counts
+
+
+ Active Curriculum Language Modeling (ACLM; Hong et al., 2023) is a learner
+directed approach to training a language model. We proposed the original
+version of this process in our submission to the BabyLM 2023 task, and now we
+propose an updated ACLM process for the BabyLM 2024 task. ACLM involves an
+iteratively- and dynamically-constructed curriculum informed over the training
+process by a model of uncertainty; other training items that are similarly
+uncertain to a least certain candidate item are prioritized. Our new process
+improves the similarity model so that it is more dynamic, and we run ACLM over
+the most successful model from the BabyLM 2023 task: ELC-BERT (Charpentier and
+Samuel, 2023). We find that while our models underperform on fine-grained
+grammatical inferences, they outperform the BabyLM 2024 official base-lines on
+common-sense and world-knowledge tasks. We make our code available at https:
+//github.com/asayeed/ActiveBaby.
+
+
+
+
+
+
+
+ ☆ TOOL-ED: Enhancing Empathetic Response Generation with the Tool Calling
+ Capability of LLM
+
+
+ Empathetic conversation is a crucial characteristic in daily conversations
+between individuals. Nowadays, Large Language models (LLMs) have shown
+outstanding performance in generating empathetic responses. Knowledge bases
+like COMET can assist LLMs in mitigating illusions and enhancing the
+understanding of users' intentions and emotions. However, models remain heavily
+reliant on fixed knowledge bases and unrestricted incorporation of external
+knowledge can introduce noise. Tool learning is a flexible end-to-end approach
+that assists LLMs in handling complex problems. In this paper, we propose
+Emotional Knowledge Tool Calling (EKTC) framework, which encapsulates the
+commonsense knowledge bases as empathetic tools, enabling LLMs to integrate
+external knowledge flexibly through tool calling. In order to adapt the models
+to the new task, we construct a novel dataset TOOL-ED based on the
+EMPATHETICMPATHETIC DIALOGUE (ED) dataset. We validate EKTC on the ED dataset,
+and the experimental results demonstrate that our framework can enhance the
+ability of LLMs to generate empathetic responses effectively.
+
+
+
+
+
+
+
+ ☆ Revolve: Optimizing AI Systems by Tracking Response Evolution in Textual
+ Optimization
+
+
+ Recent advancements in large language models (LLMs) have significantly
+enhanced the ability of LLM-based systems to perform complex tasks through
+natural language processing and tool interaction. However, optimizing these
+LLM-based systems for specific tasks remains challenging, often requiring
+manual interventions like prompt engineering and hyperparameter tuning.
+Existing automatic optimization methods, such as textual feedback-based
+techniques (e.g., TextGrad), tend to focus on immediate feedback, analogous to
+using immediate derivatives in traditional numerical gradient descent. However,
+relying solely on such feedback can be limited when the adjustments made in
+response to this feedback are either too small or fluctuate irregularly,
+potentially slowing down or even stalling the optimization process. To overcome
+these challenges, more adaptive methods are needed, especially in situations
+where the system's response is evolving slowly or unpredictably. In this paper,
+we introduce REVOLVE, an optimization method that tracks how "R"esponses
+"EVOLVE" across iterations in LLM systems. By focusing on the evolution of
+responses over time, REVOLVE enables more stable and effective optimization by
+making thoughtful, progressive adjustments at each step. Experimental results
+demonstrate that REVOLVE outperforms competitive baselines, achieving a 7.8%
+improvement in prompt optimization, a 20.72% gain in solution refinement, and a
+29.17% increase in code optimization. Additionally, REVOLVE converges in fewer
+iterations, resulting in significant computational savings. These advantages
+highlight its adaptability and efficiency, positioning REVOLVE as a valuable
+tool for optimizing LLM-based systems and accelerating the development of
+next-generation AI technologies. Code is available at:
+https://github.com/Peiyance/REVOLVE.
+
+
+
+ comment: 20 pages, 2 figures
+
+
+
+
+
+
+ ☆ ASR-EC Benchmark: Evaluating Large Language Models on Chinese ASR Error
+ Correction
+
+
+
+
+
+
+
+
+ Victor Junqiu Wei, Weicheng Wang, Di Jiang, Yuanfeng Song, Lu Wang
+
+
+ Automatic speech Recognition (ASR) is a fundamental and important task in the
+field of speech and natural language processing. It is an inherent building
+block in many applications such as voice assistant, speech translation, etc.
+Despite the advancement of ASR technologies in recent years, it is still
+inevitable for modern ASR systems to have a substantial number of erroneous
+recognition due to environmental noise, ambiguity, etc. Therefore, the error
+correction in ASR is crucial.
+ Motivated by this, this paper studies ASR error correction in the Chinese
+language, which is one of the most popular languages and enjoys a large number
+of users in the world. We first create a benchmark dataset named \emph{ASR-EC}
+that contains a wide spectrum of ASR errors generated by industry-grade ASR
+systems. To the best of our knowledge, it is the first Chinese ASR error
+correction benchmark. Then, inspired by the recent advances in \emph{large
+language models (LLMs)}, we investigate how to harness the power of LLMs to
+correct ASR errors. We apply LLMs to ASR error correction in three paradigms.
+The first paradigm is prompting, which is further categorized as zero-shot,
+few-shot, and multi-step. The second paradigm is finetuning, which finetunes
+LLMs with ASR error correction data. The third paradigm is multi-modal
+augmentation, which collectively utilizes the audio and ASR transcripts for
+error correction. Extensive experiments reveal that prompting is not effective
+for ASR error correction. Finetuning is effective only for a portion of LLMs.
+Multi-modal augmentation is the most effective method for error correction and
+achieves state-of-the-art performance.
+
+
+
+
+
+
+
+ ☆ Analytic Study of Text-Free Speech Synthesis for Raw Audio using a
+ Self-Supervised Learning Model SC 2024
+
+
+ We examine the text-free speech representations of raw audio obtained from a
+self-supervised learning (SSL) model by analyzing the synthesized speech using
+the SSL representations instead of conventional text representations. Since raw
+audio does not have paired speech representations as transcribed texts do,
+obtaining speech representations from unpaired speech is crucial for augmenting
+available datasets for speech synthesis. Specifically, the proposed speech
+synthesis is conducted using discrete symbol representations from the SSL model
+in comparison with text representations, and analytical examinations of the
+synthesized speech have been carried out. The results empirically show that
+using text representations is advantageous for preserving semantic information,
+while using discrete symbol representations is superior for preserving acoustic
+content, including prosodic and intonational information.
+
+
+
+ comment: APSIPA ASC 2024
+
+
+
+
+
+
+ ☆ Human Variability vs. Machine Consistency: A Linguistic Analysis of
+ Texts Generated by Humans and Large Language Models
+
+
+ The rapid advancements in large language models (LLMs) have significantly
+improved their ability to generate natural language, making texts generated by
+LLMs increasingly indistinguishable from human-written texts. Recent research
+has predominantly focused on using LLMs to classify text as either
+human-written or machine-generated. In our study, we adopt a different approach
+by profiling texts spanning four domains based on 250 distinct linguistic
+features. We select the M4 dataset from the Subtask B of SemEval 2024 Task 8.
+We automatically calculate various linguistic features with the LFTK tool and
+additionally measure the average syntactic depth, semantic similarity, and
+emotional content for each document. We then apply a two-dimensional PCA
+reduction to all the calculated features. Our analyses reveal significant
+differences between human-written texts and those generated by LLMs,
+particularly in the variability of these features, which we find to be
+considerably higher in human-written texts. This discrepancy is especially
+evident in text genres with less rigid linguistic style constraints. Our
+findings indicate that humans write texts that are less cognitively demanding,
+with higher semantic content, and richer emotional content compared to texts
+generated by LLMs. These insights underscore the need for incorporating
+meaningful linguistic features to enhance the understanding of textual outputs
+of LLMs.
+
+
+
+
+
+
+
+ ☆ Advancing Conversational Psychotherapy: Integrating Privacy,
+ Dual-Memory, and Domain Expertise with Large Language Models NeurIPS 2024
+
+
+ Mental health has increasingly become a global issue that reveals the
+limitations of traditional conversational psychotherapy, constrained by
+location, time, expense, and privacy concerns. In response to these challenges,
+we introduce SoulSpeak, a Large Language Model (LLM)-enabled chatbot designed
+to democratize access to psychotherapy. SoulSpeak improves upon the
+capabilities of standard LLM-enabled chatbots by incorporating a novel
+dual-memory component that combines short-term and long-term context via
+Retrieval Augmented Generation (RAG) to offer personalized responses while
+ensuring the preservation of user privacy and intimacy through a dedicated
+privacy module. In addition, it leverages a counseling chat dataset of
+therapist-client interactions and various prompting techniques to align the
+generated responses with psychotherapeutic methods. We introduce two fine-tuned
+BERT models to evaluate the system against existing LLMs and human therapists:
+the Conversational Psychotherapy Preference Model (CPPM) to simulate human
+preference among responses and another to assess response relevance to user
+input. CPPM is useful for training and evaluating psychotherapy-focused
+language models independent from SoulSpeak, helping with the constrained
+resources available for psychotherapy. Furthermore, the effectiveness of the
+dual-memory component and the robustness of the privacy module are also
+examined. Our findings highlight the potential and challenge of enhancing
+mental health care by offering an alternative that combines the expertise of
+traditional therapy with the advantages of LLMs, providing a promising way to
+address the accessibility and personalization gap in current mental health
+services.
+
+
+
+ comment: Accepted as a Poster at Statistical Foundations of LLMs and
+ Foundation Models (NeurIPS 2024 Workshop)
+
+
+
+
+
+
+ ☆ Surveying the Effects of Quality, Diversity, and Complexity in Synthetic
+ Data From Large Language Models
+
+
+
+
+
+
+
+
+ Alex Havrilla, Andrew Dai, Laura O'Mahony, Koen Oostermeijer, Vera Zisler, Alon Albalak, Fabrizio Milo, Sharath Chandra Raparthy, Kanishk Gandhi, Baber Abbasi, Duy Phung, Maia Iyer, Dakota Mahan, Chase Blagden, Srishti Gureja, Mohammed Hamdy, Wen-Ding Li, Giovanni Paolini, Pawan Sasanka Ammanamanchi, Elliot Meyerson
+
+
+ Synthetic data generation with Large Language Models is a promising paradigm
+for augmenting natural data over a nearly infinite range of tasks. Given this
+variety, direct comparisons among synthetic data generation algorithms are
+scarce, making it difficult to understand where improvement comes from and what
+bottlenecks exist. We propose to evaluate algorithms via the makeup of
+synthetic data generated by each algorithm in terms of data quality, diversity,
+and complexity. We choose these three characteristics for their significance in
+open-ended processes and the impact each has on the capabilities of downstream
+models. We find quality to be essential for in-distribution model
+generalization, diversity to be essential for out-of-distribution
+generalization, and complexity to be beneficial for both. Further, we emphasize
+the existence of Quality-Diversity trade-offs in training data and the
+downstream effects on model performance. We then examine the effect of various
+components in the synthetic data pipeline on each data characteristic. This
+examination allows us to taxonomize and compare synthetic data generation
+algorithms through the components they utilize and the resulting effects on
+data QDC composition. This analysis extends into a discussion on the importance
+of balancing QDC in synthetic data for efficient reinforcement learning and
+self-improvement algorithms. Analogous to the QD trade-offs in training data,
+often there exist trade-offs between model output quality and output diversity
+which impact the composition of synthetic data. We observe that many models are
+currently evaluated and optimized only for output quality, thereby limiting
+output diversity and the potential for self-improvement. We argue that
+balancing these trade-offs is essential to the development of future
+self-improvement algorithms and highlight a number of works making progress in
+this direction.
+
+
+
+
+
+
+
+ ☆ Curriculum-style Data Augmentation for LLM-based Metaphor Detection
+
+
+ Recently, utilizing large language models (LLMs) for metaphor detection has
+achieved promising results. However, these methods heavily rely on the
+capabilities of closed-source LLMs, which come with relatively high inference
+costs and latency. To address this, we propose a method for metaphor detection
+by fine-tuning open-source LLMs, effectively reducing inference costs and
+latency with a single inference step. Furthermore, metaphor detection suffers
+from a severe data scarcity problem, which hinders effective fine-tuning of
+LLMs. To tackle this, we introduce Curriculum-style Data Augmentation (CDA).
+Specifically, before fine-tuning, we evaluate the training data to identify
+correctly predicted instances for fine-tuning, while incorrectly predicted
+instances are used as seed data for data augmentation. This approach enables
+the model to quickly learn simpler knowledge and progressively acquire more
+complex knowledge, thereby improving performance incrementally. Experimental
+results demonstrate that our method achieves state-of-the-art performance
+across all baselines. Additionally, we provide detailed ablation studies to
+validate the effectiveness of CDA.
+
+
+
+
+
+
+
+
+ Yuntao Shou, Tao Meng, Wei Ai, Keqin Li
+
+
+ Multimodal emotion recognition in conversation (MERC) refers to identifying
+and classifying human emotional states by combining data from multiple
+different modalities (e.g., audio, images, text, video, etc.). Most existing
+multimodal emotion recognition methods use GCN to improve performance, but
+existing GCN methods are prone to overfitting and cannot capture the temporal
+dependency of the speaker's emotions. To address the above problems, we propose
+a Dynamic Graph Neural Ordinary Differential Equation Network (DGODE) for MERC,
+which combines the dynamic changes of emotions to capture the temporal
+dependency of speakers' emotions, and effectively alleviates the overfitting
+problem of GCNs. Technically, the key idea of DGODE is to utilize an adaptive
+mixhop mechanism to improve the generalization ability of GCNs and use the
+graph ODE evolution network to characterize the continuous dynamics of node
+representations over time and capture temporal dependencies. Extensive
+experiments on two publicly available multimodal emotion recognition datasets
+demonstrate that the proposed DGODE model has superior performance compared to
+various baselines. Furthermore, the proposed DGODE can also alleviate the
+over-smoothing problem, thereby enabling the construction of a deep GCN
+network.
+
+
+
+ comment: 13 pages, 6 figures
+
+
+
+
+
+
+ ☆ WithdrarXiv: A Large-Scale Dataset for Retraction Study
+
+
+
+
+
+
+
+
+ Delip Rao, Jonathan Young, Thomas Dietterich, Chris Callison-Burch
+
+
+ Retractions play a vital role in maintaining scientific integrity, yet
+systematic studies of retractions in computer science and other STEM fields
+remain scarce. We present WithdrarXiv, the first large-scale dataset of
+withdrawn papers from arXiv, containing over 14,000 papers and their associated
+retraction comments spanning the repository's entire history through September
+2024. Through careful analysis of author comments, we develop a comprehensive
+taxonomy of retraction reasons, identifying 10 distinct categories ranging from
+critical errors to policy violations. We demonstrate a simple yet highly
+accurate zero-shot automatic categorization of retraction reasons, achieving a
+weighted average F1-score of 0.96. Additionally, we release WithdrarXiv-SciFy,
+an enriched version including scripts for parsed full-text PDFs, specifically
+designed to enable research in scientific feasibility studies, claim
+verification, and automated theorem proving. These findings provide valuable
+insights for improving scientific quality control and automated verification
+systems. Finally, and most importantly, we discuss ethical issues and take a
+number of steps to implement responsible data release while fostering open
+science in this area.
+
+
+
+ comment: 11 pages, 5 figures
+
+
+
+
+
+
+ ☆ Language Model Meets Prototypes: Towards Interpretable Text
+ Classification Models through Prototypical Networks AAAI25
+
+
+ Pretrained transformer-based Language Models (LMs) are well-known for their
+ability to achieve significant improvement on NLP tasks, but their black-box
+nature, which leads to a lack of interpretability, has been a major concern. My
+dissertation focuses on developing intrinsically interpretable models when
+using LMs as encoders while maintaining their superior performance via
+prototypical networks. I initiated my research by investigating enhancements in
+performance for interpretable models of sarcasm detection. My proposed approach
+focuses on capturing sentiment incongruity to enhance accuracy while offering
+instance-based explanations for the classification decisions. Later, I
+developed a novel white-box multi-head graph attention-based prototype network
+designed to explain the decisions of text classification models without
+sacrificing the accuracy of the original black-box LMs. In addition, I am
+working on extending the attention-based prototype network with contrastive
+learning to redesign an interpretable graph neural network, aiming to enhance
+both the interpretability and performance of the model in document
+classification.
+
+
+
+ comment: 2 pages, 1 figure, accepted by AAAI25 DC
+
+
+
+
+
+
+
+ Dewang Sultania, Zhaoyu Lu, Twisha Naik, Franck Dernoncourt, David Seunghyun Yoon, Sanat Sharma, Trung Bui, Ashok Gupta, Tushar Vatsa, Suhas Suresha, Ishita Verma, Vibha Belavadi, Cheng Chen, Michael Friedrich
+
+
+ Domain specific question answering is an evolving field that requires
+specialized solutions to address unique challenges. In this paper, we show that
+a hybrid approach combining a fine-tuned dense retriever with keyword based
+sparse search methods significantly enhances performance. Our system leverages
+a linear combination of relevance signals, including cosine similarity from
+dense retrieval, BM25 scores, and URL host matching, each with tunable boost
+parameters. Experimental results indicate that this hybrid method outperforms
+our single-retriever system, achieving improved accuracy while maintaining
+robust contextual grounding. These findings suggest that integrating multiple
+retrieval methodologies with weighted scoring effectively addresses the
+complexities of domain specific question answering in enterprise settings.
+
+
+
+
+
+
+
+ ☆ From Language Models over Tokens to Language Models over Characters
+
+
+
+
+
+
+
+
+ Tim Vieira, Ben LeBrun, Mario Giulianelli, Juan Luis Gastaldi, Brian DuSell, John Terilla, Timothy J. O'Donnell, Ryan Cotterell
+
+
+ Modern language models are internally -- and mathematically -- distributions
+over token strings rather than \emph{character} strings, posing numerous
+challenges for programmers building user applications on top of them. For
+example, if a prompt is specified as a character string, it must be tokenized
+before passing it to the token-level language model. Thus, the tokenizer and
+consequent analyses are very sensitive to the specification of the prompt
+(e.g., if the prompt ends with a space or not). This paper presents algorithms
+for converting token-level language models to character-level ones. We present
+both exact and approximate algorithms. In the empirical portion of the paper,
+we benchmark the practical runtime and approximation quality. We find that --
+even with a small computation budget -- our method is able to accurately
+approximate the character-level distribution (less than 0.00021 excess bits /
+character) at reasonably fast speeds (46.3 characters / second) on the Llama
+3.1 8B language model.
+
+
+
+
+
+
+
+ ☆ Scaling Inference-Time Search with Vision Value Model for Improved
+ Visual Comprehension
+
+
+
+
+
+
+
+
+ Wang Xiyao, Yang Zhengyuan, Li Linjie, Lu Hongjin, Xu Yuancheng, Lin Chung-Ching Lin, Lin Kevin, Huang Furong, Wang Lijuan
+
+
+ Despite significant advancements in vision-language models (VLMs), there
+lacks effective approaches to enhance response quality by scaling
+inference-time computation. This capability is known to be a core step towards
+the self-improving models in recent large language model studies. In this
+paper, we present Vision Value Model (VisVM) that can guide VLM inference-time
+search to generate responses with better visual comprehension. Specifically,
+VisVM not only evaluates the generated sentence quality in the current search
+step, but also anticipates the quality of subsequent sentences that may result
+from the current step, thus providing a long-term value. In this way, VisVM
+steers VLMs away from generating sentences prone to hallucinations or
+insufficient detail, thereby producing higher quality responses. Experimental
+results demonstrate that VisVM-guided search significantly enhances VLMs'
+ability to generate descriptive captions with richer visual details and fewer
+hallucinations, compared with greedy decoding and search methods with other
+visual reward signals. Furthermore, we find that self-training the model with
+the VisVM-guided captions improve VLM's performance across a wide range of
+multimodal benchmarks, indicating the potential for developing self-improving
+VLMs. Our value model and code are available at
+https://github.com/si0wang/VisVM.
+
+
+
+
+
+
+
+
+ Guy Barel, Oren Tsur, Dan Volenchik
+
+
+ Stance detection plays a pivotal role in enabling an extensive range of
+downstream applications, from discourse parsing to tracing the spread of fake
+news and the denial of scientific facts. While most stance classification
+models rely on textual representation of the utterance in question, prior work
+has demonstrated the importance of the conversational context in stance
+detection. In this work we introduce TASTE -- a multimodal architecture for
+stance detection that harmoniously fuses Transformer-based content embedding
+with unsupervised structural embedding. Through the fine-tuning of a pretrained
+transformer and the amalgamation with social embedding via a Gated Residual
+Network (GRN) layer, our model adeptly captures the complex interplay between
+content and conversational structure in determining stance. TASTE achieves
+state-of-the-art results on common benchmarks, significantly outperforming an
+array of strong baselines. Comparative evaluations underscore the benefits of
+social grounding -- emphasizing the criticality of concurrently harnessing both
+content and structure for enhanced stance detection.
+
+
+
+ comment: The modified camera ready version will be published in January 2025
+ at COLING
+
+
+
+
+
+
+ ☆ Evaluating Language Models as Synthetic Data Generators
+
+
+
+
+
+
+
+
+ Seungone Kim, Juyoung Suk, Xiang Yue, Vijay Viswanathan, Seongyun Lee, Yizhong Wang, Kiril Gashteovski, Carolin Lawrence, Sean Welleck, Graham Neubig
+
+
+ Given the increasing use of synthetic data in language model (LM)
+post-training, an LM's ability to generate high-quality data has become nearly
+as crucial as its ability to solve problems directly. While prior works have
+focused on developing effective data generation methods, they lack systematic
+comparison of different LMs as data generators in a unified setting. To address
+this gap, we propose AgoraBench, a benchmark that provides standardized
+settings and metrics to evaluate LMs' data generation abilities. Through
+synthesizing 1.26 million training instances using 6 LMs and training 99
+student models, we uncover key insights about LMs' data generation
+capabilities. First, we observe that LMs exhibit distinct strengths. For
+instance, GPT-4o excels at generating new problems, while Claude-3.5-Sonnet
+performs better at enhancing existing ones. Furthermore, our analysis reveals
+that an LM's data generation ability doesn't necessarily correlate with its
+problem-solving ability. Instead, multiple intrinsic features of data
+quality-including response quality, perplexity, and instruction
+difficulty-collectively serve as better indicators. Finally, we demonstrate
+that strategic choices in output format and cost-conscious model selection
+significantly impact data generation effectiveness.
+
+
+
+ comment: Work in Progress
+
+
+
+
+
+
+ ☆ Personalizing Multimodal Large Language Models for Image Captioning: An
+ Experimental Analysis ECCV 2024
+
+
+
+
+
+
+
+
+ Davide Bucciarelli, Nicholas Moratelli, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
+
+
+ The task of image captioning demands an algorithm to generate natural
+language descriptions of visual inputs. Recent advancements have seen a
+convergence between image captioning research and the development of Large
+Language Models (LLMs) and Multimodal LLMs -- like GPT-4V and Gemini -- which
+extend the capabilities of text-only LLMs to multiple modalities. This paper
+investigates whether Multimodal LLMs can supplant traditional image captioning
+networks by evaluating their performance on various image description
+benchmarks. We explore both the zero-shot capabilities of these models and
+their adaptability to different semantic domains through fine-tuning methods,
+including prompt learning, prefix tuning, and low-rank adaptation. Our results
+demonstrate that while Multimodal LLMs achieve impressive zero-shot
+performance, fine-tuning for specific domains while maintaining their
+generalization capabilities intact remains challenging. We discuss the
+implications of these findings for future research in image captioning and the
+development of more adaptable Multimodal LLMs.
+
+
+
+ comment: ECCV 2024 Workshop on Green Foundation Models
+
+
+
+
+
+
+ ☆ Multimodal Sentiment Analysis Based on BERT and ResNet
+
+
+ With the rapid development of the Internet and social media, multi-modal data
+(text and image) is increasingly important in sentiment analysis tasks.
+However, the existing methods are difficult to effectively fuse text and image
+features, which limits the accuracy of analysis. To solve this problem, a
+multimodal sentiment analysis framework combining BERT and ResNet was proposed.
+BERT has shown strong text representation ability in natural language
+processing, and ResNet has excellent image feature extraction performance in
+the field of computer vision. Firstly, BERT is used to extract the text feature
+vector, and ResNet is used to extract the image feature representation. Then, a
+variety of feature fusion strategies are explored, and finally the fusion model
+based on attention mechanism is selected to make full use of the complementary
+information between text and image. Experimental results on the public dataset
+MAVA-single show that compared with the single-modal models that only use BERT
+or ResNet, the proposed multi-modal model improves the accuracy and F1 score,
+reaching the best accuracy of 74.5%. This study not only provides new ideas and
+methods for multimodal sentiment analysis, but also demonstrates the
+application potential of BERT and ResNet in cross-domain fusion. In the future,
+more advanced feature fusion techniques and optimization strategies will be
+explored to further improve the accuracy and generalization ability of
+multimodal sentiment analysis.
+
+
+
+
+
+
+
+ ☆ How to Correctly do Semantic Backpropagation on Language-based Agentic
+ Systems
+
+
+
+
+
+
+
+
+ Wenyi Wang, Hisham A. Alyahya, Dylan R. Ashley, Oleg Serikov, Dmitrii Khizbullin, Francesco Faccio, Jürgen Schmidhuber
+
+
+ Language-based agentic systems have shown great promise in recent years,
+transitioning from solving small-scale research problems to being deployed in
+challenging real-world tasks. However, optimizing these systems often requires
+substantial manual labor. Recent studies have demonstrated that these systems
+can be represented as computational graphs, enabling automatic optimization.
+Despite these advancements, most current efforts in Graph-based Agentic System
+Optimization (GASO) fail to properly assign feedback to the system's components
+given feedback on the system's output. To address this challenge, we formalize
+the concept of semantic backpropagation with semantic gradients -- a
+generalization that aligns several key optimization techniques, including
+reverse-mode automatic differentiation and the more recent TextGrad by
+exploiting the relationship among nodes with a common successor. This serves as
+a method for computing directional information about how changes to each
+component of an agentic system might improve the system's output. To use these
+gradients, we propose a method called semantic gradient descent which enables
+us to solve GASO effectively. Our results on both BIG-Bench Hard and GSM8K show
+that our approach outperforms existing state-of-the-art methods for solving
+GASO problems. A detailed ablation study on the LIAR dataset demonstrates the
+parsimonious nature of our method. A full copy of our implementation is
+publicly available at https://github.com/HishamAlyahya/semantic_backprop
+
+
+
+ comment: 11 pages in main text + 2 pages of references + 15 pages of
+ appendices, 2 figures in main text + 17 figures in appendices, 2 tables in
+ main text + 1 table in appendices, 2 algorithms in main text; source code
+ available at https://github.com/HishamAlyahya/semantic_backprop
+
+
+
+
+
+
+ ♻ ☆ StarVector: Generating Scalable Vector Graphics Code from Images and
+ Text
+
+
+
+
+
+
+
+
+ Juan A. Rodriguez, Abhay Puri, Shubham Agarwal, Issam H. Laradji, Pau Rodriguez, Sai Rajeswar, David Vazquez, Christopher Pal, Marco Pedersoli
+
+
+ Scalable Vector Graphics (SVGs) are vital for modern image rendering due to
+their scalability and versatility. Previous SVG generation methods have focused
+on curve-based vectorization, lacking semantic understanding, often producing
+artifacts, and struggling with SVG primitives beyond path curves. To address
+these issues, we introduce StarVector, a multimodal large language model for
+SVG generation. It performs image vectorization by understanding image
+semantics and using SVG primitives for compact, precise outputs. Unlike
+traditional methods, StarVector works directly in the SVG code space,
+leveraging visual understanding to apply accurate SVG primitives. To train
+StarVector, we create SVG-Stack, a diverse dataset of 2M samples that enables
+generalization across vectorization tasks and precise use of primitives like
+ellipses, polygons, and text. We address challenges in SVG evaluation, showing
+that pixel-based metrics like MSE fail to capture the unique qualities of
+vector graphics. We introduce SVG-Bench, a benchmark across 10 datasets, and 3
+tasks: Image-to-SVG, Text-to-SVG generation, and diagram generation. Using this
+setup, StarVector achieves state-of-the-art performance, producing more compact
+and semantically rich SVGs.
+
+
+
+
+
+
+
+ ♻ ☆ Privacy-Preserving Data Deduplication for Enhancing Federated Learning
+ of Language Models (Extended Version) NDSS
+
+
+ Deduplication is a vital preprocessing step that enhances machine learning
+model performance and saves training time and energy. However, enhancing
+federated learning through deduplication poses challenges, especially regarding
+scalability and potential privacy violations if deduplication involves sharing
+all clients' data. In this paper, we address the problem of deduplication in a
+federated setup by introducing a pioneering protocol, Efficient
+Privacy-Preserving Multi-Party Deduplication (EP-MPD). It efficiently removes
+duplicates from multiple clients' datasets without compromising data privacy.
+EP-MPD is constructed in a modular fashion, utilizing two novel variants of the
+Private Set Intersection protocol. Our extensive experiments demonstrate the
+significant benefits of deduplication in federated learning of large language
+models. For instance, we observe up to 19.62\% improvement in perplexity and up
+to 27.95\% reduction in running time while varying the duplication level
+between 10\% and 30\%. EP-MPD effectively balances privacy and performance in
+federated learning, making it a valuable solution for large-scale applications.
+
+
+
+ comment: Accepted at the Network and Distributed Systems Security (NDSS)
+ Symposium, 2025
+
+
+
+
+
+
+ ♻ ☆ Enhancing Biomedical Knowledge Discovery for Diseases: An Open-Source
+ Framework Applied on Rett Syndrome and Alzheimer's Disease
+
+
+
+
+
+
+
+
+ Christos Theodoropoulos, Andrei Catalin Coman, James Henderson, Marie-Francine Moens
+
+
+ The ever-growing volume of biomedical publications creates a critical need
+for efficient knowledge discovery. In this context, we introduce an open-source
+end-to-end framework designed to construct knowledge around specific diseases
+directly from raw text. To facilitate research in disease-related knowledge
+discovery, we create two annotated datasets focused on Rett syndrome and
+Alzheimer's disease, enabling the identification of semantic relations between
+biomedical entities. Extensive benchmarking explores various ways to represent
+relations and entity representations, offering insights into optimal modeling
+strategies for semantic relation detection and highlighting language models'
+competence in knowledge discovery. We also conduct probing experiments using
+different layer representations and attention scores to explore transformers'
+ability to capture semantic relations.
+
+
+
+ comment: Published in IEEE Access, doi: 10.1109/ACCESS.2024.3509714
+
+
+
+
+
+
+ ♻ ☆ Automatically Interpreting Millions of Features in Large Language Models
+
+
+
+
+
+
+
+
+ Gonçalo Paulo, Alex Mallen, Caden Juang, Nora Belrose
+
+
+ While the activations of neurons in deep neural networks usually do not have
+a simple human-understandable interpretation, sparse autoencoders (SAEs) can be
+used to transform these activations into a higher-dimensional latent space
+which may be more easily interpretable. However, these SAEs can have millions
+of distinct latent features, making it infeasible for humans to manually
+interpret each one. In this work, we build an open-source automated pipeline to
+generate and evaluate natural language explanations for SAE features using
+LLMs. We test our framework on SAEs of varying sizes, activation functions, and
+losses, trained on two different open-weight LLMs. We introduce five new
+techniques to score the quality of explanations that are cheaper to run than
+the previous state of the art. One of these techniques, intervention scoring,
+evaluates the interpretability of the effects of intervening on a feature,
+which we find explains features that are not recalled by existing methods. We
+propose guidelines for generating better explanations that remain valid for a
+broader set of activating contexts, and discuss pitfalls with existing scoring
+techniques. We use our explanations to measure the semantic similarity of
+independently trained SAEs, and find that SAEs trained on nearby layers of the
+residual stream are highly similar. Our large-scale analysis confirms that SAE
+latents are indeed much more interpretable than neurons, even when neurons are
+sparsified using top-$k$ postprocessing. Our code is available at
+https://github.com/EleutherAI/sae-auto-interp, and our explanations are
+available at
+https://huggingface.co/datasets/EleutherAI/auto_interp_explanations.
+
+
+
+
+
+
+
+ ♻ ☆ Number Cookbook: Number Understanding of Language Models and How to
+ Improve It
+
+
+ Large language models (LLMs) can solve an increasing number of complex
+reasoning tasks while making surprising mistakes in basic numerical
+understanding and processing (such as 9.11 > 9.9). The latter ability is
+essential for tackling complex arithmetic and mathematical problems and serves
+as a foundation for most reasoning tasks, but previous work paid little
+attention to it or only discussed several restricted tasks (like integer
+addition). In this paper, we comprehensively investigate the numerical
+understanding and processing ability (NUPA) of LLMs. Firstly, we introduce a
+benchmark covering four common numerical representations and 17 distinct
+numerical tasks in four major categories, resulting in 41 meaningful
+combinations in total. These tasks are derived from primary and secondary
+education curricula, encompassing nearly all everyday numerical understanding
+and processing scenarios, and the rules of these tasks are very simple and
+clear. Through the benchmark, we find that current LLMs fail frequently in many
+of the tasks. To study the problem, we train small models with existing and
+potential techniques for enhancing NUPA (such as tokenizers, PEs, and number
+formats), comprehensively evaluating their effectiveness using our testbed. We
+also finetune practical-scale LLMs on our proposed NUPA tasks and find that 1)
+naive finetuning can improve NUPA a lot on many but not all tasks, and 2)
+surprisingly, techniques designed to enhance NUPA prove ineffective for
+finetuning pretrained models. We further explore the impact of chain-of-thought
+techniques on NUPA. Our work provides a more detailed and comprehensive
+understanding of NUPA in LLMs. Our benchmark and code are released at
+https://github.com/GraphPKU/number_cookbook.
+
+
+
+
+
+
+
+ ♻ ☆ DataLab: A Unified Platform for LLM-Powered Business Intelligence
+
+
+ Business intelligence (BI) transforms large volumes of data within modern
+organizations into actionable insights for informed decision-making. Recently,
+large language model (LLM)-based agents have streamlined the BI workflow by
+automatically performing task planning, reasoning, and actions in executable
+environments based on natural language (NL) queries. However, existing
+approaches primarily focus on individual BI tasks such as NL2SQL and NL2VIS.
+The fragmentation of tasks across different data roles and tools lead to
+inefficiencies and potential errors due to the iterative and collaborative
+nature of BI. In this paper, we introduce DataLab, a unified BI platform that
+integrates a one-stop LLM-based agent framework with an augmented computational
+notebook interface. DataLab supports a wide range of BI tasks for different
+data roles by seamlessly combining LLM assistance with user customization
+within a single environment. To achieve this unification, we design a domain
+knowledge incorporation module tailored for enterprise-specific BI tasks, an
+inter-agent communication mechanism to facilitate information sharing across
+the BI workflow, and a cell-based context management strategy to enhance
+context utilization efficiency in BI notebooks. Extensive experiments
+demonstrate that DataLab achieves state-of-the-art performance on various BI
+tasks across popular research benchmarks. Moreover, DataLab maintains high
+effectiveness and efficiency on real-world datasets from Tencent, achieving up
+to a 58.58% increase in accuracy and a 61.65% reduction in token cost on
+enterprise-specific BI tasks.
+
+
+
+
+
+
+
+ ♻ ☆ Prediction-Powered Ranking of Large Language Models NeurIPS 2024
+
+
+ Large language models are often ranked according to their level of alignment
+with human preferences -- a model is better than other models if its outputs
+are more frequently preferred by humans. One of the popular ways to elicit
+human preferences utilizes pairwise comparisons between the outputs provided by
+different models to the same inputs. However, since gathering pairwise
+comparisons by humans is costly and time-consuming, it has become a common
+practice to gather pairwise comparisons by a strong large language model -- a
+model strongly aligned with human preferences. Surprisingly, practitioners
+cannot currently measure the uncertainty that any mismatch between human and
+model preferences may introduce in the constructed rankings. In this work, we
+develop a statistical framework to bridge this gap. Given a (small) set of
+pairwise comparisons by humans and a large set of pairwise comparisons by a
+model, our framework provides a rank-set -- a set of possible ranking positions
+-- for each of the models under comparison. Moreover, it guarantees that, with
+a probability greater than or equal to a user-specified value, the rank-sets
+cover the true ranking consistent with the distribution of human pairwise
+preferences asymptotically. Using pairwise comparisons made by humans in the
+LMSYS Chatbot Arena platform and pairwise comparisons made by three strong
+large language models, we empirically demonstrate the effectivity of our
+framework and show that the rank-sets constructed using only pairwise
+comparisons by the strong large language models are often inconsistent with
+(the distribution of) human pairwise preferences.
+
+
+
+ comment: Published at NeurIPS 2024
+
+
+
+
+
+
+ ♻ ☆ Dialectal Coverage And Generalization in Arabic Speech Recognition
+
+
+ Developing robust automatic speech recognition (ASR) systems for Arabic, a
+language characterized by its rich dialectal diversity and often considered a
+low-resource language in speech technology, demands effective strategies to
+manage its complexity. This study explores three critical factors influencing
+ASR performance: the role of dialectal coverage in pre-training, the
+effectiveness of dialect-specific fine-tuning compared to a multi-dialectal
+approach, and the ability to generalize to unseen dialects. Through extensive
+experiments across different dialect combinations, our findings offer key
+insights towards advancing the development of ASR systems for pluricentric
+languages like Arabic.
+
+
+
+
+
+
+
+ ♻ ☆ ELCC: the Emergent Language Corpus Collection
+
+
+ We introduce the Emergent Language Corpus Collection (ELCC): a collection of
+corpora generated from open source implementations of emergent communication
+systems across the literature. These systems include a variety of signalling
+game environments as well as more complex environments like a social deduction
+game and embodied navigation. Each corpus is annotated with metadata describing
+the characteristics of the source system as well as a suite of analyses of the
+corpus (e.g., size, entropy, average message length, performance as transfer
+learning data). Currently, research studying emergent languages requires
+directly running different systems which takes time away from actual analyses
+of such languages, makes studies which compare diverse emergent languages rare,
+and presents a barrier to entry for researchers without a background in deep
+learning. The availability of a substantial collection of well-documented
+emergent language corpora, then, will enable research which can analyze a wider
+variety of emergent languages, which more effectively uncovers general
+principles in emergent communication rather than artifacts of particular
+environments. We provide some quantitative and qualitative analyses with ELCC
+to demonstrate potential use cases of the resource in this vein.
+
+
+
+ comment: 21 pages, 8 figures; added analyses
+
+
+
+
+
+
+ ♻ ☆ LLM as a Complementary Optimizer to Gradient Descent: A Case Study in
+ Prompt Tuning
+
+
+
+
+
+
+
+
+ Zixian Guo, Ming Liu, Zhilong Ji, Jinfeng Bai, Yiwen Guo, Wangmeng Zuo
+
+
+ Mastering a skill generally relies on both hands-on experience from doers and
+insightful, high-level guidance by mentors. Will this strategy also work well
+for solving complex non-convex optimization problems? Here, a common
+gradient-based optimizer acts like a disciplined doer, making locally optimal
+updates at each step. Large Language Models (LLMs) can also search for better
+solutions by inferring from natural language instructions, akin to a high-level
+mentor. In this paper, we show that these two participators are complementary
+to each other and can effectively collaborate as a combined optimization
+framework. The collaborative optimization is achieved by alternating between
+the gradient-based and LLM-based optimizers. We instruct LLMs to generate
+possibly improved solutions by taking parameter trajectories recorded during
+the previous stage of gradient-based optimization into account. Inferred
+results of LLMs are used as restarting points for the next stage of gradient
+optimization. We verify the effectiveness of this optimization framework on
+prompt tuning. By leveraging both the locally rigorous gradient-based optimizer
+and the high-level deductive LLM-based optimizer, the combined optimization
+method consistently yields improvements over competitive baselines on a variety
+of tasks. Our results demonstrate the synergistic effect of conventional
+gradient-based optimization and the inference ability of LLMs. The code is
+released at https://github.com/guozix/LLM-catalyst.
+
+
+
+
+
+
+
+
+ Ruirui Chen, Weifeng Jiang, Chengwei Qin, Ishaan Singh Rawal, Cheston Tan, Dongkyu Choi, Bo Xiong, Bo Ai
+
+
+ The important challenge of keeping knowledge in Large Language Models (LLMs)
+up-to-date has led to the development of various methods for incorporating new
+facts. However, existing methods for such knowledge editing still face
+difficulties with multi-hop questions that require accurate fact identification
+and sequential logical reasoning, particularly among numerous fact updates. To
+tackle these challenges, this paper introduces Graph Memory-based Editing for
+Large Language Models (GMeLLo), a straightforward and effective method that
+merges the explicit knowledge representation of Knowledge Graphs (KGs) with the
+linguistic flexibility of LLMs. Beyond merely leveraging LLMs for question
+answering, GMeLLo employs these models to convert free-form language into
+structured queries and fact triples, facilitating seamless interaction with KGs
+for rapid updates and precise multi-hop reasoning. Our results show that GMeLLo
+significantly surpasses current state-of-the-art (SOTA) knowledge editing
+methods in the multi-hop question answering benchmark, MQuAKE, especially in
+scenarios with extensive knowledge edits.
+
+
+
+
+
+
+
+ ♻ ☆ Self-Improvement in Language Models: The Sharpening Mechanism
+
+
+
+
+
+
+
+
+ Audrey Huang, Adam Block, Dylan J. Foster, Dhruv Rohatgi, Cyril Zhang, Max Simchowitz, Jordan T. Ash, Akshay Krishnamurthy
+
+
+ Recent work in language modeling has raised the possibility of
+self-improvement, where a language models evaluates and refines its own
+generations to achieve higher performance without external feedback. It is
+impossible for this self-improvement to create information that is not already
+in the model, so why should we expect that this will lead to improved
+capabilities? We offer a new perspective on the capabilities of
+self-improvement through a lens we refer to as sharpening. Motivated by the
+observation that language models are often better at verifying response quality
+than they are at generating correct responses, we formalize self-improvement as
+using the model itself as a verifier during post-training in order to
+``sharpen'' the model to one placing large mass on high-quality sequences,
+thereby amortizing the expensive inference-time computation of generating good
+sequences. We begin by introducing a new statistical framework for sharpening
+in which the learner aims to sharpen a pre-trained base policy via sample
+access, and establish fundamental limits. Then we analyze two natural families
+of self-improvement algorithms based on SFT and RLHF. We find that (i) the
+SFT-based approach is minimax optimal whenever the initial model has sufficient
+coverage, but (ii) the RLHF-based approach can improve over SFT-based
+self-improvement by leveraging online exploration, bypassing the need for
+coverage. Finally, we empirically validate the sharpening mechanism via
+inference-time and amortization experiments. We view these findings as a
+starting point toward a foundational understanding that can guide the design
+and evaluation of self-improvement algorithms.
+
+
+
+
+
+
+
+ ♻ ☆ AAVENUE: Detecting LLM Biases on NLU Tasks in AAVE via a Novel Benchmark EMNLP 2024
+
+
+
+
+
+
+
+
+ Abhay Gupta, Philip Meng, Ece Yurtseven, Sean O'Brien, Kevin Zhu
+
+
+ Detecting biases in natural language understanding (NLU) for African American
+Vernacular English (AAVE) is crucial to developing inclusive natural language
+processing (NLP) systems. To address dialect-induced performance discrepancies,
+we introduce AAVENUE ({AAVE} {N}atural Language {U}nderstanding {E}valuation),
+a benchmark for evaluating large language model (LLM) performance on NLU tasks
+in AAVE and Standard American English (SAE). AAVENUE builds upon and extends
+existing benchmarks like VALUE, replacing deterministic syntactic and
+morphological transformations with a more flexible methodology leveraging
+LLM-based translation with few-shot prompting, improving performance across our
+evaluation metrics when translating key tasks from the GLUE and SuperGLUE
+benchmarks. We compare AAVENUE and VALUE translations using five popular LLMs
+and a comprehensive set of metrics including fluency, BARTScore, quality,
+coherence, and understandability. Additionally, we recruit fluent AAVE speakers
+to validate our translations for authenticity. Our evaluations reveal that LLMs
+consistently perform better on SAE tasks than AAVE-translated versions,
+underscoring inherent biases and highlighting the need for more inclusive NLP
+models. We have open-sourced our source code on GitHub and created a website to
+showcase our work at https://aavenue.live.
+
+
+
+ comment: Published at NLP4PI @ EMNLP 2024
+
+
+
+
+
+
+ ♻ ☆ A Spatio-Temporal Representation Learning as an Alternative to
+ Traditional Glosses in Sign Language Translation and Production WACV 2025
+
+
+
+
+
+
+
+
+ Eui Jun Hwang, Sukmin Cho, Huije Lee, Youngwoo Yoon, Jong C. Park
+
+
+ This work addresses the challenges associated with the use of glosses in both
+Sign Language Translation (SLT) and Sign Language Production (SLP). While
+glosses have long been used as a bridge between sign language and spoken
+language, they come with two major limitations that impede the advancement of
+sign language systems. First, annotating the glosses is a labor-intensive and
+time-consuming process, which limits the scalability of datasets. Second, the
+glosses oversimplify sign language by stripping away its spatio-temporal
+dynamics, reducing complex signs to basic labels and missing the subtle
+movements essential for precise interpretation. To address these limitations,
+we introduce Universal Gloss-level Representation (UniGloR), a framework
+designed to capture the spatio-temporal features inherent in sign language,
+providing a more dynamic and detailed alternative to the use of the glosses.
+The core idea of UniGloR is simple yet effective: We derive dense
+spatio-temporal representations from sign keypoint sequences using
+self-supervised learning and seamlessly integrate them into SLT and SLP tasks.
+Our experiments in a keypoint-based setting demonstrate that UniGloR either
+outperforms or matches the performance of previous SLT and SLP methods on two
+widely-used datasets: PHOENIX14T and How2Sign.
+
+
+
+
+
+
+
+
+ Wang Qun, Liu Yang, Lin Qingquan, Jiang Ling
+
+
+ We introduce Xmodel-1.5, a 1-billion-parameter multilingual large language
+model pretrained on 2 trillion tokens, designed for balanced performance and
+scalability. Unlike most large models that use the BPE tokenizer, Xmodel-1.5
+employs a custom unigram tokenizer with 65,280 tokens, optimizing both
+efficiency and accuracy. The model delivers competitive results across multiple
+languages, including Thai, Arabic, French, Chinese, and English, outperforming
+Alibaba's PolyLM-1.7B on respective evaluation datasets. Xmodel-1.5 excels in
+benchmarks like mMMLU and PIQA, and achieves state-of-the-art results in Thai.
+To support low-resource language research, we release Xdata_Thai, a
+Thai-specific evaluation dataset featuring unique linguistic challenges such as
+gendered particles and idioms. While the model demonstrates strong performance,
+there is still room for improvement in handling culturally specific nuances. We
+hope this work contributes to advancements in multilingual AI research. Models
+and code are publicly available on GitHub at
+https://github.com/XiaoduoAILab/XmodelLM-1.5
+
+
+ Large Language Models (LLMs) are typically trained to predict in the forward
+direction of time. However, recent works have shown that prompting these models
+to look back and critique their own generations can produce useful feedback.
+Motivated by this, we explore the question of whether LLMs can be empowered to
+think (predict and score) backwards to provide unsupervised feedback that
+complements forward LLMs. Towards this, we introduce Time Reversed Language
+Models (TRLMs), which can score and generate queries when conditioned on
+responses, effectively functioning in the reverse direction of time. Further,
+to effectively infer in the response to query direction, we pre-train and
+fine-tune a language model (TRLM-Ba) in the reverse token order from scratch.
+We show empirically (and theoretically in a stylized setting) that
+time-reversed models can indeed complement forward model predictions when used
+to score the query given response for re-ranking multiple forward generations.
+We obtain up to 5\% improvement on the widely used AlpacaEval Leaderboard over
+the competent baseline of best-of-N re-ranking using self log-perplexity
+scores. We further show that TRLM scoring outperforms conventional forward
+scoring of response given query, resulting in significant gains in applications
+such as citation generation and passage retrieval. We next leverage the
+generative ability of TRLM to augment or provide unsupervised feedback to input
+safety filters of LLMs, demonstrating a drastic reduction in false negative
+rate with negligible impact on false positive rates against several attacks
+published on the popular JailbreakBench leaderboard.
+
+
+
+ comment: Accepted as a spotlight in NeurIPS 2024
+
+
+
+
+
+
+ ♻ ☆ ODE: Open-Set Evaluation of Hallucinations in Multimodal Large Language
+ Models
+
+
+ Hallucination poses a persistent challenge for multimodal large language
+models (MLLMs). However, existing benchmarks for evaluating hallucinations are
+generally static, which may overlook the potential risk of data contamination.
+To address this issue, we propose ODE, an open-set, dynamic protocol designed
+to evaluate object hallucinations in MLLMs at both the existence and attribute
+levels. ODE employs a graph-based structure to represent real-world object
+concepts, their attributes, and the distributional associations between them.
+This structure facilitates the extraction of concept combinations based on
+diverse distributional criteria, generating varied samples for structured
+queries that evaluate hallucinations in both generative and discriminative
+tasks. Through the generation of new samples, dynamic concept combinations, and
+varied distribution frequencies, ODE mitigates the risk of data contamination
+and broadens the scope of evaluation. This protocol is applicable to both
+general and specialized scenarios, including those with limited data.
+Experimental results demonstrate the effectiveness of our protocol, revealing
+that MLLMs exhibit higher hallucination rates when evaluated with ODE-generated
+samples, which indicates potential data contamination. Furthermore, these
+generated samples aid in analyzing hallucination patterns and fine-tuning
+models, offering an effective approach to mitigating hallucinations in MLLMs.
+
+
+
+
+
+
+
+ ♻ ☆ GWQ: Gradient-Aware Weight Quantization for Large Language Models
+
+
+ Large language models (LLMs) show impressive performance in solving complex
+language tasks. However, its large number of parameters present significant
+challenges for the deployment and application of the model on edge devices.
+Compressing large language models to low bits can enable them to run on
+resource-constrained devices, often leading to performance degradation. To
+address this problem, we propose gradient-aware weight quantization (GWQ), the
+first quantization approach for low-bit weight quantization that leverages
+gradients to localize outliers, requiring only a minimal amount of calibration
+data for outlier detection. GWQ retains the weights corresponding to the top 1%
+outliers preferentially at FP16 precision, while the remaining non-outlier
+weights are stored in a low-bit format. GWQ found experimentally that utilizing
+the sensitive weights in the gradient localization model is more scientific
+compared to utilizing the sensitive weights in the Hessian matrix localization
+model. Compared to current quantization methods, GWQ can be applied to multiple
+language models and achieves lower PPL on the WikiText2 and C4 dataset. In the
+zero-shot task, GWQ quantized models have higher accuracy compared to other
+quantization methods. GWQ is also suitable for multimodal model quantization,
+and the quantized Qwen-VL family model is more accurate than other methods.
+Zero-shot target detection task dataset RefCOCO outperforms the current
+stat-of-the-arts method SPQR. GWQ achieves 1.2 times inference speedup in
+comparison to the original model, and effectively reduces the inference memory.
+
+
+ The combination of language processing and image processing keeps attracting
+increased interest given recent impressive advances that leverage the combined
+strengths of both domains of research. Among these advances, the task of
+editing an image on the basis solely of a natural language instruction stands
+out as a most challenging endeavour. While recent approaches for this task
+resort, in one way or other, to some form of preliminary preparation, training
+or fine-tuning, this paper explores a novel approach: We propose a
+preparation-free method that permits instruction-guided image editing on the
+fly. This approach is organized along three steps properly orchestrated that
+resort to image captioning and DDIM inversion, followed by obtaining the edit
+direction embedding, followed by image editing proper. While dispensing with
+preliminary preparation, our approach demonstrates to be effective and
+competitive, outperforming recent, state of the art models for this task when
+evaluated on the MAGICBRUSH dataset.
+
+
+
+
+
+
+
+ ♻ ☆ Elephants Never Forget: Memorization and Learning of Tabular Data in
+ Large Language Models
+
+
+ While many have shown how Large Language Models (LLMs) can be applied to a
+diverse set of tasks, the critical issues of data contamination and
+memorization are often glossed over. In this work, we address this concern for
+tabular data. Specifically, we introduce a variety of different techniques to
+assess whether a language model has seen a tabular dataset during training.
+This investigation reveals that LLMs have memorized many popular tabular
+datasets verbatim. We then compare the few-shot learning performance of LLMs on
+datasets that were seen during training to the performance on datasets released
+after training. We find that LLMs perform better on datasets seen during
+training, indicating that memorization leads to overfitting. At the same time,
+LLMs show non-trivial performance on novel datasets and are surprisingly robust
+to data transformations. We then investigate the in-context statistical
+learning abilities of LLMs. While LLMs are significantly better than random at
+solving statistical classification problems, the sample efficiency of few-shot
+learning lags behind traditional statistical learning algorithms, especially as
+the dimension of the problem increases. This suggests that much of the observed
+few-shot performance on novel real-world datasets is due to the LLM's world
+knowledge. Overall, our results highlight the importance of testing whether an
+LLM has seen an evaluation dataset during pre-training. We release the
+https://github.com/interpretml/LLM-Tabular-Memorization-Checker Python package
+to test LLMs for memorization of tabular datasets.
+
+
+
+ comment: COLM camera ready, fix typo
+
+
+
+
+
+
+ ♻ ☆ Knowledge Mechanisms in Large Language Models: A Survey and Perspective EMNLP 2024
+
+
+ Understanding knowledge mechanisms in Large Language Models (LLMs) is crucial
+for advancing towards trustworthy AGI. This paper reviews knowledge mechanism
+analysis from a novel taxonomy including knowledge utilization and evolution.
+Knowledge utilization delves into the mechanism of memorization, comprehension
+and application, and creation. Knowledge evolution focuses on the dynamic
+progression of knowledge within individual and group LLMs. Moreover, we discuss
+what knowledge LLMs have learned, the reasons for the fragility of parametric
+knowledge, and the potential dark knowledge (hypothesis) that will be
+challenging to address. We hope this work can help understand knowledge in LLMs
+and provide insights for future research.
+
+
+
+ comment: EMNLP 2024 Findings; 39 pages (v4)
+
+
+
+
+
+
+ ♻ ☆ Adaptive Dense Reward: Understanding the Gap Between Action and Reward
+ Space in Alignment
+
+
+
+
+
+
+
+
+ Yanshi Li, Shaopan Xiong, Gengru Chen, Xiaoyang Li, Yijia Luo, Xingyao Zhang, Yanhui Huang, Xingyuan Bu, Yingshui Tan, Chun Yuan, Jiamang Wang, Wenbo Su, Bo Zheng
+
+
+ Reinforcement Learning from Human Feedback (RLHF) has proven highly effective
+in aligning Large Language Models (LLMs) with human preferences. However, the
+original RLHF typically optimizes under an overall reward, which can lead to a
+suboptimal learning process. This limitation stems from RLHF's lack of
+awareness regarding which specific tokens should be reinforced or suppressed.
+Moreover, conflicts in supervision can arise, for instance, when a chosen
+response includes erroneous tokens, while a rejected response contains accurate
+elements. To rectify these shortcomings, increasing dense reward methods, such
+as step-wise and token-wise RLHF, have been proposed. However, these existing
+methods are limited to specific tasks (like mathematics). In this paper, we
+propose the ``Adaptive Message-wise RLHF'' method, which robustly applies to
+various tasks. By defining pivot tokens as key indicators, our approach
+adaptively identifies essential information and converts sequence-level
+supervision into fine-grained, subsequence-level supervision. This aligns the
+density of rewards and action spaces more closely with the information density
+of the input. Experiments demonstrate that our method can be integrated into
+various training methods, significantly mitigating hallucinations and
+catastrophic forgetting problems, while outperforming other methods on multiple
+evaluation metrics. Our method improves the success rate on adversarial samples
+by 10\% compared to the sample-wise approach, and achieves a 1.3\% improvement
+on evaluation benchmarks such as MMLU, GSM8K, HumanEval, etc.
+
+
+
+
+
+
+
+ ♻ ☆ Pay Attention to the Robustness of Chinese Minority Language Models!
+ Syllable-level Textual Adversarial Attack on Tibetan Script ACL 2023
+
+
+
+
+
+
+
+
+ Xi Cao, Dolma Dawa, Nuo Qun, Trashi Nyima
+
+
+ The textual adversarial attack refers to an attack method in which the
+attacker adds imperceptible perturbations to the original texts by elaborate
+design so that the NLP (natural language processing) model produces false
+judgments. This method is also used to evaluate the robustness of NLP models.
+Currently, most of the research in this field focuses on English, and there is
+also a certain amount of research on Chinese. However, to the best of our
+knowledge, there is little research targeting Chinese minority languages.
+Textual adversarial attacks are a new challenge for the information processing
+of Chinese minority languages. In response to this situation, we propose a
+Tibetan syllable-level black-box textual adversarial attack called TSAttacker
+based on syllable cosine distance and scoring mechanism. And then, we conduct
+TSAttacker on six models generated by fine-tuning two PLMs (pre-trained
+language models) for three downstream tasks. The experiment results show that
+TSAttacker is effective and generates high-quality adversarial samples. In
+addition, the robustness of the involved models still has much room for
+improvement.
+
+
+
+ comment: Revised Version; Accepted at ACL 2023 Workshop on TrustNLP
+
+
+
+
+
+
+ ♻ ☆ "Moralized" Multi-Step Jailbreak Prompts: Black-Box Testing of
+ Guardrails in Large Language Models for Verbal Attacks
+
+
+ As the application of large language models continues to expand in various
+fields, it poses higher challenges to the effectiveness of identifying harmful
+content generation and guardrail mechanisms. This research aims to evaluate the
+guardrail effectiveness of GPT-4o, Grok-2 Beta, Llama 3.1 (405B), Gemini 1.5,
+and Claude 3.5 Sonnet through black-box testing of seemingly ethical multi-step
+jailbreak prompts. It conducts ethical attacks by designing an identical
+multi-step prompts that simulates the scenario of "corporate middle managers
+competing for promotions." The data results show that the guardrails of the
+above-mentioned LLMs were bypassed and the content of verbal attacks was
+generated. Claude 3.5 Sonnet's resistance to multi-step jailbreak prompts is
+more obvious. To ensure objectivity, the experimental process, black box test
+code, and enhanced guardrail code are uploaded to the GitHub repository:
+https://github.com/brucewang123456789/GeniusTrail.git.
+
+
+
+ comment: This paper has been submitted to Nature Machine Intelligence and
+ OpenReview preprints. It has 7 pages of text, 3 figures, and 3 tables
+
+
+
+
+
+
+ ♻ ☆ Patience Is The Key to Large Language Model Reasoning
+
+
+ Recent advancements in the field of large language models, particularly
+through the Chain of Thought (CoT) approach, have demonstrated significant
+improvements in solving complex problems. However, existing models either tend
+to sacrifice detailed reasoning for brevity due to user preferences, or require
+extensive and expensive training data to learn complicated reasoning ability,
+limiting their potential in solving complex tasks. To bridge this gap,
+following the concept of scaling test-time, we propose a simple method by
+encouraging models to adopt a more patient reasoning style without the need of
+introducing new knowledge or skills. To employ a preference optimization
+approach, we generate detailed reasoning processes as positive examples and
+simple answers as negative examples, thereby training the model to favor
+thoroughness in its responses. Our results demonstrate a performance increase
+of up to 2.1% on GSM8k with training just on a lightweight dataset.
+
+
+
+ comment: The dataset and model are available at
+ https://huggingface.co/datasets/yuyijiong/patient-math-cot
+
+
+
+
+
+
+ ♻ ☆ One Initialization to Rule them All: Fine-tuning via Explained Variance
+ Adaptation
+
+
+
+
+
+
+
+
+ Fabian Paischer, Lukas Hauzenberger, Thomas Schmied, Benedikt Alkin, Marc Peter Deisenroth, Sepp Hochreiter
+
+
+ Foundation models (FMs) are pre-trained on large-scale datasets and then
+fine-tuned on a downstream task for a specific application. The most successful
+and most commonly used fine-tuning method is to update the pre-trained weights
+via a low-rank adaptation (LoRA). LoRA introduces new weight matrices that are
+usually initialized at random with a uniform rank distribution across the model
+weights. Recent works focus on different initialization schemes or the learning
+of adaptive ranks during fine-tuning. Both approaches have only been
+investigated in isolation, resulting in slow convergence or a uniform rank
+distribution, in turn leading to suboptimal performance. We propose to improve
+LoRA by initializing the new weights in a data-driven manner by computing
+singular value decomposition (SVD) on minibatches of activation vectors. Then,
+we initialize the LoRA matrices with the obtained right-singular vectors and
+redistribute ranks among all weight matrices to provably store the maximum
+amount of information of the downstream data in the newly introduced weights.
+In this way, only what information to maintain or neglect during the
+fine-tuning process needs to be learned. We call our new method Explained
+Variance Adaptation (EVA). We apply EVA to a variety of fine-tuning tasks
+ranging from language generation and understanding to image classification and
+reinforcement learning. EVA exhibits faster convergence than competitors and
+achieves the highest average score across a multitude of tasks per domain while
+reducing the number of trainable parameters through rank redistribution.
+
+
+
+ comment: 11 pages + references and appendix, code available at
+ https://github.com/ml-jku/EVA
+
+
+
+
+
+
+ ♻ ☆ Long-context Language Models Are Not Good At Retrieval Without Enough
+ Steps
+
+
+
+
+
+
+
+
+ Yijiong Yu, Ma Xiufa, Fang Jianwei, Zhi Xu, Su Guangyao, Wang Jiancheng, Yongfeng Huang, Zhixiao Qi, Wei Wang, Weifeng Liu, Ran Chen, Ji Pei
+
+
+ Long-context language models (LCLMs), characterized by their extensive
+context window, are becoming increasingly popular. However, despite they are
+nearly perfect at standard long-context retrieval, we find they are actually
+not good at all of them. Specifically, we identify 2 basic cases,
+"multi-matching retrieval," and "logic-based retrieval", which LLMs struggle to
+solve under normal settings. Moreover, we find these cases can only be well
+addressed by specific CoT prompting, with enough reasoning steps. This finding
+reminds the developers and users of LCLMs that relying on LCLMs to directly
+perform even basic retrieval tasks may be unreliable, rather, a sufficiently
+long reasoning process is necessary.
+
+
+
+ comment: Our code is publicly available at
+ https://github.com/yuyijiong/hard_retrieval_for_llm and the datasets is at
+ https://huggingface.co/datasets/yuyijiong/difficult_retrieval
+
+
+
+
+
+
+ ♻ ☆ How to Build an AI Tutor that Can Adapt to Any Course and Provide
+ Accurate Answers Using Large Language Model and Retrieval-Augmented
+ Generation
+
+
+ This paper proposes a low-code solution to build an AI tutor that leverages
+advanced AI techniques to provide accurate and contextually relevant responses
+in a personalized learning environment. The OpenAI Assistants API allows AI
+Tutor to easily embed, store, retrieve, and manage files and chat history,
+enabling a low-code solution. Large Language Models (LLMs) and
+Retrieval-Augmented Generation (RAG) technology generate sophisticated answers
+based on course-specific materials. The application efficiently organizes and
+retrieves relevant information through vector embedding and similarity-based
+retrieval algorithms. The AI Tutor prototype demonstrates its ability to
+generate relevant, accurate answers with source citations. It represents a
+significant advancement in technology-enhanced tutoring systems, democratizing
+access to high-quality, customized educational support in higher education.
+
+
+
+ comment: 4 pages, 6 figures
+
+
+
+
+
+
+ ♻ ☆ Opt-Out: Investigating Entity-Level Unlearning for Large Language Models
+ via Optimal Transport
+
+
+ Instruction-following large language models (LLMs), such as ChatGPT, have
+become widely popular among everyday users. However, these models inadvertently
+disclose private, sensitive information to their users, underscoring the need
+for machine unlearning techniques to remove selective information from the
+models. While prior work has focused on forgetting small, random subsets of
+training data at the instance-level, we argue that real-world scenarios often
+require the removal of an entire user data, which may require a more careful
+maneuver. In this study, we explore entity-level unlearning, which aims to
+erase all knowledge related to a target entity while preserving the remaining
+model capabilities. To address this, we introduce Opt-Out, an optimal
+transport-based unlearning method that utilizes the Wasserstein distance from
+the model's initial parameters to achieve more effective and fine-grained
+unlearning. We also present the first Entity-Level Unlearning Dataset (ELUDe)
+designed to evaluate entity-level unlearning. Our empirical results demonstrate
+that Opt-Out surpasses existing methods, establishing a new standard for secure
+and adaptable LLMs that can accommodate user data removal requests without the
+need for full retraining.
+
+
+
+ comment: 17 pages, 10 figures
+
+
+
+
+
+
+ ♻ ☆ A Comparative Study of LLM-based ASR and Whisper in Low Resource and
+ Code Switching Scenario
+
+
+ Large Language Models (LLMs) have showcased exceptional performance across
+diverse NLP tasks, and their integration with speech encoder is rapidly
+emerging as a dominant trend in the Automatic Speech Recognition (ASR) field.
+Previous works mainly concentrated on leveraging LLMs for speech recognition in
+English and Chinese. However, their potential for addressing speech recognition
+challenges in low resource settings remains underexplored. Hence, in this work,
+we aim to explore the capability of LLMs in low resource ASR and
+Mandarin-English code switching ASR. We also evaluate and compare the
+recognition performance of LLM-based ASR systems against Whisper model.
+Extensive experiments demonstrate that LLM-based ASR yields a relative gain of
+12.8\% over the Whisper model in low resource ASR while Whisper performs better
+in Mandarin-English code switching ASR. We hope that this study could shed
+light on ASR for low resource scenarios.
+
+
+
+ comment: This work hasn't been finished yet
+
+
+
+
+
+
+ ♻ ☆ An Effective Framework to Help Large Language Models Handle
+ Numeric-involved Long-context Tasks
+
+
+ Large Language Models (LLMs) have demonstrated remarkable capabilities in
+handling long texts and have almost perfect performance in traditional
+retrieval tasks. However, their performance significantly degrades when it
+comes to numerical calculations in the long-context. Numeric-involved
+long-context tasks typically cannot be addressed by current LLMs in normal
+settings due to their inherent limitations in simultaneously handling complex
+and massive information. Some CoT like prompting methods can improve accuracy
+but demands massive output tokens, which is costly and slow. To address this
+issue, we propose a workflow, which decompose a numeric-involved long-context
+task into 4 low-level subtasks: judging, extracting and processing with code
+and conclusion. The former 2 subtasks is relatively simple, which allows us to
+use smaller models for efficiently processing long context. When numerical
+calculations are required, we use code generated by LLMs to avoid the
+disadvantage of LLM not being good at calculations. The results in 2
+numeric-involved long-context benchmarks demonstrate our workflow can not only
+improve accuracy, but also significantly reduce the cost of API calls.
+
+
+
+
+
+
+
+ ♻ ☆ LLMs Do Not Think Step-by-step In Implicit Reasoning
+
+
+ It has been well-known that Chain-of-Thought can remarkably enhance LLMs'
+performance on complex tasks. However, because it also introduces slower
+inference speeds and higher computational costs, many researches have attempted
+to use implicit CoT, which does not need LLMs to explicitly generate the
+intermediate steps. But there is still gap between their efficacy and typical
+explicit CoT methods. This leaves us a doubt that, does implicit CoT really
+equal to explicit CoT? Therefore, in this study, we address this question
+through experiments. We probe the information of intermediate steps from the
+model's hidden states when it is performing implicit CoT. The results
+surprisingly indicate that LLMs hardly think about intermediate steps,
+suggesting they may just rely on experience rather than strict step-by-step
+reasoning. Moreover, we find LLMs' implicit reasoning capabilities are
+susceptible and unstable, reaffirming the necessity of explicit CoT to
+effectively support complex tasks.
+
+
+
+
+
+
+
+ ♻ ☆ Sibyl: Empowering Empathetic Dialogue Generation in Large Language
+ Models via Sensible and Visionary Commonsense Inference COLING 2025
+
+
+ Recently, there has been a heightened interest in building chatbots based on
+Large Language Models (LLMs) to emulate human-like qualities in multi-turn
+conversations. Despite having access to commonsense knowledge to better
+understand the psychological aspects and causality of dialogue context, even
+these powerful LLMs struggle to achieve the goals of empathy and emotional
+support. Current commonsense knowledge derived from dialogue contexts is
+inherently limited and often fails to adequately anticipate the future course
+of a dialogue. This lack of foresight can mislead LLMs and hinder their ability
+to provide effective support. In response to this challenge, we present an
+innovative framework named Sensible and Visionary Commonsense Knowledge
+(Sibyl). Designed to concentrate on the immediately succeeding dialogue, this
+paradigm equips LLMs with the capability to uncover the implicit requirements
+of the conversation, aiming to elicit more empathetic responses. Experimental
+results demonstrate that incorporating our paradigm for acquiring commonsense
+knowledge into LLMs comprehensively enhances the quality of their responses.
+
+
+
+ comment: Accepted by COLING 2025
+
+
+
+
+
+
+ ♻ ☆ Enhancing Function-Calling Capabilities in LLMs: Strategies for Prompt
+ Formats, Data Integration, and Multilingual Translation
+
+
+ Large language models (LLMs) have significantly advanced autonomous agents,
+particularly in zero-shot tool usage, also known as function calling. This
+research delves into enhancing the function-calling capabilities of LLMs by
+exploring different approaches, including prompt formats for integrating
+function descriptions, blending function-calling and instruction-following
+data, introducing a novel Decision Token for conditional prompts, leveraging
+chain-of-thought reasoning, and overcoming multilingual challenges with a
+translation pipeline. Our key findings and contributions are as follows: (1)
+Instruction-following data improves both function-calling accuracy and
+relevance detection. (2) The use of the newly proposed Decision Token, combined
+with synthetic non-function-call data, enhances relevance detection. (3) A
+tailored translation pipeline effectively overcomes multilingual limitations,
+demonstrating significant improvements in Traditional Chinese. These insights
+highlight the potential for improved function-calling capabilities and
+multilingual applications in LLMs.
+
+
+
+
+
+
+
+ ♻ ☆ Controlling Risk of Retrieval-augmented Generation: A Counterfactual
+ Prompting Framework
+
+
+ Retrieval-augmented generation (RAG) has emerged as a popular solution to
+mitigate the hallucination issues of large language models. However, existing
+studies on RAG seldom address the issue of predictive uncertainty, i.e., how
+likely it is that a RAG model's prediction is incorrect, resulting in
+uncontrollable risks in real-world applications. In this work, we emphasize the
+importance of risk control, ensuring that RAG models proactively refuse to
+answer questions with low confidence. Our research identifies two critical
+latent factors affecting RAG's confidence in its predictions: the quality of
+the retrieved results and the manner in which these results are utilized. To
+guide RAG models in assessing their own confidence based on these two latent
+factors, we develop a counterfactual prompting framework that induces the
+models to alter these factors and analyzes the effect on their answers. We also
+introduce a benchmarking procedure to collect answers with the option to
+abstain, facilitating a series of experiments. For evaluation, we introduce
+several risk-related metrics and the experimental results demonstrate the
+effectiveness of our approach. Our code and benchmark dataset are available at
+https://github.com/ict-bigdatalab/RC-RAG.
+
+
+
+
+
+
+
+ ♻ ☆ The use of large language models to enhance cancer clinical trial
+ educational materials
+
+
+
+
+
+
+
+
+ Mingye Gao, Aman Varshney, Shan Chen, Vikram Goddla, Jack Gallifant, Patrick Doyle, Claire Novack, Maeve Dillon-Martin, Teresia Perkins, Xinrong Correia, Erik Duhaime, Howard Isenstein, Elad Sharon, Lisa Soleymani Lehmann, David Kozono, Brian Anthony, Dmitriy Dligach, Danielle S. Bitterman
+
+
+ Cancer clinical trials often face challenges in recruitment and engagement
+due to a lack of participant-facing informational and educational resources.
+This study investigated the potential of Large Language Models (LLMs),
+specifically GPT4, in generating patient-friendly educational content from
+clinical trial informed consent forms. Using data from ClinicalTrials.gov, we
+employed zero-shot learning for creating trial summaries and one-shot learning
+for developing multiple-choice questions, evaluating their effectiveness
+through patient surveys and crowdsourced annotation. Results showed that
+GPT4-generated summaries were both readable and comprehensive, and may improve
+patients' understanding and interest in clinical trials. The multiple-choice
+questions demonstrated high accuracy and agreement with crowdsourced
+annotators. For both resource types, hallucinations were identified that
+require ongoing human oversight. The findings demonstrate the potential of LLMs
+"out-of-the-box" to support the generation of clinical trial education
+materials with minimal trial-specific engineering, but implementation with a
+human-in-the-loop is still needed to avoid misinformation risks.
+
+
+
+
+
+
+
+ ♻ ☆ EMOVOME: A Dataset for Emotion Recognition in Spontaneous Real-Life
+ Speech
+
+
+
+
+
+
+
+
+ Lucía Gómez-Zaragozá, Rocío del Amor, María José Castro-Bleda, Valery Naranjo, Mariano Alcañiz Raya, Javier Marín-Morales
+
+
+ Spontaneous datasets for Speech Emotion Recognition (SER) are scarce and
+frequently derived from laboratory environments or staged scenarios, such as TV
+shows, limiting their application in real-world contexts. We developed and
+publicly released the Emotional Voice Messages (EMOVOME) dataset, including 999
+voice messages from real conversations of 100 Spanish speakers on a messaging
+app, labeled in continuous and discrete emotions by expert and non-expert
+annotators. We evaluated speaker-independent SER models using acoustic features
+as baseline and transformer-based models. We compared the results with
+reference datasets including acted and elicited speech, and analyzed the
+influence of annotators and gender fairness. The pre-trained
+UniSpeech-SAT-Large model achieved the highest results, 61.64% and 55.57%
+Unweighted Accuracy (UA) for 3-class valence and arousal prediction
+respectively on EMOVOME, a 10% improvement over baseline models. For the
+emotion categories, 42.58% UA was obtained. EMOVOME performed lower than the
+acted RAVDESS dataset. The elicited IEMOCAP dataset also outperformed EMOVOME
+in predicting emotion categories, while similar results were obtained in
+valence and arousal. EMOVOME outcomes varied with annotator labels, showing
+better results and fairness when combining expert and non-expert annotations.
+This study highlights the gap between controlled and real-life scenarios,
+supporting further advancements in recognizing genuine emotions.
+
+
+
+ comment: This article is a merged version of the description of the EMOVOME
+ database in arXiv:2402.17496v1 and the speech emotion recognition models in
+ arXiv:2403.02167v1. This work has been submitted to the IEEE for possible
+ publication
+
+
+
+
+
+
+ ♻ ☆ SPORTU: A Comprehensive Sports Understanding Benchmark for Multimodal
+ Large Language Models
+
+
+ Multimodal Large Language Models (MLLMs) are advancing the ability to reason
+about complex sports scenarios by integrating textual and visual information.
+To comprehensively evaluate their capabilities, we introduce SPORTU, a
+benchmark designed to assess MLLMs across multi-level sports reasoning tasks.
+SPORTU comprises two key components: SPORTU-text, featuring 900 multiple-choice
+questions with human-annotated explanations for rule comprehension and strategy
+understanding. This component focuses on testing models' ability to reason
+about sports solely through question-answering (QA), without requiring visual
+inputs; SPORTU-video, consisting of 1,701 slow-motion video clips across 7
+different sports and 12,048 QA pairs, designed to assess multi-level reasoning,
+from simple sports recognition to complex tasks like foul detection and rule
+application. We evaluate four prevalent LLMs mainly utilizing few-shot learning
+paradigms supplemented by chain-of-thought (CoT) prompting on the SPORTU-text
+part. We evaluate four LLMs using few-shot learning and chain-of-thought (CoT)
+prompting on SPORTU-text. GPT-4o achieves the highest accuracy of 71%, but
+still falls short of human-level performance, highlighting room for improvement
+in rule comprehension and reasoning. The evaluation for the SPORTU-video part
+includes 7 proprietary and 6 open-source MLLMs. Experiments show that models
+fall short on hard tasks that require deep reasoning and rule-based
+understanding. Claude-3.5-Sonnet performs the best with only 52.6% accuracy on
+the hard task, showing large room for improvement. We hope that SPORTU will
+serve as a critical step toward evaluating models' capabilities in sports
+understanding and reasoning.
+
+
+ While there has been progress towards aligning Large Language Models (LLMs)
+with human values and ensuring safe behaviour at inference time, safety-guards
+can easily be removed when fine-tuned on unsafe and harmful datasets.While this
+setting has been treated extensively, another popular training paradigm,
+learning from unsafe feedback with reinforcement learning, has previously been
+unexplored. This is concerning due to the widespread deployment of feedback
+collection systems. We address this gap by providing an analysis of learning
+settings where feedback is adversarial and noisy, i.e. that unsafe samples are
+preferred over safe ones despite model developers goal to maintain safety. We
+find that safety-aligned LLMs easily explore unsafe action spaces through
+generating harmful text and optimize for adversarial reward indicating that
+current safety guards are not enough to prevent learning from unsafe feedback.
+In order to protect against this vulnerability, we adapt a number of both
+"implict" and "explicit" harmful fine-tuning defences to evaluate whether they
+are effective as learning constraints in an RL setting finding that no method
+is generally effective pointing to the need for more research in defences given
+the widespread adoption of methods designed to learn from feedback. We end the
+paper with the observation that some defences work by performing "harmless
+reward hacking" for which we provide a theoretical explanation drawn from the
+theory of Constrained Markov Decision Processes and provide some direction for
+future defence development.
+
+
+
+
+
+
+
+ ♻ ☆ Labrador: Exploring the Limits of Masked Language Modeling for
+ Laboratory Data ML4H 2024
+
+
+
+
+
+
+
+
+ David R. Bellamy, Bhawesh Kumar, Cindy Wang, Andrew Beam
+
+
+ In this work we introduce Labrador, a pre-trained Transformer model for
+laboratory data. Labrador and BERT were pre-trained on a corpus of 100 million
+lab test results from electronic health records (EHRs) and evaluated on various
+downstream outcome prediction tasks. Both models demonstrate mastery of the
+pre-training task but neither consistently outperform XGBoost on downstream
+supervised tasks. Our ablation studies reveal that transfer learning shows
+limited effectiveness for BERT and achieves marginal success with Labrador. We
+explore the reasons for the failure of transfer learning and suggest that the
+data generating process underlying each patient cannot be characterized
+sufficiently using labs alone, among other factors. We encourage future work to
+focus on joint modeling of multiple EHR data categories and to include
+tree-based baselines in their evaluations.
+
+
+
+ comment: 26 pages, 8 figures, best paper award at ML4H 2024
+
+
+
+
+
+
+ ♻ ☆ If CLIP Could Talk: Understanding Vision-Language Model Representations
+ Through Their Preferred Concept Descriptions EMNLP 2024
+
+
+
+
+
+
+
+
+ Reza Esfandiarpoor, Cristina Menghini, Stephen H. Bach
+
+
+ Recent works often assume that Vision-Language Model (VLM) representations
+are based on visual attributes like shape. However, it is unclear to what
+extent VLMs prioritize this information to represent concepts. We propose
+Extract and Explore (EX2), a novel approach to characterize textual features
+that are important for VLMs. EX2 uses reinforcement learning to align a large
+language model with VLM preferences and generates descriptions that incorporate
+features that are important for the VLM. Then, we inspect the descriptions to
+identify features that contribute to VLM representations. Using EX2, we find
+that spurious descriptions have a major role in VLM representations despite
+providing no helpful information, e.g., Click to enlarge photo of CONCEPT. More
+importantly, among informative descriptions, VLMs rely significantly on
+non-visual attributes like habitat (e.g., North America) to represent visual
+concepts. Also, our analysis reveals that different VLMs prioritize different
+attributes in their representations. Overall, we show that VLMs do not simply
+match images to scene descriptions and that non-visual or even spurious
+descriptions significantly influence their representations.
+
+
+
+ comment: EMNLP 2024
+
+
+
+
+
+
+ ♻ ☆ Mixing It Up: The Cocktail Effect of Multi-Task Fine-Tuning on LLM
+ Performance -- A Case Study in Finance
+
+
+
+
+
+
+
+
+ Meni Brief, Oded Ovadia, Gil Shenderovitz, Noga Ben Yoash, Rachel Lemberg, Eitam Sheetrit
+
+
+ The application of large language models (LLMs) in domain-specific contexts,
+including finance, has expanded rapidly. Domain-specific LLMs are typically
+evaluated based on their performance in various downstream tasks relevant to
+the domain. In this work, we present a detailed analysis of fine-tuning LLMs
+for such tasks. Somewhat counterintuitively, we find that in domain-specific
+cases, fine-tuning exclusively on the target task is not always the most
+effective strategy. Instead, multi-task finetuning - where models are trained
+on a cocktail of related tasks - can significantly enhance performance. We
+demonstrate how this approach enables a small model, such as Phi-3-Mini, to
+achieve state-of-the-art results, even surpassing the much larger GPT-4-o model
+on financial benchmarks. Our study involves a large-scale experiment,
+conducting over 200 training experiments using several widely adopted LLMs as
+baselines, and empirically confirms the benefits of multi-task fine-tuning.
+Additionally, we explore the use of general instruction data as a form of
+regularization, suggesting that it helps minimize performance degradation. We
+also investigate the inclusion of mathematical data, finding improvements in
+numerical reasoning that transfer effectively to financial tasks. Finally, we
+note that while fine-tuning for downstream tasks leads to targeted improvements
+in task performance, it does not necessarily result in broader gains in domain
+knowledge or complex domain reasoning abilities.
+
+
+
+
+
+
+
+ ♻ ☆ CoRNStack: High-Quality Contrastive Data for Better Code Ranking
+
+
+ Effective code retrieval plays a crucial role in advancing code generation,
+bug fixing, and software maintenance, particularly as software systems increase
+in complexity. While current code embedding models have demonstrated promise in
+retrieving code snippets for small-scale, well-defined tasks, they often
+underperform in more demanding real-world applications such as bug localization
+within GitHub repositories. We hypothesize that a key issue is their reliance
+on noisy and inconsistent datasets for training, which impedes their ability to
+generalize to more complex retrieval scenarios. To address these limitations,
+we introduce CoRNStack, a large-scale, high-quality contrastive training
+dataset for code that spans multiple programming languages. This dataset is
+curated using consistency filtering to eliminate noisy positives and is further
+enriched with mined hard negatives, thereby facilitating more effective
+learning. We demonstrate that contrastive training of embedding models using
+CoRNStack leads to state-of-the-art performance across a variety of code
+retrieval tasks. Furthermore, the dataset can be leveraged for training code
+reranking models, a largely underexplored area compared to text reranking. Our
+finetuned code reranking model significantly improves the ranking quality over
+the retrieved results. Finally, by employing our code retriever and reranker
+together, we demonstrate significant improvements in function localization for
+GitHub issues, an important component of real-world software development.
+
+
+ Text-to-Image (T2I) models have shown great performance in generating images
+based on textual prompts. However, these models are vulnerable to unsafe input
+to generate unsafe content like sexual, harassment and illegal-activity images.
+Existing studies based on image checker, model fine-tuning and embedding
+blocking are impractical in real-world applications. Hence, we propose the
+first universal prompt optimizer for safe T2I (POSI) generation in black-box
+scenario. We first construct a dataset consisting of toxic-clean prompt pairs
+by GPT-3.5 Turbo. To guide the optimizer to have the ability of converting
+toxic prompt to clean prompt while preserving semantic information, we design a
+novel reward function measuring toxicity and text alignment of generated images
+and train the optimizer through Proximal Policy Optimization. Experiments show
+that our approach can effectively reduce the likelihood of various T2I models
+in generating inappropriate images, with no significant impact on text
+alignment. It is also flexible to be combined with methods to achieve better
+performance. Our code is available at https://github.com/wu-zongyu/POSI.
+
+
+
+
+
+
+
+ ♻ ☆ MetaTool Benchmark for Large Language Models: Deciding Whether to Use
+ Tools and Which to Use
+
+
+
+
+
+
+
+
+ Yue Huang, Jiawen Shi, Yuan Li, Chenrui Fan, Siyuan Wu, Qihui Zhang, Yixin Liu, Pan Zhou, Yao Wan, Neil Zhenqiang Gong, Lichao Sun
+
+
+ Large language models (LLMs) have garnered significant attention due to their
+impressive natural language processing (NLP) capabilities. Recently, many
+studies have focused on the tool utilization ability of LLMs. They primarily
+investigated how LLMs effectively collaborate with given specific tools.
+However, in scenarios where LLMs serve as intelligent agents, as seen in
+applications like AutoGPT and MetaGPT, LLMs are expected to engage in intricate
+decision-making processes that involve deciding whether to employ a tool and
+selecting the most suitable tool(s) from a collection of available tools to
+fulfill user requests. Therefore, in this paper, we introduce MetaTool, a
+benchmark designed to evaluate whether LLMs have tool usage awareness and can
+correctly choose tools. Specifically, we create a dataset called ToolE within
+the benchmark. This dataset contains various types of user queries in the form
+of prompts that trigger LLMs to use tools, including both single-tool and
+multi-tool scenarios. Subsequently, we set the tasks for both tool usage
+awareness and tool selection. We define four subtasks from different
+perspectives in tool selection, including tool selection with similar choices,
+tool selection in specific scenarios, tool selection with possible reliability
+issues, and multi-tool selection. We conduct experiments involving eight
+popular LLMs and find that the majority of them still struggle to effectively
+select tools, highlighting the existing gaps between LLMs and genuine
+intelligent agents. However, through the error analysis, we found there is
+still significant room for improvement. Finally, we conclude with insights for
+tool developers -- we strongly recommend that tool developers choose an
+appropriate rewrite model for generating new descriptions based on the
+downstream LLM the tool will apply to. Our code is in
+https://github.com/HowieHwong/MetaTool.
+
+
+
+
+
+
+
+ ♻ ☆ Prometheus 2: An Open Source Language Model Specialized in Evaluating
+ Other Language Models EMNLP 2024
+
+
+
+
+
+
+
+
+ Seungone Kim, Juyoung Suk, Shayne Longpre, Bill Yuchen Lin, Jamin Shin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, Minjoon Seo
+
+
+ Proprietary LMs such as GPT-4 are often employed to assess the quality of
+responses from various LMs. However, concerns including transparency,
+controllability, and affordability strongly motivate the development of
+open-source LMs specialized in evaluations. On the other hand, existing open
+evaluator LMs exhibit critical shortcomings: 1) they issue scores that
+significantly diverge from those assigned by humans, and 2) they lack the
+flexibility to perform both direct assessment and pairwise ranking, the two
+most prevalent forms of assessment. Additionally, they do not possess the
+ability to evaluate based on custom evaluation criteria, focusing instead on
+general attributes like helpfulness and harmlessness. To address these issues,
+we introduce Prometheus 2, a more powerful evaluator LM than its predecessor
+that closely mirrors human and GPT-4 judgements. Moreover, it is capable of
+processing both direct assessment and pair-wise ranking formats grouped with a
+user-defined evaluation criteria. On four direct assessment benchmarks and four
+pairwise ranking benchmarks, Prometheus 2 scores the highest correlation and
+agreement with humans and proprietary LM judges among all tested open evaluator
+LMs. Our models, code, and data are all publicly available at
+https://github.com/prometheus-eval/prometheus-eval.
+
+
+
+ comment: EMNLP 2024 (Main Conference)
+
+
+
+
+
+
+ ♻ ☆ Towards a Psychology of Machines: Large Language Models Predict Human
+ Memory
+
+
+ Large language models (LLMs), such as ChatGPT, have shown remarkable
+abilities in natural language processing, opening new avenues in psychological
+research. This study explores whether LLMs can predict human memory performance
+in tasks involving garden-path sentences and contextual information. In the
+first part, we used ChatGPT to rate the relatedness and memorability of
+garden-path sentences preceded by either fitting or unfitting contexts. In the
+second part, human participants read the same sentences, rated their
+relatedness, and completed a surprise memory test. The results demonstrated
+that ChatGPT's relatedness ratings closely matched those of the human
+participants, and its memorability ratings effectively predicted human memory
+performance. Both LLM and human data revealed that higher relatedness in the
+unfitting context condition was associated with better memory performance,
+aligning with probabilistic frameworks of context-dependent learning. These
+findings suggest that LLMs, despite lacking human-like memory mechanisms, can
+model aspects of human cognition and serve as valuable tools in psychological
+research. We propose the field of machine psychology to explore this interplay
+between human cognition and artificial intelligence, offering a bidirectional
+approach where LLMs can both benefit from and contribute to our understanding
+of human cognitive processes.
+
+
+
+ comment: 34 pages, 3 figures, 2 tables
+
+
+
+
+
+
+
+
+
+ Information Retrieval 8
+
+
+
+
+
+ ☆ Freshness and Informativity Weighted Cognitive Extent and Its
+ Correlation with Cumulative Citation Count
+
+
+ In this paper, we revisit cognitive extent, originally defined as the number
+of unique phrases in a quota. We introduce Freshness and Informative Weighted
+Cognitive Extent (FICE), calculated based on two novel weighting factors, the
+lifetime ratio and informativity of scientific entities. We model the lifetime
+of each scientific entity as the time-dependent document frequency, which is
+fit by the composition of multiple Gaussian profiles. The lifetime ratio is
+then calculated as the cumulative document frequency at the publication time
+$t_0$ divided by the cumulative document frequency over its entire lifetime.
+The informativity is calculated by normalizing the document frequency across
+all scientific entities recognized in a title. Using the ACL Anthology, we
+verified the trend formerly observed in several other domains that the number
+of unique scientific entities per quota increased gradually at a slower rate.
+We found that FICE exhibits a strong correlation with the average cumulative
+citation count within a quota. Our code is available at
+\href{https://github.com/ZiheHerzWang/Freshness-and-Informativity-Weighted-Cognitive-Extent}{https://github.com/ZiheHerzWang/Freshness-and-Informativity-Weighted-Cognitive-Extent}
+
+
+
+
+
+
+
+ ☆ YT-30M: A multi-lingual multi-category dataset of YouTube comments
+
+
+ This paper introduces two large-scale multilingual comment datasets, YT-30M
+(and YT-100K) from YouTube. The analysis in this paper is performed on a
+smaller sample (YT-100K) of YT-30M. Both the datasets: YT-30M (full) and
+YT-100K (randomly selected 100K sample from YT-30M) are publicly released for
+further research. YT-30M (YT-100K) contains 32236173 (108694) comments posted
+by YouTube channel that belong to YouTube categories. Each comment is
+associated with a video ID, comment ID, commentor name, commentor channel ID,
+comment text, upvotes, original channel ID and category of the YouTube channel
+(e.g., 'News & Politics', 'Science & Technology', etc.).
+
+
+ While question-like queries are gaining popularity and search engines' users
+increasingly adopt them, keyphrase search has traditionally been the
+cornerstone of web search. This query type is also prevalent in specialised
+search tasks such as academic or professional search, where experts rely on
+keyphrases to articulate their information needs. However, current dense
+retrieval models often fail with keyphrase-like queries, primarily because they
+are mostly trained on question-like ones. This paper introduces a novel model
+that employs the ColBERT architecture to enhance document ranking for keyphrase
+queries. For that, given the lack of large keyphrase-based retrieval datasets,
+we first explore how Large Language Models can convert question-like queries
+into keyphrase format. Then, using those keyphrases, we train a keyphrase-based
+ColBERT ranker (ColBERTKP_QD) to improve the performance when working with
+keyphrase queries. Furthermore, to reduce the training costs associated with
+training the full ColBERT model, we investigate the feasibility of training
+only a keyphrase query encoder while keeping the document encoder weights
+static (ColBERTKP_Q). We assess our proposals' ranking performance using both
+automatically generated and manually annotated keyphrases. Our results reveal
+the potential of the late interaction architecture when working under the
+keyphrase search scenario.
+
+
+
+
+
+
+
+ ☆ Enhancing Recommendation Systems with GNNs and Addressing Over-Smoothing
+
+
+ This paper addresses key challenges in enhancing recommendation systems by
+leveraging Graph Neural Networks (GNNs) and addressing inherent limitations
+such as over-smoothing, which reduces model effectiveness as network hierarchy
+deepens. The proposed approach introduces three GNN-based recommendation
+models, specifically designed to mitigate over-smoothing through innovative
+mechanisms like residual connections and identity mapping within the
+aggregation propagation process. These modifications enable more effective
+information flow across layers, preserving essential user-item interaction
+details to improve recommendation accuracy. Additionally, the study emphasizes
+the critical need for interpretability in recommendation systems, aiming to
+provide transparent and justifiable suggestions tailored to dynamic user
+preferences. By integrating collaborative filtering with GNN architectures, the
+proposed models not only enhance predictive accuracy but also align
+recommendations more closely with individual behaviors, adapting to nuanced
+shifts in user interests. This work advances the field by tackling both
+technical and user-centric challenges, contributing to the development of
+robust and explainable recommendation systems capable of managing the
+complexity and scale of modern online environments.
+
+
+
+
+
+
+
+ ☆ CLAS: A Machine Learning Enhanced Framework for Exploring Large 3D
+ Design Datasets
+
+
+ Three-dimensional (3D) objects have wide applications. Despite the growing
+interest in 3D modeling in academia and industries, designing and/or creating
+3D objects from scratch remains time-consuming and challenging. With the
+development of generative artificial intelligence (AI), designers discover a
+new way to create images for ideation. However, generative AIs are less useful
+in creating 3D objects with satisfying qualities. To allow 3D designers to
+access a wide range of 3D objects for creative activities based on their
+specific demands, we propose a machine learning (ML) enhanced framework CLAS -
+named after the four-step of capture, label, associate, and search - to enable
+fully automatic retrieval of 3D objects based on user specifications leveraging
+the existing datasets of 3D objects. CLAS provides an effective and efficient
+method for any person or organization to benefit from their existing but not
+utilized 3D datasets. In addition, CLAS may also be used to produce
+high-quality 3D object synthesis datasets for training and evaluating 3D
+generative models. As a proof of concept, we created and showcased a search
+system with a web user interface (UI) for retrieving 6,778 3D objects of chairs
+in the ShapeNet dataset powered by CLAS. In a close-set retrieval setting, our
+retrieval method achieves a mean reciprocal rank (MRR) of 0.58, top 1 accuracy
+of 42.27%, and top 10 accuracy of 89.64%.
+
+
+
+
+
+
+
+ ☆ Recommender Systems for Sustainability: Overview and Research Issues
+
+
+
+
+
+
+
+
+ Alexander Felfernig, Manfred Wundara, Thi Ngoc Trang Tran, Seda Polat-Erdeniz, Sebastian Lubos, Merfat El-Mansi, Damian Garber, Viet-Man Le
+
+
+ Sustainability development goals (SDGs) are regarded as a universal call to
+action with the overall objectives of planet protection, ending of poverty, and
+ensuring peace and prosperity for all people. In order to achieve these
+objectives, different AI technologies play a major role. Specifically,
+recommender systems can provide support for organizations and individuals to
+achieve the defined goals. Recommender systems integrate AI technologies such
+as machine learning, explainable AI (XAI), case-based reasoning, and constraint
+solving in order to find and explain user-relevant alternatives from a
+potentially large set of options. In this article, we summarize the state of
+the art in applying recommender systems to support the achievement of
+sustainability development goals. In this context, we discuss open issues for
+future research.
+
+
+
+
+
+
+
+ ♻ ☆ Mathematical Information Retrieval: Search and Question Answering
+
+
+ Mathematical information is essential for technical work, but its creation,
+interpretation, and search are challenging. To help address these challenges,
+researchers have developed multimodal search engines and mathematical question
+answering systems. This book begins with a simple framework characterizing the
+information tasks that people and systems perform as we work to answer
+math-related questions. The framework is used to organize and relate the other
+core topics of the book, including interactions between people and systems,
+representing math formulas in sources, and evaluation. We close by addressing
+some key questions and presenting directions for future work. This book is
+intended for students, instructors, and researchers interested in systems that
+help us find and use mathematical information.
+
+
+
+ comment: [DRAFT] Revised (2nd) draft
+
+
+
+
+
+
+ ♻ ☆ CoRNStack: High-Quality Contrastive Data for Better Code Ranking
+
+
+ Effective code retrieval plays a crucial role in advancing code generation,
+bug fixing, and software maintenance, particularly as software systems increase
+in complexity. While current code embedding models have demonstrated promise in
+retrieving code snippets for small-scale, well-defined tasks, they often
+underperform in more demanding real-world applications such as bug localization
+within GitHub repositories. We hypothesize that a key issue is their reliance
+on noisy and inconsistent datasets for training, which impedes their ability to
+generalize to more complex retrieval scenarios. To address these limitations,
+we introduce CoRNStack, a large-scale, high-quality contrastive training
+dataset for code that spans multiple programming languages. This dataset is
+curated using consistency filtering to eliminate noisy positives and is further
+enriched with mined hard negatives, thereby facilitating more effective
+learning. We demonstrate that contrastive training of embedding models using
+CoRNStack leads to state-of-the-art performance across a variety of code
+retrieval tasks. Furthermore, the dataset can be leveraged for training code
+reranking models, a largely underexplored area compared to text reranking. Our
+finetuned code reranking model significantly improves the ranking quality over
+the retrieved results. Finally, by employing our code retriever and reranker
+together, we demonstrate significant improvements in function localization for
+GitHub issues, an important component of real-world software development.
+
+
+
+
+
+
+
+
+
+
+ Machine Learning 148
+
+
+
+
+
+ ☆ Navigation World Models
+
+
+
+
+
+
+
+
+ Amir Bar, Gaoyue Zhou, Danny Tran, Trevor Darrell, Yann LeCun
+
+
+ Navigation is a fundamental skill of agents with visual-motor capabilities.
+We introduce a Navigation World Model (NWM), a controllable video generation
+model that predicts future visual observations based on past observations and
+navigation actions. To capture complex environment dynamics, NWM employs a
+Conditional Diffusion Transformer (CDiT), trained on a diverse collection of
+egocentric videos of both human and robotic agents, and scaled up to 1 billion
+parameters. In familiar environments, NWM can plan navigation trajectories by
+simulating them and evaluating whether they achieve the desired goal. Unlike
+supervised navigation policies with fixed behavior, NWM can dynamically
+incorporate constraints during planning. Experiments demonstrate its
+effectiveness in planning trajectories from scratch or by ranking trajectories
+sampled from an external policy. Furthermore, NWM leverages its learned visual
+priors to imagine trajectories in unfamiliar environments from a single input
+image, making it a flexible and powerful tool for next-generation navigation
+systems.
+
+
+
+
+
+
+
+
+ John Hughes, Sara Price, Aengus Lynch, Rylan Schaeffer, Fazl Barez, Sanmi Koyejo, Henry Sleight, Erik Jones, Ethan Perez, Mrinank Sharma
+
+
+ We introduce Best-of-N (BoN) Jailbreaking, a simple black-box algorithm that
+jailbreaks frontier AI systems across modalities. BoN Jailbreaking works by
+repeatedly sampling variations of a prompt with a combination of augmentations
+- such as random shuffling or capitalization for textual prompts - until a
+harmful response is elicited. We find that BoN Jailbreaking achieves high
+attack success rates (ASRs) on closed-source language models, such as 89% on
+GPT-4o and 78% on Claude 3.5 Sonnet when sampling 10,000 augmented prompts.
+Further, it is similarly effective at circumventing state-of-the-art
+open-source defenses like circuit breakers. BoN also seamlessly extends to
+other modalities: it jailbreaks vision language models (VLMs) such as GPT-4o
+and audio language models (ALMs) like Gemini 1.5 Pro, using modality-specific
+augmentations. BoN reliably improves when we sample more augmented prompts.
+Across all modalities, ASR, as a function of the number of samples (N),
+empirically follows power-law-like behavior for many orders of magnitude. BoN
+Jailbreaking can also be composed with other black-box algorithms for even more
+effective attacks - combining BoN with an optimized prefix attack achieves up
+to a 35% increase in ASR. Overall, our work indicates that, despite their
+capability, language models are sensitive to seemingly innocuous changes to
+inputs, which attackers can exploit across modalities.
+
+
+
+
+
+
+
+ ☆ Perception Tokens Enhance Visual Reasoning in Multimodal Language Models
+
+
+ Multimodal language models (MLMs) still face challenges in fundamental visual
+perception tasks where specialized models excel. Tasks requiring reasoning
+about 3D structures benefit from depth estimation, and reasoning about 2D
+object instances benefits from object detection. Yet, MLMs can not produce
+intermediate depth or boxes to reason over. Finetuning MLMs on relevant data
+doesn't generalize well and outsourcing computation to specialized vision tools
+is too compute-intensive and memory-inefficient. To address this, we introduce
+Perception Tokens, intrinsic image representations designed to assist reasoning
+tasks where language is insufficient. Perception tokens act as auxiliary
+reasoning tokens, akin to chain-of-thought prompts in language models. For
+example, in a depth-related task, an MLM augmented with perception tokens can
+reason by generating a depth map as tokens, enabling it to solve the problem
+effectively. We propose AURORA, a training method that augments MLMs with
+perception tokens for improved reasoning over visual inputs. AURORA leverages a
+VQVAE to transform intermediate image representations, such as depth maps into
+a tokenized format and bounding box tokens, which is then used in a multi-task
+training framework. AURORA achieves notable improvements across counting
+benchmarks: +10.8% on BLINK, +11.3% on CVBench, and +8.3% on SEED-Bench,
+outperforming finetuning approaches in generalization across datasets. It also
+improves on relative depth: over +6% on BLINK. With perception tokens, AURORA
+expands the scope of MLMs beyond language-based reasoning, paving the way for
+more effective visual reasoning capabilities.
+
+
+
+
+
+
+
+ ☆ NODE-AdvGAN: Improving the transferability and perceptual similarity of
+ adversarial examples by dynamic-system-driven adversarial generative model
+
+
+ Understanding adversarial examples is crucial for improving the model's
+robustness, as they introduce imperceptible perturbations that deceive models.
+Effective adversarial examples, therefore, offer the potential to train more
+robust models by removing their singularities. We propose NODE-AdvGAN, a novel
+approach that treats adversarial generation as a continuous process and employs
+a Neural Ordinary Differential Equation (NODE) for simulating the dynamics of
+the generator. By mimicking the iterative nature of traditional gradient-based
+methods, NODE-AdvGAN generates smoother and more precise perturbations that
+preserve high perceptual similarity when added to benign images. We also
+propose a new training strategy, NODE-AdvGAN-T, which enhances transferability
+in black-box attacks by effectively tuning noise parameters during training.
+Experiments demonstrate that NODE-AdvGAN and NODE-AdvGAN-T generate more
+effective adversarial examples that achieve higher attack success rates while
+preserving better perceptual quality than traditional GAN-based methods.
+
+
+
+
+
+
+
+ ☆ Evaluating Gender Bias Transfer between Pre-trained and Prompt-Adapted
+ Language Models
+
+
+ Large language models (LLMs) are increasingly being adapted to achieve
+task-specificity for deployment in real-world decision systems. Several
+previous works have investigated the bias transfer hypothesis (BTH) by studying
+the effect of the fine-tuning adaptation strategy on model fairness to find
+that fairness in pre-trained masked language models have limited effect on the
+fairness of models when adapted using fine-tuning. In this work, we expand the
+study of BTH to causal models under prompt adaptations, as prompting is an
+accessible, and compute-efficient way to deploy models in real-world systems.
+In contrast to previous works, we establish that intrinsic biases in
+pre-trained Mistral, Falcon and Llama models are strongly correlated (rho >=
+0.94) with biases when the same models are zero- and few-shot prompted, using a
+pronoun co-reference resolution task. Further, we find that bias transfer
+remains strongly correlated even when LLMs are specifically prompted to exhibit
+fair or biased behavior (rho >= 0.92), and few-shot length and stereotypical
+composition are varied (rho >= 0.97). Our findings highlight the importance of
+ensuring fairness in pre-trained LLMs, especially when they are later used to
+perform downstream tasks via prompt adaptation.
+
+
+
+
+
+
+
+ ☆ A Review on Scientific Knowledge Extraction using Large Language Models
+ in Biomedical Sciences
+
+
+
+
+
+
+
+
+ Gabriel Lino Garcia, João Renato Ribeiro Manesco, Pedro Henrique Paiola, Lucas Miranda, Maria Paola de Salvo, João Paulo Papa
+
+
+ The rapid advancement of large language models (LLMs) has opened new
+boundaries in the extraction and synthesis of medical knowledge, particularly
+within evidence synthesis. This paper reviews the state-of-the-art applications
+of LLMs in the biomedical domain, exploring their effectiveness in automating
+complex tasks such as evidence synthesis and data extraction from a biomedical
+corpus of documents. While LLMs demonstrate remarkable potential, significant
+challenges remain, including issues related to hallucinations, contextual
+understanding, and the ability to generalize across diverse medical tasks. We
+highlight critical gaps in the current research literature, particularly the
+need for unified benchmarks to standardize evaluations and ensure reliability
+in real-world applications. In addition, we propose directions for future
+research, emphasizing the integration of state-of-the-art techniques such as
+retrieval-augmented generation (RAG) to enhance LLM performance in evidence
+synthesis. By addressing these challenges and utilizing the strengths of LLMs,
+we aim to improve access to medical literature and facilitate meaningful
+discoveries in healthcare.
+
+
+ In the rapidly evolving financial sector, the accurate and timely
+interpretation of market news is essential for stakeholders needing to navigate
+unpredictable events. This paper introduces FANAL (Financial Activity News
+Alerting Language Modeling Framework), a specialized BERT-based framework
+engineered for real-time financial event detection and analysis, categorizing
+news into twelve distinct financial categories. FANAL leverages silver-labeled
+data processed through XGBoost and employs advanced fine-tuning techniques,
+alongside ORBERT (Odds Ratio BERT), a novel variant of BERT fine-tuned with
+ORPO (Odds Ratio Preference Optimization) for superior class-wise probability
+calibration and alignment with financial event relevance. We evaluate FANAL's
+performance against leading large language models, including GPT-4o, Llama-3.1
+8B, and Phi-3, demonstrating its superior accuracy and cost efficiency. This
+framework sets a new standard for financial intelligence and responsiveness,
+significantly outstripping existing models in both performance and
+affordability.
+
+
+
+ comment: Accepted for the IEEE International Workshop on Large Language Models
+ for Finance, 2024. This is a preprint version
+
+ Recently, CLIP has emerged as a valuable model for aligning image and text
+information in multi-modal scenarios. However, researchers have observed
+limitations in the ability of CLIP's text and image encoders to extract
+detailed knowledge from caption-image pairs. In response, this paper introduces
+KKLIP, a novel approach designed to enhance the quality of CLIP by
+incorporating a new knowledge distillation (KD) method derived from Llama 2.
+Our method comprises three objectives: Text Embedding Distillation, Concept
+Learning, and Contrastive Learning. Firstly, Text Embedding Distillation
+involves training the KKLIP text encoder to emulate the teacher model, Llama 2.
+Secondly, Concept Learning assigns a soft concept label to each caption-image
+pair through offline k-means clustering of text information from Llama 2,
+allowing KKLIP to learn from these soft concept labels. Finally, Contrastive
+Learning harmonizes text and image embeddings. Our experimental results
+demonstrate that KKLIP enhances the quality of both text and image encoders.
+
+
+
+
+
+
+
+ ☆ Self-test loss functions for learning weak-form operators and gradient
+ flows
+
+
+ The construction of loss functions presents a major challenge in data-driven
+modeling involving weak-form operators in PDEs and gradient flows, particularly
+due to the need to select test functions appropriately. We address this
+challenge by introducing self-test loss functions, which employ test functions
+that depend on the unknown parameters, specifically for cases where the
+operator depends linearly on the unknowns. The proposed self-test loss function
+conserves energy for gradient flows and coincides with the expected
+log-likelihood ratio for stochastic differential equations. Importantly, it is
+quadratic, facilitating theoretical analysis of identifiability and
+well-posedness of the inverse problem, while also leading to efficient
+parametric or nonparametric regression algorithms. It is computationally
+simple, requiring only low-order derivatives or even being entirely
+derivative-free, and numerical experiments demonstrate its robustness against
+noisy and discrete data.
+
+
+
+
+
+
+
+ ☆ A Bidirectional Siamese Recurrent Neural Network for Accurate Gait
+ Recognition Using Body Landmarks
+
+
+ Gait recognition is a significant biometric technique for person
+identification, particularly in scenarios where other physiological biometrics
+are impractical or ineffective. In this paper, we address the challenges
+associated with gait recognition and present a novel approach to improve its
+accuracy and reliability. The proposed method leverages advanced techniques,
+including sequential gait landmarks obtained through the Mediapipe pose
+estimation model, Procrustes analysis for alignment, and a Siamese
+biGRU-dualStack Neural Network architecture for capturing temporal
+dependencies. Extensive experiments were conducted on large-scale cross-view
+datasets to demonstrate the effectiveness of the approach, achieving high
+recognition accuracy compared to other models. The model demonstrated
+accuracies of 95.7%, 94.44%, 87.71%, and 86.6% on CASIA-B, SZU RGB-D, OU-MVLP,
+and Gait3D datasets respectively. The results highlight the potential
+applications of the proposed method in various practical domains, indicating
+its significant contribution to the field of gait recognition.
+
+
+
+
+
+
+
+ ☆ Soft Checksums to Flag Untrustworthy Machine Learning Surrogate
+ Predictions and Application to Atomic Physics Simulations
+
+
+
+
+
+
+
+
+ Casey Lauer, Robert C. Blake, Jonathan B. Freund
+
+
+ Trained neural networks (NN) are attractive as surrogate models to replace
+costly calculations in physical simulations, but are often unknowingly applied
+to states not adequately represented in the training dataset. We present the
+novel technique of soft checksums for scientific machine learning, a
+general-purpose method to differentiate between trustworthy predictions with
+small errors on in-distribution (ID) data points, and untrustworthy predictions
+with large errors on out-of-distribution (OOD) data points. By adding a check
+node to the existing output layer, we train the model to learn the chosen
+checksum function encoded within the NN predictions and show that violations of
+this function correlate with high prediction errors. As the checksum function
+depends only on the NN predictions, we can calculate the checksum error for any
+prediction with a single forward pass, incurring negligible time and memory
+costs. Additionally, we find that incorporating the checksum function into the
+loss function and exposing the NN to OOD data points during the training
+process improves separation between ID and OOD predictions. By applying soft
+checksums to a physically complex and high-dimensional non-local thermodynamic
+equilibrium atomic physics dataset, we show that a well-chosen threshold
+checksum error can effectively separate ID and OOD predictions.
+
+
+
+
+
+
+
+
+ Matthew Ricci, Guy Pelc, Zoe Piran, Noa Moriel, Mor Nitzan
+
+
+ Spatiotemporal dynamics pervade the natural sciences, from the morphogen
+dynamics underlying patterning in animal pigmentation to the protein waves
+controlling cell division. A central challenge lies in understanding how
+controllable parameters induce qualitative changes in system behavior called
+bifurcations. This endeavor is made particularly difficult in realistic
+settings where governing partial differential equations (PDEs) are unknown and
+data is limited and noisy. To address this challenge, we propose TRENDy
+(Temporal Regression of Effective Nonlinear Dynamics), an equation-free
+approach to learning low-dimensional, predictive models of spatiotemporal
+dynamics. Following classical work in spatial coarse-graining, TRENDy first
+maps input data to a low-dimensional space of effective dynamics via a cascade
+of multiscale filtering operations. Our key insight is the recognition that
+these effective dynamics can be fit by a neural ordinary differential equation
+(NODE) having the same parameter space as the input PDE. The preceding
+filtering operations strongly regularize the phase space of the NODE, making
+TRENDy significantly more robust to noise compared to existing methods. We
+train TRENDy to predict the effective dynamics of synthetic and real data
+representing dynamics from across the physical and life sciences. We then
+demonstrate how our framework can automatically locate both Turing and Hopf
+bifurcations in unseen regions of parameter space. We finally apply our method
+to the analysis of spatial patterning of the ocellated lizard through
+development. We found that TRENDy's effective state not only accurately
+predicts spatial changes over time but also identifies distinct pattern
+features unique to different anatomical regions, highlighting the potential
+influence of surface geometry on reaction-diffusion mechanisms and their role
+in driving spatially varying pattern dynamics.
+
+
+ Adequately generating and evaluating prediction models based on supervised
+machine learning (ML) is often challenging, especially for less experienced
+users in applied research areas. Special attention is required in settings
+where the model generation process involves hyperparameter tuning, i.e.
+data-driven optimization of different types of hyperparameters to improve the
+predictive performance of the resulting model. Discussions about tuning
+typically focus on the hyperparameters of the ML algorithm (e.g., the minimum
+number of observations in each terminal node for a tree-based algorithm). In
+this context, it is often neglected that hyperparameters also exist for the
+preprocessing steps that are applied to the data before it is provided to the
+algorithm (e.g., how to handle missing feature values in the data). As a
+consequence, users experimenting with different preprocessing options to
+improve model performance may be unaware that this constitutes a form of
+hyperparameter tuning - albeit informal and unsystematic - and thus may fail to
+report or account for this optimization. To illuminate this issue, this paper
+reviews and empirically illustrates different procedures for generating and
+evaluating prediction models, explicitly addressing the different ways
+algorithm and preprocessing hyperparameters are typically handled by applied ML
+users. By highlighting potential pitfalls, especially those that may lead to
+exaggerated performance claims, this review aims to further improve the quality
+of predictive modeling in ML applications.
+
+
+
+
+
+
+
+ ☆ Flow Matching with General Discrete Paths: A Kinetic-Optimal Perspective
+
+
+
+
+
+
+
+
+ Neta Shaul, Itai Gat, Marton Havasi, Daniel Severo, Anuroop Sriram, Peter Holderrieth, Brian Karrer, Yaron Lipman, Ricky T. Q. Chen
+
+
+ The design space of discrete-space diffusion or flow generative models are
+significantly less well-understood than their continuous-space counterparts,
+with many works focusing only on a simple masked construction. In this work, we
+aim to take a holistic approach to the construction of discrete generative
+models based on continuous-time Markov chains, and for the first time, allow
+the use of arbitrary discrete probability paths, or colloquially, corruption
+processes. Through the lens of optimizing the symmetric kinetic energy, we
+propose velocity formulas that can be applied to any given probability path,
+completely decoupling the probability and velocity, and giving the user the
+freedom to specify any desirable probability path based on expert knowledge
+specific to the data domain. Furthermore, we find that a special construction
+of mixture probability paths optimizes the symmetric kinetic energy for the
+discrete case. We empirically validate the usefulness of this new design space
+across multiple modalities: text generation, inorganic material generation, and
+image generation. We find that we can outperform the mask construction even in
+text with kinetic-optimal mixture paths, while we can make use of
+domain-specific constructions of the probability path over the visual domain.
+
+
+
+
+
+
+
+
+ Anna van Elst, Debarghya Ghoshdastidar
+
+
+ Contrastive representation learning is a modern paradigm for learning
+representations of unlabeled data via augmentations -- precisely, contrastive
+models learn to embed semantically similar pairs of samples (positive pairs)
+closer than independently drawn samples (negative samples). In spite of its
+empirical success and widespread use in foundation models, statistical theory
+for contrastive learning remains less explored. Recent works have developed
+generalization error bounds for contrastive losses, but the resulting risk
+certificates are either vacuous (certificates based on Rademacher complexity or
+$f$-divergence) or require strong assumptions about samples that are
+unreasonable in practice. The present paper develops non-vacuous PAC-Bayesian
+risk certificates for contrastive representation learning, considering the
+practical considerations of the popular SimCLR framework. Notably, we take into
+account that SimCLR reuses positive pairs of augmented data as negative samples
+for other data, thereby inducing strong dependence and making classical PAC or
+PAC-Bayesian bounds inapplicable. We further refine existing bounds on the
+downstream classification loss by incorporating SimCLR-specific factors,
+including data augmentation and temperature scaling, and derive risk
+certificates for the contrastive zero-one risk. The resulting bounds for
+contrastive loss and downstream prediction are much tighter than those of
+previous risk certificates, as demonstrated by experiments on CIFAR-10.
+
+
+
+
+
+
+
+ ☆ Convolutional Neural Networks and Mixture of Experts for Intrusion
+ Detection in 5G Networks and beyond
+
+
+ The advent of 6G/NextG networks comes along with a series of benefits,
+including extreme capacity, reliability, and efficiency. However, these
+networks may become vulnerable to new security threats. Therefore, 6G/NextG
+networks must be equipped with advanced Artificial Intelligence algorithms, in
+order to evade these attacks. Existing studies on the intrusion detection task
+rely on the train of shallow machine learning classifiers, including Logistic
+Regression, Decision Trees, and so on, yielding suboptimal performance. Others
+are based on deep neural networks consisting of static components, which are
+not conditional on the input. This limits their representation power and
+efficiency. To resolve these issues, we present the first study integrating
+Mixture of Experts (MoE) for identifying malicious traffic. Specifically, we
+use network traffic data and convert the 1D array of features into a 2D matrix.
+Next, we pass this matrix through convolutional neural network (CNN) layers
+followed by batch normalization and max pooling layers. After obtaining the
+representation vector via the CNN layers, a sparsely gated MoE layer is used.
+This layer consists of a set of experts (dense layers) and a router, where the
+router assigns weights to the output of each expert. Sparsity is achieved by
+choosing the most relevant experts of the total ones. Finally, we perform a
+series of ablation experiments to prove the effectiveness of our proposed
+model. Experiments are conducted on the 5G-NIDD dataset, a network intrusion
+detection dataset generated from a real 5G test network. Results show that our
+introduced approach reaches weighted F1-score up to 99.95% achieving comparable
+performance to existing approaches. Findings also show that our proposed model
+achieves multiple advantages over state-of-the-art approaches.
+
+
+ Representation learning aims to extract meaningful lower-dimensional
+embeddings from data, known as representations. Despite its widespread
+application, there is no established definition of a ``good'' representation.
+Typically, the representation quality is evaluated based on its performance in
+downstream tasks such as clustering, de-noising, etc. However, this
+task-specific approach has a limitation where a representation that performs
+well for one task may not necessarily be effective for another. This highlights
+the need for a more agnostic formulation, which is the focus of our work. We
+propose a downstream-agnostic formulation: when inherent clusters exist in the
+data, the representations should be specific to each cluster. Under this idea,
+we develop a meta-algorithm that jointly learns cluster-specific
+representations and cluster assignments. As our approach is easy to integrate
+with any representation learning framework, we demonstrate its effectiveness in
+various setups, including Autoencoders, Variational Autoencoders, Contrastive
+learning models, and Restricted Boltzmann Machines. We qualitatively compare
+our cluster-specific embeddings to standard embeddings and downstream tasks
+such as de-noising and clustering. While our method slightly increases runtime
+and parameters compared to the standard model, the experiments clearly show
+that it extracts the inherent cluster structures in the data, resulting in
+improved performance in relevant applications.
+
+
+
+
+
+
+
+ ☆ YT-30M: A multi-lingual multi-category dataset of YouTube comments
+
+
+ This paper introduces two large-scale multilingual comment datasets, YT-30M
+(and YT-100K) from YouTube. The analysis in this paper is performed on a
+smaller sample (YT-100K) of YT-30M. Both the datasets: YT-30M (full) and
+YT-100K (randomly selected 100K sample from YT-30M) are publicly released for
+further research. YT-30M (YT-100K) contains 32236173 (108694) comments posted
+by YouTube channel that belong to YouTube categories. Each comment is
+associated with a video ID, comment ID, commentor name, commentor channel ID,
+comment text, upvotes, original channel ID and category of the YouTube channel
+(e.g., 'News & Politics', 'Science & Technology', etc.).
+
+
+
+
+
+
+
+ ☆ Validity and efficiency of the conformal CUSUM procedure
+
+
+
+
+
+
+
+
+ Vladimir Vovk, Ilia Nouretdinov, Alex Gammerman
+
+
+ In this paper we study the validity and efficiency of a conformal version of
+the CUSUM procedure for change detection both experimentally and theoretically.
+
+
+
+ comment: 19 pages, 7 figures
+
+
+
+
+
+
+ ☆ State Frequency Estimation for Anomaly Detection
+
+
+ Many works have studied the efficacy of state machines for detecting
+anomalies within NetFlows. These works typically learn a model from unlabeled
+data and compute anomaly scores for arbitrary traces based on their likelihood
+of occurrence or how well they fit within the model. However, these methods do
+not dynamically adapt their scores based on the traces seen at test time. This
+becomes a problem when an adversary produces seemingly common traces in their
+attack, causing the model to miss the detection by assigning low anomaly
+scores. We propose SEQUENT, a new approach that uses the state visit frequency
+to adapt its scoring for anomaly detection dynamically. SEQUENT subsequently
+uses the scores to generate root causes for anomalies. These allow the grouping
+of alarms and simplify the analysis of anomalies. Our evaluation of SEQUENT on
+three NetFlow datasets indicates that our approach outperforms existing
+methods, demonstrating its effectiveness in detecting anomalies.
+
+
+
+
+
+
+
+
+ Dung Thuy Nguyen, Ngoc N. Tran, Taylor T. Johnson, Kevin Leach
+
+
+ In recent years, the rise of machine learning (ML) in cybersecurity has
+brought new challenges, including the increasing threat of backdoor poisoning
+attacks on ML malware classifiers. For instance, adversaries could inject
+malicious samples into public malware repositories, contaminating the training
+data and potentially misclassifying malware by the ML model. Current
+countermeasures predominantly focus on detecting poisoned samples by leveraging
+disagreements within the outputs of a diverse set of ensemble models on
+training data points. However, these methods are not suitable for scenarios
+where Machine Learning-as-a-Service (MLaaS) is used or when users aim to remove
+backdoors from a model after it has been trained. Addressing this scenario, we
+introduce PBP, a post-training defense for malware classifiers that mitigates
+various types of backdoor embeddings without assuming any specific backdoor
+embedding mechanism. Our method exploits the influence of backdoor attacks on
+the activation distribution of neural networks, independent of the
+trigger-embedding method. In the presence of a backdoor attack, the activation
+distribution of each layer is distorted into a mixture of distributions. By
+regulating the statistics of the batch normalization layers, we can guide a
+backdoored model to perform similarly to a clean one. Our method demonstrates
+substantial advantages over several state-of-the-art methods, as evidenced by
+experiments on two datasets, two types of backdoor methods, and various attack
+configurations. Notably, our approach requires only a small portion of the
+training data -- only 1\% -- to purify the backdoor and reduce the attack
+success rate from 100\% to almost 0\%, a 100-fold improvement over the baseline
+methods. Our code is available at
+\url{https://github.com/judydnguyen/pbp-backdoor-purification-official}.
+
+
+
+ comment: Accepted at NDSS 2025
+
+
+
+
+
+
+ ☆ SINGER: Vivid Audio-driven Singing Video Generation with Multi-scale
+ Spectral Diffusion Model
+
+
+ Recent advancements in generative models have significantly enhanced talking
+face video generation, yet singing video generation remains underexplored. The
+differences between human talking and singing limit the performance of existing
+talking face video generation models when applied to singing. The fundamental
+differences between talking and singing-specifically in audio characteristics
+and behavioral expressions-limit the effectiveness of existing models. We
+observe that the differences between singing and talking audios manifest in
+terms of frequency and amplitude. To address this, we have designed a
+multi-scale spectral module to help the model learn singing patterns in the
+spectral domain. Additionally, we develop a spectral-filtering module that aids
+the model in learning the human behaviors associated with singing audio. These
+two modules are integrated into the diffusion model to enhance singing video
+generation performance, resulting in our proposed model, SINGER. Furthermore,
+the lack of high-quality real-world singing face videos has hindered the
+development of the singing video generation community. To address this gap, we
+have collected an in-the-wild audio-visual singing dataset to facilitate
+research in this area. Our experiments demonstrate that SINGER is capable of
+generating vivid singing videos and outperforms state-of-the-art methods in
+both objective and subjective evaluations.
+
+
+
+
+
+
+
+ ☆ Assessing Foundation Models' Transferability to Physiological Signals in
+ Precision Medicine
+
+
+ The success of precision medicine requires computational models that can
+effectively process and interpret diverse physiological signals across
+heterogeneous patient populations. While foundation models have demonstrated
+remarkable transfer capabilities across various domains, their effectiveness in
+handling individual-specific physiological signals - crucial for precision
+medicine - remains largely unexplored. This work introduces a systematic
+pipeline for rapidly and efficiently evaluating foundation models' transfer
+capabilities in medical contexts. Our pipeline employs a three-stage approach.
+First, it leverages physiological simulation software to generate diverse,
+clinically relevant scenarios, particularly focusing on data-scarce medical
+conditions. This simulation-based approach enables both targeted capability
+assessment and subsequent model fine-tuning. Second, the pipeline projects
+these simulated signals through the foundation model to obtain embeddings,
+which are then evaluated using linear methods. This evaluation quantifies the
+model's ability to capture three critical aspects: physiological feature
+independence, temporal dynamics preservation, and medical scenario
+differentiation. Finally, the pipeline validates these representations through
+specific downstream medical tasks. Initial testing of our pipeline on the
+Moirai time series foundation model revealed significant limitations in
+physiological signal processing, including feature entanglement, temporal
+dynamics distortion, and reduced scenario discrimination. These findings
+suggest that current foundation models may require substantial architectural
+modifications or targeted fine-tuning before deployment in clinical settings.
+
+
+
+ comment: Presented at the precision medicine workshop at the AI in Medicine
+ conference (2024) in Salt Lake City
+
+
+
+
+
+
+ ☆ Learning Semantic Association Rules from Internet of Things Data
+
+
+
+
+
+
+
+
+ Erkan Karabulut, Paul Groth, Victoria Degeler
+
+
+ Association Rule Mining (ARM) is the task of discovering commonalities in
+data in the form of logical implications. ARM is used in the Internet of Things
+(IoT) for different tasks including monitoring and decision-making. However,
+existing methods give limited consideration to IoT-specific requirements such
+as heterogeneity and volume. Furthermore, they do not utilize important static
+domain-specific description data about IoT systems, which is increasingly
+represented as knowledge graphs. In this paper, we propose a novel ARM pipeline
+for IoT data that utilizes both dynamic sensor data and static IoT system
+metadata. Furthermore, we propose an Autoencoder-based Neurosymbolic ARM method
+(Aerial) as part of the pipeline to address the high volume of IoT data and
+reduce the total number of rules that are resource-intensive to process. Aerial
+learns a neural representation of a given data and extracts association rules
+from this representation by exploiting the reconstruction (decoding) mechanism
+of an autoencoder. Extensive evaluations on 3 IoT datasets from 2 domains show
+that ARM on both static and dynamic IoT data results in more generically
+applicable rules while Aerial can learn a more concise set of high-quality
+association rules than the state-of-the-art with full coverage over the
+datasets.
+
+
+
+
+
+
+
+ ☆ Deep Operator BSDE: a Numerical Scheme to Approximate the Solution
+ Operators
+
+
+ Motivated by dynamic risk measures and conditional $g$-expectations, in this
+work we propose a numerical method to approximate the solution operator given
+by a Backward Stochastic Differential Equation (BSDE). The main ingredients for
+this are the Wiener chaos decomposition and the classical Euler scheme for
+BSDEs. We show convergence of this scheme under very mild assumptions, and
+provide a rate of convergence in more restrictive cases. We then implement it
+using neural networks, and we present several numerical examples where we can
+check the accuracy of the method.
+
+
+
+
+
+
+
+ ☆ Can neural operators always be continuously discretized?
+
+
+
+
+
+
+
+
+ Takashi Furuya, Michael Puthawala, Maarten V. de Hoop, Matti Lassas
+
+
+ We consider the problem of discretization of neural operators between Hilbert
+spaces in a general framework including skip connections. We focus on bijective
+neural operators through the lens of diffeomorphisms in infinite dimensions.
+Framed using category theory, we give a no-go theorem that shows that
+diffeomorphisms between Hilbert spaces or Hilbert manifolds may not admit any
+continuous approximations by diffeomorphisms on finite-dimensional spaces, even
+if the approximations are nonlinear. The natural way out is the introduction of
+strongly monotone diffeomorphisms and layerwise strongly monotone neural
+operators which have continuous approximations by strongly monotone
+diffeomorphisms on finite-dimensional spaces. For these, one can guarantee
+discretization invariance, while ensuring that finite-dimensional
+approximations converge not only as sequences of functions, but that their
+representations converge in a suitable sense as well. Finally, we show that
+bilipschitz neural operators may always be written in the form of an
+alternating composition of strongly monotone neural operators, plus a simple
+isometry. Thus we realize a rigorous platform for discretization of a
+generalization of a neural operator. We also show that neural operators of this
+type may be approximated through the composition of finite-rank residual neural
+operators, where each block is strongly monotone, and may be inverted locally
+via iteration. We conclude by providing a quantitative approximation result for
+the discretization of general bilipschitz neural operators.
+
+
+
+
+
+
+
+
+ Murat Sensoy, Lance M. Kaplan, Simon Julier, Maryam Saleki, Federico Cerutti
+
+
+ Autonomous and semi-autonomous systems are using deep learning models to
+improve decision-making. However, deep classifiers can be overly confident in
+their incorrect predictions, a major issue especially in safety-critical
+domains. The present study introduces three foundational desiderata for
+developing real-world risk-aware classification systems. Expanding upon the
+previously proposed Evidential Deep Learning (EDL), we demonstrate the unity
+between these principles and EDL's operational attributes. We then augment EDL
+empowering autonomous agents to exercise discretion during structured
+decision-making when uncertainty and risks are inherent. We rigorously examine
+empirical scenarios to substantiate these theoretical innovations. In contrast
+to existing risk-aware classifiers, our proposed methodologies consistently
+exhibit superior performance, underscoring their transformative potential in
+risk-conscious classification strategies.
+
+
+
+ comment: Accepted for publication in Expert Systems with Applications
+
+
+
+
+
+
+ ☆ Reactive Orchestration for Hierarchical Federated Learning Under a
+ Communication Cost Budget
+
+
+
+
+
+
+
+
+ Ivan Čilić, Anna Lackinger, Pantelis Frangoudis, Ivana Podnar Žarko, Alireza Furutanpey, Ilir Murturi, Schahram Dustdar
+
+
+ Deploying a Hierarchical Federated Learning (HFL) pipeline across the
+computing continuum (CC) requires careful organization of participants into a
+hierarchical structure with intermediate aggregation nodes between FL clients
+and the global FL server. This is challenging to achieve due to (i) cost
+constraints, (ii) varying data distributions, and (iii) the volatile operating
+environment of the CC. In response to these challenges, we present a framework
+for the adaptive orchestration of HFL pipelines, designed to be reactive to
+client churn and infrastructure-level events, while balancing communication
+cost and ML model accuracy. Our mechanisms identify and react to events that
+cause HFL reconfiguration actions at runtime, building on multi-level
+monitoring information (model accuracy, resource availability, resource cost).
+Moreover, our framework introduces a generic methodology for estimating
+reconfiguration costs to continuously re-evaluate the quality of adaptation
+actions, while being extensible to optimize for various HFL performance
+criteria. By extending the Kubernetes ecosystem, our framework demonstrates the
+ability to react promptly and effectively to changes in the operating
+environment, making the best of the available communication cost budget and
+effectively balancing costs and ML performance at runtime.
+
+
+ The classical shadows protocol, introduced by Huang et al. [Nat. Phys. 16,
+1050 (2020)], makes use of the median-of-means (MoM) estimator to efficiently
+estimate the expectation values of $M$ observables with failure probability
+$\delta$ using only $\mathcal{O}(\log(M/\delta))$ measurements. In their
+analysis, Huang et al. used loose constants in their asymptotic performance
+bounds for simplicity. However, the specific values of these constants can
+significantly affect the number of shots used in practical implementations. To
+address this, we studied a modified MoM estimator proposed by Minsker [PMLR
+195, 5925 (2023)] that uses optimal constants and involves a U-statistic over
+the data set. For efficient estimation, we implemented two types of incomplete
+U-statistics estimators, the first based on random sampling and the second
+based on cyclically permuted sampling. We compared the performance of the
+original and modified estimators when used with the classical shadows protocol
+with single-qubit Clifford unitaries (Pauli measurements) for an Ising spin
+chain, and global Clifford unitaries (Clifford measurements) for the
+Greenberger-Horne-Zeilinger (GHZ) state. While the original estimator
+outperformed the modified estimators for Pauli measurements, the modified
+estimators showed improved performance over the original estimator for Clifford
+measurements. Our findings highlight the importance of tailoring estimators to
+specific measurement settings to optimize the performance of the classical
+shadows protocol in practical applications.
+
+
+
+ comment: 15 pages, 13 figures
+
+
+
+
+
+
+ ☆ Granular Ball Twin Support Vector Machine with Universum Data
+
+
+ Classification with support vector machines (SVM) often suffers from limited
+performance when relying solely on labeled data from target classes and is
+sensitive to noise and outliers. Incorporating prior knowledge from Universum
+data and more robust data representations can enhance accuracy and efficiency.
+Motivated by these findings, we propose a novel Granular Ball Twin Support
+Vector Machine with Universum Data (GBU-TSVM) that extends the TSVM framework
+to leverage both Universum samples and granular ball computing during model
+training. Unlike existing TSVM methods, the proposed GBU-TSVM represents data
+instances as hyper-balls rather than points in the feature space. This
+innovative approach improves the model's robustness and efficiency,
+particularly in handling noisy and large datasets. By grouping data points into
+granular balls, the model achieves superior computational efficiency, increased
+noise resistance, and enhanced interpretability. Additionally, the inclusion of
+Universum data, which consists of samples that are not strictly from the target
+classes, further refines the classification boundaries. This integration
+enriches the model with contextual information, refining classification
+boundaries and boosting overall accuracy. Experimental results on UCI benchmark
+datasets demonstrate that the GBU-TSVM outperforms existing TSVM models in both
+accuracy and computational efficiency. These findings highlight the potential
+of the GBU-TSVM model in setting a new standard in data representation and
+classification.
+
+
+
+
+
+
+
+
+ Leizhen Wang, Peibo Duan, Zhengbing He, Cheng Lyu, Xin Chen, Nan Zheng, Li Yao, Zhenliang Ma
+
+
+ Understanding travelers' route choices can help policymakers devise optimal
+operational and planning strategies for both normal and abnormal circumstances.
+However, existing choice modeling methods often rely on predefined assumptions
+and struggle to capture the dynamic and adaptive nature of travel behavior.
+Recently, Large Language Models (LLMs) have emerged as a promising alternative,
+demonstrating remarkable ability to replicate human-like behaviors across
+various fields. Despite this potential, their capacity to accurately simulate
+human route choice behavior in transportation contexts remains doubtful. To
+satisfy this curiosity, this paper investigates the potential of LLMs for route
+choice modeling by introducing an LLM-empowered agent, "LLMTraveler." This
+agent integrates an LLM as its core, equipped with a memory system that learns
+from past experiences and makes decisions by balancing retrieved data and
+personality traits. The study systematically evaluates the LLMTraveler's
+ability to replicate human-like decision-making through two stages: (1)
+analyzing its route-switching behavior in single origin-destination (OD) pair
+congestion game scenarios, where it demonstrates patterns align with laboratory
+data but are not fully explained by traditional models, and (2) testing its
+capacity to model day-to-day (DTD) adaptive learning behaviors on the Ortuzar
+and Willumsen (OW) network, producing results comparable to Multinomial Logit
+(MNL) and Reinforcement Learning (RL) models. These experiments demonstrate
+that the framework can partially replicate human-like decision-making in route
+choice while providing natural language explanations for its decisions. This
+capability offers valuable insights for transportation policymaking, such as
+simulating traveler responses to new policies or changes in the network.
+
+
+
+
+
+
+
+ ☆ On Approximability of $\ell_2^2$ Min-Sum Clustering
+
+
+
+
+
+
+
+
+ Karthik C. S., Euiwoong Lee, Yuval Rabani, Chris Schwiegelshohn, Samson Zhou
+
+
+ The $\ell_2^2$ min-sum $k$-clustering problem is to partition an input set
+into clusters $C_1,\ldots,C_k$ to minimize $\sum_{i=1}^k\sum_{p,q\in
+C_i}\|p-q\|_2^2$. Although $\ell_2^2$ min-sum $k$-clustering is NP-hard, it is
+not known whether it is NP-hard to approximate $\ell_2^2$ min-sum
+$k$-clustering beyond a certain factor.
+ In this paper, we give the first hardness-of-approximation result for the
+$\ell_2^2$ min-sum $k$-clustering problem. We show that it is NP-hard to
+approximate the objective to a factor better than $1.056$ and moreover,
+assuming a balanced variant of the Johnson Coverage Hypothesis, it is NP-hard
+to approximate the objective to a factor better than 1.327.
+ We then complement our hardness result by giving the first
+$(1+\varepsilon)$-coreset construction for $\ell_2^2$ min-sum $k$-clustering.
+Our coreset uses $\mathcal{O}\left(k^{\varepsilon^{-4}}\right)$ space and can
+be leveraged to achieve a polynomial-time approximation scheme with runtime
+$nd\cdot f(k,\varepsilon^{-1})$, where $d$ is the underlying dimension of the
+input dataset and $f$ is a fixed function.
+ Finally, we consider a learning-augmented setting, where the algorithm has
+access to an oracle that outputs a label $i\in[k]$ for input point, thereby
+implicitly partitioning the input dataset into $k$ clusters that induce an
+approximately optimal solution, up to some amount of adversarial error
+$\alpha\in\left[0,\frac{1}{2}\right)$. We give a polynomial-time algorithm that
+outputs a $\frac{1+\gamma\alpha}{(1-\alpha)^2}$-approximation to $\ell_2^2$
+min-sum $k$-clustering, for a fixed constant $\gamma>0$.
+
+
+
+
+
+
+
+ ☆ Multi-Action Restless Bandits with Weakly Coupled Constraints:
+ Simultaneous Learning and Control
+
+
+
+
+
+
+
+
+ Jing Fu, Bill Moran, José Niño-Mora
+
+
+ We study a system with finitely many groups of multi-action bandit processes,
+each of which is a Markov decision process (MDP) with finite state and action
+spaces and potentially different transition matrices when taking different
+actions. The bandit processes of the same group share the same state and action
+spaces and, given the same action that is taken, the same transition matrix.
+All the bandit processes across various groups are subject to multiple weakly
+coupled constraints over their state and action variables. Unlike the past
+studies that focused on the offline case, we consider the online case without
+assuming full knowledge of transition matrices and reward functions a priori
+and propose an effective scheme that enables simultaneous learning and control.
+We prove the convergence of the relevant processes in both the timeline and the
+number of the bandit processes, referred to as the convergence in the time and
+the magnitude dimensions. Moreover, we prove that the relevant processes
+converge exponentially fast in the magnitude dimension, leading to
+exponentially diminishing performance deviation between the proposed online
+algorithms and offline optimality.
+
+
+
+ comment: 70 pages,0 figure
+
+
+
+
+
+
+ ☆ Scalable Bayesian Tensor Ring Factorization for Multiway Data Analysis ICONIP 2023
+
+
+ Tensor decompositions play a crucial role in numerous applications related to
+multi-way data analysis. By employing a Bayesian framework with
+sparsity-inducing priors, Bayesian Tensor Ring (BTR) factorization offers
+probabilistic estimates and an effective approach for automatically adapting
+the tensor ring rank during the learning process. However, previous BTR method
+employs an Automatic Relevance Determination (ARD) prior, which can lead to
+sub-optimal solutions. Besides, it solely focuses on continuous data, whereas
+many applications involve discrete data. More importantly, it relies on the
+Coordinate-Ascent Variational Inference (CAVI) algorithm, which is inadequate
+for handling large tensors with extensive observations. These limitations
+greatly limit its application scales and scopes, making it suitable only for
+small-scale problems, such as image/video completion. To address these issues,
+we propose a novel BTR model that incorporates a nonparametric Multiplicative
+Gamma Process (MGP) prior, known for its superior accuracy in identifying
+latent structures. To handle discrete data, we introduce the P\'olya-Gamma
+augmentation for closed-form updates. Furthermore, we develop an efficient
+Gibbs sampler for consistent posterior simulation, which reduces the
+computational complexity of previous VI algorithm by two orders, and an online
+EM algorithm that is scalable to extremely large tensors. To showcase the
+advantages of our model, we conduct extensive experiments on both simulation
+data and real-world applications.
+
+
+
+ comment: ICONIP 2023
+
+
+
+
+
+
+ ☆ FlashAttention on a Napkin: A Diagrammatic Approach to Deep Learning
+ IO-Awareness
+
+
+ Optimizing deep learning algorithms currently requires slow, manual
+derivation, potentially leaving much performance untapped. Methods like
+FlashAttention have achieved a x6 performance improvement over native PyTorch
+by avoiding unnecessary data transfers, but required three iterations over
+three years. Automated compiled methods have consistently lagged behind. GPUs
+are limited by both transfers to processors and available compute, with
+transfer bandwidth having improved at a far slower pace. Already, transfer
+bandwidth accounts for 46% of GPU energy costs. This indicates the future of
+energy and capital-efficient algorithms relies on improved consideration of
+transfer costs (IO-awareness) and a systematic method for deriving optimized
+algorithms. In this paper, we present a diagrammatic approach to deep learning
+models which, with simple relabelings, derive optimal implementations and
+performance models that consider low-level memory. Diagrams generalize down the
+GPU hierarchy, providing a universal performance model for comparing hardware
+and quantization choices. Diagrams generate pseudocode, which reveals the
+application of hardware-specific features such as coalesced memory access,
+tensor core operations, and overlapped computation. We present attention
+algorithms for Ampere, which fits 13 warps per SM (FlashAttention fits 8), and
+for Hopper, which has improved overlapping and may achieve 1.32 PFLOPs.
+
+
+ Particle-based Bayesian inference methods by sampling from a partition-free
+target (posterior) distribution, e.g., Stein variational gradient descent
+(SVGD), have attracted significant attention. We propose a path-guided
+particle-based sampling~(PGPS) method based on a novel Log-weighted Shrinkage
+(LwS) density path linking an initial distribution to the target distribution.
+We propose to utilize a Neural network to learn a vector field motivated by the
+Fokker-Planck equation of the designed density path. Particles, initiated from
+the initial distribution, evolve according to the ordinary differential
+equation defined by the vector field. The distribution of these particles is
+guided along a density path from the initial distribution to the target
+distribution. The proposed LwS density path allows for an efficient search of
+modes of the target distribution while canonical methods fail. We theoretically
+analyze the Wasserstein distance of the distribution of the PGPS-generated
+samples and the target distribution due to approximation and discretization
+errors. Practically, the proposed PGPS-LwS method demonstrates higher Bayesian
+inference accuracy and better calibration ability in experiments conducted on
+both synthetic and real-world Bayesian learning tasks, compared to baselines,
+such as SVGD and Langevin dynamics, etc.
+
+
+
+
+
+
+
+ ☆ Conveying Emotions to Robots through Touch and Sound
+
+
+
+
+
+
+
+
+ Qiaoqiao Ren, Remko Proesmans, Frederick Bossuyt, Jan Vanfleteren, Francis Wyffels, Tony Belpaeme
+
+
+ Human emotions can be conveyed through nuanced touch gestures. However, there
+is a lack of understanding of how consistently emotions can be conveyed to
+robots through touch. This study explores the consistency of touch-based
+emotional expression toward a robot by integrating tactile and auditory sensory
+reading of affective haptic expressions. We developed a piezoresistive pressure
+sensor and used a microphone to mimic touch and sound channels, respectively.
+In a study with 28 participants, each conveyed 10 emotions to a robot using
+spontaneous touch gestures. Our findings reveal a statistically significant
+consistency in emotion expression among participants. However, some emotions
+obtained low intraclass correlation values. Additionally, certain emotions with
+similar levels of arousal or valence did not exhibit significant differences in
+the way they were conveyed. We subsequently constructed a multi-modal
+integrating touch and audio features to decode the 10 emotions. A support
+vector machine (SVM) model demonstrated the highest accuracy, achieving 40% for
+10 classes, with "Attention" being the most accurately conveyed emotion at a
+balanced accuracy of 87.65%.
+
+
+
+
+
+
+
+ ☆ Gaussian Processes for Probabilistic Estimates of Earthquake Ground
+ Shaking: A 1-D Proof-of-Concept NeurIPS 2024
+
+
+
+
+
+
+
+
+ Sam A. Scivier, Tarje Nissen-Meyer, Paula Koelemeijer, Atılım Güneş Baydin
+
+
+ Estimates of seismic wave speeds in the Earth (seismic velocity models) are
+key input parameters to earthquake simulations for ground motion prediction.
+Owing to the non-uniqueness of the seismic inverse problem, typically many
+velocity models exist for any given region. The arbitrary choice of which
+velocity model to use in earthquake simulations impacts ground motion
+predictions. However, current hazard analysis methods do not account for this
+source of uncertainty. We present a proof-of-concept ground motion prediction
+workflow for incorporating uncertainties arising from inconsistencies between
+existing seismic velocity models. Our analysis is based on the probabilistic
+fusion of overlapping seismic velocity models using scalable Gaussian process
+(GP) regression. Specifically, we fit a GP to two synthetic 1-D velocity
+profiles simultaneously, and show that the predictive uncertainty accounts for
+the differences between the models. We subsequently draw velocity model samples
+from the predictive distribution and estimate peak ground displacement using
+acoustic wave propagation through the velocity models. The resulting
+distribution of possible ground motion amplitudes is much wider than would be
+predicted by simulating shaking using only the two input velocity models. This
+proof-of-concept illustrates the importance of probabilistic methods for
+physics-based seismic hazard analysis.
+
+
+
+ comment: 8 pages, 2 figures, accepted in the Machine Learning and the Physical
+ Sciences Workshop at NeurIPS 2024
+
+
+
+
+
+
+ ☆ Nonparametric Filtering, Estimation and Classification using Neural Jump
+ ODEs
+
+
+
+
+
+
+
+
+ Jakob Heiss, Florian Krach, Thorsten Schmidt, Félix B. Tambe-Ndonfack
+
+
+ Neural Jump ODEs model the conditional expectation between observations by
+neural ODEs and jump at arrival of new observations. They have demonstrated
+effectiveness for fully data-driven online forecasting in settings with
+irregular and partial observations, operating under weak regularity
+assumptions. This work extends the framework to input-output systems, enabling
+direct applications in online filtering and classification. We establish
+theoretical convergence guarantees for this approach, providing a robust
+solution to $L^2$-optimal filtering. Empirical experiments highlight the
+model's superior performance over classical parametric methods, particularly in
+scenarios with complex underlying distributions. These results emphasise the
+approach's potential in time-sensitive domains such as finance and health
+monitoring, where real-time accuracy is crucial.
+
+
+
+
+
+
+
+ ☆ NeRF and Gaussian Splatting SLAM in the Wild
+
+
+ Navigating outdoor environments with visual Simultaneous Localization and
+Mapping (SLAM) systems poses significant challenges due to dynamic scenes,
+lighting variations, and seasonal changes, requiring robust solutions. While
+traditional SLAM methods struggle with adaptability, deep learning-based
+approaches and emerging neural radiance fields as well as Gaussian
+Splatting-based SLAM methods, offer promising alternatives. However, these
+methods have primarily been evaluated in controlled indoor environments with
+stable conditions, leaving a gap in understanding their performance in
+unstructured and variable outdoor settings. This study addresses this gap by
+evaluating these methods in natural outdoor environments, focusing on camera
+tracking accuracy, robustness to environmental factors, and computational
+efficiency, highlighting distinct trade-offs. Extensive evaluations demonstrate
+that neural SLAM methods achieve superior robustness, particularly under
+challenging conditions such as low light, but at a high computational cost. At
+the same time, traditional methods perform the best across seasons but are
+highly sensitive to variations in lighting conditions. The code of the
+benchmark is publicly available at
+https://github.com/iis-esslingen/nerf-3dgs-benchmark.
+
+
+
+ comment: 5 pages, 2 figures, 4 tables
+
+
+
+
+
+
+ ☆ Learning on One Mode: Addressing Multi-Modality in Offline Reinforcement
+ Learning
+
+
+ Offline reinforcement learning (RL) seeks to learn optimal policies from
+static datasets without interacting with the environment. A common challenge is
+handling multi-modal action distributions, where multiple behaviours are
+represented in the data. Existing methods often assume unimodal behaviour
+policies, leading to suboptimal performance when this assumption is violated.
+We propose Weighted Imitation Learning on One Mode (LOM), a novel approach that
+focuses on learning from a single, promising mode of the behaviour policy. By
+using a Gaussian mixture model to identify modes and selecting the best mode
+based on expected returns, LOM avoids the pitfalls of averaging over
+conflicting actions. Theoretically, we show that LOM improves performance while
+maintaining simplicity in policy learning. Empirically, LOM outperforms
+existing methods on standard D4RL benchmarks and demonstrates its effectiveness
+in complex, multi-modal scenarios.
+
+
+
+
+
+
+
+ ☆ Variable-Speed Teaching-Playback as Real-World Data Augmentation for
+ Imitation Learning
+
+
+ Because imitation learning relies on human demonstrations in hard-to-simulate
+settings, the inclusion of force control in this method has resulted in a
+shortage of training data, even with a simple change in speed. Although the
+field of data augmentation has addressed the lack of data, conventional methods
+of data augmentation for robot manipulation are limited to simulation-based
+methods or downsampling for position control. This paper proposes a novel
+method of data augmentation that is applicable to force control and preserves
+the advantages of real-world datasets. We applied teaching-playback at variable
+speeds as real-world data augmentation to increase both the quantity and
+quality of environmental reactions at variable speeds. An experiment was
+conducted on bilateral control-based imitation learning using a method of
+imitation learning equipped with position-force control. We evaluated the
+effect of real-world data augmentation on two tasks, pick-and-place and wiping,
+at variable speeds, each from two human demonstrations at fixed speed. The
+results showed a maximum 55% increase in success rate from a simple change in
+speed of real-world reactions and improved accuracy along the
+duration/frequency command by gathering environmental reactions at variable
+speeds.
+
+
+
+ comment: 16 pages, 12 figures, 4 tables. This is a preprint of an article
+ submitted for consideration in ADVANCED ROBOTICS, copyright Taylor & Francis
+ and Robotics Society of Japan; ADVANCED ROBOTICS is available online at
+ http://www.tandfonline.com/
+
+ Given points from an arbitrary metric space and a sequence of point updates
+sent by an adversary, what is the minimum recourse per update (i.e., the
+minimum number of changes needed to the set of centers after an update), in
+order to maintain a constant-factor approximation to a $k$-clustering problem?
+This question has received attention in recent years under the name consistent
+clustering.
+ Previous works by Lattanzi and Vassilvitskii [ICLM '17] and Fichtenberger,
+Lattanzi, Norouzi-Fard, and Svensson [SODA '21] studied $k$-clustering
+objectives, including the $k$-center and the $k$-median objectives, under only
+point insertions. In this paper we study the $k$-center objective in the fully
+dynamic setting, where the update is either a point insertion or a point
+deletion. Before our work, {\L}\k{a}cki, Haeupler, Grunau, Rozho\v{n}, and
+Jayaram [SODA '24] gave a deterministic fully dynamic constant-factor
+approximation algorithm for the $k$-center objective with worst-case recourse
+of $2$ per update.
+ In this work, we prove that the $k$-center clustering problem admits optimal
+recourse bounds by developing a deterministic fully dynamic constant-factor
+approximation algorithm with worst-case recourse of $1$ per update. Moreover
+our algorithm performs simple choices based on light data structures, and thus
+is arguably more direct and faster than the previous one which uses a
+sophisticated combinatorial structure. Additionally, we develop a new
+deterministic decremental algorithm and a new deterministic incremental
+algorithm, both of which maintain a $6$-approximate $k$-center solution with
+worst-case recourse of $1$ per update. Our incremental algorithm improves over
+the $8$-approximation algorithm by Charikar, Chekuri, Feder, and Motwani [STOC
+'97]. Finally, we remark that since all three of our algorithms are
+deterministic, they work against an adaptive adversary.
+
+
+
+ comment: In Proceedings SODA 2025
+
+
+
+
+
+
+ ☆ Channel Reflection: Knowledge-Driven Data Augmentation for EEG-Based
+ Brain-Computer Interfaces
+
+
+ A brain-computer interface (BCI) enables direct communication between the
+human brain and external devices. Electroencephalography (EEG) based BCIs are
+currently the most popular for able-bodied users. To increase
+user-friendliness, usually a small amount of user-specific EEG data are used
+for calibration, which may not be enough to develop a pure data-driven decoding
+model. To cope with this typical calibration data shortage challenge in
+EEG-based BCIs, this paper proposes a parameter-free channel reflection (CR)
+data augmentation approach that incorporates prior knowledge on the channel
+distributions of different BCI paradigms in data augmentation. Experiments on
+eight public EEG datasets across four different BCI paradigms (motor imagery,
+steady-state visual evoked potential, P300, and seizure classifications) using
+different decoding algorithms demonstrated that: 1) CR is effective, i.e., it
+can noticeably improve the classification accuracy; 2) CR is robust, i.e., it
+consistently outperforms existing data augmentation approaches in the
+literature; and, 3) CR is flexible, i.e., it can be combined with other data
+augmentation approaches to further increase the performance. We suggest that
+data augmentation approaches like CR should be an essential step in EEG-based
+BCIs. Our code is available online.
+
+
+
+
+
+
+
+ ☆ Survey of different Large Language Model Architectures: Trends,
+ Benchmarks, and Challenges
+
+
+
+
+
+
+
+
+ Minghao Shao, Abdul Basit, Ramesh Karri, Muhammad Shafique
+
+
+ Large Language Models (LLMs) represent a class of deep learning models adept
+at understanding natural language and generating coherent responses to various
+prompts or queries. These models far exceed the complexity of conventional
+neural networks, often encompassing dozens of neural network layers and
+containing billions to trillions of parameters. They are typically trained on
+vast datasets, utilizing architectures based on transformer blocks. Present-day
+LLMs are multi-functional, capable of performing a range of tasks from text
+generation and language translation to question answering, as well as code
+generation and analysis. An advanced subset of these models, known as
+Multimodal Large Language Models (MLLMs), extends LLM capabilities to process
+and interpret multiple data modalities, including images, audio, and video.
+This enhancement empowers MLLMs with capabilities like video editing, image
+comprehension, and captioning for visual content. This survey provides a
+comprehensive overview of the recent advancements in LLMs. We begin by tracing
+the evolution of LLMs and subsequently delve into the advent and nuances of
+MLLMs. We analyze emerging state-of-the-art MLLMs, exploring their technical
+features, strengths, and limitations. Additionally, we present a comparative
+analysis of these models and discuss their challenges, potential limitations,
+and prospects for future development.
+
+
+
+
+
+
+
+
+ Marcin Przewięźlikowski, Randall Balestriero, Wojciech Jasiński, Marek Śmieja, Bartosz Zieliński
+
+
+ Masked Image Modeling (MIM) has emerged as a popular method for
+Self-Supervised Learning (SSL) of visual representations. However, for
+high-level perception tasks, MIM-pretrained models offer lower out-of-the-box
+representation quality than the Joint-Embedding Architectures (JEA) - another
+prominent SSL paradigm. To understand this performance gap, we analyze the
+information flow in Vision Transformers (ViT) learned by both approaches. We
+reveal that whereas JEAs construct their representation on a selected set of
+relevant image fragments, MIM models aggregate nearly whole image content.
+Moreover, we demonstrate that MIM-trained ViTs retain valuable information
+within their patch tokens, which is not effectively captured by the global
+[cls] token representations. Therefore, selective aggregation of relevant patch
+tokens, without any fine-tuning, results in consistently higher-quality of MIM
+representations. To our knowledge, we are the first to highlight the lack of
+effective representation aggregation as an emergent issue of MIM and propose
+directions to address it, contributing to future advances in Self-Supervised
+Learning.
+
+
+ Transformers are widely used for their ability to capture data relations in
+sequence processing, with great success for a wide range of static tasks.
+However, the computational and memory footprint of their main component, i.e.,
+the Scaled Dot-product Attention, is commonly overlooked. This makes their
+adoption in applications involving stream data processing with constraints in
+response latency, computational and memory resources infeasible. Some works
+have proposed methods to lower the computational cost of transformers, i.e.
+low-rank approximations, sparsity in attention, and efficient formulations for
+Continual Inference. In this paper, we introduce a new formulation of the
+Scaled Dot-product Attention based on the Nystr\"om approximation that is
+suitable for Continual Inference. In experiments on Online Audio Classification
+and Online Action Detection tasks, the proposed Continual Scaled Dot-product
+Attention can lower the number of operations by up to three orders of magnitude
+compared to the original Transformers while retaining the predictive
+performance of competing models.
+
+
+
+ comment: 11 pages, 7 figures
+
+
+
+
+
+
+ ☆ ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable
+ Compression
+
+
+ Large Language Models (LLMs) have been widely deployed in a variety of
+applications, and the context length is rapidly increasing to handle tasks such
+as long-document QA and complex logical reasoning. However, long context poses
+significant challenges for inference efficiency, including high memory costs of
+key-value (KV) cache and increased latency due to extensive memory accesses.
+Recent works have proposed compressing KV cache to approximate computation, but
+these methods either evict tokens permanently, never recalling them for later
+inference, or recall previous tokens at the granularity of pages divided by
+textual positions. Both approaches degrade the model accuracy and output
+quality. To achieve efficient and accurate recallable KV cache compression, we
+introduce ClusterKV, which recalls tokens at the granularity of semantic
+clusters. We design and implement efficient algorithms and systems for
+clustering, selection, indexing and caching. Experiment results show that
+ClusterKV attains negligible accuracy loss across various tasks with 32k
+context lengths, using only a 1k to 2k KV cache budget, and achieves up to a
+2$\times$ speedup in latency and a 2.5$\times$ improvement in decoding
+throughput. Compared to SoTA recallable KV compression methods, ClusterKV
+demonstrates higher model accuracy and output quality, while maintaining or
+exceeding inference efficiency.
+
+
+
+
+
+
+
+
+ Lingfei Deng, Changming Zhao, Zhenbang Du, Kun Xia, Dongrui Wu
+
+
+ Semi-supervised domain adaptation (SSDA) aims at training a high-performance
+model for a target domain using few labeled target data, many unlabeled target
+data, and plenty of auxiliary data from a source domain. Previous works in SSDA
+mainly focused on learning transferable representations across domains.
+However, it is difficult to find a feature space where the source and target
+domains share the same conditional probability distribution. Additionally,
+there is no flexible and effective strategy extending existing unsupervised
+domain adaptation (UDA) approaches to SSDA settings. In order to solve the
+above two challenges, we propose a novel fine-tuning framework, semi-supervised
+transfer boosting (SS-TrBoosting). Given a well-trained deep learning-based UDA
+or SSDA model, we use it as the initial model, generate additional base
+learners by boosting, and then use all of them as an ensemble. More
+specifically, half of the base learners are generated by supervised domain
+adaptation, and half by semi-supervised learning. Furthermore, for more
+efficient data transmission and better data privacy protection, we propose a
+source data generation approach to extend SS-TrBoosting to semi-supervised
+source-free domain adaptation (SS-SFDA). Extensive experiments showed that
+SS-TrBoosting can be applied to a variety of existing UDA, SSDA and SFDA
+approaches to further improve their performance.
+
+
+ One of the key tasks in graph learning is node classification. While Graph
+neural networks have been used for various applications, their adaptivity to
+reject option setting is not previously explored. In this paper, we propose
+NCwR, a novel approach to node classification in Graph Neural Networks (GNNs)
+with an integrated reject option, which allows the model to abstain from making
+predictions when uncertainty is high. We propose both cost-based and
+coverage-based methods for classification with abstention in node
+classification setting using GNNs. We perform experiments using our method on
+three standard citation network datasets Cora, Citeseer and Pubmed and compare
+with relevant baselines. We also model the Legal judgment prediction problem on
+ILDC dataset as a node classification problem where nodes represent legal cases
+and edges represent citations. We further interpret the model by analyzing the
+cases that the model abstains from predicting by visualizing which part of the
+input features influenced this decision.
+
+
+
+
+
+
+
+ ☆ Semi-decentralized Training of Spatio-Temporal Graph Neural Networks for
+ Traffic Prediction
+
+
+
+
+
+
+
+
+ Ivan Kralj, Lodovico Giaretta, Gordan Ježić, Ivana Podnar Žarko, Šarūnas Girdzijauskas
+
+
+ In smart mobility, large networks of geographically distributed sensors
+produce vast amounts of high-frequency spatio-temporal data that must be
+processed in real time to avoid major disruptions. Traditional centralized
+approaches are increasingly unsuitable to this task, as they struggle to scale
+with expanding sensor networks, and reliability issues in central components
+can easily affect the whole deployment. To address these challenges, we explore
+and adapt semi-decentralized training techniques for Spatio-Temporal Graph
+Neural Networks (ST-GNNs) in smart mobility domain. We implement a simulation
+framework where sensors are grouped by proximity into multiple cloudlets, each
+handling a subgraph of the traffic graph, fetching node features from other
+cloudlets to train its own local ST-GNN model, and exchanging model updates
+with other cloudlets to ensure consistency, enhancing scalability and removing
+reliance on a centralized aggregator. We perform extensive comparative
+evaluation of four different ST-GNN training setups -- centralized, traditional
+FL, server-free FL, and Gossip Learning -- on large-scale traffic datasets, the
+METR-LA and PeMS-BAY datasets, for short-, mid-, and long-term vehicle speed
+predictions. Experimental results show that semi-decentralized setups are
+comparable to centralized approaches in performance metrics, while offering
+advantages in terms of scalability and fault tolerance. In addition, we
+highlight often overlooked issues in existing literature for distributed
+ST-GNNs, such as the variation in model performance across different
+geographical areas due to region-specific traffic patterns, and the significant
+communication overhead and computational costs that arise from the large
+receptive field of GNNs, leading to substantial data transfers and increased
+computation of partial embeddings.
+
+
+
+
+
+
+
+ ☆ Towards Understanding and Quantifying Uncertainty for Text-to-Image
+ Generation
+
+
+
+
+
+
+
+
+ Gianni Franchi, Dat Nguyen Trong, Nacim Belkhir, Guoxuan Xia, Andrea Pilzer
+
+
+ Uncertainty quantification in text-to-image (T2I) generative models is
+crucial for understanding model behavior and improving output reliability. In
+this paper, we are the first to quantify and evaluate the uncertainty of T2I
+models with respect to the prompt. Alongside adapting existing approaches
+designed to measure uncertainty in the image space, we also introduce
+Prompt-based UNCertainty Estimation for T2I models (PUNC), a novel method
+leveraging Large Vision-Language Models (LVLMs) to better address uncertainties
+arising from the semantics of the prompt and generated images. PUNC utilizes a
+LVLM to caption a generated image, and then compares the caption with the
+original prompt in the more semantically meaningful text space. PUNC also
+enables the disentanglement of both aleatoric and epistemic uncertainties via
+precision and recall, which image-space approaches are unable to do. Extensive
+experiments demonstrate that PUNC outperforms state-of-the-art uncertainty
+estimation techniques across various settings. Uncertainty quantification in
+text-to-image generation models can be used on various applications including
+bias detection, copyright protection, and OOD detection. We also introduce a
+comprehensive dataset of text prompts and generation pairs to foster further
+research in uncertainty quantification for generative models. Our findings
+illustrate that PUNC not only achieves competitive performance but also enables
+novel applications in evaluating and improving the trustworthiness of
+text-to-image models.
+
+
+
+
+
+
+
+
+ Nouhaila Innan, Alberto Marchisio, Mohamed Bennai, Muhammad Shafique
+
+
+ Predicting loan eligibility with high accuracy remains a significant
+challenge in the finance sector. Accurate predictions enable financial
+institutions to make informed decisions, mitigate risks, and effectively adapt
+services to meet customer needs. However, the complexity and the
+high-dimensional nature of financial data have always posed significant
+challenges to achieving this level of precision. To overcome these issues, we
+propose a novel approach that employs Quantum Machine Learning (QML) for Loan
+Eligibility Prediction using Quantum Neural Networks (LEP-QNN).Our innovative
+approach achieves an accuracy of 98% in predicting loan eligibility from a
+single, comprehensive dataset. This performance boost is attributed to the
+strategic implementation of a dropout mechanism within the quantum circuit,
+aimed at minimizing overfitting and thereby improving the model's predictive
+reliability. In addition, our exploration of various optimizers leads to
+identifying the most efficient setup for our LEP-QNN framework, optimizing its
+performance. We also rigorously evaluate the resilience of LEP-QNN under
+different quantum noise scenarios, ensuring its robustness and dependability
+for quantum computing environments. This research showcases the potential of
+QML in financial predictions and establishes a foundational guide for advancing
+QML technologies, marking a step towards developing advanced, quantum-driven
+financial decision-making tools.
+
+
+
+ comment: 8 pages. 6 figures, 3 tables
+
+
+
+
+
+
+ ☆ Testing Neural Network Verifiers: A Soundness Benchmark with Hidden
+ Counterexamples
+
+
+ In recent years, many neural network (NN) verifiers have been developed to
+formally verify certain properties of neural networks such as robustness.
+Although many benchmarks have been constructed to evaluate the performance of
+NN verifiers, they typically lack a ground-truth for hard instances where no
+current verifier can verify and no counterexample can be found, which makes it
+difficult to check the soundness of a new verifier if it claims to verify hard
+instances which no other verifier can do. We propose to develop a soundness
+benchmark for NN verification. Our benchmark contains instances with
+deliberately inserted counterexamples while we also try to hide the
+counterexamples from regular adversarial attacks which can be used for finding
+counterexamples. We design a training method to produce neural networks with
+such hidden counterexamples. Our benchmark aims to be used for testing the
+soundness of NN verifiers and identifying falsely claimed verifiability when it
+is known that hidden counterexamples exist. We systematically construct our
+benchmark and generate instances across diverse model architectures, activation
+functions, input sizes, and perturbation radii. We demonstrate that our
+benchmark successfully identifies bugs in state-of-the-art NN verifiers, as
+well as synthetic bugs, providing a crucial step toward enhancing the
+reliability of testing NN verifiers. Our code is available at
+https://github.com/MVP-Harry/SoundnessBench and our benchmark is available at
+https://huggingface.co/datasets/SoundnessBench/SoundnessBench.
+
+
+
+ comment: Preprint
+
+
+
+
+
+
+ ☆ Topological Trajectory Classification and Landmark Inference on
+ Simplicial Complexes
+
+
+
+
+
+
+
+
+ Vincent P. Grande, Josef Hoppe, Florian Frantzen, Michael T. Schaub
+
+
+ We consider the problem of classifying trajectories on a discrete or
+discretised 2-dimensional manifold modelled by a simplicial complex. Previous
+works have proposed to project the trajectories into the harmonic eigenspace of
+the Hodge Laplacian, and then cluster the resulting embeddings. However, if the
+considered space has vanishing homology (i.e., no "holes"), then the harmonic
+space of the 1-Hodge Laplacian is trivial and thus the approach fails. Here we
+propose to view this issue akin to a sensor placement problem and present an
+algorithm that aims to learn "optimal holes" to distinguish a set of given
+trajectory classes. Specifically, given a set of labelled trajectories, which
+we interpret as edge-flows on the underlying simplicial complex, we search for
+2-simplices whose deletion results in an optimal separation of the trajectory
+labels according to the corresponding spectral embedding of the trajectories
+into the harmonic space. Finally, we generalise this approach to the
+unsupervised setting.
+
+
+
+ comment: 5 pages, 4 figures, Accepted at the 58th Annual Asilomar Conference
+ on Signals, Systems, and Computers 2024
+
+
+
+
+
+
+ ☆ Generalized Diffusion Model with Adjusted Offset Noise
+
+
+ Diffusion models have become fundamental tools for modeling data
+distributions in machine learning and have applications in image generation,
+drug discovery, and audio synthesis. Despite their success, these models face
+challenges when generating data with extreme brightness values, as evidenced by
+limitations in widely used frameworks like Stable Diffusion. Offset noise has
+been proposed as an empirical solution to this issue, yet its theoretical basis
+remains insufficiently explored. In this paper, we propose a generalized
+diffusion model that naturally incorporates additional noise within a rigorous
+probabilistic framework. Our approach modifies both the forward and reverse
+diffusion processes, enabling inputs to be diffused into Gaussian distributions
+with arbitrary mean structures. We derive a loss function based on the evidence
+lower bound, establishing its theoretical equivalence to offset noise with
+certain adjustments, while broadening its applicability. Experiments on
+synthetic datasets demonstrate that our model effectively addresses
+brightness-related challenges and outperforms conventional methods in
+high-dimensional scenarios.
+
+
+
+
+
+
+
+ ☆ Unifying KV Cache Compression for Large Language Models with LeanKV
+
+
+
+
+
+
+
+
+ Yanqi Zhang, Yuwei Hu, Runyuan Zhao, John C. S. Lui, Haibo Chen
+
+
+ Large language models (LLMs) demonstrate exceptional performance but incur
+high serving costs due to substantial memory demands, with the key-value (KV)
+cache being a primary bottleneck. Existing KV cache compression methods,
+including quantization and pruning, struggle with limitations such as uniform
+treatment of keys and values and static memory allocation across attention
+heads. To address these challenges, we introduce LeanKV, a unified KV cache
+compression framework that enhances LLM serving efficiency without compromising
+accuracy through three innovations: (1) Hetero-KV quantization, which stores
+keys at a higher precision than values to reflect their greater impact on
+attention computations; (2) per-head dynamic sparsity, which allocates memory
+based on token importance per head and per request; and (3) unified KV
+compression, integrating mixed-precision quantization and selective pruning to
+enable a smooth tradeoff between model accuracy and memory efficiency. To
+efficiently support these techniques, LeanKV introduces systems optimizations
+including unified paging and on-GPU parallel memory management. Implemented on
+vLLM, LeanKV compresses the KV cache by $3.0\times$ to $5.0\times$ without
+accuracy loss and up to $11.0\times$ with under 5% accuracy loss, enhancing
+throughput by $1.9\times$ to $2.5\times$, and up to $6.9\times$.
+
+
+ Sinkhorn algorithm is the de-facto standard approximation algorithm for
+optimal transport, which has been applied to a variety of applications,
+including image processing and natural language processing. In theory, the
+proof of its convergence follows from the convergence of the Sinkhorn--Knopp
+algorithm for the matrix scaling problem, and Altschuler et al. show that its
+worst-case time complexity is in near-linear time. Very recently, sequentially
+composed optimal transports were proposed by Watanabe and Isobe as a
+hierarchical extension of optimal transports. In this paper, we present an
+efficient approximation algorithm, namely Sinkhorn algorithm for sequentially
+composed optimal transports, for its entropic regularization. Furthermore, we
+present a theoretical analysis of the Sinkhorn algorithm, namely (i) its
+exponential convergence to the optimal solution with respect to the Hilbert
+pseudometric, and (ii) a worst-case complexity analysis for the case of one
+sequential composition.
+
+
+
+ comment: Preprint
+
+
+
+
+
+
+ ☆ Few-Shot Learning with Adaptive Weight Masking in Conditional GANs
+
+
+ Deep learning has revolutionized various fields, yet its efficacy is hindered
+by overfitting and the requirement of extensive annotated data, particularly in
+few-shot learning scenarios where limited samples are available. This paper
+introduces a novel approach to few-shot learning by employing a Residual Weight
+Masking Conditional Generative Adversarial Network (RWM-CGAN) for data
+augmentation. The proposed model integrates residual units within the generator
+to enhance network depth and sample quality, coupled with a weight mask
+regularization technique in the discriminator to improve feature learning from
+small-sample categories. This method addresses the core issues of robustness
+and generalization in few-shot learning by providing a controlled and clear
+augmentation of the sample space. Extensive experiments demonstrate that
+RWM-CGAN not only expands the sample space effectively but also enriches the
+diversity and quality of generated samples, leading to significant improvements
+in detection and classification accuracy on public datasets. The paper
+contributes to the advancement of few-shot learning by offering a practical
+solution to the challenges posed by data scarcity and the need for rapid
+generalization to new tasks or categories.
+
+
+
+
+
+
+
+ ☆ Enhancing Recommendation Systems with GNNs and Addressing Over-Smoothing
+
+
+ This paper addresses key challenges in enhancing recommendation systems by
+leveraging Graph Neural Networks (GNNs) and addressing inherent limitations
+such as over-smoothing, which reduces model effectiveness as network hierarchy
+deepens. The proposed approach introduces three GNN-based recommendation
+models, specifically designed to mitigate over-smoothing through innovative
+mechanisms like residual connections and identity mapping within the
+aggregation propagation process. These modifications enable more effective
+information flow across layers, preserving essential user-item interaction
+details to improve recommendation accuracy. Additionally, the study emphasizes
+the critical need for interpretability in recommendation systems, aiming to
+provide transparent and justifiable suggestions tailored to dynamic user
+preferences. By integrating collaborative filtering with GNN architectures, the
+proposed models not only enhance predictive accuracy but also align
+recommendations more closely with individual behaviors, adapting to nuanced
+shifts in user interests. This work advances the field by tackling both
+technical and user-centric challenges, contributing to the development of
+robust and explainable recommendation systems capable of managing the
+complexity and scale of modern online environments.
+
+
+
+
+
+
+
+ ☆ Revolve: Optimizing AI Systems by Tracking Response Evolution in Textual
+ Optimization
+
+
+ Recent advancements in large language models (LLMs) have significantly
+enhanced the ability of LLM-based systems to perform complex tasks through
+natural language processing and tool interaction. However, optimizing these
+LLM-based systems for specific tasks remains challenging, often requiring
+manual interventions like prompt engineering and hyperparameter tuning.
+Existing automatic optimization methods, such as textual feedback-based
+techniques (e.g., TextGrad), tend to focus on immediate feedback, analogous to
+using immediate derivatives in traditional numerical gradient descent. However,
+relying solely on such feedback can be limited when the adjustments made in
+response to this feedback are either too small or fluctuate irregularly,
+potentially slowing down or even stalling the optimization process. To overcome
+these challenges, more adaptive methods are needed, especially in situations
+where the system's response is evolving slowly or unpredictably. In this paper,
+we introduce REVOLVE, an optimization method that tracks how "R"esponses
+"EVOLVE" across iterations in LLM systems. By focusing on the evolution of
+responses over time, REVOLVE enables more stable and effective optimization by
+making thoughtful, progressive adjustments at each step. Experimental results
+demonstrate that REVOLVE outperforms competitive baselines, achieving a 7.8%
+improvement in prompt optimization, a 20.72% gain in solution refinement, and a
+29.17% increase in code optimization. Additionally, REVOLVE converges in fewer
+iterations, resulting in significant computational savings. These advantages
+highlight its adaptability and efficiency, positioning REVOLVE as a valuable
+tool for optimizing LLM-based systems and accelerating the development of
+next-generation AI technologies. Code is available at:
+https://github.com/Peiyance/REVOLVE.
+
+
+
+ comment: 20 pages, 2 figures
+
+
+
+
+
+
+ ☆ Hybrid deep learning-based strategy for the hepatocellular carcinoma
+ cancer grade classification of H&E stained liver histopathology images
+
+
+ Hepatocellular carcinoma (HCC) is a common type of liver cancer whose
+early-stage diagnosis is a common challenge, mainly due to the manual
+assessment of hematoxylin and eosin-stained whole slide images, which is a
+time-consuming process and may lead to variability in decision-making. For
+accurate detection of HCC, we propose a hybrid deep learning-based architecture
+that uses transfer learning to extract the features from pre-trained
+convolutional neural network (CNN) models and a classifier made up of a
+sequence of fully connected layers. This study uses a publicly available The
+Cancer Genome Atlas Hepatocellular Carcinoma (TCGA-LIHC)database (n=491) for
+model development and database of Kasturba Gandhi Medical College (KMC), India
+for validation. The pre-processing step involves patch extraction, colour
+normalization, and augmentation that results in 3920 patches for the TCGA
+dataset. The developed hybrid deep neural network consisting of a CNN-based
+pre-trained feature extractor and a customized artificial neural network-based
+classifier is trained using five-fold cross-validation. For this study, eight
+different state-of-the-art models are trained and tested as feature extractors
+for the proposed hybrid model. The proposed hybrid model with ResNet50-based
+feature extractor provided the sensitivity, specificity, F1-score, accuracy,
+and AUC of 100.00%, 100.00%, 100.00%, 100.00%, and 1.00, respectively on the
+TCGA database. On the KMC database, EfficientNetb3 resulted in the optimal
+choice of the feature extractor giving sensitivity, specificity, F1-score,
+accuracy, and AUC of 96.97, 98.85, 96.71, 96.71, and 0.99, respectively. The
+proposed hybrid models showed improvement in accuracy of 2% and 4% over the
+pre-trained models in TCGA-LIHC and KMC databases.
+
+
+
+ comment: 14 figure, 9 tables
+
+
+
+
+
+
+ ☆ A Scalable Quantum Neural Network for Approximate SRBB-Based Unitary
+ Synthesis
+
+
+ In this work, scalable quantum neural networks are introduced to approximate
+unitary evolutions through the Standard Recursive Block Basis (SRBB) and,
+subsequently, redesigned with a reduced number of CNOTs. This algebraic
+approach to the problem of unitary synthesis exploits Lie algebras and their
+topological features to obtain scalable parameterizations of unitary operators.
+First, the recursive algorithm that builds the SRBB is presented, framed in the
+original scalability scheme already known to the literature only from a
+theoretical point of view. Unexpectedly, 2-qubit systems emerge as a special
+case outside this scheme. Furthermore, an algorithm to reduce the number of
+CNOTs is proposed, thus deriving a new implementable scaling scheme that
+requires one single layer of approximation. From the mathematical algorithm,
+the scalable CNOT-reduced quantum neural network is implemented and its
+performance is assessed with a variety of different unitary matrices, both
+sparse and dense, up to 6 qubits via the PennyLane library. The effectiveness
+of the approximation is measured with different metrics in relation to two
+optimizers: a gradient-based method and the Nelder-Mead method. The approximate
+SRBB-based synthesis algorithm with CNOT-reduction is also tested on real
+hardware and compared with other valid approximation and decomposition methods
+available in the literature.
+
+
+
+ comment: Journal
+
+
+
+
+
+
+ ☆ UTSD: Unified Time Series Diffusion Model
+
+
+ Transformer-based architectures have achieved unprecedented success in time
+series analysis. However, facing the challenge of across-domain modeling,
+existing studies utilize statistical prior as prompt engineering fails under
+the huge distribution shift among various domains. In this paper, a Unified
+Time Series Diffusion (UTSD) model is established for the first time to model
+the multi-domain probability distribution, utilizing the powerful probability
+distribution modeling ability of Diffusion. Unlike the autoregressive models
+that capture the conditional probabilities of the prediction horizon to the
+historical sequence, we use a diffusion denoising process to model the mixture
+distribution of the cross-domain data and generate the prediction sequence for
+the target domain directly utilizing conditional sampling. The proposed UTSD
+contains three pivotal designs: (1) The condition network captures the
+multi-scale fluctuation patterns from the observation sequence, which are
+utilized as context representations to guide the denoising network to generate
+the prediction sequence; (2) Adapter-based fine-tuning strategy, the
+multi-domain universal representation learned in the pretraining stage is
+utilized for downstream tasks in target domains; (3) The diffusion and
+denoising process on the actual sequence space, combined with the improved
+classifier free guidance as the conditional generation strategy, greatly
+improves the stability and accuracy of the downstream task. We conduct
+extensive experiments on mainstream benchmarks, and the pre-trained UTSD
+outperforms existing foundation models on all data domains, exhibiting superior
+zero-shot generalization ability. After training from scratch, UTSD achieves
+comparable performance against domain-specific proprietary models. The
+empirical results validate the potential of UTSD as a time series foundational
+model.
+
+
+
+
+
+
+
+ ☆ Point-GN: A Non-Parametric Network Using Gaussian Positional Encoding
+ for Point Cloud Classification WACV
+
+
+ This paper introduces Point-GN, a novel non-parametric network for efficient
+and accurate 3D point cloud classification. Unlike conventional deep learning
+models that rely on a large number of trainable parameters, Point-GN leverages
+non-learnable components-specifically, Farthest Point Sampling (FPS), k-Nearest
+Neighbors (k-NN), and Gaussian Positional Encoding (GPE)-to extract both local
+and global geometric features. This design eliminates the need for additional
+training while maintaining high performance, making Point-GN particularly
+suited for real-time, resource-constrained applications. We evaluate Point-GN
+on two benchmark datasets, ModelNet40 and ScanObjectNN, achieving
+classification accuracies of 85.29% and 85.89%, respectively, while
+significantly reducing computational complexity. Point-GN outperforms existing
+non-parametric methods and matches the performance of fully trained models, all
+with zero learnable parameters. Our results demonstrate that Point-GN is a
+promising solution for 3D point cloud classification in practical, real-time
+environments.
+
+
+
+ comment: This paper has been accepted for presentation at the IEEE Winter
+ Conference on Applications of Computer Vision (WACV) 2025
+
+
+
+
+
+
+ ☆ Less is More: A Stealthy and Efficient Adversarial Attack Method for
+ DRL-based Autonomous Driving Policies
+
+
+
+
+
+
+
+
+ Junchao Fan, Xuyang Lei, Xiaolin Chang, Jelena Mišić, Vojislav B. Mišić
+
+
+ Despite significant advancements in deep reinforcement learning (DRL)-based
+autonomous driving policies, these policies still exhibit vulnerability to
+adversarial attacks. This vulnerability poses a formidable challenge to the
+practical deployment of these policies in autonomous driving. Designing
+effective adversarial attacks is an indispensable prerequisite for enhancing
+the robustness of these policies. In view of this, we present a novel stealthy
+and efficient adversarial attack method for DRL-based autonomous driving
+policies. Specifically, we introduce a DRL-based adversary designed to trigger
+safety violations (e.g., collisions) by injecting adversarial samples at
+critical moments. We model the attack as a mixed-integer optimization problem
+and formulate it as a Markov decision process. Then, we train the adversary to
+learn the optimal policy for attacking at critical moments without domain
+knowledge. Furthermore, we introduce attack-related information and a
+trajectory clipping method to enhance the learning capability of the adversary.
+Finally, we validate our method in an unprotected left-turn scenario across
+different traffic densities. The experimental results show that our method
+achieves more than 90% collision rate within three attacks in most cases.
+Furthermore, our method achieves more than 130% improvement in attack
+efficiency compared to the unlimited attack method.
+
+
+
+
+
+
+
+ ☆ MILLION: A General Multi-Objective Framework with Controllable Risk for
+ Portfolio Management VLDB 2025
+
+
+
+
+
+
+
+
+ Liwei Deng, Tianfu Wang, Yan Zhao, Kai Zheng
+
+
+ Portfolio management is an important yet challenging task in AI for FinTech,
+which aims to allocate investors' budgets among different assets to balance the
+risk and return of an investment. In this study, we propose a general
+Multi-objectIve framework with controLLable rIsk for pOrtfolio maNagement
+(MILLION), which consists of two main phases, i.e., return-related maximization
+and risk control. Specifically, in the return-related maximization phase, we
+introduce two auxiliary objectives, i.e., return rate prediction, and return
+rate ranking, combined with portfolio optimization to remit the overfitting
+problem and improve the generalization of the trained model to future markets.
+Subsequently, in the risk control phase, we propose two methods, i.e.,
+portfolio interpolation and portfolio improvement, to achieve fine-grained risk
+control and fast risk adaption to a user-specified risk level. For the
+portfolio interpolation method, we theoretically prove that the risk can be
+perfectly controlled if the to-be-set risk level is in a proper interval. In
+addition, we also show that the return rate of the adjusted portfolio after
+portfolio interpolation is no less than that of the min-variance optimization,
+as long as the model in the reward maximization phase is effective.
+Furthermore, the portfolio improvement method can achieve greater return rates
+while keeping the same risk level compared to portfolio interpolation.
+Extensive experiments are conducted on three real-world datasets. The results
+demonstrate the effectiveness and efficiency of the proposed framework.
+
+
+
+ comment: accepted by VLDB 2025
+
+
+
+
+
+
+ ☆ A Granger-Causal Perspective on Gradient Descent with Application to
+ Pruning
+
+
+ Stochastic Gradient Descent (SGD) is the main approach to optimizing neural
+networks. Several generalization properties of deep networks, such as
+convergence to a flatter minima, are believed to arise from SGD. This article
+explores the causality aspect of gradient descent. Specifically, we show that
+the gradient descent procedure has an implicit granger-causal relationship
+between the reduction in loss and a change in parameters. By suitable
+modifications, we make this causal relationship explicit. A causal approach to
+gradient descent has many significant applications which allow greater control.
+In this article, we illustrate the significance of the causal approach using
+the application of Pruning. The causal approach to pruning has several
+interesting properties - (i) We observe a phase shift as the percentage of
+pruned parameters increase. Such phase shift is indicative of an optimal
+pruning strategy. (ii) After pruning, we see that minima becomes "flatter",
+explaining the increase in accuracy after pruning weights.
+
+
+
+
+
+
+
+ ☆ Hamiltonian-based neural networks for systems under nonholonomic
+ constraints
+
+
+ There has been increasing interest in methodologies that incorporate physics
+priors into neural network architectures to enhance their modeling
+capabilities. A family of these methodologies that has gained traction are
+Hamiltonian neural networks (HNN) and their variations. These architectures
+explicitly encode Hamiltonian mechanics both in their structure and loss
+function. Although Hamiltonian systems under nonholonomic constraints are in
+general not Hamiltonian, it is possible to formulate them in pseudo-Hamiltonian
+form, equipped with a Lie bracket which is almost Poisson. This opens the
+possibility of using some principles of HNNs in systems under nonholonomic
+constraints. The goal of the present work is to develop a modified Hamiltonian
+neural network architecture capable of modeling Hamiltonian systems under
+holonomic and nonholonomic constraints. A three-network parallel architecture
+is proposed to simultaneously learn the Hamiltonian of the system, the
+constraints, and their associated multipliers. A rolling disk and a ball on a
+spinning table are considered as canonical examples to assess the performance
+of the proposed Hamiltonian architecture. The experiments are then repeated
+with a noisy training set to study modeling performance under more realistic
+conditions.
+
+
+
+
+
+
+
+ ☆ Learning Whole-Body Loco-Manipulation for Omni-Directional Task Space
+ Pose Tracking with a Wheeled-Quadrupedal-Manipulator
+
+
+ In this paper, we study the whole-body loco-manipulation problem using
+reinforcement learning (RL). Specifically, we focus on the problem of how to
+coordinate the floating base and the robotic arm of a wheeled-quadrupedal
+manipulator robot to achieve direct six-dimensional (6D) end-effector (EE) pose
+tracking in task space. Different from conventional whole-body
+loco-manipulation problems that track both floating-base and end-effector
+commands, the direct EE pose tracking problem requires inherent balance among
+redundant degrees of freedom in the whole-body motion. We leverage RL to solve
+this challenging problem. To address the associated difficulties, we develop a
+novel reward fusion module (RFM) that systematically integrates reward terms
+corresponding to different tasks in a nonlinear manner. In such a way, the
+inherent multi-stage and hierarchical feature of the loco-manipulation problem
+can be carefully accommodated. By combining the proposed RFM with the a
+teacher-student RL training paradigm, we present a complete RL scheme to
+achieve 6D EE pose tracking for the wheeled-quadruped manipulator robot.
+Extensive simulation and hardware experiments demonstrate the significance of
+the RFM. In particular, we enable smooth and precise tracking performance,
+achieving state-of-the-art tracking position error of less than 5 cm, and
+rotation error of less than 0.1 rad. Please refer to
+https://clearlab-sustech.github.io/RFM_loco_mani/ for more experimental videos.
+
+
+
+
+
+
+
+ ☆ Data Acquisition for Improving Model Fairness using Reinforcement
+ Learning
+
+
+ Machine learning systems are increasingly being used in critical decision
+making such as healthcare, finance, and criminal justice. Concerns around their
+fairness have resulted in several bias mitigation techniques that emphasize the
+need for high-quality data to ensure fairer decisions. However, the role of
+earlier stages of machine learning pipelines in mitigating model bias has not
+been explored well. In this paper, we focus on the task of acquiring additional
+labeled data points for training the downstream machine learning model to
+rapidly improve its fairness. Since not all data points in a data pool are
+equally beneficial to the task of fairness, we generate an ordering in which
+data points should be acquired. We present DataSift, a data acquisition
+framework based on the idea of data valuation that relies on partitioning and
+multi-armed bandits to determine the most valuable data points to acquire. Over
+several iterations, DataSift selects a partition and randomly samples a batch
+of data points from the selected partition, evaluates the benefit of acquiring
+the batch on model fairness, and updates the utility of partitions depending on
+the benefit. To further improve the effectiveness and efficiency of evaluating
+batches, we leverage influence functions that estimate the effect of acquiring
+a batch without retraining the model. We empirically evaluate DataSift on
+several real-world and synthetic datasets and show that the fairness of a
+machine learning model can be significantly improved even while acquiring a few
+data points.
+
+
+
+ comment: 19 pages, 9 figures
+
+
+
+
+
+
+ ☆ Provably Extending PageRank-based Local Clustering Algorithm to Weighted
+ Directed Graphs with Self-Loops and to Hypergraphs
+
+
+ Local clustering aims to find a compact cluster near the given starting
+instances. This work focuses on graph local clustering, which has broad
+applications beyond graphs because of the internal connectivities within
+various modalities. While most existing studies on local graph clustering adopt
+the discrete graph setting (i.e., unweighted graphs without self-loops),
+real-world graphs can be more complex. In this paper, we extend the
+non-approximating Andersen-Chung-Lang ("ACL") algorithm beyond discrete graphs
+and generalize its quadratic optimality to a wider range of graphs, including
+weighted, directed, and self-looped graphs and hypergraphs. Specifically,
+leveraging PageRank, we propose two algorithms: GeneralACL for graphs and
+HyperACL for hypergraphs. We theoretically prove that, under two mild
+conditions, both algorithms can identify a quadratically optimal local cluster
+in terms of conductance with at least 1/2 probability. On the property of
+hypergraphs, we address a fundamental gap in the literature by defining
+conductance for hypergraphs from the perspective of hypergraph random walks.
+Additionally, we provide experiments to validate our theoretical findings.
+
+
+ We study the preference-based pure exploration problem for bandits with
+vector-valued rewards. The rewards are ordered using a (given) preference cone
+$\mathcal{C}$ and our the goal is to identify the set of Pareto optimal arms.
+First, to quantify the impact of preferences, we derive a novel lower bound on
+the sample complexity for identifying the most preferred policy with confidence
+level $1-\delta$. Our lower bound elicits the role played by the geometry of
+the preference cone and punctuates the difference in hardness compared to
+existing best-arm identification variants of the problem. We further explicate
+this geometry when rewards follow Gaussian distributions. We then provide a
+convex relaxation of the lower bound. and leverage it to design
+Preference-based Track and Stop (PreTS) algorithm that identifies the most
+preferred policy. Finally, we show that sample complexity of PreTS is
+asymptotically tight by deriving a new concentration inequality for
+vector-valued rewards.
+
+
+
+
+
+
+
+ ☆ Surveying the Effects of Quality, Diversity, and Complexity in Synthetic
+ Data From Large Language Models
+
+
+
+
+
+
+
+
+ Alex Havrilla, Andrew Dai, Laura O'Mahony, Koen Oostermeijer, Vera Zisler, Alon Albalak, Fabrizio Milo, Sharath Chandra Raparthy, Kanishk Gandhi, Baber Abbasi, Duy Phung, Maia Iyer, Dakota Mahan, Chase Blagden, Srishti Gureja, Mohammed Hamdy, Wen-Ding Li, Giovanni Paolini, Pawan Sasanka Ammanamanchi, Elliot Meyerson
+
+
+ Synthetic data generation with Large Language Models is a promising paradigm
+for augmenting natural data over a nearly infinite range of tasks. Given this
+variety, direct comparisons among synthetic data generation algorithms are
+scarce, making it difficult to understand where improvement comes from and what
+bottlenecks exist. We propose to evaluate algorithms via the makeup of
+synthetic data generated by each algorithm in terms of data quality, diversity,
+and complexity. We choose these three characteristics for their significance in
+open-ended processes and the impact each has on the capabilities of downstream
+models. We find quality to be essential for in-distribution model
+generalization, diversity to be essential for out-of-distribution
+generalization, and complexity to be beneficial for both. Further, we emphasize
+the existence of Quality-Diversity trade-offs in training data and the
+downstream effects on model performance. We then examine the effect of various
+components in the synthetic data pipeline on each data characteristic. This
+examination allows us to taxonomize and compare synthetic data generation
+algorithms through the components they utilize and the resulting effects on
+data QDC composition. This analysis extends into a discussion on the importance
+of balancing QDC in synthetic data for efficient reinforcement learning and
+self-improvement algorithms. Analogous to the QD trade-offs in training data,
+often there exist trade-offs between model output quality and output diversity
+which impact the composition of synthetic data. We observe that many models are
+currently evaluated and optimized only for output quality, thereby limiting
+output diversity and the potential for self-improvement. We argue that
+balancing these trade-offs is essential to the development of future
+self-improvement algorithms and highlight a number of works making progress in
+this direction.
+
+
+ Transformers, especially the decoder-only variants, are the backbone of most
+modern large language models; yet we do not have much understanding of their
+expressive power except for the simple $1$-layer case.
+ Due to the difficulty of analyzing multi-layer models, all previous work
+relies on unproven complexity conjectures to show limitations for multi-layer
+Transformers. In this work, we prove the first $\textit{unconditional}$ lower
+bound against multi-layer decoder-only transformers. For any constant $L$, we
+prove that any $L$-layer decoder-only transformer needs a polynomial model
+dimension ($n^{\Omega(1)}$) to perform sequential composition of $L$ functions
+over an input of $n$ tokens.
+ As a consequence, our results give: (1) the first depth-width trade-off for
+multi-layer transformers, exhibiting that the $L$-step composition task is
+exponentially harder for $L$-layer models compared to $(L+1)$-layer ones; (2)
+an unconditional separation between encoder and decoder, exhibiting a hard task
+for decoders that can be solved by an exponentially shallower and smaller
+encoder; (3) a provable advantage of chain-of-thought, exhibiting a task that
+becomes exponentially easier with chain-of-thought.
+ On the technical side, we propose the multi-party $\textit{autoregressive}$
+$\textit{communication}$ $\textit{model}$ that captures the computation of a
+decoder-only Transformer. We also introduce a new proof technique that finds a
+certain $\textit{indistinguishable}$ $\textit{decomposition}$ of all possible
+inputs iteratively for proving lower bounds in this model. We believe our new
+communication model and proof technique will be helpful to further understand
+the computational power of transformers.
+
+
+
+
+
+
+
+ ☆ Unified Inductive Logic: From Formal Learning to Statistical Inference
+ to Supervised Learning
+
+
+ While the traditional conception of inductive logic is Carnapian, I develop a
+Peircean alternative and use it to unify formal learning theory, statistics,
+and a significant part of machine learning: supervised learning. Some crucial
+standards for evaluating non-deductive inferences have been assumed separately
+in those areas, but can actually be justified by a unifying principle.
+
+
+
+
+
+
+
+ ☆ How Many Ratings per Item are Necessary for Reliable Significance
+ Testing?
+
+
+
+
+
+
+
+
+ Christopher Homan, Flip Korn, Chris Welty
+
+
+ Most approaches to machine learning evaluation assume that machine and human
+responses are repeatable enough to be measured against data with unitary,
+authoritative, "gold standard" responses, via simple metrics such as accuracy,
+precision, and recall that assume scores are independent given the test item.
+However, AI models have multiple sources of stochasticity and the human raters
+who create gold standards tend to disagree with each other, often in meaningful
+ways, hence a single output response per input item may not provide enough
+information. We introduce methods for determining whether an (existing or
+planned) evaluation dataset has enough responses per item to reliably compare
+the performance of one model to another. We apply our methods to several of
+very few extant gold standard test sets with multiple disaggregated responses
+per item and show that there are usually not enough responses per item to
+reliably compare the performance of one model against another. Our methods also
+allow us to estimate the number of responses per item for hypothetical datasets
+with similar response distributions to the existing datasets we study. When two
+models are very far apart in their predictive performance, fewer raters are
+needed to confidently compare them, as expected. However, as the models draw
+closer, we find that a larger number of raters than are currently typical in
+annotation collection are needed to ensure that the power analysis correctly
+reflects the difference in performance.
+
+
+
+
+
+
+
+ ☆ 3D Interaction Geometric Pre-training for Molecular Relational Learning
+
+
+ Molecular Relational Learning (MRL) is a rapidly growing field that focuses
+on understanding the interaction dynamics between molecules, which is crucial
+for applications ranging from catalyst engineering to drug discovery. Despite
+recent progress, earlier MRL approaches are limited to using only the 2D
+topological structure of molecules, as obtaining the 3D interaction geometry
+remains prohibitively expensive. This paper introduces a novel 3D geometric
+pre-training strategy for MRL (3DMRL) that incorporates a 3D virtual
+interaction environment, overcoming the limitations of costly traditional
+quantum mechanical calculation methods. With the constructed 3D virtual
+interaction environment, 3DMRL trains 2D MRL model to learn the overall 3D
+geometric information of molecular interaction through contrastive learning.
+Moreover, fine-grained interaction between molecules is learned through force
+prediction loss, which is crucial in understanding the wide range of molecular
+interaction processes. Extensive experiments on various tasks using real-world
+datasets, including out-of-distribution and extrapolation scenarios,
+demonstrate the effectiveness of 3DMRL, showing up to a 24.93\% improvement in
+performance across 40 tasks.
+
+
+
+
+
+
+
+ ☆ Incorporating System-level Safety Requirements in Perception Models via
+ Reinforcement Learning
+
+
+ Perception components in autonomous systems are often developed and optimized
+independently of downstream decision-making and control components, relying on
+established performance metrics like accuracy, precision, and recall.
+Traditional loss functions, such as cross-entropy loss and negative
+log-likelihood, focus on reducing misclassification errors but fail to consider
+their impact on system-level safety, overlooking the varying severities of
+system-level failures caused by these errors. To address this limitation, we
+propose a novel training paradigm that augments the perception component with
+an understanding of system-level safety objectives. Central to our approach is
+the translation of system-level safety requirements, formally specified using
+the rulebook formalism, into safety scores. These scores are then incorporated
+into the reward function of a reinforcement learning framework for fine-tuning
+perception models with system-level safety objectives. Simulation results
+demonstrate that models trained with this approach outperform baseline
+perception models in terms of system-level safety.
+
+
+
+
+
+
+
+ ☆ Who Brings the Frisbee: Probing Hidden Hallucination Factors in Large
+ Vision-Language Model via Causality Analysis WACV2025
+
+
+ Recent advancements in large vision-language models (LVLM) have significantly
+enhanced their ability to comprehend visual inputs alongside natural language.
+However, a major challenge in their real-world application is hallucination,
+where LVLMs generate non-existent visual elements, eroding user trust. The
+underlying mechanism driving this multimodal hallucination is poorly
+understood. Minimal research has illuminated whether contexts such as sky,
+tree, or grass field involve the LVLM in hallucinating a frisbee. We
+hypothesize that hidden factors, such as objects, contexts, and semantic
+foreground-background structures, induce hallucination. This study proposes a
+novel causal approach: a hallucination probing system to identify these hidden
+factors. By analyzing the causality between images, text prompts, and network
+saliency, we systematically explore interventions to block these factors. Our
+experimental findings show that a straightforward technique based on our
+analysis can significantly reduce hallucinations. Additionally, our analyses
+indicate the potential to edit network internals to minimize hallucinated
+outputs.
+
+
+
+ comment: Accepted by WACV2025
+
+
+
+
+
+
+ ☆ SAVER: A Toolbox for Sampling-Based, Probabilistic Verification of
+ Neural Networks
+
+
+ We present a neural network verification toolbox to 1) assess the probability
+of satisfaction of a constraint, and 2) synthesize a set expansion factor to
+achieve the probability of satisfaction. Specifically, the tool box establishes
+with a user-specified level of confidence whether the output of the neural
+network for a given input distribution is likely to be contained within a given
+set. Should the tool determine that the given set cannot satisfy the likelihood
+constraint, the tool also implements an approach outlined in this paper to
+alter the constraint set to ensure that the user-defined satisfaction
+probability is achieved. The toolbox is comprised of sampling-based approaches
+which exploit the properties of signed distance function to define set
+containment.
+
+
+
+ comment: 7 pages, 8 figures, submitted to the 28th ACM International
+ Conference on Hybrid Systems: Computation and Control
+
+
+
+
+
+
+ ☆ BGTplanner: Maximizing Training Accuracy for Differentially Private
+ Federated Recommenders via Strategic Privacy Budget Allocation
+
+
+ To mitigate the rising concern about privacy leakage, the federated
+recommender (FR) paradigm emerges, in which decentralized clients co-train the
+recommendation model without exposing their raw user-item rating data. The
+differentially private federated recommender (DPFR) further enhances FR by
+injecting differentially private (DP) noises into clients. Yet, current DPFRs,
+suffering from noise distortion, cannot achieve satisfactory accuracy. Various
+efforts have been dedicated to improving DPFRs by adaptively allocating the
+privacy budget over the learning process. However, due to the intricate
+relation between privacy budget allocation and model accuracy, existing works
+are still far from maximizing DPFR accuracy. To address this challenge, we
+develop BGTplanner (Budget Planner) to strategically allocate the privacy
+budget for each round of DPFR training, improving overall training performance.
+Specifically, we leverage the Gaussian process regression and historical
+information to predict the change in recommendation accuracy with a certain
+allocated privacy budget. Additionally, Contextual Multi-Armed Bandit (CMAB) is
+harnessed to make privacy budget allocation decisions by reconciling the
+current improvement and long-term privacy constraints. Our extensive
+experimental results on real datasets demonstrate that \emph{BGTplanner}
+achieves an average improvement of 6.76\% in training performance compared to
+state-of-the-art baselines.
+
+
+
+
+
+
+
+
+ Simon Sinong Zhan, Qingyuan Wu, Zhian Ruan, Frank Yang, Philip Wang, Yixuan Wang, Ruochen Jiao, Chao Huang, Qi Zhu
+
+
+ Inverse Reinforcement Learning (IRL) has demonstrated effectiveness in a
+variety of imitation tasks. In this paper, we introduce an IRL framework
+designed to extract rewarding features from expert trajectories affected by
+delayed disturbances. Instead of relying on direct observations, our approach
+employs an efficient off-policy adversarial training framework to derive expert
+features and recover optimal policies from augmented delayed observations.
+Empirical evaluations in the MuJoCo environment under diverse delay settings
+validate the effectiveness of our method. Furthermore, we provide a theoretical
+analysis showing that recovering expert policies from augmented delayed
+observations outperforms using direct delayed observations.
+
+
+
+
+
+
+
+ ☆ Harnessing Loss Decomposition for Long-Horizon Wave Predictions via Deep
+ Neural Networks NeurIPS
+
+
+ Accurate prediction over long time horizons is crucial for modeling complex
+physical processes such as wave propagation. Although deep neural networks show
+promise for real-time forecasting, they often struggle with accumulating phase
+and amplitude errors as predictions extend over a long period. To address this
+issue, we propose a novel loss decomposition strategy that breaks down the loss
+into separate phase and amplitude components. This technique improves the
+long-term prediction accuracy of neural networks in wave propagation tasks by
+explicitly accounting for numerical errors, improving stability, and reducing
+error accumulation over extended forecasts.
+
+
+ Transformers are now ubiquitous for sequence modeling tasks, but their
+extension to multi-dimensional data remains a challenge due to the quadratic
+cost of the attention mechanism. In this paper, we propose Higher-Order
+Transformers (HOT), a novel architecture designed to efficiently process data
+with more than two axes, i.e. higher-order tensors. To address the
+computational challenges associated with high-order tensor attention, we
+introduce a novel Kronecker factorized attention mechanism that reduces the
+attention cost to quadratic in each axis' dimension, rather than quadratic in
+the total size of the input tensor. To further enhance efficiency, HOT
+leverages kernelized attention, reducing the complexity to linear. This
+strategy maintains the model's expressiveness while enabling scalable attention
+computation. We validate the effectiveness of HOT on two high-dimensional
+tasks, including multivariate time series forecasting, and 3D medical image
+classification. Experimental results demonstrate that HOT achieves competitive
+performance while significantly improving computational efficiency, showcasing
+its potential for tackling a wide range of complex, multi-dimensional data.
+
+
+
+
+
+
+
+ ♻ ☆ Yo'LLaVA: Your Personalized Language and Vision Assistant NeurIPS 2024
+
+
+ Large Multimodal Models (LMMs) have shown remarkable capabilities across a
+variety of tasks (e.g., image captioning, visual question answering). While
+broad, their knowledge remains generic (e.g., recognizing a dog), and they are
+unable to handle personalized subjects (e.g., recognizing a user's pet dog).
+Human reasoning, in contrast, typically operates within the context of specific
+subjects in our surroundings. For example, one might ask, "What should I buy
+for my dog's birthday?"; as opposed to a generic inquiry about "What should I
+buy for a dog's birthday?". Similarly, when looking at a friend's image, the
+interest lies in seeing their activities (e.g., "my friend is holding a cat"),
+rather than merely observing generic human actions (e.g., "a man is holding a
+cat"). In this paper, we introduce the novel task of personalizing LMMs, so
+that they can have conversations about a specific subject. We propose Yo'LLaVA,
+which learns to embed a personalized subject into a set of latent tokens given
+a handful of example images of the subject. Our qualitative and quantitative
+analyses reveal that Yo'LLaVA can learn the concept more efficiently using
+fewer tokens and more effectively encode the visual attributes compared to
+strong prompting baselines (e.g., LLaVA).
+
+
+
+
+
+
+
+ ♻ ☆ DynaMITE-RL: A Dynamic Model for Improved Temporal Meta-Reinforcement
+ Learning
+
+
+
+
+
+
+
+
+ Anthony Liang, Guy Tennenholtz, Chih-wei Hsu, Yinlam Chow, Erdem Bıyık, Craig Boutilier
+
+
+ We introduce DynaMITE-RL, a meta-reinforcement learning (meta-RL) approach to
+approximate inference in environments where the latent state evolves at varying
+rates. We model episode sessions - parts of the episode where the latent state
+is fixed - and propose three key modifications to existing meta-RL methods:
+consistency of latent information within sessions, session masking, and prior
+latent conditioning. We demonstrate the importance of these modifications in
+various domains, ranging from discrete Gridworld environments to
+continuous-control and simulated robot assistive tasks, demonstrating that
+DynaMITE-RL significantly outperforms state-of-the-art baselines in sample
+efficiency and inference returns.
+
+
+
+
+
+
+
+ ♻ ☆ Fast and reliable uncertainty quantification with neural network
+ ensembles for industrial image classification
+
+
+ Image classification with neural networks (NNs) is widely used in industrial
+processes, situations where the model likely encounters unknown objects during
+deployment, i.e., out-of-distribution (OOD) data. Worryingly, NNs tend to make
+confident yet incorrect predictions when confronted with OOD data. To increase
+the models' reliability, they should quantify the uncertainty in their own
+predictions, communicating when the output should (not) be trusted. Deep
+ensembles, composed of multiple independent NNs, have been shown to perform
+strongly but are computationally expensive. Recent research has proposed more
+efficient NN ensembles, namely the snapshot, batch, and multi-input
+multi-output ensemble. This study investigates the predictive and uncertainty
+performance of efficient NN ensembles in the context of image classification
+for industrial processes. It is the first to provide a comprehensive comparison
+and it proposes a novel Diversity Quality metric to quantify the ensembles'
+performance on the in-distribution and OOD sets in one single metric. The
+results highlight the batch ensemble as a cost-effective and competitive
+alternative to the deep ensemble. It matches the deep ensemble in both
+uncertainty and accuracy while exhibiting considerable savings in training
+time, test time, and memory storage.
+
+
+
+ comment: Submitted to Annals of Operations Research
+
+
+
+
+
+
+ ♻ ☆ Marconi: Prefix Caching for the Era of Hybrid LLMs
+
+
+
+
+
+
+
+
+ Rui Pan, Zhuang Wang, Zhen Jia, Can Karakus, Luca Zancato, Tri Dao, Yida Wang, Ravi Netravali
+
+
+ Hybrid models that combine the language modeling capabilities of Attention
+layers with the efficiency of Recurrent layers (e.g., State Space Models) have
+gained traction in practically supporting long contexts in Large Language Model
+serving. Yet, the unique properties of these models complicate the usage of
+complementary efficiency optimizations such as prefix caching that skip
+redundant computations across requests. Most notably, their use of in-place
+state updates for recurrent layers precludes rolling back cache entries for
+partial sequence overlaps, and instead mandates only exact-match cache hits;
+the effect is a deluge of (large) cache entries per sequence, most of which
+yield minimal reuse opportunities. We present Marconi, the first system that
+supports efficient prefix caching with Hybrid LLMs. Key to Marconi are its
+novel admission and eviction policies that more judiciously assess potential
+cache entries based not only on recency, but also on (1) forecasts of their
+reuse likelihood across a taxonomy of different hit scenarios, and (2) the
+compute savings that hits deliver relative to memory footprints. Across diverse
+workloads and Hybrid models, Marconi achieves up to 34.4$\times$ higher token
+hit rates (71.1% or 617 ms lower TTFT) compared to state-of-the-art prefix
+caching systems.
+
+
+ Driving is challenging in conditions like night, rain, and snow. Lack of good
+labeled datasets has hampered progress in scene understanding under such
+conditions. Unsupervised Domain Adaptation (UDA) using large labeled clear-day
+datasets is a promising research direction in such cases. However, many UDA
+methods are trained with dominant scene backgrounds (e.g., roads, sky,
+sidewalks) that appear dramatically different across domains. As a result, they
+struggle to learn effective features of smaller and often sparse foreground
+objects (e.g., people, vehicles, signs).
+ In this work, we improve UDA training by applying in-place image warping to
+focus on salient objects. We design instance-level saliency guidance to
+adaptively oversample object regions and undersample background areas, which
+reduces adverse effects from background context and enhances backbone feature
+learning. Our approach improves adaptation across geographies, lighting, and
+weather conditions, and is agnostic to the task (segmentation, detection),
+domain adaptation algorithm, saliency guidance, and underlying model
+architecture. Result highlights include +6.1 mAP50 for BDD100K Clear
+$\rightarrow$ DENSE Foggy, +3.7 mAP50 for BDD100K Day $\rightarrow$ Night, +3.0
+mAP50 for BDD100K Clear $\rightarrow$ Rainy, and +6.3 mIoU for Cityscapes
+$\rightarrow$ ACDC. Besides, Our method adds minimal training memory and no
+additional inference latency. Code is available at
+https://github.com/ShenZheng2000/Instance-Warp
+
+
+
+ comment: WACV 2025 Accepted Paper
+
+
+
+
+
+
+ ♻ ☆ Privacy-Preserving Data Deduplication for Enhancing Federated Learning
+ of Language Models (Extended Version) NDSS
+
+
+ Deduplication is a vital preprocessing step that enhances machine learning
+model performance and saves training time and energy. However, enhancing
+federated learning through deduplication poses challenges, especially regarding
+scalability and potential privacy violations if deduplication involves sharing
+all clients' data. In this paper, we address the problem of deduplication in a
+federated setup by introducing a pioneering protocol, Efficient
+Privacy-Preserving Multi-Party Deduplication (EP-MPD). It efficiently removes
+duplicates from multiple clients' datasets without compromising data privacy.
+EP-MPD is constructed in a modular fashion, utilizing two novel variants of the
+Private Set Intersection protocol. Our extensive experiments demonstrate the
+significant benefits of deduplication in federated learning of large language
+models. For instance, we observe up to 19.62\% improvement in perplexity and up
+to 27.95\% reduction in running time while varying the duplication level
+between 10\% and 30\%. EP-MPD effectively balances privacy and performance in
+federated learning, making it a valuable solution for large-scale applications.
+
+
+
+ comment: Accepted at the Network and Distributed Systems Security (NDSS)
+ Symposium, 2025
+
+
+
+
+
+
+ ♻ ☆ Towards Time Series Reasoning with LLMs NeurIPS
+
+
+
+
+
+
+
+
+ Winnie Chow, Lauren Gardiner, Haraldur T. Hallgrímsson, Maxwell A. Xu, Shirley You Ren
+
+
+ Multi-modal large language models (MLLMs) have enabled numerous advances in
+understanding and reasoning in domains like vision, but we have not yet seen
+this broad success for time-series. Although prior works on time-series MLLMs
+have shown promising performance in time-series forecasting, very few works
+show how an LLM could be used for time-series reasoning in natural language. We
+propose a novel multi-modal time-series LLM approach that learns generalizable
+information across various domains with powerful zero-shot performance. First,
+we train a lightweight time-series encoder on top of an LLM to directly extract
+time-series information. Then, we fine-tune our model with chain-of-thought
+augmented time-series tasks to encourage the model to generate reasoning paths.
+We show that our model learns a latent representation that reflects specific
+time-series features (e.g. slope, frequency), as well as outperforming GPT-4o
+on a set of zero-shot reasoning tasks on a variety of domains.
+
+
+
+ comment: Oral Presentation at 2024 NeurIPS Workshop on Time Series in the Age
+ of Large Models
+
+
+
+
+
+
+ ♻ ☆ Towards Size-Independent Generalization Bounds for Deep Operator Nets
+
+
+ In recent times machine learning methods have made significant advances in
+becoming a useful tool for analyzing physical systems. A particularly active
+area in this theme has been "physics-informed machine learning" which focuses
+on using neural nets for numerically solving differential equations. In this
+work, we aim to advance the theory of measuring out-of-sample error while
+training DeepONets - which is among the most versatile ways to solve P.D.E
+systems in one-shot. Firstly, for a class of DeepONets, we prove a bound on
+their Rademacher complexity which does not explicitly scale with the width of
+the nets involved. Secondly, we use this to show how the Huber loss can be
+chosen so that for these DeepONet classes generalization error bounds can be
+obtained that have no explicit dependence on the size of the nets. The
+effective capacity measure for DeepONets that we thus derive is also shown to
+correlate with the behavior of generalization error in experiments.
+
+
+
+ comment: 33 pages, 7 figures; Published in TMLR, December 2024
+
+
+
+
+
+
+ ♻ ☆ Fast Computation of Leave-One-Out Cross-Validation for $k$-NN Regression
+
+
+ We describe a fast computation method for leave-one-out cross-validation
+(LOOCV) for $k$-nearest neighbours ($k$-NN) regression. We show that, under a
+tie-breaking condition for nearest neighbours, the LOOCV estimate of the mean
+square error for $k$-NN regression is identical to the mean square error of
+$(k+1)$-NN regression evaluated on the training data, multiplied by the scaling
+factor $(k+1)^2/k^2$. Therefore, to compute the LOOCV score, one only needs to
+fit $(k+1)$-NN regression only once, and does not need to repeat
+training-validation of $k$-NN regression for the number of training data.
+Numerical experiments confirm the validity of the fast computation method.
+
+
+
+ comment: To appear in Transactions of Machine Learning Research (TMLR)
+
+
+
+
+
+
+
+ Zheng Zhang, Cuong Nguyen, Kevin Wells, Thanh-Toan Do, David Rosewarne, Gustavo Carneiro
+
+
+ Human-AI cooperative classification (HAI-CC) approaches aim to develop hybrid
+intelligent systems that enhance decision-making in various high-stakes
+real-world scenarios by leveraging both human expertise and AI capabilities.
+Current HAI-CC methods primarily focus on learning-to-defer (L2D), where
+decisions are deferred to human experts, and learning-to-complement (L2C),
+where AI and human experts make predictions cooperatively. However, a notable
+research gap remains in effectively exploring both L2D and L2C under diverse
+expert knowledge to improve decision-making, particularly when constrained by
+the cooperation cost required to achieve a target probability for AI-only
+selection (i.e., coverage). In this paper, we address this research gap by
+proposing the Coverage-constrained Learning to Defer and Complement with
+Specific Experts (CL2DC) method. CL2DC makes final decisions through either AI
+prediction alone or by deferring to or complementing a specific expert,
+depending on the input data. Furthermore, we propose a coverage-constrained
+optimisation to control the cooperation cost, ensuring it approximates a target
+probability for AI-only selection. This approach enables an effective
+assessment of system performance within a specified budget. Also, CL2DC is
+designed to address scenarios where training sets contain multiple noisy-label
+annotations without any clean-label references. Comprehensive evaluations on
+both synthetic and real-world datasets demonstrate that CL2DC achieves superior
+performance compared to state-of-the-art HAI-CC methods.
+
+
+
+
+
+
+
+ ♻ ☆ Distributionally robust self-supervised learning for tabular data NeurIPS2024
+
+
+ Machine learning (ML) models trained using Empirical Risk Minimization (ERM)
+often exhibit systematic errors on specific subpopulations of tabular data,
+known as error slices. Learning robust representation in presence of error
+slices is challenging, especially in self-supervised settings during the
+feature reconstruction phase, due to high cardinality features and the
+complexity of constructing error sets. Traditional robust representation
+learning methods are largely focused on improving worst group performance in
+supervised setting in computer vision, leaving a gap in approaches tailored for
+tabular data. We address this gap by developing a framework to learn robust
+representation in tabular data during self-supervised pre-training. Our
+approach utilizes an encoder-decoder model trained with Masked Language
+Modeling (MLM) loss to learn robust latent representations. This paper applies
+the Just Train Twice (JTT) and Deep Feature Reweighting (DFR) methods during
+the pre-training phase for tabular data. These methods fine-tune the ERM
+pre-trained model by up-weighting error-prone samples or creating balanced
+datasets for specific categorical features. This results in specialized models
+for each feature, which are then used in an ensemble approach to enhance
+downstream classification performance. This methodology improves robustness
+across slices, thus enhancing overall generalization performance. Extensive
+experiments across various datasets demonstrate the efficacy of our approach.
+The code is available:
+\url{https://github.com/amazon-science/distributionally-robust-self-supervised-learning-for-tabular-data}.
+
+
+
+ comment: TRL Workshop@NeurIPS2024
+
+
+
+
+
+
+ ♻ ☆ Automatically Interpreting Millions of Features in Large Language Models
+
+
+
+
+
+
+
+
+ Gonçalo Paulo, Alex Mallen, Caden Juang, Nora Belrose
+
+
+ While the activations of neurons in deep neural networks usually do not have
+a simple human-understandable interpretation, sparse autoencoders (SAEs) can be
+used to transform these activations into a higher-dimensional latent space
+which may be more easily interpretable. However, these SAEs can have millions
+of distinct latent features, making it infeasible for humans to manually
+interpret each one. In this work, we build an open-source automated pipeline to
+generate and evaluate natural language explanations for SAE features using
+LLMs. We test our framework on SAEs of varying sizes, activation functions, and
+losses, trained on two different open-weight LLMs. We introduce five new
+techniques to score the quality of explanations that are cheaper to run than
+the previous state of the art. One of these techniques, intervention scoring,
+evaluates the interpretability of the effects of intervening on a feature,
+which we find explains features that are not recalled by existing methods. We
+propose guidelines for generating better explanations that remain valid for a
+broader set of activating contexts, and discuss pitfalls with existing scoring
+techniques. We use our explanations to measure the semantic similarity of
+independently trained SAEs, and find that SAEs trained on nearby layers of the
+residual stream are highly similar. Our large-scale analysis confirms that SAE
+latents are indeed much more interpretable than neurons, even when neurons are
+sparsified using top-$k$ postprocessing. Our code is available at
+https://github.com/EleutherAI/sae-auto-interp, and our explanations are
+available at
+https://huggingface.co/datasets/EleutherAI/auto_interp_explanations.
+
+
+
+
+
+
+
+ ♻ ☆ Generalization Bounds and Model Complexity for Kolmogorov-Arnold
+ Networks
+
+
+ Kolmogorov-Arnold Network (KAN) is a network structure recently proposed by
+Liu et al. (2024) that offers improved interpretability and a more parsimonious
+design in many science-oriented tasks compared to multi-layer perceptrons. This
+work provides a rigorous theoretical analysis of KAN by establishing
+generalization bounds for KAN equipped with activation functions that are
+either represented by linear combinations of basis functions or lying in a
+low-rank Reproducing Kernel Hilbert Space (RKHS). In the first case, the
+generalization bound accommodates various choices of basis functions in forming
+the activation functions in each layer of KAN and is adapted to different
+operator norms at each layer. For a particular choice of operator norms, the
+bound scales with the $l_1$ norm of the coefficient matrices and the Lipschitz
+constants for the activation functions, and it has no dependence on
+combinatorial parameters (e.g., number of nodes) outside of logarithmic
+factors. Moreover, our result does not require the boundedness assumption on
+the loss function and, hence, is applicable to a general class of
+regression-type loss functions. In the low-rank case, the generalization bound
+scales polynomially with the underlying ranks as well as the Lipschitz
+constants of the activation functions in each layer. These bounds are
+empirically investigated for KANs trained with stochastic gradient descent on
+simulated and real data sets. The numerical results demonstrate the practical
+relevance of these bounds.
+
+
+
+
+
+
+
+ ♻ ☆ Controlling Counterfactual Harm in Decision Support Systems Based on
+ Prediction Sets ICML 2024
+
+
+ Decision support systems based on prediction sets help humans solve
+multiclass classification tasks by narrowing down the set of potential label
+values to a subset of them, namely a prediction set, and asking them to always
+predict label values from the prediction sets. While this type of systems have
+been proven to be effective at improving the average accuracy of the
+predictions made by humans, by restricting human agency, they may cause
+harm$\unicode{x2014}$a human who has succeeded at predicting the ground-truth
+label of an instance on their own may have failed had they used these systems.
+In this paper, our goal is to control how frequently a decision support system
+based on prediction sets may cause harm, by design. To this end, we start by
+characterizing the above notion of harm using the theoretical framework of
+structural causal models. Then, we show that, under a natural, albeit
+unverifiable, monotonicity assumption, we can estimate how frequently a system
+may cause harm using only predictions made by humans on their own. Further, we
+also show that, under a weaker monotonicity assumption, which can be verified
+experimentally, we can bound how frequently a system may cause harm again using
+only predictions made by humans on their own. Building upon these assumptions,
+we introduce a computational framework to design decision support systems based
+on prediction sets that are guaranteed to cause harm less frequently than a
+user-specified value using conformal risk control. We validate our framework
+using real human predictions from two different human subject studies and show
+that, in decision support systems based on prediction sets, there is a
+trade-off between accuracy and counterfactual harm.
+
+
+
+ comment: Accepted at the ICML 2024 Workshop on Humans, Algorithmic
+ Decision-Making and Society and published at NeurIPS 2024
+
+
+
+
+
+
+ ♻ ☆ Prediction-Powered Ranking of Large Language Models NeurIPS 2024
+
+
+ Large language models are often ranked according to their level of alignment
+with human preferences -- a model is better than other models if its outputs
+are more frequently preferred by humans. One of the popular ways to elicit
+human preferences utilizes pairwise comparisons between the outputs provided by
+different models to the same inputs. However, since gathering pairwise
+comparisons by humans is costly and time-consuming, it has become a common
+practice to gather pairwise comparisons by a strong large language model -- a
+model strongly aligned with human preferences. Surprisingly, practitioners
+cannot currently measure the uncertainty that any mismatch between human and
+model preferences may introduce in the constructed rankings. In this work, we
+develop a statistical framework to bridge this gap. Given a (small) set of
+pairwise comparisons by humans and a large set of pairwise comparisons by a
+model, our framework provides a rank-set -- a set of possible ranking positions
+-- for each of the models under comparison. Moreover, it guarantees that, with
+a probability greater than or equal to a user-specified value, the rank-sets
+cover the true ranking consistent with the distribution of human pairwise
+preferences asymptotically. Using pairwise comparisons made by humans in the
+LMSYS Chatbot Arena platform and pairwise comparisons made by three strong
+large language models, we empirically demonstrate the effectivity of our
+framework and show that the rank-sets constructed using only pairwise
+comparisons by the strong large language models are often inconsistent with
+(the distribution of) human pairwise preferences.
+
+
+
+ comment: Published at NeurIPS 2024
+
+
+
+
+
+
+ ♻ ☆ Deferred Poisoning: Making the Model More Vulnerable via Hessian
+ Singularization
+
+
+ Recent studies have shown that deep learning models are very vulnerable to
+poisoning attacks. Many defense methods have been proposed to address this
+issue. However, traditional poisoning attacks are not as threatening as
+commonly believed. This is because they often cause differences in how the
+model performs on the training set compared to the validation set. Such
+inconsistency can alert defenders that their data has been poisoned, allowing
+them to take the necessary defensive actions. In this paper, we introduce a
+more threatening type of poisoning attack called the Deferred Poisoning Attack.
+This new attack allows the model to function normally during the training and
+validation phases but makes it very sensitive to evasion attacks or even
+natural noise. We achieve this by ensuring the poisoned model's loss function
+has a similar value as a normally trained model at each input sample but with a
+large local curvature. A similar model loss ensures that there is no obvious
+inconsistency between the training and validation accuracy, demonstrating high
+stealthiness. On the other hand, the large curvature implies that a small
+perturbation may cause a significant increase in model loss, leading to
+substantial performance degradation, which reflects a worse robustness. We
+fulfill this purpose by making the model have singular Hessian information at
+the optimal point via our proposed Singularization Regularization term. We have
+conducted both theoretical and empirical analyses of the proposed method and
+validated its effectiveness through experiments on image classification tasks.
+Furthermore, we have confirmed the hazards of this form of poisoning attack
+under more general scenarios using natural noise, offering a new perspective
+for research in the field of security.
+
+
+
+
+
+
+
+ ♻ ☆ Can In-context Learning Really Generalize to Out-of-distribution Tasks?
+
+
+ In this work, we explore the mechanism of in-context learning (ICL) on
+out-of-distribution (OOD) tasks that were not encountered during training. To
+achieve this, we conduct synthetic experiments where the objective is to learn
+OOD mathematical functions through ICL using a GPT-2 model. We reveal that
+Transformers may struggle to learn OOD task functions through ICL.
+Specifically, ICL performance resembles implementing a function within the
+pretraining hypothesis space and optimizing it with gradient descent based on
+the in-context examples. Additionally, we investigate ICL's well-documented
+ability to learn unseen abstract labels in context. We demonstrate that such
+ability only manifests in the scenarios without distributional shifts and,
+therefore, may not serve as evidence of new-task-learning ability. Furthermore,
+we assess ICL's performance on OOD tasks when the model is pretrained on
+multiple tasks. Both empirical and theoretical analyses demonstrate the
+existence of the \textbf{low-test-error preference} of ICL, where it tends to
+implement the pretraining function that yields low test error in the testing
+context. We validate this through numerical experiments. This new theoretical
+result, combined with our empirical findings, elucidates the mechanism of ICL
+in addressing OOD tasks.
+
+
+
+
+
+
+
+
+ Joonas Hämäläinen, Antoine Hubermont, Amauri Souza, César L. C. Mattos, João P. P. Gomes, Tommi Kärkkäinen
+
+
+ Distance-based supervised method, the minimal learning machine, constructs a
+predictive model from data by learning a mapping between input and output
+distance matrices. In this paper, we propose new methods and evaluate how their
+core component, the distance mapping, can be adapted to multi-label learning.
+The proposed approach is based on combining the distance mapping with an
+inverse distance weighting. Although the proposal is one of the simplest
+methods in the multi-label learning literature, it achieves state-of-the-art
+performance for small to moderate-sized multi-label learning problems. In
+addition to its simplicity, the proposed method is fully deterministic: Its
+hyper-parameter can be selected via ranking loss-based statistic which has a
+closed form, thus avoiding conventional cross-validation-based hyper-parameter
+tuning. In addition, due to its simple linear distance mapping-based
+construction, we demonstrate that the proposed method can assess the
+uncertainty of the predictions for multi-label classification, which is a
+valuable capability for data-centric machine learning pipelines.
+
+
+
+ comment: Submitted, 29 pages
+
+
+
+
+
+
+ ♻ ☆ LLM as a Complementary Optimizer to Gradient Descent: A Case Study in
+ Prompt Tuning
+
+
+
+
+
+
+
+
+ Zixian Guo, Ming Liu, Zhilong Ji, Jinfeng Bai, Yiwen Guo, Wangmeng Zuo
+
+
+ Mastering a skill generally relies on both hands-on experience from doers and
+insightful, high-level guidance by mentors. Will this strategy also work well
+for solving complex non-convex optimization problems? Here, a common
+gradient-based optimizer acts like a disciplined doer, making locally optimal
+updates at each step. Large Language Models (LLMs) can also search for better
+solutions by inferring from natural language instructions, akin to a high-level
+mentor. In this paper, we show that these two participators are complementary
+to each other and can effectively collaborate as a combined optimization
+framework. The collaborative optimization is achieved by alternating between
+the gradient-based and LLM-based optimizers. We instruct LLMs to generate
+possibly improved solutions by taking parameter trajectories recorded during
+the previous stage of gradient-based optimization into account. Inferred
+results of LLMs are used as restarting points for the next stage of gradient
+optimization. We verify the effectiveness of this optimization framework on
+prompt tuning. By leveraging both the locally rigorous gradient-based optimizer
+and the high-level deductive LLM-based optimizer, the combined optimization
+method consistently yields improvements over competitive baselines on a variety
+of tasks. Our results demonstrate the synergistic effect of conventional
+gradient-based optimization and the inference ability of LLMs. The code is
+released at https://github.com/guozix/LLM-catalyst.
+
+
+
+
+
+
+
+ ♻ ☆ Towards a Robust Soft Baby Robot With Rich Interaction Ability for
+ Advanced Machine Learning Algorithms
+
+
+
+
+
+
+
+
+ Mohannad Alhakami, Dylan R. Ashley, Joel Dunham, Yanning Dai, Francesco Faccio, Eric Feron, Jürgen Schmidhuber
+
+
+ Advanced machine learning algorithms require platforms that are extremely
+robust and equipped with rich sensory feedback to handle extensive
+trial-and-error learning without relying on strong inductive biases.
+Traditional robotic designs, while well-suited for their specific use cases,
+are often fragile when used with these algorithms. To address this gap -- and
+inspired by the vision of enabling curiosity-driven baby robots -- we present a
+novel robotic limb designed from scratch. Our design has a hybrid soft-hard
+structure, high redundancy with rich non-contact sensors (exclusively cameras),
+and easily replaceable failure points. Proof-of-concept experiments using two
+contemporary reinforcement learning algorithms on a physical prototype
+demonstrate that our design is able to succeed in a simple target-finding task
+even under simulated sensor failures, all with minimal human oversight during
+extended learning periods. We believe this design represents a concrete step
+toward more tailored robotic designs for achieving general-purpose, generally
+intelligent robots.
+
+
+
+ comment: 6 pages in main text + 2 pages of references, 8 figures in main text,
+ 1 table in main text; source code available at
+ https://github.com/dylanashley/robot-limb-testai
+
+
+
+
+
+
+ ♻ ☆ Reducing Optimism Bias in Incomplete Cooperative Games AAMAS 2024
+
+
+
+
+
+
+
+
+ Filip Úradník, David Sychrovský, Jakub Černý, Martin Černý
+
+
+ Cooperative game theory has diverse applications in contemporary artificial
+intelligence, including domains like interpretable machine learning, resource
+allocation, and collaborative decision-making. However, specifying a
+cooperative game entails assigning values to exponentially many coalitions, and
+obtaining even a single value can be resource-intensive in practice. Yet simply
+leaving certain coalition values undisclosed introduces ambiguity regarding
+individual contributions to the collective grand coalition. This ambiguity
+often leads to players holding overly optimistic expectations, stemming from
+either inherent biases or strategic considerations, frequently resulting in
+collective claims exceeding the actual grand coalition value. In this paper, we
+present a framework aimed at optimizing the sequence for revealing coalition
+values, with the overarching goal of efficiently closing the gap between
+players' expectations and achievable outcomes in cooperative games. Our
+contributions are threefold: (i) we study the individual players' optimistic
+completions of games with missing coalition values along with the arising gap,
+and investigate its analytical characteristics that facilitate more efficient
+optimization; (ii) we develop methods to minimize this gap over classes of
+games with a known prior by disclosing values of additional coalitions in both
+offline and online fashion; and (iii) we empirically demonstrate the
+algorithms' performance in practical scenarios, together with an investigation
+into the typical order of revealing coalition values.
+
+
+
+ comment: Proc. of the 23rd International Conference on Autonomous Agents and
+ Multiagent Systems (AAMAS 2024)
+
+
+
+
+
+
+
+ Xi Wang, Nicolas Dufour, Nefeli Andreou, Marie-Paule Cani, Victoria Fernandez Abrevaya, David Picard, Vicky Kalogeiton
+
+
+ Classifier-Free Guidance (CFG) enhances the quality and condition adherence
+of text-to-image diffusion models. It operates by combining the conditional and
+unconditional predictions using a fixed weight. However, recent works vary the
+weights throughout the diffusion process, reporting superior results but
+without providing any rationale or analysis. By conducting comprehensive
+experiments, this paper provides insights into CFG weight schedulers. Our
+findings suggest that simple, monotonically increasing weight schedulers
+consistently lead to improved performances, requiring merely a single line of
+code. In addition, more complex parametrized schedulers can be optimized for
+further improvement, but do not generalize across different models and tasks.
+
+
+
+
+
+
+
+ ♻ ☆ Self-Improvement in Language Models: The Sharpening Mechanism
+
+
+
+
+
+
+
+
+ Audrey Huang, Adam Block, Dylan J. Foster, Dhruv Rohatgi, Cyril Zhang, Max Simchowitz, Jordan T. Ash, Akshay Krishnamurthy
+
+
+ Recent work in language modeling has raised the possibility of
+self-improvement, where a language models evaluates and refines its own
+generations to achieve higher performance without external feedback. It is
+impossible for this self-improvement to create information that is not already
+in the model, so why should we expect that this will lead to improved
+capabilities? We offer a new perspective on the capabilities of
+self-improvement through a lens we refer to as sharpening. Motivated by the
+observation that language models are often better at verifying response quality
+than they are at generating correct responses, we formalize self-improvement as
+using the model itself as a verifier during post-training in order to
+``sharpen'' the model to one placing large mass on high-quality sequences,
+thereby amortizing the expensive inference-time computation of generating good
+sequences. We begin by introducing a new statistical framework for sharpening
+in which the learner aims to sharpen a pre-trained base policy via sample
+access, and establish fundamental limits. Then we analyze two natural families
+of self-improvement algorithms based on SFT and RLHF. We find that (i) the
+SFT-based approach is minimax optimal whenever the initial model has sufficient
+coverage, but (ii) the RLHF-based approach can improve over SFT-based
+self-improvement by leveraging online exploration, bypassing the need for
+coverage. Finally, we empirically validate the sharpening mechanism via
+inference-time and amortization experiments. We view these findings as a
+starting point toward a foundational understanding that can guide the design
+and evaluation of self-improvement algorithms.
+
+
+
+
+
+
+
+ ♻ ☆ Tackling Decision Processes with Non-Cumulative Objectives using
+ Reinforcement Learning
+
+
+
+
+
+
+
+
+ Maximilian Nägele, Jan Olle, Thomas Fösel, Remmy Zen, Florian Marquardt
+
+
+ Markov decision processes (MDPs) are used to model a wide variety of
+applications ranging from game playing over robotics to finance. Their optimal
+policy typically maximizes the expected sum of rewards given at each step of
+the decision process. However, a large class of problems does not fit
+straightforwardly into this framework: Non-cumulative Markov decision processes
+(NCMDPs), where instead of the expected sum of rewards, the expected value of
+an arbitrary function of the rewards is maximized. Example functions include
+the maximum of the rewards or their mean divided by their standard deviation.
+In this work, we introduce a general mapping of NCMDPs to standard MDPs. This
+allows all techniques developed to find optimal policies for MDPs, such as
+reinforcement learning or dynamic programming, to be directly applied to the
+larger class of NCMDPs. Focusing on reinforcement learning, we show
+applications in a diverse set of tasks, including classical control, portfolio
+optimization in finance, and discrete optimization problems. Given our
+approach, we can improve both final performance and training time compared to
+relying on standard MDPs.
+
+
+
+
+
+
+
+ ♻ ☆ OpenDriver: An Open-Road Driver State Detection Dataset
+
+
+ Among numerous studies for driver state detection, wearable physiological
+measurements offer a practical method for real-time monitoring. However, there
+are few driver physiological datasets in open-road scenarios, and the existing
+datasets suffer from issues such as poor signal quality, small sample sizes,
+and short data collection periods. Therefore, in this paper, a large-scale
+multimodal driving dataset, OpenDriver, for driver state detection is
+developed. The OpenDriver encompasses a total of 3,278 driving trips, with a
+signal collection duration spanning approximately 4,600 hours. Two modalities
+of driving signals are enrolled in OpenDriver: electrocardiogram (ECG) signals
+and six-axis motion data of the steering wheel from a motion measurement unit
+(IMU), which were recorded from 81 drivers and their vehicles. Furthermore,
+three challenging tasks are involved in our work, namely ECG signal quality
+assessment, individual biometric identification based on ECG signals, and
+physiological signal analysis in complex driving environments. To facilitate
+research in these tasks, corresponding benchmarks have also been introduced.
+First, a noisy augmentation strategy is applied to generate a larger-scale ECG
+signal dataset with realistic noise simulation for quality assessment. Second,
+an end-to-end contrastive learning framework is employed for individual
+biometric identification. Finally, a comprehensive analysis of drivers' HRV
+features under different driving conditions is conducted. Each benchmark
+provides evaluation metrics and reference results. The OpenDriver dataset will
+be publicly available at https://github.com/bdne/OpenDriver.
+
+
+
+ comment: Considering that there are flaws in the statistical data of the
+ dataset, all the authors agreed to withdraw the manuscript
+
+
+
+
+
+
+ ♻ ☆ Identifiable Representation and Model Learning for Latent Dynamic
+ Systems
+
+
+ Learning identifiable representations and models from low-level observations
+is helpful for an intelligent spacecraft to complete downstream tasks reliably.
+For temporal observations, to ensure that the data generating process is
+provably inverted, most existing works either assume the noise variables in the
+dynamic mechanisms are (conditionally) independent or require that the
+interventions can directly affect each latent variable. However, in practice,
+the relationship between the exogenous inputs/interventions and the latent
+variables may follow some complex deterministic mechanisms. In this work, we
+study the problem of identifiable representation and model learning for latent
+dynamic systems. The key idea is to use an inductive bias inspired by
+controllable canonical forms, which are sparse and input-dependent by
+definition. We prove that, for linear and affine nonlinear latent dynamic
+systems with sparse input matrices, it is possible to identify the latent
+variables up to scaling and determine the dynamic models up to some simple
+transformations. The results have the potential to provide some theoretical
+guarantees for developing more trustworthy decision-making and control methods
+for intelligent spacecrafts.
+
+
+ The accurate diagnosis of machine breakdowns is crucial for maintaining
+operational safety in smart manufacturing. Despite the promise shown by deep
+learning in automating fault identification, the scarcity of labeled training
+data, particularly for equipment failure instances, poses a significant
+challenge. This limitation hampers the development of robust classification
+models. Existing methods like model-agnostic meta-learning (MAML) do not
+adequately address variable working conditions, affecting knowledge transfer.
+To address these challenges, a Related Task Aware Curriculum Meta-learning
+(RT-ACM) enhanced fault diagnosis framework is proposed in this paper, inspired
+by human cognitive learning processes. RT-ACM improves training by considering
+the relevance of auxiliary sensor working conditions, adhering to the principle
+of ``paying more attention to more relevant knowledge", and focusing on
+``easier first, harder later" curriculum sampling. This approach aids the
+meta-learner in achieving a superior convergence state. Extensive experiments
+on two real-world datasets demonstrate the superiority of RT-ACM framework.
+
+
+
+
+
+
+
+ ♻ ☆ ROSE: Revolutionizing Open-Set Dense Segmentation with Patch-Wise
+ Perceptual Large Multimodal Model
+
+
+ Advances in CLIP and large multimodal models (LMMs) have enabled
+open-vocabulary and free-text segmentation, yet existing models still require
+predefined category prompts, limiting free-form category self-generation. Most
+segmentation LMMs also remain confined to sparse predictions, restricting their
+applicability in open-set environments. In contrast, we propose ROSE, a
+Revolutionary Open-set dense SEgmentation LMM, which enables dense mask
+prediction and open-category generation through patch-wise perception. Our
+method treats each image patch as an independent region of interest candidate,
+enabling the model to predict both dense and sparse masks simultaneously.
+Additionally, a newly designed instruction-response paradigm takes full
+advantage of the generation and generalization capabilities of LMMs, achieving
+category prediction independent of closed-set constraints or predefined
+categories. To further enhance mask detail and category precision, we introduce
+a conversation-based refinement paradigm, integrating the prediction result
+from previous step with textual prompt for revision. Extensive experiments
+demonstrate that ROSE achieves competitive performance across various
+segmentation tasks in a unified framework. Code will be released.
+
+
+
+
+
+
+
+ ♻ ☆ Reinforcement Learning for Finite Space Mean-Field Type Games
+
+
+
+
+
+
+
+
+ Kai Shao, Jiacheng Shen, Chijie An, Mathieu Laurière
+
+
+ Mean field type games (MFTGs) describe Nash equilibria between large
+coalitions: each coalition consists of a continuum of cooperative agents who
+maximize the average reward of their coalition while interacting
+non-cooperatively with a finite number of other coalitions. Although the theory
+has been extensively developed, we are still lacking efficient and scalable
+computational methods. Here, we develop reinforcement learning methods for such
+games in a finite space setting with general dynamics and reward functions. We
+start by proving that MFTG solution yields approximate Nash equilibria in
+finite-size coalition games. We then propose two algorithms. The first is based
+on quantization of mean-field spaces and Nash Q-learning. We provide
+convergence and stability analysis. We then propose a deep reinforcement
+learning algorithm, which can scale to larger spaces. Numerical experiments in
+5 environments with mean-field distributions of dimension up to $200$ show the
+scalability and efficiency of the proposed method.
+
+
+
+
+
+
+
+ ♻ ☆ Chain-structured neural architecture search for financial time series
+ forecasting
+
+
+ Neural architecture search (NAS) emerged as a way to automatically optimize
+neural networks for a specific task and dataset. Despite an abundance of
+research on NAS for images and natural language applications, similar studies
+for time series data are lacking. Among NAS search spaces, chain-structured are
+the simplest and most applicable to small datasets like time series. We compare
+three popular NAS strategies on chain-structured search spaces: Bayesian
+optimization (specifically Tree-structured Parzen Estimator), the hyperband
+method, and reinforcement learning in the context of financial time series
+forecasting. These strategies were employed to optimize simple well-understood
+neural architectures like the MLP, 1D CNN, and RNN, with more complex temporal
+fusion transformers (TFT) and their own optimizers included for comparison. We
+find Bayesian optimization and the hyperband method performing best among the
+strategies, and RNN and 1D CNN best among the architectures, but all methods
+were very close to each other with a high variance due to the difficulty of
+working with financial datasets. We discuss our approach to overcome the
+variance and provide implementation recommendations for future users and
+researchers.
+
+
+
+ comment: This is the accepted version of the paper published in International
+ Journal of Data Science and Analytics
+
+
+
+
+
+
+ ♻ ☆ Explainable fault and severity classification for rolling element
+ bearings using Kolmogorov-Arnold networks
+
+
+ Rolling element bearings are critical components of rotating machinery, with
+their performance directly influencing the efficiency and reliability of
+industrial systems. At the same time, bearing faults are a leading cause of
+machinery failures, often resulting in costly downtime, reduced productivity,
+and, in extreme cases, catastrophic damage. This study presents a methodology
+that utilizes Kolmogorov-Arnold Networks to address these challenges through
+automatic feature selection, hyperparameter tuning and interpretable fault
+analysis within a unified framework. By training shallow network architectures
+and minimizing the number of selected features, the framework produces
+lightweight models that deliver explainable results through feature attribution
+and symbolic representations of their activation functions. Validated on two
+widely recognized datasets for bearing fault diagnosis, the framework achieved
+perfect F1-Scores for fault detection and high performance in fault and
+severity classification tasks, including 100% F1-Scores in most cases. Notably,
+it demonstrated adaptability by handling diverse fault types, such as imbalance
+and misalignment, within the same dataset. The symbolic representations
+enhanced model interpretability, while feature attribution offered insights
+into the optimal feature types or signals for each studied task. These results
+highlight the framework's potential for practical applications, such as
+real-time machinery monitoring, and for scientific research requiring efficient
+and explainable models.
+
+
+
+
+
+
+
+ ♻ ☆ The Cooperative Network Architecture: Learning Structured Networks as
+ Representation of Sensory Patterns
+
+
+
+
+
+
+
+
+ Pascal J. Sager, Jan M. Deriu, Benjamin F. Grewe, Thilo Stadelmann, Christoph von der Malsburg
+
+
+ Nets, cooperative networks of neurons, have been proposed as format for the
+representation of sensory signals, as physical implementation of the Gestalt
+phenomenon and as solution to the neural binding problem, while the direct
+interaction between nets by structure-sensitive matching has been proposed as
+basis for object-global operations such as object detection. The nets are
+flexibly composed of overlapping net fragments, which are learned from
+statistical regularities of sensory input. We here present the cooperative
+network architecture (CNA), a concrete model that learns such net structure to
+represent input patterns and deals robustly with noise, deformation, and
+out-of-distribution data, thus laying the groundwork for a novel neural
+architecture.
+
+
+
+
+
+
+
+ ♻ ☆ Local Lesion Generation is Effective for Capsule Endoscopy Image Data
+ Augmentation in a Limited Data Setting
+
+
+
+
+
+
+
+
+ Adrian B. Chłopowiec, Adam R. Chłopowiec, Krzysztof Galus, Wojciech Cebula, Martin Tabakov
+
+
+ Limited medical imaging datasets challenge deep learning models by increasing
+risks of overfitting and reduced generalization, particularly in Generative
+Adversarial Networks (GANs), where discriminators may overfit, leading to
+training divergence. This constraint also impairs classification models trained
+on small datasets. Generative Data Augmentation (GDA) addresses this by
+expanding training datasets with synthetic data, although it requires training
+a generative model. We propose and evaluate two local lesion generation
+approaches to address the challenge of augmenting small medical image datasets.
+The first approach employs the Poisson Image Editing algorithm, a classical
+image processing technique, to create realistic image composites that
+outperform current state-of-the-art methods. The second approach introduces a
+novel generative method, leveraging a fine-tuned Image Inpainting GAN to
+synthesize realistic lesions within specified regions of real training images.
+A comprehensive comparison of the two proposed methods demonstrates that
+effective local lesion generation in a data-constrained setting allows for
+reaching new state-of-the-art results in capsule endoscopy lesion
+classification. Combination of our techniques achieves a macro F1-score of
+33.07%, surpassing the previous best result by 7.84 percentage points (p.p.) on
+the highly imbalanced Kvasir Capsule Dataset, a benchmark for capsule
+endoscopy. To the best of our knowledge, this work is the first to apply a
+fine-tuned Image Inpainting GAN for GDA in medical imaging, demonstrating that
+an image-conditional GAN can be adapted effectively to limited datasets to
+generate high-quality examples, facilitating effective data augmentation.
+Additionally, we show that combining this GAN-based approach with classical
+image processing techniques further improves the results.
+
+
+
+
+
+
+
+
+ Tycho F. A. van der Ouderaa, Maximilian L. Croci, Agrin Hilmkil, James Hensman
+
+
+ Recent works on compression of large language models (LLM) using quantization
+considered reparameterizing the architecture such that weights are distributed
+on the sphere. This demonstratively improves the ability to quantize by
+increasing the mathematical notion of coherence, resulting in fewer weight
+outliers without affecting the network output. In this work, we aim to further
+exploit this spherical geometry of the weights when performing quantization by
+considering Pyramid Vector Quantization (PVQ) for large language models.
+Arranging points evenly on the sphere is notoriously difficult, especially in
+high dimensions, and in case approximate solutions exists, representing points
+explicitly in a codebook is typically not feasible due to its additional memory
+cost. Instead, PVQ uses a fixed integer lattice on the sphere by projecting
+points onto the 1-sphere, which allows for efficient encoding and decoding
+without requiring an explicit codebook in memory. To obtain a practical
+algorithm, we propose to combine PVQ with scale quantization for which we
+derive theoretically optimal quantizations, under empirically verified
+assumptions. Further, we extend pyramid vector quantization to use Hessian
+information to minimize quantization error under expected feature activations,
+instead of only relying on weight magnitudes. Experimentally, we achieves
+state-of-the-art quantization performance with pareto-optimal trade-off between
+performance and bits per weight and bits per activation, compared to compared
+methods. On weight-only, we find that we can quantize a Llama-3 70B model to
+3.25 bits per weight and retain 98\% accuracy on downstream tasks.
+
+
+
+
+
+
+
+ ♻ ☆ GWQ: Gradient-Aware Weight Quantization for Large Language Models
+
+
+ Large language models (LLMs) show impressive performance in solving complex
+language tasks. However, its large number of parameters present significant
+challenges for the deployment and application of the model on edge devices.
+Compressing large language models to low bits can enable them to run on
+resource-constrained devices, often leading to performance degradation. To
+address this problem, we propose gradient-aware weight quantization (GWQ), the
+first quantization approach for low-bit weight quantization that leverages
+gradients to localize outliers, requiring only a minimal amount of calibration
+data for outlier detection. GWQ retains the weights corresponding to the top 1%
+outliers preferentially at FP16 precision, while the remaining non-outlier
+weights are stored in a low-bit format. GWQ found experimentally that utilizing
+the sensitive weights in the gradient localization model is more scientific
+compared to utilizing the sensitive weights in the Hessian matrix localization
+model. Compared to current quantization methods, GWQ can be applied to multiple
+language models and achieves lower PPL on the WikiText2 and C4 dataset. In the
+zero-shot task, GWQ quantized models have higher accuracy compared to other
+quantization methods. GWQ is also suitable for multimodal model quantization,
+and the quantized Qwen-VL family model is more accurate than other methods.
+Zero-shot target detection task dataset RefCOCO outperforms the current
+stat-of-the-arts method SPQR. GWQ achieves 1.2 times inference speedup in
+comparison to the original model, and effectively reduces the inference memory.
+
+
+
+
+
+
+
+ ♻ ☆ Elephants Never Forget: Memorization and Learning of Tabular Data in
+ Large Language Models
+
+
+ While many have shown how Large Language Models (LLMs) can be applied to a
+diverse set of tasks, the critical issues of data contamination and
+memorization are often glossed over. In this work, we address this concern for
+tabular data. Specifically, we introduce a variety of different techniques to
+assess whether a language model has seen a tabular dataset during training.
+This investigation reveals that LLMs have memorized many popular tabular
+datasets verbatim. We then compare the few-shot learning performance of LLMs on
+datasets that were seen during training to the performance on datasets released
+after training. We find that LLMs perform better on datasets seen during
+training, indicating that memorization leads to overfitting. At the same time,
+LLMs show non-trivial performance on novel datasets and are surprisingly robust
+to data transformations. We then investigate the in-context statistical
+learning abilities of LLMs. While LLMs are significantly better than random at
+solving statistical classification problems, the sample efficiency of few-shot
+learning lags behind traditional statistical learning algorithms, especially as
+the dimension of the problem increases. This suggests that much of the observed
+few-shot performance on novel real-world datasets is due to the LLM's world
+knowledge. Overall, our results highlight the importance of testing whether an
+LLM has seen an evaluation dataset during pre-training. We release the
+https://github.com/interpretml/LLM-Tabular-Memorization-Checker Python package
+to test LLMs for memorization of tabular datasets.
+
+
+
+ comment: COLM camera ready, fix typo
+
+
+
+
+
+
+ ♻ ☆ One Step Learning, One Step ReviewAAAI
+
+
+ Visual fine-tuning has garnered significant attention with the rise of
+pre-trained vision models. The current prevailing method, full fine-tuning,
+suffers from the issue of knowledge forgetting as it focuses solely on fitting
+the downstream training set. In this paper, we propose a novel weight
+rollback-based fine-tuning method called OLOR (One step Learning, One step
+Review). OLOR combines fine-tuning with optimizers, incorporating a weight
+rollback term into the weight update term at each step. This ensures
+consistency in the weight range of upstream and downstream models, effectively
+mitigating knowledge forgetting and enhancing fine-tuning performance. In
+addition, a layer-wise penalty is presented to employ penalty decay and the
+diversified decay rate to adjust the weight rollback levels of layers for
+adapting varying downstream tasks. Through extensive experiments on various
+tasks such as image classification, object detection, semantic segmentation,
+and instance segmentation, we demonstrate the general applicability and
+state-of-the-art performance of our proposed OLOR. Code is available at
+https://github.com/rainbow-xiao/OLOR-AAAI-2024.
+
+
+
+ comment: Published at the 38th AAAI Conference on Artificial Intelligence
+ (AAAI 2024)
+
+
+
+
+
+
+ ♻ ☆ A path-norm toolkit for modern networks: consequences, promises and
+ challenges
+
+
+
+
+
+
+
+
+ Antoine Gonon, Nicolas Brisebarre, Elisa Riccietti, Rémi Gribonval
+
+
+ This work introduces the first toolkit around path-norms that fully
+encompasses general DAG ReLU networks with biases, skip connections and any
+operation based on the extraction of order statistics: max pooling, GroupSort
+etc. This toolkit notably allows us to establish generalization bounds for
+modern neural networks that are not only the most widely applicable path-norm
+based ones, but also recover or beat the sharpest known bounds of this type.
+These extended path-norms further enjoy the usual benefits of path-norms: ease
+of computation, invariance under the symmetries of the network, and improved
+sharpness on layered fully-connected networks compared to the product of
+operator norms, another complexity measure most commonly used.
+ The versatility of the toolkit and its ease of implementation allow us to
+challenge the concrete promises of path-norm-based generalization bounds, by
+numerically evaluating the sharpest known bounds for ResNets on ImageNet.
+
+
+
+ comment: Erratum: in the published version there was a typo in the definition
+ of the activation matrix in Definition A.3. This is fixed with this new
+ version
+
+
+
+
+
+
+ ♻ ☆ Knowledge Mechanisms in Large Language Models: A Survey and Perspective EMNLP 2024
+
+
+ Understanding knowledge mechanisms in Large Language Models (LLMs) is crucial
+for advancing towards trustworthy AGI. This paper reviews knowledge mechanism
+analysis from a novel taxonomy including knowledge utilization and evolution.
+Knowledge utilization delves into the mechanism of memorization, comprehension
+and application, and creation. Knowledge evolution focuses on the dynamic
+progression of knowledge within individual and group LLMs. Moreover, we discuss
+what knowledge LLMs have learned, the reasons for the fragility of parametric
+knowledge, and the potential dark knowledge (hypothesis) that will be
+challenging to address. We hope this work can help understand knowledge in LLMs
+and provide insights for future research.
+
+
+
+ comment: EMNLP 2024 Findings; 39 pages (v4)
+
+
+
+
+
+
+ ♻ ☆ Exploration of Parameter Spaces Assisted by Machine Learning
+
+
+
+
+
+
+
+
+ A. Hammad, Myeonghun Park, Raymundo Ramos, Pankaj Saha
+
+
+ We demonstrate two sampling procedures assisted by machine learning models
+via regression and classification. The main objective is the use of a neural
+network to suggest points likely inside regions of interest, reducing the
+number of evaluations of time consuming calculations. We compare results from
+this approach with results from other sampling methods, namely Markov chain
+Monte Carlo and MultiNest, obtaining results that range from comparably similar
+to arguably better. In particular, we augment our classifier method with a
+boosting technique that rapidly increases the efficiency within a few
+iterations. We show results from our methods applied to a toy model and the
+type II 2HDM, using 3 and 7 free parameters, respectively. The code used for
+this paper and instructions are publicly available on the web.
+
+
+
+ comment: 30 pages, 9 figures. Matches published version. Code and instructions
+ are available on https://github.com/AHamamd150/MLscanner
+
+
+
+
+
+
+ ♻ ☆ Learning Developmental Age from 3D Infant Kinetics Using Adaptive Graph
+ Neural Networks
+
+
+
+
+
+
+
+
+ Daniel Holmberg, Manu Airaksinen, Viviana Marchi, Andrea Guzzetta, Anna Kivi, Leena Haataja, Sampsa Vanhatalo, Teemu Roos
+
+
+ Reliable methods for the neurodevelopmental assessment of infants are
+essential for early detection of problems that may need prompt interventions.
+Spontaneous motor activity, or 'kinetics', is shown to provide a powerful
+surrogate measure of upcoming neurodevelopment. However, its assessment is by
+and large qualitative and subjective, focusing on visually identified,
+age-specific gestures. In this work, we introduce Kinetic Age (KA), a novel
+data-driven metric that quantifies neurodevelopmental maturity by predicting an
+infant's age based on their movement patterns. KA offers an interpretable and
+generalizable proxy for motor development. Our method leverages 3D video
+recordings of infants, processed with pose estimation to extract
+spatio-temporal series of anatomical landmarks, which are released as a new
+openly available dataset. These data are modeled using adaptive graph
+convolutional networks, able to capture the spatio-temporal dependencies in
+infant movements. We also show that our data-driven approach achieves
+improvement over traditional machine learning baselines based on manually
+engineered features.
+
+
+
+ comment: 15 pages, 9 figures. Code repository available via
+ https://github.com/deinal/infant-aagcn
+
+ With the rapid advancement of diffusion-based generative models, portrait
+image animation has achieved remarkable results. However, it still faces
+challenges in temporally consistent video generation and fast sampling due to
+its iterative sampling nature. This paper presents FLOAT, an audio-driven
+talking portrait video generation method based on flow matching generative
+model. We shift the generative modeling from the pixel-based latent space to a
+learned motion latent space, enabling efficient design of temporally consistent
+motion. To achieve this, we introduce a transformer-based vector field
+predictor with a simple yet effective frame-wise conditioning mechanism.
+Additionally, our method supports speech-driven emotion enhancement, enabling a
+natural incorporation of expressive motions. Extensive experiments demonstrate
+that our method outperforms state-of-the-art audio-driven talking portrait
+methods in terms of visual quality, motion fidelity, and efficiency.
+
+
+
+
+
+
+
+ ♻ ☆ Adaptive Dense Reward: Understanding the Gap Between Action and Reward
+ Space in Alignment
+
+
+
+
+
+
+
+
+ Yanshi Li, Shaopan Xiong, Gengru Chen, Xiaoyang Li, Yijia Luo, Xingyao Zhang, Yanhui Huang, Xingyuan Bu, Yingshui Tan, Chun Yuan, Jiamang Wang, Wenbo Su, Bo Zheng
+
+
+ Reinforcement Learning from Human Feedback (RLHF) has proven highly effective
+in aligning Large Language Models (LLMs) with human preferences. However, the
+original RLHF typically optimizes under an overall reward, which can lead to a
+suboptimal learning process. This limitation stems from RLHF's lack of
+awareness regarding which specific tokens should be reinforced or suppressed.
+Moreover, conflicts in supervision can arise, for instance, when a chosen
+response includes erroneous tokens, while a rejected response contains accurate
+elements. To rectify these shortcomings, increasing dense reward methods, such
+as step-wise and token-wise RLHF, have been proposed. However, these existing
+methods are limited to specific tasks (like mathematics). In this paper, we
+propose the ``Adaptive Message-wise RLHF'' method, which robustly applies to
+various tasks. By defining pivot tokens as key indicators, our approach
+adaptively identifies essential information and converts sequence-level
+supervision into fine-grained, subsequence-level supervision. This aligns the
+density of rewards and action spaces more closely with the information density
+of the input. Experiments demonstrate that our method can be integrated into
+various training methods, significantly mitigating hallucinations and
+catastrophic forgetting problems, while outperforming other methods on multiple
+evaluation metrics. Our method improves the success rate on adversarial samples
+by 10\% compared to the sample-wise approach, and achieves a 1.3\% improvement
+on evaluation benchmarks such as MMLU, GSM8K, HumanEval, etc.
+
+
+
+
+
+
+
+ ♻ ☆ Graph Pooling by Local Cluster Selection
+
+
+ Graph pooling is a family of operations which take graphs as input and
+produce shrinked graphs as output. Modern graph pooling methods are trainable
+and, in general inserted in Graph Neural Networks (GNNs) architectures as graph
+shrinking operators along the (deep) processing pipeline. This work proposes a
+novel procedure for pooling graphs, along with a node-centred graph pooling
+operator.
+
+
+
+ comment: 11 pages, 4 figures
+
+
+
+
+
+
+ ♻ ☆ Provably Mitigating Overoptimization in RLHF: Your SFT Loss is
+ Implicitly an Adversarial Regularizer
+
+
+ Aligning generative models with human preference via RLHF typically suffers
+from overoptimization, where an imperfectly learned reward model can misguide
+the generative model to output undesired responses. We investigate this problem
+in a principled manner by identifying the source of the misalignment as a form
+of distributional shift and uncertainty in learning human preferences. To
+mitigate overoptimization, we first propose a theoretical algorithm that
+chooses the best policy for an adversarially chosen reward model; one that
+simultaneously minimizes the maximum likelihood estimation of the loss and a
+reward penalty term. Here, the reward penalty term is introduced to prevent the
+policy from choosing actions with spurious high proxy rewards, resulting in
+provable sample efficiency of the algorithm under a partial coverage style
+condition. Moving from theory to practice, the proposed algorithm further
+enjoys an equivalent but surprisingly easy-to-implement reformulation. Using
+the equivalence between reward models and the corresponding optimal policy, the
+algorithm features a simple objective that combines: (i) a preference
+optimization loss that directly aligns the policy with human preference, and
+(ii) a supervised learning loss that explicitly imitates the policy with a
+(suitable) baseline distribution. In the context of aligning large language
+models (LLM), this objective fuses the direct preference optimization (DPO)
+loss with the supervised fine-tuning (SFT) loss to help mitigate the
+overoptimization towards undesired responses, for which we name the algorithm
+Regularized Preference Optimization (RPO). Experiments of aligning LLMs
+demonstrate the improved performance of RPO compared with DPO baselines. Our
+work sheds light on the interplay between preference optimization and SFT in
+tuning LLMs with both theoretical guarantees and empirical evidence.
+
+
+
+ comment: Accepted by The Thirty-Eighth Annual Conference on Neural Information
+ Processing Systems. 31 pages, 7 figures
+
+ DNA-encoded library (DEL) screening has revolutionized the detection of
+protein-ligand interactions through read counts, enabling rapid exploration of
+vast chemical spaces. However, noise in read counts, stemming from nonspecific
+interactions, can mislead this exploration process. We present DEL-Ranking, a
+novel distribution-correction denoising framework that addresses these
+challenges. Our approach introduces two key innovations: (1) a novel ranking
+loss that rectifies relative magnitude relationships between read counts,
+enabling the learning of causal features determining activity levels, and (2)
+an iterative algorithm employing self-training and consistency loss to
+establish model coherence between activity label and read count predictions.
+Furthermore, we contribute three new DEL screening datasets, the first to
+comprehensively include multi-dimensional molecular representations,
+protein-ligand enrichment values, and their activity labels. These datasets
+mitigate data scarcity issues in AI-driven DEL screening research. Rigorous
+evaluation on diverse DEL datasets demonstrates DEL-Ranking's superior
+performance across multiple correlation metrics, with significant improvements
+in binding affinity prediction accuracy. Our model exhibits zero-shot
+generalization ability across different protein targets and successfully
+identifies potential motifs determining compound binding affinity. This work
+advances DEL screening analysis and provides valuable resources for future
+research in this area.
+
+
+
+
+
+
+
+ ♻ ☆ One Initialization to Rule them All: Fine-tuning via Explained Variance
+ Adaptation
+
+
+
+
+
+
+
+
+ Fabian Paischer, Lukas Hauzenberger, Thomas Schmied, Benedikt Alkin, Marc Peter Deisenroth, Sepp Hochreiter
+
+
+ Foundation models (FMs) are pre-trained on large-scale datasets and then
+fine-tuned on a downstream task for a specific application. The most successful
+and most commonly used fine-tuning method is to update the pre-trained weights
+via a low-rank adaptation (LoRA). LoRA introduces new weight matrices that are
+usually initialized at random with a uniform rank distribution across the model
+weights. Recent works focus on different initialization schemes or the learning
+of adaptive ranks during fine-tuning. Both approaches have only been
+investigated in isolation, resulting in slow convergence or a uniform rank
+distribution, in turn leading to suboptimal performance. We propose to improve
+LoRA by initializing the new weights in a data-driven manner by computing
+singular value decomposition (SVD) on minibatches of activation vectors. Then,
+we initialize the LoRA matrices with the obtained right-singular vectors and
+redistribute ranks among all weight matrices to provably store the maximum
+amount of information of the downstream data in the newly introduced weights.
+In this way, only what information to maintain or neglect during the
+fine-tuning process needs to be learned. We call our new method Explained
+Variance Adaptation (EVA). We apply EVA to a variety of fine-tuning tasks
+ranging from language generation and understanding to image classification and
+reinforcement learning. EVA exhibits faster convergence than competitors and
+achieves the highest average score across a multitude of tasks per domain while
+reducing the number of trainable parameters through rank redistribution.
+
+
+
+ comment: 11 pages + references and appendix, code available at
+ https://github.com/ml-jku/EVA
+
+
+
+
+
+
+ ♻ ☆ On Privacy, Security, and Trustworthiness in Distributed Wireless Large
+ AI Models (WLAM)
+
+
+ Combining wireless communication with large artificial intelligence (AI)
+models can open up a myriad of novel application scenarios. In sixth generation
+(6G) networks, ubiquitous communication and computing resources allow large AI
+models to serve democratic large AI models-related services to enable real-time
+applications like autonomous vehicles, smart cities, and Internet of Things
+(IoT) ecosystems. However, the security considerations and sustainable
+communication resources limit the deployment of large AI models over
+distributed wireless networks. This paper provides a comprehensive overview of
+privacy, security, and trustworthy for distributed wireless large AI model
+(WLAM). In particular, a detailed privacy and security are analysis for
+distributed WLAM is fist revealed. The classifications and theoretical findings
+about privacy and security in distributed WLAM are discussed. Then the
+trustworthy and ethics for implementing distributed WLAM are described.
+Finally, the comprehensive applications of distributed WLAM are presented in
+the context of electromagnetic signal processing.
+
+
+
+ comment: 12 pages, 4 figures
+
+
+
+
+
+
+ ♻ ☆ CryoFM: A Flow-based Foundation Model for Cryo-EM Densities
+
+
+ Cryo-electron microscopy (cryo-EM) is a powerful technique in structural
+biology and drug discovery, enabling the study of biomolecules at high
+resolution. Significant advancements by structural biologists using cryo-EM
+have led to the production of over 38,626 protein density maps at various
+resolutions1. However, cryo-EM data processing algorithms have yet to fully
+benefit from our knowledge of biomolecular density maps, with only a few recent
+models being data-driven but limited to specific tasks. In this study, we
+present CryoFM, a foundation model designed as a generative model, learning the
+distribution of high-quality density maps and generalizing effectively to
+downstream tasks. Built on flow matching, CryoFM is trained to accurately
+capture the prior distribution of biomolecular density maps. Furthermore, we
+introduce a flow posterior sampling method that leverages CRYOFM as a flexible
+prior for several downstream tasks in cryo-EM and cryo-electron tomography
+(cryo-ET) without the need for fine-tuning, achieving state-of-the-art
+performance on most tasks and demonstrating its potential as a foundational
+model for broader applications in these fields.
+
+
+
+
+
+
+
+
+ Fu-Yun Wang, Zhaoyang Huang, Alexander William Bergman, Dazhong Shen, Peng Gao, Michael Lingelbach, Keqiang Sun, Weikang Bian, Guanglu Song, Yu Liu, Xiaogang Wang, Hongsheng Li
+
+
+ Consistency Models (CMs) have made significant progress in accelerating the
+generation of diffusion models. However, their application to high-resolution,
+text-conditioned image generation in the latent space remains unsatisfactory.
+In this paper, we identify three key flaws in the current design of Latent
+Consistency Models (LCMs). We investigate the reasons behind these limitations
+and propose Phased Consistency Models (PCMs), which generalize the design space
+and address the identified limitations. Our evaluations demonstrate that PCMs
+outperform LCMs across 1--16 step generation settings. While PCMs are
+specifically designed for multi-step refinement, they achieve comparable 1-step
+generation results to previously state-of-the-art specifically designed 1-step
+methods. Furthermore, we show the methodology of PCMs is versatile and
+applicable to video generation, enabling us to train the state-of-the-art
+few-step text-to-video generator. Our code is available at
+https://github.com/G-U-N/Phased-Consistency-Model.
+
+
+
+
+
+
+
+
+ Siddhant Dutta, Nouhaila Innan, Sadok Ben Yahia, Muhammad Shafique, David Esteban Bernal Neira
+
+
+ The integration of fully homomorphic encryption (FHE) in federated learning
+(FL) has led to significant advances in data privacy. However, during the
+aggregation phase, it often results in performance degradation of the
+aggregated model, hindering the development of robust representational
+generalization. In this work, we propose a novel multimodal quantum federated
+learning framework that utilizes quantum computing to counteract the
+performance drop resulting from FHE. For the first time in FL, our framework
+combines a multimodal quantum mixture of experts (MQMoE) model with FHE,
+incorporating multimodal datasets for enriched representation and task-specific
+learning. Our MQMoE framework enhances performance on multimodal datasets and
+combined genomics and brain MRI scans, especially for underrepresented
+categories. Our results also demonstrate that the quantum-enhanced approach
+mitigates the performance degradation associated with FHE and improves
+classification accuracy across diverse datasets, validating the potential of
+quantum interventions in enhancing privacy in FL.
+
+
+ Diffusion models achieve superior generation quality but suffer from slow
+generation speed due to the iterative nature of denoising. In contrast,
+consistency models, a new generative family, achieve competitive performance
+with significantly faster sampling. These models are trained either through
+consistency distillation, which leverages pretrained diffusion models, or
+consistency training/tuning directly from raw data. In this work, we propose a
+novel framework for understanding consistency models by modeling the denoising
+process of the diffusion model as a Markov Decision Process (MDP) and framing
+consistency model training as the value estimation through Temporal
+Difference~(TD) Learning. More importantly, this framework allows us to analyze
+the limitations of current consistency training/tuning strategies. Built upon
+Easy Consistency Tuning (ECT), we propose Stable Consistency Tuning (SCT),
+which incorporates variance-reduced learning using the score identity. SCT
+leads to significant performance improvements on benchmarks such as CIFAR-10
+and ImageNet-64. On ImageNet-64, SCT achieves 1-step FID 2.42 and 2-step FID
+1.55, a new SoTA for consistency models.
+
+
+
+ comment: Code is available at
+ https://github.com/G-U-N/Stable-Consistency-Tuning
+
+ The recent surge in contrast-based graph self-supervised learning has
+prominently featured an intensified exploration of spectral cues. Spectral
+augmentation, which involves modifying a graph's spectral properties such as
+eigenvalues or eigenvectors, is widely believed to enhance model performance.
+However, an intriguing paradox emerges, as methods grounded in seemingly
+conflicting assumptions regarding the spectral domain demonstrate notable
+enhancements in learning performance. Through extensive empirical studies, we
+find that simple edge perturbations - random edge dropping for node-level and
+random edge adding for graph-level self-supervised learning - consistently
+yield comparable or superior performance while being significantly more
+computationally efficient. This suggests that the computational overhead of
+sophisticated spectral augmentations may not justify their practical benefits.
+Our theoretical analysis of the InfoNCE loss bounds for shallow GNNs further
+supports this observation. The proposed insights represent a significant leap
+forward in the field, potentially refining the understanding and implementation
+of graph self-supervised learning.
+
+
+
+
+
+
+
+ ♻ ☆ AED-PADA:Improving Generalizability of Adversarial Example Detection via
+ Principal Adversarial Domain Adaptation
+
+
+ Adversarial example detection, which can be conveniently applied in many
+scenarios, is important in the area of adversarial defense. Unfortunately,
+existing detection methods suffer from poor generalization performance, because
+their training process usually relies on the examples generated from a single
+known adversarial attack and there exists a large discrepancy between the
+training and unseen testing adversarial examples. To address this issue, we
+propose a novel method, named Adversarial Example Detection via Principal
+Adversarial Domain Adaptation (AED-PADA). Specifically, our approach identifies
+the Principal Adversarial Domains (PADs), i.e., a combination of features of
+the adversarial examples generated by different attacks, which possesses a
+large portion of the entire adversarial feature space. Subsequently, we pioneer
+to exploit Multi-source Unsupervised Domain Adaptation in adversarial example
+detection, with PADs as the source domains. Experimental results demonstrate
+the superior generalization ability of our proposed AED-PADA. Note that this
+superiority is particularly achieved in challenging scenarios characterized by
+employing the minimal magnitude constraint for the perturbations.
+
+
+
+
+
+
+
+ ♻ ☆ Force-Guided Bridge Matching for Full-Atom Time-Coarsened Dynamics of
+ Peptides
+
+
+
+
+
+
+
+
+ Ziyang Yu, Wenbing Huang, Yang Liu
+
+
+ Molecular Dynamics (MD) is crucial in various fields such as materials
+science, chemistry, and pharmacology to name a few. Conventional MD software
+struggles with the balance between time cost and prediction accuracy, which
+restricts its wider application. Recently, data-driven approaches based on deep
+generative models have been devised for time-coarsened dynamics, which aim at
+learning dynamics of diverse molecular systems over a long timestep, enjoying
+both universality and efficiency. Nevertheless, most current methods are
+designed solely to learn from the data distribution regardless of the
+underlying Boltzmann distribution, and the physics priors such as energies and
+forces are constantly overlooked. In this work, we propose a conditional
+generative model called Force-guided Bridge Matching (FBM), which learns
+full-atom time-coarsened dynamics and targets the Boltzmann-constrained
+distribution. With the guidance of our delicately-designed intermediate force
+field, FBM leverages favourable physics priors into the generation process,
+giving rise to enhanced simulations. Experiments on two datasets consisting of
+peptides verify our superiority in terms of comprehensive metrics and
+demonstrate transferability to unseen systems.
+
+
+
+
+
+
+
+ ♻ ☆ SurvMamba: State Space Model with Multi-grained Multi-modal Interaction
+ for Survival Prediction
+
+
+ Multi-modal learning that combines pathological images with genomic data has
+significantly enhanced the accuracy of survival prediction. Nevertheless,
+existing methods have not fully utilized the inherent hierarchical structure
+within both whole slide images (WSIs) and transcriptomic data, from which
+better intra-modal representations and inter-modal integration could be
+derived. Moreover, many existing studies attempt to improve multi-modal
+representations through attention mechanisms, which inevitably lead to high
+complexity when processing high-dimensional WSIs and transcriptomic data.
+Recently, a structured state space model named Mamba emerged as a promising
+approach for its superior performance in modeling long sequences with low
+complexity. In this study, we propose Mamba with multi-grained multi-modal
+interaction (SurvMamba) for survival prediction. SurvMamba is implemented with
+a Hierarchical Interaction Mamba (HIM) module that facilitates efficient
+intra-modal interactions at different granularities, thereby capturing more
+detailed local features as well as rich global representations. In addition, an
+Interaction Fusion Mamba (IFM) module is used for cascaded inter-modal
+interactive fusion, yielding more comprehensive features for survival
+prediction. Comprehensive evaluations on five TCGA datasets demonstrate that
+SurvMamba outperforms other existing methods in terms of performance and
+computational cost.
+
+
+
+
+
+
+
+ ♻ ☆ RelCon: Relative Contrastive Learning for a Motion Foundation Model for
+ Wearable Data
+
+
+
+
+
+
+
+
+ Maxwell A. Xu, Jaya Narain, Gregory Darnell, Haraldur Hallgrimsson, Hyewon Jeong, Darren Forde, Richard Fineman, Karthik J. Raghuram, James M. Rehg, Shirley Ren
+
+
+ We present RelCon, a novel self-supervised *Rel*ative *Con*trastive learning
+approach that uses a learnable distance measure in combination with a softened
+contrastive loss for training an motion foundation model from wearable sensors.
+The learnable distance measure captures motif similarity and domain-specific
+semantic information such as rotation invariance. The learned distance provides
+a measurement of semantic similarity between a pair of accelerometer
+time-series segments, which is used to measure the distance between an anchor
+and various other sampled candidate segments. The self-supervised model is
+trained on 1 billion segments from 87,376 participants from a large wearables
+dataset. The model achieves strong performance across multiple downstream
+tasks, encompassing both classification and regression. To our knowledge, we
+are the first to show the generalizability of a self-supervised learning model
+with motion data from wearables across distinct evaluation tasks.
+
+
+ Relational learning is an essential task in the domain of knowledge
+representation, particularly in knowledge graph completion (KGC). While
+relational learning in traditional single-modal settings has been extensively
+studied, exploring it within a multimodal KGC context presents distinct
+challenges and opportunities. One of the major challenges is inference on newly
+discovered relations without any associated training data. This zero-shot
+relational learning scenario poses unique requirements for multimodal KGC,
+i.e., utilizing multimodality to facilitate relational learning.However,
+existing works fail to support the leverage of multimodal information and leave
+the problem unexplored. In this paper, we propose a novel end-to-end framework,
+consisting of three components, i.e., multimodal learner, structure
+consolidator, and relation embedding generator, to integrate diverse multimodal
+information and knowledge graph structures to facilitate the zero-shot
+relational learning. Evaluation results on three multimodal knowledge graphs
+demonstrate the superior performance of our proposed method.
+
+
+
+ comment: In the Proceedings of the 2024 IEEE International Conference on Big
+ Data (IEEE BigData 2024)
+
+
+
+
+
+
+ ♻ ☆ COVID-19 Probability Prediction Using Machine Learning: An Infectious
+ Approach
+
+
+
+
+
+
+
+
+ Mohsen Asghari Ilani, Saba Moftakhar Tehran, Ashkan Kavei, Arian Radmehr
+
+
+ The ongoing COVID-19 pandemic continues to pose significant challenges to
+global public health, despite the widespread availability of vaccines. Early
+detection of the disease remains paramount in curbing its transmission and
+mitigating its impact on public health systems. In response, this study delves
+into the application of advanced machine learning (ML) techniques for
+predicting COVID-19 infection probability. We conducted a rigorous
+investigation into the efficacy of various ML models, including XGBoost, LGBM,
+AdaBoost, Logistic Regression, Decision Tree, RandomForest, CatBoost, KNN, and
+Deep Neural Networks (DNN). Leveraging a dataset comprising 4000 samples, with
+3200 allocated for training and 800 for testing, our experiment offers
+comprehensive insights into the performance of these models in COVID-19
+prediction. Our findings reveal that Deep Neural Networks (DNN) emerge as the
+top-performing model, exhibiting superior accuracy and recall metrics. With an
+impressive accuracy rate of 89%, DNN demonstrates remarkable potential in early
+COVID-19 detection. This underscores the efficacy of deep learning approaches
+in leveraging complex data patterns to identify COVID-19 infections accurately.
+This study underscores the critical role of machine learning, particularly deep
+learning methodologies, in augmenting early detection efforts amidst the
+ongoing pandemic. The success of DNN in accurately predicting COVID-19
+infection probability highlights the importance of continued research and
+development in leveraging advanced technologies to combat infectious diseases.
+
+
+
+
+
+
+
+
+ Rafael F. Oliveira, Gladston J. P. Moreira, Vander L. S. Freitas, Eduardo J. S. Luz
+
+
+ Arrhythmias, detectable through electrocardiograms (ECGs), pose significant
+health risks, underscoring the need for accurate and efficient automated
+detection techniques. While recent advancements in graph-based methods have
+demonstrated potential to enhance arrhythmia classification, the challenge lies
+in effectively representing ECG signals as graphs. This study investigates the
+use of Visibility Graph (VG) and Vector Visibility Graph (VVG) representations
+combined with Graph Convolutional Networks (GCNs) for arrhythmia classification
+under the ANSI/AAMI standard, ensuring reproducibility and fair comparison with
+other techniques. Through extensive experiments on the MIT-BIH dataset, we
+evaluate various GCN architectures and preprocessing parameters. Our findings
+demonstrate that VG and VVG mappings enable GCNs to classify arrhythmias
+directly from raw ECG signals, without the need for preprocessing or noise
+removal. Notably, VG offers superior computational efficiency, while VVG
+delivers enhanced classification performance by leveraging additional lead
+features. The proposed approach outperforms baseline methods in several
+metrics, although challenges persist in classifying the supraventricular
+ectopic beat (S) class, particularly under the inter-patient paradigm.
+
+
+
+
+
+
+
+ ♻ ☆ Breast Cancer Classification Using Gradient Boosting Algorithms Focusing
+ on Reducing the False Negative and SHAP for Explainability
+
+
+ Cancer is one of the diseases that kill the most women in the world, with
+breast cancer being responsible for the highest number of cancer cases and
+consequently deaths. However, it can be prevented by early detection and,
+consequently, early treatment. Any development for detection or perdition this
+kind of cancer is important for a better healthy life. Many studies focus on a
+model with high accuracy in cancer prediction, but sometimes accuracy alone may
+not always be a reliable metric. This study implies an investigative approach
+to studying the performance of different machine learning algorithms based on
+boosting to predict breast cancer focusing on the recall metric. Boosting
+machine learning algorithms has been proven to be an effective tool for
+detecting medical diseases. The dataset of the University of California, Irvine
+(UCI) repository has been utilized to train and test the model classifier that
+contains their attributes. The main objective of this study is to use
+state-of-the-art boosting algorithms such as AdaBoost, XGBoost, CatBoost and
+LightGBM to predict and diagnose breast cancer and to find the most effective
+metric regarding recall, ROC-AUC, and confusion matrix. Furthermore, our study
+is the first to use these four boosting algorithms with Optuna, a library for
+hyperparameter optimization, and the SHAP method to improve the
+interpretability of our model, which can be used as a support to identify and
+predict breast cancer. We were able to improve AUC or recall for all the models
+and reduce the False Negative for AdaBoost and LigthGBM the final AUC were more
+than 99.41\% for all models.
+
+
+
+ comment: 9 pages, 16 figures
+
+
+
+
+
+
+ ♻ ☆ CGGM: A conditional graph generation model with adaptive sparsity for
+ node anomaly detection in IoT networks
+
+
+
+
+
+
+
+
+ Munan Li, Xianshi Su, Runze Ma, Tongbang Jiang, Zijian Li, Tony Q. S. Quek
+
+
+ Dynamic graphs are extensively employed for detecting anomalous behavior in
+nodes within the Internet of Things (IoT). Graph generative models are often
+used to address the issue of imbalanced node categories in dynamic graphs.
+Nevertheless, the constraints it faces include the monotonicity of adjacency
+relationships, the difficulty in constructing multi-dimensional features for
+nodes, and the lack of a method for end-to-end generation of multiple
+categories of nodes. In this paper, we propose a novel graph generation model,
+called CGGM, specifically for generating samples belonging to the minority
+class. The framework consists two core module: a conditional graph generation
+module and a graph-based anomaly detection module. The generative module adapts
+to the sparsity of the matrix by downsampling a noise adjacency matrix, and
+incorporates a multi-dimensional feature encoder based on multi-head
+self-attention to capture latent dependencies among features. Additionally, a
+latent space constraint is combined with the distribution distance to
+approximate the latent distribution of real data. The graph-based anomaly
+detection module utilizes the generated balanced dataset to predict the node
+behaviors. Extensive experiments have shown that CGGM outperforms the
+state-of-the-art methods in terms of accuracy and divergence. The results also
+demonstrate CGGM can generated diverse data categories, that enhancing the
+performance of multi-category classification task.
+
+
+ While there has been progress towards aligning Large Language Models (LLMs)
+with human values and ensuring safe behaviour at inference time, safety-guards
+can easily be removed when fine-tuned on unsafe and harmful datasets.While this
+setting has been treated extensively, another popular training paradigm,
+learning from unsafe feedback with reinforcement learning, has previously been
+unexplored. This is concerning due to the widespread deployment of feedback
+collection systems. We address this gap by providing an analysis of learning
+settings where feedback is adversarial and noisy, i.e. that unsafe samples are
+preferred over safe ones despite model developers goal to maintain safety. We
+find that safety-aligned LLMs easily explore unsafe action spaces through
+generating harmful text and optimize for adversarial reward indicating that
+current safety guards are not enough to prevent learning from unsafe feedback.
+In order to protect against this vulnerability, we adapt a number of both
+"implict" and "explicit" harmful fine-tuning defences to evaluate whether they
+are effective as learning constraints in an RL setting finding that no method
+is generally effective pointing to the need for more research in defences given
+the widespread adoption of methods designed to learn from feedback. We end the
+paper with the observation that some defences work by performing "harmless
+reward hacking" for which we provide a theoretical explanation drawn from the
+theory of Constrained Markov Decision Processes and provide some direction for
+future defence development.
+
+
+ Tangible User Interfaces (TUI) for human--computer interaction (HCI) provide
+the user with physical representations of digital information with the aim to
+overcome the limitations of screen-based interfaces. Although many compelling
+demonstrations of TUIs exist in the literature, there is a lack of research on
+TUIs intended for daily two-handed tasks and processes, such as cooking. In
+response to this gap, we propose SPICE (Smart Projection Interface for Cooking
+Enhancement). SPICE investigates TUIs in a kitchen setting, aiming to transform
+the recipe following experience from simply text-based to tangibly interactive.
+SPICE includes a tracking system, an agent-based software, and vision large
+language models to create and interpret a kitchen environment where recipe
+information is projected directly onto the cooking surface. We conducted a
+comparative usability study of SPICE and text-based recipe following with 30
+participants, assessing the task difficulty, total duration, and efficiency, as
+well as user confidence and taste perception. The results indicate that SPICE
+allowed participants to perform the recipe with less stops and in shorter time
+while also improving self-reported efficiency, confidence, and taste. Despite
+this, participants self-reported no change in overall difficulty, which is a
+direction for future research. Overall, the SPICE project demonstrates the
+potential of using TUIs to improve everyday activities, paving the way for
+future research in HCI and new computing interfaces.
+
+
+
+ comment: Article submitted to IUI 2025
+
+
+
+
+
+
+ ☆ Who Brings the Frisbee: Probing Hidden Hallucination Factors in Large
+ Vision-Language Model via Causality Analysis WACV2025
+
+
+ Recent advancements in large vision-language models (LVLM) have significantly
+enhanced their ability to comprehend visual inputs alongside natural language.
+However, a major challenge in their real-world application is hallucination,
+where LVLMs generate non-existent visual elements, eroding user trust. The
+underlying mechanism driving this multimodal hallucination is poorly
+understood. Minimal research has illuminated whether contexts such as sky,
+tree, or grass field involve the LVLM in hallucinating a frisbee. We
+hypothesize that hidden factors, such as objects, contexts, and semantic
+foreground-background structures, induce hallucination. This study proposes a
+novel causal approach: a hallucination probing system to identify these hidden
+factors. By analyzing the causality between images, text prompts, and network
+saliency, we systematically explore interventions to block these factors. Our
+experimental findings show that a straightforward technique based on our
+analysis can significantly reduce hallucinations. Additionally, our analyses
+indicate the potential to edit network internals to minimize hallucinated
+outputs.
+
+
+
+ comment: Accepted by WACV2025
+
+
+
+
+
+
+ ☆ Personalizing Multimodal Large Language Models for Image Captioning: An
+ Experimental Analysis ECCV 2024
+
+
+
+
+
+
+
+
+ Davide Bucciarelli, Nicholas Moratelli, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
+
+
+ The task of image captioning demands an algorithm to generate natural
+language descriptions of visual inputs. Recent advancements have seen a
+convergence between image captioning research and the development of Large
+Language Models (LLMs) and Multimodal LLMs -- like GPT-4V and Gemini -- which
+extend the capabilities of text-only LLMs to multiple modalities. This paper
+investigates whether Multimodal LLMs can supplant traditional image captioning
+networks by evaluating their performance on various image description
+benchmarks. We explore both the zero-shot capabilities of these models and
+their adaptability to different semantic domains through fine-tuning methods,
+including prompt learning, prefix tuning, and low-rank adaptation. Our results
+demonstrate that while Multimodal LLMs achieve impressive zero-shot
+performance, fine-tuning for specific domains while maintaining their
+generalization capabilities intact remains challenging. We discuss the
+implications of these findings for future research in image captioning and the
+development of more adaptable Multimodal LLMs.
+
+
+
+ comment: ECCV 2024 Workshop on Green Foundation Models
+
+ With the rapid advancement of diffusion-based generative models, portrait
+image animation has achieved remarkable results. However, it still faces
+challenges in temporally consistent video generation and fast sampling due to
+its iterative sampling nature. This paper presents FLOAT, an audio-driven
+talking portrait video generation method based on flow matching generative
+model. We shift the generative modeling from the pixel-based latent space to a
+learned motion latent space, enabling efficient design of temporally consistent
+motion. To achieve this, we introduce a transformer-based vector field
+predictor with a simple yet effective frame-wise conditioning mechanism.
+Additionally, our method supports speech-driven emotion enhancement, enabling a
+natural incorporation of expressive motions. Extensive experiments demonstrate
+that our method outperforms state-of-the-art audio-driven talking portrait
+methods in terms of visual quality, motion fidelity, and efficiency.
+
+
+
+
+
+
+
+
+ Anqi Li, Feng Li, Yuxi Liu, Runmin Cong, Yao Zhao, Huihui Bai
+
+
+ Although recent generative image compression methods have demonstrated
+impressive potential in optimizing the rate-distortion-perception trade-off,
+they still face the critical challenge of flexible rate adaption to diverse
+compression necessities and scenarios. To overcome this challenge, this paper
+proposes a Controllable Generative Image Compression framework, termed
+Control-GIC, the first capable of fine-grained bitrate adaption across a broad
+spectrum while ensuring high-fidelity and generality compression. Control-GIC
+is grounded in a VQGAN framework that encodes an image as a sequence of
+variable-length codes (i.e. VQ-indices), which can be losslessly compressed and
+exhibits a direct positive correlation with the bitrates. Drawing inspiration
+from the classical coding principle, we correlate the information density of
+local image patches with their granular representations. Hence, we can flexibly
+determine a proper allocation of granularity for the patches to achieve dynamic
+adjustment for VQ-indices, resulting in desirable compression rates. We further
+develop a probabilistic conditional decoder capable of retrieving historic
+encoded multi-granularity representations according to transmitted codes, and
+then reconstruct hierarchical granular features in the formalization of
+conditional probability, enabling more informative aggregation to improve
+reconstruction realism. Our experiments show that Control-GIC allows highly
+flexible and controllable bitrate adaption where the results demonstrate its
+superior performance over recent state-of-the-art methods.
+
+
+ Relational learning is an essential task in the domain of knowledge
+representation, particularly in knowledge graph completion (KGC). While
+relational learning in traditional single-modal settings has been extensively
+studied, exploring it within a multimodal KGC context presents distinct
+challenges and opportunities. One of the major challenges is inference on newly
+discovered relations without any associated training data. This zero-shot
+relational learning scenario poses unique requirements for multimodal KGC,
+i.e., utilizing multimodality to facilitate relational learning.However,
+existing works fail to support the leverage of multimodal information and leave
+the problem unexplored. In this paper, we propose a novel end-to-end framework,
+consisting of three components, i.e., multimodal learner, structure
+consolidator, and relation embedding generator, to integrate diverse multimodal
+information and knowledge graph structures to facilitate the zero-shot
+relational learning. Evaluation results on three multimodal knowledge graphs
+demonstrate the superior performance of our proposed method.
+
+
+
+ comment: In the Proceedings of the 2024 IEEE International Conference on Big
+ Data (IEEE BigData 2024)
+
+
+
+
+
+
+ ♻ ☆ PerceiverS: A Multi-Scale Perceiver with Effective Segmentation for
+ Long-Term Expressive Symbolic Music Generation
+
+
+
+
+
+
+
+
+ Yungang Yi, Weihua Li, Matthew Kuo, Quan Bai
+
+
+ AI-based music generation has progressed significantly in recent years.
+However, creating symbolic music that is both long-structured and expressive
+remains a considerable challenge. In this paper, we propose PerceiverS
+(Segmentation and Scale), a novel architecture designed to address this issue
+by leveraging both Effective Segmentation and Multi-Scale attention mechanisms.
+Our approach enhances symbolic music generation by simultaneously learning
+long-term structural dependencies and short-term expressive details. By
+combining cross-attention and self-attention in a Multi-Scale setting,
+PerceiverS captures long-range musical structure while preserving musical
+diversity. The proposed model has been evaluated using the Maestro dataset and
+has demonstrated improvements in generating music of conventional length with
+expressive nuances. The project demos and the generated music samples can be
+accessed through the link: https://perceivers.github.io
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Computation and Language 105
+
+
+
+
+
+ ☆ Scaling BERT Models for Turkish Automatic Punctuation and Capitalization
+ Correction
+
+
+
+
+
+
+
+
+ Abdulkader Saoud, Mahmut Alomeyr, Himmet Toprak Kesgin, Mehmet Fatih Amasyali
+
+
+ This paper investigates the effectiveness of BERT based models for automated
+punctuation and capitalization corrections in Turkish texts across five
+distinct model sizes. The models are designated as Tiny, Mini, Small, Medium,
+and Base. The design and capabilities of each model are tailored to address the
+specific challenges of the Turkish language, with a focus on optimizing
+performance while minimizing computational overhead. The study presents a
+systematic comparison of the performance metrics precision, recall, and F1
+score of each model, offering insights into their applicability in diverse
+operational contexts. The results demonstrate a significant improvement in text
+readability and accuracy as model size increases, with the Base model achieving
+the highest correction precision. This research provides a comprehensive guide
+for selecting the appropriate model size based on specific user needs and
+computational resources, establishing a framework for deploying these models in
+real-world applications to enhance the quality of written Turkish.
+
+
+
+ comment: 2024 Innovations in Intelligent Systems and Applications Conference
+ (ASYU)
+
+ Reinforcement learning from human feedback (RLHF) has been crucial in
+aligning large language models (LLMs) with human values. Traditionally, RLHF
+involves generating responses to a query and using a reward model to assign a
+reward to the entire response. However, this approach faces challenges due to
+its reliance on a single, sparse reward, which makes it challenging for the
+model to identify which parts of the sequence contribute most significantly to
+the final reward. Recent methods have attempted to address this limitation by
+introducing token-level rewards. However, these methods often rely on either a
+trained credit assignment model or AI annotators, raising concerns about the
+quality and reliability of the rewards. In this paper, we propose token-level
+reward regularization (T-REG), a novel approach that leverages both
+sequence-level and token-level rewards for preference optimization. Harnessing
+the self-refinement capabilities of LLMs, our method uses contrastive prompting
+to enable LLMs to self-generate token-level rewards. These self-generated
+rewards then act as reward regularization, guiding the model to more
+effectively distribute sequence-level rewards across tokens. This facilitates
+better token-level credit assignment and enhances alignment performance.
+Experiments on the instruction following benchmarks, including Alpaca Eval 2
+and Arena-Hard, show that our method consistently outperforms baseline methods
+by up to 3.8% and 4.4%, respectively. We will release the code and models at
+https://github.com/wzhouad/T-REG.
+
+
+
+
+
+
+
+ ☆ Mind the Gap: Examining the Self-Improvement Capabilities of Large
+ Language Models
+
+
+ Self-improvement is a mechanism in Large Language Model (LLM) pre-training,
+post-training and test-time inference. We explore a framework where the model
+verifies its own outputs, filters or reweights data based on this verification,
+and distills the filtered data. Despite several empirical successes, a
+fundamental understanding is still lacking. In this work, we initiate a
+comprehensive, modular and controlled study on LLM self-improvement. We provide
+a mathematical formulation for self-improvement, which is largely governed by a
+quantity which we formalize as the generation-verification gap. Through
+experiments with various model families and tasks, we discover a scaling
+phenomenon of self-improvement -- a variant of the generation-verification gap
+scales monotonically with the model pre-training flops. We also examine when
+self-improvement is possible, an iterative self-improvement procedure, and ways
+to improve its performance. Our findings not only advance understanding of LLM
+self-improvement with practical implications, but also open numerous avenues
+for future research into its capabilities and boundaries.
+
+
+
+ comment: 41 pages, 19 figures
+
+
+
+
+
+
+ ☆ Probing the statistical properties of enriched co-occurrence networks
+
+
+
+
+
+
+
+
+ Diego R. Amancio, Jeaneth Machicao, Laura V. C. Quispe
+
+
+ Recent studies have explored the addition of virtual edges to word
+co-occurrence networks using word embeddings to enhance graph representations,
+particularly for short texts. While these enriched networks have demonstrated
+some success, the impact of incorporating semantic edges into traditional
+co-occurrence networks remains uncertain. This study investigates two key
+statistical properties of text-based network models. First, we assess whether
+network metrics can effectively distinguish between meaningless and meaningful
+texts. Second, we analyze whether these metrics are more sensitive to syntactic
+or semantic aspects of the text. Our results show that incorporating virtual
+edges can have positive and negative effects, depending on the specific network
+metric. For instance, the informativeness of the average shortest path and
+closeness centrality improves in short texts, while the clustering
+coefficient's informativeness decreases as more virtual edges are added.
+Additionally, we found that including stopwords affects the statistical
+properties of enriched networks. Our results can serve as a guideline for
+determining which network metrics are most appropriate for specific
+applications, depending on the typical text size and the nature of the problem.
+
+
+
+
+
+
+
+ ☆ QA-TOOLBOX: Conversational Question-Answering for process task guidance
+ in manufacturing
+
+
+
+
+
+
+
+
+ Ramesh Manuvinakurike, Elizabeth Watkins, Celal Savur, Anthony Rhodes, Sovan Biswas, Gesem Gudino Mejia, Richard Beckwith, Saurav Sahay, Giuseppe Raffa, Lama Nachman
+
+
+ In this work we explore utilizing LLMs for data augmentation for
+manufacturing task guidance system. The dataset consists of representative
+samples of interactions with technicians working in an advanced manufacturing
+setting. The purpose of this work to explore the task, data augmentation for
+the supported tasks and evaluating the performance of the existing LLMs. We
+observe that that task is complex requiring understanding from procedure
+specification documents, actions and objects sequenced temporally. The dataset
+consists of 200,000+ question/answer pairs that refer to the spec document and
+are grounded in narrations and/or video demonstrations. We compared the
+performance of several popular open-sourced LLMs by developing a baseline using
+each LLM and then compared the responses in a reference-free setting using
+LLM-as-a-judge and compared the ratings with crowd-workers whilst validating
+the ratings with experts.
+
+
+
+
+
+
+
+ ☆ Words and Action: Modeling Linguistic Leadership in #BlackLivesMatter
+ Communities
+
+
+ In this project, we describe a method of modeling semantic leadership across
+a set of communities associated with the #BlackLivesMatter movement, which has
+been informed by qualitative research on the structure of social media and
+Black Twitter in particular. We describe our bespoke approaches to
+time-binning, community clustering, and connecting communities over time, as
+well as our adaptation of state-of-the-art approaches to semantic change
+detection and semantic leadership induction. We find substantial evidence of
+the leadership role of BLM activists and progressives, as well as Black
+celebrities. We also find evidence of the sustained engagement of the
+conservative community with this discourse, suggesting an alternative
+explanation for how we arrived at the present moment, in which "anti-woke" and
+"anti-CRT" bills are being enacted nationwide.
+
+
+
+ comment: Accepted at ICWSM 2025; minor revisions forthcoming
+
+ Large Language Models (LLMs) are typically trained to predict in the forward
+direction of time. However, recent works have shown that prompting these models
+to look back and critique their own generations can produce useful feedback.
+Motivated by this, we explore the question of whether LLMs can be empowered to
+think (predict and score) backwards to provide unsupervised feedback that
+complements forward LLMs. Towards this, we introduce Time Reversed Language
+Models (TRLMs), which can score and generate queries when conditioned on
+responses, effectively functioning in the reverse direction of time. Further,
+to effectively infer in the response to query direction, we pre-train and
+fine-tune a language model (TRLM-Ba) in the reverse token order from scratch.
+We show empirically (and theoretically in a stylized setting) that
+time-reversed models can indeed complement forward model predictions when used
+to score the query given response for re-ranking multiple forward generations.
+We obtain up to 5\% improvement on the widely used AlpacaEval Leaderboard over
+the competent baseline of best-of-N re-ranking using self log-perplexity
+scores. We further show that TRLM scoring outperforms conventional forward
+scoring of response given query, resulting in significant gains in applications
+such as citation generation and passage retrieval. We next leverage the
+generative ability of TRLM to augment or provide unsupervised feedback to input
+safety filters of LLMs, demonstrating a drastic reduction in false negative
+rate with negligible impact on false positive rates against several attacks
+published on the popular JailbreakBench leaderboard.
+
+
+
+
+
+
+
+ ☆ GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken
+ Chatbot
+
+
+
+
+
+
+
+
+ Aohan Zeng, Zhengxiao Du, Mingdao Liu, Kedong Wang, Shengmin Jiang, Lei Zhao, Yuxiao Dong, Jie Tang
+
+
+ We introduce GLM-4-Voice, an intelligent and human-like end-to-end spoken
+chatbot. It supports both Chinese and English, engages in real-time voice
+conversations, and varies vocal nuances such as emotion, intonation, speech
+rate, and dialect according to user instructions. GLM-4-Voice uses an ultra-low
+bitrate (175bps), single-codebook speech tokenizer with 12.5Hz frame rate
+derived from an automatic speech recognition (ASR) model by incorporating a
+vector-quantized bottleneck into the encoder. To efficiently transfer knowledge
+from text to speech modalities, we synthesize speech-text interleaved data from
+existing text pre-training corpora using a text-to-token model. We continue
+pre-training from the pre-trained text language model GLM-4-9B with a
+combination of unsupervised speech data, interleaved speech-text data, and
+supervised speech-text data, scaling up to 1 trillion tokens, achieving
+state-of-the-art performance in both speech language modeling and spoken
+question answering. We then fine-tune the pre-trained model with high-quality
+conversational speech data, achieving superior performance compared to existing
+baselines in both conversational ability and speech quality. The open models
+can be accessed through https://github.com/THUDM/GLM-4-Voice and
+https://huggingface.co/THUDM/glm-4-voice-9b.
+
+
+
+
+
+
+
+ ☆ AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand
+ Audio-Visual Information?
+
+
+ Recently, multimodal large language models (MLLMs), such as GPT-4o, Gemini
+1.5 Pro, and Reka Core, have expanded their capabilities to include vision and
+audio modalities. While these models demonstrate impressive performance across
+a wide range of audio-visual applications, our proposed DeafTest reveals that
+MLLMs often struggle with simple tasks humans find trivial: 1) determining
+which of two sounds is louder, and 2) determining which of two sounds has a
+higher pitch. Motivated by these observations, we introduce AV-Odyssey Bench, a
+comprehensive audio-visual benchmark designed to assess whether those MLLMs can
+truly understand the audio-visual information. This benchmark encompasses 4,555
+carefully crafted problems, each incorporating text, visual, and audio
+components. To successfully infer answers, models must effectively leverage
+clues from both visual and audio inputs. To ensure precise and objective
+evaluation of MLLM responses, we have structured the questions as
+multiple-choice, eliminating the need for human evaluation or LLM-assisted
+assessment. We benchmark a series of closed-source and open-source models and
+summarize the observations. By revealing the limitations of current models, we
+aim to provide useful insight for future dataset collection and model
+development.
+
+
+
+
+
+
+
+ ☆ Interpretable Company Similarity with Sparse Autoencoders
+
+
+
+
+
+
+
+
+ Marco Molinari, Vladimir Tregubiak, Victor Shao, Abhimanyu Pandey, Mateusz Mikolajczak, Sebastião Kuznetsov Ryder Torres Pereira
+
+
+ Determining company similarity is a vital task in finance, underpinning
+hedging, risk management, portfolio diversification, and more. Practitioners
+often rely on sector and industry classifications to gauge similarity, such as
+SIC-codes and GICS-codes, the former being used by the U.S. Securities and
+Exchange Commission (SEC), and the latter widely used by the investment
+community. Clustering embeddings of company descriptions has been proposed as a
+potential technique for determining company similarity, but the lack of
+interpretability in token embeddings poses a significant barrier to adoption in
+high-stakes contexts. Sparse Autoencoders have shown promise in enhancing the
+interpretability of Large Language Models by decomposing LLM activations into
+interpretable features. In this paper, we explore the use of SAE features in
+measuring company similarity and benchmark them against (1) SIC codes and (2)
+Major Group codes. We conclude that SAE features can reproduce and even surpass
+sector classifications in quantifying fundamental characteristics of companies,
+evaluated by the correlation of monthly returns, a proxy for similarity, and
+PnL from cointegration.
+
+
+
+
+
+
+
+ ☆ CEGI: Measuring the trade-off between efficiency and carbon emissions
+ for SLMs and VLMs
+
+
+ This paper analyzes the performance of Small Language Models (SLMs) and
+Vision Language Models (VLMs) and evaluates the trade-off between model
+performance and carbon emissions across 4 essential tasks: Image Captioning,
+Visual Question Answering (VQA), Dialogue Summarization and Text-to-SQL
+conversion. Various SLMs and VLMs belonging to the Qwen and LLaMA architecture
+family are chosen and variants based on model size in terms of the number of
+parameters, quantization level and fine-tuning parameters are evaluated. The
+model variant's performance and carbon emissions are calculated. To quantify
+the trade-off between model performance and carbon emissions, we introduce a
+novel metric called CEGI (Carbon Efficient Gain Index). This metric represents
+the carbon emission per unit percentage gain per million trainable parameters .
+This metric provides a normalized measure to compare model's efficiency in
+terms of performance improvement relative to their environmental cost. The
+experiment's outcome demonstrates that fine-tuning SLMs and VLMs can achieve
+performance levels comparable to Large Language Models (LLMs) while producing
+significantly less carbon emissions. Our findings suggest that the marginal
+gains in accuracy from larger models do not justify the substantial increase in
+carbon emissions. Leveraging lower-bit quantization levels, the proposed metric
+further enhances energy efficiency without compromising performance. This study
+highlights balancing high performance and environmental sustainability. It
+offers a valuable metric for selecting models suitable for
+environmentally-friendly AI development.
+
+
+
+
+
+
+
+ ☆ Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon
+ Pretraining Dataset
+
+
+
+
+
+
+
+
+ Dan Su, Kezhi Kong, Ying Lin, Joseph Jennings, Brandon Norick, Markus Kliegl, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro
+
+
+ Recent English Common Crawl datasets like FineWeb-Edu and DCLM achieved
+significant benchmark gains via aggressive model-based filtering, but at the
+cost of removing 90% of data. This limits their suitability for long token
+horizon training, such as 15T tokens for Llama 3.1. In this paper, we show how
+to achieve better trade-offs between accuracy and data quantity by a
+combination of classifier ensembling, synthetic data rephrasing, and reduced
+reliance on heuristic filters. When training 8B parameter models for 1T tokens,
+using a high-quality subset of our data improves MMLU by 5.6 over DCLM,
+demonstrating the efficacy of our methods for boosting accuracies over a
+relatively short token horizon. Furthermore, our full 6.3T token dataset
+matches DCLM on MMLU, but contains four times more unique real tokens than
+DCLM. This unlocks state-of-the-art training over a long token horizon: an 8B
+parameter model trained for 15T tokens, of which 7.2T came from our dataset, is
+better than the Llama 3.1 8B model: +5 on MMLU, +3.1 on ARC-Challenge, and +0.5
+on average across ten diverse tasks. The dataset is available at
+https://data.commoncrawl.org/contrib/Nemotron/Nemotron-CC/index.html
+
+
+ Retrieval-Augmented Generation (RAG) architectures have recently garnered
+significant attention for their ability to improve truth grounding and
+coherence in natural language processing tasks. However, the reliability of RAG
+systems in producing accurate answers diminishes as the volume of data they
+access increases. Even with smaller datasets, these systems occasionally fail
+to address simple queries. This issue arises from their dependence on
+state-of-the-art large language models (LLMs), which can introduce uncertainty
+into the system's outputs. In this work, I propose a novel Comparative RAG
+system that introduces an evaluator module to bridge the gap between
+probabilistic RAG systems and deterministically verifiable responses. The
+evaluator compares external recommendations with the retrieved document chunks,
+adding a decision-making layer that enhances the system's reliability. This
+approach ensures that the chunks retrieved are both semantically relevant and
+logically consistent with deterministic insights, thereby improving the
+accuracy and overall efficiency of RAG systems. This framework paves the way
+for more reliable and scalable question-answering applications in domains
+requiring high precision and verifiability.
+
+
+
+
+
+
+
+ ☆ Patent-CR: A Dataset for Patent Claim Revision
+
+
+
+
+
+
+
+
+ Lekang Jiang, Pascal A Scherz, Stephan Goetz
+
+
+ This paper presents Patent-CR, the first dataset created for the patent claim
+revision task in English. It includes both initial patent applications rejected
+by patent examiners and the final granted versions. Unlike normal text revision
+tasks that predominantly focus on enhancing sentence quality, such as grammar
+correction and coherence improvement, patent claim revision aims at ensuring
+the claims meet stringent legal criteria. These criteria are beyond novelty and
+inventiveness, including clarity of scope, technical accuracy, language
+precision, and legal robustness. We assess various large language models (LLMs)
+through professional human evaluation, including general LLMs with different
+sizes and architectures, text revision models, and domain-specific models. Our
+results indicate that LLMs often bring ineffective edits that deviate from the
+target revisions. In addition, domain-specific models and the method of
+fine-tuning show promising results. Notably, GPT-4 outperforms other tested
+LLMs, but further revisions are still necessary to reach the examination
+standard. Furthermore, we demonstrate the inconsistency between automated and
+human evaluation results, suggesting that GPT-4-based automated evaluation has
+the highest correlation with human judgment. This dataset, along with our
+preliminary empirical research, offers invaluable insights for further
+exploration in patent claim revision.
+
+
+
+ comment: 15 pages, 6 tables, 3 figures
+
+
+
+
+
+
+ ☆ LLMForecaster: Improving Seasonal Event Forecasts with Unstructured
+ Textual Data NeurIPS
+
+
+
+
+
+
+
+
+ Hanyu Zhang, Chuck Arvin, Dmitry Efimov, Michael W. Mahoney, Dominique Perrault-Joncas, Shankar Ramasubramanian, Andrew Gordon Wilson, Malcolm Wolff
+
+
+ Modern time-series forecasting models often fail to make full use of rich
+unstructured information about the time series themselves. This lack of proper
+conditioning can lead to obvious model failures; for example, models may be
+unaware of the details of a particular product, and hence fail to anticipate
+seasonal surges in customer demand in the lead up to major exogenous events
+like holidays for clearly relevant products. To address this shortcoming, this
+paper introduces a novel forecast post-processor -- which we call LLMForecaster
+-- that fine-tunes large language models (LLMs) to incorporate unstructured
+semantic and contextual information and historical data to improve the
+forecasts from an existing demand forecasting pipeline. In an industry-scale
+retail application, we demonstrate that our technique yields statistically
+significantly forecast improvements across several sets of products subject to
+holiday-driven demand surges.
+
+
+
+ comment: Presented at NeurIPS Time Series in the Age of Large Models (2024)
+
+
+
+
+
+
+ ☆ DP-2Stage: Adapting Language Models as Differentially Private Tabular
+ Data Generators
+
+
+
+
+
+
+
+
+ Tejumade Afonja, Hui-Po Wang, Raouf Kerkouche, Mario Fritz
+
+
+ Generating tabular data under differential privacy (DP) protection ensures
+theoretical privacy guarantees but poses challenges for training machine
+learning models, primarily due to the need to capture complex structures under
+noisy supervision signals. Recently, pre-trained Large Language Models (LLMs)
+-- even those at the scale of GPT-2 -- have demonstrated great potential in
+synthesizing tabular data. However, their applications under DP constraints
+remain largely unexplored. In this work, we address this gap by applying DP
+techniques to the generation of synthetic tabular data. Our findings shows that
+LLMs face difficulties in generating coherent text when fine-tuned with DP, as
+privacy budgets are inefficiently allocated to non-private elements like table
+structures. To overcome this, we propose \ours, a two-stage fine-tuning
+framework for differentially private tabular data generation. The first stage
+involves non-private fine-tuning on a pseudo dataset, followed by DP
+fine-tuning on a private dataset. Our empirical results show that this approach
+improves performance across various settings and metrics compared to directly
+fine-tuned LLMs in DP contexts. We release our code and setup at
+https://github.com/tejuafonja/DP-2Stage.
+
+
+
+
+
+
+
+ ☆ Can ChatGPT capture swearing nuances? Evidence from translating Arabic
+ oaths
+
+
+ This study sets out to answer one major question: Can ChatGPT capture
+swearing nuances? It presents an empirical study on the ability of ChatGPT to
+translate Arabic oath expressions into English. 30 Arabic oath expressions were
+collected from the literature. These 30 oaths were first translated via ChatGPT
+and then analyzed and compared to the human translation in terms of types of
+gaps left unfulfilled by ChatGPT. Specifically, the gaps involved are:
+religious gap, cultural gap, both religious and cultural gaps, no gap, using
+non-oath particles, redundancy and noncapturing of Arabic script diacritics. It
+concludes that ChatGPT translation of oaths is still much unsatisfactory,
+unveiling the need of further developments of ChatGPT, and the inclusion of
+Arabic data on which ChatGPT should be trained including oath expressions, oath
+nuances, rituals, and practices.
+
+
+
+ comment: 18 pages, 3 figures
+
+
+
+
+
+
+ ☆ Gracefully Filtering Backdoor Samples for Generative Large Language
+ Models without Retraining COLING 2025
+
+
+ Backdoor attacks remain significant security threats to generative large
+language models (LLMs). Since generative LLMs output sequences of
+high-dimensional token logits instead of low-dimensional classification logits,
+most existing backdoor defense methods designed for discriminative models like
+BERT are ineffective for generative LLMs. Inspired by the observed differences
+in learning behavior between backdoor and clean mapping in the frequency space,
+we transform gradients of each training sample, directly influencing parameter
+updates, into the frequency space. Our findings reveal a distinct separation
+between the gradients of backdoor and clean samples in the frequency space.
+Based on this phenomenon, we propose Gradient Clustering in the Frequency Space
+for Backdoor Sample Filtering (GraCeFul), which leverages sample-wise gradients
+in the frequency space to effectively identify backdoor samples without
+requiring retraining LLMs. Experimental results show that GraCeFul outperforms
+baselines significantly. Notably, GraCeFul exhibits remarkable computational
+efficiency, achieving nearly 100% recall and F1 scores in identifying backdoor
+samples, reducing the average success rate of various backdoor attacks to 0%
+with negligible drops in clean accuracy across multiple free-style question
+answering datasets. Additionally, GraCeFul generalizes to Llama-2 and Vicuna.
+The codes are publicly available at https://github.com/ZrW00/GraceFul.
+
+
+ Artificial Expert Intelligence (AEI) seeks to transcend the limitations of
+both Artificial General Intelligence (AGI) and narrow AI by integrating
+domain-specific expertise with critical, precise reasoning capabilities akin to
+those of top human experts. Existing AI systems often excel at predefined tasks
+but struggle with adaptability and precision in novel problem-solving. To
+overcome this, AEI introduces a framework for ``Probably Approximately Correct
+(PAC) Reasoning". This paradigm provides robust theoretical guarantees for
+reliably decomposing complex problems, with a practical mechanism for
+controlling reasoning precision. In reference to the division of human thought
+into System 1 for intuitive thinking and System 2 for reflective
+reasoning~\citep{tversky1974judgment}, we refer to this new type of reasoning
+as System 3 for precise reasoning, inspired by the rigor of the scientific
+method. AEI thus establishes a foundation for error-bounded, inference-time
+learning.
+
+
+
+
+
+
+
+ ☆ GerPS-Compare: Comparing NER methods for legal norm analysis
+
+
+
+
+
+
+
+
+ Sarah T. Bachinger, Christoph Unger, Robin Erd, Leila Feddoul, Clara Lachenmaier, Sina Zarrieß, Birgitta König-Ries
+
+
+ We apply NER to a particular sub-genre of legal texts in German: the genre of
+legal norms regulating administrative processes in public service
+administration. The analysis of such texts involves identifying stretches of
+text that instantiate one of ten classes identified by public service
+administration professionals. We investigate and compare three methods for
+performing Named Entity Recognition (NER) to detect these classes: a Rule-based
+system, deep discriminative models, and a deep generative model. Our results
+show that Deep Discriminative models outperform both the Rule-based system as
+well as the Deep Generative model, the latter two roughly performing equally
+well, outperforming each other in different classes. The main cause for this
+somewhat surprising result is arguably the fact that the classes used in the
+analysis are semantically and syntactically heterogeneous, in contrast to the
+classes used in more standard NER tasks. Deep Discriminative models appear to
+be better equipped for dealing with this heterogenerity than both generic LLMs
+and human linguists designing rule-based NER systems.
+
+
+
+
+
+
+
+ ☆ Four Guiding Principles for Modeling Causal Domain Knowledge: A Case
+ Study on Brainstorming Approaches for Urban Blight Analysis
+
+
+
+
+
+
+
+
+ Houssam Razouk, Michael Leitner, Roman Kern
+
+
+ Urban blight is a problem of high interest for planning and policy making.
+Researchers frequently propose theories about the relationships between urban
+blight indicators, focusing on relationships reflecting causality. In this
+paper, we improve on the integration of domain knowledge in the analysis of
+urban blight by introducing four rules for effective modeling of causal domain
+knowledge. The findings of this study reveal significant deviation from causal
+modeling guidelines by investigating cognitive maps developed for urban blight
+analysis. These findings provide valuable insights that will inform future work
+on urban blight, ultimately enhancing our understanding of urban blight complex
+interactions.
+
+
+
+
+
+
+
+
+ Xi Cao, Quzong Gesang, Yuan Sun, Nuo Qun, Tashi Nyima
+
+
+ Language models based on deep neural networks are vulnerable to textual
+adversarial attacks. While rich-resource languages like English are receiving
+focused attention, Tibetan, a cross-border language, is gradually being studied
+due to its abundant ancient literature and critical language strategy.
+Currently, there are several Tibetan adversarial text generation methods, but
+they do not fully consider the textual features of Tibetan script and
+overestimate the quality of generated adversarial texts. To address this issue,
+we propose a novel Tibetan adversarial text generation method called TSCheater,
+which considers the characteristic of Tibetan encoding and the feature that
+visually similar syllables have similar semantics. This method can also be
+transferred to other abugidas, such as Devanagari script. We utilize a
+self-constructed Tibetan syllable visual similarity database called TSVSDB to
+generate substitution candidates and adopt a greedy algorithm-based scoring
+mechanism to determine substitution order. After that, we conduct the method on
+eight victim language models. Experimentally, TSCheater outperforms existing
+methods in attack effectiveness, perturbation magnitude, semantic similarity,
+visual similarity, and human acceptance. Finally, we construct the first
+Tibetan adversarial robustness evaluation benchmark called AdvTS, which is
+generated by existing methods and proofread by humans.
+
+
+
+ comment: Review Version; Submitted to ICASSP 2025
+
+
+
+
+
+
+ ☆ The Impact of Featuring Comments in Online Discussions
+
+
+
+
+
+
+
+
+ Cedric Waterschoot, Ernst van den Hemel, Antal van den Bosch
+
+
+ A widespread moderation strategy by online news platforms is to feature what
+the platform deems high quality comments, usually called editor picks or
+featured comments. In this paper, we compare online discussions of news
+articles in which certain comments are featured, versus discussions in which no
+comments are featured. We measure the impact of featuring comments on the
+discussion, by estimating and comparing the quality of discussions from the
+perspective of the user base and the platform itself. Our analysis shows that
+the impact on discussion quality is limited. However, we do observe an increase
+in discussion activity after the first comments are featured by moderators,
+suggesting that the moderation strategy might be used to increase user
+engagement and to postpone the natural decline in user activity over time.
+
+
+
+
+
+
+
+ ☆ ScImage: How Good Are Multimodal Large Language Models at Scientific
+ Text-to-Image Generation?
+
+
+
+
+
+
+
+
+ Leixin Zhang, Steffen Eger, Yinjie Cheng, Weihe Zhai, Jonas Belouadi, Christoph Leiter, Simone Paolo Ponzetto, Fahimeh Moafian, Zhixue Zhao
+
+
+ Multimodal large language models (LLMs) have demonstrated impressive
+capabilities in generating high-quality images from textual instructions.
+However, their performance in generating scientific images--a critical
+application for accelerating scientific progress--remains underexplored. In
+this work, we address this gap by introducing ScImage, a benchmark designed to
+evaluate the multimodal capabilities of LLMs in generating scientific images
+from textual descriptions. ScImage assesses three key dimensions of
+understanding: spatial, numeric, and attribute comprehension, as well as their
+combinations, focusing on the relationships between scientific objects (e.g.,
+squares, circles). We evaluate five models, GPT-4o, Llama, AutomaTikZ, Dall-E,
+and StableDiffusion, using two modes of output generation: code-based outputs
+(Python, TikZ) and direct raster image generation. Additionally, we examine
+four different input languages: English, German, Farsi, and Chinese. Our
+evaluation, conducted with 11 scientists across three criteria (correctness,
+relevance, and scientific accuracy), reveals that while GPT-4o produces outputs
+of decent quality for simpler prompts involving individual dimensions such as
+spatial, numeric, or attribute understanding in isolation, all models face
+challenges in this task, especially for more complex prompts.
+
+
+
+
+
+
+
+ ☆ Multi-Granularity Tibetan Textual Adversarial Attack Method Based on
+ Masked Language Model WWW 2024
+
+
+
+
+
+
+
+
+ Xi Cao, Nuo Qun, Quzong Gesang, Yulei Zhu, Trashi Nyima
+
+
+ In social media, neural network models have been applied to hate speech
+detection, sentiment analysis, etc., but neural network models are susceptible
+to adversarial attacks. For instance, in a text classification task, the
+attacker elaborately introduces perturbations to the original texts that hardly
+alter the original semantics in order to trick the model into making different
+predictions. By studying textual adversarial attack methods, the robustness of
+language models can be evaluated and then improved. Currently, most of the
+research in this field focuses on English, and there is also a certain amount
+of research on Chinese. However, there is little research targeting Chinese
+minority languages. With the rapid development of artificial intelligence
+technology and the emergence of Chinese minority language models, textual
+adversarial attacks become a new challenge for the information processing of
+Chinese minority languages. In response to this situation, we propose a
+multi-granularity Tibetan textual adversarial attack method based on masked
+language models called TSTricker. We utilize the masked language models to
+generate candidate substitution syllables or words, adopt the scoring mechanism
+to determine the substitution order, and then conduct the attack method on
+several fine-tuned victim models. The experimental results show that TSTricker
+reduces the accuracy of the classification models by more than 28.70% and makes
+the classification models change the predictions of more than 90.60% of the
+samples, which has an evidently higher attack effect than the baseline method.
+
+
+
+ comment: Revised Version; Accepted at WWW 2024 Workshop on SocialNLP
+
+
+
+
+
+
+ ☆ Pay Attention to the Robustness of Chinese Minority Language Models!
+ Syllable-level Textual Adversarial Attack on Tibetan Script ACL 2023
+
+
+
+
+
+
+
+
+ Xi Cao, Dolma Dawa, Nuo Qun, Trashi Nyima
+
+
+ The textual adversarial attack refers to an attack method in which the
+attacker adds imperceptible perturbations to the original texts by elaborate
+design so that the NLP (natural language processing) model produces false
+judgments. This method is also used to evaluate the robustness of NLP models.
+Currently, most of the research in this field focuses on English, and there is
+also a certain amount of research on Chinese. However, to the best of our
+knowledge, there is little research targeting Chinese minority languages.
+Textual adversarial attacks are a new challenge for the information processing
+of Chinese minority languages. In response to this situation, we propose a
+Tibetan syllable-level black-box textual adversarial attack called TSAttacker
+based on syllable cosine distance and scoring mechanism. And then, we conduct
+TSAttacker on six models generated by fine-tuning two PLMs (pre-trained
+language models) for three downstream tasks. The experiment results show that
+TSAttacker is effective and generates high-quality adversarial samples. In
+addition, the robustness of the involved models still has much room for
+improvement.
+
+
+
+ comment: Revised Version; Accepted at ACL 2023 Workshop on TrustNLP
+
+
+
+
+
+
+ ☆ Large Multimodal Agents for Accurate Phishing Detection with Enhanced
+ Token Optimization and Cost Reduction
+
+
+ With the rise of sophisticated phishing attacks, there is a growing need for
+effective and economical detection solutions. This paper explores the use of
+large multimodal agents, specifically Gemini 1.5 Flash and GPT-4o mini, to
+analyze both URLs and webpage screenshots via APIs, thus avoiding the
+complexities of training and maintaining AI systems. Our findings indicate that
+integrating these two data types substantially enhances detection performance
+over using either type alone. However, API usage incurs costs per query that
+depend on the number of input and output tokens. To address this, we propose a
+two-tiered agentic approach: initially, one agent assesses the URL, and if
+inconclusive, a second agent evaluates both the URL and the screenshot. This
+method not only maintains robust detection performance but also significantly
+reduces API costs by minimizing unnecessary multi-input queries. Cost analysis
+shows that with the agentic approach, GPT-4o mini can process about 4.2 times
+as many websites per $100 compared to the multimodal approach (107,440 vs.
+25,626), and Gemini 1.5 Flash can process about 2.6 times more websites
+(2,232,142 vs. 862,068). These findings underscore the significant economic
+benefits of the agentic approach over the multimodal method, providing a viable
+solution for organizations aiming to leverage advanced AI for phishing
+detection while controlling expenses.
+
+
+
+ comment: Accepted in the 2nd International Conference on Foundation and Large
+ Language Models (FLLM2024)
+
+
+
+
+
+
+ ☆ Characterizing Information Shared by Participants to Coding Challenges:
+ The Case of Advent of Code
+
+
+ Advent of Code (AoC from now on) is a popular coding challenge requiring to
+solve programming puzzles for a variety of skill sets and levels. AoC follows
+the advent calendar, therefore it is an annual challenge that lasts for 25
+days. AoC participants usually post their solutions on social networks and
+discuss them online. These challenges are interesting to study since they could
+highlight the adoption of new tools, the evolution of the developer community,
+or the technological requirements of well-known companies. For these reasons,
+we first create a dataset of the 2019-2021 AoC editions containing the
+discussion threads made on the subreddit {\tt /r/adventofcode}. Then, we
+propose a model based on stream graphs to best study this context, where we
+represent its most important actors through time: participants, comments, and
+programming languages. Thanks to our model, we investigate user participation,
+adoption of new programming languages during a challenge and between two of
+them, and resiliency of programming languages based on a Stack Overflow survey.
+We find that the top-used programming languages are almost the same in the
+three years, pointing out their importance. Moreover, participants tend to keep
+the same programming language for the whole challenge, while the ones attending
+two AoCs usually change it in the next one. Finally, we observe interesting
+results about the programming languages that are ``Popular'' or ``Loved''
+according to the Stack Overflow survey. Firstly, these are the ones adopted for
+the longest time in an AoC edition, thanks to which users have a high chance of
+reaching the end of the challenge. Secondly, they are the most chosen when a
+participant decides to change programming language during the same challenge.
+
+
+
+ comment: 10 pages, 7 figures
+
+
+
+
+
+
+ ☆ A Comprehensive Evaluation of Large Language Models on Aspect-Based
+ Sentiment Analysis
+
+
+ Recently, Large Language Models (LLMs) have garnered increasing attention in
+the field of natural language processing, revolutionizing numerous downstream
+tasks with powerful reasoning and generation abilities. For example, In-Context
+Learning (ICL) introduces a fine-tuning-free paradigm, allowing out-of-the-box
+LLMs to execute downstream tasks by analogy learning without any fine-tuning.
+Besides, in a fine-tuning-dependent paradigm where substantial training data
+exists, Parameter-Efficient Fine-Tuning (PEFT), as the cost-effective methods,
+enable LLMs to achieve excellent performance comparable to full fine-tuning.
+ However, these fascinating techniques employed by LLMs have not been fully
+exploited in the ABSA field. Previous works probe LLMs in ABSA by merely using
+randomly selected input-output pairs as demonstrations in ICL, resulting in an
+incomplete and superficial evaluation. In this paper, we shed light on a
+comprehensive evaluation of LLMs in the ABSA field, involving 13 datasets, 8
+ABSA subtasks, and 6 LLMs. Specifically, we design a unified task formulation
+to unify ``multiple LLMs for multiple ABSA subtasks in multiple paradigms.''
+For the fine-tuning-dependent paradigm, we efficiently fine-tune LLMs using
+instruction-based multi-task learning. For the fine-tuning-free paradigm, we
+propose 3 demonstration selection strategies to stimulate the few-shot
+abilities of LLMs. Our extensive experiments demonstrate that LLMs achieve a
+new state-of-the-art performance compared to fine-tuned Small Language Models
+(SLMs) in the fine-tuning-dependent paradigm. More importantly, in the
+fine-tuning-free paradigm where SLMs are ineffective, LLMs with ICL still
+showcase impressive potential and even compete with fine-tuned SLMs on some
+ABSA subtasks.
+
+
+
+
+
+
+
+ ☆ MediaSpin: Exploring Media Bias Through Fine-Grained Analysis of News
+ Headlines
+
+
+ In this paper, we introduce the MediaSpin dataset aiming to help in the
+development of models that can detect different forms of media bias present in
+news headlines, developed through human-supervised and -validated Large
+Language Model (LLM) labeling of media bias. This corpus comprises 78,910 pairs
+of news headlines and annotations with explanations of the 13 distinct types of
+media bias categories assigned. We demonstrate the usefulness of our dataset
+for automated bias detection in news edits.
+
+
+
+
+
+
+
+
+ Da Ma, Lu Chen, Situo Zhang, Yuxun Miao, Su Zhu, Zhi Chen, Hongshen Xu, Hanqi Li, Shuai Fan, Lei Pan, Kai Yu
+
+
+ The increasing context window size in Large Language Models (LLMs), such as
+the GPT and LLaMA series, has improved their ability to tackle complex,
+long-text tasks, but at the cost of inference efficiency, particularly
+regarding memory and computational complexity. Existing methods, including
+selective token retention and window-based attention, improve efficiency but
+risk discarding important tokens needed for future text generation. In this
+paper, we propose an approach that enhances LLM efficiency without token loss
+by reducing the memory and computational load of less important tokens, rather
+than discarding them.We address two challenges: 1) investigating the
+distribution of important tokens in the context, discovering recent tokens are
+more important than distant tokens in context, and 2) optimizing resources for
+distant tokens by sharing attention scores across layers. The experiments show
+that our method saves $35\%$ KV cache without compromising the performance.
+
+
+
+ comment: preprint
+
+
+
+
+
+
+ ☆ BANER: Boundary-Aware LLMs for Few-Shot Named Entity Recognition COLING 2025
+
+
+ Despite the recent success of two-stage prototypical networks in few-shot
+named entity recognition (NER), challenges such as over/under-detected false
+spans in the span detection stage and unaligned entity prototypes in the type
+classification stage persist. Additionally, LLMs have not proven to be
+effective few-shot information extractors in general. In this paper, we propose
+an approach called Boundary-Aware LLMs for Few-Shot Named Entity Recognition to
+address these issues. We introduce a boundary-aware contrastive learning
+strategy to enhance the LLM's ability to perceive entity boundaries for
+generalized entity spans. Additionally, we utilize LoRAHub to align information
+from the target domain to the source domain, thereby enhancing adaptive
+cross-domain classification capabilities. Extensive experiments across various
+benchmarks demonstrate that our framework outperforms prior methods, validating
+its effectiveness. In particular, the proposed strategies demonstrate
+effectiveness across a range of LLM architectures. The code and data are
+released on https://github.com/UESTC-GQJ/BANER.
+
+
+
+ comment: Appear on COLING 2025
+
+
+
+
+
+
+ ☆ DataLab: A Unifed Platform for LLM-Powered Business Intelligence
+
+
+ Business intelligence (BI) transforms large volumes of data within modern
+organizations into actionable insights for informed decision-making. Recently,
+large language model (LLM)-based agents have streamlined the BI workflow by
+automatically performing task planning, reasoning, and actions in executable
+environments based on natural language (NL) queries. However, existing
+approaches primarily focus on individual BI tasks such as NL2SQL and NL2VIS.
+The fragmentation of tasks across different data roles and tools lead to
+inefficiencies and potential errors due to the iterative and collaborative
+nature of BI. In this paper, we introduce DataLab, a unified BI platform that
+integrates a one-stop LLM-based agent framework with an augmented computational
+notebook interface. DataLab supports a wide range of BI tasks for different
+data roles by seamlessly combining LLM assistance with user customization
+within a single environment. To achieve this unification, we design a domain
+knowledge incorporation module tailored for enterprise-specific BI tasks, an
+inter-agent communication mechanism to facilitate information sharing across
+the BI workflow, and a cell-based context management strategy to enhance
+context utilization efficiency in BI notebooks. Extensive experiments
+demonstrate that DataLab achieves state-of-the-art performance on various BI
+tasks across popular research benchmarks. Moreover, DataLab maintains high
+effectiveness and efficiency on real-world datasets from Tencent, achieving up
+to a 58.58% increase in accuracy and a 61.65% reduction in token cost on
+enterprise-specific BI tasks.
+
+
+
+
+
+
+
+ ☆ VISCO: Benchmarking Fine-Grained Critique and Correction Towards
+ Self-Improvement in Visual Reasoning
+
+
+ The ability of large vision-language models (LVLMs) to critique and correct
+their reasoning is an essential building block towards their self-improvement.
+However, a systematic analysis of such capabilities in LVLMs is still lacking.
+We propose VISCO, the first benchmark to extensively analyze the fine-grained
+critique and correction capabilities of LVLMs. Compared to existing work that
+uses a single scalar value to critique the entire reasoning [4], VISCO features
+dense and fine-grained critique, requiring LVLMs to evaluate the correctness of
+each step in the chain-of-thought and provide natural language explanations to
+support their judgments. Extensive evaluation of 24 LVLMs demonstrates that
+human-written critiques significantly enhance the performance after correction,
+showcasing the potential of the self-improvement strategy. However, the
+model-generated critiques are less helpful and sometimes detrimental to the
+performance, suggesting that critique is the crucial bottleneck. We identified
+three common patterns in critique failures: failure to critique visual
+perception, reluctance to "say no", and exaggerated assumption of error
+propagation. To address these issues, we propose an effective LookBack strategy
+that revisits the image to verify each piece of information in the initial
+reasoning. LookBack significantly improves critique and correction performance
+by up to 13.5%.
+
+
+ This paper provides a theoretical framework for interpreting acoustic
+neighbor embeddings, which are representations of the phonetic content of
+variable-width audio or text in a fixed-dimensional embedding space. A
+probabilistic interpretation of the distances between embeddings is proposed,
+based on a general quantitative definition of phonetic similarity between
+words. This provides us a framework for understanding and applying the
+embeddings in a principled manner. Theoretical and empirical evidence to
+support an approximation of uniform cluster-wise isotropy are shown, which
+allows us to reduce the distances to simple Euclidean distances. Four
+experiments that validate the framework and demonstrate how it can be applied
+to diverse problems are described. Nearest-neighbor search between audio and
+text embeddings can give isolated word classification accuracy that is
+identical to that of finite state transducers (FSTs) for vocabularies as large
+as 500k. Embedding distances give accuracy with 0.5% point difference compared
+to phone edit distances in out-of-vocabulary word recovery, as well as
+producing clustering hierarchies identical to those derived from human
+listening experiments in English dialect clustering. The theoretical framework
+also allows us to use the embeddings to predict the expected confusion of
+device wake-up words. All source code and pretrained models are provided.
+
+
+
+
+
+
+
+ ☆ Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods
+ and a New Transcript-Classifier Approach NeurIPS 2024
+
+
+
+
+
+
+
+
+ Tony T. Wang, John Hughes, Henry Sleight, Rylan Schaeffer, Rajashree Agrawal, Fazl Barez, Mrinank Sharma, Jesse Mu, Nir Shavit, Ethan Perez
+
+
+ Defending large language models against jailbreaks so that they never engage
+in a broadly-defined set of forbidden behaviors is an open problem. In this
+paper, we investigate the difficulty of jailbreak-defense when we only want to
+forbid a narrowly-defined set of behaviors. As a case study, we focus on
+preventing an LLM from helping a user make a bomb. We find that popular
+defenses such as safety training, adversarial training, and input/output
+classifiers are unable to fully solve this problem. In pursuit of a better
+solution, we develop a transcript-classifier defense which outperforms the
+baseline defenses we test. However, our classifier defense still fails in some
+circumstances, which highlights the difficulty of jailbreak-defense even in a
+narrow domain.
+
+
+
+ comment: Accepted to the AdvML-Frontiers and SoLaR workshops at NeurIPS 2024
+
+
+
+
+
+
+ ☆ Leveraging Large Language Models for Comparative Literature
+ Summarization with Reflective Incremental Mechanisms
+
+
+
+
+
+
+
+
+ Fernando Gabriela Garcia, Spencer Burns, Harrison Fuller
+
+
+ In this paper, we introduce ChatCite, a novel method leveraging large
+language models (LLMs) for generating comparative literature summaries. The
+ability to summarize research papers with a focus on key comparisons between
+studies is an essential task in academic research. Existing summarization
+models, while effective at generating concise summaries, fail to provide deep
+comparative insights. ChatCite addresses this limitation by incorporating a
+multi-step reasoning mechanism that extracts critical elements from papers,
+incrementally builds a comparative summary, and refines the output through a
+reflective memory process. We evaluate ChatCite on a custom dataset,
+CompLit-LongContext, consisting of 1000 research papers with annotated
+comparative summaries. Experimental results show that ChatCite outperforms
+several baseline methods, including GPT-4, BART, T5, and CoT, across various
+automatic evaluation metrics such as ROUGE and the newly proposed G-Score.
+Human evaluation further confirms that ChatCite generates more coherent,
+insightful, and fluent summaries compared to these baseline models. Our method
+provides a significant advancement in automatic literature review generation,
+offering researchers a powerful tool for efficiently comparing and synthesizing
+scientific research.
+
+
+
+ comment: 8 pages
+
+
+
+
+
+
+ ☆ Personalized Multimodal Large Language Models: A Survey
+
+
+ Multimodal Large Language Models (MLLMs) have become increasingly important
+due to their state-of-the-art performance and ability to integrate multiple
+data modalities, such as text, images, and audio, to perform complex tasks with
+high accuracy. This paper presents a comprehensive survey on personalized
+multimodal large language models, focusing on their architecture, training
+methods, and applications. We propose an intuitive taxonomy for categorizing
+the techniques used to personalize MLLMs to individual users, and discuss the
+techniques accordingly. Furthermore, we discuss how such techniques can be
+combined or adapted when appropriate, highlighting their advantages and
+underlying rationale. We also provide a succinct summary of personalization
+tasks investigated in existing research, along with the evaluation metrics
+commonly used. Additionally, we summarize the datasets that are useful for
+benchmarking personalized MLLMs. Finally, we outline critical open challenges.
+This survey aims to serve as a valuable resource for researchers and
+practitioners seeking to understand and advance the development of personalized
+multimodal large language models.
+
+
+
+
+
+
+
+ ☆ WSI-LLaVA: A Multimodal Large Language Model for Whole Slide Image
+
+
+ Recent advancements in computational pathology have produced patch-level
+Multi-modal Large Language Models (MLLMs), but these models are limited by
+their inability to analyze whole slide images (WSIs) comprehensively and their
+tendency to bypass crucial morphological features that pathologists rely on for
+diagnosis. To address these challenges, we first introduce WSI-Bench, a
+large-scale morphology-aware benchmark containing 180k VQA pairs from 9,850
+WSIs across 30 cancer types, designed to evaluate MLLMs' understanding of
+morphological characteristics crucial for accurate diagnosis. Building upon
+this benchmark, we present WSI-LLaVA, a novel framework for gigapixel WSI
+understanding that employs a three-stage training approach: WSI-text alignment,
+feature space alignment, and task-specific instruction tuning. To better assess
+model performance in pathological contexts, we develop two specialized WSI
+metrics: WSI-Precision and WSI-Relevance. Experimental results demonstrate that
+WSI-LLaVA outperforms existing models across all capability dimensions, with a
+significant improvement in morphological analysis, establishing a clear
+correlation between morphological understanding and diagnostic accuracy.
+
+
+
+ comment: 38 pages, 22 figures, 35 tables
+
+
+
+
+
+
+ ☆ Misalignment of Semantic Relation Knowledge between WordNet and Human
+ Intuition
+
+
+ WordNet provides a carefully constructed repository of semantic relations,
+created by specialists. But there is another source of information on semantic
+relations, the intuition of language users. We present the first systematic
+study of the degree to which these two sources are aligned. Investigating the
+cases of misalignment could make proper use of WordNet and facilitate its
+improvement. Our analysis which uses templates to elicit responses from human
+participants, reveals a general misalignment of semantic relation knowledge
+between WordNet and human intuition. Further analyses find a systematic pattern
+of mismatch among synonymy and taxonomic relations~(hypernymy and hyponymy),
+together with the fact that WordNet path length does not serve as a reliable
+indicator of human intuition regarding hypernymy or hyponymy relations.
+
+
+
+
+
+
+
+ ☆ Explainable and Interpretable Multimodal Large Language Models: A
+ Comprehensive Survey
+
+
+ The rapid development of Artificial Intelligence (AI) has revolutionized
+numerous fields, with large language models (LLMs) and computer vision (CV)
+systems driving advancements in natural language understanding and visual
+processing, respectively. The convergence of these technologies has catalyzed
+the rise of multimodal AI, enabling richer, cross-modal understanding that
+spans text, vision, audio, and video modalities. Multimodal large language
+models (MLLMs), in particular, have emerged as a powerful framework,
+demonstrating impressive capabilities in tasks like image-text generation,
+visual question answering, and cross-modal retrieval. Despite these
+advancements, the complexity and scale of MLLMs introduce significant
+challenges in interpretability and explainability, essential for establishing
+transparency, trustworthiness, and reliability in high-stakes applications.
+This paper provides a comprehensive survey on the interpretability and
+explainability of MLLMs, proposing a novel framework that categorizes existing
+research across three perspectives: (I) Data, (II) Model, (III) Training \&
+Inference. We systematically analyze interpretability from token-level to
+embedding-level representations, assess approaches related to both architecture
+analysis and design, and explore training and inference strategies that enhance
+transparency. By comparing various methodologies, we identify their strengths
+and limitations and propose future research directions to address unresolved
+challenges in multimodal explainability. This survey offers a foundational
+resource for advancing interpretability and transparency in MLLMs, guiding
+researchers and practitioners toward developing more accountable and robust
+multimodal AI systems.
+
+
+
+
+
+
+
+ ☆ Improving Language Transfer Capability of Decoder-only Architecture in
+ Multilingual Neural Machine Translation
+
+
+ Existing multilingual neural machine translation (MNMT) approaches mainly
+focus on improving models with the encoder-decoder architecture to translate
+multiple languages. However, decoder-only architecture has been explored less
+in MNMT due to its underperformance when trained on parallel data solely. In
+this work, we attribute the issue of the decoder-only architecture to its lack
+of language transfer capability. Specifically, the decoder-only architecture is
+insufficient in encoding source tokens with the target language features. We
+propose dividing the decoding process into two stages so that target tokens are
+explicitly excluded in the first stage to implicitly boost the transfer
+capability across languages. Additionally, we impose contrastive learning on
+translation instructions, resulting in improved performance in zero-shot
+translation. We conduct experiments on TED-19 and OPUS-100 datasets,
+considering both training from scratch and fine-tuning scenarios. Experimental
+results show that, compared to the encoder-decoder architecture, our methods
+not only perform competitively in supervised translations but also achieve
+improvements of up to 3.39 BLEU, 6.99 chrF++, 3.22 BERTScore, and 4.81 COMET in
+zero-shot translations.
+
+
+
+
+
+
+
+ ☆ Let's Think Var-by-Var: Large Language Models Enable Ad Hoc
+ Probabilistic Reasoning
+
+
+
+
+
+
+
+
+ Shepard Xia, Brian Lu, Jason Eisner
+
+
+ A hallmark of intelligence is the ability to flesh out underspecified
+situations using "common sense." We propose to extract that common sense from
+large language models (LLMs), in a form that can feed into probabilistic
+inference. We focus our investigation on $\textit{guesstimation}$ questions
+such as "How much are Airbnb listings in Newark, NJ?" Formulating a sensible
+answer without access to data requires drawing on, and integrating, bits of
+common knowledge about how $\texttt{Price}$ and $\texttt{Location}$ may relate
+to other variables, such as $\texttt{Property Type}$. Our framework answers
+such a question by synthesizing an $\textit{ad hoc}$ probabilistic model. First
+we prompt an LLM to propose a set of random variables relevant to the question,
+followed by moment constraints on their joint distribution. We then optimize
+the joint distribution $p$ within a log-linear family to maximize the overall
+constraint satisfaction. Our experiments show that LLMs can successfully be
+prompted to propose reasonable variables, and while the proposed numerical
+constraints can be noisy, jointly optimizing for their satisfaction reconciles
+them. When evaluated on probabilistic questions derived from three real-world
+tabular datasets, we find that our framework performs comparably to a direct
+prompting baseline in terms of total variation distance from the dataset
+distribution, and is similarly robust to noise.
+
+
+
+
+
+
+
+ ☆ BN-AuthProf: Benchmarking Machine Learning for Bangla Author Profiling
+ on Social Media Texts
+
+
+ Author profiling, the analysis of texts to uncover attributes such as gender
+and age of the author, has become essential with the widespread use of social
+media platforms. This paper focuses on author profiling in the Bangla language,
+aiming to extract valuable insights about anonymous authors based on their
+writing style on social media. The primary objective is to introduce and
+benchmark the performance of machine learning approaches on a newly created
+Bangla Author Profiling dataset, BN-AuthProf. The dataset comprises 30,131
+social media posts from 300 authors, labeled by their age and gender. Authors'
+identities and sensitive information were anonymized to ensure privacy. Various
+classical machine learning and deep learning techniques were employed to
+evaluate the dataset. For gender classification, the best accuracy achieved was
+80% using Support Vector Machine (SVM), while a Multinomial Naive Bayes (MNB)
+classifier achieved the best F1 score of 0.756. For age classification, MNB
+attained a maximum accuracy score of 91% with an F1 score of 0.905. This
+research highlights the effectiveness of machine learning in gender and age
+classification for Bangla author profiling, with practical implications
+spanning marketing, security, forensic linguistics, education, and criminal
+investigations, considering privacy and biases.
+
+
+
+ comment: Accepted to be Published in 2024 27th International Conference on
+ Computer and Information Technology (ICCIT)
+
+
+
+
+
+
+ ☆ A Multi-way Parallel Named Entity Annotated Corpus for English, Tamil
+ and Sinhala
+
+
+ This paper presents a multi-way parallel English-Tamil-Sinhala corpus
+annotated with Named Entities (NEs), where Sinhala and Tamil are low-resource
+languages. Using pre-trained multilingual Language Models (mLMs), we establish
+new benchmark Named Entity Recognition (NER) results on this dataset for
+Sinhala and Tamil. We also carry out a detailed investigation on the NER
+capabilities of different types of mLMs. Finally, we demonstrate the utility of
+our NER system on a low-resource Neural Machine Translation (NMT) task. Our
+dataset is publicly released: https://github.com/suralk/multiNER.
+
+
+
+
+
+
+
+ ☆ Impact of Data Snooping on Deep Learning Models for Locating
+ Vulnerabilities in Lifted Code
+
+
+
+
+
+
+
+
+ Gary A. McCully, John D. Hastings, Shengjie Xu
+
+
+ This study examines the impact of data snooping on neural networks for
+vulnerability detection in lifted code, building on previous research which
+used word2vec, and unidirectional and bidirectional transformer-based
+embeddings. The research specifically focuses on how model performance is
+affected when embedding models are trained on datasets, including samples also
+used for neural network training and validation. The results show that
+introducing data snooping did not significantly alter model performance,
+suggesting that data snooping had a minimal impact or that samples randomly
+dropped as part of the methodology contained hidden features critical to
+achieving optimal performance. In addition, the findings reinforce the
+conclusions of previous research, which found that models trained with GPT-2
+embeddings consistently outperformed neural networks trained with other
+embeddings. The fact that this holds even when data snooping is introduced into
+the embedding model indicates GPT-2's robustness in representing complex code
+features, even under less-than-ideal conditions.
+
+
+
+ comment: 7 pages, 2 figures
+
+
+
+
+
+
+ ☆ Single-Cell Omics Arena: A Benchmark Study for Large Language Models on
+ Cell Type Annotation Using Single-Cell Data
+
+
+ Over the past decade, the revolution in single-cell sequencing has enabled
+the simultaneous molecular profiling of various modalities across thousands of
+individual cells, allowing scientists to investigate the diverse functions of
+complex tissues and uncover underlying disease mechanisms. Among all the
+analytical steps, assigning individual cells to specific types is fundamental
+for understanding cellular heterogeneity. However, this process is usually
+labor-intensive and requires extensive expert knowledge. Recent advances in
+large language models (LLMs) have demonstrated their ability to efficiently
+process and synthesize vast corpora of text to automatically extract essential
+biological knowledge, such as marker genes, potentially promoting more
+efficient and automated cell type annotations. To thoroughly evaluate the
+capability of modern instruction-tuned LLMs in automating the cell type
+identification process, we introduce SOAR, a comprehensive benchmarking study
+of LLMs for cell type annotation tasks in single-cell genomics. Specifically,
+we assess the performance of 8 instruction-tuned LLMs across 11 datasets,
+spanning multiple cell types and species. Our study explores the potential of
+LLMs to accurately classify and annotate cell types in single-cell RNA
+sequencing (scRNA-seq) data, while extending their application to multiomics
+data through cross-modality translation. Additionally, we evaluate the
+effectiveness of chain-of-thought (CoT) prompting techniques in generating
+detailed biological insights during the annotation process. The results
+demonstrate that LLMs can provide robust interpretations of single-cell data
+without requiring additional fine-tuning, advancing the automation of cell type
+annotation in genomics research.
+
+
+
+
+
+
+
+ ☆ Does Few-Shot Learning Help LLM Performance in Code Synthesis?
+
+
+ Large language models (LLMs) have made significant strides at code generation
+through improved model design, training, and chain-of-thought. However,
+prompt-level optimizations remain an important yet under-explored aspect of
+LLMs for coding. This work focuses on the few-shot examples present in most
+code generation prompts, offering a systematic study on whether few-shot
+examples improve LLM's coding capabilities, which few-shot examples have the
+largest impact, and how to select impactful examples. Our work offers 2
+approaches for selecting few-shot examples, a model-free method,
+CODEEXEMPLAR-FREE, and a model-based method, CODEEXEMPLAR-BASED. The 2 methods
+offer a trade-off between improved performance and reliance on training data
+and interpretability. Both methods significantly improve CodeLlama's coding
+ability across the popular HumanEval+ coding benchmark. In summary, our work
+provides valuable insights into how to pick few-shot examples in code
+generation prompts to improve LLM code generation capabilities.
+
+
+
+
+
+
+
+ ☆ Enhancing Trust in Large Language Models with Uncertainty-Aware
+ Fine-Tuning
+
+
+ Large language models (LLMs) have revolutionized the field of natural
+language processing with their impressive reasoning and question-answering
+capabilities. However, these models are sometimes prone to generating
+credible-sounding but incorrect information, a phenomenon known as LLM
+hallucinations. Reliable uncertainty estimation in LLMs is essential for
+fostering trust in their generated responses and serves as a critical tool for
+the detection and prevention of erroneous or hallucinated outputs. To achieve
+reliable and well-calibrated uncertainty quantification in open-ended and
+free-form natural language generation, we propose an uncertainty-aware
+fine-tuning approach for LLMs. This approach enhances the model's ability to
+provide reliable uncertainty estimates without compromising accuracy, thereby
+guiding them to produce more trustworthy responses. We introduce a novel
+uncertainty-aware causal language modeling loss function, grounded in the
+principles of decision theory. Through rigorous evaluation on multiple
+free-form question-answering datasets and models, we demonstrate that our
+uncertainty-aware fine-tuning approach yields better calibrated uncertainty
+estimates in natural language generation tasks than fine-tuning with the
+standard causal language modeling loss. Furthermore, the experimental results
+show that the proposed method significantly improves the model's ability to
+detect hallucinations and identify out-of-domain prompts.
+
+
+
+
+
+
+
+ ☆ MLD-EA: Check and Complete Narrative Coherence by Introducing Emotions
+ and Actions
+
+
+ Narrative understanding and story generation are critical challenges in
+natural language processing (NLP), with much of the existing research focused
+on summarization and question-answering tasks. While previous studies have
+explored predicting plot endings and generating extended narratives, they often
+neglect the logical coherence within stories, leaving a significant gap in the
+field. To address this, we introduce the Missing Logic Detector by Emotion and
+Action (MLD-EA) model, which leverages large language models (LLMs) to identify
+narrative gaps and generate coherent sentences that integrate seamlessly with
+the story's emotional and logical flow. The experimental results demonstrate
+that the MLD-EA model enhances narrative understanding and story generation,
+highlighting LLMs' potential as effective logic checkers in story writing with
+logical coherence and emotional consistency. This work fills a gap in NLP
+research and advances border goals of creating more sophisticated and reliable
+story-generation systems.
+
+
+
+
+
+
+
+
+ Milad Fotouhi, Mohammad Taha Bahadori, Oluwaseyi Feyisetan, Payman Arabshahi, David Heckerman
+
+
+ The existing algorithms for identification of neurons responsible for
+undesired and harmful behaviors do not consider the effects of confounders such
+as topic of the conversation. In this work, we show that confounders can create
+spurious correlations and propose a new causal mediation approach that controls
+the impact of the topic. In experiments with two large language models, we
+study the localization hypothesis and show that adjusting for the effect of
+conversation topic, toxicity becomes less localized.
+
+
+
+
+
+
+
+ ☆ TDD-Bench Verified: Can LLMs Generate Tests for Issues Before They Get
+ Resolved?
+
+
+ Test-driven development (TDD) is the practice of writing tests first and
+coding later, and the proponents of TDD expound its numerous benefits. For
+instance, given an issue on a source code repository, tests can clarify the
+desired behavior among stake-holders before anyone writes code for the
+agreed-upon fix. Although there has been a lot of work on automated test
+generation for the practice "write code first, test later", there has been
+little such automation for TDD. Ideally, tests for TDD should be fail-to-pass
+(i.e., fail before the issue is resolved and pass after) and have good adequacy
+with respect to covering the code changed during issue resolution. This paper
+introduces TDD-Bench Verified, a high-quality benchmark suite of 449 issues
+mined from real-world GitHub code repositories. The benchmark's evaluation
+harness runs only relevant tests in isolation for simple yet accurate coverage
+measurements, and the benchmark's dataset is filtered both by human judges and
+by execution in the harness. This paper also presents Auto-TDD, an LLM-based
+solution that takes as input an issue description and a codebase (prior to
+issue resolution) and returns as output a test that can be used to validate the
+changes made for resolving the issue. Our evaluation shows that Auto-TDD yields
+a better fail-to-pass rate than the strongest prior work while also yielding
+high coverage adequacy. Overall, we hope that this work helps make developers
+more productive at resolving issues while simultaneously leading to more robust
+fixes.
+
+
+
+
+
+
+
+ ☆ CAISSON: Concept-Augmented Inference Suite of Self-Organizing Neural
+ Networks
+
+
+ We present CAISSON, a novel hierarchical approach to Retrieval-Augmented
+Generation (RAG) that transforms traditional single-vector search into a
+multi-view clustering framework. At its core, CAISSON leverages dual
+Self-Organizing Maps (SOMs) to create complementary organizational views of the
+document space, where each view captures different aspects of document
+relationships through specialized embeddings. The first view processes combined
+text and metadata embeddings, while the second operates on metadata enriched
+with concept embeddings, enabling a comprehensive multi-view analysis that
+captures both fine-grained semantic relationships and high-level conceptual
+patterns. This dual-view approach enables more nuanced document discovery by
+combining evidence from different organizational perspectives. To evaluate
+CAISSON, we develop SynFAQA, a framework for generating synthetic financial
+analyst notes and question-answer pairs that systematically tests different
+aspects of information retrieval capabilities. Drawing on HotPotQA's
+methodology for constructing multi-step reasoning questions, SynFAQA generates
+controlled test cases where each question is paired with the set of notes
+containing its ground-truth answer, progressing from simple single-entity
+queries to complex multi-hop retrieval tasks involving multiple entities and
+concepts. Our experimental results demonstrate substantial improvements over
+both basic and enhanced RAG implementations, particularly for complex
+multi-entity queries, while maintaining practical response times suitable for
+interactive applications.
+
+
+
+ comment: 26 pages, 7 figures, 8 tables
+
+
+
+
+
+
+ ☆ RARE: Retrieval-Augmented Reasoning Enhancement for Large Language
+ Models
+
+
+ This work introduces RARE (Retrieval-Augmented Reasoning Enhancement), a
+versatile extension to the mutual reasoning framework (rStar), aimed at
+enhancing reasoning accuracy and factual integrity across large language models
+(LLMs) for complex, knowledge-intensive tasks such as commonsense and medical
+reasoning. RARE incorporates two innovative actions within the Monte Carlo Tree
+Search (MCTS) framework: A6, which generates search queries based on the
+initial problem statement, performs information retrieval using those queries,
+and augments reasoning with the retrieved data to formulate the final answer;
+and A7, which leverages information retrieval specifically for generated
+sub-questions and re-answers these sub-questions with the relevant contextual
+information. Additionally, a Retrieval-Augmented Factuality Scorer is proposed
+to replace the original discriminator, prioritizing reasoning paths that meet
+high standards of factuality. Experimental results with LLaMA 3.1 show that
+RARE enables open-source LLMs to achieve competitive performance with top
+open-source models like GPT-4 and GPT-4o. This research establishes RARE as a
+scalable solution for improving LLMs in domains where logical coherence and
+factual integrity are critical.
+
+
+
+ comment: 24 pages
+
+
+
+
+
+
+ ☆ Minimization of Boolean Complexity in In-Context Concept Learning
+
+
+
+
+
+
+
+
+ Leroy Z. Wang, R. Thomas McCoy, Shane Steinert-Threlkeld
+
+
+ What factors contribute to the relative success and corresponding
+difficulties of in-context learning for Large Language Models (LLMs)? Drawing
+on insights from the literature on human concept learning, we test LLMs on
+carefully designed concept learning tasks, and show that task performance
+highly correlates with the Boolean complexity of the concept. This suggests
+that in-context learning exhibits a learning bias for simplicity in a way
+similar to humans.
+
+
+
+
+
+
+
+ ☆ CNNSum: Exploring Long-Conext Summarization with Large Language Models
+ in Chinese Novels
+
+
+
+
+
+
+
+
+ Lingxiao Wei, He Yan, Xiangju Lu, Junmin Zhu, Jun Wang, Wei Zhang
+
+
+ Large Language Models (LLMs) have been well-researched in many long-context
+tasks. However, due to high annotation costs, high-quality long-context summary
+datasets for training or evaluation are scarce, limiting further research. In
+this work, we introduce CNNSum, a new multi-scale Chinese long-context novel
+summarization benchmark, including four subsets, length covering
+16k\textasciitilde128k, 695 samples in total, the annotations are human-driven.
+We evaluate commercial and open-source models on CNNSum and conduct a detailed
+analysis. Based on the observations, we further conduct fine-tuning exploration
+with short-context summary data. In our study: (1) GPT-4o underperformed, due
+to excessive subjective commentary. (2) Currently, long-context summarization
+mainly relies on memory ability, small LLMs with stable longer context lengths
+are the most cost-effective. Using long data concatenated from short-context
+summaries makes a significant improvement. (3) Prompt templates may cause a
+large performance gap but can be mitigated through fine-tuning. (4) Fine-tuned
+Chat or Instruction versions may harm the Base model and further fine-tuning
+cannot bridge performance gap. (5) while models with RoPE base scaling exhibit
+strong extrapolation potential, their performance may vary significantly when
+combined with other interpolation methods and need careful selection. (6)
+CNNSum provides more reliable and insightful evaluation results than other
+benchmarks. We release CNNSum to advance research in this field.
+
+
+
+
+
+
+
+ ☆ An Evolutionary Large Language Model for Hallucination Mitigation
+
+
+ The emergence of LLMs, like ChatGPT and Gemini, has marked the modern era of
+artificial intelligence applications characterized by high-impact applications
+generating text, images, and videos. However, these models usually ensue with
+one critical challenge called hallucination: confident presentation of
+inaccurate or fabricated information. This problem attracts serious concern
+when these models are applied to specialized domains, including healthcare and
+law, where the accuracy and preciseness of information are absolute conditions.
+In this paper, we propose EvoLLMs, an innovative framework inspired by
+Evolutionary Computation, which automates the generation of high-quality
+Question-answering (QA) datasets while minimizing hallucinations. EvoLLMs
+employs genetic algorithms, mimicking evolutionary processes like selection,
+variation, and mutation, to guide LLMs in generating accurate, contextually
+relevant question-answer pairs. Comparative analysis shows that EvoLLMs
+consistently outperforms human-generated datasets in key metrics such as Depth,
+Relevance, and Coverage, while nearly matching human performance in mitigating
+hallucinations. These results highlight EvoLLMs as a robust and efficient
+solution for QA dataset generation, significantly reducing the time and
+resources required for manual curation.
+
+
+ Existing Scholarly Question Answering (QA) methods typically target
+homogeneous data sources, relying solely on either text or Knowledge Graphs
+(KGs). However, scholarly information often spans heterogeneous sources,
+necessitating the development of QA systems that can integrate information from
+multiple heterogeneous data sources. To address this challenge, we introduce
+Hybrid-SQuAD (Hybrid Scholarly Question Answering Dataset), a novel large-scale
+QA dataset designed to facilitate answering questions incorporating both text
+and KG facts. The dataset consists of 10.5K question-answer pairs generated by
+a large language model, leveraging the KGs - DBLP and SemOpenAlex alongside
+corresponding text from Wikipedia. In addition, we propose a RAG-based baseline
+hybrid QA model, achieving an exact match score of 69.65 on the Hybrid-SQuAD
+test set.
+
+
+
+
+
+
+
+ ☆ Optimizing Large Language Models for Turkish: New Methodologies in
+ Corpus Selection and Training
+
+
+
+
+
+
+
+
+ H. Toprak Kesgin, M. Kaan Yuce, Eren Dogan, M. Egemen Uzun, Atahan Uz, Elif Ince, Yusuf Erdem, Osama Shbib, Ahmed Zeer, M. Fatih Amasyali
+
+
+ In this study, we develop and assess new corpus selection and training
+methodologies to improve the effectiveness of Turkish language models.
+Specifically, we adapted Large Language Model generated datasets and translated
+English datasets into Turkish, integrating these resources into the training
+process. This approach led to substantial enhancements in model accuracy for
+both few-shot and zero-shot learning scenarios. Furthermore, the merging of
+these adapted models was found to markedly improve their performance. Human
+evaluative metrics, including task-specific performance assessments, further
+demonstrated that these adapted models possess a greater aptitude for
+comprehending the Turkish language and addressing logic-based queries. This
+research underscores the importance of refining corpus selection strategies to
+optimize the performance of multilingual models, particularly for
+under-resourced languages like Turkish.
+
+
+
+ comment: 2024 Innovations in Intelligent Systems and Applications Conference
+ (ASYU)
+
+
+
+
+
+
+ ☆ Cosmos-LLaVA: Chatting with the Visual Cosmos-LLaVA: Görselle Sohbet
+ Etmek
+
+
+
+
+
+
+
+
+ Ahmed Zeer, Eren Dogan, Yusuf Erdem, Elif Ince, Osama Shbib, M. Egemen Uzun, Atahan Uz, M. Kaan Yuce, H. Toprak Kesgin, M. Fatih Amasyali
+
+
+ In this study, a Turkish visual instruction model was developed and various
+model architectures and dataset combinations were analysed to improve the
+performance of this model. The Cosmos-LLaVA model, which is built by combining
+different large language models and image coders, is designed to overcome the
+deficiencies in the Turkish language. In the experiments, the effects of
+fine-tuning with various datasets on the model performance are analysed in
+detail. The results show that model architecture and dataset selection have a
+significant impact on performance.
+ Bu \c{c}al{\i}\c{s}mada bir T\"urk\c{c}e g\"orsel talimat modeli
+geli\c{s}tirilerek bu modelin performans{\i}n{\i} art{\i}rmaya y\"onelik
+\c{c}e\c{s}itli model mimarileri ve veri k\"umesi kombinasyonlar{\i}
+derinlemesine incelenmi\c{s}tir. Farkl{\i} b\"uy\"uk dil modelleri ve
+g\"or\"unt\"u kodlay{\i}c{\i}lar{\i}n{\i}n bir araya getirilmesiyle
+olu\c{s}turulan Cosmos-LLaVA modeli, T\"urk\c{c}e dilindeki eksiklikleri
+gidermeye y\"onelik olarak tasarlanm{\i}\c{s}t{\i}r. Yap{\i}lan deneylerde,
+\c{c}e\c{s}itli veri k\"umeleri ile yap{\i}lan ince ayarlar{\i}n model
+performans{\i}n{\i} nas{\i}l etkiledi\u{g}i detayl{\i} olarak ele
+al{\i}nm{\i}\c{s}t{\i}r. Sonu\c{c}lar, model mimarisi ve veri k\"umesi
+se\c{c}iminin performans \"uzerinde \"onemli bir etkiye sahip oldu\u{g}unu
+g\"ostermektedir.
+
+
+
+ comment: in Turkish language, 2024 8th International Artificial Intelligence
+ and Data Processing Symposium (IDAP)
+
+
+
+
+
+
+ ♻ ☆ From Isolated Conversations to Hierarchical Schemas: Dynamic Tree Memory
+ Representation for LLMs
+
+
+ Recent advancements in large language models have significantly improved
+their context windows, yet challenges in effective long-term memory management
+remain. We introduce MemTree, an algorithm that leverages a dynamic,
+tree-structured memory representation to optimize the organization, retrieval,
+and integration of information, akin to human cognitive schemas. MemTree
+organizes memory hierarchically, with each node encapsulating aggregated
+textual content, corresponding semantic embeddings, and varying abstraction
+levels across the tree's depths. Our algorithm dynamically adapts this memory
+structure by computing and comparing semantic embeddings of new and existing
+information to enrich the model's context-awareness. This approach allows
+MemTree to handle complex reasoning and extended interactions more effectively
+than traditional memory augmentation methods, which often rely on flat lookup
+tables. Evaluations on benchmarks for multi-turn dialogue understanding and
+document question answering show that MemTree significantly enhances
+performance in scenarios that demand structured memory management.
+
+
+
+
+
+
+
+ ♻ ☆ BPP-Search: Enhancing Tree of Thought Reasoning for Mathematical
+ Modeling Problem Solving
+
+
+
+
+
+
+
+
+ Teng Wang, Wing-Yin Yu, Zhenqi He, Zehua Liu, Xiongwei Han, Hailei Gong, Han Wu, Wei Shi, Ruifeng She, Fangzhou Zhu, Tao Zhong
+
+
+ LLMs exhibit advanced reasoning capabilities, offering the potential to
+transform natural language questions into mathematical models. However,
+existing open-source datasets in operations research domain lack detailed
+annotations of the modeling process, such as variable definitions, focusing
+solely on objective values, which hinders reinforcement learning applications.
+To address this, we release the StructuredOR dataset, annotated with
+comprehensive labels that capture the complete mathematical modeling process.
+We further propose BPP-Search, a algorithm that integrates reinforcement
+learning into a tree-of-thought structure using Beam search, a Process reward
+model, and a pairwise Preference algorithm. This approach enables efficient
+exploration of tree structures, avoiding exhaustive search while improving
+accuracy. Extensive experiments on StructuredOR, NL4OPT, and MAMO-ComplexLP
+datasets show that BPP-Search significantly outperforms state-of-the-art
+methods. In tree-based reasoning, BPP-Search excels in accuracy and efficiency,
+enabling faster retrieval of correct solutions.
+
+
+ Reinforcement learning from human feedback (RLHF) plays a crucial role in
+aligning language models with human preferences. While the significance of
+dataset quality is generally recognized, explicit investigations into its
+impact within the RLHF framework, to our knowledge, have been limited. This
+paper addresses the issue of text quality within the preference dataset by
+focusing on direct preference optimization (DPO), an increasingly adopted
+reward-model-free RLHF method. We confirm that text quality significantly
+influences the performance of models optimized with DPO more than those
+optimized with reward-model-based RLHF. Building on this new insight, we
+propose an extension of DPO, termed filtered direct preference optimization
+(fDPO). fDPO uses a trained reward model to monitor the quality of texts within
+the preference dataset during DPO training. Samples of lower quality are
+discarded based on comparisons with texts generated by the model being
+optimized, resulting in a more accurate dataset. Experimental results
+demonstrate that fDPO enhances the final model performance. Our code is
+available at https://github.com/CyberAgentAILab/filtered-dpo.
+
+
+
+ comment: EMNLP 2024
+
+
+
+
+
+
+ ♻ ☆ WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum
+ Reinforcement Learning
+
+
+ Large language models (LLMs) have shown remarkable potential as autonomous
+agents, particularly in web-based tasks. However, existing LLM web agents
+heavily rely on expensive proprietary LLM APIs, while open LLMs lack the
+necessary decision-making capabilities. This paper introduces WebRL, a
+self-evolving online curriculum reinforcement learning framework designed to
+train high-performance web agents using open LLMs. WebRL addresses three key
+challenges in building LLM web agents, including the scarcity of training
+tasks, sparse feedback signals, and policy distribution drift in online
+learning. Specifically, WebRL incorporates 1) a self-evolving curriculum that
+generates new tasks from unsuccessful attempts, 2) a robust outcome-supervised
+reward model (ORM), and 3) adaptive reinforcement learning strategies to ensure
+consistent improvements. We apply WebRL to transform open Llama-3.1 and GLM-4
+models into proficient web agents. On WebArena-Lite, WebRL improves the success
+rate of Llama-3.1-8B from 4.8% to 42.4%, and from 6.1% to 43% for GLM-4-9B.
+These open models significantly surpass the performance of GPT-4-Turbo (17.6%)
+and GPT-4o (13.9%) and outperform previous state-of-the-art web agents trained
+on open LLMs (AutoWebGLM, 18.2%). Our findings demonstrate WebRL's
+effectiveness in bridging the gap between open and proprietary LLM-based web
+agents, paving the way for more accessible and powerful autonomous web
+interaction systems.
+
+
+
+
+
+
+
+ ♻ ☆ The PRISM Alignment Dataset: What Participatory, Representative and
+ Individualised Human Feedback Reveals About the Subjective and Multicultural
+ Alignment of Large Language Models
+
+
+
+
+
+
+
+
+ Hannah Rose Kirk, Alexander Whitefield, Paul Röttger, Andrew Bean, Katerina Margatina, Juan Ciro, Rafael Mosquera, Max Bartolo, Adina Williams, He He, Bertie Vidgen, Scott A. Hale
+
+
+ Human feedback is central to the alignment of Large Language Models (LLMs).
+However, open questions remain about methods (how), domains (where), people
+(who) and objectives (to what end) of feedback processes. To navigate these
+questions, we introduce PRISM, a dataset that maps the sociodemographics and
+stated preferences of 1,500 diverse participants from 75 countries, to their
+contextual preferences and fine-grained feedback in 8,011 live conversations
+with 21 LLMs. With PRISM, we contribute (i) wider geographic and demographic
+participation in feedback; (ii) census-representative samples for two countries
+(UK, US); and (iii) individualised ratings that link to detailed participant
+profiles, permitting personalisation and attribution of sample artefacts. We
+target subjective and multicultural perspectives on value-laden and
+controversial issues, where we expect interpersonal and cross-cultural
+disagreement. We use PRISM in three case studies to demonstrate the need for
+careful consideration of which humans provide what alignment data.
+
+
+
+
+
+
+
+ ♻ ☆ Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic
+ Vision-language Context Sparsification
+
+
+ Multimodal Large Language Models (MLLMs) have achieved remarkable success in
+vision understanding, reasoning, and interaction. However, the inference
+computation and memory increase progressively with the generation of output
+tokens during decoding, directly affecting the efficacy of MLLMs. Existing
+methods attempt to reduce the vision context redundancy to achieve efficient
+MLLMs. Unfortunately, the efficiency benefits of the vision context reduction
+in the prefill stage gradually diminish during the decoding stage. To address
+this problem, we proposed a dynamic vision-language context sparsification
+framework Dynamic-LLaVA, which dynamically reduces the redundancy of vision
+context in the prefill stage and decreases the memory and computation overhead
+of the generated language context during decoding. Dynamic-LLaVA designs a
+tailored sparsification inference scheme for different inference modes, i.e.,
+prefill, decoding with and without KV cache, to achieve efficient inference of
+MLLMs. In practice, Dynamic-LLaVA can reduce computation consumption by
+$\sim$75\% in the prefill stage. Meanwhile, throughout the entire generation
+process of MLLMs, Dynamic-LLaVA reduces the $\sim$50\% computation consumption
+under decoding without KV cache, while saving $\sim$50\% GPU memory overhead
+when decoding with KV cache, due to the vision-language context sparsification.
+Extensive experiments also demonstrate that Dynamic-LLaVA achieves efficient
+inference for MLLMs with negligible understanding and generation ability
+degradation or even performance gains compared to the full-context inference
+baselines. Code is available at https://github.com/Osilly/dynamic_llava .
+
+
+
+ comment: Code is available at https://github.com/Osilly/dynamic_llava
+
+
+
+
+
+
+
+ Hongjin Su, Shuyang Jiang, Yuhang Lai, Haoyuan Wu, Boao Shi, Che Liu, Qian Liu, Tao Yu
+
+
+ Recently the retrieval-augmented generation (RAG) has been successfully
+applied in code generation. However, existing pipelines for retrieval-augmented
+code generation (RACG) employ static knowledge bases with a single source,
+limiting the adaptation capabilities of Large Language Models (LLMs) to domains
+they have insufficient knowledge of. In this work, we develop a novel pipeline,
+EVOR, that employs the synchronous evolution of both queries and diverse
+knowledge bases. On two realistic settings where the external knowledge is
+required to solve code generation tasks, we compile four new datasets
+associated with frequently updated libraries and long-tail programming
+languages, named EVOR-BENCH. Extensive experiments demonstrate that EVOR
+achieves two to four times of execution accuracy compared to other methods such
+as Reflexion (Shinn et al., 2024), DocPrompting (Zhou et al., 2023), etc. We
+demonstrate that EVOR is flexible and can be easily combined with them to
+achieve further improvement. Further analysis reveals that EVOR benefits from
+the synchronous evolution of queries and documents and the diverse information
+sources in the knowledge base. We hope that our studies will inspire more
+insights into the design of advanced RACG pipelines in future research. Our
+model, code, and data are available at https://arks-codegen.github.io.
+
+
+
+ comment: Retrieval-augmented code generation
+
+
+
+
+
+
+ ♻ ☆ BayLing 2: A Multilingual Large Language Model with Efficient Language
+ Alignment
+
+
+ Large language models (LLMs), with their powerful generative capabilities and
+vast knowledge, empower various tasks in everyday life. However, these
+abilities are primarily concentrated in high-resource languages, leaving
+low-resource languages with weaker generative capabilities and relatively
+limited knowledge. Enhancing the multilingual capabilities of LLMs is therefore
+crucial for serving over 100 linguistic communities worldwide. An intuitive
+approach to enhance the multilingual capabilities would be to construct
+instruction data for various languages, but constructing instruction data for
+over 100 languages is prohibitively costly. In this paper, we introduce BayLing
+2, which efficiently transfers generative capabilities and knowledge from
+high-resource languages to low-resource languages through language alignment.
+To achieve this, we constructed a dataset of 3.2 million instructions,
+comprising high-resource language instructions (Chinese and English) and
+cross-lingual instructions for 100+ languages and performed instruction tuning
+based on the dataset to facilitate the capability transfer between languages.
+Using Llama as the foundation model, we developed BayLing-2-7B, BayLing-2-13B,
+and BayLing-2-8B, and conducted a comprehensive evaluation of BayLing. For
+multilingual translation across 100+ languages, BayLing shows superior
+performance compared to open-source models of similar scale. For multilingual
+knowledge and understanding benchmarks, BayLing achieves significant
+improvements across over 20 low-resource languages, demonstrating its
+capability of effective knowledge transfer from high-resource to low-resource
+languages. Furthermore, results on English benchmarks indicate that BayLing
+maintains high performance in highresource languages while enhancing the
+performance in low-resource languages. Demo, homepage, code and models of
+BayLing are available.
+
+
+ Transformers can capture long-range dependencies using self-attention,
+allowing tokens to attend to all others directly. However, stacking multiple
+attention layers leads to attention concentration. One natural way to address
+this issue is to use cross-layer attention, allowing information from earlier
+layers to be directly accessible to later layers. However, this approach is
+computationally expensive. To address this problem, we propose Transformer with
+residual value (ResFormer) which approximates cross-layer attention through
+adding a residual connection from the values of the the first layer to all
+subsequent layers. Based on this method, one variant is the Transformer with
+single layer value (SVFormer), where all layers share the same value embedding
+from first layer. Comprehensive empirical evidence demonstrates ResFormer
+achieves equivalent validation loss with 10.4% fewer model parameters and 13.6%
+less training data compared to Transformer, while maintaining similar memory
+usage and computational cost. Besides, SVFormer reduces KV cache size by nearly
+half with only a small performance penalty and can be integrated with other
+KV-efficient methods, yielding further reductions in KV cache, with performance
+influenced by sequence length and cumulative learning rate. Further
+visualization results suggest that Resformer and SVFormer alleviate attention
+concentration in deeper layers through avoiding value-state drains and enhance
+representation across most layers.
+
+
+
+
+
+
+
+ ♻ ☆ Evaluating Distributed Representations for Multi-Level Lexical
+ Semantics: A Research Proposal
+
+
+ Modern neural networks (NNs), trained on extensive raw sentence data,
+construct distributed representations by compressing individual words into
+dense, continuous, high-dimensional vectors. These representations are expected
+to capture multi-level lexical meaning. In this thesis, our objective is to
+examine the efficacy of distributed representations from NNs in encoding
+lexical meaning. Initially, we identify and formalize three levels of lexical
+semantics: \textit{local}, \textit{global}, and \textit{mixed} levels. Then,
+for each level, we evaluate language models by collecting or constructing
+multilingual datasets, leveraging various language models, and employing
+linguistic analysis theories. This thesis builds a bridge between computational
+models and lexical semantics, aiming to complement each other.
+
+
+
+ comment: Paper under review
+
+
+
+
+
+
+ ♻ ☆ EnrichEvent: Enriching Social Data with Contextual Information for
+ Emerging Event Extraction
+
+
+ Social platforms have emerged as crucial platforms for disseminating
+information and discussing real-life social events, offering researchers an
+excellent opportunity to design and implement novel event detection frameworks.
+However, most existing approaches only exploit keyword burstiness or network
+structures to detect unspecified events. Thus, they often need help identifying
+unknown events regarding the challenging nature of events and social data.
+Social data, e.g., tweets, is characterized by misspellings, incompleteness,
+word sense ambiguation, irregular language, and variation in aspects of
+opinions. Moreover, extracting discriminative features and patterns for
+evolving events by exploiting the limited structural knowledge is almost
+infeasible. To address these challenges, in this paper, we propose a novel
+framework, namely EnrichEvent, that leverages the linguistic and contextual
+representations of streaming social data. In particular, we leverage contextual
+and linguistic knowledge to detect semantically related tweets and enhance the
+effectiveness of the event detection approaches. Eventually, our proposed
+framework produces cluster chains for each event to show the evolving variation
+of the event through time. We conducted extensive experiments to evaluate our
+framework, validating its high performance and effectiveness in detecting and
+distinguishing unspecified social events.
+
+
+
+
+
+
+
+ ♻ ☆ Re-examining learning linear functions in context
+
+
+ In context learning (ICL) is an attractive method of solving a wide range of
+problems. Inspired by Garg et al. (2022), we look closely at ICL in a variety
+of train and test settings for several transformer models of different sizes
+trained from scratch. Our study complements prior work by pointing out several
+systematic failures of these models to generalize to data not in the training
+distribution, thereby showing some limitations of ICL. We find that models
+adopt a strategy for this task that is very different from standard solutions.
+
+
+
+
+
+
+
+ ♻ ☆ Samba: Simple Hybrid State Space Models for Efficient Unlimited Context
+ Language Modeling
+
+
+ Efficiently modeling sequences with infinite context length has long been a
+challenging problem. Previous approaches have either suffered from quadratic
+computational complexity or limited extrapolation ability in length
+generalization. In this work, we present Samba, a simple hybrid architecture
+that layer-wise combines Mamba, a selective State Space Model (SSM), with
+Sliding Window Attention (SWA). Samba selectively compresses a given sequence
+into recurrent hidden states while still maintaining the ability to precisely
+recall recent memories with the attention mechanism. We scale Samba up to 3.8B
+parameters with 3.2T training tokens and demonstrate that it significantly
+outperforms state-of-the-art models across a variety of benchmarks. Pretrained
+on sequences of 4K length, Samba shows improved perplexity in context lengths
+of up to 1M in zero-shot. When finetuned on 4K-length sequences, Samba
+efficiently extrapolates to a 256K context length with perfect memory recall on
+the Passkey Retrieval task, and exhibits superior retrieval extrapolation on
+the challenging Phonebook task compared to full-attention models. As a
+linear-time sequence model, Samba achieves a 3.73x higher throughput compared
+to Transformers with grouped-query attention for user prompts of 128K length,
+and a 3.64x speedup when generating 64K tokens with unlimited streaming. Our
+code for training on open source data is publicly available at
+https://github.com/microsoft/Samba.
+
+
+
+
+
+
+
+ ♻ ☆ Demystifying Language Model Forgetting with Low-rank Example
+ Associations
+
+
+ Large Language models (LLMs) suffer from forgetting of upstream data when
+fine-tuned. Despite efforts on mitigating forgetting, few have investigated
+whether, and how forgotten upstream examples are dependent on and associated
+with newly learned tasks. Insights on such associations enable efficient and
+targeted mitigation of forgetting. In this paper, we empirically analyze
+forgetting (measured in log-perplexity increase) that occurs in $N$ upstream
+examples of language modeling or instruction-tuning after fine-tuning LLMs on
+one of $M$ new tasks, visualized in $M\times N$ matrices. We demonstrate that
+the matrices display simple low-rank patterns, often well-approximated with
+multiplicative scalar effects of upstream examples and newly learned tasks. We
+also examine fine-grained associations with visualization and statistics.
+Leveraging the low-rank nature of the associations, we predict forgetting of
+upstream examples when fine-tuning on unseen tasks with matrix completion over
+the empirical associations. This enables fast identification of most forgotten
+examples without expensive inference on the entire upstream data. The approach,
+despite simplicity, outperforms prior approaches that learn semantic
+relationships of learned tasks and upstream examples with LMs for predicting
+forgetting. We demonstrate the practical utility of our analysis by showing
+statistically significantly reduced forgetting as we upweight predicted
+examples for replay at fine-tuning. Project page:
+https://inklab.usc.edu/lm-forgetting-prediction/
+
+
+
+ comment: 10 pages; preprint
+
+
+
+
+
+
+ ♻ ☆ Sibyl: Empowering Empathetic Dialogue Generation in Large Language
+ Models via Sensible and Visionary Commonsense Inference COLING 2025
+
+
+ Recently, there has been a heightened interest in building chatbots based on
+Large Language Models (LLMs) to emulate human-like qualities in multi-turn
+conversations. Despite having access to commonsense knowledge to better
+understand the psychological aspects and causality of dialogue context, even
+these powerful LLMs struggle to achieve the goals of empathy and emotional
+support. Current commonsense knowledge derived from dialogue contexts is
+inherently limited and often fails to adequately anticipate the future course
+of a dialogue. This lack of foresight can mislead LLMs and hinder their ability
+to provide effective support. In response to this challenge, we present an
+innovative framework named Sensible and Visionary Commonsense Knowledge
+(Sibyl). Designed to concentrate on the immediately succeeding dialogue, this
+paradigm equips LLMs with the capability to uncover the implicit requirements
+of the conversation, aiming to elicit more empathetic responses. Experimental
+results demonstrate that incorporating our paradigm for acquiring commonsense
+knowledge into LLMs comprehensively enhances the quality of their responses.
+
+
+
+ comment: Accepted by COLING 2025
+
+
+
+
+
+
+ ♻ ☆ Towards Cross-Lingual Audio Abuse Detection in Low-Resource Settings
+ with Few-Shot Learning COLING 2025
+
+
+ Online abusive content detection, particularly in low-resource settings and
+within the audio modality, remains underexplored. We investigate the potential
+of pre-trained audio representations for detecting abusive language in
+low-resource languages, in this case, in Indian languages using Few Shot
+Learning (FSL). Leveraging powerful representations from models such as Wav2Vec
+and Whisper, we explore cross-lingual abuse detection using the ADIMA dataset
+with FSL. Our approach integrates these representations within the
+Model-Agnostic Meta-Learning (MAML) framework to classify abusive language in
+10 languages. We experiment with various shot sizes (50-200) evaluating the
+impact of limited data on performance. Additionally, a feature visualization
+study was conducted to better understand model behaviour. This study highlights
+the generalization ability of pre-trained models in low-resource scenarios and
+offers valuable insights into detecting abusive language in multilingual
+contexts.
+
+
+
+ comment: Accepted as part of the proceedings of COLING 2025
+
+
+
+
+
+
+ ♻ ☆ Model Editing for LLMs4Code: How Far are We? ICSE2025
+
+
+
+
+
+
+
+
+ Xiaopeng Li, Shangwen Wang, Shasha Li, Jun Ma, Jie Yu, Xiaodong Liu, Jing Wang, Bin Ji, Weimin Zhang
+
+
+ Large Language Models for Code (LLMs4Code) have been found to exhibit
+outstanding performance in the software engineering domain, especially the
+remarkable performance in coding tasks. However, even the most advanced
+LLMs4Code can inevitably contain incorrect or outdated code knowledge. Due to
+the high cost of training LLMs4Code, it is impractical to re-train the models
+for fixing these problematic code knowledge. Model editing is a new technical
+field for effectively and efficiently correcting erroneous knowledge in LLMs,
+where various model editing techniques and benchmarks have been proposed
+recently. Despite that, a comprehensive study that thoroughly compares and
+analyzes the performance of the state-of-the-art model editing techniques for
+adapting the knowledge within LLMs4Code across various code-related tasks is
+notably absent. To bridge this gap, we perform the first systematic study on
+applying state-of-the-art model editing approaches to repair the inaccuracy of
+LLMs4Code. To that end, we introduce a benchmark named CLMEEval, which consists
+of two datasets, i.e., CoNaLa-Edit (CNLE) with 21K+ code generation samples and
+CodeSearchNet-Edit (CSNE) with 16K+ code summarization samples. With the help
+of CLMEEval, we evaluate six advanced model editing techniques on three
+LLMs4Code: CodeLlama (7B), CodeQwen1.5 (7B), and Stable-Code (3B). Our findings
+include that the external memorization-based GRACE approach achieves the best
+knowledge editing effectiveness and specificity (the editing does not influence
+untargeted knowledge), while generalization (whether the editing can generalize
+to other semantically-identical inputs) is a universal challenge for existing
+techniques. Furthermore, building on in-depth case analysis, we introduce an
+enhanced version of GRACE called A-GRACE, which incorporates contrastive
+learning to better capture the semantics of the inputs.
+
+
+
+ comment: Accepted by ICSE2025. The code is available at:
+ https://github.com/xpq-tech/code-llmedit.git
+
+
+
+
+
+
+ ♻ ☆ AutoGuide: Automated Generation and Selection of Context-Aware
+ Guidelines for Large Language Model Agents
+
+
+
+
+
+
+
+
+ Yao Fu, Dong-Ki Kim, Jaekyeom Kim, Sungryull Sohn, Lajanugen Logeswaran, Kyunghoon Bae, Honglak Lee
+
+
+ Recent advances in large language models (LLMs) have empowered AI agents
+capable of performing various sequential decision-making tasks. However,
+effectively guiding LLMs to perform well in unfamiliar domains like web
+navigation, where they lack sufficient knowledge, has proven to be difficult
+with the demonstration-based in-context learning paradigm. In this paper, we
+introduce a novel framework, called AutoGuide, which addresses this limitation
+by automatically generating context-aware guidelines from offline experiences.
+Importantly, each context-aware guideline is expressed in concise natural
+language and follows a conditional structure, clearly describing the context
+where it is applicable. As a result, our guidelines facilitate the provision of
+relevant knowledge for the agent's current decision-making process, overcoming
+the limitations of the conventional demonstration-based learning paradigm. Our
+evaluation demonstrates that AutoGuide significantly outperforms competitive
+baselines in complex benchmark domains, including real-world web navigation.
+
+
+ Detecting euphemisms is essential for content security on various social
+media platforms, but existing methods designed for detecting euphemisms are
+ineffective in impromptu euphemisms. In this work, we make a first attempt to
+an exploration of impromptu euphemism detection and introduce the Impromptu
+Cybercrime Euphemisms Detection (ICED) dataset. Moreover, we propose a
+detection framework tailored to this problem, which employs context
+augmentation modeling and multi-round iterative training. Our detection
+framework mainly consists of a coarse-grained and a fine-grained classification
+model. The coarse-grained classification model removes most of the harmless
+content in the corpus to be detected. The fine-grained model, impromptu
+euphemisms detector, integrates context augmentation and multi-round iterations
+training to better predicts the actual meaning of a masked token. In addition,
+we leverage ChatGPT to evaluate the mode's capability. Experimental results
+demonstrate that our approach achieves a remarkable 76-fold improvement
+compared to the previous state-of-the-art euphemism detector.
+
+
+
+
+
+
+
+ ♻ ☆ CultureLLM: Incorporating Cultural Differences into Large Language
+ Models NeurIPS 2024
+
+
+ Large language models (LLMs) are reported to be partial to certain cultures
+owing to the training data dominance from the English corpora. Since
+multilingual cultural data are often expensive to collect, existing efforts
+handle this by prompt engineering or culture-specific pre-training. However,
+they might overlook the knowledge deficiency of low-resource culture and
+require extensive computing resources. In this paper, we propose CultureLLM, a
+cost-effective solution to incorporate cultural differences into LLMs.
+CultureLLM adopts World Value Survey (WVS) as seed data and generates
+semantically equivalent training data via the proposed semantic data
+augmentation. Using only 50 seed samples from WVS with augmented data, we
+fine-tune culture-specific LLMs and one unified model (CultureLLM-One) for 9
+cultures covering rich and low-resource languages. Extensive experiments on 60
+culture-related datasets demonstrate that CultureLLM significantly outperforms
+various counterparts such as GPT-3.5 (by 8.1%) and Gemini Pro (by 9.5%) with
+comparable performance to GPT-4 or even better. Our human study shows that the
+generated samples are semantically equivalent to the original samples,
+providing an effective solution for LLMs augmentation. Code is released at
+https://github.com/Scarelette/CultureLLM.
+
+
+
+ comment: NeurIPS 2024; Code is at https://github.com/Scarelette/CultureLLM
+
+
+
+
+
+
+ ♻ ☆ Intelligent Go-Explore: Standing on the Shoulders of Giant Foundation
+ Models
+
+
+ Go-Explore is a powerful family of algorithms designed to solve
+hard-exploration problems built on the principle of archiving discovered
+states, and iteratively returning to and exploring from the most promising
+states. This approach has led to superhuman performance across a wide variety
+of challenging problems including Atari games and robotic control, but requires
+manually designing heuristics to guide exploration (i.e., determine which
+states to save and explore from, and what actions to consider next), which is
+time-consuming and infeasible in general. To resolve this, we propose
+Intelligent Go-Explore (IGE) which greatly extends the scope of the original
+Go-Explore by replacing these handcrafted heuristics with the intelligence and
+internalized human notions of interestingness captured by giant pretrained
+foundation models (FMs). This provides IGE with a human-like ability to
+instinctively identify how interesting or promising any new state is (e.g.,
+discovering new objects, locations, or behaviors), even in complex environments
+where heuristics are hard to define. Moreover, IGE offers the exciting
+opportunity to recognize and capitalize on serendipitous discoveries-states
+encountered during exploration that are valuable in terms of exploration, yet
+where what makes them interesting was not anticipated by the human user. We
+evaluate our algorithm on a diverse range of language and vision-based tasks
+that require search and exploration. Across these tasks, IGE strongly exceeds
+classic reinforcement learning and graph search baselines, and also succeeds
+where prior state-of-the-art FM agents like Reflexion completely fail. Overall,
+Intelligent Go-Explore combines the tremendous strengths of FMs and the
+powerful Go-Explore algorithm, opening up a new frontier of research into
+creating more generally capable agents with impressive exploration
+capabilities.
+
+
+ System auditing is a vital technique for collecting system call events as
+system provenance and investigating complex multi-step attacks such as Advanced
+Persistent Threats. However, existing attack investigation methods struggle to
+uncover long attack sequences due to the massive volume of system provenance
+data and their inability to focus on attack-relevant parts. In this paper, we
+present Raptor, a defense system that enables human analysts to effectively
+analyze large-scale system provenance to reveal multi-step attack sequences.
+Raptor introduces an expressive domain-specific language, ProvQL, that offers
+essential primitives for various types of attack analyses (e.g., attack pattern
+search, attack dependency tracking) with user-defined constraints, enabling
+analysts to focus on attack-relevant parts and iteratively sift through the
+large provenance data. Moreover, Raptor provides an optimized execution engine
+for efficient language execution. Our extensive evaluations on a wide range of
+attack scenarios demonstrate the practical effectiveness of Raptor in
+facilitating timely attack investigation.
+
+
+
+
+
+
+
+ ♻ ☆ AtomR: Atomic Operator-Empowered Large Language Models for Heterogeneous
+ Knowledge Reasoning
+
+
+ Recent advancements in large language models (LLMs) have led to significant
+improvements in various natural language processing tasks, but it is still
+challenging for LLMs to perform knowledge-intensive complex question answering
+due to LLMs' inefficacy in reasoning planning and the hallucination problem. A
+typical solution is to employ retrieval-augmented generation (RAG) coupled with
+chain-of-thought (CoT) reasoning, which decomposes complex questions into
+chain-like sub-questions and applies iterative RAG at each sub-question.
+However, prior works exhibit sub-optimal reasoning planning and overlook
+dynamic knowledge retrieval from heterogeneous sources. In this paper, we
+propose AtomR, a novel heterogeneous knowledge reasoning framework that
+conducts multi-source reasoning at the atomic level. Drawing inspiration from
+the graph modeling of knowledge, AtomR leverages large language models (LLMs)
+to decompose complex questions into combinations of three atomic knowledge
+operators, significantly enhancing the reasoning process at both the planning
+and execution stages. We also introduce BlendQA, a novel evaluation benchmark
+tailored to assess complex heterogeneous knowledge reasoning. Experiments show
+that AtomR significantly outperforms state-of-the-art baselines across three
+single-source and two multi-source reasoning benchmarks, with notable
+performance gains of 9.4% on 2WikiMultihop and 9.5% on BlendQA.
+
+
+
+
+
+
+
+ ♻ ☆ Developing Story: Case Studies of Generative AI's Use in Journalism
+
+
+ Journalists are among the many users of large language models (LLMs). To
+better understand the journalist-AI interactions, we conduct a study of LLM
+usage by two news agencies through browsing the WildChat dataset, identifying
+candidate interactions, and verifying them by matching to online published
+articles. Our analysis uncovers instances where journalists provide sensitive
+material such as confidential correspondence with sources or articles from
+other agencies to the LLM as stimuli and prompt it to generate articles, and
+publish these machine-generated articles with limited intervention (median
+output-publication ROUGE-L of 0.62). Based on our findings, we call for further
+research into what constitutes responsible use of AI, and the establishment of
+clear guidelines and best practices on using LLMs in a journalistic context.
+
+
+
+
+
+
+
+ ♻ ☆ Yi-Lightning Technical Report
+
+
+
+
+
+
+
+
+ 01. AI, :, Alan Wake, Albert Wang, Bei Chen, C. X. Lv, Chao Li, Chengen Huang, Chenglin Cai, Chujie Zheng, Daniel Cooper, Ethan Dai, Fan Zhou, Feng Hu, Heng Ji, Howard Qiu, Jiangcheng Zhu, Jun Tian, Katherine Su, Lihuan Zhang, Liying Li, Ming Song, Mou Li, Peng Liu, Qichen Hu, Shawn Wang, Shijun Zhou, Shiyong Li, Tianhang Zhu, Wen Xie, Xiang He, Xiaobo Chen, Xiaohui Hu, Xiaoyi Ren, Xinyao Niu, Yanpeng Li, Yongke Zhao, Yongzhen Luo, Yuchi Xu, Yuxuan Sha, Zhaodong Yan, Zhiyuan Liu, Zirui Zhang
+
+
+ This technical report presents Yi-Lightning, our latest flagship large
+language model (LLM). It achieves exceptional performance, ranking 6th overall
+on Chatbot Arena, with particularly strong results (2nd to 4th place) in
+specialized categories including Chinese, Math, Coding, and Hard Prompts.
+Yi-Lightning leverages an enhanced Mixture-of-Experts (MoE) architecture,
+featuring advanced expert segmentation and routing mechanisms coupled with
+optimized KV-caching techniques. Our development process encompasses
+comprehensive pre-training, supervised fine-tuning (SFT), and reinforcement
+learning from human feedback (RLHF), where we devise deliberate strategies for
+multi-stage training, synthetic data construction, and reward modeling.
+Furthermore, we implement RAISE (Responsible AI Safety Engine), a
+four-component framework to address safety issues across pre-training,
+post-training, and serving phases. Empowered by our scalable super-computing
+infrastructure, all these innovations substantially reduce training, deployment
+and inference costs while maintaining high-performance standards. With further
+evaluations on public academic benchmarks, Yi-Lightning demonstrates
+competitive performance against top-tier LLMs, while we observe a notable
+disparity between traditional, static benchmark results and real-world, dynamic
+human preferences. This observation prompts a critical reassessment of
+conventional benchmarks' utility in guiding the development of more intelligent
+and powerful AI systems for practical applications. Yi-Lightning is now
+available through our developer platform at https://platform.lingyiwanwu.com.
+
+
+
+
+
+
+
+ ♻ ☆ NüshuRescue: Revitalization of the endangered Nüshu Language with AI COLING 2025
+
+
+ The preservation and revitalization of endangered and extinct languages is a
+meaningful endeavor, conserving cultural heritage while enriching fields like
+linguistics and anthropology. However, these languages are typically
+low-resource, making their reconstruction labor-intensive and costly. This
+challenge is exemplified by N\"ushu, a rare script historically used by Yao
+women in China for self-expression within a patriarchal society. To address
+this challenge, we introduce N\"ushuRescue, an AI-driven framework designed to
+train large language models (LLMs) on endangered languages with minimal data.
+N\"ushuRescue automates evaluation and expands target corpora to accelerate
+linguistic revitalization. As a foundational component, we developed NCGold, a
+500-sentence N\"ushu-Chinese parallel corpus, the first publicly available
+dataset of its kind. Leveraging GPT-4-Turbo, with no prior exposure to N\"ushu
+and only 35 short examples from NCGold, N\"ushuRescue achieved 48.69\%
+translation accuracy on 50 withheld sentences and generated NCSilver, a set of
+98 newly translated modern Chinese sentences of varying lengths. A sample of
+both NCGold and NCSilver is included in the Supplementary Materials.
+Additionally, we developed FastText-based and Seq2Seq models to further support
+research on N\"ushu. N\"ushuRescue provides a versatile and scalable tool for
+the revitalization of endangered languages, minimizing the need for extensive
+human input.
+
+
+
+ comment: Accepted to COLING 2025
+
+
+
+
+
+
+ ♻ ☆ CPRM: A LLM-based Continual Pre-training Framework for Relevance
+ Modeling in Commercial Search
+
+
+
+
+
+
+
+
+ Kaixin Wu, Yixin Ji, Zeyuan Chen, Qiang Wang, Cunxiang Wang, Hong Liu, Baijun Ji, Jia Xu, Zhongyi Liu, Jinjie Gu, Yuan Zhou, Linjian Mo
+
+
+ Relevance modeling between queries and items stands as a pivotal component in
+commercial search engines, directly affecting the user experience. Given the
+remarkable achievements of large language models (LLMs) in various natural
+language processing (NLP) tasks, LLM-based relevance modeling is gradually
+being adopted within industrial search systems. Nevertheless, foundational LLMs
+lack domain-specific knowledge and do not fully exploit the potential of
+in-context learning. Furthermore, structured item text remains underutilized,
+and there is a shortage in the supply of corresponding queries and background
+knowledge. We thereby propose CPRM (Continual Pre-training for Relevance
+Modeling), a framework designed for the continual pre-training of LLMs to
+address these issues. Our CPRM framework includes three modules: 1) employing
+both queries and multi-field item to jointly pre-train for enhancing domain
+knowledge, 2) applying in-context pre-training, a novel approach where LLMs are
+pre-trained on a sequence of related queries or items, and 3) conducting
+reading comprehension on items to produce associated domain knowledge and
+background information (e.g., generating summaries and corresponding queries)
+to further strengthen LLMs. Results on offline experiments and online A/B
+testing demonstrate that our model achieves convincing performance compared to
+strong baselines.
+
+
+
+
+
+
+
+ ♻ ☆ Large Language Model-empowered multimodal strain sensory system for
+ shape recognition, monitoring, and human interaction of tensegrity
+
+
+ A tensegrity-based system is a promising approach for dynamic exploration of
+uneven and unpredictable environments, particularly, space exploration.
+However, implementing such systems presents challenges in terms of intelligent
+aspects: state recognition, wireless monitoring, human interaction, and smart
+analyzing and advising function. Here, we introduce a 6-strut tensegrity
+integrate with 24 multimodal strain sensors by leveraging both deep learning
+model and large language models to realize smart tensegrity. Using conductive
+flexible tendons assisted by long short-term memory model, the tensegrity
+achieves the self-shape reconstruction without extern sensors. Through
+integrating the flask server and gpt-3.5-turbo model, the tensegrity
+autonomously enables to send data to iPhone for wireless monitoring and
+provides data analysis, explanation, prediction, and suggestions to human for
+decision making. Finally, human interaction system of the tensegrity helps
+human obtain necessary information of tensegrity from the aspect of human
+language. Overall, this intelligent tensegrity-based system with self-sensing
+tendons showcases potential for future exploration, making it a versatile tool
+for real-world applications.
+
+
+
+
+
+
+
+ ♻ ☆ Proactive Agent: Shifting LLM Agents from Reactive Responses to Active
+ Assistance
+
+
+ Agents powered by large language models have shown remarkable abilities in
+solving complex tasks. However, most agent systems remain reactive, limiting
+their effectiveness in scenarios requiring foresight and autonomous
+decision-making. In this paper, we tackle the challenge of developing proactive
+agents capable of anticipating and initiating tasks without explicit human
+instructions. We propose a novel data-driven approach for this problem.
+Firstly, we collect real-world human activities to generate proactive task
+predictions. These predictions are then labeled by human annotators as either
+accepted or rejected. The labeled data is used to train a reward model that
+simulates human judgment and serves as an automatic evaluator of the
+proactiveness of LLM agents. Building on this, we develop a comprehensive data
+generation pipeline to create a diverse dataset, ProactiveBench, containing
+6,790 events. Finally, we demonstrate that fine-tuning models with the proposed
+ProactiveBench can significantly elicit the proactiveness of LLM agents.
+Experimental results show that our fine-tuned model achieves an F1-Score of
+66.47% in proactively offering assistance, outperforming all open-source and
+close-source models. These results highlight the potential of our method in
+creating more proactive and effective agent systems, paving the way for future
+advancements in human-agent collaboration.
+
+
+
+ comment: 9 pages, 4 figures
+
+
+
+
+
+
+ ♻ ☆ Analyzing Nobel Prize Literature with Large Language Models
+
+
+
+
+
+
+
+
+ Zhenyuan Yang, Zhengliang Liu, Jing Zhang, Cen Lu, Jiaxin Tai, Tianyang Zhong, Yiwei Li, Siyan Zhao, Teng Yao, Qing Liu, Jinlin Yang, Qixin Liu, Zhaowei Li, Kexin Wang, Longjun Ma, Dajiang Zhu, Yudan Ren, Bao Ge, Wei Zhang, Ning Qiang, Tuo Zhang, Tianming Liu
+
+
+ This study examines the capabilities of advanced Large Language Models
+(LLMs), particularly the o1 model, in the context of literary analysis. The
+outputs of these models are compared directly to those produced by
+graduate-level human participants. By focusing on two Nobel Prize-winning short
+stories, 'Nine Chapters' by Han Kang, the 2024 laureate, and 'Friendship' by
+Jon Fosse, the 2023 laureate, the research explores the extent to which AI can
+engage with complex literary elements such as thematic analysis,
+intertextuality, cultural and historical contexts, linguistic and structural
+innovations, and character development. Given the Nobel Prize's prestige and
+its emphasis on cultural, historical, and linguistic richness, applying LLMs to
+these works provides a deeper understanding of both human and AI approaches to
+interpretation. The study uses qualitative and quantitative evaluations of
+coherence, creativity, and fidelity to the text, revealing the strengths and
+limitations of AI in tasks typically reserved for human expertise. While LLMs
+demonstrate strong analytical capabilities, particularly in structured tasks,
+they often fall short in emotional nuance and coherence, areas where human
+interpretation excels. This research underscores the potential for human-AI
+collaboration in the humanities, opening new opportunities in literary studies
+and beyond.
+
+
+
+
+
+
+
+ ♻ ☆ Large Language Model-Brained GUI Agents: A Survey
+
+
+ GUIs have long been central to human-computer interaction, providing an
+intuitive and visually-driven way to access and interact with digital systems.
+The advent of LLMs, particularly multimodal models, has ushered in a new era of
+GUI automation. They have demonstrated exceptional capabilities in natural
+language understanding, code generation, and visual processing. This has paved
+the way for a new generation of LLM-brained GUI agents capable of interpreting
+complex GUI elements and autonomously executing actions based on natural
+language instructions. These agents represent a paradigm shift, enabling users
+to perform intricate, multi-step tasks through simple conversational commands.
+Their applications span across web navigation, mobile app interactions, and
+desktop automation, offering a transformative user experience that
+revolutionizes how individuals interact with software. This emerging field is
+rapidly advancing, with significant progress in both research and industry.
+ To provide a structured understanding of this trend, this paper presents a
+comprehensive survey of LLM-brained GUI agents, exploring their historical
+evolution, core components, and advanced techniques. We address research
+questions such as existing GUI agent frameworks, the collection and utilization
+of data for training specialized GUI agents, the development of large action
+models tailored for GUI tasks, and the evaluation metrics and benchmarks
+necessary to assess their effectiveness. Additionally, we examine emerging
+applications powered by these agents. Through a detailed analysis, this survey
+identifies key research gaps and outlines a roadmap for future advancements in
+the field. By consolidating foundational knowledge and state-of-the-art
+developments, this work aims to guide both researchers and practitioners in
+overcoming challenges and unlocking the full potential of LLM-brained GUI
+agents.
+
+
+
+ comment: The collection of papers reviewed in this survey will be hosted and
+ regularly updated on the GitHub repository:
+ https://github.com/vyokky/LLM-Brained-GUI-Agents-Survey Additionally, a
+ searchable webpage is available at https://aka.ms/gui-agent for easier access
+ and exploration
+
+
+
+
+
+
+ ♻ ☆ Advancing Speech Language Models by Scaling Supervised Fine-Tuning with
+ Over 60,000 Hours of Synthetic Speech Dialogue Data
+
+
+ The GPT-4o represents a significant milestone in enabling real-time
+interaction with large language models (LLMs) through speech, its remarkable
+low latency and high fluency not only capture attention but also stimulate
+research interest in the field. This real-time speech interaction is
+particularly valuable in scenarios requiring rapid feedback and immediate
+responses, dramatically enhancing user experience. However, there is a notable
+lack of research focused on real-time large speech language models,
+particularly for Chinese. In this work, we present KE-Omni, a seamless large
+speech language model built upon Ke-SpeechChat, a large-scale high-quality
+synthetic speech interaction dataset consisting of 7 million Chinese and
+English conversations, featuring 42,002 speakers, and totaling over 60,000
+hours, This contributes significantly to the advancement of research and
+development in this field. The demos can be accessed at
+\url{https://huggingface.co/spaces/KE-Team/KE-Omni}.
+
+
+
+ comment: KE-Omni, Ke-SpeechChat
+
+
+
+
+
+
+ ♻ ☆ GSIFN: A Graph-Structured and Interlaced-Masked Multimodal
+ Transformer-based Fusion Network for Multimodal Sentiment Analysis
+
+
+ Multimodal Sentiment Analysis (MSA) leverages multiple data modals to analyze
+human sentiment. Existing MSA models generally employ cutting-edge multimodal
+fusion and representation learning-based methods to promote MSA capability.
+However, there are two key challenges: (i) in existing multimodal fusion
+methods, the decoupling of modal combinations and tremendous parameter
+redundancy, lead to insufficient fusion performance and efficiency; (ii) a
+challenging trade-off exists between representation capability and
+computational overhead in unimodal feature extractors and encoders. Our
+proposed GSIFN incorporates two main components to solve these problems: (i) a
+graph-structured and interlaced-masked multimodal Transformer. It adopts the
+Interlaced Mask mechanism to construct robust multimodal graph embedding,
+achieve all-modal-in-one Transformer-based fusion, and greatly reduce the
+computational overhead; (ii) a self-supervised learning framework with low
+computational overhead and high performance, which utilizes a parallelized LSTM
+with matrix memory to enhance non-verbal modal features for unimodal label
+generation. Evaluated on the MSA datasets CMU-MOSI, CMU-MOSEI, and CH-SIMS,
+GSIFN demonstrates superior performance with significantly lower computational
+overhead compared with previous state-of-the-art models.
+
+
+
+ comment: Withdraw for the error in the paper
+
+
+
+
+
+
+ ♻ ☆ Trustful LLMs: Customizing and Grounding Text Generation with Knowledge
+ Bases and Dual Decoders
+
+
+ Although people are impressed by the content generation skills of large
+language models, the use of LLMs, such as ChatGPT, is limited by the domain
+grounding of the content. The correctness and groundedness of the generated
+content need to be based on a verified context, such as results from
+Retrieval-Augmented Generation (RAG). One important issue when adapting LLMs to
+a customized domain is that the generated responses are often incomplete, or
+the additions are not verified and may even be hallucinated. Prior studies on
+hallucination detection have focused on evaluation metrics, which are not
+easily adaptable to dynamic domains and can be vulnerable to attacks like
+jail-breaking. In this work, we propose 1) a post-processing algorithm that
+leverages knowledge triplets in RAG context to correct hallucinations and 2) a
+dual-decoder model that fuses RAG context to guide the generation process.
+
+
+ A good summary can often be very useful during program comprehension. While a
+brief, fluent, and relevant summary can be helpful, it does require significant
+human effort to produce. Often, good summaries are unavailable in software
+projects, thus making maintenance more difficult. There has been a considerable
+body of research into automated AI-based methods, using Large Language models
+(LLMs), to generate summaries of code; there also has been quite a bit work on
+ways to measure the performance of such summarization methods, with special
+attention paid to how closely these AI-generated summaries resemble a summary a
+human might have produced. Measures such as BERTScore and BLEU have been
+suggested and evaluated with human-subject studies.
+ However, LLM-produced summaries can be too long, irrelevant, etc: generally,
+too dissimilar to what a human might say. Given an LLM-produced code summary,
+how can we judge if a summary is good enough? Given some input source code, and
+an LLM-generated summary, existing approaches can help judge brevity, fluency
+and relevance; however, it's difficult to gauge whether an LLM-produced summary
+sufficiently resembles what a human might produce, without a "golden"
+human-produced summary to compare against. We study this resemblance question
+as a calibration problem: given just the summary from an LLM, can we compute a
+confidence measure, that provides a reliable indication of whether the summary
+sufficiently resembles what a human would have produced in this situation? We
+examine this question using several LLMs, for several languages, and in several
+different settings. Our investigation suggests approaches to provide reliable
+predictions of the likelihood that an LLM-generated summary would sufficiently
+resemble a summary a human might write for the same code.
+
+
+
+
+
+
+
+ ♻ ☆ Scaling Laws for Multilingual Language Models
+
+
+ We propose a novel scaling law for general-purpose decoder-only language
+models (LMs) trained on multilingual data, tackling the problem of balancing
+languages during multilingual pretraining. A primary challenge in studying
+multilingual scaling is the difficulty of analyzing individual language
+performance due to cross-lingual transfer. To address this, we shift the focus
+from individual languages to language families. We introduce and validate a
+hypothesis that the test cross-entropy loss for each language family is
+determined solely by its own sampling ratio, independent of other languages in
+the mixture. This insight simplifies the complexity of multilingual scaling and
+make the analysis scalable to an arbitrary number of languages. Building on
+this hypothesis, we derive a power-law relationship that links performance with
+dataset size, model size and sampling ratios. This relationship enables us to
+predict performance across various combinations of the above three quantities,
+and derive the optimal sampling ratios at different model scales. To
+demonstrate the effectiveness and accuracy of our proposed scaling law, we
+perform a large-scale empirical study, training more than 100 models on 23
+languages spanning 5 language families. Our experiments show that the optimal
+sampling ratios derived from small models (85M parameters) generalize
+effectively to models that are several orders of magnitude larger (1.2B
+parameters), offering a resource-efficient approach for multilingual LM
+training at scale.
+
+
+
+
+
+
+
+ ♻ ☆ Mediating Modes of Thought: LLM's for design scripting
+
+
+ Architects adopt visual scripting and parametric design tools to explore more
+expansive design spaces (Coates, 2010), refine their thinking about the
+geometric logic of their design (Woodbury, 2010), and overcome conventional
+software limitations (Burry, 2011). Despite two decades of effort to make
+design scripting more accessible, a disconnect between a designer's free ways
+of thinking and the rigidity of algorithms remains (Burry, 2011). Recent
+developments in Large Language Models (LLMs) suggest this might soon change, as
+LLMs encode a general understanding of human context and exhibit the capacity
+to produce geometric logic. This project speculates that if LLMs can
+effectively mediate between user intent and algorithms, they become a powerful
+tool to make scripting in design more widespread and fun. We explore if such
+systems can interpret natural language prompts to assemble geometric operations
+relevant to computational design scripting. In the system, multiple layers of
+LLM agents are configured with specific context to infer the user intent and
+construct a sequential logic. Given a user's high-level text prompt, a
+geometric description is created, distilled into a sequence of logic
+operations, and mapped to software-specific commands. The completed script is
+constructed in the user's visual programming interface. The system succeeds in
+generating complete visual scripts up to a certain complexity but fails beyond
+this complexity threshold. It shows how LLMs can make design scripting much
+more aligned with human creativity and thought. Future research should explore
+conversational interactions, expand to multimodal inputs and outputs, and
+assess the performance of these tools.
+
+
+
+ comment: Published at ACADIA 2024
+
+
+
+
+
+
+ ♻ ☆ Towards Fair RAG: On the Impact of Fair Ranking in Retrieval-Augmented
+ Generation NeurIPS 2024
+
+
+ Many language models now enhance their responses with retrieval capabilities,
+leading to the widespread adoption of retrieval-augmented generation (RAG)
+systems. However, despite retrieval being a core component of RAG, much of the
+research in this area overlooks the extensive body of work on fair ranking,
+neglecting the importance of considering all stakeholders involved. This paper
+presents the first systematic evaluation of RAG systems integrated with fair
+rankings. We focus specifically on measuring the fair exposure of each relevant
+item across the rankings utilized by RAG systems (i.e., item-side fairness),
+aiming to promote equitable growth for relevant item providers. To gain a deep
+understanding of the relationship between item-fairness, ranking quality, and
+generation quality in the context of RAG, we analyze nine different RAG systems
+that incorporate fair rankings across seven distinct datasets. Our findings
+indicate that RAG systems with fair rankings can maintain a high level of
+generation quality and, in many cases, even outperform traditional RAG systems,
+despite the general trend of a tradeoff between ensuring fairness and
+maintaining system-effectiveness. We believe our insights lay the groundwork
+for responsible and equitable RAG systems and open new avenues for future
+research. We publicly release our codebase and dataset at
+https://github.com/kimdanny/Fair-RAG.
+
+
+
+ comment: Top 5 Spotlight at AFME Workshop at NeurIPS 2024
+
+
+
+
+
+
+ ♻ ☆ A Combinatorial Approach to Neural Emergent Communication COLING 2025
+
+
+ Substantial research on deep learning-based emergent communication uses the
+referential game framework, specifically the Lewis signaling game, however we
+argue that successful communication in this game typically only need one or two
+symbols for target image classification because of a sampling pitfall in the
+training data. To address this issue, we provide a theoretical analysis and
+introduce a combinatorial algorithm SolveMinSym (SMS) to solve the symbolic
+complexity for classification, which is the minimum number of symbols in the
+message for successful communication. We use the SMS algorithm to create
+datasets with different symbolic complexity to empirically show that data with
+higher symbolic complexity increases the number of effective symbols in the
+emergent language.
+
+
+
+ comment: Accepted to COLING 2025
+
+
+
+
+
+
+ ♻ ☆ Investigating the Contextualised Word Embedding Dimensions Specified for
+ Contextual and Temporal Semantic Changes COLING2025
+
+
+ The sense-aware contextualised word embeddings (SCWEs) encode semantic
+changes of words within the contextualised word embedding (CWE) spaces. Despite
+the superior performance of SCWEs in contextual/temporal semantic change
+detection (SCD) benchmarks, it remains unclear as to how the meaning changes
+are encoded in the embedding space. To study this, we compare pre-trained CWEs
+and their fine-tuned versions on contextual and temporal semantic change
+benchmarks under Principal Component Analysis (PCA) and Independent Component
+Analysis (ICA) transformations. Our experimental results reveal (a) although
+there exist a smaller number of axes that are specific to semantic changes of
+words in the pre-trained CWE space, this information gets distributed across
+all dimensions when fine-tuned, and (b) in contrast to prior work studying the
+geometry of CWEs, we find that PCA to better represent semantic changes than
+ICA within the top 10% of axes. These findings encourage the development of
+more efficient SCD methods with a small number of SCD-aware dimensions. Source
+code is available at https://github.com/LivNLP/svp-dims .
+
+
+
+ comment: COLING2025
+
+
+
+
+
+
+ ♻ ☆ Composing Global Optimizers to Reasoning Tasks via Algebraic Objects in
+ Neural Nets
+
+
+ We prove rich algebraic structures of the solution space for 2-layer neural
+networks with quadratic activation and $L_2$ loss, trained on reasoning tasks
+in Abelian group (e.g., modular addition). Such a rich structure enables
+analytical construction of global optimal solutions from partial solutions that
+only satisfy part of the loss, despite its high nonlinearity. We coin the
+framework as CoGO (Composing Global Optimizers). Specifically, we show that the
+weight space over different numbers of hidden nodes of the 2-layer network is
+equipped with a semi-ring algebraic structure, and the loss function to be
+optimized consists of monomial potentials, which are ring homomorphism,
+allowing partial solutions to be composed into global ones by ring addition and
+multiplication. Our experiments show that around $95\%$ of the solutions
+obtained by gradient descent match exactly our theoretical constructions.
+Although the global optimizers constructed only required a small number of
+hidden nodes, our analysis on gradient dynamics shows that
+over-parameterization asymptotically decouples training dynamics and is
+beneficial. We further show that training dynamics favors simpler solutions
+under weight decay, and thus high-order global optimizers such as perfect
+memorization are unfavorable. Code can be found at
+https://github.com/facebookresearch/luckmatters/tree/yuandong3/ssl/real-dataset.
+
+
+
+ comment: Update presentation and add more lemmas for necessary conditions
+
+
+
+
+
+
+ ♻ ☆ Can Open-source LLMs Enhance Data Synthesis for Toxic Detection?: An
+ Experimental Study
+
+
+
+
+
+
+
+
+ Zheng Hui, Zhaoxiao Guo, Hang Zhao, Juanyong Duan, Lin Ai, Yinheng Li, Julia Hirschberg, Congrui Huang
+
+
+ Effective toxic content detection relies heavily on high-quality and diverse
+data, which serves as the foundation for robust content moderation models. This
+study explores the potential of open-source LLMs for harmful data synthesis,
+utilizing prompt engineering and fine-tuning techniques to enhance data quality
+and diversity. In a two-stage evaluation, we first examine the capabilities of
+six open-source LLMs in generating harmful data across multiple datasets using
+prompt engineering. In the second stage, we fine-tune these models to improve
+data generation while addressing challenges such as hallucination, data
+duplication, and overfitting. Our findings reveal that Mistral excels in
+generating high-quality and diverse harmful data with minimal hallucination.
+Furthermore, fine-tuning enhances data quality, offering scalable and
+cost-effective solutions for augmenting datasets for specific toxic content
+detection tasks. These results emphasize the significance of data synthesis in
+building robust, standalone detection models and highlight the potential of
+open-source LLMs to advance smaller downstream content moderation systems. We
+implemented this approach in real-world industrial settings, demonstrating the
+feasibility and efficiency of fine-tuned open-source LLMs for harmful data
+synthesis.
+
+
+
+ comment: 12 pages
+
+
+
+
+
+
+ ♻ ☆ Instruct-SkillMix: A Powerful Pipeline for LLM Instruction Tuning
+
+
+
+
+
+
+
+
+ Simran Kaur, Simon Park, Anirudh Goyal, Sanjeev Arora
+
+
+ We introduce Instruct-SkillMix, an automated approach for creating diverse,
+high quality SFT data. The Instruct-SkillMix pipeline involves two stages, each
+leveraging an existing powerful LLM: (1) Skill extraction: uses the LLM to
+extract core "skills" for instruction-following, either from existing datasets,
+or by directly prompting the model; (2) Data generation: uses the powerful LLM
+to generate (instruction, response) data that exhibit a randomly chosen pair of
+these skills. Here, the use of random skill combinations promotes diversity and
+difficulty.
+ Vanilla SFT (i.e., no PPO, DPO, or RL methods) on data generated from
+Instruct-SkillMix leads to strong gains on instruction following benchmarks
+such as AlpacaEval 2.0, MT-Bench, and WildBench. With just $4$K examples,
+LLaMA-3-8B-Base achieves 42.76% length-controlled win rate on AlpacaEval 2.0.
+To our knowledge, this achieves state-of-the-art performance among all models
+that have only undergone SFT (no RL methods) and competes with proprietary
+models such as Claude 3 Opus and LLaMA-3.1-405B-Instruct.
+ Ablation studies also suggest plausible reasons for why creating open
+instruction-tuning datasets via naive crowd-sourcing has proved difficult.
+Introducing low quality answers ("shirkers") in $20\%$ of Instruct-SkillMix
+examples causes performance to plummet, sometimes catastrophically.
+ The Instruct-SkillMix pipeline is flexible and is adaptable to other
+settings.
+
+
+ In LLM alignment and many other ML applications, one often faces the
+Multi-Objective Fine-Tuning (MOFT) problem, i.e. fine-tuning an existing model
+with datasets labeled w.r.t. different objectives simultaneously. To address
+the challenge, we propose the HyperDPO framework, a conditioned one-shot
+fine-tuning approach that extends the Direct Preference Optimization (DPO)
+technique, originally developed for efficient LLM alignment with preference
+data, to accommodate the MOFT settings. By substituting the Bradley-Terry-Luce
+model in DPO with the Plackett-Luce model, our framework is capable of handling
+a wide range of MOFT tasks that involve listwise ranking datasets. Compared
+with previous approaches, HyperDPO enjoys an efficient one-shot training
+process for profiling the Pareto front of auxiliary objectives, and offers
+post-training control over trade-offs. Additionally, we propose a novel Hyper
+Prompt Tuning design, that conveys continuous importance weight across
+objectives to transformer-based models without altering their architecture, and
+investigate the potential of temperature-conditioned networks for enhancing the
+flexibility of post-training control. We demonstrate the effectiveness and
+efficiency of the HyperDPO framework through its applications to various tasks,
+including Learning-to-Rank (LTR) and LLM alignment, highlighting its viability
+for large-scale ML deployments.
+
+
+
+
+
+
+
+ ♻ ☆ When Can LLMs Actually Correct Their Own Mistakes? A Critical Survey of
+ Self-Correction of LLMs ACL 2024
+
+
+ Self-correction is an approach to improving responses from large language
+models (LLMs) by refining the responses using LLMs during inference. Prior work
+has proposed various self-correction frameworks using different sources of
+feedback, including self-evaluation and external feedback. However, there is
+still no consensus on the question of when LLMs can correct their own mistakes,
+as recent studies also report negative results. In this work, we critically
+survey broad papers and discuss the conditions required for successful
+self-correction. We first find that prior studies often do not define their
+research questions in detail and involve impractical frameworks or unfair
+evaluations that over-evaluate self-correction. To tackle these issues, we
+categorize research questions in self-correction research and provide a
+checklist for designing appropriate experiments. Our critical survey based on
+the newly categorized research questions shows that (1) no prior work
+demonstrates successful self-correction with feedback from prompted LLMs,
+except for studies in tasks that are exceptionally suited for self-correction,
+(2) self-correction works well in tasks that can use reliable external
+feedback, and (3) large-scale fine-tuning enables self-correction.
+
+
+ Recommendation Systems have become integral to modern user experiences, but
+lack transparency in their decision-making processes. Existing explainable
+recommendation methods are hindered by reliance on a post-hoc paradigm, wherein
+explanation generators are trained independently of the underlying recommender
+models. This paradigm necessitates substantial human effort in data
+construction and raises concerns about explanation reliability. In this paper,
+we present ExpCTR, a novel framework that integrates large language model based
+explanation generation directly into the CTR prediction process. Inspired by
+recent advances in reinforcement learning, we employ two carefully designed
+reward mechanisms, LC alignment, which ensures explanations reflect user
+intentions, and IC alignment, which maintains consistency with traditional
+ID-based CTR models. Our approach incorporates an efficient training paradigm
+with LoRA and a three-stage iterative process. ExpCTR circumvents the need for
+extensive explanation datasets while fostering synergy between CTR prediction
+and explanation generation. Experimental results demonstrate that ExpCTR
+significantly enhances both recommendation accuracy and interpretability across
+three real-world datasets.
+
+
+ In conversational recommender systems (CRSs), conversations usually involve a
+set of items and item-related entities or attributes, e.g., director is a
+related entity of a movie. These items and item-related entities are often
+mentioned along the development of a dialog, leading to potential sequential
+dependencies among them. However, most of existing CRSs neglect these potential
+sequential dependencies. In this article, we first propose a Transformer-based
+sequential conversational recommendation method, named TSCR, to model the
+sequential dependencies in the conversations to improve CRS. In TSCR, we
+represent conversations by items and the item-related entities, and construct
+user sequences to discover user preferences by considering both the mentioned
+items and item-related entities. Based on the constructed sequences, we deploy
+a Cloze task to predict the recommended items along a sequence. Meanwhile, in
+certain domains, knowledge graphs formed by the items and their related
+entities are readily available, which provide various different kinds of
+associations among them. Given that TSCR does not benefit from such knowledge
+graphs, we then propose a knowledge graph enhanced version of TSCR, called
+TSCRKG. In specific, we leverage the knowledge graph to offline initialize our
+model TSCRKG, and augment the user sequence of conversations (i.e., sequence of
+the mentioned items and item-related entities in the conversation) with
+multi-hop paths in the knowledge graph. Experimental results demonstrate that
+our TSCR model significantly outperforms state-of-the-art baselines, and the
+enhanced version TSCRKG further improves recommendation performance on top of
+TSCR.
+
+
+
+ comment: Accepted by ACM TOIS
+
+
+
+
+
+
+ ☆ Active Learning via Classifier Impact and Greedy Selection for
+ Interactive Image Retrieval
+
+
+ Active Learning (AL) is a user-interactive approach aimed at reducing
+annotation costs by selecting the most crucial examples to label. Although AL
+has been extensively studied for image classification tasks, the specific
+scenario of interactive image retrieval has received relatively little
+attention. This scenario presents unique characteristics, including an open-set
+and class-imbalanced binary classification, starting with very few labeled
+samples. We introduce a novel batch-mode Active Learning framework named GAL
+(Greedy Active Learning) that better copes with this application. It
+incorporates a new acquisition function for sample selection that measures the
+impact of each unlabeled sample on the classifier. We further embed this
+strategy in a greedy selection approach, better exploiting the samples within
+each batch. We evaluate our framework with both linear (SVM) and non-linear
+MLP/Gaussian Process classifiers. For the Gaussian Process case, we show a
+theoretical guarantee on the greedy approximation. Finally, we assess our
+performance for the interactive content-based image retrieval task on several
+benchmarks and demonstrate its superiority over existing approaches and common
+baselines. Code is available at https://github.com/barleah/GreedyAL.
+
+
+
+ comment: Accepted to Transactions on Machine Learning Research (TMLR)
+
+
+
+
+
+
+ ☆ CADMR: Cross-Attention and Disentangled Learning for Multimodal
+ Recommender Systems
+
+
+ The increasing availability and diversity of multimodal data in recommender
+systems offer new avenues for enhancing recommendation accuracy and user
+satisfaction. However, these systems must contend with high-dimensional, sparse
+user-item rating matrices, where reconstructing the matrix with only small
+subsets of preferred items for each user poses a significant challenge. To
+address this, we propose CADMR, a novel autoencoder-based multimodal
+recommender system framework. CADMR leverages multi-head cross-attention
+mechanisms and Disentangled Learning to effectively integrate and utilize
+heterogeneous multimodal data in reconstructing the rating matrix. Our approach
+first disentangles modality-specific features while preserving their
+interdependence, thereby learning a joint latent representation. The multi-head
+cross-attention mechanism is then applied to enhance user-item interaction
+representations with respect to the learned multimodal item latent
+representations. We evaluate CADMR on three benchmark datasets, demonstrating
+significant performance improvements over state-of-the-art methods.
+
+
+
+
+
+
+
+ ☆ Characterizing Information Shared by Participants to Coding Challenges:
+ The Case of Advent of Code
+
+
+ Advent of Code (AoC from now on) is a popular coding challenge requiring to
+solve programming puzzles for a variety of skill sets and levels. AoC follows
+the advent calendar, therefore it is an annual challenge that lasts for 25
+days. AoC participants usually post their solutions on social networks and
+discuss them online. These challenges are interesting to study since they could
+highlight the adoption of new tools, the evolution of the developer community,
+or the technological requirements of well-known companies. For these reasons,
+we first create a dataset of the 2019-2021 AoC editions containing the
+discussion threads made on the subreddit {\tt /r/adventofcode}. Then, we
+propose a model based on stream graphs to best study this context, where we
+represent its most important actors through time: participants, comments, and
+programming languages. Thanks to our model, we investigate user participation,
+adoption of new programming languages during a challenge and between two of
+them, and resiliency of programming languages based on a Stack Overflow survey.
+We find that the top-used programming languages are almost the same in the
+three years, pointing out their importance. Moreover, participants tend to keep
+the same programming language for the whole challenge, while the ones attending
+two AoCs usually change it in the next one. Finally, we observe interesting
+results about the programming languages that are ``Popular'' or ``Loved''
+according to the Stack Overflow survey. Firstly, these are the ones adopted for
+the longest time in an AoC edition, thanks to which users have a high chance of
+reaching the end of the challenge. Secondly, they are the most chosen when a
+participant decides to change programming language during the same challenge.
+
+
+
+ comment: 10 pages, 7 figures
+
+
+
+
+
+
+ ☆ CausalMob: Causal Human Mobility Prediction with LLMs-derived Human
+ Intentions toward Public Events KDD 2025
+
+
+ Large-scale human mobility exhibits spatial and temporal patterns that can
+assist policymakers in decision making. Although traditional prediction models
+attempt to capture these patterns, they often interfered by non-periodic public
+events, such as disasters and occasional celebrations. Since regular human
+mobility patterns are heavily affected by these events, estimating their causal
+effects is critical to accurate mobility predictions. Although news articles
+provide unique perspectives on these events in an unstructured format,
+processing is a challenge. In this study, we propose a causality-augmented
+prediction model, called \textbf{CausalMob}, to analyze the causal effects of
+public events. We first utilize large language models (LLMs) to extract human
+intentions from news articles and transform them into features that act as
+causal treatments. Next, the model learns representations of spatio-temporal
+regional covariates from multiple data sources to serve as confounders for
+causal inference. Finally, we present a causal effect estimation framework to
+ensure event features remain independent of confounders during prediction.
+Based on large-scale real-world data, the experimental results show that the
+proposed model excels in human mobility prediction, outperforming
+state-of-the-art models.
+
+
+
+ comment: Accepted by KDD 2025
+
+
+
+
+
+
+ ☆ Leveraging Large Language Models for Comparative Literature
+ Summarization with Reflective Incremental Mechanisms
+
+
+
+
+
+
+
+
+ Fernando Gabriela Garcia, Spencer Burns, Harrison Fuller
+
+
+ In this paper, we introduce ChatCite, a novel method leveraging large
+language models (LLMs) for generating comparative literature summaries. The
+ability to summarize research papers with a focus on key comparisons between
+studies is an essential task in academic research. Existing summarization
+models, while effective at generating concise summaries, fail to provide deep
+comparative insights. ChatCite addresses this limitation by incorporating a
+multi-step reasoning mechanism that extracts critical elements from papers,
+incrementally builds a comparative summary, and refines the output through a
+reflective memory process. We evaluate ChatCite on a custom dataset,
+CompLit-LongContext, consisting of 1000 research papers with annotated
+comparative summaries. Experimental results show that ChatCite outperforms
+several baseline methods, including GPT-4, BART, T5, and CoT, across various
+automatic evaluation metrics such as ROUGE and the newly proposed G-Score.
+Human evaluation further confirms that ChatCite generates more coherent,
+insightful, and fluent summaries compared to these baseline models. Our method
+provides a significant advancement in automatic literature review generation,
+offering researchers a powerful tool for efficiently comparing and synthesizing
+scientific research.
+
+
+
+ comment: 8 pages
+
+
+
+
+
+
+ ☆ Personalized Multimodal Large Language Models: A Survey
+
+
+ Multimodal Large Language Models (MLLMs) have become increasingly important
+due to their state-of-the-art performance and ability to integrate multiple
+data modalities, such as text, images, and audio, to perform complex tasks with
+high accuracy. This paper presents a comprehensive survey on personalized
+multimodal large language models, focusing on their architecture, training
+methods, and applications. We propose an intuitive taxonomy for categorizing
+the techniques used to personalize MLLMs to individual users, and discuss the
+techniques accordingly. Furthermore, we discuss how such techniques can be
+combined or adapted when appropriate, highlighting their advantages and
+underlying rationale. We also provide a succinct summary of personalization
+tasks investigated in existing research, along with the evaluation metrics
+commonly used. Additionally, we summarize the datasets that are useful for
+benchmarking personalized MLLMs. Finally, we outline critical open challenges.
+This survey aims to serve as a valuable resource for researchers and
+practitioners seeking to understand and advance the development of personalized
+multimodal large language models.
+
+
+
+
+
+
+
+ ☆ Improving Sequential Recommender Systems with Online and In-store User
+ Behavior
+
+
+ Online e-commerce platforms have been extending in-store shopping, which
+allows users to keep the canonical online browsing and checkout experience
+while exploring in-store shopping. However, the growing transition between
+online and in-store becomes a challenge to sequential recommender systems for
+future online interaction prediction due to the lack of holistic modeling of
+hybrid user behaviors (online and in-store). The challenges are twofold. First,
+combining online and in-store user behavior data into a single data schema and
+supporting multiple stages in the model life cycle (pre-training, training,
+inference, etc.) organically needs a new data pipeline design. Second, online
+recommender systems, which solely rely on online user behavior sequences, must
+be redesigned to support online and in-store user data as input under the
+sequential modeling setting. To overcome the first challenge, we propose a
+hybrid, omnichannel data pipeline to compile online and in-store user behavior
+data by caching information from diverse data sources. Later, we introduce a
+model-agnostic encoder module to the sequential recommender system to interpret
+the user in-store transaction and augment the modeling capacity for better
+online interaction prediction given the hybrid user behavior.
+
+
+
+ comment: 6 pages, IEEE BigData 2024 Workshop
+
+
+
+
+
+
+ ☆ Future of Information Retrieval Research in the Age of Generative AI
+
+
+
+
+
+
+
+
+ James Allan, Eunsol Choi, Daniel P. Lopresti, Hamed Zamani
+
+
+ In the fast-evolving field of information retrieval (IR), the integration of
+generative AI technologies such as large language models (LLMs) is transforming
+how users search for and interact with information. Recognizing this paradigm
+shift at the intersection of IR and generative AI (IR-GenAI), a visioning
+workshop supported by the Computing Community Consortium (CCC) was held in July
+2024 to discuss the future of IR in the age of generative AI. This workshop
+convened 44 experts in information retrieval, natural language processing,
+human-computer interaction, and artificial intelligence from academia,
+industry, and government to explore how generative AI can enhance IR and vice
+versa, and to identify the major challenges and opportunities in this rapidly
+advancing field.
+ This report contains a summary of discussions as potentially important
+research topics and contains a list of recommendations for academics, industry
+practitioners, institutions, evaluation campaigns, and funding agencies.
+
+
+
+
+
+
+
+ ☆ CAISSON: Concept-Augmented Inference Suite of Self-Organizing Neural
+ Networks
+
+
+ We present CAISSON, a novel hierarchical approach to Retrieval-Augmented
+Generation (RAG) that transforms traditional single-vector search into a
+multi-view clustering framework. At its core, CAISSON leverages dual
+Self-Organizing Maps (SOMs) to create complementary organizational views of the
+document space, where each view captures different aspects of document
+relationships through specialized embeddings. The first view processes combined
+text and metadata embeddings, while the second operates on metadata enriched
+with concept embeddings, enabling a comprehensive multi-view analysis that
+captures both fine-grained semantic relationships and high-level conceptual
+patterns. This dual-view approach enables more nuanced document discovery by
+combining evidence from different organizational perspectives. To evaluate
+CAISSON, we develop SynFAQA, a framework for generating synthetic financial
+analyst notes and question-answer pairs that systematically tests different
+aspects of information retrieval capabilities. Drawing on HotPotQA's
+methodology for constructing multi-step reasoning questions, SynFAQA generates
+controlled test cases where each question is paired with the set of notes
+containing its ground-truth answer, progressing from simple single-entity
+queries to complex multi-hop retrieval tasks involving multiple entities and
+concepts. Our experimental results demonstrate substantial improvements over
+both basic and enhanced RAG implementations, particularly for complex
+multi-entity queries, while maintaining practical response times suitable for
+interactive applications.
+
+
+
+ comment: 26 pages, 7 figures, 8 tables
+
+
+
+
+
+
+ ♻ ☆ Predictive Models in Sequential Recommendations: Bridging Performance
+ Laws with Data Quality Insights
+
+
+ Sequential Recommendation (SR) plays a critical role in predicting users'
+sequential preferences. Despite its growing prominence in various industries,
+the increasing scale of SR models incurs substantial computational costs and
+unpredictability, challenging developers to manage resources efficiently. Under
+this predicament, Scaling Laws have achieved significant success by examining
+the loss as models scale up. However, there remains a disparity between loss
+and model performance, which is of greater concern in practical applications.
+Moreover, as data continues to expand, it incorporates repetitive and
+inefficient data. In response, we introduce the Performance Law for SR models,
+which aims to theoretically investigate and model the relationship between
+model performance and data quality. Specifically, we first fit the HR and NDCG
+metrics to transformer-based SR models. Subsequently, we propose Approximate
+Entropy (ApEn) to assess data quality, presenting a more nuanced approach
+compared to traditional data quantity metrics. Our method enables accurate
+predictions across various dataset scales and model sizes, demonstrating a
+strong correlation in large SR models and offering insights into achieving
+optimal performance for any given model configuration.
+
+
+
+ comment: 12 pages, 5 figures
+
+
+
+
+
+
+ ♻ ☆ A Novel Approach to Comprehending Users' Preferences for Accurate
+ Personalized News Recommendation
+
+
+ Personalized news recommendation aims to assist users in finding news
+articles that align with their interests, which plays a pivotal role in
+mitigating users' information overload problem. Although many recent works have
+been studied for better personalized news recommendation, the following
+challenges should be explored more: (C1) Comprehending manifold intents coupled
+within a news article, (C2) Differentiating varying post-read preferences of
+news articles, and (C3) Addressing the cold-start user problem. To tackle the
+aforementioned challenges together, in this paper, we propose a novel
+personalized news recommendation framework (CROWN) that employs (1)
+category-guided intent disentanglement for (C1), (2) consistency-based news
+representation for (C2), and (3) GNN-enhanced hybrid user representation for
+(C3). Furthermore, we incorporate a category prediction into the training
+process of CROWN as an auxiliary task, which provides supplementary supervisory
+signals to enhance intent disentanglement. Extensive experiments on two
+real-world datasets reveal that (1) CROWN provides consistent performance
+improvements over ten state-of-the-art news recommendation methods and (2) the
+proposed strategies significantly improve the accuracy of CROWN.
+
+
+
+ comment: 10 pages, 6 figures, 8 tables
+
+
+
+
+
+
+ ♻ ☆ Generalized compression and compressive search of large datasets
+
+
+
+
+
+
+
+
+ Morgan E. Prior, Thomas Howard III, Emily Light, Najib Ishaq, Noah M. Daniels
+
+
+ The Big Data explosion has necessitated the development of search algorithms
+that scale sub-linearly in time and memory.
+ While compression algorithms and search algorithms do exist independently,
+few algorithms offer both, and those which do are domain-specific.
+ We present panCAKES, a novel approach to compressive search, i.e., a way to
+perform $k$-NN and $\rho$-NN search on compressed data while only decompressing
+a small, relevant, portion of the data.
+ panCAKES assumes the manifold hypothesis and leverages the low-dimensional
+structure of the data to compress and search it efficiently.
+ panCAKES is generic over any distance function for which the distance between
+two points is proportional to the memory cost of storing an encoding of one in
+terms of the other.
+ This property holds for many widely-used distance functions, e.g. string edit
+distances (Levenshtein, Needleman-Wunsch, etc.) and set dissimilarity measures
+(Jaccard, Dice, etc.).
+ We benchmark panCAKES on a variety of datasets, including genomic, proteomic,
+and set data.
+ We compare compression ratios to gzip, and search performance between the
+compressed and uncompressed versions of the same dataset.
+ panCAKES achieves compression ratios close to those of gzip, while offering
+sub-linear time performance for $k$-NN and $\rho$-NN search.
+ We conclude that panCAKES is an efficient, general-purpose algorithm for
+exact compressive search on large datasets that obey the manifold hypothesis.
+ We provide an open-source implementation of panCAKES in the Rust programming
+language.
+
+
+
+
+
+
+
+ ♻ ☆ CPRM: A LLM-based Continual Pre-training Framework for Relevance
+ Modeling in Commercial Search
+
+
+
+
+
+
+
+
+ Kaixin Wu, Yixin Ji, Zeyuan Chen, Qiang Wang, Cunxiang Wang, Hong Liu, Baijun Ji, Jia Xu, Zhongyi Liu, Jinjie Gu, Yuan Zhou, Linjian Mo
+
+
+ Relevance modeling between queries and items stands as a pivotal component in
+commercial search engines, directly affecting the user experience. Given the
+remarkable achievements of large language models (LLMs) in various natural
+language processing (NLP) tasks, LLM-based relevance modeling is gradually
+being adopted within industrial search systems. Nevertheless, foundational LLMs
+lack domain-specific knowledge and do not fully exploit the potential of
+in-context learning. Furthermore, structured item text remains underutilized,
+and there is a shortage in the supply of corresponding queries and background
+knowledge. We thereby propose CPRM (Continual Pre-training for Relevance
+Modeling), a framework designed for the continual pre-training of LLMs to
+address these issues. Our CPRM framework includes three modules: 1) employing
+both queries and multi-field item to jointly pre-train for enhancing domain
+knowledge, 2) applying in-context pre-training, a novel approach where LLMs are
+pre-trained on a sequence of related queries or items, and 3) conducting
+reading comprehension on items to produce associated domain knowledge and
+background information (e.g., generating summaries and corresponding queries)
+to further strengthen LLMs. Results on offline experiments and online A/B
+testing demonstrate that our model achieves convincing performance compared to
+strong baselines.
+
+
+
+
+
+
+
+ ♻ ☆ Towards Fair RAG: On the Impact of Fair Ranking in Retrieval-Augmented
+ Generation NeurIPS 2024
+
+
+ Many language models now enhance their responses with retrieval capabilities,
+leading to the widespread adoption of retrieval-augmented generation (RAG)
+systems. However, despite retrieval being a core component of RAG, much of the
+research in this area overlooks the extensive body of work on fair ranking,
+neglecting the importance of considering all stakeholders involved. This paper
+presents the first systematic evaluation of RAG systems integrated with fair
+rankings. We focus specifically on measuring the fair exposure of each relevant
+item across the rankings utilized by RAG systems (i.e., item-side fairness),
+aiming to promote equitable growth for relevant item providers. To gain a deep
+understanding of the relationship between item-fairness, ranking quality, and
+generation quality in the context of RAG, we analyze nine different RAG systems
+that incorporate fair rankings across seven distinct datasets. Our findings
+indicate that RAG systems with fair rankings can maintain a high level of
+generation quality and, in many cases, even outperform traditional RAG systems,
+despite the general trend of a tradeoff between ensuring fairness and
+maintaining system-effectiveness. We believe our insights lay the groundwork
+for responsible and equitable RAG systems and open new avenues for future
+research. We publicly release our codebase and dataset at
+https://github.com/kimdanny/Fair-RAG.
+
+
+
+ comment: Top 5 Spotlight at AFME Workshop at NeurIPS 2024
+
+
+
+
+
+
+
+
+
+ Machine Learning 152
+
+
+
+
+
+ ☆ Scaling BERT Models for Turkish Automatic Punctuation and Capitalization
+ Correction
+
+
+
+
+
+
+
+
+ Abdulkader Saoud, Mahmut Alomeyr, Himmet Toprak Kesgin, Mehmet Fatih Amasyali
+
+
+ This paper investigates the effectiveness of BERT based models for automated
+punctuation and capitalization corrections in Turkish texts across five
+distinct model sizes. The models are designated as Tiny, Mini, Small, Medium,
+and Base. The design and capabilities of each model are tailored to address the
+specific challenges of the Turkish language, with a focus on optimizing
+performance while minimizing computational overhead. The study presents a
+systematic comparison of the performance metrics precision, recall, and F1
+score of each model, offering insights into their applicability in diverse
+operational contexts. The results demonstrate a significant improvement in text
+readability and accuracy as model size increases, with the Base model achieving
+the highest correction precision. This research provides a comprehensive guide
+for selecting the appropriate model size based on specific user needs and
+computational resources, establishing a framework for deploying these models in
+real-world applications to enhance the quality of written Turkish.
+
+
+
+ comment: 2024 Innovations in Intelligent Systems and Applications Conference
+ (ASYU)
+
+
+
+
+
+
+ ☆ An ADHD Diagnostic Interface Based on EEG Spectrograms and Deep Learning
+ Techniques
+
+
+ This paper introduces an innovative approach to
+Attention-deficit/hyperactivity disorder (ADHD) diagnosis by employing deep
+learning (DL) techniques on electroencephalography (EEG) signals. This method
+addresses the limitations of current behavior-based diagnostic methods, which
+often lead to misdiagnosis and gender bias. By utilizing a publicly available
+EEG dataset and converting the signals into spectrograms, a Resnet-18
+convolutional neural network (CNN) architecture was used to extract features
+for ADHD classification. The model achieved a high precision, recall, and an
+overall F1 score of 0.9. Feature extraction highlighted significant brain
+regions (frontopolar, parietal, and occipital lobes) associated with ADHD.
+These insights guided the creation of a three-part digital diagnostic system,
+facilitating cost-effective and accessible ADHD screening, especially in school
+environments. This system enables earlier and more accurate identification of
+students at risk for ADHD, providing timely support to enhance their
+developmental outcomes. This study showcases the potential of integrating EEG
+analysis with DL to enhance ADHD diagnostics, presenting a viable alternative
+to traditional methods.
+
+
+ Reinforcement learning from human feedback (RLHF) has been crucial in
+aligning large language models (LLMs) with human values. Traditionally, RLHF
+involves generating responses to a query and using a reward model to assign a
+reward to the entire response. However, this approach faces challenges due to
+its reliance on a single, sparse reward, which makes it challenging for the
+model to identify which parts of the sequence contribute most significantly to
+the final reward. Recent methods have attempted to address this limitation by
+introducing token-level rewards. However, these methods often rely on either a
+trained credit assignment model or AI annotators, raising concerns about the
+quality and reliability of the rewards. In this paper, we propose token-level
+reward regularization (T-REG), a novel approach that leverages both
+sequence-level and token-level rewards for preference optimization. Harnessing
+the self-refinement capabilities of LLMs, our method uses contrastive prompting
+to enable LLMs to self-generate token-level rewards. These self-generated
+rewards then act as reward regularization, guiding the model to more
+effectively distribute sequence-level rewards across tokens. This facilitates
+better token-level credit assignment and enhances alignment performance.
+Experiments on the instruction following benchmarks, including Alpaca Eval 2
+and Arena-Hard, show that our method consistently outperforms baseline methods
+by up to 3.8% and 4.4%, respectively. We will release the code and models at
+https://github.com/wzhouad/T-REG.
+
+
+
+
+
+
+
+ ☆ The Asymptotic Behavior of Attention in Transformers
+
+
+
+
+
+
+
+
+ Álvaro Rodríguez Abella, João Pedro Silvestre, Paulo Tabuada
+
+
+ A key component of transformers is the attention mechanism orchestrating how
+each token influences the propagation of every other token through a
+transformer. In this paper we provide a rigorous, mathematical analysis of the
+asymptotic properties of attention in transformers. Although we present several
+results based on different assumptions, all of them point to the same
+conclusion, all tokens asymptotically converge to each other, a phenomenon that
+has been empirically reported in the literature. Our findings are carefully
+compared with existing theoretical results and illustrated by simulations and
+experimental studies using the GPT-2 model.
+
+
+ Contact-rich bimanual manipulation involves precise coordination of two arms
+to change object states through strategically selected contacts and motions.
+Due to the inherent complexity of these tasks, acquiring sufficient
+demonstration data and training policies that generalize to unseen scenarios
+remain a largely unresolved challenge. Building on recent advances in planning
+through contacts, we introduce Generalizable Planning-Guided Diffusion Policy
+Learning (GLIDE), an approach that effectively learns to solve contact-rich
+bimanual manipulation tasks by leveraging model-based motion planners to
+generate demonstration data in high-fidelity physics simulation. Through
+efficient planning in randomized environments, our approach generates
+large-scale and high-quality synthetic motion trajectories for tasks involving
+diverse objects and transformations. We then train a task-conditioned diffusion
+policy via behavior cloning using these demonstrations. To tackle the
+sim-to-real gap, we propose a set of essential design options in feature
+extraction, task representation, action prediction, and data augmentation that
+enable learning robust prediction of smooth action sequences and generalization
+to unseen scenarios. Through experiments in both simulation and the real world,
+we demonstrate that our approach can enable a bimanual robotic system to
+effectively manipulate objects of diverse geometries, dimensions, and physical
+properties. Website: https://glide-manip.github.io/
+
+
+
+
+
+
+
+ ☆ Mind the Gap: Examining the Self-Improvement Capabilities of Large
+ Language Models
+
+
+ Self-improvement is a mechanism in Large Language Model (LLM) pre-training,
+post-training and test-time inference. We explore a framework where the model
+verifies its own outputs, filters or reweights data based on this verification,
+and distills the filtered data. Despite several empirical successes, a
+fundamental understanding is still lacking. In this work, we initiate a
+comprehensive, modular and controlled study on LLM self-improvement. We provide
+a mathematical formulation for self-improvement, which is largely governed by a
+quantity which we formalize as the generation-verification gap. Through
+experiments with various model families and tasks, we discover a scaling
+phenomenon of self-improvement -- a variant of the generation-verification gap
+scales monotonically with the model pre-training flops. We also examine when
+self-improvement is possible, an iterative self-improvement procedure, and ways
+to improve its performance. Our findings not only advance understanding of LLM
+self-improvement with practical implications, but also open numerous avenues
+for future research into its capabilities and boundaries.
+
+
+ Many important datasets contain samples that are missing one or more feature
+values. Maintaining the interpretability of machine learning models in the
+presence of such missing data is challenging. Singly or multiply imputing
+missing values complicates the model's mapping from features to labels. On the
+other hand, reasoning on indicator variables that represent missingness
+introduces a potentially large number of additional terms, sacrificing
+sparsity. We solve these problems with M-GAM, a sparse, generalized, additive
+modeling approach that incorporates missingness indicators and their
+interaction terms while maintaining sparsity through l0 regularization. We show
+that M-GAM provides similar or superior accuracy to prior methods while
+significantly improving sparsity relative to either imputation or naive
+inclusion of indicator variables.
+
+
+
+ comment: Published in NeurIPS 2024
+
+
+
+
+
+
+ ☆ The Space Complexity of Approximating Logistic Loss
+
+
+ We provide space complexity lower bounds for data structures that approximate
+logistic loss up to $\epsilon$-relative error on a logistic regression problem
+with data $\mathbf{X} \in \mathbb{R}^{n \times d}$ and labels $\mathbf{y} \in
+\{-1,1\}^d$. The space complexity of existing coreset constructions depend on a
+natural complexity measure $\mu_\mathbf{y}(\mathbf{X})$, first defined in
+(Munteanu, 2018). We give an $\tilde{\Omega}(\frac{d}{\epsilon^2})$ space
+complexity lower bound in the regime $\mu_\mathbf{y}(\mathbf{X}) = O(1)$ that
+shows existing coresets are optimal in this regime up to lower order factors.
+We also prove a general $\tilde{\Omega}(d\cdot \mu_\mathbf{y}(\mathbf{X}))$
+space lower bound when $\epsilon$ is constant, showing that the dependency on
+$\mu_\mathbf{y}(\mathbf{X})$ is not an artifact of mergeable coresets. Finally,
+we refute a prior conjecture that $\mu_\mathbf{y}(\mathbf{X})$ is hard to
+compute by providing an efficient linear programming formulation, and we
+empirically compare our algorithm to prior approximate methods.
+
+
+
+ comment: arXiv admin note: text overlap with arXiv:2303.14284
+
+
+
+
+
+
+ ☆ Sharp-It: A Multi-view to Multi-view Diffusion Model for 3D Synthesis
+ and Manipulation
+
+
+
+
+
+
+
+
+ Yiftach Edelstein, Or Patashnik, Dana Cohen-Bar, Lihi Zelnik-Manor
+
+
+ Advancements in text-to-image diffusion models have led to significant
+progress in fast 3D content creation. One common approach is to generate a set
+of multi-view images of an object, and then reconstruct it into a 3D model.
+However, this approach bypasses the use of a native 3D representation of the
+object and is hence prone to geometric artifacts and limited in controllability
+and manipulation capabilities. An alternative approach involves native 3D
+generative models that directly produce 3D representations. These models,
+however, are typically limited in their resolution, resulting in lower quality
+3D objects. In this work, we bridge the quality gap between methods that
+directly generate 3D representations and ones that reconstruct 3D objects from
+multi-view images. We introduce a multi-view to multi-view diffusion model
+called Sharp-It, which takes a 3D consistent set of multi-view images rendered
+from a low-quality object and enriches its geometric details and texture. The
+diffusion model operates on the multi-view set in parallel, in the sense that
+it shares features across the generated views. A high-quality 3D model can then
+be reconstructed from the enriched multi-view set. By leveraging the advantages
+of both 2D and 3D approaches, our method offers an efficient and controllable
+method for high-quality 3D content creation. We demonstrate that Sharp-It
+enables various 3D applications, such as fast synthesis, editing, and
+controlled generation, while attaining high-quality assets.
+
+
+
+ comment: Project page at https://yiftachede.github.io/Sharp-It/
+
+
+
+
+
+
+ ☆ The effect of priors on Learning with Restricted Boltzmann Machines
+
+
+ Restricted Boltzmann Machines (RBMs) are generative models designed to learn
+from data with a rich underlying structure. In this work, we explore a
+teacher-student setting where a student RBM learns from examples generated by a
+teacher RBM, with a focus on the effect of the unit priors on learning
+efficiency. We consider a parametric class of priors that interpolate between
+continuous (Gaussian) and binary variables. This approach models various
+possible choices of visible units, hidden units, and weights for both the
+teacher and student RBMs.
+ By analyzing the phase diagram of the posterior distribution in both the
+Bayes optimal and mismatched regimes, we demonstrate the existence of a triple
+point that defines the critical dataset size necessary for learning through
+generalization. The critical size is strongly influenced by the properties of
+the teacher, and thus the data, but is unaffected by the properties of the
+student RBM. Nevertheless, a prudent choice of student priors can facilitate
+training by expanding the so-called signal retrieval region, where the machine
+generalizes effectively.
+
+
+
+
+
+
+
+ ☆ Medical Multimodal Foundation Models in Clinical Diagnosis and
+ Treatment: Applications, Challenges, and Future Directions
+
+
+
+
+
+
+
+
+ Kai Sun, Siyan Xue, Fuchun Sun, Haoran Sun, Yu Luo, Ling Wang, Siyuan Wang, Na Guo, Lei Liu, Tian Zhao, Xinzhou Wang, Lei Yang, Shuo Jin, Jun Yan, Jiahong Dong
+
+
+ Recent advancements in deep learning have significantly revolutionized the
+field of clinical diagnosis and treatment, offering novel approaches to improve
+diagnostic precision and treatment efficacy across diverse clinical domains,
+thus driving the pursuit of precision medicine. The growing availability of
+multi-organ and multimodal datasets has accelerated the development of
+large-scale Medical Multimodal Foundation Models (MMFMs). These models, known
+for their strong generalization capabilities and rich representational power,
+are increasingly being adapted to address a wide range of clinical tasks, from
+early diagnosis to personalized treatment strategies. This review offers a
+comprehensive analysis of recent developments in MMFMs, focusing on three key
+aspects: datasets, model architectures, and clinical applications. We also
+explore the challenges and opportunities in optimizing multimodal
+representations and discuss how these advancements are shaping the future of
+healthcare by enabling improved patient outcomes and more efficient clinical
+workflows.
+
+
+
+
+
+
+
+ ☆ Improving Dynamic Object Interactions in Text-to-Video Generation with
+ AI Feedback
+
+
+
+
+
+
+
+
+ Hiroki Furuta, Heiga Zen, Dale Schuurmans, Aleksandra Faust, Yutaka Matsuo, Percy Liang, Sherry Yang
+
+
+ Large text-to-video models hold immense potential for a wide range of
+downstream applications. However, these models struggle to accurately depict
+dynamic object interactions, often resulting in unrealistic movements and
+frequent violations of real-world physics. One solution inspired by large
+language models is to align generated outputs with desired outcomes using
+external feedback. This enables the model to refine its responses autonomously,
+eliminating extensive manual data collection. In this work, we investigate the
+use of feedback to enhance the object dynamics in text-to-video models. We aim
+to answer a critical question: what types of feedback, paired with which
+specific self-improvement algorithms, can most effectively improve text-video
+alignment and realistic object interactions? We begin by deriving a unified
+probabilistic objective for offline RL finetuning of text-to-video models. This
+perspective highlights how design elements in existing algorithms like KL
+regularization and policy projection emerge as specific choices within a
+unified framework. We then use derived methods to optimize a set of text-video
+alignment metrics (e.g., CLIP scores, optical flow), but notice that they often
+fail to align with human perceptions of generation quality. To address this
+limitation, we propose leveraging vision-language models to provide more
+nuanced feedback specifically tailored to object dynamics in videos. Our
+experiments demonstrate that our method can effectively optimize a wide variety
+of rewards, with binary AI feedback driving the most significant improvements
+in video quality for dynamic interactions, as confirmed by both AI and human
+evaluations. Notably, we observe substantial gains when using reward signals
+derived from AI feedback, particularly in scenarios involving complex
+interactions between multiple objects and realistic depictions of objects
+falling.
+
+
+ Data is an increasingly vital component of decision making processes across
+industries. However, data access raises privacy concerns motivating the need
+for privacy-preserving techniques such as differential privacy. Data markets
+provide a means to enable wider access as well as determine the appropriate
+privacy-utility trade-off. Existing data market frameworks either require a
+trusted third party to perform computationally expensive valuations or are
+unable to capture the combinatorial nature of data value and do not
+endogenously model the effect of differential privacy. This paper addresses
+these shortcomings by proposing a valuation mechanism based on the Wasserstein
+distance for differentially-private data, and corresponding procurement
+mechanisms by leveraging incentive mechanism design theory, for task-agnostic
+data procurement, and task-specific procurement co-optimisation. The mechanisms
+are reformulated into tractable mixed-integer second-order cone programs, which
+are validated with numerical studies.
+
+
+
+ comment: 35 pages, 15 figures
+
+
+
+
+
+
+ ☆ Interpretable Company Similarity with Sparse Autoencoders
+
+
+
+
+
+
+
+
+ Marco Molinari, Vladimir Tregubiak, Victor Shao, Abhimanyu Pandey, Mateusz Mikolajczak, Sebastião Kuznetsov Ryder Torres Pereira
+
+
+ Determining company similarity is a vital task in finance, underpinning
+hedging, risk management, portfolio diversification, and more. Practitioners
+often rely on sector and industry classifications to gauge similarity, such as
+SIC-codes and GICS-codes, the former being used by the U.S. Securities and
+Exchange Commission (SEC), and the latter widely used by the investment
+community. Clustering embeddings of company descriptions has been proposed as a
+potential technique for determining company similarity, but the lack of
+interpretability in token embeddings poses a significant barrier to adoption in
+high-stakes contexts. Sparse Autoencoders have shown promise in enhancing the
+interpretability of Large Language Models by decomposing LLM activations into
+interpretable features. In this paper, we explore the use of SAE features in
+measuring company similarity and benchmark them against (1) SIC codes and (2)
+Major Group codes. We conclude that SAE features can reproduce and even surpass
+sector classifications in quantifying fundamental characteristics of companies,
+evaluated by the correlation of monthly returns, a proxy for similarity, and
+PnL from cointegration.
+
+
+
+
+
+
+
+ ☆ CEGI: Measuring the trade-off between efficiency and carbon emissions
+ for SLMs and VLMs
+
+
+ This paper analyzes the performance of Small Language Models (SLMs) and
+Vision Language Models (VLMs) and evaluates the trade-off between model
+performance and carbon emissions across 4 essential tasks: Image Captioning,
+Visual Question Answering (VQA), Dialogue Summarization and Text-to-SQL
+conversion. Various SLMs and VLMs belonging to the Qwen and LLaMA architecture
+family are chosen and variants based on model size in terms of the number of
+parameters, quantization level and fine-tuning parameters are evaluated. The
+model variant's performance and carbon emissions are calculated. To quantify
+the trade-off between model performance and carbon emissions, we introduce a
+novel metric called CEGI (Carbon Efficient Gain Index). This metric represents
+the carbon emission per unit percentage gain per million trainable parameters .
+This metric provides a normalized measure to compare model's efficiency in
+terms of performance improvement relative to their environmental cost. The
+experiment's outcome demonstrates that fine-tuning SLMs and VLMs can achieve
+performance levels comparable to Large Language Models (LLMs) while producing
+significantly less carbon emissions. Our findings suggest that the marginal
+gains in accuracy from larger models do not justify the substantial increase in
+carbon emissions. Leveraging lower-bit quantization levels, the proposed metric
+further enhances energy efficiency without compromising performance. This study
+highlights balancing high performance and environmental sustainability. It
+offers a valuable metric for selecting models suitable for
+environmentally-friendly AI development.
+
+
+
+
+
+
+
+
+ Jacob Marks, Brent A. Griffin, Jason J. Corso
+
+
+ We introduce a new framework for analyzing classification datasets based on
+the ratios of reconstruction errors between autoencoders trained on individual
+classes. This analysis framework enables efficient characterization of datasets
+on the sample, class, and entire dataset levels. We define reconstruction error
+ratios (RERs) that probe classification difficulty and allow its decomposition
+into (1) finite sample size and (2) Bayes error and decision-boundary
+complexity. Through systematic study across 19 popular visual datasets, we find
+that our RER-based dataset difficulty probe strongly correlates with error rate
+for state-of-the-art (SOTA) classification models. By interpreting sample-level
+classification difficulty as a label mistakenness score, we further find that
+RERs achieve SOTA performance on mislabel detection tasks on hard datasets
+under symmetric and asymmetric label noise. Our code is publicly available at
+https://github.com/voxel51/reconstruction-error-ratios.
+
+
+
+ comment: 30 pages, 18 figures
+
+
+
+
+
+
+ ☆ Private Linear Regression with Differential Privacy and PAC Privacy
+
+
+ Linear regression is a fundamental tool for statistical analysis, which has
+motivated the development of linear regression methods that satisfy provable
+privacy guarantees so that the learned model reveals little about any one data
+point used to construct it. Most existing privacy-preserving linear regression
+methods rely on the well-established framework of differential privacy, while
+the newly proposed PAC Privacy has not yet been explored in this context. In
+this paper, we systematically compare linear regression models trained with
+differential privacy and PAC privacy across three real-world datasets,
+observing several key findings that impact the performance of
+privacy-preserving linear regression.
+
+
+
+ comment: 8 pages, 6 figures
+
+
+
+
+
+
+ ☆ TAB-Fields: A Maximum Entropy Framework for Mission-Aware Adversarial
+ Planning
+
+
+ Autonomous agents operating in adversarial scenarios face a fundamental
+challenge: while they may know their adversaries' high-level objectives, such
+as reaching specific destinations within time constraints, the exact policies
+these adversaries will employ remain unknown. Traditional approaches address
+this challenge by treating the adversary's state as a partially observable
+element, leading to a formulation as a Partially Observable Markov Decision
+Process (POMDP). However, the induced belief-space dynamics in a POMDP require
+knowledge of the system's transition dynamics, which, in this case, depend on
+the adversary's unknown policy. Our key observation is that while an
+adversary's exact policy is unknown, their behavior is necessarily constrained
+by their mission objectives and the physical environment, allowing us to
+characterize the space of possible behaviors without assuming specific
+policies. In this paper, we develop Task-Aware Behavior Fields (TAB-Fields), a
+representation that captures adversary state distributions over time by
+computing the most unbiased probability distribution consistent with known
+constraints. We construct TAB-Fields by solving a constrained optimization
+problem that minimizes additional assumptions about adversary behavior beyond
+mission and environmental requirements. We integrate TAB-Fields with standard
+planning algorithms by introducing TAB-conditioned POMCP, an adaptation of
+Partially Observable Monte Carlo Planning. Through experiments in simulation
+with underwater robots and hardware implementations with ground robots, we
+demonstrate that our approach achieves superior performance compared to
+baselines that either assume specific adversary policies or neglect mission
+constraints altogether. Evaluation videos and code are available at
+https://tab-fields.github.io.
+
+
+
+
+
+
+
+
+ Alexander Denker, Johannes Hertrich, Zeljko Kereta, Silvia Cipiccia, Ecem Erin, Simon Arridge
+
+
+ Ptychography is a coherent diffraction imaging method that uses phase
+retrieval techniques to reconstruct complex-valued images. It achieves this by
+sequentially illuminating overlapping regions of a sample with a coherent beam
+and recording the diffraction pattern. Although this addresses traditional
+imaging system challenges, it is computationally intensive and highly sensitive
+to noise, especially with reduced illumination overlap. Data-driven
+regularisation techniques have been applied in phase retrieval to improve
+reconstruction quality. In particular, plug-and-play (PnP) offers flexibility
+by integrating data-driven denoisers as implicit priors. In this work, we
+propose a half-quadratic splitting framework for using PnP and other
+data-driven priors for ptychography. We evaluate our method both on natural
+images and real test objects to validate its effectiveness for ptychographic
+image reconstruction.
+
+
+
+
+
+
+
+
+ Andrei Lixandru, Marcel van Gerven, Sergio Pequito
+
+
+ Distributed optimization is fundamental to modern machine learning
+applications like federated learning, but existing methods often struggle with
+ill-conditioned problems and face stability-versus-speed tradeoffs. We
+introduce fractional order distributed optimization (FrODO); a
+theoretically-grounded framework that incorporates fractional-order memory
+terms to enhance convergence properties in challenging optimization landscapes.
+Our approach achieves provable linear convergence for any strongly connected
+network. Through empirical validation, our results suggest that FrODO achieves
+up to 4 times faster convergence versus baselines on ill-conditioned problems
+and 2-3 times speedup in federated neural network training, while maintaining
+stability and theoretical guarantees.
+
+
+
+
+
+
+
+
+ Quang H. Nguyen, Hoang Phan, Khoa D. Doan
+
+
+ Diffusion models have shown remarkable abilities in generating realistic and
+high-quality images from text prompts. However, a trained model remains
+black-box; little do we know about the role of its components in exhibiting a
+concept such as objects or styles. Recent works employ causal tracing to
+localize layers storing knowledge in generative models without showing how
+those layers contribute to the target concept. In this work, we approach the
+model interpretability problem from a more general perspective and pose a
+question: \textit{``How do model components work jointly to demonstrate
+knowledge?''}. We adapt component attribution to decompose diffusion models,
+unveiling how a component contributes to a concept. Our framework allows
+effective model editing, in particular, we can erase a concept from diffusion
+models by removing positive components while remaining knowledge of other
+concepts. Surprisingly, we also show there exist components that contribute
+negatively to a concept, which has not been discovered in the knowledge
+localization approach. Experimental results confirm the role of positive and
+negative components pinpointed by our framework, depicting a complete view of
+interpreting generative models. Our code is available at
+\url{https://github.com/mail-research/CAD-attribution4diffusion}
+
+
+
+
+
+
+
+ ☆ On the Privacy, Security, and Trustworthy for Distributed Wireless Large
+ AI Model (WLAM)
+
+
+ Combining wireless communication with large artificial intelligence (AI)
+models can open up a myriad of novel application scenarios. In sixth generation
+(6G) networks, ubiquitous communication and computing resources allow large AI
+models to serve democratic large AI models-related services to enable real-time
+applications like autonomous vehicles, smart cities, and Internet of Things
+(IoT) ecosystems. However, the security considerations and sustainable
+communication resources limit the deployment of large AI models over
+distributed wireless networks. This paper provides a comprehensive overview of
+privacy, security, and trustworthy for distributed wireless large AI model
+(WLAM). In particular, the detailed privacy and security are analysis for
+distributed WLAM is fist revealed. The classifications and theoretical findings
+about privacy and security in distributed WLAM are discussed. Then the
+trustworthy and ethics for implementing distributed WLAM are described.
+Finally, the comprehensive applications of distributed WLAM is provided in the
+aspect of electromagnetic signal processing.
+
+
+
+ comment: 12 pages, 4 figures
+
+
+
+
+
+
+ ☆ Defending Against Diverse Attacks in Federated Learning Through
+ Consensus-Based Bi-Level Optimization
+
+
+
+
+
+
+
+
+ Nicolás García Trillos, Aditya Kumar Akash, Sixu Li, Konstantin Riedl, Yuhua Zhu
+
+
+ Adversarial attacks pose significant challenges in many machine learning
+applications, particularly in the setting of distributed training and federated
+learning, where malicious agents seek to corrupt the training process with the
+goal of jeopardizing and compromising the performance and reliability of the
+final models. In this paper, we address the problem of robust federated
+learning in the presence of such attacks by formulating the training task as a
+bi-level optimization problem. We conduct a theoretical analysis of the
+resilience of consensus-based bi-level optimization (CB$^2$O), an interacting
+multi-particle metaheuristic optimization method, in adversarial settings.
+Specifically, we provide a global convergence analysis of CB$^2$O in mean-field
+law in the presence of malicious agents, demonstrating the robustness of
+CB$^2$O against a diverse range of attacks. Thereby, we offer insights into how
+specific hyperparameter choices enable to mitigate adversarial effects. On the
+practical side, we extend CB$^2$O to the clustered federated learning setting
+by proposing FedCB$^2$O, a novel interacting multi-particle system, and design
+a practical algorithm that addresses the demands of real-world applications.
+Extensive experiments demonstrate the robustness of the FedCB$^2$O algorithm
+against label-flipping attacks in decentralized clustered federated learning
+scenarios, showcasing its effectiveness in practical contexts.
+
+
+
+
+
+
+
+ ☆ Active learning of neural population dynamics using two-photon
+ holographic optogenetics NeurIPS 2024
+
+
+
+
+
+
+
+
+ Andrew Wagenmaker, Lu Mi, Marton Rozsa, Matthew S. Bull, Karel Svoboda, Kayvon Daie, Matthew D. Golub, Kevin Jamieson
+
+
+ Recent advances in techniques for monitoring and perturbing neural
+populations have greatly enhanced our ability to study circuits in the brain.
+In particular, two-photon holographic optogenetics now enables precise
+photostimulation of experimenter-specified groups of individual neurons, while
+simultaneous two-photon calcium imaging enables the measurement of ongoing and
+induced activity across the neural population. Despite the enormous space of
+potential photostimulation patterns and the time-consuming nature of
+photostimulation experiments, very little algorithmic work has been done to
+determine the most effective photostimulation patterns for identifying the
+neural population dynamics. Here, we develop methods to efficiently select
+which neurons to stimulate such that the resulting neural responses will best
+inform a dynamical model of the neural population activity. Using neural
+population responses to photostimulation in mouse motor cortex, we demonstrate
+the efficacy of a low-rank linear dynamical systems model, and develop an
+active learning procedure which takes advantage of low-rank structure to
+determine informative photostimulation patterns. We demonstrate our approach on
+both real and synthetic data, obtaining in some cases as much as a two-fold
+reduction in the amount of data required to reach a given predictive power. Our
+active stimulation design method is based on a novel active learning procedure
+for low-rank regression, which may be of independent interest.
+
+
+
+ comment: NeurIPS 2024
+
+
+
+
+
+
+ ☆ LLMForecaster: Improving Seasonal Event Forecasts with Unstructured
+ Textual Data NeurIPS
+
+
+
+
+
+
+
+
+ Hanyu Zhang, Chuck Arvin, Dmitry Efimov, Michael W. Mahoney, Dominique Perrault-Joncas, Shankar Ramasubramanian, Andrew Gordon Wilson, Malcolm Wolff
+
+
+ Modern time-series forecasting models often fail to make full use of rich
+unstructured information about the time series themselves. This lack of proper
+conditioning can lead to obvious model failures; for example, models may be
+unaware of the details of a particular product, and hence fail to anticipate
+seasonal surges in customer demand in the lead up to major exogenous events
+like holidays for clearly relevant products. To address this shortcoming, this
+paper introduces a novel forecast post-processor -- which we call LLMForecaster
+-- that fine-tunes large language models (LLMs) to incorporate unstructured
+semantic and contextual information and historical data to improve the
+forecasts from an existing demand forecasting pipeline. In an industry-scale
+retail application, we demonstrate that our technique yields statistically
+significantly forecast improvements across several sets of products subject to
+holiday-driven demand surges.
+
+
+
+ comment: Presented at NeurIPS Time Series in the Age of Large Models (2024)
+
+
+
+
+
+
+ ☆ Cooperative Cruising: Reinforcement Learning based Time-Headway Control
+ for Increased Traffic Efficiency
+
+
+
+
+
+
+
+
+ Yaron Veksler, Sharon Hornstein, Han Wang, Maria Laura Delle Monache, Daniel Urieli
+
+
+ The proliferation of Connected Automated Vehicles represents an unprecedented
+opportunity for improving driving efficiency and alleviating traffic
+congestion. However, existing research fails to address realistic multi-lane
+highway scenarios without assuming connectivity, perception, and control
+capabilities that are typically unavailable in current vehicles. This paper
+proposes a novel AI system that is the first to improve highway traffic
+efficiency compared with human-like traffic in realistic, simulated multi-lane
+scenarios, while relying on existing connectivity, perception, and control
+capabilities. At the core of our approach is a reinforcement learning based
+controller that dynamically communicates time-headways to automated vehicles
+near bottlenecks based on real-time traffic conditions. These desired
+time-headways are then used by Adaptive Cruise Control (ACC) systems to adjust
+their following distance. By (i) integrating existing traffic estimation
+technology and low-bandwidth vehicle-to-infrastructure connectivity, (ii)
+leveraging safety-certified ACC systems, and (iii) targeting localized
+bottleneck challenges that can be addressed independently in different
+locations, we propose a practical, safe, and scalable system that can
+positively impact numerous road users.
+
+
+
+
+
+
+
+
+ Hao Chen, Han Tao, Guo Song, Jie Zhang, Yunlong Yu, Yonghan Dong, Chuang Yang, Lei Bai
+
+
+ Atmospheric science is intricately connected with other fields, e.g.,
+geography and aerospace. Most existing approaches involve training a joint
+atmospheric and geographic model from scratch, which incurs significant
+computational costs and overlooks the potential for incremental learning of
+weather variables across different domains. In this paper, we introduce
+incremental learning to weather forecasting and propose a novel structure that
+allows for the flexible expansion of variables within the model. Specifically,
+our method presents a Channel-Adapted MoE (CA-MoE) that employs a
+divide-and-conquer strategy. This strategy assigns variable training tasks to
+different experts by index embedding and reduces computational complexity
+through a channel-wise Top-K strategy. Experiments conducted on the widely
+utilized ERA5 dataset reveal that our method, utilizing only approximately 15\%
+of trainable parameters during the incremental stage, attains performance that
+is on par with state-of-the-art competitors. Notably, in the context of
+variable incremental experiments, our method demonstrates negligible issues
+with catastrophic forgetting.
+
+
+
+
+
+
+
+ ☆ The Cost of Consistency: Submodular Maximization with Constant Recourse
+
+
+
+
+
+
+
+
+ Paul Dütting, Federico Fusco, Silvio Lattanzi, Ashkan Norouzi-Fard, Ola Svensson, Morteza Zadimoghaddam
+
+
+ In this work, we study online submodular maximization, and how the
+requirement of maintaining a stable solution impacts the approximation. In
+particular, we seek bounds on the best-possible approximation ratio that is
+attainable when the algorithm is allowed to make at most a constant number of
+updates per step. We show a tight information-theoretic bound of $\tfrac{2}{3}$
+for general monotone submodular functions, and an improved (also tight) bound
+of $\tfrac{3}{4}$ for coverage functions. Since both these bounds are attained
+by non poly-time algorithms, we also give a poly-time randomized algorithm that
+achieves a $0.51$-approximation. Combined with an information-theoretic
+hardness of $\tfrac{1}{2}$ for deterministic algorithms from prior work, our
+work thus shows a separation between deterministic and randomized algorithms,
+both information theoretically and for poly-time algorithms.
+
+
+
+
+
+
+
+ ☆ Vector Optimization with Gaussian Process Bandits
+
+
+
+
+
+
+
+
+ İlter Onat Korkmaz, Yaşar Cahit Yıldırım, Çağın Ararat, Cem Tekin
+
+
+ Learning problems in which multiple conflicting objectives must be considered
+simultaneously often arise in various fields, including engineering, drug
+design, and environmental management. Traditional methods for dealing with
+multiple black-box objective functions, such as scalarization and
+identification of the Pareto set under the componentwise order, have
+limitations in incorporating objective preferences and exploring the solution
+space accordingly. While vector optimization offers improved flexibility and
+adaptability via specifying partial orders based on ordering cones, current
+techniques designed for sequential experiments either suffer from high sample
+complexity or lack theoretical guarantees. To address these issues, we propose
+Vector Optimization with Gaussian Process (VOGP), a probably approximately
+correct adaptive elimination algorithm that performs black-box vector
+optimization using Gaussian process bandits. VOGP allows users to convey
+objective preferences through ordering cones while performing efficient
+sampling by exploiting the smoothness of the objective function, resulting in a
+more effective optimization process that requires fewer evaluations. We
+establish theoretical guarantees for VOGP and derive information gain-based and
+kernel-specific sample complexity bounds. We also conduct experiments on both
+real-world and synthetic datasets to compare VOGP with the state-of-the-art
+methods.
+
+
+
+
+
+
+
+ ☆ What should a neuron aim for? Designing local objective functions based
+ on information theory
+
+
+
+
+
+
+
+
+ Andreas C. Schneider, Valentin Neuhaus, David A. Ehrlich, Abdullah Makkeh, Alexander S. Ecker, Viola Priesemann, Michael Wibral
+
+
+ In modern deep neural networks, the learning dynamics of the individual
+neurons is often obscure, as the networks are trained via global optimization.
+Conversely, biological systems build on self-organized, local learning,
+achieving robustness and efficiency with limited global information. We here
+show how self-organization between individual artificial neurons can be
+achieved by designing abstract bio-inspired local learning goals. These goals
+are parameterized using a recent extension of information theory, Partial
+Information Decomposition (PID), which decomposes the information that a set of
+information sources holds about an outcome into unique, redundant and
+synergistic contributions. Our framework enables neurons to locally shape the
+integration of information from various input classes, i.e. feedforward,
+feedback, and lateral, by selecting which of the three inputs should contribute
+uniquely, redundantly or synergistically to the output. This selection is
+expressed as a weighted sum of PID terms, which, for a given problem, can be
+directly derived from intuitive reasoning or via numerical optimization,
+offering a window into understanding task-relevant local information
+processing. Achieving neuron-level interpretability while enabling strong
+performance using local learning, our work advances a principled
+information-theoretic foundation for local learning strategies.
+
+
+
+ comment: 24 pages, 11 figures
+
+
+
+
+
+
+ ☆ OODFace: Benchmarking Robustness of Face Recognition under Common
+ Corruptions and Appearance Variations
+
+
+ With the rise of deep learning, facial recognition technology has seen
+extensive research and rapid development. Although facial recognition is
+considered a mature technology, we find that existing open-source models and
+commercial algorithms lack robustness in certain real-world Out-of-Distribution
+(OOD) scenarios, raising concerns about the reliability of these systems. In
+this paper, we introduce OODFace, which explores the OOD challenges faced by
+facial recognition models from two perspectives: common corruptions and
+appearance variations. We systematically design 30 OOD scenarios across 9 major
+categories tailored for facial recognition. By simulating these challenges on
+public datasets, we establish three robustness benchmarks: LFW-C/V, CFP-FP-C/V,
+and YTF-C/V. We then conduct extensive experiments on 19 different facial
+recognition models and 3 commercial APIs, along with extended experiments on
+face masks, Vision-Language Models (VLMs), and defense strategies to assess
+their robustness. Based on the results, we draw several key insights,
+highlighting the vulnerability of facial recognition systems to OOD data and
+suggesting possible solutions. Additionally, we offer a unified toolkit that
+includes all corruption and variation types, easily extendable to other
+datasets. We hope that our benchmarks and findings can provide guidance for
+future improvements in facial recognition model robustness.
+
+
+ Identifying the interaction targets of bioactive compounds is a foundational
+element for deciphering their pharmacological effects. Target prediction
+algorithms equip researchers with an effective tool to rapidly scope and
+explore potential targets. Here, we introduce the COMET, a multi-technological
+modular target prediction tool that provides comprehensive predictive insights,
+including similar active compounds, three-dimensional predicted binding modes,
+and probability scores, all within an average processing time of less than 10
+minutes per task. With meticulously curated data, the COMET database
+encompasses 990,944 drug-target interaction pairs and 45,035 binding pockets,
+enabling predictions for 2,685 targets, which span confirmed and exploratory
+therapeutic targets for human diseases. In comparative testing using datasets
+from ChEMBL and BindingDB, COMET outperformed five other well-known algorithms,
+offering nearly an 80% probability of accurately identifying at least one true
+target within the top 15 predictions for a given compound. COMET also features
+a user-friendly web server, accessible freely at
+https://www.pdbbind-plus.org.cn/comet.
+
+
+
+
+
+
+
+ ☆ DP-2Stage: Adapting Language Models as Differentially Private Tabular
+ Data Generators
+
+
+
+
+
+
+
+
+ Tejumade Afonja, Hui-Po Wang, Raouf Kerkouche, Mario Fritz
+
+
+ Generating tabular data under differential privacy (DP) protection ensures
+theoretical privacy guarantees but poses challenges for training machine
+learning models, primarily due to the need to capture complex structures under
+noisy supervision signals. Recently, pre-trained Large Language Models (LLMs)
+-- even those at the scale of GPT-2 -- have demonstrated great potential in
+synthesizing tabular data. However, their applications under DP constraints
+remain largely unexplored. In this work, we address this gap by applying DP
+techniques to the generation of synthetic tabular data. Our findings shows that
+LLMs face difficulties in generating coherent text when fine-tuned with DP, as
+privacy budgets are inefficiently allocated to non-private elements like table
+structures. To overcome this, we propose \ours, a two-stage fine-tuning
+framework for differentially private tabular data generation. The first stage
+involves non-private fine-tuning on a pseudo dataset, followed by DP
+fine-tuning on a private dataset. Our empirical results show that this approach
+improves performance across various settings and metrics compared to directly
+fine-tuned LLMs in DP contexts. We release our code and setup at
+https://github.com/tejuafonja/DP-2Stage.
+
+
+
+
+
+
+
+ ☆ BYE: Build Your Encoder with One Sequence of Exploration Data for
+ Long-Term Dynamic Scene Understanding
+
+
+ Dynamic scene understanding remains a persistent challenge in robotic
+applications. Early dynamic mapping methods focused on mitigating the negative
+influence of short-term dynamic objects on camera motion estimation by masking
+or tracking specific categories, which often fall short in adapting to
+long-term scene changes. Recent efforts address object association in long-term
+dynamic environments using neural networks trained on synthetic datasets, but
+they still rely on predefined object shapes and categories. Other methods
+incorporate visual, geometric, or semantic heuristics for the association but
+often lack robustness. In this work, we introduce BYE, a class-agnostic,
+per-scene point cloud encoder that removes the need for predefined categories,
+shape priors, or extensive association datasets. Trained on only a single
+sequence of exploration data, BYE can efficiently perform object association in
+dynamically changing scenes. We further propose an ensembling scheme combining
+the semantic strengths of Vision Language Models (VLMs) with the scene-specific
+expertise of BYE, achieving a 7% improvement and a 95% success rate in object
+association tasks. Code and dataset are available at
+https://byencoder.github.io.
+
+
+ Artificial Expert Intelligence (AEI) seeks to transcend the limitations of
+both Artificial General Intelligence (AGI) and narrow AI by integrating
+domain-specific expertise with critical, precise reasoning capabilities akin to
+those of top human experts. Existing AI systems often excel at predefined tasks
+but struggle with adaptability and precision in novel problem-solving. To
+overcome this, AEI introduces a framework for ``Probably Approximately Correct
+(PAC) Reasoning". This paradigm provides robust theoretical guarantees for
+reliably decomposing complex problems, with a practical mechanism for
+controlling reasoning precision. In reference to the division of human thought
+into System 1 for intuitive thinking and System 2 for reflective
+reasoning~\citep{tversky1974judgment}, we refer to this new type of reasoning
+as System 3 for precise reasoning, inspired by the rigor of the scientific
+method. AEI thus establishes a foundation for error-bounded, inference-time
+learning.
+
+
+
+
+
+
+
+ ☆ Nature versus nurture in galaxy formation: the effect of environment on
+ star formation with causal machine learning
+
+
+
+
+
+
+
+
+ Sunil Mucesh, William G. Hartley, Ciarán M. Gilligan-Lee, Ofer Lahav
+
+
+ Understanding how galaxies form and evolve is at the heart of modern
+astronomy. With the advent of large-scale surveys and simulations, remarkable
+progress has been made in the last few decades. Despite this, the physical
+processes behind the phenomena, and particularly their importance, remain far
+from known, as correlations have primarily been established rather than the
+underlying causality. We address this challenge by applying the causal
+inference framework. Specifically, we tackle the fundamental open question of
+whether galaxy formation and evolution depends more on nature (i.e., internal
+processes) or nurture (i.e., external processes), by estimating the causal
+effect of environment on star-formation rate in the IllustrisTNG simulations.
+To do so, we develop a comprehensive causal model and employ cutting-edge
+techniques from epidemiology to overcome the long-standing problem of
+disentangling nature and nurture. We find that the causal effect is negative
+and substantial, with environment suppressing the SFR by a maximal factor of
+$\sim100$. While the overall effect at $z=0$ is negative, in the early
+universe, environment is discovered to have a positive impact, boosting star
+formation by a factor of $\sim10$ at $z\sim1$ and by even greater amounts at
+higher redshifts. Furthermore, we show that: (i) nature also plays an important
+role, as ignoring it underestimates the causal effect in intermediate-density
+environments by a factor of $\sim2$, (ii) controlling for the stellar mass at a
+snapshot in time, as is common in the literature, is not only insufficient to
+disentangle nature and nurture but actually has an adverse effect, though (iii)
+stellar mass is an adequate proxy of the effects of nature. Finally, this work
+may prove a useful blueprint for extracting causal insights in other fields
+that deal with dynamical systems with closed feedback loops, such as the
+Earth's climate.
+
+
+
+ comment: 16 pages, 4 figures
+
+
+
+
+
+
+ ☆ Improved Localized Machine Unlearning Through the Lens of Memorization
+
+
+ Machine unlearning refers to removing the influence of a specified subset of
+training data from a machine learning model, efficiently, after it has already
+been trained. This is important for key applications, including making the
+model more accurate by removing outdated, mislabeled, or poisoned data. In this
+work, we study localized unlearning, where the unlearning algorithm operates on
+a (small) identified subset of parameters. Drawing inspiration from the
+memorization literature, we propose an improved localization strategy that
+yields strong results when paired with existing unlearning algorithms. We also
+propose a new unlearning algorithm, Deletion by Example Localization (DEL),
+that resets the parameters deemed-to-be most critical according to our
+localization strategy, and then finetunes them. Our extensive experiments on
+different datasets, forget sets and metrics reveal that DEL sets a new
+state-of-the-art for unlearning metrics, against both localized and
+full-parameter methods, while modifying a small subset of parameters, and
+outperforms the state-of-the-art localized unlearning in terms of test accuracy
+too.
+
+
+ A Transformer-based Koopman autoencoder is proposed for linearizing Fisher's
+reaction-diffusion equation. The primary focus of this study is on using deep
+learning techniques to find complex spatiotemporal patterns in the
+reaction-diffusion system. The emphasis is on not just solving the equation but
+also transforming the system's dynamics into a more comprehensible, linear
+form. Global coordinate transformations are achieved through the autoencoder,
+which learns to capture the underlying dynamics by training on a dataset with
+60,000 initial conditions. Extensive testing on multiple datasets was used to
+assess the efficacy of the proposed model, demonstrating its ability to
+accurately predict the system's evolution as well as to generalize. We provide
+a thorough comparison study, comparing our suggested design to a few other
+comparable methods using experiments on various PDEs, such as the
+Kuramoto-Sivashinsky equation and the Burger's equation. Results show improved
+accuracy, highlighting the capabilities of the Transformer-based Koopman
+autoencoder. The proposed architecture in is significantly ahead of other
+architectures, in terms of solving different types of PDEs using a single
+architecture. Our method relies entirely on the data, without requiring any
+knowledge of the underlying equations. This makes it applicable to even the
+datasets where the governing equations are not known.
+
+
+
+
+
+
+
+ ☆ Time-Series-Informed Closed-loop Learning for Sequential Decision Making
+ and Control
+
+
+ Closed-loop performance of sequential decision making algorithms, such as
+model predictive control, depends strongly on the parameters of cost functions,
+models, and constraints. Bayesian optimization is a common approach to learning
+these parameters based on closed-loop experiments. However, traditional
+Bayesian optimization approaches treat the learning problem as a black box,
+ignoring valuable information and knowledge about the structure of the
+underlying problem, resulting in slow convergence and high experimental
+resource use. We propose a time-series-informed optimization framework that
+incorporates intermediate performance evaluations from early iterations of each
+experimental episode into the learning procedure. Additionally, probabilistic
+early stopping criteria are proposed to terminate unpromising experiments,
+significantly reducing experimental time. Simulation results show that our
+approach achieves baseline performance with approximately half the resources.
+Moreover, with the same resource budget, our approach outperforms the baseline
+in terms of final closed-loop performance, highlighting its efficiency in
+sequential decision making scenarios.
+
+
+ We present VISTA (Visualization of Internal States and Their Associations), a
+novel pipeline for visually exploring and interpreting neural network
+representations. VISTA addresses the challenge of analyzing vast
+multidimensional spaces in modern machine learning models by mapping
+representations into a semantic 2D space. The resulting collages visually
+reveal patterns and relationships within internal representations. We
+demonstrate VISTA's utility by applying it to sparse autoencoder latents
+uncovering new properties and interpretations. We review the VISTA methodology,
+present findings from our case study ( https://got.drib.net/latents/ ), and
+discuss implications for neural network interpretability across various domains
+of machine learning.
+
+
+ The advent of smart contracts has enabled the rapid rise of Decentralized
+Finance (DeFi) on the Ethereum blockchain, offering substantial rewards in
+financial innovation and inclusivity. However, this growth has also introduced
+significant security risks, including the proliferation of illicit accounts
+involved in fraudulent activities. Traditional detection methods are limited by
+the scarcity of labeled data and the evolving tactics of malicious actors. In
+this paper, we propose a novel Self-Learning Ensemble-based Illicit account
+Detection (SLEID) framework to address these challenges. SLEID employs an
+Isolation Forest for initial outlier detection and a self-training mechanism to
+iteratively generate pseudo-labels for unlabeled accounts, thereby enhancing
+detection accuracy. Extensive experiments demonstrate that SLEID significantly
+outperforms traditional supervised approaches and recent semi-supervised
+models, achieving superior precision, recall, and F1-scores, particularly in
+detecting illicit accounts. Compared to state-of-the-art methods, our approach
+achieves better detection performance while reducing reliance on labeled data.
+The results affirm SLEID's efficacy as a robust solution for safeguarding the
+DeFi ecosystem and mitigating risks posed by malicious accounts.
+
+
+
+ comment: 12 pages, 6 figures
+
+
+
+
+
+
+ ☆ 3D Face Reconstruction From Radar Images
+
+
+
+
+
+
+
+
+ Valentin Braeutigam, Vanessa Wirth, Ingrid Ullmann, Christian Schüßler, Martin Vossiek, Matthias Berking, Bernhard Egger
+
+
+ The 3D reconstruction of faces gains wide attention in computer vision and is
+used in many fields of application, for example, animation, virtual reality,
+and even forensics. This work is motivated by monitoring patients in sleep
+laboratories. Due to their unique characteristics, sensors from the radar
+domain have advantages compared to optical sensors, namely penetration of
+electrically non-conductive materials and independence of light. These
+advantages of radar signals unlock new applications and require adaptation of
+3D reconstruction frameworks. We propose a novel model-based method for 3D
+reconstruction from radar images. We generate a dataset of synthetic radar
+images with a physics-based but non-differentiable radar renderer. This dataset
+is used to train a CNN-based encoder to estimate the parameters of a 3D
+morphable face model. Whilst the encoder alone already leads to strong
+reconstructions of synthetic data, we extend our reconstruction in an
+Analysis-by-Synthesis fashion to a model-based autoencoder. This is enabled by
+learning the rendering process in the decoder, which acts as an object-specific
+differentiable radar renderer. Subsequently, the combination of both network
+parts is trained to minimize both, the loss of the parameters and the loss of
+the resulting reconstructed radar image. This leads to the additional benefit,
+that at test time the parameters can be further optimized by finetuning the
+autoencoder unsupervised on the image loss. We evaluated our framework on
+generated synthetic face images as well as on real radar images with 3D ground
+truth of four individuals.
+
+
+
+
+
+
+
+ ☆ OMENN: One Matrix to Explain Neural Networks
+
+
+
+
+
+
+
+
+ Adam Wróbel, Mikołaj Janusz, Bartosz Zieliński, Dawid Rymarczyk
+
+
+ Deep Learning (DL) models are often black boxes, making their decision-making
+processes difficult to interpret. This lack of transparency has driven
+advancements in eXplainable Artificial Intelligence (XAI), a field dedicated to
+clarifying the reasoning behind DL model predictions. Among these,
+attribution-based methods such as LRP and GradCAM are widely used, though they
+rely on approximations that can be imprecise.
+ To address these limitations, we introduce One Matrix to Explain Neural
+Networks (OMENN), a novel post-hoc method that represents a neural network as a
+single, interpretable matrix for each specific input. This matrix is
+constructed through a series of linear transformations that represent the
+processing of the input by each successive layer in the neural network. As a
+result, OMENN provides locally precise, attribution-based explanations of the
+input across various modern models, including ViTs and CNNs. We present a
+theoretical analysis of OMENN based on dynamic linearity property and validate
+its effectiveness with extensive tests on two XAI benchmarks, demonstrating
+that OMENN is competitive with state-of-the-art methods.
+
+
+
+ comment: Under review, code will be released after acceptance
+
+ We propose a novel model for learned query optimization which provides query
+hints leading to better execution plans. The model addresses the three key
+challenges in learned hint-based query optimization: reliable hint
+recommendation (ensuring non-degradation of query latency), efficient hint
+exploration, and fast inference. We provide an in-depth analysis of existing
+NN-based approaches to hint-based optimization and experimentally confirm the
+named challenges for them. Our alternative solution consists of a new inference
+schema based on an ensemble of context-aware models and a graph storage for
+reliable hint suggestion and fast inference, and a budget-controlled training
+procedure with a local search algorithm that solves the issue of exponential
+search space exploration. In experiments on standard benchmarks, our model
+demonstrates optimization capability close to the best achievable with
+coarse-grained hints. Controlling the degree of parallelism (query dop) in
+addition to operator-related hints enables our model to achieve 3x latency
+improvement on JOB benchmark which sets a new standard for optimization. Our
+model is interpretable and easy to debug, which is particularly important for
+deployment in production.
+
+
+ Low-Rank Adaptation (LoRA) and other parameter-efficient fine-tuning (PEFT)
+methods provide low-memory, storage-efficient solutions for personalizing
+text-to-image models. However, these methods offer little to no improvement in
+wall-clock training time or the number of steps needed for convergence compared
+to full model fine-tuning. While PEFT methods assume that shifts in generated
+distributions (from base to fine-tuned models) can be effectively modeled
+through weight changes in a low-rank subspace, they fail to leverage knowledge
+of common use cases, which typically focus on capturing specific styles or
+identities. Observing that desired outputs often comprise only a small subset
+of the possible domain covered by LoRA training, we propose reducing the search
+space by incorporating a prior over regions of interest. We demonstrate that
+training a hypernetwork model to generate LoRA weights can achieve competitive
+quality for specific domains while enabling near-instantaneous conditioning on
+user input, in contrast to traditional training methods that require thousands
+of steps.
+
+
+
+ comment: 9 pages, 6 figures
+
+
+
+
+
+
+ ☆ Federated Analytics in Practice: Engineering for Privacy, Scalability
+ and Practicality
+
+
+
+
+
+
+
+
+ Harish Srinivas, Graham Cormode, Mehrdad Honarkhah, Samuel Lurye, Jonathan Hehir, Lunwen He, George Hong, Ahmed Magdy, Dzmitry Huba, Kaikai Wang, Shen Guo, Shoubhik Bhattacharya
+
+
+ Cross-device Federated Analytics (FA) is a distributed computation paradigm
+designed to answer analytics queries about and derive insights from data held
+locally on users' devices. On-device computations combined with other privacy
+and security measures ensure that only minimal data is transmitted off-device,
+achieving a high standard of data protection. Despite FA's broad relevance, the
+applicability of existing FA systems is limited by compromised accuracy; lack
+of flexibility for data analytics; and an inability to scale effectively. In
+this paper, we describe our approach to combine privacy, scalability, and
+practicality to build and deploy a system that overcomes these limitations. Our
+FA system leverages trusted execution environments (TEEs) and optimizes the use
+of on-device computing resources to facilitate federated data processing across
+large fleets of devices, while ensuring robust, defensible, and verifiable
+privacy safeguards. We focus on federated analytics (statistics and
+monitoring), in contrast to systems for federated learning (ML workloads), and
+we flag the key differences.
+
+
+
+
+
+
+
+ ☆ An Adaptive Grasping Force Tracking Strategy for Nonlinear and
+ Time-Varying Object Behaviors
+
+
+ Accurate grasp force control is one of the key skills for ensuring successful
+and damage-free robotic grasping of objects. Although existing methods have
+conducted in-depth research on slip detection and grasping force planning, they
+often overlook the issue of adaptive tracking of the actual force to the target
+force when handling objects with different material properties. The optimal
+parameters of a force tracking controller are significantly influenced by the
+object's stiffness, and many adaptive force tracking algorithms rely on
+stiffness estimation. However, real-world objects often exhibit viscous,
+plastic, or other more complex nonlinear time-varying behaviors, and existing
+studies provide insufficient support for these materials in terms of stiffness
+definition and estimation. To address this, this paper introduces the concept
+of generalized stiffness, extending the definition of stiffness to nonlinear
+time-varying grasp system models, and proposes an online generalized stiffness
+estimator based on Long Short-Term Memory (LSTM) networks. Based on generalized
+stiffness, this paper proposes an adaptive parameter adjustment strategy using
+a PI controller as an example, enabling dynamic force tracking for objects with
+varying characteristics. Experimental results demonstrate that the proposed
+method achieves high precision and short probing time, while showing better
+adaptability to non-ideal objects compared to existing methods. The method
+effectively solves the problem of grasp force tracking in unknown, nonlinear,
+and time-varying grasp systems, enhancing the robotic grasping ability in
+unstructured environments.
+
+
+ In self-supervised robot learning, robots actively explore their environments
+and generate data by acting on entities in the environment. Therefore, an
+exploration policy is desired that ensures sample efficiency to minimize robot
+execution costs while still providing accurate learning. For this purpose, the
+robotic community has adopted Intrinsic Motivation (IM)-based approaches such
+as Learning Progress (LP). On the machine learning front, Active Learning (AL)
+has been used successfully, especially for classification tasks. In this work,
+we develop a novel AL framework geared towards robotics regression tasks, such
+as action-effect prediction and, more generally, for world model learning,
+which we call MUSEL - Model Uncertainty for Sample Efficient Learning. MUSEL
+aims to extract model uncertainty from the total uncertainty estimate given by
+a suitable learning engine by making use of earning progress and input
+diversity and use it to improve sample efficiency beyond the state-of-the-art
+action-effect prediction methods. We demonstrate the feasibility of our model
+by using a Stochastic Variational Gaussian Process (SVGP) as the learning
+engine and testing the system on a set of robotic experiments in simulation.
+The efficacy of MUSEL is demonstrated by comparing its performance to standard
+methods used in robot action-effect learning. In a robotic tabletop environment
+in which a robot manipulator is tasked with learning the effect of its actions,
+the experiments show that MUSEL facilitates higher accuracy in learning action
+effects while ensuring sample efficiency.
+
+
+
+ comment: 18 pages, 18 figures
+
+
+
+
+
+
+ ☆ Efficient Model Compression Techniques with FishLeg NeurIPS 2024
+
+
+ In many domains, the most successful AI models tend to be the largest, indeed
+often too large to be handled by AI players with limited computational
+resources. To mitigate this, a number of compression methods have been
+developed, including methods that prune the network down to high sparsity
+whilst retaining performance. The best-performing pruning techniques are often
+those that use second-order curvature information (such as an estimate of the
+Fisher information matrix) to score the importance of each weight and to
+predict the optimal compensation for weight deletion. However, these methods
+are difficult to scale to high-dimensional parameter spaces without making
+heavy approximations. Here, we propose the FishLeg surgeon (FLS), a new
+second-order pruning method based on the Fisher-Legendre (FishLeg) optimizer.
+At the heart of FishLeg is a meta-learning approach to amortising the action of
+the inverse FIM, which brings a number of advantages. Firstly, the
+parameterisation enables the use of flexible tensor factorisation techniques to
+improve computational and memory efficiency without sacrificing much accuracy,
+alleviating challenges associated with scalability of most second-order pruning
+methods. Secondly, directly estimating the inverse FIM leads to less
+sensitivity to the amplification of stochasticity during inversion, thereby
+resulting in more precise estimates. Thirdly, our approach also allows for
+progressive assimilation of the curvature into the parameterisation. In the
+gradual pruning regime, this results in a more efficient estimate refinement as
+opposed to re-estimation. We find that FishLeg achieves higher or comparable
+performance against two common baselines in the area, most notably in the high
+sparsity regime when considering a ResNet18 model on CIFAR-10 (84% accuracy at
+95% sparsity vs 60% for OBS) and TinyIM (53% accuracy at 80% sparsity vs 48%
+for OBS).
+
+
+
+ comment: Published in NeurIPS 2024 - Neural Compression Workshop, 13 pages, 6
+ figures
+
+
+
+
+
+
+ ☆ Switchable deep beamformer for high-quality and real-time passive
+ acoustic mapping
+
+
+
+
+
+
+
+
+ Yi Zeng, Jinwei Li, Hui Zhu, Shukuan Lu, Jianfeng Li, Xiran Cai
+
+
+ Passive acoustic mapping (PAM) is a promising tool for monitoring acoustic
+cavitation activities in the applications of ultrasound therapy. Data-adaptive
+beamformers for PAM have better image quality compared to the time exposure
+acoustics (TEA) algorithms. However, the computational cost of data-adaptive
+beamformers is considerably expensive. In this work, we develop a deep
+beamformer based on a generative adversarial network, which can switch between
+different transducer arrays and reconstruct high-quality PAM images directly
+from radio frequency ultrasound signals with low computational cost. The deep
+beamformer was trained on the dataset consisting of simulated and experimental
+cavitation signals of single and multiple microbubble clouds measured by
+different (linear and phased) arrays covering 1-15 MHz. We compared the
+performance of the deep beamformer to TEA and three different data-adaptive
+beamformers using the simulated and experimental test dataset. Compared with
+TEA, the deep beamformer reduced the energy spread area by 18.9%-65.0% and
+improved the image signal-to-noise ratio by 9.3-22.9 dB in average for the
+different arrays in our data. Compared to the data-adaptive beamformers, the
+deep beamformer reduced the computational cost by three orders of magnitude
+achieving 10.5 ms image reconstruction speed in our data, while the image
+quality was as good as that of the data-adaptive beamformers. These results
+demonstrated the potential of the deep beamformer for high-resolution
+monitoring of microbubble cavitation activities for ultrasound therapy.
+
+
+
+
+
+
+
+ ☆ Optimizing Plastic Waste Collection in Water Bodies Using Heterogeneous
+ Autonomous Surface Vehicles with Deep Reinforcement Learning
+
+
+
+
+
+
+
+
+ Alejandro Mendoza Barrionuevo, Samuel Yanes Luis, Daniel Gutiérrez Reina, Sergio L. Toral Marín
+
+
+ This paper presents a model-free deep reinforcement learning framework for
+informative path planning with heterogeneous fleets of autonomous surface
+vehicles to locate and collect plastic waste. The system employs two teams of
+vehicles: scouts and cleaners. Coordination between these teams is achieved
+through a deep reinforcement approach, allowing agents to learn strategies to
+maximize cleaning efficiency. The primary objective is for the scout team to
+provide an up-to-date contamination model, while the cleaner team collects as
+much waste as possible following this model. This strategy leads to
+heterogeneous teams that optimize fleet efficiency through inter-team
+cooperation supported by a tailored reward function. Different trainings of the
+proposed algorithm are compared with other state-of-the-art heuristics in two
+distinct scenarios, one with high convexity and another with narrow corridors
+and challenging access. According to the obtained results, it is demonstrated
+that deep reinforcement learning based algorithms outperform other benchmark
+heuristics, exhibiting superior adaptability. In addition, training with greedy
+actions further enhances performance, particularly in scenarios with intricate
+layouts.
+
+
+
+ comment: This article is currently under revision for the Robotics and
+ Automation Letters (IEEE)
+
+
+
+
+
+
+ ☆ Noisy Ostracods: A Fine-Grained, Imbalanced Real-World Dataset for
+ Benchmarking Robust Machine Learning and Label Correction Methods
+
+
+ We present the Noisy Ostracods, a noisy dataset for genus and species
+classification of crustacean ostracods with specialists' annotations. Over the
+71466 specimens collected, 5.58% of them are estimated to be noisy (possibly
+problematic) at genus level. The dataset is created to addressing a real-world
+challenge: creating a clean fine-grained taxonomy dataset. The Noisy Ostracods
+dataset has diverse noises from multiple sources. Firstly, the noise is
+open-set, including new classes discovered during curation that were not part
+of the original annotation. The dataset has pseudo-classes, where annotators
+misclassified samples that should belong to an existing class into a new
+pseudo-class. The Noisy Ostracods dataset is highly imbalanced with a imbalance
+factor $\rho$ = 22429. This presents a unique challenge for robust machine
+learning methods, as existing approaches have not been extensively evaluated on
+fine-grained classification tasks with such diverse real-world noise. Initial
+experiments using current robust learning techniques have not yielded
+significant performance improvements on the Noisy Ostracods dataset compared to
+cross-entropy training on the raw, noisy data. On the other hand, noise
+detection methods have underperformed in error hit rate compared to naive
+cross-validation ensembling for identifying problematic labels. These findings
+suggest that the fine-grained, imbalanced nature, and complex noise
+characteristics of the dataset present considerable challenges for existing
+noise-robust algorithms. By openly releasing the Noisy Ostracods dataset, our
+goal is to encourage further research into the development of noise-resilient
+machine learning methods capable of effectively handling diverse, real-world
+noise in fine-grained classification tasks. The dataset, along with its
+evaluation protocols, can be accessed at
+https://github.com/H-Jamieu/Noisy_ostracods.
+
+
+
+ comment: Initial submit
+
+
+
+
+
+
+ ☆ Enhanced Photovoltaic Power Forecasting: An iTransformer and LSTM-Based
+ Model Integrating Temporal and Covariate Interactions
+
+
+ Accurate photovoltaic (PV) power forecasting is critical for integrating
+renewable energy sources into the grid, optimizing real-time energy management,
+and ensuring energy reliability amidst increasing demand. However, existing
+models often struggle with effectively capturing the complex relationships
+between target variables and covariates, as well as the interactions between
+temporal dynamics and multivariate data, leading to suboptimal forecasting
+accuracy. To address these challenges, we propose a novel model architecture
+that leverages the iTransformer for feature extraction from target variables
+and employs long short-term memory (LSTM) to extract features from covariates.
+A cross-attention mechanism is integrated to fuse the outputs of both models,
+followed by a Kolmogorov-Arnold network (KAN) mapping for enhanced
+representation. The effectiveness of the proposed model is validated using
+publicly available datasets from Australia, with experiments conducted across
+four seasons. Results demonstrate that the proposed model effectively capture
+seasonal variations in PV power generation and improve forecasting accuracy.
+
+
+
+
+
+
+
+ ☆ CADMR: Cross-Attention and Disentangled Learning for Multimodal
+ Recommender Systems
+
+
+ The increasing availability and diversity of multimodal data in recommender
+systems offer new avenues for enhancing recommendation accuracy and user
+satisfaction. However, these systems must contend with high-dimensional, sparse
+user-item rating matrices, where reconstructing the matrix with only small
+subsets of preferred items for each user poses a significant challenge. To
+address this, we propose CADMR, a novel autoencoder-based multimodal
+recommender system framework. CADMR leverages multi-head cross-attention
+mechanisms and Disentangled Learning to effectively integrate and utilize
+heterogeneous multimodal data in reconstructing the rating matrix. Our approach
+first disentangles modality-specific features while preserving their
+interdependence, thereby learning a joint latent representation. The multi-head
+cross-attention mechanism is then applied to enhance user-item interaction
+representations with respect to the learned multimodal item latent
+representations. We evaluate CADMR on three benchmark datasets, demonstrating
+significant performance improvements over state-of-the-art methods.
+
+
+
+
+
+
+
+ ☆ Initial Study On Improving Segmentation By Combining Preoperative CT And
+ Intraoperative CBCT Using Synthetic Data
+
+
+
+
+
+
+
+
+ Maximilian E. Tschuchnig, Philipp Steininger, Michael Gadermayr
+
+
+ Computer-Assisted Interventions enable clinicians to perform precise,
+minimally invasive procedures, often relying on advanced imaging methods.
+Cone-beam computed tomography (CBCT) can be used to facilitate
+computer-assisted interventions, despite often suffering from artifacts that
+pose challenges for accurate interpretation. While the degraded image quality
+can affect image analysis, the availability of high quality, preoperative scans
+offers potential for improvements. Here we consider a setting where
+preoperative CT and intraoperative CBCT scans are available, however, the
+alignment (registration) between the scans is imperfect to simulate a real
+world scenario. We propose a multimodal learning method that fuses roughly
+aligned CBCT and CT scans and investigate the effect on segmentation
+performance. For this experiment we use synthetically generated data containing
+real CT and synthetic CBCT volumes with corresponding voxel annotations. We
+show that this fusion setup improves segmentation performance in $18$ out of
+$20$ investigated setups.
+
+
+
+ comment: Accepted at BVM 2025. arXiv admin note: text overlap with
+ arXiv:2406.11650
+
+
+
+
+
+
+ ☆ Deep Matrix Factorization with Adaptive Weights for Multi-View
+ Clustering
+
+
+ Recently, deep matrix factorization has been established as a powerful model
+for unsupervised tasks, achieving promising results, especially for multi-view
+clustering. However, existing methods often lack effective feature selection
+mechanisms and rely on empirical hyperparameter selection. To address these
+issues, we introduce a novel Deep Matrix Factorization with Adaptive Weights
+for Multi-View Clustering (DMFAW). Our method simultaneously incorporates
+feature selection and generates local partitions, enhancing clustering results.
+Notably, the features weights are controlled and adjusted by a parameter that
+is dynamically updated using Control Theory inspired mechanism, which not only
+improves the model's stability and adaptability to diverse datasets but also
+accelerates convergence. A late fusion approach is then proposed to align the
+weighted local partitions with the consensus partition. Finally, the
+optimization problem is solved via an alternating optimization algorithm with
+theoretically guaranteed convergence. Extensive experiments on benchmark
+datasets highlight that DMFAW outperforms state-of-the-art methods in terms of
+clustering performance.
+
+
+
+
+
+
+
+
+ Yao Lyu, Xiangteng Zhang, Shengbo Eben Li, Jingliang Duan, Letian Tao, Qing Xu, Lei He, Keqiang Li
+
+
+ Training deep reinforcement learning (RL) agents necessitates overcoming the
+highly unstable nonconvex stochastic optimization inherent in the
+trial-and-error mechanism. To tackle this challenge, we propose a
+physics-inspired optimization algorithm called relativistic adaptive gradient
+descent (RAD), which enhances long-term training stability. By conceptualizing
+neural network (NN) training as the evolution of a conformal Hamiltonian
+system, we present a universal framework for transferring long-term stability
+from conformal symplectic integrators to iterative NN updating rules, where the
+choice of kinetic energy governs the dynamical properties of resulting
+optimization algorithms. By utilizing relativistic kinetic energy, RAD
+incorporates principles from special relativity and limits parameter updates
+below a finite speed, effectively mitigating abnormal gradient influences.
+Additionally, RAD models NN optimization as the evolution of a multi-particle
+system where each trainable parameter acts as an independent particle with an
+individual adaptive learning rate. We prove RAD's sublinear convergence under
+general nonconvex settings, where smaller gradient variance and larger batch
+sizes contribute to tighter convergence. Notably, RAD degrades to the
+well-known adaptive moment estimation (ADAM) algorithm when its speed
+coefficient is chosen as one and symplectic factor as a small positive value.
+Experimental results show RAD outperforming nine baseline optimizers with five
+RL algorithms across twelve environments, including standard benchmarks and
+challenging scenarios. Notably, RAD achieves up to a 155.1% performance
+improvement over ADAM in Atari games, showcasing its efficacy in stabilizing
+and accelerating RL training.
+
+
+
+
+
+
+
+ ☆ Learn More by Using Less: Distributed Learning with Energy-Constrained
+ Devices
+
+
+
+
+
+
+
+
+ Roberto Pereira, Cristian J. Vaca-Rubio, Luis Blanco
+
+
+ Federated Learning (FL) has emerged as a solution for distributed model
+training across decentralized, privacy-preserving devices, but the different
+energy capacities of participating devices (system heterogeneity) constrain
+real-world implementations. These energy limitations not only reduce model
+accuracy but also increase dropout rates, impacting on convergence in practical
+FL deployments. In this work, we propose LeanFed, an energy-aware FL framework
+designed to optimize client selection and training workloads on
+battery-constrained devices. LeanFed leverages adaptive data usage by
+dynamically adjusting the fraction of local data each device utilizes during
+training, thereby maximizing device participation across communication rounds
+while ensuring they do not run out of battery during the process. We rigorously
+evaluate LeanFed against traditional FedAvg on CIFAR-10 and CIFAR-100 datasets,
+simulating various levels of data heterogeneity and device participation rates.
+Results show that LeanFed consistently enhances model accuracy and stability,
+particularly in settings with high data heterogeneity and limited battery life,
+by mitigating client dropout and extending device availability. This approach
+demonstrates the potential of energy-efficient, privacy-preserving FL in
+real-world, large-scale applications, setting a foundation for robust and
+sustainable pervasive AI on resource-constrained networks.
+
+
+
+
+
+
+
+ ☆ GQWformer: A Quantum-based Transformer for Graph Representation Learning
+
+
+
+
+
+
+
+
+ Lei Yu, Hongyang Chen, Jingsong Lv, Linyao Yang
+
+
+ Graph Transformers (GTs) have demonstrated significant advantages in graph
+representation learning through their global attention mechanisms. However, the
+self-attention mechanism in GTs tends to neglect the inductive biases inherent
+in graph structures, making it chanllenging to effectively capture essential
+structural information. To address this issue, we propose a novel approach that
+integrate graph inductive bias into self-attention mechanisms by leveraging
+quantum technology for structural encoding. In this paper, we introduce the
+Graph Quantum Walk Transformer (GQWformer), a groundbreaking GNN framework that
+utilizes quantum walks on attributed graphs to generate node quantum states.
+These quantum states encapsulate rich structural attributes and serve as
+inductive biases for the transformer, thereby enabling the generation of more
+meaningful attention scores. By subsequently incorporating a recurrent neural
+network, our design amplifies the model's ability to focus on both local and
+global information. We conducted comprehensive experiments across five publicly
+available datasets to evaluate the effectiveness of our model. These results
+clearly indicate that GQWformer outperforms existing state-of-the-art graph
+classification algorithms. These findings highlight the significant potential
+of integrating quantum computing methodologies with traditional GNNs to advance
+the field of graph representation learning, providing a promising direction for
+future research and applications.
+
+
+
+
+
+
+
+ ☆ Step-by-Step Guidance to Differential Anemia Diagnosis with Real-World
+ Data and Deep Reinforcement Learning
+
+
+ Clinical diagnostic guidelines outline the key questions to answer to reach a
+diagnosis. Inspired by guidelines, we aim to develop a model that learns from
+electronic health records to determine the optimal sequence of actions for
+accurate diagnosis. Focusing on anemia and its sub-types, we employ deep
+reinforcement learning (DRL) algorithms and evaluate their performance on both
+a synthetic dataset, which is based on expert-defined diagnostic pathways, and
+a real-world dataset. We investigate the performance of these algorithms across
+various scenarios. Our experimental results demonstrate that DRL algorithms
+perform competitively with state-of-the-art methods while offering the
+significant advantage of progressively generating pathways to the suggested
+diagnosis, providing a transparent decision-making process that can guide and
+explain diagnostic reasoning.
+
+
+
+ comment: arXiv admin note: text overlap with arXiv:2404.05913
+
+
+
+
+
+
+ ☆ BOTracle: A framework for Discriminating Bots and Humans ESORICS
+
+
+
+
+
+
+
+
+ Jan Kadel, August See, Ritwik Sinha, Mathias Fischer
+
+
+ Bots constitute a significant portion of Internet traffic and are a source of
+various issues across multiple domains. Modern bots often become
+indistinguishable from real users, as they employ similar methods to browse the
+web, including using real browsers. We address the challenge of bot detection
+in high-traffic scenarios by analyzing three distinct detection methods. The
+first method operates on heuristics, allowing for rapid detection. The second
+method utilizes, well known, technical features, such as IP address, window
+size, and user agent. It serves primarily for comparison with the third method.
+In the third method, we rely solely on browsing behavior, omitting all static
+features and focusing exclusively on how clients behave on a website. In
+contrast to related work, we evaluate our approaches using real-world
+e-commerce traffic data, comprising 40 million monthly page visits. We further
+compare our methods against another bot detection approach, Botcha, on the same
+dataset. Our performance metrics, including precision, recall, and AUC, reach
+98 percent or higher, surpassing Botcha.
+
+
+
+ comment: Bot Detection; User Behaviour Analysis; Published at ESORICS
+ International Workshops 2024
+
+
+
+
+
+
+ ☆ Diabetic Retinopathy Classification from Retinal Images using Machine
+ Learning Approaches
+
+
+ Diabetic Retinopathy is one of the most familiar diseases and is a diabetes
+complication that affects eyes. Initially, diabetic retinopathy may cause no
+symptoms or only mild vision problems. Eventually, it can cause blindness. So
+early detection of symptoms could help to avoid blindness. In this paper, we
+present some experiments on some features of diabetic retinopathy, like
+properties of exudates, properties of blood vessels and properties of
+microaneurysm. Using the features, we can classify healthy, mild
+non-proliferative, moderate non-proliferative, severe non-proliferative and
+proliferative stages of DR. Support Vector Machine, Random Forest and Naive
+Bayes classifiers are used to classify the stages. Finally, Random Forest is
+found to be the best for higher accuracy, sensitivity and specificity of 76.5%,
+77.2% and 93.3% respectively.
+
+
+
+ comment: 5 pages, 9 figures, 2 tables. International Conference on Advanced
+ Engineering, Technology and Applications (ICAETA-2021), Istanbul, Turkey
+
+
+
+
+
+
+ ☆ Technical Report on Reinforcement Learning Control on the Lucas-Nülle
+ Inverted Pendulum
+
+
+ The discipline of automatic control is making increased use of concepts that
+originate from the domain of machine learning. Herein, reinforcement learning
+(RL) takes an elevated role, as it is inherently designed for sequential
+decision making, and can be applied to optimal control problems without the
+need for a plant system model. To advance education of control engineers and
+operators in this field, this contribution targets an RL framework that can be
+applied to educational hardware provided by the Lucas-N\"ulle company.
+Specifically, the goal of inverted pendulum control is pursued by means of RL,
+including both, swing-up and stabilization within a single holistic design
+approach. Herein, the actual learning is enabled by separating corresponding
+computations from the real-time control computer and outsourcing them to a
+different hardware. This distributed architecture, however, necessitates
+communication of the involved components, which is realized via CAN bus. The
+experimental proof of concept is presented with an applied safeguarding
+algorithm that prevents the plant from being operated harmfully during the
+trial-and-error training phase.
+
+
+
+
+
+
+
+ ☆ Composing Open-domain Vision with RAG for Ocean Monitoring and
+ Conservation NeurIPS 2024
+
+
+ Climate change's destruction of marine biodiversity is threatening
+communities and economies around the world which rely on healthy oceans for
+their livelihoods. The challenge of applying computer vision to niche,
+real-world domains such as ocean conservation lies in the dynamic and diverse
+environments where traditional top-down learning struggle with long-tailed
+distributions, generalization, and domain transfer. Scalable species
+identification for ocean monitoring is particularly difficult due to the need
+to adapt models to new environments and identify rare or unseen species. To
+overcome these limitations, we propose leveraging bottom-up, open-domain
+learning frameworks as a resilient, scalable solution for image and video
+analysis in marine applications. Our preliminary demonstration uses pretrained
+vision-language models (VLMs) combined with retrieval-augmented generation
+(RAG) as grounding, leaving the door open for numerous architectural, training
+and engineering optimizations. We validate this approach through a preliminary
+application in classifying fish from video onboard fishing vessels,
+demonstrating impressive emergent retrieval and prediction capabilities without
+domain-specific training or knowledge of the task itself.
+
+
+
+ comment: Accepted to Climate Change AI Workshop at NeurIPS 2024. 9 pages, 6
+ figures, 1 table
+
+
+
+
+
+
+ ☆ Selective Reviews of Bandit Problems in AI via a Statistical View
+
+
+ Reinforcement Learning (RL) is a widely researched area in artificial
+intelligence that focuses on teaching agents decision-making through
+interactions with their environment. A key subset includes stochastic
+multi-armed bandit (MAB) and continuum-armed bandit (SCAB) problems, which
+model sequential decision-making under uncertainty. This review outlines the
+foundational models and assumptions of bandit problems, explores non-asymptotic
+theoretical tools like concentration inequalities and minimax regret bounds,
+and compares frequentist and Bayesian algorithms for managing
+exploration-exploitation trade-offs. We also extend the discussion to $K$-armed
+contextual bandits and SCAB, examining their methodologies, regret analyses,
+and discussing the relation between the SCAB problems and the functional data
+analysis. Finally, we highlight recent advances and ongoing challenges in the
+field.
+
+
+
+ comment: 46 pages, 5 figures,
+
+
+
+
+
+
+ ☆ On Simplifying Large-Scale Spatial Vectors: Fast, Memory-Efficient, and
+ Cost-Predictable k-means
+
+
+ The k-means algorithm can simplify large-scale spatial vectors, such as 2D
+geo-locations and 3D point clouds, to support fast analytics and learning.
+However, when processing large-scale datasets, existing k-means algorithms have
+been developed to achieve high performance with significant computational
+resources, such as memory and CPU usage time. These algorithms, though
+effective, are not well-suited for resource-constrained devices. In this paper,
+we propose a fast, memory-efficient, and cost-predictable k-means called
+Dask-means. We first accelerate k-means by designing a memory-efficient
+accelerator, which utilizes an optimized nearest neighbor search over a
+memory-tunable index to assign spatial vectors to clusters in batches. We then
+design a lightweight cost estimator to predict the memory cost and runtime of
+the k-means task, allowing it to request appropriate memory from devices or
+adjust the accelerator's required space to meet memory constraints, and ensure
+sufficient CPU time for running k-means. Experiments show that when simplifying
+datasets with scale such as $10^6$, Dask-means uses less than $30$MB of memory,
+achieves over $168$ times speedup compared to the widely-used Lloyd's
+algorithm. We also validate Dask-means on mobile devices, where it demonstrates
+significant speedup and low memory cost compared to other state-of-the-art
+(SOTA) k-means algorithms. Our cost estimator estimates the memory cost with a
+difference of less than $3\%$ from the actual ones and predicts runtime with an
+MSE up to $33.3\%$ lower than SOTA methods.
+
+
+
+
+
+
+
+ ☆ U-Net in Medical Image Segmentation: A Review of Its Applications Across
+ Modalities
+
+
+ Medical imaging is essential in healthcare to provide key insights into
+patient anatomy and pathology, aiding in diagnosis and treatment. Non-invasive
+techniques such as X-ray, Magnetic Resonance Imaging (MRI), Computed Tomography
+(CT), and Ultrasound (US), capture detailed images of organs, tissues, and
+abnormalities. Effective analysis of these images requires precise segmentation
+to delineate regions of interest (ROI), such as organs or lesions. Traditional
+segmentation methods, relying on manual feature-extraction, are labor-intensive
+and vary across experts. Recent advancements in Artificial Intelligence (AI)
+and Deep Learning (DL), particularly convolutional models such as U-Net and its
+variants (U-Net++ and U-Net 3+), have transformed medical image segmentation
+(MIS) by automating the process and enhancing accuracy. These models enable
+efficient, precise pixel-wise classification across various imaging modalities,
+overcoming the limitations of manual segmentation. This review explores various
+medical imaging techniques, examines the U-Net architectures and their
+adaptations, and discusses their application across different modalities. It
+also identifies common challenges in MIS and proposes potential solutions.
+
+
+
+
+
+
+
+ ☆ ESA: Example Sieve Approach for Multi-Positive and Unlabeled Learning
+
+
+ Learning from Multi-Positive and Unlabeled (MPU) data has gradually attracted
+significant attention from practical applications. Unfortunately, the risk of
+MPU also suffer from the shift of minimum risk, particularly when the models
+are very flexible as shown in Fig.\ref{moti}. In this paper, to alleviate the
+shifting of minimum risk problem, we propose an Example Sieve Approach (ESA) to
+select examples for training a multi-class classifier. Specifically, we sieve
+out some examples by utilizing the Certain Loss (CL) value of each example in
+the training stage and analyze the consistency of the proposed risk estimator.
+Besides, we show that the estimation error of proposed ESA obtains the optimal
+parametric convergence rate. Extensive experiments on various real-world
+datasets show the proposed approach outperforms previous methods.
+
+
+ Annotating data for sensitive labels (e.g., disease, smoking) poses a
+potential threats to individual privacy in many real-world scenarios. To cope
+with this problem, we propose a novel setting to protect privacy of each
+instance, namely learning from concealed labels for multi-class classification.
+Concealed labels prevent sensitive labels from appearing in the label set
+during the label collection stage, which specifies none and some random sampled
+insensitive labels as concealed labels set to annotate sensitive data. In this
+paper, an unbiased estimator can be established from concealed data under mild
+assumptions, and the learned multi-class classifier can not only classify the
+instance from insensitive labels accurately but also recognize the instance
+from the sensitive labels. Moreover, we bound the estimation error and show
+that the multi-class classifier achieves the optimal parametric convergence
+rate. Experiments demonstrate the significance and effectiveness of the
+proposed method for concealed labels in synthetic and real-world datasets.
+
+
+
+ comment: 12 pages, 2 figures
+
+
+
+
+
+
+ ☆ BANER: Boundary-Aware LLMs for Few-Shot Named Entity Recognition COLING 2025
+
+
+ Despite the recent success of two-stage prototypical networks in few-shot
+named entity recognition (NER), challenges such as over/under-detected false
+spans in the span detection stage and unaligned entity prototypes in the type
+classification stage persist. Additionally, LLMs have not proven to be
+effective few-shot information extractors in general. In this paper, we propose
+an approach called Boundary-Aware LLMs for Few-Shot Named Entity Recognition to
+address these issues. We introduce a boundary-aware contrastive learning
+strategy to enhance the LLM's ability to perceive entity boundaries for
+generalized entity spans. Additionally, we utilize LoRAHub to align information
+from the target domain to the source domain, thereby enhancing adaptive
+cross-domain classification capabilities. Extensive experiments across various
+benchmarks demonstrate that our framework outperforms prior methods, validating
+its effectiveness. In particular, the proposed strategies demonstrate
+effectiveness across a range of LLM architectures. The code and data are
+released on https://github.com/UESTC-GQJ/BANER.
+
+
+
+ comment: Appear on COLING 2025
+
+
+
+
+
+
+ ☆ Unlocking Tuning-Free Few-Shot Adaptability in Visual Foundation Models
+ by Recycling Pre-Tuned LoRAs
+
+
+
+
+
+
+
+
+ Zixuan Hu, Yongxian Wei, Li Shen, Chun Yuan, Dacheng Tao
+
+
+ Large Language Models (LLMs) such as ChatGPT demonstrate strong few-shot
+adaptability without requiring fine-tuning, positioning them ideal for
+data-limited and real-time applications. However, this adaptability has not yet
+been replicated in current Visual Foundation Models (VFMs), which require
+explicit fine-tuning with sufficient tuning data. Besides, the
+pretraining-finetuning paradigm has led to the surge of numerous task-specific
+modular components, such as Low-Rank Adaptation (LoRA). For the first time, we
+explore the potential of reusing diverse pre-tuned LoRAs without accessing
+their original training data, to achieve tuning-free few-shot adaptation in
+VFMs. Our framework, LoRA Recycle, distills a meta-LoRA from diverse pre-tuned
+LoRAs with a meta-learning objective, using surrogate data generated inversely
+from pre-tuned LoRAs themselves. The VFM, once equipped with the meta-LoRA, is
+empowered to solve new few-shot tasks in a single forward pass, akin to the
+in-context learning of LLMs. Additionally, we incorporate a double-efficient
+mechanism tailored to our framework, significantly accelerating the
+meta-training process while maintaining or even improving performance.
+Extensive experiments across various few-shot classification benchmarks across
+both in- and cross-domain scenarios demonstrate the superiority of our
+framework.
+
+
+
+
+
+
+
+ ☆ Recovering implicit physics model under real-world constraints ECAI 2024
+
+
+
+
+
+
+
+
+ Ayan Banerjee, Sandeep K. S. Gupta
+
+
+ Recovering a physics-driven model, i.e. a governing set of equations of the
+underlying dynamical systems, from the real-world data has been of recent
+interest. Most existing methods either operate on simulation data with
+unrealistically high sampling rates or require explicit measurements of all
+system variables, which is not amenable in real-world deployments. Moreover,
+they assume the timestamps of external perturbations to the physical system are
+known a priori, without uncertainty, implicitly discounting any sensor
+time-synchronization or human reporting errors. In this paper, we propose a
+novel liquid time constant neural network (LTC-NN) based architecture to
+recover underlying model of physical dynamics from real-world data. The
+automatic differentiation property of LTC-NN nodes overcomes problems
+associated with low sampling rates, the input dependent time constant in the
+forward pass of the hidden layer of LTC-NN nodes creates a massive search space
+of implicit physical dynamics, the physics model solver based data
+reconstruction loss guides the search for the correct set of implicit dynamics,
+and the use of the dropout regularization in the dense layer ensures extraction
+of the sparsest model. Further, to account for the perturbation timing error,
+we utilize dense layer nodes to search through input shifts that results in the
+lowest reconstruction loss. Experiments on four benchmark dynamical systems,
+three with simulation data and one with the real-world data show that the
+LTC-NN architecture is more accurate in recovering implicit physics model
+coefficients than the state-of-the-art sparse model recovery approaches. We
+also introduce four additional case studies (total eight) on real-life medical
+examples in simulation and with real-world clinical data to show effectiveness
+of our approach in recovering underlying model in practice.
+
+
+
+ comment: This paper is published in ECAI 2024,
+ https://ebooks.iospress.nl/volumearticle/69651
+
+
+
+
+
+
+ ☆ An Automated Data Mining Framework Using Autoencoders for Feature
+ Extraction and Dimensionality Reduction
+
+
+ This study proposes an automated data mining framework based on autoencoders
+and experimentally verifies its effectiveness in feature extraction and data
+dimensionality reduction. Through the encoding-decoding structure, the
+autoencoder can capture the data's potential characteristics and achieve noise
+reduction and anomaly detection, providing an efficient and stable solution for
+the data mining process. The experiment compared the performance of the
+autoencoder with traditional dimensionality reduction methods (such as PCA, FA,
+T-SNE, and UMAP). The results showed that the autoencoder performed best in
+terms of reconstruction error and root mean square error and could better
+retain data structure and enhance the generalization ability of the model. The
+autoencoder-based framework not only reduces manual intervention but also
+significantly improves the automation of data processing. In the future, with
+the advancement of deep learning and big data technology, the autoencoder
+method combined with a generative adversarial network (GAN) or graph neural
+network (GNN) is expected to be more widely used in the fields of complex data
+processing, real-time data analysis and intelligent decision-making.
+
+
+ GNAS (Graph Neural Architecture Search) has demonstrated great effectiveness
+in automatically designing the optimal graph neural architectures for multiple
+downstream tasks, such as node classification and link prediction. However,
+most existing GNAS methods cannot efficiently handle large-scale graphs
+containing more than million-scale nodes and edges due to the expensive
+computational and memory overhead. To scale GNAS on large graphs while
+achieving better performance, we propose SA-GNAS, a novel framework based on
+seed architecture expansion for efficient large-scale GNAS. Similar to the cell
+expansion in biotechnology, we first construct a seed architecture and then
+expand the seed architecture iteratively. Specifically, we first propose a
+performance ranking consistency-based seed architecture selection method, which
+selects the architecture searched on the subgraph that best matches the
+original large-scale graph. Then, we propose an entropy minimization-based seed
+architecture expansion method to further improve the performance of the seed
+architecture. Extensive experimental results on five large-scale graphs
+demonstrate that the proposed SA-GNAS outperforms human-designed
+state-of-the-art GNN architectures and existing graph NAS methods. Moreover,
+SA-GNAS can significantly reduce the search time, showing better search
+efficiency. For the largest graph with billion edges, SA-GNAS can achieve 2.8
+times speedup compared to the SOTA large-scale GNAS method GAUSS. Additionally,
+since SA-GNAS is inherently parallelized, the search efficiency can be further
+improved with more GPUs. SA-GNAS is available at
+https://github.com/PasaLab/SAGNAS.
+
+
+
+
+
+
+
+ ☆ Deep Learning, Machine Learning, Advancing Big Data Analytics and
+ Management
+
+
+
+
+
+
+
+
+ Weiche Hsieh, Ziqian Bi, Keyu Chen, Benji Peng, Sen Zhang, Jiawei Xu, Jinlang Wang, Caitlyn Heqi Yin, Yichao Zhang, Pohsun Feng, Yizhu Wen, Tianyang Wang, Ming Li, Chia Xin Liang, Jintao Ren, Qian Niu, Silin Chen, Lawrence K. Q. Yan, Han Xu, Hong-Ming Tseng, Xinyuan Song, Bowen Jing, Junjie Yang, Junhao Song, Junyu Liu, Ming Liu
+
+
+ Advancements in artificial intelligence, machine learning, and deep learning
+have catalyzed the transformation of big data analytics and management into
+pivotal domains for research and application. This work explores the
+theoretical foundations, methodological advancements, and practical
+implementations of these technologies, emphasizing their role in uncovering
+actionable insights from massive, high-dimensional datasets. The study presents
+a systematic overview of data preprocessing techniques, including data
+cleaning, normalization, integration, and dimensionality reduction, to prepare
+raw data for analysis. Core analytics methodologies such as classification,
+clustering, regression, and anomaly detection are examined, with a focus on
+algorithmic innovation and scalability. Furthermore, the text delves into
+state-of-the-art frameworks for data mining and predictive modeling,
+highlighting the role of neural networks, support vector machines, and ensemble
+methods in tackling complex analytical challenges. Special emphasis is placed
+on the convergence of big data with distributed computing paradigms, including
+cloud and edge computing, to address challenges in storage, computation, and
+real-time analytics. The integration of ethical considerations, including data
+privacy and compliance with global standards, ensures a holistic perspective on
+data management. Practical applications across healthcare, finance, marketing,
+and policy-making illustrate the real-world impact of these technologies.
+Through comprehensive case studies and Python-based implementations, this work
+equips researchers, practitioners, and data enthusiasts with the tools to
+navigate the complexities of modern data analytics. It bridges the gap between
+theory and practice, fostering the development of innovative solutions for
+managing and leveraging data in the era of artificial intelligence.
+
+
+ Subgraph representation learning has been effective in solving various
+real-world problems. However, current graph neural networks (GNNs) produce
+suboptimal results for subgraph-level tasks due to their inability to capture
+complex interactions within and between subgraphs. To provide a more expressive
+and efficient alternative, we propose WLKS, a Weisfeiler-Lehman (WL) kernel
+generalized for subgraphs by applying the WL algorithm on induced $k$-hop
+neighborhoods. We combine kernels across different $k$-hop levels to capture
+richer structural information that is not fully encoded in existing models. Our
+approach can balance expressiveness and efficiency by eliminating the need for
+neighborhood sampling. In experiments on eight real-world and synthetic
+benchmarks, WLKS significantly outperforms leading approaches on five datasets
+while reducing training time, ranging from 0.01x to 0.25x compared to the
+state-of-the-art.
+
+
+
+ comment: 15 pages
+
+
+
+
+
+
+ ☆ Improved Complexity for Smooth Nonconvex Optimization: A Two-Level
+ Online Learning Approach with Quasi-Newton Methods
+
+
+ We study the problem of finding an $\epsilon$-first-order stationary point
+(FOSP) of a smooth function, given access only to gradient information. The
+best-known gradient query complexity for this task, assuming both the gradient
+and Hessian of the objective function are Lipschitz continuous, is
+${O}(\epsilon^{-7/4})$. In this work, we propose a method with a gradient
+complexity of ${O}(d^{1/4}\epsilon^{-13/8})$, where $d$ is the problem
+dimension, leading to an improved complexity when $d = {O}(\epsilon^{-1/2})$.
+To achieve this result, we design an optimization algorithm that, underneath,
+involves solving two online learning problems. Specifically, we first
+reformulate the task of finding a stationary point for a nonconvex problem as
+minimizing the regret in an online convex optimization problem, where the loss
+is determined by the gradient of the objective function. Then, we introduce a
+novel optimistic quasi-Newton method to solve this online learning problem,
+with the Hessian approximation update itself framed as an online learning
+problem in the space of matrices. Beyond improving the complexity bound for
+achieving an $\epsilon$-FOSP using a gradient oracle, our result provides the
+first guarantee suggesting that quasi-Newton methods can potentially outperform
+gradient descent-type methods in nonconvex settings.
+
+
+
+ comment: 35 pages
+
+
+
+
+
+
+ ☆ Towards the efficacy of federated prediction for epidemics on networks
+
+
+
+
+
+
+
+
+ Chengpeng Fu, Tong Li, Hao Chen, Wen Du, Zhidong He
+
+
+ Epidemic prediction is of practical significance in public health, enabling
+early intervention, resource allocation, and strategic planning. However,
+privacy concerns often hinder the sharing of health data among institutions,
+limiting the development of accurate prediction models. In this paper, we
+develop a general privacy-preserving framework for node-level epidemic
+prediction on networks based on federated learning (FL). We frame the
+spatio-temporal spread of epidemics across multiple data-isolated subnetworks,
+where each node state represents the aggregate epidemic severity within a
+community. Then, both the pure temporal LSTM model and the spatio-temporal
+model i.e., Spatio-Temporal Graph Attention Network (STGAT) are proposed to
+address the federated epidemic prediction. Extensive experiments are conducted
+on various epidemic processes using a practical airline network, offering a
+comprehensive assessment of FL efficacy under diverse scenarios. By introducing
+the efficacy energy metric to measure system robustness under various client
+configurations, we systematically explore key factors influencing FL
+performance, including client numbers, aggregation strategies, graph
+partitioning, missing infectious reports. Numerical results manifest that STGAT
+excels in capturing spatio-temporal dependencies in dynamic processes whereas
+LSTM performs well in simpler pattern. Moreover, our findings highlight the
+importance of balancing feature consistency and volume uniformity among
+clients, as well as the prediction dilemma between information richness and
+intrinsic stochasticity of dynamic processes. This study offers practical
+insights into the efficacy of FL scenario in epidemic management, demonstrates
+the potential of FL to address broader collective dynamics.
+
+
+
+
+
+
+
+ ☆ Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods
+ and a New Transcript-Classifier Approach NeurIPS 2024
+
+
+
+
+
+
+
+
+ Tony T. Wang, John Hughes, Henry Sleight, Rylan Schaeffer, Rajashree Agrawal, Fazl Barez, Mrinank Sharma, Jesse Mu, Nir Shavit, Ethan Perez
+
+
+ Defending large language models against jailbreaks so that they never engage
+in a broadly-defined set of forbidden behaviors is an open problem. In this
+paper, we investigate the difficulty of jailbreak-defense when we only want to
+forbid a narrowly-defined set of behaviors. As a case study, we focus on
+preventing an LLM from helping a user make a bomb. We find that popular
+defenses such as safety training, adversarial training, and input/output
+classifiers are unable to fully solve this problem. In pursuit of a better
+solution, we develop a transcript-classifier defense which outperforms the
+baseline defenses we test. However, our classifier defense still fails in some
+circumstances, which highlights the difficulty of jailbreak-defense even in a
+narrow domain.
+
+
+
+ comment: Accepted to the AdvML-Frontiers and SoLaR workshops at NeurIPS 2024
+
+
+
+
+
+
+ ☆ CausalMob: Causal Human Mobility Prediction with LLMs-derived Human
+ Intentions toward Public Events KDD 2025
+
+
+ Large-scale human mobility exhibits spatial and temporal patterns that can
+assist policymakers in decision making. Although traditional prediction models
+attempt to capture these patterns, they often interfered by non-periodic public
+events, such as disasters and occasional celebrations. Since regular human
+mobility patterns are heavily affected by these events, estimating their causal
+effects is critical to accurate mobility predictions. Although news articles
+provide unique perspectives on these events in an unstructured format,
+processing is a challenge. In this study, we propose a causality-augmented
+prediction model, called \textbf{CausalMob}, to analyze the causal effects of
+public events. We first utilize large language models (LLMs) to extract human
+intentions from news articles and transform them into features that act as
+causal treatments. Next, the model learns representations of spatio-temporal
+regional covariates from multiple data sources to serve as confounders for
+causal inference. Finally, we present a causal effect estimation framework to
+ensure event features remain independent of confounders during prediction.
+Based on large-scale real-world data, the experimental results show that the
+proposed model excels in human mobility prediction, outperforming
+state-of-the-art models.
+
+
+
+ comment: Accepted by KDD 2025
+
+
+
+
+
+
+ ☆ Failure Probability Estimation for Black-Box Autonomous Systems using
+ State-Dependent Importance Sampling Proposals
+
+
+
+
+
+
+
+
+ Harrison Delecki, Sydney M. Katz, Mykel J. Kochenderfer
+
+
+ Estimating the probability of failure is a critical step in developing
+safety-critical autonomous systems. Direct estimation methods such as Monte
+Carlo sampling are often impractical due to the rarity of failures in these
+systems. Existing importance sampling approaches do not scale to sequential
+decision-making systems with large state spaces and long horizons. We propose
+an adaptive importance sampling algorithm to address these limitations. Our
+method minimizes the forward Kullback-Leibler divergence between a
+state-dependent proposal distribution and a relaxed form of the optimal
+importance sampling distribution. Our method uses Markov score ascent methods
+to estimate this objective. We evaluate our approach on four sequential systems
+and show that it provides more accurate failure probability estimates than
+baseline Monte Carlo and importance sampling techniques. This work is open
+sourced.
+
+
+
+ comment: Submitted to L4DC 2025
+
+
+
+
+
+
+ ☆ Revisiting the Initial Steps in Adaptive Gradient Descent Optimization NeurIPS 2024
+
+
+ Adaptive gradient optimization methods, such as Adam, are prevalent in
+training deep neural networks across diverse machine learning tasks due to
+their ability to achieve faster convergence. However, these methods often
+suffer from suboptimal generalization compared to stochastic gradient descent
+(SGD) and exhibit instability, particularly when training Transformer models.
+In this work, we show the standard initialization of the second-order moment
+estimation ($v_0 =0$) as a significant factor contributing to these
+limitations. We introduce simple yet effective solutions: initializing the
+second-order moment estimation with non-zero values, using either data-driven
+or random initialization strategies. Empirical evaluations demonstrate that our
+approach not only stabilizes convergence but also enhances the final
+performance of adaptive gradient optimizers. Furthermore, by adopting the
+proposed initialization strategies, Adam achieves performance comparable to
+many recently proposed variants of adaptive gradient optimization methods,
+highlighting the practical impact of this straightforward modification.
+
+
+
+ comment: OPT workshop at NeurIPS 2024
+
+
+
+
+
+
+ ☆ SparseGrasp: Robotic Grasping via 3D Semantic Gaussian Splatting from
+ Sparse Multi-View RGB Images
+
+
+
+
+
+
+
+
+ Junqiu Yu, Xinlin Ren, Yongchong Gu, Haitao Lin, Tianyu Wang, Yi Zhu, Hang Xu, Yu-Gang Jiang, Xiangyang Xue, Yanwei Fu
+
+
+ Language-guided robotic grasping is a rapidly advancing field where robots
+are instructed using human language to grasp specific objects. However,
+existing methods often depend on dense camera views and struggle to quickly
+update scenes, limiting their effectiveness in changeable environments.
+ In contrast, we propose SparseGrasp, a novel open-vocabulary robotic grasping
+system that operates efficiently with sparse-view RGB images and handles scene
+updates fastly. Our system builds upon and significantly enhances existing
+computer vision modules in robotic learning. Specifically, SparseGrasp utilizes
+DUSt3R to generate a dense point cloud as the initialization for 3D Gaussian
+Splatting (3DGS), maintaining high fidelity even under sparse supervision.
+Importantly, SparseGrasp incorporates semantic awareness from recent vision
+foundation models. To further improve processing efficiency, we repurpose
+Principal Component Analysis (PCA) to compress features from 2D models.
+Additionally, we introduce a novel render-and-compare strategy that ensures
+rapid scene updates, enabling multi-turn grasping in changeable environments.
+ Experimental results show that SparseGrasp significantly outperforms
+state-of-the-art methods in terms of both speed and adaptability, providing a
+robust solution for multi-turn grasping in changeable environment.
+
+
+
+
+
+
+
+ ☆ ShapeWords: Guiding Text-to-Image Synthesis with 3D Shape-Aware Prompts
+
+
+ We introduce ShapeWords, an approach for synthesizing images based on 3D
+shape guidance and text prompts. ShapeWords incorporates target 3D shape
+information within specialized tokens embedded together with the input text,
+effectively blending 3D shape awareness with textual context to guide the image
+synthesis process. Unlike conventional shape guidance methods that rely on
+depth maps restricted to fixed viewpoints and often overlook full 3D structure
+or textual context, ShapeWords generates diverse yet consistent images that
+reflect both the target shape's geometry and the textual description.
+Experimental results show that ShapeWords produces images that are more
+text-compliant, aesthetically plausible, while also maintaining 3D shape
+awareness.
+
+
+ Recent advancements in large language models have significantly improved
+their context windows, yet challenges in effective long-term memory management
+remain. We introduce MemTree, an algorithm that leverages a dynamic,
+tree-structured memory representation to optimize the organization, retrieval,
+and integration of information, akin to human cognitive schemas. MemTree
+organizes memory hierarchically, with each node encapsulating aggregated
+textual content, corresponding semantic embeddings, and varying abstraction
+levels across the tree's depths. Our algorithm dynamically adapts this memory
+structure by computing and comparing semantic embeddings of new and existing
+information to enrich the model's context-awareness. This approach allows
+MemTree to handle complex reasoning and extended interactions more effectively
+than traditional memory augmentation methods, which often rely on flat lookup
+tables. Evaluations on benchmarks for multi-turn dialogue understanding and
+document question answering show that MemTree significantly enhances
+performance in scenarios that demand structured memory management.
+
+
+
+
+
+
+
+ ♻ ☆ Accelerating Proximal Policy Optimization Learning Using Task Prediction
+ for Solving Environments with Delayed Rewards
+
+
+
+
+
+
+
+
+ Ahmad Ahmad, Mehdi Kermanshah, Kevin Leahy, Zachary Serlin, Ho Chit Siu, Makai Mann, Cristian-Ioan Vasile, Roberto Tron, Calin Belta
+
+
+ In this paper, we tackle the challenging problem of delayed rewards in
+reinforcement learning (RL). While Proximal Policy Optimization (PPO) has
+emerged as a leading Policy Gradient method, its performance can degrade under
+delayed rewards. We introduce two key enhancements to PPO: a hybrid policy
+architecture that combines an offline policy (trained on expert demonstrations)
+with an online PPO policy, and a reward shaping mechanism using Time Window
+Temporal Logic (TWTL). The hybrid architecture leverages offline data
+throughout training while maintaining PPO's theoretical guarantees. Building on
+the monotonic improvement framework of Trust Region Policy Optimization (TRPO),
+we prove that our approach ensures improvement over both the offline policy and
+previous iterations, with a bounded performance gap of
+$(2\varsigma\gamma\alpha^2)/(1-\gamma)^2$, where $\alpha$ is the mixing
+parameter, $\gamma$ is the discount factor, and $\varsigma$ bounds the expected
+advantage. Additionally, we prove that our TWTL-based reward shaping preserves
+the optimal policy of the original problem. TWTL enables formal translation of
+temporal objectives into immediate feedback signals that guide learning. We
+demonstrate the effectiveness of our approach through extensive experiments on
+an inverted pendulum and a lunar lander environments, showing improvements in
+both learning speed and final performance compared to standard PPO and
+offline-only approaches.
+
+
+
+
+
+
+
+ ♻ ☆ Go beyond End-to-End Training: Boosting Greedy Local Learning with
+ Context Supply
+
+
+ Traditional end-to-end (E2E) training of deep networks necessitates storing
+intermediate activations for back-propagation, resulting in a large memory
+footprint on GPUs and restricted model parallelization. As an alternative,
+greedy local learning partitions the network into gradient-isolated modules and
+trains supervisely based on local preliminary losses, thereby providing
+asynchronous and parallel training methods that substantially reduce memory
+cost. However, empirical experiments reveal that as the number of segmentations
+of the gradient-isolated module increases, the performance of the local
+learning scheme degrades substantially, severely limiting its expansibility. To
+avoid this issue, we theoretically analyze the greedy local learning from the
+standpoint of information theory and propose a ContSup scheme, which
+incorporates context supply between isolated modules to compensate for
+information loss. Experiments on benchmark datasets (i.e. CIFAR, SVHN, STL-10)
+achieve SOTA results and indicate that our proposed method can significantly
+improve the performance of greedy local learning with minimal memory and
+computational overhead, allowing for the boost of the number of isolated
+modules. Our codes are available at https://github.com/Tab-ct/ContSup.
+
+
+
+ comment: 9 figures, 12 tables
+
+
+
+
+
+
+ ♻ ☆ A Fast Convergence Theory for Offline Decision Making
+
+
+ This paper proposes the first generic fast convergence result in general
+function approximation for offline decision making problems, which include
+offline reinforcement learning (RL) and off-policy evaluation (OPE) as special
+cases. To unify different settings, we introduce a framework called Decision
+Making with Offline Feedback (DMOF), which captures a wide range of offline
+decision making problems. Within this framework, we propose a simple yet
+powerful algorithm called Empirical Decision with Divergence (EDD), whose upper
+bound can be termed as a coefficient named Empirical Offline Estimation
+Coefficient (EOEC). We show that EOEC is instance-dependent and actually
+measures the correlation of the problem. When assuming partial coverage in the
+dataset, EOEC will reduce in a rate of $1/N$ where $N$ is the size of the
+dataset, endowing EDD with a fast convergence guarantee. Finally, we complement
+the above results with a lower bound in the DMOF framework, which further
+demonstrates the soundness of our theory.
+
+
+
+
+
+
+
+ ♻ ☆ Decoupling Dark Knowledge via Block-wise Logit Distillation for
+ Feature-level Alignment
+
+
+ Knowledge Distillation (KD), a learning manner with a larger teacher network
+guiding a smaller student network, transfers dark knowledge from the teacher to
+the student via logits or intermediate features, with the aim of producing a
+well-performed lightweight model. Notably, many subsequent feature-based KD
+methods outperformed the earliest logit-based KD method and iteratively
+generated numerous state-of-the-art distillation methods. Nevertheless, recent
+work has uncovered the potential of the logit-based method, bringing the simple
+KD form based on logits back into the limelight. Features or logits? They
+partially implement the KD with entirely distinct perspectives; therefore,
+choosing between logits and features is not straightforward. This paper
+provides a unified perspective of feature alignment in order to obtain a better
+comprehension of their fundamental distinction. Inheriting the design
+philosophy and insights of feature-based and logit-based methods, we introduce
+a block-wise logit distillation framework to apply implicit logit-based feature
+alignment by gradually replacing teacher's blocks as intermediate
+stepping-stone models to bridge the gap between the student and the teacher.
+Our method obtains comparable or superior results to state-of-the-art
+distillation methods. This paper demonstrates the great potential of combining
+logit and features, and we hope it will inspire future research to revisit KD
+from a higher vantage point.
+
+
+ Thermodynamic integration (TI) offers a rigorous method for estimating
+free-energy differences by integrating over a sequence of interpolating
+conformational ensembles. However, TI calculations are computationally
+expensive and typically limited to coupling a small number of degrees of
+freedom due to the need to sample numerous intermediate ensembles with
+sufficient conformational-space overlap. In this work, we propose to perform TI
+along an alchemical pathway represented by a trainable neural network, which we
+term Neural TI. Critically, we parametrize a time-dependent Hamiltonian
+interpolating between the interacting and non-interacting systems, and optimize
+its gradient using a score matching objective. The ability of the resulting
+energy-based diffusion model to sample all intermediate ensembles allows us to
+perform TI from a single reference calculation. We apply our method to
+Lennard-Jones fluids, where we report accurate calculations of the excess
+chemical potential, demonstrating that Neural TI reproduces the underlying
+changes in free energy without the need for simulations at interpolating
+Hamiltonians.
+
+
+
+
+
+
+
+ ♻ ☆ Denoising: A Powerful Building-Block for Imaging, Inverse Problems, and
+ Machine Learning
+
+
+ Denoising, the process of reducing random fluctuations in a signal to
+emphasize essential patterns, has been a fundamental problem of interest since
+the dawn of modern scientific inquiry. Recent denoising techniques,
+particularly in imaging, have achieved remarkable success, nearing theoretical
+limits by some measures. Yet, despite tens of thousands of research papers, the
+wide-ranging applications of denoising beyond noise removal have not been fully
+recognized. This is partly due to the vast and diverse literature, making a
+clear overview challenging.
+ This paper aims to address this gap. We present a clarifying perspective on
+denoisers, their structure, and desired properties. We emphasize the increasing
+importance of denoising and showcase its evolution into an essential building
+block for complex tasks in imaging, inverse problems, and machine learning.
+Despite its long history, the community continues to uncover unexpected and
+groundbreaking uses for denoising, further solidifying its place as a
+cornerstone of scientific and engineering practice.
+
+
+ Reinforcement learning from human feedback (RLHF) plays a crucial role in
+aligning language models with human preferences. While the significance of
+dataset quality is generally recognized, explicit investigations into its
+impact within the RLHF framework, to our knowledge, have been limited. This
+paper addresses the issue of text quality within the preference dataset by
+focusing on direct preference optimization (DPO), an increasingly adopted
+reward-model-free RLHF method. We confirm that text quality significantly
+influences the performance of models optimized with DPO more than those
+optimized with reward-model-based RLHF. Building on this new insight, we
+propose an extension of DPO, termed filtered direct preference optimization
+(fDPO). fDPO uses a trained reward model to monitor the quality of texts within
+the preference dataset during DPO training. Samples of lower quality are
+discarded based on comparisons with texts generated by the model being
+optimized, resulting in a more accurate dataset. Experimental results
+demonstrate that fDPO enhances the final model performance. Our code is
+available at https://github.com/CyberAgentAILab/filtered-dpo.
+
+
+
+
+
+
+
+
+ Xiaoyan Xing, Konrad Groh, Sezer Karaoglu, Theo Gevers, Anand Bhattad
+
+
+ We introduce LumiNet, a novel architecture that leverages generative models
+and latent intrinsic representations for effective lighting transfer. Given a
+source image and a target lighting image, LumiNet synthesizes a relit version
+of the source scene that captures the target's lighting. Our approach makes two
+key contributions: a data curation strategy from the StyleGAN-based relighting
+model for our training, and a modified diffusion-based ControlNet that
+processes both latent intrinsic properties from the source image and latent
+extrinsic properties from the target image. We further improve lighting
+transfer through a learned adaptor (MLP) that injects the target's latent
+extrinsic properties via cross-attention and fine-tuning.
+ Unlike traditional ControlNet, which generates images with conditional maps
+from a single scene, LumiNet processes latent representations from two
+different images - preserving geometry and albedo from the source while
+transferring lighting characteristics from the target. Experiments demonstrate
+that our method successfully transfers complex lighting phenomena including
+specular highlights and indirect illumination across scenes with varying
+spatial layouts and materials, outperforming existing approaches on challenging
+indoor scenes using only images as input.
+
+
+
+
+
+
+
+
+ Zakaria Patel, Sebastian J. Wetzel
+
+
+ It has been demonstrated in many scientific fields that artificial neural
+networks like autoencoders or Siamese networks encode meaningful concepts in
+their latent spaces. However, there does not exist a comprehensive framework
+for retrieving this information in a human-readable form without prior
+knowledge. In order to extract these concepts, we introduce a framework for
+finding closed-form interpretations of neurons in latent spaces of artificial
+neural networks. The interpretation framework is based on embedding trained
+neural networks into an equivalence class of functions that encode the same
+concept. We interpret these neural networks by finding an intersection between
+the equivalence class and human-readable equations defined by a symbolic search
+space. The approach is demonstrated by retrieving invariants of matrices and
+conserved quantities of dynamical systems from latent spaces of Siamese neural
+networks.
+
+
+
+
+
+
+
+
+ Juan Sebastian Rojas, Chi-Guhn Lee
+
+
+ Average-reward Markov decision processes (MDPs) provide a foundational
+framework for sequential decision-making under uncertainty. However,
+average-reward MDPs have remained largely unexplored in reinforcement learning
+(RL) settings, with the majority of RL-based efforts having been allocated to
+episodic and discounted MDPs. In this work, we study a unique structural
+property of average-reward MDPs and utilize it to introduce Reward-Extended
+Differential (or RED) reinforcement learning: a novel RL framework that can be
+used to effectively and efficiently solve various subtasks simultaneously in
+the average-reward setting. We introduce a family of RED learning algorithms
+for prediction and control, including proven-convergent algorithms for the
+tabular case. We then showcase the power of these algorithms by demonstrating
+how they can be used to learn a policy that optimizes, for the first time, the
+well-known conditional value-at-risk (CVaR) risk measure in a fully-online
+manner, without the use of an explicit bi-level optimization scheme or an
+augmented state-space.
+
+
+ Reinforcement Learning (RL), a subfield of Artificial Intelligence (AI),
+focuses on training agents to make decisions by interacting with their
+environment to maximize cumulative rewards. This paper provides an overview of
+RL, covering its core concepts, methodologies, and resources for further
+learning. It offers a thorough explanation of fundamental components such as
+states, actions, policies, and reward signals, ensuring readers develop a solid
+foundational understanding. Additionally, the paper presents a variety of RL
+algorithms, categorized based on the key factors such as model-free,
+model-based, value-based, policy-based, and other key factors. Resources for
+learning and implementing RL, such as books, courses, and online communities
+are also provided. By offering a clear, structured introduction, this paper
+aims to simplify the complexities of RL for beginners, providing a
+straightforward pathway to understanding.
+
+
+
+ comment: 19 pages
+
+
+
+
+
+
+ ♻ ☆ Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic
+ Vision-language Context Sparsification
+
+
+ Multimodal Large Language Models (MLLMs) have achieved remarkable success in
+vision understanding, reasoning, and interaction. However, the inference
+computation and memory increase progressively with the generation of output
+tokens during decoding, directly affecting the efficacy of MLLMs. Existing
+methods attempt to reduce the vision context redundancy to achieve efficient
+MLLMs. Unfortunately, the efficiency benefits of the vision context reduction
+in the prefill stage gradually diminish during the decoding stage. To address
+this problem, we proposed a dynamic vision-language context sparsification
+framework Dynamic-LLaVA, which dynamically reduces the redundancy of vision
+context in the prefill stage and decreases the memory and computation overhead
+of the generated language context during decoding. Dynamic-LLaVA designs a
+tailored sparsification inference scheme for different inference modes, i.e.,
+prefill, decoding with and without KV cache, to achieve efficient inference of
+MLLMs. In practice, Dynamic-LLaVA can reduce computation consumption by
+$\sim$75\% in the prefill stage. Meanwhile, throughout the entire generation
+process of MLLMs, Dynamic-LLaVA reduces the $\sim$50\% computation consumption
+under decoding without KV cache, while saving $\sim$50\% GPU memory overhead
+when decoding with KV cache, due to the vision-language context sparsification.
+Extensive experiments also demonstrate that Dynamic-LLaVA achieves efficient
+inference for MLLMs with negligible understanding and generation ability
+degradation or even performance gains compared to the full-context inference
+baselines. Code is available at https://github.com/Osilly/dynamic_llava .
+
+
+
+ comment: Code is available at https://github.com/Osilly/dynamic_llava
+
+
+
+
+
+
+
+ Koen Minartz, Fleur Hendriks, Simon Martinus Koop, Alessandro Corbetta, Vlado Menkovski
+
+
+ Understanding the dynamics of pedestrian crowds is an outstanding challenge
+crucial for designing efficient urban infrastructure and ensuring safe crowd
+management. To this end, both small-scale laboratory and large-scale real-world
+measurements have been used. However, these approaches respectively lack
+statistical resolution and parametric controllability, both essential to
+discovering physical relationships underlying the complex stochastic dynamics
+of crowds. Here, we establish an investigation paradigm that offers
+laboratory-like controllability, while ensuring the statistical resolution of
+large-scale real-world datasets. Using our data-driven Neural Crowd Simulator
+(NeCS), which we train on large-scale data and validate against key statistical
+features of crowd dynamics, we show that we can perform effective surrogate
+crowd dynamics experiments without training on specific scenarios. We not only
+reproduce known experimental results on pairwise avoidance, but also uncover
+the vision-guided and topological nature of N-body interactions. These findings
+show how virtual experiments based on neural simulation enable data-driven
+scientific discovery.
+
+
+
+ comment: 26 pages, 6 figures
+
+
+
+
+
+
+ ♻ ☆ Paired Autoencoders for Likelihood-free Estimation in Inverse Problems
+
+
+
+
+
+
+
+
+ Matthias Chung, Emma Hart, Julianne Chung, Bas Peters, Eldad Haber
+
+
+ We consider the solution of nonlinear inverse problems where the forward
+problem is a discretization of a partial differential equation. Such problems
+are notoriously difficult to solve in practice and require minimizing a
+combination of a data-fit term and a regularization term. The main
+computational bottleneck of typical algorithms is the direct estimation of the
+data misfit. Therefore, likelihood-free approaches have become appealing
+alternatives. Nonetheless, difficulties in generalization and limitations in
+accuracy have hindered their broader utility and applicability. In this work,
+we use a paired autoencoder framework as a likelihood-free estimator for
+inverse problems. We show that the use of such an architecture allows us to
+construct a solution efficiently and to overcome some known open problems when
+using likelihood-free estimators. In particular, our framework can assess the
+quality of the solution and improve on it if needed. We demonstrate the
+viability of our approach using examples from full waveform inversion and
+inverse electromagnetic imaging.
+
+
+
+ comment: 18 pages, 6 figures
+
+
+
+
+
+
+ ♻ ☆ Fast and reliable uncertainty quantification with neural network
+ ensembles for industrial image classification
+
+
+ Image classification with neural networks (NNs) is widely used in industrial
+processes, situations where the model likely encounters unknown objects during
+deployment, i.e., out-of-distribution (OOD) data. Worryingly, NNs tend to make
+confident yet incorrect predictions when confronted with OOD data. To increase
+the models' reliability, they should quantify the uncertainty in their own
+predictions, communicating when the output should (not) be trusted. Deep
+ensembles, composed of multiple independent NNs, have been shown to perform
+strongly but are computationally expensive. Recent research has proposed more
+efficient NN ensembles, namely the snapshot, batch, and multi-input
+multi-output ensemble. This study investigates the predictive and uncertainty
+performance of efficient NN ensembles in the context of image classification
+for industrial processes. It is the first to provide a comprehensive comparison
+and it proposes a novel Diversity Quality metric to quantify the ensembles'
+performance on the in-distribution and OOD sets in one single metric. The
+results highlight the batch ensemble as a cost-effective and competitive
+alternative to the deep ensemble. It matches the deep ensemble in both
+uncertainty and accuracy while exhibiting considerable savings in training
+time, test time, and memory storage.
+
+
+
+ comment: Submitted to Annals of Operations Research
+
+
+
+
+
+
+
+ Jan van Delden, Julius Schultz, Christopher Blech, Sabine C. Langer, Timo Lüddecke
+
+
+ In mechanical structures like airplanes, cars and houses, noise is generated
+and transmitted through vibrations. To take measures to reduce this noise,
+vibrations need to be simulated with expensive numerical computations. Deep
+learning surrogate models present a promising alternative to classical
+numerical simulations as they can be evaluated magnitudes faster, while
+trading-off accuracy. To quantify such trade-offs systematically and foster the
+development of methods, we present a benchmark on the task of predicting the
+vibration of harmonically excited plates. The benchmark features a total of
+12,000 plate geometries with varying forms of beadings, material, boundary
+conditions, load position and sizes with associated numerical solutions. To
+address the benchmark task, we propose a new network architecture, named
+Frequency-Query Operator, which predicts vibration patterns of plate geometries
+given a specific excitation frequency. Applying principles from operator
+learning and implicit models for shape encoding, our approach effectively
+addresses the prediction of highly variable frequency response functions
+occurring in dynamic systems. To quantify the prediction quality, we introduce
+a set of evaluation metrics and evaluate the method on our vibrating-plates
+benchmark. Our method outperforms DeepONets, Fourier Neural Operators and more
+traditional neural network architectures and can be used for design
+optimization. Code, dataset and visualizations:
+https://github.com/ecker-lab/Learning_Vibrating_Plates
+
+
+
+
+
+
+
+
+ Mauricio Tec, Ana Trisovic, Michelle Audirac, Sophie Woodward, Jie Kate Hu, Naeem Khoshnevis, Francesca Dominici
+
+
+ Spatial confounding poses a significant challenge in scientific studies
+involving spatial data, where unobserved spatial variables can influence both
+treatment and outcome, possibly leading to spurious associations. To address
+this problem, we introduce SpaCE: The Spatial Confounding Environment, the
+first toolkit to provide realistic benchmark datasets and tools for
+systematically evaluating causal inference methods designed to alleviate
+spatial confounding. Each dataset includes training data, true counterfactuals,
+a spatial graph with coordinates, and smoothness and confounding scores
+characterizing the effect of a missing spatial confounder. It also includes
+realistic semi-synthetic outcomes and counterfactuals, generated using
+state-of-the-art machine learning ensembles, following best practices for
+causal inference benchmarks. The datasets cover real treatment and covariates
+from diverse domains, including climate, health and social sciences. SpaCE
+facilitates an automated end-to-end pipeline, simplifying data loading,
+experimental setup, and evaluating machine learning and causal inference
+models. The SpaCE project provides several dozens of datasets of diverse sizes
+and spatial complexity. It is publicly available as a Python package,
+encouraging community feedback and contributions.
+
+
+
+
+
+
+
+ ♻ ☆ A Probabilistic Perspective on Unlearning and Alignment for Large
+ Language Models
+
+
+
+
+
+
+
+
+ Yan Scholten, Stephan Günnemann, Leo Schwinn
+
+
+ Comprehensive evaluation of Large Language Models (LLMs) is an open research
+problem. Existing evaluations rely on deterministic point estimates generated
+via greedy decoding. However, we find that deterministic evaluations fail to
+capture the whole output distribution of a model, yielding inaccurate
+estimations of model capabilities. This is particularly problematic in critical
+contexts such as unlearning and alignment, where precise model evaluations are
+crucial. To remedy this, we introduce the first formal probabilistic evaluation
+framework in LLMs. Namely, we derive novel metrics with high-probability
+guarantees concerning the output distribution of a model. Our metrics are
+application-independent and allow practitioners to make more reliable estimates
+about model capabilities before deployment. Through a case study focused on
+unlearning, we reveal that deterministic evaluations falsely indicate
+successful unlearning, whereas our probabilistic evaluations demonstrate that
+most if not all of the supposedly unlearned information remains accessible in
+these models. Additionally, we propose a novel unlearning loss based on entropy
+optimization and adaptive temperature scaling, which significantly improves
+unlearning in probabilistic settings on recent benchmarks. Our proposed shift
+from point estimates to probabilistic evaluations of output distributions
+represents an important step toward comprehensive evaluations of LLMs. Code
+available at https://github.com/yascho/probabilistic-unlearning.
+
+
+
+
+
+
+
+ ♻ ☆ Harnessing Preference Optimisation in Protein LMs for Hit Maturation in
+ Cell Therapy
+
+
+
+
+
+
+
+
+ Katarzyna Janocha, Annabel Ling, Alice Godson, Yulia Lampi, Simon Bornschein, Nils Y. Hammerla
+
+
+ Cell and immunotherapy offer transformative potential for treating diseases
+like cancer and autoimmune disorders by modulating the immune system. The
+development of these therapies is resource-intensive, with the majority of drug
+candidates failing to progress beyond laboratory testing. While recent advances
+in machine learning have revolutionised areas such as protein engineering,
+applications in immunotherapy remain limited due to the scarcity of
+large-scale, standardised datasets and the complexity of cellular systems. In
+this work, we address these challenges by leveraging a high-throughput
+experimental platform to generate data suitable for fine-tuning protein
+language models. We demonstrate how models fine-tuned using a preference task
+show surprising correlations to biological assays, and how they can be
+leveraged for few-shot hit maturation in CARs. This proof-of-concept presents a
+novel pathway for applying ML to immunotherapy and could generalise to other
+therapeutic modalities.
+
+
+
+
+
+
+
+ ♻ ☆ Supervised Multiple Kernel Learning approaches for multi-omics data
+ integration
+
+
+ Advances in high-throughput technologies have originated an ever-increasing
+availability of omics datasets. The integration of multiple heterogeneous data
+sources is currently an issue for biology and bioinformatics. Multiple kernel
+learning (MKL) has shown to be a flexible and valid approach to consider the
+diverse nature of multi-omics inputs, despite being an underused tool in
+genomic data mining. We provide novel MKL approaches based on different kernel
+fusion strategies. To learn from the meta-kernel of input kernels, we adapted
+unsupervised integration algorithms for supervised tasks with support vector
+machines. We also tested deep learning architectures for kernel fusion and
+classification. The results show that MKL-based models can outperform more
+complex, state-of-the-art, supervised multi-omics integrative approaches.
+Multiple kernel learning offers a natural framework for predictive models in
+multi-omics data. It proved to provide a fast and reliable solution that can
+compete with and outperform more complex architectures. Our results offer a
+direction for bio-data mining research, biomarker discovery and further
+development of methods for heterogeneous data integration.
+
+
+
+
+
+
+
+ ♻ ☆ The Descriptive Complexity of Graph Neural Networks
+
+
+ We analyse the power of graph neural networks (GNNs) in terms of Boolean
+circuit complexity and descriptive complexity.
+ We prove that the graph queries that can be computed by a polynomial-size
+bounded-depth family of GNNs are exactly those definable in the guarded
+fragment GFO+C of first-order logic with counting and with built-in relations.
+This puts GNNs in the circuit complexity class (non-uniform) $\text{TC}^0$.
+Remarkably, the GNN families may use arbitrary real weights and a wide class of
+activation functions that includes the standard ReLU, logistic "sigmoid", and
+hyperbolic tangent functions. If the GNNs are allowed to use random
+initialisation and global readout (both standard features of GNNs widely used
+in practice), they can compute exactly the same queries as bounded depth
+Boolean circuits with threshold gates, that is, exactly the queries in
+$\text{TC}^0$.
+ Moreover, we show that queries computable by a single GNN with piecewise
+linear activations and rational weights are definable in GFO+C without built-in
+relations. Therefore, they are contained in uniform $\text{TC}^0$.
+
+
+
+ comment: Journal version for TheoretiCS
+
+
+
+
+
+
+ ♻ ☆ Training for Speech Recognition on Coprocessors
+
+
+
+
+
+
+
+
+ Sebastian Baunsgaard, Sebastian B. Wrede, Pınar Tozun
+
+
+ Automatic Speech Recognition (ASR) has increased in popularity in recent
+years. The evolution of processor and storage technologies has enabled more
+advanced ASR mechanisms, fueling the development of virtual assistants such as
+Amazon Alexa, Apple Siri, Microsoft Cortana, and Google Home. The interest in
+such assistants, in turn, has amplified the novel developments in ASR research.
+However, despite this popularity, there has not been a detailed training
+efficiency analysis of modern ASR systems. This mainly stems from: the
+proprietary nature of many modern applications that depend on ASR, like the
+ones listed above; the relatively expensive co-processor hardware that is used
+to accelerate ASR by big vendors to enable such applications; and the absence
+of well-established benchmarks. The goal of this paper is to address the latter
+two of these challenges. The paper first describes an ASR model, based on a
+deep neural network inspired by recent work in this domain, and our experiences
+building it. Then we evaluate this model on three CPU-GPU co-processor
+platforms that represent different budget categories. Our results demonstrate
+that utilizing hardware acceleration yields good results even without high-end
+equipment. While the most expensive platform (10X price of the least expensive
+one) converges to the initial accuracy target 10-30% and 60-70% faster than the
+other two, the differences among the platforms almost disappear at slightly
+higher accuracy targets. In addition, our results further highlight both the
+difficulty of evaluating ASR systems due to the complex, long, and resource
+intensive nature of the model training in this domain, and the importance of
+establishing benchmarks for ASR.
+
+
+ Purpose: As visual inspection is an inherent process during radiological
+screening, the associated eye gaze data can provide valuable insights into
+relevant clinical decisions. As deep learning has become the state-of-the-art
+for computer-assisted diagnosis, integrating human behavior, such as eye gaze
+data, into these systems is instrumental to help align machine predictions with
+clinical diagnostic criteria, thus enhancing the quality of automatic
+radiological diagnosis. Methods: We propose a novel deep learning framework for
+joint disease diagnosis and prediction of corresponding clinical visual
+attention maps for chest X-ray scans. Specifically, we introduce a new
+dual-encoder multi-task UNet, which leverages both a DenseNet201 backbone and a
+Residual and Squeeze-and-Excitation block-based encoder to extract diverse
+features for visual attention map prediction, and a multi-scale feature-fusion
+classifier to perform disease classification. To tackle the issue of
+asynchronous training schedules of individual tasks in multi-task learning, we
+proposed a multi-stage cooperative learning strategy, with contrastive learning
+for feature encoder pretraining to boost performance. Results: Our proposed
+method is shown to significantly outperform existing techniques for chest X-ray
+diagnosis (AUC=0.93) and the quality of visual attention map prediction
+(Correlation coefficient=0.58). Conclusion: Benefiting from the proposed
+multi-task multi-stage cooperative learning, our technique demonstrates the
+benefit of integrating clinicians' eye gaze into clinical AI systems to boost
+performance and potentially explainability.
+
+
+ Monitoring blood pressure with non-invasive sensors has gained popularity for
+providing comfortable user experiences, one of which is a significant function
+of smart wearables. Although providing a comfortable user experience, such
+methods are suffering from the demand for a significant amount of realistic
+data to train an individual model for each subject, especially considering the
+invasive or obtrusive BP ground-truth measurements. To tackle this challenge,
+we introduce a novel physics-informed temporal network~(PITN) with adversarial
+contrastive learning to enable precise BP estimation with very limited data.
+Specifically, we first enhance the physics-informed neural network~(PINN) with
+the temporal block for investigating BP dynamics' multi-periodicity for
+personal cardiovascular cycle modeling and temporal variation. We then employ
+adversarial training to generate extra physiological time series data,
+improving PITN's robustness in the face of sparse subject-specific training
+data. Furthermore, we utilize contrastive learning to capture the
+discriminative variations of cardiovascular physiologic phenomena. This
+approach aggregates physiological signals with similar blood pressure values in
+latent space while separating clusters of samples with dissimilar blood
+pressure values. Experiments on three widely-adopted datasets with different
+modailties (\emph{i.e.,} bioimpedance, PPG, millimeter-wave) demonstrate the
+superiority and effectiveness of the proposed methods over previous
+state-of-the-art approaches. The code is available
+at~\url{https://github.com/Zest86/ACL-PITN}.
+
+
+
+ comment: 12 pages, 6 figures
+
+
+
+
+
+
+ ♻ ☆ Flow Matching for Accelerated Simulation of Atomic Transport in
+ Materials
+
+
+ We introduce LiFlow, a generative framework to accelerate molecular dynamics
+(MD) simulations for crystalline materials that formulates the task as
+conditional generation of atomic displacements. The model uses flow matching,
+with a Propagator submodel to generate atomic displacements and a Corrector to
+locally correct unphysical geometries, and incorporates an adaptive prior based
+on the Maxwell-Boltzmann distribution to account for chemical and thermal
+conditions. We benchmark LiFlow on a dataset comprising 25-ps trajectories of
+lithium diffusion across 4,186 solid-state electrolyte (SSE) candidates at four
+temperatures. The model obtains a consistent Spearman rank correlation of
+0.7-0.8 for lithium mean squared displacement (MSD) predictions on unseen
+compositions. Furthermore, LiFlow generalizes from short training trajectories
+to larger supercells and longer simulations while maintaining high accuracy.
+With speed-ups of up to 600,000$\times$ compared to first-principles methods,
+LiFlow enables scalable simulations at significantly larger length and time
+scales.
+
+
+
+
+
+
+
+ ♻ ☆ Detection and Imputation based Two-Stage Denoising Diffusion Power
+ System Measurement Recovery under Cyber-Physical Uncertainties
+
+
+ Power system cyber-physical uncertainties, including measurement ambiguities
+stemming from cyber attacks and data losses, along with system uncertainties
+introduced by massive renewables and complex dynamics, reduce the likelihood of
+enhancing the quality of measurements. Fortunately, denoising diffusion models
+exhibit powerful learning and generation abilities for the complex underlying
+physics of the real world. To this end, this paper proposes an improved
+detection and imputation based two-stage denoising diffusion model (TSDM) to
+identify and reconstruct the measurements with various cyber-physical
+uncertainties. The first stage of the model comprises a classifier-guided
+conditional anomaly detection component, while the second stage involves
+diffusion-based measurement imputation component. Moreover, the proposed TSDM
+adopts optimal variance to accelerate the diffusion generation process with
+subsequence sampling. Extensive numerical case studies demonstrate that the
+proposed TSDM can accurately recover power system measurements despite
+renewables-induced strong randomness and highly nonlinear dynamics.
+Additionally, the proposed TSDM has stronger robustness compared to existing
+reconstruction networks and exhibits lower computational complexity than
+general denoising diffusion models.
+
+
+
+
+
+
+
+ ♻ ☆ Latent Diffusion Model-Enabled Low-Latency Semantic Communication in the
+ Presence of Semantic Ambiguities and Wireless Channel Noises
+
+
+
+
+
+
+
+
+ Jianhua Pei, Cheng Feng, Ping Wang, Hina Tabassum, Dongyuan Shi
+
+
+ Deep learning (DL)-based Semantic Communications (SemCom) is becoming
+critical to maximize overall efficiency of communication networks.
+Nevertheless, SemCom is sensitive to wireless channel uncertainties, source
+outliers, and suffer from poor generalization bottlenecks. To address the
+mentioned challenges, this paper develops a latent diffusion model-enabled
+SemCom system with three key contributions, i.e., i) to handle potential
+outliers in the source data, semantic errors obtained by projected gradient
+descent based on the vulnerabilities of DL models, are utilized to update the
+parameters and obtain an outlier-robust encoder, ii) a lightweight single-layer
+latent space transformation adapter completes one-shot learning at the
+transmitter and is placed before the decoder at the receiver, enabling
+adaptation for out-of-distribution data and enhancing human-perceptual quality,
+and iii) an end-to-end consistency distillation (EECD) strategy is used to
+distill the diffusion models trained in latent space, enabling deterministic
+single or few-step low-latency denoising in various noisy channels while
+maintaining high semantic quality. Extensive numerical experiments across
+different datasets demonstrate the superiority of the proposed SemCom system,
+consistently proving its robustness to outliers, the capability to transmit
+data with unknown distributions, and the ability to perform real-time channel
+denoising tasks while preserving high human perceptual quality, outperforming
+the existing denoising approaches in semantic metrics like learned perceptual
+image path similarity (LPIPS).
+
+
+
+
+
+
+
+ ♻ ☆ Interpolation and differentiation of alchemical degrees of freedom in
+ machine learning interatomic potentials
+
+
+ Machine learning interatomic potentials (MLIPs) have become a workhorse of
+modern atomistic simulations, and recently published universal MLIPs,
+pre-trained on large datasets, have demonstrated remarkable accuracy and
+generalizability. However, the computational cost of MLIPs limits their
+applicability to chemically disordered systems requiring large simulation cells
+or to sample-intensive statistical methods. Here, we report the use of
+continuous and differentiable alchemical degrees of freedom in atomistic
+materials simulations, exploiting the fact that graph neural network MLIPs
+represent discrete elements as real-valued tensors. The proposed method
+introduces alchemical atoms with corresponding weights into the input graph,
+alongside modifications to the message-passing and readout mechanisms of MLIPs,
+and allows smooth interpolation between the compositional states of materials.
+The end-to-end differentiability of MLIPs enables efficient calculation of the
+gradient of energy with respect to the compositional weights. With this
+modification, we propose methodologies for optimizing the composition of solid
+solutions towards target macroscopic properties, characterizing order and
+disorder in multicomponent oxides, and conducting alchemical free energy
+simulations to quantify the free energy of vacancy formation and composition
+changes. The approach offers an avenue for extending the capabilities of
+universal MLIPs in the modeling of compositional disorder and characterizing
+the phase stability of complex materials systems.
+
+
+
+
+
+
+
+ ♻ ☆ Governance of Generative Artificial Intelligence for Companies
+
+
+
+
+
+
+
+
+ Johannes Schneider, Pauline Kuss, Rene Abraham, Christian Meske
+
+
+ Generative Artificial Intelligence (GenAI), specifically large language
+models like ChatGPT, has swiftly entered organizations without adequate
+governance, posing both opportunities and risks. Despite extensive debates on
+GenAI's transformative nature and regulatory measures, limited research
+addresses organizational governance, encompassing technical and business
+perspectives. Although numerous frameworks for governance of AI exist, it is
+not clear to what extent they apply to GenAI. Our review paper fills this gap
+by surveying recent works with the purpose of better understanding fundamental
+characteristics of GenAI and adjusting prior frameworks specifically towards
+GenAI governance within companies. To do so, it extends Nickerson's framework
+development processes to include prior conceptualizations. Our framework
+outlines the scope, objectives, and governance mechanisms tailored to harness
+business opportunities as well as mitigate risks associated with GenAI
+integration. Our research contributes a focused approach to GenAI governance,
+offering practical insights for companies navigating the challenges of GenAI
+adoption and highlighting research gaps.
+
+
+
+
+
+
+
+ ♻ ☆ LLM-ABBA: Understanding time series via symbolic approximation
+
+
+ The success of large language models (LLMs) for time series has been
+demonstrated in previous work. Utilizing a symbolic time series representation,
+one can efficiently bridge the gap between LLMs and time series. However, the
+remaining challenge is to exploit the semantic information hidden in time
+series by using symbols or existing tokens of LLMs, while aligning the
+embedding space of LLMs according to the hidden information of time series. The
+symbolic time series approximation (STSA) method called adaptive Brownian
+bridge-based symbolic aggregation (ABBA) shows outstanding efficacy in
+preserving salient time series features by modeling time series patterns in
+terms of amplitude and period while using existing tokens of LLMs.
+ In this paper, we introduce a method, called LLM-ABBA, that integrates ABBA
+into large language models for various downstream time series tasks. By
+symbolizing time series, LLM-ABBA compares favorably to the recent
+state-of-the-art (SOTA) in UCR and three medical time series classification
+tasks. Meanwhile, a fixed-polygonal chain trick in ABBA is introduced to
+\kc{avoid obvious drifting} during prediction tasks by significantly mitigating
+the effects of cumulative error arising from misused symbols during the
+transition from symbols to numerical values. In time series regression tasks,
+LLM-ABBA achieves the new SOTA on Time Series Extrinsic Regression (TSER)
+benchmarks. LLM-ABBA also shows competitive prediction capability compared to
+recent SOTA time series prediction results. We believe this framework can also
+seamlessly extend to other time series tasks.
+
+
+
+
+
+
+
+ ♻ ☆ BInD: Bond and Interaction-generating Diffusion Model for
+ Multi-objective Structure-based Drug Design
+
+
+
+
+
+
+
+
+ Joongwon Lee, Wonho Zhung, Jisu Seo, Woo Youn Kim
+
+
+ A remarkable advance in geometric deep generative models with accumulated
+structural data enables structure-based drug design (SBDD) with target protein
+information only. However, most existing models struggle to address
+multi-objectives simultaneously while performing well only in their specialized
+tasks. Here, we present BInD, a diffusion model with knowledge-based guidance
+for multi-objective SBDD. BInD is designed to co-generate molecules and their
+interactions with a target protein to consider all key objectives equally well,
+including target-specific interactions, molecular properties, and local
+geometry. Comprehensive evaluations show that BInD achieves robust performance
+for all objectives while outperforming or matching state-of-the-art methods for
+each. Finally, we propose a train-free optimization method empowered by
+retrieving target-specific interactions, highlighting the role of non-covalent
+interactions in achieving higher selectivity and binding affinities to a target
+protein.
+
+
+
+
+
+
+
+ ♻ ☆ Puzzle: Distillation-Based NAS for Inference-Optimized LLMs
+
+
+ Large language models (LLMs) have demonstrated remarkable capabilities, but
+their adoption is limited by high computational costs during inference. While
+increasing parameter counts enhances accuracy, it also widens the gap between
+state-of-the-art capabilities and practical deployability. We present Puzzle, a
+framework to accelerate LLM inference on specific hardware while preserving
+their capabilities. Through an innovative application of neural architecture
+search (NAS) at an unprecedented scale, Puzzle systematically optimizes models
+with tens of billions of parameters under hardware constraints. Our approach
+utilizes blockwise local knowledge distillation (BLD) for parallel architecture
+exploration and employs mixed-integer programming for precise constraint
+optimization.
+ We demonstrate the real-world impact of our framework through
+Llama-3.1-Nemotron-51B-Instruct (Nemotron-51B), a publicly available model
+derived from Llama-3.1-70B-Instruct. Nemotron-51B achieves a 2.17x inference
+throughput speedup, fitting on a single NVIDIA H100 GPU while preserving 98.4%
+of the original model's capabilities. Nemotron-51B currently stands as the most
+accurate language model capable of inference on a single GPU with large batch
+sizes. Remarkably, this transformation required just 45B training tokens,
+compared to over 15T tokens used for the 70B model it was derived from. This
+establishes a new paradigm where powerful models can be optimized for efficient
+deployment with only negligible compromise of their capabilities, demonstrating
+that inference performance, not parameter count alone, should guide model
+selection. With the release of Nemotron-51B and the presentation of the Puzzle
+framework, we provide practitioners immediate access to state-of-the-art
+language modeling capabilities at significantly reduced computational costs.
+
+
+
+
+
+
+
+ ♻ ☆ Re-examining learning linear functions in context
+
+
+ In context learning (ICL) is an attractive method of solving a wide range of
+problems. Inspired by Garg et al. (2022), we look closely at ICL in a variety
+of train and test settings for several transformer models of different sizes
+trained from scratch. Our study complements prior work by pointing out several
+systematic failures of these models to generalize to data not in the training
+distribution, thereby showing some limitations of ICL. We find that models
+adopt a strategy for this task that is very different from standard solutions.
+
+
+ Graph-based representations and message-passing modular policies constitute
+prominent approaches to tackling composable control problems in reinforcement
+learning (RL). However, as shown by recent graph deep learning literature, such
+local message-passing operators can create information bottlenecks and hinder
+global coordination. The issue becomes more serious in tasks requiring
+high-level planning. In this work, we propose a novel methodology, named Feudal
+Graph Reinforcement Learning (FGRL), that addresses such challenges by relying
+on hierarchical RL and a pyramidal message-passing architecture. In particular,
+FGRL defines a hierarchy of policies where high-level commands are propagated
+from the top of the hierarchy down through a layered graph structure. The
+bottom layers mimic the morphology of the physical system, while the upper
+layers correspond to higher-order sub-modules. The resulting agents are then
+characterized by a committee of policies where actions at a certain level set
+goals for the level below, thus implementing a hierarchical decision-making
+structure that can naturally implement task decomposition. We evaluate the
+proposed framework on a graph clustering problem and MuJoCo locomotion tasks;
+simulation results show that FGRL compares favorably against relevant
+baselines. Furthermore, an in-depth analysis of the command propagation
+mechanism provides evidence that the introduced message-passing scheme favors
+learning hierarchical decision-making policies.
+
+
+
+
+
+
+
+ ♻ ☆ OceanCastNet: A Deep Learning Ocean Wave Model with Energy Conservation
+
+
+ Traditional wave forecasting models, although based on energy conservation
+equations, are computationally expensive. On the other hand, existing deep
+learning geophysical fluid models, while computationally efficient, often
+suffer from issues such as energy dissipation in long-term forecasts. This
+paper proposes a novel energy-balanced deep learning wave forecasting model
+called OceanCastNet (OCN). By incorporating wind fields at the current,
+previous, and future time steps, as well as wave fields at the current and
+previous time steps as input variables, OCN maintains energy balance within the
+model. Furthermore, the model employs adaptive Fourier operators as its core
+components and designs a masked loss function to better handle the impact of
+land-sea boundaries. A series of experiments on the ERA5 dataset demonstrate
+that OCN can achieve short-term forecast accuracy comparable to traditional
+models while exhibiting an understanding of the wave generation process. In
+comparative experiments under both normal and extreme conditions, OCN
+consistently outperforms the widely used WaveWatch III model in the industry.
+Even after long-term forecasting, OCN maintains a stable and energy-rich state.
+By further constructing a simple meteorological model, OCN-wind, which
+considers energy balance, this paper confirms the importance of energy
+constraints for improving the long-term forecast performance of deep learning
+meteorological models. This finding provides new ideas for future research on
+deep learning geophysical fluid models.
+
+
+
+
+
+
+
+ ♻ ☆ FairML: A Julia Package for Fair Classification
+
+
+
+
+
+
+
+
+ Jan Pablo Burgard, João Vitor Pamplona
+
+
+ In this paper, we propose FairML.jl, a Julia package providing a framework
+for fair classification in machine learning. In this framework, the fair
+learning process is divided into three stages. Each stage aims to reduce
+unfairness, such as disparate impact and disparate mistreatment, in the final
+prediction. For the preprocessing stage, we present a resampling method that
+addresses unfairness coming from data imbalances. The in-processing phase
+consist of a classification method. This can be either one coming from the
+MLJ.jl package, or a user defined one. For this phase, we incorporate fair ML
+methods that can handle unfairness to a certain degree through their
+optimization process. In the post-processing, we discuss the choice of the
+cut-off value for fair prediction. With simulations, we show the performance of
+the single phases and their combinations.
+
+
+
+ comment: 25 pages, 8 figures
+
+
+
+
+
+
+ ♻ ☆ ASTM :Autonomous Smart Traffic Management System Using Artificial
+ Intelligence CNN and LSTM
+
+
+ In the modern world, the development of Artificial Intelligence (AI) has
+contributed to improvements in various areas, including automation, computer
+vision, fraud detection, and more. AI can be leveraged to enhance the
+efficiency of Autonomous Smart Traffic Management (ASTM) systems and reduce
+traffic congestion rates. This paper presents an Autonomous Smart Traffic
+Management (STM) system that uses AI to improve traffic flow rates. The system
+employs the YOLO V5 Convolutional Neural Network to detect vehicles in traffic
+management images. Additionally, it predicts the number of vehicles for the
+next 12 hours using a Recurrent Neural Network with Long Short-Term Memory
+(RNN-LSTM). The Smart Traffic Management Cycle Length Analysis manages the
+traffic cycle length based on these vehicle predictions, aided by AI. From the
+results of the RNN-LSTM model for predicting vehicle numbers over the next 12
+hours, we observe that the model predicts traffic with a Mean Squared Error
+(MSE) of 4.521 vehicles and a Root Mean Squared Error (RMSE) of 2.232 vehicles.
+After simulating the STM system in the CARLA simulation environment, we found
+that the Traffic Management Congestion Flow Rate with ASTM (21 vehicles per
+minute) is 50\% higher than the rate without STM (around 15 vehicles per
+minute). Additionally, the Traffic Management Vehicle Pass Delay with STM (5
+seconds per vehicle) is 70\% lower than without STM (around 12 seconds per
+vehicle). These results demonstrate that the STM system using AI can increase
+traffic flow by 50\% and reduce vehicle pass delays by 70\%.
+
+
+
+ comment: In process to IEEE Intelligent Vehicle Symposium 2025
+
+
+
+
+
+
+ ♻ ☆ Equation-informed data-driven identification of flow budgets and
+ dynamics
+
+
+ Computational Fluid Dynamics (CFD) is an indispensable method of fluid
+modelling in engineering applications, reducing the need for physical
+prototypes and testing for tasks such as design optimisation and performance
+analysis. Depending on the complexity of the system under consideration, models
+ranging from low to high fidelity can be used for prediction, allowing
+significant speed-up. However, the choice of model requires information about
+the actual dynamics of the flow regime. Correctly identifying the
+regions/clusters of flow that share the same dynamics has been a challenging
+research topic to date. In this study, we propose a novel hybrid approach to
+flow clustering. It consists of characterising each sample point of the system
+with equation-based features, i.e. features are budgets that represent the
+contribution of each term from the original governing equation to the local
+dynamics at each sample point. This was achieved by applying the Sparse
+Identification of Nonlinear Dynamical systems (SINDy) method pointwise to time
+evolution data. The method proceeds with equation-based clustering using the
+Girvan-Newman algorithm. This allows the detection of communities that share
+the same physical dynamics. The algorithm is implemented in both Eulerian and
+Lagrangian frameworks. In the Lagrangian, i.e. dynamic approach, the clustering
+is performed on the trajectory of each point, allowing the change of clusters
+to be represented also in time. The performance of the algorithm is first
+tested on a flow around a cylinder. The construction of the dynamic clusters in
+this test case clearly shows the evolution of the wake from the steady state
+solution through the transient to the oscillatory solution. Dynamic clustering
+was then successfully tested on turbulent flow data. Two distinct and
+well-defined clusters were identified and their temporal evolution was
+reconstructed.
+
+
+
+
+
+
+
+ ♻ ☆ Bigger, Regularized, Optimistic: scaling for compute and
+ sample-efficient continuous control NeurIPS 2024
+
+
+
+
+
+
+
+
+ Michal Nauman, Mateusz Ostaszewski, Krzysztof Jankowski, Piotr Miłoś, Marek Cygan
+
+
+ Sample efficiency in Reinforcement Learning (RL) has traditionally been
+driven by algorithmic enhancements. In this work, we demonstrate that scaling
+can also lead to substantial improvements. We conduct a thorough investigation
+into the interplay of scaling model capacity and domain-specific RL
+enhancements. These empirical findings inform the design choices underlying our
+proposed BRO (Bigger, Regularized, Optimistic) algorithm. The key innovation
+behind BRO is that strong regularization allows for effective scaling of the
+critic networks, which, paired with optimistic exploration, leads to superior
+performance. BRO achieves state-of-the-art results, significantly outperforming
+the leading model-based and model-free algorithms across 40 complex tasks from
+the DeepMind Control, MetaWorld, and MyoSuite benchmarks. BRO is the first
+model-free algorithm to achieve near-optimal policies in the notoriously
+challenging Dog and Humanoid tasks.
+
+
+
+ comment: NeurIPS 2024 Spotlight
+
+
+
+
+
+
+ ♻ ☆ Multi-objective Deep Learning: Taxonomy and Survey of the State of the
+ Art
+
+
+ Simultaneously considering multiple objectives in machine learning has been a
+popular approach for several decades, with various benefits for multi-task
+learning, the consideration of secondary goals such as sparsity, or
+multicriteria hyperparameter tuning. However - as multi-objective optimization
+is significantly more costly than single-objective optimization - the recent
+focus on deep learning architectures poses considerable additional challenges
+due to the very large number of parameters, strong nonlinearities and
+stochasticity. This survey covers recent advancements in the area of
+multi-objective deep learning. We introduce a taxonomy of existing methods -
+based on the type of training algorithm as well as the decision maker's needs -
+before listing recent advancements, and also successful applications. All three
+main learning paradigms supervised learning, unsupervised learning and
+reinforcement learning are covered, and we also address the recently very
+popular area of generative modeling.
+
+
+
+
+
+
+
+ ♻ ☆ Normalizing self-supervised learning for provably reliable Change Point
+ Detection
+
+
+ Change point detection (CPD) methods aim to identify abrupt shifts in the
+distribution of input data streams. Accurate estimators for this task are
+crucial across various real-world scenarios. Yet, traditional unsupervised CPD
+techniques face significant limitations, often relying on strong assumptions or
+suffering from low expressive power due to inherent model simplicity. In
+contrast, representation learning methods overcome these drawbacks by offering
+flexibility and the ability to capture the full complexity of the data without
+imposing restrictive assumptions. However, these approaches are still emerging
+in the CPD field and lack robust theoretical foundations to ensure their
+reliability. Our work addresses this gap by integrating the expressive power of
+representation learning with the groundedness of traditional CPD techniques. We
+adopt spectral normalization (SN) for deep representation learning in CPD tasks
+and prove that the embeddings after SN are highly informative for CPD. Our
+method significantly outperforms current state-of-the-art methods during the
+comprehensive evaluation via three standard CPD datasets.
+
+
+
+
+
+
+
+ ♻ ☆ Samba: Simple Hybrid State Space Models for Efficient Unlimited Context
+ Language Modeling
+
+
+ Efficiently modeling sequences with infinite context length has long been a
+challenging problem. Previous approaches have either suffered from quadratic
+computational complexity or limited extrapolation ability in length
+generalization. In this work, we present Samba, a simple hybrid architecture
+that layer-wise combines Mamba, a selective State Space Model (SSM), with
+Sliding Window Attention (SWA). Samba selectively compresses a given sequence
+into recurrent hidden states while still maintaining the ability to precisely
+recall recent memories with the attention mechanism. We scale Samba up to 3.8B
+parameters with 3.2T training tokens and demonstrate that it significantly
+outperforms state-of-the-art models across a variety of benchmarks. Pretrained
+on sequences of 4K length, Samba shows improved perplexity in context lengths
+of up to 1M in zero-shot. When finetuned on 4K-length sequences, Samba
+efficiently extrapolates to a 256K context length with perfect memory recall on
+the Passkey Retrieval task, and exhibits superior retrieval extrapolation on
+the challenging Phonebook task compared to full-attention models. As a
+linear-time sequence model, Samba achieves a 3.73x higher throughput compared
+to Transformers with grouped-query attention for user prompts of 128K length,
+and a 3.64x speedup when generating 64K tokens with unlimited streaming. Our
+code for training on open source data is publicly available at
+https://github.com/microsoft/Samba.
+
+
+
+
+
+
+
+ ♻ ☆ Learning from Reduced Labels for Long-Tailed Data
+
+
+ Long-tailed data is prevalent in real-world classification tasks and heavily
+relies on supervised information, which makes the annotation process
+exceptionally labor-intensive and time-consuming. Unfortunately, despite being
+a common approach to mitigate labeling costs, existing weakly supervised
+learning methods struggle to adequately preserve supervised information for
+tail samples, resulting in a decline in accuracy for the tail classes. To
+alleviate this problem, we introduce a novel weakly supervised labeling setting
+called Reduced Label. The proposed labeling setting not only avoids the decline
+of supervised information for the tail samples, but also decreases the labeling
+costs associated with long-tailed data. Additionally, we propose an
+straightforward and highly efficient unbiased framework with strong theoretical
+guarantees to learn from these Reduced Labels. Extensive experiments conducted
+on benchmark datasets including ImageNet validate the effectiveness of our
+approach, surpassing the performance of state-of-the-art weakly supervised
+methods.
+
+
+
+ comment: 11 pages, 3 figures
+
+
+
+
+
+
+ ♻ ☆ Demystifying Language Model Forgetting with Low-rank Example
+ Associations
+
+
+ Large Language models (LLMs) suffer from forgetting of upstream data when
+fine-tuned. Despite efforts on mitigating forgetting, few have investigated
+whether, and how forgotten upstream examples are dependent on and associated
+with newly learned tasks. Insights on such associations enable efficient and
+targeted mitigation of forgetting. In this paper, we empirically analyze
+forgetting (measured in log-perplexity increase) that occurs in $N$ upstream
+examples of language modeling or instruction-tuning after fine-tuning LLMs on
+one of $M$ new tasks, visualized in $M\times N$ matrices. We demonstrate that
+the matrices display simple low-rank patterns, often well-approximated with
+multiplicative scalar effects of upstream examples and newly learned tasks. We
+also examine fine-grained associations with visualization and statistics.
+Leveraging the low-rank nature of the associations, we predict forgetting of
+upstream examples when fine-tuning on unseen tasks with matrix completion over
+the empirical associations. This enables fast identification of most forgotten
+examples without expensive inference on the entire upstream data. The approach,
+despite simplicity, outperforms prior approaches that learn semantic
+relationships of learned tasks and upstream examples with LMs for predicting
+forgetting. We demonstrate the practical utility of our analysis by showing
+statistically significantly reduced forgetting as we upweight predicted
+examples for replay at fine-tuning. Project page:
+https://inklab.usc.edu/lm-forgetting-prediction/
+
+
+
+ comment: 10 pages; preprint
+
+
+
+
+
+
+ ♻ ☆ AutoGuide: Automated Generation and Selection of Context-Aware
+ Guidelines for Large Language Model Agents
+
+
+
+
+
+
+
+
+ Yao Fu, Dong-Ki Kim, Jaekyeom Kim, Sungryull Sohn, Lajanugen Logeswaran, Kyunghoon Bae, Honglak Lee
+
+
+ Recent advances in large language models (LLMs) have empowered AI agents
+capable of performing various sequential decision-making tasks. However,
+effectively guiding LLMs to perform well in unfamiliar domains like web
+navigation, where they lack sufficient knowledge, has proven to be difficult
+with the demonstration-based in-context learning paradigm. In this paper, we
+introduce a novel framework, called AutoGuide, which addresses this limitation
+by automatically generating context-aware guidelines from offline experiences.
+Importantly, each context-aware guideline is expressed in concise natural
+language and follows a conditional structure, clearly describing the context
+where it is applicable. As a result, our guidelines facilitate the provision of
+relevant knowledge for the agent's current decision-making process, overcoming
+the limitations of the conventional demonstration-based learning paradigm. Our
+evaluation demonstrates that AutoGuide significantly outperforms competitive
+baselines in complex benchmark domains, including real-world web navigation.
+
+
+
+
+
+
+
+
+ Dingwen Zhang, Yan Li, De Cheng, Nannan Wang, Junwei Han
+
+
+ To facilitate the evolution of edge intelligence in ever-changing
+environments, we study on-device incremental learning constrained in limited
+computation resource in this paper. Current on-device training methods just
+focus on efficient training without considering the catastrophic forgetting,
+preventing the model getting stronger when continually exploring the world. To
+solve this problem, a direct solution is to involve the existing incremental
+learning mechanisms into the on-device training framework. Unfortunately, such
+a manner cannot work well as those mechanisms usually introduce large
+additional computational cost to the network optimization process, which would
+inevitably exceed the memory capacity of the edge devices. To address this
+issue, this paper makes an early effort to propose a simple but effective
+edge-friendly incremental learning framework. Based on an empirical study on
+the knowledge intensity of the kernel elements of the neural network, we find
+that the center kernel is the key for maximizing the knowledge intensity for
+learning new data, while freezing the other kernel elements would get a good
+balance on the model's capacity for overcoming catastrophic forgetting. Upon
+this finding, we further design a center-sensitive kernel optimization
+framework to largely alleviate the cost of the gradient computation and
+back-propagation. Besides, a dynamic channel element selection strategy is also
+proposed to facilitate a sparse orthogonal gradient projection for further
+reducing the optimization complexity, upon the knowledge explored from the new
+task data. Extensive experiments validate our method is efficient and
+effective, e.g., our method achieves average accuracy boost of 38.08% with even
+less memory and approximate computation compared to existing on-device training
+methods, indicating its significant potential for on-device incremental
+learning.
+
+
+
+
+
+
+
+ ♻ ☆ VISION-XL: High Definition Video Inverse Problem Solver using Latent
+ Image Diffusion Models
+
+
+ In this paper, we propose a novel framework for solving high-definition video
+inverse problems using latent image diffusion models. Building on recent
+advancements in spatio-temporal optimization for video inverse problems using
+image diffusion models, our approach leverages latent-space diffusion models to
+achieve enhanced video quality and resolution. To address the high
+computational demands of processing high-resolution frames, we introduce a
+pseudo-batch consistent sampling strategy, allowing efficient operation on a
+single GPU. Additionally, to improve temporal consistency, we present
+batch-consistent inversion, an initialization technique that incorporates
+informative latents from the measurement frame. By integrating with SDXL, our
+framework achieves state-of-the-art video reconstruction across a wide range of
+spatio-temporal inverse problems, including complex combinations of frame
+averaging and various spatial degradations, such as deblurring,
+super-resolution, and inpainting. Unlike previous methods, our approach
+supports multiple aspect ratios (landscape, vertical, and square) and delivers
+HD-resolution reconstructions (exceeding 1280x720) in under 2.5 minutes on a
+single NVIDIA 4090 GPU.
+
+
+ We study the integrability of two-dimensional theories that are obtained by a
+dimensional reduction of certain four-dimensional gravitational theories
+describing the coupling of Maxwell fields and neutral scalar fields to gravity
+in the presence of a potential for the neutral scalar fields. For a certain
+solution subspace, we demonstrate partial integrability by showing that a
+subset of the equations of motion in two dimensions are the compatibility
+conditions for a linear system. Subsequently, we study the integrability of
+these two-dimensional models from a complementary one-dimensional point of
+view, framed in terms of Liouville integrability. In this endeavour, we employ
+various machine learning techniques to systematise our search for numerical Lax
+pair matrices for these models, as well as conserved currents expressed as
+functions of phase space variables.
+
+
+
+
+
+
+
+
+ Shuang Cui, Kai Han, Jing Tang, Xueying Li, Aakas Zhiyuli, Hanxiao Li
+
+
+ Submodular maximization has found extensive applications in various domains
+within the field of artificial intelligence, including but not limited to
+machine learning, computer vision, and natural language processing. With the
+increasing size of datasets in these domains, there is a pressing need to
+develop efficient and parallelizable algorithms for submodular maximization.
+One measure of the parallelizability of a submodular maximization algorithm is
+its adaptive complexity, which indicates the number of sequential rounds where
+a polynomial number of queries to the objective function can be executed in
+parallel. In this paper, we study the problem of non-monotone submodular
+maximization subject to a knapsack constraint, and propose the first
+combinatorial algorithm achieving an $(8+\epsilon)$-approximation under
+$\mathcal{O}(\log n)$ adaptive complexity, which is \textit{optimal} up to a
+factor of $\mathcal{O}(\log\log n)$. Moreover, we also propose the first
+algorithm with both provable approximation ratio and sublinear adaptive
+complexity for the problem of non-monotone submodular maximization subject to a
+$k$-system constraint. As a by-product, we show that our two algorithms can
+also be applied to the special case of submodular maximization subject to a
+cardinality constraint, and achieve performance bounds comparable with those of
+state-of-the-art algorithms. Finally, the effectiveness of our approach is
+demonstrated by extensive experiments on real-world applications.
+
+
+
+ comment: Part of the contribution appears in AAAI-2023
+
+
+
+
+
+
+ ♻ ☆ Guardian of the Ensembles: Introducing Pairwise Adversarially Robust
+ Loss for Resisting Adversarial Attacks in DNN Ensembles WACV 2025
+
+
+ Adversarial attacks rely on transferability, where an adversarial example
+(AE) crafted on a surrogate classifier tends to mislead a target classifier.
+Recent ensemble methods demonstrate that AEs are less likely to mislead
+multiple classifiers in an ensemble. This paper proposes a new ensemble
+training using a Pairwise Adversarially Robust Loss (PARL) that by construction
+produces an ensemble of classifiers with diverse decision boundaries. PARL
+utilizes outputs and gradients of each layer with respect to network parameters
+in every classifier within the ensemble simultaneously. PARL is demonstrated to
+achieve higher robustness against black-box transfer attacks than previous
+ensemble methods as well as adversarial training without adversely affecting
+clean example accuracy. Extensive experiments using standard Resnet20,
+WideResnet28-10 classifiers demonstrate the robustness of PARL against
+state-of-the-art adversarial attacks. While maintaining similar clean accuracy
+and lesser training time, the proposed architecture has a 24.8% increase in
+robust accuracy ($\epsilon$ = 0.07) from the state-of-the art method.
+
+
+
+ comment: Accepted at IEEE/CVF Winter Conference on Applications of Computer
+ Vision (WACV 2025)
+
+
+
+
+
+
+
+ Yuejiang Liu, Jubayer Ibn Hamid, Annie Xie, Yoonho Lee, Maximilian Du, Chelsea Finn
+
+
+ Predicting and executing a sequence of actions without intermediate
+replanning, known as action chunking, is increasingly used in robot learning
+from human demonstrations. Yet, its reported effects on the learned policy are
+inconsistent: some studies find it crucial for achieving strong results, while
+others observe decreased performance. In this paper, we first dissect how
+action chunking impacts the divergence between a learner and a demonstrator. We
+find that action chunking allows the learner to better capture the temporal
+dependencies in demonstrations but at the cost of reduced reactivity in
+stochastic environments. To address this tradeoff, we propose Bidirectional
+Decoding (BID), a test-time inference algorithm that bridges action chunking
+with closed-loop operations. BID samples multiple predictions at each time step
+and searches for the optimal one based on two criteria: (i) backward coherence,
+which favors samples that align with previous decisions; (ii) forward contrast,
+which seeks samples of high likelihood for future plans. By coupling decisions
+within and across action chunks, BID promotes consistency over time while
+maintaining reactivity to unexpected changes. Experimental results show that
+BID boosts the performance of two state-of-the-art generative policies across
+seven simulation benchmarks and two real-world tasks. Code and videos are
+available at https://bid-robot.github.io.
+
+
+ Large language models (LLMs) are reported to be partial to certain cultures
+owing to the training data dominance from the English corpora. Since
+multilingual cultural data are often expensive to collect, existing efforts
+handle this by prompt engineering or culture-specific pre-training. However,
+they might overlook the knowledge deficiency of low-resource culture and
+require extensive computing resources. In this paper, we propose CultureLLM, a
+cost-effective solution to incorporate cultural differences into LLMs.
+CultureLLM adopts World Value Survey (WVS) as seed data and generates
+semantically equivalent training data via the proposed semantic data
+augmentation. Using only 50 seed samples from WVS with augmented data, we
+fine-tune culture-specific LLMs and one unified model (CultureLLM-One) for 9
+cultures covering rich and low-resource languages. Extensive experiments on 60
+culture-related datasets demonstrate that CultureLLM significantly outperforms
+various counterparts such as GPT-3.5 (by 8.1%) and Gemini Pro (by 9.5%) with
+comparable performance to GPT-4 or even better. Our human study shows that the
+generated samples are semantically equivalent to the original samples,
+providing an effective solution for LLMs augmentation. Code is released at
+https://github.com/Scarelette/CultureLLM.
+
+
+
+ comment: NeurIPS 2024; Code is at https://github.com/Scarelette/CultureLLM
+
+
+
+
+
+
+ ♻ ☆ Harmful Fine-tuning Attacks and Defenses for Large Language Models: A
+ Survey
+
+
+
+
+
+
+
+
+ Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, Ling Liu
+
+
+ Recent research demonstrates that the nascent fine-tuning-as-a-service
+business model exposes serious safety concerns -- fine-tuning over a few
+harmful data uploaded by the users can compromise the safety alignment of the
+model. The attack, known as harmful fine-tuning attack, has raised a broad
+research interest among the community. However, as the attack is still new,
+\textbf{we observe that there are general misunderstandings within the research
+community.} To clear up concern, this paper provide a comprehensive overview to
+three aspects of harmful fine-tuning: attacks setting, defense design and
+evaluation methodology. Specifically, we first present the threat model of the
+problem, and introduce the harmful fine-tuning attack and its variants. Then we
+systematically survey the existing literature on attacks/defenses/mechanical
+analysis of the problem. Finally, we introduce the evaluation methodology and
+outline future research directions that might contribute to the development of
+the field. Additionally, we present a list of questions of interest, which
+might be useful to refer to when reviewers in the peer review process question
+the realism of the experiment/attack/defense setting. A curated list of
+relevant papers is maintained and made accessible at:
+https://github.com/git-disl/awesome_LLM-harmful-fine-tuning-papers.
+
+
+
+
+
+
+
+ ♻ ☆ Intelligent Go-Explore: Standing on the Shoulders of Giant Foundation
+ Models
+
+
+ Go-Explore is a powerful family of algorithms designed to solve
+hard-exploration problems built on the principle of archiving discovered
+states, and iteratively returning to and exploring from the most promising
+states. This approach has led to superhuman performance across a wide variety
+of challenging problems including Atari games and robotic control, but requires
+manually designing heuristics to guide exploration (i.e., determine which
+states to save and explore from, and what actions to consider next), which is
+time-consuming and infeasible in general. To resolve this, we propose
+Intelligent Go-Explore (IGE) which greatly extends the scope of the original
+Go-Explore by replacing these handcrafted heuristics with the intelligence and
+internalized human notions of interestingness captured by giant pretrained
+foundation models (FMs). This provides IGE with a human-like ability to
+instinctively identify how interesting or promising any new state is (e.g.,
+discovering new objects, locations, or behaviors), even in complex environments
+where heuristics are hard to define. Moreover, IGE offers the exciting
+opportunity to recognize and capitalize on serendipitous discoveries-states
+encountered during exploration that are valuable in terms of exploration, yet
+where what makes them interesting was not anticipated by the human user. We
+evaluate our algorithm on a diverse range of language and vision-based tasks
+that require search and exploration. Across these tasks, IGE strongly exceeds
+classic reinforcement learning and graph search baselines, and also succeeds
+where prior state-of-the-art FM agents like Reflexion completely fail. Overall,
+Intelligent Go-Explore combines the tremendous strengths of FMs and the
+powerful Go-Explore algorithm, opening up a new frontier of research into
+creating more generally capable agents with impressive exploration
+capabilities.
+
+
+ Multi-agent reinforcement learning has demonstrated significant potential in
+addressing complex cooperative tasks across various real-world applications.
+However, existing MARL approaches often rely on the restrictive assumption that
+the number of entities (e.g., agents, obstacles) remains constant between
+training and inference. This overlooks scenarios where entities are dynamically
+removed or added during the inference trajectory -- a common occurrence in
+real-world environments like search and rescue missions and dynamic combat
+situations. In this paper, we tackle the challenge of intra-trajectory dynamic
+entity composition under zero-shot out-of-domain (OOD) generalization, where
+such dynamic changes cannot be anticipated beforehand. Our empirical studies
+reveal that existing MARL methods suffer significant performance degradation
+and increased uncertainty in these scenarios. In response, we propose
+FlickerFusion, a novel OOD generalization method that acts as a universally
+applicable augmentation technique for MARL backbone methods. FlickerFusion
+stochastically drops out parts of the observation space, emulating being
+in-domain when inferenced OOD. The results show that FlickerFusion not only
+achieves superior inference rewards but also uniquely reduces uncertainty
+vis-\`a-vis the backbone, compared to existing methods. Benchmarks,
+implementations, and model weights are organized and open-sourced at
+flickerfusion305.github.io, accompanied by ample demo video renderings.
+
+
+ Privacy-preserving federated learning (PPFL) aims to train a global model for
+multiple clients while maintaining their data privacy. However, current PPFL
+protocols exhibit one or more of the following insufficiencies: considerable
+degradation in accuracy, the requirement for sharing keys, and cooperation
+during the key generation or decryption processes. As a mitigation, we develop
+the first protocol that utilizes neural networks to implement PPFL, as well as
+incorporating an Aggregatable Hybrid Encryption scheme tailored to the needs of
+PPFL. We name these networks as Homomorphic Adversarial Networks (HANs) which
+demonstrate that neural networks are capable of performing tasks similar to
+multi-key homomorphic encryption (MK-HE) while solving the problems of key
+distribution and collaborative decryption. Our experiments show that HANs are
+robust against privacy attacks. Compared with non-private federated learning,
+experiments conducted on multiple datasets demonstrate that HANs exhibit a
+negligible accuracy loss (at most 1.35%). Compared to traditional MK-HE
+schemes, HANs increase encryption aggregation speed by 6,075 times while
+incurring a 29.2 times increase in communication overhead.
+
+
+ This study investigates privacy leakage in dimensionality reduction methods
+through a novel machine learning-based reconstruction attack. Employing an
+informed adversary threat model, we develop a neural network capable of
+reconstructing high-dimensional data from low-dimensional embeddings.
+ We evaluate six popular dimensionality reduction techniques: PCA, sparse
+random projection (SRP), multidimensional scaling (MDS), Isomap, t-SNE, and
+UMAP. Using both MNIST and NIH Chest X-ray datasets, we perform a qualitative
+analysis to identify key factors affecting reconstruction quality. Furthermore,
+we assess the effectiveness of an additive noise mechanism in mitigating these
+reconstruction attacks. Our experimental results on both datasets reveal that
+the attack is effective against deterministic methods (PCA and Isomap), but
+ineffective against methods that employ random initialization (SRP, MDS, t-SNE
+and UMAP). When adding the images with large noises before performing PCA or
+Isomap, the attack produced severely distorted reconstructions. In contrast,
+for the other four methods, the reconstructions still show some recognizable
+features, though they bear little resemblance to the original images.
+
+
+
+ comment: Major revision
+
+
+
+
+
+
+ ♻ ☆ A Physics-embedded Deep Learning Framework for Cloth Simulation
+
+
+ Delicate cloth simulations have long been desired in computer graphics.
+Various methods were proposed to improve engaged force interactions, collision
+handling, and numerical integrations. Deep learning has the potential to
+achieve fast and real-time simulation, but common neural network structures
+often demand many parameters to capture cloth dynamics. This paper proposes a
+physics-embedded learning framework that directly encodes physical features of
+cloth simulation. The convolutional neural network is used to represent spatial
+correlations of the mass-spring system, after which three branches are designed
+to learn linear, nonlinear, and time derivate features of cloth physics. The
+framework can also integrate with other external forces and collision handling
+through either traditional simulators or sub neural networks. The model is
+tested across different cloth animation cases, without training with new data.
+Agreement with baselines and predictive realism successfully validate its
+generalization ability. Inference efficiency of the proposed model also defeats
+traditional physics simulation. This framework is also designed to easily
+integrate with other visual refinement techniques like wrinkle carving, which
+leaves significant chances to incorporate prevailing macing learning techniques
+in 3D cloth amination.
+
+
+
+ comment: updated version
+
+
+
+
+
+
+ ♻ ☆ Yi-Lightning Technical Report
+
+
+
+
+
+
+
+
+ 01. AI, :, Alan Wake, Albert Wang, Bei Chen, C. X. Lv, Chao Li, Chengen Huang, Chenglin Cai, Chujie Zheng, Daniel Cooper, Ethan Dai, Fan Zhou, Feng Hu, Heng Ji, Howard Qiu, Jiangcheng Zhu, Jun Tian, Katherine Su, Lihuan Zhang, Liying Li, Ming Song, Mou Li, Peng Liu, Qichen Hu, Shawn Wang, Shijun Zhou, Shiyong Li, Tianhang Zhu, Wen Xie, Xiang He, Xiaobo Chen, Xiaohui Hu, Xiaoyi Ren, Xinyao Niu, Yanpeng Li, Yongke Zhao, Yongzhen Luo, Yuchi Xu, Yuxuan Sha, Zhaodong Yan, Zhiyuan Liu, Zirui Zhang
+
+
+ This technical report presents Yi-Lightning, our latest flagship large
+language model (LLM). It achieves exceptional performance, ranking 6th overall
+on Chatbot Arena, with particularly strong results (2nd to 4th place) in
+specialized categories including Chinese, Math, Coding, and Hard Prompts.
+Yi-Lightning leverages an enhanced Mixture-of-Experts (MoE) architecture,
+featuring advanced expert segmentation and routing mechanisms coupled with
+optimized KV-caching techniques. Our development process encompasses
+comprehensive pre-training, supervised fine-tuning (SFT), and reinforcement
+learning from human feedback (RLHF), where we devise deliberate strategies for
+multi-stage training, synthetic data construction, and reward modeling.
+Furthermore, we implement RAISE (Responsible AI Safety Engine), a
+four-component framework to address safety issues across pre-training,
+post-training, and serving phases. Empowered by our scalable super-computing
+infrastructure, all these innovations substantially reduce training, deployment
+and inference costs while maintaining high-performance standards. With further
+evaluations on public academic benchmarks, Yi-Lightning demonstrates
+competitive performance against top-tier LLMs, while we observe a notable
+disparity between traditional, static benchmark results and real-world, dynamic
+human preferences. This observation prompts a critical reassessment of
+conventional benchmarks' utility in guiding the development of more intelligent
+and powerful AI systems for practical applications. Yi-Lightning is now
+available through our developer platform at https://platform.lingyiwanwu.com.
+
+
+
+
+
+
+
+ ♻ ☆ A Comprehensive Study of Shapley Value in Data Analytics
+
+
+
+
+
+
+
+
+ Hong Lin, Shixin Wan, Zhongle Xie, Ke Chen, Meihui Zhang, Lidan Shou, Gang Chen
+
+
+ Over the recent years, Shapley value (SV), a solution concept from
+cooperative game theory, has found numerous applications in data analytics
+(DA). This paper provides the first comprehensive study of SV used throughout
+the DA workflow, which involves three main steps: data fabric, data
+exploration, and result reporting. We summarize existing versatile forms of SV
+used in these steps by a unified definition and clarify the essential
+functionalities that SV can provide for data scientists. We categorize the arts
+in this field based on the technical challenges they tackled, which include
+computation efficiency, approximation error, privacy preservation, and
+appropriate interpretations. We discuss these challenges and analyze the
+corresponding solutions. We also implement SVBench, the first open-sourced
+benchmark for developing SV applications, and conduct experiments on six DA
+tasks to validate our analysis and discussions. Based on the qualitative and
+quantitative results, we identify the limitations of current efforts for
+applying SV to DA and highlight the directions of future research and
+engineering.
+
+
+
+
+
+
+
+ ♻ ☆ FSMLP: Modelling Channel Dependencies With Simplex Theory Based
+ Multi-Layer Perceptions In Frequency Domain
+
+
+
+
+
+
+
+
+ Zhengnan Li, Haoxuan Li, Hao Wang, Jun Fang, Duoyin Li Yunxiao Qin
+
+
+ Time series forecasting (TSF) plays a crucial role in various domains,
+including web data analysis, energy consumption prediction, and weather
+forecasting. While Multi-Layer Perceptrons (MLPs) are lightweight and effective
+for capturing temporal dependencies, they are prone to overfitting when used to
+model inter-channel dependencies. In this paper, we investigate the overfitting
+problem in channel-wise MLPs using Rademacher complexity theory, revealing that
+extreme values in time series data exacerbate this issue. To mitigate this
+issue, we introduce a novel Simplex-MLP layer, where the weights are
+constrained within a standard simplex. This strategy encourages the model to
+learn simpler patterns and thereby reducing overfitting to extreme values.
+Based on the Simplex-MLP layer, we propose a novel \textbf{F}requency
+\textbf{S}implex \textbf{MLP} (FSMLP) framework for time series forecasting,
+comprising of two kinds of modules: \textbf{S}implex
+\textbf{C}hannel-\textbf{W}ise MLP (SCWM) and \textbf{F}requency
+\textbf{T}emporal \textbf{M}LP (FTM). The SCWM effectively leverages the
+Simplex-MLP to capture inter-channel dependencies, while the FTM is a simple
+yet efficient temporal MLP designed to extract temporal information from the
+data. Our theoretical analysis shows that the upper bound of the Rademacher
+Complexity for Simplex-MLP is lower than that for standard MLPs. Moreover, we
+validate our proposed method on seven benchmark datasets, demonstrating
+significant improvements in forecasting accuracy and efficiency, while also
+showcasing superior scalability. Additionally, we demonstrate that Simplex-MLP
+can improve other methods that use channel-wise MLP to achieve less overfitting
+and improved performance. Code are available
+\href{https://github.com/FMLYD/FSMLP}{\textcolor{red}{here}}.
+
+
+
+
+
+
+
+ ♻ ☆ NüshuRescue: Revitalization of the endangered Nüshu Language with AI COLING 2025
+
+
+ The preservation and revitalization of endangered and extinct languages is a
+meaningful endeavor, conserving cultural heritage while enriching fields like
+linguistics and anthropology. However, these languages are typically
+low-resource, making their reconstruction labor-intensive and costly. This
+challenge is exemplified by N\"ushu, a rare script historically used by Yao
+women in China for self-expression within a patriarchal society. To address
+this challenge, we introduce N\"ushuRescue, an AI-driven framework designed to
+train large language models (LLMs) on endangered languages with minimal data.
+N\"ushuRescue automates evaluation and expands target corpora to accelerate
+linguistic revitalization. As a foundational component, we developed NCGold, a
+500-sentence N\"ushu-Chinese parallel corpus, the first publicly available
+dataset of its kind. Leveraging GPT-4-Turbo, with no prior exposure to N\"ushu
+and only 35 short examples from NCGold, N\"ushuRescue achieved 48.69\%
+translation accuracy on 50 withheld sentences and generated NCSilver, a set of
+98 newly translated modern Chinese sentences of varying lengths. A sample of
+both NCGold and NCSilver is included in the Supplementary Materials.
+Additionally, we developed FastText-based and Seq2Seq models to further support
+research on N\"ushu. N\"ushuRescue provides a versatile and scalable tool for
+the revitalization of endangered languages, minimizing the need for extensive
+human input.
+
+
+
+ comment: Accepted to COLING 2025
+
+
+
+
+
+
+ ♻ ☆ CPRM: A LLM-based Continual Pre-training Framework for Relevance
+ Modeling in Commercial Search
+
+
+
+
+
+
+
+
+ Kaixin Wu, Yixin Ji, Zeyuan Chen, Qiang Wang, Cunxiang Wang, Hong Liu, Baijun Ji, Jia Xu, Zhongyi Liu, Jinjie Gu, Yuan Zhou, Linjian Mo
+
+
+ Relevance modeling between queries and items stands as a pivotal component in
+commercial search engines, directly affecting the user experience. Given the
+remarkable achievements of large language models (LLMs) in various natural
+language processing (NLP) tasks, LLM-based relevance modeling is gradually
+being adopted within industrial search systems. Nevertheless, foundational LLMs
+lack domain-specific knowledge and do not fully exploit the potential of
+in-context learning. Furthermore, structured item text remains underutilized,
+and there is a shortage in the supply of corresponding queries and background
+knowledge. We thereby propose CPRM (Continual Pre-training for Relevance
+Modeling), a framework designed for the continual pre-training of LLMs to
+address these issues. Our CPRM framework includes three modules: 1) employing
+both queries and multi-field item to jointly pre-train for enhancing domain
+knowledge, 2) applying in-context pre-training, a novel approach where LLMs are
+pre-trained on a sequence of related queries or items, and 3) conducting
+reading comprehension on items to produce associated domain knowledge and
+background information (e.g., generating summaries and corresponding queries)
+to further strengthen LLMs. Results on offline experiments and online A/B
+testing demonstrate that our model achieves convincing performance compared to
+strong baselines.
+
+
+
+
+
+
+
+ ♻ ☆ Interventional Causal Discovery in a Mixture of DAGs NeurIPS 2024
+
+
+
+
+
+
+
+
+ Burak Varıcı, Dmitriy Katz-Rogozhnikov, Dennis Wei, Prasanna Sattigeri, Ali Tajer
+
+
+ Causal interactions among a group of variables are often modeled by a single
+causal graph. In some domains, however, these interactions are best described
+by multiple co-existing causal graphs, e.g., in dynamical systems or genomics.
+This paper addresses the hitherto unknown role of interventions in learning
+causal interactions among variables governed by a mixture of causal systems,
+each modeled by one directed acyclic graph (DAG). Causal discovery from
+mixtures is fundamentally more challenging than single-DAG causal discovery.
+Two major difficulties stem from (i)~an inherent uncertainty about the
+skeletons of the component DAGs that constitute the mixture and (ii)~possibly
+cyclic relationships across these component DAGs. This paper addresses these
+challenges and aims to identify edges that exist in at least one component DAG
+of the mixture, referred to as the true edges. First, it establishes matching
+necessary and sufficient conditions on the size of interventions required to
+identify the true edges. Next, guided by the necessity results, an adaptive
+algorithm is designed that learns all true edges using $O(n^2)$ interventions,
+where $n$ is the number of nodes. Remarkably, the size of the interventions is
+optimal if the underlying mixture model does not contain cycles across its
+components. More generally, the gap between the intervention size used by the
+algorithm and the optimal size is quantified. It is shown to be bounded by the
+cyclic complexity number of the mixture model, defined as the size of the
+minimal intervention that can break the cycles in the mixture, which is upper
+bounded by the number of cycles among the ancestors of a node.
+
+
+
+ comment: NeurIPS 2024 camera-ready version
+
+
+
+
+
+
+ ♻ ☆ DFRot: Achieving Outlier-Free and Massive Activation-Free for Rotated
+ LLMs with Refined Rotation
+
+
+ Rotating the activation and weight matrices to reduce the influence of
+outliers in large language models (LLMs) has recently attracted significant
+attention, particularly in the context of model quantization. Prior studies
+have shown that in low-precision quantization scenarios, such as 4-bit weights
+and 4-bit activations (W4A4), randomized Hadamard transforms can achieve
+significantly higher accuracy than randomized orthogonal transforms. Notably,
+the reason behind this phenomena remains unknown. In this paper, we find that
+these transformations show substantial improvement in eliminating outliers for
+common tokens and achieve similar quantization error. The primary reason for
+the accuracy difference lies in the fact that randomized Hadamard transforms
+can slightly reduce the quantization error for tokens with massive activations
+while randomized orthogonal transforms increase the quantization error. Due to
+the extreme rarity of these tokens and their critical impact on model accuracy,
+we consider this a long-tail optimization problem, and therefore construct a
+simple yet effective method: a weighted loss function. Additionally, we propose
+an optimization strategy for the rotation matrix that involves alternating
+optimization of quantization parameters while employing orthogonal Procrustes
+transforms to refine the rotation matrix. This makes the distribution of the
+rotated activation values more conducive to quantization, especially for tokens
+with massive activations. Our method enhances the Rotated LLMs by achieving
+dual free, Outlier-Free and Massive Activation-Free, dubbed as DFRot. Extensive
+experiments demonstrate the effectiveness and efficiency of DFRot. By tuning
+the rotation matrix using just a single sample, DFRot achieves a perplexity
+improvement of 0.25 and 0.21 on W4A4KV4 and W4A4KV16, respectively, for
+LLaMA3-8B, a model known for its quantization challenges.
+
+
+
+
+
+
+
+ ♻ ☆ Towards Universal Mesh Movement Networks NeurIPS 2024
+
+
+
+
+
+
+
+
+ Mingrui Zhang, Chunyang Wang, Stephan Kramer, Joseph G. Wallwork, Siyi Li, Jiancheng Liu, Xiang Chen, Matthew D. Piggott
+
+
+ Solving complex Partial Differential Equations (PDEs) accurately and
+efficiently is an essential and challenging problem in all scientific and
+engineering disciplines. Mesh movement methods provide the capability to
+improve the accuracy of the numerical solution without increasing the overall
+mesh degree of freedom count. Conventional sophisticated mesh movement methods
+are extremely expensive and struggle to handle scenarios with complex boundary
+geometries. However, existing learning-based methods require re-training from
+scratch given a different PDE type or boundary geometry, which limits their
+applicability, and also often suffer from robustness issues in the form of
+inverted elements. In this paper, we introduce the Universal Mesh Movement
+Network (UM2N), which -- once trained -- can be applied in a non-intrusive,
+zero-shot manner to move meshes with different size distributions and
+structures, for solvers applicable to different PDE types and boundary
+geometries. UM2N consists of a Graph Transformer (GT) encoder for extracting
+features and a Graph Attention Network (GAT) based decoder for moving the mesh.
+We evaluate our method on advection and Navier-Stokes based examples, as well
+as a real-world tsunami simulation case. Our method outperforms existing
+learning-based mesh movement methods in terms of the benchmarks described
+above. In comparison to the conventional sophisticated Monge-Amp\`ere
+PDE-solver based method, our approach not only significantly accelerates mesh
+movement, but also proves effective in scenarios where the conventional method
+fails. Our project page is at https://erizmr.github.io/UM2N/.
+
+
+
+ comment: Accepted at NeurIPS 2024 as a spotlight paper
+
+
+
+
+
+
+ ♻ ☆ HLSFactory: A Framework Empowering High-Level Synthesis Datasets for
+ Machine Learning and Beyond
+
+
+
+
+
+
+
+
+ Stefan Abi-Karam, Rishov Sarkar, Allison Seigler, Sean Lowe, Zhigang Wei, Hanqiu Chen, Nanditha Rao, Lizy John, Aman Arora, Cong Hao
+
+
+ Machine learning (ML) techniques have been applied to high-level synthesis
+(HLS) flows for quality-of-result (QoR) prediction and design space exploration
+(DSE). Nevertheless, the scarcity of accessible high-quality HLS datasets and
+the complexity of building such datasets present challenges. Existing datasets
+have limitations in terms of benchmark coverage, design space enumeration,
+vendor extensibility, or lack of reproducible and extensible software for
+dataset construction. Many works also lack user-friendly ways to add more
+designs, limiting wider adoption of such datasets. In response to these
+challenges, we introduce HLSFactory, a comprehensive framework designed to
+facilitate the curation and generation of high-quality HLS design datasets.
+HLSFactory has three main stages: 1) a design space expansion stage to
+elaborate single HLS designs into large design spaces using various
+optimization directives across multiple vendor tools, 2) a design synthesis
+stage to execute HLS and FPGA tool flows concurrently across designs, and 3) a
+data aggregation stage for extracting standardized data into packaged datasets
+for ML usage. This tripartite architecture ensures broad design space coverage
+via design space expansion and supports multiple vendor tools. Users can
+contribute to each stage with their own HLS designs and synthesis results and
+extend the framework itself with custom frontends and tool flows. We also
+include an initial set of built-in designs from common HLS benchmarks curated
+open-source HLS designs. We showcase the versatility and multi-functionality of
+our framework through seven case studies: I) ML model for QoR prediction; II)
+Design space sampling; III) Fine-grained parallelism backend speedup; IV)
+Targeting Intel's HLS flow; V) Adding new auxiliary designs; VI) Integrating
+published HLS data; VII) HLS tool version regression benchmarking.
+
+
+
+ comment: MLCAD 2024 version of the paper. New case study with ML QoR
+ prediction. Artifact evaluation details included
+
+
+
+
+
+
+
+
+
+ Multimedia 4
+
+
+
+
+
+ ☆ AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand
+ Audio-Visual Information?
+
+
+ Recently, multimodal large language models (MLLMs), such as GPT-4o, Gemini
+1.5 Pro, and Reka Core, have expanded their capabilities to include vision and
+audio modalities. While these models demonstrate impressive performance across
+a wide range of audio-visual applications, our proposed DeafTest reveals that
+MLLMs often struggle with simple tasks humans find trivial: 1) determining
+which of two sounds is louder, and 2) determining which of two sounds has a
+higher pitch. Motivated by these observations, we introduce AV-Odyssey Bench, a
+comprehensive audio-visual benchmark designed to assess whether those MLLMs can
+truly understand the audio-visual information. This benchmark encompasses 4,555
+carefully crafted problems, each incorporating text, visual, and audio
+components. To successfully infer answers, models must effectively leverage
+clues from both visual and audio inputs. To ensure precise and objective
+evaluation of MLLM responses, we have structured the questions as
+multiple-choice, eliminating the need for human evaluation or LLM-assisted
+assessment. We benchmark a series of closed-source and open-source models and
+summarize the observations. By revealing the limitations of current models, we
+aim to provide useful insight for future dataset collection and model
+development.
+
+
+
+
+
+
+
+ ☆ Copy-Move Forgery Detection and Question Answering for Remote Sensing
+ Image
+
+
+
+
+
+
+
+
+ Ze Zhang, Enyuan Zhao, Ziyi Wan, Jie Nie, Xinyue Liang, Lei Huang
+
+
+ This paper introduces the task of Remote Sensing Copy-Move Question Answering
+(RSCMQA). Unlike traditional Remote Sensing Visual Question Answering (RSVQA),
+RSCMQA focuses on interpreting complex tampering scenarios and inferring
+relationships between objects. Based on the practical needs of national defense
+security and land resource monitoring, we have developed an accurate and
+comprehensive global dataset for remote sensing image copy-move question
+answering, named RS-CMQA-2.1M. These images were collected from 29 different
+regions across 14 countries. Additionally, we have refined a balanced dataset,
+RS-CMQA-B, to address the long-standing issue of long-tail data in the remote
+sensing field. Furthermore, we propose a region-discriminative guided
+multimodal CMQA model, which enhances the accuracy of answering questions about
+tampered images by leveraging prompt about the differences and connections
+between the source and tampered domains. Extensive experiments demonstrate that
+our method provides a stronger benchmark for RS-CMQA compared to general VQA
+and RSVQA models. Our dataset and code are available at
+https://github.com/shenyedepisa/RSCMQA.
+
+
+
+ comment: 7 figs, 7 tables
+
+
+
+
+
+
+ ☆ It Takes Two: Real-time Co-Speech Two-person's Interaction Generation
+ via Reactive Auto-regressive Diffusion Model
+
+
+ Conversational scenarios are very common in real-world settings, yet existing
+co-speech motion synthesis approaches often fall short in these contexts, where
+one person's audio and gestures will influence the other's responses.
+Additionally, most existing methods rely on offline sequence-to-sequence
+frameworks, which are unsuitable for online applications. In this work, we
+introduce an audio-driven, auto-regressive system designed to synthesize
+dynamic movements for two characters during a conversation. At the core of our
+approach is a diffusion-based full-body motion synthesis model, which is
+conditioned on the past states of both characters, speech audio, and a
+task-oriented motion trajectory input, allowing for flexible spatial control.
+To enhance the model's ability to learn diverse interactions, we have enriched
+existing two-person conversational motion datasets with more dynamic and
+interactive motions. We evaluate our system through multiple experiments to
+show it outperforms across a variety of tasks, including single and two-person
+co-speech motion generation, as well as interactive motion generation. To the
+best of our knowledge, this is the first system capable of generating
+interactive full-body motions for two characters from speech in an online
+manner.
+
+
+ To establish the trustworthiness of systems that automatically generate text
+captions for audio, images and video, existing reference-free metrics rely on
+large pretrained models which are impractical to accommodate in
+resource-constrained settings. To address this, we propose some metrics to
+elicit the model's confidence in its own generation. To assess how well these
+metrics replace correctness measures that leverage reference captions, we test
+their calibration with correctness measures. We discuss why some of these
+confidence metrics align better with certain correctness measures. Further, we
+provide insight into why temperature scaling of confidence metrics is
+effective. Our main contribution is a suite of well-calibrated lightweight
+confidence metrics for reference-free evaluation of captions in
+resource-constrained settings.
+
+
+ Sign Language Processing (SLP) is an interdisciplinary field comprised of
+Natural Language Processing (NLP) and Computer Vision. It is focused on the
+computational understanding, translation, and production of signed languages.
+Traditional approaches have often been constrained by the use of gloss-based
+systems that are both language-specific and inadequate for capturing the
+multidimensional nature of sign language. These limitations have hindered the
+development of technology capable of processing signed languages effectively.
+ This thesis aims to revolutionize the field of SLP by proposing a simple
+paradigm that can bridge this existing technological gap. We propose the use of
+SignWiring, a universal sign language transcription notation system, to serve
+as an intermediary link between the visual-gestural modality of signed
+languages and text-based linguistic representations.
+ We contribute foundational libraries and resources to the SLP community,
+thereby setting the stage for a more in-depth exploration of the tasks of sign
+language translation and production. These tasks encompass the translation of
+sign language from video to spoken language text and vice versa. Through
+empirical evaluations, we establish the efficacy of our transcription method as
+a pivot for enabling faster, more targeted research, that can lead to more
+natural and accurate translations across a range of languages.
+ The universal nature of our transcription-based paradigm also paves the way
+for real-time, multilingual applications in SLP, thereby offering a more
+inclusive and accessible approach to language technology. This is a significant
+step toward universal accessibility, enabling a wider reach of AI-driven
+language technologies to include the deaf and hard-of-hearing community.
+
+
+
+ comment: PhD Thesis
+
+
+
+
+
+
+ ☆ Free Process Rewards without Process Labels
+
+
+ Different from its counterpart outcome reward models (ORMs), which evaluate
+the entire responses, a process reward model (PRM) scores a reasoning
+trajectory step by step, providing denser and more fine grained rewards.
+However, training a PRM requires labels annotated at every intermediate step,
+presenting significant challenges for both manual and automatic data
+collection. This paper aims to address this challenge. Both theoretically and
+empirically, we show that an \textit{implicit PRM} can be obtained at no
+additional cost, by simply training an ORM on the cheaper response-level
+labels. The only assumption is to parameterize the outcome reward as the
+log-likelihood ratios of the policy and reference models, which can be
+optimized regardless of the specific choice of loss objectives. In experiments,
+we instantiate our implicit PRMs with various objectives and evaluate their
+performance on MATH. We show that our implicit PRM outperforms a strong
+MCTS-based baseline \textit{\'a la} Math-Shepherd using less than $1/38$ of the
+training data. Its performance can be further improved with majority voting. We
+further find that scaling up instructions and responses benefits our implicit
+PRM, and the latter brings a larger gain. Particularly, we find that our
+implicit PRM, when instantiated with the cross-entropy (CE) loss, is more
+data-efficient and can keep improving generation models even when trained with
+only one response per instruction, the setup that suffers from extreme data
+scarcity and imbalance. Further, instructions should be relevant to downstream
+tasks while the diversity of responses does not bring gains. Surprisingly,
+training on extra Math-Shepherd step labels brings no further improvements to
+our implicit PRM trained on only outcome data. We hope that our work will
+encourage a rethinking of PRM training approaches and contribute to making
+training PRMs more accessible.
+
+
+
+ comment: Models and data are available at:
+ https://github.com/lifan-yuan/ImplicitPRM
+
+
+
+
+
+
+ ☆ The use of large language models to enhance cancer clinical trial
+ educational materials
+
+
+
+
+
+
+
+
+ Mingye Gao, Aman Varshney, Shan Chen, Vikram Goddla, Jack Gallifant, Patrick Doyle, Claire Novack, Maeve Dillon-Martin, Teresia Perkins, Xinrong Correia, Erik Duhaime, Howard Isenstein, Elad Sharon, Lisa Soleymani Lehmann, David Kozono, Brian Anthony, Dmitriy Dligach, Danielle S. Bitterman
+
+
+ Cancer clinical trials often face challenges in recruitment and engagement
+due to a lack of participant-facing informational and educational resources.
+This study investigated the potential of Large Language Models (LLMs),
+specifically GPT4, in generating patient-friendly educational content from
+clinical trial informed consent forms. Using data from ClinicalTrials.gov, we
+employed zero-shot learning for creating trial summaries and one-shot learning
+for developing multiple-choice questions, evaluating their effectiveness
+through patient surveys and crowdsourced annotation. Results showed that
+GPT4-generated summaries were both readable and comprehensive, and may improve
+patients' understanding and interest in clinical trials. The multiple-choice
+questions demonstrated high accuracy and agreement with crowdsourced
+annotators. For both resource types, hallucinations were identified that
+require ongoing human oversight. The findings demonstrate the potential of LLMs
+"out-of-the-box" to support the generation of clinical trial education
+materials with minimal trial-specific engineering, but implementation with a
+human-in-the-loop is still needed to avoid misinformation risks.
+
+
+
+
+
+
+
+ ☆ Self-Improvement in Language Models: The Sharpening Mechanism
+
+
+
+
+
+
+
+
+ Audrey Huang, Adam Block, Dylan J. Foster, Dhruv Rohatgi, Cyril Zhang, Max Simchowitz, Jordan T. Ash, Akshay Krishnamurthy
+
+
+ Recent work in language modeling has raised the possibility of
+self-improvement, where a language models evaluates and refines its own
+generations to achieve higher performance without external feedback. It is
+impossible for this self-improvement to create information that is not already
+in the model, so why should we expect that this will lead to improved
+capabilities? We offer a new perspective on the capabilities of
+self-improvement through a lens we refer to as sharpening. Motivated by the
+observation that language models are often better at verifying response quality
+than they are at generating correct responses, we formalize self-improvement as
+using the model itself as a verifier during post-training in order to
+``sharpen'' the model to one placing large mass on high-quality sequences,
+thereby amortizing the expensive inference-time computation of generating good
+sequences. We begin by introducing a new statistical framework for sharpening
+in which the learner aims to sharpen a pre-trained base policy via sample
+access, and establish fundamental limits. Then we analyze two natural families
+of self-improvement algorithms based on SFT and RLHF.
+
+
+
+
+
+
+
+ ☆ GETAE: Graph information Enhanced deep neural NeTwork ensemble
+ ArchitecturE for fake news detection
+
+
+
+
+
+
+
+
+ Ciprian-Octavian Truică, Elena-Simona Apostol, Marius Marogel, Adrian Paschke
+
+
+ In today's digital age, fake news has become a major problem that has serious
+consequences, ranging from social unrest to political upheaval. To address this
+issue, new methods for detecting and mitigating fake news are required. In this
+work, we propose to incorporate contextual and network-aware features into the
+detection process. This involves analyzing not only the content of a news
+article but also the context in which it was shared and the network of users
+who shared it, i.e., the information diffusion. Thus, we propose GETAE,
+\underline{G}raph Information \underline{E}nhanced Deep Neural
+Ne\underline{t}work Ensemble \underline{A}rchitectur\underline{E} for Fake News
+Detection, a novel ensemble architecture that uses textual content together
+with the social interactions to improve fake news detection. GETAE contains two
+Branches: the Text Branch and the Propagation Branch. The Text Branch uses Word
+and Transformer Embeddings and a Deep Neural Network based on feed-forward and
+bidirectional Recurrent Neural Networks (\textsc{[Bi]RNN}) for learning novel
+contextual features and creating a novel Text Content Embedding. The
+Propagation Branch considers the information propagation within the graph
+network and proposes a Deep Learning architecture that employs Node Embeddings
+to create novel Propagation Embedding. GETAE Ensemble combines the two novel
+embeddings, i.e., Text Content Embedding and Propagation Embedding, to create a
+novel \textit{Propagation-Enhanced Content Embedding} which is afterward used
+for classification. The experimental results obtained on two real-world
+publicly available datasets, i.e., Twitter15 and Twitter16, prove that using
+this approach improves fake news detection and outperforms state-of-the-art
+models.
+
+
+
+
+
+
+
+ ☆ COSMOS: Cross-Modality Self-Distillation for Vision Language
+ Pre-training
+
+
+
+
+
+
+
+
+ Sanghwan Kim, Rui Xiao, Mariana-Iuliana Georgescu, Stephan Alaniz, Zeynep Akata
+
+
+ Vision-Language Models (VLMs) trained with contrastive loss have achieved
+significant advancements in various vision and language tasks. However, the
+global nature of contrastive loss makes VLMs focus predominantly on foreground
+objects, neglecting other crucial information in the image, which limits their
+effectiveness in downstream tasks. To address these challenges, we propose
+COSMOS: CrOSs-MOdality Self-distillation for vision-language pre-training that
+integrates a novel text-cropping strategy and cross-attention module into a
+self-supervised learning framework. We create global and local views of images
+and texts (i.e., multi-modal augmentations), which are essential for
+self-distillation in VLMs. We further introduce a cross-attention module,
+enabling COSMOS to learn comprehensive cross-modal representations optimized
+via a cross-modality self-distillation loss. COSMOS consistently outperforms
+previous strong baselines on various zero-shot downstream tasks, including
+retrieval, classification, and semantic segmentation. Additionally, it
+surpasses CLIP-based models trained on larger datasets in visual perception and
+contextual understanding tasks.
+
+
+
+
+
+
+
+ ☆ Random Tree Model of Meaningful Memory
+
+
+ Traditional studies of memory for meaningful narratives focus on specific
+stories and their semantic structures but do not address common quantitative
+features of recall across different narratives. We introduce a statistical
+ensemble of random trees to represent narratives as hierarchies of key points,
+where each node is a compressed representation of its descendant leaves, which
+are the original narrative segments. Recall is modeled as constrained by
+working memory capacity from this hierarchical structure. Our analytical
+solution aligns with observations from large-scale narrative recall
+experiments. Specifically, our model explains that (1) average recall length
+increases sublinearly with narrative length, and (2) individuals summarize
+increasingly longer narrative segments in each recall sentence. Additionally,
+the theory predicts that for sufficiently long narratives, a universal,
+scale-invariant limit emerges, where the fraction of a narrative summarized by
+a single recall sentence follows a distribution independent of narrative
+length.
+
+
+
+ comment: 16 pages, 4 figures
+
+
+
+
+
+
+ ☆ A Neurosymbolic Fast and Slow Architecture for Graph Coloring
+
+
+ Constraint Satisfaction Problems (CSPs) present significant challenges to
+artificial intelligence due to their intricate constraints and the necessity
+for precise solutions. Existing symbolic solvers are often slow, and prior
+research has shown that Large Language Models (LLMs) alone struggle with CSPs
+because of their complexity. To bridge this gap, we build upon the existing
+SOFAI architecture (or SOFAI-v1), which adapts Daniel Kahneman's ''Thinking,
+Fast and Slow'' cognitive model to AI. Our enhanced architecture, SOFAI-v2,
+integrates refined metacognitive governance mechanisms to improve adaptability
+across complex domains, specifically tailored for solving CSPs like graph
+coloring. SOFAI-v2 combines a fast System 1 (S1) based on LLMs with a
+deliberative System 2 (S2) governed by a metacognition module. S1's initial
+solutions, often limited by non-adherence to constraints, are enhanced through
+metacognitive governance, which provides targeted feedback and examples to
+adapt S1 to CSP requirements. If S1 fails to solve the problem, metacognition
+strategically invokes S2, ensuring accurate and reliable solutions. With
+empirical results, we show that SOFAI-v2 for graph coloring problems achieves a
+16.98% increased success rate and is 32.42% faster than symbolic solvers.
+
+
+
+ comment: 18 Pages, 18 Figures, 3 Tables
+
+
+
+
+
+
+ ☆ Towards Resource Efficient and Interpretable Bias Mitigation in Large
+ Language Models NeurIPS
+ 2024
+
+
+ Although large language models (LLMs) have demonstrated their effectiveness
+in a wide range of applications, they have also been observed to perpetuate
+unwanted biases present in the training data, potentially leading to harm for
+marginalized communities. In this paper, we mitigate bias by leveraging small
+biased and anti-biased expert models to obtain a debiasing signal that will be
+added to the LLM output at decoding-time. This approach combines resource
+efficiency with interpretability and can be optimized for mitigating specific
+types of bias, depending on the target use case. Experiments on mitigating
+gender, race, and religion biases show a reduction in bias on several local and
+global bias metrics while preserving language model performance.
+
+
+
+ comment: 38th Conference on Neural Information Processing Systems (NeurIPS
+ 2024) Safe Generative AI Workshop
+
+
+
+
+
+
+ ☆ Query Performance Explanation through Large Language Model for HTAP
+ Systems ICDE 2025
+
+
+
+
+
+
+
+
+ Haibo Xiu, Li Zhang, Tieying Zhang, Jun Yang, Jianjun Chen
+
+
+ In hybrid transactional and analytical processing (HTAP) systems, users often
+struggle to understand why query plans from one engine (OLAP or OLTP) perform
+significantly slower than those from another. Although optimizers provide plan
+details via the EXPLAIN function, these explanations are frequently too
+technical for non-experts and offer limited insights into performance
+differences across engines. To address this, we propose a novel framework that
+leverages large language models (LLMs) to explain query performance in HTAP
+systems. Built on Retrieval-Augmented Generation (RAG), our framework
+constructs a knowledge base that stores historical query executions and
+expert-curated explanations. To enable efficient retrieval of relevant
+knowledge, query plans are embedded using a lightweight tree-CNN classifier.
+This augmentation allows the LLM to generate clear, context-aware explanations
+of performance differences between engines. Our approach demonstrates the
+potential of LLMs in hybrid engine systems, paving the way for further
+advancements in database optimization and user support.
+
+
+
+ comment: Submitted to ICDE 2025
+
+
+
+
+
+
+ ☆ Are We There Yet? Revealing the Risks of Utilizing Large Language Models
+ in Scholarly Peer Review
+
+
+ Scholarly peer review is a cornerstone of scientific advancement, but the
+system is under strain due to increasing manuscript submissions and the
+labor-intensive nature of the process. Recent advancements in large language
+models (LLMs) have led to their integration into peer review, with promising
+results such as substantial overlaps between LLM- and human-generated reviews.
+However, the unchecked adoption of LLMs poses significant risks to the
+integrity of the peer review system. In this study, we comprehensively analyze
+the vulnerabilities of LLM-generated reviews by focusing on manipulation and
+inherent flaws. Our experiments show that injecting covert deliberate content
+into manuscripts allows authors to explicitly manipulate LLM reviews, leading
+to inflated ratings and reduced alignment with human reviews. In a simulation,
+we find that manipulating 5% of the reviews could potentially cause 12% of the
+papers to lose their position in the top 30% rankings. Implicit manipulation,
+where authors strategically highlight minor limitations in their papers,
+further demonstrates LLMs' susceptibility compared to human reviewers, with a
+4.5 times higher consistency with disclosed limitations. Additionally, LLMs
+exhibit inherent flaws, such as potentially assigning higher ratings to
+incomplete papers compared to full papers and favoring well-known authors in
+single-blind review process. These findings highlight the risks of
+over-reliance on LLMs in peer review, underscoring that we are not yet ready
+for widespread adoption and emphasizing the need for robust safeguards.
+
+
+
+ comment: 27 pages, 24 figures
+
+
+
+
+
+
+ ☆ Can We Afford The Perfect Prompt? Balancing Cost and Accuracy with the
+ Economical Prompting Index
+
+
+
+
+
+
+
+
+ Tyler McDonald, Anthony Colosimo, Yifeng Li, Ali Emami
+
+
+ As prompt engineering research rapidly evolves, evaluations beyond accuracy
+are crucial for developing cost-effective techniques. We present the Economical
+Prompting Index (EPI), a novel metric that combines accuracy scores with token
+consumption, adjusted by a user-specified cost concern level to reflect
+different resource constraints. Our study examines 6 advanced prompting
+techniques, including Chain-of-Thought, Self-Consistency, and Tree of Thoughts,
+across 10 widely-used language models and 4 diverse datasets. We demonstrate
+that approaches such as Self-Consistency often provide statistically
+insignificant gains while becoming cost-prohibitive. For example, on
+high-performing models like Claude 3.5 Sonnet, the EPI of simpler techniques
+like Chain-of-Thought (0.72) surpasses more complex methods like
+Self-Consistency (0.64) at slight cost concern levels. Our findings suggest a
+reevaluation of complex prompting strategies in resource-constrained scenarios,
+potentially reshaping future research priorities and improving
+cost-effectiveness for end-users.
+
+
+ Query rewrite is essential for optimizing SQL queries to improve their
+execution efficiency without changing their results. Traditionally, this task
+has been tackled through heuristic and learning-based methods, each with its
+limitations in terms of inferior quality and low robustness. Recent
+advancements in LLMs offer a new paradigm by leveraging their superior natural
+language and code comprehension abilities. Despite their potential, directly
+applying LLMs like GPT-4 has faced challenges due to problems such as
+hallucinations, where the model might generate inaccurate or irrelevant
+results. To address this, we propose R-Bot, an LLM-based query rewrite system
+with a systematic approach. We first design a multi-source rewrite evidence
+preparation pipeline to generate query rewrite evidences for guiding LLMs to
+avoid hallucinations. We then propose a hybrid structure-semantics retrieval
+method that combines structural and semantic analysis to retrieve the most
+relevant rewrite evidences for effectively answering an online query. We next
+propose a step-by-step LLM rewrite method that iteratively leverages the
+retrieved evidences to select and arrange rewrite rules with self-reflection.
+We conduct comprehensive experiments on widely used benchmarks, and demonstrate
+the superior performance of our system, R-Bot, surpassing state-of-the-art
+query rewrite methods.
+
+
+
+
+
+
+
+ ☆ Concept Based Continuous Prompts for Interpretable Text Classification
+
+
+ Continuous prompts have become widely adopted for augmenting performance
+across a wide range of natural language tasks. However, the underlying
+mechanism of this enhancement remains obscure. Previous studies rely on
+individual words for interpreting continuous prompts, which lacks comprehensive
+semantic understanding. Drawing inspiration from Concept Bottleneck Models, we
+propose a framework for interpreting continuous prompts by decomposing them
+into human-readable concepts. Specifically, to ensure the feasibility of the
+decomposition, we demonstrate that a corresponding concept embedding matrix and
+a coefficient matrix can always be found to replace the prompt embedding
+matrix. Then, we employ GPT-4o to generate a concept pool and choose potential
+candidate concepts that are discriminative and representative using a novel
+submodular optimization algorithm. Experiments demonstrate that our framework
+can achieve similar results as the original P-tuning and word-based approaches
+using only a few concepts while providing more plausible results. Our code is
+available at https://github.com/qq31415926/CD.
+
+
+
+
+
+
+
+ ☆ Using Large Language Models in Automatic Hint Ranking and Generation
+ Tasks
+
+
+ The use of Large Language Models (LLMs) has increased significantly recently,
+with individuals frequently interacting with chatbots to receive answers to a
+wide range of questions. In an era where information is readily accessible, it
+is crucial to stimulate and preserve human cognitive abilities and maintain
+strong reasoning skills. This paper addresses such challenges by promoting the
+use of hints as an alternative or a supplement to direct answers. We first
+introduce a manually constructed hint dataset, WIKIHINT, which includes 5,000
+hints created for 1,000 questions. We then finetune open-source LLMs such as
+LLaMA-3.1 for hint generation in answer-aware and answer-agnostic contexts. We
+assess the effectiveness of the hints with human participants who try to answer
+questions with and without the aid of hints. Additionally, we introduce a
+lightweight evaluation method, HINTRANK, to evaluate and rank hints in both
+answer-aware and answer-agnostic settings. Our findings show that (a) the
+dataset helps generate more effective hints, (b) including answer information
+along with questions generally improves hint quality, and (c) encoder-based
+models perform better than decoder-based models in hint ranking.
+
+
+ Text summarization is a process of condensing lengthy texts while preserving
+their essential information. Previous studies have predominantly focused on
+high-resource languages, while low-resource languages like Thai have received
+less attention. Furthermore, earlier extractive summarization models for Thai
+texts have primarily relied on the article's body, without considering the
+headline. This omission can result in the exclusion of key sentences from the
+summary. To address these limitations, we propose CHIMA, an extractive
+summarization model that incorporates the contextual information of the
+headline for Thai news articles. Our model utilizes a pre-trained language
+model to capture complex language semantics and assigns a probability to each
+sentence to be included in the summary. By leveraging the headline to guide
+sentence selection, CHIMA enhances the model's ability to recover important
+sentences and discount irrelevant ones. Additionally, we introduce two
+strategies for aggregating headline-body similarities, simple average and
+harmonic mean, providing flexibility in sentence selection to accommodate
+varying writing styles. Experiments on publicly available Thai news datasets
+demonstrate that CHIMA outperforms baseline models across ROUGE, BLEU, and F1
+scores. These results highlight the effectiveness of incorporating the
+headline-body similarities as model guidance. The results also indicate an
+enhancement in the model's ability to recall critical sentences, even those
+scattered throughout the middle or end of the article. With this potential,
+headline-guided extractive summarization offers a promising approach to improve
+the quality and relevance of summaries for Thai news articles.
+
+
+
+
+
+
+
+ ☆ NYT-Connections: A Deceptively Simple Text Classification Task that
+ Stumps System-1 Thinkers
+
+
+ Large Language Models (LLMs) have shown impressive performance on various
+benchmarks, yet their ability to engage in deliberate reasoning remains
+questionable. We present NYT-Connections, a collection of 358 simple word
+classification puzzles derived from the New York Times Connections game. This
+benchmark is designed to penalize quick, intuitive "System 1" thinking,
+isolating fundamental reasoning skills. We evaluated six recent LLMs, a simple
+machine learning heuristic, and humans across three configurations:
+single-attempt, multiple attempts without hints, and multiple attempts with
+contextual hints. Our findings reveal a significant performance gap: even
+top-performing LLMs like GPT-4 fall short of human performance by nearly 30%.
+Notably, advanced prompting techniques such as Chain-of-Thought and
+Self-Consistency show diminishing returns as task difficulty increases.
+NYT-Connections uniquely combines linguistic isolation, resistance to intuitive
+shortcuts, and regular updates to mitigate data leakage, offering a novel tool
+for assessing LLM reasoning capabilities.
+
+
+ Loneliness, or the lack of fulfilling relationships, significantly impacts a
+person's mental and physical well-being and is prevalent worldwide. Previous
+research suggests that large language models (LLMs) may help mitigate
+loneliness. However, we argue that the use of widespread LLMs like ChatGPT is
+more prevalent--and riskier, as they are not designed for this purpose. To
+explore this, we analysed user interactions with ChatGPT, particularly those
+outside of its marketed use as task-oriented assistant. In dialogues classified
+as lonely, users frequently (37%) sought advice or validation, and received
+good engagement. However, ChatGPT failed in sensitive scenarios, like
+responding appropriately to suicidal ideation or trauma. We also observed a 35%
+higher incidence of toxic content, with women being 22 times more likely to be
+targeted than men. Our findings underscore ethical and legal questions about
+this technology, and note risks like radicalisation or further isolation. We
+conclude with recommendations for research and industry to address loneliness.
+
+
+
+
+
+
+
+ ☆ Medchain: Bridging the Gap Between LLM Agents and Clinical Practice
+ through Interactive Sequential Benchmarking
+
+
+
+
+
+
+
+
+ Jie Liu, Wenxuan Wang, Zizhan Ma, Guolin Huang, Yihang SU, Kao-Jung Chang, Wenting Chen, Haoliang Li, Linlin Shen, Michael Lyu
+
+
+ Clinical decision making (CDM) is a complex, dynamic process crucial to
+healthcare delivery, yet it remains a significant challenge for artificial
+intelligence systems. While Large Language Model (LLM)-based agents have been
+tested on general medical knowledge using licensing exams and knowledge
+question-answering tasks, their performance in the CDM in real-world scenarios
+is limited due to the lack of comprehensive testing datasets that mirror actual
+medical practice. To address this gap, we present MedChain, a dataset of 12,163
+clinical cases that covers five key stages of clinical workflow. MedChain
+distinguishes itself from existing benchmarks with three key features of
+real-world clinical practice: personalization, interactivity, and
+sequentiality. Further, to tackle real-world CDM challenges, we also propose
+MedChain-Agent, an AI system that integrates a feedback mechanism and a
+MCase-RAG module to learn from previous cases and adapt its responses.
+MedChain-Agent demonstrates remarkable adaptability in gathering information
+dynamically and handling sequential clinical tasks, significantly outperforming
+existing approaches. The relevant dataset and code will be released upon
+acceptance of this paper.
+
+
+
+
+
+
+
+ ☆ Scaling Law for Language Models Training Considering Batch Size
+
+
+ Large language models (LLMs) have made remarkable advances in recent years,
+with scaling laws playing a critical role in this rapid progress. In this
+paper, we empirically investigate how a critical hyper-parameter, i.e., the
+global batch size, influences the LLM training prdocess. We begin by training
+language models ranging from 125 million to 2.6 billion parameters, using up to
+300 billion high-quality tokens. Through these experiments, we establish a
+basic scaling law on model size and training data amount. We then examine how
+varying batch sizes and learning rates affect the convergence and
+generalization of these models. Our analysis yields batch size scaling laws
+under two different cases: with a fixed compute budget, and with a fixed amount
+of training data. Extrapolation experiments on models of increasing sizes
+validate our predicted laws, which provides guidance for optimizing LLM
+training strategies under specific resource constraints.
+
+
+
+
+
+
+
+ ☆ Early Exit Is a Natural Capability in Transformer-based Models: An
+ Empirical Study on Early Exit without Joint Optimization
+
+
+ Large language models (LLMs) exhibit exceptional performance across various
+downstream tasks. However, they encounter limitations due to slow inference
+speeds stemming from their extensive parameters. The early exit (EE) is an
+approach that aims to accelerate auto-regressive decoding. EE generates outputs
+from intermediate layers instead of using the whole model, which offers a
+promising solution to this challenge. However, additional output layers and
+joint optimization used in conventional EE hinder the application of EE in
+LLMs.
+ In this paper, we explore the possibility of LLMs EE without additional
+output layers and joint optimization. Our findings indicate that EE is a
+natural capability within transformer-based models. While joint optimization
+does not give model EE capability, it must be employed to address challenges by
+improving the accuracy of locating the optimal EE layer through gating
+functions. Additionally, our study reveals patterns in EE behavior from a
+sub-word perspective based on the LLaMA model and the potential possibility for
+EE based on sub-layers.
+
+
+
+
+
+
+
+ ☆ PLD+: Accelerating LLM inference by leveraging Language Model Artifacts
+
+
+ To reduce the latency associated with autoretrogressive LLM inference,
+speculative decoding has emerged as a novel decoding paradigm, where future
+tokens are drafted and verified in parallel. However, the practical deployment
+of speculative decoding is hindered by its requirements for additional
+computational resources and fine-tuning, which limits its out-of-the-box
+usability. To address these challenges, we present PLD+, a suite of novel
+algorithms developed to accelerate the inference process of LLMs, particularly
+for input-guided tasks. These tasks, which include code editing, text editing,
+summarization, etc., often feature outputs with substantial overlap with their
+inputs-an attribute PLD+ is designed to exploit. PLD+ also leverages the
+artifacts (attention and hidden states) generated during inference to
+accelerate inference speed. We test our approach on five input-guided tasks and
+through extensive experiments we find that PLD+ outperforms all tuning-free
+approaches. In the greedy setting, it even outperforms the state-of-the-art
+tuning-dependent approach EAGLE on four of the tasks. (by a margin of upto 2.31
+in terms of avg. speedup). Our approach is tuning free, does not require any
+additional compute and can easily be used for accelerating inference of any
+LLM.
+
+
+
+
+
+
+
+
+ Heejin Do, Sangwon Ryu, Jonghwi Kim, Gary Geunbae Lee
+
+
+ With the growing demand to fit fine-grained user intents, faceted
+query-by-example (QBE), which retrieves similar documents conditioned on
+specific facets, has gained recent attention. However, prior approaches mainly
+depend on document-level comparisons using basic indicators like citations due
+to the lack of facet-level relevance datasets; yet, this limits their use to
+citation-based domains and fails to capture the intricacies of facet
+constraints. In this paper, we propose a multi-facet blending (FaBle)
+augmentation method, which exploits modularity by decomposing and recomposing
+to explicitly synthesize facet-specific training sets. We automatically
+decompose documents into facet units and generate (ir)relevant pairs by
+leveraging LLMs' intrinsic distinguishing capabilities; then, dynamically
+recomposing the units leads to facet-wise relevance-informed document pairs.
+Our modularization eliminates the need for pre-defined facet knowledge or
+labels. Further, to prove the FaBle's efficacy in a new domain beyond
+citation-based scientific paper retrieval, we release a benchmark dataset for
+educational exam item QBE. FaBle augmentation on 1K documents remarkably
+assists training in obtaining facet conditional embeddings.
+
+
+
+
+
+
+
+ ☆ A Top-down Graph-based Tool for Modeling Classical Semantic Maps: A
+ Crosslinguistic Case Study of Supplementary Adverbs
+
+
+ Semantic map models (SMMs) construct a network-like conceptual space from
+cross-linguistic instances or forms, based on the connectivity hypothesis. This
+approach has been widely used to represent similarity and entailment
+relationships in cross-linguistic concept comparisons. However, most SMMs are
+manually built by human experts using bottom-up procedures, which are often
+labor-intensive and time-consuming. In this paper, we propose a novel
+graph-based algorithm that automatically generates conceptual spaces and SMMs
+in a top-down manner. The algorithm begins by creating a dense graph, which is
+subsequently pruned into maximum spanning trees, selected according to metrics
+we propose. These evaluation metrics include both intrinsic and extrinsic
+measures, considering factors such as network structure and the trade-off
+between precision and coverage. A case study on cross-linguistic supplementary
+adverbs demonstrates the effectiveness and efficiency of our model compared to
+human annotations and other automated methods. The tool is available at
+\url{https://github.com/RyanLiut/SemanticMapModel}.
+
+
+
+ comment: Paper under review
+
+
+
+
+
+
+ ☆ CLASSLA-Express: a Train of CLARIN.SI Workshops on Language Resources
+ and Tools with Easily Expanding Route
+
+
+
+
+
+
+
+
+ Nikola Ljubešić, Taja Kuzman, Ivana Filipović Petrović, Jelena Parizoska, Petya Osenova
+
+
+ This paper introduces the CLASSLA-Express workshop series as an innovative
+approach to disseminating linguistic resources and infrastructure provided by
+the CLASSLA Knowledge Centre for South Slavic languages and the Slovenian
+CLARIN.SI infrastructure. The workshop series employs two key strategies: (1)
+conducting workshops directly in countries with interested audiences, and (2)
+designing the series for easy expansion to new venues. The first iteration of
+the CLASSLA-Express workshop series encompasses 6 workshops in 5 countries. Its
+goal is to share knowledge on the use of corpus querying tools, as well as the
+recently-released CLASSLA-web corpora - the largest general corpora for South
+Slavic languages. In the paper, we present the design of the workshop series,
+its current scope and the effortless extensions of the workshop to new venues
+that are already in sight.
+
+
+
+ comment: Published in CLARIN Annual Conference Proceedings 2024
+ (https://www.clarin.eu/sites/default/files/CLARIN2024_ConferenceProceedings_final.pdf)
+
+
+
+
+
+
+ ☆ Efficient LLM Inference using Dynamic Input Pruning and Cache-Aware
+ Masking
+
+
+
+
+
+
+
+
+ Marco Federici, Davide Belli, Mart van Baalen, Amir Jalalirad, Andrii Skliar, Bence Major, Markus Nagel, Paul Whatmough
+
+
+ While mobile devices provide ever more compute power, improvements in DRAM
+bandwidth are much slower. This is unfortunate for large language model (LLM)
+token generation, which is heavily memory-bound. Previous work has proposed to
+leverage natural dynamic activation sparsity in ReLU-activated LLMs to reduce
+effective DRAM bandwidth per token. However, more recent LLMs use SwiGLU
+instead of ReLU, which result in little inherent sparsity. While SwiGLU
+activations can be pruned based on magnitude, the resulting sparsity patterns
+are difficult to predict, rendering previous approaches ineffective. To
+circumvent this issue, our work introduces Dynamic Input Pruning (DIP): a
+predictor-free dynamic sparsification approach, which preserves accuracy with
+minimal fine-tuning. DIP can further use lightweight LoRA adapters to regain
+some performance lost during sparsification. Lastly, we describe a novel
+cache-aware masking strategy, which considers the cache state and activation
+magnitude to further increase cache hit rate, improving LLM token rate on
+mobile devices. DIP outperforms other methods in terms of accuracy, memory and
+throughput trade-offs across simulated hardware settings. On Phi-3-Medium, DIP
+achieves a 46% reduction in memory and 40% increase in throughput with $<$ 0.1
+loss in perplexity.
+
+
+ The increasing complexity of computer systems necessitates innovative
+approaches to fault and error management, going beyond traditional manual log
+analysis. While existing solutions using large language models (LLMs) show
+promise, they are limited by a gap between natural and domain-specific
+languages, which restricts their effectiveness in real-world applications. Our
+approach addresses these limitations by integrating interpretable domain
+knowledge into open-source LLMs through continual pre-training (CPT), enhancing
+performance on log tasks while retaining natural language processing
+capabilities. We created a comprehensive dataset, NLPLog, with over 250,000
+question-answer pairs to facilitate this integration. Our model, SuperLog,
+trained with this dataset, achieves the best performance across four log
+analysis tasks, surpassing the second-best model by an average of 12.01%. Our
+contributions include a novel CPT paradigm that significantly improves model
+performance, the development of SuperLog with state-of-the-art results, and the
+release of a large-scale dataset to support further research in this domain.
+
+
+
+
+
+
+
+ ☆ Understanding the World's Museums through Vision-Language Reasoning
+
+
+
+
+
+
+
+
+ Ada-Astrid Balauca, Sanjana Garai, Stefan Balauca, Rasesh Udayakumar Shetty, Naitik Agrawal, Dhwanil Subhashbhai Shah, Yuqian Fu, Xi Wang, Kristina Toutanova, Danda Pani Paudel, Luc Van Gool
+
+
+ Museums serve as vital repositories of cultural heritage and historical
+artifacts spanning diverse epochs, civilizations, and regions, preserving
+well-documented collections. Data reveal key attributes such as age, origin,
+material, and cultural significance. Understanding museum exhibits from their
+images requires reasoning beyond visual features. In this work, we facilitate
+such reasoning by (a) collecting and curating a large-scale dataset of 65M
+images and 200M question-answer pairs in the standard museum catalog format for
+exhibits from all around the world; (b) training large vision-language models
+on the collected dataset; (c) benchmarking their ability on five visual
+question answering tasks. The complete dataset is labeled by museum experts,
+ensuring the quality as well as the practical significance of the labels. We
+train two VLMs from different categories: the BLIP model, with vision-language
+aligned embeddings, but lacking the expressive power of large language models,
+and the LLaVA model, a powerful instruction-tuned LLM enriched with
+vision-language reasoning capabilities. Through exhaustive experiments, we
+provide several insights on the complex and fine-grained understanding of
+museum exhibits. In particular, we show that some questions whose answers can
+often be derived directly from visual features are well answered by both types
+of models. On the other hand, questions that require the grounding of the
+visual features in repositories of human knowledge are better answered by the
+large vision-language models, thus demonstrating their superior capacity to
+perform the desired reasoning. Find our dataset, benchmarks, and source code
+at: https://github.com/insait-institute/Museum-65
+
+
+
+
+
+
+
+ ☆ A 2-step Framework for Automated Literary Translation Evaluation: Its
+ Promises and Pitfalls
+
+
+ In this work, we propose and evaluate the feasibility of a two-stage pipeline
+to evaluate literary machine translation, in a fine-grained manner, from
+English to Korean. The results show that our framework provides fine-grained,
+interpretable metrics suited for literary translation and obtains a higher
+correlation with human judgment than traditional machine translation metrics.
+Nonetheless, it still fails to match inter-human agreement, especially in
+metrics like Korean Honorifics. We also observe that LLMs tend to favor
+translations generated by other LLMs, and we highlight the necessity of
+developing more sophisticated evaluation methods to ensure accurate and
+culturally sensitive machine translation of literary works.
+
+
+
+
+
+
+
+ ☆ Exploring Long-Term Prediction of Type 2 Diabetes Microvascular
+ Complications ML4H
+
+
+
+
+
+
+
+
+ Elizabeth Remfry, Rafael Henkin, Michael R Barnes, Aakanksha Naik
+
+
+ Electronic healthcare records (EHR) contain a huge wealth of data that can
+support the prediction of clinical outcomes. EHR data is often stored and
+analysed using clinical codes (ICD10, SNOMED), however these can differ across
+registries and healthcare providers. Integrating data across systems involves
+mapping between different clinical ontologies requiring domain expertise, and
+at times resulting in data loss. To overcome this, code-agnostic models have
+been proposed. We assess the effectiveness of a code-agnostic representation
+approach on the task of long-term microvascular complication prediction for
+individuals living with Type 2 Diabetes. Our method encodes individual EHRs as
+text using fine-tuned, pretrained clinical language models. Leveraging
+large-scale EHR data from the UK, we employ a multi-label approach to
+simultaneously predict the risk of microvascular complications across 1-, 5-,
+and 10-year windows. We demonstrate that a code-agnostic approach outperforms a
+code-based model and illustrate that performance is better with longer
+prediction windows but is biased to the first occurring complication. Overall,
+we highlight that context length is vitally important for model performance.
+This study highlights the possibility of including data from across different
+clinical ontologies and is a starting point for generalisable clinical models.
+
+
+
+ comment: Findings paper presented at Machine Learning for Health (ML4H)
+ symposium 2024, December 15-16, 2024, Vancouver, Canada, 9 pages
+
+
+
+
+
+
+ ☆ The "LLM World of Words" English free association norms generated by
+ large language models
+
+
+ Free associations have been extensively used in cognitive psychology and
+linguistics for studying how conceptual knowledge is organized. Recently, the
+potential of applying a similar approach for investigating the knowledge
+encoded in LLMs has emerged, specifically as a method for investigating LLM
+biases. However, the absence of large-scale LLM-generated free association
+norms that are comparable with human-generated norms is an obstacle to this new
+research direction. To address this limitation, we create a new dataset of
+LLM-generated free association norms modeled after the "Small World of Words"
+(SWOW) human-generated norms consisting of approximately 12,000 cue words. We
+prompt three LLMs, namely Mistral, Llama3, and Haiku, with the same cues as
+those in the SWOW norms to generate three novel comparable datasets, the "LLM
+World of Words" (LWOW). Using both SWOW and LWOW norms, we construct cognitive
+network models of semantic memory that represent the conceptual knowledge
+possessed by humans and LLMs. We demonstrate how these datasets can be used for
+investigating implicit biases in humans and LLMs, such as the harmful gender
+stereotypes that are prevalent both in society and LLM outputs.
+
+
+
+ comment: 16 pages, 11 figures, associated Github page with dataset available
+ at: https://github.com/LLMWorldOfWords/LWOW
+
+
+
+
+
+
+ ☆ SiTSE: Sinhala Text Simplification Dataset and Evaluation
+
+
+
+
+
+
+
+
+ Surangika Ranathunga, Rumesh Sirithunga, Himashi Rathnayake, Lahiru De Silva, Thamindu Aluthwala, Saman Peramuna, Ravi Shekhar
+
+
+ Text Simplification is a task that has been minimally explored for
+low-resource languages. Consequently, there are only a few manually curated
+datasets. In this paper, we present a human curated sentence-level text
+simplification dataset for the Sinhala language. Our evaluation dataset
+contains 1,000 complex sentences and corresponding 3,000 simplified sentences
+produced by three different human annotators. We model the text simplification
+task as a zero-shot and zero resource sequence-to-sequence (seq-seq) task on
+the multilingual language models mT5 and mBART. We exploit auxiliary data from
+related seq-seq tasks and explore the possibility of using intermediate task
+transfer learning (ITTL). Our analysis shows that ITTL outperforms the
+previously proposed zero-resource methods for text simplification. Our findings
+also highlight the challenges in evaluating text simplification systems, and
+support the calls for improved metrics for measuring the quality of automated
+text simplification systems that would suit low-resource languages as well. Our
+code and data are publicly available:
+https://github.com/brainsharks-fyp17/Sinhala-Text-Simplification-Dataset-and-Evaluation
+
+
+
+
+
+
+
+
+ Junjie Oscar Yin, Alexander M. Rush
+
+
+ Data selection can reduce the amount of training data needed to finetune
+LLMs; however, the efficacy of data selection scales directly with its compute.
+Motivated by the practical challenge of compute-constrained finetuning, we
+consider the setting in which both the cost of selecting data and training are
+budgeted for. We first formalize the problem of data selection with a
+cost-aware utility function, and model the data selection problem as trading
+off initial-selection cost for training gain. We run a comprehensive sweep of
+experiments across multiple tasks, varying compute budget by scaling finetuning
+tokens, model sizes, and data selection compute. Interestingly we find that
+many powerful data selection methods are almost never compute-optimal, and that
+cheaper data selection alternatives dominate both from a theoretical and
+empirical perspective. For compute-optimal training, we find that perplexity
+and gradient data selection require training-to-selection model size ratios of
+5x and 10x, respectively.
+
+
+
+
+
+
+
+ ♻ ☆ RIRAG: Regulatory Information Retrieval and Answer Generation
+
+
+ Regulatory documents, issued by governmental regulatory bodies, establish
+rules, guidelines, and standards that organizations must adhere to for legal
+compliance. These documents, characterized by their length, complexity and
+frequent updates, are challenging to interpret, requiring significant
+allocation of time and expertise on the part of organizations to ensure ongoing
+compliance. Regulatory Natural Language Processing (RegNLP) is a
+multidisciplinary field aimed at simplifying access to and interpretation of
+regulatory rules and obligations. We introduce a task of generating
+question-passages pairs, where questions are automatically created and paired
+with relevant regulatory passages, facilitating the development of regulatory
+question-answering systems. We create the ObliQA dataset, containing 27,869
+questions derived from the collection of Abu Dhabi Global Markets (ADGM)
+financial regulation documents, design a baseline Regulatory Information
+Retrieval and Answer Generation (RIRAG) system and evaluate it with RePASs, a
+novel evaluation metric that tests whether generated answers accurately capture
+all relevant obligations while avoiding contradictions.
+
+
+
+
+
+
+
+ ♻ ☆ What Differentiates Educational Literature? A Multimodal Fusion Approach
+ of Transformers and Computational Linguistics
+
+
+ The integration of new literature into the English curriculum remains a
+challenge since educators often lack scalable tools to rapidly evaluate
+readability and adapt texts for diverse classroom needs. This study proposes to
+address this gap through a multimodal approach that combines transformer-based
+text classification with linguistic feature analysis to align texts with UK Key
+Stages. Eight state-of-the-art Transformers were fine-tuned on segmented text
+data, with BERT achieving the highest unimodal F1 score of 0.75. In parallel,
+500 deep neural network topologies were searched for the classification of
+linguistic characteristics, achieving an F1 score of 0.392. The fusion of these
+modalities shows a significant improvement, with every multimodal approach
+outperforming all unimodal models. In particular, the ELECTRA Transformer fused
+with the neural network achieved an F1 score of 0.996. Unimodal and multimodal
+approaches are shown to have statistically significant differences in all
+validation metrics (accuracy, precision, recall, F1 score) except for inference
+time. The proposed approach is finally encapsulated in a stakeholder-facing web
+application, providing non-technical stakeholder access to real-time insights
+on text complexity, reading difficulty, curriculum alignment, and
+recommendations for learning age range. The application empowers data-driven
+decision making and reduces manual workload by integrating AI-based
+recommendations into lesson planning for English literature.
+
+
+
+
+
+
+
+ ♻ ☆ Artificial intelligence contribution to translation industry: looking
+ back and forward
+
+
+ This study provides a comprehensive analysis of artificial intelligence (AI)
+contribution to translation industry (ACTI) research, synthesizing it over
+forty-one years from 1980-2024. 13220 articles were retrieved from three
+sources, namely WoS, Scopus, and Lens. We provided two types of analysis, viz.,
+scientometric and thematic, focusing on cluster, subject categories, keywords,
+burstness, centrality and research centers as for the former. For the latter,
+we thematically review 18 articles, selected purposefully from the articles
+involved, centering on purpose, approach, findings, and contribution to ACTI
+future directions. The findings reveal that in the past AI contribution to
+translation industry was not rigorous, resulting in rule-based machine
+translation and statistical machine translation whose output was not
+satisfactory. However, the more AI develops, the more machine translation
+develops, incorporating Neural Networking Algorithms and (Deep) Language
+Learning Models like ChatGPT whose translation output has developed
+considerably. However, much rigorous research is still needed to overcome
+several problems encountering translation industry, specifically concerning
+low-source languages, multi-dialectical and free word order languages, and
+cultural and religious registers.
+
+
+
+ comment: 20 pages, 4 figures
+
+
+
+
+
+
+ ♻ ☆ ForecastBench: A Dynamic Benchmark of AI Forecasting Capabilities
+
+
+
+
+
+
+
+
+ Ezra Karger, Houtan Bastani, Chen Yueh-Han, Zachary Jacobs, Danny Halawi, Fred Zhang, Philip E. Tetlock
+
+
+ Forecasts of future events are essential inputs into informed
+decision-making. Machine learning (ML) systems have the potential to deliver
+forecasts at scale, but there is no framework for evaluating the accuracy of ML
+systems on a standardized set of forecasting questions. To address this gap, we
+introduce ForecastBench: a dynamic benchmark that evaluates the accuracy of ML
+systems on an automatically generated and regularly updated set of 1,000
+forecasting questions. To avoid any possibility of data leakage, ForecastBench
+is comprised solely of questions about future events that have no known answer
+at the time of submission. We quantify the capabilities of current ML systems
+by collecting forecasts from expert (human) forecasters, the general public,
+and LLMs on a random subset of questions from the benchmark ($N=200$). While
+LLMs have achieved super-human performance on many benchmarks, they perform
+less well here: expert forecasters outperform the top-performing LLM (p-value
+$<0.01$). We display system and human scores in a public leaderboard at
+www.forecastbench.org.
+
+
+
+
+
+
+
+ ♻ ☆ Scaling Speech-Text Pre-training with Synthetic Interleaved Data
+
+
+
+
+
+
+
+
+ Aohan Zeng, Zhengxiao Du, Mingdao Liu, Lei Zhang, Shengmin Jiang, Yuxiao Dong, Jie Tang
+
+
+ Speech language models (SpeechLMs) accept speech input and produce speech
+output, allowing for more natural human-computer interaction compared to
+text-based large language models (LLMs). Traditional approaches for developing
+SpeechLMs are constrained by the limited availability of unsupervised speech
+data and parallel speech-text data, which are significantly less abundant than
+text pre-training data, thereby limiting their scalability as LLMs. We propose
+a novel approach to scaling speech-text pre-training by leveraging large-scale
+synthetic interleaved data derived from text corpora, eliminating the need for
+parallel speech-text datasets. Our method efficiently constructs speech-text
+interleaved data by sampling text spans from existing text corpora and
+synthesizing corresponding speech spans using a text-to-token model, bypassing
+the need to generate actual speech. We also employ a supervised speech
+tokenizer derived from an automatic speech recognition (ASR) model by
+incorporating a vector-quantized bottleneck into the encoder. This supervised
+training approach results in discrete speech tokens with strong semantic
+preservation even at lower frame rates (e.g. 12.5Hz), while still maintaining
+speech reconstruction quality. Starting from a pre-trained language model and
+scaling our pre-training to 1 trillion tokens (with 600B synthetic interleaved
+speech-text data), we achieve state-of-the-art performance in speech language
+modeling and spoken question answering, improving performance on spoken
+questions tasks from the previous SOTA of 13% (Moshi) to 31%. We further
+demonstrate that by fine-tuning the pre-trained model with speech dialogue
+data, we can develop an end-to-end spoken chatbot that achieves competitive
+performance comparable to existing baselines in both conversational abilities
+and speech quality, even operating exclusively in the speech domain.
+
+
+
+
+
+
+
+ ♻ ☆ Limits to Predicting Online Speech Using Large Language Models
+
+
+
+
+
+
+
+
+ Mina Remeli, Moritz Hardt, Robert C. Williamson
+
+
+ We study the predictability of online speech on social media, and whether
+predictability improves with information outside a user's own posts. Recent
+theoretical results suggest that posts from a user's social circle are as
+predictive of the user's future posts as that of the user's past posts.
+Motivated by the success of large language models, we empirically test this
+hypothesis. We define predictability as a measure of the model's uncertainty,
+i.e., its negative log-likelihood on future tokens given context. As the basis
+of our study, we collect 10M tweets for ``tweet-tuning'' base models and a
+further 6.25M posts from more than five thousand X (previously Twitter) users
+and their peers. Across four large language models ranging in size from 1.5
+billion to 70 billion parameters, we find that predicting a user's posts from
+their peers' posts performs poorly. Moreover, the value of the user's own posts
+for prediction is consistently higher than that of their peers'. We extend our
+investigation with a detailed analysis on what's learned in-context and the
+robustness of our findings. From context, base models learn to correctly
+predict @-mentions and hashtags. Moreover, our results replicate if instead of
+prompting the model with additional context, we finetune on it. Across the
+board, we find that predicting the posts of individual users remains hard.
+
+
+ Recent developments in LLMs offer new opportunities for assisting authors in
+improving their work. In this paper, we envision a use case where authors can
+receive LLM-generated reviews that uncover weak points in the current draft.
+While initial methods for automated review generation already exist, these
+methods tend to produce reviews that lack detail, and they do not cover the
+range of opinions that human reviewers produce. To address this shortcoming, we
+propose an efficient two-stage review generation framework called Reviewer2.
+Unlike prior work, this approach explicitly models the distribution of possible
+aspects that the review may address. We show that this leads to more detailed
+reviews that better cover the range of aspects that human reviewers identify in
+the draft. As part of the research, we generate a large-scale review dataset of
+27k papers and 99k reviews that we annotate with aspect prompts, which we make
+available as a resource for future research.
+
+
+
+
+
+
+
+ ♻ ☆ On Meta-Prompting
+
+
+
+
+
+
+
+
+ Adrian de Wynter, Xun Wang, Qilong Gu, Si-Qing Chen
+
+
+ Modern generative language models are capable of interpreting input strings
+as instructions, or prompts, and carry out tasks based on them. Many approaches
+to prompting and pre-training these models involve the automated generation of
+these prompts: meta-prompting, or prompting to obtain prompts. We propose a
+theoretical framework based on category theory to generalize and describe them.
+This framework is flexible enough to account for stochasticity, and allows us
+to obtain formal results around task agnosticity and equivalence of various
+meta-prompting approaches. Experimentally, we test our framework in two active
+areas of model research: creativity and ideation. We find that user preference
+strongly favors (p < 0.01) the prompts generated under meta-prompting, as well
+as their corresponding outputs, over a series of hardcoded baseline prompts
+that include the original task definition. Using our framework, we argue that
+meta-prompting is more effective than basic prompting at generating desirable
+outputs.
+
+
+
+
+
+
+
+
+ Zhihuan Jiang, Zhen Yang, Jinhao Chen, Zhengxiao Du, Weihan Wang, Bin Xu, Jie Tang
+
+
+ Multi-modal large language models (MLLMs) have demonstrated promising
+capabilities across various tasks by integrating textual and visual information
+to achieve visual understanding in complex scenarios. Despite the availability
+of several benchmarks aims to evaluating MLLMs in tasks from visual question
+answering to complex problem-solving, most focus predominantly on mathematics
+or general visual understanding tasks. This reveals a critical gap in current
+benchmarks, which often overlook the inclusion of other key scientific
+disciplines such as physics and chemistry. To address this gap, we meticulously
+construct a comprehensive benchmark, named VisScience, which is utilized to
+assess the multi-modal scientific reasoning across the three disciplines of
+mathematics, physics, and chemistry. This benchmark comprises 3,000 questions
+drawn from K12 education - spanning elementary school through high school -
+equally distributed across three disciplines, with 1,000 questions per
+discipline. The questions within VisScience span 21 distinct subjects and are
+categorized into five difficulty levels, offering a broad spectrum of topics
+within each discipline. With VisScience, we present a detailed evaluation of
+the performance of 25 representative MLLMs in scientific reasoning.
+Experimental results demonstrate that closed-source MLLMs generally outperform
+open-source models. The best performance observed include a 53.4\% accuracy in
+mathematics by Claude3.5-Sonnet, 38.2\% in physics by GPT-4o, and 47.0\% in
+chemistry by Gemini-1.5-Pro. These results underscore the strengths and
+limitations of MLLMs, suggesting areas for future improvement and highlighting
+the importance of developing models that can effectively handle the diverse
+demands of multi-modal scientific reasoning.
+
+
+
+ comment: 89 pages, 70 figures
+
+
+
+
+
+
+ ♻ ☆ MathGLM-Vision: Solving Mathematical Problems with Multi-Modal Large
+ Language Model
+
+
+
+
+
+
+
+
+ Zhen Yang, Jinhao Chen, Zhengxiao Du, Wenmeng Yu, Weihan Wang, Wenyi Hong, Zhihuan Jiang, Bin Xu, Jie Tang
+
+
+ Large language models (LLMs) have demonstrated significant capabilities in
+mathematical reasoning, particularly with text-based mathematical problems.
+However, current multi-modal large language models (MLLMs), especially those
+specialized in mathematics, tend to focus predominantly on solving geometric
+problems but ignore the diversity of visual information available in other
+areas of mathematics. Moreover, the geometric information for these specialized
+mathematical MLLMs is derived from several public datasets, which are typically
+limited in diversity and complexity. To address these limitations, we aim to
+construct a fine-tuning dataset named MathVL, and develop a series of
+specialized mathematical MLLMs termed MathGLM-Vision by conducting Supervised
+Fine-Tuning (SFT) on MathVL with various parameter-scale backbones. To
+extensively evaluate the effectiveness of MathGLM-Vision, we conduct
+experiments on several public benchmarks and our curated MathVL-test consisting
+of 2,000 problems. Experimental results demonstrate that MathGLM-Vision
+achieves significant improvements compared with some existing models, including
+backbone models and open-source mathematical MLLMs. These findings indicate the
+importance of diversity dataset in enhancing the mathematical reasoning
+abilities of MLLMs.
+
+
+
+ comment: 30 pages,19 figures
+
+
+
+
+
+
+ ♻ ☆ Truth or Mirage? Towards End-to-End Factuality Evaluation with LLM-Oasis
+
+
+ After the introduction of Large Language Models (LLMs), there have been
+substantial improvements in the performance of Natural Language Generation
+(NLG) tasks, including Text Summarization and Machine Translation. However,
+LLMs still produce outputs containing hallucinations, that is, content not
+grounded in factual information. Therefore, developing methods to assess the
+factuality of LLMs has become urgent.
+ Indeed, resources for factuality evaluation have recently emerged. Although
+challenging, these resources face one or more of the following limitations: (i)
+they are tailored to a specific task or domain; (ii) they are limited in size,
+thereby preventing the training of new factuality evaluators; (iii) they are
+designed for simpler verification tasks, such as claim verification.
+ To address these issues, we introduce LLM-Oasis, to the best of our knowledge
+the largest resource for training end-to-end factuality evaluators. LLM-Oasis
+is constructed by extracting claims from Wikipedia, falsifying a subset of
+these claims, and generating pairs of factual and unfactual texts. We then rely
+on human annotators to both validate the quality of our dataset and to create a
+gold standard test set for benchmarking factuality evaluation systems.
+ Our experiments demonstrate that LLM-Oasis presents a significant challenge
+for state-of-the-art LLMs, with GPT-4o achieving up to 60% accuracy in our
+proposed end-to-end factuality evaluation task, highlighting its potential to
+drive future research in the field.
+
+
+
+ comment: 15 pages. To be submitted to CL journal
+
+
+
+
+
+
+ ♻ ☆ GRDD: A Dataset for Greek Dialectal NLP
+
+
+ In this paper, we present a dataset for the computational study of a number
+of Modern Greek dialects. It consists of raw text data from four dialects of
+Modern Greek, Cretan, Pontic, Northern Greek and Cypriot Greek. The dataset is
+of considerable size, albeit imbalanced, and presents the first attempt to
+create large scale dialectal resources of this type for Modern Greek dialects.
+We then use the dataset to perform dialect idefntification. We experiment with
+traditional ML algorithms, as well as simple DL architectures. The results show
+very good performance on the task, potentially revealing that the dialects in
+question have distinct enough characteristics allowing even simple ML models to
+perform well on the task. Error analysis is performed for the top performing
+algorithms showing that in a number of cases the errors are due to insufficient
+dataset cleaning.
+
+
+
+
+
+
+
+ ♻ ☆ Cross-Refine: Improving Natural Language Explanation Generation by
+ Learning in Tandem COLING 2025
+
+
+
+
+
+
+
+
+ Qianli Wang, Tatiana Anikina, Nils Feldhus, Simon Ostermann, Sebastian Möller, Vera Schmitt
+
+
+ Natural language explanations (NLEs) are vital for elucidating the reasoning
+behind large language model (LLM) decisions. Many techniques have been
+developed to generate NLEs using LLMs. However, like humans, LLMs might not
+always produce optimal NLEs on first attempt. Inspired by human learning
+processes, we introduce Cross-Refine, which employs role modeling by deploying
+two LLMs as generator and critic, respectively. The generator outputs a first
+NLE and then refines this initial explanation using feedback and suggestions
+provided by the critic. Cross-Refine does not require any supervised training
+data or additional training. We validate Cross-Refine across three NLP tasks
+using three state-of-the-art open-source LLMs through automatic and human
+evaluation. We select Self-Refine (Madaan et al., 2023) as the baseline, which
+only utilizes self-feedback to refine the explanations. Our findings from
+automatic evaluation and a user study indicate that Cross-Refine outperforms
+Self-Refine. Meanwhile, Cross-Refine can perform effectively with less powerful
+LLMs, whereas Self-Refine only yields strong results with ChatGPT.
+Additionally, we conduct an ablation study to assess the importance of feedback
+and suggestions. Both of them play an important role in refining explanations.
+We further evaluate Cross-Refine on a bilingual dataset in English and German.
+
+
+
+ comment: Accepted at COLING 2025; long paper
+
+
+
+
+
+
+ ♻ ☆ Combining Induction and Transduction for Abstract Reasoning
+
+
+
+
+
+
+
+
+ Wen-Ding Li, Keya Hu, Carter Larsen, Yuqing Wu, Simon Alford, Caleb Woo, Spencer M. Dunn, Hao Tang, Michelangelo Naim, Dat Nguyen, Wei-Long Zheng, Zenna Tavares, Yewen Pu, Kevin Ellis
+
+
+ When learning an input-output mapping from very few examples, is it better to
+first infer a latent function that explains the examples, or is it better to
+directly predict new test outputs, e.g. using a neural network? We study this
+question on ARC by training neural models for induction (inferring latent
+functions) and transduction (directly predicting the test output for a given
+test input). We train on synthetically generated variations of Python programs
+that solve ARC training tasks. We find inductive and transductive models solve
+different kinds of test problems, despite having the same training problems and
+sharing the same neural architecture: Inductive program synthesis excels at
+precise computations, and at composing multiple concepts, while transduction
+succeeds on fuzzier perceptual concepts. Ensembling them approaches human-level
+performance on ARC.
+
+
+
+
+
+
+
+ ♻ ☆ Differentially Private Zeroth-Order Methods for Scalable Large Language
+ Model Finetuning
+
+
+
+
+
+
+
+
+ Z Liu, J Lou, W Bao, Y Hu, B Li, Z Qin, K Ren
+
+
+ Fine-tuning on task-specific datasets is a widely-embraced paradigm of
+harnessing the powerful capability of pretrained LLMs for various downstream
+tasks. Due to the popularity of LLMs fine-tuning and its accompanying privacy
+concerns, differentially private (DP) fine-tuning of pretrained LLMs has been
+widely used to safeguarding the privacy of task-specific datasets. Lying at the
+design core of DP LLM fine-tuning methods is the satisfactory tradeoff among
+privacy, utility, and scalability. Most existing methods build upon the seminal
+work of DP-SGD. Despite pushing the scalability of DP-SGD to its limit,
+DP-SGD-based fine-tuning methods are unfortunately limited by the inherent
+inefficiency of SGD.
+ In this paper, we investigate the potential of DP zeroth-order methods for
+LLM pretraining, which avoids the scalability bottleneck of SGD by
+approximating the gradient with the more efficient zeroth-order gradient.
+Rather than treating the zeroth-order method as a drop-in replacement for SGD,
+this paper presents a comprehensive study both theoretically and empirically.
+First, we propose the stagewise DP zeroth-order method (DP-ZOSO) that
+dynamically schedules key hyperparameters. This design is grounded on the
+synergy between DP random perturbation and the gradient approximation error of
+the zeroth-order method, and its effect on fine-tuning trajectory.
+ We provide theoretical analysis for both proposed methods. We conduct
+extensive empirical analysis on both encoder-only masked language model and
+decoder-only autoregressive language model, achieving impressive results in
+terms of scalability and utility regardless of the class of tasks (compared
+with DPZero, DP-ZOPO improves $4.5\%$ on SST-5, $5.5\%$ on MNLI with
+RoBERTa-Large and 9.2\% on CB, 3.9\% on BoolQ with OPT-2.7b when $\epsilon=4$,
+demonstrates more significant enhancement in performance on more complicated
+tasks).
+
+
+
+
+
+
+
+ ♻ ☆ Real-time Transformer-based Open-Vocabulary Detection with Efficient
+ Fusion Head
+
+
+
+
+
+
+
+
+ Tiancheng Zhao, Peng Liu, Xuan He, Lu Zhang, Kyusong Lee
+
+
+ End-to-end transformer-based detectors (DETRs) have shown exceptional
+performance in both closed-set and open-vocabulary object detection (OVD) tasks
+through the integration of language modalities. However, their demanding
+computational requirements have hindered their practical application in
+real-time object detection (OD) scenarios. In this paper, we scrutinize the
+limitations of two leading models in the OVDEval benchmark, OmDet and
+Grounding-DINO, and introduce OmDet-Turbo. This novel transformer-based
+real-time OVD model features an innovative Efficient Fusion Head (EFH) module
+designed to alleviate the bottlenecks observed in OmDet and Grounding-DINO.
+Notably, OmDet-Turbo-Base achieves a 100.2 frames per second (FPS) with
+TensorRT and language cache techniques applied. Notably, in zero-shot scenarios
+on COCO and LVIS datasets, OmDet-Turbo achieves performance levels nearly on
+par with current state-of-the-art supervised models. Furthermore, it
+establishes new state-of-the-art benchmarks on ODinW and OVDEval, boasting an
+AP of 30.1 and an NMS-AP of 26.86, respectively. The practicality of
+OmDet-Turbo in industrial applications is underscored by its exceptional
+performance on benchmark datasets and superior inference speed, positioning it
+as a compelling choice for real-time object detection tasks. Code:
+\url{https://github.com/om-ai-lab/OmDet}
+
+
+
+
+
+
+
+
+ Eric Tang, Bangding Yang, Xingyou Song
+
+
+ With the rise of large language models (LLMs) for flexibly processing
+information as strings, a natural application is regression, specifically by
+preprocessing string representations into LLM embeddings as downstream features
+for metric prediction. In this paper, we provide one of the first comprehensive
+investigations into embedding-based regression and demonstrate that LLM
+embeddings as features can be better for high-dimensional regression tasks than
+using traditional feature engineering. This regression performance can be
+explained in part due to LLM embeddings over numeric data inherently preserving
+Lipschitz continuity over the feature space. Furthermore, we quantify the
+contribution of different model effects, most notably model size and language
+understanding, which we find surprisingly do not always improve regression
+performance.
+
+
+
+ comment: 16 pages, 13 figures
+
+
+
+
+
+
+ ♻ ☆ Fine Tuning vs. Retrieval Augmented Generation for Less Popular
+ Knowledge
+
+
+ Language Models (LMs) memorize a vast amount of factual knowledge, exhibiting
+strong performance across diverse tasks and domains. However, it has been
+observed that the performance diminishes when dealing with less-popular or
+low-frequency concepts and entities, for example in domain specific
+applications. The two prominent approaches to enhance the performance of LMs on
+low-frequent topics are: Retrieval Augmented Generation (RAG) and fine-tuning
+(FT) over synthetic data. This paper explores and evaluates the impact of RAG
+and FT on customizing LMs in handling low-frequency entities on question
+answering tasks. We conduct extensive experiments on twelve LMs of varying size
+and type and different fine tuning, data augmentation, and retrieval models.
+Our findings indicate that while FT boosts the performance across entities of
+varying popularity, RAG surpasses FT by a large margin particularly for least
+popular factual knowledge. Additionally, the success of both RAG and FT
+approaches is amplified by improving retrieval and data augmentation
+techniques. Fine tuning, while beneficial for small LMs, requires extensive
+resources. To address this issue, we propose the new Stimulus RAG approach that
+surpasses the effectiveness of fine tuning based approaches, thereby
+eliminating the need for the costly data augmentation and fine tuning step for
+enriching LMs with less popular factual knowledge. The code is available at
+\url{https://github.com/informagi/RAGvsFT}.
+
+
+
+
+
+
+
+ ♻ ☆ Dual-Personalizing Adapter for Federated Foundation Models
+
+
+
+
+
+
+
+
+ Yiyuan Yang, Guodong Long, Tao Shen, Jing Jiang, Michael Blumenstein
+
+
+ Recently, foundation models, particularly large language models (LLMs), have
+demonstrated an impressive ability to adapt to various tasks by fine-tuning
+diverse instruction data. Notably, federated foundation models (FedFM) emerge
+as a privacy preservation method to fine-tune models collaboratively under
+federated learning (FL) settings by leveraging many distributed datasets with
+non-IID data. To alleviate communication and computation overhead,
+parameter-efficient methods are introduced for efficiency, and some research
+adapted personalization methods to FedFM for better user preferences alignment.
+However, a critical gap in existing research is the neglect of test-time
+distribution shifts in real-world applications, and conventional methods for
+test-time distribution shifts in personalized FL are less effective for FedFM
+due to their failure to adapt to complex distribution shift scenarios and the
+requirement to train all parameters. To bridge this gap, we refine the setting
+in FedFM, termed test-time personalization, which aims to learn personalized
+federated foundation models on clients while effectively handling test-time
+distribution shifts simultaneously. To address challenges in this setting, we
+explore a simple yet effective solution, a Federated Dual-Personalizing Adapter
+(FedDPA) architecture. By co-working with a foundation model, a global adapter
+and a local adapter jointly tackle the test-time distribution shifts and
+client-specific personalization. Additionally, we introduce an instance-wise
+dynamic weighting mechanism that dynamically integrates the global and local
+adapters for each test instance during inference, facilitating effective
+test-time personalization. The effectiveness of the proposed method has been
+evaluated on benchmark datasets across different NLP tasks.
+
+
+ Recognizing visual entities in a natural language sentence and arranging them
+in a 2D spatial layout require a compositional understanding of language and
+space. This task of layout prediction is valuable in text-to-image synthesis as
+it allows localized and controlled in-painting of the image. In this
+comparative study it is shown that we can predict layouts from language
+representations that implicitly or explicitly encode sentence syntax, if the
+sentences mention similar entity-relationships to the ones seen during
+training. To test compositional understanding, we collect a test set of
+grammatically correct sentences and layouts describing compositions of entities
+and relations that unlikely have been seen during training. Performance on this
+test set substantially drops, showing that current models rely on correlations
+in the training data and have difficulties in understanding the structure of
+the input sentences. We propose a novel structural loss function that better
+enforces the syntactic structure of the input sentence and show large
+performance gains in the task of 2D spatial layout prediction conditioned on
+text. The loss has the potential to be used in other generation tasks where a
+tree-like structure underlies the conditioning modality. Code, trained models
+and the USCOCO evaluation set are available via github.
+
+
+
+ comment: Published in TACL
+
+
+
+
+
+
+ ♻ ☆ MLLM-LLaVA-FL: Multimodal Large Language Model Assisted Federated
+ Learning WACV 2025
+
+
+
+
+
+
+
+
+ Jianyi Zhang, Hao Frank Yang, Ang Li, Xin Guo, Pu Wang, Haiming Wang, Yiran Chen, Hai Li
+
+
+ Previous studies on federated learning (FL) often encounter performance
+degradation due to data heterogeneity among different clients. In light of the
+recent advances in multimodal large language models (MLLMs), such as GPT-4v and
+LLaVA, which demonstrate their exceptional proficiency in multimodal tasks,
+such as image captioning and multimodal question answering. We introduce a
+novel federated learning framework, named Multimodal Large Language Model
+Assisted Federated Learning (MLLM-LLaVA-FL), which employs powerful MLLMs at
+the server end to address the heterogeneous and long-tailed challenges. Owing
+to the advanced cross-modality representation capabilities and the extensive
+open-vocabulary prior knowledge of MLLMs, our framework is adept at harnessing
+the extensive, yet previously underexploited, open-source data accessible from
+websites and powerful server-side computational resources. Hence, the
+MLLM-LLaVA-FL not only enhances the performance but also avoids increasing the
+risk of privacy leakage and the computational burden on local devices,
+distinguishing it from prior methodologies. Our framework has three key stages.
+Initially, we conduct global visual-text pretraining of the model. This
+pretraining is facilitated by utilizing the extensive open-source data
+available online, with the assistance of MLLMs. Subsequently, the pretrained
+model is distributed among various clients for local training. Finally, once
+the locally trained models are transmitted back to the server, a global
+alignment is carried out under the supervision of MLLMs to further enhance the
+performance. Experimental evaluations on established benchmarks, show that our
+framework delivers promising performance in the typical scenarios with data
+heterogeneity and long-tail distribution across different clients in FL.
+
+
+
+
+
+
+
+
+ Ya Gao, Hans Moen, Saila Koivusalo, Miika Koskinen, Pekka Marttinen
+
+
+ Nursing notes, an important part of Electronic Health Records (EHRs), track a
+patient's health during a care episode. Summarizing key information in nursing
+notes can help clinicians quickly understand patients' conditions. However,
+existing summarization methods in the clinical setting, especially abstractive
+methods, have overlooked nursing notes and require reference summaries for
+training. We introduce QGSumm, a novel query-guided self-supervised domain
+adaptation approach for abstractive nursing note summarization. The method uses
+patient-related clinical queries for guidance, and hence does not need
+reference summaries for training. Through automatic experiments and manual
+evaluation by an expert clinician, we study our approach and other
+state-of-the-art Large Language Models (LLMs) for nursing note summarization.
+Our experiments show: 1) GPT-4 is competitive in maintaining information in the
+original nursing notes, 2) QGSumm can generate high-quality summaries with a
+good balance between recall of the original content and hallucination rate
+lower than other top methods. Ultimately, our work offers a new perspective on
+conditional text summarization, tailored to clinical applications.
+
+
+
+
+
+
+
+ ♻ ☆ Deep Learning and Machine Learning, Advancing Big Data Analytics and
+ Management: Object-Oriented Programming
+
+
+
+
+
+
+
+
+ Tianyang Wang, Ziqian Bi, Keyu Chen, Jiawei Xu, Qian Niu, Junyu Liu, Benji Peng, Ming Li, Sen Zhang, Xuanhe Pan, Jinlang Wang, Pohsun Feng, Caitlyn Heqi Yin, Yizhu Wen, Ming Liu
+
+
+ Object-Oriented Programming (OOP) has become a crucial paradigm for managing
+the growing complexity of modern software systems, particularly in fields like
+machine learning, deep learning, large language models (LLM), and data
+analytics. This work provides a comprehensive introduction to the integration
+of OOP techniques within these domains, with a focus on improving code
+modularity, maintainability, and scalability. We begin by outlining the
+evolution of computing and the rise of OOP, followed by an in-depth discussion
+of key OOP principles such as encapsulation, inheritance, polymorphism, and
+abstraction. The practical application of these principles is demonstrated
+using Python, a widely adopted language in AI and data science. Furthermore, we
+examine how design patterns and modular programming can be employed to enhance
+the structure and efficiency of machine learning systems. In subsequent
+sections, we apply these OOP concepts to real-world AI tasks, including the
+encapsulation of preprocessing workflows, machine learning model training, and
+evaluation. Detailed examples illustrate how OOP can be used to build reusable,
+scalable machine learning systems while maintaining code clarity and reducing
+redundancy.This work is intended to serve as a bridge for both beginners and
+experienced developers, equipping them with the necessary knowledge to apply
+OOP methodologies in AI-driven projects, ultimately fostering the development
+of more robust and maintainable systems.
+
+
+
+ comment: 49pages
+
+
+
+
+
+
+ ♻ ☆ ODE: Open-Set Evaluation of Hallucinations in Multimodal Large Language
+ Models
+
+
+ Hallucination poses a persistent challenge for multimodal large language
+models (MLLMs). However, existing benchmarks for evaluating hallucinations are
+generally static, which may overlook the potential risk of data contamination.
+To address this issue, we propose ODE, an open-set, dynamic protocol designed
+to evaluate object hallucinations in MLLMs at both the existence and attribute
+levels. ODE employs a graph-based structure to represent real-world object
+concepts, their attributes, and the distributional associations between them.
+This structure facilitates the extraction of concept combinations based on
+diverse distributional criteria, generating varied samples for structured
+queries that evaluate hallucinations in both generative and discriminative
+tasks. Through the generation of new samples, dynamic concept combinations, and
+varied distribution frequencies, ODE mitigates the risk of data contamination
+and broadens the scope of evaluation. This protocol is applicable to both
+general and specialized scenarios, including those with limited data.
+Experimental results demonstrate the effectiveness of our protocol, revealing
+that MLLMs exhibit higher hallucination rates when evaluated with ODE-generated
+samples, which indicates potential data contamination. Furthermore, these
+generated samples aid in analyzing hallucination patterns and fine-tuning
+models, offering an effective approach to mitigating hallucinations in MLLMs.
+
+
+
+
+
+
+
+ ♻ ☆ Efficient Prompting Methods for Large Language Models: A Survey
+
+
+ Prompting is a mainstream paradigm for adapting large language models to
+specific natural language processing tasks without modifying internal
+parameters. Therefore, detailed supplementary knowledge needs to be integrated
+into external prompts, which inevitably brings extra human efforts and
+computational burdens for practical applications. As an effective solution to
+mitigate resource consumption, Efficient Prompting Methods have attracted a
+wide range of attention. We provide mathematical expressions at a high level to
+deeply discuss Automatic Prompt Engineering for different prompt components and
+Prompt Compression in continuous and discrete spaces. Finally, we highlight
+promising future directions to inspire researchers interested in this field.
+
+
+
+
+
+
+
+ ♻ ☆ Knowledge Entropy Decay during Language Model Pretraining Hinders New
+ Knowledge Acquisition
+
+
+
+
+
+
+
+
+ Jiyeon Kim, Hyunji Lee, Hyowon Cho, Joel Jang, Hyeonbin Hwang, Seungpil Won, Youbin Ahn, Dohaeng Lee, Minjoon Seo
+
+
+ In this work, we investigate how a model's tendency to broadly integrate its
+parametric knowledge evolves throughout pretraining, and how this behavior
+affects overall performance, particularly in terms of knowledge acquisition and
+forgetting. We introduce the concept of knowledge entropy, which quantifies the
+range of memory sources the model engages with; high knowledge entropy
+indicates that the model utilizes a wide range of memory sources, while low
+knowledge entropy suggests reliance on specific sources with greater certainty.
+Our analysis reveals a consistent decline in knowledge entropy as pretraining
+advances. We also find that the decline is closely associated with a reduction
+in the model's ability to acquire and retain knowledge, leading us to conclude
+that diminishing knowledge entropy (smaller number of active memory sources)
+impairs the model's knowledge acquisition and retention capabilities. We find
+further support for this by demonstrating that increasing the activity of
+inactive memory sources enhances the model's capacity for knowledge acquisition
+and retention.
+
+
+
+
+
+
+
+ ♻ ☆ GLaPE: Gold Label-agnostic Prompt Evaluation and Optimization for Large
+ Language Model EMNLP 2024
+
+
+ Despite the rapid progress of large language models (LLMs), their task
+performance remains sensitive to prompt design. Recent studies have explored
+leveraging the LLM itself as an optimizer to identify optimal prompts that
+maximize task accuracy. However, when evaluating prompts, such approaches
+heavily rely on elusive manually annotated gold labels to calculate task
+accuracy for each candidate prompt, which hinders the widespread implementation
+and generality. To overcome the limitation, this work proposes a gold
+label-agnostic prompt evaluation (GLaPE) to alleviate dependence on gold
+labels. Motivated by the observed correlation between self-consistency and the
+accuracy of the answer, we adopt self-consistency as the initial evaluation
+score. Subsequently, we refine the scores of prompts producing identical
+answers to be mutually consistent. Experimental results show that GLaPE
+provides reliable evaluations uniform with accuracy, even in the absence of
+gold labels. Moreover, on six popular reasoning tasks, our GLaPE-based prompt
+optimization yields effective prompts comparable to accuracy-based ones. The
+code is publicly available at https://github.com/thunderous77/GLaPE.
+
+
+
+ comment: EMNLP 2024
+
+
+
+
+
+
+ ♻ ☆ From Pixels to Insights: A Survey on Automatic Chart Understanding in
+ the Era of Large Foundation Models
+
+
+
+
+
+
+
+
+ Kung-Hsiang Huang, Hou Pong Chan, Yi R. Fung, Haoyi Qiu, Mingyang Zhou, Shafiq Joty, Shih-Fu Chang, Heng Ji
+
+
+ Data visualization in the form of charts plays a pivotal role in data
+analysis, offering critical insights and aiding in informed decision-making.
+Automatic chart understanding has witnessed significant advancements with the
+rise of large foundation models in recent years. Foundation models, such as
+large language models, have revolutionized various natural language processing
+tasks and are increasingly being applied to chart understanding tasks. This
+survey paper provides a comprehensive overview of the recent developments,
+challenges, and future directions in chart understanding within the context of
+these foundation models. We review fundamental building blocks crucial for
+studying chart understanding tasks. Additionally, we explore various tasks and
+their evaluation metrics and sources of both charts and textual inputs. Various
+modeling strategies are then examined, encompassing both classification-based
+and generation-based approaches, along with tool augmentation techniques that
+enhance chart understanding performance. Furthermore, we discuss the
+state-of-the-art performance of each task and discuss how we can improve the
+performance. Challenges and future directions are addressed, highlighting the
+importance of several topics, such as domain-specific charts, lack of efforts
+in developing evaluation metrics, and agent-oriented settings. This survey
+paper serves as a comprehensive resource for researchers and practitioners in
+the fields of natural language processing, computer vision, and data analysis,
+providing valuable insights and directions for future research in chart
+understanding leveraging large foundation models. The studies mentioned in this
+paper, along with emerging new research, will be continually updated at:
+https://github.com/khuangaf/Awesome-Chart-Understanding.
+
+
+
+ comment: IEEE Transactions on Knowledge and Data Engineering (TKDE)
+
+
+
+
+
+
+ ♻ ☆ T2Vid: Translating Long Text into Multi-Image is the Catalyst for
+ Video-LLMs
+
+
+ The success of Multimodal Large Language Models (MLLMs) in the image domain
+has garnered wide attention from the research community. Drawing on previous
+successful experiences, researchers have recently explored extending the
+success to the video understanding realms. Apart from training from scratch, an
+efficient way is to utilize the pre-trained image-LLMs, leading to two
+mainstream approaches, i.e. zero-shot inference and further fine-tuning with
+video data. In this work, our study of these approaches harvests an effective
+data augmentation method. We first make a deeper inspection of the zero-shot
+inference way and identify two limitations, i.e. limited generalization and
+lack of temporal understanding capabilities. Thus, we further investigate the
+fine-tuning approach and find a low learning efficiency when simply using all
+the video data samples, which can be attributed to a lack of instruction
+diversity. Aiming at this issue, we develop a method called T2Vid to synthesize
+video-like samples to enrich the instruction diversity in the training corpus.
+Integrating these data enables a simple and efficient training scheme, which
+achieves performance comparable to or even superior to using full video
+datasets by training with just 15% the sample size. Meanwhile, we find that the
+proposed scheme can boost the performance of long video understanding without
+training with long video samples. We hope our study will spark more thinking
+about using MLLMs for video understanding and curation of high-quality data.
+The code is released at https://github.com/xjtupanda/T2Vid.
+
+
+ Low-Rank Adaptation (LoRA) is a popular technique for parameter-efficient
+fine-tuning of Large Language Models (LLMs). We study how different LoRA
+modules can be merged to achieve skill composition -- testing the performance
+of the merged model on a target task that involves combining multiple skills,
+each skill coming from a single LoRA. This setup is favorable when it is
+difficult to obtain training data for the target task and when it can be
+decomposed into multiple skills. First, we identify practically occurring
+use-cases that can be studied under the realm of skill composition, e.g.
+solving hard math-word problems with code, creating a bot to answer questions
+on proprietary manuals or about domain-specialized corpora. Our main
+contribution is to show that concatenation of LoRAs (CAT), which optimally
+weights LoRAs that were individually trained on different skills, outperforms
+existing model- and data- merging techniques; for instance on math-word
+problems, CAT beats these methods by an average of 43% and 12% respectively.
+Thus, this paper advocates model merging as an efficient way to solve
+compositional tasks and underscores CAT as a simple, compute-friendly and
+effective procedure. To our knowledge, this is the first work demonstrating the
+superiority of model merging over data mixing for binary skill composition
+tasks. Code and data are available at https://github.com/aksh555/LoRA-Soups
+
+
+
+ comment: COLING 2025 Industry track; 9 pages plus references and appendices
+
+ Large Language Models (LLMs) have exhibited remarkable performance on
+reasoning tasks. They utilize autoregressive token generation to construct
+reasoning trajectories, enabling the development of a coherent chain of
+thought. In this work, we explore the impact of individual tokens on the final
+outcomes of reasoning tasks. We identify the existence of ``critical tokens''
+that lead to incorrect reasoning trajectories in LLMs. Specifically, we find
+that LLMs tend to produce positive outcomes when forced to decode other tokens
+instead of critical tokens. Motivated by this observation, we propose a novel
+approach - cDPO - designed to automatically recognize and conduct token-level
+rewards for the critical tokens during the alignment process. Specifically, we
+develop a contrastive estimation approach to automatically identify critical
+tokens. It is achieved by comparing the generation likelihood of positive and
+negative models. To achieve this, we separately fine-tune the positive and
+negative models on various reasoning trajectories, consequently, they are
+capable of identifying identify critical tokens within incorrect trajectories
+that contribute to erroneous outcomes. Moreover, to further align the model
+with the critical token information during the alignment process, we extend the
+conventional DPO algorithms to token-level DPO and utilize the differential
+likelihood from the aforementioned positive and negative model as important
+weight for token-level DPO learning.Experimental results on GSM8K and MATH500
+benchmarks with two-widely used models Llama-3 (8B and 70B) and deepseek-math
+(7B) demonstrate the effectiveness of the propsoed approach cDPO.
+
+
+
+ comment: Work in progress
+
+
+
+
+
+
+ ♻ ☆ Self and Cross-Model Distillation for LLMs: Effective Methods for
+ Refusal Pattern Alignment
+
+
+ Large Language Models (LLMs) like OpenAI's GPT series, Anthropic's Claude,
+and Meta's LLaMa have shown remarkable capabilities in text generation.
+However, their susceptibility to toxic prompts presents significant security
+challenges. This paper investigates alignment techniques, including Supervised
+Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF), to
+mitigate these risks. We conduct an empirical study on refusal patterns across
+nine LLMs, revealing that models with uniform refusal patterns, such as
+Claude3, exhibit higher security. Based on these findings, we propose
+self-distilling and cross-model distilling methods to enhance LLM security. Our
+results show that these methods significantly improve refusal rates and reduce
+unsafe content, with cross-model distilling achieving refusal rates close to
+Claude3's 94.51%. These findings underscore the potential of distillation-based
+alignment in securing LLMs against toxic prompts.
+
+
+
+ comment: The method used in the paper has obvious problems and ambiguities.
+ The security enhancement method we used cannot be considered distillation,
+ but it is described as distillation in the paper, and the experiment lacks
+ comparison and baseline, which has been criticized by many peers. In order to
+ avoid further dissemination, we have decided to withdraw the paper
+
+
+
+
+
+
+ ♻ ☆ Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning
+
+
+ Vision-language models (VLMs) have shown remarkable advancements in
+multimodal reasoning tasks. However, they still often generate inaccurate or
+irrelevant responses due to issues like hallucinated image understandings or
+unrefined reasoning paths. To address these challenges, we introduce Critic-V,
+a novel framework inspired by the Actor-Critic paradigm to boost the reasoning
+capability of VLMs. This framework decouples the reasoning process and critic
+process by integrating two independent components: the Reasoner, which
+generates reasoning paths based on visual and textual inputs, and the Critic,
+which provides constructive critique to refine these paths. In this approach,
+the Reasoner generates reasoning responses according to text prompts, which can
+evolve iteratively as a policy based on feedback from the Critic. This
+interaction process was theoretically driven by a reinforcement learning
+framework where the Critic offers natural language critiques instead of scalar
+rewards, enabling more nuanced feedback to boost the Reasoner's capability on
+complex reasoning tasks. The Critic model is trained using Direct Preference
+Optimization (DPO), leveraging a preference dataset of critiques ranked by
+Rule-based Reward~(RBR) to enhance its critic capabilities. Evaluation results
+show that the Critic-V framework significantly outperforms existing methods,
+including GPT-4V, on 5 out of 8 benchmarks, especially regarding reasoning
+accuracy and efficiency. Combining a dynamic text-based policy for the Reasoner
+and constructive feedback from the preference-optimized Critic enables a more
+reliable and context-sensitive multimodal reasoning process. Our approach
+provides a promising solution to enhance the reliability of VLMs, improving
+their performance in real-world reasoning-heavy multimodal applications such as
+autonomous driving and embodied intelligence.
+
+
+
+ comment: 16 pages, 11 figures
+
+
+
+
+
+
+ ♻ ☆ Mitigating Bias in Queer Representation within Large Language Models: A
+ Collaborative Agent Approach NeurIPS 2024
+
+
+ Large Language Models (LLMs) often perpetuate biases in pronoun usage,
+leading to misrepresentation or exclusion of queer individuals. This paper
+addresses the specific problem of biased pronoun usage in LLM outputs,
+particularly the inappropriate use of traditionally gendered pronouns ("he,"
+"she") when inclusive language is needed to accurately represent all
+identities. We introduce a collaborative agent pipeline designed to mitigate
+these biases by analyzing and optimizing pronoun usage for inclusivity. Our
+multi-agent framework includes specialized agents for both bias detection and
+correction. Experimental evaluations using the Tango dataset-a benchmark
+focused on gender pronoun usage-demonstrate that our approach significantly
+improves inclusive pronoun classification, achieving a 32.6 percentage point
+increase over GPT-4o in correctly disagreeing with inappropriate traditionally
+gendered pronouns $(\chi^2 = 38.57, p < 0.0001)$. These results accentuate the
+potential of agent-driven frameworks in enhancing fairness and inclusivity in
+AI-generated content, demonstrating their efficacy in reducing biases and
+promoting socially responsible AI.
+
+
+
+ comment: NeurIPS 2024 Queer in AI Workshop
+
+
+
+
+
+
+ ♻ ☆ Towards Understanding Domain Adapted Sentence Embeddings for Document
+ Retrieval
+
+
+
+
+
+
+
+
+ Sujoy Roychowdhury, Sumit Soman, H. G. Ranjani, Vansh Chhabra, Neeraj Gunda, Shashank Gautam, Subhadip Bandyopadhyay, Sai Krishna Bala
+
+
+ A plethora of sentence embedding models makes it challenging to choose one,
+especially for technical domains rich with specialized vocabulary. In this
+work, we domain adapt embeddings using telecom, health and science datasets for
+question answering. We evaluate embeddings obtained from publicly available
+models and their domain-adapted variants, on both point retrieval accuracies,
+as well as their (95\%) confidence intervals. We establish a systematic method
+to obtain thresholds for similarity scores for different embeddings. As
+expected, we observe that fine-tuning improves mean bootstrapped accuracies. We
+also observe that it results in tighter confidence intervals, which further
+improve when pre-training is preceded by fine-tuning. We introduce metrics
+which measure the distributional overlaps of top-$K$, correct and random
+document similarities with the question. Further, we show that these metrics
+are correlated with retrieval accuracy and similarity thresholds. Recent
+literature shows conflicting effects of isotropy on retrieval accuracies. Our
+experiments establish that the isotropy of embeddings (as measured by two
+independent state-of-the-art isotropy metric definitions) is poorly correlated
+with retrieval performance. We show that embeddings for domain-specific
+sentences have little overlap with those for domain-agnostic ones, and
+fine-tuning moves them further apart. Based on our results, we provide
+recommendations for use of our methodology and metrics by researchers and
+practitioners.
+
+
+
+
+
+
+
+
+ Christoph Minixhofer, Ondřej Klejch, Peter Bell
+
+
+ Many recently published Text-to-Speech (TTS) systems produce audio close to
+real speech. However, TTS evaluation needs to be revisited to make sense of the
+results obtained with the new architectures, approaches and datasets. We
+propose evaluating the quality of synthetic speech as a combination of multiple
+factors such as prosody, speaker identity, and intelligibility. Our approach
+assesses how well synthetic speech mirrors real speech by obtaining correlates
+of each factor and measuring their distance from both real speech datasets and
+noise datasets. We benchmark 35 TTS systems developed between 2008 and 2024 and
+show that our score computed as an unweighted average of factors strongly
+correlates with the human evaluations from each time period.
+
+
+
+ comment: SLT 2024
+
+
+
+
+
+
+ ♻ ☆ A Statistical Framework of Watermarks for Large Language Models: Pivot,
+ Detection Efficiency and Optimal Rules
+
+
+
+
+
+
+
+
+ Xiang Li, Feng Ruan, Huiyuan Wang, Qi Long, Weijie J. Su
+
+
+ Since ChatGPT was introduced in November 2022, embedding (nearly)
+unnoticeable statistical signals into text generated by large language models
+(LLMs), also known as watermarking, has been used as a principled approach to
+provable detection of LLM-generated text from its human-written counterpart. In
+this paper, we introduce a general and flexible framework for reasoning about
+the statistical efficiency of watermarks and designing powerful detection
+rules. Inspired by the hypothesis testing formulation of watermark detection,
+our framework starts by selecting a pivotal statistic of the text and a secret
+key -- provided by the LLM to the verifier -- to enable controlling the false
+positive rate (the error of mistakenly detecting human-written text as
+LLM-generated). Next, this framework allows one to evaluate the power of
+watermark detection rules by obtaining a closed-form expression of the
+asymptotic false negative rate (the error of incorrectly classifying
+LLM-generated text as human-written). Our framework further reduces the problem
+of determining the optimal detection rule to solving a minimax optimization
+program. We apply this framework to two representative watermarks -- one of
+which has been internally implemented at OpenAI -- and obtain several findings
+that can be instrumental in guiding the practice of implementing watermarks. In
+particular, we derive optimal detection rules for these watermarks under our
+framework. These theoretically derived detection rules are demonstrated to be
+competitive and sometimes enjoy a higher power than existing detection
+approaches through numerical experiments.
+
+
+
+ comment: To appear in the Annals of Statistics
+
+
+
+
+
+
+ ♻ ☆ Language Models Benefit from Preparation with Elicited Knowledge
+
+
+
+
+
+
+
+
+ Jiacan Yu, Hannah An, Lenhart K. Schubert
+
+
+ The zero-shot chain of thought (CoT) approach is often used in question
+answering (QA) by language models (LMs) for tasks that require multiple
+reasoning steps. However, some QA tasks hinge more on accessing relevant
+knowledge than on chaining reasoning steps. We introduce a simple prompting
+technique, called PREP, that involves using two instances of LMs: the first
+(LM1) generates relevant information, and the second (LM2) receives the
+information from the user and answers the question. This design is intended to
+make better use of the LM's instruction-following capability. PREP is
+applicable across various QA tasks without domain-specific prompt engineering.
+PREP is developed on a dataset of 100 QA questions, derived from an extensive
+schematic dataset specifying artifact parts and material composition. These
+questions ask which of two artifacts is less likely to share materials with
+another artifact. Such questions probe the LM's knowledge of shared materials
+in the part structure of different artifacts. We test our method on our
+parts-and-materials dataset and three published commonsense reasoning datasets.
+The average accuracy of our method is consistently higher than that of all the
+other tested methods across all the tested datasets.
+
+
+
+
+
+
+
+ ♻ ☆ VersaTune: An Efficient Data Composition Framework for Training
+ Multi-Capability LLMs
+
+
+
+
+
+
+
+
+ Keer Lu, Keshi Zhao, Zheng Liang, Da Pan, Shusen Zhang, Xin Wu, Weipeng Chen, Zenan Zhou, Guosheng Dong, Bin Cui, Wentao Zhang
+
+
+ Large-scale pretrained models, particularly Large Language Models (LLMs),
+have exhibited remarkable capabilities in handling multiple tasks across
+domains due to their emergent properties. These capabilities are further
+augmented during the Supervised Fine-Tuning (SFT) phase. Despite their
+potential, existing work mainly focuses on domain-specific enhancements during
+fine-tuning, the challenge of which lies in catastrophic forgetting of
+knowledge across other domains. In this study, we introduce VersaTune, a novel
+data composition framework designed for enhancing LLMs' overall multi-ability
+performances during training. We categorize knowledge into distinct domains
+including law, medicine, finance, science, code, etc. We begin with detecting
+the distribution of domain-specific knowledge within the base model, followed
+by the training data composition that aligns with the model's existing
+knowledge distribution. During the training process, domain weights are
+dynamically adjusted based on their learnable potential and forgetting degree.
+Experimental results demonstrate that VersaTune achieves significant
+improvements in multi-domain performance, with an 35.21% enhancement in
+comprehensive multi-domain tasks. Additionally, in scenarios where specific
+domain optimization is required, VersaTune reduces the degradation of
+performance in other domains by 38.77%, without compromising the target
+domain's training efficacy.
+
+
+
+
+
+
+
+ ♻ ☆ Evaluating LLMs for Hardware Design and Test
+
+
+ Large Language Models (LLMs) have demonstrated capabilities for producing
+code in Hardware Description Languages (HDLs). However, most of the focus
+remains on their abilities to write functional code, not test code. The
+hardware design process consists of both design and test, and so eschewing
+validation and verification leaves considerable potential benefit unexplored,
+given that a design and test framework may allow for progress towards full
+automation of the digital design pipeline. In this work, we perform one of the
+first studies exploring how a LLM can both design and test hardware modules
+from provided specifications. Using a suite of 8 representative benchmarks, we
+examined the capabilities and limitations of the state-of-the-art
+conversational LLMs when producing Verilog for functional and verification
+purposes. We taped out the benchmarks on a Skywater 130nm shuttle and received
+the functional chip.
+
+
+ This paper presents the work of restoring punctuation for ASR transcripts
+generated by multilingual ASR systems. The focus languages are English,
+Mandarin, and Malay which are three of the most popular languages in Singapore.
+To the best of our knowledge, this is the first system that can tackle
+punctuation restoration for these three languages simultaneously. Traditional
+approaches usually treat the task as a sequential labeling task, however, this
+work adopts a slot-filling approach that predicts the presence and type of
+punctuation marks at each word boundary. The approach is similar to the
+Masked-Language Model approach employed during the pre-training stages of BERT,
+but instead of predicting the masked word, our model predicts masked
+punctuation. Additionally, we find that using Jieba1 instead of only using the
+built-in SentencePiece tokenizer of XLM-R can significantly improve the
+performance of punctuating Mandarin transcripts. Experimental results on
+English and Mandarin IWSLT2022 datasets and Malay News show that the proposed
+approach achieved state-of-the-art results for Mandarin with 73.8% F1-score
+while maintaining a reasonable F1-score for English and Malay, i.e. 74.7% and
+78% respectively. Our source code that allows reproducing the results and
+building a simple web-based application for demonstration purposes is available
+on Github.
+
+
+
+ comment: Accepted at APSIPA 2022, Chiang-Mai, Thailand
+
+
+
+
+
+
+ ♻ ☆ Understanding Warmup-Stable-Decay Learning Rates: A River Valley Loss
+ Landscape Perspective
+
+
+
+
+
+
+
+
+ Kaiyue Wen, Zhiyuan Li, Jason Wang, David Hall, Percy Liang, Tengyu Ma
+
+
+ Training language models currently requires pre-determining a fixed compute
+budget because the typical cosine learning rate schedule depends on the total
+number of steps. In contrast, the Warmup-Stable-Decay (WSD) schedule uses a
+constant learning rate to produce a main branch of iterates that can in
+principle continue indefinitely without a pre-specified compute budget. Then,
+given any compute budget, one can branch out from the main branch at a proper
+time with a rapidly decaying learning rate to produce a strong model.
+Empirically, WSD generates a non-traditional loss curve: the loss remains
+elevated during the stable phase but sharply declines during the decay phase.
+Towards explaining this phenomenon, we conjecture that pretraining loss
+exhibits a river valley landscape, which resembles a deep valley with a river
+at its bottom. Under this assumption, we show that during the stable phase, the
+iterate undergoes large oscillations due to the high learning rate, yet it
+progresses swiftly along the river. During the decay phase, the rapidly
+dropping learning rate minimizes the iterate's oscillations, moving it closer
+to the river and revealing true optimization progress. Therefore, the sustained
+high learning rate phase and fast decaying phase are responsible for progress
+in the river and the mountain directions respectively, and are both critical.
+Our analysis predicts phenomenons consistent with empirical observations and
+shows that this landscape can emerge from pretraining on a simple bi-gram
+dataset. Inspired by the theory, we introduce WSD-S, a variant of WSD that
+reuses previous checkpoints' decay phases and keeps only one main branch, where
+we resume from a decayed checkpoint. WSD-S empirically outperforms WSD and
+Cyclic-Cosine in obtaining multiple language model checkpoints across various
+compute budgets in a single run for parameters scaling from 0.1B to 1.2B.
+
+
+
+ comment: 45 pages,13 figures
+
+
+
+
+
+
+ ♻ ☆ Towards Understanding Jailbreak Attacks in LLMs: A Representation Space
+ Analysis EMNLP 2024
+
+
+ Large language models (LLMs) are susceptible to a type of attack known as
+jailbreaking, which misleads LLMs to output harmful contents. Although there
+are diverse jailbreak attack strategies, there is no unified understanding on
+why some methods succeed and others fail. This paper explores the behavior of
+harmful and harmless prompts in the LLM's representation space to investigate
+the intrinsic properties of successful jailbreak attacks. We hypothesize that
+successful attacks share some similar properties: They are effective in moving
+the representation of the harmful prompt towards the direction to the harmless
+prompts. We leverage hidden representations into the objective of existing
+jailbreak attacks to move the attacks along the acceptance direction, and
+conduct experiments to validate the above hypothesis using the proposed
+objective. We hope this study provides new insights into understanding how LLMs
+understand harmfulness information.
+
+
+
+ comment: Accepted by EMNLP 2024 Main
+
+
+
+
+
+
+ ♻ ☆ Discovering influential text using convolutional neural networks ACL 2024
+
+
+
+
+
+
+
+
+ Megan Ayers, Luke Sanford, Margaret Roberts, Eddie Yang
+
+
+ Experimental methods for estimating the impacts of text on human evaluation
+have been widely used in the social sciences. However, researchers in
+experimental settings are usually limited to testing a small number of
+pre-specified text treatments. While efforts to mine unstructured texts for
+features that causally affect outcomes have been ongoing in recent years, these
+models have primarily focused on the topics or specific words of text, which
+may not always be the mechanism of the effect. We connect these efforts with
+NLP interpretability techniques and present a method for flexibly discovering
+clusters of similar text phrases that are predictive of human reactions to
+texts using convolutional neural networks. When used in an experimental
+setting, this method can identify text treatments and their effects under
+certain assumptions. We apply the method to two datasets. The first enables
+direct validation of the model's ability to detect phrases known to cause the
+outcome. The second demonstrates its ability to flexibly discover text
+treatments with varying textual structures. In both cases, the model learns a
+greater variety of text treatments compared to benchmark methods, and these
+text features quantitatively meet or exceed the ability of benchmark methods to
+predict the outcome.
+
+
+
+ comment: Published in Findings of ACL 2024 ( see
+ https://aclanthology.org/2024.findings-acl.714 )
+
+
+
+
+
+
+ ♻ ☆ MATATA: a weak-supervised MAthematical Tool-Assisted reasoning for
+ Tabular Applications
+
+
+ Mathematical reasoning capabilities are increasing with tool-augmented
+language agents, but methods often rely either on closed-source or large
+models, external data, or extensive prompt engineering. This work introduces
+MATATA, a novel cost-effective method to train LLM agents for tabular data
+problems through reasoning, planning, and tool use. With a progressive
+self-improvement paradigm and an iterative weak supervision, it empowers
+3.8B/8B Small Language Models (SLMs), particularly suited for local hosting and
+sensitive business contexts where data privacy is crucial. By employing a
+flexible and reusable tools across different datasets, it achieves robust
+performance with effective scalability across shared tasks. Experiments show
+that MATATA reaches state-of-the-art performances on FinQA and TAT-QA among
+reasoning frameworks based on open-source models. Moreover, MATATA models
+compete with GPT-4 based frameworks on TabMWP, while being SLMs.
+
+
+
+
+
+
+
+ ♻ ☆ Salient Information Prompting to Steer Content in Prompt-based
+ Abstractive Summarization EMNLP 2024
+
+
+
+
+
+
+
+
+ Lei Xu, Mohammed Asad Karim, Saket Dingliwal, Aparna Elangovan
+
+
+ Large language models (LLMs) can generate fluent summaries across domains
+using prompting techniques, reducing the need to train models for summarization
+applications. However, crafting effective prompts that guide LLMs to generate
+summaries with the appropriate level of detail and writing style remains a
+challenge. In this paper, we explore the use of salient information extracted
+from the source document to enhance summarization prompts. We show that adding
+keyphrases in prompts can improve ROUGE F1 and recall, making the generated
+summaries more similar to the reference and more complete. The number of
+keyphrases can control the precision-recall trade-off. Furthermore, our
+analysis reveals that incorporating phrase-level salient information is
+superior to word- or sentence-level. However, the impact on hallucination is
+not universally positive across LLMs. To conduct this analysis, we introduce
+Keyphrase Signal Extractor (SigExt), a lightweight model that can be finetuned
+to extract salient keyphrases. By using SigExt, we achieve consistent ROUGE
+improvements across datasets and open-weight and proprietary LLMs without any
+LLM customization. Our findings provide insights into leveraging salient
+information in building prompt-based summarization systems. We release our code
+at \url{https://github.com/amazon-science/SigExt}
+
+
+
+ comment: Accepted to EMNLP 2024 Industry Track. Code available at
+ https://github.com/amazon-science/SigExt
+
+
+
+
+
+
+ ♻ ☆ Large Language Models for Data Annotation and Synthesis: A SurveyEMNLP 2024
+
+
+
+
+
+
+
+
+ Zhen Tan, Dawei Li, Song Wang, Alimohammad Beigi, Bohan Jiang, Amrita Bhattacharjee, Mansooreh Karami, Jundong Li, Lu Cheng, Huan Liu
+
+
+ Data annotation and synthesis generally refers to the labeling or generating
+of raw data with relevant information, which could be used for improving the
+efficacy of machine learning models. The process, however, is labor-intensive
+and costly. The emergence of advanced Large Language Models (LLMs), exemplified
+by GPT-4, presents an unprecedented opportunity to automate the complicated
+process of data annotation and synthesis. While existing surveys have
+extensively covered LLM architecture, training, and general applications, we
+uniquely focus on their specific utility for data annotation. This survey
+contributes to three core aspects: LLM-Based Annotation Generation,
+LLM-Generated Annotations Assessment, and LLM-Generated Annotations
+Utilization. Furthermore, this survey includes an in-depth taxonomy of data
+types that LLMs can annotate, a comprehensive review of learning strategies for
+models utilizing LLM-generated annotations, and a detailed discussion of the
+primary challenges and limitations associated with using LLMs for data
+annotation and synthesis. Serving as a key guide, this survey aims to assist
+researchers and practitioners in exploring the potential of the latest LLMs for
+data annotation, thereby fostering future advancements in this critical field.
+
+
+
+ comment: Accepted to EMNLP 2024 Main
+
+
+
+
+
+
+ ♻ ☆ VibeCheck: Discover and Quantify Qualitative Differences in Large
+ Language Models
+
+
+
+
+
+
+
+
+ Lisa Dunlap, Krishna Mandal, Trevor Darrell, Jacob Steinhardt, Joseph E Gonzalez
+
+
+ Large language models (LLMs) often exhibit subtle yet distinctive
+characteristics in their outputs that users intuitively recognize, but struggle
+to quantify. These "vibes" -- such as tone, formatting, or writing style --
+influence user preferences, yet traditional evaluations focus primarily on the
+singular axis of correctness. We introduce VibeCheck, a system for
+automatically comparing a pair of LLMs by discovering identifying traits of a
+model (vibes) that are well-defined, differentiating, and user-aligned.
+VibeCheck iteratively discovers vibes from model outputs and then utilizes a
+panel of LLM judges to quantitatively measure the utility of each vibe. We
+validate that the vibes generated by VibeCheck align with those found in human
+discovery and run VibeCheck on pairwise preference data from real-world user
+conversations with Llama-3-70b vs GPT-4. VibeCheck reveals that Llama has a
+friendly, funny, and somewhat controversial vibe. These vibes predict model
+identity with 80% accuracy and human preference with 61% accuracy. Lastly, we
+run VibeCheck on a variety of models and tasks including summarization, math,
+and captioning to provide insight into differences in model behavior. VibeCheck
+discovers vibes like Command X prefers to add concrete intros and conclusions
+when summarizing in comparison to TNGL, Llama-405b often overexplains its
+thought process on math problems compared to GPT-4o, and GPT-4 prefers to focus
+on the mood and emotions of the scene when captioning compared to
+Gemini-1.5-Flash. Code can be found at https://github.com/lisadunlap/VibeCheck
+
+
+
+ comment: unironic use of the word 'vibe', added more analysis and cooler
+ graphs. arXiv admin note: text overlap with arXiv:2301.07597 by other authors
+
+
+
+
+
+
+ ♻ ☆ Stress-Testing Long-Context Language Models with Lifelong ICL and Task
+ Haystack NeurIPS 2024
+
+
+ We introduce Lifelong ICL, a problem setting that challenges long-context
+language models (LMs) to learn a sequence of language tasks through in-context
+learning (ICL). We further introduce Task Haystack, an evaluation suite
+dedicated to assessing and diagnosing how long-context LMs utilizes contexts in
+Lifelong ICL. When given a task instruction and test inputs, long-context LMs
+are expected to leverage the relevant demonstrations in the Lifelong ICL
+prompt, avoid distraction and interference from other tasks, and achieve test
+accuracies that are not significantly worse than those of the Single-task ICL
+baseline.
+ Task Haystack draws inspiration from the widely-adopted
+"needle-in-a-haystack" (NIAH) evaluation, but presents distinct new challenges.
+It requires models (1) to utilize the contexts at a deeper level, rather than
+resorting to simple copying and pasting; (2) to navigate through long streams
+of evolving topics and tasks, proxying the complexities and dynamism of
+contexts in real-world scenarios. Additionally, Task Haystack inherits the
+controllability of NIAH, providing model developers with tools and
+visualizations to identify model vulnerabilities effectively.
+ We benchmark 14 long-context LMs using Task Haystack, finding that frontier
+models like GPT-4o still struggle with the setting, failing on 15% of cases on
+average. Most open-weight models further lack behind by a large margin, with
+failure rates reaching up to 61%. In our controlled analysis, we identify
+factors such as distraction and recency bias as contributors to these failure
+cases. Further, performance declines when task instructions are paraphrased at
+test time or when ICL demonstrations are repeated excessively, raising concerns
+about the robustness, instruction understanding, and true context utilization
+of long-context LMs.
+
+
+
+
+
+
+
+ ♻ ☆ Cautious Optimizers: Improving Training with One Line of Code
+
+
+
+
+
+
+
+
+ Kaizhao Liang, Lizhang Chen, Bo Liu, Qiang Liu
+
+
+ AdamW has been the default optimizer for transformer pretraining. For many
+years, our community searches for faster and more stable optimizers with only
+constraint positive outcomes. In this work, we propose a \textbf{single-line
+modification in Pytorch} to any momentum-based optimizer, which we rename
+Cautious Optimizer, e.g. C-AdamW and C-Lion. Our theoretical result shows that
+this modification preserves Adam's Hamiltonian function and it does not break
+the convergence guarantee under the Lyapunov analysis. In addition, a whole new
+family of optimizers is revealed by our theoretical insight. Among them, we
+pick the simplest one for empirical experiments, showing speed-up on Llama and
+MAE pretraining up to $1.47\times$. Code is available at
+https://github.com/kyleliang919/C-Optim
+
+
+
+
+
+
+
+ ♻ ☆ Text Clustering with Large Language Model Embeddings
+
+
+
+
+
+
+
+
+ Alina Petukhova, João P. Matos-Carvalho, Nuno Fachada
+
+
+ Text clustering is an important method for organising the increasing volume
+of digital content, aiding in the structuring and discovery of hidden patterns
+in uncategorised data. The effectiveness of text clustering largely depends on
+the selection of textual embeddings and clustering algorithms. This study
+argues that recent advancements in large language models (LLMs) have the
+potential to enhance this task. The research investigates how different textual
+embeddings, particularly those utilised in LLMs, and various clustering
+algorithms influence the clustering of text datasets. A series of experiments
+were conducted to evaluate the impact of embeddings on clustering results, the
+role of dimensionality reduction through summarisation, and the adjustment of
+model size. The findings indicate that LLM embeddings are superior at capturing
+subtleties in structured language. OpenAI's GPT-3.5 Turbo model yields better
+results in three out of five clustering metrics across most tested datasets.
+Most LLM embeddings show improvements in cluster purity and provide a more
+informative silhouette score, reflecting a refined structural understanding of
+text data compared to traditional methods. Among the more lightweight models,
+BERT demonstrates leading performance. Additionally, it was observed that
+increasing model dimensionality and employing summarisation techniques do not
+consistently enhance clustering efficiency, suggesting that these strategies
+require careful consideration for practical application. These results
+highlight a complex balance between the need for refined text representation
+and computational feasibility in text clustering applications. This study
+extends traditional text clustering frameworks by integrating embeddings from
+LLMs, offering improved methodologies and suggesting new avenues for future
+research in various types of textual analysis.
+
+
+
+ comment: The peer-reviewed version of this paper is published in the
+ International Journal of Cognitive Computing in Engineering at
+ https://doi.org/10.1016/j.ijcce.2024.11.004. This version is typeset by the
+ authors and differs only in pagination and typographical detail
+
+
+
+
+
+
+ ♻ ☆ Hierarchical Text Classification (HTC) vs. eXtreme Multilabel
+ Classification (XML): Two Sides of the Same Medal
+
+
+
+
+
+
+
+
+ Nerijus Bertalis, Paul Granse, Ferhat Gül, Florian Hauss, Leon Menkel, David Schüler, Tom Speier, Lukas Galke, Ansgar Scherp
+
+
+ Assigning a subset of labels from a fixed pool of labels to a given input
+text is a text classification problem with many real-world applications, such
+as in recommender systems. Two separate research streams address this issue.
+Hierarchical Text Classification (HTC) focuses on datasets with smaller label
+pools of hundreds of entries, accompanied by a semantic label hierarchy. In
+contrast, eXtreme Multi-Label Text Classification (XML) considers very large
+label pools with up to millions of entries, in which the labels are not
+arranged in any particular manner. However, in XML, a common approach is to
+construct an artificial hierarchy without any semantic information before or
+during the training process. Here, we investigate how state-of-the-art models
+from one domain perform when trained and tested on datasets from the other
+domain. The HBGL and HGLCR models from the HTC domain are trained and tested on
+the datasets Wiki10-31K, AmazonCat-13K, and Amazon-670K from the XML domain. On
+the other side, the XML models CascadeXML and XR-Transformer are trained and
+tested on the datasets Web of Science, The New York Times Annotated Corpus, and
+RCV1-V2 from the HTC domain. HTC models, on the other hand, are not equipped to
+handle the size of XML datasets and achieve poor transfer results. The code and
+numerous files that are needed to reproduce our results can be obtained from
+https://github.com/FloHauss/XMC_HTC
+
+
+
+
+
+
+
+ ♻ ☆ AdaZeta: Adaptive Zeroth-Order Tensor-Train Adaption for
+ Memory-Efficient Large Language Models Fine-Tuning EMNLP 2024
+
+
+ Fine-tuning large language models (LLMs) has achieved remarkable performance
+across various natural language processing tasks, yet it demands more and more
+memory as model sizes keep growing. To address this issue, the recently
+proposed Memory-efficient Zeroth-order (MeZO) methods attempt to fine-tune LLMs
+using only forward passes, thereby avoiding the need for a backpropagation
+graph. However, significant performance drops and a high risk of divergence
+have limited their widespread adoption. In this paper, we propose the Adaptive
+Zeroth-order Tensor-Train Adaption (AdaZeta) framework, specifically designed
+to improve the performance and convergence of the ZO methods. To enhance
+dimension-dependent ZO estimation accuracy, we introduce a fast-forward,
+low-parameter tensorized adapter. To tackle the frequently observed divergence
+issue in large-scale ZO fine-tuning tasks, we propose an adaptive query number
+schedule that guarantees convergence. Detailed theoretical analysis and
+extensive experimental results on Roberta-Large and Llama-2-7B models
+substantiate the efficacy of our AdaZeta framework in terms of accuracy, memory
+efficiency, and convergence speed.
+
+
+
+ comment: Accepted for publication in EMNLP 2024
+
+
+
+
+
+
+
+
+
+ Information Retrieval 15
+
+
+
+
+
+ ☆ Improving feature interactions at Pinterest under industry constraints
+
+
+ Adopting advances in recommendation systems is often challenging in
+industrial settings due to unique constraints. This paper aims to highlight
+these constraints through the lens of feature interactions. Feature
+interactions are critical for accurately predicting user behavior in
+recommendation systems and online advertising. Despite numerous novel
+techniques showing superior performance on benchmark datasets like Criteo,
+their direct application in industrial settings is hindered by constraints such
+as model latency, GPU memory limitations and model reproducibility. In this
+paper, we share our learnings from improving feature interactions in
+Pinterest's Homefeed ranking model under such constraints. We provide details
+about the specific challenges encountered, the strategies employed to address
+them, and the trade-offs made to balance performance with practical
+limitations. Additionally, we present a set of learning experiments that help
+guide the feature interaction architecture selection. We believe these insights
+will be useful for engineers who are interested in improving their model
+through better feature interaction learning.
+
+
+
+
+
+
+
+ ☆ FGATT: A Robust Framework for Wireless Data Imputation Using Fuzzy Graph
+ Attention Networks and Transformer Encoders
+
+
+
+
+
+
+
+
+ Jinming Xing, Ruilin Xing, Yan Sun
+
+
+ Missing data is a pervasive challenge in wireless networks and many other
+domains, often compromising the performance of machine learning and deep
+learning models. To address this, we propose a novel framework, FGATT, that
+combines the Fuzzy Graph Attention Network (FGAT) with the Transformer encoder
+to perform robust and accurate data imputation. FGAT leverages fuzzy rough sets
+and graph attention mechanisms to capture spatial dependencies dynamically,
+even in scenarios where predefined spatial information is unavailable. The
+Transformer encoder is employed to model temporal dependencies, utilizing its
+self-attention mechanism to focus on significant time-series patterns. A
+self-adaptive graph construction method is introduced to enable dynamic
+connectivity learning, ensuring the framework's applicability to a wide range
+of wireless datasets. Extensive experiments demonstrate that our approach
+outperforms state-of-the-art methods in imputation accuracy and robustness,
+particularly in scenarios with substantial missing data. The proposed model is
+well-suited for applications in wireless sensor networks and IoT environments,
+where data integrity is critical.
+
+
+
+
+
+
+
+ ☆ Down with the Hierarchy: The 'H' in HNSW Stands for "Hubs"
+
+
+ Driven by recent breakthrough advances in neural representation learning,
+approximate near-neighbor (ANN) search over vector embeddings has emerged as a
+critical computational workload. With the introduction of the seminal
+Hierarchical Navigable Small World (HNSW) algorithm, graph-based indexes have
+established themseves as the overwhelmingly dominant paradigm for efficient and
+scalable ANN search. As the name suggests, HNSW searches a layered hierarchical
+graph to quickly identify neighborhoods of similar points to a given query
+vector. But is this hierarchy even necessary? A rigorous experimental analysis
+to answer this question would provide valuable insights into the nature of
+algorithm design for ANN search and motivate directions for future work in this
+increasingly crucial domain. To that end, we conduct an extensive benchmarking
+study covering more large-scale datasets than prior investigations of this
+question. We ultimately find that a flat graph retains all of the benefits of
+HNSW on high-dimensional datasets, with latency and recall performance
+essentially \emph{identical} to the original algorithm but with less memory
+overhead. Furthermore, we go a step further and study \emph{why} the hierarchy
+of HNSW provides no benefit in high dimensions, hypothesizing that navigable
+small world graphs contain a well-connected, frequently traversed ``highway" of
+hub nodes that maintain the same purported function as the hierarchical layers.
+We present compelling empirical evidence that the \emph{Hub Highway Hypothesis}
+holds for real datasets and investigate the mechanisms by which the highway
+forms. The implications of this hypothesis may also provide future research
+directions in developing enhancements to graph-based ANN search.
+
+
+
+ comment: 10 pages
+
+
+
+
+
+
+ ☆ Using Large Language Models in Automatic Hint Ranking and Generation
+ Tasks
+
+
+ The use of Large Language Models (LLMs) has increased significantly recently,
+with individuals frequently interacting with chatbots to receive answers to a
+wide range of questions. In an era where information is readily accessible, it
+is crucial to stimulate and preserve human cognitive abilities and maintain
+strong reasoning skills. This paper addresses such challenges by promoting the
+use of hints as an alternative or a supplement to direct answers. We first
+introduce a manually constructed hint dataset, WIKIHINT, which includes 5,000
+hints created for 1,000 questions. We then finetune open-source LLMs such as
+LLaMA-3.1 for hint generation in answer-aware and answer-agnostic contexts. We
+assess the effectiveness of the hints with human participants who try to answer
+questions with and without the aid of hints. Additionally, we introduce a
+lightweight evaluation method, HINTRANK, to evaluate and rank hints in both
+answer-aware and answer-agnostic settings. Our findings show that (a) the
+dataset helps generate more effective hints, (b) including answer information
+along with questions generally improves hint quality, and (c) encoder-based
+models perform better than decoder-based models in hint ranking.
+
+
+
+
+
+
+
+
+ Heejin Do, Sangwon Ryu, Jonghwi Kim, Gary Geunbae Lee
+
+
+ With the growing demand to fit fine-grained user intents, faceted
+query-by-example (QBE), which retrieves similar documents conditioned on
+specific facets, has gained recent attention. However, prior approaches mainly
+depend on document-level comparisons using basic indicators like citations due
+to the lack of facet-level relevance datasets; yet, this limits their use to
+citation-based domains and fails to capture the intricacies of facet
+constraints. In this paper, we propose a multi-facet blending (FaBle)
+augmentation method, which exploits modularity by decomposing and recomposing
+to explicitly synthesize facet-specific training sets. We automatically
+decompose documents into facet units and generate (ir)relevant pairs by
+leveraging LLMs' intrinsic distinguishing capabilities; then, dynamically
+recomposing the units leads to facet-wise relevance-informed document pairs.
+Our modularization eliminates the need for pre-defined facet knowledge or
+labels. Further, to prove the FaBle's efficacy in a new domain beyond
+citation-based scientific paper retrieval, we release a benchmark dataset for
+educational exam item QBE. FaBle augmentation on 1K documents remarkably
+assists training in obtaining facet conditional embeddings.
+
+
+
+
+
+
+
+ ☆ Global Estimation of Building-Integrated Facade and Rooftop Photovoltaic
+ Potential by Integrating 3D Building Footprint and Spatio-Temporal Datasets
+
+
+ This research tackles the challenges of estimating Building-Integrated
+Photovoltaics (BIPV) potential across various temporal and spatial scales,
+accounting for different geographical climates and urban morphology. We
+introduce a holistic methodology for evaluating BIPV potential, integrating 3D
+building footprint models with diverse meteorological data sources to account
+for dynamic shadow effects. The approach enables the assessment of PV potential
+on facades and rooftops at different levels-individual buildings, urban blocks,
+and cities globally. Through an analysis of 120 typical cities, we highlight
+the importance of 3D building forms, cityscape morphology, and geographic
+positioning in measuring BIPV potential at various levels. In particular, our
+simulation study reveals that among cities with optimal facade PV performance,
+the average ratio of facade PV potential to rooftop PV potential is
+approximately 68.2%. Additionally, approximately 17.5% of the analyzed samples
+demonstrate even higher facade PV potentials compared to rooftop installations.
+This finding underscores the strategic value of incorporating facade PV
+applications into urban sustainable energy systems.
+
+
+ In this work, we investigate the problem of learning distance functions
+within the query-based learning framework, where a learner is able to pose
+triplet queries of the form: ``Is $x_i$ closer to $x_j$ or $x_k$?'' We
+establish formal guarantees on the query complexity required to learn smooth,
+but otherwise general, distance functions under two notions of approximation:
+$\omega$-additive approximation and $(1 + \omega)$-multiplicative
+approximation. For the additive approximation, we propose a global method whose
+query complexity is quadratic in the size of a finite cover of the sample
+space. For the (stronger) multiplicative approximation, we introduce a method
+that combines global and local approaches, utilizing multiple Mahalanobis
+distance functions to capture local geometry. This method has a query
+complexity that scales quadratically with both the size of the cover and the
+ambient space dimension of the sample space.
+
+
+
+ comment: 40 pages, 1 figure
+
+
+
+
+
+
+ ☆ Lossless and Privacy-Preserving Graph Convolution Network for Federated
+ Item Recommendation
+
+
+ Graph neural network (GNN) has emerged as a state-of-the-art solution for
+item recommendation. However, existing GNN-based recommendation methods rely on
+a centralized storage of fragmented user-item interaction sub-graphs and
+training on an aggregated global graph, which will lead to privacy concerns. As
+a response, some recent works develop GNN-based federated recommendation
+methods by exploiting decentralized and fragmented user-item sub-graphs in
+order to preserve user privacy. However, due to privacy constraints, the graph
+convolution process in existing federated recommendation methods is incomplete
+compared with the centralized counterpart, causing a degradation of the
+recommendation performance. In this paper, we propose a novel lossless and
+privacy-preserving graph convolution network (LP-GCN), which fully completes
+the graph convolution process with decentralized user-item interaction
+sub-graphs while ensuring privacy. It is worth mentioning that its performance
+is equivalent to that of the non-federated (i.e., centralized) counterpart.
+Moreover, we validate its effectiveness through both theoretical analysis and
+empirical studies. Extensive experiments on three real-world datasets show that
+our LP-GCN outperforms the existing federated recommendation methods. The code
+will be publicly available once the paper is accepted.
+
+
+
+
+
+
+
+ ☆ Precision Profile Pollution Attack on Sequential Recommenders via
+ Influence Function
+
+
+
+
+
+
+
+
+ Xiaoyu Du, Yingying Chen, Yang Zhang, Jinhui Tang
+
+
+ Sequential recommendation approaches have demonstrated remarkable proficiency
+in modeling user preferences. Nevertheless, they are susceptible to profile
+pollution attacks (PPA), wherein items are introduced into a user's interaction
+history deliberately to influence the recommendation list. Since retraining the
+model for each polluted item is time-consuming, recent PPAs estimate item
+influence based on gradient directions to identify the most effective attack
+candidates. However, the actual item representations diverge significantly from
+the gradients, resulting in disparate outcomes.To tackle this challenge, we
+introduce an INFluence Function-based Attack approach INFAttack that offers a
+more accurate estimation of the influence of polluting items. Specifically, we
+calculate the modifications to the original model using the influence function
+when generating polluted sequences by introducing specific items. Subsequently,
+we choose the sequence that has been most significantly influenced to
+substitute the original sequence, thus promoting the target item. Comprehensive
+experiments conducted on five real-world datasets illustrate that INFAttack
+surpasses all baseline methods and consistently delivers stable attack
+performance for both popular and unpopular items.
+
+
+
+
+
+
+
+ ☆ Automated Extraction of Acronym-Expansion Pairs from Scientific Papers
+
+
+
+
+
+
+
+
+ Izhar Ali, Million Haileyesus, Serhiy Hnatyshyn, Jan-Lucas Ott, Vasil Hnatyshin
+
+
+ This project addresses challenges posed by the widespread use of
+abbreviations and acronyms in digital texts. We propose a novel method that
+combines document preprocessing, regular expressions, and a large language
+model to identify abbreviations and map them to their corresponding expansions.
+The regular expressions alone are often insufficient to extract expansions, at
+which point our approach leverages GPT-4 to analyze the text surrounding the
+acronyms. By limiting the analysis to only a small portion of the surrounding
+text, we mitigate the risk of obtaining incorrect or multiple expansions for an
+acronym. There are several known challenges in processing text with acronyms,
+including polysemous acronyms, non-local and ambiguous acronyms. Our approach
+enhances the precision and efficiency of NLP techniques by addressing these
+issues with automated acronym identification and disambiguation. This study
+highlights the challenges of working with PDF files and the importance of
+document preprocessing. Furthermore, the results of this work show that neither
+regular expressions nor GPT-4 alone can perform well. Regular expressions are
+suitable for identifying acronyms but have limitations in finding their
+expansions within the paper due to a variety of formats used for expressing
+acronym-expansion pairs and the tendency of authors to omit expansions within
+the text. GPT-4, on the other hand, is an excellent tool for obtaining
+expansions but struggles with correctly identifying all relevant acronyms.
+Additionally, GPT-4 poses challenges due to its probabilistic nature, which may
+lead to slightly different results for the same input. Our algorithm employs
+preprocessing to eliminate irrelevant information from the text, regular
+expressions for identifying acronyms, and a large language model to help find
+acronym expansions to provide the most accurate and consistent results.
+
+
+ To combat the rising energy consumption of recommender systems we implement a
+novel alternative for k-fold cross validation. This alternative, named e-fold
+cross validation, aims to minimize the number of folds to achieve a reduction
+in power usage while keeping the reliability and robustness of the test results
+high. We tested our method on 5 recommender system algorithms across 6 datasets
+and compared it with 10-fold cross validation. On average e-fold cross
+validation only needed 41.5% of the energy that 10-fold cross validation would
+need, while it's results only differed by 1.81%. We conclude that e-fold cross
+validation is a promising approach that has the potential to be an energy
+efficient but still reliable alternative to k-fold cross validation.
+
+
+
+ comment: This preprint has not undergone peer review (when applicable) or any
+ post-submission improvements or corrections. The Version of Record of this
+ contribution is published in [TBA], and is available online at [TBA]
+
+
+
+
+
+
+ ♻ ☆ RIRAG: Regulatory Information Retrieval and Answer Generation
+
+
+ Regulatory documents, issued by governmental regulatory bodies, establish
+rules, guidelines, and standards that organizations must adhere to for legal
+compliance. These documents, characterized by their length, complexity and
+frequent updates, are challenging to interpret, requiring significant
+allocation of time and expertise on the part of organizations to ensure ongoing
+compliance. Regulatory Natural Language Processing (RegNLP) is a
+multidisciplinary field aimed at simplifying access to and interpretation of
+regulatory rules and obligations. We introduce a task of generating
+question-passages pairs, where questions are automatically created and paired
+with relevant regulatory passages, facilitating the development of regulatory
+question-answering systems. We create the ObliQA dataset, containing 27,869
+questions derived from the collection of Abu Dhabi Global Markets (ADGM)
+financial regulation documents, design a baseline Regulatory Information
+Retrieval and Answer Generation (RIRAG) system and evaluate it with RePASs, a
+novel evaluation metric that tests whether generated answers accurately capture
+all relevant obligations while avoiding contradictions.
+
+
+
+
+
+
+
+
+ Xueguang Ma, Sheng-Chieh Lin, Minghan Li, Wenhu Chen, Jimmy Lin
+
+
+ In the real world, documents are organized in different formats and varied
+modalities. Traditional retrieval pipelines require tailored document parsing
+techniques and content extraction modules to prepare input for indexing. This
+process is tedious, prone to errors, and has information loss. To this end, we
+propose Document Screenshot Embedding (DSE), a novel retrieval paradigm that
+regards document screenshots as a unified input format, which does not require
+any content extraction preprocess and preserves all the information in a
+document (e.g., text, image and layout). DSE leverages a large vision-language
+model to directly encode document screenshots into dense representations for
+retrieval. To evaluate our method, we first craft the dataset of Wiki-SS, a
+1.3M Wikipedia web page screenshots as the corpus to answer the questions from
+the Natural Questions dataset. In such a text-intensive document retrieval
+setting, DSE shows competitive effectiveness compared to other text retrieval
+methods relying on parsing. For example, DSE outperforms BM25 by 17 points in
+top-1 retrieval accuracy. Additionally, in a mixed-modality task of slide
+retrieval, DSE significantly outperforms OCR text retrieval methods by over 15
+points in nDCG@10. These experiments show that DSE is an effective document
+retrieval paradigm for diverse types of documents. Model checkpoints, code, and
+Wiki-SS collection will be released.
+
+
+
+ comment: EMNLP2024 main
+
+
+
+
+
+
+ ♻ ☆ Unveiling and Mitigating Bias in Large Language Model Recommendations: A
+ Path to Fairness
+
+
+ excel in delivering comprehensive suggestions by deeply analyzing content and
+user behavior. However, they often inherit biases from skewed training data,
+favoring mainstream content while underrepresenting diverse or non-traditional
+options. This study explores the interplay between bias and LLM-based
+recommendation systems, focusing on music, song, and book recommendations
+across diverse demographic and cultural groups. This paper analyzes bias in
+LLM-based recommendation systems across multiple models (GPT, LLaMA, and
+Gemini), revealing its deep and pervasive impact on outcomes. Intersecting
+identities and contextual factors, like socioeconomic status, further amplify
+biases, complicating fair recommendations across diverse groups. Our findings
+reveal that bias in these systems is deeply ingrained, yet even simple
+interventions like prompt engineering can significantly reduce it. We further
+propose a retrieval-augmented generation strategy to mitigate bias more
+effectively. Numerical experiments validate these strategies, demonstrating
+both the pervasive nature of bias and the impact of the proposed solutions.
+
+
+
+
+
+
+
+ ♻ ☆ Using text embedding models as text classifiers with medical data
+
+
+ The advent of Large Language Models (LLMs) is promising and LLMs have been
+applied to numerous fields. However, it is not trivial to implement LLMs in the
+medical field, due to the high standards for precision and accuracy. Currently,
+the diagnosis of medical ailments must be done by hand, as it is costly to
+build a sufficiently broad LLM that can diagnose a wide range of diseases.
+Here, we explore the use of vector databases and embedding models as a means of
+encoding and classifying text with medical text data without the need to train
+a new model altogether. We used various LLMs to generate the medical data, then
+encoded the data with a text embedding model and stored it in a vector
+database. We hypothesized that higher embedding dimensions coupled with
+descriptive data in the vector database would lead to better classifications
+and designed a robustness test to test our hypothesis. By using vector
+databases and text embedding models to classify a clinician's notes on a
+patient presenting with a certain ailment, we showed that these tools can be
+successful at classifying medical text data. We found that a higher embedding
+dimension did indeed yield better results, however, querying with simple data
+in the database was optimal for performance. We have shown in this study the
+applicability of text embedding models and vector databases on a small scale,
+and our work lays the groundwork for applying these tools on a larger scale.
+
+
+
+
+
+
+
+
+ Junjie Oscar Yin, Alexander M. Rush
+
+
+ Data selection can reduce the amount of training data needed to finetune
+LLMs; however, the efficacy of data selection scales directly with its compute.
+Motivated by the practical challenge of compute-constrained finetuning, we
+consider the setting in which both the cost of selecting data and training are
+budgeted for. We first formalize the problem of data selection with a
+cost-aware utility function, and model the data selection problem as trading
+off initial-selection cost for training gain. We run a comprehensive sweep of
+experiments across multiple tasks, varying compute budget by scaling finetuning
+tokens, model sizes, and data selection compute. Interestingly we find that
+many powerful data selection methods are almost never compute-optimal, and that
+cheaper data selection alternatives dominate both from a theoretical and
+empirical perspective. For compute-optimal training, we find that perplexity
+and gradient data selection require training-to-selection model size ratios of
+5x and 10x, respectively.
+
+
+
+
+
+
+
+ ♻ ☆ A Note on Doubly Robust Estimator in Regression Continuity Designs
+
+
+ This note introduces a doubly robust (DR) estimator for regression
+discontinuity (RD) designs. RD designs provide a quasi-experimental framework
+for estimating treatment effects, where treatment assignment depends on whether
+a running variable surpasses a predefined cutoff. A common approach in RD
+estimation is the use of nonparametric regression methods, such as local linear
+regression. However, the validity of these methods still relies on the
+consistency of the nonparametric estimators. In this study, we propose the
+DR-RD estimator, which combines two distinct estimators for the conditional
+expected outcomes. The primary advantage of the DR-RD estimator lies in its
+ability to ensure the consistency of the treatment effect estimation as long as
+at least one of the two estimators is consistent. Consequently, our DR-RD
+estimator enhances robustness of treatment effect estimators in RD designs.
+
+
+
+ comment: There is a critical error in the previous submission. We have revised
+ the original claim and present a weakened result
+
+
+
+
+
+
+ ♻ ☆ Inference Scaling fLaws: The Limits of LLM Resampling with Imperfect
+ Verifiers
+
+
+ Recent research has generated hope that inference scaling could allow weaker
+language models to match or exceed the accuracy of stronger models, such as by
+repeatedly sampling solutions to a coding problem until it passes unit tests.
+The central thesis of this paper is that there is no free lunch for inference
+scaling: indefinite accuracy improvement through resampling can only be
+realized if the "verifier" (in this case, a set of unit tests) is perfect. When
+the verifier is imperfect, as it almost always is in domains such as reasoning
+or coding (for example, unit tests have imperfect coverage), there is a nonzero
+probability of false positives: incorrect solutions that pass the verifier.
+Resampling cannot decrease this probability, so it imposes an upper bound to
+the accuracy of resampling-based inference scaling even with an infinite
+compute budget. We find that there is a very strong correlation between the
+model's single-sample accuracy (i.e. accuracy without unit tests) and its false
+positive rate on coding benchmarks HumanEval and MBPP, whose unit tests have
+limited coverage. Therefore, no amount of inference scaling of weaker models
+can enable them to match the single-sample accuracy of a sufficiently strong
+model (Fig. 1a). When we consider that false positives have a negative utility
+compared to abstaining from producing a solution, it bends the inference
+scaling curve further downward. Empirically, we find that the optimal number of
+samples can be less than 10 under realistic assumptions (Fig. 1b). Finally, we
+show that beyond accuracy, false positives may have other undesirable
+qualities, such as poor adherence to coding style conventions.
+
+
+ Decentralised learning has recently gained traction as an alternative to
+federated learning in which both data and coordination are distributed. To
+preserve the confidentiality of users' data, decentralised learning relies on
+differential privacy, multi-party computation, or both. However, running
+multiple privacy-preserving summations in sequence may allow adversaries to
+perform reconstruction attacks. Current reconstruction countermeasures either
+cannot trivially be adapted to the distributed setting, or add excessive
+amounts of noise.
+ In this work, we first show that passive honest-but-curious adversaries can
+infer other users' private data after several privacy-preserving summations.
+For example, in subgraphs with 18 users, we show that only three passive
+honest-but-curious adversaries succeed at reconstructing private data 11.0% of
+the time, requiring an average of 8.8 summations per adversary. The success
+rate depends only on the adversaries' direct neighbourhood, and is independent
+of the size of the full network. We consider weak adversaries that do not
+control the graph topology, cannot exploit the summation's inner workings, and
+do not have auxiliary knowledge; and show that these adversaries can still
+infer private data.
+ We analyse how reconstruction relates to topology and propose the first
+topology-based decentralised defence against reconstruction attacks. We show
+that reconstruction requires a number of adversaries linear in the length of
+the network's shortest cycle. Consequently, exact attacks over
+privacy-preserving summations are impossible in acyclic networks.
+ Our work is a stepping stone for a formal theory of topology-based
+decentralised reconstruction defences. Such a theory would generalise our
+countermeasure beyond summation, define confidentiality in terms of entropy,
+and describe the interactions with (topology-aware) differential privacy.
+
+
+
+ comment: 14 pages, 19 figures, for associated experiment source code see
+ doi:10.4121/21572601.v2
+
+
+
+
+
+
+ ♻ ☆ Dynamic Estimation of Learning Rates Using a Non-Linear Autoregressive
+ Model
+
+
+ We introduce a new class of adaptive non-linear autoregressive (Nlar) models
+incorporating the concept of momentum, which dynamically estimate both the
+learning rates and momentum as the number of iterations increases. In our
+method, the growth of the gradients is controlled using a scaling (clipping)
+function, leading to stable convergence. Within this framework, we propose
+three distinct estimators for learning rates and provide theoretical proof of
+their convergence. We further demonstrate how these estimators underpin the
+development of effective Nlar optimizers. The performance of the proposed
+estimators and optimizers is rigorously evaluated through extensive experiments
+across several datasets and a reinforcement learning environment. The results
+highlight two key features of the Nlar optimizers: robust convergence despite
+variations in underlying parameters, including large initial learning rates,
+and strong adaptability with rapid convergence during the initial epochs.
+
+
+ With the increasing deployment of artificial intelligence (AI) technologies,
+the potential of humans working with AI agents has been growing at a great
+speed. Human-AI teaming is an important paradigm for studying various aspects
+when humans and AI agents work together. The unique aspect of Human-AI teaming
+research is the need to jointly study humans and AI agents, demanding
+multidisciplinary research efforts from machine learning to human-computer
+interaction, robotics, cognitive science, neuroscience, psychology, social
+science, and complex systems. However, existing platforms for Human-AI teaming
+research are limited, often supporting oversimplified scenarios and a single
+task, or specifically focusing on either human-teaming research or multi-agent
+AI algorithms. We introduce CREW, a platform to facilitate Human-AI teaming
+research in real-time decision-making scenarios and engage collaborations from
+multiple scientific disciplines, with a strong emphasis on human involvement.
+It includes pre-built tasks for cognitive studies and Human-AI teaming with
+expandable potentials from our modular design. Following conventional cognitive
+neuroscience research, CREW also supports multimodal human physiological signal
+recording for behavior analysis. Moreover, CREW benchmarks real-time
+human-guided reinforcement learning agents using state-of-the-art algorithms
+and well-tuned baselines. With CREW, we were able to conduct 50 human subject
+studies within a week to verify the effectiveness of our benchmark.
+
+
+
+ comment: Our project website is at: http://generalroboticslab.com/CREW
+
+
+
+
+
+
+ ♻ ☆ Two Tales of Single-Phase Contrastive Hebbian Learning ICML 2024
+
+
+ The search for ``biologically plausible'' learning algorithms has converged
+on the idea of representing gradients as activity differences. However, most
+approaches require a high degree of synchronization (distinct phases during
+learning) and introduce substantial computational overhead, which raises doubts
+regarding their biological plausibility as well as their potential utility for
+neuromorphic computing. Furthermore, they commonly rely on applying
+infinitesimal perturbations (nudges) to output units, which is impractical in
+noisy environments. Recently it has been shown that by modelling artificial
+neurons as dyads with two oppositely nudged compartments, it is possible for a
+fully local learning algorithm named ``dual propagation'' to bridge the
+performance gap to backpropagation, without requiring separate learning phases
+or infinitesimal nudging. However, the algorithm has the drawback that its
+numerical stability relies on symmetric nudging, which may be restrictive in
+biological and analog implementations. In this work we first provide a solid
+foundation for the objective underlying the dual propagation method, which also
+reveals a surprising connection with adversarial robustness. Second, we
+demonstrate how dual propagation is related to a particular adjoint state
+method, which is stable regardless of asymmetric nudging.
+
+
+
+ comment: ICML 2024; 21 pages
+
+
+
+
+
+
+ ♻ ☆ Inducing Group Fairness in Prompt-Based Language Model Decisions
+
+
+
+
+
+
+
+
+ James Atwood, Nino Scherrer, Preethi Lahoti, Ananth Balashankar, Flavien Prost, Ahmad Beirami
+
+
+ Classifiers are used throughout industry to enforce policies, ranging from
+the detection of toxic content to age-appropriate content filtering. While
+these classifiers serve important functions, it is also essential that they are
+built in ways that minimize unfair biases for users.
+ One such fairness consideration is called group fairness, which desires that
+different sub-population of users receive equal treatment. This is a
+well-studied problem in the context of 'classical' classifiers. However, the
+emergence of prompt-based language model (LM) decision making has created new
+opportunities to solve text-based classification tasks, and the fairness
+properties of these new classifiers are not yet well understood. Further, the
+`remediation toolkit' is incomplete for LM-based decision makers and little is
+understood about how to improve decision maker group fairness while maintaining
+classifier performance.
+ This work sets out to add more tools to that toolbox. We introduce
+adaptations of existing effective approaches from the classical classifier
+fairness to the prompt-based classifier space. We also devise simple methods
+that take advantage of the new structure of prompt-based decision makers and
+operate at the prompt level. We compare these approaches empirically on real
+data. Our results suggest that adaptations of approaches that are effective for
+classical classifiers remain effective in the LM-based classifier environment.
+However, there is room for further exploration of prompt-based remediation
+methods (and other remediation methods that take advantage of LM structure).
+
+
+ Regression trees have emerged as a preeminent tool for solving real-world
+regression problems due to their ability to deal with nonlinearities,
+interaction effects and sharp discontinuities. In this article, we rather study
+regression trees applied to well-behaved, differentiable functions, and
+determine the relationship between node parameters and the local gradient of
+the function being approximated. We find a simple estimate of the gradient
+which can be efficiently computed using quantities exposed by popular tree
+learning libraries. This allows the tools developed in the context of
+differentiable algorithms, like neural nets and Gaussian processes, to be
+deployed to tree-based models. To demonstrate this, we study measures of model
+sensitivity defined in terms of integrals of gradients and demonstrate how to
+compute them for regression trees using the proposed gradient estimates.
+Quantitative and qualitative numerical experiments reveal the capability of
+gradients estimated by regression trees to improve predictive analysis, solve
+tasks in uncertainty quantification, and provide interpretation of model
+behavior.
+
+
+
+ comment: Comments very welcome!
+
+
+
+
+
+
+ ♻ ☆ Asynchronous Message-Passing and Zeroth-Order Optimization Based
+ Distributed Learning with a Use-Case in Resource Allocation in Communication
+ Networks
+
+
+ Distributed learning and adaptation have received significant interest and
+found wide-ranging applications in machine learning and signal processing.
+While various approaches, such as shared-memory optimization, multi-task
+learning, and consensus-based learning (e.g., federated learning and learning
+over graphs), focus on optimizing either local costs or a global cost, there
+remains a need for further exploration of their interconnections. This paper
+specifically focuses on a scenario where agents collaborate towards a common
+task (i.e., optimizing a global cost equal to aggregated local costs) while
+effectively having distinct individual tasks (i.e., optimizing individual local
+parameters in a local cost). Each agent's actions can potentially impact other
+agents' performance through interactions. Notably, each agent has access to
+only its local zeroth-order oracle (i.e., cost function value) and shares
+scalar values, rather than gradient vectors, with other agents, leading to
+communication bandwidth efficiency and agent privacy. Agents employ
+zeroth-order optimization to update their parameters, and the asynchronous
+message-passing between them is subject to bounded but possibly random
+communication delays. This paper presents theoretical convergence analyses and
+establishes a convergence rate for nonconvex problems. Furthermore, it
+addresses the relevant use-case of deep learning-based resource allocation in
+communication networks and conducts numerical experiments in which agents,
+acting as transmitters, collaboratively train their individual policies to
+maximize a global reward, e.g., a sum of data rates.
+
+
+ In this work, we study the generalizability of diffusion models by looking
+into the hidden properties of the learned score functions, which are
+essentially a series of deep denoisers trained on various noise levels. We
+observe that as diffusion models transition from memorization to
+generalization, their corresponding nonlinear diffusion denoisers exhibit
+increasing linearity. This discovery leads us to investigate the linear
+counterparts of the nonlinear diffusion models, which are a series of linear
+models trained to match the function mappings of the nonlinear diffusion
+denoisers. Surprisingly, these linear denoisers are approximately the optimal
+denoisers for a multivariate Gaussian distribution characterized by the
+empirical mean and covariance of the training dataset. This finding implies
+that diffusion models have the inductive bias towards capturing and utilizing
+the Gaussian structure (covariance information) of the training dataset for
+data generation. We empirically demonstrate that this inductive bias is a
+unique property of diffusion models in the generalization regime, which becomes
+increasingly evident when the model's capacity is relatively small compared to
+the training dataset size. In the case that the model is highly
+overparameterized, this inductive bias emerges during the initial training
+phases before the model fully memorizes its training data. Our study provides
+crucial insights into understanding the notable strong generalization
+phenomenon recently observed in real-world diffusion models.
+
+
+
+
+
+
+
+ ♻ ☆ OminiControl: Minimal and Universal Control for Diffusion Transformer
+
+
+ In this paper, we introduce OminiControl, a highly versatile and
+parameter-efficient framework that integrates image conditions into pre-trained
+Diffusion Transformer (DiT) models. At its core, OminiControl leverages a
+parameter reuse mechanism, enabling the DiT to encode image conditions using
+itself as a powerful backbone and process them with its flexible multi-modal
+attention processors. Unlike existing methods, which rely heavily on additional
+encoder modules with complex architectures, OminiControl (1) effectively and
+efficiently incorporates injected image conditions with only ~0.1% additional
+parameters, and (2) addresses a wide range of image conditioning tasks in a
+unified manner, including subject-driven generation and spatially-aligned
+conditions such as edges, depth, and more. Remarkably, these capabilities are
+achieved by training on images generated by the DiT itself, which is
+particularly beneficial for subject-driven generation. Extensive evaluations
+demonstrate that OminiControl outperforms existing UNet-based and DiT-adapted
+models in both subject-driven and spatially-aligned conditional generation.
+Additionally, we release our training dataset, Subjects200K, a diverse
+collection of over 200,000 identity-consistent images, along with an efficient
+data synthesis pipeline to advance research in subject-consistent generation.
+
+
+
+
+
+
+
+ ♻ ☆ What Differentiates Educational Literature? A Multimodal Fusion Approach
+ of Transformers and Computational Linguistics
+
+
+ The integration of new literature into the English curriculum remains a
+challenge since educators often lack scalable tools to rapidly evaluate
+readability and adapt texts for diverse classroom needs. This study proposes to
+address this gap through a multimodal approach that combines transformer-based
+text classification with linguistic feature analysis to align texts with UK Key
+Stages. Eight state-of-the-art Transformers were fine-tuned on segmented text
+data, with BERT achieving the highest unimodal F1 score of 0.75. In parallel,
+500 deep neural network topologies were searched for the classification of
+linguistic characteristics, achieving an F1 score of 0.392. The fusion of these
+modalities shows a significant improvement, with every multimodal approach
+outperforming all unimodal models. In particular, the ELECTRA Transformer fused
+with the neural network achieved an F1 score of 0.996. Unimodal and multimodal
+approaches are shown to have statistically significant differences in all
+validation metrics (accuracy, precision, recall, F1 score) except for inference
+time. The proposed approach is finally encapsulated in a stakeholder-facing web
+application, providing non-technical stakeholder access to real-time insights
+on text complexity, reading difficulty, curriculum alignment, and
+recommendations for learning age range. The application empowers data-driven
+decision making and reduces manual workload by integrating AI-based
+recommendations into lesson planning for English literature.
+
+
+
+
+
+
+
+ ♻ ☆ Discovering group dynamics in coordinated time series via hierarchical
+ recurrent switching-state models
+
+
+
+
+
+
+
+
+ Michael T. Wojnowicz, Kaitlin Gili, Preetish Rath, Eric Miller, Jeffrey Miller, Clifford Hancock, Meghan O'Donovan, Seth Elkin-Frankston, Tad T. Brunyé, Michael C. Hughes
+
+
+ We seek a computationally efficient model for a collection of time series
+arising from multiple interacting entities (a.k.a. "agents"). Recent models of
+spatiotemporal patterns across individuals fail to incorporate explicit
+system-level collective behavior that can influence the trajectories of
+individual entities. To address this gap in the literature, we present a new
+hierarchical switching-state model that can be trained in an unsupervised
+fashion to simultaneously learn both system-level and individual-level
+dynamics. We employ a latent system-level discrete state Markov chain that
+provides top-down influence on latent entity-level chains which in turn govern
+the emission of each observed time series. Recurrent feedback from the
+observations to the latent chains at both entity and system levels allows
+recent situational context to inform how dynamics unfold at all levels in
+bottom-up fashion. We hypothesize that including both top-down and bottom-up
+influences on group dynamics will improve interpretability of the learned
+dynamics and reduce error when forecasting. Our hierarchical switching
+recurrent dynamical model can be learned via closed-form variational coordinate
+ascent updates to all latent chains that scale linearly in the number of
+entities. This is asymptotically no more costly than fitting a separate model
+for each entity. Analysis of both synthetic data and real basketball team
+movements suggests our lean parametric model can achieve competitive forecasts
+compared to larger neural network models that require far more computational
+resources. Further experiments on soldier data as well as a synthetic task with
+64 cooperating entities show how our approach can yield interpretable insights
+about team dynamics over time.
+
+
+
+
+
+
+
+ ♻ ☆ A Conditional Independence Test in the Presence of Discretization
+
+
+ Testing conditional independence has many applications, such as in Bayesian
+network learning and causal discovery. Different test methods have been
+proposed. However, existing methods generally can not work when only
+discretized observations are available. Specifically, consider $X_1$,
+$\tilde{X}_2$ and $X_3$ are observed variables, where $\tilde{X}_2$ is a
+discretization of latent variables $X_2$. Applying existing test methods to the
+observations of $X_1$, $\tilde{X}_2$ and $X_3$ can lead to a false conclusion
+about the underlying conditional independence of variables $X_1$, $X_2$ and
+$X_3$. Motivated by this, we propose a conditional independence test
+specifically designed to accommodate the presence of such discretization. To
+achieve this, we design the bridge equations to recover the parameter
+reflecting the statistical information of the underlying latent continuous
+variables. An appropriate test statistic and its asymptotic distribution under
+the null hypothesis of conditional independence have also been derived. Both
+theoretical results and empirical validation have been provided, demonstrating
+the effectiveness of our test methods.
+
+
+
+
+
+
+
+ ♻ ☆ ConvMixFormer- A Resource-efficient Convolution Mixer for
+ Transformer-based Dynamic Hand Gesture Recognition
+
+
+ Transformer models have demonstrated remarkable success in many domains such
+as natural language processing (NLP) and computer vision. With the growing
+interest in transformer-based architectures, they are now utilized for gesture
+recognition. So, we also explore and devise a novel ConvMixFormer architecture
+for dynamic hand gestures. The transformers use quadratic scaling of the
+attention features with the sequential data, due to which these models are
+computationally complex and heavy. We have considered this drawback of the
+transformer and designed a resource-efficient model that replaces the
+self-attention in the transformer with the simple convolutional layer-based
+token mixer. The computational cost and the parameters used for the
+convolution-based mixer are comparatively less than the quadratic
+self-attention. Convolution-mixer helps the model capture the local spatial
+features that self-attention struggles to capture due to their sequential
+processing nature. Further, an efficient gate mechanism is employed instead of
+a conventional feed-forward network in the transformer to help the model
+control the flow of features within different stages of the proposed model.
+This design uses fewer learnable parameters which is nearly half the vanilla
+transformer that helps in fast and efficient training. The proposed method is
+evaluated on NVidia Dynamic Hand Gesture and Briareo datasets and our model has
+achieved state-of-the-art results on single and multimodal inputs. We have also
+shown the parameter efficiency of the proposed ConvMixFormer model compared to
+other methods. The source code is available at
+https://github.com/mallikagarg/ConvMixFormer.
+
+
+
+
+
+
+
+ ♻ ☆ DGNN-YOLO: Dynamic Graph Neural Networks with YOLO11 for Small Object
+ Detection and Tracking in Traffic Surveillance
+
+
+
+
+
+
+
+
+ Shahriar Soudeep, M. F. Mridha, Md Abrar Jahin, Nilanjan Dey
+
+
+ Accurate detection and tracking of small objects such as pedestrians,
+cyclists, and motorbikes are critical for traffic surveillance systems, which
+are crucial in improving road safety and decision-making in intelligent
+transportation systems. However, traditional methods struggle with challenges
+such as occlusion, low resolution, and dynamic traffic conditions,
+necessitating innovative approaches to address these limitations. This paper
+introduces DGNN-YOLO, a novel framework integrating dynamic graph neural
+networks (DGNN) with YOLO11 to enhance small object detection and tracking in
+traffic surveillance systems. The framework leverages YOLO11's advanced spatial
+feature extraction capabilities for precise object detection and incorporates
+DGNN to model spatial-temporal relationships for robust real-time tracking
+dynamically. By constructing and updating graph structures, DGNN-YOLO
+effectively represents objects as nodes and their interactions as edges,
+ensuring adaptive and accurate tracking in complex and dynamic environments.
+Extensive experiments demonstrate that DGNN-YOLO consistently outperforms
+state-of-the-art methods in detecting and tracking small objects under diverse
+traffic conditions, achieving the highest precision (0.8382), recall (0.6875),
+and mAP@0.5:0.95 (0.6476), showcasing its robustness and scalability,
+particularly in challenging scenarios involving small and occluded objects.
+This work provides a scalable, real-time traffic surveillance and analysis
+solution, significantly contributing to intelligent transportation systems.
+
+
+
+
+
+
+
+
+ Chendi Qian, Andrei Manolache, Christopher Morris, Mathias Niepert
+
+
+ Message-passing graph neural networks (MPNNs) have emerged as a powerful
+paradigm for graph-based machine learning. Despite their effectiveness, MPNNs
+face challenges such as under-reaching and over-squashing, where limited
+receptive fields and structural bottlenecks hinder information flow in the
+graph. While graph transformers hold promise in addressing these issues, their
+scalability is limited due to quadratic complexity regarding the number of
+nodes, rendering them impractical for larger graphs. Here, we propose
+implicitly rewired message-passing neural networks (IPR-MPNNs), a novel
+approach that integrates implicit probabilistic graph rewiring into MPNNs. By
+introducing a small number of virtual nodes, i.e., adding additional nodes to a
+given graph and connecting them to existing nodes, in a differentiable,
+end-to-end manner, IPR-MPNNs enable long-distance message propagation,
+circumventing quadratic complexity. Theoretically, we demonstrate that
+IPR-MPNNs surpass the expressiveness of traditional MPNNs. Empirically, we
+validate our approach by showcasing its ability to mitigate under-reaching and
+over-squashing effects, achieving state-of-the-art performance across multiple
+graph datasets. Notably, IPR-MPNNs outperform graph transformers while
+maintaining significantly faster computational efficiency.
+
+
+
+ comment: Accepted at 38th Conference on Neural Information Processing Systems
+ (NeurIPS 2024), Vancouver, Canada
+
+
+
+
+
+
+ ♻ ☆ ForecastBench: A Dynamic Benchmark of AI Forecasting Capabilities
+
+
+
+
+
+
+
+
+ Ezra Karger, Houtan Bastani, Chen Yueh-Han, Zachary Jacobs, Danny Halawi, Fred Zhang, Philip E. Tetlock
+
+
+ Forecasts of future events are essential inputs into informed
+decision-making. Machine learning (ML) systems have the potential to deliver
+forecasts at scale, but there is no framework for evaluating the accuracy of ML
+systems on a standardized set of forecasting questions. To address this gap, we
+introduce ForecastBench: a dynamic benchmark that evaluates the accuracy of ML
+systems on an automatically generated and regularly updated set of 1,000
+forecasting questions. To avoid any possibility of data leakage, ForecastBench
+is comprised solely of questions about future events that have no known answer
+at the time of submission. We quantify the capabilities of current ML systems
+by collecting forecasts from expert (human) forecasters, the general public,
+and LLMs on a random subset of questions from the benchmark ($N=200$). While
+LLMs have achieved super-human performance on many benchmarks, they perform
+less well here: expert forecasters outperform the top-performing LLM (p-value
+$<0.01$). We display system and human scores in a public leaderboard at
+www.forecastbench.org.
+
+
+
+
+
+
+
+ ♻ ☆ Physics-Informed Real NVP for Satellite Power System Fault Detection
+
+
+ The unique challenges posed by the space environment, characterized by
+extreme conditions and limited accessibility, raise the need for robust and
+reliable techniques to identify and prevent satellite faults. Fault detection
+methods in the space sector are required to ensure mission success and to
+protect valuable assets. In this context, this paper proposes an Artificial
+Intelligence (AI) based fault detection methodology and evaluates its
+performance on ADAPT (Advanced Diagnostics and Prognostics Testbed), an
+Electrical Power System (EPS) dataset, crafted in laboratory by NASA. Our study
+focuses on the application of a physics-informed (PI) real-valued non-volume
+preserving (Real NVP) model for fault detection in space systems. The efficacy
+of this method is systematically compared against other AI approaches such as
+Gated Recurrent Unit (GRU) and Autoencoder-based techniques. Results show that
+our physics-informed approach outperforms existing methods of fault detection,
+demonstrating its suitability for addressing the unique challenges of satellite
+EPS sub-system faults. Furthermore, we unveil the competitive advantage of
+physics-informed loss in AI models to address specific space needs, namely
+robustness, reliability, and power constraints, crucial for space exploration
+and satellite missions.
+
+
+
+ comment: C. Cena, U. Albertin, M. Martini, S. Bucci and M. Chiaberge,
+ "Physics-Informed Real NVP for Satellite Power System Fault Detection," 2024
+ IEEE International Conference on Advanced Intelligent Mechatronics (AIM),
+ Boston, MA, USA, 2024, pp. 679-684, doi: 10.1109/AIM55361.2024.10636990
+
+ Mesh quality assessment (MQA) models play a critical role in the design,
+optimization, and evaluation of mesh operation systems in a wide variety of
+applications. Current MQA models, whether model-based methods using
+topology-aware features or projection-based approaches working on rendered 2D
+projections, often fail to capture the intricate interactions between texture
+and 3D geometry. We introduce HybridMQA, a first-of-its-kind hybrid
+full-reference colored MQA framework that integrates model-based and
+projection-based approaches, capturing complex interactions between textural
+information and 3D structures for enriched quality representations. Our method
+employs graph learning to extract detailed 3D representations, which are then
+projected to 2D using a novel feature rendering process that precisely aligns
+them with colored projections. This enables the exploration of geometry-texture
+interactions via cross-attention, producing comprehensive mesh quality
+representations. Extensive experiments demonstrate HybridMQA's superior
+performance across diverse datasets, highlighting its ability to effectively
+leverage geometry-texture interactions for a thorough understanding of mesh
+quality. Our implementation will be made publicly available.
+
+
+
+
+
+
+
+ ☆ X-Prompt: Towards Universal In-Context Image Generation in
+ Auto-Regressive Vision Language Foundation Models
+
+
+ In-context generation is a key component of large language models' (LLMs)
+open-task generalization capability. By leveraging a few examples as context,
+LLMs can perform both in-domain and out-of-domain tasks. Recent advancements in
+auto-regressive vision-language models (VLMs) built upon LLMs have showcased
+impressive performance in text-to-image generation. However, the potential of
+in-context learning for general image generation tasks remains largely
+unexplored. To address this, we introduce X-Prompt, a purely auto-regressive
+large-vision language model designed to deliver competitive performance across
+a wide range of both seen and unseen image generation tasks, all within a
+unified in-context learning framework. X-Prompt incorporates a specialized
+design that efficiently compresses valuable features from in-context examples,
+supporting longer in-context token sequences and improving its ability to
+generalize to unseen tasks. A unified training task for both text and image
+prediction enables X-Prompt to handle general image generation with enhanced
+task awareness from in-context examples. Extensive experiments validate the
+model's performance across diverse seen image generation tasks and its capacity
+to generalize to previously unseen tasks.
+
+
+ RGB-Thermal Salient Object Detection aims to pinpoint prominent objects
+within aligned pairs of visible and thermal infrared images. Traditional
+encoder-decoder architectures, while designed for cross-modality feature
+interactions, may not have adequately considered the robustness against noise
+originating from defective modalities. Inspired by hierarchical human visual
+systems, we propose the ConTriNet, a robust Confluent Triple-Flow Network
+employing a Divide-and-Conquer strategy. Specifically, ConTriNet comprises
+three flows: two modality-specific flows explore cues from RGB and Thermal
+modalities, and a third modality-complementary flow integrates cues from both
+modalities. ConTriNet presents several notable advantages. It incorporates a
+Modality-induced Feature Modulator in the modality-shared union encoder to
+minimize inter-modality discrepancies and mitigate the impact of defective
+samples. Additionally, a foundational Residual Atrous Spatial Pyramid Module in
+the separated flows enlarges the receptive field, allowing for the capture of
+multi-scale contextual information. Furthermore, a Modality-aware Dynamic
+Aggregation Module in the modality-complementary flow dynamically aggregates
+saliency-related cues from both modality-specific flows. Leveraging the
+proposed parallel triple-flow framework, we further refine saliency maps
+derived from different flows through a flow-cooperative fusion strategy,
+yielding a high-quality, full-resolution saliency map for the final prediction.
+To evaluate the robustness and stability of our approach, we collect a
+comprehensive RGB-T SOD benchmark, VT-IMAG, covering various real-world
+challenging scenarios. Extensive experiments on public benchmarks and our
+VT-IMAG dataset demonstrate that ConTriNet consistently outperforms
+state-of-the-art competitors in both common and challenging scenarios.
+
+
+ We introduce Presto, a novel video diffusion model designed to generate
+15-second videos with long-range coherence and rich content. Extending video
+generation methods to maintain scenario diversity over long durations presents
+significant challenges. To address this, we propose a Segmented Cross-Attention
+(SCA) strategy, which splits hidden states into segments along the temporal
+dimension, allowing each segment to cross-attend to a corresponding
+sub-caption. SCA requires no additional parameters, enabling seamless
+incorporation into current DiT-based architectures. To facilitate high-quality
+long video generation, we build the LongTake-HD dataset, consisting of 261k
+content-rich videos with scenario coherence, annotated with an overall video
+caption and five progressive sub-captions. Experiments show that our Presto
+achieves 78.5% on the VBench Semantic Score and 100% on the Dynamic Degree,
+outperforming existing state-of-the-art video generation methods. This
+demonstrates that our proposed Presto significantly enhances content richness,
+maintains long-range coherence, and captures intricate textual details. More
+details are displayed on our project page: https://presto-video.github.io/.
+
+
+ In this paper, we present a Neuron Abandoning Attention Flow (NAFlow) method
+to address the open problem of visually explaining the attention evolution
+dynamics inside CNNs when making their classification decisions. A novel
+cascading neuron abandoning back-propagation algorithm is designed to trace
+neurons in all layers of a CNN that involve in making its prediction to address
+the problem of significant interference from abandoned neurons. Firstly, a
+Neuron Abandoning Back-Propagation (NA-BP) module is proposed to generate
+Back-Propagated Feature Maps (BPFM) by using the inverse function of the
+intermediate layers of CNN models, on which the neurons not used for
+decision-making are abandoned. Meanwhile, the cascading NA-BP modules calculate
+the tensors of importance coefficients which are linearly combined with the
+tensors of BPFMs to form the NAFlow. Secondly, to be able to visualize
+attention flow for similarity metric-based CNN models, a new channel
+contribution weights module is proposed to calculate the importance
+coefficients via Jacobian Matrix. The effectiveness of the proposed NAFlow is
+validated on nine widely-used CNN models for various tasks of general image
+classification, contrastive learning classification, few-shot image
+classification, and image retrieval.
+
+
+ We introduce OmniFlow, a novel generative model designed for any-to-any
+generation tasks such as text-to-image, text-to-audio, and audio-to-image
+synthesis. OmniFlow advances the rectified flow (RF) framework used in
+text-to-image models to handle the joint distribution of multiple modalities.
+It outperforms previous any-to-any models on a wide range of tasks, such as
+text-to-image and text-to-audio synthesis. Our work offers three key
+contributions: First, we extend RF to a multi-modal setting and introduce a
+novel guidance mechanism, enabling users to flexibly control the alignment
+between different modalities in the generated outputs. Second, we propose a
+novel architecture that extends the text-to-image MMDiT architecture of Stable
+Diffusion 3 and enables audio and text generation. The extended modules can be
+efficiently pretrained individually and merged with the vanilla text-to-image
+MMDiT for fine-tuning. Lastly, we conduct a comprehensive study on the design
+choices of rectified flow transformers for large-scale audio and text
+generation, providing valuable insights into optimizing performance across
+diverse modalities. The Code will be available at
+https://github.com/jacklishufan/OmniFlows.
+
+
+ With the rapid advancement of diffusion-based generative models, portrait
+image animation has achieved remarkable results. However, it still faces
+challenges in temporally consistent video generation and fast sampling due to
+its iterative sampling nature. This paper presents FLOAT, an audio-driven
+talking portrait video generation method based on flow matching generative
+model. We shift the generative modeling from the pixel-based latent space to a
+learned motion latent space, enabling efficient design of temporally consistent
+motion. To achieve this, we introduce a transformer-based vector field
+predictor with a simple yet effective frame-wise conditioning mechanism.
+Additionally, our method supports speech-driven emotion enhancement, enabling a
+natural incorporation of expressive motions. Extensive experiments demonstrate
+that our method outperforms state-of-the-art audio-driven talking portrait
+methods in terms of visual quality, motion fidelity, and efficiency.
+
+
+ While recent research has made significant progress in speech-driven talking
+face generation, the quality of the generated video still lags behind that of
+real recordings. One reason for this is the use of handcrafted intermediate
+representations like facial landmarks and 3DMM coefficients, which are designed
+based on human knowledge and are insufficient to precisely describe facial
+movements. Additionally, these methods require an external pretrained model for
+extracting these representations, whose performance sets an upper bound on
+talking face generation. To address these limitations, we propose a novel
+method called DAE-Talker that leverages data-driven latent representations
+obtained from a diffusion autoencoder (DAE). DAE contains an image encoder that
+encodes an image into a latent vector and a DDIM image decoder that
+reconstructs the image from it. We train our DAE on talking face video frames
+and then extract their latent representations as the training target for a
+Conformer-based speech2latent model. This allows DAE-Talker to synthesize full
+video frames and produce natural head movements that align with the content of
+speech, rather than relying on a predetermined head pose from a template video.
+We also introduce pose modelling in speech2latent for pose controllability.
+Additionally, we propose a novel method for generating continuous video frames
+with the DDIM image decoder trained on individual frames, eliminating the need
+for modelling the joint distribution of consecutive frames directly. Our
+experiments show that DAE-Talker outperforms existing popular methods in
+lip-sync, video fidelity, and pose naturalness. We also conduct ablation
+studies to analyze the effectiveness of the proposed techniques and demonstrate
+the pose controllability of DAE-Talker.
+
+
+
+ comment: Accepted to ACM Multimedia 2023
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Computation and Language 15
+
+
+
+
+
+ ♻ ☆ Instruction Tuning for Large Language Models: A Survey
+
+
+ This paper surveys research works in the quickly advancing field of
+instruction tuning (IT), which can also be referred to as supervised
+fine-tuning (SFT)\footnote{In this paper, unless specified otherwise,
+supervised fine-tuning (SFT) and instruction tuning (IT) are used
+interchangeably.}, a crucial technique to enhance the capabilities and
+controllability of large language models (LLMs). Instruction tuning refers to
+the process of further training LLMs on a dataset consisting of
+\textsc{(instruction, output)} pairs in a supervised fashion, which bridges the
+gap between the next-word prediction objective of LLMs and the users' objective
+of having LLMs adhere to human instructions. In this work, we make a systematic
+review of the literature, including the general methodology of SFT, the
+construction of SFT datasets, the training of SFT models, and applications to
+different modalities, domains and application, along with analysis on aspects
+that influence the outcome of SFT (e.g., generation of instruction outputs,
+size of the instruction dataset, etc). We also review the potential pitfalls of
+SFT along with criticism against it, along with efforts pointing out current
+deficiencies of existing strategies and suggest some avenues for fruitful
+research. Project Page: github.com/xiaoya-li/Instruction-Tuning-Survey
+
+
+
+ comment: V5; Last update: Dec. 1, 2024
+
+
+
+
+
+
+ ♻ ☆ CineXDrama: Relevance Detection and Sentiment Analysis of Bangla YouTube
+ Comments on Movie-Drama using Transformers: Insights from Interpretability
+ Tool
+
+
+ In recent years, YouTube has become the leading platform for Bangla movies
+and dramas, where viewers express their opinions in comments that convey their
+sentiments about the content. However, not all comments are relevant for
+sentiment analysis, necessitating a filtering mechanism. We propose a system
+that first assesses the relevance of comments and then analyzes the sentiment
+of those deemed relevant. We introduce a dataset of 14,000 manually collected
+and preprocessed comments, annotated for relevance (relevant or irrelevant) and
+sentiment (positive or negative). Eight transformer models, including
+BanglaBERT, were used for classification tasks, with BanglaBERT achieving the
+highest accuracy (83.99% for relevance detection and 93.3% for sentiment
+analysis). The study also integrates LIME to interpret model decisions,
+enhancing transparency.
+
+
+
+ comment: Accepted for publication in Fifth International Conference on
+ Advances in Electrical, Computing, Communications and Sustainable
+ Technologies (ICAECT 2025)
+
+ Deriving formal bounds on the expressivity of transformers, as well as
+studying transformers that are constructed to implement known algorithms, are
+both effective methods for better understanding the computational power of
+transformers. Towards both ends, we introduce the temporal counting logic
+$\textsf{K}_\text{t}$[#] alongside the RASP variant $\textsf{C-RASP}$. We show
+they are equivalent to each other, and that together they are the best-known
+lower bound on the formal expressivity of future-masked soft attention
+transformers with unbounded input size. We prove this by showing all
+$\textsf{K}_\text{t}$[#] formulas can be compiled into these transformers.
+
+
+
+
+
+
+
+ ♻ ☆ DPCSpell: A Transformer-based Detector-Purificator-Corrector Framework
+ for Spelling Error Correction of Bangla and Resource Scarce Indic Languages
+
+
+ Spelling error correction is the task of identifying and rectifying
+misspelled words in texts. It is a potential and active research topic in
+Natural Language Processing because of numerous applications in human language
+understanding. The phonetically or visually similar yet semantically distinct
+characters make it an arduous task in any language. Earlier efforts on spelling
+error correction in Bangla and resource-scarce Indic languages focused on
+rule-based, statistical, and machine learning-based methods which we found
+rather inefficient. In particular, machine learning-based approaches, which
+exhibit superior performance to rule-based and statistical methods, are
+ineffective as they correct each character regardless of its appropriateness.
+In this paper, we propose a novel detector-purificator-corrector framework,
+DPCSpell based on denoising transformers by addressing previous issues. In
+addition to that, we present a method for large-scale corpus creation from
+scratch which in turn resolves the resource limitation problem of any
+left-to-right scripted language. The empirical outcomes demonstrate the
+effectiveness of our approach, which outperforms previous state-of-the-art
+methods by attaining an exact match (EM) score of 94.78%, a precision score of
+0.9487, a recall score of 0.9478, an f1 score of 0.948, an f0.5 score of
+0.9483, and a modified accuracy (MA) score of 95.16% for Bangla spelling error
+correction. The models and corpus are publicly available at
+https://tinyurl.com/DPCSpell.
+
+
+
+ comment: 29 pages, 4 figures, and 9 tables
+
+
+
+
+
+
+ ♻ ☆ Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for
+ Large Language Models
+
+
+ Assessing the effectiveness of large language models (LLMs) in performing
+different tasks is crucial for understanding their strengths and weaknesses.
+This paper presents the Hierarchical Prompting Taxonomy (HPT), grounded on
+human cognitive principles and designed to assess LLMs by examining the
+cognitive demands of various tasks. The HPT uses the Hierarchical Prompting
+Framework (HPF), a prompt selection framework that organizes five distinct
+prompting strategies by their cognitive load on LLMs. This study introduces the
+Hierarchical Prompting Index (HPI) to measure task complexity, which
+demonstrates LLMs' abilities across different datasets and serves as a
+universal metric for task complexity. The HPT offers a reliable method for
+evaluating LLMs' problem-solving skills in diverse scenarios, leading to
+clearer conclusions. Extensive experiments with multiple datasets and LLMs show
+that the HPF enhances LLM performance by 2\% to 63\% compared to standard
+benchmark datasets, confirming the effectiveness of the HPT. To support future
+research in this domain, the implementations of HPT and HPF are publicly
+available
+
+
+
+
+
+
+
+ ♻ ☆ BERT or FastText? A Comparative Analysis of Contextual as well as
+ Non-Contextual Embeddings
+
+
+ Natural Language Processing (NLP) for low-resource languages presents
+significant challenges, particularly due to the scarcity of high-quality
+annotated data and linguistic resources. The choice of embeddings plays a
+critical role in enhancing the performance of NLP tasks, such as news
+classification, sentiment analysis, and hate speech detection, especially for
+low-resource languages like Marathi. In this study, we investigate the impact
+of various embedding techniques- Contextual BERT-based, Non-Contextual
+BERT-based, and FastText-based on NLP classification tasks specific to the
+Marathi language. Our research includes a thorough evaluation of both
+compressed and uncompressed embeddings, providing a comprehensive overview of
+how these embeddings perform across different scenarios. Specifically, we
+compare two BERT model embeddings, Muril and MahaBERT, as well as two FastText
+model embeddings, IndicFT and MahaFT. Our evaluation includes applying
+embeddings to a Multiple Logistic Regression (MLR) classifier for task
+performance assessment, as well as TSNE visualizations to observe the spatial
+distribution of these embeddings. The results demonstrate that contextual
+embeddings outperform non-contextual embeddings. Furthermore, BERT-based
+non-contextual embeddings extracted from the first BERT embedding layer yield
+better results than FastText-based embeddings, suggesting a potential
+alternative to FastText embeddings.
+
+
+
+
+
+
+
+ ♻ ☆ Retrieving Implicit and Explicit Emotional Events Using Large Language
+ Models
+
+
+ Large language models (LLMs) have garnered significant attention in recent
+years due to their impressive performance. While considerable research has
+evaluated these models from various perspectives, the extent to which LLMs can
+perform implicit and explicit emotion retrieval remains largely unexplored. To
+address this gap, this study investigates LLMs' emotion retrieval capabilities
+in commonsense. Through extensive experiments involving multiple models, we
+systematically evaluate the ability of LLMs on emotion retrieval. Specifically,
+we propose a supervised contrastive probing method to verify LLMs' performance
+for implicit and explicit emotion retrieval, as well as the diversity of the
+emotional events they retrieve. The results offer valuable insights into the
+strengths and limitations of LLMs in handling emotion retrieval.
+
+
+
+
+
+
+
+ ♻ ☆ Unlocking Korean Verbs: A User-Friendly Exploration into the Verb
+ Lexicon NAACL 2025
+
+
+
+
+
+
+
+
+ Seohyun Song, Eunkyul Leah Jo, Yige Chen, Jeen-Pyo Hong, Kyuwon Kim, Jin Wee, Miyoung Kang, KyungTae Lim, Jungyeul Park, Chulwoo Park
+
+
+ The Sejong dictionary dataset offers a valuable resource, providing extensive
+coverage of morphology, syntax, and semantic representation. This dataset can
+be utilized to explore linguistic information in greater depth. The labeled
+linguistic structures within this dataset form the basis for uncovering
+relationships between words and phrases and their associations with target
+verbs. This paper introduces a user-friendly web interface designed for the
+collection and consolidation of verb-related information, with a particular
+focus on subcategorization frames. Additionally, it outlines our efforts in
+mapping this information by aligning subcategorization frames with
+corresponding illustrative sentence examples. Furthermore, we provide a Python
+library that would simplify syntactic parsing and semantic role labeling. These
+tools are intended to assist individuals interested in harnessing the Sejong
+dictionary dataset to develop applications for Korean language processing.
+
+
+
+ comment: NAACL 2025 System Demonstrations (Submitted)
+
+
+
+
+
+
+ ♻ ☆ ManiTweet: A New Benchmark for Identifying Manipulation of News on
+ Social Media COLING 2025
+
+
+ Considerable advancements have been made to tackle the misrepresentation of
+information derived from reference articles in the domains of fact-checking and
+faithful summarization. However, an unaddressed aspect remains - the
+identification of social media posts that manipulate information within
+associated news articles. This task presents a significant challenge, primarily
+due to the prevalence of personal opinions in such posts. We present a novel
+task, identifying manipulation of news on social media, which aims to detect
+manipulation in social media posts and identify manipulated or inserted
+information. To study this task, we have proposed a data collection schema and
+curated a dataset called ManiTweet, consisting of 3.6K pairs of tweets and
+corresponding articles. Our analysis demonstrates that this task is highly
+challenging, with large language models (LLMs) yielding unsatisfactory
+performance. Additionally, we have developed a simple yet effective basic model
+that outperforms LLMs significantly on the ManiTweet dataset. Finally, we have
+conducted an exploratory analysis of human-written tweets, unveiling intriguing
+connections between manipulation and the domain and factuality of news
+articles, as well as revealing that manipulated sentences are more likely to
+encapsulate the main story or consequences of a news outlet.
+
+
+
+ comment: COLING 2025
+
+
+
+
+
+
+ ♻ ☆ Retrieval Augmented Instruction Tuning for Open NER with Large Language
+ Models COLING 2025
+
+
+
+
+
+
+
+
+ Tingyu Xie, Jian Zhang, Yan Zhang, Yuanyuan Liang, Qi Li, Hongwei Wang
+
+
+ The strong capability of large language models (LLMs) has been applied to
+information extraction (IE) through either retrieval augmented prompting or
+instruction tuning (IT). However, the best way to incorporate information with
+LLMs for IE remains an open question. In this paper, we explore Retrieval
+Augmented Instruction Tuning (RA-IT) for IE, focusing on the task of open named
+entity recognition (NER). Specifically, for each training sample, we retrieve
+semantically similar examples from the training dataset as the context and
+prepend them to the input of the original instruction. To evaluate our RA-IT
+approach more thoroughly, we construct a Chinese IT dataset for open NER and
+evaluate RA-IT in both English and Chinese scenarios. Experimental results
+verify the effectiveness of RA-IT across various data sizes and in both English
+and Chinese scenarios. We also conduct thorough studies to explore the impacts
+of various retrieval strategies in the proposed RA-IT framework. Code and data
+are available at: https://github.com/Emma1066/Retrieval-Augmented-IT-OpenNER
+
+
+
+ comment: To be appeared at COLING 2025
+
+
+
+
+
+
+ ♻ ☆ A Survey on Human-Centric LLMs
+
+
+
+
+
+
+
+
+ Jing Yi Wang, Nicholas Sukiennik, Tong Li, Weikang Su, Qianyue Hao, Jingbo Xu, Zihan Huang, Fengli Xu, Yong Li
+
+
+ The rapid evolution of large language models (LLMs) and their capacity to
+simulate human cognition and behavior has given rise to LLM-based frameworks
+and tools that are evaluated and applied based on their ability to perform
+tasks traditionally performed by humans, namely those involving cognition,
+decision-making, and social interaction. This survey provides a comprehensive
+examination of such human-centric LLM capabilities, focusing on their
+performance in both individual tasks (where an LLM acts as a stand-in for a
+single human) and collective tasks (where multiple LLMs coordinate to mimic
+group dynamics). We first evaluate LLM competencies across key areas including
+reasoning, perception, and social cognition, comparing their abilities to
+human-like skills. Then, we explore real-world applications of LLMs in
+human-centric domains such as behavioral science, political science, and
+sociology, assessing their effectiveness in replicating human behaviors and
+interactions. Finally, we identify challenges and future research directions,
+such as improving LLM adaptability, emotional intelligence, and cultural
+sensitivity, while addressing inherent biases and enhancing frameworks for
+human-AI collaboration. This survey aims to provide a foundational
+understanding of LLMs from a human-centric perspective, offering insights into
+their current capabilities and potential for future development.
+
+
+
+
+
+
+
+
+ Bo Chen, Xiaoyu Li, Yingyu Liang, Jiangxuan Long, Zhenmei Shi, Zhao Song
+
+
+ Characterizing the express power of the Transformer architecture is critical
+to understanding its capacity limits and scaling law. Recent works provide the
+circuit complexity bounds to Transformer-like architecture. On the other hand,
+Rotary Position Embedding ($\mathsf{RoPE}$) has emerged as a crucial technique
+in modern large language models, offering superior performance in capturing
+positional information compared to traditional position embeddings, which shows
+great potential in application prospects, particularly for the long context
+scenario. Empirical evidence also suggests that $\mathsf{RoPE}$-based
+Transformer architectures demonstrate greater generalization capabilities
+compared to conventional Transformer models. In this work, we establish a
+circuit complexity bound for Transformers with $\mathsf{RoPE}$ attention. Our
+key contribution is that we show that unless $\mathsf{TC}^0 = \mathsf{NC}^1$, a
+$\mathsf{RoPE}$-based Transformer with $\mathrm{poly}(n)$-precision, $O(1)$
+layers, hidden dimension $d \leq O(n)$ cannot solve the Arithmetic formula
+evaluation problem or the Boolean formula value problem. This result
+significantly demonstrates the fundamental limitation of the expressivity of
+the $\mathsf{RoPE}$-based Transformer architecture, although it achieves giant
+empirical success. Our theoretical result not only establishes the complexity
+bound but also may instruct further work on the $\mathsf{RoPE}$-based
+Transformer.
+
+
+
+
+
+
+
+ ♻ ☆ MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models
+ for Integrated Capabilities
+
+
+ MM-Vet, with open-ended vision-language questions targeting at evaluating
+integrated capabilities, has become one of the most popular benchmarks for
+large multimodal model evaluation. MM-Vet assesses six core vision-language
+(VL) capabilities: recognition, knowledge, spatial awareness, language
+generation, OCR, and math. However, its question format is restricted to single
+image-text pairs, lacking the interleaved image and text sequences prevalent in
+real-world scenarios. To address this limitation, we introduce MM-Vet v2, which
+includes a new VL capability called "image-text sequence understanding",
+evaluating models' ability to process VL sequences. Furthermore, we maintain
+the high quality of evaluation samples while further expanding the evaluation
+set size. Using MM-Vet v2 to benchmark large multimodal models, we found that
+Claude 3.5 Sonnet is the best model with a score of 71.8, slightly
+outperforming GPT-4o which scored 71.0. Among open-weight models,
+InternVL2-Llama3-76B leads with a score of 68.4. The code, data, and
+leaderboard are accessible at https://github.com/yuweihao/MM-Vet.
+
+
+
+ comment: Code, data and leaderboard: https://github.com/yuweihao/MM-Vet
+
+
+
+
+
+
+ ♻ ☆ MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities ICML 2024
+
+
+ We propose MM-Vet, an evaluation benchmark that examines large multimodal
+models (LMMs) on complicated multimodal tasks. Recent LMMs have shown various
+intriguing abilities, such as solving math problems written on the blackboard,
+reasoning about events and celebrities in news images, and explaining visual
+jokes. Rapid model advancements pose challenges to evaluation benchmark
+development. Problems include: (1) How to systematically structure and evaluate
+the complicated multimodal tasks; (2) How to design evaluation metrics that
+work well across question and answer types; and (3) How to give model insights
+beyond a simple performance ranking. To this end, we present MM-Vet, designed
+based on the insight that the intriguing ability to solve complicated tasks is
+often achieved by a generalist model being able to integrate different core
+vision-language (VL) capabilities. MM-Vet defines 6 core VL capabilities and
+examines the 16 integrations of interest derived from the capability
+combination. For evaluation metrics, we propose an LLM-based evaluator for
+open-ended outputs. The evaluator enables the evaluation across different
+question types and answer styles, resulting in a unified scoring metric. We
+evaluate representative LMMs on MM-Vet, providing insights into the
+capabilities of different LMM system paradigms and models.
+
+
+
+ comment: ICML 2024. Code, data and leaderboard:
+ https://github.com/yuweihao/MM-Vet
+
+
+
+
+
+
+
+ Han Zhou, Xingchen Wan, Lev Proleev, Diana Mincu, Jilin Chen, Katherine Heller, Subhrajit Roy
+
+
+ Prompting and in-context learning (ICL) have become efficient learning
+paradigms for large language models (LLMs). However, LLMs suffer from prompt
+brittleness and various bias factors in the prompt, including but not limited
+to the formatting, the choice verbalizers, and the ICL examples. To address
+this problem that results in unexpected performance degradation, calibration
+methods have been developed to mitigate the effects of these biases while
+recovering LLM performance. In this work, we first conduct a systematic
+analysis of the existing calibration methods, where we both provide a unified
+view and reveal the failure cases. Inspired by these analyses, we propose Batch
+Calibration (BC), a simple yet intuitive method that controls the contextual
+bias from the batched input, unifies various prior approaches, and effectively
+addresses the aforementioned issues. BC is zero-shot, inference-only, and
+incurs negligible additional costs. In the few-shot setup, we further extend BC
+to allow it to learn the contextual bias from labeled data. We validate the
+effectiveness of BC with PaLM 2-(S, M, L) and CLIP models and demonstrate
+state-of-the-art performance over previous calibration baselines across more
+than 10 natural language understanding and image classification tasks.
+
+
+
+ comment: ICLR 2024
+
+
+
+
+
+
+
+
+
+ Information Retrieval 13
+
+
+
+
+
+ ☆ CoRNStack: High-Quality Contrastive Data for Better Code Ranking
+
+
+ Effective code retrieval plays a crucial role in advancing code generation,
+bug fixing, and software maintenance, particularly as software systems increase
+in complexity. While current code embedding models have demonstrated promise in
+retrieving code snippets for small-scale, well-defined tasks, they often
+underperform in more demanding real-world applications such as bug localization
+within GitHub repositories. We hypothesize that a key issue is their reliance
+on noisy and inconsistent datasets for training, which impedes their ability to
+generalize to more complex retrieval scenarios. To address these limitations,
+we introduce CoRNStack, a large-scale, high-quality contrastive training
+dataset for code that spans multiple programming languages. This dataset is
+curated using consistency filtering to eliminate noisy positives and is further
+enriched with mined hard negatives, thereby facilitating more effective
+learning. We demonstrate that contrastive training of embedding models using
+CoRNStack leads to state-of-the-art performance across a variety of code
+retrieval tasks. Furthermore, the dataset can be leveraged for training code
+reranking models, a largely underexplored area compared to text reranking. Our
+finetuned code reranking model significantly improves the ranking quality over
+the retrieved results. Finally, by employing our code retriever and reranker
+together, we demonstrate significant improvements in function localization for
+GitHub issues, an important component of real-world software development.
+
+
+
+
+
+
+
+ ☆ Patent-publication pairs for the detection of knowledge transfer from
+ research to industry: reducing ambiguities with word embeddings and
+ references
+
+
+ The performance of medical research can be viewed and evaluated not only from
+the perspective of publication output, but also from the perspective of
+economic exploitability. Patents can represent the exploitation of research
+results and thus the transfer of knowledge from research to industry. In this
+study, we set out to identify publication-patent pairs in order to use patents
+as a proxy for the economic impact of research. To identify these pairs, we
+matched scholarly publications and patents by comparing the names of authors
+and investors. To resolve the ambiguities that arise in this name-matching
+process, we expanded our approach with two additional filter features, one used
+to assess the similarity of text content, the other to identify common
+references in the two document types. To evaluate text similarity, we extracted
+and transformed technical terms from a medical ontology (MeSH) into numerical
+vectors using word embeddings. We then calculated the results of the two
+supporting features over an example five-year period. Furthermore, we developed
+a statistical procedure which can be used to determine valid patent classes for
+the domain of medicine. Our complete data processing pipeline is freely
+available, from the raw data of the two document types right through to the
+validated publication-patent pairs.
+
+
+
+
+
+
+
+
+ T. Y. S. S. Santosh, Hassan Sarwat, Matthias Grabmair
+
+
+ In this paper, we introduce QABISAR, a novel framework for statutory article
+retrieval, to overcome the semantic mismatch problem when modeling each
+query-article pair in isolation, making it hard to learn representation that
+can effectively capture multi-faceted information. QABISAR leverages bipartite
+interactions between queries and articles to capture diverse aspects inherent
+in them. Further, we employ knowledge distillation to transfer enriched query
+representations from the graph network into the query bi-encoder, to capture
+the rich semantics present in the graph representations, despite absence of
+graph-based supervision for unseen queries during inference. Our experiments on
+a real-world expert-annotated dataset demonstrate its effectiveness.
+
+
+
+ comment: Accepted to COLING 2025
+
+
+
+
+
+
+ ☆ Oracle-guided Dynamic User Preference Modeling for Sequential
+ Recommendation
+
+
+
+
+
+
+
+
+ Jiafeng Xia, Dongsheng Li, Hansu Gu, Tun Lu, Peng Zhang, Li Shang, Ning Gu
+
+
+ Sequential recommendation methods can capture dynamic user preferences from
+user historical interactions to achieve better performance. However, most
+existing methods only use past information extracted from user historical
+interactions to train the models, leading to the deviations of user preference
+modeling. Besides past information, future information is also available during
+training, which contains the ``oracle'' user preferences in the future and will
+be beneficial to model dynamic user preferences. Therefore, we propose an
+oracle-guided dynamic user preference modeling method for sequential
+recommendation (Oracle4Rec), which leverages future information to guide model
+training on past information, aiming to learn ``forward-looking'' models.
+Specifically, Oracle4Rec first extracts past and future information through two
+separate encoders, then learns a forward-looking model through an
+oracle-guiding module which minimizes the discrepancy between past and future
+information. We also tailor a two-phase model training strategy to make the
+guiding more effective. Extensive experiments demonstrate that Oracle4Rec is
+superior to state-of-the-art sequential methods. Further experiments show that
+Oracle4Rec can be leveraged as a generic module in other sequential
+recommendation methods to improve their performance with a considerable margin.
+
+
+
+
+
+
+
+ ☆ Scaling New Frontiers: Insights into Large Recommendation Models
+
+
+
+
+
+
+
+
+ Wei Guo, Hao Wang, Luankang Zhang, Jin Yao Chin, Zhongzhou Liu, Kai Cheng, Qiushi Pan, Yi Quan Lee, Wanqi Xue, Tingjia Shen, Kenan Song, Kefan Wang, Wenjia Xie, Yuyang Ye, Huifeng Guo, Yong Liu, Defu Lian, Ruiming Tang, Enhong Chen
+
+
+ Recommendation systems are essential for filtering data and retrieving
+relevant information across various applications. Recent advancements have seen
+these systems incorporate increasingly large embedding tables, scaling up to
+tens of terabytes for industrial use. However, the expansion of network
+parameters in traditional recommendation models has plateaued at tens of
+millions, limiting further benefits from increased embedding parameters.
+Inspired by the success of large language models (LLMs), a new approach has
+emerged that scales network parameters using innovative structures, enabling
+continued performance improvements. A significant development in this area is
+Meta's generative recommendation model HSTU, which illustrates the scaling laws
+of recommendation systems by expanding parameters to thousands of billions.
+This new paradigm has achieved substantial performance gains in online
+experiments. In this paper, we aim to enhance the understanding of scaling laws
+by conducting comprehensive evaluations of large recommendation models.
+Firstly, we investigate the scaling laws across different backbone
+architectures of the large recommendation models. Secondly, we conduct
+comprehensive ablation studies to explore the origins of these scaling laws. We
+then further assess the performance of HSTU, as the representative of large
+recommendation models, on complex user behavior modeling tasks to evaluate its
+applicability. Notably, we also analyze its effectiveness in ranking tasks for
+the first time. Finally, we offer insights into future directions for large
+recommendation models. Supplementary materials for our research are available
+on GitHub at https://github.com/USTC-StarTeam/Large-Recommendation-Models.
+
+
+
+
+
+
+
+ ☆ Improving Vietnamese Legal Document Retrieval using Synthetic Data
+
+
+
+
+
+
+
+
+ Son Pham Tien, Hieu Nguyen Doan, An Nguyen Dai, Sang Dinh Viet
+
+
+ In the field of legal information retrieval, effective embedding-based models
+are essential for accurate question-answering systems. However, the scarcity of
+large annotated datasets poses a significant challenge, particularly for
+Vietnamese legal texts. To address this issue, we propose a novel approach that
+leverages large language models to generate high-quality, diverse synthetic
+queries for Vietnamese legal passages. This synthetic data is then used to
+pre-train retrieval models, specifically bi-encoder and ColBERT, which are
+further fine-tuned using contrastive loss with mined hard negatives. Our
+experiments demonstrate that these enhancements lead to strong improvement in
+retrieval accuracy, validating the effectiveness of synthetic data and
+pre-training techniques in overcoming the limitations posed by the lack of
+large labeled datasets in the Vietnamese legal domain.
+
+
+
+
+
+
+
+ ☆ Needle: A Generative-AI Powered Monte Carlo Method for Answering Complex
+ Natural Language Queries on Multi-modal Data
+
+
+ Multi-modal data, such as image data sets, often miss the detailed
+descriptions that properly capture the rich information encoded in them. This
+makes answering complex natural language queries a major challenge in these
+domains. In particular, unlike the traditional nearest-neighbor search, where
+the tuples and the query are modeled as points in a data cube, the query and
+the tuples are of different natures, making the traditional query answering
+solutions not directly applicable for such settings. Existing literature
+addresses this challenge for image data through vector representations jointly
+trained on natural language and images. This technique, however, underperforms
+for complex queries due to various reasons.
+ This paper takes a step towards addressing this challenge by introducing a
+Generative-AI (GenAI) powered Monte Carlo method that utilizes foundation
+models to generate synthetic samples that capture the complexity of the natural
+language query and transform it to the same space of the multi-modal data.
+Following this method, we develop a system for image data retrieval and propose
+practical solutions that enable leveraging future advancements in GenAI and
+vector representations for improving our system's performance. Our
+comprehensive experiments on various benchmark datasets verify that our system
+significantly outperforms state-of-the-art techniques.
+
+
+
+
+
+
+
+ ♻ ☆ DUET: A Tuning-Free Device-Cloud Collaborative Parameters Generation
+ Framework for Efficient Device Model Generalization WWW'23
+
+
+ Device Model Generalization (DMG) is a practical yet under-investigated
+research topic for on-device machine learning applications. It aims to improve
+the generalization ability of pre-trained models when deployed on
+resource-constrained devices, such as improving the performance of pre-trained
+cloud models on smart mobiles. While quite a lot of works have investigated the
+data distribution shift across clouds and devices, most of them focus on model
+fine-tuning on personalized data for individual devices to facilitate DMG.
+Despite their promising, these approaches require on-device re-training, which
+is practically infeasible due to the overfitting problem and high time delay
+when performing gradient calculation on real-time data. In this paper, we argue
+that the computational cost brought by fine-tuning can be rather unnecessary.
+We consequently present a novel perspective to improving DMG without increasing
+computational cost, i.e., device-specific parameter generation which directly
+maps data distribution to parameters. Specifically, we propose an efficient
+Device-cloUd collaborative parametErs generaTion framework DUET. DUET is
+deployed on a powerful cloud server that only requires the low cost of
+forwarding propagation and low time delay of data transmission between the
+device and the cloud. By doing so, DUET can rehearse the device-specific model
+weight realizations conditioned on the personalized real-time data for an
+individual device. Importantly, our DUET elegantly connects the cloud and
+device as a 'duet' collaboration, frees the DMG from fine-tuning, and enables a
+faster and more accurate DMG paradigm. We conduct an extensive experimental
+study of DUET on three public datasets, and the experimental results confirm
+our framework's effectiveness and generalisability for different DMG tasks.
+
+
+
+ comment: Published on WWW'23: Proceedings of the ACM on Web Conference 2023
+ (pp. 3077 - 3085)
+
+
+
+
+
+
+ ♻ ☆ Intelligent Model Update Strategy for Sequential Recommendation WWW'24
+
+
+ Modern online platforms are increasingly employing recommendation systems to
+address information overload and improve user engagement. There is an evolving
+paradigm in this research field that recommendation network learning occurs
+both on the cloud and on edges with knowledge transfer in between (i.e.,
+edge-cloud collaboration). Recent works push this field further by enabling
+edge-specific context-aware adaptivity, where model parameters are updated in
+real-time based on incoming on-edge data. However, we argue that frequent data
+exchanges between the cloud and edges often lead to inefficiency and waste of
+communication/computation resources, as considerable parameter updates might be
+redundant. To investigate this problem, we introduce Intelligent Edge-Cloud
+Parameter Request Model, abbreviated as IntellectReq.
+ IntellectReq is designed to operate on edge, evaluating the cost-benefit
+landscape of parameter requests with minimal computation and communication
+overhead. We formulate this as a novel learning task, aimed at the detection of
+out-of-distribution data, thereby fine-tuning adaptive communication
+strategies. Further, we employ statistical mapping techniques to convert
+real-time user behavior into a normal distribution, thereby employing
+multi-sample outputs to quantify the model's uncertainty and thus its
+generalization capabilities. Rigorous empirical validation on four
+widely-adopted benchmarks evaluates our approach, evidencing a marked
+improvement in the efficiency and generalizability of edge-cloud collaborative
+and dynamic recommendation systems.
+
+
+
+ comment: Published on WWW'24(Oral): Proceedings of the ACM on Web Conference
+ 2024 (pp. 3117-3128)
+
+
+
+
+
+
+ ♻ ☆ Leveraging Retrieval-Augmented Generation for Persian University
+ Knowledge Retrieval
+
+
+ This paper introduces an innovative approach using Retrieval-Augmented
+Generation (RAG) pipelines with Large Language Models (LLMs) to enhance
+information retrieval and query response systems for university-related
+question answering. By systematically extracting data from the university
+official webpage and employing advanced prompt engineering techniques, we
+generate accurate, contextually relevant responses to user queries.
+ We developed a comprehensive university benchmark, UniversityQuestionBench
+(UQB), to rigorously evaluate our system performance, based on common key
+metrics in the filed of RAG pipelines, assessing accuracy and reliability
+through various metrics and real-world scenarios. Our experimental results
+demonstrate significant improvements in the precision and relevance of
+generated responses, enhancing user experience and reducing the time required
+to obtain relevant answers. In summary, this paper presents a novel application
+of RAG pipelines and LLMs, supported by a meticulously prepared university
+benchmark, offering valuable insights into advanced AI techniques for academic
+data retrieval and setting the stage for future research in this domain.
+
+
+
+
+
+
+
+
+ Liwei Deng, Penghao Chen, Ximu Zeng, Tianfu Wang, Yan Zhao, Kai Zheng
+
+
+ High-dimensional approximate $K$ nearest neighbor search (AKNN) is a
+fundamental task for various applications, including information retrieval.
+Most existing algorithms for AKNN can be decomposed into two main components,
+i.e., candidate generation and distance comparison operations (DCOs). While
+different methods have unique ways of generating candidates, they all share the
+same DCO process. In this study, we focus on accelerating the process of DCOs
+that dominates the time cost in most existing AKNN algorithms. To achieve this,
+we propose an Data-Aware Distance Estimation approach, called DADE, which
+approximates the exact distance in a lower-dimensional space. We theoretically
+prove that the distance estimation in DADE is unbiased in terms of data
+distribution. Furthermore, we propose an optimized estimation based on the
+unbiased distance estimation formulation. In addition, we propose a hypothesis
+testing approach to adaptively determine the number of dimensions needed to
+estimate the exact distance with sufficient confidence. We integrate DADE into
+widely-used AKNN search algorithms, e.g., IVF and HNSW, and conduct extensive
+experiments to demonstrate the superiority.
+
+
+
+ comment: Accepted by VLDB 2025
+
+
+
+
+
+
+ ♻ ☆ Potential Field Based Deep Metric Learning
+
+
+ Deep metric learning (DML) involves training a network to learn a
+semantically meaningful representation space. Many current approaches mine
+n-tuples of examples and model interactions within each tuplets. We present a
+novel, compositional DML model, inspired by electrostatic fields in physics
+that, instead of in tuples, represents the influence of each example
+(embedding) by a continuous potential field, and superposes the fields to
+obtain their combined global potential field. We use attractive/repulsive
+potential fields to represent interactions among embeddings from images of the
+same/different classes. Contrary to typical learning methods, where mutual
+influence of samples is proportional to their distance, we enforce reduction in
+such influence with distance, leading to a decaying field. We show that such
+decay helps improve performance on real world datasets with large intra-class
+variations and label noise. Like other proxy-based methods, we also use proxies
+to succinctly represent sub-populations of examples. We evaluate our method on
+three standard DML benchmarks- Cars-196, CUB-200-2011, and SOP datasets where
+it outperforms state-of-the-art baselines.
+
+
+
+
+
+
+
+ ♻ ☆ G-RAG: Knowledge Expansion in Material Science
+
+
+ In the field of Material Science, effective information retrieval systems are
+essential for facilitating research. Traditional Retrieval-Augmented Generation
+(RAG) approaches in Large Language Models (LLMs) often encounter challenges
+such as outdated information, hallucinations, limited interpretability due to
+context constraints, and inaccurate retrieval. To address these issues, Graph
+RAG integrates graph databases to enhance the retrieval process. Our proposed
+method processes Material Science documents by extracting key entities
+(referred to as MatIDs) from sentences, which are then utilized to query
+external Wikipedia knowledge bases (KBs) for additional relevant information.
+We implement an agent-based parsing technique to achieve a more detailed
+representation of the documents. Our improved version of Graph RAG called G-RAG
+further leverages a graph database to capture relationships between these
+entities, improving both retrieval accuracy and contextual understanding. This
+enhanced approach demonstrates significant improvements in performance for
+domains that require precise information retrieval, such as Material Science.
+
+
+
+
+
+
+
+
+
+
+ Multimedia 2
+
+
+
+
+
+ ♻ ☆ Separate Anything You Describe
+
+
+
+
+
+
+
+
+ Xubo Liu, Qiuqiang Kong, Yan Zhao, Haohe Liu, Yi Yuan, Yuzhuo Liu, Rui Xia, Yuxuan Wang, Mark D. Plumbley, Wenwu Wang
+
+
+ Language-queried audio source separation (LASS) is a new paradigm for
+computational auditory scene analysis (CASA). LASS aims to separate a target
+sound from an audio mixture given a natural language query, which provides a
+natural and scalable interface for digital audio applications. Recent works on
+LASS, despite attaining promising separation performance on specific sources
+(e.g., musical instruments, limited classes of audio events), are unable to
+separate audio concepts in the open domain. In this work, we introduce
+AudioSep, a foundation model for open-domain audio source separation with
+natural language queries. We train AudioSep on large-scale multimodal datasets
+and extensively evaluate its capabilities on numerous tasks including audio
+event separation, musical instrument separation, and speech enhancement.
+AudioSep demonstrates strong separation performance and impressive zero-shot
+generalization ability using audio captions or text labels as queries,
+substantially outperforming previous audio-queried and language-queried sound
+separation models. For reproducibility of this work, we will release the source
+code, evaluation benchmark and pre-trained model at:
+https://github.com/Audio-AGI/AudioSep.
+
+
+
+ comment: Code, benchmark and pre-trained models:
+ https://github.com/Audio-AGI/AudioSep
+
+
+
+
+
+
+ ♻ ☆ SongBsAb: A Dual Prevention Approach against Singing Voice Conversion
+ based Illegal Song Covers NDSS
+
+
+
+
+
+
+
+
+ Guangke Chen, Yedi Zhang, Fu Song, Ting Wang, Xiaoning Du, Yang Liu
+
+
+ Singing voice conversion (SVC) automates song covers by converting a source
+singing voice from a source singer into a new singing voice with the same
+lyrics and melody as the source, but sounds like being covered by the target
+singer of some given target singing voices. However, it raises serious concerns
+about copyright and civil right infringements. We propose SongBsAb, the first
+proactive approach to tackle SVC-based illegal song covers. SongBsAb adds
+perturbations to singing voices before releasing them, so that when they are
+used, the process of SVC will be interfered, leading to unexpected singing
+voices. Perturbations are carefully crafted to (1) provide a dual prevention,
+i.e., preventing the singing voice from being used as the source and target
+singing voice in SVC, by proposing a gender-transformation loss and a high/low
+hierarchy multi-target loss, respectively; and (2) be harmless, i.e., no
+side-effect on the enjoyment of protected songs, by refining a psychoacoustic
+model-based loss with the backing track as an additional masker, a unique
+accompanying element for singing voices compared to ordinary speech voices. We
+also adopt a frame-level interaction reduction-based loss and encoder ensemble
+to enhance the transferability of SongBsAb to unknown SVC models. We
+demonstrate the prevention effectiveness, harmlessness, and robustness of
+SongBsAb on five diverse and promising SVC models, using both English and
+Chinese datasets, and both objective and human study-based subjective metrics.
+Our work fosters an emerging research direction for mitigating illegal
+automated song covers.
+
+
+
+ comment: In Proceedings of the 32nd Network and Distributed System Security
+ (NDSS) Symposium 2025
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Computation and Language 25
+
+
+
+
+
+ ♻ ☆ A Review of Prominent Paradigms for LLM-Based Agents: Tool Use
+ (Including RAG), Planning, and Feedback Learning
+
+
+ Tool use, planning, and feedback learning are currently three prominent
+paradigms for developing Large Language Model (LLM)-based agents across various
+tasks. Although numerous frameworks have been devised for each paradigm, their
+intricate workflows and inconsistent taxonomy create challenges in
+understanding and reviewing the frameworks across different paradigms. This
+survey introduces a unified taxonomy to systematically review and discuss these
+frameworks. Specifically, 1) the taxonomy defines environments/tasks, common
+LLM-profiled roles or LMPRs (policy models, evaluators, and dynamic models),
+and universally applicable workflows found in prior work, and 2) it enables a
+comparison of key perspectives on the implementations of LMPRs and workflow
+designs across different agent paradigms and frameworks. 3) Finally, we
+identify three limitations in existing workflow designs and systematically
+discuss the future work. Resources have been made publicly available at in our
+GitHub repository https://github.com/xinzhel/LLM-Agent-Survey.
+
+
+
+ comment: CoLing 2025 Camera Ready (extended to 9 pages)
+
+
+
+
+
+
+ ♻ ☆ A Survey on Large Language Model-empowered Autonomous Driving
+
+
+ Artificial intelligence (AI) plays a crucial role in autonomous driving (AD)
+research, propelling its development towards intelligence and efficiency.
+Currently, the development of AD technology follows two main technical paths:
+modularization and end-to-end. Modularization decompose the driving task into
+modules such as perception, prediction, planning, and control, and train them
+separately. Due to the inconsistency of training objectives between modules,
+the integrated effect suffers from bias. End-to-end attempts to address this
+issue by utilizing a single model that directly maps from sensor data to
+control signals. This path has limited learning capabilities in a comprehensive
+set of features and struggles to handle unpredictable long-tail events and
+complex urban traffic scenarios. In the face of challenges encountered in both
+paths, many researchers believe that large language models (LLMs) with powerful
+reasoning capabilities and extensive knowledge understanding may be the
+solution, expecting LLMs to provide AD systems with deeper levels of
+understanding and decision-making capabilities. In light of the challenges
+faced by both paths, many researchers believe that LLMs, with their powerful
+reasoning abilities and extensive knowledge, could offer a solution. To
+understand if LLMs could enhance AD, this paper conducts a thorough analysis of
+the potential applications of LLMs in AD systems, including exploring their
+optimization strategies in both modular and end-to-end approaches, with a
+particular focus on how LLMs can tackle the problems and challenges present in
+current solutions. Furthermore, we discuss an important question: Can LLM-based
+artificial general intelligence (AGI) be a key to achieve high-level AD? We
+further analyze the potential limitations and challenges that LLMs may
+encounter in promoting the development of AD technology.
+
+
+
+
+
+
+
+ ♻ ☆ LLM Pruning and Distillation in Practice: The Minitron Approach
+
+
+ We present a comprehensive report on compressing the Llama 3.1 8B and Mistral
+NeMo 12B models to 4B and 8B parameters, respectively, using pruning and
+distillation. We explore two distinct pruning strategies: (1) depth pruning and
+(2) joint hidden/attention/MLP (width) pruning, and evaluate the results on
+common benchmarks from the LM Evaluation Harness. The models are then aligned
+with NeMo Aligner and tested in instruct-tuned versions. This approach produces
+a compelling 4B model from Llama 3.1 8B and a state-of-the-art
+Mistral-NeMo-Minitron-8B (MN-Minitron-8B for brevity) model from Mistral NeMo
+12B. We found that with no access to the original data, it is beneficial to
+slightly fine-tune teacher models on the distillation dataset. We open-source
+our base model weights on Hugging Face with a permissive license.
+
+
+
+ comment: v3: Update author list, other changes
+
+
+
+
+
+
+
+ Ruslan Svirschevski, Avner May, Zhuoming Chen, Beidi Chen, Zhihao Jia, Max Ryabinin
+
+
+ As large language models gain widespread adoption, running them efficiently
+becomes crucial. Recent works on LLM inference use speculative decoding to
+achieve extreme speedups. However, most of these works implicitly design their
+algorithms for high-end datacenter hardware. In this work, we ask the opposite
+question: how fast can we run LLMs on consumer machines? Consumer GPUs can no
+longer fit the largest available models (50B+ parameters) and must offload them
+to RAM or SSD. When running with offloaded parameters, the inference engine can
+process batches of hundreds or thousands of tokens at the same time as just one
+token, making it a natural fit for speculative decoding. We propose SpecExec
+(Speculative Execution), a simple parallel decoding method that can generate up
+to 20 tokens per target model iteration for popular LLM families. It utilizes
+the high spikiness of the token probabilities distribution in modern LLMs and a
+high degree of alignment between model output probabilities. SpecExec takes the
+most probable tokens continuation from the draft model to build a "cache" tree
+for the target model, which then gets validated in a single pass. Using
+SpecExec, we demonstrate inference of 50B+ parameter LLMs on consumer GPUs with
+RAM offloading at 4-6 tokens per second with 4-bit quantization or 2-3 tokens
+per second with 16-bit weights.
+
+
+
+
+
+
+
+ ♻ ☆ Large Language Models as Interpolated and Extrapolated Event Predictors
+
+
+ Salient facts of sociopolitical events are distilled into quadruples
+following a format of subject, relation, object, and timestamp. Machine
+learning methods, such as graph neural networks (GNNs) and recurrent neural
+networks (RNNs), have been built to make predictions and infer relations on the
+quadruple-based knowledge graphs (KGs). In many applications, quadruples are
+extended to quintuples with auxiliary attributes such as text summaries that
+describe the quadruple events. In this paper, we comprehensively investigate
+how large language models (LLMs) streamline the design of event prediction
+frameworks using quadruple-based or quintuple-based data while maintaining
+competitive accuracy. We propose LEAP, a unified framework that leverages large
+language models as event predictors. Specifically, we develop multiple prompt
+templates to frame the object prediction (OP) task as a standard
+question-answering (QA) task, suitable for instruction fine-tuning with an
+encoder-decoder LLM. For multi-event forecasting (MEF) task, we design a simple
+yet effective prompt template for each event quintuple. This novel approach
+removes the need for GNNs and RNNs, instead utilizing an encoder-only LLM to
+generate fixed intermediate embeddings, which are processed by a customized
+downstream head with a self-attention mechanism to predict potential relation
+occurrences in the future. Extensive experiments on multiple real-world
+datasets using various evaluation metrics validate the effectiveness of our
+approach.
+
+
+
+ comment: 11 pages, 3 figures, 10 tables
+
+
+
+
+
+
+ ♻ ☆ Data Mixture Inference: What do BPE Tokenizers Reveal about their
+ Training Data? NeurIPS
+
+
+
+
+
+
+
+
+ Jonathan Hayase, Alisa Liu, Yejin Choi, Sewoong Oh, Noah A. Smith
+
+
+ The pretraining data of today's strongest language models is opaque; in
+particular, little is known about the proportions of various domains or
+languages represented. In this work, we tackle a task which we call data
+mixture inference, which aims to uncover the distributional make-up of training
+data. We introduce a novel attack based on a previously overlooked source of
+information: byte-pair encoding (BPE) tokenizers, used by the vast majority of
+modern language models. Our key insight is that the ordered list of merge rules
+learned by a BPE tokenizer naturally reveals information about the token
+frequencies in its training data. Given a tokenizer's merge list along with
+example data for each category of interest, we formulate a linear program that
+solves for the proportion of each category in the tokenizer's training set. In
+controlled experiments, we show that our attack recovers mixture ratios with
+high precision for tokenizers trained on known mixtures of natural languages,
+programming languages, and data sources. We then apply our approach to
+off-the-shelf tokenizers released with recent LMs. We confirm much publicly
+disclosed information about these models, and also make several new inferences:
+GPT-4o and Mistral NeMo's tokenizers are much more multilingual than their
+predecessors, training on 39% and 47% non-English language data, respectively;
+Llama 3 extends GPT-3.5's tokenizer primarily for multilingual (48%) use;
+GPT-3.5's and Claude's tokenizers are trained on predominantly code (~60%). We
+hope our work sheds light on current design practices for pretraining data, and
+inspires continued research into data mixture inference for LMs.
+
+
+
+ comment: NeurIPS camera-ready, code at
+ https://github.com/alisawuffles/tokenizer-attack
+
+ Despite the importance of long-term memory in marketing and brand building,
+until now, there has been no large-scale study on the memorability of ads. All
+previous memorability studies have been conducted on short-term recall on
+specific content types like action videos. On the other hand, long-term
+memorability is crucial for the advertising industry, and ads are almost always
+highly multimodal. Therefore, we release the first memorability dataset,
+LAMBDA, consisting of 1749 participants and 2205 ads covering 276 brands.
+Running statistical tests over different participant subpopulations and ad
+types, we find many interesting insights into what makes an ad memorable, e.g.,
+fast-moving ads are more memorable than those with slower scenes; people who
+use ad-blockers remember a lower number of ads than those who don't. Next, we
+present a model, Henry, to predict the memorability of a content. Henry
+achieves state-of-the-art performance across all prominent literature
+memorability datasets. It shows strong generalization performance with better
+results in 0-shot on unseen datasets. Finally, with the intent of memorable ad
+generation, we present a scalable method to build a high-quality memorable ad
+generation model by leveraging automatically annotated data. Our approach, SEED
+(Self rEwarding mEmorability Modeling), starts with a language model trained on
+LAMBDA as seed data and progressively trains an LLM to generate more memorable
+ads. We show that the generated advertisements have 44% higher memorability
+scores than the original ads. We release this large-scale ad dataset,
+UltraLAMBDA, consisting of 5 million ads. Our code and the datasets, LAMBDA and
+UltraLAMBDA, are open-sourced at
+https://behavior-in-the-wild.github.io/memorability.
+
+
+
+
+
+
+
+
+ Yuelyu Ji, Zhuochun Li, Rui Meng, Daqing He
+
+
+ Reranking documents based on their relevance to a given query is a critical
+task in information retrieval. Traditional reranking methods often lack
+transparency and rely on proprietary models, hindering reproducibility and
+interpretability. We propose Reason-to-Rank (R2R), a novel open-source
+reranking approach that enhances transparency by generating two types of
+reasoning: direct relevance reasoning, which explains how a document addresses
+the query, and comparison reasoning, which justifies the relevance of one
+document over another. We leverage large language models (LLMs) as teacher
+models to generate these explanations and distill this knowledge into smaller,
+openly available student models. Our student models are trained to generate
+meaningful reasoning and rerank documents, achieving competitive performance
+across multiple datasets, including MSMARCO and BRIGHT. Experiments demonstrate
+that R2R not only improves reranking accuracy but also provides valuable
+insights into the decision-making process. By offering a structured and
+interpretable solution with openly accessible resources, R2R aims to bridge the
+gap between effectiveness and transparency in information retrieval, fostering
+reproducibility and further research in the field.
+
+
+
+
+
+
+
+ ♻ ☆ FedMKT: Federated Mutual Knowledge Transfer for Large and Small Language
+ Models
+
+
+
+
+
+
+
+
+ Tao Fan, Guoqiang Ma, Yan Kang, Hanlin Gu, Yuanfeng Song, Lixin Fan, Kai Chen, Qiang Yang
+
+
+ Recent research in federated large language models (LLMs) has primarily
+focused on enabling clients to fine-tune their locally deployed homogeneous
+LLMs collaboratively or on transferring knowledge from server-based LLMs to
+small language models (SLMs) at downstream clients. However, a significant gap
+remains in the simultaneous mutual enhancement of both the server's LLM and
+clients' SLMs. To bridge this gap, we propose FedMKT, a parameter-efficient
+federated mutual knowledge transfer framework for large and small language
+models. This framework is designed to adaptively transfer knowledge from the
+server's LLM to clients' SLMs while concurrently enriching the LLM with
+clients' unique domain insights. We facilitate token alignment using minimum
+edit distance (MinED) and then selective mutual knowledge transfer between
+client-side SLMs and a server-side LLM, aiming to collectively enhance their
+performance. Through extensive experiments across three distinct scenarios, we
+evaluate the effectiveness of FedMKT using various public LLMs and SLMs on a
+range of NLP text generation tasks. Empirical results demonstrate that FedMKT
+simultaneously boosts the performance of both LLMs and SLMs.
+
+
+
+
+
+
+
+ ♻ ☆ LLM4Mat-Bench: Benchmarking Large Language Models for Materials Property
+ Prediction NeurIPS 2024
+
+
+
+
+
+
+
+
+ Andre Niyongabo Rubungo, Kangming Li, Jason Hattrick-Simpers, Adji Bousso Dieng
+
+
+ Large language models (LLMs) are increasingly being used in materials
+science. However, little attention has been given to benchmarking and
+standardized evaluation for LLM-based materials property prediction, which
+hinders progress. We present LLM4Mat-Bench, the largest benchmark to date for
+evaluating the performance of LLMs in predicting the properties of crystalline
+materials. LLM4Mat-Bench contains about 1.9M crystal structures in total,
+collected from 10 publicly available materials data sources, and 45 distinct
+properties. LLM4Mat-Bench features different input modalities: crystal
+composition, CIF, and crystal text description, with 4.7M, 615.5M, and 3.1B
+tokens in total for each modality, respectively. We use LLM4Mat-Bench to
+fine-tune models with different sizes, including LLM-Prop and MatBERT, and
+provide zero-shot and few-shot prompts to evaluate the property prediction
+capabilities of LLM-chat-like models, including Llama, Gemma, and Mistral. The
+results highlight the challenges of general-purpose LLMs in materials science
+and the need for task-specific predictive models and task-specific
+instruction-tuned LLMs in materials property prediction.
+
+
+
+ comment: Accepted at NeurIPS 2024-AI4Mat Workshop. The Benchmark and code can
+ be found at https://github.com/vertaix/LLM4Mat-Bench
+
+
+
+
+
+
+ ♻ ☆ Unlocking Structured Thinking in Language Models with Cognitive
+ Prompting
+
+
+ We propose cognitive prompting as a novel approach to guide problem-solving
+in large language models (LLMs) through structured, human-like cognitive
+operations, such as goal clarification, decomposition, filtering, abstraction,
+and pattern recognition. By employing systematic, step-by-step reasoning,
+cognitive prompting enables LLMs to tackle complex, multi-step tasks more
+efficiently. We introduce three variants: a deterministic sequence of cognitive
+operations, a self-adaptive variant in which the LLM dynamically selects the
+sequence of cognitive operations, and a hybrid variant that uses generated
+correct solutions as few-shot chain-of-thought prompts. Experiments with LLaMA,
+Gemma~2, and Qwen models in each two sizes on the arithmetic reasoning
+benchmark GSM8K demonstrate that cognitive prompting significantly improves
+performance compared to standard question answering.
+
+
+
+ comment: 6 pages, submitted to ESANN 2025
+
+
+
+
+
+
+ ♻ ☆ ORAssistant: A Custom RAG-based Conversational Assistant for OpenROAD
+
+
+ Open-source Electronic Design Automation (EDA) tools are rapidly transforming
+chip design by addressing key barriers of commercial EDA tools such as
+complexity, costs, and access. Recent advancements in Large Language Models
+(LLMs) have further enhanced efficiency in chip design by providing user
+assistance across a range of tasks like setup, decision-making, and flow
+automation. This paper introduces ORAssistant, a conversational assistant for
+OpenROAD, based on Retrieval-Augmented Generation (RAG). ORAssistant aims to
+improve the user experience for the OpenROAD flow, from RTL-GDSII by providing
+context-specific responses to common user queries, including installation,
+command usage, flow setup, and execution, in prose format. Currently,
+ORAssistant integrates OpenROAD, OpenROAD-flow-scripts, Yosys, OpenSTA, and
+KLayout. The data model is built from publicly available documentation and
+GitHub resources. The proposed architecture is scalable, supporting extensions
+to other open-source tools, operating modes, and LLM models. We use Google
+Gemini as the base LLM model to build and test ORAssistant. Early evaluation
+results of the RAG-based model show notable improvements in performance and
+accuracy compared to non-fine-tuned LLMs.
+
+
+
+
+
+
+
+ ♻ ☆ A Concept-Based Explainability Framework for Large Multimodal Models NeurIPS 2024
+
+
+ Large multimodal models (LMMs) combine unimodal encoders and large language
+models (LLMs) to perform multimodal tasks. Despite recent advancements towards
+the interpretability of these models, understanding internal representations of
+LMMs remains largely a mystery. In this paper, we present a novel framework for
+the interpretation of LMMs. We propose a dictionary learning based approach,
+applied to the representation of tokens. The elements of the learned dictionary
+correspond to our proposed concepts. We show that these concepts are well
+semantically grounded in both vision and text. Thus we refer to these as
+``multi-modal concepts''. We qualitatively and quantitatively evaluate the
+results of the learnt concepts. We show that the extracted multimodal concepts
+are useful to interpret representations of test samples. Finally, we evaluate
+the disentanglement between different concepts and the quality of grounding
+concepts visually and textually. Our code is publicly available at
+https://github.com/mshukor/xl-vlms
+
+
+
+ comment: NeurIPS 2024
+
+
+
+
+
+
+ ♻ ☆ Kalahi: A handcrafted, grassroots cultural LLM evaluation suite for
+ Filipino
+
+
+
+
+
+
+
+
+ Jann Railey Montalan, Jian Gang Ngui, Wei Qi Leong, Yosephine Susanto, Hamsawardhini Rengarajan, Alham Fikri Aji, William Chandra Tjhi
+
+
+ Multilingual large language models (LLMs) today may not necessarily provide
+culturally appropriate and relevant responses to its Filipino users. We
+introduce Kalahi, a cultural LLM evaluation suite collaboratively created by
+native Filipino speakers. It is composed of 150 high-quality, handcrafted and
+nuanced prompts that test LLMs for generations that are relevant to shared
+Filipino cultural knowledge and values. Strong LLM performance in Kalahi
+indicates a model's ability to generate responses similar to what an average
+Filipino would say or do in a given situation. We conducted experiments on LLMs
+with multilingual and Filipino language support. Results show that Kalahi,
+while trivial for Filipinos, is challenging for LLMs, with the best model
+answering only 46.0% of the questions correctly compared to native Filipino
+performance of 89.10%. Thus, Kalahi can be used to accurately and reliably
+evaluate Filipino cultural representation in LLMs.
+
+
+
+ comment: Accepted for presentation at Paclic 38, 2024
+
+
+
+
+
+
+ ♻ ☆ Uncovering Safety Risks of Large Language Models through Concept
+ Activation Vector NeurIPS 2024
+
+
+ Despite careful safety alignment, current large language models (LLMs) remain
+vulnerable to various attacks. To further unveil the safety risks of LLMs, we
+introduce a Safety Concept Activation Vector (SCAV) framework, which
+effectively guides the attacks by accurately interpreting LLMs' safety
+mechanisms. We then develop an SCAV-guided attack method that can generate both
+attack prompts and embedding-level attacks with automatically selected
+perturbation hyperparameters. Both automatic and human evaluations demonstrate
+that our attack method significantly improves the attack success rate and
+response quality while requiring less training data. Additionally, we find that
+our generated attack prompts may be transferable to GPT-4, and the
+embedding-level attacks may also be transferred to other white-box LLMs whose
+parameters are known. Our experiments further uncover the safety risks present
+in current LLMs. For example, in our evaluation of seven open-source LLMs, we
+observe an average attack success rate of 99.14%, based on the classic
+keyword-matching criterion. Finally, we provide insights into the safety
+mechanism of LLMs. The code is available at
+https://github.com/SproutNan/AI-Safety_SCAV.
+
+
+
+ comment: 10 pages, accepted at NeurIPS 2024
+
+
+
+
+
+
+ ♻ ☆ Fine-Grained Alignment in Vision-and-Language Navigation through
+ Bayesian Optimization
+
+
+
+
+
+
+
+
+ Yuhang Song, Mario Gianni, Chenguang Yang, Kunyang Lin, Te-Chuan Chiu, Anh Nguyen, Chun-Yi Lee
+
+
+ This paper addresses the challenge of fine-grained alignment in
+Vision-and-Language Navigation (VLN) tasks, where robots navigate realistic 3D
+environments based on natural language instructions. Current approaches use
+contrastive learning to align language with visual trajectory sequences.
+Nevertheless, they encounter difficulties with fine-grained vision negatives.
+To enhance cross-modal embeddings, we introduce a novel Bayesian
+Optimization-based adversarial optimization framework for creating fine-grained
+contrastive vision samples. To validate the proposed methodology, we conduct a
+series of experiments to assess the effectiveness of the enriched embeddings on
+fine-grained vision negatives. We conduct experiments on two common VLN
+benchmarks R2R and REVERIE, experiments on the them demonstrate that these
+embeddings benefit navigation, and can lead to a promising performance
+enhancement. Our source code and trained models are available at:
+https://anonymous.4open.science/r/FGVLN.
+
+
+
+
+
+
+
+
+ Zhuo Chen, Yin Fang, Yichi Zhang, Lingbing Guo, Jiaoyan Che, Jeff Z. Pan, Huajun Chen, Wen Zhang
+
+
+ The rise of Multi-modal Pre-training highlights the necessity for a unified
+Multi-Modal Knowledge Graph (MMKG) representation learning framework. Such a
+framework is essential for embedding structured knowledge into multi-modal
+Large Language Models effectively, alleviating issues like knowledge
+misconceptions and multi-modal hallucinations. In this work, we explore the
+efficacy of models in accurately embedding entities within MMKGs through two
+pivotal tasks: Multi-modal Knowledge Graph Completion (MKGC) and Multi-modal
+Entity Alignment (MMEA). Building on this foundation, we propose a novel SNAG
+method that utilizes a Transformer-based architecture equipped with
+modality-level noise masking to robustly integrate multi-modal entity features
+in KGs. By incorporating specific training objectives for both MKGC and MMEA,
+our approach achieves SOTA performance across a total of ten datasets,
+demonstrating its versatility. Moreover, SNAG can not only function as a
+standalone model but also enhance other existing methods, providing stable
+performance improvements. Code and data are available at
+https://github.com/zjukg/SNAG.
+
+
+
+ comment: COLING 2025 Accpeted, Repo is available at
+ https://github.com/zjukg/SNAG
+
+ Common approaches rely on fixed-length embedding vectors from language models
+as sentence embeddings for downstream tasks such as semantic textual similarity
+(STS). Such methods are limited in their flexibility due to unknown
+computational constraints and budgets across various applications. Matryoshka
+Representation Learning (MRL) \cite{aditya2022matryoshka} encodes information
+at finer granularities, i.e., with lower embedding dimensions, to adaptively
+accommodate \emph{ad hoc} tasks. Similar accuracy can be achieved with a
+smaller embedding size, leading to speedups in downstream tasks. Despite its
+improved efficiency, MRL still requires traversing all Transformer layers
+before obtaining the embedding, which remains the dominant factor in time and
+memory consumption. This prompts consideration of whether the fixed number of
+Transformer layers affects representation quality and whether using
+intermediate layers for sentence representation is feasible. In this paper, we
+introduce a novel sentence embedding model called \textit{Two-dimensional
+Matryoshka Sentence Embedding} (2DMSE)\footnote{Our code is available at
+\url{https://github.com/SeanLee97/AnglE/blob/main/README_2DMSE.md}.}. It
+supports elastic settings for both embedding sizes and Transformer layers,
+offering greater flexibility and efficiency than MRL. We conduct extensive
+experiments on STS tasks and downstream applications. The experimental results
+demonstrate the effectiveness of our proposed model in dynamically supporting
+different embedding sizes and Transformer layers, allowing it to be highly
+adaptable to various scenarios.
+
+
+
+ comment: Decoupled with ESE
+
+
+
+
+
+
+ ♻ ☆ LLMs and Finetuning: Benchmarking cross-domain performance for hate
+ speech detection
+
+
+ In the evolving landscape of online communication, hate speech detection
+remains a formidable challenge, further compounded by the diversity of digital
+platforms. This study investigates the effectiveness and adaptability of
+pre-trained and fine-tuned Large Language Models (LLMs) in identifying hate
+speech, to address two central questions: (1) To what extent does the model
+performance depend on the fine-tuning and training parameters?, (2) To what
+extent do models generalize to cross-domain hate speech detection? and (3) What
+are the specific features of the datasets or models that influence the
+generalization potential? The experiment shows that LLMs offer a huge advantage
+over the state-of-the-art even without pretraining. Ordinary least squares
+analyses suggest that the advantage of training with fine-grained hate speech
+labels is washed away with the increase in dataset size. We conclude with a
+vision for the future of hate speech detection, emphasizing cross-domain
+generalizability and appropriate benchmarking practices.
+
+
+
+ comment: 10 pages, 3 figures, 5 tables
+
+
+
+
+
+
+ ♻ ☆ Scaling Laws for Precision
+
+
+
+
+
+
+
+
+ Tanishq Kumar, Zachary Ankner, Benjamin F. Spector, Blake Bordelon, Niklas Muennighoff, Mansheej Paul, Cengiz Pehlevan, Christopher Ré, Aditi Raghunathan
+
+
+ Low precision training and inference affect both the quality and cost of
+language models, but current scaling laws do not account for this. In this
+work, we devise "precision-aware" scaling laws for both training and inference.
+We propose that training in lower precision reduces the model's "effective
+parameter count," allowing us to predict the additional loss incurred from
+training in low precision and post-train quantization. For inference, we find
+that the degradation introduced by post-training quantization increases as
+models are trained on more data, eventually making additional pretraining data
+actively harmful. For training, our scaling laws allow us to predict the loss
+of a model with different parts in different precisions, and suggest that
+training larger models in lower precision may be compute optimal. We unify the
+scaling laws for post and pretraining quantization to arrive at a single
+functional form that predicts degradation from training and inference in varied
+precisions. We fit on over 465 pretraining runs and validate our predictions on
+model sizes up to 1.7B parameters trained on up to 26B tokens.
+
+
+
+
+
+
+
+ ♻ ☆ HEARTS: A Holistic Framework for Explainable, Sustainable and Robust
+ Text Stereotype Detection NeurIPS 2024
+
+
+
+
+
+
+
+
+ Theo King, Zekun Wu, Adriano Koshiyama, Emre Kazim, Philip Treleaven
+
+
+ Stereotypes are generalised assumptions about societal groups, and even
+state-of-the-art LLMs using in-context learning struggle to identify them
+accurately. Due to the subjective nature of stereotypes, where what constitutes
+a stereotype can vary widely depending on cultural, social, and individual
+perspectives, robust explainability is crucial. Explainable models ensure that
+these nuanced judgments can be understood and validated by human users,
+promoting trust and accountability. We address these challenges by introducing
+HEARTS (Holistic Framework for Explainable, Sustainable, and Robust Text
+Stereotype Detection), a framework that enhances model performance, minimises
+carbon footprint, and provides transparent, interpretable explanations. We
+establish the Expanded Multi-Grain Stereotype Dataset (EMGSD), comprising
+57,201 labelled texts across six groups, including under-represented
+demographics like LGBTQ+ and regional stereotypes. Ablation studies confirm
+that BERT models fine-tuned on EMGSD outperform those trained on individual
+components. We then analyse a fine-tuned, carbon-efficient ALBERT-V2 model
+using SHAP to generate token-level importance values, ensuring alignment with
+human understanding, and calculate explainability confidence scores by
+comparing SHAP and LIME outputs...
+
+
+
+ comment: NeurIPS 2024 SoLaR Workshop and NeurIPS 2024 Safety Gen AI Workshop
+
+
+
+
+
+
+ ♻ ☆ THaMES: An End-to-End Tool for Hallucination Mitigation and Evaluation
+ in Large Language Models NeurIPS 2024
+
+
+ Hallucination, the generation of factually incorrect content, is a growing
+challenge in Large Language Models (LLMs). Existing detection and mitigation
+methods are often isolated and insufficient for domain-specific needs, lacking
+a standardized pipeline. This paper introduces THaMES (Tool for Hallucination
+Mitigations and EvaluationS), an integrated framework and library addressing
+this gap. THaMES offers an end-to-end solution for evaluating and mitigating
+hallucinations in LLMs, featuring automated test set generation, multifaceted
+benchmarking, and adaptable mitigation strategies. It automates test set
+creation from any corpus, ensuring high data quality, diversity, and
+cost-efficiency through techniques like batch processing, weighted sampling,
+and counterfactual validation. THaMES assesses a model's ability to detect and
+reduce hallucinations across various tasks, including text generation and
+binary classification, applying optimal mitigation strategies like In-Context
+Learning (ICL), Retrieval Augmented Generation (RAG), and Parameter-Efficient
+Fine-tuning (PEFT). Evaluations of state-of-the-art LLMs using a knowledge base
+of academic papers, political news, and Wikipedia reveal that commercial models
+like GPT-4o benefit more from RAG than ICL, while open-weight models like
+Llama-3.1-8B-Instruct and Mistral-Nemo gain more from ICL. Additionally, PEFT
+significantly enhances the performance of Llama-3.1-8B-Instruct in both
+evaluation tasks.
+
+
+
+ comment: NeurIPS 2024 SoLaR (Socially Responsible Language Modelling Research
+ ) Workshop
+
+
+
+
+
+
+ ♻ ☆ SAGED: A Holistic Bias-Benchmarking Pipeline for Language Models with
+ Customisable Fairness Calibration COLING 2025
+
+
+ The development of unbiased large language models is widely recognized as
+crucial, yet existing benchmarks fall short in detecting biases due to limited
+scope, contamination, and lack of a fairness baseline. SAGED(-Bias) is the
+first holistic benchmarking pipeline to address these problems. The pipeline
+encompasses five core stages: scraping materials, assembling benchmarks,
+generating responses, extracting numeric features, and diagnosing with
+disparity metrics. SAGED includes metrics for max disparity, such as impact
+ratio, and bias concentration, such as Max Z-scores. Noticing that assessment
+tool bias and contextual bias in prompts can distort evaluation, SAGED
+implements counterfactual branching and baseline calibration for mitigation.
+For demonstration, we use SAGED on G20 Countries with popular 8b-level models
+including Gemma2, Llama3.1, Mistral, and Qwen2. With sentiment analysis, we
+find that while Mistral and Qwen2 show lower max disparity and higher bias
+concentration than Gemma2 and Llama3.1, all models are notably biased against
+countries like Russia and (except for Qwen2) China. With further experiments to
+have models role-playing U.S. (vice-/former-) presidents, we see bias amplifies
+and shifts in heterogeneous directions. Moreover, we see Qwen2 and Mistral not
+engage in role-playing, while Llama3.1 and Gemma2 role-play Trump notably more
+intensively than Biden and Harris, indicating role-playing performance bias in
+these models.
+
+
+ The advancement of text generation models has granted us the capability to
+produce coherent and convincing text on demand. Yet, in real-life
+circumstances, individuals do not continuously generate text or voice their
+opinions. For instance, consumers pen product reviews after weighing the merits
+and demerits of a product, and professional analysts issue reports following
+significant news releases. In essence, opinion expression is typically prompted
+by particular reasons or signals. Despite long-standing developments in opinion
+mining, the appropriate timing for expressing an opinion remains largely
+unexplored. To address this deficit, our study introduces an innovative task -
+the identification of news-triggered opinion expressing timing. We ground this
+task in the actions of professional stock analysts and develop a novel dataset
+for investigation. Our approach is decision-focused, leveraging text generation
+models to steer the classification model, thus enhancing overall performance.
+Our experimental findings demonstrate that the text generated by our model
+contributes fresh insights from various angles, effectively aiding in
+identifying the optimal timing for opinion expression.
+
+
+
+ comment: Accepted: COLING-2025
+
+
+
+
+
+
+ ♻ ☆ A Perspective for Adapting Generalist AI to Specialized Medical AI
+ Applications and Their Challenges
+
+
+
+
+
+
+
+
+ Zifeng Wang, Hanyin Wang, Benjamin Danek, Ying Li, Christina Mack, Hoifung Poon, Yajuan Wang, Pranav Rajpurkar, Jimeng Sun
+
+
+ The integration of Large Language Models (LLMs) into medical applications has
+sparked widespread interest across the healthcare industry, from drug discovery
+and development to clinical decision support, assisting telemedicine, medical
+devices, and healthcare insurance applications. This perspective paper aims to
+discuss the inner workings of building LLM-powered medical AI applications and
+introduces a comprehensive framework for their development. We review existing
+literature and outline the unique challenges of applying LLMs in specialized
+medical contexts. Additionally, we introduce a three-step framework to organize
+medical LLM research activities: 1) Modeling: breaking down complex medical
+workflows into manageable steps for developing medical-specific models; 2)
+Optimization: optimizing the model performance with crafted prompts and
+integrating external knowledge and tools, and 3) System engineering:
+decomposing complex tasks into subtasks and leveraging human expertise for
+building medical AI applications. Furthermore, we offer a detailed use case
+playbook that describes various LLM-powered medical AI applications, such as
+optimizing clinical trial design, enhancing clinical decision support, and
+advancing medical imaging analysis. Finally, we discuss various challenges and
+considerations for building medical AI applications with LLMs, such as handling
+hallucination issues, data ownership and compliance, privacy, intellectual
+property considerations, compute cost, sustainability issues, and responsible
+AI requirements.
+
+
+
+
+
+
+
+
+
+
+ Information Retrieval 9
+
+
+
+
+
+ ☆ The Impact of Generative AI on Student Churn and the Future of Formal
+ Education
+
+
+ In the contemporary educational landscape, the advent of Generative
+Artificial Intelligence (AI) presents unprecedented opportunities for
+personalised learning, fundamentally challenging the traditional paradigms of
+education. This research explores the emerging trend where high school
+students, empowered by tailored educational experiences provided by Generative
+AI, opt to forgo traditional university degrees to pursue entrepreneurial
+ventures at a younger age. To understand and predict the future of education in
+the age of Generative AI, we employ a comprehensive methodology to analyse
+social media data. Our approach includes sentiment analysis to gauge public
+opinion, topic modelling to identify key themes and emerging trends, and user
+demographic analysis to understand the engagement of different age groups and
+regions. We also perform influencer analysis to identify key figures shaping
+the discourse and engagement metrics to measure the level of interest and
+interaction with AI-related educational content. Content analysis helps us to
+determine the types of content being shared and the prevalent narratives, while
+hashtag analysis reveals the connectivity of discussions. The temporal analysis
+tracks changes over time and identifies event-based spikes in discussions. The
+insights derived from this analysis include the acceptance and adoption of
+Generative AI in education, its impact on traditional education models, the
+influence on students' entrepreneurial ambitions, and the educational outcomes
+associated with AI-driven personalised learning. Additionally, we explore
+public sentiment towards policies and regulations and use predictive modelling
+to forecast future trends. This comprehensive social media analysis provides a
+nuanced understanding of the evolving educational landscape, offering valuable
+perspectives on the role of Generative AI in shaping the future of education.
+
+
+
+
+
+
+
+ ☆ Contextual Bandits in Payment Processing: Non-uniform Exploration and
+ Supervised Learning at Adyen WWW '25
+
+
+ Uniform random exploration in decision-making systems supports off-policy
+learning via supervision but incurs high regret, making it impractical for many
+applications. Conversely, non-uniform exploration offers better immediate
+performance but lacks support for off-policy learning. Recent research suggests
+that regression oracles can bridge this gap by combining non-uniform
+exploration with supervised learning. In this paper, we analyze these
+approaches within a real-world industrial context at Adyen, a large global
+payments processor characterized by batch logged delayed feedback, short-term
+memory, and dynamic action spaces under the Empirical Risk Minimization (ERM)
+framework. Our analysis reveals that while regression oracles significantly
+improve performance, they introduce challenges due to rigid algorithmic
+assumptions. Specifically, we observe that as a policy improves, subsequent
+generations may perform worse due to shifts in the reward distribution and
+increased class imbalance in the training data. This degradation occurs de
+spite improvements in other aspects of the training data, leading to decreased
+performance in successive policy iterations. We further explore the long-term
+impact of regression oracles, identifying a potential "oscillation effect."
+This effect arises when regression oracles influence probability estimates and
+the realizability of subsequent policy models, leading to fluctuations in
+performance across iterations. Our findings highlight the need for more
+adaptable algorithms that can leverage the benefits of regression oracles
+without introducing instability in policy performance over time.
+
+
+ Large language models (LLMs) have quickly emerged as practical and versatile
+tools that provide new solutions for a wide range of domains. In this paper, we
+consider the application of LLMs on symmetric tasks where a query is asked on
+an (unordered) bag of elements. Examples of such tasks include answering
+aggregate queries on a database table. In general, when the bag contains a
+large number of elements, LLMs tend to overlook some elements, leading to
+challenges in generating accurate responses to the query. LLMs receive their
+inputs as ordered sequences. However, in this problem, we leverage the fact
+that the symmetric input is not ordered, and reordering should not affect the
+LLM's response.
+ Observing that LLMs are less likely to miss elements at certain positions of
+the input, we introduce the problem of LLM input reranking: to find a ranking
+of the input that maximizes the LLM's accuracy for the given query without
+making explicit assumptions about the query. Finding the optimal ranking
+requires identifying (i) the relevance of each input element for answering the
+query and (ii) the importance of each rank position for the LLM's attention. We
+develop algorithms for estimating these values efficiently utilizing a helper
+LLM. We conduct comprehensive experiments on different synthetic and real
+datasets to validate our proposal and to evaluate the effectiveness of our
+proposed algorithms. Our experiments confirm that our reranking approach
+improves the accuracy of the LLMs on symmetric tasks by up to $99\%$ proximity
+to the optimum upper bound.
+
+
+
+
+
+
+
+ ☆ CDEMapper: Enhancing NIH Common Data Element Normalization using Large
+ Language Models
+
+
+
+
+
+
+
+
+ Yan Wang, Jimin Huang, Huan He, Vincent Zhang, Yujia Zhou, Xubing Hao, Pritham Ram, Lingfei Qian, Qianqian Xie, Ruey-Ling Weng, Fongci Lin, Yan Hu, Licong Cui, Xiaoqian Jiang, Hua Xu, Na Hong
+
+
+ Common Data Elements (CDEs) standardize data collection and sharing across
+studies, enhancing data interoperability and improving research
+reproducibility. However, implementing CDEs presents challenges due to the
+broad range and variety of data elements. This study aims to develop an
+effective and efficient mapping tool to bridge the gap between local data
+elements and National Institutes of Health (NIH) CDEs. We propose CDEMapper, a
+large language model (LLM) powered mapping tool designed to assist in mapping
+local data elements to NIH CDEs. CDEMapper has three core modules: (1) CDE
+indexing and embeddings. NIH CDEs were indexed and embedded to support semantic
+search; (2) CDE recommendations. The tool combines Elasticsearch (BM25
+similarity methods) with state of the art GPT services to recommend candidate
+CDEs and their permissible values; and (3) Human review. Users review and
+select the NIH CDEs and values that best match their data elements and value
+sets. We evaluate the tool recommendation accuracy against manually annotated
+mapping results. CDEMapper offers a publicly available, LLM-powered, and
+intuitive user interface that consolidates essential and advanced mapping
+services into a streamlined pipeline. It provides a step by step, quality
+assured mapping workflow designed with a user-centered approach. The evaluation
+results demonstrated that augmenting BM25 with GPT embeddings and a ranker
+consistently enhances CDEMapper mapping accuracy in three different mapping
+settings across four evaluation datasets. This work opens up the potential of
+using LLMs to assist with CDE recommendation and human curation when aligning
+local data elements with NIH CDEs. Additionally, this effort enhances clinical
+research data interoperability and helps researchers better understand the gaps
+between local data elements and NIH CDEs.
+
+
+
+ comment: 11 pages,4 figures
+
+
+
+
+
+
+ ☆ FairSort: Learning to Fair Rank for Personalized Recommendations in
+ Two-Sided Platforms
+
+
+ Traditional recommendation systems focus on maximizing user satisfaction by
+suggesting their favorite items. This user-centric approach may lead to unfair
+exposure distribution among the providers. On the contrary, a provider-centric
+design might become unfair to the users. Therefore, this paper proposes a
+re-ranking model FairSort\footnote{\textbf{Reproducibility:}The code and
+datasets are available at \url{https://github.com/13543024276/FairSort}} to
+find a trade-off solution among user-side fairness, provider-side fairness, and
+personalized recommendations utility. Previous works habitually treat this
+issue as a knapsack problem, incorporating both-side fairness as constraints.
+ In this paper, we adopt a novel perspective, treating each recommendation
+list as a runway rather than a knapsack. In this perspective, each item on the
+runway gains a velocity and runs within a specific time, achieving re-ranking
+for both-side fairness. Meanwhile, we ensure the Minimum Utility Guarantee for
+personalized recommendations by designing a Binary Search approach. This can
+provide more reliable recommendations compared to the conventional greedy
+strategy based on the knapsack problem. We further broaden the applicability of
+FairSort, designing two versions for online and offline recommendation
+scenarios. Theoretical analysis and extensive experiments on real-world
+datasets indicate that FairSort can ensure more reliable personalized
+recommendations while considering fairness for both the provider and user.
+
+
+
+
+
+
+
+ ☆ Robust Table Integration in Data Lakes
+
+
+ In this paper, we investigate the challenge of integrating tables from data
+lakes, focusing on three core tasks: 1) pairwise integrability judgment, which
+determines whether a tuple pair in a table is integrable, accounting for any
+occurrences of semantic equivalence or typographical errors; 2) integrable set
+discovery, which aims to identify all integrable sets in a table based on
+pairwise integrability judgments established in the first task; 3) multi-tuple
+conflict resolution, which resolves conflicts among multiple tuples during
+integration. We train a binary classifier to address the task of pairwise
+integrability judgment. Given the scarcity of labeled data, we propose a
+self-supervised adversarial contrastive learning algorithm to perform
+classification, which incorporates data augmentation methods and adversarial
+examples to autonomously generate new training data. Upon the output of
+pairwise integrability judgment, each integrable set is considered as a
+community, a densely connected sub-graph where nodes and edges correspond to
+tuples in the table and their pairwise integrability, respectively. We proceed
+to investigate various community detection algorithms to address the integrable
+set discovery objective. Moving forward to tackle multi-tuple conflict
+resolution, we introduce an novel in-context learning methodology. This
+approach capitalizes on the knowledge embedded within pretrained large language
+models to effectively resolve conflicts that arise when integrating multiple
+tuples. Notably, our method minimizes the need for annotated data. Since no
+suitable test collections are available for our tasks, we develop our own
+benchmarks using two real-word dataset repositories: Real and Join. We conduct
+extensive experiments on these benchmarks to validate the robustness and
+applicability of our methodologies in the context of integrating tables within
+data lakes.
+
+
+
+
+
+
+
+
+ Fernando Diaz, Michael D. Ekstrand, Bhaskar Mitra
+
+
+ Although originally developed to evaluate sets of items, recall is often used
+to evaluate rankings of items, including those produced by recommender,
+retrieval, and other machine learning systems. The application of recall
+without a formal evaluative motivation has led to criticism of recall as a
+vague or inappropriate measure. In light of this debate, we reflect on the
+measurement of recall in rankings from a formal perspective. Our analysis is
+composed of three tenets: recall, robustness, and lexicographic evaluation.
+First, we formally define `recall-orientation' as the sensitivity of a metric
+to a user interested in finding every relevant item. Second, we analyze
+recall-orientation from the perspective of robustness with respect to possible
+content consumers and providers, connecting recall to recent conversations
+about fair ranking. Finally, we extend this conceptual and theoretical
+treatment of recall by developing a practical preference-based evaluation
+method based on lexicographic comparison. Through extensive empirical analysis
+across three recommendation tasks and 17 information retrieval tasks, we
+establish that our new evaluation method, lexirecall, has convergent validity
+(i.e., it is correlated with existing recall metrics) and exhibits
+substantially higher sensitivity in terms of discriminative power and stability
+in the presence of missing labels. Our conceptual, theoretical, and empirical
+analysis substantially deepens our understanding of recall and motivates its
+adoption through connections to robustness and fairness.
+
+
+
+ comment: Under review
+
+
+
+
+
+
+ ♻ ☆ Scalable Cross-Entropy Loss for Sequential Recommendations with Large
+ Item Catalogs
+
+
+ Scalability issue plays a crucial role in productionizing modern recommender
+systems. Even lightweight architectures may suffer from high computational
+overload due to intermediate calculations, limiting their practicality in
+real-world applications. Specifically, applying full Cross-Entropy (CE) loss
+often yields state-of-the-art performance in terms of recommendations quality.
+Still, it suffers from excessive GPU memory utilization when dealing with large
+item catalogs. This paper introduces a novel Scalable Cross-Entropy (SCE) loss
+function in the sequential learning setup. It approximates the CE loss for
+datasets with large-size catalogs, enhancing both time efficiency and memory
+usage without compromising recommendations quality. Unlike traditional negative
+sampling methods, our approach utilizes a selective GPU-efficient computation
+strategy, focusing on the most informative elements of the catalog,
+particularly those most likely to be false positives. This is achieved by
+approximating the softmax distribution over a subset of the model outputs
+through the maximum inner product search. Experimental results on multiple
+datasets demonstrate the effectiveness of SCE in reducing peak memory usage by
+a factor of up to 100 compared to the alternatives, retaining or even exceeding
+their metrics values. The proposed approach also opens new perspectives for
+large-scale developments in different domains, such as large language models.
+
+
+
+ comment: 11 pages, fixed some typos
+
+
+
+
+
+
+ ♻ ☆ Unraveling Movie Genres through Cross-Attention Fusion of Bi-Modal
+ Synergy of Poster
+
+
+ Movie posters are not just decorative; they are meticulously designed to
+capture the essence of a movie, such as its genre, storyline, and tone/vibe.
+For decades, movie posters have graced cinema walls, billboards, and now our
+digital screens as a form of digital posters. Movie genre classification plays
+a pivotal role in film marketing, audience engagement, and recommendation
+systems. Previous explorations into movie genre classification have been mostly
+examined in plot summaries, subtitles, trailers and movie scenes. Movie posters
+provide a pre-release tantalizing glimpse into a film's key aspects, which can
+ignite public interest. In this paper, we presented the framework that exploits
+movie posters from a visual and textual perspective to address the multilabel
+movie genre classification problem. Firstly, we extracted text from movie
+posters using an OCR and retrieved the relevant embedding. Next, we introduce a
+cross-attention-based fusion module to allocate attention weights to visual and
+textual embedding. In validating our framework, we utilized 13882 posters
+sourced from the Internet Movie Database (IMDb). The outcomes of the
+experiments indicate that our model exhibited promising performance and
+outperformed even some prominent contemporary architectures.
+
+
+
+
+
+
+
+
+
+
+ Multimedia 2
+
+
+
+
+
+ ☆ Hybrid Local-Global Context Learning for Neural Video Compression
+
+
+ In neural video codecs, current state-of-the-art methods typically adopt
+multi-scale motion compensation to handle diverse motions. These methods
+estimate and compress either optical flow or deformable offsets to reduce
+inter-frame redundancy. However, flow-based methods often suffer from
+inaccurate motion estimation in complicated scenes. Deformable
+convolution-based methods are more robust but have a higher bit cost for motion
+coding. In this paper, we propose a hybrid context generation module, which
+combines the advantages of the above methods in an optimal way and achieves
+accurate compensation at a low bit cost. Specifically, considering the
+characteristics of features at different scales, we adopt flow-guided
+deformable compensation at largest-scale to produce accurate alignment in
+detailed regions. For smaller-scale features, we perform flow-based warping to
+save the bit cost for motion coding. Furthermore, we design a local-global
+context enhancement module to fully explore the local-global information of
+previous reconstructed signals. Experimental results demonstrate that our
+proposed Hybrid Local-Global Context learning (HLGC) method can significantly
+enhance the state-of-the-art methods on standard test datasets.
+
+
+
+ comment: Accepted to DCC 2024
+
+
+
+
+
+
+ ♻ ☆ Unraveling Movie Genres through Cross-Attention Fusion of Bi-Modal
+ Synergy of Poster
+
+
+ Movie posters are not just decorative; they are meticulously designed to
+capture the essence of a movie, such as its genre, storyline, and tone/vibe.
+For decades, movie posters have graced cinema walls, billboards, and now our
+digital screens as a form of digital posters. Movie genre classification plays
+a pivotal role in film marketing, audience engagement, and recommendation
+systems. Previous explorations into movie genre classification have been mostly
+examined in plot summaries, subtitles, trailers and movie scenes. Movie posters
+provide a pre-release tantalizing glimpse into a film's key aspects, which can
+ignite public interest. In this paper, we presented the framework that exploits
+movie posters from a visual and textual perspective to address the multilabel
+movie genre classification problem. Firstly, we extracted text from movie
+posters using an OCR and retrieved the relevant embedding. Next, we introduce a
+cross-attention-based fusion module to allocate attention weights to visual and
+textual embedding. In validating our framework, we utilized 13882 posters
+sourced from the Internet Movie Database (IMDb). The outcomes of the
+experiments indicate that our model exhibited promising performance and
+outperformed even some prominent contemporary architectures.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Computation and Language 71
+
+
+
+
+
+ ☆ T2Vid: Translating Long Text into Multi-Image is the Catalyst for
+ Video-LLMs
+
+
+ The success of Multimodal Large Language Models (MLLMs) in the image domain
+has garnered wide attention from the research community. Drawing on previous
+successful experiences, researchers have recently explored extending the
+success to the video understanding realms. Apart from training from scratch, an
+efficient way is to utilize the pre-trained image-LLMs, leading to two
+mainstream approaches, i.e. zero-shot inference and further fine-tuning with
+video data. In this work, our study of these approaches harvests an effective
+data augmentation method. We first make a deeper inspection of the zero-shot
+inference way and identify two limitations, i.e. limited generalization and
+lack of temporal understanding capabilities. Thus, we further investigate the
+fine-tuning approach and find a low learning efficiency when simply using all
+the video data samples, which can be attributed to a lack of instruction
+diversity. Aiming at this issue, we develop a method called T2Vid to synthesize
+video-like samples to enrich the instruction diversity in the training corpus.
+Integrating these data enables a simple and efficient training scheme, which
+achieves performance comparable to or even superior to using full video
+datasets by training with just 15% the sample size. Meanwhile, we find that the
+proposed scheme can boost the performance of long video understanding without
+training with long video samples. We hope our study will spark more thinking
+about using MLLMs for video understanding and curation of high-quality data.
+The code is released at https://github.com/xjtupanda/T2Vid.
+
+
+ Large Language Models (LLMs) have exhibited remarkable performance on
+reasoning tasks. They utilize autoregressive token generation to construct
+reasoning trajectories, enabling the development of a coherent chain of
+thought. In this work, we explore the impact of individual tokens on the final
+outcomes of reasoning tasks. We identify the existence of ``critical tokens''
+that lead to incorrect reasoning trajectories in LLMs. Specifically, we find
+that LLMs tend to produce positive outcomes when forced to decode other tokens
+instead of critical tokens. Motivated by this observation, we propose a novel
+approach - cDPO - designed to automatically recognize and conduct token-level
+rewards for the critical tokens during the alignment process. Specifically, we
+develop a contrastive estimation approach to automatically identify critical
+tokens. It is achieved by comparing the generation likelihood of positive and
+negative models. To achieve this, we separately fine-tune the positive and
+negative models on various reasoning trajectories, consequently, they are
+capable of identifying identify critical tokens within incorrect trajectories
+that contribute to erroneous outcomes. Moreover, to further align the model
+with the critical token information during the alignment process, we extend the
+conventional DPO algorithms to token-level DPO and utilize the differential
+likelihood from the aforementioned positive and negative model as important
+weight for token-level DPO learning.Experimental results on GSM8K and MATH500
+benchmarks with two-widely used models Llama-3 (8B and 70B) and deepseek-math
+(7B) demonstrate the effectiveness of the propsoed approach cDPO.
+
+
+
+ comment: Work in progress
+
+
+
+
+
+
+ ☆ Perception Test 2024: Challenge Summary and a Novel Hour-Long VideoQA
+ Benchmark
+
+
+
+
+
+
+
+
+ Joseph Heyward, João Carreira, Dima Damen, Andrew Zisserman, Viorica Pătrăucean
+
+
+ Following the successful 2023 edition, we organised the Second Perception
+Test challenge as a half-day workshop alongside the IEEE/CVF European
+Conference on Computer Vision (ECCV) 2024, with the goal of benchmarking
+state-of-the-art video models and measuring the progress since last year using
+the Perception Test benchmark. This year, the challenge had seven tracks (up
+from six last year) and covered low-level and high-level tasks, with language
+and non-language interfaces, across video, audio, and text modalities; the
+additional track covered hour-long video understanding and introduced a novel
+video QA benchmark 1h-walk VQA. Overall, the tasks in the different tracks
+were: object tracking, point tracking, temporal action localisation, temporal
+sound localisation, multiple-choice video question-answering, grounded video
+question-answering, and hour-long video question-answering. We summarise in
+this report the challenge tasks and results, and introduce in detail the novel
+hour-long video QA benchmark 1h-walk VQA.
+
+
+
+ comment: arXiv admin note: substantial text overlap with arXiv:2312.13090
+
+ Safety concerns of Multimodal large language models (MLLMs) have gradually
+become an important problem in various applications. Surprisingly, previous
+works indicate a counter-intuitive phenomenon that using textual unlearning to
+align MLLMs achieves comparable safety performances with MLLMs trained with
+image-text pairs. To explain such a counter-intuitive phenomenon, we discover a
+visual safety information leakage (VSIL) problem in existing multimodal safety
+benchmarks, i.e., the potentially risky and sensitive content in the image has
+been revealed in the textual query. In this way, MLLMs can easily refuse these
+sensitive text-image queries according to textual queries. However, image-text
+pairs without VSIL are common in real-world scenarios and are overlooked by
+existing multimodal safety benchmarks. To this end, we construct multimodal
+visual leakless safety benchmark (VLSBench) preventing visual safety leakage
+from image to textual query with 2.4k image-text pairs. Experimental results
+indicate that VLSBench poses a significant challenge to both open-source and
+close-source MLLMs, including LLaVA, Qwen2-VL, Llama3.2-Vision, and GPT-4o.
+This study demonstrates that textual alignment is enough for multimodal safety
+scenarios with VSIL, while multimodal alignment is a more promising solution
+for multimodal safety scenarios without VSIL. Please see our code and data at:
+http://hxhcreate.github.io/VLSBench
+
+
+
+
+
+
+
+ ★ On Domain-Specific Post-Training for Multimodal Large Language Models
+
+
+ Recent years have witnessed the rapid development of general multimodal large
+language models (MLLMs). However, adapting general MLLMs to specific domains,
+such as scientific fields and industrial applications, remains less explored.
+This paper systematically investigates domain adaptation of MLLMs through
+post-training, focusing on data synthesis, training pipelines, and task
+evaluation. (1) Data Synthesis: Using open-source models, we develop a visual
+instruction synthesizer that effectively generates diverse visual instruction
+tasks from domain-specific image-caption pairs. Our synthetic tasks surpass
+those generated by manual rules, GPT-4, and GPT-4V in enhancing the
+domain-specific performance of MLLMs. (2) Training Pipeline: While the
+two-stage training--initially on image-caption pairs followed by visual
+instruction tasks--is commonly adopted for developing general MLLMs, we apply a
+single-stage training pipeline to enhance task diversity for domain-specific
+post-training. (3) Task Evaluation: We conduct experiments in two domains,
+biomedicine and food, by post-training MLLMs of different sources and scales
+(e.g., Qwen2-VL-2B, LLaVA-v1.6-8B, Llama-3.2-11B), and then evaluating MLLM
+performance on various domain-specific tasks. To support further research in
+MLLM domain adaptation, we will open-source our implementations.
+
+
+
+
+
+
+
+ ☆ SIMS: Simulating Human-Scene Interactions with Real World Script
+ Planning
+
+
+ Simulating long-term human-scene interaction is a challenging yet fascinating
+task. Previous works have not effectively addressed the generation of long-term
+human scene interactions with detailed narratives for physics-based animation.
+This paper introduces a novel framework for the planning and controlling of
+long-horizon physical plausible human-scene interaction. On the one hand, films
+and shows with stylish human locomotions or interactions with scenes are
+abundantly available on the internet, providing a rich source of data for
+script planning. On the other hand, Large Language Models (LLMs) can understand
+and generate logical storylines.
+ This motivates us to marry the two by using an LLM-based pipeline to extract
+scripts from videos, and then employ LLMs to imitate and create new scripts,
+capturing complex, time-series human behaviors and interactions with
+environments. By leveraging this, we utilize a dual-aware policy that achieves
+both language comprehension and scene understanding to guide character motions
+within contextual and spatial constraints. To facilitate training and
+evaluation, we contribute a comprehensive planning dataset containing diverse
+motion sequences extracted from real-world videos and expand them with large
+language models. We also collect and re-annotate motion clips from existing
+kinematic datasets to enable our policy learn diverse skills. Extensive
+experiments demonstrate the effectiveness of our framework in versatile task
+execution and its generalization ability to various scenarios, showing
+remarkably enhanced performance compared with existing methods. Our code and
+data will be publicly available soon.
+
+
+
+
+
+
+
+ ☆ Classical and Quantum Algorithms for the Deterministic L-system
+ Inductive Inference Problem
+
+
+
+
+
+
+
+
+ Ali Lotfi, Ian McQuillan, Steven Rayan
+
+
+ L-systems can be made to model and create simulations of many biological
+processes, such as plant development. Finding an L-system for a given process
+is typically solved by hand, by experts, in a hugely time-consuming process. It
+would be significant if this could be done automatically from data, such as
+from sequences of images. In this paper, we are interested in inferring a
+particular type of L-system, deterministic context-free L-system (D0L-system)
+from a sequence of strings. We introduce the characteristic graph of a sequence
+of strings, which we then utilize to translate our problem (inferring
+D0L-system) in polynomial time into the maximum independent set problem (MIS)
+and the SAT problem. After that, we offer a classical exact algorithm and an
+approximate quantum algorithm for the problem.
+
+
+
+ comment: 16 pages, 1 figure
+
+
+
+
+
+
+ ☆ AIDetx: a compression-based method for identification of
+ machine-learning generated text
+
+
+
+
+
+
+
+
+ Leonardo Almeida, Pedro Rodrigues, Diogo Magalhães, Armando J. Pinho, Diogo Pratas
+
+
+ This paper introduces AIDetx, a novel method for detecting machine-generated
+text using data compression techniques. Traditional approaches, such as deep
+learning classifiers, often suffer from high computational costs and limited
+interpretability. To address these limitations, we propose a compression-based
+classification framework that leverages finite-context models (FCMs). AIDetx
+constructs distinct compression models for human-written and AI-generated text,
+classifying new inputs based on which model achieves a higher compression
+ratio. We evaluated AIDetx on two benchmark datasets, achieving F1 scores
+exceeding 97% and 99%, respectively, highlighting its high accuracy. Compared
+to current methods, such as large language models (LLMs), AIDetx offers a more
+interpretable and computationally efficient solution, significantly reducing
+both training time and hardware requirements (e.g., no GPUs needed). The full
+implementation is publicly available at https://github.com/AIDetx/AIDetx.
+
+
+
+
+
+
+
+
+ Justin Chih-Yao Chen, Zifeng Wang, Hamid Palangi, Rujun Han, Sayna Ebrahimi, Long Le, Vincent Perot, Swaroop Mishra, Mohit Bansal, Chen-Yu Lee, Tomas Pfister
+
+
+ Reverse thinking plays a crucial role in human reasoning. Humans can reason
+not only from a problem to a solution but also in reverse, i.e., start from the
+solution and reason towards the problem. This often enhances overall reasoning
+performance as it enables consistency checks between their forward and backward
+thinking. To enable Large Language Models (LLMs) to perform reverse thinking,
+we introduce Reverse-Enhanced Thinking (RevThink), a framework composed of data
+augmentation and learning objectives. In RevThink, we augment the dataset by
+collecting structured forward-backward reasoning from a teacher model,
+consisting of: (1) the original question, (2) forward reasoning, (3) backward
+question, and (4) backward reasoning. We then employ three objectives to train
+a smaller student model in a multi-task learning fashion: (a) generate forward
+reasoning from a question, (b) generate a backward question from a question,
+and (c) generate backward reasoning from the backward question. Experiments
+across 12 datasets covering commonsense, math, and logical reasoning show an
+average 13.53% improvement over the student model's zero-shot performance and a
+6.84% improvement over the strongest knowledge distillation baselines.
+Moreover, our method demonstrates sample efficiency -- using only 10% of the
+correct forward reasoning from the training data, it outperforms a standard
+fine-tuning method trained on 10x more forward reasoning. RevThink also
+exhibits strong generalization to out-of-distribution held-out datasets.
+
+
+
+ comment: 20 pages
+
+
+
+
+
+
+ ☆ What fifty-one years of Linguistics and Artificial Intelligence research
+ tell us about their correlation: A scientometric review
+
+
+ There is a strong correlation between linguistics and artificial intelligence
+(AI), best manifested by deep learning language models. This study provides a
+thorough scientometric analysis of this correlation, synthesizing the
+intellectual production during 51 years, from 1974 to 2024. It involves 5750
+Web of Science-indexed articles published in 2124 journals, which are written
+by 20835 authors belonging to 13773 research centers in 794 countries. Two
+powerful software, viz., CiteSpace and VOSviewer, were used to generate mapping
+visualizations of the intellectual landscape, trending issues and (re)emerging
+hotspots. The results indicate that in the 1980s and 1990s, linguistics and AI
+research was not robust, characterized by unstable publication over time. It
+has, however, witnessed a remarkable increase of publication since then,
+reaching 1478 articles in 2023, and 546 articles in January-March timespan in
+2024, involving emerging issues and hotspots, addressing new horizons, new
+topics, and launching new applications and powerful deep learning language
+models including ChatGPT.
+
+
+
+ comment: 26 pages, 15 figures
+
+
+
+
+
+
+ ☆ Artificial intelligence contribution to translation industry: looking
+ back and forward
+
+
+ This study provides a comprehensive analysis of artificial intelligence (AI)
+contribution to translation industry (ACTI) research, synthesizing it over
+forty-one years from 1980-2024. 13220 articles were retrieved from three
+sources, namely WoS, Scopus, and Lens. We provided two types of analysis, viz.,
+scientometric and thematic, focusing on cluster, subject categories, keywords,
+burstness, centrality and research centers as for the former. For the latter,
+we thematically review 18 articles, selected purposefully from the articles
+involved, centering on purpose, approach, findings, and contribution to ACTI
+future directions. The findings reveal that in the past AI contribution to
+translation industry was not rigorous, resulting in rule-based machine
+translation and statistical machine translation whose output was not
+satisfactory. However, the more AI develops, the more machine translation
+develops, incorporating Neural Networking Algorithms and (Deep) Language
+Learning Models like ChatGPT whose translation output has developed
+considerably. However, much rigorous research is still needed to overcome
+several problems encountering translation industry, specifically concerning
+low-source languages, multi-dialectical and free word order languages, and
+cultural and religious registers.
+
+
+
+ comment: 20 pages, 4 figures
+
+
+
+
+
+
+ ☆ Sensitive Content Classification in Social Media: A Holistic Resource
+ and Evaluation
+
+
+
+
+
+
+
+
+ Dimosthenis Antypas, Indira Sen, Carla Perez-Almendros, Jose Camacho-Collados, Francesco Barbieri
+
+
+ The detection of sensitive content in large datasets is crucial for ensuring
+that shared and analysed data is free from harmful material. However, current
+moderation tools, such as external APIs, suffer from limitations in
+customisation, accuracy across diverse sensitive categories, and privacy
+concerns. Additionally, existing datasets and open-source models focus
+predominantly on toxic language, leaving gaps in detecting other sensitive
+categories such as substance abuse or self-harm. In this paper, we put forward
+a unified dataset tailored for social media content moderation across six
+sensitive categories: conflictual language, profanity, sexually explicit
+material, drug-related content, self-harm, and spam. By collecting and
+annotating data with consistent retrieval strategies and guidelines, we address
+the shortcomings of previous focalised research. Our analysis demonstrates that
+fine-tuning large language models (LLMs) on this novel dataset yields
+significant improvements in detection performance compared to open
+off-the-shelf models such as LLaMA, and even proprietary OpenAI models, which
+underperform by 10-15% overall. This limitation is even more pronounced on
+popular moderation APIs, which cannot be easily tailored to specific sensitive
+content categories, among others.
+
+
+
+
+
+
+
+
+ Fangze Fu, Wei Ai, Fan Yang, Yuntao Shou, Tao Meng, Keqin Li
+
+
+ Multimodal Emotion Recognition in Conversations (MERC) aims to classify
+utterance emotions using textual, auditory, and visual modal features. Most
+existing MERC methods assume each utterance has complete modalities,
+overlooking the common issue of incomplete modalities in real-world scenarios.
+Recently, graph neural networks (GNNs) have achieved notable results in
+Incomplete Multimodal Emotion Recognition in Conversations (IMERC). However,
+traditional GNNs focus on binary relationships between nodes, limiting their
+ability to capture more complex, higher-order information. Moreover, repeated
+message passing can cause over-smoothing, reducing their capacity to preserve
+essential high-frequency details. To address these issues, we propose a
+Spectral Domain Reconstruction Graph Neural Network (SDR-GNN) for incomplete
+multimodal learning in conversational emotion recognition. SDR-GNN constructs
+an utterance semantic interaction graph using a sliding window based on both
+speaker and context relationships to model emotional dependencies. To capture
+higher-order and high-frequency information, SDR-GNN utilizes weighted
+relationship aggregation, ensuring consistent semantic feature extraction
+across utterances. Additionally, it performs multi-frequency aggregation in the
+spectral domain, enabling efficient recovery of incomplete modalities by
+extracting both high- and low-frequency information. Finally, multi-head
+attention is applied to fuse and optimize features for emotion recognition.
+Extensive experiments on various real-world datasets demonstrate that our
+approach is effective in incomplete multimodal learning and outperforms current
+state-of-the-art methods.
+
+
+
+ comment: 17 pages, 8 figures
+
+
+
+
+
+
+ ☆ INCLUDE: Evaluating Multilingual Language Understanding with Regional
+ Knowledge
+
+
+
+
+
+
+
+
+ Angelika Romanou, Negar Foroutan, Anna Sotnikova, Zeming Chen, Sree Harsha Nelaturu, Shivalika Singh, Rishabh Maheshwary, Micol Altomare, Mohamed A. Haggag, Snegha A, Alfonso Amayuelas, Azril Hafizi Amirudin, Viraat Aryabumi, Danylo Boiko, Michael Chang, Jenny Chim, Gal Cohen, Aditya Kumar Dalmia, Abraham Diress, Sharad Duwal, Daniil Dzenhaliou, Daniel Fernando Erazo Florez, Fabian Farestam, Joseph Marvin Imperial, Shayekh Bin Islam, Perttu Isotalo, Maral Jabbarishiviari, Börje F. Karlsson, Eldar Khalilov, Christopher Klamm, Fajri Koto, Dominik Krzemiński, Gabriel Adriano de Melo, Syrielle Montariol, Yiyang Nan, Joel Niklaus, Jekaterina Novikova, Johan Samir Obando Ceron, Debjit Paul, Esther Ploeger, Jebish Purbey, Swati Rajwal, Selvan Sunitha Ravi, Sara Rydell, Roshan Santhosh, Drishti Sharma, Marjana Prifti Skenduli, Arshia Soltani Moakhar, Bardia Soltani Moakhar, Ran Tamir, Ayush Kumar Tarun, Azmine Toushik Wasi, Thenuka Ovin Weerasinghe, Serhan Yilmaz, Mike Zhang, Imanol Schlag, Marzieh Fadaee, Sara Hooker, Antoine Bosselut
+
+
+ The performance differential of large language models (LLM) between languages
+hinders their effective deployment in many regions, inhibiting the potential
+economic and societal value of generative AI tools in many communities.
+However, the development of functional LLMs in many languages (\ie,
+multilingual LLMs) is bottlenecked by the lack of high-quality evaluation
+resources in languages other than English. Moreover, current practices in
+multilingual benchmark construction often translate English resources, ignoring
+the regional and cultural knowledge of the environments in which multilingual
+systems would be used. In this work, we construct an evaluation suite of
+197,243 QA pairs from local exam sources to measure the capabilities of
+multilingual LLMs in a variety of regional contexts. Our novel resource,
+INCLUDE, is a comprehensive knowledge- and reasoning-centric benchmark across
+44 written languages that evaluates multilingual LLMs for performance in the
+actual language environments where they would be deployed.
+
+
+
+
+
+
+
+ ☆ MoTe: Learning Motion-Text Diffusion Model for Multiple Generation Tasks
+
+
+ Recently, human motion analysis has experienced great improvement due to
+inspiring generative models such as the denoising diffusion model and large
+language model. While the existing approaches mainly focus on generating
+motions with textual descriptions and overlook the reciprocal task. In this
+paper, we present~\textbf{MoTe}, a unified multi-modal model that could handle
+diverse tasks by learning the marginal, conditional, and joint distributions of
+motion and text simultaneously. MoTe enables us to handle the paired
+text-motion generation, motion captioning, and text-driven motion generation by
+simply modifying the input context. Specifically, MoTe is composed of three
+components: Motion Encoder-Decoder (MED), Text Encoder-Decoder (TED), and
+Moti-on-Text Diffusion Model (MTDM). In particular, MED and TED are trained for
+extracting latent embeddings, and subsequently reconstructing the motion
+sequences and textual descriptions from the extracted embeddings, respectively.
+MTDM, on the other hand, performs an iterative denoising process on the input
+context to handle diverse tasks. Experimental results on the benchmark datasets
+demonstrate the superior performance of our proposed method on text-to-motion
+generation and competitive performance on motion captioning.
+
+
+
+ comment: Five figures, six tables
+
+
+
+
+
+
+ ☆ PerLA: Perceptive 3D Language Assistant
+
+
+
+
+
+
+
+
+ Guofeng Mei, Wei Lin, Luigi Riz, Yujiao Wu, Fabio Poiesi, Yiming Wang
+
+
+ Enabling Large Language Models (LLMs) to understand the 3D physical world is
+an emerging yet challenging research direction. Current strategies for
+processing point clouds typically downsample the scene or divide it into
+smaller parts for separate analysis. However, both approaches risk losing key
+local details or global contextual information. In this paper, we introduce
+PerLA, a 3D language assistant designed to be more perceptive to both details
+and context, making visual representations more informative for the LLM. PerLA
+captures high-resolution (local) details in parallel from different point cloud
+areas and integrates them with (global) context obtained from a
+lower-resolution whole point cloud. We present a novel algorithm that preserves
+point cloud locality through the Hilbert curve and effectively aggregates
+local-to-global information via cross-attention and a graph neural network.
+Lastly, we introduce a novel loss for local representation consensus to promote
+training stability. PerLA outperforms state-of-the-art 3D language assistants,
+with gains of up to +1.34 CiDEr on ScanQA for question answering, and +4.22 on
+ScanRefer and +3.88 on Nr3D for dense
+captioning.\url{https://gfmei.github.io/PerLA/}
+
+
+
+
+
+
+
+ ☆ LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware
+ Omni-Modal Perception of Long Videos
+
+
+ Despite impressive advancements in video understanding, most efforts remain
+limited to coarse-grained or visual-only video tasks. However, real-world
+videos encompass omni-modal information (vision, audio, and speech) with a
+series of events forming a cohesive storyline. The lack of multi-modal video
+data with fine-grained event annotations and the high cost of manual labeling
+are major obstacles to comprehensive omni-modality video perception. To address
+this gap, we propose an automatic pipeline consisting of high-quality
+multi-modal video filtering, semantically coherent omni-modal event boundary
+detection, and cross-modal correlation-aware event captioning. In this way, we
+present LongVALE, the first-ever Vision-Audio-Language Event understanding
+benchmark comprising 105K omni-modal events with precise temporal boundaries
+and detailed relation-aware captions within 8.4K high-quality long videos.
+Further, we build a baseline that leverages LongVALE to enable video large
+language models (LLMs) for omni-modality fine-grained temporal video
+understanding for the first time. Extensive experiments demonstrate the
+effectiveness and great potential of LongVALE in advancing comprehensive
+multi-modal video understanding.
+
+
+
+ comment: 18 pages, 15 figures
+
+
+
+
+
+
+ ☆ Noro: A Noise-Robust One-shot Voice Conversion System with Hidden
+ Speaker Representation Capabilities SP
+
+
+ One-shot voice conversion (VC) aims to alter the timbre of speech from a
+source speaker to match that of a target speaker using just a single reference
+speech from the target, while preserving the semantic content of the original
+source speech. Despite advancements in one-shot VC, its effectiveness decreases
+in real-world scenarios where reference speeches, often sourced from the
+internet, contain various disturbances like background noise. To address this
+issue, we introduce Noro, a Noise Robust One-shot VC system. Noro features
+innovative components tailored for VC using noisy reference speeches, including
+a dual-branch reference encoding module and a noise-agnostic contrastive
+speaker loss. Experimental results demonstrate that Noro outperforms our
+baseline system in both clean and noisy scenarios, highlighting its efficacy
+for real-world applications. Additionally, we investigate the hidden speaker
+representation capabilities of our baseline system by repurposing its reference
+encoder as a speaker encoder. The results shows that it is competitive with
+several advanced self-supervised learning models for speaker representation
+under the SUPERB settings, highlighting the potential for advancing speaker
+representation learning through one-shot VC task.
+
+
+
+ comment: Submitted to IEEE OJSP
+
+
+
+
+
+
+ ☆ A Deep Learning Approach to Language-independent Gender Prediction on
+ Twitter
+
+
+
+
+
+
+
+
+ Reyhaneh Hashempour, Barbara Plank, Aline Villavicencio, Renato Cordeiro de Amorim
+
+
+ This work presents a set of experiments conducted to predict the gender of
+Twitter users based on language-independent features extracted from the text of
+the users' tweets. The experiments were performed on a version of TwiSty
+dataset including tweets written by the users of six different languages:
+Portuguese, French, Dutch, English, German, and Italian. Logistic regression
+(LR), and feed-forward neural networks (FFNN) with back-propagation were used
+to build models in two different settings: Inter-Lingual (IL) and Cross-Lingual
+(CL). In the IL setting, the training and testing were performed on the same
+language whereas in the CL, Italian and German datasets were set aside and only
+used as test sets and the rest were combined to compose training and
+development sets. In the IL, the highest accuracy score belongs to LR whereas
+in the CL, FFNN with three hidden layers yields the highest score. The results
+show that neural network based models underperform traditional models when the
+size of the training set is small; however, they beat traditional models by a
+non-trivial margin, when they are fed with large enough data. Finally, the
+feature analysis confirms that men and women have different writing styles
+independent of their language.
+
+
+
+
+
+
+
+ ☆ Towards Santali Linguistic Inclusion: Building the First
+ Santali-to-English Translation Model using mT5 Transformer and Data
+ Augmentation
+
+
+
+
+
+
+
+
+ Syed Mohammed Mostaque Billah, Ateya Ahmed Subarna, Sudipta Nandi Sarna, Ahmad Shawkat Wasit, Anika Fariha, Asif Sushmit, Arig Yousuf Sadeque
+
+
+ Around seven million individuals in India, Bangladesh, Bhutan, and Nepal
+speak Santali, positioning it as nearly the third most commonly used
+Austroasiatic language. Despite its prominence among the Austroasiatic language
+family's Munda subfamily, Santali lacks global recognition. Currently, no
+translation models exist for the Santali language. Our paper aims to include
+Santali to the NPL spectrum. We aim to examine the feasibility of building
+Santali translation models based on available Santali corpora. The paper
+successfully addressed the low-resource problem and, with promising results,
+examined the possibility of creating a functional Santali machine translation
+model in a low-resource setup. Our study shows that Santali-English parallel
+corpus performs better when in transformers like mt5 as opposed to untrained
+transformers, proving that transfer learning can be a viable technique that
+works with Santali language. Besides the mT5 transformer, Santali-English
+performs better than Santali-Bangla parallel corpus as the mT5 has been trained
+in way more English data than Bangla data. Lastly, our study shows that with
+data augmentation, our model performs better.
+
+
+
+
+
+
+
+
+ David Dukić, Marin Petričević, Sven Ćurković, Jan Šnajder
+
+
+ TakeLab Retriever is an AI-driven search engine designed to discover,
+collect, and semantically analyze news articles from Croatian news outlets. It
+offers a unique perspective on the history and current landscape of Croatian
+online news media, making it an essential tool for researchers seeking to
+uncover trends, patterns, and correlations that general-purpose search engines
+cannot provide. TakeLab retriever utilizes cutting-edge natural language
+processing (NLP) methods, enabling users to sift through articles using named
+entities, phrases, and topics through the web application. This technical
+report is divided into two parts: the first explains how TakeLab Retriever is
+utilized, while the second provides a detailed account of its design. In the
+second part, we also address the software engineering challenges involved and
+propose solutions for developing a microservice-based semantic search engine
+capable of handling over ten million news articles published over the past two
+decades.
+
+
+
+
+
+
+
+ ☆ MIMDE: Exploring the Use of Synthetic vs Human Data for Evaluating
+ Multi-Insight Multi-Document Extraction Tasks
+
+
+
+
+
+
+
+
+ John Francis, Saba Esnaashari, Anton Poletaev, Sukankana Chakraborty, Youmna Hashem, Jonathan Bright
+
+
+ Large language models (LLMs) have demonstrated remarkable capabilities in
+text analysis tasks, yet their evaluation on complex, real-world applications
+remains challenging. We define a set of tasks, Multi-Insight Multi-Document
+Extraction (MIMDE) tasks, which involves extracting an optimal set of insights
+from a document corpus and mapping these insights back to their source
+documents. This task is fundamental to many practical applications, from
+analyzing survey responses to processing medical records, where identifying and
+tracing key insights across documents is crucial. We develop an evaluation
+framework for MIMDE and introduce a novel set of complementary human and
+synthetic datasets to examine the potential of synthetic data for LLM
+evaluation. After establishing optimal metrics for comparing extracted
+insights, we benchmark 20 state-of-the-art LLMs on both datasets. Our analysis
+reveals a strong correlation (0.71) between the ability of LLMs to extracts
+insights on our two datasets but synthetic data fails to capture the complexity
+of document-level analysis. These findings offer crucial guidance for the use
+of synthetic data in evaluating text analysis systems, highlighting both its
+potential and limitations.
+
+
+
+
+
+
+
+ ☆ ChineseWebText 2.0: Large-Scale High-quality Chinese Web Text with
+ Multi-dimensional and fine-grained information
+
+
+ During the development of large language models (LLMs), pre-training data
+play a critical role in shaping LLMs' capabilities. In recent years several
+large-scale and high-quality pre-training datasets have been released to
+accelerate the research of LLMs, including ChineseWebText1.0, C4, Pile,
+WanJuan, MAPCC and others. However, as LLMs continue to evolve, focus has
+increasingly shifted to domain-specific capabilities and safety concerns,
+making those previous coarse-grained texts insufficient for meeting training
+requirements. Furthermore, fine-grained information, such as quality, domain
+and toxicity, is becoming increasingly important in building powerful and
+reliable LLMs for various scenarios. To address these challenges, in this paper
+we propose a new tool-chain called MDFG-tool for constructing large-scale and
+high-quality Chinese datasets with multi-dimensional and fine-grained
+information. First, we employ manually crafted rules to discard explicit noisy
+texts from raw contents. Second, the quality evaluation model, domain
+classifier, and toxicity evaluation model are well-designed to assess the
+remaining cleaned data respectively. Finally, we integrate these three types of
+fine-grained information for each text. With this approach, we release the
+largest, high-quality and fine-grained Chinese text ChineseWebText2.0, which
+consists of 3.8TB and each text is associated with a quality score, domain
+labels, a toxicity label and a toxicity score, facilitating the LLM researchers
+to select data based on various types of fine-grained information. The data,
+codes and the tool-chain are available on this website
+https://github.com/CASIA-LM/ChineseWebText-2.0
+
+
+
+ comment: ChineseWebTex2.0 dataset is available at
+ https://github.com/CASIA-LM/ChineseWebText-2.0
+
+
+
+
+
+
+ ☆ Truth or Mirage? Towards End-to-End Factuality Evaluation with LLM-OASIS
+
+
+ After the introduction of Large Language Models (LLMs), there have been
+substantial improvements in the performance of Natural Language Generation
+(NLG) tasks, including Text Summarization and Machine Translation. However,
+LLMs still produce outputs containing hallucinations, that is, content not
+grounded in factual information. Therefore, developing methods to assess the
+factuality of LLMs has become urgent.
+ Indeed, resources for factuality evaluation have recently emerged. Although
+challenging, these resources face one or more of the following limitations: (i)
+they are tailored to a specific task or domain; (ii) they are limited in size,
+thereby preventing the training of new factuality evaluators; (iii) they are
+designed for simpler verification tasks, such as claim verification.
+ To address these issues, we introduce LLM-Oasis, to the best of our knowledge
+the largest resource for training end-to-end factuality evaluators. LLM-Oasis
+is constructed by extracting claims from Wikipedia, falsifying a subset of
+these claims, and generating pairs of factual and unfactual texts. We then rely
+on human annotators to both validate the quality of our dataset and to create a
+gold standard test set for benchmarking factuality evaluation systems.
+ Our experiments demonstrate that LLM-Oasis presents a significant challenge
+for state-of-the-art LLMs, with GPT-4o achieving up to 60% accuracy in our
+proposed end-to-end factuality evaluation task, highlighting its potential to
+drive future research in the field.
+
+
+
+ comment: 15 pages. To be submitted to CL journal
+
+
+
+
+
+
+ ☆ CogACT: A Foundational Vision-Language-Action Model for Synergizing
+ Cognition and Action in Robotic Manipulation
+
+
+ The advancement of large Vision-Language-Action (VLA) models has
+significantly improved robotic manipulation in terms of language-guided task
+execution and generalization to unseen scenarios. While existing VLAs adapted
+from pretrained large Vision-Language-Models (VLM) have demonstrated promising
+generalizability, their task performance is still unsatisfactory as indicated
+by the low tasks success rates in different environments. In this paper, we
+present a new advanced VLA architecture derived from VLM. Unlike previous works
+that directly repurpose VLM for action prediction by simple action
+quantization, we propose a omponentized VLA architecture that has a specialized
+action module conditioned on VLM output. We systematically study the design of
+the action module and demonstrates the strong performance enhancement with
+diffusion action transformers for action sequence modeling, as well as their
+favorable scaling behaviors. We also conduct comprehensive experiments and
+ablation studies to evaluate the efficacy of our models with varied designs.
+The evaluation on 5 robot embodiments in simulation and real work shows that
+our model not only significantly surpasses existing VLAs in task performance
+and but also exhibits remarkable adaptation to new robots and generalization to
+unseen objects and backgrounds. It exceeds the average success rates of OpenVLA
+which has similar model size (7B) with ours by over 35% in simulated evaluation
+and 55% in real robot experiments. It also outperforms the large RT-2-X model
+(55B) by 18% absolute success rates in simulation. Code and models can be found
+on our project page (https://cogact.github.io/).
+
+
+
+
+
+
+
+ ☆ LLM Teacher-Student Framework for Text Classification With No Manually
+ Annotated Data: A Case Study in IPTC News Topic Classification
+
+
+ With the ever-increasing number of news stories available online, classifying
+them by topic, regardless of the language they are written in, has become
+crucial for enhancing readers' access to relevant content. To address this
+challenge, we propose a teacher-student framework based on large language
+models (LLMs) for developing multilingual news classification models of
+reasonable size with no need for manual data annotation. The framework employs
+a Generative Pretrained Transformer (GPT) model as the teacher model to develop
+an IPTC Media Topic training dataset through automatic annotation of news
+articles in Slovenian, Croatian, Greek, and Catalan. The teacher model exhibits
+a high zero-shot performance on all four languages. Its agreement with human
+annotators is comparable to that between the human annotators themselves. To
+mitigate the computational limitations associated with the requirement of
+processing millions of texts daily, smaller BERT-like student models are
+fine-tuned on the GPT-annotated dataset. These student models achieve high
+performance comparable to the teacher model. Furthermore, we explore the impact
+of the training data size on the performance of the student models and
+investigate their monolingual, multilingual and zero-shot cross-lingual
+capabilities. The findings indicate that student models can achieve high
+performance with a relatively small number of training instances, and
+demonstrate strong zero-shot cross-lingual abilities. Finally, we publish the
+best-performing news topic classifier, enabling multilingual classification
+with the top-level categories of the IPTC Media Topic schema.
+
+
+
+ comment: This work has been submitted to the IEEE for possible publication
+
+
+
+
+
+
+ ☆ Accelerating Multimodal Large Language Models via Dynamic Visual-Token
+ Exit and the Empirical Findings
+
+
+
+
+
+
+
+
+ Qiong Wu, Wenhao Lin, Weihao Ye, Yiyi Zhou, Xiaoshuai Sun, Rongrong Ji
+
+
+ The excessive use of visual tokens in existing Multimoal Large Language
+Models (MLLMs) often exhibits obvious redundancy and brings in prohibitively
+expensive computation. To gain insights into this problem, we first conduct
+extensive empirical studies on the attention behaviors of MLLMs, and summarize
+three main inference stages in MLLMs: (i) Early fusion between tokens is first
+accomplished quickly. (ii) Intra-modality modeling then comes to play. (iii)
+Multimodal reasoning} resumes and lasts until the end of inference. In
+particular, we reveal that visual tokens will stop contributing to reasoning
+when the text tokens receive enough image information, yielding obvious visual
+redundancy. Based on these generalized observations, we propose a simple yet
+effective method to improve the efficiency of MLLMs, termed dynamic
+visual-token exit (DyVTE). DyVTE uses lightweight hyper-networks to perceive
+the text token status and decide the removal of all visual tokens after a
+certain layer, thereby addressing the observed visual redundancy. To validate
+VTE, we apply it to a set of MLLMs, including LLaVA, VILA, Eagle and InternVL,
+and conduct extensive experiments on a bunch of benchmarks. The experiment
+results not only show the effectiveness of our VTE in improving MLLMs'
+efficiency, but also yield the general modeling patterns of MLLMs, well
+facilitating the in-depth understanding of MLLMs. Our code is anonymously
+released at https://github.com/DoubtedSteam/DyVTE.
+
+
+
+
+
+
+
+ ☆ Can Large Language Models Reason about the Region Connection Calculus?
+
+
+
+
+
+
+
+
+ Anthony G Cohn, Robert E Blackwell
+
+
+ Qualitative Spatial Reasoning is a well explored area of Knowledge
+Representation and Reasoning and has multiple applications ranging from
+Geographical Information Systems to Robotics and Computer Vision. Recently,
+many claims have been made for the reasoning capabilities of Large Language
+Models (LLMs). Here, we investigate the extent to which a set of representative
+LLMs can perform classical qualitative spatial reasoning tasks on the
+mereotopological Region Connection Calculus, RCC-8. We conduct three pairs of
+experiments (reconstruction of composition tables, alignment to human
+composition preferences, conceptual neighbourhood reconstruction) using
+state-of-the-art LLMs; in each pair one experiment uses eponymous relations and
+one, anonymous relations (to test the extent to which the LLM relies on
+knowledge about the relation names obtained during training). All instances are
+repeated 30 times to measure the stochasticity of the LLMs.
+
+
+
+ comment: 13 pages. arXiv admin note: text overlap with arXiv:2309.15577
+
+ In-context learning refers to the emerging ability of large language models
+(LLMs) to perform a target task without additional training, utilizing
+demonstrations of the task. Recent studies aim to enhance in-context learning
+performance by selecting more useful demonstrations. However, they overlook the
+presence of inevitable noisy labels in task demonstrations that arise during
+the labeling process in the real-world. In this paper, we propose a new task,
+in-context learning with noisy labels, which aims to solve real-world problems
+for in-context learning where labels in task demonstrations would be corrupted.
+Moreover, we propose a new method and baseline methods for the new task,
+inspired by studies in learning with noisy labels. Through experiments, we
+demonstrate that our proposed method can serve as a safeguard against
+performance degradation in in-context learning caused by noisy labels.
+
+
+ A lot of claims are made in social media posts, which may contain
+misinformation or fake news. Hence, it is crucial to identify claims as a first
+step towards claim verification. Given the huge number of social media posts,
+the task of identifying claims needs to be automated. This competition deals
+with the task of 'Claim Span Identification' in which, given a text, parts /
+spans that correspond to claims are to be identified. This task is more
+challenging than the traditional binary classification of text into claim or
+not-claim, and requires state-of-the-art methods in Pattern Recognition,
+Natural Language Processing and Machine Learning. For this competition, we used
+a newly developed dataset called HECSI containing about 8K posts in English and
+about 8K posts in Hindi with claim-spans marked by human annotators. This paper
+gives an overview of the competition, and the solutions developed by the
+participating teams.
+
+
+ The current large language models are mainly based on decode-only structure
+transformers, which have great in-context learning (ICL) capabilities. It is
+generally believed that the important foundation of its ICL capability is the
+induction heads mechanism, which requires at least two layers attention. In
+order to more efficiently implement the ability of the model's induction, we
+revisit the induction heads mechanism and proposed a KV shifting attention. We
+theoretically prove that the KV shifting attention reducing the model's
+requirements for the depth and width of the induction heads mechanism. Our
+experimental results demonstrate that KV shifting attention is beneficial to
+learning induction heads and language modeling, which lead to better
+performance or faster convergence from toy models to the pre-training models
+with more than 10 B parameters.
+
+
+
+ comment: 22 pages
+
+
+
+
+
+
+ ☆ Ensemble Watermarks for Large Language Models
+
+
+ The rapid advancement of large language models (LLMs) has made it
+increasingly difficult to distinguish between text written by humans and
+machines. While watermarks already exist for LLMs, they often lack flexibility,
+and struggle with attacks such as paraphrasing. To address these issues, we
+propose a multi-feature method for generating watermarks that combines multiple
+distinct watermark features into an ensemble watermark. Concretely, we combine
+acrostica and sensorimotor norms with the established red-green watermark to
+achieve a 98% detection rate. After a paraphrasing attack the performance
+remains high with 95% detection rate. The red-green feature alone as baseline
+achieves a detection rate of 49%. The evaluation of all feature combinations
+reveals that the ensemble of all three consistently has the highest detection
+rate across several LLMs and watermark strength settings. Due to the
+flexibility of combining features in the ensemble, various requirements and
+trade-offs can be addressed. Additionally, for all ensemble configurations the
+same detection function can be used without adaptations. This method is
+particularly of interest to facilitate accountability and prevent societal
+harm.
+
+
+
+ comment: 9 pages in the main body. Code is available at
+ http://github.com/CommodoreEU/master-generation. arXiv admin note:
+ substantial text overlap with arXiv:2405.08400
+
+
+
+
+
+
+ ☆ Initialization using Update Approximation is a Silver Bullet for
+ Extremely Efficient Low-Rank Fine-Tuning
+
+
+
+
+
+
+
+
+ Kaustubh Ponkshe, Raghav Singhal, Eduard Gorbunov, Alexey Tumanov, Samuel Horvath, Praneeth Vepakomma
+
+
+ Low-rank adapters have become a standard approach for efficiently fine-tuning
+large language models (LLMs), but they often fall short of achieving the
+performance of full fine-tuning. We propose a method, LoRA Silver Bullet or
+LoRA-SB, that approximates full fine-tuning within low-rank subspaces using a
+carefully designed initialization strategy. We theoretically demonstrate that
+the architecture of LoRA-XS, which inserts a trainable (r x r) matrix between B
+and A while keeping other matrices fixed, provides the precise conditions
+needed for this approximation. We leverage its constrained update space to
+achieve optimal scaling for high-rank gradient updates while removing the need
+for hyperparameter tuning. We prove that our initialization offers an optimal
+low-rank approximation of the initial gradient and preserves update directions
+throughout training. Extensive experiments across mathematical reasoning,
+commonsense reasoning, and language understanding tasks demonstrate that our
+approach exceeds the performance of standard LoRA while using 27-90x fewer
+parameters, and comprehensively outperforms LoRA-XS. Our findings establish
+that it is possible to simulate full fine-tuning in low-rank subspaces, and
+achieve significant efficiency gains without sacrificing performance. Our code
+is publicly available at https://github.com/RaghavSinghal10/lora-sb.
+
+
+
+ comment: Kaustubh Ponkshe and Raghav Singhal contributed equally to this work
+
+
+
+
+
+
+ ☆ Training Agents with Weakly Supervised Feedback from Large Language
+ Models
+
+
+
+
+
+
+
+
+ Dihong Gong, Pu Lu, Zelong Wang, Meng Zhou, Xiuqiang He
+
+
+ Large Language Models (LLMs) offer a promising basis for creating agents that
+can tackle complex tasks through iterative environmental interaction. Existing
+methods either require these agents to mimic expert-provided trajectories or
+rely on definitive environmental feedback for reinforcement learning which
+limits their application to specific scenarios like gaming or code generation.
+This paper introduces a novel training method for LLM-based agents using weakly
+supervised signals from a critic LLM, bypassing the need for expert
+trajectories or definitive feedback. Our agents are trained in iterative
+manner, where they initially generate trajectories through environmental
+interaction. Subsequently, a critic LLM selects a subset of good trajectories,
+which are then used to update the agents, enabling them to generate improved
+trajectories in the next iteration. Extensive tests on the API-bank dataset
+show consistent improvement in our agents' capabilities and comparable
+performance to GPT-4, despite using open-source models with much fewer
+parameters.
+
+
+
+
+
+
+
+ ☆ Knowledge Management for Automobile Failure Analysis Using Graph RAG
+
+
+ This paper presents a knowledge management system for automobile failure
+analysis using retrieval-augmented generation (RAG) with large language models
+(LLMs) and knowledge graphs (KGs). In the automotive industry, there is a
+growing demand for knowledge transfer of failure analysis from experienced
+engineers to young engineers. However, failure events are phenomena that occur
+in a chain reaction, making them difficult for beginners to analyze them. While
+knowledge graphs, which can describe semantic relationships and structure
+information is effective in representing failure events, due to their
+capability of representing the relationships between components, there is much
+information in KGs, so it is challenging for young engineers to extract and
+understand sub-graphs from the KG. On the other hand, there is increasing
+interest in the use of Graph RAG, a type of RAG that combines LLMs and KGs for
+knowledge management. However, when using the current Graph RAG framework with
+an existing knowledge graph for automobile failures, several issues arise
+because it is difficult to generate executable queries for a knowledge graph
+database which is not constructed by LLMs. To address this, we focused on
+optimizing the Graph RAG pipeline for existing knowledge graphs. Using an
+original Q&A dataset, the ROUGE F1 score of the sentences generated by the
+proposed method showed an average improvement of 157.6% compared to the current
+method. This highlights the effectiveness of the proposed method for automobile
+failure analysis.
+
+
+
+ comment: 7 pages, 6 figures, to be published in 2024 IEEE International
+ Conference on Bid Data (BigData)
+
+
+
+
+
+
+ ☆ TQA-Bench: Evaluating LLMs for Multi-Table Question Answering with
+ Scalable Context and Symbolic Extension
+
+
+
+
+
+
+
+
+ Zipeng Qiu, You Peng, Guangxin He, Binhang Yuan, Chen Wang
+
+
+ The advent of large language models (LLMs) has unlocked great opportunities
+in complex data management tasks, particularly in question answering (QA) over
+complicated multi-table relational data. Despite significant progress,
+systematically evaluating LLMs on multi-table QA remains a critical challenge
+due to the inherent complexity of analyzing heterogeneous table structures and
+potential large scale of serialized relational data. Existing benchmarks
+primarily focus on single-table QA, failing to capture the intricacies of
+reasoning across multiple relational tables, as required in real-world domains
+such as finance, healthcare, and e-commerce. To address this gap, we present
+TQA-Bench, a new multi-table QA benchmark designed to evaluate the capabilities
+of LLMs in tackling complex QA tasks over relational data. Our benchmark
+incorporates diverse relational database instances sourced from real-world
+public datasets and introduces a flexible sampling mechanism to create tasks
+with varying multi-table context lengths, ranging from 8K to 64K tokens. To
+ensure robustness and reliability, we integrate symbolic extensions into the
+evaluation framework, enabling the assessment of LLM reasoning capabilities
+beyond simple data retrieval or probabilistic pattern matching. We
+systematically evaluate a range of LLMs, both open-source and closed-source,
+spanning model scales from 7 billion to 70 billion parameters. Our extensive
+experiments reveal critical insights into the performance of LLMs in
+multi-table QA, highlighting both challenges and opportunities for advancing
+their application in complex, data-driven environments. Our benchmark
+implementation and results are available at
+https://github.com/Relaxed-System-Lab/TQA-Bench.
+
+
+ Large Language Models (LLMs) have shown state-of-the-art performance in a
+variety of tasks, including arithmetic and reasoning; however, to gauge the
+intellectual capabilities of LLMs, causal reasoning has become a reliable proxy
+for validating a general understanding of the mechanics and intricacies of the
+world similar to humans. Previous works in natural language processing (NLP)
+have either focused on open-ended causal reasoning via causal commonsense
+reasoning (CCR) or framed a symbolic representation-based question answering
+for theoretically backed-up analysis via a causal inference engine. The former
+adds an advantage of real-world grounding but lacks theoretically backed-up
+analysis/validation, whereas the latter is far from real-world grounding. In
+this work, we bridge this gap by proposing the COLD (Causal reasOning in cLosed
+Daily activities) framework, which is built upon human understanding of daily
+real-world activities to reason about the causal nature of events. We show that
+the proposed framework facilitates the creation of enormous causal queries (~ 9
+million) and comes close to the mini-turing test, simulating causal reasoning
+to evaluate the understanding of a daily real-world task. We evaluate multiple
+LLMs on the created causal queries and find that causal reasoning is
+challenging even for activities trivial to humans. We further explore (the
+causal reasoning abilities of LLMs) using the backdoor criterion to determine
+the causal strength between events.
+
+
+
+ comment: Paper accepted at NeurIPS 2024; Total 37 Pages
+
+
+
+
+
+
+ ☆ A Simple and Provable Scaling Law for the Test-Time Compute of Large
+ Language Models
+
+
+ We propose a general two-stage algorithm that enjoys a provable scaling law
+for the test-time compute of large language models (LLMs). Given an input
+problem, the proposed algorithm first generates $N$ candidate solutions, and
+then chooses the best one via a multiple-round knockout tournament where each
+pair of candidates are compared for $K$ times and only the winners move on to
+the next round. In a minimalistic implementation, both stages can be executed
+with a black-box LLM alone and nothing else (e.g., no external verifier or
+reward model), and a total of $N \times (K + 1)$ highly parallelizable LLM
+calls are needed for solving an input problem. Assuming that a generated
+candidate solution is correct with probability $p_{\text{gen}} > 0$ and a
+comparison between a pair of correct and incorrect solutions identifies the
+right winner with probability $p_{\text{comp}} > 0.5$ (i.e., better than a
+random guess), we prove theoretically that the failure probability of the
+proposed algorithm decays to zero exponentially with respect to $N$ and $K$:
+$$\mathbb{P}(\text{final output is incorrect}) \le (1 - p_{\text{gen}})^N +
+\lceil \log_2 N \rceil e^{-2 K (p_{\text{comp}} - 0.5)^2}.$$ Our empirical
+results with the challenging MMLU-Pro benchmark validate the technical
+assumptions, as well as the efficacy of the proposed algorithm and the gains
+from scaling up its test-time compute.
+
+
+
+ comment: Work in progress
+
+
+
+
+
+
+ ☆ Beyond Surface Structure: A Causal Assessment of LLMs' Comprehension
+ Ability
+
+
+
+
+
+
+
+
+ Yujin Han, Lei Xu, Sirui Chen, Difan Zou, Chaochao Lu
+
+
+ Large language models (LLMs) have shown remarkable capability in natural
+language tasks, yet debate persists on whether they truly comprehend deep
+structure (i.e., core semantics) or merely rely on surface structure (e.g.,
+presentation format). Prior studies observe that LLMs' performance declines
+when intervening on surface structure, arguing their success relies on surface
+structure recognition. However, surface structure sensitivity does not prevent
+deep structure comprehension. Rigorously evaluating LLMs' capability requires
+analyzing both, yet deep structure is often overlooked. To this end, we assess
+LLMs' comprehension ability using causal mediation analysis, aiming to fully
+discover the capability of using both deep and surface structures.
+Specifically, we formulate the comprehension of deep structure as direct causal
+effect (DCE) and that of surface structure as indirect causal effect (ICE),
+respectively. To address the non-estimability of original DCE and ICE --
+stemming from the infeasibility of isolating mutual influences of deep and
+surface structures, we develop the corresponding quantifiable surrogates,
+including approximated DCE (ADCE) and approximated ICE (AICE). We further apply
+the ADCE to evaluate a series of mainstream LLMs, showing that most of them
+exhibit deep structure comprehension ability, which grows along with the
+prediction accuracy. Comparing ADCE and AICE demonstrates closed-source LLMs
+rely more on deep structure, while open-source LLMs are more surface-sensitive,
+which decreases with model scale. Theoretically, ADCE is a bidirectional
+evaluation, which measures both the sufficiency and necessity of deep structure
+changes in causing output variations, thus offering a more comprehensive
+assessment than accuracy, a common evaluation in LLMs. Our work provides new
+insights into LLMs' deep structure comprehension and offers novel methods for
+LLMs evaluation.
+
+
+
+ comment: 28 pages, 14 figures, 10 tables
+
+
+
+
+
+
+ ☆ Auto-RAG: Autonomous Retrieval-Augmented Generation for Large Language
+ Models
+
+
+ Iterative retrieval refers to the process in which the model continuously
+queries the retriever during generation to enhance the relevance of the
+retrieved knowledge, thereby improving the performance of Retrieval-Augmented
+Generation (RAG). Existing work typically employs few-shot prompting or
+manually constructed rules to implement iterative retrieval. This introduces
+additional inference overhead and overlooks the remarkable reasoning
+capabilities of Large Language Models (LLMs). In this paper, we introduce
+Auto-RAG, an autonomous iterative retrieval model centered on the LLM's
+powerful decision-making capabilities. Auto-RAG engages in multi-turn dialogues
+with the retriever, systematically planning retrievals and refining queries to
+acquire valuable knowledge. This process continues until sufficient external
+information is gathered, at which point the results are presented to the user.
+To this end, we develop a method for autonomously synthesizing reasoning-based
+decision-making instructions in iterative retrieval and fine-tuned the latest
+open-source LLMs. The experimental results indicate that Auto-RAG is capable of
+autonomous iterative interaction with the retriever, effectively leveraging the
+remarkable reasoning and decision-making abilities of LLMs, which lead to
+outstanding performance across six benchmarks. Further analysis reveals that
+Auto-RAG can autonomously adjust the number of iterations based on the
+difficulty of the questions and the utility of the retrieved knowledge, without
+requiring any human intervention. Moreover, Auto-RAG expresses the iterative
+retrieval process in natural language, enhancing interpretability while
+providing users with a more intuitive experience\footnote{Code is available at
+\url{https://github.com/ictnlp/Auto-RAG}.
+
+
+
+ comment: Code is available at https://github.com/ictnlp/Auto-RAG
+
+
+
+
+
+
+ ☆ Actions and Objects Pathways for Domain Adaptation in Video Question
+ Answering
+
+
+ In this paper, we introduce the Actions and Objects Pathways (AOPath) for
+out-of-domain generalization in video question answering tasks. AOPath
+leverages features from a large pretrained model to enhance generalizability
+without the need for explicit training on the unseen domains. Inspired by human
+brain, AOPath dissociates the pretrained features into action and object
+features, and subsequently processes them through separate reasoning pathways.
+It utilizes a novel module which converts out-of-domain features into
+domain-agnostic features without introducing any trainable weights. We validate
+the proposed approach on the TVQA dataset, which is partitioned into multiple
+subsets based on genre to facilitate the assessment of generalizability. The
+proposed approach demonstrates 5% and 4% superior performance over conventional
+classifiers on out-of-domain and in-domain datasets, respectively. It also
+outperforms prior methods that involve training millions of parameters, whereas
+the proposed approach trains very few parameters.
+
+
+
+
+
+
+
+ ♻ ☆ Multi-label Sequential Sentence Classification via Large Language Model EMNLP 2024
+
+
+ Sequential sentence classification (SSC) in scientific publications is
+crucial for supporting downstream tasks such as fine-grained information
+retrieval and extractive summarization. However, current SSC methods are
+constrained by model size, sequence length, and single-label setting. To
+address these limitations, this paper proposes LLM-SSC, a large language model
+(LLM)-based framework for both single- and multi-label SSC tasks. Unlike
+previous approaches that employ small- or medium-sized language models, the
+proposed framework utilizes LLMs to generate SSC labels through designed
+prompts, which enhance task understanding by incorporating demonstrations and a
+query to describe the prediction target. We also present a multi-label
+contrastive learning loss with auto-weighting scheme, enabling the multi-label
+classification task. To support our multi-label SSC analysis, we introduce and
+release a new dataset, biorc800, which mainly contains unstructured abstracts
+in the biomedical domain with manual annotations. Experiments demonstrate
+LLM-SSC's strong performance in SSC under both in-context learning and
+task-specific tuning settings. We release biorc800 and our code at:
+https://github.com/ScienceNLP-Lab/LLM-SSC.
+
+
+
+ comment: Accepted by EMNLP 2024 Findings
+
+
+
+
+
+
+ ♻ ☆ Recent Advances of Foundation Language Models-based Continual Learning:
+ A Survey
+
+
+ Recently, foundation language models (LMs) have marked significant
+achievements in the domains of natural language processing (NLP) and computer
+vision (CV). Unlike traditional neural network models, foundation LMs obtain a
+great ability for transfer learning by acquiring rich commonsense knowledge
+through pre-training on extensive unsupervised datasets with a vast number of
+parameters. However, they still can not emulate human-like continuous learning
+due to catastrophic forgetting. Consequently, various continual learning
+(CL)-based methodologies have been developed to refine LMs, enabling them to
+adapt to new tasks without forgetting previous knowledge. However, a systematic
+taxonomy of existing approaches and a comparison of their performance are still
+lacking, which is the gap that our survey aims to fill. We delve into a
+comprehensive review, summarization, and classification of the existing
+literature on CL-based approaches applied to foundation language models, such
+as pre-trained language models (PLMs), large language models (LLMs) and
+vision-language models (VLMs). We divide these studies into offline CL and
+online CL, which consist of traditional methods, parameter-efficient-based
+methods, instruction tuning-based methods and continual pre-training methods.
+Offline CL encompasses domain-incremental learning, task-incremental learning,
+and class-incremental learning, while online CL is subdivided into hard task
+boundary and blurry task boundary settings. Additionally, we outline the
+typical datasets and metrics employed in CL research and provide a detailed
+analysis of the challenges and future work for LMs-based continual learning.
+
+
+ Pretrained large language models (LLMs) are increasingly utilized across a
+wide range of natural language processing (NLP) tasks due to their impressive
+capabilities as few-shot learners. Recent techniques, such as chain-of-thought
+(CoT) prompting, have significantly advanced multi-step reasoning by
+introducing step-by-step decomposition, achieving state-of-the-art results on
+complex reasoning benchmarks. However, these approaches often rely on static
+prompting templates that do not adapt to task complexity or errors during the
+reasoning process. In this work, we introduce Adaptive Prompting, a dynamic and
+iterative framework designed to enhance reasoning by incorporating real-time
+adjustments to prompt structures and validation mechanisms.Experimental results
+demonstrate that Adaptive Prompting significantly improves performance on
+diverse reasoning benchmarks, including arithmetic reasoning (GSM8K,
+MultiArith), logical reasoning and commonsense tasks, achieving substantial
+accuracy gains compared to static prompting baselines. By integrating guided
+prompts, intermediate validation, and self-corrective steps, our approach
+enables smaller models to achieve competitive performance with larger
+counterparts, such as GPT-4, while maintaining computational efficiency. The
+framework achieves this without requiring fine-tuning or task-specific training
+data, highlighting the untapped potential of iterative reasoning methods.
+
+
+
+ comment: Submitted to ICLR 2025. This is a preprint version. Future revisions
+ will include additional evaluations and refinements
+
+
+
+
+
+
+ ♻ ☆ A Survey on Multimodal Large Language Models
+
+
+ Recently, Multimodal Large Language Model (MLLM) represented by GPT-4V has
+been a new rising research hotspot, which uses powerful Large Language Models
+(LLMs) as a brain to perform multimodal tasks. The surprising emergent
+capabilities of MLLM, such as writing stories based on images and OCR-free math
+reasoning, are rare in traditional multimodal methods, suggesting a potential
+path to artificial general intelligence. To this end, both academia and
+industry have endeavored to develop MLLMs that can compete with or even better
+than GPT-4V, pushing the limit of research at a surprising speed. In this
+paper, we aim to trace and summarize the recent progress of MLLMs. First of
+all, we present the basic formulation of MLLM and delineate its related
+concepts, including architecture, training strategy and data, as well as
+evaluation. Then, we introduce research topics about how MLLMs can be extended
+to support more granularity, modalities, languages, and scenarios. We continue
+with multimodal hallucination and extended techniques, including Multimodal ICL
+(M-ICL), Multimodal CoT (M-CoT), and LLM-Aided Visual Reasoning (LAVR). To
+conclude the paper, we discuss existing challenges and point out promising
+research directions. In light of the fact that the era of MLLM has only just
+begun, we will keep updating this survey and hope it can inspire more research.
+An associated GitHub link collecting the latest papers is available at
+https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models.
+
+
+
+ comment: Accepted for publication in National Science Review. Project
+ page:https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models
+
+
+
+
+
+
+ ♻ ☆ Cherry on Top: Parameter Heterogeneity and Quantization in Large
+ Language Models
+
+
+ This paper reveals the phenomenon of parameter heterogeneity in large
+language models (LLMs). We find that a small subset of "cherry" parameters
+exhibit a disproportionately large influence on model performance, while the
+vast majority of parameters have minimal impact. This heterogeneity is found to
+be prevalent across different model families, scales, and types. Motivated by
+this observation, we propose CherryQ, a novel quantization method that unifies
+the optimization of mixed-precision parameters. CherryQ identifies and
+preserves the critical cherry parameters in high precision while aggressively
+quantizing the remaining parameters to low precision. Extensive experiments
+demonstrate the effectiveness of CherryQ. CherryQ outperforms existing
+quantization approaches in terms of perplexity and downstream task performance.
+Notably, our 3-bit quantized Vicuna-1.5 exhibits competitive performance
+compared to their 16-bit counterparts.
+
+
+
+
+
+
+
+ ♻ ☆ Exact Aggregation for Federated and Efficient Fine-Tuning of Foundation
+ Models NeurIPS 2024
+
+
+ Low-Rank Adaptation (LoRA) is a popular technique for efficient fine-tuning
+of foundation models. However, applying LoRA in federated learning
+environments, where data is distributed across multiple clients, presents
+unique challenges. Existing methods rely on traditional federated averaging of
+LoRA adapters, resulting in inexact updates. To address this, we propose
+Federated Exact LoRA, or FedExLoRA, which adds a residual error term to the
+pretrained frozen weight matrix. Our approach achieves exact updates with
+minimal computational and communication overhead, preserving LoRA's efficiency.
+We evaluate the method on various models across arithmetic reasoning,
+commonsense reasoning, natural language understanding and natural language
+generation tasks, showing consistent performance gains over state-of-the-art
+methods across multiple settings. Through extensive analysis, we quantify that
+the deviations in updates from the ideal solution are significant, highlighting
+the need for exact aggregation. Our method's simplicity, efficiency, and broad
+applicability position it as a promising solution for accurate and effective
+federated fine-tuning of foundation models. Our code is publicly available at
+https://github.com/RaghavSinghal10/fedex-lora.
+
+
+
+ comment: Raghav Singhal and Kaustubh Ponkshe contributed equally to this work.
+ Another version of the paper accepted at NeurIPS 2024 Workshop on Fine-Tuning
+ in Modern Machine Learning: Principles and Scalability
+
+
+
+
+
+
+ ♻ ☆ Evaluating the Data Model Robustness of Text-to-SQL Systems Based on
+ Real User Queries
+
+
+
+
+
+
+
+
+ Jonathan Fürst, Catherine Kosten, Farhad Nooralahzadeh, Yi Zhang, Kurt Stockinger
+
+
+ Text-to-SQL systems (also known as NL-to-SQL systems) have become an
+increasingly popular solution for bridging the gap between user capabilities
+and SQL-based data access. These systems translate user requests in natural
+language to valid SQL statements for a specific database. Recent Text-to-SQL
+systems have benefited from the rapid improvement of transformer-based language
+models. However, while Text-to-SQL systems that incorporate such models
+continuously reach new high scores on -- often synthetic -- benchmark datasets,
+a systematic exploration of their robustness towards different data models in a
+real-world, realistic scenario is notably missing. This paper provides the
+first in-depth evaluation of the data model robustness of Text-to-SQL systems
+in practice based on a multi-year international project focused on Text-to-SQL
+interfaces. Our evaluation is based on a real-world deployment of FootballDB, a
+system that was deployed over a 9 month period in the context of the FIFA World
+Cup 2022, during which about 6K natural language questions were asked and
+executed. All of our data is based on real user questions that were asked live
+to the system. We manually labeled and translated a subset of these questions
+for three different data models. For each data model, we explore the
+performance of representative Text-to-SQL systems and language models. We
+further quantify the impact of training data size, pre-, and post-processing
+steps as well as language model inference time. Our comprehensive evaluation
+sheds light on the design choices of real-world Text-to-SQL systems and their
+impact on moving from research prototypes to real deployments. Last, we provide
+a new benchmark dataset to the community, which is the first to enable the
+evaluation of different data models for the same dataset and is substantially
+more challenging than most previous datasets in terms of query complexity.
+
+
+
+
+
+
+
+ ♻ ☆ What Differentiates Educational Literature? A Multimodal Fusion Approach
+ of Transformers and Computational Linguistics
+
+
+ The integration of new literature into the English curriculum remains a
+challenge since educators often lack scalable tools to rapidly evaluate
+readability and adapt texts for diverse classroom needs. This study proposes to
+address this gap through a multimodal approach that combines transformer-based
+text classification with linguistic feature analysis to align texts with UK Key
+Stages. Eight state-of-the-art Transformers were fine-tuned on segmented text
+data, with BERT achieving the highest unimodal F1 score of 0.75. In parallel,
+500 deep neural network topologies were searched for the classification of
+linguistic characteristics, achieving an F1 score of 0.392. The fusion of these
+modalities shows a significant improvement, with every multimodal approach
+outperforming all unimodal models. In particular, the ELECTRA Transformer fused
+with the neural network achieved an F1 score of 0.996. Unimodal and multimodal
+approaches are shown to have statistically significant differences in all
+validation metrics (accuracy, precision, recall, F1 score) except for inference
+time. The proposed approach is finally encapsulated in a stakeholder-facing web
+application, providing non-technical stakeholder access to real-time insights
+on text complexity, reading difficulty, curriculum alignment, and
+recommendations for learning age range. The application empowers data-driven
+decision making and reduces manual workload by integrating AI-based
+recommendations into lesson planning for English literature.
+
+
+
+
+
+
+
+ ♻ ☆ OneBit: Towards Extremely Low-bit Large Language Models NeurIPS 2024
+
+
+ Model quantification uses low bit-width values to represent the weight
+matrices of existing models to be quantized, which is a promising approach to
+reduce both storage and computational overheads of deploying highly anticipated
+LLMs. However, current quantization methods suffer severe performance
+degradation when the bit-width is extremely reduced, and thus focus on
+utilizing 4-bit or 8-bit values to quantize models. This paper boldly quantizes
+the weight matrices of LLMs to 1-bit, paving the way for the extremely low
+bit-width deployment of LLMs. For this target, we introduce a 1-bit model
+compressing framework named OneBit, including a novel 1-bit parameter
+representation method to better quantize LLMs as well as an effective parameter
+initialization method based on matrix decomposition to improve the convergence
+speed of the quantization framework. Sufficient experimental results indicate
+that OneBit achieves good performance (at least 81% of the non-quantized
+performance on LLaMA models) with robust training processes when only using
+1-bit weight matrices.
+
+
+
+ comment: Accepted by NeurIPS 2024
+
+
+
+
+
+
+ ♻ ☆ Speech Translation with Speech Foundation Models and Large Language
+ Models: What is There and What is Missing? ACL 2024
+
+
+
+
+
+
+
+
+ Marco Gaido, Sara Papi, Matteo Negri, Luisa Bentivogli
+
+
+ The field of natural language processing (NLP) has recently witnessed a
+transformative shift with the emergence of foundation models, particularly
+Large Language Models (LLMs) that have revolutionized text-based NLP. This
+paradigm has extended to other modalities, including speech, where researchers
+are actively exploring the combination of Speech Foundation Models (SFMs) and
+LLMs into single, unified models capable of addressing multimodal tasks. Among
+such tasks, this paper focuses on speech-to-text translation (ST). By examining
+the published papers on the topic, we propose a unified view of the
+architectural solutions and training strategies presented so far, highlighting
+similarities and differences among them. Based on this examination, we not only
+organize the lessons learned but also show how diverse settings and evaluation
+approaches hinder the identification of the best-performing solution for each
+architectural building block and training choice. Lastly, we outline
+recommendations for future works on the topic aimed at better understanding the
+strengths and weaknesses of the SFM+LLM solutions for ST.
+
+
+
+ comment: Outstanding paper at the ACL 2024 main conference
+
+
+
+
+
+
+ ♻ ☆ Exploring syntactic information in sentence embeddings through
+ multilingual subject-verb agreement
+
+
+ In this paper, our goal is to investigate to what degree multilingual
+pretrained language models capture cross-linguistically valid abstract
+linguistic representations. We take the approach of developing curated
+synthetic data on a large scale, with specific properties, and using them to
+study sentence representations built using pretrained language models. We use a
+new multiple-choice task and datasets, Blackbird Language Matrices (BLMs), to
+focus on a specific grammatical structural phenomenon -- subject-verb agreement
+across a variety of sentence structures -- in several languages. Finding a
+solution to this task requires a system detecting complex linguistic patterns
+and paradigms in text representations. Using a two-level architecture that
+solves the problem in two steps -- detect syntactic objects and their
+properties in individual sentences, and find patterns across an input sequence
+of sentences -- we show that despite having been trained on multilingual texts
+in a consistent manner, multilingual pretrained language models have
+language-specific differences, and syntactic structure is not shared, even
+across closely related languages.
+
+
+
+ comment: 13 pages, 5 tables, 6 figures
+
+
+
+
+
+
+ ♻ ☆ Towards Evaluating Generalist Agents: An Automated Benchmark in Open
+ World
+
+
+ Evaluating generalist agents presents significant challenges due to their
+wide-ranging abilities and the limitations of current benchmarks in assessing
+true generalization. We introduce the Minecraft Universe (MCU), a fully
+automated benchmarking framework set within the open-world game Minecraft. MCU
+dynamically generates and evaluates a broad spectrum of tasks, offering three
+core components: 1) a task generation mechanism that provides high degrees of
+freedom and variability, 2) an ever-expanding set of over 3K composable atomic
+tasks, and 3) a general evaluation framework that supports open-ended task
+assessment. By integrating large language models (LLMs), MCU dynamically
+creates diverse environments for each evaluation, fostering agent
+generalization. The framework uses a vision-language model (VLM) to
+automatically generate evaluation criteria, achieving over 90% agreement with
+human ratings across multi-dimensional assessments, which demonstrates that MCU
+is a scalable and explainable solution for evaluating generalist agents.
+Additionally, we show that while state-of-the-art foundational models perform
+well on specific tasks, they often struggle with increased task diversity and
+difficulty.
+
+
+
+
+
+
+
+ ♻ ☆ Exploring Italian sentence embeddings properties through multi-tasking
+
+
+ We investigate to what degree existing LLMs encode abstract linguistic
+information in Italian in a multi-task setting. We exploit curated synthetic
+data on a large scale -- several Blackbird Language Matrices (BLMs) problems in
+Italian -- and use them to study how sentence representations built using
+pre-trained language models encode specific syntactic and semantic information.
+We use a two-level architecture to model separately a compression of the
+sentence embeddings into a representation that contains relevant information
+for a task, and a BLM task. We then investigate whether we can obtain
+compressed sentence representations that encode syntactic and semantic
+information relevant to several BLM tasks. While we expected that the sentence
+structure -- in terms of sequence of phrases/chunks -- and chunk properties
+could be shared across tasks, performance and error analysis show that the
+clues for the different tasks are encoded in different manners in the sentence
+embeddings, suggesting that abstract linguistic notions such as constituents or
+thematic roles does not seem to be present in the pretrained sentence
+embeddings.
+
+
+
+ comment: 11 pages, 6 figures, 4 tables
+
+
+
+
+
+
+ ♻ ☆ Unleashing the Power of Data Tsunami: A Comprehensive Survey on Data
+ Assessment and Selection for Instruction Tuning of Language Models
+
+
+
+
+
+
+
+
+ Yulei Qin, Yuncheng Yang, Pengcheng Guo, Gang Li, Hang Shao, Yuchen Shi, Zihan Xu, Yun Gu, Ke Li, Xing Sun
+
+
+ Instruction tuning plays a critical role in aligning large language models
+(LLMs) with human preference. Despite the vast amount of open instruction
+datasets, naively training a LLM on all existing instructions may not be
+optimal and practical. To pinpoint the most beneficial datapoints, data
+assessment and selection methods have been proposed in the fields of natural
+language processing (NLP) and deep learning. However, under the context of
+instruction tuning, there still exists a gap in knowledge on what kind of data
+evaluation metrics can be employed and how they can be integrated into the
+selection mechanism. To bridge this gap, we present a comprehensive review on
+existing literature of data assessment and selection especially for instruction
+tuning of LLMs. We systematically categorize all applicable methods into
+quality-based, diversity-based, and importance-based ones where a unified,
+fine-grained taxonomy is structured. For each category, representative methods
+are elaborated to describe the landscape of relevant research. In addition,
+comparison between the latest methods is conducted on their officially reported
+results to provide in-depth discussions on their limitations. Finally, we
+summarize the open challenges and propose the promosing avenues for future
+studies. All related contents are available at
+https://github.com/yuleiqin/fantastic-data-engineering.
+
+
+ Multi-Head Mixture-of-Experts (MH-MoE) demonstrates superior performance by
+using the multi-head mechanism to collectively attend to information from
+various representation spaces within different experts. In this paper, we
+present a novel implementation of MH-MoE that maintains both FLOPs and
+parameter parity with sparse Mixture of Experts models. Experimental results on
+language models show that the new implementation yields quality improvements
+over both vanilla MoE and fine-grained MoE models. Additionally, our
+experiments demonstrate that MH-MoE is compatible with 1-bit Large Language
+Models (LLMs) such as BitNet.
+
+
+
+ comment: 7 pages, 0 figures
+
+
+
+
+
+
+ ♻ ☆ SAM Decoding: Speculative Decoding via Suffix Automaton
+
+
+ Large Language Models (LLMs) have revolutionized natural language processing
+by unifying tasks into text generation, yet their large parameter sizes and
+autoregressive nature limit inference speed. SAM-Decoding addresses this by
+introducing a novel retrieval-based speculative decoding method that uses a
+suffix automaton for efficient and accurate draft generation. Unlike n-gram
+matching used by the existing method, SAM-Decoding finds the longest suffix
+match in generating text and text corpuss, achieving an average time complexity
+of $O(1)$ per generation step. SAM-Decoding constructs static and dynamic
+suffix automatons for the text corpus and input prompts, respectively, enabling
+fast and precise draft generation. Meanwhile, it is designed as an approach
+that can be combined with existing methods, allowing SAM-Decoding to adaptively
+select a draft generation strategy based on the matching length, thus
+increasing the inference speed of the LLM. When combined with Token Recycling,
+evaluations show SAM-Decoding outperforms existing model-free methods,
+achieving a speedup of $2.27\times$ over autoregressive decoding on Spec-Bench.
+When combined with EAGLE2, it reaches a speedup of $2.49\times$, surpassing all
+current approaches. Our code is available at
+https://github.com/hyx1999/SAM-Decoding.
+
+
+
+ comment: 17 pages, 5 figures
+
+
+
+
+
+
+ ♻ ☆ Exploiting ChatGPT for Diagnosing Autism-Associated Language Disorders
+ and Identifying Distinct Features
+
+
+ Diagnosing language disorders associated with autism is a complex challenge,
+often hampered by the subjective nature and variability of traditional
+assessment methods. Traditional diagnostic methods not only require intensive
+human effort but also often result in delayed interventions due to their lack
+of speed and precision. In this study, we explored the application of ChatGPT,
+a large language model, to overcome these obstacles by enhancing sensitivity
+and profiling linguistic features for autism diagnosis. This research utilizes
+ChatGPT natural language processing capabilities to simplify and improve the
+diagnostic process, focusing on identifying autism related language patterns.
+Specifically, we compared ChatGPT performance with that of conventional
+supervised learning models, including BERT, a model acclaimed for its
+effectiveness in various natural language processing tasks. We showed that
+ChatGPT substantially outperformed these models, achieving over 10% improvement
+in both sensitivity and positive predictive value, in a zero shot learning
+configuration. The findings underscore the model potential as a diagnostic
+tool, combining accuracy and applicability. We identified ten key features of
+autism associated language disorders across scenarios. Features such as
+echolalia, pronoun reversal, and atypical language usage play a critical role
+in diagnosing ASD and informing tailored treatment plans. Together, our
+findings advocate for adopting sophisticated AI tools like ChatGPT in clinical
+settings to assess and diagnose developmental disorders. Our approach promises
+enhanced diagnostic precision and supports personalized medicine, potentially
+transforming the evaluation landscape for autism and similar neurological
+conditions.
+
+
+
+
+
+
+
+ ♻ ☆ METEOR: Evolutionary Journey of Large Language Models from Guidance to
+ Self-Growth
+
+
+ Model evolution enables learning from feedback to refine experiences and
+update skills, transforming models from having no domain knowledge to becoming
+domain experts. However, there is currently no unified and effective method for
+guiding this evolutionary process. To address this gap, we propose the Meteor
+method, which includes three training phases: weak-to-strong data distillation,
+iterative training, and self-evolution strategies. Each phase maximizes the
+model's inherent domain capabilities, allowing it to autonomously refine its
+domain knowledge and enhance performance. Experiments demonstrate that our
+approach significantly improves accuracy, completeness, relevance, coherence,
+and reliability across domain-specific tasks.
+
+
+
+ comment: Our code can be found at https://github.com/DIRECT-BIT/METEOR
+
+
+
+
+
+
+ ♻ ☆ Dynamic Universal Approximation Theory: The Basic Theory for
+ Transformer-based Large Language Models
+
+
+ Language models have emerged as a critical area of focus in artificial
+intelligence, particularly with the introduction of groundbreaking innovations
+like ChatGPT. Large-scale Transformer networks have quickly become the leading
+approach for advancing natural language processing algorithms. Built on the
+Transformer architecture, these models enable interactions that closely mimic
+human communication and, equipped with extensive knowledge, can even assist in
+guiding human tasks. Despite their impressive capabilities and growing
+complexity, a key question remains-the theoretical foundations of large
+language models (LLMs). What makes Transformer so effective for powering
+intelligent language applications, such as translation and coding? What
+underlies LLMs' ability for In-Context Learning (ICL)? How does the LoRA scheme
+enhance the fine-tuning of LLMs? And what supports the practicality of pruning
+LLMs? To address these critical questions and explore the technological
+strategies within LLMs, we leverage the Universal Approximation Theory (UAT) to
+offer a theoretical backdrop, shedding light on the mechanisms that underpin
+these advancements.
+
+
+
+
+
+
+
+ ♻ ☆ Prompt Framework for Role-playing: Generation and Evaluation
+
+
+ Large language models (LLMs) exhibit impressive proficiency in natural
+language generation, understanding user instructions, and emulating human-like
+language use, which has led to significant interest in their application to
+role-playing scenarios. However, the manual collection of role-specific script
+data and the evaluation of model performance are resource-intensive processes.
+This project introduces a prompt-based framework designed to leverage GPT's
+capabilities for the generation of role-playing dialogue datasets and the
+evaluation of role-playing performance. To validate the effectiveness of the
+GPT-based generation and evaluation, we further incorporate the recall-oriented
+Rouge-L metric, providing an additional quantitative measure of performance.
+
+
+
+
+
+
+
+ ♻ ☆ IndicLLMSuite: A Blueprint for Creating Pre-training and Fine-Tuning
+ Datasets for Indian Languages ACL-2024
+
+
+
+
+
+
+
+
+ Mohammed Safi Ur Rahman Khan, Priyam Mehta, Ananth Sankar, Umashankar Kumaravelan, Sumanth Doddapaneni, Suriyaprasaad B, Varun Balan G, Sparsh Jain, Anoop Kunchukuttan, Pratyush Kumar, Raj Dabre, Mitesh M. Khapra
+
+
+ Despite the considerable advancements in English LLMs, the progress in
+building comparable models for other languages has been hindered due to the
+scarcity of tailored resources. Our work aims to bridge this divide by
+introducing an expansive suite of resources specifically designed for the
+development of Indic LLMs, covering 22 languages, containing a total of 251B
+tokens and 74.8M instruction-response pairs. Recognizing the importance of both
+data quality and quantity, our approach combines highly curated manually
+verified data, unverified yet valuable data, and synthetic data. We build a
+clean, open-source pipeline for curating pre-training data from diverse
+sources, including websites, PDFs, and videos, incorporating best practices for
+crawling, cleaning, flagging, and deduplication. For instruction-fine tuning,
+we amalgamate existing Indic datasets, translate/transliterate English datasets
+into Indian languages, and utilize LLaMa2 and Mixtral models to create
+conversations grounded in articles from Indian Wikipedia and Wikihow.
+Additionally, we address toxicity alignment by generating toxic prompts for
+multiple scenarios and then generate non-toxic responses by feeding these toxic
+prompts to an aligned LLaMa2 model. We hope that the datasets, tools, and
+resources released as a part of this work will not only propel the research and
+development of Indic LLMs but also establish an open-source blueprint for
+extending such efforts to other languages. The data and other artifacts created
+as part of this work are released with permissive licenses.
+
+
+
+ comment: ACL-2024 Outstanding Paper
+
+
+
+
+
+
+ ♻ ☆ Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model
+ with Frozen LLM
+
+
+
+
+
+
+
+
+ Xiong Wang, Yangze Li, Chaoyou Fu, Yunhang Shen, Lei Xie, Ke Li, Xing Sun, Long Ma
+
+
+ Rapidly developing large language models (LLMs) have brought tremendous
+intelligent applications. Especially, the GPT-4o's excellent duplex speech
+interaction ability has brought impressive experience to users. Researchers
+have recently proposed several multi-modal LLMs in this direction that can
+achieve user-agent speech-to-speech conversations. This paper proposes a novel
+speech-text multimodal LLM architecture called Freeze-Omni. Our main
+contribution is that the speech input and output modalities can be easily
+connected to a textual LLM while keeping the LLM's parameters frozen throughout
+the training process. We design a three-stage training strategy for modeling
+both the speech input and output, enabling Freeze-Omni to obtain
+speech-to-speech conversation ability using text-speech paired data (such as
+ASR and TTS data) and only 60,000 multi-round text Q&A data on 8 GPUs.
+Moreover, we can effectively ensure that the intelligence of the Freeze-Omni in
+the speech modality is at the same level compared with that in the text
+modality of its backbone LLM, while achieving low latency end-to-end spoken
+response. In addition, we also designed a method to achieve duplex dialogue
+ability through multi-task training, giving Freeze-Omni a more natural style of
+dialogue ability between users and agents. In summary, Freeze-Omni holds great
+potential to conduct speech-to-speech dialogue based on a multimodal LLM under
+the condition of a frozen LLM, avoiding the catastrophic forgetting problem
+caused by limited data and training resources.
+
+
+
+
+
+
+
+ ♻ ☆ A Spectrum Evaluation Benchmark for Medical Multi-Modal Large Language
+ Models
+
+
+
+
+
+
+
+
+ Jie Liu, Wenxuan Wang, Yihang Su, Jingyuan Huan, Wenting Chen, Yudi Zhang, Cheng-Yi Li, Kao-Jung Chang, Xiaohan Xin, Linlin Shen, Michael R. Lyu
+
+
+ The significant breakthroughs of Medical Multi-Modal Large Language Models
+(Med-MLLMs) renovate modern healthcare with robust information synthesis and
+medical decision support. However, these models are often evaluated on
+benchmarks that are unsuitable for the Med-MLLMs due to the complexity of
+real-world diagnostics across diverse specialties. To address this gap, we
+introduce Asclepius, a novel Med-MLLM benchmark that comprehensively assesses
+Med-MLLMs in terms of: distinct medical specialties (cardiovascular,
+gastroenterology, etc.) and different diagnostic capacities (perception,
+disease analysis, etc.). Grounded in 3 proposed core principles, Asclepius
+ensures a comprehensive evaluation by encompassing 15 medical specialties,
+stratifying into 3 main categories and 8 sub-categories of clinical tasks, and
+exempting overlap with existing VQA dataset. We further provide an in-depth
+analysis of 6 Med-MLLMs and compare them with 3 human specialists, providing
+insights into their competencies and limitations in various medical contexts.
+Our work not only advances the understanding of Med-MLLMs' capabilities but
+also sets a precedent for future evaluations and the safe deployment of these
+models in clinical environments.
+
+
+
+ comment: 20 pages, 15 figures
+
+
+
+
+
+
+ ♻ ☆ Automated Speaking Assessment of Conversation Tests with Novel
+ Graph-based Modeling on Spoken Response Coherence
+
+
+ Automated speaking assessment in conversation tests (ASAC) aims to evaluate
+the overall speaking proficiency of an L2 (second-language) speaker in a
+setting where an interlocutor interacts with one or more candidates. Although
+prior ASAC approaches have shown promising performance on their respective
+datasets, there is still a dearth of research specifically focused on
+incorporating the coherence of the logical flow within a conversation into the
+grading model. To address this critical challenge, we propose a hierarchical
+graph model that aptly incorporates both broad inter-response interactions
+(e.g., discourse relations) and nuanced semantic information (e.g., semantic
+words and speaker intents), which is subsequently fused with contextual
+information for the final prediction. Extensive experimental results on the
+NICT-JLE benchmark dataset suggest that our proposed modeling approach can
+yield considerable improvements in prediction accuracy with respect to various
+assessment metrics, as compared to some strong baselines. This also sheds light
+on the importance of investigating coherence-related facets of spoken responses
+in ASAC.
+
+
+
+ comment: Accepted by IEEE SLT 2024
+
+
+
+
+
+
+ ♻ ☆ Conversational Complexity for Assessing Risk in Large Language Models
+
+
+
+
+
+
+
+
+ John Burden, Manuel Cebrian, Jose Hernandez-Orallo
+
+
+ Large Language Models (LLMs) present a dual-use dilemma: they enable
+beneficial applications while harboring potential for harm, particularly
+through conversational interactions. Despite various safeguards, advanced LLMs
+remain vulnerable. A watershed case in early 2023 involved journalist Kevin
+Roose's extended dialogue with Bing, an LLM-powered search engine, which
+revealed harmful outputs after probing questions, highlighting vulnerabilities
+in the model's safeguards. This contrasts with simpler early jailbreaks, like
+the "Grandma Jailbreak," where users framed requests as innocent help for a
+grandmother, easily eliciting similar content. This raises the question: How
+much conversational effort is needed to elicit harmful information from LLMs?
+We propose two measures to quantify this effort: Conversational Length (CL),
+which measures the number of conversational turns needed to obtain a specific
+harmful response, and Conversational Complexity (CC), defined as the Kolmogorov
+complexity of the user's instruction sequence leading to the harmful response.
+To address the incomputability of Kolmogorov complexity, we approximate CC
+using a reference LLM to estimate the compressibility of the user instructions.
+Applying this approach to a large red-teaming dataset, we perform a
+quantitative analysis examining the statistical distribution of harmful and
+harmless conversational lengths and complexities. Our empirical findings
+suggest that this distributional analysis and the minimization of CC serve as
+valuable tools for understanding AI safety, offering insights into the
+accessibility of harmful information. This work establishes a foundation for a
+new perspective on LLM safety, centered around the algorithmic complexity of
+pathways to harm.
+
+
+
+ comment: 15 pages, 6 figures
+
+
+
+
+
+
+ ♻ ☆ Sequential Large Language Model-Based Hyper-Parameter Optimization
+
+
+ This study introduces SLLMBO, an innovative framework that leverages Large
+Language Models (LLMs) for hyperparameter optimization (HPO), incorporating
+dynamic search space adaptability, enhanced parameter landscape exploitation,
+and a hybrid, novel LLM-Tree-structured Parzen Estimator (LLM-TPE) sampler. By
+addressing limitations in recent fully LLM-based methods and traditional
+Bayesian Optimization (BO), SLLMBO achieves more robust optimization. This
+comprehensive benchmarking evaluates multiple LLMs, including GPT-3.5-turbo,
+GPT-4o, Claude-Sonnet-3.5, and Gemini-1.5-flash, extending prior work beyond
+GPT-3.5 and GPT-4 and establishing SLLMBO as the first framework to benchmark a
+diverse set of LLMs for HPO. By integrating LLMs' established strengths in
+parameter initialization with the exploitation abilities demonstrated in this
+study, alongside TPE's exploration capabilities, the LLM-TPE sampler achieves a
+balanced exploration-exploitation trade-off, reduces API costs, and mitigates
+premature early stoppings for more effective parameter searches. Across 14
+tabular tasks in classification and regression, the LLM-TPE sampler
+outperformed fully LLM-based methods and achieved superior results over BO
+methods in 9 tasks. Testing early stopping in budget-constrained scenarios
+further demonstrated competitive performance, indicating that LLM-based methods
+generally benefit from extended iterations for optimal results. This work lays
+the foundation for future research exploring open-source LLMs, reproducibility
+of LLM results in HPO, and benchmarking SLLMBO on complex datasets, such as
+image classification, segmentation, and machine translation.
+
+
+
+
+
+
+
+ ♻ ☆ Exploring Concept Depth: How Large Language Models Acquire Knowledge at
+ Different Layers? COLING 2025
+
+
+ Large language models (LLMs) have shown remarkable performances across a wide
+range of tasks. However, the mechanisms by which these models encode tasks of
+varying complexities remain poorly understood. In this paper, we explore the
+hypothesis that LLMs process concepts of varying complexities in different
+layers, introducing the idea of "Concept Depth" to suggest that more complex
+concepts are typically acquired in deeper layers. Specifically, we categorize
+concepts based on their level of abstraction, defining them in the order of
+increasing complexity within factual, emotional, and inferential tasks. We
+conduct extensive probing experiments using layer-wise representations across
+various LLM families (Gemma, LLaMA, Qwen) on various datasets spanning the
+three domains of tasks. Our findings reveal that models could efficiently
+conduct probing for simpler tasks in shallow layers, and more complex tasks
+typically necessitate deeper layers for accurate understanding. Additionally,
+we examine how external factors, such as adding noise to the input and
+quantizing the model weights, might affect layer-wise representations. Our
+findings suggest that these factors can impede the development of a conceptual
+understanding of LLMs until deeper layers are explored. We hope that our
+proposed concept and experimental insights will enhance the understanding of
+the mechanisms underlying LLMs. Our codes are available at
+\url{https://github.com/Luckfort/CD}.
+
+
+
+ comment: COLING 2025
+
+
+
+
+
+
+ ♻ ☆ From Prejudice to Parity: A New Approach to Debiasing Large Language
+ Model Word Embeddings COLING 2025
+
+
+ Embeddings play a pivotal role in the efficacy of Large Language Models. They
+are the bedrock on which these models grasp contextual relationships and foster
+a more nuanced understanding of language and consequently perform remarkably on
+a plethora of complex tasks that require a fundamental understanding of human
+language. Given that these embeddings themselves often reflect or exhibit bias,
+it stands to reason that these models may also inadvertently learn this bias.
+In this work, we build on the seminal previous work and propose DeepSoftDebias,
+an algorithm that uses a neural network to perform 'soft debiasing'. We
+exhaustively evaluate this algorithm across a variety of SOTA datasets,
+accuracy metrics, and challenging NLP tasks. We find that DeepSoftDebias
+outperforms the current state-of-the-art methods at reducing bias across
+gender, race, and religion.
+
+
+
+ comment: Accepted at COLING 2025
+
+
+
+
+
+
+ ♻ ☆ SignLLM: Sign Language Production Large Language Models
+
+
+
+
+
+
+
+
+ Sen Fang, Lei Wang, Ce Zheng, Chunyu Sui, Mingyu Zhao, Yapeng Tian, Chen Chen
+
+
+ In this paper, we propose SignLLM, a multilingual Sign Language Production
+(SLP) large language model, which includes two novel multilingual SLP modes
+MLSF and Prompt2LangGloss that allow sign language gestures generation from
+query texts input and question-style prompts input respectively. Both modes can
+use a new RL loss based on reinforcement learning and a new RL module named
+Priority Learning Channel. These RL components can accelerate the training by
+enhancing the model's capability to sample high-quality data. For SignLLM's
+training, we introduce Prompt2Sign, a comprehensive multilingual sign language
+dataset, which builds from public data, including American Sign Language (ASL)
+and seven others. This dataset standardizes information by extracting pose
+information from sign language videos into a unified compressed format. We
+extensively evaluate SignLLM, demonstrating that our model achieves
+state-of-the-art performance on SLP tasks across eight sign languages.
+
+
+
+ comment: website at https://signllm.github.io/
+
+ This report presents Sabi\'a-3, our new flagship language model, and
+Sabiazinho-3, a more cost-effective sibling. The models were trained on a large
+brazilian-centric corpus. Evaluations across diverse professional and academic
+benchmarks show a strong performance on Portuguese and Brazil-related tasks.
+Sabi\'a-3 shows large improvements in comparison to our previous best of model,
+Sabia-2 Medium, especially in reasoning-intensive tasks. Notably, Sabi\'a-3's
+average performance matches frontier LLMs, while it is offered at a three to
+four times lower cost per token, reinforcing the benefits of domain
+specialization.
+
+
+
+
+
+
+
+
+
+
+ Information Retrieval 8
+
+
+
+
+
+ ☆ Cross-Domain Recommendation Meets Large Language Models
+
+
+
+
+
+
+
+
+ Ajay Krishna Vajjala, Dipak Meher, Ziwei Zhu, David S. Rosenblum
+
+
+ Cross-domain recommendation (CDR) has emerged as a promising solution to the
+cold-start problem, faced by single-domain recommender systems. However,
+existing CDR models rely on complex neural architectures, large datasets, and
+significant computational resources, making them less effective in data-scarce
+scenarios or when simplicity is crucial. In this work, we leverage the
+reasoning capabilities of large language models (LLMs) and explore their
+performance in the CDR domain across multiple domain pairs. We introduce two
+novel prompt designs tailored for CDR and demonstrate that LLMs, when prompted
+effectively, outperform state-of-the-art CDR baselines across various metrics
+and domain combinations in the rating prediction and ranking tasks. This work
+bridges the gap between LLMs and recommendation systems, showcasing their
+potential as effective cross-domain recommenders.
+
+
+
+
+
+
+
+
+ David Dukić, Marin Petričević, Sven Ćurković, Jan Šnajder
+
+
+ TakeLab Retriever is an AI-driven search engine designed to discover,
+collect, and semantically analyze news articles from Croatian news outlets. It
+offers a unique perspective on the history and current landscape of Croatian
+online news media, making it an essential tool for researchers seeking to
+uncover trends, patterns, and correlations that general-purpose search engines
+cannot provide. TakeLab retriever utilizes cutting-edge natural language
+processing (NLP) methods, enabling users to sift through articles using named
+entities, phrases, and topics through the web application. This technical
+report is divided into two parts: the first explains how TakeLab Retriever is
+utilized, while the second provides a detailed account of its design. In the
+second part, we also address the software engineering challenges involved and
+propose solutions for developing a microservice-based semantic search engine
+capable of handling over ten million news articles published over the past two
+decades.
+
+
+
+
+
+
+
+ ☆ Know Your RAG: Dataset Taxonomy and Generation Strategies for Evaluating
+ RAG Systems COLING 2025
+
+
+
+
+
+
+
+
+ Rafael Teixeira de Lima, Shubham Gupta, Cesar Berrospi, Lokesh Mishra, Michele Dolfi, Peter Staar, Panagiotis Vagenas
+
+
+ Retrieval Augmented Generation (RAG) systems are a widespread application of
+Large Language Models (LLMs) in the industry. While many tools exist empowering
+developers to build their own systems, measuring their performance locally,
+with datasets reflective of the system's use cases, is a technological
+challenge. Solutions to this problem range from non-specific and cheap (most
+public datasets) to specific and costly (generating data from local documents).
+In this paper, we show that using public question and answer (Q&A) datasets to
+assess retrieval performance can lead to non-optimal systems design, and that
+common tools for RAG dataset generation can lead to unbalanced data. We propose
+solutions to these issues based on the characterization of RAG datasets through
+labels and through label-targeted data generation. Finally, we show that
+fine-tuned small LLMs can efficiently generate Q&A datasets. We believe that
+these observations are invaluable to the know-your-data step of RAG systems
+development.
+
+
+
+ comment: to be published in the 31st International Conference on Computational
+ Linguistics (COLING 2025)
+
+
+
+
+
+
+ ☆ A Review of LLM-based Explanations in Recommender Systems
+
+
+ The rise of Large Language Models (LLMs), such as LLaMA and ChatGPT, has
+opened new opportunities for enhancing recommender systems through improved
+explainability. This paper provides a systematic literature review focused on
+leveraging LLMs to generate explanations for recommendations -- a critical
+aspect for fostering transparency and user trust. We conducted a comprehensive
+search within the ACM Guide to Computing Literature, covering publications from
+the launch of ChatGPT (November 2022) to the present (November 2024). Our
+search yielded 232 articles, but after applying inclusion criteria, only six
+were identified as directly addressing the use of LLMs in explaining
+recommendations. This scarcity highlights that, despite the rise of LLMs, their
+application in explainable recommender systems is still in an early stage. We
+analyze these select studies to understand current methodologies, identify
+challenges, and suggest directions for future research. Our findings underscore
+the potential of LLMs improving explanations of recommender systems and
+encourage the development of more transparent and user-centric recommendation
+explanation solutions.
+
+
+
+
+
+
+
+ ☆ Knowledge Management for Automobile Failure Analysis Using Graph RAG
+
+
+ This paper presents a knowledge management system for automobile failure
+analysis using retrieval-augmented generation (RAG) with large language models
+(LLMs) and knowledge graphs (KGs). In the automotive industry, there is a
+growing demand for knowledge transfer of failure analysis from experienced
+engineers to young engineers. However, failure events are phenomena that occur
+in a chain reaction, making them difficult for beginners to analyze them. While
+knowledge graphs, which can describe semantic relationships and structure
+information is effective in representing failure events, due to their
+capability of representing the relationships between components, there is much
+information in KGs, so it is challenging for young engineers to extract and
+understand sub-graphs from the KG. On the other hand, there is increasing
+interest in the use of Graph RAG, a type of RAG that combines LLMs and KGs for
+knowledge management. However, when using the current Graph RAG framework with
+an existing knowledge graph for automobile failures, several issues arise
+because it is difficult to generate executable queries for a knowledge graph
+database which is not constructed by LLMs. To address this, we focused on
+optimizing the Graph RAG pipeline for existing knowledge graphs. Using an
+original Q&A dataset, the ROUGE F1 score of the sentences generated by the
+proposed method showed an average improvement of 157.6% compared to the current
+method. This highlights the effectiveness of the proposed method for automobile
+failure analysis.
+
+
+
+ comment: 7 pages, 6 figures, to be published in 2024 IEEE International
+ Conference on Bid Data (BigData)
+
+ Recommendation systems predominantly utilize two-tower architectures, which
+evaluate user-item rankings through the inner product of their respective
+embeddings. However, one key limitation of two-tower models is that they learn
+a pair-agnostic representation of users and items. In contrast, pair-wise
+representations either scale poorly due to their quadratic complexity or are
+too restrictive on the candidate pairs to rank. To address these issues, we
+introduce Context-based Graph Neural Networks (ContextGNNs), a novel deep
+learning architecture for link prediction in recommendation systems. The method
+employs a pair-wise representation technique for familiar items situated within
+a user's local subgraph, while leveraging two-tower representations to
+facilitate the recommendation of exploratory items. A final network then
+predicts how to fuse both pair-wise and two-tower recommendations into a single
+ranking of items. We demonstrate that ContextGNN is able to adapt to different
+data characteristics and outperforms existing methods, both traditional and
+GNN-based, on a diverse set of practical recommendation tasks, improving
+performance by 20% on average.
+
+
+
+ comment: 14 pages, 1 figure, 5 tables
+
+
+
+
+
+
+ ☆ TQA-Bench: Evaluating LLMs for Multi-Table Question Answering with
+ Scalable Context and Symbolic Extension
+
+
+
+
+
+
+
+
+ Zipeng Qiu, You Peng, Guangxin He, Binhang Yuan, Chen Wang
+
+
+ The advent of large language models (LLMs) has unlocked great opportunities
+in complex data management tasks, particularly in question answering (QA) over
+complicated multi-table relational data. Despite significant progress,
+systematically evaluating LLMs on multi-table QA remains a critical challenge
+due to the inherent complexity of analyzing heterogeneous table structures and
+potential large scale of serialized relational data. Existing benchmarks
+primarily focus on single-table QA, failing to capture the intricacies of
+reasoning across multiple relational tables, as required in real-world domains
+such as finance, healthcare, and e-commerce. To address this gap, we present
+TQA-Bench, a new multi-table QA benchmark designed to evaluate the capabilities
+of LLMs in tackling complex QA tasks over relational data. Our benchmark
+incorporates diverse relational database instances sourced from real-world
+public datasets and introduces a flexible sampling mechanism to create tasks
+with varying multi-table context lengths, ranging from 8K to 64K tokens. To
+ensure robustness and reliability, we integrate symbolic extensions into the
+evaluation framework, enabling the assessment of LLM reasoning capabilities
+beyond simple data retrieval or probabilistic pattern matching. We
+systematically evaluate a range of LLMs, both open-source and closed-source,
+spanning model scales from 7 billion to 70 billion parameters. Our extensive
+experiments reveal critical insights into the performance of LLMs in
+multi-table QA, highlighting both challenges and opportunities for advancing
+their application in complex, data-driven environments. Our benchmark
+implementation and results are available at
+https://github.com/Relaxed-System-Lab/TQA-Bench.
+
+
+
+
+
+
+
+ ☆ Zero-Indexing Internet Search Augmented Generation for Large Language
+ Models
+
+
+ Retrieval augmented generation has emerged as an effective method to enhance
+large language model performance. This approach typically relies on an internal
+retrieval module that uses various indexing mechanisms to manage a static
+pre-processed corpus. However, such a paradigm often falls short when it is
+necessary to integrate the most up-to-date information that has not been
+updated into the corpus during generative inference time. In this paper, we
+explore an alternative approach that leverages standard search engine APIs to
+dynamically integrate the latest online information (without maintaining any
+index for any fixed corpus), thereby improving the quality of generated
+content. We design a collaborative LLM-based paradigm, where we include: (i) a
+parser-LLM that determines if the Internet augmented generation is demanded and
+extracts the search keywords if so with a single inference; (ii) a mixed
+ranking strategy that re-ranks the retrieved HTML files to eliminate bias
+introduced from the search engine API; and (iii) an extractor-LLM that can
+accurately and efficiently extract relevant information from the fresh content
+in each HTML file. We conduct extensive empirical studies to evaluate the
+performance of this Internet search augmented generation paradigm. The
+experimental results demonstrate that our method generates content with
+significantly improved quality. Our system has been successfully deployed in a
+production environment to serve 01.AI's generative inference requests.
+
+
+
+
+
+
+
+
+
+
+ Machine Learning 150
+
+
+
+
+
+ ☆ T2Vid: Translating Long Text into Multi-Image is the Catalyst for
+ Video-LLMs
+
+
+ The success of Multimodal Large Language Models (MLLMs) in the image domain
+has garnered wide attention from the research community. Drawing on previous
+successful experiences, researchers have recently explored extending the
+success to the video understanding realms. Apart from training from scratch, an
+efficient way is to utilize the pre-trained image-LLMs, leading to two
+mainstream approaches, i.e. zero-shot inference and further fine-tuning with
+video data. In this work, our study of these approaches harvests an effective
+data augmentation method. We first make a deeper inspection of the zero-shot
+inference way and identify two limitations, i.e. limited generalization and
+lack of temporal understanding capabilities. Thus, we further investigate the
+fine-tuning approach and find a low learning efficiency when simply using all
+the video data samples, which can be attributed to a lack of instruction
+diversity. Aiming at this issue, we develop a method called T2Vid to synthesize
+video-like samples to enrich the instruction diversity in the training corpus.
+Integrating these data enables a simple and efficient training scheme, which
+achieves performance comparable to or even superior to using full video
+datasets by training with just 15% the sample size. Meanwhile, we find that the
+proposed scheme can boost the performance of long video understanding without
+training with long video samples. We hope our study will spark more thinking
+about using MLLMs for video understanding and curation of high-quality data.
+The code is released at https://github.com/xjtupanda/T2Vid.
+
+
+ We introduce AlphaTablets, a novel and generic representation of 3D planes
+that features continuous 3D surface and precise boundary delineation. By
+representing 3D planes as rectangles with alpha channels, AlphaTablets combine
+the advantages of current 2D and 3D plane representations, enabling accurate,
+consistent and flexible modeling of 3D planes. We derive differentiable
+rasterization on top of AlphaTablets to efficiently render 3D planes into
+images, and propose a novel bottom-up pipeline for 3D planar reconstruction
+from monocular videos. Starting with 2D superpixels and geometric cues from
+pre-trained models, we initialize 3D planes as AlphaTablets and optimize them
+via differentiable rendering. An effective merging scheme is introduced to
+facilitate the growth and refinement of AlphaTablets. Through iterative
+optimization and merging, we reconstruct complete and accurate 3D planes with
+solid surfaces and clear boundaries. Extensive experiments on the ScanNet
+dataset demonstrate state-of-the-art performance in 3D planar reconstruction,
+underscoring the great potential of AlphaTablets as a generic 3D plane
+representation for various applications. Project page is available at:
+https://hyzcluster.github.io/alphatablets
+
+
+
+ comment: NeurIPS 2024
+
+
+
+
+
+
+ ☆ DELT: A Simple Diversity-driven EarlyLate Training for Dataset
+ Distillation
+
+
+ Recent advances in dataset distillation have led to solutions in two main
+directions. The conventional batch-to-batch matching mechanism is ideal for
+small-scale datasets and includes bi-level optimization methods on models and
+syntheses, such as FRePo, RCIG, and RaT-BPTT, as well as other methods like
+distribution matching, gradient matching, and weight trajectory matching.
+Conversely, batch-to-global matching typifies decoupled methods, which are
+particularly advantageous for large-scale datasets. This approach has garnered
+substantial interest within the community, as seen in SRe$^2$L, G-VBSM, WMDD,
+and CDA. A primary challenge with the second approach is the lack of diversity
+among syntheses within each class since samples are optimized independently and
+the same global supervision signals are reused across different synthetic
+images. In this study, we propose a new Diversity-driven EarlyLate Training
+(DELT) scheme to enhance the diversity of images in batch-to-global matching
+with less computation. Our approach is conceptually simple yet effective, it
+partitions predefined IPC samples into smaller subtasks and employs local
+optimizations to distill each subset into distributions from distinct phases,
+reducing the uniformity induced by the unified optimization process. These
+distilled images from the subtasks demonstrate effective generalization when
+applied to the entire task. We conduct extensive experiments on CIFAR,
+Tiny-ImageNet, ImageNet-1K, and its sub-datasets. Our approach outperforms the
+previous state-of-the-art by 2$\sim$5% on average across different datasets and
+IPCs (images per class), increasing diversity per class by more than 5% while
+reducing synthesis time by up to 39.3% for enhancing the training efficiency.
+Code is available at: https://github.com/VILA-Lab/DELT.
+
+
+ Large Language Models (LLMs) have exhibited remarkable performance on
+reasoning tasks. They utilize autoregressive token generation to construct
+reasoning trajectories, enabling the development of a coherent chain of
+thought. In this work, we explore the impact of individual tokens on the final
+outcomes of reasoning tasks. We identify the existence of ``critical tokens''
+that lead to incorrect reasoning trajectories in LLMs. Specifically, we find
+that LLMs tend to produce positive outcomes when forced to decode other tokens
+instead of critical tokens. Motivated by this observation, we propose a novel
+approach - cDPO - designed to automatically recognize and conduct token-level
+rewards for the critical tokens during the alignment process. Specifically, we
+develop a contrastive estimation approach to automatically identify critical
+tokens. It is achieved by comparing the generation likelihood of positive and
+negative models. To achieve this, we separately fine-tune the positive and
+negative models on various reasoning trajectories, consequently, they are
+capable of identifying identify critical tokens within incorrect trajectories
+that contribute to erroneous outcomes. Moreover, to further align the model
+with the critical token information during the alignment process, we extend the
+conventional DPO algorithms to token-level DPO and utilize the differential
+likelihood from the aforementioned positive and negative model as important
+weight for token-level DPO learning.Experimental results on GSM8K and MATH500
+benchmarks with two-widely used models Llama-3 (8B and 70B) and deepseek-math
+(7B) demonstrate the effectiveness of the propsoed approach cDPO.
+
+
+
+
+
+
+
+
+ Hang Ye, Xiaoxuan Ma, Hai Ci, Wentao Zhu, Yizhou Wang
+
+
+ Achieving realistic animated human avatars requires accurate modeling of
+pose-dependent clothing deformations. Existing learning-based methods heavily
+rely on the Linear Blend Skinning (LBS) of minimally-clothed human models like
+SMPL to model deformation. However, these methods struggle to handle loose
+clothing, such as long dresses, where the canonicalization process becomes
+ill-defined when the clothing is far from the body, leading to disjointed and
+fragmented results. To overcome this limitation, we propose a novel hybrid
+framework to model challenging clothed humans. Our core idea is to use
+dedicated strategies to model different regions, depending on whether they are
+close to or distant from the body. Specifically, we segment the human body into
+three categories: unclothed, deformed, and generated. We simply replicate
+unclothed regions that require no deformation. For deformed regions close to
+the body, we leverage LBS to handle the deformation. As for the generated
+regions, which correspond to loose clothing areas, we introduce a novel
+free-form, part-aware generator to model them, as they are less affected by
+movements. This free-form generation paradigm brings enhanced flexibility and
+expressiveness to our hybrid framework, enabling it to capture the intricate
+geometric details of challenging loose clothing, such as skirts and dresses.
+Experimental results on the benchmark dataset featuring loose clothing
+demonstrate that our method achieves state-of-the-art performance with superior
+visual fidelity and realism, particularly in the most challenging cases.
+
+
+
+ comment: 23 pages, 25 figures
+
+
+
+
+
+
+ ☆ Perception Test 2024: Challenge Summary and a Novel Hour-Long VideoQA
+ Benchmark
+
+
+
+
+
+
+
+
+ Joseph Heyward, João Carreira, Dima Damen, Andrew Zisserman, Viorica Pătrăucean
+
+
+ Following the successful 2023 edition, we organised the Second Perception
+Test challenge as a half-day workshop alongside the IEEE/CVF European
+Conference on Computer Vision (ECCV) 2024, with the goal of benchmarking
+state-of-the-art video models and measuring the progress since last year using
+the Perception Test benchmark. This year, the challenge had seven tracks (up
+from six last year) and covered low-level and high-level tasks, with language
+and non-language interfaces, across video, audio, and text modalities; the
+additional track covered hour-long video understanding and introduced a novel
+video QA benchmark 1h-walk VQA. Overall, the tasks in the different tracks
+were: object tracking, point tracking, temporal action localisation, temporal
+sound localisation, multiple-choice video question-answering, grounded video
+question-answering, and hour-long video question-answering. We summarise in
+this report the challenge tasks and results, and introduce in detail the novel
+hour-long video QA benchmark 1h-walk VQA.
+
+
+
+ comment: arXiv admin note: substantial text overlap with arXiv:2312.13090
+
+
+
+
+
+
+ ★ On Domain-Specific Post-Training for Multimodal Large Language Models
+
+
+ Recent years have witnessed the rapid development of general multimodal large
+language models (MLLMs). However, adapting general MLLMs to specific domains,
+such as scientific fields and industrial applications, remains less explored.
+This paper systematically investigates domain adaptation of MLLMs through
+post-training, focusing on data synthesis, training pipelines, and task
+evaluation. (1) Data Synthesis: Using open-source models, we develop a visual
+instruction synthesizer that effectively generates diverse visual instruction
+tasks from domain-specific image-caption pairs. Our synthetic tasks surpass
+those generated by manual rules, GPT-4, and GPT-4V in enhancing the
+domain-specific performance of MLLMs. (2) Training Pipeline: While the
+two-stage training--initially on image-caption pairs followed by visual
+instruction tasks--is commonly adopted for developing general MLLMs, we apply a
+single-stage training pipeline to enhance task diversity for domain-specific
+post-training. (3) Task Evaluation: We conduct experiments in two domains,
+biomedicine and food, by post-training MLLMs of different sources and scales
+(e.g., Qwen2-VL-2B, LLaVA-v1.6-8B, Llama-3.2-11B), and then evaluating MLLM
+performance on various domain-specific tasks. To support further research in
+MLLM domain adaptation, we will open-source our implementations.
+
+
+
+
+
+
+
+ ☆ Scalable Out-of-distribution Robustness in the Presence of Unobserved
+ Confounders
+
+
+ We consider the task of out-of-distribution (OOD) generalization, where the
+distribution shift is due to an unobserved confounder ($Z$) affecting both the
+covariates ($X$) and the labels ($Y$). In this setting, traditional assumptions
+of covariate and label shift are unsuitable due to the confounding, which
+introduces heterogeneity in the predictor, i.e., $\hat{Y} = f_Z(X)$. OOD
+generalization differs from traditional domain adaptation by not assuming
+access to the covariate distribution ($X^\text{te}$) of the test samples during
+training. These conditions create a challenging scenario for OOD robustness:
+(a) $Z^\text{tr}$ is an unobserved confounder during training, (b)
+$P^\text{te}{Z} \neq P^\text{tr}{Z}$, (c) $X^\text{te}$ is unavailable during
+training, and (d) the posterior predictive distribution depends on
+$P^\text{te}(Z)$, i.e., $\hat{Y} = E_{P^\text{te}(Z)}[f_Z(X)]$. In general,
+accurate predictions are unattainable in this scenario, and existing literature
+has proposed complex predictors based on identifiability assumptions that
+require multiple additional variables. Our work investigates a set of
+identifiability assumptions that tremendously simplify the predictor, whose
+resulting elegant simplicity outperforms existing approaches.
+
+
+
+ comment: 24 pages, 3 figures
+
+
+
+
+
+
+ ☆ Dynamic EEG-fMRI mapping: Revealing the relationship between brain
+ connectivity and cognitive state SP
+
+
+ This study investigated the dynamic connectivity patterns between EEG and
+fMRI modalities, contributing to our understanding of brain network
+interactions. By employing a comprehensive approach that integrated static and
+dynamic analyses of EEG-fMRI data, we were able to uncover distinct
+connectivity states and characterize their temporal fluctuations. The results
+revealed modular organization within the intrinsic connectivity networks (ICNs)
+of the brain, highlighting the significant roles of sensory systems and the
+default mode network. The use of a sliding window technique allowed us to
+assess how functional connectivity varies over time, further elucidating the
+transient nature of brain connectivity. Additionally, our findings align with
+previous literature, reinforcing the notion that cognitive states can be
+effectively identified through short-duration data, specifically within the
+30-60 second timeframe. The established relationships between connectivity
+strength and cognitive processes, particularly during different visual states,
+underscore the relevance of our approach for future research into brain
+dynamics. Overall, this study not only enhances our understanding of the
+interplay between EEG and fMRI signals but also paves the way for further
+exploration into the neural correlates of cognitive functions and their
+implications in clinical settings. Future research should focus on refining
+these methodologies and exploring their applications in various cognitive and
+clinical contexts.
+
+
+ Quantifying the gap between synthetic and real-world imagery is essential for
+improving both transformer-based models - that rely on large volumes of data -
+and datasets, especially in underexplored domains like aerial scene
+understanding where the potential impact is significant. This paper introduces
+a novel methodology for scene complexity assessment using Multi-Model Consensus
+Metric (MMCM) and depth-based structural metrics, enabling a robust evaluation
+of perceptual and structural disparities between domains. Our experimental
+analysis, utilizing real-world (Dronescapes) and synthetic (Skyscenes)
+datasets, demonstrates that real-world scenes generally exhibit higher
+consensus among state-of-the-art vision transformers, while synthetic scenes
+show greater variability and challenge model adaptability. The results
+underline the inherent complexities and domain gaps, emphasizing the need for
+enhanced simulation fidelity and model generalization. This work provides
+critical insights into the interplay between domain characteristics and model
+performance, offering a pathway for improved domain adaptation strategies in
+aerial scene understanding.
+
+
+
+ comment: 17 pages (including references), 5 figures, 2 tables. Accepted for
+ publication in the "Scientific Bulletin", Series C, Electrical Engineering
+ and Computer Science, ISSN 2286-3540
+
+
+
+
+
+
+ ☆ Another look at inference after prediction
+
+
+
+
+
+
+
+
+ Jessica Gronsbell, Jianhui Gao, Yaqi Shi, Zachary R. McCaw, David Cheng
+
+
+ Prediction-based (PB) inference is increasingly used in applications where
+the outcome of interest is difficult to obtain, but its predictors are readily
+available. Unlike traditional inference, PB inference performs statistical
+inference using a partially observed outcome and a set of covariates by
+leveraging a prediction of the outcome generated from a machine learning (ML)
+model. Motwani and Witten (2023) recently revisited two innovative PB inference
+approaches for ordinary least squares. They found that the method proposed by
+Wang et al. (2020) yields a consistent estimator for the association of
+interest when the ML model perfectly captures the underlying regression
+function. Conversely, the prediction-powered inference (PPI) method proposed by
+Angelopoulos et al. (2023) yields valid inference regardless of the model's
+accuracy. In this paper, we study the statistical efficiency of the PPI
+estimator. Our analysis reveals that a more efficient estimator, proposed 25
+years ago by Chen and Chen (2000), can be obtained by simply adding a weight to
+the PPI estimator. We also contextualize PB inference with methods from the
+economics and statistics literature dating back to the 1960s. Our extensive
+theoretical and numerical analyses indicate that the Chen and Chen (CC)
+estimator offers a balance between robustness to ML model specification and
+statistical efficiency, making it the preferred choice for use in practice.
+
+
+
+
+
+
+
+ ☆ Classical and Quantum Algorithms for the Deterministic L-system
+ Inductive Inference Problem
+
+
+
+
+
+
+
+
+ Ali Lotfi, Ian McQuillan, Steven Rayan
+
+
+ L-systems can be made to model and create simulations of many biological
+processes, such as plant development. Finding an L-system for a given process
+is typically solved by hand, by experts, in a hugely time-consuming process. It
+would be significant if this could be done automatically from data, such as
+from sequences of images. In this paper, we are interested in inferring a
+particular type of L-system, deterministic context-free L-system (D0L-system)
+from a sequence of strings. We introduce the characteristic graph of a sequence
+of strings, which we then utilize to translate our problem (inferring
+D0L-system) in polynomial time into the maximum independent set problem (MIS)
+and the SAT problem. After that, we offer a classical exact algorithm and an
+approximate quantum algorithm for the problem.
+
+
+ Neural radiance fields (NeRF) have exhibited highly photorealistic rendering
+of novel views through per-scene optimization over a single 3D scene. With the
+growing popularity of NeRF and its variants, they have become ubiquitous and
+have been identified as efficient 3D resources. However, they are still far
+from being scalable since a separate model needs to be stored for each scene,
+and the training time increases linearly with every newly added scene.
+Surprisingly, the idea of encoding multiple 3D scenes into a single NeRF model
+is heavily under-explored. In this work, we propose a novel
+conditional-cum-continual framework, called $C^{3}$-NeRF, to accommodate
+multiple scenes into the parameters of a single neural radiance field. Unlike
+conventional approaches that leverage feature extractors and pre-trained priors
+for scene conditioning, we use simple pseudo-scene labels to model multiple
+scenes in NeRF. Interestingly, we observe the framework is also inherently
+continual (via generative replay) with minimal, if not no, forgetting of the
+previously learned scenes. Consequently, the proposed framework adapts to
+multiple new scenes without necessarily accessing the old data. Through
+extensive qualitative and quantitative evaluation using synthetic and real
+datasets, we demonstrate the inherent capacity of the NeRF model to accommodate
+multiple scenes with high-quality novel-view renderings without adding
+additional parameters. We provide implementation details and dynamic
+visualizations of our results in the supplementary file.
+
+
+
+
+
+
+
+ ☆ Noncommutative Model Selection for Data Clustering and Dimension
+ Reduction Using Relative von Neumann Entropy
+
+
+ We propose a pair of completely data-driven algorithms for unsupervised
+classification and dimension reduction, and we empirically study their
+performance on a number of data sets, both simulated data in three-dimensions
+and images from the COIL-20 data set. The algorithms take as input a set of
+points sampled from a uniform distribution supported on a metric space, the
+latter embedded in an ambient metric space, and they output a clustering or
+reduction of dimension of the data. They work by constructing a natural family
+of graphs from the data and selecting the graph which maximizes the relative
+von Neumann entropy of certain normalized heat operators constructed from the
+graphs. Once the appropriate graph is selected, the eigenvectors of the graph
+Laplacian may be used to reduce the dimension of the data, and clusters in the
+data may be identified with the kernel of the associated graph Laplacian.
+Notably, these algorithms do not require information about the size of a
+neighborhood or the desired number of clusters as input, in contrast to popular
+algorithms such as $k$-means, and even more modern spectral methods such as
+Laplacian eigenmaps, among others.
+ In our computational experiments, our clustering algorithm outperforms
+$k$-means clustering on data sets with non-trivial geometry and topology, in
+particular data whose clusters are not concentrated around a specific point,
+and our dimension reduction algorithm is shown to work well in several simple
+examples.
+
+
+
+ comment: 20 pages
+
+
+
+
+
+
+ ☆ Efficient quantum-enhanced classical simulation for patches of quantum
+ landscapes
+
+
+
+
+
+
+
+
+ Sacha Lerch, Ricard Puig, Manuel S. Rudolph, Armando Angrisani, Tyson Jones, M. Cerezo, Supanut Thanasilp, Zoë Holmes
+
+
+ Understanding the capabilities of classical simulation methods is key to
+identifying where quantum computers are advantageous. Not only does this ensure
+that quantum computers are used only where necessary, but also one can
+potentially identify subroutines that can be offloaded onto a classical device.
+In this work, we show that it is always possible to generate a classical
+surrogate of a sub-region (dubbed a "patch") of an expectation landscape
+produced by a parameterized quantum circuit. That is, we provide a
+quantum-enhanced classical algorithm which, after simple measurements on a
+quantum device, allows one to classically simulate approximate expectation
+values of a subregion of a landscape. We provide time and sample complexity
+guarantees for a range of families of circuits of interest, and further
+numerically demonstrate our simulation algorithms on an exactly verifiable
+simulation of a Hamiltonian variational ansatz and long-time dynamics
+simulation on a 127-qubit heavy-hex topology.
+
+
+
+ comment: 10 + 47 pages, 4 figures
+
+
+
+
+
+
+ ☆ Noncommutative Model Selection and the Data-Driven Estimation of Real
+ Cohomology Groups
+
+
+ We propose three completely data-driven methods for estimating the real
+cohomology groups $H^k (X ; \mathbb{R})$ of a compact metric-measure space $(X,
+d_X, \mu_X)$ embedded in a metric-measure space $(Y,d_Y,\mu_Y)$, given a finite
+set of points $S$ sampled from a uniform distrbution $\mu_X$ on $X$, possibly
+corrupted with noise from $Y$. We present the results of several computational
+experiments in the case that $X$ is embedded in $\mathbb{R}^n$, where two of
+the three algorithms performed well.
+
+
+
+ comment: 15 pages, sequel to "Noncommutative Model Selection for Data
+ Clustering and Dimension Reduction Using Relative von Neumann Entropy"
+
+
+
+
+
+
+ ☆ FlowCLAS: Enhancing Normalizing Flow Via Contrastive Learning For
+ Anomaly Segmentation
+
+
+
+
+
+
+
+
+ Chang Won Lee, Selina Leveugle, Svetlana Stolpner, Chris Langley, Paul Grouchy, Jonathan Kelly, Steven L. Waslander
+
+
+ Anomaly segmentation is a valuable computer vision task for safety-critical
+applications that need to be aware of unexpected events. Current
+state-of-the-art (SOTA) scene-level anomaly segmentation approaches rely on
+diverse inlier class labels during training, limiting their ability to leverage
+vast unlabeled datasets and pre-trained vision encoders. These methods may
+underperform in domains with reduced color diversity and limited object
+classes. Conversely, existing unsupervised methods struggle with anomaly
+segmentation with the diverse scenes of less restricted domains. To address
+these challenges, we introduce FlowCLAS, a novel self-supervised framework that
+utilizes vision foundation models to extract rich features and employs a
+normalizing flow network to learn their density distribution. We enhance the
+model's discriminative power by incorporating Outlier Exposure and contrastive
+learning in the latent space. FlowCLAS significantly outperforms all existing
+methods on the ALLO anomaly segmentation benchmark for space robotics and
+demonstrates competitive results on multiple road anomaly segmentation
+benchmarks for autonomous driving, including Fishyscapes Lost&Found and Road
+Anomaly. These results highlight FlowCLAS's effectiveness in addressing the
+unique challenges of space anomaly segmentation while retaining SOTA
+performance in the autonomous driving domain without reliance on inlier
+segmentation labels.
+
+
+
+
+
+
+
+
+ Rakshit Kr. Singh, Aaron Rock Menezes, Rida Irfan, Bharath Ramsundar
+
+
+ Ordinary Differential Equations (ODEs) are widely used in physics, chemistry,
+and biology to model dynamic systems, including reaction kinetics, population
+dynamics, and biological processes. In this work, we integrate GPU-accelerated
+ODE solvers into the open-source DeepChem framework, making these tools easily
+accessible. These solvers support multiple numerical methods and are fully
+differentiable, enabling easy integration into more complex differentiable
+programs. We demonstrate the capabilities of our implementation through
+experiments on Lotka-Volterra predator-prey dynamics, pharmacokinetic
+compartment models, neural ODEs, and solving PDEs using reaction-diffusion
+equations. Our solvers achieved high accuracy with mean squared errors ranging
+from $10^{-4}$ to $10^{-6}$ and showed scalability in solving large systems
+with up to 100 compartments.
+
+
+
+
+
+
+
+ ☆ Enhanced anomaly detection in well log data through the application of
+ ensemble GANs
+
+
+
+
+
+
+
+
+ Abdulrahman Al-Fakih, A. Koeshidayatullah, Tapan Mukerji, SanLinn I. Kaka
+
+
+ Although generative adversarial networks (GANs) have shown significant
+success in modeling data distributions for image datasets, their application to
+structured or tabular data, such as well logs, remains relatively
+underexplored. This study extends the ensemble GANs (EGANs) framework to
+capture the distribution of well log data and detect anomalies that fall
+outside of these distributions. The proposed approach compares the performance
+of traditional methods, such as Gaussian mixture models (GMMs), with EGANs in
+detecting anomalies outside the expected data distributions. For the gamma ray
+(GR) dataset, EGANs achieved a precision of 0.62 and F1 score of 0.76,
+outperforming GMM's precision of 0.38 and F1 score of 0.54. Similarly, for
+travel time (DT), EGANs achieved a precision of 0.70 and F1 score of 0.79,
+surpassing GMM 0.56 and 0.71. In the neutron porosity (NPHI) dataset, EGANs
+recorded a precision of 0.53 and F1 score of 0.68, outshining GMM 0.47 and
+0.61. For the bulk density (RHOB) dataset, EGANs achieved a precision of 0.52
+and an F1 score of 0.67, slightly outperforming GMM, which yielded a precision
+of 0.50 and an F1 score of 0.65. This work's novelty lies in applying EGANs for
+well log data analysis, showcasing their ability to learn data patterns and
+identify anomalies that deviate from them. This approach offers more reliable
+anomaly detection compared to traditional methods like GMM. The findings
+highlight the potential of EGANs in enhancing anomaly detection for well log
+data, delivering significant implications for optimizing drilling strategies
+and reservoir management through more accurate, data-driven insights into
+subsurface characterization.
+
+
+ Training large neural networks typically requires sharing gradients between
+accelerators through specialized high-speed interconnects. Drawing from the
+signal processing principles of frequency decomposition and energy compaction,
+we demonstrate that synchronizing full optimizer states and model parameters
+during training is unnecessary. By decoupling momentum updates and allowing
+controlled divergence in optimizer states across accelerators, we achieve
+improved convergence compared to state-of-the-art optimizers. We introduce
+{\textbf{De}}coupled {\textbf{Mo}}mentum (DeMo), a fused optimizer and data
+parallel algorithm that reduces inter-accelerator communication requirements by
+several orders of magnitude. This enables training of large neural networks
+even with limited network bandwidth and heterogeneous hardware. Our method is
+topology-agnostic and architecture-independent and supports scalable
+clock-synchronous distributed training with negligible compute and memory
+overhead. Empirical results show that models trained with DeMo match or exceed
+the performance of equivalent models trained with AdamW, while eliminating the
+need for high-speed interconnects when pre-training large scale foundation
+models. An open source reference PyTorch implementation is published on GitHub
+at https://github.com/bloc97/DeMo
+
+
+
+
+
+
+
+ ☆ AIDetx: a compression-based method for identification of
+ machine-learning generated text
+
+
+
+
+
+
+
+
+ Leonardo Almeida, Pedro Rodrigues, Diogo Magalhães, Armando J. Pinho, Diogo Pratas
+
+
+ This paper introduces AIDetx, a novel method for detecting machine-generated
+text using data compression techniques. Traditional approaches, such as deep
+learning classifiers, often suffer from high computational costs and limited
+interpretability. To address these limitations, we propose a compression-based
+classification framework that leverages finite-context models (FCMs). AIDetx
+constructs distinct compression models for human-written and AI-generated text,
+classifying new inputs based on which model achieves a higher compression
+ratio. We evaluated AIDetx on two benchmark datasets, achieving F1 scores
+exceeding 97% and 99%, respectively, highlighting its high accuracy. Compared
+to current methods, such as large language models (LLMs), AIDetx offers a more
+interpretable and computationally efficient solution, significantly reducing
+both training time and hardware requirements (e.g., no GPUs needed). The full
+implementation is publicly available at https://github.com/AIDetx/AIDetx.
+
+
+
+
+
+
+
+
+ Justin Chih-Yao Chen, Zifeng Wang, Hamid Palangi, Rujun Han, Sayna Ebrahimi, Long Le, Vincent Perot, Swaroop Mishra, Mohit Bansal, Chen-Yu Lee, Tomas Pfister
+
+
+ Reverse thinking plays a crucial role in human reasoning. Humans can reason
+not only from a problem to a solution but also in reverse, i.e., start from the
+solution and reason towards the problem. This often enhances overall reasoning
+performance as it enables consistency checks between their forward and backward
+thinking. To enable Large Language Models (LLMs) to perform reverse thinking,
+we introduce Reverse-Enhanced Thinking (RevThink), a framework composed of data
+augmentation and learning objectives. In RevThink, we augment the dataset by
+collecting structured forward-backward reasoning from a teacher model,
+consisting of: (1) the original question, (2) forward reasoning, (3) backward
+question, and (4) backward reasoning. We then employ three objectives to train
+a smaller student model in a multi-task learning fashion: (a) generate forward
+reasoning from a question, (b) generate a backward question from a question,
+and (c) generate backward reasoning from the backward question. Experiments
+across 12 datasets covering commonsense, math, and logical reasoning show an
+average 13.53% improvement over the student model's zero-shot performance and a
+6.84% improvement over the strongest knowledge distillation baselines.
+Moreover, our method demonstrates sample efficiency -- using only 10% of the
+correct forward reasoning from the training data, it outperforms a standard
+fine-tuning method trained on 10x more forward reasoning. RevThink also
+exhibits strong generalization to out-of-distribution held-out datasets.
+
+
+
+ comment: 20 pages
+
+
+
+
+
+
+ ☆ SpaRC: Sparse Radar-Camera Fusion for 3D Object Detection
+
+
+
+
+
+
+
+
+ Philipp Wolters, Johannes Gilg, Torben Teepe, Fabian Herzog, Felix Fent, Gerhard Rigoll
+
+
+ In this work, we present SpaRC, a novel Sparse fusion transformer for 3D
+perception that integrates multi-view image semantics with Radar and Camera
+point features. The fusion of radar and camera modalities has emerged as an
+efficient perception paradigm for autonomous driving systems. While
+conventional approaches utilize dense Bird's Eye View (BEV)-based architectures
+for depth estimation, contemporary query-based transformers excel in
+camera-only detection through object-centric methodology. However, these
+query-based approaches exhibit limitations in false positive detections and
+localization precision due to implicit depth modeling. We address these
+challenges through three key contributions: (1) sparse frustum fusion (SFF) for
+cross-modal feature alignment, (2) range-adaptive radar aggregation (RAR) for
+precise object localization, and (3) local self-attention (LSA) for focused
+query aggregation. In contrast to existing methods requiring computationally
+intensive BEV-grid rendering, SpaRC operates directly on encoded point
+features, yielding substantial improvements in efficiency and accuracy.
+Empirical evaluations on the nuScenes and TruckScenes benchmarks demonstrate
+that SpaRC significantly outperforms existing dense BEV-based and sparse
+query-based detectors. Our method achieves state-of-the-art performance metrics
+of 67.1 NDS and 63.1 AMOTA. The code and pretrained models are available at
+https://github.com/phi-wol/sparc.
+
+
+ While being very successful in solving many downstream tasks, the application
+of deep neural networks is limited in real-life scenarios because of their
+susceptibility to domain shifts such as common corruptions, and adversarial
+attacks. The existence of adversarial examples and data corruption
+significantly reduces the performance of deep classification models.
+Researchers have made strides in developing robust neural architectures to
+bolster decisions of deep classifiers. However, most of these works rely on
+effective adversarial training methods, and predominantly focus on overall
+model robustness, disregarding class-wise differences in robustness, which are
+critical. Exploiting weakly robust classes is a potential avenue for attackers
+to fool the image recognition models. Therefore, this study investigates
+class-to-class biases across adversarially trained robust classification models
+to understand their latent space structures and analyze their strong and weak
+class-wise properties. We further assess the robustness of classes against
+common corruptions and adversarial attacks, recognizing that class
+vulnerability extends beyond the number of correct classifications for a
+specific class. We find that the number of false positives of classes as
+specific target classes significantly impacts their vulnerability to attacks.
+Through our analysis on the Class False Positive Score, we assess a fair
+evaluation of how susceptible each class is to misclassification.
+
+
+
+
+
+
+
+ ☆ A Visual-inertial Localization Algorithm using Opportunistic Visual
+ Beacons and Dead-Reckoning for GNSS-Denied Large-scale Applications
+
+
+ With the development of smart cities, the demand for continuous pedestrian
+navigation in large-scale urban environments has significantly increased. While
+global navigation satellite systems (GNSS) provide low-cost and reliable
+positioning services, they are often hindered in complex urban canyon
+environments. Thus, exploring opportunistic signals for positioning in urban
+areas has become a key solution. Augmented reality (AR) allows pedestrians to
+acquire real-time visual information. Accordingly, we propose a low-cost
+visual-inertial positioning solution. This method comprises a lightweight
+multi-scale group convolution (MSGC)-based visual place recognition (VPR)
+neural network, a pedestrian dead reckoning (PDR) algorithm, and a
+visual/inertial fusion approach based on a Kalman filter with gross error
+suppression. The VPR serves as a conditional observation to the Kalman filter,
+effectively correcting the errors accumulated through the PDR method. This
+enables the entire algorithm to ensure the reliability of long-term positioning
+in GNSS-denied areas. Extensive experimental results demonstrate that our
+method maintains stable positioning during large-scale movements. Compared to
+the lightweight MobileNetV3-based VPR method, our proposed VPR solution
+improves Recall@1 by at least 3\% on two public datasets while reducing the
+number of parameters by 63.37\%. It also achieves performance that is
+comparable to the VGG16-based method. The VPR-PDR algorithm improves
+localization accuracy by more than 40\% compared to the original PDR.
+
+
+
+
+
+
+
+
+ Julian D Parker, Anton Smirnov, Jordi Pons, CJ Carr, Zack Zukowski, Zach Evans, Xubo Liu
+
+
+ The tokenization of speech with neural audio codec models is a vital part of
+modern AI pipelines for the generation or understanding of speech, alone or in
+a multimodal context. Traditionally such tokenization models have concentrated
+on low parameter-count architectures using only components with strong
+inductive biases. In this work we show that by scaling a transformer
+architecture with large parameter count to this problem, and applying a
+flexible Finite Scalar Quantization (FSQ) based bottleneck, it is possible to
+reach state-of-the-art speech quality at extremely low bit-rates of $400$ or
+$700$ bits-per-second. The trained models strongly out-perform existing
+baselines in both objective and subjective tests.
+
+
+
+
+
+
+
+ ☆ Feedback-driven object detection and iterative model improvement
+
+
+
+
+
+
+
+
+ Sönke Tenckhoff, Mario Koddenbrock, Erik Rodner
+
+
+ Automated object detection has become increasingly valuable across diverse
+applications, yet efficient, high-quality annotation remains a persistent
+challenge. In this paper, we present the development and evaluation of a
+platform designed to interactively improve object detection models. The
+platform allows uploading and annotating images as well as fine-tuning object
+detection models. Users can then manually review and refine annotations,
+further creating improved snapshots that are used for automatic object
+detection on subsequent image uploads - a process we refer to as semi-automatic
+annotation resulting in a significant gain in annotation efficiency.
+ Whereas iterative refinement of model results to speed up annotation has
+become common practice, we are the first to quantitatively evaluate its
+benefits with respect to time, effort, and interaction savings. Our
+experimental results show clear evidence for a significant time reduction of up
+to 53% for semi-automatic compared to manual annotation. Importantly, these
+efficiency gains did not compromise annotation quality, while matching or
+occasionally even exceeding the accuracy of manual annotations. These findings
+demonstrate the potential of our lightweight annotation platform for creating
+high-quality object detection datasets and provide best practices to guide
+future development of annotation platforms.
+ The platform is open-source, with the frontend and backend repositories
+available on GitHub.
+
+
+
+ comment: AI4EA24 preprint
+
+
+
+
+
+
+ ☆ GradAlign for Training-free Model Performance Inference
+
+
+ Architecture plays an important role in deciding the performance of deep
+neural networks. However, the search for the optimal architecture is often
+hindered by the vast search space, making it a time-intensive process.
+Recently, a novel approach known as training-free neural architecture search
+(NAS) has emerged, aiming to discover the ideal architecture without
+necessitating extensive training. Training-free NAS leverages various
+indicators for architecture selection, including metrics such as the count of
+linear regions, the density of per-sample losses, and the stability of the
+finite-width Neural Tangent Kernel (NTK) matrix. Despite the competitive
+empirical performance of current training-free NAS techniques, they suffer from
+certain limitations, including inconsistent performance and a lack of deep
+understanding. In this paper, we introduce GradAlign, a simple yet effective
+method designed for inferring model performance without the need for training.
+At its core, GradAlign quantifies the extent of conflicts within per-sample
+gradients during initialization, as substantial conflicts hinder model
+convergence and ultimately result in worse performance. We evaluate GradAlign
+against established training-free NAS methods using standard NAS benchmarks,
+showing a better overall performance. Moreover, we show that the widely adopted
+metric of linear region count may not suffice as a dependable criterion for
+selecting network architectures during at initialization.
+
+
+
+
+
+
+
+ ☆ Rethinking the initialization of Momentum in Federated Learning with
+ Heterogeneous Data
+
+
+ Data Heterogeneity is a major challenge of Federated Learning performance.
+Recently, momentum based optimization techniques have beed proved to be
+effective in mitigating the heterogeneity issue. Along with the model updates,
+the momentum updates are transmitted to the server side and aggregated.
+Therefore, the local training initialized with a global momentum is guided by
+the global history of the gradients. However, we spot a problem in the
+traditional cumulation of the momentum which is suboptimal in the Federated
+Learning systems. The momentum used to weight less on the historical gradients
+and more on the recent gradients. This however, will engage more biased local
+gradients in the end of the local training. In this work, we propose a new way
+to calculate the estimated momentum used in local initialization. The proposed
+method is named as Reversed Momentum Federated Learning (RMFL). The key idea is
+to assign exponentially decayed weights to the gradients with the time going
+forward, which is on the contrary to the traditional momentum cumulation. The
+effectiveness of RMFL is evaluated on three popular benchmark datasets with
+different heterogeneity levels.
+
+
+ We present an efficient reduction that converts any machine learning
+algorithm into an interactive protocol, enabling collaboration with another
+party (e.g., a human) to achieve consensus on predictions and improve accuracy.
+This approach imposes calibration conditions on each party, which are
+computationally and statistically tractable relaxations of Bayesian
+rationality. These conditions are sensible even in prior-free settings,
+representing a significant generalization of Aumann's classic "agreement
+theorem."
+ In our protocol, the model first provides a prediction. The human then
+responds by either agreeing or offering feedback. The model updates its state
+and revises its prediction, while the human may adjust their beliefs. This
+iterative process continues until the two parties reach agreement. Initially,
+we study a setting that extends Aumann's Agreement Theorem, where parties aim
+to agree on a one-dimensional expectation by iteratively sharing their current
+estimates. Here, we recover the convergence theorem of Aaronson'05 under weaker
+assumptions. We then address the case where parties hold beliefs over
+distributions with d outcomes, exploring two feedback mechanisms. The first
+involves vector-valued estimates of predictions, while the second adopts a
+decision-theoretic approach: the human, needing to take an action from a finite
+set based on utility, communicates their utility-maximizing action at each
+round. In this setup, the number of rounds until agreement remains independent
+of d. Finally, we generalize to scenarios with more than two parties, where
+computational complexity scales linearly with the number of participants. Our
+protocols rely on simple, efficient conditions and produce predictions that
+surpass the accuracy of any individual party's alone.
+
+
+ Grounding the instruction in the environment is a key step in solving
+language-guided goal-reaching reinforcement learning problems. In automated
+reinforcement learning, a key concern is to enhance the model's ability to
+generalize across various tasks and environments. In goal-reaching scenarios,
+the agent must comprehend the different parts of the instructions within the
+environmental context in order to complete the overall task successfully. In
+this work, we propose CAREL (Cross-modal Auxiliary REinforcement Learning) as a
+new framework to solve this problem using auxiliary loss functions inspired by
+video-text retrieval literature and a novel method called instruction tracking,
+which automatically keeps track of progress in an environment. The results of
+our experiments suggest superior sample efficiency and systematic
+generalization for this framework in multi-modal reinforcement learning
+problems. Our code base is available here.
+
+
+
+
+
+
+
+ ☆ MoTe: Learning Motion-Text Diffusion Model for Multiple Generation Tasks
+
+
+ Recently, human motion analysis has experienced great improvement due to
+inspiring generative models such as the denoising diffusion model and large
+language model. While the existing approaches mainly focus on generating
+motions with textual descriptions and overlook the reciprocal task. In this
+paper, we present~\textbf{MoTe}, a unified multi-modal model that could handle
+diverse tasks by learning the marginal, conditional, and joint distributions of
+motion and text simultaneously. MoTe enables us to handle the paired
+text-motion generation, motion captioning, and text-driven motion generation by
+simply modifying the input context. Specifically, MoTe is composed of three
+components: Motion Encoder-Decoder (MED), Text Encoder-Decoder (TED), and
+Moti-on-Text Diffusion Model (MTDM). In particular, MED and TED are trained for
+extracting latent embeddings, and subsequently reconstructing the motion
+sequences and textual descriptions from the extracted embeddings, respectively.
+MTDM, on the other hand, performs an iterative denoising process on the input
+context to handle diverse tasks. Experimental results on the benchmark datasets
+demonstrate the superior performance of our proposed method on text-to-motion
+generation and competitive performance on motion captioning.
+
+
+
+ comment: Five figures, six tables
+
+
+
+
+
+
+ ☆ Machine learning force-field model for kinetic Monte Carlo simulations
+ of itinerant Ising magnets
+
+
+ We present a scalable machine learning (ML) framework for large-scale kinetic
+Monte Carlo (kMC) simulations of itinerant electron Ising systems. As the
+effective interactions between Ising spins in such itinerant magnets are
+mediated by conducting electrons, the calculation of energy change due to a
+local spin update requires solving an electronic structure problem. Such
+repeated electronic structure calculations could be overwhelmingly prohibitive
+for large systems. Assuming the locality principle, a convolutional neural
+network (CNN) model is developed to directly predict the effective local field
+and the corresponding energy change associated with a given spin update based
+on Ising configuration in a finite neighborhood. As the kernel size of the CNN
+is fixed at a constant, the model can be directly scalable to kMC simulations
+of large lattices. Our approach is reminiscent of the ML force-field models
+widely used in first-principles molecular dynamics simulations. Applying our ML
+framework to a square-lattice double-exchange Ising model, we uncover unusual
+coarsening of ferromagnetic domains at low temperatures. Our work highlights
+the potential of ML methods for large-scale modeling of similar itinerant
+systems with discrete dynamical variables.
+
+
+
+ comment: 11 pages, 7 figures
+
+
+
+
+
+
+ ☆ PerLA: Perceptive 3D Language Assistant
+
+
+
+
+
+
+
+
+ Guofeng Mei, Wei Lin, Luigi Riz, Yujiao Wu, Fabio Poiesi, Yiming Wang
+
+
+ Enabling Large Language Models (LLMs) to understand the 3D physical world is
+an emerging yet challenging research direction. Current strategies for
+processing point clouds typically downsample the scene or divide it into
+smaller parts for separate analysis. However, both approaches risk losing key
+local details or global contextual information. In this paper, we introduce
+PerLA, a 3D language assistant designed to be more perceptive to both details
+and context, making visual representations more informative for the LLM. PerLA
+captures high-resolution (local) details in parallel from different point cloud
+areas and integrates them with (global) context obtained from a
+lower-resolution whole point cloud. We present a novel algorithm that preserves
+point cloud locality through the Hilbert curve and effectively aggregates
+local-to-global information via cross-attention and a graph neural network.
+Lastly, we introduce a novel loss for local representation consensus to promote
+training stability. PerLA outperforms state-of-the-art 3D language assistants,
+with gains of up to +1.34 CiDEr on ScanQA for question answering, and +4.22 on
+ScanRefer and +3.88 on Nr3D for dense
+captioning.\url{https://gfmei.github.io/PerLA/}
+
+
+
+
+
+
+
+ ☆ LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware
+ Omni-Modal Perception of Long Videos
+
+
+ Despite impressive advancements in video understanding, most efforts remain
+limited to coarse-grained or visual-only video tasks. However, real-world
+videos encompass omni-modal information (vision, audio, and speech) with a
+series of events forming a cohesive storyline. The lack of multi-modal video
+data with fine-grained event annotations and the high cost of manual labeling
+are major obstacles to comprehensive omni-modality video perception. To address
+this gap, we propose an automatic pipeline consisting of high-quality
+multi-modal video filtering, semantically coherent omni-modal event boundary
+detection, and cross-modal correlation-aware event captioning. In this way, we
+present LongVALE, the first-ever Vision-Audio-Language Event understanding
+benchmark comprising 105K omni-modal events with precise temporal boundaries
+and detailed relation-aware captions within 8.4K high-quality long videos.
+Further, we build a baseline that leverages LongVALE to enable video large
+language models (LLMs) for omni-modality fine-grained temporal video
+understanding for the first time. Extensive experiments demonstrate the
+effectiveness and great potential of LongVALE in advancing comprehensive
+multi-modal video understanding.
+
+
+
+ comment: 18 pages, 15 figures
+
+
+
+
+
+
+ ☆ Riemannian Denoising Score Matching for Molecular Structure Optimization
+ with Accurate Energy
+
+
+
+
+
+
+
+
+ Jeheon Woo, Seonghwan Kim, Jun Hyeong Kim, Woo Youn Kim
+
+
+ This study introduces a modified score matching method aimed at generating
+molecular structures with high energy accuracy. The denoising process of score
+matching or diffusion models mirrors molecular structure optimization, where
+scores act like physical force fields that guide particles toward equilibrium
+states. To achieve energetically accurate structures, it can be advantageous to
+have the score closely approximate the gradient of the actual potential energy
+surface. Unlike conventional methods that simply design the target score based
+on structural differences in Euclidean space, we propose a Riemannian score
+matching approach. This method represents molecular structures on a manifold
+defined by physics-informed internal coordinates to efficiently mimic the
+energy landscape, and performs noising and denoising within this space. Our
+method has been evaluated by refining several types of starting structures on
+the QM9 and GEOM datasets, demonstrating that the proposed Riemannian score
+matching method significantly improves the accuracy of the generated molecular
+structures, attaining chemical accuracy. The implications of this study extend
+to various applications in computational chemistry, offering a robust tool for
+accurate molecular structure prediction.
+
+
+
+
+
+
+
+ ☆ Stock Price Prediction using Multi-Faceted Information based on Deep
+ Recurrent Neural Networks
+
+
+ Accurate prediction of stock market trends is crucial for informed investment
+decisions and effective portfolio management, ultimately leading to enhanced
+wealth creation and risk mitigation. This study proposes a novel approach for
+predicting stock prices in the stock market by integrating Convolutional Neural
+Networks (CNN) and Long Short-Term Memory (LSTM) networks, using sentiment
+analysis of social network data and candlestick data (price). The proposed
+methodology consists of two primary components: sentiment analysis of social
+network and candlestick data. By amalgamating candlestick data with insights
+gleaned from Twitter, this approach facilitates a more detailed and accurate
+examination of market trends and patterns, ultimately leading to more effective
+stock price predictions. Additionally, a Random Forest algorithm is used to
+classify tweets as either positive or negative, allowing for a more subtle and
+informed assessment of market sentiment. This study uses CNN and LSTM networks
+to predict stock prices. The CNN extracts short-term features, while the LSTM
+models long-term dependencies. The integration of both networks enables a more
+comprehensive analysis of market trends and patterns, leading to more accurate
+stock price predictions.
+
+
+
+
+
+
+
+ ☆ Forecasting Foreign Exchange Market Prices Using Technical Indicators
+ with Deep Learning and Attention Mechanism
+
+
+ Accurate prediction of price behavior in the foreign exchange market is
+crucial. This paper proposes a novel approach that leverages technical
+indicators and deep neural networks. The proposed architecture consists of a
+Long Short-Term Memory (LSTM) and Convolutional Neural Network (CNN), and
+attention mechanism. Initially, trend and oscillation technical indicators are
+employed to extract statistical features from Forex currency pair data,
+providing insights into price trends, market volatility, relative price
+strength, and overbought and oversold conditions. Subsequently, the LSTM and
+CNN networks are utilized in parallel to predict future price movements,
+leveraging the strengths of both recurrent and convolutional architectures. The
+LSTM network captures long-term dependencies and temporal patterns in the data,
+while the CNN network extracts local patterns. The outputs of the parallel LSTM
+and CNN networks are then fed into an attention mechanism, which learns to
+weigh the importance of each feature and temporal dependency, generating a
+context-aware representation of the input data. The attention-weighted output
+is then used to predict future price movements, enabling the model to focus on
+the most relevant features and temporal dependencies. Through a comprehensive
+evaluation of the proposed approach on multiple Forex currency pairs, we
+demonstrate its effectiveness in predicting price behavior and outperforming
+benchmark models.
+
+
+
+
+
+
+
+ ☆ LaVIDE: A Language-Vision Discriminator for Detecting Changes in
+ Satellite Image with Map References
+
+
+ Change detection, which typically relies on the comparison of bi-temporal
+images, is significantly hindered when only a single image is available.
+Comparing a single image with an existing map, such as OpenStreetMap, which is
+continuously updated through crowd-sourcing, offers a viable solution to this
+challenge. Unlike images that carry low-level visual details of ground objects,
+maps convey high-level categorical information. This discrepancy in abstraction
+levels complicates the alignment and comparison of the two data types. In this
+paper, we propose a \textbf{La}nguage-\textbf{VI}sion \textbf{D}iscriminator
+for d\textbf{E}tecting changes in satellite image with map references, namely
+\ours{}, which leverages language to bridge the information gap between maps
+and images. Specifically, \ours{} formulates change detection as the problem of
+``{\textit Does the pixel belong to [class]?}'', aligning maps and images
+within the feature space of the language-vision model to associate high-level
+map categories with low-level image details. Moreover, we build a
+mixture-of-experts discriminative module, which compares linguistic features
+from maps with visual features from images across various semantic
+perspectives, achieving comprehensive semantic comparison for change detection.
+Extensive evaluation on four benchmark datasets demonstrates that \ours{} can
+effectively detect changes in satellite image with map references,
+outperforming state-of-the-art change detection algorithms, e.g., with gains of
+about $13.8$\% on the DynamicEarthNet dataset and $4.3$\% on the SECOND
+dataset.
+
+
+ Fine-tuning foundation models often compromises their robustness to
+distribution shifts. To remedy this, most robust fine-tuning methods aim to
+preserve the pre-trained features. However, not all pre-trained features are
+robust and those methods are largely indifferent to which ones to preserve. We
+propose dual risk minimization (DRM), which combines empirical risk
+minimization with worst-case risk minimization, to better preserve the core
+features of downstream tasks. In particular, we utilize core-feature
+descriptions generated by LLMs to induce core-based zero-shot predictions which
+then serve as proxies to estimate the worst-case risk. DRM balances two crucial
+aspects of model robustness: expected performance and worst-case performance,
+establishing a new state of the art on various real-world benchmarks. DRM
+significantly improves the out-of-distribution performance of CLIP ViT-L/14@336
+on ImageNet (75.9 to 77.1), WILDS-iWildCam (47.1 to 51.8), and WILDS-FMoW (50.7
+to 53.1); opening up new avenues for robust fine-tuning. Our code is available
+at https://github.com/vaynexie/DRM .
+
+
+
+
+
+
+
+
+ Yihao Wang, Marcus Klasson, Matias Turkulainen, Shuzhe Wang, Juho Kannala, Arno Solin
+
+
+ Gaussian splatting enables fast novel view synthesis in static 3D
+environments. However, reconstructing real-world environments remains
+challenging as distractors or occluders break the multi-view consistency
+assumption required for accurate 3D reconstruction. Most existing methods rely
+on external semantic information from pre-trained models, introducing
+additional computational overhead as pre-processing steps or during
+optimization. In this work, we propose a novel method, DeSplat, that directly
+separates distractors and static scene elements purely based on volume
+rendering of Gaussian primitives. We initialize Gaussians within each camera
+view for reconstructing the view-specific distractors to separately model the
+static 3D scene and distractors in the alpha compositing stages. DeSplat yields
+an explicit scene separation of static elements and distractors, achieving
+comparable results to prior distractor-free approaches without sacrificing
+rendering speed. We demonstrate DeSplat's effectiveness on three benchmark data
+sets for distractor-free novel view synthesis. See the project website at
+https://aaltoml.github.io/desplat/.
+
+
+
+
+
+
+
+ ☆ A Multi-Loss Strategy for Vehicle Trajectory Prediction: Combining
+ Off-Road, Diversity, and Directional Consistency Losses
+
+
+ Trajectory prediction is essential for the safety and efficiency of planning
+in autonomous vehicles. However, current models often fail to fully capture
+complex traffic rules and the complete range of potential vehicle movements.
+Addressing these limitations, this study introduces three novel loss functions:
+Offroad Loss, Direction Consistency Error, and Diversity Loss. These functions
+are designed to keep predicted paths within driving area boundaries, aligned
+with traffic directions, and cover a wider variety of plausible driving
+scenarios. As all prediction modes should adhere to road rules and conditions,
+this work overcomes the shortcomings of traditional "winner takes all" training
+methods by applying the loss functions to all prediction modes. These loss
+functions not only improve model training but can also serve as metrics for
+evaluating the realism and diversity of trajectory predictions. Extensive
+validation on the nuScenes and Argoverse 2 datasets with leading baseline
+models demonstrates that our approach not only maintains accuracy but
+significantly improves safety and robustness, reducing offroad errors on
+average by 47% on original and by 37% on attacked scenes. This work sets a new
+benchmark for trajectory prediction in autonomous driving, offering substantial
+improvements in navigating complex environments. Our code is available at
+https://github.com/vita-epfl/stay-on-track .
+
+
+ Building operations consume approximately 40% of global energy, with Heating,
+Ventilation, and Air Conditioning (HVAC) systems responsible for up to 50% of
+this consumption. As HVAC energy demands are expected to rise, optimising
+system efficiency is crucial for reducing future energy use and mitigating
+climate change. Existing control strategies lack generalisation and require
+extensive training and data, limiting their rapid deployment across diverse
+buildings. This paper introduces HVAC-DPT, a Decision-Pretrained Transformer
+using in-context Reinforcement Learning (RL) for multi-zone HVAC control.
+HVAC-DPT frames HVAC control as a sequential prediction task, training a causal
+transformer on interaction histories generated by diverse RL agents. This
+approach enables HVAC-DPT to refine its policy in-context, without modifying
+network parameters, allowing for deployment across different buildings without
+the need for additional training or data collection. HVAC-DPT reduces energy
+consumption in unseen buildings by 45% compared to the baseline controller,
+offering a scalable and effective approach to mitigating the increasing
+environmental impact of HVAC systems.
+
+
+
+ comment: 7 pages, 3 figures, 3 tables
+
+
+
+
+
+
+ ☆ Amplifying human performance in combinatorial competitive programming
+
+
+
+
+
+
+
+
+ Petar Veličković, Alex Vitvitskyi, Larisa Markeeva, Borja Ibarz, Lars Buesing, Matej Balog, Alexander Novikov
+
+
+ Recent years have seen a significant surge in complex AI systems for
+competitive programming, capable of performing at admirable levels against
+human competitors. While steady progress has been made, the highest percentiles
+still remain out of reach for these methods on standard competition platforms
+such as Codeforces. Here we instead focus on combinatorial competitive
+programming, where the target is to find as-good-as-possible solutions to
+otherwise computationally intractable problems, over specific given inputs. We
+hypothesise that this scenario offers a unique testbed for human-AI synergy, as
+human programmers can write a backbone of a heuristic solution, after which AI
+can be used to optimise the scoring function used by the heuristic. We deploy
+our approach on previous iterations of Hash Code, a global team programming
+competition inspired by NP-hard software engineering problems at Google, and we
+leverage FunSearch to evolve our scoring functions. Our evolved solutions
+significantly improve the attained scores from their baseline, successfully
+breaking into the top percentile on all previous Hash Code online qualification
+rounds, and outperforming the top human teams on several. Our method is also
+performant on an optimisation problem that featured in a recent held-out
+AtCoder contest.
+
+
+
+
+
+
+
+ ☆ Graph Neural Networks for Heart Failure Prediction on an EHR-Based
+ Patient Similarity Graph
+
+
+
+
+
+
+
+
+ Heloisa Oss Boll, Ali Amirahmadi, Amira Soliman, Stefan Byttner, Mariana Recamonde-Mendoza
+
+
+ Objective: In modern healthcare, accurately predicting diseases is a crucial
+matter. This study introduces a novel approach using graph neural networks
+(GNNs) and a Graph Transformer (GT) to predict the incidence of heart failure
+(HF) on a patient similarity graph at the next hospital visit. Materials and
+Methods: We used electronic health records (EHR) from the MIMIC-III dataset and
+applied the K-Nearest Neighbors (KNN) algorithm to create a patient similarity
+graph using embeddings from diagnoses, procedures, and medications. Three
+models - GraphSAGE, Graph Attention Network (GAT), and Graph Transformer (GT) -
+were implemented to predict HF incidence. Model performance was evaluated using
+F1 score, AUROC, and AUPRC metrics, and results were compared against baseline
+algorithms. An interpretability analysis was performed to understand the
+model's decision-making process. Results: The GT model demonstrated the best
+performance (F1 score: 0.5361, AUROC: 0.7925, AUPRC: 0.5168). Although the
+Random Forest (RF) baseline achieved a similar AUPRC value, the GT model
+offered enhanced interpretability due to the use of patient relationships in
+the graph structure. A joint analysis of attention weights, graph connectivity,
+and clinical features provided insight into model predictions across different
+classification groups. Discussion and Conclusion: Graph-based approaches such
+as GNNs provide an effective framework for predicting HF. By leveraging a
+patient similarity graph, GNNs can capture complex relationships in EHR data,
+potentially improving prediction accuracy and clinical interpretability.
+
+
+
+
+
+
+
+ ☆ A Note on Small Percolating Sets on Hypercubes via Generative AI
+
+
+ We apply a generative AI pattern-recognition technique called PatternBoost to
+study bootstrap percolation on hypercubes. With this, we slightly improve the
+best existing upper bound for the size of percolating subsets of the hypercube.
+
+
+
+
+
+
+
+ ☆ Improving generalization of robot locomotion policies via
+ Sharpness-Aware Reinforcement Learning
+
+
+ Reinforcement learning often requires extensive training data.
+Simulation-to-real transfer offers a promising approach to address this
+challenge in robotics. While differentiable simulators offer improved sample
+efficiency through exact gradients, they can be unstable in contact-rich
+environments and may lead to poor generalization. This paper introduces a novel
+approach integrating sharpness-aware optimization into gradient-based
+reinforcement learning algorithms. Our simulation results demonstrate that our
+method, tested on contact-rich environments, significantly enhances policy
+robustness to environmental variations and action perturbations while
+maintaining the sample efficiency of first-order methods. Specifically, our
+approach improves action noise tolerance compared to standard first-order
+methods and achieves generalization comparable to zeroth-order methods. This
+improvement stems from finding flatter minima in the loss landscape, associated
+with better generalization. Our work offers a promising solution to balance
+efficient learning and robust sim-to-real transfer in robotics, potentially
+bridging the gap between simulation and real-world performance.
+
+
+
+ comment: 9 pages, 6 figures
+
+
+
+
+
+
+ ☆ Real-Time Anomaly Detection in Video Streams
+
+
+ This thesis is part of a CIFRE agreement between the company Othello and the
+LIASD laboratory. The objective is to develop an artificial intelligence system
+that can detect real-time dangers in a video stream. To achieve this, a novel
+approach combining temporal and spatial analysis has been proposed. Several
+avenues have been explored to improve anomaly detection by integrating object
+detection, human pose detection, and motion analysis. For result
+interpretability, techniques commonly used for image analysis, such as
+activation and saliency maps, have been extended to videos, and an original
+method has been proposed. The proposed architecture performs binary or
+multiclass classification depending on whether an alert or the cause needs to
+be identified. Numerous neural networkmodels have been tested, and three of
+them have been selected. You Only Looks Once (YOLO) has been used for spatial
+analysis, a Convolutional Recurrent Neuronal Network (CRNN) composed of VGG19
+and a Gated Recurrent Unit (GRU) for temporal analysis, and a multi-layer
+perceptron for classification. These models handle different types of data and
+can be combined in parallel or in series. Although the parallel mode is faster,
+the serial mode is generally more reliable. For training these models,
+supervised learning was chosen, and two proprietary datasets were created. The
+first dataset focuses on objects that may play a potential role in anomalies,
+while the second consists of videos containing anomalies or non-anomalies. This
+approach allows for the processing of both continuous video streams and finite
+videos, providing greater flexibility in detection.
+
+
+ In light of the inherently complex and dynamic nature of real-world
+environments, incorporating risk measures is crucial for the robustness
+evaluation of deep learning models. In this work, we propose a Risk-Averse
+Certification framework for Bayesian neural networks called RAC-BNN. Our method
+leverages sampling and optimisation to compute a sound approximation of the
+output set of a BNN, represented using a set of template polytopes. To enhance
+robustness evaluation, we integrate a coherent distortion risk
+measure--Conditional Value at Risk (CVaR)--into the certification framework,
+providing probabilistic guarantees based on empirical distributions obtained
+through sampling. We validate RAC-BNN on a range of regression and
+classification benchmarks and compare its performance with a state-of-the-art
+method. The results show that RAC-BNN effectively quantifies robustness under
+worst-performing risky scenarios, and achieves tighter certified bounds and
+higher efficiency in complex tasks.
+
+
+
+
+
+
+
+ ☆ Towards Santali Linguistic Inclusion: Building the First
+ Santali-to-English Translation Model using mT5 Transformer and Data
+ Augmentation
+
+
+
+
+
+
+
+
+ Syed Mohammed Mostaque Billah, Ateya Ahmed Subarna, Sudipta Nandi Sarna, Ahmad Shawkat Wasit, Anika Fariha, Asif Sushmit, Arig Yousuf Sadeque
+
+
+ Around seven million individuals in India, Bangladesh, Bhutan, and Nepal
+speak Santali, positioning it as nearly the third most commonly used
+Austroasiatic language. Despite its prominence among the Austroasiatic language
+family's Munda subfamily, Santali lacks global recognition. Currently, no
+translation models exist for the Santali language. Our paper aims to include
+Santali to the NPL spectrum. We aim to examine the feasibility of building
+Santali translation models based on available Santali corpora. The paper
+successfully addressed the low-resource problem and, with promising results,
+examined the possibility of creating a functional Santali machine translation
+model in a low-resource setup. Our study shows that Santali-English parallel
+corpus performs better when in transformers like mt5 as opposed to untrained
+transformers, proving that transfer learning can be a viable technique that
+works with Santali language. Besides the mT5 transformer, Santali-English
+performs better than Santali-Bangla parallel corpus as the mT5 has been trained
+in way more English data than Bangla data. Lastly, our study shows that with
+data augmentation, our model performs better.
+
+
+
+
+
+
+
+ ☆ JetFormer: An Autoregressive Generative Model of Raw Images and Text
+
+
+
+
+
+
+
+
+ Michael Tschannen, André Susano Pinto, Alexander Kolesnikov
+
+
+ Removing modeling constraints and unifying architectures across domains has
+been a key driver of the recent progress in training large multimodal models.
+However, most of these models still rely on many separately trained components
+such as modality-specific encoders and decoders. In this work, we further
+streamline joint generative modeling of images and text. We propose an
+autoregressive decoder-only transformer - JetFormer - which is trained to
+directly maximize the likelihood of raw data, without relying on any separately
+pretrained components, and can understand and generate both text and images.
+Specifically, we leverage a normalizing flow model to obtain a soft-token image
+representation that is jointly trained with an autoregressive multimodal
+transformer. The normalizing flow model serves as both an image encoder for
+perception tasks and an image decoder for image generation tasks during
+inference. JetFormer achieves text-to-image generation quality competitive with
+recent VQ-VAE- and VAE-based baselines. These baselines rely on pretrained
+image autoencoders, which are trained with a complex mixture of losses,
+including perceptual ones. At the same time, JetFormer demonstrates robust
+image understanding capabilities. To the best of our knowledge, JetFormer is
+the first model that is capable of generating high-fidelity images and
+producing strong log-likelihood bounds.
+
+
+
+
+
+
+
+
+ Tomás Hüttebräucker, Simone Fiorellino, Mohamed Sana, Paolo Di Lorenzo, Emilio Calvanese Strinati
+
+
+ In multi-user semantic communication, language mismatche poses a significant
+challenge when independently trained agents interact. We present a novel
+semantic equalization algorithm that enables communication between agents with
+different languages without additional retraining. Our algorithm is based on
+relative representations, a framework that enables different agents employing
+different neural network models to have unified representation. It proceeds by
+projecting the latent vectors of different models into a common space defined
+relative to a set of data samples called \textit{anchors}, whose number equals
+the dimension of the resulting space. A communication between different agents
+translates to a communication of semantic symbols sampled from this relative
+space. This approach, in addition to aligning the semantic representations of
+different agents, allows compressing the amount of information being exchanged,
+by appropriately selecting the number of anchors. Eventually, we introduce a
+novel anchor selection strategy, which advantageously determines prototypical
+anchors, capturing the most relevant information for the downstream task. Our
+numerical results show the effectiveness of the proposed approach allowing
+seamless communication between agents with radically different models,
+including differences in terms of neural network architecture and datasets used
+for initial training.
+
+
+
+
+
+
+
+
+ Gasser Elazab, Torben Gräber, Michael Unterreiner, Olaf Hellwich
+
+
+ Self-supervised monocular depth estimation (MDE) has gained popularity for
+obtaining depth predictions directly from videos. However, these methods often
+produce scale invariant results, unless additional training signals are
+provided. Addressing this challenge, we introduce a novel self-supervised
+metric-scaled MDE model that requires only monocular video data and the
+camera's mounting position, both of which are readily available in modern
+vehicles. Our approach leverages planar-parallax geometry to reconstruct scene
+structure. The full pipeline consists of three main networks, a multi-frame
+network, a singleframe network, and a pose network. The multi-frame network
+processes sequential frames to estimate the structure of the static scene using
+planar-parallax geometry and the camera mounting position. Based on this
+reconstruction, it acts as a teacher, distilling knowledge such as scale
+information, masked drivable area, metric-scale depth for the static scene, and
+dynamic object mask to the singleframe network. It also aids the pose network
+in predicting a metric-scaled relative pose between two subsequent images. Our
+method achieved state-of-the-art results for the driving benchmark KITTI for
+metric-scaled depth prediction. Notably, it is one of the first methods to
+produce self-supervised metric-scaled depth prediction for the challenging
+Cityscapes dataset, demonstrating its effectiveness and versatility.
+
+
+
+ comment: Accepted at WACV 25, project page: https://mono-pp.github.io/
+
+
+
+
+
+
+ ☆ Forensics Adapter: Adapting CLIP for Generalizable Face Forgery
+ Detection
+
+
+ We describe the Forensics Adapter, an adapter network designed to transform
+CLIP into an effective and generalizable face forgery detector. Although CLIP
+is highly versatile, adapting it for face forgery detection is non-trivial as
+forgery-related knowledge is entangled with a wide range of unrelated
+knowledge. Existing methods treat CLIP merely as a feature extractor, lacking
+task-specific adaptation, which limits their effectiveness. To address this, we
+introduce an adapter to learn face forgery traces -- the blending boundaries
+unique to forged faces, guided by task-specific objectives. Then we enhance the
+CLIP visual tokens with a dedicated interaction strategy that communicates
+knowledge across CLIP and the adapter. Since the adapter is alongside CLIP, its
+versatility is highly retained, naturally ensuring strong generalizability in
+face forgery detection. With only $\bm{5.7M}$ trainable parameters, our method
+achieves a significant performance boost, improving by approximately $\bm{7\%}$
+on average across five standard datasets. We believe the proposed method can
+serve as a baseline for future CLIP-based face forgery detection methods.
+
+
+
+
+
+
+
+ ☆ The Streetscape Application Services Stack (SASS): Towards a Distributed
+ Sensing Architecture for Urban Applications
+
+
+ As urban populations grow, cities are becoming more complex, driving the
+deployment of interconnected sensing systems to realize the vision of smart
+cities. These systems aim to improve safety, mobility, and quality of life
+through applications that integrate diverse sensors with real-time
+decision-making. Streetscape applications-focusing on challenges like
+pedestrian safety and adaptive traffic management-depend on managing
+distributed, heterogeneous sensor data, aligning information across time and
+space, and enabling real-time processing. These tasks are inherently complex
+and often difficult to scale. The Streetscape Application Services Stack (SASS)
+addresses these challenges with three core services: multimodal data
+synchronization, spatiotemporal data fusion, and distributed edge computing. By
+structuring these capabilities as clear, composable abstractions with clear
+semantics, SASS allows developers to scale streetscape applications efficiently
+while minimizing the complexity of multimodal integration.
+ We evaluated SASS in two real-world testbed environments: a controlled
+parking lot and an urban intersection in a major U.S. city. These testbeds
+allowed us to test SASS under diverse conditions, demonstrating its practical
+applicability. The Multimodal Data Synchronization service reduced temporal
+misalignment errors by 88%, achieving synchronization accuracy within 50
+milliseconds. Spatiotemporal Data Fusion service improved detection accuracy
+for pedestrians and vehicles by over 10%, leveraging multicamera integration.
+The Distributed Edge Computing service increased system throughput by more than
+an order of magnitude. Together, these results show how SASS provides the
+abstractions and performance needed to support real-time, scalable urban
+applications, bridging the gap between sensing infrastructure and actionable
+streetscape intelligence.
+
+
+
+
+
+
+
+ ☆ Know Your RAG: Dataset Taxonomy and Generation Strategies for Evaluating
+ RAG Systems COLING 2025
+
+
+
+
+
+
+
+
+ Rafael Teixeira de Lima, Shubham Gupta, Cesar Berrospi, Lokesh Mishra, Michele Dolfi, Peter Staar, Panagiotis Vagenas
+
+
+ Retrieval Augmented Generation (RAG) systems are a widespread application of
+Large Language Models (LLMs) in the industry. While many tools exist empowering
+developers to build their own systems, measuring their performance locally,
+with datasets reflective of the system's use cases, is a technological
+challenge. Solutions to this problem range from non-specific and cheap (most
+public datasets) to specific and costly (generating data from local documents).
+In this paper, we show that using public question and answer (Q&A) datasets to
+assess retrieval performance can lead to non-optimal systems design, and that
+common tools for RAG dataset generation can lead to unbalanced data. We propose
+solutions to these issues based on the characterization of RAG datasets through
+labels and through label-targeted data generation. Finally, we show that
+fine-tuned small LLMs can efficiently generate Q&A datasets. We believe that
+these observations are invaluable to the know-your-data step of RAG systems
+development.
+
+
+
+ comment: to be published in the 31st International Conference on Computational
+ Linguistics (COLING 2025)
+
+
+
+
+
+
+ ☆ Fast Mutual Information Computation for Large Binary Datasets
+
+
+ Mutual Information (MI) is a powerful statistical measure that quantifies
+shared information between random variables, particularly valuable in
+high-dimensional data analysis across fields like genomics, natural language
+processing, and network science. However, computing MI becomes computationally
+prohibitive for large datasets where it is typically required a pairwise
+computational approach where each column is compared to others. This work
+introduces a matrix-based algorithm that accelerates MI computation by
+leveraging vectorized operations and optimized matrix calculations. By
+transforming traditional pairwise computational approaches into bulk matrix
+operations, the proposed method enables efficient MI calculation across all
+variable pairs. Experimental results demonstrate significant performance
+improvements, with computation times reduced up to 50,000 times in the largest
+dataset using optimized implementations, particularly when utilizing hardware
+optimized frameworks. The approach promises to expand MI's applicability in
+data-driven research by overcoming previous computational limitations.
+
+
+
+
+
+
+
+ ☆ Explaining the Impact of Training on Vision Models via Activation
+ Clustering
+
+
+
+
+
+
+
+
+ Ahcène Boubekki, Samuel G. Fadel, Sebastian Mair
+
+
+ Recent developments in the field of explainable artificial intelligence (XAI)
+for vision models investigate the information extracted by their feature
+encoder. We contribute to this effort and propose Neuro-Activated Vision
+Explanations (NAVE), which extracts the information captured by the encoder by
+clustering the feature activations of the frozen network to be explained. The
+method does not aim to explain the model's prediction but to answer questions
+such as which parts of the image are processed similarly or which information
+is kept in deeper layers. Experimentally, we leverage NAVE to show that the
+training dataset and the level of supervision affect which concepts are
+captured. In addition, our method reveals the impact of registers on vision
+transformers (ViT) and the information saturation caused by the watermark
+Clever Hans effect in the training set.
+
+
+
+
+
+
+
+ ☆ Gated-Attention Feature-Fusion Based Framework for Poverty Prediction ICDE
+
+
+
+
+
+
+
+
+ Muhammad Umer Ramzan, Wahab Khaddim, Muhammad Ehsan Rana, Usman Ali, Manohar Ali, Fiaz ul Hassan, Fatima Mehmood
+
+
+ This research paper addresses the significant challenge of accurately
+estimating poverty levels using deep learning, particularly in developing
+regions where traditional methods like household surveys are often costly,
+infrequent, and quickly become outdated. To address these issues, we propose a
+state-of-the-art Convolutional Neural Network (CNN) architecture, extending the
+ResNet50 model by incorporating a Gated-Attention Feature-Fusion Module (GAFM).
+Our architecture is designed to improve the model's ability to capture and
+combine both global and local features from satellite images, leading to more
+accurate poverty estimates. The model achieves a 75% R2 score, significantly
+outperforming existing leading methods in poverty mapping. This improvement is
+due to the model's capacity to focus on and refine the most relevant features,
+filtering out unnecessary data, which makes it a powerful tool for remote
+sensing and poverty estimation.
+
+
+
+ comment: The paper has accepted for publication at 5th International
+ Conference on Data Engineering and Communication Technology (ICDECT)
+
+
+
+
+
+
+ ☆ SURE-VQA: Systematic Understanding of Robustness Evaluation in Medical
+ VQA Tasks
+
+
+
+
+
+
+
+
+ Kim-Celine Kahl, Selen Erkan, Jeremias Traub, Carsten T. Lüth, Klaus Maier-Hein, Lena Maier-Hein, Paul F. Jaeger
+
+
+ Vision-Language Models (VLMs) have great potential in medical tasks, like
+Visual Question Answering (VQA), where they could act as interactive assistants
+for both patients and clinicians. Yet their robustness to distribution shifts
+on unseen data remains a critical concern for safe deployment. Evaluating such
+robustness requires a controlled experimental setup that allows for systematic
+insights into the model's behavior. However, we demonstrate that current setups
+fail to offer sufficiently thorough evaluations, limiting their ability to
+accurately assess model robustness. To address this gap, our work introduces a
+novel framework, called SURE-VQA, centered around three key requirements to
+overcome the current pitfalls and systematically analyze the robustness of
+VLMs: 1) Since robustness on synthetic shifts does not necessarily translate to
+real-world shifts, robustness should be measured on real-world shifts that are
+inherent to the VQA data; 2) Traditional token-matching metrics often fail to
+capture underlying semantics, necessitating the use of large language models
+(LLMs) for more accurate semantic evaluation; 3) Model performance often lacks
+interpretability due to missing sanity baselines, thus meaningful baselines
+should be reported that allow assessing the multimodal impact on the VLM. To
+demonstrate the relevance of this framework, we conduct a study on the
+robustness of various fine-tuning methods across three medical datasets with
+four different types of distribution shifts. Our study reveals several
+important findings: 1) Sanity baselines that do not utilize image data can
+perform surprisingly well; 2) We confirm LoRA as the best-performing PEFT
+method; 3) No PEFT method consistently outperforms others in terms of
+robustness to shifts. Code is provided at https://github.com/IML-DKFZ/sure-vqa.
+
+
+ Under stringent privacy constraints, whether federated recommendation systems
+can achieve group fairness remains an inadequately explored question. Taking
+gender fairness as a representative issue, we identify three phenomena in
+federated recommendation systems: performance difference, data imbalance, and
+preference disparity. We discover that the state-of-the-art methods only focus
+on the first phenomenon. Consequently, their imposition of inappropriate
+fairness constraints detrimentally affects the model training. Moreover, due to
+insufficient sensitive attribute protection of existing works, we can infer the
+gender of all users with 99.90% accuracy even with the addition of maximal
+noise. In this work, we propose Privacy-Preserving Orthogonal Aggregation
+(PPOA), which employs the secure aggregation scheme and quantization technique,
+to prevent the suppression of minority groups by the majority and preserve the
+distinct preferences for better group fairness. PPOA can assist different
+groups in obtaining their respective model aggregation results through a
+designed orthogonal mapping while keeping their attributes private.
+Experimental results on three real-world datasets demonstrate that PPOA
+enhances recommendation effectiveness for both females and males by up to 8.25%
+and 6.36%, respectively, with a maximum overall improvement of 7.30%, and
+achieves optimal fairness in most cases. Extensive ablation experiments and
+visualizations indicate that PPOA successfully maintains preferences for
+different gender groups.
+
+
+
+ comment: accepted by WSDM 2025
+
+
+
+
+
+
+ ☆ On the Performance Analysis of Momentum Method: A Frequency Domain
+ Perspective
+
+
+ Momentum-based optimizers are widely adopted for training neural networks.
+However, the optimal selection of momentum coefficients remains elusive. This
+uncertainty impedes a clear understanding of the role of momentum in stochastic
+gradient methods. In this paper, we present a frequency domain analysis
+framework that interprets the momentum method as a time-variant filter for
+gradients, where adjustments to momentum coefficients modify the filter
+characteristics. Our experiments support this perspective and provide a deeper
+understanding of the mechanism involved. Moreover, our analysis reveals the
+following significant findings: high-frequency gradient components are
+undesired in the late stages of training; preserving the original gradient in
+the early stages, and gradually amplifying low-frequency gradient components
+during training both enhance generalization performance. Based on these
+insights, we propose Frequency Stochastic Gradient Descent with Momentum
+(FSGDM), a heuristic optimizer that dynamically adjusts the momentum filtering
+characteristic with an empirically effective dynamic magnitude response.
+Experimental results demonstrate the superiority of FSGDM over conventional
+momentum optimizers.
+
+
+
+
+
+
+
+ ☆ Multimodal Whole Slide Foundation Model for Pathology
+
+
+
+
+
+
+
+
+ Tong Ding, Sophia J. Wagner, Andrew H. Song, Richard J. Chen, Ming Y. Lu, Andrew Zhang, Anurag J. Vaidya, Guillaume Jaume, Muhammad Shaban, Ahrong Kim, Drew F. K. Williamson, Bowen Chen, Cristina Almagro-Perez, Paul Doucet, Sharifa Sahai, Chengkuan Chen, Daisuke Komura, Akihiro Kawabe, Shumpei Ishikawa, Georg Gerber, Tingying Peng, Long Phi Le, Faisal Mahmood
+
+
+ The field of computational pathology has been transformed with recent
+advances in foundation models that encode histopathology region-of-interests
+(ROIs) into versatile and transferable feature representations via
+self-supervised learning (SSL). However, translating these advancements to
+address complex clinical challenges at the patient and slide level remains
+constrained by limited clinical data in disease-specific cohorts, especially
+for rare clinical conditions. We propose TITAN, a multimodal whole slide
+foundation model pretrained using 335,645 WSIs via visual self-supervised
+learning and vision-language alignment with corresponding pathology reports and
+423,122 synthetic captions generated from a multimodal generative AI copilot
+for pathology. Without any finetuning or requiring clinical labels, TITAN can
+extract general-purpose slide representations and generate pathology reports
+that generalize to resource-limited clinical scenarios such as rare disease
+retrieval and cancer prognosis. We evaluate TITAN on diverse clinical tasks and
+find that TITAN outperforms both ROI and slide foundation models across machine
+learning settings such as linear probing, few-shot and zero-shot
+classification, rare cancer retrieval and cross-modal retrieval, and pathology
+report generation.
+
+
+
+ comment: The code is accessible at https://github.com/mahmoodlab/TITAN
+
+
+
+
+
+
+ ☆ Nonparametric Instrumental Regression via Kernel Methods is Minimax
+ Optimal
+
+
+
+
+
+
+
+
+ Dimitri Meunier, Zhu Li, Tim Christensen, Arthur Gretton
+
+
+ We study the kernel instrumental variable algorithm of
+\citet{singh2019kernel}, a nonparametric two-stage least squares (2SLS)
+procedure which has demonstrated strong empirical performance. We provide a
+convergence analysis that covers both the identified and unidentified settings:
+when the structural function cannot be identified, we show that the kernel NPIV
+estimator converges to the IV solution with minimum norm. Crucially, our
+convergence is with respect to the strong $L_2$-norm, rather than a
+pseudo-norm. Additionally, we characterize the smoothness of the target
+function without relying on the instrument, instead leveraging a new
+description of the projected subspace size (this being closely related to the
+link condition in inverse learning literature). With the subspace size
+description and under standard kernel learning assumptions, we derive, for the
+first time, the minimax optimal learning rate for kernel NPIV in the strong
+$L_2$-norm. Our result demonstrates that the strength of the instrument is
+essential to achieve efficient learning. We also improve the original kernel
+NPIV algorithm by adopting a general spectral regularization in stage 1
+regression. The modified regularization can overcome the saturation effect of
+Tikhonov regularization.
+
+
+ Text-guided image generation and editing using diffusion models have achieved
+remarkable advancements. Among these, tuning-free methods have gained attention
+for their ability to perform edits without extensive model adjustments,
+offering simplicity and efficiency. However, existing tuning-free approaches
+often struggle with balancing fidelity and editing precision. Reconstruction
+errors in DDIM Inversion are partly attributed to the cross-attention mechanism
+in U-Net, which introduces misalignments during the inversion and
+reconstruction process. To address this, we analyze reconstruction from a
+structural perspective and propose a novel approach that replaces traditional
+cross-attention with uniform attention maps, significantly enhancing image
+reconstruction fidelity. Our method effectively minimizes distortions caused by
+varying text conditions during noise prediction. To complement this
+improvement, we introduce an adaptive mask-guided editing technique that
+integrates seamlessly with our reconstruction approach, ensuring consistency
+and accuracy in editing tasks. Experimental results demonstrate that our
+approach not only excels in achieving high-fidelity image reconstruction but
+also performs robustly in real image composition and editing scenarios. This
+study underscores the potential of uniform attention maps to enhance the
+fidelity and versatility of diffusion-based image processing methods. Code is
+available at https://github.com/Mowenyii/Uniform-Attention-Maps.
+
+
+
+ comment: Accepted to WACV 2025
+
+
+
+
+
+
+ ☆ CogACT: A Foundational Vision-Language-Action Model for Synergizing
+ Cognition and Action in Robotic Manipulation
+
+
+ The advancement of large Vision-Language-Action (VLA) models has
+significantly improved robotic manipulation in terms of language-guided task
+execution and generalization to unseen scenarios. While existing VLAs adapted
+from pretrained large Vision-Language-Models (VLM) have demonstrated promising
+generalizability, their task performance is still unsatisfactory as indicated
+by the low tasks success rates in different environments. In this paper, we
+present a new advanced VLA architecture derived from VLM. Unlike previous works
+that directly repurpose VLM for action prediction by simple action
+quantization, we propose a omponentized VLA architecture that has a specialized
+action module conditioned on VLM output. We systematically study the design of
+the action module and demonstrates the strong performance enhancement with
+diffusion action transformers for action sequence modeling, as well as their
+favorable scaling behaviors. We also conduct comprehensive experiments and
+ablation studies to evaluate the efficacy of our models with varied designs.
+The evaluation on 5 robot embodiments in simulation and real work shows that
+our model not only significantly surpasses existing VLAs in task performance
+and but also exhibits remarkable adaptation to new robots and generalization to
+unseen objects and backgrounds. It exceeds the average success rates of OpenVLA
+which has similar model size (7B) with ours by over 35% in simulated evaluation
+and 55% in real robot experiments. It also outperforms the large RT-2-X model
+(55B) by 18% absolute success rates in simulation. Code and models can be found
+on our project page (https://cogact.github.io/).
+
+
+ Modern recommendation systems frequently employ online learning to
+dynamically update their models with freshly collected data. The most commonly
+used optimizer for updating neural networks in these contexts is the Adam
+optimizer, which integrates momentum ($m_t$) and adaptive learning rate
+($v_t$). However, the volatile nature of online learning data, characterized by
+its frequent distribution shifts and presence of noises, poses significant
+challenges to Adam's standard optimization process: (1) Adam may use outdated
+momentum and the average of squared gradients, resulting in slower adaptation
+to distribution changes, and (2) Adam's performance is adversely affected by
+data noise. To mitigate these issues, we introduce CAdam, a confidence-based
+optimization strategy that assesses the consistence between the momentum and
+the gradient for each parameter dimension before deciding on updates. If
+momentum and gradient are in sync, CAdam proceeds with parameter updates
+according to Adam's original formulation; if not, it temporarily withholds
+updates and monitors potential shifts in data distribution in subsequent
+iterations. This method allows CAdam to distinguish between the true
+distributional shifts and mere noise, and adapt more quickly to new data
+distributions. Our experiments with both synthetic and real-world datasets
+demonstrate that CAdam surpasses other well-known optimizers, including the
+original Adam, in efficiency and noise robustness. Furthermore, in large-scale
+A/B testing within a live recommendation system, CAdam significantly enhances
+model performance compared to Adam, leading to substantial increases in the
+system's gross merchandise volume (GMV).
+
+
+
+
+
+
+
+ ☆ Learned Random Label Predictions as a Neural Network Complexity Metric
+
+
+ We empirically investigate the impact of learning randomly generated labels
+in parallel to class labels in supervised learning on memorization, model
+complexity, and generalization in deep neural networks. To this end, we
+introduce a multi-head network architecture as an extension of standard CNN
+architectures. Inspired by methods used in fair AI, our approach allows for the
+unlearning of random labels, preventing the network from memorizing individual
+samples. Based on the concept of Rademacher complexity, we first use our
+proposed method as a complexity metric to analyze the effects of common
+regularization techniques and challenge the traditional understanding of
+feature extraction and classification in CNNs. Second, we propose a novel
+regularizer that effectively reduces sample memorization. However, contrary to
+the predictions of classical statistical learning theory, we do not observe
+improvements in generalization.
+
+
+
+
+
+
+
+ ☆ PACMANN: Point Adaptive Collocation Method for Artificial Neural
+ Networks
+
+
+ Physics-Informed Neural Networks (PINNs) are an emerging tool for
+approximating the solution of Partial Differential Equations (PDEs) in both
+forward and inverse problems. PINNs minimize a loss function which includes the
+PDE residual determined for a set of collocation points. Previous work has
+shown that the number and distribution of these collocation points have a
+significant influence on the accuracy of the PINN solution. Therefore, the
+effective placement of these collocation points is an active area of research.
+Specifically, adaptive collocation point sampling methods have been proposed,
+which have been reported to scale poorly to higher dimensions. In this work, we
+address this issue and present the Point Adaptive Collocation Method for
+Artificial Neural Networks (PACMANN). Inspired by classic optimization
+problems, this approach incrementally moves collocation points toward regions
+of higher residuals using gradient-based optimization algorithms guided by the
+gradient of the squared residual. We apply PACMANN for forward and inverse
+problems, and demonstrate that this method matches the performance of
+state-of-the-art methods in terms of the accuracy/efficiency tradeoff for the
+low-dimensional problems, while outperforming available approaches for
+high-dimensional problems; the best performance is observed for the Adam
+optimizer. Key features of the method include its low computational cost and
+simplicity of integration in existing physics-informed neural network
+pipelines.
+
+
+
+ comment: 22 pages, 9 figures
+
+
+
+
+
+
+ ☆ Non-linear Equalization in 112 Gb/s PONs Using Kolmogorov-Arnold
+ Networks
+
+
+
+
+
+
+
+
+ Rodrigo Fischer, Patrick Matalla, Sebastian Randel, Laurent Schmalen
+
+
+ We investigate Kolmogorov-Arnold networks (KANs) for non-linear equalization
+of 112 Gb/s PAM4 passive optical networks (PONs). Using pruning and extensive
+hyperparameter search, we outperform linear equalizers and convolutional neural
+networks at low computational complexity.
+
+
+
+ comment: Submitted for possible publication at Optical Fiber Communication
+ Conference (OFC) 2025
+
+
+
+
+
+
+ ☆ OpenQDC: Open Quantum Data Commons
+
+
+
+
+
+
+
+
+ Cristian Gabellini, Nikhil Shenoy, Stephan Thaler, Semih Canturk, Daniel McNeela, Dominique Beaini, Michael Bronstein, Prudencio Tossou
+
+
+ Machine Learning Interatomic Potentials (MLIPs) are a highly promising
+alternative to force-fields for molecular dynamics (MD) simulations, offering
+precise and rapid energy and force calculations. However, Quantum-Mechanical
+(QM) datasets, crucial for MLIPs, are fragmented across various repositories,
+hindering accessibility and model development. We introduce the openQDC
+package, consolidating 37 QM datasets from over 250 quantum methods and 400
+million geometries into a single, accessible resource. These datasets are
+meticulously preprocessed, and standardized for MLIP training, covering a wide
+range of chemical elements and interactions relevant in organic chemistry.
+OpenQDC includes tools for normalization and integration, easily accessible via
+Python. Experiments with well-known architectures like SchNet, TorchMD-Net, and
+DimeNet reveal challenges for those architectures and constitute a leaderboard
+to accelerate benchmarking and guide novel algorithms development. Continuously
+adding datasets to OpenQDC will democratize QM dataset access, foster more
+collaboration and innovation, enhance MLIP development, and support their
+adoption in the MD field.
+
+
+
+
+
+
+
+ ☆ Accelerating Multimodal Large Language Models via Dynamic Visual-Token
+ Exit and the Empirical Findings
+
+
+
+
+
+
+
+
+ Qiong Wu, Wenhao Lin, Weihao Ye, Yiyi Zhou, Xiaoshuai Sun, Rongrong Ji
+
+
+ The excessive use of visual tokens in existing Multimoal Large Language
+Models (MLLMs) often exhibits obvious redundancy and brings in prohibitively
+expensive computation. To gain insights into this problem, we first conduct
+extensive empirical studies on the attention behaviors of MLLMs, and summarize
+three main inference stages in MLLMs: (i) Early fusion between tokens is first
+accomplished quickly. (ii) Intra-modality modeling then comes to play. (iii)
+Multimodal reasoning} resumes and lasts until the end of inference. In
+particular, we reveal that visual tokens will stop contributing to reasoning
+when the text tokens receive enough image information, yielding obvious visual
+redundancy. Based on these generalized observations, we propose a simple yet
+effective method to improve the efficiency of MLLMs, termed dynamic
+visual-token exit (DyVTE). DyVTE uses lightweight hyper-networks to perceive
+the text token status and decide the removal of all visual tokens after a
+certain layer, thereby addressing the observed visual redundancy. To validate
+VTE, we apply it to a set of MLLMs, including LLaVA, VILA, Eagle and InternVL,
+and conduct extensive experiments on a bunch of benchmarks. The experiment
+results not only show the effectiveness of our VTE in improving MLLMs'
+efficiency, but also yield the general modeling patterns of MLLMs, well
+facilitating the in-depth understanding of MLLMs. Our code is anonymously
+released at https://github.com/DoubtedSteam/DyVTE.
+
+
+ Condensing large datasets into smaller synthetic counterparts has
+demonstrated its promise for image classification. However, previous research
+has overlooked a crucial concern in image recognition: ensuring that models
+trained on condensed datasets are unbiased towards protected attributes (PA),
+such as gender and race. Our investigation reveals that dataset distillation
+(DD) fails to alleviate the unfairness towards minority groups within original
+datasets. Moreover, this bias typically worsens in the condensed datasets due
+to their smaller size. To bridge the research gap, we propose a novel fair
+dataset distillation (FDD) framework, namely FairDD, which can be seamlessly
+applied to diverse matching-based DD approaches, requiring no modifications to
+their original architectures. The key innovation of FairDD lies in
+synchronously matching synthetic datasets to PA-wise groups of original
+datasets, rather than indiscriminate alignment to the whole distributions in
+vanilla DDs, dominated by majority groups. This synchronized matching allows
+synthetic datasets to avoid collapsing into majority groups and bootstrap their
+balanced generation to all PA groups. Consequently, FairDD could effectively
+regularize vanilla DDs to favor biased generation toward minority groups while
+maintaining the accuracy of target attributes. Theoretical analyses and
+extensive experimental evaluations demonstrate that FairDD significantly
+improves fairness compared to vanilla DD methods, without sacrificing
+classification accuracy. Its consistent superiority across diverse DDs,
+spanning Distribution and Gradient Matching, establishes it as a versatile FDD
+approach.
+
+
+
+
+
+
+
+
+ Attila Cangi, Lenz Fiedler, Bartosz Brzoza, Karan Shah, Timothy J. Callow, Daniel Kotik, Steve Schmerler, Matthew C. Barry, James M. Goff, Andrew Rohskopf, Dayton J. Vogel, Normand Modine, Aidan P. Thompson, Sivasankaran Rajamanickam
+
+
+ We present the Materials Learning Algorithms (MALA) package, a scalable
+machine learning framework designed to accelerate density functional theory
+(DFT) calculations suitable for large-scale atomistic simulations. Using local
+descriptors of the atomic environment, MALA models efficiently predict key
+electronic observables, including local density of states, electronic density,
+density of states, and total energy. The package integrates data sampling,
+model training and scalable inference into a unified library, while ensuring
+compatibility with standard DFT and molecular dynamics codes. We demonstrate
+MALA's capabilities with examples including boron clusters, aluminum across its
+solid-liquid phase boundary, and predicting the electronic structure of a
+stacking fault in a large beryllium slab. Scaling analyses reveal MALA's
+computational efficiency and identify bottlenecks for future optimization. With
+its ability to model electronic structures at scales far beyond standard DFT,
+MALA is well suited for modeling complex material systems, making it a
+versatile tool for advanced materials research.
+
+
+ Reconstructing images using Computed Tomography (CT) in an industrial context
+leads to specific challenges that differ from those encountered in other areas,
+such as clinical CT. Indeed, non-destructive testing with industrial CT will
+often involve scanning multiple similar objects while maintaining high
+throughput, requiring short scanning times, which is not a relevant concern in
+clinical CT. Under-sampling the tomographic data (sinograms) is a natural way
+to reduce the scanning time at the cost of image quality since the latter
+depends on the number of measurements. In such a scenario, post-processing
+techniques are required to compensate for the image artifacts induced by the
+sinogram sparsity. We introduce the Self-supervised Denoiser Framework (SDF), a
+self-supervised training method that leverages pre-training on highly sampled
+sinogram data to enhance the quality of images reconstructed from undersampled
+sinogram data. The main contribution of SDF is that it proposes to train an
+image denoiser in the sinogram space by setting the learning task as the
+prediction of one sinogram subset from another. As such, it does not require
+ground-truth image data, leverages the abundant data modality in CT, the
+sinogram, and can drastically enhance the quality of images reconstructed from
+a fraction of the measurements. We demonstrate that SDF produces better image
+quality, in terms of peak signal-to-noise ratio, than other analytical and
+self-supervised frameworks in both 2D fan-beam or 3D cone-beam CT settings.
+Moreover, we show that the enhancement provided by SDF carries over when
+fine-tuning the image denoiser on a few examples, making it a suitable
+pre-training technique in a context where there is little high-quality image
+data. Our results are established on experimental datasets, making SDF a strong
+candidate for being the building block of foundational image-enhancement models
+in CT.
+
+
+
+
+
+
+
+ ☆ LDA-AQU: Adaptive Query-guided Upsampling via Local Deformable Attention ACM MM2024
+
+
+
+
+
+
+
+
+ Zewen Du, Zhenjiang Hu, Guiyu Zhao, Ying Jin, Hongbin Ma
+
+
+ Feature upsampling is an essential operation in constructing deep
+convolutional neural networks. However, existing upsamplers either lack
+specific feature guidance or necessitate the utilization of high-resolution
+feature maps, resulting in a loss of performance and flexibility. In this
+paper, we find that the local self-attention naturally has the feature guidance
+capability, and its computational paradigm aligns closely with the essence of
+feature upsampling (\ie feature reassembly of neighboring points). Therefore,
+we introduce local self-attention into the upsampling task and demonstrate that
+the majority of existing upsamplers can be regarded as special cases of
+upsamplers based on local self-attention. Considering the potential semantic
+gap between upsampled points and their neighboring points, we further introduce
+the deformation mechanism into the upsampler based on local self-attention,
+thereby proposing LDA-AQU. As a novel dynamic kernel-based upsampler, LDA-AQU
+utilizes the feature of queries to guide the model in adaptively adjusting the
+position and aggregation weight of neighboring points, thereby meeting the
+upsampling requirements across various complex scenarios. In addition, LDA-AQU
+is lightweight and can be easily integrated into various model architectures.
+We evaluate the effectiveness of LDA-AQU across four dense prediction tasks:
+object detection, instance segmentation, panoptic segmentation, and semantic
+segmentation. LDA-AQU consistently outperforms previous state-of-the-art
+upsamplers, achieving performance enhancements of 1.7 AP, 1.5 AP, 2.0 PQ, and
+2.5 mIoU compared to the baseline models in the aforementioned four tasks,
+respectively. Code is available at \url{https://github.com/duzw9311/LDA-AQU}.
+
+
+
+ comment: Accepted by ACM MM2024
+
+
+
+
+
+
+ ☆ Enhancing Sentiment Analysis in Bengali Texts: A Hybrid Approach Using
+ Lexicon-Based Algorithm and Pretrained Language Model Bangla-BERT
+
+
+ Sentiment analysis (SA) is a process of identifying the emotional tone or
+polarity within a given text and aims to uncover the user's complex emotions
+and inner feelings. While sentiment analysis has been extensively studied for
+languages like English, research in Bengali, remains limited, particularly for
+fine-grained sentiment categorization. This work aims to connect this gap by
+developing a novel approach that integrates rule-based algorithms with
+pre-trained language models. We developed a dataset from scratch, comprising
+over 15,000 manually labeled reviews. Next, we constructed a Lexicon Data
+Dictionary, assigning polarity scores to the reviews. We developed a novel rule
+based algorithm Bangla Sentiment Polarity Score (BSPS), an approach capable of
+generating sentiment scores and classifying reviews into nine distinct
+sentiment categories. To assess the performance of this method, we evaluated
+the classified sentiments using BanglaBERT, a pre-trained transformer-based
+language model. We also performed sentiment classification directly with
+BanglaBERT on the original data and evaluated this model's results. Our
+analysis revealed that the BSPS + BanglaBERT hybrid approach outperformed the
+standalone BanglaBERT model, achieving higher accuracy, precision, and nuanced
+classification across the nine sentiment categories. The results of our study
+emphasize the value and effectiveness of combining rule-based and pre-trained
+language model approaches for enhanced sentiment analysis in Bengali and
+suggest pathways for future research and application in languages with similar
+linguistic complexities.
+
+
+ The Rubiks Cube, with its vast state space and sparse reward structure,
+presents a significant challenge for reinforcement learning (RL) due to the
+difficulty of reaching rewarded states. Previous research addressed this by
+propagating cost-to-go estimates from the solved state and incorporating search
+techniques. These approaches differ from human strategies that start from fully
+scrambled cubes, which can be tricky for solving a general sparse-reward
+problem. In this paper, we introduce a novel RL algorithm using policy gradient
+methods to solve the Rubiks Cube without relying on near solved-state sampling.
+Our approach employs a neural network to predict cost patterns between states,
+allowing the agent to learn directly from scrambled states. Our method was
+tested on the 2x2x2 Rubiks Cube, where the cube was scrambled 50,000 times, and
+the model successfully solved it in over 99.4% of cases. Notably, this result
+was achieved using only the policy network without relying on tree search as in
+previous methods, demonstrating its effectiveness and potential for broader
+applications in sparse-reward problems.
+
+
+
+
+
+
+
+ ☆ A Comprehensive Framework for Automated Segmentation of Perivascular
+ Spaces in Brain MRI with the nnU-Net
+
+
+
+
+
+
+
+
+ William Pham, Alexander Jarema, Donggyu Rim, Zhibin Chen, Mohamed S. H. Khlif, Vaughan G. Macefield, Luke A. Henderson, Amy Brodtmann
+
+
+ Background: Enlargement of perivascular spaces (PVS) is common in
+neurodegenerative disorders including cerebral small vessel disease,
+Alzheimer's disease, and Parkinson's disease. PVS enlargement may indicate
+impaired clearance pathways and there is a need for reliable PVS detection
+methods which are currently lacking. Aim: To optimise a widely used deep
+learning model, the no-new-UNet (nnU-Net), for PVS segmentation. Methods: In 30
+healthy participants (mean$\pm$SD age: 50$\pm$18.9 years; 13 females),
+T1-weighted MRI images were acquired using three different protocols on three
+MRI scanners (3T Siemens Tim Trio, 3T Philips Achieva, and 7T Siemens
+Magnetom). PVS were manually segmented across ten axial slices in each
+participant. Segmentations were completed using a sparse annotation strategy.
+In total, 11 models were compared using various strategies for image handling,
+preprocessing and semi-supervised learning with pseudo-labels. Model
+performance was evaluated using 5-fold cross validation (5FCV). The main
+performance metric was the Dice Similarity Coefficient (DSC). Results: The
+voxel-spacing agnostic model (mean$\pm$SD DSC=64.3$\pm$3.3%) outperformed
+models which resampled images to a common resolution (DSC=40.5-55%). Model
+performance improved substantially following iterative label cleaning
+(DSC=85.7$\pm$1.2%). Semi-supervised learning with pseudo-labels (n=12,740)
+from 18 additional datasets improved the agreement between raw and predicted
+PVS cluster counts (Lin's concordance correlation coefficient=0.89,
+95%CI=0.82-0.94). We extended the model to enable PVS segmentation in the
+midbrain (DSC=64.3$\pm$6.5%) and hippocampus (DSC=67.8$\pm$5%). Conclusions:
+Our deep learning models provide a robust and holistic framework for the
+automated quantification of PVS in brain MRI.
+
+
+
+ comment: 46 pages, 8 figures, 2 tables
+
+
+
+
+
+
+ ☆ Initialization using Update Approximation is a Silver Bullet for
+ Extremely Efficient Low-Rank Fine-Tuning
+
+
+
+
+
+
+
+
+ Kaustubh Ponkshe, Raghav Singhal, Eduard Gorbunov, Alexey Tumanov, Samuel Horvath, Praneeth Vepakomma
+
+
+ Low-rank adapters have become a standard approach for efficiently fine-tuning
+large language models (LLMs), but they often fall short of achieving the
+performance of full fine-tuning. We propose a method, LoRA Silver Bullet or
+LoRA-SB, that approximates full fine-tuning within low-rank subspaces using a
+carefully designed initialization strategy. We theoretically demonstrate that
+the architecture of LoRA-XS, which inserts a trainable (r x r) matrix between B
+and A while keeping other matrices fixed, provides the precise conditions
+needed for this approximation. We leverage its constrained update space to
+achieve optimal scaling for high-rank gradient updates while removing the need
+for hyperparameter tuning. We prove that our initialization offers an optimal
+low-rank approximation of the initial gradient and preserves update directions
+throughout training. Extensive experiments across mathematical reasoning,
+commonsense reasoning, and language understanding tasks demonstrate that our
+approach exceeds the performance of standard LoRA while using 27-90x fewer
+parameters, and comprehensively outperforms LoRA-XS. Our findings establish
+that it is possible to simulate full fine-tuning in low-rank subspaces, and
+achieve significant efficiency gains without sacrificing performance. Our code
+is publicly available at https://github.com/RaghavSinghal10/lora-sb.
+
+
+
+ comment: Kaustubh Ponkshe and Raghav Singhal contributed equally to this work
+
+ Discovering causal structures with latent variables from observational data
+is a fundamental challenge in causal discovery. Existing methods often rely on
+constraint-based, iterative discrete searches, limiting their scalability to
+large numbers of variables. Moreover, these methods frequently assume linearity
+or invertibility, restricting their applicability to real-world scenarios. We
+present new theoretical results on the identifiability of nonlinear latent
+hierarchical causal models, relaxing previous assumptions in literature about
+the deterministic nature of latent variables and exogenous noise. Building on
+these insights, we develop a novel differentiable causal discovery algorithm
+that efficiently estimates the structure of such models. To the best of our
+knowledge, this is the first work to propose a differentiable causal discovery
+method for nonlinear latent hierarchical models. Our approach outperforms
+existing methods in both accuracy and scalability. We demonstrate its practical
+utility by learning interpretable hierarchical latent structures from
+high-dimensional image data and demonstrate its effectiveness on downstream
+tasks.
+
+
+
+ comment: 25 pages with references, 7 figures
+
+
+
+
+
+
+ ☆ Analysis of High-dimensional Gaussian Labeled-unlabeled Mixture Model
+ via Message-passing Algorithm
+
+
+ Semi-supervised learning (SSL) is a machine learning methodology that
+leverages unlabeled data in conjunction with a limited amount of labeled data.
+Although SSL has been applied in various applications and its effectiveness has
+been empirically demonstrated, it is still not fully understood when and why
+SSL performs well. Some existing theoretical studies have attempted to address
+this issue by modeling classification problems using the so-called Gaussian
+Mixture Model (GMM). These studies provide notable and insightful
+interpretations. However, their analyses are focused on specific purposes, and
+a thorough investigation of the properties of GMM in the context of SSL has
+been lacking. In this paper, we conduct such a detailed analysis of the
+properties of the high-dimensional GMM for binary classification in the SSL
+setting. To this end, we employ the approximate message passing and state
+evolution methods, which are widely used in high-dimensional settings and
+originate from statistical mechanics. We deal with two estimation approaches:
+the Bayesian one and the l2-regularized maximum likelihood estimation (RMLE).
+We conduct a comprehensive comparison between these two approaches, examining
+aspects such as the global phase diagram, estimation error for the parameters,
+and prediction error for the labels. A specific comparison is made between the
+Bayes-optimal (BO) estimator and RMLE, as the BO setting provides optimal
+estimation performance and is ideal as a benchmark. Our analysis shows that
+with appropriate regularizations, RMLE can achieve near-optimal performance in
+terms of both the estimation error and prediction error, especially when there
+is a large amount of unlabeled data. These results demonstrate that the l2
+regularization term plays an effective role in estimation and prediction in SSL
+approaches.
+
+
+
+
+
+
+
+ ☆ Bootstraping Clustering of Gaussians for View-consistent 3D Scene
+ Understanding
+
+
+
+
+
+
+
+
+ Wenbo Zhang, Lu Zhang, Ping Hu, Liqian Ma, Yunzhi Zhuge, Huchuan Lu
+
+
+ Injecting semantics into 3D Gaussian Splatting (3DGS) has recently garnered
+significant attention. While current approaches typically distill 3D semantic
+features from 2D foundational models (e.g., CLIP and SAM) to facilitate novel
+view segmentation and semantic understanding, their heavy reliance on 2D
+supervision can undermine cross-view semantic consistency and necessitate
+complex data preparation processes, therefore hindering view-consistent scene
+understanding. In this work, we present FreeGS, an unsupervised
+semantic-embedded 3DGS framework that achieves view-consistent 3D scene
+understanding without the need for 2D labels. Instead of directly learning
+semantic features, we introduce the IDentity-coupled Semantic Field (IDSF) into
+3DGS, which captures both semantic representations and view-consistent instance
+indices for each Gaussian. We optimize IDSF with a two-step alternating
+strategy: semantics help to extract coherent instances in 3D space, while the
+resulting instances regularize the injection of stable semantics from 2D space.
+Additionally, we adopt a 2D-3D joint contrastive loss to enhance the
+complementarity between view-consistent 3D geometry and rich semantics during
+the bootstrapping process, enabling FreeGS to uniformly perform tasks such as
+novel-view semantic segmentation, object selection, and 3D object detection.
+Extensive experiments on LERF-Mask, 3D-OVS, and ScanNet datasets demonstrate
+that FreeGS performs comparably to state-of-the-art methods while avoiding the
+complex data preprocessing workload.
+
+
+
+
+
+
+
+ ☆ Contextual Checkerboard Denoise -- A Novel Neural Network-Based Approach
+ for Classification-Aware OCT Image Denoising
+
+
+
+
+
+
+
+
+ Md. Touhidul Islam, Md. Abtahi M. Chowdhury, Sumaiya Salekin, Aye T. Maung, Akil A. Taki, Hafiz Imtiaz
+
+
+ In contrast to non-medical image denoising, where enhancing image clarity is
+the primary goal, medical image denoising warrants preservation of crucial
+features without introduction of new artifacts. However, many denoising methods
+that improve the clarity of the image, inadvertently alter critical information
+of the denoised images, potentially compromising classification performance and
+diagnostic quality. Additionally, supervised denoising methods are not very
+practical in medical image domain, since a \emph{ground truth} denoised version
+of a noisy medical image is often extremely challenging to obtain. In this
+paper, we tackle both of these problems by introducing a novel neural network
+based method -- \emph{Contextual Checkerboard Denoising}, that can learn
+denoising from only a dataset of noisy images, while preserving crucial
+anatomical details necessary for image classification/analysis. We perform our
+experimentation on real Optical Coherence Tomography (OCT) images, and
+empirically demonstrate that our proposed method significantly improves image
+quality, providing clearer and more detailed OCT images, while enhancing
+diagnostic accuracy.
+
+
+
+ comment: Under review in Springer Journal of Medical Systems. Code available:
+ https://github.com/AbtahiMajeed/CheckerBoardDenoiser/tree/main
+
+
+
+
+
+
+ ☆ ReconDreamer: Crafting World Models for Driving Scene Reconstruction via
+ Online Restoration
+
+
+ Closed-loop simulation is crucial for end-to-end autonomous driving. Existing
+sensor simulation methods (e.g., NeRF and 3DGS) reconstruct driving scenes
+based on conditions that closely mirror training data distributions. However,
+these methods struggle with rendering novel trajectories, such as lane changes.
+Recent works have demonstrated that integrating world model knowledge
+alleviates these issues. Despite their efficiency, these approaches still
+encounter difficulties in the accurate representation of more complex
+maneuvers, with multi-lane shifts being a notable example. Therefore, we
+introduce ReconDreamer, which enhances driving scene reconstruction through
+incremental integration of world model knowledge. Specifically, DriveRestorer
+is proposed to mitigate artifacts via online restoration. This is complemented
+by a progressive data update strategy designed to ensure high-quality rendering
+for more complex maneuvers. To the best of our knowledge, ReconDreamer is the
+first method to effectively render in large maneuvers. Experimental results
+demonstrate that ReconDreamer outperforms Street Gaussians in the NTA-IoU,
+NTL-IoU, and FID, with relative improvements by 24.87%, 6.72%, and 29.97%.
+Furthermore, ReconDreamer surpasses DriveDreamer4D with PVG during large
+maneuver rendering, as verified by a relative improvement of 195.87% in the
+NTA-IoU metric and a comprehensive user study.
+
+
+ We introduce a novel state-space model (SSM)-based framework for
+skeleton-based human action recognition, with an anatomically-guided
+architecture that improves state-of-the-art performance in both clinical
+diagnostics and general action recognition tasks. Our approach decomposes
+skeletal motion analysis into spatial, temporal, and spatio-temporal streams,
+using channel partitioning to capture distinct movement characteristics
+efficiently. By implementing a structured, multi-directional scanning strategy
+within SSMs, our model captures local joint interactions and global motion
+patterns across multiple anatomical body parts. This anatomically-aware
+decomposition enhances the ability to identify subtle motion patterns critical
+in medical diagnosis, such as gait anomalies associated with neurological
+conditions. On public action recognition benchmarks, i.e., NTU RGB+D, NTU RGB+D
+120, and NW-UCLA, our model outperforms current state-of-the-art methods,
+achieving accuracy improvements up to $3.2\%$ with lower computational
+complexity than previous leading transformer-based models. We also introduce a
+novel medical dataset for motion-based patient neurological disorder analysis
+to validate our method's potential in automated disease diagnosis.
+
+
+
+
+
+
+
+ ☆ Deepfake Media Generation and Detection in the Generative AI Era: A
+ Survey and Outlook
+
+
+
+
+
+
+
+
+ Florinel-Alin Croitoru, Andrei-Iulian Hiji, Vlad Hondru, Nicolae Catalin Ristea, Paul Irofti, Marius Popescu, Cristian Rusu, Radu Tudor Ionescu, Fahad Shahbaz Khan, Mubarak Shah
+
+
+ With the recent advancements in generative modeling, the realism of deepfake
+content has been increasing at a steady pace, even reaching the point where
+people often fail to detect manipulated media content online, thus being
+deceived into various kinds of scams. In this paper, we survey deepfake
+generation and detection techniques, including the most recent developments in
+the field, such as diffusion models and Neural Radiance Fields. Our literature
+review covers all deepfake media types, comprising image, video, audio and
+multimodal (audio-visual) content. We identify various kinds of deepfakes,
+according to the procedure used to alter or generate the fake content. We
+further construct a taxonomy of deepfake generation and detection methods,
+illustrating the important groups of methods and the domains where these
+methods are applied. Next, we gather datasets used for deepfake detection and
+provide updated rankings of the best performing deepfake detectors on the most
+popular datasets. In addition, we develop a novel multimodal benchmark to
+evaluate deepfake detectors on out-of-distribution content. The results
+indicate that state-of-the-art detectors fail to generalize to deepfake content
+generated by unseen deepfake generators. Finally, we propose future directions
+to obtain robust and powerful deepfake detectors. Our project page and new
+benchmark are available at https://github.com/CroitoruAlin/biodeep.
+
+
+
+
+
+
+
+ ☆ Development of Low-Cost IoT Units for Thermal Comfort Measurement and AC
+ Energy Consumption Prediction System
+
+
+ In response to the substantial energy consumption in buildings, the Japanese
+government initiated the BI-Tech (Behavioral Insights X Technology) project in
+2019, aimed at promoting voluntary energy-saving behaviors through the
+utilization of AI and IoT technologies. Our study aimed at small and
+medium-sized office buildings introduces a cost-effective IoT-based BI-Tech
+system, utilizing the Raspberry Pi 4B+ platform for real-time monitoring of
+indoor thermal conditions and air conditioner (AC) set-point temperature.
+Employing machine learning and image recognition, the system analyzes data to
+calculate the PMV index and predict energy consumption changes due to
+temperature adjustments. The integration of mobile and desktop applications
+conveys this information to users, encouraging energy-efficient behavior
+modifications. The machine learning model achieved with an R2 value of 97%,
+demonstrating the system's efficiency in promoting energy-saving habits among
+users.
+
+
+
+ comment: RoomVent2024 conference
+
+
+
+
+
+
+ ☆ QUOTA: Quantifying Objects with Text-to-Image Models for Any Domain
+
+
+
+
+
+
+
+
+ Wenfang Sun, Yingjun Du, Gaowen Liu, Cees G. M. Snoek
+
+
+ We tackle the problem of quantifying the number of objects by a generative
+text-to-image model. Rather than retraining such a model for each new image
+domain of interest, which leads to high computational costs and limited
+scalability, we are the first to consider this problem from a domain-agnostic
+perspective. We propose QUOTA, an optimization framework for text-to-image
+models that enables effective object quantification across unseen domains
+without retraining. It leverages a dual-loop meta-learning strategy to optimize
+a domain-invariant prompt. Further, by integrating prompt learning with
+learnable counting and domain tokens, our method captures stylistic variations
+and maintains accuracy, even for object classes not encountered during
+training. For evaluation, we adopt a new benchmark specifically designed for
+object quantification in domain generalization, enabling rigorous assessment of
+object quantification accuracy and adaptability across unseen domains in
+text-to-image generation. Extensive experiments demonstrate that QUOTA
+outperforms conventional models in both object quantification accuracy and
+semantic consistency, setting a new benchmark for efficient and scalable
+text-to-image generation for any domain.
+
+
+ Recent advancements in fine-tuning proprietary language models enable
+customized applications across various domains but also introduce two major
+challenges: high resource demands and security risks. Regarding resource
+demands, recent work proposes novel partial compression, such as BitDelta, to
+quantize the delta weights between the fine-tuned model and base model.
+Regarding the security risks, user-defined fine-tuning can introduce security
+vulnerabilities, such as alignment issues, backdoor attacks, and
+hallucinations. However, most of the current efforts in security assessment
+focus on the full-precision or full-compression models, it is not
+well-discussed how the partial compression methods affect security concerns. To
+bridge this gap, we evaluate the robustness of delta-weight quantization
+against these security threats. In this paper, we uncover a "free lunch"
+phenomenon: partial compression can enhance model security against
+fine-tuning-based attacks with bearable utility loss. Using Llama-2-7b-chat as
+a case study, we show that, with under 10% utility degradation, the partial
+compression mitigates alignment-breaking risks by up to 66.17%, harmful
+backdoor vulnerabilities by 64.46%, and targeted output manipulation risks by
+up to 90.53%. We further apply LogitLens to visualize internal state
+transformations during forward passes, suggesting mechanisms for both security
+failure and recovery in standard versus compressed fine-tuning. This work
+offers new insights into selecting effective delta compression methods for
+secure, resource-efficient multi-tenant services.
+
+
+
+
+
+
+
+
+ Xianfeng Tan, Yuhan Li, Wenxiang Shang, Yubo Wu, Jian Wang, Xuanhong Chen, Yi Zhang, Ran Lin, Bingbing Ni
+
+
+ Standard clothing asset generation involves creating forward-facing flat-lay
+garment images displayed on a clear background by extracting clothing
+information from diverse real-world contexts, which presents significant
+challenges due to highly standardized sampling distributions and precise
+structural requirements in the generated images. Existing models have limited
+spatial perception and often exhibit structural hallucinations in this
+high-specification generative task. To address this issue, we propose a novel
+Retrieval-Augmented Generation (RAG) framework, termed RAGDiffusion, to enhance
+structure determinacy and mitigate hallucinations by assimilating external
+knowledge from LLM and databases. RAGDiffusion consists of two core processes:
+(1) Retrieval-based structure aggregation, which employs contrastive learning
+and a Structure Locally Linear Embedding (SLLE) to derive global structure and
+spatial landmarks, providing both soft and hard guidance to counteract
+structural ambiguities; and (2) Omni-level faithful garment generation, which
+introduces a three-level alignment that ensures fidelity in structural,
+pattern, and decoding components within the diffusing. Extensive experiments on
+challenging real-world datasets demonstrate that RAGDiffusion synthesizes
+structurally and detail-faithful clothing assets with significant performance
+improvements, representing a pioneering effort in high-specification faithful
+generation with RAG to confront intrinsic hallucinations and enhance fidelity.
+
+
+
+
+
+
+
+ ☆ DisCoRD: Discrete Tokens to Continuous Motion via Rectified Flow
+ Decoding
+
+
+
+
+
+
+
+
+ Jungbin Cho, Junwan Kim, Jisoo Kim, Minseo Kim, Mingu Kang, Sungeun Hong, Tae-Hyun Oh, Youngjae Yu
+
+
+ Human motion, inherently continuous and dynamic, presents significant
+challenges for generative models. Despite their dominance, discrete
+quantization methods, such as VQ-VAEs, suffer from inherent limitations,
+including restricted expressiveness and frame-wise noise artifacts. Continuous
+approaches, while producing smoother and more natural motions, often falter due
+to high-dimensional complexity and limited training data. To resolve this
+"discord" between discrete and continuous representations, we introduce
+DisCoRD: Discrete Tokens to Continuous Motion via Rectified Flow Decoding, a
+novel method that decodes discrete motion tokens into continuous motion through
+rectified flow. By employing an iterative refinement process in the continuous
+space, DisCoRD captures fine-grained dynamics and ensures smoother and more
+natural motions. Compatible with any discrete-based framework, our method
+enhances naturalness without compromising faithfulness to the conditioning
+signals. Extensive evaluations demonstrate that DisCoRD achieves
+state-of-the-art performance, with FID of 0.032 on HumanML3D and 0.169 on
+KIT-ML. These results solidify DisCoRD as a robust solution for bridging the
+divide between discrete efficiency and continuous realism. Our project page is
+available at: https://whwjdqls.github.io/discord.github.io/.
+
+
+
+ comment: 20 pages 18 figures
+
+
+
+
+
+
+ ☆ LokiTalk: Learning Fine-Grained and Generalizable Correspondences to
+ Enhance NeRF-based Talking Head Synthesis
+
+
+ Despite significant progress in talking head synthesis since the introduction
+of Neural Radiance Fields (NeRF), visual artifacts and high training costs
+persist as major obstacles to large-scale commercial adoption. We propose that
+identifying and establishing fine-grained and generalizable correspondences
+between driving signals and generated results can simultaneously resolve both
+problems. Here we present LokiTalk, a novel framework designed to enhance
+NeRF-based talking heads with lifelike facial dynamics and improved training
+efficiency. To achieve fine-grained correspondences, we introduce
+Region-Specific Deformation Fields, which decompose the overall portrait motion
+into lip movements, eye blinking, head pose, and torso movements. By
+hierarchically modeling the driving signals and their associated regions
+through two cascaded deformation fields, we significantly improve dynamic
+accuracy and minimize synthetic artifacts. Furthermore, we propose ID-Aware
+Knowledge Transfer, a plug-and-play module that learns generalizable dynamic
+and static correspondences from multi-identity videos, while simultaneously
+extracting ID-specific dynamic and static features to refine the depiction of
+individual characters. Comprehensive evaluations demonstrate that LokiTalk
+delivers superior high-fidelity results and training efficiency compared to
+previous methods. The code will be released upon acceptance.
+
+
+ This paper introduces the Density-Calibrated Conformal Quantile Regression
+(CQR-d) method, a novel approach for constructing prediction intervals that
+adapts to varying uncertainty across the feature space. Building upon conformal
+quantile regression, CQR-d incorporates local information through a weighted
+combination of local and global conformity scores, where the weights are
+determined by local data density. We prove that CQR-d provides valid marginal
+coverage at level $1 - \alpha - \epsilon$, where $\epsilon$ represents a small
+tolerance from numerical optimization. Through extensive simulation studies and
+an application to the a heteroscedastic dataset available in R, we demonstrate
+that CQR-d maintains the desired coverage while producing substantially
+narrower prediction intervals compared to standard conformal quantile
+regression (CQR). Notably, in our application on heteroscedastic data, CQR-d
+achieves an $8.6\%$ reduction in average interval width while maintaining
+comparable coverage. The method's effectiveness is particularly pronounced in
+settings with clear local uncertainty patterns, making it a valuable tool for
+prediction tasks in heterogeneous data environments.
+
+
+
+
+
+
+
+ ☆ RL-MILP Solver: A Reinforcement Learning Approach for Solving
+ Mixed-Integer Linear Programs with Graph Neural Networks
+
+
+ Mixed-Integer Linear Programming (MILP) is an optimization technique widely
+used in various fields. Primal heuristics, which reduce the search space of
+MILP, have enabled traditional solvers (e.g., Gurobi) to efficiently find
+high-quality solutions. However, traditional primal heuristics rely on expert
+knowledge, motivating the advent of machine learning (ML)-based primal
+heuristics that learn repetitive patterns in MILP. Nonetheless, existing
+ML-based primal heuristics do not guarantee solution feasibility (i.e.,
+satisfying all constraints) and primarily focus on prediction for binary
+decision variables. When addressing MILP involving non-binary integer variables
+using ML-based approaches, feasibility issues can become even more pronounced.
+Since finding an optimal solution requires satisfying all constraints,
+addressing feasibility is critical. To overcome these limitations, we propose a
+novel reinforcement learning (RL)-based solver that interacts with MILP to find
+feasible solutions, rather than delegating sub-problems to traditional solvers.
+We design reward functions tailored for MILP, which enables the RL agent to
+learn relationships between decision variables and constraints. Additionally,
+to effectively model complex relationships among decision variables, we
+leverage a Transformer encoder-based graph neural network (GNN). Our
+experimental results demonstrate that the proposed method can solve MILP
+problems and find near-optimal solutions without delegating the remainder to
+traditional solvers. The proposed method provides a meaningful step forward as
+an initial study in solving MILP problems end-to-end based solely on ML.
+
+
+
+
+
+
+
+ ☆ Enhancing AI microscopy for foodborne bacterial classification via
+ adversarial domain adaptation across optical and biological variability
+
+
+ Rapid detection of foodborne bacteria is critical for food safety and
+quality, yet traditional culture-based methods require extended incubation and
+specialized sample preparation. This study addresses these challenges by i)
+enhancing the generalizability of AI-enabled microscopy for bacterial
+classification using adversarial domain adaptation and ii) comparing the
+performance of single-target and multi-domain adaptation. Three Gram-positive
+(Bacillus coagulans, Bacillus subtilis, Listeria innocua) and three
+Gram-negative (E. coli, Salmonella Enteritidis, Salmonella Typhimurium) strains
+were classified. EfficientNetV2 served as the backbone architecture, leveraging
+fine-grained feature extraction for small targets. Few-shot learning enabled
+scalability, with domain-adversarial neural networks (DANNs) addressing single
+domains and multi-DANNs (MDANNs) generalizing across all target domains. The
+model was trained on source domain data collected under controlled conditions
+(phase contrast microscopy, 60x magnification, 3-h bacterial incubation) and
+evaluated on target domains with variations in microscopy modality
+(brightfield, BF), magnification (20x), and extended incubation to compensate
+for lower resolution (20x-5h). DANNs improved target domain classification
+accuracy by up to 54.45% (20x), 43.44% (20x-5h), and 31.67% (BF), with minimal
+source domain degradation (<4.44%). MDANNs achieved superior performance in the
+BF domain and substantial gains in the 20x domain. Grad-CAM and t-SNE
+visualizations validated the model's ability to learn domain-invariant features
+across diverse conditions. This study presents a scalable and adaptable
+framework for bacterial classification, reducing reliance on extensive sample
+preparation and enabling application in decentralized and resource-limited
+environments.
+
+
+ Recommendation systems predominantly utilize two-tower architectures, which
+evaluate user-item rankings through the inner product of their respective
+embeddings. However, one key limitation of two-tower models is that they learn
+a pair-agnostic representation of users and items. In contrast, pair-wise
+representations either scale poorly due to their quadratic complexity or are
+too restrictive on the candidate pairs to rank. To address these issues, we
+introduce Context-based Graph Neural Networks (ContextGNNs), a novel deep
+learning architecture for link prediction in recommendation systems. The method
+employs a pair-wise representation technique for familiar items situated within
+a user's local subgraph, while leveraging two-tower representations to
+facilitate the recommendation of exploratory items. A final network then
+predicts how to fuse both pair-wise and two-tower recommendations into a single
+ranking of items. We demonstrate that ContextGNN is able to adapt to different
+data characteristics and outperforms existing methods, both traditional and
+GNN-based, on a diverse set of practical recommendation tasks, improving
+performance by 20% on average.
+
+
+
+ comment: 14 pages, 1 figure, 5 tables
+
+
+
+
+
+
+ ☆ Topology-Preserving Scaling in Data Augmentation
+
+
+ We propose an algorithmic framework for dataset normalization in data
+augmentation pipelines that preserves topological stability under non-uniform
+scaling transformations. Given a finite metric space \( X \subset \mathbb{R}^n
+\) with Euclidean distance \( d_X \), we consider scaling transformations
+defined by scaling factors \( s_1, s_2, \ldots, s_n > 0 \). Specifically, we
+define a scaling function \( S \) that maps each point \( x = (x_1, x_2,
+\ldots, x_n) \in X \) to \[ S(x) = (s_1 x_1, s_2 x_2, \ldots, s_n x_n). \] Our
+main result establishes that the bottleneck distance \( d_B(D, D_S) \) between
+the persistence diagrams \( D \) of \( X \) and \( D_S \) of \( S(X) \)
+satisfies: \[ d_B(D, D_S) \leq (s_{\max} - s_{\min}) \cdot
+\operatorname{diam}(X), \] where \( s_{\min} = \min_{1 \leq i \leq n} s_i \),
+\( s_{\max} = \max_{1 \leq i \leq n} s_i \), and \( \operatorname{diam}(X) \)
+is the diameter of \( X \). Based on this theoretical guarantee, we formulate
+an optimization problem to minimize the scaling variability \( \Delta_s =
+s_{\max} - s_{\min} \) under the constraint \( d_B(D, D_S) \leq \epsilon \),
+where \( \epsilon > 0 \) is a user-defined tolerance.
+ We develop an algorithmic solution to this problem, ensuring that data
+augmentation via scaling transformations preserves essential topological
+features. We further extend our analysis to higher-dimensional homological
+features, alternative metrics such as the Wasserstein distance, and iterative
+or probabilistic scaling scenarios. Our contributions provide a rigorous
+mathematical framework for dataset normalization in data augmentation
+pipelines, ensuring that essential topological characteristics are maintained
+despite scaling transformations.
+
+
+ Cross-view image synthesis involves generating new images of a scene from
+different viewpoints or perspectives, given one input image from other
+viewpoints. Despite recent advancements, there are several limitations in
+existing methods: 1) reliance on additional data such as semantic segmentation
+maps or preprocessing modules to bridge the domain gap; 2) insufficient focus
+on view-specific semantics, leading to compromised image quality and realism;
+and 3) a lack of diverse datasets representing complex urban environments. To
+tackle these challenges, we propose: 1) a novel retrieval-guided framework that
+employs a retrieval network as an embedder to address the domain gap; 2) an
+innovative generator that enhances semantic consistency and diversity specific
+to the target view to improve image quality and realism; and 3) a new dataset,
+VIGOR-GEN, providing diverse cross-view image pairs in urban settings to enrich
+dataset diversity. Extensive experiments on well-known CVUSA, CVACT, and new
+VIGOR-GEN datasets demonstrate that our method generates images of superior
+realism, significantly outperforming current leading approaches, particularly
+in SSIM and FID evaluations.
+
+
+
+
+
+
+
+ ☆ Ditto: Motion-Space Diffusion for Controllable Realtime Talking Head
+ Synthesis
+
+
+
+
+
+
+
+
+ Tianqi Li, Ruobing Zheng, Minghui Yang, Jingdong Chen, Ming Yang
+
+
+ Recent advances in diffusion models have revolutionized audio-driven talking
+head synthesis. Beyond precise lip synchronization, diffusion-based methods
+excel in generating subtle expressions and natural head movements that are
+well-aligned with the audio signal. However, these methods are confronted by
+slow inference speed, insufficient fine-grained control over facial motions,
+and occasional visual artifacts largely due to an implicit latent space derived
+from Variational Auto-Encoders (VAE), which prevent their adoption in realtime
+interaction applications. To address these issues, we introduce Ditto, a
+diffusion-based framework that enables controllable realtime talking head
+synthesis. Our key innovation lies in bridging motion generation and
+photorealistic neural rendering through an explicit identity-agnostic motion
+space, replacing conventional VAE representations. This design substantially
+reduces the complexity of diffusion learning while enabling precise control
+over the synthesized talking heads. We further propose an inference strategy
+that jointly optimizes three key components: audio feature extraction, motion
+generation, and video synthesis. This optimization enables streaming
+processing, realtime inference, and low first-frame delay, which are the
+functionalities crucial for interactive applications such as AI assistants.
+Extensive experimental results demonstrate that Ditto generates compelling
+talking head videos and substantially outperforms existing methods in both
+motion control and realtime performance.
+
+
+
+
+
+
+
+ ☆ Graph-Enhanced EEG Foundation Model
+
+
+ Electroencephalography (EEG) signals provide critical insights for
+applications in disease diagnosis and healthcare. However, the scarcity of
+labeled EEG data poses a significant challenge. Foundation models offer a
+promising solution by leveraging large-scale unlabeled data through
+pre-training, enabling strong performance across diverse tasks. While both
+temporal dynamics and inter-channel relationships are vital for understanding
+EEG signals, existing EEG foundation models primarily focus on the former,
+overlooking the latter. To address this limitation, we propose a novel
+foundation model for EEG that integrates both temporal and inter-channel
+information. Our architecture combines Graph Neural Networks (GNNs), which
+effectively capture relational structures, with a masked autoencoder to enable
+efficient pre-training. We evaluated our approach using three downstream tasks
+and experimented with various GNN architectures. The results demonstrate that
+our proposed model, particularly when employing the GCN architecture with
+optimized configurations, consistently outperformed baseline methods across all
+tasks. These findings suggest that our model serves as a robust foundation
+model for EEG analysis.
+
+
+
+
+
+
+
+ ☆ Real-time Anomaly Detection at the L1 Trigger of CMS Experiment
+
+
+ We present the preparation, deployment, and testing of an autoencoder trained
+for unbiased detection of new physics signatures in the CMS experiment Global
+Trigger (GT) test crate FPGAs during LHC Run 3. The GT makes the final decision
+whether to readout or discard the data from each LHC collision, which occur at
+a rate of 40 MHz, within a 50 ns latency. The Neural Network makes a prediction
+for each event within these constraints, which can be used to select anomalous
+events for further analysis. The GT test crate is a copy of the main GT system,
+receiving the same input data, but whose output is not used to trigger the
+readout of CMS, providing a platform for thorough testing of new trigger
+algorithms on live data, but without interrupting data taking. We describe the
+methodology to achieve ultra low latency anomaly detection, and present the
+integration of the DNN into the GT test crate, as well as the monitoring,
+testing, and validation of the algorithm during proton collisions.
+
+
+
+ comment: Contribution to 42nd International Conference on High Energy Physics
+ (ICHEP 2024)
+
+
+
+
+
+
+ ♻ ☆ Diffeomorphic Latent Neural Operators for Data-Efficient Learning of
+ Solutions to Partial Differential Equations
+
+
+ A computed approximation of the solution operator to a system of partial
+differential equations (PDEs) is needed in various areas of science and
+engineering. Neural operators have been shown to be quite effective at
+predicting these solution generators after training on high-fidelity ground
+truth data (e.g. numerical simulations). However, in order to generalize well
+to unseen spatial domains, neural operators must be trained on an extensive
+amount of geometrically varying data samples that may not be feasible to
+acquire or simulate in certain contexts (e.g., patient-specific medical data,
+large-scale computationally intensive simulations.) We propose that in order to
+learn a PDE solution operator that can generalize across multiple domains
+without needing to sample enough data expressive enough for all possible
+geometries, we can train instead a latent neural operator on just a few ground
+truth solution fields diffeomorphically mapped from different geometric/spatial
+domains to a fixed reference configuration. Furthermore, the form of the
+solutions is dependent on the choice of mapping to and from the reference
+domain. We emphasize that preserving properties of the differential operator
+when constructing these mappings can significantly reduce the data requirement
+for achieving an accurate model due to the regularity of the solution fields
+that the latent neural operator is training on. We provide motivating numerical
+experimentation that demonstrates an extreme case of this consideration by
+exploiting the conformal invariance of the Laplacian
+
+
+
+
+
+
+
+ ♻ ☆ An Operator Splitting View of Federated Learning
+
+
+ Over the past few years, the federated learning ($\texttt{FL}$) community has
+witnessed a proliferation of new $\texttt{FL}$ algorithms. However, our
+understating of the theory of $\texttt{FL}$ is still fragmented, and a
+thorough, formal comparison of these algorithms remains elusive. Motivated by
+this gap, we show that many of the existing $\texttt{FL}$ algorithms can be
+understood from an operator splitting point of view. This unification allows us
+to compare different algorithms with ease, to refine previous convergence
+results and to uncover new algorithmic variants. In particular, our analysis
+reveals the vital role played by the step size in $\texttt{FL}$ algorithms. The
+unification also leads to a streamlined and economic way to accelerate
+$\texttt{FL}$ algorithms, without incurring any communication overhead. We
+perform numerical experiments on both convex and nonconvex models to validate
+our findings.
+
+
+
+
+
+
+
+
+ Alex Cloud, Jacob Goldman-Wetzler, Evžen Wybitul, Joseph Miller, Alexander Matt Turner
+
+
+ Neural networks are trained primarily based on their inputs and outputs,
+without regard for their internal mechanisms. These neglected mechanisms
+determine properties that are critical for safety, like (i) transparency; (ii)
+the absence of sensitive information or harmful capabilities; and (iii)
+reliable generalization of goals beyond the training distribution. To address
+this shortcoming, we introduce gradient routing, a training method that
+isolates capabilities to specific subregions of a neural network. Gradient
+routing applies data-dependent, weighted masks to gradients during
+backpropagation. These masks are supplied by the user in order to configure
+which parameters are updated by which data points. We show that gradient
+routing can be used to (1) learn representations which are partitioned in an
+interpretable way; (2) enable robust unlearning via ablation of a pre-specified
+network subregion; and (3) achieve scalable oversight of a reinforcement
+learner by localizing modules responsible for different behaviors. Throughout,
+we find that gradient routing localizes capabilities even when applied to a
+limited, ad-hoc subset of the data. We conclude that the approach holds promise
+for challenging, real-world applications where quality data are scarce.
+
+
+
+
+
+
+
+ ♻ ☆ On the consistency of hyper-parameter selection in value-based deep
+ reinforcement learning
+
+
+
+
+
+
+
+
+ Johan Obando-Ceron, João G. M. Araújo, Aaron Courville, Pablo Samuel Castro
+
+
+ Deep reinforcement learning (deep RL) has achieved tremendous success on
+various domains through a combination of algorithmic design and careful
+selection of hyper-parameters. Algorithmic improvements are often the result of
+iterative enhancements built upon prior approaches, while hyper-parameter
+choices are typically inherited from previous methods or fine-tuned
+specifically for the proposed technique. Despite their crucial impact on
+performance, hyper-parameter choices are frequently overshadowed by algorithmic
+advancements. This paper conducts an extensive empirical study focusing on the
+reliability of hyper-parameter selection for value-based deep reinforcement
+learning agents, including the introduction of a new score to quantify the
+consistency and reliability of various hyper-parameters. Our findings not only
+help establish which hyper-parameters are most critical to tune, but also help
+clarify which tunings remain consistent across different training regimes.
+
+
+ In many applications in data clustering, it is desirable to find not just a
+single partition into clusters but a sequence of partitions describing the data
+at different scales (or levels of coarseness). A natural problem then is to
+analyse and compare the (not necessarily hierarchical) sequences of partitions
+that underpin multiscale descriptions of data. Here, we introduce the
+Multiscale Clustering Filtration (MCF), a well-defined and stable filtration of
+abstract simplicial complexes that encodes arbitrary patterns of cluster
+assignments across scales of increasing coarseness. We show that the
+zero-dimensional persistent homology of the MCF measures the degree of
+hierarchy in the sequence of partitions, and the higher-dimensional persistent
+homology tracks the emergence and resolution of conflicts between cluster
+assignments across the sequence of partitions. To broaden the theoretical
+foundations of the MCF, we also provide an equivalent construction via a nerve
+complex filtration, and we show that in the hierarchical case, the MCF reduces
+to a Vietoris-Rips filtration of an ultrametric space. We then use numerical
+experiments to illustrate how the MCF can serve to characterise multiscale
+clusterings of synthetic data from stochastic block models.
+
+
+
+ comment: This work was presented at the Dagstuhl Seminar (23192) on
+ "Topological Data Analysis and Applications"
+
+ This paper presents a computationally efficient and distributed speaker
+diarization framework for networked IoT-style audio devices. The work proposes
+a Federated Learning model which can identify the participants in a
+conversation without the requirement of a large audio database for training. An
+unsupervised online update mechanism is proposed for the Federated Learning
+model which depends on cosine similarity of speaker embeddings. Moreover, the
+proposed diarization system solves the problem of speaker change detection via.
+unsupervised segmentation techniques using Hotelling's t-squared Statistic and
+Bayesian Information Criterion. In this new approach, speaker change detection
+is biased around detected quasi-silences, which reduces the severity of the
+trade-off between the missed detection and false detection rates. Additionally,
+the computational overhead due to frame-by-frame identification of speakers is
+reduced via. unsupervised clustering of speech segments. The results
+demonstrate the effectiveness of the proposed training method in the presence
+of non-IID speech data. It also shows a considerable improvement in the
+reduction of false and missed detection at the segmentation stage, while
+reducing the computational overhead. Improved accuracy and reduced
+computational cost makes the mechanism suitable for real-time speaker
+diarization across a distributed IoT audio network.
+
+
+
+ comment: 11 pages, 7 figures, 1 table
+
+
+
+
+
+
+ ♻ ☆ Bias-inducing geometries: an exactly solvable data model with fairness
+ implications
+
+
+ Machine learning (ML) may be oblivious to human bias but it is not immune to
+its perpetuation. Marginalisation and iniquitous group representation are often
+traceable in the very data used for training, and may be reflected or even
+enhanced by the learning models. In the present work, we aim at clarifying the
+role played by data geometry in the emergence of ML bias. We introduce an
+exactly solvable high-dimensional model of data imbalance, where parametric
+control over the many bias-inducing factors allows for an extensive exploration
+of the bias inheritance mechanism. Through the tools of statistical physics, we
+analytically characterise the typical properties of learning models trained in
+this synthetic framework and obtain exact predictions for the observables that
+are commonly employed for fairness assessment. Despite the simplicity of the
+data model, we retrace and unpack typical unfairness behaviour observed on
+real-world datasets. We also obtain a detailed analytical characterisation of a
+class of bias mitigation strategies. We first consider a basic loss-reweighing
+scheme, which allows for an implicit minimisation of different unfairness
+metrics, and quantify the incompatibilities between some existing fairness
+criteria. Then, we consider a novel mitigation strategy based on a matched
+inference approach, consisting in the introduction of coupled learning models.
+Our theoretical analysis of this approach shows that the coupled strategy can
+strike superior fairness-accuracy trade-offs.
+
+
+
+ comment: 10 pages + appendix
+
+
+
+
+
+
+ ♻ ☆ A Riemannian Framework for Learning Reduced-order Lagrangian Dynamics
+
+
+ By incorporating physical consistency as inductive bias, deep neural networks
+display increased generalization capabilities and data efficiency in learning
+nonlinear dynamic models. However, the complexity of these models generally
+increases with the system dimensionality, requiring larger datasets, more
+complex deep networks, and significant computational effort. We propose a novel
+geometric network architecture to learn physically-consistent reduced-order
+dynamic parameters that accurately describe the original high-dimensional
+system behavior. This is achieved by building on recent advances in model-order
+reduction and by adopting a Riemannian perspective to jointly learn a
+non-linear structure-preserving latent space and the associated low-dimensional
+dynamics. Our approach enables accurate long-term predictions of the
+high-dimensional dynamics of rigid and deformable systems with increased data
+efficiency by inferring interpretable and physically plausible reduced
+Lagrangian models.
+
+
+ Bike-sharing is an environmentally friendly shared mobility mode, but its
+self-loop phenomenon, where bikes are returned to the same station after
+several time usage, significantly impacts equity in accessing its services.
+Therefore, this study conducts a multiscale analysis with a spatial
+autoregressive model and double machine learning framework to assess
+socioeconomic features and geospatial location's impact on the self-loop
+phenomenon at metro stations and street scales. The results reveal that
+bike-sharing self-loop intensity exhibits significant spatial lag effect at
+street scale and is positively associated with residential land use. Marginal
+treatment effects of residential land use is higher on streets with middle-aged
+residents, high fixed employment, and low car ownership. The multimodal public
+transit condition reveals significant positive marginal treatment effects at
+both scales. To enhance bike-sharing cooperation, we advocate augmenting
+bicycle availability in areas with high metro usage and low bus coverage,
+alongside implementing adaptable redistribution strategies.
+
+
+
+
+
+
+
+
+ Biagio Montaruli, Giuseppe Floris, Christian Scano, Luca Demetrio, Andrea Valenza, Luca Compagna, Davide Ariu, Luca Piras, Davide Balzarotti, Battista Biggio
+
+
+ Many Web Application Firewalls (WAFs) leverage the OWASP Core Rule Set (CRS)
+to block incoming malicious requests. The CRS consists of different sets of
+rules designed by domain experts to detect well-known web attack patterns. Both
+the set of rules to be used and the weights used to combine them are manually
+defined, yielding four different default configurations of the CRS. In this
+work, we focus on the detection of SQL injection (SQLi) attacks, and show that
+the manual configurations of the CRS typically yield a suboptimal trade-off
+between detection and false alarm rates. Furthermore, we show that these
+configurations are not robust to adversarial SQLi attacks, i.e.,
+carefully-crafted attacks that iteratively refine the malicious SQLi payload by
+querying the target WAF to bypass detection. To overcome these limitations, we
+propose (i) using machine learning to automate the selection of the set of
+rules to be combined along with their weights, i.e., customizing the CRS
+configuration based on the monitored web services; and (ii) leveraging
+adversarial training to significantly improve its robustness to adversarial
+SQLi manipulations. Our experiments, conducted using the well-known open-source
+ModSecurity WAF equipped with the CRS rules, show that our approach, named
+ModSec-AdvLearn, can (i) increase the detection rate up to 30%, while retaining
+negligible false alarm rates and discarding up to 50% of the CRS rules; and
+(ii) improve robustness against adversarial SQLi attacks up to 85%, marking a
+significant stride toward designing more effective and robust WAFs. We release
+our open-source code at https://github.com/pralab/modsec-advlearn.
+
+
+ Deep unrolling, or unfolding, is an emerging learning-to-optimize method that
+unrolls a truncated iterative algorithm in the layers of a trainable neural
+network. However, the convergence guarantees and generalizability of the
+unrolled networks are still open theoretical problems. To tackle these
+problems, we provide deep unrolled architectures with a stochastic descent
+nature by imposing descending constraints during training. The descending
+constraints are forced layer by layer to ensure that each unrolled layer takes,
+on average, a descent step toward the optimum during training. We theoretically
+prove that the sequence constructed by the outputs of the unrolled layers is
+then guaranteed to converge for unseen problems, assuming no distribution shift
+between training and test problems. We also show that standard unrolling is
+brittle to perturbations, and our imposed constraints provide the unrolled
+networks with robustness to additive noise and perturbations. We numerically
+assess unrolled architectures trained under the proposed constraints in two
+different applications, including the sparse coding using learnable iterative
+shrinkage and thresholding algorithm (LISTA) and image inpainting using
+proximal generative flow (GLOW-Prox), and demonstrate the performance and
+robustness benefits of the proposed method.
+
+
+ Pretrained large language models (LLMs) are increasingly utilized across a
+wide range of natural language processing (NLP) tasks due to their impressive
+capabilities as few-shot learners. Recent techniques, such as chain-of-thought
+(CoT) prompting, have significantly advanced multi-step reasoning by
+introducing step-by-step decomposition, achieving state-of-the-art results on
+complex reasoning benchmarks. However, these approaches often rely on static
+prompting templates that do not adapt to task complexity or errors during the
+reasoning process. In this work, we introduce Adaptive Prompting, a dynamic and
+iterative framework designed to enhance reasoning by incorporating real-time
+adjustments to prompt structures and validation mechanisms.Experimental results
+demonstrate that Adaptive Prompting significantly improves performance on
+diverse reasoning benchmarks, including arithmetic reasoning (GSM8K,
+MultiArith), logical reasoning and commonsense tasks, achieving substantial
+accuracy gains compared to static prompting baselines. By integrating guided
+prompts, intermediate validation, and self-corrective steps, our approach
+enables smaller models to achieve competitive performance with larger
+counterparts, such as GPT-4, while maintaining computational efficiency. The
+framework achieves this without requiring fine-tuning or task-specific training
+data, highlighting the untapped potential of iterative reasoning methods.
+
+
+
+ comment: Submitted to ICLR 2025. This is a preprint version. Future revisions
+ will include additional evaluations and refinements
+
+
+
+
+
+
+ ♻ ☆ Statistical learning theory and Occam's razor: The core argument
+
+
+ Statistical learning theory is often associated with the principle of Occam's
+razor, which recommends a simplicity preference in inductive inference. This
+paper distills the core argument for simplicity obtainable from statistical
+learning theory, built on the theory's central learning guarantee for the
+method of empirical risk minimization. This core "means-ends" argument is that
+a simpler hypothesis class or inductive model is better because it has better
+learning guarantees; however, these guarantees are model-relative and so the
+theoretical push towards simplicity is checked by our prior knowledge.
+
+
+
+
+
+
+
+ ♻ ☆ What Is Fairness? On the Role of Protected Attributes and Fictitious
+ Worlds
+
+
+ A growing body of literature in fairness-aware machine learning (fairML) aims
+to mitigate machine learning (ML)-related unfairness in automated
+decision-making (ADM) by defining metrics that measure fairness of an ML model
+and by proposing methods to ensure that trained ML models achieve low scores on
+these metrics. However, the underlying concept of fairness, i.e., the question
+of what fairness is, is rarely discussed, leaving a significant gap between
+centuries of philosophical discussion and the recent adoption of the concept in
+the ML community. In this work, we try to bridge this gap by formalizing a
+consistent concept of fairness and by translating the philosophical
+considerations into a formal framework for the training and evaluation of ML
+models in ADM systems. We argue that fairness problems can arise even without
+the presence of protected attributes (PAs), and point out that fairness and
+predictive performance are not irreconcilable opposites, but that the latter is
+necessary to achieve the former. Furthermore, we argue why and how causal
+considerations are necessary when assessing fairness in the presence of PAs by
+proposing a fictitious, normatively desired (FiND) world in which PAs have no
+causal effects. In practice, this FiND world must be approximated by a warped
+world in which the causal effects of the PAs are removed from the real-world
+data. Finally, we achieve greater linguistic clarity in the discussion of
+fairML. We outline algorithms for practical applications and present
+illustrative experiments on COMPAS data.
+
+
+
+
+
+
+
+ ♻ ☆ A Survey on Multimodal Large Language Models
+
+
+ Recently, Multimodal Large Language Model (MLLM) represented by GPT-4V has
+been a new rising research hotspot, which uses powerful Large Language Models
+(LLMs) as a brain to perform multimodal tasks. The surprising emergent
+capabilities of MLLM, such as writing stories based on images and OCR-free math
+reasoning, are rare in traditional multimodal methods, suggesting a potential
+path to artificial general intelligence. To this end, both academia and
+industry have endeavored to develop MLLMs that can compete with or even better
+than GPT-4V, pushing the limit of research at a surprising speed. In this
+paper, we aim to trace and summarize the recent progress of MLLMs. First of
+all, we present the basic formulation of MLLM and delineate its related
+concepts, including architecture, training strategy and data, as well as
+evaluation. Then, we introduce research topics about how MLLMs can be extended
+to support more granularity, modalities, languages, and scenarios. We continue
+with multimodal hallucination and extended techniques, including Multimodal ICL
+(M-ICL), Multimodal CoT (M-CoT), and LLM-Aided Visual Reasoning (LAVR). To
+conclude the paper, we discuss existing challenges and point out promising
+research directions. In light of the fact that the era of MLLM has only just
+begun, we will keep updating this survey and hope it can inspire more research.
+An associated GitHub link collecting the latest papers is available at
+https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models.
+
+
+
+ comment: Accepted for publication in National Science Review. Project
+ page:https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models
+
+
+
+
+
+
+ ♻ ☆ Learning Local Control Barrier Functions for Hybrid Systems
+
+
+
+
+
+
+
+
+ Shuo Yang, Yu Chen, Xiang Yin, George J. Pappas, Rahul Mangharam
+
+
+ Hybrid dynamical systems are ubiquitous as practical robotic applications
+often involve both continuous states and discrete switchings. Safety is a
+primary concern for hybrid robotic systems. Existing safety-critical control
+approaches for hybrid systems are either computationally inefficient,
+detrimental to system performance, or limited to small-scale systems. To amend
+these drawbacks, in this paper, we propose a learning-enabled approach to
+construct local Control Barrier Functions (CBFs) to guarantee the safety of a
+wide class of nonlinear hybrid dynamical systems. The end result is a safe
+neural CBF-based switching controller. Our approach is computationally
+efficient, minimally invasive to any reference controller, and applicable to
+large-scale systems. We empirically evaluate our framework and demonstrate its
+efficacy and flexibility through two robotic examples including a
+high-dimensional autonomous racing case, against other CBF-based approaches and
+model predictive control.
+
+
+ While reinforcement learning has shown experimental success in a number of
+applications, it is known to be sensitive to noise and perturbations in the
+parameters of the system, leading to high variance in the total reward amongst
+different episodes in slightly different environments. To introduce robustness,
+as well as sample efficiency, risk-sensitive reinforcement learning methods are
+being thoroughly studied. In this work, we provide a definition of robust
+reinforcement learning policies and formulate a risk-sensitive reinforcement
+learning problem to approximate them, by solving an optimization problem with
+respect to a modified objective based on exponential criteria. In particular,
+we study a model-free risk-sensitive variation of the widely-used Monte Carlo
+Policy Gradient algorithm and introduce a novel risk-sensitive online
+Actor-Critic algorithm based on solving a multiplicative Bellman equation using
+stochastic approximation updates. Analytical results suggest that the use of
+exponential criteria generalizes commonly used ad-hoc regularization
+approaches, improves sample efficiency, and introduces robustness with respect
+to perturbations in the model parameters and the environment. The
+implementation, performance, and robustness properties of the proposed methods
+are evaluated in simulated experiments.
+
+
+
+
+
+
+
+ ♻ ☆ Sourcerer: Sample-based Maximum Entropy Source Distribution Estimation
+
+
+
+
+
+
+
+
+ Julius Vetter, Guy Moss, Cornelius Schröder, Richard Gao, Jakob H. Macke
+
+
+ Scientific modeling applications often require estimating a distribution of
+parameters consistent with a dataset of observations - an inference task also
+known as source distribution estimation. This problem can be ill-posed,
+however, since many different source distributions might produce the same
+distribution of data-consistent simulations. To make a principled choice among
+many equally valid sources, we propose an approach which targets the maximum
+entropy distribution, i.e., prioritizes retaining as much uncertainty as
+possible. Our method is purely sample-based - leveraging the Sliced-Wasserstein
+distance to measure the discrepancy between the dataset and simulations - and
+thus suitable for simulators with intractable likelihoods. We benchmark our
+method on several tasks, and show that it can recover source distributions with
+substantially higher entropy than recent source estimation methods, without
+sacrificing the fidelity of the simulations. Finally, to demonstrate the
+utility of our approach, we infer source distributions for parameters of the
+Hodgkin-Huxley model from experimental datasets with thousands of single-neuron
+measurements. In summary, we propose a principled method for inferring source
+distributions of scientific simulator parameters while retaining as much
+uncertainty as possible.
+
+
+
+
+
+
+
+ ♻ ☆ Domain-Adaptive Pre-training of Self-Supervised Foundation Models for
+ Medical Image Classification in Gastrointestinal Endoscopy
+
+
+
+
+
+
+
+
+ Marcel Roth, Micha V. Nowak, Adrian Krenzer, Frank Puppe
+
+
+ Video capsule endoscopy has transformed gastrointestinal endoscopy (GIE)
+diagnostics by offering a non-invasive method for capturing detailed images of
+the gastrointestinal tract, enabling early disease detection. However, its
+potential is limited by the sheer volume of images generated during the imaging
+procedure, which can take anywhere from 6-8 hours and often produce up to 1
+million images, necessitating automated analysis. Additionally, the variability
+of these images, combined with the need for expert annotations and the scarcity
+of large, high-quality labeled datasets, constrains the effectiveness of
+current medical image analysis models. To address this, we introduce a novel
+large GIE dataset, called EndoExtend24, created by merging ten existing public
+and private datasets, ensuring patient integrity across splits. EndoExtend24
+includes over 226,000 labeled images, as well as dynamic class mappings, which
+allow unified training across datasets with differing labeling granularity,
+supporting up to 123 distinct pathological findings. Further, we propose to
+leverage domain adaptive pre-training of foundation models trained with
+self-supervision on generic image data, to adapt them to the task of GIE
+medical image diagnosis. Specifically, the EVA-02 model, which is based on the
+ViT architecture and trained on ImageNet-22k with masked image modeling (using
+EVA-CLIP as a MIM teacher), is pre-trained on the EndoExtend24 dataset to
+achieve domain adaptation, and finally trained on the Capsule Endoscopy 2024
+Challenge dataset. Our model demonstrates robust performance, securing third
+place in the Capsule Endoscopy 2024 Challenge. We achieved a macro AUC of 0.762
+and a balanced accuracy of 37.1% on the test set. These results emphasize the
+effectiveness of our domain-adaptive pre-training approach and the enriched
+EndoExtend24 dataset in advancing gastrointestinal endoscopy diagnostics.
+
+
+
+
+
+
+
+ ♻ ☆ What Differentiates Educational Literature? A Multimodal Fusion Approach
+ of Transformers and Computational Linguistics
+
+
+ The integration of new literature into the English curriculum remains a
+challenge since educators often lack scalable tools to rapidly evaluate
+readability and adapt texts for diverse classroom needs. This study proposes to
+address this gap through a multimodal approach that combines transformer-based
+text classification with linguistic feature analysis to align texts with UK Key
+Stages. Eight state-of-the-art Transformers were fine-tuned on segmented text
+data, with BERT achieving the highest unimodal F1 score of 0.75. In parallel,
+500 deep neural network topologies were searched for the classification of
+linguistic characteristics, achieving an F1 score of 0.392. The fusion of these
+modalities shows a significant improvement, with every multimodal approach
+outperforming all unimodal models. In particular, the ELECTRA Transformer fused
+with the neural network achieved an F1 score of 0.996. Unimodal and multimodal
+approaches are shown to have statistically significant differences in all
+validation metrics (accuracy, precision, recall, F1 score) except for inference
+time. The proposed approach is finally encapsulated in a stakeholder-facing web
+application, providing non-technical stakeholder access to real-time insights
+on text complexity, reading difficulty, curriculum alignment, and
+recommendations for learning age range. The application empowers data-driven
+decision making and reduces manual workload by integrating AI-based
+recommendations into lesson planning for English literature.
+
+
+
+
+
+
+
+ ♻ ☆ CLIPArTT: Adaptation of CLIP to New Domains at Test Time
+
+
+
+
+
+
+
+
+ Gustavo Adolfo Vargas Hakim, David Osowiechi, Mehrdad Noori, Milad Cheraghalikhani, Ali Bahri, Moslem Yazdanpanah, Ismail Ben Ayed, Christian Desrosiers
+
+
+ Pre-trained vision-language models (VLMs), exemplified by CLIP, demonstrate
+remarkable adaptability across zero-shot classification tasks without
+additional training. However, their performance diminishes in the presence of
+domain shifts. In this study, we introduce CLIP Adaptation duRing Test-Time
+(CLIPArTT), a fully test-time adaptation (TTA) approach for CLIP, which
+involves automatic text prompts construction during inference for their use as
+text supervision. Our method employs a unique, minimally invasive text prompt
+tuning process, wherein multiple predicted classes are aggregated into a single
+new text prompt, used as \emph{pseudo label} to re-classify inputs in a
+transductive manner. Additionally, we pioneer the standardization of TTA
+benchmarks (e.g., TENT) in the realm of VLMs. Our findings demonstrate that,
+without requiring additional transformations nor new trainable modules,
+CLIPArTT enhances performance dynamically across non-corrupted datasets such as
+CIFAR-100, corrupted datasets like CIFAR-100-C and ImageNet-C, alongside
+synthetic datasets such as VisDA-C. This research underscores the potential for
+improving VLMs' adaptability through novel test-time strategies, offering
+insights for robust performance across varied datasets and environments. The
+code can be found at: https://github.com/dosowiechi/CLIPArTT.git
+
+
+
+
+
+
+
+ ♻ ☆ Climate Adaptation with Reinforcement Learning: Experiments with
+ Flooding and Transportation in Copenhagen NeurIPS 2024
+
+
+
+
+
+
+
+
+ Miguel Costa, Morten W. Petersen, Arthur Vandervoort, Martin Drews, Karyn Morrissey, Francisco C. Pereira
+
+
+ Due to climate change the frequency and intensity of extreme rainfall events,
+which contribute to urban flooding, are expected to increase in many places.
+These floods can damage transport infrastructure and disrupt mobility,
+highlighting the need for cities to adapt to escalating risks. Reinforcement
+learning (RL) serves as a powerful tool for uncovering optimal adaptation
+strategies, determining how and where to deploy adaptation measures
+effectively, even under significant uncertainty. In this study, we leverage RL
+to identify the most effective timing and locations for implementing measures,
+aiming to reduce both direct and indirect impacts of flooding. Our framework
+integrates climate change projections of future rainfall events and floods,
+models city-wide motorized trips, and quantifies direct and indirect impacts on
+infrastructure and mobility. Preliminary results suggest that our RL-based
+approach can significantly enhance decision-making by prioritizing
+interventions in specific urban areas and identifying the optimal periods for
+their implementation. Our framework is publicly available:
+\url{https://github.com/MLSM-at-DTU/floods_transport_rl}.
+
+
+
+ comment: Accepted for presentation at Tackling Climate Change with Machine
+ Learning workshop at NeurIPS 2024
+
+
+
+
+
+
+ ♻ ☆ You Don't Need Domain-Specific Data Augmentations When Scaling
+ Self-Supervised Learning
+
+
+
+
+
+
+
+
+ Théo Moutakanni, Maxime Oquab, Marc Szafraniec, Maria Vakalopoulou, Piotr Bojanowski
+
+
+ Self-Supervised learning (SSL) with Joint-Embedding Architectures (JEA) has
+led to outstanding performances. All instantiations of this paradigm were
+trained using strong and well-established hand-crafted data augmentations,
+leading to the general belief that they are required for the proper training
+and performance of such models. On the other hand, generative
+reconstruction-based models such as BEIT and MAE or Joint-Embedding Predictive
+Architectures such as I-JEPA have shown strong performance without using data
+augmentations except masking. In this work, we challenge the importance of
+invariance and data-augmentation in JEAs at scale. By running a case-study on a
+recent SSL foundation model - DINOv2 - we show that strong image
+representations can be obtained with JEAs and only cropping without resizing
+provided the training data is large enough, reaching state-of-the-art results
+and using the least amount of augmentation in the literature. Through this
+study, we also discuss the impact of compute constraints on the outcomes of
+experimental deep learning research, showing that they can lead to very
+different conclusions.
+
+
+
+
+
+
+
+
+ Jacob E. Kooi, Mark Hoogendoorn, Vincent François-Lavet
+
+
+ Activation functions are one of the key components of a deep neural network.
+The most commonly used activation functions can be classed into the category of
+continuously differentiable (e.g. tanh) and piece-wise linear functions (e.g.
+ReLU), both having their own strengths and drawbacks with respect to downstream
+performance and representation capacity through learning (e.g. measured by the
+number of dead neurons and the effective rank). In reinforcement learning, the
+performance of continuously differentiable activations often falls short as
+compared to piece-wise linear functions. We provide insights into the vanishing
+gradients associated with the former, and show that the dying neuron problem is
+not exclusive to ReLU's. To alleviate vanishing gradients and the resulting
+dying neuron problem occurring with continuously differentiable activations, we
+propose a Hadamard representation. Using deep Q-networks, proximal policy
+optimization and parallelized Q-networks in the Atari domain, we show faster
+learning, a reduction in dead neurons and increased effective rank.
+
+
+
+ comment: 34 pages, 28 figures
+
+
+
+
+
+
+ ♻ ☆ An Interpretable Approach to Load Profile Forecasting in Power Grids
+ using Galerkin-Approximated Koopman Pseudospectra
+
+
+
+
+
+
+
+
+ Ali Tavasoli, Behnaz Moradijamei, Heman Shakeri
+
+
+ This paper presents an interpretable machine learning approach that
+characterizes load dynamics within an operator-theoretic framework for
+electricity load forecasting in power grids. We represent the dynamics of load
+data using the Koopman operator, which provides a linear, infinite-dimensional
+representation of the nonlinear dynamics, and approximate a finite version that
+remains robust against spectral pollutions due to truncation. By computing
+$\epsilon$-approximate Koopman eigenfunctions using dynamics-adapted kernels in
+delay coordinates, we decompose the load dynamics into coherent spatiotemporal
+patterns that evolve quasi-independently. Our approach captures temporal
+coherent patterns due to seasonal changes and finer time scales, such as time
+of day and day of the week. This method allows for a more nuanced understanding
+of the complex interactions within power grids and their response to various
+exogenous factors. We assess our method using a large-scale dataset from a
+renewable power system in the continental European electricity system. The
+results indicate that our Koopman-based method surpasses a separately optimized
+deep learning (LSTM) architecture in both accuracy and computational
+efficiency, while providing deeper insights into the underlying dynamics of the
+power grid\footnote{The code is available at
+\href{https://github.com/Shakeri-Lab/Power-Grids}{github.com/Shakeri-Lab/Power-Grids}.
+
+
+
+ comment: 34 pages, 17 figures
+
+
+
+
+
+
+ ♻ ☆ ApisTox: a new benchmark dataset for the classification of small
+ molecules toxicity on honey bees
+
+
+
+
+
+
+
+
+ Jakub Adamczyk, Jakub Poziemski, Pawel Siedlecki
+
+
+ The global decline in bee populations poses significant risks to agriculture,
+biodiversity, and environmental stability. To bridge the gap in existing data,
+we introduce ApisTox, a comprehensive dataset focusing on the toxicity of
+pesticides to honey bees (Apis mellifera). This dataset combines and leverages
+data from existing sources such as ECOTOX and PPDB, providing an extensive,
+consistent, and curated collection that surpasses the previous datasets.
+ApisTox incorporates a wide array of data, including toxicity levels for
+chemicals, details such as time of their publication in literature, and
+identifiers linking them to external chemical databases. This dataset may serve
+as an important tool for environmental and agricultural research, but also can
+support the development of policies and practices aimed at minimizing harm to
+bee populations. Finally, ApisTox offers a unique resource for benchmarking
+molecular property prediction methods on agrochemical compounds, facilitating
+advancements in both environmental science and cheminformatics. This makes it a
+valuable tool for both academic research and practical applications in bee
+conservation.
+
+
+ Thompson sampling (TS) has optimal regret and excellent empirical performance
+in multi-armed bandit problems. Yet, in Bayesian optimization, TS underperforms
+popular acquisition functions (e.g., EI, UCB). TS samples arms according to the
+probability that they are optimal. A recent algorithm, P-Star Sampler (PSS),
+performs such a sampling via Hit-and-Run. We present an improved version,
+Stagger Thompson Sampler (STS). STS more precisely locates the maximizer than
+does TS using less computation time. We demonstrate that STS outperforms TS,
+PSS, and other acquisition methods in numerical experiments of optimizations of
+several test functions across a broad range of dimension. Additionally, since
+PSS was originally presented not as a standalone acquisition method but as an
+input to a batching algorithm called Minimal Terminal Variance (MTV), we also
+demon-strate that STS matches PSS performance when used as the input to MTV.
+
+
+
+ comment: NeurIPS 2024 Workshop on Bayesian Decision-making and Uncertainty;
+ Poster
+
+
+
+
+
+
+ ♻ ☆ A data driven approach to classify descriptors based on their efficiency
+ in translating noisy trajectories into physically-relevant information
+
+
+
+
+
+
+
+
+ Simone Martino, Domiziano Doria, Chiara Lionello, Matteo Becchi, Giovanni M. Pavan
+
+
+ Reconstructing the physical complexity of many-body dynamical systems can be
+challenging. Starting from the trajectories of their constitutive units (raw
+data), typical approaches require selecting appropriate descriptors to convert
+them into time-series, which are then analyzed to extract interpretable
+information. However, identifying the most effective descriptor is often
+non-trivial. Here, we report a data-driven approach to compare the efficiency
+of various descriptors in extracting information from noisy trajectories and
+translating it into physically relevant insights. As a prototypical system with
+non-trivial internal complexity, we analyze molecular dynamics trajectories of
+an atomistic system where ice and water coexist in equilibrium near the
+solid/liquid transition temperature. We compare general and specific
+descriptors often used in aqueous systems: number of neighbors, molecular
+velocities, Smooth Overlap of Atomic Positions (SOAP), Local Environments and
+Neighbors Shuffling (LENS), Orientational Tetrahedral Order, and distance from
+the fifth neighbor ($d_5$). Using Onion Clustering -- an efficient unsupervised
+method for single-point time-series analysis -- we assess the maximum
+extractable information for each descriptor and rank them via a
+high-dimensional metric. Our results show that advanced descriptors like SOAP
+and LENS outperform classical ones due to higher signal-to-noise ratios.
+Nonetheless, even simple descriptors can rival or exceed advanced ones after
+local signal denoising. For example, $d_5$, initially among the weakest,
+becomes the most effective at resolving the system's non-local dynamical
+complexity after denoising. This work highlights the critical role of noise in
+information extraction from molecular trajectories and offers a data-driven
+approach to identify optimal descriptors for systems with characteristic
+internal complexity.
+
+
+
+ comment: 19 pages, 5 figures + 3 in supporting information (at the bottom of
+ the manuscript)
+
+
+
+
+
+
+ ♻ ☆ A Mathematical Programming Approach to Optimal Classification Forests
+
+
+
+
+
+
+
+
+ Víctor Blanco, Alberto Japón, Justo Puerto, Peter Zhang
+
+
+ This paper introduces Weighted Optimal Classification Forests (WOCFs), a new
+family of classifiers that takes advantage of an optimal ensemble of decision
+trees to derive accurate and interpretable classifiers. We propose a novel
+mathematical optimization-based methodology which simultaneously constructs a
+given number of trees, each of them providing a predicted class for the
+observations in the feature space. The classification rule is derived by
+assigning to each observation its most frequently predicted class among the
+trees. We provide a mixed integer linear programming formulation (MIP) for the
+problem and several novel MIP strengthening / scaling techniques. We report the
+results of our computational experiments, from which we conclude that our
+method has equal or superior performance compared with state-of-the-art
+tree-based classification methods for small to medium-sized instances. We also
+present three real-world case studies showing that our methodology has very
+interesting implications in terms of interpretability. Overall, WOCFs
+complement existing methods such as CART, Optimal Classification Trees, Random
+Forests and XGBoost. In addition to its Pareto improvement on accuracy and
+interpretability, we also see unique properties emerging in terms of different
+trees focusing on different feature variables. This provides nontrivial
+improvement in interpretability and usability of the trained model in terms of
+counterfactual explanation. Thus, despite the apparent computational challenge
+of WOCFs that limit the size of the problems that can be efficiently solved
+with current MIP, this is an important research direction that can lead to
+qualitatively different insights for researchers and complement the toolbox of
+practitioners for high stakes problems.
+
+
+
+
+
+
+
+
+ Geri Skenderi, Luigi Capogrosso, Andrea Toaiari, Matteo Denitto, Franco Fummi, Simone Melzi, Marco Cristani
+
+
+ Auxiliary tasks facilitate learning in situations when data is scarce or the
+principal task of focus is extremely complex. This idea is primarily inspired
+by the improved generalization capability induced by solving multiple tasks
+simultaneously, which leads to a more robust shared representation.
+Nevertheless, finding optimal auxiliary tasks is a crucial problem that often
+requires hand-crafted solutions or expensive meta-learning approaches. In this
+paper, we propose a novel framework, dubbed Detaux, whereby a weakly supervised
+disentanglement procedure is used to discover a new unrelated auxiliary
+classification task, which allows us to go from a Single-Task Learning (STL) to
+a Multi-Task Learning (MTL) problem. The disentanglement procedure works at the
+representation level, isolating the variation related to the principal task
+into an isolated subspace and additionally producing an arbitrary number of
+orthogonal subspaces, each one of them encouraging high separability among the
+projections. We generate the auxiliary classification task through a clustering
+procedure on the most disentangled subspace, obtaining a discrete set of
+labels. Subsequently, the original data, the labels associated with the
+principal task, and the newly discovered ones can be fed into any MTL
+framework. Experimental validation on both synthetic and real data, along with
+various ablation studies, demonstrate promising results, revealing the
+potential in what has been, so far, an unexplored connection between learning
+disentangled representations and MTL. The source code will be made available
+upon acceptance.
+
+
+
+
+
+
+
+ ♻ ☆ Steering Large Language Models using Conceptors: Improving
+ Addition-Based Activation Engineering NeurIPS 2024
+
+
+ Large language models have transformed AI, yet reliably controlling their
+outputs remains a challenge. This paper explores activation engineering, where
+outputs of pre-trained LLMs are controlled by manipulating their activations at
+inference time. Unlike traditional methods using a single steering vector, we
+introduce conceptors - mathematical constructs that represent sets of
+activation vectors as ellipsoidal regions. Conceptors act as soft projection
+matrices and offer more precise control over complex activation patterns. Our
+experiments demonstrate that conceptors outperform traditional methods across
+multiple steering tasks. We further use Boolean operations on conceptors for
+combined steering goals that empirically outperform additively combining
+steering vectors on a set of tasks. These results highlight conceptors as a
+promising tool for more effective steering of LLMs. Our code is available on
+github.com/jorispos/conceptorsteering.
+
+
+
+ comment: Presented at the MINT workshop at NeurIPS 2024
+
+
+
+
+
+
+ ♻ ☆ Fast post-process Bayesian inference with Variational Sparse Bayesian
+ Quadrature
+
+
+
+
+
+
+
+
+ Chengkun Li, Grégoire Clarté, Martin Jørgensen, Luigi Acerbi
+
+
+ In applied Bayesian inference scenarios, users may have access to a large
+number of pre-existing model evaluations, for example from maximum-a-posteriori
+(MAP) optimization runs. However, traditional approximate inference techniques
+make little to no use of this available information. We propose the framework
+of post-process Bayesian inference as a means to obtain a quick posterior
+approximation from existing target density evaluations, with no further model
+calls. Within this framework, we introduce Variational Sparse Bayesian
+Quadrature (VSBQ), a method for post-process approximate inference for models
+with black-box and potentially noisy likelihoods. VSBQ reuses existing target
+density evaluations to build a sparse Gaussian process (GP) surrogate model of
+the log posterior density function. Subsequently, we leverage sparse-GP
+Bayesian quadrature combined with variational inference to achieve fast
+approximate posterior inference over the surrogate. We validate our method on
+challenging synthetic scenarios and real-world applications from computational
+neuroscience. The experiments show that VSBQ builds high-quality posterior
+approximations by post-processing existing optimization traces, with no further
+model evaluations.
+
+
+
+
+
+
+
+ ♻ ☆ LLM-Rank: A Graph Theoretical Approach to Pruning Large Language Models
+
+
+ The evolving capabilities of large language models are accompanied by growing
+sizes and deployment costs, necessitating effective inference optimisation
+techniques. We propose a novel pruning method utilising centrality measures
+from graph theory, reducing both the computational requirements and the memory
+footprint of these models. Specifically, we devise a method for creating a
+weighted directed acyclical graph representation of multilayer perceptrons to
+which we apply a modified version of the weighted PageRank centrality measure
+to compute node importance scores. In combination with uniform pruning this
+leads to structured sparsity. We call this pruning method MLPRank. Furthermore
+we introduce an extension to decoder-only transformer models and call it
+LLMRank. For both variants we demonstrate a strong performance. With MLPRank on
+average leading to 6.09 % higher accuracy retention than three popular
+baselines and 13.42 % with LLMRank compared to two popular baselines. Code is
+available at https://github.com/amazon-science/llm-rank-pruning.
+
+
+
+
+
+
+
+
+ Corrado Coppola, Lorenzo Papa, Irene Amerini, Laura Palagi
+
+
+ Adaptive gradient methods have been increasingly adopted by deep learning
+community due to their fast convergence and reduced sensitivity to
+hyper-parameters. However, these methods come with limitations, such as
+increased memory requirements for elements like moving averages and a poorly
+understood convergence theory. To overcome these challenges, we introduce
+F-CMA, a Fast-Controlled Mini-batch Algorithm with a random reshuffling method
+featuring a sufficient decrease condition and a line-search procedure to ensure
+loss reduction per epoch, along with its deterministic proof of global
+convergence to a stationary point. To evaluate the F-CMA, we integrate it into
+conventional training protocols for classification tasks involving both
+convolutional neural networks and vision transformer models, allowing for a
+direct comparison with popular optimizers. Computational tests show significant
+improvements, including a decrease in the overall training time by up to 68%,
+an increase in per-epoch efficiency by up to 20%, and in model accuracy by up
+to 5%.
+
+
+
+ comment: There is an error in the literature review, in section 1. In
+ particular, we noticed that there is a wrong citation, the [65], which has
+ been erroneously associated with another author's claims
+
+
+
+
+
+
+
+ Duc Kieu, Tung Kieu, Peng Han, Bin Yang, Christian S. Jensen, Bac Le
+
+
+ Due to the global trend towards urbanization, people increasingly move to and
+live in cities that then continue to grow. Traffic forecasting plays an
+important role in the intelligent transportation systems of cities as well as
+in spatio-temporal data mining. State-of-the-art forecasting is achieved by
+deep-learning approaches due to their ability to contend with complex
+spatio-temporal dynamics. However, existing methods assume the input is
+fixed-topology road networks and static traffic time series. These assumptions
+fail to align with urbanization, where time series are collected continuously
+and road networks evolve over time. In such settings, deep-learning models
+require frequent re-initialization and re-training, imposing high computational
+costs. To enable much more efficient training without jeopardizing model
+accuracy, we propose the Topological Evolution-aware Framework (TEAM) for
+traffic forecasting that incorporates convolution and attention. This
+combination of mechanisms enables better adaptation to newly collected time
+series, while being able to maintain learned knowledge from old time series.
+TEAM features a continual learning module based on the Wasserstein metric that
+acts as a buffer that can identify the most stable and the most changing
+network nodes. Then, only data related to stable nodes is employed for
+re-training when consolidating a model. Further, only data of new nodes and
+their adjacent nodes as well as data pertaining to changing nodes are used to
+re-train the model. Empirical studies with two real-world traffic datasets
+offer evidence that TEAM is capable of much lower re-training costs than
+existing methods are, without jeopardizing forecasting accuracy.
+
+
+
+ comment: 16 pages. An extended version of "TEAM: Topological Evolution-aware
+ Framework for Traffic Forecasting" accepted at PVLDB 2025
+
+
+
+
+
+
+ ♻ ☆ Towards Evaluating Generalist Agents: An Automated Benchmark in Open
+ World
+
+
+ Evaluating generalist agents presents significant challenges due to their
+wide-ranging abilities and the limitations of current benchmarks in assessing
+true generalization. We introduce the Minecraft Universe (MCU), a fully
+automated benchmarking framework set within the open-world game Minecraft. MCU
+dynamically generates and evaluates a broad spectrum of tasks, offering three
+core components: 1) a task generation mechanism that provides high degrees of
+freedom and variability, 2) an ever-expanding set of over 3K composable atomic
+tasks, and 3) a general evaluation framework that supports open-ended task
+assessment. By integrating large language models (LLMs), MCU dynamically
+creates diverse environments for each evaluation, fostering agent
+generalization. The framework uses a vision-language model (VLM) to
+automatically generate evaluation criteria, achieving over 90% agreement with
+human ratings across multi-dimensional assessments, which demonstrates that MCU
+is a scalable and explainable solution for evaluating generalist agents.
+Additionally, we show that while state-of-the-art foundational models perform
+well on specific tasks, they often struggle with increased task diversity and
+difficulty.
+
+
+
+
+
+
+
+ ♻ ☆ Convergence Analysis for Deep Sparse Coding via Convolutional Neural
+ Networks
+
+
+ In this work, we explore intersections between sparse coding and deep
+learning to enhance our understanding of feature extraction capabilities in
+advanced neural network architectures. We begin by introducing a novel class of
+Deep Sparse Coding (DSC) models and establish thorough theoretical analysis of
+their uniqueness and stability properties. By applying iterative algorithms to
+these DSC models, we derive convergence rates for convolutional neural networks
+(CNNs) in their ability to extract sparse features. This provides a strong
+theoretical foundation for the use of CNNs in sparse feature learning tasks. We
+additionally extend the convergence analysis to more general neural network
+architectures, including those with diverse activation functions, as well as
+self-attention and transformer-based models. This broadens the applicability of
+our findings to a wide range of deep learning methods for deep sparse feature
+extraction. Inspired by the strong connection between sparse coding and CNNs,
+we also explore training strategies to encourage neural networks to learn more
+sparse features. Through numerical experiments, we demonstrate the
+effectiveness of these approaches, providing valuable insights for the design
+of efficient and interpretable deep learning models.
+
+
+
+
+
+
+
+ ♻ ☆ Powerformer: A Section-adaptive Transformer for Power Flow Adjustment
+
+
+
+
+
+
+
+
+ Kaixuan Chen, Wei Luo, Shunyu Liu, Yaoquan Wei, Yihe Zhou, Yunpeng Qing, Quan Zhang, Jie Song, Mingli Song
+
+
+ In this paper, we present a novel transformer architecture tailored for
+learning robust power system state representations, which strives to optimize
+power dispatch for the power flow adjustment across different transmission
+sections. Specifically, our proposed approach, named Powerformer, develops a
+dedicated section-adaptive attention mechanism, separating itself from the
+self-attention used in conventional transformers. This mechanism effectively
+integrates power system states with transmission section information, which
+facilitates the development of robust state representations. Furthermore, by
+considering the graph topology of power system and the electrical attributes of
+bus nodes, we introduce two customized strategies to further enhance the
+expressiveness: graph neural network propagation and multi-factor attention
+mechanism. Extensive evaluations are conducted on three power system scenarios,
+including the IEEE 118-bus system, a realistic 300-bus system in China, and a
+large-scale European system with 9241 buses, where Powerformer demonstrates its
+superior performance over several baseline methods.
+
+
+
+ comment: 8 figures
+
+
+
+
+
+
+ ♻ ☆ FRAC-Q-Learning: A Reinforcement Learning with Boredom Avoidance
+ Processes for Social Robots
+
+
+ The reinforcement learning algorithms have often been applied to social
+robots. However, most reinforcement learning algorithms were not optimized for
+the use of social robots, and consequently they may bore users. We proposed a
+new reinforcement learning method specialized for the social robot, the
+FRAC-Q-learning, that can avoid user boredom. The proposed algorithm consists
+of a forgetting process in addition to randomizing and categorizing processes.
+This study evaluated interest and boredom hardness scores of the
+FRAC-Q-learning by a comparison with the traditional Q-learning. The
+FRAC-Q-learning showed significantly higher trend of interest score, and
+indicated significantly harder to bore users compared to the traditional
+Q-learning. Therefore, the FRAC-Q-learning can contribute to develop a social
+robot that will not bore users. The proposed algorithm has a potential to apply
+for Web-based communication and educational systems. This paper presents the
+entire process, detailed implementation and a detailed evaluation method of the
+of the FRAC-Q-learning for the first time.
+
+
+
+
+
+
+
+ ♻ ☆ Solution space and storage capacity of fully connected two-layer neural
+ networks with generic activation functions
+
+
+ The storage capacity of a binary classification model is the maximum number
+of random input-output pairs per parameter that the model can learn. It is one
+of the indicators of the expressive power of machine learning models and is
+important for comparing the performance of various models. In this study, we
+analyze the structure of the solution space and the storage capacity of fully
+connected two-layer neural networks with general activation functions using the
+replica method from statistical physics. Our results demonstrate that the
+storage capacity per parameter remains finite even with infinite width and that
+the weights of the network exhibit negative correlations, leading to a
+'division of labor'. In addition, we find that increasing the dataset size
+triggers a phase transition at a certain transition point where the permutation
+symmetry of weights is broken, resulting in the solution space splitting into
+disjoint regions. We identify the dependence of this transition point and the
+storage capacity on the choice of activation function. These findings
+contribute to understanding the influence of activation functions and the
+number of parameters on the structure of the solution space, potentially
+offering insights for selecting appropriate architectures based on specific
+objectives.
+
+
+
+ comment: 16+12 pages, 5 figures, 1 table. v2 accepted to Journal of the
+ Physical Society of Japan
+
+
+
+
+
+
+
+ Constantin Ulrich, Tassilo Wald, Emily Tempus, Maximilian Rokuss, Paul F. Jaeger, Klaus Maier-Hein
+
+
+ Current interactive segmentation approaches, inspired by the success of
+META's Segment Anything model, have achieved notable advancements, however,
+they come with substantial limitations that hinder their practical application
+in 3D radiological scenarios. These include unrealistic human interaction
+requirements, such as slice-by-slice operations for 2D models on 3D data, a
+lack of iterative interactive refinement, and insufficient evaluation
+experiments. These shortcomings prevent accurate assessment of model
+performance and lead to inconsistent outcomes across studies. The RadioActive
+benchmark overcomes these challenges by offering a comprehensive and
+reproducible evaluation of interactive segmentation methods in realistic,
+clinically relevant scenarios. It includes diverse datasets, target structures,
+and interactive segmentation methods, and provides a flexible, extendable
+codebase that allows seamless integration of new models and prompting
+strategies. We also introduce advanced prompting techniques to enable 2D models
+on 3D data by reducing the needed number of interaction steps, enabling a fair
+comparison. We show that surprisingly the performance of slice-wise prompted
+approaches can match native 3D methods, despite the domain gap. Our findings
+challenge the current literature and highlight that models not specifically
+trained on medical data can outperform the current specialized medical methods.
+By open-sourcing RadioActive, we invite the research community to integrate
+their models and prompting techniques, ensuring continuous and transparent
+evaluation of interactive segmentation models in 3D medical imaging.
+
+
+
+
+
+
+
+
+ Xingyu Xie, Pan Zhou, Huan Li, Zhouchen Lin, Shuicheng Yan
+
+
+ In deep learning, different kinds of deep networks typically need different
+optimizers, which have to be chosen after multiple trials, making the training
+process inefficient. To relieve this issue and consistently improve the model
+training speed across deep networks, we propose the ADAptive Nesterov momentum
+algorithm, Adan for short. Adan first reformulates the vanilla Nesterov
+acceleration to develop a new Nesterov momentum estimation (NME) method, which
+avoids the extra overhead of computing gradient at the extrapolation point.
+Then, Adan adopts NME to estimate the gradient's first- and second-order
+moments in adaptive gradient algorithms for convergence acceleration. Besides,
+we prove that Adan finds an $\epsilon$-approximate first-order stationary point
+within $\mathcal{O}(\epsilon^{-3.5})$ stochastic gradient complexity on the
+non-convex stochastic problems (e.g., deep learning problems), matching the
+best-known lower bound. Extensive experimental results show that Adan
+consistently surpasses the corresponding SoTA optimizers on vision, language,
+and RL tasks and sets new SoTAs for many popular networks and frameworks, e.g.,
+ResNet, ConvNext, ViT, Swin, MAE, DETR, GPT-2, Transformer-XL, and BERT. More
+surprisingly, Adan can use half of the training cost (epochs) of SoTA
+optimizers to achieve higher or comparable performance on ViT, GPT-2, MAE,
+etc., and also shows great tolerance to a large range of minibatch size, e.g.,
+from 1k to 32k. Code is released at https://github.com/sail-sg/Adan, and has
+been used in multiple popular deep learning frameworks or projects.
+
+
+
+
+
+
+
+ ♻ ☆ LoCo: Low-Bit Communication Adaptor for Large-scale Model Training
+
+
+ To efficiently train large-scale models, low-bit gradient communication
+compresses full-precision gradients on local GPU nodes into low-precision ones
+for higher gradient synchronization efficiency among GPU nodes. However, it
+often degrades training quality due to compression information loss. To address
+this, we propose the Low-bit Communication Adaptor (LoCo), which compensates
+gradients on local GPU nodes before compression, ensuring efficient
+synchronization without compromising training quality. Specifically, LoCo
+designs a moving average of historical compensation errors to stably estimate
+concurrent compression error and then adopts it to compensate for the
+concurrent gradient compression, yielding a less lossless compression. This
+mechanism allows it to be compatible with general optimizers like Adam and
+sharding strategies like FSDP. Theoretical analysis shows that integrating LoCo
+into full-precision optimizers like Adam and SGD does not impair their
+convergence speed on nonconvex problems. Experimental results show that across
+large-scale model training frameworks like Megatron-LM and PyTorch's FSDP, LoCo
+significantly improves communication efficiency, e.g., improving Adam's
+training speed by 14% to 40% without performance degradation on large language
+models like LLAMAs and MoE.
+
+
+
+
+
+
+
+ ♻ ☆ Approximate information maximization for bandit games
+
+
+
+
+
+
+
+
+ Alex Barbier-Chebbah, Christian L. Vestergaard, Jean-Baptiste Masson, Etienne Boursier
+
+
+ Entropy maximization and free energy minimization are general physical
+principles for modeling the dynamics of various physical systems. Notable
+examples include modeling decision-making within the brain using the
+free-energy principle, optimizing the accuracy-complexity trade-off when
+accessing hidden variables with the information bottleneck principle (Tishby et
+al., 2000), and navigation in random environments using information
+maximization (Vergassola et al., 2007). Built on this principle, we propose a
+new class of bandit algorithms that maximize an approximation to the
+information of a key variable within the system. To this end, we develop an
+approximated analytical physics-based representation of an entropy to forecast
+the information gain of each action and greedily choose the one with the
+largest information gain. This method yields strong performances in classical
+bandit settings. Motivated by its empirical success, we prove its asymptotic
+optimality for the two-armed bandit problem with Gaussian rewards. Owing to its
+ability to encompass the system's properties in a global physical functional,
+this approach can be efficiently adapted to more complex bandit settings,
+calling for further investigation of information maximization approaches for
+multi-armed bandit problems.
+
+
+
+
+
+
+
+ ♻ ☆ Parsimonious Dynamic Mode Decomposition: A Robust and Automated Approach
+ for Optimally Sparse Mode Selection in Complex Systems
+
+
+ This paper introduces the Parsimonious Dynamic Mode Decomposition (parsDMD),
+a novel algorithm designed to automatically select an optimally sparse subset
+of dynamic modes for both spatiotemporal and purely temporal data. By
+incorporating time-delay embedding and leveraging Orthogonal Matching Pursuit
+(OMP), parsDMD ensures robustness against noise and effectively handles
+complex, nonlinear dynamics. The algorithm is validated on a diverse range of
+datasets, including standing wave signals, identifying hidden dynamics, fluid
+dynamics simulations (flow past a cylinder and transonic buffet), and
+atmospheric sea-surface temperature (SST) data. ParsDMD addresses a significant
+limitation of the traditional sparsity-promoting DMD (spDMD), which requires
+manual tuning of sparsity parameters through a rigorous trial-and-error process
+to balance between single-mode and all-mode solutions. In contrast, parsDMD
+autonomously determines the optimally sparse subset of modes without user
+intervention, while maintaining minimal computational complexity. Comparative
+analyses demonstrate that parsDMD consistently outperforms spDMD by providing
+more accurate mode identification and effective reconstruction in noisy
+environments. These advantages render parsDMD an effective tool for real-time
+diagnostics, forecasting, and reduced-order model construction across various
+disciplines.
+
+
+
+ comment: 42 pages, 16 Figures
+
+
+
+
+
+
+ ♻ ☆ An Upper Bound for the Distribution Overlap Index and Its Applications
+
+
+ This paper proposes an easy-to-compute upper bound for the overlap index
+between two probability distributions without requiring any knowledge of the
+distribution models. The computation of our bound is time-efficient and
+memory-efficient and only requires finite samples. The proposed bound shows its
+value in one-class classification and domain shift analysis. Specifically, in
+one-class classification, we build a novel one-class classifier by converting
+the bound into a confidence score function. Unlike most one-class classifiers,
+the training process is not needed for our classifier. Additionally, the
+experimental results show that our classifier can be accurate with only a small
+number of in-class samples and outperform many state-of-the-art methods on
+various datasets in different one-class classification scenarios. In domain
+shift analysis, we propose a theorem based on our bound. The theorem is useful
+in detecting the existence of domain shift and inferring data information. The
+detection and inference processes are both computation-efficient and
+memory-efficient. Our work shows significant promise toward broadening the
+applications of overlap-based metrics.
+
+
+
+
+
+
+
+ ♻ ☆ AlphaViT: A Flexible Game-Playing AI for Multiple Games and Variable
+ Board Sizes
+
+
+ This paper presents novel game-playing AI agents based on the AlphaZero
+framework, enhanced with Vision Transformer (ViT): AlphaViT, AlphaViD, and
+AlphaVDA. These agents are designed to play multiple board games of various
+sizes using a single network with shared weights, thereby overcoming
+AlphaZero's limitation of fixed-board-size constraints. AlphaViT employs only a
+transformer encoder, whereas AlphaViD and AlphaVDA incorporate both transformer
+encoders and decoders. In AlphaViD, the decoder processes outputs from the
+encoder, whereas AlphaVDA uses a learnable embeddings as the decoder input. The
+additional decoder layers in AlphaViD and AlphaVDA provide flexibility to adapt
+to various action spaces and board sizes. Experimental results show that the
+proposed agents, trained on either individual games or multiple games
+simultaneously, consistently outperform traditional algorithms such as Minimax
+and Monte Carlo Tree Search and approach the performance of AlphaZero, despite
+using a single deep neural network (DNN) with shared weights. In particular,
+AlphaViT shows strong performance across all tested games. Furthermore,
+fine-tuning the DNN using pre-trained weights from small-board games
+accelerates convergence and improves performance, particularly in Gomoku.
+Interestingly, simultaneous training on multiple games yields performance
+comparable to, or even surpassing, single-game training. These results indicate
+the potential of transformer-based architectures to develop more flexible and
+robust game-playing AI agents that excel in multiple games and dynamic
+environments.
+
+
+
+
+
+
+
+ ♻ ☆ Hybridization of Persistent Homology with Neural Networks for
+ Time-Series Prediction: A Case Study in Wave Height
+
+
+
+
+
+
+
+
+ Zixin Lin, Nur Fariha Syaqina Zulkepli, Mohd Shareduwan Mohd Kasihmuddin, R. U. Gobithaasan
+
+
+ Time-series prediction is an active area of research across various fields,
+often challenged by the fluctuating influence of short-term and long-term
+factors. In this study, we introduce a feature engineering method that enhances
+the predictive performance of neural network models. Specifically, we leverage
+computational topology techniques to derive valuable topological features from
+input data, boosting the predictive accuracy of our models. Our focus is on
+predicting wave heights, utilizing models based on topological features within
+feedforward neural networks (FNNs), recurrent neural networks (RNNs), long
+short-term memory networks (LSTM), and RNNs with gated recurrent units (GRU).
+For time-ahead predictions, the enhancements in $R^2$ score were significant
+for FNNs, RNNs, LSTM, and GRU models. Additionally, these models also showed
+significant reductions in maximum errors and mean squared errors.
+
+
+
+ comment: The work has problems in methods and results
+
+
+
+
+
+
+
+
+
+ Multimedia 5
+
+
+
+
+
+ ☆ LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware
+ Omni-Modal Perception of Long Videos
+
+
+ Despite impressive advancements in video understanding, most efforts remain
+limited to coarse-grained or visual-only video tasks. However, real-world
+videos encompass omni-modal information (vision, audio, and speech) with a
+series of events forming a cohesive storyline. The lack of multi-modal video
+data with fine-grained event annotations and the high cost of manual labeling
+are major obstacles to comprehensive omni-modality video perception. To address
+this gap, we propose an automatic pipeline consisting of high-quality
+multi-modal video filtering, semantically coherent omni-modal event boundary
+detection, and cross-modal correlation-aware event captioning. In this way, we
+present LongVALE, the first-ever Vision-Audio-Language Event understanding
+benchmark comprising 105K omni-modal events with precise temporal boundaries
+and detailed relation-aware captions within 8.4K high-quality long videos.
+Further, we build a baseline that leverages LongVALE to enable video large
+language models (LLMs) for omni-modality fine-grained temporal video
+understanding for the first time. Extensive experiments demonstrate the
+effectiveness and great potential of LongVALE in advancing comprehensive
+multi-modal video understanding.
+
+
+
+ comment: 18 pages, 15 figures
+
+
+
+
+
+
+ ☆ Ten Ways in which Virtual Reality Differs from Video Streaming
+
+
+
+
+
+
+
+
+ Gustavo de Veciana, Sonia Fahmy, George Kesidis, Voicu Popescu
+
+
+ Virtual Reality (VR) applications have a number of unique characteristics
+that set them apart from traditional video streaming. These characteristics
+have major implications on the design of VR rendering, adaptation, prefetching,
+caching, and transport mechanisms. This paper contrasts VR to video streaming,
+stored 2D video streaming in particular, and discusses how to rethink system
+and network support for VR.
+
+
+
+
+
+
+
+ ☆ Accelerating Multimodal Large Language Models via Dynamic Visual-Token
+ Exit and the Empirical Findings
+
+
+
+
+
+
+
+
+ Qiong Wu, Wenhao Lin, Weihao Ye, Yiyi Zhou, Xiaoshuai Sun, Rongrong Ji
+
+
+ The excessive use of visual tokens in existing Multimoal Large Language
+Models (MLLMs) often exhibits obvious redundancy and brings in prohibitively
+expensive computation. To gain insights into this problem, we first conduct
+extensive empirical studies on the attention behaviors of MLLMs, and summarize
+three main inference stages in MLLMs: (i) Early fusion between tokens is first
+accomplished quickly. (ii) Intra-modality modeling then comes to play. (iii)
+Multimodal reasoning} resumes and lasts until the end of inference. In
+particular, we reveal that visual tokens will stop contributing to reasoning
+when the text tokens receive enough image information, yielding obvious visual
+redundancy. Based on these generalized observations, we propose a simple yet
+effective method to improve the efficiency of MLLMs, termed dynamic
+visual-token exit (DyVTE). DyVTE uses lightweight hyper-networks to perceive
+the text token status and decide the removal of all visual tokens after a
+certain layer, thereby addressing the observed visual redundancy. To validate
+VTE, we apply it to a set of MLLMs, including LLaVA, VILA, Eagle and InternVL,
+and conduct extensive experiments on a bunch of benchmarks. The experiment
+results not only show the effectiveness of our VTE in improving MLLMs'
+efficiency, but also yield the general modeling patterns of MLLMs, well
+facilitating the in-depth understanding of MLLMs. Our code is anonymously
+released at https://github.com/DoubtedSteam/DyVTE.
+
+
+
+
+
+
+
+ ☆ Deepfake Media Generation and Detection in the Generative AI Era: A
+ Survey and Outlook
+
+
+
+
+
+
+
+
+ Florinel-Alin Croitoru, Andrei-Iulian Hiji, Vlad Hondru, Nicolae Catalin Ristea, Paul Irofti, Marius Popescu, Cristian Rusu, Radu Tudor Ionescu, Fahad Shahbaz Khan, Mubarak Shah
+
+
+ With the recent advancements in generative modeling, the realism of deepfake
+content has been increasing at a steady pace, even reaching the point where
+people often fail to detect manipulated media content online, thus being
+deceived into various kinds of scams. In this paper, we survey deepfake
+generation and detection techniques, including the most recent developments in
+the field, such as diffusion models and Neural Radiance Fields. Our literature
+review covers all deepfake media types, comprising image, video, audio and
+multimodal (audio-visual) content. We identify various kinds of deepfakes,
+according to the procedure used to alter or generate the fake content. We
+further construct a taxonomy of deepfake generation and detection methods,
+illustrating the important groups of methods and the domains where these
+methods are applied. Next, we gather datasets used for deepfake detection and
+provide updated rankings of the best performing deepfake detectors on the most
+popular datasets. In addition, we develop a novel multimodal benchmark to
+evaluate deepfake detectors on out-of-distribution content. The results
+indicate that state-of-the-art detectors fail to generalize to deepfake content
+generated by unseen deepfake generators. Finally, we propose future directions
+to obtain robust and powerful deepfake detectors. Our project page and new
+benchmark are available at https://github.com/CroitoruAlin/biodeep.
+
+
+
+
+
+
+
+ ☆ Subjective and Objective Quality Assessment Methods of Stereoscopic
+ Videos with Visibility Affecting Distortions
+
+
+ We present two major contributions in this work: 1) we create a full HD
+resolution stereoscopic (S3D) video dataset comprised of 12 reference and 360
+distorted videos. The test stimuli are produced by simulating the five levels
+of fog and haze ambiances on the pristine left and right video sequences. We
+perform subjective analysis on the created video dataset with 24 viewers and
+compute Difference Mean Opinion Scores (DMOS) as quality representative of the
+dataset, 2) an Opinion Unaware (OU) and Distortion Unaware (DU) video quality
+assessment model is developed for S3D videos. We construct cyclopean frames
+from the individual views of an S3D video and partition them into
+nonoverlapping blocks. We analyze the Natural Scene Statistics (NSS) of all
+patches of pristine and test videos, and empirically model the NSS features
+with Univariate Generalized Gaussian Distribution (UGGD). We compute UGGD model
+parameters ({\alpha}, \b{eta}) at multiple spatial scales and multiple
+orientations of spherical steerable pyramid decomposition and show that the
+UGGD parameters are distortion discriminable. Further, we perform Multivariate
+Gaussian (MVG) modeling on the pristine and distorted video feature sets and
+compute the corresponding mean vectors and covariance matrices of MVG fits. We
+compute the Bhattacharyya distance measure between mean vectors and covariance
+matrices to estimate the perceptual deviation of a test video from pristine
+video set. Finally, we pool both distance measures to estimate the overall
+quality score of an S3D video. The performance of the proposed objective
+algorithm is verified on the popular S3D video datasets such as IRCCYN,
+LFOVIAS3DPh1, LFOVIAS3DPh2 and the proposed VAD stereo dataset. The algorithm
+delivers consistent performance across all datasets and shows competitive
+performance against off-the-shelf 2D and 3D image and video quality assessment
+algorithms.
+
+
+
+
+
+
+
+
+ Xi Zhang, Zaiqiao Meng, Jake Lever, Edmond S. L. Ho
+
+
+ Radiology report generation (RRG) is a challenging task, as it requires a
+thorough understanding of medical images, integration of multiple temporal
+inputs, and accurate report generation. Effective interpretation of medical
+images, such as chest X-rays (CXRs), demands sophisticated visual-language
+reasoning to map visual findings to structured reports. Recent studies have
+shown that multimodal large language models (MLLMs) can acquire multimodal
+capabilities by aligning with pre-trained vision encoders. However, current
+approaches predominantly focus on single-image analysis or utilise rule-based
+symbolic processing to handle multiple images, thereby overlooking the
+essential temporal information derived from comparing current images with prior
+ones. To overcome this critical limitation, we introduce Libra, a
+temporal-aware MLLM tailored for CXR report generation using temporal images.
+Libra integrates a radiology-specific image encoder with a MLLM and utilises a
+novel Temporal Alignment Connector to capture and synthesise temporal
+information of images across different time points with unprecedented
+precision. Extensive experiments show that Libra achieves new state-of-the-art
+performance among the same parameter scale MLLMs for RRG tasks on the
+MIMIC-CXR. Specifically, Libra improves the RadCliQ metric by 12.9% and makes
+substantial gains across all lexical metrics compared to previous models.
+
+
+ The Needle-in-a-haystack (NIAH) test is a general task used to assess
+language models' (LMs') abilities to recall particular information from long
+input context. This framework however does not provide a means of analyzing
+what factors, beyond context length, contribute to LMs' abilities or
+inabilities to separate and recall needles from their haystacks. To provide a
+systematic means of assessing what features contribute to LMs' NIAH
+capabilities, we developed a synthetic benchmark called DENIAHL (Data-oriented
+Evaluation of NIAH for LLM's). Our work expands on previous NIAH studies by
+ablating NIAH features beyond typical context length including data type, size,
+and patterns. We find stark differences between GPT-3.5 and LLaMA 2-7B's
+performance on DENIAHL, and drops in recall performance when features like item
+size are increased, and to some degree when data type is changed from numbers
+to letters. This has implications for increasingly large context models,
+demonstrating factors beyond item-number impact NIAH capabilities.
+
+
+ In the era of foundation models, CLIP has emerged as a powerful tool for
+aligning text and visual modalities into a common embedding space. However, the
+alignment objective used to train CLIP often results in subpar visual features
+for fine-grained tasks. In contrast, SSL-pretrained models like DINO excel at
+extracting rich visual features due to their specialized training paradigm.
+Yet, these SSL models require an additional supervised linear probing step,
+which relies on fully labeled data which is often expensive and difficult to
+obtain at scale. In this paper, we propose a label-free prompt-tuning method
+that leverages the rich visual features of self-supervised learning models
+(DINO) and the broad textual knowledge of large language models (LLMs) to
+largely enhance CLIP-based image classification performance using unlabeled
+images. Our approach unfolds in three key steps: (1) We generate robust textual
+feature embeddings that more accurately represent object classes by leveraging
+class-specific descriptions from LLMs, enabling more effective zero-shot
+classification compared to CLIP's default name-specific prompts. (2) These
+textual embeddings are then used to produce pseudo-labels to train an alignment
+module that integrates the complementary strengths of LLM description-based
+textual embeddings and DINO's visual features. (3) Finally, we prompt-tune
+CLIP's vision encoder through DINO-assisted supervision using the trained
+alignment module. This three-step process allows us to harness the best of
+visual and textual foundation models, resulting in a powerful and efficient
+approach that surpasses state-of-the-art label-free classification methods.
+Notably, our framework, NoLA (No Labels Attached), achieves an average absolute
+gain of 3.6% over the state-of-the-art LaFter across 11 diverse image
+classification datasets.
+
+
+
+
+
+
+
+ ☆ Talking to DINO: Bridging Self-Supervised Vision Backbones with Language
+ for Open-Vocabulary Segmentation
+
+
+ Open-Vocabulary Segmentation (OVS) aims at segmenting images from free-form
+textual concepts without predefined training classes. While existing
+vision-language models such as CLIP can generate segmentation masks by
+leveraging coarse spatial information from Vision Transformers, they face
+challenges in spatial localization due to their global alignment of image and
+text features. Conversely, self-supervised visual models like DINO excel in
+fine-grained visual encoding but lack integration with language. To bridge this
+gap, we present Talk2DINO, a novel hybrid approach that combines the spatial
+accuracy of DINOv2 with the language understanding of CLIP. Our approach aligns
+the textual embeddings of CLIP to the patch-level features of DINOv2 through a
+learned mapping function without the need to fine-tune the underlying
+backbones. At training time, we exploit the attention maps of DINOv2 to
+selectively align local visual patches with textual embeddings. We show that
+the powerful semantic and localization abilities of Talk2DINO can enhance the
+segmentation process, resulting in more natural and less noisy segmentations,
+and that our approach can also effectively distinguish foreground objects from
+the background. Experimental results demonstrate that Talk2DINO achieves
+state-of-the-art performance across several unsupervised OVS benchmarks. Source
+code and models are publicly available at:
+https://lorebianchi98.github.io/Talk2DINO/.
+
+
+
+
+
+
+
+ ☆ Extracting Information in a Low-resource Setting: Case Study on
+ Bioinformatics Workflows
+
+
+ Bioinformatics workflows are essential for complex biological data analyses
+and are often described in scientific articles with source code in public
+repositories. Extracting detailed workflow information from articles can
+improve accessibility and reusability but is hindered by limited annotated
+corpora. To address this, we framed the problem as a low-resource extraction
+task and tested four strategies: 1) creating a tailored annotated corpus, 2)
+few-shot named-entity recognition (NER) with an autoregressive language model,
+3) NER using masked language models with existing and new corpora, and 4)
+integrating workflow knowledge into NER models. Using BioToFlow, a new corpus
+of 52 articles annotated with 16 entities, a SciBERT-based NER model achieved a
+70.4 F-measure, comparable to inter-annotator agreement. While knowledge
+integration improved performance for specific entities, it was less effective
+across the entire information schema. Our results demonstrate that
+high-performance information extraction for bioinformatics workflows is
+achievable.
+
+
+
+
+
+
+
+ ☆ Consolidating and Developing Benchmarking Datasets for the Nepali
+ Natural Language Understanding Tasks
+
+
+
+
+
+
+
+
+ Jinu Nyachhyon, Mridul Sharma, Prajwal Thapa, Bal Krishna Bal
+
+
+ The Nepali language has distinct linguistic features, especially its complex
+script (Devanagari script), morphology, and various dialects, which pose a
+unique challenge for natural language processing (NLP) evaluation. While the
+Nepali Language Understanding Evaluation (Nep-gLUE) benchmark provides a
+foundation for evaluating models, it remains limited in scope, covering four
+tasks. This restricts their utility for comprehensive assessments of NLP
+models. To address this limitation, we introduce eight new datasets, creating a
+new benchmark, the Nepali Language Understanding Evaluation (NLUE) benchmark,
+which covers a total of 12 tasks for evaluating the performance of models
+across a diverse set of Natural Language Understanding (NLU) tasks. The added
+tasks include single-sentence classification, similarity and paraphrase tasks,
+and Natural Language Inference (NLI) tasks. On evaluating the models using
+added tasks, we observe that the existing models fall short in handling complex
+NLU tasks effectively. This expanded benchmark sets a new standard for
+evaluating, comparing, and advancing models, contributing significantly to the
+broader goal of advancing NLP research for low-resource languages.
+
+
+
+
+
+
+
+ ☆ How far can bias go? -- Tracing bias from pretraining data to alignment
+
+
+
+
+
+
+
+
+ Marion Thaler, Abdullatif Köksal, Alina Leidinger, Anna Korhonen, Hinrich Schütze
+
+
+ As LLMs are increasingly integrated into user-facing applications, addressing
+biases that perpetuate societal inequalities is crucial. While much work has
+gone into measuring or mitigating biases in these models, fewer studies have
+investigated their origins. Therefore, this study examines the correlation
+between gender-occupation bias in pre-training data and their manifestation in
+LLMs, focusing on the Dolma dataset and the OLMo model. Using zero-shot
+prompting and token co-occurrence analyses, we explore how biases in training
+data influence model outputs. Our findings reveal that biases present in
+pre-training data are amplified in model outputs. The study also examines the
+effects of prompt types, hyperparameters, and instruction-tuning on bias
+expression, finding instruction-tuning partially alleviating representational
+bias while still maintaining overall stereotypical gender associations, whereas
+hyperparameters and prompting variation have a lesser effect on bias
+expression. Our research traces bias throughout the LLM development pipeline
+and underscores the importance of mitigating bias at the pretraining stage.
+
+
+
+
+
+
+
+ ☆ An Extensive Evaluation of Factual Consistency in Large Language Models
+ for Data-to-Text Generation
+
+
+ Large Language Models (LLMs) have shown exceptional performance across
+various Data-to-Text Generation (DTG) tasks. However, generating factually
+consistent text in DTG remains challenging for LLMs. Despite this, in-depth
+evaluations of LLM factual consistency for DTG remain missing in the current
+literature. This paper addresses this gap by providing an extensive evaluation
+of factual consistency in LLMs for DTG. Our evaluation covers five widely used
+DTG datasets (E2E, ViGGo, WikiTableText, DART, and WebNLG) and five prominent
+LLM families (T5, BART, OPT, BLOOM, and Llama 2). To ensure a thorough
+evaluation of factual consistency, we use four state-of-the-art automatic
+metrics and include essential human assessments. Our extensive evaluations
+reveals three key findings regarding factual consistency in LLMs for DTG.
+First, Llama 2 often excels in generating factually consistent text, although
+smaller models like T5 and BART can achieve strong factual consistency on
+larger, lexically less-diverse datasets. Second, the average rate of change
+(AROC) indicates that increasing model size (number of model trainable
+parameters) generally enhances factual consistency of LLMs in DTG. Third, we
+observe that source-reference divergence (i.e., when the reference text
+diverges semantically from the source) typically reduces the factual
+consistency of LLMs in DTG.
+
+
+ The rapid development of Large Multimodal Models (LMMs) has significantly
+advanced multimodal understanding by harnessing the language abilities of Large
+Language Models (LLMs) and integrating modality-specific encoders. However,
+LMMs are plagued by hallucinations that limit their reliability and adoption.
+While traditional methods to detect and mitigate these hallucinations often
+involve costly training or rely heavily on external models, recent approaches
+utilizing internal model features present a promising alternative. In this
+paper, we critically assess the limitations of the state-of-the-art
+training-free technique, the logit lens, in handling generalized visual
+hallucinations. We introduce a refined method that leverages contextual token
+embeddings from middle layers of LMMs. This approach significantly improves
+hallucination detection and grounding across diverse categories, including
+actions and OCR, while also excelling in tasks requiring contextual
+understanding, such as spatial relations and attribute comparison. Our novel
+grounding technique yields highly precise bounding boxes, facilitating a
+transition from Zero-Shot Object Segmentation to Grounded Visual Question
+Answering. Our contributions pave the way for more reliable and interpretable
+multimodal models.
+
+
+
+
+
+
+
+ ☆ Examining Multimodal Gender and Content Bias in ChatGPT-4o
+
+
+ This study investigates ChatGPT-4o's multimodal content generation,
+highlighting significant disparities in its treatment of sexual content and
+nudity versus violent and drug-related themes. Detailed analysis reveals that
+ChatGPT-4o consistently censors sexual content and nudity, while showing
+leniency towards violence and drug use. Moreover, a pronounced gender bias
+emerges, with female-specific content facing stricter regulation compared to
+male-specific content. This disparity likely stems from media scrutiny and
+public backlash over past AI controversies, prompting tech companies to impose
+stringent guidelines on sensitive issues to protect their reputations. Our
+findings emphasize the urgent need for AI systems to uphold genuine ethical
+standards and accountability, transcending mere political correctness. This
+research contributes to the understanding of biases in AI-driven language and
+multimodal models, calling for more balanced and ethical content moderation
+practices.
+
+
+
+ comment: 17 pages, 4 figures, 3 tables. Conference: "14th International
+ Conference on Artificial Intelligence, Soft Computing and Applications (AIAA
+ 2024), London, 23-24 November 2024" It will be published in the proceedings
+ "David C. Wyld et al. (Eds): IoTE, CNDC, DSA, AIAA, NLPTA, DPPR - 2024"
+
+
+
+
+
+
+ ☆ Integration of Contextual Descriptors in Ontology Alignment for
+ Enrichment of Semantic Correspondence
+
+
+ This paper proposes a novel approach to semantic ontology alignment using
+contextual descriptors. A formalization was developed that enables the
+integration of essential and contextual descriptors to create a comprehensive
+knowledge model. The hierarchical structure of the semantic approach and the
+mathematical apparatus for analyzing potential conflicts between concepts,
+particularly in the example of "Transparency" and "Privacy" in the context of
+artificial intelligence, are demonstrated. Experimental studies showed a
+significant improvement in ontology alignment metrics after the implementation
+of contextual descriptors, especially in the areas of privacy, responsibility,
+and freedom & autonomy. The application of contextual descriptors achieved an
+average overall improvement of approximately 4.36%. The results indicate the
+effectiveness of the proposed approach for more accurately reflecting the
+complexity of knowledge and its contextual dependence.
+
+
+
+
+
+
+
+ ☆ VARCO-VISION: Expanding Frontiers in Korean Vision-Language Models
+
+
+
+
+
+
+
+
+ Jeongho Ju, Daeyoung Kim, SunYoung Park, Youngjune Kim
+
+
+ In this paper, we introduce an open-source Korean-English vision-language
+model (VLM), VARCO-VISION. We incorporate a step-by-step training strategy that
+allows a model learn both linguistic and visual information while preserving
+the backbone model's knowledge. Our model demonstrates outstanding performance
+in diverse settings requiring bilingual image-text understanding and generation
+abilities compared to models of similar size. VARCO-VISION is also capable of
+grounding, referring, and OCR, expanding its usage and potential applications
+for real-world scenarios. In addition to the model, we release five Korean
+evaluation datasets, including four closed-set and one openset benchmarks. We
+anticipate that our milestone will broaden the opportunities for AI researchers
+aiming to train VLMs. VARCO-VISION is available at
+https://huggingface.co/NCSOFT/VARCO-VISION-14B.
+
+
+
+ comment: 24 pages, 15 figures, 4 tables. Model weights at
+ https://huggingface.co/NCSOFT/VARCO-VISION-14B. Benchmarks released at
+ NCSOFT's HuggingFace repositories (K-MMBench, K-SEED, K-MMStar, K-DTCBench,
+ K-LLaVA-W). VARCO-VISION is an open-source Korean-English VLM with OCR,
+ grounding, and referring capabilities
+
+
+
+
+
+
+
+ Sanjay Suryanarayanan, Haiyue Song, Mohammed Safi Ur Rahman Khan, Anoop Kunchukuttan, Mitesh M. Khapra, Raj Dabre
+
+
+ Mining parallel document pairs poses a significant challenge because existing
+sentence embedding models often have limited context windows, preventing them
+from effectively capturing document-level information. Another overlooked issue
+is the lack of concrete evaluation benchmarks comprising high-quality parallel
+document pairs for assessing document-level mining approaches, particularly for
+Indic languages. In this study, we introduce Pralekha, a large-scale benchmark
+for document-level alignment evaluation. Pralekha includes over 2 million
+documents, with a 1:2 ratio of unaligned to aligned pairs, covering 11 Indic
+languages and English. Using Pralekha, we evaluate various document-level
+mining approaches across three dimensions: the embedding models, the
+granularity levels, and the alignment algorithm. To address the challenge of
+aligning documents using sentence and chunk-level alignments, we propose a
+novel scoring method, Document Alignment Coefficient (DAC). DAC demonstrates
+substantial improvements over baseline pooling approaches, particularly in
+noisy scenarios, achieving average gains of 20-30% in precision and 15-20% in
+F1 score. These results highlight DAC's effectiveness in parallel document
+mining for Indic languages.
+
+
+
+ comment: Work in Progress
+
+
+
+
+
+
+ ★ Way to Specialist: Closing Loop Between Specialized LLM and Evolving
+ Domain Knowledge Graph KDD 2025
+
+
+
+
+
+
+
+
+ Yutong Zhang, Lixing Chen, Shenghong Li, Nan Cao, Yang Shi, Jiaxin Ding, Zhe Qu, Pan Zhou, Yang Bai
+
+
+ Large language models (LLMs) have demonstrated exceptional performance across
+a wide variety of domains. Nonetheless, generalist LLMs continue to fall short
+in reasoning tasks necessitating specialized knowledge. Prior investigations
+into specialized LLMs focused on domain-specific training, which entails
+substantial efforts in domain data acquisition and model parameter fine-tuning.
+To address these challenges, this paper proposes the Way-to-Specialist (WTS)
+framework, which synergizes retrieval-augmented generation with knowledge
+graphs (KGs) to enhance the specialized capability of LLMs in the absence of
+specialized training. In distinction to existing paradigms that merely utilize
+external knowledge from general KGs or static domain KGs to prompt LLM for
+enhanced domain-specific reasoning, WTS proposes an innovative
+"LLM$\circlearrowright$KG" paradigm, which achieves bidirectional enhancement
+between specialized LLM and domain knowledge graph (DKG). The proposed paradigm
+encompasses two closely coupled components: the DKG-Augmented LLM and the
+LLM-Assisted DKG Evolution. The former retrieves question-relevant domain
+knowledge from DKG and uses it to prompt LLM to enhance the reasoning
+capability for domain-specific tasks; the latter leverages LLM to generate new
+domain knowledge from processed tasks and use it to evolve DKG. WTS closes the
+loop between DKG-Augmented LLM and LLM-Assisted DKG Evolution, enabling
+continuous improvement in the domain specialization as it progressively answers
+and learns from domain-specific questions. We validate the performance of WTS
+on 6 datasets spanning 5 domains. The experimental results show that WTS
+surpasses the previous SOTA in 4 specialized domains and achieves a maximum
+performance improvement of 11.3%.
+
+
+
+ comment: Accepted by KDD 2025
+
+
+
+
+
+
+ ☆ DIESEL -- Dynamic Inference-Guidance via Evasion of Semantic Embeddings
+ in LLMs
+
+
+ In recent years, conversational large language models (LLMs) have shown
+tremendous success in tasks such as casual conversation, question answering,
+and personalized dialogue, making significant advancements in domains like
+virtual assistance, social interaction, and online customer engagement.
+However, they often generate responses that are not aligned with human values
+(e.g., ethical standards, safety, or social norms), leading to potentially
+unsafe or inappropriate outputs. While several techniques have been proposed to
+address this problem, they come with a cost, requiring computationally
+expensive training or dramatically increasing the inference time. In this
+paper, we present DIESEL, a lightweight inference guidance technique that can
+be seamlessly integrated into any autoregressive LLM to semantically filter
+undesired concepts from the response. DIESEL can function either as a
+standalone safeguard or as an additional layer of defense, enhancing response
+safety by reranking the LLM's proposed tokens based on their similarity to
+predefined negative concepts in the latent space. This approach provides an
+efficient and effective solution for maintaining alignment with human values.
+Our evaluation demonstrates DIESEL's effectiveness on state-of-the-art
+conversational models (e.g., Llama 3), even in challenging jailbreaking
+scenarios that test the limits of response safety. We further show that DIESEL
+can be generalized to use cases other than safety, providing a versatile
+solution for general-purpose response filtering with minimal computational
+overhead.
+
+
+
+
+
+
+
+ ☆ A Survey on Automatic Online Hate Speech Detection in Low-Resource
+ Languages
+
+
+ The expanding influence of social media platforms over the past decade has
+impacted the way people communicate. The level of obscurity provided by social
+media and easy accessibility of the internet has facilitated the spread of hate
+speech. The terms and expressions related to hate speech gets updated with
+changing times which poses an obstacle to policy-makers and researchers in case
+of hate speech identification. With growing number of individuals using their
+native languages to communicate with each other, hate speech in these
+low-resource languages are also growing. Although, there is awareness about the
+English-related approaches, much attention have not been provided to these
+low-resource languages due to lack of datasets and online available data. This
+article provides a detailed survey of hate speech detection in low-resource
+languages around the world with details of available datasets, features
+utilized and techniques used. This survey further discusses the prevailing
+surveys, overlapping concepts related to hate speech, research challenges and
+opportunities.
+
+
+
+ comment: 34 pages, 12 figures
+
+
+
+
+
+
+ ☆ Talking to oneself in CMC: a study of self replies in Wikipedia talk
+ pages
+
+
+ This study proposes a qualitative analysis of self replies in Wikipedia talk
+pages, more precisely when the first two messages of a discussion are written
+by the same user. This specific pattern occurs in more than 10% of threads with
+two messages or more and can be explained by a number of reasons. After a first
+examination of the lexical specificities of second messages, we propose a seven
+categories typology and use it to annotate two reference samples (English and
+French) of 100 threads each. Finally, we analyse and compare the performance of
+human annotators (who reach a reasonable global efficiency) and
+instruction-tuned LLMs (which encounter important difficulties with several
+categories).
+
+
+ Cross-lingual semantic textual relatedness task is an important research task
+that addresses challenges in cross-lingual communication and text
+understanding. It helps establish semantic connections between different
+languages, crucial for downstream tasks like machine translation, multilingual
+information retrieval, and cross-lingual text understanding.Based on extensive
+comparative experiments, we choose the XLM-R-base as our base model and use
+pre-trained sentence representations based on whitening to reduce
+anisotropy.Additionally, for the given training data, we design a delicate data
+filtering method to alleviate the curse of multilingualism. With our approach,
+we achieve a 2nd score in Spanish, a 3rd in Indonesian, and multiple entries in
+the top ten results in the competition's track C. We further do a comprehensive
+analysis to inspire future research aimed at improving performance on
+cross-lingual tasks.
+
+
+
+ comment: 8 pages, 3 figures
+
+
+
+
+
+
+ ☆ Zero-shot Slot Filling in the Age of LLMs for Dialogue Systems COLING 2025
+
+
+ Zero-shot slot filling is a well-established subtask of Natural Language
+Understanding (NLU). However, most existing methods primarily focus on
+single-turn text data, overlooking the unique complexities of conversational
+dialogue. Conversational data is highly dynamic, often involving abrupt topic
+shifts, interruptions, and implicit references that make it difficult to
+directly apply zero-shot slot filling techniques, even with the remarkable
+capabilities of large language models (LLMs). This paper addresses these
+challenges by proposing strategies for automatic data annotation with slot
+induction and black-box knowledge distillation (KD) from a teacher LLM to a
+smaller model, outperforming vanilla LLMs on internal datasets by 26% absolute
+increase in F1 score. Additionally, we introduce an efficient system
+architecture for call center product settings that surpasses off-the-shelf
+extractive models by 34% relative F1 score, enabling near real-time inference
+on dialogue streams with higher accuracy, while preserving low latency.
+
+
+
+ comment: To appear in Proceedings of COLING 2025
+
+
+
+
+
+
+ ☆ Rephrasing Electronic Health Records for Pretraining Clinical Language
+ Models
+
+
+ Clinical language models are important for many applications in healthcare,
+but their development depends on access to extensive clinical text for
+pretraining. However, obtaining clinical notes from electronic health records
+(EHRs) at scale is challenging due to patient privacy concerns. In this study,
+we rephrase existing clinical notes using LLMs to generate synthetic
+pretraining corpora, drawing inspiration from previous work on rephrasing web
+data. We examine four popular small-sized LLMs (<10B) to create synthetic
+clinical text to pretrain both decoder-based and encoder-based language models.
+The method yields better results in language modeling and downstream tasks than
+previous synthesis approaches without referencing real clinical text. We find
+that augmenting original clinical notes with synthetic corpora from different
+LLMs improves performances even at a small token budget, showing the potential
+of this method to support pretraining at the institutional level or be scaled
+to synthesize large-scale clinical corpora.
+
+
+
+
+
+
+
+ ☆ ScratchEval: Are GPT-4o Smarter than My Child? Evaluating Large
+ Multimodal Models with Visual Programming Challenges
+
+
+
+
+
+
+
+
+ Rao Fu, Ziyang Luo, Hongzhan Lin, Zhen Ye, Jing Ma
+
+
+ Recent advancements in large multimodal models (LMMs) have showcased
+impressive code generation capabilities, primarily evaluated through
+image-to-code benchmarks. However, these benchmarks are limited to specific
+visual programming scenarios where the logic reasoning and the multimodal
+understanding capacities are split apart. To fill this gap, we propose
+ScratchEval, a novel benchmark designed to evaluate the visual programming
+reasoning ability of LMMs. ScratchEval is based on Scratch, a block-based
+visual programming language widely used in children's programming education. By
+integrating visual elements and embedded programming logic, ScratchEval
+requires the model to process both visual information and code structure,
+thereby comprehensively evaluating its programming intent understanding
+ability. Our evaluation approach goes beyond the traditional image-to-code
+mapping and focuses on unified logical thinking and problem-solving abilities,
+providing a more comprehensive and challenging framework for evaluating the
+visual programming ability of LMMs. ScratchEval not only fills the gap in
+existing evaluation methods, but also provides new insights for the future
+development of LMMs in the field of visual programming. Our benchmark can be
+accessed at https://github.com/HKBUNLP/ScratchEval .
+
+
+
+
+
+
+
+ ☆ The Impact of Example Selection in Few-Shot Prompting on Automated Essay
+ Scoring Using GPT Models
+
+
+ This study investigates the impact of example selection on the performance of
+au-tomated essay scoring (AES) using few-shot prompting with GPT models. We
+evaluate the effects of the choice and order of examples in few-shot prompting
+on several versions of GPT-3.5 and GPT-4 models. Our experiments involve 119
+prompts with different examples, and we calculate the quadratic weighted kappa
+(QWK) to measure the agreement between GPT and human rater scores. Regres-sion
+analysis is used to quantitatively assess biases introduced by example
+selec-tion. The results show that the impact of example selection on QWK varies
+across models, with GPT-3.5 being more influenced by examples than GPT-4. We
+also find evidence of majority label bias, which is a tendency to favor the
+majority la-bel among the examples, and recency bias, which is a tendency to
+favor the label of the most recent example, in GPT-generated essay scores and
+QWK, with these biases being more pronounced in GPT-3.5. Notably, careful
+example selection enables GPT-3.5 models to outperform some GPT-4 models.
+However, among the GPT models, the June 2023 version of GPT-4, which is not the
+latest model, exhibits the highest stability and performance. Our findings
+provide insights into the importance of example selection in few-shot prompting
+for AES, especially in GPT-3.5 models, and highlight the need for individual
+performance evaluations of each model, even for minor versions.
+
+
+
+ comment: Accepted in AIED2024. This preprint has not undergone any
+ post-submission improvements or corrections. The Version of Record of this
+ contribution is published in Communications in Com-puter and Information
+ Science, vol 2150, and is available online at https://doi.org/
+
+
+
+
+
+
+ ☆ EzSQL: An SQL intermediate representation for improving SQL-to-text
+ Generation
+
+
+ The SQL-to-text generation task traditionally uses template base, Seq2Seq,
+tree-to-sequence, and graph-to-sequence models. Recent models take advantage of
+pre-trained generative language models for this task in the Seq2Seq framework.
+However, treating SQL as a sequence of inputs to the pre-trained models is not
+optimal. In this work, we put forward a new SQL intermediate representation
+called EzSQL to align SQL with the natural language text sequence. EzSQL
+simplifies the SQL queries and brings them closer to natural language text by
+modifying operators and keywords, which can usually be described in natural
+language. EzSQL also removes the need for set operators. Our proposed
+SQL-to-text generation model uses EzSQL as the input to a pre-trained
+generative language model for generating the text descriptions. We demonstrate
+that our model is an effective state-of-the-art method to generate text
+narrations from SQL queries on the WikiSQL and Spider datasets. We also show
+that by generating pretraining data using our SQL-to-text generation model, we
+can enhance the performance of Text-to-SQL parsers.
+
+
+
+ comment: Under Review at Expert System With Applications Journal
+
+
+
+
+
+
+ ☆ Devising a Set of Compact and Explainable Spoken Language Feature for
+ Screening Alzheimer's Disease SC
+
+
+
+
+
+
+
+
+ Junan Li, Yunxiang Li, Yuren Wang, Xixin Wu, Helen Meng
+
+
+ Alzheimer's disease (AD) has become one of the most significant health
+challenges in an aging society. The use of spoken language-based AD detection
+methods has gained prevalence due to their scalability due to their
+scalability. Based on the Cookie Theft picture description task, we devised an
+explainable and effective feature set that leverages the visual capabilities of
+a large language model (LLM) and the Term Frequency-Inverse Document Frequency
+(TF-IDF) model. Our experimental results show that the newly proposed features
+consistently outperform traditional linguistic features across two different
+classifiers with high dimension efficiency. Our new features can be well
+explained and interpreted step by step which enhance the interpretability of
+automatic AD screening.
+
+
+
+ comment: Published at ISCSLP 2024
+
+
+
+
+
+
+ ☆ MATATA: a weak-supervised MAthematical Tool-Assisted reasoning for
+ Tabular Applications
+
+
+ Mathematical reasoning capabilities are increasing with tool-augmented
+language agents, but methods often rely either on closed-source or large
+models, external data, or extensive prompt engineering. This work introduces
+MATATA, a novel cost-effective method to train LLM agents for tabular data
+problems through reasoning, planning, and tool use. With a progressive
+self-improvement paradigm and an iterative weak supervision, it empowers
+3.8B/8B Small Language Models (SLMs), particularly suited for local hosting and
+sensitive business contexts where data privacy is crucial. By employing a
+flexible and reusable tools across different datasets, it achieves robust
+performance with effective scalability across shared tasks. Experiments show
+that MATATA reaches state-of-the-art performances on FinQA and TAT-QA among
+reasoning frameworks based on open-source models. Moreover, MATATA models
+compete with GPT-4 based frameworks on TabMWP, while being SLMs.
+
+
+
+
+
+
+
+
+ Adam Karvonen, Can Rager, Samuel Marks, Neel Nanda
+
+
+ Sparse Autoencoders (SAEs) are an interpretability technique aimed at
+decomposing neural network activations into interpretable units. However, a
+major bottleneck for SAE development has been the lack of high-quality
+performance metrics, with prior work largely relying on unsupervised proxies.
+In this work, we introduce a family of evaluations based on SHIFT, a downstream
+task from Marks et al. (Sparse Feature Circuits, 2024) in which spurious cues
+are removed from a classifier by ablating SAE features judged to be
+task-irrelevant by a human annotator. We adapt SHIFT into an automated metric
+of SAE quality; this involves replacing the human annotator with an LLM.
+Additionally, we introduce the Targeted Probe Perturbation (TPP) metric that
+quantifies an SAE's ability to disentangle similar concepts, effectively
+scaling SHIFT to a wider range of datasets. We apply both SHIFT and TPP to
+multiple open-source models, demonstrating that these metrics effectively
+differentiate between various SAE training hyperparameters and architectures.
+
+
+
+
+
+
+
+ ☆ ArEEG_Words: Dataset for Envisioned Speech Recognition using EEG for
+ Arabic Words
+
+
+
+
+
+
+
+
+ Hazem Darwish, Abdalrahman Al Malah, Khloud Al Jallad, Nada Ghneim
+
+
+ Brain-Computer-Interface (BCI) aims to support communication-impaired
+patients by translating neural signals into speech. A notable research topic in
+BCI involves Electroencephalography (EEG) signals that measure the electrical
+activity in the brain. While significant advancements have been made in BCI EEG
+research, a major limitation still exists: the scarcity of publicly available
+EEG datasets for non-English languages, such as Arabic. To address this gap, we
+introduce in this paper ArEEG_Words dataset, a novel EEG dataset recorded from
+22 participants with mean age of 22 years (5 female, 17 male) using a
+14-channel Emotiv Epoc X device. The participants were asked to be free from
+any effects on their nervous system, such as coffee, alcohol, cigarettes, and
+so 8 hours before recording. They were asked to stay calm in a clam room during
+imagining one of the 16 Arabic Words for 10 seconds. The words include 16
+commonly used words such as up, down, left, and right. A total of 352 EEG
+recordings were collected, then each recording was divided into multiple 250ms
+signals, resulting in a total of 15,360 EEG signals. To the best of our
+knowledge, ArEEG_Words data is the first of its kind in Arabic EEG domain.
+Moreover, it is publicly available for researchers as we hope that will fill
+the gap in Arabic EEG research.
+
+
+
+ comment: arXiv admin note: substantial text overlap with arXiv:2402.15733
+
+
+
+
+
+
+ ☆ Sneaking Syntax into Transformer Language Models with Tree
+ Regularization
+
+
+
+
+
+
+
+
+ Ananjan Nandi, Christopher D. Manning, Shikhar Murty
+
+
+ While compositional accounts of human language understanding are based on a
+hierarchical tree-like process, neural models like transformers lack a direct
+inductive bias for such tree structures. Introducing syntactic inductive biases
+could unlock more robust and data-efficient learning in transformer language
+models (LMs), but existing methods for incorporating such structure greatly
+restrict models, either limiting their expressivity or increasing inference
+complexity. This work instead aims to softly inject syntactic inductive biases
+into given transformer circuits, through a structured regularizer. We introduce
+TREEREG, an auxiliary loss function that converts bracketing decisions from
+silver parses into a set of differentiable orthogonality constraints on vector
+hidden states. TREEREG integrates seamlessly with the standard LM objective,
+requiring no architectural changes. LMs pre-trained with TreeReg on natural
+language corpora such as WikiText-103 achieve up to 10% lower perplexities on
+out-of-distribution data and up to 9.5 point improvements in syntactic
+generalization, requiring less than half the training data to outperform
+standard LMs. TreeReg still provides gains for pre-trained LLMs: Continued
+pre-training of Sheared Llama with TreeReg results in improved syntactic
+generalization, and fine-tuning on MultiNLI with TreeReg mitigates degradation
+of performance on adversarial NLI benchmarks by 41.2 points.
+
+
+
+ comment: 17 pages, 16 figures, 8 tables
+
+
+
+
+
+
+ ☆ Measuring Risk of Bias in Biomedical Reports: The RoBBR Benchmark
+
+
+ Systems that answer questions by reviewing the scientific literature are
+becoming increasingly feasible. To draw reliable conclusions, these systems
+should take into account the quality of available evidence, placing more weight
+on studies that use a valid methodology. We present a benchmark for measuring
+the methodological strength of biomedical papers, drawing on the risk-of-bias
+framework used for systematic reviews. The four benchmark tasks, drawn from
+more than 500 papers, cover the analysis of research study methodology,
+followed by evaluation of risk of bias in these studies. The benchmark contains
+2000 expert-generated bias annotations, and a human-validated pipeline for
+fine-grained alignment with research paper content. We evaluate a range of
+large language models on the benchmark, and find that these models fall
+significantly short of expert-level performance. By providing a standardized
+tool for measuring judgments of study quality, the benchmark can help to guide
+systems that perform large-scale aggregation of scientific data. The dataset is
+available at https://github.com/RoBBR-Benchmark/RoBBR.
+
+
+
+
+
+
+
+ ♻ ☆ MetaMetrics: Calibrating Metrics For Generation Tasks Using Human
+ Preferences
+
+
+
+
+
+
+
+
+ Genta Indra Winata, David Anugraha, Lucky Susanto, Garry Kuwanto, Derry Tanti Wijaya
+
+
+ Understanding the quality of a performance evaluation metric is crucial for
+ensuring that model outputs align with human preferences. However, it remains
+unclear how well each metric captures the diverse aspects of these preferences,
+as metrics often excel in one particular area but not across all dimensions. To
+address this, it is essential to systematically calibrate metrics to specific
+aspects of human preference, catering to the unique characteristics of each
+aspect. We introduce MetaMetrics, a calibrated meta-metric designed to evaluate
+generation tasks across different modalities in a supervised manner.
+MetaMetrics optimizes the combination of existing metrics to enhance their
+alignment with human preferences. Our metric demonstrates flexibility and
+effectiveness in both language and vision downstream tasks, showing significant
+benefits across various multilingual and multi-domain scenarios. MetaMetrics
+aligns closely with human preferences and is highly extendable and easily
+integrable into any application. This makes MetaMetrics a powerful tool for
+improving the evaluation of generation tasks, ensuring that metrics are more
+representative of human judgment across diverse contexts.
+
+
+
+ comment: Preprint
+
+
+
+
+
+
+ ♻ ☆ WorldCuisines: A Massive-Scale Benchmark for Multilingual and
+ Multicultural Visual Question Answering on Global Cuisines
+
+
+
+
+
+
+
+
+ Genta Indra Winata, Frederikus Hudi, Patrick Amadeus Irawan, David Anugraha, Rifki Afina Putri, Yutong Wang, Adam Nohejl, Ubaidillah Ariq Prathama, Nedjma Ousidhoum, Afifa Amriani, Anar Rzayev, Anirban Das, Ashmari Pramodya, Aulia Adila, Bryan Wilie, Candy Olivia Mawalim, Ching Lam Cheng, Daud Abolade, Emmanuele Chersoni, Enrico Santus, Fariz Ikhwantri, Garry Kuwanto, Hanyang Zhao, Haryo Akbarianto Wibowo, Holy Lovenia, Jan Christian Blaise Cruz, Jan Wira Gotama Putra, Junho Myung, Lucky Susanto, Maria Angelica Riera Machin, Marina Zhukova, Michael Anugraha, Muhammad Farid Adilazuarda, Natasha Santosa, Peerat Limkonchotiwat, Raj Dabre, Rio Alexander Audino, Samuel Cahyawijaya, Shi-Xiong Zhang, Stephanie Yulia Salim, Yi Zhou, Yinxuan Gui, David Ifeoluwa Adelani, En-Shiun Annie Lee, Shogo Okada, Ayu Purwarianti, Alham Fikri Aji, Taro Watanabe, Derry Tanti Wijaya, Alice Oh, Chong-Wah Ngo
+
+
+ Vision Language Models (VLMs) often struggle with culture-specific knowledge,
+particularly in languages other than English and in underrepresented cultural
+contexts. To evaluate their understanding of such knowledge, we introduce
+WorldCuisines, a massive-scale benchmark for multilingual and multicultural,
+visually grounded language understanding. This benchmark includes a visual
+question answering (VQA) dataset with text-image pairs across 30 languages and
+dialects, spanning 9 language families and featuring over 1 million data
+points, making it the largest multicultural VQA benchmark to date. It includes
+tasks for identifying dish names and their origins. We provide evaluation
+datasets in two sizes (12k and 60k instances) alongside a training dataset (1
+million instances). Our findings show that while VLMs perform better with
+correct location context, they struggle with adversarial contexts and
+predicting specific regional cuisines and languages. To support future
+research, we release a knowledge base with annotated food entries and images
+along with the VQA data.
+
+
+
+ comment: Preprint
+
+
+
+
+
+
+ ♻ ☆ HJ-Ky-0.1: an Evaluation Dataset for Kyrgyz Word Embeddings
+
+
+ One of the key tasks in modern applied computational linguistics is
+constructing word vector representations (word embeddings), which are widely
+used to address natural language processing tasks such as sentiment analysis,
+information extraction, and more. To choose an appropriate method for
+generating these word embeddings, quality assessment techniques are often
+necessary. A standard approach involves calculating distances between vectors
+for words with expert-assessed 'similarity'. This work introduces the first
+'silver standard' dataset for such tasks in the Kyrgyz language, alongside
+training corresponding models and validating the dataset's suitability through
+quality evaluation metrics.
+
+
+
+ comment: The translation of the 2023 paper into English
+
+
+
+
+
+
+ ♻ ☆ A Computational Framework for Behavioral Assessment of LLM Therapists
+
+
+ The emergence of large language models (LLMs) like ChatGPT has increased
+interest in their use as therapists to address mental health challenges and the
+widespread lack of access to care. However, experts have emphasized the
+critical need for systematic evaluation of LLM-based mental health
+interventions to accurately assess their capabilities and limitations. Here, we
+propose BOLT, a proof-of-concept computational framework to systematically
+assess the conversational behavior of LLM therapists. We quantitatively measure
+LLM behavior across 13 psychotherapeutic approaches with in-context learning
+methods. Then, we compare the behavior of LLMs against high- and low-quality
+human therapy. Our analysis based on Motivational Interviewing therapy reveals
+that LLMs often resemble behaviors more commonly exhibited in low-quality
+therapy rather than high-quality therapy, such as offering a higher degree of
+problem-solving advice when clients share emotions. However, unlike low-quality
+therapy, LLMs reflect significantly more upon clients' needs and strengths. Our
+findings caution that LLM therapists still require further research for
+consistent, high-quality care.
+
+
+
+
+
+
+
+ ♻ ☆ Confidential Prompting: Protecting User Prompts from Cloud LLM Providers
+
+
+ Our work tackles the challenge of securing user inputs in cloud-hosted large
+language model (LLM) serving while ensuring output invariance, model
+confidentiality, and compute efficiency. We introduce secure multi-party
+decoding (SMD), which leverages confidential computing to confine user prompts
+to a trusted execution environment (TEE), namely a confidential virtual machine
+(CVM), while allowing service providers to generate tokens efficiently. We also
+introduce a novel cryptographic method, prompt obfuscation (PO), to ensure
+robustness against reconstruction attacks on SMD. We demonstrate that our
+approach preserves both prompt confidentiality and LLM serving efficiency. Our
+solution can enable privacy-preserving cloud LLM serving that handles sensitive
+prompts, such as clinical records, financial data, and personal information.
+
+
+ LLMs have emerged as a promising tool for assisting individuals in diverse
+text-generation tasks, including job-related texts. However, LLM-generated
+answers have been increasingly found to exhibit gender bias. This study
+evaluates three LLMs (GPT-3.5, GPT-4, Claude) to conduct a multifaceted audit
+of LLM-generated interview responses across models, question types, and jobs,
+and their alignment with two gender stereotypes. Our findings reveal that
+gender bias is consistent, and closely aligned with gender stereotypes and the
+dominance of jobs. Overall, this study contributes to the systematic
+examination of gender bias in LLM-generated interview responses, highlighting
+the need for a mindful approach to mitigate such biases in related
+applications.
+
+
+
+ comment: Accepted to NeurlIPS 2024, SoLaR workshop
+
+
+
+
+
+
+ ♻ ☆ Deliberate Reasoning for LLMs as Structure-aware Planning with Accurate
+ World Model
+
+
+ Enhancing the reasoning capabilities of large language models (LLMs) remains
+a key challenge, especially for tasks that require complex, multi-step
+decision-making. Humans excel at these tasks by leveraging deliberate planning
+with an internal world model to simulate the potential outcomes of various
+actions. Inspired by this, we propose a novel multi-step reasoning framework
+for LLMs, referred to as Structure-aware Planning with Accurate World Model
+(SWAP). Unlike previous approaches that rely solely on Chain-of-Thought (CoT)
+reasoning in natural language, SWAP incorporates structural information to
+guide the reasoning process via a world model and provides a soft verification
+mechanism over the steps. Moreover, SWAP overcomes the challenge of accurate
+world state predictions in complex reasoning tasks by introducing a
+Generator-Discriminator architecture, which enables more reliable world
+modeling. Specifically, the generator predicts the next state, and the
+discriminator ensures alignment with the logical consistency required by the
+problem context. SWAP also encourages the policy model to explore a broad range
+of potential actions to prevent premature convergence. By resolving the
+bottlenecks of generation diversity for both actions and states using
+diversity-based modeling (DBM) and improving discrimination accuracy through
+contrastive ranking (CR), SWAP significantly enhances the reasoning performance
+of LLMs. We evaluate SWAP across diverse reasoning-intensive benchmarks
+including math reasoning, logical reasoning, and coding tasks. Extensive
+experiments demonstrate that SWAP achieves substantial improvements over the
+baselines and consistently outperforms existing methods.
+
+
+
+
+
+
+
+ ♻ ☆ Memorization of Named Entities in Fine-tuned BERT Models
+
+
+ Privacy preserving deep learning is an emerging field in machine learning
+that aims to mitigate the privacy risks in the use of deep neural networks. One
+such risk is training data extraction from language models that have been
+trained on datasets, which contain personal and privacy sensitive information.
+In our study, we investigate the extent of named entity memorization in
+fine-tuned BERT models. We use single-label text classification as
+representative downstream task and employ three different fine-tuning setups in
+our experiments, including one with Differential Privacy (DP). We create a
+large number of text samples from the fine-tuned BERT models utilizing a custom
+sequential sampling strategy with two prompting strategies. We search in these
+samples for named entities and check if they are also present in the
+fine-tuning datasets. We experiment with two benchmark datasets in the domains
+of emails and blogs. We show that the application of DP has a detrimental
+effect on the text generation capabilities of BERT. Furthermore, we show that a
+fine-tuned BERT does not generate more named entities specific to the
+fine-tuning dataset than a BERT model that is pre-trained only. This suggests
+that BERT is unlikely to emit personal or privacy sensitive named entities.
+Overall, our results are important to understand to what extent BERT-based
+services are prone to training data extraction attacks.
+
+
+
+ comment: published at CD-MAKE 2023
+
+
+
+
+
+
+ ♻ ☆ Do Automatic Factuality Metrics Measure Factuality? A Critical
+ Evaluation
+
+
+ Modern LLMs can now produce highly readable abstractive summaries, to the
+point where traditional automated metrics for evaluating summary quality, such
+as ROUGE, have become saturated. However, LLMs still sometimes introduce
+unwanted content into summaries, i.e., information inconsistent with or
+unsupported by their source. Measuring the occurrence of these often subtle
+``hallucinations'' automatically has proved to be challenging. This in turn has
+motivated development of a variety of metrics intended to measure the factual
+consistency of generated summaries against their source. But are these
+approaches measuring what they purport to do? In this work, we stress-test
+automatic factuality metrics. Specifically, we investigate whether and to what
+degree superficial attributes of summary texts suffice to predict
+``factuality'', finding that a (supervised) model using only such shallow
+features is reasonably competitive with SOTA factuality scoring methods. We
+then evaluate how factuality metrics respond to factual corrections in
+inconsistent summaries and find that only a few show meaningful improvements.
+In contrast, some metrics are more sensitive to benign, non-factual edits.
+Motivated by these insights, we show that one can ``game'' (most) automatic
+factuality metrics, i.e., reliably inflate ``factuality'' scores by appending
+innocuous sentences to generated summaries. Taken together, our results raise
+questions about the degree to which we should rely on existing automated
+factuality metrics and what exactly we want ``factuality metrics'' to measure.
+
+
+
+
+
+
+
+ ♻ ☆ On Evaluating The Performance of Watermarked Machine-Generated Texts
+ Under Adversarial Attacks
+
+
+ Large Language Models (LLMs) excel in various applications, including text
+generation and complex tasks. However, the misuse of LLMs raises concerns about
+the authenticity and ethical implications of the content they produce, such as
+deepfake news, academic fraud, and copyright infringement. Watermarking
+techniques, which embed identifiable markers in machine-generated text, offer a
+promising solution to these issues by allowing for content verification and
+origin tracing. Unfortunately, the robustness of current LLM watermarking
+schemes under potential watermark removal attacks has not been comprehensively
+explored.
+ In this paper, to fill this gap, we first systematically comb the mainstream
+watermarking schemes and removal attacks on machine-generated texts, and then
+we categorize them into pre-text (before text generation) and post-text (after
+text generation) classes so that we can conduct diversified analyses. In our
+experiments, we evaluate eight watermarks (five pre-text, three post-text) and
+twelve attacks (two pre-text, ten post-text) across 87 scenarios. Evaluation
+results indicate that (1) KGW and Exponential watermarks offer high text
+quality and watermark retention but remain vulnerable to most attacks; (2)
+Post-text attacks are found to be more efficient and practical than pre-text
+attacks; (3) Pre-text watermarks are generally more imperceptible, as they do
+not alter text fluency, unlike post-text watermarks; (4) Additionally, combined
+attack methods can significantly increase effectiveness, highlighting the need
+for more robust watermarking solutions. Our study underscores the
+vulnerabilities of current techniques and the necessity for developing more
+resilient schemes.
+
+
+
+
+
+
+
+ ♻ ☆ Shortcut Learning in In-Context Learning: A Survey
+
+
+ Shortcut learning refers to the phenomenon where models employ simple,
+non-robust decision rules in practical tasks, which hinders their
+generalization and robustness. With the rapid development of large language
+models (LLMs) in recent years, an increasing number of studies have shown the
+impact of shortcut learning on LLMs. This paper provides a novel perspective to
+review relevant research on shortcut learning in In-Context Learning (ICL). It
+conducts a detailed exploration of the types of shortcuts in ICL tasks, their
+causes, available benchmarks, and strategies for mitigating shortcuts. Based on
+corresponding observations, it summarizes the unresolved issues in existing
+research and attempts to outline the future research landscape of shortcut
+learning.
+
+
+
+ comment: 20 pages, 7 figures
+
+
+
+
+
+
+ ♻ ☆ Assessing biomedical knowledge robustness in large language models by
+ query-efficient sampling attacks
+
+
+
+
+
+
+
+
+ R. Patrick Xian, Alex J. Lee, Satvik Lolla, Vincent Wang, Qiming Cui, Russell Ro, Reza Abbasi-Asl
+
+
+ The increasing depth of parametric domain knowledge in large language models
+(LLMs) is fueling their rapid deployment in real-world applications.
+Understanding model vulnerabilities in high-stakes and knowledge-intensive
+tasks is essential for quantifying the trustworthiness of model predictions and
+regulating their use. The recent discovery of named entities as adversarial
+examples (i.e. adversarial entities) in natural language processing tasks
+raises questions about their potential impact on the knowledge robustness of
+pre-trained and finetuned LLMs in high-stakes and specialized domains. We
+examined the use of type-consistent entity substitution as a template for
+collecting adversarial entities for billion-parameter LLMs with biomedical
+knowledge. To this end, we developed an embedding-space attack based on
+powerscaled distance-weighted sampling to assess the robustness of their
+biomedical knowledge with a low query budget and controllable coverage. Our
+method has favorable query efficiency and scaling over alternative approaches
+based on random sampling and blackbox gradient-guided search, which we
+demonstrated for adversarial distractor generation in biomedical question
+answering. Subsequent failure mode analysis uncovered two regimes of
+adversarial entities on the attack surface with distinct characteristics and we
+showed that entity substitution attacks can manipulate token-wise Shapley value
+explanations, which become deceptive in this setting. Our approach complements
+standard evaluations for high-capacity models and the results highlight the
+brittleness of domain knowledge in LLMs.
+
+
+
+ comment: 31 pages incl. appendix, accepted by TMLR
+
+
+
+
+
+
+ ♻ ☆ A Survey on Vision-Language-Action Models for Embodied AI
+
+
+ Deep learning has demonstrated remarkable success across many domains,
+including computer vision, natural language processing, and reinforcement
+learning. Representative artificial neural networks in these fields span
+convolutional neural networks, Transformers, and deep Q-networks. Built upon
+unimodal neural networks, numerous multi-modal models have been introduced to
+address a range of tasks such as visual question answering, image captioning,
+and speech recognition. The rise of instruction-following robotic policies in
+embodied AI has spurred the development of a novel category of multi-modal
+models known as vision-language-action models (VLAs). Their multi-modality
+capability has become a foundational element in robot learning. Various methods
+have been proposed to enhance traits such as versatility, dexterity, and
+generalizability. Some models focus on refining specific components. Others aim
+to develop control policies adept at predicting low-level actions. Certain VLAs
+serve as high-level task planners capable of decomposing long-horizon tasks
+into executable subtasks. Over the past few years, a myriad of VLAs have
+emerged, reflecting the rapid advancement of embodied AI. Therefore, it is
+imperative to capture the evolving landscape through a comprehensive survey.
+
+
+
+ comment: 17 pages, a survey of vision-language-action models
+
+
+
+
+
+
+ ♻ ☆ Exo2EgoDVC: Dense Video Captioning of Egocentric Procedural Activities
+ Using Web Instructional Videos WACV 2025
+
+
+ We propose a novel benchmark for cross-view knowledge transfer of dense video
+captioning, adapting models from web instructional videos with exocentric views
+to an egocentric view. While dense video captioning (predicting time segments
+and their captions) is primarily studied with exocentric videos (e.g.,
+YouCook2), benchmarks with egocentric videos are restricted due to data
+scarcity. To overcome the limited video availability, transferring knowledge
+from abundant exocentric web videos is demanded as a practical approach.
+However, learning the correspondence between exocentric and egocentric views is
+difficult due to their dynamic view changes. The web videos contain shots
+showing either full-body or hand regions, while the egocentric view is
+constantly shifting. This necessitates the in-depth study of cross-view
+transfer under complex view changes. To this end, we first create a real-life
+egocentric dataset (EgoYC2) whose captions follow the definition of YouCook2
+captions, enabling transfer learning between these datasets with access to
+their ground-truth. To bridge the view gaps, we propose a view-invariant
+learning method using adversarial training, which consists of pre-training and
+fine-tuning stages. Our experiments confirm the effectiveness of overcoming the
+view change problem and knowledge transfer to egocentric views. Our benchmark
+pushes the study of cross-view transfer into a new task domain of dense video
+captioning and envisions methodologies that describe egocentric videos in
+natural language.
+
+
+
+ comment: Accepted to WACV 2025
+
+
+
+
+
+
+ ♻ ☆ Bone: Block-Affine Adaptation of Large Language Models
+
+
+ Low-Rank Adaptation (LoRA) has achieved remarkable training results by
+freezing the original weights and training only low-rank matrices, establishing
+itself as the predominant fine-tuning method for LLMs. Many LoRA variants have
+emerged, yet they lack a design tailored to the characteristics of LLM weights
+and fail to leverage the original weights effectively. To address the sparsity
+of LLM weights, and drawing inspiration from GQA and MQA, we propose
+Block-Affine Adaptation (Bone), a novel PEFT technique distinct from LoRA. By
+dividing the original weights into multiple subspaces that share a single
+matrix for weight updates, Bone simplifies the process by requiring the
+trainable matrix to be initialized to zero, eliminating the need for complex
+initialization as in some LoRA variants. Compared to LoRA, Bone significantly
+reduces memory usage and achieves faster computation. Evaluation of both NLU
+and NLG tasks demonstrates that Bone substantially outperforms LoRA and its
+variants. Inspired by Pissa, we propose a new theory called "Weight Guide" to
+better utilize the information embedded in the original weights. This approach
+extracts valuable information through a linear transformation of the original
+weight matrix using a trainable matrix. To validate the effectiveness of
+"Weight Guide" we combined it with Bone to create a new structure called
+Block-Affine Transformation (Bat), and ablation experiments confirmed the
+effectiveness of "Weight Guide".
+
+
+ We present hyper-connections, a simple yet effective method that can serve as
+an alternative to residual connections. This approach specifically addresses
+common drawbacks observed in residual connection variants, such as the seesaw
+effect between gradient vanishing and representation collapse. Theoretically,
+hyper-connections allow the network to adjust the strength of connections
+between features at different depths and dynamically rearrange layers. We
+conduct experiments focusing on the pre-training of large language models,
+including dense and sparse models, where hyper-connections show significant
+performance improvements over residual connections. Additional experiments
+conducted on vision tasks also demonstrate similar improvements. We anticipate
+that this method will be broadly applicable and beneficial across a wide range
+of AI problems.
+
+
+ Unsupervised multitask pre-training has been the critical method behind the
+recent success of language models (LMs). However, supervised multitask learning
+still holds significant promise, as scaling it in the post-training stage
+trends towards better generalization. In this paper, we explore supervised
+multitask pre-training by proposing Instruction Pre-Training, a framework that
+scalably augments massive raw corpora with instruction-response pairs to
+pre-train LMs. The instruction-response pairs are generated by an efficient
+instruction synthesizer built on open-source models. In our experiments, we
+synthesize 200M instruction-response pairs covering 40+ task categories to
+verify the effectiveness of Instruction Pre-Training. In pre-training from
+scratch, Instruction Pre-Training not only consistently enhances pre-trained
+base models but also benefits more from further instruction tuning. In
+continual pre-training, Instruction Pre-Training enables Llama3-8B to be
+comparable to or even outperform Llama3-70B. Our model, code, and data are
+available at https://github.com/microsoft/LMOps.
+
+
+
+ comment: EMNLP 2024 Main Conference
+
+
+
+
+
+
+ ♻ ☆ Large Language Model-Brained GUI Agents: A Survey
+
+
+ GUIs have long been central to human-computer interaction, providing an
+intuitive and visually-driven way to access and interact with digital systems.
+The advent of LLMs, particularly multimodal models, has ushered in a new era of
+GUI automation. They have demonstrated exceptional capabilities in natural
+language understanding, code generation, and visual processing. This has paved
+the way for a new generation of LLM-brained GUI agents capable of interpreting
+complex GUI elements and autonomously executing actions based on natural
+language instructions. These agents represent a paradigm shift, enabling users
+to perform intricate, multi-step tasks through simple conversational commands.
+Their applications span across web navigation, mobile app interactions, and
+desktop automation, offering a transformative user experience that
+revolutionizes how individuals interact with software. This emerging field is
+rapidly advancing, with significant progress in both research and industry.
+ To provide a structured understanding of this trend, this paper presents a
+comprehensive survey of LLM-brained GUI agents, exploring their historical
+evolution, core components, and advanced techniques. We address research
+questions such as existing GUI agent frameworks, the collection and utilization
+of data for training specialized GUI agents, the development of large action
+models tailored for GUI tasks, and the evaluation metrics and benchmarks
+necessary to assess their effectiveness. Additionally, we examine emerging
+applications powered by these agents. Through a detailed analysis, this survey
+identifies key research gaps and outlines a roadmap for future advancements in
+the field. By consolidating foundational knowledge and state-of-the-art
+developments, this work aims to guide both researchers and practitioners in
+overcoming challenges and unlocking the full potential of LLM-brained GUI
+agents.
+
+
+
+ comment: The collection of papers reviewed in this survey will be hosted and
+ regularly updated on the GitHub repository:
+ https://github.com/vyokky/LLM-Brained-GUI-Agents-Survey Additionally, a
+ searchable webpage is available at https://aka.ms/gui-agent for easier access
+ and exploration
+
+ Graphical User Interface (GUI) grounding plays a crucial role in enhancing
+the capabilities of Vision-Language Model (VLM) agents. While general VLMs,
+such as GPT-4V, demonstrate strong performance across various tasks, their
+proficiency in GUI grounding remains suboptimal. Recent studies have focused on
+fine-tuning these models specifically for one-shot GUI grounding, yielding
+significant improvements over baseline performance. We introduce a visual
+prompting framework that employs an iterative narrowing mechanism to improve
+the performance of both general and fine-tuned models in GUI grounding by up to
+61%. For evaluation, we tested our method on a comprehensive benchmark
+comprising various UI platforms and provided the code to reproduce our results.
+
+
+
+ comment: Code available at
+ https://github.com/ant-8/GUI-Grounding-via-Iterative-Narrowing
+
+
+
+
+
+
+ ♻ ☆ AgentGen: Enhancing Planning Abilities for Large Language Model based
+ Agent via Environment and Task Generation KDD 2025
+
+
+ Large Language Model-based agents have garnered significant attention and are
+becoming increasingly popular. Furthermore, planning ability is a crucial
+component of an LLM-based agent, which generally entails achieving a desired
+goal from an initial state. This paper investigates enhancing the planning
+abilities of LLMs through instruction tuning, referred to as agent training.
+Recent studies have demonstrated that utilizing expert-level trajectory for
+instruction-tuning LLMs effectively enhances their planning capabilities.
+However, existing work primarily focuses on synthesizing trajectories from
+manually designed planning tasks and environments. The labor-intensive nature
+of creating these environments and tasks impedes the generation of sufficiently
+varied and extensive trajectories. To address this limitation, this paper
+explores the automated synthesis of diverse environments and a gradual range of
+planning tasks, from easy to difficult. We introduce a framework, AgentGen,
+that leverages LLMs first to generate environments and subsequently generate
+planning tasks conditioned on these environments. Specifically, to improve
+environmental diversity, we propose using an inspiration corpus composed of
+various domain-specific text segments as the context for synthesizing
+environments. Moreover, to increase the difficulty diversity of generated
+planning tasks, we propose a bidirectional evolution method, Bi-Evol, that
+evolves planning tasks from easier and harder directions to synthesize a task
+set with a smoother difficulty curve. The evaluation results derived from
+AgentBoard show that AgentGen greatly improves LLMs' planning ability, e.g.,
+the AgentGen instruction-tuned Llama-3.1-8B surpasses GPT-3.5 in overall
+performance. Moreover, the AgentGen-tuned Llama-3.1-70B model achieves
+state-of-the-art results in planning tasks.
+
+
+
+ comment: Accepted by KDD 2025 (Research Track)
+
+
+
+
+
+
+ ♻ ☆ Don't Command, Cultivate: An Exploratory Study of System-2 Alignment
+
+
+ The o1 system card identifies the o1 models as the most robust within OpenAI,
+with their defining characteristic being the progression from rapid, intuitive
+thinking to slower, more deliberate reasoning. This observation motivated us to
+investigate the influence of System-2 thinking patterns on model safety. In our
+preliminary research, we conducted safety evaluations of the o1 model,
+including complex jailbreak attack scenarios using adversarial natural language
+prompts and mathematical encoding prompts. Our findings indicate that the o1
+model demonstrates relatively improved safety performance; however, it still
+exhibits vulnerabilities, particularly against jailbreak attacks employing
+mathematical encoding. Through detailed case analysis, we identified specific
+patterns in the o1 model's responses. We also explored the alignment of
+System-2 safety in open-source models using prompt engineering and supervised
+fine-tuning techniques. Experimental results show that some simple methods to
+encourage the model to carefully scrutinize user requests are beneficial for
+model safety. Additionally, we proposed a implementation plan for process
+supervision to enhance safety alignment. The implementation details and
+experimental results will be provided in future versions.
+
+
+
+ comment: Preprint version, more results will be updated
+
+
+
+
+
+
+ ♻ ☆ Length Desensitization in Direct Preference Optimization
+
+
+
+
+
+
+
+
+ Wei Liu, Yang Bai, Chengcheng Han, Rongxiang Weng, Jun Xu, Xuezhi Cao, Jingang Wang, Xunliang Cai
+
+
+ Direct Preference Optimization (DPO) is widely utilized in the Reinforcement
+Learning from Human Feedback (RLHF) phase to align Large Language Models (LLMs)
+with human preferences, thereby enhancing both their harmlessness and efficacy.
+However, it has been observed that DPO tends to over-optimize for verbosity,
+which can detrimentally affect both performance and user experience. In this
+paper, we conduct an in-depth theoretical analysis of DPO's optimization
+objective and reveal a strong correlation between its implicit reward and data
+length. This correlation misguides the optimization direction, resulting in
+length sensitivity during the DPO training and leading to verbosity. To address
+this issue, we propose a length-desensitization improvement method for DPO,
+termed LD-DPO. The proposed method aims to desensitize DPO to data length by
+decoupling explicit length preference, which is relatively insignificant, from
+the other implicit preferences, thereby enabling more effective learning of the
+intrinsic preferences. We utilized two settings (Base and Instruct) of
+Llama2-13B, Llama3-8B, and Qwen2-7B for experimental validation on various
+benchmarks including MT-Bench and AlpacaEval 2. The experimental results
+indicate that LD-DPO consistently outperforms DPO and other baseline methods,
+achieving more concise responses with a 10-40% reduction in length compared to
+DPO. We conducted in-depth experimental analyses to demonstrate that LD-DPO can
+indeed achieve length desensitization and align the model more closely with
+human-like preferences.
+
+
+
+ comment: 21 pages, 9 figures
+
+
+
+
+
+
+ ♻ ☆ Safe + Safe = Unsafe? Exploring How Safe Images Can Be Exploited to
+ Jailbreak Large Vision-Language Models
+
+
+ Recent advances in Large Vision-Language Models (LVLMs) have showcased strong
+reasoning abilities across multiple modalities, achieving significant
+breakthroughs in various real-world applications. Despite this great success,
+the safety guardrail of LVLMs may not cover the unforeseen domains introduced
+by the visual modality. Existing studies primarily focus on eliciting LVLMs to
+generate harmful responses via carefully crafted image-based jailbreaks
+designed to bypass alignment defenses. In this study, we reveal that a safe
+image can be exploited to achieve the same jailbreak consequence when combined
+with additional safe images and prompts. This stems from two fundamental
+properties of LVLMs: universal reasoning capabilities and safety snowball
+effect. Building on these insights, we propose Safety Snowball Agent (SSA), a
+novel agent-based framework leveraging agents' autonomous and tool-using
+abilities to jailbreak LVLMs. SSA operates through two principal stages: (1)
+initial response generation, where tools generate or retrieve jailbreak images
+based on potential harmful intents, and (2) harmful snowballing, where refined
+subsequent prompts induce progressively harmful outputs. Our experiments
+demonstrate that \ours can use nearly any image to induce LVLMs to produce
+unsafe content, achieving high success jailbreaking rates against the latest
+LVLMs. Unlike prior works that exploit alignment flaws, \ours leverages the
+inherent properties of LVLMs, presenting a profound challenge for enforcing
+safety in generative multimodal systems. Our code is avaliable at
+\url{https://github.com/gzcch/Safety_Snowball_Agent}.
+
+
+
+
+
+
+
+ ♻ ☆ MiniKV: Pushing the Limits of LLM Inference via 2-Bit
+ Layer-Discriminative KV Cache
+
+
+ How to efficiently serve LLMs in practice has become exceptionally
+challenging due to their prohibitive memory and computation requirements. In
+this study, we investigate optimizing the KV cache, whose memory footprint
+poses a critical bottleneck in LLM inference, especially when dealing with long
+context tasks. To tackle the challenge, we introduce MiniKV, a KV cache
+optimization method that simultaneously preserves long context task accuracy
+while significantly reducing KV cache size via a novel 2-bit
+layer-discriminative KV cache. More importantly, we develop specialized CUDA
+kernels to make MiniKV compatible with FlashAttention. Experiments on a wide
+range of long context tasks show that MiniKV effectively achieves 86% KV cache
+compression ratio while recovering over 98.5% of accuracy, outperforming
+state-of-the-art methods while achieving excellent measured system performance
+improvements.
+
+
+
+
+
+
+
+ ♻ ☆ Paralinguistics-Aware Speech-Empowered Large Language Models for Natural
+ Conversation NeurIPS 2024
+
+
+
+
+
+
+
+
+ Heeseung Kim, Soonshin Seo, Kyeongseok Jeong, Ohsung Kwon, Soyoon Kim, Jungwhan Kim, Jaehong Lee, Eunwoo Song, Myungwoo Oh, Jung-Woo Ha, Sungroh Yoon, Kang Min Yoo
+
+
+ Recent work shows promising results in expanding the capabilities of large
+language models (LLM) to directly understand and synthesize speech. However, an
+LLM-based strategy for modeling spoken dialogs remains elusive, calling for
+further investigation. This paper introduces an extensive speech-text LLM
+framework, the Unified Spoken Dialog Model (USDM), designed to generate
+coherent spoken responses with naturally occurring prosodic features relevant
+to the given input speech without relying on explicit automatic speech
+recognition (ASR) or text-to-speech (TTS) systems. We have verified the
+inclusion of prosody in speech tokens that predominantly contain semantic
+information and have used this foundation to construct a prosody-infused
+speech-text model. Additionally, we propose a generalized speech-text
+pretraining scheme that enhances the capture of cross-modal semantics. To
+construct USDM, we fine-tune our speech-text model on spoken dialog data using
+a multi-step spoken dialog template that stimulates the chain-of-reasoning
+capabilities exhibited by the underlying LLM. Automatic and human evaluations
+on the DailyTalk dataset demonstrate that our approach effectively generates
+natural-sounding spoken responses, surpassing previous and cascaded baselines.
+Our code and checkpoints are available at https://github.com/naver-ai/usdm.
+
+
+
+
+
+
+
+ ♻ ☆ Strategic Prompting for Conversational Tasks: A Comparative Analysis of
+ Large Language Models Across Diverse Conversational Tasks
+
+
+ Given the advancements in conversational artificial intelligence, the
+evaluation and assessment of Large Language Models (LLMs) play a crucial role
+in ensuring optimal performance across various conversational tasks. In this
+paper, we present a comprehensive study that thoroughly evaluates the
+capabilities and limitations of five prevalent LLMs: Llama, OPT, Falcon,
+Alpaca, and MPT. The study encompasses various conversational tasks, including
+reservation, empathetic response generation, mental health and legal
+counseling, persuasion, and negotiation. To conduct the evaluation, an
+extensive test setup is employed, utilizing multiple evaluation criteria that
+span from automatic to human evaluation. This includes using generic and
+task-specific metrics to gauge the LMs' performance accurately. From our
+evaluation, no single model emerges as universally optimal for all tasks.
+Instead, their performance varies significantly depending on the specific
+requirements of each task. While some models excel in certain tasks, they may
+demonstrate comparatively poorer performance in others. These findings
+emphasize the importance of considering task-specific requirements and
+characteristics when selecting the most suitable LM for conversational
+applications.
+
+
+
+ comment: 39 pages, 12 tables
+
+
+
+
+
+
+
+
+
+ Information Retrieval 8
+
+
+
+
+
+ ☆ Parallel and Mini-Batch Stable Matching for Large-Scale Reciprocal
+ Recommender Systems
+
+
+ Reciprocal recommender systems (RRSs) are crucial in online two-sided
+matching platforms, such as online job or dating markets, as they need to
+consider the preferences of both sides of the match. The concentration of
+recommendations to a subset of users on these platforms undermines their match
+opportunities and reduces the total number of matches. To maximize the total
+number of expected matches among market participants, stable matching theory
+with transferable utility has been applied to RRSs. However, computational
+complexity and memory efficiency quadratically increase with the number of
+users, making it difficult to implement stable matching algorithms for several
+users. In this study, we propose novel methods using parallel and mini-batch
+computations for reciprocal recommendation models to improve the computational
+time and space efficiency of the optimization process for stable matching.
+Experiments on both real and synthetic data confirmed that our stable matching
+theory-based RRS increased the computation speed and enabled tractable
+large-scale data processing of up to one million samples with a single graphics
+processing unit graphics board, without losing the match count.
+
+
+
+
+
+
+
+ ☆ Introducing Three New Benchmark Datasets for Hierarchical Text
+ Classification
+
+
+
+
+
+
+
+
+ Jaco du Toit, Herman Redelinghuys, Marcel Dunaiski
+
+
+ Hierarchical Text Classification (HTC) is a natural language processing task
+with the objective to classify text documents into a set of classes from a
+structured class hierarchy. Many HTC approaches have been proposed which
+attempt to leverage the class hierarchy information in various ways to improve
+classification performance. Machine learning-based classification approaches
+require large amounts of training data and are most-commonly compared through
+three established benchmark datasets, which include the Web Of Science (WOS),
+Reuters Corpus Volume 1 Version 2 (RCV1-V2) and New York Times (NYT) datasets.
+However, apart from the RCV1-V2 dataset which is well-documented, these
+datasets are not accompanied with detailed description methodologies. In this
+paper, we introduce three new HTC benchmark datasets in the domain of research
+publications which comprise the titles and abstracts of papers from the Web of
+Science publication database. We first create two baseline datasets which use
+existing journal-and citation-based classification schemas. Due to the
+respective shortcomings of these two existing schemas, we propose an approach
+which combines their classifications to improve the reliability and robustness
+of the dataset. We evaluate the three created datasets with a clustering-based
+analysis and show that our proposed approach results in a higher quality
+dataset where documents that belong to the same class are semantically more
+similar compared to the other datasets. Finally, we provide the classification
+performance of four state-of-the-art HTC approaches on these three new datasets
+to provide baselines for future studies on machine learning-based techniques
+for scientific publication classification.
+
+
+
+ comment: 16 pages, 11 figures
+
+
+
+
+
+
+ ☆ Integration of Contextual Descriptors in Ontology Alignment for
+ Enrichment of Semantic Correspondence
+
+
+ This paper proposes a novel approach to semantic ontology alignment using
+contextual descriptors. A formalization was developed that enables the
+integration of essential and contextual descriptors to create a comprehensive
+knowledge model. The hierarchical structure of the semantic approach and the
+mathematical apparatus for analyzing potential conflicts between concepts,
+particularly in the example of "Transparency" and "Privacy" in the context of
+artificial intelligence, are demonstrated. Experimental studies showed a
+significant improvement in ontology alignment metrics after the implementation
+of contextual descriptors, especially in the areas of privacy, responsibility,
+and freedom & autonomy. The application of contextual descriptors achieved an
+average overall improvement of approximately 4.36%. The results indicate the
+effectiveness of the proposed approach for more accurately reflecting the
+complexity of knowledge and its contextual dependence.
+
+
+ Product bundling aims to organize a set of thematically related items into a
+combined bundle for shipment facilitation and item promotion. To increase the
+exposure of fresh or overstocked products, sellers typically bundle these items
+with popular products for inventory clearance. This specific task can be
+formulated as a long-tail product bundling scenario, which leverages the
+user-item interactions to define the popularity of each item. The inherent
+popularity bias in the pre-extracted user feedback features and the
+insufficient utilization of other popularity-independent knowledge may force
+the conventional bundling methods to find more popular items, thereby
+struggling with this long-tail bundling scenario. Through intuitive and
+empirical analysis, we navigate the core solution for this challenge, which is
+maximally mining the popularity-free features and effectively incorporating
+them into the bundling process. To achieve this, we propose a Distilled
+Modality-Oriented Knowledge Transfer framework (DieT) to effectively counter
+the popularity bias misintroduced by the user feedback features and adhere to
+the original intent behind the real-world bundling behaviors. Specifically,
+DieT first proposes the Popularity-free Collaborative Distribution Modeling
+module (PCD) to capture the popularity-independent information from the
+bundle-item view, which is proven most effective in the long-tail bundling
+scenario to enable the directional information transfer. With the tailored
+Unbiased Bundle-aware Knowledge Transferring module (UBT), DieT can highlight
+the significance of popularity-free features while mitigating the negative
+effects of user feedback features in the long-tail scenario via the knowledge
+distillation paradigm. Extensive experiments on two real-world datasets
+demonstrate the superiority of DieT over a list of SOTA methods in the
+long-tail bundling scenario.
+
+
+
+
+
+
+
+
+ Marie Al Ghossein, Emile Contal, Alexandre Robicquet
+
+
+ In-Context Learning (ICL) enables Large Language Models (LLMs) to perform new
+tasks by conditioning on prompts with relevant information. Retrieval-Augmented
+Generation (RAG) enhances ICL by incorporating retrieved documents into the
+LLM's context at query time. However, traditional retrieval methods focus on
+semantic relevance, treating retrieval as a search problem. In this paper, we
+propose reframing retrieval for ICL as a recommendation problem, aiming to
+select documents that maximize utility in ICL tasks. We introduce the
+In-Context Learning Embedding and Reranker Benchmark (ICLERB), a novel
+evaluation framework that compares retrievers based on their ability to enhance
+LLM accuracy in ICL settings. Additionally, we propose a novel Reinforcement
+Learning-to-Rank from AI Feedback (RLRAIF) algorithm, designed to fine-tune
+retrieval models using minimal feedback from the LLM. Our experimental results
+reveal notable differences between ICLERB and existing benchmarks, and
+demonstrate that small models fine-tuned with our RLRAIF algorithm outperform
+large state-of-the-art retrieval models. These findings highlight the
+limitations of existing evaluation methods and the need for specialized
+benchmarks and training strategies adapted to ICL.
+
+
+
+
+
+
+
+ ♻ ☆ Integrating SPARQL and LLMs for Question Answering over Scholarly Data
+ Sources ISWC
+
+
+ The Scholarly Hybrid Question Answering over Linked Data (QALD) Challenge at
+the International Semantic Web Conference (ISWC) 2024 focuses on Question
+Answering (QA) over diverse scholarly sources: DBLP, SemOpenAlex, and
+Wikipedia-based texts. This paper describes a methodology that combines SPARQL
+queries, divide and conquer algorithms, and a pre-trained extractive question
+answering model. It starts with SPARQL queries to gather data, then applies
+divide and conquer to manage various question types and sources, and uses the
+model to handle personal author questions. The approach, evaluated with Exact
+Match and F-score metrics, shows promise for improving QA accuracy and
+efficiency in scholarly contexts.
+
+
+
+ comment: Scholarly Hybrid Question answering challenge from the International
+ Semantic Web Conference of 2024(ISWC), 7 pages, 8 figures
+
+
+
+
+
+
+ ♻ ☆ A smoothed-Bayesian approach to frequency recovery from sketched data
+
+
+ We provide a novel statistical perspective on a classical problem at the
+intersection of computer science and information theory: recovering the
+empirical frequency of a symbol in a large discrete dataset using only a
+compressed representation, or sketch, obtained via random hashing. Departing
+from traditional algorithmic approaches, recent works have proposed Bayesian
+nonparametric (BNP) methods that can provide more informative frequency
+estimates by leveraging modeling assumptions about the distribution of the
+sketched data. In this paper, we propose a smoothed-Bayesian method, inspired
+by existing BNP approaches but designed in a frequentist framework to overcome
+the computational limitations of the BNP approaches when dealing with
+large-scale data from realistic distributions, including those with power-law
+tail behaviors. For sketches obtained with a single hash function, our approach
+is supported by rigorous frequentist properties, including unbiasedness and
+optimality under a squared error loss function within an intuitive class of
+linear estimators. For sketches with multiple hash functions, we introduce an
+approach based on multi-view learning to construct computationally efficient
+frequency estimators. We validate our method on synthetic and real data,
+comparing its performance to that of existing alternatives.
+
+
+
+
+
+
+
+ ♻ ☆ Model, Analyze, and Comprehend User Interactions within a Social Media
+ Platform
+
+
+ In this study, we propose a novel graph-based approach to model, analyze and
+comprehend user interactions within a social media platform based on
+post-comment relationship. We construct a user interaction graph from social
+media data and analyze it to gain insights into community dynamics, user
+behavior, and content preferences. Our investigation reveals that while 56.05%
+of the active users are strongly connected within the community, only 0.8% of
+them significantly contribute to its dynamics. Moreover, we observe temporal
+variations in community activity, with certain periods experiencing heightened
+engagement. Additionally, our findings highlight a correlation between user
+activity and popularity showing that more active users are generally more
+popular. Alongside these, a preference for positive and informative content is
+also observed where 82.41% users preferred positive and informative content.
+Overall, our study provides a comprehensive framework for understanding and
+managing online communities, leveraging graph-based techniques to gain valuable
+insights into user behavior and community dynamics.
+
+
+
+ comment: Accepted by 27th International Conference on Computer and Information
+ Technology (ICCIT), 2024. 6 Pages, 6 Figures
+
+ Identifying defects and anomalies in industrial products is a critical
+quality control task. Traditional manual inspection methods are slow,
+subjective, and error-prone. In this work, we propose a novel zero-shot
+training-free approach for automated industrial image anomaly detection using a
+multimodal machine learning pipeline, consisting of three foundation models.
+Our method first uses a large language model, i.e., GPT-3. generate text
+prompts describing the expected appearances of normal and abnormal products. We
+then use a grounding object detection model, called Grounding DINO, to locate
+the product in the image. Finally, we compare the cropped product image patches
+to the generated prompts using a zero-shot image-text matching model, called
+CLIP, to identify any anomalies. Our experiments on two datasets of industrial
+product images, namely MVTec-AD and VisA, demonstrate the effectiveness of this
+method, achieving high accuracy in detecting various types of defects and
+anomalies without the need for model training. Our proposed model enables
+efficient, scalable, and objective quality control in industrial manufacturing
+settings.
+
+
+
+ comment: Accepted to APSIPA ASC 2024
+
+
+
+
+
+
+ ☆ SuperGaussians: Enhancing Gaussian Splatting Using Primitives with
+ Spatially Varying Colors
+
+
+ Gaussian Splattings demonstrate impressive results in multi-view
+reconstruction based on Gaussian explicit representations. However, the current
+Gaussian primitives only have a single view-dependent color and an opacity to
+represent the appearance and geometry of the scene, resulting in a non-compact
+representation. In this paper, we introduce a new method called SuperGaussians
+that utilizes spatially varying colors and opacity in a single Gaussian
+primitive to improve its representation ability. We have implemented bilinear
+interpolation, movable kernels, and even tiny neural networks as spatially
+varying functions. Quantitative and qualitative experimental results
+demonstrate that all three functions outperform the baseline, with the best
+movable kernels achieving superior novel view synthesis performance on multiple
+datasets, highlighting the strong potential of spatially varying functions.
+
+
+
+
+
+
+
+ ☆ Improving Accuracy and Generalization for Efficient Visual Tracking WACV 2025
+
+
+ Efficient visual trackers overfit to their training distributions and lack
+generalization abilities, resulting in them performing well on their respective
+in-distribution (ID) test sets and not as well on out-of-distribution (OOD)
+sequences, imposing limitations to their deployment in-the-wild under
+constrained resources. We introduce SiamABC, a highly efficient Siamese tracker
+that significantly improves tracking performance, even on OOD sequences.
+SiamABC takes advantage of new architectural designs in the way it bridges the
+dynamic variability of the target, and of new losses for training. Also, it
+directly addresses OOD tracking generalization by including a fast
+backward-free dynamic test-time adaptation method that continuously adapts the
+model according to the dynamic visual changes of the target. Our extensive
+experiments suggest that SiamABC shows remarkable performance gains in OOD sets
+while maintaining accurate performance on the ID benchmarks. SiamABC
+outperforms MixFormerV2-S by 7.6\% on the OOD AVisT benchmark while being 3x
+faster (100 FPS) on a CPU.
+
+
+ Generating sound effects for videos often requires creating artistic sound
+effects that diverge significantly from real-life sources and flexible control
+in the sound design. To address this problem, we introduce MultiFoley, a model
+designed for video-guided sound generation that supports multimodal
+conditioning through text, audio, and video. Given a silent video and a text
+prompt, MultiFoley allows users to create clean sounds (e.g., skateboard wheels
+spinning without wind noise) or more whimsical sounds (e.g., making a lion's
+roar sound like a cat's meow). MultiFoley also allows users to choose reference
+audio from sound effects (SFX) libraries or partial videos for conditioning. A
+key novelty of our model lies in its joint training on both internet video
+datasets with low-quality audio and professional SFX recordings, enabling
+high-quality, full-bandwidth (48kHz) audio generation. Through automated
+evaluations and human studies, we demonstrate that MultiFoley successfully
+generates synchronized high-quality sounds across varied conditional inputs and
+outperforms existing methods. Please see our project page for video results:
+https://ificl.github.io/MultiFoley/
+
+
+
+
+
+
+
+ ♻ ☆ SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General
+ Sound SP
+
+
+
+
+
+
+
+
+ Haohe Liu, Xuenan Xu, Yi Yuan, Mengyue Wu, Wenwu Wang, Mark D. Plumbley
+
+
+ Large language models (LLMs) have significantly advanced audio processing
+through audio codecs that convert audio into discrete tokens, enabling the
+application of language modelling techniques to audio data. However,
+traditional codecs often operate at high bitrates or within narrow domains such
+as speech and lack the semantic clues required for efficient language
+modelling. Addressing these challenges, we introduce SemantiCodec, a novel
+codec designed to compress audio into fewer than a hundred tokens per second
+across diverse audio types, including speech, general sound, and music, without
+compromising quality. SemantiCodec features a dual-encoder architecture: a
+semantic encoder using a self-supervised pre-trained Audio Masked Autoencoder
+(AudioMAE), discretized using k-means clustering on extensive audio data, and
+an acoustic encoder to capture the remaining details. The semantic and acoustic
+encoder outputs are used to reconstruct audio via a diffusion-model-based
+decoder. SemantiCodec is presented in three variants with token rates of 25,
+50, and 100 per second, supporting a range of ultra-low bit rates between 0.31
+kbps and 1.40 kbps. Experimental results demonstrate that SemantiCodec
+significantly outperforms the state-of-the-art Descript codec on reconstruction
+quality. Our results also suggest that SemantiCodec contains significantly
+richer semantic information than all evaluated state-of-the-art audio codecs,
+even at significantly lower bitrates. Our code and demos are available at
+https://haoheliu.github.io/SemantiCodec/.
+
+
+
+ comment: Accepted by Journal of Selected Topics in Signal Processing (JSTSP).
+ Demo and code: https://haoheliu.github.io/SemantiCodec/
+
+
+
+
+
+
+ ♻ ☆ Dance Any Beat: Blending Beats with Visuals in Dance Video Generation WACV2025
+
+
+ Generating dance from music is crucial for advancing automated choreography.
+Current methods typically produce skeleton keypoint sequences instead of dance
+videos and lack the capability to make specific individuals dance, which
+reduces their real-world applicability. These methods also require precise
+keypoint annotations, complicating data collection and limiting the use of
+self-collected video datasets. To overcome these challenges, we introduce a
+novel task: generating dance videos directly from images of individuals guided
+by music. This task enables the dance generation of specific individuals
+without requiring keypoint annotations, making it more versatile and applicable
+to various situations. Our solution, the Dance Any Beat Diffusion model
+(DabFusion), utilizes a reference image and a music piece to generate dance
+videos featuring various dance types and choreographies. The music is analyzed
+by our specially designed music encoder, which identifies essential features
+including dance style, movement, and rhythm. DabFusion excels in generating
+dance videos not only for individuals in the training dataset but also for any
+previously unseen person. This versatility stems from its approach of
+generating latent optical flow, which contains all necessary motion information
+to animate any person in the image. We evaluate DabFusion's performance using
+the AIST++ dataset, focusing on video quality, audio-video synchronization, and
+motion-music alignment. We propose a 2D Motion-Music Alignment Score (2D-MM
+Align), which builds on the Beat Alignment Score to more effectively evaluate
+motion-music alignment for this new task. Experiments show that our DabFusion
+establishes a solid baseline for this innovative task. Video results can be
+found on our project page: https://DabFusion.github.io.
+
+
+ The composed image retrieval (CIR) task aims to retrieve the desired target
+image for a given multimodal query, i.e., a reference image with its
+corresponding modification text. The key limitations encountered by existing
+efforts are two aspects: 1) ignoring the multi-faceted query-target matching
+factors; 2) ignoring the potential unlabeled reference-target image pairs in
+existing benchmark datasets. To address these two limitations is non-trivial
+due to the following challenges: 1) how to effectively model the multi-faceted
+matching factors in a latent way without direct supervision signals; 2) how to
+fully utilize the potential unlabeled reference-target image pairs to improve
+the generalization ability of the CIR model. To address these challenges, in
+this work, we first propose a muLtI-faceted Matching Network (LIMN), which
+consists of three key modules: multi-grained image/text encoder, latent
+factor-oriented feature aggregation, and query-target matching modeling.
+Thereafter, we design an iterative dual self-training paradigm to further
+enhance the performance of LIMN by fully utilizing the potential unlabeled
+reference-target image pairs in a semi-supervised manner. Specifically, we
+denote the iterative dual self-training paradigm enhanced LIMN as LIMN+.
+Extensive experiments on three real-world datasets, FashionIQ, Shoes, and
+Birds-to-Words, show that our proposed method significantly surpasses the
+state-of-the-art baselines.
+
+