diff --git a/.nojekyll b/.nojekyll
new file mode 100644
index 0000000..e69de29
diff --git a/cache.json b/cache.json
new file mode 100644
index 0000000..c4830d2
--- /dev/null
+++ b/cache.json
@@ -0,0 +1 @@
+{"2024-12-27T00:00:00Z":{"Computation and Language":[{"id":"http://arxiv.org/abs/2412.19792v1","updated":"2024-12-27T18:45:36Z","published":"2024-12-27T18:45:36Z","title":"InfAlign: Inference-aware language model alignment","summary":" Language model alignment has become a critical step in training modern\ngenerative language models. The goal of alignment is to finetune a reference\nmodel such that the win rate of a sample from the aligned model over a sample\nfrom the reference model is high, subject to a KL divergence constraint. Today,\nwe are increasingly using inference-time algorithms (e.g., Best-of-N,\ncontrolled decoding, tree search) to decode from language models rather than\nstandard sampling. However, the alignment objective does not capture such\ninference-time decoding procedures. We show that the existing alignment\nframework is sub-optimal in view of such inference-time methods. We then modify\nthe alignment objective and propose a framework for inference-aware alignment\n(IAPO). We prove that for any inference-time decoding algorithm, the optimal\nsolution that optimizes the inference-time win rate of the aligned policy\nagainst the reference policy is the solution to the typical RLHF problem with a\ntransformation of the reward. This motivates us to provide the KL-regularized\ncalibrate-and-transform RL (CTRL) algorithm to solve this problem, which\ninvolves a reward calibration step and a KL-regularized reward maximization\nstep with a transformation of the calibrated reward. We particularize our study\nto two important inference-time strategies: best-of-N sampling and best-of-N\njailbreaking, where N responses are sampled from the model and the one with the\nhighest or lowest reward is selected. We propose specific transformations for\nthese strategies and demonstrate that our framework offers significant\nimprovements over existing state-of-the-art methods for language model\nalignment. Empirically, we outperform baselines that are designed without\ntaking inference-time decoding into consideration by 8-12% and 4-9% on\ninference-time win rates over the Anthropic helpfulness and harmlessness dialog\nbenchmark datasets.\n","authors":["Ananth Balashankar","Ziteng Sun","Jonathan Berant","Jacob Eisenstein","Michael Collins","Adrian Hutter","Jong Lee","Chirag Nagpal","Flavien Prost","Aradhana Sinha","and Ananda Theertha Suresh","Ahmad Beirami"],"pdf_url":"https://arxiv.org/pdf/2412.19792v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.09614v3","updated":"2024-12-27T18:43:59Z","published":"2024-02-14T23:05:44Z","title":"Reasoning over Uncertain Text by Generative Large Language Models","summary":" This paper considers the challenges Large Language Models (LLMs) face when\nreasoning over text that includes information involving uncertainty explicitly\nquantified via probability values. This type of reasoning is relevant to a\nvariety of contexts ranging from everyday conversations to medical\ndecision-making. Despite improvements in the mathematical reasoning\ncapabilities of LLMs, they still exhibit significant difficulties when it comes\nto probabilistic reasoning. To deal with this problem, we introduce the\nBayesian Linguistic Inference Dataset (BLInD), a new dataset specifically\ndesigned to test the probabilistic reasoning capabilities of LLMs. We use BLInD\nto find out the limitations of LLMs for tasks involving probabilistic\nreasoning. In addition, we present several prompting strategies that map the\nproblem to different formal representations, including Python code,\nprobabilistic algorithms, and probabilistic logical programming. We conclude by\nproviding an evaluation of our methods on BLInD and an adaptation of a causal\nreasoning question-answering dataset. Our empirical results highlight the\neffectiveness of our proposed strategies for multiple LLMs.\n","authors":["Aliakbar Nafar","Kristen Brent Venable","Parisa Kordjamshidi"],"pdf_url":"https://arxiv.org/pdf/2402.09614v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.19785v1","updated":"2024-12-27T18:32:24Z","published":"2024-12-27T18:32:24Z","title":"Enhancing Whisper's Accuracy and Speed for Indian Languages through\n Prompt-Tuning and Tokenization","summary":" Automatic speech recognition has recently seen a significant advancement with\nlarge foundational models such as Whisper. However, these models often struggle\nto perform well in low-resource languages, such as Indian languages. This paper\nexplores two novel approaches to enhance Whisper's multilingual speech\nrecognition performance in Indian languages. First, we propose prompt-tuning\nwith language family information, which enhances Whisper's accuracy in\nlinguistically similar languages. Second, we introduce a novel tokenizer that\nreduces the number of generated tokens, thereby accelerating Whisper's\ninference speed. Our extensive experiments demonstrate that the tokenizer\nsignificantly reduces inference time, while prompt-tuning enhances accuracy\nacross various Whisper model sizes, including Small, Medium, and Large.\nTogether, these techniques achieve a balance between optimal WER and inference\nspeed.\n","authors":["Kumud Tripathi","Raj Gothi","Pankaj Wasnik"],"pdf_url":"https://arxiv.org/pdf/2412.19785v1.pdf","comment":"Accepted at ICASSP 2025, 5 pages, 1 figures, 5 tables"},{"id":"http://arxiv.org/abs/2412.19781v1","updated":"2024-12-27T18:25:08Z","published":"2024-12-27T18:25:08Z","title":"Machine Learning for Sentiment Analysis of Imported Food in Trinidad and\n Tobago","summary":" This research investigates the performance of various machine learning\nalgorithms (CNN, LSTM, VADER, and RoBERTa) for sentiment analysis of Twitter\ndata related to imported food items in Trinidad and Tobago. The study addresses\nthree primary research questions: the comparative accuracy and efficiency of\nthe algorithms, the optimal configurations for each model, and the potential\napplications of the optimized models in a live system for monitoring public\nsentiment and its impact on the import bill. The dataset comprises tweets from\n2018 to 2024, divided into imbalanced, balanced, and temporal subsets to assess\nthe impact of data balancing and the COVID-19 pandemic on sentiment trends. Ten\nexperiments were conducted to evaluate the models under various configurations.\nResults indicated that VADER outperformed the other models in both multi-class\nand binary sentiment classifications. The study highlights significant changes\nin sentiment trends pre- and post-COVID-19, with implications for import\npolicies.\n","authors":["Cassandra Daniels","Koffka Khan"],"pdf_url":"https://arxiv.org/pdf/2412.19781v1.pdf","comment":"27 pages"},{"id":"http://arxiv.org/abs/2409.01366v2","updated":"2024-12-27T17:49:34Z","published":"2024-09-02T16:41:44Z","title":"CHESS: Optimizing LLM Inference via Channel-Wise Thresholding and\n Selective Sparsification","summary":" Deploying large language models (LLMs) on edge devices presents significant\nchallenges due to the substantial computational overhead and memory\nrequirements. Activation sparsification can mitigate these resource challenges\nby reducing the number of activated neurons during inference. Existing methods\ntypically employ thresholding-based sparsification based on the statistics of\nactivation tensors. However, they do not model the impact of activation\nsparsification on performance, resulting in suboptimal performance degradation.\nTo address the limitations, this paper reformulates the activation\nsparsification problem to explicitly capture the relationship between\nactivation sparsity and model performance. Then, this paper proposes CHESS, a\ngeneral activation sparsification approach via CHannel-wise thrEsholding and\nSelective Sparsification. First, channel-wise thresholding assigns a unique\nthreshold to each activation channel in the feed-forward network (FFN) layers.\nThen, selective sparsification involves applying thresholding-based activation\nsparsification to specific layers within the attention modules. Finally, we\ndetail the implementation of sparse kernels to accelerate LLM inference.\nExperimental results demonstrate that the proposed CHESS achieves lower\nperformance degradation over eight downstream tasks while activating fewer\nparameters than existing methods, thus speeding up the LLM inference by up to\n1.27x.\n","authors":["Junhui He","Shangyu Wu","Weidong Wen","Chun Jason Xue","Qingan Li"],"pdf_url":"https://arxiv.org/pdf/2409.01366v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2407.21792v3","updated":"2024-12-27T17:36:21Z","published":"2024-07-31T17:59:24Z","title":"Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?","summary":" As artificial intelligence systems grow more powerful, there has been\nincreasing interest in \"AI safety\" research to address emerging and future\nrisks. However, the field of AI safety remains poorly defined and\ninconsistently measured, leading to confusion about how researchers can\ncontribute. This lack of clarity is compounded by the unclear relationship\nbetween AI safety benchmarks and upstream general capabilities (e.g., general\nknowledge and reasoning). To address these issues, we conduct a comprehensive\nmeta-analysis of AI safety benchmarks, empirically analyzing their correlation\nwith general capabilities across dozens of models and providing a survey of\nexisting directions in AI safety. Our findings reveal that many safety\nbenchmarks highly correlate with both upstream model capabilities and training\ncompute, potentially enabling \"safetywashing\"--where capability improvements\nare misrepresented as safety advancements. Based on these findings, we propose\nan empirical foundation for developing more meaningful safety metrics and\ndefine AI safety in a machine learning research context as a set of clearly\ndelineated research goals that are empirically separable from generic\ncapabilities advancements. In doing so, we aim to provide a more rigorous\nframework for AI safety research, advancing the science of safety evaluations\nand clarifying the path towards measurable progress.\n","authors":["Richard Ren","Steven Basart","Adam Khoja","Alice Gatti","Long Phan","Xuwang Yin","Mantas Mazeika","Alexander Pan","Gabriel Mukobi","Ryan H. Kim","Stephen Fitz","Dan Hendrycks"],"pdf_url":"https://arxiv.org/pdf/2407.21792v3.pdf","comment":"NeurIPS 2024"},{"id":"http://arxiv.org/abs/2412.19723v1","updated":"2024-12-27T16:21:58Z","published":"2024-12-27T16:21:58Z","title":"OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse\n Task Synthesis","summary":" Graphical User Interface (GUI) agents powered by Vision-Language Models\n(VLMs) have demonstrated human-like computer control capability. Despite their\nutility in advancing digital automation, a critical bottleneck persists:\ncollecting high-quality trajectory data for training. Common practices for\ncollecting such data rely on human supervision or synthetic data generation\nthrough executing pre-defined tasks, which are either resource-intensive or\nunable to guarantee data quality. Moreover, these methods suffer from limited\ndata diversity and significant gaps between synthetic data and real-world\nenvironments. To address these challenges, we propose OS-Genesis, a novel GUI\ndata synthesis pipeline that reverses the conventional trajectory collection\nprocess. Instead of relying on pre-defined tasks, OS-Genesis enables agents\nfirst to perceive environments and perform step-wise interactions, then\nretrospectively derive high-quality tasks to enable trajectory-level\nexploration. A trajectory reward model is then employed to ensure the quality\nof the generated trajectories. We demonstrate that training GUI agents with\nOS-Genesis significantly improves their performance on highly challenging\nonline benchmarks. In-depth analysis further validates OS-Genesis's efficiency\nand its superior data quality and diversity compared to existing synthesis\nmethods. Our codes, data, and checkpoints are available at\n\\href{https://qiushisun.github.io/OS-Genesis-Home/}{OS-Genesis Homepage}.\n","authors":["Qiushi Sun","Kanzhi Cheng","Zichen Ding","Chuanyang Jin","Yian Wang","Fangzhi Xu","Zhenyu Wu","Chengyou Jia","Liheng Chen","Zhoumianze Liu","Ben Kao","Guohao Li","Junxian He","Yu Qiao","Zhiyong Wu"],"pdf_url":"https://arxiv.org/pdf/2412.19723v1.pdf","comment":"Work in progress"},{"id":"http://arxiv.org/abs/2412.19707v1","updated":"2024-12-27T16:02:34Z","published":"2024-12-27T16:02:34Z","title":"Toward Adaptive Reasoning in Large Language Models with Thought Rollback","summary":" Large language models (LLMs) have been routinely used to solve various tasks\nusing step-by-step reasoning. However, the structure of intermediate reasoning\nsteps, or thoughts, is rigid and unidirectional, such as chains, trees, or\nacyclic-directed graphs. Consequently, the resulting inflexible and\nforward-only reasoning may not address challenging tasks and fail when the LLM\nfrequently gives false responses, i.e., ``hallucinations''. This paper proposes\na new reasoning framework, called Thought Rollback (TR), allowing LLMs to\nadaptively build thought structure while maintaining effective reasoning toward\nproblem-solving under ``hallucinations''. The core mechanism of TR is rolling\nback thoughts, which allows LLMs to perform error analysis on thoughts, and\nthus roll back to any previously mistaken thought for revision. Subsequently,\nby including such trial-and-error in the prompt to guide the LLM, each rollback\nleads to one more reliable reasoning path. Therefore, starting with a simple\nprompt without human annotations, LLM with TR adaptively and gradually explores\nthoughts for a correct solution. Comprehensive experiments on mathematical\nproblems and multi-task reasoning demonstrate the state-of-the-art performance\nof TR in terms of problem-solving rate and interaction cost. For instance, the\nsolving rate of GPT-4 with TR outperforms the current best by $9\\%$ on the MATH\ndataset.\n","authors":["Sijia Chen","Baochun Li"],"pdf_url":"https://arxiv.org/pdf/2412.19707v1.pdf","comment":"ICML 2024 camera-ready version with 24 pages and 12 figures. Code\n repo with all prompts:\n https://github.com/iQua/llmpebase/tree/main/examples/ThoughtRollback"},{"id":"http://arxiv.org/abs/2410.16803v3","updated":"2024-12-27T15:32:01Z","published":"2024-10-22T08:28:05Z","title":"Context-aware Inductive Knowledge Graph Completion with Latent Type\n Constraints and Subgraph Reasoning","summary":" Inductive knowledge graph completion (KGC) aims to predict missing triples\nwith unseen entities. Recent works focus on modeling reasoning paths between\nthe head and tail entity as direct supporting evidence. However, these methods\ndepend heavily on the existence and quality of reasoning paths, which limits\ntheir general applicability in different scenarios. In addition, we observe\nthat latent type constraints and neighboring facts inherent in KGs are also\nvital in inferring missing triples. To effectively utilize all useful\ninformation in KGs, we introduce CATS, a novel context-aware inductive KGC\nsolution. With sufficient guidance from proper prompts and supervised\nfine-tuning, CATS activates the strong semantic understanding and reasoning\ncapabilities of large language models to assess the existence of query triples,\nwhich consist of two modules. First, the type-aware reasoning module evaluates\nwhether the candidate entity matches the latent entity type as required by the\nquery relation. Then, the subgraph reasoning module selects relevant reasoning\npaths and neighboring facts, and evaluates their correlation to the query\ntriple. Experiment results on three widely used datasets demonstrate that CATS\nsignificantly outperforms state-of-the-art methods in 16 out of 18\ntransductive, inductive, and few-shot settings with an average absolute MRR\nimprovement of 7.2%.\n","authors":["Muzhi Li","Cehao Yang","Chengjin Xu","Zixing Song","Xuhui Jiang","Jian Guo","Ho-fung Leung","Irwin King"],"pdf_url":"https://arxiv.org/pdf/2410.16803v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.15473v2","updated":"2024-12-27T14:56:10Z","published":"2024-06-15T17:40:49Z","title":"Intertwining CP and NLP: The Generation of Unreasonably Constrained\n Sentences","summary":" Constrained text generation remains a challenging task, particularly when\ndealing with hard constraints. Traditional NLP approaches prioritize generating\nmeaningful and coherent output. Also, the current state-of-the-art methods\noften lack the expressiveness and constraint satisfaction capabilities to\nhandle such tasks effectively. Recently, an approach for generating constrained\nsentences in CP has been proposed in (Bonlarron et al, 2023). This ad-hoc model\nto solve the sentences generation problem under MNREAD rules proved\nneithertheless to be computationaly and structuraly unsuitable to deal with\nother more constrained problems. In this paper, a novel more generic approach\nis introduced to tackle many of these previously untractable problems, and\nillustrated here with the quite untractable sentences generation problem\nfollowing RADNER rules.\n More precisely, this paper presents the CPTextGen Framework. This framework\nconsiders a constrained text generation problem as a discrete combinatorial\noptimization problem. It is solved by a constraint programming method that\ncombines linguistic properties (e.g., n-grams or language level) with other\nmore classical constraints (e.g., the number of characters, syllables).\nEventually, a curation phase allows for selecting the best-generated sentences\naccording to perplexity using an LLM.\n The effectiveness of this approach is demonstrated by tackling a new, more\ntediously constrained text generation problem: the iconic RADNER sentences\nproblem. This problem aims to generate sentences respecting a set of quite\nstrict rules defined by their use in vision and clinical research. Thanks to\nour CP-based approach, many new strongly constrained sentences have been\nsuccessfully generated. This highlights our approach's potential to handle\nunreasonably constrained text generation scenarios.\n","authors":["Alexandre Bonlarron","Jean-Charles Régin"],"pdf_url":"https://arxiv.org/pdf/2406.15473v2.pdf","comment":"Disambiguation and additional references"},{"id":"http://arxiv.org/abs/2410.08565v4","updated":"2024-12-27T14:19:55Z","published":"2024-10-11T06:44:31Z","title":"Baichuan-Omni Technical Report","summary":" The salient multimodal capabilities and interactive experience of GPT-4o\nhighlight its critical role in practical applications, yet it lacks a\nhigh-performing open-source counterpart. In this paper, we introduce\nBaichuan-omni, the first open-source 7B Multimodal Large Language Model (MLLM)\nadept at concurrently processing and analyzing modalities of image, video,\naudio, and text, while delivering an advanced multimodal interactive experience\nand strong performance. We propose an effective multimodal training schema\nstarting with 7B model and proceeding through two stages of multimodal\nalignment and multitask fine-tuning across audio, image, video, and text modal.\nThis approach equips the language model with the ability to handle visual and\naudio data effectively. Demonstrating strong performance across various\nomni-modal and multimodal benchmarks, we aim for this contribution to serve as\na competitive baseline for the open-source community in advancing multimodal\nunderstanding and real-time interaction.\n","authors":["Yadong Li","Haoze Sun","Mingan Lin","Tianpeng Li","Guosheng Dong","Tao Zhang","Bowen Ding","Wei Song","Zhenglin Cheng","Yuqi Huo","Song Chen","Xu Li","Da Pan","Shusen Zhang","Xin Wu","Zheng Liang","Jun Liu","Tao Zhang","Keer Lu","Yaqi Zhao","Yanjun Shen","Fan Yang","Kaicheng Yu","Tao Lin","Jianhua Xu","Zenan Zhou","Weipeng Chen"],"pdf_url":"https://arxiv.org/pdf/2410.08565v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2407.11843v3","updated":"2024-12-27T14:17:05Z","published":"2024-07-16T15:24:44Z","title":"Preemptive Detection and Correction of Misaligned Actions in LLM Agents","summary":" Deploying LLM-based agents in real-life applications often faces a critical\nchallenge: the misalignment between agents' behavior and user intent. Such\nmisalignment may lead agents to unintentionally execute critical actions that\ncarry negative outcomes (e.g., accidentally triggering a \"buy-now\" in web\nshopping), resulting in undesirable or even irreversible consequences. Although\naddressing these issues is crucial, the preemptive detection and correction of\nmisaligned actions remains relatively underexplored. To fill this gap, we\nintroduce InferAct, a novel approach that leverages the belief reasoning\nability of LLMs, grounded in Theory-of-Mind, to detect misaligned actions\nbefore execution. Once the misalignment is detected, InferAct alerts users for\ntimely correction, preventing adverse outcomes and enhancing the reliability of\nLLM agents' decision-making processes. Experiments on three widely used tasks\ndemonstrate that InferAct achieves up to 20% improvements on Marco-F1 against\nbaselines in misaligned action detection. An in-depth evaluation of\nmisalignment correction further highlights InferAct's effectiveness in\nimproving agent alignment.\n","authors":["Haishuo Fang","Xiaodan Zhu","Iryna Gurevych"],"pdf_url":"https://arxiv.org/pdf/2407.11843v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.00107v5","updated":"2024-12-27T12:28:34Z","published":"2023-05-31T18:27:43Z","title":"MERT: Acoustic Music Understanding Model with Large-Scale\n Self-supervised Training","summary":" Self-supervised learning (SSL) has recently emerged as a promising paradigm\nfor training generalisable models on large-scale data in the fields of vision,\ntext, and speech. Although SSL has been proven effective in speech and audio,\nits application to music audio has yet to be thoroughly explored. This is\npartially due to the distinctive challenges associated with modelling musical\nknowledge, particularly tonal and pitched characteristics of music. To address\nthis research gap, we propose an acoustic Music undERstanding model with\nlarge-scale self-supervised Training (MERT), which incorporates teacher models\nto provide pseudo labels in the masked language modelling (MLM) style acoustic\npre-training. In our exploration, we identified an effective combination of\nteacher models, which outperforms conventional speech and audio approaches in\nterms of performance. This combination includes an acoustic teacher based on\nResidual Vector Quantisation - Variational AutoEncoder (RVQ-VAE) and a musical\nteacher based on the Constant-Q Transform (CQT). Furthermore, we explore a wide\nrange of settings to overcome the instability in acoustic language model\npre-training, which allows our designed paradigm to scale from 95M to 330M\nparameters. Experimental results indicate that our model can generalise and\nperform well on 14 music understanding tasks and attain state-of-the-art (SOTA)\noverall scores.\n","authors":["Yizhi Li","Ruibin Yuan","Ge Zhang","Yinghao Ma","Xingran Chen","Hanzhi Yin","Chenghao Xiao","Chenghua Lin","Anton Ragni","Emmanouil Benetos","Norbert Gyenge","Roger Dannenberg","Ruibo Liu","Wenhu Chen","Gus Xia","Yemin Shi","Wenhao Huang","Zili Wang","Yike Guo","Jie Fu"],"pdf_url":"https://arxiv.org/pdf/2306.00107v5.pdf","comment":"accepted by ICLR 2024"},{"id":"http://arxiv.org/abs/2412.19610v1","updated":"2024-12-27T12:11:50Z","published":"2024-12-27T12:11:50Z","title":"Machine Generated Product Advertisements: Benchmarking LLMs Against\n Human Performance","summary":" This study compares the performance of AI-generated and human-written product\ndescriptions using a multifaceted evaluation model. We analyze descriptions for\n100 products generated by four AI models (Gemma 2B, LLAMA, GPT2, and ChatGPT 4)\nwith and without sample descriptions, against human-written descriptions. Our\nevaluation metrics include sentiment, readability, persuasiveness, Search\nEngine Optimization(SEO), clarity, emotional appeal, and call-to-action\neffectiveness. The results indicate that ChatGPT 4 performs the best. In\ncontrast, other models demonstrate significant shortcomings, producing\nincoherent and illogical output that lacks logical structure and contextual\nrelevance. These models struggle to maintain focus on the product being\ndescribed, resulting in disjointed sentences that do not convey meaningful\ninformation. This research provides insights into the current capabilities and\nlimitations of AI in the creation of content for e-Commerce.\n","authors":["Sanjukta Ghosh"],"pdf_url":"https://arxiv.org/pdf/2412.19610v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.03021v2","updated":"2024-12-27T11:31:01Z","published":"2024-04-03T19:14:45Z","title":"Blessing or curse? A survey on the Impact of Generative AI on Fake News","summary":" Fake news significantly influence our society. They impact consumers, voters,\nand many other societal groups. While Fake News exist for a centuries,\nGenerative AI brings fake news on a new level. It is now possible to automate\nthe creation of masses of high-quality individually targeted Fake News. On the\nother end, Generative AI can also help detecting Fake News. Both fields are\nyoung but developing fast.\n This survey provides a comprehensive examination of the research and\npractical use of Generative AI for Fake News detection and creation in 2024.\nFollowing the Structured Literature Survey approach, the paper synthesizes\ncurrent results in the following topic clusters 1) enabling technologies, 2)\ncreation of Fake News, 3) case study social media as most relevant distribution\nchannel, 4) detection of Fake News, and 5) deepfakes as upcoming technology.\n The article also identifies current challenges and open issues.\n","authors":["Alexander Loth","Martin Kappes","Marc-Oliver Pahl"],"pdf_url":"https://arxiv.org/pdf/2404.03021v2.pdf","comment":"16 pages, 2 figures. Submitted to ACM Transactions on Intelligent\n Systems and Technology (ACM TIST). Added references"},{"id":"http://arxiv.org/abs/2412.19583v1","updated":"2024-12-27T10:58:55Z","published":"2024-12-27T10:58:55Z","title":"A Comparative Study of Machine Unlearning Techniques for Image and Text\n Classification Models","summary":" Machine Unlearning has emerged as a critical area in artificial intelligence,\naddressing the need to selectively remove learned data from machine learning\nmodels in response to data privacy regulations. This paper provides a\ncomprehensive comparative analysis of six state-of-theart unlearning techniques\napplied to image and text classification tasks. We evaluate their performance,\nefficiency, and compliance with regulatory requirements, highlighting their\nstrengths and limitations in practical scenarios. By systematically analyzing\nthese methods, we aim to provide insights into their applicability,\nchallenges,and tradeoffs, fostering advancements in the field of ethical and\nadaptable machine learning.\n","authors":["Omar M. Safa","Mahmoud M. Abdelaziz","Mustafa Eltawy","Mohamed Mamdouh","Moamen Gharib","Salaheldin Eltenihy","Nagia M. Ghanem","Mohamed M. Ismail"],"pdf_url":"https://arxiv.org/pdf/2412.19583v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.00326v6","updated":"2024-12-27T10:12:28Z","published":"2023-12-01T03:44:54Z","title":"Agent-OM: Leveraging LLM Agents for Ontology Matching","summary":" Ontology matching (OM) enables semantic interoperability between different\nontologies and resolves their conceptual heterogeneity by aligning related\nentities. OM systems currently have two prevailing design paradigms:\nconventional knowledge-based expert systems and newer machine learning-based\npredictive systems. While large language models (LLMs) and LLM agents have\nrevolutionised data engineering and have been applied creatively in many\ndomains, their potential for OM remains underexplored. This study introduces a\nnovel agent-powered LLM-based design paradigm for OM systems. With\nconsideration of several specific challenges in leveraging LLM agents for OM,\nwe propose a generic framework, namely Agent-OM (Agent for Ontology Matching),\nconsisting of two Siamese agents for retrieval and matching, with a set of OM\ntools. Our framework is implemented in a proof-of-concept system. Evaluations\nof three Ontology Alignment Evaluation Initiative (OAEI) tracks over\nstate-of-the-art OM systems show that our system can achieve results very close\nto the long-standing best performance on simple OM tasks and can significantly\nimprove the performance on complex and few-shot OM tasks.\n","authors":["Zhangcheng Qiang","Weiqing Wang","Kerry Taylor"],"pdf_url":"https://arxiv.org/pdf/2312.00326v6.pdf","comment":"19 pages, 12 figures, 3 tables"},{"id":"http://arxiv.org/abs/2410.00070v2","updated":"2024-12-27T09:23:14Z","published":"2024-09-30T12:11:49Z","title":"Mamba for Streaming ASR Combined with Unimodal Aggregation","summary":" This paper works on streaming automatic speech recognition (ASR). Mamba, a\nrecently proposed state space model, has demonstrated the ability to match or\nsurpass Transformers in various tasks while benefiting from a linear complexity\nadvantage. We explore the efficiency of Mamba encoder for streaming ASR and\npropose an associated lookahead mechanism for leveraging controllable future\ninformation. Additionally, a streaming-style unimodal aggregation (UMA) method\nis implemented, which automatically detects token activity and streamingly\ntriggers token output, and meanwhile aggregates feature frames for better\nlearning token representation. Based on UMA, an early termination (ET) method\nis proposed to further reduce recognition latency. Experiments conducted on two\nMandarin Chinese datasets demonstrate that the proposed model achieves\ncompetitive ASR performance in terms of both recognition accuracy and latency.\n","authors":["Ying Fang","Xiaofei Li"],"pdf_url":"https://arxiv.org/pdf/2410.00070v2.pdf","comment":"Accepted by ICASSP 2025"},{"id":"http://arxiv.org/abs/2412.19544v1","updated":"2024-12-27T09:16:39Z","published":"2024-12-27T09:16:39Z","title":"TARGA: Targeted Synthetic Data Generation for Practical Reasoning over\n Structured Data","summary":" Semantic parsing, which converts natural language questions into logic forms,\nplays a crucial role in reasoning within structured environments. However,\nexisting methods encounter two significant challenges: reliance on extensive\nmanually annotated datasets and limited generalization capability to unseen\nexamples. To tackle these issues, we propose Targeted Synthetic Data Generation\n(TARGA), a practical framework that dynamically generates high-relevance\nsynthetic data without manual annotation. Starting from the pertinent entities\nand relations of a given question, we probe for the potential relevant queries\nthrough layer-wise expansion and cross-layer combination. Then we generate\ncorresponding natural language questions for these constructed queries to\njointly serve as the synthetic demonstrations for in-context learning.\nExperiments on multiple knowledge base question answering (KBQA) datasets\ndemonstrate that TARGA, using only a 7B-parameter model, substantially\noutperforms existing non-fine-tuned methods that utilize close-sourced model,\nachieving notable improvements in F1 scores on GrailQA(+7.7) and\nKBQA-Agent(+12.2). Furthermore, TARGA also exhibits superior sample efficiency,\nrobustness, and generalization capabilities under non-I.I.D. settings.\n","authors":["Xiang Huang","Jiayu Shen","Shanshan Huang","Sitao Cheng","Xiaxia Wang","Yuzhong Qu"],"pdf_url":"https://arxiv.org/pdf/2412.19544v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18424v2","updated":"2024-12-27T08:33:31Z","published":"2024-12-24T13:39:32Z","title":"LongDocURL: a Comprehensive Multimodal Long Document Benchmark\n Integrating Understanding, Reasoning, and Locating","summary":" Large vision language models (LVLMs) have improved the document understanding\ncapabilities remarkably, enabling the handling of complex document elements,\nlonger contexts, and a wider range of tasks. However, existing document\nunderstanding benchmarks have been limited to handling only a small number of\npages and fail to provide a comprehensive analysis of layout elements locating.\nIn this paper, we first define three primary task categories: Long Document\nUnderstanding, numerical Reasoning, and cross-element Locating, and then\npropose a comprehensive benchmark, LongDocURL, integrating above three primary\ntasks and comprising 20 sub-tasks categorized based on different primary tasks\nand answer evidences. Furthermore, we develop a semi-automated construction\npipeline and collect 2,325 high-quality question-answering pairs, covering more\nthan 33,000 pages of documents, significantly outperforming existing\nbenchmarks. Subsequently, we conduct comprehensive evaluation experiments on\nboth open-source and closed-source models across 26 different configurations,\nrevealing critical performance gaps in this field.\n","authors":["Chao Deng","Jiale Yuan","Pi Bu","Peijie Wang","Zhong-Zhi Li","Jian Xu","Xiao-Hui Li","Yuan Gao","Jun Song","Bo Zheng","Cheng-Lin Liu"],"pdf_url":"https://arxiv.org/pdf/2412.18424v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.19522v1","updated":"2024-12-27T08:25:52Z","published":"2024-12-27T08:25:52Z","title":"Exploiting Domain-Specific Parallel Data on Multilingual Language Models\n for Low-resource Language Translation","summary":" Neural Machine Translation (NMT) systems built on multilingual\nsequence-to-sequence Language Models (msLMs) fail to deliver expected results\nwhen the amount of parallel data for a language, as well as the language's\nrepresentation in the model are limited. This restricts the capabilities of\ndomain-specific NMT systems for low-resource languages (LRLs). As a solution,\nparallel data from auxiliary domains can be used either to fine-tune or to\nfurther pre-train the msLM. We present an evaluation of the effectiveness of\nthese two techniques in the context of domain-specific LRL-NMT. We also explore\nthe impact of domain divergence on NMT model performance. We recommend several\nstrategies for utilizing auxiliary parallel data in building domain-specific\nNMT models for LRLs.\n","authors":["Surangika Ranathungaa","Shravan Nayak","Shih-Ting Cindy Huang","Yanke Mao","Tong Su","Yun-Hsiang Ray Chan","Songchen Yuan","Anthony Rinaldi","Annie En-Shiun Lee"],"pdf_url":"https://arxiv.org/pdf/2412.19522v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.19513v1","updated":"2024-12-27T08:09:11Z","published":"2024-12-27T08:09:11Z","title":"Confidence v.s. Critique: A Decomposition of Self-Correction Capability\n for LLMs","summary":" Large Language Models (LLMs) can correct their self-generated responses, but\na decline in accuracy after self-correction is also witnessed. To have a deeper\nunderstanding of self-correction, we endeavor to decompose, evaluate, and\nanalyze the self-correction behaviors of LLMs. By enumerating and analyzing\nanswer correctness before and after self-correction, we decompose the\nself-correction capability into confidence (being confident to correct answers)\nand critique (turning wrong answers to correct) capabilities, and propose two\nmetrics from a probabilistic perspective to measure these 2 capabilities, along\nwith another metric for overall self-correction capability evaluation. Based on\nour decomposition and evaluation metrics, we conduct extensive experiments and\ndraw some empirical conclusions. For example, we find different models can\nexhibit distinct behaviors: some models are confident while others are more\ncritical. We also find the trade-off between the two capabilities (i.e.\nimproving one can lead to a decline in the other) when manipulating model\nself-correction behavior by prompts or in-context learning. Further, we find a\nsimple yet efficient strategy to improve self-correction capability by\ntransforming Supervision Fine-Tuning (SFT) data format, and our strategy\noutperforms vanilla SFT in both capabilities and achieves much higher accuracy\nafter self-correction. Our code will be publicly available on GitHub.\n","authors":["Zhe Yang","Yichang Zhang","Yudong Wang","Ziyao Xu","Junyang Lin","Zhifang Sui"],"pdf_url":"https://arxiv.org/pdf/2412.19513v1.pdf","comment":"16 pages, 10 figures"},{"id":"http://arxiv.org/abs/2412.19512v1","updated":"2024-12-27T08:03:22Z","published":"2024-12-27T08:03:22Z","title":"Safeguard Fine-Tuned LLMs Through Pre- and Post-Tuning Model Merging","summary":" Fine-tuning large language models (LLMs) for downstream tasks is a widely\nadopted approach, but it often leads to safety degradation in safety-aligned\nLLMs. Currently, many solutions address this issue by incorporating additional\nsafety data, which can be impractical in many cases. In this paper, we address\nthe question: How can we improve downstream task performance while preserving\nsafety in LLMs without relying on additional safety data? We propose a simple\nand effective method that maintains the inherent safety of LLMs while enhancing\ntheir downstream task performance: merging the weights of pre- and\npost-fine-tuned safety-aligned models. Experimental results across various\ndownstream tasks, models, and merging methods demonstrate that this approach\neffectively mitigates safety degradation while improving downstream task\nperformance, offering a practical solution for adapting safety-aligned LLMs.\n","authors":["Hua Farn","Hsuan Su","Shachi H Kumar","Saurav Sahay","Shang-Tse Chen","Hung-yi Lee"],"pdf_url":"https://arxiv.org/pdf/2412.19512v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.07111v2","updated":"2024-12-27T07:29:19Z","published":"2024-11-11T16:37:40Z","title":"Building a Taiwanese Mandarin Spoken Language Model: A First Attempt","summary":" This technical report presents our initial attempt to build a spoken large\nlanguage model (LLM) for Taiwanese Mandarin, specifically tailored to enable\nreal-time, speech-to-speech interaction in multi-turn conversations. Our\nend-to-end model incorporates a decoder-only transformer architecture and aims\nto achieve seamless interaction while preserving the conversational flow,\nincluding full-duplex capabilities allowing simultaneous speaking and\nlistening. The paper also details the training process, including data\npreparation with synthesized dialogues and adjustments for real-time\ninteraction. We also developed a platform to evaluate conversational fluency\nand response coherence in multi-turn dialogues. We hope the release of the\nreport can contribute to the future development of spoken LLMs in Taiwanese\nMandarin.\n","authors":["Chih-Kai Yang","Yu-Kuan Fu","Chen-An Li","Yi-Cheng Lin","Yu-Xiang Lin","Wei-Chih Chen","Ho Lam Chung","Chun-Yi Kuan","Wei-Ping Huang","Ke-Han Lu","Tzu-Quan Lin","Hsiu-Hsuan Wang","En-Pei Hu","Chan-Jan Hsu","Liang-Hsuan Tseng","I-Hsiang Chiu","Ulin Sanga","Xuanjun Chen","Po-chun Hsu","Shu-wen Yang","Hung-yi Lee"],"pdf_url":"https://arxiv.org/pdf/2411.07111v2.pdf","comment":"Work in progress"},{"id":"http://arxiv.org/abs/2412.19490v1","updated":"2024-12-27T07:16:10Z","published":"2024-12-27T07:16:10Z","title":"User Willingness-aware Sales Talk Dataset","summary":" User willingness is a crucial element in the sales talk process that affects\nthe achievement of the salesperson's or sales system's objectives. Despite the\nimportance of user willingness, to the best of our knowledge, no previous study\nhas addressed the development of automated sales talk dialogue systems that\nexplicitly consider user willingness. A major barrier is the lack of sales talk\ndatasets with reliable user willingness data. Thus, in this study, we developed\na user willingness-aware sales talk collection by leveraging the ecological\nvalidity concept, which is discussed in the field of human-computer\ninteraction. Our approach focused on three types of user willingness essential\nin real sales interactions. We created a dialogue environment that closely\nresembles real-world scenarios to elicit natural user willingness, with\nparticipants evaluating their willingness at the utterance level from multiple\nperspectives. We analyzed the collected data to gain insights into practical\nuser willingness-aware sales talk strategies. In addition, as a practical\napplication of the constructed dataset, we developed and evaluated a sales\ndialogue system aimed at enhancing the user's intent to purchase.\n","authors":["Asahi Hentona","Jun Baba","Shiki Sato","Reina Akama"],"pdf_url":"https://arxiv.org/pdf/2412.19490v1.pdf","comment":"12 pages, Accepted to COLING2025"},{"id":"http://arxiv.org/abs/2411.15862v3","updated":"2024-12-27T07:04:19Z","published":"2024-11-24T14:38:59Z","title":"Do LLMs Really Think Step-by-step In Implicit Reasoning?","summary":" It has been well-known that Chain-of-Thought can remarkably enhance LLMs'\nperformance on complex tasks. However, because it also introduces slower\ninference speeds and higher computational costs, many researches have attempted\nto use implicit CoT, which does not need LLMs to explicitly generate the\nintermediate steps. However, the invisible reasoning process leaves us a doubt\nthat, can implicit CoT really be equal to explicit CoT? Therefore, in this\nstudy, we address this question through experiments. We probe the information\nof intermediate steps from the model's hidden states when it is either trained\nor prompted to perform implicit CoT. The results surprisingly indicate that\nwhen prompted, LLMs hardly think about intermediate steps, suggesting they may\njust rely on experience rather than strict step-by-step reasoning. But when\ntrained, they indeed calculate intermediate steps. Moreover, in both\nsituations, we find the effect of using implicit CoT is susceptible to the\nformat of the problem, reaffirming the current deficiency of implicit CoT.\n","authors":["Yijiong Yu"],"pdf_url":"https://arxiv.org/pdf/2411.15862v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.19482v1","updated":"2024-12-27T06:33:42Z","published":"2024-12-27T06:33:42Z","title":"Pre-training, Fine-tuning and Re-ranking: A Three-Stage Framework for\n Legal Question Answering","summary":" Legal question answering (QA) has attracted increasing attention from people\nseeking legal advice, which aims to retrieve the most applicable answers from a\nlarge-scale database of question-answer pairs. Previous methods mainly use a\ndual-encoder architecture to learn dense representations of both questions and\nanswers. However, these methods could suffer from lacking domain knowledge and\nsufficient labeled training data. In this paper, we propose a three-stage\n(\\underline{p}re-training, \\underline{f}ine-tuning and \\underline{r}e-ranking)\nframework for \\underline{l}egal \\underline{QA} (called PFR-LQA), which promotes\nthe fine-grained text representation learning and boosts the performance of\ndense retrieval with the dual-encoder architecture. Concretely, we first\nconduct domain-specific pre-training on legal questions and answers through a\nself-supervised training objective, allowing the pre-trained model to be\nadapted to the legal domain. Then, we perform task-specific fine-tuning of the\ndual-encoder on legal question-answer pairs by using the supervised learning\nobjective, leading to a high-quality dual-encoder for the specific downstream\nQA task. Finally, we employ a contextual re-ranking objective to further refine\nthe output representations of questions produced by the document encoder, which\nuses contextual similarity to increase the discrepancy between the anchor and\nhard negative samples for better question re-ranking. We conduct extensive\nexperiments on a manually annotated legal QA dataset. Experimental results show\nthat our PFR-LQA method achieves better performance than the strong competitors\nfor legal question answering.\n","authors":["Shiwen Ni","Hao Cheng","Min Yang"],"pdf_url":"https://arxiv.org/pdf/2412.19482v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.17378v3","updated":"2024-12-27T05:56:37Z","published":"2024-06-25T08:55:12Z","title":"A Text is Worth Several Tokens: Text Embedding from LLMs Secretly Aligns\n Well with The Key Tokens","summary":" Text embeddings from large language models (LLMs) have achieved excellent\nresults in tasks such as information retrieval, semantic textual similarity,\netc. In this work, we show an interesting finding: when feeding a text into the\nLLM-based embedder, the obtained text embedding will be able to be aligned with\nthe key tokens in the input text. We first fully analyze this phenomenon on\neight LLM-based embedders and show that this phenomenon is universal and is not\naffected by model architecture, training strategy, and embedding method. With a\ndeeper analysis, we find that the main change in embedding space between these\nembedders and their LLM backbones is in the first principal component. By\nadjusting the first principal component, we can align text embedding with the\nkey tokens. Finally, we give several examples to demonstrate the vast\napplication potential of this finding: (1) we propose a simple and practical\nsparse retrieval method based on the aligned tokens, which can achieve 80% of\nthe dense retrieval effect of the same model while reducing the computation\nsignificantly; (2) we show that our findings provide a novel perspective to\nhelp understand novel technologies (e.g., instruction-following embedding) and\nfuzzy concepts (e.g., semantic relatedness vs. similarity) in this field.\n","authors":["Zhijie Nie","Richong Zhang","Zhanyu Wu"],"pdf_url":"https://arxiv.org/pdf/2406.17378v3.pdf","comment":"Work in Progress"},{"id":"http://arxiv.org/abs/2412.00652v2","updated":"2024-12-27T05:32:11Z","published":"2024-12-01T03:12:26Z","title":"Multi-Agent Collaboration in Incident Response with Large Language\n Models","summary":" Incident response (IR) is a critical aspect of cybersecurity, requiring rapid\ndecision-making and coordinated efforts to address cyberattacks effectively.\nLeveraging large language models (LLMs) as intelligent agents offers a novel\napproach to enhancing collaboration and efficiency in IR scenarios. This paper\nexplores the application of LLM-based multi-agent collaboration using the\nBackdoors & Breaches framework, a tabletop game designed for cybersecurity\ntraining. We simulate real-world IR dynamics through various team structures,\nincluding centralized, decentralized, and hybrid configurations. By analyzing\nagent interactions and performance across these setups, we provide insights\ninto optimizing multi-agent collaboration for incident response. Our findings\nhighlight the potential of LLMs to enhance decision-making, improve\nadaptability, and streamline IR processes, paving the way for more effective\nand coordinated responses to cyber threats.\n","authors":["Zefang Liu"],"pdf_url":"https://arxiv.org/pdf/2412.00652v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.10462v3","updated":"2024-12-27T05:30:00Z","published":"2023-08-21T04:31:06Z","title":"Exploring Parameter-Efficient Fine-Tuning Techniques for Code Generation\n with Large Language Models","summary":" Large language models (LLMs) demonstrate impressive capabilities to generate\naccurate code snippets given natural language intents in a zero-shot manner,\ni.e., without the need for specific fine-tuning. While prior studies have\nhighlighted the advantages of fine-tuning LLMs, this process incurs high\ncomputational costs, making it impractical in resource-scarce environments,\nparticularly for models with billions of parameters. To address these\nchallenges, previous research explored in-context learning (ICL) and\nretrieval-augmented generation (RAG) as strategies to guide the LLM generative\nprocess with task-specific prompt examples. However, ICL and RAG introduce\ninconveniences, such as the need for designing contextually relevant prompts\nand the absence of learning task-specific parameters, thereby limiting\ndownstream task performance. In this context, we foresee parameter-efficient\nfine-tuning (PEFT) as a promising approach to efficiently specialize LLMs to\ntask-specific data while maintaining reasonable resource consumption. In this\npaper, we deliver a comprehensive study of PEFT techniques for LLMs in the\ncontext of automated code generation. Our comprehensive investigation of PEFT\ntechniques for LLMs reveals their superiority and potential over ICL and RAG\nacross a diverse set of LLMs and three representative Python code generation\ndatasets: Conala, CodeAlpacaPy, and APPS. Furthermore, our study highlights the\npotential for tuning larger LLMs and significant reductions in memory usage by\ncombining PEFT with quantization. Therefore, this study opens opportunities for\nbroader applications of PEFT in software engineering scenarios. Our code is\navailable at https://github.com/martin-wey/peft-llm-code/.\n","authors":["Martin Weyssow","Xin Zhou","Kisub Kim","David Lo","Houari Sahraoui"],"pdf_url":"https://arxiv.org/pdf/2308.10462v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.09032v3","updated":"2024-12-27T05:13:23Z","published":"2024-03-14T01:51:35Z","title":"CodeUltraFeedback: An LLM-as-a-Judge Dataset for Aligning Large Language\n Models to Coding Preferences","summary":" Evaluating the alignment of large language models (LLMs) with user-defined\ncoding preferences is a challenging endeavour that requires a deep assessment\nof LLMs' outputs. Existing methods and benchmarks rely primarily on automated\nmetrics and static analysis tools, which often fail to capture the nuances of\nuser instructions and LLM outputs. To address this gap, we propose using the\nLLM-as-a-Judge methodology to evaluate the alignment of LLMs with coding\npreferences. Based on this approach, we present CodeUltraFeedback, a\ncomprehensive dataset designed to facilitate the evaluation and improvement of\nLLM alignment. CodeUltraFeedback consists of 10,000 coding instructions, each\nannotated with four responses generated from a diverse pool of 14 LLMs. These\nresponses are ranked based on five distinct coding preferences using GPT-3.5 as\na judge, providing both numerical scores and detailed textual feedback. Our\nanalysis of CodeUltraFeedback reveals that responses from GPT-3.5 and GPT-4 are\ngenerally preferred over those from open-weight LLMs, highlighting significant\ndifferences in alignment between closed and open-weight models. In turn, we\nexplore the usage of CodeUltraFeedback as feedback data to fine-tune and align\nCodeLlama-7B-Instruct using supervised fine-tuning (SFT) and reinforcement\nlearning from AI feedback (RLAIF) with direct preference optimization (DPO).\nThe resulting aligned CodeLlama-7B-Instruct model outperforms larger LLMs in\nterms of alignment with coding preferences and shows improved functional\ncorrectness on the HumanEval+ benchmark compared to the original instruct\nmodel. Therefore, our contributions bridge the gap in preference tuning of LLMs\nfor code and set the stage for further advancements in model alignment and\nRLAIF in automated software engineering.\n","authors":["Martin Weyssow","Aton Kamanda","Xin Zhou","Houari Sahraoui"],"pdf_url":"https://arxiv.org/pdf/2403.09032v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.19449v1","updated":"2024-12-27T04:37:06Z","published":"2024-12-27T04:37:06Z","title":"Feature Alignment-Based Knowledge Distillation for Efficient Compression\n of Large Language Models","summary":" This study proposes a knowledge distillation algorithm based on large\nlanguage models and feature alignment, aiming to effectively transfer the\nknowledge of large pre-trained models into lightweight student models, thereby\nreducing computational costs while maintaining high model performance.\nDifferent from the traditional soft label distillation method, this method\nintroduces a multi-layer feature alignment strategy to deeply align the\nintermediate features and attention mechanisms of the teacher model and the\nstudent model, maximally retaining the semantic expression ability and context\nmodeling ability of the teacher model. In terms of method design, a multi-task\nloss function is constructed, including feature matching loss, attention\nalignment loss, and output distribution matching loss, to ensure multi-level\ninformation transfer through joint optimization. The experiments were\ncomprehensively evaluated on the GLUE data set and various natural language\nprocessing tasks. The results show that the proposed model performs very close\nto the state-of-the-art GPT-4 model in terms of evaluation indicators such as\nperplexity, BLEU, ROUGE, and CER. At the same time, it far exceeds baseline\nmodels such as DeBERTa, XLNet, and GPT-3, showing significant performance\nimprovements and computing efficiency advantages. Research results show that\nthe feature alignment distillation strategy is an effective model compression\nmethod that can significantly reduce computational overhead and storage\nrequirements while maintaining model capabilities. Future research can be\nfurther expanded in the directions of self-supervised learning, cross-modal\nfeature alignment, and multi-task transfer learning to provide more flexible\nand efficient solutions for the deployment and optimization of deep learning\nmodels.\n","authors":["Shuo Wang","Chihang Wang","Jia Gao","Zhen Qi","Hongye Zheng","Xiaoxuan Liao"],"pdf_url":"https://arxiv.org/pdf/2412.19449v1.pdf","comment":"4 pages"},{"id":"http://arxiv.org/abs/2411.06710v2","updated":"2024-12-27T04:27:23Z","published":"2024-11-11T04:36:58Z","title":"Model Fusion through Bayesian Optimization in Language Model Fine-Tuning","summary":" Fine-tuning pre-trained models for downstream tasks is a widely adopted\ntechnique known for its adaptability and reliability across various domains.\nDespite its conceptual simplicity, fine-tuning entails several troublesome\nengineering choices, such as selecting hyperparameters and determining\ncheckpoints from an optimization trajectory. To tackle the difficulty of\nchoosing the best model, one effective solution is model fusion, which combines\nmultiple models in a parameter space. However, we observe a large discrepancy\nbetween loss and metric landscapes during the fine-tuning of pre-trained\nlanguage models. Building on this observation, we introduce a novel model\nfusion technique that optimizes both the desired metric and loss through\nmulti-objective Bayesian optimization. In addition, to effectively select\nhyperparameters, we establish a two-stage procedure by integrating Bayesian\noptimization processes into our framework. Experiments across various\ndownstream tasks show considerable performance improvements using our Bayesian\noptimization-guided method.\n","authors":["Chaeyun Jang","Hyungi Lee","Jungtaek Kim","Juho Lee"],"pdf_url":"https://arxiv.org/pdf/2411.06710v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.19437v1","updated":"2024-12-27T04:03:16Z","published":"2024-12-27T04:03:16Z","title":"DeepSeek-V3 Technical Report","summary":" We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with\n671B total parameters with 37B activated for each token. To achieve efficient\ninference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent\nAttention (MLA) and DeepSeekMoE architectures, which were thoroughly validated\nin DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free\nstrategy for load balancing and sets a multi-token prediction training\nobjective for stronger performance. We pre-train DeepSeek-V3 on 14.8 trillion\ndiverse and high-quality tokens, followed by Supervised Fine-Tuning and\nReinforcement Learning stages to fully harness its capabilities. Comprehensive\nevaluations reveal that DeepSeek-V3 outperforms other open-source models and\nachieves performance comparable to leading closed-source models. Despite its\nexcellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours for its\nfull training. In addition, its training process is remarkably stable.\nThroughout the entire training process, we did not experience any irrecoverable\nloss spikes or perform any rollbacks. The model checkpoints are available at\nhttps://github.com/deepseek-ai/DeepSeek-V3.\n","authors":[" DeepSeek-AI","Aixin Liu","Bei Feng","Bing Xue","Bingxuan Wang","Bochao Wu","Chengda Lu","Chenggang Zhao","Chengqi Deng","Chenyu Zhang","Chong Ruan","Damai Dai","Daya Guo","Dejian Yang","Deli Chen","Dongjie Ji","Erhang Li","Fangyun Lin","Fucong Dai","Fuli Luo","Guangbo Hao","Guanting Chen","Guowei Li","H. Zhang","Han Bao","Hanwei Xu","Haocheng Wang","Haowei Zhang","Honghui Ding","Huajian Xin","Huazuo Gao","Hui Li","Hui Qu","J. L. Cai","Jian Liang","Jianzhong Guo","Jiaqi Ni","Jiashi Li","Jiawei Wang","Jin Chen","Jingchang Chen","Jingyang Yuan","Junjie Qiu","Junlong Li","Junxiao Song","Kai Dong","Kai Hu","Kaige Gao","Kang Guan","Kexin Huang","Kuai Yu","Lean Wang","Lecong Zhang","Lei Xu","Leyi Xia","Liang Zhao","Litong Wang","Liyue Zhang","Meng Li","Miaojun Wang","Mingchuan Zhang","Minghua Zhang","Minghui Tang","Mingming Li","Ning Tian","Panpan Huang","Peiyi Wang","Peng Zhang","Qiancheng Wang","Qihao Zhu","Qinyu Chen","Qiushi Du","R. J. Chen","R. L. Jin","Ruiqi Ge","Ruisong Zhang","Ruizhe Pan","Runji Wang","Runxin Xu","Ruoyu Zhang","Ruyi Chen","S. S. Li","Shanghao Lu","Shangyan Zhou","Shanhuang Chen","Shaoqing Wu","Shengfeng Ye","Shengfeng Ye","Shirong Ma","Shiyu Wang","Shuang Zhou","Shuiping Yu","Shunfeng Zhou","Shuting Pan","T. Wang","Tao Yun","Tian Pei","Tianyu Sun","W. L. Xiao","Wangding Zeng","Wanjia Zhao","Wei An","Wen Liu","Wenfeng Liang","Wenjun Gao","Wenqin Yu","Wentao Zhang","X. Q. Li","Xiangyue Jin","Xianzu Wang","Xiao Bi","Xiaodong Liu","Xiaohan Wang","Xiaojin Shen","Xiaokang Chen","Xiaokang Zhang","Xiaosha Chen","Xiaotao Nie","Xiaowen Sun","Xiaoxiang Wang","Xin Cheng","Xin Liu","Xin Xie","Xingchao Liu","Xingkai Yu","Xinnan Song","Xinxia Shan","Xinyi Zhou","Xinyu Yang","Xinyuan Li","Xuecheng Su","Xuheng Lin","Y. K. Li","Y. Q. Wang","Y. X. Wei","Y. X. Zhu","Yang Zhang","Yanhong Xu","Yanhong Xu","Yanping Huang","Yao Li","Yao Zhao","Yaofeng Sun","Yaohui Li","Yaohui Wang","Yi Yu","Yi Zheng","Yichao Zhang","Yifan Shi","Yiliang Xiong","Ying He","Ying Tang","Yishi Piao","Yisong Wang","Yixuan Tan","Yiyang Ma","Yiyuan Liu","Yongqiang Guo","Yu Wu","Yuan Ou","Yuchen Zhu","Yuduan Wang","Yue Gong","Yuheng Zou","Yujia He","Yukun Zha","Yunfan Xiong","Yunxian Ma","Yuting Yan","Yuxiang Luo","Yuxiang You","Yuxuan Liu","Yuyang Zhou","Z. F. Wu","Z. Z. Ren","Zehui Ren","Zhangli Sha","Zhe Fu","Zhean Xu","Zhen Huang","Zhen Zhang","Zhenda Xie","Zhengyan Zhang","Zhewen Hao","Zhibin Gou","Zhicheng Ma","Zhigang Yan","Zhihong Shao","Zhipeng Xu","Zhiyu Wu","Zhongyu Zhang","Zhuoshu Li","Zihui Gu","Zijia Zhu","Zijun Liu","Zilin Li","Ziwei Xie","Ziyang Song","Ziyi Gao","Zizheng Pan"],"pdf_url":"https://arxiv.org/pdf/2412.19437v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.00399v3","updated":"2024-12-27T03:53:21Z","published":"2024-03-30T15:38:54Z","title":"Aurora-M: Open Source Continual Pre-training for Multilingual Language\n and Code","summary":" Pretrained language models are an integral part of AI applications, but their\nhigh computational cost for training limits accessibility. Initiatives such as\nBloom and StarCoder aim to democratize access to pretrained models for\ncollaborative community development. Despite these efforts, such models\nencounter challenges such as limited multilingual capabilities, risks of\ncatastrophic forgetting during continual pretraining, and the high costs of\ntraining models from scratch, alongside the need to align with AI safety\nstandards and regulatory frameworks.\n This paper presents Aurora-M, a 15B parameter multilingual open-source model\ntrained on English, Finnish, Hindi, Japanese, Vietnamese, and code. Continually\npretrained from StarCoderPlus on 435B additional tokens, Aurora-M surpasses 2T\ntokens in total training token count. It is the first open-source multilingual\nmodel fine-tuned on human-reviewed safety instructions, thus aligning its\ndevelopment not only with conventional red-teaming considerations, but also\nwith the specific concerns articulated in the Biden-Harris Executive Order on\nthe Safe, Secure, and Trustworthy Development and Use of Artificial\nIntelligence.\n We evaluate Aurora-M across a wide range of tasks and languages, showcasing\nits robustness against catastrophic forgetting and its superior performance in\nmultilingual settings, particularly in safety evaluations. We open-source\nAurora-M and its variants to encourage responsible open-source development of\nlarge language models at https://huggingface.co/aurora-m.\n","authors":["Taishi Nakamura","Mayank Mishra","Simone Tedeschi","Yekun Chai","Jason T Stillerman","Felix Friedrich","Prateek Yadav","Tanmay Laud","Vu Minh Chien","Terry Yue Zhuo","Diganta Misra","Ben Bogin","Xuan-Son Vu","Marzena Karpinska","Arnav Varma Dantuluri","Wojciech Kusa","Tommaso Furlanello","Rio Yokota","Niklas Muennighoff","Suhas Pai","Tosin Adewumi","Veronika Laippala","Xiaozhe Yao","Adalberto Junior","Alpay Ariyak","Aleksandr Drozd","Jordan Clive","Kshitij Gupta","Liangyu Chen","Qi Sun","Ken Tsui","Noah Persaud","Nour Fahmy","Tianlong Chen","Mohit Bansal","Nicolo Monti","Tai Dang","Ziyang Luo","Tien-Tung Bui","Roberto Navigli","Virendra Mehta","Matthew Blumberg","Victor May","Huu Nguyen","Sampo Pyysalo"],"pdf_url":"https://arxiv.org/pdf/2404.00399v3.pdf","comment":"Preprint"},{"id":"http://arxiv.org/abs/2412.13949v2","updated":"2024-12-27T03:00:19Z","published":"2024-12-18T15:29:30Z","title":"Cracking the Code of Hallucination in LVLMs with Vision-aware Head\n Divergence","summary":" Large vision-language models (LVLMs) have made substantial progress in\nintegrating large language models (LLMs) with visual inputs, enabling advanced\nmultimodal reasoning. Despite their success, a persistent challenge is\nhallucination-where generated text fails to accurately reflect visual\ncontent-undermining both accuracy and reliability. Existing methods focus on\nalignment training or decoding refinements but primarily address symptoms at\nthe generation stage without probing the underlying causes. In this work, we\ninvestigate the internal mechanisms driving hallucination in LVLMs, with an\nemphasis on the multi-head attention module. Specifically, we introduce\nVision-aware Head Divergence (VHD), a metric that quantifies the sensitivity of\nattention head outputs to visual context. Based on this, our findings reveal\nthe presence of vision-aware attention heads that are more attuned to visual\ninformation; however, the model's overreliance on its prior language patterns\nis closely related to hallucinations. Building on these insights, we propose\nVision-aware Head Reinforcement (VHR), a training-free approach to mitigate\nhallucination by enhancing the role of vision-aware attention heads. Extensive\nexperiments demonstrate that our method achieves superior performance compared\nto state-of-the-art approaches in mitigating hallucinations, while maintaining\nhigh efficiency with negligible additional time overhead.\n","authors":["Jinghan He","Kuan Zhu","Haiyun Guo","Junfeng Fang","Zhenglin Hua","Yuheng Jia","Ming Tang","Tat-Seng Chua","Jinqiao Wang"],"pdf_url":"https://arxiv.org/pdf/2412.13949v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.10758v2","updated":"2024-12-27T02:40:36Z","published":"2024-03-16T01:40:36Z","title":"Rules still work for Open Information Extraction","summary":" Open information extraction (OIE) aims to extract surface relations and their\ncorresponding arguments from natural language text, irrespective of domain.\nThis paper presents an innovative OIE model, APRCOIE, tailored for Chinese\ntext. Diverging from previous models, our model generates extraction patterns\nautonomously. The model defines a new pattern form for Chinese OIE and proposes\nan automated pattern generation methodology. In that way, the model can handle\na wide array of complex and diverse Chinese grammatical phenomena. We design a\npreliminary filter based on tensor computing to conduct the extraction\nprocedure efficiently. To train the model, we manually annotated a large-scale\nChinese OIE dataset. In the comparative evaluation, we demonstrate that APRCOIE\noutperforms state-of-the-art Chinese OIE models and significantly expands the\nboundaries of achievable OIE performance. The code of APRCOIE and the annotated\ndataset are released on GitHub (https://github.com/jialin666/APRCOIE_v1)\n","authors":["Jialin Hua","Liangqing Luo","Weiying Ping","Yan Liao","Chunhai Tao","Xuewen Lub"],"pdf_url":"https://arxiv.org/pdf/2403.10758v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.19928v1","updated":"2024-12-27T21:22:28Z","published":"2024-12-27T21:22:28Z","title":"Assessing Text Classification Methods for Cyberbullying Detection on\n Social Media Platforms","summary":" Cyberbullying significantly contributes to mental health issues in\ncommunities by negatively impacting the psychology of victims. It is a\nprevalent problem on social media platforms, necessitating effective, real-time\ndetection and monitoring systems to identify harmful messages. However, current\ncyberbullying detection systems face challenges related to performance, dataset\nquality, time efficiency, and computational costs. This research aims to\nconduct a comparative study by adapting and evaluating existing text\nclassification techniques within the cyberbullying detection domain. The study\nspecifically evaluates the effectiveness and performance of these techniques in\nidentifying cyberbullying instances on social media platforms. It focuses on\nleveraging and assessing large language models, including BERT, RoBERTa, XLNet,\nDistilBERT, and GPT-2.0, for their suitability in this domain. The results show\nthat BERT strikes a balance between performance, time efficiency, and\ncomputational resources: Accuracy of 95%, Precision of 95%, Recall of 95%, F1\nScore of 95%, Error Rate of 5%, Inference Time of 0.053 seconds, RAM Usage of\n35.28 MB, CPU/GPU Usage of 0.4%, and Energy Consumption of 0.000263 kWh. The\nfindings demonstrate that generative AI models, while powerful, do not\nconsistently outperform fine-tuned models on the tested benchmarks. However,\nstate-of-the-art performance can still be achieved through strategic adaptation\nand fine-tuning of existing models for specific datasets and tasks.\n","authors":["Adamu Gaston Philipo","Doreen Sebastian Sarwatt","Jianguo Ding","Mahmoud Daneshmand","Huansheng Ning"],"pdf_url":"https://arxiv.org/pdf/2412.19928v1.pdf","comment":"15 pages, 10 figures, 7 tables"},{"id":"http://arxiv.org/abs/2412.19926v1","updated":"2024-12-27T21:20:45Z","published":"2024-12-27T21:20:45Z","title":"Right vs. Right: Can LLMs Make Tough Choices?","summary":" An ethical dilemma describes a choice between two \"right\" options involving\nconflicting moral values. We present a comprehensive evaluation of how LLMs\nnavigate ethical dilemmas. Specifically, we investigate LLMs on their (1)\nsensitivity in comprehending ethical dilemmas, (2) consistency in moral value\nchoice, (3) consideration of consequences, and (4) ability to align their\nresponses to a moral value preference explicitly or implicitly specified in a\nprompt. Drawing inspiration from a leading ethical framework, we construct a\ndataset comprising 1,730 ethical dilemmas involving four pairs of conflicting\nvalues. We evaluate 20 well-known LLMs from six families. Our experiments\nreveal that: (1) LLMs exhibit pronounced preferences between major value pairs,\nand prioritize truth over loyalty, community over individual, and long-term\nover short-term considerations. (2) The larger LLMs tend to support a\ndeontological perspective, maintaining their choices of actions even when\nnegative consequences are specified. (3) Explicit guidelines are more effective\nin guiding LLMs' moral choice than in-context examples. Lastly, our experiments\nhighlight the limitation of LLMs in comprehending different formulations of\nethical dilemmas.\n","authors":["Jiaqing Yuan","Pradeep K. Murukannaiah","Munindar P. Singh"],"pdf_url":"https://arxiv.org/pdf/2412.19926v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.19925v1","updated":"2024-12-27T21:19:01Z","published":"2024-12-27T21:19:01Z","title":"HADES: Hardware Accelerated Decoding for Efficient Speculation in Large\n Language Models","summary":" Large Language Models (LLMs) have revolutionized natural language processing\nby understanding and generating human-like text. However, the increasing demand\nfor more sophisticated LLMs presents significant computational challenges due\nto their scale and complexity. This paper introduces Hardware Accelerated\nDecoding (HADES), a novel approach to enhance the performance and energy\nefficiency of LLMs. We address the design of an LLM accelerator with\nhardware-level speculative decoding support, a concept not previously explored\nin existing literature. Our work demonstrates how speculative decoding can\nsignificantly improve the efficiency of LLM operations, paving the way for more\nadvanced and practical applications of these models.\n","authors":["Ze Yang","Yihong Jin","Xinhe Xu"],"pdf_url":"https://arxiv.org/pdf/2412.19925v1.pdf","comment":"Accepted to ICCEA 2025"},{"id":"http://arxiv.org/abs/2409.19467v2","updated":"2024-12-27T20:53:02Z","published":"2024-09-28T22:06:06Z","title":"INSIGHTBUDDY-AI: Medication Extraction and Entity Linking using Large\n Language Models and Ensemble Learning","summary":" Medication Extraction and Mining play an important role in healthcare NLP\nresearch due to its practical applications in hospital settings, such as their\nmapping into standard clinical knowledge bases (SNOMED-CT, BNF, etc.). In this\nwork, we investigate state-of-the-art LLMs in text mining tasks on medications\nand their related attributes such as dosage, route, strength, and adverse\neffects. In addition, we explore different ensemble learning methods\n(\\textsc{Stack-Ensemble} and \\textsc{Voting-Ensemble}) to augment the model\nperformances from individual LLMs. Our ensemble learning result demonstrated\nbetter performances than individually fine-tuned base models BERT, RoBERTa,\nRoBERTa-L, BioBERT, BioClinicalBERT, BioMedRoBERTa, ClinicalBERT, and\nPubMedBERT across general and specific domains. Finally, we build up an entity\nlinking function to map extracted medical terminologies into the SNOMED-CT\ncodes and the British National Formulary (BNF) codes, which are further mapped\nto the Dictionary of Medicines and Devices (dm+d), and ICD. Our model's toolkit\nand desktop applications are publicly available (at\n\\url{https://github.com/HECTA-UoM/ensemble-NER}).\n","authors":["Pablo Romero","Lifeng Han","Goran Nenadic"],"pdf_url":"https://arxiv.org/pdf/2409.19467v2.pdf","comment":"ongoing work, 24 pages"},{"id":"http://arxiv.org/abs/2412.19906v1","updated":"2024-12-27T19:42:25Z","published":"2024-12-27T19:42:25Z","title":"Evaluate Summarization in Fine-Granularity: Auto Evaluation with LLM","summary":" Due to the exponential growth of information and the need for efficient\ninformation consumption the task of summarization has gained paramount\nimportance. Evaluating summarization accurately and objectively presents\nsignificant challenges, particularly when dealing with long and unstructured\ntexts rich in content. Existing methods, such as ROUGE (Lin, 2004) and\nembedding similarities, often yield scores that have low correlation with human\njudgements and are also not intuitively understandable, making it difficult to\ngauge the true quality of the summaries. LLMs can mimic human in giving\nsubjective reviews but subjective scores are hard to interpret and justify.\nThey can be easily manipulated by altering the models and the tones of the\nprompts. In this paper, we introduce a novel evaluation methodology and tooling\ndesigned to address these challenges, providing a more comprehensive, accurate\nand interpretable assessment of summarization outputs. Our method (SumAutoEval)\nproposes and evaluates metrics at varying granularity levels, giving objective\nscores on 4 key dimensions such as completeness, correctness, Alignment and\nreadability. We empirically demonstrate, that SumAutoEval enhances the\nunderstanding of output quality with better human correlation.\n","authors":["Dong Yuan","Eti Rastogi","Fen Zhao","Sagar Goyal","Gautam Naik","Sree Prasanna Rajagopal"],"pdf_url":"https://arxiv.org/pdf/2412.19906v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.11043v2","updated":"2024-12-27T19:27:19Z","published":"2024-10-14T19:48:31Z","title":"Personality Differences Drive Conversational Dynamics: A\n High-Dimensional NLP Approach","summary":" This paper investigates how the topical flow of dyadic conversations emerges\nover time and how differences in interlocutors' personality traits contribute\nto this topical flow. Leveraging text embeddings, we map the trajectories of $N\n= 1655$ conversations between strangers into a high-dimensional space. Using\nnonlinear projections and clustering, we then identify when each interlocutor\nenters and exits various topics. Differences in conversational flow are\nquantified via $\\textit{topic entropy}$, a summary measure of the \"spread\" of\ntopics covered during a conversation, and $\\textit{linguistic alignment}$, a\ntime-varying measure of the cosine similarity between interlocutors'\nembeddings. Our findings suggest that interlocutors with a larger difference in\nthe personality dimension of openness influence each other to spend more time\ndiscussing a wider range of topics and that interlocutors with a larger\ndifference in extraversion experience a larger decrease in linguistic alignment\nthroughout their conversation. We also examine how participants' affect\n(emotion) changes from before to after a conversation, finding that a larger\ndifference in extraversion predicts a larger difference in affect change and\nthat a greater topic entropy predicts a larger affect increase. This work\ndemonstrates how communication research can be advanced through the use of\nhigh-dimensional NLP methods and identifies personality difference as an\nimportant driver of social influence.\n","authors":["Julia R. Fischer","Nilam Ram"],"pdf_url":"https://arxiv.org/pdf/2410.11043v2.pdf","comment":"Published in the Proceedings of the Second Workshop on Social\n Influence in Conversations (SICon 2024), co-located with EMNLP 2024. This\n version corrects a labeling error in Table 1"}],"Information Retrieval":[{"id":"http://arxiv.org/abs/2405.13362v3","updated":"2024-12-27T14:44:30Z","published":"2024-05-22T05:43:15Z","title":"Lusifer: LLM-based User SImulated Feedback Environment for online\n Recommender systems","summary":" Training reinforcement learning-based recommender systems is often hindered\nby the lack of dynamic and realistic user interactions. To address this\nlimitation, we introduce Lusifer, a novel environment leveraging Large Language\nModels (LLMs) to generate simulated user feedback. Lusifer synthesizes user\nprofiles and interaction histories to simulate responses and behaviors toward\nrecommended items, with profiles updated after each rating to reflect evolving\nuser characteristics. Utilizing the MovieLens dataset as a proof of concept, we\nlimited our implementation to the last 40 interactions for each user,\nrepresenting approximately 39% and 22% of the training sets, to focus on recent\nuser behavior. For consistency and to gain insights into the performance of\ntraditional methods with limited data, we implemented baseline approaches using\nthe same data subset. Our results demonstrate that Lusifer accurately emulates\nuser behavior and preferences, even with reduced training data having an RMSE\nof 1.3 across various test sets. This paper presents Lusifer's operational\npipeline, including prompt generation and iterative user profile updates, and\ncompares its performance against baseline methods. The findings validate\nLusifer's ability to produce realistic dynamic feedback and suggest that it\noffers a scalable and adjustable framework for user simulation in online\nreinforcement learning recommender systems for future studies, particularly\nwhen training data is limited.\n","authors":["Danial Ebrat","Eli Paradalis","Luis Rueda"],"pdf_url":"https://arxiv.org/pdf/2405.13362v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.00326v6","updated":"2024-12-27T10:12:28Z","published":"2023-12-01T03:44:54Z","title":"Agent-OM: Leveraging LLM Agents for Ontology Matching","summary":" Ontology matching (OM) enables semantic interoperability between different\nontologies and resolves their conceptual heterogeneity by aligning related\nentities. OM systems currently have two prevailing design paradigms:\nconventional knowledge-based expert systems and newer machine learning-based\npredictive systems. While large language models (LLMs) and LLM agents have\nrevolutionised data engineering and have been applied creatively in many\ndomains, their potential for OM remains underexplored. This study introduces a\nnovel agent-powered LLM-based design paradigm for OM systems. With\nconsideration of several specific challenges in leveraging LLM agents for OM,\nwe propose a generic framework, namely Agent-OM (Agent for Ontology Matching),\nconsisting of two Siamese agents for retrieval and matching, with a set of OM\ntools. Our framework is implemented in a proof-of-concept system. Evaluations\nof three Ontology Alignment Evaluation Initiative (OAEI) tracks over\nstate-of-the-art OM systems show that our system can achieve results very close\nto the long-standing best performance on simple OM tasks and can significantly\nimprove the performance on complex and few-shot OM tasks.\n","authors":["Zhangcheng Qiang","Weiqing Wang","Kerry Taylor"],"pdf_url":"https://arxiv.org/pdf/2312.00326v6.pdf","comment":"19 pages, 12 figures, 3 tables"},{"id":"http://arxiv.org/abs/2412.11589v2","updated":"2024-12-27T07:36:52Z","published":"2024-12-16T09:20:29Z","title":"Future Sight and Tough Fights: Revolutionizing Sequential Recommendation\n with FENRec","summary":" Sequential recommendation (SR) systems predict user preferences by analyzing\ntime-ordered interaction sequences. A common challenge for SR is data sparsity,\nas users typically interact with only a limited number of items. While\ncontrastive learning has been employed in previous approaches to address the\nchallenges, these methods often adopt binary labels, missing finer patterns and\noverlooking detailed information in subsequent behaviors of users.\nAdditionally, they rely on random sampling to select negatives in contrastive\nlearning, which may not yield sufficiently hard negatives during later training\nstages. In this paper, we propose Future data utilization with Enduring\nNegatives for contrastive learning in sequential Recommendation (FENRec). Our\napproach aims to leverage future data with time-dependent soft labels and\ngenerate enduring hard negatives from existing data, thereby enhancing the\neffectiveness in tackling data sparsity. Experiment results demonstrate our\nstate-of-the-art performance across four benchmark datasets, with an average\nimprovement of 6.16\\% across all metrics.\n","authors":["Yu-Hsuan Huang","Ling Lo","Hongxia Xie","Hong-Han Shuai","Wen-Huang Cheng"],"pdf_url":"https://arxiv.org/pdf/2412.11589v2.pdf","comment":"Accepted by AAAI 2025, Our code is available at\n https://github.com/uikdwnd/FENRec"},{"id":"http://arxiv.org/abs/2406.17378v3","updated":"2024-12-27T05:56:37Z","published":"2024-06-25T08:55:12Z","title":"A Text is Worth Several Tokens: Text Embedding from LLMs Secretly Aligns\n Well with The Key Tokens","summary":" Text embeddings from large language models (LLMs) have achieved excellent\nresults in tasks such as information retrieval, semantic textual similarity,\netc. In this work, we show an interesting finding: when feeding a text into the\nLLM-based embedder, the obtained text embedding will be able to be aligned with\nthe key tokens in the input text. We first fully analyze this phenomenon on\neight LLM-based embedders and show that this phenomenon is universal and is not\naffected by model architecture, training strategy, and embedding method. With a\ndeeper analysis, we find that the main change in embedding space between these\nembedders and their LLM backbones is in the first principal component. By\nadjusting the first principal component, we can align text embedding with the\nkey tokens. Finally, we give several examples to demonstrate the vast\napplication potential of this finding: (1) we propose a simple and practical\nsparse retrieval method based on the aligned tokens, which can achieve 80% of\nthe dense retrieval effect of the same model while reducing the computation\nsignificantly; (2) we show that our findings provide a novel perspective to\nhelp understand novel technologies (e.g., instruction-following embedding) and\nfuzzy concepts (e.g., semantic relatedness vs. similarity) in this field.\n","authors":["Zhijie Nie","Richong Zhang","Zhanyu Wu"],"pdf_url":"https://arxiv.org/pdf/2406.17378v3.pdf","comment":"Work in Progress"}],"Machine Learning":[{"id":"http://arxiv.org/abs/2412.19802v1","updated":"2024-12-27T18:59:03Z","published":"2024-12-27T18:59:03Z","title":"LASER: A new method for locally adaptive nonparametric regression","summary":" In this article, we introduce \\textsf{LASER} (Locally Adaptive Smoothing\nEstimator for Regression), a computationally efficient locally adaptive\nnonparametric regression method that performs variable bandwidth local\npolynomial regression. We prove that it adapts (near-)optimally to the local\nH\\\"{o}lder exponent of the underlying regression function\n\\texttt{simultaneously} at all points in its domain. Furthermore, we show that\nthere is a single ideal choice of a global tuning parameter under which the\nabove mentioned local adaptivity holds. Despite the vast literature on\nnonparametric regression, instances of practicable methods with provable\nguarantees of such a strong notion of local adaptivity are rare. The proposed\nmethod achieves excellent performance across a broad range of numerical\nexperiments in comparison to popular alternative locally adaptive methods.\n","authors":["Sabyasachi Chatterjee","Subhajit Goswami","Soumendu Sundar Mukherjee"],"pdf_url":"https://arxiv.org/pdf/2412.19802v1.pdf","comment":"29 pages, 6 figures"},{"id":"http://arxiv.org/abs/2412.19792v1","updated":"2024-12-27T18:45:36Z","published":"2024-12-27T18:45:36Z","title":"InfAlign: Inference-aware language model alignment","summary":" Language model alignment has become a critical step in training modern\ngenerative language models. The goal of alignment is to finetune a reference\nmodel such that the win rate of a sample from the aligned model over a sample\nfrom the reference model is high, subject to a KL divergence constraint. Today,\nwe are increasingly using inference-time algorithms (e.g., Best-of-N,\ncontrolled decoding, tree search) to decode from language models rather than\nstandard sampling. However, the alignment objective does not capture such\ninference-time decoding procedures. We show that the existing alignment\nframework is sub-optimal in view of such inference-time methods. We then modify\nthe alignment objective and propose a framework for inference-aware alignment\n(IAPO). We prove that for any inference-time decoding algorithm, the optimal\nsolution that optimizes the inference-time win rate of the aligned policy\nagainst the reference policy is the solution to the typical RLHF problem with a\ntransformation of the reward. This motivates us to provide the KL-regularized\ncalibrate-and-transform RL (CTRL) algorithm to solve this problem, which\ninvolves a reward calibration step and a KL-regularized reward maximization\nstep with a transformation of the calibrated reward. We particularize our study\nto two important inference-time strategies: best-of-N sampling and best-of-N\njailbreaking, where N responses are sampled from the model and the one with the\nhighest or lowest reward is selected. We propose specific transformations for\nthese strategies and demonstrate that our framework offers significant\nimprovements over existing state-of-the-art methods for language model\nalignment. Empirically, we outperform baselines that are designed without\ntaking inference-time decoding into consideration by 8-12% and 4-9% on\ninference-time win rates over the Anthropic helpfulness and harmlessness dialog\nbenchmark datasets.\n","authors":["Ananth Balashankar","Ziteng Sun","Jonathan Berant","Jacob Eisenstein","Michael Collins","Adrian Hutter","Jong Lee","Chirag Nagpal","Flavien Prost","Aradhana Sinha","and Ananda Theertha Suresh","Ahmad Beirami"],"pdf_url":"https://arxiv.org/pdf/2412.19792v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.19781v1","updated":"2024-12-27T18:25:08Z","published":"2024-12-27T18:25:08Z","title":"Machine Learning for Sentiment Analysis of Imported Food in Trinidad and\n Tobago","summary":" This research investigates the performance of various machine learning\nalgorithms (CNN, LSTM, VADER, and RoBERTa) for sentiment analysis of Twitter\ndata related to imported food items in Trinidad and Tobago. The study addresses\nthree primary research questions: the comparative accuracy and efficiency of\nthe algorithms, the optimal configurations for each model, and the potential\napplications of the optimized models in a live system for monitoring public\nsentiment and its impact on the import bill. The dataset comprises tweets from\n2018 to 2024, divided into imbalanced, balanced, and temporal subsets to assess\nthe impact of data balancing and the COVID-19 pandemic on sentiment trends. Ten\nexperiments were conducted to evaluate the models under various configurations.\nResults indicated that VADER outperformed the other models in both multi-class\nand binary sentiment classifications. The study highlights significant changes\nin sentiment trends pre- and post-COVID-19, with implications for import\npolicies.\n","authors":["Cassandra Daniels","Koffka Khan"],"pdf_url":"https://arxiv.org/pdf/2412.19781v1.pdf","comment":"27 pages"},{"id":"http://arxiv.org/abs/2412.19780v1","updated":"2024-12-27T18:22:47Z","published":"2024-12-27T18:22:47Z","title":"Tensor Network Estimation of Distribution Algorithms","summary":" Tensor networks are a tool first employed in the context of many-body quantum\nphysics that now have a wide range of uses across the computational sciences,\nfrom numerical methods to machine learning. Methods integrating tensor networks\ninto evolutionary optimization algorithms have appeared in the recent\nliterature. In essence, these methods can be understood as replacing the\ntraditional crossover operation of a genetic algorithm with a tensor\nnetwork-based generative model. We investigate these methods from the point of\nview that they are Estimation of Distribution Algorithms (EDAs). We find that\noptimization performance of these methods is not related to the power of the\ngenerative model in a straightforward way. Generative models that are better\n(in the sense that they better model the distribution from which their training\ndata is drawn) do not necessarily result in better performance of the\noptimization algorithm they form a part of. This raises the question of how\nbest to incorporate powerful generative models into optimization routines. In\nlight of this we find that adding an explicit mutation operator to the output\nof the generative model often improves optimization performance.\n","authors":["John Gardiner","Javier Lopez-Piqueres"],"pdf_url":"https://arxiv.org/pdf/2412.19780v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.19778v1","updated":"2024-12-27T18:19:26Z","published":"2024-12-27T18:19:26Z","title":"Symbolic Approximations to Ricci-flat Metrics Via Extrinsic Symmetries\n of Calabi-Yau Hypersurfaces","summary":" Ever since Yau's non-constructive existence proof of Ricci-flat metrics on\nCalabi-Yau manifolds, finding their explicit construction remains a major\nobstacle to development of both string theory and algebraic geometry. Recent\ncomputational approaches employ machine learning to create novel neural\nrepresentations for approximating these metrics, offering high accuracy but\nlimited interpretability. In this paper, we analyse machine learning\napproximations to flat metrics of Fermat Calabi-Yau n-folds and some of their\none-parameter deformations in three dimensions in order to discover their new\nproperties. We formalise cases in which the flat metric has more symmetries\nthan the underlying manifold, and prove that these symmetries imply that the\nflat metric admits a surprisingly compact representation for certain choices of\ncomplex structure moduli. We show that such symmetries uniquely determine the\nflat metric on certain loci, for which we present an analytic form. We also\nincorporate our theoretical results into neural networks to achieve\nstate-of-the-art reductions in Ricci curvature for multiple Calabi-Yau\nmanifolds. We conclude by distilling the ML models to obtain for the first time\nclosed form expressions for Kahler metrics with near-zero scalar curvature.\n","authors":["Viktor Mirjanić","Challenger Mishra"],"pdf_url":"https://arxiv.org/pdf/2412.19778v1.pdf","comment":"40 pages, 14 figures"},{"id":"http://arxiv.org/abs/2410.10044v2","updated":"2024-12-27T18:16:12Z","published":"2024-10-13T23:17:58Z","title":"DAG-aware Transformer for Causal Effect Estimation","summary":" Causal inference is a critical task across fields such as healthcare,\neconomics, and the social sciences. While recent advances in machine learning,\nespecially those based on the deep-learning architectures, have shown potential\nin estimating causal effects, existing approaches often fall short in handling\ncomplex causal structures and lack adaptability across various causal\nscenarios. In this paper, we present a novel transformer-based method for\ncausal inference that overcomes these challenges. The core innovation of our\nmodel lies in its integration of causal Directed Acyclic Graphs (DAGs) directly\ninto the attention mechanism, enabling it to accurately model the underlying\ncausal structure. This allows for flexible estimation of both average treatment\neffects (ATE) and conditional average treatment effects (CATE). Extensive\nexperiments on both synthetic and real-world datasets demonstrate that our\napproach surpasses existing methods in estimating causal effects across a wide\nrange of scenarios. The flexibility and robustness of our model make it a\nvaluable tool for researchers and practitioners tackling complex causal\ninference problems.\n","authors":["Manqing Liu","David R. Bellamy","Andrew L. Beam"],"pdf_url":"https://arxiv.org/pdf/2410.10044v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.19774v1","updated":"2024-12-27T18:12:04Z","published":"2024-12-27T18:12:04Z","title":"Analysis of Premature Death Rates in Texas Counties: The Impact of Air\n Quality, Socioeconomic Factors, and COPD Prevalence","summary":" Understanding factors contributing to premature mortality is critical for\npublic health planning. This study examines the relationships between premature\ndeath rates and multiple risk factors across several Texas counties, utilizing\nEPA air quality data, Census information, and county health records from recent\nyears. We analyze the impact of air quality (PM2.5 levels), socioeconomic\nfactors (median household income), and health conditions (COPD prevalence)\nthrough statistical analysis and modeling techniques. Results reveal COPD\nprevalence as a strong predictor of premature death rates, with higher\nprevalence associated with a substantial increase in years of potential life\nlost. While socioeconomic factors show a significant negative correlation, air\nquality demonstrates more complex indirect relationships. These findings\nemphasize the need for integrated public health interventions that prioritize\nkey health conditions while addressing underlying socioeconomic disparities.\n","authors":["Richard Rich","Ernesto Diaz"],"pdf_url":"https://arxiv.org/pdf/2412.19774v1.pdf","comment":"5 pages"},{"id":"http://arxiv.org/abs/2412.19770v1","updated":"2024-12-27T18:06:25Z","published":"2024-12-27T18:06:25Z","title":"Fortran2CPP: Automating Fortran-to-C++ Migration using LLMs via\n Multi-Turn Dialogue and Dual-Agent Integration","summary":" Migrating Fortran code to C++ is a common task for many scientific computing\nteams, driven by the need to leverage modern programming paradigms, enhance\ncross-platform compatibility, and improve maintainability. Automating this\ntranslation process using large language models (LLMs) has shown promise, but\nthe lack of high-quality, specialized datasets has hindered their\neffectiveness. In this paper, we address this challenge by introducing a novel\nmulti-turn dialogue dataset, Fortran2CPP, specifically designed for\nFortran-to-C++ code migration. Our dataset, significantly larger than existing\nalternatives, is generated using a unique LLM-driven, dual-agent pipeline\nincorporating iterative compilation, execution, and code repair to ensure high\nquality and functional correctness. To demonstrate the effectiveness of our\ndataset, we fine-tuned several open-weight LLMs on Fortran2CPP and evaluated\ntheir performance on two independent benchmarks. Fine-tuning on our dataset led\nto remarkable gains, with models achieving up to a 3.31x increase in CodeBLEU\nscore and a 92\\% improvement in compilation success rate. This highlights the\ndataset's ability to enhance both the syntactic accuracy and compilability of\nthe translated C++ code. Our dataset and model have been open-sourced and are\navailable on our public GitHub\nrepository\\footnote{\\url{https://github.com/HPC-Fortran2CPP/Fortran2Cpp}}.\n","authors":["Le Chen","Bin Lei","Dunzhi Zhou","Pei-Hung Lin","Chunhua Liao","Caiwen Ding","Ali Jannesari"],"pdf_url":"https://arxiv.org/pdf/2412.19770v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.19765v1","updated":"2024-12-27T17:53:01Z","published":"2024-12-27T17:53:01Z","title":"From Ceilings to Walls: Universal Dynamic Perching of Small Aerial\n Robots on Surfaces with Variable Orientations","summary":" This work demonstrates universal dynamic perching capabilities for quadrotors\nof various sizes and on surfaces with different orientations. By employing a\nnon-dimensionalization framework and deep reinforcement learning, we\nsystematically assessed how robot size and surface orientation affect landing\ncapabilities. We hypothesized that maintaining geometric proportions across\ndifferent robot scales ensures consistent perching behavior, which was\nvalidated in both simulation and experimental tests. Additionally, we\ninvestigated the effects of joint stiffness and damping in the landing gear on\nperching behaviors and performance. While joint stiffness had minimal impact,\njoint damping ratios influenced landing success under vertical approaching\nconditions. The study also identified a critical velocity threshold necessary\nfor successful perching, determined by the robot's maneuverability and leg\ngeometry. Overall, this research advances robotic perching capabilities,\noffering insights into the role of mechanical design and scaling effects, and\nlays the groundwork for future drone autonomy and operational efficiency in\nunstructured environments.\n","authors":["Bryan Habas","Aaron Brown","Donghyeon Lee","Mitchell Goldman","Bo Cheng"],"pdf_url":"https://arxiv.org/pdf/2412.19765v1.pdf","comment":"7 pages, 8 Figures"},{"id":"http://arxiv.org/abs/2409.01366v2","updated":"2024-12-27T17:49:34Z","published":"2024-09-02T16:41:44Z","title":"CHESS: Optimizing LLM Inference via Channel-Wise Thresholding and\n Selective Sparsification","summary":" Deploying large language models (LLMs) on edge devices presents significant\nchallenges due to the substantial computational overhead and memory\nrequirements. Activation sparsification can mitigate these resource challenges\nby reducing the number of activated neurons during inference. Existing methods\ntypically employ thresholding-based sparsification based on the statistics of\nactivation tensors. However, they do not model the impact of activation\nsparsification on performance, resulting in suboptimal performance degradation.\nTo address the limitations, this paper reformulates the activation\nsparsification problem to explicitly capture the relationship between\nactivation sparsity and model performance. Then, this paper proposes CHESS, a\ngeneral activation sparsification approach via CHannel-wise thrEsholding and\nSelective Sparsification. First, channel-wise thresholding assigns a unique\nthreshold to each activation channel in the feed-forward network (FFN) layers.\nThen, selective sparsification involves applying thresholding-based activation\nsparsification to specific layers within the attention modules. Finally, we\ndetail the implementation of sparse kernels to accelerate LLM inference.\nExperimental results demonstrate that the proposed CHESS achieves lower\nperformance degradation over eight downstream tasks while activating fewer\nparameters than existing methods, thus speeding up the LLM inference by up to\n1.27x.\n","authors":["Junhui He","Shangyu Wu","Weidong Wen","Chun Jason Xue","Qingan Li"],"pdf_url":"https://arxiv.org/pdf/2409.01366v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2407.21792v3","updated":"2024-12-27T17:36:21Z","published":"2024-07-31T17:59:24Z","title":"Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?","summary":" As artificial intelligence systems grow more powerful, there has been\nincreasing interest in \"AI safety\" research to address emerging and future\nrisks. However, the field of AI safety remains poorly defined and\ninconsistently measured, leading to confusion about how researchers can\ncontribute. This lack of clarity is compounded by the unclear relationship\nbetween AI safety benchmarks and upstream general capabilities (e.g., general\nknowledge and reasoning). To address these issues, we conduct a comprehensive\nmeta-analysis of AI safety benchmarks, empirically analyzing their correlation\nwith general capabilities across dozens of models and providing a survey of\nexisting directions in AI safety. Our findings reveal that many safety\nbenchmarks highly correlate with both upstream model capabilities and training\ncompute, potentially enabling \"safetywashing\"--where capability improvements\nare misrepresented as safety advancements. Based on these findings, we propose\nan empirical foundation for developing more meaningful safety metrics and\ndefine AI safety in a machine learning research context as a set of clearly\ndelineated research goals that are empirically separable from generic\ncapabilities advancements. In doing so, we aim to provide a more rigorous\nframework for AI safety research, advancing the science of safety evaluations\nand clarifying the path towards measurable progress.\n","authors":["Richard Ren","Steven Basart","Adam Khoja","Alice Gatti","Long Phan","Xuwang Yin","Mantas Mazeika","Alexander Pan","Gabriel Mukobi","Ryan H. Kim","Stephen Fitz","Dan Hendrycks"],"pdf_url":"https://arxiv.org/pdf/2407.21792v3.pdf","comment":"NeurIPS 2024"},{"id":"http://arxiv.org/abs/2407.13873v2","updated":"2024-12-27T17:16:25Z","published":"2024-07-18T19:41:46Z","title":"Keypoint Aware Masked Image Modelling","summary":" SimMIM is a widely used method for pretraining vision transformers using\nmasked image modeling. However, despite its success in fine-tuning performance,\nit has been shown to perform sub-optimally when used for linear probing. We\npropose an efficient patch-wise weighting derived from keypoint features which\ncaptures the local information and provides better context during SimMIM's\nreconstruction phase. Our method, KAMIM, improves the top-1 linear probing\naccuracy from 16.12% to 33.97%, and finetuning accuracy from 76.78% to 77.3%\nwhen tested on the ImageNet-1K dataset with a ViT-B when trained for the same\nnumber of epochs. We conduct extensive testing on different datasets, keypoint\nextractors, and model architectures and observe that patch-wise weighting\naugments linear probing performance for larger pretraining datasets. We also\nanalyze the learned representations of a ViT-B trained using KAMIM and observe\nthat they behave similar to contrastive learning with regard to its behavior,\nwith longer attention distances and homogenous self-attention across layers.\nOur code is publicly available at https://github.com/madhava20217/KAMIM.\n","authors":["Madhava Krishna","A V Subramanyam"],"pdf_url":"https://arxiv.org/pdf/2407.13873v2.pdf","comment":"Accepted to ICASSP 2025"},{"id":"http://arxiv.org/abs/2412.19747v1","updated":"2024-12-27T17:14:52Z","published":"2024-12-27T17:14:52Z","title":"Enhancing Adversarial Robustness of Deep Neural Networks Through\n Supervised Contrastive Learning","summary":" Adversarial attacks exploit the vulnerabilities of convolutional neural\nnetworks by introducing imperceptible perturbations that lead to\nmisclassifications, exposing weaknesses in feature representations and decision\nboundaries. This paper presents a novel framework combining supervised\ncontrastive learning and margin-based contrastive loss to enhance adversarial\nrobustness. Supervised contrastive learning improves the structure of the\nfeature space by clustering embeddings of samples within the same class and\nseparating those from different classes. Margin-based contrastive loss,\ninspired by support vector machines, enforces explicit constraints to create\nrobust decision boundaries with well-defined margins. Experiments on the\nCIFAR-100 dataset with a ResNet-18 backbone demonstrate robustness performance\nimprovements in adversarial accuracy under Fast Gradient Sign Method attacks.\n","authors":["Longwei Wang","Navid Nayyem","Abdullah Rakin"],"pdf_url":"https://arxiv.org/pdf/2412.19747v1.pdf","comment":"8 pages, 11 figures"},{"id":"http://arxiv.org/abs/2412.19732v1","updated":"2024-12-27T16:43:52Z","published":"2024-12-27T16:43:52Z","title":"Generative Pretrained Embedding and Hierarchical Irregular Time Series\n Representation for Daily Living Activity Recognition","summary":" Within the evolving landscape of smart homes, the precise recognition of\ndaily living activities using ambient sensor data stands paramount. This paper\nnot only aims to bolster existing algorithms by evaluating two distinct\npretrained embeddings suited for ambient sensor activations but also introduces\na novel hierarchical architecture. We delve into an architecture anchored on\nTransformer Decoder-based pre-trained embeddings, reminiscent of the GPT\ndesign, and contrast it with the previously established state-of-the-art (SOTA)\nELMo embeddings for ambient sensors. Our proposed hierarchical structure\nleverages the strengths of each pre-trained embedding, enabling the discernment\nof activity dependencies and sequence order, thereby enhancing classification\nprecision. To further refine recognition, we incorporate into our proposed\narchitecture an hour-of-the-day embedding. Empirical evaluations underscore the\npreeminence of the Transformer Decoder embedding in classification endeavors.\nAdditionally, our innovative hierarchical design significantly bolsters the\nefficacy of both pre-trained embeddings, notably in capturing inter-activity\nnuances. The integration of temporal aspects subtly but distinctively augments\nclassification, especially for time-sensitive activities. In conclusion, our\nGPT-inspired hierarchical approach, infused with temporal insights, outshines\nthe SOTA ELMo benchmark.\n","authors":["Damien Bouchabou","Sao Mai Nguyen"],"pdf_url":"https://arxiv.org/pdf/2412.19732v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.03390v3","updated":"2024-12-27T16:43:44Z","published":"2024-01-07T05:03:30Z","title":"Global Prediction of COVID-19 Variant Emergence Using Dynamics-Informed\n Graph Neural Networks","summary":" During the COVID-19 pandemic, a major driver of new surges has been the\nemergence of new variants. When a new variant emerges in one or more countries,\nother nations monitor its spread in preparation for its potential arrival. The\nimpact of the new variant and the timings of epidemic peaks in a country highly\ndepend on when the variant arrives. The current methods for predicting the\nspread of new variants rely on statistical modeling, however, these methods\nwork only when the new variant has already arrived in the region of interest\nand has a significant prevalence. Can we predict when a variant existing\nelsewhere will arrive in a given region? To address this question, we propose a\nvariant-dynamics-informed Graph Neural Network (GNN) approach. First, we derive\nthe dynamics of variant prevalence across pairs of regions (countries) that\napply to a large class of epidemic models. The dynamics motivate the\nintroduction of certain features in the GNN. We demonstrate that our proposed\ndynamics-informed GNN outperforms all the baselines, including the currently\npervasive framework of Physics-Informed Neural Networks (PINNs). To advance\nresearch in this area, we introduce a benchmarking tool to assess a\nuser-defined model's prediction performance across 87 countries and 36\nvariants.\n","authors":["Majd Al Aawar","Srikar Mutnuri","Mansooreh Montazerin","Ajitesh Srivastava"],"pdf_url":"https://arxiv.org/pdf/2401.03390v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.19727v1","updated":"2024-12-27T16:31:09Z","published":"2024-12-27T16:31:09Z","title":"Learning to Forget: Bayesian Time Series Forecasting using Recurrent\n Sparse Spectrum Signature Gaussian Processes","summary":" The signature kernel is a kernel between time series of arbitrary length and\ncomes with strong theoretical guarantees from stochastic analysis. It has found\napplications in machine learning such as covariance functions for Gaussian\nprocesses. A strength of the underlying signature features is that they provide\na structured global description of a time series. However, this property can\nquickly become a curse when local information is essential and forgetting is\nrequired; so far this has only been addressed with ad-hoc methods such as\nslicing the time series into subsegments. To overcome this, we propose a\nprincipled, data-driven approach by introducing a novel forgetting mechanism\nfor signatures. This allows the model to dynamically adapt its context length\nto focus on more recent information. To achieve this, we revisit the recently\nintroduced Random Fourier Signature Features, and develop Random Fourier\nDecayed Signature Features (RFDSF) with Gaussian processes (GPs). This results\nin a Bayesian time series forecasting algorithm with variational inference,\nthat offers a scalable probabilistic algorithm that processes and transforms a\ntime series into a joint predictive distribution over time steps in one pass\nusing recurrence. For example, processing a sequence of length $10^4$ steps in\n$\\approx 10^{-2}$ seconds and in $< 1\\text{GB}$ of GPU memory. We demonstrate\nthat it outperforms other GP-based alternatives and competes with\nstate-of-the-art probabilistic time series forecasting algorithms.\n","authors":["Csaba Tóth","Masaki Adachi","Michael A. Osborne","Harald Oberhauser"],"pdf_url":"https://arxiv.org/pdf/2412.19727v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.19725v1","updated":"2024-12-27T16:24:31Z","published":"2024-12-27T16:24:31Z","title":"EEG-Reptile: An Automatized Reptile-Based Meta-Learning Library for BCIs","summary":" Meta-learning, i.e., \"learning to learn\", is a promising approach to enable\nefficient BCI classifier training with limited amounts of data. It can\neffectively use collections of in some way similar classification tasks, with\nrapid adaptation to new tasks where only minimal data are available. However,\napplying meta-learning to existing classifiers and BCI tasks requires\nsignificant effort. To address this issue, we propose EEG-Reptile, an automated\nlibrary that leverages meta-learning to improve classification accuracy of\nneural networks in BCIs and other EEG-based applications. It utilizes the\nReptile meta-learning algorithm to adapt neural network classifiers of EEG data\nto the inter-subject domain, allowing for more efficient fine-tuning for a new\nsubject on a small amount of data. The proposed library incorporates an\nautomated hyperparameter tuning module, a data management pipeline, and an\nimplementation of the Reptile meta-learning algorithm. EEG-Reptile automation\nlevel allows using it without deep understanding of meta-learning. We\ndemonstrate the effectiveness of EEG-Reptile on two benchmark datasets (BCI IV\n2a, Lee2019 MI) and three neural network architectures (EEGNet, FBCNet,\nEEG-Inception). Our library achieved improvement in both zero-shot and few-shot\nlearning scenarios compared to traditional transfer learning approaches.\n","authors":["Daniil A. Berdyshev","Artem M. Grachev","Sergei L. Shishkin","Bogdan L. Kozyrskiy"],"pdf_url":"https://arxiv.org/pdf/2412.19725v1.pdf","comment":"For proposed python library, see EEG-Reptile GitHub:\n https://github.com/gasiki/EEG-Reptile"},{"id":"http://arxiv.org/abs/2411.17251v4","updated":"2024-12-27T16:24:20Z","published":"2024-11-26T09:29:27Z","title":"DGNN-YOLO: Interpretable Dynamic Graph Neural Networks with YOLO11 for\n Small Object Detection and Tracking in Traffic Surveillance","summary":" Accurate detection and tracking of small objects, such as pedestrians,\ncyclists, and motorbikes, is critical for traffic surveillance systems, which\nare crucial for improving road safety and decision-making in intelligent\ntransportation systems. However, traditional methods face challenges such as\nocclusion, low resolution, and dynamic traffic conditions, necessitating\ninnovative approaches to address these limitations. This paper introduces\nDGNN-YOLO, a novel framework integrating dynamic graph neural networks (DGNN)\nwith YOLO11 to enhance small-object detection and tracking in traffic\nsurveillance systems. The framework leverages YOLO11's advanced spatial feature\nextraction capabilities for precise object detection and incorporates a DGNN to\nmodel spatial-temporal relationships for robust real-time tracking dynamically.\nBy constructing and updating graph structures, DGNN-YOLO effectively represents\nobjects as nodes and their interactions as edges, thereby ensuring adaptive and\naccurate tracking in complex and dynamic environments. Additionally, Grad-CAM,\nGrad-CAM++, and Eigen-CAM visualization techniques were applied to DGNN-YOLO to\nprovide model-agnostic interpretability and deeper insights into the model's\ndecision-making process, enhancing its transparency and trustworthiness.\nExtensive experiments demonstrated that DGNN-YOLO consistently outperformed\nstate-of-the-art methods in detecting and tracking small objects under diverse\ntraffic conditions, achieving the highest precision (0.8382), recall (0.6875),\nand mAP@0.5:0.95 (0.6476), showing its robustness and scalability, particularly\nin challenging scenarios involving small and occluded objects. This study\nprovides a scalable, real-time traffic surveillance and analysis solution,\nsignificantly contributing to intelligent transportation systems.\n","authors":["Shahriar Soudeep","M. F. Mridha","Md Abrar Jahin","Nilanjan Dey"],"pdf_url":"https://arxiv.org/pdf/2411.17251v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.07446v3","updated":"2024-12-27T16:21:32Z","published":"2024-10-09T21:26:49Z","title":"KACQ-DCNN: Uncertainty-Aware Interpretable Kolmogorov-Arnold\n Classical-Quantum Dual-Channel Neural Network for Heart Disease Detection","summary":" Heart failure is a leading cause of global mortality, necessitating improved\ndiagnostic strategies. Classical machine learning models struggle with\nchallenges such as high-dimensional data, class imbalances, poor feature\nrepresentations, and lack of interpretability. While quantum machine learning\nholds promise, current hybrid models have not fully exploited quantum\nadvantages. In this paper, we propose the Kolmogorov-Arnold Classical-Quantum\nDual-Channel Neural Network (KACQ-DCNN), a novel hybrid architecture that\nreplaces traditional multilayer perceptrons with Kolmogorov-Arnold Networks\n(KANs), enabling learnable univariate activation functions. Our KACQ-DCNN\n4-qubit, 1-layer model outperforms 37 benchmark models, including 16 classical\nand 12 quantum neural networks, achieving an accuracy of 92.03%, with\nmacro-average precision, recall, and F1 scores of 92.00%. It also achieved a\nROC-AUC of 94.77%, surpassing other models by significant margins, as validated\nby paired t-tests with a significance threshold of 0.0056 (after Bonferroni\ncorrection). Ablation studies highlight the synergistic effect of\nclassical-quantum integration, improving performance by about 2% over MLP\nvariants. Additionally, LIME and SHAP explainability techniques enhance feature\ninterpretability, while conformal prediction provides robust uncertainty\nquantification. Our results demonstrate that KACQ-DCNN improves cardiovascular\ndiagnostics by combining high accuracy with interpretability and uncertainty\nquantification.\n","authors":["Md Abrar Jahin","Md. Akmol Masud","M. F. Mridha","Zeyar Aung","Nilanjan Dey"],"pdf_url":"https://arxiv.org/pdf/2410.07446v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.19718v1","updated":"2024-12-27T16:17:22Z","published":"2024-12-27T16:17:22Z","title":"Text2Insight: Transform natural language text into insights seamlessly\n using multi-model architecture","summary":" The growing demand for dynamic, user-centric data analysis and visualization\nis evident across domains like healthcare, finance, and research. Traditional\nvisualization tools often fail to meet individual user needs due to their\nstatic and predefined nature. To address this gap, Text2Insight is introduced\nas an innovative solution that delivers customized data analysis and\nvisualizations based on user-defined natural language requirements. Leveraging\na multi-model architecture, Text2Insight transforms user inputs into actionable\ninsights and dynamic visualizations.\n The methodology begins with analyzing the input dataset to extract structural\ndetails such as columns and values. A pre-trained Llama3 model converts the\nuser's natural language query into an SQL query, which is further refined using\na Named Entity Recognition (NER) model for accuracy. A chart predictor\ndetermines the most suitable visualization type, while the Llama3 model\ngenerates insights based on the SQL query's results. The output is a\nuser-friendly and visually informative chart. To enhance analysis capabilities,\nthe system integrates a question-answering model and a predictive model using\nthe BERT framework. These models provide insights into historical data and\npredict future trends.\n Performance evaluation of Text2Insight demonstrates its effectiveness,\nachieving high accuracy (99%), precision (100%), recall (99%), and F1-score\n(99%), with a BLEU score of 0.5. The question-answering model attained an\naccuracy of 89% and the predictive model achieved 70% accuracy. These results\nvalidate Text2Insight as a robust and viable solution for transforming natural\nlanguage text into dynamic, user-specific data analysis and visualizations.\n","authors":["Pradeep Sain"],"pdf_url":"https://arxiv.org/pdf/2412.19718v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.19713v1","updated":"2024-12-27T16:14:06Z","published":"2024-12-27T16:14:06Z","title":"ProKAN: Progressive Stacking of Kolmogorov-Arnold Networks for Efficient\n Liver Segmentation","summary":" The growing need for accurate and efficient 3D identification of tumors,\nparticularly in liver segmentation, has spurred considerable research into deep\nlearning models. While many existing architectures offer strong performance,\nthey often face challenges such as overfitting and excessive computational\ncosts. An adjustable and flexible architecture that strikes a balance between\ntime efficiency and model complexity remains an unmet requirement. In this\npaper, we introduce proKAN, a progressive stacking methodology for\nKolmogorov-Arnold Networks (KANs) designed to address these challenges. Unlike\ntraditional architectures, proKAN dynamically adjusts its complexity by\nprogressively adding KAN blocks during training, based on overfitting behavior.\nThis approach allows the network to stop growing when overfitting is detected,\npreventing unnecessary computational overhead while maintaining high accuracy.\nAdditionally, proKAN utilizes KAN's learnable activation functions modeled\nthrough B-splines, which provide enhanced flexibility in learning complex\nrelationships in 3D medical data. Our proposed architecture achieves\nstate-of-the-art performance in liver segmentation tasks, outperforming\nstandard Multi-Layer Perceptrons (MLPs) and fixed KAN architectures. The\ndynamic nature of proKAN ensures efficient training times and high accuracy\nwithout the risk of overfitting. Furthermore, proKAN provides better\ninterpretability by allowing insight into the decision-making process through\nits learnable coefficients. The experimental results demonstrate a significant\nimprovement in accuracy, Dice score, and time efficiency, making proKAN a\ncompelling solution for 3D medical image segmentation tasks.\n","authors":["Bhavesh Gyanchandani","Aditya Oza","Abhinav Roy"],"pdf_url":"https://arxiv.org/pdf/2412.19713v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.19711v1","updated":"2024-12-27T16:10:03Z","published":"2024-12-27T16:10:03Z","title":"Causal machine learning for heterogeneous treatment effects in the\n presence of missing outcome data","summary":" When estimating heterogeneous treatment effects, missing outcome data can\ncomplicate treatment effect estimation, causing certain subgroups of the\npopulation to be poorly represented. In this work, we discuss this commonly\noverlooked problem and consider the impact that missing at random (MAR) outcome\ndata has on causal machine learning estimators for the conditional average\ntreatment effect (CATE). We then propose two de-biased machine learning\nestimators for the CATE, the mDR-learner and mEP-learner, which address the\nissue of under-representation by integrating inverse probability of censoring\nweights into the DR-learner and EP-learner respectively. We show that under\nreasonable conditions, these estimators are oracle efficient, and illustrate\ntheir favorable performance through simulated data settings, comparing them to\nexisting CATE estimators, including comparison to estimators which use common\nmissing data techniques. Guidance on the implementation of these estimators is\nprovided and we present an example of their application using the ACTG175\ntrial, exploring treatment effect heterogeneity when comparing Zidovudine\nmono-therapy against alternative antiretroviral therapies among HIV-1-infected\nindividuals.\n","authors":["Matthew Pryce","Karla Diaz-Ordaz","Ruth H. Keogh","Stijn Vansteelandt"],"pdf_url":"https://arxiv.org/pdf/2412.19711v1.pdf","comment":"34 pages, 6 figures, 4 tables"},{"id":"http://arxiv.org/abs/2412.19707v1","updated":"2024-12-27T16:02:34Z","published":"2024-12-27T16:02:34Z","title":"Toward Adaptive Reasoning in Large Language Models with Thought Rollback","summary":" Large language models (LLMs) have been routinely used to solve various tasks\nusing step-by-step reasoning. However, the structure of intermediate reasoning\nsteps, or thoughts, is rigid and unidirectional, such as chains, trees, or\nacyclic-directed graphs. Consequently, the resulting inflexible and\nforward-only reasoning may not address challenging tasks and fail when the LLM\nfrequently gives false responses, i.e., ``hallucinations''. This paper proposes\na new reasoning framework, called Thought Rollback (TR), allowing LLMs to\nadaptively build thought structure while maintaining effective reasoning toward\nproblem-solving under ``hallucinations''. The core mechanism of TR is rolling\nback thoughts, which allows LLMs to perform error analysis on thoughts, and\nthus roll back to any previously mistaken thought for revision. Subsequently,\nby including such trial-and-error in the prompt to guide the LLM, each rollback\nleads to one more reliable reasoning path. Therefore, starting with a simple\nprompt without human annotations, LLM with TR adaptively and gradually explores\nthoughts for a correct solution. Comprehensive experiments on mathematical\nproblems and multi-task reasoning demonstrate the state-of-the-art performance\nof TR in terms of problem-solving rate and interaction cost. For instance, the\nsolving rate of GPT-4 with TR outperforms the current best by $9\\%$ on the MATH\ndataset.\n","authors":["Sijia Chen","Baochun Li"],"pdf_url":"https://arxiv.org/pdf/2412.19707v1.pdf","comment":"ICML 2024 camera-ready version with 24 pages and 12 figures. Code\n repo with all prompts:\n https://github.com/iQua/llmpebase/tree/main/examples/ThoughtRollback"},{"id":"http://arxiv.org/abs/2412.19683v1","updated":"2024-12-27T15:20:57Z","published":"2024-12-27T15:20:57Z","title":"Combining Machine Learning with Recurrence Analysis for resonance\n detection","summary":" The width of a resonance in a nearly integrable system, i.e. in a\nnon-integrable system where chaotic motion is still not prominent, can tell us\nhow a perturbation parameter is driving the system away from integrability.\nAlthough the tool that we are presenting here can be used is quite generic and\ncan be used in a variety of systems, our particular interest lies in binary\ncompact object systems known as extreme mass ratio inspirals (EMRIs). In an\nEMRI a lighter compact object, like a black hole or a neutron star, inspirals\ninto a supermassive black hole due to gravitational radiation reaction. During\nthis inspiral the lighter object crosses resonances, which are still not very\nwell modeled. Measuring the width of resonances in EMRI models allows us to\nestimate the importance of each perturbation parameter able to drive the system\naway from resonances and decide whether its impact should be included in EMRI\nwaveform modeling or not. To tackle this issue in our study we show first that\nrecurrence quantifiers of orbits carry imprints of resonant behavior,\nregardless of the system's dimensionality. As a next step, we apply a long\nshort-term memory machine learning architecture to automate the resonance\ndetection procedure. Our analysis is developed on a simple standard map and\ngradually we extend it to more complicated systems until finally we employ it\nin a generic deformed Kerr spacetime known in the literature as the\nJohannsen-Psaltis spacetime.\n","authors":["Ondřej Zelenka","Ondřej Kopáček","Georgios Lukes-Gerakopoulos"],"pdf_url":"https://arxiv.org/pdf/2412.19683v1.pdf","comment":"12 pages, 10 figures"},{"id":"http://arxiv.org/abs/2412.19677v1","updated":"2024-12-27T14:57:40Z","published":"2024-12-27T14:57:40Z","title":"Deep ReLU networks -- injectivity capacity upper bounds","summary":" We study deep ReLU feed forward neural networks (NN) and their injectivity\nabilities. The main focus is on \\emph{precisely} determining the so-called\ninjectivity capacity. For any given hidden layers architecture, it is defined\nas the minimal ratio between number of network's outputs and inputs which\nensures unique recoverability of the input from a realizable output. A strong\nrecent progress in precisely studying single ReLU layer injectivity properties\nis here moved to a deep network level. In particular, we develop a program that\nconnects deep $l$-layer net injectivity to an $l$-extension of the $\\ell_0$\nspherical perceptrons, thereby massively generalizing an isomorphism between\nstudying single layer injectivity and the capacity of the so-called\n(1-extension) $\\ell_0$ spherical perceptrons discussed in [82]. \\emph{Random\nduality theory} (RDT) based machinery is then created and utilized to\nstatistically handle properties of the extended $\\ell_0$ spherical perceptrons\nand implicitly of the deep ReLU NNs. A sizeable set of numerical evaluations is\nconducted as well to put the entire RDT machinery in practical use. From these\nwe observe a rapidly decreasing tendency in needed layers' expansions, i.e., we\nobserve a rapid \\emph{expansion saturation effect}. Only $4$ layers of depth\nare sufficient to closely approach level of no needed expansion -- a result\nthat fairly closely resembles observations made in practical experiments and\nthat has so far remained completely untouchable by any of the existing\nmathematical methodologies.\n","authors":["Mihailo Stojnic"],"pdf_url":"https://arxiv.org/pdf/2412.19677v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.01173v2","updated":"2024-12-27T14:46:55Z","published":"2024-08-02T10:47:10Z","title":"Sustainable Diffusion-based Incentive Mechanism for Generative AI-driven\n Digital Twins in Industrial Cyber-Physical Systems","summary":" Industrial Cyber-Physical Systems (ICPSs) are an integral component of modern\nmanufacturing and industries. By digitizing data throughout product life\ncycles, Digital Twins (DTs) in ICPSs enable a shift from current industrial\ninfrastructures to intelligent and adaptive infrastructures. Thanks to data\nprocess capability, Generative Artificial Intelligence (GenAI) can drive the\nconstruction and update of DTs to improve predictive accuracy and prepare for\ndiverse smart manufacturing. However, mechanisms that leverage Industrial\nInternet of Things (IIoT) devices to share sensing data for DT construction are\nsusceptible to adverse selection problems. In this paper, we first develop a\nGenAI-driven DT architecture in ICPSs. To address the adverse selection problem\ncaused by information asymmetry, we propose a contract theory model and develop\na sustainable diffusion-based soft actor-critic algorithm to identify the\noptimal feasible contract. Specifically, we leverage dynamic structured pruning\ntechniques to reduce parameter numbers of actor networks, allowing\nsustainability and efficient implementation of the proposed algorithm.\nNumerical results demonstrate the effectiveness of the proposed scheme and the\nalgorithm, enabling efficient DT construction and updates to monitor and manage\nICPSs.\n","authors":["Jinbo Wen","Jiawen Kang","Dusit Niyato","Yang Zhang","Shiwen Mao"],"pdf_url":"https://arxiv.org/pdf/2408.01173v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.13362v3","updated":"2024-12-27T14:44:30Z","published":"2024-05-22T05:43:15Z","title":"Lusifer: LLM-based User SImulated Feedback Environment for online\n Recommender systems","summary":" Training reinforcement learning-based recommender systems is often hindered\nby the lack of dynamic and realistic user interactions. To address this\nlimitation, we introduce Lusifer, a novel environment leveraging Large Language\nModels (LLMs) to generate simulated user feedback. Lusifer synthesizes user\nprofiles and interaction histories to simulate responses and behaviors toward\nrecommended items, with profiles updated after each rating to reflect evolving\nuser characteristics. Utilizing the MovieLens dataset as a proof of concept, we\nlimited our implementation to the last 40 interactions for each user,\nrepresenting approximately 39% and 22% of the training sets, to focus on recent\nuser behavior. For consistency and to gain insights into the performance of\ntraditional methods with limited data, we implemented baseline approaches using\nthe same data subset. Our results demonstrate that Lusifer accurately emulates\nuser behavior and preferences, even with reduced training data having an RMSE\nof 1.3 across various test sets. This paper presents Lusifer's operational\npipeline, including prompt generation and iterative user profile updates, and\ncompares its performance against baseline methods. The findings validate\nLusifer's ability to produce realistic dynamic feedback and suggest that it\noffers a scalable and adjustable framework for user simulation in online\nreinforcement learning recommender systems for future studies, particularly\nwhen training data is limited.\n","authors":["Danial Ebrat","Eli Paradalis","Luis Rueda"],"pdf_url":"https://arxiv.org/pdf/2405.13362v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.19669v1","updated":"2024-12-27T14:31:52Z","published":"2024-12-27T14:31:52Z","title":"Toward Scalable Multirobot Control: Fast Policy Learning in Distributed\n MPC","summary":" Distributed model predictive control (DMPC) is promising in achieving optimal\ncooperative control in multirobot systems (MRS). However, real-time DMPC\nimplementation relies on numerical optimization tools to periodically calculate\nlocal control sequences online. This process is computationally demanding and\nlacks scalability for large-scale, nonlinear MRS. This article proposes a novel\ndistributed learning-based predictive control (DLPC) framework for scalable\nmultirobot control. Unlike conventional DMPC methods that calculate open-loop\ncontrol sequences, our approach centers around a computationally fast and\nefficient distributed policy learning algorithm that generates explicit\nclosed-loop DMPC policies for MRS without using numerical solvers. The policy\nlearning is executed incrementally and forward in time in each prediction\ninterval through an online distributed actor-critic implementation. The control\npolicies are successively updated in a receding-horizon manner, enabling fast\nand efficient policy learning with the closed-loop stability guarantee. The\nlearned control policies could be deployed online to MRS with varying robot\nscales, enhancing scalability and transferability for large-scale MRS.\nFurthermore, we extend our methodology to address the multirobot safe learning\nchallenge through a force field-inspired policy learning approach. We validate\nour approach's effectiveness, scalability, and efficiency through extensive\nexperiments on cooperative tasks of large-scale wheeled robots and multirotor\ndrones. Our results demonstrate the rapid learning and deployment of DMPC\npolicies for MRS with scales up to 10,000 units.\n","authors":["Xinglong Zhang","Wei Pan","Cong Li","Xin Xu","Xiangke Wang","Ronghua Zhang","Dewen Hu"],"pdf_url":"https://arxiv.org/pdf/2412.19669v1.pdf","comment":"26 pages, 19 figures"},{"id":"http://arxiv.org/abs/2111.08524v3","updated":"2024-12-27T14:08:01Z","published":"2021-11-16T14:53:19Z","title":"Non-separable Spatio-temporal Graph Kernels via SPDEs","summary":" Gaussian processes (GPs) provide a principled and direct approach for\ninference and learning on graphs. However, the lack of justified graph kernels\nfor spatio-temporal modelling has held back their use in graph problems. We\nleverage an explicit link between stochastic partial differential equations\n(SPDEs) and GPs on graphs, introduce a framework for deriving graph kernels via\nSPDEs, and derive non-separable spatio-temporal graph kernels that capture\ninteraction across space and time. We formulate the graph kernels for the\nstochastic heat equation and wave equation. We show that by providing novel\ntools for spatio-temporal GP modelling on graphs, we outperform pre-existing\ngraph kernels in real-world applications that feature diffusion, oscillation,\nand other complicated interactions.\n","authors":["Alexander Nikitin","ST John","Arno Solin","Samuel Kaski"],"pdf_url":"https://arxiv.org/pdf/2111.08524v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.19654v1","updated":"2024-12-27T13:59:58Z","published":"2024-12-27T13:59:58Z","title":"Asymmetrical Reciprocity-based Federated Learning for Resolving\n Disparities in Medical Diagnosis","summary":" Geographic health disparities pose a pressing global challenge, particularly\nin underserved regions of low- and middle-income nations. Addressing this issue\nrequires a collaborative approach to enhance healthcare quality, leveraging\nsupport from medically more developed areas. Federated learning emerges as a\npromising tool for this purpose. However, the scarcity of medical data and\nlimited computation resources in underserved regions make collaborative\ntraining of powerful machine learning models challenging. Furthermore, there\nexists an asymmetrical reciprocity between underserved and developed regions.\nTo overcome these challenges, we propose a novel cross-silo federated learning\nframework, named FedHelp, aimed at alleviating geographic health disparities\nand fortifying the diagnostic capabilities of underserved regions.\nSpecifically, FedHelp leverages foundational model knowledge via one-time API\naccess to guide the learning process of underserved small clients, addressing\nthe challenge of insufficient data. Additionally, we introduce a novel\nasymmetric dual knowledge distillation module to manage the issue of asymmetric\nreciprocity, facilitating the exchange of necessary knowledge between developed\nlarge clients and underserved small clients. We validate the effectiveness and\nutility of FedHelp through extensive experiments on both medical image\nclassification and segmentation tasks. The experimental results demonstrate\nsignificant performance improvement compared to state-of-the-art baselines,\nparticularly benefiting clients in underserved regions.\n","authors":["Jiaqi Wang","Ziyi Yin","Quanzeng You","Lingjuan Lyu","Fenglong Ma"],"pdf_url":"https://arxiv.org/pdf/2412.19654v1.pdf","comment":"Jiaqi Wang and Ziyi Yin equally contributed to this paper. This paper\n has been accepted by KDD 2025"},{"id":"http://arxiv.org/abs/2412.19650v1","updated":"2024-12-27T13:55:11Z","published":"2024-12-27T13:55:11Z","title":"Toward Modality Gap: Vision Prototype Learning for Weakly-supervised\n Semantic Segmentation with CLIP","summary":" The application of Contrastive Language-Image Pre-training (CLIP) in Weakly\nSupervised Semantic Segmentation (WSSS) research powerful cross-modal semantic\nunderstanding capabilities. Existing methods attempt to optimize input text\nprompts for improved alignment of images and text, by finely adjusting text\nprototypes to facilitate semantic matching. Nevertheless, given the modality\ngap between text and vision spaces, the text prototypes employed by these\nmethods have not effectively established a close correspondence with\npixel-level vision features. In this work, our theoretical analysis indicates\nthat the inherent modality gap results in misalignment of text and region\nfeatures, and that this gap cannot be sufficiently reduced by minimizing\ncontrast loss in CLIP. To mitigate the impact of the modality gap, we propose a\nVision Prototype Learning (VPL) framework, by introducing more representative\nvision prototypes. The core of this framework is to learn class-specific vision\nprototypes in vision space with the help of text prototypes, for capturing\nhigh-quality localization maps. Moreover, we propose a regional semantic\ncontrast module that contrasts regions embedding with corresponding prototypes,\nleading to more comprehensive and robust feature learning. Experimental results\nshow that our proposed framework achieves state-of-the-art performance on two\nbenchmark datasets.\n","authors":["Zhongxing Xu","Feilong Tang","Zhe Chen","Yingxue Su","Zhiyi Zhao","Ge Zhang","Jionglong Su","Zongyuan Ge"],"pdf_url":"https://arxiv.org/pdf/2412.19650v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.13714v5","updated":"2024-12-27T13:42:07Z","published":"2024-10-17T16:14:49Z","title":"Generation through the lens of learning theory","summary":" We study generation through the lens of statistical learning theory. First,\nwe abstract and formalize the results of Gold [1967], Angluin [1979], Angluin\n[1980] and Kleinberg and Mullainathan [2024] in terms of a binary hypothesis\nclass defined over an abstract example space. Then, we extend the notion of\n\"generation\" from Kleinberg and Mullainathan [2024] to two new settings, we\ncall \"uniform\" and \"non-uniform\" generation, and provide a characterization of\nwhich hypothesis classes are uniformly and non-uniformly generatable. As is\nstandard in learning theory, our characterizations are in terms of the\nfiniteness of a new combinatorial dimension termed the Closure dimension. By\ndoing so, we are able to compare generatability with predictability (captured\nvia PAC and online learnability) and show that these two properties of\nhypothesis classes are incompatible -- there are classes that are generatable\nbut not predictable and vice versa. Finally, we extend our results to capture\nprompted generation and give a complete characterization of which classes are\nprompt generatable, generalizing some of the work by Kleinberg and Mullainathan\n[2024].\n","authors":["Jiaxun Li","Vinod Raman","Ambuj Tewari"],"pdf_url":"https://arxiv.org/pdf/2410.13714v5.pdf","comment":"35 pages, 2 figures. Reorganization and content addition"},{"id":"http://arxiv.org/abs/2409.02572v4","updated":"2024-12-27T13:29:14Z","published":"2024-09-04T09:46:33Z","title":"GenDFIR: Advancing Cyber Incident Timeline Analysis Through Retrieval\n Augmented Generation and Large Language Models","summary":" Cyber timeline analysis, or forensic timeline analysis, is crucial in Digital\nForensics and Incident Response (DFIR). It examines artefacts and events\nparticularly timestamps and metadata to detect anomalies, establish\ncorrelations, and reconstruct incident timelines. Traditional methods rely on\nstructured artefacts, such as logs and filesystem metadata, using specialised\ntools for evidence identification and feature extraction. This paper introduces\nGenDFIR, a framework leveraging large language models (LLMs), specifically\nLlama 3.1 8B in zero shot mode, integrated with a Retrieval-Augmented\nGeneration (RAG) agent. Incident data is preprocessed into a structured\nknowledge base, enabling the RAG agent to retrieve relevant events based on\nuser prompts. The LLM interprets this context, offering semantic enrichment.\nTested on synthetic data in a controlled environment, results demonstrate\nGenDFIR's reliability and robustness, showcasing LLMs potential to automate\ntimeline analysis and advance threat detection.\n","authors":["Fatma Yasmine Loumachi","Mohamed Chahine Ghanem","Mohamed Amine Ferrag"],"pdf_url":"https://arxiv.org/pdf/2409.02572v4.pdf","comment":"24 pages V5.3"},{"id":"http://arxiv.org/abs/2404.00639v2","updated":"2024-12-27T13:26:58Z","published":"2024-03-31T10:43:33Z","title":"RL-MUL 2.0: Multiplier Design Optimization with Parallel Deep\n Reinforcement Learning and Space Reduction","summary":" Multiplication is a fundamental operation in many applications, and\nmultipliers are widely adopted in various circuits. However, optimizing\nmultipliers is challenging due to the extensive design space. In this paper, we\npropose a multiplier design optimization framework based on reinforcement\nlearning. We utilize matrix and tensor representations for the compressor tree\nof a multiplier, enabling seamless integration of convolutional neural networks\nas the agent network. The agent optimizes the multiplier structure using a\nPareto-driven reward customized to balance area and delay. Furthermore, we\nenhance the original framework with parallel reinforcement learning and design\nspace pruning techniques and extend its capability to optimize fused\nmultiply-accumulate (MAC) designs. Experiments conducted on different bit\nwidths of multipliers demonstrate that multipliers produced by our approach\noutperform all baseline designs in terms of area, power, and delay. The\nperformance gain is further validated by comparing the area, power, and delay\nof processing element arrays using multipliers from our approach and baseline\napproaches.\n","authors":["Dongsheng Zuo","Jiadong Zhu","Yikang Ouyang","Yuzhe Ma"],"pdf_url":"https://arxiv.org/pdf/2404.00639v2.pdf","comment":"Accepted by TODAES 2025"},{"id":"http://arxiv.org/abs/2412.19634v1","updated":"2024-12-27T13:23:58Z","published":"2024-12-27T13:23:58Z","title":"Deep Linear Hawkes Processes","summary":" Marked temporal point processes (MTPPs) are used to model sequences of\ndifferent types of events with irregular arrival times, with broad applications\nranging from healthcare and social networks to finance. We address shortcomings\nin existing point process models by drawing connections between modern deep\nstate-space models (SSMs) and linear Hawkes processes (LHPs), culminating in an\nMTPP that we call the deep linear Hawkes process (DLHP). The DLHP modifies the\nlinear differential equations in deep SSMs to be stochastic jump differential\nequations, akin to LHPs. After discretizing, the resulting recurrence can be\nimplemented efficiently using a parallel scan. This brings parallelism and\nlinear scaling to MTPP models. This contrasts with attention-based MTPPs, which\nscale quadratically, and RNN-based MTPPs, which do not parallelize across the\nsequence length. We show empirically that DLHPs match or outperform existing\nmodels across a broad range of metrics on eight real-world datasets. Our\nproposed DLHP model is the first instance of the unique architectural\ncapabilities of SSMs being leveraged to construct a new class of MTPP models.\n","authors":["Yuxin Chang","Alex Boyd","Cao Xiao","Taha Kass-Hout","Parminder Bhatia","Padhraic Smyth","Andrew Warrington"],"pdf_url":"https://arxiv.org/pdf/2412.19634v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.16954v3","updated":"2024-12-27T13:23:03Z","published":"2024-05-27T08:46:28Z","title":"Convergence of SGD with momentum in the nonconvex case: A time\n window-based analysis","summary":" The stochastic gradient descent method with momentum (SGDM) is a common\napproach for solving large-scale and stochastic optimization problems. Despite\nits popularity, the convergence behavior of SGDM remains less understood in\nnonconvex scenarios. This is primarily due to the absence of a sufficient\ndescent property and challenges in simultaneously controlling the momentum and\nstochastic errors in an almost sure sense. To address these challenges, we\ninvestigate the behavior of SGDM over specific time windows, rather than\nexamining the descent of consecutive iterates as in traditional studies. This\ntime window-based approach simplifies the convergence analysis and enables us\nto establish the iterate convergence result for SGDM under the {\\L}ojasiewicz\nproperty. We further provide local convergence rates which depend on the\nunderlying {\\L}ojasiewicz exponent and the utilized step size schemes.\n","authors":["Junwen Qiu","Bohao Ma","Andre Milzarek"],"pdf_url":"https://arxiv.org/pdf/2405.16954v3.pdf","comment":"23 pages"},{"id":"http://arxiv.org/abs/2403.07945v3","updated":"2024-12-27T13:08:14Z","published":"2024-03-11T03:44:18Z","title":"A Mathematical Framework for the Problem of Security for Cognition in\n Neurotechnology","summary":" The rapid advancement in neurotechnology in recent years has created an\nemerging critical intersection between neurotechnology and security.\nImplantable devices, non-invasive monitoring, and non-invasive therapies all\ncarry with them the prospect of violating the privacy and autonomy of\nindividuals' cognition. A growing number of scientists and physicians have made\ncalls to address this issue, but applied efforts have been relatively limited.\nA major barrier hampering scientific and engineering efforts to address these\nsecurity issues is the lack of a clear means of describing and analyzing\nrelevant problems. In this paper we develop Cognitive Neurosecurity, a\nmathematical framework which enables such description and analysis by drawing\non methods and results from multiple fields. We demonstrate certain statistical\nproperties which have significant implications for Cognitive Neurosecurity, and\nthen present descriptions of the algorithmic problems faced by attackers\nattempting to violate privacy and autonomy, and defenders attempting to\nobstruct such attempts.\n","authors":["Bryce Allen Bagley","Claudia K Petritsch"],"pdf_url":"https://arxiv.org/pdf/2403.07945v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.00107v5","updated":"2024-12-27T12:28:34Z","published":"2023-05-31T18:27:43Z","title":"MERT: Acoustic Music Understanding Model with Large-Scale\n Self-supervised Training","summary":" Self-supervised learning (SSL) has recently emerged as a promising paradigm\nfor training generalisable models on large-scale data in the fields of vision,\ntext, and speech. Although SSL has been proven effective in speech and audio,\nits application to music audio has yet to be thoroughly explored. This is\npartially due to the distinctive challenges associated with modelling musical\nknowledge, particularly tonal and pitched characteristics of music. To address\nthis research gap, we propose an acoustic Music undERstanding model with\nlarge-scale self-supervised Training (MERT), which incorporates teacher models\nto provide pseudo labels in the masked language modelling (MLM) style acoustic\npre-training. In our exploration, we identified an effective combination of\nteacher models, which outperforms conventional speech and audio approaches in\nterms of performance. This combination includes an acoustic teacher based on\nResidual Vector Quantisation - Variational AutoEncoder (RVQ-VAE) and a musical\nteacher based on the Constant-Q Transform (CQT). Furthermore, we explore a wide\nrange of settings to overcome the instability in acoustic language model\npre-training, which allows our designed paradigm to scale from 95M to 330M\nparameters. Experimental results indicate that our model can generalise and\nperform well on 14 music understanding tasks and attain state-of-the-art (SOTA)\noverall scores.\n","authors":["Yizhi Li","Ruibin Yuan","Ge Zhang","Yinghao Ma","Xingran Chen","Hanzhi Yin","Chenghao Xiao","Chenghua Lin","Anton Ragni","Emmanouil Benetos","Norbert Gyenge","Roger Dannenberg","Ruibo Liu","Wenhu Chen","Gus Xia","Yemin Shi","Wenhao Huang","Zili Wang","Yike Guo","Jie Fu"],"pdf_url":"https://arxiv.org/pdf/2306.00107v5.pdf","comment":"accepted by ICLR 2024"},{"id":"http://arxiv.org/abs/2412.19616v1","updated":"2024-12-27T12:23:39Z","published":"2024-12-27T12:23:39Z","title":"Gradient Weight-normalized Low-rank Projection for Efficient LLM\n Training","summary":" Large Language Models (LLMs) have shown remarkable performance across various\ntasks, but the escalating demands on computational resources pose significant\nchallenges, particularly in the extensive utilization of full fine-tuning for\ndownstream tasks. To address this, parameter-efficient fine-tuning (PEFT)\nmethods have been developed, but they often underperform compared to full\nfine-tuning and struggle with memory efficiency. In this work, we introduce\nGradient Weight-Normalized Low-Rank Projection (GradNormLoRP), a novel approach\nthat enhances both parameter and memory efficiency while maintaining comparable\nperformance to full fine-tuning. GradNormLoRP normalizes the weight matrix to\nimprove gradient conditioning, facilitating better convergence during\noptimization. Additionally, it applies low-rank approximations to the weight\nand gradient matrices, significantly reducing memory usage during training.\nExtensive experiments demonstrate that our 8-bit GradNormLoRP reduces optimizer\nmemory usage by up to 89.5% and enables the pre-training of large LLMs, such as\nLLaMA 7B, on consumer-level GPUs like the NVIDIA RTX 4090, without additional\ninference costs. Moreover, GradNormLoRP outperforms existing low-rank methods\nin fine-tuning tasks. For instance, when fine-tuning the RoBERTa model on all\nGLUE tasks with a rank of 8, GradNormLoRP achieves an average score of 80.65,\nsurpassing LoRA's score of 79.23. These results underscore GradNormLoRP as a\npromising alternative for efficient LLM pre-training and fine-tuning. Source\ncode and Appendix:\nhttps://github.com/Jhhuangkay/Gradient-Weight-normalized-Low-rank-Projection-for-Efficient-LLM-Training\n","authors":["Jia-Hong Huang","Yixian Shen","Hongyi Zhu","Stevan Rudinac","Evangelos Kanoulas"],"pdf_url":"https://arxiv.org/pdf/2412.19616v1.pdf","comment":"Accepted by the 39th AAAI Conference on Artificial Intelligence\n (AAAI-25) [Main Technical Track]"},{"id":"http://arxiv.org/abs/2412.05545v2","updated":"2024-12-27T11:57:40Z","published":"2024-12-07T05:47:28Z","title":"Convergence analysis of wide shallow neural operators within the\n framework of Neural Tangent Kernel","summary":" Neural operators are aiming at approximating operators mapping between Banach\nspaces of functions, achieving much success in the field of scientific\ncomputing. Compared to certain deep learning-based solvers, such as\nPhysics-Informed Neural Networks (PINNs), Deep Ritz Method (DRM), neural\noperators can solve a class of Partial Differential Equations (PDEs). Although\nmuch work has been done to analyze the approximation and generalization error\nof neural operators, there is still a lack of analysis on their training error.\nIn this work, we conduct the convergence analysis of gradient descent for the\nwide shallow neural operators within the framework of Neural Tangent Kernel\n(NTK). The core idea lies on the fact that over-parameterization and random\ninitialization together ensure that each weight vector remains near its\ninitialization throughout all iterations, yielding the linear convergence of\ngradient descent. In this work, we demonstrate that under the setting of\nover-parametrization, gradient descent can find the global minimum regardless\nof whether it is in continuous time or discrete time. Finally, we briefly\ndiscuss the case of physics-informed shallow neural operators.\n","authors":["Xianliang Xu","Ye Li","Zhongyi Huang"],"pdf_url":"https://arxiv.org/pdf/2412.05545v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.17438v2","updated":"2024-12-27T11:49:53Z","published":"2024-12-23T09:59:49Z","title":"Markov Process-Based Graph Convolutional Networks for Entity\n Classification in Knowledge Graphs","summary":" Despite the vast amount of information encoded in Knowledge Graphs (KGs),\ninformation about the class affiliation of entities remains often incomplete.\nGraph Convolutional Networks (GCNs) have been shown to be effective predictors\nof complete information about the class affiliation of entities in KGs.\nHowever, these models do not learn the class affiliation of entities in KGs\nincorporating the complexity of the task, which negatively affects the models\nprediction capabilities. To address this problem, we introduce a Markov\nprocess-based architecture into well-known GCN architectures. This end-to-end\nnetwork learns the prediction of class affiliation of entities in KGs within a\nMarkov process. The number of computational steps is learned during training\nusing a geometric distribution. At the same time, the loss function combines\ninsights from the field of evidential learning. The experiments show a\nperformance improvement over existing models in several studied architectures\nand datasets. Based on the chosen hyperparameters for the geometric\ndistribution, the expected number of computation steps can be adjusted to\nimprove efficiency and accuracy during training.\n","authors":["Johannes Mäkelburg","Yiwen Peng","Mehwish Alam","Tobias Weller","Maribel Acosta"],"pdf_url":"https://arxiv.org/pdf/2412.17438v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.09196v2","updated":"2024-12-27T11:41:38Z","published":"2024-09-13T21:12:18Z","title":"Are Sparse Neural Networks Better Hard Sample Learners?","summary":" While deep learning has demonstrated impressive progress, it remains a\ndaunting challenge to learn from hard samples as these samples are usually\nnoisy and intricate. These hard samples play a crucial role in the optimal\nperformance of deep neural networks. Most research on Sparse Neural Networks\n(SNNs) has focused on standard training data, leaving gaps in understanding\ntheir effectiveness on complex and challenging data. This paper's extensive\ninvestigation across scenarios reveals that most SNNs trained on challenging\nsamples can often match or surpass dense models in accuracy at certain sparsity\nlevels, especially with limited data. We observe that layer-wise density ratios\ntend to play an important role in SNN performance, particularly for methods\nthat train from scratch without pre-trained initialization. These insights\nenhance our understanding of SNNs' behavior and potential for efficient\nlearning approaches in data-centric AI. Our code is publicly available at:\n\\url{https://github.com/QiaoXiao7282/hard_sample_learners}.\n","authors":["Qiao Xiao","Boqian Wu","Lu Yin","Christopher Neil Gadzinski","Tianjin Huang","Mykola Pechenizkiy","Decebal Constantin Mocanu"],"pdf_url":"https://arxiv.org/pdf/2409.09196v2.pdf","comment":"Accepted at British Machine Vision Conference (BMVC 2024)"},{"id":"http://arxiv.org/abs/2412.19589v1","updated":"2024-12-27T11:19:10Z","published":"2024-12-27T11:19:10Z","title":"ViDTA: Enhanced Drug-Target Affinity Prediction via Virtual Graph Nodes\n and Attention-based Feature Fusion","summary":" Drug-target interaction is fundamental in understanding how drugs affect\nbiological systems, and accurately predicting drug-target affinity (DTA) is\nvital for drug discovery. Recently, deep learning methods have emerged as a\nsignificant approach for estimating the binding strength between drugs and\ntarget proteins. However, existing methods simply utilize the drug's local\ninformation from molecular topology rather than global information.\nAdditionally, the features of drugs and proteins are usually fused with a\nsimple concatenation operation, limiting their effectiveness. To address these\nchallenges, we proposed ViDTA, an enhanced DTA prediction framework. We\nintroduce virtual nodes into the Graph Neural Network (GNN)-based drug feature\nextraction network, which acts as a global memory to exchange messages more\nefficiently. By incorporating virtual graph nodes, we seamlessly integrate\nlocal and global features of drug molecular structures, expanding the GNN's\nreceptive field. Additionally, we propose an attention-based linear feature\nfusion network for better capturing the interaction information between drugs\nand proteins. Experimental results evaluated on various benchmarks including\nDavis, Metz, and KIBA demonstrate that our proposed ViDTA outperforms the\nstate-of-the-art baselines.\n","authors":["Minghui Li","Zikang Guo","Yang Wu","Peijin Guo","Yao Shi","Shengshan Hu","Wei Wan","Shengqing Hu"],"pdf_url":"https://arxiv.org/pdf/2412.19589v1.pdf","comment":"Accepted by International Conference on Bioinformatics and\n Biomedicine (BIBM 24)"},{"id":"http://arxiv.org/abs/2412.19587v1","updated":"2024-12-27T11:14:11Z","published":"2024-12-27T11:14:11Z","title":"Goal-oriented Communications based on Recursive Early Exit Neural\n Networks","summary":" This paper presents a novel framework for goal-oriented semantic\ncommunications leveraging recursive early exit models. The proposed approach is\nbuilt on two key components. First, we introduce an innovative early exit\nstrategy that dynamically partitions computations, enabling samples to be\noffloaded to a server based on layer-wise recursive prediction dynamics that\ndetect samples for which the confidence is not increasing fast enough over\nlayers. Second, we develop a Reinforcement Learning-based online optimization\nframework that jointly determines early exit points, computation splitting, and\noffloading strategies, while accounting for wireless conditions, inference\naccuracy, and resource costs. Numerical evaluations in an edge inference\nscenario demonstrate the method's adaptability and effectiveness in striking an\nexcellent trade-off between performance, latency, and resource efficiency.\n","authors":["Jary Pomponi","Mattia Merluzzi","Alessio Devoto","Mateus Pontes Mota","Paolo Di Lorenzo","Simone Scardapane"],"pdf_url":"https://arxiv.org/pdf/2412.19587v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.19585v1","updated":"2024-12-27T11:03:26Z","published":"2024-12-27T11:03:26Z","title":"Ultralight Signal Classification Model for Automatic Modulation\n Recognition","summary":" The growing complexity of radar signals demands responsive and accurate\ndetection systems that can operate efficiently on resource-constrained edge\ndevices. Existing models, while effective, often rely on substantial\ncomputational resources and large datasets, making them impractical for edge\ndeployment. In this work, we propose an ultralight hybrid neural network\noptimized for edge applications, delivering robust performance across\nunfavorable signal-to-noise ratios (mean accuracy of 96.3% at 0 dB) using less\nthan 100 samples per class, and significantly reducing computational overhead.\n","authors":["Alessandro Daniele Genuardi Oquendo","Agustín Matías Galante Cerviño","Nilotpal Sinha","Luc Andrea","Sam Mugel","Román Orús"],"pdf_url":"https://arxiv.org/pdf/2412.19585v1.pdf","comment":"8 pages, 8 figures"},{"id":"http://arxiv.org/abs/2412.19583v1","updated":"2024-12-27T10:58:55Z","published":"2024-12-27T10:58:55Z","title":"A Comparative Study of Machine Unlearning Techniques for Image and Text\n Classification Models","summary":" Machine Unlearning has emerged as a critical area in artificial intelligence,\naddressing the need to selectively remove learned data from machine learning\nmodels in response to data privacy regulations. This paper provides a\ncomprehensive comparative analysis of six state-of-theart unlearning techniques\napplied to image and text classification tasks. We evaluate their performance,\nefficiency, and compliance with regulatory requirements, highlighting their\nstrengths and limitations in practical scenarios. By systematically analyzing\nthese methods, we aim to provide insights into their applicability,\nchallenges,and tradeoffs, fostering advancements in the field of ethical and\nadaptable machine learning.\n","authors":["Omar M. Safa","Mahmoud M. Abdelaziz","Mustafa Eltawy","Mohamed Mamdouh","Moamen Gharib","Salaheldin Eltenihy","Nagia M. Ghanem","Mohamed M. Ismail"],"pdf_url":"https://arxiv.org/pdf/2412.19583v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.19578v1","updated":"2024-12-27T10:50:43Z","published":"2024-12-27T10:50:43Z","title":"Graph-attention-based Casual Discovery with Trust Region-navigated\n Clipping Policy Optimization","summary":" In many domains of empirical sciences, discovering the causal structure\nwithin variables remains an indispensable task. Recently, to tackle with\nunoriented edges or latent assumptions violation suffered by conventional\nmethods, researchers formulated a reinforcement learning (RL) procedure for\ncausal discovery, and equipped REINFORCE algorithm to search for the\nbest-rewarded directed acyclic graph. The two keys to the overall performance\nof the procedure are the robustness of RL methods and the efficient encoding of\nvariables. However, on the one hand, REINFORCE is prone to local convergence\nand unstable performance during training. Neither trust region policy\noptimization, being computationally-expensive, nor proximal policy optimization\n(PPO), suffering from aggregate constraint deviation, is decent alternative for\ncombinatory optimization problems with considerable individual subactions. We\npropose a trust region-navigated clipping policy optimization method for causal\ndiscovery that guarantees both better search efficiency and steadiness in\npolicy optimization, in comparison with REINFORCE, PPO and our prioritized\nsampling-guided REINFORCE implementation. On the other hand, to boost the\nefficient encoding of variables, we propose a refined graph attention encoder\ncalled SDGAT that can grasp more feature information without priori\nneighbourhood information. With these improvements, the proposed method\noutperforms former RL method in both synthetic and benchmark datasets in terms\nof output results and optimization robustness.\n","authors":["Shixuan Liu","Yanghe Feng","Keyu Wu","Guangquan Cheng","Jincai Huang","Zhong Liu"],"pdf_url":"https://arxiv.org/pdf/2412.19578v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.12570v3","updated":"2024-12-27T09:56:51Z","published":"2024-11-19T15:39:25Z","title":"A data driven approach to classify descriptors based on their efficiency\n in translating noisy trajectories into physically-relevant information","summary":" Reconstructing the physical complexity of many-body dynamical systems can be\nchallenging. Starting from the trajectories of their constitutive units (raw\ndata), typical approaches require selecting appropriate descriptors to convert\nthem into time-series, which are then analyzed to extract interpretable\ninformation. However, identifying the most effective descriptor is often\nnon-trivial. Here, we report a data-driven approach to compare the efficiency\nof various descriptors in extracting information from noisy trajectories and\ntranslating it into physically relevant insights. As a prototypical system with\nnon-trivial internal complexity, we analyze molecular dynamics trajectories of\nan atomistic system where ice and water coexist in equilibrium near the\nsolid/liquid transition temperature. We compare general and specific\ndescriptors often used in aqueous systems: number of neighbors, molecular\nvelocities, Smooth Overlap of Atomic Positions (SOAP), Local Environments and\nNeighbors Shuffling (LENS), Orientational Tetrahedral Order, and distance from\nthe fifth neighbor ($d_5$). Using Onion Clustering -- an efficient unsupervised\nmethod for single-point time-series analysis -- we assess the maximum\nextractable information for each descriptor and rank them via a\nhigh-dimensional metric. Our results show that advanced descriptors like SOAP\nand LENS outperform classical ones due to higher signal-to-noise ratios.\nNonetheless, even simple descriptors can rival or exceed advanced ones after\nlocal signal denoising. For example, $d_5$, initially among the weakest,\nbecomes the most effective at resolving the system's non-local dynamical\ncomplexity after denoising. This work highlights the critical role of noise in\ninformation extraction from molecular trajectories and offers a data-driven\napproach to identify optimal descriptors for systems with characteristic\ninternal complexity.\n","authors":["Simone Martino","Domiziano Doria","Chiara Lionello","Matteo Becchi","Giovanni M. Pavan"],"pdf_url":"https://arxiv.org/pdf/2411.12570v3.pdf","comment":"19 pages, 5 figures + 3 in supporting information (at the bottom of\n the manuscript)"},{"id":"http://arxiv.org/abs/2409.09099v3","updated":"2024-12-27T09:30:18Z","published":"2024-09-13T08:29:36Z","title":"S-STE: Continuous Pruning Function for Efficient 2:4 Sparse Pre-training","summary":" Training deep neural networks (DNNs) is costly. Fortunately, Nvidia Ampere\nand Hopper GPUs can accelerate matrix multiplications twice as fast as a dense\nequivalent by implementing 2:4 sparsity. However, previous STE-based 2:4\npre-training methods (e.g. STE with hard-thresholding, SR-STE) suffer from\noptimization difficulties because of discontinuous pruning function. In this\nstudy, we comprehensively analyse the bottleneck of traditional N:M sparse\ntraining and recognize three drawbacks with discontinuity: incorrect descending\ndirection, inability to predict the amount of descent and sparse mask\noscillation. In light of this, we propose S-STE, a simple yet powerful 2:4\ntraining method that contains two parts: to continuously project weights to be\n2:4 sparse, and to rescale sparse weights with a per-tensor fixed scaling\nfactor. Besides, we adopt minimum-variance unbiased estimation for activation\ngradient and FP8 quantization for whole process. Results show that our method\nsurpasses previous 2:4 pre-training recipes and is comparable even with full\nparameter models. Our toolkit is available at\nhttps://github.com/huyz2023/2by4-pretrain.\n","authors":["Yuezhou Hu","Jun Zhu","Jianfei Chen"],"pdf_url":"https://arxiv.org/pdf/2409.09099v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.19542v1","updated":"2024-12-27T09:08:46Z","published":"2024-12-27T09:08:46Z","title":"Interacted Object Grounding in Spatio-Temporal Human-Object Interactions","summary":" Spatio-temporal Human-Object Interaction (ST-HOI) understanding aims at\ndetecting HOIs from videos, which is crucial for activity understanding.\nHowever, existing whole-body-object interaction video benchmarks overlook the\ntruth that open-world objects are diverse, that is, they usually provide\nlimited and predefined object classes. Therefore, we introduce a new open-world\nbenchmark: Grounding Interacted Objects (GIO) including 1,098 interacted\nobjects class and 290K interacted object boxes annotation. Accordingly, an\nobject grounding task is proposed expecting vision systems to discover\ninteracted objects. Even though today's detectors and grounding methods have\nsucceeded greatly, they perform unsatisfactorily in localizing diverse and rare\nobjects in GIO. This profoundly reveals the limitations of current vision\nsystems and poses a great challenge. Thus, we explore leveraging\nspatio-temporal cues to address object grounding and propose a 4D\nquestion-answering framework (4D-QA) to discover interacted objects from\ndiverse videos. Our method demonstrates significant superiority in extensive\nexperiments compared to current baselines. Data and code will be publicly\navailable at https://github.com/DirtyHarryLYL/HAKE-AVA.\n","authors":["Xiaoyang Liu","Boran Wen","Xinpeng Liu","Zizheng Zhou","Hongwei Fan","Cewu Lu","Lizhuang Ma","Yulong Chen","Yong-Lu Li"],"pdf_url":"https://arxiv.org/pdf/2412.19542v1.pdf","comment":"To be published in the Proceedings of AAAI 2025. The first three\n authors contributed equally. Project:\n https://github.com/DirtyHarryLYL/HAKE-AVA"},{"id":"http://arxiv.org/abs/2412.19530v1","updated":"2024-12-27T08:50:54Z","published":"2024-12-27T08:50:54Z","title":"The Value of AI Advice: Personalized and Value-Maximizing AI Advisors\n Are Necessary to Reliably Benefit Experts and Organizations","summary":" Despite advances in AI's performance and interpretability, AI advisors can\nundermine experts' decisions and increase the time and effort experts must\ninvest to make decisions. Consequently, AI systems deployed in high-stakes\nsettings often fail to consistently add value across contexts and can even\ndiminish the value that experts alone provide. Beyond harm in specific domains,\nsuch outcomes impede progress in research and practice, underscoring the need\nto understand when and why different AI advisors add or diminish value. To\nbridge this gap, we stress the importance of assessing the value AI advice\nbrings to real-world contexts when designing and evaluating AI advisors.\nBuilding on this perspective, we characterize key pillars -- pathways through\nwhich AI advice impacts value -- and develop a framework that incorporates\nthese pillars to create reliable, personalized, and value-adding advisors. Our\nresults highlight the need for system-level, value-driven development of AI\nadvisors that advise selectively, are tailored to experts' unique behaviors,\nand are optimized for context-specific trade-offs between decision improvements\nand advising costs. They also reveal how the lack of inclusion of these pillars\nin the design of AI advising systems may be contributing to the failures\nobserved in practical applications.\n","authors":["Nicholas Wolczynski","Maytal Saar-Tsechansky","Tong Wang"],"pdf_url":"https://arxiv.org/pdf/2412.19530v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.19529v1","updated":"2024-12-27T08:46:46Z","published":"2024-12-27T08:46:46Z","title":"Nonconvex Stochastic Optimization under Heavy-Tailed Noises: Optimal\n Convergence without Gradient Clipping","summary":" Recently, the study of heavy-tailed noises in first-order nonconvex\nstochastic optimization has gotten a lot of attention since it was recognized\nas a more realistic condition as suggested by many empirical observations.\nSpecifically, the stochastic noise (the difference between the stochastic and\ntrue gradient) is considered only to have a finite $\\mathfrak{p}$-th moment\nwhere $\\mathfrak{p}\\in\\left(1,2\\right]$ instead of assuming it always satisfies\nthe classical finite variance assumption. To deal with this more challenging\nsetting, people have proposed different algorithms and proved them to converge\nat an optimal $\\mathcal{O}(T^{\\frac{1-\\mathfrak{p}}{3\\mathfrak{p}-2}})$ rate\nfor smooth objectives after $T$ iterations. Notably, all these new-designed\nalgorithms are based on the same technique - gradient clipping. Naturally, one\nmay want to know whether the clipping method is a necessary ingredient and the\nonly way to guarantee convergence under heavy-tailed noises. In this work, by\nrevisiting the existing Batched Normalized Stochastic Gradient Descent with\nMomentum (Batched NSGDM) algorithm, we provide the first convergence result\nunder heavy-tailed noises but without gradient clipping. Concretely, we prove\nthat Batched NSGDM can achieve the optimal\n$\\mathcal{O}(T^{\\frac{1-\\mathfrak{p}}{3\\mathfrak{p}-2}})$ rate even under the\nrelaxed smooth condition. More interestingly, we also establish the first\n$\\mathcal{O}(T^{\\frac{1-\\mathfrak{p}}{2\\mathfrak{p}}})$ convergence rate in the\ncase where the tail index $\\mathfrak{p}$ is unknown in advance, which is\narguably the common scenario in practice.\n","authors":["Zijian Liu","Zhengyuan Zhou"],"pdf_url":"https://arxiv.org/pdf/2412.19529v1.pdf","comment":"In submission"},{"id":"http://arxiv.org/abs/2411.05861v2","updated":"2024-12-27T08:39:23Z","published":"2024-11-07T08:35:01Z","title":"Rethinking Deep Learning: Non-backpropagation and Non-optimization\n Machine Learning Approach Using Hebbian Neural Networks","summary":" Developing strong AI could provide a powerful tool for addressing social and\nscientific challenges. Neural networks (NNs), inspired by biological systems,\nhave the potential to achieve this. However, weight optimization techniques\nusing error backpropagation are not observed in biological systems, raising\ndoubts about current NNs approaches. In this context, Itoh (2024) solved the\nMNIST classification problem without using objective functions or\nbackpropagation. However, weight updates were not used, so it does not qualify\nas machine learning AI. In this study, I develop a machine learning method that\nmimics biological neural systems by implementing Hebbian learning in NNs\nwithout backpropagation and optimization method to solve the MNIST\nclassification problem and analyze its output. Development proceeded in three\nstages. In the first stage, I applied the Hebbian learning rule to the MNIST\ncharacter recognition algorithm by Itoh (2024), resulting in lower accuracy\nthan non-Hebbian NNs, highlighting the limitations of conventional training\nprocedures for Hebbian learning. In the second stage, I examined the properties\nof individually trained NNs using norm-based cognition, showing that NNs\ntrained on a specific label respond powerfully to that label. In the third\nstage, I created an MNIST character recognition program using vector norm\nmagnitude as the criterion, achieving an accuracy of approximately 75%. This\ndemonstrates that the Hebbian learning NNs can recognize handwritten characters\nwithout objective functions, backpropagation, optimization processes, and large\ndata set. Based on these results, developing a mechanism based on norm-based\ncognition as a fundamental unit and then increasing complexity to achieve\nindirect similarity cognition should help mimic biological neural systems and\ncontribute to realizing strong AI.\n","authors":["Kei Itoh"],"pdf_url":"https://arxiv.org/pdf/2411.05861v2.pdf","comment":"13 pages, 4 figures"},{"id":"http://arxiv.org/abs/2409.02426v2","updated":"2024-12-27T08:33:51Z","published":"2024-09-04T04:14:02Z","title":"Diffusion Models Learn Low-Dimensional Distributions via Subspace\n Clustering","summary":" Recent empirical studies have demonstrated that diffusion models can\neffectively learn the image distribution and generate new samples. Remarkably,\nthese models can achieve this even with a small number of training samples\ndespite a large image dimension, circumventing the curse of dimensionality. In\nthis work, we provide theoretical insights into this phenomenon by leveraging\nkey empirical observations: (i) the low intrinsic dimensionality of image data,\n(ii) a union of manifold structure of image data, and (iii) the low-rank\nproperty of the denoising autoencoder in trained diffusion models. These\nobservations motivate us to assume the underlying data distribution of image\ndata as a mixture of low-rank Gaussians and to parameterize the denoising\nautoencoder as a low-rank model according to the score function of the assumed\ndistribution. With these setups, we rigorously show that optimizing the\ntraining loss of diffusion models is equivalent to solving the canonical\nsubspace clustering problem over the training samples. Based on this\nequivalence, we further show that the minimal number of samples required to\nlearn the underlying distribution scales linearly with the intrinsic dimensions\nunder the above data and model assumptions. This insight sheds light on why\ndiffusion models can break the curse of dimensionality and exhibit the phase\ntransition in learning distributions. Moreover, we empirically establish a\ncorrespondence between the subspaces and the semantic representations of image\ndata, facilitating image editing. We validate these results with corroborated\nexperimental results on both simulated distributions and image datasets.\n","authors":["Peng Wang","Huijie Zhang","Zekai Zhang","Siyi Chen","Yi Ma","Qing Qu"],"pdf_url":"https://arxiv.org/pdf/2409.02426v2.pdf","comment":"40 pages, 9 figures"},{"id":"http://arxiv.org/abs/2410.11843v2","updated":"2024-12-27T08:32:38Z","published":"2024-09-23T08:39:16Z","title":"From Commands to Prompts: LLM-based Semantic File System for AIOS","summary":" Large language models (LLMs) have demonstrated significant potential in the\ndevelopment of intelligent applications and systems such as LLM-based agents\nand agent operating systems (AIOS). However, when these applications and\nsystems interact with the underlying file system, the file system still remains\nthe traditional paradigm: reliant on manual navigation through precise\ncommands. This paradigm poses a bottleneck to the usability of these systems as\nusers are required to navigate complex folder hierarchies and remember cryptic\nfile names. To address this limitation, we propose an LLM-based semantic file\nsystem ( LSFS ) for prompt-driven file management. Unlike conventional\napproaches, LSFS incorporates LLMs to enable users or agents to interact with\nfiles through natural language prompts, facilitating semantic file management.\nAt the macro-level, we develop a comprehensive API set to achieve semantic file\nmanagement functionalities, such as semantic file retrieval, file update\nmonitoring and summarization, and semantic file rollback). At the micro-level,\nwe store files by constructing semantic indexes for them, design and implement\nsyscalls of different semantic operations (e.g., CRUD, group by, join) powered\nby vector database. Our experiments show that LSFS offers significant\nimprovements over traditional file systems in terms of user convenience, the\ndiversity of supported functions, and the accuracy and efficiency of file\noperations. Additionally, with the integration of LLM, our system enables more\nintelligent file management tasks, such as content summarization and version\ncomparison, further enhancing its capabilities.\n","authors":["Zeru Shi","Kai Mei","Yongye Su","Chaoji Zuo","Wenyue Hua","Wujiang Xu","Yujie Ren","Zirui Liu","Mengnan Du","Dong Deng","Yongfeng Zhang"],"pdf_url":"https://arxiv.org/pdf/2410.11843v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.19517v1","updated":"2024-12-27T08:19:23Z","published":"2024-12-27T08:19:23Z","title":"Estimation of System Parameters Including Repeated Cross-Sectional Data\n through Emulator-Informed Deep Generative Model","summary":" Differential equations (DEs) are crucial for modeling the evolution of\nnatural or engineered systems. Traditionally, the parameters in DEs are\nadjusted to fit data from system observations. However, in fields such as\npolitics, economics, and biology, available data are often independently\ncollected at distinct time points from different subjects (i.e., repeated\ncross-sectional (RCS) data). Conventional optimization techniques struggle to\naccurately estimate DE parameters when RCS data exhibit various\nheterogeneities, leading to a significant loss of information. To address this\nissue, we propose a new estimation method called the emulator-informed\ndeep-generative model (EIDGM), designed to handle RCS data. Specifically, EIDGM\nintegrates a physics-informed neural network-based emulator that immediately\ngenerates DE solutions and a Wasserstein generative adversarial network-based\nparameter generator that can effectively mimic the RCS data. We evaluated EIDGM\non exponential growth, logistic population models, and the Lorenz system,\ndemonstrating its superior ability to accurately capture parameter\ndistributions. Additionally, we applied EIDGM to an experimental dataset of\nAmyloid beta 40 and beta 42, successfully capturing diverse parameter\ndistribution shapes. This shows that EIDGM can be applied to model a wide range\nof systems and extended to uncover the operating principles of systems based on\nlimited data.\n","authors":["Hyunwoo Cho","Sung Woong Cho","Hyeontae Jo","Hyung Ju Hwang"],"pdf_url":"https://arxiv.org/pdf/2412.19517v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.19515v1","updated":"2024-12-27T08:14:28Z","published":"2024-12-27T08:14:28Z","title":"Real-time classification of EEG signals using Machine Learning\n deployment","summary":" The prevailing educational methods predominantly rely on traditional\nclassroom instruction or online delivery, often limiting the teachers' ability\nto engage effectively with all the students simultaneously. A more intrinsic\nmethod of evaluating student attentiveness during lectures can enable the\neducators to tailor the course materials and their teaching styles in order to\nbetter meet the students' needs. The aim of this paper is to enhance teaching\nquality in real time, thereby fostering a higher student engagement in the\nclassroom activities. By monitoring the students' electroencephalography (EEG)\nsignals and employing machine learning algorithms, this study proposes a\ncomprehensive solution for addressing this challenge. Machine learning has\nemerged as a powerful tool for simplifying the analysis of complex variables,\nenabling the effective assessment of the students' concentration levels based\non specific parameters. However, the real-time impact of machine learning\nmodels necessitates a careful consideration as their deployment is concerned.\nThis study proposes a machine learning-based approach for predicting the level\nof students' comprehension with regard to a certain topic. A browser interface\nwas introduced that accesses the values of the system's parameters to determine\na student's level of concentration on a chosen topic. The deployment of the\nproposed system made it necessary to address the real-time challenges faced by\nthe students, consider the system's cost, and establish trust in its efficacy.\nThis paper presents the efforts made for approaching this pertinent issue\nthrough the implementation of innovative technologies and provides a framework\nfor addressing key considerations for future research directions.\n","authors":["Swati Chowdhuri","Satadip Saha","Samadrita Karmakar","Ankur Chanda"],"pdf_url":"https://arxiv.org/pdf/2412.19515v1.pdf","comment":"Published in Romanian Journal of Information Technology and Automatic\n Control"},{"id":"http://arxiv.org/abs/2412.19511v1","updated":"2024-12-27T08:01:42Z","published":"2024-12-27T08:01:42Z","title":"Uncertainty quantification for improving radiomic-based models in\n radiation pneumonitis prediction","summary":" Background and Objective: Radiation pneumonitis (RP) is a side effect of\nthoracic radiation therapy. Recently, Machine learning (ML) models enhanced\nwith radiomic and dosiomic features provide better predictions by incorporating\nspatial information beyond DVHs. However, to improve the clinical decision\nprocess, we propose to use uncertainty quantification (UQ) to improve the\nconfidence in model prediction. This study evaluates the impact of post hoc UQ\nmethods on the discriminative performance and calibration of ML models for RP\nprediction. Methods: This study evaluated four ML models: logistic regression\n(LR), support vector machines (SVM), extreme gradient boosting (XGB), and\nrandom forest (RF), using radiomic, dosiomic, and dosimetric features to\npredict RP. We applied UQ methods, including Patt scaling, isotonic regression,\nVenn-ABERS predictor, and Conformal Prediction, to quantify uncertainty. Model\nperformance was assessed through Area Under the Receiver Operating\nCharacteristic curve (AUROC), Area Under the Precision-Recall Curve (AUPRC),\nand Adaptive Calibration Error (ACE) using Leave-One-Out Cross-Validation\n(LOO-CV). Results: UQ methods enhanced predictive performance, particularly for\nhigh-certainty predictions, while also improving calibration. Radiomic and\ndosiomic features increased model accuracy but introduced calibration\nchallenges, especially for non-linear models like XGB and RF. Performance gains\nfrom UQ methods were most noticeable at higher certainty thresholds.\nConclusion: Integrating UQ into ML models with radiomic and dosiomic features\nimproves both predictive accuracy and calibration, supporting more reliable\nclinical decision-making. The findings emphasize the value of UQ methods in\nenhancing applicability of predictive models for RP in healthcare settings.\n","authors":["Chanon Puttanawarut","Romen Samuel Wabina","Nat Sirirutbunkajorn"],"pdf_url":"https://arxiv.org/pdf/2412.19511v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.19500v1","updated":"2024-12-27T07:34:54Z","published":"2024-12-27T07:34:54Z","title":"RobotDiffuse: Motion Planning for Redundant Manipulator based on\n Diffusion Model","summary":" Redundant manipulators, with their higher Degrees of Freedom (DOFs), offer\nenhanced kinematic performance and versatility, making them suitable for\napplications like manufacturing, surgical robotics, and human-robot\ncollaboration. However, motion planning for these manipulators is challenging\ndue to increased DOFs and complex, dynamic environments. While traditional\nmotion planning algorithms struggle with high-dimensional spaces, deep\nlearning-based methods often face instability and inefficiency in complex\ntasks. This paper introduces RobotDiffuse, a diffusion model-based approach for\nmotion planning in redundant manipulators. By integrating physical constraints\nwith a point cloud encoder and replacing the U-Net structure with an\nencoder-only transformer, RobotDiffuse improves the model's ability to capture\ntemporal dependencies and generate smoother, more coherent motion plans. We\nvalidate the approach using a complex simulator, and release a new dataset with\n35M robot poses and 0.14M obstacle avoidance scenarios. Experimental results\ndemonstrate the effectiveness of RobotDiffuse and the promise of diffusion\nmodels for motion planning tasks. The code can be accessed at\nhttps://github.com/ACRoboT-buaa/RobotDiffuse.\n","authors":["Xiaohan Zhang","Xudong Mou","Rui Wang","Tianyu Wo","Ningbo Gu","Tiejun Wang","Cangbai Xu","Xudong Liu"],"pdf_url":"https://arxiv.org/pdf/2412.19500v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.19495v1","updated":"2024-12-27T07:31:14Z","published":"2024-12-27T07:31:14Z","title":"Disparate Model Performance and Stability in Machine Learning Clinical\n Support for Diabetes and Heart Diseases","summary":" Machine Learning (ML) algorithms are vital for supporting clinical\ndecision-making in biomedical informatics. However, their predictive\nperformance can vary across demographic groups, often due to the\nunderrepresentation of historically marginalized populations in training\ndatasets. The investigation reveals widespread sex- and age-related inequities\nin chronic disease datasets and their derived ML models. Thus, a novel\nanalytical framework is introduced, combining systematic arbitrariness with\ntraditional metrics like accuracy and data complexity. The analysis of data\nfrom over 25,000 individuals with chronic diseases revealed mild sex-related\ndisparities, favoring predictive accuracy for males, and significant\nage-related differences, with better accuracy for younger patients. Notably,\nolder patients showed inconsistent predictive accuracy across seven datasets,\nlinked to higher data complexity and lower model performance. This highlights\nthat representativeness in training data alone does not guarantee equitable\noutcomes, and model arbitrariness must be addressed before deploying models in\nclinical settings.\n","authors":["Ioannis Bilionis","Ricardo C. Berrios","Luis Fernandez-Luque","Carlos Castillo"],"pdf_url":"https://arxiv.org/pdf/2412.19495v1.pdf","comment":"This paper will be presented in American Medical Informatics\n Association (AMIA) Informatics Summit Conference 2025 (Pittsburgh, PA). 10\n pages, 2 figures, 5 tables"},{"id":"http://arxiv.org/abs/2402.16901v2","updated":"2024-12-27T06:40:39Z","published":"2024-02-24T13:13:17Z","title":"FGBERT: Function-Driven Pre-trained Gene Language Model for Metagenomics","summary":" Metagenomic data, comprising mixed multi-species genomes, are prevalent in\ndiverse environments like oceans and soils, significantly impacting human\nhealth and ecological functions. However, current research relies on K-mer,\nwhich limits the capture of structurally and functionally relevant gene\ncontexts. Moreover, these approaches struggle with encoding biologically\nmeaningful genes and fail to address the One-to-Many and Many-to-One\nrelationships inherent in metagenomic data. To overcome these challenges, we\nintroduce FGBERT, a novel metagenomic pre-trained model that employs a\nprotein-based gene representation as a context-aware and structure-relevant\ntokenizer. FGBERT incorporates Masked Gene Modeling (MGM) to enhance the\nunderstanding of inter-gene contextual relationships and Triplet Enhanced\nMetagenomic Contrastive Learning (TMC) to elucidate gene sequence-function\nrelationships. Pre-trained on over 100 million metagenomic sequences, FGBERT\ndemonstrates superior performance on metagenomic datasets at four levels,\nspanning gene, functional, bacterial, and environmental levels and ranging from\n1k to 213k input sequences. Case studies of ATP Synthase and Gene Operons\nhighlight FGBERT's capability for functional recognition and its biological\nrelevance in metagenomic research.\n","authors":["ChenRui Duan","Zelin Zang","Yongjie Xu","Hang He","Zihan Liu","Siyuan Li","Zijia Song","Ju-Sheng Zheng","Stan Z. Li"],"pdf_url":"https://arxiv.org/pdf/2402.16901v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.19471v1","updated":"2024-12-27T05:51:40Z","published":"2024-12-27T05:51:40Z","title":"Meta-Learning-Based Delayless Subband Adaptive Filter using Complex\n Self-Attention for Active Noise Control","summary":" Active noise control typically employs adaptive filtering to generate\nsecondary noise, where the least mean square algorithm is the most widely used.\nHowever, traditional updating rules are linear and exhibit limited\neffectiveness in addressing nonlinear environments and nonstationary noise. To\ntackle this challenge, we reformulate the active noise control problem as a\nmeta-learning problem and propose a meta-learning-based delayless subband\nadaptive filter with deep neural networks. The core idea is to utilize a neural\nnetwork as an adaptive algorithm that can adapt to different environments and\ntypes of noise. The neural network will train under noisy observations,\nimplying that it recognizes the optimized updating rule without true labels. A\nsingle-headed attention recurrent neural network is devised with learnable\nfeature embedding to update the adaptive filter weight efficiently, enabling\naccurate computation of the secondary source to attenuate the unwanted primary\nnoise. In order to relax the time constraint on updating the adaptive filter\nweights, the delayless subband architecture is employed, which will allow the\nsystem to be updated less frequently as the downsampling factor increases. In\naddition, the delayless subband architecture does not introduce additional time\ndelays in active noise control systems. A skip updating strategy is introduced\nto decrease the updating frequency further so that machines with limited\nresources have more possibility to board our meta-learning-based model.\nExtensive multi-condition training ensures generalization and robustness\nagainst various types of noise and environments. Simulation results demonstrate\nthat our meta-learning-based model achieves superior noise reduction\nperformance compared to traditional methods.\n","authors":["Pengxing Feng","Hing Cheung So"],"pdf_url":"https://arxiv.org/pdf/2412.19471v1.pdf","comment":"31 pages, 8 figures"},{"id":"http://arxiv.org/abs/2411.02115v2","updated":"2024-12-27T05:48:14Z","published":"2024-11-04T14:29:04Z","title":"FedMoE-DA: Federated Mixture of Experts via Domain Aware Fine-grained\n Aggregation","summary":" Federated learning (FL) is a collaborative machine learning approach that\nenables multiple clients to train models without sharing their private data.\nWith the rise of deep learning, large-scale models have garnered significant\nattention due to their exceptional performance. However, a key challenge in FL\nis the limitation imposed by clients with constrained computational and\ncommunication resources, which hampers the deployment of these large models.\nThe Mixture of Experts (MoE) architecture addresses this challenge with its\nsparse activation property, which reduces computational workload and\ncommunication demands during inference and updates. Additionally, MoE\nfacilitates better personalization by allowing each expert to specialize in\ndifferent subsets of the data distribution. To alleviate the communication\nburdens between the server and clients, we propose FedMoE-DA, a new FL model\ntraining framework that leverages the MoE architecture and incorporates a novel\ndomain-aware, fine-grained aggregation strategy to enhance the robustness,\npersonalizability, and communication efficiency simultaneously. Specifically,\nthe correlation between both intra-client expert models and inter-client data\nheterogeneity is exploited. Moreover, we utilize peer-to-peer (P2P)\ncommunication between clients for selective expert model synchronization, thus\nsignificantly reducing the server-client transmissions. Experiments demonstrate\nthat our FedMoE-DA achieves excellent performance while reducing the\ncommunication pressure on the server.\n","authors":["Ziwei Zhan","Wenkuan Zhao","Yuanqing Li","Weijie Liu","Xiaoxi Zhang","Chee Wei Tan","Chuan Wu","Deke Guo","Xu Chen"],"pdf_url":"https://arxiv.org/pdf/2411.02115v2.pdf","comment":"8 pages, 5 figures, accepted by The 20th International Conference on\n Mobility, Sensing and Networking (MSN 2024)"},{"id":"http://arxiv.org/abs/2308.10462v3","updated":"2024-12-27T05:30:00Z","published":"2023-08-21T04:31:06Z","title":"Exploring Parameter-Efficient Fine-Tuning Techniques for Code Generation\n with Large Language Models","summary":" Large language models (LLMs) demonstrate impressive capabilities to generate\naccurate code snippets given natural language intents in a zero-shot manner,\ni.e., without the need for specific fine-tuning. While prior studies have\nhighlighted the advantages of fine-tuning LLMs, this process incurs high\ncomputational costs, making it impractical in resource-scarce environments,\nparticularly for models with billions of parameters. To address these\nchallenges, previous research explored in-context learning (ICL) and\nretrieval-augmented generation (RAG) as strategies to guide the LLM generative\nprocess with task-specific prompt examples. However, ICL and RAG introduce\ninconveniences, such as the need for designing contextually relevant prompts\nand the absence of learning task-specific parameters, thereby limiting\ndownstream task performance. In this context, we foresee parameter-efficient\nfine-tuning (PEFT) as a promising approach to efficiently specialize LLMs to\ntask-specific data while maintaining reasonable resource consumption. In this\npaper, we deliver a comprehensive study of PEFT techniques for LLMs in the\ncontext of automated code generation. Our comprehensive investigation of PEFT\ntechniques for LLMs reveals their superiority and potential over ICL and RAG\nacross a diverse set of LLMs and three representative Python code generation\ndatasets: Conala, CodeAlpacaPy, and APPS. Furthermore, our study highlights the\npotential for tuning larger LLMs and significant reductions in memory usage by\ncombining PEFT with quantization. Therefore, this study opens opportunities for\nbroader applications of PEFT in software engineering scenarios. Our code is\navailable at https://github.com/martin-wey/peft-llm-code/.\n","authors":["Martin Weyssow","Xin Zhou","Kisub Kim","David Lo","Houari Sahraoui"],"pdf_url":"https://arxiv.org/pdf/2308.10462v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18202v2","updated":"2024-12-27T05:28:43Z","published":"2024-12-24T06:14:34Z","title":"Developing Cryptocurrency Trading Strategy Based on Autoencoder-CNN-GANs\n Algorithms","summary":" This paper leverages machine learning algorithms to forecast and analyze\nfinancial time series. The process begins with a denoising autoencoder to\nfilter out random noise fluctuations from the main contract price data. Then,\none-dimensional convolution reduces the dimensionality of the filtered data and\nextracts key information. The filtered and dimensionality-reduced price data is\nfed into a GANs network, and its output serve as input of a fully connected\nnetwork. Through cross-validation, a model is trained to capture features that\nprecede large price fluctuations. The model predicts the likelihood and\ndirection of significant price changes in real-time price sequences, placing\ntrades at moments of high prediction accuracy. Empirical results demonstrate\nthat using autoencoders and convolution to filter and denoise financial data,\ncombined with GANs, achieves a certain level of predictive performance,\nvalidating the capabilities of machine learning algorithms to discover\nunderlying patterns in financial sequences. Keywords - CNN;GANs;\nCryptocurrency; Prediction.\n","authors":["Zhuohuan Hu","Richard Yu","Zizhou Zhang","Haoran Zheng","Qianying Liu","Yining Zhou"],"pdf_url":"https://arxiv.org/pdf/2412.18202v2.pdf","comment":"The paper was accepted by 2024 4th International Conference on\n Artificial Intelligence, Robotics, and Communication(ICAIRC 2024)"},{"id":"http://arxiv.org/abs/2412.19467v1","updated":"2024-12-27T05:26:12Z","published":"2024-12-27T05:26:12Z","title":"Optimizing Helmet Detection with Hybrid YOLO Pipelines: A Detailed\n Analysis","summary":" Helmet detection is crucial for advancing protection levels in public road\ntraffic dynamics. This problem statement translates to an object detection\ntask. Therefore, this paper compares recent You Only Look Once (YOLO) models in\nthe context of helmet detection in terms of reliability and computational load.\nSpecifically, YOLOv8, YOLOv9, and the newly released YOLOv11 have been used.\nBesides, a modified architectural pipeline that remarkably improves the overall\nperformance has been proposed in this manuscript. This hybridized YOLO model\n(h-YOLO) has been pitted against the independent models for analysis that\nproves h-YOLO is preferable for helmet detection over plain YOLO models. The\nmodels were tested using a range of standard object detection benchmarks such\nas recall, precision, and mAP (Mean Average Precision). In addition, training\nand testing times were recorded to provide the overall scope of the models in a\nreal-time detection scenario.\n","authors":["Vaikunth M","Dejey D","Vishaal C","Balamurali S"],"pdf_url":"https://arxiv.org/pdf/2412.19467v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.09032v3","updated":"2024-12-27T05:13:23Z","published":"2024-03-14T01:51:35Z","title":"CodeUltraFeedback: An LLM-as-a-Judge Dataset for Aligning Large Language\n Models to Coding Preferences","summary":" Evaluating the alignment of large language models (LLMs) with user-defined\ncoding preferences is a challenging endeavour that requires a deep assessment\nof LLMs' outputs. Existing methods and benchmarks rely primarily on automated\nmetrics and static analysis tools, which often fail to capture the nuances of\nuser instructions and LLM outputs. To address this gap, we propose using the\nLLM-as-a-Judge methodology to evaluate the alignment of LLMs with coding\npreferences. Based on this approach, we present CodeUltraFeedback, a\ncomprehensive dataset designed to facilitate the evaluation and improvement of\nLLM alignment. CodeUltraFeedback consists of 10,000 coding instructions, each\nannotated with four responses generated from a diverse pool of 14 LLMs. These\nresponses are ranked based on five distinct coding preferences using GPT-3.5 as\na judge, providing both numerical scores and detailed textual feedback. Our\nanalysis of CodeUltraFeedback reveals that responses from GPT-3.5 and GPT-4 are\ngenerally preferred over those from open-weight LLMs, highlighting significant\ndifferences in alignment between closed and open-weight models. In turn, we\nexplore the usage of CodeUltraFeedback as feedback data to fine-tune and align\nCodeLlama-7B-Instruct using supervised fine-tuning (SFT) and reinforcement\nlearning from AI feedback (RLAIF) with direct preference optimization (DPO).\nThe resulting aligned CodeLlama-7B-Instruct model outperforms larger LLMs in\nterms of alignment with coding preferences and shows improved functional\ncorrectness on the HumanEval+ benchmark compared to the original instruct\nmodel. Therefore, our contributions bridge the gap in preference tuning of LLMs\nfor code and set the stage for further advancements in model alignment and\nRLAIF in automated software engineering.\n","authors":["Martin Weyssow","Aton Kamanda","Xin Zhou","Houari Sahraoui"],"pdf_url":"https://arxiv.org/pdf/2403.09032v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2407.11071v2","updated":"2024-12-27T04:54:02Z","published":"2024-07-12T20:34:59Z","title":"MonoSparse-CAM: Efficient Tree Model Processing via Monotonicity and\n Sparsity in CAMs","summary":" While the tree-based machine learning (TBML) models exhibit superior\nperformance compared to neural networks on tabular data and hold promise for\nenergy-efficient acceleration using aCAM arrays, their ideal deployment on\nhardware with explicit exploitation of TBML structure and aCAM circuitry\nremains a challenging task. In this work, we present MonoSparse-CAM, a new\nCAM-based optimization technique that exploits TBML sparsity and monotonicity\nin CAM circuitry to further advance processing performance. Our results\nindicate that MonoSparse-CAM reduces energy consumption by upto to 28.56x\ncompared to raw processing and by 18.51x compared to state-of-the-art\ntechniques, while improving the efficiency of computation by at least 1.68x.\n","authors":["Tergel Molom-Ochir","Brady Taylor","Hai Li","Yiran Chen"],"pdf_url":"https://arxiv.org/pdf/2407.11071v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.17663v2","updated":"2024-12-27T04:48:39Z","published":"2024-09-26T09:21:48Z","title":"Explanation Bottleneck Models","summary":" Recent concept-based interpretable models have succeeded in providing\nmeaningful explanations by pre-defined concept sets. However, the dependency on\nthe pre-defined concepts restricts the application because of the limited\nnumber of concepts for explanations. This paper proposes a novel interpretable\ndeep neural network called explanation bottleneck models (XBMs). XBMs generate\na text explanation from the input without pre-defined concepts and then predict\na final task prediction based on the generated explanation by leveraging\npre-trained vision-language encoder-decoder models. To achieve both the target\ntask performance and the explanation quality, we train XBMs through the target\ntask loss with the regularization penalizing the explanation decoder via the\ndistillation from the frozen pre-trained decoder. Our experiments, including a\ncomparison to state-of-the-art concept bottleneck models, confirm that XBMs\nprovide accurate and fluent natural language explanations without pre-defined\nconcept sets. Code will be available at https://github.com/yshinya6/xbm/.\n","authors":["Shin'ya Yamaguchi","Kosuke Nishida"],"pdf_url":"https://arxiv.org/pdf/2409.17663v2.pdf","comment":"Accepted to AAAI 2025"},{"id":"http://arxiv.org/abs/2412.19444v1","updated":"2024-12-27T04:22:02Z","published":"2024-12-27T04:22:02Z","title":"Towards Simple and Provable Parameter-Free Adaptive Gradient Methods","summary":" Optimization algorithms such as AdaGrad and Adam have significantly advanced\nthe training of deep models by dynamically adjusting the learning rate during\nthe optimization process. However, adhoc tuning of learning rates poses a\nchallenge, leading to inefficiencies in practice. To address this issue, recent\nresearch has focused on developing \"learning-rate-free\" or \"parameter-free\"\nalgorithms that operate effectively without the need for learning rate tuning.\nDespite these efforts, existing parameter-free variants of AdaGrad and Adam\ntend to be overly complex and/or lack formal convergence guarantees. In this\npaper, we present AdaGrad++ and Adam++, novel and simple parameter-free\nvariants of AdaGrad and Adam with convergence guarantees. We prove that\nAdaGrad++ achieves comparable convergence rates to AdaGrad in convex\noptimization without predefined learning rate assumptions. Similarly, Adam++\nmatches the convergence rate of Adam without relying on any conditions on the\nlearning rates. Experimental results across various deep learning tasks\nvalidate the competitive performance of AdaGrad++ and Adam++.\n","authors":["Yuanzhe Tao","Huizhuo Yuan","Xun Zhou","Yuan Cao","Quanquan Gu"],"pdf_url":"https://arxiv.org/pdf/2412.19444v1.pdf","comment":"34 pages, 16 figures, 3 tables"},{"id":"http://arxiv.org/abs/2412.19441v1","updated":"2024-12-27T04:17:34Z","published":"2024-12-27T04:17:34Z","title":"Comparative Performance Analysis of Quantum Machine Learning\n Architectures for Credit Card Fraud Detection","summary":" As financial fraud becomes increasingly complex, effective detection methods\nare essential. Quantum Machine Learning (QML) introduces certain capabilities\nthat may enhance both accuracy and efficiency in this area. This study examines\nhow different quantum feature map and ansatz configurations affect the\nperformance of three QML-based classifiers-the Variational Quantum Classifier\n(VQC), the Sampler Quantum Neural Network (SQNN), and the Estimator Quantum\nNeural Network (EQNN)-when applied to two non-standardized financial fraud\ndatasets. Different quantum feature map and ansatz configurations are\nevaluated, revealing distinct performance patterns. The VQC consistently\ndemonstrates strong classification results, achieving an F1 score of 0.88,\nwhile the SQNN also delivers promising outcomes. In contrast, the EQNN\nstruggles to produce robust results, emphasizing the challenges presented by\nnon-standardized data. These findings highlight the importance of careful model\nconfiguration in QML-based financial fraud detection. By showing how specific\nfeature maps and ansatz choices influence predictive success, this work guides\nresearchers and practitioners in refining QML approaches for complex financial\napplications.\n","authors":["Mansour El Alami","Nouhaila Innan","Muhammad Shafique","Mohamed Bennai"],"pdf_url":"https://arxiv.org/pdf/2412.19441v1.pdf","comment":"12 pages, 17 figures, 7 tables, under review"},{"id":"http://arxiv.org/abs/2412.19436v1","updated":"2024-12-27T04:02:46Z","published":"2024-12-27T04:02:46Z","title":"Low-Rank Contextual Reinforcement Learning from Heterogeneous Human\n Feedback","summary":" Reinforcement learning from human feedback (RLHF) has become a cornerstone\nfor aligning large language models with human preferences. However, the\nheterogeneity of human feedback, driven by diverse individual contexts and\npreferences, poses significant challenges for reward learning. To address this,\nwe propose a Low-rank Contextual RLHF (LoCo-RLHF) framework that integrates\ncontextual information to better model heterogeneous feedback while maintaining\ncomputational efficiency. Our approach builds on a contextual preference model,\nleveraging the intrinsic low-rank structure of the interaction between user\ncontexts and query-answer pairs to mitigate the high dimensionality of feature\nrepresentations. Furthermore, we address the challenge of distributional shifts\nin feedback through our Pessimism in Reduced Subspace (PRS) policy, inspired by\npessimistic offline reinforcement learning techniques. We theoretically\ndemonstrate that our policy achieves a tighter sub-optimality gap compared to\nexisting methods. Extensive experiments validate the effectiveness of\nLoCo-RLHF, showcasing its superior performance in personalized RLHF settings\nand its robustness to distribution shifts.\n","authors":["Seong Jin Lee","Will Wei Sun","Yufeng Liu"],"pdf_url":"https://arxiv.org/pdf/2412.19436v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.00399v3","updated":"2024-12-27T03:53:21Z","published":"2024-03-30T15:38:54Z","title":"Aurora-M: Open Source Continual Pre-training for Multilingual Language\n and Code","summary":" Pretrained language models are an integral part of AI applications, but their\nhigh computational cost for training limits accessibility. Initiatives such as\nBloom and StarCoder aim to democratize access to pretrained models for\ncollaborative community development. Despite these efforts, such models\nencounter challenges such as limited multilingual capabilities, risks of\ncatastrophic forgetting during continual pretraining, and the high costs of\ntraining models from scratch, alongside the need to align with AI safety\nstandards and regulatory frameworks.\n This paper presents Aurora-M, a 15B parameter multilingual open-source model\ntrained on English, Finnish, Hindi, Japanese, Vietnamese, and code. Continually\npretrained from StarCoderPlus on 435B additional tokens, Aurora-M surpasses 2T\ntokens in total training token count. It is the first open-source multilingual\nmodel fine-tuned on human-reviewed safety instructions, thus aligning its\ndevelopment not only with conventional red-teaming considerations, but also\nwith the specific concerns articulated in the Biden-Harris Executive Order on\nthe Safe, Secure, and Trustworthy Development and Use of Artificial\nIntelligence.\n We evaluate Aurora-M across a wide range of tasks and languages, showcasing\nits robustness against catastrophic forgetting and its superior performance in\nmultilingual settings, particularly in safety evaluations. We open-source\nAurora-M and its variants to encourage responsible open-source development of\nlarge language models at https://huggingface.co/aurora-m.\n","authors":["Taishi Nakamura","Mayank Mishra","Simone Tedeschi","Yekun Chai","Jason T Stillerman","Felix Friedrich","Prateek Yadav","Tanmay Laud","Vu Minh Chien","Terry Yue Zhuo","Diganta Misra","Ben Bogin","Xuan-Son Vu","Marzena Karpinska","Arnav Varma Dantuluri","Wojciech Kusa","Tommaso Furlanello","Rio Yokota","Niklas Muennighoff","Suhas Pai","Tosin Adewumi","Veronika Laippala","Xiaozhe Yao","Adalberto Junior","Alpay Ariyak","Aleksandr Drozd","Jordan Clive","Kshitij Gupta","Liangyu Chen","Qi Sun","Ken Tsui","Noah Persaud","Nour Fahmy","Tianlong Chen","Mohit Bansal","Nicolo Monti","Tai Dang","Ziyang Luo","Tien-Tung Bui","Roberto Navigli","Virendra Mehta","Matthew Blumberg","Victor May","Huu Nguyen","Sampo Pyysalo"],"pdf_url":"https://arxiv.org/pdf/2404.00399v3.pdf","comment":"Preprint"},{"id":"http://arxiv.org/abs/2309.17403v3","updated":"2024-12-27T03:31:31Z","published":"2023-09-29T17:04:06Z","title":"Maximal Volume Matrix Cross Approximation for Image Compression and\n Least Squares Solution","summary":" We study the classic matrix cross approximation based on the maximal volume\nsubmatrices. Our main results consist of an improvement of the classic estimate\nfor matrix cross approximation and a greedy approach for finding the maximal\nvolume submatrices. More precisely, we present a new proof of the classic\nestimate of the inequality with an improved constant. Also, we present a family\nof greedy maximal volume algorithms to improve the computational efficiency of\nmatrix cross approximation. The proposed algorithms are shown to have\ntheoretical guarantees of convergence. Finally, we present two applications:\nimage compression and the least squares approximation of continuous functions.\nOur numerical results at the end of the paper demonstrate the effective\nperformance of our approach.\n","authors":["Kenneth Allen","Ming-Jun Lai","Zhaiming Shen"],"pdf_url":"https://arxiv.org/pdf/2309.17403v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.19423v1","updated":"2024-12-27T03:17:26Z","published":"2024-12-27T03:17:26Z","title":"Revisiting PCA for time series reduction in temporal dimension","summary":" Revisiting PCA for Time Series Reduction in Temporal Dimension; Jiaxin Gao,\nWenbo Hu, Yuntian Chen; Deep learning has significantly advanced time series\nanalysis (TSA), enabling the extraction of complex patterns for tasks like\nclassification, forecasting, and regression. Although dimensionality reduction\nhas traditionally focused on the variable space-achieving notable success in\nminimizing data redundancy and computational complexity-less attention has been\npaid to reducing the temporal dimension. In this study, we revisit Principal\nComponent Analysis (PCA), a classical dimensionality reduction technique, to\nexplore its utility in temporal dimension reduction for time series data. It is\ngenerally thought that applying PCA to the temporal dimension would disrupt\ntemporal dependencies, leading to limited exploration in this area. However,\nour theoretical analysis and extensive experiments demonstrate that applying\nPCA to sliding series windows not only maintains model performance, but also\nenhances computational efficiency. In auto-regressive forecasting, the temporal\nstructure is partially preserved through windowing, and PCA is applied within\nthese windows to denoise the time series while retaining their statistical\ninformation. By preprocessing time-series data with PCA, we reduce the temporal\ndimensionality before feeding it into TSA models such as Linear, Transformer,\nCNN, and RNN architectures. This approach accelerates training and inference\nand reduces resource consumption. Notably, PCA improves Informer training and\ninference speed by up to 40% and decreases GPU memory usage of TimesNet by 30%,\nwithout sacrificing model accuracy. Comparative analysis against other\nreduction methods further highlights the effectiveness of PCA in improving the\nefficiency of TSA models.\n","authors":["Jiaxin Gao","Wenbo Hu","Yuntian Chen"],"pdf_url":"https://arxiv.org/pdf/2412.19423v1.pdf","comment":"13 pages, 5 figures, 7 tables"},{"id":"http://arxiv.org/abs/2412.19422v1","updated":"2024-12-27T03:16:56Z","published":"2024-12-27T03:16:56Z","title":"Gx2Mol: De Novo Generation of Hit-like Molecules from Gene Expression\n Profiles via Deep Learning","summary":" De novo generation of hit-like molecules is a challenging task in the drug\ndiscovery process. Most methods in previous studies learn the semantics and\nsyntax of molecular structures by analyzing molecular graphs or simplified\nmolecular input line entry system (SMILES) strings; however, they do not take\ninto account the drug responses of the biological systems consisting of genes\nand proteins. In this study we propose a deep generative model, Gx2Mol, which\nutilizes gene expression profiles to generate molecular structures with\ndesirable phenotypes for arbitrary target proteins. In the algorithm, a\nvariational autoencoder is employed as a feature extractor to learn the latent\nfeature distribution of the gene expression profiles. Then, a long short-term\nmemory is leveraged as the chemical generator to produce syntactically valid\nSMILES strings that satisfy the feature conditions of the gene expression\nprofile extracted by the feature extractor. Experimental results and case\nstudies demonstrate that the proposed Gx2Mol model can produce new molecules\nwith potential bioactivities and drug-like properties.\n","authors":["Chen Li","Yuki Matsukiyo","Yoshihiro Yamanishi"],"pdf_url":"https://arxiv.org/pdf/2412.19422v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.19419v1","updated":"2024-12-27T03:13:02Z","published":"2024-12-27T03:13:02Z","title":"Introduction to Graph Neural Networks: A Starting Point for Machine\n Learning Engineers","summary":" Graph neural networks are deep neural networks designed for graphs with\nattributes attached to nodes or edges. The number of research papers in the\nliterature concerning these models is growing rapidly due to their impressive\nperformance on a broad range of tasks. This survey introduces graph neural\nnetworks through the encoder-decoder framework and provides examples of\ndecoders for a range of graph analytic tasks. It uses theory and numerous\nexperiments on homogeneous graphs to illustrate the behavior of graph neural\nnetworks for different training sizes and degrees of graph complexity.\n","authors":["James H. Tanis","Chris Giannella","Adrian V. Mariano"],"pdf_url":"https://arxiv.org/pdf/2412.19419v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.15519v2","updated":"2024-12-27T02:48:04Z","published":"2024-12-20T03:15:02Z","title":"PreNeT: Leveraging Computational Features to Predict Deep Neural Network\n Training Time","summary":" Training deep learning models, particularly Transformer-based architectures\nsuch as Large Language Models (LLMs), demands substantial computational\nresources and extended training periods. While optimal configuration and\ninfrastructure selection can significantly reduce associated costs, this\noptimization requires preliminary analysis tools. This paper introduces PreNeT,\na novel predictive framework designed to address this optimization challenge.\nPreNeT facilitates training optimization by integrating comprehensive\ncomputational metrics, including layer-specific parameters, arithmetic\noperations and memory utilization. A key feature of PreNeT is its capacity to\naccurately predict training duration on previously unexamined hardware\ninfrastructures, including novel accelerator architectures. This framework\nemploys a sophisticated approach to capture and analyze the distinct\ncharacteristics of various neural network layers, thereby enhancing existing\nprediction methodologies. Through proactive implementation of PreNeT,\nresearchers and practitioners can determine optimal configurations, parameter\nsettings, and hardware specifications to maximize cost-efficiency and minimize\ntraining duration. Experimental results demonstrate that PreNeT achieves up to\n72% improvement in prediction accuracy compared to contemporary\nstate-of-the-art frameworks.\n","authors":["Alireza Pourali","Arian Boukani","Hamzeh Khazaei"],"pdf_url":"https://arxiv.org/pdf/2412.15519v2.pdf","comment":"11 pages, Conference"},{"id":"http://arxiv.org/abs/2412.19404v1","updated":"2024-12-27T02:05:09Z","published":"2024-12-27T02:05:09Z","title":"Spectral-Temporal Fusion Representation for Person-in-Bed Detection","summary":" This study is based on the ICASSP 2025 Signal Processing Grand Challenge's\nAccelerometer-Based Person-in-Bed Detection Challenge, which aims to determine\nbed occupancy using accelerometer signals. The task is divided into two tracks:\n\"in bed\" and \"not in bed\" segmented detection, and streaming detection, facing\nchallenges such as individual differences, posture variations, and external\ndisturbances. We propose a spectral-temporal fusion-based feature\nrepresentation method with mixup data augmentation, and adopt Intersection over\nUnion (IoU) loss to optimize detection accuracy. In the two tracks, our method\nachieved outstanding results of 100.00% and 95.55% in detection scores,\nsecuring first place and third place, respectively.\n","authors":["Xuefeng Yang","Shiheng Zhang","Jian Guan","Feiyang Xiao","Wei Lu","Qiaoxi Zhu"],"pdf_url":"https://arxiv.org/pdf/2412.19404v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.19403v1","updated":"2024-12-27T01:53:18Z","published":"2024-12-27T01:53:18Z","title":"Fully Data-driven but Interpretable Human Behavioural Modelling with\n Differentiable Discrete Choice Model","summary":" Discrete choice models are essential for modelling various decision-making\nprocesses in human behaviour. However, the specification of these models has\ndepended heavily on domain knowledge from experts, and the fully automated but\ninterpretable modelling of complex human behaviours has been a long-standing\nchallenge. In this paper, we introduce the differentiable discrete choice model\n(Diff-DCM), a fully data-driven method for the interpretable modelling,\nlearning, prediction, and control of complex human behaviours, which is\nrealised by differentiable programming. Solely from input features and choice\noutcomes without any prior knowledge, Diff-DCM can estimate interpretable\nclosed-form utility functions that reproduce observed behaviours. Comprehensive\nexperiments with both synthetic and real-world data demonstrate that Diff-DCM\ncan be applied to various types of data and requires only a small amount of\ncomputational resources for the estimations, which can be completed within tens\nof seconds on a laptop without any accelerators. In these experiments, we also\ndemonstrate that, using its differentiability, Diff-DCM can provide useful\ninsights into human behaviours, such as an optimal intervention path for\neffective behavioural changes. This study provides a strong basis for the fully\nautomated and reliable modelling, prediction, and control of human behaviours.\n","authors":["Fumiyasu Makinoshima","Tatsuya Mitomi","Fumiya Makihara","Eigo Segawa"],"pdf_url":"https://arxiv.org/pdf/2412.19403v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11174v3","updated":"2024-12-27T01:39:43Z","published":"2024-01-20T09:09:52Z","title":"Pixel-Wise Recognition for Holistic Surgical Scene Understanding","summary":" This paper presents the Holistic and Multi-Granular Surgical Scene\nUnderstanding of Prostatectomies (GraSP) dataset, a curated benchmark that\nmodels surgical scene understanding as a hierarchy of complementary tasks with\nvarying levels of granularity. Our approach encompasses long-term tasks, such\nas surgical phase and step recognition, and short-term tasks, including\nsurgical instrument segmentation and atomic visual actions detection. To\nexploit our proposed benchmark, we introduce the Transformers for Actions,\nPhases, Steps, and Instrument Segmentation (TAPIS) model, a general\narchitecture that combines a global video feature extractor with localized\nregion proposals from an instrument segmentation model to tackle the\nmulti-granularity of our benchmark. Through extensive experimentation in ours\nand alternative benchmarks, we demonstrate TAPIS's versatility and\nstate-of-the-art performance across different tasks. This work represents a\nfoundational step forward in Endoscopic Vision, offering a novel framework for\nfuture research towards holistic surgical scene understanding.\n","authors":["Nicolás Ayobi","Santiago Rodríguez","Alejandra Pérez","Isabela Hernández","Nicolás Aparicio","Eugénie Dessevres","Sebastián Peña","Jessica Santander","Juan Ignacio Caicedo","Nicolás Fernández","Pablo Arbeláez"],"pdf_url":"https://arxiv.org/pdf/2401.11174v3.pdf","comment":"Preprint submitted to Medical Image Analysis. Official extension of\n previous MICCAI 2022\n (https://link.springer.com/chapter/10.1007/978-3-031-16449-1_42) and ISBI\n 2023 (https://ieeexplore.ieee.org/document/10230819) orals. Data and codes\n are available at https://github.com/BCV-Uniandes/GraSP"},{"id":"http://arxiv.org/abs/2409.01207v2","updated":"2024-12-27T01:23:05Z","published":"2024-09-02T12:35:59Z","title":"Towards General Industrial Intelligence: A Survey of Continual Large\n Models in Industrial IoT","summary":" Industrial AI is transitioning from traditional deep learning models to\nlarge-scale transformer-based architectures, with the Industrial Internet of\nThings (IIoT) playing a pivotal role. IIoT evolves from a simple data pipeline\nto an intelligent infrastructure, enabling and enhancing these advanced AI\nsystems. This survey explores the integration of IIoT with large models (LMs)\nand their potential applications in industrial environments. We focus on four\nprimary types of industrial LMs: language-based, vision-based, time-series, and\nmultimodal models. The lifecycle of LMs is segmented into four critical phases:\ndata foundation, model training, model connectivity, and continuous evolution.\nFirst, we analyze how IIoT provides abundant and diverse data resources,\nsupporting the training and fine-tuning of LMs. Second, we discuss how IIoT\noffers an efficient training infrastructure in low-latency and\nbandwidth-optimized environments. Third, we highlight the deployment advantages\nof LMs within IIoT, emphasizing IIoT's role as a connectivity nexus fostering\nemergent intelligence through modular design, dynamic routing, and model\nmerging to enhance system scalability and adaptability. Finally, we demonstrate\nhow IIoT supports continual learning mechanisms, enabling LMs to adapt to\ndynamic industrial conditions and ensure long-term effectiveness. This paper\nunderscores IIoT's critical role in the evolution of industrial intelligence\nwith large models, offering a theoretical framework and actionable insights for\nfuture research.\n","authors":["Jiao Chen","Jiayi He","Fangfang Chen","Zuohong Lv","Jianhua Tang","Weihua Li","Zuozhu Liu","Howard H. Yang","Guangjie Han"],"pdf_url":"https://arxiv.org/pdf/2409.01207v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.19396v1","updated":"2024-12-27T01:10:17Z","published":"2024-12-27T01:10:17Z","title":"Comparing Few to Rank Many: Active Human Preference Learning using\n Randomized Frank-Wolfe","summary":" We study learning of human preferences from a limited comparison feedback.\nThis task is ubiquitous in machine learning. Its applications such as\nreinforcement learning from human feedback, have been transformational. We\nformulate this problem as learning a Plackett-Luce model over a universe of $N$\nchoices from $K$-way comparison feedback, where typically $K \\ll N$. Our\nsolution is the D-optimal design for the Plackett-Luce objective. The design\ndefines a data logging policy that elicits comparison feedback for a small\ncollection of optimally chosen points from all ${N \\choose K}$ feasible\nsubsets. The main algorithmic challenge in this work is that even fast methods\nfor solving D-optimal designs would have $O({N \\choose K})$ time complexity. To\naddress this issue, we propose a randomized Frank-Wolfe (FW) algorithm that\nsolves the linear maximization sub-problems in the FW method on randomly chosen\nvariables. We analyze the algorithm, and evaluate it empirically on synthetic\nand open-source NLP datasets.\n","authors":["Kiran Koshy Thekumparampil","Gaurush Hiranandani","Kousha Kalantari","Shoham Sabach","Branislav Kveton"],"pdf_url":"https://arxiv.org/pdf/2412.19396v1.pdf","comment":"Submitted to AISTATS 2025 on October 10, 2024"},{"id":"http://arxiv.org/abs/2412.06947v2","updated":"2024-12-27T01:07:02Z","published":"2024-12-09T19:45:54Z","title":"PyraNet: A Large Scale Hierarchical Verilog Dataset","summary":" Recently, there has been a growing interest in leveraging Large Language\nModels for Verilog code generation. However, the current quality of the\ngenerated Verilog code remains suboptimal. This is largely due to the absence\nof well-defined, well-organized datasets with high-quality samples, as well as\na lack of innovative fine-tuning methods and models specifically trained on\nVerilog. In this paper, we introduce a novel open-source dataset and a\ncorresponding fine-tuning technique, which utilizes a multi-layered structure\nthat we refer to as PyraNet. Our experiments demonstrate that employing the\nproposed dataset and fine-tuning approach leads to a more accurate fine-tuned\nmodel, producing syntactically and functionally correct Verilog code. The\nevaluation results show improvements by up-to $32.6\\%$ in comparison to the\nCodeLlama-7B baseline model and up-to $16.7\\%$ in comparison to the\nstate-of-the-art models using VerilogEval evaluation platform.\n","authors":["Bardia Nadimi","Ghali Omar Boutaib","Hao Zheng"],"pdf_url":"https://arxiv.org/pdf/2412.06947v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.19392v1","updated":"2024-12-27T00:44:34Z","published":"2024-12-27T00:44:34Z","title":"Asymptotically Optimal Search for a Change Point Anomaly under a\n Composite Hypothesis Model","summary":" We address the problem of searching for a change point in an anomalous\nprocess among a finite set of M processes. Specifically, we address a composite\nhypothesis model in which each process generates measurements following a\ncommon distribution with an unknown parameter (vector). This parameter belongs\nto either a normal or abnormal space depending on the current state of the\nprocess. Before the change point, all processes, including the anomalous one,\nare in a normal state; after the change point, the anomalous process\ntransitions to an abnormal state. Our goal is to design a sequential search\nstrategy that minimizes the Bayes risk by balancing sample complexity and\ndetection accuracy. We propose a deterministic search algorithm with the\nfollowing notable properties. First, we analytically demonstrate that when the\ndistributions of both normal and abnormal processes are unknown, the algorithm\nis asymptotically optimal in minimizing the Bayes risk as the error probability\napproaches zero. In the second setting, where the parameter under the null\nhypothesis is known, the algorithm achieves asymptotic optimality with improved\ndetection time based on the true normal state. Simulation results are presented\nto validate the theoretical findings.\n","authors":["Liad Lea Didi","Tomer Gafni","Kobi Cohen"],"pdf_url":"https://arxiv.org/pdf/2412.19392v1.pdf","comment":"13 pages, 6 figures"},{"id":"http://arxiv.org/abs/2412.16160v2","updated":"2024-12-27T00:43:39Z","published":"2024-11-23T18:30:04Z","title":"Online High-Frequency Trading Stock Forecasting with Automated Feature\n Clustering and Radial Basis Function Neural Networks","summary":" This study presents an autonomous experimental machine learning protocol for\nhigh-frequency trading (HFT) stock price forecasting that involves a dual\ncompetitive feature importance mechanism and clustering via shallow neural\nnetwork topology for fast training. By incorporating the k-means algorithm into\nthe radial basis function neural network (RBFNN), the proposed method addresses\nthe challenges of manual clustering and the reliance on potentially\nuninformative features. More specifically, our approach involves a dual\ncompetitive mechanism for feature importance, combining the mean-decrease\nimpurity (MDI) method and a gradient descent (GD) based feature importance\nmechanism. This approach, tested on HFT Level 1 order book data for 20 S&P 500\nstocks, enhances the forecasting ability of the RBFNN regressor. Our findings\nsuggest that an autonomous approach to feature selection and clustering is\ncrucial, as each stock requires a different input feature space. Overall, by\nautomating the feature selection and clustering processes, we remove the need\nfor manual topological grid search and provide a more efficient way to predict\nLOB's mid-price.\n","authors":["Adamantios Ntakaris","Gbenga Ibikunle"],"pdf_url":"https://arxiv.org/pdf/2412.16160v2.pdf","comment":"This paper was presented at the Economics of Financial Technology\n Conference, June 2023, in Edinburgh, UK"},{"id":"http://arxiv.org/abs/2412.19391v1","updated":"2024-12-27T00:36:40Z","published":"2024-12-27T00:36:40Z","title":"An In-Depth Analysis of Adversarial Discriminative Domain Adaptation for\n Digit Classification","summary":" Domain adaptation is an active area of research driven by the growing demand\nfor robust machine learning models that perform well on real-world data.\nAdversarial learning for deep neural networks (DNNs) has emerged as a promising\napproach to improving generalization ability, particularly for image\nclassification. In this paper, we implement a specific adversarial learning\ntechnique known as Adversarial Discriminative Domain Adaptation (ADDA) and\nreplicate digit classification experiments from the original ADDA paper. We\nextend their findings by examining a broader range of domain shifts and provide\na detailed analysis of in-domain classification accuracy post-ADDA. Our results\ndemonstrate that ADDA significantly improves accuracy across certain domain\nshifts with minimal impact on in-domain performance. Furthermore, we provide\nqualitative analysis and propose potential explanations for ADDA's limitations\nin less successful domain shifts. Code is at\nhttps://github.com/eugenechoi2004/COS429_FINAL .\n","authors":["Eugene Choi","Julian Rodriguez","Edmund Young"],"pdf_url":"https://arxiv.org/pdf/2412.19391v1.pdf","comment":null}],"Multimedia":[{"id":"http://arxiv.org/abs/2308.09599v2","updated":"2024-12-27T16:36:56Z","published":"2023-08-18T14:54:13Z","title":"Language-Guided Diffusion Model for Visual Grounding","summary":" Visual grounding (VG) tasks involve explicit cross-modal alignment, as\nsemantically corresponding image regions are to be located for the language\nphrases provided. Existing approaches complete such visual-text reasoning in a\nsingle-step manner. Their performance causes high demands on large-scale\nanchors and over-designed multi-modal fusion modules based on human priors,\nleading to complicated frameworks that may be difficult to train and overfit to\nspecific scenarios. Even worse, such once-for-all reasoning mechanisms are\nincapable of refining boxes continuously to enhance query-region matching. In\ncontrast, in this paper, we formulate an iterative reasoning process by\ndenoising diffusion modeling. Specifically, we propose a language-guided\ndiffusion framework for visual grounding, LG-DVG, which trains the model to\nprogressively reason queried object boxes by denoising a set of noisy boxes\nwith the language guide. To achieve this, LG-DVG gradually perturbs\nquery-aligned ground truth boxes to noisy ones and reverses this process step\nby step, conditional on query semantics. Extensive experiments for our proposed\nframework on five widely used datasets validate the superior performance of\nsolving visual grounding, a cross-modal alignment task, in a generative way.\nThe source codes are available at\nhttps://github.com/iQua/vgbase/tree/main/examples/DiffusionVG.\n","authors":["Sijia Chen","Baochun Li"],"pdf_url":"https://arxiv.org/pdf/2308.09599v2.pdf","comment":"20 pages, 16 figures"},{"id":"http://arxiv.org/abs/2412.19648v1","updated":"2024-12-27T13:54:32Z","published":"2024-12-27T13:54:32Z","title":"Enhancing Vision-Language Tracking by Effectively Converting Textual\n Cues into Visual Cues","summary":" Vision-Language Tracking (VLT) aims to localize a target in video sequences\nusing a visual template and language description. While textual cues enhance\ntracking potential, current datasets typically contain much more image data\nthan text, limiting the ability of VLT methods to align the two modalities\neffectively. To address this imbalance, we propose a novel plug-and-play method\nnamed CTVLT that leverages the strong text-image alignment capabilities of\nfoundation grounding models. CTVLT converts textual cues into interpretable\nvisual heatmaps, which are easier for trackers to process. Specifically, we\ndesign a textual cue mapping module that transforms textual cues into target\ndistribution heatmaps, visually representing the location described by the\ntext. Additionally, the heatmap guidance module fuses these heatmaps with the\nsearch image to guide tracking more effectively. Extensive experiments on\nmainstream benchmarks demonstrate the effectiveness of our approach, achieving\nstate-of-the-art performance and validating the utility of our method for\nenhanced VLT.\n","authors":["X. Feng","D. Zhang","S. Hu","X. Li","M. Wu","J. Zhang","X. Chen","K. Huang"],"pdf_url":"https://arxiv.org/pdf/2412.19648v1.pdf","comment":"Accepted by ICASSP '25 ! Code: https://github.com/XiaokunFeng/CTVLT"},{"id":"http://arxiv.org/abs/2407.19493v3","updated":"2024-12-27T10:34:15Z","published":"2024-07-28T13:23:43Z","title":"Official-NV: An LLM-Generated News Video Dataset for Multimodal Fake\n News Detection","summary":" News media, especially video news media, have penetrated into every aspect of\ndaily life, which also brings the risk of fake news. Therefore, multimodal fake\nnews detection has recently garnered increased attention. However, the existing\ndatasets are comprised of user-uploaded videos and contain an excess amounts of\nsuperfluous data, which introduces noise into the model training process. To\naddress this issue, we construct a dataset named Official-NV, comprising\nofficially published news videos. The crawl officially published videos are\naugmented through the use of LLMs-based generation and manual verification,\nthereby expanding the dataset. We also propose a new baseline model called\nOFNVD, which captures key information from multimodal features through a GLU\nattention mechanism and performs feature enhancement and modal aggregation via\na cross-modal Transformer. Benchmarking the dataset and baselines demonstrates\nthe effectiveness of our model in multimodal news detection.\n","authors":["Yihao Wang","Lizhi Chen","Zhong Qian","Peifeng Li"],"pdf_url":"https://arxiv.org/pdf/2407.19493v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.19492v1","updated":"2024-12-27T07:20:30Z","published":"2024-12-27T07:20:30Z","title":"Towards Open-Vocabulary Remote Sensing Image Semantic Segmentation","summary":" Recently, deep learning based methods have revolutionized remote sensing\nimage segmentation. However, these methods usually rely on a pre-defined\nsemantic class set, thus needing additional image annotation and model training\nwhen adapting to new classes. More importantly, they are unable to segment\narbitrary semantic classes. In this work, we introduce Open-Vocabulary Remote\nSensing Image Semantic Segmentation (OVRSISS), which aims to segment arbitrary\nsemantic classes in remote sensing images. To address the lack of OVRSISS\ndatasets, we develop LandDiscover50K, a comprehensive dataset of 51,846 images\ncovering 40 diverse semantic classes. In addition, we propose a novel framework\nnamed GSNet that integrates domain priors from special remote sensing models\nand versatile capabilities of general vision-language models. Technically,\nGSNet consists of a Dual-Stream Image Encoder (DSIE), a Query-Guided Feature\nFusion (QGFF), and a Residual Information Preservation Decoder (RIPD). DSIE\nfirst captures comprehensive features from both special models and general\nmodels in dual streams. Then, with the guidance of variable vocabularies, QGFF\nintegrates specialist and generalist features, enabling them to complement each\nother. Finally, RIPD is proposed to aggregate multi-source features for more\naccurate mask predictions. Experiments show that our method outperforms other\nmethods by a large margin, and our proposed LandDiscover50K improves the\nperformance of OVRSISS methods. The proposed dataset and method will be made\npublicly available at https://github.com/yecy749/GSNet.\n","authors":["Chengyang Ye","Yunzhi Zhuge","Pingping Zhang"],"pdf_url":"https://arxiv.org/pdf/2412.19492v1.pdf","comment":"Accepted by AAAI2025"},{"id":"http://arxiv.org/abs/2412.19446v1","updated":"2024-12-27T04:25:32Z","published":"2024-12-27T04:25:32Z","title":"Adrenaline: Adaptive Rendering Optimization System for Scalable Cloud\n Gaming","summary":" Cloud gaming requires a low-latency network connection, making it a prime\ncandidate for being hosted at the network edge. However, an edge server is\nprovisioned with a fixed compute capacity, causing an issue for multi-user\nservice and resulting in users having to wait before they can play when the\nserver is occupied. In this work, we present a new insight that when a user's\nnetwork condition results in use of lossy compression, the end-to-end visual\nquality more degrades for frames of high rendering quality, wasting the\nserver's computing resources. We leverage this observation to build Adrenaline,\na new system which adaptively optimizes the game rendering qualities by\nconsidering the user-side visual quality and server-side rendering cost. The\nrendering quality optimization of Adrenaline is done via a scoring mechanism\nquantifying the effectiveness of server resource usage on the user-side gaming\nquality. Our open-sourced implementation of Adrenaline demonstrates easy\nintegration with modern game engines. In our evaluations, Adrenaline achieves\nup to 24% higher service quality and 2x more users served with the same\nresource footprint compared to other baselines.\n","authors":["Jin Heo","Ketan Bhardwaj","Ada Gavrilovska"],"pdf_url":"https://arxiv.org/pdf/2412.19446v1.pdf","comment":"15 pages, 13 figures, 5 tables"},{"id":"http://arxiv.org/abs/2403.05427v3","updated":"2024-12-27T19:52:04Z","published":"2024-03-08T16:24:42Z","title":"Reply with Sticker: New Dataset and Model for Sticker Retrieval","summary":" Using stickers in online chatting is very prevalent on social media\nplatforms, where the stickers used in the conversation can express someone's\nintention/emotion/attitude in a vivid, tactful, and intuitive way. Existing\nsticker retrieval research typically retrieves stickers based on context and\nthe current utterance delivered by the user. That is, the stickers serve as a\nsupplement to the current utterance. However, in the real-world scenario, using\nstickers to express what we want to say rather than as a supplement to our\nwords only is also important. Therefore, in this paper, we create a new dataset\nfor sticker retrieval in conversation, called \\textbf{StickerInt}, where\nstickers are used to reply to previous conversations or supplement our\nwords\\footnote{We believe that the release of this dataset will provide a more\ncomplete paradigm than existing work for the research of sticker retrieval in\nthe open-domain online conversation.}. Based on the created dataset, we present\na simple yet effective framework for sticker retrieval in conversation based on\nthe learning of intention and the cross-modal relationships between\nconversation context and stickers, coined as \\textbf{Int-RA}. Specifically, we\nfirst devise a knowledge-enhanced intention predictor to introduce the\nintention information into the conversation representations. Subsequently, a\nrelation-aware sticker selector is devised to retrieve the response sticker via\ncross-modal relationships. Extensive experiments on the created dataset show\nthat the proposed model achieves state-of-the-art performance in sticker\nretrieval\\footnote{The dataset and source code of this work are released at\n\\url{https://github.com/HITSZ-HLT/Int-RA}.}.\n","authors":["Bin Liang","Bingbing Wang","Zhixin Bai","Qiwei Lang","Mingwei Sun","Kaiheng Hou","Lanjun Zhou","Ruifeng Xu","Kam-Fai Wong"],"pdf_url":"https://arxiv.org/pdf/2403.05427v3.pdf","comment":null}]},"2024-12-26T00:00:00Z":{"Computation and Language":[{"id":"http://arxiv.org/abs/2412.19361v1","updated":"2024-12-26T22:04:23Z","published":"2024-12-26T22:04:23Z","title":"Dynamic Skill Adaptation for Large Language Models","summary":" We present Dynamic Skill Adaptation (DSA), an adaptive and dynamic framework\nto adapt novel and complex skills to Large Language Models (LLMs). Compared\nwith previous work which learns from human-curated and static data in random\norders, we propose to first automatically generate and organize the training\ndata by mimicking the learning pathways of human and then dynamically tailor\nthe training data based on the training dynamics. Specifically, inspired by the\nlearning structures and teaching strategies in the human education system, we\nfirst construct a skill graph by decomposing complex skills into sub-skills and\narranging them based on their dependencies in human syllables. For every skill,\nwe utilize LLMs to generate both textbook-like data which contains detailed\ndescriptions of skills for pre-training and exercise-like data which targets at\nexplicitly utilizing the skills to solve problems for instruction-tuning.\nFurthermore, during the instruction-tuning, we dynamically update the training\ndata which down-weight easy-to-learn examples, generate more complex examples,\nand filter out data with errors. Experiments on large language models such as\nLLAMA and Mistral demonstrate the effectiveness of our proposed methods in\nadapting math reasoning skills and social study skills.\n","authors":["Jiaao Chen","Diyi Yang"],"pdf_url":"https://arxiv.org/pdf/2412.19361v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.19351v1","updated":"2024-12-26T21:13:12Z","published":"2024-12-26T21:13:12Z","title":"ETTA: Elucidating the Design Space of Text-to-Audio Models","summary":" Recent years have seen significant progress in Text-To-Audio (TTA) synthesis,\nenabling users to enrich their creative workflows with synthetic audio\ngenerated from natural language prompts. Despite this progress, the effects of\ndata, model architecture, training objective functions, and sampling strategies\non target benchmarks are not well understood. With the purpose of providing a\nholistic understanding of the design space of TTA models, we set up a\nlarge-scale empirical experiment focused on diffusion and flow matching models.\nOur contributions include: 1) AF-Synthetic, a large dataset of high quality\nsynthetic captions obtained from an audio understanding model; 2) a systematic\ncomparison of different architectural, training, and inference design choices\nfor TTA models; 3) an analysis of sampling methods and their Pareto curves with\nrespect to generation quality and inference speed. We leverage the knowledge\nobtained from this extensive analysis to propose our best model dubbed\nElucidated Text-To-Audio (ETTA). When evaluated on AudioCaps and MusicCaps,\nETTA provides improvements over the baselines trained on publicly available\ndata, while being competitive with models trained on proprietary data. Finally,\nwe show ETTA's improved ability to generate creative audio following complex\nand imaginative captions -- a task that is more challenging than current\nbenchmarks.\n","authors":["Sang-gil Lee","Zhifeng Kong","Arushi Goel","Sungwon Kim","Rafael Valle","Bryan Catanzaro"],"pdf_url":"https://arxiv.org/pdf/2412.19351v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.19350v1","updated":"2024-12-26T20:53:04Z","published":"2024-12-26T20:53:04Z","title":"On the Expressiveness and Length Generalization of Selective State-Space\n Models on Regular Languages","summary":" Selective state-space models (SSMs) are an emerging alternative to the\nTransformer, offering the unique advantage of parallel training and sequential\ninference. Although these models have shown promising performance on a variety\nof tasks, their formal expressiveness and length generalization properties\nremain underexplored. In this work, we provide insight into the workings of\nselective SSMs by analyzing their expressiveness and length generalization\nperformance on regular language tasks, i.e., finite-state automaton (FSA)\nemulation. We address certain limitations of modern SSM-based architectures by\nintroducing the Selective Dense State-Space Model (SD-SSM), the first selective\nSSM that exhibits perfect length generalization on a set of various regular\nlanguage tasks using a single layer. It utilizes a dictionary of dense\ntransition matrices, a softmax selection mechanism that creates a convex\ncombination of dictionary matrices at each time step, and a readout consisting\nof layer normalization followed by a linear map. We then proceed to evaluate\nvariants of diagonal selective SSMs by considering their empirical performance\non commutative and non-commutative automata. We explain the experimental\nresults with theoretical considerations. Our code is available at\nhttps://github.com/IBM/selective-dense-state-space-model.\n","authors":["Aleksandar Terzić","Michael Hersche","Giacomo Camposampiero","Thomas Hofmann","Abu Sebastian","Abbas Rahimi"],"pdf_url":"https://arxiv.org/pdf/2412.19350v1.pdf","comment":"13 pages, 7 figures, to be published in AAAI 2025"},{"id":"http://arxiv.org/abs/2412.19346v1","updated":"2024-12-26T20:24:35Z","published":"2024-12-26T20:24:35Z","title":"Semi-Supervised Learning from Small Annotated Data and Large Unlabeled\n Data for Fine-grained PICO Entity Recognition","summary":" Objective: Extracting PICO elements -- Participants, Intervention,\nComparison, and Outcomes -- from clinical trial literature is essential for\nclinical evidence retrieval, appraisal, and synthesis. Existing approaches do\nnot distinguish the attributes of PICO entities. This study aims to develop a\nnamed entity recognition (NER) model to extract PICO entities with fine\ngranularities.\n Materials and Methods: Using a corpus of 2,511 abstracts with PICO mentions\nfrom 4 public datasets, we developed a semi-supervised method to facilitate the\ntraining of a NER model, FinePICO, by combining limited annotated data of PICO\nentities and abundant unlabeled data. For evaluation, we divided the entire\ndataset into two subsets: a smaller group with annotations and a larger group\nwithout annotations. We then established the theoretical lower and upper\nperformance bounds based on the performance of supervised learning models\ntrained solely on the small, annotated subset and on the entire set with\ncomplete annotations, respectively. Finally, we evaluated FinePICO on both the\nsmaller annotated subset and the larger, initially unannotated subset. We\nmeasured the performance of FinePICO using precision, recall, and F1.\n Results: Our method achieved precision/recall/F1 of 0.567/0.636/0.60,\nrespectively, using a small set of annotated samples, outperforming the\nbaseline model (F1: 0.437) by more than 16\\%. The model demonstrates\ngeneralizability to a different PICO framework and to another corpus, which\nconsistently outperforms the benchmark in diverse experimental settings\n(p-value \\textless0.001).\n Conclusion: This study contributes a generalizable and effective\nsemi-supervised approach to named entity recognition leveraging large unlabeled\ndata together with small, annotated data. It also initially supports\nfine-grained PICO extraction.\n","authors":["Fangyi Chen","Gongbo Zhang","Yilu Fang","Yifan Peng","Chunhua Weng"],"pdf_url":"https://arxiv.org/pdf/2412.19346v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.19619v2","updated":"2024-12-26T20:04:21Z","published":"2023-10-30T15:12:09Z","title":"Towards A Holistic Landscape of Situated Theory of Mind in Large\n Language Models","summary":" Large Language Models (LLMs) have generated considerable interest and debate\nregarding their potential emergence of Theory of Mind (ToM). Several recent\ninquiries reveal a lack of robust ToM in these models and pose a pressing\ndemand to develop new benchmarks, as current ones primarily focus on different\naspects of ToM and are prone to shortcuts and data leakage. In this position\npaper, we seek to answer two road-blocking questions: (1) How can we taxonomize\na holistic landscape of machine ToM? (2) What is a more effective evaluation\nprotocol for machine ToM? Following psychological studies, we taxonomize\nmachine ToM into 7 mental state categories and delineate existing benchmarks to\nidentify under-explored aspects of ToM. We argue for a holistic and situated\nevaluation of ToM to break ToM into individual components and treat LLMs as an\nagent who is physically situated in environments and socially situated in\ninteractions with humans. Such situated evaluation provides a more\ncomprehensive assessment of mental states and potentially mitigates the risk of\nshortcuts and data leakage. We further present a pilot study in a grid world\nsetup as a proof of concept. We hope this position paper can facilitate future\nresearch to integrate ToM with LLMs and offer an intuitive means for\nresearchers to better position their work in the landscape of ToM. Project\npage: https://github.com/Mars-tin/awesome-theory-of-mind\n","authors":["Ziqiao Ma","Jacob Sansom","Run Peng","Joyce Chai"],"pdf_url":"https://arxiv.org/pdf/2310.19619v2.pdf","comment":"EMNLP 2023 (Findings)"},{"id":"http://arxiv.org/abs/2306.08685v2","updated":"2024-12-26T19:50:42Z","published":"2023-06-14T18:10:05Z","title":"World-to-Words: Grounded Open Vocabulary Acquisition through Fast\n Mapping in Vision-Language Models","summary":" The ability to connect language units to their referents in the physical\nworld, referred to as grounding, is crucial to learning and understanding\ngrounded meanings of words. While humans demonstrate fast mapping in new word\nlearning, it remains unclear whether modern vision-language models can truly\nrepresent language with their grounded meanings and how grounding may further\nbootstrap new word learning. To this end, we introduce Grounded Open Vocabulary\nAcquisition (GOVA) to examine grounding and bootstrapping in open-world\nlanguage learning. As an initial attempt, we propose object-oriented BERT\n(OctoBERT), a novel visually-grounded language model by pre-training on\nimage-text pairs highlighting grounding as an objective. Through extensive\nexperiments and analysis, we demonstrate that OctoBERT is a more coherent and\nfast grounded word learner, and that the grounding ability acquired during\npre-training helps the model to learn unseen words more rapidly and robustly.\nOur code is available at https://github.com/sled-group/world-to-words\n","authors":["Ziqiao Ma","Jiayi Pan","Joyce Chai"],"pdf_url":"https://arxiv.org/pdf/2306.08685v2.pdf","comment":"ACL 2023 Outstanding Paper"},{"id":"http://arxiv.org/abs/2412.12225v2","updated":"2024-12-26T19:23:17Z","published":"2024-12-16T10:03:44Z","title":"DLF: Disentangled-Language-Focused Multimodal Sentiment Analysis","summary":" Multimodal Sentiment Analysis (MSA) leverages heterogeneous modalities, such\nas language, vision, and audio, to enhance the understanding of human\nsentiment. While existing models often focus on extracting shared information\nacross modalities or directly fusing heterogeneous modalities, such approaches\ncan introduce redundancy and conflicts due to equal treatment of all modalities\nand the mutual transfer of information between modality pairs. To address these\nissues, we propose a Disentangled-Language-Focused (DLF) multimodal\nrepresentation learning framework, which incorporates a feature disentanglement\nmodule to separate modality-shared and modality-specific information. To\nfurther reduce redundancy and enhance language-targeted features, four\ngeometric measures are introduced to refine the disentanglement process. A\nLanguage-Focused Attractor (LFA) is further developed to strengthen language\nrepresentation by leveraging complementary modality-specific information\nthrough a language-guided cross-attention mechanism. The framework also employs\nhierarchical predictions to improve overall accuracy. Extensive experiments on\ntwo popular MSA datasets, CMU-MOSI and CMU-MOSEI, demonstrate the significant\nperformance gains achieved by the proposed DLF framework. Comprehensive\nablation studies further validate the effectiveness of the feature\ndisentanglement module, language-focused attractor, and hierarchical\npredictions. Our code is available at https://github.com/pwang322/DLF.\n","authors":["Pan Wang","Qiang Zhou","Yawen Wu","Tianlong Chen","Jingtong Hu"],"pdf_url":"https://arxiv.org/pdf/2412.12225v2.pdf","comment":"AAAI 2025 accepted"},{"id":"http://arxiv.org/abs/2412.15188v2","updated":"2024-12-26T18:56:18Z","published":"2024-12-19T18:56:24Z","title":"LMFusion: Adapting Pretrained Language Models for Multimodal Generation","summary":" We present LMFusion, a framework for empowering pretrained text-only large\nlanguage models (LLMs) with multimodal generative capabilities, enabling them\nto understand and generate both text and images in arbitrary sequences.\nLMFusion leverages existing Llama-3's weights for processing texts\nautoregressively while introducing additional and parallel transformer modules\nfor processing images with diffusion. During training, the data from each\nmodality is routed to its dedicated modules: modality-specific feedforward\nlayers, query-key-value projections, and normalization layers process each\nmodality independently, while the shared self-attention layers allow\ninteractions across text and image features. By freezing the text-specific\nmodules and only training the image-specific modules, LMFusion preserves the\nlanguage capabilities of text-only LLMs while developing strong visual\nunderstanding and generation abilities. Compared to methods that pretrain\nmultimodal generative models from scratch, our experiments demonstrate that,\nLMFusion improves image understanding by 20% and image generation by 3.6% using\nonly 50% of the FLOPs while maintaining Llama-3's language capabilities. We\nalso demonstrate that this framework can adapt existing vision-language models\nwith multimodal generation ability. Overall, this framework not only leverages\nexisting computational investments in text-only LLMs but also enables the\nparallel development of language and vision capabilities, presenting a\npromising direction for efficient multimodal model development.\n","authors":["Weijia Shi","Xiaochuang Han","Chunting Zhou","Weixin Liang","Xi Victoria Lin","Luke Zettlemoyer","Lili Yu"],"pdf_url":"https://arxiv.org/pdf/2412.15188v2.pdf","comment":"Name change: LlamaFusion to LMFusion"},{"id":"http://arxiv.org/abs/2305.13168v4","updated":"2024-12-26T18:54:53Z","published":"2023-05-22T15:56:44Z","title":"LLMs for Knowledge Graph Construction and Reasoning: Recent Capabilities\n and Future Opportunities","summary":" This paper presents an exhaustive quantitative and qualitative evaluation of\nLarge Language Models (LLMs) for Knowledge Graph (KG) construction and\nreasoning. We engage in experiments across eight diverse datasets, focusing on\nfour representative tasks encompassing entity and relation extraction, event\nextraction, link prediction, and question-answering, thereby thoroughly\nexploring LLMs' performance in the domain of construction and inference.\nEmpirically, our findings suggest that LLMs, represented by GPT-4, are more\nsuited as inference assistants rather than few-shot information extractors.\nSpecifically, while GPT-4 exhibits good performance in tasks related to KG\nconstruction, it excels further in reasoning tasks, surpassing fine-tuned\nmodels in certain cases. Moreover, our investigation extends to the potential\ngeneralization ability of LLMs for information extraction, leading to the\nproposition of a Virtual Knowledge Extraction task and the development of the\ncorresponding VINE dataset. Based on these empirical findings, we further\npropose AutoKG, a multi-agent-based approach employing LLMs and external\nsources for KG construction and reasoning. We anticipate that this research can\nprovide invaluable insights for future undertakings in the field of knowledge\ngraphs. The code and datasets are in https://github.com/zjunlp/AutoKG.\n","authors":["Yuqi Zhu","Xiaohan Wang","Jing Chen","Shuofei Qiao","Yixin Ou","Yunzhi Yao","Shumin Deng","Huajun Chen","Ningyu Zhang"],"pdf_url":"https://arxiv.org/pdf/2305.13168v4.pdf","comment":"World Wide Web Journal"},{"id":"http://arxiv.org/abs/2412.08753v2","updated":"2024-12-26T18:50:10Z","published":"2024-12-11T19:50:37Z","title":"BDA: Bangla Text Data Augmentation Framework","summary":" Data augmentation involves generating synthetic samples that resemble those\nin a given dataset. In resource-limited fields where high-quality data is\nscarce, augmentation plays a crucial role in increasing the volume of training\ndata. This paper introduces a Bangla Text Data Augmentation (BDA) Framework\nthat uses both pre-trained models and rule-based methods to create new variants\nof the text. A filtering process is included to ensure that the new text keeps\nthe same meaning as the original while also adding variety in the words used.\nWe conduct a comprehensive evaluation of the framework's effectiveness in\nBangla text classification tasks. Our framework achieved significant\nimprovement in F1 scores across five distinct datasets, delivering performance\nequivalent to models trained on 100% of the data while utilizing only 50% of\nthe training dataset. Additionally, we explore the impact of data scarcity by\nprogressively reducing the training data and augmenting it through BDA,\nresulting in notable F1 score enhancements. The study offers a thorough\nexamination of BDA's performance, identifying key factors for optimal results\nand addressing its limitations through detailed analysis.\n","authors":["Md. Tariquzzaman","Audwit Nafi Anam","Naimul Haque","Mohsinul Kabir","Hasan Mahmud","Md Kamrul Hasan"],"pdf_url":"https://arxiv.org/pdf/2412.08753v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.19289v1","updated":"2024-12-26T17:29:38Z","published":"2024-12-26T17:29:38Z","title":"ViPCap: Retrieval Text-Based Visual Prompts for Lightweight Image\n Captioning","summary":" Recent lightweight image captioning models using retrieved data mainly focus\non text prompts. However, previous works only utilize the retrieved text as\ntext prompts, and the visual information relies only on the CLIP visual\nembedding. Because of this issue, there is a limitation that the image\ndescriptions inherent in the prompt are not sufficiently reflected in the\nvisual embedding space. To tackle this issue, we propose ViPCap, a novel\nretrieval text-based visual prompt for lightweight image captioning. ViPCap\nleverages the retrieved text with image information as visual prompts to\nenhance the ability of the model to capture relevant visual information. By\nmapping text prompts into the CLIP space and generating multiple randomized\nGaussian distributions, our method leverages sampling to explore randomly\naugmented distributions and effectively retrieves the semantic features that\ncontain image information. These retrieved features are integrated into the\nimage and designated as the visual prompt, leading to performance improvements\non the datasets such as COCO, Flickr30k, and NoCaps. Experimental results\ndemonstrate that ViPCap significantly outperforms prior lightweight captioning\nmodels in efficiency and effectiveness, demonstrating the potential for a\nplug-and-play solution.\n","authors":["Taewhan Kim","Soeun Lee","Si-Woo Kim","Dong-Jin Kim"],"pdf_url":"https://arxiv.org/pdf/2412.19289v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.15577v2","updated":"2024-12-26T16:54:25Z","published":"2024-11-23T14:47:10Z","title":"From MTEB to MTOB: Retrieval-Augmented Classification for Descriptive\n Grammars","summary":" Recent advances in language modeling have demonstrated significant\nimprovements in zero-shot capabilities, including in-context learning,\ninstruction following, and machine translation for extremely under-resourced\nlanguages (Tanzer et al., 2024). However, many languages with limited written\nresources rely primarily on formal descriptions of grammar and vocabulary.\n In this paper, we introduce a set of benchmarks to evaluate how well models\ncan extract and classify information from the complex descriptions found in\nlinguistic grammars. We present a Retrieval-Augmented Generation (RAG)-based\napproach that leverages these descriptions for downstream tasks such as machine\ntranslation. Our benchmarks encompass linguistic descriptions for 248 languages\nacross 142 language families, focusing on typological features from WALS and\nGrambank.\n This set of benchmarks offers the first comprehensive evaluation of language\nmodels' in-context ability to accurately interpret and extract linguistic\nfeatures, providing a critical resource for scaling NLP to low-resource\nlanguages. The code and data are publicly available at\n\\url{https://github.com/al-the-eigenvalue/RAG-on-grammars}.\n","authors":["Albert Kornilov","Tatiana Shavrina"],"pdf_url":"https://arxiv.org/pdf/2411.15577v2.pdf","comment":"submitted to COLING 2025"},{"id":"http://arxiv.org/abs/2412.18120v2","updated":"2024-12-26T16:31:53Z","published":"2024-12-24T03:06:52Z","title":"Do Language Models Understand the Cognitive Tasks Given to Them?\n Investigations with the N-Back Paradigm","summary":" Cognitive tasks originally developed for humans are now increasingly used to\nstudy language models. While applying these tasks is often straightforward,\ninterpreting their results can be challenging. In particular, when a model\nunderperforms, it is often unclear whether this results from a limitation in\nthe cognitive ability being tested or a failure to understand the task itself.\nA recent study argues that GPT 3.5's declining performance on 2-back and 3-back\ntasks reflects a working memory capacity limit similar to humans (Gong et al.,\n2024). By analyzing a range of open-source language models of varying\nperformance levels on these tasks, we show that the poor performance instead\nreflects a limitation in task comprehension and task set maintenance. In\naddition, we challenge the best-performing model with progressively harder\nversions of the task (up to 10-back) and experiment with alternative prompting\nstrategies, before analyzing model attentions. Our larger aim is to contribute\nto the ongoing conversation around refining methodologies for the cognitive\nevaluation of language models.\n","authors":["Xiaoyang Hu","Richard L. Lewis"],"pdf_url":"https://arxiv.org/pdf/2412.18120v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.19265v1","updated":"2024-12-26T16:05:19Z","published":"2024-12-26T16:05:19Z","title":"Optimizing Multi-Stage Language Models for Effective Text Retrieval","summary":" Efficient text retrieval is critical for applications such as legal document\nanalysis, particularly in specialized contexts like Japanese legal systems.\nExisting retrieval methods often underperform in such domain-specific\nscenarios, necessitating tailored approaches. In this paper, we introduce a\nnovel two-phase text retrieval pipeline optimized for Japanese legal datasets.\nOur method leverages advanced language models to achieve state-of-the-art\nperformance, significantly improving retrieval efficiency and accuracy. To\nfurther enhance robustness and adaptability, we incorporate an ensemble model\nthat integrates multiple retrieval strategies, resulting in superior outcomes\nacross diverse tasks. Extensive experiments validate the effectiveness of our\napproach, demonstrating strong performance on both Japanese legal datasets and\nwidely recognized benchmarks like MS-MARCO. Our work establishes new standards\nfor text retrieval in domain-specific and general contexts, providing a\ncomprehensive solution for addressing complex queries in legal and multilingual\nenvironments.\n","authors":["Quang Hoang Trung","Le Trung Hoang","Nguyen Van Hoang Phuc"],"pdf_url":"https://arxiv.org/pdf/2412.19265v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.19260v1","updated":"2024-12-26T15:54:10Z","published":"2024-12-26T15:54:10Z","title":"MEDEC: A Benchmark for Medical Error Detection and Correction in\n Clinical Notes","summary":" Several studies showed that Large Language Models (LLMs) can answer medical\nquestions correctly, even outperforming the average human score in some medical\nexams. However, to our knowledge, no study has been conducted to assess the\nability of language models to validate existing or generated medical text for\ncorrectness and consistency. In this paper, we introduce MEDEC\n(https://github.com/abachaa/MEDEC), the first publicly available benchmark for\nmedical error detection and correction in clinical notes, covering five types\nof errors (Diagnosis, Management, Treatment, Pharmacotherapy, and Causal\nOrganism). MEDEC consists of 3,848 clinical texts, including 488 clinical notes\nfrom three US hospital systems that were not previously seen by any LLM. The\ndataset has been used for the MEDIQA-CORR shared task to evaluate seventeen\nparticipating systems [Ben Abacha et al., 2024]. In this paper, we describe the\ndata creation methods and we evaluate recent LLMs (e.g., o1-preview, GPT-4,\nClaude 3.5 Sonnet, and Gemini 2.0 Flash) for the tasks of detecting and\ncorrecting medical errors requiring both medical knowledge and reasoning\ncapabilities. We also conducted a comparative study where two medical doctors\nperformed the same task on the MEDEC test set. The results showed that MEDEC is\na sufficiently challenging benchmark to assess the ability of models to\nvalidate existing or generated notes and to correct medical errors. We also\nfound that although recent LLMs have a good performance in error detection and\ncorrection, they are still outperformed by medical doctors in these tasks. We\ndiscuss the potential factors behind this gap, the insights from our\nexperiments, the limitations of current evaluation metrics, and share potential\npointers for future research.\n","authors":["Asma Ben Abacha","Wen-wai Yim","Yujuan Fu","Zhaoyi Sun","Meliha Yetisgen","Fei Xia","Thomas Lin"],"pdf_url":"https://arxiv.org/pdf/2412.19260v1.pdf","comment":"12 pages"},{"id":"http://arxiv.org/abs/2412.19255v1","updated":"2024-12-26T15:45:45Z","published":"2024-12-26T15:45:45Z","title":"Multi-matrix Factorization Attention","summary":" We propose novel attention architectures, Multi-matrix Factorization\nAttention (MFA) and MFA-Key-Reuse (MFA-KR). Existing variants for standard\nMulti-Head Attention (MHA), including SOTA methods like MLA, fail to maintain\nas strong performance under stringent Key-Value cache (KV cache) constraints.\nMFA enhances model capacity by efficiently scaling up both the number and\ndimension of attention heads through low-rank matrix factorization in the\nQuery-Key (QK) circuit. Extending MFA, MFA-KR further reduces memory\nrequirements by repurposing the key cache as value through value projection\nre-parameterization. MFA's design enables strong model capacity when working\nunder tight KV cache budget, while MFA-KR is suitable for even harsher KV cache\nlimits with minor performance trade-off. Notably, in our extensive and\nlarge-scale experiments, the proposed architecture outperforms MLA and performs\ncomparably to MHA, while reducing KV cache usage by up to 56% and 93.7%,\nrespectively.\n","authors":["Jingcheng Hu","Houyi Li","Yinmin Zhang","Zili Wang","Shuigeng Zhou","Xiangyu Zhang","Heung-Yeung Shum"],"pdf_url":"https://arxiv.org/pdf/2412.19255v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.04739v3","updated":"2024-12-26T13:58:31Z","published":"2024-10-07T04:15:02Z","title":"TableRAG: Million-Token Table Understanding with Language Models","summary":" Recent advancements in language models (LMs) have notably enhanced their\nability to reason with tabular data, primarily through program-aided mechanisms\nthat manipulate and analyze tables. However, these methods often require the\nentire table as input, leading to scalability challenges due to the positional\nbias or context length constraints. In response to these challenges, we\nintroduce TableRAG, a Retrieval-Augmented Generation (RAG) framework\nspecifically designed for LM-based table understanding. TableRAG leverages\nquery expansion combined with schema and cell retrieval to pinpoint crucial\ninformation before providing it to the LMs. This enables more efficient data\nencoding and precise retrieval, significantly reducing prompt lengths and\nmitigating information loss. We have developed two new million-token benchmarks\nfrom the Arcade and BIRD-SQL datasets to thoroughly evaluate TableRAG's\neffectiveness at scale. Our results demonstrate that TableRAG's retrieval\ndesign achieves the highest retrieval quality, leading to the new\nstate-of-the-art performance on large-scale table understanding.\n","authors":["Si-An Chen","Lesly Miculicich","Julian Martin Eisenschlos","Zifeng Wang","Zilong Wang","Yanfei Chen","Yasuhisa Fujii","Hsuan-Tien Lin","Chen-Yu Lee","Tomas Pfister"],"pdf_url":"https://arxiv.org/pdf/2410.04739v3.pdf","comment":"Accepted to NeurIPS 2024"},{"id":"http://arxiv.org/abs/2407.10347v3","updated":"2024-12-26T11:47:56Z","published":"2024-07-14T22:23:07Z","title":"Enhancing Long-Range Dependency with State Space Model and\n Kolmogorov-Arnold Networks for Aspect-Based Sentiment Analysis","summary":" Aspect-based Sentiment Analysis (ABSA) evaluates sentiments toward specific\naspects of entities within the text. However, attention mechanisms and neural\nnetwork models struggle with syntactic constraints. The quadratic complexity of\nattention mechanisms also limits their adoption for capturing long-range\ndependencies between aspect and opinion words in ABSA. This complexity can lead\nto the misinterpretation of irrelevant contextual words, restricting their\neffectiveness to short-range dependencies. To address the above problem, we\npresent a novel approach to enhance long-range dependencies between aspect and\nopinion words in ABSA (MambaForGCN). This approach incorporates syntax-based\nGraph Convolutional Network (SynGCN) and MambaFormer (Mamba-Transformer)\nmodules to encode input with dependency relations and semantic information. The\nMultihead Attention (MHA) and Selective State Space model (Mamba) blocks in the\nMambaFormer module serve as channels to enhance the model with short and\nlong-range dependencies between aspect and opinion words. We also introduce the\nKolmogorov-Arnold Networks (KANs) gated fusion, an adaptive feature\nrepresentation system that integrates SynGCN and MambaFormer and captures\nnon-linear, complex dependencies. Experimental results on three benchmark\ndatasets demonstrate MambaForGCN's effectiveness, outperforming\nstate-of-the-art (SOTA) baseline models.\n","authors":["Adamu Lawan","Juhua Pu","Haruna Yunusa","Aliyu Umar","Muhammad Lawan"],"pdf_url":"https://arxiv.org/pdf/2407.10347v3.pdf","comment":"11 pages, 3 figures and 3 tables. arXiv admin note: text overlap with\n arXiv:2405.13013"},{"id":"http://arxiv.org/abs/2412.19184v1","updated":"2024-12-26T11:46:22Z","published":"2024-12-26T11:46:22Z","title":"Multi-Head Attention Driven Dynamic Visual-Semantic Embedding for\n Enhanced Image-Text Matching","summary":" With the rapid development of multimodal learning, the image-text matching\ntask, as a bridge connecting vision and language, has become increasingly\nimportant. Based on existing research, this study proposes an innovative visual\nsemantic embedding model, Multi-Headed Consensus-Aware Visual-Semantic\nEmbedding (MH-CVSE). This model introduces a multi-head self-attention\nmechanism based on the consensus-aware visual semantic embedding model (CVSE)\nto capture information in multiple subspaces in parallel, significantly\nenhancing the model's ability to understand and represent the complex\nrelationship between images and texts. In addition, we adopt a parameterized\nfeature fusion strategy to flexibly integrate feature information at different\nlevels, further improving the model's expressive power. In terms of loss\nfunction design, the MH-CVSE model adopts a dynamic weight adjustment strategy\nto dynamically adjust the weight according to the loss value itself, so that\nthe model can better balance the contribution of different loss terms during\ntraining. At the same time, we introduce a cosine annealing learning rate\nstrategy to help the model converge more stably in the later stages of\ntraining. Extensive experimental verification on the Flickr30k dataset shows\nthat the MH-CVSE model achieves better performance than previous methods in\nboth bidirectional image and text retrieval tasks, fully demonstrating its\neffectiveness and superiority.\n","authors":["Wenjing Chen"],"pdf_url":"https://arxiv.org/pdf/2412.19184v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.17131v2","updated":"2024-12-26T11:43:37Z","published":"2024-12-22T18:38:24Z","title":"LLMsAgainstHate @ NLU of Devanagari Script Languages 2025: Hate Speech\n Detection and Target Identification in Devanagari Languages via Parameter\n Efficient Fine-Tuning of LLMs","summary":" The detection of hate speech has become increasingly important in combating\nonline hostility and its real-world consequences. Despite recent advancements,\nthere is limited research addressing hate speech detection in\nDevanagari-scripted languages, where resources and tools are scarce. While\nlarge language models (LLMs) have shown promise in language-related tasks,\ntraditional fine-tuning approaches are often infeasible given the size of the\nmodels. In this paper, we propose a Parameter Efficient Fine tuning (PEFT)\nbased solution for hate speech detection and target identification. We evaluate\nmultiple LLMs on the Devanagari dataset provided by (Thapa et al., 2025), which\ncontains annotated instances in 2 languages - Hindi and Nepali. The results\ndemonstrate the efficacy of our approach in handling Devanagari-scripted\ncontent.\n","authors":["Rushendra Sidibomma","Pransh Patwa","Parth Patwa","Aman Chadha","Vinija Jain","Amitava Das"],"pdf_url":"https://arxiv.org/pdf/2412.17131v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.19178v1","updated":"2024-12-26T11:32:00Z","published":"2024-12-26T11:32:00Z","title":"Reversed in Time: A Novel Temporal-Emphasized Benchmark for Cross-Modal\n Video-Text Retrieval","summary":" Cross-modal (e.g. image-text, video-text) retrieval is an important task in\ninformation retrieval and multimodal vision-language understanding field.\nTemporal understanding makes video-text retrieval more challenging than\nimage-text retrieval. However, we find that the widely used video-text\nbenchmarks have shortcomings in comprehensively assessing abilities of models,\nespecially in temporal understanding, causing large-scale image-text\npre-trained models can already achieve comparable zero-shot performance with\nvideo-text pre-trained models. In this paper, we introduce RTime, a novel\ntemporal-emphasized video-text retrieval dataset. We first obtain videos of\nactions or events with significant temporality, and then reverse these videos\nto create harder negative samples. We then recruit annotators to judge the\nsignificance and reversibility of candidate videos, and write captions for\nqualified videos. We further adopt GPT-4 to extend more captions based on\nhuman-written captions. Our RTime dataset currently consists of 21k videos with\n10 captions per video, totalling about 122 hours. Based on RTime, we propose\nthree retrieval benchmark tasks: RTime-Origin, RTime-Hard, and RTime-Binary. We\nfurther enhance the use of harder-negatives in model training, and benchmark a\nvariety of video-text models on RTime. Extensive experiment analysis proves\nthat RTime indeed poses new and higher challenges to video-text retrieval. We\nrelease our RTime\ndataset\\footnote{\\url{https://github.com/qyr0403/Reversed-in-Time}} to further\nadvance video-text retrieval and multimodal understanding research.\n","authors":["Yang Du","Yuqi Liu","Qin Jin"],"pdf_url":"https://arxiv.org/pdf/2412.19178v1.pdf","comment":"ACMMM 2024 poster"},{"id":"http://arxiv.org/abs/2412.19168v1","updated":"2024-12-26T10:58:40Z","published":"2024-12-26T10:58:40Z","title":"GFG -- Gender-Fair Generation: A CALAMITA Challenge","summary":" Gender-fair language aims at promoting gender equality by using terms and\nexpressions that include all identities and avoid reinforcing gender\nstereotypes. Implementing gender-fair strategies is particularly challenging in\nheavily gender-marked languages, such as Italian. To address this, the\nGender-Fair Generation challenge intends to help shift toward gender-fair\nlanguage in written communication. The challenge, designed to assess and\nmonitor the recognition and generation of gender-fair language in both mono-\nand cross-lingual scenarios, includes three tasks: (1) the detection of\ngendered expressions in Italian sentences, (2) the reformulation of gendered\nexpressions into gender-fair alternatives, and (3) the generation of\ngender-fair language in automatic translation from English to Italian. The\nchallenge relies on three different annotated datasets: the GFL-it corpus,\nwhich contains Italian texts extracted from administrative documents provided\nby the University of Brescia; GeNTE, a bilingual test set for gender-neutral\nrewriting and translation built upon a subset of the Europarl dataset; and\nNeo-GATE, a bilingual test set designed to assess the use of non-binary\nneomorphemes in Italian for both fair formulation and translation tasks.\nFinally, each task is evaluated with specific metrics: average of F1-score\nobtained by means of BERTScore computed on each entry of the datasets for task\n1, an accuracy measured with a gender-neutral classifier, and a\ncoverage-weighted accuracy for tasks 2 and 3.\n","authors":["Simona Frenda","Andrea Piergentili","Beatrice Savoldi","Marco Madeddu","Martina Rosola","Silvia Casola","Chiara Ferrando","Viviana Patti","Matteo Negri","Luisa Bentivogli"],"pdf_url":"https://arxiv.org/pdf/2412.19168v1.pdf","comment":"To refer to this paper please cite the CEUR-ws publication available\n at https://ceur-ws.org/Vol-3878/"},{"id":"http://arxiv.org/abs/2412.19155v1","updated":"2024-12-26T10:19:20Z","published":"2024-12-26T10:19:20Z","title":"Referencing Where to Focus: Improving VisualGrounding with Referential\n Query","summary":" Visual Grounding aims to localize the referring object in an image given a\nnatural language expression. Recent advancements in DETR-based visual grounding\nmethods have attracted considerable attention, as they directly predict the\ncoordinates of the target object without relying on additional efforts, such as\npre-generated proposal candidates or pre-defined anchor boxes. However,\nexisting research primarily focuses on designing stronger multi-modal decoder,\nwhich typically generates learnable queries by random initialization or by\nusing linguistic embeddings. This vanilla query generation approach inevitably\nincreases the learning difficulty for the model, as it does not involve any\ntarget-related information at the beginning of decoding. Furthermore, they only\nuse the deepest image feature during the query learning process, overlooking\nthe importance of features from other levels. To address these issues, we\npropose a novel approach, called RefFormer. It consists of the query adaption\nmodule that can be seamlessly integrated into CLIP and generate the referential\nquery to provide the prior context for decoder, along with a task-specific\ndecoder. By incorporating the referential query into the decoder, we can\neffectively mitigate the learning difficulty of the decoder, and accurately\nconcentrate on the target object. Additionally, our proposed query adaption\nmodule can also act as an adapter, preserving the rich knowledge within CLIP\nwithout the need to tune the parameters of the backbone network. Extensive\nexperiments demonstrate the effectiveness and efficiency of our proposed\nmethod, outperforming state-of-the-art approaches on five visual grounding\nbenchmarks.\n","authors":["Yabing Wang","Zhuotao Tian","Qingpei Guo","Zheng Qin","Sanping Zhou","Ming Yang","Le Wang"],"pdf_url":"https://arxiv.org/pdf/2412.19155v1.pdf","comment":"Accepted by NIPS2024"},{"id":"http://arxiv.org/abs/2412.19140v1","updated":"2024-12-26T09:53:01Z","published":"2024-12-26T09:53:01Z","title":"SILC-EFSA: Self-aware In-context Learning Correction for Entity-level\n Financial Sentiment Analysis","summary":" In recent years, fine-grained sentiment analysis in finance has gained\nsignificant attention, but the scarcity of entity-level datasets remains a key\nchallenge. To address this, we have constructed the largest English and Chinese\nfinancial entity-level sentiment analysis datasets to date. Building on this\nfoundation, we propose a novel two-stage sentiment analysis approach called\nSelf-aware In-context Learning Correction (SILC). The first stage involves\nfine-tuning a base large language model to generate pseudo-labeled data\nspecific to our task. In the second stage, we train a correction model using a\nGNN-based example retriever, which is informed by the pseudo-labeled data. This\ntwo-stage strategy has allowed us to achieve state-of-the-art performance on\nthe newly constructed datasets, advancing the field of financial sentiment\nanalysis. In a case study, we demonstrate the enhanced practical utility of our\ndata and methods in monitoring the cryptocurrency market. Our datasets and code\nare available at https://github.com/NLP-Bin/SILC-EFSA.\n","authors":["Senbin Zhu","Chenyuan He","Hongde Liu","Pengcheng Dong","Hanjie Zhao","Yuchen Yan","Yuxiang Jia","Hongying Zan","Min Peng"],"pdf_url":"https://arxiv.org/pdf/2412.19140v1.pdf","comment":"This paper is to be published in the Proceedings of the 31st\n International Conference on Computational Linguistics (COLING 2025)"},{"id":"http://arxiv.org/abs/2412.19113v1","updated":"2024-12-26T08:13:34Z","published":"2024-12-26T08:13:34Z","title":"SketchFill: Sketch-Guided Code Generation for Imputing Derived Missing\n Values","summary":" Missing value is a critical issue in data science, significantly impacting\nthe reliability of analyses and predictions. Missing value imputation (MVI) is\na longstanding problem because it highly relies on domain knowledge. Large\nlanguage models (LLMs) have emerged as a promising tool for data cleaning,\nincluding MVI for tabular data, offering advanced capabilities for\nunderstanding and generating content. However, despite their promise, existing\nLLM techniques such as in-context learning and Chain-of-Thought (CoT) often\nfall short in guiding LLMs to perform complex reasoning for MVI, particularly\nwhen imputing derived missing values, which require mathematical formulas and\ndata relationships across rows and columns. This gap underscores the need for\nfurther advancements in LLM methodologies to enhance their reasoning\ncapabilities for more reliable imputation outcomes. To fill this gap, we\npropose SketchFill, a novel sketch-based method to guide LLMs in generating\naccurate formulas to impute missing numerical values. Our experimental results\ndemonstrate that SketchFill significantly outperforms state-of-the-art methods,\nachieving 56.2% higher accuracy than CoT-based methods and 78.8% higher\naccuracy than MetaGPT. This sets a new standard for automated data cleaning and\nadvances the field of MVI for numerical values.\n","authors":["Yunfan Zhang","Changlun Li","Yuyu Luo","Nan Tang"],"pdf_url":"https://arxiv.org/pdf/2412.19113v1.pdf","comment":"19 pages, 6 figures"},{"id":"http://arxiv.org/abs/2412.19102v1","updated":"2024-12-26T07:43:18Z","published":"2024-12-26T07:43:18Z","title":"\"I've Heard of You!\": Generate Spoken Named Entity Recognition Data for\n Unseen Entities","summary":" Spoken named entity recognition (NER) aims to identify named entities from\nspeech, playing an important role in speech processing. New named entities\nappear every day, however, annotating their Spoken NER data is costly. In this\npaper, we demonstrate that existing Spoken NER systems perform poorly when\ndealing with previously unseen named entities. To tackle this challenge, we\npropose a method for generating Spoken NER data based on a named entity\ndictionary (NED) to reduce costs. Specifically, we first use a large language\nmodel (LLM) to generate sentences from the sampled named entities and then use\na text-to-speech (TTS) system to generate the speech. Furthermore, we introduce\na noise metric to filter out noisy data. To evaluate our approach, we release a\nnovel Spoken NER benchmark along with a corresponding NED containing 8,853\nentities. Experiment results show that our method achieves state-of-the-art\n(SOTA) performance in the in-domain, zero-shot domain adaptation, and fully\nzero-shot settings. Our data will be available at\nhttps://github.com/DeepLearnXMU/HeardU.\n","authors":["Jiawei Yu","Xiang Geng","Yuang Li","Mengxin Ren","Wei Tang","Jiahuan Li","Zhibin Lan","Min Zhang","Hao Yang","Shujian Huang","Jinsong Su"],"pdf_url":"https://arxiv.org/pdf/2412.19102v1.pdf","comment":"Accepted by ICASSP 2025"},{"id":"http://arxiv.org/abs/2412.02371v3","updated":"2024-12-26T07:34:28Z","published":"2024-12-03T10:57:19Z","title":"TSCheater: Generating High-Quality Tibetan Adversarial Texts via Visual\n Similarity","summary":" Language models based on deep neural networks are vulnerable to textual\nadversarial attacks. While rich-resource languages like English are receiving\nfocused attention, Tibetan, a cross-border language, is gradually being studied\ndue to its abundant ancient literature and critical language strategy.\nCurrently, there are several Tibetan adversarial text generation methods, but\nthey do not fully consider the textual features of Tibetan script and\noverestimate the quality of generated adversarial texts. To address this issue,\nwe propose a novel Tibetan adversarial text generation method called TSCheater,\nwhich considers the characteristic of Tibetan encoding and the feature that\nvisually similar syllables have similar semantics. This method can also be\ntransferred to other abugidas, such as Devanagari script. We utilize a\nself-constructed Tibetan syllable visual similarity database called TSVSDB to\ngenerate substitution candidates and adopt a greedy algorithm-based scoring\nmechanism to determine substitution order. After that, we conduct the method on\neight victim language models. Experimentally, TSCheater outperforms existing\nmethods in attack effectiveness, perturbation magnitude, semantic similarity,\nvisual similarity, and human acceptance. Finally, we construct the first\nTibetan adversarial robustness evaluation benchmark called AdvTS, which is\ngenerated by existing methods and proofread by humans.\n","authors":["Xi Cao","Quzong Gesang","Yuan Sun","Nuo Qun","Tashi Nyima"],"pdf_url":"https://arxiv.org/pdf/2412.02371v3.pdf","comment":"Camera-Ready Version; Accepted at ICASSP 2025"},{"id":"http://arxiv.org/abs/2412.19087v1","updated":"2024-12-26T06:57:04Z","published":"2024-12-26T06:57:04Z","title":"MoPD: Mixture-of-Prompts Distillation for Vision-Language Models","summary":" Soft prompt learning methods are effective for adapting vision-language\nmodels (VLMs) to downstream tasks. Nevertheless, empirical evidence reveals a\ntendency of existing methods that they overfit seen classes and exhibit\ndegraded performance on unseen classes. This limitation is due to the inherent\nbias in the training data towards the seen classes. To address this issue, we\npropose a novel soft prompt learning method, named Mixture-of-Prompts\nDistillation (MoPD), which can effectively transfer useful knowledge from hard\nprompts manually hand-crafted (a.k.a. teacher prompts) to the learnable soft\nprompt (a.k.a. student prompt), thereby enhancing the generalization ability of\nsoft prompts on unseen classes. Moreover, the proposed MoPD method utilizes a\ngating network that learns to select hard prompts used for prompt distillation.\nExtensive experiments demonstrate that the proposed MoPD method outperforms\nstate-of-the-art baselines especially on on unseen classes.\n","authors":["Yang Chen","Shuai Fu","Yu Zhang"],"pdf_url":"https://arxiv.org/pdf/2412.19087v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.12142v2","updated":"2024-12-26T06:39:38Z","published":"2024-08-22T05:59:47Z","title":"MDD-5k: A New Diagnostic Conversation Dataset for Mental Disorders\n Synthesized via Neuro-Symbolic LLM Agents","summary":" The clinical diagnosis of most mental disorders primarily relies on the\nconversations between psychiatrist and patient. The creation of such diagnostic\nconversation datasets is promising to boost the AI mental healthcare community.\nHowever, directly collecting the conversations in real diagnosis scenarios is\nnear impossible due to stringent privacy and ethical considerations. To address\nthis issue, we seek to synthesize diagnostic conversation by exploiting\nanonymized patient cases that are easier to access. Specifically, we design a\nneuro-symbolic multi-agent framework for synthesizing the diagnostic\nconversation of mental disorders with large language models. It takes patient\ncase as input and is capable of generating multiple diverse conversations with\none single patient case. The framework basically involves the interaction\nbetween a doctor agent and a patient agent, and generates conversations under\nsymbolic control via a dynamic diagnosis tree. By applying the proposed\nframework, we develop the largest Chinese mental disorders diagnosis dataset\nMDD-5k. This dataset is built upon 1000 real, anonymized patient cases by\ncooperating with Shanghai Mental Health Center and comprises 5000 high-quality\nlong conversations with diagnosis results and treatment opinions as labels. To\nthe best of our knowledge, it's also the first labeled dataset for Chinese\nmental disorders diagnosis. Human evaluation demonstrates the proposed MDD-5k\ndataset successfully simulates human-like diagnostic process of mental\ndisorders.\n","authors":["Congchi Yin","Feng Li","Shu Zhang","Zike Wang","Jun Shao","Piji Li","Jianhua Chen","Xun Jiang"],"pdf_url":"https://arxiv.org/pdf/2408.12142v2.pdf","comment":"Accepted by the 39th Annual AAAI Conference on Artificial\n Intelligence"},{"id":"http://arxiv.org/abs/2412.19076v1","updated":"2024-12-26T06:23:53Z","published":"2024-12-26T06:23:53Z","title":"Advancing LLM detection in the ALTA 2024 Shared Task: Techniques and\n Analysis","summary":" The recent proliferation of AI-generated content has prompted significant\ninterest in developing reliable detection methods. This study explores\ntechniques for identifying AI-generated text through sentence-level evaluation\nwithin hybrid articles. Our findings indicate that ChatGPT-3.5 Turbo exhibits\ndistinct, repetitive probability patterns that enable consistent in-domain\ndetection. Empirical tests show that minor textual modifications, such as\nrewording, have minimal impact on detection accuracy. These results provide\nvaluable insights for advancing AI detection methodologies, offering a pathway\ntoward robust solutions to address the complexities of synthetic text\nidentification.\n","authors":["Dima Galat"],"pdf_url":"https://arxiv.org/pdf/2412.19076v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.12639v2","updated":"2024-12-26T06:20:21Z","published":"2024-12-17T08:02:08Z","title":"Falcon: Faster and Parallel Inference of Large Language Models through\n Enhanced Semi-Autoregressive Drafting and Custom-Designed Decoding Tree","summary":" Striking an optimal balance between minimal drafting latency and high\nspeculation accuracy to enhance the inference speed of Large Language Models\nremains a significant challenge in speculative decoding. In this paper, we\nintroduce Falcon, an innovative semi-autoregressive speculative decoding\nframework fashioned to augment both the drafter's parallelism and output\nquality. Falcon incorporates the Coupled Sequential Glancing Distillation\ntechnique, which fortifies inter-token dependencies within the same block,\nleading to increased speculation accuracy. We offer a comprehensive theoretical\nanalysis to illuminate the underlying mechanisms. Additionally, we introduce a\nCustom-Designed Decoding Tree, which permits the drafter to generate multiple\ntokens in a single forward pass and accommodates multiple forward passes as\nneeded, thereby boosting the number of drafted tokens and significantly\nimproving the overall acceptance rate. Comprehensive evaluations on benchmark\ndatasets such as MT-Bench, HumanEval, and GSM8K demonstrate Falcon's superior\nacceleration capabilities. The framework achieves a lossless speedup ratio\nranging from 2.91x to 3.51x when tested on the Vicuna and LLaMA2-Chat model\nseries. These results outstrip existing speculative decoding methods for LLMs,\nincluding Eagle, Medusa, Lookahead, SPS, and PLD, while maintaining a compact\ndrafter architecture equivalent to merely two Transformer layers.\n","authors":["Xiangxiang Gao","Weisheng Xie","Yiwei Xiang","Feng Ji"],"pdf_url":"https://arxiv.org/pdf/2412.12639v2.pdf","comment":"AAAI 2025 Accepted"},{"id":"http://arxiv.org/abs/2412.19072v1","updated":"2024-12-26T06:05:52Z","published":"2024-12-26T06:05:52Z","title":"Robust Speech and Natural Language Processing Models for Depression\n Screening","summary":" Depression is a global health concern with a critical need for increased\npatient screening. Speech technology offers advantages for remote screening but\nmust perform robustly across patients. We have described two deep learning\nmodels developed for this purpose. One model is based on acoustics; the other\nis based on natural language processing. Both models employ transfer learning.\nData from a depression-labeled corpus in which 11,000 unique users interacted\nwith a human-machine application using conversational speech is used. Results\non binary depression classification have shown that both models perform at or\nabove AUC=0.80 on unseen data with no speaker overlap. Performance is further\nanalyzed as a function of test subset characteristics, finding that the models\nare generally robust over speaker and session variables. We conclude that\nmodels based on these approaches offer promise for generalized automated\ndepression screening.\n","authors":["Y. Lu","A. Harati","T. Rutowski","R. Oliveira","P. Chlebek","E. Shriberg"],"pdf_url":"https://arxiv.org/pdf/2412.19072v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.19070v1","updated":"2024-12-26T05:54:24Z","published":"2024-12-26T05:54:24Z","title":"Cross-Demographic Portability of Deep NLP-Based Depression Models","summary":" Deep learning models are rapidly gaining interest for real-world applications\nin behavioral health. An important gap in current literature is how well such\nmodels generalize over different populations. We study Natural Language\nProcessing (NLP) based models to explore portability over two different corpora\nhighly mismatched in age. The first and larger corpus contains younger\nspeakers. It is used to train an NLP model to predict depression. When testing\non unseen speakers from the same age distribution, this model performs at\nAUC=0.82. We then test this model on the second corpus, which comprises seniors\nfrom a retirement community. Despite the large demographic differences in the\ntwo corpora, we saw only modest degradation in performance for the\nsenior-corpus data, achieving AUC=0.76. Interestingly, in the senior\npopulation, we find AUC=0.81 for the subset of patients whose health state is\nconsistent over time. Implications for demographic portability of speech-based\napplications are discussed.\n","authors":["Tomek Rutowski","Elizabeth Shriberg","Amir Harati","Yang Lu","Ricardo Oliveira","Piotr Chlebek"],"pdf_url":"https://arxiv.org/pdf/2412.19070v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.16844v3","updated":"2024-12-26T04:41:11Z","published":"2024-12-22T03:43:51Z","title":"Sim911: Towards Effective and Equitable 9-1-1 Dispatcher Training with\n an LLM-Enabled Simulation","summary":" Emergency response services are vital for enhancing public safety by\nsafeguarding the environment, property, and human lives. As frontline members\nof these services, 9-1-1 dispatchers have a direct impact on response times and\nthe overall effectiveness of emergency operations. However, traditional\ndispatcher training methods, which rely on role-playing by experienced\npersonnel, are labor-intensive, time-consuming, and often neglect the specific\nneeds of underserved communities. To address these challenges, we introduce\nSim911, the first training simulation for 9-1-1 dispatchers powered by Large\nLanguage Models (LLMs). Sim911 enhances training through three key technical\ninnovations: (1) knowledge construction, which utilizes archived 9-1-1 call\ndata to generate simulations that closely mirror real-world scenarios; (2)\ncontext-aware controlled generation, which employs dynamic prompts and vector\nbases to ensure that LLM behavior aligns with training objectives; and (3)\nvalidation with looped correction, which filters out low-quality responses and\nrefines the system performance.\n","authors":["Zirong Chen","Elizabeth Chason","Noah Mladenovski","Erin Wilson","Kristin Mullen","Stephen Martini","Meiyi Ma"],"pdf_url":"https://arxiv.org/pdf/2412.16844v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.19043v1","updated":"2024-12-26T03:37:40Z","published":"2024-12-26T03:37:40Z","title":"Indonesian-English Code-Switching Speech Synthesizer Utilizing\n Multilingual STEN-TTS and Bert LID","summary":" Multilingual text-to-speech systems convert text into speech across multiple\nlanguages. In many cases, text sentences may contain segments in different\nlanguages, a phenomenon known as code-switching. This is particularly common in\nIndonesia, especially between Indonesian and English. Despite its significance,\nno research has yet developed a multilingual TTS system capable of handling\ncode-switching between these two languages. This study addresses\nIndonesian-English code-switching in STEN-TTS. Key modifications include adding\na language identification component to the text-to-phoneme conversion using\nfinetuned BERT for per-word language identification, as well as removing\nlanguage embedding from the base model. Experimental results demonstrate that\nthe code-switching model achieves superior naturalness and improved speech\nintelligibility compared to the Indonesian and English baseline STEN-TTS\nmodels.\n","authors":["Ahmad Alfani Handoyo","Chung Tran","Dessi Puji Lestari","Sakriani Sakti"],"pdf_url":"https://arxiv.org/pdf/2412.19043v1.pdf","comment":"Accepted at O-COCOSDA 2024"},{"id":"http://arxiv.org/abs/2411.06175v3","updated":"2024-12-26T02:47:15Z","published":"2024-11-09T13:17:39Z","title":"Clustering Algorithms and RAG Enhancing Semi-Supervised Text\n Classification with Large LLMs","summary":" This paper proposes a Clustering, Labeling, then Augmenting framework that\nsignificantly enhances performance in Semi-Supervised Text Classification\n(SSTC) tasks, effectively addressing the challenge of vast datasets with\nlimited labeled examples. Unlike traditional SSTC approaches that rely on a\npredefined small set of labeled data to generate pseudo-labels for the\nunlabeled data, this framework innovatively employs clustering to select\nrepresentative \"landmarks\" for labeling. These landmarks subsequently act as\nintermediaries in an ensemble of augmentation techniques, including\nRetrieval-Augmented Generation (RAG), Large Language Model (LLMs)-based\nrewriting, and synonym substitution, to generate synthetic labeled data without\nmaking pseudo-labels for the unlabeled data. Empirical results show that even\nin complex text document classification scenarios involving over 100\ncategories, our method achieves state-of-the-art accuracies of 95.41% on the\nReuters dataset and 82.43% on the Web of Science dataset. Our approach\nsignificantly reduces the reliance on human labeling efforts and the associated\nexpenses, while simultaneously ensuring high data quality and minimizing\nprivacy risks. The finetuning results further show the efficiency of\nfine-tuning LLMs for text classification tasks, highlighting a robust solution\nfor leveraging limited labeled data.\n","authors":["Shan Zhong","Jiahao Zeng","Yongxin Yu","Bohong Lin"],"pdf_url":"https://arxiv.org/pdf/2411.06175v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.03631v2","updated":"2024-12-26T02:14:28Z","published":"2024-08-07T08:43:32Z","title":"Large Language Model as a Catalyst: A Paradigm Shift in Base Station\n Siting Optimization","summary":" Traditional base station siting (BSS) methods rely heavily on drive testing\nand user feedback, which are laborious and require extensive expertise in\ncommunication, networking, and optimization. As large language models (LLMs)\nand their associated technologies advance, particularly in the realms of prompt\nengineering and agent engineering, network optimization will witness a\nrevolutionary approach. This approach entails the strategic use of well-crafted\nprompts to infuse human experience and knowledge into these sophisticated LLMs,\nand the deployment of autonomous agents as a communication bridge to seamlessly\nconnect the machine language based LLMs with human users using natural\nlanguage. Furthermore, our proposed framework incorporates retrieval-augmented\ngeneration (RAG) to enhance the system's ability to acquire domain-specific\nknowledge and generate solutions, thereby enabling the customization and\noptimization of the BSS process. This integration represents the future\nparadigm of artificial intelligence (AI) as a service and AI for more ease.\nThis research first develops a novel LLM-empowered BSS optimization framework,\nand heuristically proposes three different potential implementations: the\nstrategies based on Prompt-optimized LLM (PoL), LLM-empowered autonomous BSS\nagent (LaBa), and Cooperative multiple LLM-based autonomous BSS agents (CLaBa).\nThrough evaluation on real-world data, the experiments demonstrate that\nprompt-assisted LLMs and LLM-based agents can generate more efficient and\nreliable network deployments, noticeably enhancing the efficiency of BSS\noptimization and reducing trivial manual participation.\n","authors":["Yanhu Wang","Muhammad Muzammil Afzal","Zhengyang Li","Jie Zhou","Chenyuan Feng","Shuaishuai Guo","Tony Q. S. Quek"],"pdf_url":"https://arxiv.org/pdf/2408.03631v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.19018v1","updated":"2024-12-26T01:56:42Z","published":"2024-12-26T01:56:42Z","title":"Let the Rule Speak: Enhancing In-context Learning Debiasing with\n Interpretability","summary":" In-context learning, which allows large language models to perform diverse\ntasks with a few demonstrations, is found to have imbalanced per-class\nprediction accuracy on multi-class text classification. Although notable output\ncorrection methods have been developed to tackle the issue and simultaneously\nimprove downstream prediction accuracy, they may fail to answer the core\ninterpretability challenges: why and which certain classes need corrections,\nand more importantly, a tailored correction for per-sample, per-class's\nprobability. To address such interpretability gaps, we first find that the\nimbalance arises from certain classes consistently receiving high ICL output\nprobabilities, whereas others receiving lower or mixed ranges, so the former is\nmore frequently chosen, resulting in higher accuracy; more crucially, we find\nthat these ranges have significantly varying degrees of influence on the\naccuracy bias, highlighting the need for precise, interpretable probability\ncorrections by range. Motivated by this, we propose FuRud, a Fuzzy Rule\nOptimization based Debiasing method, that (1) detects which classes need\ncorrections, and (2) for each correction-needed class, detects its probability\nranges and applies asymmetric amplifications or reductions to correct them\ninterpretably. Notably, across seven benchmark datasets, FuRud reduces the\npairwise class accuracy bias (COBias) by more than half (56%), while achieving\na relative increase of 21% in accuracy, outperforming state-of-the-art\ndebiasing methods. Moreover, FuRud can optimize downstream tasks with as few as\n10 optimization examples. Furthermore, FuRud can work for prompt formats that\nlead to highly skewed predictions. For example, FuRud greatly improves ICL\noutputs which use letter options, with 44% relative accuracy increase and 54%\nrelative COBias reduction.\n","authors":["Ruixi Lin","Yang You"],"pdf_url":"https://arxiv.org/pdf/2412.19018v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.08925v2","updated":"2024-12-26T00:15:20Z","published":"2024-02-14T03:56:27Z","title":"MaxMin-RLHF: Alignment with Diverse Human Preferences","summary":" Reinforcement Learning from Human Feedback (RLHF) aligns language models to\nhuman preferences by employing a singular reward model derived from preference\ndata. However, such an approach overlooks the rich diversity of human\npreferences inherent in data collected from multiple users. In this work, we\nfirst derive an impossibility result of alignment with single reward RLHF,\nthereby highlighting its insufficiency in representing diverse human\npreferences. To provide an equitable solution to the problem, we learn a\nmixture of preference distributions via an expectation-maximization algorithm\nand propose a MaxMin alignment objective for policy learning inspired by the\nEgalitarian principle in social choice theory to better represent diverse human\npreferences. We elucidate the connection of our proposed approach to\ndistributionally robust optimization and general utility RL, thereby\nhighlighting the generality and robustness of our proposed solution. We present\ncomprehensive experimental results on small-scale (GPT-2) and large-scale\nlanguage models (with Tulu2-7B) and show the efficacy of the proposed approach\nin the presence of diversity among human preferences. Our algorithm achieves an\naverage improvement of more than 16% in win-rates over conventional RLHF\nalgorithms and improves the win-rate (accuracy) for minority groups by over 33%\nwithout compromising the performance of majority groups, showcasing the\nrobustness and fairness of our approach. We remark that our findings in this\nwork are not only limited to language models but also extend to reinforcement\nlearning in general.\n","authors":["Souradip Chakraborty","Jiahao Qiu","Hui Yuan","Alec Koppel","Furong Huang","Dinesh Manocha","Amrit Singh Bedi","Mengdi Wang"],"pdf_url":"https://arxiv.org/pdf/2402.08925v2.pdf","comment":null}],"Information Retrieval":[{"id":"http://arxiv.org/abs/2305.13168v4","updated":"2024-12-26T18:54:53Z","published":"2023-05-22T15:56:44Z","title":"LLMs for Knowledge Graph Construction and Reasoning: Recent Capabilities\n and Future Opportunities","summary":" This paper presents an exhaustive quantitative and qualitative evaluation of\nLarge Language Models (LLMs) for Knowledge Graph (KG) construction and\nreasoning. We engage in experiments across eight diverse datasets, focusing on\nfour representative tasks encompassing entity and relation extraction, event\nextraction, link prediction, and question-answering, thereby thoroughly\nexploring LLMs' performance in the domain of construction and inference.\nEmpirically, our findings suggest that LLMs, represented by GPT-4, are more\nsuited as inference assistants rather than few-shot information extractors.\nSpecifically, while GPT-4 exhibits good performance in tasks related to KG\nconstruction, it excels further in reasoning tasks, surpassing fine-tuned\nmodels in certain cases. Moreover, our investigation extends to the potential\ngeneralization ability of LLMs for information extraction, leading to the\nproposition of a Virtual Knowledge Extraction task and the development of the\ncorresponding VINE dataset. Based on these empirical findings, we further\npropose AutoKG, a multi-agent-based approach employing LLMs and external\nsources for KG construction and reasoning. We anticipate that this research can\nprovide invaluable insights for future undertakings in the field of knowledge\ngraphs. The code and datasets are in https://github.com/zjunlp/AutoKG.\n","authors":["Yuqi Zhu","Xiaohan Wang","Jing Chen","Shuofei Qiao","Yixin Ou","Yunzhi Yao","Shumin Deng","Huajun Chen","Ningyu Zhang"],"pdf_url":"https://arxiv.org/pdf/2305.13168v4.pdf","comment":"World Wide Web Journal"},{"id":"http://arxiv.org/abs/2412.19312v1","updated":"2024-12-26T18:19:53Z","published":"2024-12-26T18:19:53Z","title":"From Interets to Insights: An LLM Approach to Course Recommendations\n Using Natural Language Queries","summary":" Most universities in the United States encourage their students to explore\nacademic areas before declaring a major and to acquire academic breadth by\nsatisfying a variety of requirements. Each term, students must choose among\nmany thousands of offerings, spanning dozens of subject areas, a handful of\ncourses to take. The curricular environment is also dynamic, and poor\ncommunication and search functions on campus can limit a student's ability to\ndiscover new courses of interest. To support both students and their advisers\nin such a setting, we explore a novel Large Language Model (LLM) course\nrecommendation system that applies a Retrieval Augmented Generation (RAG)\nmethod to the corpus of course descriptions. The system first generates an\n'ideal' course description based on the user's query. This description is\nconverted into a search vector using embeddings, which is then used to find\nactual courses with similar content by comparing embedding similarities. We\ndescribe the method and assess the quality and fairness of some example\nprompts. Steps to deploy a pilot system on campus are discussed.\n","authors":["Hugh Van Deventer","Mark Mills","August Evrard"],"pdf_url":"https://arxiv.org/pdf/2412.19312v1.pdf","comment":"17 pages, 9 figures"},{"id":"http://arxiv.org/abs/2412.19302v1","updated":"2024-12-26T17:51:54Z","published":"2024-12-26T17:51:54Z","title":"RecLM: Recommendation Instruction Tuning","summary":" Modern recommender systems aim to deeply understand users' complex\npreferences through their past interactions. While deep collaborative filtering\napproaches using Graph Neural Networks (GNNs) excel at capturing user-item\nrelationships, their effectiveness is limited when handling sparse data or\nzero-shot scenarios, primarily due to constraints in ID-based embedding\nfunctions. To address these challenges, we propose a model-agnostic\nrecommendation instruction-tuning paradigm that seamlessly integrates large\nlanguage models with collaborative filtering. Our proposed Recommendation\nLanguage Model (RecLM) enhances the capture of user preference diversity\nthrough a carefully designed reinforcement learning reward function that\nfacilitates self-augmentation of language models. Comprehensive evaluations\ndemonstrate significant advantages of our approach across various settings, and\nits plug-and-play compatibility with state-of-the-art recommender systems\nresults in notable performance enhancements.\n","authors":["Yangqin Jiang","Yuhao Yang","Lianghao Xia","Da Luo","Kangyi Lin","Chao Huang"],"pdf_url":"https://arxiv.org/pdf/2412.19302v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.19265v1","updated":"2024-12-26T16:05:19Z","published":"2024-12-26T16:05:19Z","title":"Optimizing Multi-Stage Language Models for Effective Text Retrieval","summary":" Efficient text retrieval is critical for applications such as legal document\nanalysis, particularly in specialized contexts like Japanese legal systems.\nExisting retrieval methods often underperform in such domain-specific\nscenarios, necessitating tailored approaches. In this paper, we introduce a\nnovel two-phase text retrieval pipeline optimized for Japanese legal datasets.\nOur method leverages advanced language models to achieve state-of-the-art\nperformance, significantly improving retrieval efficiency and accuracy. To\nfurther enhance robustness and adaptability, we incorporate an ensemble model\nthat integrates multiple retrieval strategies, resulting in superior outcomes\nacross diverse tasks. Extensive experiments validate the effectiveness of our\napproach, demonstrating strong performance on both Japanese legal datasets and\nwidely recognized benchmarks like MS-MARCO. Our work establishes new standards\nfor text retrieval in domain-specific and general contexts, providing a\ncomprehensive solution for addressing complex queries in legal and multilingual\nenvironments.\n","authors":["Quang Hoang Trung","Le Trung Hoang","Nguyen Van Hoang Phuc"],"pdf_url":"https://arxiv.org/pdf/2412.19265v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.04739v3","updated":"2024-12-26T13:58:31Z","published":"2024-10-07T04:15:02Z","title":"TableRAG: Million-Token Table Understanding with Language Models","summary":" Recent advancements in language models (LMs) have notably enhanced their\nability to reason with tabular data, primarily through program-aided mechanisms\nthat manipulate and analyze tables. However, these methods often require the\nentire table as input, leading to scalability challenges due to the positional\nbias or context length constraints. In response to these challenges, we\nintroduce TableRAG, a Retrieval-Augmented Generation (RAG) framework\nspecifically designed for LM-based table understanding. TableRAG leverages\nquery expansion combined with schema and cell retrieval to pinpoint crucial\ninformation before providing it to the LMs. This enables more efficient data\nencoding and precise retrieval, significantly reducing prompt lengths and\nmitigating information loss. We have developed two new million-token benchmarks\nfrom the Arcade and BIRD-SQL datasets to thoroughly evaluate TableRAG's\neffectiveness at scale. Our results demonstrate that TableRAG's retrieval\ndesign achieves the highest retrieval quality, leading to the new\nstate-of-the-art performance on large-scale table understanding.\n","authors":["Si-An Chen","Lesly Miculicich","Julian Martin Eisenschlos","Zifeng Wang","Zilong Wang","Yanfei Chen","Yasuhisa Fujii","Hsuan-Tien Lin","Chen-Yu Lee","Tomas Pfister"],"pdf_url":"https://arxiv.org/pdf/2410.04739v3.pdf","comment":"Accepted to NeurIPS 2024"},{"id":"http://arxiv.org/abs/2412.19200v1","updated":"2024-12-26T12:47:35Z","published":"2024-12-26T12:47:35Z","title":"Personalized Dynamic Music Emotion Recognition with Dual-Scale\n Attention-Based Meta-Learning","summary":" Dynamic Music Emotion Recognition (DMER) aims to predict the emotion of\ndifferent moments in music, playing a crucial role in music information\nretrieval. The existing DMER methods struggle to capture long-term dependencies\nwhen dealing with sequence data, which limits their performance. Furthermore,\nthese methods often overlook the influence of individual differences on emotion\nperception, even though everyone has their own personalized emotional\nperception in the real world. Motivated by these issues, we explore more\neffective sequence processing methods and introduce the Personalized DMER\n(PDMER) problem, which requires models to predict emotions that align with\npersonalized perception. Specifically, we propose a Dual-Scale Attention-Based\nMeta-Learning (DSAML) method. This method fuses features from a dual-scale\nfeature extractor and captures both short and long-term dependencies using a\ndual-scale attention transformer, improving the performance in traditional\nDMER. To achieve PDMER, we design a novel task construction strategy that\ndivides tasks by annotators. Samples in a task are annotated by the same\nannotator, ensuring consistent perception. Leveraging this strategy alongside\nmeta-learning, DSAML can predict personalized perception of emotions with just\none personalized annotation sample. Our objective and subjective experiments\ndemonstrate that our method can achieve state-of-the-art performance in both\ntraditional DMER and PDMER.\n","authors":["Dengming Zhang","Weitao You","Ziheng Liu","Lingyun Sun","Pei Chen"],"pdf_url":"https://arxiv.org/pdf/2412.19200v1.pdf","comment":"Accepted by the 39th AAAI Conference on Artificial Intelligence\n (AAAI-25)"},{"id":"http://arxiv.org/abs/2412.19178v1","updated":"2024-12-26T11:32:00Z","published":"2024-12-26T11:32:00Z","title":"Reversed in Time: A Novel Temporal-Emphasized Benchmark for Cross-Modal\n Video-Text Retrieval","summary":" Cross-modal (e.g. image-text, video-text) retrieval is an important task in\ninformation retrieval and multimodal vision-language understanding field.\nTemporal understanding makes video-text retrieval more challenging than\nimage-text retrieval. However, we find that the widely used video-text\nbenchmarks have shortcomings in comprehensively assessing abilities of models,\nespecially in temporal understanding, causing large-scale image-text\npre-trained models can already achieve comparable zero-shot performance with\nvideo-text pre-trained models. In this paper, we introduce RTime, a novel\ntemporal-emphasized video-text retrieval dataset. We first obtain videos of\nactions or events with significant temporality, and then reverse these videos\nto create harder negative samples. We then recruit annotators to judge the\nsignificance and reversibility of candidate videos, and write captions for\nqualified videos. We further adopt GPT-4 to extend more captions based on\nhuman-written captions. Our RTime dataset currently consists of 21k videos with\n10 captions per video, totalling about 122 hours. Based on RTime, we propose\nthree retrieval benchmark tasks: RTime-Origin, RTime-Hard, and RTime-Binary. We\nfurther enhance the use of harder-negatives in model training, and benchmark a\nvariety of video-text models on RTime. Extensive experiment analysis proves\nthat RTime indeed poses new and higher challenges to video-text retrieval. We\nrelease our RTime\ndataset\\footnote{\\url{https://github.com/qyr0403/Reversed-in-Time}} to further\nadvance video-text retrieval and multimodal understanding research.\n","authors":["Yang Du","Yuqi Liu","Qin Jin"],"pdf_url":"https://arxiv.org/pdf/2412.19178v1.pdf","comment":"ACMMM 2024 poster"},{"id":"http://arxiv.org/abs/2412.19172v1","updated":"2024-12-26T11:06:49Z","published":"2024-12-26T11:06:49Z","title":"Towards Popularity-Aware Recommendation: A Multi-Behavior Enhanced\n Framework with Orthogonality Constraint","summary":" Top-$K$ recommendation involves inferring latent user preferences and\ngenerating personalized recommendations accordingly, which is now ubiquitous in\nvarious decision systems. Nonetheless, recommender systems usually suffer from\nsevere \\textit{popularity bias}, leading to the over-recommendation of popular\nitems. Such a bias deviates from the central aim of reflecting user preference\nfaithfully, compromising both customer satisfaction and retailer profits.\nDespite the prevalence, existing methods tackling popularity bias still have\nlimitations due to the considerable accuracy-debias tradeoff and the\nsensitivity to extensive parameter selection, further exacerbated by the\nextreme sparsity in positive user-item interactions.\n In this paper, we present a \\textbf{Pop}ularity-aware top-$K$ recommendation\nalgorithm integrating multi-behavior \\textbf{S}ide \\textbf{I}nformation\n(PopSI), aiming to enhance recommendation accuracy and debias performance\nsimultaneously. Specifically, by leveraging multiple user feedback that mirrors\nsimilar user preferences and formulating it as a three-dimensional tensor,\nPopSI can utilize all slices to capture the desiring user preferences\neffectively. Subsequently, we introduced a novel orthogonality constraint to\nrefine the estimated item feature space, enforcing it to be invariant to item\npopularity features thereby addressing our model's sensitivity to popularity\nbias. Comprehensive experiments on real-world e-commerce datasets demonstrate\nthe general improvements of PopSI over state-of-the-art debias methods with a\nmarginal accuracy-debias tradeoff and scalability to practical applications.\nThe source code for our algorithm and experiments is available at\n\\url{https://github.com/Eason-sys/PopSI}.\n","authors":["Yishan Han","Biao Xu","Yao Wang","Shanxing Gao"],"pdf_url":"https://arxiv.org/pdf/2412.19172v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.12730v3","updated":"2024-12-26T09:20:17Z","published":"2024-09-19T12:55:34Z","title":"When SparseMoE Meets Noisy Interactions: An Ensemble View on Denoising\n Recommendation","summary":" Learning user preferences from implicit feedback is one of the core\nchallenges in recommendation. The difficulty lies in the potential noise within\nimplicit feedback. Therefore, various denoising recommendation methods have\nbeen proposed recently. However, most of them overly rely on the hyperparameter\nconfigurations, inevitably leading to inadequacies in model adaptability and\ngeneralization performance. In this study, we propose a novel Adaptive Ensemble\nLearning (AEL) for denoising recommendation, which employs a sparse gating\nnetwork as a brain, selecting suitable experts to synthesize appropriate\ndenoising capacities for different data samples. To address the ensemble\nlearning shortcoming of model complexity and ensure sub-recommender diversity,\nwe also proposed a novel method that stacks components to create\nsub-recommenders instead of directly constructing them. Extensive experiments\nacross various datasets demonstrate that AEL outperforms others in kinds of\npopular metrics, even in the presence of substantial and dynamic noise. Our\ncode is available at https://github.com/cpu9xx/AEL.\n","authors":["Weipu Chen","Zhuangzhuang He","Fei Liu"],"pdf_url":"https://arxiv.org/pdf/2409.12730v3.pdf","comment":"Accepted at ICASSP 2025. 5pages, 4 figures"},{"id":"http://arxiv.org/abs/2412.19069v1","updated":"2024-12-26T05:53:10Z","published":"2024-12-26T05:53:10Z","title":"Effective and secure federated online learning to rank","summary":" Online Learning to Rank (OLTR) optimises ranking models using implicit user\nfeedback, such as clicks. Unlike traditional Learning to Rank (LTR) methods\nthat rely on a static set of training data with relevance judgements to learn a\nranking model, OLTR methods update the model continually as new data arrives.\nThus, it addresses several drawbacks such as the high cost of human\nannotations, potential misalignment between user preferences and human\njudgments, and the rapid changes in user query intents. However, OLTR methods\ntypically require the collection of searchable data, user queries, and clicks,\nwhich poses privacy concerns for users.\n Federated Online Learning to Rank (FOLTR) integrates OLTR within a Federated\nLearning (FL) framework to enhance privacy by not sharing raw data. While\npromising, FOLTR methods currently lag behind traditional centralised OLTR due\nto challenges in ranking effectiveness, robustness with respect to data\ndistribution across clients, susceptibility to attacks, and the ability to\nunlearn client interactions and data. This thesis presents a comprehensive\nstudy on Federated Online Learning to Rank, addressing its effectiveness,\nrobustness, security, and unlearning capabilities, thereby expanding the\nlandscape of FOLTR.\n","authors":["Shuyi Wang"],"pdf_url":"https://arxiv.org/pdf/2412.19069v1.pdf","comment":"PhD Thesis"},{"id":"http://arxiv.org/abs/2412.19048v1","updated":"2024-12-26T04:05:28Z","published":"2024-12-26T04:05:28Z","title":"Jasper and Stella: distillation of SOTA embedding models","summary":" A crucial component of many deep learning applications (such as FAQ and RAG)\nis dense retrieval, in which embedding models are used to convert raw text to\nnumerical vectors and then get the most similar text by MIPS (Maximum Inner\nProduct Search). Some text embedding benchmarks (e.g. MTEB, BEIR, and\nAIR-Bench) have been established to evaluate embedding models accurately.\nThanks to these benchmarks, we can use SOTA models; however, the deployment and\napplication of these models in industry were hampered by their large vector\ndimensions and numerous parameters. To alleviate this problem, 1) we present a\ndistillation technique that can enable a smaller student model to achieve good\nperformance. 2) Inspired by MRL we present a training approach of reducing the\nvector dimensions based on its own vectors or its teacher vectors. 3) We do\nsimple yet effective alignment training between images and text to make our\nmodel a multimodal encoder. We trained Stella and Jasper models using the\ntechnologies above and achieved high scores on the MTEB leaderboard. We release\nthe model and data at Hugging Face Hub\n(https://huggingface.co/infgrad/jasper_en_vision_language_v1) and the training\nlogs are at https://api.wandb.ai/links/dunnzhang0/z8jqoqpb.\n","authors":["Dun Zhang"," FulongWang"],"pdf_url":"https://arxiv.org/pdf/2412.19048v1.pdf","comment":"7 pages, 1 figures"},{"id":"http://arxiv.org/abs/2405.03988v3","updated":"2024-12-26T03:03:30Z","published":"2024-05-07T04:00:30Z","title":"LEARN: Knowledge Adaptation from Large Language Model to Recommendation\n for Practical Industrial Application","summary":" Contemporary recommendation systems predominantly rely on ID embedding to\ncapture latent associations among users and items. However, this approach\noverlooks the wealth of semantic information embedded within textual\ndescriptions of items, leading to suboptimal performance and poor\ngeneralizations. Leveraging the capability of large language models to\ncomprehend and reason about textual content presents a promising avenue for\nadvancing recommendation systems. To achieve this, we propose an Llm-driven\nknowlEdge Adaptive RecommeNdation (LEARN) framework that synergizes open-world\nknowledge with collaborative knowledge. We address computational complexity\nconcerns by utilizing pretrained LLMs as item encoders and freezing LLM\nparameters to avoid catastrophic forgetting and preserve open-world knowledge.\nTo bridge the gap between the open-world and collaborative domains, we design a\ntwin-tower structure supervised by the recommendation task and tailored for\npractical industrial application. Through experiments on the real large-scale\nindustrial dataset and online A/B tests, we demonstrate the efficacy of our\napproach in industry application. We also achieve state-of-the-art performance\non six Amazon Review datasets to verify the superiority of our method.\n","authors":["Jian Jia","Yipei Wang","Yan Li","Honggang Chen","Xuehan Bai","Zhaocheng Liu","Jian Liang","Quan Chen","Han Li","Peng Jiang","Kun Gai"],"pdf_url":"https://arxiv.org/pdf/2405.03988v3.pdf","comment":"Accepted by AAAI 2025. Codes are released at\n https://github.com/adxcreative/LEARN"},{"id":"http://arxiv.org/abs/2408.15620v2","updated":"2024-12-26T02:45:03Z","published":"2024-08-28T08:21:56Z","title":"CAPER: Enhancing Career Trajectory Prediction using Temporal Knowledge\n Graph and Ternary Relationship","summary":" The problem of career trajectory prediction (CTP) aims to predict one's\nfuture employer or job position. While several CTP methods have been developed\nfor this problem, we posit that none of these methods (1) jointly considers the\nmutual ternary dependency between three key units (i.e., user, position, and\ncompany) of a career and (2) captures the characteristic shifts of key units in\ncareer over time, leading to an inaccurate understanding of the job movement\npatterns in the labor market. To address the above challenges, we propose a\nnovel solution, named as CAPER, that solves the challenges via sophisticated\ntemporal knowledge graph (TKG) modeling. It enables the utilization of a\ngraph-structured knowledge base with rich expressiveness, effectively\npreserving the changes in job movement patterns. Furthermore, we devise an\nextrapolated career reasoning task on TKG for a realistic evaluation. The\nexperiments on a real-world career trajectory dataset demonstrate that CAPER\nconsistently and significantly outperforms four baselines, two recent TKG\nreasoning methods, and five state-of-the-art CTP methods in predicting one's\nfuture companies and positions--i.e., on average, yielding 6.80% and 34.58%\nmore accurate predictions, respectively. The codebase of CAPER is available at\nhttps://github.com/Bigdasgit/CAPER.\n","authors":["Yeon-Chang Lee","JaeHyun Lee","Michiharu Yamashita","Dongwon Lee","Sang-Wook Kim"],"pdf_url":"https://arxiv.org/pdf/2408.15620v2.pdf","comment":"Accepted by ACM KDD 2025"},{"id":"http://arxiv.org/abs/2412.04272v2","updated":"2024-12-26T02:24:52Z","published":"2024-12-05T15:54:16Z","title":"PoTable: Programming Standardly on Table-based Reasoning Like a Human\n Analyst","summary":" Table-based reasoning has garnered substantial research interest,\nparticularly in its integration with Large Language Model (LLM) which has\nrevolutionized the general reasoning paradigm. Numerous LLM-based studies\nintroduce symbolic tools (e.g., databases, Python) as assistants to extend\nhuman-like abilities in structured table understanding and complex arithmetic\ncomputations. However, these studies can be improved better in simulating human\ncognitive behavior when using symbolic tools, as they still suffer from\nlimitations of non-standard logical splits and constrained operation pools. In\nthis study, we propose PoTable as a novel table-based reasoning method that\nsimulates a human tabular analyst, which integrates a Python interpreter as the\nreal-time executor accompanied by an LLM-based operation planner and code\ngenerator. Specifically, PoTable follows a human-like logical stage split and\nextends the operation pool into an open-world space without any constraints.\nThrough planning and executing in each distinct stage, PoTable standardly\ncompletes the entire reasoning process and produces superior reasoning results\nalong with highly accurate, steply commented and completely executable\nprograms. Accordingly, the effectiveness and explainability of PoTable are\nfully demonstrated. Extensive experiments over three evaluation datasets from\ntwo public benchmarks on two backbones show the outstanding performance of our\napproach. In particular, GPT-based PoTable achieves over 4% higher absolute\naccuracy than runner-ups on all evaluation datasets.\n","authors":["Qingyang Mao","Qi Liu","Zhi Li","Mingyue Cheng","Zheng Zhang","Rui Li"],"pdf_url":"https://arxiv.org/pdf/2412.04272v2.pdf","comment":"12 pages, 4 figures"}],"Machine Learning":[{"id":"http://arxiv.org/abs/2412.19372v1","updated":"2024-12-26T22:49:53Z","published":"2024-12-26T22:49:53Z","title":"Minimal Batch Adaptive Learning Policy Engine for Real-Time Mid-Price\n Forecasting in High-Frequency Trading","summary":" High-frequency trading (HFT) has transformed modern financial markets, making\nreliable short-term price forecasting models essential. In this study, we\npresent a novel approach to mid-price forecasting using Level 1 limit order\nbook (LOB) data from NASDAQ, focusing on 100 U.S. stocks from the S&P 500 index\nduring the period from September to November 2022. Expanding on our previous\nwork with Radial Basis Function Neural Networks (RBFNN), which leveraged\nautomated feature importance techniques based on mean decrease impurity (MDI)\nand gradient descent (GD), we introduce the Adaptive Learning Policy Engine\n(ALPE) - a reinforcement learning (RL)-based agent designed for batch-free,\nimmediate mid-price forecasting. ALPE incorporates adaptive epsilon decay to\ndynamically balance exploration and exploitation, outperforming a diverse range\nof highly effective machine learning (ML) and deep learning (DL) models in\nforecasting performance.\n","authors":["Adamantios Ntakaris","Gbenga Ibikunle"],"pdf_url":"https://arxiv.org/pdf/2412.19372v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.04329v3","updated":"2024-12-26T22:33:58Z","published":"2024-06-06T17:59:10Z","title":"Simplified and Generalized Masked Diffusion for Discrete Data","summary":" Masked (or absorbing) diffusion is actively explored as an alternative to\nautoregressive models for generative modeling of discrete data. However,\nexisting work in this area has been hindered by unnecessarily complex model\nformulations and unclear relationships between different perspectives, leading\nto suboptimal parameterization, training objectives, and ad hoc adjustments to\ncounteract these issues. In this work, we aim to provide a simple and general\nframework that unlocks the full potential of masked diffusion models. We show\nthat the continuous-time variational objective of masked diffusion models is a\nsimple weighted integral of cross-entropy losses. Our framework also enables\ntraining generalized masked diffusion models with state-dependent masking\nschedules. When evaluated by perplexity, our models trained on OpenWebText\nsurpass prior diffusion language models at GPT-2 scale and demonstrate superior\nperformance on 4 out of 5 zero-shot language modeling tasks. Furthermore, our\nmodels vastly outperform previous discrete diffusion models on pixel-level\nimage modeling, achieving 2.75 (CIFAR-10) and 3.40 (ImageNet 64x64) bits per\ndimension that are better than autoregressive models of similar sizes. Our code\nis available at https://github.com/google-deepmind/md4.\n","authors":["Jiaxin Shi","Kehang Han","Zhe Wang","Arnaud Doucet","Michalis K. Titsias"],"pdf_url":"https://arxiv.org/pdf/2406.04329v3.pdf","comment":"NeurIPS 2024. Code is available at:\n https://github.com/google-deepmind/md4"},{"id":"http://arxiv.org/abs/2412.19363v1","updated":"2024-12-26T22:06:29Z","published":"2024-12-26T22:06:29Z","title":"Large Language Models for Market Research: A Data-augmentation Approach","summary":" Large Language Models (LLMs) have transformed artificial intelligence by\nexcelling in complex natural language processing tasks. Their ability to\ngenerate human-like text has opened new possibilities for market research,\nparticularly in conjoint analysis, where understanding consumer preferences is\nessential but often resource-intensive. Traditional survey-based methods face\nlimitations in scalability and cost, making LLM-generated data a promising\nalternative. However, while LLMs have the potential to simulate real consumer\nbehavior, recent studies highlight a significant gap between LLM-generated and\nhuman data, with biases introduced when substituting between the two. In this\npaper, we address this gap by proposing a novel statistical data augmentation\napproach that efficiently integrates LLM-generated data with real data in\nconjoint analysis. Our method leverages transfer learning principles to debias\nthe LLM-generated data using a small amount of human data. This results in\nstatistically robust estimators with consistent and asymptotically normal\nproperties, in contrast to naive approaches that simply substitute human data\nwith LLM-generated data, which can exacerbate bias. We validate our framework\nthrough an empirical study on COVID-19 vaccine preferences, demonstrating its\nsuperior ability to reduce estimation error and save data and costs by 24.9\\%\nto 79.8\\%. In contrast, naive approaches fail to save data due to the inherent\nbiases in LLM-generated data compared to human data. Another empirical study on\nsports car choices validates the robustness of our results. Our findings\nsuggest that while LLM-generated data is not a direct substitute for human\nresponses, it can serve as a valuable complement when used within a robust\nstatistical framework.\n","authors":["Mengxin Wang","Dennis J. Zhang","Heng Zhang"],"pdf_url":"https://arxiv.org/pdf/2412.19363v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.19362v1","updated":"2024-12-26T22:05:30Z","published":"2024-12-26T22:05:30Z","title":"Evaluating Convolutional Neural Networks for COVID-19 classification in\n chest X-ray images","summary":" Coronavirus Disease 2019 (COVID-19) pandemic rapidly spread globally,\nimpacting the lives of billions of people. The effective screening of infected\npatients is a critical step to struggle with COVID-19, and treating the\npatients avoiding this quickly disease spread. The need for automated and\nscalable methods has increased due to the unavailability of accurate automated\ntoolkits. Recent researches using chest X-ray images suggest they include\nrelevant information about the COVID-19 virus. Hence, applying machine learning\ntechniques combined with radiological imaging promises to identify this disease\naccurately. It is straightforward to collect these images once it is spreadly\nshared and analyzed in the world. This paper presents a method for automatic\nCOVID-19 detection using chest Xray images through four convolutional neural\nnetworks, namely: AlexNet, VGG-11, SqueezeNet, and DenseNet-121. This method\nhad been providing accurate diagnostics for positive or negative COVID-19\nclassification. We validate our experiments using a ten-fold cross-validation\nprocedure over the training and test sets. Our findings include the shallow\nfine-tuning and data augmentation strategies that can assist in dealing with\nthe low number of positive COVID-19 images publicly available. The accuracy for\nall CNNs is higher than 97.00%, and the SqueezeNet model achieved the best\nresult with 99.20%.\n","authors":["Leonardo Gabriel Ferreira Rodrigues","Danilo Ferreira da Silva","Larissa Ferreira Rodrigues","João Fernando Mari"],"pdf_url":"https://arxiv.org/pdf/2412.19362v1.pdf","comment":"6 pages"},{"id":"http://arxiv.org/abs/2412.04565v2","updated":"2024-12-26T21:57:29Z","published":"2024-12-05T19:13:17Z","title":"Solving High-dimensional Inverse Problems Using Amortized\n Likelihood-free Inference with Noisy and Incomplete Data","summary":" We present a likelihood-free probabilistic inversion method based on\nnormalizing flows for high-dimensional inverse problems. The proposed method is\ncomposed of two complementary networks: a summary network for data compression\nand an inference network for parameter estimation. The summary network encodes\nraw observations into a fixed-size vector of summary features, while the\ninference network generates samples of the approximate posterior distribution\nof the model parameters based on these summary features. The posterior samples\nare produced in a deep generative fashion by sampling from a latent Gaussian\ndistribution and passing these samples through an invertible transformation. We\nconstruct this invertible transformation by sequentially alternating\nconditional invertible neural network and conditional neural spline flow\nlayers. The summary and inference networks are trained simultaneously. We apply\nthe proposed method to an inversion problem in groundwater hydrology to\nestimate the posterior distribution of the log-conductivity field conditioned\non spatially sparse time-series observations of the system's hydraulic head\nresponses.The conductivity field is represented with 706 degrees of freedom in\nthe considered problem.The comparison with the likelihood-based iterative\nensemble smoother PEST-IES method demonstrates that the proposed method\naccurately estimates the parameter posterior distribution and the observations'\npredictive posterior distribution at a fraction of the inference time of\nPEST-IES.\n","authors":["Jice Zeng","Yuanzhe Wang","Alexandre M. Tartakovsky","David Barajas-Solano"],"pdf_url":"https://arxiv.org/pdf/2412.04565v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.19354v1","updated":"2024-12-26T21:32:08Z","published":"2024-12-26T21:32:08Z","title":"Federated Hybrid Training and Self-Adversarial Distillation: Towards\n Robust Edge Networks","summary":" Federated learning (FL) is a distributed training technology that enhances\ndata privacy in mobile edge networks by allowing data owners to collaborate\nwithout transmitting raw data to the edge server. However, data heterogeneity\nand adversarial attacks pose challenges to develop an unbiased and robust\nglobal model for edge deployment. To address this, we propose Federated hyBrid\nAdversarial training and self-adversarial disTillation (FedBAT), a new\nframework designed to improve both robustness and generalization of the global\nmodel. FedBAT seamlessly integrates hybrid adversarial training and\nself-adversarial distillation into the conventional FL framework from data\naugmentation and feature distillation perspectives. From a data augmentation\nperspective, we propose hybrid adversarial training to defend against\nadversarial attacks by balancing accuracy and robustness through a weighted\ncombination of standard and adversarial training. From a feature distillation\nperspective, we introduce a novel augmentation-invariant adversarial\ndistillation method that aligns local adversarial features of augmented images\nwith their corresponding unbiased global clean features. This alignment can\neffectively mitigate bias from data heterogeneity while enhancing both the\nrobustness and generalization of the global model. Extensive experimental\nresults across multiple datasets demonstrate that FedBAT yields comparable or\nsuperior performance gains in improving robustness while maintaining accuracy\ncompared to several baselines.\n","authors":["Yu Qiao","Apurba Adhikary","Kitae Kim","Eui-Nam Huh","Zhu Han","Choong Seon Hong"],"pdf_url":"https://arxiv.org/pdf/2412.19354v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.19351v1","updated":"2024-12-26T21:13:12Z","published":"2024-12-26T21:13:12Z","title":"ETTA: Elucidating the Design Space of Text-to-Audio Models","summary":" Recent years have seen significant progress in Text-To-Audio (TTA) synthesis,\nenabling users to enrich their creative workflows with synthetic audio\ngenerated from natural language prompts. Despite this progress, the effects of\ndata, model architecture, training objective functions, and sampling strategies\non target benchmarks are not well understood. With the purpose of providing a\nholistic understanding of the design space of TTA models, we set up a\nlarge-scale empirical experiment focused on diffusion and flow matching models.\nOur contributions include: 1) AF-Synthetic, a large dataset of high quality\nsynthetic captions obtained from an audio understanding model; 2) a systematic\ncomparison of different architectural, training, and inference design choices\nfor TTA models; 3) an analysis of sampling methods and their Pareto curves with\nrespect to generation quality and inference speed. We leverage the knowledge\nobtained from this extensive analysis to propose our best model dubbed\nElucidated Text-To-Audio (ETTA). When evaluated on AudioCaps and MusicCaps,\nETTA provides improvements over the baselines trained on publicly available\ndata, while being competitive with models trained on proprietary data. Finally,\nwe show ETTA's improved ability to generate creative audio following complex\nand imaginative captions -- a task that is more challenging than current\nbenchmarks.\n","authors":["Sang-gil Lee","Zhifeng Kong","Arushi Goel","Sungwon Kim","Rafael Valle","Bryan Catanzaro"],"pdf_url":"https://arxiv.org/pdf/2412.19351v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.10935v2","updated":"2024-12-26T21:12:31Z","published":"2024-12-14T19:06:01Z","title":"Progressive Compression with Universally Quantized Diffusion Models","summary":" Diffusion probabilistic models have achieved mainstream success in many\ngenerative modeling tasks, from image generation to inverse problem solving. A\ndistinct feature of these models is that they correspond to deep hierarchical\nlatent variable models optimizing a variational evidence lower bound (ELBO) on\nthe data likelihood. Drawing on a basic connection between likelihood modeling\nand compression, we explore the potential of diffusion models for progressive\ncoding, resulting in a sequence of bits that can be incrementally transmitted\nand decoded with progressively improving reconstruction quality. Unlike prior\nwork based on Gaussian diffusion or conditional diffusion models, we propose a\nnew form of diffusion model with uniform noise in the forward process, whose\nnegative ELBO corresponds to the end-to-end compression cost using universal\nquantization. We obtain promising first results on image compression, achieving\ncompetitive rate-distortion and rate-realism results on a wide range of\nbit-rates with a single model, bringing neural codecs a step closer to\npractical deployment.\n","authors":["Yibo Yang","Justus C. Will","Stephan Mandt"],"pdf_url":"https://arxiv.org/pdf/2412.10935v2.pdf","comment":"20 pages, 10 figures"},{"id":"http://arxiv.org/abs/2412.19350v1","updated":"2024-12-26T20:53:04Z","published":"2024-12-26T20:53:04Z","title":"On the Expressiveness and Length Generalization of Selective State-Space\n Models on Regular Languages","summary":" Selective state-space models (SSMs) are an emerging alternative to the\nTransformer, offering the unique advantage of parallel training and sequential\ninference. Although these models have shown promising performance on a variety\nof tasks, their formal expressiveness and length generalization properties\nremain underexplored. In this work, we provide insight into the workings of\nselective SSMs by analyzing their expressiveness and length generalization\nperformance on regular language tasks, i.e., finite-state automaton (FSA)\nemulation. We address certain limitations of modern SSM-based architectures by\nintroducing the Selective Dense State-Space Model (SD-SSM), the first selective\nSSM that exhibits perfect length generalization on a set of various regular\nlanguage tasks using a single layer. It utilizes a dictionary of dense\ntransition matrices, a softmax selection mechanism that creates a convex\ncombination of dictionary matrices at each time step, and a readout consisting\nof layer normalization followed by a linear map. We then proceed to evaluate\nvariants of diagonal selective SSMs by considering their empirical performance\non commutative and non-commutative automata. We explain the experimental\nresults with theoretical considerations. Our code is available at\nhttps://github.com/IBM/selective-dense-state-space-model.\n","authors":["Aleksandar Terzić","Michael Hersche","Giacomo Camposampiero","Thomas Hofmann","Abu Sebastian","Abbas Rahimi"],"pdf_url":"https://arxiv.org/pdf/2412.19350v1.pdf","comment":"13 pages, 7 figures, to be published in AAAI 2025"},{"id":"http://arxiv.org/abs/2412.19340v1","updated":"2024-12-26T20:08:10Z","published":"2024-12-26T20:08:10Z","title":"A Reinforcement Learning-Based Task Mapping Method to Improve the\n Reliability of Clustered Manycores","summary":" The increasing scale of manycore systems poses significant challenges in\nmanaging reliability while meeting performance demands. Simultaneously, these\nsystems become more susceptible to different aging mechanisms such as\nnegative-bias temperature instability (NBTI), hot carrier injection (HCI), and\nthermal cycling (TC), as well as the electromigration (EM) phenomenon. In this\npaper, we propose a reinforcement learning (RL)-based task mapping method to\nimprove the reliability of manycore systems considering the aforementioned\naging mechanisms, which consists of three steps including bin packing,\ntask-to-bin mapping, and task-to-core mapping. In the initial step, a\ndensity-based spatial application with noise (DBSCAN) clustering method is\nemployed to compose some clusters (bins) based on the cores temperature. Then,\nthe Q-learning algorithm is used for the two latter steps, to map the arrived\ntask on a core such that the minimum thermal variation is occurred among all\nthe bins. Compared to the state-of-the-art works, the proposed method is\nperformed during runtime without requiring any parameter to be calculated\noffline. The effectiveness of the proposed technique is evaluated on 16, 32,\nand 64 cores systems using SPLASH2 and PARSEC benchmark suite applications. The\nresults demonstrate up to 27% increase in the mean time to failure (MTTF)\ncompared to the state-of-the-art task mapping techniques.\n","authors":["Fatemeh Hossein-Khani","Omid Akbari"],"pdf_url":"https://arxiv.org/pdf/2412.19340v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.12225v2","updated":"2024-12-26T19:23:17Z","published":"2024-12-16T10:03:44Z","title":"DLF: Disentangled-Language-Focused Multimodal Sentiment Analysis","summary":" Multimodal Sentiment Analysis (MSA) leverages heterogeneous modalities, such\nas language, vision, and audio, to enhance the understanding of human\nsentiment. While existing models often focus on extracting shared information\nacross modalities or directly fusing heterogeneous modalities, such approaches\ncan introduce redundancy and conflicts due to equal treatment of all modalities\nand the mutual transfer of information between modality pairs. To address these\nissues, we propose a Disentangled-Language-Focused (DLF) multimodal\nrepresentation learning framework, which incorporates a feature disentanglement\nmodule to separate modality-shared and modality-specific information. To\nfurther reduce redundancy and enhance language-targeted features, four\ngeometric measures are introduced to refine the disentanglement process. A\nLanguage-Focused Attractor (LFA) is further developed to strengthen language\nrepresentation by leveraging complementary modality-specific information\nthrough a language-guided cross-attention mechanism. The framework also employs\nhierarchical predictions to improve overall accuracy. Extensive experiments on\ntwo popular MSA datasets, CMU-MOSI and CMU-MOSEI, demonstrate the significant\nperformance gains achieved by the proposed DLF framework. Comprehensive\nablation studies further validate the effectiveness of the feature\ndisentanglement module, language-focused attractor, and hierarchical\npredictions. Our code is available at https://github.com/pwang322/DLF.\n","authors":["Pan Wang","Qiang Zhou","Yawen Wu","Tianlong Chen","Jingtong Hu"],"pdf_url":"https://arxiv.org/pdf/2412.12225v2.pdf","comment":"AAAI 2025 accepted"},{"id":"http://arxiv.org/abs/2412.19331v1","updated":"2024-12-26T18:59:37Z","published":"2024-12-26T18:59:37Z","title":"CALICO: Part-Focused Semantic Co-Segmentation with Large Vision-Language\n Models","summary":" Recent advances in Large Vision-Language Models (LVLMs) have sparked\nsignificant progress in general-purpose vision tasks through visual instruction\ntuning. While some works have demonstrated the capability of LVLMs to generate\nsegmentation masks that align phrases with natural language descriptions in a\nsingle image, they struggle with segmentation-grounded comparisons across\nmultiple images, particularly at finer granularities such as object parts. In\nthis paper, we introduce the new task of part-focused semantic co-segmentation,\nwhich seeks to identify and segment common and unique objects and parts across\nimages. To address this task, we present CALICO, the first LVLM that can\nsegment and reason over multiple masks across images, enabling object\ncomparison based on their constituent parts. CALICO features two proposed\ncomponents, a novel Correspondence Extraction Module, which captures\nsemantic-rich information to identify part-level correspondences between\nobjects, and a Correspondence Adaptation Module, which embeds this information\ninto the LVLM to facilitate multi-image understanding in a parameter-efficient\nmanner. To support training and evaluation, we curate MixedParts, a\ncomprehensive multi-image segmentation dataset containing $\\sim$2.4M samples\nacross $\\sim$44K images with diverse object and part categories. Experimental\nresults show CALICO, finetuned on only 0.3% of its architecture, achieves\nrobust performance in part-focused semantic co-segmentation.\n","authors":["Kiet A. Nguyen","Adheesh Juvekar","Tianjiao Yu","Muntasir Wahed","Ismini Lourentzou"],"pdf_url":"https://arxiv.org/pdf/2412.19331v1.pdf","comment":"Project page: https://plan-lab.github.io/calico"},{"id":"http://arxiv.org/abs/2412.19329v1","updated":"2024-12-26T18:58:38Z","published":"2024-12-26T18:58:38Z","title":"Deep learning and whole-brain networks for biomarker discovery: modeling\n the dynamics of brain fluctuations in resting-state and cognitive tasks","summary":" Background: Brain network models offer insights into brain dynamics, but the\nutility of model-derived bifurcation parameters as biomarkers remains\nunderexplored. Objective: This study evaluates bifurcation parameters from a\nwhole-brain network model as biomarkers for distinguishing brain states\nassociated with resting-state and task-based cognitive conditions. Methods:\nSynthetic BOLD signals were generated using a supercritical Hopf brain network\nmodel to train deep learning models for bifurcation parameter prediction.\nInference was performed on Human Connectome Project data, including both\nresting-state and task-based conditions. Statistical analyses assessed the\nseparability of brain states based on bifurcation parameter distributions.\nResults: Bifurcation parameter distributions differed significantly across task\nand resting-state conditions ($p < 0.0001$ for all but one comparison).\nTask-based brain states exhibited higher bifurcation values compared to rest.\nConclusion: Bifurcation parameters effectively differentiate cognitive and\nresting states, warranting further investigation as biomarkers for brain state\ncharacterization and neurological disorder assessment.\n","authors":["Facundo Roffet","Gustavo Deco","Claudio Delrieux","Gustavo Patow"],"pdf_url":"https://arxiv.org/pdf/2412.19329v1.pdf","comment":"12 pages, 4 figures, 1 table"},{"id":"http://arxiv.org/abs/2412.15188v2","updated":"2024-12-26T18:56:18Z","published":"2024-12-19T18:56:24Z","title":"LMFusion: Adapting Pretrained Language Models for Multimodal Generation","summary":" We present LMFusion, a framework for empowering pretrained text-only large\nlanguage models (LLMs) with multimodal generative capabilities, enabling them\nto understand and generate both text and images in arbitrary sequences.\nLMFusion leverages existing Llama-3's weights for processing texts\nautoregressively while introducing additional and parallel transformer modules\nfor processing images with diffusion. During training, the data from each\nmodality is routed to its dedicated modules: modality-specific feedforward\nlayers, query-key-value projections, and normalization layers process each\nmodality independently, while the shared self-attention layers allow\ninteractions across text and image features. By freezing the text-specific\nmodules and only training the image-specific modules, LMFusion preserves the\nlanguage capabilities of text-only LLMs while developing strong visual\nunderstanding and generation abilities. Compared to methods that pretrain\nmultimodal generative models from scratch, our experiments demonstrate that,\nLMFusion improves image understanding by 20% and image generation by 3.6% using\nonly 50% of the FLOPs while maintaining Llama-3's language capabilities. We\nalso demonstrate that this framework can adapt existing vision-language models\nwith multimodal generation ability. Overall, this framework not only leverages\nexisting computational investments in text-only LLMs but also enables the\nparallel development of language and vision capabilities, presenting a\npromising direction for efficient multimodal model development.\n","authors":["Weijia Shi","Xiaochuang Han","Chunting Zhou","Weixin Liang","Xi Victoria Lin","Luke Zettlemoyer","Lili Yu"],"pdf_url":"https://arxiv.org/pdf/2412.15188v2.pdf","comment":"Name change: LlamaFusion to LMFusion"},{"id":"http://arxiv.org/abs/2305.13168v4","updated":"2024-12-26T18:54:53Z","published":"2023-05-22T15:56:44Z","title":"LLMs for Knowledge Graph Construction and Reasoning: Recent Capabilities\n and Future Opportunities","summary":" This paper presents an exhaustive quantitative and qualitative evaluation of\nLarge Language Models (LLMs) for Knowledge Graph (KG) construction and\nreasoning. We engage in experiments across eight diverse datasets, focusing on\nfour representative tasks encompassing entity and relation extraction, event\nextraction, link prediction, and question-answering, thereby thoroughly\nexploring LLMs' performance in the domain of construction and inference.\nEmpirically, our findings suggest that LLMs, represented by GPT-4, are more\nsuited as inference assistants rather than few-shot information extractors.\nSpecifically, while GPT-4 exhibits good performance in tasks related to KG\nconstruction, it excels further in reasoning tasks, surpassing fine-tuned\nmodels in certain cases. Moreover, our investigation extends to the potential\ngeneralization ability of LLMs for information extraction, leading to the\nproposition of a Virtual Knowledge Extraction task and the development of the\ncorresponding VINE dataset. Based on these empirical findings, we further\npropose AutoKG, a multi-agent-based approach employing LLMs and external\nsources for KG construction and reasoning. We anticipate that this research can\nprovide invaluable insights for future undertakings in the field of knowledge\ngraphs. The code and datasets are in https://github.com/zjunlp/AutoKG.\n","authors":["Yuqi Zhu","Xiaohan Wang","Jing Chen","Shuofei Qiao","Yixin Ou","Yunzhi Yao","Shumin Deng","Huajun Chen","Ningyu Zhang"],"pdf_url":"https://arxiv.org/pdf/2305.13168v4.pdf","comment":"World Wide Web Journal"},{"id":"http://arxiv.org/abs/2412.19325v1","updated":"2024-12-26T18:54:32Z","published":"2024-12-26T18:54:32Z","title":"Performance Control in Early Exiting to Deploy Large Models at the Same\n Cost of Smaller Ones","summary":" Early Exiting (EE) is a promising technique for speeding up inference by\nadaptively allocating compute resources to data points based on their\ndifficulty. The approach enables predictions to exit at earlier layers for\nsimpler samples while reserving more computation for challenging ones. In this\nstudy, we first present a novel perspective on the EE approach, showing that\nlarger models deployed with EE can achieve higher performance than smaller\nmodels while maintaining similar computational costs. As existing EE approaches\nrely on confidence estimation at each exit point, we further study the impact\nof overconfidence on the controllability of the compute-performance trade-off.\nWe introduce Performance Control Early Exiting (PCEE), a method that enables\naccuracy thresholding by basing decisions not on a data point's confidence but\non the average accuracy of samples with similar confidence levels from a\nheld-out validation set. In our experiments, we show that PCEE offers a simple\nyet computationally efficient approach that provides better control over\nperformance than standard confidence-based approaches, and allows us to scale\nup model sizes to yield performance gain while reducing the computational cost.\n","authors":["Mehrnaz Mofakhami","Reza Bayat","Ioannis Mitliagkas","Joao Monteiro","Valentina Zantedeschi"],"pdf_url":"https://arxiv.org/pdf/2412.19325v1.pdf","comment":"Appeared at ICML 2024 Workshop on Efficient Systems for Foundation\n Models (ES-FoMo-II)"},{"id":"http://arxiv.org/abs/2412.19318v1","updated":"2024-12-26T18:42:08Z","published":"2024-12-26T18:42:08Z","title":"Adaptive Conformal Inference by Betting","summary":" Conformal prediction is a valuable tool for quantifying predictive\nuncertainty of machine learning models. However, its applicability relies on\nthe assumption of data exchangeability, a condition which is often not met in\nreal-world scenarios. In this paper, we consider the problem of adaptive\nconformal inference without any assumptions about the data generating process.\nExisting approaches for adaptive conformal inference are based on optimizing\nthe pinball loss using variants of online gradient descent. A notable\nshortcoming of such approaches is in their explicit dependence on and\nsensitivity to the choice of the learning rates. In this paper, we propose a\ndifferent approach for adaptive conformal inference that leverages\nparameter-free online convex optimization techniques. We prove that our method\ncontrols long-term miscoverage frequency at a nominal level and demonstrate its\nconvincing empirical performance without any need of performing cumbersome\nparameter tuning.\n","authors":["Aleksandr Podkopaev","Darren Xu","Kuang-Chih Lee"],"pdf_url":"https://arxiv.org/pdf/2412.19318v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.00439v2","updated":"2024-12-26T18:21:50Z","published":"2024-08-01T10:19:25Z","title":"Rapid and Power-Aware Learned Optimization for Modular Receive\n Beamforming","summary":" Multiple-input multiple-output (MIMO) systems play a key role in wireless\ncommunication technologies. A widely considered approach to realize scalable\nMIMO systems involves architectures comprised of multiple separate modules,\neach with its own beamforming capability. Such models accommodate cell-free\nmassive MIMO and partially connected hybrid MIMO architectures. A core issue\nwith the implementation of modular MIMO arises from the need to rapidly set the\nbeampatterns of the modules, while maintaining their power efficiency. This\nleads to challenging constrained optimization that should be repeatedly solved\non each coherence duration. In this work, we propose a power-oriented\noptimization algorithm for beamforming in uplink modular hybrid MIMO systems,\nwhich learns from data to operate rapidly. We derive our learned optimizer by\ntackling the rate maximization objective using projected gradient ascent steps\nwith momentum. We then leverage data to tune the hyperparameters of the\noptimizer, allowing it to operate reliably in a fixed and small number of\niterations while completely preserving its interpretable operation. We show how\npower efficient beamforming can be encouraged by the learned optimizer, via\nboosting architectures with low-resolution phase shifts and with deactivated\nanalog components. Numerical results show that our learn-to-optimize method\nnotably reduces the number of iterations and computation latency required to\nreliably tune modular MIMO receivers, and that it allows obtaining desirable\nbalances between power efficient designs and throughput.\n","authors":["Ohad Levy","Nir Shlezinger"],"pdf_url":"https://arxiv.org/pdf/2408.00439v2.pdf","comment":"Under review for possible publication in the IEEE"},{"id":"http://arxiv.org/abs/2412.19311v1","updated":"2024-12-26T18:19:04Z","published":"2024-12-26T18:19:04Z","title":"xSRL: Safety-Aware Explainable Reinforcement Learning -- Safety as a\n Product of Explainability","summary":" Reinforcement learning (RL) has shown great promise in simulated\nenvironments, such as games, where failures have minimal consequences. However,\nthe deployment of RL agents in real-world systems such as autonomous vehicles,\nrobotics, UAVs, and medical devices demands a higher level of safety and\ntransparency, particularly when facing adversarial threats. Safe RL algorithms\nhave been developed to address these concerns by optimizing both task\nperformance and safety constraints. However, errors are inevitable, and when\nthey occur, it is essential that the RL agents can also explain their actions\nto human operators. This makes trust in the safety mechanisms of RL systems\ncrucial for effective deployment. Explainability plays a key role in building\nthis trust by providing clear, actionable insights into the agent's\ndecision-making process, ensuring that safety-critical decisions are well\nunderstood. While machine learning (ML) has seen significant advances in\ninterpretability and visualization, explainability methods for RL remain\nlimited. Current tools fail to address the dynamic, sequential nature of RL and\nits needs to balance task performance with safety constraints over time. The\nre-purposing of traditional ML methods, such as saliency maps, is inadequate\nfor safety-critical RL applications where mistakes can result in severe\nconsequences. To bridge this gap, we propose xSRL, a framework that integrates\nboth local and global explanations to provide a comprehensive understanding of\nRL agents' behavior. xSRL also enables developers to identify policy\nvulnerabilities through adversarial attacks, offering tools to debug and patch\nagents without retraining. Our experiments and user studies demonstrate xSRL's\neffectiveness in increasing safety in RL systems, making them more reliable and\ntrustworthy for real-world deployment. Code is available at\nhttps://github.com/risal-shefin/xSRL.\n","authors":["Risal Shahriar Shefin","Md Asifur Rahman","Thai Le","Sarra Alqahtani"],"pdf_url":"https://arxiv.org/pdf/2412.19311v1.pdf","comment":"Accepted to 24th International Conference on Autonomous Agents and\n Multiagent Systems (AAMAS 2025)"},{"id":"http://arxiv.org/abs/2412.19291v1","updated":"2024-12-26T17:34:26Z","published":"2024-12-26T17:34:26Z","title":"RAG with Differential Privacy","summary":" Retrieval-Augmented Generation (RAG) has emerged as the dominant technique to\nprovide *Large Language Models* (LLM) with fresh and relevant context,\nmitigating the risk of hallucinations and improving the overall quality of\nresponses in environments with large and fast moving knowledge bases. However,\nthe integration of external documents into the generation process raises\nsignificant privacy concerns. Indeed, when added to a prompt, it is not\npossible to guarantee a response will not inadvertently expose confidential\ndata, leading to potential breaches of privacy and ethical dilemmas. This paper\nexplores a practical solution to this problem suitable to general knowledge\nextraction from personal data. It shows *differentially private token\ngeneration* is a viable approach to private RAG.\n","authors":["Nicolas Grislain"],"pdf_url":"https://arxiv.org/pdf/2412.19291v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.19289v1","updated":"2024-12-26T17:29:38Z","published":"2024-12-26T17:29:38Z","title":"ViPCap: Retrieval Text-Based Visual Prompts for Lightweight Image\n Captioning","summary":" Recent lightweight image captioning models using retrieved data mainly focus\non text prompts. However, previous works only utilize the retrieved text as\ntext prompts, and the visual information relies only on the CLIP visual\nembedding. Because of this issue, there is a limitation that the image\ndescriptions inherent in the prompt are not sufficiently reflected in the\nvisual embedding space. To tackle this issue, we propose ViPCap, a novel\nretrieval text-based visual prompt for lightweight image captioning. ViPCap\nleverages the retrieved text with image information as visual prompts to\nenhance the ability of the model to capture relevant visual information. By\nmapping text prompts into the CLIP space and generating multiple randomized\nGaussian distributions, our method leverages sampling to explore randomly\naugmented distributions and effectively retrieves the semantic features that\ncontain image information. These retrieved features are integrated into the\nimage and designated as the visual prompt, leading to performance improvements\non the datasets such as COCO, Flickr30k, and NoCaps. Experimental results\ndemonstrate that ViPCap significantly outperforms prior lightweight captioning\nmodels in efficiency and effectiveness, demonstrating the potential for a\nplug-and-play solution.\n","authors":["Taewhan Kim","Soeun Lee","Si-Woo Kim","Dong-Jin Kim"],"pdf_url":"https://arxiv.org/pdf/2412.19289v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.01223v4","updated":"2024-12-26T17:27:23Z","published":"2024-10-02T04:02:21Z","title":"Statistical Taylor Expansion","summary":" Statistical Taylor expansion replaces the input precise variables in a\nconventional Taylor expansion with random variables each with known\ndistribution, to calculate the result mean and deviation. It is based on the\nuncorrelated uncertainty assumption: Each input variable is measured\nindependently with fine enough statistical precision, so that their\nuncertainties are independent of each other. Statistical Taylor expansion\nreviews that the intermediate analytic expressions can no longer be regarded as\nindependent of each other, and the result of analytic expression should be path\nindependent. This conclusion differs fundamentally from the conventional common\napproach in applied mathematics to find the best execution path for a result.\nThis paper also presents an implementation of statistical Taylor expansion\ncalled variance arithmetic, and the tests on variance arithmetic.\n","authors":["Chengpu Wang"],"pdf_url":"https://arxiv.org/pdf/2410.01223v4.pdf","comment":"65 pages, 53 figures"},{"id":"http://arxiv.org/abs/2412.19286v1","updated":"2024-12-26T17:15:30Z","published":"2024-12-26T17:15:30Z","title":"Time Series Foundational Models: Their Role in Anomaly Detection and\n Prediction","summary":" Time series foundational models (TSFM) have gained prominence in time series\nforecasting, promising state-of-the-art performance across various\napplications. However, their application in anomaly detection and prediction\nremains underexplored, with growing concerns regarding their black-box nature,\nlack of interpretability and applicability. This paper critically evaluates the\nefficacy of TSFM in anomaly detection and prediction tasks. We systematically\nanalyze TSFM across multiple datasets, including those characterized by the\nabsence of discernible patterns, trends and seasonality. Our analysis shows\nthat while TSFMs can be extended for anomaly detection and prediction,\ntraditional statistical and deep learning models often match or outperform TSFM\nin these tasks. Additionally, TSFMs require high computational resources but\nfail to capture sequential dependencies effectively or improve performance in\nfew-shot or zero-shot scenarios. \\noindent The preprocessed datasets, codes to\nreproduce the results and supplementary materials are available at\nhttps://github.com/smtmnfg/TSFM.\n","authors":["Chathurangi Shyalika","Harleen Kaur Bagga","Ahan Bhatt","Renjith Prasad","Alaa Al Ghazo","Amit Sheth"],"pdf_url":"https://arxiv.org/pdf/2412.19286v1.pdf","comment":"12 pages, 6 figures, 5 tables. Accepted at AAAI2025 Anomaly Detection\n in Scientific Domains Workshop"},{"id":"http://arxiv.org/abs/2412.19284v1","updated":"2024-12-26T17:02:19Z","published":"2024-12-26T17:02:19Z","title":"PearSAN: A Machine Learning Method for Inverse Design using Pearson\n Correlated Surrogate Annealing","summary":" PearSAN is a machine learning-assisted optimization algorithm applicable to\ninverse design problems with large design spaces, where traditional optimizers\nstruggle. The algorithm leverages the latent space of a generative model for\nrapid sampling and employs a Pearson correlated surrogate model to predict the\nfigure of merit of the true design metric. As a showcase example, PearSAN is\napplied to thermophotovoltaic (TPV) metasurface design by matching the working\nbands between a thermal radiator and a photovoltaic cell. PearSAN can work with\nany pretrained generative model with a discretized latent space, making it easy\nto integrate with VQ-VAEs and binary autoencoders. Its novel Pearson\ncorrelational loss can be used as both a latent regularization method, similar\nto batch and layer normalization, and as a surrogate training loss. We compare\nboth to previous energy matching losses, which are shown to enforce poor\nregularization and performance, even with upgraded affine parameters. PearSAN\nachieves a state-of-the-art maximum design efficiency of 97%, and is at least\nan order of magnitude faster than previous methods, with an improved maximum\nfigure-of-merit gain.\n","authors":["Michael Bezick","Blake A. Wilson","Vaishnavi Iyer","Yuheng Chen","Vladimir M. Shalaev","Sabre Kais","Alexander V. Kildishev","Alexandra Boltasseva","Brad Lackey"],"pdf_url":"https://arxiv.org/pdf/2412.19284v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2201.08507v2","updated":"2024-12-26T17:00:25Z","published":"2022-01-21T01:26:08Z","title":"Decentralized Sparse Linear Regression via Gradient-Tracking: Linear\n Convergence and Statistical Guarantees","summary":" We study sparse linear regression over a network of agents, modeled as an\nundirected graph and no server node. The estimation of the $s$-sparse parameter\nis formulated as a constrained LASSO problem wherein each agent owns a subset\nof the $N$ total observations. We analyze the convergence rate and statistical\nguarantees of a distributed projected gradient tracking-based algorithm under\nhigh-dimensional scaling, allowing the ambient dimension $d$ to grow with (and\npossibly exceed) the sample size $N$. Our theory shows that, under standard\nnotions of restricted strong convexity and smoothness of the loss functions,\nsuitable conditions on the network connectivity and algorithm tuning, the\ndistributed algorithm converges globally at a {\\it linear} rate to an estimate\nthat is within the centralized {\\it statistical precision} of the model,\n$O(s\\log d/N)$. When $s\\log d/N=o(1)$, a condition necessary for statistical\nconsistency, an $\\varepsilon$-optimal solution is attained after\n$\\mathcal{O}(\\kappa \\log (1/\\varepsilon))$ gradient computations and $O\n(\\kappa/(1-\\rho) \\log (1/\\varepsilon))$ communication rounds, where $\\kappa$ is\nthe restricted condition number of the loss function and $\\rho$ measures the\nnetwork connectivity. The computation cost matches that of the centralized\nprojected gradient algorithm despite having data distributed; whereas the\ncommunication rounds reduce as the network connectivity improves. Overall, our\nstudy reveals interesting connections between statistical efficiency, network\nconnectivity \\& topology, and convergence rate in high dimensions.\n","authors":["Marie Maros","Gesualdo Scutari","Ying Sun","Guang Cheng"],"pdf_url":"https://arxiv.org/pdf/2201.08507v2.pdf","comment":"The order of the first three authors is alphabetic. Final revised\n version"},{"id":"http://arxiv.org/abs/2412.19279v1","updated":"2024-12-26T16:45:20Z","published":"2024-12-26T16:45:20Z","title":"Improving Generalization for AI-Synthesized Voice Detection","summary":" AI-synthesized voice technology has the potential to create realistic human\nvoices for beneficial applications, but it can also be misused for malicious\npurposes. While existing AI-synthesized voice detection models excel in\nintra-domain evaluation, they face challenges in generalizing across different\ndomains, potentially becoming obsolete as new voice generators emerge. Current\nsolutions use diverse data and advanced machine learning techniques (e.g.,\ndomain-invariant representation, self-supervised learning), but are limited by\npredefined vocoders and sensitivity to factors like background noise and\nspeaker identity. In this work, we introduce an innovative disentanglement\nframework aimed at extracting domain-agnostic artifact features related to\nvocoders. Utilizing these features, we enhance model learning in a flat loss\nlandscape, enabling escape from suboptimal solutions and improving\ngeneralization. Extensive experiments on benchmarks show our approach\noutperforms state-of-the-art methods, achieving up to 5.12% improvement in the\nequal error rate metric in intra-domain and 7.59% in cross-domain evaluations.\n","authors":["Hainan Ren","Lin Li","Chun-Hao Liu","Xin Wang","Shu Hu"],"pdf_url":"https://arxiv.org/pdf/2412.19279v1.pdf","comment":"AAAI25"},{"id":"http://arxiv.org/abs/2412.19265v1","updated":"2024-12-26T16:05:19Z","published":"2024-12-26T16:05:19Z","title":"Optimizing Multi-Stage Language Models for Effective Text Retrieval","summary":" Efficient text retrieval is critical for applications such as legal document\nanalysis, particularly in specialized contexts like Japanese legal systems.\nExisting retrieval methods often underperform in such domain-specific\nscenarios, necessitating tailored approaches. In this paper, we introduce a\nnovel two-phase text retrieval pipeline optimized for Japanese legal datasets.\nOur method leverages advanced language models to achieve state-of-the-art\nperformance, significantly improving retrieval efficiency and accuracy. To\nfurther enhance robustness and adaptability, we incorporate an ensemble model\nthat integrates multiple retrieval strategies, resulting in superior outcomes\nacross diverse tasks. Extensive experiments validate the effectiveness of our\napproach, demonstrating strong performance on both Japanese legal datasets and\nwidely recognized benchmarks like MS-MARCO. Our work establishes new standards\nfor text retrieval in domain-specific and general contexts, providing a\ncomprehensive solution for addressing complex queries in legal and multilingual\nenvironments.\n","authors":["Quang Hoang Trung","Le Trung Hoang","Nguyen Van Hoang Phuc"],"pdf_url":"https://arxiv.org/pdf/2412.19265v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.19260v1","updated":"2024-12-26T15:54:10Z","published":"2024-12-26T15:54:10Z","title":"MEDEC: A Benchmark for Medical Error Detection and Correction in\n Clinical Notes","summary":" Several studies showed that Large Language Models (LLMs) can answer medical\nquestions correctly, even outperforming the average human score in some medical\nexams. However, to our knowledge, no study has been conducted to assess the\nability of language models to validate existing or generated medical text for\ncorrectness and consistency. In this paper, we introduce MEDEC\n(https://github.com/abachaa/MEDEC), the first publicly available benchmark for\nmedical error detection and correction in clinical notes, covering five types\nof errors (Diagnosis, Management, Treatment, Pharmacotherapy, and Causal\nOrganism). MEDEC consists of 3,848 clinical texts, including 488 clinical notes\nfrom three US hospital systems that were not previously seen by any LLM. The\ndataset has been used for the MEDIQA-CORR shared task to evaluate seventeen\nparticipating systems [Ben Abacha et al., 2024]. In this paper, we describe the\ndata creation methods and we evaluate recent LLMs (e.g., o1-preview, GPT-4,\nClaude 3.5 Sonnet, and Gemini 2.0 Flash) for the tasks of detecting and\ncorrecting medical errors requiring both medical knowledge and reasoning\ncapabilities. We also conducted a comparative study where two medical doctors\nperformed the same task on the MEDEC test set. The results showed that MEDEC is\na sufficiently challenging benchmark to assess the ability of models to\nvalidate existing or generated notes and to correct medical errors. We also\nfound that although recent LLMs have a good performance in error detection and\ncorrection, they are still outperformed by medical doctors in these tasks. We\ndiscuss the potential factors behind this gap, the insights from our\nexperiments, the limitations of current evaluation metrics, and share potential\npointers for future research.\n","authors":["Asma Ben Abacha","Wen-wai Yim","Yujuan Fu","Zhaoyi Sun","Meliha Yetisgen","Fei Xia","Thomas Lin"],"pdf_url":"https://arxiv.org/pdf/2412.19260v1.pdf","comment":"12 pages"},{"id":"http://arxiv.org/abs/2412.19255v1","updated":"2024-12-26T15:45:45Z","published":"2024-12-26T15:45:45Z","title":"Multi-matrix Factorization Attention","summary":" We propose novel attention architectures, Multi-matrix Factorization\nAttention (MFA) and MFA-Key-Reuse (MFA-KR). Existing variants for standard\nMulti-Head Attention (MHA), including SOTA methods like MLA, fail to maintain\nas strong performance under stringent Key-Value cache (KV cache) constraints.\nMFA enhances model capacity by efficiently scaling up both the number and\ndimension of attention heads through low-rank matrix factorization in the\nQuery-Key (QK) circuit. Extending MFA, MFA-KR further reduces memory\nrequirements by repurposing the key cache as value through value projection\nre-parameterization. MFA's design enables strong model capacity when working\nunder tight KV cache budget, while MFA-KR is suitable for even harsher KV cache\nlimits with minor performance trade-off. Notably, in our extensive and\nlarge-scale experiments, the proposed architecture outperforms MLA and performs\ncomparably to MHA, while reducing KV cache usage by up to 56% and 93.7%,\nrespectively.\n","authors":["Jingcheng Hu","Houyi Li","Yinmin Zhang","Zili Wang","Shuigeng Zhou","Xiangyu Zhang","Heung-Yeung Shum"],"pdf_url":"https://arxiv.org/pdf/2412.19255v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.19252v1","updated":"2024-12-26T15:29:58Z","published":"2024-12-26T15:29:58Z","title":"Localized exploration in contextual dynamic pricing achieves\n dimension-free regret","summary":" We study the problem of contextual dynamic pricing with a linear demand\nmodel. We propose a novel localized exploration-then-commit (LetC) algorithm\nwhich starts with a pure exploration stage, followed by a refinement stage that\nexplores near the learned optimal pricing policy, and finally enters a pure\nexploitation stage. The algorithm is shown to achieve a minimax optimal,\ndimension-free regret bound when the time horizon exceeds a polynomial of the\ncovariate dimension. Furthermore, we provide a general theoretical framework\nthat encompasses the entire time spectrum, demonstrating how to balance\nexploration and exploitation when the horizon is limited. The analysis is\npowered by a novel critical inequality that depicts the\nexploration-exploitation trade-off in dynamic pricing, mirroring its existing\ncounterpart for the bias-variance trade-off in regularized regression. Our\ntheoretical results are validated by extensive experiments on synthetic and\nreal-world data.\n","authors":["Jinhang Chai","Yaqi Duan","Jianqing Fan","Kaizheng Wang"],"pdf_url":"https://arxiv.org/pdf/2412.19252v1.pdf","comment":"60 pages, 9 figures"},{"id":"http://arxiv.org/abs/2412.19245v1","updated":"2024-12-26T15:01:24Z","published":"2024-12-26T15:01:24Z","title":"Sentiment trading with large language models","summary":" We investigate the efficacy of large language models (LLMs) in sentiment\nanalysis of U.S. financial news and their potential in predicting stock market\nreturns. We analyze a dataset comprising 965,375 news articles that span from\nJanuary 1, 2010, to June 30, 2023; we focus on the performance of various LLMs,\nincluding BERT, OPT, FINBERT, and the traditional Loughran-McDonald dictionary\nmodel, which has been a dominant methodology in the finance literature. The\nstudy documents a significant association between LLM scores and subsequent\ndaily stock returns. Specifically, OPT, which is a GPT-3 based LLM, shows the\nhighest accuracy in sentiment prediction with an accuracy of 74.4%, slightly\nahead of BERT (72.5%) and FINBERT (72.2%). In contrast, the Loughran-McDonald\ndictionary model demonstrates considerably lower effectiveness with only 50.1%\naccuracy. Regression analyses highlight a robust positive impact of OPT model\nscores on next-day stock returns, with coefficients of 0.274 and 0.254 in\ndifferent model specifications. BERT and FINBERT also exhibit predictive\nrelevance, though to a lesser extent. Notably, we do not observe a significant\nrelationship between the Loughran-McDonald dictionary model scores and stock\nreturns, challenging the efficacy of this traditional method in the current\nfinancial context. In portfolio performance, the long-short OPT strategy excels\nwith a Sharpe ratio of 3.05, compared to 2.11 for BERT and 2.07 for FINBERT\nlong-short strategies. Strategies based on the Loughran-McDonald dictionary\nyield the lowest Sharpe ratio of 1.23. Our findings emphasize the superior\nperformance of advanced LLMs, especially OPT, in financial market prediction\nand portfolio management, marking a significant shift in the landscape of\nfinancial analysis tools with implications to financial regulation and policy\nanalysis.\n","authors":["Kemal Kirtac","Guido Germano"],"pdf_url":"https://arxiv.org/pdf/2412.19245v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.19241v1","updated":"2024-12-26T14:51:24Z","published":"2024-12-26T14:51:24Z","title":"Latenrgy: Model Agnostic Latency and Energy Consumption Prediction for\n Binary Classifiers","summary":" Machine learning systems increasingly drive innovation across scientific\nfields and industry, yet challenges in compute overhead, specifically during\ninference, limit their scalability and sustainability. Responsible AI\nguardrails, essential for ensuring fairness, transparency, and privacy, further\nexacerbate these computational demands. This study addresses critical gaps in\nthe literature, chiefly the lack of generalized predictive techniques for\nlatency and energy consumption, limited cross-comparisons of classifiers, and\nunquantified impacts of RAI guardrails on inference performance. Using Theory\nConstruction Methodology, this work constructed a model-agnostic theoretical\nframework for predicting latency and energy consumption in binary\nclassification models during inference. The framework synthesizes classifier\ncharacteristics, dataset properties, and RAI guardrails into a unified\nanalytical instrument. Two predictive equations are derived that capture the\ninterplay between these factors while offering generalizability across diverse\nclassifiers. The proposed framework provides foundational insights for\ndesigning efficient, responsible ML systems. It enables researchers to\nbenchmark and optimize inference performance and assists practitioners in\ndeploying scalable solutions. Finally, this work establishes a theoretical\nfoundation for balancing computational efficiency with ethical AI principles,\npaving the way for future empirical validation and broader applications.\n","authors":["Jason M. Pittman"],"pdf_url":"https://arxiv.org/pdf/2412.19241v1.pdf","comment":"8 pages, 2 tables"},{"id":"http://arxiv.org/abs/2412.19238v1","updated":"2024-12-26T14:44:47Z","published":"2024-12-26T14:44:47Z","title":"FineVQ: Fine-Grained User Generated Content Video Quality Assessment","summary":" The rapid growth of user-generated content (UGC) videos has produced an\nurgent need for effective video quality assessment (VQA) algorithms to monitor\nvideo quality and guide optimization and recommendation procedures. However,\ncurrent VQA models generally only give an overall rating for a UGC video, which\nlacks fine-grained labels for serving video processing and recommendation\napplications. To address the challenges and promote the development of UGC\nvideos, we establish the first large-scale Fine-grained Video quality\nassessment Database, termed FineVD, which comprises 6104 UGC videos with\nfine-grained quality scores and descriptions across multiple dimensions. Based\non this database, we propose a Fine-grained Video Quality assessment (FineVQ)\nmodel to learn the fine-grained quality of UGC videos, with the capabilities of\nquality rating, quality scoring, and quality attribution. Extensive\nexperimental results demonstrate that our proposed FineVQ can produce\nfine-grained video-quality results and achieve state-of-the-art performance on\nFineVD and other commonly used UGC-VQA datasets. Both Both FineVD and FineVQ\nwill be made publicly available.\n","authors":["Huiyu Duan","Qiang Hu","Jiarui Wang","Liu Yang","Zitong Xu","Lu Liu","Xiongkuo Min","Chunlei Cai","Tianxiao Ye","Xiaoyun Zhang","Guangtao Zhai"],"pdf_url":"https://arxiv.org/pdf/2412.19238v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.19237v1","updated":"2024-12-26T14:40:38Z","published":"2024-12-26T14:40:38Z","title":"SeaMo: A Multi-Seasonal and Multimodal Remote Sensing Foundation Model","summary":" Remote Sensing (RS) data contains a wealth of multi-dimensional information\ncrucial for Earth observation. Owing to its vast volume, diverse sources, and\ntemporal properties, RS data is highly suitable for the development of large\nVisual Foundation Models (VFMs). VFMs act as robust feature extractors,\nlearning from extensive RS data, and are subsequently fine-tuned for deployment\nin various geoscientific tasks. However, current VFMs in the RS domain are\npredominantly pretrained and tailored exclusively for specific characteristics\nof RS imagery, neglecting the potential of utilizing the multi-dimensional\nproperties of RS data. Therefore, in this work, we propose SeaMo, a pioneering\nvisual foundation model that integrates multi-seasonal and multimodal\ninformation in the RS field. SeaMo is designed to harness multiple properties\nof RS data. Within the masked image modeling framework, we employ non-aligned\ncropping techniques to extract spatial properties, use multi-source inputs for\nmultimodal integration, and incorporate temporal-multimodal fusion blocks for\neffective assimilation of multi-seasonal data. SeaMo explicitly models the\nmulti-dimensional properties of RS data, making the model more comprehensive,\nrobust, and versatile. We applied SeaMo to several downstream geoscience tasks,\nwhich demonstrated exceptional performance. Extensive ablation studies were\nconducted to validate the model's superiority.\n","authors":["Xuyang Li","Danfeng Hong","Chenyu Li","Jocelyn Chanussot"],"pdf_url":"https://arxiv.org/pdf/2412.19237v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.19235v1","updated":"2024-12-26T14:30:54Z","published":"2024-12-26T14:30:54Z","title":"Are Two Hidden Layers Still Enough for the Physics-Informed Neural\n Networks?","summary":" The article discusses the development of various methods and techniques for\ninitializing and training neural networks with a single hidden layer, as well\nas training a separable physics-informed neural network consisting of neural\nnetworks with a single hidden layer to solve physical problems described by\nordinary differential equations (ODEs) and partial differential equations\n(PDEs). A method for strictly deterministic initialization of a neural network\nwith one hidden layer for solving physical problems described by an ODE is\nproposed. Modifications to existing methods for weighting the loss function are\ngiven, as well as new methods developed for training strictly\ndeterministic-initialized neural networks to solve ODEs (detaching, additional\nweighting based on the second derivative, predicted solution-based weighting,\nrelative residuals). An algorithm for physics-informed data-driven\ninitialization of a neural network with one hidden layer is proposed. A neural\nnetwork with pronounced generalizing properties is presented, whose\ngeneralizing abilities of which can be precisely controlled by adjusting\nnetwork parameters. A metric for measuring the generalization of such neural\nnetwork has been introduced. A gradient-free neuron-by-neuron fitting method\nhas been developed for adjusting the parameters of a single-hidden-layer neural\nnetwork, which does not require the use of an optimizer or solver for its\nimplementation. The proposed methods have been extended to 2D problems using\nthe separable physics-informed neural networks approach. Numerous experiments\nhave been carried out to develop the above methods and approaches. Experiments\non physical problems, such as solving various ODEs and PDEs, have demonstrated\nthat these methods for initializing and training neural networks with one or\ntwo hidden layers (SPINN) achieve competitive accuracy and, in some cases,\nstate-of-the-art results.\n","authors":["Vasiliy A. Es'kin","Alexey O. Malkhanov","Mikhail E. Smorkalov"],"pdf_url":"https://arxiv.org/pdf/2412.19235v1.pdf","comment":"45 pages, 36 figures, 9 tables"},{"id":"http://arxiv.org/abs/2412.19229v1","updated":"2024-12-26T14:16:15Z","published":"2024-12-26T14:16:15Z","title":"Virtual Nodes Can Help: Tackling Distribution Shifts in Federated Graph\n Learning","summary":" Federated Graph Learning (FGL) enables multiple clients to jointly train\npowerful graph learning models, e.g., Graph Neural Networks (GNNs), without\nsharing their local graph data for graph-related downstream tasks, such as\ngraph property prediction. In the real world, however, the graph data can\nsuffer from significant distribution shifts across clients as the clients may\ncollect their graph data for different purposes. In particular, graph\nproperties are usually associated with invariant label-relevant substructures\n(i.e., subgraphs) across clients, while label-irrelevant substructures can\nappear in a client-specific manner. The issue of distribution shifts of graph\ndata hinders the efficiency of GNN training and leads to serious performance\ndegradation in FGL. To tackle the aforementioned issue, we propose a novel FGL\nframework entitled FedVN that eliminates distribution shifts through\nclient-specific graph augmentation strategies with multiple learnable Virtual\nNodes (VNs). Specifically, FedVN lets the clients jointly learn a set of shared\nVNs while training a global GNN model. To eliminate distribution shifts, each\nclient trains a personalized edge generator that determines how the VNs connect\nlocal graphs in a client-specific manner. Furthermore, we provide theoretical\nanalyses indicating that FedVN can eliminate distribution shifts of graph data\nacross clients. Comprehensive experiments on four datasets under five settings\ndemonstrate the superiority of our proposed FedVN over nine baselines.\n","authors":["Xingbo Fu","Zihan Chen","Yinhan He","Song Wang","Binchi Zhang","Chen Chen","Jundong Li"],"pdf_url":"https://arxiv.org/pdf/2412.19229v1.pdf","comment":"Accepted by AAAI 2025"},{"id":"http://arxiv.org/abs/2412.19228v1","updated":"2024-12-26T14:09:16Z","published":"2024-12-26T14:09:16Z","title":"Learning Cross-Domain Representations for Transferable Drug\n Perturbations on Single-Cell Transcriptional Responses","summary":" Phenotypic drug discovery has attracted widespread attention because of its\npotential to identify bioactive molecules. Transcriptomic profiling provides a\ncomprehensive reflection of phenotypic changes in cellular responses to\nexternal perturbations. In this paper, we propose XTransferCDR, a novel\ngenerative framework designed for feature decoupling and transferable\nrepresentation learning across domains. Given a pair of perturbed expression\nprofiles, our approach decouples the perturbation representations from basal\nstates through domain separation encoders and then cross-transfers them in the\nlatent space. The transferred representations are then used to reconstruct the\ncorresponding perturbed expression profiles via a shared decoder. This\ncross-transfer constraint effectively promotes the learning of transferable\ndrug perturbation representations. We conducted extensive evaluations of our\nmodel on multiple datasets, including single-cell transcriptional responses to\ndrugs and single- and combinatorial genetic perturbations. The experimental\nresults show that XTransferCDR achieved better performance than current\nstate-of-the-art methods, showcasing its potential to advance phenotypic drug\ndiscovery.\n","authors":["Hui Liu","Shikai Jin"],"pdf_url":"https://arxiv.org/pdf/2412.19228v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.19227v1","updated":"2024-12-26T14:05:51Z","published":"2024-12-26T14:05:51Z","title":"Multi-view Fake News Detection Model Based on Dynamic Hypergraph","summary":" With the rapid development of online social networks and the inadequacies in\ncontent moderation mechanisms, the detection of fake news has emerged as a\npressing concern for the public. Various methods have been proposed for fake\nnews detection, including text-based approaches as well as a series of\ngraph-based approaches. However, the deceptive nature of fake news renders\ntext-based approaches less effective. Propagation tree-based methods focus on\nthe propagation process of individual news, capturing pairwise relationships\nbut lacking the capability to capture high-order complex relationships. Large\nheterogeneous graph-based approaches necessitate the incorporation of\nsubstantial additional information beyond news text and user data, while\nhypergraph-based approaches rely on predefined hypergraph structures. To tackle\nthese issues, we propose a novel dynamic hypergraph-based multi-view fake news\ndetection model (DHy-MFND) that learns news embeddings across three distinct\nviews: text-level, propagation tree-level, and hypergraph-level. By employing\nhypergraph structures to model complex high-order relationships among multiple\nnews pieces and introducing dynamic hypergraph structure learning, we optimize\npredefined hypergraph structures while learning news embeddings. Additionally,\nwe introduce contrastive learning to capture authenticity-relevant embeddings\nacross different views. Extensive experiments on two benchmark datasets\ndemonstrate the effectiveness of our proposed DHy-MFND compared with a broad\nrange of competing baselines.\n","authors":["Rongping Ye","Xiaobing Pei"],"pdf_url":"https://arxiv.org/pdf/2412.19227v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.19226v1","updated":"2024-12-26T14:05:14Z","published":"2024-12-26T14:05:14Z","title":"VINEVI: A Virtualized Network Vision Architecture for Smart Monitoring\n of Heterogeneous Applications and Infrastructures","summary":" Monitoring heterogeneous infrastructures and applications is essential to\ncope with user requirements properly, but it still lacks enhancements. The\nwell-known state-of-the-art methods and tools do not support seamless\nmonitoring of bare-metal, low-cost infrastructures, neither hosted nor\nvirtualized services with fine-grained details. This work proposes VIrtualized\nNEtwork VIsion architecture (VINEVI), an intelligent method for seamless\nmonitoring heterogeneous infrastructures and applications. The VINEVI\narchitecture advances state of the art with a node-embedded traffic\nclassification agent placing physical and virtualized infrastructures enabling\nreal-time traffic classification. VINEVI combines this real-time traffic\nclassification with well-known tools such as Prometheus and Victoria Metrics to\nmonitor the entire stack from the hardware to the virtualized applications.\nExperimental results showcased that VINEVI architecture allowed seamless\nheterogeneous infrastructure monitoring with a higher level of detail beyond\nliterature. Also, our node-embedded real-time Internet traffic classifier\nevolved with flexibility the methods with monitoring heterogeneous\ninfrastructures seamlessly.\n","authors":["Rodrigo Moreira","Hugo G. V. O. da Cunha","Larissa F. Rodrigues Moreira","Flávio de Oliveira Silva"],"pdf_url":"https://arxiv.org/pdf/2412.19226v1.pdf","comment":"12 pages"},{"id":"http://arxiv.org/abs/2410.04739v3","updated":"2024-12-26T13:58:31Z","published":"2024-10-07T04:15:02Z","title":"TableRAG: Million-Token Table Understanding with Language Models","summary":" Recent advancements in language models (LMs) have notably enhanced their\nability to reason with tabular data, primarily through program-aided mechanisms\nthat manipulate and analyze tables. However, these methods often require the\nentire table as input, leading to scalability challenges due to the positional\nbias or context length constraints. In response to these challenges, we\nintroduce TableRAG, a Retrieval-Augmented Generation (RAG) framework\nspecifically designed for LM-based table understanding. TableRAG leverages\nquery expansion combined with schema and cell retrieval to pinpoint crucial\ninformation before providing it to the LMs. This enables more efficient data\nencoding and precise retrieval, significantly reducing prompt lengths and\nmitigating information loss. We have developed two new million-token benchmarks\nfrom the Arcade and BIRD-SQL datasets to thoroughly evaluate TableRAG's\neffectiveness at scale. Our results demonstrate that TableRAG's retrieval\ndesign achieves the highest retrieval quality, leading to the new\nstate-of-the-art performance on large-scale table understanding.\n","authors":["Si-An Chen","Lesly Miculicich","Julian Martin Eisenschlos","Zifeng Wang","Zilong Wang","Yanfei Chen","Yasuhisa Fujii","Hsuan-Tien Lin","Chen-Yu Lee","Tomas Pfister"],"pdf_url":"https://arxiv.org/pdf/2410.04739v3.pdf","comment":"Accepted to NeurIPS 2024"},{"id":"http://arxiv.org/abs/2402.04046v2","updated":"2024-12-26T13:57:25Z","published":"2024-02-06T14:48:34Z","title":"Reviving Life on the Edge: Joint Score-Based Graph Generation of Rich\n Edge Attributes","summary":" Graph generation is integral to various engineering and scientific\ndisciplines. Nevertheless, existing methodologies tend to overlook the\ngeneration of edge attributes. However, we identify critical applications where\nedge attributes are essential, making prior methods potentially unsuitable in\nsuch contexts. Moreover, while trivial adaptations are available, empirical\ninvestigations reveal their limited efficacy as they do not properly model the\ninterplay among graph components. To address this, we propose a joint\nscore-based model of nodes and edges for graph generation that considers all\ngraph components. Our approach offers three key novelties: \\textbf{(1)} node\nand edge attributes are combined in an attention module that generates samples\nbased on the two ingredients, \\textbf{(2)} node, edge and adjacency information\nare mutually dependent during the graph diffusion process, and \\textbf{(3)} the\nframework enables the generation of graphs with rich attributes along the\nedges, providing a more expressive formulation for generative tasks than\nexisting works. We evaluate our method on challenging benchmarks involving\nreal-world and synthetic datasets in which edge features are crucial.\nAdditionally, we introduce a new synthetic dataset that incorporates edge\nvalues. Furthermore, we propose a novel application that greatly benefits from\nthe method due to its nature: the generation of traffic scenes represented as\ngraphs. Our method outperforms other graph generation methods, demonstrating a\nsignificant advantage in edge-related measures.\n","authors":["Nimrod Berman","Eitan Kosman","Dotan Di Castro","Omri Azencot"],"pdf_url":"https://arxiv.org/pdf/2402.04046v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.19217v1","updated":"2024-12-26T13:47:04Z","published":"2024-12-26T13:47:04Z","title":"Applying the maximum entropy principle to multi-species neural networks\n improves species distribution models","summary":" The rapid expansion of citizen science initiatives has led to a significant\ngrowth of biodiversity databases, and particularly presence-only (PO)\nobservations. PO data are invaluable for understanding species distributions\nand their dynamics, but their use in Species Distribution Models (SDM) is\ncurtailed by sampling biases and the lack of information on absences. Poisson\npoint processes are widely used for SDMs, with Maxent being one of the most\npopular methods. Maxent maximises the entropy of a probability distribution\nacross sites as a function of predefined transformations of environmental\nvariables, called features. In contrast, neural networks and deep learning have\nemerged as a promising technique for automatic feature extraction from complex\ninput variables. In this paper, we propose DeepMaxent, which harnesses neural\nnetworks to automatically learn shared features among species, using the\nmaximum entropy principle. To do so, it employs a normalised Poisson loss where\nfor each species, presence probabilities across sites are modelled by a neural\nnetwork. We evaluate DeepMaxent on a benchmark dataset known for its spatial\nsampling biases, using PO data for calibration and presence-absence (PA) data\nfor validation across six regions with different biological groups and\nenvironmental covariates. Our results indicate that DeepMaxent improves model\nperformance over Maxent and other state-of-the-art SDMs across regions and\ntaxonomic groups. The method performs particularly well in regions of uneven\nsampling, demonstrating substantial potential to improve species distribution\nmodelling. The method opens the possibility to learn more robust environmental\nfeatures predicting jointly many species and scales to arbitrary large numbers\nof sites without an increased memory demand.\n","authors":["Maxime Ryckewaert","Diego Marcos","Christophe Botella","Maximilien Servajean","Pierre Bonnet","Alexis Joly"],"pdf_url":"https://arxiv.org/pdf/2412.19217v1.pdf","comment":"Submitted to Methods in Ecology and Evolution"},{"id":"http://arxiv.org/abs/2412.19215v1","updated":"2024-12-26T13:36:18Z","published":"2024-12-26T13:36:18Z","title":"Optimizing Fantasy Sports Team Selection with Deep Reinforcement\n Learning","summary":" Fantasy sports, particularly fantasy cricket, have garnered immense\npopularity in India in recent years, offering enthusiasts the opportunity to\nengage in strategic team-building and compete based on the real-world\nperformance of professional athletes. In this paper, we address the challenge\nof optimizing fantasy cricket team selection using reinforcement learning (RL)\ntechniques. By framing the team creation process as a sequential\ndecision-making problem, we aim to develop a model that can adaptively select\nplayers to maximize the team's potential performance. Our approach leverages\nhistorical player data to train RL algorithms, which then predict future\nperformance and optimize team composition. This not only represents a huge\nbusiness opportunity by enabling more accurate predictions of high-performing\nteams but also enhances the overall user experience. Through empirical\nevaluation and comparison with traditional fantasy team drafting methods, we\ndemonstrate the effectiveness of RL in constructing competitive fantasy teams.\nOur results show that RL-based strategies provide valuable insights into player\nselection in fantasy sports.\n","authors":["Shamik Bhattacharjee","Kamlesh Marathe","Hitesh Kapoor","Nilesh Patil"],"pdf_url":"https://arxiv.org/pdf/2412.19215v1.pdf","comment":"8 Pages including references, Accepted to CODS-COMAD 2024 conference"},{"id":"http://arxiv.org/abs/2412.19212v1","updated":"2024-12-26T13:23:37Z","published":"2024-12-26T13:23:37Z","title":"Towards Better Spherical Sliced-Wasserstein Distance Learning with\n Data-Adaptive Discriminative Projection Direction","summary":" Spherical Sliced-Wasserstein (SSW) has recently been proposed to measure the\ndiscrepancy between spherical data distributions in various fields, such as\ngeology, medical domains, computer vision, and deep representation learning.\nHowever, in the original SSW, all projection directions are treated equally,\nwhich is too idealistic and cannot accurately reflect the importance of\ndifferent projection directions for various data distributions. To address this\nissue, we propose a novel data-adaptive Discriminative Spherical\nSliced-Wasserstein (DSSW) distance, which utilizes a projected energy function\nto determine the discriminative projection direction for SSW. In our new DSSW,\nwe introduce two types of projected energy functions to generate the weights\nfor projection directions with complete theoretical guarantees. The first type\nemploys a non-parametric deterministic function that transforms the projected\nWasserstein distance into its corresponding weight in each projection\ndirection. This improves the performance of the original SSW distance with\nnegligible additional computational overhead. The second type utilizes a neural\nnetwork-induced function that learns the projection direction weight through a\nparameterized neural network based on data projections. This further enhances\nthe performance of the original SSW distance with less extra computational\noverhead. Finally, we evaluate the performance of our proposed DSSW by\ncomparing it with several state-of-the-art methods across a variety of machine\nlearning tasks, including gradient flows, density estimation on real earth\ndata, and self-supervised learning.\n","authors":["Hongliang Zhang","Shuo Chen","Lei Luo","Jian Yang"],"pdf_url":"https://arxiv.org/pdf/2412.19212v1.pdf","comment":"Accepted by AAAI 2025"},{"id":"http://arxiv.org/abs/2412.19211v1","updated":"2024-12-26T13:21:09Z","published":"2024-12-26T13:21:09Z","title":"Large Language Models Meet Graph Neural Networks: A Perspective of Graph\n Mining","summary":" Graph mining is an important area in data mining and machine learning that\ninvolves extracting valuable information from graph-structured data. In recent\nyears, significant progress has been made in this field through the development\nof graph neural networks (GNNs). However, GNNs are still deficient in\ngeneralizing to diverse graph data. Aiming to this issue, Large Language Models\n(LLMs) could provide new solutions for graph mining tasks with their superior\nsemantic understanding. In this review, we systematically review the\ncombination and application techniques of LLMs and GNNs and present a novel\ntaxonomy for research in this interdisciplinary field, which involves three\nmain categories: GNN-driving-LLM, LLM-driving-GNN, and GNN-LLM-co-driving.\nWithin this framework, we reveal the capabilities of LLMs in enhancing graph\nfeature extraction as well as improving the effectiveness of downstream tasks\nsuch as node classification, link prediction, and community detection. Although\nLLMs have demonstrated their great potential in handling graph-structured data,\ntheir high computational requirements and complexity remain challenges. Future\nresearch needs to continue to explore how to efficiently fuse LLMs and GNNs to\nachieve more powerful graph learning and reasoning capabilities and provide new\nimpetus for the development of graph mining techniques.\n","authors":["Yuxin You","Zhen Liu","Xiangchao Wen","Yongtao Zhang","Wei Ai"],"pdf_url":"https://arxiv.org/pdf/2412.19211v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.19209v1","updated":"2024-12-26T13:19:26Z","published":"2024-12-26T13:19:26Z","title":"Context-Aware Deep Learning for Multi Modal Depression Detection","summary":" In this study, we focus on automated approaches to detect depression from\nclinical interviews using multi-modal machine learning (ML). Our approach\ndifferentiates from other successful ML methods such as context-aware analysis\nthrough feature engineering and end-to-end deep neural networks for depression\ndetection utilizing the Distress Analysis Interview Corpus. We propose a novel\nmethod that incorporates: (1) pre-trained Transformer combined with data\naugmentation based on topic modelling for textual data; and (2) deep 1D\nconvolutional neural network (CNN) for acoustic feature modeling. The\nsimulation results demonstrate the effectiveness of the proposed method for\ntraining multi-modal deep learning models. Our deep 1D CNN and Transformer\nmodels achieved state-of-the-art performance for audio and text modalities\nrespectively. Combining them in a multi-modal framework also outperforms\nstate-of-the-art for the combined setting. Code available at\nhttps://github.com/genandlam/multi-modal-depression-detection\n","authors":["Genevieve Lam","Huang Dongyan","Weisi Lin"],"pdf_url":"https://arxiv.org/pdf/2412.19209v1.pdf","comment":"Presented as an Oral at International Conference on Acoustics, Speech\n and Signal Processing 2019, United Kingdom"},{"id":"http://arxiv.org/abs/2412.19208v1","updated":"2024-12-26T13:18:16Z","published":"2024-12-26T13:18:16Z","title":"Developing Explainable Machine Learning Model using Augmented Concept\n Activation Vector","summary":" Machine learning models use high dimensional feature spaces to map their\ninputs to the corresponding class labels. However, these features often do not\nhave a one-to-one correspondence with physical concepts understandable by\nhumans, which hinders the ability to provide a meaningful explanation for the\ndecisions made by these models. We propose a method for measuring the\ncorrelation between high-level concepts and the decisions made by a machine\nlearning model. Our method can isolate the impact of a given high-level concept\nand accurately measure it quantitatively. Additionally, this study aims to\ndetermine the prevalence of frequent patterns in machine learning models, which\noften occur in imbalanced datasets. We have successfully applied the proposed\nmethod to fundus images and managed to quantitatively measure the impact of\nradiomic patterns on the model decisions.\n","authors":["Reza Hassanpour","Kasim Oztoprak","Niels Netten","Tony Busker","Mortaza S. Bargh","Sunil Choenni","Beyza Kizildag","Leyla Sena Kilinc"],"pdf_url":"https://arxiv.org/pdf/2412.19208v1.pdf","comment":"11 pages, 8 figures, \"to be published in the journal of Computer\n SCience\""},{"id":"http://arxiv.org/abs/2402.15351v2","updated":"2024-12-26T13:14:16Z","published":"2024-02-23T14:38:19Z","title":"AutoMMLab: Automatically Generating Deployable Models from Language\n Instructions for Computer Vision Tasks","summary":" Automated machine learning (AutoML) is a collection of techniques designed to\nautomate the machine learning development process. While traditional AutoML\napproaches have been successfully applied in several critical steps of model\ndevelopment (e.g. hyperparameter optimization), there lacks a AutoML system\nthat automates the entire end-to-end model production workflow for computer\nvision. To fill this blank, we propose a novel request-to-model task, which\ninvolves understanding the user's natural language request and execute the\nentire workflow to output production-ready models. This empowers non-expert\nindividuals to easily build task-specific models via a user-friendly language\ninterface. To facilitate development and evaluation, we develop a new\nexperimental platform called AutoMMLab and a new benchmark called LAMP for\nstudying key components in the end-to-end request-to-model pipeline.\nHyperparameter optimization (HPO) is one of the most important components for\nAutoML. Traditional approaches mostly rely on trial-and-error, leading to\ninefficient parameter search. To solve this problem, we propose a novel\nLLM-based HPO algorithm, called HPO-LLaMA. Equipped with extensive knowledge\nand experience in model hyperparameter tuning, HPO-LLaMA achieves significant\nimprovement of HPO efficiency. Dataset and code are available at\nhttps://github.com/yang-ze-kang/AutoMMLab.\n","authors":["Zekang Yang","Wang Zeng","Sheng Jin","Chen Qian","Ping Luo","Wentao Liu"],"pdf_url":"https://arxiv.org/pdf/2402.15351v2.pdf","comment":"Accepted by AAAI2025"},{"id":"http://arxiv.org/abs/2211.10724v3","updated":"2024-12-26T13:10:25Z","published":"2022-11-19T15:40:26Z","title":"Deep Smart Contract Intent Detection","summary":" In recent years, research in software security has concentrated on\nidentifying vulnerabilities in smart contracts to prevent significant losses of\ncrypto assets on blockchains. Despite early successes in this area, detecting\ndevelopers' intents in smart contracts has become a more pressing issue, as\nmalicious intents have caused substantial financial losses. Unfortunately,\nexisting research lacks effective methods for detecting development intents in\nsmart contracts.\n To address this gap, we propose \\textsc{SmartIntentNN} (Smart Contract Intent\nNeural Network), a deep learning model designed to automatically detect\ndevelopment intents in smart contracts. \\textsc{SmartIntentNN} leverages a\npre-trained sentence encoder to generate contextual representations of smart\ncontracts, employs a K-means clustering model to identify and highlight\nprominent intent features, and utilizes a bidirectional LSTM-based deep neural\nnetwork for multi-label classification.\n We trained and evaluated \\textsc{SmartIntentNN} on a dataset containing over\n40,000 real-world smart contracts, employing self-comparison baselines in our\nexperimental setup. The results show that \\textsc{SmartIntentNN} achieves an\nF1-score of 0.8633 in identifying intents across 10 distinct categories,\noutperforming all baselines and addressing the gap in smart contract detection\nby incorporating intent analysis.\n","authors":["Youwei Huang","Sen Fang","Jianwen Li","Jiachun Tao","Bin Hu","Tao Zhang"],"pdf_url":"https://arxiv.org/pdf/2211.10724v3.pdf","comment":"12 pages, 8 figures, conference"},{"id":"http://arxiv.org/abs/2409.19078v2","updated":"2024-12-26T12:56:59Z","published":"2024-09-27T18:25:54Z","title":"Differential privacy enables fair and accurate AI-based analysis of\n speech disorders while protecting patient data","summary":" Speech pathology has impacts on communication abilities and quality of life.\nWhile deep learning-based models have shown potential in diagnosing these\ndisorders, the use of sensitive data raises critical privacy concerns. Although\ndifferential privacy (DP) has been explored in the medical imaging domain, its\napplication in pathological speech analysis remains largely unexplored despite\nthe equally critical privacy concerns. This study is the first to investigate\nDP's impact on pathological speech data, focusing on the trade-offs between\nprivacy, diagnostic accuracy, and fairness. Using a large, real-world dataset\nof 200 hours of recordings from 2,839 German-speaking participants, we observed\na maximum accuracy reduction of 3.85% when training with DP with high privacy\nlevels. To highlight real-world privacy risks, we demonstrated the\nvulnerability of non-private models to explicit gradient inversion attacks,\nreconstructing identifiable speech samples and showcasing DP's effectiveness in\nmitigating these risks. To generalize our findings across languages and\ndisorders, we validated our approach on a dataset of Spanish-speaking\nParkinson's disease patients, leveraging pretrained models from healthy\nEnglish-speaking datasets, and demonstrated that careful pretraining on\nlarge-scale task-specific datasets can maintain favorable accuracy under DP\nconstraints. A comprehensive fairness analysis revealed minimal gender bias at\nreasonable privacy levels but underscored the need for addressing age-related\ndisparities. Our results establish that DP can balance privacy and utility in\nspeech disorder detection, while highlighting unique challenges in\nprivacy-fairness trade-offs for speech data. This provides a foundation for\nrefining DP methodologies and improving fairness across diverse patient groups\nin real-world deployments.\n","authors":["Soroosh Tayebi Arasteh","Mahshad Lotfinia","Paula Andrea Perez-Toro","Tomas Arias-Vergara","Mahtab Ranji","Juan Rafael Orozco-Arroyave","Maria Schuster","Andreas Maier","Seung Hee Yang"],"pdf_url":"https://arxiv.org/pdf/2409.19078v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.19201v1","updated":"2024-12-26T12:51:14Z","published":"2024-12-26T12:51:14Z","title":"GAIS: A Novel Approach to Instance Selection with Graph Attention\n Networks","summary":" Instance selection (IS) is a crucial technique in machine learning that aims\nto reduce dataset size while maintaining model performance. This paper\nintroduces a novel method called Graph Attention-based Instance Selection\n(GAIS), which leverages Graph Attention Networks (GATs) to identify the most\ninformative instances in a dataset. GAIS represents the data as a graph and\nuses GATs to learn node representations, enabling it to capture complex\nrelationships between instances. The method processes data in chunks, applies\nrandom masking and similarity thresholding during graph construction, and\nselects instances based on confidence scores from the trained GAT model.\nExperiments on 13 diverse datasets demonstrate that GAIS consistently\noutperforms traditional IS methods in terms of effectiveness, achieving high\nreduction rates (average 96\\%) while maintaining or improving model\nperformance. Although GAIS exhibits slightly higher computational costs, its\nsuperior performance in maintaining accuracy with significantly reduced\ntraining data makes it a promising approach for graph-based data selection.\n","authors":["Zahiriddin Rustamov","Ayham Zaitouny","Rafat Damseh","Nazar Zaki"],"pdf_url":"https://arxiv.org/pdf/2412.19201v1.pdf","comment":"Accepted at ICKG 2024. Code is available at\n https://github.com/zahiriddin-rustamov/gais"},{"id":"http://arxiv.org/abs/2412.19194v1","updated":"2024-12-26T12:25:04Z","published":"2024-12-26T12:25:04Z","title":"Provably Efficient Exploration in Reward Machines with Low Regret","summary":" We study reinforcement learning (RL) for decision processes with\nnon-Markovian reward, in which high-level knowledge of the task in the form of\nreward machines is available to the learner. We consider probabilistic reward\nmachines with initially unknown dynamics, and investigate RL under the\naverage-reward criterion, where the learning performance is assessed through\nthe notion of regret. Our main algorithmic contribution is a model-based RL\nalgorithm for decision processes involving probabilistic reward machines that\nis capable of exploiting the structure induced by such machines. We further\nderive high-probability and non-asymptotic bounds on its regret and demonstrate\nthe gain in terms of regret over existing algorithms that could be applied, but\nobliviously to the structure. We also present a regret lower bound for the\nstudied setting. To the best of our knowledge, the proposed algorithm\nconstitutes the first attempt to tailor and analyze regret specifically for RL\nwith probabilistic reward machines.\n","authors":["Hippolyte Bourel","Anders Jonsson","Odalric-Ambrym Maillard","Chenxiao Ma","Mohammad Sadegh Talebi"],"pdf_url":"https://arxiv.org/pdf/2412.19194v1.pdf","comment":"35 pages"},{"id":"http://arxiv.org/abs/2412.19191v1","updated":"2024-12-26T12:12:23Z","published":"2024-12-26T12:12:23Z","title":"Biology Instructions: A Dataset and Benchmark for Multi-Omics Sequence\n Understanding Capability of Large Language Models","summary":" Large language models have already demonstrated their formidable capabilities\nin general domains, ushering in a revolutionary transformation. However,\nexploring and exploiting the extensive knowledge of these models to comprehend\nmulti-omics biology remains underexplored. To fill this research gap, we first\nintroduce Biology-Instructions, the first large-scale multi-omics biological\nsequences-related instruction-tuning dataset including DNA, RNA, proteins, and\nmulti-molecules, designed to bridge the gap between large language models\n(LLMs) and complex biological sequences-related tasks. This dataset can enhance\nthe versatility of LLMs by integrating diverse biological sequenced-based\nprediction tasks with advanced reasoning capabilities, while maintaining\nconversational fluency. Additionally, we reveal significant performance\nlimitations in even state-of-the-art LLMs on biological sequence-related\nmulti-omics tasks without specialized pre-training and instruction-tuning. We\nfurther develop a strong baseline called ChatMultiOmics with a novel\nthree-stage training pipeline, demonstrating the powerful ability to understand\nbiology by using Biology-Instructions. Biology-Instructions and ChatMultiOmics\nare publicly available and crucial resources for enabling more effective\nintegration of LLMs with multi-omics sequence analysis.\n","authors":["Haonan He","Yuchen Ren","Yining Tang","Ziyang Xu","Junxian Li","Minghao Yang","Di Zhang","Dong Yuan","Tao Chen","Shufei Zhang","Yuqiang Li","Nanqing Dong","Wanli Ouyang","Dongzhan Zhou","Peng Ye"],"pdf_url":"https://arxiv.org/pdf/2412.19191v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.19189v1","updated":"2024-12-26T11:57:54Z","published":"2024-12-26T11:57:54Z","title":"An End-to-End Depth-Based Pipeline for Selfie Image Rectification","summary":" Portraits or selfie images taken from a close distance typically suffer from\nperspective distortion. In this paper, we propose an end-to-end deep\nlearning-based rectification pipeline to mitigate the effects of perspective\ndistortion. We learn to predict the facial depth by training a deep CNN. The\nestimated depth is utilized to adjust the camera-to-subject distance by moving\nthe camera farther, increasing the camera focal length, and reprojecting the 3D\nimage features to the new perspective. The reprojected features are then fed to\nan inpainting module to fill in the missing pixels. We leverage a\ndifferentiable renderer to enable end-to-end training of our depth estimation\nand feature extraction nets to improve the rectified outputs. To boost the\nresults of the inpainting module, we incorporate an auxiliary module to predict\nthe horizontal movement of the camera which decreases the area that requires\nhallucination of challenging face parts such as ears. Unlike previous works, we\nprocess the full-frame input image at once without cropping the subject's face\nand processing it separately from the rest of the body, eliminating the need\nfor complex post-processing steps to attach the face back to the subject's\nbody. To train our network, we utilize the popular game engine Unreal Engine to\ngenerate a large synthetic face dataset containing various subjects, head\nposes, expressions, eyewear, clothes, and lighting. Quantitative and\nqualitative results show that our rectification pipeline outperforms previous\nmethods, and produces comparable results with a time-consuming 3D GAN-based\nmethod while being more than 260 times faster.\n","authors":["Ahmed Alhawwary","Phong Nguyen-Ha","Janne Mustaniemi","Janne Heikkilä"],"pdf_url":"https://arxiv.org/pdf/2412.19189v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2209.11144v3","updated":"2024-12-26T11:43:26Z","published":"2022-09-22T16:42:14Z","title":"Automatic and effective discovery of quantum kernels","summary":" Quantum computing can empower machine learning models by enabling kernel\nmachines to leverage quantum kernels for representing similarity measures\nbetween data. Quantum kernels are able to capture relationships in the data\nthat are not efficiently computable on classical devices. However, there is no\nstraightforward method to engineer the optimal quantum kernel for each specific\nuse case. We present an approach to this problem, which employs optimization\ntechniques, similar to those used in neural architecture search and AutoML, to\nautomatically find an optimal kernel in a heuristic manner. To this purpose we\ndefine an algorithm for constructing a quantum circuit implementing the\nsimilarity measure as a combinatorial object, which is evaluated based on a\ncost function and then iteratively modified using a meta-heuristic optimization\ntechnique. The cost function can encode many criteria ensuring favorable\nstatistical properties of the candidate solution, such as the rank of the\nDynamical Lie Algebra. Importantly, our approach is independent of the\noptimization technique employed. The results obtained by testing our approach\non a high-energy physics problem demonstrate that, in the best-case scenario,\nwe can either match or improve testing accuracy with respect to the manual\ndesign approach, showing the potential of our technique to deliver superior\nresults with reduced effort.\n","authors":["Massimiliano Incudini","Daniele Lizzio Bosco","Francesco Martini","Michele Grossi","Giuseppe Serra","Alessandra Di Pierro"],"pdf_url":"https://arxiv.org/pdf/2209.11144v3.pdf","comment":"Accepted into IEEE Transactions on Emerging Topics in Computational\n Intelligence"},{"id":"http://arxiv.org/abs/2412.19179v1","updated":"2024-12-26T11:35:57Z","published":"2024-12-26T11:35:57Z","title":"Mask Approximation Net: Merging Feature Extraction and Distribution\n Learning for Remote Sensing Change Captioning","summary":" Remote sensing image change description, as a novel multimodal task in the\nfield of remote sensing processing, not only enables the detection of changes\nin surface conditions but also provides detailed descriptions of these changes,\nthereby enhancing human interpretability and interactivity. However, previous\nmethods mainly employed Convolutional Neural Network (CNN) architectures to\nextract bitemporal image features. This approach often leads to an overemphasis\non designing specific network architectures and limits the captured feature\ndistributions to the current dataset, resulting in poor generalizability and\nrobustness when applied to other datasets or real-world scenarios. To address\nthese limitations, this paper proposes a novel approach for remote sensing\nimage change detection and description that integrates diffusion models, aiming\nto shift the focus from conventional feature learning paradigms to data\ndistribution learning. The proposed method primarily includes a simple\nmulti-scale change detection module, whose output features are subsequently\nrefined using a diffusion model. Additionally, we introduce a frequency-guided\ncomplex filter module to handle high-frequency noise during the diffusion\nprocess, which helps to maintain model performance. Finally, we validate the\neffectiveness of our proposed method on several remote sensing change detection\ndescription datasets, demonstrating its superior performance. The code\navailable at MaskApproxNet.\n","authors":["Dongwei Sun","Xiangyong Cao"],"pdf_url":"https://arxiv.org/pdf/2412.19179v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.19178v1","updated":"2024-12-26T11:32:00Z","published":"2024-12-26T11:32:00Z","title":"Reversed in Time: A Novel Temporal-Emphasized Benchmark for Cross-Modal\n Video-Text Retrieval","summary":" Cross-modal (e.g. image-text, video-text) retrieval is an important task in\ninformation retrieval and multimodal vision-language understanding field.\nTemporal understanding makes video-text retrieval more challenging than\nimage-text retrieval. However, we find that the widely used video-text\nbenchmarks have shortcomings in comprehensively assessing abilities of models,\nespecially in temporal understanding, causing large-scale image-text\npre-trained models can already achieve comparable zero-shot performance with\nvideo-text pre-trained models. In this paper, we introduce RTime, a novel\ntemporal-emphasized video-text retrieval dataset. We first obtain videos of\nactions or events with significant temporality, and then reverse these videos\nto create harder negative samples. We then recruit annotators to judge the\nsignificance and reversibility of candidate videos, and write captions for\nqualified videos. We further adopt GPT-4 to extend more captions based on\nhuman-written captions. Our RTime dataset currently consists of 21k videos with\n10 captions per video, totalling about 122 hours. Based on RTime, we propose\nthree retrieval benchmark tasks: RTime-Origin, RTime-Hard, and RTime-Binary. We\nfurther enhance the use of harder-negatives in model training, and benchmark a\nvariety of video-text models on RTime. Extensive experiment analysis proves\nthat RTime indeed poses new and higher challenges to video-text retrieval. We\nrelease our RTime\ndataset\\footnote{\\url{https://github.com/qyr0403/Reversed-in-Time}} to further\nadvance video-text retrieval and multimodal understanding research.\n","authors":["Yang Du","Yuqi Liu","Qin Jin"],"pdf_url":"https://arxiv.org/pdf/2412.19178v1.pdf","comment":"ACMMM 2024 poster"},{"id":"http://arxiv.org/abs/2412.19160v1","updated":"2024-12-26T10:40:15Z","published":"2024-12-26T10:40:15Z","title":"Dual Channel Multi-Attention in ViT for Biometric Authentication using\n Forehead Subcutaneous Vein Pattern and Periocular Pattern","summary":" Traditional biometric systems, like face and fingerprint recognition, have\nencountered significant setbacks due to wearing face masks and hygiene\nconcerns. To meet the challenges of the partially covered face due to face\nmasks and hygiene concerns of fingerprint recognition, this paper proposes a\nnovel dual-channel multi-attention Vision Transformer (ViT) framework for\nbiometric authentication using forehead subcutaneous vein patterns and\nperiocular patterns, offering a promising alternative to traditional methods,\ncapable of performing well even with face masks and without any physical touch.\nThe proposed framework leverages a dual-channel ViT architecture, designed to\nhandle two distinct biometric traits. It can capture long-range dependencies of\nindependent features from the vein and periocular patterns. A custom classifier\nis then designed to integrate the independently extracted features, producing a\nfinal class prediction. The performance of the proposed algorithm was\nrigorously evaluated using the Forehead Subcutaneous Vein Pattern and\nPeriocular Biometric Pattern (FSVP-PBP) database. The results demonstrated the\nsuperiority of the algorithm over state-of-the-art methods, achieving\nremarkable classification accuracy of $99.3 \\pm 0.02\\%$ with the combined vein\nand periocular patterns.\n","authors":["Arun K. Sharma","Shubhobrata Bhattacharya","Motahar Reza"],"pdf_url":"https://arxiv.org/pdf/2412.19160v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.13106v2","updated":"2024-12-26T10:15:54Z","published":"2024-12-17T17:22:52Z","title":"Active Reinforcement Learning Strategies for Offline Policy Improvement","summary":" Learning agents that excel at sequential decision-making tasks must\ncontinuously resolve the problem of exploration and exploitation for optimal\nlearning. However, such interactions with the environment online might be\nprohibitively expensive and may involve some constraints, such as a limited\nbudget for agent-environment interactions and restricted exploration in certain\nregions of the state space. Examples include selecting candidates for medical\ntrials and training agents in complex navigation environments. This problem\nnecessitates the study of active reinforcement learning strategies that collect\nminimal additional experience trajectories by reusing existing offline data\npreviously collected by some unknown behavior policy. In this work, we propose\nan active reinforcement learning method capable of collecting trajectories that\ncan augment existing offline data. With extensive experimentation, we\ndemonstrate that our proposed method reduces additional online interaction with\nthe environment by up to 75% over competitive baselines across various\ncontinuous control environments such as Gym-MuJoCo locomotion environments as\nwell as Maze2d, AntMaze, CARLA and IsaacSimGo1. To the best of our knowledge,\nthis is the first work that addresses the active learning problem in the\ncontext of sequential decision-making and reinforcement learning.\n","authors":["Ambedkar Dukkipati","Ranga Shaarad Ayyagari","Bodhisattwa Dasgupta","Parag Dutta","Prabhas Reddy Onteru"],"pdf_url":"https://arxiv.org/pdf/2412.13106v2.pdf","comment":"AAAI 2025"},{"id":"http://arxiv.org/abs/2412.19152v1","updated":"2024-12-26T10:12:08Z","published":"2024-12-26T10:12:08Z","title":"To Predict or Not To Predict? Proportionally Masked Autoencoders for\n Tabular Data Imputation","summary":" Masked autoencoders (MAEs) have recently demonstrated effectiveness in\ntabular data imputation. However, due to the inherent heterogeneity of tabular\ndata, the uniform random masking strategy commonly used in MAEs can disrupt the\ndistribution of missingness, leading to suboptimal performance. To address\nthis, we propose a proportional masking strategy for MAEs. Specifically, we\nfirst compute the statistics of missingness based on the observed proportions\nin the dataset, and then generate masks that align with these statistics,\nensuring that the distribution of missingness is preserved after masking.\nFurthermore, we argue that simple MLP-based token mixing offers competitive or\noften superior performance compared to attention mechanisms while being more\ncomputationally efficient, especially in the tabular domain with the inherent\nheterogeneity. Experimental results validate the effectiveness of the proposed\nproportional masking strategy across various missing data patterns in tabular\ndatasets. Code is available at: \\url{https://github.com/normal-kim/PMAE}.\n","authors":["Jungkyu Kim","Kibok Lee","Taeyoung Park"],"pdf_url":"https://arxiv.org/pdf/2412.19152v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.14398v2","updated":"2024-12-26T10:10:00Z","published":"2024-03-21T13:43:49Z","title":"Regularized Adaptive Momentum Dual Averaging with an Efficient Inexact\n Subproblem Solver for Training Structured Neural Network","summary":" We propose a Regularized Adaptive Momentum Dual Averaging (RAMDA) algorithm\nfor training structured neural networks. Similar to existing regularized\nadaptive methods, the subproblem for computing the update direction of RAMDA\ninvolves a nonsmooth regularizer and a diagonal preconditioner, and therefore\ndoes not possess a closed-form solution in general. We thus also carefully\ndevise an implementable inexactness condition that retains convergence\nguarantees similar to the exact versions, and propose a companion efficient\nsolver for the subproblems of both RAMDA and existing methods to make them\npractically feasible. We leverage the theory of manifold identification in\nvariational analysis to show that, even in the presence of such inexactness,\nthe iterates of RAMDA attain the ideal structure induced by the regularizer at\nthe stationary point of asymptotic convergence. This structure is locally\noptimal near the point of convergence, so RAMDA is guaranteed to obtain the\nbest structure possible among all methods converging to the same point, making\nit the first regularized adaptive method outputting models that possess\noutstanding predictive performance while being (locally) optimally structured.\nExtensive numerical experiments in large-scale modern computer vision, language\nmodeling, and speech tasks show that the proposed RAMDA is efficient and\nconsistently outperforms state of the art for training structured neural\nnetwork. Implementation of our algorithm is available at\nhttps://www.github.com/ismoptgroup/RAMDA/.\n","authors":["Zih-Syuan Huang","Ching-pei Lee"],"pdf_url":"https://arxiv.org/pdf/2403.14398v2.pdf","comment":"NeurIPS 2024. 25 pages"},{"id":"http://arxiv.org/abs/2412.19139v1","updated":"2024-12-26T09:51:05Z","published":"2024-12-26T09:51:05Z","title":"PlanLLM: Video Procedure Planning with Refinable Large Language Models","summary":" Video procedure planning, i.e., planning a sequence of action steps given the\nvideo frames of start and goal states, is an essential ability for embodied AI.\nRecent works utilize Large Language Models (LLMs) to generate enriched action\nstep description texts to guide action step decoding. Although LLMs are\nintroduced, these methods decode the action steps into a closed-set of one-hot\nvectors, limiting the model's capability of generalizing to new steps or tasks.\nAdditionally, fixed action step descriptions based on world-level commonsense\nmay contain noise in specific instances of visual states. In this paper, we\npropose PlanLLM, a cross-modal joint learning framework with LLMs for video\nprocedure planning. We propose an LLM-Enhanced Planning module which fully uses\nthe generalization ability of LLMs to produce free-form planning output and to\nenhance action step decoding. We also propose Mutual Information Maximization\nmodule to connect world-level commonsense of step descriptions and\nsample-specific information of visual states, enabling LLMs to employ the\nreasoning ability to generate step sequences. With the assistance of LLMs, our\nmethod can both closed-set and open vocabulary procedure planning tasks. Our\nPlanLLM achieves superior performance on three benchmarks, demonstrating the\neffectiveness of our designs.\n","authors":["Dejie Yang","Zijing Zhao"," YangLiu"],"pdf_url":"https://arxiv.org/pdf/2412.19139v1.pdf","comment":"accepted to AAAI2025"},{"id":"http://arxiv.org/abs/2412.19138v1","updated":"2024-12-26T09:41:36Z","published":"2024-12-26T09:41:36Z","title":"SUTrack: Towards Simple and Unified Single Object Tracking","summary":" In this paper, we propose a simple yet unified single object tracking (SOT)\nframework, dubbed SUTrack. It consolidates five SOT tasks (RGB-based,\nRGB-Depth, RGB-Thermal, RGB-Event, RGB-Language Tracking) into a unified model\ntrained in a single session. Due to the distinct nature of the data, current\nmethods typically design individual architectures and train separate models for\neach task. This fragmentation results in redundant training processes,\nrepetitive technological innovations, and limited cross-modal knowledge\nsharing. In contrast, SUTrack demonstrates that a single model with a unified\ninput representation can effectively handle various common SOT tasks,\neliminating the need for task-specific designs and separate training sessions.\nAdditionally, we introduce a task-recognition auxiliary training strategy and a\nsoft token type embedding to further enhance SUTrack's performance with minimal\noverhead. Experiments show that SUTrack outperforms previous task-specific\ncounterparts across 11 datasets spanning five SOT tasks. Moreover, we provide a\nrange of models catering edge devices as well as high-performance GPUs,\nstriking a good trade-off between speed and accuracy. We hope SUTrack could\nserve as a strong foundation for further compelling research into unified\ntracking models. Code and models are available at\ngithub.com/chenxin-dlut/SUTrack.\n","authors":["Xin Chen","Ben Kang","Wanting Geng","Jiawen Zhu","Yi Liu","Dong Wang","Huchuan Lu"],"pdf_url":"https://arxiv.org/pdf/2412.19138v1.pdf","comment":"Accepted by AAAI 2025"}],"Multimedia":[{"id":"http://arxiv.org/abs/2412.12225v2","updated":"2024-12-26T19:23:17Z","published":"2024-12-16T10:03:44Z","title":"DLF: Disentangled-Language-Focused Multimodal Sentiment Analysis","summary":" Multimodal Sentiment Analysis (MSA) leverages heterogeneous modalities, such\nas language, vision, and audio, to enhance the understanding of human\nsentiment. While existing models often focus on extracting shared information\nacross modalities or directly fusing heterogeneous modalities, such approaches\ncan introduce redundancy and conflicts due to equal treatment of all modalities\nand the mutual transfer of information between modality pairs. To address these\nissues, we propose a Disentangled-Language-Focused (DLF) multimodal\nrepresentation learning framework, which incorporates a feature disentanglement\nmodule to separate modality-shared and modality-specific information. To\nfurther reduce redundancy and enhance language-targeted features, four\ngeometric measures are introduced to refine the disentanglement process. A\nLanguage-Focused Attractor (LFA) is further developed to strengthen language\nrepresentation by leveraging complementary modality-specific information\nthrough a language-guided cross-attention mechanism. The framework also employs\nhierarchical predictions to improve overall accuracy. Extensive experiments on\ntwo popular MSA datasets, CMU-MOSI and CMU-MOSEI, demonstrate the significant\nperformance gains achieved by the proposed DLF framework. Comprehensive\nablation studies further validate the effectiveness of the feature\ndisentanglement module, language-focused attractor, and hierarchical\npredictions. Our code is available at https://github.com/pwang322/DLF.\n","authors":["Pan Wang","Qiang Zhou","Yawen Wu","Tianlong Chen","Jingtong Hu"],"pdf_url":"https://arxiv.org/pdf/2412.12225v2.pdf","comment":"AAAI 2025 accepted"},{"id":"http://arxiv.org/abs/2407.05551v2","updated":"2024-12-26T15:23:36Z","published":"2024-07-08T01:59:17Z","title":"Read, Watch and Scream! Sound Generation from Text and Video","summary":" Despite the impressive progress of multimodal generative models,\nvideo-to-audio generation still suffers from limited performance and limits the\nflexibility to prioritize sound synthesis for specific objects within the\nscene. Conversely, text-to-audio generation methods generate high-quality audio\nbut pose challenges in ensuring comprehensive scene depiction and time-varying\ncontrol. To tackle these challenges, we propose a novel video-and-text-to-audio\ngeneration method, called \\ours, where video serves as a conditional control\nfor a text-to-audio generation model. Especially, our method estimates the\nstructural information of sound (namely, energy) from the video while receiving\nkey content cues from a user prompt. We employ a well-performing text-to-audio\nmodel to consolidate the video control, which is much more efficient for\ntraining multimodal diffusion models with massive triplet-paired\n(audio-video-text) data. In addition, by separating the generative components\nof audio, it becomes a more flexible system that allows users to freely adjust\nthe energy, surrounding environment, and primary sound source according to\ntheir preferences. Experimental results demonstrate that our method shows\nsuperiority in terms of quality, controllability, and training efficiency. Code\nand demo are available at https://naver-ai.github.io/rewas.\n","authors":["Yujin Jeong","Yunji Kim","Sanghyuk Chun","Jiyoung Lee"],"pdf_url":"https://arxiv.org/pdf/2407.05551v2.pdf","comment":"AAAI2025, Project page: https://naver-ai.github.io/rewas"},{"id":"http://arxiv.org/abs/2412.19238v1","updated":"2024-12-26T14:44:47Z","published":"2024-12-26T14:44:47Z","title":"FineVQ: Fine-Grained User Generated Content Video Quality Assessment","summary":" The rapid growth of user-generated content (UGC) videos has produced an\nurgent need for effective video quality assessment (VQA) algorithms to monitor\nvideo quality and guide optimization and recommendation procedures. However,\ncurrent VQA models generally only give an overall rating for a UGC video, which\nlacks fine-grained labels for serving video processing and recommendation\napplications. To address the challenges and promote the development of UGC\nvideos, we establish the first large-scale Fine-grained Video quality\nassessment Database, termed FineVD, which comprises 6104 UGC videos with\nfine-grained quality scores and descriptions across multiple dimensions. Based\non this database, we propose a Fine-grained Video Quality assessment (FineVQ)\nmodel to learn the fine-grained quality of UGC videos, with the capabilities of\nquality rating, quality scoring, and quality attribution. Extensive\nexperimental results demonstrate that our proposed FineVQ can produce\nfine-grained video-quality results and achieve state-of-the-art performance on\nFineVD and other commonly used UGC-VQA datasets. Both Both FineVD and FineVQ\nwill be made publicly available.\n","authors":["Huiyu Duan","Qiang Hu","Jiarui Wang","Liu Yang","Zitong Xu","Lu Liu","Xiongkuo Min","Chunlei Cai","Tianxiao Ye","Xiaoyun Zhang","Guangtao Zhai"],"pdf_url":"https://arxiv.org/pdf/2412.19238v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.19139v1","updated":"2024-12-26T09:51:05Z","published":"2024-12-26T09:51:05Z","title":"PlanLLM: Video Procedure Planning with Refinable Large Language Models","summary":" Video procedure planning, i.e., planning a sequence of action steps given the\nvideo frames of start and goal states, is an essential ability for embodied AI.\nRecent works utilize Large Language Models (LLMs) to generate enriched action\nstep description texts to guide action step decoding. Although LLMs are\nintroduced, these methods decode the action steps into a closed-set of one-hot\nvectors, limiting the model's capability of generalizing to new steps or tasks.\nAdditionally, fixed action step descriptions based on world-level commonsense\nmay contain noise in specific instances of visual states. In this paper, we\npropose PlanLLM, a cross-modal joint learning framework with LLMs for video\nprocedure planning. We propose an LLM-Enhanced Planning module which fully uses\nthe generalization ability of LLMs to produce free-form planning output and to\nenhance action step decoding. We also propose Mutual Information Maximization\nmodule to connect world-level commonsense of step descriptions and\nsample-specific information of visual states, enabling LLMs to employ the\nreasoning ability to generate step sequences. With the assistance of LLMs, our\nmethod can both closed-set and open vocabulary procedure planning tasks. Our\nPlanLLM achieves superior performance on three benchmarks, demonstrating the\neffectiveness of our designs.\n","authors":["Dejie Yang","Zijing Zhao"," YangLiu"],"pdf_url":"https://arxiv.org/pdf/2412.19139v1.pdf","comment":"accepted to AAAI2025"},{"id":"http://arxiv.org/abs/2412.19133v1","updated":"2024-12-26T09:29:59Z","published":"2024-12-26T09:29:59Z","title":"A Rhetorical Relations-Based Framework for Tailored Multimedia Document\n Summarization","summary":" In the rapidly evolving landscape of digital content, the task of summarizing\nmultimedia documents, which encompass textual, visual, and auditory elements,\npresents intricate challenges. These challenges include extracting pertinent\ninformation from diverse formats, maintaining the structural integrity and\nsemantic coherence of the original content, and generating concise yet\ninformative summaries. This paper introduces a novel framework for multimedia\ndocument summarization that capitalizes on the inherent structure of the\ndocument to craft coherent and succinct summaries. Central to this framework is\nthe incorporation of a rhetorical structure for structural analysis, augmented\nby a graph-based representation to facilitate the extraction of pivotal\ninformation. Weighting algorithms are employed to assign significance values to\ndocument units, thereby enabling effective ranking and selection of relevant\ncontent. Furthermore, the framework is designed to accommodate user preferences\nand time constraints, ensuring the production of personalized and contextually\nrelevant summaries. The summarization process is elaborately delineated,\nencompassing document specification, graph construction, unit weighting, and\nsummary extraction, supported by illustrative examples and algorithmic\nelucidation. This proposed framework represents a significant advancement in\nautomatic summarization, with broad potential applications across multimedia\ndocument processing, promising transformative impacts in the field.\n","authors":["Azze-Eddine Maredj","Madjid Sadallah"],"pdf_url":"https://arxiv.org/pdf/2412.19133v1.pdf","comment":"10 pages, preprint"},{"id":"http://arxiv.org/abs/2412.19123v1","updated":"2024-12-26T08:47:13Z","published":"2024-12-26T08:47:13Z","title":"CoheDancers: Enhancing Interactive Group Dance Generation through\n Music-Driven Coherence Decomposition","summary":" Dance generation is crucial and challenging, particularly in domains like\ndance performance and virtual gaming. In the current body of literature, most\nmethodologies focus on Solo Music2Dance. While there are efforts directed\ntowards Group Music2Dance, these often suffer from a lack of coherence,\nresulting in aesthetically poor dance performances. Thus, we introduce\nCoheDancers, a novel framework for Music-Driven Interactive Group Dance\nGeneration. CoheDancers aims to enhance group dance generation coherence by\ndecomposing it into three key aspects: synchronization, naturalness, and\nfluidity. Correspondingly, we develop a Cycle Consistency based Dance\nSynchronization strategy to foster music-dance correspondences, an\nAuto-Regressive-based Exposure Bias Correction strategy to enhance the fluidity\nof the generated dances, and an Adversarial Training Strategy to augment the\nnaturalness of the group dance output. Collectively, these strategies enable\nCohdeDancers to produce highly coherent group dances with superior quality.\nFurthermore, to establish better benchmarks for Group Music2Dance, we construct\nthe most diverse and comprehensive open-source dataset to date, I-Dancers,\nfeaturing rich dancer interactions, and create comprehensive evaluation\nmetrics. Experimental evaluations on I-Dancers and other extant datasets\nsubstantiate that CoheDancers achieves unprecedented state-of-the-art\nperformance. Code will be released.\n","authors":["Kaixing Yang","Xulong Tang","Haoyu Wu","Qinliang Xue","Biao Qin","Hongyan Liu","Zhaoxin Fan"],"pdf_url":"https://arxiv.org/pdf/2412.19123v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.14135v2","updated":"2024-12-26T08:33:31Z","published":"2024-11-21T14:01:33Z","title":"Compact Visual Data Representation for Green Multimedia -- A Human\n Visual System Perspective","summary":" The Human Visual System (HVS), with its intricate sophistication, is capable\nof achieving ultra-compact information compression for visual signals. This\nremarkable ability is coupled with high generalization capability and energy\nefficiency. By contrast, the state-of-the-art Versatile Video Coding (VVC)\nstandard achieves a compression ratio of around 1,000 times for raw visual\ndata. This notable disparity motivates the research community to draw\ninspiration to effectively handle the immense volume of visual data in a green\nway. Therefore, this paper provides a survey of how visual data can be\nefficiently represented for green multimedia, in particular when the ultimate\ntask is knowledge extraction instead of visual signal reconstruction. We\nintroduce recent research efforts that promote green, sustainable, and\nefficient multimedia in this field. Moreover, we discuss how the deep\nunderstanding of the HVS can benefit the research community, and envision the\ndevelopment of future green multimedia technologies.\n","authors":["Peilin Chen","Xiaohan Fang","Meng Wang","Shiqi Wang","Siwei Ma"],"pdf_url":"https://arxiv.org/pdf/2411.14135v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.19009v1","updated":"2024-12-26T00:53:54Z","published":"2024-12-26T00:53:54Z","title":"FACEMUG: A Multimodal Generative and Fusion Framework for Local Facial\n Editing","summary":" Existing facial editing methods have achieved remarkable results, yet they\noften fall short in supporting multimodal conditional local facial editing. One\nof the significant evidences is that their output image quality degrades\ndramatically after several iterations of incremental editing, as they do not\nsupport local editing. In this paper, we present a novel multimodal generative\nand fusion framework for globally-consistent local facial editing (FACEMUG)\nthat can handle a wide range of input modalities and enable fine-grained and\nsemantic manipulation while remaining unedited parts unchanged. Different\nmodalities, including sketches, semantic maps, color maps, exemplar images,\ntext, and attribute labels, are adept at conveying diverse conditioning\ndetails, and their combined synergy can provide more explicit guidance for the\nediting process. We thus integrate all modalities into a unified generative\nlatent space to enable multimodal local facial edits. Specifically, a novel\nmultimodal feature fusion mechanism is proposed by utilizing multimodal\naggregation and style fusion blocks to fuse facial priors and multimodalities\nin both latent and feature spaces. We further introduce a novel self-supervised\nlatent warping algorithm to rectify misaligned facial features, efficiently\ntransferring the pose of the edited image to the given latent codes. We\nevaluate our FACEMUG through extensive experiments and comparisons to\nstate-of-the-art (SOTA) methods. The results demonstrate the superiority of\nFACEMUG in terms of editing quality, flexibility, and semantic control, making\nit a promising solution for a wide range of local facial editing tasks.\n","authors":["Wanglong Lu","Jikai Wang","Xiaogang Jin","Xianta Jiang","Hanli Zhao"],"pdf_url":"https://arxiv.org/pdf/2412.19009v1.pdf","comment":"Published at IEEE Transactions on Visualization and Computer\n Graphics; 21 pages, 26 figures"}]},"2024-12-25T00:00:00Z":{"Computation and Language":[{"id":"http://arxiv.org/abs/2403.17919v4","updated":"2024-12-25T19:03:23Z","published":"2024-03-26T17:55:02Z","title":"LISA: Layerwise Importance Sampling for Memory-Efficient Large Language\n Model Fine-Tuning","summary":" The machine learning community has witnessed impressive advancements since\nlarge language models (LLMs) first appeared. Yet, their massive memory\nconsumption has become a significant roadblock to large-scale training. For\ninstance, a 7B model typically requires at least 60 GB of GPU memory with full\nparameter training, which presents challenges for researchers without access to\nhigh-resource environments. Parameter Efficient Fine-Tuning techniques such as\nLow-Rank Adaptation (LoRA) have been proposed to alleviate this problem.\nHowever, in most large-scale fine-tuning settings, their performance does not\nreach the level of full parameter training because they confine the parameter\nsearch to a low-rank subspace. Attempting to complement this deficiency, we\ninvestigate the layerwise properties of LoRA on fine-tuning tasks and observe\nan unexpected but consistent skewness of weight norms across different layers.\nUtilizing this key observation, a surprisingly simple training strategy is\ndiscovered, which outperforms both LoRA and full parameter training in a wide\nrange of settings with memory costs as low as LoRA. We name it Layerwise\nImportance Sampled AdamW (LISA), a promising alternative for LoRA, which\napplies the idea of importance sampling to different layers in LLMs and\nrandomly freezes most middle layers during optimization. Experimental results\nshow that with similar or less GPU memory consumption, LISA surpasses LoRA or\neven full parameter tuning in downstream fine-tuning tasks, where LISA\nconsistently outperforms LoRA by over 10%-35% in terms of MT-Bench score while\nachieving on-par or better performance in MMLU, AGIEval and WinoGrande. On\nlarge models, specifically LLaMA-2-70B, LISA surpasses LoRA on MT-Bench, GSM8K,\nand PubMedQA, demonstrating its effectiveness across different domains.\n","authors":["Rui Pan","Xiang Liu","Shizhe Diao","Renjie Pi","Jipeng Zhang","Chi Han","Tong Zhang"],"pdf_url":"https://arxiv.org/pdf/2403.17919v4.pdf","comment":"NeurIPS 2024"},{"id":"http://arxiv.org/abs/2412.16145v2","updated":"2024-12-25T18:54:02Z","published":"2024-12-20T18:49:45Z","title":"Offline Reinforcement Learning for LLM Multi-Step Reasoning","summary":" Improving the multi-step reasoning ability of large language models (LLMs)\nwith offline reinforcement learning (RL) is essential for quickly adapting them\nto complex tasks. While Direct Preference Optimization (DPO) has shown promise\nin aligning LLMs with human preferences, it is less suitable for multi-step\nreasoning tasks because (1) DPO relies on paired preference data, which is not\nreadily available for multi-step reasoning tasks, and (2) it treats all tokens\nuniformly, making it ineffective for credit assignment in multi-step reasoning\ntasks, which often come with sparse reward. In this work, we propose OREO\n(Offline Reasoning Optimization), an offline RL method for enhancing LLM\nmulti-step reasoning. Building on insights from previous works of maximum\nentropy reinforcement learning, it jointly learns a policy model and value\nfunction by optimizing the soft Bellman Equation. We show in principle that it\nreduces the need to collect pairwise data and enables better credit assignment.\nEmpirically, OREO surpasses existing offline learning methods on multi-step\nreasoning benchmarks, including mathematical reasoning tasks (GSM8K, MATH) and\nembodied agent control (ALFWorld). The approach can be extended to a\nmulti-iteration framework when additional resources are available. Furthermore,\nthe learned value function can be leveraged to guide the tree search for free,\nwhich can further boost performance during test time.\n","authors":["Huaijie Wang","Shibo Hao","Hanze Dong","Shenao Zhang","Yilin Bao","Ziran Yang","Yi Wu"],"pdf_url":"https://arxiv.org/pdf/2412.16145v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.16893v3","updated":"2024-12-25T18:11:46Z","published":"2024-08-29T20:27:05Z","title":"Exploring Multiple Strategies to Improve Multilingual Coreference\n Resolution in CorefUD","summary":" Coreference resolution, the task of identifying expressions in text that\nrefer to the same entity, is a critical component in various natural language\nprocessing applications. This paper presents a novel end-to-end neural\ncoreference resolution system utilizing the CorefUD 1.1 dataset, which spans 17\ndatasets across 12 languages. The proposed model is based on the standard\nend-to-end neural coreference resolution system. We first establish baseline\nmodels, including monolingual and cross-lingual variations, and then propose\nseveral extensions to enhance performance across diverse linguistic contexts.\nThese extensions include cross-lingual training, incorporation of syntactic\ninformation, a Span2Head model for optimized headword prediction, and advanced\nsingleton modeling. We also experiment with headword span representation and\nlong-documents modeling through overlapping segments. The proposed extensions,\nparticularly the heads-only approach, singleton modeling, and long document\nprediction, significantly improve performance across most datasets. We also\nperform zero-shot cross-lingual experiments, highlighting the potential and\nlimitations of cross-lingual transfer in coreference resolution. Our findings\ncontribute to the development of robust and scalable coreference systems for\nmultilingual coreference resolution. Finally, we evaluate our model on the\nCorefUD 1.1 test set and surpass the best model from the CRAC 2023 shared task\nof comparable size by a large margin.\n","authors":["Ondřej Pražák","Miloslav Konopík","Pavel Král"],"pdf_url":"https://arxiv.org/pdf/2408.16893v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.14547v5","updated":"2024-12-25T17:19:12Z","published":"2024-02-22T13:36:53Z","title":"OmniPred: Language Models as Universal Regressors","summary":" Regression is a powerful tool to accurately predict the outcome metric of a\nsystem given a set of parameters, but has traditionally been restricted to\nmethods which are only applicable to a specific task. In this paper, we propose\nOmniPred, a framework for training language models as universal end-to-end\nregressors over $(x,y)$ data from arbitrary formats. Using data sourced from\nGoogle Vizier, one of the largest proprietary blackbox optimization databases\nin the world, our extensive experiments demonstrate that language models are\ncapable of very precise numerical regression using only textual representations\nof mathematical parameters and values, and if given the opportunity to train at\nscale over multiple tasks, can significantly outperform traditional regression\nmodels.\n","authors":["Xingyou Song","Oscar Li","Chansoo Lee","Bangding Yang","Daiyi Peng","Sagi Perel","Yutian Chen"],"pdf_url":"https://arxiv.org/pdf/2402.14547v5.pdf","comment":"Published in Transactions on Machine Learning Research (TMLR) 2024.\n Code can be found in\n https://github.com/google-research/optformer/tree/main/optformer/omnipred"},{"id":"http://arxiv.org/abs/2412.18947v1","updated":"2024-12-25T16:51:29Z","published":"2024-12-25T16:51:29Z","title":"MedHallBench: A New Benchmark for Assessing Hallucination in Medical\n Large Language Models","summary":" Medical Large Language Models (MLLMs) have demonstrated potential in\nhealthcare applications, yet their propensity for hallucinations -- generating\nmedically implausible or inaccurate information -- presents substantial risks\nto patient care. This paper introduces MedHallBench, a comprehensive benchmark\nframework for evaluating and mitigating hallucinations in MLLMs. Our\nmethodology integrates expert-validated medical case scenarios with established\nmedical databases to create a robust evaluation dataset. The framework employs\na sophisticated measurement system that combines automated ACHMI (Automatic\nCaption Hallucination Measurement in Medical Imaging) scoring with rigorous\nclinical expert evaluations and utilizes reinforcement learning methods to\nachieve automatic annotation. Through an optimized reinforcement learning from\nhuman feedback (RLHF) training pipeline specifically designed for medical\napplications, MedHallBench enables thorough evaluation of MLLMs across diverse\nclinical contexts while maintaining stringent accuracy standards. We conducted\ncomparative experiments involving various models, utilizing the benchmark to\nestablish a baseline for widely adopted large language models (LLMs). Our\nfindings indicate that ACHMI provides a more nuanced understanding of the\neffects of hallucinations compared to traditional metrics, thereby highlighting\nits advantages in hallucination assessment. This research establishes a\nfoundational framework for enhancing MLLMs' reliability in healthcare settings\nand presents actionable strategies for addressing the critical challenge of AI\nhallucinations in medical applications.\n","authors":["Kaiwen Zuo","Yirui Jiang"],"pdf_url":"https://arxiv.org/pdf/2412.18947v1.pdf","comment":"Published to AAAI-25 Bridge Program"},{"id":"http://arxiv.org/abs/2411.00533v4","updated":"2024-12-25T16:13:57Z","published":"2024-11-01T12:08:08Z","title":"ReverseNER: A Self-Generated Example-Driven Framework for Zero-Shot\n Named Entity Recognition with Large Language Models","summary":" This paper presents ReverseNER, a method aimed at overcoming the limitation\nof large language models (LLMs) in zero-shot named entity recognition (NER)\ntasks, arising from their reliance on pre-provided demonstrations. ReverseNER\ntackles this challenge by constructing a reliable example library composed of\ndozens of entity-labeled sentences, generated through the reverse process of\nNER. Specifically, while conventional NER methods label entities in a sentence,\nReverseNER features reversing the process by using an LLM to generate entities\nfrom their definitions and subsequently expand them into full sentences. During\nthe entity expansion process, the LLM is guided to generate sentences by\nreplicating the structures of a set of specific \\textsl{feature sentences},\nextracted from the task sentences by clustering. This expansion process\nproduces dozens of entity-labeled task-relevant sentences. After constructing\nthe example library, the method selects several semantically similar\nentity-labeled examples for each task sentence as references to facilitate the\nLLM's entity recognition. We also propose an entity-level self-consistency\nscoring mechanism to improve NER performance with LLMs. Experiments show that\nReverseNER significantly outperforms other zero-shot NER methods with LLMs,\nmarking a notable improvement in NER for domains without labeled data, while\ndeclining computational resource consumption.\n","authors":["Anbang Wang","Difei Mei","Zhichao Zhang","Xiuxiu Bai","Ran Yao","Zewen Fang","Min Hu","Zhirui Cao","Haitao Sun","Yifeng Guo","Hongyao Zhou","Yu Guo"],"pdf_url":"https://arxiv.org/pdf/2411.00533v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18934v1","updated":"2024-12-25T15:45:18Z","published":"2024-12-25T15:45:18Z","title":"Dovetail: A CPU/GPU Heterogeneous Speculative Decoding for LLM inference","summary":" Due to the high resource demands of Large Language Models (LLMs), achieving\nwidespread deployment on consumer-grade devices presents significant\nchallenges. Typically, personal or consumer-grade devices, including servers\nconfigured prior to the era of large-scale models, generally have relatively\nweak GPUs and relatively strong CPUs. However, most current methods primarily\ndepend on GPUs for computation. Therefore, we propose Dovetail, an approach\nthat deploys the draft model on the GPU to generate draft tokens while allowing\nthe target model to perform parallel verification on the CPU, thereby improving\nthe utilization of all available hardware resources and occupying less\ninter-device communication bandwidth. Accordingly, we have redesigned the draft\nmodel to better align with heterogeneous hardware characteristics. To this end,\nwe implemented several optimizations: reducing the number of draft tokens to\nmitigate latency in parallel verification, increasing the depth of the draft\nmodel to enhance its predictive capacity, and introducing DGF (Dynamic Gating\nFusion) to improve the integration of features and token embeddings. In the\nHumanEval benchmark, Dovetail achieved an inference speed of 5.86 tokens per\nsecond for LLaMA2-Chat-7B using 3GB of VRAM, representing an approximately\n2.77x improvement over CPU-only inference. Furthermore, the inference speed was\nincreased to 8 tokens per second when utilizing 7GB of VRAM.\n","authors":["Libo Zhang","Zhaoning Zhang","Baizhou Xu","Songzhu Mei","Dongsheng Li"],"pdf_url":"https://arxiv.org/pdf/2412.18934v1.pdf","comment":"9 pages, 7 figures"},{"id":"http://arxiv.org/abs/2412.18925v1","updated":"2024-12-25T15:12:34Z","published":"2024-12-25T15:12:34Z","title":"HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs","summary":" The breakthrough of OpenAI o1 highlights the potential of enhancing reasoning\nto improve LLM. Yet, most research in reasoning has focused on mathematical\ntasks, leaving domains like medicine underexplored. The medical domain, though\ndistinct from mathematics, also demands robust reasoning to provide reliable\nanswers, given the high standards of healthcare. However, verifying medical\nreasoning is challenging, unlike those in mathematics. To address this, we\npropose verifiable medical problems with a medical verifier to check the\ncorrectness of model outputs. This verifiable nature enables advancements in\nmedical reasoning through a two-stage approach: (1) using the verifier to guide\nthe search for a complex reasoning trajectory for fine-tuning LLMs, (2)\napplying reinforcement learning (RL) with verifier-based rewards to enhance\ncomplex reasoning further. Finally, we introduce HuatuoGPT-o1, a medical LLM\ncapable of complex reasoning, which outperforms general and medical-specific\nbaselines using only 40K verifiable problems. Experiments show complex\nreasoning improves medical problem-solving and benefits more from RL. We hope\nour approach inspires advancements in reasoning across medical and other\nspecialized domains.\n","authors":["Junying Chen","Zhenyang Cai","Ke Ji","Xidong Wang","Wanlong Liu","Rongsheng Wang","Jianye Hou","Benyou Wang"],"pdf_url":"https://arxiv.org/pdf/2412.18925v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.17690v3","updated":"2024-12-25T15:05:04Z","published":"2024-12-23T16:16:30Z","title":"RAGONITE: Iterative Retrieval on Induced Databases and Verbalized RDF\n for Conversational QA over KGs with RAG","summary":" Conversational question answering (ConvQA) is a convenient means of searching\nover RDF knowledge graphs (KGs), where a prevalent approach is to translate\nnatural language questions to SPARQL queries. However, SPARQL has certain\nshortcomings: (i) it is brittle for complex intents and conversational\nquestions, and (ii) it is not suitable for more abstract needs. Instead, we\npropose a novel two-pronged system where we fuse: (i) SQL-query results over a\ndatabase automatically derived from the KG, and (ii) text-search results over\nverbalizations of KG facts. Our pipeline supports iterative retrieval: when the\nresults of any branch are found to be unsatisfactory, the system can\nautomatically opt for further rounds. We put everything together in a retrieval\naugmented generation (RAG) setup, where an LLM generates a coherent response\nfrom accumulated search results. We demonstrate the superiority of our proposed\nsystem over several baselines on a knowledge graph of BMW automobiles.\n","authors":["Rishiraj Saha Roy","Chris Hinze","Joel Schlotthauer","Farzad Naderi","Viktor Hangya","Andreas Foltyn","Luzian Hahn","Fabian Kuech"],"pdf_url":"https://arxiv.org/pdf/2412.17690v3.pdf","comment":"Accepted at BTW 2025, 10 pages"},{"id":"http://arxiv.org/abs/2411.14871v2","updated":"2024-12-25T14:55:08Z","published":"2024-11-22T11:45:33Z","title":"Prioritize Denoising Steps on Diffusion Model Preference Alignment via\n Explicit Denoised Distribution Estimation","summary":" Diffusion models have shown remarkable success in text-to-image generation,\nmaking alignment methods for these models increasingly important. A key\nchallenge is the sparsity of preference labels, which are typically available\nonly at the terminal of denoising trajectories. This raises the issue of how to\nassign credit across denoising steps based on these sparse labels. In this\npaper, we propose Denoised Distribution Estimation (DDE), a novel method for\ncredit assignment. Unlike previous approaches that rely on auxiliary models or\nhand-crafted schemes, DDE derives its strategy more explicitly. The proposed\nDDE directly estimates the terminal denoised distribution from the perspective\nof each step. It is equipped with two estimation strategies and capable of\nrepresenting the entire denoising trajectory with a single model inference.\nTheoretically and empirically, we show that DDE prioritizes optimizing the\nmiddle part of the denoising trajectory, resulting in a novel and effective\ncredit assignment scheme. Extensive experiments demonstrate that our approach\nachieves superior performance, both quantitatively and qualitatively.\n","authors":["Dingyuan Shi","Yong Wang","Hangyu Li","Xiangxiang Chu"],"pdf_url":"https://arxiv.org/pdf/2411.14871v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.13782v2","updated":"2024-12-25T14:52:33Z","published":"2024-12-18T12:21:46Z","title":"Knowledge Editing with Dynamic Knowledge Graphs for Multi-Hop Question\n Answering","summary":" Multi-hop question answering (MHQA) poses a significant challenge for large\nlanguage models (LLMs) due to the extensive knowledge demands involved.\nKnowledge editing, which aims to precisely modify the LLMs to incorporate\nspecific knowledge without negatively impacting other unrelated knowledge,\noffers a potential solution for addressing MHQA challenges with LLMs. However,\ncurrent solutions struggle to effectively resolve issues of knowledge\nconflicts. Most parameter-preserving editing methods are hindered by inaccurate\nretrieval and overlook secondary editing issues, which can introduce noise into\nthe reasoning process of LLMs. In this paper, we introduce KEDKG, a novel\nknowledge editing method that leverages a dynamic knowledge graph for MHQA,\ndesigned to ensure the reliability of answers. KEDKG involves two primary\nsteps: dynamic knowledge graph construction and knowledge graph augmented\ngeneration. Initially, KEDKG autonomously constructs a dynamic knowledge graph\nto store revised information while resolving potential knowledge conflicts.\nSubsequently, it employs a fine-grained retrieval strategy coupled with an\nentity and relation detector to enhance the accuracy of graph retrieval for LLM\ngeneration. Experimental results on benchmarks show that KEDKG surpasses\nprevious state-of-the-art models, delivering more accurate and reliable answers\nin environments with dynamic information.\n","authors":["Yifan Lu","Yigeng Zhou","Jing Li","Yequan Wang","Xuebo Liu","Daojing He","Fangming Liu","Min Zhang"],"pdf_url":"https://arxiv.org/pdf/2412.13782v2.pdf","comment":"AAAI 2025"},{"id":"http://arxiv.org/abs/2412.18910v1","updated":"2024-12-25T13:57:33Z","published":"2024-12-25T13:57:33Z","title":"AdaEAGLE: Optimizing Speculative Decoding via Explicit Modeling of\n Adaptive Draft Structures","summary":" Speculative Decoding (SD) is a popular lossless technique for accelerating\nthe inference of Large Language Models (LLMs). We show that the decoding speed\nof SD frameworks with static draft structures can be significantly improved by\nincorporating context-aware adaptive draft structures. However, current studies\non adaptive draft structures are limited by their performance, modeling\napproaches, and applicability. In this paper, we introduce AdaEAGLE, the first\nSD framework that explicitly models adaptive draft structures. AdaEAGLE\nleverages the Lightweight Draft Length Predictor (LDLP) module to explicitly\npredict the optimal number of draft tokens during inference to guide the draft\nmodel. It achieves comparable speedup results without manual thresholds and\nallows for deeper, more specialized optimizations. Moreover, together with\nthreshold-based strategies, AdaEAGLE achieves a $1.62\\times$ speedup over the\nvanilla AR decoding and outperforms fixed-length SotA baseline while\nmaintaining output quality.\n","authors":["Situo Zhang","Hankun Wang","Da Ma","Zichen Zhu","Lu Chen","Kunyao Lan","Kai Yu"],"pdf_url":"https://arxiv.org/pdf/2412.18910v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18908v1","updated":"2024-12-25T13:54:40Z","published":"2024-12-25T13:54:40Z","title":"Research Experiment on Multi-Model Comparison for Chinese Text\n Classification Tasks","summary":" With the explosive growth of Chinese text data and advancements in natural\nlanguage processing technologies, Chinese text classification has become one of\nthe key techniques in fields such as information retrieval and sentiment\nanalysis, attracting increasing attention. This paper conducts a comparative\nstudy on three deep learning models:TextCNN, TextRNN, and FastText.specifically\nfor Chinese text classification tasks. By conducting experiments on the\nTHUCNews dataset, the performance of these models is evaluated, and their\napplicability in different scenarios is discussed.\n","authors":["JiaCheng Li"],"pdf_url":"https://arxiv.org/pdf/2412.18908v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.01707v3","updated":"2024-12-25T13:32:54Z","published":"2024-10-02T16:15:31Z","title":"Interpretable Contrastive Monte Carlo Tree Search Reasoning","summary":" We propose SC-MCTS*: a novel Monte Carlo Tree Search (MCTS) reasoning\nalgorithm for Large Language Models (LLMs), significantly improves both\nreasoning accuracy and speed. Our motivation comes from: 1. Previous MCTS LLM\nreasoning works often overlooked its biggest drawback--slower speed compared to\nCoT; 2. Previous research mainly used MCTS as a tool for LLM reasoning on\nvarious tasks with limited quantitative analysis or ablation studies of its\ncomponents from reasoning interpretability perspective. 3. The reward model is\nthe most crucial component in MCTS, however previous work has rarely conducted\nin-depth study or improvement of MCTS's reward models. Thus, we conducted\nextensive ablation studies and quantitative analysis on components of MCTS,\nrevealing the impact of each component on the MCTS reasoning performance of\nLLMs. Building on this, (i) we designed a highly interpretable reward model\nbased on the principle of contrastive decoding and (ii) achieved an average\nspeed improvement of 51.9% per node using speculative decoding. Additionally,\n(iii) we improved UCT node selection strategy and backpropagation used in\nprevious works, resulting in significant performance improvement. We\noutperformed o1-mini by an average of 17.4% on the Blocksworld multi-step\nreasoning dataset using Llama-3.1-70B with SC-MCTS*. Our code is available at\nhttps://github.com/zitian-gao/SC-MCTS.\n","authors":["Zitian Gao","Boye Niu","Xuzheng He","Haotian Xu","Hongzhang Liu","Aiwei Liu","Xuming Hu","Lijie Wen"],"pdf_url":"https://arxiv.org/pdf/2410.01707v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2407.20906v2","updated":"2024-12-25T13:10:27Z","published":"2024-07-30T15:26:36Z","title":"Automated Review Generation Method Based on Large Language Models","summary":" Literature research, vital for scientific work, faces the challenge of the\nsurging torrent of information in the vast ocean of literature exceeding\nresearchers' processing capabilities. To address this issue, we present an\nautomated review generation method based on Large Language Models (LLMs), aimed\nat overcoming efficiency bottlenecks in literature processing and reducing\ncognitive load. Our statistically validated evaluation framework demonstrates\nthat the generated reviews match or exceed manual quality, offering broad\napplicability across research fields due to minimal domain knowledge\nrequirements. In a case study on propane dehydrogenation (PDH) catalysts, our\nmethod swiftly analyzed 343 articles, averaging seconds per article per LLM\naccount, producing comprehensive reviews spanning 35 topics. Extended analysis\nof 1041 articles provided deep insights into catalysts' composition, structure,\nand performance. Recognizing LLMs' hallucinations, we implemented a\nmulti-layered quality control strategy, effectively mitigating risks and\nensuring reliability, as quantitatively demonstrated through manual\nverification. Expert verification confirms the accuracy and citation integrity\nof generated reviews, demonstrating LLM hallucination risks reduced to below\n0.5\\% with over 95\\% confidence. Released Windows application enables one-click\nreview generation, aiding researchers in tracking advancements and recommending\nliterature. This approach showcases LLMs' role in enhancing scientific research\nproductivity and sets the stage for further exploration.\n","authors":["Shican Wu","Xiao Ma","Dehui Luo","Lulu Li","Xiangcheng Shi","Xin Chang","Xiaoyun Lin","Ran Luo","Chunlei Pei","Zhi-Jian Zhao","Jinlong Gong"],"pdf_url":"https://arxiv.org/pdf/2407.20906v2.pdf","comment":"29 pages, 5 figures, 3 tables Code:\n https://github.com/TJU-ECAT-AI/AutomaticReviewGeneration Data:\n https://github.com/TJU-ECAT-AI/AutomaticReviewGenerationData This research\n has been invited for a Short Oral presentation at the 18th ICC -\n International Congress on Catalysis, taking place in Lyon, France from July\n 14-19, 2024"},{"id":"http://arxiv.org/abs/2412.18868v1","updated":"2024-12-25T11:00:27Z","published":"2024-12-25T11:00:27Z","title":"Overview of MWE history, challenges, and horizons: standing at the 20th\n anniversary of the MWE workshop series via MWE-UD2024","summary":" Starting in 2003 when the first MWE workshop was held with ACL in Sapporo,\nJapan, this year, the joint workshop of MWE-UD co-located with the LREC-COLING\n2024 conference marked the 20th anniversary of MWE workshop events over the\npast nearly two decades. Standing at this milestone, we look back to this\nworkshop series and summarise the research topics and methodologies researchers\nhave carried out over the years. We also discuss the current challenges that we\nare facing and the broader impacts/synergies of MWE research within the CL and\nNLP fields. Finally, we give future research perspectives. We hope this\nposition paper can help researchers, students, and industrial practitioners\ninterested in MWE get a brief but easy understanding of its history, current,\nand possible future.\n","authors":["Lifeng Han","Kilian Evang","Archna Bhatia","Gosse Bouma","A. Seza Doğruöz","Marcos Garcia","Voula Giouli","Joakim Nivre","Alexandre Rademacher"],"pdf_url":"https://arxiv.org/pdf/2412.18868v1.pdf","comment":"ongoing work, position paper, 6 pages"},{"id":"http://arxiv.org/abs/2407.15621v2","updated":"2024-12-25T10:49:52Z","published":"2024-07-22T13:29:56Z","title":"RadioRAG: Factual large language models for enhanced diagnostics in\n radiology using online retrieval augmented generation","summary":" Large language models (LLMs) often generate outdated or inaccurate\ninformation based on static training datasets. Retrieval augmented generation\n(RAG) mitigates this by integrating outside data sources. While previous RAG\nsystems used pre-assembled, fixed databases with limited flexibility, we have\ndeveloped Radiology RAG (RadioRAG), an end-to-end framework that retrieves data\nfrom authoritative radiologic online sources in real-time. We evaluate the\ndiagnostic accuracy of various LLMs when answering radiology-specific questions\nwith and without access to additional online information via RAG. Using 80\nquestions from the RSNA Case Collection across radiologic subspecialties and 24\nadditional expert-curated questions with reference standard answers, LLMs\n(GPT-3.5-turbo, GPT-4, Mistral-7B, Mixtral-8x7B, and Llama3 [8B and 70B]) were\nprompted with and without RadioRAG in a zero-shot inference scenario RadioRAG\nretrieved context-specific information from www.radiopaedia.org in real-time.\nAccuracy was investigated. Statistical analyses were performed using\nbootstrapping. The results were further compared with human performance.\nRadioRAG improved diagnostic accuracy across most LLMs, with relative accuracy\nincreases ranging up to 54% for different LLMs. It matched or exceeded non-RAG\nmodels and the human radiologist in question answering across radiologic\nsubspecialties, particularly in breast imaging and emergency radiology.\nHowever, the degree of improvement varied among models; GPT-3.5-turbo and\nMixtral-8x7B-instruct-v0.1 saw notable gains, while Mistral-7B-instruct-v0.2\nshowed no improvement, highlighting variability in RadioRAG's effectiveness.\nLLMs benefit when provided access to domain-specific data beyond their training\ndata. For radiology, RadioRAG establishes a robust framework that substantially\nimproves diagnostic accuracy and factuality in radiological question answering.\n","authors":["Soroosh Tayebi Arasteh","Mahshad Lotfinia","Keno Bressem","Robert Siepmann","Lisa Adams","Dyke Ferber","Christiane Kuhl","Jakob Nikolas Kather","Sven Nebelung","Daniel Truhn"],"pdf_url":"https://arxiv.org/pdf/2407.15621v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2407.08582v3","updated":"2024-12-25T10:20:23Z","published":"2024-07-11T15:07:26Z","title":"On the Universal Truthfulness Hyperplane Inside LLMs","summary":" While large language models (LLMs) have demonstrated remarkable abilities\nacross various fields, hallucination remains a significant challenge. Recent\nstudies have explored hallucinations through the lens of internal\nrepresentations, proposing mechanisms to decipher LLMs' adherence to facts.\nHowever, these approaches often fail to generalize to out-of-distribution data,\nleading to concerns about whether internal representation patterns reflect\nfundamental factual awareness, or only overfit spurious correlations on the\nspecific datasets. In this work, we investigate whether a universal\ntruthfulness hyperplane that distinguishes the model's factually correct and\nincorrect outputs exists within the model. To this end, we scale up the number\nof training datasets and conduct an extensive evaluation -- we train the\ntruthfulness hyperplane on a diverse collection of over 40 datasets and examine\nits cross-task, cross-domain, and in-domain generalization. Our results\nindicate that increasing the diversity of the training datasets significantly\nenhances the performance in all scenarios, while the volume of data samples\nplays a less critical role. This finding supports the optimistic hypothesis\nthat a universal truthfulness hyperplane may indeed exist within the model,\noffering promising directions for future research.\n","authors":["Junteng Liu","Shiqi Chen","Yu Cheng","Junxian He"],"pdf_url":"https://arxiv.org/pdf/2407.08582v3.pdf","comment":"EMNLP 2024: Camera-ready version"},{"id":"http://arxiv.org/abs/2412.18863v1","updated":"2024-12-25T10:17:15Z","published":"2024-12-25T10:17:15Z","title":"Whose Morality Do They Speak? Unraveling Cultural Bias in Multilingual\n Language Models","summary":" Large language models (LLMs) have become integral tools in diverse domains,\nyet their moral reasoning capabilities across cultural and linguistic contexts\nremain underexplored. This study investigates whether multilingual LLMs, such\nas GPT-3.5-Turbo, GPT-4o-mini, Llama 3.1, and MistralNeMo, reflect culturally\nspecific moral values or impose dominant moral norms, particularly those rooted\nin English. Using the updated Moral Foundations Questionnaire (MFQ-2) in eight\nlanguages, Arabic, Farsi, English, Spanish, Japanese, Chinese, French, and\nRussian, the study analyzes the models' adherence to six core moral\nfoundations: care, equality, proportionality, loyalty, authority, and purity.\nThe results reveal significant cultural and linguistic variability, challenging\nthe assumption of universal moral consistency in LLMs. Although some models\ndemonstrate adaptability to diverse contexts, others exhibit biases influenced\nby the composition of the training data. These findings underscore the need for\nculturally inclusive model development to improve fairness and trust in\nmultilingual AI systems.\n","authors":["Meltem Aksoy"],"pdf_url":"https://arxiv.org/pdf/2412.18863v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18860v1","updated":"2024-12-25T10:08:54Z","published":"2024-12-25T10:08:54Z","title":"Bootstrap Your Own Context Length","summary":" We introduce a bootstrapping approach to train long-context language models\nby exploiting their short-context capabilities only. Our method utilizes a\nsimple agent workflow to synthesize diverse long-context instruction tuning\ndata, thereby eliminating the necessity for manual data collection and\nannotation. The proposed data synthesis workflow requires only a short-context\nlanguage model, a text retriever, and a document collection, all of which are\nreadily accessible within the open-source ecosystem. Subsequently, language\nmodels are fine-tuned using the synthesized data to extend their context\nlengths. In this manner, we effectively transfer the short-context capabilities\nof language models to long-context scenarios through a bootstrapping process.\nWe conduct experiments with the open-source Llama-3 family of models and\ndemonstrate that our method can successfully extend the context length to up to\n1M tokens, achieving superior performance across various benchmarks.\n","authors":["Liang Wang","Nan Yang","Xingxing Zhang","Xiaolong Huang","Furu Wei"],"pdf_url":"https://arxiv.org/pdf/2412.18860v1.pdf","comment":"18 pages"},{"id":"http://arxiv.org/abs/2409.05286v2","updated":"2024-12-25T09:57:52Z","published":"2024-09-09T02:41:00Z","title":"Seek and Solve Reasoning for Table Question Answering","summary":" The complexities of table structures and question logic make table-based\nquestion answering (TQA) tasks challenging for Large Language Models (LLMs),\noften requiring task simplification before solving. This paper reveals that the\nreasoning process during task simplification may be more valuable than the\nsimplified tasks themselves and aims to improve TQA performance by leveraging\nLLMs' reasoning capabilities. We propose a Seek-and-Solve pipeline that\ninstructs the LLM to first seek relevant information and then answer questions,\nintegrating these two stages at the reasoning level into a coherent\nSeek-and-Solve Chain of Thought (SS-CoT). Additionally, we distill a\nsingle-step TQA-solving prompt from this pipeline, using demonstrations with\nSS-CoT paths to guide the LLM in solving complex TQA tasks under In-Context\nLearning settings. Our experiments show that our approaches result in improved\nperformance and reliability while being efficient. Our findings emphasize the\nimportance of eliciting LLMs' reasoning capabilities to handle complex TQA\ntasks effectively.\n","authors":["Ruya Jiang","Chun Wang","Weihong Deng"],"pdf_url":"https://arxiv.org/pdf/2409.05286v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18377v2","updated":"2024-12-25T09:26:52Z","published":"2024-12-24T12:03:36Z","title":"ChaI-TeA: A Benchmark for Evaluating Autocompletion of Interactions with\n LLM-based Chatbots","summary":" The rise of LLMs has deflected a growing portion of human-computer\ninteractions towards LLM-based chatbots. The remarkable abilities of these\nmodels allow users to interact using long, diverse natural language text\ncovering a wide range of topics and styles. Phrasing these messages is a time\nand effort consuming task, calling for an autocomplete solution to assist\nusers. We introduce the task of chatbot interaction autocomplete. We present\nChaI-TeA: CHat InTEraction Autocomplete; An autcomplete evaluation framework\nfor LLM-based chatbot interactions. The framework includes a formal definition\nof the task, coupled with suitable datasets and metrics. We use the framework\nto evaluate After formally defining the task along with suitable datasets and\nmetrics, we test 9 models on the defined auto completion task, finding that\nwhile current off-the-shelf models perform fairly, there is still much room for\nimprovement, mainly in ranking of the generated suggestions. We provide\ninsights for practitioners working on this task and open new research\ndirections for researchers in the field. We release our framework to serve as a\nfoundation for future research.\n","authors":["Shani Goren","Oren Kalinsky","Tomer Stav","Yuri Rapoport","Yaron Fairstein","Ram Yazdi","Nachshon Cohen","Alexander Libov","Guy Kushilevitz"],"pdf_url":"https://arxiv.org/pdf/2412.18377v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.17332v2","updated":"2024-12-25T08:49:27Z","published":"2024-12-23T06:50:04Z","title":"A Dual-Perspective Metaphor Detection Framework Using Large Language\n Models","summary":" Metaphor detection, a critical task in natural language processing, involves\nidentifying whether a particular word in a sentence is used metaphorically.\nTraditional approaches often rely on supervised learning models that implicitly\nencode semantic relationships based on metaphor theories. However, these\nmethods often suffer from a lack of transparency in their decision-making\nprocesses, which undermines the reliability of their predictions. Recent\nresearch indicates that LLMs (large language models) exhibit significant\npotential in metaphor detection. Nevertheless, their reasoning capabilities are\nconstrained by predefined knowledge graphs. To overcome these limitations, we\npropose DMD, a novel dual-perspective framework that harnesses both implicit\nand explicit applications of metaphor theories to guide LLMs in metaphor\ndetection and adopts a self-judgment mechanism to validate the responses from\nthe aforementioned forms of guidance. In comparison to previous methods, our\nframework offers more transparent reasoning processes and delivers more\nreliable predictions. Experimental results prove the effectiveness of DMD,\ndemonstrating state-of-the-art performance across widely-used datasets.\n","authors":["Yujie Lin","Jingyao Liu","Yan Gao","Ante Wang","Jinsong Su"],"pdf_url":"https://arxiv.org/pdf/2412.17332v2.pdf","comment":"Accepted to ICASSP 2025"},{"id":"http://arxiv.org/abs/2412.18826v1","updated":"2024-12-25T08:31:53Z","published":"2024-12-25T08:31:53Z","title":"RapGuard: Safeguarding Multimodal Large Language Models via\n Rationale-aware Defensive Prompting","summary":" While Multimodal Large Language Models (MLLMs) have made remarkable progress\nin vision-language reasoning, they are also more susceptible to producing\nharmful content compared to models that focus solely on text. Existing\ndefensive prompting techniques rely on a static, unified safety guideline that\nfails to account for the specific risks inherent in different multimodal\ncontexts. To address these limitations, we propose RapGuard, a novel framework\nthat uses multimodal chain-of-thought reasoning to dynamically generate\nscenario-specific safety prompts. RapGuard enhances safety by adapting its\nprompts to the unique risks of each input, effectively mitigating harmful\noutputs while maintaining high performance on benign tasks. Our experimental\nresults across multiple MLLM benchmarks demonstrate that RapGuard achieves\nstate-of-the-art safety performance, significantly reducing harmful content\nwithout degrading the quality of responses.\n","authors":["Yilei Jiang","Yingshui Tan","Xiangyu Yue"],"pdf_url":"https://arxiv.org/pdf/2412.18826v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.05257v3","updated":"2024-12-25T07:54:44Z","published":"2024-09-09T00:40:47Z","title":"UPCS: Unbiased Persona Construction for Dialogue Generation","summary":" Narrative systems, such as dialogue and storytelling systems, often utilize\npersona profiles to enhance personalized interactions. Existing persona\nprofiles frequently exhibit biases, posing risks to system integrity and\nfairness. To address this, we introduce the UPCS framework, which categorizes\ncharacter descriptions into eight dimensions, including bias mitigation\nstrategies. Experimental results demonstrate UPCS's superiority in accuracy,\ndiversity, bias elimination, and user satisfaction, marking a significant\nadvancement in persona construction for reliable narrative systems.\n","authors":["Kuiyun Chen","Yanbin Wei"],"pdf_url":"https://arxiv.org/pdf/2409.05257v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18811v1","updated":"2024-12-25T07:42:22Z","published":"2024-12-25T07:42:22Z","title":"DCIS: Efficient Length Extrapolation of LLMs via Divide-and-Conquer\n Scaling Factor Search","summary":" Large language models (LLMs) based on the Transformer architecture usually\nhave their context length limited due to the high training cost. Recent\nadvancements extend the context window by adjusting the scaling factors of RoPE\nand fine-tuning. However, suboptimal initialization of these factors results in\nincreased fine-tuning costs and reduced performance at target length. To\naddress these challenges, we propose an innovative RoPE-based fine-tuning\nframework that diverges from conventional scaling factors search. Specifically,\nwe present a Divide-and-Conquer Incremental Search (DCIS) algorithm that\nstrategically determines the better scaling factors. Further fine-tuning with\nthe identified scaling factors effectively extends the context window of LLMs.\nEmpirical results demonstrate that our methodology not only mitigates\nperformance decay at extended target lengths but also allows the model to\nfine-tune on short contexts and generalize to long contexts, thereby reducing\nthe cost of fine-tuning. The scaling factors obtained through DCIS can even\nperform effectively without fine-tuning. Further analysis of the search space\nreveals that DCIS achieves twice the search efficiency compared to other\nmethods. We also examine the impact of the non-strictly increasing scaling\nfactors utilized in DCIS and evaluate the general capabilities of LLMs across\nvarious context lengths.\n","authors":["Lei Yang","Shaoyang Xu","Deyi Xiong"],"pdf_url":"https://arxiv.org/pdf/2412.18811v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18800v1","updated":"2024-12-25T06:40:36Z","published":"2024-12-25T06:40:36Z","title":"Improving Generated and Retrieved Knowledge Combination Through\n Zero-shot Generation","summary":" Open-domain Question Answering (QA) has garnered substantial interest by\ncombining the advantages of faithfully retrieved passages and relevant passages\ngenerated through Large Language Models (LLMs). However, there is a lack of\ndefinitive labels available to pair these sources of knowledge. In order to\naddress this issue, we propose an unsupervised and simple framework called\nBi-Reranking for Merging Generated and Retrieved Knowledge (BRMGR), which\nutilizes re-ranking methods for both retrieved passages and LLM-generated\npassages. We pair the two types of passages using two separate re-ranking\nmethods and then combine them through greedy matching. We demonstrate that\nBRMGR is equivalent to employing a bipartite matching loss when assigning each\nretrieved passage with a corresponding LLM-generated passage. The application\nof our model yielded experimental results from three datasets, improving their\nperformance by +1.7 and +1.6 on NQ and WebQ datasets, respectively, and\nobtaining comparable result on TriviaQA dataset when compared to competitive\nbaselines.\n","authors":["Xinkai Du","Quanjie Han","Chao Lv","Yan Liu","Yalin Sun","Hao Shu","Hongbo Shan","Maosong Sun"],"pdf_url":"https://arxiv.org/pdf/2412.18800v1.pdf","comment":"Accepted by ICASSP 2025"},{"id":"http://arxiv.org/abs/2412.18367v2","updated":"2024-12-25T06:20:11Z","published":"2024-12-24T11:50:18Z","title":"Towards Global AI Inclusivity: A Large-Scale Multilingual Terminology\n Dataset","summary":" The field of machine translation has achieved significant advancements, yet\ndomain-specific terminology translation, particularly in AI, remains\nchallenging. We introduced GIST, a large-scale multilingual AI terminology\ndataset containing 5K terms extracted from top AI conference papers spanning\n2000 to 2023. The terms were translated into Arabic, Chinese, French, Japanese,\nand Russian using a hybrid framework that combines LLMs for extraction with\nhuman expertise for translation. The dataset's quality was benchmarked against\nexisting resources, demonstrating superior translation accuracy through\ncrowdsourced evaluation. GIST was integrated into translation workflows using\npost-translation refinement methods that required no retraining, where LLM\nprompting consistently improved BLEU and COMET scores. A web demonstration on\nthe ACL Anthology platform highlights its practical application, showcasing\nimproved accessibility for non-English speakers. This work aims to address\ncritical gaps in AI terminology resources and fosters global inclusivity and\ncollaboration in AI research.\n","authors":["Jiarui Liu","Iman Ouzzani","Wenkai Li","Lechen Zhang","Tianyue Ou","Houda Bouamor","Zhijing Jin","Mona Diab"],"pdf_url":"https://arxiv.org/pdf/2412.18367v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.14660v2","updated":"2024-12-25T06:05:36Z","published":"2024-12-19T09:10:07Z","title":"Unveiling Uncertainty: A Deep Dive into Calibration and Performance of\n Multimodal Large Language Models","summary":" Multimodal large language models (MLLMs) combine visual and textual data for\ntasks such as image captioning and visual question answering. Proper\nuncertainty calibration is crucial, yet challenging, for reliable use in areas\nlike healthcare and autonomous driving. This paper investigates representative\nMLLMs, focusing on their calibration across various scenarios, including before\nand after visual fine-tuning, as well as before and after multimodal training\nof the base LLMs. We observed miscalibration in their performance, and at the\nsame time, no significant differences in calibration across these scenarios. We\nalso highlight how uncertainty differs between text and images and how their\nintegration affects overall uncertainty. To better understand MLLMs'\nmiscalibration and their ability to self-assess uncertainty, we construct the\nIDK (I don't know) dataset, which is key to evaluating how they handle\nunknowns. Our findings reveal that MLLMs tend to give answers rather than admit\nuncertainty, but this self-assessment improves with proper prompt adjustments.\nFinally, to calibrate MLLMs and enhance model reliability, we propose\ntechniques such as temperature scaling and iterative prompt optimization. Our\nresults provide insights into improving MLLMs for effective and responsible\ndeployment in multimodal applications. Code and IDK dataset:\nhttps://github.com/hfutml/Calibration-MLLM.\n","authors":["Zijun Chen","Wenbo Hu","Guande He","Zhijie Deng","Zheng Zhang","Richang Hong"],"pdf_url":"https://arxiv.org/pdf/2412.14660v2.pdf","comment":"Accepted to COLING 2025"},{"id":"http://arxiv.org/abs/2409.10197v2","updated":"2024-12-25T04:30:16Z","published":"2024-09-16T11:43:19Z","title":"Fit and Prune: Fast and Training-free Visual Token Pruning for\n Multi-modal Large Language Models","summary":" Recent progress in Multimodal Large Language Models(MLLMs) often use large\nimage tokens to compensate the visual shortcoming of MLLMs, which not only\nexhibits obvious redundancy but also greatly exacerbates the already high\ncomputation. Token pruning is an effective solution for speeding up MLLMs, but\nwhen and how to drop tokens still remains a challenge. In this paper, we\npropose a novel and training-free approach for the effective visual token\npruning of MLLMs, termed FitPrune, which can quickly produce a complete pruning\nrecipe for MLLMs according to a pre-defined budget. Specifically, FitPrune\nconsiders token pruning as a statistical problem of MLLM and its objective is\nto find out an optimal pruning scheme that can minimize the divergence of the\nattention distributions before and after pruning. In practice, FitPrune can be\nquickly accomplished based on the attention statistics from a small batch of\ninference data, avoiding the expensive trials of MLLMs. According to the\npruning recipe, an MLLM can directly remove the redundant visual tokens of\ndifferent examples during inference. To validate FitPrune, we apply it to a set\nof recent MLLMs, including LLaVA-1.5, LLaVA-HR and LLaVA-NEXT, and conduct\nextensive experiments on a set of benchmarks. The experimental results show\nthat our FitPrune can not only reduce the computational complexity to a large\nextent, while retaining high performance, e.g., -54.9% FLOPs for LLaVA-NEXT\nwith only 0.5% accuracy drop. Notably, the pruning recipe can be obtained in\nabout 5 minutes. Our code is available at https://github.com/ywh187/FitPrune.\n","authors":["Weihao Ye","Qiong Wu","Wenhao Lin","Yiyi Zhou"],"pdf_url":"https://arxiv.org/pdf/2409.10197v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2311.12699v2","updated":"2024-12-25T03:52:46Z","published":"2023-11-21T16:03:51Z","title":"Explore the Potential of LLMs in Misinformation Detection: An Empirical\n Study","summary":" Large Language Models (LLMs) have garnered significant attention for their\npowerful ability in natural language understanding and reasoning. In this\npaper, we present a comprehensive empirical study to explore the performance of\nLLMs on misinformation detection tasks. This study stands as the pioneering\ninvestigation into the understanding capabilities of multiple LLMs regarding\nboth content and propagation across social media platforms. Our empirical\nstudies on eight misinformation detection datasets show that LLM-based\ndetectors can achieve comparable performance in text-based misinformation\ndetection but exhibit notably constrained capabilities in comprehending\npropagation structure compared to existing models in propagation-based\nmisinformation detection. Our experiments further demonstrate that LLMs exhibit\ngreat potential to enhance existing misinformation detection models. These\nfindings highlight the potential ability of LLMs to detect misinformation.\n","authors":["Mengyang Chen","Lingwei Wei","Han Cao","Wei Zhou","Songlin Hu"],"pdf_url":"https://arxiv.org/pdf/2311.12699v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.10044v2","updated":"2024-12-25T03:12:44Z","published":"2024-09-16T07:13:30Z","title":"Benchmarking Large Language Model Uncertainty for Prompt Optimization","summary":" Prompt optimization algorithms for Large Language Models (LLMs) excel in\nmulti-step reasoning but still lack effective uncertainty estimation. This\npaper introduces a benchmark dataset to evaluate uncertainty metrics, focusing\non Answer, Correctness, Aleatoric, and Epistemic Uncertainty. Through analysis\nof models like GPT-3.5-Turbo and Meta-Llama-3.1-8B-Instruct, we show that\ncurrent metrics align more with Answer Uncertainty, which reflects output\nconfidence and diversity, rather than Correctness Uncertainty, highlighting the\nneed for improved metrics that are optimization-objective-aware to better guide\nprompt optimization. Our code and dataset are available at\nhttps://github.com/0Frett/PO-Uncertainty-Benchmarking.\n","authors":["Pei-Fu Guo","Yun-Da Tsai","Shou-De Lin"],"pdf_url":"https://arxiv.org/pdf/2409.10044v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18748v1","updated":"2024-12-25T02:41:13Z","published":"2024-12-25T02:41:13Z","title":"Towards Expressive Video Dubbing with Multiscale Multimodal Context\n Interaction","summary":" Automatic Video Dubbing (AVD) generates speech aligned with lip motion and\nfacial emotion from scripts. Recent research focuses on modeling multimodal\ncontext to enhance prosody expressiveness but overlooks two key issues: 1)\nMultiscale prosody expression attributes in the context influence the current\nsentence's prosody. 2) Prosody cues in context interact with the current\nsentence, impacting the final prosody expressiveness. To tackle these\nchallenges, we propose M2CI-Dubber, a Multiscale Multimodal Context Interaction\nscheme for AVD. This scheme includes two shared M2CI encoders to model the\nmultiscale multimodal context and facilitate its deep interaction with the\ncurrent sentence. By extracting global and local features for each modality in\nthe context, utilizing attention-based mechanisms for aggregation and\ninteraction, and employing an interaction-based graph attention network for\nfusion, the proposed approach enhances the prosody expressiveness of\nsynthesized speech for the current sentence. Experiments on the Chem dataset\nshow our model outperforms baselines in dubbing expressiveness. The code and\ndemos are available at\n\\textcolor[rgb]{0.93,0.0,0.47}{https://github.com/AI-S2-Lab/M2CI-Dubber}.\n","authors":["Yuan Zhao","Rui Liu","Gaoxiang Cong"],"pdf_url":"https://arxiv.org/pdf/2412.18748v1.pdf","comment":"Accepted by ICSSP 2025"},{"id":"http://arxiv.org/abs/2410.14940v4","updated":"2024-12-25T02:40:01Z","published":"2024-10-19T02:07:33Z","title":"Baichuan Alignment Technical Report","summary":" We introduce Baichuan Alignment, a detailed analysis of the alignment\ntechniques employed in the Baichuan series of models. This represents the\nindustry's first comprehensive account of alignment methodologies, offering\nvaluable insights for advancing AI research. We investigate the critical\ncomponents that enhance model performance during the alignment process,\nincluding optimization methods, data strategies, capability enhancements, and\nevaluation processes. The process spans three key stages: Prompt Augmentation\nSystem(PAS), Supervised Fine-Tuning(SFT), and Preference Alignment. The\nproblems encountered, the solutions applied, and the improvements made are\nthoroughly recorded.\n Through comparisons across well-established benchmarks, we highlight the\ntechnological advancements enabled by Baichuan Alignment. Baichuan-Instruct is\nan internal model, while Qwen2-Nova-72B and Llama3-PBM-Nova-70B are instruct\nversions of the Qwen2-72B and Llama-3-70B base models, optimized through\nBaichuan Alignment. Baichuan-Instruct demonstrates significant improvements in\ncore capabilities, with user experience gains ranging from 17% to 28%, and\nperforms exceptionally well on specialized benchmarks. In open-source benchmark\nevaluations, both Qwen2-Nova-72B and Llama3-PBM-Nova-70B consistently\noutperform their respective official instruct versions across nearly all\ndatasets. This report aims to clarify the key technologies behind the alignment\nprocess, fostering a deeper understanding within the community.\nLlama3-PBM-Nova-70B model is available at\nhttps://huggingface.co/PKU-Baichuan-MLSystemLab/Llama3-PBM-Nova-70B.\n","authors":["Mingan Lin","Fan Yang","Yanjun Shen","Haoze Sun","Tianpeng Li","Tao Zhang","Chenzheng Zhu","Tao Zhang","Miao Zheng","Xu Li","Yijie Zhou","Mingyang Chen","Yanzhao Qin","Youquan Li","Hao Liang","Fei Li","Yadong Li","Mang Wang","Guosheng Dong","Kun Fang","Jianhua Xu","Bin Cui","Wentao Zhang","Zenan Zhou","Weipeng Chen"],"pdf_url":"https://arxiv.org/pdf/2410.14940v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.11303v3","updated":"2024-12-25T02:16:39Z","published":"2024-10-15T05:54:17Z","title":"TSDS: Data Selection for Task-Specific Model Finetuning","summary":" Finetuning foundation models for specific tasks is an emerging paradigm in\nmodern machine learning. The efficacy of task-specific finetuning largely\ndepends on the selection of appropriate training data. We present TSDS\n(Task-Specific Data Selection), a framework to select data for task-specific\nmodel finetuning, guided by a small but representative set of examples from the\ntarget task. To do so, we formulate data selection for task-specific finetuning\nas an optimization problem with a distribution alignment loss based on optimal\ntransport to capture the discrepancy between the selected data and the target\ndistribution. In addition, we add a regularizer to encourage the diversity of\nthe selected data and incorporate kernel density estimation into the\nregularizer to reduce the negative effects of near-duplicates among the\ncandidate data. We connect our optimization problem to nearest neighbor search\nand design efficient algorithms to compute the optimal solution based on\napproximate nearest neighbor search techniques. We evaluate our method on data\nselection for both continued pretraining and instruction tuning of language\nmodels. We show that instruction tuning using data selected by our method with\na 1% selection ratio often outperforms using the full dataset and beats the\nbaseline selection methods by 1.5 points in F1 score on average.\n","authors":["Zifan Liu","Amin Karbasi","Theodoros Rekatsinas"],"pdf_url":"https://arxiv.org/pdf/2410.11303v3.pdf","comment":"31 pages, 1 figure"},{"id":"http://arxiv.org/abs/2412.06394v3","updated":"2024-12-25T01:59:54Z","published":"2024-12-09T11:22:59Z","title":"GameArena: Evaluating LLM Reasoning through Live Computer Games","summary":" Evaluating the reasoning abilities of large language models (LLMs) is\nchallenging. Existing benchmarks often depend on static datasets, which are\nvulnerable to data contamination and may get saturated over time, or on binary\nlive human feedback that conflates reasoning with other abilities. As the most\nprominent dynamic benchmark, Chatbot Arena evaluates open-ended questions in\nreal-world settings, but lacks the granularity in assessing specific reasoning\ncapabilities. We introduce GameArena, a dynamic benchmark designed to evaluate\nLLM reasoning capabilities through interactive gameplay with humans. GameArena\nconsists of three games designed to test specific reasoning capabilities (e.g.,\ndeductive and inductive reasoning), while keeping participants entertained and\nengaged. We analyze the gaming data retrospectively to uncover the underlying\nreasoning processes of LLMs and measure their fine-grained reasoning\ncapabilities. We collect over 2000 game sessions and provide detailed\nassessments of various reasoning capabilities for five state-of-the-art LLMs.\nOur user study with 100 participants suggests that GameArena improves user\nengagement compared to Chatbot Arena. For the first time, GameArena enables the\ncollection of step-by-step LLM reasoning data in the wild.\n","authors":["Lanxiang Hu","Qiyu Li","Anze Xie","Nan Jiang","Ion Stoica","Haojian Jin","Hao Zhang"],"pdf_url":"https://arxiv.org/pdf/2412.06394v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18733v1","updated":"2024-12-25T01:35:59Z","published":"2024-12-25T01:35:59Z","title":"Intra- and Inter-modal Context Interaction Modeling for Conversational\n Speech Synthesis","summary":" Conversational Speech Synthesis (CSS) aims to effectively take the multimodal\ndialogue history (MDH) to generate speech with appropriate conversational\nprosody for target utterance. The key challenge of CSS is to model the\ninteraction between the MDH and the target utterance. Note that text and speech\nmodalities in MDH have their own unique influences, and they complement each\nother to produce a comprehensive impact on the target utterance. Previous works\ndid not explicitly model such intra-modal and inter-modal interactions. To\naddress this issue, we propose a new intra-modal and inter-modal context\ninteraction scheme-based CSS system, termed III-CSS. Specifically, in the\ntraining phase, we combine the MDH with the text and speech modalities in the\ntarget utterance to obtain four modal combinations, including Historical\nText-Next Text, Historical Speech-Next Speech, Historical Text-Next Speech, and\nHistorical Speech-Next Text. Then, we design two contrastive learning-based\nintra-modal and two inter-modal interaction modules to deeply learn the\nintra-modal and inter-modal context interaction. In the inference phase, we\ntake MDH and adopt trained interaction modules to fully infer the speech\nprosody of the target utterance's text content. Subjective and objective\nexperiments on the DailyTalk dataset show that III-CSS outperforms the advanced\nbaselines in terms of prosody expressiveness. Code and speech samples are\navailable at https://github.com/AI-S2-Lab/I3CSS.\n","authors":["Zhenqi Jia","Rui Liu"],"pdf_url":"https://arxiv.org/pdf/2412.18733v1.pdf","comment":"Accepted by ICASSP 2025"},{"id":"http://arxiv.org/abs/2412.18729v1","updated":"2024-12-25T01:10:25Z","published":"2024-12-25T01:10:25Z","title":"Optimizing Large Language Models with an Enhanced LoRA Fine-Tuning\n Algorithm for Efficiency and Robustness in NLP Tasks","summary":" This study proposes a large language model optimization method based on the\nimproved LoRA fine-tuning algorithm, aiming to improve the accuracy and\ncomputational efficiency of the model in natural language processing tasks. We\nfine-tune the large language model through a low-rank adaptation strategy,\nwhich significantly reduces the consumption of computing resources while\nmaintaining the powerful capabilities of the pre-trained model. The experiment\nuses the QQP task as the evaluation scenario. The results show that the\nimproved LoRA algorithm shows significant improvements in accuracy, F1 score,\nand MCC compared with traditional models such as BERT, Roberta, T5, and GPT-4.\nIn particular, in terms of F1 score and MCC, our model shows stronger\nrobustness and discrimination ability, which proves the potential of the\nimproved LoRA algorithm in fine-tuning large-scale pre-trained models. In\naddition, this paper also discusses the application prospects of the improved\nLoRA algorithm in other natural language processing tasks, emphasizing its\nadvantages in multi-task learning and scenarios with limited computing\nresources. Future research can further optimize the LoRA fine-tuning strategy\nand expand its application in larger-scale pre-trained models to improve the\ngeneralization ability and task adaptability of the model.\n","authors":["Jiacheng Hu","Xiaoxuan Liao","Jia Gao","Zhen Qi","Hongye Zheng","Chihang Wang"],"pdf_url":"https://arxiv.org/pdf/2412.18729v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.13732v2","updated":"2024-12-25T00:49:28Z","published":"2024-09-10T06:01:16Z","title":"Enhancing Large Language Models with Domain-Specific Knowledge: The Case\n in Topological Materials","summary":" Large language models (LLMs), such as ChatGPT, have demonstrated impressive\nperformance in the text generation task, showing the ability to understand and\nrespond to complex instructions. However, the performance of naive LLMs in\nspeciffc domains is limited due to the scarcity of domain-speciffc corpora and\nspecialized training. Moreover, training a specialized large-scale model\nnecessitates signiffcant hardware resources, which restricts researchers from\nleveraging such models to drive advances. Hence, it is crucial to further\nimprove and optimize LLMs to meet speciffc domain demands and enhance their\nscalability. Based on the condensed matter data center, we establish a material\nknowledge graph (MaterialsKG) and integrate it with literature. Using large\nlanguage models and prompt learning, we develop a specialized dialogue system\nfor topological materials called TopoChat. Compared to naive LLMs, TopoChat\nexhibits superior performance in structural and property querying, material\nrecommendation, and complex relational reasoning. This system enables efffcient\nand precise retrieval of information and facilitates knowledge interaction,\nthereby encouraging the advancement on the ffeld of condensed matter materials.\n","authors":["HuangChao Xu","Baohua Zhang","Zhong Jin","Tiannian Zhu","Quansheng Wu","Hongming Weng"],"pdf_url":"https://arxiv.org/pdf/2409.13732v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18719v1","updated":"2024-12-25T00:31:53Z","published":"2024-12-25T00:31:53Z","title":"Using Large Language Models for Automated Grading of Student Writing\n about Science","summary":" Assessing writing in large classes for formal or informal learners presents a\nsignificant challenge. Consequently, most large classes, particularly in\nscience, rely on objective assessment tools such as multiple-choice quizzes,\nwhich have a single correct answer. The rapid development of AI has introduced\nthe possibility of using large language models (LLMs) to evaluate student\nwriting. An experiment was conducted using GPT-4 to determine if machine\nlearning methods based on LLMs can match or exceed the reliability of\ninstructor grading in evaluating short writing assignments on topics in\nastronomy. The audience consisted of adult learners in three massive open\nonline courses (MOOCs) offered through Coursera. One course was on astronomy,\nthe second was on astrobiology, and the third was on the history and philosophy\nof astronomy. The results should also be applicable to non-science majors in\nuniversity settings, where the content and modes of evaluation are similar. The\ndata comprised answers from 120 students to 12 questions across the three\ncourses. GPT-4 was provided with total grades, model answers, and rubrics from\nan instructor for all three courses. In addition to evaluating how reliably the\nLLM reproduced instructor grades, the LLM was also tasked with generating its\nown rubrics. Overall, the LLM was more reliable than peer grading, both in\naggregate and by individual student, and approximately matched instructor\ngrades for all three online courses. The implication is that LLMs may soon be\nused for automated, reliable, and scalable grading of student science writing.\n","authors":["Chris Impey","Matthew Wenger","Nikhil Garuda","Shahriar Golchin","Sarah Stamer"],"pdf_url":"https://arxiv.org/pdf/2412.18719v1.pdf","comment":"Accepted at IJAIE"}],"Information Retrieval":[{"id":"http://arxiv.org/abs/2412.18962v1","updated":"2024-12-25T18:41:36Z","published":"2024-12-25T18:41:36Z","title":"Don't Lose Yourself: Boosting Multimodal Recommendation via Reducing\n Node-neighbor Discrepancy in Graph Convolutional Network","summary":" The rapid expansion of multimedia contents has led to the emergence of\nmultimodal recommendation systems. It has attracted increasing attention in\nrecommendation systems because its full utilization of data from different\nmodalities alleviates the persistent data sparsity problem. As such, multimodal\nrecommendation models can learn personalized information about nodes in terms\nof visual and textual. To further alleviate the data sparsity problem, some\nprevious works have introduced graph convolutional networks (GCNs) for\nmultimodal recommendation systems, to enhance the semantic representation of\nusers and items by capturing the potential relationships between them. However,\nadopting GCNs inevitably introduces the over-smoothing problem, which make\nnodes to be too similar. Unfortunately, incorporating multimodal information\nwill exacerbate this challenge because nodes that are too similar will lose the\npersonalized information learned through multimodal information. To address\nthis problem, we propose a novel model that retains the personalized\ninformation of ego nodes during feature aggregation by Reducing Node-neighbor\nDiscrepancy (RedN^nD). Extensive experiments on three public datasets show that\nRedN^nD achieves state-of-the-art performance on accuracy and robustness, with\nsignificant improvements over existing GCN-based multimodal frameworks.\n","authors":["Zheyu Chen","Jinfeng Xu","Haibo Hu"],"pdf_url":"https://arxiv.org/pdf/2412.18962v1.pdf","comment":"Accepted by ICASSP 2025"},{"id":"http://arxiv.org/abs/2412.18956v1","updated":"2024-12-25T18:09:34Z","published":"2024-12-25T18:09:34Z","title":"Musings About the Future of Search: A Return to the Past?","summary":" When you have a question, the most effective way to have the question\nanswered is to directly connect with experts on the topic and have a\nconversation with them. Prior to the invention of writing, this was the only\nway. Although effective, this solution exhibits scalability challenges. Writing\nallowed knowledge to be materialized, preserved, and replicated, enabling the\ndevelopment of different technologies over the centuries to connect information\nseekers with relevant information. This progression ultimately culminated in\nthe ten-blue-links web search paradigm we're familiar with, just before the\nrecent emergence of generative AI. However, we often forget that consuming\nstatic content is an imperfect solution. With the advent of large language\nmodels, it has become possible to develop a superior experience by allowing\nusers to directly engage with experts. These interactions can of course satisfy\ninformation needs, but expert models can do so much more. This coming future\nrequires reimagining search.\n","authors":["Jimmy Lin","Pankaj Gupta","Will Horn","Gilad Mishne"],"pdf_url":"https://arxiv.org/pdf/2412.18956v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.17690v3","updated":"2024-12-25T15:05:04Z","published":"2024-12-23T16:16:30Z","title":"RAGONITE: Iterative Retrieval on Induced Databases and Verbalized RDF\n for Conversational QA over KGs with RAG","summary":" Conversational question answering (ConvQA) is a convenient means of searching\nover RDF knowledge graphs (KGs), where a prevalent approach is to translate\nnatural language questions to SPARQL queries. However, SPARQL has certain\nshortcomings: (i) it is brittle for complex intents and conversational\nquestions, and (ii) it is not suitable for more abstract needs. Instead, we\npropose a novel two-pronged system where we fuse: (i) SQL-query results over a\ndatabase automatically derived from the KG, and (ii) text-search results over\nverbalizations of KG facts. Our pipeline supports iterative retrieval: when the\nresults of any branch are found to be unsatisfactory, the system can\nautomatically opt for further rounds. We put everything together in a retrieval\naugmented generation (RAG) setup, where an LLM generates a coherent response\nfrom accumulated search results. We demonstrate the superiority of our proposed\nsystem over several baselines on a knowledge graph of BMW automobiles.\n","authors":["Rishiraj Saha Roy","Chris Hinze","Joel Schlotthauer","Farzad Naderi","Viktor Hangya","Andreas Foltyn","Luzian Hahn","Fabian Kuech"],"pdf_url":"https://arxiv.org/pdf/2412.17690v3.pdf","comment":"Accepted at BTW 2025, 10 pages"},{"id":"http://arxiv.org/abs/2402.09176v2","updated":"2024-12-25T12:38:33Z","published":"2024-02-14T13:45:06Z","title":"Large Language Model Simulator for Cold-Start Recommendation","summary":" Recommending cold items remains a significant challenge in billion-scale\nonline recommendation systems. While warm items benefit from historical user\nbehaviors, cold items rely solely on content features, limiting their\nrecommendation performance and impacting user experience and revenue. Current\nmodels generate synthetic behavioral embeddings from content features but fail\nto address the core issue: the absence of historical behavior data. To tackle\nthis, we introduce the LLM Simulator framework, which leverages large language\nmodels to simulate user interactions for cold items, fundamentally addressing\nthe cold-start problem. However, simply using LLM to traverse all users can\nintroduce significant complexity in billion-scale systems. To manage the\ncomputational complexity, we propose a coupled funnel ColdLLM framework for\nonline recommendation. ColdLLM efficiently reduces the number of candidate\nusers from billions to hundreds using a trained coupled filter, allowing the\nLLM to operate efficiently and effectively on the filtered set. Extensive\nexperiments show that ColdLLM significantly surpasses baselines in cold-start\nrecommendations, including Recall and NDCG metrics. A two-week A/B test also\nvalidates that ColdLLM can effectively increase the cold-start period GMV.\n","authors":["Feiran Huang","Yuanchen Bei","Zhenghang Yang","Junyi Jiang","Hao Chen","Qijie Shen","Senzhang Wang","Fakhri Karray","Philip S. Yu"],"pdf_url":"https://arxiv.org/pdf/2402.09176v2.pdf","comment":"10 pages, accepted by WSDM 2025"},{"id":"http://arxiv.org/abs/2412.18860v1","updated":"2024-12-25T10:08:54Z","published":"2024-12-25T10:08:54Z","title":"Bootstrap Your Own Context Length","summary":" We introduce a bootstrapping approach to train long-context language models\nby exploiting their short-context capabilities only. Our method utilizes a\nsimple agent workflow to synthesize diverse long-context instruction tuning\ndata, thereby eliminating the necessity for manual data collection and\nannotation. The proposed data synthesis workflow requires only a short-context\nlanguage model, a text retriever, and a document collection, all of which are\nreadily accessible within the open-source ecosystem. Subsequently, language\nmodels are fine-tuned using the synthesized data to extend their context\nlengths. In this manner, we effectively transfer the short-context capabilities\nof language models to long-context scenarios through a bootstrapping process.\nWe conduct experiments with the open-source Llama-3 family of models and\ndemonstrate that our method can successfully extend the context length to up to\n1M tokens, achieving superior performance across various benchmarks.\n","authors":["Liang Wang","Nan Yang","Xingxing Zhang","Xiaolong Huang","Furu Wei"],"pdf_url":"https://arxiv.org/pdf/2412.18860v1.pdf","comment":"18 pages"},{"id":"http://arxiv.org/abs/2412.18819v1","updated":"2024-12-25T08:17:37Z","published":"2024-12-25T08:17:37Z","title":"LLM-assisted vector similarity search","summary":" As data retrieval demands become increasingly complex, traditional search\nmethods often fall short in addressing nuanced and conceptual queries. Vector\nsimilarity search has emerged as a promising technique for finding semantically\nsimilar information efficiently. However, its effectiveness diminishes when\nhandling intricate queries with contextual nuances. This paper explores a\nhybrid approach combining vector similarity search with Large Language Models\n(LLMs) to enhance search accuracy and relevance. The proposed two-step solution\nfirst employs vector similarity search to shortlist potential matches, followed\nby an LLM for context-aware ranking of the results. Experiments on structured\ndatasets demonstrate that while vector similarity search alone performs well\nfor straightforward queries, the LLM-assisted approach excels in processing\ncomplex queries involving constraints, negations, or conceptual requirements.\nBy leveraging the natural language understanding capabilities of LLMs, this\nmethod improves the accuracy of search results for complex tasks without\nsacrificing efficiency. We also discuss real-world applications and propose\ndirections for future research to refine and scale this technique for diverse\ndatasets and use cases.\n Original article:\nhttps://engineering.grab.com/llm-assisted-vector-similarity-search\n","authors":["Md Riyadh","Muqi Li","Felix Haryanto Lie","Jia Long Loh","Haotian Mi","Sayam Bohra"],"pdf_url":"https://arxiv.org/pdf/2412.18819v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18806v1","updated":"2024-12-25T07:08:51Z","published":"2024-12-25T07:08:51Z","title":"FOR: Finetuning for Object Level Open Vocabulary Image Retrieval","summary":" As working with large datasets becomes standard, the task of accurately\nretrieving images containing objects of interest by an open set textual query\ngains practical importance. The current leading approach utilizes a pre-trained\nCLIP model without any adaptation to the target domain, balancing accuracy and\nefficiency through additional post-processing. In this work, we propose FOR:\nFinetuning for Object-centric Open-vocabulary Image Retrieval, which allows\nfinetuning on a target dataset using closed-set labels while keeping the\nvisual-language association crucial for open vocabulary retrieval. FOR is based\non two design elements: a specialized decoder variant of the CLIP head\ncustomized for the intended task, and its coupling within a multi-objective\ntraining framework. Together, these design choices result in a significant\nincrease in accuracy, showcasing improvements of up to 8 mAP@50 points over\nSoTA across three datasets. Additionally, we demonstrate that FOR is also\neffective in a semi-supervised setting, achieving impressive results even when\nonly a small portion of the dataset is labeled.\n","authors":["Hila Levi","Guy Heller","Dan Levi"],"pdf_url":"https://arxiv.org/pdf/2412.18806v1.pdf","comment":"WACV 2025"},{"id":"http://arxiv.org/abs/2412.18784v1","updated":"2024-12-25T05:20:08Z","published":"2024-12-25T05:20:08Z","title":"Zema Dataset: A Comprehensive Study of Yaredawi Zema with a Focus on\n Horologium Chants","summary":" Computational music research plays a critical role in advancing music\nproduction, distribution, and understanding across various musical styles\nworldwide. Despite the immense cultural and religious significance, the\nEthiopian Orthodox Tewahedo Church (EOTC) chants are relatively\nunderrepresented in computational music research. This paper contributes to\nthis field by introducing a new dataset specifically tailored for analyzing\nEOTC chants, also known as Yaredawi Zema. This work provides a comprehensive\noverview of a 10-hour dataset, 369 instances, creation, and curation process,\nincluding rigorous quality assurance measures. Our dataset has a detailed\nword-level temporal boundary and reading tone annotation along with the\ncorresponding chanting mode label of audios. Moreover, we have also identified\nthe chanting options associated with multiple chanting notations in the\nmanuscript by annotating them accordingly. Our goal in making this dataset\navailable to the public 1 is to encourage more research and study of EOTC\nchants, including lyrics transcription, lyric-to-audio alignment, and music\ngeneration tasks. Such research work will advance knowledge and efforts to\npreserve this distinctive liturgical music, a priceless cultural artifact for\nthe Ethiopian people.\n","authors":["Mequanent Argaw Muluneh","Yan-Tsung Peng","Worku Abebe Degife","Nigussie Abate Tadesse","Aknachew Mebreku Demeku","Li Su"],"pdf_url":"https://arxiv.org/pdf/2412.18784v1.pdf","comment":"6 pages"},{"id":"http://arxiv.org/abs/2412.18770v1","updated":"2024-12-25T04:03:09Z","published":"2024-12-25T04:03:09Z","title":"Attack-in-the-Chain: Bootstrapping Large Language Models for Attacks\n Against Black-box Neural Ranking Models","summary":" Neural ranking models (NRMs) have been shown to be highly effective in terms\nof retrieval performance. Unfortunately, they have also displayed a higher\ndegree of sensitivity to attacks than previous generation models. To help\nexpose and address this lack of robustness, we introduce a novel ranking attack\nframework named Attack-in-the-Chain, which tracks interactions between large\nlanguage models (LLMs) and NRMs based on chain-of-thought (CoT) prompting to\ngenerate adversarial examples under black-box settings. Our approach starts by\nidentifying anchor documents with higher ranking positions than the target\ndocument as nodes in the reasoning chain. We then dynamically assign the number\nof perturbation words to each node and prompt LLMs to execute attacks. Finally,\nwe verify the attack performance of all nodes at each reasoning step and\nproceed to generate the next reasoning step. Empirical results on two web\nsearch benchmarks show the effectiveness of our method.\n","authors":["Yu-An Liu","Ruqing Zhang","Jiafeng Guo","Maarten de Rijke","Yixing Fan","Xueqi Cheng"],"pdf_url":"https://arxiv.org/pdf/2412.18770v1.pdf","comment":"Accepted by AAAI25"},{"id":"http://arxiv.org/abs/2412.18768v1","updated":"2024-12-25T03:51:26Z","published":"2024-12-25T03:51:26Z","title":"On the Robustness of Generative Information Retrieval Models","summary":" Generative information retrieval methods retrieve documents by directly\ngenerating their identifiers. Much effort has been devoted to developing\neffective generative IR models. Less attention has been paid to the robustness\nof these models. It is critical to assess the out-of-distribution (OOD)\ngeneralization of generative IR models, i.e., how would such models generalize\nto new distributions? To answer this question, we focus on OOD scenarios from\nfour perspectives in retrieval problems: (i)query variations; (ii)unseen query\ntypes; (iii)unseen tasks; and (iv)corpus expansion. Based on this taxonomy, we\nconduct empirical studies to analyze the OOD robustness of representative\ngenerative IR models against dense retrieval models. Our empirical results\nindicate that the OOD robustness of generative IR models is in need of\nimprovement. By inspecting the OOD robustness of generative IR models we aim to\ncontribute to the development of more reliable IR models. The code is available\nat \\url{https://github.com/Davion-Liu/GR_OOD}.\n","authors":["Yu-An Liu","Ruqing Zhang","Jiafeng Guo","Changjiang Zhou","Maarten de Rijke","Xueqi Cheng"],"pdf_url":"https://arxiv.org/pdf/2412.18768v1.pdf","comment":"Accepted by ECIR 2025. arXiv admin note: substantial text overlap\n with arXiv:2306.12756"},{"id":"http://arxiv.org/abs/2412.13825v3","updated":"2024-12-25T02:33:14Z","published":"2024-12-18T13:12:36Z","title":"MixRec: Heterogeneous Graph Collaborative Filtering","summary":" For modern recommender systems, the use of low-dimensional latent\nrepresentations to embed users and items based on their observed interactions\nhas become commonplace. However, many existing recommendation models are\nprimarily designed for coarse-grained and homogeneous interactions, which\nlimits their effectiveness in two critical dimensions. Firstly, these models\nfail to leverage the relational dependencies that exist across different types\nof user behaviors, such as page views, collects, comments, and purchases.\nSecondly, they struggle to capture the fine-grained latent factors that drive\nuser interaction patterns. To address these limitations, we present a\nheterogeneous graph collaborative filtering model MixRec that excels at\ndisentangling users' multi-behavior interaction patterns and uncovering the\nlatent intent factors behind each behavior. Our model achieves this by\nincorporating intent disentanglement and multi-behavior modeling, facilitated\nby a parameterized heterogeneous hypergraph architecture. Furthermore, we\nintroduce a novel contrastive learning paradigm that adaptively explores the\nadvantages of self-supervised data augmentation, thereby enhancing the model's\nresilience against data sparsity and expressiveness with relation\nheterogeneity. To validate the efficacy of MixRec, we conducted extensive\nexperiments on three public datasets. The results clearly demonstrate its\nsuperior performance, significantly outperforming various state-of-the-art\nbaselines. Our model is open-sourced and available at:\nhttps://github.com/HKUDS/MixRec.\n","authors":["Lianghao Xia","Meiyan Xie","Yong Xu","Chao Huang"],"pdf_url":"https://arxiv.org/pdf/2412.13825v3.pdf","comment":"This paper is accepted by WSDM'2025"},{"id":"http://arxiv.org/abs/2412.18735v1","updated":"2024-12-25T01:47:39Z","published":"2024-12-25T01:47:39Z","title":"Adaptive Self-supervised Learning for Social Recommendations","summary":" In recent years, researchers have attempted to exploit social relations to\nimprove the performance in recommendation systems. Generally, most existing\nsocial recommendation methods heavily depends on substantial domain knowledge\nand expertise in primary recommendation tasks for designing useful auxiliary\ntasks. Meanwhile, Self-Supervised Learning (SSL) recently has received\nconsiderable attention in the field of recommendation, since it can provide\nself-supervision signals in assisting the improvement of target recommendation\nsystems by constructing self-supervised auxiliary tasks from raw data without\nhuman-annotated labels. Despite the great success, these SSL-based social\nrecommendations are insufficient to adaptively balance various self-supervised\nauxiliary tasks, since assigning equal weights on various auxiliary tasks can\nresult in sub-optimal recommendation performance, where different\nself-supervised auxiliary tasks may contribute differently to improving the\nprimary social recommendation across different datasets. To address this issue,\nin this work, we propose Adaptive Self-supervised Learning for Social\nRecommendations (AdasRec) by taking advantage of various self-supervised\nauxiliary tasks. More specifically, an adaptive weighting mechanism is proposed\nto learn adaptive weights for various self-supervised auxiliary tasks, so as to\nbalance the contribution of such self-supervised auxiliary tasks for enhancing\nrepresentation learning in social recommendations. The adaptive weighting\nmechanism is used to assign different weights on auxiliary tasks to achieve an\noverall weighting of the entire auxiliary tasks and ultimately assist the\nprimary recommendation task, achieved by a meta learning optimization problem\nwith an adaptive weighting network. Comprehensive experiments on various\nreal-world datasets are constructed to verify the effectiveness of our proposed\nmethod.\n","authors":["Xin He","Shanru Lin","Wenqi Fan","Mingchen Sun","Ying Wang","Xin Wang"],"pdf_url":"https://arxiv.org/pdf/2412.18735v1.pdf","comment":"13 pages, 4 figures"},{"id":"http://arxiv.org/abs/2412.18731v1","updated":"2024-12-25T01:22:35Z","published":"2024-12-25T01:22:35Z","title":"Position-aware Graph Transformer for Recommendation","summary":" Collaborative recommendation fundamentally involves learning high-quality\nuser and item representations from interaction data. Recently, graph\nconvolution networks (GCNs) have advanced the field by utilizing high-order\nconnectivity patterns in interaction graphs, as evidenced by state-of-the-art\nmethods like PinSage and LightGCN. However, one key limitation has not been\nwell addressed in existing solutions: capturing long-range collaborative\nfiltering signals, which are crucial for modeling user preference. In this\nwork, we propose a new graph transformer (GT) framework --\n\\textit{Position-aware Graph Transformer for Recommendation} (PGTR), which\ncombines the global modeling capability of Transformer blocks with the local\nneighborhood feature extraction of GCNs. The key insight is to explicitly\nincorporate node position and structure information from the user-item\ninteraction graph into GT architecture via several purpose-designed positional\nencodings. The long-range collaborative signals from the Transformer block are\nthen combined linearly with the local neighborhood features from the GCN\nbackbone to enhance node embeddings for final recommendations. Empirical\nstudies demonstrate the effectiveness of the proposed PGTR method when\nimplemented on various GCN-based backbones across four real-world datasets, and\nthe robustness against interaction sparsity as well as noise.\n","authors":["Jiajia Chen","Jiancan Wu","Jiawei Chen","Chongming Gao","Yong Li","Xiang Wang"],"pdf_url":"https://arxiv.org/pdf/2412.18731v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18715v1","updated":"2024-12-25T00:26:51Z","published":"2024-12-25T00:26:51Z","title":"Optimization and Scalability of Collaborative Filtering Algorithms in\n Large Language Models","summary":" With the rapid development of large language models (LLMs) and the growing\ndemand for personalized content, recommendation systems have become critical in\nenhancing user experience and driving engagement. Collaborative filtering\nalgorithms, being core to many recommendation systems, have garnered\nsignificant attention for their efficiency and interpretability. However,\ntraditional collaborative filtering approaches face numerous challenges when\nintegrated into large-scale LLM-based systems, including high computational\ncosts, severe data sparsity, cold start problems, and lack of scalability. This\npaper investigates the optimization and scalability of collaborative filtering\nalgorithms in large language models, addressing these limitations through\nadvanced optimization strategies. Firstly, we analyze the fundamental\nprinciples of collaborative filtering algorithms and their limitations when\napplied in LLM-based contexts. Next, several optimization techniques such as\nmatrix factorization, approximate nearest neighbor search, and parallel\ncomputing are proposed to enhance computational efficiency and model accuracy.\nAdditionally, strategies such as distributed architecture and model compression\nare explored to facilitate dynamic updates and scalability in data-intensive\nenvironments.\n","authors":["Haowei Yang","Longfei Yun","Jinghan Cao","Qingyi Lu","Yuming Tu"],"pdf_url":"https://arxiv.org/pdf/2412.18715v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18713v1","updated":"2024-12-25T00:23:53Z","published":"2024-12-25T00:23:53Z","title":"Enhanced Recommendation Combining Collaborative Filtering and Large\n Language Models","summary":" With the advent of the information explosion era, the importance of\nrecommendation systems in various applications is increasingly significant.\nTraditional collaborative filtering algorithms are widely used due to their\neffectiveness in capturing user behavior patterns, but they encounter\nlimitations when dealing with cold start problems and data sparsity. Large\nLanguage Models (LLMs), with their strong natural language understanding and\ngeneration capabilities, provide a new breakthrough for recommendation systems.\nThis study proposes an enhanced recommendation method that combines\ncollaborative filtering and LLMs, aiming to leverage collaborative filtering's\nadvantage in modeling user preferences while enhancing the understanding of\ntextual information about users and items through LLMs to improve\nrecommendation accuracy and diversity. This paper first introduces the\nfundamental theories of collaborative filtering and LLMs, then designs a\nrecommendation system architecture that integrates both, and validates the\nsystem's effectiveness through experiments. The results show that the hybrid\nmodel based on collaborative filtering and LLMs significantly improves\nprecision, recall, and user satisfaction, demonstrating its potential in\ncomplex recommendation scenarios.\n","authors":["Xueting Lin","Zhan Cheng","Longfei Yun","Qingyi Lu","Yuanshuai Luo"],"pdf_url":"https://arxiv.org/pdf/2412.18713v1.pdf","comment":null}],"Multimedia":[{"id":"http://arxiv.org/abs/2412.18988v1","updated":"2024-12-25T21:52:31Z","published":"2024-12-25T21:52:31Z","title":"MTCAE-DFER: Multi-Task Cascaded Autoencoder for Dynamic Facial\n Expression Recognition","summary":" This paper expands the cascaded network branch of the autoencoder-based\nmulti-task learning (MTL) framework for dynamic facial expression recognition,\nnamely Multi-Task Cascaded Autoencoder for Dynamic Facial Expression\nRecognition (MTCAE-DFER). MTCAE-DFER builds a plug-and-play cascaded decoder\nmodule, which is based on the Vision Transformer (ViT) architecture and employs\nthe decoder concept of Transformer to reconstruct the multi-head attention\nmodule. The decoder output from the previous task serves as the query (Q),\nrepresenting local dynamic features, while the Video Masked Autoencoder\n(VideoMAE) shared encoder output acts as both the key (K) and value (V),\nrepresenting global dynamic features. This setup facilitates interaction\nbetween global and local dynamic features across related tasks. Additionally,\nthis proposal aims to alleviate overfitting of complex large model. We utilize\nautoencoder-based multi-task cascaded learning approach to explore the impact\nof dynamic face detection and dynamic face landmark on dynamic facial\nexpression recognition, which enhances the model's generalization ability.\nAfter we conduct extensive ablation experiments and comparison with\nstate-of-the-art (SOTA) methods on various public datasets for dynamic facial\nexpression recognition, the robustness of the MTCAE-DFER model and the\neffectiveness of global-local dynamic feature interaction among related tasks\nhave been proven.\n","authors":["Peihao Xiang","Kaida Wu","Chaohao Lin","Ou Bai"],"pdf_url":"https://arxiv.org/pdf/2412.18988v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18962v1","updated":"2024-12-25T18:41:36Z","published":"2024-12-25T18:41:36Z","title":"Don't Lose Yourself: Boosting Multimodal Recommendation via Reducing\n Node-neighbor Discrepancy in Graph Convolutional Network","summary":" The rapid expansion of multimedia contents has led to the emergence of\nmultimodal recommendation systems. It has attracted increasing attention in\nrecommendation systems because its full utilization of data from different\nmodalities alleviates the persistent data sparsity problem. As such, multimodal\nrecommendation models can learn personalized information about nodes in terms\nof visual and textual. To further alleviate the data sparsity problem, some\nprevious works have introduced graph convolutional networks (GCNs) for\nmultimodal recommendation systems, to enhance the semantic representation of\nusers and items by capturing the potential relationships between them. However,\nadopting GCNs inevitably introduces the over-smoothing problem, which make\nnodes to be too similar. Unfortunately, incorporating multimodal information\nwill exacerbate this challenge because nodes that are too similar will lose the\npersonalized information learned through multimodal information. To address\nthis problem, we propose a novel model that retains the personalized\ninformation of ego nodes during feature aggregation by Reducing Node-neighbor\nDiscrepancy (RedN^nD). Extensive experiments on three public datasets show that\nRedN^nD achieves state-of-the-art performance on accuracy and robustness, with\nsignificant improvements over existing GCN-based multimodal frameworks.\n","authors":["Zheyu Chen","Jinfeng Xu","Haibo Hu"],"pdf_url":"https://arxiv.org/pdf/2412.18962v1.pdf","comment":"Accepted by ICASSP 2025"},{"id":"http://arxiv.org/abs/2412.18960v1","updated":"2024-12-25T18:36:21Z","published":"2024-12-25T18:36:21Z","title":"XRFlux: Virtual Reality Benchmark for Edge Caching Systems","summary":" We introduce a Unity based benchmark XRFlux for evaluating Virtual Reality\n(VR) delivery systems using edge-cloud caching. As VR applications and systems\nprogress, the need to meet strict latency and Quality of Experience (QoE)\nrequirements is increasingly evident. In the context of VR, traditional cloud\narchitectures (e.g., remote AWS S3 for content delivery) often struggle to meet\nthese demands, especially for users of the same application in different\nlocations. With edge computing, resources are brought closer to users in\nefforts to reduce latency and improve QoEs. However, VR's dynamic nature, with\nchanging fields of view (FoVs) and user synchronization requirements, creates\nvarious challenges for edge caching. We address the lack of suitable benchmarks\nand propose a framework that simulates multiuser VR scenarios while logging\nusers' interaction with objects within their actual and predicted FoVs. The\nbenchmark's activity log can then be played back through an edge cache to\nassess the resulting QoEs. This tool fills a gap by supporting research in the\noptimization of edge caching (and other edge-cloud functions) for VR streaming.\n","authors":["Nader Alfares","George Kesidis"],"pdf_url":"https://arxiv.org/pdf/2412.18960v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18933v1","updated":"2024-12-25T15:43:41Z","published":"2024-12-25T15:43:41Z","title":"TINQ: Temporal Inconsistency Guided Blind Video Quality Assessment","summary":" Blind video quality assessment (BVQA) has been actively researched for\nuser-generated content (UGC) videos. Recently, super-resolution (SR) techniques\nhave been widely applied in UGC. Therefore, an effective BVQA method for both\nUGC and SR scenarios is essential. Temporal inconsistency, referring to\nirregularities between consecutive frames, is relevant to video quality.\nCurrent BVQA approaches typically model temporal relationships in UGC videos\nusing statistics of motion information, but inconsistencies remain unexplored.\nAdditionally, different from temporal inconsistency in UGC videos, such\ninconsistency in SR videos is amplified due to upscaling algorithms. In this\npaper, we introduce the Temporal Inconsistency Guided Blind Video Quality\nAssessment (TINQ) metric, demonstrating that exploring temporal inconsistency\nis crucial for effective BVQA. Since temporal inconsistencies vary between UGC\nand SR videos, they are calculated in different ways. Based on this, a spatial\nmodule highlights inconsistent areas across consecutive frames at coarse and\nfine granularities. In addition, a temporal module aggregates features over\ntime in two stages. The first stage employs a visual memory capacity block to\nadaptively segment the time dimension based on estimated complexity, while the\nsecond stage focuses on selecting key features. The stages work together\nthrough Consistency-aware Fusion Units to regress cross-time-scale video\nquality. Extensive experiments on UGC and SR video quality datasets show that\nour method outperforms existing state-of-the-art BVQA methods. Code is\navailable at https://github.com/Lighting-YXLI/TINQ.\n","authors":["Yixiao Li","Xiaoyuan Yang","Weide Liu","Xin Jin","Xu Jia","Yukun Lai","Haotao Liu","Paul L Rosin","Wei Zhou"],"pdf_url":"https://arxiv.org/pdf/2412.18933v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.16495v2","updated":"2024-12-25T12:41:58Z","published":"2024-12-21T05:49:40Z","title":"Follow-Your-MultiPose: Tuning-Free Multi-Character Text-to-Video\n Generation via Pose Guidance","summary":" Text-editable and pose-controllable character video generation is a\nchallenging but prevailing topic with practical applications. However, existing\napproaches mainly focus on single-object video generation with pose guidance,\nignoring the realistic situation that multi-character appear concurrently in a\nscenario. To tackle this, we propose a novel multi-character video generation\nframework in a tuning-free manner, which is based on the separated text and\npose guidance. Specifically, we first extract character masks from the pose\nsequence to identify the spatial position for each generating character, and\nthen single prompts for each character are obtained with LLMs for precise text\nguidance. Moreover, the spatial-aligned cross attention and multi-branch\ncontrol module are proposed to generate fine grained controllable\nmulti-character video. The visualized results of generating video demonstrate\nthe precise controllability of our method for multi-character generation. We\nalso verify the generality of our method by applying it to various personalized\nT2I models. Moreover, the quantitative results show that our approach achieves\nsuperior performance compared with previous works.\n","authors":["Beiyuan Zhang","Yue Ma","Chunlei Fu","Xinyang Song","Zhenan Sun","Ziqiang Li"],"pdf_url":"https://arxiv.org/pdf/2412.16495v2.pdf","comment":"5 pages,conference"},{"id":"http://arxiv.org/abs/2412.18390v2","updated":"2024-12-25T12:12:10Z","published":"2024-12-24T12:28:19Z","title":"RDPM: Solve Diffusion Probabilistic Models via Recurrent Token\n Prediction","summary":" Diffusion Probabilistic Models (DPMs) have emerged as the de facto approach\nfor high-fidelity image synthesis, operating diffusion processes on continuous\nVAE latent, which significantly differ from the text generation methods\nemployed by Large Language Models (LLMs). In this paper, we introduce a novel\ngenerative framework, the Recurrent Diffusion Probabilistic Model (RDPM), which\nenhances the diffusion process through a recurrent token prediction mechanism,\nthereby pioneering the field of Discrete Diffusion. By progressively\nintroducing Gaussian noise into the latent representations of images and\nencoding them into vector-quantized tokens in a recurrent manner, RDPM\nfacilitates a unique diffusion process on discrete-value domains. This process\niteratively predicts the token codes for subsequent timesteps, transforming the\ninitial standard Gaussian noise into the source data distribution, aligning\nwith GPT-style models in terms of the loss function. RDPM demonstrates superior\nperformance while benefiting from the speed advantage of requiring only a few\ninference steps. This model not only leverages the diffusion process to ensure\nhigh-quality generation but also converts continuous signals into a series of\nhigh-fidelity discrete tokens, thereby maintaining a unified optimization\nstrategy with other discrete tokens, such as text. We anticipate that this work\nwill contribute to the development of a unified model for multimodal\ngeneration, specifically by integrating continuous signal domains such as\nimages, videos, and audio with text. We will release the code and model weights\nto the open-source community.\n","authors":["Xiaoping Wu","Jie Hu","Xiaoming Wei"],"pdf_url":"https://arxiv.org/pdf/2412.18390v2.pdf","comment":"8 pages"},{"id":"http://arxiv.org/abs/2412.18834v1","updated":"2024-12-25T08:42:23Z","published":"2024-12-25T08:42:23Z","title":"Adaptive Rate Control for Deep Video Compression with Rate-Distortion\n Prediction","summary":" Deep video compression has made significant progress in recent years,\nachieving rate-distortion performance that surpasses that of traditional video\ncompression methods. However, rate control schemes tailored for deep video\ncompression have not been well studied. In this paper, we propose a neural\nnetwork-based $\\lambda$-domain rate control scheme for deep video compression,\nwhich determines the coding parameter $\\lambda$ for each to-be-coded frame\nbased on the rate-distortion-$\\lambda$ (R-D-$\\lambda$) relationships directly\nlearned from uncompressed frames, achieving high rate control accuracy\nefficiently without the need for pre-encoding. Moreover, this content-aware\nscheme is able to mitigate inter-frame quality fluctuations and adapt to abrupt\nchanges in video content. Specifically, we introduce two neural network-based\npredictors to estimate the relationship between bitrate and $\\lambda$, as well\nas the relationship between distortion and $\\lambda$ for each frame. Then we\ndetermine the coding parameter $\\lambda$ for each frame to achieve the target\nbitrate. Experimental results demonstrate that our approach achieves high rate\ncontrol accuracy at the mini-GOP level with low time overhead and mitigates\ninter-frame quality fluctuations across video content of varying resolutions.\n","authors":["Bowen Gu","Hao Chen","Ming Lu","Jie Yao","Zhan Ma"],"pdf_url":"https://arxiv.org/pdf/2412.18834v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.10762v2","updated":"2024-12-25T05:17:34Z","published":"2024-09-16T22:32:22Z","title":"Stimulus Modality Matters: Impact of Perceptual Evaluations from\n Different Modalities on Speech Emotion Recognition System Performance","summary":" Speech Emotion Recognition (SER) systems rely on speech input and emotional\nlabels annotated by humans. However, various emotion databases collect\nperceptional evaluations in different ways. For instance, the IEMOCAP dataset\nuses video clips with sounds for annotators to provide their emotional\nperceptions. However, the most significant English emotion dataset, the\nMSP-PODCAST, only provides speech for raters to choose the emotional ratings.\nNevertheless, using speech as input is the standard approach to training SER\nsystems. Therefore, the open question is the emotional labels elicited by which\nscenarios are the most effective for training SER systems. We comprehensively\ncompare the effectiveness of SER systems trained with labels elicited by\ndifferent modality stimuli and evaluate the SER systems on various testing\nconditions. Also, we introduce an all-inclusive label that combines all labels\nelicited by various modalities. We show that using labels elicited by\nvoice-only stimuli for training yields better performance on the test set,\nwhereas labels elicited by voice-only stimuli.\n","authors":["Huang-Cheng Chou","Haibin Wu","Chi-Chun Lee"],"pdf_url":"https://arxiv.org/pdf/2409.10762v2.pdf","comment":"5 pages, 2 figures, 4 tables, acceptance for ICASSP 2025"},{"id":"http://arxiv.org/abs/2409.10197v2","updated":"2024-12-25T04:30:16Z","published":"2024-09-16T11:43:19Z","title":"Fit and Prune: Fast and Training-free Visual Token Pruning for\n Multi-modal Large Language Models","summary":" Recent progress in Multimodal Large Language Models(MLLMs) often use large\nimage tokens to compensate the visual shortcoming of MLLMs, which not only\nexhibits obvious redundancy but also greatly exacerbates the already high\ncomputation. Token pruning is an effective solution for speeding up MLLMs, but\nwhen and how to drop tokens still remains a challenge. In this paper, we\npropose a novel and training-free approach for the effective visual token\npruning of MLLMs, termed FitPrune, which can quickly produce a complete pruning\nrecipe for MLLMs according to a pre-defined budget. Specifically, FitPrune\nconsiders token pruning as a statistical problem of MLLM and its objective is\nto find out an optimal pruning scheme that can minimize the divergence of the\nattention distributions before and after pruning. In practice, FitPrune can be\nquickly accomplished based on the attention statistics from a small batch of\ninference data, avoiding the expensive trials of MLLMs. According to the\npruning recipe, an MLLM can directly remove the redundant visual tokens of\ndifferent examples during inference. To validate FitPrune, we apply it to a set\nof recent MLLMs, including LLaVA-1.5, LLaVA-HR and LLaVA-NEXT, and conduct\nextensive experiments on a set of benchmarks. The experimental results show\nthat our FitPrune can not only reduce the computational complexity to a large\nextent, while retaining high performance, e.g., -54.9% FLOPs for LLaVA-NEXT\nwith only 0.5% accuracy drop. Notably, the pruning recipe can be obtained in\nabout 5 minutes. Our code is available at https://github.com/ywh187/FitPrune.\n","authors":["Weihao Ye","Qiong Wu","Wenhao Lin","Yiyi Zhou"],"pdf_url":"https://arxiv.org/pdf/2409.10197v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18748v1","updated":"2024-12-25T02:41:13Z","published":"2024-12-25T02:41:13Z","title":"Towards Expressive Video Dubbing with Multiscale Multimodal Context\n Interaction","summary":" Automatic Video Dubbing (AVD) generates speech aligned with lip motion and\nfacial emotion from scripts. Recent research focuses on modeling multimodal\ncontext to enhance prosody expressiveness but overlooks two key issues: 1)\nMultiscale prosody expression attributes in the context influence the current\nsentence's prosody. 2) Prosody cues in context interact with the current\nsentence, impacting the final prosody expressiveness. To tackle these\nchallenges, we propose M2CI-Dubber, a Multiscale Multimodal Context Interaction\nscheme for AVD. This scheme includes two shared M2CI encoders to model the\nmultiscale multimodal context and facilitate its deep interaction with the\ncurrent sentence. By extracting global and local features for each modality in\nthe context, utilizing attention-based mechanisms for aggregation and\ninteraction, and employing an interaction-based graph attention network for\nfusion, the proposed approach enhances the prosody expressiveness of\nsynthesized speech for the current sentence. Experiments on the Chem dataset\nshow our model outperforms baselines in dubbing expressiveness. The code and\ndemos are available at\n\\textcolor[rgb]{0.93,0.0,0.47}{https://github.com/AI-S2-Lab/M2CI-Dubber}.\n","authors":["Yuan Zhao","Rui Liu","Gaoxiang Cong"],"pdf_url":"https://arxiv.org/pdf/2412.18748v1.pdf","comment":"Accepted by ICSSP 2025"}]},"2024-12-30T00:00:00Z":{"Computation and Language":[{"id":"http://arxiv.org/abs/2412.21200v1","updated":"2024-12-30T18:59:06Z","published":"2024-12-30T18:59:06Z","title":"Distributed Mixture-of-Agents for Edge Inference with Large Language\n Models","summary":" Mixture-of-Agents (MoA) has recently been proposed as a method to enhance\nperformance of large language models (LLMs), enabling multiple individual LLMs\nto work together for collaborative inference. This collaborative approach\nresults in improved responses to user prompts compared to relying on a single\nLLM. In this paper, we consider such an MoA architecture in a distributed\nsetting, where LLMs operate on individual edge devices, each uniquely\nassociated with a user and equipped with its own distributed computing power.\nThese devices exchange information using decentralized gossip algorithms,\nallowing different device nodes to talk without the supervision of a\ncentralized server. In the considered setup, different users have their own LLM\nmodels to address user prompts. Additionally, the devices gossip either their\nown user-specific prompts or augmented prompts to generate more refined answers\nto certain queries. User prompts are temporarily stored in the device queues\nwhen their corresponding LLMs are busy. Given the memory limitations of edge\ndevices, it is crucial to ensure that the average queue sizes in the system\nremain bounded. In this paper, we address this by theoretically calculating the\nqueuing stability conditions for the device queues under reasonable\nassumptions, which we validate experimentally as well. Further, we demonstrate\nthrough experiments, leveraging open-source LLMs for the implementation of\ndistributed MoA, that certain MoA configurations produce higher-quality\nresponses compared to others, as evaluated on AlpacaEval 2.0 benchmark. The\nimplementation is available at:\nhttps://github.com/purbeshmitra/distributed_moa.\n","authors":["Purbesh Mitra","Priyanka Kaswan","Sennur Ulukus"],"pdf_url":"https://arxiv.org/pdf/2412.21200v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.21199v1","updated":"2024-12-30T18:58:58Z","published":"2024-12-30T18:58:58Z","title":"HumanEval Pro and MBPP Pro: Evaluating Large Language Models on\n Self-invoking Code Generation","summary":" We introduce self-invoking code generation, a new task designed to evaluate\nthe progressive reasoning and problem-solving capabilities of LLMs. In this\ntask, models are presented with a base problem and a related, more complex\nproblem. They must solve the base problem and then utilize its solution to\naddress the more complex one. This work features three key contributions.\nFirst, we propose a general recipe for generating more challenging versions of\nexisting benchmarks, resulting in three new benchmarks: HumanEval Pro, MBPP\nPro, and BigCodeBench-Lite Pro, specifically designed to assess LLMs on\nself-invoking code generation. Second, from the analysis of experimental\nresults over twenty LLMs on our benchmarks, we have two important observations:\n(i) Most LLMs excel in traditional code generation benchmarks like HumanEval\nand MBPP, but their performance declines on self-invoking tasks. For example,\no1-mini achieves 96.2% pass@1 on HumanEval but only 76.2% on HumanEval Pro.\n(ii) On self-invoking code generation task, the instruction-tuned models\ndemonstrate only marginal improvements compared to the base models. Third, we\ndisclose the types of failure modes that exist in our evaluation results. All\nthese results underscore the need for further advancements in self-invoking\ncode generation tasks and provide a new direction for future research on\nenhancing LLMs' code reasoning capabilities.\n","authors":["Zhaojian Yu","Yilun Zhao","Arman Cohan","Xiao-Ping Zhang"],"pdf_url":"https://arxiv.org/pdf/2412.21199v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.21187v1","updated":"2024-12-30T18:55:12Z","published":"2024-12-30T18:55:12Z","title":"Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs","summary":" The remarkable performance of models like the OpenAI o1 can be attributed to\ntheir ability to emulate human-like long-time thinking during inference. These\nmodels employ extended chain-of-thought (CoT) processes, exploring multiple\nstrategies to enhance problem-solving capabilities. However, a critical\nquestion remains: How to intelligently and efficiently scale computational\nresources during testing. This paper presents the first comprehensive study on\nthe prevalent issue of overthinking in these models, where excessive\ncomputational resources are allocated for simple problems with minimal benefit.\nWe introduce novel efficiency metrics from both outcome and process\nperspectives to evaluate the rational use of computational resources by o1-like\nmodels. Using a self-training paradigm, we propose strategies to mitigate\noverthinking, streamlining reasoning processes without compromising accuracy.\nExperimental results show that our approach successfully reduces computational\noverhead while preserving model performance across a range of testsets with\nvarying difficulty levels, such as GSM8K, MATH500, GPQA, and AIME.\n","authors":["Xingyu Chen","Jiahao Xu","Tian Liang","Zhiwei He","Jianhui Pang","Dian Yu","Linfeng Song","Qiuzhi Liu","Mengfei Zhou","Zhuosheng Zhang","Rui Wang","Zhaopeng Tu","Haitao Mi","Dong Yu"],"pdf_url":"https://arxiv.org/pdf/2412.21187v1.pdf","comment":"Work in progress"},{"id":"http://arxiv.org/abs/2412.21178v1","updated":"2024-12-30T18:50:37Z","published":"2024-12-30T18:50:37Z","title":"Two-component spatiotemporal template for activation-inhibition of\n speech in ECoG","summary":" I compute the average trial-by-trial power of band-limited speech activity\nacross epochs of multi-channel high-density electrocorticography (ECoG)\nrecorded from multiple subjects during a consonant-vowel speaking task. I show\nthat previously seen anti-correlations of average beta frequency activity\n(12-35 Hz) to high-frequency gamma activity (70-140 Hz) during speech movement\nare observable between individual ECoG channels in the sensorimotor cortex\n(SMC). With this I fit a variance-based model using principal component\nanalysis to the band-powers of individual channels of session-averaged ECoG\ndata in the SMC and project SMC channels onto their lower-dimensional principal\ncomponents.\n Spatiotemporal relationships between speech-related activity and principal\ncomponents are identified by correlating the principal components of both\nfrequency bands to individual ECoG channels over time using windowed\ncorrelation. Correlations of principal component areas to sensorimotor areas\nreveal a distinct two-component activation-inhibition-like representation for\nspeech that resembles distinct local sensorimotor areas recently shown to have\ncomplex interplay in whole-body motor control, inhibition, and posture. Notably\nthe third principal component shows insignificant correlations across all\nsubjects, suggesting two components of ECoG are sufficient to represent SMC\nactivity during speech movement.\n","authors":["Eric Easthope"],"pdf_url":"https://arxiv.org/pdf/2412.21178v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.21154v1","updated":"2024-12-30T18:33:28Z","published":"2024-12-30T18:33:28Z","title":"Aviary: training language agents on challenging scientific tasks","summary":" Solving complex real-world tasks requires cycles of actions and observations.\nThis is particularly true in science, where tasks require many cycles of\nanalysis, tool use, and experimentation. Language agents are promising for\nautomating intellectual tasks in science because they can interact with tools\nvia natural language or code. Yet their flexibility creates conceptual and\npractical challenges for software implementations, since agents may comprise\nnon-standard components such as internal reasoning, planning, tool usage, as\nwell as the inherent stochasticity of temperature-sampled language models.\nHere, we introduce Aviary, an extensible gymnasium for language agents. We\nformalize agents as policies solving language-grounded partially observable\nMarkov decision processes, which we term language decision processes. We then\nimplement five environments, including three challenging scientific\nenvironments: (1) manipulating DNA constructs for molecular cloning, (2)\nanswering research questions by accessing scientific literature, and (3)\nengineering protein stability. These environments were selected for their focus\non multi-step reasoning and their relevance to contemporary biology research.\nFinally, with online training and scaling inference-time compute, we show that\nlanguage agents backed by open-source, non-frontier LLMs can match and exceed\nboth frontier LLM agents and human experts on multiple tasks at up to 100x\nlower inference cost.\n","authors":["Siddharth Narayanan","James D. Braza","Ryan-Rhys Griffiths","Manu Ponnapati","Albert Bou","Jon Laurent","Ori Kabeli","Geemi Wellawatte","Sam Cox","Samuel G. Rodriques","Andrew D. White"],"pdf_url":"https://arxiv.org/pdf/2412.21154v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.05093v3","updated":"2024-12-30T18:21:08Z","published":"2024-08-09T14:34:32Z","title":"Order Matters in Hallucination: Reasoning Order as Benchmark and\n Reflexive Prompting for Large-Language-Models","summary":" Large language models (LLMs) have generated significant attention since their\ninception, finding applications across various academic and industrial domains.\nHowever, these models often suffer from the \"hallucination problem\", where\noutputs, though grammatically and logically coherent, lack factual accuracy or\nare entirely fabricated. A particularly troubling issue discovered and widely\ndiscussed recently is the numerical comparison error where multiple LLMs\nincorrectly infer that \"9.11$>$9.9\". We discovered that the order in which LLMs\ngenerate answers and reasoning impacts their consistency. Specifically, results\nvary significantly when an LLM generates an answer first and then provides the\nreasoning versus generating the reasoning process first and then the\nconclusion. Inspired by this, we propose a new benchmark method for assessing\nLLM consistency: comparing responses generated through these two different\napproaches. This benchmark effectively identifies instances where LLMs\nfabricate answers and subsequently generate justifications. Furthermore, we\nintroduce a novel and straightforward prompt strategy designed to mitigate this\nissue. Experimental results demonstrate that this strategy improves performance\nacross various LLMs compared to direct questioning. This work not only sheds\nlight on a critical flaw in LLMs but also offers a practical solution to\nenhance their reliability.\n","authors":["Zikai Xie"],"pdf_url":"https://arxiv.org/pdf/2408.05093v3.pdf","comment":"8 pages, submitted to ACL22025"},{"id":"http://arxiv.org/abs/2412.21140v1","updated":"2024-12-30T18:15:45Z","published":"2024-12-30T18:15:45Z","title":"Facilitating large language model Russian adaptation with Learned\n Embedding Propagation","summary":" Rapid advancements of large language model (LLM) technologies led to the\nintroduction of powerful open-source instruction-tuned LLMs that have the same\ntext generation quality as the state-of-the-art counterparts such as GPT-4.\nWhile the emergence of such models accelerates the adoption of LLM technologies\nin sensitive-information environments the authors of such models don not\ndisclose the training data necessary for replication of the results thus making\nthe achievements model-exclusive. Since those open-source models are also\nmultilingual this in turn reduces the benefits of training a language specific\nLLMs as improved inference computation efficiency becomes the only guaranteed\nadvantage of such costly procedure. More cost-efficient options such as\nvocabulary extension and subsequent continued pre-training are also inhibited\nby the lack of access to high-quality instruction-tuning data since it is the\nmajor factor behind the resulting LLM task-solving capabilities. To address the\nlimitations and cut the costs of the language adaptation pipeline we propose\nLearned Embedding Propagation (LEP). Unlike existing approaches our method has\nlower training data size requirements due to minimal impact on existing LLM\nknowledge which we reinforce using novel ad-hoc embedding propagation procedure\nthat allows to skip the instruction-tuning step and instead implant the new\nlanguage knowledge directly into any existing instruct-tuned variant. We\nevaluated four Russian vocabulary adaptations for LLaMa-3-8B and Mistral-7B,\nshowing that LEP is competitive with traditional instruction-tuning methods,\nachieving performance comparable to OpenChat 3.5 and LLaMa-3-8B-Instruct, with\nfurther improvements via self-calibration and continued tuning enhancing\ntask-solving capabilities.\n","authors":["Mikhail Tikhomirov","Daniil Chernyshev"],"pdf_url":"https://arxiv.org/pdf/2412.21140v1.pdf","comment":"Preprint version of an article published in the Journal of Language\n and Education. Copyright held by the owner/author(s). Publication rights\n licensed to the Journal of Language and Education"},{"id":"http://arxiv.org/abs/2412.21139v1","updated":"2024-12-30T18:15:39Z","published":"2024-12-30T18:15:39Z","title":"Training Software Engineering Agents and Verifiers with SWE-Gym","summary":" We present SWE-Gym, the first environment for training real-world software\nengineering (SWE) agents. SWE-Gym contains 2,438 real-world Python task\ninstances, each comprising a codebase with an executable runtime environment,\nunit tests, and a task specified in natural language. We use SWE-Gym to train\nlanguage model based SWE agents , achieving up to 19% absolute gains in resolve\nrate on the popular SWE-Bench Verified and Lite test sets. We also experiment\nwith inference-time scaling through verifiers trained on agent trajectories\nsampled from SWE-Gym. When combined with our fine-tuned SWE agents, we achieve\n32.0% and 26.0% on SWE-Bench Verified and Lite, respectively, reflecting a new\nstate-of-the-art for open-weight SWE agents. To facilitate further research, we\npublicly release SWE-Gym, models, and agent trajectories.\n","authors":["Jiayi Pan","Xingyao Wang","Graham Neubig","Navdeep Jaitly","Heng Ji","Alane Suhr","Yizhe Zhang"],"pdf_url":"https://arxiv.org/pdf/2412.21139v1.pdf","comment":"Code at https://github.com/SWE-Gym/SWE-Gym"},{"id":"http://arxiv.org/abs/2412.21102v1","updated":"2024-12-30T17:25:58Z","published":"2024-12-30T17:25:58Z","title":"Exploring and Controlling Diversity in LLM-Agent Conversation","summary":" Diversity is a critical aspect of multi-agent communication. In this paper,\nwe focus on controlling and exploring diversity in the context of open-domain\nmulti-agent conversations, particularly for world simulation applications. We\npropose Adaptive Prompt Pruning (APP), a novel method that dynamically adjusts\nthe content of the utterance generation prompt to control diversity using a\nsingle parameter, lambda. Through extensive experiments, we show that APP\neffectively controls the output diversity across models and datasets, with\npruning more information leading to more diverse output. We comprehensively\nanalyze the relationship between prompt content and conversational diversity.\nOur findings reveal that information from all components of the prompt\ngenerally constrains the diversity of the output, with the Memory block\nexerting the most significant influence. APP is compatible with established\ntechniques like temperature sampling and top-p sampling, providing a versatile\ntool for diversity management. To address the trade-offs of increased\ndiversity, such as inconsistencies with omitted information, we incorporate a\npost-generation correction step, which effectively balances diversity\nenhancement with output consistency. Additionally, we examine how prompt\nstructure, including component order and length, impacts diversity. This study\naddresses key questions surrounding diversity in multi-agent world simulation,\noffering insights into its control, influencing factors, and associated\ntrade-offs. Our contributions lay the foundation for systematically engineering\ndiversity in LLM-based multi-agent collaborations, advancing their\neffectiveness in real-world applications.\n","authors":["KuanChao Chu","Yi-Pei Chen","Hideki Nakayama"],"pdf_url":"https://arxiv.org/pdf/2412.21102v1.pdf","comment":"Accepted for the AAAI 2025 Workshop on Advancing LLM-Based\n Multi-Agent Collaboration"},{"id":"http://arxiv.org/abs/2412.15264v2","updated":"2024-12-30T16:56:25Z","published":"2024-12-17T02:07:33Z","title":"ReXTrust: A Model for Fine-Grained Hallucination Detection in\n AI-Generated Radiology Reports","summary":" The increasing adoption of AI-generated radiology reports necessitates robust\nmethods for detecting hallucinations--false or unfounded statements that could\nimpact patient care. We present ReXTrust, a novel framework for fine-grained\nhallucination detection in AI-generated radiology reports. Our approach\nleverages sequences of hidden states from large vision-language models to\nproduce finding-level hallucination risk scores. We evaluate ReXTrust on a\nsubset of the MIMIC-CXR dataset and demonstrate superior performance compared\nto existing approaches, achieving an AUROC of 0.8751 across all findings and\n0.8963 on clinically significant findings. Our results show that white-box\napproaches leveraging model hidden states can provide reliable hallucination\ndetection for medical AI systems, potentially improving the safety and\nreliability of automated radiology reporting.\n","authors":["Romain Hardy","Sung Eun Kim","Pranav Rajpurkar"],"pdf_url":"https://arxiv.org/pdf/2412.15264v2.pdf","comment":"Accepted to AIMedHealth 10 pages, 5 figures"},{"id":"http://arxiv.org/abs/2412.09807v2","updated":"2024-12-30T16:45:50Z","published":"2024-12-13T02:48:36Z","title":"LLM Distillation for Efficient Few-Shot Multiple Choice Question\n Answering","summary":" Multiple Choice Question Answering (MCQA) is an important problem with\nnumerous real-world applications, such as medicine, law, and education. The\nhigh cost of building MCQA datasets makes few-shot learning pivotal in this\ndomain. While Large Language Models (LLMs) can enable few-shot learning, their\ndirect application in real-world scenarios is often hindered by their high\ncomputational cost. To address this challenge, we propose a simple yet\neffective approach that uses LLMs for data generation and scoring. Our approach\nutilizes LLMs to create MCQA data which contains questions and choices, and to\nassign probability scores to the generated choices. We then use the generated\ndata and LLM-assigned scores to finetune a smaller and more efficient\nencoder-only model, DeBERTa-v3-base by leveraging distillation loss. Extensive\nexperiments on the Massive Multitask Language Understanding (MMLU) benchmark\ndemonstrate that our method improves accuracy from 28.9% to 39.3%, representing\na gain of over 10% compared to a baseline finetuned directly on 5-shot\nexamples. This shows the effectiveness of LLM-driven data generation and\nknowledge distillation for few-shot MCQA.\n","authors":["Patrick Sutanto","Joan Santoso","Esther Irawati Setiawan","Aji Prasetya Wibawa"],"pdf_url":"https://arxiv.org/pdf/2412.09807v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.21065v1","updated":"2024-12-30T16:34:11Z","published":"2024-12-30T16:34:11Z","title":"Efficient Multi-Task Inferencing with a Shared Backbone and Lightweight\n Task-Specific Adapters for Automatic Scoring","summary":" The integration of Artificial Intelligence (AI) in education requires\nscalable and efficient frameworks that balance performance, adaptability, and\ncost. This paper addresses these needs by proposing a shared backbone model\narchitecture enhanced with lightweight LoRA adapters for task-specific\nfine-tuning, targeting the automated scoring of student responses across 27\nmutually exclusive tasks. By achieving competitive performance (average QWK of\n0.848 compared to 0.888 for fully fine-tuned models) while reducing GPU memory\nconsumption by 60% and inference latency by 40%, the framework demonstrates\nsignificant efficiency gains. This approach aligns with the workshops' focus on\nimproving language models for educational tasks, creating responsible\ninnovations for cost-sensitive deployment, and supporting educators by\nstreamlining assessment workflows. The findings underscore the potential of\nscalable AI to enhance learning outcomes while maintaining fairness and\ntransparency in automated scoring systems.\n","authors":["Ehsan Latif","Xiaoming Zhai"],"pdf_url":"https://arxiv.org/pdf/2412.21065v1.pdf","comment":"Accepted by AAAI-iRAISE Workshop"},{"id":"http://arxiv.org/abs/2412.17498v2","updated":"2024-12-30T16:29:36Z","published":"2024-12-23T11:55:33Z","title":"DRT-o1: Optimized Deep Reasoning Translation via Long Chain-of-Thought","summary":" Recently, O1-like models have emerged as representative examples,\nillustrating the effectiveness of long chain-of-thought (CoT) in reasoning\ntasks such as math and coding tasks. In this paper, we introduce DRT-o1, an\nattempt to bring the success of long CoT to neural machine translation (MT).\nSpecifically, in view of the literature books that might involve similes and\nmetaphors, translating these texts to a target language is very difficult in\npractice due to cultural differences. In such cases, literal translation often\nfails to convey the intended meaning effectively. Even for professional human\ntranslators, considerable thought must be given to preserving semantics\nthroughout the translation process. To simulate LLMs' long thought ability in\nMT, we first mine sentences containing similes or metaphors from existing\nliterature books, and then develop a multi-agent framework to translate these\nsentences via long thought. In the multi-agent framework, a translator is used\nto iteratively translate the source sentence under the suggestions provided by\nan advisor. To ensure the effectiveness of the long thoughts, an evaluator is\nalso employed to quantify the translation in each round. In this way, we\ncollect tens of thousands of long-thought MT data, which is used to train our\nDRT-o1. Using Qwen2.5 and LLama-3.1 as the backbones, DRT-o1 models can learn\nthe thought process during machine translation, and outperform vanilla LLMs as\nwell as existing O1-like LLMs, showing their effectiveness The project is\navailable at https://github.com/krystalan/DRT-o1\n","authors":["Jiaan Wang","Fandong Meng","Yunlong Liang","Jie Zhou"],"pdf_url":"https://arxiv.org/pdf/2412.17498v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.21037v1","updated":"2024-12-30T16:02:44Z","published":"2024-12-30T16:02:44Z","title":"TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow\n Matching and Clap-Ranked Preference Optimization","summary":" We introduce TangoFlux, an efficient Text-to-Audio (TTA) generative model\nwith 515M parameters, capable of generating up to 30 seconds of 44.1kHz audio\nin just 3.7 seconds on a single A40 GPU. A key challenge in aligning TTA models\nlies in the difficulty of creating preference pairs, as TTA lacks structured\nmechanisms like verifiable rewards or gold-standard answers available for Large\nLanguage Models (LLMs). To address this, we propose CLAP-Ranked Preference\nOptimization (CRPO), a novel framework that iteratively generates and optimizes\npreference data to enhance TTA alignment. We demonstrate that the audio\npreference dataset generated using CRPO outperforms existing alternatives. With\nthis framework, TangoFlux achieves state-of-the-art performance across both\nobjective and subjective benchmarks. We open source all code and models to\nsupport further research in TTA generation.\n","authors":["Chia-Yu Hung","Navonil Majumder","Zhifeng Kong","Ambuj Mehrish","Rafael Valle","Bryan Catanzaro","Soujanya Poria"],"pdf_url":"https://arxiv.org/pdf/2412.21037v1.pdf","comment":"https://tangoflux.github.io/"},{"id":"http://arxiv.org/abs/2412.21036v1","updated":"2024-12-30T16:01:43Z","published":"2024-12-30T16:01:43Z","title":"GePBench: Evaluating Fundamental Geometric Perception for Multimodal\n Large Language Models","summary":" Multimodal large language models (MLLMs) have achieved significant\nadvancements in integrating visual and linguistic understanding. While existing\nbenchmarks evaluate these models in context-rich, real-life scenarios, they\noften overlook fundamental perceptual skills essential for environments\ndeviating from everyday realism. In particular, geometric perception, the\nability to interpret spatial relationships and abstract visual patterns,\nremains underexplored. To address this limitation, we introduce GePBench, a\nnovel benchmark designed to assess the geometric perception capabilities of\nMLLMs. Results from extensive evaluations reveal that current state-of-the-art\nMLLMs exhibit significant deficiencies in such tasks. Additionally, we\ndemonstrate that models trained with data sourced from GePBench show notable\nimprovements on a wide range of downstream tasks, underscoring the importance\nof geometric perception as a foundation for advanced multimodal applications.\nOur code and datasets will be publicly available.\n","authors":["Shangyu Xing","Changhao Xiang","Yuteng Han","Yifan Yue","Zhen Wu","Xinyu Liu","Zhangtai Wu","Fei Zhao","Xinyu Dai"],"pdf_url":"https://arxiv.org/pdf/2412.21036v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.15639v3","updated":"2024-12-30T15:59:18Z","published":"2024-04-24T04:25:04Z","title":"CodeIP: A Grammar-Guided Multi-Bit Watermark for Large Language Models\n of Code","summary":" Large Language Models (LLMs) have achieved remarkable progress in code\ngeneration. It now becomes crucial to identify whether the code is AI-generated\nand to determine the specific model used, particularly for purposes such as\nprotecting Intellectual Property (IP) in industry and preventing cheating in\nprogramming exercises. To this end, several attempts have been made to insert\nwatermarks into machine-generated code. However, existing approaches are\nlimited to inserting only a single bit of information. In this paper, we\nintroduce CodeIP, a novel multi-bit watermarking technique that inserts\nadditional information to preserve crucial provenance details, such as the\nvendor ID of an LLM, thereby safeguarding the IPs of LLMs in code generation.\nFurthermore, to ensure the syntactical correctness of the generated code, we\npropose constraining the sampling process for predicting the next token by\ntraining a type predictor. Experiments conducted on a real-world dataset across\nfive programming languages demonstrate the effectiveness of CodeIP in\nwatermarking LLMs for code generation while maintaining the syntactical\ncorrectness of code.\n","authors":["Batu Guan","Yao Wan","Zhangqian Bi","Zheng Wang","Hongyu Zhang","Pan Zhou","Lichao Sun"],"pdf_url":"https://arxiv.org/pdf/2404.15639v3.pdf","comment":"16 pages, 13 figures"},{"id":"http://arxiv.org/abs/2412.21033v1","updated":"2024-12-30T15:58:41Z","published":"2024-12-30T15:58:41Z","title":"Plancraft: an evaluation dataset for planning with LLM agents","summary":" We present Plancraft, a multi-modal evaluation dataset for LLM agents.\nPlancraft has both a text-only and multi-modal interface, based on the\nMinecraft crafting GUI. We include the Minecraft Wiki to evaluate tool use and\nRetrieval Augmented Generation (RAG), as well as an oracle planner and oracle\nRAG information extractor, to ablate the different components of a modern agent\narchitecture. To evaluate decision-making, Plancraft also includes a subset of\nexamples that are intentionally unsolvable, providing a realistic challenge\nthat requires the agent not only to complete tasks but also to decide whether\nthey are solvable at all. We benchmark both open-source and closed-source LLMs\nand strategies on our task and compare their performance to a handcrafted\nplanner. We find that LLMs and VLMs struggle with the planning problems that\nPlancraft introduces, and we offer suggestions on how to improve their\ncapabilities.\n","authors":["Gautier Dagan","Frank Keller","Alex Lascarides"],"pdf_url":"https://arxiv.org/pdf/2412.21033v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.21015v1","updated":"2024-12-30T15:33:19Z","published":"2024-12-30T15:33:19Z","title":"MapQaTor: A System for Efficient Annotation of Map Query Datasets","summary":" Mapping and navigation services like Google Maps, Apple Maps, Openstreet\nMaps, are essential for accessing various location-based data, yet they often\nstruggle to handle natural language geospatial queries. Recent advancements in\nLarge Language Models (LLMs) show promise in question answering (QA), but\ncreating reliable geospatial QA datasets from map services remains challenging.\nWe introduce MapQaTor, a web application that streamlines the creation of\nreproducible, traceable map-based QA datasets. With its plug-and-play\narchitecture, MapQaTor enables seamless integration with any maps API, allowing\nusers to gather and visualize data from diverse sources with minimal setup. By\ncaching API responses, the platform ensures consistent ground truth, enhancing\nthe reliability of the data even as real-world information evolves. MapQaTor\ncentralizes data retrieval, annotation, and visualization within a single\nplatform, offering a unique opportunity to evaluate the current state of\nLLM-based geospatial reasoning while advancing their capabilities for improved\ngeospatial understanding. Evaluation metrics show that, MapQaTor speeds up the\nannotation process by at least 30 times compared to manual methods,\nunderscoring its potential for developing geospatial resources, such as complex\nmap reasoning datasets. The website is live at: https://mapqator.github.io/ and\na demo video is available at: https://youtu.be/7_aV9Wmhs6Q.\n","authors":["Mahir Labib Dihan","Mohammed Eunus Ali","Md Rizwan Parvez"],"pdf_url":"https://arxiv.org/pdf/2412.21015v1.pdf","comment":"13 pages, 35 figures"},{"id":"http://arxiv.org/abs/2412.21006v1","updated":"2024-12-30T15:15:08Z","published":"2024-12-30T15:15:08Z","title":"Verbosity-Aware Rationale Reduction: Effective Reduction of Redundant\n Rationale via Principled Criteria","summary":" Large Language Models (LLMs) rely on generating extensive intermediate\nreasoning units (e.g., tokens, sentences) to enhance final answer quality\nacross a wide range of complex tasks. While generating multiple reasoning paths\nor iteratively refining rationales proves effective for improving performance,\nthese approaches inevitably result in significantly higher inference costs. In\nthis work, we propose a novel sentence-level rationale reduction training\nframework that leverages likelihood-based criteria, verbosity, to identify and\nremove redundant reasoning sentences. Unlike previous approaches that utilize\ntoken-level reduction, our sentence-level reduction framework maintains model\nperformance while reducing generation length. This preserves the original\nreasoning abilities of LLMs and achieves an average 17.15% reduction in\ngeneration costs across various models and tasks.\n","authors":["Joonwon Jang","Jaehee Kim","Wonbin Kweon","Hwanjo Yu"],"pdf_url":"https://arxiv.org/pdf/2412.21006v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.08603v3","updated":"2024-12-30T15:08:23Z","published":"2024-05-14T13:42:05Z","title":"A Comprehensive Survey of Large Language Models and Multimodal Large\n Language Models in Medicine","summary":" Since the release of ChatGPT and GPT-4, large language models (LLMs) and\nmultimodal large language models (MLLMs) have attracted widespread attention\nfor their exceptional capabilities in understanding, reasoning, and generation,\nintroducing transformative paradigms for integrating artificial intelligence\ninto medicine. This survey provides a comprehensive overview of the\ndevelopment, principles, application scenarios, challenges, and future\ndirections of LLMs and MLLMs in medicine. Specifically, it begins by examining\nthe paradigm shift, tracing the transition from traditional models to LLMs and\nMLLMs, and highlighting the unique advantages of these LLMs and MLLMs in\nmedical applications. Next, the survey reviews existing medical LLMs and MLLMs,\nproviding detailed guidance on their construction and evaluation in a clear and\nsystematic manner. Subsequently, to underscore the substantial value of LLMs\nand MLLMs in healthcare, the survey explores five promising applications in the\nfield. Finally, the survey addresses the challenges confronting medical LLMs\nand MLLMs and proposes practical strategies and future directions for their\nintegration into medicine. In summary, this survey offers a comprehensive\nanalysis of the technical methodologies and practical clinical applications of\nmedical LLMs and MLLMs, with the goal of bridging the gap between these\nadvanced technologies and clinical practice, thereby fostering the evolution of\nthe next generation of intelligent healthcare systems.\n","authors":["Hanguang Xiao","Feizhong Zhou","Xingyue Liu","Tianqi Liu","Zhipeng Li","Xin Liu","Xiaoxuan Huang"],"pdf_url":"https://arxiv.org/pdf/2405.08603v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.20996v1","updated":"2024-12-30T15:01:48Z","published":"2024-12-30T15:01:48Z","title":"Plug-and-Play Training Framework for Preference Optimization","summary":" Recently, preference optimization methods such as DPO have significantly\nenhanced large language models (LLMs) in wide tasks including dialogue and\nquestion-answering. However, current methods fail to account for the varying\ndifficulty levels of training samples during preference optimization, leading\nto mediocre performance in tasks with high accuracy requirements, particularly\nin mathematical reasoning. To address this limitation, we propose a novel\ntraining framework, which employs multiple sampling to analyze output\ndistributions, assign different weights to samples, and incorporate these\nweights into the preference optimization process. This plug-and-play approach\nenables LLMs to prioritize challenging examples during training, improving\nlearning efficiency. Experimental results demonstrate that our framework\nintegrates seamlessly with various preference optimization methods and achieves\nconsistent improvements in mathematical reasoning tasks.\n","authors":["Jingyuan Ma","Rui Li","Zheng Li","Lei Sha","Zhifang Sui"],"pdf_url":"https://arxiv.org/pdf/2412.20996v1.pdf","comment":"12 pages, 9 figures"},{"id":"http://arxiv.org/abs/2412.20995v1","updated":"2024-12-30T14:58:46Z","published":"2024-12-30T14:58:46Z","title":"KARPA: A Training-free Method of Adapting Knowledge Graph as References\n for Large Language Model's Reasoning Path Aggregation","summary":" Large language models (LLMs) demonstrate exceptional performance across a\nvariety of tasks, yet they are often affected by hallucinations and the\ntimeliness of knowledge. Leveraging knowledge graphs (KGs) as external\nknowledge sources has emerged as a viable solution, but existing methods for\nLLM-based knowledge graph question answering (KGQA) are often limited by\nstep-by-step decision-making on KGs, restricting the global planning and\nreasoning capabilities of LLMs, or they require fine-tuning or pre-training on\nspecific KGs. To address these challenges, we propose Knowledge graph Assisted\nReasoning Path Aggregation (KARPA), a novel framework that harnesses the global\nplanning abilities of LLMs for efficient and accurate KG reasoning. KARPA\noperates in three steps: pre-planning relation paths using the LLM's global\nplanning capabilities, matching semantically relevant paths via an embedding\nmodel, and reasoning over these paths to generate answers. Unlike existing KGQA\nmethods, KARPA avoids stepwise traversal, requires no additional training, and\nis adaptable to various LLM architectures. Extensive experimental results show\nthat KARPA achieves state-of-the-art performance in KGQA tasks, delivering both\nhigh efficiency and accuracy. Our code will be available on Github.\n","authors":["Siyuan Fang","Kaijing Ma","Tianyu Zheng","Xinrun Du","Ningxuan Lu","Ge Zhang","Qingkun Tang"],"pdf_url":"https://arxiv.org/pdf/2412.20995v1.pdf","comment":"23 pages, 6 figures"},{"id":"http://arxiv.org/abs/2412.20993v1","updated":"2024-12-30T14:57:53Z","published":"2024-12-30T14:57:53Z","title":"Efficiently Serving LLM Reasoning Programs with Certaindex","summary":" The rapid evolution of large language models (LLMs) has unlocked their\ncapabilities in advanced reasoning tasks like mathematical problem-solving,\ncode generation, and legal analysis. Central to this progress are\ninference-time reasoning algorithms, which refine outputs by exploring multiple\nsolution paths, at the cost of increasing compute demands and response\nlatencies. Existing serving systems fail to adapt to the scaling behaviors of\nthese algorithms or the varying difficulty of queries, leading to inefficient\nresource use and unmet latency targets.\n We present Dynasor, a system that optimizes inference-time compute for LLM\nreasoning queries. Unlike traditional engines, Dynasor tracks and schedules\nrequests within reasoning queries and uses Certaindex, a proxy that measures\nstatistical reasoning progress based on model certainty, to guide compute\nallocation dynamically. Dynasor co-adapts scheduling with reasoning progress:\nit allocates more compute to hard queries, reduces compute for simpler ones,\nand terminates unpromising queries early, balancing accuracy, latency, and\ncost. On diverse datasets and algorithms, Dynasor reduces compute by up to 50%\nin batch processing and sustaining 3.3x higher query rates or 4.7x tighter\nlatency SLOs in online serving.\n","authors":["Yichao Fu","Junda Chen","Siqi Zhu","Zheyu Fu","Zhongdongming Dai","Aurick Qiao","Hao Zhang"],"pdf_url":"https://arxiv.org/pdf/2412.20993v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.12094v3","updated":"2024-12-30T14:54:29Z","published":"2024-12-16T18:58:57Z","title":"SepLLM: Accelerate Large Language Models by Compressing One Segment into\n One Separator","summary":" Large Language Models (LLMs) have exhibited exceptional performance across a\nspectrum of natural language processing tasks. However, their substantial sizes\npose considerable challenges, particularly in computational demands and\ninference speed, due to their quadratic complexity. In this work, we have\nidentified a key pattern: certain seemingly meaningless special tokens (i.e.,\nseparators) contribute disproportionately to attention scores compared to\nsemantically meaningful tokens. This observation suggests that information of\nthe segments between these separator tokens can be effectively condensed into\nthe separator tokens themselves without significant information loss. Guided by\nthis insight, we introduce SepLLM, a plug-and-play framework that accelerates\ninference by compressing these segments and eliminating redundant tokens.\nAdditionally, we implement efficient kernels for training acceleration.\nExperimental results across training-free, training-from-scratch, and\npost-training settings demonstrate SepLLM's effectiveness. Notably, using the\nLlama-3-8B backbone, SepLLM achieves over 50% reduction in KV cache on the\nGSM8K-CoT benchmark while maintaining comparable performance. Furthermore, in\nstreaming settings, SepLLM effectively processes sequences of up to 4 million\ntokens or more while maintaining consistent language modeling capabilities.\n","authors":["Guoxuan Chen","Han Shi","Jiawei Li","Yihang Gao","Xiaozhe Ren","Yimeng Chen","Xin Jiang","Zhenguo Li","Weiyang Liu","Chao Huang"],"pdf_url":"https://arxiv.org/pdf/2412.12094v3.pdf","comment":"We have made our code publicly available at sepllm.github.io. Our\n codebase supports efficient multi-node distributed training with accelerated\n attention module Sep-Attention and also supports numerous existing Fusion\n Operators to accelerate the training process, such as fused rope, etc. If you\n find our code helpful, please kindly consider giving us a **star** on\n GitHub^_^. Thank you very much!"},{"id":"http://arxiv.org/abs/2409.05840v4","updated":"2024-12-30T14:08:49Z","published":"2024-09-09T17:44:00Z","title":"MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct","summary":" The development of Multimodal Large Language Models (MLLMs) has seen\nsignificant advancements with increasing demands in various fields (e.g.,\nmultimodal agents, embodied intelligence). While model-driven approaches\nattempt to enhance MLLMs capabilities through diverse architectures, the gains\nhave become increasingly marginal. Conversely, data-driven methods, which scale\nup image-text instruction data, are more effective but face limited data\ndiversity and complexity challenges. The absence of high-quality data\nconstitutes a significant development barrier for MLLMs. To address the data\nquality bottleneck, we propose MMEvol, a novel multimodal instruction data\nevolution framework. This framework iteratively improve data quality through a\nrefined combination of fine-grained perception, cognitive reasoning, and\ninteraction evolution, generating a more complex and diverse image-text\ninstruction dataset that empowers MLLMs with enhanced capabilities. Beginning\nwith an initial set of instructions, SEED-163K, we utilize MMEvol to\nsystematically broaden the diversity of instruction types, extend visual\nreasoning steps to improve cognitive reasoning abilities, and thoroughly\nexplore fine-grained information within images to enhance visual understanding\nand robustness. To comprehensively evaluate the effectiveness of our approach,\nwe conduct extensive qualitative analysis and quantitative experiments across\n13 vision-language tasks. Compared to baseline models trained with the initial\nseed data, the results demonstrate that our method achieves an average accuracy\nimprovement of 3.1 percentage points. Furthermore, our approach reaches\nstate-of-the-art (SOTA) performance in nine tasks using significantly less data\ncompared to state-of-the-art models.\n","authors":["Run Luo","Haonan Zhang","Longze Chen","Ting-En Lin","Xiong Liu","Yuchuan Wu","Min Yang","Minzheng Wang","Pengpeng Zeng","Lianli Gao","Heng Tao Shen","Yunshui Li","Xiaobo Xia","Fei Huang","Jingkuan Song","Yongbin Li"],"pdf_url":"https://arxiv.org/pdf/2409.05840v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2407.07099v3","updated":"2024-12-30T13:43:46Z","published":"2024-06-18T07:46:13Z","title":"Nash CoT: Multi-Path Inference with Preference Equilibrium","summary":" Chain of thought (CoT) is a reasoning framework that can enhance the\nperformance of Large Language Models (LLMs) on complex inference tasks. In\nparticular, among various studies related to CoT, multi-path inference stands\nout as a simple yet effective improvement. However, there is no optimal setting\nfor the number of inference paths. Therefore, we have to increase the number of\ninference paths to obtain better results, which in turn increases the inference\ncost. To address this limitation, we can utilize question-related role\ntemplates to guide LLMs into relevant roles, thereby increasing the possibility\nof correct inferences for each path and further reducing dependence on the\nnumber of inference paths while improving reasoning accuracy. However, placing\nLLMs into specific roles may reduce their reasoning diversity and performance\non a few tasks where role dependence is low. To alleviate the excessive\nimmersion of the LLM into a specific role, we propose Nash CoT by constructing\na game system on each path that balances the generation from role-specific\nLLMs' and the general LLMs' generation, thereby ensuring both effective role\nadoption and diversity in LLM generation further maintaining the performance of\nmulti-path inference while reducing the requirement of the number of inference\npaths. We evaluate Nash CoT across various inference tasks, including Arabic\nReasoning, Commonsense Question Answering, and Symbolic Inference, achieving\nresults that are comparable to or better than those of multi-path CoT with the\nequal number of inference paths.\n","authors":["Ziqi Zhang","Cunxiang Wang","Xiong Xiao","Yue Zhang","Donglin Wang"],"pdf_url":"https://arxiv.org/pdf/2407.07099v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.12147v2","updated":"2024-12-30T13:41:59Z","published":"2024-11-19T00:50:06Z","title":"JuniperLiu at CoMeDi Shared Task: Models as Annotators in Lexical\n Semantics Disagreements","summary":" We present the results of our system for the CoMeDi Shared Task, which\npredicts majority votes (Subtask 1) and annotator disagreements (Subtask 2).\nOur approach combines model ensemble strategies with MLP-based and\nthreshold-based methods trained on pretrained language models. Treating\nindividual models as virtual annotators, we simulate the annotation process by\ndesigning aggregation measures that incorporate continuous relatedness scores\nand discrete classification labels to capture both majority and disagreement.\nAdditionally, we employ anisotropy removal techniques to enhance performance.\nExperimental results demonstrate the effectiveness of our methods, particularly\nfor Subtask 2. Notably, we find that standard deviation on continuous\nrelatedness scores among different model manipulations correlates with human\ndisagreement annotations compared to metrics on aggregated discrete labels. The\ncode will be published at https://github.com/RyanLiut/CoMeDi_Solution.\n","authors":["Zhu Liu","Zhen Hu","Ying Liu"],"pdf_url":"https://arxiv.org/pdf/2411.12147v2.pdf","comment":"accepted by CoMeDi workshop in Coling 2025"},{"id":"http://arxiv.org/abs/2308.08747v4","updated":"2024-12-30T12:32:49Z","published":"2023-08-17T02:53:23Z","title":"An Empirical Study of Catastrophic Forgetting in Large Language Models\n During Continual Fine-tuning","summary":" Catastrophic forgetting (CF) is a phenomenon that occurs in machine learning\nwhen a model forgets previously learned information while acquiring new\nknowledge for achieving a satisfactory performance in downstream tasks. As\nlarge language models (LLMs) have demonstrated remarkable performance, it is\nintriguing to investigate whether CF exists during the continual instruction\ntuning of LLMs. This study empirically evaluates the forgetting phenomenon in\nLLMs' knowledge during continual instruction tuning from the perspectives of\ndomain knowledge, reasoning, and reading comprehension. The experiments reveal\nthat catastrophic forgetting is generally observed in LLMs ranging from 1b to\n7b parameters. Surprisingly, as the model scale increases, the severity of\nforgetting intensifies in such a model sale range which may result from the\nmuch significant initial performance in the larger LLM. Comparing the\ndecoder-only model BLOOMZ with the encoder-decoder model mT0, BLOOMZ exhibits\nless forgetting and retains more knowledge. Interestingly, we also observe that\nLLMs can mitigate language biases, such as gender bias, during continual\nfine-tuning. Furthermore, our findings indicate that general instruction tuning\ncan help alleviate the forgetting phenomenon in LLMs during subsequent\nfine-tuning.\n","authors":["Yun Luo","Zhen Yang","Fandong Meng","Yafu Li","Jie Zhou","Yue Zhang"],"pdf_url":"https://arxiv.org/pdf/2308.08747v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.20891v1","updated":"2024-12-30T12:00:47Z","published":"2024-12-30T12:00:47Z","title":"DoTA: Weight-Decomposed Tensor Adaptation for Large Language Models","summary":" Low-rank adaptation (LoRA) reduces the computational and memory demands of\nfine-tuning large language models (LLMs) by approximating updates with low-rank\nmatrices. However, low-rank approximation in two-dimensional space fails to\ncapture high-dimensional structures within the target matrix. Recently, tensor\ndecomposition methods have been explored for fine-tuning LLMs, leveraging their\nability to extract structured information. Yet, these approaches primarily rely\non random initialization, and the impact of initialization on tensor adaptation\nremains underexplored. In this paper, we reveal that random initialization\nsignificantly diverges from the validation loss achieved by full fine-tuning.\nTo address this, we propose Weight-Decomposed Tensor Adaptation (DoTA), which\nleverages the Matrix Product Operator (MPO) decomposition of pre-trained\nweights for effective initialization in fine-tuning LLMs. Additionally, we\nintroduce QDoTA, a quantized version of DoTA designed for 4-bit quantization.\nExperiments on commonsense and arithmetic reasoning tasks show that DoTA\noutperforms random initialization methods with fewer parameters. QDoTA further\nreduces memory consumption and achieves comparable performance to DoTA on\ncommonsense reasoning tasks. We will release our code to support future\nresearch.\n","authors":["Xiaolin Hu","Xiang Cheng","Peiyu Liu","Wei Liu","Jian Luan","Bin Wang","Yong Liu"],"pdf_url":"https://arxiv.org/pdf/2412.20891v1.pdf","comment":"12 pages, 6 figures"},{"id":"http://arxiv.org/abs/2412.18552v2","updated":"2024-12-30T11:24:32Z","published":"2024-12-24T17:05:26Z","title":"Distilling Fine-grained Sentiment Understanding from Large Language\n Models","summary":" Fine-grained sentiment analysis (FSA) aims to extract and summarize user\nopinions from vast opinionated text. Recent studies demonstrate that large\nlanguage models (LLMs) possess exceptional sentiment understanding\ncapabilities. However, directly deploying LLMs for FSA applications incurs high\ninference costs. Therefore, this paper investigates the distillation of\nfine-grained sentiment understanding from LLMs into small language models\n(SLMs). We prompt LLMs to examine and interpret the sentiments of given reviews\nand then utilize the generated content to pretrain SLMs. Additionally, we\ndevelop a comprehensive FSA benchmark to evaluate both SLMs and LLMs. Extensive\nexperiments on this benchmark reveal that: (1) distillation significantly\nenhances the performance of SLMs in FSA tasks, achieving a 6.00\\% improvement\nin $F_1$-score, and the distilled model can outperform Llama-2-7b with only\n220M parameters; (2) distillation equips SLMs with excellent zero-shot\nsentiment classification capabilities, enabling them to match or even exceed\ntheir teacher models. These results suggest that distillation from LLMs is a\nhighly promising direction for FSA. We will release our code, data, and\npretrained model weights at https://github.com/HITSZ-HLT/FSA-Distillation.\n","authors":["Yice Zhang","Guangyu Xie","Hongling Xu","Kaiheng Hou","Jianzhu Bao","Qianlong Wang","Shiwei Chen","Ruifeng Xu"],"pdf_url":"https://arxiv.org/pdf/2412.18552v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.20864v1","updated":"2024-12-30T11:07:05Z","published":"2024-12-30T11:07:05Z","title":"Enhancing Annotated Bibliography Generation with LLM Ensembles","summary":" This work proposes a novel approach to enhancing annotated bibliography\ngeneration through Large Language Model (LLM) ensembles. In particular,\nmultiple LLMs in different roles -- controllable text generation, evaluation,\nand summarization -- are introduced and validated using a systematic\nmethodology to enhance model performance in scholarly tasks. Output diversity\namong the ensemble that generates text is obtained using different LLM\nparameters, followed by an LLM acting as a judge to assess relevance, accuracy,\nand coherence. Responses selected by several combining strategies are then\nmerged and refined through summarization and redundancy removal techniques. The\npreliminary experimental validation demonstrates that the combined outputs from\nthe LLM ensemble improve coherence and relevance compared to individual\nresponses, leading to a 38% improvement in annotation quality and a 51%\nreduction in content redundancy, thus highlighting the potential for automating\ncomplex scholarly tasks while maintaining high-quality standards.\n","authors":["Sergio Bermejo"],"pdf_url":"https://arxiv.org/pdf/2412.20864v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.06691v3","updated":"2024-12-30T11:05:20Z","published":"2024-09-10T17:54:28Z","title":"Geometric-Averaged Preference Optimization for Soft Preference Labels","summary":" Many algorithms for aligning LLMs with human preferences assume that human\npreferences are binary and deterministic. However, human preferences can vary\nacross individuals, and therefore should be represented distributionally. In\nthis work, we introduce the distributional soft preference labels and improve\nDirect Preference Optimization (DPO) with a weighted geometric average of the\nLLM output likelihood in the loss function. This approach adjusts the scale of\nlearning loss based on the soft labels such that the loss would approach zero\nwhen the responses are closer to equally preferred. This simple modification\ncan be easily applied to any DPO-based methods and mitigate over-optimization\nand objective mismatch, which prior works suffer from. Our experiments simulate\nthe soft preference labels with AI feedback from LLMs and demonstrate that\ngeometric averaging consistently improves performance on standard benchmarks\nfor alignment research. In particular, we observe more preferable responses\nthan binary labels and significant improvements where modestly-confident labels\nare in the majority.\n","authors":["Hiroki Furuta","Kuang-Huei Lee","Shixiang Shane Gu","Yutaka Matsuo","Aleksandra Faust","Heiga Zen","Izzeddin Gur"],"pdf_url":"https://arxiv.org/pdf/2409.06691v3.pdf","comment":"Accepted at NeurIPS 2024"},{"id":"http://arxiv.org/abs/2412.19168v2","updated":"2024-12-30T10:44:39Z","published":"2024-12-26T10:58:40Z","title":"GFG -- Gender-Fair Generation: A CALAMITA Challenge","summary":" Gender-fair language aims at promoting gender equality by using terms and\nexpressions that include all identities and avoid reinforcing gender\nstereotypes. Implementing gender-fair strategies is particularly challenging in\nheavily gender-marked languages, such as Italian. To address this, the\nGender-Fair Generation challenge intends to help shift toward gender-fair\nlanguage in written communication. The challenge, designed to assess and\nmonitor the recognition and generation of gender-fair language in both mono-\nand cross-lingual scenarios, includes three tasks: (1) the detection of\ngendered expressions in Italian sentences, (2) the reformulation of gendered\nexpressions into gender-fair alternatives, and (3) the generation of\ngender-fair language in automatic translation from English to Italian. The\nchallenge relies on three different annotated datasets: the GFL-it corpus,\nwhich contains Italian texts extracted from administrative documents provided\nby the University of Brescia; GeNTE, a bilingual test set for gender-neutral\nrewriting and translation built upon a subset of the Europarl dataset; and\nNeo-GATE, a bilingual test set designed to assess the use of non-binary\nneomorphemes in Italian for both fair formulation and translation tasks.\nFinally, each task is evaluated with specific metrics: average of F1-score\nobtained by means of BERTScore computed on each entry of the datasets for task\n1, an accuracy measured with a gender-neutral classifier, and a\ncoverage-weighted accuracy for tasks 2 and 3.\n","authors":["Simona Frenda","Andrea Piergentili","Beatrice Savoldi","Marco Madeddu","Martina Rosola","Silvia Casola","Chiara Ferrando","Viviana Patti","Matteo Negri","Luisa Bentivogli"],"pdf_url":"https://arxiv.org/pdf/2412.19168v2.pdf","comment":"To refer to this paper please cite the CEUR-ws publication available\n at https://ceur-ws.org/Vol-3878/"},{"id":"http://arxiv.org/abs/2412.20846v1","updated":"2024-12-30T10:29:18Z","published":"2024-12-30T10:29:18Z","title":"Are LLMs Really Not Knowledgable? Mining the Submerged Knowledge in\n LLMs' Memory","summary":" Large language models (LLMs) have shown promise as potential knowledge bases,\nyet they often struggle with question-answering tasks and are prone to\nhallucinations. While previous research attributes these issues to knowledge\ngaps in the model's parameters, our investigation reveals a different\nphenomenon: LLMs often retain correct knowledge even when generating incorrect\nanswers. Through analysis of model's internal representations, we find that\ncorrect answers frequently appear among high-probability tokens despite not\nbeing selected as final outputs. Based on this observation, we introduce\nHits@k, a new metric to assess knowledge retention independent of expression\naccuracy. Our extensive experiments demonstrate that LLMs store significantly\nmore knowledge than their QA performance suggests. Building on these findings,\nwe develop SkipUnsure, a method to improve answer accuracy by leveraging\ndetected but unexpressed knowledge. Experiments on both open-domain and\nspecific-domain datasets show consistent improvements, with accuracy gains of\nup to 11.8% on DBPedia and 6.3% on IMDB, without requiring model retraining.\n","authors":["Xingjian Tao","Yiwei Wang","Yujun Cai","Zhicheng Yang","Jing Tang"],"pdf_url":"https://arxiv.org/pdf/2412.20846v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.20834v1","updated":"2024-12-30T09:58:31Z","published":"2024-12-30T09:58:31Z","title":"Disentangling Preference Representation and Text Generation for\n Efficient Individual Preference Alignment","summary":" Aligning Large Language Models (LLMs) with general human preferences has been\nproved crucial in improving the interaction quality between LLMs and human.\nHowever, human values are inherently diverse among different individuals,\nmaking it insufficient to align LLMs solely with general preferences. To\naddress this, personalizing LLMs according to individual feedback emerges as a\npromising solution. Nonetheless, this approach presents challenges in terms of\nthe efficiency of alignment algorithms. In this work, we introduce a flexible\nparadigm for individual preference alignment. Our method fundamentally improves\nefficiency by disentangling preference representation from text generation in\nLLMs. We validate our approach across multiple text generation tasks and\ndemonstrate that it can produce aligned quality as well as or better than\nPEFT-based methods, while reducing additional training time for each new\nindividual preference by $80\\%$ to $90\\%$ in comparison with them.\n","authors":["Jianfei Zhang","Jun Bai","Bei Li","Yanmeng Wang","Rumei Li","Chenghua Lin","Wenge Rong"],"pdf_url":"https://arxiv.org/pdf/2412.20834v1.pdf","comment":"Coling 2025"},{"id":"http://arxiv.org/abs/2312.07395v2","updated":"2024-12-30T09:51:22Z","published":"2023-12-12T16:10:19Z","title":"A Simple Recipe for Contrastively Pre-training Video-First Encoders\n Beyond 16 Frames","summary":" Understanding long, real-world videos requires modeling of long-range visual\ndependencies. To this end, we explore video-first architectures, building on\nthe common paradigm of transferring large-scale, image--text models to video\nvia shallow temporal fusion. However, we expose two limitations to the\napproach: (1) decreased spatial capabilities, likely due to poor\nvideo--language alignment in standard video datasets, and (2) higher memory\nconsumption, bottlenecking the number of frames that can be processed. To\nmitigate the memory bottleneck, we systematically analyze the memory/accuracy\ntrade-off of various efficient methods: factorized attention,\nparameter-efficient image-to-video adaptation, input masking, and\nmulti-resolution patchification. Surprisingly, simply masking large portions of\nthe video (up to 75%) during contrastive pre-training proves to be one of the\nmost robust ways to scale encoders to videos up to 4.3 minutes at 1 FPS. Our\nsimple approach for training long video-to-text models, which scales to 1B\nparameters, does not add new architectural complexity and is able to outperform\nthe popular paradigm of using much larger LLMs as an information aggregator\nover segment-based information on benchmarks with long-range temporal\ndependencies (YouCook2, EgoSchema).\n","authors":["Pinelopi Papalampidi","Skanda Koppula","Shreya Pathak","Justin Chiu","Joe Heyward","Viorica Patraucean","Jiajun Shen","Antoine Miech","Andrew Zisserman","Aida Nematzadeh"],"pdf_url":"https://arxiv.org/pdf/2312.07395v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.19792v2","updated":"2024-12-30T09:37:33Z","published":"2024-12-27T18:45:36Z","title":"InfAlign: Inference-aware language model alignment","summary":" Language model alignment has become a critical step in training modern\ngenerative language models. The goal of alignment is to finetune a reference\nmodel such that the win rate of a sample from the aligned model over a sample\nfrom the reference model is high, subject to a KL divergence constraint. Today,\nwe are increasingly using inference-time algorithms (e.g., Best-of-N,\ncontrolled decoding, tree search) to decode from language models rather than\nstandard sampling. However, the alignment objective does not capture such\ninference-time decoding procedures. We show that the existing alignment\nframework is sub-optimal in view of such inference-time methods. We then modify\nthe alignment objective and propose a framework for inference-aware alignment\n(IAPO). We prove that for any inference-time decoding algorithm, the optimal\nsolution that optimizes the inference-time win rate of the aligned policy\nagainst the reference policy is the solution to the typical RLHF problem with a\ntransformation of the reward. This motivates us to provide the KL-regularized\ncalibrate-and-transform RL (CTRL) algorithm to solve this problem, which\ninvolves a reward calibration step and a KL-regularized reward maximization\nstep with a transformation of the calibrated reward. We particularize our study\nto two important inference-time strategies: best-of-N sampling and best-of-N\njailbreaking, where N responses are sampled from the model and the one with the\nhighest or lowest reward is selected. We propose specific transformations for\nthese strategies and demonstrate that our framework offers significant\nimprovements over existing state-of-the-art methods for language model\nalignment. Empirically, we outperform baselines that are designed without\ntaking inference-time decoding into consideration by 8-12% and 4-9% on\ninference-time win rates over the Anthropic helpfulness and harmlessness dialog\nbenchmark datasets.\n","authors":["Ananth Balashankar","Ziteng Sun","Jonathan Berant","Jacob Eisenstein","Michael Collins","Adrian Hutter","Jong Lee","Chirag Nagpal","Flavien Prost","Aradhana Sinha","Ananda Theertha Suresh","Ahmad Beirami"],"pdf_url":"https://arxiv.org/pdf/2412.19792v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.20821v1","updated":"2024-12-30T09:30:41Z","published":"2024-12-30T09:30:41Z","title":"Enhancing Multimodal Emotion Recognition through Multi-Granularity\n Cross-Modal Alignment","summary":" Multimodal emotion recognition (MER), leveraging speech and text, has emerged\nas a pivotal domain within human-computer interaction, demanding sophisticated\nmethods for effective multimodal integration. The challenge of aligning\nfeatures across these modalities is significant, with most existing approaches\nadopting a singular alignment strategy. Such a narrow focus not only limits\nmodel performance but also fails to address the complexity and ambiguity\ninherent in emotional expressions. In response, this paper introduces a\nMulti-Granularity Cross-Modal Alignment (MGCMA) framework, distinguished by its\ncomprehensive approach encompassing distribution-based, instance-based, and\ntoken-based alignment modules. This framework enables a multi-level perception\nof emotional information across modalities. Our experiments on IEMOCAP\ndemonstrate that our proposed method outperforms current state-of-the-art\ntechniques.\n","authors":["Xuechen Wang","Shiwan Zhao","Haoqin Sun","Hui Wang","Jiaming Zhou","Yong Qin"],"pdf_url":"https://arxiv.org/pdf/2412.20821v1.pdf","comment":"ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech\n and Signal Processing (ICASSP)"},{"id":"http://arxiv.org/abs/2412.10424v2","updated":"2024-12-30T09:11:50Z","published":"2024-12-10T15:00:32Z","title":"LLM-as-an-Interviewer: Beyond Static Testing Through Dynamic LLM\n Evaluation","summary":" We introduce LLM-as-an-Interviewer, a novel paradigm for evaluating large\nlanguage models (LLMs). This approach leverages multi-turn interactions where\nthe LLM interviewer actively provides feedback on responses and poses follow-up\nquestions to the evaluated LLM. At the start of the interview, the LLM\ninterviewer dynamically modifies datasets to generate initial questions,\nmitigating data contamination. We apply the LLM-as-an-Interviewer framework to\nevaluate six models on the MATH and DepthQA tasks. Our results show that the\nframework effectively provides insights into LLM performance, including the\nquality of initial responses, adaptability to feedback, and ability to address\nfollow-up queries like clarification or additional knowledge requests. The\nframework also addresses key limitations of conventional methods like\nLLM-as-a-Judge, including verbosity bias and inconsistency across runs.\nFinally, we propose the Interview Report, which aggregates insights from the\ninterview process, providing examples and a comprehensive analysis of the LLM's\nstrengths and weaknesses. This report offers a detailed snapshot of the model's\nreal-world applicability. The code for our framework is publicly available at\nhttps://github.com/interview-eval/.\n","authors":["Eunsu Kim","Juyoung Suk","Seungone Kim","Niklas Muennighoff","Dongkwan Kim","Alice Oh"],"pdf_url":"https://arxiv.org/pdf/2412.10424v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.12543v3","updated":"2024-12-30T07:57:10Z","published":"2024-10-16T13:21:46Z","title":"LLM-based Translation Inference with Iterative Bilingual Understanding","summary":" The remarkable understanding and generation capabilities of large language\nmodels (LLMs) have greatly improved translation performance. However, incorrect\nunderstanding of the sentence to be translated can degrade translation quality.\nTo address this issue, we proposed a novel Iterative Bilingual Understanding\nTranslation (IBUT) method based on the cross-lingual capabilities of LLMs and\nthe dual characteristics of translation tasks. The cross-lingual capability of\nLLMs enables the generation of contextual understanding for both the source and\ntarget languages separately. Furthermore, the dual characteristics allow IBUT\nto generate effective cross-lingual feedback, iteratively refining contextual\nunderstanding, thereby reducing errors and improving translation performance.\nExperimental results showed that the proposed IBUT outperforms several strong\ncomparison methods, especially being generalized to multiple domains (e.g.,\nnews, commonsense, and cultural translation benchmarks).\n","authors":["Andong Chen","Kehai Chen","Yang Xiang","Xuefeng Bai","Muyun Yang","Yang Feng","Tiejun Zhao","Min zhang"],"pdf_url":"https://arxiv.org/pdf/2410.12543v3.pdf","comment":"Work in progress"},{"id":"http://arxiv.org/abs/2407.03963v2","updated":"2024-12-30T07:46:43Z","published":"2024-07-04T14:33:03Z","title":"LLM-jp: A Cross-organizational Project for the Research and Development\n of Fully Open Japanese LLMs","summary":" This paper introduces LLM-jp, a cross-organizational project for the research\nand development of Japanese large language models (LLMs). LLM-jp aims to\ndevelop open-source and strong Japanese LLMs, and as of this writing, more than\n1,500 participants from academia and industry are working together for this\npurpose. This paper presents the background of the establishment of LLM-jp,\nsummaries of its activities, and technical reports on the LLMs developed by\nLLM-jp. For the latest activities, visit https://llm-jp.nii.ac.jp/en/.\n","authors":[" LLM-jp"," :","Akiko Aizawa","Eiji Aramaki","Bowen Chen","Fei Cheng","Hiroyuki Deguchi","Rintaro Enomoto","Kazuki Fujii","Kensuke Fukumoto","Takuya Fukushima","Namgi Han","Yuto Harada","Chikara Hashimoto","Tatsuya Hiraoka","Shohei Hisada","Sosuke Hosokawa","Lu Jie","Keisuke Kamata","Teruhito Kanazawa","Hiroki Kanezashi","Hiroshi Kataoka","Satoru Katsumata","Daisuke Kawahara","Seiya Kawano","Atsushi Keyaki","Keisuke Kiryu","Hirokazu Kiyomaru","Takashi Kodama","Takahiro Kubo","Yohei Kuga","Ryoma Kumon","Shuhei Kurita","Sadao Kurohashi","Conglong Li","Taiki Maekawa","Hiroshi Matsuda","Yusuke Miyao","Kentaro Mizuki","Sakae Mizuki","Yugo Murawaki","Akim Mousterou","Ryo Nakamura","Taishi Nakamura","Kouta Nakayama","Tomoka Nakazato","Takuro Niitsuma","Jiro Nishitoba","Yusuke Oda","Hayato Ogawa","Takumi Okamoto","Naoaki Okazaki","Yohei Oseki","Shintaro Ozaki","Koki Ryu","Rafal Rzepka","Keisuke Sakaguchi","Shota Sasaki","Satoshi Sekine","Kohei Suda","Saku Sugawara","Issa Sugiura","Hiroaki Sugiyama","Hisami Suzuki","Jun Suzuki","Toyotaro Suzumura","Kensuke Tachibana","Yu Takagi","Kyosuke Takami","Koichi Takeda","Masashi Takeshita","Masahiro Tanaka","Kenjiro Taura","Arseny Tolmachev","Nobuhiro Ueda","Zhen Wan","Shuntaro Yada","Sakiko Yahata","Yuya Yamamoto","Yusuke Yamauchi","Hitomi Yanaka","Rio Yokota","Koichiro Yoshino"],"pdf_url":"https://arxiv.org/pdf/2407.03963v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.08285v4","updated":"2024-12-30T07:44:37Z","published":"2024-12-11T11:00:33Z","title":"Adaptive Prompting for Continual Relation Extraction: A Within-Task\n Variance Perspective","summary":" To address catastrophic forgetting in Continual Relation Extraction (CRE),\nmany current approaches rely on memory buffers to rehearse previously learned\nknowledge while acquiring new tasks. Recently, prompt-based methods have\nemerged as potent alternatives to rehearsal-based strategies, demonstrating\nstrong empirical performance. However, upon analyzing existing prompt-based\napproaches for CRE, we identified several critical limitations, such as\ninaccurate prompt selection, inadequate mechanisms for mitigating forgetting in\nshared parameters, and suboptimal handling of cross-task and within-task\nvariances. To overcome these challenges, we draw inspiration from the\nrelationship between prefix-tuning and mixture of experts, proposing a novel\napproach that employs a prompt pool for each task, capturing variations within\neach task while enhancing cross-task variances. Furthermore, we incorporate a\ngenerative model to consolidate prior knowledge within shared parameters,\neliminating the need for explicit data storage. Extensive experiments validate\nthe efficacy of our approach, demonstrating superior performance over\nstate-of-the-art prompt-based and rehearsal-free methods in continual relation\nextraction.\n","authors":["Minh Le","Tien Ngoc Luu","An Nguyen The","Thanh-Thien Le","Trang Nguyen","Tung Thanh Nguyen","Linh Ngo Van","Thien Huu Nguyen"],"pdf_url":"https://arxiv.org/pdf/2412.08285v4.pdf","comment":"Accepted to AAAI 2025"},{"id":"http://arxiv.org/abs/2409.15911v3","updated":"2024-12-30T07:33:17Z","published":"2024-09-24T09:27:43Z","title":"A Modular-based Strategy for Mitigating Gradient Conflicts in\n Simultaneous Speech Translation","summary":" Simultaneous Speech Translation (SimulST) involves generating target language\ntext while continuously processing streaming speech input, presenting\nsignificant real-time challenges. Multi-task learning is often employed to\nenhance SimulST performance but introduces optimization conflicts between\nprimary and auxiliary tasks, potentially compromising overall efficiency. The\nexisting model-level conflict resolution methods are not well-suited for this\ntask which exacerbates inefficiencies and leads to high GPU memory consumption.\nTo address these challenges, we propose a Modular Gradient Conflict Mitigation\n(MGCM) strategy that detects conflicts at a finer-grained modular level and\nresolves them utilizing gradient projection. Experimental results demonstrate\nthat MGCM significantly improves SimulST performance, particularly under medium\nand high latency conditions, achieving a 0.68 BLEU score gain in offline tasks.\nAdditionally, MGCM reduces GPU memory consumption by over 95\\% compared to\nother conflict mitigation methods, establishing it as a robust solution for\nSimulST tasks.\n","authors":["Xiaoqian Liu","Yangfan Du","Jianjin Wang","Yuan Ge","Chen Xu","Tong Xiao","Guocheng Chen","Jingbo Zhu"],"pdf_url":"https://arxiv.org/pdf/2409.15911v3.pdf","comment":"Accepted to ICASSP 2025"},{"id":"http://arxiv.org/abs/2412.15838v2","updated":"2024-12-30T07:27:58Z","published":"2024-12-20T12:27:16Z","title":"Align Anything: Training All-Modality Models to Follow Instructions with\n Language Feedback","summary":" Reinforcement learning from human feedback (RLHF) has proven effective in\nenhancing the instruction-following capabilities of large language models;\nhowever, it remains underexplored in the cross-modality domain. As the number\nof modalities increases, aligning all-modality models with human intentions --\nsuch as instruction following -- becomes a pressing challenge. In this work, we\nmake the first attempt to fine-tune all-modality models (i.e. input and output\nwith any modality, also named any-to-any models) using human preference data\nacross all modalities (including text, image, audio, and video), ensuring its\nbehavior aligns with human intentions. This endeavor presents several\nchallenges. First, there is no large-scale all-modality human preference data\nin existing open-source resources, as most datasets are limited to specific\nmodalities, predominantly text and image. Secondly, the effectiveness of binary\npreferences in RLHF for post-training alignment in complex all-modality\nscenarios remains an unexplored area. Finally, there is a lack of a systematic\nframework to evaluate the capabilities of all-modality models, particularly\nregarding modality selection and synergy. To address these challenges, we\npropose the align-anything framework, which includes meticulously annotated\n200k all-modality human preference data. Then, we introduce an alignment method\nthat learns from unified language feedback, effectively capturing complex\nmodality-specific human preferences and enhancing the model's\ninstruction-following capabilities. Furthermore, to assess performance\nimprovements in all-modality models after post-training alignment, we construct\na challenging all-modality capability evaluation framework -- eval-anything.\nAll data, models, and code frameworks have been open-sourced for the community.\nFor more details, please refer to\nhttps://github.com/PKU-Alignment/align-anything.\n","authors":["Jiaming Ji","Jiayi Zhou","Hantao Lou","Boyuan Chen","Donghai Hong","Xuyao Wang","Wenqi Chen","Kaile Wang","Rui Pan","Jiahao Li","Mohan Wang","Josef Dai","Tianyi Qiu","Hua Xu","Dong Li","Weipeng Chen","Jun Song","Bo Zheng","Yaodong Yang"],"pdf_url":"https://arxiv.org/pdf/2412.15838v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.09945v4","updated":"2024-12-30T07:26:14Z","published":"2024-08-19T12:34:31Z","title":"Large Language Models for Classical Chinese Poetry Translation:\n Benchmarking, Evaluating, and Improving","summary":" Different from the traditional translation tasks, classical Chinese poetry\ntranslation requires both adequacy and fluency in translating culturally and\nhistorically significant content and linguistic poetic elegance. Large language\nmodels (LLMs) with impressive multilingual capabilities may bring a ray of hope\nto achieve this extreme translation demand. This paper first introduces a\nsuitable benchmark (PoetMT) where each Chinese poetry has a recognized elegant\ntranslation. Meanwhile, we propose a new metric based on GPT-4 to evaluate the\nextent to which current LLMs can meet these demands. Our empirical evaluation\nreveals that the existing LLMs fall short in the challenging task. Hence, we\npropose a Retrieval-Augmented Machine Translation (RAT) method which\nincorporates knowledge related to classical poetry for advancing the\ntranslation of Chinese Poetry in LLMs. Experimental results show that RAT\nconsistently outperforms all comparison methods regarding wildly used BLEU,\nCOMET, BLEURT, our proposed metric, and human evaluation.\n","authors":["Andong Chen","Lianzhang Lou","Kehai Chen","Xuefeng Bai","Yang Xiang","Muyun Yang","Tiejun Zhao","Min Zhang"],"pdf_url":"https://arxiv.org/pdf/2408.09945v4.pdf","comment":"Work in progress"},{"id":"http://arxiv.org/abs/2405.13578v2","updated":"2024-12-30T07:25:13Z","published":"2024-05-22T12:15:52Z","title":"ConTrans: Weak-to-Strong Alignment Engineering via Concept\n Transplantation","summary":" Ensuring large language models (LLM) behave consistently with human goals,\nvalues, and intentions is crucial for their safety but yet computationally\nexpensive. To reduce the computational cost of alignment training of LLMs,\nespecially for those with a huge number of parameters, and to reutilize learned\nvalue alignment, we propose ConTrans, a novel framework that enables\nweak-to-strong alignment transfer via concept transplantation. From the\nperspective of representation engineering, ConTrans refines concept vectors in\nvalue alignment from a source LLM (usually a weak yet aligned LLM). The refined\nconcept vectors are then reformulated to adapt to the target LLM (usually a\nstrong yet unaligned base LLM) via affine transformation. In the third step,\nConTrans transplants the reformulated concept vectors into the residual stream\nof the target LLM. Experiments demonstrate the successful transplantation of a\nwide range of aligned concepts from 7B models to 13B and 70B models across\nmultiple LLMs and LLM families. Remarkably, ConTrans even surpasses\ninstruction-tuned models in terms of truthfulness. Experiment results validate\nthe effectiveness of both inter-LLM-family and intra-LLM-family concept\ntransplantation. Our work successfully demonstrates an alternative way to\nachieve weak-to-strong alignment generalization and control.\n","authors":["Weilong Dong","Xinwei Wu","Renren Jin","Shaoyang Xu","Deyi Xiong"],"pdf_url":"https://arxiv.org/pdf/2405.13578v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.20760v1","updated":"2024-12-30T07:09:25Z","published":"2024-12-30T07:09:25Z","title":"Attributing Culture-Conditioned Generations to Pretraining Corpora","summary":" In open-ended generative tasks like narrative writing or dialogue, large\nlanguage models often exhibit cultural biases, showing limited knowledge and\ngenerating templated outputs for less prevalent cultures. Recent works show\nthat these biases may stem from uneven cultural representation in pretraining\ncorpora. This work investigates how pretraining leads to biased\nculture-conditioned generations by analyzing how models associate entities with\ncultures based on pretraining data patterns. We propose the MEMOed framework\n(MEMOrization from pretraining document) to determine whether a generation for\na culture arises from memorization. Using MEMOed on culture-conditioned\ngenerations about food and clothing for 110 cultures, we find that\nhigh-frequency cultures in pretraining data yield more generations with\nmemorized symbols, while some low-frequency cultures produce none.\nAdditionally, the model favors generating entities with extraordinarily high\nfrequency regardless of the conditioned culture, reflecting biases toward\nfrequent pretraining terms irrespective of relevance. We hope that the MEMOed\nframework and our insights will inspire more works on attributing model\nperformance on pretraining data.\n","authors":["Huihan Li","Arnav Goel","Keyu He","Xiang Ren"],"pdf_url":"https://arxiv.org/pdf/2412.20760v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.20741v1","updated":"2024-12-30T06:33:39Z","published":"2024-12-30T06:33:39Z","title":"Depression and Anxiety Prediction Using Deep Language Models and\n Transfer Learning","summary":" Digital screening and monitoring applications can aid providers in the\nmanagement of behavioral health conditions. We explore deep language models for\ndetecting depression, anxiety, and their co-occurrence from conversational\nspeech collected during 16k user interactions with an application. Labels come\nfrom PHQ-8 and GAD-7 results also collected by the application. We find that\nresults for binary classification range from 0.86 to 0.79 AUC, depending on\ncondition and co-occurrence. Best performance is achieved when a user has\neither both or neither condition, and we show that this result is not\nattributable to data skew. Finally, we find evidence suggesting that underlying\nword sequence cues may be more salient for depression than for anxiety.\n","authors":["Tomasz Rutowski","Elizabeth Shriberg","Amir Harati","Yang Lu","Piotr Chlebek","Ricardo Oliveira"],"pdf_url":"https://arxiv.org/pdf/2412.20741v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.20735v1","updated":"2024-12-30T06:18:33Z","published":"2024-12-30T06:18:33Z","title":"HUNYUANPROVER: A Scalable Data Synthesis Framework and Guided Tree\n Search for Automated Theorem Proving","summary":" We introduce HunyuanProver, an language model finetuned from the Hunyuan 7B\nfor interactive automatic theorem proving with LEAN4. To alleviate the data\nsparsity issue, we design a scalable framework to iterative synthesize data\nwith low cost. Besides, guided tree search algorithms are designed to enable\neffective ``system 2 thinking`` of the prover. HunyuanProver achieves\nstate-of-the-art (SOTA) performances on major benchmarks. Specifically, it\nachieves a pass of 68.4% on the miniF2F-test compared to 65.9%, the current\nSOTA results. It proves 4 IMO statements (imo_1960_p2, imo_1962_p2},\nimo_1964_p2 and imo_1983_p6) in miniF2F-test. To benefit the community, we will\nopen-source a dataset of 30k synthesized instances, where each instance\ncontains the original question in natural language, the converted statement by\nautoformalization, and the proof by HunyuanProver.\n","authors":["Yang Li","Dong Du","Linfeng Song","Chen Li","Weikang Wang","Tao Yang","Haitao Mi"],"pdf_url":"https://arxiv.org/pdf/2412.20735v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.18279v6","updated":"2024-12-30T05:54:24Z","published":"2024-11-27T12:13:39Z","title":"Large Language Model-Brained GUI Agents: A Survey","summary":" GUIs have long been central to human-computer interaction, providing an\nintuitive and visually-driven way to access and interact with digital systems.\nThe advent of LLMs, particularly multimodal models, has ushered in a new era of\nGUI automation. They have demonstrated exceptional capabilities in natural\nlanguage understanding, code generation, and visual processing. This has paved\nthe way for a new generation of LLM-brained GUI agents capable of interpreting\ncomplex GUI elements and autonomously executing actions based on natural\nlanguage instructions. These agents represent a paradigm shift, enabling users\nto perform intricate, multi-step tasks through simple conversational commands.\nTheir applications span across web navigation, mobile app interactions, and\ndesktop automation, offering a transformative user experience that\nrevolutionizes how individuals interact with software. This emerging field is\nrapidly advancing, with significant progress in both research and industry.\n To provide a structured understanding of this trend, this paper presents a\ncomprehensive survey of LLM-brained GUI agents, exploring their historical\nevolution, core components, and advanced techniques. We address research\nquestions such as existing GUI agent frameworks, the collection and utilization\nof data for training specialized GUI agents, the development of large action\nmodels tailored for GUI tasks, and the evaluation metrics and benchmarks\nnecessary to assess their effectiveness. Additionally, we examine emerging\napplications powered by these agents. Through a detailed analysis, this survey\nidentifies key research gaps and outlines a roadmap for future advancements in\nthe field. By consolidating foundational knowledge and state-of-the-art\ndevelopments, this work aims to guide both researchers and practitioners in\novercoming challenges and unlocking the full potential of LLM-brained GUI\nagents.\n","authors":["Chaoyun Zhang","Shilin He","Jiaxu Qian","Bowen Li","Liqun Li","Si Qin","Yu Kang","Minghua Ma","Guyue Liu","Qingwei Lin","Saravan Rajmohan","Dongmei Zhang","Qi Zhang"],"pdf_url":"https://arxiv.org/pdf/2411.18279v6.pdf","comment":"The collection of papers reviewed in this survey will be hosted and\n regularly updated on the GitHub repository:\n https://github.com/vyokky/LLM-Brained-GUI-Agents-Survey Additionally, a\n searchable webpage is available at https://aka.ms/gui-agent for easier access\n and exploration"},{"id":"http://arxiv.org/abs/2403.04652v2","updated":"2024-12-30T05:46:35Z","published":"2024-03-07T16:52:49Z","title":"Yi: Open Foundation Models by 01.AI","summary":" We introduce the Yi model family, a series of language and multimodal models\nthat demonstrate strong multi-dimensional capabilities. The Yi model family is\nbased on 6B and 34B pretrained language models, then we extend them to chat\nmodels, 200K long context models, depth-upscaled models, and vision-language\nmodels. Our base models achieve strong performance on a wide range of\nbenchmarks like MMLU, and our finetuned chat models deliver strong human\npreference rate on major evaluation platforms like AlpacaEval and Chatbot\nArena. Building upon our scalable super-computing infrastructure and the\nclassical transformer architecture, we attribute the performance of Yi models\nprimarily to its data quality resulting from our data-engineering efforts. For\npretraining, we construct 3.1 trillion tokens of English and Chinese corpora\nusing a cascaded data deduplication and quality filtering pipeline. For\nfinetuning, we polish a small scale (less than 10K) instruction dataset over\nmultiple iterations such that every single instance has been verified directly\nby our machine learning engineers. For vision-language, we combine the chat\nlanguage model with a vision transformer encoder and train the model to align\nvisual representations to the semantic space of the language model. We further\nextend the context length to 200K through lightweight continual pretraining and\ndemonstrate strong needle-in-a-haystack retrieval performance. We show that\nextending the depth of the pretrained checkpoint through continual pretraining\nfurther improves performance. We believe that given our current results,\ncontinuing to scale up model parameters using thoroughly optimized data will\nlead to even stronger frontier models.\n","authors":["01. AI"," :","Alex Young","Bei Chen","Chao Li","Chengen Huang","Ge Zhang","Guanwei Zhang","Heng Li","Jiangcheng Zhu","Jianqun Chen","Jing Chang","Kaidong Yu","Peng Liu","Qiang Liu","Shawn Yue","Senbin Yang","Shiming Yang","Tao Yu","Wen Xie","Wenhao Huang","Xiaohui Hu","Xiaoyi Ren","Xinyao Niu","Pengcheng Nie","Yuchi Xu","Yudong Liu","Yue Wang","Yuxuan Cai","Zhenyu Gu","Zhiyuan Liu","Zonghong Dai"],"pdf_url":"https://arxiv.org/pdf/2403.04652v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.16341v2","updated":"2024-12-30T05:44:55Z","published":"2024-06-24T06:26:50Z","title":"EHRCon: Dataset for Checking Consistency between Unstructured Notes and\n Structured Tables in Electronic Health Records","summary":" Electronic Health Records (EHRs) are integral for storing comprehensive\npatient medical records, combining structured data (e.g., medications) with\ndetailed clinical notes (e.g., physician notes). These elements are essential\nfor straightforward data retrieval and provide deep, contextual insights into\npatient care. However, they often suffer from discrepancies due to unintuitive\nEHR system designs and human errors, posing serious risks to patient safety. To\naddress this, we developed EHRCon, a new dataset and task specifically designed\nto ensure data consistency between structured tables and unstructured notes in\nEHRs. EHRCon was crafted in collaboration with healthcare professionals using\nthe MIMIC-III EHR dataset, and includes manual annotations of 4,101 entities\nacross 105 clinical notes checked against database entries for consistency.\nEHRCon has two versions, one using the original MIMIC-III schema, and another\nusing the OMOP CDM schema, in order to increase its applicability and\ngeneralizability. Furthermore, leveraging the capabilities of large language\nmodels, we introduce CheckEHR, a novel framework for verifying the consistency\nbetween clinical notes and database tables. CheckEHR utilizes an eight-stage\nprocess and shows promising results in both few-shot and zero-shot settings.\nThe code is available at https://github.com/dustn1259/EHRCon.\n","authors":["Yeonsu Kwon","Jiho Kim","Gyubok Lee","Seongsu Bae","Daeun Kyung","Wonchul Cha","Tom Pollard","Alistair Johnson","Edward Choi"],"pdf_url":"https://arxiv.org/pdf/2406.16341v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.08877v3","updated":"2024-12-30T05:08:00Z","published":"2024-04-13T02:36:40Z","title":"Aligning the Objective of LLM-based Program Repair","summary":" Large language models (LLMs) have achieved decent results on automated\nprogram repair (APR). However, the next token prediction training objective of\ndecoder-only LLMs (e.g., GPT-4) is misaligned with the masked span prediction\nobjective of current infilling-style methods, which impedes LLMs from fully\nleveraging pre-trained knowledge for program repair. In addition, while some\nLLMs can locate and repair bugs in certain functions using the related\nartifacts (e.g., test cases), existing methods still depend on statement-level\nfault localization methods to provide a list of buggy hunks for repair. This\nrestriction hinders LLMs from exploring potential patches beyond the given\nlocations.\n In this paper, we investigate a new approach to adapt LLMs to program repair.\nOur core insight is that LLM's APR capability can be greatly improved by simply\naligning the output to their training objective and allowing them to refine the\nwhole program without first identifying faulty statements. Based on this\ninsight, we designed D4C, a straightforward prompting framework for APR. D4C\ncan repair 180 bugs correctly in Defects4J, with each patch being sampled only\n10 times. This surpasses the SOTA APR methods with perfect fault localization\nby 10% and reduces the patch sampling number by 90%. Our findings reveal that\n(1) objective alignment is crucial for fully exploiting LLM's pre-trained\ncapability, and (2) replacing the traditional localize-buggy-hunks-then-repair\nworkflow with direct debugging is more effective for LLM-based APR methods.\nThus, we believe this paper introduces a new mindset for harnessing LLMs in\nAPR.\n","authors":["Junjielong Xu","Ying Fu","Shin Hwei Tan","Pinjia He"],"pdf_url":"https://arxiv.org/pdf/2404.08877v3.pdf","comment":"Accepted by ICSE'25"},{"id":"http://arxiv.org/abs/2412.20715v1","updated":"2024-12-30T05:07:34Z","published":"2024-12-30T05:07:34Z","title":"ChartAdapter: Large Vision-Language Model for Chart Summarization","summary":" Chart summarization, which focuses on extracting key information from charts\nand interpreting it in natural language, is crucial for generating and\ndelivering insights through effective and accessible data analysis. Traditional\nmethods for chart understanding and summarization often rely on multi-stage\npipelines, which may produce suboptimal semantic alignment between visual and\ntextual information. In comparison, recently developed LLM-based methods are\nmore dependent on the capability of foundation images or languages, while\nignoring the characteristics of chart data and its relevant challenges. To\naddress these limitations, we propose ChartAdapter, a novel lightweight\ntransformer module designed to bridge the gap between charts and textual\nsummaries. ChartAdapter employs learnable query vectors to extract implicit\nsemantics from chart data and incorporates a cross-modal alignment projector to\nenhance vision-to-language generative learning. By integrating ChartAdapter\nwith an LLM, we enable end-to-end training and efficient chart summarization.\nTo further enhance the training, we introduce a three-stage hierarchical\ntraining procedure and develop a large-scale dataset specifically curated for\nchart summarization, comprising 190,618 samples. Experimental results on the\nstandard Chart-to-Text testing set demonstrate that our approach significantly\noutperforms existing methods, including state-of-the-art models, in generating\nhigh-quality chart summaries. Ablation studies further validate the\neffectiveness of key components in ChartAdapter. This work highlights the\npotential of tailored LLM-based approaches to advance chart understanding and\nsets a strong foundation for future research in this area.\n","authors":["Peixin Xu","Yujuan Ding","Wenqi Fan"],"pdf_url":"https://arxiv.org/pdf/2412.20715v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.19289v2","updated":"2024-12-30T05:07:17Z","published":"2024-12-26T17:29:38Z","title":"ViPCap: Retrieval Text-Based Visual Prompts for Lightweight Image\n Captioning","summary":" Recent lightweight image captioning models using retrieved data mainly focus\non text prompts. However, previous works only utilize the retrieved text as\ntext prompts, and the visual information relies only on the CLIP visual\nembedding. Because of this issue, there is a limitation that the image\ndescriptions inherent in the prompt are not sufficiently reflected in the\nvisual embedding space. To tackle this issue, we propose ViPCap, a novel\nretrieval text-based visual prompt for lightweight image captioning. ViPCap\nleverages the retrieved text with image information as visual prompts to\nenhance the ability of the model to capture relevant visual information. By\nmapping text prompts into the CLIP space and generating multiple randomized\nGaussian distributions, our method leverages sampling to explore randomly\naugmented distributions and effectively retrieves the semantic features that\ncontain image information. These retrieved features are integrated into the\nimage and designated as the visual prompt, leading to performance improvements\non the datasets such as COCO, Flickr30k, and NoCaps. Experimental results\ndemonstrate that ViPCap significantly outperforms prior lightweight captioning\nmodels in efficiency and effectiveness, demonstrating the potential for a\nplug-and-play solution.\n","authors":["Taewhan Kim","Soeun Lee","Si-Woo Kim","Dong-Jin Kim"],"pdf_url":"https://arxiv.org/pdf/2412.19289v2.pdf","comment":"Accepted to AAAI 2025"},{"id":"http://arxiv.org/abs/2408.08545v2","updated":"2024-12-30T05:01:44Z","published":"2024-08-16T06:11:21Z","title":"SelectLLM: Query-Aware Efficient Selection Algorithm for Large Language\n Models","summary":" Large language models (LLMs) have seen widespread adoption due to their\nremarkable performance across various applications, driving the accelerated\ndevelopment of a large number of diverse LLMs. However, these individual LLMs\nshow limitations in generalization and performance on complex tasks due to\ninherent training biases, model size constraints, and the quality or diversity\nof pre-training datasets. A promising direction is to efficiently harness the\ndiverse capabilities of LLMs to overcome these individual limitations. To\naddress these limitations, we introduce a novel LLM selection algorithm called\nSelectLLM, which efficiently directs input queries to the most suitable subset\nof LLMs from a large pool, ensuring that the selected models collectively\nprovide accurate responses. SelectLLM employs a multi-label classifier and\npolicy based on the classifier's predictions and confidence scores in selecting\nan optimal, query-aware, and lightweight subset of LLMs. Our findings indicate\nthat the proposed model outperforms existing ensemble-based baselines and\nachieves competitive performance with similarly sized top-performing LLMs while\nmaintaining efficiency. Specifically, it achieves a huge reduction in inference\nlatency on two challenging reasoning benchmarks: 13% on GSM8K and 70% on MMLU,\ncompared to the top-performing baselines. Also, we establish a theoretical\nupper bound by an oracle with LLMs and explore in-depth linguistic analysis to\nunderstand the performance gap between Oracle and SelectLLM.\n","authors":["Kaushal Kumar Maurya","KV Aditya Srivatsa","Ekaterina Kochmar"],"pdf_url":"https://arxiv.org/pdf/2408.08545v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2407.20906v3","updated":"2024-12-30T04:46:07Z","published":"2024-07-30T15:26:36Z","title":"Automated Review Generation Method Based on Large Language Models","summary":" Literature research, vital for scientific work, faces the challenge of\nsurging information volumes exceeding researchers' processing capabilities. We\npresent an automated review generation method based on large language models\n(LLMs) to overcome efficiency bottlenecks and reduce cognitive load. Our\nstatistically validated evaluation framework demonstrates that the generated\nreviews match or exceed manual quality, offering broad applicability across\nresearch fields without requiring users' domain knowledge. Applied to propane\ndehydrogenation (PDH) catalysts, our method swiftly analyzed 343 articles,\naveraging seconds per article per LLM account, producing comprehensive reviews\nspanning 35 topics, with extended analysis of 1041 articles providing insights\ninto catalysts' properties. Through multi-layered quality control, we\neffectively mitigated LLMs' hallucinations, with expert verification confirming\naccuracy and citation integrity while demonstrating hallucination risks reduced\nto below 0.5\\% with 95\\% confidence. Released Windows application enables\none-click review generation, enhancing research productivity and literature\nrecommendation efficiency while setting the stage for broader scientific\nexplorations.\n","authors":["Shican Wu","Xiao Ma","Dehui Luo","Lulu Li","Xiangcheng Shi","Xin Chang","Xiaoyun Lin","Ran Luo","Chunlei Pei","Changyin Du","Zhi-Jian Zhao","Jinlong Gong"],"pdf_url":"https://arxiv.org/pdf/2407.20906v3.pdf","comment":"21 pages, 5 figures, 1 tables Code:\n https://github.com/TJU-ECAT-AI/AutomaticReviewGeneration Data:\n https://github.com/TJU-ECAT-AI/AutomaticReviewGenerationData This research\n has been invited for a Short Oral presentation at the 18th ICC -\n International Congress on Catalysis, taking place in Lyon, France from July\n 14-19, 2024"},{"id":"http://arxiv.org/abs/2310.09881v4","updated":"2024-12-30T04:27:05Z","published":"2023-10-15T16:40:19Z","title":"In-Context Learning with Iterative Demonstration Selection","summary":" Spurred by advancements in scale, large language models (LLMs) have\ndemonstrated strong few-shot learning ability via in-context learning (ICL).\nHowever, the performance of ICL has been shown to be highly sensitive to the\nselection of few-shot demonstrations. Selecting the most suitable examples as\ncontext remains an ongoing challenge and an open problem. Existing literature\nhas highlighted the importance of selecting examples that are diverse or\nsemantically similar to the test sample while ignoring the fact that the\noptimal selection dimension, i.e., diversity or similarity, is task-specific.\nBased on how the test sample is answered, we propose Iterative Demonstration\nSelection (IDS) to leverage the merits of both dimensions. Using zero-shot\nchain-of-thought reasoning (Zero-shot-CoT), IDS iteratively selects examples\nthat are diverse but still strongly correlated with the test sample as ICL\ndemonstrations. Specifically, IDS applies Zero-shot-CoT to the test sample\nbefore demonstration selection. The output reasoning path is then used to\nchoose demonstrations that are prepended to the test sample for inference. The\ngenerated answer is followed by its corresponding reasoning path for extracting\na new set of demonstrations in the next iteration. After several iterations,\nIDS adopts majority voting to obtain the final result. Through extensive\nexperiments on tasks including reasoning, question answering, and topic\nclassification, we demonstrate that IDS can consistently outperform existing\nICL demonstration selection methods.\n","authors":["Chengwei Qin","Aston Zhang","Chen Chen","Anirudh Dagar","Wenming Ye"],"pdf_url":"https://arxiv.org/pdf/2310.09881v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.14368v3","updated":"2024-12-30T04:09:29Z","published":"2024-12-18T22:04:56Z","title":"Memorization Over Reasoning? Exposing and Mitigating Verbatim\n Memorization in Large Language Models' Character Understanding Evaluation","summary":" Recently, Large Language Models (LLMs) have shown impressive performance in\ncharacter understanding tasks, such as analyzing the roles, personalities, and\nrelationships of fictional characters. However, the extensive pre-training\ncorpora used by LLMs raise concerns that they may rely on memorizing popular\nfictional works rather than genuinely understanding and reasoning about them.\nIn this work, we argue that 'gist memory'-capturing essential meaning - should\nbe the primary mechanism for character understanding tasks, as opposed to\n'verbatim memory' - exact match of a string. We introduce a simple yet\neffective method to mitigate mechanized memorization in character understanding\nevaluations while preserving the essential implicit cues needed for\ncomprehension and reasoning. Our approach reduces memorization-driven\nperformance on popular fictional works from 96% accuracy to 72% and results in\nup to an 18% drop in accuracy across various character understanding tasks.\nThese findings underscore the issue of data contamination in existing\nbenchmarks, which often measure memorization rather than true character\nunderstanding.\n","authors":["Yuxuan Jiang","Francis Ferraro"],"pdf_url":"https://arxiv.org/pdf/2412.14368v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.20694v1","updated":"2024-12-30T04:05:22Z","published":"2024-12-30T04:05:22Z","title":"UBER: Uncertainty-Based Evolution with Large Language Models for\n Automatic Heuristic Design","summary":" NP-hard problem-solving traditionally relies on heuristics, but manually\ncrafting effective heuristics for complex problems remains challenging. While\nrecent work like FunSearch has demonstrated that large language models (LLMs)\ncan be leveraged for heuristic design in evolutionary algorithm (EA)\nframeworks, their potential is not fully realized due to its deficiency in\nexploitation and exploration. We present UBER (Uncertainty-Based Evolution for\nRefinement), a method that enhances LLM+EA methods for automatic heuristic\ndesign by integrating uncertainty on top of the FunSearch framework. UBER\nintroduces two key innovations: an Uncertainty-Inclusive Evolution Process\n(UIEP) for adaptive exploration-exploitation balance, and a principled\nUncertainty-Inclusive Island Reset (UIIS) strategy for maintaining population\ndiversity. Through extensive experiments on challenging NP-complete problems,\nUBER demonstrates significant improvements over FunSearch. Our work provides a\nnew direction for the synergy of LLMs and EA, advancing the field of automatic\nheuristic design.\n","authors":["Zijie Chen","Zhanchao Zhou","Yu Lu","Renjun Xu","Lili Pan","Zhenzhong Lan"],"pdf_url":"https://arxiv.org/pdf/2412.20694v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.10671v4","updated":"2024-12-30T03:18:38Z","published":"2024-06-15T15:28:02Z","title":"Augmenting Biomedical Named Entity Recognition with General-domain\n Resources","summary":" Training a neural network-based biomedical named entity recognition (BioNER)\nmodel usually requires extensive and costly human annotations. While several\nstudies have employed multi-task learning with multiple BioNER datasets to\nreduce human effort, this approach does not consistently yield performance\nimprovements and may introduce label ambiguity in different biomedical corpora.\nWe aim to tackle those challenges through transfer learning from easily\naccessible resources with fewer concept overlaps with biomedical datasets. We\nproposed GERBERA, a simple-yet-effective method that utilized general-domain\nNER datasets for training. We performed multi-task learning to train a\npre-trained biomedical language model with both the target BioNER dataset and\nthe general-domain dataset. Subsequently, we fine-tuned the models specifically\nfor the BioNER dataset. We systematically evaluated GERBERA on five datasets of\neight entity types, collectively consisting of 81,410 instances. Despite using\nfewer biomedical resources, our models demonstrated superior performance\ncompared to baseline models trained with additional BioNER datasets.\nSpecifically, our models consistently outperformed the baseline models in six\nout of eight entity types, achieving an average improvement of 0.9% over the\nbest baseline performance across eight entities. Our method was especially\neffective in amplifying performance on BioNER datasets characterized by limited\ndata, with a 4.7% improvement in F1 scores on the JNLPBA-RNA dataset. This\nstudy introduces a new training method that leverages cost-effective\ngeneral-domain NER datasets to augment BioNER models. This approach\nsignificantly improves BioNER model performance, making it a valuable asset for\nscenarios with scarce or costly biomedical datasets.\n","authors":["Yu Yin","Hyunjae Kim","Xiao Xiao","Chih Hsuan Wei","Jaewoo Kang","Zhiyong Lu","Hua Xu","Meng Fang","Qingyu Chen"],"pdf_url":"https://arxiv.org/pdf/2406.10671v4.pdf","comment":"Published in JBI 2024. We make data, codes, and models publicly\n available via https://github.com/qingyu-qc/bioner_gerbera"},{"id":"http://arxiv.org/abs/2412.20677v1","updated":"2024-12-30T03:05:45Z","published":"2024-12-30T03:05:45Z","title":"Align Attention Heads Before Merging Them: An Effective Way for\n Converting MHA to GQA","summary":" Large language models have been shown to perform well on a variety of natural\nlanguage processing problems. However, as the model size and the input\nsequence's length increase, the rapid increase of KV Cache significantly slows\ndown inference speed. Therefore GQA model, as an alternative to MHA model, has\nbeen widely introduced into LLMs. In this work, we propose a low-cost method\nfor pruning MHA models into GQA models with any compression ratio of key-value\nheads. Our method is based on $\\mathit{L_0}$ masks to gradually remove\nredundant parameters. In addition, we apply orthogonal transformations to\nattention heads without changing the model to increase similarity between\nattention heads before pruning training, in order to further improve\nperformance of the model. Our method can be compatible with rotary position\nembedding (RoPE), which means the model after training can be fully adapted to\nthe mainstream standard GQA framework. Experiments demonstrate that our\nstrategy can compress up to 87.5% of key-value heads of the LLaMA2-7B model\nwithout too much performance degradation, just achieved through supervised\nfine-tuning.\n","authors":["Qingyun Jin","Xiaohui Song","Feng Zhou","Zengchang Qin"],"pdf_url":"https://arxiv.org/pdf/2412.20677v1.pdf","comment":"12 pages, 4 figures"},{"id":"http://arxiv.org/abs/2412.18619v2","updated":"2024-12-30T03:00:30Z","published":"2024-12-16T05:02:25Z","title":"Next Token Prediction Towards Multimodal Intelligence: A Comprehensive\n Survey","summary":" Building on the foundations of language modeling in natural language\nprocessing, Next Token Prediction (NTP) has evolved into a versatile training\nobjective for machine learning tasks across various modalities, achieving\nconsiderable success. As Large Language Models (LLMs) have advanced to unify\nunderstanding and generation tasks within the textual modality, recent research\nhas shown that tasks from different modalities can also be effectively\nencapsulated within the NTP framework, transforming the multimodal information\ninto tokens and predict the next one given the context. This survey introduces\na comprehensive taxonomy that unifies both understanding and generation within\nmultimodal learning through the lens of NTP. The proposed taxonomy covers five\nkey aspects: Multimodal tokenization, MMNTP model architectures, unified task\nrepresentation, datasets \\& evaluation, and open challenges. This new taxonomy\naims to aid researchers in their exploration of multimodal intelligence. An\nassociated GitHub repository collecting the latest papers and repos is\navailable at https://github.com/LMM101/Awesome-Multimodal-Next-Token-Prediction\n","authors":["Liang Chen","Zekun Wang","Shuhuai Ren","Lei Li","Haozhe Zhao","Yunshui Li","Zefan Cai","Hongcheng Guo","Lei Zhang","Yizhe Xiong","Yichi Zhang","Ruoyu Wu","Qingxiu Dong","Ge Zhang","Jian Yang","Lingwei Meng","Shujie Hu","Yulong Chen","Junyang Lin","Shuai Bai","Andreas Vlachos","Xu Tan","Minjia Zhang","Wen Xiao","Aaron Yee","Tianyu Liu","Baobao Chang"],"pdf_url":"https://arxiv.org/pdf/2412.18619v2.pdf","comment":"69 papes, 18 figures, repo at\n https://github.com/LMM101/Awesome-Multimodal-Next-Token-Prediction"},{"id":"http://arxiv.org/abs/2411.06790v2","updated":"2024-12-30T02:14:18Z","published":"2024-11-11T08:36:49Z","title":"Large-scale moral machine experiment on large language models","summary":" The rapid advancement of Large Language Models (LLMs) and their potential\nintegration into autonomous driving systems necessitates understanding their\nmoral decision-making capabilities. While our previous study examined four\nprominent LLMs using the Moral Machine experimental framework, the dynamic\nlandscape of LLM development demands a more comprehensive analysis. Here, we\nevaluate moral judgments across 52 different LLMs, including multiple versions\nof proprietary models (GPT, Claude, Gemini) and open-source alternatives\n(Llama, Gemma), to assess their alignment with human moral preferences in\nautonomous driving scenarios. Using a conjoint analysis framework, we evaluated\nhow closely LLM responses aligned with human preferences in ethical dilemmas\nand examined the effects of model size, updates, and architecture. Results\nshowed that proprietary models and open-source models exceeding 10 billion\nparameters demonstrated relatively close alignment with human judgments, with a\nsignificant negative correlation between model size and distance from human\njudgments in open-source models. However, model updates did not consistently\nimprove alignment with human preferences, and many LLMs showed excessive\nemphasis on specific ethical principles. These findings suggest that while\nincreasing model size may naturally lead to more human-like moral judgments,\npractical implementation in autonomous driving systems requires careful\nconsideration of the trade-off between judgment quality and computational\nefficiency. Our comprehensive analysis provides crucial insights for the\nethical design of autonomous systems and highlights the importance of\nconsidering cultural contexts in AI moral decision-making.\n","authors":["Muhammad Shahrul Zaim bin Ahmad","Kazuhiro Takemoto"],"pdf_url":"https://arxiv.org/pdf/2411.06790v2.pdf","comment":"21 pages, 6 figures"},{"id":"http://arxiv.org/abs/2410.22316v2","updated":"2024-12-30T01:48:26Z","published":"2024-10-29T17:55:00Z","title":"Understanding Synthetic Context Extension via Retrieval Heads","summary":" Long-context LLMs are increasingly in demand for applications such as\nretrieval-augmented generation. To defray the cost of pretraining LLMs over\nlong contexts, recent work takes an approach of synthetic context extension:\nfine-tuning LLMs with synthetically generated long-context data in a\npost-training stage. However, it remains unclear how and why this synthetic\ncontext extension imparts abilities for downstream long-context tasks. In this\npaper, we investigate fine-tuning on synthetic data for three long-context\ntasks that require retrieval and reasoning. We vary the realism of \"needle\"\nconcepts to be retrieved and diversity of the surrounding \"haystack\" context,\nfrom using LLMs to construct synthetic documents to using templated relations\nand creating symbolic datasets. We find that models trained on synthetic data\nfall short of the real data, but surprisingly, the mismatch can be interpreted\nand even predicted in terms of a special set of attention heads that are\nresponsible for retrieval over long context, retrieval heads (Wu et al., 2024).\nThe retrieval heads learned on synthetic data have high overlap with retrieval\nheads learned on real data, and there is a strong correlation between the\nrecall of heads learned and the downstream performance of a model. Furthermore,\nwith attention knockout and activation patching, we mechanistically show that\nretrieval heads are necessary and explain model performance, although they are\nnot totally sufficient. Our results shed light on how to interpret synthetic\ndata fine-tuning performance and how to approach creating better data for\nlearning real-world capabilities over long contexts.\n","authors":["Xinyu Zhao","Fangcong Yin","Greg Durrett"],"pdf_url":"https://arxiv.org/pdf/2410.22316v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.07066v6","updated":"2024-12-30T01:25:19Z","published":"2024-04-10T14:56:40Z","title":"Exploring Concept Depth: How Large Language Models Acquire Knowledge at\n Different Layers?","summary":" Large language models (LLMs) have shown remarkable performances across a wide\nrange of tasks. However, the mechanisms by which these models encode tasks of\nvarying complexities remain poorly understood. In this paper, we explore the\nhypothesis that LLMs process concepts of varying complexities in different\nlayers, introducing the idea of ``Concept Depth'' to suggest that more complex\nconcepts are typically acquired in deeper layers. Specifically, we categorize\nconcepts based on their level of abstraction, defining them in the order of\nincreasing complexity within factual, emotional, and inferential tasks. We\nconduct extensive probing experiments using layer-wise representations across\nvarious LLM families (Gemma, LLaMA, Qwen) on various datasets spanning the\nthree domains of tasks. Our findings reveal that models could efficiently\nconduct probing for simpler tasks in shallow layers, and more complex tasks\ntypically necessitate deeper layers for accurate understanding. Additionally,\nwe examine how external factors, such as adding noise to the input and\nquantizing the model weights, might affect layer-wise representations. Our\nfindings suggest that these factors can impede the development of a conceptual\nunderstanding of LLMs until deeper layers are explored. We hope that our\nproposed concept and experimental insights will enhance the understanding of\nthe mechanisms underlying LLMs. Our codes are available at\n\\url{https://github.com/Luckfort/CD}.\n","authors":["Mingyu Jin","Qinkai Yu","Jingyuan Huang","Qingcheng Zeng","Zhenting Wang","Wenyue Hua","Haiyan Zhao","Kai Mei","Yanda Meng","Kaize Ding","Fan Yang","Mengnan Du","Yongfeng Zhang"],"pdf_url":"https://arxiv.org/pdf/2404.07066v6.pdf","comment":"COLING 2025"},{"id":"http://arxiv.org/abs/2412.18547v2","updated":"2024-12-30T01:07:39Z","published":"2024-12-24T16:55:45Z","title":"Token-Budget-Aware LLM Reasoning","summary":" Reasoning is critical for large language models (LLMs) to excel in a wide\nrange of tasks. While methods like Chain-of-Thought (CoT) reasoning enhance LLM\nperformance by decomposing problems into intermediate steps, they also incur\nsignificant overhead in token usage, leading to increased costs. We find that\nthe reasoning process of current LLMs is unnecessarily lengthy and it can be\ncompressed by including a reasonable token budget in the prompt, but the choice\nof token budget plays a crucial role in the actual compression effectiveness.\nWe then propose a token-budget-aware LLM reasoning framework, which dynamically\nestimates token budgets for different problems based on reasoning complexity\nand uses the estimated token budgets to guide the reasoning process.\nExperiments show that our method effectively reduces token costs in CoT\nreasoning with only a slight performance reduction, offering a practical\nsolution to balance efficiency and accuracy in LLM reasoning. Code:\nhttps://github.com/GeniusHTX/TALE.\n","authors":["Tingxu Han","Chunrong Fang","Shiyu Zhao","Shiqing Ma","Zhenyu Chen","Zhenting Wang"],"pdf_url":"https://arxiv.org/pdf/2412.18547v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.20637v1","updated":"2024-12-30T00:58:00Z","published":"2024-12-30T00:58:00Z","title":"Knowledge Editing for Large Language Model with Knowledge Neuronal\n Ensemble","summary":" As real-world knowledge is constantly evolving, ensuring the timeliness and\naccuracy of a model's knowledge is crucial. This has made knowledge editing in\nlarge language models increasingly important. However, existing knowledge\nediting methods face several challenges, including parameter localization\ncoupling, imprecise localization, and a lack of dynamic interaction across\nlayers. In this paper, we propose a novel knowledge editing method called\nKnowledge Neuronal Ensemble (KNE). A knowledge neuronal ensemble represents a\ngroup of neurons encoding specific knowledge, thus mitigating the issue of\nfrequent parameter modification caused by coupling in parameter localization.\nThe KNE method enhances the precision and accuracy of parameter localization by\ncomputing gradient attribution scores for each parameter at each layer. During\nthe editing process, only the gradients and losses associated with the\nknowledge neuronal ensemble are computed, with error backpropagation performed\naccordingly, ensuring dynamic interaction and collaborative updates among\nparameters. Experimental results on three widely used knowledge editing\ndatasets show that the KNE method significantly improves the accuracy of\nknowledge editing and achieves, or even exceeds, the performance of the best\nbaseline methods in portability and locality metrics.\n","authors":["Yongchang Li","Yujin Zhu","Tao Yan","Shijian Fan","Gang Wu","Liang Xu"],"pdf_url":"https://arxiv.org/pdf/2412.20637v1.pdf","comment":"26 pages, 5 figures, 2 tables"},{"id":"http://arxiv.org/abs/2405.14170v3","updated":"2024-12-30T00:53:45Z","published":"2024-05-23T04:54:37Z","title":"Large Language Models-guided Dynamic Adaptation for Temporal Knowledge\n Graph Reasoning","summary":" Temporal Knowledge Graph Reasoning (TKGR) is the process of utilizing\ntemporal information to capture complex relations within a Temporal Knowledge\nGraph (TKG) to infer new knowledge. Conventional methods in TKGR typically\ndepend on deep learning algorithms or temporal logical rules. However, deep\nlearning-based TKGRs often lack interpretability, whereas rule-based TKGRs\nstruggle to effectively learn temporal rules that capture temporal patterns.\nRecently, Large Language Models (LLMs) have demonstrated extensive knowledge\nand remarkable proficiency in temporal reasoning. Consequently, the employment\nof LLMs for Temporal Knowledge Graph Reasoning (TKGR) has sparked increasing\ninterest among researchers. Nonetheless, LLMs are known to function as black\nboxes, making it challenging to comprehend their reasoning process.\nAdditionally, due to the resource-intensive nature of fine-tuning, promptly\nupdating LLMs to integrate evolving knowledge within TKGs for reasoning is\nimpractical. To address these challenges, in this paper, we propose a Large\nLanguage Models-guided Dynamic Adaptation (LLM-DA) method for reasoning on\nTKGs. Specifically, LLM-DA harnesses the capabilities of LLMs to analyze\nhistorical data and extract temporal logical rules. These rules unveil temporal\npatterns and facilitate interpretable reasoning. To account for the evolving\nnature of TKGs, a dynamic adaptation strategy is proposed to update the\nLLM-generated rules with the latest events. This ensures that the extracted\nrules always incorporate the most recent knowledge and better generalize to the\npredictions on future events. Experimental results show that without the need\nof fine-tuning, LLM-DA significantly improves the accuracy of reasoning over\nseveral common datasets, providing a robust framework for TKGR tasks.\n","authors":["Jiapu Wang","Kai Sun","Linhao Luo","Wei Wei","Yongli Hu","Alan Wee-Chung Liew","Shirui Pan","Baocai Yin"],"pdf_url":"https://arxiv.org/pdf/2405.14170v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.20953v1","updated":"2024-12-30T13:49:28Z","published":"2024-12-30T13:49:28Z","title":"GASLITEing the Retrieval: Exploring Vulnerabilities in Dense\n Embedding-based Search","summary":" Dense embedding-based text retrieval$\\unicode{x2013}$retrieval of relevant\npassages from corpora via deep learning encodings$\\unicode{x2013}$has emerged\nas a powerful method attaining state-of-the-art search results and popularizing\nthe use of Retrieval Augmented Generation (RAG). Still, like other search\nmethods, embedding-based retrieval may be susceptible to search-engine\noptimization (SEO) attacks, where adversaries promote malicious content by\nintroducing adversarial passages to corpora. To faithfully assess and gain\ninsights into the susceptibility of such systems to SEO, this work proposes the\nGASLITE attack, a mathematically principled gradient-based search method for\ngenerating adversarial passages without relying on the corpus content or\nmodifying the model. Notably, GASLITE's passages (1) carry adversary-chosen\ninformation while (2) achieving high retrieval ranking for a selected query\ndistribution when inserted to corpora. We use GASLITE to extensively evaluate\nretrievers' robustness, testing nine advanced models under varied threat\nmodels, while focusing on realistic adversaries targeting queries on a specific\nconcept (e.g., a public figure). We found GASLITE consistently outperformed\nbaselines by $\\geq$140% success rate, in all settings. Particularly,\nadversaries using GASLITE require minimal effort to manipulate search\nresults$\\unicode{x2013}$by injecting a negligible amount of adversarial\npassages ($\\leq$0.0001% of the corpus), they could make them visible in the\ntop-10 results for 61-100% of unseen concept-specific queries against most\nevaluated models. Inspecting variance in retrievers' robustness, we identify\nkey factors that may contribute to models' susceptibility to SEO, including\nspecific properties in the embedding space's geometry.\n","authors":["Matan Ben-Tov","Mahmood Sharif"],"pdf_url":"https://arxiv.org/pdf/2412.20953v1.pdf","comment":null}],"Information Retrieval":[{"id":"http://arxiv.org/abs/2401.10690v3","updated":"2024-12-30T18:21:53Z","published":"2024-01-19T13:41:08Z","title":"Beyond RMSE and MAE: Introducing EAUC to unmask hidden bias and\n unfairness in dyadic regression models","summary":" Dyadic regression models, which output real-valued predictions for pairs of\nentities, are fundamental in many domains (e.g. obtaining user-product ratings\nin Recommender Systems) and promising and under exploration in others (e.g.\ntuning patient-drug dosages in personalized pharmacology). In this work, we\nprove that non-uniform observed value distributions of individual entities lead\nto severe biases in state-of-the-art models, skewing predictions towards the\naverage of observed past values for the entity and providing worse-than-random\npredictive power in eccentric yet crucial cases; we name this phenomenon\neccentricity bias. We show that global error metrics like Root Mean Squared\nError (RMSE) are insufficient to capture this bias, and we introduce\nEccentricity-Area Under the Curve (EAUC) as a novel complementary metric that\ncan quantify it in all studied domains and models. We prove the intuitive\ninterpretation of EAUC by experimenting with naive post-training bias\ncorrections, and theorize other options to use EAUC to guide the construction\nof fair models. This work contributes a bias-aware evaluation of dyadic\nregression to prevent unfairness in critical real-world applications of such\nsystems.\n","authors":["Jorge Paz-Ruza","Amparo Alonso-Betanzos","Bertha Guijarro-Berdiñas","Brais Cancela","Carlos Eiras-Franco"],"pdf_url":"https://arxiv.org/pdf/2401.10690v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.13847v2","updated":"2024-12-30T17:13:48Z","published":"2024-09-20T18:42:04Z","title":"Segment Discovery: Enhancing E-commerce Targeting","summary":" Modern e-commerce services frequently target customers with incentives or\ninterventions to engage them in their products such as games, shopping, video\nstreaming, etc. This customer engagement increases acquisition of more\ncustomers and retention of existing ones, leading to more business for the\ncompany while improving customer experience. Often, customers are either\nrandomly targeted or targeted based on the propensity of desirable behavior.\nHowever, such policies can be suboptimal as they do not target the set of\ncustomers who would benefit the most from the intervention and they may also\nnot take account of any constraints. In this paper, we propose a policy\nframework based on uplift modeling and constrained optimization that identifies\ncustomers to target for a use-case specific intervention so as to maximize the\nvalue to the business, while taking account of any given constraints. We\ndemonstrate improvement over state-of-the-art targeting approaches using two\nlarge-scale experimental studies and a production implementation.\n","authors":["Qiqi Li","Roopali Singh","Charin Polpanumas","Tanner Fiez","Namita Kumar","Shreya Chakrabarti"],"pdf_url":"https://arxiv.org/pdf/2409.13847v2.pdf","comment":"Accepted at the CONSEQUENCES'24 workshop, co-located with ACM\n RecSys'24"},{"id":"http://arxiv.org/abs/2412.19312v2","updated":"2024-12-30T15:30:23Z","published":"2024-12-26T18:19:53Z","title":"From Interests to Insights: An LLM Approach to Course Recommendations\n Using Natural Language Queries","summary":" Most universities in the United States encourage their students to explore\nacademic areas before declaring a major and to acquire academic breadth by\nsatisfying a variety of requirements. Each term, students must choose among\nmany thousands of offerings, spanning dozens of subject areas, a handful of\ncourses to take. The curricular environment is also dynamic, and poor\ncommunication and search functions on campus can limit a student's ability to\ndiscover new courses of interest. To support both students and their advisers\nin such a setting, we explore a novel Large Language Model (LLM) course\nrecommendation system that applies a Retrieval Augmented Generation (RAG)\nmethod to the corpus of course descriptions. The system first generates an\n'ideal' course description based on the user's query. This description is\nconverted into a search vector using embeddings, which is then used to find\nactual courses with similar content by comparing embedding similarities. We\ndescribe the method and assess the quality and fairness of some example\nprompts. Steps to deploy a pilot system on campus are discussed.\n","authors":["Hugh Van Deventer","Mark Mills","August Evrard"],"pdf_url":"https://arxiv.org/pdf/2412.19312v2.pdf","comment":"17 pages, 9 figures"},{"id":"http://arxiv.org/abs/2412.21009v1","updated":"2024-12-30T15:21:36Z","published":"2024-12-30T15:21:36Z","title":"Towards Identity-Aware Cross-Modal Retrieval: a Dataset and a Baseline","summary":" Recent advancements in deep learning have significantly enhanced\ncontent-based retrieval methods, notably through models like CLIP that map\nimages and texts into a shared embedding space. However, these methods often\nstruggle with domain-specific entities and long-tail concepts absent from their\ntraining data, particularly in identifying specific individuals. In this paper,\nwe explore the task of identity-aware cross-modal retrieval, which aims to\nretrieve images of persons in specific contexts based on natural language\nqueries. This task is critical in various scenarios, such as for searching and\nbrowsing personalized video collections or large audio-visual archives\nmaintained by national broadcasters. We introduce a novel dataset, COCO Person\nFaceSwap (COCO-PFS), derived from the widely used COCO dataset and enriched\nwith deepfake-generated faces from VGGFace2. This dataset addresses the lack of\nlarge-scale datasets needed for training and evaluating models for this task.\nOur experiments assess the performance of different CLIP variations repurposed\nfor this task, including our architecture, Identity-aware CLIP (Id-CLIP), which\nachieves competitive retrieval performance through targeted fine-tuning. Our\ncontributions lay the groundwork for more robust cross-modal retrieval systems\ncapable of recognizing long-tail identities and contextual nuances. Data and\ncode are available at https://github.com/mesnico/IdCLIP.\n","authors":["Nicola Messina","Lucia Vadicamo","Leo Maltese","Claudio Gennaro"],"pdf_url":"https://arxiv.org/pdf/2412.21009v1.pdf","comment":"Accepted as full paper at ECIR 2025"},{"id":"http://arxiv.org/abs/2412.20960v1","updated":"2024-12-30T13:55:28Z","published":"2024-12-30T13:55:28Z","title":"Rise of Generative Artificial Intelligence in Science","summary":" Generative Artificial Intelligence (GenAI, generative AI) has rapidly become\navailable as a tool in scientific research. To explore the use of generative AI\nin science, we conduct an empirical analysis using OpenAlex. Analyzing GenAI\npublications and other AI publications from 2017 to 2023, we profile growth\npatterns, the diffusion of GenAI publications across fields of study, and the\ngeographical spread of scientific research on generative AI. We also\ninvestigate team size and international collaborations to explore whether\nGenAI, as an emerging scientific research area, shows different collaboration\npatterns compared to other AI technologies. The results indicate that\ngenerative AI has experienced rapid growth and increasing presence in\nscientific publications. The use of GenAI now extends beyond computer science\nto other scientific research domains. Over the study period, U.S. researchers\ncontributed nearly two-fifths of global GenAI publications. The U.S. is\nfollowed by China, with several small and medium-sized advanced economies\ndemonstrating relatively high levels of GenAI deployment in their research\npublications. Although scientific research overall is becoming increasingly\nspecialized and collaborative, our results suggest that GenAI research groups\ntend to have slightly smaller team sizes than found in other AI fields.\nFurthermore, notwithstanding recent geopolitical tensions, GenAI research\ncontinues to exhibit levels of international collaboration comparable to other\nAI technologies.\n","authors":["Liangping Ding","Cornelia Lawson","Philip Shapira"],"pdf_url":"https://arxiv.org/pdf/2412.20960v1.pdf","comment":"26 pages, 4 tables, 1 figures, 1 appendix figure"},{"id":"http://arxiv.org/abs/2412.20942v1","updated":"2024-12-30T13:36:05Z","published":"2024-12-30T13:36:05Z","title":"Ontology-grounded Automatic Knowledge Graph Construction by LLM under\n Wikidata schema","summary":" We propose an ontology-grounded approach to Knowledge Graph (KG) construction\nusing Large Language Models (LLMs) on a knowledge base. An ontology is authored\nby generating Competency Questions (CQ) on knowledge base to discover knowledge\nscope, extracting relations from CQs, and attempt to replace equivalent\nrelations by their counterpart in Wikidata. To ensure consistency and\ninterpretability in the resulting KG, we ground generation of KG with the\nauthored ontology based on extracted relations. Evaluation on benchmark\ndatasets demonstrates competitive performance in knowledge graph construction\ntask. Our work presents a promising direction for scalable KG construction\npipeline with minimal human intervention, that yields high quality and\nhuman-interpretable KGs, which are interoperable with Wikidata semantics for\npotential knowledge base expansion.\n","authors":["Xiaohan Feng","Xixin Wu","Helen Meng"],"pdf_url":"https://arxiv.org/pdf/2412.20942v1.pdf","comment":"Presented at HI-AI@KDD, Human-Interpretable AI Workshop at the KDD\n 2024, 26th of August 2024, Barcelona, Spain"},{"id":"http://arxiv.org/abs/2412.18176v2","updated":"2024-12-30T09:24:34Z","published":"2024-12-24T05:23:13Z","title":"Molar: Multimodal LLMs with Collaborative Filtering Alignment for\n Enhanced Sequential Recommendation","summary":" Sequential recommendation (SR) systems have evolved significantly over the\npast decade, transitioning from traditional collaborative filtering to deep\nlearning approaches and, more recently, to large language models (LLMs). While\nthe adoption of LLMs has driven substantial advancements, these models\ninherently lack collaborative filtering information, relying primarily on\ntextual content data neglecting other modalities and thus failing to achieve\noptimal recommendation performance. To address this limitation, we propose\nMolar, a Multimodal large language sequential recommendation framework that\nintegrates multiple content modalities with ID information to capture\ncollaborative signals effectively. Molar employs an MLLM to generate unified\nitem representations from both textual and non-textual data, facilitating\ncomprehensive multimodal modeling and enriching item embeddings. Additionally,\nit incorporates collaborative filtering signals through a post-alignment\nmechanism, which aligns user representations from content-based and ID-based\nmodels, ensuring precise personalization and robust performance. By seamlessly\ncombining multimodal content with collaborative filtering insights, Molar\ncaptures both user interests and contextual semantics, leading to superior\nrecommendation accuracy. Extensive experiments validate that Molar\nsignificantly outperforms traditional and LLM-based baselines, highlighting its\nstrength in utilizing multimodal data and collaborative signals for sequential\nrecommendation tasks. The source code is available at\nhttps://anonymous.4open.science/r/Molar-8B06/.\n","authors":["Yucong Luo","Qitao Qin","Hao Zhang","Mingyue Cheng","Ruiran Yan","Kefan Wang","Jie Ouyang"],"pdf_url":"https://arxiv.org/pdf/2412.18176v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.07500v2","updated":"2024-12-30T08:43:32Z","published":"2024-09-10T15:24:13Z","title":"DV-FSR: A Dual-View Target Attack Framework for Federated Sequential\n Recommendation","summary":" Federated recommendation (FedRec) preserves user privacy by enabling\ndecentralized training of personalized models, but this architecture is\ninherently vulnerable to adversarial attacks. Significant research has been\nconducted on targeted attacks in FedRec systems, motivated by commercial and\nsocial influence considerations. However, much of this work has largely\noverlooked the differential robustness of recommendation models. Moreover, our\nempirical findings indicate that existing targeted attack methods achieve only\nlimited effectiveness in Federated Sequential Recommendation (FSR) tasks.\nDriven by these observations, we focus on investigating targeted attacks in FSR\nand propose a novel dualview attack framework, named DV-FSR. This attack method\nuniquely combines a sampling-based explicit strategy with a contrastive\nlearning-based implicit gradient strategy to orchestrate a coordinated attack.\nAdditionally, we introduce a specific defense mechanism tailored for targeted\nattacks in FSR, aiming to evaluate the mitigation effects of the attack method\nwe proposed. Extensive experiments validate the effectiveness of our proposed\napproach on representative sequential models.\n","authors":["Qitao Qin","Yucong Luo","Mingyue Cheng","Qingyang Mao","Chenyi Lei"],"pdf_url":"https://arxiv.org/pdf/2409.07500v2.pdf","comment":"I am requesting the withdrawal of my paper due to identified errors\n that require significant revision"},{"id":"http://arxiv.org/abs/2411.12179v2","updated":"2024-12-30T08:32:39Z","published":"2024-11-19T02:45:17Z","title":"Multi-Grained Preference Enhanced Transformer for Multi-Behavior\n Sequential Recommendation","summary":" Sequential recommendation (SR) aims to predict the next purchasing item\naccording to users' dynamic preference learned from their historical user-item\ninteractions. To improve the performance of recommendation, learning dynamic\nheterogeneous cross-type behavior dependencies is indispensable for recommender\nsystem. However, there still exists some challenges in Multi-Behavior\nSequential Recommendation (MBSR). On the one hand, existing methods only model\nheterogeneous multi-behavior dependencies at behavior-level or item-level, and\nmodelling interaction-level dependencies is still a challenge. On the other\nhand, the dynamic multi-grained behavior-aware preference is hard to capture in\ninteraction sequences, which reflects interaction-aware sequential pattern. To\ntackle these challenges, we propose a Multi-Grained Preference enhanced\nTransformer framework (M-GPT). First, M-GPT constructs a interaction-level\ngraph of historical cross-typed interactions in a sequence. Then graph\nconvolution is performed to derive interaction-level multi-behavior dependency\nrepresentation repeatedly, in which the complex correlation between historical\ncross-typed interactions at specific orders can be well learned. Secondly, a\nnovel multi-scale transformer architecture equipped with multi-grained user\npreference extraction is proposed to encode the interaction-aware sequential\npattern enhanced by capturing temporal behavior-aware multi-grained preference\n. Experiments on the real-world datasets indicate that our method M-GPT\nconsistently outperforms various state-of-the-art recommendation methods.\n","authors":["Chuan He","Yongchao Liu","Qiang Li","Weiqiang Wang","Xin Fu","Xinyi Fu","Chuntao Hong","Xinwei Yao"],"pdf_url":"https://arxiv.org/pdf/2411.12179v2.pdf","comment":"12 pages"},{"id":"http://arxiv.org/abs/2412.20756v1","updated":"2024-12-30T07:01:34Z","published":"2024-12-30T07:01:34Z","title":"Unsupervised dense retrieval with conterfactual contrastive learning","summary":" Efficiently retrieving a concise set of candidates from a large document\ncorpus remains a pivotal challenge in Information Retrieval (IR). Neural\nretrieval models, particularly dense retrieval models built with transformers\nand pretrained language models, have been popular due to their superior\nperformance. However, criticisms have also been raised on their lack of\nexplainability and vulnerability to adversarial attacks. In response to these\nchallenges, we propose to improve the robustness of dense retrieval models by\nenhancing their sensitivity of fine-graned relevance signals. A model achieving\nsensitivity in this context should exhibit high variances when documents' key\npassages determining their relevance to queries have been modified, while\nmaintaining low variances for other changes in irrelevant passages. This\nsensitivity allows a dense retrieval model to produce robust results with\nrespect to attacks that try to promote documents without actually increasing\ntheir relevance. It also makes it possible to analyze which part of a document\nis actually relevant to a query, and thus improve the explainability of the\nretrieval model. Motivated by causality and counterfactual analysis, we propose\na series of counterfactual regularization methods based on game theory and\nunsupervised learning with counterfactual passages. Experiments show that, our\nmethod can extract key passages without reliance on the passage-level relevance\nannotations. Moreover, the regularized dense retrieval models exhibit\nheightened robustness against adversarial attacks, surpassing the\nstate-of-the-art anti-attack methods.\n","authors":["Haitian Chen","Qingyao Ai","Xiao Wang","Yiqun Liu","Fen Lin","Qin Liu"],"pdf_url":"https://arxiv.org/pdf/2412.20756v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19478v2","updated":"2024-12-30T06:52:12Z","published":"2024-11-29T05:31:04Z","title":"Zero-Indexing Internet Search Augmented Generation for Large Language\n Models","summary":" Retrieval augmented generation has emerged as an effective method to enhance\nlarge language model performance. This approach typically relies on an internal\nretrieval module that uses various indexing mechanisms to manage a static\npre-processed corpus. However, such a paradigm often falls short when it is\nnecessary to integrate the most up-to-date information that has not been\nupdated into the corpus during generative inference time. In this paper, we\nexplore an alternative approach that leverages standard search engine APIs to\ndynamically integrate the latest online information (without maintaining any\nindex for any fixed corpus), thereby improving the quality of generated\ncontent. We design a collaborative LLM-based paradigm, where we include: (i) a\nparser-LLM that determines if the Internet augmented generation is demanded and\nextracts the search keywords if so with a single inference; (ii) a mixed\nranking strategy that re-ranks the retrieved HTML files to eliminate bias\nintroduced from the search engine API; and (iii) an extractor-LLM that can\naccurately and efficiently extract relevant information from the fresh content\nin each HTML file. We conduct extensive empirical studies to evaluate the\nperformance of this Internet search augmented generation paradigm. The\nexperimental results demonstrate that our method generates content with\nsignificantly improved quality. Our system has been successfully deployed in a\nproduction environment to serve 01.AI's generative inference requests.\n","authors":["Guangxin He","Zonghong Dai","Jiangcheng Zhu","Binqiang Zhao","Qicheng Hu","Chenyue Li","You Peng","Chen Wang","Binhang Yuan"],"pdf_url":"https://arxiv.org/pdf/2411.19478v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.12984v2","updated":"2024-12-30T05:34:10Z","published":"2024-12-17T15:04:54Z","title":"Cluster-guided Contrastive Class-imbalanced Graph Classification","summary":" This paper studies the problem of class-imbalanced graph classification,\nwhich aims at effectively classifying the graph categories in scenarios with\nimbalanced class distributions. While graph neural networks (GNNs) have\nachieved remarkable success, their modeling ability on imbalanced\ngraph-structured data remains suboptimal, which typically leads to predictions\nbiased towards the majority classes. On the other hand, existing\nclass-imbalanced learning methods in vision may overlook the rich graph\nsemantic substructures of the majority classes and excessively emphasize\nlearning from the minority classes. To address these challenges, we propose a\nsimple yet powerful approach called C$^3$GNN that integrates the idea of\nclustering into contrastive learning to enhance class-imbalanced graph\nclassification. Technically, C$^3$GNN clusters graphs from each majority class\ninto multiple subclasses, with sizes comparable to the minority class,\nmitigating class imbalance. It also employs the Mixup technique to generate\nsynthetic samples, enriching the semantic diversity of each subclass.\nFurthermore, supervised contrastive learning is used to hierarchically learn\neffective graph representations, enabling the model to thoroughly explore\nsemantic substructures in majority classes while avoiding excessive focus on\nminority classes. Extensive experiments on real-world graph benchmark datasets\nverify the superior performance of our proposed method against competitive\nbaselines.\n","authors":["Wei Ju","Zhengyang Mao","Siyu Yi","Yifang Qin","Yiyang Gu","Zhiping Xiao","Jianhao Shen","Ziyue Qiao","Ming Zhang"],"pdf_url":"https://arxiv.org/pdf/2412.12984v2.pdf","comment":"Accepted by Proceedings of the Thirty-Ninth AAAI Conference on\n Artificial Intelligence (AAAI-25)"},{"id":"http://arxiv.org/abs/2412.18819v2","updated":"2024-12-30T04:15:42Z","published":"2024-12-25T08:17:37Z","title":"LLM-assisted Vector Similarity Search","summary":" As data retrieval demands become increasingly complex, traditional search\nmethods often fall short in addressing nuanced and conceptual queries. Vector\nsimilarity search has emerged as a promising technique for finding semantically\nsimilar information efficiently. However, its effectiveness diminishes when\nhandling intricate queries with contextual nuances. This paper explores a\nhybrid approach combining vector similarity search with Large Language Models\n(LLMs) to enhance search accuracy and relevance. The proposed two-step solution\nfirst employs vector similarity search to shortlist potential matches, followed\nby an LLM for context-aware ranking of the results. Experiments on structured\ndatasets demonstrate that while vector similarity search alone performs well\nfor straightforward queries, the LLM-assisted approach excels in processing\ncomplex queries involving constraints, negations, or conceptual requirements.\nBy leveraging the natural language understanding capabilities of LLMs, this\nmethod improves the accuracy of search results for complex tasks without\nsacrificing efficiency. We also discuss real-world applications and propose\ndirections for future research to refine and scale this technique for diverse\ndatasets and use cases.\n Original article:\nhttps://engineering.grab.com/llm-assisted-vector-similarity-search\n","authors":["Md Riyadh","Muqi Li","Felix Haryanto Lie","Jia Long Loh","Haotian Mi","Sayam Bohra"],"pdf_url":"https://arxiv.org/pdf/2412.18819v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.20914v1","updated":"2024-12-30T12:49:55Z","published":"2024-12-30T12:49:55Z","title":"Language-based Audio Retrieval with Co-Attention Networks","summary":" In recent years, user-generated audio content has proliferated across various\nmedia platforms, creating a growing need for efficient retrieval methods that\nallow users to search for audio clips using natural language queries. This\ntask, known as language-based audio retrieval, presents significant challenges\ndue to the complexity of learning semantic representations from heterogeneous\ndata across both text and audio modalities. In this work, we introduce a novel\nframework for the language-based audio retrieval task that leverages\nco-attention mechanismto jointly learn meaningful representations from both\nmodalities. To enhance the model's ability to capture fine-grained cross-modal\ninteractions, we propose a cascaded co-attention architecture, where\nco-attention modules are stacked or iterated to progressively refine the\nsemantic alignment between text and audio. Experiments conducted on two public\ndatasets show that the proposed method can achieve better performance than the\nstate-of-the-art method. Specifically, our best performed co-attention model\nachieves a 16.6% improvement in mean Average Precision on Clotho dataset, and a\n15.1% improvement on AudioCaps.\n","authors":["Haoran Sun","Zimu Wang","Qiuyi Chen","Jianjun Chen","Jia Wang","Haiyang Zhang"],"pdf_url":"https://arxiv.org/pdf/2412.20914v1.pdf","comment":"Accepted at UIC 2024 proceedings. Accepted version"},{"id":"http://arxiv.org/abs/2412.20756v1","updated":"2024-12-30T07:01:34Z","published":"2024-12-30T07:01:34Z","title":"Unsupervised dense retrieval with conterfactual contrastive learning","summary":" Efficiently retrieving a concise set of candidates from a large document\ncorpus remains a pivotal challenge in Information Retrieval (IR). Neural\nretrieval models, particularly dense retrieval models built with transformers\nand pretrained language models, have been popular due to their superior\nperformance. However, criticisms have also been raised on their lack of\nexplainability and vulnerability to adversarial attacks. In response to these\nchallenges, we propose to improve the robustness of dense retrieval models by\nenhancing their sensitivity of fine-graned relevance signals. A model achieving\nsensitivity in this context should exhibit high variances when documents' key\npassages determining their relevance to queries have been modified, while\nmaintaining low variances for other changes in irrelevant passages. This\nsensitivity allows a dense retrieval model to produce robust results with\nrespect to attacks that try to promote documents without actually increasing\ntheir relevance. It also makes it possible to analyze which part of a document\nis actually relevant to a query, and thus improve the explainability of the\nretrieval model. Motivated by causality and counterfactual analysis, we propose\na series of counterfactual regularization methods based on game theory and\nunsupervised learning with counterfactual passages. Experiments show that, our\nmethod can extract key passages without reliance on the passage-level relevance\nannotations. Moreover, the regularized dense retrieval models exhibit\nheightened robustness against adversarial attacks, surpassing the\nstate-of-the-art anti-attack methods.\n","authors":["Haitian Chen","Qingyao Ai","Xiao Wang","Yiqun Liu","Fen Lin","Qin Liu"],"pdf_url":"https://arxiv.org/pdf/2412.20756v1.pdf","comment":"arXiv admin note: text overlap with arXiv:2107.07773 by other authors"}],"Machine Learning":[{"id":"http://arxiv.org/abs/2412.21205v1","updated":"2024-12-30T18:59:55Z","published":"2024-12-30T18:59:55Z","title":"Action-Agnostic Point-Level Supervision for Temporal Action Detection","summary":" We propose action-agnostic point-level (AAPL) supervision for temporal action\ndetection to achieve accurate action instance detection with a lightly\nannotated dataset. In the proposed scheme, a small portion of video frames is\nsampled in an unsupervised manner and presented to human annotators, who then\nlabel the frames with action categories. Unlike point-level supervision, which\nrequires annotators to search for every action instance in an untrimmed video,\nframes to annotate are selected without human intervention in AAPL supervision.\nWe also propose a detection model and learning method to effectively utilize\nthe AAPL labels. Extensive experiments on the variety of datasets (THUMOS '14,\nFineAction, GTEA, BEOID, and ActivityNet 1.3) demonstrate that the proposed\napproach is competitive with or outperforms prior methods for video-level and\npoint-level supervision in terms of the trade-off between the annotation cost\nand detection performance.\n","authors":["Shuhei M. Yoshida","Takashi Shibata","Makoto Terao","Takayuki Okatani","Masashi Sugiyama"],"pdf_url":"https://arxiv.org/pdf/2412.21205v1.pdf","comment":"AAAI-25. Technical appendices included. 15 pages, 3 figures, 11\n tables"},{"id":"http://arxiv.org/abs/2412.21203v1","updated":"2024-12-30T18:59:46Z","published":"2024-12-30T18:59:46Z","title":"SoS Certificates for Sparse Singular Values and Their Applications:\n Robust Statistics, Subspace Distortion, and More","summary":" We study $\\textit{sparse singular value certificates}$ for random rectangular\nmatrices. If $M$ is an $n \\times d$ matrix with independent Gaussian entries,\nwe give a new family of polynomial-time algorithms which can certify upper\nbounds on the maximum of $\\|M u\\|$, where $u$ is a unit vector with at most\n$\\eta n$ nonzero entries for a given $\\eta \\in (0,1)$. This basic algorithmic\nprimitive lies at the heart of a wide range of problems across algorithmic\nstatistics and theoretical computer science.\n Our algorithms certify a bound which is asymptotically smaller than the naive\none, given by the maximum singular value of $M$, for nearly the widest-possible\nrange of $n,d,$ and $\\eta$. Efficiently certifying such a bound for a range of\n$n,d$ and $\\eta$ which is larger by any polynomial factor than what is achieved\nby our algorithm would violate lower bounds in the SQ and low-degree\npolynomials models. Our certification algorithm makes essential use of the\nSum-of-Squares hierarchy. To prove the correctness of our algorithm, we develop\na new combinatorial connection between the graph matrix approach to analyze\nrandom matrices with dependent entries, and the Efron-Stein decomposition of\nfunctions of independent random variables.\n As applications of our certification algorithm, we obtain new efficient\nalgorithms for a wide range of well-studied algorithmic tasks. In algorithmic\nrobust statistics, we obtain new algorithms for robust mean and covariance\nestimation with tradeoffs between breakdown point and sample complexity, which\nare nearly matched by SQ and low-degree polynomial lower bounds (that we\nestablish). We also obtain new polynomial-time guarantees for certification of\n$\\ell_1/\\ell_2$ distortion of random subspaces of $\\mathbb{R}^n$ (also with\nnearly matching lower bounds), sparse principal component analysis, and\ncertification of the $2\\rightarrow p$ norm of a random matrix.\n","authors":["Ilias Diakonikolas","Samuel B. Hopkins","Ankit Pensia","Stefan Tiegel"],"pdf_url":"https://arxiv.org/pdf/2412.21203v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.21200v1","updated":"2024-12-30T18:59:06Z","published":"2024-12-30T18:59:06Z","title":"Distributed Mixture-of-Agents for Edge Inference with Large Language\n Models","summary":" Mixture-of-Agents (MoA) has recently been proposed as a method to enhance\nperformance of large language models (LLMs), enabling multiple individual LLMs\nto work together for collaborative inference. This collaborative approach\nresults in improved responses to user prompts compared to relying on a single\nLLM. In this paper, we consider such an MoA architecture in a distributed\nsetting, where LLMs operate on individual edge devices, each uniquely\nassociated with a user and equipped with its own distributed computing power.\nThese devices exchange information using decentralized gossip algorithms,\nallowing different device nodes to talk without the supervision of a\ncentralized server. In the considered setup, different users have their own LLM\nmodels to address user prompts. Additionally, the devices gossip either their\nown user-specific prompts or augmented prompts to generate more refined answers\nto certain queries. User prompts are temporarily stored in the device queues\nwhen their corresponding LLMs are busy. Given the memory limitations of edge\ndevices, it is crucial to ensure that the average queue sizes in the system\nremain bounded. In this paper, we address this by theoretically calculating the\nqueuing stability conditions for the device queues under reasonable\nassumptions, which we validate experimentally as well. Further, we demonstrate\nthrough experiments, leveraging open-source LLMs for the implementation of\ndistributed MoA, that certain MoA configurations produce higher-quality\nresponses compared to others, as evaluated on AlpacaEval 2.0 benchmark. The\nimplementation is available at:\nhttps://github.com/purbeshmitra/distributed_moa.\n","authors":["Purbesh Mitra","Priyanka Kaswan","Sennur Ulukus"],"pdf_url":"https://arxiv.org/pdf/2412.21200v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.21188v1","updated":"2024-12-30T18:55:35Z","published":"2024-12-30T18:55:35Z","title":"Sparse chaos in cortical circuits","summary":" Nerve impulses, the currency of information flow in the brain, are generated\nby an instability of the neuronal membrane potential dynamics. Neuronal\ncircuits exhibit collective chaos that appears essential for learning, memory,\nsensory processing, and motor control. However, the factors controlling the\nnature and intensity of collective chaos in neuronal circuits are not well\nunderstood. Here we use computational ergodic theory to demonstrate that basic\nfeatures of nerve impulse generation profoundly affect collective chaos in\nneuronal circuits. Numerically exact calculations of Lyapunov spectra,\nKolmogorov-Sinai-entropy, and upper and lower bounds on attractor dimension\nshow that changes in nerve impulse generation in individual neurons moderately\nimpact information encoding rates but qualitatively transform phase space\nstructure. Specifically, we find a drastic reduction in the number of unstable\nmanifolds, Kolmogorov-Sinai entropy, and attractor dimension. Beyond a critical\npoint, marked by the simultaneous breakdown of the diffusion approximation, a\npeak in the largest Lyapunov exponent, and a localization transition of the\nleading covariant Lyapunov vector, networks exhibit sparse chaos: prolonged\nperiods of near stable dynamics interrupted by short bursts of intense chaos.\nAnalysis of large, more realistically structured networks supports the\ngenerality of these findings. In cortical circuits, biophysical properties\nappear tuned to this regime of sparse chaos. Our results reveal a close link\nbetween fundamental aspects of single-neuron biophysics and the collective\ndynamics of cortical circuits, suggesting that nerve impulse generation\nmechanisms are adapted to enhance circuit controllability and information flow.\n","authors":["Rainer Engelken","Michael Monteforte","Fred Wolf"],"pdf_url":"https://arxiv.org/pdf/2412.21188v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.21178v1","updated":"2024-12-30T18:50:37Z","published":"2024-12-30T18:50:37Z","title":"Two-component spatiotemporal template for activation-inhibition of\n speech in ECoG","summary":" I compute the average trial-by-trial power of band-limited speech activity\nacross epochs of multi-channel high-density electrocorticography (ECoG)\nrecorded from multiple subjects during a consonant-vowel speaking task. I show\nthat previously seen anti-correlations of average beta frequency activity\n(12-35 Hz) to high-frequency gamma activity (70-140 Hz) during speech movement\nare observable between individual ECoG channels in the sensorimotor cortex\n(SMC). With this I fit a variance-based model using principal component\nanalysis to the band-powers of individual channels of session-averaged ECoG\ndata in the SMC and project SMC channels onto their lower-dimensional principal\ncomponents.\n Spatiotemporal relationships between speech-related activity and principal\ncomponents are identified by correlating the principal components of both\nfrequency bands to individual ECoG channels over time using windowed\ncorrelation. Correlations of principal component areas to sensorimotor areas\nreveal a distinct two-component activation-inhibition-like representation for\nspeech that resembles distinct local sensorimotor areas recently shown to have\ncomplex interplay in whole-body motor control, inhibition, and posture. Notably\nthe third principal component shows insignificant correlations across all\nsubjects, suggesting two components of ECoG are sufficient to represent SMC\nactivity during speech movement.\n","authors":["Eric Easthope"],"pdf_url":"https://arxiv.org/pdf/2412.21178v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.21164v1","updated":"2024-12-30T18:43:21Z","published":"2024-12-30T18:43:21Z","title":"Adversarial Attack and Defense for LoRa Device Identification and\n Authentication via Deep Learning","summary":" LoRa provides long-range, energy-efficient communications in Internet of\nThings (IoT) applications that rely on Low-Power Wide-Area Network (LPWAN)\ncapabilities. Despite these merits, concerns persist regarding the security of\nLoRa networks, especially in situations where device identification and\nauthentication are imperative to secure the reliable access to the LoRa\nnetworks. This paper explores a deep learning (DL) approach to tackle these\nconcerns, focusing on two critical tasks, namely (i) identifying LoRa devices\nand (ii) classifying them to legitimate and rogue devices. Deep neural networks\n(DNNs), encompassing both convolutional and feedforward neural networks, are\ntrained for these tasks using actual LoRa signal data. In this setting, the\nadversaries may spoof rogue LoRa signals through the kernel density estimation\n(KDE) method based on legitimate device signals that are received by the\nadversaries. Two cases are considered, (i) training two separate classifiers,\none for each of the two tasks, and (ii) training a multi-task classifier for\nboth tasks. The vulnerabilities of the resulting DNNs to manipulations in input\nsamples are studied in form of untargeted and targeted adversarial attacks\nusing the Fast Gradient Sign Method (FGSM). Individual and common perturbations\nare considered against single-task and multi-task classifiers for the LoRa\nsignal analysis. To provide resilience against such attacks, a defense approach\nis presented by increasing the robustness of classifiers with adversarial\ntraining. Results quantify how vulnerable LoRa signal classification tasks are\nto adversarial attacks and emphasize the need to fortify IoT applications\nagainst these subtle yet effective threats.\n","authors":["Yalin E. Sagduyu","Tugba Erpek"],"pdf_url":"https://arxiv.org/pdf/2412.21164v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.21156v1","updated":"2024-12-30T18:35:02Z","published":"2024-12-30T18:35:02Z","title":"Unified dimensionality reduction techniques in chronic liver disease\n detection","summary":" Globally, chronic liver disease continues to be a major health concern that\nrequires precise predictive models for prompt detection and treatment. Using\nthe Indian Liver Patient Dataset (ILPD) from the University of California at\nIrvine's UCI Machine Learning Repository, a number of machine learning\nalgorithms are investigated in this study. The main focus of our research is\nthis dataset, which includes the medical records of 583 patients, 416 of whom\nhave been diagnosed with liver disease and 167 of whom have not. There are\nseveral aspects to this work, including feature extraction and dimensionality\nreduction methods like Linear Discriminant Analysis (LDA), Factor Analysis\n(FA), t-distributed Stochastic Neighbour Embedding (t-SNE), and Uniform\nManifold Approximation and Projection (UMAP). The purpose of the study is to\ninvestigate how well these approaches work for converting high-dimensional\ndatasets and improving prediction accuracy. To assess the prediction ability of\nthe improved models, a number of classification methods were used, such as\nMulti-layer Perceptron, Random Forest, K-nearest neighbours, and Logistic\nRegression. Remarkably, the improved models performed admirably, with Random\nForest having the highest accuracy of 98.31\\% in 10-fold cross-validation and\n95.79\\% in train-test split evaluation. Findings offer important new\nperspectives on the choice and use of customized feature extraction and\ndimensionality reduction methods, which improve predictive models for patients\nwith chronic liver disease.\n","authors":["Anand Karna","Naina Khan","Rahul Rauniyar","Prashant Giridhar Shambharkar"],"pdf_url":"https://arxiv.org/pdf/2412.21156v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.21154v1","updated":"2024-12-30T18:33:28Z","published":"2024-12-30T18:33:28Z","title":"Aviary: training language agents on challenging scientific tasks","summary":" Solving complex real-world tasks requires cycles of actions and observations.\nThis is particularly true in science, where tasks require many cycles of\nanalysis, tool use, and experimentation. Language agents are promising for\nautomating intellectual tasks in science because they can interact with tools\nvia natural language or code. Yet their flexibility creates conceptual and\npractical challenges for software implementations, since agents may comprise\nnon-standard components such as internal reasoning, planning, tool usage, as\nwell as the inherent stochasticity of temperature-sampled language models.\nHere, we introduce Aviary, an extensible gymnasium for language agents. We\nformalize agents as policies solving language-grounded partially observable\nMarkov decision processes, which we term language decision processes. We then\nimplement five environments, including three challenging scientific\nenvironments: (1) manipulating DNA constructs for molecular cloning, (2)\nanswering research questions by accessing scientific literature, and (3)\nengineering protein stability. These environments were selected for their focus\non multi-step reasoning and their relevance to contemporary biology research.\nFinally, with online training and scaling inference-time compute, we show that\nlanguage agents backed by open-source, non-frontier LLMs can match and exceed\nboth frontier LLM agents and human experts on multiple tasks at up to 100x\nlower inference cost.\n","authors":["Siddharth Narayanan","James D. Braza","Ryan-Rhys Griffiths","Manu Ponnapati","Albert Bou","Jon Laurent","Ori Kabeli","Geemi Wellawatte","Sam Cox","Samuel G. Rodriques","Andrew D. White"],"pdf_url":"https://arxiv.org/pdf/2412.21154v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.21151v1","updated":"2024-12-30T18:32:05Z","published":"2024-12-30T18:32:05Z","title":"PyG-SSL: A Graph Self-Supervised Learning Toolkit","summary":" Graph Self-Supervised Learning (SSL) has emerged as a pivotal area of\nresearch in recent years. By engaging in pretext tasks to learn the intricate\ntopological structures and properties of graphs using unlabeled data, these\ngraph SSL models achieve enhanced performance, improved generalization, and\nheightened robustness. Despite the remarkable achievements of these graph SSL\nmethods, their current implementation poses significant challenges for\nbeginners and practitioners due to the complex nature of graph structures,\ninconsistent evaluation metrics, and concerns regarding reproducibility hinder\nfurther progress in this field. Recognizing the growing interest within the\nresearch community, there is an urgent need for a comprehensive,\nbeginner-friendly, and accessible toolkit consisting of the most representative\ngraph SSL algorithms. To address these challenges, we present a Graph SSL\ntoolkit named PyG-SSL, which is built upon PyTorch and is compatible with\nvarious deep learning and scientific computing backends. Within the toolkit, we\noffer a unified framework encompassing dataset loading, hyper-parameter\nconfiguration, model training, and comprehensive performance evaluation for\ndiverse downstream tasks. Moreover, we provide beginner-friendly tutorials and\nthe best hyper-parameters of each graph SSL algorithm on different graph\ndatasets, facilitating the reproduction of results. The GitHub repository of\nthe library is https://github.com/iDEA-iSAIL-Lab-UIUC/pyg-ssl.\n","authors":["Lecheng Zheng","Baoyu Jing","Zihao Li","Zhichen Zeng","Tianxin Wei","Mengting Ai","Xinrui He","Lihui Liu","Dongqi Fu","Jiaxuan You","Hanghang Tong","Jingrui He"],"pdf_url":"https://arxiv.org/pdf/2412.21151v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.21149v1","updated":"2024-12-30T18:29:48Z","published":"2024-12-30T18:29:48Z","title":"Functional Risk Minimization","summary":" The field of Machine Learning has changed significantly since the 1970s.\nHowever, its most basic principle, Empirical Risk Minimization (ERM), remains\nunchanged. We propose Functional Risk Minimization~(FRM), a general framework\nwhere losses compare functions rather than outputs. This results in better\nperformance in supervised, unsupervised, and RL experiments. In the FRM\nparadigm, for each data point $(x_i,y_i)$ there is function $f_{\\theta_i}$ that\nfits it: $y_i = f_{\\theta_i}(x_i)$. This allows FRM to subsume ERM for many\ncommon loss functions and to capture more realistic noise processes. We also\nshow that FRM provides an avenue towards understanding generalization in the\nmodern over-parameterized regime, as its objective can be framed as finding the\nsimplest model that fits the training data.\n","authors":["Ferran Alet","Clement Gehring","Tomás Lozano-Pérez","Kenji Kawaguchi","Joshua B. Tenenbaum","Leslie Pack Kaelbling"],"pdf_url":"https://arxiv.org/pdf/2412.21149v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10690v3","updated":"2024-12-30T18:21:53Z","published":"2024-01-19T13:41:08Z","title":"Beyond RMSE and MAE: Introducing EAUC to unmask hidden bias and\n unfairness in dyadic regression models","summary":" Dyadic regression models, which output real-valued predictions for pairs of\nentities, are fundamental in many domains (e.g. obtaining user-product ratings\nin Recommender Systems) and promising and under exploration in others (e.g.\ntuning patient-drug dosages in personalized pharmacology). In this work, we\nprove that non-uniform observed value distributions of individual entities lead\nto severe biases in state-of-the-art models, skewing predictions towards the\naverage of observed past values for the entity and providing worse-than-random\npredictive power in eccentric yet crucial cases; we name this phenomenon\neccentricity bias. We show that global error metrics like Root Mean Squared\nError (RMSE) are insufficient to capture this bias, and we introduce\nEccentricity-Area Under the Curve (EAUC) as a novel complementary metric that\ncan quantify it in all studied domains and models. We prove the intuitive\ninterpretation of EAUC by experimenting with naive post-training bias\ncorrections, and theorize other options to use EAUC to guide the construction\nof fair models. This work contributes a bias-aware evaluation of dyadic\nregression to prevent unfairness in critical real-world applications of such\nsystems.\n","authors":["Jorge Paz-Ruza","Amparo Alonso-Betanzos","Bertha Guijarro-Berdiñas","Brais Cancela","Carlos Eiras-Franco"],"pdf_url":"https://arxiv.org/pdf/2401.10690v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.21132v1","updated":"2024-12-30T18:08:55Z","published":"2024-12-30T18:08:55Z","title":"DeepF-fNet: a physics-informed neural network for vibration isolation\n optimization","summary":" Structural optimization is essential for designing safe, efficient, and\ndurable components with minimal material usage. Traditional methods for\nvibration control often rely on active systems to mitigate unpredictable\nvibrations, which may lead to resonance and potential structural failure.\nHowever, these methods face significant challenges when addressing the\nnonlinear inverse eigenvalue problems required for optimizing structures\nsubjected to a wide range of frequencies. As a result, no existing approach has\neffectively addressed the need for real-time vibration suppression within this\ncontext, particularly in high-performance environments such as automotive\nnoise, vibration and harshness, where computational efficiency is crucial.\n This study introduces DeepF-fNet, a novel neural network framework designed\nto replace traditional active systems in vibration-based structural\noptimization. Leveraging DeepONets within the context of physics-informed\nneural networks, DeepF-fNet integrates both data and the governing physical\nlaws. This enables rapid identification of optimal parameters to suppress\ncritical vibrations at specific frequencies, offering a more efficient and\nreal-time alternative to conventional methods.\n The proposed framework is validated through a case study involving a locally\nresonant metamaterial used to isolate structures from user-defined frequency\nranges. The results demonstrate that DeepF-fNet outperforms traditional genetic\nalgorithms in terms of computational speed while achieving comparable results,\nmaking it a promising tool for vibration-sensitive applications. By replacing\nactive systems with machine learning techniques, DeepF-fNet paves the way for\nmore efficient and cost-effective structural optimization in real-world\nscenarios.\n","authors":["A. Tollardo","F. Cadini","M. Giglio","L. Lomazzi"],"pdf_url":"https://arxiv.org/pdf/2412.21132v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18263v2","updated":"2024-12-30T18:07:15Z","published":"2024-12-24T08:25:38Z","title":"High-Rank Irreducible Cartesian Tensor Decomposition and Bases of\n Equivariant Spaces","summary":" Irreducible Cartesian tensors (ICTs) play a crucial role in the design of\nequivariant graph neural networks, as well as in theoretical chemistry and\nchemical physics. Meanwhile, the design space of available linear operations on\ntensors that preserve symmetry presents a significant challenge. The ICT\ndecomposition and a basis of this equivariant space are difficult to obtain for\nhigh-order tensors. After decades of research, we recently achieve an explicit\nICT decomposition for $n=5$ \\citep{bonvicini2024irreducible} with factorial\ntime/space complexity. This work, for the first time, obtains decomposition\nmatrices for ICTs up to rank $n=9$ with reduced and affordable complexity, by\nconstructing what we call path matrices. The path matrices are obtained via\nperforming chain-like contraction with Clebsch-Gordan matrices following the\nparentage scheme. We prove and leverage that the concatenation of path matrices\nis an orthonormal change-of-basis matrix between the Cartesian tensor product\nspace and the spherical direct sum spaces. Furthermore, we identify a complete\northogonal basis for the equivariant space, rather than a spanning set\n\\citep{pearce2023brauer}, through this path matrices technique. We further\nextend our result to the arbitrary tensor product and direct sum spaces,\nenabling free design between different spaces while keeping symmetry. The\nPython code is available in\nhttps://github.com/ShihaoShao-GH/ICT-decomposition-and-equivariant-bases where\nthe $n=6,\\dots,9$ ICT decomposition matrices are obtained in 1s, 3s, 11s, and\n4m32s, respectively.\n","authors":["Shihao Shao","Yikang Li","Zhouchen Lin","Qinghua Cui"],"pdf_url":"https://arxiv.org/pdf/2412.18263v2.pdf","comment":"43 pages"},{"id":"http://arxiv.org/abs/2412.21124v1","updated":"2024-12-30T17:55:28Z","published":"2024-12-30T17:55:28Z","title":"Adaptive Batch Size Schedules for Distributed Training of Language\n Models with Data and Model Parallelism","summary":" An appropriate choice of batch sizes in large-scale model training is\ncrucial, yet it involves an intrinsic yet inevitable dilemma: large-batch\ntraining improves training efficiency in terms of memory utilization, while\ngeneralization performance often deteriorates due to small amounts of gradient\nnoise. Despite this dilemma, the common practice of choosing batch sizes in\nlanguage model training often prioritizes training efficiency -- employing\neither constant large sizes with data parallelism or implementing batch size\nwarmup schedules. However, such batch size schedule designs remain heuristic\nand often fail to adapt to training dynamics, presenting the challenge of\ndesigning adaptive batch size schedules. Given the abundance of available\ndatasets and the data-hungry nature of language models, data parallelism has\nbecome an indispensable distributed training paradigm, enabling the use of\nlarger batch sizes for gradient computation. However, vanilla data parallelism\nrequires replicas of model parameters, gradients, and optimizer states at each\nworker, which prohibits training larger models with billions of parameters. To\noptimize memory usage, more advanced parallelism strategies must be employed.\nIn this work, we propose general-purpose and theoretically principled adaptive\nbatch size schedules compatible with data parallelism and model parallelism. We\ndevelop a practical implementation with PyTorch Fully Sharded Data Parallel,\nfacilitating the pretraining of language models of different sizes. We\nempirically demonstrate that our proposed approaches outperform constant batch\nsizes and heuristic batch size warmup schedules in the pretraining of models in\nthe Llama family, with particular focus on smaller models with up to 3 billion\nparameters. We also establish theoretical convergence guarantees for such\nadaptive batch size schedules with Adam for general smooth nonconvex\nobjectives.\n","authors":["Tim Tsz-Kit Lau","Weijian Li","Chenwei Xu","Han Liu","Mladen Kolar"],"pdf_url":"https://arxiv.org/pdf/2412.21124v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2204.04512v2","updated":"2024-12-30T17:41:16Z","published":"2022-04-09T16:45:22Z","title":"Non-asymptotic spectral bounds on the $\\varepsilon$-entropy of kernel\n classes","summary":" Let $K: \\boldsymbol{\\Omega}\\times \\boldsymbol{\\Omega}$ be a continuous Mercer\nkernel defined on a compact subset of ${\\mathbb R}^n$ and $\\mathcal{H}_K$ be\nthe reproducing kernel Hilbert space (RKHS) associated with $K$. Given a finite\nmeasure $\\nu$ on $\\boldsymbol{\\Omega}$, we investigate upper and lower bounds\non the $\\varepsilon$-entropy of the unit ball of $\\mathcal{H}_K$ in the space\n$L_p(\\nu)$. This topic is an important direction in the modern statistical\ntheory of kernel-based methods.\n We prove sharp upper and lower bounds for $p\\in [1,+\\infty]$. For $p\\in\n[1,2]$, the upper bounds are determined solely by the eigenvalue behaviour of\nthe corresponding integral operator $\\phi\\to \\int_{\\boldsymbol{\\Omega}}\nK(\\cdot,{\\mathbf y})\\phi({\\mathbf y})d\\nu({\\mathbf y})$. In constrast, for\n$p>2$, the bounds additionally depend on the convergence rate of the truncated\nMercer series to the kernel $K$ in the $L_p(\\nu)$-norm.\n We discuss a number of consequences of our bounds and show that they are\nsubstantially tighter than previous bounds for general kernels. Furthermore,\nfor specific cases, such as zonal kernels and the Gaussian kernel on a box, our\nbounds are asymptotically tight as $\\varepsilon\\to +0$.\n","authors":["Rustem Takhanov"],"pdf_url":"https://arxiv.org/pdf/2204.04512v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.13847v2","updated":"2024-12-30T17:13:48Z","published":"2024-09-20T18:42:04Z","title":"Segment Discovery: Enhancing E-commerce Targeting","summary":" Modern e-commerce services frequently target customers with incentives or\ninterventions to engage them in their products such as games, shopping, video\nstreaming, etc. This customer engagement increases acquisition of more\ncustomers and retention of existing ones, leading to more business for the\ncompany while improving customer experience. Often, customers are either\nrandomly targeted or targeted based on the propensity of desirable behavior.\nHowever, such policies can be suboptimal as they do not target the set of\ncustomers who would benefit the most from the intervention and they may also\nnot take account of any constraints. In this paper, we propose a policy\nframework based on uplift modeling and constrained optimization that identifies\ncustomers to target for a use-case specific intervention so as to maximize the\nvalue to the business, while taking account of any given constraints. We\ndemonstrate improvement over state-of-the-art targeting approaches using two\nlarge-scale experimental studies and a production implementation.\n","authors":["Qiqi Li","Roopali Singh","Charin Polpanumas","Tanner Fiez","Namita Kumar","Shreya Chakrabarti"],"pdf_url":"https://arxiv.org/pdf/2409.13847v2.pdf","comment":"Accepted at the CONSEQUENCES'24 workshop, co-located with ACM\n RecSys'24"},{"id":"http://arxiv.org/abs/2412.21084v1","updated":"2024-12-30T17:02:37Z","published":"2024-12-30T17:02:37Z","title":"On the Generalizability of Machine Learning-based Ransomware Detection\n in Block Storage","summary":" Ransomware represents a pervasive threat, traditionally countered at the\noperating system, file-system, or network levels. However, these approaches\noften introduce significant overhead and remain susceptible to circumvention by\nattackers. Recent research activity started looking into the detection of\nransomware by observing block IO operations. However, this approach exhibits\nsignificant detection challenges. Recognizing these limitations, our research\npivots towards enabling robust ransomware detection in storage systems keeping\nin mind their limited computational resources available. To perform our\nstudies, we propose a kernel-based framework capable of efficiently extracting\nand analyzing IO operations to identify ransomware activity. The framework can\nbe adopted to storage systems using computational storage devices to improve\nsecurity and fully hide detection overheads. Our method employs a refined set\nof computationally light features optimized for ML models to accurately discern\nmalicious from benign activities.\n Using this lightweight approach, we study a wide range of generalizability\naspects and analyze the performance of these models across a large space of\nsetups and configurations covering a wide range of realistic real-world\nscenarios. We reveal various trade-offs and provide strong arguments for the\ngeneralizability of storage-based detection of ransomware and show that our\napproach outperforms currently available ML-based ransomware detection in\nstorage. Empirical validation reveals that our decision tree-based models\nachieve remarkable effectiveness, evidenced by higher median F1 scores of up to\n12.8%, lower false negative rates of up to 10.9% and particularly decreased\nfalse positive rates of up to 17.1% compared to existing storage-based\ndetection approaches.\n","authors":["Nicolas Reategui","Roman Pletka","Dionysios Diamantopoulos"],"pdf_url":"https://arxiv.org/pdf/2412.21084v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.21082v1","updated":"2024-12-30T17:00:54Z","published":"2024-12-30T17:00:54Z","title":"Quantum Diffusion Model for Quark and Gluon Jet Generation","summary":" Diffusion models have demonstrated remarkable success in image generation,\nbut they are computationally intensive and time-consuming to train. In this\npaper, we introduce a novel diffusion model that benefits from quantum\ncomputing techniques in order to mitigate computational challenges and enhance\ngenerative performance within high energy physics data. The fully quantum\ndiffusion model replaces Gaussian noise with random unitary matrices in the\nforward process and incorporates a variational quantum circuit within the U-Net\nin the denoising architecture. We run evaluations on the structurally complex\nquark and gluon jets dataset from the Large Hadron Collider. The results\ndemonstrate that the fully quantum and hybrid models are competitive with a\nsimilar classical model for jet generation, highlighting the potential of using\nquantum techniques for machine learning problems.\n","authors":["Mariia Baidachna","Rey Guadarrama","Gopal Ramesh Dahale","Tom Magorsch","Isabel Pedraza","Konstantin T. Matchev","Katia Matcheva","Kyoungchul Kong","Sergei Gleyzer"],"pdf_url":"https://arxiv.org/pdf/2412.21082v1.pdf","comment":"Accepted for the NeurIPS 2024 MLNCP workshop"},{"id":"http://arxiv.org/abs/2310.03146v5","updated":"2024-12-30T16:54:13Z","published":"2023-10-04T20:18:45Z","title":"Fairness-enhancing mixed effects deep learning improves fairness on in-\n and out-of-distribution clustered (non-iid) data","summary":" Traditional deep learning (DL) models have two ubiquitous limitations. First,\nthey assume training samples are independent and identically distributed\n(i.i.d), an assumption often violated in real-world datasets where samples have\nadditional correlation due to repeat measurements (e.g., on the same\nparticipants in a longitudinal study or cells from the same sequencer). This\nleads to performance degradation, limited generalization, and covariate\nconfounding, which induces Type I and Type II errors. Second, DL models\ntypically prioritize overall accuracy, favoring accuracy on the majority while\nsacrificing performance for underrepresented subpopulations, leading to unfair,\nbiased models. This is critical to remediate, particularly in models which\ninfluence decisions regarding loan approvals and healthcare. To address these\nissues, we propose the Fair Mixed Effects Deep Learning (Fair MEDL) framework.\nThis framework quantifies cluster-invariant fixed effects (FE) and\ncluster-specific random effects (RE) through: 1) a cluster adversary for\nlearning invariant FE, 2) a Bayesian neural network for RE, and 3) a mixing\nfunction combining FE and RE for final predictions. Fairness is enhanced\nthrough architectural and loss function changes introduced by an adversarial\ndebiasing network. We formally define and demonstrate improved fairness across\nthree metrics: equalized odds, demographic parity, and counterfactual fairness,\nfor both classification and regression tasks. Our method also identifies and\nde-weights confounded covariates, mitigating Type I and II errors. The\nframework is comprehensively evaluated across three datasets spanning two\nindustries, including finance and healthcare. The Fair MEDL framework improves\nfairness by 86.4% for Age, 64.9% for Race, 57.8% for Sex, and 36.2% for Marital\nstatus, while maintaining robust predictive performance.\n","authors":["Son Nguyen","Adam Wang","Albert Montillo"],"pdf_url":"https://arxiv.org/pdf/2310.03146v5.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.21072v1","updated":"2024-12-30T16:44:11Z","published":"2024-12-30T16:44:11Z","title":"Enhanced coarsening of charge density waves induced by electron\n correlation: Machine-learning enabled large-scale dynamical simulations","summary":" The phase ordering kinetics of emergent orders in correlated electron systems\nis a fundamental topic in non-equilibrium physics, yet it remains largely\nunexplored. The intricate interplay between quasiparticles and emergent\norder-parameter fields could lead to unusual coarsening dynamics that is beyond\nthe standard theories. However, accurate treatment of both quasiparticles and\ncollective degrees of freedom is a multi-scale challenge in dynamical\nsimulations of correlated electrons. Here we leverage modern machine learning\n(ML) methods to achieve a linear-scaling algorithm for simulating the\ncoarsening of charge density waves (CDWs), one of the fundamental symmetry\nbreaking phases in functional electron materials. We demonstrate our approach\non the square-lattice Hubbard-Holstein model and uncover an intriguing\nenhancement of CDW coarsening which is related to the screening of on-site\npotential by electron-electron interactions. Our study provides fresh insights\ninto the role of electron correlations in non-equilibrium dynamics and\nunderscores the promise of ML force-field approaches for advancing multi-scale\ndynamical modeling of correlated electron systems.\n","authors":["Yang Yang","Chen Cheng","Yunhao Fan","Gia-Wei Chern"],"pdf_url":"https://arxiv.org/pdf/2412.21072v1.pdf","comment":"11 pages, 4 figures"},{"id":"http://arxiv.org/abs/2412.21071v1","updated":"2024-12-30T16:41:16Z","published":"2024-12-30T16:41:16Z","title":"Investigating layer-selective transfer learning of QAOA parameters for\n Max-Cut problem","summary":" Quantum approximate optimization algorithm (QAOA) is a variational quantum\nalgorithm (VQA) ideal for noisy intermediate-scale quantum (NISQ) processors,\nand is highly successful for solving combinatorial optimization problems\n(COPs). It has been observed that the optimal variational parameters obtained\nfrom one instance of a COP can be transferred to another instance, producing\nsufficiently satisfactory solutions for the latter. In this context, a suitable\nmethod for further improving the solution is to fine-tune a subset of the\ntransferred parameters. We numerically explore the role of optimizing\nindividual QAOA layers in improving the approximate solution of the Max-Cut\nproblem after parameter transfer. We also investigate the trade-off between a\ngood approximation and the required optimization time when optimizing\ntransferred QAOA parameters. These studies show that optimizing a subset of\nlayers can be more effective at a lower time-cost compared to optimizing all\nlayers.\n","authors":["Francesco Aldo Venturelli","Sreetama Das","Filippo Caruso"],"pdf_url":"https://arxiv.org/pdf/2412.21071v1.pdf","comment":"8 pages, 6 figures. Comments are welcome"},{"id":"http://arxiv.org/abs/2412.21069v1","updated":"2024-12-30T16:37:17Z","published":"2024-12-30T16:37:17Z","title":"Privacy-Aware Multi-Device Cooperative Edge Inference with Distributed\n Resource Bidding","summary":" Mobile edge computing (MEC) has empowered mobile devices (MDs) in supporting\nartificial intelligence (AI) applications through collaborative efforts with\nproximal MEC servers. Unfortunately, despite the great promise of device-edge\ncooperative AI inference, data privacy becomes an increasing concern. In this\npaper, we develop a privacy-aware multi-device cooperative edge inference\nsystem for classification tasks, which integrates a distributed bidding\nmechanism for the MEC server's computational resources. Intermediate feature\ncompression is adopted as a principled approach to minimize data privacy\nleakage. To determine the bidding values and feature compression ratios in a\ndistributed fashion, we formulate a decentralized partially observable Markov\ndecision process (DEC-POMDP) model, for which, a multi-agent deep deterministic\npolicy gradient (MADDPG)-based algorithm is developed. Simulation results\ndemonstrate the effectiveness of the proposed algorithm in privacy-preserving\ncooperative edge inference. Specifically, given a sufficient level of data\nprivacy protection, the proposed algorithm achieves 0.31-0.95% improvements in\nclassification accuracy compared to the approach being agnostic to the wireless\nchannel conditions. The performance is further enhanced by 1.54-1.67% by\nconsidering the difficulties of inference data.\n","authors":["Wenhao Zhuang","Yuyi Mao"],"pdf_url":"https://arxiv.org/pdf/2412.21069v1.pdf","comment":"This article was submitted to IEEE for possible publication"},{"id":"http://arxiv.org/abs/2412.21061v1","updated":"2024-12-30T16:30:50Z","published":"2024-12-30T16:30:50Z","title":"BridgePure: Revealing the Fragility of Black-box Data Protection","summary":" Availability attacks, or unlearnable examples, are defensive techniques that\nallow data owners to modify their datasets in ways that prevent unauthorized\nmachine learning models from learning effectively while maintaining the data's\nintended functionality. It has led to the release of popular black-box tools\nfor users to upload personal data and receive protected counterparts. In this\nwork, we show such black-box protections can be substantially bypassed if a\nsmall set of unprotected in-distribution data is available. Specifically, an\nadversary can (1) easily acquire (unprotected, protected) pairs by querying the\nblack-box protections with the unprotected dataset; and (2) train a diffusion\nbridge model to build a mapping. This mapping, termed BridgePure, can\neffectively remove the protection from any previously unseen data within the\nsame distribution. Under this threat model, our method demonstrates superior\npurification performance on classification and style mimicry tasks, exposing\ncritical vulnerabilities in black-box data protection.\n","authors":["Yihan Wang","Yiwei Lu","Xiao-Shan Gao","Gautam Kamath","Yaoliang Yu"],"pdf_url":"https://arxiv.org/pdf/2412.21061v1.pdf","comment":"26 pages,13 figures"},{"id":"http://arxiv.org/abs/2406.03852v2","updated":"2024-12-30T16:14:07Z","published":"2024-06-06T08:36:21Z","title":"Why the Metric Backbone Preserves Community Structure","summary":" The metric backbone of a weighted graph is the union of all-pairs shortest\npaths. It is obtained by removing all edges $(u,v)$ that are not the shortest\npath between $u$ and $v$. In networks with well-separated communities, the\nmetric backbone tends to preserve many inter-community edges, because these\nedges serve as bridges connecting two communities, but tends to delete many\nintra-community edges because the communities are dense. This suggests that the\nmetric backbone would dilute or destroy the community structure of the network.\nHowever, this is not borne out by prior empirical work, which instead showed\nthat the metric backbone of real networks preserves the community structure of\nthe original network well. In this work, we analyze the metric backbone of a\nbroad class of weighted random graphs with communities, and we formally prove\nthe robustness of the community structure with respect to the deletion of all\nthe edges that are not in the metric backbone. An empirical comparison of\nseveral graph sparsification techniques confirms our theoretical finding and\nshows that the metric backbone is an efficient sparsifier in the presence of\ncommunities.\n","authors":["Maximilien Dreveton","Charbel Chucri","Matthias Grossglauser","Patrick Thiran"],"pdf_url":"https://arxiv.org/pdf/2406.03852v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.15274v2","updated":"2024-12-30T16:12:57Z","published":"2024-10-20T04:17:59Z","title":"Physically Guided Deep Unsupervised Inversion for 1D Magnetotelluric\n Models","summary":" The global demand for unconventional energy sources such as geothermal energy\nand white hydrogen requires new exploration techniques for precise subsurface\nstructure characterization and potential reservoir identification. The\nMagnetotelluric (MT) method is crucial for these tasks, providing critical\ninformation on the distribution of subsurface electrical resistivity at depths\nranging from hundreds to thousands of meters. However, traditional iterative\nalgorithm-based inversion methods require the adjustment of multiple\nparameters, demanding time-consuming and exhaustive tuning processes to achieve\nproper cost function minimization. Although recent advances have incorporated\ndeep learning algorithms for MT inversion, primarily based on supervised\nlearning, \\paul{and} needs large labeled datasets for training. This work\nutilizes TensorFlow operations to create a differentiable forward MT operator,\nleveraging its automatic differentiation capability. Moreover, instead of\nsolving for the subsurface model directly, as classical algorithms perform,\nthis paper presents a new deep unsupervised inversion algorithm guided by\nphysics to estimate 1D MT models. Instead of using datasets with the observed\ndata and their respective model as labels during training, our method employs a\ndifferentiable modeling operator that physically guides the cost function\nminimization, making the proposed method solely dependent on observed data.\nTherefore, the optimization \\paul{algorithm} updates the network weights to\nminimize the data misfit. We test the proposed method with field and synthetic\ndata at different acquisition frequencies, demonstrating that the resistivity\nmodels obtained are more accurate than those calculated using other techniques.\n","authors":["Paul Goyes-Peñafiel","Umair bin Waheed","Henry Arguello"],"pdf_url":"https://arxiv.org/pdf/2410.15274v2.pdf","comment":"5 pages, 6 figures, github repository, submitted to IEEE-GRSL"},{"id":"http://arxiv.org/abs/2412.21052v1","updated":"2024-12-30T16:09:33Z","published":"2024-12-30T16:09:33Z","title":"Towards Effective Discrimination Testing for Generative AI","summary":" Generative AI (GenAI) models present new challenges in regulating against\ndiscriminatory behavior. In this paper, we argue that GenAI fairness research\nstill has not met these challenges; instead, a significant gap remains between\nexisting bias assessment methods and regulatory goals. This leads to\nineffective regulation that can allow deployment of reportedly fair, yet\nactually discriminatory, GenAI systems. Towards remedying this problem, we\nconnect the legal and technical literature around GenAI bias evaluation and\nidentify areas of misalignment. Through four case studies, we demonstrate how\nthis misalignment between fairness testing techniques and regulatory goals can\nresult in discriminatory outcomes in real-world deployments, especially in\nadaptive or complex environments. We offer practical recommendations for\nimproving discrimination testing to better align with regulatory goals and\nenhance the reliability of fairness assessments in future deployments.\n","authors":["Thomas P. Zollo","Nikita Rajaneesh","Richard Zemel","Talia B. Gillis","Emily Black"],"pdf_url":"https://arxiv.org/pdf/2412.21052v1.pdf","comment":"38 pages, 9 tables, 8 figures"},{"id":"http://arxiv.org/abs/2412.19279v2","updated":"2024-12-30T16:08:39Z","published":"2024-12-26T16:45:20Z","title":"Improving Generalization for AI-Synthesized Voice Detection","summary":" AI-synthesized voice technology has the potential to create realistic human\nvoices for beneficial applications, but it can also be misused for malicious\npurposes. While existing AI-synthesized voice detection models excel in\nintra-domain evaluation, they face challenges in generalizing across different\ndomains, potentially becoming obsolete as new voice generators emerge. Current\nsolutions use diverse data and advanced machine learning techniques (e.g.,\ndomain-invariant representation, self-supervised learning), but are limited by\npredefined vocoders and sensitivity to factors like background noise and\nspeaker identity. In this work, we introduce an innovative disentanglement\nframework aimed at extracting domain-agnostic artifact features related to\nvocoders. Utilizing these features, we enhance model learning in a flat loss\nlandscape, enabling escape from suboptimal solutions and improving\ngeneralization. Extensive experiments on benchmarks show our approach\noutperforms state-of-the-art methods, achieving up to 5.12% improvement in the\nequal error rate metric in intra-domain and 7.59% in cross-domain evaluations.\n","authors":["Hainan Ren","Li Lin","Chun-Hao Liu","Xin Wang","Shu Hu"],"pdf_url":"https://arxiv.org/pdf/2412.19279v2.pdf","comment":"AAAI25"},{"id":"http://arxiv.org/abs/2412.21049v1","updated":"2024-12-30T16:08:12Z","published":"2024-12-30T16:08:12Z","title":"Learning Epidemiological Dynamics via the Finite Expression Method","summary":" Modeling and forecasting the spread of infectious diseases is essential for\neffective public health decision-making. Traditional epidemiological models\nrely on expert-defined frameworks to describe complex dynamics, while neural\nnetworks, despite their predictive power, often lack interpretability due to\ntheir ``black-box\" nature. This paper introduces the Finite Expression Method,\na symbolic learning framework that leverages reinforcement learning to derive\nexplicit mathematical expressions for epidemiological dynamics. Through\nnumerical experiments on both synthetic and real-world datasets, FEX\ndemonstrates high accuracy in modeling and predicting disease spread, while\nuncovering explicit relationships among epidemiological variables. These\nresults highlight FEX as a powerful tool for infectious disease modeling,\ncombining interpretability with strong predictive performance to support\npractical applications in public health.\n","authors":["Jianda Du","Senwei Liang","Chunmei Wang"],"pdf_url":"https://arxiv.org/pdf/2412.21049v1.pdf","comment":"13 pages, 5 figures"},{"id":"http://arxiv.org/abs/2412.21046v1","updated":"2024-12-30T16:07:41Z","published":"2024-12-30T16:07:41Z","title":"Mind the truncation gap: challenges of learning on dynamic graphs with\n recurrent architectures","summary":" Systems characterized by evolving interactions, prevalent in social,\nfinancial, and biological domains, are effectively modeled as continuous-time\ndynamic graphs (CTDGs). To manage the scale and complexity of these graph\ndatasets, machine learning (ML) approaches have become essential. However,\nCTDGs pose challenges for ML because traditional static graph methods do not\nnaturally account for event timings. Newer approaches, such as graph recurrent\nneural networks (GRNNs), are inherently time-aware and offer advantages over\nstatic methods for CTDGs. However, GRNNs face another issue: the short\ntruncation of backpropagation-through-time (BPTT), whose impact has not been\nproperly examined until now. In this work, we demonstrate that this truncation\ncan limit the learning of dependencies beyond a single hop, resulting in\nreduced performance. Through experiments on a novel synthetic task and\nreal-world datasets, we reveal a performance gap between full\nbackpropagation-through-time (F-BPTT) and the truncated\nbackpropagation-through-time (T-BPTT) commonly used to train GRNN models. We\nterm this gap the \"truncation gap\" and argue that understanding and addressing\nit is essential as the importance of CTDGs grows, discussing potential future\ndirections for research in this area.\n","authors":["João Bravo","Jacopo Bono","Pedro Saleiro","Hugo Ferreira","Pedro Bizarro"],"pdf_url":"https://arxiv.org/pdf/2412.21046v1.pdf","comment":"Published in Transactions on Machine Learning Research (TMLR)"},{"id":"http://arxiv.org/abs/2405.20194v7","updated":"2024-12-30T16:04:47Z","published":"2024-05-30T15:58:22Z","title":"Occam Gradient Descent","summary":" Deep learning neural network models must be large enough to adapt to their\nproblem domain, while small enough to avoid overfitting training data during\ngradient descent. To balance these competing demands, overprovisioned deep\nlearning models such as transformers are trained for a single epoch on large\ndata sets, and hence inefficient with both computing resources and training\ndata. In response to these inefficiencies, we exploit learning theory to derive\nOccam Gradient Descent, an algorithm that interleaves adaptive reduction of\nmodel size to minimize generalization error, with gradient descent on model\nweights to minimize fitting error. In contrast, traditional gradient descent\ngreedily minimizes fitting error without regard to generalization error. Our\nalgorithm simultaneously descends the space of weights and topological size of\nany neural network without modification. With respect to loss, compute and\nmodel size, our experiments show (a) on image classification benchmarks, linear\nand convolutional neural networks trained with Occam Gradient Descent\noutperform traditional gradient descent with or without post-train pruning; (b)\non a range of tabular data classification tasks, neural networks trained with\nOccam Gradient Descent outperform traditional gradient descent, as well as\nRandom Forests; (c) on natural language transformers, Occam Gradient Descent\noutperforms traditional gradient descent.\n","authors":["B. N. Kausik"],"pdf_url":"https://arxiv.org/pdf/2405.20194v7.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.21035v1","updated":"2024-12-30T15:59:40Z","published":"2024-12-30T15:59:40Z","title":"Machine Learning Optimal Ordering in Global Routing Problems in\n Semiconductors","summary":" In this work, we propose a new method for ordering nets during the process of\nlayer assignment in global routing problems. The global routing problems that\nwe focus on in this work are based on routing problems that occur in the design\nof substrates in multilayered semiconductor packages. The proposed new method\nis based on machine learning techniques and we show that the proposed method\nsupersedes conventional net ordering techniques based on heuristic score\nfunctions. We perform global routing experiments in multilayered semiconductor\npackage environments in order to illustrate that the routing order based on our\nnew proposed technique outperforms previous methods based on heuristics. Our\napproach of using machine learning for global routing targets specifically the\nnet ordering step which we show in this work can be significantly improved by\ndeep learning.\n","authors":["Heejin Choi","Minji Lee","Chang Hyeong Lee","Jaeho Yang","Rak-Kyeong Seong"],"pdf_url":"https://arxiv.org/pdf/2412.21035v1.pdf","comment":"18 pages, 13 figures, 6 tables; published in Scientific Reports"},{"id":"http://arxiv.org/abs/2412.21030v1","updated":"2024-12-30T15:56:34Z","published":"2024-12-30T15:56:34Z","title":"Improving Location-based Thermal Emission Side-Channel Analysis Using\n Iterative Transfer Learning","summary":" This paper proposes the use of iterative transfer learning applied to deep\nlearning models for side-channel attacks. Currently, most of the side-channel\nattack methods train a model for each individual byte, without considering the\ncorrelation between bytes. However, since the models' parameters for attacking\ndifferent bytes may be similar, we can leverage transfer learning, meaning that\nwe first train the model for one of the key bytes, then use the trained model\nas a pretrained model for the remaining bytes. This technique can be applied\niteratively, a process known as iterative transfer learning. Experimental\nresults show that when using thermal or power consumption map images as input,\nand multilayer perceptron or convolutional neural network as the model, our\nmethod improves average performance, especially when the amount of data is\ninsufficient.\n","authors":["Tun-Chieh Lou","Chung-Che Wang","Jyh-Shing Roger Jang","Henian Li","Lang Lin","Norman Chang"],"pdf_url":"https://arxiv.org/pdf/2412.21030v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.21023v1","updated":"2024-12-30T15:46:53Z","published":"2024-12-30T15:46:53Z","title":"EdgeRAG: Online-Indexed RAG for Edge Devices","summary":" Deploying Retrieval Augmented Generation (RAG) on resource-constrained edge\ndevices is challenging due to limited memory and processing power. In this\nwork, we propose EdgeRAG which addresses the memory constraint by pruning\nembeddings within clusters and generating embeddings on-demand during\nretrieval. To avoid the latency of generating embeddings for large tail\nclusters, EdgeRAG pre-computes and stores embeddings for these clusters, while\nadaptively caching remaining embeddings to minimize redundant computations and\nfurther optimize latency. The result from BEIR suite shows that EdgeRAG offers\nsignificant latency reduction over the baseline IVF index, but with similar\ngeneration quality while allowing all of our evaluated datasets to fit into the\nmemory.\n","authors":["Korakit Seemakhupt","Sihang Liu","Samira Khan"],"pdf_url":"https://arxiv.org/pdf/2412.21023v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.21022v1","updated":"2024-12-30T15:44:05Z","published":"2024-12-30T15:44:05Z","title":"Text Classification: Neural Networks VS Machine Learning Models VS\n Pre-trained Models","summary":" Text classification is a very common task nowadays and there are many\nefficient methods and algorithms that we can employ to accomplish it.\nTransformers have revolutionized the field of deep learning, particularly in\nNatural Language Processing (NLP) and have rapidly expanded to other domains\nsuch as computer vision, time-series analysis and more. The transformer model\nwas firstly introduced in the context of machine translation and its\narchitecture relies on self-attention mechanisms to capture complex\nrelationships within data sequences. It is able to handle long-range\ndependencies more effectively than traditional neural networks (such as\nRecurrent Neural Networks and Multilayer Perceptrons). In this work, we present\na comparison between different techniques to perform text classification. We\ntake into consideration seven pre-trained models, three standard neural\nnetworks and three machine learning models. For standard neural networks and\nmachine learning models we also compare two embedding techniques: TF-IDF and\nGloVe, with the latter consistently outperforming the former. Finally, we\ndemonstrate the results from our experiments where pre-trained models such as\nBERT and DistilBERT always perform better than standard models/algorithms.\n","authors":["Christos Petridis"],"pdf_url":"https://arxiv.org/pdf/2412.21022v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.00851v2","updated":"2024-12-30T15:38:52Z","published":"2024-10-30T11:19:10Z","title":"Automatic feature selection and weighting in molecular systems using\n Differentiable Information Imbalance","summary":" Feature selection is essential in the analysis of molecular systems and many\nother fields, but several uncertainties remain: What is the optimal number of\nfeatures for a simplified, interpretable model that retains essential\ninformation? How should features with different units be aligned, and how\nshould their relative importance be weighted? Here, we introduce the\nDifferentiable Information Imbalance (DII), an automated method to rank\ninformation content between sets of features. Using distances in a ground truth\nfeature space, DII identifies a low-dimensional subset of features that best\npreserves these relationships. Each feature is scaled by a weight, which is\noptimized by minimizing the DII through gradient descent. This allows\nsimultaneously performing unit alignment and relative importance scaling, while\npreserving interpretability. DII can also produce sparse solutions and\ndetermine the optimal size of the reduced feature space. We demonstrate the\nusefulness of this approach on two benchmark molecular problems: (1)\nidentifying collective variables that describe conformations of a biomolecule,\nand (2) selecting features for training a machine-learning force field. These\nresults show the potential of DII in addressing feature selection challenges\nand optimizing dimensionality in various applications. The method is available\nin the Python library DADApy.\n","authors":["Romina Wild","Felix Wodaczek","Vittorio Del Tatto","Bingqing Cheng","Alessandro Laio"],"pdf_url":"https://arxiv.org/pdf/2411.00851v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.21004v1","updated":"2024-12-30T15:13:57Z","published":"2024-12-30T15:13:57Z","title":"Weber-Fechner Law in Temporal Difference learning derived from Control\n as Inference","summary":" This paper investigates a novel nonlinear update rule based on temporal\ndifference (TD) errors in reinforcement learning (RL). The update rule in the\nstandard RL states that the TD error is linearly proportional to the degree of\nupdates, treating all rewards equally without no bias. On the other hand, the\nrecent biological studies revealed that there are nonlinearities in the TD\nerror and the degree of updates, biasing policies optimistic or pessimistic.\nSuch biases in learning due to nonlinearities are expected to be useful and\nintentionally leftover features in biological learning. Therefore, this\nresearch explores a theoretical framework that can leverage the nonlinearity\nbetween the degree of the update and TD errors. To this end, we focus on a\ncontrol as inference framework, since it is known as a generalized formulation\nencompassing various RL and optimal control methods. In particular, we\ninvestigate the uncomputable nonlinear term needed to be approximately excluded\nin the derivation of the standard RL from control as inference. By analyzing\nit, Weber-Fechner law (WFL) is found, namely, perception (a.k.a. the degree of\nupdates) in response to stimulus change (a.k.a. TD error) is attenuated by\nincrease in the stimulus intensity (a.k.a. the value function). To numerically\nreveal the utilities of WFL on RL, we then propose a practical implementation\nusing a reward-punishment framework and modifying the definition of optimality.\nAnalysis of this implementation reveals that two utilities can be expected i)\nto increase rewards to a certain level early, and ii) to sufficiently suppress\npunishment. We finally investigate and discuss the expected utilities through\nsimulations and robot experiments. As a result, the proposed RL algorithm with\nWFL shows the expected utilities that accelerate the reward-maximizing startup\nand continue to suppress punishments during learning.\n","authors":["Keiichiro Takahashi","Taisuke Kobayashi","Tomoya Yamanokuchi","Takamitsu Matsubara"],"pdf_url":"https://arxiv.org/pdf/2412.21004v1.pdf","comment":"36 pages 9 figures"},{"id":"http://arxiv.org/abs/2412.21001v1","updated":"2024-12-30T15:10:57Z","published":"2024-12-30T15:10:57Z","title":"LEASE: Offline Preference-based Reinforcement Learning with High Sample\n Efficiency","summary":" Offline preference-based reinforcement learning (PbRL) provides an effective\nway to overcome the challenges of designing reward and the high costs of online\ninteraction. However, since labeling preference needs real-time human feedback,\nacquiring sufficient preference labels is challenging. To solve this, this\npaper proposes a offLine prEference-bAsed RL with high Sample Efficiency\n(LEASE) algorithm, where a learned transition model is leveraged to generate\nunlabeled preference data. Considering the pretrained reward model may generate\nincorrect labels for unlabeled data, we design an uncertainty-aware mechanism\nto ensure the performance of reward model, where only high confidence and low\nvariance data are selected. Moreover, we provide the generalization bound of\nreward model to analyze the factors influencing reward accuracy, and\ndemonstrate that the policy learned by LEASE has theoretical improvement\nguarantee. The developed theory is based on state-action pair, which can be\neasily combined with other offline algorithms. The experimental results show\nthat LEASE can achieve comparable performance to baseline under fewer\npreference data without online interaction.\n","authors":["Xiao-Yin Liu","Guotao Li","Xiao-Hu Zhou","Zeng-Guang Hou"],"pdf_url":"https://arxiv.org/pdf/2412.21001v1.pdf","comment":"14 pages, 4 figures"},{"id":"http://arxiv.org/abs/2412.20993v1","updated":"2024-12-30T14:57:53Z","published":"2024-12-30T14:57:53Z","title":"Efficiently Serving LLM Reasoning Programs with Certaindex","summary":" The rapid evolution of large language models (LLMs) has unlocked their\ncapabilities in advanced reasoning tasks like mathematical problem-solving,\ncode generation, and legal analysis. Central to this progress are\ninference-time reasoning algorithms, which refine outputs by exploring multiple\nsolution paths, at the cost of increasing compute demands and response\nlatencies. Existing serving systems fail to adapt to the scaling behaviors of\nthese algorithms or the varying difficulty of queries, leading to inefficient\nresource use and unmet latency targets.\n We present Dynasor, a system that optimizes inference-time compute for LLM\nreasoning queries. Unlike traditional engines, Dynasor tracks and schedules\nrequests within reasoning queries and uses Certaindex, a proxy that measures\nstatistical reasoning progress based on model certainty, to guide compute\nallocation dynamically. Dynasor co-adapts scheduling with reasoning progress:\nit allocates more compute to hard queries, reduces compute for simpler ones,\nand terminates unpromising queries early, balancing accuracy, latency, and\ncost. On diverse datasets and algorithms, Dynasor reduces compute by up to 50%\nin batch processing and sustaining 3.3x higher query rates or 4.7x tighter\nlatency SLOs in online serving.\n","authors":["Yichao Fu","Junda Chen","Siqi Zhu","Zheyu Fu","Zhongdongming Dai","Aurick Qiao","Hao Zhang"],"pdf_url":"https://arxiv.org/pdf/2412.20993v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.20992v1","updated":"2024-12-30T14:57:32Z","published":"2024-12-30T14:57:32Z","title":"Verified Lifting of Deep learning Operators","summary":" Deep learning operators are fundamental components of modern deep learning\nframeworks. With the growing demand for customized operators, it has become\nincreasingly common for developers to create their own. However, designing and\nimplementing operators is complex and error-prone, due to hardware-specific\noptimizations and the need for numerical stability. There is a pressing need\nfor tools that can summarize the functionality of both existing and\nuser-defined operators. To address this gap, this work introduces a novel\nframework for the verified lifting of deep learning operators, which\nsynthesizes high-level mathematical formulas from low-level implementations.\nOur approach combines symbolic execution, syntax-guided synthesis, and\nSMT-based verification to produce readable and formally verified mathematical\nformulas. In synthesis, we employ a combination of top-down and bottom-up\nstrategies to explore the vast search space efficiently; In verification, we\ndesign invariant synthesis patterns and leverage SMT solvers to validate the\ncorrectness of the derived summaries; In simplification, we use egraph-based\ntechniques with custom rules to restore complex formulas to their natural,\nintuitive forms. Evaluated on a dataset of deep learning operators implemented\nin Triton from the real world, our method demonstrates the effectiveness of\nsynthesis and verification compared to existing techniques. This framework\nbridges the gap between low-level implementations and high-level abstractions,\nimproving understanding and reliability in deep learning operator development.\n","authors":["Qi Zhan","Xing Hu","Xin Xia","Shanping Li"],"pdf_url":"https://arxiv.org/pdf/2412.20992v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.20987v1","updated":"2024-12-30T14:54:35Z","published":"2024-12-30T14:54:35Z","title":"RobustBlack: Challenging Black-Box Adversarial Attacks on\n State-of-the-Art Defenses","summary":" Although adversarial robustness has been extensively studied in white-box\nsettings, recent advances in black-box attacks (including transfer- and\nquery-based approaches) are primarily benchmarked against weak defenses,\nleaving a significant gap in the evaluation of their effectiveness against more\nrecent and moderate robust models (e.g., those featured in the Robustbench\nleaderboard). In this paper, we question this lack of attention from black-box\nattacks to robust models. We establish a framework to evaluate the\neffectiveness of recent black-box attacks against both top-performing and\nstandard defense mechanisms, on the ImageNet dataset. Our empirical evaluation\nreveals the following key findings: (1) the most advanced black-box attacks\nstruggle to succeed even against simple adversarially trained models; (2)\nrobust models that are optimized to withstand strong white-box attacks, such as\nAutoAttack, also exhibits enhanced resilience against black-box attacks; and\n(3) robustness alignment between the surrogate models and the target model\nplays a key factor in the success rate of transfer-based attacks\n","authors":["Mohamed Djilani","Salah Ghamizi","Maxime Cordy"],"pdf_url":"https://arxiv.org/pdf/2412.20987v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.12094v3","updated":"2024-12-30T14:54:29Z","published":"2024-12-16T18:58:57Z","title":"SepLLM: Accelerate Large Language Models by Compressing One Segment into\n One Separator","summary":" Large Language Models (LLMs) have exhibited exceptional performance across a\nspectrum of natural language processing tasks. However, their substantial sizes\npose considerable challenges, particularly in computational demands and\ninference speed, due to their quadratic complexity. In this work, we have\nidentified a key pattern: certain seemingly meaningless special tokens (i.e.,\nseparators) contribute disproportionately to attention scores compared to\nsemantically meaningful tokens. This observation suggests that information of\nthe segments between these separator tokens can be effectively condensed into\nthe separator tokens themselves without significant information loss. Guided by\nthis insight, we introduce SepLLM, a plug-and-play framework that accelerates\ninference by compressing these segments and eliminating redundant tokens.\nAdditionally, we implement efficient kernels for training acceleration.\nExperimental results across training-free, training-from-scratch, and\npost-training settings demonstrate SepLLM's effectiveness. Notably, using the\nLlama-3-8B backbone, SepLLM achieves over 50% reduction in KV cache on the\nGSM8K-CoT benchmark while maintaining comparable performance. Furthermore, in\nstreaming settings, SepLLM effectively processes sequences of up to 4 million\ntokens or more while maintaining consistent language modeling capabilities.\n","authors":["Guoxuan Chen","Han Shi","Jiawei Li","Yihang Gao","Xiaozhe Ren","Yimeng Chen","Xin Jiang","Zhenguo Li","Weiyang Liu","Chao Huang"],"pdf_url":"https://arxiv.org/pdf/2412.12094v3.pdf","comment":"We have made our code publicly available at sepllm.github.io. Our\n codebase supports efficient multi-node distributed training with accelerated\n attention module Sep-Attention and also supports numerous existing Fusion\n Operators to accelerate the training process, such as fused rope, etc. If you\n find our code helpful, please kindly consider giving us a **star** on\n GitHub^_^. Thank you very much!"},{"id":"http://arxiv.org/abs/2412.20984v1","updated":"2024-12-30T14:50:32Z","published":"2024-12-30T14:50:32Z","title":"AlignAb: Pareto-Optimal Energy Alignment for Designing Nature-Like\n Antibodies","summary":" We present a three-stage framework for training deep learning models\nspecializing in antibody sequence-structure co-design. We first pre-train a\nlanguage model using millions of antibody sequence data. Then, we employ the\nlearned representations to guide the training of a diffusion model for joint\noptimization over both sequence and structure of antibodies. During the final\nalignment stage, we optimize the model to favor antibodies with low repulsion\nand high attraction to the antigen binding site, enhancing the rationality and\nfunctionality of the designs. To mitigate conflicting energy preferences, we\nextend AbDPO (Antibody Direct Preference Optimization) to guide the model\ntowards Pareto optimality under multiple energy-based alignment objectives.\nFurthermore, we adopt an iterative learning paradigm with temperature scaling,\nenabling the model to benefit from diverse online datasets without requiring\nadditional data. In practice, our proposed methods achieve high stability and\nefficiency in producing a better Pareto front of antibody designs compared to\ntop samples generated by baselines and previous alignment techniques. Through\nextensive experiments, we showcase the superior performance of our methods in\ngenerating nature-like antibodies with high binding affinity consistently.\n","authors":["Yibo Wen","Chenwei Xu","Jerry Yao-Chieh Hu","Han Liu"],"pdf_url":"https://arxiv.org/pdf/2412.20984v1.pdf","comment":"30 pages"},{"id":"http://arxiv.org/abs/2412.11657v3","updated":"2024-12-30T14:39:08Z","published":"2024-12-16T11:00:02Z","title":"CNNtention: Can CNNs do better with Attention?","summary":" Convolutional Neural Networks (CNNs) have been the standard for image\nclassification tasks for a long time, but more recently attention-based\nmechanisms have gained traction. This project aims to compare traditional CNNs\nwith attention-augmented CNNs across an image classification task. By\nevaluating and comparing their performance, accuracy and computational\nefficiency, the project will highlight benefits and trade-off of the localized\nfeature extraction of traditional CNNs and the global context capture in\nattention-augmented CNNs. By doing this, we can reveal further insights into\ntheir respective strengths and weaknesses, guide the selection of models based\non specific application needs and ultimately, enhance understanding of these\narchitectures in the deep learning community.\n This was our final project for CS7643 Deep Learning course at Georgia Tech.\n","authors":["Nikhil Kapila","Julian Glattki","Tejas Rathi"],"pdf_url":"https://arxiv.org/pdf/2412.11657v3.pdf","comment":"10 pages, 11 figures"},{"id":"http://arxiv.org/abs/2411.17450v2","updated":"2024-12-30T14:31:43Z","published":"2024-11-26T14:07:48Z","title":"A Graph Neural Network deep-dive into successful counterattacks","summary":" A counterattack in soccer is a high speed, high intensity direct attack that\ncan occur when a team transitions from a defensive state to an attacking state\nafter regaining possession of the ball. The aim is to create a goal-scoring\nopportunity by convering a lot of ground with minimal passes before the\nopposing team can recover their defensive shape. The purpose of this research\nis to build gender-specific Graph Neural Networks to model the likelihood of a\ncounterattack being successful and uncover what factors make them successful in\nprofessional soccer. These models are trained on a total of 20863 frames of\nsynchronized on-ball event and spatiotemporal (broadcast) tracking data. This\ndataset is derived from 632 games of MLS (2022), NWSL (2022) and international\nsoccer (2020-2022). With this data we demonstrate that gender-specific Graph\nNeural Networks outperform architecturally identical gender-ambiguous models in\npredicting the successful outcome of counterattacks. We show, using Permutation\nFeature Importance, that byline to byline speed, angle to the goal, angle to\nthe ball and sideline to sideline speed are the node features with the highest\nimpact on model performance. Additionally, we offer some illustrative examples\non how to navigate the infinite solution search space to aid in identifying\nimprovements for player decision making.\n This research is accompanied by an open-source repository containing all data\nand code, and it is also accompanied by an open-source Python package which\nsimplifies converting spatiotemporal data into graphs. This package also\nfacilitates testing, validation, training and prediction with this data. This\nshould allow the reader to replicate and improve upon our research more easily.\n","authors":["Joris Bekkers","Amod Sahasrabudhe"],"pdf_url":"https://arxiv.org/pdf/2411.17450v2.pdf","comment":"11 pages, 11 figures, first submitted (and accepted) at MIT Sloan\n Sports Analytics Conference 2023"},{"id":"http://arxiv.org/abs/2410.07643v2","updated":"2024-12-30T14:18:25Z","published":"2024-10-10T06:21:32Z","title":"On Reward Transferability in Adversarial Inverse Reinforcement Learning:\n Insights from Random Matrix Theory","summary":" In the context of inverse reinforcement learning (IRL) with a single expert,\nadversarial inverse reinforcement learning (AIRL) serves as a foundational\napproach to providing comprehensive and transferable task descriptions.\nHowever, AIRL faces practical performance challenges, primarily stemming from\nthe framework's overly idealized decomposability condition, the unclear proof\nregarding the potential equilibrium in reward recovery, or questionable\nrobustness in high-dimensional environments. This paper revisits AIRL in\n\\textbf{high-dimensional scenarios where the state space tends to infinity}.\nSpecifically, we first establish a necessary and sufficient condition for\nreward transferability by examining the rank of the matrix derived from\nsubtracting the identity matrix from the transition matrix. Furthermore,\nleveraging random matrix theory, we analyze the spectral distribution of this\nmatrix, demonstrating that our rank criterion holds with high probability even\nwhen the transition matrices are unobservable. This suggests that the\nlimitations on transfer are not inherent to the AIRL framework itself, but are\ninstead related to the training variance of the reinforcement learning\nalgorithms employed within it. Based on this insight, we propose a hybrid\nframework that integrates on-policy proximal policy optimization in the source\nenvironment with off-policy soft actor-critic in the target environment,\nleading to significant improvements in reward transfer effectiveness.\n","authors":["Yangchun Zhang","Wang Zhou","Yirui Zhou"],"pdf_url":"https://arxiv.org/pdf/2410.07643v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2304.09010v5","updated":"2024-12-30T14:09:59Z","published":"2023-04-18T14:26:02Z","title":"Causal Flow-based Variational Auto-Encoder for Disentangled Causal\n Representation Learning","summary":" Disentangled representation learning aims to learn low-dimensional\nrepresentations where each dimension corresponds to an underlying generative\nfactor. While the Variational Auto-Encoder (VAE) is widely used for this\npurpose, most existing methods assume independence among factors, a\nsimplification that does not hold in many real-world scenarios where factors\nare often interdependent and exhibit causal relationships. To overcome this\nlimitation, we propose the Disentangled Causal Variational Auto-Encoder\n(DCVAE), a novel supervised VAE framework that integrates causal flows into the\nrepresentation learning process, enabling the learning of more meaningful and\ninterpretable disentangled representations. We evaluate DCVAE on both synthetic\nand real-world datasets, demonstrating its superior ability in causal\ndisentanglement and intervention experiments. Furthermore, DCVAE outperforms\nstate-of-the-art methods in various downstream tasks, highlighting its\npotential for learning true causal structures among factors.\n","authors":["Di Fan","Yannian Kou","Chuanhou Gao"],"pdf_url":"https://arxiv.org/pdf/2304.09010v5.pdf","comment":"22 pages, 14 figures"},{"id":"http://arxiv.org/abs/2205.14568v6","updated":"2024-12-30T14:00:24Z","published":"2022-05-29T03:52:44Z","title":"Towards Instance-Wise Calibration: Local Amortized Diagnostics and\n Reshaping of Conditional Densities (LADaR)","summary":" There is a growing interest in conditional density estimation and generative\nmodeling of a target $y$ given complex inputs $\\mathbf{x}$. However,\noff-the-shelf methods often lack instance-wise calibration -- that is, for\nindividual inputs $\\mathbf{x}$, the individual estimated probabilities can be\nvery different from the true probabilities, even when the estimates are\nreasonable when averaged over the entire population. This paper introduces the\nLADaR (Local Amortized Diagnostics and Reshaping of Conditional Densities)\nframework and proposes an algorithm called $\\texttt{Cal-PIT}$ that produces\ninterpretable local calibration diagnostics and includes a mechanism to\nrecalibrate the initial model. Our $\\texttt{Cal-PIT}$ algorithm learns a single\nlocal probability-probability map from calibration data to assess and quantify\nwhere corrections are needed across the feature space. When necessary, it\nreshapes the initial distribution into an estimate with approximate\ninstance-wise calibration. We illustrate the LADaR framework by applying\n$\\texttt{Cal-PIT}$ to synthetic examples, including probabilistic forecasting\nwith sequences of images as inputs, akin to predicting the wind speed of\ntropical cyclones from satellite imagery. Our main science application is\nconditional density estimation of galaxy distances given imaging data\n(so-called photometric redshift estimation). On a benchmark photometric\nredshift data challenge, $\\texttt{Cal-PIT}$ achieves better conditional density\nestimation (as measured by the conditional density estimation loss) than all 11\nother literature methods tested. This demonstrates its potential for meeting\nthe stringent photometric redshift requirements for next generation weak\ngravitational lensing analyses.\n","authors":["Biprateep Dey","David Zhao","Brett H. Andrews","Jeffrey A. Newman","Rafael Izbicki","Ann B. Lee"],"pdf_url":"https://arxiv.org/pdf/2205.14568v6.pdf","comment":"Code available as a Python package\n https://github.com/lee-group-cmu/Cal-PIT"},{"id":"http://arxiv.org/abs/2412.20962v1","updated":"2024-12-30T13:55:59Z","published":"2024-12-30T13:55:59Z","title":"Conservation-informed Graph Learning for Spatiotemporal Dynamics\n Prediction","summary":" Data-centric methods have shown great potential in understanding and\npredicting spatiotemporal dynamics, enabling better design and control of the\nobject system. However, pure deep learning models often lack interpretability,\nfail to obey intrinsic physics, and struggle to cope with the various domains.\nWhile geometry-based methods, e.g., graph neural networks (GNNs), have been\nproposed to further tackle these challenges, they still need to find the\nimplicit physical laws from large datasets and rely excessively on rich labeled\ndata. In this paper, we herein introduce the conservation-informed GNN (CiGNN),\nan end-to-end explainable learning framework, to learn spatiotemporal dynamics\nbased on limited training data. The network is designed to conform to the\ngeneral conservation law via symmetry, where conservative and non-conservative\ninformation passes over a multiscale space enhanced by a latent temporal\nmarching strategy. The efficacy of our model has been verified in various\nspatiotemporal systems based on synthetic and real-world datasets, showing\nsuperiority over baseline models. Results demonstrate that CiGNN exhibits\nremarkable accuracy and generalization ability, and is readily applicable to\nlearning for prediction of various spatiotemporal dynamics in a spatial domain\nwith complex geometry.\n","authors":["Yuan Mi","Pu Ren","Hongteng Xu","Hongsheng Liu","Zidong Wang","Yike Guo","Ji-Rong Wen","Hao Sun","Yang Liu"],"pdf_url":"https://arxiv.org/pdf/2412.20962v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.09516v2","updated":"2024-12-30T13:53:16Z","published":"2023-10-14T07:02:54Z","title":"Efficient Link Prediction via GNN Layers Induced by Negative Sampling","summary":" Graph neural networks (GNNs) for link prediction can loosely be divided into\ntwo broad categories. First, \\emph{node-wise} architectures pre-compute\nindividual embeddings for each node that are later combined by a simple decoder\nto make predictions. While extremely efficient at inference time, model\nexpressiveness is limited such that isomorphic nodes contributing to candidate\nedges may not be distinguishable, compromising accuracy. In contrast,\n\\emph{edge-wise} methods rely on the formation of edge-specific subgraph\nembeddings to enrich the representation of pair-wise relationships,\ndisambiguating isomorphic nodes to improve accuracy, but with increased model\ncomplexity. To better navigate this trade-off, we propose a novel GNN\narchitecture whereby the \\emph{forward pass} explicitly depends on \\emph{both}\npositive (as is typical) and negative (unique to our approach) edges to inform\nmore flexible, yet still cheap node-wise embeddings. This is achieved by\nrecasting the embeddings themselves as minimizers of a forward-pass-specific\nenergy function that favors separation of positive and negative samples.\nNotably, this energy is distinct from the actual training loss shared by most\nexisting link prediction models, where contrastive pairs only influence the\n\\textit{backward pass}. As demonstrated by extensive empirical evaluations, the\nresulting architecture retains the inference speed of node-wise models, while\nproducing competitive accuracy with edge-wise alternatives. We released our\ncode at https://github.com/yxzwang/SubmissionverOfYinYanGNN.\n","authors":["Yuxin Wang","Xiannian Hu","Quan Gan","Xuanjing Huang","Xipeng Qiu","David Wipf"],"pdf_url":"https://arxiv.org/pdf/2310.09516v2.pdf","comment":"Accepted to TKDE. Citation information: DOI 10.1109/TKDE.2024.3481015"},{"id":"http://arxiv.org/abs/2407.07099v3","updated":"2024-12-30T13:43:46Z","published":"2024-06-18T07:46:13Z","title":"Nash CoT: Multi-Path Inference with Preference Equilibrium","summary":" Chain of thought (CoT) is a reasoning framework that can enhance the\nperformance of Large Language Models (LLMs) on complex inference tasks. In\nparticular, among various studies related to CoT, multi-path inference stands\nout as a simple yet effective improvement. However, there is no optimal setting\nfor the number of inference paths. Therefore, we have to increase the number of\ninference paths to obtain better results, which in turn increases the inference\ncost. To address this limitation, we can utilize question-related role\ntemplates to guide LLMs into relevant roles, thereby increasing the possibility\nof correct inferences for each path and further reducing dependence on the\nnumber of inference paths while improving reasoning accuracy. However, placing\nLLMs into specific roles may reduce their reasoning diversity and performance\non a few tasks where role dependence is low. To alleviate the excessive\nimmersion of the LLM into a specific role, we propose Nash CoT by constructing\na game system on each path that balances the generation from role-specific\nLLMs' and the general LLMs' generation, thereby ensuring both effective role\nadoption and diversity in LLM generation further maintaining the performance of\nmulti-path inference while reducing the requirement of the number of inference\npaths. We evaluate Nash CoT across various inference tasks, including Arabic\nReasoning, Commonsense Question Answering, and Symbolic Inference, achieving\nresults that are comparable to or better than those of multi-path CoT with the\nequal number of inference paths.\n","authors":["Ziqi Zhang","Cunxiang Wang","Xiong Xiao","Yue Zhang","Donglin Wang"],"pdf_url":"https://arxiv.org/pdf/2407.07099v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2304.02022v2","updated":"2024-12-30T13:42:42Z","published":"2023-04-04T09:25:34Z","title":"Online Joint Assortment-Inventory Optimization under MNL Choices","summary":" We study an online joint assortment-inventory optimization problem, in which\nwe assume that the choice behavior of each customer follows the Multinomial\nLogit (MNL) choice model, and the attraction parameters are unknown a priori.\nThe retailer makes periodic assortment and inventory decisions to dynamically\nlearn from the customer choice observations about the attraction parameters\nwhile maximizing the expected total profit over time. In this paper, we propose\na novel algorithm that can effectively balance exploration and exploitation in\nthe online decision-making of assortment and inventory. Our algorithm builds on\na new estimator for the MNL attraction parameters, an innovative approach to\nincentivize exploration by adaptively tuning certain known and unknown\nparameters, and an optimization oracle to static single-cycle\nassortment-inventory planning problems with given parameters. We establish a\nregret upper bound for our algorithm and a lower bound for the online joint\nassortment-inventory optimization problem, suggesting that our algorithm\nachieves nearly optimal regret rate, provided that the static optimization\noracle is exact. Then we incorporate more practical approximate static\noptimization oracles into our algorithm, and bound from above the impact of\nstatic optimization errors on the regret of our algorithm. We perform numerical\nstudies to demonstrate the effectiveness of our proposed algorithm.At last, we\nextend our study by incorporating inventory carryover and the learning of\ncustomer arrival distribution.\n","authors":["Yong Liang","Xiaojie Mao","Shiyuan Wang"],"pdf_url":"https://arxiv.org/pdf/2304.02022v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.20946v1","updated":"2024-12-30T13:38:31Z","published":"2024-12-30T13:38:31Z","title":"Generalizing in Net-Zero Microgrids: A Study with Federated PPO and TRPO","summary":" This work addresses the challenge of optimal energy management in microgrids\nthrough a collaborative and privacy-preserving framework. We propose the\nFedTRPO methodology, which integrates Federated Learning (FL) and Trust Region\nPolicy Optimization (TRPO) to manage distributed energy resources (DERs)\nefficiently. Using a customized version of the CityLearn environment and\nsynthetically generated data, we simulate designed net-zero energy scenarios\nfor microgrids composed of multiple buildings. Our approach emphasizes reducing\nenergy costs and carbon emissions while ensuring privacy. Experimental results\ndemonstrate that FedTRPO is comparable with state-of-the-art federated RL\nmethodologies without hyperparameter tunning. The proposed framework highlights\nthe feasibility of collaborative learning for achieving optimal control\npolicies in energy systems, advancing the goals of sustainable and efficient\nsmart grids.\n","authors":["Nicolas M Cuadrado Avila","Samuel Horváth","Martin Takáč"],"pdf_url":"https://arxiv.org/pdf/2412.20946v1.pdf","comment":"Submitted to Environmental Data Science Journal from Cambridge\n University Press"},{"id":"http://arxiv.org/abs/2409.15698v2","updated":"2024-12-30T13:28:24Z","published":"2024-09-24T03:24:31Z","title":"GISExplainer: On Explainability of Graph Neural Networks via\n Game-theoretic Interaction Subgraphs","summary":" Explainability is crucial for the application of black-box Graph Neural\nNetworks (GNNs) in critical fields such as healthcare, finance, cybersecurity,\nand more. Various feature attribution methods, especially the\nperturbation-based methods, have been proposed to indicate how much each\nnode/edge contributes to the model predictions. However, these methods fail to\ngenerate connected explanatory subgraphs that consider the causal interaction\nbetween edges within different coalition scales, which will result in\nunfaithful explanations. In our study, we propose GISExplainer, a novel\ngame-theoretic interaction based explanation method that uncovers what the\nunderlying GNNs have learned for node classification by discovering\nhuman-interpretable causal explanatory subgraphs. First, GISExplainer defines a\ncausal attribution mechanism that considers the game-theoretic interaction of\nmulti-granularity coalitions in candidate explanatory subgraph to quantify the\ncausal effect of an edge on the prediction. Second, GISExplainer assumes that\nthe coalitions with negative effects on the predictions are also significant\nfor model interpretation, and the contribution of the computation graph stems\nfrom the combined influence of both positive and negative interactions within\nthe coalitions. Then, GISExplainer regards the explanation task as a sequential\ndecision process, in which a salient edges is successively selected and\nconnected to the previously selected subgraph based on its causal effect to\nform an explanatory subgraph, ultimately striving for better explanations.\nAdditionally, an efficiency optimization scheme is proposed for the causal\nattribution mechanism through coalition sampling. Extensive experiments\ndemonstrate that GISExplainer achieves better performance than state-of-the-art\napproaches w.r.t. two quantitative metrics: Fidelity and Sparsity.\n","authors":["Xingping Xian","Jianlu Liu","Chao Wang","Tao Wu","Shaojie Qiao","Xiaochuan Tang","Qun Liu"],"pdf_url":"https://arxiv.org/pdf/2409.15698v2.pdf","comment":"13 pages, 7 figures"},{"id":"http://arxiv.org/abs/2412.20925v1","updated":"2024-12-30T13:12:56Z","published":"2024-12-30T13:12:56Z","title":"Active Learning with Variational Quantum Circuits for Quantum Process\n Tomography","summary":" Quantum process tomography (QPT), used for reconstruction of an unknown\nquantum process from measurement data, is a fundamental tool for the diagnostic\nand full characterization of quantum systems. It relies on querying a set of\nquantum states as input to the quantum process. Previous works commonly use a\nstraightforward strategy to select a set of quantum states randomly,\noverlooking differences in informativeness among quantum states. Since querying\nthe quantum system requires multiple experiments that can be prohibitively\ncostly, it is always the case that there are not enough quantum states for\nhigh-quality reconstruction. In this paper, we propose a general framework for\nactive learning (AL) to adaptively select a set of informative quantum states\nthat improves the reconstruction most efficiently. In particular, we introduce\na learning framework that leverages the widely-used variational quantum\ncircuits (VQCs) to perform the QPT task and integrate our AL algorithms into\nthe query step. We design and evaluate three various types of AL algorithms:\ncommittee-based, uncertainty-based, and diversity-based, each exhibiting\ndistinct advantages in terms of performance and computational cost.\nAdditionally, we provide a guideline for selecting algorithms suitable for\ndifferent scenarios. Numerical results demonstrate that our algorithms achieve\nsignificantly improved reconstruction compared to the baseline method that\nselects a set of quantum states randomly. Moreover, these results suggest that\nactive learning based approaches are applicable to other complicated learning\ntasks in large-scale quantum information processing.\n","authors":["Jiaqi Yang","Xiaohua Xu","Wei Xie"],"pdf_url":"https://arxiv.org/pdf/2412.20925v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.15681v2","updated":"2024-12-30T13:10:11Z","published":"2024-10-21T06:43:04Z","title":"Federated Learning with MMD-based Early Stopping for Adaptive GNSS\n Interference Classification","summary":" Federated learning (FL) enables multiple devices to collaboratively train a\nglobal model while maintaining data on local servers. Each device trains the\nmodel on its local server and shares only the model updates (i.e., gradient\nweights) during the aggregation step. A significant challenge in FL is managing\nthe feature distribution of novel and unbalanced data across devices. In this\npaper, we propose an FL approach using few-shot learning and aggregation of the\nmodel weights on a global server. We introduce a dynamic early stopping method\nto balance out-of-distribution classes based on representation learning,\nspecifically utilizing the maximum mean discrepancy of feature embeddings\nbetween local and global models. An exemplary application of FL is to\norchestrate machine learning models along highways for interference\nclassification based on snapshots from global navigation satellite system\n(GNSS) receivers. Extensive experiments on four GNSS datasets from two\nreal-world highways and controlled environments demonstrate that our FL method\nsurpasses state-of-the-art techniques in adapting to both novel interference\nclasses and multipath scenarios.\n","authors":["Nishant S. Gaikwad","Lucas Heublein","Nisha L. Raichur","Tobias Feigl","Christopher Mutschler","Felix Ott"],"pdf_url":"https://arxiv.org/pdf/2410.15681v2.pdf","comment":"Git repository:\n https://gitlab.cc-asp.fraunhofer.de/darcy_gnss/federated_learning"},{"id":"http://arxiv.org/abs/2412.19108v2","updated":"2024-12-30T13:10:06Z","published":"2024-12-26T07:49:51Z","title":"Graph Mixture of Experts and Memory-augmented Routers for Multivariate\n Time Series Anomaly Detection","summary":" Multivariate time series (MTS) anomaly detection is a critical task that\ninvolves identifying abnormal patterns or events in data that consist of\nmultiple interrelated time series. In order to better model the complex\ninterdependence between entities and the various inherent characteristics of\neach entity, the GNN based methods are widely adopted by existing methods. In\neach layer of GNN, node features aggregate information from their neighboring\nnodes to update their information. In doing so, from shallow layer to deep\nlayer in GNN, original individual node features continue to be weakened and\nmore structural information,i.e., from short-distance neighborhood to\nlong-distance neighborhood, continues to be enhanced. However, research to date\nhas largely ignored the understanding of how hierarchical graph information is\nrepresented and their characteristics that can benefit anomaly detection.\nExisting methods simply leverage the output from the last layer of GNN for\nanomaly estimation while neglecting the essential information contained in the\nintermediate GNN layers. To address such limitations, in this paper, we propose\na Graph Mixture of Experts (Graph-MoE) network for multivariate time series\nanomaly detection, which incorporates the mixture of experts (MoE) module to\nadaptively represent and integrate hierarchical multi-layer graph information\ninto entity representations. It is worth noting that our Graph-MoE can be\nintegrated into any GNN-based MTS anomaly detection method in a plug-and-play\nmanner. In addition, the memory-augmented routers are proposed in this paper to\ncapture the correlation temporal information in terms of the global historical\nfeatures of MTS to adaptively weigh the obtained entity representations to\nachieve successful anomaly estimation. Extensive experiments on five\nchallenging datasets prove the superiority of our approach and each proposed\nmodule.\n","authors":["Xiaoyu Huang","Weidong Chen","Bo Hu","Zhendong Mao"],"pdf_url":"https://arxiv.org/pdf/2412.19108v2.pdf","comment":"Accepted by AAAI 2025"},{"id":"http://arxiv.org/abs/2412.20918v1","updated":"2024-12-30T12:57:31Z","published":"2024-12-30T12:57:31Z","title":"Uncertainty-Aware Out-of-Distribution Detection with Gaussian Processes","summary":" Deep neural networks (DNNs) are often constructed under the closed-world\nassumption, which may fail to generalize to the out-of-distribution (OOD) data.\nThis leads to DNNs producing overconfident wrong predictions and can result in\ndisastrous consequences in safety-critical applications. Existing OOD detection\nmethods mainly rely on curating a set of OOD data for model training or\nhyper-parameter tuning to distinguish OOD data from training data (also known\nas in-distribution data or InD data). However, OOD samples are not always\navailable during the training phase in real-world applications, hindering the\nOOD detection accuracy. To overcome this limitation, we propose a\nGaussian-process-based OOD detection method to establish a decision boundary\nbased on InD data only. The basic idea is to perform uncertainty quantification\nof the unconstrained softmax scores of a DNN via a multi-class Gaussian process\n(GP), and then define a score function to separate InD and potential OOD data\nbased on their fundamental differences in the posterior predictive distribution\nfrom the GP. Two case studies on conventional image classification datasets and\nreal-world image datasets are conducted to demonstrate that the proposed method\noutperforms the state-of-the-art OOD detection methods when OOD samples are not\nobserved in the training phase.\n","authors":["Yang Chen","Chih-Li Sung","Arpan Kusari","Xiaoyang Song","Wenbo Sun"],"pdf_url":"https://arxiv.org/pdf/2412.20918v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.20899v1","updated":"2024-12-30T12:22:33Z","published":"2024-12-30T12:22:33Z","title":"DDIM sampling for Generative AIBIM, a faster intelligent structural\n design framework","summary":" Generative AIBIM, a successful structural design pipeline, has proven its\nability to intelligently generate high-quality, diverse, and creative shear\nwall designs that are tailored to specific physical conditions. However, the\ncurrent module of Generative AIBIM that generates designs, known as the\nphysics-based conditional diffusion model (PCDM), necessitates 1000 iterations\nfor each generation due to its reliance on the denoising diffusion\nprobabilistic model (DDPM) sampling process. This leads to a time-consuming and\ncomputationally demanding generation process. To address this issue, this study\nintroduces the denoising diffusion implicit model (DDIM), an accelerated\ngeneration method that replaces the DDPM sampling process in PCDM. While the\noriginal DDIM was designed for DDPM and the optimization process of PCDM\ndiffers from that of DDPM, this paper designs \"DDIM sampling for PCDM,\" which\nmodifies the original DDIM formulations to adapt to the optimization process of\nPCDM. Experimental results demonstrate that DDIM sampling for PCDM can\naccelerate the generation process of the original PCDM by a factor of 100 while\nmaintaining the same visual quality in the generated results. This study\neffectively showcases the effectiveness of DDIM sampling for PCDM in expediting\nintelligent structural design. Furthermore, this paper reorganizes the contents\nof DDIM, focusing on the practical usage of DDIM. This change is particularly\nmeaningful for researchers who may not possess a strong background in machine\nlearning theory but are interested in utilizing the tool effectively.\n","authors":["Zhili He","Yu-Hsing Wang"],"pdf_url":"https://arxiv.org/pdf/2412.20899v1.pdf","comment":"the 10th International Conference on Innovative Production and\n Construction (IPC 2024), Perth, Australia. https://ipcannual.com/proceedings/"},{"id":"http://arxiv.org/abs/2412.20895v1","updated":"2024-12-30T12:06:27Z","published":"2024-12-30T12:06:27Z","title":"Towards Compatible Fine-tuning for Vision-Language Model Updates","summary":" So far, efficient fine-tuning has become a popular strategy for enhancing the\ncapabilities of foundation models on downstream tasks by learning plug-and-play\nmodules. However, existing methods overlook a crucial issue: if the underlying\nfoundation model is updated, are these plug-and-play modules still effective?\nIn this paper, we first conduct a detailed analysis of various fine-tuning\nmethods on the CLIP in terms of their compatibility with model updates. The\nstudy reveals that many high-performing fine-tuning methods fail to be\ncompatible with the upgraded models. To address this, we propose a novel\napproach, Class-conditioned Context Optimization (ContCoOp), which integrates\nlearnable prompts with class embeddings using an attention layer before\ninputting them into the text encoder. Consequently, the prompts can dynamically\nadapt to the changes in embedding space (due to model updates), ensuring\ncontinued effectiveness. Extensive experiments over 15 datasets show that our\nContCoOp achieves the highest compatibility over the baseline methods, and\nexhibits robust out-of-distribution generalization.\n","authors":["Zhengbo Wang","Jian Liang","Lijun Sheng","Ran He","Zilei Wang","Tieniu Tan"],"pdf_url":"https://arxiv.org/pdf/2412.20895v1.pdf","comment":"preprint"},{"id":"http://arxiv.org/abs/2412.20892v1","updated":"2024-12-30T12:04:36Z","published":"2024-12-30T12:04:36Z","title":"Rethinking Aleatoric and Epistemic Uncertainty","summary":" The ideas of aleatoric and epistemic uncertainty are widely used to reason\nabout the probabilistic predictions of machine-learning models. We identify\nincoherence in existing discussions of these ideas and suggest this stems from\nthe aleatoric-epistemic view being insufficiently expressive to capture all of\nthe distinct quantities that researchers are interested in. To explain and\naddress this we derive a simple delineation of different model-based\nuncertainties and the data-generating processes associated with training and\nevaluation. Using this in place of the aleatoric-epistemic view could produce\nclearer discourse as the field moves forward.\n","authors":["Freddie Bickford Smith","Jannik Kossen","Eleanor Trollope","Mark van der Wilk","Adam Foster","Tom Rainforth"],"pdf_url":"https://arxiv.org/pdf/2412.20892v1.pdf","comment":"Presented at the Workshop on Bayesian Decision-Making and Uncertainty\n (NeurIPS 2024)"},{"id":"http://arxiv.org/abs/2409.10242v2","updated":"2024-12-30T12:03:37Z","published":"2024-09-16T12:45:03Z","title":"Hedging Is Not All You Need: A Simple Baseline for Online Learning Under\n Haphazard Inputs","summary":" Handling haphazard streaming data, such as data from edge devices, presents a\nchallenging problem. Over time, the incoming data becomes inconsistent, with\nmissing, faulty, or new inputs reappearing. Therefore, it requires models that\nare reliable. Recent methods to solve this problem depend on a hedging-based\nsolution and require specialized elements like auxiliary dropouts, forked\narchitectures, and intricate network design. We observed that hedging can be\nreduced to a special case of weighted residual connection; this motivated us to\napproximate it with plain self-attention. In this work, we propose HapNet, a\nsimple baseline that is scalable, does not require online backpropagation, and\nis adaptable to varying input types. All present methods are restricted to\nscaling with a fixed window; however, we introduce a more complex problem of\nscaling with a variable window where the data becomes positionally\nuncorrelated, and cannot be addressed by present methods. We demonstrate that a\nvariant of the proposed approach can work even for this complex scenario. We\nextensively evaluated the proposed approach on five benchmarks and found\ncompetitive performance.\n","authors":["Himanshu Buckchash","Momojit Biswas","Rohit Agarwal","Dilip K. Prasad"],"pdf_url":"https://arxiv.org/pdf/2409.10242v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.20891v1","updated":"2024-12-30T12:00:47Z","published":"2024-12-30T12:00:47Z","title":"DoTA: Weight-Decomposed Tensor Adaptation for Large Language Models","summary":" Low-rank adaptation (LoRA) reduces the computational and memory demands of\nfine-tuning large language models (LLMs) by approximating updates with low-rank\nmatrices. However, low-rank approximation in two-dimensional space fails to\ncapture high-dimensional structures within the target matrix. Recently, tensor\ndecomposition methods have been explored for fine-tuning LLMs, leveraging their\nability to extract structured information. Yet, these approaches primarily rely\non random initialization, and the impact of initialization on tensor adaptation\nremains underexplored. In this paper, we reveal that random initialization\nsignificantly diverges from the validation loss achieved by full fine-tuning.\nTo address this, we propose Weight-Decomposed Tensor Adaptation (DoTA), which\nleverages the Matrix Product Operator (MPO) decomposition of pre-trained\nweights for effective initialization in fine-tuning LLMs. Additionally, we\nintroduce QDoTA, a quantized version of DoTA designed for 4-bit quantization.\nExperiments on commonsense and arithmetic reasoning tasks show that DoTA\noutperforms random initialization methods with fewer parameters. QDoTA further\nreduces memory consumption and achieves comparable performance to DoTA on\ncommonsense reasoning tasks. We will release our code to support future\nresearch.\n","authors":["Xiaolin Hu","Xiang Cheng","Peiyu Liu","Wei Liu","Jian Luan","Bin Wang","Yong Liu"],"pdf_url":"https://arxiv.org/pdf/2412.20891v1.pdf","comment":"12 pages, 6 figures"},{"id":"http://arxiv.org/abs/2412.20885v1","updated":"2024-12-30T11:52:39Z","published":"2024-12-30T11:52:39Z","title":"CF-CGN: Channel Fingerprints Extrapolation for Multi-band Massive MIMO\n Transmission based on Cycle-Consistent Generative Networks","summary":" Multi-band massive multiple-input multiple-output (MIMO) communication can\npromote the cooperation of licensed and unlicensed spectra, effectively\nenhancing spectrum efficiency for Wi-Fi and other wireless systems. As an\nenabler for multi-band transmission, channel fingerprints (CF), also known as\nthe channel knowledge map or radio environment map, are used to assist channel\nstate information (CSI) acquisition and reduce computational complexity. In\nthis paper, we propose CF-CGN (Channel Fingerprints with Cycle-consistent\nGenerative Networks) to extrapolate CF for multi-band massive MIMO transmission\nwhere licensed and unlicensed spectra cooperate to provide ubiquitous\nconnectivity. Specifically, we first model CF as a multichannel image and\ntransform the extrapolation problem into an image translation task, which\nconverts CF from one frequency to another by exploring the shared\ncharacteristics of statistical CSI in the beam domain. Then, paired generative\nnetworks are designed and coupled by variable-weight cycle consistency losses\nto fit the reciprocal relationship at different bands. Matched with the coupled\nnetworks, a joint training strategy is developed accordingly, supporting\nsynchronous optimization of all trainable parameters. During the inference\nprocess, we also introduce a refining scheme to improve the extrapolation\naccuracy based on the resolution of CF. Numerical results illustrate that our\nproposed CF-CGN can achieve bidirectional extrapolation with an error of 5-17\ndB lower than the benchmarks in different communication scenarios,\ndemonstrating its excellent generalization ability. We further show that the\nsum rate performance assisted by CF-CGN-based CF is close to that with perfect\nCSI for multi-band massive MIMO transmission.\n","authors":["Chenjie Xie","Li You","Zhenzhou Jin","Jinke Tang","Xiqi Gao","Xiang-Gen Xia"],"pdf_url":"https://arxiv.org/pdf/2412.20885v1.pdf","comment":"13 pages, 12 figures"},{"id":"http://arxiv.org/abs/2410.09567v3","updated":"2024-12-30T11:46:45Z","published":"2024-10-12T15:29:18Z","title":"Timeseria: an object-oriented time series processing library","summary":" Timeseria is an object-oriented time series processing library implemented in\nPython, which aims at making it easier to manipulate time series data and to\nbuild statistical and machine learning models on top of it. Unlike common data\nanalysis frameworks, it builds up from well defined and reusable logical units\n(objects), which can be easily combined together in order to ensure a high\nlevel of consistency. Thanks to this approach, Timeseria can address by design\nseveral non-trivial issues which are often underestimated, such as handling\ndata losses, non-uniform sampling rates, differences between aggregated data\nand punctual observations, time zones, daylight saving times, and more.\nTimeseria comes with a comprehensive set of base data structures, data\ntransformations for resampling and aggregation, common data manipulation\noperations, and extensible models for data reconstruction, forecasting and\nanomaly detection. It also integrates a fully featured, interactive plotting\nengine capable of handling even millions of data points.\n","authors":["Stefano Alberto Russo","Giuliano Taffoni","Luca Bortolussi"],"pdf_url":"https://arxiv.org/pdf/2410.09567v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.16489v2","updated":"2024-12-30T11:28:31Z","published":"2024-05-26T08:55:22Z","title":"Causal-aware Graph Neural Architecture Search under Distribution Shifts","summary":" Graph NAS has emerged as a promising approach for autonomously designing GNN\narchitectures by leveraging the correlations between graphs and architectures.\nExisting methods fail to generalize under distribution shifts that are\nubiquitous in real-world graph scenarios, mainly because the graph-architecture\ncorrelations they exploit might be spurious and varying across distributions.\nWe propose to handle the distribution shifts in the graph architecture search\nprocess by discovering and exploiting the causal relationship between graphs\nand architectures to search for the optimal architectures that can generalize\nunder distribution shifts. The problem remains unexplored with following\nchallenges: how to discover the causal graph-architecture relationship that has\nstable predictive abilities across distributions, and how to handle\ndistribution shifts with the discovered causal graph-architecture relationship\nto search the generalized graph architectures. To address these challenges, we\npropose Causal-aware Graph Neural Architecture Search (CARNAS), which is able\nto capture the causal graph-architecture relationship during the architecture\nsearch process and discover the generalized graph architecture under\ndistribution shifts. Specifically, we propose Disentangled Causal Subgraph\nIdentification to capture the causal subgraphs that have stable prediction\nabilities across distributions. Then, we propose Graph Embedding Intervention\nto intervene on causal subgraphs within the latent space, ensuring that these\nsubgraphs encapsulate essential features for prediction while excluding\nnon-causal elements. Additionally, we propose Invariant Architecture\nCustomization to reinforce the causal invariant nature of the causal subgraphs,\nwhich are utilized to tailor generalized graph architectures. Extensive\nexperiments demonstrate that CARNAS achieves advanced out-of-distribution\ngeneralization ability.\n","authors":["Peiwen Li","Xin Wang","Zeyang Zhang","Yijian Qin","Ziwei Zhang","Jialong Wang","Yang Li","Wenwu Zhu"],"pdf_url":"https://arxiv.org/pdf/2405.16489v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.20868v1","updated":"2024-12-30T11:11:35Z","published":"2024-12-30T11:11:35Z","title":"Machine Learning of Slow Collective Variables and Enhanced Sampling via\n Spatial Techniques","summary":" Understanding the long-time dynamics of complex physical processes depends on\nour ability to recognize patterns. To simplify the description of these\nprocesses, we often introduce a set of reaction coordinates, customarily\nreferred to as collective variables (CVs). The quality of these CVs heavily\nimpacts our comprehension of the dynamics, often influencing the estimates of\nthermodynamics and kinetics from atomistic simulations. Consequently,\nidentifying CVs poses a fundamental challenge in chemical physics. Recently,\nsignificant progress was made by leveraging the predictive ability of\nunsupervised machine learning techniques to determine CVs. Many of these\ntechniques require temporal information to learn slow CVs that correspond to\nthe long timescale behavior of the studied process. Here, however, we\nspecifically focus on techniques that can identify CVs corresponding to the\nslowest transitions between states without needing temporal trajectories as\ninput, instead using the spatial characteristics of the data. We discuss the\nlatest developments in this category of techniques and briefly discuss\npotential directions for thermodynamics-informed spatial learning of slow CVs.\n","authors":["Tuğçe Gökdemir","Jakub Rydzewski"],"pdf_url":"https://arxiv.org/pdf/2412.20868v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.20864v1","updated":"2024-12-30T11:07:05Z","published":"2024-12-30T11:07:05Z","title":"Enhancing Annotated Bibliography Generation with LLM Ensembles","summary":" This work proposes a novel approach to enhancing annotated bibliography\ngeneration through Large Language Model (LLM) ensembles. In particular,\nmultiple LLMs in different roles -- controllable text generation, evaluation,\nand summarization -- are introduced and validated using a systematic\nmethodology to enhance model performance in scholarly tasks. Output diversity\namong the ensemble that generates text is obtained using different LLM\nparameters, followed by an LLM acting as a judge to assess relevance, accuracy,\nand coherence. Responses selected by several combining strategies are then\nmerged and refined through summarization and redundancy removal techniques. The\npreliminary experimental validation demonstrates that the combined outputs from\nthe LLM ensemble improve coherence and relevance compared to individual\nresponses, leading to a 38% improvement in annotation quality and a 51%\nreduction in content redundancy, thus highlighting the potential for automating\ncomplex scholarly tasks while maintaining high-quality standards.\n","authors":["Sergio Bermejo"],"pdf_url":"https://arxiv.org/pdf/2412.20864v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.06691v3","updated":"2024-12-30T11:05:20Z","published":"2024-09-10T17:54:28Z","title":"Geometric-Averaged Preference Optimization for Soft Preference Labels","summary":" Many algorithms for aligning LLMs with human preferences assume that human\npreferences are binary and deterministic. However, human preferences can vary\nacross individuals, and therefore should be represented distributionally. In\nthis work, we introduce the distributional soft preference labels and improve\nDirect Preference Optimization (DPO) with a weighted geometric average of the\nLLM output likelihood in the loss function. This approach adjusts the scale of\nlearning loss based on the soft labels such that the loss would approach zero\nwhen the responses are closer to equally preferred. This simple modification\ncan be easily applied to any DPO-based methods and mitigate over-optimization\nand objective mismatch, which prior works suffer from. Our experiments simulate\nthe soft preference labels with AI feedback from LLMs and demonstrate that\ngeometric averaging consistently improves performance on standard benchmarks\nfor alignment research. In particular, we observe more preferable responses\nthan binary labels and significant improvements where modestly-confident labels\nare in the majority.\n","authors":["Hiroki Furuta","Kuang-Huei Lee","Shixiang Shane Gu","Yutaka Matsuo","Aleksandra Faust","Heiga Zen","Izzeddin Gur"],"pdf_url":"https://arxiv.org/pdf/2409.06691v3.pdf","comment":"Accepted at NeurIPS 2024"},{"id":"http://arxiv.org/abs/2411.08638v2","updated":"2024-12-30T11:03:39Z","published":"2024-11-13T14:26:04Z","title":"Gaussian Mixture Models Based Augmentation Enhances GNN Generalization","summary":" Graph Neural Networks (GNNs) have shown great promise in tasks like node and\ngraph classification, but they often struggle to generalize, particularly to\nunseen or out-of-distribution (OOD) data. These challenges are exacerbated when\ntraining data is limited in size or diversity. To address these issues, we\nintroduce a theoretical framework using Rademacher complexity to compute a\nregret bound on the generalization error and then characterize the effect of\ndata augmentation. This framework informs the design of GMM-GDA, an efficient\ngraph data augmentation (GDA) algorithm leveraging the capability of Gaussian\nMixture Models (GMMs) to approximate any distribution. Our approach not only\noutperforms existing augmentation techniques in terms of generalization but\nalso offers improved time complexity, making it highly suitable for real-world\napplications.\n","authors":["Yassine Abbahaddou","Fragkiskos D. Malliaros","Johannes F. Lutzeyer","Amine Mohamed Aboussalah","Michalis Vazirgiannis"],"pdf_url":"https://arxiv.org/pdf/2411.08638v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2407.03862v2","updated":"2024-12-30T11:01:26Z","published":"2024-07-04T11:50:24Z","title":"FedSat: A Statistical Aggregation Approach for Class Imbalanced Clients\n in Federated Learning","summary":" Federated learning (FL) has emerged as a promising paradigm for\nprivacy-preserving distributed machine learning, but faces challenges with\nheterogeneous data distributions across clients. This paper presents FedSat, a\nnovel FL approach specifically designed to simultaneously handle three forms of\ndata heterogeneity, namely label skewness, missing classes, and quantity\nskewness, by proposing a prediction-sensitive loss function and a\nprioritized-class based weighted aggregation scheme. While the\nprediction-sensitive loss function enhances model performance on minority\nclasses, the prioritized-class based weighted aggregation scheme ensures client\ncontributions are weighted based on both statistical significance and\nperformance on critical classes. Extensive experiments across diverse\ndata-heterogeneity settings demonstrate that FedSat significantly outperforms\nstate-of-the-art baselines, with an average improvement of 1.8% over the\nsecond-best method and 19.87% over the weakest-performing baseline. The\napproach also demonstrates faster convergence compared to existing methods.\nThese results highlight FedSat's effectiveness in addressing the challenges of\nheterogeneous federated learning and its potential for real-world applications.\n","authors":["Sujit Chowdhury","Raju Halder"],"pdf_url":"https://arxiv.org/pdf/2407.03862v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18387v2","updated":"2024-12-30T11:00:35Z","published":"2024-12-24T12:20:24Z","title":"Scaling Capability in Token Space: An Analysis of Large Vision Language\n Model","summary":" The scaling capability has been widely validated in neural language models\nwith respect to the number of parameters and the size of training data.\n One important question is that does the scaling capability also exists\nsimilarly with respect to the number of vision tokens in large vision language\nModel?\n This study fills the gap by investigating the relationship between the number\nof vision tokens and the performance on vision-language models.\n Our theoretical analysis and empirical evaluations demonstrate that the model\nexhibits scalable performance \\(S(N_l)\\) with respect to the number of vision\ntokens \\(N_l\\), characterized by the relationship \\(S(N_l) \\approx\n(c/N_l)^{\\alpha}\\).\n Furthermore, we also investigate the impact of a fusion mechanism that\nintegrates the user's question with vision tokens.\n The results reveal two key findings.\n First, the scaling capability remains intact with the incorporation of the\nfusion mechanism.\n Second, the fusion mechanism enhances model performance, particularly when\nthe user's question is task-specific and relevant.\n The analysis, conducted on fifteen diverse benchmarks spanning a broad range\nof tasks and domains, validates the effectiveness of the proposed approach.\n","authors":["Tenghui Li","Guoxu Zhou","Xuyang Zhao","Qibin Zhao"],"pdf_url":"https://arxiv.org/pdf/2412.18387v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.16726v4","updated":"2024-12-30T11:00:27Z","published":"2024-02-26T16:48:12Z","title":"Towards Empirical Interpretation of Internal Circuits and Properties in\n Grokked Transformers on Modular Polynomials","summary":" Grokking has been actively explored to reveal the mystery of delayed\ngeneralization and identifying interpretable representations and algorithms\ninside the grokked models is a suggestive hint to understanding its mechanism.\nGrokking on modular addition has been known to implement Fourier representation\nand its calculation circuits with trigonometric identities in Transformers.\nConsidering the periodicity in modular arithmetic, the natural question is to\nwhat extent these explanations and interpretations hold for the grokking on\nother modular operations beyond addition. For a closer look, we first\nhypothesize that any modular operations can be characterized with distinctive\nFourier representation or internal circuits, grokked models obtain common\nfeatures transferable among similar operations, and mixing datasets with\nsimilar operations promotes grokking. Then, we extensively examine them by\nlearning Transformers on complex modular arithmetic tasks, including\npolynomials. Our Fourier analysis and novel progress measure for modular\narithmetic, Fourier Frequency Density and Fourier Coefficient Ratio,\ncharacterize distinctive internal representations of grokked models per modular\noperation; for instance, polynomials often result in the superposition of the\nFourier components seen in elementary arithmetic, but clear patterns do not\nemerge in challenging non-factorizable polynomials. In contrast, our ablation\nstudy on the pre-grokked models reveals that the transferability among the\nmodels grokked with each operation can be only limited to specific\ncombinations, such as from elementary arithmetic to linear expressions.\nMoreover, some multi-task mixtures may lead to co-grokking -- where grokking\nsimultaneously happens for all the tasks -- and accelerate generalization,\nwhile others may not find optimal solutions. We provide empirical steps towards\nthe interpretability of internal circuits.\n","authors":["Hiroki Furuta","Gouki Minegishi","Yusuke Iwasawa","Yutaka Matsuo"],"pdf_url":"https://arxiv.org/pdf/2402.16726v4.pdf","comment":"Published at Transactions on Machine Learning Research (TMLR), Code:\n https://github.com/frt03/grok_mod_poly"},{"id":"http://arxiv.org/abs/2307.13918v3","updated":"2024-12-30T10:53:48Z","published":"2023-07-26T02:34:57Z","title":"Simulation-based Inference for Cardiovascular Models","summary":" Over the past decades, hemodynamics simulators have steadily evolved and have\nbecome tools of choice for studying cardiovascular systems in-silico. While\nsuch tools are routinely used to simulate whole-body hemodynamics from\nphysiological parameters, solving the corresponding inverse problem of mapping\nwaveforms back to plausible physiological parameters remains both promising and\nchallenging. Motivated by advances in simulation-based inference (SBI), we cast\nthis inverse problem as statistical inference. In contrast to alternative\napproaches, SBI provides \\textit{posterior distributions} for the parameters of\ninterest, providing a \\textit{multi-dimensional} representation of uncertainty\nfor \\textit{individual} measurements. We showcase this ability by performing an\nin-silico uncertainty analysis of five biomarkers of clinical interest\ncomparing several measurement modalities. Beyond the corroboration of known\nfacts, such as the feasibility of estimating heart rate, our study highlights\nthe potential of estimating new biomarkers from standard-of-care measurements.\nSBI reveals practically relevant findings that cannot be captured by standard\nsensitivity analyses, such as the existence of sub-populations for which\nparameter estimation exhibits distinct uncertainty regimes. Finally, we study\nthe gap between in-vivo and in-silico with the MIMIC-III waveform database and\ncritically discuss how cardiovascular simulations can inform real-world data\nanalysis.\n","authors":["Antoine Wehenkel","Laura Manduchi","Jens Behrmann","Luca Pegolotti","Andrew C. Miller","Guillermo Sapiro","Ozan Sener","Marco Cuturi","Jörn-Henrik Jacobsen"],"pdf_url":"https://arxiv.org/pdf/2307.13918v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.20851v1","updated":"2024-12-30T10:42:28Z","published":"2024-12-30T10:42:28Z","title":"About rectified sigmoid function for enhancing the accuracy of\n Physics-Informed Neural Networks","summary":" The article is devoted to the study of neural networks with one hidden layer\nand a modified activation function for solving physical problems. A rectified\nsigmoid activation function has been proposed to solve physical problems\ndescribed by the ODE with neural networks. Algorithms for physics-informed\ndata-driven initialization of a neural network and a neuron-by-neuron\ngradient-free fitting method have been presented for the neural network with\nthis activation function. Numerical experiments demonstrate the superiority of\nneural networks with a rectified sigmoid function over neural networks with a\nsigmoid function in the accuracy of solving physical problems (harmonic\noscillator, relativistic slingshot, and Lorentz system).\n","authors":["Vasiliy A. Es'kin","Alexey O. Malkhanov","Mikhail E. Smorkalov"],"pdf_url":"https://arxiv.org/pdf/2412.20851v1.pdf","comment":"9 pages, 1 figure, 2 tables, 4 algthorithms. arXiv admin note:\n substantial text overlap with arXiv:2412.19235"},{"id":"http://arxiv.org/abs/2310.12595v3","updated":"2024-12-30T10:33:44Z","published":"2023-10-19T09:03:41Z","title":"Bayesian Meta-Learning for Improving Generalizability of Health\n Prediction Models With Similar Causal Mechanisms","summary":" Machine learning strategies like multi-task learning, meta-learning, and\ntransfer learning enable efficient adaptation of machine learning models to\nspecific applications in healthcare, such as prediction of various diseases, by\nleveraging generalizable knowledge across large datasets and multiple domains.\nIn particular, Bayesian meta-learning methods pool data across related\nprediction tasks to learn prior distributions for model parameters, which are\nthen used to derive models for specific tasks. However, inter- and intra-task\nvariability due to disease heterogeneity and other patient-level differences\npose challenges of negative transfer during shared learning and poor\ngeneralizability to new patients. We introduce a novel Bayesian meta-learning\napproach that aims to address this in two key settings: (1) predictions for new\npatients (same population as the training set) and (2) adapting to new patient\npopulations. Our main contribution is in modeling similarity between causal\nmechanisms of the tasks, for (1) mitigating negative transfer during training\nand (2) fine-tuning that pools information from tasks that are expected to aid\ngeneralizability. We propose an algorithm for implementing this approach for\nBayesian deep learning, and apply it to a case study for stroke prediction\ntasks using electronic health record data. Experiments for the UK Biobank\ndataset as the training population demonstrated significant generalizability\nimprovements compared to standard meta-learning, non-causal task similarity\nmeasures, and local baselines (separate models for each task). This was\nassessed for a variety of tasks that considered both new patients from the\ntraining population (UK Biobank) and a new population (FinnGen).\n","authors":["Sophie Wharrie","Lisa Eick","Lotta Mäkinen","Andrea Ganna","Samuel Kaski"," FinnGen"],"pdf_url":"https://arxiv.org/pdf/2310.12595v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.20844v1","updated":"2024-12-30T10:24:30Z","published":"2024-12-30T10:24:30Z","title":"Acquisition-Independent Deep Learning for Quantitative MRI Parameter\n Estimation using Neural Controlled Differential Equations","summary":" Deep learning has proven to be a suitable alternative to least-squares (LSQ)\nfitting for parameter estimation in various quantitative MRI (QMRI) models.\nHowever, current deep learning implementations are not robust to changes in MR\nacquisition protocols. In practice, QMRI acquisition protocols differ\nsubstantially between different studies and clinical settings. The lack of\ngeneralizability and adoptability of current deep learning approaches for QMRI\nparameter estimation impedes the implementation of these algorithms in clinical\ntrials and clinical practice. Neural Controlled Differential Equations (NCDEs)\nallow for the sampling of incomplete and irregularly sampled data with variable\nlength, making them ideal for use in QMRI parameter estimation. In this study,\nwe show that NCDEs can function as a generic tool for the accurate prediction\nof QMRI parameters, regardless of QMRI sequence length, configuration of\nindependent variables and QMRI forward model (variable flip angle T1-mapping,\nintravoxel incoherent motion MRI, dynamic contrast-enhanced MRI). NCDEs\nachieved lower mean squared error than LSQ fitting in low-SNR simulations and\nin vivo in challenging anatomical regions like the abdomen and leg, but this\nimprovement was no longer evident at high SNR. NCDEs reduce estimation error\ninterquartile range without increasing bias, particularly under conditions of\nhigh uncertainty. These findings suggest that NCDEs offer a robust approach for\nreliable QMRI parameter estimation, especially in scenarios with high\nuncertainty or low image quality. We believe that with NCDEs, we have solved\none of the main challenges for using deep learning for QMRI parameter\nestimation in a broader clinical and research setting.\n","authors":["Daan Kuppens","Sebastiano Barbieri","Daisy van den Berg","Pepijn Schouten","Harriet C. Thoeny","Myrte Wennen","Oliver J. Gurney-Champion"],"pdf_url":"https://arxiv.org/pdf/2412.20844v1.pdf","comment":"29 pages, 10 figures, 7 supplementary figures, pre-print"},{"id":"http://arxiv.org/abs/2412.20838v1","updated":"2024-12-30T10:06:02Z","published":"2024-12-30T10:06:02Z","title":"Dual-Space Augmented Intrinsic-LoRA for Wind Turbine Segmentation","summary":" Accurate segmentation of wind turbine blade (WTB) images is critical for\neffective assessments, as it directly influences the performance of automated\ndamage detection systems. Despite advancements in large universal vision\nmodels, these models often underperform in domain-specific tasks like WTB\nsegmentation. To address this, we extend Intrinsic LoRA for image segmentation,\nand propose a novel dual-space augmentation strategy that integrates both\nimage-level and latent-space augmentations. The image-space augmentation is\nachieved through linear interpolation between image pairs, while the\nlatent-space augmentation is accomplished by introducing a noise-based latent\nprobabilistic model. Our approach significantly boosts segmentation accuracy,\nsurpassing current state-of-the-art methods in WTB image segmentation.\n","authors":["Shubh Singhal","Raül Pérez-Gonzalo","Andreas Espersen","Antonio Agudo"],"pdf_url":"https://arxiv.org/pdf/2412.20838v1.pdf","comment":"Authors Shubh Singhal and Ra\\\"ul P\\'erez-Gonzalo contributed equally\n to this work. Accepted to ICASSP 2025"},{"id":"http://arxiv.org/abs/2412.19585v2","updated":"2024-12-30T09:51:38Z","published":"2024-12-27T11:03:26Z","title":"Ultralight Signal Classification Model for Automatic Modulation\n Recognition","summary":" The growing complexity of radar signals demands responsive and accurate\ndetection systems that can operate efficiently on resource-constrained edge\ndevices. Existing models, while effective, often rely on substantial\ncomputational resources and large datasets, making them impractical for edge\ndeployment. In this work, we propose an ultralight hybrid neural network\noptimized for edge applications, delivering robust performance across\nunfavorable signal-to-noise ratios (mean accuracy of 96.3% at 0 dB) using less\nthan 100 samples per class, and significantly reducing computational overhead.\n","authors":["Alessandro Daniele Genuardi Oquendo","Agustín Matías Galante Cerviño","Nilotpal Kanti Sinha","Luc Andrea","Sam Mugel","Román Orús"],"pdf_url":"https://arxiv.org/pdf/2412.19585v2.pdf","comment":"8 pages, 8 figures"},{"id":"http://arxiv.org/abs/2412.19792v2","updated":"2024-12-30T09:37:33Z","published":"2024-12-27T18:45:36Z","title":"InfAlign: Inference-aware language model alignment","summary":" Language model alignment has become a critical step in training modern\ngenerative language models. The goal of alignment is to finetune a reference\nmodel such that the win rate of a sample from the aligned model over a sample\nfrom the reference model is high, subject to a KL divergence constraint. Today,\nwe are increasingly using inference-time algorithms (e.g., Best-of-N,\ncontrolled decoding, tree search) to decode from language models rather than\nstandard sampling. However, the alignment objective does not capture such\ninference-time decoding procedures. We show that the existing alignment\nframework is sub-optimal in view of such inference-time methods. We then modify\nthe alignment objective and propose a framework for inference-aware alignment\n(IAPO). We prove that for any inference-time decoding algorithm, the optimal\nsolution that optimizes the inference-time win rate of the aligned policy\nagainst the reference policy is the solution to the typical RLHF problem with a\ntransformation of the reward. This motivates us to provide the KL-regularized\ncalibrate-and-transform RL (CTRL) algorithm to solve this problem, which\ninvolves a reward calibration step and a KL-regularized reward maximization\nstep with a transformation of the calibrated reward. We particularize our study\nto two important inference-time strategies: best-of-N sampling and best-of-N\njailbreaking, where N responses are sampled from the model and the one with the\nhighest or lowest reward is selected. We propose specific transformations for\nthese strategies and demonstrate that our framework offers significant\nimprovements over existing state-of-the-art methods for language model\nalignment. Empirically, we outperform baselines that are designed without\ntaking inference-time decoding into consideration by 8-12% and 4-9% on\ninference-time win rates over the Anthropic helpfulness and harmlessness dialog\nbenchmark datasets.\n","authors":["Ananth Balashankar","Ziteng Sun","Jonathan Berant","Jacob Eisenstein","Michael Collins","Adrian Hutter","Jong Lee","Chirag Nagpal","Flavien Prost","Aradhana Sinha","Ananda Theertha Suresh","Ahmad Beirami"],"pdf_url":"https://arxiv.org/pdf/2412.19792v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.20824v1","updated":"2024-12-30T09:35:45Z","published":"2024-12-30T09:35:45Z","title":"Isoperimetry is All We Need: Langevin Posterior Sampling for RL with\n Sublinear Regret","summary":" In Reinforcement Learning (RL) theory, we impose restrictive assumptions to\ndesign an algorithm with provably sublinear regret. Common assumptions, like\nlinear or RKHS models, and Gaussian or log-concave posteriors over the models,\ndo not explain practical success of RL across a wider range of distributions\nand models. Thus, we study how to design RL algorithms with sublinear regret\nfor isoperimetric distributions, specifically the ones satisfying the\nLog-Sobolev Inequality (LSI). LSI distributions include the standard setups of\nRL and others, such as many non-log-concave and perturbed distributions. First,\nwe show that the Posterior Sampling-based RL (PSRL) yields sublinear regret if\nthe data distributions satisfy LSI under some mild additional assumptions.\nAlso, when we cannot compute or sample from an exact posterior, we propose a\nLangevin sampling-based algorithm design: LaPSRL. We show that LaPSRL achieves\norder optimal regret and subquadratic complexity per episode. Finally, we\ndeploy LaPSRL with a Langevin sampler -- SARAH-LD, and test it for different\nbandit and MDP environments. Experimental results validate the generality of\nLaPSRL across environments and its competitive performance with respect to the\nbaselines.\n","authors":["Emilio Jorge","Christos Dimitrakakis","Debabrota Basu"],"pdf_url":"https://arxiv.org/pdf/2412.20824v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.20810v1","updated":"2024-12-30T09:06:47Z","published":"2024-12-30T09:06:47Z","title":"TimeRAF: Retrieval-Augmented Foundation model for Zero-shot Time Series\n Forecasting","summary":" Time series forecasting plays a crucial role in data mining, driving rapid\nadvancements across numerous industries. With the emergence of large models,\ntime series foundation models (TSFMs) have exhibited remarkable generalization\ncapabilities, such as zero-shot learning, through large-scale pre-training.\nMeanwhile, Retrieval-Augmented Generation (RAG) methods have been widely\nemployed to enhance the performance of foundation models on unseen data,\nallowing models to access to external knowledge. In this paper, we introduce\nTimeRAF, a Retrieval-Augmented Forecasting model that enhance zero-shot time\nseries forecasting through retrieval-augmented techniques. We develop\ncustomized time series knowledge bases that are tailored to the specific\nforecasting tasks. TimeRAF employs an end-to-end learnable retriever to extract\nvaluable information from the knowledge base. Additionally, we propose Channel\nPrompting for knowledge integration, which effectively extracts relevant\ninformation from the retrieved knowledge along the channel dimension. Extensive\nexperiments demonstrate the effectiveness of our model, showing significant\nimprovement across various domains and datasets.\n","authors":["Huanyu Zhang","Chang Xu","Yi-Fan Zhang","Zhang Zhang","Liang Wang","Jiang Bian","Tieniu Tan"],"pdf_url":"https://arxiv.org/pdf/2412.20810v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.15700v2","updated":"2024-12-30T09:00:55Z","published":"2024-12-20T09:18:30Z","title":"AIR: Unifying Individual and Collective Exploration in Cooperative\n Multi-Agent Reinforcement Learning","summary":" Exploration in cooperative multi-agent reinforcement learning (MARL) remains\nchallenging for value-based agents due to the absence of an explicit policy.\nExisting approaches include individual exploration based on uncertainty towards\nthe system and collective exploration through behavioral diversity among\nagents. However, the introduction of additional structures often leads to\nreduced training efficiency and infeasible integration of these methods. In\nthis paper, we propose Adaptive exploration via Identity Recognition~(AIR),\nwhich consists of two adversarial components: a classifier that recognizes\nagent identities from their trajectories, and an action selector that\nadaptively adjusts the mode and degree of exploration. We theoretically prove\nthat AIR can facilitate both individual and collective exploration during\ntraining, and experiments also demonstrate the efficiency and effectiveness of\nAIR across various tasks.\n","authors":["Guangchong Zhou","Zeren Zhang","Guoliang Fan"],"pdf_url":"https://arxiv.org/pdf/2412.15700v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.20802v1","updated":"2024-12-30T08:49:41Z","published":"2024-12-30T08:49:41Z","title":"Robust Matrix Completion for Discrete Rating-Scale Data","summary":" Matrix completion has gained considerable interest in recent years. The goal\nof matrix completion is to predict the unknown entries of a partially observed\nmatrix using its known entries. Although common applications feature discrete\nrating-scale data, such as user-product rating matrices in recommender systems\nor surveys in the social and behavioral sciences, methods for matrix completion\nare almost always designed for and studied in the context of continuous data.\nFurthermore, only a small subset of the literature considers matrix completion\nin the presence of corrupted observations despite their common occurrence in\npractice. Examples include attacks on recommender systems (i.e., malicious\nusers deliberately manipulating ratings to influence the recommender system to\ntheir advantage), or careless respondents in surveys (i.e., respondents\nproviding answers irrespective of what the survey asks of them due to a lack of\nattention). We introduce a matrix completion algorithm that is tailored towards\nthe discrete nature of rating-scale data and robust to the presence of\ncorrupted observations. In addition, we investigate the performance of the\nproposed method and its competitors with discrete rating-scale (rather than\ncontinuous) data as well as under various missing data mechanisms and types of\ncorrupted observations.\n","authors":["Aurore Archimbaud","Andreas Alfons","Ines Wilms"],"pdf_url":"https://arxiv.org/pdf/2412.20802v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.20796v1","updated":"2024-12-30T08:38:09Z","published":"2024-12-30T08:38:09Z","title":"FastCHGNet: Training one Universal Interatomic Potential to 1.5 Hours\n with 32 GPUs","summary":" Graph neural network universal interatomic potentials (GNN-UIPs) have\ndemonstrated remarkable generalization and transfer capabilities in material\ndiscovery and property prediction. These models can accelerate molecular\ndynamics (MD) simulation by several orders of magnitude while maintaining\n\\textit{ab initio} accuracy, making them a promising new paradigm in material\nsimulations. One notable example is Crystal Hamiltonian Graph Neural Network\n(CHGNet), pretrained on the energies, forces, stresses, and magnetic moments\nfrom the MPtrj dataset, representing a state-of-the-art GNN-UIP model for\ncharge-informed MD simulations. However, training the CHGNet model is\ntime-consuming(8.3 days on one A100 GPU) for three reasons: (i) requiring\nmulti-layer propagation to reach more distant atom information, (ii) requiring\nsecond-order derivatives calculation to finish weights updating and (iii) the\nimplementation of reference CHGNet does not fully leverage the computational\ncapabilities. This paper introduces FastCHGNet, an optimized CHGNet, with three\ncontributions: Firstly, we design innovative Force/Stress Readout modules to\ndecompose Force/Stress prediction. Secondly, we adopt massive optimizations\nsuch as kernel fusion, redundancy bypass, etc, to exploit GPU computation power\nsufficiently. Finally, we extend CHGNet to support multiple GPUs and propose a\nload-balancing technique to enhance GPU utilization. Numerical results show\nthat FastCHGNet reduces memory footprint by a factor of 3.59. The final\ntraining time of FastCHGNet can be decreased to \\textbf{1.53 hours} on 32 GPUs\nwithout sacrificing model accuracy.\n","authors":["Yuanchang Zhou","Siyu Hu","Chen Wang","Lin-Wang Wang","Guangming Tan","Weile Jia"],"pdf_url":"https://arxiv.org/pdf/2412.20796v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.07687v2","updated":"2024-12-30T08:29:09Z","published":"2024-12-10T17:20:47Z","title":"Privacy-Preserving Customer Support: A Framework for Secure and Scalable\n Interactions","summary":" The growing reliance on artificial intelligence (AI) in customer support has\nsignificantly improved operational efficiency and user experience. However,\ntraditional machine learning (ML) approaches, which require extensive local\ntraining on sensitive datasets, pose substantial privacy risks and compliance\nchallenges with regulations like the General Data Protection Regulation (GDPR)\nand California Consumer Privacy Act (CCPA). Existing privacy-preserving\ntechniques, such as anonymization, differential privacy, and federated\nlearning, address some concerns but face limitations in utility, scalability,\nand complexity. This paper introduces the Privacy-Preserving Zero-Shot Learning\n(PP-ZSL) framework, a novel approach leveraging large language models (LLMs) in\na zero-shot learning mode. Unlike conventional ML methods, PP-ZSL eliminates\nthe need for local training on sensitive data by utilizing pre-trained LLMs to\ngenerate responses directly. The framework incorporates real-time data\nanonymization to redact or mask sensitive information, retrieval-augmented\ngeneration (RAG) for domain-specific query resolution, and robust\npost-processing to ensure compliance with regulatory standards. This\ncombination reduces privacy risks, simplifies compliance, and enhances\nscalability and operational efficiency. Empirical analysis demonstrates that\nthe PP-ZSL framework provides accurate, privacy-compliant responses while\nsignificantly lowering the costs and complexities of deploying AI-driven\ncustomer support systems. The study highlights potential applications across\nindustries, including financial services, healthcare, e-commerce, legal\nsupport, telecommunications, and government services. By addressing the dual\nchallenges of privacy and performance, this framework establishes a foundation\nfor secure, efficient, and regulatory-compliant AI applications in customer\ninteractions.\n","authors":["Anant Prakash Awasthi","Girdhar Gopal Agarwal","Chandraketu Singh","Rakshit Varma","Sanchit Sharma"],"pdf_url":"https://arxiv.org/pdf/2412.07687v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.20790v1","updated":"2024-12-30T08:12:17Z","published":"2024-12-30T08:12:17Z","title":"Frequency-Masked Embedding Inference: A Non-Contrastive Approach for\n Time Series Representation Learning","summary":" Contrastive learning underpins most current self-supervised time series\nrepresentation methods. The strategy for constructing positive and negative\nsample pairs significantly affects the final representation quality. However,\ndue to the continuous nature of time series semantics, the modeling approach of\ncontrastive learning struggles to accommodate the characteristics of time\nseries data. This results in issues such as difficulties in constructing hard\nnegative samples and the potential introduction of inappropriate biases during\npositive sample construction. Although some recent works have developed several\nscientific strategies for constructing positive and negative sample pairs with\nimproved effectiveness, they remain constrained by the contrastive learning\nframework. To fundamentally overcome the limitations of contrastive learning,\nthis paper introduces Frequency-masked Embedding Inference (FEI), a novel\nnon-contrastive method that completely eliminates the need for positive and\nnegative samples. The proposed FEI constructs 2 inference branches based on a\nprompting strategy: 1) Using frequency masking as prompts to infer the\nembedding representation of the target series with missing frequency bands in\nthe embedding space, and 2) Using the target series as prompts to infer its\nfrequency masking embedding. In this way, FEI enables continuous semantic\nrelationship modeling for time series. Experiments on 8 widely used time series\ndatasets for classification and regression tasks, using linear evaluation and\nend-to-end fine-tuning, show that FEI significantly outperforms existing\ncontrastive-based methods in terms of generalization. This study provides new\ninsights into self-supervised representation learning for time series. The code\nis available at\nhttps://github.com/USTBInnovationPark/Frequency-masked-Embedding-Inference.\n","authors":["En Fu","Yanyan Hu"],"pdf_url":"https://arxiv.org/pdf/2412.20790v1.pdf","comment":"This paper has been accepted by AAAI-2025 main track"},{"id":"http://arxiv.org/abs/2412.20785v1","updated":"2024-12-30T08:10:21Z","published":"2024-12-30T08:10:21Z","title":"Accelerating Energy-Efficient Federated Learning in Cell-Free Networks\n with Adaptive Quantization","summary":" Federated Learning (FL) enables clients to share learning parameters instead\nof local data, reducing communication overhead. Traditional wireless networks\nface latency challenges with FL. In contrast, Cell-Free Massive MIMO (CFmMIMO)\ncan serve multiple clients on shared resources, boosting spectral efficiency\nand reducing latency for large-scale FL. However, clients' communication\nresource limitations can hinder the completion of the FL training. To address\nthis challenge, we propose an energy-efficient, low-latency FL framework\nfeaturing optimized uplink power allocation for seamless client-server\ncollaboration. Our framework employs an adaptive quantization scheme,\ndynamically adjusting bit allocation for local gradient updates to reduce\ncommunication costs. We formulate a joint optimization problem covering FL\nmodel updates, local iterations, and power allocation, solved using sequential\nquadratic programming (SQP) to balance energy and latency. Additionally,\nclients use the AdaDelta method for local FL model updates, enhancing local\nmodel convergence compared to standard SGD, and we provide a comprehensive\nanalysis of FL convergence with AdaDelta local updates. Numerical results show\nthat, within the same energy and latency budgets, our power allocation scheme\noutperforms the Dinkelbach and max-sum rate methods by increasing the test\naccuracy up to $7$\\% and $19$\\%, respectively. Moreover, for the three power\nallocation methods, our proposed quantization scheme outperforms AQUILA and LAQ\nby increasing test accuracy by up to $36$\\% and $35$\\%, respectively.\n","authors":["Afsaneh Mahmoudi","Ming Xiao","Emil Björnson"],"pdf_url":"https://arxiv.org/pdf/2412.20785v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.08285v4","updated":"2024-12-30T07:44:37Z","published":"2024-12-11T11:00:33Z","title":"Adaptive Prompting for Continual Relation Extraction: A Within-Task\n Variance Perspective","summary":" To address catastrophic forgetting in Continual Relation Extraction (CRE),\nmany current approaches rely on memory buffers to rehearse previously learned\nknowledge while acquiring new tasks. Recently, prompt-based methods have\nemerged as potent alternatives to rehearsal-based strategies, demonstrating\nstrong empirical performance. However, upon analyzing existing prompt-based\napproaches for CRE, we identified several critical limitations, such as\ninaccurate prompt selection, inadequate mechanisms for mitigating forgetting in\nshared parameters, and suboptimal handling of cross-task and within-task\nvariances. To overcome these challenges, we draw inspiration from the\nrelationship between prefix-tuning and mixture of experts, proposing a novel\napproach that employs a prompt pool for each task, capturing variations within\neach task while enhancing cross-task variances. Furthermore, we incorporate a\ngenerative model to consolidate prior knowledge within shared parameters,\neliminating the need for explicit data storage. Extensive experiments validate\nthe efficacy of our approach, demonstrating superior performance over\nstate-of-the-art prompt-based and rehearsal-free methods in continual relation\nextraction.\n","authors":["Minh Le","Tien Ngoc Luu","An Nguyen The","Thanh-Thien Le","Trang Nguyen","Tung Thanh Nguyen","Linh Ngo Van","Thien Huu Nguyen"],"pdf_url":"https://arxiv.org/pdf/2412.08285v4.pdf","comment":"Accepted to AAAI 2025"},{"id":"http://arxiv.org/abs/2409.16450v2","updated":"2024-12-30T07:35:25Z","published":"2024-09-24T20:34:47Z","title":"A Multi-Agent Multi-Environment Mixed Q-Learning for Partially\n Decentralized Wireless Network Optimization","summary":" Q-learning is a powerful tool for network control and policy optimization in\nwireless networks, but it struggles with large state spaces. Recent\nadvancements, like multi-environment mixed Q-learning (MEMQ), improves\nperformance and reduces complexity by integrating multiple Q-learning\nalgorithms across multiple related environments so-called digital cousins.\nHowever, MEMQ is designed for centralized single-agent networks and is not\nsuitable for decentralized or multi-agent networks. To address this challenge,\nwe propose a novel multi-agent MEMQ algorithm for partially decentralized\nwireless networks with multiple mobile transmitters (TXs) and base stations\n(BSs), where TXs do not have access to each other's states and actions. In\nuncoordinated states, TXs act independently to minimize their individual costs.\nIn coordinated states, TXs use a Bayesian approach to estimate the joint state\nbased on local observations and share limited information with leader TX to\nminimize joint cost. The cost of information sharing scales linearly with the\nnumber of TXs and is independent of the joint state-action space size. The\nproposed scheme is 50% faster than centralized MEMQ with only a 20% increase in\naverage policy error (APE) and is 25% faster than several advanced\ndecentralized Q-learning algorithms with 40% less APE. The convergence of the\nalgorithm is also demonstrated.\n","authors":["Talha Bozkus","Urbashi Mitra"],"pdf_url":"https://arxiv.org/pdf/2409.16450v2.pdf","comment":"Accepted to 2025 IEEE International Conference on Acoustics, Speech,\n and Signal Processing (ICASSP 2025)"},{"id":"http://arxiv.org/abs/2406.05395v2","updated":"2024-12-30T07:20:27Z","published":"2024-06-08T08:12:41Z","title":"Dynamic Importance Learning using Fisher Information Matrix (FIM) for\n Nonlinear Dynamic Mapping","summary":" Understanding output variance is critical in modeling nonlinear dynamic\nsystems, as it reflects the system's sensitivity to input variations and\nfeature interactions. This work presents a methodology for dynamically\ndetermining relevance scores in black-box models while ensuring\ninterpretability through an embedded decision module. This interpretable\nmodule, integrated into the first layer of the model, employs the Fisher\nInformation Matrix (FIM) and logistic regression to compute relevance scores,\ninterpreted as the probabilities of input neurons being active based on their\ncontribution to the output variance. The proposed method leverages a\ngradient-based framework to uncover the importance of variance-driven features,\ncapturing both individual contributions and complex feature interactions. These\nrelevance scores are applied through element-wise transformations of the\ninputs, enabling the black-box model to prioritize features dynamically based\non their impact on system output. This approach effectively bridges\ninterpretability with the intricate modeling of nonlinear dynamics and\ntime-dependent interactions. Simulation results demonstrate the method's\nability to infer feature interactions while achieving superior performance in\nfeature relevance compared to existing techniques. The practical utility of\nthis approach is showcased through its application to an industrial pH\nneutralization process, where critical system dynamics are uncovered.\n","authors":["Vahid MohammadZadeh Eivaghi","Mahdi Aliyari Shoorehdeli"],"pdf_url":"https://arxiv.org/pdf/2406.05395v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.20762v1","updated":"2024-12-30T07:15:21Z","published":"2024-12-30T07:15:21Z","title":"Enhancing Privacy in Federated Learning through Quantum Teleportation\n Integration","summary":" Federated learning enables collaborative model training across multiple\nclients without sharing raw data, thereby enhancing privacy. However, the\nexchange of model updates can still expose sensitive information. Quantum\nteleportation, a process that transfers quantum states between distant\nlocations without physical transmission of the particles themselves, has\nrecently been implemented in real-world networks. This position paper explores\nthe potential of integrating quantum teleportation into federated learning\nframeworks to bolster privacy. By leveraging quantum entanglement and the\nno-cloning theorem, quantum teleportation ensures that data remains secure\nduring transmission, as any eavesdropping attempt would be detectable. We\npropose a novel architecture where quantum teleportation facilitates the secure\nexchange of model parameters and gradients among clients and servers. This\nintegration aims to mitigate risks associated with data leakage and adversarial\nattacks inherent in classical federated learning setups. We also discuss the\npractical challenges of implementing such a system, including the current\nlimitations of quantum network infrastructure and the need for hybrid\nquantum-classical protocols. Our analysis suggests that, despite these\nchallenges, the convergence of quantum communication technologies and federated\nlearning presents a promising avenue for achieving unprecedented levels of\nprivacy in distributed machine learning.\n","authors":["Koffka Khan"],"pdf_url":"https://arxiv.org/pdf/2412.20762v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.20749v1","updated":"2024-12-30T06:43:22Z","published":"2024-12-30T06:43:22Z","title":"Solar Filaments Detection using Active Contours Without Edges","summary":" In this article, an active contours without edges (ACWE)-based algorithm has\nbeen proposed for the detection of solar filaments in H-alpha full-disk solar\nimages. The overall algorithm consists of three main steps of image processing.\nThese are image pre-processing, image segmentation, and image post-processing.\nHere in the work, contours are initialized on the solar image and allowed to\ndeform based on the energy function. As soon as the contour reaches the\nboundary of the desired object, the energy function gets reduced, and the\ncontour stops evolving. The proposed algorithm has been applied to few\nbenchmark datasets and has been compared with the classical technique of object\ndetection. The results analysis indicates that the proposed algorithm\noutperforms the results obtained using the existing classical algorithm of\nobject detection.\n","authors":["Sanmoy Bandyopadhyay","Vaibhav Pant"],"pdf_url":"https://arxiv.org/pdf/2412.20749v1.pdf","comment":"6 pages, 2 figures"},{"id":"http://arxiv.org/abs/2412.20744v1","updated":"2024-12-30T06:36:05Z","published":"2024-12-30T06:36:05Z","title":"Advancing Parkinson's Disease Progression Prediction: Comparing Long\n Short-Term Memory Networks and Kolmogorov-Arnold Networks","summary":" Parkinson's Disease (PD) is a degenerative neurological disorder that impairs\nmotor and non-motor functions, significantly reducing quality of life and\nincreasing mortality risk. Early and accurate detection of PD progression is\nvital for effective management and improved patient outcomes. Current\ndiagnostic methods, however, are often costly, time-consuming, and require\nspecialized equipment and expertise. This work proposes an innovative approach\nto predicting PD progression using regression methods, Long Short-Term Memory\n(LSTM) networks, and Kolmogorov Arnold Networks (KAN). KAN, utilizing\nspline-parametrized univariate functions, allows for dynamic learning of\nactivation patterns, unlike traditional linear models.\n The Movement Disorder Society-Sponsored Revision of the Unified Parkinson's\nDisease Rating Scale (MDS-UPDRS) is a comprehensive tool for evaluating PD\nsymptoms and is commonly used to measure disease progression. Additionally,\nprotein or peptide abnormalities are linked to PD onset and progression.\nIdentifying these associations can aid in predicting disease progression and\nunderstanding molecular changes.\n Comparing multiple models, including LSTM and KAN, this study aims to\nidentify the method that delivers the highest metrics. The analysis reveals\nthat KAN, with its dynamic learning capabilities, outperforms other approaches\nin predicting PD progression. This research highlights the potential of AI and\nmachine learning in healthcare, paving the way for advanced computational\nmodels to enhance clinical predictions and improve patient care and treatment\nstrategies in PD management.\n","authors":["Abhinav Roy","Bhavesh Gyanchandani","Aditya Oza","Abhishek Sharma"],"pdf_url":"https://arxiv.org/pdf/2412.20744v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.12530v2","updated":"2024-12-30T06:16:19Z","published":"2024-10-16T13:10:04Z","title":"Disentangling data distribution for Federated Learning","summary":" Federated Learning (FL) facilitates collaborative training of a global model\nwhose performance is boosted by private data owned by distributed clients,\nwithout compromising data privacy. Yet the wide applicability of FL is hindered\nby entanglement of data distributions across different clients. This paper\ndemonstrates for the first time that by disentangling data distributions FL can\nin principle achieve efficiencies comparable to those of distributed systems,\nrequiring only one round of communication. To this end, we propose a novel\nFedDistr algorithm, which employs stable diffusion models to decouple and\nrecover data distributions. Empirical results on the CIFAR100 and DomainNet\ndatasets show that FedDistr significantly enhances model utility and efficiency\nin both disentangled and near-disentangled scenarios while ensuring privacy,\noutperforming traditional federated learning methods.\n","authors":["Xinyuan Zhao","Hanlin Gu","Lixin Fan","Yuxing Han","Qiang Yang"],"pdf_url":"https://arxiv.org/pdf/2410.12530v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.20732v1","updated":"2024-12-30T06:06:45Z","published":"2024-12-30T06:06:45Z","title":"Joint Scoring Rules: Zero-Sum Competition Avoids Performative Prediction","summary":" In a decision-making scenario, a principal could use conditional predictions\nfrom an expert agent to inform their choice. However, this approach would\nintroduce a fundamental conflict of interest. An agent optimizing for\npredictive accuracy is incentivized to manipulate their principal towards more\npredictable actions, which prevents that principal from being able to\ndeterministically select their true preference. We demonstrate that this\nimpossibility result can be overcome through the joint evaluation of multiple\nagents. When agents are made to engage in zero-sum competition, their incentive\nto influence the action taken is eliminated, and the principal can identify and\ntake the action they most prefer. We further prove that this zero-sum setup is\nunique, efficiently implementable, and applicable under stochastic choice.\nExperiments in a toy environment demonstrate that training on a zero-sum\nobjective significantly enhances both predictive accuracy and principal\nutility, and can eliminate previously learned manipulative behavior.\n","authors":["Rubi Hudson"],"pdf_url":"https://arxiv.org/pdf/2412.20732v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.20727v1","updated":"2024-12-30T05:56:25Z","published":"2024-12-30T05:56:25Z","title":"AverageLinear: Enhance Long-Term Time series forcasting with simple\n averaging","summary":" Long-term time series analysis aims to forecast long-term trends by examining\nchanges over past and future periods. The intricacy of time series data poses\nsignificant challenges for modeling. Models based on the Transformer\narchitecture, through the application of attention mechanisms to channels and\nsequences, have demonstrated notable performance advantages. In contrast,\nmethods based on convolutional neural networks or linear models often struggle\nto effectively handle scenarios with large number of channels. However, our\nresearch reveals that the attention mechanism is not the core component\nresponsible for performance enhancement. We have designed an exceedingly simple\nlinear structure AverageLinear. By employing straightforward channel embedding\nand averaging operations, this model can effectively capture correlations\nbetween channels while maintaining a lightweight architecture. Experimentss on\nreal-world datasets shows that AverageLinear matches or even surpasses\nstate-of-the-art Transformer-based structures in performance. This indicates\nthat using purely linear structures can also endow models with robust\npredictive power.\n","authors":["Gaoxiang Zhao","Li Zhou","Xiaoqiang Wang"],"pdf_url":"https://arxiv.org/pdf/2412.20727v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.20724v1","updated":"2024-12-30T05:53:17Z","published":"2024-12-30T05:53:17Z","title":"Training Deep Neural Classifiers with Soft Diamond Regularizers","summary":" We introduce new \\emph{soft diamond} regularizers that both improve synaptic\nsparsity and maintain classification accuracy in deep neural networks. These\nparametrized regularizers outperform the state-of-the-art hard-diamond\nLaplacian regularizer of Lasso regression and classification. They use\nthick-tailed symmetric alpha-stable ($\\mathcal{S \\alpha S}$) bell-curve\nsynaptic weight priors that are not Gaussian and so have thicker tails. The\ngeometry of the diamond-shaped constraint set varies from a circle to a star\ndepending on the tail thickness and dispersion of the prior probability density\nfunction. Training directly with these priors is computationally intensive\nbecause almost all $\\mathcal{S \\alpha S}$ probability densities lack a closed\nform. A precomputed look-up table removed this computational bottleneck. We\ntested the new soft diamond regularizers with deep neural classifiers on the\nthree datasets CIFAR-10, CIFAR-100, and Caltech-256. The regularizers improved\nthe accuracy of the classifiers. The improvements included $4.57\\%$ on\nCIFAR-10, $4.27\\%$ on CIFAR-100, and $6.69\\%$ on Caltech-256. They also\noutperformed $L_2$ regularizers on all the test cases. Soft diamond\nregularizers also outperformed $L_1$ lasso or Laplace regularizers because they\nbetter increased sparsity while improving classification accuracy. Soft-diamond\npriors substantially improved accuracy on CIFAR-10 when combined with dropout,\nbatch, or data-augmentation regularization.\n","authors":["Olaoluwa Adigun","Bart Kosko"],"pdf_url":"https://arxiv.org/pdf/2412.20724v1.pdf","comment":"8 pages, 10 figures"},{"id":"http://arxiv.org/abs/2412.12984v2","updated":"2024-12-30T05:34:10Z","published":"2024-12-17T15:04:54Z","title":"Cluster-guided Contrastive Class-imbalanced Graph Classification","summary":" This paper studies the problem of class-imbalanced graph classification,\nwhich aims at effectively classifying the graph categories in scenarios with\nimbalanced class distributions. While graph neural networks (GNNs) have\nachieved remarkable success, their modeling ability on imbalanced\ngraph-structured data remains suboptimal, which typically leads to predictions\nbiased towards the majority classes. On the other hand, existing\nclass-imbalanced learning methods in vision may overlook the rich graph\nsemantic substructures of the majority classes and excessively emphasize\nlearning from the minority classes. To address these challenges, we propose a\nsimple yet powerful approach called C$^3$GNN that integrates the idea of\nclustering into contrastive learning to enhance class-imbalanced graph\nclassification. Technically, C$^3$GNN clusters graphs from each majority class\ninto multiple subclasses, with sizes comparable to the minority class,\nmitigating class imbalance. It also employs the Mixup technique to generate\nsynthetic samples, enriching the semantic diversity of each subclass.\nFurthermore, supervised contrastive learning is used to hierarchically learn\neffective graph representations, enabling the model to thoroughly explore\nsemantic substructures in majority classes while avoiding excessive focus on\nminority classes. Extensive experiments on real-world graph benchmark datasets\nverify the superior performance of our proposed method against competitive\nbaselines.\n","authors":["Wei Ju","Zhengyang Mao","Siyu Yi","Yifang Qin","Yiyang Gu","Zhiping Xiao","Jianhao Shen","Ziyue Qiao","Ming Zhang"],"pdf_url":"https://arxiv.org/pdf/2412.12984v2.pdf","comment":"Accepted by Proceedings of the Thirty-Ninth AAAI Conference on\n Artificial Intelligence (AAAI-25)"},{"id":"http://arxiv.org/abs/2404.08877v3","updated":"2024-12-30T05:08:00Z","published":"2024-04-13T02:36:40Z","title":"Aligning the Objective of LLM-based Program Repair","summary":" Large language models (LLMs) have achieved decent results on automated\nprogram repair (APR). However, the next token prediction training objective of\ndecoder-only LLMs (e.g., GPT-4) is misaligned with the masked span prediction\nobjective of current infilling-style methods, which impedes LLMs from fully\nleveraging pre-trained knowledge for program repair. In addition, while some\nLLMs can locate and repair bugs in certain functions using the related\nartifacts (e.g., test cases), existing methods still depend on statement-level\nfault localization methods to provide a list of buggy hunks for repair. This\nrestriction hinders LLMs from exploring potential patches beyond the given\nlocations.\n In this paper, we investigate a new approach to adapt LLMs to program repair.\nOur core insight is that LLM's APR capability can be greatly improved by simply\naligning the output to their training objective and allowing them to refine the\nwhole program without first identifying faulty statements. Based on this\ninsight, we designed D4C, a straightforward prompting framework for APR. D4C\ncan repair 180 bugs correctly in Defects4J, with each patch being sampled only\n10 times. This surpasses the SOTA APR methods with perfect fault localization\nby 10% and reduces the patch sampling number by 90%. Our findings reveal that\n(1) objective alignment is crucial for fully exploiting LLM's pre-trained\ncapability, and (2) replacing the traditional localize-buggy-hunks-then-repair\nworkflow with direct debugging is more effective for LLM-based APR methods.\nThus, we believe this paper introduces a new mindset for harnessing LLMs in\nAPR.\n","authors":["Junjielong Xu","Ying Fu","Shin Hwei Tan","Pinjia He"],"pdf_url":"https://arxiv.org/pdf/2404.08877v3.pdf","comment":"Accepted by ICSE'25"},{"id":"http://arxiv.org/abs/2412.19289v2","updated":"2024-12-30T05:07:17Z","published":"2024-12-26T17:29:38Z","title":"ViPCap: Retrieval Text-Based Visual Prompts for Lightweight Image\n Captioning","summary":" Recent lightweight image captioning models using retrieved data mainly focus\non text prompts. However, previous works only utilize the retrieved text as\ntext prompts, and the visual information relies only on the CLIP visual\nembedding. Because of this issue, there is a limitation that the image\ndescriptions inherent in the prompt are not sufficiently reflected in the\nvisual embedding space. To tackle this issue, we propose ViPCap, a novel\nretrieval text-based visual prompt for lightweight image captioning. ViPCap\nleverages the retrieved text with image information as visual prompts to\nenhance the ability of the model to capture relevant visual information. By\nmapping text prompts into the CLIP space and generating multiple randomized\nGaussian distributions, our method leverages sampling to explore randomly\naugmented distributions and effectively retrieves the semantic features that\ncontain image information. These retrieved features are integrated into the\nimage and designated as the visual prompt, leading to performance improvements\non the datasets such as COCO, Flickr30k, and NoCaps. Experimental results\ndemonstrate that ViPCap significantly outperforms prior lightweight captioning\nmodels in efficiency and effectiveness, demonstrating the potential for a\nplug-and-play solution.\n","authors":["Taewhan Kim","Soeun Lee","Si-Woo Kim","Dong-Jin Kim"],"pdf_url":"https://arxiv.org/pdf/2412.19289v2.pdf","comment":"Accepted to AAAI 2025"},{"id":"http://arxiv.org/abs/2412.20704v1","updated":"2024-12-30T04:34:42Z","published":"2024-12-30T04:34:42Z","title":"HFI: A unified framework for training-free detection and implicit\n watermarking of latent diffusion model generated images","summary":" Dramatic advances in the quality of the latent diffusion models (LDMs) also\nled to the malicious use of AI-generated images. While current AI-generated\nimage detection methods assume the availability of real/AI-generated images for\ntraining, this is practically limited given the vast expressibility of LDMs.\nThis motivates the training-free detection setup where no related data are\navailable in advance. The existing LDM-generated image detection method assumes\nthat images generated by LDM are easier to reconstruct using an autoencoder\nthan real images. However, we observe that this reconstruction distance is\noverfitted to background information, leading the current method to\nunderperform in detecting images with simple backgrounds. To address this, we\npropose a novel method called HFI. Specifically, by viewing the autoencoder of\nLDM as a downsampling-upsampling kernel, HFI measures the extent of aliasing, a\ndistortion of high-frequency information that appears in the reconstructed\nimage. HFI is training-free, efficient, and consistently outperforms other\ntraining-free methods in detecting challenging images generated by various\ngenerative models. We also show that HFI can successfully detect the images\ngenerated from the specified LDM as a means of implicit watermarking. HFI\noutperforms the best baseline method while achieving magnitudes of\n","authors":["Sungik Choi","Sungwoo Park","Jaehoon Lee","Seunghyun Kim","Stanley Jungkyu Choi","Moontae Lee"],"pdf_url":"https://arxiv.org/pdf/2412.20704v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.17323v5","updated":"2024-12-30T04:31:16Z","published":"2023-05-27T01:56:09Z","title":"Some Primal-Dual Theory for Subgradient Methods for Strongly Convex\n Optimization","summary":" We consider (stochastic) subgradient methods for strongly convex but\npotentially nonsmooth non-Lipschitz optimization. We provide new equivalent\ndual descriptions (in the style of dual averaging) for the classic subgradient\nmethod, the proximal subgradient method, and the switching subgradient method.\nThese equivalences enable $O(1/T)$ convergence guarantees in terms of both\ntheir classic primal gap and a not previously analyzed dual gap for strongly\nconvex optimization. Consequently, our theory provides these classic methods\nwith simple, optimal stopping criteria and optimality certificates at no added\ncomputational cost. Our results apply to a wide range of stepsize selections\nand of non-Lipschitz ill-conditioned problems where the early iterations of the\nsubgradient method may diverge exponentially quickly (a phenomenon which, to\nthe best of our knowledge, no prior works address). Even in the presence of\nsuch undesirable behaviors, our theory still ensures and bounds eventual\nconvergence.\n","authors":["Benjamin Grimmer","Danlin Li"],"pdf_url":"https://arxiv.org/pdf/2305.17323v5.pdf","comment":"25 pages, major revision shortened the write-up and unified the\n analysis to be done just once in a single \"super\" setting"},{"id":"http://arxiv.org/abs/2402.05961v4","updated":"2024-12-30T04:22:28Z","published":"2024-02-05T04:12:40Z","title":"Genetic-guided GFlowNets for Sample Efficient Molecular Optimization","summary":" The challenge of discovering new molecules with desired properties is crucial\nin domains like drug discovery and material design. Recent advances in deep\nlearning-based generative methods have shown promise but face the issue of\nsample efficiency due to the computational expense of evaluating the reward\nfunction. This paper proposes a novel algorithm for sample-efficient molecular\noptimization by distilling a powerful genetic algorithm into deep generative\npolicy using GFlowNets training, the off-policy method for amortized inference.\nThis approach enables the deep generative policy to learn from domain\nknowledge, which has been explicitly integrated into the genetic algorithm. Our\nmethod achieves state-of-the-art performance in the official molecular\noptimization benchmark, significantly outperforming previous methods. It also\ndemonstrates effectiveness in designing inhibitors against SARS-CoV-2 with\nsubstantially fewer reward calls.\n","authors":["Hyeonah Kim","Minsu Kim","Sanghyeok Choi","Jinkyoo Park"],"pdf_url":"https://arxiv.org/pdf/2402.05961v4.pdf","comment":"NeurIPS 2024"},{"id":"http://arxiv.org/abs/2412.18819v2","updated":"2024-12-30T04:15:42Z","published":"2024-12-25T08:17:37Z","title":"LLM-assisted Vector Similarity Search","summary":" As data retrieval demands become increasingly complex, traditional search\nmethods often fall short in addressing nuanced and conceptual queries. Vector\nsimilarity search has emerged as a promising technique for finding semantically\nsimilar information efficiently. However, its effectiveness diminishes when\nhandling intricate queries with contextual nuances. This paper explores a\nhybrid approach combining vector similarity search with Large Language Models\n(LLMs) to enhance search accuracy and relevance. The proposed two-step solution\nfirst employs vector similarity search to shortlist potential matches, followed\nby an LLM for context-aware ranking of the results. Experiments on structured\ndatasets demonstrate that while vector similarity search alone performs well\nfor straightforward queries, the LLM-assisted approach excels in processing\ncomplex queries involving constraints, negations, or conceptual requirements.\nBy leveraging the natural language understanding capabilities of LLMs, this\nmethod improves the accuracy of search results for complex tasks without\nsacrificing efficiency. We also discuss real-world applications and propose\ndirections for future research to refine and scale this technique for diverse\ndatasets and use cases.\n Original article:\nhttps://engineering.grab.com/llm-assisted-vector-similarity-search\n","authors":["Md Riyadh","Muqi Li","Felix Haryanto Lie","Jia Long Loh","Haotian Mi","Sayam Bohra"],"pdf_url":"https://arxiv.org/pdf/2412.18819v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2110.03580v2","updated":"2024-12-30T04:14:12Z","published":"2021-10-07T15:59:01Z","title":"A Model Selection Approach for Corruption Robust Reinforcement Learning","summary":" We develop a model selection approach to tackle reinforcement learning with\nadversarial corruption in both transition and reward. For finite-horizon\ntabular MDPs, without prior knowledge on the total amount of corruption, our\nalgorithm achieves a regret bound of\n$\\widetilde{\\mathcal{O}}(\\min\\{\\frac{1}{\\Delta}, \\sqrt{T}\\}+C)$ where $T$ is\nthe number of episodes, $C$ is the total amount of corruption, and $\\Delta$ is\nthe reward gap between the best and the second-best policy. This is the first\nworst-case optimal bound achieved without knowledge of $C$, improving previous\nresults of Lykouris et al. (2021); Chen et al. (2021); Wu et al. (2021). For\nfinite-horizon linear MDPs, we develop a computationally efficient algorithm\nwith a regret bound of $\\widetilde{\\mathcal{O}}(\\sqrt{(1+C)T})$, and another\ncomputationally inefficient one with $\\widetilde{\\mathcal{O}}(\\sqrt{T}+C)$,\nimproving the result of Lykouris et al. (2021) and answering an open question\nby Zhang et al. (2021b). Finally, our model selection framework can be easily\napplied to other settings including linear bandits, linear contextual bandits,\nand MDPs with general function approximation, leading to several improved or\nnew results.\n","authors":["Chen-Yu Wei","Christoph Dann","Julian Zimmert"],"pdf_url":"https://arxiv.org/pdf/2110.03580v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.19212v3","updated":"2024-12-30T04:08:54Z","published":"2024-09-28T02:30:44Z","title":"An Accelerated Algorithm for Stochastic Bilevel Optimization under\n Unbounded Smoothness","summary":" This paper investigates a class of stochastic bilevel optimization problems\nwhere the upper-level function is nonconvex with potentially unbounded\nsmoothness and the lower-level problem is strongly convex. These problems have\nsignificant applications in sequential data learning, such as text\nclassification using recurrent neural networks. The unbounded smoothness is\ncharacterized by the smoothness constant of the upper-level function scaling\nlinearly with the gradient norm, lacking a uniform upper bound. Existing\nstate-of-the-art algorithms require $\\widetilde{O}(1/\\epsilon^4)$ oracle calls\nof stochastic gradient or Hessian/Jacobian-vector product to find an\n$\\epsilon$-stationary point. However, it remains unclear if we can further\nimprove the convergence rate when the assumptions for the function in the\npopulation level also hold for each random realization almost surely (e.g.,\nLipschitzness of each realization of the stochastic gradient). To address this\nissue, we propose a new Accelerated Bilevel Optimization algorithm named AccBO.\nThe algorithm updates the upper-level variable by normalized stochastic\ngradient descent with recursive momentum and the lower-level variable by the\nstochastic Nesterov accelerated gradient descent algorithm with averaging. We\nprove that our algorithm achieves an oracle complexity of\n$\\widetilde{O}(1/\\epsilon^3)$ to find an $\\epsilon$-stationary point. Our proof\nrelies on a novel lemma characterizing the dynamics of stochastic Nesterov\naccelerated gradient descent algorithm under distribution drift with high\nprobability for the lower-level variable, which is of independent interest and\nalso plays a crucial role in analyzing the hypergradient estimation error over\ntime. Experimental results on various tasks confirm that our proposed algorithm\nachieves the predicted theoretical acceleration and significantly outperforms\nbaselines in bilevel optimization.\n","authors":["Xiaochuan Gong","Jie Hao","Mingrui Liu"],"pdf_url":"https://arxiv.org/pdf/2409.19212v3.pdf","comment":"Accepted by NeurIPS 2024. The code is available at\n https://github.com/MingruiLiu-ML-Lab/Accelerated-Bilevel-Optimization-Unbounded-Smoothness"},{"id":"http://arxiv.org/abs/2412.07010v2","updated":"2024-12-30T03:52:19Z","published":"2024-12-09T21:36:42Z","title":"TAEN: A Model-Constrained Tikhonov Autoencoder Network for Forward and\n Inverse Problems","summary":" Efficient real-time solvers for forward and inverse problems are essential in\nengineering and science applications. Machine learning surrogate models have\nemerged as promising alternatives to traditional methods, offering\nsubstantially reduced computational time. Nevertheless, these models typically\ndemand extensive training datasets to achieve robust generalization across\ndiverse scenarios. While physics-based approaches can partially mitigate this\ndata dependency and ensure physics-interpretable solutions, addressing scarce\ndata regimes remains a challenge. Both purely data-driven and physics-based\nmachine learning approaches demonstrate severe overfitting issues when trained\nwith insufficient data. We propose a novel Tikhonov autoencoder\nmodel-constrained framework, called TAE, capable of learning both forward and\ninverse surrogate models using a single arbitrary observation sample. We\ndevelop comprehensive theoretical foundations including forward and inverse\ninference error bounds for the proposed approach for linear cases. For\ncomparative analysis, we derive equivalent formulations for pure data-driven\nand model-constrained approach counterparts. At the heart of our approach is a\ndata randomization strategy, which functions as a generative mechanism for\nexploring the training data space, enabling effective training of both forward\nand inverse surrogate models from a single observation, while regularizing the\nlearning process. We validate our approach through extensive numerical\nexperiments on two challenging inverse problems: 2D heat conductivity inversion\nand initial condition reconstruction for time-dependent 2D Navier-Stokes\nequations. Results demonstrate that TAE achieves accuracy comparable to\ntraditional Tikhonov solvers and numerical forward solvers for both inverse and\nforward problems, respectively, while delivering orders of magnitude\ncomputational speedups.\n","authors":["Hai V. Nguyen","Tan Bui-Thanh","Clint Dawson"],"pdf_url":"https://arxiv.org/pdf/2412.07010v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.20682v1","updated":"2024-12-30T03:26:53Z","published":"2024-12-30T03:26:53Z","title":"Learning to Rank Pre-trained Vision-Language Models for Downstream Tasks","summary":" Vision language models (VLMs) like CLIP show stellar zero-shot capability on\nclassification benchmarks. However, selecting the VLM with the highest\nperformance on the unlabeled downstream task is non-trivial. Existing VLM\nselection methods focus on the class-name-only setting, relying on a supervised\nlarge-scale dataset and large language models, which may not be accessible or\nfeasible during deployment. This paper introduces the problem of\n\\textbf{unsupervised vision-language model selection}, where only unsupervised\ndownstream datasets are available, with no additional information provided. To\nsolve this problem, we propose a method termed Visual-tExtual Graph Alignment\n(VEGA), to select VLMs without any annotations by measuring the alignment of\nthe VLM between the two modalities on the downstream task. VEGA is motivated by\nthe pretraining paradigm of VLMs, which aligns features with the same semantics\nfrom the visual and textual modalities, thereby mapping both modalities into a\nshared representation space. Specifically, we first construct two graphs on the\nvision and textual features, respectively. VEGA is then defined as the overall\nsimilarity between the visual and textual graphs at both node and edge levels.\nExtensive experiments across three different benchmarks, covering a variety of\napplication scenarios and downstream datasets, demonstrate that VEGA\nconsistently provides reliable and accurate estimates of VLMs' performance on\nunlabeled downstream tasks.\n","authors":["Yuhe Ding","Bo Jiang","Aihua Zheng","Qin Xu","Jian Liang"],"pdf_url":"https://arxiv.org/pdf/2412.20682v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19285v2","updated":"2024-12-30T03:25:23Z","published":"2024-11-28T17:31:15Z","title":"BPQP: A Differentiable Convex Optimization Framework for Efficient\n End-to-End Learning","summary":" Data-driven decision-making processes increasingly utilize end-to-end\nlearnable deep neural networks to render final decisions. Sometimes, the output\nof the forward functions in certain layers is determined by the solutions to\nmathematical optimization problems, leading to the emergence of differentiable\noptimization layers that permit gradient back-propagation. However, real-world\nscenarios often involve large-scale datasets and numerous constraints,\npresenting significant challenges. Current methods for differentiating\noptimization problems typically rely on implicit differentiation, which\nnecessitates costly computations on the Jacobian matrices, resulting in low\nefficiency. In this paper, we introduce BPQP, a differentiable convex\noptimization framework designed for efficient end-to-end learning. To enhance\nefficiency, we reformulate the backward pass as a simplified and decoupled\nquadratic programming problem by leveraging the structural properties of the\nKKT matrix. This reformulation enables the use of first-order optimization\nalgorithms in calculating the backward pass gradients, allowing our framework\nto potentially utilize any state-of-the-art solver. As solver technologies\nevolve, BPQP can continuously adapt and improve its efficiency. Extensive\nexperiments on both simulated and real-world datasets demonstrate that BPQP\nachieves a significant improvement in efficiency--typically an order of\nmagnitude faster in overall execution time compared to other differentiable\noptimization layers. Our results not only highlight the efficiency gains of\nBPQP but also underscore its superiority over differentiable optimization layer\nbaselines.\n","authors":["Jianming Pan","Zeqi Ye","Xiao Yang","Xu Yang","Weiqing Liu","Lewen Wang","Jiang Bian"],"pdf_url":"https://arxiv.org/pdf/2411.19285v2.pdf","comment":"NeurIPS 2024 Spotlight"},{"id":"http://arxiv.org/abs/2412.20679v1","updated":"2024-12-30T03:18:24Z","published":"2024-12-30T03:18:24Z","title":"Differentiable Convex Optimization Layers in Neural Architectures:\n Foundations and Perspectives","summary":" The integration of optimization problems within neural network architectures\nrepresents a fundamental shift from traditional approaches to handling\nconstraints in deep learning. While it is long known that neural networks can\nincorporate soft constraints with techniques such as regularization, strict\nadherence to hard constraints is generally more difficult. A recent advance in\nthis field, however, has addressed this problem by enabling the direct\nembedding of optimization layers as differentiable components within deep\nnetworks. This paper surveys the evolution and current state of this approach,\nfrom early implementations limited to quadratic programming, to more recent\nframeworks supporting general convex optimization problems. We provide a\ncomprehensive review of the background, theoretical foundations, and emerging\napplications of this technology. Our analysis includes detailed mathematical\nproofs and an examination of various use cases that demonstrate the potential\nof this hybrid approach. This work synthesizes developments at the intersection\nof optimization theory and deep learning, offering insights into both current\ncapabilities and future research directions in this rapidly evolving field.\n","authors":["Calder Katyal"],"pdf_url":"https://arxiv.org/pdf/2412.20679v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.20678v1","updated":"2024-12-30T03:15:25Z","published":"2024-12-30T03:15:25Z","title":"Attention-Driven Metapath Encoding in Heterogeneous Graphs","summary":" One of the emerging techniques in node classification in heterogeneous graphs\nis to restrict message aggregation to pre-defined, semantically meaningful\nstructures called metapaths. This work is the first attempt to incorporate\nattention into the process of encoding entire metapaths without dropping\nintermediate nodes. In particular, we construct two encoders: the first uses\nsequential attention to extend the multi-hop message passing algorithm designed\nin \\citet{magna} to the metapath setting, and the second incorporates direct\nattention to extract semantic relations in the metapath. The model then employs\nthe intra-metapath and inter-metapath aggregation mechanisms of \\citet{han}. We\nfurthermore use the powerful training scheduler specialized for heterogeneous\ngraphs that was developed in \\citet{lts}, ensuring the model slowly learns how\nto classify the most difficult nodes. The result is a resilient,\ngeneral-purpose framework for capturing semantic structures in heterogeneous\ngraphs. In particular, we demonstrate that our model is competitive with\nstate-of-the-art models on performing node classification on the IMDB dataset,\na popular benchmark introduced in \\citet{benchmark}.\n","authors":["Calder Katyal"],"pdf_url":"https://arxiv.org/pdf/2412.20678v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18619v2","updated":"2024-12-30T03:00:30Z","published":"2024-12-16T05:02:25Z","title":"Next Token Prediction Towards Multimodal Intelligence: A Comprehensive\n Survey","summary":" Building on the foundations of language modeling in natural language\nprocessing, Next Token Prediction (NTP) has evolved into a versatile training\nobjective for machine learning tasks across various modalities, achieving\nconsiderable success. As Large Language Models (LLMs) have advanced to unify\nunderstanding and generation tasks within the textual modality, recent research\nhas shown that tasks from different modalities can also be effectively\nencapsulated within the NTP framework, transforming the multimodal information\ninto tokens and predict the next one given the context. This survey introduces\na comprehensive taxonomy that unifies both understanding and generation within\nmultimodal learning through the lens of NTP. The proposed taxonomy covers five\nkey aspects: Multimodal tokenization, MMNTP model architectures, unified task\nrepresentation, datasets \\& evaluation, and open challenges. This new taxonomy\naims to aid researchers in their exploration of multimodal intelligence. An\nassociated GitHub repository collecting the latest papers and repos is\navailable at https://github.com/LMM101/Awesome-Multimodal-Next-Token-Prediction\n","authors":["Liang Chen","Zekun Wang","Shuhuai Ren","Lei Li","Haozhe Zhao","Yunshui Li","Zefan Cai","Hongcheng Guo","Lei Zhang","Yizhe Xiong","Yichi Zhang","Ruoyu Wu","Qingxiu Dong","Ge Zhang","Jian Yang","Lingwei Meng","Shujie Hu","Yulong Chen","Junyang Lin","Shuai Bai","Andreas Vlachos","Xu Tan","Minjia Zhang","Wen Xiao","Aaron Yee","Tianyu Liu","Baobao Chang"],"pdf_url":"https://arxiv.org/pdf/2412.18619v2.pdf","comment":"69 papes, 18 figures, repo at\n https://github.com/LMM101/Awesome-Multimodal-Next-Token-Prediction"},{"id":"http://arxiv.org/abs/2412.20674v1","updated":"2024-12-30T02:58:18Z","published":"2024-12-30T02:58:18Z","title":"Blockchain-Empowered Cyber-Secure Federated Learning for Trustworthy\n Edge Computing","summary":" Federated Learning (FL) is a privacy-preserving distributed machine learning\nscheme, where each participant data remains on the participating devices and\nonly the local model generated utilizing the local computational power is\ntransmitted throughout the database. However, the distributed computational\nnature of FL creates the necessity to develop a mechanism that can remotely\ntrigger any network agents, track their activities, and prevent threats to the\noverall process posed by malicious participants. Particularly, the FL paradigm\nmay become vulnerable due to an active attack from the network participants,\ncalled a poisonous attack. In such an attack, the malicious participant acts as\na benign agent capable of affecting the global model quality by uploading an\nobfuscated poisoned local model update to the server. This paper presents a\ncross-device FL model that ensures trustworthiness, fairness, and authenticity\nin the underlying FL training process. We leverage trustworthiness by\nconstructing a reputation-based trust model based on contributions of agents\ntoward model convergence. We ensure fairness by identifying and removing\nmalicious agents from the training process through an outlier detection\ntechnique. Further, we establish authenticity by generating a token for each\nparticipating device through a distributed sensing mechanism and storing that\nunique token in a blockchain smart contract. Further, we insert the trust\nscores of all agents into a blockchain and validate their reputations using\nvarious consensus mechanisms that consider the computational task.\n","authors":["Ervin Moore","Ahmed Imteaj","Md Zarif Hossain","Shabnam Rezapour","M. Hadi Amini"],"pdf_url":"https://arxiv.org/pdf/2412.20674v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.20671v1","updated":"2024-12-30T02:51:57Z","published":"2024-12-30T02:51:57Z","title":"Two Birds with One Stone: Improving Rumor Detection by Addressing the\n Unfairness Issue","summary":" The degraded performance and group unfairness caused by confounding sensitive\nattributes in rumor detection remains relatively unexplored. To address this,\nwe propose a two-step framework. Initially, it identifies confounding sensitive\nattributes that limit rumor detection performance and cause unfairness across\ngroups. Subsequently, we aim to learn equally informative representations\nthrough invariant learning. Our method considers diverse sets of groups without\nsensitive attribute annotations. Experiments show our method easily integrates\nwith existing rumor detectors, significantly improving both their detection\nperformance and fairness.\n","authors":["Junyi Chen","Mengjia Wu","Qian Liu","Ying Ding","Yi Zhang"],"pdf_url":"https://arxiv.org/pdf/2412.20671v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.20670v1","updated":"2024-12-30T02:48:34Z","published":"2024-12-30T02:48:34Z","title":"Prototypical Distillation and Debiased Tuning for Black-box Unsupervised\n Domain Adaptation","summary":" Unsupervised domain adaptation aims to transfer knowledge from a related,\nlabel-rich source domain to an unlabeled target domain, thereby circumventing\nthe high costs associated with manual annotation. Recently, there has been\ngrowing interest in source-free domain adaptation, a paradigm in which only a\npre-trained model, rather than the labeled source data, is provided to the\ntarget domain. Given the potential risk of source data leakage via model\ninversion attacks, this paper introduces a novel setting called black-box\ndomain adaptation, where the source model is accessible only through an API\nthat provides the predicted label along with the corresponding confidence value\nfor each query. We develop a two-step framework named $\\textbf{Pro}$totypical\n$\\textbf{D}$istillation and $\\textbf{D}$ebiased tun$\\textbf{ing}$\n($\\textbf{ProDDing}$). In the first step, ProDDing leverages both the raw\npredictions from the source model and prototypes derived from the target domain\nas teachers to distill a customized target model. In the second step, ProDDing\nkeeps fine-tuning the distilled model by penalizing logits that are biased\ntoward certain classes. Empirical results across multiple benchmarks\ndemonstrate that ProDDing outperforms existing black-box domain adaptation\nmethods. Moreover, in the case of hard-label black-box domain adaptation, where\nonly predicted labels are available, ProDDing achieves significant improvements\nover these methods. Code will be available at\n\\url{https://github.com/tim-learn/ProDDing/}.\n","authors":["Jian Liang","Lijun Sheng","Hongmin Liu","Ran He"],"pdf_url":"https://arxiv.org/pdf/2412.20670v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.15187v2","updated":"2024-12-30T02:41:33Z","published":"2023-12-23T07:47:58Z","title":"IRG: Generating Synthetic Relational Databases using Deep Learning with\n Insightful Relational Understanding","summary":" Synthetic data has numerous applications, including but not limited to\nsoftware testing at scale, privacy-preserving data sharing to enable smoother\ncollaboration between stakeholders, and data augmentation for analytical and\nmachine learning tasks. Relational databases, which are commonly used by\ncorporations, governments, and financial institutions, present unique\nchallenges for synthetic data generation due to their complex structures.\nExisting synthetic relational database generation approaches often assume\nidealized scenarios, such as every table having a perfect primary key column\nwithout composite and potentially overlapping primary or foreign key\nconstraints, and fail to account for the sequential nature of certain tables.\nIn this paper, we propose incremental relational generator (IRG), that\nsuccessfully handles these ubiquitous real-life situations. IRG ensures the\npreservation of relational schema integrity, offers a deep contextual\nunderstanding of relationships beyond direct ancestors and descendants,\nleverages the power of newly designed deep neural networks, and scales\nefficiently to handle larger datasets--a combination never achieved in previous\nworks. Experiments on three open-source real-life relational datasets in\ndifferent fields at different scales demonstrate IRG's advantage in maintaining\nthe synthetic data's relational schema validity and data fidelity and utility.\n","authors":["Jiayu Li","Zilong Zhao","Vikram Chundawat","Biplab Sikdar","Y. C. Tay"],"pdf_url":"https://arxiv.org/pdf/2312.15187v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.17856v2","updated":"2024-12-30T02:28:52Z","published":"2024-12-20T04:05:09Z","title":"Graph Structure Refinement with Energy-based Contrastive Learning","summary":" Graph Neural Networks (GNNs) have recently gained widespread attention as a\nsuccessful tool for analyzing graph-structured data. However, imperfect graph\nstructure with noisy links lacks enough robustness and may damage graph\nrepresentations, therefore limiting the GNNs' performance in practical tasks.\nMoreover, existing generative architectures fail to fit discriminative\ngraph-related tasks. To tackle these issues, we introduce an unsupervised\nmethod based on a joint of generative training and discriminative training to\nlearn graph structure and representation, aiming to improve the discriminative\nperformance of generative models. We propose an Energy-based Contrastive\nLearning (ECL) guided Graph Structure Refinement (GSR) framework, denoted as\nECL-GSR. To our knowledge, this is the first work to combine energy-based\nmodels with contrastive learning for GSR. Specifically, we leverage ECL to\napproximate the joint distribution of sample pairs, which increases the\nsimilarity between representations of positive pairs while reducing the\nsimilarity between negative ones. Refined structure is produced by augmenting\nand removing edges according to the similarity metrics among node\nrepresentations. Extensive experiments demonstrate that ECL-GSR outperforms the\nstate-of-the-art on eight benchmark datasets in node classification. ECL-GSR\nachieves faster training with fewer samples and memories against the leading\nbaseline, highlighting its simplicity and efficiency in downstream tasks.\n","authors":["Xianlin Zeng","Yufeng Wang","Yuqi Sun","Guodong Guo","Baochang Zhang","Wenrui Ding"],"pdf_url":"https://arxiv.org/pdf/2412.17856v2.pdf","comment":"Accepted to AAAI 2025"},{"id":"http://arxiv.org/abs/2412.20656v1","updated":"2024-12-30T02:20:40Z","published":"2024-12-30T02:20:40Z","title":"Overcoming Class Imbalance: Unified GNN Learning with Structural and\n Semantic Connectivity Representations","summary":" Class imbalance is pervasive in real-world graph datasets, where the majority\nof annotated nodes belong to a small set of classes (majority classes), leaving\nmany other classes (minority classes) with only a handful of labeled nodes.\nGraph Neural Networks (GNNs) suffer from significant performance degradation in\nthe presence of class imbalance, exhibiting bias towards majority classes and\nstruggling to generalize effectively on minority classes. This limitation\nstems, in part, from the message passing process, leading GNNs to overfit to\nthe limited neighborhood of annotated nodes from minority classes and impeding\nthe propagation of discriminative information throughout the entire graph. In\nthis paper, we introduce a novel Unified Graph Neural Network Learning\n(Uni-GNN) framework to tackle class-imbalanced node classification. The\nproposed framework seamlessly integrates both structural and semantic\nconnectivity representations through semantic and structural node encoders. By\ncombining these connectivity types, Uni-GNN extends the propagation of node\nembeddings beyond immediate neighbors, encompassing non-adjacent structural\nnodes and semantically similar nodes, enabling efficient diffusion of\ndiscriminative information throughout the graph. Moreover, to harness the\npotential of unlabeled nodes within the graph, we employ a balanced\npseudo-label generation mechanism that augments the pool of available labeled\nnodes from minority classes in the training set. Experimental results\nunderscore the superior performance of our proposed Uni-GNN framework compared\nto state-of-the-art class-imbalanced graph learning baselines across multiple\nbenchmark datasets.\n","authors":["Abdullah Alchihabi","Hao Yan","Yuhong Guo"],"pdf_url":"https://arxiv.org/pdf/2412.20656v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.02322v6","updated":"2024-12-30T02:14:15Z","published":"2024-02-04T02:26:40Z","title":"Dynamic Incremental Optimization for Best Subset Selection","summary":" Best subset selection is considered the `gold standard' for many sparse\nlearning problems. A variety of optimization techniques have been proposed to\nattack this non-smooth non-convex problem. In this paper, we investigate the\ndual forms of a family of $\\ell_0$-regularized problems. An efficient\nprimal-dual algorithm is developed based on the primal and dual problem\nstructures. By leveraging the dual range estimation along with the incremental\nstrategy, our algorithm potentially reduces redundant computation and improves\nthe solutions of best subset selection. Theoretical analysis and experiments on\nsynthetic and real-world datasets validate the efficiency and statistical\nproperties of the proposed solutions.\n","authors":["Shaogang Ren","Xiaoning Qian"],"pdf_url":"https://arxiv.org/pdf/2402.02322v6.pdf","comment":"arXiv admin note: substantial text overlap with arXiv:2207.02058"},{"id":"http://arxiv.org/abs/2412.20644v1","updated":"2024-12-30T01:33:42Z","published":"2024-12-30T01:33:42Z","title":"Uncertainty Herding: One Active Learning Method for All Label Budgets","summary":" Most active learning research has focused on methods which perform well when\nmany labels are available, but can be dramatically worse than random selection\nwhen label budgets are small. Other methods have focused on the low-budget\nregime, but do poorly as label budgets increase. As the line between \"low\" and\n\"high\" budgets varies by problem, this is a serious issue in practice. We\npropose uncertainty coverage, an objective which generalizes a variety of low-\nand high-budget objectives, as well as natural, hyperparameter-light methods to\nsmoothly interpolate between low- and high-budget regimes. We call greedy\noptimization of the estimate Uncertainty Herding; this simple method is\ncomputationally fast, and we prove that it nearly optimizes the\ndistribution-level coverage. In experimental validation across a variety of\nactive learning tasks, our proposal matches or beats state-of-the-art\nperformance in essentially all cases; it is the only method of which we are\naware that reliably works well in both low- and high-budget settings.\n","authors":["Wonho Bae","Gabriel L. Oliveira","Danica J. Sutherland"],"pdf_url":"https://arxiv.org/pdf/2412.20644v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.05302v3","updated":"2024-12-30T01:29:09Z","published":"2024-11-26T09:41:26Z","title":"A High Energy-Efficiency Multi-core Neuromorphic Architecture for Deep\n SNN Training","summary":" There is a growing necessity for edge training to adapt to dynamically\nchanging environment. Neuromorphic computing represents a significant pathway\nfor high-efficiency intelligent computation in energy-constrained edges, but\nexisting neuromorphic architectures lack the ability of directly training\nspiking neural networks (SNNs) based on backpropagation. We develop a\nmulti-core neuromorphic architecture with Feedforward-Propagation,\nBack-Propagation, and Weight-Gradient engines in each core, supporting high\nefficient parallel computing at both the engine and core levels. It combines\nvarious data flows and sparse computation optimization by fully leveraging the\nsparsity in SNN training, obtaining a high energy efficiency of 1.05TFLOPS/W@\nFP16 @ 28nm, 55 ~ 85% reduction of DRAM access compared to A100 GPU in SNN\ntrainings, and a 20-core deep SNN training and a 5-worker federated learning on\nFPGAs. Our study develops the first multi-core neuromorphic architecture\nsupporting the direct SNN training, facilitating the neuromorphic computing in\nedge-learnable applications.\n","authors":["Mingjing Li","Huihui Zhou","Xiaofeng Xu","Zhiwei Zhong","Puli Quan","Xueke Zhu","Yanyu Lin","Wenjie Lin","Hongyu Guo","Junchao Zhang","Yunhao Ma","Wei Wang","Qingyan Meng","Zhengyu Ma","Guoqi Li","Xiaoxin Cui","Yonghong Tian"],"pdf_url":"https://arxiv.org/pdf/2412.05302v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.07066v6","updated":"2024-12-30T01:25:19Z","published":"2024-04-10T14:56:40Z","title":"Exploring Concept Depth: How Large Language Models Acquire Knowledge at\n Different Layers?","summary":" Large language models (LLMs) have shown remarkable performances across a wide\nrange of tasks. However, the mechanisms by which these models encode tasks of\nvarying complexities remain poorly understood. In this paper, we explore the\nhypothesis that LLMs process concepts of varying complexities in different\nlayers, introducing the idea of ``Concept Depth'' to suggest that more complex\nconcepts are typically acquired in deeper layers. Specifically, we categorize\nconcepts based on their level of abstraction, defining them in the order of\nincreasing complexity within factual, emotional, and inferential tasks. We\nconduct extensive probing experiments using layer-wise representations across\nvarious LLM families (Gemma, LLaMA, Qwen) on various datasets spanning the\nthree domains of tasks. Our findings reveal that models could efficiently\nconduct probing for simpler tasks in shallow layers, and more complex tasks\ntypically necessitate deeper layers for accurate understanding. Additionally,\nwe examine how external factors, such as adding noise to the input and\nquantizing the model weights, might affect layer-wise representations. Our\nfindings suggest that these factors can impede the development of a conceptual\nunderstanding of LLMs until deeper layers are explored. We hope that our\nproposed concept and experimental insights will enhance the understanding of\nthe mechanisms underlying LLMs. Our codes are available at\n\\url{https://github.com/Luckfort/CD}.\n","authors":["Mingyu Jin","Qinkai Yu","Jingyuan Huang","Qingcheng Zeng","Zhenting Wang","Wenyue Hua","Haiyan Zhao","Kai Mei","Yanda Meng","Kaize Ding","Fan Yang","Mengnan Du","Yongfeng Zhang"],"pdf_url":"https://arxiv.org/pdf/2404.07066v6.pdf","comment":"COLING 2025"},{"id":"http://arxiv.org/abs/2412.20641v1","updated":"2024-12-30T01:10:10Z","published":"2024-12-30T01:10:10Z","title":"SafeSynthDP: Leveraging Large Language Models for Privacy-Preserving\n Synthetic Data Generation Using Differential Privacy","summary":" Machine learning (ML) models frequently rely on training data that may\ninclude sensitive or personal information, raising substantial privacy\nconcerns. Legislative frameworks such as the General Data Protection Regulation\n(GDPR) and the California Consumer Privacy Act (CCPA) have necessitated the\ndevelopment of strategies that preserve privacy while maintaining the utility\nof data. In this paper, we investigate the capability of Large Language Models\n(LLMs) to generate synthetic datasets integrated with Differential Privacy (DP)\nmechanisms, thereby enabling data-driven research and model training without\ndirect exposure of sensitive information. Our approach incorporates DP-based\nnoise injection methods, including Laplace and Gaussian distributions, into the\ndata generation process. We then evaluate the utility of these DP-enhanced\nsynthetic datasets by comparing the performance of ML models trained on them\nagainst models trained on the original data. To substantiate privacy\nguarantees, we assess the resilience of the generated synthetic data to\nmembership inference attacks and related threats. The experimental results\ndemonstrate that integrating DP within LLM-driven synthetic data generation\noffers a viable balance between privacy protection and data utility. This study\nprovides a foundational methodology and insight into the privacy-preserving\ncapabilities of LLMs, paving the way for compliant and effective ML research\nand applications.\n","authors":["Md Mahadi Hasan Nahid","Sadid Bin Hasan"],"pdf_url":"https://arxiv.org/pdf/2412.20641v1.pdf","comment":"15 pages, 1 figure, 5 tables"},{"id":"http://arxiv.org/abs/2412.18547v2","updated":"2024-12-30T01:07:39Z","published":"2024-12-24T16:55:45Z","title":"Token-Budget-Aware LLM Reasoning","summary":" Reasoning is critical for large language models (LLMs) to excel in a wide\nrange of tasks. While methods like Chain-of-Thought (CoT) reasoning enhance LLM\nperformance by decomposing problems into intermediate steps, they also incur\nsignificant overhead in token usage, leading to increased costs. We find that\nthe reasoning process of current LLMs is unnecessarily lengthy and it can be\ncompressed by including a reasonable token budget in the prompt, but the choice\nof token budget plays a crucial role in the actual compression effectiveness.\nWe then propose a token-budget-aware LLM reasoning framework, which dynamically\nestimates token budgets for different problems based on reasoning complexity\nand uses the estimated token budgets to guide the reasoning process.\nExperiments show that our method effectively reduces token costs in CoT\nreasoning with only a slight performance reduction, offering a practical\nsolution to balance efficiency and accuracy in LLM reasoning. Code:\nhttps://github.com/GeniusHTX/TALE.\n","authors":["Tingxu Han","Chunrong Fang","Shiyu Zhao","Shiqing Ma","Zhenyu Chen","Zhenting Wang"],"pdf_url":"https://arxiv.org/pdf/2412.18547v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.11078v2","updated":"2024-12-30T01:01:58Z","published":"2024-10-14T20:39:02Z","title":"Predicting Chess Puzzle Difficulty with Transformers","summary":" This study addresses the challenge of quantifying chess puzzle difficulty - a\ncomplex task that combines elements of game theory and human cognition and\nunderscores its critical role in effective chess training. We present\nGlickFormer, a novel transformer-based architecture that predicts chess puzzle\ndifficulty by approximating the Glicko-2 rating system. Unlike conventional\nchess engines that optimize for game outcomes, GlickFormer models human\nperception of tactical patterns and problem-solving complexity. The proposed\nmodel utilizes a modified ChessFormer backbone for spatial feature extraction\nand incorporates temporal information via factorized transformer techniques.\nThis approach enables the capture of both spatial chess piece arrangements and\nmove sequences, effectively modeling spatio-temporal relationships relevant to\ndifficulty assessment. Experimental evaluation was conducted on a dataset of\nover 4 million chess puzzles. Results demonstrate GlickFormer's superior\nperformance compared to the state-of-the-art ChessFormer baseline across\nmultiple metrics. The algorithm's performance has also been recognized through\nits competitive results in the IEEE BigData 2024 Cup: Predicting Chess Puzzle\nDifficulty competition, where it placed 11th. The insights gained from this\nstudy have implications for personalized chess training and broader\napplications in educational technology and cognitive modeling.\n","authors":["Szymon Miłosz","Paweł Kapusta"],"pdf_url":"https://arxiv.org/pdf/2410.11078v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.20638v1","updated":"2024-12-30T01:01:15Z","published":"2024-12-30T01:01:15Z","title":"Predicting Long Term Sequential Policy Value Using Softer Surrogates","summary":" Performing policy evaluation in education, healthcare and online commerce can\nbe challenging, because it can require waiting substantial amounts of time to\nobserve outcomes over the desired horizon of interest. While offline evaluation\nmethods can be used to estimate the performance of a new decision policy from\nhistorical data in some cases, such methods struggle when the new policy\ninvolves novel actions or is being run in a new decision process with\npotentially different dynamics. Here we consider how to estimate the\nfull-horizon value of a new decision policy using only short-horizon data from\nthe new policy, and historical full-horizon data from a different behavior\npolicy. We introduce two new estimators for this setting, including a doubly\nrobust estimator, and provide formal analysis of their properties. Our\nempirical results on two realistic simulators, of HIV treatment and sepsis\ntreatment, show that our methods can often provide informative estimates of a\nnew decision policy ten times faster than waiting for the full horizon,\nhighlighting that it may be possible to quickly identify if a new decision\npolicy, involving new actions, is better or worse than existing past policies.\n","authors":["Hyunji Nam","Allen Nie","Ge Gao","Vasilis Syrgkanis","Emma Brunskill"],"pdf_url":"https://arxiv.org/pdf/2412.20638v1.pdf","comment":"23 pages, 1 figure"},{"id":"http://arxiv.org/abs/2412.20635v1","updated":"2024-12-30T00:47:49Z","published":"2024-12-30T00:47:49Z","title":"NetFlowGen: Leveraging Generative Pre-training for Network Traffic\n Dynamics","summary":" Understanding the traffic dynamics in networks is a core capability for\nautomated systems to monitor and analyze networking behaviors, reducing\nexpensive human efforts and economic risks through tasks such as traffic\nclassification, congestion prediction, and attack detection. However, it is\nstill challenging to accurately model network traffic with machine learning\napproaches in an efficient and broadly applicable manner. Task-specific models\ntrained from scratch are used for different networking applications, which\nlimits the efficiency of model development and generalization of model\ndeployment. Furthermore, while networking data is abundant, high-quality\ntask-specific labels are often insufficient for training individual models.\nLarge-scale self-supervised learning on unlabeled data provides a natural\npathway for tackling these challenges. We propose to pre-train a\ngeneral-purpose machine learning model to capture traffic dynamics with only\ntraffic data from NetFlow records, with the goal of fine-tuning for different\ndownstream tasks with small amount of labels. Our presented NetFlowGen\nframework goes beyond a proof-of-concept for network traffic pre-training and\naddresses specific challenges such as unifying network feature representations,\nlearning from large unlabeled traffic data volume, and testing on real\ndownstream tasks in DDoS attack detection. Experiments demonstrate promising\nresults of our pre-training framework on capturing traffic dynamics and\nadapting to different networking tasks.\n","authors":["Jiawei Zhou","Woojeong Kim","Zhiying Xu","Alexander M. Rush","Minlan Yu"],"pdf_url":"https://arxiv.org/pdf/2412.20635v1.pdf","comment":null}],"Multimedia":[{"id":"http://arxiv.org/abs/2412.21042v1","updated":"2024-12-30T16:05:40Z","published":"2024-12-30T16:05:40Z","title":"Visual Style Prompt Learning Using Diffusion Models for Blind Face\n Restoration","summary":" Blind face restoration aims to recover high-quality facial images from\nvarious unidentified sources of degradation, posing significant challenges due\nto the minimal information retrievable from the degraded images. Prior\nknowledge-based methods, leveraging geometric priors and facial features, have\nled to advancements in face restoration but often fall short of capturing fine\ndetails. To address this, we introduce a visual style prompt learning framework\nthat utilizes diffusion probabilistic models to explicitly generate visual\nprompts within the latent space of pre-trained generative models. These prompts\nare designed to guide the restoration process. To fully utilize the visual\nprompts and enhance the extraction of informative and rich patterns, we\nintroduce a style-modulated aggregation transformation layer. Extensive\nexperiments and applications demonstrate the superiority of our method in\nachieving high-quality blind face restoration. The source code is available at\n\\href{https://github.com/LonglongaaaGo/VSPBFR}{https://github.com/LonglongaaaGo/VSPBFR}.\n","authors":["Wanglong Lu","Jikai Wang","Tao Wang","Kaihao Zhang","Xianta Jiang","Hanli Zhao"],"pdf_url":"https://arxiv.org/pdf/2412.21042v1.pdf","comment":"Published at Pattern Recognition; 13 pages, 11 figures"},{"id":"http://arxiv.org/abs/2412.21009v1","updated":"2024-12-30T15:21:36Z","published":"2024-12-30T15:21:36Z","title":"Towards Identity-Aware Cross-Modal Retrieval: a Dataset and a Baseline","summary":" Recent advancements in deep learning have significantly enhanced\ncontent-based retrieval methods, notably through models like CLIP that map\nimages and texts into a shared embedding space. However, these methods often\nstruggle with domain-specific entities and long-tail concepts absent from their\ntraining data, particularly in identifying specific individuals. In this paper,\nwe explore the task of identity-aware cross-modal retrieval, which aims to\nretrieve images of persons in specific contexts based on natural language\nqueries. This task is critical in various scenarios, such as for searching and\nbrowsing personalized video collections or large audio-visual archives\nmaintained by national broadcasters. We introduce a novel dataset, COCO Person\nFaceSwap (COCO-PFS), derived from the widely used COCO dataset and enriched\nwith deepfake-generated faces from VGGFace2. This dataset addresses the lack of\nlarge-scale datasets needed for training and evaluating models for this task.\nOur experiments assess the performance of different CLIP variations repurposed\nfor this task, including our architecture, Identity-aware CLIP (Id-CLIP), which\nachieves competitive retrieval performance through targeted fine-tuning. Our\ncontributions lay the groundwork for more robust cross-modal retrieval systems\ncapable of recognizing long-tail identities and contextual nuances. Data and\ncode are available at https://github.com/mesnico/IdCLIP.\n","authors":["Nicola Messina","Lucia Vadicamo","Leo Maltese","Claudio Gennaro"],"pdf_url":"https://arxiv.org/pdf/2412.21009v1.pdf","comment":"Accepted as full paper at ECIR 2025"},{"id":"http://arxiv.org/abs/2405.14040v2","updated":"2024-12-30T09:02:53Z","published":"2024-05-22T22:22:26Z","title":"Synchronized Video Storytelling: Generating Video Narrations with\n Structured Storyline","summary":" Video storytelling is engaging multimedia content that utilizes video and its\naccompanying narration to attract the audience, where a key challenge is\ncreating narrations for recorded visual scenes. Previous studies on dense video\ncaptioning and video story generation have made some progress. However, in\npractical applications, we typically require synchronized narrations for\nongoing visual scenes. In this work, we introduce a new task of Synchronized\nVideo Storytelling, which aims to generate synchronous and informative\nnarrations for videos. These narrations, associated with each video clip,\nshould relate to the visual content, integrate relevant knowledge, and have an\nappropriate word count corresponding to the clip's duration. Specifically, a\nstructured storyline is beneficial to guide the generation process, ensuring\ncoherence and integrity. To support the exploration of this task, we introduce\na new benchmark dataset E-SyncVidStory with rich annotations. Since existing\nMultimodal LLMs are not effective in addressing this task in one-shot or\nfew-shot settings, we propose a framework named VideoNarrator that can generate\na storyline for input videos and simultaneously generate narrations with the\nguidance of the generated or predefined storyline. We further introduce a set\nof evaluation metrics to thoroughly assess the generation. Both automatic and\nhuman evaluations validate the effectiveness of our approach. Our dataset,\ncodes, and evaluations will be released.\n","authors":["Dingyi Yang","Chunru Zhan","Ziheng Wang","Biao Wang","Tiezheng Ge","Bo Zheng","Qin Jin"],"pdf_url":"https://arxiv.org/pdf/2405.14040v2.pdf","comment":"15 pages, 13 figures"},{"id":"http://arxiv.org/abs/2412.20799v1","updated":"2024-12-30T08:46:50Z","published":"2024-12-30T08:46:50Z","title":"SFE-Net: Harnessing Biological Principles of Differential Gene\n Expression for Improved Feature Selection in Deep Learning Networks","summary":" In the realm of DeepFake detection, the challenge of adapting to various\nsynthesis methodologies such as Faceswap, Deepfakes, Face2Face, and\nNeuralTextures significantly impacts the performance of traditional machine\nlearning models. These models often suffer from static feature representation,\nwhich struggles to perform consistently across diversely generated deepfake\ndatasets. Inspired by the biological concept of differential gene expression,\nwhere gene activation is dynamically regulated in response to environmental\nstimuli, we introduce the Selective Feature Expression Network (SFE-Net). This\ninnovative framework integrates selective feature activation principles into\ndeep learning architectures, allowing the model to dynamically adjust feature\npriorities in response to varying deepfake generation techniques. SFE-Net\nemploys a novel mechanism that selectively enhances critical features essential\nfor accurately detecting forgeries, while reducing the impact of irrelevant or\nmisleading cues akin to adaptive evolutionary processes in nature. Through\nrigorous testing on a range of deepfake datasets, SFE-Net not only surpasses\nexisting static models in detecting sophisticated forgeries but also shows\nenhanced generalization capabilities in cross-dataset scenarios. Our approach\nsignificantly mitigates overfitting by maintaining a dynamic balance between\nfeature exploration and exploitation, thus producing more robust and effective\ndeepfake detection models. This bio-inspired strategy paves the way for\ndeveloping adaptive deep learning systems that are finely tuned to address the\nnuanced challenges posed by the varied nature of digital forgeries in modern\ndigital forensics.\n","authors":["Yuqi Li","Yuanzhong Zheng","Yaoxuan Wang","Jianjun Yin","Haojun Fei"],"pdf_url":"https://arxiv.org/pdf/2412.20799v1.pdf","comment":"5 pages,3 figures,2 charts,conference"},{"id":"http://arxiv.org/abs/2412.20733v1","updated":"2024-12-30T06:14:48Z","published":"2024-12-30T06:14:48Z","title":"Towards nation-wide analytical healthcare infrastructures: A\n privacy-preserving augmented knee rehabilitation case study","summary":" The purpose of this paper is to contribute towards the near-future\nprivacy-preserving big data analytical healthcare platforms, capable of\nprocessing streamed or uploaded timeseries data or videos from patients. The\nexperimental work includes a real-life knee rehabilitation video dataset\ncapturing a set of exercises from simple and personalised to more general and\nchallenging movements aimed for returning to sport. To convert video from\nmobile into privacy-preserving diagnostic timeseries data, we employed Google\nMediaPipe pose estimation. The developed proof-of-concept algorithms can\naugment knee exercise videos by overlaying the patient with stick figure\nelements while updating generated timeseries plot with knee angle estimation\nstreamed as CSV file format. For patients and physiotherapists, video with\nside-to-side timeseries visually indicating potential issues such as excessive\nknee flexion or unstable knee movements or stick figure overlay errors is\npossible by setting a-priori knee-angle parameters. To address adherence to\nrehabilitation programme and quantify exercise sets and repetitions, our\nadaptive algorithm can correctly identify (91.67%-100%) of all exercises from\nside- and front-view videos. Transparent algorithm design for adaptive visual\nanalysis of various knee exercise patterns contributes towards the\ninterpretable AI and will inform near-future privacy-preserving, non-vendor\nlocking, open-source developments for both end-user computing devices and as\non-premises non-proprietary cloud platforms that can be deployed within the\nnational healthcare system.\n","authors":["Boris Bačić","Claudiu Vasile","Chengwei Feng","Marian G. Ciucă"],"pdf_url":"https://arxiv.org/pdf/2412.20733v1.pdf","comment":"The original work citation: Ba\\v{c}i\\'c, B., Claudiu Vasile, Feng,\n C., & Ciuc\\u{a}, M. G. (2024, 13-15 Dec.). Towards nation-wide analytical\n healthcare infrastructures: A privacy-preserving augmented knee\n rehabilitation case study. Presented at the Conference on Innovative\n Technologies in Intelligent Systems & Industrial Applications (CITISIA 2024),\n Sydney, NSW"},{"id":"http://arxiv.org/abs/2409.15157v2","updated":"2024-12-30T05:20:35Z","published":"2024-09-23T16:04:50Z","title":"LoVA: Long-form Video-to-Audio Generation","summary":" Video-to-audio (V2A) generation is important for video editing and\npost-processing, enabling the creation of semantics-aligned audio for silent\nvideo. However, most existing methods focus on generating short-form audio for\nshort video segment (less than 10 seconds), while giving little attention to\nthe scenario of long-form video inputs. For current UNet-based diffusion V2A\nmodels, an inevitable problem when handling long-form audio generation is the\ninconsistencies within the final concatenated audio. In this paper, we first\nhighlight the importance of long-form V2A problem. Besides, we propose LoVA, a\nnovel model for Long-form Video-to-Audio generation. Based on the Diffusion\nTransformer (DiT) architecture, LoVA proves to be more effective at generating\nlong-form audio compared to existing autoregressive models and UNet-based\ndiffusion models. Extensive objective and subjective experiments demonstrate\nthat LoVA achieves comparable performance on 10-second V2A benchmark and\noutperforms all other baselines on a benchmark with long-form video input.\n","authors":["Xin Cheng","Xihua Wang","Yihan Wu","Yuyue Wang","Ruihua Song"],"pdf_url":"https://arxiv.org/pdf/2409.15157v2.pdf","comment":"Accepted by ICASSP 2025"},{"id":"http://arxiv.org/abs/2412.20715v1","updated":"2024-12-30T05:07:34Z","published":"2024-12-30T05:07:34Z","title":"ChartAdapter: Large Vision-Language Model for Chart Summarization","summary":" Chart summarization, which focuses on extracting key information from charts\nand interpreting it in natural language, is crucial for generating and\ndelivering insights through effective and accessible data analysis. Traditional\nmethods for chart understanding and summarization often rely on multi-stage\npipelines, which may produce suboptimal semantic alignment between visual and\ntextual information. In comparison, recently developed LLM-based methods are\nmore dependent on the capability of foundation images or languages, while\nignoring the characteristics of chart data and its relevant challenges. To\naddress these limitations, we propose ChartAdapter, a novel lightweight\ntransformer module designed to bridge the gap between charts and textual\nsummaries. ChartAdapter employs learnable query vectors to extract implicit\nsemantics from chart data and incorporates a cross-modal alignment projector to\nenhance vision-to-language generative learning. By integrating ChartAdapter\nwith an LLM, we enable end-to-end training and efficient chart summarization.\nTo further enhance the training, we introduce a three-stage hierarchical\ntraining procedure and develop a large-scale dataset specifically curated for\nchart summarization, comprising 190,618 samples. Experimental results on the\nstandard Chart-to-Text testing set demonstrate that our approach significantly\noutperforms existing methods, including state-of-the-art models, in generating\nhigh-quality chart summaries. Ablation studies further validate the\neffectiveness of key components in ChartAdapter. This work highlights the\npotential of tailored LLM-based approaches to advance chart understanding and\nsets a strong foundation for future research in this area.\n","authors":["Peixin Xu","Yujuan Ding","Wenqi Fan"],"pdf_url":"https://arxiv.org/pdf/2412.20715v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.20685v1","updated":"2024-12-30T03:36:07Z","published":"2024-12-30T03:36:07Z","title":"MarsSQE: Stereo Quality Enhancement for Martian Images Using Bi-level\n Cross-view Attention","summary":" Stereo images captured by Mars rovers are transmitted after lossy compression\ndue to the limited bandwidth between Mars and Earth. Unfortunately, this\nprocess results in undesirable compression artifacts. In this paper, we present\na novel stereo quality enhancement approach for Martian images, named MarsSQE.\nFirst, we establish the first dataset of stereo Martian images. Through\nextensive analysis of this dataset, we observe that cross-view correlations in\nMartian images are notably high. Leveraging this insight, we design a bi-level\ncross-view attention-based quality enhancement network that fully exploits\nthese inherent cross-view correlations. Specifically, our network integrates\npixel-level attention for precise matching and patch-level attention for\nbroader contextual information. Experimental results demonstrate the\neffectiveness of our MarsSQE approach.\n","authors":["Mai Xu","Yinglin Zhu","Qunliang Xing","Jing Yang","Xin Zou"],"pdf_url":"https://arxiv.org/pdf/2412.20685v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18619v2","updated":"2024-12-30T03:00:30Z","published":"2024-12-16T05:02:25Z","title":"Next Token Prediction Towards Multimodal Intelligence: A Comprehensive\n Survey","summary":" Building on the foundations of language modeling in natural language\nprocessing, Next Token Prediction (NTP) has evolved into a versatile training\nobjective for machine learning tasks across various modalities, achieving\nconsiderable success. As Large Language Models (LLMs) have advanced to unify\nunderstanding and generation tasks within the textual modality, recent research\nhas shown that tasks from different modalities can also be effectively\nencapsulated within the NTP framework, transforming the multimodal information\ninto tokens and predict the next one given the context. This survey introduces\na comprehensive taxonomy that unifies both understanding and generation within\nmultimodal learning through the lens of NTP. The proposed taxonomy covers five\nkey aspects: Multimodal tokenization, MMNTP model architectures, unified task\nrepresentation, datasets \\& evaluation, and open challenges. This new taxonomy\naims to aid researchers in their exploration of multimodal intelligence. An\nassociated GitHub repository collecting the latest papers and repos is\navailable at https://github.com/LMM101/Awesome-Multimodal-Next-Token-Prediction\n","authors":["Liang Chen","Zekun Wang","Shuhuai Ren","Lei Li","Haozhe Zhao","Yunshui Li","Zefan Cai","Hongcheng Guo","Lei Zhang","Yizhe Xiong","Yichi Zhang","Ruoyu Wu","Qingxiu Dong","Ge Zhang","Jian Yang","Lingwei Meng","Shujie Hu","Yulong Chen","Junyang Lin","Shuai Bai","Andreas Vlachos","Xu Tan","Minjia Zhang","Wen Xiao","Aaron Yee","Tianyu Liu","Baobao Chang"],"pdf_url":"https://arxiv.org/pdf/2412.18619v2.pdf","comment":"69 papes, 18 figures, repo at\n https://github.com/LMM101/Awesome-Multimodal-Next-Token-Prediction"},{"id":"http://arxiv.org/abs/2412.20665v1","updated":"2024-12-30T02:47:51Z","published":"2024-12-30T02:47:51Z","title":"SM3Det: A Unified Model for Multi-Modal Remote Sensing Object Detection","summary":" With the rapid advancement of remote sensing technology, high-resolution\nmulti-modal imagery is now more widely accessible. Conventional Object\ndetection models are trained on a single dataset, often restricted to a\nspecific imaging modality and annotation format. However, such an approach\noverlooks the valuable shared knowledge across multi-modalities and limits the\nmodel's applicability in more versatile scenarios. This paper introduces a new\ntask called Multi-Modal Datasets and Multi-Task Object Detection (M2Det) for\nremote sensing, designed to accurately detect horizontal or oriented objects\nfrom any sensor modality. This task poses challenges due to 1) the trade-offs\ninvolved in managing multi-modal modelling and 2) the complexities of\nmulti-task optimization. To address these, we establish a benchmark dataset and\npropose a unified model, SM3Det (Single Model for Multi-Modal datasets and\nMulti-Task object Detection). SM3Det leverages a grid-level sparse MoE backbone\nto enable joint knowledge learning while preserving distinct feature\nrepresentations for different modalities. Furthermore, it integrates a\nconsistency and synchronization optimization strategy using dynamic learning\nrate adjustment, allowing it to effectively handle varying levels of learning\ndifficulty across modalities and tasks. Extensive experiments demonstrate\nSM3Det's effectiveness and generalizability, consistently outperforming\nspecialized models on individual datasets. The code is available at\nhttps://github.com/zcablii/SM3Det.\n","authors":["Yuxuan Li","Xiang Li","Yunheng Li","Yicheng Zhang","Yimian Dai","Qibin Hou","Ming-Ming Cheng","Jian Yang"],"pdf_url":"https://arxiv.org/pdf/2412.20665v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.20833v1","updated":"2024-12-30T09:58:27Z","published":"2024-12-30T09:58:27Z","title":"Inclusion 2024 Global Multimedia Deepfake Detection: Towards\n Multi-dimensional Facial Forgery Detection","summary":" In this paper, we present the Global Multimedia Deepfake Detection held\nconcurrently with the Inclusion 2024. Our Multimedia Deepfake Detection aims to\ndetect automatic image and audio-video manipulations including but not limited\nto editing, synthesis, generation, Photoshop,etc. Our challenge has attracted\n1500 teams from all over the world, with about 5000 valid result submission\ncounts. We invite the top 20 teams to present their solutions to the challenge,\nfrom which the top 3 teams are awarded prizes in the grand finale. In this\npaper, we present the solutions from the top 3 teams of the two tracks, to\nboost the research work in the field of image and audio-video forgery\ndetection. The methodologies developed through the challenge will contribute to\nthe development of next-generation deepfake detection systems and we encourage\nparticipants to open source their methods.\n","authors":["Yi Zhang","Weize Gao","Changtao Miao","Man Luo","Jianshu Li","Wenzhong Deng","Zhe Li","Bingyu Hu","Weibin Yao","Wenbo Zhou","Tao Gong","Qi Chu"],"pdf_url":"https://arxiv.org/pdf/2412.20833v1.pdf","comment":"Inclusion 2024 Global Multimedia Deepfake Detection Competition Top\n Team Technical Report"}]},"2024-12-29T00:00:00Z":{"Computation and Language":[{"id":"http://arxiv.org/abs/2407.07035v2","updated":"2024-12-29T23:16:37Z","published":"2024-07-09T16:53:36Z","title":"Vision-and-Language Navigation Today and Tomorrow: A Survey in the Era\n of Foundation Models","summary":" Vision-and-Language Navigation (VLN) has gained increasing attention over\nrecent years and many approaches have emerged to advance their development. The\nremarkable achievements of foundation models have shaped the challenges and\nproposed methods for VLN research. In this survey, we provide a top-down review\nthat adopts a principled framework for embodied planning and reasoning, and\nemphasizes the current methods and future opportunities leveraging foundation\nmodels to address VLN challenges. We hope our in-depth discussions could\nprovide valuable resources and insights: on one hand, to milestone the progress\nand explore opportunities and potential roles for foundation models in this\nfield, and on the other, to organize different challenges and solutions in VLN\nto foundation model researchers.\n","authors":["Yue Zhang","Ziqiao Ma","Jialu Li","Yanyuan Qiao","Zun Wang","Joyce Chai","Qi Wu","Mohit Bansal","Parisa Kordjamshidi"],"pdf_url":"https://arxiv.org/pdf/2407.07035v2.pdf","comment":"Authors contributed equally to this work, and supervisors contributed\n equal advising to this work; GitHub repository:\n https://github.com/zhangyuejoslin/VLN-Survey-with-Foundation-Models"},{"id":"http://arxiv.org/abs/2412.20602v1","updated":"2024-12-29T22:14:59Z","published":"2024-12-29T22:14:59Z","title":"NLP-based Regulatory Compliance -- Using GPT 4.0 to Decode Regulatory\n Documents","summary":" Large Language Models (LLMs) such as GPT-4.0 have shown significant promise\nin addressing the semantic complexities of regulatory documents, particularly\nin detecting inconsistencies and contradictions. This study evaluates GPT-4.0's\nability to identify conflicts within regulatory requirements by analyzing a\ncurated corpus with artificially injected ambiguities and contradictions,\ndesigned in collaboration with architects and compliance engineers. Using\nmetrics such as precision, recall, and F1 score, the experiment demonstrates\nGPT-4.0's effectiveness in detecting inconsistencies, with findings validated\nby human experts. The results highlight the potential of LLMs to enhance\nregulatory compliance processes, though further testing with larger datasets\nand domain-specific fine-tuning is needed to maximize accuracy and practical\napplicability. Future work will explore automated conflict resolution and\nreal-world implementation through pilot projects with industry partners.\n","authors":["Bimal Kumar","Dmitri Roussinov"],"pdf_url":"https://arxiv.org/pdf/2412.20602v1.pdf","comment":"accepted for presentation at Georg Nemetschek Institute Symposium &\n Expo on Artificial Intelligence for the Built World - Munich, Germany. 12\n Sept 2024"},{"id":"http://arxiv.org/abs/2412.20597v1","updated":"2024-12-29T22:02:00Z","published":"2024-12-29T22:02:00Z","title":"GliLem: Leveraging GliNER for Contextualized Lemmatization in Estonian","summary":" We present GliLem -- a novel hybrid lemmatization system for Estonian that\nenhances the highly accurate rule-based morphological analyzer Vabamorf with an\nexternal disambiguation module based on GliNER -- an open vocabulary NER model\nthat is able to match text spans with text labels in natural language. We\nleverage the flexibility of a pre-trained GliNER model to improve the\nlemmatization accuracy of Vabamorf by 10\\% compared to its original\ndisambiguation module and achieve an improvement over the token\nclassification-based baseline. To measure the impact of improvements in\nlemmatization accuracy on the information retrieval downstream task, we first\ncreated an information retrieval dataset for Estonian by automatically\ntranslating the DBpedia-Entity dataset from English. We benchmark several token\nnormalization approaches, including lemmatization, on the created dataset using\nthe BM25 algorithm. We observe a substantial improvement in IR metrics when\nusing lemmatization over simplistic stemming. The benefits of improving lemma\ndisambiguation accuracy manifest in small but consistent improvement in the IR\nrecall measure, especially in the setting of high k.\n","authors":["Aleksei Dorkin","Kairit Sirts"],"pdf_url":"https://arxiv.org/pdf/2412.20597v1.pdf","comment":"Accepted to NoDaLiDa/Baltic-HLT 2025"},{"id":"http://arxiv.org/abs/2412.20595v1","updated":"2024-12-29T21:54:39Z","published":"2024-12-29T21:54:39Z","title":"Controlling Out-of-Domain Gaps in LLMs for Genre Classification and\n Generated Text Detection","summary":" This study demonstrates that the modern generation of Large Language Models\n(LLMs, such as GPT-4) suffers from the same out-of-domain (OOD) performance gap\nobserved in prior research on pre-trained Language Models (PLMs, such as BERT).\nWe demonstrate this across two non-topical classification tasks: 1) genre\nclassification and 2) generated text detection. Our results show that when\ndemonstration examples for In-Context Learning (ICL) come from one domain\n(e.g., travel) and the system is tested on another domain (e.g., history),\nclassification performance declines significantly.\n To address this, we introduce a method that controls which predictive\nindicators are used and which are excluded during classification. For the two\ntasks studied here, this ensures that topical features are omitted, while the\nmodel is guided to focus on stylistic rather than content-based attributes.\nThis approach reduces the OOD gap by up to 20 percentage points in a few-shot\nsetup. Straightforward Chain-of-Thought (CoT) methods, used as the baseline,\nprove insufficient, while our approach consistently enhances domain transfer\nperformance.\n","authors":["Dmitri Roussinov","Serge Sharoff","Nadezhda Puchnina"],"pdf_url":"https://arxiv.org/pdf/2412.20595v1.pdf","comment":"The 31st International Conference on Computational Linguistics"},{"id":"http://arxiv.org/abs/2412.20584v1","updated":"2024-12-29T21:12:39Z","published":"2024-12-29T21:12:39Z","title":"Towards Neural No-Resource Language Translation: A Comparative\n Evaluation of Approaches","summary":" No-resource languages - those with minimal or no digital representation -\npose unique challenges for machine translation (MT). Unlike low-resource\nlanguages, which rely on limited but existent corpora, no-resource languages\noften have fewer than 100 sentences available for training. This work explores\nthe problem of no-resource translation through three distinct workflows:\nfine-tuning of translation-specific models, in-context learning with large\nlanguage models (LLMs) using chain-of-reasoning prompting, and direct prompting\nwithout reasoning. Using Owens Valley Paiute as a case study, we demonstrate\nthat no-resource translation demands fundamentally different approaches from\nlow-resource scenarios, as traditional approaches to machine translation, such\nas those that work for low-resource languages, fail. Empirical results reveal\nthat, although traditional approaches fail, the in-context learning\ncapabilities of general-purpose large language models enable no-resource\nlanguage translation that outperforms low-resource translation approaches and\nrivals human translations (BLEU 0.45-0.6); specifically, chain-of-reasoning\nprompting outperforms other methods for larger corpora, while direct prompting\nexhibits advantages in smaller datasets. As these approaches are\nlanguage-agnostic, they have potential to be generalized to translation tasks\nfrom a wide variety of no-resource languages without expert input. These\nfindings establish no-resource translation as a distinct paradigm requiring\ninnovative solutions, providing practical and theoretical insights for language\npreservation.\n","authors":["Madhavendra Thakur"],"pdf_url":"https://arxiv.org/pdf/2412.20584v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.20563v1","updated":"2024-12-29T20:18:52Z","published":"2024-12-29T20:18:52Z","title":"Counterfactual Samples Constructing and Training for Commonsense\n Statements Estimation","summary":" Plausibility Estimation (PE) plays a crucial role for enabling language\nmodels to objectively comprehend the real world. While large language models\n(LLMs) demonstrate remarkable capabilities in PE tasks but sometimes produce\ntrivial commonsense errors due to the complexity of commonsense knowledge. They\nlack two key traits of an ideal PE model: a) Language-explainable: relying on\ncritical word segments for decisions, and b) Commonsense-sensitive: detecting\nsubtle linguistic variations in commonsense. To address these issues, we\npropose a novel model-agnostic method, referred to as Commonsense\nCounterfactual Samples Generating (CCSG). By training PE models with CCSG, we\nencourage them to focus on critical words, thereby enhancing both their\nlanguage-explainable and commonsense-sensitive capabilities. Specifically, CCSG\ngenerates counterfactual samples by strategically replacing key words and\nintroducing low-level dropout within sentences. These counterfactual samples\nare then incorporated into a sentence-level contrastive training framework to\nfurther enhance the model's learning process. Experimental results across nine\ndiverse datasets demonstrate the effectiveness of CCSG in addressing\ncommonsense reasoning challenges, with our CCSG method showing 3.07%\nimprovement against the SOTA methods.\n","authors":["Chong Liu","Zaiwen Feng","Lin Liu","Zhenyun Deng","Jiuyong Li","Ruifang Zhai","Debo Cheng","Li Qin"],"pdf_url":"https://arxiv.org/pdf/2412.20563v1.pdf","comment":"14 pages, 4 figures"},{"id":"http://arxiv.org/abs/2408.04216v2","updated":"2024-12-29T19:00:55Z","published":"2024-08-08T04:52:10Z","title":"Attention Mechanism and Context Modeling System for Text Mining Machine\n Translation","summary":" This paper advances a novel architectural schema anchored upon the\nTransformer paradigm and innovatively amalgamates the K-means categorization\nalgorithm to augment the contextual apprehension capabilities of the schema.\nThe transformer model performs well in machine translation tasks due to its\nparallel computing power and multi-head attention mechanism. However, it may\nencounter contextual ambiguity or ignore local features when dealing with\nhighly complex language structures. To circumvent this constraint, this\nexposition incorporates the K-Means algorithm, which is used to stratify the\nlexis and idioms of the input textual matter, thereby facilitating superior\nidentification and preservation of the local structure and contextual\nintelligence of the language. The advantage of this combination is that K-Means\ncan automatically discover the topic or concept regions in the text, which may\nbe directly related to translation quality. Consequently, the schema contrived\nherein enlists K-Means as a preparatory phase antecedent to the Transformer and\nrecalibrates the multi-head attention weights to assist in the discrimination\nof lexis and idioms bearing analogous semantics or functionalities. This\nensures the schema accords heightened regard to the contextual intelligence\nembodied by these clusters during the training phase, rather than merely\nfocusing on locational intelligence.\n","authors":["Yuwei Zhang","Junming Huang","Sitong Liu","Zexi Chen","Zizheng Li"],"pdf_url":"https://arxiv.org/pdf/2408.04216v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10246v3","updated":"2024-12-29T18:54:51Z","published":"2023-07-17T06:54:36Z","title":"Deep Neural Networks and Brain Alignment: Brain Encoding and Decoding\n (Survey)","summary":" Can artificial intelligence unlock the secrets of the human brain? How do the\ninner mechanisms of deep learning models relate to our neural circuits? Is it\npossible to enhance AI by tapping into the power of brain recordings? These\ncaptivating questions lie at the heart of an emerging field at the intersection\nof neuroscience and artificial intelligence. Our survey dives into this\nexciting domain, focusing on human brain recording studies and cutting-edge\ncognitive neuroscience datasets that capture brain activity during natural\nlanguage processing, visual perception, and auditory experiences. We explore\ntwo fundamental approaches: encoding models, which attempt to generate brain\nactivity patterns from sensory inputs; and decoding models, which aim to\nreconstruct our thoughts and perceptions from neural signals. These techniques\nnot only promise breakthroughs in neurological diagnostics and brain-computer\ninterfaces but also offer a window into the very nature of cognition. In this\nsurvey, we first discuss popular representations of language, vision, and\nspeech stimuli, and present a summary of neuroscience datasets. We then review\nhow the recent advances in deep learning transformed this field, by\ninvestigating the popular deep learning based encoding and decoding\narchitectures, noting their benefits and limitations across different sensory\nmodalities. From text to images, speech to videos, we investigate how these\nmodels capture the brain's response to our complex, multimodal world. While our\nprimary focus is on human studies, we also highlight the crucial role of animal\nmodels in advancing our understanding of neural mechanisms. Throughout, we\nmention the ethical implications of these powerful technologies, addressing\nconcerns about privacy and cognitive liberty. We conclude with a summary and\ndiscussion of future trends in this rapidly evolving field.\n","authors":["Subba Reddy Oota","Zijiao Chen","Manish Gupta","Raju S. Bapi","Gael Jobard","Frederic Alexandre","Xavier Hinaut"],"pdf_url":"https://arxiv.org/pdf/2307.10246v3.pdf","comment":"61 pages, 22 figures"},{"id":"http://arxiv.org/abs/2412.20545v1","updated":"2024-12-29T18:34:10Z","published":"2024-12-29T18:34:10Z","title":"The Impact of Prompt Programming on Function-Level Code Generation","summary":" Large Language Models (LLMs) are increasingly used by software engineers for\ncode generation. However, limitations of LLMs such as irrelevant or incorrect\ncode have highlighted the need for prompt programming (or prompt engineering)\nwhere engineers apply specific prompt techniques (e.g., chain-of-thought or\ninput-output examples) to improve the generated code. Despite this, the impact\nof different prompt techniques -- and their combinations -- on code generation\nremains underexplored. In this study, we introduce CodePromptEval, a dataset of\n7072 prompts designed to evaluate five prompt techniques (few-shot, persona,\nchain-of-thought, function signature, list of packages) and their effect on the\ncorrectness, similarity, and quality of complete functions generated by three\nLLMs (GPT-4o, Llama3, and Mistral). Our findings show that while certain prompt\ntechniques significantly influence the generated code, combining multiple\ntechniques does not necessarily improve the outcome. Additionally, we observed\na trade-off between correctness and quality when using prompt techniques. Our\ndataset and replication package enable future research on improving\nLLM-generated code and evaluating new prompt techniques.\n","authors":["Ranim Khojah","Francisco Gomes de Oliveira Neto","Mazen Mohamad","Philipp Leitner"],"pdf_url":"https://arxiv.org/pdf/2412.20545v1.pdf","comment":"CodePromptEval dataset and replication package on GitHub:\n https://github.com/icetlab/CodePromptEval"},{"id":"http://arxiv.org/abs/2410.14651v2","updated":"2024-12-29T18:22:27Z","published":"2024-10-18T17:47:11Z","title":"Real-time Fake News from Adversarial Feedback","summary":" We show that existing evaluations for fake news detection based on\nconventional sources, such as claims on fact-checking websites, result in high\naccuracies over time for LLM-based detectors -- even after their knowledge\ncutoffs. This suggests that recent popular fake news from such sources can be\neasily detected due to pre-training and retrieval corpus contamination or\nincreasingly salient shallow patterns. Instead, we argue that a proper fake\nnews detection dataset should test a model's ability to reason factually about\nthe current world by retrieving and reading related evidence. To this end, we\ndevelop a novel pipeline that leverages natural language feedback from a\nRAG-based detector to iteratively modify real-time news into deceptive fake\nnews that challenges LLMs. Our iterative rewrite decreases the binary\nclassification ROC-AUC by an absolute 17.5 percent for a strong RAG-based\nGPT-4o detector. Our experiments reveal the important role of RAG in both\ndetecting and generating fake news, as retrieval-free LLM detectors are\nvulnerable to unseen events and adversarial attacks, while feedback from RAG\ndetection helps discover more deceitful patterns in fake news.\n","authors":["Sanxing Chen","Yukun Huang","Bhuwan Dhingra"],"pdf_url":"https://arxiv.org/pdf/2410.14651v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.20541v1","updated":"2024-12-29T18:16:28Z","published":"2024-12-29T18:16:28Z","title":"SAFE-MEME: Structured Reasoning Framework for Robust Hate Speech\n Detection in Memes","summary":" Memes act as cryptic tools for sharing sensitive ideas, often requiring\ncontextual knowledge to interpret. This makes moderating multimodal memes\nchallenging, as existing works either lack high-quality datasets on nuanced\nhate categories or rely on low-quality social media visuals. Here, we curate\ntwo novel multimodal hate speech datasets, MHS and MHS-Con, that capture\nfine-grained hateful abstractions in regular and confounding scenarios,\nrespectively. We benchmark these datasets against several competing baselines.\nFurthermore, we introduce SAFE-MEME (Structured reAsoning FramEwork), a novel\nmultimodal Chain-of-Thought-based framework employing Q&A-style reasoning\n(SAFE-MEME-QA) and hierarchical categorization (SAFE-MEME-H) to enable robust\nhate speech detection in memes. SAFE-MEME-QA outperforms existing baselines,\nachieving an average improvement of approximately 5% and 4% on MHS and MHS-Con,\nrespectively. In comparison, SAFE-MEME-H achieves an average improvement of 6%\nin MHS while outperforming only multimodal baselines in MHS-Con. We show that\nfine-tuning a single-layer adapter within SAFE-MEME-H outperforms fully\nfine-tuned models in regular fine-grained hateful meme detection. However, the\nfully fine-tuning approach with a Q&A setup is more effective for handling\nconfounding cases. We also systematically examine the error cases, offering\nvaluable insights into the robustness and limitations of the proposed\nstructured reasoning framework for analyzing hateful memes.\n","authors":["Palash Nandi","Shivam Sharma","Tanmoy Chakraborty"],"pdf_url":"https://arxiv.org/pdf/2412.20541v1.pdf","comment":"28 pages, 15 figures, 6 tables"},{"id":"http://arxiv.org/abs/2410.24159v2","updated":"2024-12-29T18:00:05Z","published":"2024-10-31T17:18:11Z","title":"GPT or BERT: why not both?","summary":" We present a simple way to merge masked language modeling with causal\nlanguage modeling. This hybrid training objective results in a model that\ncombines the strengths of both modeling paradigms within a single transformer\nstack: GPT-BERT can be transparently used like any standard causal or masked\nlanguage model. We test the pretraining process that enables this flexible\nbehavior on the BabyLM Challenge 2024. The results show that the hybrid\npretraining outperforms masked-only or causal-only models. We openly release\nthe models, training corpora and code.\n","authors":["Lucas Georges Gabriel Charpentier","David Samuel"],"pdf_url":"https://arxiv.org/pdf/2410.24159v2.pdf","comment":"22 pages; submission to the BabyLM Challenge 2024"},{"id":"http://arxiv.org/abs/2412.20504v1","updated":"2024-12-29T15:42:24Z","published":"2024-12-29T15:42:24Z","title":"ReTaKe: Reducing Temporal and Knowledge Redundancy for Long Video\n Understanding","summary":" Video Large Language Models (VideoLLMs) have achieved remarkable progress in\nvideo understanding. However, existing VideoLLMs often inherit the limitations\nof their backbone LLMs in handling long sequences, leading to challenges for\nlong video understanding. Common solutions either simply uniformly sample\nvideos' frames or compress visual tokens, which focus primarily on low-level\ntemporal visual redundancy, overlooking high-level knowledge redundancy. This\nlimits the achievable compression rate with minimal loss. To this end. we\nintroduce a training-free method, $\\textbf{ReTaKe}$, containing two novel\nmodules DPSelect and PivotKV, to jointly model and reduce both temporal visual\nredundancy and knowledge redundancy for long video understanding. Specifically,\nDPSelect identifies keyframes with local maximum peak distance based on their\nvisual features, which are closely aligned with human video perception. PivotKV\nemploys the obtained keyframes as pivots and conducts KV-Cache compression for\nthe non-pivot tokens with low attention scores, which are derived from the\nlearned prior knowledge of LLMs. Experiments on benchmarks VideoMME, MLVU, and\nLVBench, show that ReTaKe can support 4x longer video sequences with minimal\nperformance loss (<1%) and outperform all similar-size VideoLLMs with 3%-5%,\neven surpassing or on par with much larger ones. Our code is available at\nhttps://github.com/SCZwangxiao/video-ReTaKe\n","authors":["Xiao Wang","Qingyi Si","Jianlong Wu","Shiyu Zhu","Li Cao","Liqiang Nie"],"pdf_url":"https://arxiv.org/pdf/2412.20504v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.20476v1","updated":"2024-12-29T14:29:34Z","published":"2024-12-29T14:29:34Z","title":"Cut the Deadwood Out: Post-Training Model Purification with Selective\n Module Substitution","summary":" The success of DNNs often depends on training with large-scale datasets, but\nbuilding such datasets is both expensive and challenging. Consequently, public\ndatasets from open-source platforms like HuggingFace have become popular,\nposing significant risks of data poisoning attacks. Existing backdoor defenses\nin NLP primarily focus on identifying and removing poisoned samples; however,\npurifying a backdoored model with these sample-cleaning approaches typically\nrequires expensive retraining. Therefore, we propose Greedy Module Substitution\n(GMS), which identifies and substitutes ''deadwood'' modules (i.e., components\ncritical to backdoor pathways) in a backdoored model to purify it. Our method\nrelaxes the common dependency of prior model purification methods on clean\ndatasets or clean auxiliary models. When applied to RoBERTa-large under\nbackdoor attacks, GMS demonstrates strong effectiveness across various\nsettings, particularly against widely recognized challenging attacks like LWS,\nachieving a post-purification attack success rate (ASR) of 9.7% on SST-2\ncompared to 58.8% for the best baseline approach.\n","authors":["Yao Tong","Weijun Li","Xuanli He","Haolan Zhan","Qiongkai Xu"],"pdf_url":"https://arxiv.org/pdf/2412.20476v1.pdf","comment":"preprint"},{"id":"http://arxiv.org/abs/2412.20467v1","updated":"2024-12-29T13:45:11Z","published":"2024-12-29T13:45:11Z","title":"Utilizing Multimodal Data for Edge Case Robust Call-sign Recognition and\n Understanding","summary":" Operational machine-learning based assistant systems must be robust in a wide\nrange of scenarios. This hold especially true for the air-traffic control (ATC)\ndomain. The robustness of an architecture is particularly evident in edge\ncases, such as high word error rate (WER) transcripts resulting from noisy ATC\nrecordings or partial transcripts due to clipped recordings. To increase the\nedge-case robustness of call-sign recognition and understanding (CRU), a core\ntasks in ATC speech processing, we propose the multimodal call-sign-command\nrecovery model (CCR). The CCR architecture leads to an increase in the edge\ncase performance of up to 15%. We demonstrate this on our second proposed\narchitecture, CallSBERT. A CRU model that has less parameters, can be\nfine-tuned noticeably faster and is more robust during fine-tuning than the\nstate of the art for CRU. Furthermore, we demonstrate that optimizing for edge\ncases leads to a significantly higher accuracy across a wide operational range.\n","authors":["Alexander Blatt","Dietrich Klakow"],"pdf_url":"https://arxiv.org/pdf/2412.20467v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.12869v2","updated":"2024-12-29T13:08:04Z","published":"2024-10-14T01:57:25Z","title":"Language Model Preference Evaluation with Multiple Weak Evaluators","summary":" Despite the remarkable success of Large Language Models (LLMs), evaluating\ntheir outputs' quality regarding *preference* remains a critical challenge.\nExisting works usually leverage a powerful LLM (e.g., GPT4) as the judge for\ncomparing LLMs' output pairwisely, yet such model-based evaluator is vulnerable\nto *conflicting preference*, i.e., output A is better than B, B than C, but C\nthan A, causing contradictory evaluation results. To improve model-based\npreference evaluation, we introduce GED (Preference Graph Ensemble and\nDenoise), a novel approach that leverages multiple model-based evaluators to\nconstruct preference graphs, and then ensemble and denoise these graphs for\nbetter, non-contradictory evaluation results. In particular, our method\nconsists of two primary stages: aggregating evaluations into a unified graph\nand applying a denoising process to eliminate cyclic inconsistencies, ensuring\na directed acyclic graph (DAG) structure. We provide theoretical guarantees for\nour framework, demonstrating its efficacy in recovering the ground truth\npreference structure. Extensive experiments across ten benchmark datasets show\nthat GED outperforms baseline methods in model ranking, response selection, and\nmodel alignment tasks. Notably, GED combines weaker evaluators like Llama3-8B,\nMistral-7B, and Qwen2-7B to surpass the performance of stronger evaluators like\nQwen2-72B, highlighting its ability to enhance evaluation reliability and\nimprove model performance.\n","authors":["Zhengyu Hu","Jieyu Zhang","Zhihan Xiong","Alexander Ratner","Hui Xiong","Ranjay Krishna"],"pdf_url":"https://arxiv.org/pdf/2410.12869v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.20440v1","updated":"2024-12-29T11:33:51Z","published":"2024-12-29T11:33:51Z","title":"Enhancing Entertainment Translation for Indian Languages using Adaptive\n Context, Style and LLMs","summary":" We address the challenging task of neural machine translation (NMT) in the\nentertainment domain, where the objective is to automatically translate a given\ndialogue from a source language content to a target language. This task has\nvarious applications, particularly in automatic dubbing, subtitling, and other\ncontent localization tasks, enabling source content to reach a wider audience.\nTraditional NMT systems typically translate individual sentences in isolation,\nwithout facilitating knowledge transfer of crucial elements such as the context\nand style from previously encountered sentences. In this work, we emphasize the\nsignificance of these fundamental aspects in producing pertinent and\ncaptivating translations. We demonstrate their significance through several\nexamples and propose a novel framework for entertainment translation, which, to\nour knowledge, is the first of its kind. Furthermore, we introduce an algorithm\nto estimate the context and style of the current session and use these\nestimations to generate a prompt that guides a Large Language Model (LLM) to\ngenerate high-quality translations. Our method is both language and\nLLM-agnostic, making it a general-purpose tool. We demonstrate the\neffectiveness of our algorithm through various numerical studies and observe\nsignificant improvement in the COMET scores over various state-of-the-art LLMs.\nMoreover, our proposed method consistently outperforms baseline LLMs in terms\nof win-ratio.\n","authors":["Pratik Rakesh Singh","Mohammadi Zaki","Pankaj Wasnik"],"pdf_url":"https://arxiv.org/pdf/2412.20440v1.pdf","comment":"Accepted to AAAI'25"},{"id":"http://arxiv.org/abs/2412.20438v1","updated":"2024-12-29T11:25:03Z","published":"2024-12-29T11:25:03Z","title":"Integrating Natural Language Processing Techniques of Text Mining Into\n Financial System: Applications and Limitations","summary":" The financial sector, a pivotal force in economic development, increasingly\nuses the intelligent technologies such as natural language processing to\nenhance data processing and insight extraction. This research paper through a\nreview process of the time span of 2018-2023 explores the use of text mining as\nnatural language processing techniques in various components of the financial\nsystem including asset pricing, corporate finance, derivatives, risk\nmanagement, and public finance and highlights the need to address the specific\nproblems in the discussion section. We notice that most of the research\nmaterials combined probabilistic with vector-space models, and text-data with\nnumerical ones. The most used technique regarding information processing is the\ninformation classification technique and the most used algorithms include the\nlong-short term memory and bidirectional encoder models. The research noticed\nthat new specific algorithms are developed and the focus of the financial\nsystem is mainly on asset pricing component. The research also proposes a path\nfrom engineering perspective for researchers who need to analyze financial\ntext. The challenges regarding text mining perspective such as data quality,\ncontext-adaption and model interpretability need to be solved so to integrate\nadvanced natural language processing models and techniques in enhancing\nfinancial analysis and prediction. Keywords: Financial System (FS), Natural\nLanguage Processing (NLP), Software and Text Engineering, Probabilistic,\nVector-Space, Models, Techniques, TextData, Financial Analysis.\n","authors":["Denisa Millo","Blerina Vika","Nevila Baci"],"pdf_url":"https://arxiv.org/pdf/2412.20438v1.pdf","comment":"6 pages, 5 figures, 1 table"},{"id":"http://arxiv.org/abs/2412.20414v1","updated":"2024-12-29T09:47:14Z","published":"2024-12-29T09:47:14Z","title":"Comparative Performance of Advanced NLP Models and LLMs in Multilingual\n Geo-Entity Detection","summary":" The integration of advanced Natural Language Processing (NLP) methodologies\nand Large Language Models (LLMs) has significantly enhanced the extraction and\nanalysis of geospatial data from multilingual texts, impacting sectors such as\nnational and international security. This paper presents a comprehensive\nevaluation of leading NLP models -- SpaCy, XLM-RoBERTa, mLUKE, GeoLM -- and\nLLMs, specifically OpenAI's GPT 3.5 and GPT 4, within the context of\nmultilingual geo-entity detection. Utilizing datasets from Telegram channels in\nEnglish, Russian, and Arabic, we examine the performance of these models\nthrough metrics such as accuracy, precision, recall, and F1 scores, to assess\ntheir effectiveness in accurately identifying geospatial references. The\nanalysis exposes each model's distinct advantages and challenges, underscoring\nthe complexities involved in achieving precise geo-entity identification across\nvaried linguistic landscapes. The conclusions drawn from this experiment aim to\ndirect the enhancement and creation of more advanced and inclusive NLP tools,\nthus advancing the field of geospatial analysis and its application to global\nsecurity.\n","authors":["Kalin Kopanov"],"pdf_url":"https://arxiv.org/pdf/2412.20414v1.pdf","comment":"6 pages, 1 table, AICCONF '24: Cognitive Models and Artificial\n Intelligence Conference, Istanbul, Turkey"},{"id":"http://arxiv.org/abs/2412.20412v1","updated":"2024-12-29T09:35:56Z","published":"2024-12-29T09:35:56Z","title":"Multi-Objective Large Language Model Unlearning","summary":" Machine unlearning in the domain of large language models (LLMs) has\nattracted great attention recently, which aims to effectively eliminate\nundesirable behaviors from LLMs without full retraining from scratch. In this\npaper, we explore the Gradient Ascent (GA) approach in LLM unlearning, which is\na proactive way to decrease the prediction probability of the model on the\ntarget data in order to remove their influence. We analyze two challenges that\nrender the process impractical: gradient explosion and catastrophic forgetting.\nTo address these issues, we propose Multi-Objective Large Language Model\nUnlearning (MOLLM) algorithm. We first formulate LLM unlearning as a\nmulti-objective optimization problem, in which the cross-entropy loss is\nmodified to the unlearning version to overcome the gradient explosion issue. A\ncommon descent update direction is then calculated, which enables the model to\nforget the target data while preserving the utility of the LLM. Our empirical\nresults verify that MoLLM outperforms the SOTA GA-based LLM unlearning methods\nin terms of unlearning effect and model utility preservation.\n","authors":["Zibin Pan","Shuwen Zhang","Yuesheng Zheng","Chi Li","Yuheng Cheng","Junhua Zhao"],"pdf_url":"https://arxiv.org/pdf/2412.20412v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.20406v1","updated":"2024-12-29T09:10:52Z","published":"2024-12-29T09:10:52Z","title":"A Multidisciplinary Approach to Telegram Data Analysis","summary":" This paper presents a multidisciplinary approach to analyzing data from\nTelegram for early warning information regarding cyber threats. With the\nproliferation of hacktivist groups utilizing Telegram to disseminate\ninformation regarding future cyberattacks or to boast about successful ones,\nthe need for effective data analysis methods is paramount. The primary\nchallenge lies in the vast number of channels and the overwhelming volume of\ndata, necessitating advanced techniques for discerning pertinent risks amidst\nthe noise. To address this challenge, we employ a combination of neural network\narchitectures and traditional machine learning algorithms. These methods are\nutilized to classify and identify potential cyber threats within the Telegram\ndata. Additionally, sentiment analysis and entity recognition techniques are\nincorporated to provide deeper insights into the nature and context of the\ncommunicated information. The study evaluates the effectiveness of each method\nin detecting and categorizing cyber threats, comparing their performance and\nidentifying areas for improvement. By leveraging these diverse analytical\ntools, we aim to enhance early warning systems for cyber threats, enabling more\nproactive responses to potential security breaches. This research contributes\nto the ongoing efforts to bolster cybersecurity measures in an increasingly\ninterconnected digital landscape.\n","authors":["Velizar Varbanov","Kalin Kopanov","Tatiana Atanasova"],"pdf_url":"https://arxiv.org/pdf/2412.20406v1.pdf","comment":"7 pages, 1 table, 2 figures, 24th International Multidisciplinary\n Scientific GeoConference SGEM 2024"},{"id":"http://arxiv.org/abs/2407.01085v3","updated":"2024-12-29T08:52:29Z","published":"2024-07-01T08:37:41Z","title":"Explaining Length Bias in LLM-Based Preference Evaluations","summary":" The use of large language models (LLMs) as judges, particularly in preference\ncomparisons, has become widespread, but this reveals a notable bias towards\nlonger responses, undermining the reliability of such evaluations. To better\nunderstand such bias, we propose to decompose the preference evaluation metric,\nspecifically the win rate, into two key components: desirability and\ninformation mass, where the former is length-independent and related to\ntrustworthiness such as correctness, toxicity, and consistency, and the latter\nis length-dependent and represents the amount of information in the response.\nWe empirically demonstrated the decomposition through controlled experiments\nand found that response length impacts evaluations by influencing information\nmass. To derive a reliable evaluation metric that assesses content quality\nwithout being confounded by response length, we propose AdapAlpaca, a simple\nyet effective adjustment to win rate measurement. Specifically, AdapAlpaca\nensures a fair comparison of response quality by aligning the lengths of\nreference and test model responses under equivalent length intervals.\n","authors":["Zhengyu Hu","Linxin Song","Jieyu Zhang","Zheyuan Xiao","Tianfu Wang","Zhengyu Chen","Nicholas Jing Yuan","Jianxun Lian","Kaize Ding","Hui Xiong"],"pdf_url":"https://arxiv.org/pdf/2407.01085v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.17220v2","updated":"2024-12-29T07:31:22Z","published":"2024-05-27T14:37:01Z","title":"RLAIF-V: Open-Source AI Feedback Leads to Super GPT-4V Trustworthiness","summary":" Traditional feedback learning for hallucination reduction relies on\nlabor-intensive manual labeling or expensive proprietary models. This leaves\nthe community without foundational knowledge about how to build high-quality\nfeedback with open-source MLLMs. In this work, we introduce RLAIF-V, a novel\nframework that aligns MLLMs in a fully open-source paradigm. RLAIF-V maximally\nexplores open-source MLLMs from two perspectives, including high-quality\nfeedback data generation for preference learning and self-feedback guidance for\ninference-time scaling. Extensive experiments on six benchmarks in both\nautomatic and human evaluation show that RLAIF-V substantially enhances the\ntrustworthiness of models at both preference learning and inference time.\nRLAIF-V 7B reduces object hallucination by 80.7\\% and overall hallucination by\n33.7\\%. Remarkably, RLAIF-V 12B further reveals the self-alignment potential of\nopen-source MLLMs, where the model can learn from feedback of itself to achieve\nsuper GPT-4V trustworthiness.\n","authors":["Tianyu Yu","Haoye Zhang","Qiming Li","Qixin Xu","Yuan Yao","Da Chen","Xiaoman Lu","Ganqu Cui","Yunkai Dang","Taiwen He","Xiaocheng Feng","Jun Song","Bo Zheng","Zhiyuan Liu","Tat-Seng Chua","Maosong Sun"],"pdf_url":"https://arxiv.org/pdf/2405.17220v2.pdf","comment":"Project Website: https://github.com/RLHF-V/RLAIF-V"},{"id":"http://arxiv.org/abs/2412.20382v1","updated":"2024-12-29T07:02:45Z","published":"2024-12-29T07:02:45Z","title":"Natural Language Fine-Tuning","summary":" Large language model fine-tuning techniques typically depend on extensive\nlabeled data, external guidance, and feedback, such as human alignment, scalar\nrewards, and demonstration. However, in practical application, the scarcity of\nspecific knowledge poses unprecedented challenges to existing fine-tuning\ntechniques. In this paper, focusing on fine-tuning tasks in specific domains\nwith limited data, we introduce Natural Language Fine-Tuning (NLFT), which\nutilizes natural language for fine-tuning for the first time. By leveraging the\nstrong language comprehension capability of the target LM, NLFT attaches the\nguidance of natural language to the token-level outputs. Then, saliency tokens\nare identified with calculated probabilities. Since linguistic information is\neffectively utilized in NLFT, our proposed method significantly reduces\ntraining costs. It markedly enhances training efficiency, comprehensively\noutperforming reinforcement fine-tuning algorithms in accuracy, time-saving,\nand resource conservation. Additionally, on the macro level, NLFT can be viewed\nas a token-level fine-grained optimization of SFT, thereby efficiently\nreplacing the SFT process without the need for warm-up (as opposed to ReFT\nrequiring multiple rounds of warm-up with SFT). Compared to SFT, NLFT does not\nincrease the algorithmic complexity, maintaining O(n). Extensive experiments on\nthe GSM8K dataset demonstrate that NLFT, with only 50 data instances, achieves\nan accuracy increase that exceeds SFT by 219%. Compared to ReFT, the time\ncomplexity and space complexity of NLFT are reduced by 78.27% and 92.24%,\nrespectively. The superior technique of NLFT is paving the way for the\ndeployment of various innovative LLM fine-tuning applications when resources\nare limited at network edges.\n Our code has been released at https://github.com/Julia-LiuJ/NLFT.\n","authors":["Jia Liu","Yue Wang","Zhiqi Lin","Min Chen","Yixue Hao","Long Hu"],"pdf_url":"https://arxiv.org/pdf/2412.20382v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.20372v1","updated":"2024-12-29T06:32:36Z","published":"2024-12-29T06:32:36Z","title":"LLM2: Let Large Language Models Harness System 2 Reasoning","summary":" Large language models (LLMs) have exhibited impressive capabilities across a\nmyriad of tasks, yet they occasionally yield undesirable outputs. We posit that\nthese limitations are rooted in the foundational autoregressive architecture of\nLLMs, which inherently lacks mechanisms for differentiating between desirable\nand undesirable results. Drawing inspiration from the dual-process theory of\nhuman cognition, we introduce LLM2, a novel framework that combines an LLM\n(System 1) with a process-based verifier (System 2). Within LLM2, the LLM is\nresponsible for generating plausible candidates, while the verifier provides\ntimely process-based feedback to distinguish desirable and undesirable outputs.\nThe verifier is trained with a pairwise comparison loss on synthetic\nprocess-supervision data generated through our token quality exploration\nstrategy. Empirical results on mathematical reasoning benchmarks substantiate\nthe efficacy of LLM2, exemplified by an accuracy enhancement from 50.3 to 57.8\n(+7.5) for Llama3-1B on GSM8K. Furthermore, when combined with\nself-consistency, LLM2 achieves additional improvements, boosting major@20\naccuracy from 56.2 to 70.2 (+14.0).\n","authors":["Cheng Yang","Chufan Shi","Siheng Li","Bo Shui","Yujiu Yang","Wai Lam"],"pdf_url":"https://arxiv.org/pdf/2412.20372v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.04512v2","updated":"2024-12-29T06:29:14Z","published":"2024-09-06T17:15:17Z","title":"Chain-of-Translation Prompting (CoTR): A Novel Prompting Technique for\n Low Resource Languages","summary":" This paper introduces Chain of Translation Prompting (CoTR), a novel strategy\ndesigned to enhance the performance of language models in low-resource\nlanguages. CoTR restructures prompts to first translate the input context from\na low-resource language into a higher-resource language, such as English. The\nspecified task like generation, classification, or any other NLP function is\nthen performed on the translated text, with the option to translate the output\nback to the original language if needed. All these steps are specified in a\nsingle prompt. We demonstrate the effectiveness of this method through a case\nstudy on the low-resource Indic language Marathi. The CoTR strategy is applied\nto various tasks, including sentiment analysis, hate speech classification,\nsubject classification and text generation, and its efficacy is showcased by\ncomparing it with regular prompting methods. Our results underscore the\npotential of translation-based prompting strategies to significantly improve\nmultilingual LLM performance in low-resource languages, offering valuable\ninsights for future research and applications. We specifically see the highest\naccuracy improvements with the hate speech detection task. The technique also\nhas the potential to enhance the quality of synthetic data generation for\nunderrepresented languages using LLMs.\n","authors":["Tejas Deshpande","Nidhi Kowtal","Raviraj Joshi"],"pdf_url":"https://arxiv.org/pdf/2409.04512v2.pdf","comment":"Accepted at PACLIC 38 (2024)"},{"id":"http://arxiv.org/abs/2311.09367v2","updated":"2024-12-29T06:16:05Z","published":"2023-11-15T20:59:13Z","title":"A Survey on Online User Aggression: Content Detection and Behavioral\n Analysis on Social Media","summary":" The rise of social media platforms has led to an increase in cyber-aggressive\nbehavior, encompassing a broad spectrum of hostile behavior, including\ncyberbullying, online harassment, and the dissemination of offensive and hate\nspeech. These behaviors have been associated with significant societal\nconsequences, ranging from online anonymity to real-world outcomes such as\ndepression, suicidal tendencies, and, in some instances, offline violence.\nRecognizing the societal risks associated with unchecked aggressive content,\nthis paper delves into the field of Aggression Content Detection and Behavioral\nAnalysis of Aggressive Users, aiming to bridge the gap between disparate\nstudies. In this paper, we analyzed the diversity of definitions and proposed a\nunified cyber-aggression definition. We examine the comprehensive process of\nAggression Content Detection, spanning from dataset creation, feature selection\nand extraction, and detection algorithm development. Further, we review studies\non Behavioral Analysis of Aggression that explore the influencing factors,\nconsequences, and patterns associated with cyber-aggressive behavior. This\nsystematic literature review is a cross-examination of content detection and\nbehavioral analysis in the realm of cyber-aggression. The integrated\ninvestigation reveals the effectiveness of incorporating sociological insights\ninto computational techniques for preventing cyber-aggressive behavior.\nFinally, the paper concludes by identifying research gaps and encouraging\nfurther progress in the unified domain of socio-computational aggressive\nbehavior analysis.\n","authors":["Swapnil Mane","Suman Kundu","Rajesh Sharma"],"pdf_url":"https://arxiv.org/pdf/2311.09367v2.pdf","comment":"Accepted at ACM Computing Survey"},{"id":"http://arxiv.org/abs/2412.20367v1","updated":"2024-12-29T06:15:41Z","published":"2024-12-29T06:15:41Z","title":"Enhancing Code LLMs with Reinforcement Learning in Code Generation","summary":" With the rapid evolution of large language models (LLM), reinforcement\nlearning (RL) has emerged as a pivotal technique for code generation and\noptimization in various domains. This paper presents a systematic survey of the\napplication of RL in code optimization and generation, highlighting its role in\nenhancing compiler optimization, resource allocation, and the development of\nframeworks and tools. Subsequent sections first delve into the intricate\nprocesses of compiler optimization, where RL algorithms are leveraged to\nimprove efficiency and resource utilization. The discussion then progresses to\nthe function of RL in resource allocation, emphasizing register allocation and\nsystem optimization. We also explore the burgeoning role of frameworks and\ntools in code generation, examining how RL can be integrated to bolster their\ncapabilities. This survey aims to serve as a comprehensive resource for\nresearchers and practitioners interested in harnessing the power of RL to\nadvance code generation and optimization techniques.\n","authors":["Junqiao Wang","Zeng Zhang","Yangfan He","Yuyang Song","Tianyu Shi","Yuchen Li","Hengyuan Xu","Kunyu Wu","Guangwu Qian","Qiuwu Chen","Lewei He"],"pdf_url":"https://arxiv.org/pdf/2412.20367v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.20357v1","updated":"2024-12-29T05:28:15Z","published":"2024-12-29T05:28:15Z","title":"HindiLLM: Large Language Model for Hindi","summary":" The advancements in the Large Language Model (LLM) have helped in solving\nseveral problems related to language processing. Most of the researches have\nfocused on the English language only, because of its popularity and abundance\non the internet. However, a high-performance language model for Hindi and other\nIndic languages is lacking in the literature. In this work, we have pre-trained\ntwo autoregressive LLM models for the Hindi language, namely HindiLLM-Small and\nHindiLLM-Medium. We use a two-step process comprising unsupervised pre-training\nand supervised fine-tuning. First, we create a large and high-quality text\ncorpus for unsupervised pre-training. Next, we train a Byte-Pair Encoding,\nnamed HindiLLM tokenizer, using the pre-training text data. We then perform\ntraining on the unlabeled data, known as the pre-training step, to get the\nHindiLLM base models. Furthermore, we perform fine-tuning of the HindiLLM base\nmodels for different tasks like sentiment analysis, text classification,\nnatural language inference, and multiple choice question-answer on popular\nlabeled datasets to measure the real-world performance. The evaluation shows\nthat the HindiLLM-based fine-tuned models outperform several models in most of\nthe language related tasks.\n","authors":["Sanjay Chouhan","Shubha Brata Nath","Aparajita Dutta"],"pdf_url":"https://arxiv.org/pdf/2412.20357v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.02085v5","updated":"2024-12-29T04:41:32Z","published":"2024-08-04T16:50:07Z","title":"Unleashing the Power of Data Tsunami: A Comprehensive Survey on Data\n Assessment and Selection for Instruction Tuning of Language Models","summary":" Instruction tuning plays a critical role in aligning large language models\n(LLMs) with human preference. Despite the vast amount of open instruction\ndatasets, naively training a LLM on all existing instructions may not be\noptimal and practical. To pinpoint the most beneficial datapoints, data\nassessment and selection methods have been proposed in the fields of natural\nlanguage processing (NLP) and deep learning. However, under the context of\ninstruction tuning, there still exists a gap in knowledge on what kind of data\nevaluation metrics can be employed and how they can be integrated into the\nselection mechanism. To bridge this gap, we present a comprehensive review on\nexisting literature of data assessment and selection especially for instruction\ntuning of LLMs. We systematically categorize all applicable methods into\nquality-based, diversity-based, and importance-based ones where a unified,\nfine-grained taxonomy is structured. For each category, representative methods\nare elaborated to describe the landscape of relevant research. In addition,\ncomparison between the latest methods is conducted on their officially reported\nresults to provide in-depth discussions on their limitations. Finally, we\nsummarize the open challenges and propose the promosing avenues for future\nstudies. All related contents are available at\nhttps://github.com/yuleiqin/fantastic-data-engineering.\n","authors":["Yulei Qin","Yuncheng Yang","Pengcheng Guo","Gang Li","Hang Shao","Yuchen Shi","Zihan Xu","Yun Gu","Ke Li","Xing Sun"],"pdf_url":"https://arxiv.org/pdf/2408.02085v5.pdf","comment":"Accepted to TMLR with Survey Certificate, review, survey, 37 pages, 5\n figures, 4 tables"},{"id":"http://arxiv.org/abs/2409.02259v2","updated":"2024-12-29T04:13:59Z","published":"2024-09-03T19:34:25Z","title":"Optimal L-Systems for Stochastic L-system Inference Problems","summary":" This paper presents two novel theorems that address two open problems in\nstochastic Lindenmayer-system (L-system) inference, specifically focusing on\nthe construction of an optimal stochastic L-system capable of generating a\ngiven sequence of strings. The first theorem delineates a method for crafting a\nstochastic L-system that has the maximum probability of a derivation producing\na given sequence of words through a single derivation (noting that multiple\nderivations may generate the same sequence). Furthermore, the second theorem\ndetermines the stochastic L-systems with the highest probability of producing a\ngiven sequence of words with multiple possible derivations. From these, we\nintroduce an algorithm to infer an optimal stochastic L-system from a given\nsequence. This algorithm incorporates advanced optimization techniques, such as\ninterior point methods, to ensure the creation of a stochastic L-system that\nmaximizes the probability of generating the given sequence (allowing for\nmultiple derivations). This allows for the use of stochastic L-systems as a\nmodel for machine learning using only positive data for training.\n","authors":["Ali Lotfi","Ian McQuillan"],"pdf_url":"https://arxiv.org/pdf/2409.02259v2.pdf","comment":"15 pages"},{"id":"http://arxiv.org/abs/2407.21004v2","updated":"2024-12-29T01:13:50Z","published":"2024-07-30T17:51:44Z","title":"Evolver: Chain-of-Evolution Prompting to Boost Large Multimodal Models\n for Hateful Meme Detection","summary":" Recent advances show that two-stream approaches have achieved outstanding\nperformance in hateful meme detection. However, hateful memes constantly evolve\nas new memes emerge by fusing progressive cultural ideas, making existing\nmethods obsolete or ineffective. In this work, we explore the potential of\nLarge Multimodal Models (LMMs) for hateful meme detection. To this end, we\npropose Evolver, which incorporates LMMs via Chain-of-Evolution (CoE)\nPrompting, by integrating the evolution attribute and in-context information of\nmemes. Specifically, Evolver simulates the evolving and expressing process of\nmemes and reasons through LMMs in a step-by-step manner. First, an evolutionary\npair mining module retrieves the top-k most similar memes in the external\ncurated meme set with the input meme. Second, an evolutionary information\nextractor is designed to summarize the semantic regularities between the paired\nmemes for prompting. Finally, a contextual relevance amplifier enhances the\nin-context hatefulness information to boost the search for evolutionary\nprocesses. Extensive experiments on public FHM, MAMI, and HarM datasets show\nthat CoE prompting can be incorporated into existing LMMs to improve their\nperformance. More encouragingly, it can serve as an interpretive tool to\npromote the understanding of the evolution of social memes.\n","authors":["Jinfa Huang","Jinsheng Pan","Zhongwei Wan","Hanjia Lyu","Jiebo Luo"],"pdf_url":"https://arxiv.org/pdf/2407.21004v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.20309v1","updated":"2024-12-29T00:58:33Z","published":"2024-12-29T00:58:33Z","title":"Understanding the Impact of Confidence in Retrieval Augmented\n Generation: A Case Study in the Medical Domain","summary":" Retrieval Augmented Generation (RAG) complements the knowledge of Large\nLanguage Models (LLMs) by leveraging external information to enhance response\naccuracy for queries. This approach is widely applied in several fields by\ntaking its advantage of injecting the most up-to-date information, and\nresearchers are focusing on understanding and improving this aspect to unlock\nthe full potential of RAG in such high-stakes applications. However, despite\nthe potential of RAG to address these needs, the mechanisms behind the\nconfidence levels of its outputs remain underexplored, although the confidence\nof information is very critical in some domains, such as finance, healthcare,\nand medicine. Our study focuses the impact of RAG on confidence within the\nmedical domain under various configurations and models. We evaluate confidence\nby treating the model's predicted probability as its output and calculating\nExpected Calibration Error (ECE) and Adaptive Calibration Error (ACE) scores\nbased on the probabilities and accuracy. In addition, we analyze whether the\norder of retrieved documents within prompts calibrates the confidence. Our\nfindings reveal large variation in confidence and accuracy depending on the\nmodel, settings, and the format of input prompts. These results underscore the\nnecessity of optimizing configurations based on the specific model and\nconditions.\n","authors":["Shintaro Ozaki","Yuta Kato","Siyuan Feng","Masayo Tomita","Kazuki Hayashi","Ryoma Obara","Masafumi Oyamada","Katsuhiko Hayashi","Hidetaka Kamigaito","Taro Watanabe"],"pdf_url":"https://arxiv.org/pdf/2412.20309v1.pdf","comment":null}],"Information Retrieval":[{"id":"http://arxiv.org/abs/2412.20427v1","updated":"2024-12-29T10:36:33Z","published":"2024-12-29T10:36:33Z","title":"AmalREC: A Dataset for Relation Extraction and Classification Leveraging\n Amalgamation of Large Language Models","summary":" Existing datasets for relation classification and extraction often exhibit\nlimitations such as restricted relation types and domain-specific biases. This\nwork presents a generic framework to generate well-structured sentences from\ngiven tuples with the help of Large Language Models (LLMs). This study has\nfocused on the following major questions: (i) how to generate sentences from\nrelation tuples, (ii) how to compare and rank them, (iii) can we combine\nstrengths of individual methods and amalgamate them to generate an even bette\nquality of sentences, and (iv) how to evaluate the final dataset? For the first\nquestion, we employ a multifaceted 5-stage pipeline approach, leveraging LLMs\nin conjunction with template-guided generation. We introduce Sentence\nEvaluation Index(SEI) that prioritizes factors like grammatical correctness,\nfluency, human-aligned sentiment, accuracy, and complexity to answer the first\npart of the second question. To answer the second part of the second question,\nthis work introduces a SEI-Ranker module that leverages SEI to select top\ncandidate generations. The top sentences are then strategically amalgamated to\nproduce the final, high-quality sentence. Finally, we evaluate our dataset on\nLLM-based and SOTA baselines for relation classification. The proposed dataset\nfeatures 255 relation types, with 15K sentences in the test set and around 150k\nin the train set organized in, significantly enhancing relational diversity and\ncomplexity. This work not only presents a new comprehensive benchmark dataset\nfor RE/RC task, but also compare different LLMs for generation of quality\nsentences from relational tuples.\n","authors":[" Mansi","Pranshu Pandya","Mahek Bhavesh Vora","Soumya Bharadwaj","Ashish Anand"],"pdf_url":"https://arxiv.org/pdf/2412.20427v1.pdf","comment":"18 pages, 5 Figures"},{"id":"http://arxiv.org/abs/2412.20414v1","updated":"2024-12-29T09:47:14Z","published":"2024-12-29T09:47:14Z","title":"Comparative Performance of Advanced NLP Models and LLMs in Multilingual\n Geo-Entity Detection","summary":" The integration of advanced Natural Language Processing (NLP) methodologies\nand Large Language Models (LLMs) has significantly enhanced the extraction and\nanalysis of geospatial data from multilingual texts, impacting sectors such as\nnational and international security. This paper presents a comprehensive\nevaluation of leading NLP models -- SpaCy, XLM-RoBERTa, mLUKE, GeoLM -- and\nLLMs, specifically OpenAI's GPT 3.5 and GPT 4, within the context of\nmultilingual geo-entity detection. Utilizing datasets from Telegram channels in\nEnglish, Russian, and Arabic, we examine the performance of these models\nthrough metrics such as accuracy, precision, recall, and F1 scores, to assess\ntheir effectiveness in accurately identifying geospatial references. The\nanalysis exposes each model's distinct advantages and challenges, underscoring\nthe complexities involved in achieving precise geo-entity identification across\nvaried linguistic landscapes. The conclusions drawn from this experiment aim to\ndirect the enhancement and creation of more advanced and inclusive NLP tools,\nthus advancing the field of geospatial analysis and its application to global\nsecurity.\n","authors":["Kalin Kopanov"],"pdf_url":"https://arxiv.org/pdf/2412.20414v1.pdf","comment":"6 pages, 1 table, AICCONF '24: Cognitive Models and Artificial\n Intelligence Conference, Istanbul, Turkey"},{"id":"http://arxiv.org/abs/2412.20366v1","updated":"2024-12-29T06:10:31Z","published":"2024-12-29T06:10:31Z","title":"Introducing Semantic Capability in LinkedIn's Content Search Engine","summary":" In the past, most search queries issued to a search engine were short and\nsimple. A keyword based search engine was able to answer such queries quite\nwell. However, members are now developing the habit of issuing long and complex\nnatural language queries. Answering such queries requires evolution of a search\nengine to have semantic capability. In this paper we present the design of\nLinkedIn's new content search engine with semantic capability, and its impact\non metrics.\n","authors":["Xin Yang","Rachel Zheng","Madhumitha Mohan","Sonali Bhadra","Lingyu Zhang","Rupesh Gupta"],"pdf_url":"https://arxiv.org/pdf/2412.20366v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.20360v1","updated":"2024-12-29T05:43:40Z","published":"2024-12-29T05:43:40Z","title":"Left-handed representation in top 100 male professional tennis players:\n Multi-disciplinary perspectives","summary":" A commonly held opinion is that left-handed tennis players are\noverrepresented compared to the percentage of left-handers within the general\npopulation. This study provides the domain insights supported by data analysis\nthat could help inform the decision of parents and coaches considering whether\na child should start playing tennis as left- or right-handed when there is no\nstrong arm-handed dominance. Compared to the commonly cited figure of about 10%\nof left-handed male population, data analysis from the official ATP web site\nfor the top 100 ranked tennis players over the past decades (1985-2016) shows\nevidence of overrepresentation of left-handed elite tennis players (about 15%).\nThe insights and data analysis can inform the handedness decision, advance\ncoaching and strategic game concepts, enhance media coverage/analytics,\nleft-handed facts and statistics, and inform tennis equipment manufacturing.\n","authors":["Boris Bačić","Ali Ghazala"],"pdf_url":"https://arxiv.org/pdf/2412.20360v1.pdf","comment":"The original work citation (in APA): Ba\\v{c}i\\'c, B., & Ghazala, A.\n (2016). Left-handed representation in top 100 male professional tennis\n players: Multi-disciplinary perspectives. Symposium conducted at the meeting\n of the First New Zealand Text Mining Workshop (TMNZ 2016) in conjunction with\n the 8th Asian Conference on Machine Learning (ACML 2016), Hamilton, New\n Zealand"},{"id":"http://arxiv.org/abs/2408.08713v3","updated":"2024-12-29T01:51:59Z","published":"2024-08-16T12:51:52Z","title":"Beyond KAN: Introducing KarSein for Adaptive High-Order Feature\n Interaction Modeling in CTR Prediction","summary":" Modeling feature interactions is crucial for click-through rate (CTR)\nprediction, particularly when it comes to high-order explicit interactions.\nTraditional methods struggle with this task because they often predefine a\nmaximum interaction order, which relies heavily on prior knowledge and can\nlimit the model's effectiveness. Additionally, modeling high-order interactions\ntypically leads to increased computational costs. Therefore, the challenge lies\nin adaptively modeling high-order feature interactions while maintaining\nefficiency. To address this issue, we introduce Kolmogorov-Arnold Represented\nSparse Efficient Interaction Network (KarSein), designed to optimize both\npredictive accuracy and computational efficiency. We firstly identify\nlimitations of directly applying Kolmogorov-Arnold Networks (KAN) to CTR and\nthen introduce KarSein to overcome these issues. It features a novel\narchitecture that reduces the computational costs of KAN and supports embedding\nvectors as feature inputs. Additionally, KarSein employs guided symbolic\nregression to address the challenge of KAN in spontaneously learning\nmultiplicative relationships. Extensive experiments demonstrate KarSein's\nsuperior performance, achieving significant predictive accuracy with minimal\ncomputational overhead. Furthermore, KarSein maintains strong global\nexplainability while enabling the removal of redundant features, resulting in a\nsparse network structure. These advantages also position KarSein as a promising\nmethod for efficient inference.\n","authors":["Yunxiao Shi","Wujiang Xu","Haimin Zhang","Qiang Wu","Yongfeng Zhang","Min Xu"],"pdf_url":"https://arxiv.org/pdf/2408.08713v3.pdf","comment":"KarSein for CTR"}],"Machine Learning":[{"id":"http://arxiv.org/abs/2412.20620v1","updated":"2024-12-29T23:49:25Z","published":"2024-12-29T23:49:25Z","title":"Matrix Concentration for Random Signed Graphs and Community Recovery in\n the Signed Stochastic Block Model","summary":" We consider graphs where edges and their signs are added independently at\nrandom from among all pairs of nodes. We establish strong concentration\ninequalities for adjacency and Laplacian matrices obtained from this family of\nrandom graph models. Then, we apply our results to study graphs sampled from\nthe signed stochastic block model. Namely, we take a two-community setting\nwhere edges within the communities have positive signs and edges between the\ncommunities have negative signs and apply a random sign perturbation with\nprobability $0< s <1/2$. In this setting, our findings include: first, the\nspectral gap of the corresponding signed Laplacian matrix concentrates near\n$2s$ with high probability; and second, the sign of the first eigenvector of\nthe Laplacian matrix defines a weakly consistent estimator for the balanced\ncommunity detection problem, or equivalently, the $\\pm 1$ synchronization\nproblem. We supplement our theoretical contributions with experimental data\nobtained from the models under consideration.\n","authors":["Sawyer Jack Robertson"],"pdf_url":"https://arxiv.org/pdf/2412.20620v1.pdf","comment":"29 pages, 10 figures"},{"id":"http://arxiv.org/abs/2412.20619v1","updated":"2024-12-29T23:48:35Z","published":"2024-12-29T23:48:35Z","title":"Audiopedia: Audio QA with Knowledge","summary":" In this paper, we introduce Audiopedia, a novel task called Audio Question\nAnswering with Knowledge, which requires both audio comprehension and external\nknowledge reasoning. Unlike traditional Audio Question Answering (AQA)\nbenchmarks that focus on simple queries answerable from audio alone, Audiopedia\ntargets knowledge-intensive questions. We define three sub-tasks: (i) Single\nAudio Question Answering (s-AQA), where questions are answered based on a\nsingle audio sample, (ii) Multi-Audio Question Answering (m-AQA), which\nrequires reasoning over multiple audio samples, and (iii) Retrieval-Augmented\nAudio Question Answering (r-AQA), which involves retrieving relevant audio to\nanswer the question. We benchmark large audio language models (LALMs) on these\nsub-tasks and observe suboptimal performance. To address this, we propose a\ngeneric framework that can be adapted to any LALM, equipping them with\nknowledge reasoning capabilities. Our framework has two components: (i) Audio\nEntity Linking (AEL) and (ii) Knowledge-Augmented Audio Large Multimodal Model\n(KA2LM), which together improve performance on knowledge-intensive AQA tasks.\nTo our knowledge, this is the first work to address advanced audio\nunderstanding via knowledge-intensive tasks like Audiopedia.\n","authors":["Abhirama Subramanyam Penamakuri","Kiran Chhatre","Akshat Jain"],"pdf_url":"https://arxiv.org/pdf/2412.20619v1.pdf","comment":"Accepted to ICASSP 2025"},{"id":"http://arxiv.org/abs/2412.20617v1","updated":"2024-12-29T23:29:05Z","published":"2024-12-29T23:29:05Z","title":"Converting Time Series Data to Numeric Representations Using Alphabetic\n Mapping and k-mer strategy","summary":" In the realm of data analysis and bioinformatics, representing time series\ndata in a manner akin to biological sequences offers a novel approach to\nleverage sequence analysis techniques. Transforming time series signals into\nmolecular sequence-type representations allows us to enhance pattern\nrecognition by applying sophisticated sequence analysis techniques (e.g.\n$k$-mers based representation) developed in bioinformatics, uncovering hidden\npatterns and relationships in complex, non-linear time series data. This paper\nproposes a method to transform time series signals into biological/molecular\nsequence-type representations using a unique alphabetic mapping technique. By\ngenerating 26 ranges corresponding to the 26 letters of the English alphabet,\neach value within the time series is mapped to a specific character based on\nits range. This conversion facilitates the application of sequence analysis\nalgorithms, typically used in bioinformatics, to analyze time series data. We\ndemonstrate the effectiveness of this approach by converting real-world time\nseries signals into character sequences and performing sequence classification.\nThe resulting sequences can be utilized for various sequence-based analysis\ntechniques, offering a new perspective on time series data representation and\nanalysis.\n","authors":["Sarwan Ali","Tamkanat E Ali","Imdad Ullah Khan","Murray Patterson"],"pdf_url":"https://arxiv.org/pdf/2412.20617v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.20616v1","updated":"2024-12-29T23:26:43Z","published":"2024-12-29T23:26:43Z","title":"Hilbert Curve Based Molecular Sequence Analysis","summary":" Accurate molecular sequence analysis is a key task in the field of\nbioinformatics. To apply molecular sequence classification algorithms, we first\nneed to generate the appropriate representations of the sequences. Traditional\nnumeric sequence representation techniques are mostly based on sequence\nalignment that faces limitations in the form of lack of accuracy. Although\nseveral alignment-free techniques have also been introduced, their tabular data\nform results in low performance when used with Deep Learning (DL) models\ncompared to the competitive performance observed in the case of image-based\ndata. To find a solution to this problem and to make Deep Learning (DL) models\nfunction to their maximum potential while capturing the important spatial\ninformation in the sequence data, we propose a universal Hibert curve-based\nChaos Game Representation (CGR) method. This method is a transformative\nfunction that involves a novel Alphabetic index mapping technique used in\nconstructing Hilbert curve-based image representation from molecular sequences.\nOur method can be globally applied to any type of molecular sequence data. The\nHilbert curve-based image representations can be used as input to sophisticated\nvision DL models for sequence classification. The proposed method shows\npromising results as it outperforms current state-of-the-art methods by\nachieving a high accuracy of $94.5$\\% and an F1 score of $93.9\\%$ when tested\nwith the CNN model on the lung cancer dataset. This approach opens up a new\nhorizon for exploring molecular sequence analysis using image classification\nmethods.\n","authors":["Sarwan Ali","Tamkanat E Ali","Imdad Ullah Khan","Murray Patterson"],"pdf_url":"https://arxiv.org/pdf/2412.20616v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.13067v2","updated":"2024-12-29T22:53:05Z","published":"2024-09-19T20:01:07Z","title":"E-Sort: Empowering End-to-end Neural Network for Multi-channel Spike\n Sorting with Transfer Learning and Fast Post-processing","summary":" Decoding extracellular recordings is a crucial task in electrophysiology and\nbrain-computer interfaces. Spike sorting, which distinguishes spikes and their\nputative neurons from extracellular recordings, becomes computationally\ndemanding with the increasing number of channels in modern neural probes. To\naddress the intensive workload and complex neuron interactions, we propose\nE-Sort, an end-to-end neural network-based spike sorter with transfer learning\nand parallelizable post-processing. Our framework reduces the required number\nof annotated spikes for training by 44% compared to training from scratch,\nachieving up to 25.68% higher accuracy. Additionally, our novel post-processing\nalgorithm is compatible with deep learning frameworks, making E-Sort\nsignificantly faster than state-of-the-art spike sorters. On synthesized\nNeuropixels recordings, E-Sort achieves comparable accuracy with Kilosort4\nwhile sorting 50 seconds of data in only 1.32 seconds. Our method demonstrates\nrobustness across various probe geometries, noise levels, and drift conditions,\noffering a substantial improvement in both accuracy and runtime efficiency\ncompared to existing spike sorters.\n","authors":["Yuntao Han","Shiwei Wang"],"pdf_url":"https://arxiv.org/pdf/2409.13067v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.20601v1","updated":"2024-12-29T22:13:16Z","published":"2024-12-29T22:13:16Z","title":"MATEY: multiscale adaptive foundation models for spatiotemporal physical\n systems","summary":" Accurate representation of the multiscale features in spatiotemporal physical\nsystems using vision transformer (ViT) architectures requires extremely long,\ncomputationally prohibitive token sequences. To address this issue, we propose\ntwo adaptive tokenization schemes that dynamically adjust patch sizes based on\nlocal features: one ensures convergent behavior to uniform patch refinement,\nwhile the other offers better computational efficiency. Moreover, we present a\nset of spatiotemporal attention schemes, where the temporal or axial spatial\ndimensions are decoupled, and evaluate their computational and data\nefficiencies. We assess the performance of the proposed multiscale adaptive\nmodel, MATEY, in a sequence of experiments. The results show that adaptive\ntokenization schemes achieve improved accuracy without significantly increasing\nthe length of the token sequence. Compared to a full spatiotemporal attention\nscheme or a scheme that decouples only the temporal dimension, we find that\nfully decoupled axial attention is less efficient and expressive, requiring\nmore training time and model weights to achieve the same accuracy. Finally, we\ndemonstrate in two fine-tuning tasks featuring different physics that models\npretrained on PDEBench data outperform the ones trained from scratch,\nespecially in the low data regime with frozen attention.\n","authors":["Pei Zhang","M. Paul Laiu","Matthew Norman","Doug Stefanski","John Gounley"],"pdf_url":"https://arxiv.org/pdf/2412.20601v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.02976v2","updated":"2024-12-29T22:13:12Z","published":"2024-10-03T20:35:23Z","title":"Learning Optimal Control and Dynamical Structure of Global Trajectory\n Search Problems with Diffusion Models","summary":" Spacecraft trajectory design is a global search problem, where previous work\nhas revealed specific solution structures that can be captured with data-driven\nmethods. This paper explores two global search problems in the circular\nrestricted three-body problem: hybrid cost function of minimum\nfuel/time-of-flight and transfers to energy-dependent invariant manifolds.\nThese problems display a fundamental structure either in the optimal control\nprofile or the use of dynamical structures. We build on our prior generative\nmachine learning framework to apply diffusion models to learn the conditional\nprobability distribution of the search problem and analyze the model's\ncapability to capture these structures.\n","authors":["Jannik Graebner","Anjian Li","Amlan Sinha","Ryne Beeson"],"pdf_url":"https://arxiv.org/pdf/2410.02976v2.pdf","comment":"This paper was presented at the AAS/AIAA Astrodynamics Specialist\n Conference"},{"id":"http://arxiv.org/abs/2409.03377v3","updated":"2024-12-29T22:03:13Z","published":"2024-09-05T09:28:56Z","title":"Real-time Speech Enhancement on Raw Signals with Deep State-space\n Modeling","summary":" We present aTENNuate, a simple deep state-space autoencoder configured for\nefficient online raw speech enhancement in an end-to-end fashion. The network's\nperformance is primarily evaluated on raw speech denoising, with additional\nassessments on tasks such as super-resolution and de-quantization. We benchmark\naTENNuate on the VoiceBank + DEMAND and the Microsoft DNS1 synthetic test sets.\nThe network outperforms previous real-time denoising models in terms of PESQ\nscore, parameter count, MACs, and latency. Even as a raw waveform processing\nmodel, the model maintains high fidelity to the clean signal with minimal\naudible artifacts. In addition, the model remains performant even when the\nnoisy input is compressed down to 4000Hz and 4 bits, suggesting general speech\nenhancement capabilities in low-resource environments. Code is available at\ngithub.com/Brainchip-Inc/aTENNuate\n","authors":["Yan Ru Pei","Ritik Shrivastava","FNU Sidharth"],"pdf_url":"https://arxiv.org/pdf/2409.03377v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.18458v2","updated":"2024-12-29T21:24:30Z","published":"2023-05-29T05:20:18Z","title":"CASUAL: Conditional Support Alignment for Domain Adaptation with Label\n Shift","summary":" Unsupervised domain adaptation (UDA) refers to a domain adaptation framework\nin which a learning model is trained based on the labeled samples on the source\ndomain and unlabeled ones in the target domain. The dominant existing methods\nin the field that rely on the classical covariate shift assumption to learn\ndomain-invariant feature representation have yielded suboptimal performance\nunder label distribution shift. In this paper, we propose a novel Conditional\nAdversarial SUpport ALignment (CASUAL) whose aim is to minimize the conditional\nsymmetric support divergence between the source's and target domain's feature\nrepresentation distributions, aiming at a more discriminative representation\nfor the classification task. We also introduce a novel theoretical target risk\nbound, which justifies the merits of aligning the supports of conditional\nfeature distributions compared to the existing marginal support alignment\napproach in the UDA settings. We then provide a complete training process for\nlearning in which the objective optimization functions are precisely based on\nthe proposed target risk bound. Our empirical results demonstrate that CASUAL\noutperforms other state-of-the-art methods on different UDA benchmark tasks\nunder different label shift conditions.\n","authors":["Anh T Nguyen","Lam Tran","Anh Tong","Tuan-Duy H. Nguyen","Toan Tran"],"pdf_url":"https://arxiv.org/pdf/2305.18458v2.pdf","comment":"Accepted at AAAI 2025"},{"id":"http://arxiv.org/abs/2412.20588v1","updated":"2024-12-29T21:23:09Z","published":"2024-12-29T21:23:09Z","title":"Kryptonite-N: Machine Learning Strikes Back","summary":" Quinn et al propose challenge datasets in their work called ``Kryptonite-N\".\nThese datasets aim to counter the universal function approximation argument of\nmachine learning, breaking the notation that machine learning can ``approximate\nany continuous function\" \\cite{original_paper}. Our work refutes this claim and\nshows that universal function approximations can be applied successfully; the\nKryptonite datasets are constructed predictably, allowing logistic regression\nwith sufficient polynomial expansion and L1 regularization to solve for any\ndimension N.\n","authors":["Albus Li","Nathan Bailey","Will Sumerfield","Kira Kim"],"pdf_url":"https://arxiv.org/pdf/2412.20588v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.20586v1","updated":"2024-12-29T21:22:24Z","published":"2024-12-29T21:22:24Z","title":"Testing and Improving the Robustness of Amortized Bayesian Inference for\n Cognitive Models","summary":" Contaminant observations and outliers often cause problems when estimating\nthe parameters of cognitive models, which are statistical models representing\ncognitive processes. In this study, we test and improve the robustness of\nparameter estimation using amortized Bayesian inference (ABI) with neural\nnetworks. To this end, we conduct systematic analyses on a toy example and\nanalyze both synthetic and real data using a popular cognitive model, the Drift\nDiffusion Models (DDM). First, we study the sensitivity of ABI to contaminants\nwith tools from robust statistics: the empirical influence function and the\nbreakdown point. Next, we propose a data augmentation or noise injection\napproach that incorporates a contamination distribution into the\ndata-generating process during training. We examine several candidate\ndistributions and evaluate their performance and cost in terms of accuracy and\nefficiency loss relative to a standard estimator. Introducing contaminants from\na Cauchy distribution during training considerably increases the robustness of\nthe neural density estimator as measured by bounded influence functions and a\nmuch higher breakdown point. Overall, the proposed method is straightforward\nand practical to implement and has a broad applicability in fields where\noutlier detection or removal is challenging.\n","authors":["Yufei Wu","Stefan Radev","Francis Tuerlinckx"],"pdf_url":"https://arxiv.org/pdf/2412.20586v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.20582v1","updated":"2024-12-29T21:04:35Z","published":"2024-12-29T21:04:35Z","title":"Bridging the Gap: A Decade Review of Time-Series Clustering Methods","summary":" Time series, as one of the most fundamental representations of sequential\ndata, has been extensively studied across diverse disciplines, including\ncomputer science, biology, geology, astronomy, and environmental sciences. The\nadvent of advanced sensing, storage, and networking technologies has resulted\nin high-dimensional time-series data, however, posing significant challenges\nfor analyzing latent structures over extended temporal scales. Time-series\nclustering, an established unsupervised learning strategy that groups similar\ntime series together, helps unveil hidden patterns in these complex datasets.\nIn this survey, we trace the evolution of time-series clustering methods from\nclassical approaches to recent advances in neural networks. While previous\nsurveys have focused on specific methodological categories, we bridge the gap\nbetween traditional clustering methods and emerging deep learning-based\nalgorithms, presenting a comprehensive, unified taxonomy for this research\narea. This survey highlights key developments and provides insights to guide\nfuture research in time-series clustering.\n","authors":["John Paparrizos","Fan Yang","Haojun Li"],"pdf_url":"https://arxiv.org/pdf/2412.20582v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.20574v1","updated":"2024-12-29T20:47:08Z","published":"2024-12-29T20:47:08Z","title":"A Survey on Time-Series Distance Measures","summary":" Distance measures have been recognized as one of the fundamental building\nblocks in time-series analysis tasks, e.g., querying, indexing, classification,\nclustering, anomaly detection, and similarity search. The vast proliferation of\ntime-series data across a wide range of fields has increased the relevance of\nevaluating the effectiveness and efficiency of these distance measures. To\nprovide a comprehensive view of this field, this work considers over 100\nstate-of-the-art distance measures, classified into 7 categories: lock-step\nmeasures, sliding measures, elastic measures, kernel measures, feature-based\nmeasures, model-based measures, and embedding measures. Beyond providing\ncomprehensive mathematical frameworks, this work also delves into the\ndistinctions and applications across these categories for both univariate and\nmultivariate cases. By providing comprehensive collections and insights, this\nstudy paves the way for the future development of innovative time-series\ndistance measures.\n","authors":["John Paparrizos","Haojun Li","Fan Yang","Kaize Wu","Jens E. d'Hondt","Odysseas Papapetrou"],"pdf_url":"https://arxiv.org/pdf/2412.20574v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.20573v1","updated":"2024-12-29T20:44:59Z","published":"2024-12-29T20:44:59Z","title":"The intrinsic motivation of reinforcement and imitation learning for\n sequential tasks","summary":" This work in the field of developmental cognitive robotics aims to devise a\nnew domain bridging between reinforcement learning and imitation learning, with\na model of the intrinsic motivation for learning agents to learn with guidance\nfrom tutors multiple tasks, including sequential tasks. The main contribution\nhas been to propose a common formulation of intrinsic motivation based on\nempirical progress for a learning agent to choose automatically its learning\ncurriculum by actively choosing its learning strategy for simple or sequential\ntasks: which task to learn, between autonomous exploration or imitation\nlearning, between low-level actions or task decomposition, between several\ntutors. The originality is to design a learner that benefits not only passively\nfrom data provided by tutors, but to actively choose when to request tutoring\nand what and whom to ask. The learner is thus more robust to the quality of the\ntutoring and learns faster with fewer demonstrations. We developed the\nframework of socially guided intrinsic motivation with machine learning\nalgorithms to learn multiple tasks by taking advantage of the generalisability\nproperties of human demonstrations in a passive manner or in an active manner\nthrough requests of demonstrations from the best tutor for simple and composing\nsubtasks. The latter relies on a representation of subtask composition proposed\nfor a construction process, which should be refined by representations used for\nobservational processes of analysing human movements and activities of daily\nliving. With the outlook of a language-like communication with the tutor, we\ninvestigated the emergence of a symbolic representation of the continuous\nsensorimotor space and of tasks using intrinsic motivation. We proposed within\nthe reinforcement learning framework, a reward function for interacting with\ntutors for automatic curriculum learning in multi-task learning.\n","authors":["Sao Mai Nguyen"],"pdf_url":"https://arxiv.org/pdf/2412.20573v1.pdf","comment":"Habilitation thesis"},{"id":"http://arxiv.org/abs/2409.06953v3","updated":"2024-12-29T20:25:02Z","published":"2024-09-11T02:29:53Z","title":"Neural Algorithmic Reasoning with Multiple Correct Solutions","summary":" Neural Algorithmic Reasoning (NAR) aims to optimize classical algorithms.\nHowever, canonical implementations of NAR train neural networks to return only\na single solution, even when there are multiple correct solutions to a problem,\nsuch as single-source shortest paths. For some applications, it is desirable to\nrecover more than one correct solution. To that end, we give the first method\nfor NAR with multiple solutions. We demonstrate our method on two classical\nalgorithms: Bellman-Ford (BF) and Depth-First Search (DFS), favouring deeper\ninsight into two algorithms over a broader survey of algorithms. This method\ninvolves generating appropriate training data as well as sampling and\nvalidating solutions from model output. Each step of our method, which can\nserve as a framework for neural algorithmic reasoning beyond the tasks\npresented in this paper, might be of independent interest to the field and our\nresults represent the first attempt at this task in the NAR literature.\n","authors":["Zeno Kujawa","John Poole","Dobrik Georgiev","Danilo Numeroso","Pietro Liò"],"pdf_url":"https://arxiv.org/pdf/2409.06953v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.09892v2","updated":"2024-12-29T20:13:04Z","published":"2024-11-15T02:36:36Z","title":"A Self-Supervised Robotic System for Autonomous Contact-Based Spatial\n Mapping of Semiconductor Properties","summary":" Integrating robotically driven contact-based material characterization\ntechniques into self-driving laboratories can enhance measurement quality,\nreliability, and throughput. While deep learning models support robust\nautonomy, current methods lack reliable pixel-precision positioning and require\nextensive labeled data. To overcome these challenges, we propose an approach\nfor building self-supervised autonomy into contact-based robotic systems that\nteach the robot to follow domain expert measurement principles at\nhigh-throughputs. Firstly, we design a vision-based, self-supervised\nconvolutional neural network (CNN) architecture that uses differentiable image\npriors to optimize domain-specific objectives, refining the pixel precision of\npredicted robot contact poses by 20.0% relative to existing approaches.\nSecondly, we design a reliable graph-based planner for generating\ndistance-minimizing paths to accelerate the robot measurement throughput and\ndecrease planning variance by 6x. We demonstrate the performance of this\napproach by autonomously driving a 4-degree-of-freedom robotic probe for 24\nhours to characterize semiconductor photoconductivity at 3,025 uniquely\npredicted poses across a gradient of drop-casted perovskite film compositions,\nachieving throughputs over 125 measurements per hour. Spatially mapping\nphotoconductivity onto each drop-casted film reveals compositional trends and\nregions of inhomogeneity, valuable for identifying manufacturing process\ndefects. With this self-supervised CNN-driven robotic system, we enable\nhigh-precision and reliable automation of contact-based characterization\ntechniques at high throughputs, thereby allowing the measurement of previously\ninaccessible yet important semiconductor properties for self-driving\nlaboratories.\n","authors":["Alexander E. Siemenn","Basita Das","Kangyu Ji","Fang Sheng","Tonio Buonassisi"],"pdf_url":"https://arxiv.org/pdf/2411.09892v2.pdf","comment":"Manuscript 18 pages, 6 figures. Supplementary information 6 pages, 7\n figures"},{"id":"http://arxiv.org/abs/2408.12561v2","updated":"2024-12-29T19:45:44Z","published":"2024-08-22T17:22:59Z","title":"ssProp: Energy-Efficient Training for Convolutional Neural Networks with\n Scheduled Sparse Back Propagation","summary":" Recently, deep learning has made remarkable strides, especially with\ngenerative modeling, such as large language models and probabilistic diffusion\nmodels. However, training these models often involves significant computational\nresources, requiring billions of petaFLOPs. This high resource consumption\nresults in substantial energy usage and a large carbon footprint, raising\ncritical environmental concerns. Back-propagation (BP) is a major source of\ncomputational expense during training deep learning models. To advance research\non energy-efficient training and allow for sparse learning on any machine and\ndevice, we propose a general, energy-efficient convolution module that can be\nseamlessly integrated into any deep learning architecture. Specifically, we\nintroduce channel-wise sparsity with additional gradient selection schedulers\nduring backward based on the assumption that BP is often dense and inefficient,\nwhich can lead to over-fitting and high computational consumption. Our\nexperiments demonstrate that our approach reduces 40\\% computations while\npotentially improving model performance, validated on image classification and\ngeneration tasks. This reduction can lead to significant energy savings and a\nlower carbon footprint during the research and development phases of\nlarge-scale AI systems. Additionally, our method mitigates over-fitting in a\nmanner distinct from Dropout, allowing it to be combined with Dropout to\nfurther enhance model performance and reduce computational resource usage.\nExtensive experiments validate that our method generalizes to a variety of\ndatasets and tasks and is compatible with a wide range of deep learning\narchitectures and modules. Code is publicly available at\nhttps://github.com/lujiazho/ssProp.\n","authors":["Lujia Zhong","Shuo Huang","Yonggang Shi"],"pdf_url":"https://arxiv.org/pdf/2408.12561v2.pdf","comment":"Accepted by AAAI24 Workshop: Scalable and Efficient Artificial\n Intelligence Systems"},{"id":"http://arxiv.org/abs/2412.20556v1","updated":"2024-12-29T19:31:23Z","published":"2024-12-29T19:31:23Z","title":"Distributionally Robust Optimization via Iterative Algorithms in\n Continuous Probability Spaces","summary":" We consider a minimax problem motivated by distributionally robust\noptimization (DRO) when the worst-case distribution is continuous, leading to\nsignificant computational challenges due to the infinite-dimensional nature of\nthe optimization problem. Recent research has explored learning the worst-case\ndistribution using neural network-based generative models to address these\ncomputational challenges but lacks algorithmic convergence guarantees. This\npaper bridges this theoretical gap by presenting an iterative algorithm to\nsolve such a minimax problem, achieving global convergence under mild\nassumptions and leveraging technical tools from vector space minimax\noptimization and convex analysis in the space of continuous probability\ndensities. In particular, leveraging Brenier's theorem, we represent the\nworst-case distribution as a transport map applied to a continuous reference\nmeasure and reformulate the regularized discrepancy-based DRO as a minimax\nproblem in the Wasserstein space. Furthermore, we demonstrate that the\nworst-case distribution can be efficiently computed using a modified\nJordan-Kinderlehrer-Otto (JKO) scheme with sufficiently large regularization\nparameters for commonly used discrepancy functions, linked to the radius of the\nambiguity set. Additionally, we derive the global convergence rate and quantify\nthe total number of subgradient and inexact modified JKO iterations required to\nobtain approximate stationary points. These results are potentially applicable\nto nonconvex and nonsmooth scenarios, with broad relevance to modern machine\nlearning applications.\n","authors":["Linglingzhi Zhu","Yao Xie"],"pdf_url":"https://arxiv.org/pdf/2412.20556v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.14226v2","updated":"2024-12-29T19:15:59Z","published":"2024-12-18T16:31:34Z","title":"FedSTaS: Client Stratification and Client Level Sampling for Efficient\n Federated Learning","summary":" Federated learning (FL) is a machine learning methodology that involves the\ncollaborative training of a global model across multiple decentralized clients\nin a privacy-preserving way. Several FL methods are introduced to tackle\ncommunication inefficiencies but do not address how to sample participating\nclients in each round effectively and in a privacy-preserving manner. In this\npaper, we propose \\textit{FedSTaS}, a client and data-level sampling method\ninspired by \\textit{FedSTS} and \\textit{FedSampling}. In each federated\nlearning round, \\textit{FedSTaS} stratifies clients based on their compressed\ngradients, re-allocate the number of clients to sample using an optimal Neyman\nallocation, and sample local data from each participating clients using a data\nuniform sampling strategy. Experiments on three datasets show that\n\\textit{FedSTaS} can achieve higher accuracy scores than those of\n\\textit{FedSTS} within a fixed number of training rounds.\n","authors":["Jordan Slessor","Dezheng Kong","Xiaofen Tang","Zheng En Than","Linglong Kong"],"pdf_url":"https://arxiv.org/pdf/2412.14226v2.pdf","comment":"6 pages, 3 figures"},{"id":"http://arxiv.org/abs/2412.20553v1","updated":"2024-12-29T18:59:01Z","published":"2024-12-29T18:59:01Z","title":"Edge of Stochastic Stability: Revisiting the Edge of Stability for SGD","summary":" Recent findings by Cohen et al., 2021, demonstrate that when training neural\nnetworks with full-batch gradient descent at a step size of $\\eta$, the\nsharpness--defined as the largest eigenvalue of the full batch\nHessian--consistently stabilizes at $2/\\eta$. These results have significant\nimplications for convergence and generalization. Unfortunately, this was\nobserved not to be the case for mini-batch stochastic gradient descent (SGD),\nthus limiting the broader applicability of these findings. We show that SGD\ntrains in a different regime we call Edge of Stochastic Stability. In this\nregime, what hovers at $2/\\eta$ is, instead, the average over the batches of\nthe largest eigenvalue of the Hessian of the mini batch (MiniBS) loss--which is\nalways bigger than the sharpness. This implies that the sharpness is generally\nlower when training with smaller batches or bigger learning rate, providing a\nbasis for the observed implicit regularization effect of SGD towards flatter\nminima and a number of well established empirical phenomena. Additionally, we\nquantify the gap between the MiniBS and the sharpness, further characterizing\nthis distinct training regime.\n","authors":["Arseniy Andreyev","Pierfrancesco Beneventano"],"pdf_url":"https://arxiv.org/pdf/2412.20553v1.pdf","comment":"28 pages, 24 figures"},{"id":"http://arxiv.org/abs/2307.10246v3","updated":"2024-12-29T18:54:51Z","published":"2023-07-17T06:54:36Z","title":"Deep Neural Networks and Brain Alignment: Brain Encoding and Decoding\n (Survey)","summary":" Can artificial intelligence unlock the secrets of the human brain? How do the\ninner mechanisms of deep learning models relate to our neural circuits? Is it\npossible to enhance AI by tapping into the power of brain recordings? These\ncaptivating questions lie at the heart of an emerging field at the intersection\nof neuroscience and artificial intelligence. Our survey dives into this\nexciting domain, focusing on human brain recording studies and cutting-edge\ncognitive neuroscience datasets that capture brain activity during natural\nlanguage processing, visual perception, and auditory experiences. We explore\ntwo fundamental approaches: encoding models, which attempt to generate brain\nactivity patterns from sensory inputs; and decoding models, which aim to\nreconstruct our thoughts and perceptions from neural signals. These techniques\nnot only promise breakthroughs in neurological diagnostics and brain-computer\ninterfaces but also offer a window into the very nature of cognition. In this\nsurvey, we first discuss popular representations of language, vision, and\nspeech stimuli, and present a summary of neuroscience datasets. We then review\nhow the recent advances in deep learning transformed this field, by\ninvestigating the popular deep learning based encoding and decoding\narchitectures, noting their benefits and limitations across different sensory\nmodalities. From text to images, speech to videos, we investigate how these\nmodels capture the brain's response to our complex, multimodal world. While our\nprimary focus is on human studies, we also highlight the crucial role of animal\nmodels in advancing our understanding of neural mechanisms. Throughout, we\nmention the ethical implications of these powerful technologies, addressing\nconcerns about privacy and cognitive liberty. We conclude with a summary and\ndiscussion of future trends in this rapidly evolving field.\n","authors":["Subba Reddy Oota","Zijiao Chen","Manish Gupta","Raju S. Bapi","Gael Jobard","Frederic Alexandre","Xavier Hinaut"],"pdf_url":"https://arxiv.org/pdf/2307.10246v3.pdf","comment":"61 pages, 22 figures"},{"id":"http://arxiv.org/abs/2411.06360v2","updated":"2024-12-29T18:43:04Z","published":"2024-11-10T04:56:14Z","title":"An Efficient Matrix Multiplication Algorithm for Accelerating Inference\n in Binary and Ternary Neural Networks","summary":" Despite their tremendous success and versatility, Large Language Models\n(LLMs) suffer from inference inefficiency while relying on advanced\ncomputational infrastructure. To address these challenges and make LLMs more\naccessible and cost-effective, in this paper, we propose algorithms to improve\nthe inference time and memory efficiency of 1.58-bit LLMs with ternary weight\nmatrices. Particularly focusing on matrix multiplication as the bottle-neck\noperation of inference, we observe that, once trained, the weight matrices of a\nmodel no longer change. This allows us to preprocess these matrices and create\nindices that help reduce the storage requirements by a logarithmic factor while\nenabling our efficient inference algorithms. Specifically, for a $n$ by $n$\nweight matrix, our efficient algorithm guarantees a time complexity of\n$O(\\frac{n^2}{\\log n})$, a logarithmic factor improvement over the standard\n$O(n^2)$ vector-matrix multiplication. Besides theoretical analysis, we conduct\nextensive experiments to evaluate the practical efficiency of our algorithms.\nOur results confirm the superiority of the approach both with respect to time\nand memory, as we observed a reduction in inference time up to 29x and memory\nusage up to 6x.\n","authors":["Mohsen Dehghankar","Mahdi Erfanian","Abolfazl Asudeh"],"pdf_url":"https://arxiv.org/pdf/2411.06360v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.20545v1","updated":"2024-12-29T18:34:10Z","published":"2024-12-29T18:34:10Z","title":"The Impact of Prompt Programming on Function-Level Code Generation","summary":" Large Language Models (LLMs) are increasingly used by software engineers for\ncode generation. However, limitations of LLMs such as irrelevant or incorrect\ncode have highlighted the need for prompt programming (or prompt engineering)\nwhere engineers apply specific prompt techniques (e.g., chain-of-thought or\ninput-output examples) to improve the generated code. Despite this, the impact\nof different prompt techniques -- and their combinations -- on code generation\nremains underexplored. In this study, we introduce CodePromptEval, a dataset of\n7072 prompts designed to evaluate five prompt techniques (few-shot, persona,\nchain-of-thought, function signature, list of packages) and their effect on the\ncorrectness, similarity, and quality of complete functions generated by three\nLLMs (GPT-4o, Llama3, and Mistral). Our findings show that while certain prompt\ntechniques significantly influence the generated code, combining multiple\ntechniques does not necessarily improve the outcome. Additionally, we observed\na trade-off between correctness and quality when using prompt techniques. Our\ndataset and replication package enable future research on improving\nLLM-generated code and evaluating new prompt techniques.\n","authors":["Ranim Khojah","Francisco Gomes de Oliveira Neto","Mazen Mohamad","Philipp Leitner"],"pdf_url":"https://arxiv.org/pdf/2412.20545v1.pdf","comment":"CodePromptEval dataset and replication package on GitHub:\n https://github.com/icetlab/CodePromptEval"}],"Multimedia":[{"id":"http://arxiv.org/abs/2412.20619v1","updated":"2024-12-29T23:48:35Z","published":"2024-12-29T23:48:35Z","title":"Audiopedia: Audio QA with Knowledge","summary":" In this paper, we introduce Audiopedia, a novel task called Audio Question\nAnswering with Knowledge, which requires both audio comprehension and external\nknowledge reasoning. Unlike traditional Audio Question Answering (AQA)\nbenchmarks that focus on simple queries answerable from audio alone, Audiopedia\ntargets knowledge-intensive questions. We define three sub-tasks: (i) Single\nAudio Question Answering (s-AQA), where questions are answered based on a\nsingle audio sample, (ii) Multi-Audio Question Answering (m-AQA), which\nrequires reasoning over multiple audio samples, and (iii) Retrieval-Augmented\nAudio Question Answering (r-AQA), which involves retrieving relevant audio to\nanswer the question. We benchmark large audio language models (LALMs) on these\nsub-tasks and observe suboptimal performance. To address this, we propose a\ngeneric framework that can be adapted to any LALM, equipping them with\nknowledge reasoning capabilities. Our framework has two components: (i) Audio\nEntity Linking (AEL) and (ii) Knowledge-Augmented Audio Large Multimodal Model\n(KA2LM), which together improve performance on knowledge-intensive AQA tasks.\nTo our knowledge, this is the first work to address advanced audio\nunderstanding via knowledge-intensive tasks like Audiopedia.\n","authors":["Abhirama Subramanyam Penamakuri","Kiran Chhatre","Akshat Jain"],"pdf_url":"https://arxiv.org/pdf/2412.20619v1.pdf","comment":"Accepted to ICASSP 2025"},{"id":"http://arxiv.org/abs/2412.14158v2","updated":"2024-12-29T17:22:30Z","published":"2024-12-18T18:53:22Z","title":"AKiRa: Augmentation Kit on Rays for optical video generation","summary":" Recent advances in text-conditioned video diffusion have greatly improved\nvideo quality. However, these methods offer limited or sometimes no control to\nusers on camera aspects, including dynamic camera motion, zoom, distorted lens\nand focus shifts. These motion and optical aspects are crucial for adding\ncontrollability and cinematic elements to generation frameworks, ultimately\nresulting in visual content that draws focus, enhances mood, and guides\nemotions according to filmmakers' controls. In this paper, we aim to close the\ngap between controllable video generation and camera optics. To achieve this,\nwe propose AKiRa (Augmentation Kit on Rays), a novel augmentation framework\nthat builds and trains a camera adapter with a complex camera model over an\nexisting video generation backbone. It enables fine-tuned control over camera\nmotion as well as complex optical parameters (focal length, distortion,\naperture) to achieve cinematic effects such as zoom, fisheye effect, and bokeh.\nExtensive experiments demonstrate AKiRa's effectiveness in combining and\ncomposing camera optics while outperforming all state-of-the-art methods. This\nwork sets a new landmark in controlled and optically enhanced video generation,\npaving the way for future optical video generation methods.\n","authors":["Xi Wang","Robin Courant","Marc Christie","Vicky Kalogeiton"],"pdf_url":"https://arxiv.org/pdf/2412.14158v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.20504v1","updated":"2024-12-29T15:42:24Z","published":"2024-12-29T15:42:24Z","title":"ReTaKe: Reducing Temporal and Knowledge Redundancy for Long Video\n Understanding","summary":" Video Large Language Models (VideoLLMs) have achieved remarkable progress in\nvideo understanding. However, existing VideoLLMs often inherit the limitations\nof their backbone LLMs in handling long sequences, leading to challenges for\nlong video understanding. Common solutions either simply uniformly sample\nvideos' frames or compress visual tokens, which focus primarily on low-level\ntemporal visual redundancy, overlooking high-level knowledge redundancy. This\nlimits the achievable compression rate with minimal loss. To this end. we\nintroduce a training-free method, $\\textbf{ReTaKe}$, containing two novel\nmodules DPSelect and PivotKV, to jointly model and reduce both temporal visual\nredundancy and knowledge redundancy for long video understanding. Specifically,\nDPSelect identifies keyframes with local maximum peak distance based on their\nvisual features, which are closely aligned with human video perception. PivotKV\nemploys the obtained keyframes as pivots and conducts KV-Cache compression for\nthe non-pivot tokens with low attention scores, which are derived from the\nlearned prior knowledge of LLMs. Experiments on benchmarks VideoMME, MLVU, and\nLVBench, show that ReTaKe can support 4x longer video sequences with minimal\nperformance loss (<1%) and outperform all similar-size VideoLLMs with 3%-5%,\neven surpassing or on par with much larger ones. Our code is available at\nhttps://github.com/SCZwangxiao/video-ReTaKe\n","authors":["Xiao Wang","Qingyi Si","Jianlong Wu","Shiyu Zhu","Li Cao","Liqiang Nie"],"pdf_url":"https://arxiv.org/pdf/2412.20504v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.20423v1","updated":"2024-12-29T10:13:30Z","published":"2024-12-29T10:13:30Z","title":"ESVQA: Perceptual Quality Assessment of Egocentric Spatial Videos","summary":" With the rapid development of eXtended Reality (XR), egocentric spatial\nshooting and display technologies have further enhanced immersion and\nengagement for users. Assessing the quality of experience (QoE) of egocentric\nspatial videos is crucial to ensure a high-quality viewing experience. However,\nthe corresponding research is still lacking. In this paper, we use the embodied\nexperience to highlight this more immersive experience and study the new\nproblem, i.e., embodied perceptual quality assessment for egocentric spatial\nvideos. Specifically, we introduce the first Egocentric Spatial Video Quality\nAssessment Database (ESVQAD), which comprises 600 egocentric spatial videos and\ntheir mean opinion scores (MOSs). Furthermore, we propose a novel\nmulti-dimensional binocular feature fusion model, termed ESVQAnet, which\nintegrates binocular spatial, motion, and semantic features to predict the\nperceptual quality. Experimental results demonstrate the ESVQAnet outperforms\n16 state-of-the-art VQA models on the embodied perceptual quality assessment\ntask, and exhibits strong generalization capability on traditional VQA tasks.\nThe database and codes will be released upon the publication.\n","authors":["Xilei Zhu","Huiyu Duan","Liu Yang","Yucheng Zhu","Xiongkuo Min","Guangtao Zhai","Patrick Le Callet"],"pdf_url":"https://arxiv.org/pdf/2412.20423v1.pdf","comment":"7 pages, 3 figures"},{"id":"http://arxiv.org/abs/2412.20381v1","updated":"2024-12-29T07:02:33Z","published":"2024-12-29T07:02:33Z","title":"Protégé: Learn and Generate Basic Makeup Styles with Generative\n Adversarial Networks (GANs)","summary":" Makeup is no longer confined to physical application; people now use mobile\napps to digitally apply makeup to their photos, which they then share on social\nmedia. However, while this shift has made makeup more accessible, designing\ndiverse makeup styles tailored to individual faces remains a challenge. This\nchallenge currently must still be done manually by humans. Existing systems,\nsuch as makeup recommendation engines and makeup transfer techniques, offer\nlimitations in creating innovative makeups for different individuals\n\"intuitively\" -- significant user effort and knowledge needed and limited\nmakeup options available in app. Our motivation is to address this challenge by\nproposing Prot\\'eg\\'e, a new makeup application, leveraging recent generative\nmodel -- GANs to learn and automatically generate makeup styles. This is a task\nthat existing makeup applications (i.e., makeup recommendation systems using\nexpert system and makeup transfer methods) are unable to perform. Extensive\nexperiments has been conducted to demonstrate the capability of Prot\\'eg\\'e in\nlearning and creating diverse makeups, providing a convenient and intuitive\nway, marking a significant leap in digital makeup technology!\n","authors":["Jia Wei Sii","Chee Seng Chan"],"pdf_url":"https://arxiv.org/pdf/2412.20381v1.pdf","comment":"8 pages, 5 figures"},{"id":"http://arxiv.org/abs/2412.20378v1","updated":"2024-12-29T06:46:24Z","published":"2024-12-29T06:46:24Z","title":"Tri-Ergon: Fine-grained Video-to-Audio Generation with Multi-modal\n Conditions and LUFS Control","summary":" Video-to-audio (V2A) generation utilizes visual-only video features to\nproduce realistic sounds that correspond to the scene. However, current V2A\nmodels often lack fine-grained control over the generated audio, especially in\nterms of loudness variation and the incorporation of multi-modal conditions. To\novercome these limitations, we introduce Tri-Ergon, a diffusion-based V2A model\nthat incorporates textual, auditory, and pixel-level visual prompts to enable\ndetailed and semantically rich audio synthesis. Additionally, we introduce\nLoudness Units relative to Full Scale (LUFS) embedding, which allows for\nprecise manual control of the loudness changes over time for individual audio\nchannels, enabling our model to effectively address the intricate correlation\nof video and audio in real-world Foley workflows. Tri-Ergon is capable of\ncreating 44.1 kHz high-fidelity stereo audio clips of varying lengths up to 60\nseconds, which significantly outperforms existing state-of-the-art V2A methods\nthat typically generate mono audio for a fixed duration.\n","authors":["Bingliang Li","Fengyu Yang","Yuxin Mao","Qingwen Ye","Hongkai Chen","Yiran Zhong"],"pdf_url":"https://arxiv.org/pdf/2412.20378v1.pdf","comment":"AAAI 2025 Accepted"},{"id":"http://arxiv.org/abs/2412.20359v1","updated":"2024-12-29T05:30:06Z","published":"2024-12-29T05:30:06Z","title":"EmoReg: Directional Latent Vector Modeling for Emotional Intensity\n Regularization in Diffusion-based Voice Conversion","summary":" The Emotional Voice Conversion (EVC) aims to convert the discrete emotional\nstate from the source emotion to the target for a given speech utterance while\npreserving linguistic content. In this paper, we propose regularizing emotion\nintensity in the diffusion-based EVC framework to generate precise speech of\nthe target emotion. Traditional approaches control the intensity of an\nemotional state in the utterance via emotion class probabilities or intensity\nlabels that often lead to inept style manipulations and degradations in\nquality. On the contrary, we aim to regulate emotion intensity using\nself-supervised learning-based feature representations and unsupervised\ndirectional latent vector modeling (DVM) in the emotional embedding space\nwithin a diffusion-based framework. These emotion embeddings can be modified\nbased on the given target emotion intensity and the corresponding direction\nvector. Furthermore, the updated embeddings can be fused in the reverse\ndiffusion process to generate the speech with the desired emotion and\nintensity. In summary, this paper aims to achieve high-quality emotional\nintensity regularization in the diffusion-based EVC framework, which is the\nfirst of its kind work. The effectiveness of the proposed method has been shown\nacross state-of-the-art (SOTA) baselines in terms of subjective and objective\nevaluations for the English and Hindi languages \\footnote{Demo samples are\navailable at the following URL: \\url{https://nirmesh-sony.github.io/EmoReg/}}.\n","authors":["Ashishkumar Gudmalwar","Ishan D. Biyani","Nirmesh Shah","Pankaj Wasnik","Rajiv Ratn Shah"],"pdf_url":"https://arxiv.org/pdf/2412.20359v1.pdf","comment":"Accepted to AAAI 2025"},{"id":"http://arxiv.org/abs/2412.16861v2","updated":"2024-12-29T05:06:53Z","published":"2024-12-22T05:04:17Z","title":"SoundLoc3D: Invisible 3D Sound Source Localization and Classification\n Using a Multimodal RGB-D Acoustic Camera","summary":" Accurately localizing 3D sound sources and estimating their semantic labels\n-- where the sources may not be visible, but are assumed to lie on the physical\nsurface of objects in the scene -- have many real applications, including\ndetecting gas leak and machinery malfunction. The audio-visual weak-correlation\nin such setting poses new challenges in deriving innovative methods to answer\nif or how we can use cross-modal information to solve the task. Towards this\nend, we propose to use an acoustic-camera rig consisting of a pinhole RGB-D\ncamera and a coplanar four-channel microphone array~(Mic-Array). By using this\nrig to record audio-visual signals from multiviews, we can use the cross-modal\ncues to estimate the sound sources 3D locations. Specifically, our framework\nSoundLoc3D treats the task as a set prediction problem, each element in the set\ncorresponds to a potential sound source. Given the audio-visual\nweak-correlation, the set representation is initially learned from a single\nview microphone array signal, and then refined by actively incorporating\nphysical surface cues revealed from multiview RGB-D images. We demonstrate the\nefficiency and superiority of SoundLoc3D on large-scale simulated dataset, and\nfurther show its robustness to RGB-D measurement inaccuracy and ambient noise\ninterference.\n","authors":["Yuhang He","Sangyun Shin","Anoop Cherian","Niki Trigoni","Andrew Markham"],"pdf_url":"https://arxiv.org/pdf/2412.16861v2.pdf","comment":"Accepted by WACV2025"}]},"2024-12-28T00:00:00Z":{"Computation and Language":[{"id":"http://arxiv.org/abs/2412.20299v1","updated":"2024-12-28T23:30:47Z","published":"2024-12-28T23:30:47Z","title":"No Preference Left Behind: Group Distributional Preference Optimization","summary":" Preferences within a group of people are not uniform but follow a\ndistribution. While existing alignment methods like Direct Preference\nOptimization (DPO) attempt to steer models to reflect human preferences, they\nstruggle to capture the distributional pluralistic preferences within a group.\nThese methods often skew toward dominant preferences, overlooking the diversity\nof opinions, especially when conflicting preferences arise. To address this\nissue, we propose Group Distribution Preference Optimization (GDPO), a novel\nframework that aligns language models with the distribution of preferences\nwithin a group by incorporating the concept of beliefs that shape individual\npreferences. GDPO calibrates a language model using statistical estimation of\nthe group's belief distribution and aligns the model with belief-conditioned\npreferences, offering a more inclusive alignment framework than traditional\nmethods. In experiments using both synthetic controllable opinion generation\nand real-world movie review datasets, we show that DPO fails to align with the\ntargeted belief distributions, while GDPO consistently reduces this alignment\ngap during training. Moreover, our evaluation metrics demonstrate that GDPO\noutperforms existing approaches in aligning with group distributional\npreferences, marking a significant advance in pluralistic alignment.\n","authors":["Binwei Yao","Zefan Cai","Yun-Shiuan Chuang","Shanglin Yang","Ming Jiang","Diyi Yang","Junjie Hu"],"pdf_url":"https://arxiv.org/pdf/2412.20299v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.17947v2","updated":"2024-12-28T21:27:17Z","published":"2024-12-23T19:58:11Z","title":"IITR-CIOL@NLU of Devanagari Script Languages 2025: Multilingual Hate\n Speech Detection and Target Identification in Devanagari-Scripted Languages","summary":" This work focuses on two subtasks related to hate speech detection and target\nidentification in Devanagari-scripted languages, specifically Hindi, Marathi,\nNepali, Bhojpuri, and Sanskrit. Subtask B involves detecting hate speech in\nonline text, while Subtask C requires identifying the specific targets of hate\nspeech, such as individuals, organizations, or communities. We propose the\nMultilingualRobertaClass model, a deep neural network built on the pretrained\nmultilingual transformer model ia-multilingual-transliterated-roberta,\noptimized for classification tasks in multilingual and transliterated contexts.\nThe model leverages contextualized embeddings to handle linguistic diversity,\nwith a classifier head for binary classification. We received 88.40% accuracy\nin Subtask B and 66.11% accuracy in Subtask C, in the test set.\n","authors":["Siddhant Gupta","Siddh Singhal","Azmine Toushik Wasi"],"pdf_url":"https://arxiv.org/pdf/2412.17947v2.pdf","comment":"Accepted to CHiPSAL Workshop at COLING 2025"},{"id":"http://arxiv.org/abs/2406.11177v2","updated":"2024-12-28T21:16:38Z","published":"2024-06-17T03:29:14Z","title":"Retrieval-Augmented Feature Generation for Domain-Specific\n Classification","summary":" Feature generation can significantly enhance learning outcomes, particularly\nfor tasks with limited data. An effective way to improve feature generation is\nby expanding the current feature space using existing features and enriching\nthe informational content. However, generating new, interpretable features in\napplication fields often requires domain-specific knowledge about the existing\nfeatures. This paper introduces a new method RAFG for generating reasonable and\nexplainable features specific to domain classification tasks. To generate new\nfeatures with interpretability in domain knowledge, we perform information\nretrieval on existing features to identify potential feature associations, and\nutilize these associations to generate meaningful features. Furthermore, we\ndevelop a Large Language Model (LLM)-based framework for feature generation\nwith reasoning to verify and filter features during the generation process.\nExperiments across several datasets in medical, economic, and geographic\ndomains show that our RAFG method produces high-quality, meaningful features\nand significantly improves classification performance compared with baseline\nmethods.\n","authors":["Xinhao Zhang","Jinghan Zhang","Fengran Mo","Yuzhong Chen","Kunpeng Liu"],"pdf_url":"https://arxiv.org/pdf/2406.11177v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.01003v3","updated":"2024-12-28T21:01:43Z","published":"2024-12-01T23:35:53Z","title":"Competition Dynamics Shape Algorithmic Phases of In-Context Learning","summary":" In-Context Learning (ICL) has significantly expanded the general-purpose\nnature of large language models, allowing them to adapt to novel tasks using\nmerely the inputted context. This has motivated a series of papers that analyze\ntractable synthetic domains and postulate precise mechanisms that may underlie\nICL. However, the use of relatively distinct setups that often lack a sequence\nmodeling nature to them makes it unclear how general the reported insights from\nsuch studies are. Motivated by this, we propose a synthetic sequence modeling\ntask that involves learning to simulate a finite mixture of Markov chains. As\nwe show, models trained on this task reproduce most well-known results on ICL,\nhence offering a unified setting for studying the concept. Building on this\nsetup, we demonstrate we can explain a model's behavior by decomposing it into\nfour broad algorithms that combine a fuzzy retrieval vs. inference approach\nwith either unigram or bigram statistics of the context. These algorithms\nengage in a competition dynamics to dominate model behavior, with the precise\nexperimental conditions dictating which algorithm ends up superseding others:\ne.g., we find merely varying context size or amount of training yields (at\ntimes sharp) transitions between which algorithm dictates the model behavior,\nrevealing a mechanism that explains the transient nature of ICL. In this sense,\nwe argue ICL is best thought of as a mixture of different algorithms, each with\nits own peculiarities, instead of a monolithic capability. This also implies\nthat making general claims about ICL that hold universally across all settings\nmay be infeasible.\n","authors":["Core Francisco Park","Ekdeep Singh Lubana","Itamar Pres","Hidenori Tanaka"],"pdf_url":"https://arxiv.org/pdf/2412.01003v3.pdf","comment":"Preprint. Under review"},{"id":"http://arxiv.org/abs/2412.20264v1","updated":"2024-12-28T20:37:57Z","published":"2024-12-28T20:37:57Z","title":"Scoring with Large Language Models: A Study on Measuring Empathy of\n Responses in Dialogues","summary":" In recent years, Large Language Models (LLMs) have become increasingly more\npowerful in their ability to complete complex tasks. One such task in which\nLLMs are often employed is scoring, i.e., assigning a numerical value from a\ncertain scale to a subject. In this paper, we strive to understand how LLMs\nscore, specifically in the context of empathy scoring. We develop a novel and\ncomprehensive framework for investigating how effective LLMs are at measuring\nand scoring empathy of responses in dialogues, and what methods can be employed\nto deepen our understanding of LLM scoring. Our strategy is to approximate the\nperformance of state-of-the-art and fine-tuned LLMs with explicit and\nexplainable features. We train classifiers using various features of dialogues\nincluding embeddings, the Motivational Interviewing Treatment Integrity (MITI)\nCode, a set of explicit subfactors of empathy as proposed by LLMs, and a\ncombination of the MITI Code and the explicit subfactors. Our results show that\nwhen only using embeddings, it is possible to achieve performance close to that\nof generic LLMs, and when utilizing the MITI Code and explicit subfactors\nscored by an LLM, the trained classifiers can closely match the performance of\nfine-tuned LLMs. We employ feature selection methods to derive the most crucial\nfeatures in the process of empathy scoring. Our work provides a new perspective\ntoward understanding LLM empathy scoring and helps the LLM community explore\nthe potential of LLM scoring in social science studies.\n","authors":["Henry J. Xie","Jinghan Zhang","Xinhao Zhang","Kunpeng Liu"],"pdf_url":"https://arxiv.org/pdf/2412.20264v1.pdf","comment":"Accepted by IEEE BigData 2024"},{"id":"http://arxiv.org/abs/2412.20251v1","updated":"2024-12-28T19:51:08Z","published":"2024-12-28T19:51:08Z","title":"ComparisonQA: Evaluating Factuality Robustness of LLMs Through Knowledge\n Frequency Control and Uncertainty","summary":" The rapid development of LLMs has sparked extensive research into their\nfactual knowledge. Current works claim that LLMs fall short on questions\nrequiring less frequent knowledge. However, their proof is incomplete since\nthey only study the influence of entity frequency, which can not fully\nrepresent knowledge frequency. So we introduce ComparisonQA benchmark,\ncontaining 283K abstract questions, each instantiated by a pair of\nhigh-frequency and low-frequency entities. It ensures a controllable comparison\nbecause the difference of knowledge frequency between such a pair is only\nrelated to entity frequency. In addition, to avoid possible semantic shortcuts,\nwhich is a severe problem of current LLMs study, we design a two-round method\nfor knowledge robustness measurement utilizing both correctness and\nuncertainty. Experiments reveal that LLMs exhibit particularly low robustness\nregarding low-frequency knowledge, and GPT-4o is even the worst under this\nmeasurement. Besides, we introduce an automatic method to filter out questions\nwith low-quality and shortcuts to form ComparisonQA-Hard. We find that\nuncertainty effectively identifies such questions while maintaining the data\nsize.\n","authors":["Qing Zong","Zhaowei Wang","Tianshi Zheng","Xiyu Ren","Yangqiu Song"],"pdf_url":"https://arxiv.org/pdf/2412.20251v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.16671v5","updated":"2024-12-28T17:50:04Z","published":"2023-09-28T17:59:56Z","title":"Demystifying CLIP Data","summary":" Contrastive Language-Image Pre-training (CLIP) is an approach that has\nadvanced research and applications in computer vision, fueling modern\nrecognition systems and generative models. We believe that the main ingredient\nto the success of CLIP is its data and not the model architecture or\npre-training objective. However, CLIP only provides very limited information\nabout its data and how it has been collected, leading to works that aim to\nreproduce CLIP's data by filtering with its model parameters. In this work, we\nintend to reveal CLIP's data curation approach and in our pursuit of making it\nopen to the community introduce Metadata-Curated Language-Image Pre-training\n(MetaCLIP). MetaCLIP takes a raw data pool and metadata (derived from CLIP's\nconcepts) and yields a balanced subset over the metadata distribution. Our\nexperimental study rigorously isolates the model and training settings,\nconcentrating solely on data. MetaCLIP applied to CommonCrawl with 400M\nimage-text data pairs outperforms CLIP's data on multiple standard benchmarks.\nIn zero-shot ImageNet classification, MetaCLIP achieves 70.8% accuracy,\nsurpassing CLIP's 68.3% on ViT-B models. Scaling to 1B data, while maintaining\nthe same training budget, attains 72.4%. Our observations hold across various\nmodel sizes, exemplified by ViT-H achieving 80.5%, without any\nbells-and-whistles. Curation code and training data distribution on metadata is\nmade available at https://github.com/facebookresearch/MetaCLIP.\n","authors":["Hu Xu","Saining Xie","Xiaoqing Ellen Tan","Po-Yao Huang","Russell Howes","Vasu Sharma","Shang-Wen Li","Gargi Ghosh","Luke Zettlemoyer","Christoph Feichtenhofer"],"pdf_url":"https://arxiv.org/pdf/2309.16671v5.pdf","comment":"17 pages. arXiv admin note: text overlap with arXiv:2103.00020 by\n other authors"},{"id":"http://arxiv.org/abs/2412.20227v1","updated":"2024-12-28T17:48:33Z","published":"2024-12-28T17:48:33Z","title":"LLM Reasoning Engine: Specialized Training for Enhanced Mathematical\n Reasoning","summary":" Large Language Models (LLMs) have shown remarkable performance in various\nnatural language processing tasks but face challenges in mathematical\nreasoning, where complex problem-solving requires both linguistic understanding\nand mathematical reasoning skills. Existing approaches to address this\nchallenge often rely on ensemble methods and suffer from the problem of data\nscarcity in target domains. In this work, we present a novel method to enhance\nLLMs' capabilities in mathematical reasoning tasks. Motivated by the need to\nbridge this gap, our approach incorporates a question paraphrase strategy,\nwhich aims at diversifying the linguistic forms of mathematical questions to\nimprove generalization. Additionally, specialized training objectives are\nemployed to guide the model's learning process, focusing on enhancing its\nunderstanding of mathematical concepts and reasoning processes. We conduct\nexperiments on four datasets using different LLMs, and demonstrate the\neffectiveness of our approach in improving LLMs' performance on mathematical\nreasoning tasks. Our findings underscore the significance of our methodology in\nthe advancement of large language models and its potential implications for\nreal-world applications that require mathematical reasoning abilities.\n","authors":["Shuguang Chen","Guang Lin"],"pdf_url":"https://arxiv.org/pdf/2412.20227v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.17251v3","updated":"2024-12-28T17:46:27Z","published":"2024-10-22T17:59:57Z","title":"Altogether: Image Captioning via Re-aligning Alt-text","summary":" This paper focuses on creating synthetic data to improve the quality of image\ncaptions. Existing works typically have two shortcomings. First, they caption\nimages from scratch, ignoring existing alt-text metadata, and second, lack\ntransparency if the captioners' training data (e.g. GPT) is unknown. In this\npaper, we study a principled approach Altogether based on the key idea to edit\nand re-align existing alt-texts associated with the images. To generate\ntraining data, we perform human annotation where annotators start with the\nexisting alt-text and re-align it to the image content in multiple rounds,\nconsequently constructing captions with rich visual concepts. This differs from\nprior work that carries out human annotation as a one-time description task\nsolely based on images and annotator knowledge. We train a captioner on this\ndata that generalizes the process of re-aligning alt-texts at scale. Our\nresults show our Altogether approach leads to richer image captions that also\nimprove text-to-image generation and zero-shot image classification tasks.\n","authors":["Hu Xu","Po-Yao Huang","Xiaoqing Ellen Tan","Ching-Feng Yeh","Jacob Kahn","Christine Jou","Gargi Ghosh","Omer Levy","Luke Zettlemoyer","Wen-tau Yih","Shang-Wen Li","Saining Xie","Christoph Feichtenhofer"],"pdf_url":"https://arxiv.org/pdf/2410.17251v3.pdf","comment":"accepted by EMNLP 2024; Meta CLIP 1.2 Data Engine"},{"id":"http://arxiv.org/abs/2412.20223v1","updated":"2024-12-28T17:34:17Z","published":"2024-12-28T17:34:17Z","title":"AfriHG: News headline generation for African Languages","summary":" This paper introduces AfriHG -- a news headline generation dataset created by\ncombining from XLSum and MasakhaNEWS datasets focusing on 16 languages widely\nspoken by Africa. We experimented with two seq2eq models (mT5-base and AfriTeVa\nV2), and Aya-101 LLM. Our results show that Africa-centric seq2seq models such\nas AfriTeVa V2 outperform the massively multilingual mT5-base model. Finally,\nwe show that the performance of fine-tuning AfriTeVa V2 with 313M parameters is\ncompetitive to prompting Aya-101 LLM with more than 13B parameters.\n","authors":["Toyib Ogunremi","Serah Akojenu","Anthony Soronnadi","Olubayo Adekanmbi","David Ifeoluwa Adelani"],"pdf_url":"https://arxiv.org/pdf/2412.20223v1.pdf","comment":"Accepted to AfricaNLP Workshop at ICLR 2024"},{"id":"http://arxiv.org/abs/2412.03152v2","updated":"2024-12-28T17:21:27Z","published":"2024-12-04T09:21:46Z","title":"A Measure of the System Dependence of Automated Metrics","summary":" Automated metrics for Machine Translation have made significant progress,\nwith the goal of replacing expensive and time-consuming human evaluations.\nThese metrics are typically assessed by their correlation with human judgments,\nwhich captures the monotonic relationship between human and metric scores.\nHowever, we argue that it is equally important to ensure that metrics treat all\nsystems fairly and consistently. In this paper, we introduce a method to\nevaluate this aspect.\n","authors":["Pius von Däniken","Jan Deriu","Mark Cieliebak"],"pdf_url":"https://arxiv.org/pdf/2412.03152v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.09503v2","updated":"2024-12-28T17:15:28Z","published":"2024-08-18T14:52:25Z","title":"Out-of-distribution generalization via composition: a lens through\n induction heads in Transformers","summary":" Large language models (LLMs) such as GPT-4 sometimes appear to be creative,\nsolving novel tasks often with a few demonstrations in the prompt. These tasks\nrequire the models to generalize on distributions different from those from\ntraining data -- which is known as out-of-distribution (OOD) generalization.\nDespite the tremendous success of LLMs, how they approach OOD generalization\nremains an open and underexplored question. We examine OOD generalization in\nsettings where instances are generated according to hidden rules, including\nin-context learning with symbolic reasoning. Models are required to infer the\nhidden rules behind input prompts without any fine-tuning.\n We empirically examined the training dynamics of Transformers on a synthetic\nexample and conducted extensive experiments on a variety of pretrained LLMs,\nfocusing on a type of components known as induction heads. We found that OOD\ngeneralization and composition are tied together -- models can learn rules by\ncomposing two self-attention layers, thereby achieving OOD generalization.\nFurthermore, a shared latent subspace in the embedding (or feature) space acts\nas a bridge for composition by aligning early layers and later layers, which we\nrefer to as the common bridge representation hypothesis.\n","authors":["Jiajun Song","Zhuoyan Xu","Yiqiao Zhong"],"pdf_url":"https://arxiv.org/pdf/2408.09503v2.pdf","comment":"46 pages, 27 figures"},{"id":"http://arxiv.org/abs/2412.20218v1","updated":"2024-12-28T17:03:30Z","published":"2024-12-28T17:03:30Z","title":"YAD: Leveraging T5 for Improved Automatic Diacritization of Yorùbá\n Text","summary":" In this work, we present Yor\\`ub\\'a automatic diacritization (YAD) benchmark\ndataset for evaluating Yor\\`ub\\'a diacritization systems. In addition, we\npre-train text-to-text transformer, T5 model for Yor\\`ub\\'a and showed that\nthis model outperform several multilingually trained T5 models. Lastly, we\nshowed that more data and larger models are better at diacritization for\nYor\\`ub\\'a\n","authors":["Akindele Michael Olawole","Jesujoba O. Alabi","Aderonke Busayo Sakpere","David I. Adelani"],"pdf_url":"https://arxiv.org/pdf/2412.20218v1.pdf","comment":"Accepted at AfricaNLP Workshop at ICLR 2024"},{"id":"http://arxiv.org/abs/2412.20213v1","updated":"2024-12-28T16:54:25Z","published":"2024-12-28T16:54:25Z","title":"Decoding Emotion: Speech Perception Patterns in Individuals with\n Self-reported Depression","summary":" The current study examines the relationship between self-reported depression\nand the perception of affective speech within the Indian population. PANAS and\nPHQ-9 were used to assess current mood and depression, respectively.\nParticipants' emotional reactivity was recorded on a valence and arousal scale\nagainst the affective speech audio presented in a sequence. No significant\ndifferences between the depression and no-depression groups were observed for\nany of the emotional stimuli, except the audio file depicting neutral emotion.\nSignificantly higher PANAS scores by the depression than the no-depression\ngroup indicate the impact of pre-disposed mood on the current mood status.\nContrary to previous findings, this study did not observe reduced positive\nemotional reactivity by the depression group. However, the results demonstrated\nconsistency in emotional reactivity for speech stimuli depicting sadness and\nanger across all measures of emotion perception.\n","authors":["Guneesh Vats","Priyanka Srivastava","Chiranjeevi Yarra"],"pdf_url":"https://arxiv.org/pdf/2412.20213v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.20212v1","updated":"2024-12-28T16:53:25Z","published":"2024-12-28T16:53:25Z","title":"Building a Rich Dataset to Empower the Persian Question Answering\n Systems","summary":" Question answering systems provide short, precise, and specific answers to\nquestions. So far, many robust question answering systems have been developed\nfor English, while some languages with fewer resources, like Persian, have few\nnumbers of standard dataset. In this study, a comprehensive open-domain dataset\nis presented for Persian. This dataset is called NextQuAD and has 7,515\ncontexts, including 23,918 questions and answers. Then, a BERT-based question\nanswering model has been applied to this dataset using two pre-trained language\nmodels, including ParsBERT and XLM-RoBERTa. The results of these two models\nhave been ensembled using mean logits. Evaluation on the development set shows\n0.95 Exact Match (EM) and 0.97 Fl_score. Also, to compare the NextQuAD with\nother Persian datasets, our trained model on the NextQuAD, is evaluated on two\nother datasets named PersianQA and ParSQuAD. Comparisons show that the proposed\nmodel increased EM by 0.39 and 0.14 respectively in PersianQA and\nParSQuAD-manual, while a slight EM decline of 0.007 happened in\nParSQuAD-automatic.\n","authors":["Mohsen Yazdinejad","Marjan Kaedi"],"pdf_url":"https://arxiv.org/pdf/2412.20212v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.06195v2","updated":"2024-12-28T16:46:16Z","published":"2024-10-08T16:55:51Z","title":"Entering Real Social World! Benchmarking the Social Intelligence of\n Large Language Models from a First-person Perspective","summary":" Social intelligence is built upon three foundational pillars: cognitive\nintelligence, situational intelligence, and behavioral intelligence. As large\nlanguage models (LLMs) become increasingly integrated into our social lives,\nunderstanding, evaluating, and developing their social intelligence are\nbecoming increasingly important. While multiple existing works have\ninvestigated the social intelligence of LLMs, (1) most focus on a specific\naspect, and the social intelligence of LLMs has yet to be systematically\norganized and studied; (2) position LLMs as passive observers from a\nthird-person perspective, such as in Theory of Mind (ToM) tests. Compared to\nthe third-person perspective, ego-centric first-person perspective evaluation\ncan align well with actual LLM-based Agent use scenarios. (3) a lack of\ncomprehensive evaluation of behavioral intelligence, with specific emphasis on\nincorporating critical human-machine interaction scenarios. In light of this,\nwe present EgoSocialArena, a novel framework grounded in the three pillars of\nsocial intelligence: cognitive, situational, and behavioral intelligence, aimed\nto systematically evaluate the social intelligence of LLMs from a first-person\nperspective. With EgoSocialArena, we have conducted a comprehensive evaluation\nof eight prominent foundation models, even the most advanced LLMs like\no1-preview lag behind human performance by 11.0 points.\n","authors":["Guiyang Hou","Wenqi Zhang","Yongliang Shen","Zeqi Tan","Sihao Shen","Weiming Lu"],"pdf_url":"https://arxiv.org/pdf/2410.06195v2.pdf","comment":"14 pages, 6 figures"},{"id":"http://arxiv.org/abs/2410.03168v2","updated":"2024-12-28T13:39:59Z","published":"2024-10-04T06:01:27Z","title":"Can Watermarked LLMs be Identified by Users via Crafted Prompts?","summary":" Text watermarking for Large Language Models (LLMs) has made significant\nprogress in detecting LLM outputs and preventing misuse. Current watermarking\ntechniques offer high detectability, minimal impact on text quality, and\nrobustness to text editing. However, current researches lack investigation into\nthe imperceptibility of watermarking techniques in LLM services. This is\ncrucial as LLM providers may not want to disclose the presence of watermarks in\nreal-world scenarios, as it could reduce user willingness to use the service\nand make watermarks more vulnerable to attacks. This work is the first to\ninvestigate the imperceptibility of watermarked LLMs. We design an\nidentification algorithm called Water-Probe that detects watermarks through\nwell-designed prompts to the LLM. Our key motivation is that current\nwatermarked LLMs expose consistent biases under the same watermark key,\nresulting in similar differences across prompts under different watermark keys.\nExperiments show that almost all mainstream watermarking algorithms are easily\nidentified with our well-designed prompts, while Water-Probe demonstrates a\nminimal false positive rate for non-watermarked LLMs. Finally, we propose that\nthe key to enhancing the imperceptibility of watermarked LLMs is to increase\nthe randomness of watermark key selection. Based on this, we introduce the\nWater-Bag strategy, which significantly improves watermark imperceptibility by\nmerging multiple watermark keys.\n","authors":["Aiwei Liu","Sheng Guan","Yiming Liu","Leyi Pan","Yifei Zhang","Liancheng Fang","Lijie Wen","Philip S. Yu","Xuming Hu"],"pdf_url":"https://arxiv.org/pdf/2410.03168v2.pdf","comment":"30 pages, 5 figures, 11 tables"},{"id":"http://arxiv.org/abs/2412.20145v1","updated":"2024-12-28T13:13:33Z","published":"2024-12-28T13:13:33Z","title":"Efficient Multi-Agent Collaboration with Tool Use for Online Planning in\n Complex Table Question Answering","summary":" Complex table question answering (TQA) aims to answer questions that require\ncomplex reasoning, such as multi-step or multi-category reasoning, over data\nrepresented in tabular form. Previous approaches demonstrated notable\nperformance by leveraging either closed-source large language models (LLMs) or\nfine-tuned open-weight LLMs. However, fine-tuning LLMs requires high-quality\ntraining data, which is costly to obtain, and utilizing closed-source LLMs\nposes accessibility challenges and leads to reproducibility issues. In this\npaper, we propose Multi-Agent Collaboration with Tool use (MACT), a framework\nthat requires neither closed-source models nor fine-tuning. In MACT, a planning\nagent and a coding agent that also make use of tools collaborate to answer\nquestions. Our experiments on four TQA benchmarks show that MACT outperforms\nprevious SoTA systems on three out of four benchmarks and that it performs\ncomparably to the larger and more expensive closed-source model GPT-4 on two\nbenchmarks, even when using only open-weight models without any fine-tuning. We\nconduct extensive analyses to prove the effectiveness of MACT's multi-agent\ncollaboration in TQA.\n","authors":["Wei Zhou","Mohsen Mesgar","Annemarie Friedrich","Heike Adel"],"pdf_url":"https://arxiv.org/pdf/2412.20145v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2311.05844v2","updated":"2024-12-28T12:31:01Z","published":"2023-09-25T13:46:00Z","title":"Face-StyleSpeech: Enhancing Zero-shot Speech Synthesis from Face Images\n with Improved Face-to-Speech Mapping","summary":" Generating speech from a face image is crucial for developing virtual humans\ncapable of interacting using their unique voices, without relying on\npre-recorded human speech. In this paper, we propose Face-StyleSpeech, a\nzero-shot Text-To-Speech (TTS) synthesis model that generates natural speech\nconditioned on a face image rather than reference speech. We hypothesize that\nlearning entire prosodic features from a face image poses a significant\nchallenge. To address this, our TTS model incorporates both face and prosody\nencoders. The prosody encoder is specifically designed to model speech style\ncharacteristics that are not fully captured by the face image, allowing the\nface encoder to focus on extracting speaker-specific features such as timbre.\nExperimental results demonstrate that Face-StyleSpeech effectively generates\nmore natural speech from a face image than baselines, even for unseen faces.\nSamples are available on our demo page.\n","authors":["Minki Kang","Wooseok Han","Eunho Yang"],"pdf_url":"https://arxiv.org/pdf/2311.05844v2.pdf","comment":"Accepted by ICASSP 2025"},{"id":"http://arxiv.org/abs/2412.20127v1","updated":"2024-12-28T12:11:28Z","published":"2024-12-28T12:11:28Z","title":"M-MAD: Multidimensional Multi-Agent Debate Framework for Fine-grained\n Machine Translation Evaluation","summary":" Recent advancements in large language models (LLMs) have given rise to the\nLLM-as-a-judge paradigm, showcasing their potential to deliver human-like\njudgments. However, in the field of machine translation (MT) evaluation,\ncurrent LLM-as-a-judge methods fall short of learned automatic metrics. In this\npaper, we propose Multidimensional Multi-Agent Debate (M-MAD), a systematic\nLLM-based multi-agent framework for advanced LLM-as-a-judge MT evaluation. Our\nfindings demonstrate that M-MAD achieves significant advancements by (1)\ndecoupling heuristic MQM criteria into distinct evaluation dimensions for\nfine-grained assessments; (2) employing multi-agent debates to harness the\ncollaborative reasoning capabilities of LLMs; (3) synthesizing\ndimension-specific results into a final evaluation judgment to ensure robust\nand reliable outcomes. Comprehensive experiments show that M-MAD not only\noutperforms all existing LLM-as-a-judge methods but also competes with\nstate-of-the-art reference-based automatic metrics, even when powered by a\nsuboptimal model like GPT-4o mini. Detailed ablations and analysis highlight\nthe superiority of our framework design, offering a fresh perspective for\nLLM-as-a-judge paradigm. Our code and data are publicly available at\nhttps://github.com/SU-JIAYUAN/M-MAD.\n","authors":["Zhaopeng Feng","Jiayuan Su","Jiamei Zheng","Jiahan Ren","Yan Zhang","Jian Wu","Hongwei Wang","Zuozhu Liu"],"pdf_url":"https://arxiv.org/pdf/2412.20127v1.pdf","comment":"Work in progress. Code and data are available at\n https://github.com/SU-JIAYUAN/M-MAD"},{"id":"http://arxiv.org/abs/2310.05650v2","updated":"2024-12-28T11:50:15Z","published":"2023-10-09T12:01:26Z","title":"ReZG: Retrieval-Augmented Zero-Shot Counter Narrative Generation for\n Hate Speech","summary":" The proliferation of hate speech (HS) on social media poses a serious threat\nto societal security. Automatic counter narrative (CN) generation, as an active\nstrategy for HS intervention, has garnered increasing attention in recent\nyears. Existing methods for automatically generating CNs mainly rely on\nre-training or fine-tuning pre-trained language models (PLMs) on human-curated\nCN corpora. Unfortunately, the annotation speed of CN corpora cannot keep up\nwith the growth of HS targets, while generating specific and effective CNs for\nunseen targets remains a significant challenge for the model. To tackle this\nissue, we propose Retrieval-Augmented Zero-shot Generation (ReZG) to generate\nCNs with high-specificity for unseen targets. Specifically, we propose a\nmulti-dimensional hierarchical retrieval method that integrates stance,\nsemantics, and fitness, extending the retrieval metric from single dimension to\nmultiple dimensions suitable for the knowledge that refutes HS. Then, we\nimplement an energy-based constrained decoding mechanism that enables PLMs to\nuse differentiable knowledge preservation, countering, and fluency constraint\nfunctions instead of in-target CNs as control signals for generation, thereby\nachieving zero-shot CN generation. With the above techniques, ReZG can\nintegrate external knowledge flexibly and improve the specificity of CNs.\nExperimental results show that ReZG exhibits stronger generalization\ncapabilities and outperforms strong baselines with significant improvements of\n2.0%+ in the relevance and 4.5%+ in the countering success rate metrics.\n","authors":["Shuyu Jiang","Wenyi Tang","Xingshu Chen","Rui Tang","Haizhou Wang","Wenxian Wang"],"pdf_url":"https://arxiv.org/pdf/2310.05650v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.03350v2","updated":"2024-12-28T09:18:36Z","published":"2024-11-04T04:43:01Z","title":"A Comprehensive Survey of Small Language Models in the Era of Large\n Language Models: Techniques, Enhancements, Applications, Collaboration with\n LLMs, and Trustworthiness","summary":" Large language models (LLMs) have demonstrated emergent abilities in text\ngeneration, question answering, and reasoning, facilitating various tasks and\ndomains. Despite their proficiency in various tasks, LLMs like PaLM 540B and\nLlama-3.1 405B face limitations due to large parameter sizes and computational\ndemands, often requiring cloud API use which raises privacy concerns, limits\nreal-time applications on edge devices, and increases fine-tuning costs.\nAdditionally, LLMs often underperform in specialized domains such as healthcare\nand law due to insufficient domain-specific knowledge, necessitating\nspecialized models. Therefore, Small Language Models (SLMs) are increasingly\nfavored for their low inference latency, cost-effectiveness, efficient\ndevelopment, and easy customization and adaptability. These models are\nparticularly well-suited for resource-limited environments and domain knowledge\nacquisition, addressing LLMs' challenges and proving ideal for applications\nthat require localized data handling for privacy, minimal inference latency for\nefficiency, and domain knowledge acquisition through lightweight fine-tuning.\nThe rising demand for SLMs has spurred extensive research and development.\nHowever, a comprehensive survey investigating issues related to the definition,\nacquisition, application, enhancement, and reliability of SLM remains lacking,\nprompting us to conduct a detailed survey on these topics. The definition of\nSLMs varies widely, thus to standardize, we propose defining SLMs by their\ncapability to perform specialized tasks and suitability for\nresource-constrained settings, setting boundaries based on the minimal size for\nemergent abilities and the maximum size sustainable under resource constraints.\nFor other aspects, we provide a taxonomy of relevant models/methods and develop\ngeneral frameworks for each category to enhance and utilize SLMs effectively.\n","authors":["Fali Wang","Zhiwei Zhang","Xianren Zhang","Zongyu Wu","Tzuhao Mo","Qiuhao Lu","Wanjing Wang","Rui Li","Junjie Xu","Xianfeng Tang","Qi He","Yao Ma","Ming Huang","Suhang Wang"],"pdf_url":"https://arxiv.org/pdf/2411.03350v2.pdf","comment":"78 pages, 32 figures, 14 tables"},{"id":"http://arxiv.org/abs/2406.15504v3","updated":"2024-12-28T08:16:58Z","published":"2024-06-19T16:43:56Z","title":"Multi-View Empowered Structural Graph Wordification for Language Models","summary":" Significant efforts have been dedicated to integrating the powerful Large\nLanguage Models (LLMs) with diverse modalities, particularly focusing on the\nfusion of language, vision and audio data. However, the graph-structured data,\nwhich is inherently rich in structural and domain-specific knowledge, has not\nyet been gracefully adapted to LLMs. Existing methods either describe the graph\nwith raw text, suffering the loss of graph structural information, or feed\nGraph Neural Network (GNN) embeddings into LLMs at the cost of losing\nexplainable prompt semantics. To bridge this gap, we introduce an end-to-end\nmodality-aligning framework for LLM-graph alignment: Dual-Residual Vector\nQuantized-Variational AutoEncoder, namely Dr.E. Our approach is purposefully\ndesigned to facilitate token-level alignment with LLMs, enabling an effective\ntranslation of the intrinsic `language' of graphs into comprehensible natural\nlanguage. We also manage to enhance LLMs' more robust structural understanding\nof graphs by incorporating multiple views of the central nodes based on their\nsurrounding nodes at various distances. Our experimental evaluations on\nstandard graph tasks demonstrate competitive performance against other\nstate-of-the-art (SOTA) approaches. Additionally, our framework ensures certain\nvisual interpretability, efficiency, and robustness, marking the promising\nsuccessful endeavor to achieve token-level alignment between LLMs and GNNs. Our\ncode is available at: https://github.com/Timothy914/Dr.E.\n","authors":["Zipeng Liu","Likang Wu","Ming He","Zhong Guan","Hongke Zhao","Nan Feng"],"pdf_url":"https://arxiv.org/pdf/2406.15504v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.20072v1","updated":"2024-12-28T07:54:14Z","published":"2024-12-28T07:54:14Z","title":"Extract Information from Hybrid Long Documents Leveraging LLMs: A\n Framework and Dataset","summary":" Large Language Models (LLMs) demonstrate exceptional performance in textual\nunderstanding and tabular reasoning tasks. However, their ability to comprehend\nand analyze hybrid text, containing textual and tabular data, remains\nunexplored. The hybrid text often appears in the form of hybrid long documents\n(HLDs), which far exceed the token limit of LLMs. Consequently, we apply an\nAutomated Information Extraction framework (AIE) to enable LLMs to process the\nHLDs and carry out experiments to analyse four important aspects of information\nextraction from HLDs. Given the findings: 1) The effective way to select and\nsummarize the useful part of a HLD. 2) An easy table serialization way is\nenough for LLMs to understand tables. 3) The naive AIE has adaptability in many\ncomplex scenarios. 4) The useful prompt engineering to enhance LLMs on HLDs. To\naddress the issue of dataset scarcity in HLDs and support future work, we also\npropose the Financial Reports Numerical Extraction (FINE) dataset. The dataset\nand code are publicly available in the attachments.\n","authors":["Chongjian Yue","Xinrun Xu","Xiaojun Ma","Lun Du","Zhiming Ding","Shi Han","Dongmei Zhang","Qi Zhang"],"pdf_url":"https://arxiv.org/pdf/2412.20072v1.pdf","comment":"ICASSP 2025"},{"id":"http://arxiv.org/abs/2412.20070v1","updated":"2024-12-28T07:50:00Z","published":"2024-12-28T07:50:00Z","title":"On the Compositional Generalization of Multimodal LLMs for Medical\n Imaging","summary":" Multimodal large language models (MLLMs) hold significant potential in the\nmedical field, but their capabilities are often limited by insufficient data in\ncertain medical domains, highlighting the need for understanding what kinds of\nimages can be used by MLLMs for generalization. Current research suggests that\nmulti-task training outperforms single-task as different tasks can benefit each\nother, but they often overlook the internal relationships within these tasks,\nproviding limited guidance on selecting datasets to enhance specific tasks. To\nanalyze this phenomenon, we attempted to employ compositional generalization\n(CG)-the ability of models to understand novel combinations by recombining\nlearned elements-as a guiding framework. Since medical images can be precisely\ndefined by Modality, Anatomical area, and Task, naturally providing an\nenvironment for exploring CG. Therefore, we assembled 106 medical datasets to\ncreate Med-MAT for comprehensive experiments. The experiments confirmed that\nMLLMs can use CG to understand unseen medical images and identified CG as one\nof the main drivers of the generalization observed in multi-task training.\nAdditionally, further studies demonstrated that CG effectively supports\ndatasets with limited data and delivers consistent performance across different\nbackbones, highlighting its versatility and broad applicability. Med-MAT is\npublicly available at https://github.com/FreedomIntelligence/Med-MAT.\n","authors":["Zhenyang Cai","Junying Chen","Rongsheng Wang","Weihong Wang","Yonglin Deng","Dingjie Song","Yize Chen","Zixu Zhang","Benyou Wang"],"pdf_url":"https://arxiv.org/pdf/2412.20070v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.20068v1","updated":"2024-12-28T07:42:29Z","published":"2024-12-28T07:42:29Z","title":"The Emotional Spectrum of LLMs: Leveraging Empathy and Emotion-Based\n Markers for Mental Health Support","summary":" The increasing demand for mental health services has highlighted the need for\ninnovative solutions, particularly in the realm of psychological conversational\nAI, where the availability of sensitive data is scarce. In this work, we\nexplored the development of a system tailored for mental health support with a\nnovel approach to psychological assessment based on explainable emotional\nprofiles in combination with empathetic conversational models, offering a\npromising tool for augmenting traditional care, particularly where immediate\nexpertise is unavailable. Our work can be divided into two main parts,\nintrinsecaly connected to each other. First, we present RACLETTE, a\nconversational system that demonstrates superior emotional accuracy compared to\nstate-of-the-art benchmarks in both understanding users' emotional states and\ngenerating empathetic responses during conversations, while progressively\nbuilding an emotional profile of the user through their interactions. Second,\nwe show how the emotional profiles of a user can be used as interpretable\nmarkers for mental health assessment. These profiles can be compared with\ncharacteristic emotional patterns associated with different mental disorders,\nproviding a novel approach to preliminary screening and support.\n","authors":["Alessandro De Grandi","Federico Ravenda","Andrea Raballo","Fabio Crestani"],"pdf_url":"https://arxiv.org/pdf/2412.20068v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.20061v1","updated":"2024-12-28T07:30:05Z","published":"2024-12-28T07:30:05Z","title":"Comparative Analysis of Listwise Reranking with Large Language Models in\n Limited-Resource Language Contexts","summary":" Large Language Models (LLMs) have demonstrated significant effectiveness\nacross various NLP tasks, including text ranking. This study assesses the\nperformance of large language models (LLMs) in listwise reranking for\nlimited-resource African languages. We compare proprietary models RankGPT3.5,\nRank4o-mini, RankGPTo1-mini and RankClaude-sonnet in cross-lingual contexts.\nResults indicate that these LLMs significantly outperform traditional baseline\nmethods such as BM25-DT in most evaluation metrics, particularly in nDCG@10 and\nMRR@100. These findings highlight the potential of LLMs in enhancing reranking\ntasks for low-resource languages and offer insights into cost-effective\nsolutions.\n","authors":["Yanxin Shen","Lun Wang","Chuanqi Shi","Shaoshuai Du","Yiyi Tao","Yixian Shen","Hang Zhang"],"pdf_url":"https://arxiv.org/pdf/2412.20061v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.20057v1","updated":"2024-12-28T07:14:55Z","published":"2024-12-28T07:14:55Z","title":"\"My life is miserable, have to sign 500 autographs everyday\": Exposing\n Humblebragging, the Brags in Disguise","summary":" Humblebragging is a phenomenon where individuals present self-promotional\nstatements under the guise of modesty or complaints. For example, a statement\nlike, \"Ugh, I can't believe I got promoted to lead the entire team. So\nstressful!\", subtly highlights an achievement while pretending to be\ncomplaining. Detecting humblebragging is important for machines to better\nunderstand the nuances of human language, especially in tasks like sentiment\nanalysis and intent recognition. However, this topic has not yet been studied\nin computational linguistics. For the first time, we introduce the task of\nautomatically detecting humblebragging in text. We formalize the task by\nproposing a 4-tuple definition of humblebragging and evaluate machine learning,\ndeep learning, and large language models (LLMs) on this task, comparing their\nperformance with humans. We also create and release a dataset called HB24,\ncontaining 3,340 humblebrags generated using GPT-4o. Our experiments show that\ndetecting humblebragging is non-trivial, even for humans. Our best model\nachieves an F1-score of 0.88. This work lays the foundation for further\nexploration of this nuanced linguistic phenomenon and its integration into\nbroader natural language understanding systems.\n","authors":["Sharath Naganna","Saprativa Bhattacharjee","Pushpak Bhattacharyya","Biplab Banerjee"],"pdf_url":"https://arxiv.org/pdf/2412.20057v1.pdf","comment":"Under review at ARR"},{"id":"http://arxiv.org/abs/2402.10835v5","updated":"2024-12-28T06:31:51Z","published":"2024-02-16T17:15:28Z","title":"Time Series Forecasting with LLMs: Understanding and Enhancing Model\n Capabilities","summary":" Large language models (LLMs) have been applied in many fields and have\ndeveloped rapidly in recent years. As a classic machine learning task, time\nseries forecasting has recently been boosted by LLMs. Recent works treat large\nlanguage models as \\emph{zero-shot} time series reasoners without further\nfine-tuning, which achieves remarkable performance. However, there are some\nunexplored research problems when applying LLMs for time series forecasting\nunder the zero-shot setting. For instance, the LLMs' preferences for the input\ntime series are less understood. In this paper, by comparing LLMs with\ntraditional time series forecasting models, we observe many interesting\nproperties of LLMs in the context of time series forecasting. First, our study\nshows that LLMs perform well in predicting time series with clear patterns and\ntrends, but face challenges with datasets lacking periodicity. This observation\ncan be explained by the ability of LLMs to recognize the underlying period\nwithin datasets, which is supported by our experiments. In addition, the input\nstrategy is investigated, and it is found that incorporating external knowledge\nand adopting natural language paraphrases substantially improve the predictive\nperformance of LLMs for time series. Overall, our study contributes insight\ninto LLMs' advantages and limitations in time series forecasting under\ndifferent conditions.\n","authors":["Hua Tang","Chong Zhang","Mingyu Jin","Qinkai Yu","Zhenting Wang","Xiaobo Jin","Yongfeng Zhang","Mengnan Du"],"pdf_url":"https://arxiv.org/pdf/2402.10835v5.pdf","comment":"Accepted by SIGKDD Explorations Newsletter"},{"id":"http://arxiv.org/abs/2304.09542v3","updated":"2024-12-28T06:20:54Z","published":"2023-04-19T10:16:03Z","title":"Is ChatGPT Good at Search? Investigating Large Language Models as\n Re-Ranking Agents","summary":" Large Language Models (LLMs) have demonstrated remarkable zero-shot\ngeneralization across various language-related tasks, including search engines.\nHowever, existing work utilizes the generative ability of LLMs for Information\nRetrieval (IR) rather than direct passage ranking. The discrepancy between the\npre-training objectives of LLMs and the ranking objective poses another\nchallenge. In this paper, we first investigate generative LLMs such as ChatGPT\nand GPT-4 for relevance ranking in IR. Surprisingly, our experiments reveal\nthat properly instructed LLMs can deliver competitive, even superior results to\nstate-of-the-art supervised methods on popular IR benchmarks. Furthermore, to\naddress concerns about data contamination of LLMs, we collect a new test set\ncalled NovelEval, based on the latest knowledge and aiming to verify the\nmodel's ability to rank unknown knowledge. Finally, to improve efficiency in\nreal-world applications, we delve into the potential for distilling the ranking\ncapabilities of ChatGPT into small specialized models using a permutation\ndistillation scheme. Our evaluation results turn out that a distilled 440M\nmodel outperforms a 3B supervised model on the BEIR benchmark. The code to\nreproduce our results is available at www.github.com/sunnweiwei/RankGPT.\n","authors":["Weiwei Sun","Lingyong Yan","Xinyu Ma","Shuaiqiang Wang","Pengjie Ren","Zhumin Chen","Dawei Yin","Zhaochun Ren"],"pdf_url":"https://arxiv.org/pdf/2304.09542v3.pdf","comment":"EMNLP 2023"},{"id":"http://arxiv.org/abs/2412.20043v1","updated":"2024-12-28T06:13:50Z","published":"2024-12-28T06:13:50Z","title":"STAYKATE: Hybrid In-Context Example Selection Combining\n Representativeness Sampling and Retrieval-based Approach -- A Case Study on\n Science Domains","summary":" Large language models (LLMs) demonstrate the ability to learn in-context,\noffering a potential solution for scientific information extraction, which\noften contends with challenges such as insufficient training data and the high\ncost of annotation processes. Given that the selection of in-context examples\ncan significantly impact performance, it is crucial to design a proper method\nto sample the efficient ones. In this paper, we propose STAYKATE, a\nstatic-dynamic hybrid selection method that combines the principles of\nrepresentativeness sampling from active learning with the prevalent\nretrieval-based approach. The results across three domain-specific datasets\nindicate that STAYKATE outperforms both the traditional supervised methods and\nexisting selection methods. The enhancement in performance is particularly\npronounced for entity types that other methods pose challenges.\n","authors":["Chencheng Zhu","Kazutaka Shimada","Tomoki Taniguchi","Tomoko Ohkuma"],"pdf_url":"https://arxiv.org/pdf/2412.20043v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.20024v1","updated":"2024-12-28T05:01:26Z","published":"2024-12-28T05:01:26Z","title":"BaiJia: A Large Scale Role-Playing Agent Corpus of Chinese Historical\n Charcaters","summary":" We introduce a comprehensive large-scale role-playing agent corpus, termed\nBaiJia, that comprises various Chinese historical characters. This corpus is\nnoteworthy for being the pioneering compilation of low-resource data that can\nbe utilized in large language models (LLMs) to engage in AI-driven historical\nrole-playing agents. BaiJia addresses the challenges in terms of fragmented\nhistorical textual records in different forms and modalities, integrating\nvarious characters' information, including their biographical, literary, family\nrelations, historical events, and so on. We conduct extensive experiments to\ndemonstrate the effectiveness of our BaiJia agent corpus in bolstering the\nrole-playing abilities of various foundational LLMs, and promoting the\ndevelopment and assessment of LLMs in the context of historical role-playing\ntasks. The agent corpus is available at baijia.online.\n","authors":["Ting Bai","Jiazheng Kang","Jiayang Fan"],"pdf_url":"https://arxiv.org/pdf/2412.20024v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.20005v1","updated":"2024-12-28T04:01:30Z","published":"2024-12-28T04:01:30Z","title":"OneKE: A Dockerized Schema-Guided LLM Agent-based Knowledge Extraction\n System","summary":" We introduce OneKE, a dockerized schema-guided knowledge extraction system,\nwhich can extract knowledge from the Web and raw PDF Books, and support various\ndomains (science, news, etc.). Specifically, we design OneKE with multiple\nagents and a configure knowledge base. Different agents perform their\nrespective roles, enabling support for various extraction scenarios. The\nconfigure knowledge base facilitates schema configuration, error case debugging\nand correction, further improving the performance. Empirical evaluations on\nbenchmark datasets demonstrate OneKE's efficacy, while case studies further\nelucidate its adaptability to diverse tasks across multiple domains,\nhighlighting its potential for broad applications. We have open-sourced the\nCode at https://github.com/zjunlp/OneKE and released a Video at\nhttp://oneke.openkg.cn/demo.mp4.\n","authors":["Yujie Luo","Xiangyuan Ru","Kangwei Liu","Lin Yuan","Mengshu Sun","Ningyu Zhang","Lei Liang","Zhiqiang Zhang","Jun Zhou","Lanning Wei","Da Zheng","Haofen Wang","Huajun Chen"],"pdf_url":"https://arxiv.org/pdf/2412.20005v1.pdf","comment":"Work in progress"},{"id":"http://arxiv.org/abs/2412.19994v1","updated":"2024-12-28T03:40:25Z","published":"2024-12-28T03:40:25Z","title":"From Generalist to Specialist: A Survey of Large Language Models for\n Chemistry","summary":" Large Language Models (LLMs) have significantly transformed our daily life\nand established a new paradigm in natural language processing (NLP). However,\nthe predominant pretraining of LLMs on extensive web-based texts remains\ninsufficient for advanced scientific discovery, particularly in chemistry. The\nscarcity of specialized chemistry data, coupled with the complexity of\nmulti-modal data such as 2D graph, 3D structure and spectrum, present distinct\nchallenges. Although several studies have reviewed Pretrained Language Models\n(PLMs) in chemistry, there is a conspicuous absence of a systematic survey\nspecifically focused on chemistry-oriented LLMs. In this paper, we outline\nmethodologies for incorporating domain-specific chemistry knowledge and\nmulti-modal information into LLMs, we also conceptualize chemistry LLMs as\nagents using chemistry tools and investigate their potential to accelerate\nscientific research. Additionally, we conclude the existing benchmarks to\nevaluate chemistry ability of LLMs. Finally, we critically examine the current\nchallenges and identify promising directions for future research. Through this\ncomprehensive survey, we aim to assist researchers in staying at the forefront\nof developments in chemistry LLMs and to inspire innovative applications in the\nfield.\n","authors":["Yang Han","Ziping Wan","Lu Chen","Kai Yu","Xin Chen"],"pdf_url":"https://arxiv.org/pdf/2412.19994v1.pdf","comment":"COLING2025,We maintain an up-to-date Github repository at:\n https://github.com/OpenDFM/LLM4Chemistry"},{"id":"http://arxiv.org/abs/2412.19966v1","updated":"2024-12-28T01:29:53Z","published":"2024-12-28T01:29:53Z","title":"Bridging Context Gaps: Enhancing Comprehension in Long-Form Social\n Conversations Through Contextualized Excerpts","summary":" We focus on enhancing comprehension in small-group recorded conversations,\nwhich serve as a medium to bring people together and provide a space for\nsharing personal stories and experiences on crucial social matters. One way to\nparse and convey information from these conversations is by sharing highlighted\nexcerpts in subsequent conversations. This can help promote a collective\nunderstanding of relevant issues, by highlighting perspectives and experiences\nto other groups of people who might otherwise be unfamiliar with and thus\nunable to relate to these experiences. The primary challenge that arises then\nis that excerpts taken from one conversation and shared in another setting\nmight be missing crucial context or key elements that were previously\nintroduced in the original conversation. This problem is exacerbated when\nconversations become lengthier and richer in themes and shared experiences. To\naddress this, we explore how Large Language Models (LLMs) can enrich these\nexcerpts by providing socially relevant context. We present approaches for\neffective contextualization to improve comprehension, readability, and empathy.\nWe show significant improvements in understanding, as assessed through\nsubjective and objective evaluations. While LLMs can offer valuable context,\nthey struggle with capturing key social aspects. We release the Human-annotated\nSalient Excerpts (HSE) dataset to support future work. Additionally, we show\nhow context-enriched excerpts can provide more focused and comprehensive\nconversation summaries.\n","authors":["Shrestha Mohanty","Sarah Xuan","Jacob Jobraeel","Anurag Kumar","Deb Roy","Jad Kabbara"],"pdf_url":"https://arxiv.org/pdf/2412.19966v1.pdf","comment":"Accepted at COLING 2025"},{"id":"http://arxiv.org/abs/2402.11414v3","updated":"2024-12-28T01:06:51Z","published":"2024-02-18T01:03:25Z","title":"Fine-grained and Explainable Factuality Evaluation for Multimodal\n Summarization","summary":" Multimodal summarization aims to generate a concise summary based on the\ninput text and image. However, the existing methods potentially suffer from\nunfactual output. To evaluate the factuality of multimodal summarization\nmodels, we propose two fine-grained and explainable evaluation frameworks\n(FALLACIOUS) for different application scenarios, i.e. reference-based\nfactuality evaluation framework and reference-free factuality evaluation\nframework. Notably, the reference-free factuality evaluation framework doesn't\nneed ground truth and hence it has a wider application scenario. To evaluate\nthe effectiveness of the proposed frameworks, we compute the correlation\nbetween our frameworks and the other metrics. The experimental results show the\neffectiveness of our proposed method. We will release our code and dataset via\ngithub.\n","authors":["Yue Zhang","Jingxuan Zuo","Liqiang Jing"],"pdf_url":"https://arxiv.org/pdf/2402.11414v3.pdf","comment":"AAAI 2025"},{"id":"http://arxiv.org/abs/2412.20033v1","updated":"2024-12-28T05:47:38Z","published":"2024-12-28T05:47:38Z","title":"Children's Acquisition of Tail-recursion Sequences: A Review of Locative\n Recursion and Possessive Recursion as Examples","summary":" Recursion is the nature of human natural language. Since Chomsky proposed\ngenerative grammar, many scholars have studied recursion either theoretically\nor empirically. However, by observing children's acquisition of tail recursion\nsequences, we can verify the nativism of language supported by universal\ngrammar and reveal the cognitive mechanism of human brain. To date, our\nunderstanding of children's acquisition path of recursion and influencing\nfactors still remain controversial. This systematic review summarizes the\nresearch of tail recursive sequence by taking possessive recursion and locative\nrecursion as examples, focusing on the experimental methods, acquisition paths,\nand influencing factors of tail recursive sequence. The current behavioural\nexperiments reveal that, the debate about children's performance revolves\naround: 1) Gradual acquisition or synchronous acquisition. 2) symmetry or\nasymmetry between the acquisition of locative recursion sequences and\npossessive recursion sequences. We presume that children can acquire recursion\nquickly in a short period of time thanks to the language acquisition device,\nthough there are also scholars who believe that a third factor also plays a\nrole.\n","authors":["Xiaoyi Wang","Chenxi Fu","Caimei Yang","Ziman Zhuang"],"pdf_url":"https://arxiv.org/pdf/2412.20033v1.pdf","comment":"32 pages, 5 figures"}],"Information Retrieval":[{"id":"http://arxiv.org/abs/2412.03557v2","updated":"2024-12-28T18:54:29Z","published":"2024-12-04T18:52:32Z","title":"Freshness and Informativity Weighted Cognitive Extent and Its\n Correlation with Cumulative Citation Count","summary":" In this paper, we revisit cognitive extent, originally defined as the number\nof unique phrases in a quota. We introduce Freshness and Informative Weighted\nCognitive Extent (FICE), calculated based on two novel weighting factors, the\nlifetime ratio and informativity of scientific entities. We model the lifetime\nof each scientific entity as the time-dependent document frequency, which is\nfit by the composition of multiple Gaussian profiles. The lifetime ratio is\nthen calculated as the cumulative document frequency at the publication time\n$t_0$ divided by the cumulative document frequency over its entire lifetime.\nThe informativity is calculated by normalizing the document frequency across\nall scientific entities recognized in a title. Using the ACL Anthology, we\nverified the trend formerly observed in several other domains that the number\nof unique scientific entities per quota increased gradually at a slower rate.\nWe found that FICE exhibits a strong correlation with the average cumulative\ncitation count within a quota. Our code is available at\n\\href{https://github.com/ZiheHerzWang/Freshness-and-Informativity-Weighted-Cognitive-Extent}{https://github.com/ZiheHerzWang/Freshness-and-Informativity-Weighted-Cognitive-Extent}\n","authors":["Zihe Wang","Jian Wu"],"pdf_url":"https://arxiv.org/pdf/2412.03557v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.20163v1","updated":"2024-12-28T14:27:45Z","published":"2024-12-28T14:27:45Z","title":"Topic-Aware Knowledge Graph with Large Language Models for\n Interoperability in Recommender Systems","summary":" The use of knowledge graphs in recommender systems has become one of the\ncommon approaches to addressing data sparsity and cold start problems. Recent\nadvances in large language models (LLMs) offer new possibilities for processing\nside and context information within knowledge graphs. However, consistent\nintegration across various systems remains challenging due to the need for\ndomain expert intervention and differences in system characteristics. To\naddress these issues, we propose a consistent approach that extracts both\ngeneral and specific topics from both side and context information using LLMs.\nFirst, general topics are iteratively extracted and updated from side\ninformation. Then, specific topics are extracted using context information.\nFinally, to address synonymous topics generated during the specific topic\nextraction process, a refining algorithm processes and resolves these issues\neffectively. This approach allows general topics to capture broad knowledge\nacross diverse item characteristics, while specific topics emphasize detailed\nattributes, providing a more comprehensive understanding of the semantic\nfeatures of items and the preferences of users. Experimental results\ndemonstrate significant improvements in recommendation performance across\ndiverse knowledge graphs.\n","authors":["Minhye Jeon","Seokho Ahn","Young-Duk Seo"],"pdf_url":"https://arxiv.org/pdf/2412.20163v1.pdf","comment":"Accepted by The 40th ACM/SIGAPP Symposium On Applied Computing(SAC)\n 2025"},{"id":"http://arxiv.org/abs/2412.08066v2","updated":"2024-12-28T06:27:42Z","published":"2024-12-11T03:22:04Z","title":"Cluster-Enhanced Federated Graph Neural Network for Recommendation","summary":" Personal interaction data can be effectively modeled as individual graphs for\neach user in recommender systems.Graph Neural Networks (GNNs)-based\nrecommendation techniques have become extremely popular since they can capture\nhigh-order collaborative signals between users and items by aggregating the\nindividual graph into a global interactive graph.However, this centralized\napproach inherently poses a threat to user privacy and security. Recently,\nfederated GNN-based recommendation techniques have emerged as a promising\nsolution to mitigate privacy concerns. Nevertheless, current implementations\neither limit on-device training to an unaccompanied individual graphs or\nnecessitate reliance on an extra third-party server to touch other individual\ngraphs, which also increases the risk of privacy leakage. To address this\nchallenge, we propose a Cluster-enhanced Federated Graph Neural Network\nframework for Recommendation, named CFedGR, which introduces high-order\ncollaborative signals to augment individual graphs in a privacy preserving\nmanner. Specifically, the server clusters the pretrained user representations\nto identify high-order collaborative signals. In addition, two efficient\nstrategies are devised to reduce communication between devices and the server.\nExtensive experiments on three benchmark datasets validate the effectiveness of\nour proposed methods.\n","authors":["Haiyan Wang","Ye Yuan"],"pdf_url":"https://arxiv.org/pdf/2412.08066v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2304.09542v3","updated":"2024-12-28T06:20:54Z","published":"2023-04-19T10:16:03Z","title":"Is ChatGPT Good at Search? Investigating Large Language Models as\n Re-Ranking Agents","summary":" Large Language Models (LLMs) have demonstrated remarkable zero-shot\ngeneralization across various language-related tasks, including search engines.\nHowever, existing work utilizes the generative ability of LLMs for Information\nRetrieval (IR) rather than direct passage ranking. The discrepancy between the\npre-training objectives of LLMs and the ranking objective poses another\nchallenge. In this paper, we first investigate generative LLMs such as ChatGPT\nand GPT-4 for relevance ranking in IR. Surprisingly, our experiments reveal\nthat properly instructed LLMs can deliver competitive, even superior results to\nstate-of-the-art supervised methods on popular IR benchmarks. Furthermore, to\naddress concerns about data contamination of LLMs, we collect a new test set\ncalled NovelEval, based on the latest knowledge and aiming to verify the\nmodel's ability to rank unknown knowledge. Finally, to improve efficiency in\nreal-world applications, we delve into the potential for distilling the ranking\ncapabilities of ChatGPT into small specialized models using a permutation\ndistillation scheme. Our evaluation results turn out that a distilled 440M\nmodel outperforms a 3B supervised model on the BEIR benchmark. The code to\nreproduce our results is available at www.github.com/sunnweiwei/RankGPT.\n","authors":["Weiwei Sun","Lingyong Yan","Xinyu Ma","Shuaiqiang Wang","Pengjie Ren","Zhumin Chen","Dawei Yin","Zhaochun Ren"],"pdf_url":"https://arxiv.org/pdf/2304.09542v3.pdf","comment":"EMNLP 2023"},{"id":"http://arxiv.org/abs/2412.20040v1","updated":"2024-12-28T06:12:02Z","published":"2024-12-28T06:12:02Z","title":"A Contrastive Pretrain Model with Prompt Tuning for Multi-center\n Medication Recommendation","summary":" Medication recommendation is one of the most critical health-related\napplications, which has attracted extensive research interest recently. Most\nexisting works focus on a single hospital with abundant medical data. However,\nmany small hospitals only have a few records, which hinders applying existing\nmedication recommendation works to the real world. Thus, we seek to explore a\nmore practical setting, i.e., multi-center medication recommendation. In this\nsetting, most hospitals have few records, but the total number of records is\nlarge. Though small hospitals may benefit from total affluent records, it is\nalso faced with the challenge that the data distributions between various\nhospitals are much different. In this work, we introduce a novel conTrastive\nprEtrain Model with Prompt Tuning (TEMPT) for multi-center medication\nrecommendation, which includes two stages of pretraining and finetuning. We\nfirst design two self-supervised tasks for the pretraining stage to learn\ngeneral medical knowledge. They are mask prediction and contrastive tasks,\nwhich extract the intra- and inter-relationships of input diagnosis and\nprocedures. Furthermore, we devise a novel prompt tuning method to capture the\nspecific information of each hospital rather than adopting the common\nfinetuning. On the one hand, the proposed prompt tuning can better learn the\nheterogeneity of each hospital to fit various distributions. On the other hand,\nit can also relieve the catastrophic forgetting problem of finetuning. To\nvalidate the proposed model, we conduct extensive experiments on the public\neICU, a multi-center medical dataset. The experimental results illustrate the\neffectiveness of our model. The implementation code is available to ease the\nreproducibility https://github.com/Applied-Machine-Learning-Lab/TEMPT.\n","authors":["Qidong Liu","Zhaopeng Qiu","Xiangyu Zhao","Xian Wu","Zijian Zhang","Tong Xu","Feng Tian"],"pdf_url":"https://arxiv.org/pdf/2412.20040v1.pdf","comment":"accepted by TOIS"},{"id":"http://arxiv.org/abs/2410.10381v2","updated":"2024-12-28T06:07:17Z","published":"2024-10-14T11:10:15Z","title":"Collaborative filtering based on nonnegative/binary matrix factorization","summary":" Collaborative filtering generates recommendations based on user-item\nsimilarities through rating data, which may involve numerous unrated items. To\npredict scores for unrated items, matrix factorization techniques, such as\nnonnegative matrix factorization (NMF), are often employed to predict scores\nfor unrated items. Nonnegative/binary matrix factorization (NBMF), which is an\nextension of NMF, approximates a nonnegative matrix as the product of\nnonnegative and binary matrices. Previous studies have employed NBMF for image\nanalysis where the data were dense. In this paper, we propose a modified NBMF\nalgorithm that can be applied to collaborative filtering where data are sparse.\nIn the modified method, unrated elements in a rating matrix are masked, which\nimproves the collaborative filtering performance. Utilizing a low-latency Ising\nmachine in NBMF is advantageous in terms of the computation time, making the\nproposed method beneficial.\n","authors":["Yukino Terui","Yuka Inoue","Yohei Hamakawa","Kosuke Tatsumura","Kazue Kudo"],"pdf_url":"https://arxiv.org/pdf/2410.10381v2.pdf","comment":"14 pages, 7 figures"},{"id":"http://arxiv.org/abs/2412.20036v1","updated":"2024-12-28T05:57:02Z","published":"2024-12-28T05:57:02Z","title":"Invariant debiasing learning for recommendation via biased imputation","summary":" Previous debiasing studies utilize unbiased data to make supervision of model\ntraining. They suffer from the high trial risks and experimental costs to\nobtain unbiased data. Recent research attempts to use invariant learning to\ndetach the invariant preference of users for unbiased recommendations in an\nunsupervised way. However, it faces the drawbacks of low model accuracy and\nunstable prediction performance due to the losing cooperation with variant\npreference. In this paper, we experimentally demonstrate that invariant\nlearning causes information loss by directly discarding the variant\ninformation, which reduces the generalization ability and results in the\ndegradation of model performance in unbiased recommendations. Based on this\nconsideration, we propose a novel lightweight knowledge distillation framework\n(KDDebias) to automatically learn the unbiased preference of users from both\ninvariant and variant information. Specifically, the variant information is\nimputed to the invariant user preference in the distance-aware knowledge\ndistillation process. Extensive experiments on three public datasets, i.e.,\nYahoo!R3, Coat, and MIND, show that with the biased imputation from the variant\npreference of users, our proposed method achieves significant improvements with\nless than 50% learning parameters compared to the SOTA unsupervised debiasing\nmodel in recommender systems. Our code are publicly available at\nhttps://github.com/BAI-LAB/KD-Debias.\n","authors":["Ting Bai","Weijie Chen","Cheng Yang","Chuan Shi"],"pdf_url":"https://arxiv.org/pdf/2412.20036v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.00847v3","updated":"2024-12-28T05:14:14Z","published":"2024-09-01T21:30:14Z","title":"The Design of an LLM-powered Unstructured Analytics System","summary":" LLMs demonstrate an uncanny ability to process unstructured data, and as\nsuch, have the potential to go beyond search and run complex, semantic analyses\nat scale. We describe the design of an unstructured analytics system, Aryn, and\nthe tenets and use cases that motivate its design. With Aryn, users specify\nqueries in natural language and the system automatically determines a semantic\nplan and executes it to compute an answer from a large collection of\nunstructured documents. At the core of Aryn is Sycamore, a declarative document\nprocessing engine, that provides a reliable distributed abstraction called\nDocSets. Sycamore allows users to analyze, enrich, and transform complex\ndocuments at scale. Aryn includes Luna, a query planner that translates natural\nlanguage queries to Sycamore scripts, and DocParse, which takes raw PDFs and\ndocument images, and converts them to DocSets for downstream processing. We\nshow how these pieces come together to achieve better accuracy than RAG on\nanalytics queries over real world reports from the National Transportation\nSafety Board (NTSB). Also, given current limitations of LLMs, we argue that an\nanalytics system must provide explainability to be practical, and show how\nAryn's user interface does this to help build trust.\n","authors":["Eric Anderson","Jonathan Fritz","Austin Lee","Bohou Li","Mark Lindblad","Henry Lindeman","Alex Meyer","Parth Parmar","Tanvi Ranade","Mehul A. Shah","Benjamin Sowell","Dan Tecuci","Vinayak Thapliyal","Matt Welsh"],"pdf_url":"https://arxiv.org/pdf/2409.00847v3.pdf","comment":"Included in the proceedings of The Conference on Innovative Data\n Systems Research (CIDR) 2025"},{"id":"http://arxiv.org/abs/2412.20005v1","updated":"2024-12-28T04:01:30Z","published":"2024-12-28T04:01:30Z","title":"OneKE: A Dockerized Schema-Guided LLM Agent-based Knowledge Extraction\n System","summary":" We introduce OneKE, a dockerized schema-guided knowledge extraction system,\nwhich can extract knowledge from the Web and raw PDF Books, and support various\ndomains (science, news, etc.). Specifically, we design OneKE with multiple\nagents and a configure knowledge base. Different agents perform their\nrespective roles, enabling support for various extraction scenarios. The\nconfigure knowledge base facilitates schema configuration, error case debugging\nand correction, further improving the performance. Empirical evaluations on\nbenchmark datasets demonstrate OneKE's efficacy, while case studies further\nelucidate its adaptability to diverse tasks across multiple domains,\nhighlighting its potential for broad applications. We have open-sourced the\nCode at https://github.com/zjunlp/OneKE and released a Video at\nhttp://oneke.openkg.cn/demo.mp4.\n","authors":["Yujie Luo","Xiangyuan Ru","Kangwei Liu","Lin Yuan","Mengshu Sun","Ningyu Zhang","Lei Liang","Zhiqiang Zhang","Jun Zhou","Lanning Wei","Da Zheng","Haofen Wang","Huajun Chen"],"pdf_url":"https://arxiv.org/pdf/2412.20005v1.pdf","comment":"Work in progress"},{"id":"http://arxiv.org/abs/2412.20211v1","updated":"2024-12-28T16:48:55Z","published":"2024-12-28T16:48:55Z","title":"Generative Regression Based Watch Time Prediction for Video\n Recommendation: Model and Performance","summary":" Watch time prediction (WTP) has emerged as a pivotal task in short video\nrecommendation systems, designed to encapsulate user interests. Predicting\nusers' watch times on videos often encounters challenges, including wide value\nranges and imbalanced data distributions, which can lead to significant bias\nwhen directly regressing watch time. Recent studies have tried to tackle these\nissues by converting the continuous watch time estimation into an ordinal\nclassification task. While these methods are somewhat effective, they exhibit\nnotable limitations. Inspired by language modeling, we propose a novel\nGenerative Regression (GR) paradigm for WTP based on sequence generation. This\napproach employs structural discretization to enable the lossless\nreconstruction of original values while maintaining prediction fidelity. By\nformulating the prediction problem as a numerical-to-sequence mapping, and with\nmeticulously designed vocabulary and label encodings, each watch time is\ntransformed into a sequence of tokens. To expedite model training, we introduce\nthe curriculum learning with an embedding mixup strategy which can mitigate\ntraining-and-inference inconsistency associated with teacher forcing. We\nevaluate our method against state-of-the-art approaches on four public datasets\nand one industrial dataset. We also perform online A/B testing on Kuaishou, a\nleading video app with about 400 million DAUs, to demonstrate the real-world\nefficacy of our method. The results conclusively show that GR outperforms\nexisting techniques significantly. Furthermore, we successfully apply GR to\nanother regression task in recommendation systems, i.e., Lifetime Value (LTV)\nprediction, which highlights its potential as a novel and effective solution to\ngeneral regression challenges.\n","authors":["Hongxu Ma","Kai Tian","Tao Zhang","Xuefeng Zhang","Chunjie Chen","Han Li","Jihong Guan","Shuigeng Zhou"],"pdf_url":"https://arxiv.org/pdf/2412.20211v1.pdf","comment":"10 pages, 5 figures, conference or other essential info"}],"Multimedia":[{"id":"http://arxiv.org/abs/2311.05844v2","updated":"2024-12-28T12:31:01Z","published":"2023-09-25T13:46:00Z","title":"Face-StyleSpeech: Enhancing Zero-shot Speech Synthesis from Face Images\n with Improved Face-to-Speech Mapping","summary":" Generating speech from a face image is crucial for developing virtual humans\ncapable of interacting using their unique voices, without relying on\npre-recorded human speech. In this paper, we propose Face-StyleSpeech, a\nzero-shot Text-To-Speech (TTS) synthesis model that generates natural speech\nconditioned on a face image rather than reference speech. We hypothesize that\nlearning entire prosodic features from a face image poses a significant\nchallenge. To address this, our TTS model incorporates both face and prosody\nencoders. The prosody encoder is specifically designed to model speech style\ncharacteristics that are not fully captured by the face image, allowing the\nface encoder to focus on extracting speaker-specific features such as timbre.\nExperimental results demonstrate that Face-StyleSpeech effectively generates\nmore natural speech from a face image than baselines, even for unseen faces.\nSamples are available on our demo page.\n","authors":["Minki Kang","Wooseok Han","Eunho Yang"],"pdf_url":"https://arxiv.org/pdf/2311.05844v2.pdf","comment":"Accepted by ICASSP 2025"}]}}
\ No newline at end of file
diff --git a/favicon.ico b/favicon.ico
new file mode 100644
index 0000000..7f5166c
Binary files /dev/null and b/favicon.ico differ
diff --git a/index.css b/index.css
new file mode 100644
index 0000000..9ded9d9
--- /dev/null
+++ b/index.css
@@ -0,0 +1,355 @@
+:root {
+ /* Palette: Nord (https://www.nordtheme.com)*/
+ --nord00: #2e3440;
+ --nord01: #3b4252;
+ --nord02: #434c5e;
+ --nord03: #4c566a;
+ --nord04: #d8dee9;
+ --nord05: #e5e9f0;
+ --nord06: #eceff4;
+ --nord07: #8fbcbb;
+ --nord08: #88c0d0;
+ --nord09: #81a1c1;
+ --nord0A: #5e81ac;
+ --nord0B: #bf616a;
+ --nord0C: #d08770;
+ --nord0D: #ebcb8b;
+ --nord0E: #a3be8c;
+ --nord0F: #b48ead;
+
+
+ /* Typograph */
+ --font-family-default: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, Oxygen-Sans, Ubuntu, Cantarell, "Helvetica Neue",
+ sans-serif;
+ --font-size-scaler: 62.5%;
+ --font-size-m: 1.6rem;
+ --font-size-s: 1.4rem;
+
+ /* Components */
+ --body-color: var(--nord06);
+ --body-bg: var(--nord00);
+
+ --header-title: var(--nord06);
+ --header-container: var(--nord00);
+ --header-title-preffix: var(--nord0F);
+
+ --chip-font: var(--nord08);
+ --chip-color: var(--nord0B);
+
+ --icons: var(--nord06);
+ --icons-hover: var(--nord0F);
+
+ --day-container: var(--nord01);
+ --date: var(--nord09);
+
+ --summary: var(--nord0E);
+ --summary-hover: var(--nord0F);
+
+ --details-open: var(--nord02);
+ --details-content: var(--nord05);
+ --details-a: var(--nord07);
+ --details-a-hover: var(--nord0F);
+
+ --highlight-title: var(--nord0B);
+ --highlight-author: var(--nord0B);
+
+ --article-summary-hover-color: var(--nord0D);
+ --article-summary-color: var(--nord04);
+
+ --article-title-color: var(--nord05);
+ --article-title-hover-color: var(--nord0E);
+
+ --accordion-content-rail-color: var(--nord01);
+ --accordion-content-hover-rail-color: var(--nord0D);
+ --accordion-title-marker-color: var(--nord01);
+ --accordion-title-hover-marker-color: var(--nord0E);
+
+ --footer-color: var(--nord04);
+ --footer-link-hover-color: var(--nord0D);
+}
+
+[data-theme="light"] {
+ /* Theme design */
+
+ --color-primary: var(--nord07);
+ --color-primary-second: var(--nord00);
+ --color-info: var(--nord0A);
+ --color-success: var(--nord0E);
+ --color-warning: var(--nord0C);
+ --color-danger: var(--nord0B);
+
+ --color-text: var(--nord00);
+ --color-hover: var(--nord0D);
+ --color-shadow: var(--nord03);
+
+ --color-primary-h: var(--nord09);
+ --color-primary-s: var(--nord08);
+ --color-primary-l: var(--nord07);
+
+ --color-contrast-higher-h: var(--nord01);
+ --color-contrast-higher-l: var(--nord02);
+ --color-contrast-higher-s: var(--nord03);
+
+ --color-content: white;
+
+ --background: var(--nord06);
+ --background-content: var(--nord05);
+ --background-color: var(--nord04);
+
+ /* Components */
+
+ --chip-font: var(--nord06);
+ --chip-color: var(--nord09);
+
+ --body-color: var(--background-color);
+ --body-bg: var(--background);
+
+ --header-title: var(--color-shadow);
+ --header-container: var(--background);
+ --header-title-preffix: var(--color-primary-h);
+
+ --icons: var(--color-shadow);
+ --icons-hover: var(--color-hover);
+
+ --day-container: var(--background-content);
+ --date: var(--color-primary-l);
+
+ --summary: var(--color-info);
+ --summary-hover: var(--color-success);
+
+ --details-open: var(--color-content);
+ --details-content: var(--color-text);
+ --details-a: var(--color-primary-h);
+ --details-a-hover: var(--color-hover);
+
+ --highlight-title: var(--color-danger);
+ --highlight-author: var(--color-warning);
+
+ --article-summary-color: var(--color-text);
+ --article-summary-hover-color: var(--color-primary-s);
+
+ --article-title-color: var(--color-primary);
+ --article-title-hover-color: var(--color-success);
+
+ --accordion-content-rail-color: var(--color-warning);
+ --accordion-content-hover-rail-color: var(--color-warning);
+ --accordion-title-marker-color: var(--color-success);
+ --accordion-title-hover-marker-color: var(--color-success);
+
+ --footer-color: var(--color-text);
+ --footer-link-hover-color: var(--color-hover);
+}
+
+html {
+ font-size: var(--font-size-scaler);
+}
+
+body {
+ background-color: var(--body-bg);
+ font-family: var(--font-family-default);
+ color: var(--body-color);
+ margin: 0;
+ padding-top: 16px;
+ display: grid;
+}
+
+.header-container {
+ width: 90%;
+ max-width: 1200px;
+ background: var(--header-container);
+ margin: 0 auto;
+}
+
+.header-title {
+ font-size: 32px;
+ font-weight: bold;
+ color: var(--header-title);
+ margin: 0;
+ padding-bottom: 14px;
+}
+
+.header-title-preffix {
+ color: var(--header-title-preffix);
+}
+
+.icons {
+ color: var(--icons);
+ padding-bottom: 16px;
+}
+
+.icons a {
+ color: var(--icons);
+ text-decoration: none;
+}
+
+.icons a:hover {
+ color: var(--icons-hover);
+}
+
+.day-container {
+ padding: 16px 16px 16px 16px;
+ background: var(--day-container);
+ width: 90%;
+ max-width: 1200px;
+ margin: 0 auto;
+ margin-bottom: 8px;
+ border-radius: 10px;
+}
+
+.date {
+ font-size: 24px;
+ font-weight: 700;
+ margin: 0;
+ color: var(--date);
+}
+
+p {
+ margin: 0;
+}
+
+summary {
+ font-weight: 600;
+ color: var(--summary);
+}
+
+summary:hover {
+ text-decoration: underline;
+ cursor: pointer;
+ color: var(--summary-hover);
+}
+
+details {
+ --border-color: transparent;
+
+ padding: 2px 4px;
+ font-size: 20px;
+ border: 1px solid var(--border-color);
+ border-radius: 4px;
+}
+
+details[open] {
+ background-color: var(--details-open);
+ margin-bottom: 8px;
+}
+
+.details-content {
+ padding: 12px 3px;
+ gap: 16px;
+ color: var(--details-content);
+}
+
+details a {
+ color: var(--details-a);
+}
+
+details a:hover {
+ color: var(--details-a-hover);
+}
+
+footer {
+ margin: 0 auto;
+ color: var(--footer-color);
+ font-size: var(--font-size-s);
+ display: flex;
+ padding: 0 16px;
+ justify-content: space-between;
+}
+
+.description {
+ margin: 0 auto;
+ color: var(--footer-color);
+ font-size: var(--font-size-s);
+ display: flex;
+ padding: 0 16px;
+ text-align: center;
+}
+
+.highlight-author {
+ color: var(--highlight-author);
+ font-weight: bold;
+}
+
+.highlight-title {
+ color: var(--highlight-title);
+ font-weight: bold;
+}
+
+.channel-description {
+ text-align: center;
+ font-size: var(--font-size-scaler);
+}
+
+.article-summary-link {
+ color: var(--article-summary-color);
+ font-size: var(--font-size-s);
+ text-decoration: none;
+}
+
+.article-summary-link:hover {
+ color: var(--article-summary-hover-color);
+ --accordion-content-rail-color: var(--accordion-content-hover-rail-color);
+}
+
+.article-summary-box-outer {
+ display: block;
+ padding: 4px 8px 8px 4px;
+}
+
+.article-summary-box-inner {
+ padding-left: 8px;
+ border-left: 1px solid var(--accordion-content-rail-color);
+ font-size: var(--font-size-m);
+}
+
+.article-expander {
+ padding: 10px 4px;
+ border-radius: 4px;
+}
+
+.article-authors {
+ font-size: var(--font-size-m);
+ padding: 0.25em 1em;
+}
+
+.article-authors a {
+ text-decoration: none;
+}
+
+.article-expander-title {
+ font-size: var(--font-size-m);
+ font-weight: 600;
+}
+
+.article-expander-title:hover {
+ cursor: pointer;
+}
+
+.article-expander-title::marker {
+ color: var(--accordion-title-marker-color);
+}
+
+.article-expander-title:hover::marker {
+ color: var(--accordion-title-hover-marker-color);
+}
+
+/* for switcher */
+.theme-switch {
+ display: inline-block;
+ position: relative;
+}
+
+.theme-switch input {
+ display: none;
+}
+
+/* chip */
+.chip {
+ font-size: 90%;
+ align-items: center;
+ color: var(--chip-font);
+ background: var(--chip-color);
+ border-radius: 5rem;
+ display: inline-flex;
+ padding: .2rem .4rem;
+ vertical-align: middle;
+}
\ No newline at end of file
diff --git a/index.html b/index.html
new file mode 100644
index 0000000..02b6446
--- /dev/null
+++ b/index.html
@@ -0,0 +1,26415 @@
+
+
+
+
+ MyArxiv
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ MyArxiv
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Computation and Language 70
+
+
+
+
+
+ ☆ Distributed Mixture-of-Agents for Edge Inference with Large Language
+ Models
+
+
+ Mixture-of-Agents (MoA) has recently been proposed as a method to enhance
+performance of large language models (LLMs), enabling multiple individual LLMs
+to work together for collaborative inference. This collaborative approach
+results in improved responses to user prompts compared to relying on a single
+LLM. In this paper, we consider such an MoA architecture in a distributed
+setting, where LLMs operate on individual edge devices, each uniquely
+associated with a user and equipped with its own distributed computing power.
+These devices exchange information using decentralized gossip algorithms,
+allowing different device nodes to talk without the supervision of a
+centralized server. In the considered setup, different users have their own LLM
+models to address user prompts. Additionally, the devices gossip either their
+own user-specific prompts or augmented prompts to generate more refined answers
+to certain queries. User prompts are temporarily stored in the device queues
+when their corresponding LLMs are busy. Given the memory limitations of edge
+devices, it is crucial to ensure that the average queue sizes in the system
+remain bounded. In this paper, we address this by theoretically calculating the
+queuing stability conditions for the device queues under reasonable
+assumptions, which we validate experimentally as well. Further, we demonstrate
+through experiments, leveraging open-source LLMs for the implementation of
+distributed MoA, that certain MoA configurations produce higher-quality
+responses compared to others, as evaluated on AlpacaEval 2.0 benchmark. The
+implementation is available at:
+https://github.com/purbeshmitra/distributed_moa.
+
+
+
+
+
+
+
+ ☆ HumanEval Pro and MBPP Pro: Evaluating Large Language Models on
+ Self-invoking Code Generation
+
+
+ We introduce self-invoking code generation, a new task designed to evaluate
+the progressive reasoning and problem-solving capabilities of LLMs. In this
+task, models are presented with a base problem and a related, more complex
+problem. They must solve the base problem and then utilize its solution to
+address the more complex one. This work features three key contributions.
+First, we propose a general recipe for generating more challenging versions of
+existing benchmarks, resulting in three new benchmarks: HumanEval Pro, MBPP
+Pro, and BigCodeBench-Lite Pro, specifically designed to assess LLMs on
+self-invoking code generation. Second, from the analysis of experimental
+results over twenty LLMs on our benchmarks, we have two important observations:
+(i) Most LLMs excel in traditional code generation benchmarks like HumanEval
+and MBPP, but their performance declines on self-invoking tasks. For example,
+o1-mini achieves 96.2% pass@1 on HumanEval but only 76.2% on HumanEval Pro.
+(ii) On self-invoking code generation task, the instruction-tuned models
+demonstrate only marginal improvements compared to the base models. Third, we
+disclose the types of failure modes that exist in our evaluation results. All
+these results underscore the need for further advancements in self-invoking
+code generation tasks and provide a new direction for future research on
+enhancing LLMs' code reasoning capabilities.
+
+
+
+
+
+
+
+ ☆ Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs
+
+
+ The remarkable performance of models like the OpenAI o1 can be attributed to
+their ability to emulate human-like long-time thinking during inference. These
+models employ extended chain-of-thought (CoT) processes, exploring multiple
+strategies to enhance problem-solving capabilities. However, a critical
+question remains: How to intelligently and efficiently scale computational
+resources during testing. This paper presents the first comprehensive study on
+the prevalent issue of overthinking in these models, where excessive
+computational resources are allocated for simple problems with minimal benefit.
+We introduce novel efficiency metrics from both outcome and process
+perspectives to evaluate the rational use of computational resources by o1-like
+models. Using a self-training paradigm, we propose strategies to mitigate
+overthinking, streamlining reasoning processes without compromising accuracy.
+Experimental results show that our approach successfully reduces computational
+overhead while preserving model performance across a range of testsets with
+varying difficulty levels, such as GSM8K, MATH500, GPQA, and AIME.
+
+
+
+ comment: Work in progress
+
+
+
+
+
+
+ ☆ Two-component spatiotemporal template for activation-inhibition of
+ speech in ECoG
+
+
+ I compute the average trial-by-trial power of band-limited speech activity
+across epochs of multi-channel high-density electrocorticography (ECoG)
+recorded from multiple subjects during a consonant-vowel speaking task. I show
+that previously seen anti-correlations of average beta frequency activity
+(12-35 Hz) to high-frequency gamma activity (70-140 Hz) during speech movement
+are observable between individual ECoG channels in the sensorimotor cortex
+(SMC). With this I fit a variance-based model using principal component
+analysis to the band-powers of individual channels of session-averaged ECoG
+data in the SMC and project SMC channels onto their lower-dimensional principal
+components.
+ Spatiotemporal relationships between speech-related activity and principal
+components are identified by correlating the principal components of both
+frequency bands to individual ECoG channels over time using windowed
+correlation. Correlations of principal component areas to sensorimotor areas
+reveal a distinct two-component activation-inhibition-like representation for
+speech that resembles distinct local sensorimotor areas recently shown to have
+complex interplay in whole-body motor control, inhibition, and posture. Notably
+the third principal component shows insignificant correlations across all
+subjects, suggesting two components of ECoG are sufficient to represent SMC
+activity during speech movement.
+
+
+
+
+
+
+
+ ☆ Aviary: training language agents on challenging scientific tasks
+
+
+
+
+
+
+
+
+ Siddharth Narayanan, James D. Braza, Ryan-Rhys Griffiths, Manu Ponnapati, Albert Bou, Jon Laurent, Ori Kabeli, Geemi Wellawatte, Sam Cox, Samuel G. Rodriques, Andrew D. White
+
+
+ Solving complex real-world tasks requires cycles of actions and observations.
+This is particularly true in science, where tasks require many cycles of
+analysis, tool use, and experimentation. Language agents are promising for
+automating intellectual tasks in science because they can interact with tools
+via natural language or code. Yet their flexibility creates conceptual and
+practical challenges for software implementations, since agents may comprise
+non-standard components such as internal reasoning, planning, tool usage, as
+well as the inherent stochasticity of temperature-sampled language models.
+Here, we introduce Aviary, an extensible gymnasium for language agents. We
+formalize agents as policies solving language-grounded partially observable
+Markov decision processes, which we term language decision processes. We then
+implement five environments, including three challenging scientific
+environments: (1) manipulating DNA constructs for molecular cloning, (2)
+answering research questions by accessing scientific literature, and (3)
+engineering protein stability. These environments were selected for their focus
+on multi-step reasoning and their relevance to contemporary biology research.
+Finally, with online training and scaling inference-time compute, we show that
+language agents backed by open-source, non-frontier LLMs can match and exceed
+both frontier LLM agents and human experts on multiple tasks at up to 100x
+lower inference cost.
+
+
+
+
+
+
+
+ ☆ Facilitating large language model Russian adaptation with Learned
+ Embedding Propagation
+
+
+ Rapid advancements of large language model (LLM) technologies led to the
+introduction of powerful open-source instruction-tuned LLMs that have the same
+text generation quality as the state-of-the-art counterparts such as GPT-4.
+While the emergence of such models accelerates the adoption of LLM technologies
+in sensitive-information environments the authors of such models don not
+disclose the training data necessary for replication of the results thus making
+the achievements model-exclusive. Since those open-source models are also
+multilingual this in turn reduces the benefits of training a language specific
+LLMs as improved inference computation efficiency becomes the only guaranteed
+advantage of such costly procedure. More cost-efficient options such as
+vocabulary extension and subsequent continued pre-training are also inhibited
+by the lack of access to high-quality instruction-tuning data since it is the
+major factor behind the resulting LLM task-solving capabilities. To address the
+limitations and cut the costs of the language adaptation pipeline we propose
+Learned Embedding Propagation (LEP). Unlike existing approaches our method has
+lower training data size requirements due to minimal impact on existing LLM
+knowledge which we reinforce using novel ad-hoc embedding propagation procedure
+that allows to skip the instruction-tuning step and instead implant the new
+language knowledge directly into any existing instruct-tuned variant. We
+evaluated four Russian vocabulary adaptations for LLaMa-3-8B and Mistral-7B,
+showing that LEP is competitive with traditional instruction-tuning methods,
+achieving performance comparable to OpenChat 3.5 and LLaMa-3-8B-Instruct, with
+further improvements via self-calibration and continued tuning enhancing
+task-solving capabilities.
+
+
+
+ comment: Preprint version of an article published in the Journal of Language
+ and Education. Copyright held by the owner/author(s). Publication rights
+ licensed to the Journal of Language and Education
+
+
+
+
+
+
+ ☆ Training Software Engineering Agents and Verifiers with SWE-Gym
+
+
+ We present SWE-Gym, the first environment for training real-world software
+engineering (SWE) agents. SWE-Gym contains 2,438 real-world Python task
+instances, each comprising a codebase with an executable runtime environment,
+unit tests, and a task specified in natural language. We use SWE-Gym to train
+language model based SWE agents , achieving up to 19% absolute gains in resolve
+rate on the popular SWE-Bench Verified and Lite test sets. We also experiment
+with inference-time scaling through verifiers trained on agent trajectories
+sampled from SWE-Gym. When combined with our fine-tuned SWE agents, we achieve
+32.0% and 26.0% on SWE-Bench Verified and Lite, respectively, reflecting a new
+state-of-the-art for open-weight SWE agents. To facilitate further research, we
+publicly release SWE-Gym, models, and agent trajectories.
+
+
+
+ comment: Code at https://github.com/SWE-Gym/SWE-Gym
+
+
+
+
+
+
+ ☆ Exploring and Controlling Diversity in LLM-Agent Conversation AAAI 2025
+
+
+ Diversity is a critical aspect of multi-agent communication. In this paper,
+we focus on controlling and exploring diversity in the context of open-domain
+multi-agent conversations, particularly for world simulation applications. We
+propose Adaptive Prompt Pruning (APP), a novel method that dynamically adjusts
+the content of the utterance generation prompt to control diversity using a
+single parameter, lambda. Through extensive experiments, we show that APP
+effectively controls the output diversity across models and datasets, with
+pruning more information leading to more diverse output. We comprehensively
+analyze the relationship between prompt content and conversational diversity.
+Our findings reveal that information from all components of the prompt
+generally constrains the diversity of the output, with the Memory block
+exerting the most significant influence. APP is compatible with established
+techniques like temperature sampling and top-p sampling, providing a versatile
+tool for diversity management. To address the trade-offs of increased
+diversity, such as inconsistencies with omitted information, we incorporate a
+post-generation correction step, which effectively balances diversity
+enhancement with output consistency. Additionally, we examine how prompt
+structure, including component order and length, impacts diversity. This study
+addresses key questions surrounding diversity in multi-agent world simulation,
+offering insights into its control, influencing factors, and associated
+trade-offs. Our contributions lay the foundation for systematically engineering
+diversity in LLM-based multi-agent collaborations, advancing their
+effectiveness in real-world applications.
+
+
+
+ comment: Accepted for the AAAI 2025 Workshop on Advancing LLM-Based
+ Multi-Agent Collaboration
+
+
+
+
+
+
+ ☆ Efficient Multi-Task Inferencing with a Shared Backbone and Lightweight
+ Task-Specific Adapters for Automatic Scoring AAAI
+
+
+ The integration of Artificial Intelligence (AI) in education requires
+scalable and efficient frameworks that balance performance, adaptability, and
+cost. This paper addresses these needs by proposing a shared backbone model
+architecture enhanced with lightweight LoRA adapters for task-specific
+fine-tuning, targeting the automated scoring of student responses across 27
+mutually exclusive tasks. By achieving competitive performance (average QWK of
+0.848 compared to 0.888 for fully fine-tuned models) while reducing GPU memory
+consumption by 60% and inference latency by 40%, the framework demonstrates
+significant efficiency gains. This approach aligns with the workshops' focus on
+improving language models for educational tasks, creating responsible
+innovations for cost-sensitive deployment, and supporting educators by
+streamlining assessment workflows. The findings underscore the potential of
+scalable AI to enhance learning outcomes while maintaining fairness and
+transparency in automated scoring systems.
+
+
+
+ comment: Accepted by AAAI-iRAISE Workshop
+
+
+
+
+
+
+ ☆ TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow
+ Matching and Clap-Ranked Preference Optimization
+
+
+ We introduce TangoFlux, an efficient Text-to-Audio (TTA) generative model
+with 515M parameters, capable of generating up to 30 seconds of 44.1kHz audio
+in just 3.7 seconds on a single A40 GPU. A key challenge in aligning TTA models
+lies in the difficulty of creating preference pairs, as TTA lacks structured
+mechanisms like verifiable rewards or gold-standard answers available for Large
+Language Models (LLMs). To address this, we propose CLAP-Ranked Preference
+Optimization (CRPO), a novel framework that iteratively generates and optimizes
+preference data to enhance TTA alignment. We demonstrate that the audio
+preference dataset generated using CRPO outperforms existing alternatives. With
+this framework, TangoFlux achieves state-of-the-art performance across both
+objective and subjective benchmarks. We open source all code and models to
+support further research in TTA generation.
+
+
+
+ comment: https://tangoflux.github.io/
+
+
+
+
+
+
+ ☆ GePBench: Evaluating Fundamental Geometric Perception for Multimodal
+ Large Language Models
+
+
+ Multimodal large language models (MLLMs) have achieved significant
+advancements in integrating visual and linguistic understanding. While existing
+benchmarks evaluate these models in context-rich, real-life scenarios, they
+often overlook fundamental perceptual skills essential for environments
+deviating from everyday realism. In particular, geometric perception, the
+ability to interpret spatial relationships and abstract visual patterns,
+remains underexplored. To address this limitation, we introduce GePBench, a
+novel benchmark designed to assess the geometric perception capabilities of
+MLLMs. Results from extensive evaluations reveal that current state-of-the-art
+MLLMs exhibit significant deficiencies in such tasks. Additionally, we
+demonstrate that models trained with data sourced from GePBench show notable
+improvements on a wide range of downstream tasks, underscoring the importance
+of geometric perception as a foundation for advanced multimodal applications.
+Our code and datasets will be publicly available.
+
+
+
+
+
+
+
+ ☆ Plancraft: an evaluation dataset for planning with LLM agents
+
+
+
+
+
+
+
+
+ Gautier Dagan, Frank Keller, Alex Lascarides
+
+
+ We present Plancraft, a multi-modal evaluation dataset for LLM agents.
+Plancraft has both a text-only and multi-modal interface, based on the
+Minecraft crafting GUI. We include the Minecraft Wiki to evaluate tool use and
+Retrieval Augmented Generation (RAG), as well as an oracle planner and oracle
+RAG information extractor, to ablate the different components of a modern agent
+architecture. To evaluate decision-making, Plancraft also includes a subset of
+examples that are intentionally unsolvable, providing a realistic challenge
+that requires the agent not only to complete tasks but also to decide whether
+they are solvable at all. We benchmark both open-source and closed-source LLMs
+and strategies on our task and compare their performance to a handcrafted
+planner. We find that LLMs and VLMs struggle with the planning problems that
+Plancraft introduces, and we offer suggestions on how to improve their
+capabilities.
+
+
+
+
+
+
+
+ ☆ MapQaTor: A System for Efficient Annotation of Map Query Datasets
+
+
+ Mapping and navigation services like Google Maps, Apple Maps, Openstreet
+Maps, are essential for accessing various location-based data, yet they often
+struggle to handle natural language geospatial queries. Recent advancements in
+Large Language Models (LLMs) show promise in question answering (QA), but
+creating reliable geospatial QA datasets from map services remains challenging.
+We introduce MapQaTor, a web application that streamlines the creation of
+reproducible, traceable map-based QA datasets. With its plug-and-play
+architecture, MapQaTor enables seamless integration with any maps API, allowing
+users to gather and visualize data from diverse sources with minimal setup. By
+caching API responses, the platform ensures consistent ground truth, enhancing
+the reliability of the data even as real-world information evolves. MapQaTor
+centralizes data retrieval, annotation, and visualization within a single
+platform, offering a unique opportunity to evaluate the current state of
+LLM-based geospatial reasoning while advancing their capabilities for improved
+geospatial understanding. Evaluation metrics show that, MapQaTor speeds up the
+annotation process by at least 30 times compared to manual methods,
+underscoring its potential for developing geospatial resources, such as complex
+map reasoning datasets. The website is live at: https://mapqator.github.io/ and
+a demo video is available at: https://youtu.be/7_aV9Wmhs6Q.
+
+
+ Large Language Models (LLMs) rely on generating extensive intermediate
+reasoning units (e.g., tokens, sentences) to enhance final answer quality
+across a wide range of complex tasks. While generating multiple reasoning paths
+or iteratively refining rationales proves effective for improving performance,
+these approaches inevitably result in significantly higher inference costs. In
+this work, we propose a novel sentence-level rationale reduction training
+framework that leverages likelihood-based criteria, verbosity, to identify and
+remove redundant reasoning sentences. Unlike previous approaches that utilize
+token-level reduction, our sentence-level reduction framework maintains model
+performance while reducing generation length. This preserves the original
+reasoning abilities of LLMs and achieves an average 17.15% reduction in
+generation costs across various models and tasks.
+
+
+
+
+
+
+
+ ☆ Plug-and-Play Training Framework for Preference Optimization
+
+
+
+
+
+
+
+
+ Jingyuan Ma, Rui Li, Zheng Li, Lei Sha, Zhifang Sui
+
+
+ Recently, preference optimization methods such as DPO have significantly
+enhanced large language models (LLMs) in wide tasks including dialogue and
+question-answering. However, current methods fail to account for the varying
+difficulty levels of training samples during preference optimization, leading
+to mediocre performance in tasks with high accuracy requirements, particularly
+in mathematical reasoning. To address this limitation, we propose a novel
+training framework, which employs multiple sampling to analyze output
+distributions, assign different weights to samples, and incorporate these
+weights into the preference optimization process. This plug-and-play approach
+enables LLMs to prioritize challenging examples during training, improving
+learning efficiency. Experimental results demonstrate that our framework
+integrates seamlessly with various preference optimization methods and achieves
+consistent improvements in mathematical reasoning tasks.
+
+
+
+ comment: 12 pages, 9 figures
+
+
+
+
+
+
+ ☆ KARPA: A Training-free Method of Adapting Knowledge Graph as References
+ for Large Language Model's Reasoning Path Aggregation
+
+
+ Large language models (LLMs) demonstrate exceptional performance across a
+variety of tasks, yet they are often affected by hallucinations and the
+timeliness of knowledge. Leveraging knowledge graphs (KGs) as external
+knowledge sources has emerged as a viable solution, but existing methods for
+LLM-based knowledge graph question answering (KGQA) are often limited by
+step-by-step decision-making on KGs, restricting the global planning and
+reasoning capabilities of LLMs, or they require fine-tuning or pre-training on
+specific KGs. To address these challenges, we propose Knowledge graph Assisted
+Reasoning Path Aggregation (KARPA), a novel framework that harnesses the global
+planning abilities of LLMs for efficient and accurate KG reasoning. KARPA
+operates in three steps: pre-planning relation paths using the LLM's global
+planning capabilities, matching semantically relevant paths via an embedding
+model, and reasoning over these paths to generate answers. Unlike existing KGQA
+methods, KARPA avoids stepwise traversal, requires no additional training, and
+is adaptable to various LLM architectures. Extensive experimental results show
+that KARPA achieves state-of-the-art performance in KGQA tasks, delivering both
+high efficiency and accuracy. Our code will be available on Github.
+
+
+ The rapid evolution of large language models (LLMs) has unlocked their
+capabilities in advanced reasoning tasks like mathematical problem-solving,
+code generation, and legal analysis. Central to this progress are
+inference-time reasoning algorithms, which refine outputs by exploring multiple
+solution paths, at the cost of increasing compute demands and response
+latencies. Existing serving systems fail to adapt to the scaling behaviors of
+these algorithms or the varying difficulty of queries, leading to inefficient
+resource use and unmet latency targets.
+ We present Dynasor, a system that optimizes inference-time compute for LLM
+reasoning queries. Unlike traditional engines, Dynasor tracks and schedules
+requests within reasoning queries and uses Certaindex, a proxy that measures
+statistical reasoning progress based on model certainty, to guide compute
+allocation dynamically. Dynasor co-adapts scheduling with reasoning progress:
+it allocates more compute to hard queries, reduces compute for simpler ones,
+and terminates unpromising queries early, balancing accuracy, latency, and
+cost. On diverse datasets and algorithms, Dynasor reduces compute by up to 50%
+in batch processing and sustaining 3.3x higher query rates or 4.7x tighter
+latency SLOs in online serving.
+
+
+
+
+
+
+
+ ☆ DoTA: Weight-Decomposed Tensor Adaptation for Large Language Models
+
+
+ Low-rank adaptation (LoRA) reduces the computational and memory demands of
+fine-tuning large language models (LLMs) by approximating updates with low-rank
+matrices. However, low-rank approximation in two-dimensional space fails to
+capture high-dimensional structures within the target matrix. Recently, tensor
+decomposition methods have been explored for fine-tuning LLMs, leveraging their
+ability to extract structured information. Yet, these approaches primarily rely
+on random initialization, and the impact of initialization on tensor adaptation
+remains underexplored. In this paper, we reveal that random initialization
+significantly diverges from the validation loss achieved by full fine-tuning.
+To address this, we propose Weight-Decomposed Tensor Adaptation (DoTA), which
+leverages the Matrix Product Operator (MPO) decomposition of pre-trained
+weights for effective initialization in fine-tuning LLMs. Additionally, we
+introduce QDoTA, a quantized version of DoTA designed for 4-bit quantization.
+Experiments on commonsense and arithmetic reasoning tasks show that DoTA
+outperforms random initialization methods with fewer parameters. QDoTA further
+reduces memory consumption and achieves comparable performance to DoTA on
+commonsense reasoning tasks. We will release our code to support future
+research.
+
+
+ This work proposes a novel approach to enhancing annotated bibliography
+generation through Large Language Model (LLM) ensembles. In particular,
+multiple LLMs in different roles -- controllable text generation, evaluation,
+and summarization -- are introduced and validated using a systematic
+methodology to enhance model performance in scholarly tasks. Output diversity
+among the ensemble that generates text is obtained using different LLM
+parameters, followed by an LLM acting as a judge to assess relevance, accuracy,
+and coherence. Responses selected by several combining strategies are then
+merged and refined through summarization and redundancy removal techniques. The
+preliminary experimental validation demonstrates that the combined outputs from
+the LLM ensemble improve coherence and relevance compared to individual
+responses, leading to a 38% improvement in annotation quality and a 51%
+reduction in content redundancy, thus highlighting the potential for automating
+complex scholarly tasks while maintaining high-quality standards.
+
+
+
+
+
+
+
+ ☆ Are LLMs Really Not Knowledgable? Mining the Submerged Knowledge in
+ LLMs' Memory
+
+
+ Large language models (LLMs) have shown promise as potential knowledge bases,
+yet they often struggle with question-answering tasks and are prone to
+hallucinations. While previous research attributes these issues to knowledge
+gaps in the model's parameters, our investigation reveals a different
+phenomenon: LLMs often retain correct knowledge even when generating incorrect
+answers. Through analysis of model's internal representations, we find that
+correct answers frequently appear among high-probability tokens despite not
+being selected as final outputs. Based on this observation, we introduce
+Hits@k, a new metric to assess knowledge retention independent of expression
+accuracy. Our extensive experiments demonstrate that LLMs store significantly
+more knowledge than their QA performance suggests. Building on these findings,
+we develop SkipUnsure, a method to improve answer accuracy by leveraging
+detected but unexpressed knowledge. Experiments on both open-domain and
+specific-domain datasets show consistent improvements, with accuracy gains of
+up to 11.8% on DBPedia and 6.3% on IMDB, without requiring model retraining.
+
+
+
+
+
+
+
+ ☆ Disentangling Preference Representation and Text Generation for
+ Efficient Individual Preference Alignment
+
+
+
+
+
+
+
+
+ Jianfei Zhang, Jun Bai, Bei Li, Yanmeng Wang, Rumei Li, Chenghua Lin, Wenge Rong
+
+
+ Aligning Large Language Models (LLMs) with general human preferences has been
+proved crucial in improving the interaction quality between LLMs and human.
+However, human values are inherently diverse among different individuals,
+making it insufficient to align LLMs solely with general preferences. To
+address this, personalizing LLMs according to individual feedback emerges as a
+promising solution. Nonetheless, this approach presents challenges in terms of
+the efficiency of alignment algorithms. In this work, we introduce a flexible
+paradigm for individual preference alignment. Our method fundamentally improves
+efficiency by disentangling preference representation from text generation in
+LLMs. We validate our approach across multiple text generation tasks and
+demonstrate that it can produce aligned quality as well as or better than
+PEFT-based methods, while reducing additional training time for each new
+individual preference by $80\%$ to $90\%$ in comparison with them.
+
+
+ Multimodal emotion recognition (MER), leveraging speech and text, has emerged
+as a pivotal domain within human-computer interaction, demanding sophisticated
+methods for effective multimodal integration. The challenge of aligning
+features across these modalities is significant, with most existing approaches
+adopting a singular alignment strategy. Such a narrow focus not only limits
+model performance but also fails to address the complexity and ambiguity
+inherent in emotional expressions. In response, this paper introduces a
+Multi-Granularity Cross-Modal Alignment (MGCMA) framework, distinguished by its
+comprehensive approach encompassing distribution-based, instance-based, and
+token-based alignment modules. This framework enables a multi-level perception
+of emotional information across modalities. Our experiments on IEMOCAP
+demonstrate that our proposed method outperforms current state-of-the-art
+techniques.
+
+
+
+ comment: ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech
+ and Signal Processing (ICASSP)
+
+ In open-ended generative tasks like narrative writing or dialogue, large
+language models often exhibit cultural biases, showing limited knowledge and
+generating templated outputs for less prevalent cultures. Recent works show
+that these biases may stem from uneven cultural representation in pretraining
+corpora. This work investigates how pretraining leads to biased
+culture-conditioned generations by analyzing how models associate entities with
+cultures based on pretraining data patterns. We propose the MEMOed framework
+(MEMOrization from pretraining document) to determine whether a generation for
+a culture arises from memorization. Using MEMOed on culture-conditioned
+generations about food and clothing for 110 cultures, we find that
+high-frequency cultures in pretraining data yield more generations with
+memorized symbols, while some low-frequency cultures produce none.
+Additionally, the model favors generating entities with extraordinarily high
+frequency regardless of the conditioned culture, reflecting biases toward
+frequent pretraining terms irrespective of relevance. We hope that the MEMOed
+framework and our insights will inspire more works on attributing model
+performance on pretraining data.
+
+
+
+
+
+
+
+ ☆ Depression and Anxiety Prediction Using Deep Language Models and
+ Transfer Learning
+
+
+
+
+
+
+
+
+ Tomasz Rutowski, Elizabeth Shriberg, Amir Harati, Yang Lu, Piotr Chlebek, Ricardo Oliveira
+
+
+ Digital screening and monitoring applications can aid providers in the
+management of behavioral health conditions. We explore deep language models for
+detecting depression, anxiety, and their co-occurrence from conversational
+speech collected during 16k user interactions with an application. Labels come
+from PHQ-8 and GAD-7 results also collected by the application. We find that
+results for binary classification range from 0.86 to 0.79 AUC, depending on
+condition and co-occurrence. Best performance is achieved when a user has
+either both or neither condition, and we show that this result is not
+attributable to data skew. Finally, we find evidence suggesting that underlying
+word sequence cues may be more salient for depression than for anxiety.
+
+
+
+
+
+
+
+ ☆ HUNYUANPROVER: A Scalable Data Synthesis Framework and Guided Tree
+ Search for Automated Theorem Proving
+
+
+
+
+
+
+
+
+ Yang Li, Dong Du, Linfeng Song, Chen Li, Weikang Wang, Tao Yang, Haitao Mi
+
+
+ We introduce HunyuanProver, an language model finetuned from the Hunyuan 7B
+for interactive automatic theorem proving with LEAN4. To alleviate the data
+sparsity issue, we design a scalable framework to iterative synthesize data
+with low cost. Besides, guided tree search algorithms are designed to enable
+effective ``system 2 thinking`` of the prover. HunyuanProver achieves
+state-of-the-art (SOTA) performances on major benchmarks. Specifically, it
+achieves a pass of 68.4% on the miniF2F-test compared to 65.9%, the current
+SOTA results. It proves 4 IMO statements (imo_1960_p2, imo_1962_p2},
+imo_1964_p2 and imo_1983_p6) in miniF2F-test. To benefit the community, we will
+open-source a dataset of 30k synthesized instances, where each instance
+contains the original question in natural language, the converted statement by
+autoformalization, and the proof by HunyuanProver.
+
+
+
+
+
+
+
+ ☆ ChartAdapter: Large Vision-Language Model for Chart Summarization
+
+
+ Chart summarization, which focuses on extracting key information from charts
+and interpreting it in natural language, is crucial for generating and
+delivering insights through effective and accessible data analysis. Traditional
+methods for chart understanding and summarization often rely on multi-stage
+pipelines, which may produce suboptimal semantic alignment between visual and
+textual information. In comparison, recently developed LLM-based methods are
+more dependent on the capability of foundation images or languages, while
+ignoring the characteristics of chart data and its relevant challenges. To
+address these limitations, we propose ChartAdapter, a novel lightweight
+transformer module designed to bridge the gap between charts and textual
+summaries. ChartAdapter employs learnable query vectors to extract implicit
+semantics from chart data and incorporates a cross-modal alignment projector to
+enhance vision-to-language generative learning. By integrating ChartAdapter
+with an LLM, we enable end-to-end training and efficient chart summarization.
+To further enhance the training, we introduce a three-stage hierarchical
+training procedure and develop a large-scale dataset specifically curated for
+chart summarization, comprising 190,618 samples. Experimental results on the
+standard Chart-to-Text testing set demonstrate that our approach significantly
+outperforms existing methods, including state-of-the-art models, in generating
+high-quality chart summaries. Ablation studies further validate the
+effectiveness of key components in ChartAdapter. This work highlights the
+potential of tailored LLM-based approaches to advance chart understanding and
+sets a strong foundation for future research in this area.
+
+
+
+
+
+
+
+ ☆ UBER: Uncertainty-Based Evolution with Large Language Models for
+ Automatic Heuristic Design
+
+
+ NP-hard problem-solving traditionally relies on heuristics, but manually
+crafting effective heuristics for complex problems remains challenging. While
+recent work like FunSearch has demonstrated that large language models (LLMs)
+can be leveraged for heuristic design in evolutionary algorithm (EA)
+frameworks, their potential is not fully realized due to its deficiency in
+exploitation and exploration. We present UBER (Uncertainty-Based Evolution for
+Refinement), a method that enhances LLM+EA methods for automatic heuristic
+design by integrating uncertainty on top of the FunSearch framework. UBER
+introduces two key innovations: an Uncertainty-Inclusive Evolution Process
+(UIEP) for adaptive exploration-exploitation balance, and a principled
+Uncertainty-Inclusive Island Reset (UIIS) strategy for maintaining population
+diversity. Through extensive experiments on challenging NP-complete problems,
+UBER demonstrates significant improvements over FunSearch. Our work provides a
+new direction for the synergy of LLMs and EA, advancing the field of automatic
+heuristic design.
+
+
+
+
+
+
+
+ ☆ Align Attention Heads Before Merging Them: An Effective Way for
+ Converting MHA to GQA
+
+
+ Large language models have been shown to perform well on a variety of natural
+language processing problems. However, as the model size and the input
+sequence's length increase, the rapid increase of KV Cache significantly slows
+down inference speed. Therefore GQA model, as an alternative to MHA model, has
+been widely introduced into LLMs. In this work, we propose a low-cost method
+for pruning MHA models into GQA models with any compression ratio of key-value
+heads. Our method is based on $\mathit{L_0}$ masks to gradually remove
+redundant parameters. In addition, we apply orthogonal transformations to
+attention heads without changing the model to increase similarity between
+attention heads before pruning training, in order to further improve
+performance of the model. Our method can be compatible with rotary position
+embedding (RoPE), which means the model after training can be fully adapted to
+the mainstream standard GQA framework. Experiments demonstrate that our
+strategy can compress up to 87.5% of key-value heads of the LLaMA2-7B model
+without too much performance degradation, just achieved through supervised
+fine-tuning.
+
+
+
+ comment: 12 pages, 4 figures
+
+
+
+
+
+
+ ☆ Knowledge Editing for Large Language Model with Knowledge Neuronal
+ Ensemble
+
+
+
+
+
+
+
+
+ Yongchang Li, Yujin Zhu, Tao Yan, Shijian Fan, Gang Wu, Liang Xu
+
+
+ As real-world knowledge is constantly evolving, ensuring the timeliness and
+accuracy of a model's knowledge is crucial. This has made knowledge editing in
+large language models increasingly important. However, existing knowledge
+editing methods face several challenges, including parameter localization
+coupling, imprecise localization, and a lack of dynamic interaction across
+layers. In this paper, we propose a novel knowledge editing method called
+Knowledge Neuronal Ensemble (KNE). A knowledge neuronal ensemble represents a
+group of neurons encoding specific knowledge, thus mitigating the issue of
+frequent parameter modification caused by coupling in parameter localization.
+The KNE method enhances the precision and accuracy of parameter localization by
+computing gradient attribution scores for each parameter at each layer. During
+the editing process, only the gradients and losses associated with the
+knowledge neuronal ensemble are computed, with error backpropagation performed
+accordingly, ensuring dynamic interaction and collaborative updates among
+parameters. Experimental results on three widely used knowledge editing
+datasets show that the KNE method significantly improves the accuracy of
+knowledge editing and achieves, or even exceeds, the performance of the best
+baseline methods in portability and locality metrics.
+
+
+
+ comment: 26 pages, 5 figures, 2 tables
+
+
+
+
+
+
+ ☆ GASLITEing the Retrieval: Exploring Vulnerabilities in Dense
+ Embedding-based Search
+
+
+ Dense embedding-based text retrieval$\unicode{x2013}$retrieval of relevant
+passages from corpora via deep learning encodings$\unicode{x2013}$has emerged
+as a powerful method attaining state-of-the-art search results and popularizing
+the use of Retrieval Augmented Generation (RAG). Still, like other search
+methods, embedding-based retrieval may be susceptible to search-engine
+optimization (SEO) attacks, where adversaries promote malicious content by
+introducing adversarial passages to corpora. To faithfully assess and gain
+insights into the susceptibility of such systems to SEO, this work proposes the
+GASLITE attack, a mathematically principled gradient-based search method for
+generating adversarial passages without relying on the corpus content or
+modifying the model. Notably, GASLITE's passages (1) carry adversary-chosen
+information while (2) achieving high retrieval ranking for a selected query
+distribution when inserted to corpora. We use GASLITE to extensively evaluate
+retrievers' robustness, testing nine advanced models under varied threat
+models, while focusing on realistic adversaries targeting queries on a specific
+concept (e.g., a public figure). We found GASLITE consistently outperformed
+baselines by $\geq$140% success rate, in all settings. Particularly,
+adversaries using GASLITE require minimal effort to manipulate search
+results$\unicode{x2013}$by injecting a negligible amount of adversarial
+passages ($\leq$0.0001% of the corpus), they could make them visible in the
+top-10 results for 61-100% of unseen concept-specific queries against most
+evaluated models. Inspecting variance in retrievers' robustness, we identify
+key factors that may contribute to models' susceptibility to SEO, including
+specific properties in the embedding space's geometry.
+
+
+
+
+
+
+
+ ♻ ☆ Order Matters in Hallucination: Reasoning Order as Benchmark and
+ Reflexive Prompting for Large-Language-Models ACL22025
+
+
+ Large language models (LLMs) have generated significant attention since their
+inception, finding applications across various academic and industrial domains.
+However, these models often suffer from the "hallucination problem", where
+outputs, though grammatically and logically coherent, lack factual accuracy or
+are entirely fabricated. A particularly troubling issue discovered and widely
+discussed recently is the numerical comparison error where multiple LLMs
+incorrectly infer that "9.11$>$9.9". We discovered that the order in which LLMs
+generate answers and reasoning impacts their consistency. Specifically, results
+vary significantly when an LLM generates an answer first and then provides the
+reasoning versus generating the reasoning process first and then the
+conclusion. Inspired by this, we propose a new benchmark method for assessing
+LLM consistency: comparing responses generated through these two different
+approaches. This benchmark effectively identifies instances where LLMs
+fabricate answers and subsequently generate justifications. Furthermore, we
+introduce a novel and straightforward prompt strategy designed to mitigate this
+issue. Experimental results demonstrate that this strategy improves performance
+across various LLMs compared to direct questioning. This work not only sheds
+light on a critical flaw in LLMs but also offers a practical solution to
+enhance their reliability.
+
+
+
+ comment: 8 pages, submitted to ACL22025
+
+
+
+
+
+
+ ♻ ☆ ReXTrust: A Model for Fine-Grained Hallucination Detection in
+ AI-Generated Radiology Reports
+
+
+ The increasing adoption of AI-generated radiology reports necessitates robust
+methods for detecting hallucinations--false or unfounded statements that could
+impact patient care. We present ReXTrust, a novel framework for fine-grained
+hallucination detection in AI-generated radiology reports. Our approach
+leverages sequences of hidden states from large vision-language models to
+produce finding-level hallucination risk scores. We evaluate ReXTrust on a
+subset of the MIMIC-CXR dataset and demonstrate superior performance compared
+to existing approaches, achieving an AUROC of 0.8751 across all findings and
+0.8963 on clinically significant findings. Our results show that white-box
+approaches leveraging model hidden states can provide reliable hallucination
+detection for medical AI systems, potentially improving the safety and
+reliability of automated radiology reporting.
+
+
+
+ comment: Accepted to AIMedHealth 10 pages, 5 figures
+
+
+
+
+
+
+
+ Patrick Sutanto, Joan Santoso, Esther Irawati Setiawan, Aji Prasetya Wibawa
+
+
+ Multiple Choice Question Answering (MCQA) is an important problem with
+numerous real-world applications, such as medicine, law, and education. The
+high cost of building MCQA datasets makes few-shot learning pivotal in this
+domain. While Large Language Models (LLMs) can enable few-shot learning, their
+direct application in real-world scenarios is often hindered by their high
+computational cost. To address this challenge, we propose a simple yet
+effective approach that uses LLMs for data generation and scoring. Our approach
+utilizes LLMs to create MCQA data which contains questions and choices, and to
+assign probability scores to the generated choices. We then use the generated
+data and LLM-assigned scores to finetune a smaller and more efficient
+encoder-only model, DeBERTa-v3-base by leveraging distillation loss. Extensive
+experiments on the Massive Multitask Language Understanding (MMLU) benchmark
+demonstrate that our method improves accuracy from 28.9% to 39.3%, representing
+a gain of over 10% compared to a baseline finetuned directly on 5-shot
+examples. This shows the effectiveness of LLM-driven data generation and
+knowledge distillation for few-shot MCQA.
+
+
+
+
+
+
+
+ ♻ ☆ DRT-o1: Optimized Deep Reasoning Translation via Long Chain-of-Thought
+
+
+ Recently, O1-like models have emerged as representative examples,
+illustrating the effectiveness of long chain-of-thought (CoT) in reasoning
+tasks such as math and coding tasks. In this paper, we introduce DRT-o1, an
+attempt to bring the success of long CoT to neural machine translation (MT).
+Specifically, in view of the literature books that might involve similes and
+metaphors, translating these texts to a target language is very difficult in
+practice due to cultural differences. In such cases, literal translation often
+fails to convey the intended meaning effectively. Even for professional human
+translators, considerable thought must be given to preserving semantics
+throughout the translation process. To simulate LLMs' long thought ability in
+MT, we first mine sentences containing similes or metaphors from existing
+literature books, and then develop a multi-agent framework to translate these
+sentences via long thought. In the multi-agent framework, a translator is used
+to iteratively translate the source sentence under the suggestions provided by
+an advisor. To ensure the effectiveness of the long thoughts, an evaluator is
+also employed to quantify the translation in each round. In this way, we
+collect tens of thousands of long-thought MT data, which is used to train our
+DRT-o1. Using Qwen2.5 and LLama-3.1 as the backbones, DRT-o1 models can learn
+the thought process during machine translation, and outperform vanilla LLMs as
+well as existing O1-like LLMs, showing their effectiveness The project is
+available at https://github.com/krystalan/DRT-o1
+
+
+
+
+
+
+
+ ♻ ☆ CodeIP: A Grammar-Guided Multi-Bit Watermark for Large Language Models
+ of Code
+
+
+
+
+
+
+
+
+ Batu Guan, Yao Wan, Zhangqian Bi, Zheng Wang, Hongyu Zhang, Pan Zhou, Lichao Sun
+
+
+ Large Language Models (LLMs) have achieved remarkable progress in code
+generation. It now becomes crucial to identify whether the code is AI-generated
+and to determine the specific model used, particularly for purposes such as
+protecting Intellectual Property (IP) in industry and preventing cheating in
+programming exercises. To this end, several attempts have been made to insert
+watermarks into machine-generated code. However, existing approaches are
+limited to inserting only a single bit of information. In this paper, we
+introduce CodeIP, a novel multi-bit watermarking technique that inserts
+additional information to preserve crucial provenance details, such as the
+vendor ID of an LLM, thereby safeguarding the IPs of LLMs in code generation.
+Furthermore, to ensure the syntactical correctness of the generated code, we
+propose constraining the sampling process for predicting the next token by
+training a type predictor. Experiments conducted on a real-world dataset across
+five programming languages demonstrate the effectiveness of CodeIP in
+watermarking LLMs for code generation while maintaining the syntactical
+correctness of code.
+
+
+
+ comment: 16 pages, 13 figures
+
+
+
+
+
+
+ ♻ ☆ A Comprehensive Survey of Large Language Models and Multimodal Large
+ Language Models in Medicine
+
+
+ Since the release of ChatGPT and GPT-4, large language models (LLMs) and
+multimodal large language models (MLLMs) have attracted widespread attention
+for their exceptional capabilities in understanding, reasoning, and generation,
+introducing transformative paradigms for integrating artificial intelligence
+into medicine. This survey provides a comprehensive overview of the
+development, principles, application scenarios, challenges, and future
+directions of LLMs and MLLMs in medicine. Specifically, it begins by examining
+the paradigm shift, tracing the transition from traditional models to LLMs and
+MLLMs, and highlighting the unique advantages of these LLMs and MLLMs in
+medical applications. Next, the survey reviews existing medical LLMs and MLLMs,
+providing detailed guidance on their construction and evaluation in a clear and
+systematic manner. Subsequently, to underscore the substantial value of LLMs
+and MLLMs in healthcare, the survey explores five promising applications in the
+field. Finally, the survey addresses the challenges confronting medical LLMs
+and MLLMs and proposes practical strategies and future directions for their
+integration into medicine. In summary, this survey offers a comprehensive
+analysis of the technical methodologies and practical clinical applications of
+medical LLMs and MLLMs, with the goal of bridging the gap between these
+advanced technologies and clinical practice, thereby fostering the evolution of
+the next generation of intelligent healthcare systems.
+
+
+
+
+
+
+
+ ♻ ☆ SepLLM: Accelerate Large Language Models by Compressing One Segment into
+ One Separator
+
+
+ Large Language Models (LLMs) have exhibited exceptional performance across a
+spectrum of natural language processing tasks. However, their substantial sizes
+pose considerable challenges, particularly in computational demands and
+inference speed, due to their quadratic complexity. In this work, we have
+identified a key pattern: certain seemingly meaningless special tokens (i.e.,
+separators) contribute disproportionately to attention scores compared to
+semantically meaningful tokens. This observation suggests that information of
+the segments between these separator tokens can be effectively condensed into
+the separator tokens themselves without significant information loss. Guided by
+this insight, we introduce SepLLM, a plug-and-play framework that accelerates
+inference by compressing these segments and eliminating redundant tokens.
+Additionally, we implement efficient kernels for training acceleration.
+Experimental results across training-free, training-from-scratch, and
+post-training settings demonstrate SepLLM's effectiveness. Notably, using the
+Llama-3-8B backbone, SepLLM achieves over 50% reduction in KV cache on the
+GSM8K-CoT benchmark while maintaining comparable performance. Furthermore, in
+streaming settings, SepLLM effectively processes sequences of up to 4 million
+tokens or more while maintaining consistent language modeling capabilities.
+
+
+
+ comment: We have made our code publicly available at sepllm.github.io. Our
+ codebase supports efficient multi-node distributed training with accelerated
+ attention module Sep-Attention and also supports numerous existing Fusion
+ Operators to accelerate the training process, such as fused rope, etc. If you
+ find our code helpful, please kindly consider giving us a **star** on
+ GitHub^_^. Thank you very much!
+
+
+
+
+
+
+ ♻ ☆ MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct
+
+
+ The development of Multimodal Large Language Models (MLLMs) has seen
+significant advancements with increasing demands in various fields (e.g.,
+multimodal agents, embodied intelligence). While model-driven approaches
+attempt to enhance MLLMs capabilities through diverse architectures, the gains
+have become increasingly marginal. Conversely, data-driven methods, which scale
+up image-text instruction data, are more effective but face limited data
+diversity and complexity challenges. The absence of high-quality data
+constitutes a significant development barrier for MLLMs. To address the data
+quality bottleneck, we propose MMEvol, a novel multimodal instruction data
+evolution framework. This framework iteratively improve data quality through a
+refined combination of fine-grained perception, cognitive reasoning, and
+interaction evolution, generating a more complex and diverse image-text
+instruction dataset that empowers MLLMs with enhanced capabilities. Beginning
+with an initial set of instructions, SEED-163K, we utilize MMEvol to
+systematically broaden the diversity of instruction types, extend visual
+reasoning steps to improve cognitive reasoning abilities, and thoroughly
+explore fine-grained information within images to enhance visual understanding
+and robustness. To comprehensively evaluate the effectiveness of our approach,
+we conduct extensive qualitative analysis and quantitative experiments across
+13 vision-language tasks. Compared to baseline models trained with the initial
+seed data, the results demonstrate that our method achieves an average accuracy
+improvement of 3.1 percentage points. Furthermore, our approach reaches
+state-of-the-art (SOTA) performance in nine tasks using significantly less data
+compared to state-of-the-art models.
+
+
+ Chain of thought (CoT) is a reasoning framework that can enhance the
+performance of Large Language Models (LLMs) on complex inference tasks. In
+particular, among various studies related to CoT, multi-path inference stands
+out as a simple yet effective improvement. However, there is no optimal setting
+for the number of inference paths. Therefore, we have to increase the number of
+inference paths to obtain better results, which in turn increases the inference
+cost. To address this limitation, we can utilize question-related role
+templates to guide LLMs into relevant roles, thereby increasing the possibility
+of correct inferences for each path and further reducing dependence on the
+number of inference paths while improving reasoning accuracy. However, placing
+LLMs into specific roles may reduce their reasoning diversity and performance
+on a few tasks where role dependence is low. To alleviate the excessive
+immersion of the LLM into a specific role, we propose Nash CoT by constructing
+a game system on each path that balances the generation from role-specific
+LLMs' and the general LLMs' generation, thereby ensuring both effective role
+adoption and diversity in LLM generation further maintaining the performance of
+multi-path inference while reducing the requirement of the number of inference
+paths. We evaluate Nash CoT across various inference tasks, including Arabic
+Reasoning, Commonsense Question Answering, and Symbolic Inference, achieving
+results that are comparable to or better than those of multi-path CoT with the
+equal number of inference paths.
+
+
+
+
+
+
+
+ ♻ ☆ JuniperLiu at CoMeDi Shared Task: Models as Annotators in Lexical
+ Semantics Disagreements
+
+
+ We present the results of our system for the CoMeDi Shared Task, which
+predicts majority votes (Subtask 1) and annotator disagreements (Subtask 2).
+Our approach combines model ensemble strategies with MLP-based and
+threshold-based methods trained on pretrained language models. Treating
+individual models as virtual annotators, we simulate the annotation process by
+designing aggregation measures that incorporate continuous relatedness scores
+and discrete classification labels to capture both majority and disagreement.
+Additionally, we employ anisotropy removal techniques to enhance performance.
+Experimental results demonstrate the effectiveness of our methods, particularly
+for Subtask 2. Notably, we find that standard deviation on continuous
+relatedness scores among different model manipulations correlates with human
+disagreement annotations compared to metrics on aggregated discrete labels. The
+code will be published at https://github.com/RyanLiut/CoMeDi_Solution.
+
+
+
+ comment: accepted by CoMeDi workshop in Coling 2025
+
+
+
+
+
+
+ ♻ ☆ An Empirical Study of Catastrophic Forgetting in Large Language Models
+ During Continual Fine-tuning
+
+
+ Catastrophic forgetting (CF) is a phenomenon that occurs in machine learning
+when a model forgets previously learned information while acquiring new
+knowledge for achieving a satisfactory performance in downstream tasks. As
+large language models (LLMs) have demonstrated remarkable performance, it is
+intriguing to investigate whether CF exists during the continual instruction
+tuning of LLMs. This study empirically evaluates the forgetting phenomenon in
+LLMs' knowledge during continual instruction tuning from the perspectives of
+domain knowledge, reasoning, and reading comprehension. The experiments reveal
+that catastrophic forgetting is generally observed in LLMs ranging from 1b to
+7b parameters. Surprisingly, as the model scale increases, the severity of
+forgetting intensifies in such a model sale range which may result from the
+much significant initial performance in the larger LLM. Comparing the
+decoder-only model BLOOMZ with the encoder-decoder model mT0, BLOOMZ exhibits
+less forgetting and retains more knowledge. Interestingly, we also observe that
+LLMs can mitigate language biases, such as gender bias, during continual
+fine-tuning. Furthermore, our findings indicate that general instruction tuning
+can help alleviate the forgetting phenomenon in LLMs during subsequent
+fine-tuning.
+
+
+
+
+
+
+
+ ♻ ☆ Distilling Fine-grained Sentiment Understanding from Large Language
+ Models
+
+
+ Fine-grained sentiment analysis (FSA) aims to extract and summarize user
+opinions from vast opinionated text. Recent studies demonstrate that large
+language models (LLMs) possess exceptional sentiment understanding
+capabilities. However, directly deploying LLMs for FSA applications incurs high
+inference costs. Therefore, this paper investigates the distillation of
+fine-grained sentiment understanding from LLMs into small language models
+(SLMs). We prompt LLMs to examine and interpret the sentiments of given reviews
+and then utilize the generated content to pretrain SLMs. Additionally, we
+develop a comprehensive FSA benchmark to evaluate both SLMs and LLMs. Extensive
+experiments on this benchmark reveal that: (1) distillation significantly
+enhances the performance of SLMs in FSA tasks, achieving a 6.00\% improvement
+in $F_1$-score, and the distilled model can outperform Llama-2-7b with only
+220M parameters; (2) distillation equips SLMs with excellent zero-shot
+sentiment classification capabilities, enabling them to match or even exceed
+their teacher models. These results suggest that distillation from LLMs is a
+highly promising direction for FSA. We will release our code, data, and
+pretrained model weights at https://github.com/HITSZ-HLT/FSA-Distillation.
+
+
+
+
+
+
+
+
+ Hiroki Furuta, Kuang-Huei Lee, Shixiang Shane Gu, Yutaka Matsuo, Aleksandra Faust, Heiga Zen, Izzeddin Gur
+
+
+ Many algorithms for aligning LLMs with human preferences assume that human
+preferences are binary and deterministic. However, human preferences can vary
+across individuals, and therefore should be represented distributionally. In
+this work, we introduce the distributional soft preference labels and improve
+Direct Preference Optimization (DPO) with a weighted geometric average of the
+LLM output likelihood in the loss function. This approach adjusts the scale of
+learning loss based on the soft labels such that the loss would approach zero
+when the responses are closer to equally preferred. This simple modification
+can be easily applied to any DPO-based methods and mitigate over-optimization
+and objective mismatch, which prior works suffer from. Our experiments simulate
+the soft preference labels with AI feedback from LLMs and demonstrate that
+geometric averaging consistently improves performance on standard benchmarks
+for alignment research. In particular, we observe more preferable responses
+than binary labels and significant improvements where modestly-confident labels
+are in the majority.
+
+
+
+
+
+
+
+
+ Simona Frenda, Andrea Piergentili, Beatrice Savoldi, Marco Madeddu, Martina Rosola, Silvia Casola, Chiara Ferrando, Viviana Patti, Matteo Negri, Luisa Bentivogli
+
+
+ Gender-fair language aims at promoting gender equality by using terms and
+expressions that include all identities and avoid reinforcing gender
+stereotypes. Implementing gender-fair strategies is particularly challenging in
+heavily gender-marked languages, such as Italian. To address this, the
+Gender-Fair Generation challenge intends to help shift toward gender-fair
+language in written communication. The challenge, designed to assess and
+monitor the recognition and generation of gender-fair language in both mono-
+and cross-lingual scenarios, includes three tasks: (1) the detection of
+gendered expressions in Italian sentences, (2) the reformulation of gendered
+expressions into gender-fair alternatives, and (3) the generation of
+gender-fair language in automatic translation from English to Italian. The
+challenge relies on three different annotated datasets: the GFL-it corpus,
+which contains Italian texts extracted from administrative documents provided
+by the University of Brescia; GeNTE, a bilingual test set for gender-neutral
+rewriting and translation built upon a subset of the Europarl dataset; and
+Neo-GATE, a bilingual test set designed to assess the use of non-binary
+neomorphemes in Italian for both fair formulation and translation tasks.
+Finally, each task is evaluated with specific metrics: average of F1-score
+obtained by means of BERTScore computed on each entry of the datasets for task
+1, an accuracy measured with a gender-neutral classifier, and a
+coverage-weighted accuracy for tasks 2 and 3.
+
+
+
+ comment: To refer to this paper please cite the CEUR-ws publication available
+ at https://ceur-ws.org/Vol-3878/
+
+
+
+
+
+
+
+ Pinelopi Papalampidi, Skanda Koppula, Shreya Pathak, Justin Chiu, Joe Heyward, Viorica Patraucean, Jiajun Shen, Antoine Miech, Andrew Zisserman, Aida Nematzadeh
+
+
+ Understanding long, real-world videos requires modeling of long-range visual
+dependencies. To this end, we explore video-first architectures, building on
+the common paradigm of transferring large-scale, image--text models to video
+via shallow temporal fusion. However, we expose two limitations to the
+approach: (1) decreased spatial capabilities, likely due to poor
+video--language alignment in standard video datasets, and (2) higher memory
+consumption, bottlenecking the number of frames that can be processed. To
+mitigate the memory bottleneck, we systematically analyze the memory/accuracy
+trade-off of various efficient methods: factorized attention,
+parameter-efficient image-to-video adaptation, input masking, and
+multi-resolution patchification. Surprisingly, simply masking large portions of
+the video (up to 75%) during contrastive pre-training proves to be one of the
+most robust ways to scale encoders to videos up to 4.3 minutes at 1 FPS. Our
+simple approach for training long video-to-text models, which scales to 1B
+parameters, does not add new architectural complexity and is able to outperform
+the popular paradigm of using much larger LLMs as an information aggregator
+over segment-based information on benchmarks with long-range temporal
+dependencies (YouCook2, EgoSchema).
+
+
+
+
+
+
+
+ ♻ ☆ InfAlign: Inference-aware language model alignment
+
+
+
+
+
+
+
+
+ Ananth Balashankar, Ziteng Sun, Jonathan Berant, Jacob Eisenstein, Michael Collins, Adrian Hutter, Jong Lee, Chirag Nagpal, Flavien Prost, Aradhana Sinha, Ananda Theertha Suresh, Ahmad Beirami
+
+
+ Language model alignment has become a critical step in training modern
+generative language models. The goal of alignment is to finetune a reference
+model such that the win rate of a sample from the aligned model over a sample
+from the reference model is high, subject to a KL divergence constraint. Today,
+we are increasingly using inference-time algorithms (e.g., Best-of-N,
+controlled decoding, tree search) to decode from language models rather than
+standard sampling. However, the alignment objective does not capture such
+inference-time decoding procedures. We show that the existing alignment
+framework is sub-optimal in view of such inference-time methods. We then modify
+the alignment objective and propose a framework for inference-aware alignment
+(IAPO). We prove that for any inference-time decoding algorithm, the optimal
+solution that optimizes the inference-time win rate of the aligned policy
+against the reference policy is the solution to the typical RLHF problem with a
+transformation of the reward. This motivates us to provide the KL-regularized
+calibrate-and-transform RL (CTRL) algorithm to solve this problem, which
+involves a reward calibration step and a KL-regularized reward maximization
+step with a transformation of the calibrated reward. We particularize our study
+to two important inference-time strategies: best-of-N sampling and best-of-N
+jailbreaking, where N responses are sampled from the model and the one with the
+highest or lowest reward is selected. We propose specific transformations for
+these strategies and demonstrate that our framework offers significant
+improvements over existing state-of-the-art methods for language model
+alignment. Empirically, we outperform baselines that are designed without
+taking inference-time decoding into consideration by 8-12% and 4-9% on
+inference-time win rates over the Anthropic helpfulness and harmlessness dialog
+benchmark datasets.
+
+
+
+
+
+
+
+
+ Eunsu Kim, Juyoung Suk, Seungone Kim, Niklas Muennighoff, Dongkwan Kim, Alice Oh
+
+
+ We introduce LLM-as-an-Interviewer, a novel paradigm for evaluating large
+language models (LLMs). This approach leverages multi-turn interactions where
+the LLM interviewer actively provides feedback on responses and poses follow-up
+questions to the evaluated LLM. At the start of the interview, the LLM
+interviewer dynamically modifies datasets to generate initial questions,
+mitigating data contamination. We apply the LLM-as-an-Interviewer framework to
+evaluate six models on the MATH and DepthQA tasks. Our results show that the
+framework effectively provides insights into LLM performance, including the
+quality of initial responses, adaptability to feedback, and ability to address
+follow-up queries like clarification or additional knowledge requests. The
+framework also addresses key limitations of conventional methods like
+LLM-as-a-Judge, including verbosity bias and inconsistency across runs.
+Finally, we propose the Interview Report, which aggregates insights from the
+interview process, providing examples and a comprehensive analysis of the LLM's
+strengths and weaknesses. This report offers a detailed snapshot of the model's
+real-world applicability. The code for our framework is publicly available at
+https://github.com/interview-eval/.
+
+
+
+
+
+
+
+
+ Andong Chen, Kehai Chen, Yang Xiang, Xuefeng Bai, Muyun Yang, Yang Feng, Tiejun Zhao, Min zhang
+
+
+ The remarkable understanding and generation capabilities of large language
+models (LLMs) have greatly improved translation performance. However, incorrect
+understanding of the sentence to be translated can degrade translation quality.
+To address this issue, we proposed a novel Iterative Bilingual Understanding
+Translation (IBUT) method based on the cross-lingual capabilities of LLMs and
+the dual characteristics of translation tasks. The cross-lingual capability of
+LLMs enables the generation of contextual understanding for both the source and
+target languages separately. Furthermore, the dual characteristics allow IBUT
+to generate effective cross-lingual feedback, iteratively refining contextual
+understanding, thereby reducing errors and improving translation performance.
+Experimental results showed that the proposed IBUT outperforms several strong
+comparison methods, especially being generalized to multiple domains (e.g.,
+news, commonsense, and cultural translation benchmarks).
+
+
+
+ comment: Work in progress
+
+
+
+
+
+
+ ♻ ☆ LLM-jp: A Cross-organizational Project for the Research and Development
+ of Fully Open Japanese LLMs
+
+
+ This paper introduces LLM-jp, a cross-organizational project for the research
+and development of Japanese large language models (LLMs). LLM-jp aims to
+develop open-source and strong Japanese LLMs, and as of this writing, more than
+1,500 participants from academia and industry are working together for this
+purpose. This paper presents the background of the establishment of LLM-jp,
+summaries of its activities, and technical reports on the LLMs developed by
+LLM-jp. For the latest activities, visit https://llm-jp.nii.ac.jp/en/.
+
+
+
+
+
+
+
+
+ Minh Le, Tien Ngoc Luu, An Nguyen The, Thanh-Thien Le, Trang Nguyen, Tung Thanh Nguyen, Linh Ngo Van, Thien Huu Nguyen
+
+
+ To address catastrophic forgetting in Continual Relation Extraction (CRE),
+many current approaches rely on memory buffers to rehearse previously learned
+knowledge while acquiring new tasks. Recently, prompt-based methods have
+emerged as potent alternatives to rehearsal-based strategies, demonstrating
+strong empirical performance. However, upon analyzing existing prompt-based
+approaches for CRE, we identified several critical limitations, such as
+inaccurate prompt selection, inadequate mechanisms for mitigating forgetting in
+shared parameters, and suboptimal handling of cross-task and within-task
+variances. To overcome these challenges, we draw inspiration from the
+relationship between prefix-tuning and mixture of experts, proposing a novel
+approach that employs a prompt pool for each task, capturing variations within
+each task while enhancing cross-task variances. Furthermore, we incorporate a
+generative model to consolidate prior knowledge within shared parameters,
+eliminating the need for explicit data storage. Extensive experiments validate
+the efficacy of our approach, demonstrating superior performance over
+state-of-the-art prompt-based and rehearsal-free methods in continual relation
+extraction.
+
+
+
+ comment: Accepted to AAAI 2025
+
+
+
+
+
+
+ ♻ ☆ A Modular-based Strategy for Mitigating Gradient Conflicts in
+ Simultaneous Speech Translation ICASSP 2025
+
+
+ Simultaneous Speech Translation (SimulST) involves generating target language
+text while continuously processing streaming speech input, presenting
+significant real-time challenges. Multi-task learning is often employed to
+enhance SimulST performance but introduces optimization conflicts between
+primary and auxiliary tasks, potentially compromising overall efficiency. The
+existing model-level conflict resolution methods are not well-suited for this
+task which exacerbates inefficiencies and leads to high GPU memory consumption.
+To address these challenges, we propose a Modular Gradient Conflict Mitigation
+(MGCM) strategy that detects conflicts at a finer-grained modular level and
+resolves them utilizing gradient projection. Experimental results demonstrate
+that MGCM significantly improves SimulST performance, particularly under medium
+and high latency conditions, achieving a 0.68 BLEU score gain in offline tasks.
+Additionally, MGCM reduces GPU memory consumption by over 95\% compared to
+other conflict mitigation methods, establishing it as a robust solution for
+SimulST tasks.
+
+
+
+ comment: Accepted to ICASSP 2025
+
+
+
+
+
+
+ ♻ ☆ Align Anything: Training All-Modality Models to Follow Instructions with
+ Language Feedback
+
+
+
+
+
+
+
+
+ Jiaming Ji, Jiayi Zhou, Hantao Lou, Boyuan Chen, Donghai Hong, Xuyao Wang, Wenqi Chen, Kaile Wang, Rui Pan, Jiahao Li, Mohan Wang, Josef Dai, Tianyi Qiu, Hua Xu, Dong Li, Weipeng Chen, Jun Song, Bo Zheng, Yaodong Yang
+
+
+ Reinforcement learning from human feedback (RLHF) has proven effective in
+enhancing the instruction-following capabilities of large language models;
+however, it remains underexplored in the cross-modality domain. As the number
+of modalities increases, aligning all-modality models with human intentions --
+such as instruction following -- becomes a pressing challenge. In this work, we
+make the first attempt to fine-tune all-modality models (i.e. input and output
+with any modality, also named any-to-any models) using human preference data
+across all modalities (including text, image, audio, and video), ensuring its
+behavior aligns with human intentions. This endeavor presents several
+challenges. First, there is no large-scale all-modality human preference data
+in existing open-source resources, as most datasets are limited to specific
+modalities, predominantly text and image. Secondly, the effectiveness of binary
+preferences in RLHF for post-training alignment in complex all-modality
+scenarios remains an unexplored area. Finally, there is a lack of a systematic
+framework to evaluate the capabilities of all-modality models, particularly
+regarding modality selection and synergy. To address these challenges, we
+propose the align-anything framework, which includes meticulously annotated
+200k all-modality human preference data. Then, we introduce an alignment method
+that learns from unified language feedback, effectively capturing complex
+modality-specific human preferences and enhancing the model's
+instruction-following capabilities. Furthermore, to assess performance
+improvements in all-modality models after post-training alignment, we construct
+a challenging all-modality capability evaluation framework -- eval-anything.
+All data, models, and code frameworks have been open-sourced for the community.
+For more details, please refer to
+https://github.com/PKU-Alignment/align-anything.
+
+
+
+
+
+
+
+ ♻ ☆ Large Language Models for Classical Chinese Poetry Translation:
+ Benchmarking, Evaluating, and Improving
+
+
+ Different from the traditional translation tasks, classical Chinese poetry
+translation requires both adequacy and fluency in translating culturally and
+historically significant content and linguistic poetic elegance. Large language
+models (LLMs) with impressive multilingual capabilities may bring a ray of hope
+to achieve this extreme translation demand. This paper first introduces a
+suitable benchmark (PoetMT) where each Chinese poetry has a recognized elegant
+translation. Meanwhile, we propose a new metric based on GPT-4 to evaluate the
+extent to which current LLMs can meet these demands. Our empirical evaluation
+reveals that the existing LLMs fall short in the challenging task. Hence, we
+propose a Retrieval-Augmented Machine Translation (RAT) method which
+incorporates knowledge related to classical poetry for advancing the
+translation of Chinese Poetry in LLMs. Experimental results show that RAT
+consistently outperforms all comparison methods regarding wildly used BLEU,
+COMET, BLEURT, our proposed metric, and human evaluation.
+
+
+ Ensuring large language models (LLM) behave consistently with human goals,
+values, and intentions is crucial for their safety but yet computationally
+expensive. To reduce the computational cost of alignment training of LLMs,
+especially for those with a huge number of parameters, and to reutilize learned
+value alignment, we propose ConTrans, a novel framework that enables
+weak-to-strong alignment transfer via concept transplantation. From the
+perspective of representation engineering, ConTrans refines concept vectors in
+value alignment from a source LLM (usually a weak yet aligned LLM). The refined
+concept vectors are then reformulated to adapt to the target LLM (usually a
+strong yet unaligned base LLM) via affine transformation. In the third step,
+ConTrans transplants the reformulated concept vectors into the residual stream
+of the target LLM. Experiments demonstrate the successful transplantation of a
+wide range of aligned concepts from 7B models to 13B and 70B models across
+multiple LLMs and LLM families. Remarkably, ConTrans even surpasses
+instruction-tuned models in terms of truthfulness. Experiment results validate
+the effectiveness of both inter-LLM-family and intra-LLM-family concept
+transplantation. Our work successfully demonstrates an alternative way to
+achieve weak-to-strong alignment generalization and control.
+
+
+
+
+
+
+
+ ♻ ☆ Large Language Model-Brained GUI Agents: A Survey
+
+
+ GUIs have long been central to human-computer interaction, providing an
+intuitive and visually-driven way to access and interact with digital systems.
+The advent of LLMs, particularly multimodal models, has ushered in a new era of
+GUI automation. They have demonstrated exceptional capabilities in natural
+language understanding, code generation, and visual processing. This has paved
+the way for a new generation of LLM-brained GUI agents capable of interpreting
+complex GUI elements and autonomously executing actions based on natural
+language instructions. These agents represent a paradigm shift, enabling users
+to perform intricate, multi-step tasks through simple conversational commands.
+Their applications span across web navigation, mobile app interactions, and
+desktop automation, offering a transformative user experience that
+revolutionizes how individuals interact with software. This emerging field is
+rapidly advancing, with significant progress in both research and industry.
+ To provide a structured understanding of this trend, this paper presents a
+comprehensive survey of LLM-brained GUI agents, exploring their historical
+evolution, core components, and advanced techniques. We address research
+questions such as existing GUI agent frameworks, the collection and utilization
+of data for training specialized GUI agents, the development of large action
+models tailored for GUI tasks, and the evaluation metrics and benchmarks
+necessary to assess their effectiveness. Additionally, we examine emerging
+applications powered by these agents. Through a detailed analysis, this survey
+identifies key research gaps and outlines a roadmap for future advancements in
+the field. By consolidating foundational knowledge and state-of-the-art
+developments, this work aims to guide both researchers and practitioners in
+overcoming challenges and unlocking the full potential of LLM-brained GUI
+agents.
+
+
+
+ comment: The collection of papers reviewed in this survey will be hosted and
+ regularly updated on the GitHub repository:
+ https://github.com/vyokky/LLM-Brained-GUI-Agents-Survey Additionally, a
+ searchable webpage is available at https://aka.ms/gui-agent for easier access
+ and exploration
+
+
+
+
+
+
+ ♻ ☆ Yi: Open Foundation Models by 01.AI
+
+
+ We introduce the Yi model family, a series of language and multimodal models
+that demonstrate strong multi-dimensional capabilities. The Yi model family is
+based on 6B and 34B pretrained language models, then we extend them to chat
+models, 200K long context models, depth-upscaled models, and vision-language
+models. Our base models achieve strong performance on a wide range of
+benchmarks like MMLU, and our finetuned chat models deliver strong human
+preference rate on major evaluation platforms like AlpacaEval and Chatbot
+Arena. Building upon our scalable super-computing infrastructure and the
+classical transformer architecture, we attribute the performance of Yi models
+primarily to its data quality resulting from our data-engineering efforts. For
+pretraining, we construct 3.1 trillion tokens of English and Chinese corpora
+using a cascaded data deduplication and quality filtering pipeline. For
+finetuning, we polish a small scale (less than 10K) instruction dataset over
+multiple iterations such that every single instance has been verified directly
+by our machine learning engineers. For vision-language, we combine the chat
+language model with a vision transformer encoder and train the model to align
+visual representations to the semantic space of the language model. We further
+extend the context length to 200K through lightweight continual pretraining and
+demonstrate strong needle-in-a-haystack retrieval performance. We show that
+extending the depth of the pretrained checkpoint through continual pretraining
+further improves performance. We believe that given our current results,
+continuing to scale up model parameters using thoroughly optimized data will
+lead to even stronger frontier models.
+
+
+
+
+
+
+
+ ♻ ☆ EHRCon: Dataset for Checking Consistency between Unstructured Notes and
+ Structured Tables in Electronic Health Records
+
+
+
+
+
+
+
+
+ Yeonsu Kwon, Jiho Kim, Gyubok Lee, Seongsu Bae, Daeun Kyung, Wonchul Cha, Tom Pollard, Alistair Johnson, Edward Choi
+
+
+ Electronic Health Records (EHRs) are integral for storing comprehensive
+patient medical records, combining structured data (e.g., medications) with
+detailed clinical notes (e.g., physician notes). These elements are essential
+for straightforward data retrieval and provide deep, contextual insights into
+patient care. However, they often suffer from discrepancies due to unintuitive
+EHR system designs and human errors, posing serious risks to patient safety. To
+address this, we developed EHRCon, a new dataset and task specifically designed
+to ensure data consistency between structured tables and unstructured notes in
+EHRs. EHRCon was crafted in collaboration with healthcare professionals using
+the MIMIC-III EHR dataset, and includes manual annotations of 4,101 entities
+across 105 clinical notes checked against database entries for consistency.
+EHRCon has two versions, one using the original MIMIC-III schema, and another
+using the OMOP CDM schema, in order to increase its applicability and
+generalizability. Furthermore, leveraging the capabilities of large language
+models, we introduce CheckEHR, a novel framework for verifying the consistency
+between clinical notes and database tables. CheckEHR utilizes an eight-stage
+process and shows promising results in both few-shot and zero-shot settings.
+The code is available at https://github.com/dustn1259/EHRCon.
+
+
+
+
+
+
+
+ ♻ ☆ Aligning the Objective of LLM-based Program Repair ICSE'25
+
+
+ Large language models (LLMs) have achieved decent results on automated
+program repair (APR). However, the next token prediction training objective of
+decoder-only LLMs (e.g., GPT-4) is misaligned with the masked span prediction
+objective of current infilling-style methods, which impedes LLMs from fully
+leveraging pre-trained knowledge for program repair. In addition, while some
+LLMs can locate and repair bugs in certain functions using the related
+artifacts (e.g., test cases), existing methods still depend on statement-level
+fault localization methods to provide a list of buggy hunks for repair. This
+restriction hinders LLMs from exploring potential patches beyond the given
+locations.
+ In this paper, we investigate a new approach to adapt LLMs to program repair.
+Our core insight is that LLM's APR capability can be greatly improved by simply
+aligning the output to their training objective and allowing them to refine the
+whole program without first identifying faulty statements. Based on this
+insight, we designed D4C, a straightforward prompting framework for APR. D4C
+can repair 180 bugs correctly in Defects4J, with each patch being sampled only
+10 times. This surpasses the SOTA APR methods with perfect fault localization
+by 10% and reduces the patch sampling number by 90%. Our findings reveal that
+(1) objective alignment is crucial for fully exploiting LLM's pre-trained
+capability, and (2) replacing the traditional localize-buggy-hunks-then-repair
+workflow with direct debugging is more effective for LLM-based APR methods.
+Thus, we believe this paper introduces a new mindset for harnessing LLMs in
+APR.
+
+
+
+
+
+
+
+
+ Taewhan Kim, Soeun Lee, Si-Woo Kim, Dong-Jin Kim
+
+
+ Recent lightweight image captioning models using retrieved data mainly focus
+on text prompts. However, previous works only utilize the retrieved text as
+text prompts, and the visual information relies only on the CLIP visual
+embedding. Because of this issue, there is a limitation that the image
+descriptions inherent in the prompt are not sufficiently reflected in the
+visual embedding space. To tackle this issue, we propose ViPCap, a novel
+retrieval text-based visual prompt for lightweight image captioning. ViPCap
+leverages the retrieved text with image information as visual prompts to
+enhance the ability of the model to capture relevant visual information. By
+mapping text prompts into the CLIP space and generating multiple randomized
+Gaussian distributions, our method leverages sampling to explore randomly
+augmented distributions and effectively retrieves the semantic features that
+contain image information. These retrieved features are integrated into the
+image and designated as the visual prompt, leading to performance improvements
+on the datasets such as COCO, Flickr30k, and NoCaps. Experimental results
+demonstrate that ViPCap significantly outperforms prior lightweight captioning
+models in efficiency and effectiveness, demonstrating the potential for a
+plug-and-play solution.
+
+
+
+ comment: Accepted to AAAI 2025
+
+
+
+
+
+
+ ♻ ☆ SelectLLM: Query-Aware Efficient Selection Algorithm for Large Language
+ Models
+
+
+ Large language models (LLMs) have seen widespread adoption due to their
+remarkable performance across various applications, driving the accelerated
+development of a large number of diverse LLMs. However, these individual LLMs
+show limitations in generalization and performance on complex tasks due to
+inherent training biases, model size constraints, and the quality or diversity
+of pre-training datasets. A promising direction is to efficiently harness the
+diverse capabilities of LLMs to overcome these individual limitations. To
+address these limitations, we introduce a novel LLM selection algorithm called
+SelectLLM, which efficiently directs input queries to the most suitable subset
+of LLMs from a large pool, ensuring that the selected models collectively
+provide accurate responses. SelectLLM employs a multi-label classifier and
+policy based on the classifier's predictions and confidence scores in selecting
+an optimal, query-aware, and lightweight subset of LLMs. Our findings indicate
+that the proposed model outperforms existing ensemble-based baselines and
+achieves competitive performance with similarly sized top-performing LLMs while
+maintaining efficiency. Specifically, it achieves a huge reduction in inference
+latency on two challenging reasoning benchmarks: 13% on GSM8K and 70% on MMLU,
+compared to the top-performing baselines. Also, we establish a theoretical
+upper bound by an oracle with LLMs and explore in-depth linguistic analysis to
+understand the performance gap between Oracle and SelectLLM.
+
+
+
+
+
+
+
+ ♻ ☆ Automated Review Generation Method Based on Large Language Models
+
+
+ Literature research, vital for scientific work, faces the challenge of
+surging information volumes exceeding researchers' processing capabilities. We
+present an automated review generation method based on large language models
+(LLMs) to overcome efficiency bottlenecks and reduce cognitive load. Our
+statistically validated evaluation framework demonstrates that the generated
+reviews match or exceed manual quality, offering broad applicability across
+research fields without requiring users' domain knowledge. Applied to propane
+dehydrogenation (PDH) catalysts, our method swiftly analyzed 343 articles,
+averaging seconds per article per LLM account, producing comprehensive reviews
+spanning 35 topics, with extended analysis of 1041 articles providing insights
+into catalysts' properties. Through multi-layered quality control, we
+effectively mitigated LLMs' hallucinations, with expert verification confirming
+accuracy and citation integrity while demonstrating hallucination risks reduced
+to below 0.5\% with 95\% confidence. Released Windows application enables
+one-click review generation, enhancing research productivity and literature
+recommendation efficiency while setting the stage for broader scientific
+explorations.
+
+
+
+ comment: 21 pages, 5 figures, 1 tables Code:
+ https://github.com/TJU-ECAT-AI/AutomaticReviewGeneration Data:
+ https://github.com/TJU-ECAT-AI/AutomaticReviewGenerationData This research
+ has been invited for a Short Oral presentation at the 18th ICC -
+ International Congress on Catalysis, taking place in Lyon, France from July
+ 14-19, 2024
+
+
+
+
+
+
+
+ Chengwei Qin, Aston Zhang, Chen Chen, Anirudh Dagar, Wenming Ye
+
+
+ Spurred by advancements in scale, large language models (LLMs) have
+demonstrated strong few-shot learning ability via in-context learning (ICL).
+However, the performance of ICL has been shown to be highly sensitive to the
+selection of few-shot demonstrations. Selecting the most suitable examples as
+context remains an ongoing challenge and an open problem. Existing literature
+has highlighted the importance of selecting examples that are diverse or
+semantically similar to the test sample while ignoring the fact that the
+optimal selection dimension, i.e., diversity or similarity, is task-specific.
+Based on how the test sample is answered, we propose Iterative Demonstration
+Selection (IDS) to leverage the merits of both dimensions. Using zero-shot
+chain-of-thought reasoning (Zero-shot-CoT), IDS iteratively selects examples
+that are diverse but still strongly correlated with the test sample as ICL
+demonstrations. Specifically, IDS applies Zero-shot-CoT to the test sample
+before demonstration selection. The output reasoning path is then used to
+choose demonstrations that are prepended to the test sample for inference. The
+generated answer is followed by its corresponding reasoning path for extracting
+a new set of demonstrations in the next iteration. After several iterations,
+IDS adopts majority voting to obtain the final result. Through extensive
+experiments on tasks including reasoning, question answering, and topic
+classification, we demonstrate that IDS can consistently outperform existing
+ICL demonstration selection methods.
+
+
+
+
+
+
+
+ ♻ ☆ Memorization Over Reasoning? Exposing and Mitigating Verbatim
+ Memorization in Large Language Models' Character Understanding Evaluation
+
+
+ Recently, Large Language Models (LLMs) have shown impressive performance in
+character understanding tasks, such as analyzing the roles, personalities, and
+relationships of fictional characters. However, the extensive pre-training
+corpora used by LLMs raise concerns that they may rely on memorizing popular
+fictional works rather than genuinely understanding and reasoning about them.
+In this work, we argue that 'gist memory'-capturing essential meaning - should
+be the primary mechanism for character understanding tasks, as opposed to
+'verbatim memory' - exact match of a string. We introduce a simple yet
+effective method to mitigate mechanized memorization in character understanding
+evaluations while preserving the essential implicit cues needed for
+comprehension and reasoning. Our approach reduces memorization-driven
+performance on popular fictional works from 96% accuracy to 72% and results in
+up to an 18% drop in accuracy across various character understanding tasks.
+These findings underscore the issue of data contamination in existing
+benchmarks, which often measure memorization rather than true character
+understanding.
+
+
+
+
+
+
+
+ ♻ ☆ Augmenting Biomedical Named Entity Recognition with General-domain
+ Resources
+
+
+ Training a neural network-based biomedical named entity recognition (BioNER)
+model usually requires extensive and costly human annotations. While several
+studies have employed multi-task learning with multiple BioNER datasets to
+reduce human effort, this approach does not consistently yield performance
+improvements and may introduce label ambiguity in different biomedical corpora.
+We aim to tackle those challenges through transfer learning from easily
+accessible resources with fewer concept overlaps with biomedical datasets. We
+proposed GERBERA, a simple-yet-effective method that utilized general-domain
+NER datasets for training. We performed multi-task learning to train a
+pre-trained biomedical language model with both the target BioNER dataset and
+the general-domain dataset. Subsequently, we fine-tuned the models specifically
+for the BioNER dataset. We systematically evaluated GERBERA on five datasets of
+eight entity types, collectively consisting of 81,410 instances. Despite using
+fewer biomedical resources, our models demonstrated superior performance
+compared to baseline models trained with additional BioNER datasets.
+Specifically, our models consistently outperformed the baseline models in six
+out of eight entity types, achieving an average improvement of 0.9% over the
+best baseline performance across eight entities. Our method was especially
+effective in amplifying performance on BioNER datasets characterized by limited
+data, with a 4.7% improvement in F1 scores on the JNLPBA-RNA dataset. This
+study introduces a new training method that leverages cost-effective
+general-domain NER datasets to augment BioNER models. This approach
+significantly improves BioNER model performance, making it a valuable asset for
+scenarios with scarce or costly biomedical datasets.
+
+
+
+ comment: Published in JBI 2024. We make data, codes, and models publicly
+ available via https://github.com/qingyu-qc/bioner_gerbera
+
+
+
+
+
+
+ ♻ ☆ Next Token Prediction Towards Multimodal Intelligence: A Comprehensive
+ Survey
+
+
+ Building on the foundations of language modeling in natural language
+processing, Next Token Prediction (NTP) has evolved into a versatile training
+objective for machine learning tasks across various modalities, achieving
+considerable success. As Large Language Models (LLMs) have advanced to unify
+understanding and generation tasks within the textual modality, recent research
+has shown that tasks from different modalities can also be effectively
+encapsulated within the NTP framework, transforming the multimodal information
+into tokens and predict the next one given the context. This survey introduces
+a comprehensive taxonomy that unifies both understanding and generation within
+multimodal learning through the lens of NTP. The proposed taxonomy covers five
+key aspects: Multimodal tokenization, MMNTP model architectures, unified task
+representation, datasets \& evaluation, and open challenges. This new taxonomy
+aims to aid researchers in their exploration of multimodal intelligence. An
+associated GitHub repository collecting the latest papers and repos is
+available at https://github.com/LMM101/Awesome-Multimodal-Next-Token-Prediction
+
+
+
+
+
+
+
+ ♻ ☆ Large-scale moral machine experiment on large language models
+
+
+
+
+
+
+
+
+ Muhammad Shahrul Zaim bin Ahmad, Kazuhiro Takemoto
+
+
+ The rapid advancement of Large Language Models (LLMs) and their potential
+integration into autonomous driving systems necessitates understanding their
+moral decision-making capabilities. While our previous study examined four
+prominent LLMs using the Moral Machine experimental framework, the dynamic
+landscape of LLM development demands a more comprehensive analysis. Here, we
+evaluate moral judgments across 52 different LLMs, including multiple versions
+of proprietary models (GPT, Claude, Gemini) and open-source alternatives
+(Llama, Gemma), to assess their alignment with human moral preferences in
+autonomous driving scenarios. Using a conjoint analysis framework, we evaluated
+how closely LLM responses aligned with human preferences in ethical dilemmas
+and examined the effects of model size, updates, and architecture. Results
+showed that proprietary models and open-source models exceeding 10 billion
+parameters demonstrated relatively close alignment with human judgments, with a
+significant negative correlation between model size and distance from human
+judgments in open-source models. However, model updates did not consistently
+improve alignment with human preferences, and many LLMs showed excessive
+emphasis on specific ethical principles. These findings suggest that while
+increasing model size may naturally lead to more human-like moral judgments,
+practical implementation in autonomous driving systems requires careful
+consideration of the trade-off between judgment quality and computational
+efficiency. Our comprehensive analysis provides crucial insights for the
+ethical design of autonomous systems and highlights the importance of
+considering cultural contexts in AI moral decision-making.
+
+
+ Long-context LLMs are increasingly in demand for applications such as
+retrieval-augmented generation. To defray the cost of pretraining LLMs over
+long contexts, recent work takes an approach of synthetic context extension:
+fine-tuning LLMs with synthetically generated long-context data in a
+post-training stage. However, it remains unclear how and why this synthetic
+context extension imparts abilities for downstream long-context tasks. In this
+paper, we investigate fine-tuning on synthetic data for three long-context
+tasks that require retrieval and reasoning. We vary the realism of "needle"
+concepts to be retrieved and diversity of the surrounding "haystack" context,
+from using LLMs to construct synthetic documents to using templated relations
+and creating symbolic datasets. We find that models trained on synthetic data
+fall short of the real data, but surprisingly, the mismatch can be interpreted
+and even predicted in terms of a special set of attention heads that are
+responsible for retrieval over long context, retrieval heads (Wu et al., 2024).
+The retrieval heads learned on synthetic data have high overlap with retrieval
+heads learned on real data, and there is a strong correlation between the
+recall of heads learned and the downstream performance of a model. Furthermore,
+with attention knockout and activation patching, we mechanistically show that
+retrieval heads are necessary and explain model performance, although they are
+not totally sufficient. Our results shed light on how to interpret synthetic
+data fine-tuning performance and how to approach creating better data for
+learning real-world capabilities over long contexts.
+
+
+
+
+
+
+
+ ♻ ☆ Exploring Concept Depth: How Large Language Models Acquire Knowledge at
+ Different Layers? COLING 2025
+
+
+ Large language models (LLMs) have shown remarkable performances across a wide
+range of tasks. However, the mechanisms by which these models encode tasks of
+varying complexities remain poorly understood. In this paper, we explore the
+hypothesis that LLMs process concepts of varying complexities in different
+layers, introducing the idea of ``Concept Depth'' to suggest that more complex
+concepts are typically acquired in deeper layers. Specifically, we categorize
+concepts based on their level of abstraction, defining them in the order of
+increasing complexity within factual, emotional, and inferential tasks. We
+conduct extensive probing experiments using layer-wise representations across
+various LLM families (Gemma, LLaMA, Qwen) on various datasets spanning the
+three domains of tasks. Our findings reveal that models could efficiently
+conduct probing for simpler tasks in shallow layers, and more complex tasks
+typically necessitate deeper layers for accurate understanding. Additionally,
+we examine how external factors, such as adding noise to the input and
+quantizing the model weights, might affect layer-wise representations. Our
+findings suggest that these factors can impede the development of a conceptual
+understanding of LLMs until deeper layers are explored. We hope that our
+proposed concept and experimental insights will enhance the understanding of
+the mechanisms underlying LLMs. Our codes are available at
+\url{https://github.com/Luckfort/CD}.
+
+
+ Reasoning is critical for large language models (LLMs) to excel in a wide
+range of tasks. While methods like Chain-of-Thought (CoT) reasoning enhance LLM
+performance by decomposing problems into intermediate steps, they also incur
+significant overhead in token usage, leading to increased costs. We find that
+the reasoning process of current LLMs is unnecessarily lengthy and it can be
+compressed by including a reasonable token budget in the prompt, but the choice
+of token budget plays a crucial role in the actual compression effectiveness.
+We then propose a token-budget-aware LLM reasoning framework, which dynamically
+estimates token budgets for different problems based on reasoning complexity
+and uses the estimated token budgets to guide the reasoning process.
+Experiments show that our method effectively reduces token costs in CoT
+reasoning with only a slight performance reduction, offering a practical
+solution to balance efficiency and accuracy in LLM reasoning. Code:
+https://github.com/GeniusHTX/TALE.
+
+
+
+
+
+
+
+ ♻ ☆ Large Language Models-guided Dynamic Adaptation for Temporal Knowledge
+ Graph Reasoning
+
+
+ Temporal Knowledge Graph Reasoning (TKGR) is the process of utilizing
+temporal information to capture complex relations within a Temporal Knowledge
+Graph (TKG) to infer new knowledge. Conventional methods in TKGR typically
+depend on deep learning algorithms or temporal logical rules. However, deep
+learning-based TKGRs often lack interpretability, whereas rule-based TKGRs
+struggle to effectively learn temporal rules that capture temporal patterns.
+Recently, Large Language Models (LLMs) have demonstrated extensive knowledge
+and remarkable proficiency in temporal reasoning. Consequently, the employment
+of LLMs for Temporal Knowledge Graph Reasoning (TKGR) has sparked increasing
+interest among researchers. Nonetheless, LLMs are known to function as black
+boxes, making it challenging to comprehend their reasoning process.
+Additionally, due to the resource-intensive nature of fine-tuning, promptly
+updating LLMs to integrate evolving knowledge within TKGs for reasoning is
+impractical. To address these challenges, in this paper, we propose a Large
+Language Models-guided Dynamic Adaptation (LLM-DA) method for reasoning on
+TKGs. Specifically, LLM-DA harnesses the capabilities of LLMs to analyze
+historical data and extract temporal logical rules. These rules unveil temporal
+patterns and facilitate interpretable reasoning. To account for the evolving
+nature of TKGs, a dynamic adaptation strategy is proposed to update the
+LLM-generated rules with the latest events. This ensures that the extracted
+rules always incorporate the most recent knowledge and better generalize to the
+predictions on future events. Experimental results show that without the need
+of fine-tuning, LLM-DA significantly improves the accuracy of reasoning over
+several common datasets, providing a robust framework for TKGR tasks.
+
+
+
+
+
+
+
+
+
+
+ Information Retrieval 15
+
+
+
+
+
+ ☆ Towards Identity-Aware Cross-Modal Retrieval: a Dataset and a Baseline ECIR 2025
+
+
+ Recent advancements in deep learning have significantly enhanced
+content-based retrieval methods, notably through models like CLIP that map
+images and texts into a shared embedding space. However, these methods often
+struggle with domain-specific entities and long-tail concepts absent from their
+training data, particularly in identifying specific individuals. In this paper,
+we explore the task of identity-aware cross-modal retrieval, which aims to
+retrieve images of persons in specific contexts based on natural language
+queries. This task is critical in various scenarios, such as for searching and
+browsing personalized video collections or large audio-visual archives
+maintained by national broadcasters. We introduce a novel dataset, COCO Person
+FaceSwap (COCO-PFS), derived from the widely used COCO dataset and enriched
+with deepfake-generated faces from VGGFace2. This dataset addresses the lack of
+large-scale datasets needed for training and evaluating models for this task.
+Our experiments assess the performance of different CLIP variations repurposed
+for this task, including our architecture, Identity-aware CLIP (Id-CLIP), which
+achieves competitive retrieval performance through targeted fine-tuning. Our
+contributions lay the groundwork for more robust cross-modal retrieval systems
+capable of recognizing long-tail identities and contextual nuances. Data and
+code are available at https://github.com/mesnico/IdCLIP.
+
+
+
+ comment: Accepted as full paper at ECIR 2025
+
+
+
+
+
+
+ ☆ Rise of Generative Artificial Intelligence in Science
+
+
+ Generative Artificial Intelligence (GenAI, generative AI) has rapidly become
+available as a tool in scientific research. To explore the use of generative AI
+in science, we conduct an empirical analysis using OpenAlex. Analyzing GenAI
+publications and other AI publications from 2017 to 2023, we profile growth
+patterns, the diffusion of GenAI publications across fields of study, and the
+geographical spread of scientific research on generative AI. We also
+investigate team size and international collaborations to explore whether
+GenAI, as an emerging scientific research area, shows different collaboration
+patterns compared to other AI technologies. The results indicate that
+generative AI has experienced rapid growth and increasing presence in
+scientific publications. The use of GenAI now extends beyond computer science
+to other scientific research domains. Over the study period, U.S. researchers
+contributed nearly two-fifths of global GenAI publications. The U.S. is
+followed by China, with several small and medium-sized advanced economies
+demonstrating relatively high levels of GenAI deployment in their research
+publications. Although scientific research overall is becoming increasingly
+specialized and collaborative, our results suggest that GenAI research groups
+tend to have slightly smaller team sizes than found in other AI fields.
+Furthermore, notwithstanding recent geopolitical tensions, GenAI research
+continues to exhibit levels of international collaboration comparable to other
+AI technologies.
+
+
+ We propose an ontology-grounded approach to Knowledge Graph (KG) construction
+using Large Language Models (LLMs) on a knowledge base. An ontology is authored
+by generating Competency Questions (CQ) on knowledge base to discover knowledge
+scope, extracting relations from CQs, and attempt to replace equivalent
+relations by their counterpart in Wikidata. To ensure consistency and
+interpretability in the resulting KG, we ground generation of KG with the
+authored ontology based on extracted relations. Evaluation on benchmark
+datasets demonstrates competitive performance in knowledge graph construction
+task. Our work presents a promising direction for scalable KG construction
+pipeline with minimal human intervention, that yields high quality and
+human-interpretable KGs, which are interoperable with Wikidata semantics for
+potential knowledge base expansion.
+
+
+
+ comment: Presented at HI-AI@KDD, Human-Interpretable AI Workshop at the KDD
+ 2024, 26th of August 2024, Barcelona, Spain
+
+ Efficiently retrieving a concise set of candidates from a large document
+corpus remains a pivotal challenge in Information Retrieval (IR). Neural
+retrieval models, particularly dense retrieval models built with transformers
+and pretrained language models, have been popular due to their superior
+performance. However, criticisms have also been raised on their lack of
+explainability and vulnerability to adversarial attacks. In response to these
+challenges, we propose to improve the robustness of dense retrieval models by
+enhancing their sensitivity of fine-graned relevance signals. A model achieving
+sensitivity in this context should exhibit high variances when documents' key
+passages determining their relevance to queries have been modified, while
+maintaining low variances for other changes in irrelevant passages. This
+sensitivity allows a dense retrieval model to produce robust results with
+respect to attacks that try to promote documents without actually increasing
+their relevance. It also makes it possible to analyze which part of a document
+is actually relevant to a query, and thus improve the explainability of the
+retrieval model. Motivated by causality and counterfactual analysis, we propose
+a series of counterfactual regularization methods based on game theory and
+unsupervised learning with counterfactual passages. Experiments show that, our
+method can extract key passages without reliance on the passage-level relevance
+annotations. Moreover, the regularized dense retrieval models exhibit
+heightened robustness against adversarial attacks, surpassing the
+state-of-the-art anti-attack methods.
+
+
+ In recent years, user-generated audio content has proliferated across various
+media platforms, creating a growing need for efficient retrieval methods that
+allow users to search for audio clips using natural language queries. This
+task, known as language-based audio retrieval, presents significant challenges
+due to the complexity of learning semantic representations from heterogeneous
+data across both text and audio modalities. In this work, we introduce a novel
+framework for the language-based audio retrieval task that leverages
+co-attention mechanismto jointly learn meaningful representations from both
+modalities. To enhance the model's ability to capture fine-grained cross-modal
+interactions, we propose a cascaded co-attention architecture, where
+co-attention modules are stacked or iterated to progressively refine the
+semantic alignment between text and audio. Experiments conducted on two public
+datasets show that the proposed method can achieve better performance than the
+state-of-the-art method. Specifically, our best performed co-attention model
+achieves a 16.6% improvement in mean Average Precision on Clotho dataset, and a
+15.1% improvement on AudioCaps.
+
+
+
+ comment: Accepted at UIC 2024 proceedings. Accepted version
+
+ Efficiently retrieving a concise set of candidates from a large document
+corpus remains a pivotal challenge in Information Retrieval (IR). Neural
+retrieval models, particularly dense retrieval models built with transformers
+and pretrained language models, have been popular due to their superior
+performance. However, criticisms have also been raised on their lack of
+explainability and vulnerability to adversarial attacks. In response to these
+challenges, we propose to improve the robustness of dense retrieval models by
+enhancing their sensitivity of fine-graned relevance signals. A model achieving
+sensitivity in this context should exhibit high variances when documents' key
+passages determining their relevance to queries have been modified, while
+maintaining low variances for other changes in irrelevant passages. This
+sensitivity allows a dense retrieval model to produce robust results with
+respect to attacks that try to promote documents without actually increasing
+their relevance. It also makes it possible to analyze which part of a document
+is actually relevant to a query, and thus improve the explainability of the
+retrieval model. Motivated by causality and counterfactual analysis, we propose
+a series of counterfactual regularization methods based on game theory and
+unsupervised learning with counterfactual passages. Experiments show that, our
+method can extract key passages without reliance on the passage-level relevance
+annotations. Moreover, the regularized dense retrieval models exhibit
+heightened robustness against adversarial attacks, surpassing the
+state-of-the-art anti-attack methods.
+
+
+
+ comment: arXiv admin note: text overlap with arXiv:2107.07773 by other authors
+
+
+
+
+
+
+ ♻ ☆ Beyond RMSE and MAE: Introducing EAUC to unmask hidden bias and
+ unfairness in dyadic regression models
+
+
+ Dyadic regression models, which output real-valued predictions for pairs of
+entities, are fundamental in many domains (e.g. obtaining user-product ratings
+in Recommender Systems) and promising and under exploration in others (e.g.
+tuning patient-drug dosages in personalized pharmacology). In this work, we
+prove that non-uniform observed value distributions of individual entities lead
+to severe biases in state-of-the-art models, skewing predictions towards the
+average of observed past values for the entity and providing worse-than-random
+predictive power in eccentric yet crucial cases; we name this phenomenon
+eccentricity bias. We show that global error metrics like Root Mean Squared
+Error (RMSE) are insufficient to capture this bias, and we introduce
+Eccentricity-Area Under the Curve (EAUC) as a novel complementary metric that
+can quantify it in all studied domains and models. We prove the intuitive
+interpretation of EAUC by experimenting with naive post-training bias
+corrections, and theorize other options to use EAUC to guide the construction
+of fair models. This work contributes a bias-aware evaluation of dyadic
+regression to prevent unfairness in critical real-world applications of such
+systems.
+
+
+ Modern e-commerce services frequently target customers with incentives or
+interventions to engage them in their products such as games, shopping, video
+streaming, etc. This customer engagement increases acquisition of more
+customers and retention of existing ones, leading to more business for the
+company while improving customer experience. Often, customers are either
+randomly targeted or targeted based on the propensity of desirable behavior.
+However, such policies can be suboptimal as they do not target the set of
+customers who would benefit the most from the intervention and they may also
+not take account of any constraints. In this paper, we propose a policy
+framework based on uplift modeling and constrained optimization that identifies
+customers to target for a use-case specific intervention so as to maximize the
+value to the business, while taking account of any given constraints. We
+demonstrate improvement over state-of-the-art targeting approaches using two
+large-scale experimental studies and a production implementation.
+
+
+
+ comment: Accepted at the CONSEQUENCES'24 workshop, co-located with ACM
+ RecSys'24
+
+
+
+
+
+
+ ♻ ☆ From Interests to Insights: An LLM Approach to Course Recommendations
+ Using Natural Language Queries
+
+
+
+
+
+
+
+
+ Hugh Van Deventer, Mark Mills, August Evrard
+
+
+ Most universities in the United States encourage their students to explore
+academic areas before declaring a major and to acquire academic breadth by
+satisfying a variety of requirements. Each term, students must choose among
+many thousands of offerings, spanning dozens of subject areas, a handful of
+courses to take. The curricular environment is also dynamic, and poor
+communication and search functions on campus can limit a student's ability to
+discover new courses of interest. To support both students and their advisers
+in such a setting, we explore a novel Large Language Model (LLM) course
+recommendation system that applies a Retrieval Augmented Generation (RAG)
+method to the corpus of course descriptions. The system first generates an
+'ideal' course description based on the user's query. This description is
+converted into a search vector using embeddings, which is then used to find
+actual courses with similar content by comparing embedding similarities. We
+describe the method and assess the quality and fairness of some example
+prompts. Steps to deploy a pilot system on campus are discussed.
+
+
+ Sequential recommendation (SR) systems have evolved significantly over the
+past decade, transitioning from traditional collaborative filtering to deep
+learning approaches and, more recently, to large language models (LLMs). While
+the adoption of LLMs has driven substantial advancements, these models
+inherently lack collaborative filtering information, relying primarily on
+textual content data neglecting other modalities and thus failing to achieve
+optimal recommendation performance. To address this limitation, we propose
+Molar, a Multimodal large language sequential recommendation framework that
+integrates multiple content modalities with ID information to capture
+collaborative signals effectively. Molar employs an MLLM to generate unified
+item representations from both textual and non-textual data, facilitating
+comprehensive multimodal modeling and enriching item embeddings. Additionally,
+it incorporates collaborative filtering signals through a post-alignment
+mechanism, which aligns user representations from content-based and ID-based
+models, ensuring precise personalization and robust performance. By seamlessly
+combining multimodal content with collaborative filtering insights, Molar
+captures both user interests and contextual semantics, leading to superior
+recommendation accuracy. Extensive experiments validate that Molar
+significantly outperforms traditional and LLM-based baselines, highlighting its
+strength in utilizing multimodal data and collaborative signals for sequential
+recommendation tasks. The source code is available at
+https://anonymous.4open.science/r/Molar-8B06/.
+
+
+ Federated recommendation (FedRec) preserves user privacy by enabling
+decentralized training of personalized models, but this architecture is
+inherently vulnerable to adversarial attacks. Significant research has been
+conducted on targeted attacks in FedRec systems, motivated by commercial and
+social influence considerations. However, much of this work has largely
+overlooked the differential robustness of recommendation models. Moreover, our
+empirical findings indicate that existing targeted attack methods achieve only
+limited effectiveness in Federated Sequential Recommendation (FSR) tasks.
+Driven by these observations, we focus on investigating targeted attacks in FSR
+and propose a novel dualview attack framework, named DV-FSR. This attack method
+uniquely combines a sampling-based explicit strategy with a contrastive
+learning-based implicit gradient strategy to orchestrate a coordinated attack.
+Additionally, we introduce a specific defense mechanism tailored for targeted
+attacks in FSR, aiming to evaluate the mitigation effects of the attack method
+we proposed. Extensive experiments validate the effectiveness of our proposed
+approach on representative sequential models.
+
+
+
+ comment: I am requesting the withdrawal of my paper due to identified errors
+ that require significant revision
+
+ Sequential recommendation (SR) aims to predict the next purchasing item
+according to users' dynamic preference learned from their historical user-item
+interactions. To improve the performance of recommendation, learning dynamic
+heterogeneous cross-type behavior dependencies is indispensable for recommender
+system. However, there still exists some challenges in Multi-Behavior
+Sequential Recommendation (MBSR). On the one hand, existing methods only model
+heterogeneous multi-behavior dependencies at behavior-level or item-level, and
+modelling interaction-level dependencies is still a challenge. On the other
+hand, the dynamic multi-grained behavior-aware preference is hard to capture in
+interaction sequences, which reflects interaction-aware sequential pattern. To
+tackle these challenges, we propose a Multi-Grained Preference enhanced
+Transformer framework (M-GPT). First, M-GPT constructs a interaction-level
+graph of historical cross-typed interactions in a sequence. Then graph
+convolution is performed to derive interaction-level multi-behavior dependency
+representation repeatedly, in which the complex correlation between historical
+cross-typed interactions at specific orders can be well learned. Secondly, a
+novel multi-scale transformer architecture equipped with multi-grained user
+preference extraction is proposed to encode the interaction-aware sequential
+pattern enhanced by capturing temporal behavior-aware multi-grained preference
+. Experiments on the real-world datasets indicate that our method M-GPT
+consistently outperforms various state-of-the-art recommendation methods.
+
+
+
+ comment: 12 pages
+
+
+
+
+
+
+ ♻ ☆ Zero-Indexing Internet Search Augmented Generation for Large Language
+ Models
+
+
+ Retrieval augmented generation has emerged as an effective method to enhance
+large language model performance. This approach typically relies on an internal
+retrieval module that uses various indexing mechanisms to manage a static
+pre-processed corpus. However, such a paradigm often falls short when it is
+necessary to integrate the most up-to-date information that has not been
+updated into the corpus during generative inference time. In this paper, we
+explore an alternative approach that leverages standard search engine APIs to
+dynamically integrate the latest online information (without maintaining any
+index for any fixed corpus), thereby improving the quality of generated
+content. We design a collaborative LLM-based paradigm, where we include: (i) a
+parser-LLM that determines if the Internet augmented generation is demanded and
+extracts the search keywords if so with a single inference; (ii) a mixed
+ranking strategy that re-ranks the retrieved HTML files to eliminate bias
+introduced from the search engine API; and (iii) an extractor-LLM that can
+accurately and efficiently extract relevant information from the fresh content
+in each HTML file. We conduct extensive empirical studies to evaluate the
+performance of this Internet search augmented generation paradigm. The
+experimental results demonstrate that our method generates content with
+significantly improved quality. Our system has been successfully deployed in a
+production environment to serve 01.AI's generative inference requests.
+
+
+ This paper studies the problem of class-imbalanced graph classification,
+which aims at effectively classifying the graph categories in scenarios with
+imbalanced class distributions. While graph neural networks (GNNs) have
+achieved remarkable success, their modeling ability on imbalanced
+graph-structured data remains suboptimal, which typically leads to predictions
+biased towards the majority classes. On the other hand, existing
+class-imbalanced learning methods in vision may overlook the rich graph
+semantic substructures of the majority classes and excessively emphasize
+learning from the minority classes. To address these challenges, we propose a
+simple yet powerful approach called C$^3$GNN that integrates the idea of
+clustering into contrastive learning to enhance class-imbalanced graph
+classification. Technically, C$^3$GNN clusters graphs from each majority class
+into multiple subclasses, with sizes comparable to the minority class,
+mitigating class imbalance. It also employs the Mixup technique to generate
+synthetic samples, enriching the semantic diversity of each subclass.
+Furthermore, supervised contrastive learning is used to hierarchically learn
+effective graph representations, enabling the model to thoroughly explore
+semantic substructures in majority classes while avoiding excessive focus on
+minority classes. Extensive experiments on real-world graph benchmark datasets
+verify the superior performance of our proposed method against competitive
+baselines.
+
+
+
+ comment: Accepted by Proceedings of the Thirty-Ninth AAAI Conference on
+ Artificial Intelligence (AAAI-25)
+
+ As data retrieval demands become increasingly complex, traditional search
+methods often fall short in addressing nuanced and conceptual queries. Vector
+similarity search has emerged as a promising technique for finding semantically
+similar information efficiently. However, its effectiveness diminishes when
+handling intricate queries with contextual nuances. This paper explores a
+hybrid approach combining vector similarity search with Large Language Models
+(LLMs) to enhance search accuracy and relevance. The proposed two-step solution
+first employs vector similarity search to shortlist potential matches, followed
+by an LLM for context-aware ranking of the results. Experiments on structured
+datasets demonstrate that while vector similarity search alone performs well
+for straightforward queries, the LLM-assisted approach excels in processing
+complex queries involving constraints, negations, or conceptual requirements.
+By leveraging the natural language understanding capabilities of LLMs, this
+method improves the accuracy of search results for complex tasks without
+sacrificing efficiency. We also discuss real-world applications and propose
+directions for future research to refine and scale this technique for diverse
+datasets and use cases.
+ Original article:
+https://engineering.grab.com/llm-assisted-vector-similarity-search
+
+
+ We propose action-agnostic point-level (AAPL) supervision for temporal action
+detection to achieve accurate action instance detection with a lightly
+annotated dataset. In the proposed scheme, a small portion of video frames is
+sampled in an unsupervised manner and presented to human annotators, who then
+label the frames with action categories. Unlike point-level supervision, which
+requires annotators to search for every action instance in an untrimmed video,
+frames to annotate are selected without human intervention in AAPL supervision.
+We also propose a detection model and learning method to effectively utilize
+the AAPL labels. Extensive experiments on the variety of datasets (THUMOS '14,
+FineAction, GTEA, BEOID, and ActivityNet 1.3) demonstrate that the proposed
+approach is competitive with or outperforms prior methods for video-level and
+point-level supervision in terms of the trade-off between the annotation cost
+and detection performance.
+
+
+
+
+
+
+
+ ☆ SoS Certificates for Sparse Singular Values and Their Applications:
+ Robust Statistics, Subspace Distortion, and More
+
+
+
+
+
+
+
+
+ Ilias Diakonikolas, Samuel B. Hopkins, Ankit Pensia, Stefan Tiegel
+
+
+ We study $\textit{sparse singular value certificates}$ for random rectangular
+matrices. If $M$ is an $n \times d$ matrix with independent Gaussian entries,
+we give a new family of polynomial-time algorithms which can certify upper
+bounds on the maximum of $\|M u\|$, where $u$ is a unit vector with at most
+$\eta n$ nonzero entries for a given $\eta \in (0,1)$. This basic algorithmic
+primitive lies at the heart of a wide range of problems across algorithmic
+statistics and theoretical computer science.
+ Our algorithms certify a bound which is asymptotically smaller than the naive
+one, given by the maximum singular value of $M$, for nearly the widest-possible
+range of $n,d,$ and $\eta$. Efficiently certifying such a bound for a range of
+$n,d$ and $\eta$ which is larger by any polynomial factor than what is achieved
+by our algorithm would violate lower bounds in the SQ and low-degree
+polynomials models. Our certification algorithm makes essential use of the
+Sum-of-Squares hierarchy. To prove the correctness of our algorithm, we develop
+a new combinatorial connection between the graph matrix approach to analyze
+random matrices with dependent entries, and the Efron-Stein decomposition of
+functions of independent random variables.
+ As applications of our certification algorithm, we obtain new efficient
+algorithms for a wide range of well-studied algorithmic tasks. In algorithmic
+robust statistics, we obtain new algorithms for robust mean and covariance
+estimation with tradeoffs between breakdown point and sample complexity, which
+are nearly matched by SQ and low-degree polynomial lower bounds (that we
+establish). We also obtain new polynomial-time guarantees for certification of
+$\ell_1/\ell_2$ distortion of random subspaces of $\mathbb{R}^n$ (also with
+nearly matching lower bounds), sparse principal component analysis, and
+certification of the $2\rightarrow p$ norm of a random matrix.
+
+
+
+
+
+
+
+ ☆ Distributed Mixture-of-Agents for Edge Inference with Large Language
+ Models
+
+
+ Mixture-of-Agents (MoA) has recently been proposed as a method to enhance
+performance of large language models (LLMs), enabling multiple individual LLMs
+to work together for collaborative inference. This collaborative approach
+results in improved responses to user prompts compared to relying on a single
+LLM. In this paper, we consider such an MoA architecture in a distributed
+setting, where LLMs operate on individual edge devices, each uniquely
+associated with a user and equipped with its own distributed computing power.
+These devices exchange information using decentralized gossip algorithms,
+allowing different device nodes to talk without the supervision of a
+centralized server. In the considered setup, different users have their own LLM
+models to address user prompts. Additionally, the devices gossip either their
+own user-specific prompts or augmented prompts to generate more refined answers
+to certain queries. User prompts are temporarily stored in the device queues
+when their corresponding LLMs are busy. Given the memory limitations of edge
+devices, it is crucial to ensure that the average queue sizes in the system
+remain bounded. In this paper, we address this by theoretically calculating the
+queuing stability conditions for the device queues under reasonable
+assumptions, which we validate experimentally as well. Further, we demonstrate
+through experiments, leveraging open-source LLMs for the implementation of
+distributed MoA, that certain MoA configurations produce higher-quality
+responses compared to others, as evaluated on AlpacaEval 2.0 benchmark. The
+implementation is available at:
+https://github.com/purbeshmitra/distributed_moa.
+
+
+
+
+
+
+
+
+ Rainer Engelken, Michael Monteforte, Fred Wolf
+
+
+ Nerve impulses, the currency of information flow in the brain, are generated
+by an instability of the neuronal membrane potential dynamics. Neuronal
+circuits exhibit collective chaos that appears essential for learning, memory,
+sensory processing, and motor control. However, the factors controlling the
+nature and intensity of collective chaos in neuronal circuits are not well
+understood. Here we use computational ergodic theory to demonstrate that basic
+features of nerve impulse generation profoundly affect collective chaos in
+neuronal circuits. Numerically exact calculations of Lyapunov spectra,
+Kolmogorov-Sinai-entropy, and upper and lower bounds on attractor dimension
+show that changes in nerve impulse generation in individual neurons moderately
+impact information encoding rates but qualitatively transform phase space
+structure. Specifically, we find a drastic reduction in the number of unstable
+manifolds, Kolmogorov-Sinai entropy, and attractor dimension. Beyond a critical
+point, marked by the simultaneous breakdown of the diffusion approximation, a
+peak in the largest Lyapunov exponent, and a localization transition of the
+leading covariant Lyapunov vector, networks exhibit sparse chaos: prolonged
+periods of near stable dynamics interrupted by short bursts of intense chaos.
+Analysis of large, more realistically structured networks supports the
+generality of these findings. In cortical circuits, biophysical properties
+appear tuned to this regime of sparse chaos. Our results reveal a close link
+between fundamental aspects of single-neuron biophysics and the collective
+dynamics of cortical circuits, suggesting that nerve impulse generation
+mechanisms are adapted to enhance circuit controllability and information flow.
+
+
+
+
+
+
+
+ ☆ Two-component spatiotemporal template for activation-inhibition of
+ speech in ECoG
+
+
+ I compute the average trial-by-trial power of band-limited speech activity
+across epochs of multi-channel high-density electrocorticography (ECoG)
+recorded from multiple subjects during a consonant-vowel speaking task. I show
+that previously seen anti-correlations of average beta frequency activity
+(12-35 Hz) to high-frequency gamma activity (70-140 Hz) during speech movement
+are observable between individual ECoG channels in the sensorimotor cortex
+(SMC). With this I fit a variance-based model using principal component
+analysis to the band-powers of individual channels of session-averaged ECoG
+data in the SMC and project SMC channels onto their lower-dimensional principal
+components.
+ Spatiotemporal relationships between speech-related activity and principal
+components are identified by correlating the principal components of both
+frequency bands to individual ECoG channels over time using windowed
+correlation. Correlations of principal component areas to sensorimotor areas
+reveal a distinct two-component activation-inhibition-like representation for
+speech that resembles distinct local sensorimotor areas recently shown to have
+complex interplay in whole-body motor control, inhibition, and posture. Notably
+the third principal component shows insignificant correlations across all
+subjects, suggesting two components of ECoG are sufficient to represent SMC
+activity during speech movement.
+
+
+
+
+
+
+
+ ☆ Adversarial Attack and Defense for LoRa Device Identification and
+ Authentication via Deep Learning
+
+
+ LoRa provides long-range, energy-efficient communications in Internet of
+Things (IoT) applications that rely on Low-Power Wide-Area Network (LPWAN)
+capabilities. Despite these merits, concerns persist regarding the security of
+LoRa networks, especially in situations where device identification and
+authentication are imperative to secure the reliable access to the LoRa
+networks. This paper explores a deep learning (DL) approach to tackle these
+concerns, focusing on two critical tasks, namely (i) identifying LoRa devices
+and (ii) classifying them to legitimate and rogue devices. Deep neural networks
+(DNNs), encompassing both convolutional and feedforward neural networks, are
+trained for these tasks using actual LoRa signal data. In this setting, the
+adversaries may spoof rogue LoRa signals through the kernel density estimation
+(KDE) method based on legitimate device signals that are received by the
+adversaries. Two cases are considered, (i) training two separate classifiers,
+one for each of the two tasks, and (ii) training a multi-task classifier for
+both tasks. The vulnerabilities of the resulting DNNs to manipulations in input
+samples are studied in form of untargeted and targeted adversarial attacks
+using the Fast Gradient Sign Method (FGSM). Individual and common perturbations
+are considered against single-task and multi-task classifiers for the LoRa
+signal analysis. To provide resilience against such attacks, a defense approach
+is presented by increasing the robustness of classifiers with adversarial
+training. Results quantify how vulnerable LoRa signal classification tasks are
+to adversarial attacks and emphasize the need to fortify IoT applications
+against these subtle yet effective threats.
+
+
+ Globally, chronic liver disease continues to be a major health concern that
+requires precise predictive models for prompt detection and treatment. Using
+the Indian Liver Patient Dataset (ILPD) from the University of California at
+Irvine's UCI Machine Learning Repository, a number of machine learning
+algorithms are investigated in this study. The main focus of our research is
+this dataset, which includes the medical records of 583 patients, 416 of whom
+have been diagnosed with liver disease and 167 of whom have not. There are
+several aspects to this work, including feature extraction and dimensionality
+reduction methods like Linear Discriminant Analysis (LDA), Factor Analysis
+(FA), t-distributed Stochastic Neighbour Embedding (t-SNE), and Uniform
+Manifold Approximation and Projection (UMAP). The purpose of the study is to
+investigate how well these approaches work for converting high-dimensional
+datasets and improving prediction accuracy. To assess the prediction ability of
+the improved models, a number of classification methods were used, such as
+Multi-layer Perceptron, Random Forest, K-nearest neighbours, and Logistic
+Regression. Remarkably, the improved models performed admirably, with Random
+Forest having the highest accuracy of 98.31\% in 10-fold cross-validation and
+95.79\% in train-test split evaluation. Findings offer important new
+perspectives on the choice and use of customized feature extraction and
+dimensionality reduction methods, which improve predictive models for patients
+with chronic liver disease.
+
+
+
+
+
+
+
+ ☆ Aviary: training language agents on challenging scientific tasks
+
+
+
+
+
+
+
+
+ Siddharth Narayanan, James D. Braza, Ryan-Rhys Griffiths, Manu Ponnapati, Albert Bou, Jon Laurent, Ori Kabeli, Geemi Wellawatte, Sam Cox, Samuel G. Rodriques, Andrew D. White
+
+
+ Solving complex real-world tasks requires cycles of actions and observations.
+This is particularly true in science, where tasks require many cycles of
+analysis, tool use, and experimentation. Language agents are promising for
+automating intellectual tasks in science because they can interact with tools
+via natural language or code. Yet their flexibility creates conceptual and
+practical challenges for software implementations, since agents may comprise
+non-standard components such as internal reasoning, planning, tool usage, as
+well as the inherent stochasticity of temperature-sampled language models.
+Here, we introduce Aviary, an extensible gymnasium for language agents. We
+formalize agents as policies solving language-grounded partially observable
+Markov decision processes, which we term language decision processes. We then
+implement five environments, including three challenging scientific
+environments: (1) manipulating DNA constructs for molecular cloning, (2)
+answering research questions by accessing scientific literature, and (3)
+engineering protein stability. These environments were selected for their focus
+on multi-step reasoning and their relevance to contemporary biology research.
+Finally, with online training and scaling inference-time compute, we show that
+language agents backed by open-source, non-frontier LLMs can match and exceed
+both frontier LLM agents and human experts on multiple tasks at up to 100x
+lower inference cost.
+
+
+ Graph Self-Supervised Learning (SSL) has emerged as a pivotal area of
+research in recent years. By engaging in pretext tasks to learn the intricate
+topological structures and properties of graphs using unlabeled data, these
+graph SSL models achieve enhanced performance, improved generalization, and
+heightened robustness. Despite the remarkable achievements of these graph SSL
+methods, their current implementation poses significant challenges for
+beginners and practitioners due to the complex nature of graph structures,
+inconsistent evaluation metrics, and concerns regarding reproducibility hinder
+further progress in this field. Recognizing the growing interest within the
+research community, there is an urgent need for a comprehensive,
+beginner-friendly, and accessible toolkit consisting of the most representative
+graph SSL algorithms. To address these challenges, we present a Graph SSL
+toolkit named PyG-SSL, which is built upon PyTorch and is compatible with
+various deep learning and scientific computing backends. Within the toolkit, we
+offer a unified framework encompassing dataset loading, hyper-parameter
+configuration, model training, and comprehensive performance evaluation for
+diverse downstream tasks. Moreover, we provide beginner-friendly tutorials and
+the best hyper-parameters of each graph SSL algorithm on different graph
+datasets, facilitating the reproduction of results. The GitHub repository of
+the library is https://github.com/iDEA-iSAIL-Lab-UIUC/pyg-ssl.
+
+
+ The field of Machine Learning has changed significantly since the 1970s.
+However, its most basic principle, Empirical Risk Minimization (ERM), remains
+unchanged. We propose Functional Risk Minimization~(FRM), a general framework
+where losses compare functions rather than outputs. This results in better
+performance in supervised, unsupervised, and RL experiments. In the FRM
+paradigm, for each data point $(x_i,y_i)$ there is function $f_{\theta_i}$ that
+fits it: $y_i = f_{\theta_i}(x_i)$. This allows FRM to subsume ERM for many
+common loss functions and to capture more realistic noise processes. We also
+show that FRM provides an avenue towards understanding generalization in the
+modern over-parameterized regime, as its objective can be framed as finding the
+simplest model that fits the training data.
+
+
+
+
+
+
+
+ ☆ DeepF-fNet: a physics-informed neural network for vibration isolation
+ optimization
+
+
+
+
+
+
+
+
+ A. Tollardo, F. Cadini, M. Giglio, L. Lomazzi
+
+
+ Structural optimization is essential for designing safe, efficient, and
+durable components with minimal material usage. Traditional methods for
+vibration control often rely on active systems to mitigate unpredictable
+vibrations, which may lead to resonance and potential structural failure.
+However, these methods face significant challenges when addressing the
+nonlinear inverse eigenvalue problems required for optimizing structures
+subjected to a wide range of frequencies. As a result, no existing approach has
+effectively addressed the need for real-time vibration suppression within this
+context, particularly in high-performance environments such as automotive
+noise, vibration and harshness, where computational efficiency is crucial.
+ This study introduces DeepF-fNet, a novel neural network framework designed
+to replace traditional active systems in vibration-based structural
+optimization. Leveraging DeepONets within the context of physics-informed
+neural networks, DeepF-fNet integrates both data and the governing physical
+laws. This enables rapid identification of optimal parameters to suppress
+critical vibrations at specific frequencies, offering a more efficient and
+real-time alternative to conventional methods.
+ The proposed framework is validated through a case study involving a locally
+resonant metamaterial used to isolate structures from user-defined frequency
+ranges. The results demonstrate that DeepF-fNet outperforms traditional genetic
+algorithms in terms of computational speed while achieving comparable results,
+making it a promising tool for vibration-sensitive applications. By replacing
+active systems with machine learning techniques, DeepF-fNet paves the way for
+more efficient and cost-effective structural optimization in real-world
+scenarios.
+
+
+
+
+
+
+
+ ☆ Adaptive Batch Size Schedules for Distributed Training of Language
+ Models with Data and Model Parallelism
+
+
+
+
+
+
+
+
+ Tim Tsz-Kit Lau, Weijian Li, Chenwei Xu, Han Liu, Mladen Kolar
+
+
+ An appropriate choice of batch sizes in large-scale model training is
+crucial, yet it involves an intrinsic yet inevitable dilemma: large-batch
+training improves training efficiency in terms of memory utilization, while
+generalization performance often deteriorates due to small amounts of gradient
+noise. Despite this dilemma, the common practice of choosing batch sizes in
+language model training often prioritizes training efficiency -- employing
+either constant large sizes with data parallelism or implementing batch size
+warmup schedules. However, such batch size schedule designs remain heuristic
+and often fail to adapt to training dynamics, presenting the challenge of
+designing adaptive batch size schedules. Given the abundance of available
+datasets and the data-hungry nature of language models, data parallelism has
+become an indispensable distributed training paradigm, enabling the use of
+larger batch sizes for gradient computation. However, vanilla data parallelism
+requires replicas of model parameters, gradients, and optimizer states at each
+worker, which prohibits training larger models with billions of parameters. To
+optimize memory usage, more advanced parallelism strategies must be employed.
+In this work, we propose general-purpose and theoretically principled adaptive
+batch size schedules compatible with data parallelism and model parallelism. We
+develop a practical implementation with PyTorch Fully Sharded Data Parallel,
+facilitating the pretraining of language models of different sizes. We
+empirically demonstrate that our proposed approaches outperform constant batch
+sizes and heuristic batch size warmup schedules in the pretraining of models in
+the Llama family, with particular focus on smaller models with up to 3 billion
+parameters. We also establish theoretical convergence guarantees for such
+adaptive batch size schedules with Adam for general smooth nonconvex
+objectives.
+
+
+
+
+
+
+
+ ☆ On the Generalizability of Machine Learning-based Ransomware Detection
+ in Block Storage
+
+
+
+
+
+
+
+
+ Nicolas Reategui, Roman Pletka, Dionysios Diamantopoulos
+
+
+ Ransomware represents a pervasive threat, traditionally countered at the
+operating system, file-system, or network levels. However, these approaches
+often introduce significant overhead and remain susceptible to circumvention by
+attackers. Recent research activity started looking into the detection of
+ransomware by observing block IO operations. However, this approach exhibits
+significant detection challenges. Recognizing these limitations, our research
+pivots towards enabling robust ransomware detection in storage systems keeping
+in mind their limited computational resources available. To perform our
+studies, we propose a kernel-based framework capable of efficiently extracting
+and analyzing IO operations to identify ransomware activity. The framework can
+be adopted to storage systems using computational storage devices to improve
+security and fully hide detection overheads. Our method employs a refined set
+of computationally light features optimized for ML models to accurately discern
+malicious from benign activities.
+ Using this lightweight approach, we study a wide range of generalizability
+aspects and analyze the performance of these models across a large space of
+setups and configurations covering a wide range of realistic real-world
+scenarios. We reveal various trade-offs and provide strong arguments for the
+generalizability of storage-based detection of ransomware and show that our
+approach outperforms currently available ML-based ransomware detection in
+storage. Empirical validation reveals that our decision tree-based models
+achieve remarkable effectiveness, evidenced by higher median F1 scores of up to
+12.8%, lower false negative rates of up to 10.9% and particularly decreased
+false positive rates of up to 17.1% compared to existing storage-based
+detection approaches.
+
+
+
+
+
+
+
+ ☆ Quantum Diffusion Model for Quark and Gluon Jet Generation NeurIPS 2024
+
+
+
+
+
+
+
+
+ Mariia Baidachna, Rey Guadarrama, Gopal Ramesh Dahale, Tom Magorsch, Isabel Pedraza, Konstantin T. Matchev, Katia Matcheva, Kyoungchul Kong, Sergei Gleyzer
+
+
+ Diffusion models have demonstrated remarkable success in image generation,
+but they are computationally intensive and time-consuming to train. In this
+paper, we introduce a novel diffusion model that benefits from quantum
+computing techniques in order to mitigate computational challenges and enhance
+generative performance within high energy physics data. The fully quantum
+diffusion model replaces Gaussian noise with random unitary matrices in the
+forward process and incorporates a variational quantum circuit within the U-Net
+in the denoising architecture. We run evaluations on the structurally complex
+quark and gluon jets dataset from the Large Hadron Collider. The results
+demonstrate that the fully quantum and hybrid models are competitive with a
+similar classical model for jet generation, highlighting the potential of using
+quantum techniques for machine learning problems.
+
+
+
+ comment: Accepted for the NeurIPS 2024 MLNCP workshop
+
+
+
+
+
+
+ ☆ Enhanced coarsening of charge density waves induced by electron
+ correlation: Machine-learning enabled large-scale dynamical simulations
+
+
+ The phase ordering kinetics of emergent orders in correlated electron systems
+is a fundamental topic in non-equilibrium physics, yet it remains largely
+unexplored. The intricate interplay between quasiparticles and emergent
+order-parameter fields could lead to unusual coarsening dynamics that is beyond
+the standard theories. However, accurate treatment of both quasiparticles and
+collective degrees of freedom is a multi-scale challenge in dynamical
+simulations of correlated electrons. Here we leverage modern machine learning
+(ML) methods to achieve a linear-scaling algorithm for simulating the
+coarsening of charge density waves (CDWs), one of the fundamental symmetry
+breaking phases in functional electron materials. We demonstrate our approach
+on the square-lattice Hubbard-Holstein model and uncover an intriguing
+enhancement of CDW coarsening which is related to the screening of on-site
+potential by electron-electron interactions. Our study provides fresh insights
+into the role of electron correlations in non-equilibrium dynamics and
+underscores the promise of ML force-field approaches for advancing multi-scale
+dynamical modeling of correlated electron systems.
+
+
+
+ comment: 11 pages, 4 figures
+
+
+
+
+
+
+ ☆ Investigating layer-selective transfer learning of QAOA parameters for
+ Max-Cut problem
+
+
+ Quantum approximate optimization algorithm (QAOA) is a variational quantum
+algorithm (VQA) ideal for noisy intermediate-scale quantum (NISQ) processors,
+and is highly successful for solving combinatorial optimization problems
+(COPs). It has been observed that the optimal variational parameters obtained
+from one instance of a COP can be transferred to another instance, producing
+sufficiently satisfactory solutions for the latter. In this context, a suitable
+method for further improving the solution is to fine-tune a subset of the
+transferred parameters. We numerically explore the role of optimizing
+individual QAOA layers in improving the approximate solution of the Max-Cut
+problem after parameter transfer. We also investigate the trade-off between a
+good approximation and the required optimization time when optimizing
+transferred QAOA parameters. These studies show that optimizing a subset of
+layers can be more effective at a lower time-cost compared to optimizing all
+layers.
+
+
+
+ comment: 8 pages, 6 figures. Comments are welcome
+
+ Mobile edge computing (MEC) has empowered mobile devices (MDs) in supporting
+artificial intelligence (AI) applications through collaborative efforts with
+proximal MEC servers. Unfortunately, despite the great promise of device-edge
+cooperative AI inference, data privacy becomes an increasing concern. In this
+paper, we develop a privacy-aware multi-device cooperative edge inference
+system for classification tasks, which integrates a distributed bidding
+mechanism for the MEC server's computational resources. Intermediate feature
+compression is adopted as a principled approach to minimize data privacy
+leakage. To determine the bidding values and feature compression ratios in a
+distributed fashion, we formulate a decentralized partially observable Markov
+decision process (DEC-POMDP) model, for which, a multi-agent deep deterministic
+policy gradient (MADDPG)-based algorithm is developed. Simulation results
+demonstrate the effectiveness of the proposed algorithm in privacy-preserving
+cooperative edge inference. Specifically, given a sufficient level of data
+privacy protection, the proposed algorithm achieves 0.31-0.95% improvements in
+classification accuracy compared to the approach being agnostic to the wireless
+channel conditions. The performance is further enhanced by 1.54-1.67% by
+considering the difficulties of inference data.
+
+
+
+ comment: This article was submitted to IEEE for possible publication
+
+
+
+
+
+
+ ☆ BridgePure: Revealing the Fragility of Black-box Data Protection
+
+
+ Availability attacks, or unlearnable examples, are defensive techniques that
+allow data owners to modify their datasets in ways that prevent unauthorized
+machine learning models from learning effectively while maintaining the data's
+intended functionality. It has led to the release of popular black-box tools
+for users to upload personal data and receive protected counterparts. In this
+work, we show such black-box protections can be substantially bypassed if a
+small set of unprotected in-distribution data is available. Specifically, an
+adversary can (1) easily acquire (unprotected, protected) pairs by querying the
+black-box protections with the unprotected dataset; and (2) train a diffusion
+bridge model to build a mapping. This mapping, termed BridgePure, can
+effectively remove the protection from any previously unseen data within the
+same distribution. Under this threat model, our method demonstrates superior
+purification performance on classification and style mimicry tasks, exposing
+critical vulnerabilities in black-box data protection.
+
+
+
+ comment: 26 pages,13 figures
+
+
+
+
+
+
+ ☆ Towards Effective Discrimination Testing for Generative AI
+
+
+
+
+
+
+
+
+ Thomas P. Zollo, Nikita Rajaneesh, Richard Zemel, Talia B. Gillis, Emily Black
+
+
+ Generative AI (GenAI) models present new challenges in regulating against
+discriminatory behavior. In this paper, we argue that GenAI fairness research
+still has not met these challenges; instead, a significant gap remains between
+existing bias assessment methods and regulatory goals. This leads to
+ineffective regulation that can allow deployment of reportedly fair, yet
+actually discriminatory, GenAI systems. Towards remedying this problem, we
+connect the legal and technical literature around GenAI bias evaluation and
+identify areas of misalignment. Through four case studies, we demonstrate how
+this misalignment between fairness testing techniques and regulatory goals can
+result in discriminatory outcomes in real-world deployments, especially in
+adaptive or complex environments. We offer practical recommendations for
+improving discrimination testing to better align with regulatory goals and
+enhance the reliability of fairness assessments in future deployments.
+
+
+
+ comment: 38 pages, 9 tables, 8 figures
+
+
+
+
+
+
+ ☆ Learning Epidemiological Dynamics via the Finite Expression Method
+
+
+
+
+
+
+
+
+ Jianda Du, Senwei Liang, Chunmei Wang
+
+
+ Modeling and forecasting the spread of infectious diseases is essential for
+effective public health decision-making. Traditional epidemiological models
+rely on expert-defined frameworks to describe complex dynamics, while neural
+networks, despite their predictive power, often lack interpretability due to
+their ``black-box" nature. This paper introduces the Finite Expression Method,
+a symbolic learning framework that leverages reinforcement learning to derive
+explicit mathematical expressions for epidemiological dynamics. Through
+numerical experiments on both synthetic and real-world datasets, FEX
+demonstrates high accuracy in modeling and predicting disease spread, while
+uncovering explicit relationships among epidemiological variables. These
+results highlight FEX as a powerful tool for infectious disease modeling,
+combining interpretability with strong predictive performance to support
+practical applications in public health.
+
+
+
+ comment: 13 pages, 5 figures
+
+
+
+
+
+
+ ☆ Mind the truncation gap: challenges of learning on dynamic graphs with
+ recurrent architectures
+
+
+
+
+
+
+
+
+ João Bravo, Jacopo Bono, Pedro Saleiro, Hugo Ferreira, Pedro Bizarro
+
+
+ Systems characterized by evolving interactions, prevalent in social,
+financial, and biological domains, are effectively modeled as continuous-time
+dynamic graphs (CTDGs). To manage the scale and complexity of these graph
+datasets, machine learning (ML) approaches have become essential. However,
+CTDGs pose challenges for ML because traditional static graph methods do not
+naturally account for event timings. Newer approaches, such as graph recurrent
+neural networks (GRNNs), are inherently time-aware and offer advantages over
+static methods for CTDGs. However, GRNNs face another issue: the short
+truncation of backpropagation-through-time (BPTT), whose impact has not been
+properly examined until now. In this work, we demonstrate that this truncation
+can limit the learning of dependencies beyond a single hop, resulting in
+reduced performance. Through experiments on a novel synthetic task and
+real-world datasets, we reveal a performance gap between full
+backpropagation-through-time (F-BPTT) and the truncated
+backpropagation-through-time (T-BPTT) commonly used to train GRNN models. We
+term this gap the "truncation gap" and argue that understanding and addressing
+it is essential as the importance of CTDGs grows, discussing potential future
+directions for research in this area.
+
+
+
+ comment: Published in Transactions on Machine Learning Research (TMLR)
+
+
+
+
+
+
+ ☆ Machine Learning Optimal Ordering in Global Routing Problems in
+ Semiconductors
+
+
+ In this work, we propose a new method for ordering nets during the process of
+layer assignment in global routing problems. The global routing problems that
+we focus on in this work are based on routing problems that occur in the design
+of substrates in multilayered semiconductor packages. The proposed new method
+is based on machine learning techniques and we show that the proposed method
+supersedes conventional net ordering techniques based on heuristic score
+functions. We perform global routing experiments in multilayered semiconductor
+package environments in order to illustrate that the routing order based on our
+new proposed technique outperforms previous methods based on heuristics. Our
+approach of using machine learning for global routing targets specifically the
+net ordering step which we show in this work can be significantly improved by
+deep learning.
+
+
+
+ comment: 18 pages, 13 figures, 6 tables; published in Scientific Reports
+
+
+
+
+
+
+ ☆ Improving Location-based Thermal Emission Side-Channel Analysis Using
+ Iterative Transfer Learning
+
+
+
+
+
+
+
+
+ Tun-Chieh Lou, Chung-Che Wang, Jyh-Shing Roger Jang, Henian Li, Lang Lin, Norman Chang
+
+
+ This paper proposes the use of iterative transfer learning applied to deep
+learning models for side-channel attacks. Currently, most of the side-channel
+attack methods train a model for each individual byte, without considering the
+correlation between bytes. However, since the models' parameters for attacking
+different bytes may be similar, we can leverage transfer learning, meaning that
+we first train the model for one of the key bytes, then use the trained model
+as a pretrained model for the remaining bytes. This technique can be applied
+iteratively, a process known as iterative transfer learning. Experimental
+results show that when using thermal or power consumption map images as input,
+and multilayer perceptron or convolutional neural network as the model, our
+method improves average performance, especially when the amount of data is
+insufficient.
+
+
+ Deploying Retrieval Augmented Generation (RAG) on resource-constrained edge
+devices is challenging due to limited memory and processing power. In this
+work, we propose EdgeRAG which addresses the memory constraint by pruning
+embeddings within clusters and generating embeddings on-demand during
+retrieval. To avoid the latency of generating embeddings for large tail
+clusters, EdgeRAG pre-computes and stores embeddings for these clusters, while
+adaptively caching remaining embeddings to minimize redundant computations and
+further optimize latency. The result from BEIR suite shows that EdgeRAG offers
+significant latency reduction over the baseline IVF index, but with similar
+generation quality while allowing all of our evaluated datasets to fit into the
+memory.
+
+
+
+
+
+
+
+ ☆ Text Classification: Neural Networks VS Machine Learning Models VS
+ Pre-trained Models
+
+
+ Text classification is a very common task nowadays and there are many
+efficient methods and algorithms that we can employ to accomplish it.
+Transformers have revolutionized the field of deep learning, particularly in
+Natural Language Processing (NLP) and have rapidly expanded to other domains
+such as computer vision, time-series analysis and more. The transformer model
+was firstly introduced in the context of machine translation and its
+architecture relies on self-attention mechanisms to capture complex
+relationships within data sequences. It is able to handle long-range
+dependencies more effectively than traditional neural networks (such as
+Recurrent Neural Networks and Multilayer Perceptrons). In this work, we present
+a comparison between different techniques to perform text classification. We
+take into consideration seven pre-trained models, three standard neural
+networks and three machine learning models. For standard neural networks and
+machine learning models we also compare two embedding techniques: TF-IDF and
+GloVe, with the latter consistently outperforming the former. Finally, we
+demonstrate the results from our experiments where pre-trained models such as
+BERT and DistilBERT always perform better than standard models/algorithms.
+
+
+
+
+
+
+
+ ☆ Weber-Fechner Law in Temporal Difference learning derived from Control
+ as Inference
+
+
+ This paper investigates a novel nonlinear update rule based on temporal
+difference (TD) errors in reinforcement learning (RL). The update rule in the
+standard RL states that the TD error is linearly proportional to the degree of
+updates, treating all rewards equally without no bias. On the other hand, the
+recent biological studies revealed that there are nonlinearities in the TD
+error and the degree of updates, biasing policies optimistic or pessimistic.
+Such biases in learning due to nonlinearities are expected to be useful and
+intentionally leftover features in biological learning. Therefore, this
+research explores a theoretical framework that can leverage the nonlinearity
+between the degree of the update and TD errors. To this end, we focus on a
+control as inference framework, since it is known as a generalized formulation
+encompassing various RL and optimal control methods. In particular, we
+investigate the uncomputable nonlinear term needed to be approximately excluded
+in the derivation of the standard RL from control as inference. By analyzing
+it, Weber-Fechner law (WFL) is found, namely, perception (a.k.a. the degree of
+updates) in response to stimulus change (a.k.a. TD error) is attenuated by
+increase in the stimulus intensity (a.k.a. the value function). To numerically
+reveal the utilities of WFL on RL, we then propose a practical implementation
+using a reward-punishment framework and modifying the definition of optimality.
+Analysis of this implementation reveals that two utilities can be expected i)
+to increase rewards to a certain level early, and ii) to sufficiently suppress
+punishment. We finally investigate and discuss the expected utilities through
+simulations and robot experiments. As a result, the proposed RL algorithm with
+WFL shows the expected utilities that accelerate the reward-maximizing startup
+and continue to suppress punishments during learning.
+
+
+
+ comment: 36 pages 9 figures
+
+
+
+
+
+
+ ☆ LEASE: Offline Preference-based Reinforcement Learning with High Sample
+ Efficiency
+
+
+ Offline preference-based reinforcement learning (PbRL) provides an effective
+way to overcome the challenges of designing reward and the high costs of online
+interaction. However, since labeling preference needs real-time human feedback,
+acquiring sufficient preference labels is challenging. To solve this, this
+paper proposes a offLine prEference-bAsed RL with high Sample Efficiency
+(LEASE) algorithm, where a learned transition model is leveraged to generate
+unlabeled preference data. Considering the pretrained reward model may generate
+incorrect labels for unlabeled data, we design an uncertainty-aware mechanism
+to ensure the performance of reward model, where only high confidence and low
+variance data are selected. Moreover, we provide the generalization bound of
+reward model to analyze the factors influencing reward accuracy, and
+demonstrate that the policy learned by LEASE has theoretical improvement
+guarantee. The developed theory is based on state-action pair, which can be
+easily combined with other offline algorithms. The experimental results show
+that LEASE can achieve comparable performance to baseline under fewer
+preference data without online interaction.
+
+
+ The rapid evolution of large language models (LLMs) has unlocked their
+capabilities in advanced reasoning tasks like mathematical problem-solving,
+code generation, and legal analysis. Central to this progress are
+inference-time reasoning algorithms, which refine outputs by exploring multiple
+solution paths, at the cost of increasing compute demands and response
+latencies. Existing serving systems fail to adapt to the scaling behaviors of
+these algorithms or the varying difficulty of queries, leading to inefficient
+resource use and unmet latency targets.
+ We present Dynasor, a system that optimizes inference-time compute for LLM
+reasoning queries. Unlike traditional engines, Dynasor tracks and schedules
+requests within reasoning queries and uses Certaindex, a proxy that measures
+statistical reasoning progress based on model certainty, to guide compute
+allocation dynamically. Dynasor co-adapts scheduling with reasoning progress:
+it allocates more compute to hard queries, reduces compute for simpler ones,
+and terminates unpromising queries early, balancing accuracy, latency, and
+cost. On diverse datasets and algorithms, Dynasor reduces compute by up to 50%
+in batch processing and sustaining 3.3x higher query rates or 4.7x tighter
+latency SLOs in online serving.
+
+
+
+
+
+
+
+ ☆ Verified Lifting of Deep learning Operators
+
+
+ Deep learning operators are fundamental components of modern deep learning
+frameworks. With the growing demand for customized operators, it has become
+increasingly common for developers to create their own. However, designing and
+implementing operators is complex and error-prone, due to hardware-specific
+optimizations and the need for numerical stability. There is a pressing need
+for tools that can summarize the functionality of both existing and
+user-defined operators. To address this gap, this work introduces a novel
+framework for the verified lifting of deep learning operators, which
+synthesizes high-level mathematical formulas from low-level implementations.
+Our approach combines symbolic execution, syntax-guided synthesis, and
+SMT-based verification to produce readable and formally verified mathematical
+formulas. In synthesis, we employ a combination of top-down and bottom-up
+strategies to explore the vast search space efficiently; In verification, we
+design invariant synthesis patterns and leverage SMT solvers to validate the
+correctness of the derived summaries; In simplification, we use egraph-based
+techniques with custom rules to restore complex formulas to their natural,
+intuitive forms. Evaluated on a dataset of deep learning operators implemented
+in Triton from the real world, our method demonstrates the effectiveness of
+synthesis and verification compared to existing techniques. This framework
+bridges the gap between low-level implementations and high-level abstractions,
+improving understanding and reliability in deep learning operator development.
+
+
+
+
+
+
+
+
+ Mohamed Djilani, Salah Ghamizi, Maxime Cordy
+
+
+ Although adversarial robustness has been extensively studied in white-box
+settings, recent advances in black-box attacks (including transfer- and
+query-based approaches) are primarily benchmarked against weak defenses,
+leaving a significant gap in the evaluation of their effectiveness against more
+recent and moderate robust models (e.g., those featured in the Robustbench
+leaderboard). In this paper, we question this lack of attention from black-box
+attacks to robust models. We establish a framework to evaluate the
+effectiveness of recent black-box attacks against both top-performing and
+standard defense mechanisms, on the ImageNet dataset. Our empirical evaluation
+reveals the following key findings: (1) the most advanced black-box attacks
+struggle to succeed even against simple adversarially trained models; (2)
+robust models that are optimized to withstand strong white-box attacks, such as
+AutoAttack, also exhibits enhanced resilience against black-box attacks; and
+(3) robustness alignment between the surrogate models and the target model
+plays a key factor in the success rate of transfer-based attacks
+
+
+
+
+
+
+
+ ☆ AlignAb: Pareto-Optimal Energy Alignment for Designing Nature-Like
+ Antibodies
+
+
+
+
+
+
+
+
+ Yibo Wen, Chenwei Xu, Jerry Yao-Chieh Hu, Han Liu
+
+
+ We present a three-stage framework for training deep learning models
+specializing in antibody sequence-structure co-design. We first pre-train a
+language model using millions of antibody sequence data. Then, we employ the
+learned representations to guide the training of a diffusion model for joint
+optimization over both sequence and structure of antibodies. During the final
+alignment stage, we optimize the model to favor antibodies with low repulsion
+and high attraction to the antigen binding site, enhancing the rationality and
+functionality of the designs. To mitigate conflicting energy preferences, we
+extend AbDPO (Antibody Direct Preference Optimization) to guide the model
+towards Pareto optimality under multiple energy-based alignment objectives.
+Furthermore, we adopt an iterative learning paradigm with temperature scaling,
+enabling the model to benefit from diverse online datasets without requiring
+additional data. In practice, our proposed methods achieve high stability and
+efficiency in producing a better Pareto front of antibody designs compared to
+top samples generated by baselines and previous alignment techniques. Through
+extensive experiments, we showcase the superior performance of our methods in
+generating nature-like antibodies with high binding affinity consistently.
+
+
+
+
+
+
+
+
+ Yuan Mi, Pu Ren, Hongteng Xu, Hongsheng Liu, Zidong Wang, Yike Guo, Ji-Rong Wen, Hao Sun, Yang Liu
+
+
+ Data-centric methods have shown great potential in understanding and
+predicting spatiotemporal dynamics, enabling better design and control of the
+object system. However, pure deep learning models often lack interpretability,
+fail to obey intrinsic physics, and struggle to cope with the various domains.
+While geometry-based methods, e.g., graph neural networks (GNNs), have been
+proposed to further tackle these challenges, they still need to find the
+implicit physical laws from large datasets and rely excessively on rich labeled
+data. In this paper, we herein introduce the conservation-informed GNN (CiGNN),
+an end-to-end explainable learning framework, to learn spatiotemporal dynamics
+based on limited training data. The network is designed to conform to the
+general conservation law via symmetry, where conservative and non-conservative
+information passes over a multiscale space enhanced by a latent temporal
+marching strategy. The efficacy of our model has been verified in various
+spatiotemporal systems based on synthetic and real-world datasets, showing
+superiority over baseline models. Results demonstrate that CiGNN exhibits
+remarkable accuracy and generalization ability, and is readily applicable to
+learning for prediction of various spatiotemporal dynamics in a spatial domain
+with complex geometry.
+
+
+
+
+
+
+
+ ☆ Generalizing in Net-Zero Microgrids: A Study with Federated PPO and TRPO
+
+
+
+
+
+
+
+
+ Nicolas M Cuadrado Avila, Samuel Horváth, Martin Takáč
+
+
+ This work addresses the challenge of optimal energy management in microgrids
+through a collaborative and privacy-preserving framework. We propose the
+FedTRPO methodology, which integrates Federated Learning (FL) and Trust Region
+Policy Optimization (TRPO) to manage distributed energy resources (DERs)
+efficiently. Using a customized version of the CityLearn environment and
+synthetically generated data, we simulate designed net-zero energy scenarios
+for microgrids composed of multiple buildings. Our approach emphasizes reducing
+energy costs and carbon emissions while ensuring privacy. Experimental results
+demonstrate that FedTRPO is comparable with state-of-the-art federated RL
+methodologies without hyperparameter tunning. The proposed framework highlights
+the feasibility of collaborative learning for achieving optimal control
+policies in energy systems, advancing the goals of sustainable and efficient
+smart grids.
+
+
+
+ comment: Submitted to Environmental Data Science Journal from Cambridge
+ University Press
+
+
+
+
+
+
+ ☆ Active Learning with Variational Quantum Circuits for Quantum Process
+ Tomography
+
+
+ Quantum process tomography (QPT), used for reconstruction of an unknown
+quantum process from measurement data, is a fundamental tool for the diagnostic
+and full characterization of quantum systems. It relies on querying a set of
+quantum states as input to the quantum process. Previous works commonly use a
+straightforward strategy to select a set of quantum states randomly,
+overlooking differences in informativeness among quantum states. Since querying
+the quantum system requires multiple experiments that can be prohibitively
+costly, it is always the case that there are not enough quantum states for
+high-quality reconstruction. In this paper, we propose a general framework for
+active learning (AL) to adaptively select a set of informative quantum states
+that improves the reconstruction most efficiently. In particular, we introduce
+a learning framework that leverages the widely-used variational quantum
+circuits (VQCs) to perform the QPT task and integrate our AL algorithms into
+the query step. We design and evaluate three various types of AL algorithms:
+committee-based, uncertainty-based, and diversity-based, each exhibiting
+distinct advantages in terms of performance and computational cost.
+Additionally, we provide a guideline for selecting algorithms suitable for
+different scenarios. Numerical results demonstrate that our algorithms achieve
+significantly improved reconstruction compared to the baseline method that
+selects a set of quantum states randomly. Moreover, these results suggest that
+active learning based approaches are applicable to other complicated learning
+tasks in large-scale quantum information processing.
+
+
+
+
+
+
+
+
+ Yang Chen, Chih-Li Sung, Arpan Kusari, Xiaoyang Song, Wenbo Sun
+
+
+ Deep neural networks (DNNs) are often constructed under the closed-world
+assumption, which may fail to generalize to the out-of-distribution (OOD) data.
+This leads to DNNs producing overconfident wrong predictions and can result in
+disastrous consequences in safety-critical applications. Existing OOD detection
+methods mainly rely on curating a set of OOD data for model training or
+hyper-parameter tuning to distinguish OOD data from training data (also known
+as in-distribution data or InD data). However, OOD samples are not always
+available during the training phase in real-world applications, hindering the
+OOD detection accuracy. To overcome this limitation, we propose a
+Gaussian-process-based OOD detection method to establish a decision boundary
+based on InD data only. The basic idea is to perform uncertainty quantification
+of the unconstrained softmax scores of a DNN via a multi-class Gaussian process
+(GP), and then define a score function to separate InD and potential OOD data
+based on their fundamental differences in the posterior predictive distribution
+from the GP. Two case studies on conventional image classification datasets and
+real-world image datasets are conducted to demonstrate that the proposed method
+outperforms the state-of-the-art OOD detection methods when OOD samples are not
+observed in the training phase.
+
+
+
+
+
+
+
+ ☆ DDIM sampling for Generative AIBIM, a faster intelligent structural
+ design framework
+
+
+ Generative AIBIM, a successful structural design pipeline, has proven its
+ability to intelligently generate high-quality, diverse, and creative shear
+wall designs that are tailored to specific physical conditions. However, the
+current module of Generative AIBIM that generates designs, known as the
+physics-based conditional diffusion model (PCDM), necessitates 1000 iterations
+for each generation due to its reliance on the denoising diffusion
+probabilistic model (DDPM) sampling process. This leads to a time-consuming and
+computationally demanding generation process. To address this issue, this study
+introduces the denoising diffusion implicit model (DDIM), an accelerated
+generation method that replaces the DDPM sampling process in PCDM. While the
+original DDIM was designed for DDPM and the optimization process of PCDM
+differs from that of DDPM, this paper designs "DDIM sampling for PCDM," which
+modifies the original DDIM formulations to adapt to the optimization process of
+PCDM. Experimental results demonstrate that DDIM sampling for PCDM can
+accelerate the generation process of the original PCDM by a factor of 100 while
+maintaining the same visual quality in the generated results. This study
+effectively showcases the effectiveness of DDIM sampling for PCDM in expediting
+intelligent structural design. Furthermore, this paper reorganizes the contents
+of DDIM, focusing on the practical usage of DDIM. This change is particularly
+meaningful for researchers who may not possess a strong background in machine
+learning theory but are interested in utilizing the tool effectively.
+
+
+
+ comment: the 10th International Conference on Innovative Production and
+ Construction (IPC 2024), Perth, Australia. https://ipcannual.com/proceedings/
+
+
+
+
+
+
+ ☆ Towards Compatible Fine-tuning for Vision-Language Model Updates
+
+
+
+
+
+
+
+
+ Zhengbo Wang, Jian Liang, Lijun Sheng, Ran He, Zilei Wang, Tieniu Tan
+
+
+ So far, efficient fine-tuning has become a popular strategy for enhancing the
+capabilities of foundation models on downstream tasks by learning plug-and-play
+modules. However, existing methods overlook a crucial issue: if the underlying
+foundation model is updated, are these plug-and-play modules still effective?
+In this paper, we first conduct a detailed analysis of various fine-tuning
+methods on the CLIP in terms of their compatibility with model updates. The
+study reveals that many high-performing fine-tuning methods fail to be
+compatible with the upgraded models. To address this, we propose a novel
+approach, Class-conditioned Context Optimization (ContCoOp), which integrates
+learnable prompts with class embeddings using an attention layer before
+inputting them into the text encoder. Consequently, the prompts can dynamically
+adapt to the changes in embedding space (due to model updates), ensuring
+continued effectiveness. Extensive experiments over 15 datasets show that our
+ContCoOp achieves the highest compatibility over the baseline methods, and
+exhibits robust out-of-distribution generalization.
+
+
+
+
+
+
+
+
+ Freddie Bickford Smith, Jannik Kossen, Eleanor Trollope, Mark van der Wilk, Adam Foster, Tom Rainforth
+
+
+ The ideas of aleatoric and epistemic uncertainty are widely used to reason
+about the probabilistic predictions of machine-learning models. We identify
+incoherence in existing discussions of these ideas and suggest this stems from
+the aleatoric-epistemic view being insufficiently expressive to capture all of
+the distinct quantities that researchers are interested in. To explain and
+address this we derive a simple delineation of different model-based
+uncertainties and the data-generating processes associated with training and
+evaluation. Using this in place of the aleatoric-epistemic view could produce
+clearer discourse as the field moves forward.
+
+
+
+ comment: Presented at the Workshop on Bayesian Decision-Making and Uncertainty
+ (NeurIPS 2024)
+
+
+
+
+
+
+ ☆ DoTA: Weight-Decomposed Tensor Adaptation for Large Language Models
+
+
+ Low-rank adaptation (LoRA) reduces the computational and memory demands of
+fine-tuning large language models (LLMs) by approximating updates with low-rank
+matrices. However, low-rank approximation in two-dimensional space fails to
+capture high-dimensional structures within the target matrix. Recently, tensor
+decomposition methods have been explored for fine-tuning LLMs, leveraging their
+ability to extract structured information. Yet, these approaches primarily rely
+on random initialization, and the impact of initialization on tensor adaptation
+remains underexplored. In this paper, we reveal that random initialization
+significantly diverges from the validation loss achieved by full fine-tuning.
+To address this, we propose Weight-Decomposed Tensor Adaptation (DoTA), which
+leverages the Matrix Product Operator (MPO) decomposition of pre-trained
+weights for effective initialization in fine-tuning LLMs. Additionally, we
+introduce QDoTA, a quantized version of DoTA designed for 4-bit quantization.
+Experiments on commonsense and arithmetic reasoning tasks show that DoTA
+outperforms random initialization methods with fewer parameters. QDoTA further
+reduces memory consumption and achieves comparable performance to DoTA on
+commonsense reasoning tasks. We will release our code to support future
+research.
+
+
+
+ comment: 12 pages, 6 figures
+
+
+
+
+
+
+ ☆ CF-CGN: Channel Fingerprints Extrapolation for Multi-band Massive MIMO
+ Transmission based on Cycle-Consistent Generative Networks
+
+
+ Multi-band massive multiple-input multiple-output (MIMO) communication can
+promote the cooperation of licensed and unlicensed spectra, effectively
+enhancing spectrum efficiency for Wi-Fi and other wireless systems. As an
+enabler for multi-band transmission, channel fingerprints (CF), also known as
+the channel knowledge map or radio environment map, are used to assist channel
+state information (CSI) acquisition and reduce computational complexity. In
+this paper, we propose CF-CGN (Channel Fingerprints with Cycle-consistent
+Generative Networks) to extrapolate CF for multi-band massive MIMO transmission
+where licensed and unlicensed spectra cooperate to provide ubiquitous
+connectivity. Specifically, we first model CF as a multichannel image and
+transform the extrapolation problem into an image translation task, which
+converts CF from one frequency to another by exploring the shared
+characteristics of statistical CSI in the beam domain. Then, paired generative
+networks are designed and coupled by variable-weight cycle consistency losses
+to fit the reciprocal relationship at different bands. Matched with the coupled
+networks, a joint training strategy is developed accordingly, supporting
+synchronous optimization of all trainable parameters. During the inference
+process, we also introduce a refining scheme to improve the extrapolation
+accuracy based on the resolution of CF. Numerical results illustrate that our
+proposed CF-CGN can achieve bidirectional extrapolation with an error of 5-17
+dB lower than the benchmarks in different communication scenarios,
+demonstrating its excellent generalization ability. We further show that the
+sum rate performance assisted by CF-CGN-based CF is close to that with perfect
+CSI for multi-band massive MIMO transmission.
+
+
+
+ comment: 13 pages, 12 figures
+
+
+
+
+
+
+ ☆ Machine Learning of Slow Collective Variables and Enhanced Sampling via
+ Spatial Techniques
+
+
+ Understanding the long-time dynamics of complex physical processes depends on
+our ability to recognize patterns. To simplify the description of these
+processes, we often introduce a set of reaction coordinates, customarily
+referred to as collective variables (CVs). The quality of these CVs heavily
+impacts our comprehension of the dynamics, often influencing the estimates of
+thermodynamics and kinetics from atomistic simulations. Consequently,
+identifying CVs poses a fundamental challenge in chemical physics. Recently,
+significant progress was made by leveraging the predictive ability of
+unsupervised machine learning techniques to determine CVs. Many of these
+techniques require temporal information to learn slow CVs that correspond to
+the long timescale behavior of the studied process. Here, however, we
+specifically focus on techniques that can identify CVs corresponding to the
+slowest transitions between states without needing temporal trajectories as
+input, instead using the spatial characteristics of the data. We discuss the
+latest developments in this category of techniques and briefly discuss
+potential directions for thermodynamics-informed spatial learning of slow CVs.
+
+
+ This work proposes a novel approach to enhancing annotated bibliography
+generation through Large Language Model (LLM) ensembles. In particular,
+multiple LLMs in different roles -- controllable text generation, evaluation,
+and summarization -- are introduced and validated using a systematic
+methodology to enhance model performance in scholarly tasks. Output diversity
+among the ensemble that generates text is obtained using different LLM
+parameters, followed by an LLM acting as a judge to assess relevance, accuracy,
+and coherence. Responses selected by several combining strategies are then
+merged and refined through summarization and redundancy removal techniques. The
+preliminary experimental validation demonstrates that the combined outputs from
+the LLM ensemble improve coherence and relevance compared to individual
+responses, leading to a 38% improvement in annotation quality and a 51%
+reduction in content redundancy, thus highlighting the potential for automating
+complex scholarly tasks while maintaining high-quality standards.
+
+
+
+
+
+
+
+ ☆ About rectified sigmoid function for enhancing the accuracy of
+ Physics-Informed Neural Networks
+
+
+
+
+
+
+
+
+ Vasiliy A. Es'kin, Alexey O. Malkhanov, Mikhail E. Smorkalov
+
+
+ The article is devoted to the study of neural networks with one hidden layer
+and a modified activation function for solving physical problems. A rectified
+sigmoid activation function has been proposed to solve physical problems
+described by the ODE with neural networks. Algorithms for physics-informed
+data-driven initialization of a neural network and a neuron-by-neuron
+gradient-free fitting method have been presented for the neural network with
+this activation function. Numerical experiments demonstrate the superiority of
+neural networks with a rectified sigmoid function over neural networks with a
+sigmoid function in the accuracy of solving physical problems (harmonic
+oscillator, relativistic slingshot, and Lorentz system).
+
+
+
+ comment: 9 pages, 1 figure, 2 tables, 4 algthorithms. arXiv admin note:
+ substantial text overlap with arXiv:2412.19235
+
+
+
+
+
+
+ ☆ Acquisition-Independent Deep Learning for Quantitative MRI Parameter
+ Estimation using Neural Controlled Differential Equations
+
+
+
+
+
+
+
+
+ Daan Kuppens, Sebastiano Barbieri, Daisy van den Berg, Pepijn Schouten, Harriet C. Thoeny, Myrte Wennen, Oliver J. Gurney-Champion
+
+
+ Deep learning has proven to be a suitable alternative to least-squares (LSQ)
+fitting for parameter estimation in various quantitative MRI (QMRI) models.
+However, current deep learning implementations are not robust to changes in MR
+acquisition protocols. In practice, QMRI acquisition protocols differ
+substantially between different studies and clinical settings. The lack of
+generalizability and adoptability of current deep learning approaches for QMRI
+parameter estimation impedes the implementation of these algorithms in clinical
+trials and clinical practice. Neural Controlled Differential Equations (NCDEs)
+allow for the sampling of incomplete and irregularly sampled data with variable
+length, making them ideal for use in QMRI parameter estimation. In this study,
+we show that NCDEs can function as a generic tool for the accurate prediction
+of QMRI parameters, regardless of QMRI sequence length, configuration of
+independent variables and QMRI forward model (variable flip angle T1-mapping,
+intravoxel incoherent motion MRI, dynamic contrast-enhanced MRI). NCDEs
+achieved lower mean squared error than LSQ fitting in low-SNR simulations and
+in vivo in challenging anatomical regions like the abdomen and leg, but this
+improvement was no longer evident at high SNR. NCDEs reduce estimation error
+interquartile range without increasing bias, particularly under conditions of
+high uncertainty. These findings suggest that NCDEs offer a robust approach for
+reliable QMRI parameter estimation, especially in scenarios with high
+uncertainty or low image quality. We believe that with NCDEs, we have solved
+one of the main challenges for using deep learning for QMRI parameter
+estimation in a broader clinical and research setting.
+
+
+
+
+
+
+
+
+ Shubh Singhal, Raül Pérez-Gonzalo, Andreas Espersen, Antonio Agudo
+
+
+ Accurate segmentation of wind turbine blade (WTB) images is critical for
+effective assessments, as it directly influences the performance of automated
+damage detection systems. Despite advancements in large universal vision
+models, these models often underperform in domain-specific tasks like WTB
+segmentation. To address this, we extend Intrinsic LoRA for image segmentation,
+and propose a novel dual-space augmentation strategy that integrates both
+image-level and latent-space augmentations. The image-space augmentation is
+achieved through linear interpolation between image pairs, while the
+latent-space augmentation is accomplished by introducing a noise-based latent
+probabilistic model. Our approach significantly boosts segmentation accuracy,
+surpassing current state-of-the-art methods in WTB image segmentation.
+
+
+
+ comment: Authors Shubh Singhal and Ra\"ul P\'erez-Gonzalo contributed equally
+ to this work. Accepted to ICASSP 2025
+
+
+
+
+
+
+ ☆ Isoperimetry is All We Need: Langevin Posterior Sampling for RL with
+ Sublinear Regret
+
+
+ In Reinforcement Learning (RL) theory, we impose restrictive assumptions to
+design an algorithm with provably sublinear regret. Common assumptions, like
+linear or RKHS models, and Gaussian or log-concave posteriors over the models,
+do not explain practical success of RL across a wider range of distributions
+and models. Thus, we study how to design RL algorithms with sublinear regret
+for isoperimetric distributions, specifically the ones satisfying the
+Log-Sobolev Inequality (LSI). LSI distributions include the standard setups of
+RL and others, such as many non-log-concave and perturbed distributions. First,
+we show that the Posterior Sampling-based RL (PSRL) yields sublinear regret if
+the data distributions satisfy LSI under some mild additional assumptions.
+Also, when we cannot compute or sample from an exact posterior, we propose a
+Langevin sampling-based algorithm design: LaPSRL. We show that LaPSRL achieves
+order optimal regret and subquadratic complexity per episode. Finally, we
+deploy LaPSRL with a Langevin sampler -- SARAH-LD, and test it for different
+bandit and MDP environments. Experimental results validate the generality of
+LaPSRL across environments and its competitive performance with respect to the
+baselines.
+
+
+
+
+
+
+
+ ☆ TimeRAF: Retrieval-Augmented Foundation model for Zero-shot Time Series
+ Forecasting
+
+
+ Time series forecasting plays a crucial role in data mining, driving rapid
+advancements across numerous industries. With the emergence of large models,
+time series foundation models (TSFMs) have exhibited remarkable generalization
+capabilities, such as zero-shot learning, through large-scale pre-training.
+Meanwhile, Retrieval-Augmented Generation (RAG) methods have been widely
+employed to enhance the performance of foundation models on unseen data,
+allowing models to access to external knowledge. In this paper, we introduce
+TimeRAF, a Retrieval-Augmented Forecasting model that enhance zero-shot time
+series forecasting through retrieval-augmented techniques. We develop
+customized time series knowledge bases that are tailored to the specific
+forecasting tasks. TimeRAF employs an end-to-end learnable retriever to extract
+valuable information from the knowledge base. Additionally, we propose Channel
+Prompting for knowledge integration, which effectively extracts relevant
+information from the retrieved knowledge along the channel dimension. Extensive
+experiments demonstrate the effectiveness of our model, showing significant
+improvement across various domains and datasets.
+
+
+
+
+
+
+
+ ☆ Robust Matrix Completion for Discrete Rating-Scale Data
+
+
+ Matrix completion has gained considerable interest in recent years. The goal
+of matrix completion is to predict the unknown entries of a partially observed
+matrix using its known entries. Although common applications feature discrete
+rating-scale data, such as user-product rating matrices in recommender systems
+or surveys in the social and behavioral sciences, methods for matrix completion
+are almost always designed for and studied in the context of continuous data.
+Furthermore, only a small subset of the literature considers matrix completion
+in the presence of corrupted observations despite their common occurrence in
+practice. Examples include attacks on recommender systems (i.e., malicious
+users deliberately manipulating ratings to influence the recommender system to
+their advantage), or careless respondents in surveys (i.e., respondents
+providing answers irrespective of what the survey asks of them due to a lack of
+attention). We introduce a matrix completion algorithm that is tailored towards
+the discrete nature of rating-scale data and robust to the presence of
+corrupted observations. In addition, we investigate the performance of the
+proposed method and its competitors with discrete rating-scale (rather than
+continuous) data as well as under various missing data mechanisms and types of
+corrupted observations.
+
+
+
+
+
+
+
+ ☆ FastCHGNet: Training one Universal Interatomic Potential to 1.5 Hours
+ with 32 GPUs
+
+
+ Graph neural network universal interatomic potentials (GNN-UIPs) have
+demonstrated remarkable generalization and transfer capabilities in material
+discovery and property prediction. These models can accelerate molecular
+dynamics (MD) simulation by several orders of magnitude while maintaining
+\textit{ab initio} accuracy, making them a promising new paradigm in material
+simulations. One notable example is Crystal Hamiltonian Graph Neural Network
+(CHGNet), pretrained on the energies, forces, stresses, and magnetic moments
+from the MPtrj dataset, representing a state-of-the-art GNN-UIP model for
+charge-informed MD simulations. However, training the CHGNet model is
+time-consuming(8.3 days on one A100 GPU) for three reasons: (i) requiring
+multi-layer propagation to reach more distant atom information, (ii) requiring
+second-order derivatives calculation to finish weights updating and (iii) the
+implementation of reference CHGNet does not fully leverage the computational
+capabilities. This paper introduces FastCHGNet, an optimized CHGNet, with three
+contributions: Firstly, we design innovative Force/Stress Readout modules to
+decompose Force/Stress prediction. Secondly, we adopt massive optimizations
+such as kernel fusion, redundancy bypass, etc, to exploit GPU computation power
+sufficiently. Finally, we extend CHGNet to support multiple GPUs and propose a
+load-balancing technique to enhance GPU utilization. Numerical results show
+that FastCHGNet reduces memory footprint by a factor of 3.59. The final
+training time of FastCHGNet can be decreased to \textbf{1.53 hours} on 32 GPUs
+without sacrificing model accuracy.
+
+
+
+
+
+
+
+ ☆ Frequency-Masked Embedding Inference: A Non-Contrastive Approach for
+ Time Series Representation Learning AAAI-2025
+
+
+ Contrastive learning underpins most current self-supervised time series
+representation methods. The strategy for constructing positive and negative
+sample pairs significantly affects the final representation quality. However,
+due to the continuous nature of time series semantics, the modeling approach of
+contrastive learning struggles to accommodate the characteristics of time
+series data. This results in issues such as difficulties in constructing hard
+negative samples and the potential introduction of inappropriate biases during
+positive sample construction. Although some recent works have developed several
+scientific strategies for constructing positive and negative sample pairs with
+improved effectiveness, they remain constrained by the contrastive learning
+framework. To fundamentally overcome the limitations of contrastive learning,
+this paper introduces Frequency-masked Embedding Inference (FEI), a novel
+non-contrastive method that completely eliminates the need for positive and
+negative samples. The proposed FEI constructs 2 inference branches based on a
+prompting strategy: 1) Using frequency masking as prompts to infer the
+embedding representation of the target series with missing frequency bands in
+the embedding space, and 2) Using the target series as prompts to infer its
+frequency masking embedding. In this way, FEI enables continuous semantic
+relationship modeling for time series. Experiments on 8 widely used time series
+datasets for classification and regression tasks, using linear evaluation and
+end-to-end fine-tuning, show that FEI significantly outperforms existing
+contrastive-based methods in terms of generalization. This study provides new
+insights into self-supervised representation learning for time series. The code
+is available at
+https://github.com/USTBInnovationPark/Frequency-masked-Embedding-Inference.
+
+
+
+ comment: This paper has been accepted by AAAI-2025 main track
+
+
+
+
+
+
+ ☆ Accelerating Energy-Efficient Federated Learning in Cell-Free Networks
+ with Adaptive Quantization
+
+
+
+
+
+
+
+
+ Afsaneh Mahmoudi, Ming Xiao, Emil Björnson
+
+
+ Federated Learning (FL) enables clients to share learning parameters instead
+of local data, reducing communication overhead. Traditional wireless networks
+face latency challenges with FL. In contrast, Cell-Free Massive MIMO (CFmMIMO)
+can serve multiple clients on shared resources, boosting spectral efficiency
+and reducing latency for large-scale FL. However, clients' communication
+resource limitations can hinder the completion of the FL training. To address
+this challenge, we propose an energy-efficient, low-latency FL framework
+featuring optimized uplink power allocation for seamless client-server
+collaboration. Our framework employs an adaptive quantization scheme,
+dynamically adjusting bit allocation for local gradient updates to reduce
+communication costs. We formulate a joint optimization problem covering FL
+model updates, local iterations, and power allocation, solved using sequential
+quadratic programming (SQP) to balance energy and latency. Additionally,
+clients use the AdaDelta method for local FL model updates, enhancing local
+model convergence compared to standard SGD, and we provide a comprehensive
+analysis of FL convergence with AdaDelta local updates. Numerical results show
+that, within the same energy and latency budgets, our power allocation scheme
+outperforms the Dinkelbach and max-sum rate methods by increasing the test
+accuracy up to $7$\% and $19$\%, respectively. Moreover, for the three power
+allocation methods, our proposed quantization scheme outperforms AQUILA and LAQ
+by increasing test accuracy by up to $36$\% and $35$\%, respectively.
+
+
+
+
+
+
+
+ ☆ Enhancing Privacy in Federated Learning through Quantum Teleportation
+ Integration
+
+
+ Federated learning enables collaborative model training across multiple
+clients without sharing raw data, thereby enhancing privacy. However, the
+exchange of model updates can still expose sensitive information. Quantum
+teleportation, a process that transfers quantum states between distant
+locations without physical transmission of the particles themselves, has
+recently been implemented in real-world networks. This position paper explores
+the potential of integrating quantum teleportation into federated learning
+frameworks to bolster privacy. By leveraging quantum entanglement and the
+no-cloning theorem, quantum teleportation ensures that data remains secure
+during transmission, as any eavesdropping attempt would be detectable. We
+propose a novel architecture where quantum teleportation facilitates the secure
+exchange of model parameters and gradients among clients and servers. This
+integration aims to mitigate risks associated with data leakage and adversarial
+attacks inherent in classical federated learning setups. We also discuss the
+practical challenges of implementing such a system, including the current
+limitations of quantum network infrastructure and the need for hybrid
+quantum-classical protocols. Our analysis suggests that, despite these
+challenges, the convergence of quantum communication technologies and federated
+learning presents a promising avenue for achieving unprecedented levels of
+privacy in distributed machine learning.
+
+
+
+
+
+
+
+ ☆ Solar Filaments Detection using Active Contours Without Edges
+
+
+ In this article, an active contours without edges (ACWE)-based algorithm has
+been proposed for the detection of solar filaments in H-alpha full-disk solar
+images. The overall algorithm consists of three main steps of image processing.
+These are image pre-processing, image segmentation, and image post-processing.
+Here in the work, contours are initialized on the solar image and allowed to
+deform based on the energy function. As soon as the contour reaches the
+boundary of the desired object, the energy function gets reduced, and the
+contour stops evolving. The proposed algorithm has been applied to few
+benchmark datasets and has been compared with the classical technique of object
+detection. The results analysis indicates that the proposed algorithm
+outperforms the results obtained using the existing classical algorithm of
+object detection.
+
+
+ Parkinson's Disease (PD) is a degenerative neurological disorder that impairs
+motor and non-motor functions, significantly reducing quality of life and
+increasing mortality risk. Early and accurate detection of PD progression is
+vital for effective management and improved patient outcomes. Current
+diagnostic methods, however, are often costly, time-consuming, and require
+specialized equipment and expertise. This work proposes an innovative approach
+to predicting PD progression using regression methods, Long Short-Term Memory
+(LSTM) networks, and Kolmogorov Arnold Networks (KAN). KAN, utilizing
+spline-parametrized univariate functions, allows for dynamic learning of
+activation patterns, unlike traditional linear models.
+ The Movement Disorder Society-Sponsored Revision of the Unified Parkinson's
+Disease Rating Scale (MDS-UPDRS) is a comprehensive tool for evaluating PD
+symptoms and is commonly used to measure disease progression. Additionally,
+protein or peptide abnormalities are linked to PD onset and progression.
+Identifying these associations can aid in predicting disease progression and
+understanding molecular changes.
+ Comparing multiple models, including LSTM and KAN, this study aims to
+identify the method that delivers the highest metrics. The analysis reveals
+that KAN, with its dynamic learning capabilities, outperforms other approaches
+in predicting PD progression. This research highlights the potential of AI and
+machine learning in healthcare, paving the way for advanced computational
+models to enhance clinical predictions and improve patient care and treatment
+strategies in PD management.
+
+
+ In a decision-making scenario, a principal could use conditional predictions
+from an expert agent to inform their choice. However, this approach would
+introduce a fundamental conflict of interest. An agent optimizing for
+predictive accuracy is incentivized to manipulate their principal towards more
+predictable actions, which prevents that principal from being able to
+deterministically select their true preference. We demonstrate that this
+impossibility result can be overcome through the joint evaluation of multiple
+agents. When agents are made to engage in zero-sum competition, their incentive
+to influence the action taken is eliminated, and the principal can identify and
+take the action they most prefer. We further prove that this zero-sum setup is
+unique, efficiently implementable, and applicable under stochastic choice.
+Experiments in a toy environment demonstrate that training on a zero-sum
+objective significantly enhances both predictive accuracy and principal
+utility, and can eliminate previously learned manipulative behavior.
+
+
+
+
+
+
+
+ ☆ AverageLinear: Enhance Long-Term Time series forcasting with simple
+ averaging
+
+
+
+
+
+
+
+
+ Gaoxiang Zhao, Li Zhou, Xiaoqiang Wang
+
+
+ Long-term time series analysis aims to forecast long-term trends by examining
+changes over past and future periods. The intricacy of time series data poses
+significant challenges for modeling. Models based on the Transformer
+architecture, through the application of attention mechanisms to channels and
+sequences, have demonstrated notable performance advantages. In contrast,
+methods based on convolutional neural networks or linear models often struggle
+to effectively handle scenarios with large number of channels. However, our
+research reveals that the attention mechanism is not the core component
+responsible for performance enhancement. We have designed an exceedingly simple
+linear structure AverageLinear. By employing straightforward channel embedding
+and averaging operations, this model can effectively capture correlations
+between channels while maintaining a lightweight architecture. Experimentss on
+real-world datasets shows that AverageLinear matches or even surpasses
+state-of-the-art Transformer-based structures in performance. This indicates
+that using purely linear structures can also endow models with robust
+predictive power.
+
+
+
+
+
+
+
+ ☆ Training Deep Neural Classifiers with Soft Diamond Regularizers
+
+
+ We introduce new \emph{soft diamond} regularizers that both improve synaptic
+sparsity and maintain classification accuracy in deep neural networks. These
+parametrized regularizers outperform the state-of-the-art hard-diamond
+Laplacian regularizer of Lasso regression and classification. They use
+thick-tailed symmetric alpha-stable ($\mathcal{S \alpha S}$) bell-curve
+synaptic weight priors that are not Gaussian and so have thicker tails. The
+geometry of the diamond-shaped constraint set varies from a circle to a star
+depending on the tail thickness and dispersion of the prior probability density
+function. Training directly with these priors is computationally intensive
+because almost all $\mathcal{S \alpha S}$ probability densities lack a closed
+form. A precomputed look-up table removed this computational bottleneck. We
+tested the new soft diamond regularizers with deep neural classifiers on the
+three datasets CIFAR-10, CIFAR-100, and Caltech-256. The regularizers improved
+the accuracy of the classifiers. The improvements included $4.57\%$ on
+CIFAR-10, $4.27\%$ on CIFAR-100, and $6.69\%$ on Caltech-256. They also
+outperformed $L_2$ regularizers on all the test cases. Soft diamond
+regularizers also outperformed $L_1$ lasso or Laplace regularizers because they
+better increased sparsity while improving classification accuracy. Soft-diamond
+priors substantially improved accuracy on CIFAR-10 when combined with dropout,
+batch, or data-augmentation regularization.
+
+
+
+ comment: 8 pages, 10 figures
+
+
+
+
+
+
+ ☆ HFI: A unified framework for training-free detection and implicit
+ watermarking of latent diffusion model generated images
+
+
+
+
+
+
+
+
+ Sungik Choi, Sungwoo Park, Jaehoon Lee, Seunghyun Kim, Stanley Jungkyu Choi, Moontae Lee
+
+
+ Dramatic advances in the quality of the latent diffusion models (LDMs) also
+led to the malicious use of AI-generated images. While current AI-generated
+image detection methods assume the availability of real/AI-generated images for
+training, this is practically limited given the vast expressibility of LDMs.
+This motivates the training-free detection setup where no related data are
+available in advance. The existing LDM-generated image detection method assumes
+that images generated by LDM are easier to reconstruct using an autoencoder
+than real images. However, we observe that this reconstruction distance is
+overfitted to background information, leading the current method to
+underperform in detecting images with simple backgrounds. To address this, we
+propose a novel method called HFI. Specifically, by viewing the autoencoder of
+LDM as a downsampling-upsampling kernel, HFI measures the extent of aliasing, a
+distortion of high-frequency information that appears in the reconstructed
+image. HFI is training-free, efficient, and consistently outperforms other
+training-free methods in detecting challenging images generated by various
+generative models. We also show that HFI can successfully detect the images
+generated from the specified LDM as a means of implicit watermarking. HFI
+outperforms the best baseline method while achieving magnitudes of
+
+
+
+
+
+
+
+ ☆ Learning to Rank Pre-trained Vision-Language Models for Downstream Tasks
+
+
+ Vision language models (VLMs) like CLIP show stellar zero-shot capability on
+classification benchmarks. However, selecting the VLM with the highest
+performance on the unlabeled downstream task is non-trivial. Existing VLM
+selection methods focus on the class-name-only setting, relying on a supervised
+large-scale dataset and large language models, which may not be accessible or
+feasible during deployment. This paper introduces the problem of
+\textbf{unsupervised vision-language model selection}, where only unsupervised
+downstream datasets are available, with no additional information provided. To
+solve this problem, we propose a method termed Visual-tExtual Graph Alignment
+(VEGA), to select VLMs without any annotations by measuring the alignment of
+the VLM between the two modalities on the downstream task. VEGA is motivated by
+the pretraining paradigm of VLMs, which aligns features with the same semantics
+from the visual and textual modalities, thereby mapping both modalities into a
+shared representation space. Specifically, we first construct two graphs on the
+vision and textual features, respectively. VEGA is then defined as the overall
+similarity between the visual and textual graphs at both node and edge levels.
+Extensive experiments across three different benchmarks, covering a variety of
+application scenarios and downstream datasets, demonstrate that VEGA
+consistently provides reliable and accurate estimates of VLMs' performance on
+unlabeled downstream tasks.
+
+
+
+
+
+
+
+ ☆ Differentiable Convex Optimization Layers in Neural Architectures:
+ Foundations and Perspectives
+
+
+ The integration of optimization problems within neural network architectures
+represents a fundamental shift from traditional approaches to handling
+constraints in deep learning. While it is long known that neural networks can
+incorporate soft constraints with techniques such as regularization, strict
+adherence to hard constraints is generally more difficult. A recent advance in
+this field, however, has addressed this problem by enabling the direct
+embedding of optimization layers as differentiable components within deep
+networks. This paper surveys the evolution and current state of this approach,
+from early implementations limited to quadratic programming, to more recent
+frameworks supporting general convex optimization problems. We provide a
+comprehensive review of the background, theoretical foundations, and emerging
+applications of this technology. Our analysis includes detailed mathematical
+proofs and an examination of various use cases that demonstrate the potential
+of this hybrid approach. This work synthesizes developments at the intersection
+of optimization theory and deep learning, offering insights into both current
+capabilities and future research directions in this rapidly evolving field.
+
+
+ One of the emerging techniques in node classification in heterogeneous graphs
+is to restrict message aggregation to pre-defined, semantically meaningful
+structures called metapaths. This work is the first attempt to incorporate
+attention into the process of encoding entire metapaths without dropping
+intermediate nodes. In particular, we construct two encoders: the first uses
+sequential attention to extend the multi-hop message passing algorithm designed
+in \citet{magna} to the metapath setting, and the second incorporates direct
+attention to extract semantic relations in the metapath. The model then employs
+the intra-metapath and inter-metapath aggregation mechanisms of \citet{han}. We
+furthermore use the powerful training scheduler specialized for heterogeneous
+graphs that was developed in \citet{lts}, ensuring the model slowly learns how
+to classify the most difficult nodes. The result is a resilient,
+general-purpose framework for capturing semantic structures in heterogeneous
+graphs. In particular, we demonstrate that our model is competitive with
+state-of-the-art models on performing node classification on the IMDB dataset,
+a popular benchmark introduced in \citet{benchmark}.
+
+
+
+
+
+
+
+
+ Ervin Moore, Ahmed Imteaj, Md Zarif Hossain, Shabnam Rezapour, M. Hadi Amini
+
+
+ Federated Learning (FL) is a privacy-preserving distributed machine learning
+scheme, where each participant data remains on the participating devices and
+only the local model generated utilizing the local computational power is
+transmitted throughout the database. However, the distributed computational
+nature of FL creates the necessity to develop a mechanism that can remotely
+trigger any network agents, track their activities, and prevent threats to the
+overall process posed by malicious participants. Particularly, the FL paradigm
+may become vulnerable due to an active attack from the network participants,
+called a poisonous attack. In such an attack, the malicious participant acts as
+a benign agent capable of affecting the global model quality by uploading an
+obfuscated poisoned local model update to the server. This paper presents a
+cross-device FL model that ensures trustworthiness, fairness, and authenticity
+in the underlying FL training process. We leverage trustworthiness by
+constructing a reputation-based trust model based on contributions of agents
+toward model convergence. We ensure fairness by identifying and removing
+malicious agents from the training process through an outlier detection
+technique. Further, we establish authenticity by generating a token for each
+participating device through a distributed sensing mechanism and storing that
+unique token in a blockchain smart contract. Further, we insert the trust
+scores of all agents into a blockchain and validate their reputations using
+various consensus mechanisms that consider the computational task.
+
+
+
+
+
+
+
+ ☆ Two Birds with One Stone: Improving Rumor Detection by Addressing the
+ Unfairness Issue
+
+
+ The degraded performance and group unfairness caused by confounding sensitive
+attributes in rumor detection remains relatively unexplored. To address this,
+we propose a two-step framework. Initially, it identifies confounding sensitive
+attributes that limit rumor detection performance and cause unfairness across
+groups. Subsequently, we aim to learn equally informative representations
+through invariant learning. Our method considers diverse sets of groups without
+sensitive attribute annotations. Experiments show our method easily integrates
+with existing rumor detectors, significantly improving both their detection
+performance and fairness.
+
+
+
+
+
+
+
+ ☆ Prototypical Distillation and Debiased Tuning for Black-box Unsupervised
+ Domain Adaptation
+
+
+
+
+
+
+
+
+ Jian Liang, Lijun Sheng, Hongmin Liu, Ran He
+
+
+ Unsupervised domain adaptation aims to transfer knowledge from a related,
+label-rich source domain to an unlabeled target domain, thereby circumventing
+the high costs associated with manual annotation. Recently, there has been
+growing interest in source-free domain adaptation, a paradigm in which only a
+pre-trained model, rather than the labeled source data, is provided to the
+target domain. Given the potential risk of source data leakage via model
+inversion attacks, this paper introduces a novel setting called black-box
+domain adaptation, where the source model is accessible only through an API
+that provides the predicted label along with the corresponding confidence value
+for each query. We develop a two-step framework named $\textbf{Pro}$totypical
+$\textbf{D}$istillation and $\textbf{D}$ebiased tun$\textbf{ing}$
+($\textbf{ProDDing}$). In the first step, ProDDing leverages both the raw
+predictions from the source model and prototypes derived from the target domain
+as teachers to distill a customized target model. In the second step, ProDDing
+keeps fine-tuning the distilled model by penalizing logits that are biased
+toward certain classes. Empirical results across multiple benchmarks
+demonstrate that ProDDing outperforms existing black-box domain adaptation
+methods. Moreover, in the case of hard-label black-box domain adaptation, where
+only predicted labels are available, ProDDing achieves significant improvements
+over these methods. Code will be available at
+\url{https://github.com/tim-learn/ProDDing/}.
+
+
+
+
+
+
+
+ ☆ Overcoming Class Imbalance: Unified GNN Learning with Structural and
+ Semantic Connectivity Representations
+
+
+ Class imbalance is pervasive in real-world graph datasets, where the majority
+of annotated nodes belong to a small set of classes (majority classes), leaving
+many other classes (minority classes) with only a handful of labeled nodes.
+Graph Neural Networks (GNNs) suffer from significant performance degradation in
+the presence of class imbalance, exhibiting bias towards majority classes and
+struggling to generalize effectively on minority classes. This limitation
+stems, in part, from the message passing process, leading GNNs to overfit to
+the limited neighborhood of annotated nodes from minority classes and impeding
+the propagation of discriminative information throughout the entire graph. In
+this paper, we introduce a novel Unified Graph Neural Network Learning
+(Uni-GNN) framework to tackle class-imbalanced node classification. The
+proposed framework seamlessly integrates both structural and semantic
+connectivity representations through semantic and structural node encoders. By
+combining these connectivity types, Uni-GNN extends the propagation of node
+embeddings beyond immediate neighbors, encompassing non-adjacent structural
+nodes and semantically similar nodes, enabling efficient diffusion of
+discriminative information throughout the graph. Moreover, to harness the
+potential of unlabeled nodes within the graph, we employ a balanced
+pseudo-label generation mechanism that augments the pool of available labeled
+nodes from minority classes in the training set. Experimental results
+underscore the superior performance of our proposed Uni-GNN framework compared
+to state-of-the-art class-imbalanced graph learning baselines across multiple
+benchmark datasets.
+
+
+
+
+
+
+
+ ☆ Uncertainty Herding: One Active Learning Method for All Label Budgets
+
+
+
+
+
+
+
+
+ Wonho Bae, Gabriel L. Oliveira, Danica J. Sutherland
+
+
+ Most active learning research has focused on methods which perform well when
+many labels are available, but can be dramatically worse than random selection
+when label budgets are small. Other methods have focused on the low-budget
+regime, but do poorly as label budgets increase. As the line between "low" and
+"high" budgets varies by problem, this is a serious issue in practice. We
+propose uncertainty coverage, an objective which generalizes a variety of low-
+and high-budget objectives, as well as natural, hyperparameter-light methods to
+smoothly interpolate between low- and high-budget regimes. We call greedy
+optimization of the estimate Uncertainty Herding; this simple method is
+computationally fast, and we prove that it nearly optimizes the
+distribution-level coverage. In experimental validation across a variety of
+active learning tasks, our proposal matches or beats state-of-the-art
+performance in essentially all cases; it is the only method of which we are
+aware that reliably works well in both low- and high-budget settings.
+
+
+
+
+
+
+
+ ☆ SafeSynthDP: Leveraging Large Language Models for Privacy-Preserving
+ Synthetic Data Generation Using Differential Privacy
+
+
+
+
+
+
+
+
+ Md Mahadi Hasan Nahid, Sadid Bin Hasan
+
+
+ Machine learning (ML) models frequently rely on training data that may
+include sensitive or personal information, raising substantial privacy
+concerns. Legislative frameworks such as the General Data Protection Regulation
+(GDPR) and the California Consumer Privacy Act (CCPA) have necessitated the
+development of strategies that preserve privacy while maintaining the utility
+of data. In this paper, we investigate the capability of Large Language Models
+(LLMs) to generate synthetic datasets integrated with Differential Privacy (DP)
+mechanisms, thereby enabling data-driven research and model training without
+direct exposure of sensitive information. Our approach incorporates DP-based
+noise injection methods, including Laplace and Gaussian distributions, into the
+data generation process. We then evaluate the utility of these DP-enhanced
+synthetic datasets by comparing the performance of ML models trained on them
+against models trained on the original data. To substantiate privacy
+guarantees, we assess the resilience of the generated synthetic data to
+membership inference attacks and related threats. The experimental results
+demonstrate that integrating DP within LLM-driven synthetic data generation
+offers a viable balance between privacy protection and data utility. This study
+provides a foundational methodology and insight into the privacy-preserving
+capabilities of LLMs, paving the way for compliant and effective ML research
+and applications.
+
+
+
+ comment: 15 pages, 1 figure, 5 tables
+
+
+
+
+
+
+ ☆ Predicting Long Term Sequential Policy Value Using Softer Surrogates
+
+
+
+
+
+
+
+
+ Hyunji Nam, Allen Nie, Ge Gao, Vasilis Syrgkanis, Emma Brunskill
+
+
+ Performing policy evaluation in education, healthcare and online commerce can
+be challenging, because it can require waiting substantial amounts of time to
+observe outcomes over the desired horizon of interest. While offline evaluation
+methods can be used to estimate the performance of a new decision policy from
+historical data in some cases, such methods struggle when the new policy
+involves novel actions or is being run in a new decision process with
+potentially different dynamics. Here we consider how to estimate the
+full-horizon value of a new decision policy using only short-horizon data from
+the new policy, and historical full-horizon data from a different behavior
+policy. We introduce two new estimators for this setting, including a doubly
+robust estimator, and provide formal analysis of their properties. Our
+empirical results on two realistic simulators, of HIV treatment and sepsis
+treatment, show that our methods can often provide informative estimates of a
+new decision policy ten times faster than waiting for the full horizon,
+highlighting that it may be possible to quickly identify if a new decision
+policy, involving new actions, is better or worse than existing past policies.
+
+
+
+
+
+
+
+
+ Jiawei Zhou, Woojeong Kim, Zhiying Xu, Alexander M. Rush, Minlan Yu
+
+
+ Understanding the traffic dynamics in networks is a core capability for
+automated systems to monitor and analyze networking behaviors, reducing
+expensive human efforts and economic risks through tasks such as traffic
+classification, congestion prediction, and attack detection. However, it is
+still challenging to accurately model network traffic with machine learning
+approaches in an efficient and broadly applicable manner. Task-specific models
+trained from scratch are used for different networking applications, which
+limits the efficiency of model development and generalization of model
+deployment. Furthermore, while networking data is abundant, high-quality
+task-specific labels are often insufficient for training individual models.
+Large-scale self-supervised learning on unlabeled data provides a natural
+pathway for tackling these challenges. We propose to pre-train a
+general-purpose machine learning model to capture traffic dynamics with only
+traffic data from NetFlow records, with the goal of fine-tuning for different
+downstream tasks with small amount of labels. Our presented NetFlowGen
+framework goes beyond a proof-of-concept for network traffic pre-training and
+addresses specific challenges such as unifying network feature representations,
+learning from large unlabeled traffic data volume, and testing on real
+downstream tasks in DDoS attack detection. Experiments demonstrate promising
+results of our pre-training framework on capturing traffic dynamics and
+adapting to different networking tasks.
+
+
+
+
+
+
+
+ ♻ ☆ Beyond RMSE and MAE: Introducing EAUC to unmask hidden bias and
+ unfairness in dyadic regression models
+
+
+ Dyadic regression models, which output real-valued predictions for pairs of
+entities, are fundamental in many domains (e.g. obtaining user-product ratings
+in Recommender Systems) and promising and under exploration in others (e.g.
+tuning patient-drug dosages in personalized pharmacology). In this work, we
+prove that non-uniform observed value distributions of individual entities lead
+to severe biases in state-of-the-art models, skewing predictions towards the
+average of observed past values for the entity and providing worse-than-random
+predictive power in eccentric yet crucial cases; we name this phenomenon
+eccentricity bias. We show that global error metrics like Root Mean Squared
+Error (RMSE) are insufficient to capture this bias, and we introduce
+Eccentricity-Area Under the Curve (EAUC) as a novel complementary metric that
+can quantify it in all studied domains and models. We prove the intuitive
+interpretation of EAUC by experimenting with naive post-training bias
+corrections, and theorize other options to use EAUC to guide the construction
+of fair models. This work contributes a bias-aware evaluation of dyadic
+regression to prevent unfairness in critical real-world applications of such
+systems.
+
+
+ Irreducible Cartesian tensors (ICTs) play a crucial role in the design of
+equivariant graph neural networks, as well as in theoretical chemistry and
+chemical physics. Meanwhile, the design space of available linear operations on
+tensors that preserve symmetry presents a significant challenge. The ICT
+decomposition and a basis of this equivariant space are difficult to obtain for
+high-order tensors. After decades of research, we recently achieve an explicit
+ICT decomposition for $n=5$ \citep{bonvicini2024irreducible} with factorial
+time/space complexity. This work, for the first time, obtains decomposition
+matrices for ICTs up to rank $n=9$ with reduced and affordable complexity, by
+constructing what we call path matrices. The path matrices are obtained via
+performing chain-like contraction with Clebsch-Gordan matrices following the
+parentage scheme. We prove and leverage that the concatenation of path matrices
+is an orthonormal change-of-basis matrix between the Cartesian tensor product
+space and the spherical direct sum spaces. Furthermore, we identify a complete
+orthogonal basis for the equivariant space, rather than a spanning set
+\citep{pearce2023brauer}, through this path matrices technique. We further
+extend our result to the arbitrary tensor product and direct sum spaces,
+enabling free design between different spaces while keeping symmetry. The
+Python code is available in
+https://github.com/ShihaoShao-GH/ICT-decomposition-and-equivariant-bases where
+the $n=6,\dots,9$ ICT decomposition matrices are obtained in 1s, 3s, 11s, and
+4m32s, respectively.
+
+
+
+ comment: 43 pages
+
+
+
+
+
+
+ ♻ ☆ Non-asymptotic spectral bounds on the $\varepsilon$-entropy of kernel
+ classes
+
+
+ Let $K: \boldsymbol{\Omega}\times \boldsymbol{\Omega}$ be a continuous Mercer
+kernel defined on a compact subset of ${\mathbb R}^n$ and $\mathcal{H}_K$ be
+the reproducing kernel Hilbert space (RKHS) associated with $K$. Given a finite
+measure $\nu$ on $\boldsymbol{\Omega}$, we investigate upper and lower bounds
+on the $\varepsilon$-entropy of the unit ball of $\mathcal{H}_K$ in the space
+$L_p(\nu)$. This topic is an important direction in the modern statistical
+theory of kernel-based methods.
+ We prove sharp upper and lower bounds for $p\in [1,+\infty]$. For $p\in
+[1,2]$, the upper bounds are determined solely by the eigenvalue behaviour of
+the corresponding integral operator $\phi\to \int_{\boldsymbol{\Omega}}
+K(\cdot,{\mathbf y})\phi({\mathbf y})d\nu({\mathbf y})$. In constrast, for
+$p>2$, the bounds additionally depend on the convergence rate of the truncated
+Mercer series to the kernel $K$ in the $L_p(\nu)$-norm.
+ We discuss a number of consequences of our bounds and show that they are
+substantially tighter than previous bounds for general kernels. Furthermore,
+for specific cases, such as zonal kernels and the Gaussian kernel on a box, our
+bounds are asymptotically tight as $\varepsilon\to +0$.
+
+
+ Modern e-commerce services frequently target customers with incentives or
+interventions to engage them in their products such as games, shopping, video
+streaming, etc. This customer engagement increases acquisition of more
+customers and retention of existing ones, leading to more business for the
+company while improving customer experience. Often, customers are either
+randomly targeted or targeted based on the propensity of desirable behavior.
+However, such policies can be suboptimal as they do not target the set of
+customers who would benefit the most from the intervention and they may also
+not take account of any constraints. In this paper, we propose a policy
+framework based on uplift modeling and constrained optimization that identifies
+customers to target for a use-case specific intervention so as to maximize the
+value to the business, while taking account of any given constraints. We
+demonstrate improvement over state-of-the-art targeting approaches using two
+large-scale experimental studies and a production implementation.
+
+
+
+ comment: Accepted at the CONSEQUENCES'24 workshop, co-located with ACM
+ RecSys'24
+
+
+
+
+
+
+ ♻ ☆ Fairness-enhancing mixed effects deep learning improves fairness on in-
+ and out-of-distribution clustered (non-iid) data
+
+
+
+
+
+
+
+
+ Son Nguyen, Adam Wang, Albert Montillo
+
+
+ Traditional deep learning (DL) models have two ubiquitous limitations. First,
+they assume training samples are independent and identically distributed
+(i.i.d), an assumption often violated in real-world datasets where samples have
+additional correlation due to repeat measurements (e.g., on the same
+participants in a longitudinal study or cells from the same sequencer). This
+leads to performance degradation, limited generalization, and covariate
+confounding, which induces Type I and Type II errors. Second, DL models
+typically prioritize overall accuracy, favoring accuracy on the majority while
+sacrificing performance for underrepresented subpopulations, leading to unfair,
+biased models. This is critical to remediate, particularly in models which
+influence decisions regarding loan approvals and healthcare. To address these
+issues, we propose the Fair Mixed Effects Deep Learning (Fair MEDL) framework.
+This framework quantifies cluster-invariant fixed effects (FE) and
+cluster-specific random effects (RE) through: 1) a cluster adversary for
+learning invariant FE, 2) a Bayesian neural network for RE, and 3) a mixing
+function combining FE and RE for final predictions. Fairness is enhanced
+through architectural and loss function changes introduced by an adversarial
+debiasing network. We formally define and demonstrate improved fairness across
+three metrics: equalized odds, demographic parity, and counterfactual fairness,
+for both classification and regression tasks. Our method also identifies and
+de-weights confounded covariates, mitigating Type I and II errors. The
+framework is comprehensively evaluated across three datasets spanning two
+industries, including finance and healthcare. The Fair MEDL framework improves
+fairness by 86.4% for Age, 64.9% for Race, 57.8% for Sex, and 36.2% for Marital
+status, while maintaining robust predictive performance.
+
+
+
+
+
+
+
+ ♻ ☆ Why the Metric Backbone Preserves Community Structure
+
+
+ The metric backbone of a weighted graph is the union of all-pairs shortest
+paths. It is obtained by removing all edges $(u,v)$ that are not the shortest
+path between $u$ and $v$. In networks with well-separated communities, the
+metric backbone tends to preserve many inter-community edges, because these
+edges serve as bridges connecting two communities, but tends to delete many
+intra-community edges because the communities are dense. This suggests that the
+metric backbone would dilute or destroy the community structure of the network.
+However, this is not borne out by prior empirical work, which instead showed
+that the metric backbone of real networks preserves the community structure of
+the original network well. In this work, we analyze the metric backbone of a
+broad class of weighted random graphs with communities, and we formally prove
+the robustness of the community structure with respect to the deletion of all
+the edges that are not in the metric backbone. An empirical comparison of
+several graph sparsification techniques confirms our theoretical finding and
+shows that the metric backbone is an efficient sparsifier in the presence of
+communities.
+
+
+
+
+
+
+
+ ♻ ☆ Physically Guided Deep Unsupervised Inversion for 1D Magnetotelluric
+ Models
+
+
+
+
+
+
+
+
+ Paul Goyes-Peñafiel, Umair bin Waheed, Henry Arguello
+
+
+ The global demand for unconventional energy sources such as geothermal energy
+and white hydrogen requires new exploration techniques for precise subsurface
+structure characterization and potential reservoir identification. The
+Magnetotelluric (MT) method is crucial for these tasks, providing critical
+information on the distribution of subsurface electrical resistivity at depths
+ranging from hundreds to thousands of meters. However, traditional iterative
+algorithm-based inversion methods require the adjustment of multiple
+parameters, demanding time-consuming and exhaustive tuning processes to achieve
+proper cost function minimization. Although recent advances have incorporated
+deep learning algorithms for MT inversion, primarily based on supervised
+learning, \paul{and} needs large labeled datasets for training. This work
+utilizes TensorFlow operations to create a differentiable forward MT operator,
+leveraging its automatic differentiation capability. Moreover, instead of
+solving for the subsurface model directly, as classical algorithms perform,
+this paper presents a new deep unsupervised inversion algorithm guided by
+physics to estimate 1D MT models. Instead of using datasets with the observed
+data and their respective model as labels during training, our method employs a
+differentiable modeling operator that physically guides the cost function
+minimization, making the proposed method solely dependent on observed data.
+Therefore, the optimization \paul{algorithm} updates the network weights to
+minimize the data misfit. We test the proposed method with field and synthetic
+data at different acquisition frequencies, demonstrating that the resistivity
+models obtained are more accurate than those calculated using other techniques.
+
+
+
+
+
+
+
+
+ Hainan Ren, Li Lin, Chun-Hao Liu, Xin Wang, Shu Hu
+
+
+ AI-synthesized voice technology has the potential to create realistic human
+voices for beneficial applications, but it can also be misused for malicious
+purposes. While existing AI-synthesized voice detection models excel in
+intra-domain evaluation, they face challenges in generalizing across different
+domains, potentially becoming obsolete as new voice generators emerge. Current
+solutions use diverse data and advanced machine learning techniques (e.g.,
+domain-invariant representation, self-supervised learning), but are limited by
+predefined vocoders and sensitivity to factors like background noise and
+speaker identity. In this work, we introduce an innovative disentanglement
+framework aimed at extracting domain-agnostic artifact features related to
+vocoders. Utilizing these features, we enhance model learning in a flat loss
+landscape, enabling escape from suboptimal solutions and improving
+generalization. Extensive experiments on benchmarks show our approach
+outperforms state-of-the-art methods, achieving up to 5.12% improvement in the
+equal error rate metric in intra-domain and 7.59% in cross-domain evaluations.
+
+
+ Deep learning neural network models must be large enough to adapt to their
+problem domain, while small enough to avoid overfitting training data during
+gradient descent. To balance these competing demands, overprovisioned deep
+learning models such as transformers are trained for a single epoch on large
+data sets, and hence inefficient with both computing resources and training
+data. In response to these inefficiencies, we exploit learning theory to derive
+Occam Gradient Descent, an algorithm that interleaves adaptive reduction of
+model size to minimize generalization error, with gradient descent on model
+weights to minimize fitting error. In contrast, traditional gradient descent
+greedily minimizes fitting error without regard to generalization error. Our
+algorithm simultaneously descends the space of weights and topological size of
+any neural network without modification. With respect to loss, compute and
+model size, our experiments show (a) on image classification benchmarks, linear
+and convolutional neural networks trained with Occam Gradient Descent
+outperform traditional gradient descent with or without post-train pruning; (b)
+on a range of tabular data classification tasks, neural networks trained with
+Occam Gradient Descent outperform traditional gradient descent, as well as
+Random Forests; (c) on natural language transformers, Occam Gradient Descent
+outperforms traditional gradient descent.
+
+
+
+
+
+
+
+ ♻ ☆ Automatic feature selection and weighting in molecular systems using
+ Differentiable Information Imbalance
+
+
+
+
+
+
+
+
+ Romina Wild, Felix Wodaczek, Vittorio Del Tatto, Bingqing Cheng, Alessandro Laio
+
+
+ Feature selection is essential in the analysis of molecular systems and many
+other fields, but several uncertainties remain: What is the optimal number of
+features for a simplified, interpretable model that retains essential
+information? How should features with different units be aligned, and how
+should their relative importance be weighted? Here, we introduce the
+Differentiable Information Imbalance (DII), an automated method to rank
+information content between sets of features. Using distances in a ground truth
+feature space, DII identifies a low-dimensional subset of features that best
+preserves these relationships. Each feature is scaled by a weight, which is
+optimized by minimizing the DII through gradient descent. This allows
+simultaneously performing unit alignment and relative importance scaling, while
+preserving interpretability. DII can also produce sparse solutions and
+determine the optimal size of the reduced feature space. We demonstrate the
+usefulness of this approach on two benchmark molecular problems: (1)
+identifying collective variables that describe conformations of a biomolecule,
+and (2) selecting features for training a machine-learning force field. These
+results show the potential of DII in addressing feature selection challenges
+and optimizing dimensionality in various applications. The method is available
+in the Python library DADApy.
+
+
+
+
+
+
+
+ ♻ ☆ SepLLM: Accelerate Large Language Models by Compressing One Segment into
+ One Separator
+
+
+ Large Language Models (LLMs) have exhibited exceptional performance across a
+spectrum of natural language processing tasks. However, their substantial sizes
+pose considerable challenges, particularly in computational demands and
+inference speed, due to their quadratic complexity. In this work, we have
+identified a key pattern: certain seemingly meaningless special tokens (i.e.,
+separators) contribute disproportionately to attention scores compared to
+semantically meaningful tokens. This observation suggests that information of
+the segments between these separator tokens can be effectively condensed into
+the separator tokens themselves without significant information loss. Guided by
+this insight, we introduce SepLLM, a plug-and-play framework that accelerates
+inference by compressing these segments and eliminating redundant tokens.
+Additionally, we implement efficient kernels for training acceleration.
+Experimental results across training-free, training-from-scratch, and
+post-training settings demonstrate SepLLM's effectiveness. Notably, using the
+Llama-3-8B backbone, SepLLM achieves over 50% reduction in KV cache on the
+GSM8K-CoT benchmark while maintaining comparable performance. Furthermore, in
+streaming settings, SepLLM effectively processes sequences of up to 4 million
+tokens or more while maintaining consistent language modeling capabilities.
+
+
+
+ comment: We have made our code publicly available at sepllm.github.io. Our
+ codebase supports efficient multi-node distributed training with accelerated
+ attention module Sep-Attention and also supports numerous existing Fusion
+ Operators to accelerate the training process, such as fused rope, etc. If you
+ find our code helpful, please kindly consider giving us a **star** on
+ GitHub^_^. Thank you very much!
+
+
+
+
+
+
+ ♻ ☆ CNNtention: Can CNNs do better with Attention?
+
+
+ Convolutional Neural Networks (CNNs) have been the standard for image
+classification tasks for a long time, but more recently attention-based
+mechanisms have gained traction. This project aims to compare traditional CNNs
+with attention-augmented CNNs across an image classification task. By
+evaluating and comparing their performance, accuracy and computational
+efficiency, the project will highlight benefits and trade-off of the localized
+feature extraction of traditional CNNs and the global context capture in
+attention-augmented CNNs. By doing this, we can reveal further insights into
+their respective strengths and weaknesses, guide the selection of models based
+on specific application needs and ultimately, enhance understanding of these
+architectures in the deep learning community.
+ This was our final project for CS7643 Deep Learning course at Georgia Tech.
+
+
+
+ comment: 10 pages, 11 figures
+
+
+
+
+
+
+ ♻ ☆ A Graph Neural Network deep-dive into successful counterattacks
+
+
+ A counterattack in soccer is a high speed, high intensity direct attack that
+can occur when a team transitions from a defensive state to an attacking state
+after regaining possession of the ball. The aim is to create a goal-scoring
+opportunity by convering a lot of ground with minimal passes before the
+opposing team can recover their defensive shape. The purpose of this research
+is to build gender-specific Graph Neural Networks to model the likelihood of a
+counterattack being successful and uncover what factors make them successful in
+professional soccer. These models are trained on a total of 20863 frames of
+synchronized on-ball event and spatiotemporal (broadcast) tracking data. This
+dataset is derived from 632 games of MLS (2022), NWSL (2022) and international
+soccer (2020-2022). With this data we demonstrate that gender-specific Graph
+Neural Networks outperform architecturally identical gender-ambiguous models in
+predicting the successful outcome of counterattacks. We show, using Permutation
+Feature Importance, that byline to byline speed, angle to the goal, angle to
+the ball and sideline to sideline speed are the node features with the highest
+impact on model performance. Additionally, we offer some illustrative examples
+on how to navigate the infinite solution search space to aid in identifying
+improvements for player decision making.
+ This research is accompanied by an open-source repository containing all data
+and code, and it is also accompanied by an open-source Python package which
+simplifies converting spatiotemporal data into graphs. This package also
+facilitates testing, validation, training and prediction with this data. This
+should allow the reader to replicate and improve upon our research more easily.
+
+
+
+ comment: 11 pages, 11 figures, first submitted (and accepted) at MIT Sloan
+ Sports Analytics Conference 2023
+
+
+
+
+
+
+ ♻ ☆ On Reward Transferability in Adversarial Inverse Reinforcement Learning:
+ Insights from Random Matrix Theory
+
+
+ In the context of inverse reinforcement learning (IRL) with a single expert,
+adversarial inverse reinforcement learning (AIRL) serves as a foundational
+approach to providing comprehensive and transferable task descriptions.
+However, AIRL faces practical performance challenges, primarily stemming from
+the framework's overly idealized decomposability condition, the unclear proof
+regarding the potential equilibrium in reward recovery, or questionable
+robustness in high-dimensional environments. This paper revisits AIRL in
+\textbf{high-dimensional scenarios where the state space tends to infinity}.
+Specifically, we first establish a necessary and sufficient condition for
+reward transferability by examining the rank of the matrix derived from
+subtracting the identity matrix from the transition matrix. Furthermore,
+leveraging random matrix theory, we analyze the spectral distribution of this
+matrix, demonstrating that our rank criterion holds with high probability even
+when the transition matrices are unobservable. This suggests that the
+limitations on transfer are not inherent to the AIRL framework itself, but are
+instead related to the training variance of the reinforcement learning
+algorithms employed within it. Based on this insight, we propose a hybrid
+framework that integrates on-policy proximal policy optimization in the source
+environment with off-policy soft actor-critic in the target environment,
+leading to significant improvements in reward transfer effectiveness.
+
+
+ Disentangled representation learning aims to learn low-dimensional
+representations where each dimension corresponds to an underlying generative
+factor. While the Variational Auto-Encoder (VAE) is widely used for this
+purpose, most existing methods assume independence among factors, a
+simplification that does not hold in many real-world scenarios where factors
+are often interdependent and exhibit causal relationships. To overcome this
+limitation, we propose the Disentangled Causal Variational Auto-Encoder
+(DCVAE), a novel supervised VAE framework that integrates causal flows into the
+representation learning process, enabling the learning of more meaningful and
+interpretable disentangled representations. We evaluate DCVAE on both synthetic
+and real-world datasets, demonstrating its superior ability in causal
+disentanglement and intervention experiments. Furthermore, DCVAE outperforms
+state-of-the-art methods in various downstream tasks, highlighting its
+potential for learning true causal structures among factors.
+
+
+
+ comment: 22 pages, 14 figures
+
+
+
+
+
+
+ ♻ ☆ Towards Instance-Wise Calibration: Local Amortized Diagnostics and
+ Reshaping of Conditional Densities (LADaR)
+
+
+
+
+
+
+
+
+ Biprateep Dey, David Zhao, Brett H. Andrews, Jeffrey A. Newman, Rafael Izbicki, Ann B. Lee
+
+
+ There is a growing interest in conditional density estimation and generative
+modeling of a target $y$ given complex inputs $\mathbf{x}$. However,
+off-the-shelf methods often lack instance-wise calibration -- that is, for
+individual inputs $\mathbf{x}$, the individual estimated probabilities can be
+very different from the true probabilities, even when the estimates are
+reasonable when averaged over the entire population. This paper introduces the
+LADaR (Local Amortized Diagnostics and Reshaping of Conditional Densities)
+framework and proposes an algorithm called $\texttt{Cal-PIT}$ that produces
+interpretable local calibration diagnostics and includes a mechanism to
+recalibrate the initial model. Our $\texttt{Cal-PIT}$ algorithm learns a single
+local probability-probability map from calibration data to assess and quantify
+where corrections are needed across the feature space. When necessary, it
+reshapes the initial distribution into an estimate with approximate
+instance-wise calibration. We illustrate the LADaR framework by applying
+$\texttt{Cal-PIT}$ to synthetic examples, including probabilistic forecasting
+with sequences of images as inputs, akin to predicting the wind speed of
+tropical cyclones from satellite imagery. Our main science application is
+conditional density estimation of galaxy distances given imaging data
+(so-called photometric redshift estimation). On a benchmark photometric
+redshift data challenge, $\texttt{Cal-PIT}$ achieves better conditional density
+estimation (as measured by the conditional density estimation loss) than all 11
+other literature methods tested. This demonstrates its potential for meeting
+the stringent photometric redshift requirements for next generation weak
+gravitational lensing analyses.
+
+
+
+ comment: Code available as a Python package
+ https://github.com/lee-group-cmu/Cal-PIT
+
+
+
+
+
+
+ ♻ ☆ Efficient Link Prediction via GNN Layers Induced by Negative Sampling
+
+
+
+
+
+
+
+
+ Yuxin Wang, Xiannian Hu, Quan Gan, Xuanjing Huang, Xipeng Qiu, David Wipf
+
+
+ Graph neural networks (GNNs) for link prediction can loosely be divided into
+two broad categories. First, \emph{node-wise} architectures pre-compute
+individual embeddings for each node that are later combined by a simple decoder
+to make predictions. While extremely efficient at inference time, model
+expressiveness is limited such that isomorphic nodes contributing to candidate
+edges may not be distinguishable, compromising accuracy. In contrast,
+\emph{edge-wise} methods rely on the formation of edge-specific subgraph
+embeddings to enrich the representation of pair-wise relationships,
+disambiguating isomorphic nodes to improve accuracy, but with increased model
+complexity. To better navigate this trade-off, we propose a novel GNN
+architecture whereby the \emph{forward pass} explicitly depends on \emph{both}
+positive (as is typical) and negative (unique to our approach) edges to inform
+more flexible, yet still cheap node-wise embeddings. This is achieved by
+recasting the embeddings themselves as minimizers of a forward-pass-specific
+energy function that favors separation of positive and negative samples.
+Notably, this energy is distinct from the actual training loss shared by most
+existing link prediction models, where contrastive pairs only influence the
+\textit{backward pass}. As demonstrated by extensive empirical evaluations, the
+resulting architecture retains the inference speed of node-wise models, while
+producing competitive accuracy with edge-wise alternatives. We released our
+code at https://github.com/yxzwang/SubmissionverOfYinYanGNN.
+
+
+
+ comment: Accepted to TKDE. Citation information: DOI 10.1109/TKDE.2024.3481015
+
+ Chain of thought (CoT) is a reasoning framework that can enhance the
+performance of Large Language Models (LLMs) on complex inference tasks. In
+particular, among various studies related to CoT, multi-path inference stands
+out as a simple yet effective improvement. However, there is no optimal setting
+for the number of inference paths. Therefore, we have to increase the number of
+inference paths to obtain better results, which in turn increases the inference
+cost. To address this limitation, we can utilize question-related role
+templates to guide LLMs into relevant roles, thereby increasing the possibility
+of correct inferences for each path and further reducing dependence on the
+number of inference paths while improving reasoning accuracy. However, placing
+LLMs into specific roles may reduce their reasoning diversity and performance
+on a few tasks where role dependence is low. To alleviate the excessive
+immersion of the LLM into a specific role, we propose Nash CoT by constructing
+a game system on each path that balances the generation from role-specific
+LLMs' and the general LLMs' generation, thereby ensuring both effective role
+adoption and diversity in LLM generation further maintaining the performance of
+multi-path inference while reducing the requirement of the number of inference
+paths. We evaluate Nash CoT across various inference tasks, including Arabic
+Reasoning, Commonsense Question Answering, and Symbolic Inference, achieving
+results that are comparable to or better than those of multi-path CoT with the
+equal number of inference paths.
+
+
+ We study an online joint assortment-inventory optimization problem, in which
+we assume that the choice behavior of each customer follows the Multinomial
+Logit (MNL) choice model, and the attraction parameters are unknown a priori.
+The retailer makes periodic assortment and inventory decisions to dynamically
+learn from the customer choice observations about the attraction parameters
+while maximizing the expected total profit over time. In this paper, we propose
+a novel algorithm that can effectively balance exploration and exploitation in
+the online decision-making of assortment and inventory. Our algorithm builds on
+a new estimator for the MNL attraction parameters, an innovative approach to
+incentivize exploration by adaptively tuning certain known and unknown
+parameters, and an optimization oracle to static single-cycle
+assortment-inventory planning problems with given parameters. We establish a
+regret upper bound for our algorithm and a lower bound for the online joint
+assortment-inventory optimization problem, suggesting that our algorithm
+achieves nearly optimal regret rate, provided that the static optimization
+oracle is exact. Then we incorporate more practical approximate static
+optimization oracles into our algorithm, and bound from above the impact of
+static optimization errors on the regret of our algorithm. We perform numerical
+studies to demonstrate the effectiveness of our proposed algorithm.At last, we
+extend our study by incorporating inventory carryover and the learning of
+customer arrival distribution.
+
+
+
+
+
+
+
+ ♻ ☆ GISExplainer: On Explainability of Graph Neural Networks via
+ Game-theoretic Interaction Subgraphs
+
+
+ Explainability is crucial for the application of black-box Graph Neural
+Networks (GNNs) in critical fields such as healthcare, finance, cybersecurity,
+and more. Various feature attribution methods, especially the
+perturbation-based methods, have been proposed to indicate how much each
+node/edge contributes to the model predictions. However, these methods fail to
+generate connected explanatory subgraphs that consider the causal interaction
+between edges within different coalition scales, which will result in
+unfaithful explanations. In our study, we propose GISExplainer, a novel
+game-theoretic interaction based explanation method that uncovers what the
+underlying GNNs have learned for node classification by discovering
+human-interpretable causal explanatory subgraphs. First, GISExplainer defines a
+causal attribution mechanism that considers the game-theoretic interaction of
+multi-granularity coalitions in candidate explanatory subgraph to quantify the
+causal effect of an edge on the prediction. Second, GISExplainer assumes that
+the coalitions with negative effects on the predictions are also significant
+for model interpretation, and the contribution of the computation graph stems
+from the combined influence of both positive and negative interactions within
+the coalitions. Then, GISExplainer regards the explanation task as a sequential
+decision process, in which a salient edges is successively selected and
+connected to the previously selected subgraph based on its causal effect to
+form an explanatory subgraph, ultimately striving for better explanations.
+Additionally, an efficiency optimization scheme is proposed for the causal
+attribution mechanism through coalition sampling. Extensive experiments
+demonstrate that GISExplainer achieves better performance than state-of-the-art
+approaches w.r.t. two quantitative metrics: Fidelity and Sparsity.
+
+
+
+ comment: 13 pages, 7 figures
+
+
+
+
+
+
+ ♻ ☆ Federated Learning with MMD-based Early Stopping for Adaptive GNSS
+ Interference Classification
+
+
+
+
+
+
+
+
+ Nishant S. Gaikwad, Lucas Heublein, Nisha L. Raichur, Tobias Feigl, Christopher Mutschler, Felix Ott
+
+
+ Federated learning (FL) enables multiple devices to collaboratively train a
+global model while maintaining data on local servers. Each device trains the
+model on its local server and shares only the model updates (i.e., gradient
+weights) during the aggregation step. A significant challenge in FL is managing
+the feature distribution of novel and unbalanced data across devices. In this
+paper, we propose an FL approach using few-shot learning and aggregation of the
+model weights on a global server. We introduce a dynamic early stopping method
+to balance out-of-distribution classes based on representation learning,
+specifically utilizing the maximum mean discrepancy of feature embeddings
+between local and global models. An exemplary application of FL is to
+orchestrate machine learning models along highways for interference
+classification based on snapshots from global navigation satellite system
+(GNSS) receivers. Extensive experiments on four GNSS datasets from two
+real-world highways and controlled environments demonstrate that our FL method
+surpasses state-of-the-art techniques in adapting to both novel interference
+classes and multipath scenarios.
+
+
+
+
+
+
+
+ ♻ ☆ Graph Mixture of Experts and Memory-augmented Routers for Multivariate
+ Time Series Anomaly Detection AAAI 2025
+
+
+
+
+
+
+
+
+ Xiaoyu Huang, Weidong Chen, Bo Hu, Zhendong Mao
+
+
+ Multivariate time series (MTS) anomaly detection is a critical task that
+involves identifying abnormal patterns or events in data that consist of
+multiple interrelated time series. In order to better model the complex
+interdependence between entities and the various inherent characteristics of
+each entity, the GNN based methods are widely adopted by existing methods. In
+each layer of GNN, node features aggregate information from their neighboring
+nodes to update their information. In doing so, from shallow layer to deep
+layer in GNN, original individual node features continue to be weakened and
+more structural information,i.e., from short-distance neighborhood to
+long-distance neighborhood, continues to be enhanced. However, research to date
+has largely ignored the understanding of how hierarchical graph information is
+represented and their characteristics that can benefit anomaly detection.
+Existing methods simply leverage the output from the last layer of GNN for
+anomaly estimation while neglecting the essential information contained in the
+intermediate GNN layers. To address such limitations, in this paper, we propose
+a Graph Mixture of Experts (Graph-MoE) network for multivariate time series
+anomaly detection, which incorporates the mixture of experts (MoE) module to
+adaptively represent and integrate hierarchical multi-layer graph information
+into entity representations. It is worth noting that our Graph-MoE can be
+integrated into any GNN-based MTS anomaly detection method in a plug-and-play
+manner. In addition, the memory-augmented routers are proposed in this paper to
+capture the correlation temporal information in terms of the global historical
+features of MTS to adaptively weigh the obtained entity representations to
+achieve successful anomaly estimation. Extensive experiments on five
+challenging datasets prove the superiority of our approach and each proposed
+module.
+
+
+
+ comment: Accepted by AAAI 2025
+
+
+
+
+
+
+ ♻ ☆ Hedging Is Not All You Need: A Simple Baseline for Online Learning Under
+ Haphazard Inputs
+
+
+ Handling haphazard streaming data, such as data from edge devices, presents a
+challenging problem. Over time, the incoming data becomes inconsistent, with
+missing, faulty, or new inputs reappearing. Therefore, it requires models that
+are reliable. Recent methods to solve this problem depend on a hedging-based
+solution and require specialized elements like auxiliary dropouts, forked
+architectures, and intricate network design. We observed that hedging can be
+reduced to a special case of weighted residual connection; this motivated us to
+approximate it with plain self-attention. In this work, we propose HapNet, a
+simple baseline that is scalable, does not require online backpropagation, and
+is adaptable to varying input types. All present methods are restricted to
+scaling with a fixed window; however, we introduce a more complex problem of
+scaling with a variable window where the data becomes positionally
+uncorrelated, and cannot be addressed by present methods. We demonstrate that a
+variant of the proposed approach can work even for this complex scenario. We
+extensively evaluated the proposed approach on five benchmarks and found
+competitive performance.
+
+
+
+
+
+
+
+ ♻ ☆ Timeseria: an object-oriented time series processing library
+
+
+ Timeseria is an object-oriented time series processing library implemented in
+Python, which aims at making it easier to manipulate time series data and to
+build statistical and machine learning models on top of it. Unlike common data
+analysis frameworks, it builds up from well defined and reusable logical units
+(objects), which can be easily combined together in order to ensure a high
+level of consistency. Thanks to this approach, Timeseria can address by design
+several non-trivial issues which are often underestimated, such as handling
+data losses, non-uniform sampling rates, differences between aggregated data
+and punctual observations, time zones, daylight saving times, and more.
+Timeseria comes with a comprehensive set of base data structures, data
+transformations for resampling and aggregation, common data manipulation
+operations, and extensible models for data reconstruction, forecasting and
+anomaly detection. It also integrates a fully featured, interactive plotting
+engine capable of handling even millions of data points.
+
+
+
+
+
+
+
+ ♻ ☆ Causal-aware Graph Neural Architecture Search under Distribution Shifts
+
+
+ Graph NAS has emerged as a promising approach for autonomously designing GNN
+architectures by leveraging the correlations between graphs and architectures.
+Existing methods fail to generalize under distribution shifts that are
+ubiquitous in real-world graph scenarios, mainly because the graph-architecture
+correlations they exploit might be spurious and varying across distributions.
+We propose to handle the distribution shifts in the graph architecture search
+process by discovering and exploiting the causal relationship between graphs
+and architectures to search for the optimal architectures that can generalize
+under distribution shifts. The problem remains unexplored with following
+challenges: how to discover the causal graph-architecture relationship that has
+stable predictive abilities across distributions, and how to handle
+distribution shifts with the discovered causal graph-architecture relationship
+to search the generalized graph architectures. To address these challenges, we
+propose Causal-aware Graph Neural Architecture Search (CARNAS), which is able
+to capture the causal graph-architecture relationship during the architecture
+search process and discover the generalized graph architecture under
+distribution shifts. Specifically, we propose Disentangled Causal Subgraph
+Identification to capture the causal subgraphs that have stable prediction
+abilities across distributions. Then, we propose Graph Embedding Intervention
+to intervene on causal subgraphs within the latent space, ensuring that these
+subgraphs encapsulate essential features for prediction while excluding
+non-causal elements. Additionally, we propose Invariant Architecture
+Customization to reinforce the causal invariant nature of the causal subgraphs,
+which are utilized to tailor generalized graph architectures. Extensive
+experiments demonstrate that CARNAS achieves advanced out-of-distribution
+generalization ability.
+
+
+
+
+
+
+
+
+ Hiroki Furuta, Kuang-Huei Lee, Shixiang Shane Gu, Yutaka Matsuo, Aleksandra Faust, Heiga Zen, Izzeddin Gur
+
+
+ Many algorithms for aligning LLMs with human preferences assume that human
+preferences are binary and deterministic. However, human preferences can vary
+across individuals, and therefore should be represented distributionally. In
+this work, we introduce the distributional soft preference labels and improve
+Direct Preference Optimization (DPO) with a weighted geometric average of the
+LLM output likelihood in the loss function. This approach adjusts the scale of
+learning loss based on the soft labels such that the loss would approach zero
+when the responses are closer to equally preferred. This simple modification
+can be easily applied to any DPO-based methods and mitigate over-optimization
+and objective mismatch, which prior works suffer from. Our experiments simulate
+the soft preference labels with AI feedback from LLMs and demonstrate that
+geometric averaging consistently improves performance on standard benchmarks
+for alignment research. In particular, we observe more preferable responses
+than binary labels and significant improvements where modestly-confident labels
+are in the majority.
+
+
+
+
+
+
+
+
+ Yassine Abbahaddou, Fragkiskos D. Malliaros, Johannes F. Lutzeyer, Amine Mohamed Aboussalah, Michalis Vazirgiannis
+
+
+ Graph Neural Networks (GNNs) have shown great promise in tasks like node and
+graph classification, but they often struggle to generalize, particularly to
+unseen or out-of-distribution (OOD) data. These challenges are exacerbated when
+training data is limited in size or diversity. To address these issues, we
+introduce a theoretical framework using Rademacher complexity to compute a
+regret bound on the generalization error and then characterize the effect of
+data augmentation. This framework informs the design of GMM-GDA, an efficient
+graph data augmentation (GDA) algorithm leveraging the capability of Gaussian
+Mixture Models (GMMs) to approximate any distribution. Our approach not only
+outperforms existing augmentation techniques in terms of generalization but
+also offers improved time complexity, making it highly suitable for real-world
+applications.
+
+
+
+
+
+
+
+ ♻ ☆ FedSat: A Statistical Aggregation Approach for Class Imbalanced Clients
+ in Federated Learning
+
+
+ Federated learning (FL) has emerged as a promising paradigm for
+privacy-preserving distributed machine learning, but faces challenges with
+heterogeneous data distributions across clients. This paper presents FedSat, a
+novel FL approach specifically designed to simultaneously handle three forms of
+data heterogeneity, namely label skewness, missing classes, and quantity
+skewness, by proposing a prediction-sensitive loss function and a
+prioritized-class based weighted aggregation scheme. While the
+prediction-sensitive loss function enhances model performance on minority
+classes, the prioritized-class based weighted aggregation scheme ensures client
+contributions are weighted based on both statistical significance and
+performance on critical classes. Extensive experiments across diverse
+data-heterogeneity settings demonstrate that FedSat significantly outperforms
+state-of-the-art baselines, with an average improvement of 1.8% over the
+second-best method and 19.87% over the weakest-performing baseline. The
+approach also demonstrates faster convergence compared to existing methods.
+These results highlight FedSat's effectiveness in addressing the challenges of
+heterogeneous federated learning and its potential for real-world applications.
+
+
+
+
+
+
+
+ ♻ ☆ Scaling Capability in Token Space: An Analysis of Large Vision Language
+ Model
+
+
+ The scaling capability has been widely validated in neural language models
+with respect to the number of parameters and the size of training data.
+ One important question is that does the scaling capability also exists
+similarly with respect to the number of vision tokens in large vision language
+Model?
+ This study fills the gap by investigating the relationship between the number
+of vision tokens and the performance on vision-language models.
+ Our theoretical analysis and empirical evaluations demonstrate that the model
+exhibits scalable performance \(S(N_l)\) with respect to the number of vision
+tokens \(N_l\), characterized by the relationship \(S(N_l) \approx
+(c/N_l)^{\alpha}\).
+ Furthermore, we also investigate the impact of a fusion mechanism that
+integrates the user's question with vision tokens.
+ The results reveal two key findings.
+ First, the scaling capability remains intact with the incorporation of the
+fusion mechanism.
+ Second, the fusion mechanism enhances model performance, particularly when
+the user's question is task-specific and relevant.
+ The analysis, conducted on fifteen diverse benchmarks spanning a broad range
+of tasks and domains, validates the effectiveness of the proposed approach.
+
+
+
+
+
+
+
+ ♻ ☆ Towards Empirical Interpretation of Internal Circuits and Properties in
+ Grokked Transformers on Modular Polynomials
+
+
+ Grokking has been actively explored to reveal the mystery of delayed
+generalization and identifying interpretable representations and algorithms
+inside the grokked models is a suggestive hint to understanding its mechanism.
+Grokking on modular addition has been known to implement Fourier representation
+and its calculation circuits with trigonometric identities in Transformers.
+Considering the periodicity in modular arithmetic, the natural question is to
+what extent these explanations and interpretations hold for the grokking on
+other modular operations beyond addition. For a closer look, we first
+hypothesize that any modular operations can be characterized with distinctive
+Fourier representation or internal circuits, grokked models obtain common
+features transferable among similar operations, and mixing datasets with
+similar operations promotes grokking. Then, we extensively examine them by
+learning Transformers on complex modular arithmetic tasks, including
+polynomials. Our Fourier analysis and novel progress measure for modular
+arithmetic, Fourier Frequency Density and Fourier Coefficient Ratio,
+characterize distinctive internal representations of grokked models per modular
+operation; for instance, polynomials often result in the superposition of the
+Fourier components seen in elementary arithmetic, but clear patterns do not
+emerge in challenging non-factorizable polynomials. In contrast, our ablation
+study on the pre-grokked models reveals that the transferability among the
+models grokked with each operation can be only limited to specific
+combinations, such as from elementary arithmetic to linear expressions.
+Moreover, some multi-task mixtures may lead to co-grokking -- where grokking
+simultaneously happens for all the tasks -- and accelerate generalization,
+while others may not find optimal solutions. We provide empirical steps towards
+the interpretability of internal circuits.
+
+
+
+ comment: Published at Transactions on Machine Learning Research (TMLR), Code:
+ https://github.com/frt03/grok_mod_poly
+
+
+
+
+
+
+
+ Antoine Wehenkel, Laura Manduchi, Jens Behrmann, Luca Pegolotti, Andrew C. Miller, Guillermo Sapiro, Ozan Sener, Marco Cuturi, Jörn-Henrik Jacobsen
+
+
+ Over the past decades, hemodynamics simulators have steadily evolved and have
+become tools of choice for studying cardiovascular systems in-silico. While
+such tools are routinely used to simulate whole-body hemodynamics from
+physiological parameters, solving the corresponding inverse problem of mapping
+waveforms back to plausible physiological parameters remains both promising and
+challenging. Motivated by advances in simulation-based inference (SBI), we cast
+this inverse problem as statistical inference. In contrast to alternative
+approaches, SBI provides \textit{posterior distributions} for the parameters of
+interest, providing a \textit{multi-dimensional} representation of uncertainty
+for \textit{individual} measurements. We showcase this ability by performing an
+in-silico uncertainty analysis of five biomarkers of clinical interest
+comparing several measurement modalities. Beyond the corroboration of known
+facts, such as the feasibility of estimating heart rate, our study highlights
+the potential of estimating new biomarkers from standard-of-care measurements.
+SBI reveals practically relevant findings that cannot be captured by standard
+sensitivity analyses, such as the existence of sub-populations for which
+parameter estimation exhibits distinct uncertainty regimes. Finally, we study
+the gap between in-vivo and in-silico with the MIMIC-III waveform database and
+critically discuss how cardiovascular simulations can inform real-world data
+analysis.
+
+
+
+
+
+
+
+ ♻ ☆ Bayesian Meta-Learning for Improving Generalizability of Health
+ Prediction Models With Similar Causal Mechanisms
+
+
+
+
+
+
+
+
+ Sophie Wharrie, Lisa Eick, Lotta Mäkinen, Andrea Ganna, Samuel Kaski, FinnGen
+
+
+ Machine learning strategies like multi-task learning, meta-learning, and
+transfer learning enable efficient adaptation of machine learning models to
+specific applications in healthcare, such as prediction of various diseases, by
+leveraging generalizable knowledge across large datasets and multiple domains.
+In particular, Bayesian meta-learning methods pool data across related
+prediction tasks to learn prior distributions for model parameters, which are
+then used to derive models for specific tasks. However, inter- and intra-task
+variability due to disease heterogeneity and other patient-level differences
+pose challenges of negative transfer during shared learning and poor
+generalizability to new patients. We introduce a novel Bayesian meta-learning
+approach that aims to address this in two key settings: (1) predictions for new
+patients (same population as the training set) and (2) adapting to new patient
+populations. Our main contribution is in modeling similarity between causal
+mechanisms of the tasks, for (1) mitigating negative transfer during training
+and (2) fine-tuning that pools information from tasks that are expected to aid
+generalizability. We propose an algorithm for implementing this approach for
+Bayesian deep learning, and apply it to a case study for stroke prediction
+tasks using electronic health record data. Experiments for the UK Biobank
+dataset as the training population demonstrated significant generalizability
+improvements compared to standard meta-learning, non-causal task similarity
+measures, and local baselines (separate models for each task). This was
+assessed for a variety of tasks that considered both new patients from the
+training population (UK Biobank) and a new population (FinnGen).
+
+
+
+
+
+
+
+ ♻ ☆ Ultralight Signal Classification Model for Automatic Modulation
+ Recognition
+
+
+
+
+
+
+
+
+ Alessandro Daniele Genuardi Oquendo, Agustín Matías Galante Cerviño, Nilotpal Kanti Sinha, Luc Andrea, Sam Mugel, Román Orús
+
+
+ The growing complexity of radar signals demands responsive and accurate
+detection systems that can operate efficiently on resource-constrained edge
+devices. Existing models, while effective, often rely on substantial
+computational resources and large datasets, making them impractical for edge
+deployment. In this work, we propose an ultralight hybrid neural network
+optimized for edge applications, delivering robust performance across
+unfavorable signal-to-noise ratios (mean accuracy of 96.3% at 0 dB) using less
+than 100 samples per class, and significantly reducing computational overhead.
+
+
+
+ comment: 8 pages, 8 figures
+
+
+
+
+
+
+ ♻ ☆ InfAlign: Inference-aware language model alignment
+
+
+
+
+
+
+
+
+ Ananth Balashankar, Ziteng Sun, Jonathan Berant, Jacob Eisenstein, Michael Collins, Adrian Hutter, Jong Lee, Chirag Nagpal, Flavien Prost, Aradhana Sinha, Ananda Theertha Suresh, Ahmad Beirami
+
+
+ Language model alignment has become a critical step in training modern
+generative language models. The goal of alignment is to finetune a reference
+model such that the win rate of a sample from the aligned model over a sample
+from the reference model is high, subject to a KL divergence constraint. Today,
+we are increasingly using inference-time algorithms (e.g., Best-of-N,
+controlled decoding, tree search) to decode from language models rather than
+standard sampling. However, the alignment objective does not capture such
+inference-time decoding procedures. We show that the existing alignment
+framework is sub-optimal in view of such inference-time methods. We then modify
+the alignment objective and propose a framework for inference-aware alignment
+(IAPO). We prove that for any inference-time decoding algorithm, the optimal
+solution that optimizes the inference-time win rate of the aligned policy
+against the reference policy is the solution to the typical RLHF problem with a
+transformation of the reward. This motivates us to provide the KL-regularized
+calibrate-and-transform RL (CTRL) algorithm to solve this problem, which
+involves a reward calibration step and a KL-regularized reward maximization
+step with a transformation of the calibrated reward. We particularize our study
+to two important inference-time strategies: best-of-N sampling and best-of-N
+jailbreaking, where N responses are sampled from the model and the one with the
+highest or lowest reward is selected. We propose specific transformations for
+these strategies and demonstrate that our framework offers significant
+improvements over existing state-of-the-art methods for language model
+alignment. Empirically, we outperform baselines that are designed without
+taking inference-time decoding into consideration by 8-12% and 4-9% on
+inference-time win rates over the Anthropic helpfulness and harmlessness dialog
+benchmark datasets.
+
+
+ Exploration in cooperative multi-agent reinforcement learning (MARL) remains
+challenging for value-based agents due to the absence of an explicit policy.
+Existing approaches include individual exploration based on uncertainty towards
+the system and collective exploration through behavioral diversity among
+agents. However, the introduction of additional structures often leads to
+reduced training efficiency and infeasible integration of these methods. In
+this paper, we propose Adaptive exploration via Identity Recognition~(AIR),
+which consists of two adversarial components: a classifier that recognizes
+agent identities from their trajectories, and an action selector that
+adaptively adjusts the mode and degree of exploration. We theoretically prove
+that AIR can facilitate both individual and collective exploration during
+training, and experiments also demonstrate the efficiency and effectiveness of
+AIR across various tasks.
+
+
+
+
+
+
+
+ ♻ ☆ Privacy-Preserving Customer Support: A Framework for Secure and Scalable
+ Interactions
+
+
+ The growing reliance on artificial intelligence (AI) in customer support has
+significantly improved operational efficiency and user experience. However,
+traditional machine learning (ML) approaches, which require extensive local
+training on sensitive datasets, pose substantial privacy risks and compliance
+challenges with regulations like the General Data Protection Regulation (GDPR)
+and California Consumer Privacy Act (CCPA). Existing privacy-preserving
+techniques, such as anonymization, differential privacy, and federated
+learning, address some concerns but face limitations in utility, scalability,
+and complexity. This paper introduces the Privacy-Preserving Zero-Shot Learning
+(PP-ZSL) framework, a novel approach leveraging large language models (LLMs) in
+a zero-shot learning mode. Unlike conventional ML methods, PP-ZSL eliminates
+the need for local training on sensitive data by utilizing pre-trained LLMs to
+generate responses directly. The framework incorporates real-time data
+anonymization to redact or mask sensitive information, retrieval-augmented
+generation (RAG) for domain-specific query resolution, and robust
+post-processing to ensure compliance with regulatory standards. This
+combination reduces privacy risks, simplifies compliance, and enhances
+scalability and operational efficiency. Empirical analysis demonstrates that
+the PP-ZSL framework provides accurate, privacy-compliant responses while
+significantly lowering the costs and complexities of deploying AI-driven
+customer support systems. The study highlights potential applications across
+industries, including financial services, healthcare, e-commerce, legal
+support, telecommunications, and government services. By addressing the dual
+challenges of privacy and performance, this framework establishes a foundation
+for secure, efficient, and regulatory-compliant AI applications in customer
+interactions.
+
+
+
+
+
+
+
+
+ Minh Le, Tien Ngoc Luu, An Nguyen The, Thanh-Thien Le, Trang Nguyen, Tung Thanh Nguyen, Linh Ngo Van, Thien Huu Nguyen
+
+
+ To address catastrophic forgetting in Continual Relation Extraction (CRE),
+many current approaches rely on memory buffers to rehearse previously learned
+knowledge while acquiring new tasks. Recently, prompt-based methods have
+emerged as potent alternatives to rehearsal-based strategies, demonstrating
+strong empirical performance. However, upon analyzing existing prompt-based
+approaches for CRE, we identified several critical limitations, such as
+inaccurate prompt selection, inadequate mechanisms for mitigating forgetting in
+shared parameters, and suboptimal handling of cross-task and within-task
+variances. To overcome these challenges, we draw inspiration from the
+relationship between prefix-tuning and mixture of experts, proposing a novel
+approach that employs a prompt pool for each task, capturing variations within
+each task while enhancing cross-task variances. Furthermore, we incorporate a
+generative model to consolidate prior knowledge within shared parameters,
+eliminating the need for explicit data storage. Extensive experiments validate
+the efficacy of our approach, demonstrating superior performance over
+state-of-the-art prompt-based and rehearsal-free methods in continual relation
+extraction.
+
+
+ Q-learning is a powerful tool for network control and policy optimization in
+wireless networks, but it struggles with large state spaces. Recent
+advancements, like multi-environment mixed Q-learning (MEMQ), improves
+performance and reduces complexity by integrating multiple Q-learning
+algorithms across multiple related environments so-called digital cousins.
+However, MEMQ is designed for centralized single-agent networks and is not
+suitable for decentralized or multi-agent networks. To address this challenge,
+we propose a novel multi-agent MEMQ algorithm for partially decentralized
+wireless networks with multiple mobile transmitters (TXs) and base stations
+(BSs), where TXs do not have access to each other's states and actions. In
+uncoordinated states, TXs act independently to minimize their individual costs.
+In coordinated states, TXs use a Bayesian approach to estimate the joint state
+based on local observations and share limited information with leader TX to
+minimize joint cost. The cost of information sharing scales linearly with the
+number of TXs and is independent of the joint state-action space size. The
+proposed scheme is 50% faster than centralized MEMQ with only a 20% increase in
+average policy error (APE) and is 25% faster than several advanced
+decentralized Q-learning algorithms with 40% less APE. The convergence of the
+algorithm is also demonstrated.
+
+
+
+ comment: Accepted to 2025 IEEE International Conference on Acoustics, Speech,
+ and Signal Processing (ICASSP 2025)
+
+
+
+
+
+
+ ♻ ☆ Dynamic Importance Learning using Fisher Information Matrix (FIM) for
+ Nonlinear Dynamic Mapping
+
+
+ Understanding output variance is critical in modeling nonlinear dynamic
+systems, as it reflects the system's sensitivity to input variations and
+feature interactions. This work presents a methodology for dynamically
+determining relevance scores in black-box models while ensuring
+interpretability through an embedded decision module. This interpretable
+module, integrated into the first layer of the model, employs the Fisher
+Information Matrix (FIM) and logistic regression to compute relevance scores,
+interpreted as the probabilities of input neurons being active based on their
+contribution to the output variance. The proposed method leverages a
+gradient-based framework to uncover the importance of variance-driven features,
+capturing both individual contributions and complex feature interactions. These
+relevance scores are applied through element-wise transformations of the
+inputs, enabling the black-box model to prioritize features dynamically based
+on their impact on system output. This approach effectively bridges
+interpretability with the intricate modeling of nonlinear dynamics and
+time-dependent interactions. Simulation results demonstrate the method's
+ability to infer feature interactions while achieving superior performance in
+feature relevance compared to existing techniques. The practical utility of
+this approach is showcased through its application to an industrial pH
+neutralization process, where critical system dynamics are uncovered.
+
+
+
+
+
+
+
+ ♻ ☆ Disentangling data distribution for Federated Learning
+
+
+ Federated Learning (FL) facilitates collaborative training of a global model
+whose performance is boosted by private data owned by distributed clients,
+without compromising data privacy. Yet the wide applicability of FL is hindered
+by entanglement of data distributions across different clients. This paper
+demonstrates for the first time that by disentangling data distributions FL can
+in principle achieve efficiencies comparable to those of distributed systems,
+requiring only one round of communication. To this end, we propose a novel
+FedDistr algorithm, which employs stable diffusion models to decouple and
+recover data distributions. Empirical results on the CIFAR100 and DomainNet
+datasets show that FedDistr significantly enhances model utility and efficiency
+in both disentangled and near-disentangled scenarios while ensuring privacy,
+outperforming traditional federated learning methods.
+
+
+ This paper studies the problem of class-imbalanced graph classification,
+which aims at effectively classifying the graph categories in scenarios with
+imbalanced class distributions. While graph neural networks (GNNs) have
+achieved remarkable success, their modeling ability on imbalanced
+graph-structured data remains suboptimal, which typically leads to predictions
+biased towards the majority classes. On the other hand, existing
+class-imbalanced learning methods in vision may overlook the rich graph
+semantic substructures of the majority classes and excessively emphasize
+learning from the minority classes. To address these challenges, we propose a
+simple yet powerful approach called C$^3$GNN that integrates the idea of
+clustering into contrastive learning to enhance class-imbalanced graph
+classification. Technically, C$^3$GNN clusters graphs from each majority class
+into multiple subclasses, with sizes comparable to the minority class,
+mitigating class imbalance. It also employs the Mixup technique to generate
+synthetic samples, enriching the semantic diversity of each subclass.
+Furthermore, supervised contrastive learning is used to hierarchically learn
+effective graph representations, enabling the model to thoroughly explore
+semantic substructures in majority classes while avoiding excessive focus on
+minority classes. Extensive experiments on real-world graph benchmark datasets
+verify the superior performance of our proposed method against competitive
+baselines.
+
+
+
+ comment: Accepted by Proceedings of the Thirty-Ninth AAAI Conference on
+ Artificial Intelligence (AAAI-25)
+
+
+
+
+
+
+ ♻ ☆ Aligning the Objective of LLM-based Program Repair ICSE'25
+
+
+ Large language models (LLMs) have achieved decent results on automated
+program repair (APR). However, the next token prediction training objective of
+decoder-only LLMs (e.g., GPT-4) is misaligned with the masked span prediction
+objective of current infilling-style methods, which impedes LLMs from fully
+leveraging pre-trained knowledge for program repair. In addition, while some
+LLMs can locate and repair bugs in certain functions using the related
+artifacts (e.g., test cases), existing methods still depend on statement-level
+fault localization methods to provide a list of buggy hunks for repair. This
+restriction hinders LLMs from exploring potential patches beyond the given
+locations.
+ In this paper, we investigate a new approach to adapt LLMs to program repair.
+Our core insight is that LLM's APR capability can be greatly improved by simply
+aligning the output to their training objective and allowing them to refine the
+whole program without first identifying faulty statements. Based on this
+insight, we designed D4C, a straightforward prompting framework for APR. D4C
+can repair 180 bugs correctly in Defects4J, with each patch being sampled only
+10 times. This surpasses the SOTA APR methods with perfect fault localization
+by 10% and reduces the patch sampling number by 90%. Our findings reveal that
+(1) objective alignment is crucial for fully exploiting LLM's pre-trained
+capability, and (2) replacing the traditional localize-buggy-hunks-then-repair
+workflow with direct debugging is more effective for LLM-based APR methods.
+Thus, we believe this paper introduces a new mindset for harnessing LLMs in
+APR.
+
+
+
+
+
+
+
+
+ Taewhan Kim, Soeun Lee, Si-Woo Kim, Dong-Jin Kim
+
+
+ Recent lightweight image captioning models using retrieved data mainly focus
+on text prompts. However, previous works only utilize the retrieved text as
+text prompts, and the visual information relies only on the CLIP visual
+embedding. Because of this issue, there is a limitation that the image
+descriptions inherent in the prompt are not sufficiently reflected in the
+visual embedding space. To tackle this issue, we propose ViPCap, a novel
+retrieval text-based visual prompt for lightweight image captioning. ViPCap
+leverages the retrieved text with image information as visual prompts to
+enhance the ability of the model to capture relevant visual information. By
+mapping text prompts into the CLIP space and generating multiple randomized
+Gaussian distributions, our method leverages sampling to explore randomly
+augmented distributions and effectively retrieves the semantic features that
+contain image information. These retrieved features are integrated into the
+image and designated as the visual prompt, leading to performance improvements
+on the datasets such as COCO, Flickr30k, and NoCaps. Experimental results
+demonstrate that ViPCap significantly outperforms prior lightweight captioning
+models in efficiency and effectiveness, demonstrating the potential for a
+plug-and-play solution.
+
+
+
+ comment: Accepted to AAAI 2025
+
+
+
+
+
+
+ ♻ ☆ Some Primal-Dual Theory for Subgradient Methods for Strongly Convex
+ Optimization
+
+
+ We consider (stochastic) subgradient methods for strongly convex but
+potentially nonsmooth non-Lipschitz optimization. We provide new equivalent
+dual descriptions (in the style of dual averaging) for the classic subgradient
+method, the proximal subgradient method, and the switching subgradient method.
+These equivalences enable $O(1/T)$ convergence guarantees in terms of both
+their classic primal gap and a not previously analyzed dual gap for strongly
+convex optimization. Consequently, our theory provides these classic methods
+with simple, optimal stopping criteria and optimality certificates at no added
+computational cost. Our results apply to a wide range of stepsize selections
+and of non-Lipschitz ill-conditioned problems where the early iterations of the
+subgradient method may diverge exponentially quickly (a phenomenon which, to
+the best of our knowledge, no prior works address). Even in the presence of
+such undesirable behaviors, our theory still ensures and bounds eventual
+convergence.
+
+
+
+ comment: 25 pages, major revision shortened the write-up and unified the
+ analysis to be done just once in a single "super" setting
+
+
+
+
+
+
+
+ Hyeonah Kim, Minsu Kim, Sanghyeok Choi, Jinkyoo Park
+
+
+ The challenge of discovering new molecules with desired properties is crucial
+in domains like drug discovery and material design. Recent advances in deep
+learning-based generative methods have shown promise but face the issue of
+sample efficiency due to the computational expense of evaluating the reward
+function. This paper proposes a novel algorithm for sample-efficient molecular
+optimization by distilling a powerful genetic algorithm into deep generative
+policy using GFlowNets training, the off-policy method for amortized inference.
+This approach enables the deep generative policy to learn from domain
+knowledge, which has been explicitly integrated into the genetic algorithm. Our
+method achieves state-of-the-art performance in the official molecular
+optimization benchmark, significantly outperforming previous methods. It also
+demonstrates effectiveness in designing inhibitors against SARS-CoV-2 with
+substantially fewer reward calls.
+
+
+ As data retrieval demands become increasingly complex, traditional search
+methods often fall short in addressing nuanced and conceptual queries. Vector
+similarity search has emerged as a promising technique for finding semantically
+similar information efficiently. However, its effectiveness diminishes when
+handling intricate queries with contextual nuances. This paper explores a
+hybrid approach combining vector similarity search with Large Language Models
+(LLMs) to enhance search accuracy and relevance. The proposed two-step solution
+first employs vector similarity search to shortlist potential matches, followed
+by an LLM for context-aware ranking of the results. Experiments on structured
+datasets demonstrate that while vector similarity search alone performs well
+for straightforward queries, the LLM-assisted approach excels in processing
+complex queries involving constraints, negations, or conceptual requirements.
+By leveraging the natural language understanding capabilities of LLMs, this
+method improves the accuracy of search results for complex tasks without
+sacrificing efficiency. We also discuss real-world applications and propose
+directions for future research to refine and scale this technique for diverse
+datasets and use cases.
+ Original article:
+https://engineering.grab.com/llm-assisted-vector-similarity-search
+
+
+
+
+
+
+
+ ♻ ☆ A Model Selection Approach for Corruption Robust Reinforcement Learning
+
+
+ We develop a model selection approach to tackle reinforcement learning with
+adversarial corruption in both transition and reward. For finite-horizon
+tabular MDPs, without prior knowledge on the total amount of corruption, our
+algorithm achieves a regret bound of
+$\widetilde{\mathcal{O}}(\min\{\frac{1}{\Delta}, \sqrt{T}\}+C)$ where $T$ is
+the number of episodes, $C$ is the total amount of corruption, and $\Delta$ is
+the reward gap between the best and the second-best policy. This is the first
+worst-case optimal bound achieved without knowledge of $C$, improving previous
+results of Lykouris et al. (2021); Chen et al. (2021); Wu et al. (2021). For
+finite-horizon linear MDPs, we develop a computationally efficient algorithm
+with a regret bound of $\widetilde{\mathcal{O}}(\sqrt{(1+C)T})$, and another
+computationally inefficient one with $\widetilde{\mathcal{O}}(\sqrt{T}+C)$,
+improving the result of Lykouris et al. (2021) and answering an open question
+by Zhang et al. (2021b). Finally, our model selection framework can be easily
+applied to other settings including linear bandits, linear contextual bandits,
+and MDPs with general function approximation, leading to several improved or
+new results.
+
+
+
+
+
+
+
+ ♻ ☆ An Accelerated Algorithm for Stochastic Bilevel Optimization under
+ Unbounded Smoothness NeurIPS 2024
+
+
+
+
+
+
+
+
+ Xiaochuan Gong, Jie Hao, Mingrui Liu
+
+
+ This paper investigates a class of stochastic bilevel optimization problems
+where the upper-level function is nonconvex with potentially unbounded
+smoothness and the lower-level problem is strongly convex. These problems have
+significant applications in sequential data learning, such as text
+classification using recurrent neural networks. The unbounded smoothness is
+characterized by the smoothness constant of the upper-level function scaling
+linearly with the gradient norm, lacking a uniform upper bound. Existing
+state-of-the-art algorithms require $\widetilde{O}(1/\epsilon^4)$ oracle calls
+of stochastic gradient or Hessian/Jacobian-vector product to find an
+$\epsilon$-stationary point. However, it remains unclear if we can further
+improve the convergence rate when the assumptions for the function in the
+population level also hold for each random realization almost surely (e.g.,
+Lipschitzness of each realization of the stochastic gradient). To address this
+issue, we propose a new Accelerated Bilevel Optimization algorithm named AccBO.
+The algorithm updates the upper-level variable by normalized stochastic
+gradient descent with recursive momentum and the lower-level variable by the
+stochastic Nesterov accelerated gradient descent algorithm with averaging. We
+prove that our algorithm achieves an oracle complexity of
+$\widetilde{O}(1/\epsilon^3)$ to find an $\epsilon$-stationary point. Our proof
+relies on a novel lemma characterizing the dynamics of stochastic Nesterov
+accelerated gradient descent algorithm under distribution drift with high
+probability for the lower-level variable, which is of independent interest and
+also plays a crucial role in analyzing the hypergradient estimation error over
+time. Experimental results on various tasks confirm that our proposed algorithm
+achieves the predicted theoretical acceleration and significantly outperforms
+baselines in bilevel optimization.
+
+
+
+ comment: Accepted by NeurIPS 2024. The code is available at
+ https://github.com/MingruiLiu-ML-Lab/Accelerated-Bilevel-Optimization-Unbounded-Smoothness
+
+
+
+
+
+
+ ♻ ☆ TAEN: A Model-Constrained Tikhonov Autoencoder Network for Forward and
+ Inverse Problems
+
+
+
+
+
+
+
+
+ Hai V. Nguyen, Tan Bui-Thanh, Clint Dawson
+
+
+ Efficient real-time solvers for forward and inverse problems are essential in
+engineering and science applications. Machine learning surrogate models have
+emerged as promising alternatives to traditional methods, offering
+substantially reduced computational time. Nevertheless, these models typically
+demand extensive training datasets to achieve robust generalization across
+diverse scenarios. While physics-based approaches can partially mitigate this
+data dependency and ensure physics-interpretable solutions, addressing scarce
+data regimes remains a challenge. Both purely data-driven and physics-based
+machine learning approaches demonstrate severe overfitting issues when trained
+with insufficient data. We propose a novel Tikhonov autoencoder
+model-constrained framework, called TAE, capable of learning both forward and
+inverse surrogate models using a single arbitrary observation sample. We
+develop comprehensive theoretical foundations including forward and inverse
+inference error bounds for the proposed approach for linear cases. For
+comparative analysis, we derive equivalent formulations for pure data-driven
+and model-constrained approach counterparts. At the heart of our approach is a
+data randomization strategy, which functions as a generative mechanism for
+exploring the training data space, enabling effective training of both forward
+and inverse surrogate models from a single observation, while regularizing the
+learning process. We validate our approach through extensive numerical
+experiments on two challenging inverse problems: 2D heat conductivity inversion
+and initial condition reconstruction for time-dependent 2D Navier-Stokes
+equations. Results demonstrate that TAE achieves accuracy comparable to
+traditional Tikhonov solvers and numerical forward solvers for both inverse and
+forward problems, respectively, while delivering orders of magnitude
+computational speedups.
+
+
+ Data-driven decision-making processes increasingly utilize end-to-end
+learnable deep neural networks to render final decisions. Sometimes, the output
+of the forward functions in certain layers is determined by the solutions to
+mathematical optimization problems, leading to the emergence of differentiable
+optimization layers that permit gradient back-propagation. However, real-world
+scenarios often involve large-scale datasets and numerous constraints,
+presenting significant challenges. Current methods for differentiating
+optimization problems typically rely on implicit differentiation, which
+necessitates costly computations on the Jacobian matrices, resulting in low
+efficiency. In this paper, we introduce BPQP, a differentiable convex
+optimization framework designed for efficient end-to-end learning. To enhance
+efficiency, we reformulate the backward pass as a simplified and decoupled
+quadratic programming problem by leveraging the structural properties of the
+KKT matrix. This reformulation enables the use of first-order optimization
+algorithms in calculating the backward pass gradients, allowing our framework
+to potentially utilize any state-of-the-art solver. As solver technologies
+evolve, BPQP can continuously adapt and improve its efficiency. Extensive
+experiments on both simulated and real-world datasets demonstrate that BPQP
+achieves a significant improvement in efficiency--typically an order of
+magnitude faster in overall execution time compared to other differentiable
+optimization layers. Our results not only highlight the efficiency gains of
+BPQP but also underscore its superiority over differentiable optimization layer
+baselines.
+
+
+
+ comment: NeurIPS 2024 Spotlight
+
+
+
+
+
+
+ ♻ ☆ Next Token Prediction Towards Multimodal Intelligence: A Comprehensive
+ Survey
+
+
+ Building on the foundations of language modeling in natural language
+processing, Next Token Prediction (NTP) has evolved into a versatile training
+objective for machine learning tasks across various modalities, achieving
+considerable success. As Large Language Models (LLMs) have advanced to unify
+understanding and generation tasks within the textual modality, recent research
+has shown that tasks from different modalities can also be effectively
+encapsulated within the NTP framework, transforming the multimodal information
+into tokens and predict the next one given the context. This survey introduces
+a comprehensive taxonomy that unifies both understanding and generation within
+multimodal learning through the lens of NTP. The proposed taxonomy covers five
+key aspects: Multimodal tokenization, MMNTP model architectures, unified task
+representation, datasets \& evaluation, and open challenges. This new taxonomy
+aims to aid researchers in their exploration of multimodal intelligence. An
+associated GitHub repository collecting the latest papers and repos is
+available at https://github.com/LMM101/Awesome-Multimodal-Next-Token-Prediction
+
+
+
+
+
+
+
+ ♻ ☆ IRG: Generating Synthetic Relational Databases using Deep Learning with
+ Insightful Relational Understanding
+
+
+
+
+
+
+
+
+ Jiayu Li, Zilong Zhao, Vikram Chundawat, Biplab Sikdar, Y. C. Tay
+
+
+ Synthetic data has numerous applications, including but not limited to
+software testing at scale, privacy-preserving data sharing to enable smoother
+collaboration between stakeholders, and data augmentation for analytical and
+machine learning tasks. Relational databases, which are commonly used by
+corporations, governments, and financial institutions, present unique
+challenges for synthetic data generation due to their complex structures.
+Existing synthetic relational database generation approaches often assume
+idealized scenarios, such as every table having a perfect primary key column
+without composite and potentially overlapping primary or foreign key
+constraints, and fail to account for the sequential nature of certain tables.
+In this paper, we propose incremental relational generator (IRG), that
+successfully handles these ubiquitous real-life situations. IRG ensures the
+preservation of relational schema integrity, offers a deep contextual
+understanding of relationships beyond direct ancestors and descendants,
+leverages the power of newly designed deep neural networks, and scales
+efficiently to handle larger datasets--a combination never achieved in previous
+works. Experiments on three open-source real-life relational datasets in
+different fields at different scales demonstrate IRG's advantage in maintaining
+the synthetic data's relational schema validity and data fidelity and utility.
+
+
+ Graph Neural Networks (GNNs) have recently gained widespread attention as a
+successful tool for analyzing graph-structured data. However, imperfect graph
+structure with noisy links lacks enough robustness and may damage graph
+representations, therefore limiting the GNNs' performance in practical tasks.
+Moreover, existing generative architectures fail to fit discriminative
+graph-related tasks. To tackle these issues, we introduce an unsupervised
+method based on a joint of generative training and discriminative training to
+learn graph structure and representation, aiming to improve the discriminative
+performance of generative models. We propose an Energy-based Contrastive
+Learning (ECL) guided Graph Structure Refinement (GSR) framework, denoted as
+ECL-GSR. To our knowledge, this is the first work to combine energy-based
+models with contrastive learning for GSR. Specifically, we leverage ECL to
+approximate the joint distribution of sample pairs, which increases the
+similarity between representations of positive pairs while reducing the
+similarity between negative ones. Refined structure is produced by augmenting
+and removing edges according to the similarity metrics among node
+representations. Extensive experiments demonstrate that ECL-GSR outperforms the
+state-of-the-art on eight benchmark datasets in node classification. ECL-GSR
+achieves faster training with fewer samples and memories against the leading
+baseline, highlighting its simplicity and efficiency in downstream tasks.
+
+
+
+ comment: Accepted to AAAI 2025
+
+
+
+
+
+
+ ♻ ☆ Dynamic Incremental Optimization for Best Subset Selection
+
+
+ Best subset selection is considered the `gold standard' for many sparse
+learning problems. A variety of optimization techniques have been proposed to
+attack this non-smooth non-convex problem. In this paper, we investigate the
+dual forms of a family of $\ell_0$-regularized problems. An efficient
+primal-dual algorithm is developed based on the primal and dual problem
+structures. By leveraging the dual range estimation along with the incremental
+strategy, our algorithm potentially reduces redundant computation and improves
+the solutions of best subset selection. Theoretical analysis and experiments on
+synthetic and real-world datasets validate the efficiency and statistical
+properties of the proposed solutions.
+
+
+
+ comment: arXiv admin note: substantial text overlap with arXiv:2207.02058
+
+
+
+
+
+
+ ♻ ☆ A High Energy-Efficiency Multi-core Neuromorphic Architecture for Deep
+ SNN Training
+
+
+ There is a growing necessity for edge training to adapt to dynamically
+changing environment. Neuromorphic computing represents a significant pathway
+for high-efficiency intelligent computation in energy-constrained edges, but
+existing neuromorphic architectures lack the ability of directly training
+spiking neural networks (SNNs) based on backpropagation. We develop a
+multi-core neuromorphic architecture with Feedforward-Propagation,
+Back-Propagation, and Weight-Gradient engines in each core, supporting high
+efficient parallel computing at both the engine and core levels. It combines
+various data flows and sparse computation optimization by fully leveraging the
+sparsity in SNN training, obtaining a high energy efficiency of 1.05TFLOPS/W@
+FP16 @ 28nm, 55 ~ 85% reduction of DRAM access compared to A100 GPU in SNN
+trainings, and a 20-core deep SNN training and a 5-worker federated learning on
+FPGAs. Our study develops the first multi-core neuromorphic architecture
+supporting the direct SNN training, facilitating the neuromorphic computing in
+edge-learnable applications.
+
+
+
+
+
+
+
+ ♻ ☆ Exploring Concept Depth: How Large Language Models Acquire Knowledge at
+ Different Layers? COLING 2025
+
+
+ Large language models (LLMs) have shown remarkable performances across a wide
+range of tasks. However, the mechanisms by which these models encode tasks of
+varying complexities remain poorly understood. In this paper, we explore the
+hypothesis that LLMs process concepts of varying complexities in different
+layers, introducing the idea of ``Concept Depth'' to suggest that more complex
+concepts are typically acquired in deeper layers. Specifically, we categorize
+concepts based on their level of abstraction, defining them in the order of
+increasing complexity within factual, emotional, and inferential tasks. We
+conduct extensive probing experiments using layer-wise representations across
+various LLM families (Gemma, LLaMA, Qwen) on various datasets spanning the
+three domains of tasks. Our findings reveal that models could efficiently
+conduct probing for simpler tasks in shallow layers, and more complex tasks
+typically necessitate deeper layers for accurate understanding. Additionally,
+we examine how external factors, such as adding noise to the input and
+quantizing the model weights, might affect layer-wise representations. Our
+findings suggest that these factors can impede the development of a conceptual
+understanding of LLMs until deeper layers are explored. We hope that our
+proposed concept and experimental insights will enhance the understanding of
+the mechanisms underlying LLMs. Our codes are available at
+\url{https://github.com/Luckfort/CD}.
+
+
+ Reasoning is critical for large language models (LLMs) to excel in a wide
+range of tasks. While methods like Chain-of-Thought (CoT) reasoning enhance LLM
+performance by decomposing problems into intermediate steps, they also incur
+significant overhead in token usage, leading to increased costs. We find that
+the reasoning process of current LLMs is unnecessarily lengthy and it can be
+compressed by including a reasonable token budget in the prompt, but the choice
+of token budget plays a crucial role in the actual compression effectiveness.
+We then propose a token-budget-aware LLM reasoning framework, which dynamically
+estimates token budgets for different problems based on reasoning complexity
+and uses the estimated token budgets to guide the reasoning process.
+Experiments show that our method effectively reduces token costs in CoT
+reasoning with only a slight performance reduction, offering a practical
+solution to balance efficiency and accuracy in LLM reasoning. Code:
+https://github.com/GeniusHTX/TALE.
+
+
+ This study addresses the challenge of quantifying chess puzzle difficulty - a
+complex task that combines elements of game theory and human cognition and
+underscores its critical role in effective chess training. We present
+GlickFormer, a novel transformer-based architecture that predicts chess puzzle
+difficulty by approximating the Glicko-2 rating system. Unlike conventional
+chess engines that optimize for game outcomes, GlickFormer models human
+perception of tactical patterns and problem-solving complexity. The proposed
+model utilizes a modified ChessFormer backbone for spatial feature extraction
+and incorporates temporal information via factorized transformer techniques.
+This approach enables the capture of both spatial chess piece arrangements and
+move sequences, effectively modeling spatio-temporal relationships relevant to
+difficulty assessment. Experimental evaluation was conducted on a dataset of
+over 4 million chess puzzles. Results demonstrate GlickFormer's superior
+performance compared to the state-of-the-art ChessFormer baseline across
+multiple metrics. The algorithm's performance has also been recognized through
+its competitive results in the IEEE BigData 2024 Cup: Predicting Chess Puzzle
+Difficulty competition, where it placed 11th. The insights gained from this
+study have implications for personalized chess training and broader
+applications in educational technology and cognitive modeling.
+
+
+
+
+
+
+
+
+
+
+ Multimedia 11
+
+
+
+
+
+ ☆ Visual Style Prompt Learning Using Diffusion Models for Blind Face
+ Restoration
+
+
+ Blind face restoration aims to recover high-quality facial images from
+various unidentified sources of degradation, posing significant challenges due
+to the minimal information retrievable from the degraded images. Prior
+knowledge-based methods, leveraging geometric priors and facial features, have
+led to advancements in face restoration but often fall short of capturing fine
+details. To address this, we introduce a visual style prompt learning framework
+that utilizes diffusion probabilistic models to explicitly generate visual
+prompts within the latent space of pre-trained generative models. These prompts
+are designed to guide the restoration process. To fully utilize the visual
+prompts and enhance the extraction of informative and rich patterns, we
+introduce a style-modulated aggregation transformation layer. Extensive
+experiments and applications demonstrate the superiority of our method in
+achieving high-quality blind face restoration. The source code is available at
+\href{https://github.com/LonglongaaaGo/VSPBFR}{https://github.com/LonglongaaaGo/VSPBFR}.
+
+
+
+ comment: Published at Pattern Recognition; 13 pages, 11 figures
+
+
+
+
+
+
+ ☆ Towards Identity-Aware Cross-Modal Retrieval: a Dataset and a Baseline ECIR 2025
+
+
+ Recent advancements in deep learning have significantly enhanced
+content-based retrieval methods, notably through models like CLIP that map
+images and texts into a shared embedding space. However, these methods often
+struggle with domain-specific entities and long-tail concepts absent from their
+training data, particularly in identifying specific individuals. In this paper,
+we explore the task of identity-aware cross-modal retrieval, which aims to
+retrieve images of persons in specific contexts based on natural language
+queries. This task is critical in various scenarios, such as for searching and
+browsing personalized video collections or large audio-visual archives
+maintained by national broadcasters. We introduce a novel dataset, COCO Person
+FaceSwap (COCO-PFS), derived from the widely used COCO dataset and enriched
+with deepfake-generated faces from VGGFace2. This dataset addresses the lack of
+large-scale datasets needed for training and evaluating models for this task.
+Our experiments assess the performance of different CLIP variations repurposed
+for this task, including our architecture, Identity-aware CLIP (Id-CLIP), which
+achieves competitive retrieval performance through targeted fine-tuning. Our
+contributions lay the groundwork for more robust cross-modal retrieval systems
+capable of recognizing long-tail identities and contextual nuances. Data and
+code are available at https://github.com/mesnico/IdCLIP.
+
+
+
+ comment: Accepted as full paper at ECIR 2025
+
+
+
+
+
+
+ ☆ SFE-Net: Harnessing Biological Principles of Differential Gene
+ Expression for Improved Feature Selection in Deep Learning Networks
+
+
+ In the realm of DeepFake detection, the challenge of adapting to various
+synthesis methodologies such as Faceswap, Deepfakes, Face2Face, and
+NeuralTextures significantly impacts the performance of traditional machine
+learning models. These models often suffer from static feature representation,
+which struggles to perform consistently across diversely generated deepfake
+datasets. Inspired by the biological concept of differential gene expression,
+where gene activation is dynamically regulated in response to environmental
+stimuli, we introduce the Selective Feature Expression Network (SFE-Net). This
+innovative framework integrates selective feature activation principles into
+deep learning architectures, allowing the model to dynamically adjust feature
+priorities in response to varying deepfake generation techniques. SFE-Net
+employs a novel mechanism that selectively enhances critical features essential
+for accurately detecting forgeries, while reducing the impact of irrelevant or
+misleading cues akin to adaptive evolutionary processes in nature. Through
+rigorous testing on a range of deepfake datasets, SFE-Net not only surpasses
+existing static models in detecting sophisticated forgeries but also shows
+enhanced generalization capabilities in cross-dataset scenarios. Our approach
+significantly mitigates overfitting by maintaining a dynamic balance between
+feature exploration and exploitation, thus producing more robust and effective
+deepfake detection models. This bio-inspired strategy paves the way for
+developing adaptive deep learning systems that are finely tuned to address the
+nuanced challenges posed by the varied nature of digital forgeries in modern
+digital forensics.
+
+
+
+
+
+
+
+ ☆ Towards nation-wide analytical healthcare infrastructures: A
+ privacy-preserving augmented knee rehabilitation case study
+
+
+
+
+
+
+
+
+ Boris Bačić, Claudiu Vasile, Chengwei Feng, Marian G. Ciucă
+
+
+ The purpose of this paper is to contribute towards the near-future
+privacy-preserving big data analytical healthcare platforms, capable of
+processing streamed or uploaded timeseries data or videos from patients. The
+experimental work includes a real-life knee rehabilitation video dataset
+capturing a set of exercises from simple and personalised to more general and
+challenging movements aimed for returning to sport. To convert video from
+mobile into privacy-preserving diagnostic timeseries data, we employed Google
+MediaPipe pose estimation. The developed proof-of-concept algorithms can
+augment knee exercise videos by overlaying the patient with stick figure
+elements while updating generated timeseries plot with knee angle estimation
+streamed as CSV file format. For patients and physiotherapists, video with
+side-to-side timeseries visually indicating potential issues such as excessive
+knee flexion or unstable knee movements or stick figure overlay errors is
+possible by setting a-priori knee-angle parameters. To address adherence to
+rehabilitation programme and quantify exercise sets and repetitions, our
+adaptive algorithm can correctly identify (91.67%-100%) of all exercises from
+side- and front-view videos. Transparent algorithm design for adaptive visual
+analysis of various knee exercise patterns contributes towards the
+interpretable AI and will inform near-future privacy-preserving, non-vendor
+locking, open-source developments for both end-user computing devices and as
+on-premises non-proprietary cloud platforms that can be deployed within the
+national healthcare system.
+
+
+
+ comment: The original work citation: Ba\v{c}i\'c, B., Claudiu Vasile, Feng,
+ C., & Ciuc\u{a}, M. G. (2024, 13-15 Dec.). Towards nation-wide analytical
+ healthcare infrastructures: A privacy-preserving augmented knee
+ rehabilitation case study. Presented at the Conference on Innovative
+ Technologies in Intelligent Systems & Industrial Applications (CITISIA 2024),
+ Sydney, NSW
+
+
+
+
+
+
+ ☆ ChartAdapter: Large Vision-Language Model for Chart Summarization
+
+
+ Chart summarization, which focuses on extracting key information from charts
+and interpreting it in natural language, is crucial for generating and
+delivering insights through effective and accessible data analysis. Traditional
+methods for chart understanding and summarization often rely on multi-stage
+pipelines, which may produce suboptimal semantic alignment between visual and
+textual information. In comparison, recently developed LLM-based methods are
+more dependent on the capability of foundation images or languages, while
+ignoring the characteristics of chart data and its relevant challenges. To
+address these limitations, we propose ChartAdapter, a novel lightweight
+transformer module designed to bridge the gap between charts and textual
+summaries. ChartAdapter employs learnable query vectors to extract implicit
+semantics from chart data and incorporates a cross-modal alignment projector to
+enhance vision-to-language generative learning. By integrating ChartAdapter
+with an LLM, we enable end-to-end training and efficient chart summarization.
+To further enhance the training, we introduce a three-stage hierarchical
+training procedure and develop a large-scale dataset specifically curated for
+chart summarization, comprising 190,618 samples. Experimental results on the
+standard Chart-to-Text testing set demonstrate that our approach significantly
+outperforms existing methods, including state-of-the-art models, in generating
+high-quality chart summaries. Ablation studies further validate the
+effectiveness of key components in ChartAdapter. This work highlights the
+potential of tailored LLM-based approaches to advance chart understanding and
+sets a strong foundation for future research in this area.
+
+
+
+
+
+
+
+
+ Mai Xu, Yinglin Zhu, Qunliang Xing, Jing Yang, Xin Zou
+
+
+ Stereo images captured by Mars rovers are transmitted after lossy compression
+due to the limited bandwidth between Mars and Earth. Unfortunately, this
+process results in undesirable compression artifacts. In this paper, we present
+a novel stereo quality enhancement approach for Martian images, named MarsSQE.
+First, we establish the first dataset of stereo Martian images. Through
+extensive analysis of this dataset, we observe that cross-view correlations in
+Martian images are notably high. Leveraging this insight, we design a bi-level
+cross-view attention-based quality enhancement network that fully exploits
+these inherent cross-view correlations. Specifically, our network integrates
+pixel-level attention for precise matching and patch-level attention for
+broader contextual information. Experimental results demonstrate the
+effectiveness of our MarsSQE approach.
+
+
+
+
+
+
+
+ ☆ SM3Det: A Unified Model for Multi-Modal Remote Sensing Object Detection
+
+
+ With the rapid advancement of remote sensing technology, high-resolution
+multi-modal imagery is now more widely accessible. Conventional Object
+detection models are trained on a single dataset, often restricted to a
+specific imaging modality and annotation format. However, such an approach
+overlooks the valuable shared knowledge across multi-modalities and limits the
+model's applicability in more versatile scenarios. This paper introduces a new
+task called Multi-Modal Datasets and Multi-Task Object Detection (M2Det) for
+remote sensing, designed to accurately detect horizontal or oriented objects
+from any sensor modality. This task poses challenges due to 1) the trade-offs
+involved in managing multi-modal modelling and 2) the complexities of
+multi-task optimization. To address these, we establish a benchmark dataset and
+propose a unified model, SM3Det (Single Model for Multi-Modal datasets and
+Multi-Task object Detection). SM3Det leverages a grid-level sparse MoE backbone
+to enable joint knowledge learning while preserving distinct feature
+representations for different modalities. Furthermore, it integrates a
+consistency and synchronization optimization strategy using dynamic learning
+rate adjustment, allowing it to effectively handle varying levels of learning
+difficulty across modalities and tasks. Extensive experiments demonstrate
+SM3Det's effectiveness and generalizability, consistently outperforming
+specialized models on individual datasets. The code is available at
+https://github.com/zcablii/SM3Det.
+
+
+ In this paper, we present the Global Multimedia Deepfake Detection held
+concurrently with the Inclusion 2024. Our Multimedia Deepfake Detection aims to
+detect automatic image and audio-video manipulations including but not limited
+to editing, synthesis, generation, Photoshop,etc. Our challenge has attracted
+1500 teams from all over the world, with about 5000 valid result submission
+counts. We invite the top 20 teams to present their solutions to the challenge,
+from which the top 3 teams are awarded prizes in the grand finale. In this
+paper, we present the solutions from the top 3 teams of the two tracks, to
+boost the research work in the field of image and audio-video forgery
+detection. The methodologies developed through the challenge will contribute to
+the development of next-generation deepfake detection systems and we encourage
+participants to open source their methods.
+
+
+
+ comment: Inclusion 2024 Global Multimedia Deepfake Detection Competition Top
+ Team Technical Report
+
+
+
+
+
+
+ ♻ ☆ Synchronized Video Storytelling: Generating Video Narrations with
+ Structured Storyline
+
+
+ Video storytelling is engaging multimedia content that utilizes video and its
+accompanying narration to attract the audience, where a key challenge is
+creating narrations for recorded visual scenes. Previous studies on dense video
+captioning and video story generation have made some progress. However, in
+practical applications, we typically require synchronized narrations for
+ongoing visual scenes. In this work, we introduce a new task of Synchronized
+Video Storytelling, which aims to generate synchronous and informative
+narrations for videos. These narrations, associated with each video clip,
+should relate to the visual content, integrate relevant knowledge, and have an
+appropriate word count corresponding to the clip's duration. Specifically, a
+structured storyline is beneficial to guide the generation process, ensuring
+coherence and integrity. To support the exploration of this task, we introduce
+a new benchmark dataset E-SyncVidStory with rich annotations. Since existing
+Multimodal LLMs are not effective in addressing this task in one-shot or
+few-shot settings, we propose a framework named VideoNarrator that can generate
+a storyline for input videos and simultaneously generate narrations with the
+guidance of the generated or predefined storyline. We further introduce a set
+of evaluation metrics to thoroughly assess the generation. Both automatic and
+human evaluations validate the effectiveness of our approach. Our dataset,
+codes, and evaluations will be released.
+
+
+ Video-to-audio (V2A) generation is important for video editing and
+post-processing, enabling the creation of semantics-aligned audio for silent
+video. However, most existing methods focus on generating short-form audio for
+short video segment (less than 10 seconds), while giving little attention to
+the scenario of long-form video inputs. For current UNet-based diffusion V2A
+models, an inevitable problem when handling long-form audio generation is the
+inconsistencies within the final concatenated audio. In this paper, we first
+highlight the importance of long-form V2A problem. Besides, we propose LoVA, a
+novel model for Long-form Video-to-Audio generation. Based on the Diffusion
+Transformer (DiT) architecture, LoVA proves to be more effective at generating
+long-form audio compared to existing autoregressive models and UNet-based
+diffusion models. Extensive objective and subjective experiments demonstrate
+that LoVA achieves comparable performance on 10-second V2A benchmark and
+outperforms all other baselines on a benchmark with long-form video input.
+
+
+
+ comment: Accepted by ICASSP 2025
+
+
+
+
+
+
+ ♻ ☆ Next Token Prediction Towards Multimodal Intelligence: A Comprehensive
+ Survey
+
+
+ Building on the foundations of language modeling in natural language
+processing, Next Token Prediction (NTP) has evolved into a versatile training
+objective for machine learning tasks across various modalities, achieving
+considerable success. As Large Language Models (LLMs) have advanced to unify
+understanding and generation tasks within the textual modality, recent research
+has shown that tasks from different modalities can also be effectively
+encapsulated within the NTP framework, transforming the multimodal information
+into tokens and predict the next one given the context. This survey introduces
+a comprehensive taxonomy that unifies both understanding and generation within
+multimodal learning through the lens of NTP. The proposed taxonomy covers five
+key aspects: Multimodal tokenization, MMNTP model architectures, unified task
+representation, datasets \& evaluation, and open challenges. This new taxonomy
+aims to aid researchers in their exploration of multimodal intelligence. An
+associated GitHub repository collecting the latest papers and repos is
+available at https://github.com/LMM101/Awesome-Multimodal-Next-Token-Prediction
+
+
+ Large Language Models (LLMs) such as GPT-4.0 have shown significant promise
+in addressing the semantic complexities of regulatory documents, particularly
+in detecting inconsistencies and contradictions. This study evaluates GPT-4.0's
+ability to identify conflicts within regulatory requirements by analyzing a
+curated corpus with artificially injected ambiguities and contradictions,
+designed in collaboration with architects and compliance engineers. Using
+metrics such as precision, recall, and F1 score, the experiment demonstrates
+GPT-4.0's effectiveness in detecting inconsistencies, with findings validated
+by human experts. The results highlight the potential of LLMs to enhance
+regulatory compliance processes, though further testing with larger datasets
+and domain-specific fine-tuning is needed to maximize accuracy and practical
+applicability. Future work will explore automated conflict resolution and
+real-world implementation through pilot projects with industry partners.
+
+
+
+ comment: accepted for presentation at Georg Nemetschek Institute Symposium &
+ Expo on Artificial Intelligence for the Built World - Munich, Germany. 12
+ Sept 2024
+
+
+
+
+
+
+ ☆ GliLem: Leveraging GliNER for Contextualized Lemmatization in Estonian
+
+
+ We present GliLem -- a novel hybrid lemmatization system for Estonian that
+enhances the highly accurate rule-based morphological analyzer Vabamorf with an
+external disambiguation module based on GliNER -- an open vocabulary NER model
+that is able to match text spans with text labels in natural language. We
+leverage the flexibility of a pre-trained GliNER model to improve the
+lemmatization accuracy of Vabamorf by 10\% compared to its original
+disambiguation module and achieve an improvement over the token
+classification-based baseline. To measure the impact of improvements in
+lemmatization accuracy on the information retrieval downstream task, we first
+created an information retrieval dataset for Estonian by automatically
+translating the DBpedia-Entity dataset from English. We benchmark several token
+normalization approaches, including lemmatization, on the created dataset using
+the BM25 algorithm. We observe a substantial improvement in IR metrics when
+using lemmatization over simplistic stemming. The benefits of improving lemma
+disambiguation accuracy manifest in small but consistent improvement in the IR
+recall measure, especially in the setting of high k.
+
+
+
+ comment: Accepted to NoDaLiDa/Baltic-HLT 2025
+
+
+
+
+
+
+ ☆ Controlling Out-of-Domain Gaps in LLMs for Genre Classification and
+ Generated Text Detection
+
+
+ This study demonstrates that the modern generation of Large Language Models
+(LLMs, such as GPT-4) suffers from the same out-of-domain (OOD) performance gap
+observed in prior research on pre-trained Language Models (PLMs, such as BERT).
+We demonstrate this across two non-topical classification tasks: 1) genre
+classification and 2) generated text detection. Our results show that when
+demonstration examples for In-Context Learning (ICL) come from one domain
+(e.g., travel) and the system is tested on another domain (e.g., history),
+classification performance declines significantly.
+ To address this, we introduce a method that controls which predictive
+indicators are used and which are excluded during classification. For the two
+tasks studied here, this ensures that topical features are omitted, while the
+model is guided to focus on stylistic rather than content-based attributes.
+This approach reduces the OOD gap by up to 20 percentage points in a few-shot
+setup. Straightforward Chain-of-Thought (CoT) methods, used as the baseline,
+prove insufficient, while our approach consistently enhances domain transfer
+performance.
+
+
+
+ comment: The 31st International Conference on Computational Linguistics
+
+
+
+
+
+
+ ☆ Towards Neural No-Resource Language Translation: A Comparative
+ Evaluation of Approaches
+
+
+ No-resource languages - those with minimal or no digital representation -
+pose unique challenges for machine translation (MT). Unlike low-resource
+languages, which rely on limited but existent corpora, no-resource languages
+often have fewer than 100 sentences available for training. This work explores
+the problem of no-resource translation through three distinct workflows:
+fine-tuning of translation-specific models, in-context learning with large
+language models (LLMs) using chain-of-reasoning prompting, and direct prompting
+without reasoning. Using Owens Valley Paiute as a case study, we demonstrate
+that no-resource translation demands fundamentally different approaches from
+low-resource scenarios, as traditional approaches to machine translation, such
+as those that work for low-resource languages, fail. Empirical results reveal
+that, although traditional approaches fail, the in-context learning
+capabilities of general-purpose large language models enable no-resource
+language translation that outperforms low-resource translation approaches and
+rivals human translations (BLEU 0.45-0.6); specifically, chain-of-reasoning
+prompting outperforms other methods for larger corpora, while direct prompting
+exhibits advantages in smaller datasets. As these approaches are
+language-agnostic, they have potential to be generalized to translation tasks
+from a wide variety of no-resource languages without expert input. These
+findings establish no-resource translation as a distinct paradigm requiring
+innovative solutions, providing practical and theoretical insights for language
+preservation.
+
+
+
+
+
+
+
+ ☆ Counterfactual Samples Constructing and Training for Commonsense
+ Statements Estimation
+
+
+ Plausibility Estimation (PE) plays a crucial role for enabling language
+models to objectively comprehend the real world. While large language models
+(LLMs) demonstrate remarkable capabilities in PE tasks but sometimes produce
+trivial commonsense errors due to the complexity of commonsense knowledge. They
+lack two key traits of an ideal PE model: a) Language-explainable: relying on
+critical word segments for decisions, and b) Commonsense-sensitive: detecting
+subtle linguistic variations in commonsense. To address these issues, we
+propose a novel model-agnostic method, referred to as Commonsense
+Counterfactual Samples Generating (CCSG). By training PE models with CCSG, we
+encourage them to focus on critical words, thereby enhancing both their
+language-explainable and commonsense-sensitive capabilities. Specifically, CCSG
+generates counterfactual samples by strategically replacing key words and
+introducing low-level dropout within sentences. These counterfactual samples
+are then incorporated into a sentence-level contrastive training framework to
+further enhance the model's learning process. Experimental results across nine
+diverse datasets demonstrate the effectiveness of CCSG in addressing
+commonsense reasoning challenges, with our CCSG method showing 3.07%
+improvement against the SOTA methods.
+
+
+
+ comment: 14 pages, 4 figures
+
+
+
+
+
+
+ ☆ The Impact of Prompt Programming on Function-Level Code Generation
+
+
+
+
+
+
+
+
+ Ranim Khojah, Francisco Gomes de Oliveira Neto, Mazen Mohamad, Philipp Leitner
+
+
+ Large Language Models (LLMs) are increasingly used by software engineers for
+code generation. However, limitations of LLMs such as irrelevant or incorrect
+code have highlighted the need for prompt programming (or prompt engineering)
+where engineers apply specific prompt techniques (e.g., chain-of-thought or
+input-output examples) to improve the generated code. Despite this, the impact
+of different prompt techniques -- and their combinations -- on code generation
+remains underexplored. In this study, we introduce CodePromptEval, a dataset of
+7072 prompts designed to evaluate five prompt techniques (few-shot, persona,
+chain-of-thought, function signature, list of packages) and their effect on the
+correctness, similarity, and quality of complete functions generated by three
+LLMs (GPT-4o, Llama3, and Mistral). Our findings show that while certain prompt
+techniques significantly influence the generated code, combining multiple
+techniques does not necessarily improve the outcome. Additionally, we observed
+a trade-off between correctness and quality when using prompt techniques. Our
+dataset and replication package enable future research on improving
+LLM-generated code and evaluating new prompt techniques.
+
+
+
+ comment: CodePromptEval dataset and replication package on GitHub:
+ https://github.com/icetlab/CodePromptEval
+
+
+
+
+
+
+ ☆ SAFE-MEME: Structured Reasoning Framework for Robust Hate Speech
+ Detection in Memes
+
+
+ Memes act as cryptic tools for sharing sensitive ideas, often requiring
+contextual knowledge to interpret. This makes moderating multimodal memes
+challenging, as existing works either lack high-quality datasets on nuanced
+hate categories or rely on low-quality social media visuals. Here, we curate
+two novel multimodal hate speech datasets, MHS and MHS-Con, that capture
+fine-grained hateful abstractions in regular and confounding scenarios,
+respectively. We benchmark these datasets against several competing baselines.
+Furthermore, we introduce SAFE-MEME (Structured reAsoning FramEwork), a novel
+multimodal Chain-of-Thought-based framework employing Q&A-style reasoning
+(SAFE-MEME-QA) and hierarchical categorization (SAFE-MEME-H) to enable robust
+hate speech detection in memes. SAFE-MEME-QA outperforms existing baselines,
+achieving an average improvement of approximately 5% and 4% on MHS and MHS-Con,
+respectively. In comparison, SAFE-MEME-H achieves an average improvement of 6%
+in MHS while outperforming only multimodal baselines in MHS-Con. We show that
+fine-tuning a single-layer adapter within SAFE-MEME-H outperforms fully
+fine-tuned models in regular fine-grained hateful meme detection. However, the
+fully fine-tuning approach with a Q&A setup is more effective for handling
+confounding cases. We also systematically examine the error cases, offering
+valuable insights into the robustness and limitations of the proposed
+structured reasoning framework for analyzing hateful memes.
+
+
+
+ comment: 28 pages, 15 figures, 6 tables
+
+
+
+
+
+
+ ☆ ReTaKe: Reducing Temporal and Knowledge Redundancy for Long Video
+ Understanding
+
+
+
+
+
+
+
+
+ Xiao Wang, Qingyi Si, Jianlong Wu, Shiyu Zhu, Li Cao, Liqiang Nie
+
+
+ Video Large Language Models (VideoLLMs) have achieved remarkable progress in
+video understanding. However, existing VideoLLMs often inherit the limitations
+of their backbone LLMs in handling long sequences, leading to challenges for
+long video understanding. Common solutions either simply uniformly sample
+videos' frames or compress visual tokens, which focus primarily on low-level
+temporal visual redundancy, overlooking high-level knowledge redundancy. This
+limits the achievable compression rate with minimal loss. To this end. we
+introduce a training-free method, $\textbf{ReTaKe}$, containing two novel
+modules DPSelect and PivotKV, to jointly model and reduce both temporal visual
+redundancy and knowledge redundancy for long video understanding. Specifically,
+DPSelect identifies keyframes with local maximum peak distance based on their
+visual features, which are closely aligned with human video perception. PivotKV
+employs the obtained keyframes as pivots and conducts KV-Cache compression for
+the non-pivot tokens with low attention scores, which are derived from the
+learned prior knowledge of LLMs. Experiments on benchmarks VideoMME, MLVU, and
+LVBench, show that ReTaKe can support 4x longer video sequences with minimal
+performance loss (<1%) and outperform all similar-size VideoLLMs with 3%-5%,
+even surpassing or on par with much larger ones. Our code is available at
+https://github.com/SCZwangxiao/video-ReTaKe
+
+
+
+
+
+
+
+ ☆ Cut the Deadwood Out: Post-Training Model Purification with Selective
+ Module Substitution
+
+
+ The success of DNNs often depends on training with large-scale datasets, but
+building such datasets is both expensive and challenging. Consequently, public
+datasets from open-source platforms like HuggingFace have become popular,
+posing significant risks of data poisoning attacks. Existing backdoor defenses
+in NLP primarily focus on identifying and removing poisoned samples; however,
+purifying a backdoored model with these sample-cleaning approaches typically
+requires expensive retraining. Therefore, we propose Greedy Module Substitution
+(GMS), which identifies and substitutes ''deadwood'' modules (i.e., components
+critical to backdoor pathways) in a backdoored model to purify it. Our method
+relaxes the common dependency of prior model purification methods on clean
+datasets or clean auxiliary models. When applied to RoBERTa-large under
+backdoor attacks, GMS demonstrates strong effectiveness across various
+settings, particularly against widely recognized challenging attacks like LWS,
+achieving a post-purification attack success rate (ASR) of 9.7% on SST-2
+compared to 58.8% for the best baseline approach.
+
+
+
+ comment: preprint
+
+
+
+
+
+
+ ☆ Utilizing Multimodal Data for Edge Case Robust Call-sign Recognition and
+ Understanding
+
+
+ Operational machine-learning based assistant systems must be robust in a wide
+range of scenarios. This hold especially true for the air-traffic control (ATC)
+domain. The robustness of an architecture is particularly evident in edge
+cases, such as high word error rate (WER) transcripts resulting from noisy ATC
+recordings or partial transcripts due to clipped recordings. To increase the
+edge-case robustness of call-sign recognition and understanding (CRU), a core
+tasks in ATC speech processing, we propose the multimodal call-sign-command
+recovery model (CCR). The CCR architecture leads to an increase in the edge
+case performance of up to 15%. We demonstrate this on our second proposed
+architecture, CallSBERT. A CRU model that has less parameters, can be
+fine-tuned noticeably faster and is more robust during fine-tuning than the
+state of the art for CRU. Furthermore, we demonstrate that optimizing for edge
+cases leads to a significantly higher accuracy across a wide operational range.
+
+
+
+
+
+
+
+ ☆ Enhancing Entertainment Translation for Indian Languages using Adaptive
+ Context, Style and LLMs AAAI'25
+
+
+ We address the challenging task of neural machine translation (NMT) in the
+entertainment domain, where the objective is to automatically translate a given
+dialogue from a source language content to a target language. This task has
+various applications, particularly in automatic dubbing, subtitling, and other
+content localization tasks, enabling source content to reach a wider audience.
+Traditional NMT systems typically translate individual sentences in isolation,
+without facilitating knowledge transfer of crucial elements such as the context
+and style from previously encountered sentences. In this work, we emphasize the
+significance of these fundamental aspects in producing pertinent and
+captivating translations. We demonstrate their significance through several
+examples and propose a novel framework for entertainment translation, which, to
+our knowledge, is the first of its kind. Furthermore, we introduce an algorithm
+to estimate the context and style of the current session and use these
+estimations to generate a prompt that guides a Large Language Model (LLM) to
+generate high-quality translations. Our method is both language and
+LLM-agnostic, making it a general-purpose tool. We demonstrate the
+effectiveness of our algorithm through various numerical studies and observe
+significant improvement in the COMET scores over various state-of-the-art LLMs.
+Moreover, our proposed method consistently outperforms baseline LLMs in terms
+of win-ratio.
+
+
+
+ comment: Accepted to AAAI'25
+
+
+
+
+
+
+ ☆ Integrating Natural Language Processing Techniques of Text Mining Into
+ Financial System: Applications and Limitations
+
+
+ The financial sector, a pivotal force in economic development, increasingly
+uses the intelligent technologies such as natural language processing to
+enhance data processing and insight extraction. This research paper through a
+review process of the time span of 2018-2023 explores the use of text mining as
+natural language processing techniques in various components of the financial
+system including asset pricing, corporate finance, derivatives, risk
+management, and public finance and highlights the need to address the specific
+problems in the discussion section. We notice that most of the research
+materials combined probabilistic with vector-space models, and text-data with
+numerical ones. The most used technique regarding information processing is the
+information classification technique and the most used algorithms include the
+long-short term memory and bidirectional encoder models. The research noticed
+that new specific algorithms are developed and the focus of the financial
+system is mainly on asset pricing component. The research also proposes a path
+from engineering perspective for researchers who need to analyze financial
+text. The challenges regarding text mining perspective such as data quality,
+context-adaption and model interpretability need to be solved so to integrate
+advanced natural language processing models and techniques in enhancing
+financial analysis and prediction. Keywords: Financial System (FS), Natural
+Language Processing (NLP), Software and Text Engineering, Probabilistic,
+Vector-Space, Models, Techniques, TextData, Financial Analysis.
+
+
+
+ comment: 6 pages, 5 figures, 1 table
+
+
+
+
+
+
+ ☆ Comparative Performance of Advanced NLP Models and LLMs in Multilingual
+ Geo-Entity Detection
+
+
+ The integration of advanced Natural Language Processing (NLP) methodologies
+and Large Language Models (LLMs) has significantly enhanced the extraction and
+analysis of geospatial data from multilingual texts, impacting sectors such as
+national and international security. This paper presents a comprehensive
+evaluation of leading NLP models -- SpaCy, XLM-RoBERTa, mLUKE, GeoLM -- and
+LLMs, specifically OpenAI's GPT 3.5 and GPT 4, within the context of
+multilingual geo-entity detection. Utilizing datasets from Telegram channels in
+English, Russian, and Arabic, we examine the performance of these models
+through metrics such as accuracy, precision, recall, and F1 scores, to assess
+their effectiveness in accurately identifying geospatial references. The
+analysis exposes each model's distinct advantages and challenges, underscoring
+the complexities involved in achieving precise geo-entity identification across
+varied linguistic landscapes. The conclusions drawn from this experiment aim to
+direct the enhancement and creation of more advanced and inclusive NLP tools,
+thus advancing the field of geospatial analysis and its application to global
+security.
+
+
+ Machine unlearning in the domain of large language models (LLMs) has
+attracted great attention recently, which aims to effectively eliminate
+undesirable behaviors from LLMs without full retraining from scratch. In this
+paper, we explore the Gradient Ascent (GA) approach in LLM unlearning, which is
+a proactive way to decrease the prediction probability of the model on the
+target data in order to remove their influence. We analyze two challenges that
+render the process impractical: gradient explosion and catastrophic forgetting.
+To address these issues, we propose Multi-Objective Large Language Model
+Unlearning (MOLLM) algorithm. We first formulate LLM unlearning as a
+multi-objective optimization problem, in which the cross-entropy loss is
+modified to the unlearning version to overcome the gradient explosion issue. A
+common descent update direction is then calculated, which enables the model to
+forget the target data while preserving the utility of the LLM. Our empirical
+results verify that MoLLM outperforms the SOTA GA-based LLM unlearning methods
+in terms of unlearning effect and model utility preservation.
+
+
+
+
+
+
+
+ ☆ A Multidisciplinary Approach to Telegram Data Analysis
+
+
+ This paper presents a multidisciplinary approach to analyzing data from
+Telegram for early warning information regarding cyber threats. With the
+proliferation of hacktivist groups utilizing Telegram to disseminate
+information regarding future cyberattacks or to boast about successful ones,
+the need for effective data analysis methods is paramount. The primary
+challenge lies in the vast number of channels and the overwhelming volume of
+data, necessitating advanced techniques for discerning pertinent risks amidst
+the noise. To address this challenge, we employ a combination of neural network
+architectures and traditional machine learning algorithms. These methods are
+utilized to classify and identify potential cyber threats within the Telegram
+data. Additionally, sentiment analysis and entity recognition techniques are
+incorporated to provide deeper insights into the nature and context of the
+communicated information. The study evaluates the effectiveness of each method
+in detecting and categorizing cyber threats, comparing their performance and
+identifying areas for improvement. By leveraging these diverse analytical
+tools, we aim to enhance early warning systems for cyber threats, enabling more
+proactive responses to potential security breaches. This research contributes
+to the ongoing efforts to bolster cybersecurity measures in an increasingly
+interconnected digital landscape.
+
+
+
+
+
+
+
+
+ Jia Liu, Yue Wang, Zhiqi Lin, Min Chen, Yixue Hao, Long Hu
+
+
+ Large language model fine-tuning techniques typically depend on extensive
+labeled data, external guidance, and feedback, such as human alignment, scalar
+rewards, and demonstration. However, in practical application, the scarcity of
+specific knowledge poses unprecedented challenges to existing fine-tuning
+techniques. In this paper, focusing on fine-tuning tasks in specific domains
+with limited data, we introduce Natural Language Fine-Tuning (NLFT), which
+utilizes natural language for fine-tuning for the first time. By leveraging the
+strong language comprehension capability of the target LM, NLFT attaches the
+guidance of natural language to the token-level outputs. Then, saliency tokens
+are identified with calculated probabilities. Since linguistic information is
+effectively utilized in NLFT, our proposed method significantly reduces
+training costs. It markedly enhances training efficiency, comprehensively
+outperforming reinforcement fine-tuning algorithms in accuracy, time-saving,
+and resource conservation. Additionally, on the macro level, NLFT can be viewed
+as a token-level fine-grained optimization of SFT, thereby efficiently
+replacing the SFT process without the need for warm-up (as opposed to ReFT
+requiring multiple rounds of warm-up with SFT). Compared to SFT, NLFT does not
+increase the algorithmic complexity, maintaining O(n). Extensive experiments on
+the GSM8K dataset demonstrate that NLFT, with only 50 data instances, achieves
+an accuracy increase that exceeds SFT by 219%. Compared to ReFT, the time
+complexity and space complexity of NLFT are reduced by 78.27% and 92.24%,
+respectively. The superior technique of NLFT is paving the way for the
+deployment of various innovative LLM fine-tuning applications when resources
+are limited at network edges.
+ Our code has been released at https://github.com/Julia-LiuJ/NLFT.
+
+
+
+
+
+
+
+ ☆ LLM2: Let Large Language Models Harness System 2 Reasoning
+
+
+ Large language models (LLMs) have exhibited impressive capabilities across a
+myriad of tasks, yet they occasionally yield undesirable outputs. We posit that
+these limitations are rooted in the foundational autoregressive architecture of
+LLMs, which inherently lacks mechanisms for differentiating between desirable
+and undesirable results. Drawing inspiration from the dual-process theory of
+human cognition, we introduce LLM2, a novel framework that combines an LLM
+(System 1) with a process-based verifier (System 2). Within LLM2, the LLM is
+responsible for generating plausible candidates, while the verifier provides
+timely process-based feedback to distinguish desirable and undesirable outputs.
+The verifier is trained with a pairwise comparison loss on synthetic
+process-supervision data generated through our token quality exploration
+strategy. Empirical results on mathematical reasoning benchmarks substantiate
+the efficacy of LLM2, exemplified by an accuracy enhancement from 50.3 to 57.8
+(+7.5) for Llama3-1B on GSM8K. Furthermore, when combined with
+self-consistency, LLM2 achieves additional improvements, boosting major@20
+accuracy from 56.2 to 70.2 (+14.0).
+
+
+
+
+
+
+
+ ☆ Enhancing Code LLMs with Reinforcement Learning in Code Generation
+
+
+ With the rapid evolution of large language models (LLM), reinforcement
+learning (RL) has emerged as a pivotal technique for code generation and
+optimization in various domains. This paper presents a systematic survey of the
+application of RL in code optimization and generation, highlighting its role in
+enhancing compiler optimization, resource allocation, and the development of
+frameworks and tools. Subsequent sections first delve into the intricate
+processes of compiler optimization, where RL algorithms are leveraged to
+improve efficiency and resource utilization. The discussion then progresses to
+the function of RL in resource allocation, emphasizing register allocation and
+system optimization. We also explore the burgeoning role of frameworks and
+tools in code generation, examining how RL can be integrated to bolster their
+capabilities. This survey aims to serve as a comprehensive resource for
+researchers and practitioners interested in harnessing the power of RL to
+advance code generation and optimization techniques.
+
+
+
+
+
+
+
+ ☆ HindiLLM: Large Language Model for Hindi
+
+
+ The advancements in the Large Language Model (LLM) have helped in solving
+several problems related to language processing. Most of the researches have
+focused on the English language only, because of its popularity and abundance
+on the internet. However, a high-performance language model for Hindi and other
+Indic languages is lacking in the literature. In this work, we have pre-trained
+two autoregressive LLM models for the Hindi language, namely HindiLLM-Small and
+HindiLLM-Medium. We use a two-step process comprising unsupervised pre-training
+and supervised fine-tuning. First, we create a large and high-quality text
+corpus for unsupervised pre-training. Next, we train a Byte-Pair Encoding,
+named HindiLLM tokenizer, using the pre-training text data. We then perform
+training on the unlabeled data, known as the pre-training step, to get the
+HindiLLM base models. Furthermore, we perform fine-tuning of the HindiLLM base
+models for different tasks like sentiment analysis, text classification,
+natural language inference, and multiple choice question-answer on popular
+labeled datasets to measure the real-world performance. The evaluation shows
+that the HindiLLM-based fine-tuned models outperform several models in most of
+the language related tasks.
+
+
+
+
+
+
+
+ ☆ Understanding the Impact of Confidence in Retrieval Augmented
+ Generation: A Case Study in the Medical Domain
+
+
+ Retrieval Augmented Generation (RAG) complements the knowledge of Large
+Language Models (LLMs) by leveraging external information to enhance response
+accuracy for queries. This approach is widely applied in several fields by
+taking its advantage of injecting the most up-to-date information, and
+researchers are focusing on understanding and improving this aspect to unlock
+the full potential of RAG in such high-stakes applications. However, despite
+the potential of RAG to address these needs, the mechanisms behind the
+confidence levels of its outputs remain underexplored, although the confidence
+of information is very critical in some domains, such as finance, healthcare,
+and medicine. Our study focuses the impact of RAG on confidence within the
+medical domain under various configurations and models. We evaluate confidence
+by treating the model's predicted probability as its output and calculating
+Expected Calibration Error (ECE) and Adaptive Calibration Error (ACE) scores
+based on the probabilities and accuracy. In addition, we analyze whether the
+order of retrieved documents within prompts calibrates the confidence. Our
+findings reveal large variation in confidence and accuracy depending on the
+model, settings, and the format of input prompts. These results underscore the
+necessity of optimizing configurations based on the specific model and
+conditions.
+
+
+
+
+
+
+
+ ♻ ☆ Vision-and-Language Navigation Today and Tomorrow: A Survey in the Era
+ of Foundation Models
+
+
+ Vision-and-Language Navigation (VLN) has gained increasing attention over
+recent years and many approaches have emerged to advance their development. The
+remarkable achievements of foundation models have shaped the challenges and
+proposed methods for VLN research. In this survey, we provide a top-down review
+that adopts a principled framework for embodied planning and reasoning, and
+emphasizes the current methods and future opportunities leveraging foundation
+models to address VLN challenges. We hope our in-depth discussions could
+provide valuable resources and insights: on one hand, to milestone the progress
+and explore opportunities and potential roles for foundation models in this
+field, and on the other, to organize different challenges and solutions in VLN
+to foundation model researchers.
+
+
+
+ comment: Authors contributed equally to this work, and supervisors contributed
+ equal advising to this work; GitHub repository:
+ https://github.com/zhangyuejoslin/VLN-Survey-with-Foundation-Models
+
+
+
+
+
+
+ ♻ ☆ Attention Mechanism and Context Modeling System for Text Mining Machine
+ Translation
+
+
+ This paper advances a novel architectural schema anchored upon the
+Transformer paradigm and innovatively amalgamates the K-means categorization
+algorithm to augment the contextual apprehension capabilities of the schema.
+The transformer model performs well in machine translation tasks due to its
+parallel computing power and multi-head attention mechanism. However, it may
+encounter contextual ambiguity or ignore local features when dealing with
+highly complex language structures. To circumvent this constraint, this
+exposition incorporates the K-Means algorithm, which is used to stratify the
+lexis and idioms of the input textual matter, thereby facilitating superior
+identification and preservation of the local structure and contextual
+intelligence of the language. The advantage of this combination is that K-Means
+can automatically discover the topic or concept regions in the text, which may
+be directly related to translation quality. Consequently, the schema contrived
+herein enlists K-Means as a preparatory phase antecedent to the Transformer and
+recalibrates the multi-head attention weights to assist in the discrimination
+of lexis and idioms bearing analogous semantics or functionalities. This
+ensures the schema accords heightened regard to the contextual intelligence
+embodied by these clusters during the training phase, rather than merely
+focusing on locational intelligence.
+
+
+
+
+
+
+
+ ♻ ☆ Deep Neural Networks and Brain Alignment: Brain Encoding and Decoding
+ (Survey)
+
+
+ Can artificial intelligence unlock the secrets of the human brain? How do the
+inner mechanisms of deep learning models relate to our neural circuits? Is it
+possible to enhance AI by tapping into the power of brain recordings? These
+captivating questions lie at the heart of an emerging field at the intersection
+of neuroscience and artificial intelligence. Our survey dives into this
+exciting domain, focusing on human brain recording studies and cutting-edge
+cognitive neuroscience datasets that capture brain activity during natural
+language processing, visual perception, and auditory experiences. We explore
+two fundamental approaches: encoding models, which attempt to generate brain
+activity patterns from sensory inputs; and decoding models, which aim to
+reconstruct our thoughts and perceptions from neural signals. These techniques
+not only promise breakthroughs in neurological diagnostics and brain-computer
+interfaces but also offer a window into the very nature of cognition. In this
+survey, we first discuss popular representations of language, vision, and
+speech stimuli, and present a summary of neuroscience datasets. We then review
+how the recent advances in deep learning transformed this field, by
+investigating the popular deep learning based encoding and decoding
+architectures, noting their benefits and limitations across different sensory
+modalities. From text to images, speech to videos, we investigate how these
+models capture the brain's response to our complex, multimodal world. While our
+primary focus is on human studies, we also highlight the crucial role of animal
+models in advancing our understanding of neural mechanisms. Throughout, we
+mention the ethical implications of these powerful technologies, addressing
+concerns about privacy and cognitive liberty. We conclude with a summary and
+discussion of future trends in this rapidly evolving field.
+
+
+ We show that existing evaluations for fake news detection based on
+conventional sources, such as claims on fact-checking websites, result in high
+accuracies over time for LLM-based detectors -- even after their knowledge
+cutoffs. This suggests that recent popular fake news from such sources can be
+easily detected due to pre-training and retrieval corpus contamination or
+increasingly salient shallow patterns. Instead, we argue that a proper fake
+news detection dataset should test a model's ability to reason factually about
+the current world by retrieving and reading related evidence. To this end, we
+develop a novel pipeline that leverages natural language feedback from a
+RAG-based detector to iteratively modify real-time news into deceptive fake
+news that challenges LLMs. Our iterative rewrite decreases the binary
+classification ROC-AUC by an absolute 17.5 percent for a strong RAG-based
+GPT-4o detector. Our experiments reveal the important role of RAG in both
+detecting and generating fake news, as retrieval-free LLM detectors are
+vulnerable to unseen events and adversarial attacks, while feedback from RAG
+detection helps discover more deceitful patterns in fake news.
+
+
+
+
+
+
+
+ ♻ ☆ GPT or BERT: why not both?
+
+
+
+
+
+
+
+
+ Lucas Georges Gabriel Charpentier, David Samuel
+
+
+ We present a simple way to merge masked language modeling with causal
+language modeling. This hybrid training objective results in a model that
+combines the strengths of both modeling paradigms within a single transformer
+stack: GPT-BERT can be transparently used like any standard causal or masked
+language model. We test the pretraining process that enables this flexible
+behavior on the BabyLM Challenge 2024. The results show that the hybrid
+pretraining outperforms masked-only or causal-only models. We openly release
+the models, training corpora and code.
+
+
+
+ comment: 22 pages; submission to the BabyLM Challenge 2024
+
+
+
+
+
+
+ ♻ ☆ Language Model Preference Evaluation with Multiple Weak Evaluators
+
+
+ Despite the remarkable success of Large Language Models (LLMs), evaluating
+their outputs' quality regarding *preference* remains a critical challenge.
+Existing works usually leverage a powerful LLM (e.g., GPT4) as the judge for
+comparing LLMs' output pairwisely, yet such model-based evaluator is vulnerable
+to *conflicting preference*, i.e., output A is better than B, B than C, but C
+than A, causing contradictory evaluation results. To improve model-based
+preference evaluation, we introduce GED (Preference Graph Ensemble and
+Denoise), a novel approach that leverages multiple model-based evaluators to
+construct preference graphs, and then ensemble and denoise these graphs for
+better, non-contradictory evaluation results. In particular, our method
+consists of two primary stages: aggregating evaluations into a unified graph
+and applying a denoising process to eliminate cyclic inconsistencies, ensuring
+a directed acyclic graph (DAG) structure. We provide theoretical guarantees for
+our framework, demonstrating its efficacy in recovering the ground truth
+preference structure. Extensive experiments across ten benchmark datasets show
+that GED outperforms baseline methods in model ranking, response selection, and
+model alignment tasks. Notably, GED combines weaker evaluators like Llama3-8B,
+Mistral-7B, and Qwen2-7B to surpass the performance of stronger evaluators like
+Qwen2-72B, highlighting its ability to enhance evaluation reliability and
+improve model performance.
+
+
+ The use of large language models (LLMs) as judges, particularly in preference
+comparisons, has become widespread, but this reveals a notable bias towards
+longer responses, undermining the reliability of such evaluations. To better
+understand such bias, we propose to decompose the preference evaluation metric,
+specifically the win rate, into two key components: desirability and
+information mass, where the former is length-independent and related to
+trustworthiness such as correctness, toxicity, and consistency, and the latter
+is length-dependent and represents the amount of information in the response.
+We empirically demonstrated the decomposition through controlled experiments
+and found that response length impacts evaluations by influencing information
+mass. To derive a reliable evaluation metric that assesses content quality
+without being confounded by response length, we propose AdapAlpaca, a simple
+yet effective adjustment to win rate measurement. Specifically, AdapAlpaca
+ensures a fair comparison of response quality by aligning the lengths of
+reference and test model responses under equivalent length intervals.
+
+
+
+
+
+
+
+ ♻ ☆ RLAIF-V: Open-Source AI Feedback Leads to Super GPT-4V Trustworthiness
+
+
+
+
+
+
+
+
+ Tianyu Yu, Haoye Zhang, Qiming Li, Qixin Xu, Yuan Yao, Da Chen, Xiaoman Lu, Ganqu Cui, Yunkai Dang, Taiwen He, Xiaocheng Feng, Jun Song, Bo Zheng, Zhiyuan Liu, Tat-Seng Chua, Maosong Sun
+
+
+ Traditional feedback learning for hallucination reduction relies on
+labor-intensive manual labeling or expensive proprietary models. This leaves
+the community without foundational knowledge about how to build high-quality
+feedback with open-source MLLMs. In this work, we introduce RLAIF-V, a novel
+framework that aligns MLLMs in a fully open-source paradigm. RLAIF-V maximally
+explores open-source MLLMs from two perspectives, including high-quality
+feedback data generation for preference learning and self-feedback guidance for
+inference-time scaling. Extensive experiments on six benchmarks in both
+automatic and human evaluation show that RLAIF-V substantially enhances the
+trustworthiness of models at both preference learning and inference time.
+RLAIF-V 7B reduces object hallucination by 80.7\% and overall hallucination by
+33.7\%. Remarkably, RLAIF-V 12B further reveals the self-alignment potential of
+open-source MLLMs, where the model can learn from feedback of itself to achieve
+super GPT-4V trustworthiness.
+
+
+ This paper introduces Chain of Translation Prompting (CoTR), a novel strategy
+designed to enhance the performance of language models in low-resource
+languages. CoTR restructures prompts to first translate the input context from
+a low-resource language into a higher-resource language, such as English. The
+specified task like generation, classification, or any other NLP function is
+then performed on the translated text, with the option to translate the output
+back to the original language if needed. All these steps are specified in a
+single prompt. We demonstrate the effectiveness of this method through a case
+study on the low-resource Indic language Marathi. The CoTR strategy is applied
+to various tasks, including sentiment analysis, hate speech classification,
+subject classification and text generation, and its efficacy is showcased by
+comparing it with regular prompting methods. Our results underscore the
+potential of translation-based prompting strategies to significantly improve
+multilingual LLM performance in low-resource languages, offering valuable
+insights for future research and applications. We specifically see the highest
+accuracy improvements with the hate speech detection task. The technique also
+has the potential to enhance the quality of synthetic data generation for
+underrepresented languages using LLMs.
+
+
+
+ comment: Accepted at PACLIC 38 (2024)
+
+
+
+
+
+
+ ♻ ☆ A Survey on Online User Aggression: Content Detection and Behavioral
+ Analysis on Social Media
+
+
+ The rise of social media platforms has led to an increase in cyber-aggressive
+behavior, encompassing a broad spectrum of hostile behavior, including
+cyberbullying, online harassment, and the dissemination of offensive and hate
+speech. These behaviors have been associated with significant societal
+consequences, ranging from online anonymity to real-world outcomes such as
+depression, suicidal tendencies, and, in some instances, offline violence.
+Recognizing the societal risks associated with unchecked aggressive content,
+this paper delves into the field of Aggression Content Detection and Behavioral
+Analysis of Aggressive Users, aiming to bridge the gap between disparate
+studies. In this paper, we analyzed the diversity of definitions and proposed a
+unified cyber-aggression definition. We examine the comprehensive process of
+Aggression Content Detection, spanning from dataset creation, feature selection
+and extraction, and detection algorithm development. Further, we review studies
+on Behavioral Analysis of Aggression that explore the influencing factors,
+consequences, and patterns associated with cyber-aggressive behavior. This
+systematic literature review is a cross-examination of content detection and
+behavioral analysis in the realm of cyber-aggression. The integrated
+investigation reveals the effectiveness of incorporating sociological insights
+into computational techniques for preventing cyber-aggressive behavior.
+Finally, the paper concludes by identifying research gaps and encouraging
+further progress in the unified domain of socio-computational aggressive
+behavior analysis.
+
+
+
+ comment: Accepted at ACM Computing Survey
+
+
+
+
+
+
+ ♻ ☆ Unleashing the Power of Data Tsunami: A Comprehensive Survey on Data
+ Assessment and Selection for Instruction Tuning of Language Models
+
+
+
+
+
+
+
+
+ Yulei Qin, Yuncheng Yang, Pengcheng Guo, Gang Li, Hang Shao, Yuchen Shi, Zihan Xu, Yun Gu, Ke Li, Xing Sun
+
+
+ Instruction tuning plays a critical role in aligning large language models
+(LLMs) with human preference. Despite the vast amount of open instruction
+datasets, naively training a LLM on all existing instructions may not be
+optimal and practical. To pinpoint the most beneficial datapoints, data
+assessment and selection methods have been proposed in the fields of natural
+language processing (NLP) and deep learning. However, under the context of
+instruction tuning, there still exists a gap in knowledge on what kind of data
+evaluation metrics can be employed and how they can be integrated into the
+selection mechanism. To bridge this gap, we present a comprehensive review on
+existing literature of data assessment and selection especially for instruction
+tuning of LLMs. We systematically categorize all applicable methods into
+quality-based, diversity-based, and importance-based ones where a unified,
+fine-grained taxonomy is structured. For each category, representative methods
+are elaborated to describe the landscape of relevant research. In addition,
+comparison between the latest methods is conducted on their officially reported
+results to provide in-depth discussions on their limitations. Finally, we
+summarize the open challenges and propose the promosing avenues for future
+studies. All related contents are available at
+https://github.com/yuleiqin/fantastic-data-engineering.
+
+
+
+ comment: Accepted to TMLR with Survey Certificate, review, survey, 37 pages, 5
+ figures, 4 tables
+
+ This paper presents two novel theorems that address two open problems in
+stochastic Lindenmayer-system (L-system) inference, specifically focusing on
+the construction of an optimal stochastic L-system capable of generating a
+given sequence of strings. The first theorem delineates a method for crafting a
+stochastic L-system that has the maximum probability of a derivation producing
+a given sequence of words through a single derivation (noting that multiple
+derivations may generate the same sequence). Furthermore, the second theorem
+determines the stochastic L-systems with the highest probability of producing a
+given sequence of words with multiple possible derivations. From these, we
+introduce an algorithm to infer an optimal stochastic L-system from a given
+sequence. This algorithm incorporates advanced optimization techniques, such as
+interior point methods, to ensure the creation of a stochastic L-system that
+maximizes the probability of generating the given sequence (allowing for
+multiple derivations). This allows for the use of stochastic L-systems as a
+model for machine learning using only positive data for training.
+
+
+
+ comment: 15 pages
+
+
+
+
+
+
+ ♻ ☆ Evolver: Chain-of-Evolution Prompting to Boost Large Multimodal Models
+ for Hateful Meme Detection
+
+
+ Recent advances show that two-stream approaches have achieved outstanding
+performance in hateful meme detection. However, hateful memes constantly evolve
+as new memes emerge by fusing progressive cultural ideas, making existing
+methods obsolete or ineffective. In this work, we explore the potential of
+Large Multimodal Models (LMMs) for hateful meme detection. To this end, we
+propose Evolver, which incorporates LMMs via Chain-of-Evolution (CoE)
+Prompting, by integrating the evolution attribute and in-context information of
+memes. Specifically, Evolver simulates the evolving and expressing process of
+memes and reasons through LMMs in a step-by-step manner. First, an evolutionary
+pair mining module retrieves the top-k most similar memes in the external
+curated meme set with the input meme. Second, an evolutionary information
+extractor is designed to summarize the semantic regularities between the paired
+memes for prompting. Finally, a contextual relevance amplifier enhances the
+in-context hatefulness information to boost the search for evolutionary
+processes. Extensive experiments on public FHM, MAMI, and HarM datasets show
+that CoE prompting can be incorporated into existing LMMs to improve their
+performance. More encouragingly, it can serve as an interpretive tool to
+promote the understanding of the evolution of social memes.
+
+
+
+
+
+
+
+
+
+
+ Information Retrieval 5
+
+
+
+
+
+ ☆ AmalREC: A Dataset for Relation Extraction and Classification Leveraging
+ Amalgamation of Large Language Models
+
+
+ Existing datasets for relation classification and extraction often exhibit
+limitations such as restricted relation types and domain-specific biases. This
+work presents a generic framework to generate well-structured sentences from
+given tuples with the help of Large Language Models (LLMs). This study has
+focused on the following major questions: (i) how to generate sentences from
+relation tuples, (ii) how to compare and rank them, (iii) can we combine
+strengths of individual methods and amalgamate them to generate an even bette
+quality of sentences, and (iv) how to evaluate the final dataset? For the first
+question, we employ a multifaceted 5-stage pipeline approach, leveraging LLMs
+in conjunction with template-guided generation. We introduce Sentence
+Evaluation Index(SEI) that prioritizes factors like grammatical correctness,
+fluency, human-aligned sentiment, accuracy, and complexity to answer the first
+part of the second question. To answer the second part of the second question,
+this work introduces a SEI-Ranker module that leverages SEI to select top
+candidate generations. The top sentences are then strategically amalgamated to
+produce the final, high-quality sentence. Finally, we evaluate our dataset on
+LLM-based and SOTA baselines for relation classification. The proposed dataset
+features 255 relation types, with 15K sentences in the test set and around 150k
+in the train set organized in, significantly enhancing relational diversity and
+complexity. This work not only presents a new comprehensive benchmark dataset
+for RE/RC task, but also compare different LLMs for generation of quality
+sentences from relational tuples.
+
+
+
+ comment: 18 pages, 5 Figures
+
+
+
+
+
+
+ ☆ Comparative Performance of Advanced NLP Models and LLMs in Multilingual
+ Geo-Entity Detection
+
+
+ The integration of advanced Natural Language Processing (NLP) methodologies
+and Large Language Models (LLMs) has significantly enhanced the extraction and
+analysis of geospatial data from multilingual texts, impacting sectors such as
+national and international security. This paper presents a comprehensive
+evaluation of leading NLP models -- SpaCy, XLM-RoBERTa, mLUKE, GeoLM -- and
+LLMs, specifically OpenAI's GPT 3.5 and GPT 4, within the context of
+multilingual geo-entity detection. Utilizing datasets from Telegram channels in
+English, Russian, and Arabic, we examine the performance of these models
+through metrics such as accuracy, precision, recall, and F1 scores, to assess
+their effectiveness in accurately identifying geospatial references. The
+analysis exposes each model's distinct advantages and challenges, underscoring
+the complexities involved in achieving precise geo-entity identification across
+varied linguistic landscapes. The conclusions drawn from this experiment aim to
+direct the enhancement and creation of more advanced and inclusive NLP tools,
+thus advancing the field of geospatial analysis and its application to global
+security.
+
+
+ In the past, most search queries issued to a search engine were short and
+simple. A keyword based search engine was able to answer such queries quite
+well. However, members are now developing the habit of issuing long and complex
+natural language queries. Answering such queries requires evolution of a search
+engine to have semantic capability. In this paper we present the design of
+LinkedIn's new content search engine with semantic capability, and its impact
+on metrics.
+
+
+
+
+
+
+
+ ☆ Left-handed representation in top 100 male professional tennis players:
+ Multi-disciplinary perspectives ACML 2016
+
+
+ A commonly held opinion is that left-handed tennis players are
+overrepresented compared to the percentage of left-handers within the general
+population. This study provides the domain insights supported by data analysis
+that could help inform the decision of parents and coaches considering whether
+a child should start playing tennis as left- or right-handed when there is no
+strong arm-handed dominance. Compared to the commonly cited figure of about 10%
+of left-handed male population, data analysis from the official ATP web site
+for the top 100 ranked tennis players over the past decades (1985-2016) shows
+evidence of overrepresentation of left-handed elite tennis players (about 15%).
+The insights and data analysis can inform the handedness decision, advance
+coaching and strategic game concepts, enhance media coverage/analytics,
+left-handed facts and statistics, and inform tennis equipment manufacturing.
+
+
+
+ comment: The original work citation (in APA): Ba\v{c}i\'c, B., & Ghazala, A.
+ (2016). Left-handed representation in top 100 male professional tennis
+ players: Multi-disciplinary perspectives. Symposium conducted at the meeting
+ of the First New Zealand Text Mining Workshop (TMNZ 2016) in conjunction with
+ the 8th Asian Conference on Machine Learning (ACML 2016), Hamilton, New
+ Zealand
+
+ Modeling feature interactions is crucial for click-through rate (CTR)
+prediction, particularly when it comes to high-order explicit interactions.
+Traditional methods struggle with this task because they often predefine a
+maximum interaction order, which relies heavily on prior knowledge and can
+limit the model's effectiveness. Additionally, modeling high-order interactions
+typically leads to increased computational costs. Therefore, the challenge lies
+in adaptively modeling high-order feature interactions while maintaining
+efficiency. To address this issue, we introduce Kolmogorov-Arnold Represented
+Sparse Efficient Interaction Network (KarSein), designed to optimize both
+predictive accuracy and computational efficiency. We firstly identify
+limitations of directly applying Kolmogorov-Arnold Networks (KAN) to CTR and
+then introduce KarSein to overcome these issues. It features a novel
+architecture that reduces the computational costs of KAN and supports embedding
+vectors as feature inputs. Additionally, KarSein employs guided symbolic
+regression to address the challenge of KAN in spontaneously learning
+multiplicative relationships. Extensive experiments demonstrate KarSein's
+superior performance, achieving significant predictive accuracy with minimal
+computational overhead. Furthermore, KarSein maintains strong global
+explainability while enabling the removal of redundant features, resulting in a
+sparse network structure. These advantages also position KarSein as a promising
+method for efficient inference.
+
+
+
+ comment: KarSein for CTR
+
+
+
+
+
+
+
+
+
+ Machine Learning 23
+
+
+
+
+
+ ☆ Matrix Concentration for Random Signed Graphs and Community Recovery in
+ the Signed Stochastic Block Model
+
+
+ We consider graphs where edges and their signs are added independently at
+random from among all pairs of nodes. We establish strong concentration
+inequalities for adjacency and Laplacian matrices obtained from this family of
+random graph models. Then, we apply our results to study graphs sampled from
+the signed stochastic block model. Namely, we take a two-community setting
+where edges within the communities have positive signs and edges between the
+communities have negative signs and apply a random sign perturbation with
+probability $0< s <1/2$. In this setting, our findings include: first, the
+spectral gap of the corresponding signed Laplacian matrix concentrates near
+$2s$ with high probability; and second, the sign of the first eigenvector of
+the Laplacian matrix defines a weakly consistent estimator for the balanced
+community detection problem, or equivalently, the $\pm 1$ synchronization
+problem. We supplement our theoretical contributions with experimental data
+obtained from the models under consideration.
+
+
+ In this paper, we introduce Audiopedia, a novel task called Audio Question
+Answering with Knowledge, which requires both audio comprehension and external
+knowledge reasoning. Unlike traditional Audio Question Answering (AQA)
+benchmarks that focus on simple queries answerable from audio alone, Audiopedia
+targets knowledge-intensive questions. We define three sub-tasks: (i) Single
+Audio Question Answering (s-AQA), where questions are answered based on a
+single audio sample, (ii) Multi-Audio Question Answering (m-AQA), which
+requires reasoning over multiple audio samples, and (iii) Retrieval-Augmented
+Audio Question Answering (r-AQA), which involves retrieving relevant audio to
+answer the question. We benchmark large audio language models (LALMs) on these
+sub-tasks and observe suboptimal performance. To address this, we propose a
+generic framework that can be adapted to any LALM, equipping them with
+knowledge reasoning capabilities. Our framework has two components: (i) Audio
+Entity Linking (AEL) and (ii) Knowledge-Augmented Audio Large Multimodal Model
+(KA2LM), which together improve performance on knowledge-intensive AQA tasks.
+To our knowledge, this is the first work to address advanced audio
+understanding via knowledge-intensive tasks like Audiopedia.
+
+
+
+ comment: Accepted to ICASSP 2025
+
+
+
+
+
+
+ ☆ Converting Time Series Data to Numeric Representations Using Alphabetic
+ Mapping and k-mer strategy
+
+
+
+
+
+
+
+
+ Sarwan Ali, Tamkanat E Ali, Imdad Ullah Khan, Murray Patterson
+
+
+ In the realm of data analysis and bioinformatics, representing time series
+data in a manner akin to biological sequences offers a novel approach to
+leverage sequence analysis techniques. Transforming time series signals into
+molecular sequence-type representations allows us to enhance pattern
+recognition by applying sophisticated sequence analysis techniques (e.g.
+$k$-mers based representation) developed in bioinformatics, uncovering hidden
+patterns and relationships in complex, non-linear time series data. This paper
+proposes a method to transform time series signals into biological/molecular
+sequence-type representations using a unique alphabetic mapping technique. By
+generating 26 ranges corresponding to the 26 letters of the English alphabet,
+each value within the time series is mapped to a specific character based on
+its range. This conversion facilitates the application of sequence analysis
+algorithms, typically used in bioinformatics, to analyze time series data. We
+demonstrate the effectiveness of this approach by converting real-world time
+series signals into character sequences and performing sequence classification.
+The resulting sequences can be utilized for various sequence-based analysis
+techniques, offering a new perspective on time series data representation and
+analysis.
+
+
+
+
+
+
+
+
+ Sarwan Ali, Tamkanat E Ali, Imdad Ullah Khan, Murray Patterson
+
+
+ Accurate molecular sequence analysis is a key task in the field of
+bioinformatics. To apply molecular sequence classification algorithms, we first
+need to generate the appropriate representations of the sequences. Traditional
+numeric sequence representation techniques are mostly based on sequence
+alignment that faces limitations in the form of lack of accuracy. Although
+several alignment-free techniques have also been introduced, their tabular data
+form results in low performance when used with Deep Learning (DL) models
+compared to the competitive performance observed in the case of image-based
+data. To find a solution to this problem and to make Deep Learning (DL) models
+function to their maximum potential while capturing the important spatial
+information in the sequence data, we propose a universal Hibert curve-based
+Chaos Game Representation (CGR) method. This method is a transformative
+function that involves a novel Alphabetic index mapping technique used in
+constructing Hilbert curve-based image representation from molecular sequences.
+Our method can be globally applied to any type of molecular sequence data. The
+Hilbert curve-based image representations can be used as input to sophisticated
+vision DL models for sequence classification. The proposed method shows
+promising results as it outperforms current state-of-the-art methods by
+achieving a high accuracy of $94.5$\% and an F1 score of $93.9\%$ when tested
+with the CNN model on the lung cancer dataset. This approach opens up a new
+horizon for exploring molecular sequence analysis using image classification
+methods.
+
+
+
+
+
+
+
+ ☆ MATEY: multiscale adaptive foundation models for spatiotemporal physical
+ systems
+
+
+
+
+
+
+
+
+ Pei Zhang, M. Paul Laiu, Matthew Norman, Doug Stefanski, John Gounley
+
+
+ Accurate representation of the multiscale features in spatiotemporal physical
+systems using vision transformer (ViT) architectures requires extremely long,
+computationally prohibitive token sequences. To address this issue, we propose
+two adaptive tokenization schemes that dynamically adjust patch sizes based on
+local features: one ensures convergent behavior to uniform patch refinement,
+while the other offers better computational efficiency. Moreover, we present a
+set of spatiotemporal attention schemes, where the temporal or axial spatial
+dimensions are decoupled, and evaluate their computational and data
+efficiencies. We assess the performance of the proposed multiscale adaptive
+model, MATEY, in a sequence of experiments. The results show that adaptive
+tokenization schemes achieve improved accuracy without significantly increasing
+the length of the token sequence. Compared to a full spatiotemporal attention
+scheme or a scheme that decouples only the temporal dimension, we find that
+fully decoupled axial attention is less efficient and expressive, requiring
+more training time and model weights to achieve the same accuracy. Finally, we
+demonstrate in two fine-tuning tasks featuring different physics that models
+pretrained on PDEBench data outperform the ones trained from scratch,
+especially in the low data regime with frozen attention.
+
+
+
+
+
+
+
+
+ Albus Li, Nathan Bailey, Will Sumerfield, Kira Kim
+
+
+ Quinn et al propose challenge datasets in their work called ``Kryptonite-N".
+These datasets aim to counter the universal function approximation argument of
+machine learning, breaking the notation that machine learning can ``approximate
+any continuous function" \cite{original_paper}. Our work refutes this claim and
+shows that universal function approximations can be applied successfully; the
+Kryptonite datasets are constructed predictably, allowing logistic regression
+with sufficient polynomial expansion and L1 regularization to solve for any
+dimension N.
+
+
+
+
+
+
+
+ ☆ Testing and Improving the Robustness of Amortized Bayesian Inference for
+ Cognitive Models
+
+
+
+
+
+
+
+
+ Yufei Wu, Stefan Radev, Francis Tuerlinckx
+
+
+ Contaminant observations and outliers often cause problems when estimating
+the parameters of cognitive models, which are statistical models representing
+cognitive processes. In this study, we test and improve the robustness of
+parameter estimation using amortized Bayesian inference (ABI) with neural
+networks. To this end, we conduct systematic analyses on a toy example and
+analyze both synthetic and real data using a popular cognitive model, the Drift
+Diffusion Models (DDM). First, we study the sensitivity of ABI to contaminants
+with tools from robust statistics: the empirical influence function and the
+breakdown point. Next, we propose a data augmentation or noise injection
+approach that incorporates a contamination distribution into the
+data-generating process during training. We examine several candidate
+distributions and evaluate their performance and cost in terms of accuracy and
+efficiency loss relative to a standard estimator. Introducing contaminants from
+a Cauchy distribution during training considerably increases the robustness of
+the neural density estimator as measured by bounded influence functions and a
+much higher breakdown point. Overall, the proposed method is straightforward
+and practical to implement and has a broad applicability in fields where
+outlier detection or removal is challenging.
+
+
+
+
+
+
+
+ ☆ Bridging the Gap: A Decade Review of Time-Series Clustering Methods
+
+
+
+
+
+
+
+
+ John Paparrizos, Fan Yang, Haojun Li
+
+
+ Time series, as one of the most fundamental representations of sequential
+data, has been extensively studied across diverse disciplines, including
+computer science, biology, geology, astronomy, and environmental sciences. The
+advent of advanced sensing, storage, and networking technologies has resulted
+in high-dimensional time-series data, however, posing significant challenges
+for analyzing latent structures over extended temporal scales. Time-series
+clustering, an established unsupervised learning strategy that groups similar
+time series together, helps unveil hidden patterns in these complex datasets.
+In this survey, we trace the evolution of time-series clustering methods from
+classical approaches to recent advances in neural networks. While previous
+surveys have focused on specific methodological categories, we bridge the gap
+between traditional clustering methods and emerging deep learning-based
+algorithms, presenting a comprehensive, unified taxonomy for this research
+area. This survey highlights key developments and provides insights to guide
+future research in time-series clustering.
+
+
+
+
+
+
+
+ ☆ A Survey on Time-Series Distance Measures
+
+
+
+
+
+
+
+
+ John Paparrizos, Haojun Li, Fan Yang, Kaize Wu, Jens E. d'Hondt, Odysseas Papapetrou
+
+
+ Distance measures have been recognized as one of the fundamental building
+blocks in time-series analysis tasks, e.g., querying, indexing, classification,
+clustering, anomaly detection, and similarity search. The vast proliferation of
+time-series data across a wide range of fields has increased the relevance of
+evaluating the effectiveness and efficiency of these distance measures. To
+provide a comprehensive view of this field, this work considers over 100
+state-of-the-art distance measures, classified into 7 categories: lock-step
+measures, sliding measures, elastic measures, kernel measures, feature-based
+measures, model-based measures, and embedding measures. Beyond providing
+comprehensive mathematical frameworks, this work also delves into the
+distinctions and applications across these categories for both univariate and
+multivariate cases. By providing comprehensive collections and insights, this
+study paves the way for the future development of innovative time-series
+distance measures.
+
+
+
+
+
+
+
+ ☆ The intrinsic motivation of reinforcement and imitation learning for
+ sequential tasks
+
+
+ This work in the field of developmental cognitive robotics aims to devise a
+new domain bridging between reinforcement learning and imitation learning, with
+a model of the intrinsic motivation for learning agents to learn with guidance
+from tutors multiple tasks, including sequential tasks. The main contribution
+has been to propose a common formulation of intrinsic motivation based on
+empirical progress for a learning agent to choose automatically its learning
+curriculum by actively choosing its learning strategy for simple or sequential
+tasks: which task to learn, between autonomous exploration or imitation
+learning, between low-level actions or task decomposition, between several
+tutors. The originality is to design a learner that benefits not only passively
+from data provided by tutors, but to actively choose when to request tutoring
+and what and whom to ask. The learner is thus more robust to the quality of the
+tutoring and learns faster with fewer demonstrations. We developed the
+framework of socially guided intrinsic motivation with machine learning
+algorithms to learn multiple tasks by taking advantage of the generalisability
+properties of human demonstrations in a passive manner or in an active manner
+through requests of demonstrations from the best tutor for simple and composing
+subtasks. The latter relies on a representation of subtask composition proposed
+for a construction process, which should be refined by representations used for
+observational processes of analysing human movements and activities of daily
+living. With the outlook of a language-like communication with the tutor, we
+investigated the emergence of a symbolic representation of the continuous
+sensorimotor space and of tasks using intrinsic motivation. We proposed within
+the reinforcement learning framework, a reward function for interacting with
+tutors for automatic curriculum learning in multi-task learning.
+
+
+
+ comment: Habilitation thesis
+
+
+
+
+
+
+ ☆ Distributionally Robust Optimization via Iterative Algorithms in
+ Continuous Probability Spaces
+
+
+ We consider a minimax problem motivated by distributionally robust
+optimization (DRO) when the worst-case distribution is continuous, leading to
+significant computational challenges due to the infinite-dimensional nature of
+the optimization problem. Recent research has explored learning the worst-case
+distribution using neural network-based generative models to address these
+computational challenges but lacks algorithmic convergence guarantees. This
+paper bridges this theoretical gap by presenting an iterative algorithm to
+solve such a minimax problem, achieving global convergence under mild
+assumptions and leveraging technical tools from vector space minimax
+optimization and convex analysis in the space of continuous probability
+densities. In particular, leveraging Brenier's theorem, we represent the
+worst-case distribution as a transport map applied to a continuous reference
+measure and reformulate the regularized discrepancy-based DRO as a minimax
+problem in the Wasserstein space. Furthermore, we demonstrate that the
+worst-case distribution can be efficiently computed using a modified
+Jordan-Kinderlehrer-Otto (JKO) scheme with sufficiently large regularization
+parameters for commonly used discrepancy functions, linked to the radius of the
+ambiguity set. Additionally, we derive the global convergence rate and quantify
+the total number of subgradient and inexact modified JKO iterations required to
+obtain approximate stationary points. These results are potentially applicable
+to nonconvex and nonsmooth scenarios, with broad relevance to modern machine
+learning applications.
+
+
+
+
+
+
+
+ ☆ Edge of Stochastic Stability: Revisiting the Edge of Stability for SGD
+
+
+ Recent findings by Cohen et al., 2021, demonstrate that when training neural
+networks with full-batch gradient descent at a step size of $\eta$, the
+sharpness--defined as the largest eigenvalue of the full batch
+Hessian--consistently stabilizes at $2/\eta$. These results have significant
+implications for convergence and generalization. Unfortunately, this was
+observed not to be the case for mini-batch stochastic gradient descent (SGD),
+thus limiting the broader applicability of these findings. We show that SGD
+trains in a different regime we call Edge of Stochastic Stability. In this
+regime, what hovers at $2/\eta$ is, instead, the average over the batches of
+the largest eigenvalue of the Hessian of the mini batch (MiniBS) loss--which is
+always bigger than the sharpness. This implies that the sharpness is generally
+lower when training with smaller batches or bigger learning rate, providing a
+basis for the observed implicit regularization effect of SGD towards flatter
+minima and a number of well established empirical phenomena. Additionally, we
+quantify the gap between the MiniBS and the sharpness, further characterizing
+this distinct training regime.
+
+
+
+ comment: 28 pages, 24 figures
+
+
+
+
+
+
+ ☆ The Impact of Prompt Programming on Function-Level Code Generation
+
+
+
+
+
+
+
+
+ Ranim Khojah, Francisco Gomes de Oliveira Neto, Mazen Mohamad, Philipp Leitner
+
+
+ Large Language Models (LLMs) are increasingly used by software engineers for
+code generation. However, limitations of LLMs such as irrelevant or incorrect
+code have highlighted the need for prompt programming (or prompt engineering)
+where engineers apply specific prompt techniques (e.g., chain-of-thought or
+input-output examples) to improve the generated code. Despite this, the impact
+of different prompt techniques -- and their combinations -- on code generation
+remains underexplored. In this study, we introduce CodePromptEval, a dataset of
+7072 prompts designed to evaluate five prompt techniques (few-shot, persona,
+chain-of-thought, function signature, list of packages) and their effect on the
+correctness, similarity, and quality of complete functions generated by three
+LLMs (GPT-4o, Llama3, and Mistral). Our findings show that while certain prompt
+techniques significantly influence the generated code, combining multiple
+techniques does not necessarily improve the outcome. Additionally, we observed
+a trade-off between correctness and quality when using prompt techniques. Our
+dataset and replication package enable future research on improving
+LLM-generated code and evaluating new prompt techniques.
+
+
+
+ comment: CodePromptEval dataset and replication package on GitHub:
+ https://github.com/icetlab/CodePromptEval
+
+
+
+
+
+
+ ♻ ☆ E-Sort: Empowering End-to-end Neural Network for Multi-channel Spike
+ Sorting with Transfer Learning and Fast Post-processing
+
+
+ Decoding extracellular recordings is a crucial task in electrophysiology and
+brain-computer interfaces. Spike sorting, which distinguishes spikes and their
+putative neurons from extracellular recordings, becomes computationally
+demanding with the increasing number of channels in modern neural probes. To
+address the intensive workload and complex neuron interactions, we propose
+E-Sort, an end-to-end neural network-based spike sorter with transfer learning
+and parallelizable post-processing. Our framework reduces the required number
+of annotated spikes for training by 44% compared to training from scratch,
+achieving up to 25.68% higher accuracy. Additionally, our novel post-processing
+algorithm is compatible with deep learning frameworks, making E-Sort
+significantly faster than state-of-the-art spike sorters. On synthesized
+Neuropixels recordings, E-Sort achieves comparable accuracy with Kilosort4
+while sorting 50 seconds of data in only 1.32 seconds. Our method demonstrates
+robustness across various probe geometries, noise levels, and drift conditions,
+offering a substantial improvement in both accuracy and runtime efficiency
+compared to existing spike sorters.
+
+
+
+
+
+
+
+ ♻ ☆ Learning Optimal Control and Dynamical Structure of Global Trajectory
+ Search Problems with Diffusion Models
+
+
+ Spacecraft trajectory design is a global search problem, where previous work
+has revealed specific solution structures that can be captured with data-driven
+methods. This paper explores two global search problems in the circular
+restricted three-body problem: hybrid cost function of minimum
+fuel/time-of-flight and transfers to energy-dependent invariant manifolds.
+These problems display a fundamental structure either in the optimal control
+profile or the use of dynamical structures. We build on our prior generative
+machine learning framework to apply diffusion models to learn the conditional
+probability distribution of the search problem and analyze the model's
+capability to capture these structures.
+
+
+
+ comment: This paper was presented at the AAS/AIAA Astrodynamics Specialist
+ Conference
+
+
+
+
+
+
+ ♻ ☆ Real-time Speech Enhancement on Raw Signals with Deep State-space
+ Modeling
+
+
+ We present aTENNuate, a simple deep state-space autoencoder configured for
+efficient online raw speech enhancement in an end-to-end fashion. The network's
+performance is primarily evaluated on raw speech denoising, with additional
+assessments on tasks such as super-resolution and de-quantization. We benchmark
+aTENNuate on the VoiceBank + DEMAND and the Microsoft DNS1 synthetic test sets.
+The network outperforms previous real-time denoising models in terms of PESQ
+score, parameter count, MACs, and latency. Even as a raw waveform processing
+model, the model maintains high fidelity to the clean signal with minimal
+audible artifacts. In addition, the model remains performant even when the
+noisy input is compressed down to 4000Hz and 4 bits, suggesting general speech
+enhancement capabilities in low-resource environments. Code is available at
+github.com/Brainchip-Inc/aTENNuate
+
+
+
+
+
+
+
+ ♻ ☆ CASUAL: Conditional Support Alignment for Domain Adaptation with Label
+ Shift AAAI 2025
+
+
+
+
+
+
+
+
+ Anh T Nguyen, Lam Tran, Anh Tong, Tuan-Duy H. Nguyen, Toan Tran
+
+
+ Unsupervised domain adaptation (UDA) refers to a domain adaptation framework
+in which a learning model is trained based on the labeled samples on the source
+domain and unlabeled ones in the target domain. The dominant existing methods
+in the field that rely on the classical covariate shift assumption to learn
+domain-invariant feature representation have yielded suboptimal performance
+under label distribution shift. In this paper, we propose a novel Conditional
+Adversarial SUpport ALignment (CASUAL) whose aim is to minimize the conditional
+symmetric support divergence between the source's and target domain's feature
+representation distributions, aiming at a more discriminative representation
+for the classification task. We also introduce a novel theoretical target risk
+bound, which justifies the merits of aligning the supports of conditional
+feature distributions compared to the existing marginal support alignment
+approach in the UDA settings. We then provide a complete training process for
+learning in which the objective optimization functions are precisely based on
+the proposed target risk bound. Our empirical results demonstrate that CASUAL
+outperforms other state-of-the-art methods on different UDA benchmark tasks
+under different label shift conditions.
+
+
+
+
+
+
+
+
+ Zeno Kujawa, John Poole, Dobrik Georgiev, Danilo Numeroso, Pietro Liò
+
+
+ Neural Algorithmic Reasoning (NAR) aims to optimize classical algorithms.
+However, canonical implementations of NAR train neural networks to return only
+a single solution, even when there are multiple correct solutions to a problem,
+such as single-source shortest paths. For some applications, it is desirable to
+recover more than one correct solution. To that end, we give the first method
+for NAR with multiple solutions. We demonstrate our method on two classical
+algorithms: Bellman-Ford (BF) and Depth-First Search (DFS), favouring deeper
+insight into two algorithms over a broader survey of algorithms. This method
+involves generating appropriate training data as well as sampling and
+validating solutions from model output. Each step of our method, which can
+serve as a framework for neural algorithmic reasoning beyond the tasks
+presented in this paper, might be of independent interest to the field and our
+results represent the first attempt at this task in the NAR literature.
+
+
+
+
+
+
+
+ ♻ ☆ A Self-Supervised Robotic System for Autonomous Contact-Based Spatial
+ Mapping of Semiconductor Properties
+
+
+
+
+
+
+
+
+ Alexander E. Siemenn, Basita Das, Kangyu Ji, Fang Sheng, Tonio Buonassisi
+
+
+ Integrating robotically driven contact-based material characterization
+techniques into self-driving laboratories can enhance measurement quality,
+reliability, and throughput. While deep learning models support robust
+autonomy, current methods lack reliable pixel-precision positioning and require
+extensive labeled data. To overcome these challenges, we propose an approach
+for building self-supervised autonomy into contact-based robotic systems that
+teach the robot to follow domain expert measurement principles at
+high-throughputs. Firstly, we design a vision-based, self-supervised
+convolutional neural network (CNN) architecture that uses differentiable image
+priors to optimize domain-specific objectives, refining the pixel precision of
+predicted robot contact poses by 20.0% relative to existing approaches.
+Secondly, we design a reliable graph-based planner for generating
+distance-minimizing paths to accelerate the robot measurement throughput and
+decrease planning variance by 6x. We demonstrate the performance of this
+approach by autonomously driving a 4-degree-of-freedom robotic probe for 24
+hours to characterize semiconductor photoconductivity at 3,025 uniquely
+predicted poses across a gradient of drop-casted perovskite film compositions,
+achieving throughputs over 125 measurements per hour. Spatially mapping
+photoconductivity onto each drop-casted film reveals compositional trends and
+regions of inhomogeneity, valuable for identifying manufacturing process
+defects. With this self-supervised CNN-driven robotic system, we enable
+high-precision and reliable automation of contact-based characterization
+techniques at high throughputs, thereby allowing the measurement of previously
+inaccessible yet important semiconductor properties for self-driving
+laboratories.
+
+
+ Recently, deep learning has made remarkable strides, especially with
+generative modeling, such as large language models and probabilistic diffusion
+models. However, training these models often involves significant computational
+resources, requiring billions of petaFLOPs. This high resource consumption
+results in substantial energy usage and a large carbon footprint, raising
+critical environmental concerns. Back-propagation (BP) is a major source of
+computational expense during training deep learning models. To advance research
+on energy-efficient training and allow for sparse learning on any machine and
+device, we propose a general, energy-efficient convolution module that can be
+seamlessly integrated into any deep learning architecture. Specifically, we
+introduce channel-wise sparsity with additional gradient selection schedulers
+during backward based on the assumption that BP is often dense and inefficient,
+which can lead to over-fitting and high computational consumption. Our
+experiments demonstrate that our approach reduces 40\% computations while
+potentially improving model performance, validated on image classification and
+generation tasks. This reduction can lead to significant energy savings and a
+lower carbon footprint during the research and development phases of
+large-scale AI systems. Additionally, our method mitigates over-fitting in a
+manner distinct from Dropout, allowing it to be combined with Dropout to
+further enhance model performance and reduce computational resource usage.
+Extensive experiments validate that our method generalizes to a variety of
+datasets and tasks and is compatible with a wide range of deep learning
+architectures and modules. Code is publicly available at
+https://github.com/lujiazho/ssProp.
+
+
+
+ comment: Accepted by AAAI24 Workshop: Scalable and Efficient Artificial
+ Intelligence Systems
+
+
+
+
+
+
+
+ Jordan Slessor, Dezheng Kong, Xiaofen Tang, Zheng En Than, Linglong Kong
+
+
+ Federated learning (FL) is a machine learning methodology that involves the
+collaborative training of a global model across multiple decentralized clients
+in a privacy-preserving way. Several FL methods are introduced to tackle
+communication inefficiencies but do not address how to sample participating
+clients in each round effectively and in a privacy-preserving manner. In this
+paper, we propose \textit{FedSTaS}, a client and data-level sampling method
+inspired by \textit{FedSTS} and \textit{FedSampling}. In each federated
+learning round, \textit{FedSTaS} stratifies clients based on their compressed
+gradients, re-allocate the number of clients to sample using an optimal Neyman
+allocation, and sample local data from each participating clients using a data
+uniform sampling strategy. Experiments on three datasets show that
+\textit{FedSTaS} can achieve higher accuracy scores than those of
+\textit{FedSTS} within a fixed number of training rounds.
+
+
+
+ comment: 6 pages, 3 figures
+
+
+
+
+
+
+ ♻ ☆ Deep Neural Networks and Brain Alignment: Brain Encoding and Decoding
+ (Survey)
+
+
+ Can artificial intelligence unlock the secrets of the human brain? How do the
+inner mechanisms of deep learning models relate to our neural circuits? Is it
+possible to enhance AI by tapping into the power of brain recordings? These
+captivating questions lie at the heart of an emerging field at the intersection
+of neuroscience and artificial intelligence. Our survey dives into this
+exciting domain, focusing on human brain recording studies and cutting-edge
+cognitive neuroscience datasets that capture brain activity during natural
+language processing, visual perception, and auditory experiences. We explore
+two fundamental approaches: encoding models, which attempt to generate brain
+activity patterns from sensory inputs; and decoding models, which aim to
+reconstruct our thoughts and perceptions from neural signals. These techniques
+not only promise breakthroughs in neurological diagnostics and brain-computer
+interfaces but also offer a window into the very nature of cognition. In this
+survey, we first discuss popular representations of language, vision, and
+speech stimuli, and present a summary of neuroscience datasets. We then review
+how the recent advances in deep learning transformed this field, by
+investigating the popular deep learning based encoding and decoding
+architectures, noting their benefits and limitations across different sensory
+modalities. From text to images, speech to videos, we investigate how these
+models capture the brain's response to our complex, multimodal world. While our
+primary focus is on human studies, we also highlight the crucial role of animal
+models in advancing our understanding of neural mechanisms. Throughout, we
+mention the ethical implications of these powerful technologies, addressing
+concerns about privacy and cognitive liberty. We conclude with a summary and
+discussion of future trends in this rapidly evolving field.
+
+
+
+ comment: 61 pages, 22 figures
+
+
+
+
+
+
+ ♻ ☆ An Efficient Matrix Multiplication Algorithm for Accelerating Inference
+ in Binary and Ternary Neural Networks
+
+
+ Despite their tremendous success and versatility, Large Language Models
+(LLMs) suffer from inference inefficiency while relying on advanced
+computational infrastructure. To address these challenges and make LLMs more
+accessible and cost-effective, in this paper, we propose algorithms to improve
+the inference time and memory efficiency of 1.58-bit LLMs with ternary weight
+matrices. Particularly focusing on matrix multiplication as the bottle-neck
+operation of inference, we observe that, once trained, the weight matrices of a
+model no longer change. This allows us to preprocess these matrices and create
+indices that help reduce the storage requirements by a logarithmic factor while
+enabling our efficient inference algorithms. Specifically, for a $n$ by $n$
+weight matrix, our efficient algorithm guarantees a time complexity of
+$O(\frac{n^2}{\log n})$, a logarithmic factor improvement over the standard
+$O(n^2)$ vector-matrix multiplication. Besides theoretical analysis, we conduct
+extensive experiments to evaluate the practical efficiency of our algorithms.
+Our results confirm the superiority of the approach both with respect to time
+and memory, as we observed a reduction in inference time up to 29x and memory
+usage up to 6x.
+
+
+ In this paper, we introduce Audiopedia, a novel task called Audio Question
+Answering with Knowledge, which requires both audio comprehension and external
+knowledge reasoning. Unlike traditional Audio Question Answering (AQA)
+benchmarks that focus on simple queries answerable from audio alone, Audiopedia
+targets knowledge-intensive questions. We define three sub-tasks: (i) Single
+Audio Question Answering (s-AQA), where questions are answered based on a
+single audio sample, (ii) Multi-Audio Question Answering (m-AQA), which
+requires reasoning over multiple audio samples, and (iii) Retrieval-Augmented
+Audio Question Answering (r-AQA), which involves retrieving relevant audio to
+answer the question. We benchmark large audio language models (LALMs) on these
+sub-tasks and observe suboptimal performance. To address this, we propose a
+generic framework that can be adapted to any LALM, equipping them with
+knowledge reasoning capabilities. Our framework has two components: (i) Audio
+Entity Linking (AEL) and (ii) Knowledge-Augmented Audio Large Multimodal Model
+(KA2LM), which together improve performance on knowledge-intensive AQA tasks.
+To our knowledge, this is the first work to address advanced audio
+understanding via knowledge-intensive tasks like Audiopedia.
+
+
+
+ comment: Accepted to ICASSP 2025
+
+
+
+
+
+
+ ☆ ReTaKe: Reducing Temporal and Knowledge Redundancy for Long Video
+ Understanding
+
+
+
+
+
+
+
+
+ Xiao Wang, Qingyi Si, Jianlong Wu, Shiyu Zhu, Li Cao, Liqiang Nie
+
+
+ Video Large Language Models (VideoLLMs) have achieved remarkable progress in
+video understanding. However, existing VideoLLMs often inherit the limitations
+of their backbone LLMs in handling long sequences, leading to challenges for
+long video understanding. Common solutions either simply uniformly sample
+videos' frames or compress visual tokens, which focus primarily on low-level
+temporal visual redundancy, overlooking high-level knowledge redundancy. This
+limits the achievable compression rate with minimal loss. To this end. we
+introduce a training-free method, $\textbf{ReTaKe}$, containing two novel
+modules DPSelect and PivotKV, to jointly model and reduce both temporal visual
+redundancy and knowledge redundancy for long video understanding. Specifically,
+DPSelect identifies keyframes with local maximum peak distance based on their
+visual features, which are closely aligned with human video perception. PivotKV
+employs the obtained keyframes as pivots and conducts KV-Cache compression for
+the non-pivot tokens with low attention scores, which are derived from the
+learned prior knowledge of LLMs. Experiments on benchmarks VideoMME, MLVU, and
+LVBench, show that ReTaKe can support 4x longer video sequences with minimal
+performance loss (<1%) and outperform all similar-size VideoLLMs with 3%-5%,
+even surpassing or on par with much larger ones. Our code is available at
+https://github.com/SCZwangxiao/video-ReTaKe
+
+
+
+
+
+
+
+
+ Xilei Zhu, Huiyu Duan, Liu Yang, Yucheng Zhu, Xiongkuo Min, Guangtao Zhai, Patrick Le Callet
+
+
+ With the rapid development of eXtended Reality (XR), egocentric spatial
+shooting and display technologies have further enhanced immersion and
+engagement for users. Assessing the quality of experience (QoE) of egocentric
+spatial videos is crucial to ensure a high-quality viewing experience. However,
+the corresponding research is still lacking. In this paper, we use the embodied
+experience to highlight this more immersive experience and study the new
+problem, i.e., embodied perceptual quality assessment for egocentric spatial
+videos. Specifically, we introduce the first Egocentric Spatial Video Quality
+Assessment Database (ESVQAD), which comprises 600 egocentric spatial videos and
+their mean opinion scores (MOSs). Furthermore, we propose a novel
+multi-dimensional binocular feature fusion model, termed ESVQAnet, which
+integrates binocular spatial, motion, and semantic features to predict the
+perceptual quality. Experimental results demonstrate the ESVQAnet outperforms
+16 state-of-the-art VQA models on the embodied perceptual quality assessment
+task, and exhibits strong generalization capability on traditional VQA tasks.
+The database and codes will be released upon the publication.
+
+
+ Makeup is no longer confined to physical application; people now use mobile
+apps to digitally apply makeup to their photos, which they then share on social
+media. However, while this shift has made makeup more accessible, designing
+diverse makeup styles tailored to individual faces remains a challenge. This
+challenge currently must still be done manually by humans. Existing systems,
+such as makeup recommendation engines and makeup transfer techniques, offer
+limitations in creating innovative makeups for different individuals
+"intuitively" -- significant user effort and knowledge needed and limited
+makeup options available in app. Our motivation is to address this challenge by
+proposing Prot\'eg\'e, a new makeup application, leveraging recent generative
+model -- GANs to learn and automatically generate makeup styles. This is a task
+that existing makeup applications (i.e., makeup recommendation systems using
+expert system and makeup transfer methods) are unable to perform. Extensive
+experiments has been conducted to demonstrate the capability of Prot\'eg\'e in
+learning and creating diverse makeups, providing a convenient and intuitive
+way, marking a significant leap in digital makeup technology!
+
+
+
+ comment: 8 pages, 5 figures
+
+
+
+
+
+
+ ☆ Tri-Ergon: Fine-grained Video-to-Audio Generation with Multi-modal
+ Conditions and LUFS Control AAAI 2025
+
+
+ Video-to-audio (V2A) generation utilizes visual-only video features to
+produce realistic sounds that correspond to the scene. However, current V2A
+models often lack fine-grained control over the generated audio, especially in
+terms of loudness variation and the incorporation of multi-modal conditions. To
+overcome these limitations, we introduce Tri-Ergon, a diffusion-based V2A model
+that incorporates textual, auditory, and pixel-level visual prompts to enable
+detailed and semantically rich audio synthesis. Additionally, we introduce
+Loudness Units relative to Full Scale (LUFS) embedding, which allows for
+precise manual control of the loudness changes over time for individual audio
+channels, enabling our model to effectively address the intricate correlation
+of video and audio in real-world Foley workflows. Tri-Ergon is capable of
+creating 44.1 kHz high-fidelity stereo audio clips of varying lengths up to 60
+seconds, which significantly outperforms existing state-of-the-art V2A methods
+that typically generate mono audio for a fixed duration.
+
+
+
+
+
+
+
+
+ Ashishkumar Gudmalwar, Ishan D. Biyani, Nirmesh Shah, Pankaj Wasnik, Rajiv Ratn Shah
+
+
+ The Emotional Voice Conversion (EVC) aims to convert the discrete emotional
+state from the source emotion to the target for a given speech utterance while
+preserving linguistic content. In this paper, we propose regularizing emotion
+intensity in the diffusion-based EVC framework to generate precise speech of
+the target emotion. Traditional approaches control the intensity of an
+emotional state in the utterance via emotion class probabilities or intensity
+labels that often lead to inept style manipulations and degradations in
+quality. On the contrary, we aim to regulate emotion intensity using
+self-supervised learning-based feature representations and unsupervised
+directional latent vector modeling (DVM) in the emotional embedding space
+within a diffusion-based framework. These emotion embeddings can be modified
+based on the given target emotion intensity and the corresponding direction
+vector. Furthermore, the updated embeddings can be fused in the reverse
+diffusion process to generate the speech with the desired emotion and
+intensity. In summary, this paper aims to achieve high-quality emotional
+intensity regularization in the diffusion-based EVC framework, which is the
+first of its kind work. The effectiveness of the proposed method has been shown
+across state-of-the-art (SOTA) baselines in terms of subjective and objective
+evaluations for the English and Hindi languages \footnote{Demo samples are
+available at the following URL: \url{https://nirmesh-sony.github.io/EmoReg/}}.
+
+
+
+ comment: Accepted to AAAI 2025
+
+
+
+
+
+
+ ♻ ☆ AKiRa: Augmentation Kit on Rays for optical video generation
+
+
+
+
+
+
+
+
+ Xi Wang, Robin Courant, Marc Christie, Vicky Kalogeiton
+
+
+ Recent advances in text-conditioned video diffusion have greatly improved
+video quality. However, these methods offer limited or sometimes no control to
+users on camera aspects, including dynamic camera motion, zoom, distorted lens
+and focus shifts. These motion and optical aspects are crucial for adding
+controllability and cinematic elements to generation frameworks, ultimately
+resulting in visual content that draws focus, enhances mood, and guides
+emotions according to filmmakers' controls. In this paper, we aim to close the
+gap between controllable video generation and camera optics. To achieve this,
+we propose AKiRa (Augmentation Kit on Rays), a novel augmentation framework
+that builds and trains a camera adapter with a complex camera model over an
+existing video generation backbone. It enables fine-tuned control over camera
+motion as well as complex optical parameters (focal length, distortion,
+aperture) to achieve cinematic effects such as zoom, fisheye effect, and bokeh.
+Extensive experiments demonstrate AKiRa's effectiveness in combining and
+composing camera optics while outperforming all state-of-the-art methods. This
+work sets a new landmark in controlled and optically enhanced video generation,
+paving the way for future optical video generation methods.
+
+
+
+
+
+
+
+ ♻ ☆ SoundLoc3D: Invisible 3D Sound Source Localization and Classification
+ Using a Multimodal RGB-D Acoustic Camera WACV2025
+
+
+ Accurately localizing 3D sound sources and estimating their semantic labels
+-- where the sources may not be visible, but are assumed to lie on the physical
+surface of objects in the scene -- have many real applications, including
+detecting gas leak and machinery malfunction. The audio-visual weak-correlation
+in such setting poses new challenges in deriving innovative methods to answer
+if or how we can use cross-modal information to solve the task. Towards this
+end, we propose to use an acoustic-camera rig consisting of a pinhole RGB-D
+camera and a coplanar four-channel microphone array~(Mic-Array). By using this
+rig to record audio-visual signals from multiviews, we can use the cross-modal
+cues to estimate the sound sources 3D locations. Specifically, our framework
+SoundLoc3D treats the task as a set prediction problem, each element in the set
+corresponds to a potential sound source. Given the audio-visual
+weak-correlation, the set representation is initially learned from a single
+view microphone array signal, and then refined by actively incorporating
+physical surface cues revealed from multiview RGB-D images. We demonstrate the
+efficiency and superiority of SoundLoc3D on large-scale simulated dataset, and
+further show its robustness to RGB-D measurement inaccuracy and ambient noise
+interference.
+
+
+
+ comment: Accepted by WACV2025
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Computation and Language 37
+
+
+
+
+
+ ☆ No Preference Left Behind: Group Distributional Preference Optimization
+
+
+ Preferences within a group of people are not uniform but follow a
+distribution. While existing alignment methods like Direct Preference
+Optimization (DPO) attempt to steer models to reflect human preferences, they
+struggle to capture the distributional pluralistic preferences within a group.
+These methods often skew toward dominant preferences, overlooking the diversity
+of opinions, especially when conflicting preferences arise. To address this
+issue, we propose Group Distribution Preference Optimization (GDPO), a novel
+framework that aligns language models with the distribution of preferences
+within a group by incorporating the concept of beliefs that shape individual
+preferences. GDPO calibrates a language model using statistical estimation of
+the group's belief distribution and aligns the model with belief-conditioned
+preferences, offering a more inclusive alignment framework than traditional
+methods. In experiments using both synthetic controllable opinion generation
+and real-world movie review datasets, we show that DPO fails to align with the
+targeted belief distributions, while GDPO consistently reduces this alignment
+gap during training. Moreover, our evaluation metrics demonstrate that GDPO
+outperforms existing approaches in aligning with group distributional
+preferences, marking a significant advance in pluralistic alignment.
+
+
+
+
+
+
+
+ ☆ Scoring with Large Language Models: A Study on Measuring Empathy of
+ Responses in Dialogues
+
+
+
+
+
+
+
+
+ Henry J. Xie, Jinghan Zhang, Xinhao Zhang, Kunpeng Liu
+
+
+ In recent years, Large Language Models (LLMs) have become increasingly more
+powerful in their ability to complete complex tasks. One such task in which
+LLMs are often employed is scoring, i.e., assigning a numerical value from a
+certain scale to a subject. In this paper, we strive to understand how LLMs
+score, specifically in the context of empathy scoring. We develop a novel and
+comprehensive framework for investigating how effective LLMs are at measuring
+and scoring empathy of responses in dialogues, and what methods can be employed
+to deepen our understanding of LLM scoring. Our strategy is to approximate the
+performance of state-of-the-art and fine-tuned LLMs with explicit and
+explainable features. We train classifiers using various features of dialogues
+including embeddings, the Motivational Interviewing Treatment Integrity (MITI)
+Code, a set of explicit subfactors of empathy as proposed by LLMs, and a
+combination of the MITI Code and the explicit subfactors. Our results show that
+when only using embeddings, it is possible to achieve performance close to that
+of generic LLMs, and when utilizing the MITI Code and explicit subfactors
+scored by an LLM, the trained classifiers can closely match the performance of
+fine-tuned LLMs. We employ feature selection methods to derive the most crucial
+features in the process of empathy scoring. Our work provides a new perspective
+toward understanding LLM empathy scoring and helps the LLM community explore
+the potential of LLM scoring in social science studies.
+
+
+
+ comment: Accepted by IEEE BigData 2024
+
+
+
+
+
+
+ ☆ ComparisonQA: Evaluating Factuality Robustness of LLMs Through Knowledge
+ Frequency Control and Uncertainty
+
+
+ The rapid development of LLMs has sparked extensive research into their
+factual knowledge. Current works claim that LLMs fall short on questions
+requiring less frequent knowledge. However, their proof is incomplete since
+they only study the influence of entity frequency, which can not fully
+represent knowledge frequency. So we introduce ComparisonQA benchmark,
+containing 283K abstract questions, each instantiated by a pair of
+high-frequency and low-frequency entities. It ensures a controllable comparison
+because the difference of knowledge frequency between such a pair is only
+related to entity frequency. In addition, to avoid possible semantic shortcuts,
+which is a severe problem of current LLMs study, we design a two-round method
+for knowledge robustness measurement utilizing both correctness and
+uncertainty. Experiments reveal that LLMs exhibit particularly low robustness
+regarding low-frequency knowledge, and GPT-4o is even the worst under this
+measurement. Besides, we introduce an automatic method to filter out questions
+with low-quality and shortcuts to form ComparisonQA-Hard. We find that
+uncertainty effectively identifies such questions while maintaining the data
+size.
+
+
+
+
+
+
+
+ ☆ LLM Reasoning Engine: Specialized Training for Enhanced Mathematical
+ Reasoning
+
+
+ Large Language Models (LLMs) have shown remarkable performance in various
+natural language processing tasks but face challenges in mathematical
+reasoning, where complex problem-solving requires both linguistic understanding
+and mathematical reasoning skills. Existing approaches to address this
+challenge often rely on ensemble methods and suffer from the problem of data
+scarcity in target domains. In this work, we present a novel method to enhance
+LLMs' capabilities in mathematical reasoning tasks. Motivated by the need to
+bridge this gap, our approach incorporates a question paraphrase strategy,
+which aims at diversifying the linguistic forms of mathematical questions to
+improve generalization. Additionally, specialized training objectives are
+employed to guide the model's learning process, focusing on enhancing its
+understanding of mathematical concepts and reasoning processes. We conduct
+experiments on four datasets using different LLMs, and demonstrate the
+effectiveness of our approach in improving LLMs' performance on mathematical
+reasoning tasks. Our findings underscore the significance of our methodology in
+the advancement of large language models and its potential implications for
+real-world applications that require mathematical reasoning abilities.
+
+
+
+
+
+
+
+ ☆ AfriHG: News headline generation for African Languages ICLR 2024
+
+
+
+
+
+
+
+
+ Toyib Ogunremi, Serah Akojenu, Anthony Soronnadi, Olubayo Adekanmbi, David Ifeoluwa Adelani
+
+
+ This paper introduces AfriHG -- a news headline generation dataset created by
+combining from XLSum and MasakhaNEWS datasets focusing on 16 languages widely
+spoken by Africa. We experimented with two seq2eq models (mT5-base and AfriTeVa
+V2), and Aya-101 LLM. Our results show that Africa-centric seq2seq models such
+as AfriTeVa V2 outperform the massively multilingual mT5-base model. Finally,
+we show that the performance of fine-tuning AfriTeVa V2 with 313M parameters is
+competitive to prompting Aya-101 LLM with more than 13B parameters.
+
+
+
+ comment: Accepted to AfricaNLP Workshop at ICLR 2024
+
+
+
+
+
+
+ ☆ YAD: Leveraging T5 for Improved Automatic Diacritization of Yorùbá
+ Text ICLR 2024
+
+
+
+
+
+
+
+
+ Akindele Michael Olawole, Jesujoba O. Alabi, Aderonke Busayo Sakpere, David I. Adelani
+
+
+ In this work, we present Yor\`ub\'a automatic diacritization (YAD) benchmark
+dataset for evaluating Yor\`ub\'a diacritization systems. In addition, we
+pre-train text-to-text transformer, T5 model for Yor\`ub\'a and showed that
+this model outperform several multilingually trained T5 models. Lastly, we
+showed that more data and larger models are better at diacritization for
+Yor\`ub\'a
+
+
+
+ comment: Accepted at AfricaNLP Workshop at ICLR 2024
+
+
+
+
+
+
+ ☆ Decoding Emotion: Speech Perception Patterns in Individuals with
+ Self-reported Depression
+
+
+ The current study examines the relationship between self-reported depression
+and the perception of affective speech within the Indian population. PANAS and
+PHQ-9 were used to assess current mood and depression, respectively.
+Participants' emotional reactivity was recorded on a valence and arousal scale
+against the affective speech audio presented in a sequence. No significant
+differences between the depression and no-depression groups were observed for
+any of the emotional stimuli, except the audio file depicting neutral emotion.
+Significantly higher PANAS scores by the depression than the no-depression
+group indicate the impact of pre-disposed mood on the current mood status.
+Contrary to previous findings, this study did not observe reduced positive
+emotional reactivity by the depression group. However, the results demonstrated
+consistency in emotional reactivity for speech stimuli depicting sadness and
+anger across all measures of emotion perception.
+
+
+
+
+
+
+
+ ☆ Building a Rich Dataset to Empower the Persian Question Answering
+ Systems
+
+
+ Question answering systems provide short, precise, and specific answers to
+questions. So far, many robust question answering systems have been developed
+for English, while some languages with fewer resources, like Persian, have few
+numbers of standard dataset. In this study, a comprehensive open-domain dataset
+is presented for Persian. This dataset is called NextQuAD and has 7,515
+contexts, including 23,918 questions and answers. Then, a BERT-based question
+answering model has been applied to this dataset using two pre-trained language
+models, including ParsBERT and XLM-RoBERTa. The results of these two models
+have been ensembled using mean logits. Evaluation on the development set shows
+0.95 Exact Match (EM) and 0.97 Fl_score. Also, to compare the NextQuAD with
+other Persian datasets, our trained model on the NextQuAD, is evaluated on two
+other datasets named PersianQA and ParSQuAD. Comparisons show that the proposed
+model increased EM by 0.39 and 0.14 respectively in PersianQA and
+ParSQuAD-manual, while a slight EM decline of 0.007 happened in
+ParSQuAD-automatic.
+
+
+
+
+
+
+
+ ☆ Efficient Multi-Agent Collaboration with Tool Use for Online Planning in
+ Complex Table Question Answering
+
+
+ Complex table question answering (TQA) aims to answer questions that require
+complex reasoning, such as multi-step or multi-category reasoning, over data
+represented in tabular form. Previous approaches demonstrated notable
+performance by leveraging either closed-source large language models (LLMs) or
+fine-tuned open-weight LLMs. However, fine-tuning LLMs requires high-quality
+training data, which is costly to obtain, and utilizing closed-source LLMs
+poses accessibility challenges and leads to reproducibility issues. In this
+paper, we propose Multi-Agent Collaboration with Tool use (MACT), a framework
+that requires neither closed-source models nor fine-tuning. In MACT, a planning
+agent and a coding agent that also make use of tools collaborate to answer
+questions. Our experiments on four TQA benchmarks show that MACT outperforms
+previous SoTA systems on three out of four benchmarks and that it performs
+comparably to the larger and more expensive closed-source model GPT-4 on two
+benchmarks, even when using only open-weight models without any fine-tuning. We
+conduct extensive analyses to prove the effectiveness of MACT's multi-agent
+collaboration in TQA.
+
+
+
+
+
+
+
+
+ Zhaopeng Feng, Jiayuan Su, Jiamei Zheng, Jiahan Ren, Yan Zhang, Jian Wu, Hongwei Wang, Zuozhu Liu
+
+
+ Recent advancements in large language models (LLMs) have given rise to the
+LLM-as-a-judge paradigm, showcasing their potential to deliver human-like
+judgments. However, in the field of machine translation (MT) evaluation,
+current LLM-as-a-judge methods fall short of learned automatic metrics. In this
+paper, we propose Multidimensional Multi-Agent Debate (M-MAD), a systematic
+LLM-based multi-agent framework for advanced LLM-as-a-judge MT evaluation. Our
+findings demonstrate that M-MAD achieves significant advancements by (1)
+decoupling heuristic MQM criteria into distinct evaluation dimensions for
+fine-grained assessments; (2) employing multi-agent debates to harness the
+collaborative reasoning capabilities of LLMs; (3) synthesizing
+dimension-specific results into a final evaluation judgment to ensure robust
+and reliable outcomes. Comprehensive experiments show that M-MAD not only
+outperforms all existing LLM-as-a-judge methods but also competes with
+state-of-the-art reference-based automatic metrics, even when powered by a
+suboptimal model like GPT-4o mini. Detailed ablations and analysis highlight
+the superiority of our framework design, offering a fresh perspective for
+LLM-as-a-judge paradigm. Our code and data are publicly available at
+https://github.com/SU-JIAYUAN/M-MAD.
+
+
+
+ comment: Work in progress. Code and data are available at
+ https://github.com/SU-JIAYUAN/M-MAD
+
+
+
+
+
+
+ ☆ Extract Information from Hybrid Long Documents Leveraging LLMs: A
+ Framework and DatasetICASSP 2025
+
+
+ Large Language Models (LLMs) demonstrate exceptional performance in textual
+understanding and tabular reasoning tasks. However, their ability to comprehend
+and analyze hybrid text, containing textual and tabular data, remains
+unexplored. The hybrid text often appears in the form of hybrid long documents
+(HLDs), which far exceed the token limit of LLMs. Consequently, we apply an
+Automated Information Extraction framework (AIE) to enable LLMs to process the
+HLDs and carry out experiments to analyse four important aspects of information
+extraction from HLDs. Given the findings: 1) The effective way to select and
+summarize the useful part of a HLD. 2) An easy table serialization way is
+enough for LLMs to understand tables. 3) The naive AIE has adaptability in many
+complex scenarios. 4) The useful prompt engineering to enhance LLMs on HLDs. To
+address the issue of dataset scarcity in HLDs and support future work, we also
+propose the Financial Reports Numerical Extraction (FINE) dataset. The dataset
+and code are publicly available in the attachments.
+
+
+
+ comment: ICASSP 2025
+
+
+
+
+
+
+ ☆ On the Compositional Generalization of Multimodal LLMs for Medical
+ Imaging
+
+
+ Multimodal large language models (MLLMs) hold significant potential in the
+medical field, but their capabilities are often limited by insufficient data in
+certain medical domains, highlighting the need for understanding what kinds of
+images can be used by MLLMs for generalization. Current research suggests that
+multi-task training outperforms single-task as different tasks can benefit each
+other, but they often overlook the internal relationships within these tasks,
+providing limited guidance on selecting datasets to enhance specific tasks. To
+analyze this phenomenon, we attempted to employ compositional generalization
+(CG)-the ability of models to understand novel combinations by recombining
+learned elements-as a guiding framework. Since medical images can be precisely
+defined by Modality, Anatomical area, and Task, naturally providing an
+environment for exploring CG. Therefore, we assembled 106 medical datasets to
+create Med-MAT for comprehensive experiments. The experiments confirmed that
+MLLMs can use CG to understand unseen medical images and identified CG as one
+of the main drivers of the generalization observed in multi-task training.
+Additionally, further studies demonstrated that CG effectively supports
+datasets with limited data and delivers consistent performance across different
+backbones, highlighting its versatility and broad applicability. Med-MAT is
+publicly available at https://github.com/FreedomIntelligence/Med-MAT.
+
+
+
+
+
+
+
+ ☆ The Emotional Spectrum of LLMs: Leveraging Empathy and Emotion-Based
+ Markers for Mental Health Support
+
+
+
+
+
+
+
+
+ Alessandro De Grandi, Federico Ravenda, Andrea Raballo, Fabio Crestani
+
+
+ The increasing demand for mental health services has highlighted the need for
+innovative solutions, particularly in the realm of psychological conversational
+AI, where the availability of sensitive data is scarce. In this work, we
+explored the development of a system tailored for mental health support with a
+novel approach to psychological assessment based on explainable emotional
+profiles in combination with empathetic conversational models, offering a
+promising tool for augmenting traditional care, particularly where immediate
+expertise is unavailable. Our work can be divided into two main parts,
+intrinsecaly connected to each other. First, we present RACLETTE, a
+conversational system that demonstrates superior emotional accuracy compared to
+state-of-the-art benchmarks in both understanding users' emotional states and
+generating empathetic responses during conversations, while progressively
+building an emotional profile of the user through their interactions. Second,
+we show how the emotional profiles of a user can be used as interpretable
+markers for mental health assessment. These profiles can be compared with
+characteristic emotional patterns associated with different mental disorders,
+providing a novel approach to preliminary screening and support.
+
+
+
+
+
+
+
+ ☆ Comparative Analysis of Listwise Reranking with Large Language Models in
+ Limited-Resource Language Contexts
+
+
+ Large Language Models (LLMs) have demonstrated significant effectiveness
+across various NLP tasks, including text ranking. This study assesses the
+performance of large language models (LLMs) in listwise reranking for
+limited-resource African languages. We compare proprietary models RankGPT3.5,
+Rank4o-mini, RankGPTo1-mini and RankClaude-sonnet in cross-lingual contexts.
+Results indicate that these LLMs significantly outperform traditional baseline
+methods such as BM25-DT in most evaluation metrics, particularly in nDCG@10 and
+MRR@100. These findings highlight the potential of LLMs in enhancing reranking
+tasks for low-resource languages and offer insights into cost-effective
+solutions.
+
+
+
+
+
+
+
+ ☆ "My life is miserable, have to sign 500 autographs everyday": Exposing
+ Humblebragging, the Brags in Disguise
+
+
+ Humblebragging is a phenomenon where individuals present self-promotional
+statements under the guise of modesty or complaints. For example, a statement
+like, "Ugh, I can't believe I got promoted to lead the entire team. So
+stressful!", subtly highlights an achievement while pretending to be
+complaining. Detecting humblebragging is important for machines to better
+understand the nuances of human language, especially in tasks like sentiment
+analysis and intent recognition. However, this topic has not yet been studied
+in computational linguistics. For the first time, we introduce the task of
+automatically detecting humblebragging in text. We formalize the task by
+proposing a 4-tuple definition of humblebragging and evaluate machine learning,
+deep learning, and large language models (LLMs) on this task, comparing their
+performance with humans. We also create and release a dataset called HB24,
+containing 3,340 humblebrags generated using GPT-4o. Our experiments show that
+detecting humblebragging is non-trivial, even for humans. Our best model
+achieves an F1-score of 0.88. This work lays the foundation for further
+exploration of this nuanced linguistic phenomenon and its integration into
+broader natural language understanding systems.
+
+
+
+ comment: Under review at ARR
+
+
+
+
+
+
+ ☆ STAYKATE: Hybrid In-Context Example Selection Combining
+ Representativeness Sampling and Retrieval-based Approach -- A Case Study on
+ Science Domains
+
+
+ Large language models (LLMs) demonstrate the ability to learn in-context,
+offering a potential solution for scientific information extraction, which
+often contends with challenges such as insufficient training data and the high
+cost of annotation processes. Given that the selection of in-context examples
+can significantly impact performance, it is crucial to design a proper method
+to sample the efficient ones. In this paper, we propose STAYKATE, a
+static-dynamic hybrid selection method that combines the principles of
+representativeness sampling from active learning with the prevalent
+retrieval-based approach. The results across three domain-specific datasets
+indicate that STAYKATE outperforms both the traditional supervised methods and
+existing selection methods. The enhancement in performance is particularly
+pronounced for entity types that other methods pose challenges.
+
+
+
+
+
+
+
+ ☆ BaiJia: A Large Scale Role-Playing Agent Corpus of Chinese Historical
+ Charcaters
+
+
+ We introduce a comprehensive large-scale role-playing agent corpus, termed
+BaiJia, that comprises various Chinese historical characters. This corpus is
+noteworthy for being the pioneering compilation of low-resource data that can
+be utilized in large language models (LLMs) to engage in AI-driven historical
+role-playing agents. BaiJia addresses the challenges in terms of fragmented
+historical textual records in different forms and modalities, integrating
+various characters' information, including their biographical, literary, family
+relations, historical events, and so on. We conduct extensive experiments to
+demonstrate the effectiveness of our BaiJia agent corpus in bolstering the
+role-playing abilities of various foundational LLMs, and promoting the
+development and assessment of LLMs in the context of historical role-playing
+tasks. The agent corpus is available at baijia.online.
+
+
+
+
+
+
+
+ ☆ OneKE: A Dockerized Schema-Guided LLM Agent-based Knowledge Extraction
+ System
+
+
+
+
+
+
+
+
+ Yujie Luo, Xiangyuan Ru, Kangwei Liu, Lin Yuan, Mengshu Sun, Ningyu Zhang, Lei Liang, Zhiqiang Zhang, Jun Zhou, Lanning Wei, Da Zheng, Haofen Wang, Huajun Chen
+
+
+ We introduce OneKE, a dockerized schema-guided knowledge extraction system,
+which can extract knowledge from the Web and raw PDF Books, and support various
+domains (science, news, etc.). Specifically, we design OneKE with multiple
+agents and a configure knowledge base. Different agents perform their
+respective roles, enabling support for various extraction scenarios. The
+configure knowledge base facilitates schema configuration, error case debugging
+and correction, further improving the performance. Empirical evaluations on
+benchmark datasets demonstrate OneKE's efficacy, while case studies further
+elucidate its adaptability to diverse tasks across multiple domains,
+highlighting its potential for broad applications. We have open-sourced the
+Code at https://github.com/zjunlp/OneKE and released a Video at
+http://oneke.openkg.cn/demo.mp4.
+
+
+
+ comment: Work in progress
+
+
+
+
+
+
+ ☆ From Generalist to Specialist: A Survey of Large Language Models for
+ Chemistry COLING2025
+
+
+
+
+
+
+
+
+ Yang Han, Ziping Wan, Lu Chen, Kai Yu, Xin Chen
+
+
+ Large Language Models (LLMs) have significantly transformed our daily life
+and established a new paradigm in natural language processing (NLP). However,
+the predominant pretraining of LLMs on extensive web-based texts remains
+insufficient for advanced scientific discovery, particularly in chemistry. The
+scarcity of specialized chemistry data, coupled with the complexity of
+multi-modal data such as 2D graph, 3D structure and spectrum, present distinct
+challenges. Although several studies have reviewed Pretrained Language Models
+(PLMs) in chemistry, there is a conspicuous absence of a systematic survey
+specifically focused on chemistry-oriented LLMs. In this paper, we outline
+methodologies for incorporating domain-specific chemistry knowledge and
+multi-modal information into LLMs, we also conceptualize chemistry LLMs as
+agents using chemistry tools and investigate their potential to accelerate
+scientific research. Additionally, we conclude the existing benchmarks to
+evaluate chemistry ability of LLMs. Finally, we critically examine the current
+challenges and identify promising directions for future research. Through this
+comprehensive survey, we aim to assist researchers in staying at the forefront
+of developments in chemistry LLMs and to inspire innovative applications in the
+field.
+
+
+
+
+
+
+
+ ☆ Bridging Context Gaps: Enhancing Comprehension in Long-Form Social
+ Conversations Through Contextualized Excerpts COLING 2025
+
+
+
+
+
+
+
+
+ Shrestha Mohanty, Sarah Xuan, Jacob Jobraeel, Anurag Kumar, Deb Roy, Jad Kabbara
+
+
+ We focus on enhancing comprehension in small-group recorded conversations,
+which serve as a medium to bring people together and provide a space for
+sharing personal stories and experiences on crucial social matters. One way to
+parse and convey information from these conversations is by sharing highlighted
+excerpts in subsequent conversations. This can help promote a collective
+understanding of relevant issues, by highlighting perspectives and experiences
+to other groups of people who might otherwise be unfamiliar with and thus
+unable to relate to these experiences. The primary challenge that arises then
+is that excerpts taken from one conversation and shared in another setting
+might be missing crucial context or key elements that were previously
+introduced in the original conversation. This problem is exacerbated when
+conversations become lengthier and richer in themes and shared experiences. To
+address this, we explore how Large Language Models (LLMs) can enrich these
+excerpts by providing socially relevant context. We present approaches for
+effective contextualization to improve comprehension, readability, and empathy.
+We show significant improvements in understanding, as assessed through
+subjective and objective evaluations. While LLMs can offer valuable context,
+they struggle with capturing key social aspects. We release the Human-annotated
+Salient Excerpts (HSE) dataset to support future work. Additionally, we show
+how context-enriched excerpts can provide more focused and comprehensive
+conversation summaries.
+
+
+
+ comment: Accepted at COLING 2025
+
+
+
+
+
+
+ ☆ Children's Acquisition of Tail-recursion Sequences: A Review of Locative
+ Recursion and Possessive Recursion as Examples
+
+
+ Recursion is the nature of human natural language. Since Chomsky proposed
+generative grammar, many scholars have studied recursion either theoretically
+or empirically. However, by observing children's acquisition of tail recursion
+sequences, we can verify the nativism of language supported by universal
+grammar and reveal the cognitive mechanism of human brain. To date, our
+understanding of children's acquisition path of recursion and influencing
+factors still remain controversial. This systematic review summarizes the
+research of tail recursive sequence by taking possessive recursion and locative
+recursion as examples, focusing on the experimental methods, acquisition paths,
+and influencing factors of tail recursive sequence. The current behavioural
+experiments reveal that, the debate about children's performance revolves
+around: 1) Gradual acquisition or synchronous acquisition. 2) symmetry or
+asymmetry between the acquisition of locative recursion sequences and
+possessive recursion sequences. We presume that children can acquire recursion
+quickly in a short period of time thanks to the language acquisition device,
+though there are also scholars who believe that a third factor also plays a
+role.
+
+
+
+ comment: 32 pages, 5 figures
+
+
+
+
+
+
+ ♻ ☆ IITR-CIOL@NLU of Devanagari Script Languages 2025: Multilingual Hate
+ Speech Detection and Target Identification in Devanagari-Scripted Languages COLING 2025
+
+
+ This work focuses on two subtasks related to hate speech detection and target
+identification in Devanagari-scripted languages, specifically Hindi, Marathi,
+Nepali, Bhojpuri, and Sanskrit. Subtask B involves detecting hate speech in
+online text, while Subtask C requires identifying the specific targets of hate
+speech, such as individuals, organizations, or communities. We propose the
+MultilingualRobertaClass model, a deep neural network built on the pretrained
+multilingual transformer model ia-multilingual-transliterated-roberta,
+optimized for classification tasks in multilingual and transliterated contexts.
+The model leverages contextualized embeddings to handle linguistic diversity,
+with a classifier head for binary classification. We received 88.40% accuracy
+in Subtask B and 66.11% accuracy in Subtask C, in the test set.
+
+
+
+ comment: Accepted to CHiPSAL Workshop at COLING 2025
+
+ Feature generation can significantly enhance learning outcomes, particularly
+for tasks with limited data. An effective way to improve feature generation is
+by expanding the current feature space using existing features and enriching
+the informational content. However, generating new, interpretable features in
+application fields often requires domain-specific knowledge about the existing
+features. This paper introduces a new method RAFG for generating reasonable and
+explainable features specific to domain classification tasks. To generate new
+features with interpretability in domain knowledge, we perform information
+retrieval on existing features to identify potential feature associations, and
+utilize these associations to generate meaningful features. Furthermore, we
+develop a Large Language Model (LLM)-based framework for feature generation
+with reasoning to verify and filter features during the generation process.
+Experiments across several datasets in medical, economic, and geographic
+domains show that our RAFG method produces high-quality, meaningful features
+and significantly improves classification performance compared with baseline
+methods.
+
+
+
+
+
+
+
+
+ Core Francisco Park, Ekdeep Singh Lubana, Itamar Pres, Hidenori Tanaka
+
+
+ In-Context Learning (ICL) has significantly expanded the general-purpose
+nature of large language models, allowing them to adapt to novel tasks using
+merely the inputted context. This has motivated a series of papers that analyze
+tractable synthetic domains and postulate precise mechanisms that may underlie
+ICL. However, the use of relatively distinct setups that often lack a sequence
+modeling nature to them makes it unclear how general the reported insights from
+such studies are. Motivated by this, we propose a synthetic sequence modeling
+task that involves learning to simulate a finite mixture of Markov chains. As
+we show, models trained on this task reproduce most well-known results on ICL,
+hence offering a unified setting for studying the concept. Building on this
+setup, we demonstrate we can explain a model's behavior by decomposing it into
+four broad algorithms that combine a fuzzy retrieval vs. inference approach
+with either unigram or bigram statistics of the context. These algorithms
+engage in a competition dynamics to dominate model behavior, with the precise
+experimental conditions dictating which algorithm ends up superseding others:
+e.g., we find merely varying context size or amount of training yields (at
+times sharp) transitions between which algorithm dictates the model behavior,
+revealing a mechanism that explains the transient nature of ICL. In this sense,
+we argue ICL is best thought of as a mixture of different algorithms, each with
+its own peculiarities, instead of a monolithic capability. This also implies
+that making general claims about ICL that hold universally across all settings
+may be infeasible.
+
+
+
+ comment: Preprint. Under review
+
+
+
+
+
+
+ ♻ ☆ Demystifying CLIP Data
+
+
+
+
+
+
+
+
+ Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, Christoph Feichtenhofer
+
+
+ Contrastive Language-Image Pre-training (CLIP) is an approach that has
+advanced research and applications in computer vision, fueling modern
+recognition systems and generative models. We believe that the main ingredient
+to the success of CLIP is its data and not the model architecture or
+pre-training objective. However, CLIP only provides very limited information
+about its data and how it has been collected, leading to works that aim to
+reproduce CLIP's data by filtering with its model parameters. In this work, we
+intend to reveal CLIP's data curation approach and in our pursuit of making it
+open to the community introduce Metadata-Curated Language-Image Pre-training
+(MetaCLIP). MetaCLIP takes a raw data pool and metadata (derived from CLIP's
+concepts) and yields a balanced subset over the metadata distribution. Our
+experimental study rigorously isolates the model and training settings,
+concentrating solely on data. MetaCLIP applied to CommonCrawl with 400M
+image-text data pairs outperforms CLIP's data on multiple standard benchmarks.
+In zero-shot ImageNet classification, MetaCLIP achieves 70.8% accuracy,
+surpassing CLIP's 68.3% on ViT-B models. Scaling to 1B data, while maintaining
+the same training budget, attains 72.4%. Our observations hold across various
+model sizes, exemplified by ViT-H achieving 80.5%, without any
+bells-and-whistles. Curation code and training data distribution on metadata is
+made available at https://github.com/facebookresearch/MetaCLIP.
+
+
+
+ comment: 17 pages. arXiv admin note: text overlap with arXiv:2103.00020 by
+ other authors
+
+
+
+
+
+
+
+ Hu Xu, Po-Yao Huang, Xiaoqing Ellen Tan, Ching-Feng Yeh, Jacob Kahn, Christine Jou, Gargi Ghosh, Omer Levy, Luke Zettlemoyer, Wen-tau Yih, Shang-Wen Li, Saining Xie, Christoph Feichtenhofer
+
+
+ This paper focuses on creating synthetic data to improve the quality of image
+captions. Existing works typically have two shortcomings. First, they caption
+images from scratch, ignoring existing alt-text metadata, and second, lack
+transparency if the captioners' training data (e.g. GPT) is unknown. In this
+paper, we study a principled approach Altogether based on the key idea to edit
+and re-align existing alt-texts associated with the images. To generate
+training data, we perform human annotation where annotators start with the
+existing alt-text and re-align it to the image content in multiple rounds,
+consequently constructing captions with rich visual concepts. This differs from
+prior work that carries out human annotation as a one-time description task
+solely based on images and annotator knowledge. We train a captioner on this
+data that generalizes the process of re-aligning alt-texts at scale. Our
+results show our Altogether approach leads to richer image captions that also
+improve text-to-image generation and zero-shot image classification tasks.
+
+
+
+ comment: accepted by EMNLP 2024; Meta CLIP 1.2 Data Engine
+
+
+
+
+
+
+ ♻ ☆ A Measure of the System Dependence of Automated Metrics
+
+
+
+
+
+
+
+
+ Pius von Däniken, Jan Deriu, Mark Cieliebak
+
+
+ Automated metrics for Machine Translation have made significant progress,
+with the goal of replacing expensive and time-consuming human evaluations.
+These metrics are typically assessed by their correlation with human judgments,
+which captures the monotonic relationship between human and metric scores.
+However, we argue that it is equally important to ensure that metrics treat all
+systems fairly and consistently. In this paper, we introduce a method to
+evaluate this aspect.
+
+
+
+
+
+
+
+ ♻ ☆ Out-of-distribution generalization via composition: a lens through
+ induction heads in Transformers
+
+
+ Large language models (LLMs) such as GPT-4 sometimes appear to be creative,
+solving novel tasks often with a few demonstrations in the prompt. These tasks
+require the models to generalize on distributions different from those from
+training data -- which is known as out-of-distribution (OOD) generalization.
+Despite the tremendous success of LLMs, how they approach OOD generalization
+remains an open and underexplored question. We examine OOD generalization in
+settings where instances are generated according to hidden rules, including
+in-context learning with symbolic reasoning. Models are required to infer the
+hidden rules behind input prompts without any fine-tuning.
+ We empirically examined the training dynamics of Transformers on a synthetic
+example and conducted extensive experiments on a variety of pretrained LLMs,
+focusing on a type of components known as induction heads. We found that OOD
+generalization and composition are tied together -- models can learn rules by
+composing two self-attention layers, thereby achieving OOD generalization.
+Furthermore, a shared latent subspace in the embedding (or feature) space acts
+as a bridge for composition by aligning early layers and later layers, which we
+refer to as the common bridge representation hypothesis.
+
+
+
+ comment: 46 pages, 27 figures
+
+
+
+
+
+
+ ♻ ☆ Entering Real Social World! Benchmarking the Social Intelligence of
+ Large Language Models from a First-person Perspective
+
+
+ Social intelligence is built upon three foundational pillars: cognitive
+intelligence, situational intelligence, and behavioral intelligence. As large
+language models (LLMs) become increasingly integrated into our social lives,
+understanding, evaluating, and developing their social intelligence are
+becoming increasingly important. While multiple existing works have
+investigated the social intelligence of LLMs, (1) most focus on a specific
+aspect, and the social intelligence of LLMs has yet to be systematically
+organized and studied; (2) position LLMs as passive observers from a
+third-person perspective, such as in Theory of Mind (ToM) tests. Compared to
+the third-person perspective, ego-centric first-person perspective evaluation
+can align well with actual LLM-based Agent use scenarios. (3) a lack of
+comprehensive evaluation of behavioral intelligence, with specific emphasis on
+incorporating critical human-machine interaction scenarios. In light of this,
+we present EgoSocialArena, a novel framework grounded in the three pillars of
+social intelligence: cognitive, situational, and behavioral intelligence, aimed
+to systematically evaluate the social intelligence of LLMs from a first-person
+perspective. With EgoSocialArena, we have conducted a comprehensive evaluation
+of eight prominent foundation models, even the most advanced LLMs like
+o1-preview lag behind human performance by 11.0 points.
+
+
+
+ comment: 14 pages, 6 figures
+
+
+
+
+
+
+ ♻ ☆ Can Watermarked LLMs be Identified by Users via Crafted Prompts?
+
+
+
+
+
+
+
+
+ Aiwei Liu, Sheng Guan, Yiming Liu, Leyi Pan, Yifei Zhang, Liancheng Fang, Lijie Wen, Philip S. Yu, Xuming Hu
+
+
+ Text watermarking for Large Language Models (LLMs) has made significant
+progress in detecting LLM outputs and preventing misuse. Current watermarking
+techniques offer high detectability, minimal impact on text quality, and
+robustness to text editing. However, current researches lack investigation into
+the imperceptibility of watermarking techniques in LLM services. This is
+crucial as LLM providers may not want to disclose the presence of watermarks in
+real-world scenarios, as it could reduce user willingness to use the service
+and make watermarks more vulnerable to attacks. This work is the first to
+investigate the imperceptibility of watermarked LLMs. We design an
+identification algorithm called Water-Probe that detects watermarks through
+well-designed prompts to the LLM. Our key motivation is that current
+watermarked LLMs expose consistent biases under the same watermark key,
+resulting in similar differences across prompts under different watermark keys.
+Experiments show that almost all mainstream watermarking algorithms are easily
+identified with our well-designed prompts, while Water-Probe demonstrates a
+minimal false positive rate for non-watermarked LLMs. Finally, we propose that
+the key to enhancing the imperceptibility of watermarked LLMs is to increase
+the randomness of watermark key selection. Based on this, we introduce the
+Water-Bag strategy, which significantly improves watermark imperceptibility by
+merging multiple watermark keys.
+
+
+
+ comment: 30 pages, 5 figures, 11 tables
+
+
+
+
+
+
+ ♻ ☆ Face-StyleSpeech: Enhancing Zero-shot Speech Synthesis from Face Images
+ with Improved Face-to-Speech Mapping ICASSP 2025
+
+
+ Generating speech from a face image is crucial for developing virtual humans
+capable of interacting using their unique voices, without relying on
+pre-recorded human speech. In this paper, we propose Face-StyleSpeech, a
+zero-shot Text-To-Speech (TTS) synthesis model that generates natural speech
+conditioned on a face image rather than reference speech. We hypothesize that
+learning entire prosodic features from a face image poses a significant
+challenge. To address this, our TTS model incorporates both face and prosody
+encoders. The prosody encoder is specifically designed to model speech style
+characteristics that are not fully captured by the face image, allowing the
+face encoder to focus on extracting speaker-specific features such as timbre.
+Experimental results demonstrate that Face-StyleSpeech effectively generates
+more natural speech from a face image than baselines, even for unseen faces.
+Samples are available on our demo page.
+
+
+
+ comment: Accepted by ICASSP 2025
+
+
+
+
+
+
+ ♻ ☆ ReZG: Retrieval-Augmented Zero-Shot Counter Narrative Generation for
+ Hate Speech
+
+
+ The proliferation of hate speech (HS) on social media poses a serious threat
+to societal security. Automatic counter narrative (CN) generation, as an active
+strategy for HS intervention, has garnered increasing attention in recent
+years. Existing methods for automatically generating CNs mainly rely on
+re-training or fine-tuning pre-trained language models (PLMs) on human-curated
+CN corpora. Unfortunately, the annotation speed of CN corpora cannot keep up
+with the growth of HS targets, while generating specific and effective CNs for
+unseen targets remains a significant challenge for the model. To tackle this
+issue, we propose Retrieval-Augmented Zero-shot Generation (ReZG) to generate
+CNs with high-specificity for unseen targets. Specifically, we propose a
+multi-dimensional hierarchical retrieval method that integrates stance,
+semantics, and fitness, extending the retrieval metric from single dimension to
+multiple dimensions suitable for the knowledge that refutes HS. Then, we
+implement an energy-based constrained decoding mechanism that enables PLMs to
+use differentiable knowledge preservation, countering, and fluency constraint
+functions instead of in-target CNs as control signals for generation, thereby
+achieving zero-shot CN generation. With the above techniques, ReZG can
+integrate external knowledge flexibly and improve the specificity of CNs.
+Experimental results show that ReZG exhibits stronger generalization
+capabilities and outperforms strong baselines with significant improvements of
+2.0%+ in the relevance and 4.5%+ in the countering success rate metrics.
+
+
+
+
+
+
+
+ ♻ ☆ A Comprehensive Survey of Small Language Models in the Era of Large
+ Language Models: Techniques, Enhancements, Applications, Collaboration with
+ LLMs, and Trustworthiness
+
+
+
+
+
+
+
+
+ Fali Wang, Zhiwei Zhang, Xianren Zhang, Zongyu Wu, Tzuhao Mo, Qiuhao Lu, Wanjing Wang, Rui Li, Junjie Xu, Xianfeng Tang, Qi He, Yao Ma, Ming Huang, Suhang Wang
+
+
+ Large language models (LLMs) have demonstrated emergent abilities in text
+generation, question answering, and reasoning, facilitating various tasks and
+domains. Despite their proficiency in various tasks, LLMs like PaLM 540B and
+Llama-3.1 405B face limitations due to large parameter sizes and computational
+demands, often requiring cloud API use which raises privacy concerns, limits
+real-time applications on edge devices, and increases fine-tuning costs.
+Additionally, LLMs often underperform in specialized domains such as healthcare
+and law due to insufficient domain-specific knowledge, necessitating
+specialized models. Therefore, Small Language Models (SLMs) are increasingly
+favored for their low inference latency, cost-effectiveness, efficient
+development, and easy customization and adaptability. These models are
+particularly well-suited for resource-limited environments and domain knowledge
+acquisition, addressing LLMs' challenges and proving ideal for applications
+that require localized data handling for privacy, minimal inference latency for
+efficiency, and domain knowledge acquisition through lightweight fine-tuning.
+The rising demand for SLMs has spurred extensive research and development.
+However, a comprehensive survey investigating issues related to the definition,
+acquisition, application, enhancement, and reliability of SLM remains lacking,
+prompting us to conduct a detailed survey on these topics. The definition of
+SLMs varies widely, thus to standardize, we propose defining SLMs by their
+capability to perform specialized tasks and suitability for
+resource-constrained settings, setting boundaries based on the minimal size for
+emergent abilities and the maximum size sustainable under resource constraints.
+For other aspects, we provide a taxonomy of relevant models/methods and develop
+general frameworks for each category to enhance and utilize SLMs effectively.
+
+
+
+ comment: 78 pages, 32 figures, 14 tables
+
+
+
+
+
+
+ ♻ ☆ Multi-View Empowered Structural Graph Wordification for Language Models
+
+
+
+
+
+
+
+
+ Zipeng Liu, Likang Wu, Ming He, Zhong Guan, Hongke Zhao, Nan Feng
+
+
+ Significant efforts have been dedicated to integrating the powerful Large
+Language Models (LLMs) with diverse modalities, particularly focusing on the
+fusion of language, vision and audio data. However, the graph-structured data,
+which is inherently rich in structural and domain-specific knowledge, has not
+yet been gracefully adapted to LLMs. Existing methods either describe the graph
+with raw text, suffering the loss of graph structural information, or feed
+Graph Neural Network (GNN) embeddings into LLMs at the cost of losing
+explainable prompt semantics. To bridge this gap, we introduce an end-to-end
+modality-aligning framework for LLM-graph alignment: Dual-Residual Vector
+Quantized-Variational AutoEncoder, namely Dr.E. Our approach is purposefully
+designed to facilitate token-level alignment with LLMs, enabling an effective
+translation of the intrinsic `language' of graphs into comprehensible natural
+language. We also manage to enhance LLMs' more robust structural understanding
+of graphs by incorporating multiple views of the central nodes based on their
+surrounding nodes at various distances. Our experimental evaluations on
+standard graph tasks demonstrate competitive performance against other
+state-of-the-art (SOTA) approaches. Additionally, our framework ensures certain
+visual interpretability, efficiency, and robustness, marking the promising
+successful endeavor to achieve token-level alignment between LLMs and GNNs. Our
+code is available at: https://github.com/Timothy914/Dr.E.
+
+
+
+
+
+
+
+ ♻ ☆ Time Series Forecasting with LLMs: Understanding and Enhancing Model
+ Capabilities KDD
+
+
+ Large language models (LLMs) have been applied in many fields and have
+developed rapidly in recent years. As a classic machine learning task, time
+series forecasting has recently been boosted by LLMs. Recent works treat large
+language models as \emph{zero-shot} time series reasoners without further
+fine-tuning, which achieves remarkable performance. However, there are some
+unexplored research problems when applying LLMs for time series forecasting
+under the zero-shot setting. For instance, the LLMs' preferences for the input
+time series are less understood. In this paper, by comparing LLMs with
+traditional time series forecasting models, we observe many interesting
+properties of LLMs in the context of time series forecasting. First, our study
+shows that LLMs perform well in predicting time series with clear patterns and
+trends, but face challenges with datasets lacking periodicity. This observation
+can be explained by the ability of LLMs to recognize the underlying period
+within datasets, which is supported by our experiments. In addition, the input
+strategy is investigated, and it is found that incorporating external knowledge
+and adopting natural language paraphrases substantially improve the predictive
+performance of LLMs for time series. Overall, our study contributes insight
+into LLMs' advantages and limitations in time series forecasting under
+different conditions.
+
+
+
+ comment: Accepted by SIGKDD Explorations Newsletter
+
+
+
+
+
+
+ ♻ ☆ Is ChatGPT Good at Search? Investigating Large Language Models as
+ Re-Ranking Agents EMNLP 2023
+
+
+ Large Language Models (LLMs) have demonstrated remarkable zero-shot
+generalization across various language-related tasks, including search engines.
+However, existing work utilizes the generative ability of LLMs for Information
+Retrieval (IR) rather than direct passage ranking. The discrepancy between the
+pre-training objectives of LLMs and the ranking objective poses another
+challenge. In this paper, we first investigate generative LLMs such as ChatGPT
+and GPT-4 for relevance ranking in IR. Surprisingly, our experiments reveal
+that properly instructed LLMs can deliver competitive, even superior results to
+state-of-the-art supervised methods on popular IR benchmarks. Furthermore, to
+address concerns about data contamination of LLMs, we collect a new test set
+called NovelEval, based on the latest knowledge and aiming to verify the
+model's ability to rank unknown knowledge. Finally, to improve efficiency in
+real-world applications, we delve into the potential for distilling the ranking
+capabilities of ChatGPT into small specialized models using a permutation
+distillation scheme. Our evaluation results turn out that a distilled 440M
+model outperforms a 3B supervised model on the BEIR benchmark. The code to
+reproduce our results is available at www.github.com/sunnweiwei/RankGPT.
+
+
+ Multimodal summarization aims to generate a concise summary based on the
+input text and image. However, the existing methods potentially suffer from
+unfactual output. To evaluate the factuality of multimodal summarization
+models, we propose two fine-grained and explainable evaluation frameworks
+(FALLACIOUS) for different application scenarios, i.e. reference-based
+factuality evaluation framework and reference-free factuality evaluation
+framework. Notably, the reference-free factuality evaluation framework doesn't
+need ground truth and hence it has a wider application scenario. To evaluate
+the effectiveness of the proposed frameworks, we compute the correlation
+between our frameworks and the other metrics. The experimental results show the
+effectiveness of our proposed method. We will release our code and dataset via
+github.
+
+
+
+ comment: AAAI 2025
+
+
+
+
+
+
+
+
+
+ Information Retrieval 10
+
+
+
+
+
+ ☆ Topic-Aware Knowledge Graph with Large Language Models for
+ Interoperability in Recommender Systems
+
+
+ The use of knowledge graphs in recommender systems has become one of the
+common approaches to addressing data sparsity and cold start problems. Recent
+advances in large language models (LLMs) offer new possibilities for processing
+side and context information within knowledge graphs. However, consistent
+integration across various systems remains challenging due to the need for
+domain expert intervention and differences in system characteristics. To
+address these issues, we propose a consistent approach that extracts both
+general and specific topics from both side and context information using LLMs.
+First, general topics are iteratively extracted and updated from side
+information. Then, specific topics are extracted using context information.
+Finally, to address synonymous topics generated during the specific topic
+extraction process, a refining algorithm processes and resolves these issues
+effectively. This approach allows general topics to capture broad knowledge
+across diverse item characteristics, while specific topics emphasize detailed
+attributes, providing a more comprehensive understanding of the semantic
+features of items and the preferences of users. Experimental results
+demonstrate significant improvements in recommendation performance across
+diverse knowledge graphs.
+
+
+
+ comment: Accepted by The 40th ACM/SIGAPP Symposium On Applied Computing(SAC)
+ 2025
+
+
+
+
+
+
+ ☆ A Contrastive Pretrain Model with Prompt Tuning for Multi-center
+ Medication Recommendation
+
+
+ Medication recommendation is one of the most critical health-related
+applications, which has attracted extensive research interest recently. Most
+existing works focus on a single hospital with abundant medical data. However,
+many small hospitals only have a few records, which hinders applying existing
+medication recommendation works to the real world. Thus, we seek to explore a
+more practical setting, i.e., multi-center medication recommendation. In this
+setting, most hospitals have few records, but the total number of records is
+large. Though small hospitals may benefit from total affluent records, it is
+also faced with the challenge that the data distributions between various
+hospitals are much different. In this work, we introduce a novel conTrastive
+prEtrain Model with Prompt Tuning (TEMPT) for multi-center medication
+recommendation, which includes two stages of pretraining and finetuning. We
+first design two self-supervised tasks for the pretraining stage to learn
+general medical knowledge. They are mask prediction and contrastive tasks,
+which extract the intra- and inter-relationships of input diagnosis and
+procedures. Furthermore, we devise a novel prompt tuning method to capture the
+specific information of each hospital rather than adopting the common
+finetuning. On the one hand, the proposed prompt tuning can better learn the
+heterogeneity of each hospital to fit various distributions. On the other hand,
+it can also relieve the catastrophic forgetting problem of finetuning. To
+validate the proposed model, we conduct extensive experiments on the public
+eICU, a multi-center medical dataset. The experimental results illustrate the
+effectiveness of our model. The implementation code is available to ease the
+reproducibility https://github.com/Applied-Machine-Learning-Lab/TEMPT.
+
+
+
+ comment: accepted by TOIS
+
+
+
+
+
+
+ ☆ Invariant debiasing learning for recommendation via biased imputation
+
+
+ Previous debiasing studies utilize unbiased data to make supervision of model
+training. They suffer from the high trial risks and experimental costs to
+obtain unbiased data. Recent research attempts to use invariant learning to
+detach the invariant preference of users for unbiased recommendations in an
+unsupervised way. However, it faces the drawbacks of low model accuracy and
+unstable prediction performance due to the losing cooperation with variant
+preference. In this paper, we experimentally demonstrate that invariant
+learning causes information loss by directly discarding the variant
+information, which reduces the generalization ability and results in the
+degradation of model performance in unbiased recommendations. Based on this
+consideration, we propose a novel lightweight knowledge distillation framework
+(KDDebias) to automatically learn the unbiased preference of users from both
+invariant and variant information. Specifically, the variant information is
+imputed to the invariant user preference in the distance-aware knowledge
+distillation process. Extensive experiments on three public datasets, i.e.,
+Yahoo!R3, Coat, and MIND, show that with the biased imputation from the variant
+preference of users, our proposed method achieves significant improvements with
+less than 50% learning parameters compared to the SOTA unsupervised debiasing
+model in recommender systems. Our code are publicly available at
+https://github.com/BAI-LAB/KD-Debias.
+
+
+
+
+
+
+
+ ☆ OneKE: A Dockerized Schema-Guided LLM Agent-based Knowledge Extraction
+ System
+
+
+
+
+
+
+
+
+ Yujie Luo, Xiangyuan Ru, Kangwei Liu, Lin Yuan, Mengshu Sun, Ningyu Zhang, Lei Liang, Zhiqiang Zhang, Jun Zhou, Lanning Wei, Da Zheng, Haofen Wang, Huajun Chen
+
+
+ We introduce OneKE, a dockerized schema-guided knowledge extraction system,
+which can extract knowledge from the Web and raw PDF Books, and support various
+domains (science, news, etc.). Specifically, we design OneKE with multiple
+agents and a configure knowledge base. Different agents perform their
+respective roles, enabling support for various extraction scenarios. The
+configure knowledge base facilitates schema configuration, error case debugging
+and correction, further improving the performance. Empirical evaluations on
+benchmark datasets demonstrate OneKE's efficacy, while case studies further
+elucidate its adaptability to diverse tasks across multiple domains,
+highlighting its potential for broad applications. We have open-sourced the
+Code at https://github.com/zjunlp/OneKE and released a Video at
+http://oneke.openkg.cn/demo.mp4.
+
+
+
+ comment: Work in progress
+
+
+
+
+
+
+ ☆ Generative Regression Based Watch Time Prediction for Video
+ Recommendation: Model and Performance
+
+
+
+
+
+
+
+
+ Hongxu Ma, Kai Tian, Tao Zhang, Xuefeng Zhang, Chunjie Chen, Han Li, Jihong Guan, Shuigeng Zhou
+
+
+ Watch time prediction (WTP) has emerged as a pivotal task in short video
+recommendation systems, designed to encapsulate user interests. Predicting
+users' watch times on videos often encounters challenges, including wide value
+ranges and imbalanced data distributions, which can lead to significant bias
+when directly regressing watch time. Recent studies have tried to tackle these
+issues by converting the continuous watch time estimation into an ordinal
+classification task. While these methods are somewhat effective, they exhibit
+notable limitations. Inspired by language modeling, we propose a novel
+Generative Regression (GR) paradigm for WTP based on sequence generation. This
+approach employs structural discretization to enable the lossless
+reconstruction of original values while maintaining prediction fidelity. By
+formulating the prediction problem as a numerical-to-sequence mapping, and with
+meticulously designed vocabulary and label encodings, each watch time is
+transformed into a sequence of tokens. To expedite model training, we introduce
+the curriculum learning with an embedding mixup strategy which can mitigate
+training-and-inference inconsistency associated with teacher forcing. We
+evaluate our method against state-of-the-art approaches on four public datasets
+and one industrial dataset. We also perform online A/B testing on Kuaishou, a
+leading video app with about 400 million DAUs, to demonstrate the real-world
+efficacy of our method. The results conclusively show that GR outperforms
+existing techniques significantly. Furthermore, we successfully apply GR to
+another regression task in recommendation systems, i.e., Lifetime Value (LTV)
+prediction, which highlights its potential as a novel and effective solution to
+general regression challenges.
+
+
+
+ comment: 10 pages, 5 figures, conference or other essential info
+
+
+
+
+
+
+ ♻ ☆ Freshness and Informativity Weighted Cognitive Extent and Its
+ Correlation with Cumulative Citation Count
+
+
+ In this paper, we revisit cognitive extent, originally defined as the number
+of unique phrases in a quota. We introduce Freshness and Informative Weighted
+Cognitive Extent (FICE), calculated based on two novel weighting factors, the
+lifetime ratio and informativity of scientific entities. We model the lifetime
+of each scientific entity as the time-dependent document frequency, which is
+fit by the composition of multiple Gaussian profiles. The lifetime ratio is
+then calculated as the cumulative document frequency at the publication time
+$t_0$ divided by the cumulative document frequency over its entire lifetime.
+The informativity is calculated by normalizing the document frequency across
+all scientific entities recognized in a title. Using the ACL Anthology, we
+verified the trend formerly observed in several other domains that the number
+of unique scientific entities per quota increased gradually at a slower rate.
+We found that FICE exhibits a strong correlation with the average cumulative
+citation count within a quota. Our code is available at
+\href{https://github.com/ZiheHerzWang/Freshness-and-Informativity-Weighted-Cognitive-Extent}{https://github.com/ZiheHerzWang/Freshness-and-Informativity-Weighted-Cognitive-Extent}
+
+
+ Personal interaction data can be effectively modeled as individual graphs for
+each user in recommender systems.Graph Neural Networks (GNNs)-based
+recommendation techniques have become extremely popular since they can capture
+high-order collaborative signals between users and items by aggregating the
+individual graph into a global interactive graph.However, this centralized
+approach inherently poses a threat to user privacy and security. Recently,
+federated GNN-based recommendation techniques have emerged as a promising
+solution to mitigate privacy concerns. Nevertheless, current implementations
+either limit on-device training to an unaccompanied individual graphs or
+necessitate reliance on an extra third-party server to touch other individual
+graphs, which also increases the risk of privacy leakage. To address this
+challenge, we propose a Cluster-enhanced Federated Graph Neural Network
+framework for Recommendation, named CFedGR, which introduces high-order
+collaborative signals to augment individual graphs in a privacy preserving
+manner. Specifically, the server clusters the pretrained user representations
+to identify high-order collaborative signals. In addition, two efficient
+strategies are devised to reduce communication between devices and the server.
+Extensive experiments on three benchmark datasets validate the effectiveness of
+our proposed methods.
+
+
+
+
+
+
+
+ ♻ ☆ Is ChatGPT Good at Search? Investigating Large Language Models as
+ Re-Ranking Agents EMNLP 2023
+
+
+ Large Language Models (LLMs) have demonstrated remarkable zero-shot
+generalization across various language-related tasks, including search engines.
+However, existing work utilizes the generative ability of LLMs for Information
+Retrieval (IR) rather than direct passage ranking. The discrepancy between the
+pre-training objectives of LLMs and the ranking objective poses another
+challenge. In this paper, we first investigate generative LLMs such as ChatGPT
+and GPT-4 for relevance ranking in IR. Surprisingly, our experiments reveal
+that properly instructed LLMs can deliver competitive, even superior results to
+state-of-the-art supervised methods on popular IR benchmarks. Furthermore, to
+address concerns about data contamination of LLMs, we collect a new test set
+called NovelEval, based on the latest knowledge and aiming to verify the
+model's ability to rank unknown knowledge. Finally, to improve efficiency in
+real-world applications, we delve into the potential for distilling the ranking
+capabilities of ChatGPT into small specialized models using a permutation
+distillation scheme. Our evaluation results turn out that a distilled 440M
+model outperforms a 3B supervised model on the BEIR benchmark. The code to
+reproduce our results is available at www.github.com/sunnweiwei/RankGPT.
+
+
+
+ comment: EMNLP 2023
+
+
+
+
+
+
+ ♻ ☆ Collaborative filtering based on nonnegative/binary matrix factorization
+
+
+ Collaborative filtering generates recommendations based on user-item
+similarities through rating data, which may involve numerous unrated items. To
+predict scores for unrated items, matrix factorization techniques, such as
+nonnegative matrix factorization (NMF), are often employed to predict scores
+for unrated items. Nonnegative/binary matrix factorization (NBMF), which is an
+extension of NMF, approximates a nonnegative matrix as the product of
+nonnegative and binary matrices. Previous studies have employed NBMF for image
+analysis where the data were dense. In this paper, we propose a modified NBMF
+algorithm that can be applied to collaborative filtering where data are sparse.
+In the modified method, unrated elements in a rating matrix are masked, which
+improves the collaborative filtering performance. Utilizing a low-latency Ising
+machine in NBMF is advantageous in terms of the computation time, making the
+proposed method beneficial.
+
+
+
+ comment: 14 pages, 7 figures
+
+
+
+
+
+
+ ♻ ☆ The Design of an LLM-powered Unstructured Analytics System CIDR
+
+
+
+
+
+
+
+
+ Eric Anderson, Jonathan Fritz, Austin Lee, Bohou Li, Mark Lindblad, Henry Lindeman, Alex Meyer, Parth Parmar, Tanvi Ranade, Mehul A. Shah, Benjamin Sowell, Dan Tecuci, Vinayak Thapliyal, Matt Welsh
+
+
+ LLMs demonstrate an uncanny ability to process unstructured data, and as
+such, have the potential to go beyond search and run complex, semantic analyses
+at scale. We describe the design of an unstructured analytics system, Aryn, and
+the tenets and use cases that motivate its design. With Aryn, users specify
+queries in natural language and the system automatically determines a semantic
+plan and executes it to compute an answer from a large collection of
+unstructured documents. At the core of Aryn is Sycamore, a declarative document
+processing engine, that provides a reliable distributed abstraction called
+DocSets. Sycamore allows users to analyze, enrich, and transform complex
+documents at scale. Aryn includes Luna, a query planner that translates natural
+language queries to Sycamore scripts, and DocParse, which takes raw PDFs and
+document images, and converts them to DocSets for downstream processing. We
+show how these pieces come together to achieve better accuracy than RAG on
+analytics queries over real world reports from the National Transportation
+Safety Board (NTSB). Also, given current limitations of LLMs, we argue that an
+analytics system must provide explainability to be practical, and show how
+Aryn's user interface does this to help build trust.
+
+
+
+ comment: Included in the proceedings of The Conference on Innovative Data
+ Systems Research (CIDR) 2025
+
+
+
+
+
+
+
+
+
+ Multimedia 1
+
+
+
+
+
+ ♻ ☆ Face-StyleSpeech: Enhancing Zero-shot Speech Synthesis from Face Images
+ with Improved Face-to-Speech Mapping ICASSP 2025
+
+
+ Generating speech from a face image is crucial for developing virtual humans
+capable of interacting using their unique voices, without relying on
+pre-recorded human speech. In this paper, we propose Face-StyleSpeech, a
+zero-shot Text-To-Speech (TTS) synthesis model that generates natural speech
+conditioned on a face image rather than reference speech. We hypothesize that
+learning entire prosodic features from a face image poses a significant
+challenge. To address this, our TTS model incorporates both face and prosody
+encoders. The prosody encoder is specifically designed to model speech style
+characteristics that are not fully captured by the face image, allowing the
+face encoder to focus on extracting speaker-specific features such as timbre.
+Experimental results demonstrate that Face-StyleSpeech effectively generates
+more natural speech from a face image than baselines, even for unseen faces.
+Samples are available on our demo page.
+
+
+
+ comment: Accepted by ICASSP 2025
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Computation and Language 43
+
+
+
+
+
+ ☆ InfAlign: Inference-aware language model alignment
+
+
+
+
+
+
+
+
+ Ananth Balashankar, Ziteng Sun, Jonathan Berant, Jacob Eisenstein, Michael Collins, Adrian Hutter, Jong Lee, Chirag Nagpal, Flavien Prost, Aradhana Sinha, and Ananda Theertha Suresh, Ahmad Beirami
+
+
+ Language model alignment has become a critical step in training modern
+generative language models. The goal of alignment is to finetune a reference
+model such that the win rate of a sample from the aligned model over a sample
+from the reference model is high, subject to a KL divergence constraint. Today,
+we are increasingly using inference-time algorithms (e.g., Best-of-N,
+controlled decoding, tree search) to decode from language models rather than
+standard sampling. However, the alignment objective does not capture such
+inference-time decoding procedures. We show that the existing alignment
+framework is sub-optimal in view of such inference-time methods. We then modify
+the alignment objective and propose a framework for inference-aware alignment
+(IAPO). We prove that for any inference-time decoding algorithm, the optimal
+solution that optimizes the inference-time win rate of the aligned policy
+against the reference policy is the solution to the typical RLHF problem with a
+transformation of the reward. This motivates us to provide the KL-regularized
+calibrate-and-transform RL (CTRL) algorithm to solve this problem, which
+involves a reward calibration step and a KL-regularized reward maximization
+step with a transformation of the calibrated reward. We particularize our study
+to two important inference-time strategies: best-of-N sampling and best-of-N
+jailbreaking, where N responses are sampled from the model and the one with the
+highest or lowest reward is selected. We propose specific transformations for
+these strategies and demonstrate that our framework offers significant
+improvements over existing state-of-the-art methods for language model
+alignment. Empirically, we outperform baselines that are designed without
+taking inference-time decoding into consideration by 8-12% and 4-9% on
+inference-time win rates over the Anthropic helpfulness and harmlessness dialog
+benchmark datasets.
+
+
+
+
+
+
+
+ ☆ Enhancing Whisper's Accuracy and Speed for Indian Languages through
+ Prompt-Tuning and Tokenization ICASSP 2025
+
+
+ Automatic speech recognition has recently seen a significant advancement with
+large foundational models such as Whisper. However, these models often struggle
+to perform well in low-resource languages, such as Indian languages. This paper
+explores two novel approaches to enhance Whisper's multilingual speech
+recognition performance in Indian languages. First, we propose prompt-tuning
+with language family information, which enhances Whisper's accuracy in
+linguistically similar languages. Second, we introduce a novel tokenizer that
+reduces the number of generated tokens, thereby accelerating Whisper's
+inference speed. Our extensive experiments demonstrate that the tokenizer
+significantly reduces inference time, while prompt-tuning enhances accuracy
+across various Whisper model sizes, including Small, Medium, and Large.
+Together, these techniques achieve a balance between optimal WER and inference
+speed.
+
+
+ This research investigates the performance of various machine learning
+algorithms (CNN, LSTM, VADER, and RoBERTa) for sentiment analysis of Twitter
+data related to imported food items in Trinidad and Tobago. The study addresses
+three primary research questions: the comparative accuracy and efficiency of
+the algorithms, the optimal configurations for each model, and the potential
+applications of the optimized models in a live system for monitoring public
+sentiment and its impact on the import bill. The dataset comprises tweets from
+2018 to 2024, divided into imbalanced, balanced, and temporal subsets to assess
+the impact of data balancing and the COVID-19 pandemic on sentiment trends. Ten
+experiments were conducted to evaluate the models under various configurations.
+Results indicated that VADER outperformed the other models in both multi-class
+and binary sentiment classifications. The study highlights significant changes
+in sentiment trends pre- and post-COVID-19, with implications for import
+policies.
+
+
+
+ comment: 27 pages
+
+
+
+
+
+
+ ☆ OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse
+ Task Synthesis
+
+
+ Graphical User Interface (GUI) agents powered by Vision-Language Models
+(VLMs) have demonstrated human-like computer control capability. Despite their
+utility in advancing digital automation, a critical bottleneck persists:
+collecting high-quality trajectory data for training. Common practices for
+collecting such data rely on human supervision or synthetic data generation
+through executing pre-defined tasks, which are either resource-intensive or
+unable to guarantee data quality. Moreover, these methods suffer from limited
+data diversity and significant gaps between synthetic data and real-world
+environments. To address these challenges, we propose OS-Genesis, a novel GUI
+data synthesis pipeline that reverses the conventional trajectory collection
+process. Instead of relying on pre-defined tasks, OS-Genesis enables agents
+first to perceive environments and perform step-wise interactions, then
+retrospectively derive high-quality tasks to enable trajectory-level
+exploration. A trajectory reward model is then employed to ensure the quality
+of the generated trajectories. We demonstrate that training GUI agents with
+OS-Genesis significantly improves their performance on highly challenging
+online benchmarks. In-depth analysis further validates OS-Genesis's efficiency
+and its superior data quality and diversity compared to existing synthesis
+methods. Our codes, data, and checkpoints are available at
+\href{https://qiushisun.github.io/OS-Genesis-Home/}{OS-Genesis Homepage}.
+
+
+
+ comment: Work in progress
+
+
+
+
+
+
+ ☆ Toward Adaptive Reasoning in Large Language Models with Thought Rollback ICML 2024
+
+
+ Large language models (LLMs) have been routinely used to solve various tasks
+using step-by-step reasoning. However, the structure of intermediate reasoning
+steps, or thoughts, is rigid and unidirectional, such as chains, trees, or
+acyclic-directed graphs. Consequently, the resulting inflexible and
+forward-only reasoning may not address challenging tasks and fail when the LLM
+frequently gives false responses, i.e., ``hallucinations''. This paper proposes
+a new reasoning framework, called Thought Rollback (TR), allowing LLMs to
+adaptively build thought structure while maintaining effective reasoning toward
+problem-solving under ``hallucinations''. The core mechanism of TR is rolling
+back thoughts, which allows LLMs to perform error analysis on thoughts, and
+thus roll back to any previously mistaken thought for revision. Subsequently,
+by including such trial-and-error in the prompt to guide the LLM, each rollback
+leads to one more reliable reasoning path. Therefore, starting with a simple
+prompt without human annotations, LLM with TR adaptively and gradually explores
+thoughts for a correct solution. Comprehensive experiments on mathematical
+problems and multi-task reasoning demonstrate the state-of-the-art performance
+of TR in terms of problem-solving rate and interaction cost. For instance, the
+solving rate of GPT-4 with TR outperforms the current best by $9\%$ on the MATH
+dataset.
+
+
+
+ comment: ICML 2024 camera-ready version with 24 pages and 12 figures. Code
+ repo with all prompts:
+ https://github.com/iQua/llmpebase/tree/main/examples/ThoughtRollback
+
+
+
+
+
+
+ ☆ Machine Generated Product Advertisements: Benchmarking LLMs Against
+ Human Performance
+
+
+ This study compares the performance of AI-generated and human-written product
+descriptions using a multifaceted evaluation model. We analyze descriptions for
+100 products generated by four AI models (Gemma 2B, LLAMA, GPT2, and ChatGPT 4)
+with and without sample descriptions, against human-written descriptions. Our
+evaluation metrics include sentiment, readability, persuasiveness, Search
+Engine Optimization(SEO), clarity, emotional appeal, and call-to-action
+effectiveness. The results indicate that ChatGPT 4 performs the best. In
+contrast, other models demonstrate significant shortcomings, producing
+incoherent and illogical output that lacks logical structure and contextual
+relevance. These models struggle to maintain focus on the product being
+described, resulting in disjointed sentences that do not convey meaningful
+information. This research provides insights into the current capabilities and
+limitations of AI in the creation of content for e-Commerce.
+
+
+
+
+
+
+
+ ☆ A Comparative Study of Machine Unlearning Techniques for Image and Text
+ Classification Models
+
+
+
+
+
+
+
+
+ Omar M. Safa, Mahmoud M. Abdelaziz, Mustafa Eltawy, Mohamed Mamdouh, Moamen Gharib, Salaheldin Eltenihy, Nagia M. Ghanem, Mohamed M. Ismail
+
+
+ Machine Unlearning has emerged as a critical area in artificial intelligence,
+addressing the need to selectively remove learned data from machine learning
+models in response to data privacy regulations. This paper provides a
+comprehensive comparative analysis of six state-of-theart unlearning techniques
+applied to image and text classification tasks. We evaluate their performance,
+efficiency, and compliance with regulatory requirements, highlighting their
+strengths and limitations in practical scenarios. By systematically analyzing
+these methods, we aim to provide insights into their applicability,
+challenges,and tradeoffs, fostering advancements in the field of ethical and
+adaptable machine learning.
+
+
+
+
+
+
+
+ ☆ TARGA: Targeted Synthetic Data Generation for Practical Reasoning over
+ Structured Data
+
+
+ Semantic parsing, which converts natural language questions into logic forms,
+plays a crucial role in reasoning within structured environments. However,
+existing methods encounter two significant challenges: reliance on extensive
+manually annotated datasets and limited generalization capability to unseen
+examples. To tackle these issues, we propose Targeted Synthetic Data Generation
+(TARGA), a practical framework that dynamically generates high-relevance
+synthetic data without manual annotation. Starting from the pertinent entities
+and relations of a given question, we probe for the potential relevant queries
+through layer-wise expansion and cross-layer combination. Then we generate
+corresponding natural language questions for these constructed queries to
+jointly serve as the synthetic demonstrations for in-context learning.
+Experiments on multiple knowledge base question answering (KBQA) datasets
+demonstrate that TARGA, using only a 7B-parameter model, substantially
+outperforms existing non-fine-tuned methods that utilize close-sourced model,
+achieving notable improvements in F1 scores on GrailQA(+7.7) and
+KBQA-Agent(+12.2). Furthermore, TARGA also exhibits superior sample efficiency,
+robustness, and generalization capabilities under non-I.I.D. settings.
+
+
+
+
+
+
+
+ ☆ Exploiting Domain-Specific Parallel Data on Multilingual Language Models
+ for Low-resource Language Translation
+
+
+
+
+
+
+
+
+ Surangika Ranathungaa, Shravan Nayak, Shih-Ting Cindy Huang, Yanke Mao, Tong Su, Yun-Hsiang Ray Chan, Songchen Yuan, Anthony Rinaldi, Annie En-Shiun Lee
+
+
+ Neural Machine Translation (NMT) systems built on multilingual
+sequence-to-sequence Language Models (msLMs) fail to deliver expected results
+when the amount of parallel data for a language, as well as the language's
+representation in the model are limited. This restricts the capabilities of
+domain-specific NMT systems for low-resource languages (LRLs). As a solution,
+parallel data from auxiliary domains can be used either to fine-tune or to
+further pre-train the msLM. We present an evaluation of the effectiveness of
+these two techniques in the context of domain-specific LRL-NMT. We also explore
+the impact of domain divergence on NMT model performance. We recommend several
+strategies for utilizing auxiliary parallel data in building domain-specific
+NMT models for LRLs.
+
+
+
+
+
+
+
+ ☆ Confidence v.s. Critique: A Decomposition of Self-Correction Capability
+ for LLMs
+
+
+ Large Language Models (LLMs) can correct their self-generated responses, but
+a decline in accuracy after self-correction is also witnessed. To have a deeper
+understanding of self-correction, we endeavor to decompose, evaluate, and
+analyze the self-correction behaviors of LLMs. By enumerating and analyzing
+answer correctness before and after self-correction, we decompose the
+self-correction capability into confidence (being confident to correct answers)
+and critique (turning wrong answers to correct) capabilities, and propose two
+metrics from a probabilistic perspective to measure these 2 capabilities, along
+with another metric for overall self-correction capability evaluation. Based on
+our decomposition and evaluation metrics, we conduct extensive experiments and
+draw some empirical conclusions. For example, we find different models can
+exhibit distinct behaviors: some models are confident while others are more
+critical. We also find the trade-off between the two capabilities (i.e.
+improving one can lead to a decline in the other) when manipulating model
+self-correction behavior by prompts or in-context learning. Further, we find a
+simple yet efficient strategy to improve self-correction capability by
+transforming Supervision Fine-Tuning (SFT) data format, and our strategy
+outperforms vanilla SFT in both capabilities and achieves much higher accuracy
+after self-correction. Our code will be publicly available on GitHub.
+
+
+
+ comment: 16 pages, 10 figures
+
+
+
+
+
+
+ ☆ Safeguard Fine-Tuned LLMs Through Pre- and Post-Tuning Model Merging
+
+
+
+
+
+
+
+
+ Hua Farn, Hsuan Su, Shachi H Kumar, Saurav Sahay, Shang-Tse Chen, Hung-yi Lee
+
+
+ Fine-tuning large language models (LLMs) for downstream tasks is a widely
+adopted approach, but it often leads to safety degradation in safety-aligned
+LLMs. Currently, many solutions address this issue by incorporating additional
+safety data, which can be impractical in many cases. In this paper, we address
+the question: How can we improve downstream task performance while preserving
+safety in LLMs without relying on additional safety data? We propose a simple
+and effective method that maintains the inherent safety of LLMs while enhancing
+their downstream task performance: merging the weights of pre- and
+post-fine-tuned safety-aligned models. Experimental results across various
+downstream tasks, models, and merging methods demonstrate that this approach
+effectively mitigates safety degradation while improving downstream task
+performance, offering a practical solution for adapting safety-aligned LLMs.
+
+
+ User willingness is a crucial element in the sales talk process that affects
+the achievement of the salesperson's or sales system's objectives. Despite the
+importance of user willingness, to the best of our knowledge, no previous study
+has addressed the development of automated sales talk dialogue systems that
+explicitly consider user willingness. A major barrier is the lack of sales talk
+datasets with reliable user willingness data. Thus, in this study, we developed
+a user willingness-aware sales talk collection by leveraging the ecological
+validity concept, which is discussed in the field of human-computer
+interaction. Our approach focused on three types of user willingness essential
+in real sales interactions. We created a dialogue environment that closely
+resembles real-world scenarios to elicit natural user willingness, with
+participants evaluating their willingness at the utterance level from multiple
+perspectives. We analyzed the collected data to gain insights into practical
+user willingness-aware sales talk strategies. In addition, as a practical
+application of the constructed dataset, we developed and evaluated a sales
+dialogue system aimed at enhancing the user's intent to purchase.
+
+
+
+ comment: 12 pages, Accepted to COLING2025
+
+
+
+
+
+
+ ☆ Pre-training, Fine-tuning and Re-ranking: A Three-Stage Framework for
+ Legal Question Answering
+
+
+ Legal question answering (QA) has attracted increasing attention from people
+seeking legal advice, which aims to retrieve the most applicable answers from a
+large-scale database of question-answer pairs. Previous methods mainly use a
+dual-encoder architecture to learn dense representations of both questions and
+answers. However, these methods could suffer from lacking domain knowledge and
+sufficient labeled training data. In this paper, we propose a three-stage
+(\underline{p}re-training, \underline{f}ine-tuning and \underline{r}e-ranking)
+framework for \underline{l}egal \underline{QA} (called PFR-LQA), which promotes
+the fine-grained text representation learning and boosts the performance of
+dense retrieval with the dual-encoder architecture. Concretely, we first
+conduct domain-specific pre-training on legal questions and answers through a
+self-supervised training objective, allowing the pre-trained model to be
+adapted to the legal domain. Then, we perform task-specific fine-tuning of the
+dual-encoder on legal question-answer pairs by using the supervised learning
+objective, leading to a high-quality dual-encoder for the specific downstream
+QA task. Finally, we employ a contextual re-ranking objective to further refine
+the output representations of questions produced by the document encoder, which
+uses contextual similarity to increase the discrepancy between the anchor and
+hard negative samples for better question re-ranking. We conduct extensive
+experiments on a manually annotated legal QA dataset. Experimental results show
+that our PFR-LQA method achieves better performance than the strong competitors
+for legal question answering.
+
+
+
+
+
+
+
+ ☆ Feature Alignment-Based Knowledge Distillation for Efficient Compression
+ of Large Language Models
+
+
+ This study proposes a knowledge distillation algorithm based on large
+language models and feature alignment, aiming to effectively transfer the
+knowledge of large pre-trained models into lightweight student models, thereby
+reducing computational costs while maintaining high model performance.
+Different from the traditional soft label distillation method, this method
+introduces a multi-layer feature alignment strategy to deeply align the
+intermediate features and attention mechanisms of the teacher model and the
+student model, maximally retaining the semantic expression ability and context
+modeling ability of the teacher model. In terms of method design, a multi-task
+loss function is constructed, including feature matching loss, attention
+alignment loss, and output distribution matching loss, to ensure multi-level
+information transfer through joint optimization. The experiments were
+comprehensively evaluated on the GLUE data set and various natural language
+processing tasks. The results show that the proposed model performs very close
+to the state-of-the-art GPT-4 model in terms of evaluation indicators such as
+perplexity, BLEU, ROUGE, and CER. At the same time, it far exceeds baseline
+models such as DeBERTa, XLNet, and GPT-3, showing significant performance
+improvements and computing efficiency advantages. Research results show that
+the feature alignment distillation strategy is an effective model compression
+method that can significantly reduce computational overhead and storage
+requirements while maintaining model capabilities. Future research can be
+further expanded in the directions of self-supervised learning, cross-modal
+feature alignment, and multi-task transfer learning to provide more flexible
+and efficient solutions for the deployment and optimization of deep learning
+models.
+
+
+
+ comment: 4 pages
+
+
+
+
+
+
+ ☆ DeepSeek-V3 Technical Report
+
+
+
+
+
+
+
+
+ DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Li, Hui Qu, J. L. Cai, Jian Liang, Jianzhong Guo, Jiaqi Ni, Jiashi Li, Jiawei Wang, Jin Chen, Jingchang Chen, Jingyang Yuan, Junjie Qiu, Junlong Li, Junxiao Song, Kai Dong, Kai Hu, Kaige Gao, Kang Guan, Kexin Huang, Kuai Yu, Lean Wang, Lecong Zhang, Lei Xu, Leyi Xia, Liang Zhao, Litong Wang, Liyue Zhang, Meng Li, Miaojun Wang, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Mingming Li, Ning Tian, Panpan Huang, Peiyi Wang, Peng Zhang, Qiancheng Wang, Qihao Zhu, Qinyu Chen, Qiushi Du, R. J. Chen, R. L. Jin, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, Runxin Xu, Ruoyu Zhang, Ruyi Chen, S. S. Li, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shaoqing Wu, Shengfeng Ye, Shengfeng Ye, Shirong Ma, Shiyu Wang, Shuang Zhou, Shuiping Yu, Shunfeng Zhou, Shuting Pan, T. Wang, Tao Yun, Tian Pei, Tianyu Sun, W. L. Xiao, Wangding Zeng, Wanjia Zhao, Wei An, Wen Liu, Wenfeng Liang, Wenjun Gao, Wenqin Yu, Wentao Zhang, X. Q. Li, Xiangyue Jin, Xianzu Wang, Xiao Bi, Xiaodong Liu, Xiaohan Wang, Xiaojin Shen, Xiaokang Chen, Xiaokang Zhang, Xiaosha Chen, Xiaotao Nie, Xiaowen Sun, Xiaoxiang Wang, Xin Cheng, Xin Liu, Xin Xie, Xingchao Liu, Xingkai Yu, Xinnan Song, Xinxia Shan, Xinyi Zhou, Xinyu Yang, Xinyuan Li, Xuecheng Su, Xuheng Lin, Y. K. Li, Y. Q. Wang, Y. X. Wei, Y. X. Zhu, Yang Zhang, Yanhong Xu, Yanhong Xu, Yanping Huang, Yao Li, Yao Zhao, Yaofeng Sun, Yaohui Li, Yaohui Wang, Yi Yu, Yi Zheng, Yichao Zhang, Yifan Shi, Yiliang Xiong, Ying He, Ying Tang, Yishi Piao, Yisong Wang, Yixuan Tan, Yiyang Ma, Yiyuan Liu, Yongqiang Guo, Yu Wu, Yuan Ou, Yuchen Zhu, Yuduan Wang, Yue Gong, Yuheng Zou, Yujia He, Yukun Zha, Yunfan Xiong, Yunxian Ma, Yuting Yan, Yuxiang Luo, Yuxiang You, Yuxuan Liu, Yuyang Zhou, Z. F. Wu, Z. Z. Ren, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean Xu, Zhen Huang, Zhen Zhang, Zhenda Xie, Zhengyan Zhang, Zhewen Hao, Zhibin Gou, Zhicheng Ma, Zhigang Yan, Zhihong Shao, Zhipeng Xu, Zhiyu Wu, Zhongyu Zhang, Zhuoshu Li, Zihui Gu, Zijia Zhu, Zijun Liu, Zilin Li, Ziwei Xie, Ziyang Song, Ziyi Gao, Zizheng Pan
+
+
+ We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with
+671B total parameters with 37B activated for each token. To achieve efficient
+inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent
+Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated
+in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free
+strategy for load balancing and sets a multi-token prediction training
+objective for stronger performance. We pre-train DeepSeek-V3 on 14.8 trillion
+diverse and high-quality tokens, followed by Supervised Fine-Tuning and
+Reinforcement Learning stages to fully harness its capabilities. Comprehensive
+evaluations reveal that DeepSeek-V3 outperforms other open-source models and
+achieves performance comparable to leading closed-source models. Despite its
+excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours for its
+full training. In addition, its training process is remarkably stable.
+Throughout the entire training process, we did not experience any irrecoverable
+loss spikes or perform any rollbacks. The model checkpoints are available at
+https://github.com/deepseek-ai/DeepSeek-V3.
+
+
+
+
+
+
+
+ ☆ Assessing Text Classification Methods for Cyberbullying Detection on
+ Social Media Platforms
+
+
+
+
+
+
+
+
+ Adamu Gaston Philipo, Doreen Sebastian Sarwatt, Jianguo Ding, Mahmoud Daneshmand, Huansheng Ning
+
+
+ Cyberbullying significantly contributes to mental health issues in
+communities by negatively impacting the psychology of victims. It is a
+prevalent problem on social media platforms, necessitating effective, real-time
+detection and monitoring systems to identify harmful messages. However, current
+cyberbullying detection systems face challenges related to performance, dataset
+quality, time efficiency, and computational costs. This research aims to
+conduct a comparative study by adapting and evaluating existing text
+classification techniques within the cyberbullying detection domain. The study
+specifically evaluates the effectiveness and performance of these techniques in
+identifying cyberbullying instances on social media platforms. It focuses on
+leveraging and assessing large language models, including BERT, RoBERTa, XLNet,
+DistilBERT, and GPT-2.0, for their suitability in this domain. The results show
+that BERT strikes a balance between performance, time efficiency, and
+computational resources: Accuracy of 95%, Precision of 95%, Recall of 95%, F1
+Score of 95%, Error Rate of 5%, Inference Time of 0.053 seconds, RAM Usage of
+35.28 MB, CPU/GPU Usage of 0.4%, and Energy Consumption of 0.000263 kWh. The
+findings demonstrate that generative AI models, while powerful, do not
+consistently outperform fine-tuned models on the tested benchmarks. However,
+state-of-the-art performance can still be achieved through strategic adaptation
+and fine-tuning of existing models for specific datasets and tasks.
+
+
+
+ comment: 15 pages, 10 figures, 7 tables
+
+
+
+
+
+
+ ☆ Right vs. Right: Can LLMs Make Tough Choices?
+
+
+
+
+
+
+
+
+ Jiaqing Yuan, Pradeep K. Murukannaiah, Munindar P. Singh
+
+
+ An ethical dilemma describes a choice between two "right" options involving
+conflicting moral values. We present a comprehensive evaluation of how LLMs
+navigate ethical dilemmas. Specifically, we investigate LLMs on their (1)
+sensitivity in comprehending ethical dilemmas, (2) consistency in moral value
+choice, (3) consideration of consequences, and (4) ability to align their
+responses to a moral value preference explicitly or implicitly specified in a
+prompt. Drawing inspiration from a leading ethical framework, we construct a
+dataset comprising 1,730 ethical dilemmas involving four pairs of conflicting
+values. We evaluate 20 well-known LLMs from six families. Our experiments
+reveal that: (1) LLMs exhibit pronounced preferences between major value pairs,
+and prioritize truth over loyalty, community over individual, and long-term
+over short-term considerations. (2) The larger LLMs tend to support a
+deontological perspective, maintaining their choices of actions even when
+negative consequences are specified. (3) Explicit guidelines are more effective
+in guiding LLMs' moral choice than in-context examples. Lastly, our experiments
+highlight the limitation of LLMs in comprehending different formulations of
+ethical dilemmas.
+
+
+
+
+
+
+
+ ☆ HADES: Hardware Accelerated Decoding for Efficient Speculation in Large
+ Language Models
+
+
+ Large Language Models (LLMs) have revolutionized natural language processing
+by understanding and generating human-like text. However, the increasing demand
+for more sophisticated LLMs presents significant computational challenges due
+to their scale and complexity. This paper introduces Hardware Accelerated
+Decoding (HADES), a novel approach to enhance the performance and energy
+efficiency of LLMs. We address the design of an LLM accelerator with
+hardware-level speculative decoding support, a concept not previously explored
+in existing literature. Our work demonstrates how speculative decoding can
+significantly improve the efficiency of LLM operations, paving the way for more
+advanced and practical applications of these models.
+
+
+
+ comment: Accepted to ICCEA 2025
+
+
+
+
+
+
+ ☆ Evaluate Summarization in Fine-Granularity: Auto Evaluation with LLM
+
+
+ Due to the exponential growth of information and the need for efficient
+information consumption the task of summarization has gained paramount
+importance. Evaluating summarization accurately and objectively presents
+significant challenges, particularly when dealing with long and unstructured
+texts rich in content. Existing methods, such as ROUGE (Lin, 2004) and
+embedding similarities, often yield scores that have low correlation with human
+judgements and are also not intuitively understandable, making it difficult to
+gauge the true quality of the summaries. LLMs can mimic human in giving
+subjective reviews but subjective scores are hard to interpret and justify.
+They can be easily manipulated by altering the models and the tones of the
+prompts. In this paper, we introduce a novel evaluation methodology and tooling
+designed to address these challenges, providing a more comprehensive, accurate
+and interpretable assessment of summarization outputs. Our method (SumAutoEval)
+proposes and evaluates metrics at varying granularity levels, giving objective
+scores on 4 key dimensions such as completeness, correctness, Alignment and
+readability. We empirically demonstrate, that SumAutoEval enhances the
+understanding of output quality with better human correlation.
+
+
+
+
+
+
+
+ ♻ ☆ Reasoning over Uncertain Text by Generative Large Language Models
+
+
+ This paper considers the challenges Large Language Models (LLMs) face when
+reasoning over text that includes information involving uncertainty explicitly
+quantified via probability values. This type of reasoning is relevant to a
+variety of contexts ranging from everyday conversations to medical
+decision-making. Despite improvements in the mathematical reasoning
+capabilities of LLMs, they still exhibit significant difficulties when it comes
+to probabilistic reasoning. To deal with this problem, we introduce the
+Bayesian Linguistic Inference Dataset (BLInD), a new dataset specifically
+designed to test the probabilistic reasoning capabilities of LLMs. We use BLInD
+to find out the limitations of LLMs for tasks involving probabilistic
+reasoning. In addition, we present several prompting strategies that map the
+problem to different formal representations, including Python code,
+probabilistic algorithms, and probabilistic logical programming. We conclude by
+providing an evaluation of our methods on BLInD and an adaptation of a causal
+reasoning question-answering dataset. Our empirical results highlight the
+effectiveness of our proposed strategies for multiple LLMs.
+
+
+
+
+
+
+
+
+ Junhui He, Shangyu Wu, Weidong Wen, Chun Jason Xue, Qingan Li
+
+
+ Deploying large language models (LLMs) on edge devices presents significant
+challenges due to the substantial computational overhead and memory
+requirements. Activation sparsification can mitigate these resource challenges
+by reducing the number of activated neurons during inference. Existing methods
+typically employ thresholding-based sparsification based on the statistics of
+activation tensors. However, they do not model the impact of activation
+sparsification on performance, resulting in suboptimal performance degradation.
+To address the limitations, this paper reformulates the activation
+sparsification problem to explicitly capture the relationship between
+activation sparsity and model performance. Then, this paper proposes CHESS, a
+general activation sparsification approach via CHannel-wise thrEsholding and
+Selective Sparsification. First, channel-wise thresholding assigns a unique
+threshold to each activation channel in the feed-forward network (FFN) layers.
+Then, selective sparsification involves applying thresholding-based activation
+sparsification to specific layers within the attention modules. Finally, we
+detail the implementation of sparse kernels to accelerate LLM inference.
+Experimental results demonstrate that the proposed CHESS achieves lower
+performance degradation over eight downstream tasks while activating fewer
+parameters than existing methods, thus speeding up the LLM inference by up to
+1.27x.
+
+
+
+
+
+
+
+
+ Richard Ren, Steven Basart, Adam Khoja, Alice Gatti, Long Phan, Xuwang Yin, Mantas Mazeika, Alexander Pan, Gabriel Mukobi, Ryan H. Kim, Stephen Fitz, Dan Hendrycks
+
+
+ As artificial intelligence systems grow more powerful, there has been
+increasing interest in "AI safety" research to address emerging and future
+risks. However, the field of AI safety remains poorly defined and
+inconsistently measured, leading to confusion about how researchers can
+contribute. This lack of clarity is compounded by the unclear relationship
+between AI safety benchmarks and upstream general capabilities (e.g., general
+knowledge and reasoning). To address these issues, we conduct a comprehensive
+meta-analysis of AI safety benchmarks, empirically analyzing their correlation
+with general capabilities across dozens of models and providing a survey of
+existing directions in AI safety. Our findings reveal that many safety
+benchmarks highly correlate with both upstream model capabilities and training
+compute, potentially enabling "safetywashing"--where capability improvements
+are misrepresented as safety advancements. Based on these findings, we propose
+an empirical foundation for developing more meaningful safety metrics and
+define AI safety in a machine learning research context as a set of clearly
+delineated research goals that are empirically separable from generic
+capabilities advancements. In doing so, we aim to provide a more rigorous
+framework for AI safety research, advancing the science of safety evaluations
+and clarifying the path towards measurable progress.
+
+
+
+ comment: NeurIPS 2024
+
+
+
+
+
+
+ ♻ ☆ Context-aware Inductive Knowledge Graph Completion with Latent Type
+ Constraints and Subgraph Reasoning
+
+
+ Inductive knowledge graph completion (KGC) aims to predict missing triples
+with unseen entities. Recent works focus on modeling reasoning paths between
+the head and tail entity as direct supporting evidence. However, these methods
+depend heavily on the existence and quality of reasoning paths, which limits
+their general applicability in different scenarios. In addition, we observe
+that latent type constraints and neighboring facts inherent in KGs are also
+vital in inferring missing triples. To effectively utilize all useful
+information in KGs, we introduce CATS, a novel context-aware inductive KGC
+solution. With sufficient guidance from proper prompts and supervised
+fine-tuning, CATS activates the strong semantic understanding and reasoning
+capabilities of large language models to assess the existence of query triples,
+which consist of two modules. First, the type-aware reasoning module evaluates
+whether the candidate entity matches the latent entity type as required by the
+query relation. Then, the subgraph reasoning module selects relevant reasoning
+paths and neighboring facts, and evaluates their correlation to the query
+triple. Experiment results on three widely used datasets demonstrate that CATS
+significantly outperforms state-of-the-art methods in 16 out of 18
+transductive, inductive, and few-shot settings with an average absolute MRR
+improvement of 7.2%.
+
+
+
+
+
+
+
+ ♻ ☆ Intertwining CP and NLP: The Generation of Unreasonably Constrained
+ Sentences
+
+
+ Constrained text generation remains a challenging task, particularly when
+dealing with hard constraints. Traditional NLP approaches prioritize generating
+meaningful and coherent output. Also, the current state-of-the-art methods
+often lack the expressiveness and constraint satisfaction capabilities to
+handle such tasks effectively. Recently, an approach for generating constrained
+sentences in CP has been proposed in (Bonlarron et al, 2023). This ad-hoc model
+to solve the sentences generation problem under MNREAD rules proved
+neithertheless to be computationaly and structuraly unsuitable to deal with
+other more constrained problems. In this paper, a novel more generic approach
+is introduced to tackle many of these previously untractable problems, and
+illustrated here with the quite untractable sentences generation problem
+following RADNER rules.
+ More precisely, this paper presents the CPTextGen Framework. This framework
+considers a constrained text generation problem as a discrete combinatorial
+optimization problem. It is solved by a constraint programming method that
+combines linguistic properties (e.g., n-grams or language level) with other
+more classical constraints (e.g., the number of characters, syllables).
+Eventually, a curation phase allows for selecting the best-generated sentences
+according to perplexity using an LLM.
+ The effectiveness of this approach is demonstrated by tackling a new, more
+tediously constrained text generation problem: the iconic RADNER sentences
+problem. This problem aims to generate sentences respecting a set of quite
+strict rules defined by their use in vision and clinical research. Thanks to
+our CP-based approach, many new strongly constrained sentences have been
+successfully generated. This highlights our approach's potential to handle
+unreasonably constrained text generation scenarios.
+
+
+
+ comment: Disambiguation and additional references
+
+
+
+
+
+
+
+ Yadong Li, Haoze Sun, Mingan Lin, Tianpeng Li, Guosheng Dong, Tao Zhang, Bowen Ding, Wei Song, Zhenglin Cheng, Yuqi Huo, Song Chen, Xu Li, Da Pan, Shusen Zhang, Xin Wu, Zheng Liang, Jun Liu, Tao Zhang, Keer Lu, Yaqi Zhao, Yanjun Shen, Fan Yang, Kaicheng Yu, Tao Lin, Jianhua Xu, Zenan Zhou, Weipeng Chen
+
+
+ The salient multimodal capabilities and interactive experience of GPT-4o
+highlight its critical role in practical applications, yet it lacks a
+high-performing open-source counterpart. In this paper, we introduce
+Baichuan-omni, the first open-source 7B Multimodal Large Language Model (MLLM)
+adept at concurrently processing and analyzing modalities of image, video,
+audio, and text, while delivering an advanced multimodal interactive experience
+and strong performance. We propose an effective multimodal training schema
+starting with 7B model and proceeding through two stages of multimodal
+alignment and multitask fine-tuning across audio, image, video, and text modal.
+This approach equips the language model with the ability to handle visual and
+audio data effectively. Demonstrating strong performance across various
+omni-modal and multimodal benchmarks, we aim for this contribution to serve as
+a competitive baseline for the open-source community in advancing multimodal
+understanding and real-time interaction.
+
+
+
+
+
+
+
+ ♻ ☆ Preemptive Detection and Correction of Misaligned Actions in LLM Agents
+
+
+ Deploying LLM-based agents in real-life applications often faces a critical
+challenge: the misalignment between agents' behavior and user intent. Such
+misalignment may lead agents to unintentionally execute critical actions that
+carry negative outcomes (e.g., accidentally triggering a "buy-now" in web
+shopping), resulting in undesirable or even irreversible consequences. Although
+addressing these issues is crucial, the preemptive detection and correction of
+misaligned actions remains relatively underexplored. To fill this gap, we
+introduce InferAct, a novel approach that leverages the belief reasoning
+ability of LLMs, grounded in Theory-of-Mind, to detect misaligned actions
+before execution. Once the misalignment is detected, InferAct alerts users for
+timely correction, preventing adverse outcomes and enhancing the reliability of
+LLM agents' decision-making processes. Experiments on three widely used tasks
+demonstrate that InferAct achieves up to 20% improvements on Marco-F1 against
+baselines in misaligned action detection. An in-depth evaluation of
+misalignment correction further highlights InferAct's effectiveness in
+improving agent alignment.
+
+
+
+
+
+
+
+ ♻ ☆ MERT: Acoustic Music Understanding Model with Large-Scale
+ Self-supervised Training ICLR 2024
+
+
+
+
+
+
+
+
+ Yizhi Li, Ruibin Yuan, Ge Zhang, Yinghao Ma, Xingran Chen, Hanzhi Yin, Chenghao Xiao, Chenghua Lin, Anton Ragni, Emmanouil Benetos, Norbert Gyenge, Roger Dannenberg, Ruibo Liu, Wenhu Chen, Gus Xia, Yemin Shi, Wenhao Huang, Zili Wang, Yike Guo, Jie Fu
+
+
+ Self-supervised learning (SSL) has recently emerged as a promising paradigm
+for training generalisable models on large-scale data in the fields of vision,
+text, and speech. Although SSL has been proven effective in speech and audio,
+its application to music audio has yet to be thoroughly explored. This is
+partially due to the distinctive challenges associated with modelling musical
+knowledge, particularly tonal and pitched characteristics of music. To address
+this research gap, we propose an acoustic Music undERstanding model with
+large-scale self-supervised Training (MERT), which incorporates teacher models
+to provide pseudo labels in the masked language modelling (MLM) style acoustic
+pre-training. In our exploration, we identified an effective combination of
+teacher models, which outperforms conventional speech and audio approaches in
+terms of performance. This combination includes an acoustic teacher based on
+Residual Vector Quantisation - Variational AutoEncoder (RVQ-VAE) and a musical
+teacher based on the Constant-Q Transform (CQT). Furthermore, we explore a wide
+range of settings to overcome the instability in acoustic language model
+pre-training, which allows our designed paradigm to scale from 95M to 330M
+parameters. Experimental results indicate that our model can generalise and
+perform well on 14 music understanding tasks and attain state-of-the-art (SOTA)
+overall scores.
+
+
+
+ comment: accepted by ICLR 2024
+
+
+
+
+
+
+ ♻ ☆ Blessing or curse? A survey on the Impact of Generative AI on Fake News
+
+
+
+
+
+
+
+
+ Alexander Loth, Martin Kappes, Marc-Oliver Pahl
+
+
+ Fake news significantly influence our society. They impact consumers, voters,
+and many other societal groups. While Fake News exist for a centuries,
+Generative AI brings fake news on a new level. It is now possible to automate
+the creation of masses of high-quality individually targeted Fake News. On the
+other end, Generative AI can also help detecting Fake News. Both fields are
+young but developing fast.
+ This survey provides a comprehensive examination of the research and
+practical use of Generative AI for Fake News detection and creation in 2024.
+Following the Structured Literature Survey approach, the paper synthesizes
+current results in the following topic clusters 1) enabling technologies, 2)
+creation of Fake News, 3) case study social media as most relevant distribution
+channel, 4) detection of Fake News, and 5) deepfakes as upcoming technology.
+ The article also identifies current challenges and open issues.
+
+
+
+ comment: 16 pages, 2 figures. Submitted to ACM Transactions on Intelligent
+ Systems and Technology (ACM TIST). Added references
+
+ Ontology matching (OM) enables semantic interoperability between different
+ontologies and resolves their conceptual heterogeneity by aligning related
+entities. OM systems currently have two prevailing design paradigms:
+conventional knowledge-based expert systems and newer machine learning-based
+predictive systems. While large language models (LLMs) and LLM agents have
+revolutionised data engineering and have been applied creatively in many
+domains, their potential for OM remains underexplored. This study introduces a
+novel agent-powered LLM-based design paradigm for OM systems. With
+consideration of several specific challenges in leveraging LLM agents for OM,
+we propose a generic framework, namely Agent-OM (Agent for Ontology Matching),
+consisting of two Siamese agents for retrieval and matching, with a set of OM
+tools. Our framework is implemented in a proof-of-concept system. Evaluations
+of three Ontology Alignment Evaluation Initiative (OAEI) tracks over
+state-of-the-art OM systems show that our system can achieve results very close
+to the long-standing best performance on simple OM tasks and can significantly
+improve the performance on complex and few-shot OM tasks.
+
+
+
+ comment: 19 pages, 12 figures, 3 tables
+
+
+
+
+
+
+ ♻ ☆ Mamba for Streaming ASR Combined with Unimodal Aggregation ICASSP 2025
+
+
+ This paper works on streaming automatic speech recognition (ASR). Mamba, a
+recently proposed state space model, has demonstrated the ability to match or
+surpass Transformers in various tasks while benefiting from a linear complexity
+advantage. We explore the efficiency of Mamba encoder for streaming ASR and
+propose an associated lookahead mechanism for leveraging controllable future
+information. Additionally, a streaming-style unimodal aggregation (UMA) method
+is implemented, which automatically detects token activity and streamingly
+triggers token output, and meanwhile aggregates feature frames for better
+learning token representation. Based on UMA, an early termination (ET) method
+is proposed to further reduce recognition latency. Experiments conducted on two
+Mandarin Chinese datasets demonstrate that the proposed model achieves
+competitive ASR performance in terms of both recognition accuracy and latency.
+
+
+
+ comment: Accepted by ICASSP 2025
+
+
+
+
+
+
+ ♻ ☆ LongDocURL: a Comprehensive Multimodal Long Document Benchmark
+ Integrating Understanding, Reasoning, and Locating
+
+
+
+
+
+
+
+
+ Chao Deng, Jiale Yuan, Pi Bu, Peijie Wang, Zhong-Zhi Li, Jian Xu, Xiao-Hui Li, Yuan Gao, Jun Song, Bo Zheng, Cheng-Lin Liu
+
+
+ Large vision language models (LVLMs) have improved the document understanding
+capabilities remarkably, enabling the handling of complex document elements,
+longer contexts, and a wider range of tasks. However, existing document
+understanding benchmarks have been limited to handling only a small number of
+pages and fail to provide a comprehensive analysis of layout elements locating.
+In this paper, we first define three primary task categories: Long Document
+Understanding, numerical Reasoning, and cross-element Locating, and then
+propose a comprehensive benchmark, LongDocURL, integrating above three primary
+tasks and comprising 20 sub-tasks categorized based on different primary tasks
+and answer evidences. Furthermore, we develop a semi-automated construction
+pipeline and collect 2,325 high-quality question-answering pairs, covering more
+than 33,000 pages of documents, significantly outperforming existing
+benchmarks. Subsequently, we conduct comprehensive evaluation experiments on
+both open-source and closed-source models across 26 different configurations,
+revealing critical performance gaps in this field.
+
+
+
+
+
+
+
+ ♻ ☆ Building a Taiwanese Mandarin Spoken Language Model: A First Attempt
+
+
+ This technical report presents our initial attempt to build a spoken large
+language model (LLM) for Taiwanese Mandarin, specifically tailored to enable
+real-time, speech-to-speech interaction in multi-turn conversations. Our
+end-to-end model incorporates a decoder-only transformer architecture and aims
+to achieve seamless interaction while preserving the conversational flow,
+including full-duplex capabilities allowing simultaneous speaking and
+listening. The paper also details the training process, including data
+preparation with synthesized dialogues and adjustments for real-time
+interaction. We also developed a platform to evaluate conversational fluency
+and response coherence in multi-turn dialogues. We hope the release of the
+report can contribute to the future development of spoken LLMs in Taiwanese
+Mandarin.
+
+
+
+ comment: Work in progress
+
+
+
+
+
+
+ ♻ ☆ Do LLMs Really Think Step-by-step In Implicit Reasoning?
+
+
+ It has been well-known that Chain-of-Thought can remarkably enhance LLMs'
+performance on complex tasks. However, because it also introduces slower
+inference speeds and higher computational costs, many researches have attempted
+to use implicit CoT, which does not need LLMs to explicitly generate the
+intermediate steps. However, the invisible reasoning process leaves us a doubt
+that, can implicit CoT really be equal to explicit CoT? Therefore, in this
+study, we address this question through experiments. We probe the information
+of intermediate steps from the model's hidden states when it is either trained
+or prompted to perform implicit CoT. The results surprisingly indicate that
+when prompted, LLMs hardly think about intermediate steps, suggesting they may
+just rely on experience rather than strict step-by-step reasoning. But when
+trained, they indeed calculate intermediate steps. Moreover, in both
+situations, we find the effect of using implicit CoT is susceptible to the
+format of the problem, reaffirming the current deficiency of implicit CoT.
+
+
+
+
+
+
+
+ ♻ ☆ A Text is Worth Several Tokens: Text Embedding from LLMs Secretly Aligns
+ Well with The Key Tokens
+
+
+ Text embeddings from large language models (LLMs) have achieved excellent
+results in tasks such as information retrieval, semantic textual similarity,
+etc. In this work, we show an interesting finding: when feeding a text into the
+LLM-based embedder, the obtained text embedding will be able to be aligned with
+the key tokens in the input text. We first fully analyze this phenomenon on
+eight LLM-based embedders and show that this phenomenon is universal and is not
+affected by model architecture, training strategy, and embedding method. With a
+deeper analysis, we find that the main change in embedding space between these
+embedders and their LLM backbones is in the first principal component. By
+adjusting the first principal component, we can align text embedding with the
+key tokens. Finally, we give several examples to demonstrate the vast
+application potential of this finding: (1) we propose a simple and practical
+sparse retrieval method based on the aligned tokens, which can achieve 80% of
+the dense retrieval effect of the same model while reducing the computation
+significantly; (2) we show that our findings provide a novel perspective to
+help understand novel technologies (e.g., instruction-following embedding) and
+fuzzy concepts (e.g., semantic relatedness vs. similarity) in this field.
+
+
+
+ comment: Work in Progress
+
+
+
+
+
+
+ ♻ ☆ Multi-Agent Collaboration in Incident Response with Large Language
+ Models
+
+
+ Incident response (IR) is a critical aspect of cybersecurity, requiring rapid
+decision-making and coordinated efforts to address cyberattacks effectively.
+Leveraging large language models (LLMs) as intelligent agents offers a novel
+approach to enhancing collaboration and efficiency in IR scenarios. This paper
+explores the application of LLM-based multi-agent collaboration using the
+Backdoors & Breaches framework, a tabletop game designed for cybersecurity
+training. We simulate real-world IR dynamics through various team structures,
+including centralized, decentralized, and hybrid configurations. By analyzing
+agent interactions and performance across these setups, we provide insights
+into optimizing multi-agent collaboration for incident response. Our findings
+highlight the potential of LLMs to enhance decision-making, improve
+adaptability, and streamline IR processes, paving the way for more effective
+and coordinated responses to cyber threats.
+
+
+
+
+
+
+
+ ♻ ☆ Exploring Parameter-Efficient Fine-Tuning Techniques for Code Generation
+ with Large Language Models
+
+
+
+
+
+
+
+
+ Martin Weyssow, Xin Zhou, Kisub Kim, David Lo, Houari Sahraoui
+
+
+ Large language models (LLMs) demonstrate impressive capabilities to generate
+accurate code snippets given natural language intents in a zero-shot manner,
+i.e., without the need for specific fine-tuning. While prior studies have
+highlighted the advantages of fine-tuning LLMs, this process incurs high
+computational costs, making it impractical in resource-scarce environments,
+particularly for models with billions of parameters. To address these
+challenges, previous research explored in-context learning (ICL) and
+retrieval-augmented generation (RAG) as strategies to guide the LLM generative
+process with task-specific prompt examples. However, ICL and RAG introduce
+inconveniences, such as the need for designing contextually relevant prompts
+and the absence of learning task-specific parameters, thereby limiting
+downstream task performance. In this context, we foresee parameter-efficient
+fine-tuning (PEFT) as a promising approach to efficiently specialize LLMs to
+task-specific data while maintaining reasonable resource consumption. In this
+paper, we deliver a comprehensive study of PEFT techniques for LLMs in the
+context of automated code generation. Our comprehensive investigation of PEFT
+techniques for LLMs reveals their superiority and potential over ICL and RAG
+across a diverse set of LLMs and three representative Python code generation
+datasets: Conala, CodeAlpacaPy, and APPS. Furthermore, our study highlights the
+potential for tuning larger LLMs and significant reductions in memory usage by
+combining PEFT with quantization. Therefore, this study opens opportunities for
+broader applications of PEFT in software engineering scenarios. Our code is
+available at https://github.com/martin-wey/peft-llm-code/.
+
+
+
+
+
+
+
+ ♻ ☆ CodeUltraFeedback: An LLM-as-a-Judge Dataset for Aligning Large Language
+ Models to Coding Preferences
+
+
+ Evaluating the alignment of large language models (LLMs) with user-defined
+coding preferences is a challenging endeavour that requires a deep assessment
+of LLMs' outputs. Existing methods and benchmarks rely primarily on automated
+metrics and static analysis tools, which often fail to capture the nuances of
+user instructions and LLM outputs. To address this gap, we propose using the
+LLM-as-a-Judge methodology to evaluate the alignment of LLMs with coding
+preferences. Based on this approach, we present CodeUltraFeedback, a
+comprehensive dataset designed to facilitate the evaluation and improvement of
+LLM alignment. CodeUltraFeedback consists of 10,000 coding instructions, each
+annotated with four responses generated from a diverse pool of 14 LLMs. These
+responses are ranked based on five distinct coding preferences using GPT-3.5 as
+a judge, providing both numerical scores and detailed textual feedback. Our
+analysis of CodeUltraFeedback reveals that responses from GPT-3.5 and GPT-4 are
+generally preferred over those from open-weight LLMs, highlighting significant
+differences in alignment between closed and open-weight models. In turn, we
+explore the usage of CodeUltraFeedback as feedback data to fine-tune and align
+CodeLlama-7B-Instruct using supervised fine-tuning (SFT) and reinforcement
+learning from AI feedback (RLAIF) with direct preference optimization (DPO).
+The resulting aligned CodeLlama-7B-Instruct model outperforms larger LLMs in
+terms of alignment with coding preferences and shows improved functional
+correctness on the HumanEval+ benchmark compared to the original instruct
+model. Therefore, our contributions bridge the gap in preference tuning of LLMs
+for code and set the stage for further advancements in model alignment and
+RLAIF in automated software engineering.
+
+
+
+
+
+
+
+ ♻ ☆ Model Fusion through Bayesian Optimization in Language Model Fine-Tuning
+
+
+
+
+
+
+
+
+ Chaeyun Jang, Hyungi Lee, Jungtaek Kim, Juho Lee
+
+
+ Fine-tuning pre-trained models for downstream tasks is a widely adopted
+technique known for its adaptability and reliability across various domains.
+Despite its conceptual simplicity, fine-tuning entails several troublesome
+engineering choices, such as selecting hyperparameters and determining
+checkpoints from an optimization trajectory. To tackle the difficulty of
+choosing the best model, one effective solution is model fusion, which combines
+multiple models in a parameter space. However, we observe a large discrepancy
+between loss and metric landscapes during the fine-tuning of pre-trained
+language models. Building on this observation, we introduce a novel model
+fusion technique that optimizes both the desired metric and loss through
+multi-objective Bayesian optimization. In addition, to effectively select
+hyperparameters, we establish a two-stage procedure by integrating Bayesian
+optimization processes into our framework. Experiments across various
+downstream tasks show considerable performance improvements using our Bayesian
+optimization-guided method.
+
+
+
+
+
+
+
+ ♻ ☆ Aurora-M: Open Source Continual Pre-training for Multilingual Language
+ and Code
+
+
+
+
+
+
+
+
+ Taishi Nakamura, Mayank Mishra, Simone Tedeschi, Yekun Chai, Jason T Stillerman, Felix Friedrich, Prateek Yadav, Tanmay Laud, Vu Minh Chien, Terry Yue Zhuo, Diganta Misra, Ben Bogin, Xuan-Son Vu, Marzena Karpinska, Arnav Varma Dantuluri, Wojciech Kusa, Tommaso Furlanello, Rio Yokota, Niklas Muennighoff, Suhas Pai, Tosin Adewumi, Veronika Laippala, Xiaozhe Yao, Adalberto Junior, Alpay Ariyak, Aleksandr Drozd, Jordan Clive, Kshitij Gupta, Liangyu Chen, Qi Sun, Ken Tsui, Noah Persaud, Nour Fahmy, Tianlong Chen, Mohit Bansal, Nicolo Monti, Tai Dang, Ziyang Luo, Tien-Tung Bui, Roberto Navigli, Virendra Mehta, Matthew Blumberg, Victor May, Huu Nguyen, Sampo Pyysalo
+
+
+ Pretrained language models are an integral part of AI applications, but their
+high computational cost for training limits accessibility. Initiatives such as
+Bloom and StarCoder aim to democratize access to pretrained models for
+collaborative community development. Despite these efforts, such models
+encounter challenges such as limited multilingual capabilities, risks of
+catastrophic forgetting during continual pretraining, and the high costs of
+training models from scratch, alongside the need to align with AI safety
+standards and regulatory frameworks.
+ This paper presents Aurora-M, a 15B parameter multilingual open-source model
+trained on English, Finnish, Hindi, Japanese, Vietnamese, and code. Continually
+pretrained from StarCoderPlus on 435B additional tokens, Aurora-M surpasses 2T
+tokens in total training token count. It is the first open-source multilingual
+model fine-tuned on human-reviewed safety instructions, thus aligning its
+development not only with conventional red-teaming considerations, but also
+with the specific concerns articulated in the Biden-Harris Executive Order on
+the Safe, Secure, and Trustworthy Development and Use of Artificial
+Intelligence.
+ We evaluate Aurora-M across a wide range of tasks and languages, showcasing
+its robustness against catastrophic forgetting and its superior performance in
+multilingual settings, particularly in safety evaluations. We open-source
+Aurora-M and its variants to encourage responsible open-source development of
+large language models at https://huggingface.co/aurora-m.
+
+
+
+ comment: Preprint
+
+
+
+
+
+
+ ♻ ☆ Cracking the Code of Hallucination in LVLMs with Vision-aware Head
+ Divergence
+
+
+ Large vision-language models (LVLMs) have made substantial progress in
+integrating large language models (LLMs) with visual inputs, enabling advanced
+multimodal reasoning. Despite their success, a persistent challenge is
+hallucination-where generated text fails to accurately reflect visual
+content-undermining both accuracy and reliability. Existing methods focus on
+alignment training or decoding refinements but primarily address symptoms at
+the generation stage without probing the underlying causes. In this work, we
+investigate the internal mechanisms driving hallucination in LVLMs, with an
+emphasis on the multi-head attention module. Specifically, we introduce
+Vision-aware Head Divergence (VHD), a metric that quantifies the sensitivity of
+attention head outputs to visual context. Based on this, our findings reveal
+the presence of vision-aware attention heads that are more attuned to visual
+information; however, the model's overreliance on its prior language patterns
+is closely related to hallucinations. Building on these insights, we propose
+Vision-aware Head Reinforcement (VHR), a training-free approach to mitigate
+hallucination by enhancing the role of vision-aware attention heads. Extensive
+experiments demonstrate that our method achieves superior performance compared
+to state-of-the-art approaches in mitigating hallucinations, while maintaining
+high efficiency with negligible additional time overhead.
+
+
+
+
+
+
+
+ ♻ ☆ Rules still work for Open Information Extraction
+
+
+
+
+
+
+
+
+ Jialin Hua, Liangqing Luo, Weiying Ping, Yan Liao, Chunhai Tao, Xuewen Lub
+
+
+ Open information extraction (OIE) aims to extract surface relations and their
+corresponding arguments from natural language text, irrespective of domain.
+This paper presents an innovative OIE model, APRCOIE, tailored for Chinese
+text. Diverging from previous models, our model generates extraction patterns
+autonomously. The model defines a new pattern form for Chinese OIE and proposes
+an automated pattern generation methodology. In that way, the model can handle
+a wide array of complex and diverse Chinese grammatical phenomena. We design a
+preliminary filter based on tensor computing to conduct the extraction
+procedure efficiently. To train the model, we manually annotated a large-scale
+Chinese OIE dataset. In the comparative evaluation, we demonstrate that APRCOIE
+outperforms state-of-the-art Chinese OIE models and significantly expands the
+boundaries of achievable OIE performance. The code of APRCOIE and the annotated
+dataset are released on GitHub (https://github.com/jialin666/APRCOIE_v1)
+
+
+
+
+
+
+
+ ♻ ☆ INSIGHTBUDDY-AI: Medication Extraction and Entity Linking using Large
+ Language Models and Ensemble Learning
+
+
+ Medication Extraction and Mining play an important role in healthcare NLP
+research due to its practical applications in hospital settings, such as their
+mapping into standard clinical knowledge bases (SNOMED-CT, BNF, etc.). In this
+work, we investigate state-of-the-art LLMs in text mining tasks on medications
+and their related attributes such as dosage, route, strength, and adverse
+effects. In addition, we explore different ensemble learning methods
+(\textsc{Stack-Ensemble} and \textsc{Voting-Ensemble}) to augment the model
+performances from individual LLMs. Our ensemble learning result demonstrated
+better performances than individually fine-tuned base models BERT, RoBERTa,
+RoBERTa-L, BioBERT, BioClinicalBERT, BioMedRoBERTa, ClinicalBERT, and
+PubMedBERT across general and specific domains. Finally, we build up an entity
+linking function to map extracted medical terminologies into the SNOMED-CT
+codes and the British National Formulary (BNF) codes, which are further mapped
+to the Dictionary of Medicines and Devices (dm+d), and ICD. Our model's toolkit
+and desktop applications are publicly available (at
+\url{https://github.com/HECTA-UoM/ensemble-NER}).
+
+
+ This paper investigates how the topical flow of dyadic conversations emerges
+over time and how differences in interlocutors' personality traits contribute
+to this topical flow. Leveraging text embeddings, we map the trajectories of $N
+= 1655$ conversations between strangers into a high-dimensional space. Using
+nonlinear projections and clustering, we then identify when each interlocutor
+enters and exits various topics. Differences in conversational flow are
+quantified via $\textit{topic entropy}$, a summary measure of the "spread" of
+topics covered during a conversation, and $\textit{linguistic alignment}$, a
+time-varying measure of the cosine similarity between interlocutors'
+embeddings. Our findings suggest that interlocutors with a larger difference in
+the personality dimension of openness influence each other to spend more time
+discussing a wider range of topics and that interlocutors with a larger
+difference in extraversion experience a larger decrease in linguistic alignment
+throughout their conversation. We also examine how participants' affect
+(emotion) changes from before to after a conversation, finding that a larger
+difference in extraversion predicts a larger difference in affect change and
+that a greater topic entropy predicts a larger affect increase. This work
+demonstrates how communication research can be advanced through the use of
+high-dimensional NLP methods and identifies personality difference as an
+important driver of social influence.
+
+
+
+ comment: Published in the Proceedings of the Second Workshop on Social
+ Influence in Conversations (SICon 2024), co-located with EMNLP 2024. This
+ version corrects a labeling error in Table 1
+
+
+
+
+
+
+
+
+
+ Information Retrieval 4
+
+
+
+
+
+ ♻ ☆ Lusifer: LLM-based User SImulated Feedback Environment for online
+ Recommender systems
+
+
+
+
+
+
+
+
+ Danial Ebrat, Eli Paradalis, Luis Rueda
+
+
+ Training reinforcement learning-based recommender systems is often hindered
+by the lack of dynamic and realistic user interactions. To address this
+limitation, we introduce Lusifer, a novel environment leveraging Large Language
+Models (LLMs) to generate simulated user feedback. Lusifer synthesizes user
+profiles and interaction histories to simulate responses and behaviors toward
+recommended items, with profiles updated after each rating to reflect evolving
+user characteristics. Utilizing the MovieLens dataset as a proof of concept, we
+limited our implementation to the last 40 interactions for each user,
+representing approximately 39% and 22% of the training sets, to focus on recent
+user behavior. For consistency and to gain insights into the performance of
+traditional methods with limited data, we implemented baseline approaches using
+the same data subset. Our results demonstrate that Lusifer accurately emulates
+user behavior and preferences, even with reduced training data having an RMSE
+of 1.3 across various test sets. This paper presents Lusifer's operational
+pipeline, including prompt generation and iterative user profile updates, and
+compares its performance against baseline methods. The findings validate
+Lusifer's ability to produce realistic dynamic feedback and suggest that it
+offers a scalable and adjustable framework for user simulation in online
+reinforcement learning recommender systems for future studies, particularly
+when training data is limited.
+
+
+ Ontology matching (OM) enables semantic interoperability between different
+ontologies and resolves their conceptual heterogeneity by aligning related
+entities. OM systems currently have two prevailing design paradigms:
+conventional knowledge-based expert systems and newer machine learning-based
+predictive systems. While large language models (LLMs) and LLM agents have
+revolutionised data engineering and have been applied creatively in many
+domains, their potential for OM remains underexplored. This study introduces a
+novel agent-powered LLM-based design paradigm for OM systems. With
+consideration of several specific challenges in leveraging LLM agents for OM,
+we propose a generic framework, namely Agent-OM (Agent for Ontology Matching),
+consisting of two Siamese agents for retrieval and matching, with a set of OM
+tools. Our framework is implemented in a proof-of-concept system. Evaluations
+of three Ontology Alignment Evaluation Initiative (OAEI) tracks over
+state-of-the-art OM systems show that our system can achieve results very close
+to the long-standing best performance on simple OM tasks and can significantly
+improve the performance on complex and few-shot OM tasks.
+
+
+ Sequential recommendation (SR) systems predict user preferences by analyzing
+time-ordered interaction sequences. A common challenge for SR is data sparsity,
+as users typically interact with only a limited number of items. While
+contrastive learning has been employed in previous approaches to address the
+challenges, these methods often adopt binary labels, missing finer patterns and
+overlooking detailed information in subsequent behaviors of users.
+Additionally, they rely on random sampling to select negatives in contrastive
+learning, which may not yield sufficiently hard negatives during later training
+stages. In this paper, we propose Future data utilization with Enduring
+Negatives for contrastive learning in sequential Recommendation (FENRec). Our
+approach aims to leverage future data with time-dependent soft labels and
+generate enduring hard negatives from existing data, thereby enhancing the
+effectiveness in tackling data sparsity. Experiment results demonstrate our
+state-of-the-art performance across four benchmark datasets, with an average
+improvement of 6.16\% across all metrics.
+
+
+
+ comment: Accepted by AAAI 2025, Our code is available at
+ https://github.com/uikdwnd/FENRec
+
+
+
+
+
+
+ ♻ ☆ A Text is Worth Several Tokens: Text Embedding from LLMs Secretly Aligns
+ Well with The Key Tokens
+
+
+ Text embeddings from large language models (LLMs) have achieved excellent
+results in tasks such as information retrieval, semantic textual similarity,
+etc. In this work, we show an interesting finding: when feeding a text into the
+LLM-based embedder, the obtained text embedding will be able to be aligned with
+the key tokens in the input text. We first fully analyze this phenomenon on
+eight LLM-based embedders and show that this phenomenon is universal and is not
+affected by model architecture, training strategy, and embedding method. With a
+deeper analysis, we find that the main change in embedding space between these
+embedders and their LLM backbones is in the first principal component. By
+adjusting the first principal component, we can align text embedding with the
+key tokens. Finally, we give several examples to demonstrate the vast
+application potential of this finding: (1) we propose a simple and practical
+sparse retrieval method based on the aligned tokens, which can achieve 80% of
+the dense retrieval effect of the same model while reducing the computation
+significantly; (2) we show that our findings provide a novel perspective to
+help understand novel technologies (e.g., instruction-following embedding) and
+fuzzy concepts (e.g., semantic relatedness vs. similarity) in this field.
+
+
+
+ comment: Work in Progress
+
+
+
+
+
+
+
+
+
+ Machine Learning 87
+
+
+
+
+
+ ☆ LASER: A new method for locally adaptive nonparametric regression
+
+
+ In this article, we introduce \textsf{LASER} (Locally Adaptive Smoothing
+Estimator for Regression), a computationally efficient locally adaptive
+nonparametric regression method that performs variable bandwidth local
+polynomial regression. We prove that it adapts (near-)optimally to the local
+H\"{o}lder exponent of the underlying regression function
+\texttt{simultaneously} at all points in its domain. Furthermore, we show that
+there is a single ideal choice of a global tuning parameter under which the
+above mentioned local adaptivity holds. Despite the vast literature on
+nonparametric regression, instances of practicable methods with provable
+guarantees of such a strong notion of local adaptivity are rare. The proposed
+method achieves excellent performance across a broad range of numerical
+experiments in comparison to popular alternative locally adaptive methods.
+
+
+
+ comment: 29 pages, 6 figures
+
+
+
+
+
+
+ ☆ InfAlign: Inference-aware language model alignment
+
+
+
+
+
+
+
+
+ Ananth Balashankar, Ziteng Sun, Jonathan Berant, Jacob Eisenstein, Michael Collins, Adrian Hutter, Jong Lee, Chirag Nagpal, Flavien Prost, Aradhana Sinha, and Ananda Theertha Suresh, Ahmad Beirami
+
+
+ Language model alignment has become a critical step in training modern
+generative language models. The goal of alignment is to finetune a reference
+model such that the win rate of a sample from the aligned model over a sample
+from the reference model is high, subject to a KL divergence constraint. Today,
+we are increasingly using inference-time algorithms (e.g., Best-of-N,
+controlled decoding, tree search) to decode from language models rather than
+standard sampling. However, the alignment objective does not capture such
+inference-time decoding procedures. We show that the existing alignment
+framework is sub-optimal in view of such inference-time methods. We then modify
+the alignment objective and propose a framework for inference-aware alignment
+(IAPO). We prove that for any inference-time decoding algorithm, the optimal
+solution that optimizes the inference-time win rate of the aligned policy
+against the reference policy is the solution to the typical RLHF problem with a
+transformation of the reward. This motivates us to provide the KL-regularized
+calibrate-and-transform RL (CTRL) algorithm to solve this problem, which
+involves a reward calibration step and a KL-regularized reward maximization
+step with a transformation of the calibrated reward. We particularize our study
+to two important inference-time strategies: best-of-N sampling and best-of-N
+jailbreaking, where N responses are sampled from the model and the one with the
+highest or lowest reward is selected. We propose specific transformations for
+these strategies and demonstrate that our framework offers significant
+improvements over existing state-of-the-art methods for language model
+alignment. Empirically, we outperform baselines that are designed without
+taking inference-time decoding into consideration by 8-12% and 4-9% on
+inference-time win rates over the Anthropic helpfulness and harmlessness dialog
+benchmark datasets.
+
+
+
+
+
+
+
+ ☆ Machine Learning for Sentiment Analysis of Imported Food in Trinidad and
+ Tobago
+
+
+ This research investigates the performance of various machine learning
+algorithms (CNN, LSTM, VADER, and RoBERTa) for sentiment analysis of Twitter
+data related to imported food items in Trinidad and Tobago. The study addresses
+three primary research questions: the comparative accuracy and efficiency of
+the algorithms, the optimal configurations for each model, and the potential
+applications of the optimized models in a live system for monitoring public
+sentiment and its impact on the import bill. The dataset comprises tweets from
+2018 to 2024, divided into imbalanced, balanced, and temporal subsets to assess
+the impact of data balancing and the COVID-19 pandemic on sentiment trends. Ten
+experiments were conducted to evaluate the models under various configurations.
+Results indicated that VADER outperformed the other models in both multi-class
+and binary sentiment classifications. The study highlights significant changes
+in sentiment trends pre- and post-COVID-19, with implications for import
+policies.
+
+
+
+ comment: 27 pages
+
+
+
+
+
+
+ ☆ Tensor Network Estimation of Distribution Algorithms
+
+
+
+
+
+
+
+
+ John Gardiner, Javier Lopez-Piqueres
+
+
+ Tensor networks are a tool first employed in the context of many-body quantum
+physics that now have a wide range of uses across the computational sciences,
+from numerical methods to machine learning. Methods integrating tensor networks
+into evolutionary optimization algorithms have appeared in the recent
+literature. In essence, these methods can be understood as replacing the
+traditional crossover operation of a genetic algorithm with a tensor
+network-based generative model. We investigate these methods from the point of
+view that they are Estimation of Distribution Algorithms (EDAs). We find that
+optimization performance of these methods is not related to the power of the
+generative model in a straightforward way. Generative models that are better
+(in the sense that they better model the distribution from which their training
+data is drawn) do not necessarily result in better performance of the
+optimization algorithm they form a part of. This raises the question of how
+best to incorporate powerful generative models into optimization routines. In
+light of this we find that adding an explicit mutation operator to the output
+of the generative model often improves optimization performance.
+
+
+
+
+
+
+
+ ☆ Symbolic Approximations to Ricci-flat Metrics Via Extrinsic Symmetries
+ of Calabi-Yau Hypersurfaces
+
+
+ Ever since Yau's non-constructive existence proof of Ricci-flat metrics on
+Calabi-Yau manifolds, finding their explicit construction remains a major
+obstacle to development of both string theory and algebraic geometry. Recent
+computational approaches employ machine learning to create novel neural
+representations for approximating these metrics, offering high accuracy but
+limited interpretability. In this paper, we analyse machine learning
+approximations to flat metrics of Fermat Calabi-Yau n-folds and some of their
+one-parameter deformations in three dimensions in order to discover their new
+properties. We formalise cases in which the flat metric has more symmetries
+than the underlying manifold, and prove that these symmetries imply that the
+flat metric admits a surprisingly compact representation for certain choices of
+complex structure moduli. We show that such symmetries uniquely determine the
+flat metric on certain loci, for which we present an analytic form. We also
+incorporate our theoretical results into neural networks to achieve
+state-of-the-art reductions in Ricci curvature for multiple Calabi-Yau
+manifolds. We conclude by distilling the ML models to obtain for the first time
+closed form expressions for Kahler metrics with near-zero scalar curvature.
+
+
+
+ comment: 40 pages, 14 figures
+
+
+
+
+
+
+ ☆ Analysis of Premature Death Rates in Texas Counties: The Impact of Air
+ Quality, Socioeconomic Factors, and COPD Prevalence
+
+
+ Understanding factors contributing to premature mortality is critical for
+public health planning. This study examines the relationships between premature
+death rates and multiple risk factors across several Texas counties, utilizing
+EPA air quality data, Census information, and county health records from recent
+years. We analyze the impact of air quality (PM2.5 levels), socioeconomic
+factors (median household income), and health conditions (COPD prevalence)
+through statistical analysis and modeling techniques. Results reveal COPD
+prevalence as a strong predictor of premature death rates, with higher
+prevalence associated with a substantial increase in years of potential life
+lost. While socioeconomic factors show a significant negative correlation, air
+quality demonstrates more complex indirect relationships. These findings
+emphasize the need for integrated public health interventions that prioritize
+key health conditions while addressing underlying socioeconomic disparities.
+
+
+
+ comment: 5 pages
+
+
+
+
+
+
+ ☆ Fortran2CPP: Automating Fortran-to-C++ Migration using LLMs via
+ Multi-Turn Dialogue and Dual-Agent Integration
+
+
+
+
+
+
+
+
+ Le Chen, Bin Lei, Dunzhi Zhou, Pei-Hung Lin, Chunhua Liao, Caiwen Ding, Ali Jannesari
+
+
+ Migrating Fortran code to C++ is a common task for many scientific computing
+teams, driven by the need to leverage modern programming paradigms, enhance
+cross-platform compatibility, and improve maintainability. Automating this
+translation process using large language models (LLMs) has shown promise, but
+the lack of high-quality, specialized datasets has hindered their
+effectiveness. In this paper, we address this challenge by introducing a novel
+multi-turn dialogue dataset, Fortran2CPP, specifically designed for
+Fortran-to-C++ code migration. Our dataset, significantly larger than existing
+alternatives, is generated using a unique LLM-driven, dual-agent pipeline
+incorporating iterative compilation, execution, and code repair to ensure high
+quality and functional correctness. To demonstrate the effectiveness of our
+dataset, we fine-tuned several open-weight LLMs on Fortran2CPP and evaluated
+their performance on two independent benchmarks. Fine-tuning on our dataset led
+to remarkable gains, with models achieving up to a 3.31x increase in CodeBLEU
+score and a 92\% improvement in compilation success rate. This highlights the
+dataset's ability to enhance both the syntactic accuracy and compilability of
+the translated C++ code. Our dataset and model have been open-sourced and are
+available on our public GitHub
+repository\footnote{\url{https://github.com/HPC-Fortran2CPP/Fortran2Cpp}}.
+
+
+
+
+
+
+
+ ☆ From Ceilings to Walls: Universal Dynamic Perching of Small Aerial
+ Robots on Surfaces with Variable Orientations
+
+
+ This work demonstrates universal dynamic perching capabilities for quadrotors
+of various sizes and on surfaces with different orientations. By employing a
+non-dimensionalization framework and deep reinforcement learning, we
+systematically assessed how robot size and surface orientation affect landing
+capabilities. We hypothesized that maintaining geometric proportions across
+different robot scales ensures consistent perching behavior, which was
+validated in both simulation and experimental tests. Additionally, we
+investigated the effects of joint stiffness and damping in the landing gear on
+perching behaviors and performance. While joint stiffness had minimal impact,
+joint damping ratios influenced landing success under vertical approaching
+conditions. The study also identified a critical velocity threshold necessary
+for successful perching, determined by the robot's maneuverability and leg
+geometry. Overall, this research advances robotic perching capabilities,
+offering insights into the role of mechanical design and scaling effects, and
+lays the groundwork for future drone autonomy and operational efficiency in
+unstructured environments.
+
+
+
+ comment: 7 pages, 8 Figures
+
+
+
+
+
+
+ ☆ Enhancing Adversarial Robustness of Deep Neural Networks Through
+ Supervised Contrastive Learning
+
+
+ Adversarial attacks exploit the vulnerabilities of convolutional neural
+networks by introducing imperceptible perturbations that lead to
+misclassifications, exposing weaknesses in feature representations and decision
+boundaries. This paper presents a novel framework combining supervised
+contrastive learning and margin-based contrastive loss to enhance adversarial
+robustness. Supervised contrastive learning improves the structure of the
+feature space by clustering embeddings of samples within the same class and
+separating those from different classes. Margin-based contrastive loss,
+inspired by support vector machines, enforces explicit constraints to create
+robust decision boundaries with well-defined margins. Experiments on the
+CIFAR-100 dataset with a ResNet-18 backbone demonstrate robustness performance
+improvements in adversarial accuracy under Fast Gradient Sign Method attacks.
+
+
+
+ comment: 8 pages, 11 figures
+
+
+
+
+
+
+ ☆ Generative Pretrained Embedding and Hierarchical Irregular Time Series
+ Representation for Daily Living Activity Recognition
+
+
+ Within the evolving landscape of smart homes, the precise recognition of
+daily living activities using ambient sensor data stands paramount. This paper
+not only aims to bolster existing algorithms by evaluating two distinct
+pretrained embeddings suited for ambient sensor activations but also introduces
+a novel hierarchical architecture. We delve into an architecture anchored on
+Transformer Decoder-based pre-trained embeddings, reminiscent of the GPT
+design, and contrast it with the previously established state-of-the-art (SOTA)
+ELMo embeddings for ambient sensors. Our proposed hierarchical structure
+leverages the strengths of each pre-trained embedding, enabling the discernment
+of activity dependencies and sequence order, thereby enhancing classification
+precision. To further refine recognition, we incorporate into our proposed
+architecture an hour-of-the-day embedding. Empirical evaluations underscore the
+preeminence of the Transformer Decoder embedding in classification endeavors.
+Additionally, our innovative hierarchical design significantly bolsters the
+efficacy of both pre-trained embeddings, notably in capturing inter-activity
+nuances. The integration of temporal aspects subtly but distinctively augments
+classification, especially for time-sensitive activities. In conclusion, our
+GPT-inspired hierarchical approach, infused with temporal insights, outshines
+the SOTA ELMo benchmark.
+
+
+
+
+
+
+
+ ☆ Learning to Forget: Bayesian Time Series Forecasting using Recurrent
+ Sparse Spectrum Signature Gaussian Processes
+
+
+
+
+
+
+
+
+ Csaba Tóth, Masaki Adachi, Michael A. Osborne, Harald Oberhauser
+
+
+ The signature kernel is a kernel between time series of arbitrary length and
+comes with strong theoretical guarantees from stochastic analysis. It has found
+applications in machine learning such as covariance functions for Gaussian
+processes. A strength of the underlying signature features is that they provide
+a structured global description of a time series. However, this property can
+quickly become a curse when local information is essential and forgetting is
+required; so far this has only been addressed with ad-hoc methods such as
+slicing the time series into subsegments. To overcome this, we propose a
+principled, data-driven approach by introducing a novel forgetting mechanism
+for signatures. This allows the model to dynamically adapt its context length
+to focus on more recent information. To achieve this, we revisit the recently
+introduced Random Fourier Signature Features, and develop Random Fourier
+Decayed Signature Features (RFDSF) with Gaussian processes (GPs). This results
+in a Bayesian time series forecasting algorithm with variational inference,
+that offers a scalable probabilistic algorithm that processes and transforms a
+time series into a joint predictive distribution over time steps in one pass
+using recurrence. For example, processing a sequence of length $10^4$ steps in
+$\approx 10^{-2}$ seconds and in $< 1\text{GB}$ of GPU memory. We demonstrate
+that it outperforms other GP-based alternatives and competes with
+state-of-the-art probabilistic time series forecasting algorithms.
+
+
+
+
+
+
+
+ ☆ EEG-Reptile: An Automatized Reptile-Based Meta-Learning Library for BCIs
+
+
+
+
+
+
+
+
+ Daniil A. Berdyshev, Artem M. Grachev, Sergei L. Shishkin, Bogdan L. Kozyrskiy
+
+
+ Meta-learning, i.e., "learning to learn", is a promising approach to enable
+efficient BCI classifier training with limited amounts of data. It can
+effectively use collections of in some way similar classification tasks, with
+rapid adaptation to new tasks where only minimal data are available. However,
+applying meta-learning to existing classifiers and BCI tasks requires
+significant effort. To address this issue, we propose EEG-Reptile, an automated
+library that leverages meta-learning to improve classification accuracy of
+neural networks in BCIs and other EEG-based applications. It utilizes the
+Reptile meta-learning algorithm to adapt neural network classifiers of EEG data
+to the inter-subject domain, allowing for more efficient fine-tuning for a new
+subject on a small amount of data. The proposed library incorporates an
+automated hyperparameter tuning module, a data management pipeline, and an
+implementation of the Reptile meta-learning algorithm. EEG-Reptile automation
+level allows using it without deep understanding of meta-learning. We
+demonstrate the effectiveness of EEG-Reptile on two benchmark datasets (BCI IV
+2a, Lee2019 MI) and three neural network architectures (EEGNet, FBCNet,
+EEG-Inception). Our library achieved improvement in both zero-shot and few-shot
+learning scenarios compared to traditional transfer learning approaches.
+
+
+
+ comment: For proposed python library, see EEG-Reptile GitHub:
+ https://github.com/gasiki/EEG-Reptile
+
+
+
+
+
+
+ ☆ Text2Insight: Transform natural language text into insights seamlessly
+ using multi-model architecture
+
+
+ The growing demand for dynamic, user-centric data analysis and visualization
+is evident across domains like healthcare, finance, and research. Traditional
+visualization tools often fail to meet individual user needs due to their
+static and predefined nature. To address this gap, Text2Insight is introduced
+as an innovative solution that delivers customized data analysis and
+visualizations based on user-defined natural language requirements. Leveraging
+a multi-model architecture, Text2Insight transforms user inputs into actionable
+insights and dynamic visualizations.
+ The methodology begins with analyzing the input dataset to extract structural
+details such as columns and values. A pre-trained Llama3 model converts the
+user's natural language query into an SQL query, which is further refined using
+a Named Entity Recognition (NER) model for accuracy. A chart predictor
+determines the most suitable visualization type, while the Llama3 model
+generates insights based on the SQL query's results. The output is a
+user-friendly and visually informative chart. To enhance analysis capabilities,
+the system integrates a question-answering model and a predictive model using
+the BERT framework. These models provide insights into historical data and
+predict future trends.
+ Performance evaluation of Text2Insight demonstrates its effectiveness,
+achieving high accuracy (99%), precision (100%), recall (99%), and F1-score
+(99%), with a BLEU score of 0.5. The question-answering model attained an
+accuracy of 89% and the predictive model achieved 70% accuracy. These results
+validate Text2Insight as a robust and viable solution for transforming natural
+language text into dynamic, user-specific data analysis and visualizations.
+
+
+
+
+
+
+
+ ☆ ProKAN: Progressive Stacking of Kolmogorov-Arnold Networks for Efficient
+ Liver Segmentation
+
+
+ The growing need for accurate and efficient 3D identification of tumors,
+particularly in liver segmentation, has spurred considerable research into deep
+learning models. While many existing architectures offer strong performance,
+they often face challenges such as overfitting and excessive computational
+costs. An adjustable and flexible architecture that strikes a balance between
+time efficiency and model complexity remains an unmet requirement. In this
+paper, we introduce proKAN, a progressive stacking methodology for
+Kolmogorov-Arnold Networks (KANs) designed to address these challenges. Unlike
+traditional architectures, proKAN dynamically adjusts its complexity by
+progressively adding KAN blocks during training, based on overfitting behavior.
+This approach allows the network to stop growing when overfitting is detected,
+preventing unnecessary computational overhead while maintaining high accuracy.
+Additionally, proKAN utilizes KAN's learnable activation functions modeled
+through B-splines, which provide enhanced flexibility in learning complex
+relationships in 3D medical data. Our proposed architecture achieves
+state-of-the-art performance in liver segmentation tasks, outperforming
+standard Multi-Layer Perceptrons (MLPs) and fixed KAN architectures. The
+dynamic nature of proKAN ensures efficient training times and high accuracy
+without the risk of overfitting. Furthermore, proKAN provides better
+interpretability by allowing insight into the decision-making process through
+its learnable coefficients. The experimental results demonstrate a significant
+improvement in accuracy, Dice score, and time efficiency, making proKAN a
+compelling solution for 3D medical image segmentation tasks.
+
+
+
+
+
+
+
+ ☆ Causal machine learning for heterogeneous treatment effects in the
+ presence of missing outcome data
+
+
+
+
+
+
+
+
+ Matthew Pryce, Karla Diaz-Ordaz, Ruth H. Keogh, Stijn Vansteelandt
+
+
+ When estimating heterogeneous treatment effects, missing outcome data can
+complicate treatment effect estimation, causing certain subgroups of the
+population to be poorly represented. In this work, we discuss this commonly
+overlooked problem and consider the impact that missing at random (MAR) outcome
+data has on causal machine learning estimators for the conditional average
+treatment effect (CATE). We then propose two de-biased machine learning
+estimators for the CATE, the mDR-learner and mEP-learner, which address the
+issue of under-representation by integrating inverse probability of censoring
+weights into the DR-learner and EP-learner respectively. We show that under
+reasonable conditions, these estimators are oracle efficient, and illustrate
+their favorable performance through simulated data settings, comparing them to
+existing CATE estimators, including comparison to estimators which use common
+missing data techniques. Guidance on the implementation of these estimators is
+provided and we present an example of their application using the ACTG175
+trial, exploring treatment effect heterogeneity when comparing Zidovudine
+mono-therapy against alternative antiretroviral therapies among HIV-1-infected
+individuals.
+
+
+
+ comment: 34 pages, 6 figures, 4 tables
+
+
+
+
+
+
+ ☆ Toward Adaptive Reasoning in Large Language Models with Thought Rollback ICML 2024
+
+
+ Large language models (LLMs) have been routinely used to solve various tasks
+using step-by-step reasoning. However, the structure of intermediate reasoning
+steps, or thoughts, is rigid and unidirectional, such as chains, trees, or
+acyclic-directed graphs. Consequently, the resulting inflexible and
+forward-only reasoning may not address challenging tasks and fail when the LLM
+frequently gives false responses, i.e., ``hallucinations''. This paper proposes
+a new reasoning framework, called Thought Rollback (TR), allowing LLMs to
+adaptively build thought structure while maintaining effective reasoning toward
+problem-solving under ``hallucinations''. The core mechanism of TR is rolling
+back thoughts, which allows LLMs to perform error analysis on thoughts, and
+thus roll back to any previously mistaken thought for revision. Subsequently,
+by including such trial-and-error in the prompt to guide the LLM, each rollback
+leads to one more reliable reasoning path. Therefore, starting with a simple
+prompt without human annotations, LLM with TR adaptively and gradually explores
+thoughts for a correct solution. Comprehensive experiments on mathematical
+problems and multi-task reasoning demonstrate the state-of-the-art performance
+of TR in terms of problem-solving rate and interaction cost. For instance, the
+solving rate of GPT-4 with TR outperforms the current best by $9\%$ on the MATH
+dataset.
+
+
+
+ comment: ICML 2024 camera-ready version with 24 pages and 12 figures. Code
+ repo with all prompts:
+ https://github.com/iQua/llmpebase/tree/main/examples/ThoughtRollback
+
+
+
+
+
+
+ ☆ Combining Machine Learning with Recurrence Analysis for resonance
+ detection
+
+
+ The width of a resonance in a nearly integrable system, i.e. in a
+non-integrable system where chaotic motion is still not prominent, can tell us
+how a perturbation parameter is driving the system away from integrability.
+Although the tool that we are presenting here can be used is quite generic and
+can be used in a variety of systems, our particular interest lies in binary
+compact object systems known as extreme mass ratio inspirals (EMRIs). In an
+EMRI a lighter compact object, like a black hole or a neutron star, inspirals
+into a supermassive black hole due to gravitational radiation reaction. During
+this inspiral the lighter object crosses resonances, which are still not very
+well modeled. Measuring the width of resonances in EMRI models allows us to
+estimate the importance of each perturbation parameter able to drive the system
+away from resonances and decide whether its impact should be included in EMRI
+waveform modeling or not. To tackle this issue in our study we show first that
+recurrence quantifiers of orbits carry imprints of resonant behavior,
+regardless of the system's dimensionality. As a next step, we apply a long
+short-term memory machine learning architecture to automate the resonance
+detection procedure. Our analysis is developed on a simple standard map and
+gradually we extend it to more complicated systems until finally we employ it
+in a generic deformed Kerr spacetime known in the literature as the
+Johannsen-Psaltis spacetime.
+
+
+ We study deep ReLU feed forward neural networks (NN) and their injectivity
+abilities. The main focus is on \emph{precisely} determining the so-called
+injectivity capacity. For any given hidden layers architecture, it is defined
+as the minimal ratio between number of network's outputs and inputs which
+ensures unique recoverability of the input from a realizable output. A strong
+recent progress in precisely studying single ReLU layer injectivity properties
+is here moved to a deep network level. In particular, we develop a program that
+connects deep $l$-layer net injectivity to an $l$-extension of the $\ell_0$
+spherical perceptrons, thereby massively generalizing an isomorphism between
+studying single layer injectivity and the capacity of the so-called
+(1-extension) $\ell_0$ spherical perceptrons discussed in [82]. \emph{Random
+duality theory} (RDT) based machinery is then created and utilized to
+statistically handle properties of the extended $\ell_0$ spherical perceptrons
+and implicitly of the deep ReLU NNs. A sizeable set of numerical evaluations is
+conducted as well to put the entire RDT machinery in practical use. From these
+we observe a rapidly decreasing tendency in needed layers' expansions, i.e., we
+observe a rapid \emph{expansion saturation effect}. Only $4$ layers of depth
+are sufficient to closely approach level of no needed expansion -- a result
+that fairly closely resembles observations made in practical experiments and
+that has so far remained completely untouchable by any of the existing
+mathematical methodologies.
+
+
+
+
+
+
+
+ ☆ Toward Scalable Multirobot Control: Fast Policy Learning in Distributed
+ MPC
+
+
+ Distributed model predictive control (DMPC) is promising in achieving optimal
+cooperative control in multirobot systems (MRS). However, real-time DMPC
+implementation relies on numerical optimization tools to periodically calculate
+local control sequences online. This process is computationally demanding and
+lacks scalability for large-scale, nonlinear MRS. This article proposes a novel
+distributed learning-based predictive control (DLPC) framework for scalable
+multirobot control. Unlike conventional DMPC methods that calculate open-loop
+control sequences, our approach centers around a computationally fast and
+efficient distributed policy learning algorithm that generates explicit
+closed-loop DMPC policies for MRS without using numerical solvers. The policy
+learning is executed incrementally and forward in time in each prediction
+interval through an online distributed actor-critic implementation. The control
+policies are successively updated in a receding-horizon manner, enabling fast
+and efficient policy learning with the closed-loop stability guarantee. The
+learned control policies could be deployed online to MRS with varying robot
+scales, enhancing scalability and transferability for large-scale MRS.
+Furthermore, we extend our methodology to address the multirobot safe learning
+challenge through a force field-inspired policy learning approach. We validate
+our approach's effectiveness, scalability, and efficiency through extensive
+experiments on cooperative tasks of large-scale wheeled robots and multirotor
+drones. Our results demonstrate the rapid learning and deployment of DMPC
+policies for MRS with scales up to 10,000 units.
+
+
+
+ comment: 26 pages, 19 figures
+
+
+
+
+
+
+ ☆ Asymmetrical Reciprocity-based Federated Learning for Resolving
+ Disparities in Medical Diagnosis KDD 2025
+
+
+ Geographic health disparities pose a pressing global challenge, particularly
+in underserved regions of low- and middle-income nations. Addressing this issue
+requires a collaborative approach to enhance healthcare quality, leveraging
+support from medically more developed areas. Federated learning emerges as a
+promising tool for this purpose. However, the scarcity of medical data and
+limited computation resources in underserved regions make collaborative
+training of powerful machine learning models challenging. Furthermore, there
+exists an asymmetrical reciprocity between underserved and developed regions.
+To overcome these challenges, we propose a novel cross-silo federated learning
+framework, named FedHelp, aimed at alleviating geographic health disparities
+and fortifying the diagnostic capabilities of underserved regions.
+Specifically, FedHelp leverages foundational model knowledge via one-time API
+access to guide the learning process of underserved small clients, addressing
+the challenge of insufficient data. Additionally, we introduce a novel
+asymmetric dual knowledge distillation module to manage the issue of asymmetric
+reciprocity, facilitating the exchange of necessary knowledge between developed
+large clients and underserved small clients. We validate the effectiveness and
+utility of FedHelp through extensive experiments on both medical image
+classification and segmentation tasks. The experimental results demonstrate
+significant performance improvement compared to state-of-the-art baselines,
+particularly benefiting clients in underserved regions.
+
+
+
+ comment: Jiaqi Wang and Ziyi Yin equally contributed to this paper. This paper
+ has been accepted by KDD 2025
+
+
+
+
+
+
+
+ Zhongxing Xu, Feilong Tang, Zhe Chen, Yingxue Su, Zhiyi Zhao, Ge Zhang, Jionglong Su, Zongyuan Ge
+
+
+ The application of Contrastive Language-Image Pre-training (CLIP) in Weakly
+Supervised Semantic Segmentation (WSSS) research powerful cross-modal semantic
+understanding capabilities. Existing methods attempt to optimize input text
+prompts for improved alignment of images and text, by finely adjusting text
+prototypes to facilitate semantic matching. Nevertheless, given the modality
+gap between text and vision spaces, the text prototypes employed by these
+methods have not effectively established a close correspondence with
+pixel-level vision features. In this work, our theoretical analysis indicates
+that the inherent modality gap results in misalignment of text and region
+features, and that this gap cannot be sufficiently reduced by minimizing
+contrast loss in CLIP. To mitigate the impact of the modality gap, we propose a
+Vision Prototype Learning (VPL) framework, by introducing more representative
+vision prototypes. The core of this framework is to learn class-specific vision
+prototypes in vision space with the help of text prototypes, for capturing
+high-quality localization maps. Moreover, we propose a regional semantic
+contrast module that contrasts regions embedding with corresponding prototypes,
+leading to more comprehensive and robust feature learning. Experimental results
+show that our proposed framework achieves state-of-the-art performance on two
+benchmark datasets.
+
+
+
+
+
+
+
+ ☆ Deep Linear Hawkes Processes
+
+
+
+
+
+
+
+
+ Yuxin Chang, Alex Boyd, Cao Xiao, Taha Kass-Hout, Parminder Bhatia, Padhraic Smyth, Andrew Warrington
+
+
+ Marked temporal point processes (MTPPs) are used to model sequences of
+different types of events with irregular arrival times, with broad applications
+ranging from healthcare and social networks to finance. We address shortcomings
+in existing point process models by drawing connections between modern deep
+state-space models (SSMs) and linear Hawkes processes (LHPs), culminating in an
+MTPP that we call the deep linear Hawkes process (DLHP). The DLHP modifies the
+linear differential equations in deep SSMs to be stochastic jump differential
+equations, akin to LHPs. After discretizing, the resulting recurrence can be
+implemented efficiently using a parallel scan. This brings parallelism and
+linear scaling to MTPP models. This contrasts with attention-based MTPPs, which
+scale quadratically, and RNN-based MTPPs, which do not parallelize across the
+sequence length. We show empirically that DLHPs match or outperform existing
+models across a broad range of metrics on eight real-world datasets. Our
+proposed DLHP model is the first instance of the unique architectural
+capabilities of SSMs being leveraged to construct a new class of MTPP models.
+
+
+
+
+
+
+
+ ☆ Gradient Weight-normalized Low-rank Projection for Efficient LLM
+ Training AAAI
+
+
+ Large Language Models (LLMs) have shown remarkable performance across various
+tasks, but the escalating demands on computational resources pose significant
+challenges, particularly in the extensive utilization of full fine-tuning for
+downstream tasks. To address this, parameter-efficient fine-tuning (PEFT)
+methods have been developed, but they often underperform compared to full
+fine-tuning and struggle with memory efficiency. In this work, we introduce
+Gradient Weight-Normalized Low-Rank Projection (GradNormLoRP), a novel approach
+that enhances both parameter and memory efficiency while maintaining comparable
+performance to full fine-tuning. GradNormLoRP normalizes the weight matrix to
+improve gradient conditioning, facilitating better convergence during
+optimization. Additionally, it applies low-rank approximations to the weight
+and gradient matrices, significantly reducing memory usage during training.
+Extensive experiments demonstrate that our 8-bit GradNormLoRP reduces optimizer
+memory usage by up to 89.5% and enables the pre-training of large LLMs, such as
+LLaMA 7B, on consumer-level GPUs like the NVIDIA RTX 4090, without additional
+inference costs. Moreover, GradNormLoRP outperforms existing low-rank methods
+in fine-tuning tasks. For instance, when fine-tuning the RoBERTa model on all
+GLUE tasks with a rank of 8, GradNormLoRP achieves an average score of 80.65,
+surpassing LoRA's score of 79.23. These results underscore GradNormLoRP as a
+promising alternative for efficient LLM pre-training and fine-tuning. Source
+code and Appendix:
+https://github.com/Jhhuangkay/Gradient-Weight-normalized-Low-rank-Projection-for-Efficient-LLM-Training
+
+
+
+ comment: Accepted by the 39th AAAI Conference on Artificial Intelligence
+ (AAAI-25) [Main Technical Track]
+
+
+
+
+
+
+
+ Minghui Li, Zikang Guo, Yang Wu, Peijin Guo, Yao Shi, Shengshan Hu, Wei Wan, Shengqing Hu
+
+
+ Drug-target interaction is fundamental in understanding how drugs affect
+biological systems, and accurately predicting drug-target affinity (DTA) is
+vital for drug discovery. Recently, deep learning methods have emerged as a
+significant approach for estimating the binding strength between drugs and
+target proteins. However, existing methods simply utilize the drug's local
+information from molecular topology rather than global information.
+Additionally, the features of drugs and proteins are usually fused with a
+simple concatenation operation, limiting their effectiveness. To address these
+challenges, we proposed ViDTA, an enhanced DTA prediction framework. We
+introduce virtual nodes into the Graph Neural Network (GNN)-based drug feature
+extraction network, which acts as a global memory to exchange messages more
+efficiently. By incorporating virtual graph nodes, we seamlessly integrate
+local and global features of drug molecular structures, expanding the GNN's
+receptive field. Additionally, we propose an attention-based linear feature
+fusion network for better capturing the interaction information between drugs
+and proteins. Experimental results evaluated on various benchmarks including
+Davis, Metz, and KIBA demonstrate that our proposed ViDTA outperforms the
+state-of-the-art baselines.
+
+
+
+ comment: Accepted by International Conference on Bioinformatics and
+ Biomedicine (BIBM 24)
+
+
+
+
+
+
+ ☆ Goal-oriented Communications based on Recursive Early Exit Neural
+ Networks
+
+
+ This paper presents a novel framework for goal-oriented semantic
+communications leveraging recursive early exit models. The proposed approach is
+built on two key components. First, we introduce an innovative early exit
+strategy that dynamically partitions computations, enabling samples to be
+offloaded to a server based on layer-wise recursive prediction dynamics that
+detect samples for which the confidence is not increasing fast enough over
+layers. Second, we develop a Reinforcement Learning-based online optimization
+framework that jointly determines early exit points, computation splitting, and
+offloading strategies, while accounting for wireless conditions, inference
+accuracy, and resource costs. Numerical evaluations in an edge inference
+scenario demonstrate the method's adaptability and effectiveness in striking an
+excellent trade-off between performance, latency, and resource efficiency.
+
+
+
+
+
+
+
+ ☆ Ultralight Signal Classification Model for Automatic Modulation
+ Recognition
+
+
+ The growing complexity of radar signals demands responsive and accurate
+detection systems that can operate efficiently on resource-constrained edge
+devices. Existing models, while effective, often rely on substantial
+computational resources and large datasets, making them impractical for edge
+deployment. In this work, we propose an ultralight hybrid neural network
+optimized for edge applications, delivering robust performance across
+unfavorable signal-to-noise ratios (mean accuracy of 96.3% at 0 dB) using less
+than 100 samples per class, and significantly reducing computational overhead.
+
+
+
+ comment: 8 pages, 8 figures
+
+
+
+
+
+
+ ☆ A Comparative Study of Machine Unlearning Techniques for Image and Text
+ Classification Models
+
+
+
+
+
+
+
+
+ Omar M. Safa, Mahmoud M. Abdelaziz, Mustafa Eltawy, Mohamed Mamdouh, Moamen Gharib, Salaheldin Eltenihy, Nagia M. Ghanem, Mohamed M. Ismail
+
+
+ Machine Unlearning has emerged as a critical area in artificial intelligence,
+addressing the need to selectively remove learned data from machine learning
+models in response to data privacy regulations. This paper provides a
+comprehensive comparative analysis of six state-of-theart unlearning techniques
+applied to image and text classification tasks. We evaluate their performance,
+efficiency, and compliance with regulatory requirements, highlighting their
+strengths and limitations in practical scenarios. By systematically analyzing
+these methods, we aim to provide insights into their applicability,
+challenges,and tradeoffs, fostering advancements in the field of ethical and
+adaptable machine learning.
+
+
+ In many domains of empirical sciences, discovering the causal structure
+within variables remains an indispensable task. Recently, to tackle with
+unoriented edges or latent assumptions violation suffered by conventional
+methods, researchers formulated a reinforcement learning (RL) procedure for
+causal discovery, and equipped REINFORCE algorithm to search for the
+best-rewarded directed acyclic graph. The two keys to the overall performance
+of the procedure are the robustness of RL methods and the efficient encoding of
+variables. However, on the one hand, REINFORCE is prone to local convergence
+and unstable performance during training. Neither trust region policy
+optimization, being computationally-expensive, nor proximal policy optimization
+(PPO), suffering from aggregate constraint deviation, is decent alternative for
+combinatory optimization problems with considerable individual subactions. We
+propose a trust region-navigated clipping policy optimization method for causal
+discovery that guarantees both better search efficiency and steadiness in
+policy optimization, in comparison with REINFORCE, PPO and our prioritized
+sampling-guided REINFORCE implementation. On the other hand, to boost the
+efficient encoding of variables, we propose a refined graph attention encoder
+called SDGAT that can grasp more feature information without priori
+neighbourhood information. With these improvements, the proposed method
+outperforms former RL method in both synthetic and benchmark datasets in terms
+of output results and optimization robustness.
+
+
+ Spatio-temporal Human-Object Interaction (ST-HOI) understanding aims at
+detecting HOIs from videos, which is crucial for activity understanding.
+However, existing whole-body-object interaction video benchmarks overlook the
+truth that open-world objects are diverse, that is, they usually provide
+limited and predefined object classes. Therefore, we introduce a new open-world
+benchmark: Grounding Interacted Objects (GIO) including 1,098 interacted
+objects class and 290K interacted object boxes annotation. Accordingly, an
+object grounding task is proposed expecting vision systems to discover
+interacted objects. Even though today's detectors and grounding methods have
+succeeded greatly, they perform unsatisfactorily in localizing diverse and rare
+objects in GIO. This profoundly reveals the limitations of current vision
+systems and poses a great challenge. Thus, we explore leveraging
+spatio-temporal cues to address object grounding and propose a 4D
+question-answering framework (4D-QA) to discover interacted objects from
+diverse videos. Our method demonstrates significant superiority in extensive
+experiments compared to current baselines. Data and code will be publicly
+available at https://github.com/DirtyHarryLYL/HAKE-AVA.
+
+
+
+ comment: To be published in the Proceedings of AAAI 2025. The first three
+ authors contributed equally. Project:
+ https://github.com/DirtyHarryLYL/HAKE-AVA
+
+
+
+
+
+
+ ☆ The Value of AI Advice: Personalized and Value-Maximizing AI Advisors
+ Are Necessary to Reliably Benefit Experts and Organizations
+
+
+
+
+
+
+
+
+ Nicholas Wolczynski, Maytal Saar-Tsechansky, Tong Wang
+
+
+ Despite advances in AI's performance and interpretability, AI advisors can
+undermine experts' decisions and increase the time and effort experts must
+invest to make decisions. Consequently, AI systems deployed in high-stakes
+settings often fail to consistently add value across contexts and can even
+diminish the value that experts alone provide. Beyond harm in specific domains,
+such outcomes impede progress in research and practice, underscoring the need
+to understand when and why different AI advisors add or diminish value. To
+bridge this gap, we stress the importance of assessing the value AI advice
+brings to real-world contexts when designing and evaluating AI advisors.
+Building on this perspective, we characterize key pillars -- pathways through
+which AI advice impacts value -- and develop a framework that incorporates
+these pillars to create reliable, personalized, and value-adding advisors. Our
+results highlight the need for system-level, value-driven development of AI
+advisors that advise selectively, are tailored to experts' unique behaviors,
+and are optimized for context-specific trade-offs between decision improvements
+and advising costs. They also reveal how the lack of inclusion of these pillars
+in the design of AI advising systems may be contributing to the failures
+observed in practical applications.
+
+
+ Recently, the study of heavy-tailed noises in first-order nonconvex
+stochastic optimization has gotten a lot of attention since it was recognized
+as a more realistic condition as suggested by many empirical observations.
+Specifically, the stochastic noise (the difference between the stochastic and
+true gradient) is considered only to have a finite $\mathfrak{p}$-th moment
+where $\mathfrak{p}\in\left(1,2\right]$ instead of assuming it always satisfies
+the classical finite variance assumption. To deal with this more challenging
+setting, people have proposed different algorithms and proved them to converge
+at an optimal $\mathcal{O}(T^{\frac{1-\mathfrak{p}}{3\mathfrak{p}-2}})$ rate
+for smooth objectives after $T$ iterations. Notably, all these new-designed
+algorithms are based on the same technique - gradient clipping. Naturally, one
+may want to know whether the clipping method is a necessary ingredient and the
+only way to guarantee convergence under heavy-tailed noises. In this work, by
+revisiting the existing Batched Normalized Stochastic Gradient Descent with
+Momentum (Batched NSGDM) algorithm, we provide the first convergence result
+under heavy-tailed noises but without gradient clipping. Concretely, we prove
+that Batched NSGDM can achieve the optimal
+$\mathcal{O}(T^{\frac{1-\mathfrak{p}}{3\mathfrak{p}-2}})$ rate even under the
+relaxed smooth condition. More interestingly, we also establish the first
+$\mathcal{O}(T^{\frac{1-\mathfrak{p}}{2\mathfrak{p}}})$ convergence rate in the
+case where the tail index $\mathfrak{p}$ is unknown in advance, which is
+arguably the common scenario in practice.
+
+
+
+ comment: In submission
+
+
+
+
+
+
+ ☆ Estimation of System Parameters Including Repeated Cross-Sectional Data
+ through Emulator-Informed Deep Generative Model
+
+
+
+
+
+
+
+
+ Hyunwoo Cho, Sung Woong Cho, Hyeontae Jo, Hyung Ju Hwang
+
+
+ Differential equations (DEs) are crucial for modeling the evolution of
+natural or engineered systems. Traditionally, the parameters in DEs are
+adjusted to fit data from system observations. However, in fields such as
+politics, economics, and biology, available data are often independently
+collected at distinct time points from different subjects (i.e., repeated
+cross-sectional (RCS) data). Conventional optimization techniques struggle to
+accurately estimate DE parameters when RCS data exhibit various
+heterogeneities, leading to a significant loss of information. To address this
+issue, we propose a new estimation method called the emulator-informed
+deep-generative model (EIDGM), designed to handle RCS data. Specifically, EIDGM
+integrates a physics-informed neural network-based emulator that immediately
+generates DE solutions and a Wasserstein generative adversarial network-based
+parameter generator that can effectively mimic the RCS data. We evaluated EIDGM
+on exponential growth, logistic population models, and the Lorenz system,
+demonstrating its superior ability to accurately capture parameter
+distributions. Additionally, we applied EIDGM to an experimental dataset of
+Amyloid beta 40 and beta 42, successfully capturing diverse parameter
+distribution shapes. This shows that EIDGM can be applied to model a wide range
+of systems and extended to uncover the operating principles of systems based on
+limited data.
+
+
+
+
+
+
+
+ ☆ Real-time classification of EEG signals using Machine Learning
+ deployment
+
+
+ The prevailing educational methods predominantly rely on traditional
+classroom instruction or online delivery, often limiting the teachers' ability
+to engage effectively with all the students simultaneously. A more intrinsic
+method of evaluating student attentiveness during lectures can enable the
+educators to tailor the course materials and their teaching styles in order to
+better meet the students' needs. The aim of this paper is to enhance teaching
+quality in real time, thereby fostering a higher student engagement in the
+classroom activities. By monitoring the students' electroencephalography (EEG)
+signals and employing machine learning algorithms, this study proposes a
+comprehensive solution for addressing this challenge. Machine learning has
+emerged as a powerful tool for simplifying the analysis of complex variables,
+enabling the effective assessment of the students' concentration levels based
+on specific parameters. However, the real-time impact of machine learning
+models necessitates a careful consideration as their deployment is concerned.
+This study proposes a machine learning-based approach for predicting the level
+of students' comprehension with regard to a certain topic. A browser interface
+was introduced that accesses the values of the system's parameters to determine
+a student's level of concentration on a chosen topic. The deployment of the
+proposed system made it necessary to address the real-time challenges faced by
+the students, consider the system's cost, and establish trust in its efficacy.
+This paper presents the efforts made for approaching this pertinent issue
+through the implementation of innovative technologies and provides a framework
+for addressing key considerations for future research directions.
+
+
+
+ comment: Published in Romanian Journal of Information Technology and Automatic
+ Control
+
+
+
+
+
+
+ ☆ Uncertainty quantification for improving radiomic-based models in
+ radiation pneumonitis prediction
+
+
+ Background and Objective: Radiation pneumonitis (RP) is a side effect of
+thoracic radiation therapy. Recently, Machine learning (ML) models enhanced
+with radiomic and dosiomic features provide better predictions by incorporating
+spatial information beyond DVHs. However, to improve the clinical decision
+process, we propose to use uncertainty quantification (UQ) to improve the
+confidence in model prediction. This study evaluates the impact of post hoc UQ
+methods on the discriminative performance and calibration of ML models for RP
+prediction. Methods: This study evaluated four ML models: logistic regression
+(LR), support vector machines (SVM), extreme gradient boosting (XGB), and
+random forest (RF), using radiomic, dosiomic, and dosimetric features to
+predict RP. We applied UQ methods, including Patt scaling, isotonic regression,
+Venn-ABERS predictor, and Conformal Prediction, to quantify uncertainty. Model
+performance was assessed through Area Under the Receiver Operating
+Characteristic curve (AUROC), Area Under the Precision-Recall Curve (AUPRC),
+and Adaptive Calibration Error (ACE) using Leave-One-Out Cross-Validation
+(LOO-CV). Results: UQ methods enhanced predictive performance, particularly for
+high-certainty predictions, while also improving calibration. Radiomic and
+dosiomic features increased model accuracy but introduced calibration
+challenges, especially for non-linear models like XGB and RF. Performance gains
+from UQ methods were most noticeable at higher certainty thresholds.
+Conclusion: Integrating UQ into ML models with radiomic and dosiomic features
+improves both predictive accuracy and calibration, supporting more reliable
+clinical decision-making. The findings emphasize the value of UQ methods in
+enhancing applicability of predictive models for RP in healthcare settings.
+
+
+
+
+
+
+
+ ☆ RobotDiffuse: Motion Planning for Redundant Manipulator based on
+ Diffusion Model
+
+
+ Redundant manipulators, with their higher Degrees of Freedom (DOFs), offer
+enhanced kinematic performance and versatility, making them suitable for
+applications like manufacturing, surgical robotics, and human-robot
+collaboration. However, motion planning for these manipulators is challenging
+due to increased DOFs and complex, dynamic environments. While traditional
+motion planning algorithms struggle with high-dimensional spaces, deep
+learning-based methods often face instability and inefficiency in complex
+tasks. This paper introduces RobotDiffuse, a diffusion model-based approach for
+motion planning in redundant manipulators. By integrating physical constraints
+with a point cloud encoder and replacing the U-Net structure with an
+encoder-only transformer, RobotDiffuse improves the model's ability to capture
+temporal dependencies and generate smoother, more coherent motion plans. We
+validate the approach using a complex simulator, and release a new dataset with
+35M robot poses and 0.14M obstacle avoidance scenarios. Experimental results
+demonstrate the effectiveness of RobotDiffuse and the promise of diffusion
+models for motion planning tasks. The code can be accessed at
+https://github.com/ACRoboT-buaa/RobotDiffuse.
+
+
+
+
+
+
+
+ ☆ Disparate Model Performance and Stability in Machine Learning Clinical
+ Support for Diabetes and Heart Diseases
+
+
+
+
+
+
+
+
+ Ioannis Bilionis, Ricardo C. Berrios, Luis Fernandez-Luque, Carlos Castillo
+
+
+ Machine Learning (ML) algorithms are vital for supporting clinical
+decision-making in biomedical informatics. However, their predictive
+performance can vary across demographic groups, often due to the
+underrepresentation of historically marginalized populations in training
+datasets. The investigation reveals widespread sex- and age-related inequities
+in chronic disease datasets and their derived ML models. Thus, a novel
+analytical framework is introduced, combining systematic arbitrariness with
+traditional metrics like accuracy and data complexity. The analysis of data
+from over 25,000 individuals with chronic diseases revealed mild sex-related
+disparities, favoring predictive accuracy for males, and significant
+age-related differences, with better accuracy for younger patients. Notably,
+older patients showed inconsistent predictive accuracy across seven datasets,
+linked to higher data complexity and lower model performance. This highlights
+that representativeness in training data alone does not guarantee equitable
+outcomes, and model arbitrariness must be addressed before deploying models in
+clinical settings.
+
+
+
+ comment: This paper will be presented in American Medical Informatics
+ Association (AMIA) Informatics Summit Conference 2025 (Pittsburgh, PA). 10
+ pages, 2 figures, 5 tables
+
+
+
+
+
+
+ ☆ Meta-Learning-Based Delayless Subband Adaptive Filter using Complex
+ Self-Attention for Active Noise Control
+
+
+ Active noise control typically employs adaptive filtering to generate
+secondary noise, where the least mean square algorithm is the most widely used.
+However, traditional updating rules are linear and exhibit limited
+effectiveness in addressing nonlinear environments and nonstationary noise. To
+tackle this challenge, we reformulate the active noise control problem as a
+meta-learning problem and propose a meta-learning-based delayless subband
+adaptive filter with deep neural networks. The core idea is to utilize a neural
+network as an adaptive algorithm that can adapt to different environments and
+types of noise. The neural network will train under noisy observations,
+implying that it recognizes the optimized updating rule without true labels. A
+single-headed attention recurrent neural network is devised with learnable
+feature embedding to update the adaptive filter weight efficiently, enabling
+accurate computation of the secondary source to attenuate the unwanted primary
+noise. In order to relax the time constraint on updating the adaptive filter
+weights, the delayless subband architecture is employed, which will allow the
+system to be updated less frequently as the downsampling factor increases. In
+addition, the delayless subband architecture does not introduce additional time
+delays in active noise control systems. A skip updating strategy is introduced
+to decrease the updating frequency further so that machines with limited
+resources have more possibility to board our meta-learning-based model.
+Extensive multi-condition training ensures generalization and robustness
+against various types of noise and environments. Simulation results demonstrate
+that our meta-learning-based model achieves superior noise reduction
+performance compared to traditional methods.
+
+
+
+ comment: 31 pages, 8 figures
+
+
+
+
+
+
+ ☆ Optimizing Helmet Detection with Hybrid YOLO Pipelines: A Detailed
+ Analysis
+
+
+
+
+
+
+
+
+ Vaikunth M, Dejey D, Vishaal C, Balamurali S
+
+
+ Helmet detection is crucial for advancing protection levels in public road
+traffic dynamics. This problem statement translates to an object detection
+task. Therefore, this paper compares recent You Only Look Once (YOLO) models in
+the context of helmet detection in terms of reliability and computational load.
+Specifically, YOLOv8, YOLOv9, and the newly released YOLOv11 have been used.
+Besides, a modified architectural pipeline that remarkably improves the overall
+performance has been proposed in this manuscript. This hybridized YOLO model
+(h-YOLO) has been pitted against the independent models for analysis that
+proves h-YOLO is preferable for helmet detection over plain YOLO models. The
+models were tested using a range of standard object detection benchmarks such
+as recall, precision, and mAP (Mean Average Precision). In addition, training
+and testing times were recorded to provide the overall scope of the models in a
+real-time detection scenario.
+
+
+
+
+
+
+
+ ☆ Towards Simple and Provable Parameter-Free Adaptive Gradient Methods
+
+
+ Optimization algorithms such as AdaGrad and Adam have significantly advanced
+the training of deep models by dynamically adjusting the learning rate during
+the optimization process. However, adhoc tuning of learning rates poses a
+challenge, leading to inefficiencies in practice. To address this issue, recent
+research has focused on developing "learning-rate-free" or "parameter-free"
+algorithms that operate effectively without the need for learning rate tuning.
+Despite these efforts, existing parameter-free variants of AdaGrad and Adam
+tend to be overly complex and/or lack formal convergence guarantees. In this
+paper, we present AdaGrad++ and Adam++, novel and simple parameter-free
+variants of AdaGrad and Adam with convergence guarantees. We prove that
+AdaGrad++ achieves comparable convergence rates to AdaGrad in convex
+optimization without predefined learning rate assumptions. Similarly, Adam++
+matches the convergence rate of Adam without relying on any conditions on the
+learning rates. Experimental results across various deep learning tasks
+validate the competitive performance of AdaGrad++ and Adam++.
+
+
+
+
+
+
+
+
+ Mansour El Alami, Nouhaila Innan, Muhammad Shafique, Mohamed Bennai
+
+
+ As financial fraud becomes increasingly complex, effective detection methods
+are essential. Quantum Machine Learning (QML) introduces certain capabilities
+that may enhance both accuracy and efficiency in this area. This study examines
+how different quantum feature map and ansatz configurations affect the
+performance of three QML-based classifiers-the Variational Quantum Classifier
+(VQC), the Sampler Quantum Neural Network (SQNN), and the Estimator Quantum
+Neural Network (EQNN)-when applied to two non-standardized financial fraud
+datasets. Different quantum feature map and ansatz configurations are
+evaluated, revealing distinct performance patterns. The VQC consistently
+demonstrates strong classification results, achieving an F1 score of 0.88,
+while the SQNN also delivers promising outcomes. In contrast, the EQNN
+struggles to produce robust results, emphasizing the challenges presented by
+non-standardized data. These findings highlight the importance of careful model
+configuration in QML-based financial fraud detection. By showing how specific
+feature maps and ansatz choices influence predictive success, this work guides
+researchers and practitioners in refining QML approaches for complex financial
+applications.
+
+
+
+
+
+
+
+ ☆ Low-Rank Contextual Reinforcement Learning from Heterogeneous Human
+ Feedback
+
+
+
+
+
+
+
+
+ Seong Jin Lee, Will Wei Sun, Yufeng Liu
+
+
+ Reinforcement learning from human feedback (RLHF) has become a cornerstone
+for aligning large language models with human preferences. However, the
+heterogeneity of human feedback, driven by diverse individual contexts and
+preferences, poses significant challenges for reward learning. To address this,
+we propose a Low-rank Contextual RLHF (LoCo-RLHF) framework that integrates
+contextual information to better model heterogeneous feedback while maintaining
+computational efficiency. Our approach builds on a contextual preference model,
+leveraging the intrinsic low-rank structure of the interaction between user
+contexts and query-answer pairs to mitigate the high dimensionality of feature
+representations. Furthermore, we address the challenge of distributional shifts
+in feedback through our Pessimism in Reduced Subspace (PRS) policy, inspired by
+pessimistic offline reinforcement learning techniques. We theoretically
+demonstrate that our policy achieves a tighter sub-optimality gap compared to
+existing methods. Extensive experiments validate the effectiveness of
+LoCo-RLHF, showcasing its superior performance in personalized RLHF settings
+and its robustness to distribution shifts.
+
+
+
+
+
+
+
+ ☆ Revisiting PCA for time series reduction in temporal dimension
+
+
+ Revisiting PCA for Time Series Reduction in Temporal Dimension; Jiaxin Gao,
+Wenbo Hu, Yuntian Chen; Deep learning has significantly advanced time series
+analysis (TSA), enabling the extraction of complex patterns for tasks like
+classification, forecasting, and regression. Although dimensionality reduction
+has traditionally focused on the variable space-achieving notable success in
+minimizing data redundancy and computational complexity-less attention has been
+paid to reducing the temporal dimension. In this study, we revisit Principal
+Component Analysis (PCA), a classical dimensionality reduction technique, to
+explore its utility in temporal dimension reduction for time series data. It is
+generally thought that applying PCA to the temporal dimension would disrupt
+temporal dependencies, leading to limited exploration in this area. However,
+our theoretical analysis and extensive experiments demonstrate that applying
+PCA to sliding series windows not only maintains model performance, but also
+enhances computational efficiency. In auto-regressive forecasting, the temporal
+structure is partially preserved through windowing, and PCA is applied within
+these windows to denoise the time series while retaining their statistical
+information. By preprocessing time-series data with PCA, we reduce the temporal
+dimensionality before feeding it into TSA models such as Linear, Transformer,
+CNN, and RNN architectures. This approach accelerates training and inference
+and reduces resource consumption. Notably, PCA improves Informer training and
+inference speed by up to 40% and decreases GPU memory usage of TimesNet by 30%,
+without sacrificing model accuracy. Comparative analysis against other
+reduction methods further highlights the effectiveness of PCA in improving the
+efficiency of TSA models.
+
+
+
+ comment: 13 pages, 5 figures, 7 tables
+
+
+
+
+
+
+ ☆ Gx2Mol: De Novo Generation of Hit-like Molecules from Gene Expression
+ Profiles via Deep Learning
+
+
+ De novo generation of hit-like molecules is a challenging task in the drug
+discovery process. Most methods in previous studies learn the semantics and
+syntax of molecular structures by analyzing molecular graphs or simplified
+molecular input line entry system (SMILES) strings; however, they do not take
+into account the drug responses of the biological systems consisting of genes
+and proteins. In this study we propose a deep generative model, Gx2Mol, which
+utilizes gene expression profiles to generate molecular structures with
+desirable phenotypes for arbitrary target proteins. In the algorithm, a
+variational autoencoder is employed as a feature extractor to learn the latent
+feature distribution of the gene expression profiles. Then, a long short-term
+memory is leveraged as the chemical generator to produce syntactically valid
+SMILES strings that satisfy the feature conditions of the gene expression
+profile extracted by the feature extractor. Experimental results and case
+studies demonstrate that the proposed Gx2Mol model can produce new molecules
+with potential bioactivities and drug-like properties.
+
+
+
+
+
+
+
+ ☆ Introduction to Graph Neural Networks: A Starting Point for Machine
+ Learning Engineers
+
+
+
+
+
+
+
+
+ James H. Tanis, Chris Giannella, Adrian V. Mariano
+
+
+ Graph neural networks are deep neural networks designed for graphs with
+attributes attached to nodes or edges. The number of research papers in the
+literature concerning these models is growing rapidly due to their impressive
+performance on a broad range of tasks. This survey introduces graph neural
+networks through the encoder-decoder framework and provides examples of
+decoders for a range of graph analytic tasks. It uses theory and numerous
+experiments on homogeneous graphs to illustrate the behavior of graph neural
+networks for different training sizes and degrees of graph complexity.
+
+
+ This study is based on the ICASSP 2025 Signal Processing Grand Challenge's
+Accelerometer-Based Person-in-Bed Detection Challenge, which aims to determine
+bed occupancy using accelerometer signals. The task is divided into two tracks:
+"in bed" and "not in bed" segmented detection, and streaming detection, facing
+challenges such as individual differences, posture variations, and external
+disturbances. We propose a spectral-temporal fusion-based feature
+representation method with mixup data augmentation, and adopt Intersection over
+Union (IoU) loss to optimize detection accuracy. In the two tracks, our method
+achieved outstanding results of 100.00% and 95.55% in detection scores,
+securing first place and third place, respectively.
+
+
+
+
+
+
+
+ ☆ Fully Data-driven but Interpretable Human Behavioural Modelling with
+ Differentiable Discrete Choice Model
+
+
+ Discrete choice models are essential for modelling various decision-making
+processes in human behaviour. However, the specification of these models has
+depended heavily on domain knowledge from experts, and the fully automated but
+interpretable modelling of complex human behaviours has been a long-standing
+challenge. In this paper, we introduce the differentiable discrete choice model
+(Diff-DCM), a fully data-driven method for the interpretable modelling,
+learning, prediction, and control of complex human behaviours, which is
+realised by differentiable programming. Solely from input features and choice
+outcomes without any prior knowledge, Diff-DCM can estimate interpretable
+closed-form utility functions that reproduce observed behaviours. Comprehensive
+experiments with both synthetic and real-world data demonstrate that Diff-DCM
+can be applied to various types of data and requires only a small amount of
+computational resources for the estimations, which can be completed within tens
+of seconds on a laptop without any accelerators. In these experiments, we also
+demonstrate that, using its differentiability, Diff-DCM can provide useful
+insights into human behaviours, such as an optimal intervention path for
+effective behavioural changes. This study provides a strong basis for the fully
+automated and reliable modelling, prediction, and control of human behaviours.
+
+
+
+
+
+
+
+ ☆ Comparing Few to Rank Many: Active Human Preference Learning using
+ Randomized Frank-Wolfe AISTATS 2025
+
+
+ We study learning of human preferences from a limited comparison feedback.
+This task is ubiquitous in machine learning. Its applications such as
+reinforcement learning from human feedback, have been transformational. We
+formulate this problem as learning a Plackett-Luce model over a universe of $N$
+choices from $K$-way comparison feedback, where typically $K \ll N$. Our
+solution is the D-optimal design for the Plackett-Luce objective. The design
+defines a data logging policy that elicits comparison feedback for a small
+collection of optimally chosen points from all ${N \choose K}$ feasible
+subsets. The main algorithmic challenge in this work is that even fast methods
+for solving D-optimal designs would have $O({N \choose K})$ time complexity. To
+address this issue, we propose a randomized Frank-Wolfe (FW) algorithm that
+solves the linear maximization sub-problems in the FW method on randomly chosen
+variables. We analyze the algorithm, and evaluate it empirically on synthetic
+and open-source NLP datasets.
+
+
+
+ comment: Submitted to AISTATS 2025 on October 10, 2024
+
+
+
+
+
+
+ ☆ Asymptotically Optimal Search for a Change Point Anomaly under a
+ Composite Hypothesis Model
+
+
+ We address the problem of searching for a change point in an anomalous
+process among a finite set of M processes. Specifically, we address a composite
+hypothesis model in which each process generates measurements following a
+common distribution with an unknown parameter (vector). This parameter belongs
+to either a normal or abnormal space depending on the current state of the
+process. Before the change point, all processes, including the anomalous one,
+are in a normal state; after the change point, the anomalous process
+transitions to an abnormal state. Our goal is to design a sequential search
+strategy that minimizes the Bayes risk by balancing sample complexity and
+detection accuracy. We propose a deterministic search algorithm with the
+following notable properties. First, we analytically demonstrate that when the
+distributions of both normal and abnormal processes are unknown, the algorithm
+is asymptotically optimal in minimizing the Bayes risk as the error probability
+approaches zero. In the second setting, where the parameter under the null
+hypothesis is known, the algorithm achieves asymptotic optimality with improved
+detection time based on the true normal state. Simulation results are presented
+to validate the theoretical findings.
+
+
+
+ comment: 13 pages, 6 figures
+
+
+
+
+
+
+ ☆ An In-Depth Analysis of Adversarial Discriminative Domain Adaptation for
+ Digit Classification
+
+
+
+
+
+
+
+
+ Eugene Choi, Julian Rodriguez, Edmund Young
+
+
+ Domain adaptation is an active area of research driven by the growing demand
+for robust machine learning models that perform well on real-world data.
+Adversarial learning for deep neural networks (DNNs) has emerged as a promising
+approach to improving generalization ability, particularly for image
+classification. In this paper, we implement a specific adversarial learning
+technique known as Adversarial Discriminative Domain Adaptation (ADDA) and
+replicate digit classification experiments from the original ADDA paper. We
+extend their findings by examining a broader range of domain shifts and provide
+a detailed analysis of in-domain classification accuracy post-ADDA. Our results
+demonstrate that ADDA significantly improves accuracy across certain domain
+shifts with minimal impact on in-domain performance. Furthermore, we provide
+qualitative analysis and propose potential explanations for ADDA's limitations
+in less successful domain shifts. Code is at
+https://github.com/eugenechoi2004/COS429_FINAL .
+
+
+
+
+
+
+
+
+ Manqing Liu, David R. Bellamy, Andrew L. Beam
+
+
+ Causal inference is a critical task across fields such as healthcare,
+economics, and the social sciences. While recent advances in machine learning,
+especially those based on the deep-learning architectures, have shown potential
+in estimating causal effects, existing approaches often fall short in handling
+complex causal structures and lack adaptability across various causal
+scenarios. In this paper, we present a novel transformer-based method for
+causal inference that overcomes these challenges. The core innovation of our
+model lies in its integration of causal Directed Acyclic Graphs (DAGs) directly
+into the attention mechanism, enabling it to accurately model the underlying
+causal structure. This allows for flexible estimation of both average treatment
+effects (ATE) and conditional average treatment effects (CATE). Extensive
+experiments on both synthetic and real-world datasets demonstrate that our
+approach surpasses existing methods in estimating causal effects across a wide
+range of scenarios. The flexibility and robustness of our model make it a
+valuable tool for researchers and practitioners tackling complex causal
+inference problems.
+
+
+
+
+
+
+
+
+ Junhui He, Shangyu Wu, Weidong Wen, Chun Jason Xue, Qingan Li
+
+
+ Deploying large language models (LLMs) on edge devices presents significant
+challenges due to the substantial computational overhead and memory
+requirements. Activation sparsification can mitigate these resource challenges
+by reducing the number of activated neurons during inference. Existing methods
+typically employ thresholding-based sparsification based on the statistics of
+activation tensors. However, they do not model the impact of activation
+sparsification on performance, resulting in suboptimal performance degradation.
+To address the limitations, this paper reformulates the activation
+sparsification problem to explicitly capture the relationship between
+activation sparsity and model performance. Then, this paper proposes CHESS, a
+general activation sparsification approach via CHannel-wise thrEsholding and
+Selective Sparsification. First, channel-wise thresholding assigns a unique
+threshold to each activation channel in the feed-forward network (FFN) layers.
+Then, selective sparsification involves applying thresholding-based activation
+sparsification to specific layers within the attention modules. Finally, we
+detail the implementation of sparse kernels to accelerate LLM inference.
+Experimental results demonstrate that the proposed CHESS achieves lower
+performance degradation over eight downstream tasks while activating fewer
+parameters than existing methods, thus speeding up the LLM inference by up to
+1.27x.
+
+
+
+
+
+
+
+
+ Richard Ren, Steven Basart, Adam Khoja, Alice Gatti, Long Phan, Xuwang Yin, Mantas Mazeika, Alexander Pan, Gabriel Mukobi, Ryan H. Kim, Stephen Fitz, Dan Hendrycks
+
+
+ As artificial intelligence systems grow more powerful, there has been
+increasing interest in "AI safety" research to address emerging and future
+risks. However, the field of AI safety remains poorly defined and
+inconsistently measured, leading to confusion about how researchers can
+contribute. This lack of clarity is compounded by the unclear relationship
+between AI safety benchmarks and upstream general capabilities (e.g., general
+knowledge and reasoning). To address these issues, we conduct a comprehensive
+meta-analysis of AI safety benchmarks, empirically analyzing their correlation
+with general capabilities across dozens of models and providing a survey of
+existing directions in AI safety. Our findings reveal that many safety
+benchmarks highly correlate with both upstream model capabilities and training
+compute, potentially enabling "safetywashing"--where capability improvements
+are misrepresented as safety advancements. Based on these findings, we propose
+an empirical foundation for developing more meaningful safety metrics and
+define AI safety in a machine learning research context as a set of clearly
+delineated research goals that are empirically separable from generic
+capabilities advancements. In doing so, we aim to provide a more rigorous
+framework for AI safety research, advancing the science of safety evaluations
+and clarifying the path towards measurable progress.
+
+
+ SimMIM is a widely used method for pretraining vision transformers using
+masked image modeling. However, despite its success in fine-tuning performance,
+it has been shown to perform sub-optimally when used for linear probing. We
+propose an efficient patch-wise weighting derived from keypoint features which
+captures the local information and provides better context during SimMIM's
+reconstruction phase. Our method, KAMIM, improves the top-1 linear probing
+accuracy from 16.12% to 33.97%, and finetuning accuracy from 76.78% to 77.3%
+when tested on the ImageNet-1K dataset with a ViT-B when trained for the same
+number of epochs. We conduct extensive testing on different datasets, keypoint
+extractors, and model architectures and observe that patch-wise weighting
+augments linear probing performance for larger pretraining datasets. We also
+analyze the learned representations of a ViT-B trained using KAMIM and observe
+that they behave similar to contrastive learning with regard to its behavior,
+with longer attention distances and homogenous self-attention across layers.
+Our code is publicly available at https://github.com/madhava20217/KAMIM.
+
+
+
+ comment: Accepted to ICASSP 2025
+
+
+
+
+
+
+ ♻ ☆ Global Prediction of COVID-19 Variant Emergence Using Dynamics-Informed
+ Graph Neural Networks
+
+
+
+
+
+
+
+
+ Majd Al Aawar, Srikar Mutnuri, Mansooreh Montazerin, Ajitesh Srivastava
+
+
+ During the COVID-19 pandemic, a major driver of new surges has been the
+emergence of new variants. When a new variant emerges in one or more countries,
+other nations monitor its spread in preparation for its potential arrival. The
+impact of the new variant and the timings of epidemic peaks in a country highly
+depend on when the variant arrives. The current methods for predicting the
+spread of new variants rely on statistical modeling, however, these methods
+work only when the new variant has already arrived in the region of interest
+and has a significant prevalence. Can we predict when a variant existing
+elsewhere will arrive in a given region? To address this question, we propose a
+variant-dynamics-informed Graph Neural Network (GNN) approach. First, we derive
+the dynamics of variant prevalence across pairs of regions (countries) that
+apply to a large class of epidemic models. The dynamics motivate the
+introduction of certain features in the GNN. We demonstrate that our proposed
+dynamics-informed GNN outperforms all the baselines, including the currently
+pervasive framework of Physics-Informed Neural Networks (PINNs). To advance
+research in this area, we introduce a benchmarking tool to assess a
+user-defined model's prediction performance across 87 countries and 36
+variants.
+
+
+
+
+
+
+
+ ♻ ☆ DGNN-YOLO: Interpretable Dynamic Graph Neural Networks with YOLO11 for
+ Small Object Detection and Tracking in Traffic Surveillance
+
+
+
+
+
+
+
+
+ Shahriar Soudeep, M. F. Mridha, Md Abrar Jahin, Nilanjan Dey
+
+
+ Accurate detection and tracking of small objects, such as pedestrians,
+cyclists, and motorbikes, is critical for traffic surveillance systems, which
+are crucial for improving road safety and decision-making in intelligent
+transportation systems. However, traditional methods face challenges such as
+occlusion, low resolution, and dynamic traffic conditions, necessitating
+innovative approaches to address these limitations. This paper introduces
+DGNN-YOLO, a novel framework integrating dynamic graph neural networks (DGNN)
+with YOLO11 to enhance small-object detection and tracking in traffic
+surveillance systems. The framework leverages YOLO11's advanced spatial feature
+extraction capabilities for precise object detection and incorporates a DGNN to
+model spatial-temporal relationships for robust real-time tracking dynamically.
+By constructing and updating graph structures, DGNN-YOLO effectively represents
+objects as nodes and their interactions as edges, thereby ensuring adaptive and
+accurate tracking in complex and dynamic environments. Additionally, Grad-CAM,
+Grad-CAM++, and Eigen-CAM visualization techniques were applied to DGNN-YOLO to
+provide model-agnostic interpretability and deeper insights into the model's
+decision-making process, enhancing its transparency and trustworthiness.
+Extensive experiments demonstrated that DGNN-YOLO consistently outperformed
+state-of-the-art methods in detecting and tracking small objects under diverse
+traffic conditions, achieving the highest precision (0.8382), recall (0.6875),
+and mAP@0.5:0.95 (0.6476), showing its robustness and scalability, particularly
+in challenging scenarios involving small and occluded objects. This study
+provides a scalable, real-time traffic surveillance and analysis solution,
+significantly contributing to intelligent transportation systems.
+
+
+ Heart failure is a leading cause of global mortality, necessitating improved
+diagnostic strategies. Classical machine learning models struggle with
+challenges such as high-dimensional data, class imbalances, poor feature
+representations, and lack of interpretability. While quantum machine learning
+holds promise, current hybrid models have not fully exploited quantum
+advantages. In this paper, we propose the Kolmogorov-Arnold Classical-Quantum
+Dual-Channel Neural Network (KACQ-DCNN), a novel hybrid architecture that
+replaces traditional multilayer perceptrons with Kolmogorov-Arnold Networks
+(KANs), enabling learnable univariate activation functions. Our KACQ-DCNN
+4-qubit, 1-layer model outperforms 37 benchmark models, including 16 classical
+and 12 quantum neural networks, achieving an accuracy of 92.03%, with
+macro-average precision, recall, and F1 scores of 92.00%. It also achieved a
+ROC-AUC of 94.77%, surpassing other models by significant margins, as validated
+by paired t-tests with a significance threshold of 0.0056 (after Bonferroni
+correction). Ablation studies highlight the synergistic effect of
+classical-quantum integration, improving performance by about 2% over MLP
+variants. Additionally, LIME and SHAP explainability techniques enhance feature
+interpretability, while conformal prediction provides robust uncertainty
+quantification. Our results demonstrate that KACQ-DCNN improves cardiovascular
+diagnostics by combining high accuracy with interpretability and uncertainty
+quantification.
+
+
+
+
+
+
+
+ ♻ ☆ Sustainable Diffusion-based Incentive Mechanism for Generative AI-driven
+ Digital Twins in Industrial Cyber-Physical Systems
+
+
+
+
+
+
+
+
+ Jinbo Wen, Jiawen Kang, Dusit Niyato, Yang Zhang, Shiwen Mao
+
+
+ Industrial Cyber-Physical Systems (ICPSs) are an integral component of modern
+manufacturing and industries. By digitizing data throughout product life
+cycles, Digital Twins (DTs) in ICPSs enable a shift from current industrial
+infrastructures to intelligent and adaptive infrastructures. Thanks to data
+process capability, Generative Artificial Intelligence (GenAI) can drive the
+construction and update of DTs to improve predictive accuracy and prepare for
+diverse smart manufacturing. However, mechanisms that leverage Industrial
+Internet of Things (IIoT) devices to share sensing data for DT construction are
+susceptible to adverse selection problems. In this paper, we first develop a
+GenAI-driven DT architecture in ICPSs. To address the adverse selection problem
+caused by information asymmetry, we propose a contract theory model and develop
+a sustainable diffusion-based soft actor-critic algorithm to identify the
+optimal feasible contract. Specifically, we leverage dynamic structured pruning
+techniques to reduce parameter numbers of actor networks, allowing
+sustainability and efficient implementation of the proposed algorithm.
+Numerical results demonstrate the effectiveness of the proposed scheme and the
+algorithm, enabling efficient DT construction and updates to monitor and manage
+ICPSs.
+
+
+
+
+
+
+
+ ♻ ☆ Lusifer: LLM-based User SImulated Feedback Environment for online
+ Recommender systems
+
+
+
+
+
+
+
+
+ Danial Ebrat, Eli Paradalis, Luis Rueda
+
+
+ Training reinforcement learning-based recommender systems is often hindered
+by the lack of dynamic and realistic user interactions. To address this
+limitation, we introduce Lusifer, a novel environment leveraging Large Language
+Models (LLMs) to generate simulated user feedback. Lusifer synthesizes user
+profiles and interaction histories to simulate responses and behaviors toward
+recommended items, with profiles updated after each rating to reflect evolving
+user characteristics. Utilizing the MovieLens dataset as a proof of concept, we
+limited our implementation to the last 40 interactions for each user,
+representing approximately 39% and 22% of the training sets, to focus on recent
+user behavior. For consistency and to gain insights into the performance of
+traditional methods with limited data, we implemented baseline approaches using
+the same data subset. Our results demonstrate that Lusifer accurately emulates
+user behavior and preferences, even with reduced training data having an RMSE
+of 1.3 across various test sets. This paper presents Lusifer's operational
+pipeline, including prompt generation and iterative user profile updates, and
+compares its performance against baseline methods. The findings validate
+Lusifer's ability to produce realistic dynamic feedback and suggest that it
+offers a scalable and adjustable framework for user simulation in online
+reinforcement learning recommender systems for future studies, particularly
+when training data is limited.
+
+
+
+
+
+
+
+
+ Alexander Nikitin, ST John, Arno Solin, Samuel Kaski
+
+
+ Gaussian processes (GPs) provide a principled and direct approach for
+inference and learning on graphs. However, the lack of justified graph kernels
+for spatio-temporal modelling has held back their use in graph problems. We
+leverage an explicit link between stochastic partial differential equations
+(SPDEs) and GPs on graphs, introduce a framework for deriving graph kernels via
+SPDEs, and derive non-separable spatio-temporal graph kernels that capture
+interaction across space and time. We formulate the graph kernels for the
+stochastic heat equation and wave equation. We show that by providing novel
+tools for spatio-temporal GP modelling on graphs, we outperform pre-existing
+graph kernels in real-world applications that feature diffusion, oscillation,
+and other complicated interactions.
+
+
+
+
+
+
+
+ ♻ ☆ Generation through the lens of learning theory
+
+
+ We study generation through the lens of statistical learning theory. First,
+we abstract and formalize the results of Gold [1967], Angluin [1979], Angluin
+[1980] and Kleinberg and Mullainathan [2024] in terms of a binary hypothesis
+class defined over an abstract example space. Then, we extend the notion of
+"generation" from Kleinberg and Mullainathan [2024] to two new settings, we
+call "uniform" and "non-uniform" generation, and provide a characterization of
+which hypothesis classes are uniformly and non-uniformly generatable. As is
+standard in learning theory, our characterizations are in terms of the
+finiteness of a new combinatorial dimension termed the Closure dimension. By
+doing so, we are able to compare generatability with predictability (captured
+via PAC and online learnability) and show that these two properties of
+hypothesis classes are incompatible -- there are classes that are generatable
+but not predictable and vice versa. Finally, we extend our results to capture
+prompted generation and give a complete characterization of which classes are
+prompt generatable, generalizing some of the work by Kleinberg and Mullainathan
+[2024].
+
+
+ Cyber timeline analysis, or forensic timeline analysis, is crucial in Digital
+Forensics and Incident Response (DFIR). It examines artefacts and events
+particularly timestamps and metadata to detect anomalies, establish
+correlations, and reconstruct incident timelines. Traditional methods rely on
+structured artefacts, such as logs and filesystem metadata, using specialised
+tools for evidence identification and feature extraction. This paper introduces
+GenDFIR, a framework leveraging large language models (LLMs), specifically
+Llama 3.1 8B in zero shot mode, integrated with a Retrieval-Augmented
+Generation (RAG) agent. Incident data is preprocessed into a structured
+knowledge base, enabling the RAG agent to retrieve relevant events based on
+user prompts. The LLM interprets this context, offering semantic enrichment.
+Tested on synthetic data in a controlled environment, results demonstrate
+GenDFIR's reliability and robustness, showcasing LLMs potential to automate
+timeline analysis and advance threat detection.
+
+
+
+ comment: 24 pages V5.3
+
+
+
+
+
+
+ ♻ ☆ RL-MUL 2.0: Multiplier Design Optimization with Parallel Deep
+ Reinforcement Learning and Space Reduction
+
+
+ Multiplication is a fundamental operation in many applications, and
+multipliers are widely adopted in various circuits. However, optimizing
+multipliers is challenging due to the extensive design space. In this paper, we
+propose a multiplier design optimization framework based on reinforcement
+learning. We utilize matrix and tensor representations for the compressor tree
+of a multiplier, enabling seamless integration of convolutional neural networks
+as the agent network. The agent optimizes the multiplier structure using a
+Pareto-driven reward customized to balance area and delay. Furthermore, we
+enhance the original framework with parallel reinforcement learning and design
+space pruning techniques and extend its capability to optimize fused
+multiply-accumulate (MAC) designs. Experiments conducted on different bit
+widths of multipliers demonstrate that multipliers produced by our approach
+outperform all baseline designs in terms of area, power, and delay. The
+performance gain is further validated by comparing the area, power, and delay
+of processing element arrays using multipliers from our approach and baseline
+approaches.
+
+
+
+ comment: Accepted by TODAES 2025
+
+
+
+
+
+
+ ♻ ☆ Convergence of SGD with momentum in the nonconvex case: A time
+ window-based analysis
+
+
+ The stochastic gradient descent method with momentum (SGDM) is a common
+approach for solving large-scale and stochastic optimization problems. Despite
+its popularity, the convergence behavior of SGDM remains less understood in
+nonconvex scenarios. This is primarily due to the absence of a sufficient
+descent property and challenges in simultaneously controlling the momentum and
+stochastic errors in an almost sure sense. To address these challenges, we
+investigate the behavior of SGDM over specific time windows, rather than
+examining the descent of consecutive iterates as in traditional studies. This
+time window-based approach simplifies the convergence analysis and enables us
+to establish the iterate convergence result for SGDM under the {\L}ojasiewicz
+property. We further provide local convergence rates which depend on the
+underlying {\L}ojasiewicz exponent and the utilized step size schemes.
+
+
+
+ comment: 23 pages
+
+
+
+
+
+
+ ♻ ☆ A Mathematical Framework for the Problem of Security for Cognition in
+ Neurotechnology
+
+
+
+
+
+
+
+
+ Bryce Allen Bagley, Claudia K Petritsch
+
+
+ The rapid advancement in neurotechnology in recent years has created an
+emerging critical intersection between neurotechnology and security.
+Implantable devices, non-invasive monitoring, and non-invasive therapies all
+carry with them the prospect of violating the privacy and autonomy of
+individuals' cognition. A growing number of scientists and physicians have made
+calls to address this issue, but applied efforts have been relatively limited.
+A major barrier hampering scientific and engineering efforts to address these
+security issues is the lack of a clear means of describing and analyzing
+relevant problems. In this paper we develop Cognitive Neurosecurity, a
+mathematical framework which enables such description and analysis by drawing
+on methods and results from multiple fields. We demonstrate certain statistical
+properties which have significant implications for Cognitive Neurosecurity, and
+then present descriptions of the algorithmic problems faced by attackers
+attempting to violate privacy and autonomy, and defenders attempting to
+obstruct such attempts.
+
+
+
+
+
+
+
+ ♻ ☆ MERT: Acoustic Music Understanding Model with Large-Scale
+ Self-supervised Training ICLR 2024
+
+
+
+
+
+
+
+
+ Yizhi Li, Ruibin Yuan, Ge Zhang, Yinghao Ma, Xingran Chen, Hanzhi Yin, Chenghao Xiao, Chenghua Lin, Anton Ragni, Emmanouil Benetos, Norbert Gyenge, Roger Dannenberg, Ruibo Liu, Wenhu Chen, Gus Xia, Yemin Shi, Wenhao Huang, Zili Wang, Yike Guo, Jie Fu
+
+
+ Self-supervised learning (SSL) has recently emerged as a promising paradigm
+for training generalisable models on large-scale data in the fields of vision,
+text, and speech. Although SSL has been proven effective in speech and audio,
+its application to music audio has yet to be thoroughly explored. This is
+partially due to the distinctive challenges associated with modelling musical
+knowledge, particularly tonal and pitched characteristics of music. To address
+this research gap, we propose an acoustic Music undERstanding model with
+large-scale self-supervised Training (MERT), which incorporates teacher models
+to provide pseudo labels in the masked language modelling (MLM) style acoustic
+pre-training. In our exploration, we identified an effective combination of
+teacher models, which outperforms conventional speech and audio approaches in
+terms of performance. This combination includes an acoustic teacher based on
+Residual Vector Quantisation - Variational AutoEncoder (RVQ-VAE) and a musical
+teacher based on the Constant-Q Transform (CQT). Furthermore, we explore a wide
+range of settings to overcome the instability in acoustic language model
+pre-training, which allows our designed paradigm to scale from 95M to 330M
+parameters. Experimental results indicate that our model can generalise and
+perform well on 14 music understanding tasks and attain state-of-the-art (SOTA)
+overall scores.
+
+
+
+ comment: accepted by ICLR 2024
+
+
+
+
+
+
+ ♻ ☆ Convergence analysis of wide shallow neural operators within the
+ framework of Neural Tangent Kernel
+
+
+ Neural operators are aiming at approximating operators mapping between Banach
+spaces of functions, achieving much success in the field of scientific
+computing. Compared to certain deep learning-based solvers, such as
+Physics-Informed Neural Networks (PINNs), Deep Ritz Method (DRM), neural
+operators can solve a class of Partial Differential Equations (PDEs). Although
+much work has been done to analyze the approximation and generalization error
+of neural operators, there is still a lack of analysis on their training error.
+In this work, we conduct the convergence analysis of gradient descent for the
+wide shallow neural operators within the framework of Neural Tangent Kernel
+(NTK). The core idea lies on the fact that over-parameterization and random
+initialization together ensure that each weight vector remains near its
+initialization throughout all iterations, yielding the linear convergence of
+gradient descent. In this work, we demonstrate that under the setting of
+over-parametrization, gradient descent can find the global minimum regardless
+of whether it is in continuous time or discrete time. Finally, we briefly
+discuss the case of physics-informed shallow neural operators.
+
+
+ Despite the vast amount of information encoded in Knowledge Graphs (KGs),
+information about the class affiliation of entities remains often incomplete.
+Graph Convolutional Networks (GCNs) have been shown to be effective predictors
+of complete information about the class affiliation of entities in KGs.
+However, these models do not learn the class affiliation of entities in KGs
+incorporating the complexity of the task, which negatively affects the models
+prediction capabilities. To address this problem, we introduce a Markov
+process-based architecture into well-known GCN architectures. This end-to-end
+network learns the prediction of class affiliation of entities in KGs within a
+Markov process. The number of computational steps is learned during training
+using a geometric distribution. At the same time, the loss function combines
+insights from the field of evidential learning. The experiments show a
+performance improvement over existing models in several studied architectures
+and datasets. Based on the chosen hyperparameters for the geometric
+distribution, the expected number of computation steps can be adjusted to
+improve efficiency and accuracy during training.
+
+
+
+
+
+
+
+ ♻ ☆ Are Sparse Neural Networks Better Hard Sample Learners? BMVC 2024
+
+
+
+
+
+
+
+
+ Qiao Xiao, Boqian Wu, Lu Yin, Christopher Neil Gadzinski, Tianjin Huang, Mykola Pechenizkiy, Decebal Constantin Mocanu
+
+
+ While deep learning has demonstrated impressive progress, it remains a
+daunting challenge to learn from hard samples as these samples are usually
+noisy and intricate. These hard samples play a crucial role in the optimal
+performance of deep neural networks. Most research on Sparse Neural Networks
+(SNNs) has focused on standard training data, leaving gaps in understanding
+their effectiveness on complex and challenging data. This paper's extensive
+investigation across scenarios reveals that most SNNs trained on challenging
+samples can often match or surpass dense models in accuracy at certain sparsity
+levels, especially with limited data. We observe that layer-wise density ratios
+tend to play an important role in SNN performance, particularly for methods
+that train from scratch without pre-trained initialization. These insights
+enhance our understanding of SNNs' behavior and potential for efficient
+learning approaches in data-centric AI. Our code is publicly available at:
+\url{https://github.com/QiaoXiao7282/hard_sample_learners}.
+
+
+
+ comment: Accepted at British Machine Vision Conference (BMVC 2024)
+
+
+
+
+
+
+ ♻ ☆ A data driven approach to classify descriptors based on their efficiency
+ in translating noisy trajectories into physically-relevant information
+
+
+
+
+
+
+
+
+ Simone Martino, Domiziano Doria, Chiara Lionello, Matteo Becchi, Giovanni M. Pavan
+
+
+ Reconstructing the physical complexity of many-body dynamical systems can be
+challenging. Starting from the trajectories of their constitutive units (raw
+data), typical approaches require selecting appropriate descriptors to convert
+them into time-series, which are then analyzed to extract interpretable
+information. However, identifying the most effective descriptor is often
+non-trivial. Here, we report a data-driven approach to compare the efficiency
+of various descriptors in extracting information from noisy trajectories and
+translating it into physically relevant insights. As a prototypical system with
+non-trivial internal complexity, we analyze molecular dynamics trajectories of
+an atomistic system where ice and water coexist in equilibrium near the
+solid/liquid transition temperature. We compare general and specific
+descriptors often used in aqueous systems: number of neighbors, molecular
+velocities, Smooth Overlap of Atomic Positions (SOAP), Local Environments and
+Neighbors Shuffling (LENS), Orientational Tetrahedral Order, and distance from
+the fifth neighbor ($d_5$). Using Onion Clustering -- an efficient unsupervised
+method for single-point time-series analysis -- we assess the maximum
+extractable information for each descriptor and rank them via a
+high-dimensional metric. Our results show that advanced descriptors like SOAP
+and LENS outperform classical ones due to higher signal-to-noise ratios.
+Nonetheless, even simple descriptors can rival or exceed advanced ones after
+local signal denoising. For example, $d_5$, initially among the weakest,
+becomes the most effective at resolving the system's non-local dynamical
+complexity after denoising. This work highlights the critical role of noise in
+information extraction from molecular trajectories and offers a data-driven
+approach to identify optimal descriptors for systems with characteristic
+internal complexity.
+
+
+
+ comment: 19 pages, 5 figures + 3 in supporting information (at the bottom of
+ the manuscript)
+
+
+
+
+
+
+ ♻ ☆ S-STE: Continuous Pruning Function for Efficient 2:4 Sparse Pre-training
+
+
+ Training deep neural networks (DNNs) is costly. Fortunately, Nvidia Ampere
+and Hopper GPUs can accelerate matrix multiplications twice as fast as a dense
+equivalent by implementing 2:4 sparsity. However, previous STE-based 2:4
+pre-training methods (e.g. STE with hard-thresholding, SR-STE) suffer from
+optimization difficulties because of discontinuous pruning function. In this
+study, we comprehensively analyse the bottleneck of traditional N:M sparse
+training and recognize three drawbacks with discontinuity: incorrect descending
+direction, inability to predict the amount of descent and sparse mask
+oscillation. In light of this, we propose S-STE, a simple yet powerful 2:4
+training method that contains two parts: to continuously project weights to be
+2:4 sparse, and to rescale sparse weights with a per-tensor fixed scaling
+factor. Besides, we adopt minimum-variance unbiased estimation for activation
+gradient and FP8 quantization for whole process. Results show that our method
+surpasses previous 2:4 pre-training recipes and is comparable even with full
+parameter models. Our toolkit is available at
+https://github.com/huyz2023/2by4-pretrain.
+
+
+
+
+
+
+
+ ♻ ☆ Rethinking Deep Learning: Non-backpropagation and Non-optimization
+ Machine Learning Approach Using Hebbian Neural Networks
+
+
+ Developing strong AI could provide a powerful tool for addressing social and
+scientific challenges. Neural networks (NNs), inspired by biological systems,
+have the potential to achieve this. However, weight optimization techniques
+using error backpropagation are not observed in biological systems, raising
+doubts about current NNs approaches. In this context, Itoh (2024) solved the
+MNIST classification problem without using objective functions or
+backpropagation. However, weight updates were not used, so it does not qualify
+as machine learning AI. In this study, I develop a machine learning method that
+mimics biological neural systems by implementing Hebbian learning in NNs
+without backpropagation and optimization method to solve the MNIST
+classification problem and analyze its output. Development proceeded in three
+stages. In the first stage, I applied the Hebbian learning rule to the MNIST
+character recognition algorithm by Itoh (2024), resulting in lower accuracy
+than non-Hebbian NNs, highlighting the limitations of conventional training
+procedures for Hebbian learning. In the second stage, I examined the properties
+of individually trained NNs using norm-based cognition, showing that NNs
+trained on a specific label respond powerfully to that label. In the third
+stage, I created an MNIST character recognition program using vector norm
+magnitude as the criterion, achieving an accuracy of approximately 75%. This
+demonstrates that the Hebbian learning NNs can recognize handwritten characters
+without objective functions, backpropagation, optimization processes, and large
+data set. Based on these results, developing a mechanism based on norm-based
+cognition as a fundamental unit and then increasing complexity to achieve
+indirect similarity cognition should help mimic biological neural systems and
+contribute to realizing strong AI.
+
+
+ Recent empirical studies have demonstrated that diffusion models can
+effectively learn the image distribution and generate new samples. Remarkably,
+these models can achieve this even with a small number of training samples
+despite a large image dimension, circumventing the curse of dimensionality. In
+this work, we provide theoretical insights into this phenomenon by leveraging
+key empirical observations: (i) the low intrinsic dimensionality of image data,
+(ii) a union of manifold structure of image data, and (iii) the low-rank
+property of the denoising autoencoder in trained diffusion models. These
+observations motivate us to assume the underlying data distribution of image
+data as a mixture of low-rank Gaussians and to parameterize the denoising
+autoencoder as a low-rank model according to the score function of the assumed
+distribution. With these setups, we rigorously show that optimizing the
+training loss of diffusion models is equivalent to solving the canonical
+subspace clustering problem over the training samples. Based on this
+equivalence, we further show that the minimal number of samples required to
+learn the underlying distribution scales linearly with the intrinsic dimensions
+under the above data and model assumptions. This insight sheds light on why
+diffusion models can break the curse of dimensionality and exhibit the phase
+transition in learning distributions. Moreover, we empirically establish a
+correspondence between the subspaces and the semantic representations of image
+data, facilitating image editing. We validate these results with corroborated
+experimental results on both simulated distributions and image datasets.
+
+
+
+ comment: 40 pages, 9 figures
+
+
+
+
+
+
+ ♻ ☆ From Commands to Prompts: LLM-based Semantic File System for AIOS
+
+
+
+
+
+
+
+
+ Zeru Shi, Kai Mei, Yongye Su, Chaoji Zuo, Wenyue Hua, Wujiang Xu, Yujie Ren, Zirui Liu, Mengnan Du, Dong Deng, Yongfeng Zhang
+
+
+ Large language models (LLMs) have demonstrated significant potential in the
+development of intelligent applications and systems such as LLM-based agents
+and agent operating systems (AIOS). However, when these applications and
+systems interact with the underlying file system, the file system still remains
+the traditional paradigm: reliant on manual navigation through precise
+commands. This paradigm poses a bottleneck to the usability of these systems as
+users are required to navigate complex folder hierarchies and remember cryptic
+file names. To address this limitation, we propose an LLM-based semantic file
+system ( LSFS ) for prompt-driven file management. Unlike conventional
+approaches, LSFS incorporates LLMs to enable users or agents to interact with
+files through natural language prompts, facilitating semantic file management.
+At the macro-level, we develop a comprehensive API set to achieve semantic file
+management functionalities, such as semantic file retrieval, file update
+monitoring and summarization, and semantic file rollback). At the micro-level,
+we store files by constructing semantic indexes for them, design and implement
+syscalls of different semantic operations (e.g., CRUD, group by, join) powered
+by vector database. Our experiments show that LSFS offers significant
+improvements over traditional file systems in terms of user convenience, the
+diversity of supported functions, and the accuracy and efficiency of file
+operations. Additionally, with the integration of LLM, our system enables more
+intelligent file management tasks, such as content summarization and version
+comparison, further enhancing its capabilities.
+
+
+
+
+
+
+
+ ♻ ☆ FGBERT: Function-Driven Pre-trained Gene Language Model for Metagenomics
+
+
+
+
+
+
+
+
+ ChenRui Duan, Zelin Zang, Yongjie Xu, Hang He, Zihan Liu, Siyuan Li, Zijia Song, Ju-Sheng Zheng, Stan Z. Li
+
+
+ Metagenomic data, comprising mixed multi-species genomes, are prevalent in
+diverse environments like oceans and soils, significantly impacting human
+health and ecological functions. However, current research relies on K-mer,
+which limits the capture of structurally and functionally relevant gene
+contexts. Moreover, these approaches struggle with encoding biologically
+meaningful genes and fail to address the One-to-Many and Many-to-One
+relationships inherent in metagenomic data. To overcome these challenges, we
+introduce FGBERT, a novel metagenomic pre-trained model that employs a
+protein-based gene representation as a context-aware and structure-relevant
+tokenizer. FGBERT incorporates Masked Gene Modeling (MGM) to enhance the
+understanding of inter-gene contextual relationships and Triplet Enhanced
+Metagenomic Contrastive Learning (TMC) to elucidate gene sequence-function
+relationships. Pre-trained on over 100 million metagenomic sequences, FGBERT
+demonstrates superior performance on metagenomic datasets at four levels,
+spanning gene, functional, bacterial, and environmental levels and ranging from
+1k to 213k input sequences. Case studies of ATP Synthase and Gene Operons
+highlight FGBERT's capability for functional recognition and its biological
+relevance in metagenomic research.
+
+
+ Federated learning (FL) is a collaborative machine learning approach that
+enables multiple clients to train models without sharing their private data.
+With the rise of deep learning, large-scale models have garnered significant
+attention due to their exceptional performance. However, a key challenge in FL
+is the limitation imposed by clients with constrained computational and
+communication resources, which hampers the deployment of these large models.
+The Mixture of Experts (MoE) architecture addresses this challenge with its
+sparse activation property, which reduces computational workload and
+communication demands during inference and updates. Additionally, MoE
+facilitates better personalization by allowing each expert to specialize in
+different subsets of the data distribution. To alleviate the communication
+burdens between the server and clients, we propose FedMoE-DA, a new FL model
+training framework that leverages the MoE architecture and incorporates a novel
+domain-aware, fine-grained aggregation strategy to enhance the robustness,
+personalizability, and communication efficiency simultaneously. Specifically,
+the correlation between both intra-client expert models and inter-client data
+heterogeneity is exploited. Moreover, we utilize peer-to-peer (P2P)
+communication between clients for selective expert model synchronization, thus
+significantly reducing the server-client transmissions. Experiments demonstrate
+that our FedMoE-DA achieves excellent performance while reducing the
+communication pressure on the server.
+
+
+
+ comment: 8 pages, 5 figures, accepted by The 20th International Conference on
+ Mobility, Sensing and Networking (MSN 2024)
+
+
+
+
+
+
+ ♻ ☆ Exploring Parameter-Efficient Fine-Tuning Techniques for Code Generation
+ with Large Language Models
+
+
+
+
+
+
+
+
+ Martin Weyssow, Xin Zhou, Kisub Kim, David Lo, Houari Sahraoui
+
+
+ Large language models (LLMs) demonstrate impressive capabilities to generate
+accurate code snippets given natural language intents in a zero-shot manner,
+i.e., without the need for specific fine-tuning. While prior studies have
+highlighted the advantages of fine-tuning LLMs, this process incurs high
+computational costs, making it impractical in resource-scarce environments,
+particularly for models with billions of parameters. To address these
+challenges, previous research explored in-context learning (ICL) and
+retrieval-augmented generation (RAG) as strategies to guide the LLM generative
+process with task-specific prompt examples. However, ICL and RAG introduce
+inconveniences, such as the need for designing contextually relevant prompts
+and the absence of learning task-specific parameters, thereby limiting
+downstream task performance. In this context, we foresee parameter-efficient
+fine-tuning (PEFT) as a promising approach to efficiently specialize LLMs to
+task-specific data while maintaining reasonable resource consumption. In this
+paper, we deliver a comprehensive study of PEFT techniques for LLMs in the
+context of automated code generation. Our comprehensive investigation of PEFT
+techniques for LLMs reveals their superiority and potential over ICL and RAG
+across a diverse set of LLMs and three representative Python code generation
+datasets: Conala, CodeAlpacaPy, and APPS. Furthermore, our study highlights the
+potential for tuning larger LLMs and significant reductions in memory usage by
+combining PEFT with quantization. Therefore, this study opens opportunities for
+broader applications of PEFT in software engineering scenarios. Our code is
+available at https://github.com/martin-wey/peft-llm-code/.
+
+
+
+
+
+
+
+ ♻ ☆ Developing Cryptocurrency Trading Strategy Based on Autoencoder-CNN-GANs
+ Algorithms
+
+
+ This paper leverages machine learning algorithms to forecast and analyze
+financial time series. The process begins with a denoising autoencoder to
+filter out random noise fluctuations from the main contract price data. Then,
+one-dimensional convolution reduces the dimensionality of the filtered data and
+extracts key information. The filtered and dimensionality-reduced price data is
+fed into a GANs network, and its output serve as input of a fully connected
+network. Through cross-validation, a model is trained to capture features that
+precede large price fluctuations. The model predicts the likelihood and
+direction of significant price changes in real-time price sequences, placing
+trades at moments of high prediction accuracy. Empirical results demonstrate
+that using autoencoders and convolution to filter and denoise financial data,
+combined with GANs, achieves a certain level of predictive performance,
+validating the capabilities of machine learning algorithms to discover
+underlying patterns in financial sequences. Keywords - CNN;GANs;
+Cryptocurrency; Prediction.
+
+
+
+ comment: The paper was accepted by 2024 4th International Conference on
+ Artificial Intelligence, Robotics, and Communication(ICAIRC 2024)
+
+
+
+
+
+
+ ♻ ☆ CodeUltraFeedback: An LLM-as-a-Judge Dataset for Aligning Large Language
+ Models to Coding Preferences
+
+
+ Evaluating the alignment of large language models (LLMs) with user-defined
+coding preferences is a challenging endeavour that requires a deep assessment
+of LLMs' outputs. Existing methods and benchmarks rely primarily on automated
+metrics and static analysis tools, which often fail to capture the nuances of
+user instructions and LLM outputs. To address this gap, we propose using the
+LLM-as-a-Judge methodology to evaluate the alignment of LLMs with coding
+preferences. Based on this approach, we present CodeUltraFeedback, a
+comprehensive dataset designed to facilitate the evaluation and improvement of
+LLM alignment. CodeUltraFeedback consists of 10,000 coding instructions, each
+annotated with four responses generated from a diverse pool of 14 LLMs. These
+responses are ranked based on five distinct coding preferences using GPT-3.5 as
+a judge, providing both numerical scores and detailed textual feedback. Our
+analysis of CodeUltraFeedback reveals that responses from GPT-3.5 and GPT-4 are
+generally preferred over those from open-weight LLMs, highlighting significant
+differences in alignment between closed and open-weight models. In turn, we
+explore the usage of CodeUltraFeedback as feedback data to fine-tune and align
+CodeLlama-7B-Instruct using supervised fine-tuning (SFT) and reinforcement
+learning from AI feedback (RLAIF) with direct preference optimization (DPO).
+The resulting aligned CodeLlama-7B-Instruct model outperforms larger LLMs in
+terms of alignment with coding preferences and shows improved functional
+correctness on the HumanEval+ benchmark compared to the original instruct
+model. Therefore, our contributions bridge the gap in preference tuning of LLMs
+for code and set the stage for further advancements in model alignment and
+RLAIF in automated software engineering.
+
+
+
+
+
+
+
+ ♻ ☆ MonoSparse-CAM: Efficient Tree Model Processing via Monotonicity and
+ Sparsity in CAMs
+
+
+ While the tree-based machine learning (TBML) models exhibit superior
+performance compared to neural networks on tabular data and hold promise for
+energy-efficient acceleration using aCAM arrays, their ideal deployment on
+hardware with explicit exploitation of TBML structure and aCAM circuitry
+remains a challenging task. In this work, we present MonoSparse-CAM, a new
+CAM-based optimization technique that exploits TBML sparsity and monotonicity
+in CAM circuitry to further advance processing performance. Our results
+indicate that MonoSparse-CAM reduces energy consumption by upto to 28.56x
+compared to raw processing and by 18.51x compared to state-of-the-art
+techniques, while improving the efficiency of computation by at least 1.68x.
+
+
+ Recent concept-based interpretable models have succeeded in providing
+meaningful explanations by pre-defined concept sets. However, the dependency on
+the pre-defined concepts restricts the application because of the limited
+number of concepts for explanations. This paper proposes a novel interpretable
+deep neural network called explanation bottleneck models (XBMs). XBMs generate
+a text explanation from the input without pre-defined concepts and then predict
+a final task prediction based on the generated explanation by leveraging
+pre-trained vision-language encoder-decoder models. To achieve both the target
+task performance and the explanation quality, we train XBMs through the target
+task loss with the regularization penalizing the explanation decoder via the
+distillation from the frozen pre-trained decoder. Our experiments, including a
+comparison to state-of-the-art concept bottleneck models, confirm that XBMs
+provide accurate and fluent natural language explanations without pre-defined
+concept sets. Code will be available at https://github.com/yshinya6/xbm/.
+
+
+
+ comment: Accepted to AAAI 2025
+
+
+
+
+
+
+ ♻ ☆ Aurora-M: Open Source Continual Pre-training for Multilingual Language
+ and Code
+
+
+
+
+
+
+
+
+ Taishi Nakamura, Mayank Mishra, Simone Tedeschi, Yekun Chai, Jason T Stillerman, Felix Friedrich, Prateek Yadav, Tanmay Laud, Vu Minh Chien, Terry Yue Zhuo, Diganta Misra, Ben Bogin, Xuan-Son Vu, Marzena Karpinska, Arnav Varma Dantuluri, Wojciech Kusa, Tommaso Furlanello, Rio Yokota, Niklas Muennighoff, Suhas Pai, Tosin Adewumi, Veronika Laippala, Xiaozhe Yao, Adalberto Junior, Alpay Ariyak, Aleksandr Drozd, Jordan Clive, Kshitij Gupta, Liangyu Chen, Qi Sun, Ken Tsui, Noah Persaud, Nour Fahmy, Tianlong Chen, Mohit Bansal, Nicolo Monti, Tai Dang, Ziyang Luo, Tien-Tung Bui, Roberto Navigli, Virendra Mehta, Matthew Blumberg, Victor May, Huu Nguyen, Sampo Pyysalo
+
+
+ Pretrained language models are an integral part of AI applications, but their
+high computational cost for training limits accessibility. Initiatives such as
+Bloom and StarCoder aim to democratize access to pretrained models for
+collaborative community development. Despite these efforts, such models
+encounter challenges such as limited multilingual capabilities, risks of
+catastrophic forgetting during continual pretraining, and the high costs of
+training models from scratch, alongside the need to align with AI safety
+standards and regulatory frameworks.
+ This paper presents Aurora-M, a 15B parameter multilingual open-source model
+trained on English, Finnish, Hindi, Japanese, Vietnamese, and code. Continually
+pretrained from StarCoderPlus on 435B additional tokens, Aurora-M surpasses 2T
+tokens in total training token count. It is the first open-source multilingual
+model fine-tuned on human-reviewed safety instructions, thus aligning its
+development not only with conventional red-teaming considerations, but also
+with the specific concerns articulated in the Biden-Harris Executive Order on
+the Safe, Secure, and Trustworthy Development and Use of Artificial
+Intelligence.
+ We evaluate Aurora-M across a wide range of tasks and languages, showcasing
+its robustness against catastrophic forgetting and its superior performance in
+multilingual settings, particularly in safety evaluations. We open-source
+Aurora-M and its variants to encourage responsible open-source development of
+large language models at https://huggingface.co/aurora-m.
+
+
+
+ comment: Preprint
+
+
+
+
+
+
+ ♻ ☆ Maximal Volume Matrix Cross Approximation for Image Compression and
+ Least Squares Solution
+
+
+ We study the classic matrix cross approximation based on the maximal volume
+submatrices. Our main results consist of an improvement of the classic estimate
+for matrix cross approximation and a greedy approach for finding the maximal
+volume submatrices. More precisely, we present a new proof of the classic
+estimate of the inequality with an improved constant. Also, we present a family
+of greedy maximal volume algorithms to improve the computational efficiency of
+matrix cross approximation. The proposed algorithms are shown to have
+theoretical guarantees of convergence. Finally, we present two applications:
+image compression and the least squares approximation of continuous functions.
+Our numerical results at the end of the paper demonstrate the effective
+performance of our approach.
+
+
+
+
+
+
+
+ ♻ ☆ PreNeT: Leveraging Computational Features to Predict Deep Neural Network
+ Training Time
+
+
+ Training deep learning models, particularly Transformer-based architectures
+such as Large Language Models (LLMs), demands substantial computational
+resources and extended training periods. While optimal configuration and
+infrastructure selection can significantly reduce associated costs, this
+optimization requires preliminary analysis tools. This paper introduces PreNeT,
+a novel predictive framework designed to address this optimization challenge.
+PreNeT facilitates training optimization by integrating comprehensive
+computational metrics, including layer-specific parameters, arithmetic
+operations and memory utilization. A key feature of PreNeT is its capacity to
+accurately predict training duration on previously unexamined hardware
+infrastructures, including novel accelerator architectures. This framework
+employs a sophisticated approach to capture and analyze the distinct
+characteristics of various neural network layers, thereby enhancing existing
+prediction methodologies. Through proactive implementation of PreNeT,
+researchers and practitioners can determine optimal configurations, parameter
+settings, and hardware specifications to maximize cost-efficiency and minimize
+training duration. Experimental results demonstrate that PreNeT achieves up to
+72% improvement in prediction accuracy compared to contemporary
+state-of-the-art frameworks.
+
+
+
+ comment: 11 pages, Conference
+
+
+
+
+
+
+ ♻ ☆ Pixel-Wise Recognition for Holistic Surgical Scene Understanding MICCAI 2022
+
+
+ This paper presents the Holistic and Multi-Granular Surgical Scene
+Understanding of Prostatectomies (GraSP) dataset, a curated benchmark that
+models surgical scene understanding as a hierarchy of complementary tasks with
+varying levels of granularity. Our approach encompasses long-term tasks, such
+as surgical phase and step recognition, and short-term tasks, including
+surgical instrument segmentation and atomic visual actions detection. To
+exploit our proposed benchmark, we introduce the Transformers for Actions,
+Phases, Steps, and Instrument Segmentation (TAPIS) model, a general
+architecture that combines a global video feature extractor with localized
+region proposals from an instrument segmentation model to tackle the
+multi-granularity of our benchmark. Through extensive experimentation in ours
+and alternative benchmarks, we demonstrate TAPIS's versatility and
+state-of-the-art performance across different tasks. This work represents a
+foundational step forward in Endoscopic Vision, offering a novel framework for
+future research towards holistic surgical scene understanding.
+
+
+
+ comment: Preprint submitted to Medical Image Analysis. Official extension of
+ previous MICCAI 2022
+ (https://link.springer.com/chapter/10.1007/978-3-031-16449-1_42) and ISBI
+ 2023 (https://ieeexplore.ieee.org/document/10230819) orals. Data and codes
+ are available at https://github.com/BCV-Uniandes/GraSP
+
+
+
+
+
+
+ ♻ ☆ Towards General Industrial Intelligence: A Survey of Continual Large
+ Models in Industrial IoT
+
+
+
+
+
+
+
+
+ Jiao Chen, Jiayi He, Fangfang Chen, Zuohong Lv, Jianhua Tang, Weihua Li, Zuozhu Liu, Howard H. Yang, Guangjie Han
+
+
+ Industrial AI is transitioning from traditional deep learning models to
+large-scale transformer-based architectures, with the Industrial Internet of
+Things (IIoT) playing a pivotal role. IIoT evolves from a simple data pipeline
+to an intelligent infrastructure, enabling and enhancing these advanced AI
+systems. This survey explores the integration of IIoT with large models (LMs)
+and their potential applications in industrial environments. We focus on four
+primary types of industrial LMs: language-based, vision-based, time-series, and
+multimodal models. The lifecycle of LMs is segmented into four critical phases:
+data foundation, model training, model connectivity, and continuous evolution.
+First, we analyze how IIoT provides abundant and diverse data resources,
+supporting the training and fine-tuning of LMs. Second, we discuss how IIoT
+offers an efficient training infrastructure in low-latency and
+bandwidth-optimized environments. Third, we highlight the deployment advantages
+of LMs within IIoT, emphasizing IIoT's role as a connectivity nexus fostering
+emergent intelligence through modular design, dynamic routing, and model
+merging to enhance system scalability and adaptability. Finally, we demonstrate
+how IIoT supports continual learning mechanisms, enabling LMs to adapt to
+dynamic industrial conditions and ensure long-term effectiveness. This paper
+underscores IIoT's critical role in the evolution of industrial intelligence
+with large models, offering a theoretical framework and actionable insights for
+future research.
+
+
+
+
+
+
+
+ ♻ ☆ PyraNet: A Large Scale Hierarchical Verilog Dataset
+
+
+ Recently, there has been a growing interest in leveraging Large Language
+Models for Verilog code generation. However, the current quality of the
+generated Verilog code remains suboptimal. This is largely due to the absence
+of well-defined, well-organized datasets with high-quality samples, as well as
+a lack of innovative fine-tuning methods and models specifically trained on
+Verilog. In this paper, we introduce a novel open-source dataset and a
+corresponding fine-tuning technique, which utilizes a multi-layered structure
+that we refer to as PyraNet. Our experiments demonstrate that employing the
+proposed dataset and fine-tuning approach leads to a more accurate fine-tuned
+model, producing syntactically and functionally correct Verilog code. The
+evaluation results show improvements by up-to $32.6\%$ in comparison to the
+CodeLlama-7B baseline model and up-to $16.7\%$ in comparison to the
+state-of-the-art models using VerilogEval evaluation platform.
+
+
+
+
+
+
+
+ ♻ ☆ Online High-Frequency Trading Stock Forecasting with Automated Feature
+ Clustering and Radial Basis Function Neural Networks
+
+
+ This study presents an autonomous experimental machine learning protocol for
+high-frequency trading (HFT) stock price forecasting that involves a dual
+competitive feature importance mechanism and clustering via shallow neural
+network topology for fast training. By incorporating the k-means algorithm into
+the radial basis function neural network (RBFNN), the proposed method addresses
+the challenges of manual clustering and the reliance on potentially
+uninformative features. More specifically, our approach involves a dual
+competitive mechanism for feature importance, combining the mean-decrease
+impurity (MDI) method and a gradient descent (GD) based feature importance
+mechanism. This approach, tested on HFT Level 1 order book data for 20 S&P 500
+stocks, enhances the forecasting ability of the RBFNN regressor. Our findings
+suggest that an autonomous approach to feature selection and clustering is
+crucial, as each stock requires a different input feature space. Overall, by
+automating the feature selection and clustering processes, we remove the need
+for manual topological grid search and provide a more efficient way to predict
+LOB's mid-price.
+
+
+
+ comment: This paper was presented at the Economics of Financial Technology
+ Conference, June 2023, in Edinburgh, UK
+
+
+
+
+
+
+
+ X. Feng, D. Zhang, S. Hu, X. Li, M. Wu, J. Zhang, X. Chen, K. Huang
+
+
+ Vision-Language Tracking (VLT) aims to localize a target in video sequences
+using a visual template and language description. While textual cues enhance
+tracking potential, current datasets typically contain much more image data
+than text, limiting the ability of VLT methods to align the two modalities
+effectively. To address this imbalance, we propose a novel plug-and-play method
+named CTVLT that leverages the strong text-image alignment capabilities of
+foundation grounding models. CTVLT converts textual cues into interpretable
+visual heatmaps, which are easier for trackers to process. Specifically, we
+design a textual cue mapping module that transforms textual cues into target
+distribution heatmaps, visually representing the location described by the
+text. Additionally, the heatmap guidance module fuses these heatmaps with the
+search image to guide tracking more effectively. Extensive experiments on
+mainstream benchmarks demonstrate the effectiveness of our approach, achieving
+state-of-the-art performance and validating the utility of our method for
+enhanced VLT.
+
+
+
+ comment: Accepted by ICASSP '25 ! Code: https://github.com/XiaokunFeng/CTVLT
+
+ Recently, deep learning based methods have revolutionized remote sensing
+image segmentation. However, these methods usually rely on a pre-defined
+semantic class set, thus needing additional image annotation and model training
+when adapting to new classes. More importantly, they are unable to segment
+arbitrary semantic classes. In this work, we introduce Open-Vocabulary Remote
+Sensing Image Semantic Segmentation (OVRSISS), which aims to segment arbitrary
+semantic classes in remote sensing images. To address the lack of OVRSISS
+datasets, we develop LandDiscover50K, a comprehensive dataset of 51,846 images
+covering 40 diverse semantic classes. In addition, we propose a novel framework
+named GSNet that integrates domain priors from special remote sensing models
+and versatile capabilities of general vision-language models. Technically,
+GSNet consists of a Dual-Stream Image Encoder (DSIE), a Query-Guided Feature
+Fusion (QGFF), and a Residual Information Preservation Decoder (RIPD). DSIE
+first captures comprehensive features from both special models and general
+models in dual streams. Then, with the guidance of variable vocabularies, QGFF
+integrates specialist and generalist features, enabling them to complement each
+other. Finally, RIPD is proposed to aggregate multi-source features for more
+accurate mask predictions. Experiments show that our method outperforms other
+methods by a large margin, and our proposed LandDiscover50K improves the
+performance of OVRSISS methods. The proposed dataset and method will be made
+publicly available at https://github.com/yecy749/GSNet.
+
+
+
+ comment: Accepted by AAAI2025
+
+
+
+
+
+
+ ☆ Adrenaline: Adaptive Rendering Optimization System for Scalable Cloud
+ Gaming
+
+
+
+
+
+
+
+
+ Jin Heo, Ketan Bhardwaj, Ada Gavrilovska
+
+
+ Cloud gaming requires a low-latency network connection, making it a prime
+candidate for being hosted at the network edge. However, an edge server is
+provisioned with a fixed compute capacity, causing an issue for multi-user
+service and resulting in users having to wait before they can play when the
+server is occupied. In this work, we present a new insight that when a user's
+network condition results in use of lossy compression, the end-to-end visual
+quality more degrades for frames of high rendering quality, wasting the
+server's computing resources. We leverage this observation to build Adrenaline,
+a new system which adaptively optimizes the game rendering qualities by
+considering the user-side visual quality and server-side rendering cost. The
+rendering quality optimization of Adrenaline is done via a scoring mechanism
+quantifying the effectiveness of server resource usage on the user-side gaming
+quality. Our open-sourced implementation of Adrenaline demonstrates easy
+integration with modern game engines. In our evaluations, Adrenaline achieves
+up to 24% higher service quality and 2x more users served with the same
+resource footprint compared to other baselines.
+
+
+
+ comment: 15 pages, 13 figures, 5 tables
+
+
+
+
+
+
+ ♻ ☆ Language-Guided Diffusion Model for Visual Grounding
+
+
+ Visual grounding (VG) tasks involve explicit cross-modal alignment, as
+semantically corresponding image regions are to be located for the language
+phrases provided. Existing approaches complete such visual-text reasoning in a
+single-step manner. Their performance causes high demands on large-scale
+anchors and over-designed multi-modal fusion modules based on human priors,
+leading to complicated frameworks that may be difficult to train and overfit to
+specific scenarios. Even worse, such once-for-all reasoning mechanisms are
+incapable of refining boxes continuously to enhance query-region matching. In
+contrast, in this paper, we formulate an iterative reasoning process by
+denoising diffusion modeling. Specifically, we propose a language-guided
+diffusion framework for visual grounding, LG-DVG, which trains the model to
+progressively reason queried object boxes by denoising a set of noisy boxes
+with the language guide. To achieve this, LG-DVG gradually perturbs
+query-aligned ground truth boxes to noisy ones and reverses this process step
+by step, conditional on query semantics. Extensive experiments for our proposed
+framework on five widely used datasets validate the superior performance of
+solving visual grounding, a cross-modal alignment task, in a generative way.
+The source codes are available at
+https://github.com/iQua/vgbase/tree/main/examples/DiffusionVG.
+
+
+
+ comment: 20 pages, 16 figures
+
+
+
+
+
+
+ ♻ ☆ Official-NV: An LLM-Generated News Video Dataset for Multimodal Fake
+ News Detection
+
+
+ News media, especially video news media, have penetrated into every aspect of
+daily life, which also brings the risk of fake news. Therefore, multimodal fake
+news detection has recently garnered increased attention. However, the existing
+datasets are comprised of user-uploaded videos and contain an excess amounts of
+superfluous data, which introduces noise into the model training process. To
+address this issue, we construct a dataset named Official-NV, comprising
+officially published news videos. The crawl officially published videos are
+augmented through the use of LLMs-based generation and manual verification,
+thereby expanding the dataset. We also propose a new baseline model called
+OFNVD, which captures key information from multimodal features through a GLU
+attention mechanism and performs feature enhancement and modal aggregation via
+a cross-modal Transformer. Benchmarking the dataset and baselines demonstrates
+the effectiveness of our model in multimodal news detection.
+
+
+
+
+
+
+
+ ♻ ☆ Reply with Sticker: New Dataset and Model for Sticker Retrieval
+
+
+ Using stickers in online chatting is very prevalent on social media
+platforms, where the stickers used in the conversation can express someone's
+intention/emotion/attitude in a vivid, tactful, and intuitive way. Existing
+sticker retrieval research typically retrieves stickers based on context and
+the current utterance delivered by the user. That is, the stickers serve as a
+supplement to the current utterance. However, in the real-world scenario, using
+stickers to express what we want to say rather than as a supplement to our
+words only is also important. Therefore, in this paper, we create a new dataset
+for sticker retrieval in conversation, called \textbf{StickerInt}, where
+stickers are used to reply to previous conversations or supplement our
+words\footnote{We believe that the release of this dataset will provide a more
+complete paradigm than existing work for the research of sticker retrieval in
+the open-domain online conversation.}. Based on the created dataset, we present
+a simple yet effective framework for sticker retrieval in conversation based on
+the learning of intention and the cross-modal relationships between
+conversation context and stickers, coined as \textbf{Int-RA}. Specifically, we
+first devise a knowledge-enhanced intention predictor to introduce the
+intention information into the conversation representations. Subsequently, a
+relation-aware sticker selector is devised to retrieve the response sticker via
+cross-modal relationships. Extensive experiments on the created dataset show
+that the proposed model achieves state-of-the-art performance in sticker
+retrieval\footnote{The dataset and source code of this work are released at
+\url{https://github.com/HITSZ-HLT/Int-RA}.}.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Computation and Language 39
+
+
+
+
+
+ ☆ Dynamic Skill Adaptation for Large Language Models
+
+
+ We present Dynamic Skill Adaptation (DSA), an adaptive and dynamic framework
+to adapt novel and complex skills to Large Language Models (LLMs). Compared
+with previous work which learns from human-curated and static data in random
+orders, we propose to first automatically generate and organize the training
+data by mimicking the learning pathways of human and then dynamically tailor
+the training data based on the training dynamics. Specifically, inspired by the
+learning structures and teaching strategies in the human education system, we
+first construct a skill graph by decomposing complex skills into sub-skills and
+arranging them based on their dependencies in human syllables. For every skill,
+we utilize LLMs to generate both textbook-like data which contains detailed
+descriptions of skills for pre-training and exercise-like data which targets at
+explicitly utilizing the skills to solve problems for instruction-tuning.
+Furthermore, during the instruction-tuning, we dynamically update the training
+data which down-weight easy-to-learn examples, generate more complex examples,
+and filter out data with errors. Experiments on large language models such as
+LLAMA and Mistral demonstrate the effectiveness of our proposed methods in
+adapting math reasoning skills and social study skills.
+
+
+
+
+
+
+
+ ☆ ETTA: Elucidating the Design Space of Text-to-Audio Models
+
+
+
+
+
+
+
+
+ Sang-gil Lee, Zhifeng Kong, Arushi Goel, Sungwon Kim, Rafael Valle, Bryan Catanzaro
+
+
+ Recent years have seen significant progress in Text-To-Audio (TTA) synthesis,
+enabling users to enrich their creative workflows with synthetic audio
+generated from natural language prompts. Despite this progress, the effects of
+data, model architecture, training objective functions, and sampling strategies
+on target benchmarks are not well understood. With the purpose of providing a
+holistic understanding of the design space of TTA models, we set up a
+large-scale empirical experiment focused on diffusion and flow matching models.
+Our contributions include: 1) AF-Synthetic, a large dataset of high quality
+synthetic captions obtained from an audio understanding model; 2) a systematic
+comparison of different architectural, training, and inference design choices
+for TTA models; 3) an analysis of sampling methods and their Pareto curves with
+respect to generation quality and inference speed. We leverage the knowledge
+obtained from this extensive analysis to propose our best model dubbed
+Elucidated Text-To-Audio (ETTA). When evaluated on AudioCaps and MusicCaps,
+ETTA provides improvements over the baselines trained on publicly available
+data, while being competitive with models trained on proprietary data. Finally,
+we show ETTA's improved ability to generate creative audio following complex
+and imaginative captions -- a task that is more challenging than current
+benchmarks.
+
+
+
+
+
+
+
+ ☆ On the Expressiveness and Length Generalization of Selective State-Space
+ Models on Regular Languages AAAI 2025
+
+
+
+
+
+
+
+
+ Aleksandar Terzić, Michael Hersche, Giacomo Camposampiero, Thomas Hofmann, Abu Sebastian, Abbas Rahimi
+
+
+ Selective state-space models (SSMs) are an emerging alternative to the
+Transformer, offering the unique advantage of parallel training and sequential
+inference. Although these models have shown promising performance on a variety
+of tasks, their formal expressiveness and length generalization properties
+remain underexplored. In this work, we provide insight into the workings of
+selective SSMs by analyzing their expressiveness and length generalization
+performance on regular language tasks, i.e., finite-state automaton (FSA)
+emulation. We address certain limitations of modern SSM-based architectures by
+introducing the Selective Dense State-Space Model (SD-SSM), the first selective
+SSM that exhibits perfect length generalization on a set of various regular
+language tasks using a single layer. It utilizes a dictionary of dense
+transition matrices, a softmax selection mechanism that creates a convex
+combination of dictionary matrices at each time step, and a readout consisting
+of layer normalization followed by a linear map. We then proceed to evaluate
+variants of diagonal selective SSMs by considering their empirical performance
+on commutative and non-commutative automata. We explain the experimental
+results with theoretical considerations. Our code is available at
+https://github.com/IBM/selective-dense-state-space-model.
+
+
+
+ comment: 13 pages, 7 figures, to be published in AAAI 2025
+
+
+
+
+
+
+ ☆ Semi-Supervised Learning from Small Annotated Data and Large Unlabeled
+ Data for Fine-grained PICO Entity Recognition
+
+
+ Objective: Extracting PICO elements -- Participants, Intervention,
+Comparison, and Outcomes -- from clinical trial literature is essential for
+clinical evidence retrieval, appraisal, and synthesis. Existing approaches do
+not distinguish the attributes of PICO entities. This study aims to develop a
+named entity recognition (NER) model to extract PICO entities with fine
+granularities.
+ Materials and Methods: Using a corpus of 2,511 abstracts with PICO mentions
+from 4 public datasets, we developed a semi-supervised method to facilitate the
+training of a NER model, FinePICO, by combining limited annotated data of PICO
+entities and abundant unlabeled data. For evaluation, we divided the entire
+dataset into two subsets: a smaller group with annotations and a larger group
+without annotations. We then established the theoretical lower and upper
+performance bounds based on the performance of supervised learning models
+trained solely on the small, annotated subset and on the entire set with
+complete annotations, respectively. Finally, we evaluated FinePICO on both the
+smaller annotated subset and the larger, initially unannotated subset. We
+measured the performance of FinePICO using precision, recall, and F1.
+ Results: Our method achieved precision/recall/F1 of 0.567/0.636/0.60,
+respectively, using a small set of annotated samples, outperforming the
+baseline model (F1: 0.437) by more than 16\%. The model demonstrates
+generalizability to a different PICO framework and to another corpus, which
+consistently outperforms the benchmark in diverse experimental settings
+(p-value \textless0.001).
+ Conclusion: This study contributes a generalizable and effective
+semi-supervised approach to named entity recognition leveraging large unlabeled
+data together with small, annotated data. It also initially supports
+fine-grained PICO extraction.
+
+
+
+
+
+
+
+
+ Taewhan Kim, Soeun Lee, Si-Woo Kim, Dong-Jin Kim
+
+
+ Recent lightweight image captioning models using retrieved data mainly focus
+on text prompts. However, previous works only utilize the retrieved text as
+text prompts, and the visual information relies only on the CLIP visual
+embedding. Because of this issue, there is a limitation that the image
+descriptions inherent in the prompt are not sufficiently reflected in the
+visual embedding space. To tackle this issue, we propose ViPCap, a novel
+retrieval text-based visual prompt for lightweight image captioning. ViPCap
+leverages the retrieved text with image information as visual prompts to
+enhance the ability of the model to capture relevant visual information. By
+mapping text prompts into the CLIP space and generating multiple randomized
+Gaussian distributions, our method leverages sampling to explore randomly
+augmented distributions and effectively retrieves the semantic features that
+contain image information. These retrieved features are integrated into the
+image and designated as the visual prompt, leading to performance improvements
+on the datasets such as COCO, Flickr30k, and NoCaps. Experimental results
+demonstrate that ViPCap significantly outperforms prior lightweight captioning
+models in efficiency and effectiveness, demonstrating the potential for a
+plug-and-play solution.
+
+
+
+
+
+
+
+ ☆ Optimizing Multi-Stage Language Models for Effective Text Retrieval
+
+
+
+
+
+
+
+
+ Quang Hoang Trung, Le Trung Hoang, Nguyen Van Hoang Phuc
+
+
+ Efficient text retrieval is critical for applications such as legal document
+analysis, particularly in specialized contexts like Japanese legal systems.
+Existing retrieval methods often underperform in such domain-specific
+scenarios, necessitating tailored approaches. In this paper, we introduce a
+novel two-phase text retrieval pipeline optimized for Japanese legal datasets.
+Our method leverages advanced language models to achieve state-of-the-art
+performance, significantly improving retrieval efficiency and accuracy. To
+further enhance robustness and adaptability, we incorporate an ensemble model
+that integrates multiple retrieval strategies, resulting in superior outcomes
+across diverse tasks. Extensive experiments validate the effectiveness of our
+approach, demonstrating strong performance on both Japanese legal datasets and
+widely recognized benchmarks like MS-MARCO. Our work establishes new standards
+for text retrieval in domain-specific and general contexts, providing a
+comprehensive solution for addressing complex queries in legal and multilingual
+environments.
+
+
+
+
+
+
+
+ ☆ MEDEC: A Benchmark for Medical Error Detection and Correction in
+ Clinical Notes
+
+
+
+
+
+
+
+
+ Asma Ben Abacha, Wen-wai Yim, Yujuan Fu, Zhaoyi Sun, Meliha Yetisgen, Fei Xia, Thomas Lin
+
+
+ Several studies showed that Large Language Models (LLMs) can answer medical
+questions correctly, even outperforming the average human score in some medical
+exams. However, to our knowledge, no study has been conducted to assess the
+ability of language models to validate existing or generated medical text for
+correctness and consistency. In this paper, we introduce MEDEC
+(https://github.com/abachaa/MEDEC), the first publicly available benchmark for
+medical error detection and correction in clinical notes, covering five types
+of errors (Diagnosis, Management, Treatment, Pharmacotherapy, and Causal
+Organism). MEDEC consists of 3,848 clinical texts, including 488 clinical notes
+from three US hospital systems that were not previously seen by any LLM. The
+dataset has been used for the MEDIQA-CORR shared task to evaluate seventeen
+participating systems [Ben Abacha et al., 2024]. In this paper, we describe the
+data creation methods and we evaluate recent LLMs (e.g., o1-preview, GPT-4,
+Claude 3.5 Sonnet, and Gemini 2.0 Flash) for the tasks of detecting and
+correcting medical errors requiring both medical knowledge and reasoning
+capabilities. We also conducted a comparative study where two medical doctors
+performed the same task on the MEDEC test set. The results showed that MEDEC is
+a sufficiently challenging benchmark to assess the ability of models to
+validate existing or generated notes and to correct medical errors. We also
+found that although recent LLMs have a good performance in error detection and
+correction, they are still outperformed by medical doctors in these tasks. We
+discuss the potential factors behind this gap, the insights from our
+experiments, the limitations of current evaluation metrics, and share potential
+pointers for future research.
+
+
+ We propose novel attention architectures, Multi-matrix Factorization
+Attention (MFA) and MFA-Key-Reuse (MFA-KR). Existing variants for standard
+Multi-Head Attention (MHA), including SOTA methods like MLA, fail to maintain
+as strong performance under stringent Key-Value cache (KV cache) constraints.
+MFA enhances model capacity by efficiently scaling up both the number and
+dimension of attention heads through low-rank matrix factorization in the
+Query-Key (QK) circuit. Extending MFA, MFA-KR further reduces memory
+requirements by repurposing the key cache as value through value projection
+re-parameterization. MFA's design enables strong model capacity when working
+under tight KV cache budget, while MFA-KR is suitable for even harsher KV cache
+limits with minor performance trade-off. Notably, in our extensive and
+large-scale experiments, the proposed architecture outperforms MLA and performs
+comparably to MHA, while reducing KV cache usage by up to 56% and 93.7%,
+respectively.
+
+
+ With the rapid development of multimodal learning, the image-text matching
+task, as a bridge connecting vision and language, has become increasingly
+important. Based on existing research, this study proposes an innovative visual
+semantic embedding model, Multi-Headed Consensus-Aware Visual-Semantic
+Embedding (MH-CVSE). This model introduces a multi-head self-attention
+mechanism based on the consensus-aware visual semantic embedding model (CVSE)
+to capture information in multiple subspaces in parallel, significantly
+enhancing the model's ability to understand and represent the complex
+relationship between images and texts. In addition, we adopt a parameterized
+feature fusion strategy to flexibly integrate feature information at different
+levels, further improving the model's expressive power. In terms of loss
+function design, the MH-CVSE model adopts a dynamic weight adjustment strategy
+to dynamically adjust the weight according to the loss value itself, so that
+the model can better balance the contribution of different loss terms during
+training. At the same time, we introduce a cosine annealing learning rate
+strategy to help the model converge more stably in the later stages of
+training. Extensive experimental verification on the Flickr30k dataset shows
+that the MH-CVSE model achieves better performance than previous methods in
+both bidirectional image and text retrieval tasks, fully demonstrating its
+effectiveness and superiority.
+
+
+
+
+
+
+
+ ☆ Reversed in Time: A Novel Temporal-Emphasized Benchmark for Cross-Modal
+ Video-Text Retrieval
+
+
+ Cross-modal (e.g. image-text, video-text) retrieval is an important task in
+information retrieval and multimodal vision-language understanding field.
+Temporal understanding makes video-text retrieval more challenging than
+image-text retrieval. However, we find that the widely used video-text
+benchmarks have shortcomings in comprehensively assessing abilities of models,
+especially in temporal understanding, causing large-scale image-text
+pre-trained models can already achieve comparable zero-shot performance with
+video-text pre-trained models. In this paper, we introduce RTime, a novel
+temporal-emphasized video-text retrieval dataset. We first obtain videos of
+actions or events with significant temporality, and then reverse these videos
+to create harder negative samples. We then recruit annotators to judge the
+significance and reversibility of candidate videos, and write captions for
+qualified videos. We further adopt GPT-4 to extend more captions based on
+human-written captions. Our RTime dataset currently consists of 21k videos with
+10 captions per video, totalling about 122 hours. Based on RTime, we propose
+three retrieval benchmark tasks: RTime-Origin, RTime-Hard, and RTime-Binary. We
+further enhance the use of harder-negatives in model training, and benchmark a
+variety of video-text models on RTime. Extensive experiment analysis proves
+that RTime indeed poses new and higher challenges to video-text retrieval. We
+release our RTime
+dataset\footnote{\url{https://github.com/qyr0403/Reversed-in-Time}} to further
+advance video-text retrieval and multimodal understanding research.
+
+
+
+
+
+
+
+
+ Simona Frenda, Andrea Piergentili, Beatrice Savoldi, Marco Madeddu, Martina Rosola, Silvia Casola, Chiara Ferrando, Viviana Patti, Matteo Negri, Luisa Bentivogli
+
+
+ Gender-fair language aims at promoting gender equality by using terms and
+expressions that include all identities and avoid reinforcing gender
+stereotypes. Implementing gender-fair strategies is particularly challenging in
+heavily gender-marked languages, such as Italian. To address this, the
+Gender-Fair Generation challenge intends to help shift toward gender-fair
+language in written communication. The challenge, designed to assess and
+monitor the recognition and generation of gender-fair language in both mono-
+and cross-lingual scenarios, includes three tasks: (1) the detection of
+gendered expressions in Italian sentences, (2) the reformulation of gendered
+expressions into gender-fair alternatives, and (3) the generation of
+gender-fair language in automatic translation from English to Italian. The
+challenge relies on three different annotated datasets: the GFL-it corpus,
+which contains Italian texts extracted from administrative documents provided
+by the University of Brescia; GeNTE, a bilingual test set for gender-neutral
+rewriting and translation built upon a subset of the Europarl dataset; and
+Neo-GATE, a bilingual test set designed to assess the use of non-binary
+neomorphemes in Italian for both fair formulation and translation tasks.
+Finally, each task is evaluated with specific metrics: average of F1-score
+obtained by means of BERTScore computed on each entry of the datasets for task
+1, an accuracy measured with a gender-neutral classifier, and a
+coverage-weighted accuracy for tasks 2 and 3.
+
+
+
+ comment: To refer to this paper please cite the CEUR-ws publication available
+ at https://ceur-ws.org/Vol-3878/
+
+
+
+
+
+
+ ☆ Referencing Where to Focus: Improving VisualGrounding with Referential
+ Query NIPS2024
+
+
+
+
+
+
+
+
+ Yabing Wang, Zhuotao Tian, Qingpei Guo, Zheng Qin, Sanping Zhou, Ming Yang, Le Wang
+
+
+ Visual Grounding aims to localize the referring object in an image given a
+natural language expression. Recent advancements in DETR-based visual grounding
+methods have attracted considerable attention, as they directly predict the
+coordinates of the target object without relying on additional efforts, such as
+pre-generated proposal candidates or pre-defined anchor boxes. However,
+existing research primarily focuses on designing stronger multi-modal decoder,
+which typically generates learnable queries by random initialization or by
+using linguistic embeddings. This vanilla query generation approach inevitably
+increases the learning difficulty for the model, as it does not involve any
+target-related information at the beginning of decoding. Furthermore, they only
+use the deepest image feature during the query learning process, overlooking
+the importance of features from other levels. To address these issues, we
+propose a novel approach, called RefFormer. It consists of the query adaption
+module that can be seamlessly integrated into CLIP and generate the referential
+query to provide the prior context for decoder, along with a task-specific
+decoder. By incorporating the referential query into the decoder, we can
+effectively mitigate the learning difficulty of the decoder, and accurately
+concentrate on the target object. Additionally, our proposed query adaption
+module can also act as an adapter, preserving the rich knowledge within CLIP
+without the need to tune the parameters of the backbone network. Extensive
+experiments demonstrate the effectiveness and efficiency of our proposed
+method, outperforming state-of-the-art approaches on five visual grounding
+benchmarks.
+
+
+ In recent years, fine-grained sentiment analysis in finance has gained
+significant attention, but the scarcity of entity-level datasets remains a key
+challenge. To address this, we have constructed the largest English and Chinese
+financial entity-level sentiment analysis datasets to date. Building on this
+foundation, we propose a novel two-stage sentiment analysis approach called
+Self-aware In-context Learning Correction (SILC). The first stage involves
+fine-tuning a base large language model to generate pseudo-labeled data
+specific to our task. In the second stage, we train a correction model using a
+GNN-based example retriever, which is informed by the pseudo-labeled data. This
+two-stage strategy has allowed us to achieve state-of-the-art performance on
+the newly constructed datasets, advancing the field of financial sentiment
+analysis. In a case study, we demonstrate the enhanced practical utility of our
+data and methods in monitoring the cryptocurrency market. Our datasets and code
+are available at https://github.com/NLP-Bin/SILC-EFSA.
+
+
+
+ comment: This paper is to be published in the Proceedings of the 31st
+ International Conference on Computational Linguistics (COLING 2025)
+
+ Missing value is a critical issue in data science, significantly impacting
+the reliability of analyses and predictions. Missing value imputation (MVI) is
+a longstanding problem because it highly relies on domain knowledge. Large
+language models (LLMs) have emerged as a promising tool for data cleaning,
+including MVI for tabular data, offering advanced capabilities for
+understanding and generating content. However, despite their promise, existing
+LLM techniques such as in-context learning and Chain-of-Thought (CoT) often
+fall short in guiding LLMs to perform complex reasoning for MVI, particularly
+when imputing derived missing values, which require mathematical formulas and
+data relationships across rows and columns. This gap underscores the need for
+further advancements in LLM methodologies to enhance their reasoning
+capabilities for more reliable imputation outcomes. To fill this gap, we
+propose SketchFill, a novel sketch-based method to guide LLMs in generating
+accurate formulas to impute missing numerical values. Our experimental results
+demonstrate that SketchFill significantly outperforms state-of-the-art methods,
+achieving 56.2% higher accuracy than CoT-based methods and 78.8% higher
+accuracy than MetaGPT. This sets a new standard for automated data cleaning and
+advances the field of MVI for numerical values.
+
+
+
+ comment: 19 pages, 6 figures
+
+
+
+
+
+
+ ☆ "I've Heard of You!": Generate Spoken Named Entity Recognition Data for
+ Unseen Entities ICASSP 2025
+
+
+ Spoken named entity recognition (NER) aims to identify named entities from
+speech, playing an important role in speech processing. New named entities
+appear every day, however, annotating their Spoken NER data is costly. In this
+paper, we demonstrate that existing Spoken NER systems perform poorly when
+dealing with previously unseen named entities. To tackle this challenge, we
+propose a method for generating Spoken NER data based on a named entity
+dictionary (NED) to reduce costs. Specifically, we first use a large language
+model (LLM) to generate sentences from the sampled named entities and then use
+a text-to-speech (TTS) system to generate the speech. Furthermore, we introduce
+a noise metric to filter out noisy data. To evaluate our approach, we release a
+novel Spoken NER benchmark along with a corresponding NED containing 8,853
+entities. Experiment results show that our method achieves state-of-the-art
+(SOTA) performance in the in-domain, zero-shot domain adaptation, and fully
+zero-shot settings. Our data will be available at
+https://github.com/DeepLearnXMU/HeardU.
+
+
+ Soft prompt learning methods are effective for adapting vision-language
+models (VLMs) to downstream tasks. Nevertheless, empirical evidence reveals a
+tendency of existing methods that they overfit seen classes and exhibit
+degraded performance on unseen classes. This limitation is due to the inherent
+bias in the training data towards the seen classes. To address this issue, we
+propose a novel soft prompt learning method, named Mixture-of-Prompts
+Distillation (MoPD), which can effectively transfer useful knowledge from hard
+prompts manually hand-crafted (a.k.a. teacher prompts) to the learnable soft
+prompt (a.k.a. student prompt), thereby enhancing the generalization ability of
+soft prompts on unseen classes. Moreover, the proposed MoPD method utilizes a
+gating network that learns to select hard prompts used for prompt distillation.
+Extensive experiments demonstrate that the proposed MoPD method outperforms
+state-of-the-art baselines especially on on unseen classes.
+
+
+
+
+
+
+
+ ☆ Advancing LLM detection in the ALTA 2024 Shared Task: Techniques and
+ Analysis
+
+
+ The recent proliferation of AI-generated content has prompted significant
+interest in developing reliable detection methods. This study explores
+techniques for identifying AI-generated text through sentence-level evaluation
+within hybrid articles. Our findings indicate that ChatGPT-3.5 Turbo exhibits
+distinct, repetitive probability patterns that enable consistent in-domain
+detection. Empirical tests show that minor textual modifications, such as
+rewording, have minimal impact on detection accuracy. These results provide
+valuable insights for advancing AI detection methodologies, offering a pathway
+toward robust solutions to address the complexities of synthetic text
+identification.
+
+
+
+
+
+
+
+ ☆ Robust Speech and Natural Language Processing Models for Depression
+ Screening
+
+
+
+
+
+
+
+
+ Y. Lu, A. Harati, T. Rutowski, R. Oliveira, P. Chlebek, E. Shriberg
+
+
+ Depression is a global health concern with a critical need for increased
+patient screening. Speech technology offers advantages for remote screening but
+must perform robustly across patients. We have described two deep learning
+models developed for this purpose. One model is based on acoustics; the other
+is based on natural language processing. Both models employ transfer learning.
+Data from a depression-labeled corpus in which 11,000 unique users interacted
+with a human-machine application using conversational speech is used. Results
+on binary depression classification have shown that both models perform at or
+above AUC=0.80 on unseen data with no speaker overlap. Performance is further
+analyzed as a function of test subset characteristics, finding that the models
+are generally robust over speaker and session variables. We conclude that
+models based on these approaches offer promise for generalized automated
+depression screening.
+
+
+
+
+
+
+
+ ☆ Cross-Demographic Portability of Deep NLP-Based Depression Models
+
+
+
+
+
+
+
+
+ Tomek Rutowski, Elizabeth Shriberg, Amir Harati, Yang Lu, Ricardo Oliveira, Piotr Chlebek
+
+
+ Deep learning models are rapidly gaining interest for real-world applications
+in behavioral health. An important gap in current literature is how well such
+models generalize over different populations. We study Natural Language
+Processing (NLP) based models to explore portability over two different corpora
+highly mismatched in age. The first and larger corpus contains younger
+speakers. It is used to train an NLP model to predict depression. When testing
+on unseen speakers from the same age distribution, this model performs at
+AUC=0.82. We then test this model on the second corpus, which comprises seniors
+from a retirement community. Despite the large demographic differences in the
+two corpora, we saw only modest degradation in performance for the
+senior-corpus data, achieving AUC=0.76. Interestingly, in the senior
+population, we find AUC=0.81 for the subset of patients whose health state is
+consistent over time. Implications for demographic portability of speech-based
+applications are discussed.
+
+
+
+
+
+
+
+ ☆ Indonesian-English Code-Switching Speech Synthesizer Utilizing
+ Multilingual STEN-TTS and Bert LID
+
+
+ Multilingual text-to-speech systems convert text into speech across multiple
+languages. In many cases, text sentences may contain segments in different
+languages, a phenomenon known as code-switching. This is particularly common in
+Indonesia, especially between Indonesian and English. Despite its significance,
+no research has yet developed a multilingual TTS system capable of handling
+code-switching between these two languages. This study addresses
+Indonesian-English code-switching in STEN-TTS. Key modifications include adding
+a language identification component to the text-to-phoneme conversion using
+finetuned BERT for per-word language identification, as well as removing
+language embedding from the base model. Experimental results demonstrate that
+the code-switching model achieves superior naturalness and improved speech
+intelligibility compared to the Indonesian and English baseline STEN-TTS
+models.
+
+
+
+ comment: Accepted at O-COCOSDA 2024
+
+
+
+
+
+
+ ☆ Let the Rule Speak: Enhancing In-context Learning Debiasing with
+ Interpretability
+
+
+ In-context learning, which allows large language models to perform diverse
+tasks with a few demonstrations, is found to have imbalanced per-class
+prediction accuracy on multi-class text classification. Although notable output
+correction methods have been developed to tackle the issue and simultaneously
+improve downstream prediction accuracy, they may fail to answer the core
+interpretability challenges: why and which certain classes need corrections,
+and more importantly, a tailored correction for per-sample, per-class's
+probability. To address such interpretability gaps, we first find that the
+imbalance arises from certain classes consistently receiving high ICL output
+probabilities, whereas others receiving lower or mixed ranges, so the former is
+more frequently chosen, resulting in higher accuracy; more crucially, we find
+that these ranges have significantly varying degrees of influence on the
+accuracy bias, highlighting the need for precise, interpretable probability
+corrections by range. Motivated by this, we propose FuRud, a Fuzzy Rule
+Optimization based Debiasing method, that (1) detects which classes need
+corrections, and (2) for each correction-needed class, detects its probability
+ranges and applies asymmetric amplifications or reductions to correct them
+interpretably. Notably, across seven benchmark datasets, FuRud reduces the
+pairwise class accuracy bias (COBias) by more than half (56%), while achieving
+a relative increase of 21% in accuracy, outperforming state-of-the-art
+debiasing methods. Moreover, FuRud can optimize downstream tasks with as few as
+10 optimization examples. Furthermore, FuRud can work for prompt formats that
+lead to highly skewed predictions. For example, FuRud greatly improves ICL
+outputs which use letter options, with 44% relative accuracy increase and 54%
+relative COBias reduction.
+
+
+
+
+
+
+
+ ♻ ☆ Towards A Holistic Landscape of Situated Theory of Mind in Large
+ Language Models EMNLP 2023
+
+
+
+
+
+
+
+
+ Ziqiao Ma, Jacob Sansom, Run Peng, Joyce Chai
+
+
+ Large Language Models (LLMs) have generated considerable interest and debate
+regarding their potential emergence of Theory of Mind (ToM). Several recent
+inquiries reveal a lack of robust ToM in these models and pose a pressing
+demand to develop new benchmarks, as current ones primarily focus on different
+aspects of ToM and are prone to shortcuts and data leakage. In this position
+paper, we seek to answer two road-blocking questions: (1) How can we taxonomize
+a holistic landscape of machine ToM? (2) What is a more effective evaluation
+protocol for machine ToM? Following psychological studies, we taxonomize
+machine ToM into 7 mental state categories and delineate existing benchmarks to
+identify under-explored aspects of ToM. We argue for a holistic and situated
+evaluation of ToM to break ToM into individual components and treat LLMs as an
+agent who is physically situated in environments and socially situated in
+interactions with humans. Such situated evaluation provides a more
+comprehensive assessment of mental states and potentially mitigates the risk of
+shortcuts and data leakage. We further present a pilot study in a grid world
+setup as a proof of concept. We hope this position paper can facilitate future
+research to integrate ToM with LLMs and offer an intuitive means for
+researchers to better position their work in the landscape of ToM. Project
+page: https://github.com/Mars-tin/awesome-theory-of-mind
+
+
+
+ comment: EMNLP 2023 (Findings)
+
+
+
+
+
+
+ ♻ ☆ World-to-Words: Grounded Open Vocabulary Acquisition through Fast
+ Mapping in Vision-Language Models ACL 2023
+
+
+ The ability to connect language units to their referents in the physical
+world, referred to as grounding, is crucial to learning and understanding
+grounded meanings of words. While humans demonstrate fast mapping in new word
+learning, it remains unclear whether modern vision-language models can truly
+represent language with their grounded meanings and how grounding may further
+bootstrap new word learning. To this end, we introduce Grounded Open Vocabulary
+Acquisition (GOVA) to examine grounding and bootstrapping in open-world
+language learning. As an initial attempt, we propose object-oriented BERT
+(OctoBERT), a novel visually-grounded language model by pre-training on
+image-text pairs highlighting grounding as an objective. Through extensive
+experiments and analysis, we demonstrate that OctoBERT is a more coherent and
+fast grounded word learner, and that the grounding ability acquired during
+pre-training helps the model to learn unseen words more rapidly and robustly.
+Our code is available at https://github.com/sled-group/world-to-words
+
+
+
+
+
+
+
+
+ Pan Wang, Qiang Zhou, Yawen Wu, Tianlong Chen, Jingtong Hu
+
+
+ Multimodal Sentiment Analysis (MSA) leverages heterogeneous modalities, such
+as language, vision, and audio, to enhance the understanding of human
+sentiment. While existing models often focus on extracting shared information
+across modalities or directly fusing heterogeneous modalities, such approaches
+can introduce redundancy and conflicts due to equal treatment of all modalities
+and the mutual transfer of information between modality pairs. To address these
+issues, we propose a Disentangled-Language-Focused (DLF) multimodal
+representation learning framework, which incorporates a feature disentanglement
+module to separate modality-shared and modality-specific information. To
+further reduce redundancy and enhance language-targeted features, four
+geometric measures are introduced to refine the disentanglement process. A
+Language-Focused Attractor (LFA) is further developed to strengthen language
+representation by leveraging complementary modality-specific information
+through a language-guided cross-attention mechanism. The framework also employs
+hierarchical predictions to improve overall accuracy. Extensive experiments on
+two popular MSA datasets, CMU-MOSI and CMU-MOSEI, demonstrate the significant
+performance gains achieved by the proposed DLF framework. Comprehensive
+ablation studies further validate the effectiveness of the feature
+disentanglement module, language-focused attractor, and hierarchical
+predictions. Our code is available at https://github.com/pwang322/DLF.
+
+
+
+ comment: AAAI 2025 accepted
+
+
+
+
+
+
+ ♻ ☆ LMFusion: Adapting Pretrained Language Models for Multimodal Generation
+
+
+
+
+
+
+
+
+ Weijia Shi, Xiaochuang Han, Chunting Zhou, Weixin Liang, Xi Victoria Lin, Luke Zettlemoyer, Lili Yu
+
+
+ We present LMFusion, a framework for empowering pretrained text-only large
+language models (LLMs) with multimodal generative capabilities, enabling them
+to understand and generate both text and images in arbitrary sequences.
+LMFusion leverages existing Llama-3's weights for processing texts
+autoregressively while introducing additional and parallel transformer modules
+for processing images with diffusion. During training, the data from each
+modality is routed to its dedicated modules: modality-specific feedforward
+layers, query-key-value projections, and normalization layers process each
+modality independently, while the shared self-attention layers allow
+interactions across text and image features. By freezing the text-specific
+modules and only training the image-specific modules, LMFusion preserves the
+language capabilities of text-only LLMs while developing strong visual
+understanding and generation abilities. Compared to methods that pretrain
+multimodal generative models from scratch, our experiments demonstrate that,
+LMFusion improves image understanding by 20% and image generation by 3.6% using
+only 50% of the FLOPs while maintaining Llama-3's language capabilities. We
+also demonstrate that this framework can adapt existing vision-language models
+with multimodal generation ability. Overall, this framework not only leverages
+existing computational investments in text-only LLMs but also enables the
+parallel development of language and vision capabilities, presenting a
+promising direction for efficient multimodal model development.
+
+
+
+ comment: Name change: LlamaFusion to LMFusion
+
+
+
+
+
+
+ ♻ ☆ LLMs for Knowledge Graph Construction and Reasoning: Recent Capabilities
+ and Future Opportunities
+
+
+ This paper presents an exhaustive quantitative and qualitative evaluation of
+Large Language Models (LLMs) for Knowledge Graph (KG) construction and
+reasoning. We engage in experiments across eight diverse datasets, focusing on
+four representative tasks encompassing entity and relation extraction, event
+extraction, link prediction, and question-answering, thereby thoroughly
+exploring LLMs' performance in the domain of construction and inference.
+Empirically, our findings suggest that LLMs, represented by GPT-4, are more
+suited as inference assistants rather than few-shot information extractors.
+Specifically, while GPT-4 exhibits good performance in tasks related to KG
+construction, it excels further in reasoning tasks, surpassing fine-tuned
+models in certain cases. Moreover, our investigation extends to the potential
+generalization ability of LLMs for information extraction, leading to the
+proposition of a Virtual Knowledge Extraction task and the development of the
+corresponding VINE dataset. Based on these empirical findings, we further
+propose AutoKG, a multi-agent-based approach employing LLMs and external
+sources for KG construction and reasoning. We anticipate that this research can
+provide invaluable insights for future undertakings in the field of knowledge
+graphs. The code and datasets are in https://github.com/zjunlp/AutoKG.
+
+
+
+ comment: World Wide Web Journal
+
+
+
+
+
+
+ ♻ ☆ BDA: Bangla Text Data Augmentation Framework
+
+
+ Data augmentation involves generating synthetic samples that resemble those
+in a given dataset. In resource-limited fields where high-quality data is
+scarce, augmentation plays a crucial role in increasing the volume of training
+data. This paper introduces a Bangla Text Data Augmentation (BDA) Framework
+that uses both pre-trained models and rule-based methods to create new variants
+of the text. A filtering process is included to ensure that the new text keeps
+the same meaning as the original while also adding variety in the words used.
+We conduct a comprehensive evaluation of the framework's effectiveness in
+Bangla text classification tasks. Our framework achieved significant
+improvement in F1 scores across five distinct datasets, delivering performance
+equivalent to models trained on 100% of the data while utilizing only 50% of
+the training dataset. Additionally, we explore the impact of data scarcity by
+progressively reducing the training data and augmenting it through BDA,
+resulting in notable F1 score enhancements. The study offers a thorough
+examination of BDA's performance, identifying key factors for optimal results
+and addressing its limitations through detailed analysis.
+
+
+
+
+
+
+
+ ♻ ☆ From MTEB to MTOB: Retrieval-Augmented Classification for Descriptive
+ Grammars COLING 2025
+
+
+ Recent advances in language modeling have demonstrated significant
+improvements in zero-shot capabilities, including in-context learning,
+instruction following, and machine translation for extremely under-resourced
+languages (Tanzer et al., 2024). However, many languages with limited written
+resources rely primarily on formal descriptions of grammar and vocabulary.
+ In this paper, we introduce a set of benchmarks to evaluate how well models
+can extract and classify information from the complex descriptions found in
+linguistic grammars. We present a Retrieval-Augmented Generation (RAG)-based
+approach that leverages these descriptions for downstream tasks such as machine
+translation. Our benchmarks encompass linguistic descriptions for 248 languages
+across 142 language families, focusing on typological features from WALS and
+Grambank.
+ This set of benchmarks offers the first comprehensive evaluation of language
+models' in-context ability to accurately interpret and extract linguistic
+features, providing a critical resource for scaling NLP to low-resource
+languages. The code and data are publicly available at
+\url{https://github.com/al-the-eigenvalue/RAG-on-grammars}.
+
+
+
+ comment: submitted to COLING 2025
+
+
+
+
+
+
+ ♻ ☆ Do Language Models Understand the Cognitive Tasks Given to Them?
+ Investigations with the N-Back Paradigm
+
+
+ Cognitive tasks originally developed for humans are now increasingly used to
+study language models. While applying these tasks is often straightforward,
+interpreting their results can be challenging. In particular, when a model
+underperforms, it is often unclear whether this results from a limitation in
+the cognitive ability being tested or a failure to understand the task itself.
+A recent study argues that GPT 3.5's declining performance on 2-back and 3-back
+tasks reflects a working memory capacity limit similar to humans (Gong et al.,
+2024). By analyzing a range of open-source language models of varying
+performance levels on these tasks, we show that the poor performance instead
+reflects a limitation in task comprehension and task set maintenance. In
+addition, we challenge the best-performing model with progressively harder
+versions of the task (up to 10-back) and experiment with alternative prompting
+strategies, before analyzing model attentions. Our larger aim is to contribute
+to the ongoing conversation around refining methodologies for the cognitive
+evaluation of language models.
+
+
+
+
+
+
+
+ ♻ ☆ TableRAG: Million-Token Table Understanding with Language Models NeurIPS 2024
+
+
+ Recent advancements in language models (LMs) have notably enhanced their
+ability to reason with tabular data, primarily through program-aided mechanisms
+that manipulate and analyze tables. However, these methods often require the
+entire table as input, leading to scalability challenges due to the positional
+bias or context length constraints. In response to these challenges, we
+introduce TableRAG, a Retrieval-Augmented Generation (RAG) framework
+specifically designed for LM-based table understanding. TableRAG leverages
+query expansion combined with schema and cell retrieval to pinpoint crucial
+information before providing it to the LMs. This enables more efficient data
+encoding and precise retrieval, significantly reducing prompt lengths and
+mitigating information loss. We have developed two new million-token benchmarks
+from the Arcade and BIRD-SQL datasets to thoroughly evaluate TableRAG's
+effectiveness at scale. Our results demonstrate that TableRAG's retrieval
+design achieves the highest retrieval quality, leading to the new
+state-of-the-art performance on large-scale table understanding.
+
+
+
+ comment: Accepted to NeurIPS 2024
+
+
+
+
+
+
+ ♻ ☆ Enhancing Long-Range Dependency with State Space Model and
+ Kolmogorov-Arnold Networks for Aspect-Based Sentiment Analysis
+
+
+ Aspect-based Sentiment Analysis (ABSA) evaluates sentiments toward specific
+aspects of entities within the text. However, attention mechanisms and neural
+network models struggle with syntactic constraints. The quadratic complexity of
+attention mechanisms also limits their adoption for capturing long-range
+dependencies between aspect and opinion words in ABSA. This complexity can lead
+to the misinterpretation of irrelevant contextual words, restricting their
+effectiveness to short-range dependencies. To address the above problem, we
+present a novel approach to enhance long-range dependencies between aspect and
+opinion words in ABSA (MambaForGCN). This approach incorporates syntax-based
+Graph Convolutional Network (SynGCN) and MambaFormer (Mamba-Transformer)
+modules to encode input with dependency relations and semantic information. The
+Multihead Attention (MHA) and Selective State Space model (Mamba) blocks in the
+MambaFormer module serve as channels to enhance the model with short and
+long-range dependencies between aspect and opinion words. We also introduce the
+Kolmogorov-Arnold Networks (KANs) gated fusion, an adaptive feature
+representation system that integrates SynGCN and MambaFormer and captures
+non-linear, complex dependencies. Experimental results on three benchmark
+datasets demonstrate MambaForGCN's effectiveness, outperforming
+state-of-the-art (SOTA) baseline models.
+
+
+
+ comment: 11 pages, 3 figures and 3 tables. arXiv admin note: text overlap with
+ arXiv:2405.13013
+
+
+
+
+
+
+ ♻ ☆ LLMsAgainstHate @ NLU of Devanagari Script Languages 2025: Hate Speech
+ Detection and Target Identification in Devanagari Languages via Parameter
+ Efficient Fine-Tuning of LLMs
+
+
+
+
+
+
+
+
+ Rushendra Sidibomma, Pransh Patwa, Parth Patwa, Aman Chadha, Vinija Jain, Amitava Das
+
+
+ The detection of hate speech has become increasingly important in combating
+online hostility and its real-world consequences. Despite recent advancements,
+there is limited research addressing hate speech detection in
+Devanagari-scripted languages, where resources and tools are scarce. While
+large language models (LLMs) have shown promise in language-related tasks,
+traditional fine-tuning approaches are often infeasible given the size of the
+models. In this paper, we propose a Parameter Efficient Fine tuning (PEFT)
+based solution for hate speech detection and target identification. We evaluate
+multiple LLMs on the Devanagari dataset provided by (Thapa et al., 2025), which
+contains annotated instances in 2 languages - Hindi and Nepali. The results
+demonstrate the efficacy of our approach in handling Devanagari-scripted
+content.
+
+
+
+
+
+
+
+
+ Xi Cao, Quzong Gesang, Yuan Sun, Nuo Qun, Tashi Nyima
+
+
+ Language models based on deep neural networks are vulnerable to textual
+adversarial attacks. While rich-resource languages like English are receiving
+focused attention, Tibetan, a cross-border language, is gradually being studied
+due to its abundant ancient literature and critical language strategy.
+Currently, there are several Tibetan adversarial text generation methods, but
+they do not fully consider the textual features of Tibetan script and
+overestimate the quality of generated adversarial texts. To address this issue,
+we propose a novel Tibetan adversarial text generation method called TSCheater,
+which considers the characteristic of Tibetan encoding and the feature that
+visually similar syllables have similar semantics. This method can also be
+transferred to other abugidas, such as Devanagari script. We utilize a
+self-constructed Tibetan syllable visual similarity database called TSVSDB to
+generate substitution candidates and adopt a greedy algorithm-based scoring
+mechanism to determine substitution order. After that, we conduct the method on
+eight victim language models. Experimentally, TSCheater outperforms existing
+methods in attack effectiveness, perturbation magnitude, semantic similarity,
+visual similarity, and human acceptance. Finally, we construct the first
+Tibetan adversarial robustness evaluation benchmark called AdvTS, which is
+generated by existing methods and proofread by humans.
+
+
+
+ comment: Camera-Ready Version; Accepted at ICASSP 2025
+
+
+
+
+
+
+ ♻ ☆ MDD-5k: A New Diagnostic Conversation Dataset for Mental Disorders
+ Synthesized via Neuro-Symbolic LLM Agents AAAI
+
+
+ The clinical diagnosis of most mental disorders primarily relies on the
+conversations between psychiatrist and patient. The creation of such diagnostic
+conversation datasets is promising to boost the AI mental healthcare community.
+However, directly collecting the conversations in real diagnosis scenarios is
+near impossible due to stringent privacy and ethical considerations. To address
+this issue, we seek to synthesize diagnostic conversation by exploiting
+anonymized patient cases that are easier to access. Specifically, we design a
+neuro-symbolic multi-agent framework for synthesizing the diagnostic
+conversation of mental disorders with large language models. It takes patient
+case as input and is capable of generating multiple diverse conversations with
+one single patient case. The framework basically involves the interaction
+between a doctor agent and a patient agent, and generates conversations under
+symbolic control via a dynamic diagnosis tree. By applying the proposed
+framework, we develop the largest Chinese mental disorders diagnosis dataset
+MDD-5k. This dataset is built upon 1000 real, anonymized patient cases by
+cooperating with Shanghai Mental Health Center and comprises 5000 high-quality
+long conversations with diagnosis results and treatment opinions as labels. To
+the best of our knowledge, it's also the first labeled dataset for Chinese
+mental disorders diagnosis. Human evaluation demonstrates the proposed MDD-5k
+dataset successfully simulates human-like diagnostic process of mental
+disorders.
+
+
+
+ comment: Accepted by the 39th Annual AAAI Conference on Artificial
+ Intelligence
+
+
+
+
+
+
+ ♻ ☆ Falcon: Faster and Parallel Inference of Large Language Models through
+ Enhanced Semi-Autoregressive Drafting and Custom-Designed Decoding Tree AAAI 2025
+
+
+ Striking an optimal balance between minimal drafting latency and high
+speculation accuracy to enhance the inference speed of Large Language Models
+remains a significant challenge in speculative decoding. In this paper, we
+introduce Falcon, an innovative semi-autoregressive speculative decoding
+framework fashioned to augment both the drafter's parallelism and output
+quality. Falcon incorporates the Coupled Sequential Glancing Distillation
+technique, which fortifies inter-token dependencies within the same block,
+leading to increased speculation accuracy. We offer a comprehensive theoretical
+analysis to illuminate the underlying mechanisms. Additionally, we introduce a
+Custom-Designed Decoding Tree, which permits the drafter to generate multiple
+tokens in a single forward pass and accommodates multiple forward passes as
+needed, thereby boosting the number of drafted tokens and significantly
+improving the overall acceptance rate. Comprehensive evaluations on benchmark
+datasets such as MT-Bench, HumanEval, and GSM8K demonstrate Falcon's superior
+acceleration capabilities. The framework achieves a lossless speedup ratio
+ranging from 2.91x to 3.51x when tested on the Vicuna and LLaMA2-Chat model
+series. These results outstrip existing speculative decoding methods for LLMs,
+including Eagle, Medusa, Lookahead, SPS, and PLD, while maintaining a compact
+drafter architecture equivalent to merely two Transformer layers.
+
+
+
+ comment: AAAI 2025 Accepted
+
+
+
+
+
+
+ ♻ ☆ Sim911: Towards Effective and Equitable 9-1-1 Dispatcher Training with
+ an LLM-Enabled Simulation
+
+
+
+
+
+
+
+
+ Zirong Chen, Elizabeth Chason, Noah Mladenovski, Erin Wilson, Kristin Mullen, Stephen Martini, Meiyi Ma
+
+
+ Emergency response services are vital for enhancing public safety by
+safeguarding the environment, property, and human lives. As frontline members
+of these services, 9-1-1 dispatchers have a direct impact on response times and
+the overall effectiveness of emergency operations. However, traditional
+dispatcher training methods, which rely on role-playing by experienced
+personnel, are labor-intensive, time-consuming, and often neglect the specific
+needs of underserved communities. To address these challenges, we introduce
+Sim911, the first training simulation for 9-1-1 dispatchers powered by Large
+Language Models (LLMs). Sim911 enhances training through three key technical
+innovations: (1) knowledge construction, which utilizes archived 9-1-1 call
+data to generate simulations that closely mirror real-world scenarios; (2)
+context-aware controlled generation, which employs dynamic prompts and vector
+bases to ensure that LLM behavior aligns with training objectives; and (3)
+validation with looped correction, which filters out low-quality responses and
+refines the system performance.
+
+
+
+
+
+
+
+ ♻ ☆ Clustering Algorithms and RAG Enhancing Semi-Supervised Text
+ Classification with Large LLMs
+
+
+ This paper proposes a Clustering, Labeling, then Augmenting framework that
+significantly enhances performance in Semi-Supervised Text Classification
+(SSTC) tasks, effectively addressing the challenge of vast datasets with
+limited labeled examples. Unlike traditional SSTC approaches that rely on a
+predefined small set of labeled data to generate pseudo-labels for the
+unlabeled data, this framework innovatively employs clustering to select
+representative "landmarks" for labeling. These landmarks subsequently act as
+intermediaries in an ensemble of augmentation techniques, including
+Retrieval-Augmented Generation (RAG), Large Language Model (LLMs)-based
+rewriting, and synonym substitution, to generate synthetic labeled data without
+making pseudo-labels for the unlabeled data. Empirical results show that even
+in complex text document classification scenarios involving over 100
+categories, our method achieves state-of-the-art accuracies of 95.41% on the
+Reuters dataset and 82.43% on the Web of Science dataset. Our approach
+significantly reduces the reliance on human labeling efforts and the associated
+expenses, while simultaneously ensuring high data quality and minimizing
+privacy risks. The finetuning results further show the efficiency of
+fine-tuning LLMs for text classification tasks, highlighting a robust solution
+for leveraging limited labeled data.
+
+
+
+
+
+
+
+ ♻ ☆ Large Language Model as a Catalyst: A Paradigm Shift in Base Station
+ Siting Optimization
+
+
+
+
+
+
+
+
+ Yanhu Wang, Muhammad Muzammil Afzal, Zhengyang Li, Jie Zhou, Chenyuan Feng, Shuaishuai Guo, Tony Q. S. Quek
+
+
+ Traditional base station siting (BSS) methods rely heavily on drive testing
+and user feedback, which are laborious and require extensive expertise in
+communication, networking, and optimization. As large language models (LLMs)
+and their associated technologies advance, particularly in the realms of prompt
+engineering and agent engineering, network optimization will witness a
+revolutionary approach. This approach entails the strategic use of well-crafted
+prompts to infuse human experience and knowledge into these sophisticated LLMs,
+and the deployment of autonomous agents as a communication bridge to seamlessly
+connect the machine language based LLMs with human users using natural
+language. Furthermore, our proposed framework incorporates retrieval-augmented
+generation (RAG) to enhance the system's ability to acquire domain-specific
+knowledge and generate solutions, thereby enabling the customization and
+optimization of the BSS process. This integration represents the future
+paradigm of artificial intelligence (AI) as a service and AI for more ease.
+This research first develops a novel LLM-empowered BSS optimization framework,
+and heuristically proposes three different potential implementations: the
+strategies based on Prompt-optimized LLM (PoL), LLM-empowered autonomous BSS
+agent (LaBa), and Cooperative multiple LLM-based autonomous BSS agents (CLaBa).
+Through evaluation on real-world data, the experiments demonstrate that
+prompt-assisted LLMs and LLM-based agents can generate more efficient and
+reliable network deployments, noticeably enhancing the efficiency of BSS
+optimization and reducing trivial manual participation.
+
+
+
+
+
+
+
+ ♻ ☆ MaxMin-RLHF: Alignment with Diverse Human Preferences
+
+
+ Reinforcement Learning from Human Feedback (RLHF) aligns language models to
+human preferences by employing a singular reward model derived from preference
+data. However, such an approach overlooks the rich diversity of human
+preferences inherent in data collected from multiple users. In this work, we
+first derive an impossibility result of alignment with single reward RLHF,
+thereby highlighting its insufficiency in representing diverse human
+preferences. To provide an equitable solution to the problem, we learn a
+mixture of preference distributions via an expectation-maximization algorithm
+and propose a MaxMin alignment objective for policy learning inspired by the
+Egalitarian principle in social choice theory to better represent diverse human
+preferences. We elucidate the connection of our proposed approach to
+distributionally robust optimization and general utility RL, thereby
+highlighting the generality and robustness of our proposed solution. We present
+comprehensive experimental results on small-scale (GPT-2) and large-scale
+language models (with Tulu2-7B) and show the efficacy of the proposed approach
+in the presence of diversity among human preferences. Our algorithm achieves an
+average improvement of more than 16% in win-rates over conventional RLHF
+algorithms and improves the win-rate (accuracy) for minority groups by over 33%
+without compromising the performance of majority groups, showcasing the
+robustness and fairness of our approach. We remark that our findings in this
+work are not only limited to language models but also extend to reinforcement
+learning in general.
+
+
+
+
+
+
+
+
+
+
+ Information Retrieval 14
+
+
+
+
+
+ ☆ From Interets to Insights: An LLM Approach to Course Recommendations
+ Using Natural Language Queries
+
+
+
+
+
+
+
+
+ Hugh Van Deventer, Mark Mills, August Evrard
+
+
+ Most universities in the United States encourage their students to explore
+academic areas before declaring a major and to acquire academic breadth by
+satisfying a variety of requirements. Each term, students must choose among
+many thousands of offerings, spanning dozens of subject areas, a handful of
+courses to take. The curricular environment is also dynamic, and poor
+communication and search functions on campus can limit a student's ability to
+discover new courses of interest. To support both students and their advisers
+in such a setting, we explore a novel Large Language Model (LLM) course
+recommendation system that applies a Retrieval Augmented Generation (RAG)
+method to the corpus of course descriptions. The system first generates an
+'ideal' course description based on the user's query. This description is
+converted into a search vector using embeddings, which is then used to find
+actual courses with similar content by comparing embedding similarities. We
+describe the method and assess the quality and fairness of some example
+prompts. Steps to deploy a pilot system on campus are discussed.
+
+
+ Modern recommender systems aim to deeply understand users' complex
+preferences through their past interactions. While deep collaborative filtering
+approaches using Graph Neural Networks (GNNs) excel at capturing user-item
+relationships, their effectiveness is limited when handling sparse data or
+zero-shot scenarios, primarily due to constraints in ID-based embedding
+functions. To address these challenges, we propose a model-agnostic
+recommendation instruction-tuning paradigm that seamlessly integrates large
+language models with collaborative filtering. Our proposed Recommendation
+Language Model (RecLM) enhances the capture of user preference diversity
+through a carefully designed reinforcement learning reward function that
+facilitates self-augmentation of language models. Comprehensive evaluations
+demonstrate significant advantages of our approach across various settings, and
+its plug-and-play compatibility with state-of-the-art recommender systems
+results in notable performance enhancements.
+
+
+
+
+
+
+
+ ☆ Optimizing Multi-Stage Language Models for Effective Text Retrieval
+
+
+
+
+
+
+
+
+ Quang Hoang Trung, Le Trung Hoang, Nguyen Van Hoang Phuc
+
+
+ Efficient text retrieval is critical for applications such as legal document
+analysis, particularly in specialized contexts like Japanese legal systems.
+Existing retrieval methods often underperform in such domain-specific
+scenarios, necessitating tailored approaches. In this paper, we introduce a
+novel two-phase text retrieval pipeline optimized for Japanese legal datasets.
+Our method leverages advanced language models to achieve state-of-the-art
+performance, significantly improving retrieval efficiency and accuracy. To
+further enhance robustness and adaptability, we incorporate an ensemble model
+that integrates multiple retrieval strategies, resulting in superior outcomes
+across diverse tasks. Extensive experiments validate the effectiveness of our
+approach, demonstrating strong performance on both Japanese legal datasets and
+widely recognized benchmarks like MS-MARCO. Our work establishes new standards
+for text retrieval in domain-specific and general contexts, providing a
+comprehensive solution for addressing complex queries in legal and multilingual
+environments.
+
+
+
+
+
+
+
+ ☆ Personalized Dynamic Music Emotion Recognition with Dual-Scale
+ Attention-Based Meta-Learning AAAI
+
+
+ Dynamic Music Emotion Recognition (DMER) aims to predict the emotion of
+different moments in music, playing a crucial role in music information
+retrieval. The existing DMER methods struggle to capture long-term dependencies
+when dealing with sequence data, which limits their performance. Furthermore,
+these methods often overlook the influence of individual differences on emotion
+perception, even though everyone has their own personalized emotional
+perception in the real world. Motivated by these issues, we explore more
+effective sequence processing methods and introduce the Personalized DMER
+(PDMER) problem, which requires models to predict emotions that align with
+personalized perception. Specifically, we propose a Dual-Scale Attention-Based
+Meta-Learning (DSAML) method. This method fuses features from a dual-scale
+feature extractor and captures both short and long-term dependencies using a
+dual-scale attention transformer, improving the performance in traditional
+DMER. To achieve PDMER, we design a novel task construction strategy that
+divides tasks by annotators. Samples in a task are annotated by the same
+annotator, ensuring consistent perception. Leveraging this strategy alongside
+meta-learning, DSAML can predict personalized perception of emotions with just
+one personalized annotation sample. Our objective and subjective experiments
+demonstrate that our method can achieve state-of-the-art performance in both
+traditional DMER and PDMER.
+
+
+
+ comment: Accepted by the 39th AAAI Conference on Artificial Intelligence
+ (AAAI-25)
+
+
+
+
+
+
+ ☆ Reversed in Time: A Novel Temporal-Emphasized Benchmark for Cross-Modal
+ Video-Text Retrieval
+
+
+ Cross-modal (e.g. image-text, video-text) retrieval is an important task in
+information retrieval and multimodal vision-language understanding field.
+Temporal understanding makes video-text retrieval more challenging than
+image-text retrieval. However, we find that the widely used video-text
+benchmarks have shortcomings in comprehensively assessing abilities of models,
+especially in temporal understanding, causing large-scale image-text
+pre-trained models can already achieve comparable zero-shot performance with
+video-text pre-trained models. In this paper, we introduce RTime, a novel
+temporal-emphasized video-text retrieval dataset. We first obtain videos of
+actions or events with significant temporality, and then reverse these videos
+to create harder negative samples. We then recruit annotators to judge the
+significance and reversibility of candidate videos, and write captions for
+qualified videos. We further adopt GPT-4 to extend more captions based on
+human-written captions. Our RTime dataset currently consists of 21k videos with
+10 captions per video, totalling about 122 hours. Based on RTime, we propose
+three retrieval benchmark tasks: RTime-Origin, RTime-Hard, and RTime-Binary. We
+further enhance the use of harder-negatives in model training, and benchmark a
+variety of video-text models on RTime. Extensive experiment analysis proves
+that RTime indeed poses new and higher challenges to video-text retrieval. We
+release our RTime
+dataset\footnote{\url{https://github.com/qyr0403/Reversed-in-Time}} to further
+advance video-text retrieval and multimodal understanding research.
+
+
+
+ comment: ACMMM 2024 poster
+
+
+
+
+
+
+ ☆ Towards Popularity-Aware Recommendation: A Multi-Behavior Enhanced
+ Framework with Orthogonality Constraint
+
+
+ Top-$K$ recommendation involves inferring latent user preferences and
+generating personalized recommendations accordingly, which is now ubiquitous in
+various decision systems. Nonetheless, recommender systems usually suffer from
+severe \textit{popularity bias}, leading to the over-recommendation of popular
+items. Such a bias deviates from the central aim of reflecting user preference
+faithfully, compromising both customer satisfaction and retailer profits.
+Despite the prevalence, existing methods tackling popularity bias still have
+limitations due to the considerable accuracy-debias tradeoff and the
+sensitivity to extensive parameter selection, further exacerbated by the
+extreme sparsity in positive user-item interactions.
+ In this paper, we present a \textbf{Pop}ularity-aware top-$K$ recommendation
+algorithm integrating multi-behavior \textbf{S}ide \textbf{I}nformation
+(PopSI), aiming to enhance recommendation accuracy and debias performance
+simultaneously. Specifically, by leveraging multiple user feedback that mirrors
+similar user preferences and formulating it as a three-dimensional tensor,
+PopSI can utilize all slices to capture the desiring user preferences
+effectively. Subsequently, we introduced a novel orthogonality constraint to
+refine the estimated item feature space, enforcing it to be invariant to item
+popularity features thereby addressing our model's sensitivity to popularity
+bias. Comprehensive experiments on real-world e-commerce datasets demonstrate
+the general improvements of PopSI over state-of-the-art debias methods with a
+marginal accuracy-debias tradeoff and scalability to practical applications.
+The source code for our algorithm and experiments is available at
+\url{https://github.com/Eason-sys/PopSI}.
+
+
+
+
+
+
+
+ ☆ Effective and secure federated online learning to rank
+
+
+ Online Learning to Rank (OLTR) optimises ranking models using implicit user
+feedback, such as clicks. Unlike traditional Learning to Rank (LTR) methods
+that rely on a static set of training data with relevance judgements to learn a
+ranking model, OLTR methods update the model continually as new data arrives.
+Thus, it addresses several drawbacks such as the high cost of human
+annotations, potential misalignment between user preferences and human
+judgments, and the rapid changes in user query intents. However, OLTR methods
+typically require the collection of searchable data, user queries, and clicks,
+which poses privacy concerns for users.
+ Federated Online Learning to Rank (FOLTR) integrates OLTR within a Federated
+Learning (FL) framework to enhance privacy by not sharing raw data. While
+promising, FOLTR methods currently lag behind traditional centralised OLTR due
+to challenges in ranking effectiveness, robustness with respect to data
+distribution across clients, susceptibility to attacks, and the ability to
+unlearn client interactions and data. This thesis presents a comprehensive
+study on Federated Online Learning to Rank, addressing its effectiveness,
+robustness, security, and unlearning capabilities, thereby expanding the
+landscape of FOLTR.
+
+
+
+ comment: PhD Thesis
+
+
+
+
+
+
+ ☆ Jasper and Stella: distillation of SOTA embedding models
+
+
+ A crucial component of many deep learning applications (such as FAQ and RAG)
+is dense retrieval, in which embedding models are used to convert raw text to
+numerical vectors and then get the most similar text by MIPS (Maximum Inner
+Product Search). Some text embedding benchmarks (e.g. MTEB, BEIR, and
+AIR-Bench) have been established to evaluate embedding models accurately.
+Thanks to these benchmarks, we can use SOTA models; however, the deployment and
+application of these models in industry were hampered by their large vector
+dimensions and numerous parameters. To alleviate this problem, 1) we present a
+distillation technique that can enable a smaller student model to achieve good
+performance. 2) Inspired by MRL we present a training approach of reducing the
+vector dimensions based on its own vectors or its teacher vectors. 3) We do
+simple yet effective alignment training between images and text to make our
+model a multimodal encoder. We trained Stella and Jasper models using the
+technologies above and achieved high scores on the MTEB leaderboard. We release
+the model and data at Hugging Face Hub
+(https://huggingface.co/infgrad/jasper_en_vision_language_v1) and the training
+logs are at https://api.wandb.ai/links/dunnzhang0/z8jqoqpb.
+
+
+
+ comment: 7 pages, 1 figures
+
+
+
+
+
+
+ ♻ ☆ LLMs for Knowledge Graph Construction and Reasoning: Recent Capabilities
+ and Future Opportunities
+
+
+ This paper presents an exhaustive quantitative and qualitative evaluation of
+Large Language Models (LLMs) for Knowledge Graph (KG) construction and
+reasoning. We engage in experiments across eight diverse datasets, focusing on
+four representative tasks encompassing entity and relation extraction, event
+extraction, link prediction, and question-answering, thereby thoroughly
+exploring LLMs' performance in the domain of construction and inference.
+Empirically, our findings suggest that LLMs, represented by GPT-4, are more
+suited as inference assistants rather than few-shot information extractors.
+Specifically, while GPT-4 exhibits good performance in tasks related to KG
+construction, it excels further in reasoning tasks, surpassing fine-tuned
+models in certain cases. Moreover, our investigation extends to the potential
+generalization ability of LLMs for information extraction, leading to the
+proposition of a Virtual Knowledge Extraction task and the development of the
+corresponding VINE dataset. Based on these empirical findings, we further
+propose AutoKG, a multi-agent-based approach employing LLMs and external
+sources for KG construction and reasoning. We anticipate that this research can
+provide invaluable insights for future undertakings in the field of knowledge
+graphs. The code and datasets are in https://github.com/zjunlp/AutoKG.
+
+
+
+ comment: World Wide Web Journal
+
+
+
+
+
+
+ ♻ ☆ TableRAG: Million-Token Table Understanding with Language Models NeurIPS 2024
+
+
+ Recent advancements in language models (LMs) have notably enhanced their
+ability to reason with tabular data, primarily through program-aided mechanisms
+that manipulate and analyze tables. However, these methods often require the
+entire table as input, leading to scalability challenges due to the positional
+bias or context length constraints. In response to these challenges, we
+introduce TableRAG, a Retrieval-Augmented Generation (RAG) framework
+specifically designed for LM-based table understanding. TableRAG leverages
+query expansion combined with schema and cell retrieval to pinpoint crucial
+information before providing it to the LMs. This enables more efficient data
+encoding and precise retrieval, significantly reducing prompt lengths and
+mitigating information loss. We have developed two new million-token benchmarks
+from the Arcade and BIRD-SQL datasets to thoroughly evaluate TableRAG's
+effectiveness at scale. Our results demonstrate that TableRAG's retrieval
+design achieves the highest retrieval quality, leading to the new
+state-of-the-art performance on large-scale table understanding.
+
+
+
+ comment: Accepted to NeurIPS 2024
+
+
+
+
+
+
+ ♻ ☆ When SparseMoE Meets Noisy Interactions: An Ensemble View on Denoising
+ Recommendation ICASSP 2025
+
+
+ Learning user preferences from implicit feedback is one of the core
+challenges in recommendation. The difficulty lies in the potential noise within
+implicit feedback. Therefore, various denoising recommendation methods have
+been proposed recently. However, most of them overly rely on the hyperparameter
+configurations, inevitably leading to inadequacies in model adaptability and
+generalization performance. In this study, we propose a novel Adaptive Ensemble
+Learning (AEL) for denoising recommendation, which employs a sparse gating
+network as a brain, selecting suitable experts to synthesize appropriate
+denoising capacities for different data samples. To address the ensemble
+learning shortcoming of model complexity and ensure sub-recommender diversity,
+we also proposed a novel method that stacks components to create
+sub-recommenders instead of directly constructing them. Extensive experiments
+across various datasets demonstrate that AEL outperforms others in kinds of
+popular metrics, even in the presence of substantial and dynamic noise. Our
+code is available at https://github.com/cpu9xx/AEL.
+
+
+
+ comment: Accepted at ICASSP 2025. 5pages, 4 figures
+
+
+
+
+
+
+ ♻ ☆ LEARN: Knowledge Adaptation from Large Language Model to Recommendation
+ for Practical Industrial Application AAAI 2025
+
+
+
+
+
+
+
+
+ Jian Jia, Yipei Wang, Yan Li, Honggang Chen, Xuehan Bai, Zhaocheng Liu, Jian Liang, Quan Chen, Han Li, Peng Jiang, Kun Gai
+
+
+ Contemporary recommendation systems predominantly rely on ID embedding to
+capture latent associations among users and items. However, this approach
+overlooks the wealth of semantic information embedded within textual
+descriptions of items, leading to suboptimal performance and poor
+generalizations. Leveraging the capability of large language models to
+comprehend and reason about textual content presents a promising avenue for
+advancing recommendation systems. To achieve this, we propose an Llm-driven
+knowlEdge Adaptive RecommeNdation (LEARN) framework that synergizes open-world
+knowledge with collaborative knowledge. We address computational complexity
+concerns by utilizing pretrained LLMs as item encoders and freezing LLM
+parameters to avoid catastrophic forgetting and preserve open-world knowledge.
+To bridge the gap between the open-world and collaborative domains, we design a
+twin-tower structure supervised by the recommendation task and tailored for
+practical industrial application. Through experiments on the real large-scale
+industrial dataset and online A/B tests, we demonstrate the efficacy of our
+approach in industry application. We also achieve state-of-the-art performance
+on six Amazon Review datasets to verify the superiority of our method.
+
+
+
+ comment: Accepted by AAAI 2025. Codes are released at
+ https://github.com/adxcreative/LEARN
+
+
+
+
+
+
+ ♻ ☆ CAPER: Enhancing Career Trajectory Prediction using Temporal Knowledge
+ Graph and Ternary Relationship KDD 2025
+
+
+ The problem of career trajectory prediction (CTP) aims to predict one's
+future employer or job position. While several CTP methods have been developed
+for this problem, we posit that none of these methods (1) jointly considers the
+mutual ternary dependency between three key units (i.e., user, position, and
+company) of a career and (2) captures the characteristic shifts of key units in
+career over time, leading to an inaccurate understanding of the job movement
+patterns in the labor market. To address the above challenges, we propose a
+novel solution, named as CAPER, that solves the challenges via sophisticated
+temporal knowledge graph (TKG) modeling. It enables the utilization of a
+graph-structured knowledge base with rich expressiveness, effectively
+preserving the changes in job movement patterns. Furthermore, we devise an
+extrapolated career reasoning task on TKG for a realistic evaluation. The
+experiments on a real-world career trajectory dataset demonstrate that CAPER
+consistently and significantly outperforms four baselines, two recent TKG
+reasoning methods, and five state-of-the-art CTP methods in predicting one's
+future companies and positions--i.e., on average, yielding 6.80% and 34.58%
+more accurate predictions, respectively. The codebase of CAPER is available at
+https://github.com/Bigdasgit/CAPER.
+
+
+
+ comment: Accepted by ACM KDD 2025
+
+
+
+
+
+
+ ♻ ☆ PoTable: Programming Standardly on Table-based Reasoning Like a Human
+ Analyst
+
+
+ Table-based reasoning has garnered substantial research interest,
+particularly in its integration with Large Language Model (LLM) which has
+revolutionized the general reasoning paradigm. Numerous LLM-based studies
+introduce symbolic tools (e.g., databases, Python) as assistants to extend
+human-like abilities in structured table understanding and complex arithmetic
+computations. However, these studies can be improved better in simulating human
+cognitive behavior when using symbolic tools, as they still suffer from
+limitations of non-standard logical splits and constrained operation pools. In
+this study, we propose PoTable as a novel table-based reasoning method that
+simulates a human tabular analyst, which integrates a Python interpreter as the
+real-time executor accompanied by an LLM-based operation planner and code
+generator. Specifically, PoTable follows a human-like logical stage split and
+extends the operation pool into an open-world space without any constraints.
+Through planning and executing in each distinct stage, PoTable standardly
+completes the entire reasoning process and produces superior reasoning results
+along with highly accurate, steply commented and completely executable
+programs. Accordingly, the effectiveness and explainability of PoTable are
+fully demonstrated. Extensive experiments over three evaluation datasets from
+two public benchmarks on two backbones show the outstanding performance of our
+approach. In particular, GPT-based PoTable achieves over 4% higher absolute
+accuracy than runner-ups on all evaluation datasets.
+
+
+ High-frequency trading (HFT) has transformed modern financial markets, making
+reliable short-term price forecasting models essential. In this study, we
+present a novel approach to mid-price forecasting using Level 1 limit order
+book (LOB) data from NASDAQ, focusing on 100 U.S. stocks from the S&P 500 index
+during the period from September to November 2022. Expanding on our previous
+work with Radial Basis Function Neural Networks (RBFNN), which leveraged
+automated feature importance techniques based on mean decrease impurity (MDI)
+and gradient descent (GD), we introduce the Adaptive Learning Policy Engine
+(ALPE) - a reinforcement learning (RL)-based agent designed for batch-free,
+immediate mid-price forecasting. ALPE incorporates adaptive epsilon decay to
+dynamically balance exploration and exploitation, outperforming a diverse range
+of highly effective machine learning (ML) and deep learning (DL) models in
+forecasting performance.
+
+
+
+
+
+
+
+ ☆ Large Language Models for Market Research: A Data-augmentation Approach
+
+
+
+
+
+
+
+
+ Mengxin Wang, Dennis J. Zhang, Heng Zhang
+
+
+ Large Language Models (LLMs) have transformed artificial intelligence by
+excelling in complex natural language processing tasks. Their ability to
+generate human-like text has opened new possibilities for market research,
+particularly in conjoint analysis, where understanding consumer preferences is
+essential but often resource-intensive. Traditional survey-based methods face
+limitations in scalability and cost, making LLM-generated data a promising
+alternative. However, while LLMs have the potential to simulate real consumer
+behavior, recent studies highlight a significant gap between LLM-generated and
+human data, with biases introduced when substituting between the two. In this
+paper, we address this gap by proposing a novel statistical data augmentation
+approach that efficiently integrates LLM-generated data with real data in
+conjoint analysis. Our method leverages transfer learning principles to debias
+the LLM-generated data using a small amount of human data. This results in
+statistically robust estimators with consistent and asymptotically normal
+properties, in contrast to naive approaches that simply substitute human data
+with LLM-generated data, which can exacerbate bias. We validate our framework
+through an empirical study on COVID-19 vaccine preferences, demonstrating its
+superior ability to reduce estimation error and save data and costs by 24.9\%
+to 79.8\%. In contrast, naive approaches fail to save data due to the inherent
+biases in LLM-generated data compared to human data. Another empirical study on
+sports car choices validates the robustness of our results. Our findings
+suggest that while LLM-generated data is not a direct substitute for human
+responses, it can serve as a valuable complement when used within a robust
+statistical framework.
+
+
+
+
+
+
+
+
+ Leonardo Gabriel Ferreira Rodrigues, Danilo Ferreira da Silva, Larissa Ferreira Rodrigues, João Fernando Mari
+
+
+ Coronavirus Disease 2019 (COVID-19) pandemic rapidly spread globally,
+impacting the lives of billions of people. The effective screening of infected
+patients is a critical step to struggle with COVID-19, and treating the
+patients avoiding this quickly disease spread. The need for automated and
+scalable methods has increased due to the unavailability of accurate automated
+toolkits. Recent researches using chest X-ray images suggest they include
+relevant information about the COVID-19 virus. Hence, applying machine learning
+techniques combined with radiological imaging promises to identify this disease
+accurately. It is straightforward to collect these images once it is spreadly
+shared and analyzed in the world. This paper presents a method for automatic
+COVID-19 detection using chest Xray images through four convolutional neural
+networks, namely: AlexNet, VGG-11, SqueezeNet, and DenseNet-121. This method
+had been providing accurate diagnostics for positive or negative COVID-19
+classification. We validate our experiments using a ten-fold cross-validation
+procedure over the training and test sets. Our findings include the shallow
+fine-tuning and data augmentation strategies that can assist in dealing with
+the low number of positive COVID-19 images publicly available. The accuracy for
+all CNNs is higher than 97.00%, and the SqueezeNet model achieved the best
+result with 99.20%.
+
+
+
+ comment: 6 pages
+
+
+
+
+
+
+ ☆ Federated Hybrid Training and Self-Adversarial Distillation: Towards
+ Robust Edge Networks
+
+
+ Federated learning (FL) is a distributed training technology that enhances
+data privacy in mobile edge networks by allowing data owners to collaborate
+without transmitting raw data to the edge server. However, data heterogeneity
+and adversarial attacks pose challenges to develop an unbiased and robust
+global model for edge deployment. To address this, we propose Federated hyBrid
+Adversarial training and self-adversarial disTillation (FedBAT), a new
+framework designed to improve both robustness and generalization of the global
+model. FedBAT seamlessly integrates hybrid adversarial training and
+self-adversarial distillation into the conventional FL framework from data
+augmentation and feature distillation perspectives. From a data augmentation
+perspective, we propose hybrid adversarial training to defend against
+adversarial attacks by balancing accuracy and robustness through a weighted
+combination of standard and adversarial training. From a feature distillation
+perspective, we introduce a novel augmentation-invariant adversarial
+distillation method that aligns local adversarial features of augmented images
+with their corresponding unbiased global clean features. This alignment can
+effectively mitigate bias from data heterogeneity while enhancing both the
+robustness and generalization of the global model. Extensive experimental
+results across multiple datasets demonstrate that FedBAT yields comparable or
+superior performance gains in improving robustness while maintaining accuracy
+compared to several baselines.
+
+
+
+
+
+
+
+ ☆ ETTA: Elucidating the Design Space of Text-to-Audio Models
+
+
+
+
+
+
+
+
+ Sang-gil Lee, Zhifeng Kong, Arushi Goel, Sungwon Kim, Rafael Valle, Bryan Catanzaro
+
+
+ Recent years have seen significant progress in Text-To-Audio (TTA) synthesis,
+enabling users to enrich their creative workflows with synthetic audio
+generated from natural language prompts. Despite this progress, the effects of
+data, model architecture, training objective functions, and sampling strategies
+on target benchmarks are not well understood. With the purpose of providing a
+holistic understanding of the design space of TTA models, we set up a
+large-scale empirical experiment focused on diffusion and flow matching models.
+Our contributions include: 1) AF-Synthetic, a large dataset of high quality
+synthetic captions obtained from an audio understanding model; 2) a systematic
+comparison of different architectural, training, and inference design choices
+for TTA models; 3) an analysis of sampling methods and their Pareto curves with
+respect to generation quality and inference speed. We leverage the knowledge
+obtained from this extensive analysis to propose our best model dubbed
+Elucidated Text-To-Audio (ETTA). When evaluated on AudioCaps and MusicCaps,
+ETTA provides improvements over the baselines trained on publicly available
+data, while being competitive with models trained on proprietary data. Finally,
+we show ETTA's improved ability to generate creative audio following complex
+and imaginative captions -- a task that is more challenging than current
+benchmarks.
+
+
+
+
+
+
+
+ ☆ On the Expressiveness and Length Generalization of Selective State-Space
+ Models on Regular Languages AAAI 2025
+
+
+
+
+
+
+
+
+ Aleksandar Terzić, Michael Hersche, Giacomo Camposampiero, Thomas Hofmann, Abu Sebastian, Abbas Rahimi
+
+
+ Selective state-space models (SSMs) are an emerging alternative to the
+Transformer, offering the unique advantage of parallel training and sequential
+inference. Although these models have shown promising performance on a variety
+of tasks, their formal expressiveness and length generalization properties
+remain underexplored. In this work, we provide insight into the workings of
+selective SSMs by analyzing their expressiveness and length generalization
+performance on regular language tasks, i.e., finite-state automaton (FSA)
+emulation. We address certain limitations of modern SSM-based architectures by
+introducing the Selective Dense State-Space Model (SD-SSM), the first selective
+SSM that exhibits perfect length generalization on a set of various regular
+language tasks using a single layer. It utilizes a dictionary of dense
+transition matrices, a softmax selection mechanism that creates a convex
+combination of dictionary matrices at each time step, and a readout consisting
+of layer normalization followed by a linear map. We then proceed to evaluate
+variants of diagonal selective SSMs by considering their empirical performance
+on commutative and non-commutative automata. We explain the experimental
+results with theoretical considerations. Our code is available at
+https://github.com/IBM/selective-dense-state-space-model.
+
+
+
+ comment: 13 pages, 7 figures, to be published in AAAI 2025
+
+
+
+
+
+
+ ☆ A Reinforcement Learning-Based Task Mapping Method to Improve the
+ Reliability of Clustered Manycores
+
+
+ The increasing scale of manycore systems poses significant challenges in
+managing reliability while meeting performance demands. Simultaneously, these
+systems become more susceptible to different aging mechanisms such as
+negative-bias temperature instability (NBTI), hot carrier injection (HCI), and
+thermal cycling (TC), as well as the electromigration (EM) phenomenon. In this
+paper, we propose a reinforcement learning (RL)-based task mapping method to
+improve the reliability of manycore systems considering the aforementioned
+aging mechanisms, which consists of three steps including bin packing,
+task-to-bin mapping, and task-to-core mapping. In the initial step, a
+density-based spatial application with noise (DBSCAN) clustering method is
+employed to compose some clusters (bins) based on the cores temperature. Then,
+the Q-learning algorithm is used for the two latter steps, to map the arrived
+task on a core such that the minimum thermal variation is occurred among all
+the bins. Compared to the state-of-the-art works, the proposed method is
+performed during runtime without requiring any parameter to be calculated
+offline. The effectiveness of the proposed technique is evaluated on 16, 32,
+and 64 cores systems using SPLASH2 and PARSEC benchmark suite applications. The
+results demonstrate up to 27% increase in the mean time to failure (MTTF)
+compared to the state-of-the-art task mapping techniques.
+
+
+
+
+
+
+
+ ☆ CALICO: Part-Focused Semantic Co-Segmentation with Large Vision-Language
+ Models
+
+
+
+
+
+
+
+
+ Kiet A. Nguyen, Adheesh Juvekar, Tianjiao Yu, Muntasir Wahed, Ismini Lourentzou
+
+
+ Recent advances in Large Vision-Language Models (LVLMs) have sparked
+significant progress in general-purpose vision tasks through visual instruction
+tuning. While some works have demonstrated the capability of LVLMs to generate
+segmentation masks that align phrases with natural language descriptions in a
+single image, they struggle with segmentation-grounded comparisons across
+multiple images, particularly at finer granularities such as object parts. In
+this paper, we introduce the new task of part-focused semantic co-segmentation,
+which seeks to identify and segment common and unique objects and parts across
+images. To address this task, we present CALICO, the first LVLM that can
+segment and reason over multiple masks across images, enabling object
+comparison based on their constituent parts. CALICO features two proposed
+components, a novel Correspondence Extraction Module, which captures
+semantic-rich information to identify part-level correspondences between
+objects, and a Correspondence Adaptation Module, which embeds this information
+into the LVLM to facilitate multi-image understanding in a parameter-efficient
+manner. To support training and evaluation, we curate MixedParts, a
+comprehensive multi-image segmentation dataset containing $\sim$2.4M samples
+across $\sim$44K images with diverse object and part categories. Experimental
+results show CALICO, finetuned on only 0.3% of its architecture, achieves
+robust performance in part-focused semantic co-segmentation.
+
+
+
+
+
+
+
+ ☆ Deep learning and whole-brain networks for biomarker discovery: modeling
+ the dynamics of brain fluctuations in resting-state and cognitive tasks
+
+
+ Background: Brain network models offer insights into brain dynamics, but the
+utility of model-derived bifurcation parameters as biomarkers remains
+underexplored. Objective: This study evaluates bifurcation parameters from a
+whole-brain network model as biomarkers for distinguishing brain states
+associated with resting-state and task-based cognitive conditions. Methods:
+Synthetic BOLD signals were generated using a supercritical Hopf brain network
+model to train deep learning models for bifurcation parameter prediction.
+Inference was performed on Human Connectome Project data, including both
+resting-state and task-based conditions. Statistical analyses assessed the
+separability of brain states based on bifurcation parameter distributions.
+Results: Bifurcation parameter distributions differed significantly across task
+and resting-state conditions ($p < 0.0001$ for all but one comparison).
+Task-based brain states exhibited higher bifurcation values compared to rest.
+Conclusion: Bifurcation parameters effectively differentiate cognitive and
+resting states, warranting further investigation as biomarkers for brain state
+characterization and neurological disorder assessment.
+
+
+
+ comment: 12 pages, 4 figures, 1 table
+
+
+
+
+
+
+ ☆ Performance Control in Early Exiting to Deploy Large Models at the Same
+ Cost of Smaller Ones ICML 2024
+
+
+ Early Exiting (EE) is a promising technique for speeding up inference by
+adaptively allocating compute resources to data points based on their
+difficulty. The approach enables predictions to exit at earlier layers for
+simpler samples while reserving more computation for challenging ones. In this
+study, we first present a novel perspective on the EE approach, showing that
+larger models deployed with EE can achieve higher performance than smaller
+models while maintaining similar computational costs. As existing EE approaches
+rely on confidence estimation at each exit point, we further study the impact
+of overconfidence on the controllability of the compute-performance trade-off.
+We introduce Performance Control Early Exiting (PCEE), a method that enables
+accuracy thresholding by basing decisions not on a data point's confidence but
+on the average accuracy of samples with similar confidence levels from a
+held-out validation set. In our experiments, we show that PCEE offers a simple
+yet computationally efficient approach that provides better control over
+performance than standard confidence-based approaches, and allows us to scale
+up model sizes to yield performance gain while reducing the computational cost.
+
+
+
+ comment: Appeared at ICML 2024 Workshop on Efficient Systems for Foundation
+ Models (ES-FoMo-II)
+
+
+
+
+
+
+
+ Aleksandr Podkopaev, Darren Xu, Kuang-Chih Lee
+
+
+ Conformal prediction is a valuable tool for quantifying predictive
+uncertainty of machine learning models. However, its applicability relies on
+the assumption of data exchangeability, a condition which is often not met in
+real-world scenarios. In this paper, we consider the problem of adaptive
+conformal inference without any assumptions about the data generating process.
+Existing approaches for adaptive conformal inference are based on optimizing
+the pinball loss using variants of online gradient descent. A notable
+shortcoming of such approaches is in their explicit dependence on and
+sensitivity to the choice of the learning rates. In this paper, we propose a
+different approach for adaptive conformal inference that leverages
+parameter-free online convex optimization techniques. We prove that our method
+controls long-term miscoverage frequency at a nominal level and demonstrate its
+convincing empirical performance without any need of performing cumbersome
+parameter tuning.
+
+
+
+
+
+
+
+ ☆ xSRL: Safety-Aware Explainable Reinforcement Learning -- Safety as a
+ Product of Explainability AAMAS 2025
+
+
+ Reinforcement learning (RL) has shown great promise in simulated
+environments, such as games, where failures have minimal consequences. However,
+the deployment of RL agents in real-world systems such as autonomous vehicles,
+robotics, UAVs, and medical devices demands a higher level of safety and
+transparency, particularly when facing adversarial threats. Safe RL algorithms
+have been developed to address these concerns by optimizing both task
+performance and safety constraints. However, errors are inevitable, and when
+they occur, it is essential that the RL agents can also explain their actions
+to human operators. This makes trust in the safety mechanisms of RL systems
+crucial for effective deployment. Explainability plays a key role in building
+this trust by providing clear, actionable insights into the agent's
+decision-making process, ensuring that safety-critical decisions are well
+understood. While machine learning (ML) has seen significant advances in
+interpretability and visualization, explainability methods for RL remain
+limited. Current tools fail to address the dynamic, sequential nature of RL and
+its needs to balance task performance with safety constraints over time. The
+re-purposing of traditional ML methods, such as saliency maps, is inadequate
+for safety-critical RL applications where mistakes can result in severe
+consequences. To bridge this gap, we propose xSRL, a framework that integrates
+both local and global explanations to provide a comprehensive understanding of
+RL agents' behavior. xSRL also enables developers to identify policy
+vulnerabilities through adversarial attacks, offering tools to debug and patch
+agents without retraining. Our experiments and user studies demonstrate xSRL's
+effectiveness in increasing safety in RL systems, making them more reliable and
+trustworthy for real-world deployment. Code is available at
+https://github.com/risal-shefin/xSRL.
+
+
+
+ comment: Accepted to 24th International Conference on Autonomous Agents and
+ Multiagent Systems (AAMAS 2025)
+
+ Retrieval-Augmented Generation (RAG) has emerged as the dominant technique to
+provide *Large Language Models* (LLM) with fresh and relevant context,
+mitigating the risk of hallucinations and improving the overall quality of
+responses in environments with large and fast moving knowledge bases. However,
+the integration of external documents into the generation process raises
+significant privacy concerns. Indeed, when added to a prompt, it is not
+possible to guarantee a response will not inadvertently expose confidential
+data, leading to potential breaches of privacy and ethical dilemmas. This paper
+explores a practical solution to this problem suitable to general knowledge
+extraction from personal data. It shows *differentially private token
+generation* is a viable approach to private RAG.
+
+
+
+
+
+
+
+
+ Taewhan Kim, Soeun Lee, Si-Woo Kim, Dong-Jin Kim
+
+
+ Recent lightweight image captioning models using retrieved data mainly focus
+on text prompts. However, previous works only utilize the retrieved text as
+text prompts, and the visual information relies only on the CLIP visual
+embedding. Because of this issue, there is a limitation that the image
+descriptions inherent in the prompt are not sufficiently reflected in the
+visual embedding space. To tackle this issue, we propose ViPCap, a novel
+retrieval text-based visual prompt for lightweight image captioning. ViPCap
+leverages the retrieved text with image information as visual prompts to
+enhance the ability of the model to capture relevant visual information. By
+mapping text prompts into the CLIP space and generating multiple randomized
+Gaussian distributions, our method leverages sampling to explore randomly
+augmented distributions and effectively retrieves the semantic features that
+contain image information. These retrieved features are integrated into the
+image and designated as the visual prompt, leading to performance improvements
+on the datasets such as COCO, Flickr30k, and NoCaps. Experimental results
+demonstrate that ViPCap significantly outperforms prior lightweight captioning
+models in efficiency and effectiveness, demonstrating the potential for a
+plug-and-play solution.
+
+
+
+
+
+
+
+ ☆ Time Series Foundational Models: Their Role in Anomaly Detection and
+ Prediction AAAI2025
+
+
+ Time series foundational models (TSFM) have gained prominence in time series
+forecasting, promising state-of-the-art performance across various
+applications. However, their application in anomaly detection and prediction
+remains underexplored, with growing concerns regarding their black-box nature,
+lack of interpretability and applicability. This paper critically evaluates the
+efficacy of TSFM in anomaly detection and prediction tasks. We systematically
+analyze TSFM across multiple datasets, including those characterized by the
+absence of discernible patterns, trends and seasonality. Our analysis shows
+that while TSFMs can be extended for anomaly detection and prediction,
+traditional statistical and deep learning models often match or outperform TSFM
+in these tasks. Additionally, TSFMs require high computational resources but
+fail to capture sequential dependencies effectively or improve performance in
+few-shot or zero-shot scenarios. \noindent The preprocessed datasets, codes to
+reproduce the results and supplementary materials are available at
+https://github.com/smtmnfg/TSFM.
+
+
+
+ comment: 12 pages, 6 figures, 5 tables. Accepted at AAAI2025 Anomaly Detection
+ in Scientific Domains Workshop
+
+
+
+
+
+
+ ☆ PearSAN: A Machine Learning Method for Inverse Design using Pearson
+ Correlated Surrogate Annealing
+
+
+
+
+
+
+
+
+ Michael Bezick, Blake A. Wilson, Vaishnavi Iyer, Yuheng Chen, Vladimir M. Shalaev, Sabre Kais, Alexander V. Kildishev, Alexandra Boltasseva, Brad Lackey
+
+
+ PearSAN is a machine learning-assisted optimization algorithm applicable to
+inverse design problems with large design spaces, where traditional optimizers
+struggle. The algorithm leverages the latent space of a generative model for
+rapid sampling and employs a Pearson correlated surrogate model to predict the
+figure of merit of the true design metric. As a showcase example, PearSAN is
+applied to thermophotovoltaic (TPV) metasurface design by matching the working
+bands between a thermal radiator and a photovoltaic cell. PearSAN can work with
+any pretrained generative model with a discretized latent space, making it easy
+to integrate with VQ-VAEs and binary autoencoders. Its novel Pearson
+correlational loss can be used as both a latent regularization method, similar
+to batch and layer normalization, and as a surrogate training loss. We compare
+both to previous energy matching losses, which are shown to enforce poor
+regularization and performance, even with upgraded affine parameters. PearSAN
+achieves a state-of-the-art maximum design efficiency of 97%, and is at least
+an order of magnitude faster than previous methods, with an improved maximum
+figure-of-merit gain.
+
+
+
+
+
+
+
+
+ Hainan Ren, Lin Li, Chun-Hao Liu, Xin Wang, Shu Hu
+
+
+ AI-synthesized voice technology has the potential to create realistic human
+voices for beneficial applications, but it can also be misused for malicious
+purposes. While existing AI-synthesized voice detection models excel in
+intra-domain evaluation, they face challenges in generalizing across different
+domains, potentially becoming obsolete as new voice generators emerge. Current
+solutions use diverse data and advanced machine learning techniques (e.g.,
+domain-invariant representation, self-supervised learning), but are limited by
+predefined vocoders and sensitivity to factors like background noise and
+speaker identity. In this work, we introduce an innovative disentanglement
+framework aimed at extracting domain-agnostic artifact features related to
+vocoders. Utilizing these features, we enhance model learning in a flat loss
+landscape, enabling escape from suboptimal solutions and improving
+generalization. Extensive experiments on benchmarks show our approach
+outperforms state-of-the-art methods, achieving up to 5.12% improvement in the
+equal error rate metric in intra-domain and 7.59% in cross-domain evaluations.
+
+
+
+ comment: AAAI25
+
+
+
+
+
+
+ ☆ Optimizing Multi-Stage Language Models for Effective Text Retrieval
+
+
+
+
+
+
+
+
+ Quang Hoang Trung, Le Trung Hoang, Nguyen Van Hoang Phuc
+
+
+ Efficient text retrieval is critical for applications such as legal document
+analysis, particularly in specialized contexts like Japanese legal systems.
+Existing retrieval methods often underperform in such domain-specific
+scenarios, necessitating tailored approaches. In this paper, we introduce a
+novel two-phase text retrieval pipeline optimized for Japanese legal datasets.
+Our method leverages advanced language models to achieve state-of-the-art
+performance, significantly improving retrieval efficiency and accuracy. To
+further enhance robustness and adaptability, we incorporate an ensemble model
+that integrates multiple retrieval strategies, resulting in superior outcomes
+across diverse tasks. Extensive experiments validate the effectiveness of our
+approach, demonstrating strong performance on both Japanese legal datasets and
+widely recognized benchmarks like MS-MARCO. Our work establishes new standards
+for text retrieval in domain-specific and general contexts, providing a
+comprehensive solution for addressing complex queries in legal and multilingual
+environments.
+
+
+
+
+
+
+
+ ☆ MEDEC: A Benchmark for Medical Error Detection and Correction in
+ Clinical Notes
+
+
+
+
+
+
+
+
+ Asma Ben Abacha, Wen-wai Yim, Yujuan Fu, Zhaoyi Sun, Meliha Yetisgen, Fei Xia, Thomas Lin
+
+
+ Several studies showed that Large Language Models (LLMs) can answer medical
+questions correctly, even outperforming the average human score in some medical
+exams. However, to our knowledge, no study has been conducted to assess the
+ability of language models to validate existing or generated medical text for
+correctness and consistency. In this paper, we introduce MEDEC
+(https://github.com/abachaa/MEDEC), the first publicly available benchmark for
+medical error detection and correction in clinical notes, covering five types
+of errors (Diagnosis, Management, Treatment, Pharmacotherapy, and Causal
+Organism). MEDEC consists of 3,848 clinical texts, including 488 clinical notes
+from three US hospital systems that were not previously seen by any LLM. The
+dataset has been used for the MEDIQA-CORR shared task to evaluate seventeen
+participating systems [Ben Abacha et al., 2024]. In this paper, we describe the
+data creation methods and we evaluate recent LLMs (e.g., o1-preview, GPT-4,
+Claude 3.5 Sonnet, and Gemini 2.0 Flash) for the tasks of detecting and
+correcting medical errors requiring both medical knowledge and reasoning
+capabilities. We also conducted a comparative study where two medical doctors
+performed the same task on the MEDEC test set. The results showed that MEDEC is
+a sufficiently challenging benchmark to assess the ability of models to
+validate existing or generated notes and to correct medical errors. We also
+found that although recent LLMs have a good performance in error detection and
+correction, they are still outperformed by medical doctors in these tasks. We
+discuss the potential factors behind this gap, the insights from our
+experiments, the limitations of current evaluation metrics, and share potential
+pointers for future research.
+
+
+ We propose novel attention architectures, Multi-matrix Factorization
+Attention (MFA) and MFA-Key-Reuse (MFA-KR). Existing variants for standard
+Multi-Head Attention (MHA), including SOTA methods like MLA, fail to maintain
+as strong performance under stringent Key-Value cache (KV cache) constraints.
+MFA enhances model capacity by efficiently scaling up both the number and
+dimension of attention heads through low-rank matrix factorization in the
+Query-Key (QK) circuit. Extending MFA, MFA-KR further reduces memory
+requirements by repurposing the key cache as value through value projection
+re-parameterization. MFA's design enables strong model capacity when working
+under tight KV cache budget, while MFA-KR is suitable for even harsher KV cache
+limits with minor performance trade-off. Notably, in our extensive and
+large-scale experiments, the proposed architecture outperforms MLA and performs
+comparably to MHA, while reducing KV cache usage by up to 56% and 93.7%,
+respectively.
+
+
+ We study the problem of contextual dynamic pricing with a linear demand
+model. We propose a novel localized exploration-then-commit (LetC) algorithm
+which starts with a pure exploration stage, followed by a refinement stage that
+explores near the learned optimal pricing policy, and finally enters a pure
+exploitation stage. The algorithm is shown to achieve a minimax optimal,
+dimension-free regret bound when the time horizon exceeds a polynomial of the
+covariate dimension. Furthermore, we provide a general theoretical framework
+that encompasses the entire time spectrum, demonstrating how to balance
+exploration and exploitation when the horizon is limited. The analysis is
+powered by a novel critical inequality that depicts the
+exploration-exploitation trade-off in dynamic pricing, mirroring its existing
+counterpart for the bias-variance trade-off in regularized regression. Our
+theoretical results are validated by extensive experiments on synthetic and
+real-world data.
+
+
+
+ comment: 60 pages, 9 figures
+
+
+
+
+
+
+ ☆ Sentiment trading with large language models
+
+
+ We investigate the efficacy of large language models (LLMs) in sentiment
+analysis of U.S. financial news and their potential in predicting stock market
+returns. We analyze a dataset comprising 965,375 news articles that span from
+January 1, 2010, to June 30, 2023; we focus on the performance of various LLMs,
+including BERT, OPT, FINBERT, and the traditional Loughran-McDonald dictionary
+model, which has been a dominant methodology in the finance literature. The
+study documents a significant association between LLM scores and subsequent
+daily stock returns. Specifically, OPT, which is a GPT-3 based LLM, shows the
+highest accuracy in sentiment prediction with an accuracy of 74.4%, slightly
+ahead of BERT (72.5%) and FINBERT (72.2%). In contrast, the Loughran-McDonald
+dictionary model demonstrates considerably lower effectiveness with only 50.1%
+accuracy. Regression analyses highlight a robust positive impact of OPT model
+scores on next-day stock returns, with coefficients of 0.274 and 0.254 in
+different model specifications. BERT and FINBERT also exhibit predictive
+relevance, though to a lesser extent. Notably, we do not observe a significant
+relationship between the Loughran-McDonald dictionary model scores and stock
+returns, challenging the efficacy of this traditional method in the current
+financial context. In portfolio performance, the long-short OPT strategy excels
+with a Sharpe ratio of 3.05, compared to 2.11 for BERT and 2.07 for FINBERT
+long-short strategies. Strategies based on the Loughran-McDonald dictionary
+yield the lowest Sharpe ratio of 1.23. Our findings emphasize the superior
+performance of advanced LLMs, especially OPT, in financial market prediction
+and portfolio management, marking a significant shift in the landscape of
+financial analysis tools with implications to financial regulation and policy
+analysis.
+
+
+
+
+
+
+
+ ☆ Latenrgy: Model Agnostic Latency and Energy Consumption Prediction for
+ Binary Classifiers
+
+
+ Machine learning systems increasingly drive innovation across scientific
+fields and industry, yet challenges in compute overhead, specifically during
+inference, limit their scalability and sustainability. Responsible AI
+guardrails, essential for ensuring fairness, transparency, and privacy, further
+exacerbate these computational demands. This study addresses critical gaps in
+the literature, chiefly the lack of generalized predictive techniques for
+latency and energy consumption, limited cross-comparisons of classifiers, and
+unquantified impacts of RAI guardrails on inference performance. Using Theory
+Construction Methodology, this work constructed a model-agnostic theoretical
+framework for predicting latency and energy consumption in binary
+classification models during inference. The framework synthesizes classifier
+characteristics, dataset properties, and RAI guardrails into a unified
+analytical instrument. Two predictive equations are derived that capture the
+interplay between these factors while offering generalizability across diverse
+classifiers. The proposed framework provides foundational insights for
+designing efficient, responsible ML systems. It enables researchers to
+benchmark and optimize inference performance and assists practitioners in
+deploying scalable solutions. Finally, this work establishes a theoretical
+foundation for balancing computational efficiency with ethical AI principles,
+paving the way for future empirical validation and broader applications.
+
+
+
+ comment: 8 pages, 2 tables
+
+
+
+
+
+
+ ☆ FineVQ: Fine-Grained User Generated Content Video Quality Assessment
+
+
+
+
+
+
+
+
+ Huiyu Duan, Qiang Hu, Jiarui Wang, Liu Yang, Zitong Xu, Lu Liu, Xiongkuo Min, Chunlei Cai, Tianxiao Ye, Xiaoyun Zhang, Guangtao Zhai
+
+
+ The rapid growth of user-generated content (UGC) videos has produced an
+urgent need for effective video quality assessment (VQA) algorithms to monitor
+video quality and guide optimization and recommendation procedures. However,
+current VQA models generally only give an overall rating for a UGC video, which
+lacks fine-grained labels for serving video processing and recommendation
+applications. To address the challenges and promote the development of UGC
+videos, we establish the first large-scale Fine-grained Video quality
+assessment Database, termed FineVD, which comprises 6104 UGC videos with
+fine-grained quality scores and descriptions across multiple dimensions. Based
+on this database, we propose a Fine-grained Video Quality assessment (FineVQ)
+model to learn the fine-grained quality of UGC videos, with the capabilities of
+quality rating, quality scoring, and quality attribution. Extensive
+experimental results demonstrate that our proposed FineVQ can produce
+fine-grained video-quality results and achieve state-of-the-art performance on
+FineVD and other commonly used UGC-VQA datasets. Both Both FineVD and FineVQ
+will be made publicly available.
+
+
+
+
+
+
+
+ ☆ SeaMo: A Multi-Seasonal and Multimodal Remote Sensing Foundation Model
+
+
+ Remote Sensing (RS) data contains a wealth of multi-dimensional information
+crucial for Earth observation. Owing to its vast volume, diverse sources, and
+temporal properties, RS data is highly suitable for the development of large
+Visual Foundation Models (VFMs). VFMs act as robust feature extractors,
+learning from extensive RS data, and are subsequently fine-tuned for deployment
+in various geoscientific tasks. However, current VFMs in the RS domain are
+predominantly pretrained and tailored exclusively for specific characteristics
+of RS imagery, neglecting the potential of utilizing the multi-dimensional
+properties of RS data. Therefore, in this work, we propose SeaMo, a pioneering
+visual foundation model that integrates multi-seasonal and multimodal
+information in the RS field. SeaMo is designed to harness multiple properties
+of RS data. Within the masked image modeling framework, we employ non-aligned
+cropping techniques to extract spatial properties, use multi-source inputs for
+multimodal integration, and incorporate temporal-multimodal fusion blocks for
+effective assimilation of multi-seasonal data. SeaMo explicitly models the
+multi-dimensional properties of RS data, making the model more comprehensive,
+robust, and versatile. We applied SeaMo to several downstream geoscience tasks,
+which demonstrated exceptional performance. Extensive ablation studies were
+conducted to validate the model's superiority.
+
+
+
+
+
+
+
+ ☆ Are Two Hidden Layers Still Enough for the Physics-Informed Neural
+ Networks?
+
+
+
+
+
+
+
+
+ Vasiliy A. Es'kin, Alexey O. Malkhanov, Mikhail E. Smorkalov
+
+
+ The article discusses the development of various methods and techniques for
+initializing and training neural networks with a single hidden layer, as well
+as training a separable physics-informed neural network consisting of neural
+networks with a single hidden layer to solve physical problems described by
+ordinary differential equations (ODEs) and partial differential equations
+(PDEs). A method for strictly deterministic initialization of a neural network
+with one hidden layer for solving physical problems described by an ODE is
+proposed. Modifications to existing methods for weighting the loss function are
+given, as well as new methods developed for training strictly
+deterministic-initialized neural networks to solve ODEs (detaching, additional
+weighting based on the second derivative, predicted solution-based weighting,
+relative residuals). An algorithm for physics-informed data-driven
+initialization of a neural network with one hidden layer is proposed. A neural
+network with pronounced generalizing properties is presented, whose
+generalizing abilities of which can be precisely controlled by adjusting
+network parameters. A metric for measuring the generalization of such neural
+network has been introduced. A gradient-free neuron-by-neuron fitting method
+has been developed for adjusting the parameters of a single-hidden-layer neural
+network, which does not require the use of an optimizer or solver for its
+implementation. The proposed methods have been extended to 2D problems using
+the separable physics-informed neural networks approach. Numerous experiments
+have been carried out to develop the above methods and approaches. Experiments
+on physical problems, such as solving various ODEs and PDEs, have demonstrated
+that these methods for initializing and training neural networks with one or
+two hidden layers (SPINN) achieve competitive accuracy and, in some cases,
+state-of-the-art results.
+
+
+
+ comment: 45 pages, 36 figures, 9 tables
+
+
+
+
+
+
+ ☆ Virtual Nodes Can Help: Tackling Distribution Shifts in Federated Graph
+ Learning AAAI 2025
+
+
+ Federated Graph Learning (FGL) enables multiple clients to jointly train
+powerful graph learning models, e.g., Graph Neural Networks (GNNs), without
+sharing their local graph data for graph-related downstream tasks, such as
+graph property prediction. In the real world, however, the graph data can
+suffer from significant distribution shifts across clients as the clients may
+collect their graph data for different purposes. In particular, graph
+properties are usually associated with invariant label-relevant substructures
+(i.e., subgraphs) across clients, while label-irrelevant substructures can
+appear in a client-specific manner. The issue of distribution shifts of graph
+data hinders the efficiency of GNN training and leads to serious performance
+degradation in FGL. To tackle the aforementioned issue, we propose a novel FGL
+framework entitled FedVN that eliminates distribution shifts through
+client-specific graph augmentation strategies with multiple learnable Virtual
+Nodes (VNs). Specifically, FedVN lets the clients jointly learn a set of shared
+VNs while training a global GNN model. To eliminate distribution shifts, each
+client trains a personalized edge generator that determines how the VNs connect
+local graphs in a client-specific manner. Furthermore, we provide theoretical
+analyses indicating that FedVN can eliminate distribution shifts of graph data
+across clients. Comprehensive experiments on four datasets under five settings
+demonstrate the superiority of our proposed FedVN over nine baselines.
+
+
+
+ comment: Accepted by AAAI 2025
+
+
+
+
+
+
+ ☆ Learning Cross-Domain Representations for Transferable Drug
+ Perturbations on Single-Cell Transcriptional Responses
+
+
+ Phenotypic drug discovery has attracted widespread attention because of its
+potential to identify bioactive molecules. Transcriptomic profiling provides a
+comprehensive reflection of phenotypic changes in cellular responses to
+external perturbations. In this paper, we propose XTransferCDR, a novel
+generative framework designed for feature decoupling and transferable
+representation learning across domains. Given a pair of perturbed expression
+profiles, our approach decouples the perturbation representations from basal
+states through domain separation encoders and then cross-transfers them in the
+latent space. The transferred representations are then used to reconstruct the
+corresponding perturbed expression profiles via a shared decoder. This
+cross-transfer constraint effectively promotes the learning of transferable
+drug perturbation representations. We conducted extensive evaluations of our
+model on multiple datasets, including single-cell transcriptional responses to
+drugs and single- and combinatorial genetic perturbations. The experimental
+results show that XTransferCDR achieved better performance than current
+state-of-the-art methods, showcasing its potential to advance phenotypic drug
+discovery.
+
+
+
+
+
+
+
+ ☆ Multi-view Fake News Detection Model Based on Dynamic Hypergraph
+
+
+ With the rapid development of online social networks and the inadequacies in
+content moderation mechanisms, the detection of fake news has emerged as a
+pressing concern for the public. Various methods have been proposed for fake
+news detection, including text-based approaches as well as a series of
+graph-based approaches. However, the deceptive nature of fake news renders
+text-based approaches less effective. Propagation tree-based methods focus on
+the propagation process of individual news, capturing pairwise relationships
+but lacking the capability to capture high-order complex relationships. Large
+heterogeneous graph-based approaches necessitate the incorporation of
+substantial additional information beyond news text and user data, while
+hypergraph-based approaches rely on predefined hypergraph structures. To tackle
+these issues, we propose a novel dynamic hypergraph-based multi-view fake news
+detection model (DHy-MFND) that learns news embeddings across three distinct
+views: text-level, propagation tree-level, and hypergraph-level. By employing
+hypergraph structures to model complex high-order relationships among multiple
+news pieces and introducing dynamic hypergraph structure learning, we optimize
+predefined hypergraph structures while learning news embeddings. Additionally,
+we introduce contrastive learning to capture authenticity-relevant embeddings
+across different views. Extensive experiments on two benchmark datasets
+demonstrate the effectiveness of our proposed DHy-MFND compared with a broad
+range of competing baselines.
+
+
+
+
+
+
+
+ ☆ VINEVI: A Virtualized Network Vision Architecture for Smart Monitoring
+ of Heterogeneous Applications and Infrastructures
+
+
+
+
+
+
+
+
+ Rodrigo Moreira, Hugo G. V. O. da Cunha, Larissa F. Rodrigues Moreira, Flávio de Oliveira Silva
+
+
+ Monitoring heterogeneous infrastructures and applications is essential to
+cope with user requirements properly, but it still lacks enhancements. The
+well-known state-of-the-art methods and tools do not support seamless
+monitoring of bare-metal, low-cost infrastructures, neither hosted nor
+virtualized services with fine-grained details. This work proposes VIrtualized
+NEtwork VIsion architecture (VINEVI), an intelligent method for seamless
+monitoring heterogeneous infrastructures and applications. The VINEVI
+architecture advances state of the art with a node-embedded traffic
+classification agent placing physical and virtualized infrastructures enabling
+real-time traffic classification. VINEVI combines this real-time traffic
+classification with well-known tools such as Prometheus and Victoria Metrics to
+monitor the entire stack from the hardware to the virtualized applications.
+Experimental results showcased that VINEVI architecture allowed seamless
+heterogeneous infrastructure monitoring with a higher level of detail beyond
+literature. Also, our node-embedded real-time Internet traffic classifier
+evolved with flexibility the methods with monitoring heterogeneous
+infrastructures seamlessly.
+
+
+
+ comment: 12 pages
+
+
+
+
+
+
+ ☆ Applying the maximum entropy principle to multi-species neural networks
+ improves species distribution models
+
+
+
+
+
+
+
+
+ Maxime Ryckewaert, Diego Marcos, Christophe Botella, Maximilien Servajean, Pierre Bonnet, Alexis Joly
+
+
+ The rapid expansion of citizen science initiatives has led to a significant
+growth of biodiversity databases, and particularly presence-only (PO)
+observations. PO data are invaluable for understanding species distributions
+and their dynamics, but their use in Species Distribution Models (SDM) is
+curtailed by sampling biases and the lack of information on absences. Poisson
+point processes are widely used for SDMs, with Maxent being one of the most
+popular methods. Maxent maximises the entropy of a probability distribution
+across sites as a function of predefined transformations of environmental
+variables, called features. In contrast, neural networks and deep learning have
+emerged as a promising technique for automatic feature extraction from complex
+input variables. In this paper, we propose DeepMaxent, which harnesses neural
+networks to automatically learn shared features among species, using the
+maximum entropy principle. To do so, it employs a normalised Poisson loss where
+for each species, presence probabilities across sites are modelled by a neural
+network. We evaluate DeepMaxent on a benchmark dataset known for its spatial
+sampling biases, using PO data for calibration and presence-absence (PA) data
+for validation across six regions with different biological groups and
+environmental covariates. Our results indicate that DeepMaxent improves model
+performance over Maxent and other state-of-the-art SDMs across regions and
+taxonomic groups. The method performs particularly well in regions of uneven
+sampling, demonstrating substantial potential to improve species distribution
+modelling. The method opens the possibility to learn more robust environmental
+features predicting jointly many species and scales to arbitrary large numbers
+of sites without an increased memory demand.
+
+
+
+ comment: Submitted to Methods in Ecology and Evolution
+
+
+
+
+
+
+ ☆ Optimizing Fantasy Sports Team Selection with Deep Reinforcement
+ Learning
+
+
+ Fantasy sports, particularly fantasy cricket, have garnered immense
+popularity in India in recent years, offering enthusiasts the opportunity to
+engage in strategic team-building and compete based on the real-world
+performance of professional athletes. In this paper, we address the challenge
+of optimizing fantasy cricket team selection using reinforcement learning (RL)
+techniques. By framing the team creation process as a sequential
+decision-making problem, we aim to develop a model that can adaptively select
+players to maximize the team's potential performance. Our approach leverages
+historical player data to train RL algorithms, which then predict future
+performance and optimize team composition. This not only represents a huge
+business opportunity by enabling more accurate predictions of high-performing
+teams but also enhances the overall user experience. Through empirical
+evaluation and comparison with traditional fantasy team drafting methods, we
+demonstrate the effectiveness of RL in constructing competitive fantasy teams.
+Our results show that RL-based strategies provide valuable insights into player
+selection in fantasy sports.
+
+
+
+ comment: 8 Pages including references, Accepted to CODS-COMAD 2024 conference
+
+
+
+
+
+
+ ☆ Towards Better Spherical Sliced-Wasserstein Distance Learning with
+ Data-Adaptive Discriminative Projection Direction AAAI 2025
+
+
+
+
+
+
+
+
+ Hongliang Zhang, Shuo Chen, Lei Luo, Jian Yang
+
+
+ Spherical Sliced-Wasserstein (SSW) has recently been proposed to measure the
+discrepancy between spherical data distributions in various fields, such as
+geology, medical domains, computer vision, and deep representation learning.
+However, in the original SSW, all projection directions are treated equally,
+which is too idealistic and cannot accurately reflect the importance of
+different projection directions for various data distributions. To address this
+issue, we propose a novel data-adaptive Discriminative Spherical
+Sliced-Wasserstein (DSSW) distance, which utilizes a projected energy function
+to determine the discriminative projection direction for SSW. In our new DSSW,
+we introduce two types of projected energy functions to generate the weights
+for projection directions with complete theoretical guarantees. The first type
+employs a non-parametric deterministic function that transforms the projected
+Wasserstein distance into its corresponding weight in each projection
+direction. This improves the performance of the original SSW distance with
+negligible additional computational overhead. The second type utilizes a neural
+network-induced function that learns the projection direction weight through a
+parameterized neural network based on data projections. This further enhances
+the performance of the original SSW distance with less extra computational
+overhead. Finally, we evaluate the performance of our proposed DSSW by
+comparing it with several state-of-the-art methods across a variety of machine
+learning tasks, including gradient flows, density estimation on real earth
+data, and self-supervised learning.
+
+
+
+ comment: Accepted by AAAI 2025
+
+
+
+
+
+
+ ☆ Large Language Models Meet Graph Neural Networks: A Perspective of Graph
+ Mining
+
+
+ Graph mining is an important area in data mining and machine learning that
+involves extracting valuable information from graph-structured data. In recent
+years, significant progress has been made in this field through the development
+of graph neural networks (GNNs). However, GNNs are still deficient in
+generalizing to diverse graph data. Aiming to this issue, Large Language Models
+(LLMs) could provide new solutions for graph mining tasks with their superior
+semantic understanding. In this review, we systematically review the
+combination and application techniques of LLMs and GNNs and present a novel
+taxonomy for research in this interdisciplinary field, which involves three
+main categories: GNN-driving-LLM, LLM-driving-GNN, and GNN-LLM-co-driving.
+Within this framework, we reveal the capabilities of LLMs in enhancing graph
+feature extraction as well as improving the effectiveness of downstream tasks
+such as node classification, link prediction, and community detection. Although
+LLMs have demonstrated their great potential in handling graph-structured data,
+their high computational requirements and complexity remain challenges. Future
+research needs to continue to explore how to efficiently fuse LLMs and GNNs to
+achieve more powerful graph learning and reasoning capabilities and provide new
+impetus for the development of graph mining techniques.
+
+
+
+
+
+
+
+ ☆ Context-Aware Deep Learning for Multi Modal Depression Detection
+
+
+ In this study, we focus on automated approaches to detect depression from
+clinical interviews using multi-modal machine learning (ML). Our approach
+differentiates from other successful ML methods such as context-aware analysis
+through feature engineering and end-to-end deep neural networks for depression
+detection utilizing the Distress Analysis Interview Corpus. We propose a novel
+method that incorporates: (1) pre-trained Transformer combined with data
+augmentation based on topic modelling for textual data; and (2) deep 1D
+convolutional neural network (CNN) for acoustic feature modeling. The
+simulation results demonstrate the effectiveness of the proposed method for
+training multi-modal deep learning models. Our deep 1D CNN and Transformer
+models achieved state-of-the-art performance for audio and text modalities
+respectively. Combining them in a multi-modal framework also outperforms
+state-of-the-art for the combined setting. Code available at
+https://github.com/genandlam/multi-modal-depression-detection
+
+
+
+ comment: Presented as an Oral at International Conference on Acoustics, Speech
+ and Signal Processing 2019, United Kingdom
+
+
+
+
+
+
+
+ Reza Hassanpour, Kasim Oztoprak, Niels Netten, Tony Busker, Mortaza S. Bargh, Sunil Choenni, Beyza Kizildag, Leyla Sena Kilinc
+
+
+ Machine learning models use high dimensional feature spaces to map their
+inputs to the corresponding class labels. However, these features often do not
+have a one-to-one correspondence with physical concepts understandable by
+humans, which hinders the ability to provide a meaningful explanation for the
+decisions made by these models. We propose a method for measuring the
+correlation between high-level concepts and the decisions made by a machine
+learning model. Our method can isolate the impact of a given high-level concept
+and accurately measure it quantitatively. Additionally, this study aims to
+determine the prevalence of frequent patterns in machine learning models, which
+often occur in imbalanced datasets. We have successfully applied the proposed
+method to fundus images and managed to quantitatively measure the impact of
+radiomic patterns on the model decisions.
+
+
+
+ comment: 11 pages, 8 figures, "to be published in the journal of Computer
+ SCience"
+
+
+
+
+
+
+ ☆ GAIS: A Novel Approach to Instance Selection with Graph Attention
+ Networks
+
+
+ Instance selection (IS) is a crucial technique in machine learning that aims
+to reduce dataset size while maintaining model performance. This paper
+introduces a novel method called Graph Attention-based Instance Selection
+(GAIS), which leverages Graph Attention Networks (GATs) to identify the most
+informative instances in a dataset. GAIS represents the data as a graph and
+uses GATs to learn node representations, enabling it to capture complex
+relationships between instances. The method processes data in chunks, applies
+random masking and similarity thresholding during graph construction, and
+selects instances based on confidence scores from the trained GAT model.
+Experiments on 13 diverse datasets demonstrate that GAIS consistently
+outperforms traditional IS methods in terms of effectiveness, achieving high
+reduction rates (average 96\%) while maintaining or improving model
+performance. Although GAIS exhibits slightly higher computational costs, its
+superior performance in maintaining accuracy with significantly reduced
+training data makes it a promising approach for graph-based data selection.
+
+
+
+ comment: Accepted at ICKG 2024. Code is available at
+ https://github.com/zahiriddin-rustamov/gais
+
+
+
+
+
+
+ ☆ Provably Efficient Exploration in Reward Machines with Low Regret
+
+
+
+
+
+
+
+
+ Hippolyte Bourel, Anders Jonsson, Odalric-Ambrym Maillard, Chenxiao Ma, Mohammad Sadegh Talebi
+
+
+ We study reinforcement learning (RL) for decision processes with
+non-Markovian reward, in which high-level knowledge of the task in the form of
+reward machines is available to the learner. We consider probabilistic reward
+machines with initially unknown dynamics, and investigate RL under the
+average-reward criterion, where the learning performance is assessed through
+the notion of regret. Our main algorithmic contribution is a model-based RL
+algorithm for decision processes involving probabilistic reward machines that
+is capable of exploiting the structure induced by such machines. We further
+derive high-probability and non-asymptotic bounds on its regret and demonstrate
+the gain in terms of regret over existing algorithms that could be applied, but
+obliviously to the structure. We also present a regret lower bound for the
+studied setting. To the best of our knowledge, the proposed algorithm
+constitutes the first attempt to tailor and analyze regret specifically for RL
+with probabilistic reward machines.
+
+
+
+ comment: 35 pages
+
+
+
+
+
+
+ ☆ Biology Instructions: A Dataset and Benchmark for Multi-Omics Sequence
+ Understanding Capability of Large Language Models
+
+
+ Large language models have already demonstrated their formidable capabilities
+in general domains, ushering in a revolutionary transformation. However,
+exploring and exploiting the extensive knowledge of these models to comprehend
+multi-omics biology remains underexplored. To fill this research gap, we first
+introduce Biology-Instructions, the first large-scale multi-omics biological
+sequences-related instruction-tuning dataset including DNA, RNA, proteins, and
+multi-molecules, designed to bridge the gap between large language models
+(LLMs) and complex biological sequences-related tasks. This dataset can enhance
+the versatility of LLMs by integrating diverse biological sequenced-based
+prediction tasks with advanced reasoning capabilities, while maintaining
+conversational fluency. Additionally, we reveal significant performance
+limitations in even state-of-the-art LLMs on biological sequence-related
+multi-omics tasks without specialized pre-training and instruction-tuning. We
+further develop a strong baseline called ChatMultiOmics with a novel
+three-stage training pipeline, demonstrating the powerful ability to understand
+biology by using Biology-Instructions. Biology-Instructions and ChatMultiOmics
+are publicly available and crucial resources for enabling more effective
+integration of LLMs with multi-omics sequence analysis.
+
+
+
+
+
+
+
+ ☆ An End-to-End Depth-Based Pipeline for Selfie Image Rectification
+
+
+
+
+
+
+
+
+ Ahmed Alhawwary, Phong Nguyen-Ha, Janne Mustaniemi, Janne Heikkilä
+
+
+ Portraits or selfie images taken from a close distance typically suffer from
+perspective distortion. In this paper, we propose an end-to-end deep
+learning-based rectification pipeline to mitigate the effects of perspective
+distortion. We learn to predict the facial depth by training a deep CNN. The
+estimated depth is utilized to adjust the camera-to-subject distance by moving
+the camera farther, increasing the camera focal length, and reprojecting the 3D
+image features to the new perspective. The reprojected features are then fed to
+an inpainting module to fill in the missing pixels. We leverage a
+differentiable renderer to enable end-to-end training of our depth estimation
+and feature extraction nets to improve the rectified outputs. To boost the
+results of the inpainting module, we incorporate an auxiliary module to predict
+the horizontal movement of the camera which decreases the area that requires
+hallucination of challenging face parts such as ears. Unlike previous works, we
+process the full-frame input image at once without cropping the subject's face
+and processing it separately from the rest of the body, eliminating the need
+for complex post-processing steps to attach the face back to the subject's
+body. To train our network, we utilize the popular game engine Unreal Engine to
+generate a large synthetic face dataset containing various subjects, head
+poses, expressions, eyewear, clothes, and lighting. Quantitative and
+qualitative results show that our rectification pipeline outperforms previous
+methods, and produces comparable results with a time-consuming 3D GAN-based
+method while being more than 260 times faster.
+
+
+
+
+
+
+
+ ☆ Mask Approximation Net: Merging Feature Extraction and Distribution
+ Learning for Remote Sensing Change Captioning
+
+
+ Remote sensing image change description, as a novel multimodal task in the
+field of remote sensing processing, not only enables the detection of changes
+in surface conditions but also provides detailed descriptions of these changes,
+thereby enhancing human interpretability and interactivity. However, previous
+methods mainly employed Convolutional Neural Network (CNN) architectures to
+extract bitemporal image features. This approach often leads to an overemphasis
+on designing specific network architectures and limits the captured feature
+distributions to the current dataset, resulting in poor generalizability and
+robustness when applied to other datasets or real-world scenarios. To address
+these limitations, this paper proposes a novel approach for remote sensing
+image change detection and description that integrates diffusion models, aiming
+to shift the focus from conventional feature learning paradigms to data
+distribution learning. The proposed method primarily includes a simple
+multi-scale change detection module, whose output features are subsequently
+refined using a diffusion model. Additionally, we introduce a frequency-guided
+complex filter module to handle high-frequency noise during the diffusion
+process, which helps to maintain model performance. Finally, we validate the
+effectiveness of our proposed method on several remote sensing change detection
+description datasets, demonstrating its superior performance. The code
+available at MaskApproxNet.
+
+
+
+
+
+
+
+ ☆ Reversed in Time: A Novel Temporal-Emphasized Benchmark for Cross-Modal
+ Video-Text Retrieval
+
+
+ Cross-modal (e.g. image-text, video-text) retrieval is an important task in
+information retrieval and multimodal vision-language understanding field.
+Temporal understanding makes video-text retrieval more challenging than
+image-text retrieval. However, we find that the widely used video-text
+benchmarks have shortcomings in comprehensively assessing abilities of models,
+especially in temporal understanding, causing large-scale image-text
+pre-trained models can already achieve comparable zero-shot performance with
+video-text pre-trained models. In this paper, we introduce RTime, a novel
+temporal-emphasized video-text retrieval dataset. We first obtain videos of
+actions or events with significant temporality, and then reverse these videos
+to create harder negative samples. We then recruit annotators to judge the
+significance and reversibility of candidate videos, and write captions for
+qualified videos. We further adopt GPT-4 to extend more captions based on
+human-written captions. Our RTime dataset currently consists of 21k videos with
+10 captions per video, totalling about 122 hours. Based on RTime, we propose
+three retrieval benchmark tasks: RTime-Origin, RTime-Hard, and RTime-Binary. We
+further enhance the use of harder-negatives in model training, and benchmark a
+variety of video-text models on RTime. Extensive experiment analysis proves
+that RTime indeed poses new and higher challenges to video-text retrieval. We
+release our RTime
+dataset\footnote{\url{https://github.com/qyr0403/Reversed-in-Time}} to further
+advance video-text retrieval and multimodal understanding research.
+
+
+
+ comment: ACMMM 2024 poster
+
+
+
+
+
+
+ ☆ Dual Channel Multi-Attention in ViT for Biometric Authentication using
+ Forehead Subcutaneous Vein Pattern and Periocular Pattern
+
+
+ Traditional biometric systems, like face and fingerprint recognition, have
+encountered significant setbacks due to wearing face masks and hygiene
+concerns. To meet the challenges of the partially covered face due to face
+masks and hygiene concerns of fingerprint recognition, this paper proposes a
+novel dual-channel multi-attention Vision Transformer (ViT) framework for
+biometric authentication using forehead subcutaneous vein patterns and
+periocular patterns, offering a promising alternative to traditional methods,
+capable of performing well even with face masks and without any physical touch.
+The proposed framework leverages a dual-channel ViT architecture, designed to
+handle two distinct biometric traits. It can capture long-range dependencies of
+independent features from the vein and periocular patterns. A custom classifier
+is then designed to integrate the independently extracted features, producing a
+final class prediction. The performance of the proposed algorithm was
+rigorously evaluated using the Forehead Subcutaneous Vein Pattern and
+Periocular Biometric Pattern (FSVP-PBP) database. The results demonstrated the
+superiority of the algorithm over state-of-the-art methods, achieving
+remarkable classification accuracy of $99.3 \pm 0.02\%$ with the combined vein
+and periocular patterns.
+
+
+
+
+
+
+
+ ☆ To Predict or Not To Predict? Proportionally Masked Autoencoders for
+ Tabular Data Imputation
+
+
+
+
+
+
+
+
+ Jungkyu Kim, Kibok Lee, Taeyoung Park
+
+
+ Masked autoencoders (MAEs) have recently demonstrated effectiveness in
+tabular data imputation. However, due to the inherent heterogeneity of tabular
+data, the uniform random masking strategy commonly used in MAEs can disrupt the
+distribution of missingness, leading to suboptimal performance. To address
+this, we propose a proportional masking strategy for MAEs. Specifically, we
+first compute the statistics of missingness based on the observed proportions
+in the dataset, and then generate masks that align with these statistics,
+ensuring that the distribution of missingness is preserved after masking.
+Furthermore, we argue that simple MLP-based token mixing offers competitive or
+often superior performance compared to attention mechanisms while being more
+computationally efficient, especially in the tabular domain with the inherent
+heterogeneity. Experimental results validate the effectiveness of the proposed
+proportional masking strategy across various missing data patterns in tabular
+datasets. Code is available at: \url{https://github.com/normal-kim/PMAE}.
+
+
+
+
+
+
+
+ ☆ PlanLLM: Video Procedure Planning with Refinable Large Language Models AAAI2025
+
+
+ Video procedure planning, i.e., planning a sequence of action steps given the
+video frames of start and goal states, is an essential ability for embodied AI.
+Recent works utilize Large Language Models (LLMs) to generate enriched action
+step description texts to guide action step decoding. Although LLMs are
+introduced, these methods decode the action steps into a closed-set of one-hot
+vectors, limiting the model's capability of generalizing to new steps or tasks.
+Additionally, fixed action step descriptions based on world-level commonsense
+may contain noise in specific instances of visual states. In this paper, we
+propose PlanLLM, a cross-modal joint learning framework with LLMs for video
+procedure planning. We propose an LLM-Enhanced Planning module which fully uses
+the generalization ability of LLMs to produce free-form planning output and to
+enhance action step decoding. We also propose Mutual Information Maximization
+module to connect world-level commonsense of step descriptions and
+sample-specific information of visual states, enabling LLMs to employ the
+reasoning ability to generate step sequences. With the assistance of LLMs, our
+method can both closed-set and open vocabulary procedure planning tasks. Our
+PlanLLM achieves superior performance on three benchmarks, demonstrating the
+effectiveness of our designs.
+
+
+
+ comment: accepted to AAAI2025
+
+
+
+
+
+
+ ☆ SUTrack: Towards Simple and Unified Single Object Tracking AAAI 2025
+
+
+
+
+
+
+
+
+ Xin Chen, Ben Kang, Wanting Geng, Jiawen Zhu, Yi Liu, Dong Wang, Huchuan Lu
+
+
+ In this paper, we propose a simple yet unified single object tracking (SOT)
+framework, dubbed SUTrack. It consolidates five SOT tasks (RGB-based,
+RGB-Depth, RGB-Thermal, RGB-Event, RGB-Language Tracking) into a unified model
+trained in a single session. Due to the distinct nature of the data, current
+methods typically design individual architectures and train separate models for
+each task. This fragmentation results in redundant training processes,
+repetitive technological innovations, and limited cross-modal knowledge
+sharing. In contrast, SUTrack demonstrates that a single model with a unified
+input representation can effectively handle various common SOT tasks,
+eliminating the need for task-specific designs and separate training sessions.
+Additionally, we introduce a task-recognition auxiliary training strategy and a
+soft token type embedding to further enhance SUTrack's performance with minimal
+overhead. Experiments show that SUTrack outperforms previous task-specific
+counterparts across 11 datasets spanning five SOT tasks. Moreover, we provide a
+range of models catering edge devices as well as high-performance GPUs,
+striking a good trade-off between speed and accuracy. We hope SUTrack could
+serve as a strong foundation for further compelling research into unified
+tracking models. Code and models are available at
+github.com/chenxin-dlut/SUTrack.
+
+
+
+ comment: Accepted by AAAI 2025
+
+
+
+
+
+
+ ♻ ☆ Simplified and Generalized Masked Diffusion for Discrete Data NeurIPS 2024
+
+
+ Masked (or absorbing) diffusion is actively explored as an alternative to
+autoregressive models for generative modeling of discrete data. However,
+existing work in this area has been hindered by unnecessarily complex model
+formulations and unclear relationships between different perspectives, leading
+to suboptimal parameterization, training objectives, and ad hoc adjustments to
+counteract these issues. In this work, we aim to provide a simple and general
+framework that unlocks the full potential of masked diffusion models. We show
+that the continuous-time variational objective of masked diffusion models is a
+simple weighted integral of cross-entropy losses. Our framework also enables
+training generalized masked diffusion models with state-dependent masking
+schedules. When evaluated by perplexity, our models trained on OpenWebText
+surpass prior diffusion language models at GPT-2 scale and demonstrate superior
+performance on 4 out of 5 zero-shot language modeling tasks. Furthermore, our
+models vastly outperform previous discrete diffusion models on pixel-level
+image modeling, achieving 2.75 (CIFAR-10) and 3.40 (ImageNet 64x64) bits per
+dimension that are better than autoregressive models of similar sizes. Our code
+is available at https://github.com/google-deepmind/md4.
+
+
+
+ comment: NeurIPS 2024. Code is available at:
+ https://github.com/google-deepmind/md4
+
+
+
+
+
+
+ ♻ ☆ Solving High-dimensional Inverse Problems Using Amortized
+ Likelihood-free Inference with Noisy and Incomplete Data
+
+
+
+
+
+
+
+
+ Jice Zeng, Yuanzhe Wang, Alexandre M. Tartakovsky, David Barajas-Solano
+
+
+ We present a likelihood-free probabilistic inversion method based on
+normalizing flows for high-dimensional inverse problems. The proposed method is
+composed of two complementary networks: a summary network for data compression
+and an inference network for parameter estimation. The summary network encodes
+raw observations into a fixed-size vector of summary features, while the
+inference network generates samples of the approximate posterior distribution
+of the model parameters based on these summary features. The posterior samples
+are produced in a deep generative fashion by sampling from a latent Gaussian
+distribution and passing these samples through an invertible transformation. We
+construct this invertible transformation by sequentially alternating
+conditional invertible neural network and conditional neural spline flow
+layers. The summary and inference networks are trained simultaneously. We apply
+the proposed method to an inversion problem in groundwater hydrology to
+estimate the posterior distribution of the log-conductivity field conditioned
+on spatially sparse time-series observations of the system's hydraulic head
+responses.The conductivity field is represented with 706 degrees of freedom in
+the considered problem.The comparison with the likelihood-based iterative
+ensemble smoother PEST-IES method demonstrates that the proposed method
+accurately estimates the parameter posterior distribution and the observations'
+predictive posterior distribution at a fraction of the inference time of
+PEST-IES.
+
+
+
+
+
+
+
+
+ Yibo Yang, Justus C. Will, Stephan Mandt
+
+
+ Diffusion probabilistic models have achieved mainstream success in many
+generative modeling tasks, from image generation to inverse problem solving. A
+distinct feature of these models is that they correspond to deep hierarchical
+latent variable models optimizing a variational evidence lower bound (ELBO) on
+the data likelihood. Drawing on a basic connection between likelihood modeling
+and compression, we explore the potential of diffusion models for progressive
+coding, resulting in a sequence of bits that can be incrementally transmitted
+and decoded with progressively improving reconstruction quality. Unlike prior
+work based on Gaussian diffusion or conditional diffusion models, we propose a
+new form of diffusion model with uniform noise in the forward process, whose
+negative ELBO corresponds to the end-to-end compression cost using universal
+quantization. We obtain promising first results on image compression, achieving
+competitive rate-distortion and rate-realism results on a wide range of
+bit-rates with a single model, bringing neural codecs a step closer to
+practical deployment.
+
+
+
+
+
+
+
+
+ Pan Wang, Qiang Zhou, Yawen Wu, Tianlong Chen, Jingtong Hu
+
+
+ Multimodal Sentiment Analysis (MSA) leverages heterogeneous modalities, such
+as language, vision, and audio, to enhance the understanding of human
+sentiment. While existing models often focus on extracting shared information
+across modalities or directly fusing heterogeneous modalities, such approaches
+can introduce redundancy and conflicts due to equal treatment of all modalities
+and the mutual transfer of information between modality pairs. To address these
+issues, we propose a Disentangled-Language-Focused (DLF) multimodal
+representation learning framework, which incorporates a feature disentanglement
+module to separate modality-shared and modality-specific information. To
+further reduce redundancy and enhance language-targeted features, four
+geometric measures are introduced to refine the disentanglement process. A
+Language-Focused Attractor (LFA) is further developed to strengthen language
+representation by leveraging complementary modality-specific information
+through a language-guided cross-attention mechanism. The framework also employs
+hierarchical predictions to improve overall accuracy. Extensive experiments on
+two popular MSA datasets, CMU-MOSI and CMU-MOSEI, demonstrate the significant
+performance gains achieved by the proposed DLF framework. Comprehensive
+ablation studies further validate the effectiveness of the feature
+disentanglement module, language-focused attractor, and hierarchical
+predictions. Our code is available at https://github.com/pwang322/DLF.
+
+
+
+ comment: AAAI 2025 accepted
+
+
+
+
+
+
+ ♻ ☆ LMFusion: Adapting Pretrained Language Models for Multimodal Generation
+
+
+
+
+
+
+
+
+ Weijia Shi, Xiaochuang Han, Chunting Zhou, Weixin Liang, Xi Victoria Lin, Luke Zettlemoyer, Lili Yu
+
+
+ We present LMFusion, a framework for empowering pretrained text-only large
+language models (LLMs) with multimodal generative capabilities, enabling them
+to understand and generate both text and images in arbitrary sequences.
+LMFusion leverages existing Llama-3's weights for processing texts
+autoregressively while introducing additional and parallel transformer modules
+for processing images with diffusion. During training, the data from each
+modality is routed to its dedicated modules: modality-specific feedforward
+layers, query-key-value projections, and normalization layers process each
+modality independently, while the shared self-attention layers allow
+interactions across text and image features. By freezing the text-specific
+modules and only training the image-specific modules, LMFusion preserves the
+language capabilities of text-only LLMs while developing strong visual
+understanding and generation abilities. Compared to methods that pretrain
+multimodal generative models from scratch, our experiments demonstrate that,
+LMFusion improves image understanding by 20% and image generation by 3.6% using
+only 50% of the FLOPs while maintaining Llama-3's language capabilities. We
+also demonstrate that this framework can adapt existing vision-language models
+with multimodal generation ability. Overall, this framework not only leverages
+existing computational investments in text-only LLMs but also enables the
+parallel development of language and vision capabilities, presenting a
+promising direction for efficient multimodal model development.
+
+
+
+ comment: Name change: LlamaFusion to LMFusion
+
+
+
+
+
+
+ ♻ ☆ LLMs for Knowledge Graph Construction and Reasoning: Recent Capabilities
+ and Future Opportunities
+
+
+ This paper presents an exhaustive quantitative and qualitative evaluation of
+Large Language Models (LLMs) for Knowledge Graph (KG) construction and
+reasoning. We engage in experiments across eight diverse datasets, focusing on
+four representative tasks encompassing entity and relation extraction, event
+extraction, link prediction, and question-answering, thereby thoroughly
+exploring LLMs' performance in the domain of construction and inference.
+Empirically, our findings suggest that LLMs, represented by GPT-4, are more
+suited as inference assistants rather than few-shot information extractors.
+Specifically, while GPT-4 exhibits good performance in tasks related to KG
+construction, it excels further in reasoning tasks, surpassing fine-tuned
+models in certain cases. Moreover, our investigation extends to the potential
+generalization ability of LLMs for information extraction, leading to the
+proposition of a Virtual Knowledge Extraction task and the development of the
+corresponding VINE dataset. Based on these empirical findings, we further
+propose AutoKG, a multi-agent-based approach employing LLMs and external
+sources for KG construction and reasoning. We anticipate that this research can
+provide invaluable insights for future undertakings in the field of knowledge
+graphs. The code and datasets are in https://github.com/zjunlp/AutoKG.
+
+
+
+ comment: World Wide Web Journal
+
+
+
+
+
+
+ ♻ ☆ Rapid and Power-Aware Learned Optimization for Modular Receive
+ Beamforming
+
+
+ Multiple-input multiple-output (MIMO) systems play a key role in wireless
+communication technologies. A widely considered approach to realize scalable
+MIMO systems involves architectures comprised of multiple separate modules,
+each with its own beamforming capability. Such models accommodate cell-free
+massive MIMO and partially connected hybrid MIMO architectures. A core issue
+with the implementation of modular MIMO arises from the need to rapidly set the
+beampatterns of the modules, while maintaining their power efficiency. This
+leads to challenging constrained optimization that should be repeatedly solved
+on each coherence duration. In this work, we propose a power-oriented
+optimization algorithm for beamforming in uplink modular hybrid MIMO systems,
+which learns from data to operate rapidly. We derive our learned optimizer by
+tackling the rate maximization objective using projected gradient ascent steps
+with momentum. We then leverage data to tune the hyperparameters of the
+optimizer, allowing it to operate reliably in a fixed and small number of
+iterations while completely preserving its interpretable operation. We show how
+power efficient beamforming can be encouraged by the learned optimizer, via
+boosting architectures with low-resolution phase shifts and with deactivated
+analog components. Numerical results show that our learn-to-optimize method
+notably reduces the number of iterations and computation latency required to
+reliably tune modular MIMO receivers, and that it allows obtaining desirable
+balances between power efficient designs and throughput.
+
+
+
+ comment: Under review for possible publication in the IEEE
+
+ Statistical Taylor expansion replaces the input precise variables in a
+conventional Taylor expansion with random variables each with known
+distribution, to calculate the result mean and deviation. It is based on the
+uncorrelated uncertainty assumption: Each input variable is measured
+independently with fine enough statistical precision, so that their
+uncertainties are independent of each other. Statistical Taylor expansion
+reviews that the intermediate analytic expressions can no longer be regarded as
+independent of each other, and the result of analytic expression should be path
+independent. This conclusion differs fundamentally from the conventional common
+approach in applied mathematics to find the best execution path for a result.
+This paper also presents an implementation of statistical Taylor expansion
+called variance arithmetic, and the tests on variance arithmetic.
+
+
+
+ comment: 65 pages, 53 figures
+
+
+
+
+
+
+ ♻ ☆ Decentralized Sparse Linear Regression via Gradient-Tracking: Linear
+ Convergence and Statistical Guarantees
+
+
+ We study sparse linear regression over a network of agents, modeled as an
+undirected graph and no server node. The estimation of the $s$-sparse parameter
+is formulated as a constrained LASSO problem wherein each agent owns a subset
+of the $N$ total observations. We analyze the convergence rate and statistical
+guarantees of a distributed projected gradient tracking-based algorithm under
+high-dimensional scaling, allowing the ambient dimension $d$ to grow with (and
+possibly exceed) the sample size $N$. Our theory shows that, under standard
+notions of restricted strong convexity and smoothness of the loss functions,
+suitable conditions on the network connectivity and algorithm tuning, the
+distributed algorithm converges globally at a {\it linear} rate to an estimate
+that is within the centralized {\it statistical precision} of the model,
+$O(s\log d/N)$. When $s\log d/N=o(1)$, a condition necessary for statistical
+consistency, an $\varepsilon$-optimal solution is attained after
+$\mathcal{O}(\kappa \log (1/\varepsilon))$ gradient computations and $O
+(\kappa/(1-\rho) \log (1/\varepsilon))$ communication rounds, where $\kappa$ is
+the restricted condition number of the loss function and $\rho$ measures the
+network connectivity. The computation cost matches that of the centralized
+projected gradient algorithm despite having data distributed; whereas the
+communication rounds reduce as the network connectivity improves. Overall, our
+study reveals interesting connections between statistical efficiency, network
+connectivity \& topology, and convergence rate in high dimensions.
+
+
+
+ comment: The order of the first three authors is alphabetic. Final revised
+ version
+
+
+
+
+
+
+ ♻ ☆ TableRAG: Million-Token Table Understanding with Language Models NeurIPS 2024
+
+
+ Recent advancements in language models (LMs) have notably enhanced their
+ability to reason with tabular data, primarily through program-aided mechanisms
+that manipulate and analyze tables. However, these methods often require the
+entire table as input, leading to scalability challenges due to the positional
+bias or context length constraints. In response to these challenges, we
+introduce TableRAG, a Retrieval-Augmented Generation (RAG) framework
+specifically designed for LM-based table understanding. TableRAG leverages
+query expansion combined with schema and cell retrieval to pinpoint crucial
+information before providing it to the LMs. This enables more efficient data
+encoding and precise retrieval, significantly reducing prompt lengths and
+mitigating information loss. We have developed two new million-token benchmarks
+from the Arcade and BIRD-SQL datasets to thoroughly evaluate TableRAG's
+effectiveness at scale. Our results demonstrate that TableRAG's retrieval
+design achieves the highest retrieval quality, leading to the new
+state-of-the-art performance on large-scale table understanding.
+
+
+
+ comment: Accepted to NeurIPS 2024
+
+
+
+
+
+
+ ♻ ☆ Reviving Life on the Edge: Joint Score-Based Graph Generation of Rich
+ Edge Attributes
+
+
+ Graph generation is integral to various engineering and scientific
+disciplines. Nevertheless, existing methodologies tend to overlook the
+generation of edge attributes. However, we identify critical applications where
+edge attributes are essential, making prior methods potentially unsuitable in
+such contexts. Moreover, while trivial adaptations are available, empirical
+investigations reveal their limited efficacy as they do not properly model the
+interplay among graph components. To address this, we propose a joint
+score-based model of nodes and edges for graph generation that considers all
+graph components. Our approach offers three key novelties: \textbf{(1)} node
+and edge attributes are combined in an attention module that generates samples
+based on the two ingredients, \textbf{(2)} node, edge and adjacency information
+are mutually dependent during the graph diffusion process, and \textbf{(3)} the
+framework enables the generation of graphs with rich attributes along the
+edges, providing a more expressive formulation for generative tasks than
+existing works. We evaluate our method on challenging benchmarks involving
+real-world and synthetic datasets in which edge features are crucial.
+Additionally, we introduce a new synthetic dataset that incorporates edge
+values. Furthermore, we propose a novel application that greatly benefits from
+the method due to its nature: the generation of traffic scenes represented as
+graphs. Our method outperforms other graph generation methods, demonstrating a
+significant advantage in edge-related measures.
+
+
+
+
+
+
+
+ ♻ ☆ AutoMMLab: Automatically Generating Deployable Models from Language
+ Instructions for Computer Vision Tasks AAAI2025
+
+
+
+
+
+
+
+
+ Zekang Yang, Wang Zeng, Sheng Jin, Chen Qian, Ping Luo, Wentao Liu
+
+
+ Automated machine learning (AutoML) is a collection of techniques designed to
+automate the machine learning development process. While traditional AutoML
+approaches have been successfully applied in several critical steps of model
+development (e.g. hyperparameter optimization), there lacks a AutoML system
+that automates the entire end-to-end model production workflow for computer
+vision. To fill this blank, we propose a novel request-to-model task, which
+involves understanding the user's natural language request and execute the
+entire workflow to output production-ready models. This empowers non-expert
+individuals to easily build task-specific models via a user-friendly language
+interface. To facilitate development and evaluation, we develop a new
+experimental platform called AutoMMLab and a new benchmark called LAMP for
+studying key components in the end-to-end request-to-model pipeline.
+Hyperparameter optimization (HPO) is one of the most important components for
+AutoML. Traditional approaches mostly rely on trial-and-error, leading to
+inefficient parameter search. To solve this problem, we propose a novel
+LLM-based HPO algorithm, called HPO-LLaMA. Equipped with extensive knowledge
+and experience in model hyperparameter tuning, HPO-LLaMA achieves significant
+improvement of HPO efficiency. Dataset and code are available at
+https://github.com/yang-ze-kang/AutoMMLab.
+
+
+
+
+
+
+
+
+ Youwei Huang, Sen Fang, Jianwen Li, Jiachun Tao, Bin Hu, Tao Zhang
+
+
+ In recent years, research in software security has concentrated on
+identifying vulnerabilities in smart contracts to prevent significant losses of
+crypto assets on blockchains. Despite early successes in this area, detecting
+developers' intents in smart contracts has become a more pressing issue, as
+malicious intents have caused substantial financial losses. Unfortunately,
+existing research lacks effective methods for detecting development intents in
+smart contracts.
+ To address this gap, we propose \textsc{SmartIntentNN} (Smart Contract Intent
+Neural Network), a deep learning model designed to automatically detect
+development intents in smart contracts. \textsc{SmartIntentNN} leverages a
+pre-trained sentence encoder to generate contextual representations of smart
+contracts, employs a K-means clustering model to identify and highlight
+prominent intent features, and utilizes a bidirectional LSTM-based deep neural
+network for multi-label classification.
+ We trained and evaluated \textsc{SmartIntentNN} on a dataset containing over
+40,000 real-world smart contracts, employing self-comparison baselines in our
+experimental setup. The results show that \textsc{SmartIntentNN} achieves an
+F1-score of 0.8633 in identifying intents across 10 distinct categories,
+outperforming all baselines and addressing the gap in smart contract detection
+by incorporating intent analysis.
+
+
+
+ comment: 12 pages, 8 figures, conference
+
+
+
+
+
+
+ ♻ ☆ Differential privacy enables fair and accurate AI-based analysis of
+ speech disorders while protecting patient data
+
+
+
+
+
+
+
+
+ Soroosh Tayebi Arasteh, Mahshad Lotfinia, Paula Andrea Perez-Toro, Tomas Arias-Vergara, Mahtab Ranji, Juan Rafael Orozco-Arroyave, Maria Schuster, Andreas Maier, Seung Hee Yang
+
+
+ Speech pathology has impacts on communication abilities and quality of life.
+While deep learning-based models have shown potential in diagnosing these
+disorders, the use of sensitive data raises critical privacy concerns. Although
+differential privacy (DP) has been explored in the medical imaging domain, its
+application in pathological speech analysis remains largely unexplored despite
+the equally critical privacy concerns. This study is the first to investigate
+DP's impact on pathological speech data, focusing on the trade-offs between
+privacy, diagnostic accuracy, and fairness. Using a large, real-world dataset
+of 200 hours of recordings from 2,839 German-speaking participants, we observed
+a maximum accuracy reduction of 3.85% when training with DP with high privacy
+levels. To highlight real-world privacy risks, we demonstrated the
+vulnerability of non-private models to explicit gradient inversion attacks,
+reconstructing identifiable speech samples and showcasing DP's effectiveness in
+mitigating these risks. To generalize our findings across languages and
+disorders, we validated our approach on a dataset of Spanish-speaking
+Parkinson's disease patients, leveraging pretrained models from healthy
+English-speaking datasets, and demonstrated that careful pretraining on
+large-scale task-specific datasets can maintain favorable accuracy under DP
+constraints. A comprehensive fairness analysis revealed minimal gender bias at
+reasonable privacy levels but underscored the need for addressing age-related
+disparities. Our results establish that DP can balance privacy and utility in
+speech disorder detection, while highlighting unique challenges in
+privacy-fairness trade-offs for speech data. This provides a foundation for
+refining DP methodologies and improving fairness across diverse patient groups
+in real-world deployments.
+
+
+
+
+
+
+
+ ♻ ☆ Automatic and effective discovery of quantum kernels
+
+
+
+
+
+
+
+
+ Massimiliano Incudini, Daniele Lizzio Bosco, Francesco Martini, Michele Grossi, Giuseppe Serra, Alessandra Di Pierro
+
+
+ Quantum computing can empower machine learning models by enabling kernel
+machines to leverage quantum kernels for representing similarity measures
+between data. Quantum kernels are able to capture relationships in the data
+that are not efficiently computable on classical devices. However, there is no
+straightforward method to engineer the optimal quantum kernel for each specific
+use case. We present an approach to this problem, which employs optimization
+techniques, similar to those used in neural architecture search and AutoML, to
+automatically find an optimal kernel in a heuristic manner. To this purpose we
+define an algorithm for constructing a quantum circuit implementing the
+similarity measure as a combinatorial object, which is evaluated based on a
+cost function and then iteratively modified using a meta-heuristic optimization
+technique. The cost function can encode many criteria ensuring favorable
+statistical properties of the candidate solution, such as the rank of the
+Dynamical Lie Algebra. Importantly, our approach is independent of the
+optimization technique employed. The results obtained by testing our approach
+on a high-energy physics problem demonstrate that, in the best-case scenario,
+we can either match or improve testing accuracy with respect to the manual
+design approach, showing the potential of our technique to deliver superior
+results with reduced effort.
+
+
+
+ comment: Accepted into IEEE Transactions on Emerging Topics in Computational
+ Intelligence
+
+
+
+
+
+
+ ♻ ☆ Active Reinforcement Learning Strategies for Offline Policy Improvement AAAI 2025
+
+
+ Learning agents that excel at sequential decision-making tasks must
+continuously resolve the problem of exploration and exploitation for optimal
+learning. However, such interactions with the environment online might be
+prohibitively expensive and may involve some constraints, such as a limited
+budget for agent-environment interactions and restricted exploration in certain
+regions of the state space. Examples include selecting candidates for medical
+trials and training agents in complex navigation environments. This problem
+necessitates the study of active reinforcement learning strategies that collect
+minimal additional experience trajectories by reusing existing offline data
+previously collected by some unknown behavior policy. In this work, we propose
+an active reinforcement learning method capable of collecting trajectories that
+can augment existing offline data. With extensive experimentation, we
+demonstrate that our proposed method reduces additional online interaction with
+the environment by up to 75% over competitive baselines across various
+continuous control environments such as Gym-MuJoCo locomotion environments as
+well as Maze2d, AntMaze, CARLA and IsaacSimGo1. To the best of our knowledge,
+this is the first work that addresses the active learning problem in the
+context of sequential decision-making and reinforcement learning.
+
+
+
+ comment: AAAI 2025
+
+
+
+
+
+
+ ♻ ☆ Regularized Adaptive Momentum Dual Averaging with an Efficient Inexact
+ Subproblem Solver for Training Structured Neural Network NeurIPS 2024
+
+
+ We propose a Regularized Adaptive Momentum Dual Averaging (RAMDA) algorithm
+for training structured neural networks. Similar to existing regularized
+adaptive methods, the subproblem for computing the update direction of RAMDA
+involves a nonsmooth regularizer and a diagonal preconditioner, and therefore
+does not possess a closed-form solution in general. We thus also carefully
+devise an implementable inexactness condition that retains convergence
+guarantees similar to the exact versions, and propose a companion efficient
+solver for the subproblems of both RAMDA and existing methods to make them
+practically feasible. We leverage the theory of manifold identification in
+variational analysis to show that, even in the presence of such inexactness,
+the iterates of RAMDA attain the ideal structure induced by the regularizer at
+the stationary point of asymptotic convergence. This structure is locally
+optimal near the point of convergence, so RAMDA is guaranteed to obtain the
+best structure possible among all methods converging to the same point, making
+it the first regularized adaptive method outputting models that possess
+outstanding predictive performance while being (locally) optimally structured.
+Extensive numerical experiments in large-scale modern computer vision, language
+modeling, and speech tasks show that the proposed RAMDA is efficient and
+consistently outperforms state of the art for training structured neural
+network. Implementation of our algorithm is available at
+https://www.github.com/ismoptgroup/RAMDA/.
+
+
+
+ comment: NeurIPS 2024. 25 pages
+
+
+
+
+
+
+
+
+
+ Multimedia 8
+
+
+
+
+
+ ☆ FineVQ: Fine-Grained User Generated Content Video Quality Assessment
+
+
+
+
+
+
+
+
+ Huiyu Duan, Qiang Hu, Jiarui Wang, Liu Yang, Zitong Xu, Lu Liu, Xiongkuo Min, Chunlei Cai, Tianxiao Ye, Xiaoyun Zhang, Guangtao Zhai
+
+
+ The rapid growth of user-generated content (UGC) videos has produced an
+urgent need for effective video quality assessment (VQA) algorithms to monitor
+video quality and guide optimization and recommendation procedures. However,
+current VQA models generally only give an overall rating for a UGC video, which
+lacks fine-grained labels for serving video processing and recommendation
+applications. To address the challenges and promote the development of UGC
+videos, we establish the first large-scale Fine-grained Video quality
+assessment Database, termed FineVD, which comprises 6104 UGC videos with
+fine-grained quality scores and descriptions across multiple dimensions. Based
+on this database, we propose a Fine-grained Video Quality assessment (FineVQ)
+model to learn the fine-grained quality of UGC videos, with the capabilities of
+quality rating, quality scoring, and quality attribution. Extensive
+experimental results demonstrate that our proposed FineVQ can produce
+fine-grained video-quality results and achieve state-of-the-art performance on
+FineVD and other commonly used UGC-VQA datasets. Both Both FineVD and FineVQ
+will be made publicly available.
+
+
+
+
+
+
+
+ ☆ PlanLLM: Video Procedure Planning with Refinable Large Language Models AAAI2025
+
+
+ Video procedure planning, i.e., planning a sequence of action steps given the
+video frames of start and goal states, is an essential ability for embodied AI.
+Recent works utilize Large Language Models (LLMs) to generate enriched action
+step description texts to guide action step decoding. Although LLMs are
+introduced, these methods decode the action steps into a closed-set of one-hot
+vectors, limiting the model's capability of generalizing to new steps or tasks.
+Additionally, fixed action step descriptions based on world-level commonsense
+may contain noise in specific instances of visual states. In this paper, we
+propose PlanLLM, a cross-modal joint learning framework with LLMs for video
+procedure planning. We propose an LLM-Enhanced Planning module which fully uses
+the generalization ability of LLMs to produce free-form planning output and to
+enhance action step decoding. We also propose Mutual Information Maximization
+module to connect world-level commonsense of step descriptions and
+sample-specific information of visual states, enabling LLMs to employ the
+reasoning ability to generate step sequences. With the assistance of LLMs, our
+method can both closed-set and open vocabulary procedure planning tasks. Our
+PlanLLM achieves superior performance on three benchmarks, demonstrating the
+effectiveness of our designs.
+
+
+
+ comment: accepted to AAAI2025
+
+
+
+
+
+
+ ☆ A Rhetorical Relations-Based Framework for Tailored Multimedia Document
+ Summarization
+
+
+ In the rapidly evolving landscape of digital content, the task of summarizing
+multimedia documents, which encompass textual, visual, and auditory elements,
+presents intricate challenges. These challenges include extracting pertinent
+information from diverse formats, maintaining the structural integrity and
+semantic coherence of the original content, and generating concise yet
+informative summaries. This paper introduces a novel framework for multimedia
+document summarization that capitalizes on the inherent structure of the
+document to craft coherent and succinct summaries. Central to this framework is
+the incorporation of a rhetorical structure for structural analysis, augmented
+by a graph-based representation to facilitate the extraction of pivotal
+information. Weighting algorithms are employed to assign significance values to
+document units, thereby enabling effective ranking and selection of relevant
+content. Furthermore, the framework is designed to accommodate user preferences
+and time constraints, ensuring the production of personalized and contextually
+relevant summaries. The summarization process is elaborately delineated,
+encompassing document specification, graph construction, unit weighting, and
+summary extraction, supported by illustrative examples and algorithmic
+elucidation. This proposed framework represents a significant advancement in
+automatic summarization, with broad potential applications across multimedia
+document processing, promising transformative impacts in the field.
+
+
+
+ comment: 10 pages, preprint
+
+
+
+
+
+
+ ☆ CoheDancers: Enhancing Interactive Group Dance Generation through
+ Music-Driven Coherence Decomposition
+
+
+ Dance generation is crucial and challenging, particularly in domains like
+dance performance and virtual gaming. In the current body of literature, most
+methodologies focus on Solo Music2Dance. While there are efforts directed
+towards Group Music2Dance, these often suffer from a lack of coherence,
+resulting in aesthetically poor dance performances. Thus, we introduce
+CoheDancers, a novel framework for Music-Driven Interactive Group Dance
+Generation. CoheDancers aims to enhance group dance generation coherence by
+decomposing it into three key aspects: synchronization, naturalness, and
+fluidity. Correspondingly, we develop a Cycle Consistency based Dance
+Synchronization strategy to foster music-dance correspondences, an
+Auto-Regressive-based Exposure Bias Correction strategy to enhance the fluidity
+of the generated dances, and an Adversarial Training Strategy to augment the
+naturalness of the group dance output. Collectively, these strategies enable
+CohdeDancers to produce highly coherent group dances with superior quality.
+Furthermore, to establish better benchmarks for Group Music2Dance, we construct
+the most diverse and comprehensive open-source dataset to date, I-Dancers,
+featuring rich dancer interactions, and create comprehensive evaluation
+metrics. Experimental evaluations on I-Dancers and other extant datasets
+substantiate that CoheDancers achieves unprecedented state-of-the-art
+performance. Code will be released.
+
+
+
+
+
+
+
+ ☆ FACEMUG: A Multimodal Generative and Fusion Framework for Local Facial
+ Editing
+
+
+ Existing facial editing methods have achieved remarkable results, yet they
+often fall short in supporting multimodal conditional local facial editing. One
+of the significant evidences is that their output image quality degrades
+dramatically after several iterations of incremental editing, as they do not
+support local editing. In this paper, we present a novel multimodal generative
+and fusion framework for globally-consistent local facial editing (FACEMUG)
+that can handle a wide range of input modalities and enable fine-grained and
+semantic manipulation while remaining unedited parts unchanged. Different
+modalities, including sketches, semantic maps, color maps, exemplar images,
+text, and attribute labels, are adept at conveying diverse conditioning
+details, and their combined synergy can provide more explicit guidance for the
+editing process. We thus integrate all modalities into a unified generative
+latent space to enable multimodal local facial edits. Specifically, a novel
+multimodal feature fusion mechanism is proposed by utilizing multimodal
+aggregation and style fusion blocks to fuse facial priors and multimodalities
+in both latent and feature spaces. We further introduce a novel self-supervised
+latent warping algorithm to rectify misaligned facial features, efficiently
+transferring the pose of the edited image to the given latent codes. We
+evaluate our FACEMUG through extensive experiments and comparisons to
+state-of-the-art (SOTA) methods. The results demonstrate the superiority of
+FACEMUG in terms of editing quality, flexibility, and semantic control, making
+it a promising solution for a wide range of local facial editing tasks.
+
+
+
+ comment: Published at IEEE Transactions on Visualization and Computer
+ Graphics; 21 pages, 26 figures
+
+
+
+
+
+
+
+ Pan Wang, Qiang Zhou, Yawen Wu, Tianlong Chen, Jingtong Hu
+
+
+ Multimodal Sentiment Analysis (MSA) leverages heterogeneous modalities, such
+as language, vision, and audio, to enhance the understanding of human
+sentiment. While existing models often focus on extracting shared information
+across modalities or directly fusing heterogeneous modalities, such approaches
+can introduce redundancy and conflicts due to equal treatment of all modalities
+and the mutual transfer of information between modality pairs. To address these
+issues, we propose a Disentangled-Language-Focused (DLF) multimodal
+representation learning framework, which incorporates a feature disentanglement
+module to separate modality-shared and modality-specific information. To
+further reduce redundancy and enhance language-targeted features, four
+geometric measures are introduced to refine the disentanglement process. A
+Language-Focused Attractor (LFA) is further developed to strengthen language
+representation by leveraging complementary modality-specific information
+through a language-guided cross-attention mechanism. The framework also employs
+hierarchical predictions to improve overall accuracy. Extensive experiments on
+two popular MSA datasets, CMU-MOSI and CMU-MOSEI, demonstrate the significant
+performance gains achieved by the proposed DLF framework. Comprehensive
+ablation studies further validate the effectiveness of the feature
+disentanglement module, language-focused attractor, and hierarchical
+predictions. Our code is available at https://github.com/pwang322/DLF.
+
+
+
+ comment: AAAI 2025 accepted
+
+
+
+
+
+
+ ♻ ☆ Read, Watch and Scream! Sound Generation from Text and Video AAAI2025
+
+
+
+
+
+
+
+
+ Yujin Jeong, Yunji Kim, Sanghyuk Chun, Jiyoung Lee
+
+
+ Despite the impressive progress of multimodal generative models,
+video-to-audio generation still suffers from limited performance and limits the
+flexibility to prioritize sound synthesis for specific objects within the
+scene. Conversely, text-to-audio generation methods generate high-quality audio
+but pose challenges in ensuring comprehensive scene depiction and time-varying
+control. To tackle these challenges, we propose a novel video-and-text-to-audio
+generation method, called \ours, where video serves as a conditional control
+for a text-to-audio generation model. Especially, our method estimates the
+structural information of sound (namely, energy) from the video while receiving
+key content cues from a user prompt. We employ a well-performing text-to-audio
+model to consolidate the video control, which is much more efficient for
+training multimodal diffusion models with massive triplet-paired
+(audio-video-text) data. In addition, by separating the generative components
+of audio, it becomes a more flexible system that allows users to freely adjust
+the energy, surrounding environment, and primary sound source according to
+their preferences. Experimental results demonstrate that our method shows
+superiority in terms of quality, controllability, and training efficiency. Code
+and demo are available at https://naver-ai.github.io/rewas.
+
+
+ The Human Visual System (HVS), with its intricate sophistication, is capable
+of achieving ultra-compact information compression for visual signals. This
+remarkable ability is coupled with high generalization capability and energy
+efficiency. By contrast, the state-of-the-art Versatile Video Coding (VVC)
+standard achieves a compression ratio of around 1,000 times for raw visual
+data. This notable disparity motivates the research community to draw
+inspiration to effectively handle the immense volume of visual data in a green
+way. Therefore, this paper provides a survey of how visual data can be
+efficiently represented for green multimedia, in particular when the ultimate
+task is knowledge extraction instead of visual signal reconstruction. We
+introduce recent research efforts that promote green, sustainable, and
+efficient multimedia in this field. Moreover, we discuss how the deep
+understanding of the HVS can benefit the research community, and envision the
+development of future green multimedia technologies.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Computation and Language 40
+
+
+
+
+
+ ☆ MedHallBench: A New Benchmark for Assessing Hallucination in Medical
+ Large Language Models AAAI-25
+
+
+ Medical Large Language Models (MLLMs) have demonstrated potential in
+healthcare applications, yet their propensity for hallucinations -- generating
+medically implausible or inaccurate information -- presents substantial risks
+to patient care. This paper introduces MedHallBench, a comprehensive benchmark
+framework for evaluating and mitigating hallucinations in MLLMs. Our
+methodology integrates expert-validated medical case scenarios with established
+medical databases to create a robust evaluation dataset. The framework employs
+a sophisticated measurement system that combines automated ACHMI (Automatic
+Caption Hallucination Measurement in Medical Imaging) scoring with rigorous
+clinical expert evaluations and utilizes reinforcement learning methods to
+achieve automatic annotation. Through an optimized reinforcement learning from
+human feedback (RLHF) training pipeline specifically designed for medical
+applications, MedHallBench enables thorough evaluation of MLLMs across diverse
+clinical contexts while maintaining stringent accuracy standards. We conducted
+comparative experiments involving various models, utilizing the benchmark to
+establish a baseline for widely adopted large language models (LLMs). Our
+findings indicate that ACHMI provides a more nuanced understanding of the
+effects of hallucinations compared to traditional metrics, thereby highlighting
+its advantages in hallucination assessment. This research establishes a
+foundational framework for enhancing MLLMs' reliability in healthcare settings
+and presents actionable strategies for addressing the critical challenge of AI
+hallucinations in medical applications.
+
+
+
+ comment: Published to AAAI-25 Bridge Program
+
+
+
+
+
+
+ ☆ Dovetail: A CPU/GPU Heterogeneous Speculative Decoding for LLM inference
+
+
+ Due to the high resource demands of Large Language Models (LLMs), achieving
+widespread deployment on consumer-grade devices presents significant
+challenges. Typically, personal or consumer-grade devices, including servers
+configured prior to the era of large-scale models, generally have relatively
+weak GPUs and relatively strong CPUs. However, most current methods primarily
+depend on GPUs for computation. Therefore, we propose Dovetail, an approach
+that deploys the draft model on the GPU to generate draft tokens while allowing
+the target model to perform parallel verification on the CPU, thereby improving
+the utilization of all available hardware resources and occupying less
+inter-device communication bandwidth. Accordingly, we have redesigned the draft
+model to better align with heterogeneous hardware characteristics. To this end,
+we implemented several optimizations: reducing the number of draft tokens to
+mitigate latency in parallel verification, increasing the depth of the draft
+model to enhance its predictive capacity, and introducing DGF (Dynamic Gating
+Fusion) to improve the integration of features and token embeddings. In the
+HumanEval benchmark, Dovetail achieved an inference speed of 5.86 tokens per
+second for LLaMA2-Chat-7B using 3GB of VRAM, representing an approximately
+2.77x improvement over CPU-only inference. Furthermore, the inference speed was
+increased to 8 tokens per second when utilizing 7GB of VRAM.
+
+
+
+ comment: 9 pages, 7 figures
+
+
+
+
+
+
+ ☆ HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs
+
+
+
+
+
+
+
+
+ Junying Chen, Zhenyang Cai, Ke Ji, Xidong Wang, Wanlong Liu, Rongsheng Wang, Jianye Hou, Benyou Wang
+
+
+ The breakthrough of OpenAI o1 highlights the potential of enhancing reasoning
+to improve LLM. Yet, most research in reasoning has focused on mathematical
+tasks, leaving domains like medicine underexplored. The medical domain, though
+distinct from mathematics, also demands robust reasoning to provide reliable
+answers, given the high standards of healthcare. However, verifying medical
+reasoning is challenging, unlike those in mathematics. To address this, we
+propose verifiable medical problems with a medical verifier to check the
+correctness of model outputs. This verifiable nature enables advancements in
+medical reasoning through a two-stage approach: (1) using the verifier to guide
+the search for a complex reasoning trajectory for fine-tuning LLMs, (2)
+applying reinforcement learning (RL) with verifier-based rewards to enhance
+complex reasoning further. Finally, we introduce HuatuoGPT-o1, a medical LLM
+capable of complex reasoning, which outperforms general and medical-specific
+baselines using only 40K verifiable problems. Experiments show complex
+reasoning improves medical problem-solving and benefits more from RL. We hope
+our approach inspires advancements in reasoning across medical and other
+specialized domains.
+
+
+
+
+
+
+
+
+ Situo Zhang, Hankun Wang, Da Ma, Zichen Zhu, Lu Chen, Kunyao Lan, Kai Yu
+
+
+ Speculative Decoding (SD) is a popular lossless technique for accelerating
+the inference of Large Language Models (LLMs). We show that the decoding speed
+of SD frameworks with static draft structures can be significantly improved by
+incorporating context-aware adaptive draft structures. However, current studies
+on adaptive draft structures are limited by their performance, modeling
+approaches, and applicability. In this paper, we introduce AdaEAGLE, the first
+SD framework that explicitly models adaptive draft structures. AdaEAGLE
+leverages the Lightweight Draft Length Predictor (LDLP) module to explicitly
+predict the optimal number of draft tokens during inference to guide the draft
+model. It achieves comparable speedup results without manual thresholds and
+allows for deeper, more specialized optimizations. Moreover, together with
+threshold-based strategies, AdaEAGLE achieves a $1.62\times$ speedup over the
+vanilla AR decoding and outperforms fixed-length SotA baseline while
+maintaining output quality.
+
+
+
+
+
+
+
+ ☆ Research Experiment on Multi-Model Comparison for Chinese Text
+ Classification Tasks
+
+
+ With the explosive growth of Chinese text data and advancements in natural
+language processing technologies, Chinese text classification has become one of
+the key techniques in fields such as information retrieval and sentiment
+analysis, attracting increasing attention. This paper conducts a comparative
+study on three deep learning models:TextCNN, TextRNN, and FastText.specifically
+for Chinese text classification tasks. By conducting experiments on the
+THUCNews dataset, the performance of these models is evaluated, and their
+applicability in different scenarios is discussed.
+
+
+
+
+
+
+
+ ☆ Overview of MWE history, challenges, and horizons: standing at the 20th
+ anniversary of the MWE workshop series via MWE-UD2024
+
+
+ Starting in 2003 when the first MWE workshop was held with ACL in Sapporo,
+Japan, this year, the joint workshop of MWE-UD co-located with the LREC-COLING
+2024 conference marked the 20th anniversary of MWE workshop events over the
+past nearly two decades. Standing at this milestone, we look back to this
+workshop series and summarise the research topics and methodologies researchers
+have carried out over the years. We also discuss the current challenges that we
+are facing and the broader impacts/synergies of MWE research within the CL and
+NLP fields. Finally, we give future research perspectives. We hope this
+position paper can help researchers, students, and industrial practitioners
+interested in MWE get a brief but easy understanding of its history, current,
+and possible future.
+
+
+
+ comment: ongoing work, position paper, 6 pages
+
+
+
+
+
+
+ ☆ Whose Morality Do They Speak? Unraveling Cultural Bias in Multilingual
+ Language Models
+
+
+ Large language models (LLMs) have become integral tools in diverse domains,
+yet their moral reasoning capabilities across cultural and linguistic contexts
+remain underexplored. This study investigates whether multilingual LLMs, such
+as GPT-3.5-Turbo, GPT-4o-mini, Llama 3.1, and MistralNeMo, reflect culturally
+specific moral values or impose dominant moral norms, particularly those rooted
+in English. Using the updated Moral Foundations Questionnaire (MFQ-2) in eight
+languages, Arabic, Farsi, English, Spanish, Japanese, Chinese, French, and
+Russian, the study analyzes the models' adherence to six core moral
+foundations: care, equality, proportionality, loyalty, authority, and purity.
+The results reveal significant cultural and linguistic variability, challenging
+the assumption of universal moral consistency in LLMs. Although some models
+demonstrate adaptability to diverse contexts, others exhibit biases influenced
+by the composition of the training data. These findings underscore the need for
+culturally inclusive model development to improve fairness and trust in
+multilingual AI systems.
+
+
+
+
+
+
+
+ ☆ Bootstrap Your Own Context Length
+
+
+ We introduce a bootstrapping approach to train long-context language models
+by exploiting their short-context capabilities only. Our method utilizes a
+simple agent workflow to synthesize diverse long-context instruction tuning
+data, thereby eliminating the necessity for manual data collection and
+annotation. The proposed data synthesis workflow requires only a short-context
+language model, a text retriever, and a document collection, all of which are
+readily accessible within the open-source ecosystem. Subsequently, language
+models are fine-tuned using the synthesized data to extend their context
+lengths. In this manner, we effectively transfer the short-context capabilities
+of language models to long-context scenarios through a bootstrapping process.
+We conduct experiments with the open-source Llama-3 family of models and
+demonstrate that our method can successfully extend the context length to up to
+1M tokens, achieving superior performance across various benchmarks.
+
+
+
+ comment: 18 pages
+
+
+
+
+
+
+ ☆ RapGuard: Safeguarding Multimodal Large Language Models via
+ Rationale-aware Defensive Prompting
+
+
+ While Multimodal Large Language Models (MLLMs) have made remarkable progress
+in vision-language reasoning, they are also more susceptible to producing
+harmful content compared to models that focus solely on text. Existing
+defensive prompting techniques rely on a static, unified safety guideline that
+fails to account for the specific risks inherent in different multimodal
+contexts. To address these limitations, we propose RapGuard, a novel framework
+that uses multimodal chain-of-thought reasoning to dynamically generate
+scenario-specific safety prompts. RapGuard enhances safety by adapting its
+prompts to the unique risks of each input, effectively mitigating harmful
+outputs while maintaining high performance on benign tasks. Our experimental
+results across multiple MLLM benchmarks demonstrate that RapGuard achieves
+state-of-the-art safety performance, significantly reducing harmful content
+without degrading the quality of responses.
+
+
+ Large language models (LLMs) based on the Transformer architecture usually
+have their context length limited due to the high training cost. Recent
+advancements extend the context window by adjusting the scaling factors of RoPE
+and fine-tuning. However, suboptimal initialization of these factors results in
+increased fine-tuning costs and reduced performance at target length. To
+address these challenges, we propose an innovative RoPE-based fine-tuning
+framework that diverges from conventional scaling factors search. Specifically,
+we present a Divide-and-Conquer Incremental Search (DCIS) algorithm that
+strategically determines the better scaling factors. Further fine-tuning with
+the identified scaling factors effectively extends the context window of LLMs.
+Empirical results demonstrate that our methodology not only mitigates
+performance decay at extended target lengths but also allows the model to
+fine-tune on short contexts and generalize to long contexts, thereby reducing
+the cost of fine-tuning. The scaling factors obtained through DCIS can even
+perform effectively without fine-tuning. Further analysis of the search space
+reveals that DCIS achieves twice the search efficiency compared to other
+methods. We also examine the impact of the non-strictly increasing scaling
+factors utilized in DCIS and evaluate the general capabilities of LLMs across
+various context lengths.
+
+
+
+
+
+
+
+
+ Xinkai Du, Quanjie Han, Chao Lv, Yan Liu, Yalin Sun, Hao Shu, Hongbo Shan, Maosong Sun
+
+
+ Open-domain Question Answering (QA) has garnered substantial interest by
+combining the advantages of faithfully retrieved passages and relevant passages
+generated through Large Language Models (LLMs). However, there is a lack of
+definitive labels available to pair these sources of knowledge. In order to
+address this issue, we propose an unsupervised and simple framework called
+Bi-Reranking for Merging Generated and Retrieved Knowledge (BRMGR), which
+utilizes re-ranking methods for both retrieved passages and LLM-generated
+passages. We pair the two types of passages using two separate re-ranking
+methods and then combine them through greedy matching. We demonstrate that
+BRMGR is equivalent to employing a bipartite matching loss when assigning each
+retrieved passage with a corresponding LLM-generated passage. The application
+of our model yielded experimental results from three datasets, improving their
+performance by +1.7 and +1.6 on NQ and WebQ datasets, respectively, and
+obtaining comparable result on TriviaQA dataset when compared to competitive
+baselines.
+
+
+
+ comment: Accepted by ICASSP 2025
+
+
+
+
+
+
+ ☆ Towards Expressive Video Dubbing with Multiscale Multimodal Context
+ Interaction SP 2025
+
+
+ Automatic Video Dubbing (AVD) generates speech aligned with lip motion and
+facial emotion from scripts. Recent research focuses on modeling multimodal
+context to enhance prosody expressiveness but overlooks two key issues: 1)
+Multiscale prosody expression attributes in the context influence the current
+sentence's prosody. 2) Prosody cues in context interact with the current
+sentence, impacting the final prosody expressiveness. To tackle these
+challenges, we propose M2CI-Dubber, a Multiscale Multimodal Context Interaction
+scheme for AVD. This scheme includes two shared M2CI encoders to model the
+multiscale multimodal context and facilitate its deep interaction with the
+current sentence. By extracting global and local features for each modality in
+the context, utilizing attention-based mechanisms for aggregation and
+interaction, and employing an interaction-based graph attention network for
+fusion, the proposed approach enhances the prosody expressiveness of
+synthesized speech for the current sentence. Experiments on the Chem dataset
+show our model outperforms baselines in dubbing expressiveness. The code and
+demos are available at
+\textcolor[rgb]{0.93,0.0,0.47}{https://github.com/AI-S2-Lab/M2CI-Dubber}.
+
+
+ Conversational Speech Synthesis (CSS) aims to effectively take the multimodal
+dialogue history (MDH) to generate speech with appropriate conversational
+prosody for target utterance. The key challenge of CSS is to model the
+interaction between the MDH and the target utterance. Note that text and speech
+modalities in MDH have their own unique influences, and they complement each
+other to produce a comprehensive impact on the target utterance. Previous works
+did not explicitly model such intra-modal and inter-modal interactions. To
+address this issue, we propose a new intra-modal and inter-modal context
+interaction scheme-based CSS system, termed III-CSS. Specifically, in the
+training phase, we combine the MDH with the text and speech modalities in the
+target utterance to obtain four modal combinations, including Historical
+Text-Next Text, Historical Speech-Next Speech, Historical Text-Next Speech, and
+Historical Speech-Next Text. Then, we design two contrastive learning-based
+intra-modal and two inter-modal interaction modules to deeply learn the
+intra-modal and inter-modal context interaction. In the inference phase, we
+take MDH and adopt trained interaction modules to fully infer the speech
+prosody of the target utterance's text content. Subjective and objective
+experiments on the DailyTalk dataset show that III-CSS outperforms the advanced
+baselines in terms of prosody expressiveness. Code and speech samples are
+available at https://github.com/AI-S2-Lab/I3CSS.
+
+
+
+ comment: Accepted by ICASSP 2025
+
+
+
+
+
+
+ ☆ Optimizing Large Language Models with an Enhanced LoRA Fine-Tuning
+ Algorithm for Efficiency and Robustness in NLP Tasks
+
+
+ This study proposes a large language model optimization method based on the
+improved LoRA fine-tuning algorithm, aiming to improve the accuracy and
+computational efficiency of the model in natural language processing tasks. We
+fine-tune the large language model through a low-rank adaptation strategy,
+which significantly reduces the consumption of computing resources while
+maintaining the powerful capabilities of the pre-trained model. The experiment
+uses the QQP task as the evaluation scenario. The results show that the
+improved LoRA algorithm shows significant improvements in accuracy, F1 score,
+and MCC compared with traditional models such as BERT, Roberta, T5, and GPT-4.
+In particular, in terms of F1 score and MCC, our model shows stronger
+robustness and discrimination ability, which proves the potential of the
+improved LoRA algorithm in fine-tuning large-scale pre-trained models. In
+addition, this paper also discusses the application prospects of the improved
+LoRA algorithm in other natural language processing tasks, emphasizing its
+advantages in multi-task learning and scenarios with limited computing
+resources. Future research can further optimize the LoRA fine-tuning strategy
+and expand its application in larger-scale pre-trained models to improve the
+generalization ability and task adaptability of the model.
+
+
+
+
+
+
+
+ ☆ Using Large Language Models for Automated Grading of Student Writing
+ about Science
+
+
+
+
+
+
+
+
+ Chris Impey, Matthew Wenger, Nikhil Garuda, Shahriar Golchin, Sarah Stamer
+
+
+ Assessing writing in large classes for formal or informal learners presents a
+significant challenge. Consequently, most large classes, particularly in
+science, rely on objective assessment tools such as multiple-choice quizzes,
+which have a single correct answer. The rapid development of AI has introduced
+the possibility of using large language models (LLMs) to evaluate student
+writing. An experiment was conducted using GPT-4 to determine if machine
+learning methods based on LLMs can match or exceed the reliability of
+instructor grading in evaluating short writing assignments on topics in
+astronomy. The audience consisted of adult learners in three massive open
+online courses (MOOCs) offered through Coursera. One course was on astronomy,
+the second was on astrobiology, and the third was on the history and philosophy
+of astronomy. The results should also be applicable to non-science majors in
+university settings, where the content and modes of evaluation are similar. The
+data comprised answers from 120 students to 12 questions across the three
+courses. GPT-4 was provided with total grades, model answers, and rubrics from
+an instructor for all three courses. In addition to evaluating how reliably the
+LLM reproduced instructor grades, the LLM was also tasked with generating its
+own rubrics. Overall, the LLM was more reliable than peer grading, both in
+aggregate and by individual student, and approximately matched instructor
+grades for all three online courses. The implication is that LLMs may soon be
+used for automated, reliable, and scalable grading of student science writing.
+
+
+
+ comment: Accepted at IJAIE
+
+
+
+
+
+
+ ♻ ★ LISA: Layerwise Importance Sampling for Memory-Efficient Large Language
+ Model Fine-Tuning NeurIPS 2024
+
+
+ The machine learning community has witnessed impressive advancements since
+large language models (LLMs) first appeared. Yet, their massive memory
+consumption has become a significant roadblock to large-scale training. For
+instance, a 7B model typically requires at least 60 GB of GPU memory with full
+parameter training, which presents challenges for researchers without access to
+high-resource environments. Parameter Efficient Fine-Tuning techniques such as
+Low-Rank Adaptation (LoRA) have been proposed to alleviate this problem.
+However, in most large-scale fine-tuning settings, their performance does not
+reach the level of full parameter training because they confine the parameter
+search to a low-rank subspace. Attempting to complement this deficiency, we
+investigate the layerwise properties of LoRA on fine-tuning tasks and observe
+an unexpected but consistent skewness of weight norms across different layers.
+Utilizing this key observation, a surprisingly simple training strategy is
+discovered, which outperforms both LoRA and full parameter training in a wide
+range of settings with memory costs as low as LoRA. We name it Layerwise
+Importance Sampled AdamW (LISA), a promising alternative for LoRA, which
+applies the idea of importance sampling to different layers in LLMs and
+randomly freezes most middle layers during optimization. Experimental results
+show that with similar or less GPU memory consumption, LISA surpasses LoRA or
+even full parameter tuning in downstream fine-tuning tasks, where LISA
+consistently outperforms LoRA by over 10%-35% in terms of MT-Bench score while
+achieving on-par or better performance in MMLU, AGIEval and WinoGrande. On
+large models, specifically LLaMA-2-70B, LISA surpasses LoRA on MT-Bench, GSM8K,
+and PubMedQA, demonstrating its effectiveness across different domains.
+
+
+ Improving the multi-step reasoning ability of large language models (LLMs)
+with offline reinforcement learning (RL) is essential for quickly adapting them
+to complex tasks. While Direct Preference Optimization (DPO) has shown promise
+in aligning LLMs with human preferences, it is less suitable for multi-step
+reasoning tasks because (1) DPO relies on paired preference data, which is not
+readily available for multi-step reasoning tasks, and (2) it treats all tokens
+uniformly, making it ineffective for credit assignment in multi-step reasoning
+tasks, which often come with sparse reward. In this work, we propose OREO
+(Offline Reasoning Optimization), an offline RL method for enhancing LLM
+multi-step reasoning. Building on insights from previous works of maximum
+entropy reinforcement learning, it jointly learns a policy model and value
+function by optimizing the soft Bellman Equation. We show in principle that it
+reduces the need to collect pairwise data and enables better credit assignment.
+Empirically, OREO surpasses existing offline learning methods on multi-step
+reasoning benchmarks, including mathematical reasoning tasks (GSM8K, MATH) and
+embodied agent control (ALFWorld). The approach can be extended to a
+multi-iteration framework when additional resources are available. Furthermore,
+the learned value function can be leveraged to guide the tree search for free,
+which can further boost performance during test time.
+
+
+
+
+
+
+
+
+ Ondřej Pražák, Miloslav Konopík, Pavel Král
+
+
+ Coreference resolution, the task of identifying expressions in text that
+refer to the same entity, is a critical component in various natural language
+processing applications. This paper presents a novel end-to-end neural
+coreference resolution system utilizing the CorefUD 1.1 dataset, which spans 17
+datasets across 12 languages. The proposed model is based on the standard
+end-to-end neural coreference resolution system. We first establish baseline
+models, including monolingual and cross-lingual variations, and then propose
+several extensions to enhance performance across diverse linguistic contexts.
+These extensions include cross-lingual training, incorporation of syntactic
+information, a Span2Head model for optimized headword prediction, and advanced
+singleton modeling. We also experiment with headword span representation and
+long-documents modeling through overlapping segments. The proposed extensions,
+particularly the heads-only approach, singleton modeling, and long document
+prediction, significantly improve performance across most datasets. We also
+perform zero-shot cross-lingual experiments, highlighting the potential and
+limitations of cross-lingual transfer in coreference resolution. Our findings
+contribute to the development of robust and scalable coreference systems for
+multilingual coreference resolution. Finally, we evaluate our model on the
+CorefUD 1.1 test set and surpass the best model from the CRAC 2023 shared task
+of comparable size by a large margin.
+
+
+
+
+
+
+
+ ♻ ☆ OmniPred: Language Models as Universal Regressors
+
+
+ Regression is a powerful tool to accurately predict the outcome metric of a
+system given a set of parameters, but has traditionally been restricted to
+methods which are only applicable to a specific task. In this paper, we propose
+OmniPred, a framework for training language models as universal end-to-end
+regressors over $(x,y)$ data from arbitrary formats. Using data sourced from
+Google Vizier, one of the largest proprietary blackbox optimization databases
+in the world, our extensive experiments demonstrate that language models are
+capable of very precise numerical regression using only textual representations
+of mathematical parameters and values, and if given the opportunity to train at
+scale over multiple tasks, can significantly outperform traditional regression
+models.
+
+
+
+ comment: Published in Transactions on Machine Learning Research (TMLR) 2024.
+ Code can be found in
+ https://github.com/google-research/optformer/tree/main/optformer/omnipred
+
+
+
+
+
+
+ ♻ ☆ ReverseNER: A Self-Generated Example-Driven Framework for Zero-Shot
+ Named Entity Recognition with Large Language Models
+
+
+ This paper presents ReverseNER, a method aimed at overcoming the limitation
+of large language models (LLMs) in zero-shot named entity recognition (NER)
+tasks, arising from their reliance on pre-provided demonstrations. ReverseNER
+tackles this challenge by constructing a reliable example library composed of
+dozens of entity-labeled sentences, generated through the reverse process of
+NER. Specifically, while conventional NER methods label entities in a sentence,
+ReverseNER features reversing the process by using an LLM to generate entities
+from their definitions and subsequently expand them into full sentences. During
+the entity expansion process, the LLM is guided to generate sentences by
+replicating the structures of a set of specific \textsl{feature sentences},
+extracted from the task sentences by clustering. This expansion process
+produces dozens of entity-labeled task-relevant sentences. After constructing
+the example library, the method selects several semantically similar
+entity-labeled examples for each task sentence as references to facilitate the
+LLM's entity recognition. We also propose an entity-level self-consistency
+scoring mechanism to improve NER performance with LLMs. Experiments show that
+ReverseNER significantly outperforms other zero-shot NER methods with LLMs,
+marking a notable improvement in NER for domains without labeled data, while
+declining computational resource consumption.
+
+
+
+
+
+
+
+ ♻ ☆ RAGONITE: Iterative Retrieval on Induced Databases and Verbalized RDF
+ for Conversational QA over KGs with RAG
+
+
+
+
+
+
+
+
+ Rishiraj Saha Roy, Chris Hinze, Joel Schlotthauer, Farzad Naderi, Viktor Hangya, Andreas Foltyn, Luzian Hahn, Fabian Kuech
+
+
+ Conversational question answering (ConvQA) is a convenient means of searching
+over RDF knowledge graphs (KGs), where a prevalent approach is to translate
+natural language questions to SPARQL queries. However, SPARQL has certain
+shortcomings: (i) it is brittle for complex intents and conversational
+questions, and (ii) it is not suitable for more abstract needs. Instead, we
+propose a novel two-pronged system where we fuse: (i) SQL-query results over a
+database automatically derived from the KG, and (ii) text-search results over
+verbalizations of KG facts. Our pipeline supports iterative retrieval: when the
+results of any branch are found to be unsatisfactory, the system can
+automatically opt for further rounds. We put everything together in a retrieval
+augmented generation (RAG) setup, where an LLM generates a coherent response
+from accumulated search results. We demonstrate the superiority of our proposed
+system over several baselines on a knowledge graph of BMW automobiles.
+
+
+
+ comment: Accepted at BTW 2025, 10 pages
+
+
+
+
+
+
+ ♻ ☆ Prioritize Denoising Steps on Diffusion Model Preference Alignment via
+ Explicit Denoised Distribution Estimation
+
+
+ Diffusion models have shown remarkable success in text-to-image generation,
+making alignment methods for these models increasingly important. A key
+challenge is the sparsity of preference labels, which are typically available
+only at the terminal of denoising trajectories. This raises the issue of how to
+assign credit across denoising steps based on these sparse labels. In this
+paper, we propose Denoised Distribution Estimation (DDE), a novel method for
+credit assignment. Unlike previous approaches that rely on auxiliary models or
+hand-crafted schemes, DDE derives its strategy more explicitly. The proposed
+DDE directly estimates the terminal denoised distribution from the perspective
+of each step. It is equipped with two estimation strategies and capable of
+representing the entire denoising trajectory with a single model inference.
+Theoretically and empirically, we show that DDE prioritizes optimizing the
+middle part of the denoising trajectory, resulting in a novel and effective
+credit assignment scheme. Extensive experiments demonstrate that our approach
+achieves superior performance, both quantitatively and qualitatively.
+
+
+ Multi-hop question answering (MHQA) poses a significant challenge for large
+language models (LLMs) due to the extensive knowledge demands involved.
+Knowledge editing, which aims to precisely modify the LLMs to incorporate
+specific knowledge without negatively impacting other unrelated knowledge,
+offers a potential solution for addressing MHQA challenges with LLMs. However,
+current solutions struggle to effectively resolve issues of knowledge
+conflicts. Most parameter-preserving editing methods are hindered by inaccurate
+retrieval and overlook secondary editing issues, which can introduce noise into
+the reasoning process of LLMs. In this paper, we introduce KEDKG, a novel
+knowledge editing method that leverages a dynamic knowledge graph for MHQA,
+designed to ensure the reliability of answers. KEDKG involves two primary
+steps: dynamic knowledge graph construction and knowledge graph augmented
+generation. Initially, KEDKG autonomously constructs a dynamic knowledge graph
+to store revised information while resolving potential knowledge conflicts.
+Subsequently, it employs a fine-grained retrieval strategy coupled with an
+entity and relation detector to enhance the accuracy of graph retrieval for LLM
+generation. Experimental results on benchmarks show that KEDKG surpasses
+previous state-of-the-art models, delivering more accurate and reliable answers
+in environments with dynamic information.
+
+
+
+ comment: AAAI 2025
+
+
+
+
+
+
+ ♻ ☆ Interpretable Contrastive Monte Carlo Tree Search Reasoning
+
+
+ We propose SC-MCTS*: a novel Monte Carlo Tree Search (MCTS) reasoning
+algorithm for Large Language Models (LLMs), significantly improves both
+reasoning accuracy and speed. Our motivation comes from: 1. Previous MCTS LLM
+reasoning works often overlooked its biggest drawback--slower speed compared to
+CoT; 2. Previous research mainly used MCTS as a tool for LLM reasoning on
+various tasks with limited quantitative analysis or ablation studies of its
+components from reasoning interpretability perspective. 3. The reward model is
+the most crucial component in MCTS, however previous work has rarely conducted
+in-depth study or improvement of MCTS's reward models. Thus, we conducted
+extensive ablation studies and quantitative analysis on components of MCTS,
+revealing the impact of each component on the MCTS reasoning performance of
+LLMs. Building on this, (i) we designed a highly interpretable reward model
+based on the principle of contrastive decoding and (ii) achieved an average
+speed improvement of 51.9% per node using speculative decoding. Additionally,
+(iii) we improved UCT node selection strategy and backpropagation used in
+previous works, resulting in significant performance improvement. We
+outperformed o1-mini by an average of 17.4% on the Blocksworld multi-step
+reasoning dataset using Llama-3.1-70B with SC-MCTS*. Our code is available at
+https://github.com/zitian-gao/SC-MCTS.
+
+
+
+
+
+
+
+ ♻ ☆ Automated Review Generation Method Based on Large Language Models
+
+
+ Literature research, vital for scientific work, faces the challenge of the
+surging torrent of information in the vast ocean of literature exceeding
+researchers' processing capabilities. To address this issue, we present an
+automated review generation method based on Large Language Models (LLMs), aimed
+at overcoming efficiency bottlenecks in literature processing and reducing
+cognitive load. Our statistically validated evaluation framework demonstrates
+that the generated reviews match or exceed manual quality, offering broad
+applicability across research fields due to minimal domain knowledge
+requirements. In a case study on propane dehydrogenation (PDH) catalysts, our
+method swiftly analyzed 343 articles, averaging seconds per article per LLM
+account, producing comprehensive reviews spanning 35 topics. Extended analysis
+of 1041 articles provided deep insights into catalysts' composition, structure,
+and performance. Recognizing LLMs' hallucinations, we implemented a
+multi-layered quality control strategy, effectively mitigating risks and
+ensuring reliability, as quantitatively demonstrated through manual
+verification. Expert verification confirms the accuracy and citation integrity
+of generated reviews, demonstrating LLM hallucination risks reduced to below
+0.5\% with over 95\% confidence. Released Windows application enables one-click
+review generation, aiding researchers in tracking advancements and recommending
+literature. This approach showcases LLMs' role in enhancing scientific research
+productivity and sets the stage for further exploration.
+
+
+
+ comment: 29 pages, 5 figures, 3 tables Code:
+ https://github.com/TJU-ECAT-AI/AutomaticReviewGeneration Data:
+ https://github.com/TJU-ECAT-AI/AutomaticReviewGenerationData This research
+ has been invited for a Short Oral presentation at the 18th ICC -
+ International Congress on Catalysis, taking place in Lyon, France from July
+ 14-19, 2024
+
+
+
+
+
+
+ ♻ ☆ RadioRAG: Factual large language models for enhanced diagnostics in
+ radiology using online retrieval augmented generation
+
+
+
+
+
+
+
+
+ Soroosh Tayebi Arasteh, Mahshad Lotfinia, Keno Bressem, Robert Siepmann, Lisa Adams, Dyke Ferber, Christiane Kuhl, Jakob Nikolas Kather, Sven Nebelung, Daniel Truhn
+
+
+ Large language models (LLMs) often generate outdated or inaccurate
+information based on static training datasets. Retrieval augmented generation
+(RAG) mitigates this by integrating outside data sources. While previous RAG
+systems used pre-assembled, fixed databases with limited flexibility, we have
+developed Radiology RAG (RadioRAG), an end-to-end framework that retrieves data
+from authoritative radiologic online sources in real-time. We evaluate the
+diagnostic accuracy of various LLMs when answering radiology-specific questions
+with and without access to additional online information via RAG. Using 80
+questions from the RSNA Case Collection across radiologic subspecialties and 24
+additional expert-curated questions with reference standard answers, LLMs
+(GPT-3.5-turbo, GPT-4, Mistral-7B, Mixtral-8x7B, and Llama3 [8B and 70B]) were
+prompted with and without RadioRAG in a zero-shot inference scenario RadioRAG
+retrieved context-specific information from www.radiopaedia.org in real-time.
+Accuracy was investigated. Statistical analyses were performed using
+bootstrapping. The results were further compared with human performance.
+RadioRAG improved diagnostic accuracy across most LLMs, with relative accuracy
+increases ranging up to 54% for different LLMs. It matched or exceeded non-RAG
+models and the human radiologist in question answering across radiologic
+subspecialties, particularly in breast imaging and emergency radiology.
+However, the degree of improvement varied among models; GPT-3.5-turbo and
+Mixtral-8x7B-instruct-v0.1 saw notable gains, while Mistral-7B-instruct-v0.2
+showed no improvement, highlighting variability in RadioRAG's effectiveness.
+LLMs benefit when provided access to domain-specific data beyond their training
+data. For radiology, RadioRAG establishes a robust framework that substantially
+improves diagnostic accuracy and factuality in radiological question answering.
+
+
+
+
+
+
+
+ ♻ ☆ On the Universal Truthfulness Hyperplane Inside LLMs EMNLP 2024
+
+
+ While large language models (LLMs) have demonstrated remarkable abilities
+across various fields, hallucination remains a significant challenge. Recent
+studies have explored hallucinations through the lens of internal
+representations, proposing mechanisms to decipher LLMs' adherence to facts.
+However, these approaches often fail to generalize to out-of-distribution data,
+leading to concerns about whether internal representation patterns reflect
+fundamental factual awareness, or only overfit spurious correlations on the
+specific datasets. In this work, we investigate whether a universal
+truthfulness hyperplane that distinguishes the model's factually correct and
+incorrect outputs exists within the model. To this end, we scale up the number
+of training datasets and conduct an extensive evaluation -- we train the
+truthfulness hyperplane on a diverse collection of over 40 datasets and examine
+its cross-task, cross-domain, and in-domain generalization. Our results
+indicate that increasing the diversity of the training datasets significantly
+enhances the performance in all scenarios, while the volume of data samples
+plays a less critical role. This finding supports the optimistic hypothesis
+that a universal truthfulness hyperplane may indeed exist within the model,
+offering promising directions for future research.
+
+
+
+ comment: EMNLP 2024: Camera-ready version
+
+
+
+
+
+
+ ♻ ☆ Seek and Solve Reasoning for Table Question Answering
+
+
+
+
+
+
+
+
+ Ruya Jiang, Chun Wang, Weihong Deng
+
+
+ The complexities of table structures and question logic make table-based
+question answering (TQA) tasks challenging for Large Language Models (LLMs),
+often requiring task simplification before solving. This paper reveals that the
+reasoning process during task simplification may be more valuable than the
+simplified tasks themselves and aims to improve TQA performance by leveraging
+LLMs' reasoning capabilities. We propose a Seek-and-Solve pipeline that
+instructs the LLM to first seek relevant information and then answer questions,
+integrating these two stages at the reasoning level into a coherent
+Seek-and-Solve Chain of Thought (SS-CoT). Additionally, we distill a
+single-step TQA-solving prompt from this pipeline, using demonstrations with
+SS-CoT paths to guide the LLM in solving complex TQA tasks under In-Context
+Learning settings. Our experiments show that our approaches result in improved
+performance and reliability while being efficient. Our findings emphasize the
+importance of eliciting LLMs' reasoning capabilities to handle complex TQA
+tasks effectively.
+
+
+
+
+
+
+
+ ♻ ☆ ChaI-TeA: A Benchmark for Evaluating Autocompletion of Interactions with
+ LLM-based Chatbots
+
+
+ The rise of LLMs has deflected a growing portion of human-computer
+interactions towards LLM-based chatbots. The remarkable abilities of these
+models allow users to interact using long, diverse natural language text
+covering a wide range of topics and styles. Phrasing these messages is a time
+and effort consuming task, calling for an autocomplete solution to assist
+users. We introduce the task of chatbot interaction autocomplete. We present
+ChaI-TeA: CHat InTEraction Autocomplete; An autcomplete evaluation framework
+for LLM-based chatbot interactions. The framework includes a formal definition
+of the task, coupled with suitable datasets and metrics. We use the framework
+to evaluate After formally defining the task along with suitable datasets and
+metrics, we test 9 models on the defined auto completion task, finding that
+while current off-the-shelf models perform fairly, there is still much room for
+improvement, mainly in ranking of the generated suggestions. We provide
+insights for practitioners working on this task and open new research
+directions for researchers in the field. We release our framework to serve as a
+foundation for future research.
+
+
+
+
+
+
+
+ ♻ ☆ A Dual-Perspective Metaphor Detection Framework Using Large Language
+ Models ICASSP 2025
+
+
+
+
+
+
+
+
+ Yujie Lin, Jingyao Liu, Yan Gao, Ante Wang, Jinsong Su
+
+
+ Metaphor detection, a critical task in natural language processing, involves
+identifying whether a particular word in a sentence is used metaphorically.
+Traditional approaches often rely on supervised learning models that implicitly
+encode semantic relationships based on metaphor theories. However, these
+methods often suffer from a lack of transparency in their decision-making
+processes, which undermines the reliability of their predictions. Recent
+research indicates that LLMs (large language models) exhibit significant
+potential in metaphor detection. Nevertheless, their reasoning capabilities are
+constrained by predefined knowledge graphs. To overcome these limitations, we
+propose DMD, a novel dual-perspective framework that harnesses both implicit
+and explicit applications of metaphor theories to guide LLMs in metaphor
+detection and adopts a self-judgment mechanism to validate the responses from
+the aforementioned forms of guidance. In comparison to previous methods, our
+framework offers more transparent reasoning processes and delivers more
+reliable predictions. Experimental results prove the effectiveness of DMD,
+demonstrating state-of-the-art performance across widely-used datasets.
+
+
+
+ comment: Accepted to ICASSP 2025
+
+
+
+
+
+
+ ♻ ☆ UPCS: Unbiased Persona Construction for Dialogue Generation
+
+
+ Narrative systems, such as dialogue and storytelling systems, often utilize
+persona profiles to enhance personalized interactions. Existing persona
+profiles frequently exhibit biases, posing risks to system integrity and
+fairness. To address this, we introduce the UPCS framework, which categorizes
+character descriptions into eight dimensions, including bias mitigation
+strategies. Experimental results demonstrate UPCS's superiority in accuracy,
+diversity, bias elimination, and user satisfaction, marking a significant
+advancement in persona construction for reliable narrative systems.
+
+
+
+
+
+
+
+ ♻ ☆ Towards Global AI Inclusivity: A Large-Scale Multilingual Terminology
+ Dataset
+
+
+ The field of machine translation has achieved significant advancements, yet
+domain-specific terminology translation, particularly in AI, remains
+challenging. We introduced GIST, a large-scale multilingual AI terminology
+dataset containing 5K terms extracted from top AI conference papers spanning
+2000 to 2023. The terms were translated into Arabic, Chinese, French, Japanese,
+and Russian using a hybrid framework that combines LLMs for extraction with
+human expertise for translation. The dataset's quality was benchmarked against
+existing resources, demonstrating superior translation accuracy through
+crowdsourced evaluation. GIST was integrated into translation workflows using
+post-translation refinement methods that required no retraining, where LLM
+prompting consistently improved BLEU and COMET scores. A web demonstration on
+the ACL Anthology platform highlights its practical application, showcasing
+improved accessibility for non-English speakers. This work aims to address
+critical gaps in AI terminology resources and fosters global inclusivity and
+collaboration in AI research.
+
+
+
+
+
+
+
+ ♻ ☆ Unveiling Uncertainty: A Deep Dive into Calibration and Performance of
+ Multimodal Large Language Models COLING 2025
+
+
+ Multimodal large language models (MLLMs) combine visual and textual data for
+tasks such as image captioning and visual question answering. Proper
+uncertainty calibration is crucial, yet challenging, for reliable use in areas
+like healthcare and autonomous driving. This paper investigates representative
+MLLMs, focusing on their calibration across various scenarios, including before
+and after visual fine-tuning, as well as before and after multimodal training
+of the base LLMs. We observed miscalibration in their performance, and at the
+same time, no significant differences in calibration across these scenarios. We
+also highlight how uncertainty differs between text and images and how their
+integration affects overall uncertainty. To better understand MLLMs'
+miscalibration and their ability to self-assess uncertainty, we construct the
+IDK (I don't know) dataset, which is key to evaluating how they handle
+unknowns. Our findings reveal that MLLMs tend to give answers rather than admit
+uncertainty, but this self-assessment improves with proper prompt adjustments.
+Finally, to calibrate MLLMs and enhance model reliability, we propose
+techniques such as temperature scaling and iterative prompt optimization. Our
+results provide insights into improving MLLMs for effective and responsible
+deployment in multimodal applications. Code and IDK dataset:
+https://github.com/hfutml/Calibration-MLLM.
+
+
+
+ comment: Accepted to COLING 2025
+
+
+
+
+
+
+ ♻ ☆ Fit and Prune: Fast and Training-free Visual Token Pruning for
+ Multi-modal Large Language Models
+
+
+ Recent progress in Multimodal Large Language Models(MLLMs) often use large
+image tokens to compensate the visual shortcoming of MLLMs, which not only
+exhibits obvious redundancy but also greatly exacerbates the already high
+computation. Token pruning is an effective solution for speeding up MLLMs, but
+when and how to drop tokens still remains a challenge. In this paper, we
+propose a novel and training-free approach for the effective visual token
+pruning of MLLMs, termed FitPrune, which can quickly produce a complete pruning
+recipe for MLLMs according to a pre-defined budget. Specifically, FitPrune
+considers token pruning as a statistical problem of MLLM and its objective is
+to find out an optimal pruning scheme that can minimize the divergence of the
+attention distributions before and after pruning. In practice, FitPrune can be
+quickly accomplished based on the attention statistics from a small batch of
+inference data, avoiding the expensive trials of MLLMs. According to the
+pruning recipe, an MLLM can directly remove the redundant visual tokens of
+different examples during inference. To validate FitPrune, we apply it to a set
+of recent MLLMs, including LLaVA-1.5, LLaVA-HR and LLaVA-NEXT, and conduct
+extensive experiments on a set of benchmarks. The experimental results show
+that our FitPrune can not only reduce the computational complexity to a large
+extent, while retaining high performance, e.g., -54.9% FLOPs for LLaVA-NEXT
+with only 0.5% accuracy drop. Notably, the pruning recipe can be obtained in
+about 5 minutes. Our code is available at https://github.com/ywh187/FitPrune.
+
+
+
+
+
+
+
+ ♻ ☆ Explore the Potential of LLMs in Misinformation Detection: An Empirical
+ Study
+
+
+
+
+
+
+
+
+ Mengyang Chen, Lingwei Wei, Han Cao, Wei Zhou, Songlin Hu
+
+
+ Large Language Models (LLMs) have garnered significant attention for their
+powerful ability in natural language understanding and reasoning. In this
+paper, we present a comprehensive empirical study to explore the performance of
+LLMs on misinformation detection tasks. This study stands as the pioneering
+investigation into the understanding capabilities of multiple LLMs regarding
+both content and propagation across social media platforms. Our empirical
+studies on eight misinformation detection datasets show that LLM-based
+detectors can achieve comparable performance in text-based misinformation
+detection but exhibit notably constrained capabilities in comprehending
+propagation structure compared to existing models in propagation-based
+misinformation detection. Our experiments further demonstrate that LLMs exhibit
+great potential to enhance existing misinformation detection models. These
+findings highlight the potential ability of LLMs to detect misinformation.
+
+
+
+
+
+
+
+ ♻ ☆ Benchmarking Large Language Model Uncertainty for Prompt Optimization
+
+
+ Prompt optimization algorithms for Large Language Models (LLMs) excel in
+multi-step reasoning but still lack effective uncertainty estimation. This
+paper introduces a benchmark dataset to evaluate uncertainty metrics, focusing
+on Answer, Correctness, Aleatoric, and Epistemic Uncertainty. Through analysis
+of models like GPT-3.5-Turbo and Meta-Llama-3.1-8B-Instruct, we show that
+current metrics align more with Answer Uncertainty, which reflects output
+confidence and diversity, rather than Correctness Uncertainty, highlighting the
+need for improved metrics that are optimization-objective-aware to better guide
+prompt optimization. Our code and dataset are available at
+https://github.com/0Frett/PO-Uncertainty-Benchmarking.
+
+
+ We introduce Baichuan Alignment, a detailed analysis of the alignment
+techniques employed in the Baichuan series of models. This represents the
+industry's first comprehensive account of alignment methodologies, offering
+valuable insights for advancing AI research. We investigate the critical
+components that enhance model performance during the alignment process,
+including optimization methods, data strategies, capability enhancements, and
+evaluation processes. The process spans three key stages: Prompt Augmentation
+System(PAS), Supervised Fine-Tuning(SFT), and Preference Alignment. The
+problems encountered, the solutions applied, and the improvements made are
+thoroughly recorded.
+ Through comparisons across well-established benchmarks, we highlight the
+technological advancements enabled by Baichuan Alignment. Baichuan-Instruct is
+an internal model, while Qwen2-Nova-72B and Llama3-PBM-Nova-70B are instruct
+versions of the Qwen2-72B and Llama-3-70B base models, optimized through
+Baichuan Alignment. Baichuan-Instruct demonstrates significant improvements in
+core capabilities, with user experience gains ranging from 17% to 28%, and
+performs exceptionally well on specialized benchmarks. In open-source benchmark
+evaluations, both Qwen2-Nova-72B and Llama3-PBM-Nova-70B consistently
+outperform their respective official instruct versions across nearly all
+datasets. This report aims to clarify the key technologies behind the alignment
+process, fostering a deeper understanding within the community.
+Llama3-PBM-Nova-70B model is available at
+https://huggingface.co/PKU-Baichuan-MLSystemLab/Llama3-PBM-Nova-70B.
+
+
+
+
+
+
+
+ ♻ ☆ TSDS: Data Selection for Task-Specific Model Finetuning
+
+
+ Finetuning foundation models for specific tasks is an emerging paradigm in
+modern machine learning. The efficacy of task-specific finetuning largely
+depends on the selection of appropriate training data. We present TSDS
+(Task-Specific Data Selection), a framework to select data for task-specific
+model finetuning, guided by a small but representative set of examples from the
+target task. To do so, we formulate data selection for task-specific finetuning
+as an optimization problem with a distribution alignment loss based on optimal
+transport to capture the discrepancy between the selected data and the target
+distribution. In addition, we add a regularizer to encourage the diversity of
+the selected data and incorporate kernel density estimation into the
+regularizer to reduce the negative effects of near-duplicates among the
+candidate data. We connect our optimization problem to nearest neighbor search
+and design efficient algorithms to compute the optimal solution based on
+approximate nearest neighbor search techniques. We evaluate our method on data
+selection for both continued pretraining and instruction tuning of language
+models. We show that instruction tuning using data selected by our method with
+a 1% selection ratio often outperforms using the full dataset and beats the
+baseline selection methods by 1.5 points in F1 score on average.
+
+
+
+ comment: 31 pages, 1 figure
+
+
+
+
+
+
+ ♻ ☆ GameArena: Evaluating LLM Reasoning through Live Computer Games
+
+
+ Evaluating the reasoning abilities of large language models (LLMs) is
+challenging. Existing benchmarks often depend on static datasets, which are
+vulnerable to data contamination and may get saturated over time, or on binary
+live human feedback that conflates reasoning with other abilities. As the most
+prominent dynamic benchmark, Chatbot Arena evaluates open-ended questions in
+real-world settings, but lacks the granularity in assessing specific reasoning
+capabilities. We introduce GameArena, a dynamic benchmark designed to evaluate
+LLM reasoning capabilities through interactive gameplay with humans. GameArena
+consists of three games designed to test specific reasoning capabilities (e.g.,
+deductive and inductive reasoning), while keeping participants entertained and
+engaged. We analyze the gaming data retrospectively to uncover the underlying
+reasoning processes of LLMs and measure their fine-grained reasoning
+capabilities. We collect over 2000 game sessions and provide detailed
+assessments of various reasoning capabilities for five state-of-the-art LLMs.
+Our user study with 100 participants suggests that GameArena improves user
+engagement compared to Chatbot Arena. For the first time, GameArena enables the
+collection of step-by-step LLM reasoning data in the wild.
+
+
+
+
+
+
+
+ ♻ ☆ Enhancing Large Language Models with Domain-Specific Knowledge: The Case
+ in Topological Materials
+
+
+ Large language models (LLMs), such as ChatGPT, have demonstrated impressive
+performance in the text generation task, showing the ability to understand and
+respond to complex instructions. However, the performance of naive LLMs in
+speciffc domains is limited due to the scarcity of domain-speciffc corpora and
+specialized training. Moreover, training a specialized large-scale model
+necessitates signiffcant hardware resources, which restricts researchers from
+leveraging such models to drive advances. Hence, it is crucial to further
+improve and optimize LLMs to meet speciffc domain demands and enhance their
+scalability. Based on the condensed matter data center, we establish a material
+knowledge graph (MaterialsKG) and integrate it with literature. Using large
+language models and prompt learning, we develop a specialized dialogue system
+for topological materials called TopoChat. Compared to naive LLMs, TopoChat
+exhibits superior performance in structural and property querying, material
+recommendation, and complex relational reasoning. This system enables efffcient
+and precise retrieval of information and facilitates knowledge interaction,
+thereby encouraging the advancement on the ffeld of condensed matter materials.
+
+
+ The rapid expansion of multimedia contents has led to the emergence of
+multimodal recommendation systems. It has attracted increasing attention in
+recommendation systems because its full utilization of data from different
+modalities alleviates the persistent data sparsity problem. As such, multimodal
+recommendation models can learn personalized information about nodes in terms
+of visual and textual. To further alleviate the data sparsity problem, some
+previous works have introduced graph convolutional networks (GCNs) for
+multimodal recommendation systems, to enhance the semantic representation of
+users and items by capturing the potential relationships between them. However,
+adopting GCNs inevitably introduces the over-smoothing problem, which make
+nodes to be too similar. Unfortunately, incorporating multimodal information
+will exacerbate this challenge because nodes that are too similar will lose the
+personalized information learned through multimodal information. To address
+this problem, we propose a novel model that retains the personalized
+information of ego nodes during feature aggregation by Reducing Node-neighbor
+Discrepancy (RedN^nD). Extensive experiments on three public datasets show that
+RedN^nD achieves state-of-the-art performance on accuracy and robustness, with
+significant improvements over existing GCN-based multimodal frameworks.
+
+
+
+ comment: Accepted by ICASSP 2025
+
+
+
+
+
+
+ ☆ Musings About the Future of Search: A Return to the Past?
+
+
+
+
+
+
+
+
+ Jimmy Lin, Pankaj Gupta, Will Horn, Gilad Mishne
+
+
+ When you have a question, the most effective way to have the question
+answered is to directly connect with experts on the topic and have a
+conversation with them. Prior to the invention of writing, this was the only
+way. Although effective, this solution exhibits scalability challenges. Writing
+allowed knowledge to be materialized, preserved, and replicated, enabling the
+development of different technologies over the centuries to connect information
+seekers with relevant information. This progression ultimately culminated in
+the ten-blue-links web search paradigm we're familiar with, just before the
+recent emergence of generative AI. However, we often forget that consuming
+static content is an imperfect solution. With the advent of large language
+models, it has become possible to develop a superior experience by allowing
+users to directly engage with experts. These interactions can of course satisfy
+information needs, but expert models can do so much more. This coming future
+requires reimagining search.
+
+
+
+
+
+
+
+ ☆ Bootstrap Your Own Context Length
+
+
+ We introduce a bootstrapping approach to train long-context language models
+by exploiting their short-context capabilities only. Our method utilizes a
+simple agent workflow to synthesize diverse long-context instruction tuning
+data, thereby eliminating the necessity for manual data collection and
+annotation. The proposed data synthesis workflow requires only a short-context
+language model, a text retriever, and a document collection, all of which are
+readily accessible within the open-source ecosystem. Subsequently, language
+models are fine-tuned using the synthesized data to extend their context
+lengths. In this manner, we effectively transfer the short-context capabilities
+of language models to long-context scenarios through a bootstrapping process.
+We conduct experiments with the open-source Llama-3 family of models and
+demonstrate that our method can successfully extend the context length to up to
+1M tokens, achieving superior performance across various benchmarks.
+
+
+ As data retrieval demands become increasingly complex, traditional search
+methods often fall short in addressing nuanced and conceptual queries. Vector
+similarity search has emerged as a promising technique for finding semantically
+similar information efficiently. However, its effectiveness diminishes when
+handling intricate queries with contextual nuances. This paper explores a
+hybrid approach combining vector similarity search with Large Language Models
+(LLMs) to enhance search accuracy and relevance. The proposed two-step solution
+first employs vector similarity search to shortlist potential matches, followed
+by an LLM for context-aware ranking of the results. Experiments on structured
+datasets demonstrate that while vector similarity search alone performs well
+for straightforward queries, the LLM-assisted approach excels in processing
+complex queries involving constraints, negations, or conceptual requirements.
+By leveraging the natural language understanding capabilities of LLMs, this
+method improves the accuracy of search results for complex tasks without
+sacrificing efficiency. We also discuss real-world applications and propose
+directions for future research to refine and scale this technique for diverse
+datasets and use cases.
+ Original article:
+https://engineering.grab.com/llm-assisted-vector-similarity-search
+
+
+
+
+
+
+
+ ☆ FOR: Finetuning for Object Level Open Vocabulary Image Retrieval WACV 2025
+
+
+ As working with large datasets becomes standard, the task of accurately
+retrieving images containing objects of interest by an open set textual query
+gains practical importance. The current leading approach utilizes a pre-trained
+CLIP model without any adaptation to the target domain, balancing accuracy and
+efficiency through additional post-processing. In this work, we propose FOR:
+Finetuning for Object-centric Open-vocabulary Image Retrieval, which allows
+finetuning on a target dataset using closed-set labels while keeping the
+visual-language association crucial for open vocabulary retrieval. FOR is based
+on two design elements: a specialized decoder variant of the CLIP head
+customized for the intended task, and its coupling within a multi-objective
+training framework. Together, these design choices result in a significant
+increase in accuracy, showcasing improvements of up to 8 mAP@50 points over
+SoTA across three datasets. Additionally, we demonstrate that FOR is also
+effective in a semi-supervised setting, achieving impressive results even when
+only a small portion of the dataset is labeled.
+
+
+
+ comment: WACV 2025
+
+
+
+
+
+
+ ☆ Zema Dataset: A Comprehensive Study of Yaredawi Zema with a Focus on
+ Horologium Chants
+
+
+ Computational music research plays a critical role in advancing music
+production, distribution, and understanding across various musical styles
+worldwide. Despite the immense cultural and religious significance, the
+Ethiopian Orthodox Tewahedo Church (EOTC) chants are relatively
+underrepresented in computational music research. This paper contributes to
+this field by introducing a new dataset specifically tailored for analyzing
+EOTC chants, also known as Yaredawi Zema. This work provides a comprehensive
+overview of a 10-hour dataset, 369 instances, creation, and curation process,
+including rigorous quality assurance measures. Our dataset has a detailed
+word-level temporal boundary and reading tone annotation along with the
+corresponding chanting mode label of audios. Moreover, we have also identified
+the chanting options associated with multiple chanting notations in the
+manuscript by annotating them accordingly. Our goal in making this dataset
+available to the public 1 is to encourage more research and study of EOTC
+chants, including lyrics transcription, lyric-to-audio alignment, and music
+generation tasks. Such research work will advance knowledge and efforts to
+preserve this distinctive liturgical music, a priceless cultural artifact for
+the Ethiopian people.
+
+
+
+ comment: 6 pages
+
+
+
+
+
+
+ ☆ Attack-in-the-Chain: Bootstrapping Large Language Models for Attacks
+ Against Black-box Neural Ranking Models AAAI25
+
+
+ Neural ranking models (NRMs) have been shown to be highly effective in terms
+of retrieval performance. Unfortunately, they have also displayed a higher
+degree of sensitivity to attacks than previous generation models. To help
+expose and address this lack of robustness, we introduce a novel ranking attack
+framework named Attack-in-the-Chain, which tracks interactions between large
+language models (LLMs) and NRMs based on chain-of-thought (CoT) prompting to
+generate adversarial examples under black-box settings. Our approach starts by
+identifying anchor documents with higher ranking positions than the target
+document as nodes in the reasoning chain. We then dynamically assign the number
+of perturbation words to each node and prompt LLMs to execute attacks. Finally,
+we verify the attack performance of all nodes at each reasoning step and
+proceed to generate the next reasoning step. Empirical results on two web
+search benchmarks show the effectiveness of our method.
+
+
+
+ comment: Accepted by AAAI25
+
+
+
+
+
+
+ ☆ On the Robustness of Generative Information Retrieval Models ECIR 2025
+
+
+ Generative information retrieval methods retrieve documents by directly
+generating their identifiers. Much effort has been devoted to developing
+effective generative IR models. Less attention has been paid to the robustness
+of these models. It is critical to assess the out-of-distribution (OOD)
+generalization of generative IR models, i.e., how would such models generalize
+to new distributions? To answer this question, we focus on OOD scenarios from
+four perspectives in retrieval problems: (i)query variations; (ii)unseen query
+types; (iii)unseen tasks; and (iv)corpus expansion. Based on this taxonomy, we
+conduct empirical studies to analyze the OOD robustness of representative
+generative IR models against dense retrieval models. Our empirical results
+indicate that the OOD robustness of generative IR models is in need of
+improvement. By inspecting the OOD robustness of generative IR models we aim to
+contribute to the development of more reliable IR models. The code is available
+at \url{https://github.com/Davion-Liu/GR_OOD}.
+
+
+
+ comment: Accepted by ECIR 2025. arXiv admin note: substantial text overlap
+ with arXiv:2306.12756
+
+
+
+
+
+
+ ☆ Adaptive Self-supervised Learning for Social Recommendations
+
+
+ In recent years, researchers have attempted to exploit social relations to
+improve the performance in recommendation systems. Generally, most existing
+social recommendation methods heavily depends on substantial domain knowledge
+and expertise in primary recommendation tasks for designing useful auxiliary
+tasks. Meanwhile, Self-Supervised Learning (SSL) recently has received
+considerable attention in the field of recommendation, since it can provide
+self-supervision signals in assisting the improvement of target recommendation
+systems by constructing self-supervised auxiliary tasks from raw data without
+human-annotated labels. Despite the great success, these SSL-based social
+recommendations are insufficient to adaptively balance various self-supervised
+auxiliary tasks, since assigning equal weights on various auxiliary tasks can
+result in sub-optimal recommendation performance, where different
+self-supervised auxiliary tasks may contribute differently to improving the
+primary social recommendation across different datasets. To address this issue,
+in this work, we propose Adaptive Self-supervised Learning for Social
+Recommendations (AdasRec) by taking advantage of various self-supervised
+auxiliary tasks. More specifically, an adaptive weighting mechanism is proposed
+to learn adaptive weights for various self-supervised auxiliary tasks, so as to
+balance the contribution of such self-supervised auxiliary tasks for enhancing
+representation learning in social recommendations. The adaptive weighting
+mechanism is used to assign different weights on auxiliary tasks to achieve an
+overall weighting of the entire auxiliary tasks and ultimately assist the
+primary recommendation task, achieved by a meta learning optimization problem
+with an adaptive weighting network. Comprehensive experiments on various
+real-world datasets are constructed to verify the effectiveness of our proposed
+method.
+
+
+ Collaborative recommendation fundamentally involves learning high-quality
+user and item representations from interaction data. Recently, graph
+convolution networks (GCNs) have advanced the field by utilizing high-order
+connectivity patterns in interaction graphs, as evidenced by state-of-the-art
+methods like PinSage and LightGCN. However, one key limitation has not been
+well addressed in existing solutions: capturing long-range collaborative
+filtering signals, which are crucial for modeling user preference. In this
+work, we propose a new graph transformer (GT) framework --
+\textit{Position-aware Graph Transformer for Recommendation} (PGTR), which
+combines the global modeling capability of Transformer blocks with the local
+neighborhood feature extraction of GCNs. The key insight is to explicitly
+incorporate node position and structure information from the user-item
+interaction graph into GT architecture via several purpose-designed positional
+encodings. The long-range collaborative signals from the Transformer block are
+then combined linearly with the local neighborhood features from the GCN
+backbone to enhance node embeddings for final recommendations. Empirical
+studies demonstrate the effectiveness of the proposed PGTR method when
+implemented on various GCN-based backbones across four real-world datasets, and
+the robustness against interaction sparsity as well as noise.
+
+
+
+
+
+
+
+ ☆ Optimization and Scalability of Collaborative Filtering Algorithms in
+ Large Language Models
+
+
+ With the rapid development of large language models (LLMs) and the growing
+demand for personalized content, recommendation systems have become critical in
+enhancing user experience and driving engagement. Collaborative filtering
+algorithms, being core to many recommendation systems, have garnered
+significant attention for their efficiency and interpretability. However,
+traditional collaborative filtering approaches face numerous challenges when
+integrated into large-scale LLM-based systems, including high computational
+costs, severe data sparsity, cold start problems, and lack of scalability. This
+paper investigates the optimization and scalability of collaborative filtering
+algorithms in large language models, addressing these limitations through
+advanced optimization strategies. Firstly, we analyze the fundamental
+principles of collaborative filtering algorithms and their limitations when
+applied in LLM-based contexts. Next, several optimization techniques such as
+matrix factorization, approximate nearest neighbor search, and parallel
+computing are proposed to enhance computational efficiency and model accuracy.
+Additionally, strategies such as distributed architecture and model compression
+are explored to facilitate dynamic updates and scalability in data-intensive
+environments.
+
+
+
+
+
+
+
+ ☆ Enhanced Recommendation Combining Collaborative Filtering and Large
+ Language Models
+
+
+ With the advent of the information explosion era, the importance of
+recommendation systems in various applications is increasingly significant.
+Traditional collaborative filtering algorithms are widely used due to their
+effectiveness in capturing user behavior patterns, but they encounter
+limitations when dealing with cold start problems and data sparsity. Large
+Language Models (LLMs), with their strong natural language understanding and
+generation capabilities, provide a new breakthrough for recommendation systems.
+This study proposes an enhanced recommendation method that combines
+collaborative filtering and LLMs, aiming to leverage collaborative filtering's
+advantage in modeling user preferences while enhancing the understanding of
+textual information about users and items through LLMs to improve
+recommendation accuracy and diversity. This paper first introduces the
+fundamental theories of collaborative filtering and LLMs, then designs a
+recommendation system architecture that integrates both, and validates the
+system's effectiveness through experiments. The results show that the hybrid
+model based on collaborative filtering and LLMs significantly improves
+precision, recall, and user satisfaction, demonstrating its potential in
+complex recommendation scenarios.
+
+
+
+
+
+
+
+ ♻ ☆ RAGONITE: Iterative Retrieval on Induced Databases and Verbalized RDF
+ for Conversational QA over KGs with RAG
+
+
+
+
+
+
+
+
+ Rishiraj Saha Roy, Chris Hinze, Joel Schlotthauer, Farzad Naderi, Viktor Hangya, Andreas Foltyn, Luzian Hahn, Fabian Kuech
+
+
+ Conversational question answering (ConvQA) is a convenient means of searching
+over RDF knowledge graphs (KGs), where a prevalent approach is to translate
+natural language questions to SPARQL queries. However, SPARQL has certain
+shortcomings: (i) it is brittle for complex intents and conversational
+questions, and (ii) it is not suitable for more abstract needs. Instead, we
+propose a novel two-pronged system where we fuse: (i) SQL-query results over a
+database automatically derived from the KG, and (ii) text-search results over
+verbalizations of KG facts. Our pipeline supports iterative retrieval: when the
+results of any branch are found to be unsatisfactory, the system can
+automatically opt for further rounds. We put everything together in a retrieval
+augmented generation (RAG) setup, where an LLM generates a coherent response
+from accumulated search results. We demonstrate the superiority of our proposed
+system over several baselines on a knowledge graph of BMW automobiles.
+
+
+
+ comment: Accepted at BTW 2025, 10 pages
+
+
+
+
+
+
+ ♻ ☆ Large Language Model Simulator for Cold-Start Recommendation WSDM 2025
+
+
+ Recommending cold items remains a significant challenge in billion-scale
+online recommendation systems. While warm items benefit from historical user
+behaviors, cold items rely solely on content features, limiting their
+recommendation performance and impacting user experience and revenue. Current
+models generate synthetic behavioral embeddings from content features but fail
+to address the core issue: the absence of historical behavior data. To tackle
+this, we introduce the LLM Simulator framework, which leverages large language
+models to simulate user interactions for cold items, fundamentally addressing
+the cold-start problem. However, simply using LLM to traverse all users can
+introduce significant complexity in billion-scale systems. To manage the
+computational complexity, we propose a coupled funnel ColdLLM framework for
+online recommendation. ColdLLM efficiently reduces the number of candidate
+users from billions to hundreds using a trained coupled filter, allowing the
+LLM to operate efficiently and effectively on the filtered set. Extensive
+experiments show that ColdLLM significantly surpasses baselines in cold-start
+recommendations, including Recall and NDCG metrics. A two-week A/B test also
+validates that ColdLLM can effectively increase the cold-start period GMV.
+
+
+ For modern recommender systems, the use of low-dimensional latent
+representations to embed users and items based on their observed interactions
+has become commonplace. However, many existing recommendation models are
+primarily designed for coarse-grained and homogeneous interactions, which
+limits their effectiveness in two critical dimensions. Firstly, these models
+fail to leverage the relational dependencies that exist across different types
+of user behaviors, such as page views, collects, comments, and purchases.
+Secondly, they struggle to capture the fine-grained latent factors that drive
+user interaction patterns. To address these limitations, we present a
+heterogeneous graph collaborative filtering model MixRec that excels at
+disentangling users' multi-behavior interaction patterns and uncovering the
+latent intent factors behind each behavior. Our model achieves this by
+incorporating intent disentanglement and multi-behavior modeling, facilitated
+by a parameterized heterogeneous hypergraph architecture. Furthermore, we
+introduce a novel contrastive learning paradigm that adaptively explores the
+advantages of self-supervised data augmentation, thereby enhancing the model's
+resilience against data sparsity and expressiveness with relation
+heterogeneity. To validate the efficacy of MixRec, we conducted extensive
+experiments on three public datasets. The results clearly demonstrate its
+superior performance, significantly outperforming various state-of-the-art
+baselines. Our model is open-sourced and available at:
+https://github.com/HKUDS/MixRec.
+
+
+
+
+
+
+
+
+ Peihao Xiang, Kaida Wu, Chaohao Lin, Ou Bai
+
+
+ This paper expands the cascaded network branch of the autoencoder-based
+multi-task learning (MTL) framework for dynamic facial expression recognition,
+namely Multi-Task Cascaded Autoencoder for Dynamic Facial Expression
+Recognition (MTCAE-DFER). MTCAE-DFER builds a plug-and-play cascaded decoder
+module, which is based on the Vision Transformer (ViT) architecture and employs
+the decoder concept of Transformer to reconstruct the multi-head attention
+module. The decoder output from the previous task serves as the query (Q),
+representing local dynamic features, while the Video Masked Autoencoder
+(VideoMAE) shared encoder output acts as both the key (K) and value (V),
+representing global dynamic features. This setup facilitates interaction
+between global and local dynamic features across related tasks. Additionally,
+this proposal aims to alleviate overfitting of complex large model. We utilize
+autoencoder-based multi-task cascaded learning approach to explore the impact
+of dynamic face detection and dynamic face landmark on dynamic facial
+expression recognition, which enhances the model's generalization ability.
+After we conduct extensive ablation experiments and comparison with
+state-of-the-art (SOTA) methods on various public datasets for dynamic facial
+expression recognition, the robustness of the MTCAE-DFER model and the
+effectiveness of global-local dynamic feature interaction among related tasks
+have been proven.
+
+
+ The rapid expansion of multimedia contents has led to the emergence of
+multimodal recommendation systems. It has attracted increasing attention in
+recommendation systems because its full utilization of data from different
+modalities alleviates the persistent data sparsity problem. As such, multimodal
+recommendation models can learn personalized information about nodes in terms
+of visual and textual. To further alleviate the data sparsity problem, some
+previous works have introduced graph convolutional networks (GCNs) for
+multimodal recommendation systems, to enhance the semantic representation of
+users and items by capturing the potential relationships between them. However,
+adopting GCNs inevitably introduces the over-smoothing problem, which make
+nodes to be too similar. Unfortunately, incorporating multimodal information
+will exacerbate this challenge because nodes that are too similar will lose the
+personalized information learned through multimodal information. To address
+this problem, we propose a novel model that retains the personalized
+information of ego nodes during feature aggregation by Reducing Node-neighbor
+Discrepancy (RedN^nD). Extensive experiments on three public datasets show that
+RedN^nD achieves state-of-the-art performance on accuracy and robustness, with
+significant improvements over existing GCN-based multimodal frameworks.
+
+
+
+ comment: Accepted by ICASSP 2025
+
+
+
+
+
+
+ ☆ XRFlux: Virtual Reality Benchmark for Edge Caching Systems
+
+
+ We introduce a Unity based benchmark XRFlux for evaluating Virtual Reality
+(VR) delivery systems using edge-cloud caching. As VR applications and systems
+progress, the need to meet strict latency and Quality of Experience (QoE)
+requirements is increasingly evident. In the context of VR, traditional cloud
+architectures (e.g., remote AWS S3 for content delivery) often struggle to meet
+these demands, especially for users of the same application in different
+locations. With edge computing, resources are brought closer to users in
+efforts to reduce latency and improve QoEs. However, VR's dynamic nature, with
+changing fields of view (FoVs) and user synchronization requirements, creates
+various challenges for edge caching. We address the lack of suitable benchmarks
+and propose a framework that simulates multiuser VR scenarios while logging
+users' interaction with objects within their actual and predicted FoVs. The
+benchmark's activity log can then be played back through an edge cache to
+assess the resulting QoEs. This tool fills a gap by supporting research in the
+optimization of edge caching (and other edge-cloud functions) for VR streaming.
+
+
+ Blind video quality assessment (BVQA) has been actively researched for
+user-generated content (UGC) videos. Recently, super-resolution (SR) techniques
+have been widely applied in UGC. Therefore, an effective BVQA method for both
+UGC and SR scenarios is essential. Temporal inconsistency, referring to
+irregularities between consecutive frames, is relevant to video quality.
+Current BVQA approaches typically model temporal relationships in UGC videos
+using statistics of motion information, but inconsistencies remain unexplored.
+Additionally, different from temporal inconsistency in UGC videos, such
+inconsistency in SR videos is amplified due to upscaling algorithms. In this
+paper, we introduce the Temporal Inconsistency Guided Blind Video Quality
+Assessment (TINQ) metric, demonstrating that exploring temporal inconsistency
+is crucial for effective BVQA. Since temporal inconsistencies vary between UGC
+and SR videos, they are calculated in different ways. Based on this, a spatial
+module highlights inconsistent areas across consecutive frames at coarse and
+fine granularities. In addition, a temporal module aggregates features over
+time in two stages. The first stage employs a visual memory capacity block to
+adaptively segment the time dimension based on estimated complexity, while the
+second stage focuses on selecting key features. The stages work together
+through Consistency-aware Fusion Units to regress cross-time-scale video
+quality. Extensive experiments on UGC and SR video quality datasets show that
+our method outperforms existing state-of-the-art BVQA methods. Code is
+available at https://github.com/Lighting-YXLI/TINQ.
+
+
+
+
+
+
+
+ ☆ Adaptive Rate Control for Deep Video Compression with Rate-Distortion
+ Prediction
+
+
+
+
+
+
+
+
+ Bowen Gu, Hao Chen, Ming Lu, Jie Yao, Zhan Ma
+
+
+ Deep video compression has made significant progress in recent years,
+achieving rate-distortion performance that surpasses that of traditional video
+compression methods. However, rate control schemes tailored for deep video
+compression have not been well studied. In this paper, we propose a neural
+network-based $\lambda$-domain rate control scheme for deep video compression,
+which determines the coding parameter $\lambda$ for each to-be-coded frame
+based on the rate-distortion-$\lambda$ (R-D-$\lambda$) relationships directly
+learned from uncompressed frames, achieving high rate control accuracy
+efficiently without the need for pre-encoding. Moreover, this content-aware
+scheme is able to mitigate inter-frame quality fluctuations and adapt to abrupt
+changes in video content. Specifically, we introduce two neural network-based
+predictors to estimate the relationship between bitrate and $\lambda$, as well
+as the relationship between distortion and $\lambda$ for each frame. Then we
+determine the coding parameter $\lambda$ for each frame to achieve the target
+bitrate. Experimental results demonstrate that our approach achieves high rate
+control accuracy at the mini-GOP level with low time overhead and mitigates
+inter-frame quality fluctuations across video content of varying resolutions.
+
+
+
+
+
+
+
+ ☆ Towards Expressive Video Dubbing with Multiscale Multimodal Context
+ Interaction SP 2025
+
+
+ Automatic Video Dubbing (AVD) generates speech aligned with lip motion and
+facial emotion from scripts. Recent research focuses on modeling multimodal
+context to enhance prosody expressiveness but overlooks two key issues: 1)
+Multiscale prosody expression attributes in the context influence the current
+sentence's prosody. 2) Prosody cues in context interact with the current
+sentence, impacting the final prosody expressiveness. To tackle these
+challenges, we propose M2CI-Dubber, a Multiscale Multimodal Context Interaction
+scheme for AVD. This scheme includes two shared M2CI encoders to model the
+multiscale multimodal context and facilitate its deep interaction with the
+current sentence. By extracting global and local features for each modality in
+the context, utilizing attention-based mechanisms for aggregation and
+interaction, and employing an interaction-based graph attention network for
+fusion, the proposed approach enhances the prosody expressiveness of
+synthesized speech for the current sentence. Experiments on the Chem dataset
+show our model outperforms baselines in dubbing expressiveness. The code and
+demos are available at
+\textcolor[rgb]{0.93,0.0,0.47}{https://github.com/AI-S2-Lab/M2CI-Dubber}.
+
+
+ Text-editable and pose-controllable character video generation is a
+challenging but prevailing topic with practical applications. However, existing
+approaches mainly focus on single-object video generation with pose guidance,
+ignoring the realistic situation that multi-character appear concurrently in a
+scenario. To tackle this, we propose a novel multi-character video generation
+framework in a tuning-free manner, which is based on the separated text and
+pose guidance. Specifically, we first extract character masks from the pose
+sequence to identify the spatial position for each generating character, and
+then single prompts for each character are obtained with LLMs for precise text
+guidance. Moreover, the spatial-aligned cross attention and multi-branch
+control module are proposed to generate fine grained controllable
+multi-character video. The visualized results of generating video demonstrate
+the precise controllability of our method for multi-character generation. We
+also verify the generality of our method by applying it to various personalized
+T2I models. Moreover, the quantitative results show that our approach achieves
+superior performance compared with previous works.
+
+
+ Diffusion Probabilistic Models (DPMs) have emerged as the de facto approach
+for high-fidelity image synthesis, operating diffusion processes on continuous
+VAE latent, which significantly differ from the text generation methods
+employed by Large Language Models (LLMs). In this paper, we introduce a novel
+generative framework, the Recurrent Diffusion Probabilistic Model (RDPM), which
+enhances the diffusion process through a recurrent token prediction mechanism,
+thereby pioneering the field of Discrete Diffusion. By progressively
+introducing Gaussian noise into the latent representations of images and
+encoding them into vector-quantized tokens in a recurrent manner, RDPM
+facilitates a unique diffusion process on discrete-value domains. This process
+iteratively predicts the token codes for subsequent timesteps, transforming the
+initial standard Gaussian noise into the source data distribution, aligning
+with GPT-style models in terms of the loss function. RDPM demonstrates superior
+performance while benefiting from the speed advantage of requiring only a few
+inference steps. This model not only leverages the diffusion process to ensure
+high-quality generation but also converts continuous signals into a series of
+high-fidelity discrete tokens, thereby maintaining a unified optimization
+strategy with other discrete tokens, such as text. We anticipate that this work
+will contribute to the development of a unified model for multimodal
+generation, specifically by integrating continuous signal domains such as
+images, videos, and audio with text. We will release the code and model weights
+to the open-source community.
+
+
+
+ comment: 8 pages
+
+
+
+
+
+
+ ♻ ☆ Stimulus Modality Matters: Impact of Perceptual Evaluations from
+ Different Modalities on Speech Emotion Recognition System Performance ICASSP 2025
+
+
+ Speech Emotion Recognition (SER) systems rely on speech input and emotional
+labels annotated by humans. However, various emotion databases collect
+perceptional evaluations in different ways. For instance, the IEMOCAP dataset
+uses video clips with sounds for annotators to provide their emotional
+perceptions. However, the most significant English emotion dataset, the
+MSP-PODCAST, only provides speech for raters to choose the emotional ratings.
+Nevertheless, using speech as input is the standard approach to training SER
+systems. Therefore, the open question is the emotional labels elicited by which
+scenarios are the most effective for training SER systems. We comprehensively
+compare the effectiveness of SER systems trained with labels elicited by
+different modality stimuli and evaluate the SER systems on various testing
+conditions. Also, we introduce an all-inclusive label that combines all labels
+elicited by various modalities. We show that using labels elicited by
+voice-only stimuli for training yields better performance on the test set,
+whereas labels elicited by voice-only stimuli.
+
+
+ Recent progress in Multimodal Large Language Models(MLLMs) often use large
+image tokens to compensate the visual shortcoming of MLLMs, which not only
+exhibits obvious redundancy but also greatly exacerbates the already high
+computation. Token pruning is an effective solution for speeding up MLLMs, but
+when and how to drop tokens still remains a challenge. In this paper, we
+propose a novel and training-free approach for the effective visual token
+pruning of MLLMs, termed FitPrune, which can quickly produce a complete pruning
+recipe for MLLMs according to a pre-defined budget. Specifically, FitPrune
+considers token pruning as a statistical problem of MLLM and its objective is
+to find out an optimal pruning scheme that can minimize the divergence of the
+attention distributions before and after pruning. In practice, FitPrune can be
+quickly accomplished based on the attention statistics from a small batch of
+inference data, avoiding the expensive trials of MLLMs. According to the
+pruning recipe, an MLLM can directly remove the redundant visual tokens of
+different examples during inference. To validate FitPrune, we apply it to a set
+of recent MLLMs, including LLaVA-1.5, LLaVA-HR and LLaVA-NEXT, and conduct
+extensive experiments on a set of benchmarks. The experimental results show
+that our FitPrune can not only reduce the computational complexity to a large
+extent, while retaining high performance, e.g., -54.9% FLOPs for LLaVA-NEXT
+with only 0.5% accuracy drop. Notably, the pruning recipe can be obtained in
+about 5 minutes. Our code is available at https://github.com/ywh187/FitPrune.
+
+