-
Training with Pseudo-Code for Instruction Following
Authors:
Prince Kumar,
Rudra Murthy,
Riyaz Bhat,
Danish Contractor
Abstract:
Despite the rapid progress in the capabilities of Large Language Models (LLMs), they continue to have difficulty following relatively simple, unambiguous instructions, especially when compositions are involved. In this paper, we take inspiration from recent work that suggests that models may follow instructions better when they are expressed in pseudo-code. However, writing pseudo-code programs ca…
▽ More
Despite the rapid progress in the capabilities of Large Language Models (LLMs), they continue to have difficulty following relatively simple, unambiguous instructions, especially when compositions are involved. In this paper, we take inspiration from recent work that suggests that models may follow instructions better when they are expressed in pseudo-code. However, writing pseudo-code programs can be tedious and using few-shot demonstrations to craft code representations for use in inference can be unnatural for non-expert users of LLMs. To overcome these limitations, we propose fine-tuning LLMs with instruction-tuning data that additionally includes instructions re-expressed in pseudo-code along with the final response. We evaluate models trained using our method on $11$ publicly available benchmarks comprising of tasks related to instruction-following, mathematics, and common-sense reasoning. We conduct rigorous experiments with $5$ different models and find that not only do models follow instructions better when trained with pseudo-code, they also retain their capabilities on the other tasks related to mathematical and common sense reasoning. Specifically, we observe a relative gain of $3$--$19$% on instruction-following benchmark, and an average gain of upto 14% across all tasks.
△ Less
Submitted 23 May, 2025;
originally announced May 2025.
-
Federated Learning over 5G, WiFi, and Ethernet: Measurements and Evaluation
Authors:
Robert J. Hayek,
Joaquin Chung,
Kayla Comer,
Chandra R. Murthy,
Rajkumar Kettimuthu,
Igor Kadota
Abstract:
Federated Learning (FL) deployments using IoT devices is an area that is poised to significantly benefit from advances in NextG wireless. In this paper, we deploy a FL application using a 5G-NR Standalone (SA) testbed with open-source and Commercial Off-the-Shelf (COTS) components. The 5G testbed architecture consists of a network of resource-constrained edge devices, namely Raspberry Pi's, and a…
▽ More
Federated Learning (FL) deployments using IoT devices is an area that is poised to significantly benefit from advances in NextG wireless. In this paper, we deploy a FL application using a 5G-NR Standalone (SA) testbed with open-source and Commercial Off-the-Shelf (COTS) components. The 5G testbed architecture consists of a network of resource-constrained edge devices, namely Raspberry Pi's, and a central server equipped with a Software Defined Radio (SDR) and running O-RAN software. Our testbed allows edge devices to communicate with the server using WiFi and Ethernet, instead of 5G. FL is deployed using the Flower FL framework, for which we developed a comprehensive instrumentation tool to collect and analyze diverse communications and machine learning performance metrics including: model aggregation time, downlink transmission time, training time, and uplink transmission time. Leveraging these measurements, we perform a comparative analysis of the FL application across three network interfaces: 5G, WiFi, and Ethernet. Our experimental results suggest that, on 5G, the uplink model transfer time is a significant factor in convergence time of FL. In particular, we find that the 5G uplink contributes to roughly 23% of the duration of one average communication round when using all edge devices in our testbed. When comparing the uplink time of the 5G testbed, we find that it is 33.3x higher than Ethernet and 17.8x higher than WiFi. Our results also suggest that 5G exacerbates the well-known straggler effect. For reproducibility, we have open-sourced our FL application, instrumentation tools, and testbed configuration.
△ Less
Submitted 6 April, 2025;
originally announced April 2025.
-
PersonaBench: Evaluating AI Models on Understanding Personal Information through Accessing (Synthetic) Private User Data
Authors:
Juntao Tan,
Liangwei Yang,
Zuxin Liu,
Zhiwei Liu,
Rithesh Murthy,
Tulika Manoj Awalgaonkar,
Jianguo Zhang,
Weiran Yao,
Ming Zhu,
Shirley Kokane,
Silvio Savarese,
Huan Wang,
Caiming Xiong,
Shelby Heinecke
Abstract:
Personalization is critical in AI assistants, particularly in the context of private AI models that work with individual users. A key scenario in this domain involves enabling AI models to access and interpret a user's private data (e.g., conversation history, user-AI interactions, app usage) to understand personal details such as biographical information, preferences, and social connections. Howe…
▽ More
Personalization is critical in AI assistants, particularly in the context of private AI models that work with individual users. A key scenario in this domain involves enabling AI models to access and interpret a user's private data (e.g., conversation history, user-AI interactions, app usage) to understand personal details such as biographical information, preferences, and social connections. However, due to the sensitive nature of such data, there are no publicly available datasets that allow us to assess an AI model's ability to understand users through direct access to personal information.
To address this gap, we introduce a synthetic data generation pipeline that creates diverse, realistic user profiles and private documents simulating human activities. Leveraging this synthetic data, we present PersonaBench, a benchmark designed to evaluate AI models' performance in understanding personal information derived from simulated private user data.
We evaluate Retrieval-Augmented Generation (RAG) pipelines using questions directly related to a user's personal information, supported by the relevant private documents provided to the models. Our results reveal that current retrieval-augmented AI models struggle to answer private questions by extracting personal information from user documents, highlighting the need for improved methodologies to enhance personalization capabilities in AI.
△ Less
Submitted 27 February, 2025;
originally announced February 2025.
-
Granite Embedding Models
Authors:
Parul Awasthy,
Aashka Trivedi,
Yulong Li,
Mihaela Bornea,
David Cox,
Abraham Daniels,
Martin Franz,
Gabe Goodhart,
Bhavani Iyer,
Vishwajeet Kumar,
Luis Lastras,
Scott McCarley,
Rudra Murthy,
Vignesh P,
Sara Rosenthal,
Salim Roukos,
Jaydeep Sen,
Sukriti Sharma,
Avirup Sil,
Kate Soule,
Arafat Sultan,
Radu Florian
Abstract:
We introduce the Granite Embedding models, a family of encoder-based embedding models designed for retrieval tasks, spanning dense-retrieval and sparse retrieval architectures, with both English and Multilingual capabilities. This report provides the technical details of training these highly effective 12 layer embedding models, along with their efficient 6 layer distilled counterparts. Extensive…
▽ More
We introduce the Granite Embedding models, a family of encoder-based embedding models designed for retrieval tasks, spanning dense-retrieval and sparse retrieval architectures, with both English and Multilingual capabilities. This report provides the technical details of training these highly effective 12 layer embedding models, along with their efficient 6 layer distilled counterparts. Extensive evaluations show that the models, developed with techniques like retrieval oriented pretraining, contrastive finetuning, knowledge distillation, and model merging significantly outperform publicly available models of similar sizes on both internal IBM retrieval and search tasks, and have equivalent performance on widely used information retrieval benchmarks, while being trained on high-quality data suitable for enterprise use. We publicly release all our Granite Embedding models under the Apache 2.0 license, allowing both research and commercial use at https://huggingface.co/collections/ibm-granite.
△ Less
Submitted 27 February, 2025;
originally announced February 2025.
-
Stroke classification using Virtual Hybrid Edge Detection from in silico electrical impedance tomography data
Authors:
Juan Pablo Agnelli,
Fernando S. Moura,
Siiri Rautio,
Melody Alsaker,
Rashmi Murthy,
Matti Lassas,
Samuli Siltanen
Abstract:
Electrical impedance tomography (EIT) is a non-invasive imaging method for recovering the internal conductivity of a physical body from electric boundary measurements. EIT combined with machine learning has shown promise for the classification of strokes. However, most previous works have used raw EIT voltage data as network inputs. We build upon a recent development which suggested the use of spe…
▽ More
Electrical impedance tomography (EIT) is a non-invasive imaging method for recovering the internal conductivity of a physical body from electric boundary measurements. EIT combined with machine learning has shown promise for the classification of strokes. However, most previous works have used raw EIT voltage data as network inputs. We build upon a recent development which suggested the use of special noise-robust Virtual Hybrid Edge Detection (VHED) functions as network inputs, although that work used only highly simplified and mathematically ideal models. In this work we strengthen the case for the use of EIT, and VHED functions especially, for stroke classification. We design models with high detail and mathematical realism to test the use of VHED functions as inputs. Virtual patients are created using a physically detailed 2D head model which includes features known to create challenges in real-world imaging scenarios. Conductivity values are drawn from statistically realistic distributions, and phantoms are afflicted with either hemorrhagic or ischemic strokes of various shapes and sizes. Simulated noisy EIT electrode data, generated using the realistic Complete Electrode Model (CEM) as opposed to the mathematically ideal continuum model, is processed to obtain VHED functions. We compare the use of VHED functions as inputs against the alternative paradigm of using raw EIT voltages. Our results show that (i) stroke classification can be performed with high accuracy using 2D EIT data from physically detailed and mathematically realistic models, and (ii) in the presence of noise, VHED functions outperform raw data as network inputs.
△ Less
Submitted 29 January, 2025; v1 submitted 24 January, 2025;
originally announced January 2025.
-
A Probably Approximately Correct Analysis of Group Testing Algorithms
Authors:
Sameera Bharadwaja H.,
Chandra R. Murthy
Abstract:
We consider the problem of identifying the defectives from a population of items via a non-adaptive group testing framework with a random pooling-matrix design. We analyze the sufficient number of tests needed for approximate set identification, i.e., for identifying almost all the defective and non-defective items with high confidence. To this end, we view the group testing problem as a function…
▽ More
We consider the problem of identifying the defectives from a population of items via a non-adaptive group testing framework with a random pooling-matrix design. We analyze the sufficient number of tests needed for approximate set identification, i.e., for identifying almost all the defective and non-defective items with high confidence. To this end, we view the group testing problem as a function learning problem and develop our analysis using the probably approximately correct (PAC) framework. Using this formulation, we derive sufficiency bounds on the number of tests for three popular binary group testing algorithms: column matching, combinatorial basis pursuit, and definite defectives. We compare the derived bounds with the existing ones in the literature for exact recovery theoretically and using simulations. Finally, we contrast the three group testing algorithms under consideration in terms of the sufficient testing rate surface and the sufficient number of tests contours across the range of the approximation and confidence levels.
△ Less
Submitted 30 November, 2024;
originally announced December 2024.
-
SpecTool: A Benchmark for Characterizing Errors in Tool-Use LLMs
Authors:
Shirley Kokane,
Ming Zhu,
Tulika Awalgaonkar,
Jianguo Zhang,
Thai Hoang,
Akshara Prabhakar,
Zuxin Liu,
Tian Lan,
Liangwei Yang,
Juntao Tan,
Rithesh Murthy,
Weiran Yao,
Zhiwei Liu,
Juan Carlos Niebles,
Huan Wang,
Shelby Heinecke,
Caiming Xiong,
Silivo Savarese
Abstract:
Evaluating the output of Large Language Models (LLMs) is one of the most critical aspects of building a performant compound AI system. Since the output from LLMs propagate to downstream steps, identifying LLM errors is crucial to system performance. A common task for LLMs in AI systems is tool use. While there are several benchmark environments for evaluating LLMs on this task, they typically only…
▽ More
Evaluating the output of Large Language Models (LLMs) is one of the most critical aspects of building a performant compound AI system. Since the output from LLMs propagate to downstream steps, identifying LLM errors is crucial to system performance. A common task for LLMs in AI systems is tool use. While there are several benchmark environments for evaluating LLMs on this task, they typically only give a success rate without any explanation of the failure cases. To solve this problem, we introduce SpecTool, a new benchmark to identify error patterns in LLM output on tool-use tasks. Our benchmark data set comprises of queries from diverse environments that can be used to test for the presence of seven newly characterized error patterns. Using SPECTOOL , we show that even the most prominent LLMs exhibit these error patterns in their outputs. Researchers can use the analysis and insights from SPECTOOL to guide their error mitigation strategies.
△ Less
Submitted 20 November, 2024;
originally announced November 2024.
-
MILU: A Multi-task Indic Language Understanding Benchmark
Authors:
Sshubam Verma,
Mohammed Safi Ur Rahman Khan,
Vishwajeet Kumar,
Rudra Murthy,
Jaydeep Sen
Abstract:
Evaluating Large Language Models (LLMs) in low-resource and linguistically diverse languages remains a significant challenge in NLP, particularly for languages using non-Latin scripts like those spoken in India. Existing benchmarks predominantly focus on English, leaving substantial gaps in assessing LLM capabilities in these languages. We introduce MILU, a Multi task Indic Language Understanding…
▽ More
Evaluating Large Language Models (LLMs) in low-resource and linguistically diverse languages remains a significant challenge in NLP, particularly for languages using non-Latin scripts like those spoken in India. Existing benchmarks predominantly focus on English, leaving substantial gaps in assessing LLM capabilities in these languages. We introduce MILU, a Multi task Indic Language Understanding Benchmark, a comprehensive evaluation benchmark designed to address this gap. MILU spans 8 domains and 41 subjects across 11 Indic languages, reflecting both general and culturally specific knowledge. With an India-centric design, incorporates material from regional and state-level examinations, covering topics such as local history, arts, festivals, and laws, alongside standard subjects like science and mathematics. We evaluate over 42 LLMs, and find that current LLMs struggle with MILU, with GPT-4o achieving the highest average accuracy at 74 percent. Open multilingual models outperform language-specific fine-tuned models, which perform only slightly better than random baselines. Models also perform better in high resource languages as compared to low resource ones. Domain-wise analysis indicates that models perform poorly in culturally relevant areas like Arts and Humanities, Law and Governance compared to general fields like STEM. To the best of our knowledge, MILU is the first of its kind benchmark focused on Indic languages, serving as a crucial step towards comprehensive cultural evaluation. All code, benchmarks, and artifacts are publicly available to foster open research.
△ Less
Submitted 4 February, 2025; v1 submitted 4 November, 2024;
originally announced November 2024.
-
PRACT: Optimizing Principled Reasoning and Acting of LLM Agent
Authors:
Zhiwei Liu,
Weiran Yao,
Jianguo Zhang,
Rithesh Murthy,
Liangwei Yang,
Zuxin Liu,
Tian Lan,
Ming Zhu,
Juntao Tan,
Shirley Kokane,
Thai Hoang,
Juan Carlos Niebles,
Shelby Heinecke,
Huan Wang,
Silvio Savarese,
Caiming Xiong
Abstract:
We introduce the Principled Reasoning and Acting (PRAct) framework, a novel method for learning and enforcing action principles from trajectory data. Central to our approach is the use of text gradients from a reflection and optimization engine to derive these action principles. To adapt action principles to specific task requirements, we propose a new optimization framework, Reflective Principle…
▽ More
We introduce the Principled Reasoning and Acting (PRAct) framework, a novel method for learning and enforcing action principles from trajectory data. Central to our approach is the use of text gradients from a reflection and optimization engine to derive these action principles. To adapt action principles to specific task requirements, we propose a new optimization framework, Reflective Principle Optimization (RPO). After execution, RPO employs a reflector to critique current action principles and an optimizer to update them accordingly. We develop the RPO framework under two scenarios: Reward-RPO, which uses environmental rewards for reflection, and Self-RPO, which conducts self-reflection without external rewards. Additionally, two RPO methods, RPO-Traj and RPO-Batch, is introduced to adapt to different settings. Experimental results across four environments demonstrate that the PRAct agent, leveraging the RPO framework, effectively learns and applies action principles to enhance performance.
△ Less
Submitted 24 October, 2024;
originally announced October 2024.
-
KCIF: Knowledge-Conditioned Instruction Following
Authors:
Rudra Murthy,
Praveen Venkateswaran,
Prince Kumar,
Danish Contractor
Abstract:
LLM evaluation benchmarks have traditionally separated the testing of knowledge/reasoning capabilities from instruction following. In this work, we study the interaction between knowledge and instruction following, and observe that LLMs struggle to follow simple answer modifying instructions, and are also distracted by instructions that should have no bearing on the original knowledge task answer.…
▽ More
LLM evaluation benchmarks have traditionally separated the testing of knowledge/reasoning capabilities from instruction following. In this work, we study the interaction between knowledge and instruction following, and observe that LLMs struggle to follow simple answer modifying instructions, and are also distracted by instructions that should have no bearing on the original knowledge task answer. We leverage existing multiple-choice answer based knowledge benchmarks and apply a set of simple instructions which include manipulating text (eg.: change case), numeric quantities (eg.: increase value, change formatting), operate on lists (eg.: sort answer candidates) and distractor instructions (eg.: change case of numeric answers). We evaluate models at varying parameter sizes (1B-405B) from different model families and find that, surprisingly, all models report a significant drop in performance on such simple task compositions. While large-sized and frontier models report performance drops of 40-50%, in small and medium sized models the drop is severe (sometimes exceeding 80%). Our results highlight a limitation in the traditional separation of knowledge/reasoning and instruction following, and suggest that joint-study of these capabilities are important. We release our benchmark dataset, evaluation framework code, and results for future work.
△ Less
Submitted 23 May, 2025; v1 submitted 16 October, 2024;
originally announced October 2024.
-
Benchmarking and Building Zero-Shot Hindi Retrieval Model with Hindi-BEIR and NLLB-E5
Authors:
Arkadeep Acharya,
Rudra Murthy,
Vishwajeet Kumar,
Jaydeep Sen
Abstract:
Given the large number of Hindi speakers worldwide, there is a pressing need for robust and efficient information retrieval systems for Hindi. Despite ongoing research, comprehensive benchmarks for evaluating retrieval models in Hindi are lacking. To address this gap, we introduce the Hindi-BEIR benchmark, comprising 15 datasets across seven distinct tasks. We evaluate state-of-the-art multilingua…
▽ More
Given the large number of Hindi speakers worldwide, there is a pressing need for robust and efficient information retrieval systems for Hindi. Despite ongoing research, comprehensive benchmarks for evaluating retrieval models in Hindi are lacking. To address this gap, we introduce the Hindi-BEIR benchmark, comprising 15 datasets across seven distinct tasks. We evaluate state-of-the-art multilingual retrieval models on the Hindi-BEIR benchmark, identifying task and domain-specific challenges that impact Hindi retrieval performance. Building on the insights from these results, we introduce NLLB-E5, a multilingual retrieval model that leverages a zero-shot approach to support Hindi without the need for Hindi training data. We believe our contributions, which include the release of the Hindi-BEIR benchmark and the NLLB-E5 model, will prove to be a valuable resource for researchers and promote advancements in multilingual retrieval models.
△ Less
Submitted 25 October, 2024; v1 submitted 9 September, 2024;
originally announced September 2024.
-
xLAM: A Family of Large Action Models to Empower AI Agent Systems
Authors:
Jianguo Zhang,
Tian Lan,
Ming Zhu,
Zuxin Liu,
Thai Hoang,
Shirley Kokane,
Weiran Yao,
Juntao Tan,
Akshara Prabhakar,
Haolin Chen,
Zhiwei Liu,
Yihao Feng,
Tulika Awalgaonkar,
Rithesh Murthy,
Eric Hu,
Zeyuan Chen,
Ran Xu,
Juan Carlos Niebles,
Shelby Heinecke,
Huan Wang,
Silvio Savarese,
Caiming Xiong
Abstract:
Autonomous agents powered by large language models (LLMs) have attracted significant research interest. However, the open-source community faces many challenges in developing specialized models for agent tasks, driven by the scarcity of high-quality agent datasets and the absence of standard protocols in this area. We introduce and publicly release xLAM, a series of large action models designed fo…
▽ More
Autonomous agents powered by large language models (LLMs) have attracted significant research interest. However, the open-source community faces many challenges in developing specialized models for agent tasks, driven by the scarcity of high-quality agent datasets and the absence of standard protocols in this area. We introduce and publicly release xLAM, a series of large action models designed for AI agent tasks. The xLAM series includes five models with both dense and mixture-of-expert architectures, ranging from 1B to 8x22B parameters, trained using a scalable, flexible pipeline that unifies, augments, and synthesizes diverse datasets to enhance AI agents' generalizability and performance across varied environments. Our experimental results demonstrate that xLAM consistently delivers exceptional performance across multiple agent ability benchmarks, notably securing the 1st position on the Berkeley Function-Calling Leaderboard, outperforming GPT-4, Claude-3, and many other models in terms of tool use. By releasing the xLAM series, we aim to advance the performance of open-source LLMs for autonomous AI agents, potentially accelerating progress and democratizing access to high-performance models for agent tasks. Models are available at https://huggingface.co/collections/Salesforce/xlam-models-65f00e2a0a63bbcd1c2dade4
△ Less
Submitted 4 September, 2024;
originally announced September 2024.
-
Mistral-SPLADE: LLMs for better Learned Sparse Retrieval
Authors:
Meet Doshi,
Vishwajeet Kumar,
Rudra Murthy,
Vignesh P,
Jaydeep Sen
Abstract:
Learned Sparse Retrievers (LSR) have evolved into an effective retrieval strategy that can bridge the gap between traditional keyword-based sparse retrievers and embedding-based dense retrievers. At its core, learned sparse retrievers try to learn the most important semantic keyword expansions from a query and/or document which can facilitate better retrieval with overlapping keyword expansions. L…
▽ More
Learned Sparse Retrievers (LSR) have evolved into an effective retrieval strategy that can bridge the gap between traditional keyword-based sparse retrievers and embedding-based dense retrievers. At its core, learned sparse retrievers try to learn the most important semantic keyword expansions from a query and/or document which can facilitate better retrieval with overlapping keyword expansions. LSR like SPLADE has typically been using encoder only models with MLM (masked language modeling) style objective in conjunction with known ways of retrieval performance improvement such as hard negative mining, distillation, etc. In this work, we propose to use decoder-only model for learning semantic keyword expansion. We posit, decoder only models that have seen much higher magnitudes of data are better equipped to learn keyword expansions needed for improved retrieval. We use Mistral as the backbone to develop our Learned Sparse Retriever similar to SPLADE and train it on a subset of sentence-transformer data which is often used for training text embedding models. Our experiments support the hypothesis that a sparse retrieval model based on decoder only large language model (LLM) surpasses the performance of existing LSR systems, including SPLADE and all its variants. The LLM based model (Echo-Mistral-SPLADE) now stands as a state-of-the-art learned sparse retrieval model on the BEIR text retrieval benchmark.
△ Less
Submitted 21 August, 2024; v1 submitted 20 August, 2024;
originally announced August 2024.
-
Hindi-BEIR : A Large Scale Retrieval Benchmark in Hindi
Authors:
Arkadeep Acharya,
Rudra Murthy,
Vishwajeet Kumar,
Jaydeep Sen
Abstract:
Given the large number of Hindi speakers worldwide, there is a pressing need for robust and efficient information retrieval systems for Hindi. Despite ongoing research, there is a lack of comprehensive benchmark for evaluating retrieval models in Hindi. To address this gap, we introduce the Hindi version of the BEIR benchmark, which includes a subset of English BEIR datasets translated to Hindi, e…
▽ More
Given the large number of Hindi speakers worldwide, there is a pressing need for robust and efficient information retrieval systems for Hindi. Despite ongoing research, there is a lack of comprehensive benchmark for evaluating retrieval models in Hindi. To address this gap, we introduce the Hindi version of the BEIR benchmark, which includes a subset of English BEIR datasets translated to Hindi, existing Hindi retrieval datasets, and synthetically created datasets for retrieval. The benchmark is comprised of $15$ datasets spanning across $8$ distinct tasks. We evaluate state-of-the-art multilingual retrieval models on this benchmark to identify task and domain-specific challenges and their impact on retrieval performance. By releasing this benchmark and a set of relevant baselines, we enable researchers to understand the limitations and capabilities of current Hindi retrieval models, promoting advancements in this critical area. The datasets from Hindi-BEIR are publicly available.
△ Less
Submitted 18 August, 2024;
originally announced August 2024.
-
Diversity Empowers Intelligence: Integrating Expertise of Software Engineering Agents
Authors:
Kexun Zhang,
Weiran Yao,
Zuxin Liu,
Yihao Feng,
Zhiwei Liu,
Rithesh Murthy,
Tian Lan,
Lei Li,
Renze Lou,
Jiacheng Xu,
Bo Pang,
Yingbo Zhou,
Shelby Heinecke,
Silvio Savarese,
Huan Wang,
Caiming Xiong
Abstract:
Large language model (LLM) agents have shown great potential in solving real-world software engineering (SWE) problems. The most advanced open-source SWE agent can resolve over 27% of real GitHub issues in SWE-Bench Lite. However, these sophisticated agent frameworks exhibit varying strengths, excelling in certain tasks while underperforming in others. To fully harness the diversity of these agent…
▽ More
Large language model (LLM) agents have shown great potential in solving real-world software engineering (SWE) problems. The most advanced open-source SWE agent can resolve over 27% of real GitHub issues in SWE-Bench Lite. However, these sophisticated agent frameworks exhibit varying strengths, excelling in certain tasks while underperforming in others. To fully harness the diversity of these agents, we propose DEI (Diversity Empowered Intelligence), a framework that leverages their unique expertise. DEI functions as a meta-module atop existing SWE agent frameworks, managing agent collectives for enhanced problem-solving. Experimental results show that a DEI-guided committee of agents is able to surpass the best individual agent's performance by a large margin. For instance, a group of open-source SWE agents, with a maximum individual resolve rate of 27.3% on SWE-Bench Lite, can achieve a 34.3% resolve rate with DEI, making a 25% improvement and beating most closed-source solutions. Our best-performing group excels with a 55% resolve rate, securing the highest ranking on SWE-Bench Lite. Our findings contribute to the growing body of research on collaborative AI systems and their potential to solve complex software engineering challenges.
△ Less
Submitted 13 August, 2024;
originally announced August 2024.
-
MaterioMiner -- An ontology-based text mining dataset for extraction of process-structure-property entities
Authors:
Ali Riza Durmaz,
Akhil Thomas,
Lokesh Mishra,
Rachana Niranjan Murthy,
Thomas Straub
Abstract:
While large language models learn sound statistical representations of the language and information therein, ontologies are symbolic knowledge representations that can complement the former ideally. Research at this critical intersection relies on datasets that intertwine ontologies and text corpora to enable training and comprehensive benchmarking of neurosymbolic models. We present the MaterioMi…
▽ More
While large language models learn sound statistical representations of the language and information therein, ontologies are symbolic knowledge representations that can complement the former ideally. Research at this critical intersection relies on datasets that intertwine ontologies and text corpora to enable training and comprehensive benchmarking of neurosymbolic models. We present the MaterioMiner dataset and the linked materials mechanics ontology where ontological concepts from the mechanics of materials domain are associated with textual entities within the literature corpus. Another distinctive feature of the dataset is its eminently fine-granular annotation. Specifically, 179 distinct classes are manually annotated by three raters within four publications, amounting to a total of 2191 entities that were annotated and curated. Conceptual work is presented for the symbolic representation of causal composition-process-microstructure-property relationships. We explore the annotation consistency between the three raters and perform fine-tuning of pre-trained models to showcase the feasibility of named-entity recognition model training. Reusing the dataset can foster training and benchmarking of materials language models, automated ontology construction, and knowledge graph generation from textual data.
△ Less
Submitted 5 August, 2024;
originally announced August 2024.
-
The Llama 3 Herd of Models
Authors:
Aaron Grattafiori,
Abhimanyu Dubey,
Abhinav Jauhri,
Abhinav Pandey,
Abhishek Kadian,
Ahmad Al-Dahle,
Aiesha Letman,
Akhil Mathur,
Alan Schelten,
Alex Vaughan,
Amy Yang,
Angela Fan,
Anirudh Goyal,
Anthony Hartshorn,
Aobo Yang,
Archi Mitra,
Archie Sravankumar,
Artem Korenev,
Arthur Hinsvark,
Arun Rao,
Aston Zhang,
Aurelien Rodriguez,
Austen Gregerson,
Ava Spataru,
Baptiste Roziere
, et al. (536 additional authors not shown)
Abstract:
Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. This paper presents an extensive empirical…
▽ More
Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. This paper presents an extensive empirical evaluation of Llama 3. We find that Llama 3 delivers comparable quality to leading language models such as GPT-4 on a plethora of tasks. We publicly release Llama 3, including pre-trained and post-trained versions of the 405B parameter language model and our Llama Guard 3 model for input and output safety. The paper also presents the results of experiments in which we integrate image, video, and speech capabilities into Llama 3 via a compositional approach. We observe this approach performs competitively with the state-of-the-art on image, video, and speech recognition tasks. The resulting models are not yet being broadly released as they are still under development.
△ Less
Submitted 23 November, 2024; v1 submitted 31 July, 2024;
originally announced July 2024.
-
Personalized Multi-task Training for Recommender System
Authors:
Liangwei Yang,
Zhiwei Liu,
Jianguo Zhang,
Rithesh Murthy,
Shelby Heinecke,
Huan Wang,
Caiming Xiong,
Philip S. Yu
Abstract:
In the vast landscape of internet information, recommender systems (RecSys) have become essential for guiding users through a sea of choices aligned with their preferences. These systems have applications in diverse domains, such as news feeds, game suggestions, and shopping recommendations. Personalization is a key technique in RecSys, where modern methods leverage representation learning to enco…
▽ More
In the vast landscape of internet information, recommender systems (RecSys) have become essential for guiding users through a sea of choices aligned with their preferences. These systems have applications in diverse domains, such as news feeds, game suggestions, and shopping recommendations. Personalization is a key technique in RecSys, where modern methods leverage representation learning to encode user/item interactions into embeddings, forming the foundation for personalized recommendations. However, integrating information from multiple sources to enhance recommendation performance remains challenging. This paper introduces a novel approach named PMTRec, the first personalized multi-task learning algorithm to obtain comprehensive user/item embeddings from various information sources. Addressing challenges specific to personalized RecSys, we develop modules to handle personalized task weights, diverse task orientations, and variations in gradient magnitudes across tasks. PMTRec dynamically adjusts task weights based on gradient norms for each user/item, employs a Task Focusing module to align gradient combinations with the main recommendation task, and uses a Gradient Magnitude Balancing module to ensure balanced training across tasks. Through extensive experiments on three real-world datasets with different scales, we demonstrate that PMTRec significantly outperforms existing multi-task learning methods, showcasing its effectiveness in achieving enhanced recommendation accuracy by leveraging multiple tasks simultaneously. Our contributions open new avenues for advancing personalized multi-task training in recommender systems.
△ Less
Submitted 31 July, 2024;
originally announced July 2024.
-
INDIC QA BENCHMARK: A Multilingual Benchmark to Evaluate Question Answering capability of LLMs for Indic Languages
Authors:
Abhishek Kumar Singh,
Vishwajeet kumar,
Rudra Murthy,
Jaydeep Sen,
Ashish Mittal,
Ganesh Ramakrishnan
Abstract:
Large Language Models (LLMs) perform well on unseen tasks in English, but their abilities in non English languages are less explored due to limited benchmarks and training data. To bridge this gap, we introduce the Indic QA Benchmark, a large dataset for context grounded question answering in 11 major Indian languages, covering both extractive and abstractive tasks. Evaluations of multilingual LLM…
▽ More
Large Language Models (LLMs) perform well on unseen tasks in English, but their abilities in non English languages are less explored due to limited benchmarks and training data. To bridge this gap, we introduce the Indic QA Benchmark, a large dataset for context grounded question answering in 11 major Indian languages, covering both extractive and abstractive tasks. Evaluations of multilingual LLMs, including instruction finetuned versions, revealed weak performance in low resource languages due to a strong English language bias in their training data. We also investigated the Translate Test paradigm,where inputs are translated to English for processing and the results are translated back into the source language for output. This approach outperformed multilingual LLMs, particularly in low resource settings. By releasing Indic QA, we aim to promote further research into LLMs question answering capabilities in low resource languages. This benchmark offers a critical resource to address existing limitations and foster multilingual understanding.
△ Less
Submitted 24 February, 2025; v1 submitted 18 July, 2024;
originally announced July 2024.
-
APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets
Authors:
Zuxin Liu,
Thai Hoang,
Jianguo Zhang,
Ming Zhu,
Tian Lan,
Shirley Kokane,
Juntao Tan,
Weiran Yao,
Zhiwei Liu,
Yihao Feng,
Rithesh Murthy,
Liangwei Yang,
Silvio Savarese,
Juan Carlos Niebles,
Huan Wang,
Shelby Heinecke,
Caiming Xiong
Abstract:
The advancement of function-calling agent models requires diverse, reliable, and high-quality datasets. This paper presents APIGen, an automated data generation pipeline designed to synthesize verifiable high-quality datasets for function-calling applications. We leverage APIGen and collect 3,673 executable APIs across 21 different categories to generate diverse function-calling datasets in a scal…
▽ More
The advancement of function-calling agent models requires diverse, reliable, and high-quality datasets. This paper presents APIGen, an automated data generation pipeline designed to synthesize verifiable high-quality datasets for function-calling applications. We leverage APIGen and collect 3,673 executable APIs across 21 different categories to generate diverse function-calling datasets in a scalable and structured manner. Each data in our dataset is verified through three hierarchical stages: format checking, actual function executions, and semantic verification, ensuring its reliability and correctness. We demonstrate that models trained with our curated datasets, even with only 7B parameters, can achieve state-of-the-art performance on the Berkeley Function-Calling Benchmark, outperforming multiple GPT-4 models. Moreover, our 1B model achieves exceptional performance, surpassing GPT-3.5-Turbo and Claude-3 Haiku. We release a dataset containing 60,000 high-quality entries, aiming to advance the field of function-calling agent domains. The dataset is available on Huggingface: https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k and the project homepage: https://apigen-pipeline.github.io/
△ Less
Submitted 26 June, 2024;
originally announced June 2024.
-
MobileAIBench: Benchmarking LLMs and LMMs for On-Device Use Cases
Authors:
Rithesh Murthy,
Liangwei Yang,
Juntao Tan,
Tulika Manoj Awalgaonkar,
Yilun Zhou,
Shelby Heinecke,
Sachin Desai,
Jason Wu,
Ran Xu,
Sarah Tan,
Jianguo Zhang,
Zhiwei Liu,
Shirley Kokane,
Zuxin Liu,
Ming Zhu,
Huan Wang,
Caiming Xiong,
Silvio Savarese
Abstract:
The deployment of Large Language Models (LLMs) and Large Multimodal Models (LMMs) on mobile devices has gained significant attention due to the benefits of enhanced privacy, stability, and personalization. However, the hardware constraints of mobile devices necessitate the use of models with fewer parameters and model compression techniques like quantization. Currently, there is limited understand…
▽ More
The deployment of Large Language Models (LLMs) and Large Multimodal Models (LMMs) on mobile devices has gained significant attention due to the benefits of enhanced privacy, stability, and personalization. However, the hardware constraints of mobile devices necessitate the use of models with fewer parameters and model compression techniques like quantization. Currently, there is limited understanding of quantization's impact on various task performances, including LLM tasks, LMM tasks, and, critically, trust and safety. There is a lack of adequate tools for systematically testing these models on mobile devices. To address these gaps, we introduce MobileAIBench, a comprehensive benchmarking framework for evaluating mobile-optimized LLMs and LMMs. MobileAIBench assesses models across different sizes, quantization levels, and tasks, measuring latency and resource consumption on real devices. Our two-part open-source framework includes a library for running evaluations on desktops and an iOS app for on-device latency and hardware utilization measurements. Our thorough analysis aims to accelerate mobile AI research and deployment by providing insights into the performance and feasibility of deploying LLMs and LMMs on mobile platforms.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
AgentOhana: Design Unified Data and Training Pipeline for Effective Agent Learning
Authors:
Jianguo Zhang,
Tian Lan,
Rithesh Murthy,
Zhiwei Liu,
Weiran Yao,
Ming Zhu,
Juntao Tan,
Thai Hoang,
Zuxin Liu,
Liangwei Yang,
Yihao Feng,
Shirley Kokane,
Tulika Awalgaonkar,
Juan Carlos Niebles,
Silvio Savarese,
Shelby Heinecke,
Huan Wang,
Caiming Xiong
Abstract:
Autonomous agents powered by large language models (LLMs) have garnered significant research attention. However, fully harnessing the potential of LLMs for agent-based tasks presents inherent challenges due to the heterogeneous nature of diverse data sources featuring multi-turn trajectories. In this paper, we introduce \textbf{AgentOhana} as a comprehensive solution to address these challenges. \…
▽ More
Autonomous agents powered by large language models (LLMs) have garnered significant research attention. However, fully harnessing the potential of LLMs for agent-based tasks presents inherent challenges due to the heterogeneous nature of diverse data sources featuring multi-turn trajectories. In this paper, we introduce \textbf{AgentOhana} as a comprehensive solution to address these challenges. \textit{AgentOhana} aggregates agent trajectories from distinct environments, spanning a wide array of scenarios. It meticulously standardizes and unifies these trajectories into a consistent format, streamlining the creation of a generic data loader optimized for agent training. Leveraging the data unification, our training pipeline maintains equilibrium across different data sources and preserves independent randomness across devices during dataset partitioning and model training. Additionally, we present \textbf{xLAM-v0.1}, a large action model tailored for AI agents, which demonstrates exceptional performance across various benchmarks. Begin the exploration at \url{https://github.com/SalesforceAIResearch/xLAM}.
△ Less
Submitted 8 November, 2024; v1 submitted 23 February, 2024;
originally announced February 2024.
-
Airavata: Introducing Hindi Instruction-tuned LLM
Authors:
Jay Gala,
Thanmay Jayakumar,
Jaavid Aktar Husain,
Aswanth Kumar M,
Mohammed Safi Ur Rahman Khan,
Diptesh Kanojia,
Ratish Puduppully,
Mitesh M. Khapra,
Raj Dabre,
Rudra Murthy,
Anoop Kunchukuttan
Abstract:
We announce the initial release of "Airavata," an instruction-tuned LLM for Hindi. Airavata was created by fine-tuning OpenHathi with diverse, instruction-tuning Hindi datasets to make it better suited for assistive tasks. Along with the model, we also share the IndicInstruct dataset, which is a collection of diverse instruction-tuning datasets to enable further research for Indic LLMs. Additional…
▽ More
We announce the initial release of "Airavata," an instruction-tuned LLM for Hindi. Airavata was created by fine-tuning OpenHathi with diverse, instruction-tuning Hindi datasets to make it better suited for assistive tasks. Along with the model, we also share the IndicInstruct dataset, which is a collection of diverse instruction-tuning datasets to enable further research for Indic LLMs. Additionally, we present evaluation benchmarks and a framework for assessing LLM performance across tasks in Hindi. Currently, Airavata supports Hindi, but we plan to expand this to all 22 scheduled Indic languages. You can access all artifacts at https://ai4bharat.github.io/airavata.
△ Less
Submitted 26 February, 2024; v1 submitted 26 January, 2024;
originally announced January 2024.
-
Distributed IRSs Always Benefit Every Mobile Operator
Authors:
L. Yashvanth,
Chandra R. Murthy
Abstract:
We investigate the impact of multiple distributed intelligent reflecting surfaces (IRSs), which are deployed and optimized by a mobile operator (MO), on the performance of user equipments (UEs) served by other co-existing out-of-band (OOB) MOs that do not control the IRSs. We show that, under round-robin scheduling, in mmWave frequencies, the ergodic sum spectral efficiency (SE) of an OOB MO incre…
▽ More
We investigate the impact of multiple distributed intelligent reflecting surfaces (IRSs), which are deployed and optimized by a mobile operator (MO), on the performance of user equipments (UEs) served by other co-existing out-of-band (OOB) MOs that do not control the IRSs. We show that, under round-robin scheduling, in mmWave frequencies, the ergodic sum spectral efficiency (SE) of an OOB MO increases logarithmically in the total number of IRS elements with a pre-log factor that increases with the ratio of the number of OOB paths through the IRS to the number of elements at an IRS. We further show that the maximum achievable SE of the OOB MO scales log-linearly with the total IRS elements, with a pre-log factor of $1$. Then, we specify the minimum number of IRSs as a function of the channel parameters and design a distributed IRS system in which an OOB MO almost surely obtains the maximum SE. Finally, we prove that the outage probability at an OOB UE decreases exponentially as the number of IRSs increases, even though they are randomly configured from the OOB UE's viewpoint. We numerically verify our theory and conclude that distributed IRSs always help every MO, but the MO controlling the IRSs benefits the most.
△ Less
Submitted 13 July, 2024; v1 submitted 16 January, 2024;
originally announced January 2024.
-
PUB: A Pragmatics Understanding Benchmark for Assessing LLMs' Pragmatics Capabilities
Authors:
Settaluri Lakshmi Sravanthi,
Meet Doshi,
Tankala Pavan Kalyan,
Rudra Murthy,
Pushpak Bhattacharyya,
Raj Dabre
Abstract:
LLMs have demonstrated remarkable capability for understanding semantics, but they often struggle with understanding pragmatics. To demonstrate this fact, we release a Pragmatics Understanding Benchmark (PUB) dataset consisting of fourteen tasks in four pragmatics phenomena, namely, Implicature, Presupposition, Reference, and Deixis. We curated high-quality test sets for each task, consisting of M…
▽ More
LLMs have demonstrated remarkable capability for understanding semantics, but they often struggle with understanding pragmatics. To demonstrate this fact, we release a Pragmatics Understanding Benchmark (PUB) dataset consisting of fourteen tasks in four pragmatics phenomena, namely, Implicature, Presupposition, Reference, and Deixis. We curated high-quality test sets for each task, consisting of Multiple Choice Question Answers (MCQA). PUB includes a total of 28k data points, 6.1k of which have been created by us, and the rest are adapted from existing datasets. We evaluated nine models varying in the number of parameters and type of training. Our study indicates that fine-tuning for instruction-following and chat significantly enhances the pragmatics capabilities of smaller language models. However, for larger models, the base versions perform comparably with their chat-adapted counterparts. Additionally, there is a noticeable performance gap between human capabilities and model capabilities. Furthermore, unlike the consistent performance of humans across various tasks, the models demonstrate variability in their proficiency, with performance levels fluctuating due to different hints and the complexities of tasks within the same dataset. Overall, the benchmark aims to provide a comprehensive evaluation of LLM's ability to handle real-world language tasks that require pragmatic reasoning.
△ Less
Submitted 13 January, 2024;
originally announced January 2024.
-
Tradeoff of age-of-information and power under reliability constraint for short-packet communication with block-length adaptation
Authors:
Sudarsanan A. K.,
Vineeth B. S.,
Chandra R. Murthy
Abstract:
In applications such as remote estimation and monitoring, update packets are transmitted by power-constrained devices using short-packet codes over wireless networks. Therefore, networks need to be end-to-end optimized using information freshness metrics such as age of information under transmit power and reliability constraints to ensure support for such applications. For short-packet coding, mod…
▽ More
In applications such as remote estimation and monitoring, update packets are transmitted by power-constrained devices using short-packet codes over wireless networks. Therefore, networks need to be end-to-end optimized using information freshness metrics such as age of information under transmit power and reliability constraints to ensure support for such applications. For short-packet coding, modelling and understanding the effect of block codeword length on transmit power and other performance metrics is important. To understand the above optimization for short-packet coding, we consider the optimal tradeoff problem between age of information and transmit power under reliability constraints for short packet point-to-point communication model with an exogenous packet generation process. In contrast to prior work, we consider scheduling policies that can possibly adapt the block-length or transmission time of short packet codes in order to achieve the optimal tradeoff. We characterize the tradeoff using a semi-Markov decision process formulation. We also obtain analytical upper bounds as well as numerical, analytical, and asymptotic lower bounds on the optimal tradeoff. We show that in certain regimes, such as high reliability and high packet generation rate, non-adaptive scheduling policies (fixed transmission time policies) are close-to-optimal. Furthermore, in a high-power or in a low-power regime, non-adaptive as well as state-independent randomized scheduling policies are order-optimal. These results are corroborated by numerical and simulation experiments. The tradeoff is then characterized for a wireless point-to-point channel with block fading as well as for other packet generation models (including an age-dependent packet generation model).
△ Less
Submitted 3 December, 2023;
originally announced December 2023.
-
On the Impact of an IRS on the Out-of-Band Performance in Sub-6 GHz & mmWave Frequencies
Authors:
L. Yashvanth,
Chandra R. Murthy
Abstract:
Intelligent reflecting surfaces (IRSs) were introduced to enhance the performance of wireless communication systems. However, from a service provider's viewpoint, a concern with the use of an IRS is its effect on out-of-band (OOB) quality of service. Specifically, if two operators, say X and Y, provide services in a given geographical area using non-overlapping frequency bands, and if operator X u…
▽ More
Intelligent reflecting surfaces (IRSs) were introduced to enhance the performance of wireless communication systems. However, from a service provider's viewpoint, a concern with the use of an IRS is its effect on out-of-band (OOB) quality of service. Specifically, if two operators, say X and Y, provide services in a given geographical area using non-overlapping frequency bands, and if operator X uses an IRS to enhance the spectral efficiency (SE) of its users (UEs), does it degrade the performance of UEs served by operator Y? We answer this by analyzing the average and instantaneous performances of the OOB operator considering both sub-6 GHz and mmWave bands. Specifically, we derive the ergodic sum SE achieved by the operators under round-robin scheduling. We also derive the outage probability and analyze the change in the SNR caused by the IRS at an OOB UE using stochastic dominance theory. Surprisingly, even though the IRS is randomly configured from operator Y's point of view, the OOB operator still benefits from the presence of the IRS, witnessing a performance enhancement for free in both sub-6 GHz and mmWave bands. This is because the IRS introduces additional paths between the transmitter and receiver, increasing the overall signal power arriving at the UE and providing diversity benefits. Finally, we show that the use of opportunistic scheduling schemes can further enhance the benefit of the uncontrolled IRS at OOB UEs. We numerically illustrate our findings and conclude that an IRS is always beneficial to every operator, even when the IRS is deployed & controlled by only one operator.
△ Less
Submitted 10 June, 2024; v1 submitted 27 August, 2023;
originally announced August 2023.
-
BOLAA: Benchmarking and Orchestrating LLM-augmented Autonomous Agents
Authors:
Zhiwei Liu,
Weiran Yao,
Jianguo Zhang,
Le Xue,
Shelby Heinecke,
Rithesh Murthy,
Yihao Feng,
Zeyuan Chen,
Juan Carlos Niebles,
Devansh Arpit,
Ran Xu,
Phil Mui,
Huan Wang,
Caiming Xiong,
Silvio Savarese
Abstract:
The massive successes of large language models (LLMs) encourage the emerging exploration of LLM-augmented Autonomous Agents (LAAs). An LAA is able to generate actions with its core LLM and interact with environments, which facilitates the ability to resolve complex tasks by conditioning on past interactions such as observations and actions. Since the investigation of LAA is still very recent, limi…
▽ More
The massive successes of large language models (LLMs) encourage the emerging exploration of LLM-augmented Autonomous Agents (LAAs). An LAA is able to generate actions with its core LLM and interact with environments, which facilitates the ability to resolve complex tasks by conditioning on past interactions such as observations and actions. Since the investigation of LAA is still very recent, limited explorations are available. Therefore, we provide a comprehensive comparison of LAA in terms of both agent architectures and LLM backbones. Additionally, we propose a new strategy to orchestrate multiple LAAs such that each labor LAA focuses on one type of action, \textit{i.e.} BOLAA, where a controller manages the communication among multiple agents. We conduct simulations on both decision-making and multi-step reasoning environments, which comprehensively justify the capacity of LAAs. Our performance results provide quantitative suggestions for designing LAA architectures and the optimal choice of LLMs, as well as the compatibility of both. We release our implementation code of LAAs to the public at \url{https://github.com/salesforce/BOLAA}.
△ Less
Submitted 11 August, 2023;
originally announced August 2023.
-
Retroformer: Retrospective Large Language Agents with Policy Gradient Optimization
Authors:
Weiran Yao,
Shelby Heinecke,
Juan Carlos Niebles,
Zhiwei Liu,
Yihao Feng,
Le Xue,
Rithesh Murthy,
Zeyuan Chen,
Jianguo Zhang,
Devansh Arpit,
Ran Xu,
Phil Mui,
Huan Wang,
Caiming Xiong,
Silvio Savarese
Abstract:
Recent months have seen the emergence of a powerful new trend in which large language models (LLMs) are augmented to become autonomous language agents capable of performing objective oriented multi-step tasks on their own, rather than merely responding to queries from human users. Most existing language agents, however, are not optimized using environment-specific rewards. Although some agents ena…
▽ More
Recent months have seen the emergence of a powerful new trend in which large language models (LLMs) are augmented to become autonomous language agents capable of performing objective oriented multi-step tasks on their own, rather than merely responding to queries from human users. Most existing language agents, however, are not optimized using environment-specific rewards. Although some agents enable iterative refinement through verbal feedback, they do not reason and plan in ways that are compatible with gradient-based learning from rewards. This paper introduces a principled framework for reinforcing large language agents by learning a retrospective model, which automatically tunes the language agent prompts from environment feedback through policy gradient. Specifically, our proposed agent architecture learns from rewards across multiple environments and tasks, for fine-tuning a pre-trained language model which refines the language agent prompt by summarizing the root cause of prior failed attempts and proposing action plans. Experimental results on various tasks demonstrate that the language agents improve over time and that our approach considerably outperforms baselines that do not properly leverage gradients from the environment. This demonstrates that using policy gradient optimization to improve language agents, for which we believe our work is one of the first, seems promising and can be applied to optimize other models in the agent architecture to enhance agent performances over time.
△ Less
Submitted 5 May, 2024; v1 submitted 4 August, 2023;
originally announced August 2023.
-
REX: Rapid Exploration and eXploitation for AI Agents
Authors:
Rithesh Murthy,
Shelby Heinecke,
Juan Carlos Niebles,
Zhiwei Liu,
Le Xue,
Weiran Yao,
Yihao Feng,
Zeyuan Chen,
Akash Gokul,
Devansh Arpit,
Ran Xu,
Phil Mui,
Huan Wang,
Caiming Xiong,
Silvio Savarese
Abstract:
In this paper, we propose an enhanced approach for Rapid Exploration and eXploitation for AI Agents called REX. Existing AutoGPT-style techniques have inherent limitations, such as a heavy reliance on precise descriptions for decision-making, and the lack of a systematic approach to leverage try-and-fail procedures akin to traditional Reinforcement Learning (RL). REX introduces an additional layer…
▽ More
In this paper, we propose an enhanced approach for Rapid Exploration and eXploitation for AI Agents called REX. Existing AutoGPT-style techniques have inherent limitations, such as a heavy reliance on precise descriptions for decision-making, and the lack of a systematic approach to leverage try-and-fail procedures akin to traditional Reinforcement Learning (RL). REX introduces an additional layer of rewards and integrates concepts similar to Upper Confidence Bound (UCB) scores, leading to more robust and efficient AI agent performance. This approach has the advantage of enabling the utilization of offline behaviors from logs and allowing seamless integration with existing foundation models while it does not require any model fine-tuning. Through comparative analysis with existing methods such as Chain-of-Thoughts(CoT) and Reasoning viA Planning(RAP), REX-based methods demonstrate comparable performance and, in certain cases, even surpass the results achieved by these existing techniques. Notably, REX-based methods exhibit remarkable reductions in execution time, enhancing their practical applicability across a diverse set of scenarios.
△ Less
Submitted 26 January, 2024; v1 submitted 18 July, 2023;
originally announced July 2023.
-
Prompting with Pseudo-Code Instructions
Authors:
Mayank Mishra,
Prince Kumar,
Riyaz Bhat,
Rudra Murthy V,
Danish Contractor,
Srikanth Tamilselvam
Abstract:
Prompting with natural language instructions has recently emerged as a popular method of harnessing the capabilities of large language models. Given the inherent ambiguity present in natural language, it is intuitive to consider the possible advantages of prompting with less ambiguous prompt styles, such as the use of pseudo-code.
In this paper we explore if prompting via pseudo-code instruction…
▽ More
Prompting with natural language instructions has recently emerged as a popular method of harnessing the capabilities of large language models. Given the inherent ambiguity present in natural language, it is intuitive to consider the possible advantages of prompting with less ambiguous prompt styles, such as the use of pseudo-code.
In this paper we explore if prompting via pseudo-code instructions helps improve the performance of pre-trained language models. We manually create a dataset of pseudo-code prompts for 132 different tasks spanning classification, QA and generative language tasks, sourced from the Super-NaturalInstructions dataset. Using these prompts along with their counterparts in natural language, we study their performance on two LLM families - BLOOM and CodeGen. Our experiments show that using pseudo-code instructions leads to better results, with an average increase (absolute) of 7-16 points in F1 scores for classification tasks and an improvement (relative) of 12-38% in aggregate ROUGE-L scores across all tasks. We include detailed ablation studies which indicate that code comments, docstrings, and the structural clues encoded in pseudo-code all contribute towards the improvement in performance.
To the best of our knowledge our work is the first to demonstrate how pseudo-code prompts can be helpful in improving the performance of pre-trained LMs.
△ Less
Submitted 19 October, 2023; v1 submitted 19 May, 2023;
originally announced May 2023.
-
StarCoder: may the source be with you!
Authors:
Raymond Li,
Loubna Ben Allal,
Yangtian Zi,
Niklas Muennighoff,
Denis Kocetkov,
Chenghao Mou,
Marc Marone,
Christopher Akiki,
Jia Li,
Jenny Chim,
Qian Liu,
Evgenii Zheltonozhskii,
Terry Yue Zhuo,
Thomas Wang,
Olivier Dehaene,
Mishig Davaadorj,
Joel Lamy-Poirier,
João Monteiro,
Oleh Shliazhko,
Nicolas Gontier,
Nicholas Meade,
Armel Zebaze,
Ming-Ho Yee,
Logesh Kumar Umapathi,
Jian Zhu
, et al. (42 additional authors not shown)
Abstract:
The BigCode community, an open-scientific collaboration working on the responsible development of Large Language Models for Code (Code LLMs), introduces StarCoder and StarCoderBase: 15.5B parameter models with 8K context length, infilling capabilities and fast large-batch inference enabled by multi-query attention. StarCoderBase is trained on 1 trillion tokens sourced from The Stack, a large colle…
▽ More
The BigCode community, an open-scientific collaboration working on the responsible development of Large Language Models for Code (Code LLMs), introduces StarCoder and StarCoderBase: 15.5B parameter models with 8K context length, infilling capabilities and fast large-batch inference enabled by multi-query attention. StarCoderBase is trained on 1 trillion tokens sourced from The Stack, a large collection of permissively licensed GitHub repositories with inspection tools and an opt-out process. We fine-tuned StarCoderBase on 35B Python tokens, resulting in the creation of StarCoder. We perform the most comprehensive evaluation of Code LLMs to date and show that StarCoderBase outperforms every open Code LLM that supports multiple programming languages and matches or outperforms the OpenAI code-cushman-001 model. Furthermore, StarCoder outperforms every model that is fine-tuned on Python, can be prompted to achieve 40\% pass@1 on HumanEval, and still retains its performance on other programming languages. We take several important steps towards a safe open-access model release, including an improved PII redaction pipeline and a novel attribution tracing tool, and make the StarCoder models publicly available under a more commercially viable version of the Open Responsible AI Model license.
△ Less
Submitted 13 December, 2023; v1 submitted 9 May, 2023;
originally announced May 2023.
-
Denoising-based UNMT is more robust to word-order divergence than MASS-based UNMT
Authors:
Tamali Banerjee,
Rudra Murthy V,
Pushpak Bhattacharyya
Abstract:
We aim to investigate whether UNMT approaches with self-supervised pre-training are robust to word-order divergence between language pairs. We achieve this by comparing two models pre-trained with the same self-supervised pre-training objective. The first model is trained on language pairs with different word-orders, and the second model is trained on the same language pairs with source language r…
▽ More
We aim to investigate whether UNMT approaches with self-supervised pre-training are robust to word-order divergence between language pairs. We achieve this by comparing two models pre-trained with the same self-supervised pre-training objective. The first model is trained on language pairs with different word-orders, and the second model is trained on the same language pairs with source language re-ordered to match the word-order of the target language. Ideally, UNMT approaches which are robust to word-order divergence should exhibit no visible performance difference between the two configurations. In this paper, we investigate two such self-supervised pre-training based UNMT approaches, namely Masked Sequence-to-Sequence Pre-Training, (MASS) (which does not have shuffling noise) and Denoising AutoEncoder (DAE), (which has shuffling noise).
We experiment with five English$\rightarrow$Indic language pairs, i.e., en-hi, en-bn, en-gu, en-kn, and en-ta) where word-order of the source language is SVO (Subject-Verb-Object), and the word-order of the target languages is SOV (Subject-Object-Verb). We observed that for these language pairs, DAE-based UNMT approach consistently outperforms MASS in terms of translation accuracies. Moreover, bridging the word-order gap using reordering improves the translation accuracy of MASS-based UNMT models, while it cannot improve the translation accuracy of DAE-based UNMT models. This observation indicates that DAE-based UNMT is more robust to word-order divergence than MASS-based UNMT. Word-shuffling noise in DAE approach could be the possible reason for the approach being robust to word-order divergence.
△ Less
Submitted 2 March, 2023;
originally announced March 2023.
-
Does an IRS Degrade Out-of-Band Performance?
Authors:
L. Yashvanth,
Chandra R. Murthy
Abstract:
Intelligent reflecting surfaces (IRSs) were introduced to enhance the performance of wireless systems. However, from a cellular service provider's view, a concern with the use of an IRS is its effect on out-of-band (OOB) quality of service. Specifically, given two operators, say X and Y, providing services in a geographical area using non-overlapping frequency bands, if operator-X uses an IRS to o…
▽ More
Intelligent reflecting surfaces (IRSs) were introduced to enhance the performance of wireless systems. However, from a cellular service provider's view, a concern with the use of an IRS is its effect on out-of-band (OOB) quality of service. Specifically, given two operators, say X and Y, providing services in a geographical area using non-overlapping frequency bands, if operator-X uses an IRS to optimally enhance the throughput of its users, does the IRS degrade the performance of operator-Y? We answer this by deriving the ergodic sum spectral efficiency (SE) of both operators under round-robin scheduling. We also derive the complementary cumulative distribution function of the change in effective channel at an OOB user with and without the IRS, which provides deeper insights into OOB performance. Surprisingly, we find that even though the IRS is randomly configured from operator-Y's view, the OOB operator still benefits from the IRS, witnessing a performance enhancement for free. This happens because the IRS introduces additional paths between the nodes, increasing the signal power at the receiver and providing diversity benefits. We verify our findings numerically and conclude that an IRS is beneficial to every operator, even when the IRS is deployed to optimally serve only one operator.
△ Less
Submitted 30 June, 2023; v1 submitted 24 February, 2023;
originally announced February 2023.
-
Channel State Information Based User Censoring in Irregular Repetition Slotted Aloha
Authors:
Chirag Ramesh Srivatsa,
Chandra R. Murthy
Abstract:
Irregular repetition slotted aloha (IRSA) is a massive random access protocol which can be used to serve a large number of users while achieving a packet loss rate (PLR) close to zero. However, if the number of users is too high, then the system is interference limited and the PLR is close to one. In this paper, we propose a variant of IRSA in the interference limited regime, namely Censored-IRSA…
▽ More
Irregular repetition slotted aloha (IRSA) is a massive random access protocol which can be used to serve a large number of users while achieving a packet loss rate (PLR) close to zero. However, if the number of users is too high, then the system is interference limited and the PLR is close to one. In this paper, we propose a variant of IRSA in the interference limited regime, namely Censored-IRSA (C-IRSA), wherein users with poor channel states censor themselves from transmitting their packets. We theoretically analyze the throughput performance of C-IRSA via density evolution. Using this, we derive closed-form expressions for the optimal choice of the censor threshold which maximizes the throughput while achieving zero PLR among uncensored users. Through extensive numerical simulations, we show that C-IRSA can achieve a 4$\times$ improvement in the peak throughput compared to conventional IRSA.
△ Less
Submitted 24 February, 2023;
originally announced February 2023.
-
Semi-Structured Object Sequence Encoders
Authors:
Rudra Murthy V,
Riyaz Bhat,
Chulaka Gunasekara,
Siva Sankalp Patel,
Hui Wan,
Tejas Indulal Dhamecha,
Danish Contractor,
Marina Danilevsky
Abstract:
In this paper we explore the task of modeling semi-structured object sequences; in particular, we focus our attention on the problem of developing a structure-aware input representation for such sequences. Examples of such data include user activity on websites, machine logs, and many others. This type of data is often represented as a sequence of sets of key-value pairs over time and can present…
▽ More
In this paper we explore the task of modeling semi-structured object sequences; in particular, we focus our attention on the problem of developing a structure-aware input representation for such sequences. Examples of such data include user activity on websites, machine logs, and many others. This type of data is often represented as a sequence of sets of key-value pairs over time and can present modeling challenges due to an ever-increasing sequence length. We propose a two-part approach, which first considers each key independently and encodes a representation of its values over time; we then self-attend over these value-aware key representations to accomplish a downstream task. This allows us to operate on longer object sequences than existing methods. We introduce a novel shared-attention-head architecture between the two modules and present an innovative training schedule that interleaves the training of both modules with shared weights for some attention heads. Our experiments on multiple prediction tasks using real-world data demonstrate that our approach outperforms a unified network with hierarchical encoding, as well as other methods including a record-centric representation and a flattened representation of the sequence.
△ Less
Submitted 22 May, 2023; v1 submitted 3 January, 2023;
originally announced January 2023.
-
Naamapadam: A Large-Scale Named Entity Annotated Data for Indic Languages
Authors:
Arnav Mhaske,
Harshit Kedia,
Sumanth Doddapaneni,
Mitesh M. Khapra,
Pratyush Kumar,
Rudra Murthy V,
Anoop Kunchukuttan
Abstract:
We present, Naamapadam, the largest publicly available Named Entity Recognition (NER) dataset for the 11 major Indian languages from two language families. The dataset contains more than 400k sentences annotated with a total of at least 100k entities from three standard entity categories (Person, Location, and, Organization) for 9 out of the 11 languages. The training dataset has been automaticall…
▽ More
We present, Naamapadam, the largest publicly available Named Entity Recognition (NER) dataset for the 11 major Indian languages from two language families. The dataset contains more than 400k sentences annotated with a total of at least 100k entities from three standard entity categories (Person, Location, and, Organization) for 9 out of the 11 languages. The training dataset has been automatically created from the Samanantar parallel corpus by projecting automatically tagged entities from an English sentence to the corresponding Indian language translation. We also create manually annotated testsets for 9 languages. We demonstrate the utility of the obtained dataset on the Naamapadam-test dataset. We also release IndicNER, a multilingual IndicBERT model fine-tuned on Naamapadam training set. IndicNER achieves an F1 score of more than $80$ for $7$ out of $9$ test languages. The dataset and models are available under open-source licences at https://ai4bharat.iitm.ac.in/naamapadam.
△ Less
Submitted 28 May, 2023; v1 submitted 20 December, 2022;
originally announced December 2022.
-
Impact of Mobility on Downlink Cell-Free Massive MIMO Systems
Authors:
Abhinav Anand,
Chandra R. Murthy,
Ribhu Chopra
Abstract:
In this paper, we analyze the achievable downlink spectral efficiency of cell-free massive multiple input multiple output (CF-mMIMO) systems, accounting for the effects of channel aging (caused by user mobility) and pilot contamination. We consider two cases, one where user equipments (UEs) rely on downlink pilots beamformed by the access points (APs) to estimate downlink channel, and another wher…
▽ More
In this paper, we analyze the achievable downlink spectral efficiency of cell-free massive multiple input multiple output (CF-mMIMO) systems, accounting for the effects of channel aging (caused by user mobility) and pilot contamination. We consider two cases, one where user equipments (UEs) rely on downlink pilots beamformed by the access points (APs) to estimate downlink channel, and another where UEs utilize statistical channel state information (CSI) for data decoding. For comparison, we also consider cellular mMIMO and derive its achievable spectral efficiency with channel aging and pilot contamination in the above two cases. Our results show that, in CF-mMIMO, downlink training is preferable over statistical CSI when the length of the data sequence is chosen optimally to maximize the spectral efficiency. In cellular mMIMO, however, either one of the two schemes may be better depending on whether user fairness or sum spectral efficiency is prioritized. Furthermore, the CF-mMIMO system generally outperforms cellular mMIMO even after accounting for the effects of channel aging and pilot contamination. Through numerical results, we illustrate the effect of various system parameters such as the maximum user velocity, uplink/downlink pilot lengths, data duration, network densification, and provide interesting insights into the key differences between cell-free and cellular mMIMO systems.
△ Less
Submitted 6 September, 2022;
originally announced September 2022.
-
Performance Analysis of Irregular Repetition Slotted Aloha with Multi-Cell Interference
Authors:
Chirag Ramesh Srivatsa,
Chandra R. Murthy
Abstract:
Irregular repetition slotted aloha (IRSA) is a massive random access protocol in which users transmit several replicas of their packet over a frame to a base station. Existing studies have analyzed IRSA in the single-cell (SC) setup, which does not extend to the more practically relevant multi-cell (MC) setup due to the inter-cell interference. In this work, we analyze MC IRSA, accounting for pilo…
▽ More
Irregular repetition slotted aloha (IRSA) is a massive random access protocol in which users transmit several replicas of their packet over a frame to a base station. Existing studies have analyzed IRSA in the single-cell (SC) setup, which does not extend to the more practically relevant multi-cell (MC) setup due to the inter-cell interference. In this work, we analyze MC IRSA, accounting for pilot contamination and multiuser interference. Via numerical simulations, we illustrate that, in practical settings, MC IRSA can have a drastic loss of throughput, up to $70\%$, compared to SC IRSA. Further, MC IRSA requires a significantly higher training length (about 4-5x compared to SC IRSA), in order to support the same user density and achieve the same throughput. We also provide insights into the impact of the pilot length, number of antennas, and signal to noise ratio on the performance of MC IRSA.
△ Less
Submitted 28 May, 2022; v1 submitted 14 May, 2022;
originally announced May 2022.
-
HiNER: A Large Hindi Named Entity Recognition Dataset
Authors:
Rudra Murthy,
Pallab Bhattacharjee,
Rahul Sharnagat,
Jyotsana Khatri,
Diptesh Kanojia,
Pushpak Bhattacharyya
Abstract:
Named Entity Recognition (NER) is a foundational NLP task that aims to provide class labels like Person, Location, Organisation, Time, and Number to words in free text. Named Entities can also be multi-word expressions where the additional I-O-B annotation information helps label them during the NER annotation process. While English and European languages have considerable annotated data for the N…
▽ More
Named Entity Recognition (NER) is a foundational NLP task that aims to provide class labels like Person, Location, Organisation, Time, and Number to words in free text. Named Entities can also be multi-word expressions where the additional I-O-B annotation information helps label them during the NER annotation process. While English and European languages have considerable annotated data for the NER task, Indian languages lack on that front -- both in terms of quantity and following annotation standards. This paper releases a significantly sized standard-abiding Hindi NER dataset containing 109,146 sentences and 2,220,856 tokens, annotated with 11 tags. We discuss the dataset statistics in all their essential detail and provide an in-depth analysis of the NER tag-set used with our data. The statistics of tag-set in our dataset show a healthy per-tag distribution, especially for prominent classes like Person, Location and Organisation. Since the proof of resource-effectiveness is in building models with the resource and testing the model on benchmark data and against the leader-board entries in shared tasks, we do the same with the aforesaid data. We use different language models to perform the sequence labelling task for NER and show the efficacy of our data by performing a comparative evaluation with models trained on another dataset available for the Hindi NER task. Our dataset helps achieve a weighted F1 score of 88.78 with all the tags and 92.22 when we collapse the tag-set, as discussed in the paper. To the best of our knowledge, no available dataset meets the standards of volume (amount) and variability (diversity), as far as Hindi NER is concerned. We fill this gap through this work, which we hope will significantly help NLP for Hindi. We release this dataset with our code and models at https://github.com/cfiltnlp/HiNER
△ Less
Submitted 28 April, 2022;
originally announced April 2022.
-
Performance Analysis of Intelligent Reflecting Surface Assisted Opportunistic Communications
Authors:
L. Yashvanth,
Chandra R. Murthy
Abstract:
Intelligent reflecting surfaces (IRSs) are a promising technology for enhancing coverage and spectral efficiency, both in the sub-6 GHz and the millimeter wave (mmWave) bands. Existing approaches to leverage the benefits of IRS involve the use of a resource-intensive channel estimation step followed by a computationally expensive algorithm to optimize the reflection coefficients at the IRS. In thi…
▽ More
Intelligent reflecting surfaces (IRSs) are a promising technology for enhancing coverage and spectral efficiency, both in the sub-6 GHz and the millimeter wave (mmWave) bands. Existing approaches to leverage the benefits of IRS involve the use of a resource-intensive channel estimation step followed by a computationally expensive algorithm to optimize the reflection coefficients at the IRS. In this work, focusing on the sub-6 GHz band of communications, we present and analyze several alternative schemes, where the phase configuration of the IRS is randomized and multi-user diversity is exploited to opportunistically select the best user at each point in time for data transmission. We show that the throughput of an IRS assisted opportunistic communication (OC) system asymptotically converges to the optimal beamforming-based throughput under fair allocation of resources, as the number of users gets large. We also introduce schemes that enhance the rate of convergence of the OC rate to the beamforming rate with the number of users. For all the proposed schemes, we derive the scaling law of the throughput in terms of the system parameters, as the number of users gets large. Following this, we extend the setup to wideband channels via an orthogonal frequency division multiplexing (OFDM) system and discuss two OC schemes in an IRS assisted setting that clearly elucidate the superior performance that IRS aided OC systems can offer over conventional systems, at very low implementation cost and complexity.
△ Less
Submitted 24 October, 2022; v1 submitted 11 March, 2022;
originally announced March 2022.
-
On the Impact of Channel Estimation on the Design and Analysis of IRSA based Systems
Authors:
Chirag Ramesh Srivatsa,
Chandra R. Murthy
Abstract:
Irregular repetition slotted aloha (IRSA) is a distributed grant-free random access protocol where users transmit multiple replicas of their packets to a base station (BS). The BS recovers the packets using successive interference cancellation. In this paper, we first derive channel estimates for IRSA, exploiting the sparsity structure of IRSA transmissions, when non-orthogonal pilots are employed…
▽ More
Irregular repetition slotted aloha (IRSA) is a distributed grant-free random access protocol where users transmit multiple replicas of their packets to a base station (BS). The BS recovers the packets using successive interference cancellation. In this paper, we first derive channel estimates for IRSA, exploiting the sparsity structure of IRSA transmissions, when non-orthogonal pilots are employed across users to facilitate channel estimation at the BS. Allowing for the use of non-orthogonal pilots is important, as the length of orthogonal pilots scales linearly with the total number of devices, leading to prohibitive overhead as the number of devices increases. Next, we present a novel analysis of the throughput of IRSA under practical channel estimation errors, with the use of multiple antennas at the BS. Finally, we theoretically characterize the asymptotic throughput performance of IRSA using a density evolution based analysis. Simulation results underline the importance of accounting for channel estimation errors in analyzing IRSA, which can even lead to 70% loss in performance in severely interference-limited regimes. We also provide novel insights on the effect of parameters such as pilot length, SNR, number of antennas at the BS, etc, on the system throughput.
△ Less
Submitted 23 June, 2022; v1 submitted 14 December, 2021;
originally announced December 2021.
-
User Activity Detection for Irregular Repetition Slotted Aloha based MMTC
Authors:
Chirag Ramesh Srivatsa,
Chandra R. Murthy
Abstract:
Irregular repetition slotted aloha (IRSA) is a grant-free random access protocol for massive machine-type communications, where a large number of users sporadically send their data packets to a base station (BS). IRSA is a completely distributed multiple access protocol: in any given frame, a small subset of the users, i.e., the active users, transmit replicas of their packet in randomly selected…
▽ More
Irregular repetition slotted aloha (IRSA) is a grant-free random access protocol for massive machine-type communications, where a large number of users sporadically send their data packets to a base station (BS). IRSA is a completely distributed multiple access protocol: in any given frame, a small subset of the users, i.e., the active users, transmit replicas of their packet in randomly selected resource elements (REs). The first step in the decoding process at the BS is to detect which users are active in each frame. To this end, a novel Bayesian user activity detection (UAD) algorithm is developed, which exploits both the sparsity in user activity as well as the underlying structure of IRSA-based transmissions. Next, the Cramer-Rao bound (CRB) on the mean squared error in channel estimation is derived. It is empirically shown that the channel estimates obtained as a by-product of the proposed UAD algorithm achieves the CRB. Then, the signal to interference plus noise ratio achieved by the users is analyzed, accounting for UAD, channel estimation errors, and pilot contamination. The impact of these non-idealities on the throughput of IRSA is illustrated via Monte Carlo simulations. For example, in a system with 1500 users and 10% of the users being active per frame, a pilot length of as low as 20 symbols is sufficient for accurate user activity detection. In contrast, using classical compressed sensing approaches for UAD would require a pilot length of about 346 symbols. The results also reveal crucial insights into dependence of UAD errors and throughput on parameters such as the length of the pilot sequence, the number of antennas at the BS, the number of users, and the signal to noise ratio.
△ Less
Submitted 23 June, 2022; v1 submitted 11 November, 2021;
originally announced November 2021.
-
Evaluation Of Orthogonal Chirp Division Multiplexing For Automotive Integrated Sensing And Communications
Authors:
Sangeeta Bhattacharjee,
Kumar Vijay Mishra,
Ramesh Annavajjala,
Chandra R. Murthy
Abstract:
We consider a bistatic vehicular integrated sensing and communications (ISAC) system that employs the recently proposed orthogonal chirp division multiplexing (OCDM) multicarrier waveform. As a stand-alone communications waveform, OCDM has been shown to be robust against the interference in time-frequency selective channels. In a bistatic ISAC, we exploit this property to develop efficient receive…
▽ More
We consider a bistatic vehicular integrated sensing and communications (ISAC) system that employs the recently proposed orthogonal chirp division multiplexing (OCDM) multicarrier waveform. As a stand-alone communications waveform, OCDM has been shown to be robust against the interference in time-frequency selective channels. In a bistatic ISAC, we exploit this property to develop efficient receive processing algorithms that achieve high target resolution as well as high communications rate. We derive statistical bounds for our proposed Sequential symbol decoding and radar parameter estimation (SUNDAE) algorithm and compare its competitive performance with other multicarrier waveforms through numerical experiments.
△ Less
Submitted 9 November, 2021;
originally announced November 2021.
-
Can Dynamic TDD Enabled Half-Duplex Cell-Free Massive MIMO Outperform Full-Duplex Cellular Massive MIMO?
Authors:
Anubhab Chowdhury,
Ribhu Chopra,
Chandra R. Murthy
Abstract:
We consider a dynamic time division duplex (DTDD) enabled cell-free massive multiple-input multiple-output (CF-mMIMO) system, where each half-duplex (HD) access point (AP) is scheduled to operate in the uplink (UL) or downlink (DL) mode based on the data demands of the user equipments (UEs), with the goal of maximizing the sum UL-DL spectral efficiency (SE). We develop a new, low complexity, greed…
▽ More
We consider a dynamic time division duplex (DTDD) enabled cell-free massive multiple-input multiple-output (CF-mMIMO) system, where each half-duplex (HD) access point (AP) is scheduled to operate in the uplink (UL) or downlink (DL) mode based on the data demands of the user equipments (UEs), with the goal of maximizing the sum UL-DL spectral efficiency (SE). We develop a new, low complexity, greedy algorithm for the combinatorial AP scheduling problem, with an optimality guarantee theoretically established via showing that a lower bound of the sum UL-DL SE is sub-modular. We also consider pilot sequence reuse among the UEs to limit the channel estimation overhead. In CF systems, all the APs estimate the channel from every UE, making pilot allocation problem different from the cellular case. We develop a novel algorithm that iteratively minimizes the maximum pilot contamination across the UEs. We compare the performance of our solutions, both theoretically and via simulations, against a full duplex (FD) multi-cell mMIMO system. Our results show that, due to the joint processing of the signals at the central processing unit, CF-mMIMO with dynamic HD AP-scheduling significantly outperforms cellular FD-mMIMO in terms of the sum SE and 90% likely SE. Thus, DTDD enabled HD CF-mMIMO is a promising alternative to cellular FD-mMIMO, without the cost of hardware for self-interference suppression.
△ Less
Submitted 21 May, 2022; v1 submitted 19 October, 2021;
originally announced October 2021.
-
Role of Language Relatedness in Multilingual Fine-tuning of Language Models: A Case Study in Indo-Aryan Languages
Authors:
Tejas Indulal Dhamecha,
Rudra Murthy V,
Samarth Bharadwaj,
Karthik Sankaranarayanan,
Pushpak Bhattacharyya
Abstract:
We explore the impact of leveraging the relatedness of languages that belong to the same family in NLP models using multilingual fine-tuning. We hypothesize and validate that multilingual fine-tuning of pre-trained language models can yield better performance on downstream NLP applications, compared to models fine-tuned on individual languages. A first of its kind detailed study is presented to tr…
▽ More
We explore the impact of leveraging the relatedness of languages that belong to the same family in NLP models using multilingual fine-tuning. We hypothesize and validate that multilingual fine-tuning of pre-trained language models can yield better performance on downstream NLP applications, compared to models fine-tuned on individual languages. A first of its kind detailed study is presented to track performance change as languages are added to a base language in a graded and greedy (in the sense of best boost of performance) manner; which reveals that careful selection of subset of related languages can significantly improve performance than utilizing all related languages. The Indo-Aryan (IA) language family is chosen for the study, the exact languages being Bengali, Gujarati, Hindi, Marathi, Oriya, Punjabi and Urdu. The script barrier is crossed by simple rule-based transliteration of the text of all languages to Devanagari. Experiments are performed on mBERT, IndicBERT, MuRIL and two RoBERTa-based LMs, the last two being pre-trained by us. Low resource languages, such as Oriya and Punjabi, are found to be the largest beneficiaries of multilingual fine-tuning. Textual Entailment, Entity Classification, Section Title Prediction, tasks of IndicGLUE and POS tagging form our test bed. Compared to monolingual fine tuning we get relative performance improvement of up to 150% in the downstream tasks. The surprise take-away is that for any language there is a particular combination of other languages which yields the best performance, and any additional language is in fact detrimental.
△ Less
Submitted 22 September, 2021;
originally announced September 2021.
-
Resilient and Latency-aware Orchestration of Network Slices Using Multi-connectivity in MEC-enabled 5G Networks
Authors:
Prabhu Kaliyammal Thiruvasagam,
Abhishek Chakraborty,
C Siva Ram Murthy
Abstract:
Network slicing (NS) and multi-access edge computing (MEC) are new paradigms which play key roles in 5G and beyond networks. NS allows network operators (NOs) to divide the available network resources into multiple logical NSs for providing dedicated virtual networks tailored to the specific service/business requirements. MEC enables NOs to provide diverse ultra-low latency services for supporting…
▽ More
Network slicing (NS) and multi-access edge computing (MEC) are new paradigms which play key roles in 5G and beyond networks. NS allows network operators (NOs) to divide the available network resources into multiple logical NSs for providing dedicated virtual networks tailored to the specific service/business requirements. MEC enables NOs to provide diverse ultra-low latency services for supporting the needs of different industry verticals by moving computing facilities to the network edge. NS can be constructed by instantiating a set of virtual network functions (VNFs) on top of MEC cloud servers for provisioning diverse latency-sensitive communication services (e.g., autonomous driving and augmented reality) on demand at a lesser cost and time. However, VNFs, MEC cloud servers, and communication links are subject to failures due to software bugs, misconfiguration, overloading, hardware faults, cyber attacks, power outage, and natural/man-made disaster. Failure of a critical network component disrupts services abruptly and leads to users' dissatisfaction, which may result in revenue loss for the NOs. In this paper, we present a novel approach based on multi-connectivity in 5G networks to tackle this problem and our proposed approach is resilient against i) failure of VNFs, ii) failure of local servers within MEC, iii) failure of communication links, and iv) failure of an entire MEC cloud facility in regional level. To this end, we formulate the problem as a binary integer programming (BIP) model in order to optimally deploy NSs with the minimum cost, and prove it is NP-hard. To overcome time complexity, we propose an efficient genetic algorithm based heuristic to obtain near-optimal solution in polynomial time. By extensive simulations, we show that our proposed approach not only reduces resource wastage, but also improves throughput while providing high resiliency against failures.
△ Less
Submitted 1 July, 2021;
originally announced July 2021.
-
Latency-aware and Survivable Mapping of VNFs in 5G Network Edge Cloud
Authors:
Prabhu Kaliyammal Thiruvasagam,
Abhishek Chakraborty,
C. Siva Ram Murthy
Abstract:
Network Functions Virtualization (NFV) and Multi-access Edge Computing (MEC) play crucial roles in 5G networks for dynamically provisioning diverse communication services with heterogeneous service requirements. In particular, while NFV improves flexibility and scalability by softwarizing physical network functions as Virtual Network Functions (VNFs), MEC enables to provide delay-sensitive/time-cr…
▽ More
Network Functions Virtualization (NFV) and Multi-access Edge Computing (MEC) play crucial roles in 5G networks for dynamically provisioning diverse communication services with heterogeneous service requirements. In particular, while NFV improves flexibility and scalability by softwarizing physical network functions as Virtual Network Functions (VNFs), MEC enables to provide delay-sensitive/time-critical services by moving computing facilities to the network edge. However, these new paradigms introduce challenges in terms of latency, availability, and resource allocation. In this paper, we first explore MEC cloud facility location selection and then latency-aware placement of VNFs in different selected locations of NFV enabled MEC cloud facilities in order to meet the ultra-low latency requirements of different applications (e.g., Tactile Internet, virtual reality, and mission-critical applications). Furthermore, we also aim to guarantee the survivability of VNFs and an edge server against failures in resource limited MEC cloud facility due to software bugs, configuration faults, etc. To this end, we formulate the problem of latency-aware and survivable mapping of VNFs in different MEC cloud facilities as an Integer Linear Programming (ILP) to minimize the overall service provisioning cost, and show that the problem is NP-hard. Owing to the high computational complexity of solving the ILP, we propose a simulated annealing based heuristic algorithm to obtain near-optimal solution in polynomial time. With extensive simulations, we show the effectiveness of our proposed solution in a real-world network topology, which performs close to the optimal solution.
△ Less
Submitted 17 June, 2021;
originally announced June 2021.
-
Crosslingual Embeddings are Essential in UNMT for Distant Languages: An English to IndoAryan Case Study
Authors:
Tamali Banerjee,
Rudra Murthy V,
Pushpak Bhattacharyya
Abstract:
Recent advances in Unsupervised Neural Machine Translation (UNMT) have minimized the gap between supervised and unsupervised machine translation performance for closely related language pairs. However, the situation is very different for distant language pairs. Lack of lexical overlap and low syntactic similarities such as between English and Indo-Aryan languages leads to poor translation quality…
▽ More
Recent advances in Unsupervised Neural Machine Translation (UNMT) have minimized the gap between supervised and unsupervised machine translation performance for closely related language pairs. However, the situation is very different for distant language pairs. Lack of lexical overlap and low syntactic similarities such as between English and Indo-Aryan languages leads to poor translation quality in existing UNMT systems. In this paper, we show that initializing the embedding layer of UNMT models with cross-lingual embeddings shows significant improvements in BLEU score over existing approaches with embeddings randomly initialized. Further, static embeddings (freezing the embedding layer weights) lead to better gains compared to updating the embedding layer weights during training (non-static). We experimented using Masked Sequence to Sequence (MASS) and Denoising Autoencoder (DAE) UNMT approaches for three distant language pairs. The proposed cross-lingual embedding initialization yields BLEU score improvement of as much as ten times over the baseline for English-Hindi, English-Bengali, and English-Gujarati. Our analysis shows the importance of cross-lingual embedding, comparisons between approaches, and the scope of improvements in these systems.
△ Less
Submitted 9 June, 2021;
originally announced June 2021.
-
Multiple Support Recovery Using Very Few Measurements Per Sample
Authors:
Lekshmi Ramesh,
Chandra R. Murthy,
Himanshu Tyagi
Abstract:
In the problem of multiple support recovery, we are given access to linear measurements of multiple sparse samples in $\mathbb{R}^{d}$. These samples can be partitioned into $\ell$ groups, with samples having the same support belonging to the same group. For a given budget of $m$ measurements per sample, the goal is to recover the $\ell$ underlying supports, in the absence of the knowledge of grou…
▽ More
In the problem of multiple support recovery, we are given access to linear measurements of multiple sparse samples in $\mathbb{R}^{d}$. These samples can be partitioned into $\ell$ groups, with samples having the same support belonging to the same group. For a given budget of $m$ measurements per sample, the goal is to recover the $\ell$ underlying supports, in the absence of the knowledge of group labels. We study this problem with a focus on the measurement-constrained regime where $m$ is smaller than the support size $k$ of each sample. We design a two-step procedure that estimates the union of the underlying supports first, and then uses a spectral algorithm to estimate the individual supports. Our proposed estimator can recover the supports with $m<k$ measurements per sample, from $\tilde{O}(k^{4}\ell^{4}/m^{4})$ samples. Our guarantees hold for a general, generative model assumption on the samples and measurement matrices. We also provide results from experiments conducted on synthetic data and on the MNIST dataset.
△ Less
Submitted 20 May, 2021;
originally announced May 2021.