-
Unsupervised Text Representation Learning via Instruction-Tuning for Zero-Shot Dense Retrieval
Authors:
Qiuhai Zeng,
Zimeng Qiu,
Dae Yon Hwang,
Xin He,
William M. Campbell
Abstract:
Dense retrieval systems are commonly used for information retrieval (IR). They rely on learning text representations through an encoder and usually require supervised modeling via labelled data which can be costly to obtain or simply unavailable. In this study, we introduce a novel unsupervised text representation learning technique via instruction-tuning the pre-trained encoder-decoder large lang…
▽ More
Dense retrieval systems are commonly used for information retrieval (IR). They rely on learning text representations through an encoder and usually require supervised modeling via labelled data which can be costly to obtain or simply unavailable. In this study, we introduce a novel unsupervised text representation learning technique via instruction-tuning the pre-trained encoder-decoder large language models (LLM) under the dual-encoder retrieval framework. We demonstrate the corpus representation can be augmented by the representations of relevant synthetic queries generated by the instruct-tuned LLM founded on the Rao-Blackwell theorem. Furthermore, we effectively align the query and corpus text representation with self-instructed-tuning. Specifically, we first prompt an open-box pre-trained LLM to follow defined instructions (i.e. question generation and keyword summarization) to generate synthetic queries. Next, we fine-tune the pre-trained LLM with defined instructions and the generated queries that passed quality check. Finally, we generate synthetic queries with the instruction-tuned LLM for each corpora and represent each corpora by weighted averaging the synthetic queries and original corpora embeddings. We evaluate our proposed method under low-resource settings on three English and one German retrieval datasets measuring NDCG@10, MRR@100, Recall@100. We significantly improve the average zero-shot retrieval performance on all metrics, increasing open-box FLAN-T5 model variations by [3.34%, 3.50%] in absolute and exceeding three competitive dense retrievers (i.e. mDPR, T-Systems, mBART-Large), with model of size at least 38% smaller, by 1.96%, 4.62%, 9.52% absolute on NDCG@10.
△ Less
Submitted 24 September, 2024;
originally announced September 2024.
-
Efficient Semi-Supervised Learning for Natural Language Understanding by Optimizing Diversity
Authors:
Eunah Cho,
He Xie,
John P. Lalor,
Varun Kumar,
William M. Campbell
Abstract:
Expanding new functionalities efficiently is an ongoing challenge for single-turn task-oriented dialogue systems. In this work, we explore functionality-specific semi-supervised learning via self-training. We consider methods that augment training data automatically from unlabeled data sets in a functionality-targeted manner. In addition, we examine multiple techniques for efficient selection of a…
▽ More
Expanding new functionalities efficiently is an ongoing challenge for single-turn task-oriented dialogue systems. In this work, we explore functionality-specific semi-supervised learning via self-training. We consider methods that augment training data automatically from unlabeled data sets in a functionality-targeted manner. In addition, we examine multiple techniques for efficient selection of augmented utterances to reduce training time and increase diversity. First, we consider paraphrase detection methods that attempt to find utterance variants of labeled training data with good coverage. Second, we explore sub-modular optimization based on n-grams features for utterance selection. Experiments show that functionality-specific self-training is very effective for improving system performance. In addition, methods optimizing diversity can reduce training data in many cases to 50% with little impact on performance.
△ Less
Submitted 9 October, 2019;
originally announced October 2019.
-
Graph Model Selection via Random Walks
Authors:
Lin Li,
William M. Campbell,
Rajmonda S. Caceres
Abstract:
In this paper, we present a novel approach based on the random walk process for finding meaningful representations of a graph model. Our approach leverages the transient behavior of many short random walks with novel initialization mechanisms to generate model discriminative features. These features are able to capture a more comprehensive structural signature of the underlying graph model. The re…
▽ More
In this paper, we present a novel approach based on the random walk process for finding meaningful representations of a graph model. Our approach leverages the transient behavior of many short random walks with novel initialization mechanisms to generate model discriminative features. These features are able to capture a more comprehensive structural signature of the underlying graph model. The resulting representation is invariant to both node permutation and the size of the graph, allowing direct comparison between large classes of graphs. We test our approach on two challenging model selection problems: the discrimination in the sparse regime of an Erdös-Renyi model from a stochastic block model and the planted clique problem. Our representation approach achieves performance that closely matches known theoretical limits in addition to being computationally simple and scalable to large graphs.
△ Less
Submitted 10 May, 2018; v1 submitted 18 April, 2017;
originally announced April 2017.
-
Making Sense of Unstructured Text Data
Authors:
Lin Li,
William M. Campbell,
Cagri Dagli,
Joseph P. Campbell
Abstract:
Many network analysis tasks in social sciences rely on pre-existing data sources that were created with explicit relations or interactions between entities under consideration. Examples include email logs, friends and followers networks on social media, communication networks, etc. In these data, it is relatively easy to identify who is connected to whom and how they are connected. However, most o…
▽ More
Many network analysis tasks in social sciences rely on pre-existing data sources that were created with explicit relations or interactions between entities under consideration. Examples include email logs, friends and followers networks on social media, communication networks, etc. In these data, it is relatively easy to identify who is connected to whom and how they are connected. However, most of the data that we encounter on a daily basis are unstructured free-text data, e.g., forums, online marketplaces, etc. It is considerably more difficult to extract network data from unstructured text. In this work, we present an end-to-end system for analyzing unstructured text data and transforming the data into structured graphs that are directly applicable to a downstream application. Specifically, we look at social media data and attempt to predict the most indicative words from users' posts. The resulting keywords can be used to construct a context+content network for downstream processing such as graph-based analysis and learning. With that goal in mind, we apply our methods to the application of cross-domain entity resolution. The performance of the resulting system with automatic keywords shows improvement over the system with user-annotated hashtags.
△ Less
Submitted 18 April, 2017;
originally announced April 2017.
-
Consistent Alignment of Word Embedding Models
Authors:
Cem Safak Sahin,
Rajmonda S. Caceres,
Brandon Oselio,
William M. Campbell
Abstract:
Word embedding models offer continuous vector representations that can capture rich contextual semantics based on their word co-occurrence patterns. While these word vectors can provide very effective features used in many NLP tasks such as clustering similar words and inferring learning relationships, many challenges and open research questions remain. In this paper, we propose a solution that al…
▽ More
Word embedding models offer continuous vector representations that can capture rich contextual semantics based on their word co-occurrence patterns. While these word vectors can provide very effective features used in many NLP tasks such as clustering similar words and inferring learning relationships, many challenges and open research questions remain. In this paper, we propose a solution that aligns variations of the same model (or different models) in a joint low-dimensional latent space leveraging carefully generated synthetic data points. This generative process is inspired by the observation that a variety of linguistic relationships is captured by simple linear operations in embedded space. We demonstrate that our approach can lead to substantial improvements in recovering embeddings of local neighborhoods.
△ Less
Submitted 24 February, 2017;
originally announced February 2017.
-
Model Selection Framework for Graph-based data
Authors:
Rajmonda S. Caceres,
Leah Weiner,
Matthew C. Schmidt,
Benjamin A. Miller,
William M. Campbell
Abstract:
Graphs are powerful abstractions for capturing complex relationships in diverse application settings. An active area of research focuses on theoretical models that define the generative mechanism of a graph. Yet given the complexity and inherent noise in real datasets, it is still very challenging to identify the best model for a given observed graph. We discuss a framework for graph model selecti…
▽ More
Graphs are powerful abstractions for capturing complex relationships in diverse application settings. An active area of research focuses on theoretical models that define the generative mechanism of a graph. Yet given the complexity and inherent noise in real datasets, it is still very challenging to identify the best model for a given observed graph. We discuss a framework for graph model selection that leverages a long list of graph topological properties and a random forest classifier to learn and classify different graph instances. We fully characterize the discriminative power of our approach as we sweep through the parameter space of two generative models, the Erdos-Renyi and the stochastic block model. We show that our approach gets very close to known theoretical bounds and we provide insight on which topological features play a critical discriminating role.
△ Less
Submitted 15 September, 2016;
originally announced September 2016.
-
Cross-Domain Entity Resolution in Social Media
Authors:
W. M. Campbell,
Lin Li,
C. Dagli,
J. Acevedo-Aviles,
K. Geyer,
J. P. Campbell,
C. Priebe
Abstract:
The challenge of associating entities across multiple domains is a key problem in social media understanding. Successful cross-domain entity resolution provides integration of information from multiple sites to create a complete picture of user and community activities, characteristics, and trends. In this work, we examine the problem of entity resolution across Twitter and Instagram using general…
▽ More
The challenge of associating entities across multiple domains is a key problem in social media understanding. Successful cross-domain entity resolution provides integration of information from multiple sites to create a complete picture of user and community activities, characteristics, and trends. In this work, we examine the problem of entity resolution across Twitter and Instagram using general techniques. Our methods fall into three categories: profile, content, and graph based. For the profile-based methods, we consider techniques based on approximate string matching. For content-based methods, we perform author identification. Finally, for graph-based methods, we apply novel cross-domain community detection methods and generate neighborhood-based features. The three categories of methods are applied to a large graph of users in Twitter and Instagram to understand challenges, determine performance, and understand fusion of multiple methods. Final results demonstrate an equal error rate less than 1%.
△ Less
Submitted 3 August, 2016;
originally announced August 2016.
-
Matching Community Structure Across Online Social Networks
Authors:
Lin Li,
W. M. Campbell
Abstract:
The discovery of community structure in networks is a problem of considerable interest in recent years. In online social networks, often times, users are simultaneously involved in multiple social media sites, some of which share common social relationships. It is of great interest to uncover a shared community structure across these networks. However, in reality, users typically identify themselv…
▽ More
The discovery of community structure in networks is a problem of considerable interest in recent years. In online social networks, often times, users are simultaneously involved in multiple social media sites, some of which share common social relationships. It is of great interest to uncover a shared community structure across these networks. However, in reality, users typically identify themselves with different usernames across social media sites. This creates a great difficulty in detecting the community structure. In this paper, we explore several approaches for community detection across online social networks with limited knowledge of username alignment across the networks. We refer to the known alignment of usernames as seeds. We investigate strategies for seed selection and its impact on networks with a different fraction of overlapping vertices. The goal is to study the interplay between network topologies and seed selection strategies, and to understand how it affects the detected community structure. We also propose several measures to assess the performance of community detection and use them to measure the quality of the detected communities in both Twitter-Twitter networks and Twitter-Instagram networks.
△ Less
Submitted 3 August, 2016;
originally announced August 2016.