-
A bag-of-concepts model improves relation extraction in a narrow knowledge domain with limited data
Authors:
Jiyu Chen,
Karin Verspoor,
Zenan Zhai
Abstract:
This paper focuses on a traditional relation extraction task in the context of limited annotated data and a narrow knowledge domain. We explore this task with a clinical corpus consisting of 200 breast cancer follow-up treatment letters in which 16 distinct types of relations are annotated. We experiment with an approach to extracting typed relations called window-bounded co-occurrence (WBC), whic…
▽ More
This paper focuses on a traditional relation extraction task in the context of limited annotated data and a narrow knowledge domain. We explore this task with a clinical corpus consisting of 200 breast cancer follow-up treatment letters in which 16 distinct types of relations are annotated. We experiment with an approach to extracting typed relations called window-bounded co-occurrence (WBC), which uses an adjustable context window around entity mentions of a relevant type, and compare its performance with a more typical intra-sentential co-occurrence baseline. We further introduce a new bag-of-concepts (BoC) approach to feature engineering based on the state-of-the-art word embeddings and word synonyms. We demonstrate the competitiveness of BoC by comparing with methods of higher complexity, and explore its effectiveness on this small dataset.
△ Less
Submitted 24 April, 2019;
originally announced April 2019.
-
Analysing health professionals' learning interactions in online social networks: A social network analysis approach
Authors:
Xin Li,
Kathleen Gray,
Karin Verspoor,
Stephen Barnett
Abstract:
Online Social Networking may be a way to support health professionals' need for continuous learning through interaction with peers and experts. Understanding and evaluating such learning is important but difficult, and Social Network Analysis (SNA) offers a solution. This paper demonstrates how SNA can be used to study levels of participation as well as the patterns of interactions that take place…
▽ More
Online Social Networking may be a way to support health professionals' need for continuous learning through interaction with peers and experts. Understanding and evaluating such learning is important but difficult, and Social Network Analysis (SNA) offers a solution. This paper demonstrates how SNA can be used to study levels of participation as well as the patterns of interactions that take place among health professionals in a large online professional learning network. Our analysis has shown that their learning network is highly centralised and loosely connected. The level of participation is low in general, and most interactions are structured around a small set of users consisting of moderators and core members. The structural patterns of interaction indicates there is a chance of small group learning occurring and requires further investigation to identify those potential learning groups. This first stage of analysis, to be followed by longitudinal study of the dynamics of interaction and complemented by content analysis of their discussion, may contribute to greater sophistication in the analysis and utilisation of new environments for health professional learning.
△ Less
Submitted 11 April, 2016;
originally announced April 2016.
-
Adjusting for Chance Clustering Comparison Measures
Authors:
Simone Romano,
Nguyen Xuan Vinh,
James Bailey,
Karin Verspoor
Abstract:
Adjusted for chance measures are widely used to compare partitions/clusterings of the same data set. In particular, the Adjusted Rand Index (ARI) based on pair-counting, and the Adjusted Mutual Information (AMI) based on Shannon information theory are very popular in the clustering community. Nonetheless it is an open problem as to what are the best application scenarios for each measure and guide…
▽ More
Adjusted for chance measures are widely used to compare partitions/clusterings of the same data set. In particular, the Adjusted Rand Index (ARI) based on pair-counting, and the Adjusted Mutual Information (AMI) based on Shannon information theory are very popular in the clustering community. Nonetheless it is an open problem as to what are the best application scenarios for each measure and guidelines in the literature for their usage are sparse, with the result that users often resort to using both. Generalized Information Theoretic (IT) measures based on the Tsallis entropy have been shown to link pair-counting and Shannon IT measures. In this paper, we aim to bridge the gap between adjustment of measures based on pair-counting and measures based on information theory. We solve the key technical challenge of analytically computing the expected value and variance of generalized IT measures. This allows us to propose adjustments of generalized IT measures, which reduce to well known adjusted clustering comparison measures as special cases. Using the theory of generalized IT measures, we are able to propose the following guidelines for using ARI and AMI as external validation indices: ARI should be used when the reference clustering has large equal sized clusters; AMI should be used when the reference clustering is unbalanced and there exist small clusters.
△ Less
Submitted 3 December, 2015;
originally announced December 2015.
-
A Framework to Adjust Dependency Measure Estimates for Chance
Authors:
Simone Romano,
Nguyen Xuan Vinh,
James Bailey,
Karin Verspoor
Abstract:
Estimating the strength of dependency between two variables is fundamental for exploratory analysis and many other applications in data mining. For example: non-linear dependencies between two continuous variables can be explored with the Maximal Information Coefficient (MIC); and categorical variables that are dependent to the target class are selected using Gini gain in random forests. Nonethele…
▽ More
Estimating the strength of dependency between two variables is fundamental for exploratory analysis and many other applications in data mining. For example: non-linear dependencies between two continuous variables can be explored with the Maximal Information Coefficient (MIC); and categorical variables that are dependent to the target class are selected using Gini gain in random forests. Nonetheless, because dependency measures are estimated on finite samples, the interpretability of their quantification and the accuracy when ranking dependencies become challenging. Dependency estimates are not equal to 0 when variables are independent, cannot be compared if computed on different sample size, and they are inflated by chance on variables with more categories. In this paper, we propose a framework to adjust dependency measure estimates on finite samples. Our adjustments, which are simple and applicable to any dependency measure, are helpful in improving interpretability when quantifying dependency and in improving accuracy on the task of ranking dependencies. In particular, we demonstrate that our approach enhances the interpretability of MIC when used as a proxy for the amount of noise between variables, and to gain accuracy when ranking variables during the splitting procedure in random forests.
△ Less
Submitted 20 January, 2016; v1 submitted 27 October, 2015;
originally announced October 2015.