Cross-institution text mining to uncover clinical associations: a case study relating social factors and code status in intensive care medicine
Authors:
Madhumita Sushil,
Atul J. Butte,
Ewoud Schuit,
Maarten van Smeden,
Artuur M. Leeuwenberg
Abstract:
Objective: Text mining of clinical notes embedded in electronic medical records is increasingly used to extract patient characteristics otherwise not or only partly available, to assess their association with relevant health outcomes. As manual data labeling needed to develop text mining models is resource intensive, we investigated whether off-the-shelf text mining models developed at external in…
▽ More
Objective: Text mining of clinical notes embedded in electronic medical records is increasingly used to extract patient characteristics otherwise not or only partly available, to assess their association with relevant health outcomes. As manual data labeling needed to develop text mining models is resource intensive, we investigated whether off-the-shelf text mining models developed at external institutions, together with limited within-institution labeled data, could be used to reliably extract study variables to conduct association studies.
Materials and Methods: We developed multiple text mining models on different combinations of within-institution and external-institution data to extract social factors from discharge reports of intensive care patients. Subsequently, we assessed the associations between social factors and having a do-not-resuscitate/intubate code. Results: Important differences were found between associations based on manually labeled data compared to text-mined social factors in three out of five cases. Adopting external-institution text mining models using manually labeled within-institution data resulted in models with higher F1-scores, but not in meaningfully different associations.
Discussion: While text mining facilitated scaling analyses to larger samples leading to discovering a larger number of associations, the estimates may be unreliable. Confirmation is needed with better text mining models, ideally on a larger manually labeled dataset.
Conclusion: The currently used text mining models were not sufficiently accurate to be used reliably in an association study. Model adaptation using within-institution data did not improve the estimates. Further research is needed to set conditions for reliable use of text mining in medical research.
△ Less
Submitted 16 January, 2023;
originally announced January 2023.
Comparing methods addressing multi-collinearity when developing prediction models
Authors:
Artuur M. Leeuwenberg,
Maarten van Smeden,
Johannes A. Langendijk,
Arjen van der Schaaf,
Murielle E. Mauer,
Karel G. M. Moons,
Johannes B. Reitsma,
Ewoud Schuit
Abstract:
Clinical prediction models are developed widely across medical disciplines. When predictors in such models are highly collinear, unexpected or spurious predictor-outcome associations may occur, thereby potentially reducing face-validity and explainability of the prediction model. Collinearity can be dealt with by exclusion of collinear predictors, but when there is no a priori motivation (besides…
▽ More
Clinical prediction models are developed widely across medical disciplines. When predictors in such models are highly collinear, unexpected or spurious predictor-outcome associations may occur, thereby potentially reducing face-validity and explainability of the prediction model. Collinearity can be dealt with by exclusion of collinear predictors, but when there is no a priori motivation (besides collinearity) to include or exclude specific predictors, such an approach is arbitrary and possibly inappropriate. We compare different methods to address collinearity, including shrinkage, dimensionality reduction, and constrained optimization. The effectiveness of these methods is illustrated via simulations. In the conducted simulations, no effect of collinearity was observed on predictive outcomes. However, a negative effect of collinearity on the stability of predictor selection was found, affecting all compared methods, but in particular methods that perform strong predictor selection (e.g., Lasso).}
△ Less
Submitted 5 January, 2021;
originally announced January 2021.