-
Vision-Language Modeling in PET/CT for Visual Grounding of Positive Findings
Authors:
Zachary Huemann,
Samuel Church,
Joshua D. Warner,
Daniel Tran,
Xin Tie,
Alan B McMillan,
Junjie Hu,
Steve Y. Cho,
Meghan Lubner,
Tyler J. Bradshaw
Abstract:
Vision-language models can connect the text description of an object to its specific location in an image through visual grounding. This has potential applications in enhanced radiology reporting. However, these models require large annotated image-text datasets, which are lacking for PET/CT. We developed an automated pipeline to generate weak labels linking PET/CT report descriptions to their ima…
▽ More
Vision-language models can connect the text description of an object to its specific location in an image through visual grounding. This has potential applications in enhanced radiology reporting. However, these models require large annotated image-text datasets, which are lacking for PET/CT. We developed an automated pipeline to generate weak labels linking PET/CT report descriptions to their image locations and used it to train a 3D vision-language visual grounding model. Our pipeline finds positive findings in PET/CT reports by identifying mentions of SUVmax and axial slice numbers. From 25,578 PET/CT exams, we extracted 11,356 sentence-label pairs. Using this data, we trained ConTEXTual Net 3D, which integrates text embeddings from a large language model with a 3D nnU-Net via token-level cross-attention. The model's performance was compared against LLMSeg, a 2.5D version of ConTEXTual Net, and two nuclear medicine physicians. The weak-labeling pipeline accurately identified lesion locations in 98% of cases (246/251), with 7.5% requiring boundary adjustments. ConTEXTual Net 3D achieved an F1 score of 0.80, outperforming LLMSeg (F1=0.22) and the 2.5D model (F1=0.53), though it underperformed both physicians (F1=0.94 and 0.91). The model achieved better performance on FDG (F1=0.78) and DCFPyL (F1=0.75) exams, while performance dropped on DOTATE (F1=0.58) and Fluciclovine (F1=0.66). The model performed consistently across lesion sizes but showed reduced accuracy on lesions with low uptake. Our novel weak labeling pipeline accurately produced an annotated dataset of PET/CT image-text pairs, facilitating the development of 3D visual grounding models. ConTEXTual Net 3D significantly outperformed other models but fell short of the performance of nuclear medicine physicians. Our study suggests that even larger datasets may be needed to close this performance gap.
△ Less
Submitted 1 February, 2025;
originally announced February 2025.
-
DFT-s-OFDM-based On-Off Keying for Low-Power Wake-Up Signal
Authors:
Renaud-Alexandre Pitaval,
Xiaolei Tie
Abstract:
5G-Advanced and likely 6G will support a new low-power wake-up signal (LP-WUS) enabling low-power devices, equipped with a complementary ultra low-power receiver to monitor wireless traffic, to completely switch off their main radio. This orthogonal frequency-division multiplexed (OFDM) signal will emulate an on-off keying (OOK) modulation to enable very low-energy envelope detection at the receiv…
▽ More
5G-Advanced and likely 6G will support a new low-power wake-up signal (LP-WUS) enabling low-power devices, equipped with a complementary ultra low-power receiver to monitor wireless traffic, to completely switch off their main radio. This orthogonal frequency-division multiplexed (OFDM) signal will emulate an on-off keying (OOK) modulation to enable very low-energy envelope detection at the receiver. Higher rate LP-WUS, containing multiple OOK symbols within single OFDM symbol, will be generated using the time-domain pulse multiplexing of discrete Fourier transform spread (DFT-s-) OFDM. In this context, this paper presents a comprehensive signal design framework for DFT-s-OFDM-based OOK generation. General properties of subcarrier coefficients are derived demonstrating that only DFT of the bits needs to be computed online and repeated over the band before applying appropriate frequency-domain processing. The conventional approach of generating rectangular-like OOK waveforms is then addressed by a combination of pre-DFT bit-spreading and post-DFT processing; and the least-squares (LS) method from Mazloum and Edfors, proposed for 5G LP-WUS and also Ambient-IoT, is shown to be implementable as such. Even though aesthetically pleasing and of independent interest, rectangular-like OOK waveforms are not optimal for 5G LP-WUS scenarios due to their limited robustness to channel frequency-selectivity and timing offset, and so shaping methods for spreading the OOK spectrum and concentrating the OOK symbol energy are analyzed and shown to improve the bit error rate performance under practical conditions.
△ Less
Submitted 17 June, 2025; v1 submitted 10 December, 2024;
originally announced December 2024.
-
Deep Learning for Longitudinal Gross Tumor Volume Segmentation in MRI-Guided Adaptive Radiotherapy for Head and Neck Cancer
Authors:
Xin Tie,
Weijie Chen,
Zachary Huemann,
Brayden Schott,
Nuohao Liu,
Tyler J. Bradshaw
Abstract:
Accurate segmentation of gross tumor volume (GTV) is essential for effective MRI-guided adaptive radiotherapy (MRgART) in head and neck cancer. However, manual segmentation of the GTV over the course of therapy is time-consuming and prone to interobserver variability. Deep learning (DL) has the potential to overcome these challenges by automatically delineating GTVs. In this study, our team,…
▽ More
Accurate segmentation of gross tumor volume (GTV) is essential for effective MRI-guided adaptive radiotherapy (MRgART) in head and neck cancer. However, manual segmentation of the GTV over the course of therapy is time-consuming and prone to interobserver variability. Deep learning (DL) has the potential to overcome these challenges by automatically delineating GTVs. In this study, our team, $\textit{UW LAIR}$, tackled the challenges of both pre-radiotherapy (pre-RT) (Task 1) and mid-radiotherapy (mid-RT) (Task 2) tumor volume segmentation. To this end, we developed a series of DL models for longitudinal GTV segmentation. The backbone of our models for both tasks was SegResNet with deep supervision. For Task 1, we trained the model using a combined dataset of pre-RT and mid-RT MRI data, which resulted in the improved aggregated Dice similarity coefficient (DSCagg) on an internal testing set compared to models trained solely on pre-RT MRI data. In Task 2, we introduced mask-aware attention modules, enabling pre-RT GTV masks to influence intermediate features learned from mid-RT data. This attention-based approach yielded slight improvements over the baseline method, which concatenated mid-RT MRI with pre-RT GTV masks as input. In the final testing phase, the ensemble of 10 pre-RT segmentation models achieved an average DSCagg of 0.794, with 0.745 for primary GTV (GTVp) and 0.844 for metastatic lymph nodes (GTVn) in Task 1. For Task 2, the ensemble of 10 mid-RT segmentation models attained an average DSCagg of 0.733, with 0.607 for GTVp and 0.859 for GTVn, leading us to $\textbf{achieve 1st place}$. In summary, we presented a collection of DL models that could facilitate GTV segmentation in MRgART, offering the potential to streamline radiation oncology workflows. Our code and model weights are available at https://github.com/xtie97/HNTS-MRG24-UWLAIR.
△ Less
Submitted 30 November, 2024;
originally announced December 2024.
-
Automatic Quantification of Serial PET/CT Images for Pediatric Hodgkin Lymphoma Patients Using a Longitudinally-Aware Segmentation Network
Authors:
Xin Tie,
Muheon Shin,
Changhee Lee,
Scott B. Perlman,
Zachary Huemann,
Amy J. Weisman,
Sharon M. Castellino,
Kara M. Kelly,
Kathleen M. McCarten,
Adina L. Alazraki,
Junjie Hu,
Steve Y. Cho,
Tyler J. Bradshaw
Abstract:
$\textbf{Purpose}$: Automatic quantification of longitudinal changes in PET scans for lymphoma patients has proven challenging, as residual disease in interim-therapy scans is often subtle and difficult to detect. Our goal was to develop a longitudinally-aware segmentation network (LAS-Net) that can quantify serial PET/CT images for pediatric Hodgkin lymphoma patients. $\textbf{Materials and Metho…
▽ More
$\textbf{Purpose}$: Automatic quantification of longitudinal changes in PET scans for lymphoma patients has proven challenging, as residual disease in interim-therapy scans is often subtle and difficult to detect. Our goal was to develop a longitudinally-aware segmentation network (LAS-Net) that can quantify serial PET/CT images for pediatric Hodgkin lymphoma patients. $\textbf{Materials and Methods}$: This retrospective study included baseline (PET1) and interim (PET2) PET/CT images from 297 patients enrolled in two Children's Oncology Group clinical trials (AHOD1331 and AHOD0831). LAS-Net incorporates longitudinal cross-attention, allowing relevant features from PET1 to inform the analysis of PET2. Model performance was evaluated using Dice coefficients for PET1 and detection F1 scores for PET2. Additionally, we extracted and compared quantitative PET metrics, including metabolic tumor volume (MTV) and total lesion glycolysis (TLG) in PET1, as well as qPET and $Δ$SUVmax in PET2, against physician measurements. We quantified their agreement using Spearman's $ρ$ correlations and employed bootstrap resampling for statistical analysis. $\textbf{Results}$: LAS-Net detected residual lymphoma in PET2 with an F1 score of 0.606 (precision/recall: 0.615/0.600), outperforming all comparator methods (P<0.01). For baseline segmentation, LAS-Net achieved a mean Dice score of 0.772. In PET quantification, LAS-Net's measurements of qPET, $Δ$SUVmax, MTV and TLG were strongly correlated with physician measurements, with Spearman's $ρ$ of 0.78, 0.80, 0.93 and 0.96, respectively. The performance remained high, with a slight decrease, in an external testing cohort. $\textbf{Conclusion}$: LAS-Net demonstrated significant improvements in quantifying PET metrics across serial scans, highlighting the value of longitudinal awareness in evaluating multi-time-point imaging datasets.
△ Less
Submitted 30 September, 2024; v1 submitted 12 April, 2024;
originally announced April 2024.
-
Automatic Personalized Impression Generation for PET Reports Using Large Language Models
Authors:
Xin Tie,
Muheon Shin,
Ali Pirasteh,
Nevein Ibrahim,
Zachary Huemann,
Sharon M. Castellino,
Kara M. Kelly,
John Garrett,
Junjie Hu,
Steve Y. Cho,
Tyler J. Bradshaw
Abstract:
In this study, we aimed to determine if fine-tuned large language models (LLMs) can generate accurate, personalized impressions for whole-body PET reports. Twelve language models were trained on a corpus of PET reports using the teacher-forcing algorithm, with the report findings as input and the clinical impressions as reference. An extra input token encodes the reading physician's identity, allo…
▽ More
In this study, we aimed to determine if fine-tuned large language models (LLMs) can generate accurate, personalized impressions for whole-body PET reports. Twelve language models were trained on a corpus of PET reports using the teacher-forcing algorithm, with the report findings as input and the clinical impressions as reference. An extra input token encodes the reading physician's identity, allowing models to learn physician-specific reporting styles. Our corpus comprised 37,370 retrospective PET reports collected from our institution between 2010 and 2022. To identify the best LLM, 30 evaluation metrics were benchmarked against quality scores from two nuclear medicine (NM) physicians, with the most aligned metrics selecting the model for expert evaluation. In a subset of data, model-generated impressions and original clinical impressions were assessed by three NM physicians according to 6 quality dimensions (3-point scale) and an overall utility score (5-point scale). Each physician reviewed 12 of their own reports and 12 reports from other physicians. Bootstrap resampling was used for statistical analysis. Of all evaluation metrics, domain-adapted BARTScore and PEGASUSScore showed the highest Spearman's rank correlations (0.568 and 0.563) with physician preferences. Based on these metrics, the fine-tuned PEGASUS model was selected as the top LLM. When physicians reviewed PEGASUS-generated impressions in their own style, 89% were considered clinically acceptable, with a mean utility score of 4.08 out of 5. Physicians rated these personalized impressions as comparable in overall utility to the impressions dictated by other physicians (4.03, P=0.41). In conclusion, personalized impressions generated by PEGASUS were clinically useful, highlighting its potential to expedite PET reporting.
△ Less
Submitted 17 October, 2023; v1 submitted 18 September, 2023;
originally announced September 2023.
-
ConTEXTual Net: A Multimodal Vision-Language Model for Segmentation of Pneumothorax
Authors:
Zachary Huemann,
Xin Tie,
Junjie Hu,
Tyler J. Bradshaw
Abstract:
Radiology narrative reports often describe characteristics of a patient's disease, including its location, size, and shape. Motivated by the recent success of multimodal learning, we hypothesized that this descriptive text could guide medical image analysis algorithms. We proposed a novel vision-language model, ConTEXTual Net, for the task of pneumothorax segmentation on chest radiographs. ConTEXT…
▽ More
Radiology narrative reports often describe characteristics of a patient's disease, including its location, size, and shape. Motivated by the recent success of multimodal learning, we hypothesized that this descriptive text could guide medical image analysis algorithms. We proposed a novel vision-language model, ConTEXTual Net, for the task of pneumothorax segmentation on chest radiographs. ConTEXTual Net utilizes language features extracted from corresponding free-form radiology reports using a pre-trained language model. Cross-attention modules are designed to combine the intermediate output of each vision encoder layer and the text embeddings generated by the language model. ConTEXTual Net was trained on the CANDID-PTX dataset consisting of 3,196 positive cases of pneumothorax with segmentation annotations from 6 different physicians as well as clinical radiology reports. Using cross-validation, ConTEXTual Net achieved a Dice score of 0.716$\pm$0.016, which was similar to the degree of inter-reader variability (0.712$\pm$0.044) computed on a subset of the data. It outperformed both vision-only models (ResNet50 U-Net: 0.677$\pm$0.015 and GLoRIA: 0.686$\pm$0.014) and a competing vision-language model (LAVT: 0.706$\pm$0.009). Ablation studies confirmed that it was the text information that led to the performance gains. Additionally, we show that certain augmentation methods degraded ConTEXTual Net's segmentation performance by breaking the image-text concordance. We also evaluated the effects of using different language models and activation functions in the cross-attention module, highlighting the efficacy of our chosen architectural design.
△ Less
Submitted 15 September, 2023; v1 submitted 2 March, 2023;
originally announced March 2023.
-
A Generalizable Artificial Intelligence Model for COVID-19 Classification Task Using Chest X-ray Radiographs: Evaluated Over Four Clinical Datasets with 15,097 Patients
Authors:
Ran Zhang,
Xin Tie,
John W. Garrett,
Dalton Griner,
Zhihua Qi,
Nicholas B. Bevins,
Scott B. Reeder,
Guang-Hong Chen
Abstract:
Purpose: To answer the long-standing question of whether a model trained from a single clinical site can be generalized to external sites.
Materials and Methods: 17,537 chest x-ray radiographs (CXRs) from 3,264 COVID-19-positive patients and 4,802 COVID-19-negative patients were collected from a single site for AI model development. The generalizability of the trained model was retrospectively e…
▽ More
Purpose: To answer the long-standing question of whether a model trained from a single clinical site can be generalized to external sites.
Materials and Methods: 17,537 chest x-ray radiographs (CXRs) from 3,264 COVID-19-positive patients and 4,802 COVID-19-negative patients were collected from a single site for AI model development. The generalizability of the trained model was retrospectively evaluated using four different real-world clinical datasets with a total of 26,633 CXRs from 15,097 patients (3,277 COVID-19-positive patients). The area under the receiver operating characteristic curve (AUC) was used to assess diagnostic performance.
Results: The AI model trained using a single-source clinical dataset achieved an AUC of 0.82 (95% CI: 0.80, 0.84) when applied to the internal temporal test set. When applied to datasets from two external clinical sites, an AUC of 0.81 (95% CI: 0.80, 0.82) and 0.82 (95% CI: 0.80, 0.84) were achieved. An AUC of 0.79 (95% CI: 0.77, 0.81) was achieved when applied to a multi-institutional COVID-19 dataset collected by the Medical Imaging and Data Resource Center (MIDRC). A power-law dependence, N^(k )(k is empirically found to be -0.21 to -0.25), indicates a relatively weak performance dependence on the training data sizes.
Conclusion: COVID-19 classification AI model trained using well-curated data from a single clinical site is generalizable to external clinical sites without a significant drop in performance.
△ Less
Submitted 4 October, 2022;
originally announced October 2022.
-
Flow-based Video Segmentation for Human Head and Shoulders
Authors:
Zijian Kuang,
Xinran Tie
Abstract:
Video segmentation for the human head and shoulders is essential in creating elegant media for videoconferencing and virtual reality applications. The main challenge is to process high-quality background subtraction in a real-time manner and address the segmentation issues under motion blurs, e.g., shaking the head or waving hands during conference video. To overcome the motion blur problem in vid…
▽ More
Video segmentation for the human head and shoulders is essential in creating elegant media for videoconferencing and virtual reality applications. The main challenge is to process high-quality background subtraction in a real-time manner and address the segmentation issues under motion blurs, e.g., shaking the head or waving hands during conference video. To overcome the motion blur problem in video segmentation, we propose a novel flow-based encoder-decoder network (FUNet) that combines both traditional Horn-Schunck optical-flow estimation technique and convolutional neural networks to perform robust real-time video segmentation. We also introduce a video and image segmentation dataset: ConferenceVideoSegmentationDataset. Code and pre-trained models are available on our GitHub repository: \url{https://github.com/kuangzijian/Flow-Based-Video-Matting}.
△ Less
Submitted 20 April, 2021;
originally announced April 2021.
-
A Survey of Multimedia Technologies and Robust Algorithms
Authors:
Zijian Kuang,
Xinran Tie
Abstract:
Multimedia technologies are now more practical and deployable in real life, and the algorithms are widely used in various researching areas such as deep learning, signal processing, haptics, computer vision, robotics, and medical multimedia processing. This survey provides an overview of multimedia technologies and robust algorithms in multimedia data processing, medical multimedia processing, hum…
▽ More
Multimedia technologies are now more practical and deployable in real life, and the algorithms are widely used in various researching areas such as deep learning, signal processing, haptics, computer vision, robotics, and medical multimedia processing. This survey provides an overview of multimedia technologies and robust algorithms in multimedia data processing, medical multimedia processing, human facial expression tracking and pose recognition, and multimedia in education and training. This survey will also analyze and propose a future research direction based on the overview of current robust algorithms and multimedia technologies. We want to thank the research and previous work done by the Multimedia Research Centre (MRC), the University of Alberta, which is the inspiration and starting point for future research.
△ Less
Submitted 25 March, 2021; v1 submitted 24 March, 2021;
originally announced March 2021.
-
Computer Vision and Normalizing Flow-Based Defect Detection
Authors:
Zijian Kuang,
Xinran Tie,
Lihang Ying,
Shi Jin
Abstract:
Visual defect detection is critical to ensure the quality of most products. However, the majority of small and medium-sized manufacturing enterprises still rely on tedious and error-prone human manual inspection. The main reasons include: 1) the existing automated visual defect detection systems require altering production assembly lines, which is time consuming and expensive 2) the existing syste…
▽ More
Visual defect detection is critical to ensure the quality of most products. However, the majority of small and medium-sized manufacturing enterprises still rely on tedious and error-prone human manual inspection. The main reasons include: 1) the existing automated visual defect detection systems require altering production assembly lines, which is time consuming and expensive 2) the existing systems require manually collecting defective samples and labeling them for a comparison-based algorithm or training a machine learning model. This introduces a heavy burden for small and medium-sized manufacturing enterprises as defects do not happen often and are difficult and time-consuming to collect. Furthermore, we cannot exhaustively collect or define all defect types as any new deviation from acceptable products are defects. In this paper, we overcome these challenges and design a three-stage plug-and-play fully automated unsupervised 360-degree defect detection system. In our system, products are freely placed on an unaltered assembly line and receive 360 degree visual inspection with multiple cameras from different angles. As such, the images collected from real-world product assembly lines contain lots of background noise. The products face different angles. The product sizes vary due to the distance to cameras. All these make defect detection much more difficult. Our system use object detection, background subtraction and unsupervised normalizing flow-based defect detection techniques to tackle these difficulties. Experiments show our system can achieve 0.90 AUROC in a real-world non-altered drinkware production assembly line.
△ Less
Submitted 13 February, 2022; v1 submitted 12 December, 2020;
originally announced December 2020.
-
Improved Actor Relation Graph based Group Activity Recognition
Authors:
Zijian Kuang,
Xinran Tie
Abstract:
Video understanding is to recognize and classify different actions or activities appearing in the video. A lot of previous work, such as video captioning, has shown promising performance in producing general video understanding. However, it is still challenging to generate a fine-grained description of human actions and their interactions using state-of-the-art video captioning techniques. The det…
▽ More
Video understanding is to recognize and classify different actions or activities appearing in the video. A lot of previous work, such as video captioning, has shown promising performance in producing general video understanding. However, it is still challenging to generate a fine-grained description of human actions and their interactions using state-of-the-art video captioning techniques. The detailed description of human actions and group activities is essential information, which can be used in real-time CCTV video surveillance, health care, sports video analysis, etc. This study proposes a video understanding method that mainly focused on group activity recognition by learning the pair-wise actor appearance similarity and actor positions. We propose to use Normalized cross-correlation (NCC) and the sum of absolute differences (SAD) to calculate the pair-wise appearance similarity and build the actor relationship graph to allow the graph convolution network to learn how to classify group activities. We also propose to use MobileNet as the backbone to extract features from each video frame. A visualization model is further introduced to visualize each input video frame with predicted bounding boxes on each human object and predict individual action and collective activity.
△ Less
Submitted 29 December, 2020; v1 submitted 24 October, 2020;
originally announced October 2020.