-
A Pan-cancer Classification Model using Multi-view Feature Selection Method and Ensemble Classifier
Authors:
Tareque Mohmud Chowdhury,
Farzana Tabassum,
Sabrina Islam,
Abu Raihan Mostofa Kamal
Abstract:
Accurately identifying cancer samples is crucial for precise diagnosis and effective patient treatment. Traditional methods falter with high-dimensional and high feature-to-sample count ratios, which are critical for classifying cancer samples. This study aims to develop a novel feature selection framework specifically for transcriptome data and propose two ensemble classifiers. For feature select…
▽ More
Accurately identifying cancer samples is crucial for precise diagnosis and effective patient treatment. Traditional methods falter with high-dimensional and high feature-to-sample count ratios, which are critical for classifying cancer samples. This study aims to develop a novel feature selection framework specifically for transcriptome data and propose two ensemble classifiers. For feature selection, we partition the transcriptome dataset vertically based on feature types. Then apply the Boruta feature selection process on each of the partitions, combine the results, and apply Boruta again on the combined result. We repeat the process with different parameters of Boruta and prepare the final feature set. Finally, we constructed two ensemble ML models based on LR, SVM and XGBoost classifiers with max voting and averaging probability approach. We used 10-fold cross-validation to ensure robust and reliable classification performance. With 97.11\% accuracy and 0.9996 AUC value, our approach performs better compared to existing state-of-the-art methods to classify 33 types of cancers. A set of 12 types of cancer is traditionally challenging to differentiate between each other due to their similarity in tissue of origin. Our method accurately identifies over 90\% of samples from these 12 types of cancers, which outperforms all known methods presented in existing literature. The gene set enrichment analysis reveals that our framework's selected features have enriched the pathways highly related to cancers. This study develops a feature selection framework to select features highly related to cancer development and leads to identifying different types of cancer samples with higher accuracy.
△ Less
Submitted 12 January, 2025;
originally announced January 2025.
-
BnSentMix: A Diverse Bengali-English Code-Mixed Dataset for Sentiment Analysis
Authors:
Sadia Alam,
Md Farhan Ishmam,
Navid Hasin Alvee,
Md Shahnewaz Siddique,
Md Azam Hossain,
Abu Raihan Mostofa Kamal
Abstract:
The widespread availability of code-mixed data can provide valuable insights into low-resource languages like Bengali, which have limited datasets. Sentiment analysis has been a fundamental text classification task across several languages for code-mixed data. However, there has yet to be a large-scale and diverse sentiment analysis dataset on code-mixed Bengali. We address this limitation by intr…
▽ More
The widespread availability of code-mixed data can provide valuable insights into low-resource languages like Bengali, which have limited datasets. Sentiment analysis has been a fundamental text classification task across several languages for code-mixed data. However, there has yet to be a large-scale and diverse sentiment analysis dataset on code-mixed Bengali. We address this limitation by introducing BnSentMix, a sentiment analysis dataset on code-mixed Bengali consisting of 20,000 samples with 4 sentiment labels from Facebook, YouTube, and e-commerce sites. We ensure diversity in data sources to replicate realistic code-mixed scenarios. Additionally, we propose 14 baseline methods including novel transformer encoders further pre-trained on code-mixed Bengali-English, achieving an overall accuracy of 69.8% and an F1 score of 69.1% on sentiment classification tasks. Detailed analyses reveal variations in performance across different sentiment labels and text types, highlighting areas for future improvement.
△ Less
Submitted 9 December, 2024; v1 submitted 16 August, 2024;
originally announced August 2024.
-
Visual Robustness Benchmark for Visual Question Answering (VQA)
Authors:
Md Farhan Ishmam,
Ishmam Tashdeed,
Talukder Asir Saadat,
Md Hamjajul Ashmafee,
Abu Raihan Mostofa Kamal,
Md. Azam Hossain
Abstract:
Can Visual Question Answering (VQA) systems perform just as well when deployed in the real world? Or are they susceptible to realistic corruption effects e.g. image blur, which can be detrimental in sensitive applications, such as medical VQA? While linguistic or textual robustness has been thoroughly explored in the VQA literature, there has yet to be any significant work on the visual robustness…
▽ More
Can Visual Question Answering (VQA) systems perform just as well when deployed in the real world? Or are they susceptible to realistic corruption effects e.g. image blur, which can be detrimental in sensitive applications, such as medical VQA? While linguistic or textual robustness has been thoroughly explored in the VQA literature, there has yet to be any significant work on the visual robustness of VQA models. We propose the first large-scale benchmark comprising 213,000 augmented images, challenging the visual robustness of multiple VQA models and assessing the strength of realistic visual corruptions. Additionally, we have designed several robustness evaluation metrics that can be aggregated into a unified metric and tailored to fit a variety of use cases. Our experiments reveal several insights into the relationships between model size, performance, and robustness with the visual corruptions. Our benchmark highlights the need for a balanced approach in model development that considers model performance without compromising the robustness.
△ Less
Submitted 29 October, 2024; v1 submitted 3 July, 2024;
originally announced July 2024.
-
ChartSumm: A Comprehensive Benchmark for Automatic Chart Summarization of Long and Short Summaries
Authors:
Raian Rahman,
Rizvi Hasan,
Abdullah Al Farhad,
Md Tahmid Rahman Laskar,
Md. Hamjajul Ashmafee,
Abu Raihan Mostofa Kamal
Abstract:
Automatic chart to text summarization is an effective tool for the visually impaired people along with providing precise insights of tabular data in natural language to the user. A large and well-structured dataset is always a key part for data driven models. In this paper, we propose ChartSumm: a large-scale benchmark dataset consisting of a total of 84,363 charts along with their metadata and de…
▽ More
Automatic chart to text summarization is an effective tool for the visually impaired people along with providing precise insights of tabular data in natural language to the user. A large and well-structured dataset is always a key part for data driven models. In this paper, we propose ChartSumm: a large-scale benchmark dataset consisting of a total of 84,363 charts along with their metadata and descriptions covering a wide range of topics and chart types to generate short and long summaries. Extensive experiments with strong baseline models show that even though these models generate fluent and informative summaries by achieving decent scores in various automatic evaluation metrics, they often face issues like suffering from hallucination, missing out important data points, in addition to incorrect explanation of complex trends in the charts. We also investigated the potential of expanding ChartSumm to other languages using automated translation tools. These make our dataset a challenging benchmark for future research.
△ Less
Submitted 11 June, 2023; v1 submitted 26 April, 2023;
originally announced April 2023.
-
AutoCl : A Visual Interactive System for Automatic Deep Learning Classifier Recommendation Based on Models Performance
Authors:
Fuad Ahmed,
Rubayea Ferdows,
Md Rafiqul Islam,
Abu Raihan M. Kamal
Abstract:
Nowadays, deep learning (DL) models being increasingly applied to various fields, people without technical expertise and domain knowledge struggle to find an appropriate model for their task. In this paper, we introduce AutoCl a visual interactive recommender system aimed at helping non-experts to adopt an appropriate DL classifier. Our system enables users to compare the performance and behavior…
▽ More
Nowadays, deep learning (DL) models being increasingly applied to various fields, people without technical expertise and domain knowledge struggle to find an appropriate model for their task. In this paper, we introduce AutoCl a visual interactive recommender system aimed at helping non-experts to adopt an appropriate DL classifier. Our system enables users to compare the performance and behavior of multiple classifiers trained with various hyperparameter setups as well as automatically recommends a best classifier with appropriate hyperparameter. We compare features of AutoCl against several recent AutoML systems and show that it helps non-experts better in choosing DL classifier. Finally, we demonstrate use cases for image classification using publicly available dataset to show the capability of our system.
△ Less
Submitted 24 February, 2022;
originally announced February 2022.
-
Disease Identification From Unstructured User Input
Authors:
Fahim Faisal,
Shafkat Ahmed Bhuiyan,
Abu Raihan Mostofa Kamal
Abstract:
A method to identify probable diseases from the unstructured textual input (eg, health forum posts) by incorporating a lexicographic and semantic feature based two-phase text classification module and a symptom-disease correlation-based similarity measurement module. One notable aspect of my approach was to develop a competent algorithm to extract all inherent features from the data source to make…
▽ More
A method to identify probable diseases from the unstructured textual input (eg, health forum posts) by incorporating a lexicographic and semantic feature based two-phase text classification module and a symptom-disease correlation-based similarity measurement module. One notable aspect of my approach was to develop a competent algorithm to extract all inherent features from the data source to make a better decision.
△ Less
Submitted 10 May, 2019; v1 submitted 1 May, 2019;
originally announced May 2019.