-
How much can ChatGPT really help Computational Biologists in Programming?
Authors:
Chowdhury Rafeed Rahman,
Limsoon Wong
Abstract:
ChatGPT, a recently developed product by openAI, is successfully leaving its mark as a multi-purpose natural language based chatbot. In this paper, we are more interested in analyzing its potential in the field of computational biology. A major share of work done by computational biologists these days involve coding up bioinformatics algorithms, analyzing data, creating pipelining scripts and even…
▽ More
ChatGPT, a recently developed product by openAI, is successfully leaving its mark as a multi-purpose natural language based chatbot. In this paper, we are more interested in analyzing its potential in the field of computational biology. A major share of work done by computational biologists these days involve coding up bioinformatics algorithms, analyzing data, creating pipelining scripts and even machine learning modeling and feature extraction. This paper focuses on the potential influence (both positive and negative) of ChatGPT in the mentioned aspects with illustrative examples from different perspectives. Compared to other fields of computer science, computational biology has - (1) less coding resources, (2) more sensitivity and bias issues (deals with medical data) and (3) more necessity of coding assistance (people from diverse background come to this field). Keeping such issues in mind, we cover use cases such as code writing, reviewing, debugging, converting, refactoring and pipelining using ChatGPT from the perspective of computational biologists in this paper.
△ Less
Submitted 4 December, 2023; v1 submitted 16 September, 2023;
originally announced September 2023.
-
ReviewRanker: A Semi-Supervised Learning Based Approach for Code Review Quality Estimation
Authors:
Saifullah Mahbub,
Md. Easin Arafat,
Chowdhury Rafeed Rahman,
Zannatul Ferdows,
Masum Hasan
Abstract:
Code review is considered a key process in the software industry for minimizing bugs and improving code quality. Inspection of review process effectiveness and continuous improvement can boost development productivity. Such inspection is a time-consuming and human-bias-prone task. We propose a semi-supervised learning based system ReviewRanker which is aimed at assigning each code review a confide…
▽ More
Code review is considered a key process in the software industry for minimizing bugs and improving code quality. Inspection of review process effectiveness and continuous improvement can boost development productivity. Such inspection is a time-consuming and human-bias-prone task. We propose a semi-supervised learning based system ReviewRanker which is aimed at assigning each code review a confidence score which is expected to resonate with the quality of the review. Our proposed method is trained based on simple and and well defined labels provided by developers. The labeling task requires little to no effort from the developers and has an indirect relation to the end goal (assignment of review confidence score). ReviewRanker is expected to improve industry-wide code review quality inspection through reducing human bias and effort required for such task. The system has the potential of minimizing the back-and-forth cycle existing in the development and review process. Usable code and dataset for this research can be found at: https://github.com/saifarnab/code_review
△ Less
Submitted 8 July, 2023;
originally announced July 2023.
-
BSpell: A CNN-Blended BERT Based Bangla Spell Checker
Authors:
Chowdhury Rafeed Rahman,
MD. Hasibur Rahman,
Samiha Zakir,
Mohammad Rafsan,
Mohammed Eunus Ali
Abstract:
Bangla typing is mostly performed using English keyboard and can be highly erroneous due to the presence of compound and similarly pronounced letters. Spelling correction of a misspelled word requires understanding of word typing pattern as well as the context of the word usage. A specialized BERT model named BSpell has been proposed in this paper targeted towards word for word correction in sente…
▽ More
Bangla typing is mostly performed using English keyboard and can be highly erroneous due to the presence of compound and similarly pronounced letters. Spelling correction of a misspelled word requires understanding of word typing pattern as well as the context of the word usage. A specialized BERT model named BSpell has been proposed in this paper targeted towards word for word correction in sentence level. BSpell contains an end-to-end trainable CNN sub-model named SemanticNet along with specialized auxiliary loss. This allows BSpell to specialize in highly inflected Bangla vocabulary in the presence of spelling errors. Furthermore, a hybrid pretraining scheme has been proposed for BSpell that combines word level and character level masking. Comparison on two Bangla and one Hindi spelling correction dataset shows the superiority of our proposed approach. BSpell is available as a Bangla spell checking tool via GitHub: https://github.com/Hasiburshanto/Bangla-Spell-Checker
△ Less
Submitted 31 December, 2023; v1 submitted 20 August, 2022;
originally announced August 2022.
-
Judge a Sentence by Its Content to Generate Grammatical Errors
Authors:
Chowdhury Rafeed Rahman
Abstract:
Data sparsity is a well-known problem for grammatical error correction (GEC). Generating synthetic training data is one widely proposed solution to this problem, and has allowed models to achieve state-of-the-art (SOTA) performance in recent years. However, these methods often generate unrealistic errors, or aim to generate sentences with only one error. We propose a learning based two stage metho…
▽ More
Data sparsity is a well-known problem for grammatical error correction (GEC). Generating synthetic training data is one widely proposed solution to this problem, and has allowed models to achieve state-of-the-art (SOTA) performance in recent years. However, these methods often generate unrealistic errors, or aim to generate sentences with only one error. We propose a learning based two stage method for synthetic data generation for GEC that relaxes this constraint on sentences containing only one error. Errors are generated in accordance with sentence merit. We show that a GEC model trained on our synthetically generated corpus outperforms models trained on synthetic data from prior work.
△ Less
Submitted 20 August, 2022;
originally announced August 2022.
-
Paradigm Shift in Language Modeling: Revisiting CNN for Modeling Sanskrit Originated Bengali and Hindi Language
Authors:
Chowdhury Rafeed Rahman,
MD. Hasibur Rahman,
Mohammad Rafsan,
Samiha Zakir,
Mohammed Eunus Ali,
Rafsanjani Muhammod
Abstract:
Though there has been a large body of recent works in language modeling (LM) for high resource languages such as English and Chinese, the area is still unexplored for low resource languages like Bengali and Hindi. We propose an end to end trainable memory efficient CNN architecture named CoCNN to handle specific characteristics such as high inflection, morphological richness, flexible word order a…
▽ More
Though there has been a large body of recent works in language modeling (LM) for high resource languages such as English and Chinese, the area is still unexplored for low resource languages like Bengali and Hindi. We propose an end to end trainable memory efficient CNN architecture named CoCNN to handle specific characteristics such as high inflection, morphological richness, flexible word order and phonetical spelling errors of Bengali and Hindi. In particular, we introduce two learnable convolutional sub-models at word and at sentence level that are end to end trainable. We show that state-of-the-art (SOTA) Transformer models including pretrained BERT do not necessarily yield the best performance for Bengali and Hindi. CoCNN outperforms pretrained BERT with 16X less parameters, and it achieves much better performance than SOTA LSTM models on multiple real-world datasets. This is the first study on the effectiveness of different architectures drawn from three deep learning paradigms - Convolution, Recurrent, and Transformer neural nets for modeling two widely used languages, Bengali and Hindi.
△ Less
Submitted 4 November, 2021; v1 submitted 25 October, 2021;
originally announced October 2021.
-
i6mA-CNN: a convolution based computational approach towards identification of DNA N6-methyladenine sites in rice genome
Authors:
Ruhul Amin,
Chowdhury Rafeed Rahman,
Md. Sadrul Islam Toaha,
Swakkhar Shatabda
Abstract:
DNA N6-methylation (6mA) in Adenine nucleotide is a post replication modification and is responsible for many biological functions. Experimental methods for genome wide 6mA site detection is an expensive and manual labour intensive process. Automated and accurate computational methods can help to identify 6mA sites in long genomes saving significant time and money. Our study develops a convolution…
▽ More
DNA N6-methylation (6mA) in Adenine nucleotide is a post replication modification and is responsible for many biological functions. Experimental methods for genome wide 6mA site detection is an expensive and manual labour intensive process. Automated and accurate computational methods can help to identify 6mA sites in long genomes saving significant time and money. Our study develops a convolutional neural network based tool i6mA-CNN capable of identifying 6mA sites in the rice genome. Our model coordinates among multiple types of features such as PseAAC inspired customized feature vector, multiple one hot representations and dinucleotide physicochemical properties. It achieves area under the receiver operating characteristic curve of 0.98 with an overall accuracy of 0.94 using 5 fold cross validation on benchmark dataset. Finally, we evaluate our model on two other plant genome 6mA site identification datasets besides rice. Results suggest that our proposed tool is able to generalize its ability of 6mA site identification on plant genomes irrespective of plant species. Web tool for this research can be found at: https://cutt.ly/Co6KuWG. Supplementary data (benchmark dataset, independent test dataset, comparison purpose dataset, trained model, physicochemical property values, attention mechanism details for motif finding) are available at https://cutt.ly/PpDdeDH.
△ Less
Submitted 11 August, 2020; v1 submitted 20 July, 2020;
originally announced July 2020.
-
Rice grain disease identification using dual phase convolutional neural network based system aimed at small dataset
Authors:
Tashin Ahmed,
Chowdhury Rafeed Rahman,
Md. Faysal Mahmud Abid
Abstract:
Although Convolutional neural networks (CNNs) are widely used for plant disease detection, they require a large number of training samples when dealing with wide variety of heterogeneous background. In this work, a CNN based dual phase method has been proposed which can work effectively on small rice grain disease dataset with heterogeneity. At the first phase, Faster RCNN method is applied for cr…
▽ More
Although Convolutional neural networks (CNNs) are widely used for plant disease detection, they require a large number of training samples when dealing with wide variety of heterogeneous background. In this work, a CNN based dual phase method has been proposed which can work effectively on small rice grain disease dataset with heterogeneity. At the first phase, Faster RCNN method is applied for cropping out the significant portion (rice grain) from the image. This initial phase results in a secondary dataset of rice grains devoid of heterogeneous background. Disease classification is performed on such derived and simplified samples using CNN architecture. Comparison of the dual phase approach with straight forward application of CNN on the small grain dataset shows the effectiveness of the proposed method which provides a 5 fold cross validation accuracy of 88.07%.
△ Less
Submitted 7 May, 2021; v1 submitted 21 April, 2020;
originally announced April 2020.
-
Confronting the Constraints for Optical Character Segmentation from Printed Bangla Text Image
Authors:
Abu Saleh Md. Abir,
Sanjana Rahman,
Samia Ellin,
Maisha Farzana,
Md Hridoy Manik,
Chowdhury Rafeed Rahman
Abstract:
In a world of digitization, optical character recognition holds the automation to written history. Optical character recognition system basically converts printed images into editable texts for better storage and usability. To be completely functional, the system needs to go through some crucial methods such as pre-processing and segmentation. Pre-processing helps printed data to be noise free and…
▽ More
In a world of digitization, optical character recognition holds the automation to written history. Optical character recognition system basically converts printed images into editable texts for better storage and usability. To be completely functional, the system needs to go through some crucial methods such as pre-processing and segmentation. Pre-processing helps printed data to be noise free and gets rid of skewness efficiently whereas segmentation helps the image fragment into line, word and character precisely for better conversion. These steps hold the door to better accuracy and consistent results for a printed image to be ready for conversion. Our proposed algorithm is able to segment characters both from ideal and non-ideal cases of scanned or captured images giving a sustainable outcome. The implementation of our work is provided here: https://cutt.ly/rgdfBIa
△ Less
Submitted 5 January, 2021; v1 submitted 18 March, 2020;
originally announced March 2020.
-
Synthetic Error Dataset Generation Mimicking Bengali Writing Pattern
Authors:
Md. Habibur Rahman Sifat,
Chowdhury Rafeed Rahman,
Mohammad Rafsan,
Md. Hasibur Rahman
Abstract:
While writing Bengali using English keyboard, users often make spelling mistakes. The accuracy of any Bengali spell checker or paragraph correction module largely depends on the kind of error dataset it is based on. Manual generation of such error dataset is a cumbersome process. In this research, We present an algorithm for automatic misspelled Bengali word generation from correct word through an…
▽ More
While writing Bengali using English keyboard, users often make spelling mistakes. The accuracy of any Bengali spell checker or paragraph correction module largely depends on the kind of error dataset it is based on. Manual generation of such error dataset is a cumbersome process. In this research, We present an algorithm for automatic misspelled Bengali word generation from correct word through analyzing Bengali writing pattern using QWERTY layout English keyboard. As part of our analysis, we have formed a list of most commonly used Bengali words, phonetically similar replaceable clusters, frequently mispressed replaceable clusters, frequently mispressed insertion prone clusters and some rules for Juktakkhar (constant letter clusters) handling while generating errors.
△ Less
Submitted 21 May, 2020; v1 submitted 6 March, 2020;
originally announced March 2020.
-
Automatic Signboard Detection and Localization in Densely Populated Developing Cities
Authors:
Md. Sadrul Islam Toaha,
Sakib Bin Asad,
Chowdhury Rafeed Rahman,
S. M. Shahriar Haque,
Mahfuz Ara Proma,
Md. Ahsan Habib Shuvo,
Tashin Ahmed,
Md. Amimul Basher
Abstract:
Most city establishments of developing cities are digitally unlabeled because of the lack of automatic annotation systems. Hence location and trajectory services such as Google Maps, Uber etc remain underutilized in such cities. Accurate signboard detection in natural scene images is the foremost task for error-free information retrieval from such city streets. Yet, developing accurate signboard l…
▽ More
Most city establishments of developing cities are digitally unlabeled because of the lack of automatic annotation systems. Hence location and trajectory services such as Google Maps, Uber etc remain underutilized in such cities. Accurate signboard detection in natural scene images is the foremost task for error-free information retrieval from such city streets. Yet, developing accurate signboard localization system is still an unresolved challenge because of its diverse appearances that include textual images and perplexing backgrounds. We present a novel object detection approach that can detect signboards automatically and is suitable for such cities. We use Faster R-CNN based localization by incorporating two specialized pretraining methods and a run time efficient hyperparameter value selection algorithm. We have taken an incremental approach in reaching our final proposed method through detailed evaluation and comparison with baselines using our constructed SVSO (Street View Signboard Objects) signboard dataset containing signboard natural scene images of six developing countries. We demonstrate state-of-the-art performance of our proposed method on both SVSO dataset and Open Image Dataset. Our proposed method can detect signboards accurately (even if the images contain multiple signboards with diverse shapes and colours in a noisy background) achieving 0.90 mAP (mean average precision) score on SVSO independent test set. Our implementation is available at: https://github.com/sadrultoaha/Signboard-Detection
△ Less
Submitted 22 August, 2022; v1 submitted 4 March, 2020;
originally announced March 2020.
-
iPromoter-BnCNN: a Novel Branched CNN Based Predictor for Identifying and Classifying Sigma Promoters
Authors:
Ruhul Amin,
Chowdhury Rafeed Rahman,
Md. Habibur Rahman Sifat,
Md Nazmul Khan Liton,
Md. Moshiur Rahman,
Swakkhar Shatabda,
Sajid Ahmed
Abstract:
Promoter is a short region of DNA which is responsible for initiating transcription of specific genes. Development of computational tools for automatic identification of promoters is in high demand. According to the difference of functions, promoters can be of different types. Promoters may have both intra and inter class variation and similarity in terms of consensus sequences. Accurate classific…
▽ More
Promoter is a short region of DNA which is responsible for initiating transcription of specific genes. Development of computational tools for automatic identification of promoters is in high demand. According to the difference of functions, promoters can be of different types. Promoters may have both intra and inter class variation and similarity in terms of consensus sequences. Accurate classification of various types of sigma promoters still remains a challenge. We present iPromoter-BnCNN for identification and accurate classification of six types of promoters - sigma24, sigma28, sigma32, sigma38, sigma54, sigma70. It is a Convolutional Neural Network (CNN) based classifier which combines local features related to monomer nucleotide sequence, trimer nucleotide sequence, dimer structural properties and trimer structural properties through the use of parallel branching. We conducted experiments on a benchmark dataset and compared with two state-of-the-art tools to show our supremacy on 5-fold cross-validation. Moreover, we tested our classifier on an independent test dataset. Our proposed tool iPromoter-BnCNN web server is freely available at http://103.109.52.8/iPromoter-BnCNN. The runnable source code can be found at https://colab.research.google.com/drive/1yWWh7BXhsm8U4PODgPqlQRy23QGjF2DZ.
△ Less
Submitted 16 June, 2020; v1 submitted 21 December, 2019;
originally announced December 2019.
-
A Hybrid Approach Towards Two Stage Bengali Question Classification Utilizing Smart Data Balancing Technique
Authors:
Md. Hasibur Rahman,
Chowdhury Rafeed Rahman,
Ruhul Amin,
Md. Habibur Rahman Sifat,
Afra Anika
Abstract:
Question classification (QC) is the primary step of the Question Answering (QA) system. Question Classification (QC) system classifies the questions in particular classes so that Question Answering (QA) System can provide correct answers for the questions. Our system categorizes the factoid type questions asked in natural language after extracting features of the questions. We present a two stage…
▽ More
Question classification (QC) is the primary step of the Question Answering (QA) system. Question Classification (QC) system classifies the questions in particular classes so that Question Answering (QA) System can provide correct answers for the questions. Our system categorizes the factoid type questions asked in natural language after extracting features of the questions. We present a two stage QC system for Bengali. It utilizes one dimensional convolutional neural network for classifying questions into coarse classes in the first stage. Word2vec representation of existing words of the question corpus have been constructed and used for assisting 1D CNN. A smart data balancing technique has been employed for giving data hungry convolutional neural network the advantage of a greater number of effective samples to learn from. For each coarse class, a separate Stochastic Gradient Descent (SGD) based classifier has been used in order to differentiate among the finer classes within that coarse class. TF-IDF representation of each word has been used as feature for the SGD classifiers implemented as part of second stage classification. Experiments show the effectiveness of our proposed method for Bengali question classification.
△ Less
Submitted 2 March, 2020; v1 submitted 29 November, 2019;
originally announced December 2019.
-
A Comprehensive Comparison of Machine Learning Based Methods Used in Bengali Question Classification
Authors:
Afra Anika,
Md. Hasibur Rahman,
Salekul Islam,
Abu Shafin Mohammad Mahdee Jameel,
Chowdhury Rafeed Rahman
Abstract:
QA classification system maps questions asked by humans to an appropriate answer category. A sound question classification (QC) system model is the pre-requisite of a sound QA system. This work demonstrates phases of assembling a QA type classification model. We present a comprehensive comparison (performance and computational complexity) among some machine learning based approaches used in QC for…
▽ More
QA classification system maps questions asked by humans to an appropriate answer category. A sound question classification (QC) system model is the pre-requisite of a sound QA system. This work demonstrates phases of assembling a QA type classification model. We present a comprehensive comparison (performance and computational complexity) among some machine learning based approaches used in QC for Bengali language.
△ Less
Submitted 19 November, 2019; v1 submitted 8 November, 2019;
originally announced November 2019.
-
Identification and Recognition of Rice Diseases and Pests Using Convolutional Neural Networks
Authors:
Chowdhury Rafeed Rahman,
Preetom Saha Arko,
Mohammed Eunus Ali,
Mohammad Ashik Iqbal Khan,
Sajid Hasan Apon,
Farzana Nowrin,
Abu Wasif
Abstract:
An accurate and timely detection of diseases and pests in rice plants can help farmers in applying timely treatment on the plants and thereby can reduce the economic losses substantially. Recent developments in deep learning based convolutional neural networks (CNN) have greatly improved the image classification accuracy. Being motivated by the success of CNNs in image classification, deep learnin…
▽ More
An accurate and timely detection of diseases and pests in rice plants can help farmers in applying timely treatment on the plants and thereby can reduce the economic losses substantially. Recent developments in deep learning based convolutional neural networks (CNN) have greatly improved the image classification accuracy. Being motivated by the success of CNNs in image classification, deep learning based approaches have been developed in this paper for detecting diseases and pests from rice plant images. The contribution of this paper is two fold: (i) State-of-the-art large scale architectures such as VGG16 and InceptionV3 have been adopted and fine tuned for detecting and recognizing rice diseases and pests. Experimental results show the effectiveness of these models with real datasets. (ii) Since large scale architectures are not suitable for mobile devices, a two-stage small CNN architecture has been proposed, and compared with the state-of-the-art memory efficient CNN architectures such as MobileNet, NasNet Mobile and SqueezeNet. Experimental results show that the proposed architecture can achieve the desired accuracy of 93.3\% with a significantly reduced model size (e.g., 99\% less size compared to that of VGG16).
△ Less
Submitted 4 March, 2020; v1 submitted 3 December, 2018;
originally announced December 2018.