-
Semantic Shield: Defending Vision-Language Models Against Backdooring and Poisoning via Fine-grained Knowledge Alignment
Authors:
Alvi Md Ishmam,
Christopher Thomas
Abstract:
In recent years there has been enormous interest in vision-language models trained using self-supervised objectives. However, the use of large-scale datasets scraped from the web for training also makes these models vulnerable to potential security threats, such as backdooring and poisoning attacks. In this paper, we propose a method for mitigating such attacks on contrastively trained vision-lang…
▽ More
In recent years there has been enormous interest in vision-language models trained using self-supervised objectives. However, the use of large-scale datasets scraped from the web for training also makes these models vulnerable to potential security threats, such as backdooring and poisoning attacks. In this paper, we propose a method for mitigating such attacks on contrastively trained vision-language models. Our approach leverages external knowledge extracted from a language model to prevent models from learning correlations between image regions which lack strong alignment with external knowledge. We do this by imposing constraints to enforce that attention paid by the model to visual regions is proportional to the alignment of those regions with external knowledge. We conduct extensive experiments using a variety of recent backdooring and poisoning attacks on multiple datasets and architectures. Our results clearly demonstrate that our proposed approach is highly effective at defending against such attacks across multiple settings, while maintaining model utility and without requiring any changes at inference time
△ Less
Submitted 23 November, 2024;
originally announced November 2024.
-
JourneyBench: A Challenging One-Stop Vision-Language Understanding Benchmark of Generated Images
Authors:
Zhecan Wang,
Junzhang Liu,
Chia-Wei Tang,
Hani Alomari,
Anushka Sivakumar,
Rui Sun,
Wenhao Li,
Md. Atabuzzaman,
Hammad Ayyubi,
Haoxuan You,
Alvi Ishmam,
Kai-Wei Chang,
Shih-Fu Chang,
Chris Thomas
Abstract:
Existing vision-language understanding benchmarks largely consist of images of objects in their usual contexts. As a consequence, recent multimodal large language models can perform well with only a shallow visual understanding by relying on background language biases. Thus, strong performance on these benchmarks does not necessarily correlate with strong visual understanding. In this paper, we re…
▽ More
Existing vision-language understanding benchmarks largely consist of images of objects in their usual contexts. As a consequence, recent multimodal large language models can perform well with only a shallow visual understanding by relying on background language biases. Thus, strong performance on these benchmarks does not necessarily correlate with strong visual understanding. In this paper, we release JourneyBench, a comprehensive human-annotated benchmark of generated images designed to assess the model's fine-grained multimodal reasoning abilities across five tasks: complementary multimodal chain of thought, multi-image VQA, imaginary image captioning, VQA with hallucination triggers, and fine-grained retrieval with sample-specific distractors. Unlike existing benchmarks, JourneyBench explicitly requires fine-grained multimodal reasoning in unusual imaginary scenarios where language bias and holistic image gist are insufficient. We benchmark state-of-the-art models on JourneyBench and analyze performance along a number of fine-grained dimensions. Results across all five tasks show that JourneyBench is exceptionally challenging for even the best models, indicating that models' visual reasoning abilities are not as strong as they first appear. We discuss the implications of our findings and propose avenues for further research.
△ Less
Submitted 9 January, 2025; v1 submitted 19 September, 2024;
originally announced September 2024.
-
Automatic Detection of Natural Disaster Effect on Paddy Field from Satellite Images using Deep Learning Techniques
Authors:
Tahmid Alavi Ishmam,
Amin Ahsan Ali,
Md Ahsraful Amin,
A K M Mahbubur Rahman
Abstract:
This paper aims to detect rice field damage from natural disasters in Bangladesh using high-resolution satellite imagery. The authors developed ground truth data for rice field damage from the field level. At first, NDVI differences before and after the disaster are calculated to identify possible crop loss. The areas equal to and above the 0.33 threshold are marked as crop loss areas as significa…
▽ More
This paper aims to detect rice field damage from natural disasters in Bangladesh using high-resolution satellite imagery. The authors developed ground truth data for rice field damage from the field level. At first, NDVI differences before and after the disaster are calculated to identify possible crop loss. The areas equal to and above the 0.33 threshold are marked as crop loss areas as significant changes are observed. The authors also verified crop loss areas by collecting data from local farmers. Later, different bands of satellite data (Red, Green, Blue) and (False Color Infrared) are useful to detect crop loss area. We used the NDVI different images as ground truth to train the DeepLabV3plus model. With RGB, we got IoU 0.41 and with FCI, we got IoU 0.51. As FCI uses NIR, Red, Blue bands and NDVI is normalized difference between NIR and Red bands, so greater FCI's IoU score than RGB is expected. But RGB does not perform very badly here. So, where other bands are not available, RGB can use to understand crop loss areas to some extent. The ground truth developed in this paper can be used for segmentation models with very high resolution RGB only images such as Bing, Google etc.
△ Less
Submitted 2 April, 2023;
originally announced April 2023.
-
BLPnet: A new DNN model and Bengali OCR engine for Automatic License Plate Recognition
Authors:
Md. Saif Hassan Onim,
Hussain Nyeem,
Koushik Roy,
Mahmudul Hasan,
Abtahi Ishmam,
Md. Akiful Hoque Akif,
Tareque Bashar Ovi
Abstract:
The development of the Automatic License Plate Recognition (ALPR) system has received much attention for the English license plate. However, despite being the sixth largest population around the world, no significant progress can be tracked in the Bengali language countries or states for the ALPR system addressing their more alarming traffic management with inadequate road-safety measures. This pa…
▽ More
The development of the Automatic License Plate Recognition (ALPR) system has received much attention for the English license plate. However, despite being the sixth largest population around the world, no significant progress can be tracked in the Bengali language countries or states for the ALPR system addressing their more alarming traffic management with inadequate road-safety measures. This paper reports a computationally efficient and reasonably accurate Automatic License Plate Recognition (ALPR) system for Bengali characters with a new end-to-end DNN model that we call Bengali License Plate Network(BLPnet). The cascaded architecture for detecting vehicle regions prior to vehicle license plate (VLP) in the model is proposed to eliminate false positives resulting in higher detection accuracy of VLP. Besides, a lower set of trainable parameters is considered for reducing the computational cost making the system faster and more compatible for a real-time application. With a Computational Neural Network (CNN)based new Bengali OCR engine and word-mapping process, the model is characters rotation invariant, and can readily extract, detect and output the complete license plate number of a vehicle. The model feeding with17 frames per second (fps) on real-time video footage can detect a vehicle with the Mean Squared Error (MSE) of 0.0152, and the mean license plate character recognition accuracy of 95%. While compared to the other models, an improvement of 5% and 20% were recorded for the BLPnetover the prominent YOLO-based ALPR model and the Tesseract model for the number-plate detection accuracy and time requirement, respectively.
△ Less
Submitted 18 February, 2022;
originally announced February 2022.
-
Modelling Lips-State Detection Using CNN for Non-Verbal Communications
Authors:
Abtahi Ishmam,
Mahmudul Hasan,
Md. Saif Hassan Onim,
Koushik Roy,
Md. Akiful Haque Akif,
Hussain Nyeem
Abstract:
Vision-based deep learning models can be promising for speech-and-hearing-impaired and secret communications. While such non-verbal communications are primarily investigated with hand-gestures and facial expressions, no research endeavour is tracked so far for the lips state (i.e., open/close)-based interpretation/translation system. In support of this development, this paper reports two new Convo…
▽ More
Vision-based deep learning models can be promising for speech-and-hearing-impaired and secret communications. While such non-verbal communications are primarily investigated with hand-gestures and facial expressions, no research endeavour is tracked so far for the lips state (i.e., open/close)-based interpretation/translation system. In support of this development, this paper reports two new Convolutional Neural Network (CNN) models for lips state detection. Building upon two prominent lips landmark detectors, DLIB and MediaPipe, we simplify lips-state model with a set of six key landmarks, and use their distances for the lips state classification. Thereby, both the models are developed to count the opening and closing of lips and thus, they can classify a symbol with the total count. Varying frame-rates, lips-movements and face-angles are investigated to determine the effectiveness of the models. Our early experimental results demonstrate that the model with DLIB is relatively slower in terms of an average of 6 frames per second (FPS) and higher average detection accuracy of 95.25%. In contrast, the model with MediaPipe offers faster landmark detection capability with an average FPS of 20 and detection accuracy of 94.4%. Both models thus could effectively interpret the lips state for non-verbal semantics into a natural language.
△ Less
Submitted 11 December, 2021; v1 submitted 9 December, 2021;
originally announced December 2021.
-
Demand Forecasting in Smart Grid Using Long Short-Term Memory
Authors:
Koushik Roy,
Abtahi Ishmam,
Kazi Abu Taher
Abstract:
Demand forecasting in power sector has become an important part of modern demand management and response systems with the rise of smart metering enabled grids. Long Short-Term Memory (LSTM) shows promising results in predicting time series data which can also be applied to power load demand in smart grids. In this paper, an LSTM based model using neural network architecture is proposed to forecast…
▽ More
Demand forecasting in power sector has become an important part of modern demand management and response systems with the rise of smart metering enabled grids. Long Short-Term Memory (LSTM) shows promising results in predicting time series data which can also be applied to power load demand in smart grids. In this paper, an LSTM based model using neural network architecture is proposed to forecast power demand. The model is trained with hourly energy and power usage data of four years from a smart grid. After training and prediction, the accuracy of the model is compared against the traditional statistical time series analysis algorithms, such as Auto-Regressive (AR), to determine the efficiency. The mean absolute percentile error is found to be 1.22 in the proposed LSTM model, which is the lowest among the other models. From the findings, it is clear that the inclusion of neural network in predicting power demand reduces the error of prediction significantly. Thus, the application of LSTM can enable a more efficient demand response system.
△ Less
Submitted 28 July, 2021;
originally announced July 2021.
-
Towards Interpretable Multilingual Detection of Hate Speech against Immigrants and Women in Twitter at SemEval-2019 Task 5
Authors:
Alvi Md Ishmam
Abstract:
his paper describes our techniques to detect hate speech against women and immigrants on Twitter in multilingual contexts, particularly in English and Spanish. The challenge was designed by SemEval-2019 Task 5, where the participants need to design algorithms to detect hate speech in English and Spanish language with a given target (e.g., women or immigrants). Here, we have developed two deep neur…
▽ More
his paper describes our techniques to detect hate speech against women and immigrants on Twitter in multilingual contexts, particularly in English and Spanish. The challenge was designed by SemEval-2019 Task 5, where the participants need to design algorithms to detect hate speech in English and Spanish language with a given target (e.g., women or immigrants). Here, we have developed two deep neural networks (Bidirectional Gated Recurrent Unit (GRU), Character-level Convolutional Neural Network (CNN)), and one machine learning model by exploiting the linguistic features. Our proposed model obtained 57 and 75 F1 scores for Task A in English and Spanish language respectively. For Task B, the F1 scores are 67 for English and 75.33 for Spanish. In the case of task A (Spanish) and task B (both English and Spanish), the F1 scores are improved by 2, 10, and 5 points respectively. Besides, we present visually interpretable models that can address the generalizability issues of the custom-designed machine learning architecture by investigating the annotated dataset.
△ Less
Submitted 26 November, 2020;
originally announced November 2020.
-
Challenges of Bridging the Gap between Mass People and Welfare Organizations in Bangladesh
Authors:
Alvi Md Ishmam,
Md Raihan Mia
Abstract:
Computing for the development of marginalized communities is a big deal of challenges for researchers. Different social organizations are working to develop the conditions of a specialized marginalized community namely Street Children, one of the most underprivileged communities in Bangladesh. However, lack of proper engagement among different social welfare organizations, donors, and the mass com…
▽ More
Computing for the development of marginalized communities is a big deal of challenges for researchers. Different social organizations are working to develop the conditions of a specialized marginalized community namely Street Children, one of the most underprivileged communities in Bangladesh. However, lack of proper engagement among different social welfare organizations, donors, and the mass community limits the goal of the development of street children. Developing a virtual organization hub can eliminate communication gap as well as the information gap by involving people of all communities. However, some human imposed stigmas may often limit the rate of success of potential virtual computing solutions intended for organizations working with the marginalized communities, which we also face in our case. After a partial successful deployment, the design itself needs to be self comprehensive and trustworthy in order to overcome the stigmas that demand a reasonable amount of time. Moreover, after a wide scalable deployment, it is yet to be investigated whether the design of our computational solution can attain the goal for the facilitation of the organizations so that those organizations can become more effective for the development of street children than before.
△ Less
Submitted 2 April, 2020; v1 submitted 23 March, 2020;
originally announced March 2020.