-
TriNER: A Series of Named Entity Recognition Models For Hindi, Bengali & Marathi
Authors:
Mohammed Amaan Dhamaskar,
Rasika Ransing
Abstract:
India's rich cultural and linguistic diversity poses various challenges in the domain of Natural Language Processing (NLP), particularly in Named Entity Recognition (NER). NER is a NLP task that aims to identify and classify tokens into different entity groups like Person, Location, Organization, Number, etc. This makes NER very useful for downstream tasks like context-aware anonymization. This pa…
▽ More
India's rich cultural and linguistic diversity poses various challenges in the domain of Natural Language Processing (NLP), particularly in Named Entity Recognition (NER). NER is a NLP task that aims to identify and classify tokens into different entity groups like Person, Location, Organization, Number, etc. This makes NER very useful for downstream tasks like context-aware anonymization. This paper details our work to build a multilingual NER model for the three most spoken languages in India - Hindi, Bengali & Marathi. We train a custom transformer model and fine tune a few pretrained models, achieving an F1 Score of 92.11 for a total of 6 entity groups. Through this paper, we aim to introduce a single model to perform NER and significantly reduce the inconsistencies in entity groups and tag names, across the three languages.
△ Less
Submitted 6 February, 2025;
originally announced February 2025.
-
Survey of Pseudonymization, Abstractive Summarization & Spell Checker for Hindi and Marathi
Authors:
Rasika Ransing,
Mohammed Amaan Dhamaskar,
Ayush Rajpurohit,
Amey Dhoke,
Sanket Dalvi
Abstract:
India's vast linguistic diversity presents unique challenges and opportunities for technological advancement, especially in the realm of Natural Language Processing (NLP). While there has been significant progress in NLP applications for widely spoken languages, the regional languages of India, such as Marathi and Hindi, remain underserved. Research in the field of NLP for Indian regional language…
▽ More
India's vast linguistic diversity presents unique challenges and opportunities for technological advancement, especially in the realm of Natural Language Processing (NLP). While there has been significant progress in NLP applications for widely spoken languages, the regional languages of India, such as Marathi and Hindi, remain underserved. Research in the field of NLP for Indian regional languages is at a formative stage and holds immense significance. The paper aims to build a platform which enables the user to use various features like text anonymization, abstractive text summarization and spell checking in English, Hindi and Marathi language. The aim of these tools is to serve enterprise and consumer clients who predominantly use Indian Regional Languages.
△ Less
Submitted 23 December, 2024;
originally announced December 2024.
-
Dynamical Embedding of Single Channel Electroencephalogram for Artifact Subspace Reconstruction
Authors:
Doli Hazarika,
Vishnu KN,
Ramdas Ransing,
Cota Navin Gupta
Abstract:
This study introduces a novel framework to apply Artifact Subspace Reconstruction (ASR) algorithm on single-channel Electroencephalogram (EEG) data. ASR, renowned for its automated capability to effectively eliminate various artifacts like eye-blinks and eye movements from EEG signals. Importantly it has been implemented on android smartphones, but relied on multiple channels for principal compone…
▽ More
This study introduces a novel framework to apply Artifact Subspace Reconstruction (ASR) algorithm on single-channel Electroencephalogram (EEG) data. ASR, renowned for its automated capability to effectively eliminate various artifacts like eye-blinks and eye movements from EEG signals. Importantly it has been implemented on android smartphones, but relied on multiple channels for principal component subspace calculations. To overcome this limitation, we incorporate the established dynamical embedding approach into the algorithm, naming it Embedded-ASR (E-ASR). In our proposed method, an embedded matrix is first constructed from a single-channel EEG data using series of delay vectors. ASR is then applied to this embedded matrix, and the resulting cleaned single-channel EEG is reconstructed by removing the time lag and concatenating the rows of the embedded matrix. Data was collected from four subjects in resting states with eyes open from pre-frontal (Fp1 and Fp2) electrodes using CameraEEG app. To assess the effectiveness of the E-ASR algorithm on an EEG dataset with artifacts, we employed performance metrics such as relative root mean square error (RRMSE), correlation coefficient (CC), average power ratio as well as estimated the number of eye-blinks with and without the E-ASR approach. E-ASR was able to reduce artifacts from the semi-simulated EEG data, with an RRMSE of 45.45% and a CC of 0.91. For real EEG data, the counted eye-blinks were manually cross-checked with ground truth obtained from CameraEEG video data across all subjects for individual Fp1 and Fp2 electrodes. In conclusion, our study suggests E-ASR framework can remove artifacts from single channel EEG data. This promising algorithm might have potential for smartphone-based natural environment EEG applications, where minimal number of electrodes is a critical factor.
△ Less
Submitted 29 October, 2024; v1 submitted 28 June, 2024;
originally announced July 2024.