-
Bornil: An open-source sign language data crowdsourcing platform for AI enabled dialect-agnostic communication
Authors:
Shahriar Elahi Dhruvo,
Mohammad Akhlaqur Rahman,
Manash Kumar Mandal,
Md. Istiak Hossain Shihab,
A. A. Noman Ansary,
Kaneez Fatema Shithi,
Sanjida Khanom,
Rabeya Akter,
Safaeid Hossain Arib,
M. N. Ansary,
Sazia Mehnaz,
Rezwana Sultana,
Sejuti Rahman,
Sayma Sultana Chowdhury,
Sabbir Ahmed Chowdhury,
Farig Sadeque,
Asif Sushmit
Abstract:
The absence of annotated sign language datasets has hindered the development of sign language recognition and translation technologies. In this paper, we introduce Bornil; a crowdsource-friendly, multilingual sign language data collection, annotation, and validation platform. Bornil allows users to record sign language gestures and lets annotators perform sentence and gloss-level annotation. It al…
▽ More
The absence of annotated sign language datasets has hindered the development of sign language recognition and translation technologies. In this paper, we introduce Bornil; a crowdsource-friendly, multilingual sign language data collection, annotation, and validation platform. Bornil allows users to record sign language gestures and lets annotators perform sentence and gloss-level annotation. It also allows validators to make sure of the quality of both the recorded videos and the annotations through manual validation to develop high-quality datasets for deep learning-based Automatic Sign Language Recognition. To demonstrate the system's efficacy; we collected the largest sign language dataset for Bangladeshi Sign Language dialect, perform deep learning based Sign Language Recognition modeling, and report the benchmark performance. The Bornil platform, BornilDB v1.0 Dataset, and the codebases are available on https://bornil.bengali.ai
△ Less
Submitted 29 August, 2023;
originally announced August 2023.
-
bbOCR: An Open-source Multi-domain OCR Pipeline for Bengali Documents
Authors:
Imam Mohammad Zulkarnain,
Shayekh Bin Islam,
Md. Zami Al Zunaed Farabe,
Md. Mehedi Hasan Shawon,
Jawaril Munshad Abedin,
Beig Rajibul Hasan,
Marsia Haque,
Istiak Shihab,
Syed Mobassir,
MD. Nazmuddoha Ansary,
Asif Sushmit,
Farig Sadeque
Abstract:
Despite the existence of numerous Optical Character Recognition (OCR) tools, the lack of comprehensive open-source systems hampers the progress of document digitization in various low-resource languages, including Bengali. Low-resource languages, especially those with an alphasyllabary writing system, suffer from the lack of large-scale datasets for various document OCR components such as word-lev…
▽ More
Despite the existence of numerous Optical Character Recognition (OCR) tools, the lack of comprehensive open-source systems hampers the progress of document digitization in various low-resource languages, including Bengali. Low-resource languages, especially those with an alphasyllabary writing system, suffer from the lack of large-scale datasets for various document OCR components such as word-level OCR, document layout extraction, and distortion correction; which are available as individual modules in high-resource languages. In this paper, we introduce Bengali$.$AI-BRACU-OCR (bbOCR): an open-source scalable document OCR system that can reconstruct Bengali documents into a structured searchable digitized format that leverages a novel Bengali text recognition model and two novel synthetic datasets. We present extensive component-level and system-level evaluation: both use a novel diversified evaluation dataset and comprehensive evaluation metrics. Our extensive evaluation suggests that our proposed solution is preferable over the current state-of-the-art Bengali OCR systems. The source codes and datasets are available here: https://bengaliai.github.io/bbocr.
△ Less
Submitted 21 August, 2023; v1 submitted 21 August, 2023;
originally announced August 2023.
-
Unicode Normalization and Grapheme Parsing of Indic Languages
Authors:
Nazmuddoha Ansary,
Quazi Adibur Rahman Adib,
Tahsin Reasat,
Asif Shahriyar Sushmit,
Ahmed Imtiaz Humayun,
Sazia Mehnaz,
Kanij Fatema,
Mohammad Mamun Or Rashid,
Farig Sadeque
Abstract:
Writing systems of Indic languages have orthographic syllables, also known as complex graphemes, as unique horizontal units. A prominent feature of these languages is these complex grapheme units that comprise consonants/consonant conjuncts, vowel diacritics, and consonant diacritics, which, together make a unique Language. Unicode-based writing schemes of these languages often disregard this feat…
▽ More
Writing systems of Indic languages have orthographic syllables, also known as complex graphemes, as unique horizontal units. A prominent feature of these languages is these complex grapheme units that comprise consonants/consonant conjuncts, vowel diacritics, and consonant diacritics, which, together make a unique Language. Unicode-based writing schemes of these languages often disregard this feature of these languages and encode words as linear sequences of Unicode characters using an intricate scheme of connector characters and font interpreters. Due to this way of using a few dozen Unicode glyphs to write thousands of different unique glyphs (complex graphemes), there are serious ambiguities that lead to malformed words. In this paper, we are proposing two libraries: i) a normalizer for normalizing inconsistencies caused by a Unicode-based encoding scheme for Indic languages and ii) a grapheme parser for Abugida text. It deconstructs words into visually distinct orthographic syllables or complex graphemes and their constituents. Our proposed normalizer is a more efficient and effective tool than the previously used IndicNLP normalizer. Moreover, our parser and normalizer are also suitable tools for general Abugida text processing as they performed well in our robust word-based and NLP experiments. We report the pipeline for the scripts of 7 languages in this work and develop the framework for the integration of more scripts.
△ Less
Submitted 27 May, 2024; v1 submitted 11 May, 2023;
originally announced June 2023.
-
OOD-Speech: A Large Bengali Speech Recognition Dataset for Out-of-Distribution Benchmarking
Authors:
Fazle Rabbi Rakib,
Souhardya Saha Dip,
Samiul Alam,
Nazia Tasnim,
Md. Istiak Hossain Shihab,
Md. Nazmuddoha Ansary,
Syed Mobassir Hossen,
Marsia Haque Meghla,
Mamunur Mamun,
Farig Sadeque,
Sayma Sultana Chowdhury,
Tahsin Reasat,
Asif Sushmit,
Ahmed Imtiaz Humayun
Abstract:
We present OOD-Speech, the first out-of-distribution (OOD) benchmarking dataset for Bengali automatic speech recognition (ASR). Being one of the most spoken languages globally, Bengali portrays large diversity in dialects and prosodic features, which demands ASR frameworks to be robust towards distribution shifts. For example, islamic religious sermons in Bengali are delivered with a tonality that…
▽ More
We present OOD-Speech, the first out-of-distribution (OOD) benchmarking dataset for Bengali automatic speech recognition (ASR). Being one of the most spoken languages globally, Bengali portrays large diversity in dialects and prosodic features, which demands ASR frameworks to be robust towards distribution shifts. For example, islamic religious sermons in Bengali are delivered with a tonality that is significantly different from regular speech. Our training dataset is collected via massively online crowdsourcing campaigns which resulted in 1177.94 hours collected and curated from $22,645$ native Bengali speakers from South Asia. Our test dataset comprises 23.03 hours of speech collected and manually annotated from 17 different sources, e.g., Bengali TV drama, Audiobook, Talk show, Online class, and Islamic sermons to name a few. OOD-Speech is jointly the largest publicly available speech dataset, as well as the first out-of-distribution ASR benchmarking dataset for Bengali.
△ Less
Submitted 15 May, 2023;
originally announced May 2023.
-
BaDLAD: A Large Multi-Domain Bengali Document Layout Analysis Dataset
Authors:
Md. Istiak Hossain Shihab,
Md. Rakibul Hasan,
Mahfuzur Rahman Emon,
Syed Mobassir Hossen,
Md. Nazmuddoha Ansary,
Intesur Ahmed,
Fazle Rabbi Rakib,
Shahriar Elahi Dhruvo,
Souhardya Saha Dip,
Akib Hasan Pavel,
Marsia Haque Meghla,
Md. Rezwanul Haque,
Sayma Sultana Chowdhury,
Farig Sadeque,
Tahsin Reasat,
Ahmed Imtiaz Humayun,
Asif Shahriyar Sushmit
Abstract:
While strides have been made in deep learning based Bengali Optical Character Recognition (OCR) in the past decade, the absence of large Document Layout Analysis (DLA) datasets has hindered the application of OCR in document transcription, e.g., transcribing historical documents and newspapers. Moreover, rule-based DLA systems that are currently being employed in practice are not robust to domain…
▽ More
While strides have been made in deep learning based Bengali Optical Character Recognition (OCR) in the past decade, the absence of large Document Layout Analysis (DLA) datasets has hindered the application of OCR in document transcription, e.g., transcribing historical documents and newspapers. Moreover, rule-based DLA systems that are currently being employed in practice are not robust to domain variations and out-of-distribution layouts. To this end, we present the first multidomain large Bengali Document Layout Analysis Dataset: BaDLAD. This dataset contains 33,695 human annotated document samples from six domains - i) books and magazines, ii) public domain govt. documents, iii) liberation war documents, iv) newspapers, v) historical newspapers, and vi) property deeds, with 710K polygon annotations for four unit types: text-box, paragraph, image, and table. Through preliminary experiments benchmarking the performance of existing state-of-the-art deep learning architectures for English DLA, we demonstrate the efficacy of our dataset in training deep learning based Bengali document digitization models.
△ Less
Submitted 5 May, 2023; v1 submitted 9 March, 2023;
originally announced March 2023.
-
Bengali Common Voice Speech Dataset for Automatic Speech Recognition
Authors:
Samiul Alam,
Asif Sushmit,
Zaowad Abdullah,
Shahrin Nakkhatra,
MD. Nazmuddoha Ansary,
Syed Mobassir Hossen,
Sazia Morshed Mehnaz,
Tahsin Reasat,
Ahmed Imtiaz Humayun
Abstract:
Bengali is one of the most spoken languages in the world with over 300 million speakers globally. Despite its popularity, research into the development of Bengali speech recognition systems is hindered due to the lack of diverse open-source datasets. As a way forward, we have crowdsourced the Bengali Common Voice Speech Dataset, which is a sentence-level automatic speech recognition corpus. Collec…
▽ More
Bengali is one of the most spoken languages in the world with over 300 million speakers globally. Despite its popularity, research into the development of Bengali speech recognition systems is hindered due to the lack of diverse open-source datasets. As a way forward, we have crowdsourced the Bengali Common Voice Speech Dataset, which is a sentence-level automatic speech recognition corpus. Collected on the Mozilla Common Voice platform, the dataset is part of an ongoing campaign that has led to the collection of over 400 hours of data in 2 months and is growing rapidly. Our analysis shows that this dataset has more speaker, phoneme, and environmental diversity compared to the OpenSLR Bengali ASR dataset, the largest existing open-source Bengali speech dataset. We present insights obtained from the dataset and discuss key linguistic challenges that need to be addressed in future versions. Additionally, we report the current performance of a few Automatic Speech Recognition (ASR) algorithms and set a benchmark for future research.
△ Less
Submitted 29 June, 2022; v1 submitted 28 June, 2022;
originally announced June 2022.
-
A Sweet Recipe for Consolidated Vulnerabilities: Attacking a Live Website by Harnessing a Killer Combination of Vulnerabilities
Authors:
Mazharul Islam,
MD. Nazmuddoha Ansary,
Novia Nurain,
Salauddin Parvez Shams,
A. B. M. Alim Al Islam
Abstract:
The recent emergence of new vulnerabilities is an epoch-making problem in the complex world of website security. Most of the websites are failing to keep updating to tackle their websites from these new vulnerabilities leaving without realizing the weakness of the websites. As a result, when cyber-criminals scour such vulnerable old version websites, the scanner will represent a set of vulnerabili…
▽ More
The recent emergence of new vulnerabilities is an epoch-making problem in the complex world of website security. Most of the websites are failing to keep updating to tackle their websites from these new vulnerabilities leaving without realizing the weakness of the websites. As a result, when cyber-criminals scour such vulnerable old version websites, the scanner will represent a set of vulnerabilities. Once found, these vulnerabilities are then exploited to steal data, distribute malicious content, or inject defacement and spam content into the vulnerable websites. Furthermore, a combination of different vulnerabilities is able to cause more damages than anticipation. Therefore, in this paper, we endeavor to find connections among various vulnerabilities such as cross-site scripting, local file inclusion, remote file inclusion, buffer overflow CSRF, etc. To do so, we develop a Finite State Machine (FSM) attacking model, which analyzes a set of vulnerabilities towards the road to finding connections. We demonstrate the efficacy of our model by applying it to the set of vulnerabilities found on two live websites.
△ Less
Submitted 27 June, 2019;
originally announced June 2019.