Search | arXiv e-print repository

Case Study: Fine-tuning Small Language Models for Accurate and Private CWE Detection in Python Code

Authors: Md. Azizul Hakim Bappy, Hossen A Mustafa, Prottoy Saha, Rajinus Salehat

Abstract: Large Language Models (LLMs) have demonstrated significant capabilities in understanding and analyzing code for security vulnerabilities, such as Common Weakness Enumerations (CWEs). However, their reliance on cloud infrastructure and substantial computational requirements pose challenges for analyzing sensitive or proprietary codebases due to privacy concerns and inference costs. This work explor… ▽ More Large Language Models (LLMs) have demonstrated significant capabilities in understanding and analyzing code for security vulnerabilities, such as Common Weakness Enumerations (CWEs). However, their reliance on cloud infrastructure and substantial computational requirements pose challenges for analyzing sensitive or proprietary codebases due to privacy concerns and inference costs. This work explores the potential of Small Language Models (SLMs) as a viable alternative for accurate, on-premise vulnerability detection. We investigated whether a 350-million parameter pre-trained code model (codegen-mono) could be effectively fine-tuned to detect the MITRE Top 25 CWEs specifically within Python code. To facilitate this, we developed a targeted dataset of 500 examples using a semi-supervised approach involving LLM-driven synthetic data generation coupled with meticulous human review. Initial tests confirmed that the base codegen-mono model completely failed to identify CWEs in our samples. However, after applying instruction-following fine-tuning, the specialized SLM achieved remarkable performance on our test set, yielding approximately 99% accuracy, 98.08% precision, 100% recall, and a 99.04% F1-score. These results strongly suggest that fine-tuned SLMs can serve as highly accurate and efficient tools for CWE detection, offering a practical and privacy-preserving solution for integrating advanced security analysis directly into development workflows. △ Less

Submitted 23 April, 2025; originally announced April 2025.

Comments: 11 pages, 2 figures, 3 tables. Dataset available at https://huggingface.co/datasets/floxihunter/synthetic_python_cwe. Model available at https://huggingface.co/floxihunter/codegen-mono-CWEdetect. Keywords: Small Language Models (SLMs), Vulnerability Detection, CWE, Fine-tuning, Python Security, Privacy-Preserving Code Analysis

arXiv:2504.03732 [pdf, other]

SAGe: A Lightweight Algorithm-Architecture Co-Design for Mitigating the Data Preparation Bottleneck in Large-Scale Genome Analysis

Authors: Nika Mansouri Ghiasi, Talu Güloglu, Harun Mustafa, Can Firtina, Konstantina Koliogeorgi, Konstantinos Kanellopoulos, Haiyu Mao, Rakesh Nadig, Mohammad Sadrosadati, Jisung Park, Onur Mutlu

Abstract: Given the exponentially growing volumes of genomic data, there are extensive efforts to accelerate genome analysis. We demonstrate a major bottleneck that greatly limits and diminishes the benefits of state-of-the-art genome analysis accelerators: the data preparation bottleneck, where genomic data is stored in compressed form and needs to be decompressed and formatted first before an accelerator… ▽ More Given the exponentially growing volumes of genomic data, there are extensive efforts to accelerate genome analysis. We demonstrate a major bottleneck that greatly limits and diminishes the benefits of state-of-the-art genome analysis accelerators: the data preparation bottleneck, where genomic data is stored in compressed form and needs to be decompressed and formatted first before an accelerator can operate on it. To mitigate this bottleneck, we propose SAGe, an algorithm-architecture co-design for highly-compressed storage and high-performance access of large-scale genomic data. SAGe overcomes the challenges of mitigating the data preparation bottleneck while maintaining high compression ratios (comparable to genomic-specific compression algorithms) at low hardware cost. This is enabled by leveraging key features of genomic datasets to co-design (i) a new (de)compression algorithm, (ii) hardware, (iii) storage data layout, and (iv) interface commands to access storage. SAGe stores data in structures that can be rapidly interpreted and decompressed by efficient streaming accesses and lightweight hardware. To achieve high compression ratios using only these lightweight structures, SAGe exploits unique features of genomic data. We show that SAGe can be seamlessly integrated with a broad range of genome analysis hardware accelerators to mitigate their data preparation bottlenecks. Our results demonstrate that SAGe improves the average end-to-end performance and energy efficiency of two state-of-the-art genome analysis accelerators by 3.0x-32.1x and 18.8x-49.6x, respectively, compared to when the accelerators rely on state-of-the-art decompression tools. △ Less

Submitted 21 April, 2025; v1 submitted 31 March, 2025; originally announced April 2025.

arXiv:2406.19113 [pdf, other]

MegIS: High-Performance, Energy-Efficient, and Low-Cost Metagenomic Analysis with In-Storage Processing

Authors: Nika Mansouri Ghiasi, Mohammad Sadrosadati, Harun Mustafa, Arvid Gollwitzer, Can Firtina, Julien Eudine, Haiyu Mao, Joël Lindegger, Meryem Banu Cavlak, Mohammed Alser, Jisung Park, Onur Mutlu

Abstract: Metagenomics has led to significant advances in many fields. Metagenomic analysis commonly involves the key tasks of determining the species present in a sample and their relative abundances. These tasks require searching large metagenomic databases. Metagenomic analysis suffers from significant data movement overhead due to moving large amounts of low-reuse data from the storage system. In-storag… ▽ More Metagenomics has led to significant advances in many fields. Metagenomic analysis commonly involves the key tasks of determining the species present in a sample and their relative abundances. These tasks require searching large metagenomic databases. Metagenomic analysis suffers from significant data movement overhead due to moving large amounts of low-reuse data from the storage system. In-storage processing can be a fundamental solution for reducing this overhead. However, designing an in-storage processing system for metagenomics is challenging because existing approaches to metagenomic analysis cannot be directly implemented in storage effectively due to the hardware limitations of modern SSDs. We propose MegIS, the first in-storage processing system designed to significantly reduce the data movement overhead of the end-to-end metagenomic analysis pipeline. MegIS is enabled by our lightweight design that effectively leverages and orchestrates processing inside and outside the storage system. We address in-storage processing challenges for metagenomics via specialized and efficient 1) task partitioning, 2) data/computation flow coordination, 3) storage technology-aware algorithmic optimizations, 4) data mapping, and 5) lightweight in-storage accelerators. MegIS's design is flexible, capable of supporting different types of metagenomic input datasets, and can be integrated into various metagenomic analysis pipelines. Our evaluation shows that MegIS outperforms the state-of-the-art performance- and accuracy-optimized software metagenomic tools by 2.7$\times$-37.2$\times$ and 6.9$\times$-100.2$\times$, respectively, while matching the accuracy of the accuracy-optimized tool. MegIS achieves 1.5$\times$-5.1$\times$ speedup compared to the state-of-the-art metagenomic hardware-accelerated (using processing-in-memory) tool, while achieving significantly higher accuracy. △ Less

Submitted 27 June, 2024; originally announced June 2024.

Comments: To appear in ISCA 2024. arXiv admin note: substantial text overlap with arXiv:2311.12527

arXiv:2311.12527 [pdf, other]

MetaStore: High-Performance Metagenomic Analysis via In-Storage Computing

Authors: Nika Mansouri Ghiasi, Mohammad Sadrosadati, Harun Mustafa, Arvid Gollwitzer, Can Firtina, Julien Eudine, Haiyu Ma, Joël Lindegger, Meryem Banu Cavlak, Mohammed Alser, Jisung Park, Onur Mutlu

Abstract: Metagenomics has led to significant advancements in many fields. Metagenomic analysis commonly involves the key tasks of determining the species present in a sample and their relative abundances. These tasks require searching large metagenomic databases containing information on different species' genomes. Metagenomic analysis suffers from significant data movement overhead due to moving large amo… ▽ More Metagenomics has led to significant advancements in many fields. Metagenomic analysis commonly involves the key tasks of determining the species present in a sample and their relative abundances. These tasks require searching large metagenomic databases containing information on different species' genomes. Metagenomic analysis suffers from significant data movement overhead due to moving large amounts of low-reuse data from the storage system to the rest of the system. In-storage processing can be a fundamental solution for reducing data movement overhead. However, designing an in-storage processing system for metagenomics is challenging because none of the existing approaches can be directly implemented in storage effectively due to the hardware limitations of modern SSDs. We propose MetaStore, the first in-storage processing system designed to significantly reduce the data movement overhead of end-to-end metagenomic analysis. MetaStore is enabled by our lightweight and cooperative design that effectively leverages and orchestrates processing inside and outside the storage system. Through our detailed analysis of the end-to-end metagenomic analysis pipeline and careful hardware/software co-design, we address in-storage processing challenges for metagenomics via specialized and efficient 1) task partitioning, 2) data/computation flow coordination, 3) storage technology-aware algorithmic optimizations, 4) light-weight in-storage accelerators, and 5) data mapping. Our evaluation shows that MetaStore outperforms the state-of-the-art performance- and accuracy-optimized software metagenomic tools by 2.7-37.2$\times$ and 6.9-100.2$\times$, respectively, while matching the accuracy of the accuracy-optimized tool. MetaStore achieves 1.5-5.1$\times$ speedup compared to the state-of-the-art metagenomic hardware-accelerated tool, while achieving significantly higher accuracy. △ Less

Submitted 21 November, 2023; originally announced November 2023.

arXiv:2211.07854 [pdf, other]

Variational Quantum Algorithms for Chemical Simulation and Drug Discovery

Authors: Hasan Mustafa, Sai Nandan Morapakula, Prateek Jain, Srinjoy Ganguly

Abstract: Quantum computing has gained a lot of attention recently, and scientists have seen potential applications in this field using quantum computing for Cryptography and Communication to Machine Learning and Healthcare. Protein folding has been one of the most interesting areas to study, and it is also one of the biggest problems of biochemistry. Each protein folds distinctively, and the difficulty of… ▽ More Quantum computing has gained a lot of attention recently, and scientists have seen potential applications in this field using quantum computing for Cryptography and Communication to Machine Learning and Healthcare. Protein folding has been one of the most interesting areas to study, and it is also one of the biggest problems of biochemistry. Each protein folds distinctively, and the difficulty of finding its stable shape rapidly increases with an increase in the number of amino acids in the chain. A moderate protein has about 100 amino acids, and the number of combinations one needs to verify to find the stable structure is enormous. At some point, the number of these combinations will be so vast that classical computers cannot even attempt to solve them. In this paper, we examine how this problem can be solved with the help of quantum computing using two different algorithms, Variational Quantum Eigensolver (VQE) and Quantum Approximate Optimization Algorithm (QAOA), using Qiskit Nature. We compare the results of different quantum hardware and simulators and check how error mitigation affects the performance. Further, we make comparisons with SoTA algorithms and evaluate the reliability of the method. △ Less

Submitted 14 November, 2022; originally announced November 2022.

arXiv:2209.01147 [pdf, other]

Algorithms for Discrepancy, Matchings, and Approximations: Fast, Simple, and Practical

Authors: Mónika Csikós, Nabil H. Mustafa

Abstract: We study one of the key tools in data approximation and optimization: low-discrepancy colorings. Formally, given a finite set system $(X,\mathcal S)$, the \emph{discrepancy} of a two-coloring $χ:X\to\{-1,1\}$ is defined as $\max_{S \in \mathcal S}|{χ(S)}|$, where $χ(S)=\sum\limits_{x \in S}χ(x)$. We propose a randomized algorithm which, for any $d>0$ and $(X,\mathcal S)$ with dual shatter functi… ▽ More We study one of the key tools in data approximation and optimization: low-discrepancy colorings. Formally, given a finite set system $(X,\mathcal S)$, the \emph{discrepancy} of a two-coloring $χ:X\to\{-1,1\}$ is defined as $\max_{S \in \mathcal S}|{χ(S)}|$, where $χ(S)=\sum\limits_{x \in S}χ(x)$. We propose a randomized algorithm which, for any $d>0$ and $(X,\mathcal S)$ with dual shatter function $π^*(k)=O(k^d)$, returns a coloring with expected discrepancy $O\left({\sqrt{|X|^{1-1/d}\log|\mathcal S|}}\right)$ (this bound is tight) in time $\tilde O\left({|\mathcal S|\cdot|X|^{1/d}+|X|^{2+1/d}}\right)$, improving upon the previous-best time of $O\left(|\mathcal S|\cdot|X|^3\right)$ by at least a factor of $|X|^{2-1/d}$ when $|\mathcal S|\geq|X|$. This setup includes many geometric classes, families of bounded dual VC-dimension, and others. As an immediate consequence, we obtain an improved algorithm to construct $\varepsilon$-approximations of sub-quadratic size. Our method uses primal-dual reweighing with an improved analysis of randomly updated weights and exploits the structural properties of the set system via matchings with low crossing number -- a fundamental structure in computational geometry. In particular, we get the same $|X|^{2-1/d}$ factor speed-up on the construction time of matchings with crossing number $O\left({|X|^{1-1/d}}\right)$, which is the first improvement since the 1980s. The proposed algorithms are very simple, which makes it possible, for the first time, to compute colorings with near-optimal discrepancies and near-optimal sized approximations for abstract and geometric set systems in dimensions higher than $2$. △ Less

Submitted 2 September, 2022; originally announced September 2022.

arXiv:2202.10400 [pdf, other]

GenStore: A High-Performance and Energy-Efficient In-Storage Computing System for Genome Sequence Analysis

Authors: Nika Mansouri Ghiasi, Jisung Park, Harun Mustafa, Jeremie Kim, Ataberk Olgun, Arvid Gollwitzer, Damla Senol Cali, Can Firtina, Haiyu Mao, Nour Almadhoun Alserr, Rachata Ausavarungnirun, Nandita Vijaykumar, Mohammed Alser, Onur Mutlu

Abstract: Read mapping is a fundamental, yet computationally-expensive step in many genomics applications. It is used to identify potential matches and differences between fragments (called reads) of a sequenced genome and an already known genome (called a reference genome). To address the computational challenges in genome analysis, many prior works propose various approaches such as filters that select th… ▽ More Read mapping is a fundamental, yet computationally-expensive step in many genomics applications. It is used to identify potential matches and differences between fragments (called reads) of a sequenced genome and an already known genome (called a reference genome). To address the computational challenges in genome analysis, many prior works propose various approaches such as filters that select the reads that must undergo expensive computation, efficient heuristics, and hardware acceleration. While effective at reducing the computation overhead, all such approaches still require the costly movement of a large amount of data from storage to the rest of the system, which can significantly lower the end-to-end performance of read mapping in conventional and emerging genomics systems. We propose GenStore, the first in-storage processing system designed for genome sequence analysis that greatly reduces both data movement and computational overheads of genome sequence analysis by exploiting low-cost and accurate in-storage filters. GenStore leverages hardware/software co-design to address the challenges of in-storage processing, supporting reads with 1) different read lengths and error rates, and 2) different degrees of genetic variation. Through rigorous analysis of read mapping processes, we meticulously design low-cost hardware accelerators and data/computation flows inside a NAND flash-based SSD. Our evaluation using a wide range of real genomic datasets shows that GenStore, when implemented in three modern SSDs, significantly improves the read mapping performance of state-of-the-art software (hardware) baselines by 2.07-6.05$\times$ (1.52-3.32$\times$) for read sets with high similarity to the reference genome and 1.45-33.63$\times$ (2.70-19.2$\times$) for read sets with low similarity to the reference genome. △ Less

Submitted 6 April, 2023; v1 submitted 21 February, 2022; originally announced February 2022.

Comments: Published at ASPLOS 2022

arXiv:2008.08970 [pdf, ps, other]

Optimal Approximations Made Easy

Authors: Mónika Csikós, Nabil H. Mustafa

Abstract: The fundamental result of Li, Long, and Srinivasan on approximations of set systems has become a key tool across several communities such as learning theory, algorithms, computational geometry, combinatorics and data analysis. The goal of this paper is to give a modular, self-contained, intuitive proof of this result for finite set systems. The only ingredient we assume is the standard Chernoff'… ▽ More The fundamental result of Li, Long, and Srinivasan on approximations of set systems has become a key tool across several communities such as learning theory, algorithms, computational geometry, combinatorics and data analysis. The goal of this paper is to give a modular, self-contained, intuitive proof of this result for finite set systems. The only ingredient we assume is the standard Chernoff's concentration bound. This makes the proof accessible to a wider audience, readers not familiar with techniques from statistical learning theory, and makes it possible to be covered in a single self-contained lecture in a geometry, algorithms or combinatorics course. △ Less

Submitted 1 September, 2022; v1 submitted 20 August, 2020; originally announced August 2020.

Journal ref: Published in Information Processing Letters, Volume 176, June 2022, 106250

arXiv:1911.04200 [pdf, other]

Communication-Efficient Jaccard Similarity for High-Performance Distributed Genome Comparisons

Authors: Maciej Besta, Raghavendra Kanakagiri, Harun Mustafa, Mikhail Karasikov, Gunnar Rätsch, Torsten Hoefler, Edgar Solomonik

Abstract: The Jaccard similarity index is an important measure of the overlap of two sets, widely used in machine learning, computational genomics, information retrieval, and many other areas. We design and implement SimilarityAtScale, the first communication-efficient distributed algorithm for computing the Jaccard similarity among pairs of large datasets. Our algorithm provides an efficient encoding of th… ▽ More The Jaccard similarity index is an important measure of the overlap of two sets, widely used in machine learning, computational genomics, information retrieval, and many other areas. We design and implement SimilarityAtScale, the first communication-efficient distributed algorithm for computing the Jaccard similarity among pairs of large datasets. Our algorithm provides an efficient encoding of this problem into a multiplication of sparse matrices. Both the encoding and sparse matrix product are performed in a way that minimizes data movement in terms of communication and synchronization costs. We apply our algorithm to obtain similarity among all pairs of a set of large samples of genomes. This task is a key part of modern metagenomics analysis and an evergrowing need due to the increasing availability of high-throughput DNA sequencing data. The resulting scheme is the first to enable accurate Jaccard distance derivations for massive datasets, using largescale distributed-memory systems. We package our routines in a tool, called GenomeAtScale, that combines the proposed algorithm with tools for processing input sequences. Our evaluation on real data illustrates that one can use GenomeAtScale to effectively employ tens of thousands of processors to reach new frontiers in large-scale genomic and metagenomic analysis. While GenomeAtScale can be used to foster DNA research, the more general underlying SimilarityAtScale algorithm may be used for high-performance distributed similarity computations in other data analytics application domains. △ Less

Submitted 11 November, 2020; v1 submitted 11 November, 2019; originally announced November 2019.

Journal ref: Proceedings of the 34st IEEE International Parallel and Distributed Processing Symposium (IPDPS'20), 2020

arXiv:1909.13146 [pdf, other]

META$^\mathbf{2}$: Memory-efficient taxonomic classification and abundance estimation for metagenomics with deep learning

Authors: Andreas Georgiou, Vincent Fortuin, Harun Mustafa, Gunnar Rätsch

Abstract: Metagenomic studies have increasingly utilized sequencing technologies in order to analyze DNA fragments found in environmental samples.One important step in this analysis is the taxonomic classification of the DNA fragments. Conventional read classification methods require large databases and vast amounts of memory to run, with recent deep learning methods suffering from very large model sizes. W… ▽ More Metagenomic studies have increasingly utilized sequencing technologies in order to analyze DNA fragments found in environmental samples.One important step in this analysis is the taxonomic classification of the DNA fragments. Conventional read classification methods require large databases and vast amounts of memory to run, with recent deep learning methods suffering from very large model sizes. We therefore aim to develop a more memory-efficient technique for taxonomic classification. A task of particular interest is abundance estimation in metagenomic samples. Current attempts rely on classifying single DNA reads independently from each other and are therefore agnostic to co-occurence patterns between taxa. In this work, we also attempt to take these patterns into account. We develop a novel memory-efficient read classification technique, combining deep learning and locality-sensitive hashing. We show that this approach outperforms conventional mapping-based and other deep learning methods for single-read taxonomic classification when restricting all methods to a fixed memory footprint. Moreover, we formulate the task of abundance estimation as a Multiple Instance Learning (MIL) problem and we extend current deep learning architectures with two different types of permutation-invariant MIL pooling layers: a) deepsets and b) attention-based pooling. We illustrate that our architectures can exploit the co-occurrence of species in metagenomic read sets and outperform the single-read architectures in predicting the distribution over taxa at higher taxonomic ranks. △ Less

Submitted 10 February, 2020; v1 submitted 28 September, 2019; originally announced September 2019.

arXiv:1903.12011 [pdf]

doi 10.14738/tmlai.71.5712

Novel Artificial Human Optimization Field Algorithms - The Beginning

Authors: Satish Gajawada, Hassan Mustafa

Abstract: New Artificial Human Optimization (AHO) Field Algorithms can be created from scratch or by adding the concept of Artificial Humans into other existing Optimization Algorithms. Particle Swarm Optimization (PSO) has been very popular for solving complex optimization problems due to its simplicity. In this work, new Artificial Human Optimization Field Algorithms are created by modifying existing PSO… ▽ More New Artificial Human Optimization (AHO) Field Algorithms can be created from scratch or by adding the concept of Artificial Humans into other existing Optimization Algorithms. Particle Swarm Optimization (PSO) has been very popular for solving complex optimization problems due to its simplicity. In this work, new Artificial Human Optimization Field Algorithms are created by modifying existing PSO algorithms with AHO Field Concepts. These Hybrid PSO Algorithms comes under PSO Field as well as AHO Field. There are Hybrid PSO research articles based on Human Behavior, Human Cognition and Human Thinking etc. But there are no Hybrid PSO articles which based on concepts like Human Disease, Human Kindness and Human Relaxation. This paper proposes new AHO Field algorithms based on these research gaps. Some existing Hybrid PSO algorithms are given a new name in this work so that it will be easy for future AHO researchers to find these novel Artificial Human Optimization Field Algorithms. A total of 6 Artificial Human Optimization Field algorithms titled "Human Safety Particle Swarm Optimization (HuSaPSO)", "Human Kindness Particle Swarm Optimization (HKPSO)", "Human Relaxation Particle Swarm Optimization (HRPSO)", "Multiple Strategy Human Particle Swarm Optimization (MSHPSO)", "Human Thinking Particle Swarm Optimization (HTPSO)" and "Human Disease Particle Swarm Optimization (HDPSO)" are tested by applying these novel algorithms on Ackley, Beale, Bohachevsky, Booth and Three-Hump Camel Benchmark Functions. Results obtained are compared with PSO algorithm. △ Less

Submitted 26 March, 2019; originally announced March 2019.

Comments: 25 pages, 41 figures

Journal ref: Transactions on Machine Learning and Artificial Intelligence (TMLAI), Volume 7, Issue 1, February 2019

arXiv:1807.07924 [pdf, ps, other]

Optimal Bounds on the VC-dimension

Authors: Monika Csikos, Andrey Kupavskii, Nabil H. Mustafa

Abstract: The VC-dimension of a set system is a way to capture its complexity and has been a key parameter studied extensively in machine learning and geometry communities. In this paper, we resolve two longstanding open problems on bounding the VC-dimension of two fundamental set systems: $k$-fold unions/intersections of half-spaces, and the simplices set system. Among other implications, it settles an ope… ▽ More The VC-dimension of a set system is a way to capture its complexity and has been a key parameter studied extensively in machine learning and geometry communities. In this paper, we resolve two longstanding open problems on bounding the VC-dimension of two fundamental set systems: $k$-fold unions/intersections of half-spaces, and the simplices set system. Among other implications, it settles an open question in machine learning that was first studied in the 1989 foundational paper of Blumer, Ehrenfeucht, Haussler and Warmuth as well as by Eisenstat and Angluin and Johnson. △ Less

Submitted 20 July, 2018; originally announced July 2018.

arXiv:1806.08725 [pdf, other]

Theorems of Carathéodory, Helly, and Tverberg without dimension

Authors: Karim Adiprasito, Imre Bárány, Nabil H. Mustafa, Tamás Terpai

Abstract: We prove a no-dimensional version of Carathédory's theorem: given an $n$-element set $P\subset \Re^d$, a point $a \in \conv P$, and an integer $r\le d$, $r \le n$, there is a subset $Q\subset P$ of $r$ elements such that the distance between $a$ and $\conv Q$ is less than $\diam P/\sqrt {2r}$. A general no-dimension Helly type result is also proved with colourful and fractional consequences. Simil… ▽ More We prove a no-dimensional version of Carathédory's theorem: given an $n$-element set $P\subset \Re^d$, a point $a \in \conv P$, and an integer $r\le d$, $r \le n$, there is a subset $Q\subset P$ of $r$ elements such that the distance between $a$ and $\conv Q$ is less than $\diam P/\sqrt {2r}$. A general no-dimension Helly type result is also proved with colourful and fractional consequences. Similar versions of Tverberg's theorem and some of their extensions are also established. △ Less

Submitted 28 August, 2019; v1 submitted 22 June, 2018; originally announced June 2018.

Comments: 23 pages, 1 figure

arXiv:1711.01198 [pdf]

Design and Analysis of a Secure Three Factor User Authentication Scheme Using Biometric and Smart Card

Authors: Hossen Asiful Mustafa, Hasan Muhammad Kafi

Abstract: Password security can no longer provide enough security in the area of remote user authentication. Considering this security drawback, researchers are trying to find solution with multifactor remote user authentication system. Recently, three factor remote user authentication using biometric and smart card has drawn a considerable attention of the researchers. However, most of the current proposed… ▽ More Password security can no longer provide enough security in the area of remote user authentication. Considering this security drawback, researchers are trying to find solution with multifactor remote user authentication system. Recently, three factor remote user authentication using biometric and smart card has drawn a considerable attention of the researchers. However, most of the current proposed schemes have security flaws. They are vulnerable to attacks like user impersonation attack, server masquerading attack, password guessing attack, insider attack, denial of service attack, forgery attack, etc. Also, most of them are unable to provide mutual authentication, session key agreement and password, or smart card recovery system. Considering these drawbacks, we propose a secure three factor user authentication scheme using biometric and smart card. Through security analysis, we show that our proposed scheme can overcome drawbacks of existing systems and ensure high security in remote user authentication. △ Less

Submitted 3 November, 2017; originally announced November 2017.

Comments: 12 pages, 6 figures, 2 tables

Journal ref: International Journal of Computer Science and Information Security (IJCSIS), Vol. 15, No. 6, June 2017

arXiv:1708.01590 [pdf, other]

doi 10.1017/S0963548318000287

Bounding the size of an almost-equidistant set in Euclidean space

Authors: Andrey Kupavskii, Nabil H. Mustafa, Konrad J. Swanepoel

Abstract: A set of points in d-dimensional Euclidean space is almost equidistant if among any three points of the set, some two are at distance 1. We show that an almost-equidistant set in $\mathbb{R}^d$ has cardinality $O(d^{4/3})$. A set of points in d-dimensional Euclidean space is almost equidistant if among any three points of the set, some two are at distance 1. We show that an almost-equidistant set in $\mathbb{R}^d$ has cardinality $O(d^{4/3})$. △ Less

Submitted 4 August, 2017; originally announced August 2017.

Comments: 6 pages

MSC Class: 52C10

Journal ref: Combinator. Probab. Comp. 28 (2019) 280-286

arXiv:1702.03676 [pdf, ps, other]

Epsilon-approximations and epsilon-nets

Authors: Nabil H. Mustafa, Kasturi R. Varadarajan

Abstract: The use of random samples to approximate properties of geometric configurations has been an influential idea for both combinatorial and algorithmic purposes. This chapter considers two related notions---$ε$-approximations and $ε$-nets---that capture the most important quantitative properties that one would expect from a random sample with respect to an underlying geometric configuration. The use of random samples to approximate properties of geometric configurations has been an influential idea for both combinatorial and algorithmic purposes. This chapter considers two related notions---$ε$-approximations and $ε$-nets---that capture the most important quantitative properties that one would expect from a random sample with respect to an underlying geometric configuration. △ Less

Submitted 8 August, 2017; v1 submitted 13 February, 2017; originally announced February 2017.

Comments: Chapter 47 in Handbook on Discrete and Computational Geometry, 3rd edition. 27 pages

arXiv:1606.03668 [pdf, other]

Spatial and Social Paradigms for Interference and Coverage Analysis in Underlay D2D Network

Authors: Hafiz Attaul Mustafa, Muhammad Zeeshan Shakir, Muhammad Ali Imran, Rahim Tafazolli

Abstract: The homogeneous Poisson point process (PPP) is widely used to model spatial distribution of base stations and mobile terminals. The same process can be used to model underlay device-to-device (D2D) network, however, neglecting homophilic relation for D2D pairing presents underestimated system insights. In this paper, we model both spatial and social distributions of interfering D2D nodes as proxim… ▽ More The homogeneous Poisson point process (PPP) is widely used to model spatial distribution of base stations and mobile terminals. The same process can be used to model underlay device-to-device (D2D) network, however, neglecting homophilic relation for D2D pairing presents underestimated system insights. In this paper, we model both spatial and social distributions of interfering D2D nodes as proximity based independently marked homogeneous Poisson point process. The proximity considers physical distance between D2D nodes whereas social relationship is modeled as Zipf based marks. We apply these two paradigms to analyze the effect of interference on coverage probability of distance-proportional power-controlled cellular user. Effectively, we apply two type of functional mappings (physical distance, social marks) to Laplace functional of PPP. The resulting coverage probability has no closed-form expression, however for a subset of social marks, the mark summation converges to digamma and polygamma functions. This subset constitutes the upper and lower bounds on coverage probability. We present numerical evaluation of these bounds on coverage probability by varying number of different parameters. The results show that by imparting simple power control on cellular user, ultra-dense underlay D2D network can be realized without compromising the coverage probability of cellular user. △ Less

Submitted 28 April, 2017; v1 submitted 12 June, 2016; originally announced June 2016.

Comments: 10 pages, 10 figures

arXiv:1604.02636 [pdf, ps, other]

doi 10.1109/COMST.2015.2459596

Separation Framework: An Enabler for Cooperative and D2D Communication for Future 5G Networks

Authors: Hafiz Attaul Mustafa, Muhammad Ali Imran, Muhammad Zeeshan Shakir, Ali Imran, Rahim Tafazolli

Abstract: Soaring capacity and coverage demands dictate that future cellular networks need to soon migrate towards ultra-dense networks. However, network densification comes with a host of challenges that include compromised energy efficiency, complex interference management, cumbersome mobility management, burdensome signaling overheads and higher backhaul costs. Interestingly, most of the problems, that b… ▽ More Soaring capacity and coverage demands dictate that future cellular networks need to soon migrate towards ultra-dense networks. However, network densification comes with a host of challenges that include compromised energy efficiency, complex interference management, cumbersome mobility management, burdensome signaling overheads and higher backhaul costs. Interestingly, most of the problems, that beleaguer network densification, stem from legacy networks' one common feature i.e., tight coupling between the control and data planes regardless of their degree of heterogeneity and cell density. Consequently, in wake of 5G, control and data planes separation architecture (SARC) has recently been conceived as a promising paradigm that has potential to address most of aforementioned challenges. In this article, we review various proposals that have been presented in literature so far to enable SARC. More specifically, we analyze how and to what degree various SARC proposals address the four main challenges in network densification namely: energy efficiency, system level capacity maximization, interference management and mobility management. We then focus on two salient features of future cellular networks that have not yet been adapted in legacy networks at wide scale and thus remain a hallmark of 5G, i.e., coordinated multipoint (CoMP), and device-to-device (D2D) communications. After providing necessary background on CoMP and D2D, we analyze how SARC can particularly act as a major enabler for CoMP and D2D in context of 5G. This article thus serves as both a tutorial as well as an up to date survey on SARC, CoMP and D2D. Most importantly, the article provides an extensive outlook of challenges and opportunities that lie at the crossroads of these three mutually entangled emerging technologies. △ Less

Submitted 10 April, 2016; originally announced April 2016.

Comments: 28 pages, 11 figures, IEEE Communications Surveys & Tutorials 2015

arXiv:1603.01698 [pdf, ps, other]

doi 10.1109/LCOMM.2015.2459677

Coverage gain and Device-to-Device user Density: Stochastic Geometry Modeling and Analysis

Authors: Hafiz Attaul Mustafa, Muhammad Zeeshan Shakir, Muhammad Ali Imran, Ali Imran, Rahim Tafazolli

Abstract: Device-to-device (D2D) communication has huge potential for capacity and coverage enhancements for next generation cellular networks. The number of potential nodes for D2D communication is an important parameter that directly impacts the system capacity. In this paper, we derive analytic expression for average coverage probability of cellular user and corresponding number of potential D2D users. I… ▽ More Device-to-device (D2D) communication has huge potential for capacity and coverage enhancements for next generation cellular networks. The number of potential nodes for D2D communication is an important parameter that directly impacts the system capacity. In this paper, we derive analytic expression for average coverage probability of cellular user and corresponding number of potential D2D users. In this context, mature framework of stochastic geometry and Poisson point process has been used. The retention probability has been incorporated in Laplace functional to capture reduced path-loss and shortest distance criterion based D2D pairing. The numerical results show a close match between analytic expression and simulation setup. △ Less

Submitted 5 March, 2016; originally announced March 2016.

Comments: 4 pages, 5 figures

Journal ref: IEEE Comml, Volume:19, Issue:10, pp. 1742-1745, 2015

arXiv:1603.01694 [pdf, ps, other]

Intracell Interference Characterization and Cluster Inference for D2D Communication

Authors: Hafiz Attaul Mustafa, Muhammad Zeeshan Shakir, Ali Riza Ekti, Muhammad Ali Imran, Rahim Tafazolli

Abstract: The homogeneous poisson point process (PPP) is widely used to model temporal, spatial or both topologies of base stations (BSs) and mobile terminals (MTs). However, negative spatial correlation in BSs, due to strategical deployments, and positive spatial correlations in MTs, due to homophilic relations, cannot be captured by homogeneous spatial PPP (SPPP). In this paper, we assume doubly stochasti… ▽ More The homogeneous poisson point process (PPP) is widely used to model temporal, spatial or both topologies of base stations (BSs) and mobile terminals (MTs). However, negative spatial correlation in BSs, due to strategical deployments, and positive spatial correlations in MTs, due to homophilic relations, cannot be captured by homogeneous spatial PPP (SPPP). In this paper, we assume doubly stochastic poisson process, a generalization of homogeneous PPP, with intensity measure as another stochastic process. To this end, we assume Permanental Cox Process (PCP) to capture positive spatial correlation in MTs. We consider product density to derive closed-form approximation (CFA) of spatial summary statistics. We propose Euler Characteristic (EC) based novel approach to approximate intractable random intensity measure and subsequently derive nearest neighbor distribution function. We further propose the threshold and spatial extent of excursion set of chi-square random field as interference control parameters to select different cluster sizes for device-to-device (D2D) communication. The spatial extent of clusters is controlled by nearest neighbor distribution function which is incorporated into Laplace functional of SPPP to analyze the effect of D2D interfering clusters on average coverage probability of cellular user. The CFA and empirical results are in good agreement and its comparison with SPPP clearly shows spatial correlation between D2D nodes. △ Less

Submitted 5 March, 2016; originally announced March 2016.

Comments: 11 pages, 14 figures

arXiv:1509.04020 [pdf, ps, other]

A Note on the Size-Sensitive Packing Lemma

Authors: Nabil H. Mustafa

Abstract: We show that the size-sensitive packing lemma follows from a simple modification of the standard proof, due to Haussler and simplified by Chazelle, of the packing lemma. We show that the size-sensitive packing lemma follows from a simple modification of the standard proof, due to Haussler and simplified by Chazelle, of the packing lemma. △ Less

Submitted 15 September, 2015; v1 submitted 14 September, 2015; originally announced September 2015.

Comments: Modified title of the paper. 2 pages

arXiv:1501.03246 [pdf, other]

Tighter Estimates for epsilon-nets for Disks

Authors: Norbert Bus, Shashwat Garg, Nabil H. Mustafa, Saurabh Ray

Abstract: The geometric hitting set problem is one of the basic geometric combinatorial optimization problems: given a set $P$ of points, and a set $\mathcal{D}$ of geometric objects in the plane, the goal is to compute a small-sized subset of $P$ that hits all objects in $\mathcal{D}$. In 1994, Bronniman and Goodrich made an important connection of this problem to the size of fundamental combinatorial stru… ▽ More The geometric hitting set problem is one of the basic geometric combinatorial optimization problems: given a set $P$ of points, and a set $\mathcal{D}$ of geometric objects in the plane, the goal is to compute a small-sized subset of $P$ that hits all objects in $\mathcal{D}$. In 1994, Bronniman and Goodrich made an important connection of this problem to the size of fundamental combinatorial structures called $ε$-nets, showing that small-sized $ε$-nets imply approximation algorithms with correspondingly small approximation ratios. Very recently, Agarwal and Pan showed that their scheme can be implemented in near-linear time for disks in the plane. Altogether this gives $O(1)$-factor approximation algorithms in $\tilde{O}(n)$ time for hitting sets for disks in the plane. This constant factor depends on the sizes of $ε$-nets for disks; unfortunately, the current state-of-the-art bounds are large -- at least $24/ε$ and most likely larger than $40/ε$. Thus the approximation factor of the Agarwal and Pan algorithm ends up being more than $40$. The best lower-bound is $2/ε$, which follows from the Pach-Woeginger construction for halfspaces in two dimensions. Thus there is a large gap between the best-known upper and lower bounds. Besides being of independent interest, finding precise bounds is important since this immediately implies an improved linear-time algorithm for the hitting-set problem. The main goal of this paper is to improve the upper-bound to $13.4/ε$ for disks in the plane. The proof is constructive, giving a simple algorithm that uses only Delaunay triangulations. We have implemented the algorithm, which is available as a public open-source module. Experimental results show that the sizes of $ε$-nets for a variety of data-sets is lower, around $9/ε$. △ Less

Submitted 13 January, 2015; originally announced January 2015.

arXiv:1403.0835 [pdf, other]

QPTAS for Geometric Set-Cover Problems via Optimal Separators

Authors: Nabil H. Mustafa, Rajiv Raman, Saurabh Ray

Abstract: Weighted geometric set-cover problems arise naturally in several geometric and non-geometric settings (e.g. the breakthrough of Bansal-Pruhs (FOCS 2010) reduces a wide class of machine scheduling problems to weighted geometric set-cover). More than two decades of research has succeeded in settling the $(1+ε)$-approximability status for most geometric set-cover problems, except for four basic scena… ▽ More Weighted geometric set-cover problems arise naturally in several geometric and non-geometric settings (e.g. the breakthrough of Bansal-Pruhs (FOCS 2010) reduces a wide class of machine scheduling problems to weighted geometric set-cover). More than two decades of research has succeeded in settling the $(1+ε)$-approximability status for most geometric set-cover problems, except for four basic scenarios which are still lacking. One is that of weighted disks in the plane for which, after a series of papers, Varadarajan (STOC 2010) presented a clever \emph{quasi-sampling} technique, which together with improvements by Chan \etal~(SODA 2012), yielded a $O(1)$-approximation algorithm. Even for the unweighted case, a PTAS for a fundamental class of objects called pseudodisks (which includes disks, unit-height rectangles, translates of convex sets etc.) is currently unknown. Another fundamental case is weighted halfspaces in $\Re^3$, for which a PTAS is currently lacking. In this paper, we present a QPTAS for all of these remaining problems. Our results are based on the separator framework of Adamaszek-Wiese (FOCS 2013, SODA 2014), who recently obtained a QPTAS for weighted independent set of polygonal regions. This rules out the possibility that these problems are APX-hard, assuming $\textbf{NP} \nsubseteq \textbf{DTIME}(2^{polylog(n)})$. Together with the recent work of Chan-Grant (CGTA 2014), this settles the APX-hardness status for all natural geometric set-cover problems. △ Less

Submitted 5 April, 2014; v1 submitted 4 March, 2014; originally announced March 2014.

Comments: 26 pages. Revised to include an additional set-cover QPTAS for halfspaces

arXiv:1002.4831 [pdf]

On Analysis and Evaluation of Multi-Sensory Cognitive Learning of a Mathematical Topic Using Artificial Neural Networks

Authors: F. A. Al-Zahrani, H. M. Mustafa, A. Al-Hamadi

Abstract: This piece of research belongs to the field of educational assessment issue based upon the cognitive multimedia theory. Considering that theory; visual and auditory material should be presented simultaneously to reinforce the retention of a mathematical learned topic, a carefully computer-assisted learning (CAL) module is designed for development of a multimedia tutorial for our suggested mathem… ▽ More This piece of research belongs to the field of educational assessment issue based upon the cognitive multimedia theory. Considering that theory; visual and auditory material should be presented simultaneously to reinforce the retention of a mathematical learned topic, a carefully computer-assisted learning (CAL) module is designed for development of a multimedia tutorial for our suggested mathematical topic. The designed CAL module is a multimedia tutorial computer package with visual and/or auditory material. So, via suggested computer package, Multi-Sensory associative memories and classical conditioning theories are practically applicable at an educational field (a children classroom). It is noticed that comparative practical results obtained are interesting for field application of CAL package with and without associated teacher's voice. Finally, the presented study highly recommends application of a novel teaching trend aiming to improve quality of children mathematical learning performance. △ Less

Submitted 25 February, 2010; originally announced February 2010.

Comments: Journal of Telecommunications,Volume 1, Issue 1, pp99-104, February 2010

Showing 1–24 of 24 results for author: Mustafa, H