Search | arXiv e-print repository

Data Balancing Strategies: A Survey of Resampling and Augmentation Methods

Authors: Behnam Yousefimehr, Mehdi Ghatee, Mohammad Amin Seifi, Javad Fazli, Sajed Tavakoli, Zahra Rafei, Shervin Ghaffari, Abolfazl Nikahd, Mahdi Razi Gandomani, Alireza Orouji, Ramtin Mahmoudi Kashani, Sarina Heshmati, Negin Sadat Mousavi

Abstract: Imbalanced data poses a significant obstacle in machine learning, as an unequal distribution of class labels often results in skewed predictions and diminished model accuracy. To mitigate this problem, various resampling strategies have been developed, encompassing both oversampling and undersampling techniques aimed at modifying class proportions. Conventional oversampling approaches like SMOTE e… ▽ More Imbalanced data poses a significant obstacle in machine learning, as an unequal distribution of class labels often results in skewed predictions and diminished model accuracy. To mitigate this problem, various resampling strategies have been developed, encompassing both oversampling and undersampling techniques aimed at modifying class proportions. Conventional oversampling approaches like SMOTE enhance the representation of the minority class, whereas undersampling methods focus on trimming down the majority class. Advances in deep learning have facilitated the creation of more complex solutions, such as Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs), which are capable of producing high-quality synthetic examples. This paper reviews a broad spectrum of data balancing methods, classifying them into categories including synthetic oversampling, adaptive techniques, generative models, ensemble-based strategies, hybrid approaches, undersampling, and neighbor-based methods. Furthermore, it highlights current developments in resampling techniques and discusses practical implementations and case studies that validate their effectiveness. The paper concludes by offering perspectives on potential directions for future exploration in this domain. △ Less

Submitted 17 May, 2025; originally announced May 2025.

arXiv:2402.15654 [pdf, other]

Exploring Failure Cases in Multimodal Reasoning About Physical Dynamics

Authors: Sadaf Ghaffari, Nikhil Krishnaswamy

Abstract: In this paper, we present an exploration of LLMs' abilities to problem solve with physical reasoning in situated environments. We construct a simple simulated environment and demonstrate examples of where, in a zero-shot setting, both text and multimodal LLMs display atomic world knowledge about various objects but fail to compose this knowledge in correct solutions for an object manipulation and… ▽ More In this paper, we present an exploration of LLMs' abilities to problem solve with physical reasoning in situated environments. We construct a simple simulated environment and demonstrate examples of where, in a zero-shot setting, both text and multimodal LLMs display atomic world knowledge about various objects but fail to compose this knowledge in correct solutions for an object manipulation and placement task. We also use BLIP, a vision-language model trained with more sophisticated cross-modal attention, to identify cases relevant to object physical properties that that model fails to ground. Finally, we present a procedure for discovering the relevant properties of objects in the environment and propose a method to distill this knowledge back into the LLM. △ Less

Submitted 23 February, 2024; originally announced February 2024.

Comments: 10 pages, 10 figures, Proceedings of AAAI Spring Symposium: Empowering Machine Learning and Large Language Models with Domain and Commonsense Knowledge (MAKE). AAAI (2024)

arXiv:2308.13549 [pdf, other]

Combining Automatic Coding and Instructor Input to Generate ENA Visualizations for Asynchronous Online Discussion

Authors: Marcia Moraes, Sadaf Ghaffari, Yanye Luther, James Folkestad

Abstract: Asynchronous online discussions are a common fundamental tool to facilitate social interaction in hybrid and online courses. However, instructors lack the tools to accomplish the overwhelming task of evaluating asynchronous online discussion activities. In this paper we present an approach that uses Latent Dirichlet Analysis (LDA) and the instructor's keywords to automatically extract codes from a… ▽ More Asynchronous online discussions are a common fundamental tool to facilitate social interaction in hybrid and online courses. However, instructors lack the tools to accomplish the overwhelming task of evaluating asynchronous online discussion activities. In this paper we present an approach that uses Latent Dirichlet Analysis (LDA) and the instructor's keywords to automatically extract codes from a relatively small dataset. We use the generated codes to build an Epistemic Network Analysis (ENA) model and compare this model with a previous ENA model built by human coders. The results show that there is no statistical difference between the two models. We present an analysis of these models and discuss the potential use of ENA as a visualization to help instructors evaluating asynchronous online discussions. △ Less

Submitted 22 August, 2023; originally announced August 2023.

Comments: 15 pages, 4 figures, 6 Tables, appearing in ICQE 2023 proceedings

arXiv:2305.17387 [pdf, other]

Learning from Integral Losses in Physics Informed Neural Networks

Authors: Ehsan Saleh, Saba Ghaffari, Timothy Bretl, Luke Olson, Matthew West

Abstract: This work proposes a solution for the problem of training physics-informed networks under partial integro-differential equations. These equations require an infinite or a large number of neural evaluations to construct a single residual for training. As a result, accurate evaluation may be impractical, and we show that naive approximations at replacing these integrals with unbiased estimates lead… ▽ More This work proposes a solution for the problem of training physics-informed networks under partial integro-differential equations. These equations require an infinite or a large number of neural evaluations to construct a single residual for training. As a result, accurate evaluation may be impractical, and we show that naive approximations at replacing these integrals with unbiased estimates lead to biased loss functions and solutions. To overcome this bias, we investigate three types of potential solutions: the deterministic sampling approaches, the double-sampling trick, and the delayed target method. We consider three classes of PDEs for benchmarking; one defining Poisson problems with singular charges and weak solutions of up to 10 dimensions, another involving weak solutions on electro-magnetic fields and a Maxwell equation, and a third one defining a Smoluchowski coagulation problem. Our numerical results confirm the existence of the aforementioned bias in practice and also show that our proposed delayed target approach can lead to accurate solutions with comparable quality to ones estimated with a large sample size integral. Our implementation is open-source and available at https://github.com/ehsansaleh/btspinn. △ Less

Submitted 11 June, 2024; v1 submitted 27 May, 2023; originally announced May 2023.

Comments: Accepted in the main track of ICML 2024

arXiv:2305.13668 [pdf, other]

Grounding and Distinguishing Conceptual Vocabulary Through Similarity Learning in Embodied Simulations

Authors: Sadaf Ghaffari, Nikhil Krishnaswamy

Abstract: We present a novel method for using agent experiences gathered through an embodied simulation to ground contextualized word vectors to object representations. We use similarity learning to make comparisons between different object types based on their properties when interacted with, and to extract common features pertaining to the objects' behavior. We then use an affine transformation to calcula… ▽ More We present a novel method for using agent experiences gathered through an embodied simulation to ground contextualized word vectors to object representations. We use similarity learning to make comparisons between different object types based on their properties when interacted with, and to extract common features pertaining to the objects' behavior. We then use an affine transformation to calculate a projection matrix that transforms contextualized word vectors from different transformer-based language models into this learned space, and evaluate whether new test instances of transformed token vectors identify the correct concept in the object embedding space. Our results expose properties of the embedding spaces of four different transformer models and show that grounding object token vectors is usually more helpful to grounding verb and attribute token vectors than the reverse, which reflects earlier conclusions in the analogical reasoning and psycholinguistic literature. △ Less

Submitted 23 May, 2023; originally announced May 2023.

Comments: Accepted at IWCS Conference

arXiv:2305.13650 [pdf, other]

Robust Model-Based Optimization for Challenging Fitness Landscapes

Authors: Saba Ghaffari, Ehsan Saleh, Alexander G. Schwing, Yu-Xiong Wang, Martin D. Burke, Saurabh Sinha

Abstract: Protein design, a grand challenge of the day, involves optimization on a fitness landscape, and leading methods adopt a model-based approach where a model is trained on a training set (protein sequences and fitness) and proposes candidates to explore next. These methods are challenged by sparsity of high-fitness samples in the training set, a problem that has been in the literature. A less recogni… ▽ More Protein design, a grand challenge of the day, involves optimization on a fitness landscape, and leading methods adopt a model-based approach where a model is trained on a training set (protein sequences and fitness) and proposes candidates to explore next. These methods are challenged by sparsity of high-fitness samples in the training set, a problem that has been in the literature. A less recognized but equally important problem stems from the distribution of training samples in the design space: leading methods are not designed for scenarios where the desired optimum is in a region that is not only poorly represented in training data, but also relatively far from the highly represented low-fitness regions. We show that this problem of "separation" in the design space is a significant bottleneck in existing model-based optimization tools and propose a new approach that uses a novel VAE as its search model to overcome the problem. We demonstrate its advantage over prior methods in robustly finding improved samples, regardless of the imbalance and separation between low- and high-fitness samples. Our comprehensive benchmark on real and semi-synthetic protein datasets as well as solution design for physics-informed neural networks, showcases the generality of our approach in discrete and continuous design spaces. Our implementation is available at https://github.com/sabagh1994/PGVAE. △ Less

Submitted 27 June, 2024; v1 submitted 22 May, 2023; originally announced May 2023.

arXiv:2211.04555 [pdf, other]

Detecting and Accommodating Novel Types and Concepts in an Embodied Simulation Environment

Authors: Sadaf Ghaffari, Nikhil Krishnaswamy

Abstract: In this paper, we present methods for two types of metacognitive tasks in an AI system: rapidly expanding a neural classification model to accommodate a new category of object, and recognizing when a novel object type is observed instead of misclassifying the observation as a known class. Our methods take numerical data drawn from an embodied simulation environment, which describes the motion and… ▽ More In this paper, we present methods for two types of metacognitive tasks in an AI system: rapidly expanding a neural classification model to accommodate a new category of object, and recognizing when a novel object type is observed instead of misclassifying the observation as a known class. Our methods take numerical data drawn from an embodied simulation environment, which describes the motion and properties of objects when interacted with, and we demonstrate that this type of representation is important for the success of novel type detection. We present a suite of experiments in rapidly accommodating the introduction of new categories and concepts and in novel type detection, and an architecture to integrate the two in an interactive system. △ Less

Submitted 8 November, 2022; originally announced November 2022.

Comments: arXiv admin note: substantial text overlap with arXiv:2204.08107

arXiv:2210.17495 [pdf, ps, other]

Automated Code Extraction from Discussion Board Text Dataset

Authors: Sina Mahdipour Saravani, Sadaf Ghaffari, Yanye Luther, James Folkestad, Marcia Moraes

Abstract: This study introduces and investigates the capabilities of three different text mining approaches, namely Latent Semantic Analysis, Latent Dirichlet Analysis, and Clustering Word Vectors, for automating code extraction from a relatively small discussion board dataset. We compare the outputs of each algorithm with a previous dataset that was manually coded by two human raters. The results show that… ▽ More This study introduces and investigates the capabilities of three different text mining approaches, namely Latent Semantic Analysis, Latent Dirichlet Analysis, and Clustering Word Vectors, for automating code extraction from a relatively small discussion board dataset. We compare the outputs of each algorithm with a previous dataset that was manually coded by two human raters. The results show that even with a relatively small dataset, automated approaches can be an asset to course instructors by extracting some of the discussion codes, which can be used in Epistemic Network Analysis. △ Less

Submitted 18 April, 2023; v1 submitted 31 October, 2022; originally announced October 2022.

Comments: LaTeX; typos corrected at page 6

arXiv:2205.15379 [pdf, other]

Truly Deterministic Policy Optimization

Authors: Ehsan Saleh, Saba Ghaffari, Timothy Bretl, Matthew West

Abstract: In this paper, we present a policy gradient method that avoids exploratory noise injection and performs policy search over the deterministic landscape. By avoiding noise injection all sources of estimation variance can be eliminated in systems with deterministic dynamics (up to the initial state distribution). Since deterministic policy regularization is impossible using traditional non-metric mea… ▽ More In this paper, we present a policy gradient method that avoids exploratory noise injection and performs policy search over the deterministic landscape. By avoiding noise injection all sources of estimation variance can be eliminated in systems with deterministic dynamics (up to the initial state distribution). Since deterministic policy regularization is impossible using traditional non-metric measures such as the KL divergence, we derive a Wasserstein-based quadratic model for our purposes. We state conditions on the system model under which it is possible to establish a monotonic policy improvement guarantee, propose a surrogate function for policy gradient estimation, and show that it is possible to compute exact advantage estimates if both the state transition model and the policy are deterministic. Finally, we describe two novel robotic control environments -- one with non-local rewards in the frequency domain and the other with a long horizon (8000 time-steps) -- for which our policy gradient method (TDPO) significantly outperforms existing methods (PPO, TRPO, DDPG, and TD3). Our implementation with all the experimental settings is available at https://github.com/ehsansaleh/code_tdpo △ Less

Submitted 30 May, 2022; originally announced May 2022.

arXiv:2204.08107 [pdf, other]

Exploiting Embodied Simulation to Detect Novel Object Classes Through Interaction

Authors: Nikhil Krishnaswamy, Sadaf Ghaffari

Abstract: In this paper we present a novel method for a naive agent to detect novel objects it encounters in an interaction. We train a reinforcement learning policy on a stacking task given a known object type, and then observe the results of the agent attempting to stack various other objects based on the same trained policy. By extracting embedding vectors from a convolutional neural net trained over the… ▽ More In this paper we present a novel method for a naive agent to detect novel objects it encounters in an interaction. We train a reinforcement learning policy on a stacking task given a known object type, and then observe the results of the agent attempting to stack various other objects based on the same trained policy. By extracting embedding vectors from a convolutional neural net trained over the results of the aforementioned stacking play, we can determine the similarity of a given object to known object types, and determine if the given object is likely dissimilar enough to the known types to be considered a novel class of object. We present the results of this method on two datasets gathered using two different policies and demonstrate what information the agent needs to extract from its environment to make these novelty judgments. △ Less

Submitted 17 April, 2022; originally announced April 2022.

arXiv:2110.02529 [pdf, other]

On the Importance of Firth Bias Reduction in Few-Shot Classification

Authors: Saba Ghaffari, Ehsan Saleh, David Forsyth, Yu-xiong Wang

Abstract: Learning accurate classifiers for novel categories from very few examples, known as few-shot image classification, is a challenging task in statistical machine learning and computer vision. The performance in few-shot classification suffers from the bias in the estimation of classifier parameters; however, an effective underlying bias reduction technique that could alleviate this issue in training… ▽ More Learning accurate classifiers for novel categories from very few examples, known as few-shot image classification, is a challenging task in statistical machine learning and computer vision. The performance in few-shot classification suffers from the bias in the estimation of classifier parameters; however, an effective underlying bias reduction technique that could alleviate this issue in training few-shot classifiers has been overlooked. In this work, we demonstrate the effectiveness of Firth bias reduction in few-shot classification. Theoretically, Firth bias reduction removes the $O(N^{-1})$ first order term from the small-sample bias of the Maximum Likelihood Estimator. Here we show that the general Firth bias reduction technique simplifies to encouraging uniform class assignment probabilities for multinomial logistic classification, and almost has the same effect in cosine classifiers. We derive an easy-to-implement optimization objective for Firth penalized multinomial logistic and cosine classifiers, which is equivalent to penalizing the cross-entropy loss with a KL-divergence between the uniform label distribution and the predictions. Then, we empirically evaluate that it is consistently effective across the board for few-shot image classification, regardless of (1) the feature representations from different backbones, (2) the number of samples per class, and (3) the number of classes. Finally, we show the robustness of Firth bias reduction, in the case of imbalanced data distribution. Our implementation is available at https://github.com/ehsansaleh/firth_bias_reduction △ Less

Submitted 14 April, 2022; v1 submitted 6 October, 2021; originally announced October 2021.

arXiv:1707.03004 [pdf]

doi 10.5121/sipij.2017.8301

Foot anthropometry device and single object image thresholding

Authors: Amir Mohammad Esmaieeli Sikaroudi, Sasan Ghaffari, Ali Yousefi, Hassan Sadeghi Naeini

Abstract: This paper introduces a device, algorithm and graphical user interface to obtain anthropometric measurements of foot. Presented device facilitates obtaining scale of image and image processing by taking one image from side foot and underfoot simultaneously. Introduced image processing algorithm minimizes a noise criterion, which is suitable for object detection in single object images and outperfo… ▽ More This paper introduces a device, algorithm and graphical user interface to obtain anthropometric measurements of foot. Presented device facilitates obtaining scale of image and image processing by taking one image from side foot and underfoot simultaneously. Introduced image processing algorithm minimizes a noise criterion, which is suitable for object detection in single object images and outperforms famous image thresholding methods when lighting condition is poor. Performance of image-based method is compared to manual method. Image-based measurements of underfoot in average was 4mm less than actual measures. Mean absolute error of underfoot length was 1.6mm, however length obtained from side foot had 4.4mm mean absolute error. Furthermore, based on t-test and f-test results, no significant difference between manual and image-based anthropometry observed. In order to maintain anthropometry process performance in different situations user interface designed for handling changes in light conditions and altering speed of the algorithm. △ Less

Submitted 10 July, 2017; originally announced July 2017.

Journal ref: Signal & Image Processing : An International Journal (SIPIJ) Vol.8, No.3, June 2017

Showing 1–12 of 12 results for author: Ghaffari, S