-
X-Cross: Dynamic Integration of Language Models for Cross-Domain Sequential Recommendation
Authors:
Guy Hadad,
Haggai Roitman,
Yotam Eshel,
Bracha Shapira,
Lior Rokach
Abstract:
As new products are emerging daily, recommendation systems are required to quickly adapt to possible new domains without needing extensive retraining. This work presents ``X-Cross'' -- a novel cross-domain sequential-recommendation model that recommends products in new domains by integrating several domain-specific language models; each model is fine-tuned with low-rank adapters (LoRA). Given a re…
▽ More
As new products are emerging daily, recommendation systems are required to quickly adapt to possible new domains without needing extensive retraining. This work presents ``X-Cross'' -- a novel cross-domain sequential-recommendation model that recommends products in new domains by integrating several domain-specific language models; each model is fine-tuned with low-rank adapters (LoRA). Given a recommendation prompt, operating layer by layer, X-Cross dynamically refines the representation of each source language model by integrating knowledge from all other models. These refined representations are propagated from one layer to the next, leveraging the activations from each domain adapter to ensure domain-specific nuances are preserved while enabling adaptability across domains. Using Amazon datasets for sequential recommendation, X-Cross achieves performance comparable to a model that is fine-tuned with LoRA, while using only 25% of the additional parameters. In cross-domain tasks, such as adapting from Toys domain to Tools, Electronics or Sports, X-Cross demonstrates robust performance, while requiring about 50%-75% less fine-tuning data than LoRA to make fine-tuning effective. Furthermore, X-Cross achieves significant improvement in accuracy over alternative cross-domain baselines. Overall, X-Cross enables scalable and adaptive cross-domain recommendations, reducing computational overhead and providing an efficient solution for data-constrained environments.
△ Less
Submitted 29 April, 2025;
originally announced April 2025.
-
Forget What You Know about LLMs Evaluations -- LLMs are Like a Chameleon
Authors:
Nurit Cohen-Inger,
Yehonatan Elisha,
Bracha Shapira,
Lior Rokach,
Seffi Cohen
Abstract:
Large language models (LLMs) often appear to excel on public benchmarks, but these high scores may mask an overreliance on dataset-specific surface cues rather than true language understanding. We introduce the Chameleon Benchmark Overfit Detector (C-BOD), a meta-evaluation framework that systematically distorts benchmark prompts via a parametric transformation and detects overfitting of LLMs. By…
▽ More
Large language models (LLMs) often appear to excel on public benchmarks, but these high scores may mask an overreliance on dataset-specific surface cues rather than true language understanding. We introduce the Chameleon Benchmark Overfit Detector (C-BOD), a meta-evaluation framework that systematically distorts benchmark prompts via a parametric transformation and detects overfitting of LLMs. By rephrasing inputs while preserving their semantic content and labels, C-BOD exposes whether a model's performance is driven by memorized patterns. Evaluated on the MMLU benchmark using 26 leading LLMs, our method reveals an average performance degradation of 2.15% under modest perturbations, with 20 out of 26 models exhibiting statistically significant differences. Notably, models with higher baseline accuracy exhibit larger performance differences under perturbation, and larger LLMs tend to be more sensitive to rephrasings indicating that both cases may overrely on fixed prompt patterns. In contrast, the Llama family and models with lower baseline accuracy show insignificant degradation, suggesting reduced dependency on superficial cues. Moreover, C-BOD's dataset- and model-agnostic design allows easy integration into training pipelines to promote more robust language understanding. Our findings challenge the community to look beyond leaderboard scores and prioritize resilience and generalization in LLM evaluation.
△ Less
Submitted 11 February, 2025;
originally announced February 2025.
-
DFPE: A Diverse Fingerprint Ensemble for Enhancing LLM Performance
Authors:
Seffi Cohen,
Niv Goldshlager,
Nurit Cohen-Inger,
Bracha Shapira,
Lior Rokach
Abstract:
Large Language Models (LLMs) have shown remarkable capabilities across various natural language processing tasks but often struggle to excel uniformly in diverse or complex domains. We propose a novel ensemble method - Diverse Fingerprint Ensemble (DFPE), which leverages the complementary strengths of multiple LLMs to achieve more robust performance. Our approach involves: (1) clustering models ba…
▽ More
Large Language Models (LLMs) have shown remarkable capabilities across various natural language processing tasks but often struggle to excel uniformly in diverse or complex domains. We propose a novel ensemble method - Diverse Fingerprint Ensemble (DFPE), which leverages the complementary strengths of multiple LLMs to achieve more robust performance. Our approach involves: (1) clustering models based on response "fingerprints" patterns, (2) applying a quantile-based filtering mechanism to remove underperforming models at a per-subject level, and (3) assigning adaptive weights to remaining models based on their subject-wise validation accuracy. In experiments on the Massive Multitask Language Understanding (MMLU) benchmark, DFPE outperforms the best single model by 3% overall accuracy and 5% in discipline-level accuracy. This method increases the robustness and generalization of LLMs and underscores how model selection, diversity preservation, and performance-driven weighting can effectively address challenging, multi-faceted language understanding tasks.
△ Less
Submitted 6 February, 2025; v1 submitted 29 January, 2025;
originally announced January 2025.
-
FairTTTS: A Tree Test Time Simulation Method for Fairness-Aware Classification
Authors:
Nurit Cohen-Inger,
Lior Rokach,
Bracha Shapira,
Seffi Cohen
Abstract:
Algorithmic decision-making has become deeply ingrained in many domains, yet biases in machine learning models can still produce discriminatory outcomes, often harming unprivileged groups. Achieving fair classification is inherently challenging, requiring a careful balance between predictive performance and ethical considerations. We present FairTTTS, a novel post-processing bias mitigation method…
▽ More
Algorithmic decision-making has become deeply ingrained in many domains, yet biases in machine learning models can still produce discriminatory outcomes, often harming unprivileged groups. Achieving fair classification is inherently challenging, requiring a careful balance between predictive performance and ethical considerations. We present FairTTTS, a novel post-processing bias mitigation method inspired by the Tree Test Time Simulation (TTTS) method. Originally developed to enhance accuracy and robustness against adversarial inputs through probabilistic decision-path adjustments, TTTS serves as the foundation for FairTTTS. By building on this accuracy-enhancing technique, FairTTTS mitigates bias and improves predictive performance. FairTTTS uses a distance-based heuristic to adjust decisions at protected attribute nodes, ensuring fairness for unprivileged samples. This fairness-oriented adjustment occurs as a post-processing step, allowing FairTTTS to be applied to pre-trained models, diverse datasets, and various fairness metrics without retraining. Extensive evaluation on seven benchmark datasets shows that FairTTTS outperforms traditional methods in fairness improvement, achieving a 20.96% average increase over the baseline compared to 18.78% for related work, and further enhances accuracy by 0.55%. In contrast, competing methods typically reduce accuracy by 0.42%. These results confirm that FairTTTS effectively promotes more equitable decision-making while simultaneously improving predictive performance.
△ Less
Submitted 14 January, 2025;
originally announced January 2025.
-
BiasGuard: Guardrailing Fairness in Machine Learning Production Systems
Authors:
Nurit Cohen-Inger,
Seffi Cohen,
Neomi Rabaev,
Lior Rokach,
Bracha Shapira
Abstract:
As machine learning (ML) systems increasingly impact critical sectors such as hiring, financial risk assessments, and criminal justice, the imperative to ensure fairness has intensified due to potential negative implications. While much ML fairness research has focused on enhancing training data and processes, addressing the outputs of already deployed systems has received less attention. This pap…
▽ More
As machine learning (ML) systems increasingly impact critical sectors such as hiring, financial risk assessments, and criminal justice, the imperative to ensure fairness has intensified due to potential negative implications. While much ML fairness research has focused on enhancing training data and processes, addressing the outputs of already deployed systems has received less attention. This paper introduces 'BiasGuard', a novel approach designed to act as a fairness guardrail in production ML systems. BiasGuard leverages Test-Time Augmentation (TTA) powered by Conditional Generative Adversarial Network (CTGAN), a cutting-edge generative AI model, to synthesize data samples conditioned on inverted protected attribute values, thereby promoting equitable outcomes across diverse groups. This method aims to provide equal opportunities for both privileged and unprivileged groups while significantly enhancing the fairness metrics of deployed systems without the need for retraining. Our comprehensive experimental analysis across diverse datasets reveals that BiasGuard enhances fairness by 31% while only reducing accuracy by 0.09% compared to non-mitigated benchmarks. Additionally, BiasGuard outperforms existing post-processing methods in improving fairness, positioning it as an effective tool to safeguard against biases when retraining the model is impractical.
△ Less
Submitted 7 January, 2025;
originally announced January 2025.
-
MAMMAL -- Molecular Aligned Multi-Modal Architecture and Language
Authors:
Yoel Shoshan,
Moshiko Raboh,
Michal Ozery-Flato,
Vadim Ratner,
Alex Golts,
Jeffrey K. Weber,
Ella Barkan,
Simona Rabinovici-Cohen,
Sagi Polaczek,
Ido Amos,
Ben Shapira,
Liam Hazan,
Matan Ninio,
Sivan Ravid,
Michael M. Danziger,
Yosi Shamay,
Sharon Kurant,
Joseph A. Morrone,
Parthasarathy Suryanarayanan,
Michal Rosen-Zvi,
Efrat Hexter
Abstract:
Large language models applied to vast biological datasets have the potential to transform biology by uncovering disease mechanisms and accelerating drug development. However, current models are often siloed, trained separately on small-molecules, proteins, or transcriptomic data, limiting their ability to capture complex, multi-modal interactions. Effective drug discovery requires computational to…
▽ More
Large language models applied to vast biological datasets have the potential to transform biology by uncovering disease mechanisms and accelerating drug development. However, current models are often siloed, trained separately on small-molecules, proteins, or transcriptomic data, limiting their ability to capture complex, multi-modal interactions. Effective drug discovery requires computational tools that integrate multiple biological entities while supporting prediction and generation, a challenge existing models struggle to address. For this purpose, we present MAMMAL - Molecular Aligned Multi-Modal Architecture and Language - a versatile method applied to create a multi-task foundation model that learns from large-scale biological datasets across diverse modalities, including proteins, small-molecules, and omics. MAMMAL's structured prompt syntax supports classification, regression, and generation tasks while handling token and scalar inputs and outputs. Evaluated on eleven diverse downstream tasks, it reaches a new state of the art (SOTA) in nine tasks and is comparable to SOTA in two tasks, all within a unified architecture, unlike prior task-specific models. Additionally, we explored Alphafold 3 binding prediction capabilities on antibody-antigen and nanobody-antigen complexes showing significantly better classification performance of MAMMAL in 3 out of 4 targets. The model code and pretrained weights are publicly available at https://github.com/BiomedSciAI/biomed-multi-alignment and https://huggingface.co/ibm/biomed.omics.bl.sm.ma-ted-458m
△ Less
Submitted 6 May, 2025; v1 submitted 28 October, 2024;
originally announced October 2024.
-
AMFPMC -- An improved method of detecting multiple types of drug-drug interactions using only known drug-drug interactions
Authors:
Bar Vered,
Guy Shtar,
Lior Rokach,
Bracha Shapira
Abstract:
Adverse drug interactions are largely preventable causes of medical accidents, which frequently result in physician and emergency room encounters. The detection of drug interactions in a lab, prior to a drug's use in medical practice, is essential, however it is costly and time-consuming. Machine learning techniques can provide an efficient and accurate means of predicting possible drug-drug inter…
▽ More
Adverse drug interactions are largely preventable causes of medical accidents, which frequently result in physician and emergency room encounters. The detection of drug interactions in a lab, prior to a drug's use in medical practice, is essential, however it is costly and time-consuming. Machine learning techniques can provide an efficient and accurate means of predicting possible drug-drug interactions and combat the growing problem of adverse drug interactions. Most existing models for predicting interactions rely on the chemical properties of drugs. While such models can be accurate, the required properties are not always available.
△ Less
Submitted 7 February, 2023;
originally announced February 2023.
-
Transfer learning for time series classification using synthetic data generation
Authors:
Yarden Rotem,
Nathaniel Shimoni,
Lior Rokach,
Bracha Shapira
Abstract:
In this paper, we propose an innovative Transfer learning for Time series classification method. Instead of using an existing dataset from the UCR archive as the source dataset, we generated a 15,000,000 synthetic univariate time series dataset that was created using our unique synthetic time series generator algorithm which can generate data with diverse patterns and angles and different sequence…
▽ More
In this paper, we propose an innovative Transfer learning for Time series classification method. Instead of using an existing dataset from the UCR archive as the source dataset, we generated a 15,000,000 synthetic univariate time series dataset that was created using our unique synthetic time series generator algorithm which can generate data with diverse patterns and angles and different sequence lengths. Furthermore, instead of using classification tasks provided by the UCR archive as the source task as previous studies did,we used our own 55 regression tasks as the source tasks, which produced better results than selecting classification tasks from the UCR archive
△ Less
Submitted 16 July, 2022;
originally announced July 2022.
-
Boosting Anomaly Detection Using Unsupervised Diverse Test-Time Augmentation
Authors:
Seffi Cohen,
Niv Goldshlager,
Lior Rokach,
Bracha Shapira
Abstract:
Anomaly detection is a well-known task that involves the identification of abnormal events that occur relatively infrequently. Methods for improving anomaly detection performance have been widely studied. However, no studies utilizing test-time augmentation (TTA) for anomaly detection in tabular data have been performed. TTA involves aggregating the predictions of several synthetic versions of a g…
▽ More
Anomaly detection is a well-known task that involves the identification of abnormal events that occur relatively infrequently. Methods for improving anomaly detection performance have been widely studied. However, no studies utilizing test-time augmentation (TTA) for anomaly detection in tabular data have been performed. TTA involves aggregating the predictions of several synthetic versions of a given test sample; TTA produces different points of view for a specific test instance and might decrease its prediction bias. We propose the Test-Time Augmentation for anomaly Detection (TTAD) technique, a TTA-based method aimed at improving anomaly detection performance. TTAD augments a test instance based on its nearest neighbors; various methods, including the k-Means centroid and SMOTE methods, are used to produce the augmentations. Our technique utilizes a Siamese network to learn an advanced distance metric when retrieving a test instance's neighbors. Our experiments show that the anomaly detector that uses our TTA technique achieved significantly higher AUC results on all datasets evaluated.
△ Less
Submitted 29 October, 2021;
originally announced October 2021.
-
Evolving Context-Aware Recommender Systems With Users in Mind
Authors:
Amit Livne,
Eliad Shem Tov,
Adir Solomon,
Achiya Elyasaf,
Bracha Shapira,
Lior Rokach
Abstract:
A context-aware recommender system (CARS) applies sensing and analysis of user context to provide personalized services. The contextual information can be driven from sensors in order to improve the accuracy of the recommendations. Yet, generating accurate recommendations is not enough to constitute a useful system from the users' perspective, since certain contextual information may cause differe…
▽ More
A context-aware recommender system (CARS) applies sensing and analysis of user context to provide personalized services. The contextual information can be driven from sensors in order to improve the accuracy of the recommendations. Yet, generating accurate recommendations is not enough to constitute a useful system from the users' perspective, since certain contextual information may cause different issues, such as draining the user's battery, privacy issues, and more. Adding high-dimensional contextual information may increase both the dimensionality and sparsity of the model. Previous studies suggest reducing the amount of contextual information by selecting the most suitable contextual information using a domain knowledge. Another solution is compressing it into a denser latent space, thus disrupting the ability to explain the recommendation item to the user, and damaging users' trust. In this paper we present an approach for selecting low-dimensional subsets of the contextual information and incorporating them explicitly within CARS. Specifically, we present a novel feature-selection algorithm, based on genetic algorithms (GA), that outperforms SOTA dimensional-reduction CARS algorithms, improves the accuracy and the explainability of the recommendations, and allows for controlling user aspects, such as privacy and battery consumption. Furthermore, we exploit the top subsets that are generated along the evolutionary process, by learning multiple deep context-aware models and applying a stacking technique on them, thus improving the accuracy while remaining at the explicit space. We evaluated our approach on two high-dimensional context-aware datasets driven from smartphones. An empirical analysis of our results validates that our proposed approach outperforms SOTA CARS models while improving transparency and explainability to the user.
△ Less
Submitted 30 July, 2020;
originally announced July 2020.
-
A framework for optimizing COVID-19 testing policy using a Multi Armed Bandit approach
Authors:
Hagit Grushka-Cohen,
Raphael Cohen,
Bracha Shapira,
Jacob Moran-Gilad,
Lior Rokach
Abstract:
Testing is an important part of tackling the COVID-19 pandemic. Availability of testing is a bottleneck due to constrained resources and effective prioritization of individuals is necessary. Here, we discuss the impact of different prioritization policies on COVID-19 patient discovery and the ability of governments and health organizations to use the results for effective decision making. We sugge…
▽ More
Testing is an important part of tackling the COVID-19 pandemic. Availability of testing is a bottleneck due to constrained resources and effective prioritization of individuals is necessary. Here, we discuss the impact of different prioritization policies on COVID-19 patient discovery and the ability of governments and health organizations to use the results for effective decision making. We suggest a framework for testing that balances the maximal discovery of positive individuals with the need for population-based surveillance aimed at understanding disease spread and characteristics. This framework draws from similar approaches to prioritization in the domain of cyber-security based on ranking individuals using a risk score and then reserving a portion of the capacity for random sampling. This approach is an application of Multi-Armed-Bandits maximizing exploration/exploitation of the underlying distribution. We find that individuals can be ranked for effective testing using a few simple features, and that ranking them using such models we can capture 65% (CI: 64.7%-68.3%) of the positive individuals using less than 20% of the testing capacity or 92.1% (CI: 91.1%-93.2%) of positives individuals using 70% of the capacity, allowing reserving a significant portion of the tests for population studies. Our approach allows experts and decision-makers to tailor the resulting policies as needed allowing transparency into the ranking policy and the ability to understand the disease spread in the population and react quickly and in an informed manner.
△ Less
Submitted 28 July, 2020;
originally announced July 2020.
-
Iterative Boosting Deep Neural Networks for Predicting Click-Through Rate
Authors:
Amit Livne,
Roy Dor,
Eyal Mazuz,
Tamar Didi,
Bracha Shapira,
Lior Rokach
Abstract:
The click-through rate (CTR) reflects the ratio of clicks on a specific item to its total number of views. It has significant impact on websites' advertising revenue. Learning sophisticated models to understand and predict user behavior is essential for maximizing the CTR in recommendation systems. Recent works have suggested new methods that replace the expensive and time-consuming feature engine…
▽ More
The click-through rate (CTR) reflects the ratio of clicks on a specific item to its total number of views. It has significant impact on websites' advertising revenue. Learning sophisticated models to understand and predict user behavior is essential for maximizing the CTR in recommendation systems. Recent works have suggested new methods that replace the expensive and time-consuming feature engineering process with a variety of deep learning (DL) classifiers capable of capturing complicated patterns from raw data; these methods have shown significant improvement on the CTR prediction task. While DL techniques can learn intricate user behavior patterns, it relies on a vast amount of data and does not perform as well when there is a limited amount of data. We propose XDBoost, a new DL method for capturing complex patterns that requires just a limited amount of raw data. XDBoost is an iterative three-stage neural network model influenced by the traditional machine learning boosting mechanism. XDBoost's components operate sequentially similar to boosting; However, unlike conventional boosting, XDBoost does not sum the predictions generated by its components. Instead, it utilizes these predictions as new artificial features and enhances CTR prediction by retraining the model using these features. Comprehensive experiments conducted to illustrate the effectiveness of XDBoost on two datasets demonstrated its ability to outperform existing state-of-the-art (SOTA) models for CTR prediction.
△ Less
Submitted 26 July, 2020;
originally announced July 2020.
-
Automatic Machine Learning Derived from Scholarly Big Data
Authors:
Asnat Greenstein-Messica,
Roman Vainshtein,
Gilad Katz,
Bracha Shapira,
Lior Rokach
Abstract:
One of the challenging aspects of applying machine learning is the need to identify the algorithms that will perform best for a given dataset. This process can be difficult, time consuming and often requires a great deal of domain knowledge. We present Sommelier, an expert system for recommending the machine learning algorithms that should be applied on a previously unseen dataset. Sommelier is ba…
▽ More
One of the challenging aspects of applying machine learning is the need to identify the algorithms that will perform best for a given dataset. This process can be difficult, time consuming and often requires a great deal of domain knowledge. We present Sommelier, an expert system for recommending the machine learning algorithms that should be applied on a previously unseen dataset. Sommelier is based on word embedding representations of the domain knowledge extracted from a large corpus of academic publications. When presented with a new dataset and its problem description, Sommelier leverages a recommendation model trained on the word embedding representation to provide a ranked list of the most relevant algorithms to be used on the dataset. We demonstrate Sommelier's effectiveness by conducting an extensive evaluation on 121 publicly available datasets and 53 classification algorithms. The top algorithms recommended for each dataset by Sommelier were able to achieve on average 97.7% of the optimal accuracy of all surveyed algorithms.
△ Less
Submitted 6 March, 2020;
originally announced March 2020.
-
Sequence Preserving Network Traffic Generation
Authors:
Sigal Shaked,
Amos Zamir,
Roman Vainshtein,
Moshe Unger,
Lior Rokach,
Rami Puzis,
Bracha Shapira
Abstract:
We present the Network Traffic Generator (NTG), a framework for perturbing recorded network traffic with the purpose of generating diverse but realistic background traffic for network simulation and what-if analysis in enterprise environments. The framework preserves many characteristics of the original traffic recorded in an enterprise, as well as sequences of network activities. Using the propos…
▽ More
We present the Network Traffic Generator (NTG), a framework for perturbing recorded network traffic with the purpose of generating diverse but realistic background traffic for network simulation and what-if analysis in enterprise environments. The framework preserves many characteristics of the original traffic recorded in an enterprise, as well as sequences of network activities. Using the proposed framework, the original traffic flows are profiled using 200 cross-protocol features. The traffic is aggregated into flows of packets between IP pairs and clustered into groups of similar network activities. Sequences of network activities are then extracted. We examined two methods for extracting sequences of activities: a Markov model and a neural language model. Finally, new traffic is generated using the extracted model. We developed a prototype of the framework and conducted extensive experiments based on two real network traffic collections. Hypothesis testing was used to examine the difference between the distribution of original and generated features, showing that 30-100\% of the extracted features were preserved. Small differences between n-gram perplexities in sequences of network activities in the original and generated traffic, indicate that sequences of network activities were well preserved.
△ Less
Submitted 23 February, 2020;
originally announced February 2020.
-
Diversifying Database Activity Monitoring with Bandits
Authors:
Hagit Grushka-Cohen,
Ofer Biller,
Oded Sofer,
Lior Rokach,
Bracha Shapira
Abstract:
Database activity monitoring (DAM) systems are commonly used by organizations to protect the organizational data, knowledge and intellectual properties. In order to protect organizations database DAM systems have two main roles, monitoring (documenting activity) and alerting to anomalous activity. Due to high-velocity streams and operating costs, such systems are restricted to examining only a sam…
▽ More
Database activity monitoring (DAM) systems are commonly used by organizations to protect the organizational data, knowledge and intellectual properties. In order to protect organizations database DAM systems have two main roles, monitoring (documenting activity) and alerting to anomalous activity. Due to high-velocity streams and operating costs, such systems are restricted to examining only a sample of the activity. Current solutions use policies, manually crafted by experts, to decide which transactions to monitor and log. This limits the diversity of the data collected. Bandit algorithms, which use reward functions as the basis for optimization while adding diversity to the recommended set, have gained increased attention in recommendation systems for improving diversity.
In this work, we redefine the data sampling problem as a special case of the multi-armed bandit (MAB) problem and present a novel algorithm, which combines expert knowledge with random exploration. We analyze the effect of diversity on coverage and downstream event detection tasks using a simulated dataset. In doing so, we find that adding diversity to the sampling using the bandit-based approach works well for this task and maximizing population coverage without decreasing the quality in terms of issuing alerts about events.
△ Less
Submitted 23 October, 2019;
originally announced October 2019.
-
Deep Context-Aware Recommender System Utilizing Sequential Latent Context
Authors:
Amit Livne,
Moshe Unger,
Bracha Shapira,
Lior Rokach
Abstract:
Context-aware recommender systems (CARSs) apply sensing and analysis of user context in order to provide personalized services. Adding context to a recommendation model is challenging, since the addition of context may increases both the dimensionality and sparsity of the model. Recent research has shown that modeling contextual information as a latent vector may address the sparsity and dimension…
▽ More
Context-aware recommender systems (CARSs) apply sensing and analysis of user context in order to provide personalized services. Adding context to a recommendation model is challenging, since the addition of context may increases both the dimensionality and sparsity of the model. Recent research has shown that modeling contextual information as a latent vector may address the sparsity and dimensionality challenges. We suggest a new latent modeling of sequential context by generating sequences of contextual information and reducing their contextual space to a compressed latent space.We train a long short-term memory (LSTM) encoder-decoder network on sequences of contextual information and extract sequential latent context from the hidden layer of the network in order to represent a compressed representation of sequential data. We propose new context-aware recommendation models that extend the neural collaborative filtering approach and learn nonlinear interactions between latent features of users, items, and contexts which take into account the sequential latent context representation as part of the recommendation process. We deployed our approach using two context-aware datasets with different context dimensions. Empirical analysis of our results validates that our proposed sequential latent context-aware model (SLCM), surpasses state of the art CARS models.
△ Less
Submitted 6 August, 2020; v1 submitted 9 September, 2019;
originally announced September 2019.
-
Assessing the Quality of Scientific Papers
Authors:
Roman Vainshtein,
Gilad Katz,
Bracha Shapira,
Lior Rokach
Abstract:
A multitude of factors are responsible for the overall quality of scientific papers, including readability, linguistic quality, fluency,semantic complexity, and of course domain-specific technical factors. These factors vary from one field of study to another. In this paper, we propose a measure and method for assessing the overall quality of the scientific papers in a particular field of study. W…
▽ More
A multitude of factors are responsible for the overall quality of scientific papers, including readability, linguistic quality, fluency,semantic complexity, and of course domain-specific technical factors. These factors vary from one field of study to another. In this paper, we propose a measure and method for assessing the overall quality of the scientific papers in a particular field of study. We evaluate our method in the computer science domain, but it can be applied to other technical and scientific fields.Our method is based on the corpus linguistics technique. This technique enables the extraction of required information and knowledge associated with a specific domain. For this purpose, we have created a large corpus, consisting of papers from very high impact conferences. First, we analyze this corpus in order to extract rich domain-specific terminology and knowledge. Then we use the acquired knowledge to estimate the quality of scientific papers by applying our proposed measure. We examine our measure on high and low scientific impact test corpora. Our results show a significant difference in the measure scores of the high and low impact test corpora. Second, we develop a classifier based on our proposed measure and compare it to the baseline classifier. Our results show that the classifier based on our measure over-performed the baseline classifier. Based on the presented results the proposed measure and the technique can be used for automated assessment of scientific papers.
△ Less
Submitted 12 August, 2019;
originally announced August 2019.
-
Clustering Wi-Fi Fingerprints for Indoor-Outdoor Detection
Authors:
Guy Shtar,
Bracha Shapira,
Lior Rokach
Abstract:
This paper presents a method for continuous indoor-outdoor environment detection on mobile devices based solely on WiFi fingerprints. Detection of indoor outdoor switching is an important part of identifying a user's context, and it provides important information for upper layer context aware mobile applications such as recommender systems, navigation tools, etc. Moreover, future indoor positionin…
▽ More
This paper presents a method for continuous indoor-outdoor environment detection on mobile devices based solely on WiFi fingerprints. Detection of indoor outdoor switching is an important part of identifying a user's context, and it provides important information for upper layer context aware mobile applications such as recommender systems, navigation tools, etc. Moreover, future indoor positioning systems are likely to use Wi-Fi fingerprints, and therefore Wi-Fi receivers will be on most of the time. In contrast to existing research, we believe that these fingerprints should be leveraged, and they serve as the basis of the proposed method. Using various machine learning algorithms, we train a supervised classifier based on features extracted from the raw fingerprints, clusters, and cluster transition graph. The contribution of each of the features to the method is assessed. Our method assumes no prior knowledge of the environment, and a training set consisting of the data collected for just a few hours on a single device is sufficient in order to provide indoor-outdoor classification, even in an unknown location or when using new devices. We evaluate our method in an experiment involving 12 participants during their daily routine, with a total of 828 hours' worth of data collected by the participants. We report a predictive performance of the AUC (area under the curve) of 0.94 using the gradient boosting machine ensemble learning method. We show that our method can be used for other context detection tasks such as learning and recognizing a given building or room.
△ Less
Submitted 2 August, 2019;
originally announced August 2019.
-
A difficulty ranking approach to personalization in E-learning
Authors:
Avi Segal,
Kobi Gal,
Guy Shani,
Bracha Shapira
Abstract:
The prevalence of e-learning systems and on-line courses has made educational material widely accessible to students of varying abilities and backgrounds. There is thus a growing need to accommodate for individual differences in e-learning systems. This paper presents an algorithm called EduRank for personalizing educational content to students that combines a collaborative filtering algorithm wit…
▽ More
The prevalence of e-learning systems and on-line courses has made educational material widely accessible to students of varying abilities and backgrounds. There is thus a growing need to accommodate for individual differences in e-learning systems. This paper presents an algorithm called EduRank for personalizing educational content to students that combines a collaborative filtering algorithm with voting methods. EduRank constructs a difficulty ranking for each student by aggregating the rankings of similar students using different aspects of their performance on common questions. These aspects include grades, number of retries, and time spent solving questions. It infers a difficulty ranking directly over the questions for each student, rather than ordering them according to the student's predicted score. The EduRank algorithm was tested on two data sets containing thousands of students and a million records. It was able to outperform the state-of-the-art ranking approaches as well as a domain expert. EduRank was used by students in a classroom activity, where a prior model was incorporated to predict the difficulty rankings of students with no prior history in the system. It was shown to lead students to solve more difficult questions than an ordering by a domain expert, without reducing their performance.
△ Less
Submitted 28 July, 2019;
originally announced July 2019.
-
New Item Consumption Prediction Using Deep Learning
Authors:
Michael Shekasta,
Gilad Katz,
Asnat Greenstein-Messica,
Lior Rokach,
Bracha Shapira
Abstract:
Recommendation systems have become ubiquitous in today's online world and are an integral part of practically every e-commerce platform. While traditional recommender systems use customer history, this approach is not feasible in 'cold start' scenarios. Such scenarios include the need to produce recommendations for new or unregistered users and the introduction of new items. In this study, we pres…
▽ More
Recommendation systems have become ubiquitous in today's online world and are an integral part of practically every e-commerce platform. While traditional recommender systems use customer history, this approach is not feasible in 'cold start' scenarios. Such scenarios include the need to produce recommendations for new or unregistered users and the introduction of new items. In this study, we present the Purchase Intent Session-bAsed (PISA) algorithm, a content-based algorithm for predicting the purchase intent for cold start session-based scenarios. Our approach employs deep learning techniques both for modeling the content and purchase intent prediction. Our experiments show that PISA outperforms a well-known deep learning baseline when new items are introduced. In addition, while content-based approaches often fail to perform well in highly imbalanced datasets, our approach successfully handles such cases. Finally, our experiments show that combining PISA with the baseline in non-cold start scenarios further improves performance.
△ Less
Submitted 12 May, 2019; v1 submitted 5 May, 2019;
originally announced May 2019.
-
Online Budgeted Learning for Classifier Induction
Authors:
Eran Fainman,
Bracha Shapira,
Lior Rokach,
Yisroel Mirsky
Abstract:
In real-world machine learning applications, there is a cost associated with sampling of different features. Budgeted learning can be used to select which feature-values to acquire from each instance in a dataset, such that the best model is induced under a given constraint. However, this approach is not possible in the domain of online learning since one may not retroactively acquire feature-valu…
▽ More
In real-world machine learning applications, there is a cost associated with sampling of different features. Budgeted learning can be used to select which feature-values to acquire from each instance in a dataset, such that the best model is induced under a given constraint. However, this approach is not possible in the domain of online learning since one may not retroactively acquire feature-values from past instances. In online learning, the challenge is to find the optimum set of features to be acquired from each instance upon arrival from a data stream. In this paper we introduce the issue of online budgeted learning and describe a general framework for addressing this challenge. We propose two types of feature value acquisition policies based on the multi-armed bandit problem: random and adaptive. Adaptive policies perform online adjustments according to new information coming from a data stream, while random policies are not sensitive to the information that arrives from the data stream. Our comparative study on five real-world datasets indicates that adaptive policies outperform random policies for most budget limitations and datasets. Furthermore, we found that in some cases adaptive policies achieve near-optimal results.
△ Less
Submitted 13 March, 2019;
originally announced March 2019.
-
Personal Dynamic Cost-Aware Sensing for Latent Context Detection
Authors:
Saar Tal,
Bracha Shapira,
Lior Rokach
Abstract:
In the past decade, the usage of mobile devices has gone far beyond simple activities like calling and texting. Today, smartphones contain multiple embedded sensors and are able to collect useful sensing data about the user and infer the user's context. The more frequent the sensing, the more accurate the context. However, continuous sensing results in huge energy consumption, decreasing the batte…
▽ More
In the past decade, the usage of mobile devices has gone far beyond simple activities like calling and texting. Today, smartphones contain multiple embedded sensors and are able to collect useful sensing data about the user and infer the user's context. The more frequent the sensing, the more accurate the context. However, continuous sensing results in huge energy consumption, decreasing the battery's lifetime. We propose a novel approach for cost-aware sensing when performing continuous latent context detection. The suggested method dynamically determines user's sensors sampling policy based on three factors: (1) User's last known context; (2) Predicted information loss using KL-Divergence; and (3) Sensors' sampling costs. The objective function aims at minimizing both sampling cost and information loss. The method is based on various machine learning techniques including autoencoder neural networks for latent context detection, linear regression for information loss prediction, and convex optimization for determining the optimal sampling policy. To evaluate the suggested method, we performed a series of tests on real-world data recorded at a high-frequency rate; the data was collected from six mobile phone sensors of twenty users over the course of a week. Results show that by applying a dynamic sampling policy, our method naturally balances information loss and energy consumption and outperforms the static approach.% We compared the performance of our method with another state of the art dynamic sampling method and demonstrate its consistent superiority in various measures. %Our methods outperformed, and were able to improve we achieved better results in either sampling cost or information loss, and in some cases we improved both.
△ Less
Submitted 13 March, 2019;
originally announced March 2019.
-
Detecting drug-drug interactions using artificial neural networks and classic graph similarity measures
Authors:
Guy Shtar,
Lior Rokach,
Bracha Shapira
Abstract:
Drug-drug interactions are preventable causes of medical injuries and often result in doctor and emergency room visits. Computational techniques can be used to predict potential drug-drug interactions. We approach the drug-drug interaction prediction problem as a link prediction problem and present two novel methods for drug-drug interaction prediction based on artificial neural networks and facto…
▽ More
Drug-drug interactions are preventable causes of medical injuries and often result in doctor and emergency room visits. Computational techniques can be used to predict potential drug-drug interactions. We approach the drug-drug interaction prediction problem as a link prediction problem and present two novel methods for drug-drug interaction prediction based on artificial neural networks and factor propagation over graph nodes: adjacency matrix factorization (AMF) and adjacency matrix factorization with propagation (AMFP). We conduct a retrospective analysis by training our models on a previous release of the DrugBank database with 1,141 drugs and 45,296 drug-drug interactions and evaluate the results on a later version of DrugBank with 1,440 drugs and 248,146 drug-drug interactions. Additionally, we perform a holdout analysis using DrugBank. We report an area under the receiver operating characteristic curve score of 0.807 and 0.990 for the retrospective and holdout analyses respectively. Finally, we create an ensemble-based classifier using AMF, AMFP, and existing link prediction methods and obtain an area under the receiver operating characteristic curve of 0.814 and 0.991 for the retrospective and the holdout analyses. We demonstrate that AMF and AMFP provide state of the art results compared to existing methods and that the ensemble-based classifier improves the performance by combining various predictors. These results suggest that AMF, AMFP, and the proposed ensemble-based classifier can provide important information during drug development and regarding drug prescription given only partial or noisy data. These methods can also be used to solve other link prediction problems. Drug embeddings (compressed representations) created when training our models using the interaction network have been made public.
△ Less
Submitted 5 August, 2019; v1 submitted 11 March, 2019;
originally announced March 2019.
-
Attack Graph Obfuscation
Authors:
Rami Puzis,
Hadar Polad,
Bracha Shapira
Abstract:
Before executing an attack, adversaries usually explore the victim's network in an attempt to infer the network topology and identify vulnerabilities in the victim's servers and personal computers. Falsifying the information collected by the adversary post penetration may significantly slower lateral movement and increase the amount of noise generated within the victim's network. We investigate th…
▽ More
Before executing an attack, adversaries usually explore the victim's network in an attempt to infer the network topology and identify vulnerabilities in the victim's servers and personal computers. Falsifying the information collected by the adversary post penetration may significantly slower lateral movement and increase the amount of noise generated within the victim's network. We investigate the effect of fake vulnerabilities within a real enterprise network on the attacker performance. We use the attack graphs to model the path of an attacker making its way towards a target in a given network. We use combinatorial optimization in order to find the optimal assignments of fake vulnerabilities. We demonstrate the feasibility of our deception-based defense by presenting results of experiments with a large scale real network. We show that adding fake vulnerabilities forces the adversary to invest a significant amount of effort, in terms of time and exploitability cost.
△ Less
Submitted 6 March, 2019;
originally announced March 2019.
-
Explaining Anomalies Detected by Autoencoders Using SHAP
Authors:
Liat Antwarg,
Ronnie Mindlin Miller,
Bracha Shapira,
Lior Rokach
Abstract:
Anomaly detection algorithms are often thought to be limited because they don't facilitate the process of validating results performed by domain experts. In Contrast, deep learning algorithms for anomaly detection, such as autoencoders, point out the outliers, saving experts the time-consuming task of examining normal cases in order to find anomalies. Most outlier detection algorithms output a sco…
▽ More
Anomaly detection algorithms are often thought to be limited because they don't facilitate the process of validating results performed by domain experts. In Contrast, deep learning algorithms for anomaly detection, such as autoencoders, point out the outliers, saving experts the time-consuming task of examining normal cases in order to find anomalies. Most outlier detection algorithms output a score for each instance in the database. The top-k most intense outliers are returned to the user for further inspection; however the manual validation of results becomes challenging without additional clues. An explanation of why an instance is anomalous enables the experts to focus their investigation on most important anomalies and may increase their trust in the algorithm.
Recently, a game theory-based framework known as SHapley Additive exPlanations (SHAP) has been shown to be effective in explaining various supervised learning models. In this research, we extend SHAP to explain anomalies detected by an autoencoder, an unsupervised model. The proposed method extracts and visually depicts both the features that most contributed to the anomaly and those that offset it. A preliminary experimental study using real world data demonstrates the usefulness of the proposed method in assisting the domain experts to understand the anomaly and filtering out the uninteresting anomalies, aiming at minimizing the false positive rate of detected anomalies.
△ Less
Submitted 30 June, 2020; v1 submitted 6 March, 2019;
originally announced March 2019.
-
Implicit Dimension Identification in User-Generated Text with LSTM Networks
Authors:
Victor Makarenkov,
Ido Guy,
Niva Hazon,
Tamar Meisels,
Bracha Shapira,
Lior Rokach
Abstract:
In the process of online storytelling, individual users create and consume highly diverse content that contains a great deal of implicit beliefs and not plainly expressed narrative. It is hard to manually detect these implicit beliefs, intentions and moral foundations of the writers. We study and investigate two different tasks, each of which reflect the difficulty of detecting an implicit user's…
▽ More
In the process of online storytelling, individual users create and consume highly diverse content that contains a great deal of implicit beliefs and not plainly expressed narrative. It is hard to manually detect these implicit beliefs, intentions and moral foundations of the writers. We study and investigate two different tasks, each of which reflect the difficulty of detecting an implicit user's knowledge, intent or belief that may be based on writer's moral foundation: 1) political perspective detection in news articles 2) identification of informational vs. conversational questions in community question answering (CQA) archives and. In both tasks we first describe new interesting annotated datasets and make the datasets publicly available. Second, we compare various classification algorithms, and show the differences in their performance on both tasks. Third, in political perspective detection task we utilize a narrative representation language of local press to identify perspective differences between presumably neutral American and British press.
△ Less
Submitted 1 February, 2019; v1 submitted 26 January, 2019;
originally announced January 2019.
-
Choosing the Right Word: Using Bidirectional LSTM Tagger for Writing Support Systems
Authors:
Victor Makarenkov,
Lior Rokach,
Bracha Shapira
Abstract:
Scientific writing is difficult. It is even harder for those for whom English is a second language (ESL learners). Scholars around the world spend a significant amount of time and resources proofreading their work before submitting it for review or publication.
In this paper we present a novel machine learning based application for proper word choice task. Proper word choice is a generalization…
▽ More
Scientific writing is difficult. It is even harder for those for whom English is a second language (ESL learners). Scholars around the world spend a significant amount of time and resources proofreading their work before submitting it for review or publication.
In this paper we present a novel machine learning based application for proper word choice task. Proper word choice is a generalization the lexical substitution (LS) and grammatical error correction (GEC) tasks. We demonstrate and evaluate the usefulness of applying bidirectional Long Short Term Memory (LSTM) tagger, for this task. While state-of-the-art grammatical error correction uses error-specific classifiers and machine translation methods, we demonstrate an unsupervised method that is based solely on a high quality text corpus and does not require manually annotated data. We use a bidirectional Recurrent Neural Network (RNN) with LSTM for learning the proper word choice based on a word's sentential context. We demonstrate and evaluate our application on both a domain-specific (scientific), writing task and a general-purpose writing task. We show that our domain-specific and general-purpose models outperform state-of-the-art general context learning. As an additional contribution of this research, we also share our code, pre-trained models, and a new ESL learner test set with the research community.
△ Less
Submitted 8 January, 2019;
originally announced January 2019.
-
Wikibook-Bot - Automatic Generation of a Wikipedia Book
Authors:
Shahar Admati,
Lior Rokach,
Bracha Shapira
Abstract:
A Wikipedia book (known as Wikibook) is a collection of Wikipedia articles on a particular theme that is organized as a book. We propose Wikibook-Bot, a machine-learning based technique for automatically generating high quality Wikibooks based on a concept provided by the user. In order to create the Wikibook we apply machine learning algorithms to the different steps of the proposed technique. Fi…
▽ More
A Wikipedia book (known as Wikibook) is a collection of Wikipedia articles on a particular theme that is organized as a book. We propose Wikibook-Bot, a machine-learning based technique for automatically generating high quality Wikibooks based on a concept provided by the user. In order to create the Wikibook we apply machine learning algorithms to the different steps of the proposed technique. Firs, we need to decide whether an article belongs to a specific Wikibook - a classification task. Then, we need to divide the chosen articles into chapters - a clustering task - and finally, we deal with the ordering task which includes two subtasks: order articles within each chapter and order the chapters themselves. We propose a set of structural, text-based and unique Wikipedia features, and we show that by using these features, a machine learning classifier can successfully address the above challenges. The predictive performance of the proposed method is evaluated by comparing the auto-generated books to existing 407 Wikibooks which were manually generated by humans. For all the tasks we were able to obtain high and statistically significant results when comparing the Wikibook-bot books to books that were manually generated by Wikipedia contributors
△ Less
Submitted 28 December, 2018;
originally announced December 2018.
-
Predicting Relevance Scores for Triples from Type-Like Relations using Neural Embedding - The Cabbage Triple Scorer at WSDM Cup 2017
Authors:
Yael Brumer,
Bracha Shapira,
Lior Rokach,
Oren Barkan
Abstract:
The WSDM Cup 2017 Triple scoring challenge is aimed at calculating and assigning relevance scores for triples from type-like relations. Such scores are a fundamental ingredient for ranking results in entity search. In this paper, we propose a method that uses neural embedding techniques to accurately calculate an entity score for a triple based on its nearest neighbor. We strive to develop a new l…
▽ More
The WSDM Cup 2017 Triple scoring challenge is aimed at calculating and assigning relevance scores for triples from type-like relations. Such scores are a fundamental ingredient for ranking results in entity search. In this paper, we propose a method that uses neural embedding techniques to accurately calculate an entity score for a triple based on its nearest neighbor. We strive to develop a new latent semantic model with a deep structure that captures the semantic and syntactic relations between words. Our method has been ranked among the top performers with accuracy - 0.74, average score difference - 1.74, and average Kendall's Tau - 0.35.
△ Less
Submitted 22 December, 2017;
originally announced December 2017.
-
Sampling High Throughput Data for Anomaly Detection of Data-Base Activity
Authors:
Hagit Grushka-Cohen,
Oded Sofer,
Ofer Biller,
Michael Dymshits,
Lior Rokach,
Bracha Shapira
Abstract:
Data leakage and theft from databases is a dangerous threat to organizations. Data Security and Data Privacy protection systems (DSDP) monitor data access and usage to identify leakage or suspicious activities that should be investigated. Because of the high velocity nature of database systems, such systems audit only a portion of the vast number of transactions that take place. Anomalies are inve…
▽ More
Data leakage and theft from databases is a dangerous threat to organizations. Data Security and Data Privacy protection systems (DSDP) monitor data access and usage to identify leakage or suspicious activities that should be investigated. Because of the high velocity nature of database systems, such systems audit only a portion of the vast number of transactions that take place. Anomalies are investigated by a Security Officer (SO) in order to choose the proper response. In this paper we investigate the effect of sampling methods based on the risk the transaction poses and propose a new method for "combined sampling" for capturing a more varied sample.
△ Less
Submitted 14 August, 2017;
originally announced August 2017.
-
Language Models with Pre-Trained (GloVe) Word Embeddings
Authors:
Victor Makarenkov,
Bracha Shapira,
Lior Rokach
Abstract:
In this work we implement a training of a Language Model (LM), using Recurrent Neural Network (RNN) and GloVe word embeddings, introduced by Pennigton et al. in [1]. The implementation is following the general idea of training RNNs for LM tasks presented in [2], but is rather using Gated Recurrent Unit (GRU) [3] for a memory cell, and not the more commonly used LSTM [4].
In this work we implement a training of a Language Model (LM), using Recurrent Neural Network (RNN) and GloVe word embeddings, introduced by Pennigton et al. in [1]. The implementation is following the general idea of training RNNs for LM tasks presented in [2], but is rather using Gated Recurrent Unit (GRU) [3] for a memory cell, and not the more commonly used LSTM [4].
△ Less
Submitted 5 February, 2017; v1 submitted 12 October, 2016;
originally announced October 2016.
-
JoKER: Trusted Detection of Kernel Rootkits in Android Devices via JTAG Interface
Authors:
Mordechai Guri,
Yuri Poliak,
Bracha Shapira,
Yuval Elovici
Abstract:
Smartphones and tablets have become prime targets for malware, due to the valuable private and corporate information they hold. While Anti-Virus (AV) program may successfully detect malicious applications (apps), they remain ineffective against low-level rootkits that evade detection mechanisms by masking their own presence. Furthermore, any detection mechanism run on the same physical device as t…
▽ More
Smartphones and tablets have become prime targets for malware, due to the valuable private and corporate information they hold. While Anti-Virus (AV) program may successfully detect malicious applications (apps), they remain ineffective against low-level rootkits that evade detection mechanisms by masking their own presence. Furthermore, any detection mechanism run on the same physical device as the monitored OS can be compromised via application, kernel or boot-loader vulnerabilities. Consequentially, trusted detection of kernel rootkits in mobile devices is a challenging task in practice. In this paper we present JoKER - a system which aims at detecting rootkits in the Android kernel by utilizing the hardware's Joint Test Action Group (JTAG) interface for trusted memory forensics. Our framework consists of components that extract areas of a kernel's memory and reconstruct it for further analysis. We present the overall architecture along with its implementation, and demonstrate that the system can successfully detect the presence of stealthy rootkits in the kernel. The results show that although JTAG's main purpose is system testing, it can also be used for malware detection where traditional methods fail.
△ Less
Submitted 13 December, 2015;
originally announced December 2015.
-
Enabling Complex Wikipedia Queries - Technical Report
Authors:
Gilad Katz,
Bracha Shapira
Abstract:
In this technical report we present a database schema used to store Wikipedia so it can be easily used in query-intensive applications. In addition to storing the information in a way that makes it highly accessible, our schema enables users to easily formulate complex queries using information such as the anchor-text of links and their location in the page, the titles and number of redirect pages…
▽ More
In this technical report we present a database schema used to store Wikipedia so it can be easily used in query-intensive applications. In addition to storing the information in a way that makes it highly accessible, our schema enables users to easily formulate complex queries using information such as the anchor-text of links and their location in the page, the titles and number of redirect pages for each page and the paragraph structure of entity pages. We have successfully used the schema in domains such as recommender systems, information retrieval and sentiment analysis. In order to assist other researchers, we now make the schema and its content available online.
△ Less
Submitted 13 August, 2015;
originally announced August 2015.
-
Content-based data leakage detection using extended fingerprinting
Authors:
Yuri Shapira,
Bracha Shapira,
Asaf Shabtai
Abstract:
Protecting sensitive information from unauthorized disclosure is a major concern of every organization. As an organizations employees need to access such information in order to carry out their daily work, data leakage detection is both an essential and challenging task. Whether caused by malicious intent or an inadvertent mistake, data loss can result in significant damage to the organization. Fi…
▽ More
Protecting sensitive information from unauthorized disclosure is a major concern of every organization. As an organizations employees need to access such information in order to carry out their daily work, data leakage detection is both an essential and challenging task. Whether caused by malicious intent or an inadvertent mistake, data loss can result in significant damage to the organization. Fingerprinting is a content-based method used for detecting data leakage. In fingerprinting, signatures of known confidential content are extracted and matched with outgoing content in order to detect leakage of sensitive content. Existing fingerprinting methods, however, suffer from two major limitations. First, fingerprinting can be bypassed by rephrasing (or minor modification) of the confidential content, and second, usually the whole content of document is fingerprinted (including non-confidential parts), resulting in false alarms. In this paper we propose an extension to the fingerprinting approach that is based on sorted k-skip-n-grams. The proposed method is able to produce a fingerprint of the core confidential content which ignores non-relevant (non-confidential) sections. In addition, the proposed fingerprint method is more robust to rephrasing and can also be used to detect a previously unseen confidential document and therefore provide better detection of intentional leakage incidents.
△ Less
Submitted 8 February, 2013;
originally announced February 2013.
-
Social Network Based Search for Experts
Authors:
Yehonatan Bitton,
Michael Fire,
Dima Kagan,
Bracha Shapira,
Lior Rokach,
Judit Bar-Ilan
Abstract:
Our system illustrates how information retrieved from social networks can be used for suggesting experts for specific tasks. The system is designed to facilitate the task of finding the appropriate person(s) for a job, as a conference committee member, an advisor, etc. This short description will demonstrate how the system works in the context of the HCIR2012 published tasks.
Our system illustrates how information retrieved from social networks can be used for suggesting experts for specific tasks. The system is designed to facilitate the task of finding the appropriate person(s) for a job, as a conference committee member, an advisor, etc. This short description will demonstrate how the system works in the context of the HCIR2012 published tasks.
△ Less
Submitted 14 December, 2012;
originally announced December 2012.
-
Using Wikipedia to Boost SVD Recommender Systems
Authors:
Gilad Katz,
Guy Shani,
Bracha Shapira,
Lior Rokach
Abstract:
Singular Value Decomposition (SVD) has been used successfully in recent years in the area of recommender systems. In this paper we present how this model can be extended to consider both user ratings and information from Wikipedia. By mapping items to Wikipedia pages and quantifying their similarity, we are able to use this information in order to improve recommendation accuracy, especially when t…
▽ More
Singular Value Decomposition (SVD) has been used successfully in recent years in the area of recommender systems. In this paper we present how this model can be extended to consider both user ratings and information from Wikipedia. By mapping items to Wikipedia pages and quantifying their similarity, we are able to use this information in order to improve recommendation accuracy, especially when the sparsity is high. Another advantage of the proposed approach is the fact that it can be easily integrated into any other SVD implementation, regardless of additional parameters that may have been added to it. Preliminary experimental results on the MovieLens dataset are encouraging.
△ Less
Submitted 5 December, 2012;
originally announced December 2012.
-
Boosting Simple Collaborative Filtering Models Using Ensemble Methods
Authors:
Ariel Bar,
Lior Rokach,
Guy Shani,
Bracha Shapira,
Alon Schclar
Abstract:
In this paper we examine the effect of applying ensemble learning to the performance of collaborative filtering methods. We present several systematic approaches for generating an ensemble of collaborative filtering models based on a single collaborative filtering algorithm (single-model or homogeneous ensemble). We present an adaptation of several popular ensemble techniques in machine learning f…
▽ More
In this paper we examine the effect of applying ensemble learning to the performance of collaborative filtering methods. We present several systematic approaches for generating an ensemble of collaborative filtering models based on a single collaborative filtering algorithm (single-model or homogeneous ensemble). We present an adaptation of several popular ensemble techniques in machine learning for the collaborative filtering domain, including bagging, boosting, fusion and randomness injection. We evaluate the proposed approach on several types of collaborative filtering base models: k- NN, matrix factorization and a neighborhood matrix factorization model. Empirical evaluation shows a prediction improvement compared to all base CF algorithms. In particular, we show that the performance of an ensemble of simple (weak) CF models such as k-NN is competitive compared with a single strong CF model (such as matrix factorization) while requiring an order of magnitude less computational cost.
△ Less
Submitted 13 November, 2012;
originally announced November 2012.
-
Detection of Deviations in Mobile Applications Network Behavior
Authors:
L. Chekina,
D. Mimran,
L. Rokach,
Y. Elovici,
B. Shapira
Abstract:
In this paper a novel system for detecting meaningful deviations in a mobile application's network behavior is proposed. The main goal of the proposed system is to protect mobile device users and cellular infrastructure companies from malicious applications. The new system is capable of: (1) identifying malicious attacks or masquerading applications installed on a mobile device, and (2) identifyin…
▽ More
In this paper a novel system for detecting meaningful deviations in a mobile application's network behavior is proposed. The main goal of the proposed system is to protect mobile device users and cellular infrastructure companies from malicious applications. The new system is capable of: (1) identifying malicious attacks or masquerading applications installed on a mobile device, and (2) identifying republishing of popular applications injected with a malicious code. The detection is performed based on the application's network traffic patterns only. For each application two types of models are learned. The first model, local, represents the personal traffic pattern for each user using an application and is learned on the device. The second model, collaborative, represents traffic patterns of numerous users using an application and is learned on the system server. Machine-learning methods are used for learning and detection purposes. This paper focuses on methods utilized for local (i.e., on mobile device) learning and detection of deviations from the normal application's behavior. These methods were implemented and evaluated on Android devices. The evaluation experiments demonstrate that: (1) various applications have specific network traffic patterns and certain application categories can be distinguishable by their network patterns, (2) different levels of deviations from normal behavior can be detected accurately, and (3) local learning is feasible and has a low performance overhead on mobile devices.
△ Less
Submitted 5 August, 2012; v1 submitted 27 July, 2012;
originally announced August 2012.