Search | arXiv e-print repository

Data Overvaluation Attack and Truthful Data Valuation in Federated Learning

Authors: Shuyuan Zheng, Sudong Cai, Chuan Xiao, Yang Cao, Jianbin Qin, Masatoshi Yoshikawa, Makoto Onizuka

Abstract: In collaborative machine learning (CML), data valuation, i.e., evaluating the contribution of each client's data to the machine learning model, has become a critical task for incentivizing and selecting positive data contributions. However, existing studies often assume that clients engage in data valuation truthfully, overlooking the practical motivation for clients to exaggerate their contributi… ▽ More In collaborative machine learning (CML), data valuation, i.e., evaluating the contribution of each client's data to the machine learning model, has become a critical task for incentivizing and selecting positive data contributions. However, existing studies often assume that clients engage in data valuation truthfully, overlooking the practical motivation for clients to exaggerate their contributions. To unlock this threat, this paper introduces the data overvaluation attack, enabling strategic clients to have their data significantly overvalued in federated learning, a widely adopted paradigm for decentralized CML. Furthermore, we propose a Bayesian truthful data valuation metric, named Truth-Shapley. Truth-Shapley is the unique metric that guarantees some promising axioms for data valuation while ensuring that clients' optimal strategy is to perform truthful data valuation under certain conditions. Our experiments demonstrate the vulnerability of existing data valuation metrics to the proposed attack and validate the robustness and effectiveness of Truth-Shapley. △ Less

Submitted 24 May, 2025; v1 submitted 1 February, 2025; originally announced February 2025.

arXiv:2410.16121 [pdf, other]

Extracting Spatiotemporal Data from Gradients with Large Language Models

Authors: Lele Zheng, Yang Cao, Renhe Jiang, Kenjiro Taura, Yulong Shen, Sheng Li, Masatoshi Yoshikawa

Abstract: Recent works show that sensitive user data can be reconstructed from gradient updates, breaking the key privacy promise of federated learning. While success was demonstrated primarily on image data, these methods do not directly transfer to other domains, such as spatiotemporal data. To understand privacy risks in spatiotemporal federated learning, we first propose Spatiotemporal Gradient Inversio… ▽ More Recent works show that sensitive user data can be reconstructed from gradient updates, breaking the key privacy promise of federated learning. While success was demonstrated primarily on image data, these methods do not directly transfer to other domains, such as spatiotemporal data. To understand privacy risks in spatiotemporal federated learning, we first propose Spatiotemporal Gradient Inversion Attack (ST-GIA), a gradient attack algorithm tailored to spatiotemporal data that successfully reconstructs the original location from gradients. Furthermore, the absence of priors in attacks on spatiotemporal data has hindered the accurate reconstruction of real client data. To address this limitation, we propose ST-GIA+, which utilizes an auxiliary language model to guide the search for potential locations, thereby successfully reconstructing the original data from gradients. In addition, we design an adaptive defense strategy to mitigate gradient inversion attacks in spatiotemporal federated learning. By dynamically adjusting the perturbation levels, we can offer tailored protection for varying rounds of training data, thereby achieving a better trade-off between privacy and utility than current state-of-the-art methods. Through intensive experimental analysis on three real-world datasets, we reveal that the proposed defense strategy can well preserve the utility of spatiotemporal federated learning with effective security protection. △ Less

Submitted 21 October, 2024; originally announced October 2024.

Comments: arXiv admin note: substantial text overlap with arXiv:2407.08529

arXiv:2408.10231 [pdf, ps, other]

Achieving Faster and More Accurate Operation of Deep Predictive Learning

Authors: Masaki Yoshikawa, Hiroshi Ito, Tetsuya Ogata

Abstract: Achieving both high speed and precision in robot operations is a significant challenge for social implementation. While factory robots excel at predefined tasks, they struggle with environment-specific actions like cleaning and cooking. Deep learning research aims to address this by enabling robots to autonomously execute behaviors through end-to-end learning with sensor data. RT-1 and ACT are not… ▽ More Achieving both high speed and precision in robot operations is a significant challenge for social implementation. While factory robots excel at predefined tasks, they struggle with environment-specific actions like cleaning and cooking. Deep learning research aims to address this by enabling robots to autonomously execute behaviors through end-to-end learning with sensor data. RT-1 and ACT are notable examples that have expanded robots' capabilities. However, issues with model inference speed and hand position accuracy persist. High-quality training data and fast, stable inference mechanisms are essential to overcome these challenges. This paper proposes a motion generation model for high-speed, high-precision tasks, exemplified by the sports stacking task. By teaching motions slowly and inferring at high speeds, the model achieved a 94% success rate in stacking cups with a real robot. △ Less

Submitted 3 August, 2024; originally announced August 2024.

Comments: 2 pages, 2 figures

arXiv:2408.02928 [pdf, other]

PGB: Benchmarking Differentially Private Synthetic Graph Generation Algorithms

Authors: Shang Liu, Hao Du, Yang Cao, Bo Yan, Jinfei Liu, Masatoshi Yoshikawa

Abstract: Differentially private graph analysis is a powerful tool for deriving insights from diverse graph data while protecting individual information. Designing private analytic algorithms for different graph queries often requires starting from scratch. In contrast, differentially private synthetic graph generation offers a general paradigm that supports one-time generation for multiple queries. Althoug… ▽ More Differentially private graph analysis is a powerful tool for deriving insights from diverse graph data while protecting individual information. Designing private analytic algorithms for different graph queries often requires starting from scratch. In contrast, differentially private synthetic graph generation offers a general paradigm that supports one-time generation for multiple queries. Although a rich set of differentially private graph generation algorithms has been proposed, comparing them effectively remains challenging due to various factors, including differing privacy definitions, diverse graph datasets, varied privacy requirements, and multiple utility metrics. To this end, we propose PGB (Private Graph Benchmark), a comprehensive benchmark designed to enable researchers to compare differentially private graph generation algorithms fairly. We begin by identifying four essential elements of existing works as a 4-tuple: mechanisms, graph datasets, privacy requirements, and utility metrics. We discuss principles regarding these elements to ensure the comprehensiveness of a benchmark. Next, we present a benchmark instantiation that adheres to all principles, establishing a new method to evaluate existing and newly proposed graph generation algorithms. Through extensive theoretical and empirical analysis, we gain valuable insights into the strengths and weaknesses of prior algorithms. Our results indicate that there is no universal solution for all possible cases. Finally, we provide guidelines to help researchers select appropriate mechanisms for various scenarios. △ Less

Submitted 9 December, 2024; v1 submitted 5 August, 2024; originally announced August 2024.

Comments: 18 pages, accepted by ICDE 2025

arXiv:2407.08529 [pdf, other]

Enhancing Privacy of Spatiotemporal Federated Learning against Gradient Inversion Attacks

Authors: Lele Zheng, Yang Cao, Renhe Jiang, Kenjiro Taura, Yulong Shen, Sheng Li, Masatoshi Yoshikawa

Abstract: Spatiotemporal federated learning has recently raised intensive studies due to its ability to train valuable models with only shared gradients in various location-based services. On the other hand, recent studies have shown that shared gradients may be subject to gradient inversion attacks (GIA) on images or texts. However, so far there has not been any systematic study of the gradient inversion a… ▽ More Spatiotemporal federated learning has recently raised intensive studies due to its ability to train valuable models with only shared gradients in various location-based services. On the other hand, recent studies have shown that shared gradients may be subject to gradient inversion attacks (GIA) on images or texts. However, so far there has not been any systematic study of the gradient inversion attacks in spatiotemporal federated learning. In this paper, we explore the gradient attack problem in spatiotemporal federated learning from attack and defense perspectives. To understand privacy risks in spatiotemporal federated learning, we first propose Spatiotemporal Gradient Inversion Attack (ST-GIA), a gradient attack algorithm tailored to spatiotemporal data that successfully reconstructs the original location from gradients. Furthermore, we design an adaptive defense strategy to mitigate gradient inversion attacks in spatiotemporal federated learning. By dynamically adjusting the perturbation levels, we can offer tailored protection for varying rounds of training data, thereby achieving a better trade-off between privacy and utility than current state-of-the-art methods. Through intensive experimental analysis on three real-world datasets, we reveal that the proposed defense strategy can well preserve the utility of spatiotemporal federated learning with effective security protection. △ Less

Submitted 15 July, 2024; v1 submitted 11 July, 2024; originally announced July 2024.

Comments: Accepted by DASFAA 2024, 16 pages

arXiv:2405.20576 [pdf, other]

Federated Graph Analytics with Differential Privacy

Authors: Shang Liu, Yang Cao, Takao Murakami, Weiran Liu, Seng Pei Liew, Tsubasa Takahashi, Jinfei Liu, Masatoshi Yoshikawa

Abstract: Collaborative graph analysis across multiple institutions is becoming increasingly popular. Realistic examples include social network analysis across various social platforms, financial transaction analysis across multiple banks, and analyzing the transmission of infectious diseases across multiple hospitals. We define the federated graph analytics, a new problem for collaborative graph analytics… ▽ More Collaborative graph analysis across multiple institutions is becoming increasingly popular. Realistic examples include social network analysis across various social platforms, financial transaction analysis across multiple banks, and analyzing the transmission of infectious diseases across multiple hospitals. We define the federated graph analytics, a new problem for collaborative graph analytics under differential privacy. Although differentially private graph analysis has been widely studied, it fails to achieve a good tradeoff between utility and privacy in federated scenarios, due to the limited view of local clients and overlapping information across multiple subgraphs. Motivated by this, we first propose a federated graph analytic framework, named FEAT, which enables arbitrary downstream common graph statistics while preserving individual privacy. Furthermore, we introduce an optimized framework based on our proposed degree-based partition algorithm, called FEAT+, which improves the overall utility by leveraging the true local subgraphs. Finally, extensive experiments demonstrate that our FEAT and FEAT+ significantly outperform the baseline approach by approximately one and four orders of magnitude, respectively. △ Less

Submitted 30 May, 2024; originally announced May 2024.

Comments: 13 pages

arXiv:2405.08043 [pdf, other]

HRNet: Differentially Private Hierarchical and Multi-Resolution Network for Human Mobility Data Synthesization

Authors: Shun Takagi, Li Xiong, Fumiyuki Kato, Yang Cao, Masatoshi Yoshikawa

Abstract: Human mobility data offers valuable insights for many applications such as urban planning and pandemic response, but its use also raises privacy concerns. In this paper, we introduce the Hierarchical and Multi-Resolution Network (HRNet), a novel deep generative model specifically designed to synthesize realistic human mobility data while guaranteeing differential privacy. We first identify the key… ▽ More Human mobility data offers valuable insights for many applications such as urban planning and pandemic response, but its use also raises privacy concerns. In this paper, we introduce the Hierarchical and Multi-Resolution Network (HRNet), a novel deep generative model specifically designed to synthesize realistic human mobility data while guaranteeing differential privacy. We first identify the key difficulties inherent in learning human mobility data under differential privacy. In response to these challenges, HRNet integrates three components: a hierarchical location encoding mechanism, multi-task learning across multiple resolutions, and private pre-training. These elements collectively enhance the model's ability under the constraints of differential privacy. Through extensive comparative experiments utilizing a real-world dataset, HRNet demonstrates a marked improvement over existing methods in balancing the utility-privacy trade-off. △ Less

Submitted 19 July, 2024; v1 submitted 13 May, 2024; originally announced May 2024.

arXiv:2312.12938 [pdf, other]

CARGO: Crypto-Assisted Differentially Private Triangle Counting without Trusted Servers

Authors: Shang Liu, Yang Cao, Takao Murakami, Jinfei Liu, Masatoshi Yoshikawa

Abstract: Differentially private triangle counting in graphs is essential for analyzing connection patterns and calculating clustering coefficients while protecting sensitive individual information. Previous works have relied on either central or local models to enforce differential privacy. However, a significant utility gap exists between the central and local models of differentially private triangle cou… ▽ More Differentially private triangle counting in graphs is essential for analyzing connection patterns and calculating clustering coefficients while protecting sensitive individual information. Previous works have relied on either central or local models to enforce differential privacy. However, a significant utility gap exists between the central and local models of differentially private triangle counting, depending on whether or not a trusted server is needed. In particular, the central model provides a high accuracy but necessitates a trusted server. The local model does not require a trusted server but suffers from limited accuracy. Our paper introduces a crypto-assisted differentially private triangle counting system, named CARGO, leveraging cryptographic building blocks to improve the effectiveness of differentially private triangle counting without assumption of trusted servers. It achieves high utility similar to the central model but without the need for a trusted server like the local model. CARGO consists of three main components. First, we introduce a similarity-based projection method that reduces the global sensitivity while preserving more triangles via triangle homogeneity. Second, we present a triangle counting scheme based on the additive secret sharing that securely and accurately computes the triangles while protecting sensitive information. Third, we design a distributed perturbation algorithm that perturbs the triangle count with minimal but sufficient noise. We also provide a comprehensive theoretical and empirical analysis of our proposed methods. Extensive experiments demonstrate that our CARGO significantly outperforms the local model in terms of utility and achieves high-utility triangle counting comparable to the central model. △ Less

Submitted 20 December, 2023; originally announced December 2023.

Comments: Accepted by ICDE 2024

arXiv:2308.12210 [pdf, other]

ULDP-FL: Federated Learning with Across Silo User-Level Differential Privacy

Authors: Fumiyuki Kato, Li Xiong, Shun Takagi, Yang Cao, Masatoshi Yoshikawa

Abstract: Differentially Private Federated Learning (DP-FL) has garnered attention as a collaborative machine learning approach that ensures formal privacy. Most DP-FL approaches ensure DP at the record-level within each silo for cross-silo FL. However, a single user's data may extend across multiple silos, and the desired user-level DP guarantee for such a setting remains unknown. In this study, we present… ▽ More Differentially Private Federated Learning (DP-FL) has garnered attention as a collaborative machine learning approach that ensures formal privacy. Most DP-FL approaches ensure DP at the record-level within each silo for cross-silo FL. However, a single user's data may extend across multiple silos, and the desired user-level DP guarantee for such a setting remains unknown. In this study, we present Uldp-FL, a novel FL framework designed to guarantee user-level DP in cross-silo FL where a single user's data may belong to multiple silos. Our proposed algorithm directly ensures user-level DP through per-user weighted clipping, departing from group-privacy approaches. We provide a theoretical analysis of the algorithm's privacy and utility. Additionally, we enhance the utility of the proposed algorithm with an enhanced weighting strategy based on user record distribution and design a novel private protocol that ensures no additional information is revealed to the silos and the server. Experiments on real-world datasets show substantial improvements in our methods in privacy-utility trade-offs under user-level DP compared to baseline methods. To the best of our knowledge, our work is the first FL framework that effectively provides user-level DP in the general cross-silo FL setting. △ Less

Submitted 16 June, 2024; v1 submitted 23 August, 2023; originally announced August 2023.

Comments: This is the full version of the paper accepted to VLDB 2024

arXiv:2306.13293 [pdf, other]

Differentially Private Streaming Data Release under Temporal Correlations via Post-processing

Authors: Xuyang Cao, Yang Cao, Primal Pappachan, Atsuyoshi Nakamura, Masatoshi Yoshikawa

Abstract: The release of differentially private streaming data has been extensively studied, yet striking a good balance between privacy and utility on temporally correlated data in the stream remains an open problem. Existing works focus on enhancing privacy when applying differential privacy to correlated data, highlighting that differential privacy may suffer from additional privacy leakage under correla… ▽ More The release of differentially private streaming data has been extensively studied, yet striking a good balance between privacy and utility on temporally correlated data in the stream remains an open problem. Existing works focus on enhancing privacy when applying differential privacy to correlated data, highlighting that differential privacy may suffer from additional privacy leakage under correlations; consequently, a small privacy budget has to be used which worsens the utility. In this work, we propose a post-processing framework to improve the utility of differential privacy data release under temporal correlations. We model the problem as a maximum posterior estimation given the released differentially private data and correlation model and transform it into nonlinear constrained programming. Our experiments on synthetic datasets show that the proposed approach significantly improves the utility and accuracy of differentially private data by nearly a hundred times in terms of mean square error when a strict privacy budget is given. △ Less

Submitted 25 June, 2023; v1 submitted 23 June, 2023; originally announced June 2023.

arXiv:2306.10656 [pdf, other]

Virtual Human Generative Model: Masked Modeling Approach for Learning Human Characteristics

Authors: Kenta Oono, Nontawat Charoenphakdee, Kotatsu Bito, Zhengyan Gao, Hideyoshi Igata, Masashi Yoshikawa, Yoshiaki Ota, Hiroki Okui, Kei Akita, Shoichiro Yamaguchi, Yohei Sugawara, Shin-ichi Maeda, Kunihiko Miyoshi, Yuki Saito, Koki Tsuda, Hiroshi Maruyama, Kohei Hayashi

Abstract: Identifying the relationship between healthcare attributes, lifestyles, and personality is vital for understanding and improving physical and mental well-being. Machine learning approaches are promising for modeling their relationships and offering actionable suggestions. In this paper, we propose the Virtual Human Generative Model (VHGM), a novel deep generative model capable of estimating over 2… ▽ More Identifying the relationship between healthcare attributes, lifestyles, and personality is vital for understanding and improving physical and mental well-being. Machine learning approaches are promising for modeling their relationships and offering actionable suggestions. In this paper, we propose the Virtual Human Generative Model (VHGM), a novel deep generative model capable of estimating over 2,000 attributes across healthcare, lifestyle, and personality domains. VHGM leverages masked modeling to learn the joint distribution of attributes, enabling accurate predictions and robust conditional sampling. We deploy VHGM as a web service, showcasing its versatility in driving diverse healthcare applications aimed at improving user well-being. Through extensive quantitative evaluations, we demonstrate VHGM's superior performance in attribute imputation and high-quality sample generation compared to existing baselines. This work highlights VHGM as a powerful tool for personalized healthcare and lifestyle management, with broad implications for data-driven health solutions. △ Less

Submitted 29 January, 2025; v1 submitted 18 June, 2023; originally announced June 2023.

Comments: 19 pages, 4 figures

arXiv:2302.08148 [pdf, other]

Empirical Investigation of Neural Symbolic Reasoning Strategies

Authors: Yoichi Aoki, Keito Kudo, Tatsuki Kuribayashi, Ana Brassard, Masashi Yoshikawa, Keisuke Sakaguchi, Kentaro Inui

Abstract: Neural reasoning accuracy improves when generating intermediate reasoning steps. However, the source of this improvement is yet unclear. Here, we investigate and factorize the benefit of generating intermediate steps for symbolic reasoning. Specifically, we decompose the reasoning strategy w.r.t. step granularity and chaining strategy. With a purely symbolic numerical reasoning dataset (e.g., A=1,… ▽ More Neural reasoning accuracy improves when generating intermediate reasoning steps. However, the source of this improvement is yet unclear. Here, we investigate and factorize the benefit of generating intermediate steps for symbolic reasoning. Specifically, we decompose the reasoning strategy w.r.t. step granularity and chaining strategy. With a purely symbolic numerical reasoning dataset (e.g., A=1, B=3, C=A+3, C?), we found that the choice of reasoning strategies significantly affects the performance, with the gap becoming even larger as the extrapolation length becomes longer. Surprisingly, we also found that certain configurations lead to nearly perfect performance, even in the case of length extrapolation. Our results indicate the importance of further exploring effective strategies for neural reasoning models. △ Less

Submitted 16 February, 2023; originally announced February 2023.

Comments: This paper is accepted as the findings at EACL 2023, and the earlier version (non-archival) of this work got the Best Paper Award in the Student Research Workshop of AACL 2022

arXiv:2302.07866 [pdf, other]

Do Deep Neural Networks Capture Compositionality in Arithmetic Reasoning?

Authors: Keito Kudo, Yoichi Aoki, Tatsuki Kuribayashi, Ana Brassard, Masashi Yoshikawa, Keisuke Sakaguchi, Kentaro Inui

Abstract: Compositionality is a pivotal property of symbolic reasoning. However, how well recent neural models capture compositionality remains underexplored in the symbolic reasoning tasks. This study empirically addresses this question by systematically examining recently published pre-trained seq2seq models with a carefully controlled dataset of multi-hop arithmetic symbolic reasoning. We introduce a ski… ▽ More Compositionality is a pivotal property of symbolic reasoning. However, how well recent neural models capture compositionality remains underexplored in the symbolic reasoning tasks. This study empirically addresses this question by systematically examining recently published pre-trained seq2seq models with a carefully controlled dataset of multi-hop arithmetic symbolic reasoning. We introduce a skill tree on compositionality in arithmetic symbolic reasoning that defines the hierarchical levels of complexity along with three compositionality dimensions: systematicity, productivity, and substitutivity. Our experiments revealed that among the three types of composition, the models struggled most with systematicity, performing poorly even with relatively simple compositions. That difficulty was not resolved even after training the models with intermediate reasoning steps. △ Less

Submitted 15 February, 2023; originally announced February 2023.

Comments: accepted by EACL 2023

arXiv:2301.06758 [pdf, other]

Tracing and Manipulating Intermediate Values in Neural Math Problem Solvers

Authors: Yuta Matsumoto, Benjamin Heinzerling, Masashi Yoshikawa, Kentaro Inui

Abstract: How language models process complex input that requires multiple steps of inference is not well understood. Previous research has shown that information about intermediate values of these inputs can be extracted from the activations of the models, but it is unclear where that information is encoded and whether that information is indeed used during inference. We introduce a method for analyzing ho… ▽ More How language models process complex input that requires multiple steps of inference is not well understood. Previous research has shown that information about intermediate values of these inputs can be extracted from the activations of the models, but it is unclear where that information is encoded and whether that information is indeed used during inference. We introduce a method for analyzing how a Transformer model processes these inputs by focusing on simple arithmetic problems and their intermediate values. To trace where information about intermediate values is encoded, we measure the correlation between intermediate values and the activations of the model using principal component analysis (PCA). Then, we perform a causal intervention by manipulating model weights. This intervention shows that the weights identified via tracing are not merely correlated with intermediate values, but causally related to model predictions. Our findings show that the model has a locality to certain intermediate values, and this is useful for enhancing the interpretability of the models. △ Less

Submitted 17 January, 2023; originally announced January 2023.

Comments: 5 pages, 4 figures, MathNLP

arXiv:2212.10688 [pdf, other]

Local Differential Privacy Image Generation Using Flow-based Deep Generative Models

Authors: Hisaichi Shibata, Shouhei Hanaoka, Yang Cao, Masatoshi Yoshikawa, Tomomi Takenaga, Yukihiro Nomura, Naoto Hayashi, Osamu Abe

Abstract: Diagnostic radiologists need artificial intelligence (AI) for medical imaging, but access to medical images required for training in AI has become increasingly restrictive. To release and use medical images, we need an algorithm that can simultaneously protect privacy and preserve pathologies in medical images. To develop such an algorithm, here, we propose DP-GLOW, a hybrid of a local differentia… ▽ More Diagnostic radiologists need artificial intelligence (AI) for medical imaging, but access to medical images required for training in AI has become increasingly restrictive. To release and use medical images, we need an algorithm that can simultaneously protect privacy and preserve pathologies in medical images. To develop such an algorithm, here, we propose DP-GLOW, a hybrid of a local differential privacy (LDP) algorithm and one of the flow-based deep generative models (GLOW). By applying a GLOW model, we disentangle the pixelwise correlation of images, which makes it difficult to protect privacy with straightforward LDP algorithms for images. Specifically, we map images onto the latent vector of the GLOW model, each element of which follows an independent normal distribution, and we apply the Laplace mechanism to the latent vector. Moreover, we applied DP-GLOW to chest X-ray images to generate LDP images while preserving pathologies. △ Less

Submitted 20 December, 2022; originally announced December 2022.

arXiv:2209.04856 [pdf, other]

doi 10.14778/3587136.3587141

Secure Shapley Value for Cross-Silo Federated Learning (Technical Report)

Authors: Shuyuan Zheng, Yang Cao, Masatoshi Yoshikawa

Abstract: The Shapley value (SV) is a fair and principled metric for contribution evaluation in cross-silo federated learning (cross-silo FL), wherein organizations, i.e., clients, collaboratively train prediction models with the coordination of a parameter server. However, existing SV calculation methods for FL assume that the server can access the raw FL models and public test data. This may not be a vali… ▽ More The Shapley value (SV) is a fair and principled metric for contribution evaluation in cross-silo federated learning (cross-silo FL), wherein organizations, i.e., clients, collaboratively train prediction models with the coordination of a parameter server. However, existing SV calculation methods for FL assume that the server can access the raw FL models and public test data. This may not be a valid assumption in practice considering the emerging privacy attacks on FL models and the fact that test data might be clients' private assets. Hence, we investigate the problem of secure SV calculation for cross-silo FL. We first propose HESV, a one-server solution based solely on homomorphic encryption (HE) for privacy protection, which has limitations in efficiency. To overcome these limitations, we propose SecSV, an efficient two-server protocol with the following novel features. First, SecSV utilizes a hybrid privacy protection scheme to avoid ciphertext--ciphertext multiplications between test data and models, which are extremely expensive under HE. Second, an efficient secure matrix multiplication method is proposed for SecSV. Third, SecSV strategically identifies and skips some test samples without significantly affecting the evaluation accuracy. Our experiments demonstrate that SecSV is 7.2-36.6 times as fast as HESV, with a limited loss in the accuracy of calculated SVs. △ Less

Submitted 25 December, 2024; v1 submitted 11 September, 2022; originally announced September 2022.

Comments: Technical report for our VLDB 2023 paper (https://www.vldb.org/pvldb/vol16/p1657-zheng.pdf)

Journal ref: Proceedings of the VLDB Endowment, 16(7): 1657-1670, 2023

arXiv:2209.02231 [pdf, other]

doi 10.1109/BigData55660.2022.10020435

A Crypto-Assisted Approach for Publishing Graph Statistics with Node Local Differential Privacy

Authors: Shang Liu, Yang Cao, Takao Murakami, Masatoshi Yoshikawa

Abstract: Publishing graph statistics under node differential privacy has attracted much attention since it provides a stronger privacy guarantee than edge differential privacy. Existing works related to node differential privacy assume a trusted data curator who holds the whole graph. However, in many applications, a trusted curator is usually not available due to privacy and security issues. In this paper… ▽ More Publishing graph statistics under node differential privacy has attracted much attention since it provides a stronger privacy guarantee than edge differential privacy. Existing works related to node differential privacy assume a trusted data curator who holds the whole graph. However, in many applications, a trusted curator is usually not available due to privacy and security issues. In this paper, for the first time, we investigate the problem of publishing the graph degree distribution under Node Local Differential privacy (Node-LDP), which does not rely on a trusted server. We propose an algorithm to publish the degree distribution with Node-LDP by exploring how to select the optimal graph projection parameter and how to execute the local graph projection. Specifically, we propose a Crypto-assisted local projection method that combines LDP and cryptographic primitives, achieving higher accuracy than our baseline PureLDP local projection method. On the other hand, we improve our baseline Node-level parameter selection by proposing an Edge-level parameter selection that preserves more neighboring information and provides better utility. Finally, extensive experiments on real-world graphs show that Edge-level local projection provides higher accuracy than Node-level local projection, and Crypto-assisted parameter selection owns the better utility than PureLDP parameter selection, improving by up to 79.8% and 57.2% respectively. △ Less

Submitted 15 April, 2023; v1 submitted 6 September, 2022; originally announced September 2022.

Comments: Accepted by IEEE BigData 2022, https://ieeexplore.ieee.org/document/10020435

arXiv:2204.13032 [pdf, other]

BiTimeBERT: Extending Pre-Trained Language Representations with Bi-Temporal Information

Authors: Jiexin Wang, Adam Jatowt, Masatoshi Yoshikawa, Yi Cai

Abstract: Time is an important aspect of documents and is used in a range of NLP and IR tasks. In this work, we investigate methods for incorporating temporal information during pre-training to further improve the performance on time-related tasks. Compared with common pre-trained language models like BERT which utilize synchronic document collections (e.g., BookCorpus and Wikipedia) as the training corpora… ▽ More Time is an important aspect of documents and is used in a range of NLP and IR tasks. In this work, we investigate methods for incorporating temporal information during pre-training to further improve the performance on time-related tasks. Compared with common pre-trained language models like BERT which utilize synchronic document collections (e.g., BookCorpus and Wikipedia) as the training corpora, we use long-span temporal news article collection for building word representations. We introduce BiTimeBERT, a novel language representation model trained on a temporal collection of news articles via two new pre-training tasks, which harnesses two distinct temporal signals to construct time-aware language representations. The experimental results show that BiTimeBERT consistently outperforms BERT and other existing pre-trained models with substantial gains on different downstream NLP tasks and applications for which time is of importance (e.g., the accuracy improvement over BERT is 155\% on the event time estimation task). △ Less

Submitted 27 April, 2023; v1 submitted 27 April, 2022; originally announced April 2022.

arXiv:2204.03919 [pdf, other]

doi 10.1145/3514221.3526162

Network Shuffling: Privacy Amplification via Random Walks

Authors: Seng Pei Liew, Tsubasa Takahashi, Shun Takagi, Fumiyuki Kato, Yang Cao, Masatoshi Yoshikawa

Abstract: Recently, it is shown that shuffling can amplify the central differential privacy guarantees of data randomized with local differential privacy. Within this setup, a centralized, trusted shuffler is responsible for shuffling by keeping the identities of data anonymous, which subsequently leads to stronger privacy guarantees for systems. However, introducing a centralized entity to the originally l… ▽ More Recently, it is shown that shuffling can amplify the central differential privacy guarantees of data randomized with local differential privacy. Within this setup, a centralized, trusted shuffler is responsible for shuffling by keeping the identities of data anonymous, which subsequently leads to stronger privacy guarantees for systems. However, introducing a centralized entity to the originally local privacy model loses some appeals of not having any centralized entity as in local differential privacy. Moreover, implementing a shuffler in a reliable way is not trivial due to known security issues and/or requirements of advanced hardware or secure computation technology. Motivated by these practical considerations, we rethink the shuffle model to relax the assumption of requiring a centralized, trusted shuffler. We introduce network shuffling, a decentralized mechanism where users exchange data in a random-walk fashion on a network/graph, as an alternative of achieving privacy amplification via anonymity. We analyze the threat model under such a setting, and propose distributed protocols of network shuffling that is straightforward to implement in practice. Furthermore, we show that the privacy amplification rate is similar to other privacy amplification techniques such as uniform shuffling. To our best knowledge, among the recently studied intermediate trust models that leverage privacy amplification techniques, our work is the first that is not relying on any centralized entity to achieve privacy amplification. △ Less

Submitted 8 April, 2022; originally announced April 2022.

Comments: 15 pages, 9 figures; SIGMOD 2022 version

arXiv:2203.06791 [pdf, other]

doi 10.14778/3538598.3538601

HDPView: Differentially Private Materialized View for Exploring High Dimensional Relational Data

Authors: Fumiyuki Kato, Tsubasa Takahashi, Shun Takagi, Yang Cao, Seng Pei Liew, Masatoshi Yoshikawa

Abstract: How can we explore the unknown properties of high-dimensional sensitive relational data while preserving privacy? We study how to construct an explorable privacy-preserving materialized view under differential privacy. No existing state-of-the-art methods simultaneously satisfy the following essential properties in data exploration: workload independence, analytical reliability (i.e., providing er… ▽ More How can we explore the unknown properties of high-dimensional sensitive relational data while preserving privacy? We study how to construct an explorable privacy-preserving materialized view under differential privacy. No existing state-of-the-art methods simultaneously satisfy the following essential properties in data exploration: workload independence, analytical reliability (i.e., providing error bound for each search query), applicability to high-dimensional data, and space efficiency. To solve the above issues, we propose HDPView, which creates a differentially private materialized view by well-designed recursive bisected partitioning on an original data cube, i.e., count tensor. Our method searches for block partitioning to minimize the error for the counting query, in addition to randomizing the convergence, by choosing the effective cutting points in a differentially private way, resulting in a less noisy and compact view. Furthermore, we ensure formal privacy guarantee and analytical reliability by providing the error bound for arbitrary counting queries on the materialized views. HDPView has the following desirable properties: (a) Workload independence, (b) Analytical reliability, (c) Noise resistance on high-dimensional data, (d) Space efficiency. To demonstrate the above properties and the suitability for data exploration, we conduct extensive experiments with eight types of range counting queries on eight real datasets. HDPView outperforms the state-of-the-art methods in these evaluations. △ Less

Submitted 26 May, 2022; v1 submitted 13 March, 2022; originally announced March 2022.

Comments: accepted at VLDB 2022

arXiv:2202.07165 [pdf, other]

OLIVE: Oblivious Federated Learning on Trusted Execution Environment against the risk of sparsification

Authors: Fumiyuki Kato, Yang Cao, Masatoshi Yoshikawa

Abstract: Combining Federated Learning (FL) with a Trusted Execution Environment (TEE) is a promising approach for realizing privacy-preserving FL, which has garnered significant academic attention in recent years. Implementing the TEE on the server side enables each round of FL to proceed without exposing the client's gradient information to untrusted servers. This addresses usability gaps in existing secu… ▽ More Combining Federated Learning (FL) with a Trusted Execution Environment (TEE) is a promising approach for realizing privacy-preserving FL, which has garnered significant academic attention in recent years. Implementing the TEE on the server side enables each round of FL to proceed without exposing the client's gradient information to untrusted servers. This addresses usability gaps in existing secure aggregation schemes as well as utility gaps in differentially private FL. However, to address the issue using a TEE, the vulnerabilities of server-side TEEs need to be considered -- this has not been sufficiently investigated in the context of FL. The main technical contribution of this study is the analysis of the vulnerabilities of TEE in FL and the defense. First, we theoretically analyze the leakage of memory access patterns, revealing the risk of sparsified gradients, which are commonly used in FL to enhance communication efficiency and model accuracy. Second, we devise an inference attack to link memory access patterns to sensitive information in the training dataset. Finally, we propose an oblivious yet efficient aggregation algorithm to prevent memory access pattern leakage. Our experiments on real-world data demonstrate that the proposed method functions efficiently in practical scales. △ Less

Submitted 19 June, 2023; v1 submitted 14 February, 2022; originally announced February 2022.

Comments: This paper is the full version of a paper accepted at VLDB 2023

arXiv:2109.13497 [pdf, other]

Instance-Based Neural Dependency Parsing

Authors: Hiroki Ouchi, Jun Suzuki, Sosuke Kobayashi, Sho Yokoi, Tatsuki Kuribayashi, Masashi Yoshikawa, Kentaro Inui

Abstract: Interpretable rationales for model predictions are crucial in practical applications. We develop neural models that possess an interpretable inference process for dependency parsing. Our models adopt instance-based inference, where dependency edges are extracted and labeled by comparing them to edges in a training set. The training edges are explicitly used for the predictions; thus, it is easy to… ▽ More Interpretable rationales for model predictions are crucial in practical applications. We develop neural models that possess an interpretable inference process for dependency parsing. Our models adopt instance-based inference, where dependency edges are extracted and labeled by comparing them to edges in a training set. The training edges are explicitly used for the predictions; thus, it is easy to grasp the contribution of each edge to the predictions. Our experiments show that our instance-based models achieve competitive accuracy with standard neural models and have the reasonable plausibility of instance-based explanations. △ Less

Submitted 28 September, 2021; originally announced September 2021.

Comments: 15 pages, accepted to TACL 2021

arXiv:2109.03438 [pdf, other]

ArchivalQA: A Large-scale Benchmark Dataset for Open Domain Question Answering over Historical News Collections

Authors: Jiexin Wang, Adam Jatowt, Masatoshi Yoshikawa

Abstract: In the last few years, open-domain question answering (ODQA) has advanced rapidly due to the development of deep learning techniques and the availability of large-scale QA datasets. However, the current datasets are essentially designed for synchronic document collections (e.g., Wikipedia). Temporal news collections such as long-term news archives spanning several decades, are rarely used in train… ▽ More In the last few years, open-domain question answering (ODQA) has advanced rapidly due to the development of deep learning techniques and the availability of large-scale QA datasets. However, the current datasets are essentially designed for synchronic document collections (e.g., Wikipedia). Temporal news collections such as long-term news archives spanning several decades, are rarely used in training the models despite they are quite valuable for our society. To foster the research in the field of ODQA on such historical collections, we present ArchivalQA, a large question answering dataset consisting of 532,444 question-answer pairs which is designed for temporal news QA. We divide our dataset into four subparts based on the question difficulty levels and the containment of temporal expressions, which we believe are useful for training and testing ODQA systems characterized by different strengths and abilities. The novel QA dataset-constructing framework that we introduce can be also applied to generate non-ambiguous questions of good quality over other types of temporal document collections. △ Less

Submitted 21 February, 2022; v1 submitted 8 September, 2021; originally announced September 2021.

arXiv:2106.07033 [pdf, other]

Understanding the Interplay between Privacy and Robustness in Federated Learning

Authors: Yaowei Han, Yang Cao, Masatoshi Yoshikawa

Abstract: Federated Learning (FL) is emerging as a promising paradigm of privacy-preserving machine learning, which trains an algorithm across multiple clients without exchanging their data samples. Recent works highlighted several privacy and robustness weaknesses in FL and addressed these concerns using local differential privacy (LDP) and some well-studied methods used in conventional ML, separately. How… ▽ More Federated Learning (FL) is emerging as a promising paradigm of privacy-preserving machine learning, which trains an algorithm across multiple clients without exchanging their data samples. Recent works highlighted several privacy and robustness weaknesses in FL and addressed these concerns using local differential privacy (LDP) and some well-studied methods used in conventional ML, separately. However, it is still not clear how LDP affects adversarial robustness in FL. To fill this gap, this work attempts to develop a comprehensive understanding of the effects of LDP on adversarial robustness in FL. Clarifying the interplay is significant since this is the first step towards a principled design of private and robust FL systems. We certify that local differential privacy has both positive and negative effects on adversarial robustness using theoretical analysis and empirical verification. △ Less

Submitted 13 June, 2021; originally announced June 2021.

arXiv:2106.04384 [pdf, other]

doi 10.1109/BigData55660.2022.10020232

FL-Market: Trading Private Models in Federated Learning

Authors: Shuyuan Zheng, Yang Cao, Masatoshi Yoshikawa, Huizhong Li, Qiang Yan

Abstract: The difficulty in acquiring a sufficient amount of training data is a major bottleneck for machine learning (ML) based data analytics. Recently, commoditizing ML models has been proposed as an economical and moderate solution to ML-oriented data acquisition. However, existing model marketplaces assume that the broker can access data owners' private training data, which may not be realistic in prac… ▽ More The difficulty in acquiring a sufficient amount of training data is a major bottleneck for machine learning (ML) based data analytics. Recently, commoditizing ML models has been proposed as an economical and moderate solution to ML-oriented data acquisition. However, existing model marketplaces assume that the broker can access data owners' private training data, which may not be realistic in practice. In this paper, to promote trustworthy data acquisition for ML tasks, we propose FL-Market, a locally private model marketplace that protects privacy not only against model buyers but also against the untrusted broker. FL-Market decouples ML from the need to centrally gather training data on the broker's side using federated learning, an emerging privacy-preserving ML paradigm in which data owners collaboratively train an ML model by uploading local gradients (to be aggregated into a global gradient for model updating). Then, FL-Market enables data owners to locally perturb their gradients by local differential privacy and thus further prevents privacy risks. To drive FL-Market, we propose a deep learning-empowered auction mechanism for intelligently deciding the local gradients' perturbation levels and an optimal aggregation mechanism for aggregating the perturbed gradients. Our auction and aggregation mechanisms can jointly maximize the global gradient's accuracy, which optimizes model buyers' utility. Our experiments verify the effectiveness of the proposed mechanisms. △ Less

Submitted 3 April, 2023; v1 submitted 8 June, 2021; originally announced June 2021.

Comments: Extended report for our IEEE BigData 2022 paper

Journal ref: Proceedings of the 2022 IEEE International Conference on Big Data, 1525-1534. (https://ieeexplore.ieee.org/document/10020232)

arXiv:2105.01651 [pdf, other]

Pricing Private Data with Personalized Differential Privacy and Partial Arbitrage Freeness

Authors: Shuyuan Zheng, Yang Cao, Masatoshi Yoshikawa

Abstract: There is a growing trend regarding perceiving personal data as a commodity. Existing studies have built frameworks and theories about how to determine an arbitrage-free price of a given query according to the privacy loss quantified by differential privacy. However, those studies have assumed that data buyers can purchase query answers with the arbitrary privacy loss of data owners, which may not… ▽ More There is a growing trend regarding perceiving personal data as a commodity. Existing studies have built frameworks and theories about how to determine an arbitrage-free price of a given query according to the privacy loss quantified by differential privacy. However, those studies have assumed that data buyers can purchase query answers with the arbitrary privacy loss of data owners, which may not be valid under strict privacy regulations and data owners' increasing privacy concerns. In this paper, we study how to empower data owners to control privacy loss in data trading. First, we propose a framework for trading personal data that enables data owners to bound their personalized privacy losses. Second, since bounded privacy losses indicate bounded utilities of query answers, we propose a reasonable relaxation of arbitrage freeness named partial arbitrage freeness, i.e., the guarantee of arbitrage-free pricing only for a limited range of utilities, which provides more possibilities for our market design. Third, to avoid arbitrage, we propose a general method for ensuring arbitrage freeness under personalized differential privacy. Fourth, to fully utilize data owners' personalized privacy loss bounds, we propose privacy budget allocation techniques to allocate privacy losses for queries under arbitrage freeness. Finally, we conduct experiments to verify the effectiveness of our proposed trading protocols. △ Less

Submitted 23 November, 2021; v1 submitted 4 May, 2021; originally announced May 2021.

arXiv:2104.06569 [pdf, other]

Preventing Manipulation Attack in Local Differential Privacy using Verifiable Randomization Mechanism

Authors: Fumiyuki Kato, Yang Cao, Masatoshi Yoshikawa

Abstract: Several randomization mechanisms for local differential privacy (LDP) (e.g., randomized response) are well-studied to improve the utility. However, recent studies show that LDP is generally vulnerable to malicious data providers in nature. Because a data collector has to estimate background data distribution only from already randomized data, malicious data providers can manipulate their output be… ▽ More Several randomization mechanisms for local differential privacy (LDP) (e.g., randomized response) are well-studied to improve the utility. However, recent studies show that LDP is generally vulnerable to malicious data providers in nature. Because a data collector has to estimate background data distribution only from already randomized data, malicious data providers can manipulate their output before sending, i.e., randomization would provide them plausible deniability. Attackers can skew the estimations effectively since they are calculated by normalizing with randomization probability defined in the LDP protocol, and can even control the estimations. In this paper, we show how we prevent malicious attackers from compromising LDP protocol. Our approach is to utilize a verifiable randomization mechanism. The data collector can verify the completeness of executing an agreed randomization mechanism for every data provider. Our proposed method completely protects the LDP protocol from output-manipulations, and significantly mitigates the expected damage from attacks. We do not assume any specific attacks, and it works effectively against general output-manipulation, and thus is more powerful than previously proposed countermeasures. We describe the secure version of three state-of-the-art LDP protocols and empirically show they cause acceptable overheads according to several parameters. △ Less

Submitted 9 June, 2021; v1 submitted 13 April, 2021; originally announced April 2021.

Comments: accepted by DBSec 2021

arXiv:2103.00996 [pdf, other]

Asymmetric Differential Privacy

Authors: Shun Takagi, Yang Cao, Masatoshi Yoshikawa

Abstract: Differential privacy (DP) is getting attention as a privacy definition when publishing statistics of a dataset. This paper focuses on the limitation that DP inevitably causes two-sided error, which is not desirable for epidemic analysis such as how many COVID-19 infected individuals visited location A. For example, consider publishing misinformation that many infected people did not visit location… ▽ More Differential privacy (DP) is getting attention as a privacy definition when publishing statistics of a dataset. This paper focuses on the limitation that DP inevitably causes two-sided error, which is not desirable for epidemic analysis such as how many COVID-19 infected individuals visited location A. For example, consider publishing misinformation that many infected people did not visit location A, which may lead to miss decision-making that expands the epidemic. To fix this issue, we propose a relaxation of DP, called asymmetric differential privacy (ADP). We show that ADP can provide reasonable privacy protection while achieving one-sided error. Finally, we conduct experiments to evaluate the utility of proposed mechanisms for epidemic analysis using a real-world dataset, which shows the practicality of our mechanisms. △ Less

Submitted 5 September, 2022; v1 submitted 1 March, 2021; originally announced March 2021.

arXiv:2012.13061 [pdf, other]

Quantifying the Privacy-Utility Trade-offs in COVID-19 Contact Tracing Apps

Authors: Patrick Ocheja, Yang Cao, Shiyao Ding, Masatoshi Yoshikawa

Abstract: How to contain the spread of the COVID-19 virus is a major concern for most countries. As the situation continues to change, various countries are making efforts to reopen their economies by lifting some restrictions and enforcing new measures to prevent the spread. In this work, we review some approaches that have been adopted to contain the COVID-19 virus such as contact tracing, clusters identi… ▽ More How to contain the spread of the COVID-19 virus is a major concern for most countries. As the situation continues to change, various countries are making efforts to reopen their economies by lifting some restrictions and enforcing new measures to prevent the spread. In this work, we review some approaches that have been adopted to contain the COVID-19 virus such as contact tracing, clusters identification, movement restrictions, and status validation. Specifically, we classify available techniques based on some characteristics such as technology, architecture, trade-offs (privacy vs utility), and the phase of adoption. We present a novel approach for evaluating privacy using both qualitative and quantitative measures of privacy-utility assessment of contact tracing applications. In this new method, we classify utility at three (3) distinct levels: no privacy, 100% privacy, and at k where k is set by the system providing the utility or privacy. △ Less

Submitted 23 December, 2020; originally announced December 2020.

Comments: 12 pages, 11 figures, 4 tables

MSC Class: 68P27 ACM Class: H.3.4

arXiv:2012.03782 [pdf, other]

PCT-TEE: Trajectory-based Private Contact Tracing System with Trusted Execution Environment

Authors: Fumiyuki Kato, Yang Cao, Masatoshi Yoshikawa

Abstract: Existing Bluetooth-based Private Contact Tracing (PCT) systems can privately detect whether people have come into direct contact with COVID-19 patients. However, we find that the existing systems lack functionality and flexibility, which may hurt the success of the contact tracing. Specifically, they cannot detect indirect contact (e.g., people may be exposed to coronavirus because of used the sam… ▽ More Existing Bluetooth-based Private Contact Tracing (PCT) systems can privately detect whether people have come into direct contact with COVID-19 patients. However, we find that the existing systems lack functionality and flexibility, which may hurt the success of the contact tracing. Specifically, they cannot detect indirect contact (e.g., people may be exposed to coronavirus because of used the same elevator even without direct contact); they also cannot flexibly change the rules of "risky contact", such as how many hours of exposure or how close to a COVID-19 patient that is considered as risk exposure, which may be changed with the environmental situation. In this paper, we propose an efficient and secure contact tracing system that enables both direct contact and indirect contact. To address the above problems, we need to utilize users' trajectory data for private contact tracing, which we call trajectory-based PCT. We formalize this problem as Spatiotemporal Private Set Intersection. By analyzing different approaches such as homomorphic encryption that could be extended to solve this problem, we identify that Trusted Execution Environment (TEE) is a proposing method to achieve our requirements. The major challenge is how to design algorithms for spatiotemporal private set intersection under limited secure memory of TEE. To this end, we design a TEE-based system with flexible trajectory data encoding algorithms. Our experiments on real-world data show that the proposed system can process thousands of queries on tens of million records of trajectory data in a few seconds. △ Less

Submitted 31 December, 2021; v1 submitted 7 December, 2020; originally announced December 2020.

Comments: Accepted by ACM TSAS

arXiv:2010.13449 [pdf, other]

Geo-Graph-Indistinguishability: Location Privacy on Road Networks Based on Differential Privacy

Authors: Shun Takagi, Yang Cao, Yasuhito Asano, Masatoshi Yoshikawa

Abstract: In recent years, concerns about location privacy are increasing with the spread of location-based services (LBSs). Many methods to protect location privacy have been proposed in the past decades. Especially, perturbation methods based on Geo-Indistinguishability (Geo-I), which randomly perturb a true location to a pseudolocation, are getting attention due to its strong privacy guarantee inherited… ▽ More In recent years, concerns about location privacy are increasing with the spread of location-based services (LBSs). Many methods to protect location privacy have been proposed in the past decades. Especially, perturbation methods based on Geo-Indistinguishability (Geo-I), which randomly perturb a true location to a pseudolocation, are getting attention due to its strong privacy guarantee inherited from differential privacy. However, Geo-I is based on the Euclidean plane even though many LBSs are based on road networks (e.g. ride-sharing services). This causes unnecessary noise and thus an insufficient tradeoff between utility and privacy for LBSs on road networks. To address this issue, we propose a new privacy notion, Geo-Graph-Indistinguishability (GG-I), for locations on a road network to achieve a better tradeoff. We propose Graph-Exponential Mechanism (GEM), which satisfies GG-I. Moreover, we formalize the optimization problem to find the optimal GEM in terms of the tradeoff. However, the computational complexity of a naive method to find the optimal solution is prohibitive, so we propose a greedy algorithm to find an approximate solution in an acceptable amount of time. Finally, our experiments show that our proposed mechanism outperforms a Geo-I's mechanism with respect to the tradeoff. △ Less

Submitted 26 October, 2020; originally announced October 2020.

arXiv:2010.13381 [pdf, other]

Secure and Efficient Trajectory-Based Contact Tracing using Trusted Hardware

Authors: Fumiyuki Kato, Yang Cao, Masatoshi Yoshikawa

Abstract: The COVID-19 pandemic has prompted technological measures to control the spread of the disease. Private contact tracing (PCT) is one of the promising techniques for the purpose. However, the recently proposed Bluetooth-based PCT has several limitations in terms of functionality and flexibility. The existing systems are only able to detect direct contact (i.e., human-human contact), but cannot dete… ▽ More The COVID-19 pandemic has prompted technological measures to control the spread of the disease. Private contact tracing (PCT) is one of the promising techniques for the purpose. However, the recently proposed Bluetooth-based PCT has several limitations in terms of functionality and flexibility. The existing systems are only able to detect direct contact (i.e., human-human contact), but cannot detect indirect contact (i.e., human-object, such as the disease transmission through surface). Moreover, the rule of risky contact cannot be flexibly changed with the environmental situation and the nature of the virus. In this paper, we propose a secure and efficient trajectory-based PCT system using trusted hardware. We formalize trajectory-based PCT as a generalization of the well-studied Private Set Intersection (PSI), which is mostly based on cryptographic primitives and thus insufficient. We solve the problem by leveraging trusted hardware such as Intel SGX and designing a novel algorithm to achieve a secure, efficient and flexible PCT system. Our experiments on real-world data show that the proposed system can achieve high performance and scalability. Specifically, our system (one single machine with Intel SGX) can process thousands of queries on 100 million records of trajectory data in a few seconds. △ Less

Submitted 4 November, 2020; v1 submitted 26 October, 2020; originally announced October 2020.

Comments: Accepted by 7th International Workshop on Privacy and Security of Big Data (PSBD 2020) in conjunction with 2020 IEEE International Conference on Big Data (IEEE BigData 2020)

arXiv:2009.08063 [pdf, other]

FLAME: Differentially Private Federated Learning in the Shuffle Model

Authors: Ruixuan Liu, Yang Cao, Hong Chen, Ruoyang Guo, Masatoshi Yoshikawa

Abstract: Federated Learning (FL) is a promising machine learning paradigm that enables the analyzer to train a model without collecting users' raw data. To ensure users' privacy, differentially private federated learning has been intensively studied. The existing works are mainly based on the \textit{curator model} or \textit{local model} of differential privacy. However, both of them have pros and cons. T… ▽ More Federated Learning (FL) is a promising machine learning paradigm that enables the analyzer to train a model without collecting users' raw data. To ensure users' privacy, differentially private federated learning has been intensively studied. The existing works are mainly based on the \textit{curator model} or \textit{local model} of differential privacy. However, both of them have pros and cons. The curator model allows greater accuracy but requires a trusted analyzer. In the local model where users randomize local data before sending them to the analyzer, a trusted analyzer is not required but the accuracy is limited. In this work, by leveraging the \textit{privacy amplification} effect in the recently proposed shuffle model of differential privacy, we achieve the best of two worlds, i.e., accuracy in the curator model and strong privacy without relying on any trusted party. We first propose an FL framework in the shuffle model and a simple protocol (SS-Simple) extended from existing work. We find that SS-Simple only provides an insufficient privacy amplification effect in FL since the dimension of the model parameter is quite large. To solve this challenge, we propose an enhanced protocol (SS-Double) to increase the privacy amplification effect by subsampling. Furthermore, for boosting the utility when the model size is greater than the user population, we propose an advanced protocol (SS-Topk) with gradient sparsification techniques. We also provide theoretical analysis and numerical evaluations of the privacy amplification of the proposed protocols. Experiments on real-world dataset validate that SS-Topk improves the testing accuracy by 60.7\% than the local model based FL. △ Less

Submitted 20 March, 2021; v1 submitted 17 September, 2020; originally announced September 2020.

Comments: accepted by AAAI-21

arXiv:2006.12101 [pdf, other]

P3GM: Private High-Dimensional Data Release via Privacy Preserving Phased Generative Model

Authors: Shun Takagi, Tsubasa Takahashi, Yang Cao, Masatoshi Yoshikawa

Abstract: How can we release a massive volume of sensitive data while mitigating privacy risks? Privacy-preserving data synthesis enables the data holder to outsource analytical tasks to an untrusted third party. The state-of-the-art approach for this problem is to build a generative model under differential privacy, which offers a rigorous privacy guarantee. However, the existing method cannot adequately h… ▽ More How can we release a massive volume of sensitive data while mitigating privacy risks? Privacy-preserving data synthesis enables the data holder to outsource analytical tasks to an untrusted third party. The state-of-the-art approach for this problem is to build a generative model under differential privacy, which offers a rigorous privacy guarantee. However, the existing method cannot adequately handle high dimensional data. In particular, when the input dataset contains a large number of features, the existing techniques require injecting a prohibitive amount of noise to satisfy differential privacy, which results in the outsourced data analysis meaningless. To address the above issue, this paper proposes privacy-preserving phased generative model (P3GM), which is a differentially private generative model for releasing such sensitive data. P3GM employs the two-phase learning process to make it robust against the noise, and to increase learning efficiency (e.g., easy to converge). We give theoretical analyses about the learning complexity and privacy loss in P3GM. We further experimentally evaluate our proposed method and demonstrate that P3GM significantly outperforms existing solutions. Compared with the state-of-the-art methods, our generated samples look fewer noises and closer to the original data in terms of data diversity. Besides, in several data mining tasks with synthesized data, our model outperforms the competitors in terms of accuracy. △ Less

Submitted 7 March, 2022; v1 submitted 22 June, 2020; originally announced June 2020.

Comments: The version accepted at ICDE 2021 includes wrong proof in the Wishart mechanism. The current version fixes the problem

arXiv:2005.01263 [pdf, other]

PGLP: Customizable and Rigorous Location Privacy through Policy Graph

Authors: Yang Cao, Yonghui Xiao, Shun Takagi, Li Xiong, Masatoshi Yoshikawa, Yilin Shen, Jinfei Liu, Hongxia Jin, Xiaofeng Xu

Abstract: Location privacy has been extensively studied in the literature. However, existing location privacy models are either not rigorous or not customizable, which limits the trade-off between privacy and utility in many real-world applications. To address this issue, we propose a new location privacy notion called PGLP, i.e., \textit{Policy Graph based Location Privacy}, providing a rich interface to r… ▽ More Location privacy has been extensively studied in the literature. However, existing location privacy models are either not rigorous or not customizable, which limits the trade-off between privacy and utility in many real-world applications. To address this issue, we propose a new location privacy notion called PGLP, i.e., \textit{Policy Graph based Location Privacy}, providing a rich interface to release private locations with customizable and rigorous privacy guarantee. First, we design the privacy metrics of PGLP by extending differential privacy. Specifically, we formalize a user's location privacy requirements using a \textit{location policy graph}, which is expressive and customizable. Second, we investigate how to satisfy an arbitrarily given location policy graph under adversarial knowledge. We find that a location policy graph may not always be viable and may suffer \textit{location exposure} when the attacker knows the user's mobility pattern. We propose efficient methods to detect location exposure and repair the policy graph with optimal utility. Third, we design a private location trace release framework that pipelines the detection of location exposure, policy graph repair, and private trajectory release with customizable and rigorous location privacy. Finally, we conduct experiments on real-world datasets to verify the effectiveness of the privacy-utility trade-off and the efficiency of the proposed algorithms. △ Less

Submitted 15 July, 2020; v1 submitted 4 May, 2020; originally announced May 2020.

Comments: accepted in the 25th European Symposium on Research in Computer Security (ESORICS) 2020

arXiv:2005.00186 [pdf, other]

PANDA: Policy-aware Location Privacy for Epidemic Surveillance

Authors: Yang Cao, Shun Takagi, Yonghui Xiao, Li Xiong, Masatoshi Yoshikawa

Abstract: In this demonstration, we present a privacy-preserving epidemic surveillance system. Recently, many countries that suffer from coronavirus crises attempt to access citizen's location data to eliminate the outbreak. However, it raises privacy concerns and may open the doors to more invasive forms of surveillance in the name of public health. It also brings a challenge for privacy protection techniq… ▽ More In this demonstration, we present a privacy-preserving epidemic surveillance system. Recently, many countries that suffer from coronavirus crises attempt to access citizen's location data to eliminate the outbreak. However, it raises privacy concerns and may open the doors to more invasive forms of surveillance in the name of public health. It also brings a challenge for privacy protection techniques: how can we leverage people's mobile data to help combat the pandemic without scarifying our location privacy. We demonstrate that we can have the best of the two worlds by implementing policy-based location privacy for epidemic surveillance. Specifically, we formalize the privacy policy using graphs in light of differential privacy, called policy graph. Our system has three primary functions for epidemic surveillance: location monitoring, epidemic analysis, and contact tracing. We provide an interactive tool allowing the attendees to explore and examine the usability of our system: (1) the utility of location monitor and disease transmission model estimation, (2) the procedure of contact tracing in our systems, and (3) the privacy-utility trade-offs w.r.t. different policy graphs. The attendees can find that it is possible to have the full functionality of epidemic surveillance while preserving location privacy. △ Less

Submitted 6 June, 2020; v1 submitted 30 April, 2020; originally announced May 2020.

Comments: Accepted in the 46th International Conference on Very Large Data Bases (VLDB 2020) demonstration track

arXiv:2004.07442 [pdf, other]

Voice-Indistinguishability: Protecting Voiceprint in Privacy-Preserving Speech Data Release

Authors: Yaowei Han, Sheng Li, Yang Cao, Qiang Ma, Masatoshi Yoshikawa

Abstract: With the development of smart devices, such as the Amazon Echo and Apple's HomePod, speech data have become a new dimension of big data. However, privacy and security concerns may hinder the collection and sharing of real-world speech data, which contain the speaker's identifiable information, i.e., voiceprint, which is considered a type of biometric identifier. Current studies on voiceprint priva… ▽ More With the development of smart devices, such as the Amazon Echo and Apple's HomePod, speech data have become a new dimension of big data. However, privacy and security concerns may hinder the collection and sharing of real-world speech data, which contain the speaker's identifiable information, i.e., voiceprint, which is considered a type of biometric identifier. Current studies on voiceprint privacy protection do not provide either a meaningful privacy-utility trade-off or a formal and rigorous definition of privacy. In this study, we design a novel and rigorous privacy metric for voiceprint privacy, which is referred to as voice-indistinguishability, by extending differential privacy. We also propose mechanisms and frameworks for privacy-preserving speech data release satisfying voice-indistinguishability. Experiments on public datasets verify the effectiveness and efficiency of the proposed methods. △ Less

Submitted 15 April, 2020; originally announced April 2020.

Comments: The paper has been accepted by the IEEE International Conference on Multimedia & Expo 2020(ICME 2020)

arXiv:2003.10637 [pdf, other]

FedSel: Federated SGD under Local Differential Privacy with Top-k Dimension Selection

Authors: Ruixuan Liu, Yang Cao, Masatoshi Yoshikawa, Hong Chen

Abstract: As massive data are produced from small gadgets, federated learning on mobile devices has become an emerging trend. In the federated setting, Stochastic Gradient Descent (SGD) has been widely used in federated learning for various machine learning models. To prevent privacy leakages from gradients that are calculated on users' sensitive data, local differential privacy (LDP) has been considered as… ▽ More As massive data are produced from small gadgets, federated learning on mobile devices has become an emerging trend. In the federated setting, Stochastic Gradient Descent (SGD) has been widely used in federated learning for various machine learning models. To prevent privacy leakages from gradients that are calculated on users' sensitive data, local differential privacy (LDP) has been considered as a privacy guarantee in federated SGD recently. However, the existing solutions have a dimension dependency problem: the injected noise is substantially proportional to the dimension $d$. In this work, we propose a two-stage framework FedSel for federated SGD under LDP to relieve this problem. Our key idea is that not all dimensions are equally important so that we privately select Top-k dimensions according to their contributions in each iteration of federated SGD. Specifically, we propose three private dimension selection mechanisms and adapt the gradient accumulation technique to stabilize the learning process with noisy updates. We also theoretically analyze privacy, accuracy and time complexity of FedSel, which outperforms the state-of-the-art solutions. Experiments on real-world and synthetic datasets verify the effectiveness and efficiency of our framework. △ Less

Submitted 23 March, 2020; originally announced March 2020.

Comments: 18 pages, to be published in DASFAA 2020

arXiv:1910.11040 [pdf, ps, other]

Toward a view-based data cleaning architecture

Authors: Toshiyuki Shimizu, Hiroki Omori, Masatoshi Yoshikawa

Abstract: Big data analysis has become an active area of study with the growth of machine learning techniques. To properly analyze data, it is important to maintain high-quality data. Thus, research on data cleaning is also important. It is difficult to automatically detect and correct inconsistent values for data requiring expert knowledge or data created by many contributors, such as integrated data from… ▽ More Big data analysis has become an active area of study with the growth of machine learning techniques. To properly analyze data, it is important to maintain high-quality data. Thus, research on data cleaning is also important. It is difficult to automatically detect and correct inconsistent values for data requiring expert knowledge or data created by many contributors, such as integrated data from heterogeneous data sources. An example of such data is metadata for scientific datasets, which should be confirmed by data managers while handling the data. To support the efficient cleaning of data by data managers, we propose a data cleaning architecture in which data managers interactively browse and correct portions of data through views. In this paper, we explain our view-based data cleaning architecture and discuss some remaining issues. △ Less

Submitted 24 October, 2019; originally announced October 2019.

Comments: Proceedings of the Third Workshop on Software Foundations for Data Interoperability (SFDI2019+), October 28, 2019, Fukuoka, Japan

arXiv:1907.10814 [pdf, other]

doi 10.1109/TKDE.2019.2963312

Protecting Spatiotemporal Event Privacy in Continuous Location-Based Services

Authors: Yang Cao, Yonghui Xiao, Li Xiong, Liquan Bai, Masatoshi Yoshikawa

Abstract: Location privacy-preserving mechanisms (LPPMs) have been extensively studied for protecting users' location privacy by releasing a perturbed location to third parties such as location-based service providers. However, when a user's perturbed locations are released continuously, existing LPPMs may not protect the sensitive information about the user's spatiotemporal activities, such as "visited hos… ▽ More Location privacy-preserving mechanisms (LPPMs) have been extensively studied for protecting users' location privacy by releasing a perturbed location to third parties such as location-based service providers. However, when a user's perturbed locations are released continuously, existing LPPMs may not protect the sensitive information about the user's spatiotemporal activities, such as "visited hospital in the last week" or "regularly commuting between Address 1 and Address 2" (it is easy to infer that Addresses 1 and 2 may be home and office), which we call it \textit{spatiotemporal event}. In this paper, we first formally define {spatiotemporal event} as Boolean expressions between location and time predicates, and then we define $ ε$-\textit{spatiotemporal event privacy} by extending the notion of differential privacy. Second, to understand how much spatiotemporal event privacy that existing LPPMs can provide, we design computationally efficient algorithms to quantify the privacy leakage of state-of-the-art LPPMs when an adversary has prior knowledge of the user's initial probability over possible locations. It turns out that the existing LPPMs cannot adequately protect spatiotemporal event privacy. Third, we propose a framework, PriSTE, to transform an existing LPPM into one protecting spatiotemporal event privacy against adversaries with \textit{any} prior knowledge. Our experiments on real-life and synthetic data verified that the proposed method is effective and efficient. △ Less

Submitted 16 May, 2020; v1 submitted 24 July, 2019; originally announced July 2019.

Comments: accepted in TKDE. arXiv admin note: substantial text overlap with arXiv:1810.09152

Journal ref: IEEE Transactions on Knowledge and Data Engineering (TKDE) 2020

arXiv:1906.05457 [pdf, other]

Trading Location Data with Bounded Personalized Privacy Loss

Authors: Shuyuan Zheng, Yang Cao, Masatoshi Yoshikawa

Abstract: As personal data have been the new oil of the digital era, there is a growing trend perceiving personal data as a commodity. Although some people are willing to trade their personal data for money, they might still expect limited privacy loss, and the maximum tolerable privacy loss varies with each individual. In this paper, we propose a framework that enables individuals to trade their personal d… ▽ More As personal data have been the new oil of the digital era, there is a growing trend perceiving personal data as a commodity. Although some people are willing to trade their personal data for money, they might still expect limited privacy loss, and the maximum tolerable privacy loss varies with each individual. In this paper, we propose a framework that enables individuals to trade their personal data with bounded personalized privacy loss, which raises technical challenges in the aspects of budget allocation and arbitrage-freeness. To deal with those challenges,we propose two arbitrage-free trading mechanisms with different advantages. △ Less

Submitted 24 October, 2019; v1 submitted 12 June, 2019; originally announced June 2019.

Comments: Proceedings of the Third Workshop on Software Foundations for Data Interoperability (SFDI2019+), October 28, 2019, Fukuoka, Japan

arXiv:1906.03952 [pdf, other]

Multimodal Logical Inference System for Visual-Textual Entailment

Authors: Riko Suzuki, Hitomi Yanaka, Masashi Yoshikawa, Koji Mineshima, Daisuke Bekki

Abstract: A large amount of research about multimodal inference across text and vision has been recently developed to obtain visually grounded word and sentence representations. In this paper, we use logic-based representations as unified meaning representations for texts and images and present an unsupervised multimodal logical inference system that can effectively prove entailment relations between them.… ▽ More A large amount of research about multimodal inference across text and vision has been recently developed to obtain visually grounded word and sentence representations. In this paper, we use logic-based representations as unified meaning representations for texts and images and present an unsupervised multimodal logical inference system that can effectively prove entailment relations between them. We show that by combining semantic parsing and theorem proving, the system can handle semantically complex sentences for visual-textual inference. △ Less

Submitted 10 June, 2019; originally announced June 2019.

arXiv:1906.01834 [pdf, other]

Automatic Generation of High Quality CCGbanks for Parser Domain Adaptation

Authors: Masashi Yoshikawa, Hiroshi Noji, Koji Mineshima, Daisuke Bekki

Abstract: We propose a new domain adaptation method for Combinatory Categorial Grammar (CCG) parsing, based on the idea of automatic generation of CCG corpora exploiting cheaper resources of dependency trees. Our solution is conceptually simple, and not relying on a specific parser architecture, making it applicable to the current best-performing parsers. We conduct extensive parsing experiments with detail… ▽ More We propose a new domain adaptation method for Combinatory Categorial Grammar (CCG) parsing, based on the idea of automatic generation of CCG corpora exploiting cheaper resources of dependency trees. Our solution is conceptually simple, and not relying on a specific parser architecture, making it applicable to the current best-performing parsers. We conduct extensive parsing experiments with detailed discussion; on top of existing benchmark datasets on (1) biomedical texts and (2) question sentences, we create experimental datasets of (3) speech conversation and (4) math problems. When applied to the proposed method, an off-the-shelf CCG parser shows significant performance gains, improving from 90.7% to 96.6% on speech conversation, and from 88.5% to 96.8% on math problems. △ Less

Submitted 5 June, 2019; originally announced June 2019.

Comments: 11 pages, accepted as long paper to ACL 2019 Italy

arXiv:1904.10606 [pdf, other]

doi 10.1109/ICDEW.2019.00010

Blockchain-based Bidirectional Updates on Fine-grained Medical Data

Authors: Chunmiao Li, Yang Cao, Zhenjiang Hu, Masatoshi Yoshikawa

Abstract: Electronic medical data sharing between stakeholders, such as patients, doctors, and researchers, can promote more effective medical treatment collaboratively. These sensitive and private data should only be accessed by authorized users. Given a total medical data, users may care about parts of them and other unrelated information might interfere with the user interested data search and increase t… ▽ More Electronic medical data sharing between stakeholders, such as patients, doctors, and researchers, can promote more effective medical treatment collaboratively. These sensitive and private data should only be accessed by authorized users. Given a total medical data, users may care about parts of them and other unrelated information might interfere with the user interested data search and increase the risk of exposure. Besides accessing these data, users may want to update them and propagate to other sharing peers so that all peers keep identical data after each update. To satisfy these requirements, in this paper we propose a medical data sharing architecture that addresses the permission control using smart contracts on the blockchain and splits data into fined grained pieces shared with different peers then synchronize full data and these pieces with bidirectional transformations. Medical data reside on each userś local database and permission related data are stored on smart contracts. Only all peers have gained the newest shared data after updates can they start to do next operations on it, which are enforced by smart contracts. Blockchain based immutable shared ledge enables users to trace data updates history. This paper can provide a new perspective to view full medical data as different slices to be shared with various peers but consistency after updates between them are still promised, which can protect the privacy and improve data search efficiency. △ Less

Submitted 23 April, 2019; originally announced April 2019.

ACM Class: H.2; D.2; K.6.5

arXiv:1904.10578 [pdf, ps, other]

doi 10.1007/978-3-030-22479-0_9

When and where do you want to hide? Recommendation of location privacy preferences with local differential privacy

Authors: Maho Asada, Masatoshi Yoshikawa, Yang Cao

Abstract: In recent years, it has become easy to obtain location information quite precisely. However, the acquisition of such information has risks such as individual identification and leakage of sensitive information, so it is necessary to protect the privacy of location information. For this purpose, people should know their location privacy preferences, that is, whether or not he/she can release locati… ▽ More In recent years, it has become easy to obtain location information quite precisely. However, the acquisition of such information has risks such as individual identification and leakage of sensitive information, so it is necessary to protect the privacy of location information. For this purpose, people should know their location privacy preferences, that is, whether or not he/she can release location information at each place and time. However, it is not easy for each user to make such decisions and it is troublesome to set the privacy preference at each time. Therefore, we propose a method to recommend location privacy preferences for decision making. Comparing to existing method, our method can improve the accuracy of recommendation by using matrix factorization and preserve privacy strictly by local differential privacy, whereas the existing method does not achieve formal privacy guarantee. In addition, we found the best granularity of a location privacy preference, that is, how to express the information in location privacy protection. To evaluate and verify the utility of our method, we have integrated two existing datasets to create a rich information in term of user number. From the results of the evaluation using this dataset, we confirmed that our method can predict location privacy preferences accurately and that it provides a suitable method to define the location privacy preference. △ Less

Submitted 23 April, 2019; originally announced April 2019.

arXiv:1811.06203 [pdf, other]

Combining Axiom Injection and Knowledge Base Completion for Efficient Natural Language Inference

Authors: Masashi Yoshikawa, Koji Mineshima, Hiroshi Noji, Daisuke Bekki

Abstract: In logic-based approaches to reasoning tasks such as Recognizing Textual Entailment (RTE), it is important for a system to have a large amount of knowledge data. However, there is a tradeoff between adding more knowledge data for improved RTE performance and maintaining an efficient RTE system, as such a big database is problematic in terms of the memory usage and computational complexity. In this… ▽ More In logic-based approaches to reasoning tasks such as Recognizing Textual Entailment (RTE), it is important for a system to have a large amount of knowledge data. However, there is a tradeoff between adding more knowledge data for improved RTE performance and maintaining an efficient RTE system, as such a big database is problematic in terms of the memory usage and computational complexity. In this work, we show the processing time of a state-of-the-art logic-based RTE system can be significantly reduced by replacing its search-based axiom injection (abduction) mechanism by that based on Knowledge Base Completion (KBC). We integrate this mechanism in a Coq plugin that provides a proof automation tactic for natural language inference. Additionally, we show empirically that adding new knowledge data contributes to better RTE performance while not harming the processing speed in this framework. △ Less

Submitted 15 November, 2018; originally announced November 2018.

Comments: 9 pages, accepted to AAAI 2019

arXiv:1809.10357 [pdf, other]

Making View Update Strategies Programmable - Toward Controlling and Sharing Distributed Data -

Authors: Yasuhito Asano, Soichiro Hidaka, Zhenjiang Hu, Yasunori Ishihara, Hiroyuki Kato, Hsiang-Shang Ko, Keisuke Nakano, Makoto Onizuka, Yuya Sasaki, Toshiyuki Shimizu, Van-Dang Tran, Kanae Tsushima, Masatoshi Yoshikawa

Abstract: Views are known mechanisms for controlling access of data and for sharing data of different schemas. Despite long and intensive research on views in both the database community and the programming language community, we are facing difficulties to use views in practice. The main reason is that we lack ways to directly describe view update strategies to deal with the inherent ambiguity of view updat… ▽ More Views are known mechanisms for controlling access of data and for sharing data of different schemas. Despite long and intensive research on views in both the database community and the programming language community, we are facing difficulties to use views in practice. The main reason is that we lack ways to directly describe view update strategies to deal with the inherent ambiguity of view updating. This paper aims to provide a new language-based approach to controlling and sharing distributed data based on views, and establish a software foundation for systematic construction of such data management systems. Our key observation is that a view should be defined through a view update strategy rather than a view definition. We show that Datalog can be used for specifying view update strategies whose unique view definition can be automatically derived, present a novel P2P-based programmable architecture for distributed data management where updatable views are fully utilized for controlling and sharing distributed data, and demonstrate its usefulness through the development of a privacy-preserving ride-sharing alliance system. △ Less

Submitted 27 September, 2018; originally announced September 2018.

Comments: 6 pages. arXiv admin note: text overlap with arXiv:1803.06674

arXiv:1804.08473 [pdf, other]

doi 10.1145/3240508.3240587

Beyond Narrative Description: Generating Poetry from Images by Multi-Adversarial Training

Authors: Bei Liu, Jianlong Fu, Makoto P. Kato, Masatoshi Yoshikawa

Abstract: Automatic generation of natural language from images has attracted extensive attention. In this paper, we take one step further to investigate generation of poetic language (with multiple lines) to an image for automatic poetry creation. This task involves multiple challenges, including discovering poetic clues from the image (e.g., hope from green), and generating poems to satisfy both relevance… ▽ More Automatic generation of natural language from images has attracted extensive attention. In this paper, we take one step further to investigate generation of poetic language (with multiple lines) to an image for automatic poetry creation. This task involves multiple challenges, including discovering poetic clues from the image (e.g., hope from green), and generating poems to satisfy both relevance to the image and poeticness in language level. To solve the above challenges, we formulate the task of poem generation into two correlated sub-tasks by multi-adversarial training via policy gradient, through which the cross-modal relevance and poetic language style can be ensured. To extract poetic clues from images, we propose to learn a deep coupled visual-poetic embedding, in which the poetic representation from objects, sentiments and scenes in an image can be jointly learned. Two discriminative networks are further introduced to guide the poem generation, including a multi-modal discriminator and a poem-style discriminator. To facilitate the research, we have released two poem datasets by human annotators with two distinct properties: 1) the first human annotated image-to-poem pair dataset (with 8,292 pairs in total), and 2) to-date the largest public English poem corpus dataset (with 92,265 different poems in total). Extensive experiments are conducted with 8K images, among which 1.5K image are randomly picked for evaluation. Both objective and subjective evaluations show the superior performances against the state-of-the-art methods for poem generation from images. Turing test carried out with over 500 human subjects, among which 30 evaluators are poetry experts, demonstrates the effectiveness of our approach. △ Less

Submitted 9 October, 2018; v1 submitted 23 April, 2018; originally announced April 2018.

arXiv:1804.07068 [pdf, ps, other]

Consistent CCG Parsing over Multiple Sentences for Improved Logical Reasoning

Authors: Masashi Yoshikawa, Koji Mineshima, Hiroshi Noji, Daisuke Bekki

Abstract: In formal logic-based approaches to Recognizing Textual Entailment (RTE), a Combinatory Categorial Grammar (CCG) parser is used to parse input premises and hypotheses to obtain their logical formulas. Here, it is important that the parser processes the sentences consistently; failing to recognize a similar syntactic structure results in inconsistent predicate argument structures among them, in whi… ▽ More In formal logic-based approaches to Recognizing Textual Entailment (RTE), a Combinatory Categorial Grammar (CCG) parser is used to parse input premises and hypotheses to obtain their logical formulas. Here, it is important that the parser processes the sentences consistently; failing to recognize a similar syntactic structure results in inconsistent predicate argument structures among them, in which case the succeeding theorem proving is doomed to failure. In this work, we present a simple method to extend an existing CCG parser to parse a set of sentences consistently, which is achieved with an inter-sentence modeling with Markov Random Fields (MRF). When combined with existing logic-based systems, our method always shows improvement in the RTE experiments on English and Japanese languages. △ Less

Submitted 19 April, 2018; originally announced April 2018.

Comments: 6 pages. short paper accepted to NAACL2018

arXiv:1803.06674 [pdf, other]

A View-based Programmable Architecture for Controlling and Integrating Decentralized Data

Authors: Yasuhito Asano, Soichiro Hidaka, Zhenjiang Hu, Yasunori Ishihara, Hiroyuki Kato, Hsiang-Shang Ko, Keisuke Nakano, Makoto Onizuka, Yuya Sasaki, Toshiyuki Shimizu, Kanae Tsushima, Masatoshi Yoshikawa

Abstract: The view and the view update are known mechanism for controlling access of data and for integrating data of different schemas. Despite intensive and long research on them in both the database community and the programming language community, we are facing difficulties to use them in practice. The main reason is that we are lacking of control over the view update strategy to deal with inherited amb… ▽ More The view and the view update are known mechanism for controlling access of data and for integrating data of different schemas. Despite intensive and long research on them in both the database community and the programming language community, we are facing difficulties to use them in practice. The main reason is that we are lacking of control over the view update strategy to deal with inherited ambiguity of view update for a given view. This vision paper aims to provide a new language-based approach to controlling and integrating decentralized data based on the view, and establish a software foundation for systematic construction of such data management systems. Our key observation is that a view should be defined through a view update strategy rather than a query. In other words, the view definition should be extracted from the view update strategy, which is in sharp contrast to the traditional approaches where the view update strategy is derived from the view definition. In this paper, we present the first programmable architecture with a declarative language for specifying update strategies over views, whose unique view definition can be automatically derived, and show how it can be effectively used to control data access, integrate data generally allowing coexistence of GAV (global as view) and LAV (local as view), and perform both analysis and updates on the integrated data. We demonstrate its usefulness through development of a privacy-preserving ride-sharing alliance system, discuss its application scope, and highlight future challenges. △ Less

Submitted 18 March, 2018; originally announced March 2018.

Comments: 14 pages, 2 figures, conference

Showing 1–50 of 56 results for author: Yoshikawa, M