Skip to main content

Showing 1–50 of 78 results for author: Ooi, B C

.
  1. arXiv:2506.05831  [pdf, ps, other

    cs.LG cs.AI

    Heartcare Suite: Multi-dimensional Understanding of ECG with Raw Multi-lead Signal Modeling

    Authors: Yihan Xie, Sijing Li, Tianwei Lin, Zhuonan Wang, Chenglin Yang, Yu Zhong, Wenqiao Zhang, Haoyuan Li, Hao Jiang, Fengda Zhang, Qishan Chen, Jun Xiao, Yueting Zhuang, Beng Chin Ooi

    Abstract: We present Heartcare Suite, a multimodal comprehensive framework for finegrained electrocardiogram (ECG) understanding. It comprises three key components: (i) Heartcare-220K, a high-quality, structured, and comprehensive multimodal ECG dataset covering essential tasks such as disease diagnosis, waveform morphology analysis, and rhythm interpretation. (ii) Heartcare-Bench, a systematic and multi-di… ▽ More

    Submitted 9 June, 2025; v1 submitted 6 June, 2025; originally announced June 2025.

  2. arXiv:2505.12524  [pdf, ps, other

    cs.DB cs.LG

    HAKES: Scalable Vector Database for Embedding Search Service

    Authors: Guoyu Hu, Shaofeng Cai, Tien Tuan Anh Dinh, Zhongle Xie, Cong Yue, Gang Chen, Beng Chin Ooi

    Abstract: Modern deep learning models capture the semantics of complex data by transforming them into high-dimensional embedding vectors. Emerging applications, such as retrieval-augmented generation, use approximate nearest neighbor (ANN) search in the embedding vector space to find similar data. Existing vector databases provide indexes for efficient ANN searches, with graph-based indexes being the most p… ▽ More

    Submitted 18 May, 2025; originally announced May 2025.

  3. arXiv:2505.04404  [pdf, other

    cs.DB cs.AI

    In-Context Adaptation to Concept Drift for Learned Database Operations

    Authors: Jiaqi Zhu, Shaofeng Cai, Yanyan Shen, Gang Chen, Fang Deng, Beng Chin Ooi

    Abstract: Machine learning has demonstrated transformative potential for database operations, such as query optimization and in-database data analytics. However, dynamic database environments, characterized by frequent updates and evolving data distributions, introduce concept drift, which leads to performance degradation for learned models and limits their practical applicability. Addressing this challenge… ▽ More

    Submitted 22 May, 2025; v1 submitted 7 May, 2025; originally announced May 2025.

    Comments: Accepted by ICML 2025

  4. arXiv:2504.13650  [pdf, other

    cs.CV

    EyecareGPT: Boosting Comprehensive Ophthalmology Understanding with Tailored Dataset, Benchmark and Model

    Authors: Sijing Li, Tianwei Lin, Lingshuai Lin, Wenqiao Zhang, Jiang Liu, Xiaoda Yang, Juncheng Li, Yucheng He, Xiaohui Song, Jun Xiao, Yueting Zhuang, Beng Chin Ooi

    Abstract: Medical Large Vision-Language Models (Med-LVLMs) demonstrate significant potential in healthcare, but their reliance on general medical data and coarse-grained global visual understanding limits them in intelligent ophthalmic diagnosis. Currently, intelligent ophthalmic diagnosis faces three major challenges: (i) Data. The lack of deeply annotated, high-quality, multi-modal ophthalmic visual instr… ▽ More

    Submitted 18 April, 2025; originally announced April 2025.

  5. arXiv:2504.11259  [pdf, ps, other

    cs.DB

    The Cambridge Report on Database Research

    Authors: Anastasia Ailamaki, Samuel Madden, Daniel Abadi, Gustavo Alonso, Sihem Amer-Yahia, Magdalena Balazinska, Philip A. Bernstein, Peter Boncz, Michael Cafarella, Surajit Chaudhuri, Susan Davidson, David DeWitt, Yanlei Diao, Xin Luna Dong, Michael Franklin, Juliana Freire, Johannes Gehrke, Alon Halevy, Joseph M. Hellerstein, Mark D. Hill, Stratos Idreos, Yannis Ioannidis, Christoph Koch, Donald Kossmann, Tim Kraska , et al. (21 additional authors not shown)

    Abstract: On October 19 and 20, 2023, the authors of this report convened in Cambridge, MA, to discuss the state of the database research field, its recent accomplishments and ongoing challenges, and future directions for research and community engagement. This gathering continues a long standing tradition in the database community, dating back to the late 1980s, in which researchers meet roughly every five… ▽ More

    Submitted 15 April, 2025; originally announced April 2025.

  6. arXiv:2503.13822  [pdf, other

    cs.DB

    NeurBench: Benchmarking Learned Database Components with Data and Workload Drift Modeling

    Authors: Zhanhao Zhao, Haotian Gao, Naili Xing, Lingze Zeng, Meihui Zhang, Gang Chen, Manuel Rigger, Beng Chin Ooi

    Abstract: Learned database components, which deeply integrate machine learning into their design, have been extensively studied in recent years. Given the dynamism of databases, where data and workloads continuously drift, it is crucial for learned database components to remain effective and efficient in the face of data and workload drift. Adaptability, therefore, is a key factor in assessing their practic… ▽ More

    Submitted 24 March, 2025; v1 submitted 17 March, 2025; originally announced March 2025.

  7. arXiv:2503.10036  [pdf, other

    cs.DB

    CCaaLF: Concurrency Control as a Learnable Function

    Authors: Hexiang Pan, Shaofeng Cai, Tien Tuan Anh Dinh, Yuncheng Wu, Yeow Meng Chee, Gang Chen, Beng Chin Ooi

    Abstract: Concurrency control (CC) algorithms are important in modern transactional databases, as they enable high performance by executing transactions concurrently while ensuring correctness. However, state-of-the-art CC algorithms struggle to perform well across diverse workloads, and most do not consider workload drifts. In this paper, we propose CCaaLF (Concurrency Control as a Learnable Function), a… ▽ More

    Submitted 25 March, 2025; v1 submitted 13 March, 2025; originally announced March 2025.

    MSC Class: 68P15 ACM Class: H.2.4

  8. arXiv:2502.09838  [pdf, other

    cs.CV cs.AI

    HealthGPT: A Medical Large Vision-Language Model for Unifying Comprehension and Generation via Heterogeneous Knowledge Adaptation

    Authors: Tianwei Lin, Wenqiao Zhang, Sijing Li, Yuqian Yuan, Binhe Yu, Haoyuan Li, Wanggui He, Hao Jiang, Mengze Li, Xiaohui Song, Siliang Tang, Jun Xiao, Hui Lin, Yueting Zhuang, Beng Chin Ooi

    Abstract: We present HealthGPT, a powerful Medical Large Vision-Language Model (Med-LVLM) that integrates medical visual comprehension and generation capabilities within a unified autoregressive paradigm. Our bootstrapping philosophy is to progressively adapt heterogeneous comprehension and generation knowledge to pre-trained large language models (LLMs). This is achieved through a novel heterogeneous low-r… ▽ More

    Submitted 21 February, 2025; v1 submitted 13 February, 2025; originally announced February 2025.

    Comments: Comments: added project page

  9. arXiv:2412.11640  [pdf, other

    cs.CR cs.DC

    SeSeMI: Secure Serverless Model Inference on Sensitive Data

    Authors: Guoyu Hu, Yuncheng Wu, Gang Chen, Tien Tuan Anh Dinh, Beng Chin Ooi

    Abstract: Model inference systems are essential for implementing end-to-end data analytics pipelines that deliver the benefits of machine learning models to users. Existing cloud-based model inference systems are costly, not easy to scale, and must be trusted in handling the models and user request data. Serverless computing presents a new opportunity, as it provides elasticity and fine-grained pricing. Our… ▽ More

    Submitted 16 December, 2024; originally announced December 2024.

  10. arXiv:2411.15893  [pdf, other

    cs.LG cs.AI

    Distribution-aware Online Continual Learning for Urban Spatio-Temporal Forecasting

    Authors: Chengxin Wang, Gary Tan, Swagato Barman Roy, Beng Chin Ooi

    Abstract: Urban spatio-temporal (ST) forecasting is crucial for various urban applications such as intelligent scheduling and trip planning. Previous studies focus on modeling ST correlations among urban locations in offline settings, which often neglect the non-stationary nature of urban ST data, particularly, distribution shifts over time. This oversight can lead to degraded performance in real-world scen… ▽ More

    Submitted 24 November, 2024; originally announced November 2024.

  11. arXiv:2408.03013  [pdf, other

    cs.DB cs.AI cs.LG

    NeurDB: On the Design and Implementation of an AI-powered Autonomous Database

    Authors: Zhanhao Zhao, Shaofeng Cai, Haotian Gao, Hexiang Pan, Siqi Xiang, Naili Xing, Gang Chen, Beng Chin Ooi, Yanyan Shen, Yuncheng Wu, Meihui Zhang

    Abstract: Databases are increasingly embracing AI to provide autonomous system optimization and intelligent in-database analytics, aiming to relieve end-user burdens across various industry sectors. Nonetheless, most existing approaches fail to account for the dynamic nature of databases, which renders them ineffective for real-world applications characterized by evolving data and workloads. This paper intr… ▽ More

    Submitted 4 January, 2025; v1 submitted 6 August, 2024; originally announced August 2024.

    Journal ref: CIDR 2025

  12. arXiv:2408.00513  [pdf, other

    cs.LG

    VecAug: Unveiling Camouflaged Frauds with Cohort Augmentation for Enhanced Detection

    Authors: Fei Xiao, Shaofeng Cai, Gang Chen, H. V. Jagadish, Beng Chin Ooi, Meihui Zhang

    Abstract: Fraud detection presents a challenging task characterized by ever-evolving fraud patterns and scarce labeled data. Existing methods predominantly rely on graph-based or sequence-based approaches. While graph-based approaches connect users through shared entities to capture structural information, they remain vulnerable to fraudsters who can disrupt or manipulate these connections. In contrast, seq… ▽ More

    Submitted 1 August, 2024; originally announced August 2024.

    Comments: Accepted by KDD 2024

  13. arXiv:2407.05034  [pdf, other

    cs.CR

    GCON: Differentially Private Graph Convolutional Network via Objective Perturbation

    Authors: Jianxin Wei, Yizheng Zhu, Xiaokui Xiao, Ergute Bao, Yin Yang, Kuntai Cai, Beng Chin Ooi

    Abstract: Graph Convolutional Networks (GCNs) are a popular machine learning model with a wide range of applications in graph analytics, including healthcare, transportation, and finance. However, a GCN trained without privacy protection measures may memorize private interpersonal relationships in the training data through its model parameters. This poses a substantial risk of compromising privacy through l… ▽ More

    Submitted 30 January, 2025; v1 submitted 6 July, 2024; originally announced July 2024.

  14. arXiv:2406.14015  [pdf, other

    cs.LG

    CohortNet: Empowering Cohort Discovery for Interpretable Healthcare Analytics

    Authors: Qingpeng Cai, Kaiping Zheng, H. V. Jagadish, Beng Chin Ooi, James Yip

    Abstract: Cohort studies are of significant importance in the field of healthcare analysis. However, existing methods typically involve manual, labor-intensive, and expert-driven pattern definitions or rely on simplistic clustering techniques that lack medical relevance. Automating cohort studies with interpretable patterns has great potential to facilitate healthcare analysis but remains an unmet need in p… ▽ More

    Submitted 20 June, 2024; originally announced June 2024.

    Comments: 10 pages, 12 figures

  15. NeurDB: An AI-powered Autonomous Data System

    Authors: Beng Chin Ooi, Shaofeng Cai, Gang Chen, Yanyan Shen, Kian-Lee Tan, Yuncheng Wu, Xiaokui Xiao, Naili Xing, Cong Yue, Lingze Zeng, Meihui Zhang, Zhanhao Zhao

    Abstract: In the wake of rapid advancements in artificial intelligence (AI), we stand on the brink of a transformative leap in data systems. The imminent fusion of AI and DB (AIxDB) promises a new generation of data systems, which will relieve the burden on end-users across all industry sectors by featuring AI-enhanced functionalities, such as personalized and automated in-database AI-powered analytics, sel… ▽ More

    Submitted 4 July, 2024; v1 submitted 6 May, 2024; originally announced May 2024.

    Journal ref: SCIENCE CHINA Information Sciences 67, 10 (2024)

  16. arXiv:2405.00568  [pdf, other

    cs.DB cs.AI

    Powering In-Database Dynamic Model Slicing for Structured Data Analytics

    Authors: Lingze Zeng, Naili Xing, Shaofeng Cai, Gang Chen, Beng Chin Ooi, Jian Pei, Yuncheng Wu

    Abstract: Relational database management systems (RDBMS) are widely used for the storage of structured data. To derive insights beyond statistical aggregation, we typically have to extract specific subdatasets from the database using conventional database operations, and then apply deep neural networks (DNN) training and inference on these subdatasets in a separate analytics system. The process can be prohi… ▽ More

    Submitted 3 November, 2024; v1 submitted 1 May, 2024; originally announced May 2024.

    Comments: VLDB 2025

  17. arXiv:2404.09654  [pdf, other

    cs.CV cs.MM

    Do LLMs Understand Visual Anomalies? Uncovering LLM's Capabilities in Zero-shot Anomaly Detection

    Authors: Jiaqi Zhu, Shaofeng Cai, Fang Deng, Beng Chin Ooi, Junran Wu

    Abstract: Large vision-language models (LVLMs) are markedly proficient in deriving visual representations guided by natural language. Recent explorations have utilized LVLMs to tackle zero-shot visual anomaly detection (VAD) challenges by pairing images with textual descriptions indicative of normal and abnormal conditions, referred to as anomaly prompts. However, existing approaches depend on static anomal… ▽ More

    Submitted 7 April, 2025; v1 submitted 15 April, 2024; originally announced April 2024.

    Comments: Accepted by MM'24 (Oral)

  18. arXiv:2403.10318  [pdf, other

    cs.LG

    Anytime Neural Architecture Search on Tabular Data

    Authors: Naili Xing, Shaofeng Cai, Zhaojing Luo, Beng Chin Ooi, Jian Pei

    Abstract: The increasing demand for tabular data analysis calls for transitioning from manual architecture design to Neural Architecture Search (NAS). This transition demands an efficient and responsive anytime NAS approach that is capable of returning current optimal architectures within any given time budget while progressively enhancing architecture quality with increased budget allocation. However, the… ▽ More

    Submitted 6 May, 2024; v1 submitted 15 March, 2024; originally announced March 2024.

  19. Exploring Privacy and Fairness Risks in Sharing Diffusion Models: An Adversarial Perspective

    Authors: Xinjian Luo, Yangfan Jiang, Fei Wei, Yuncheng Wu, Xiaokui Xiao, Beng Chin Ooi

    Abstract: Diffusion models have recently gained significant attention in both academia and industry due to their impressive generative performance in terms of both sampling quality and distribution coverage. Accordingly, proposals are made for sharing pre-trained diffusion models across different organizations, as a way of improving data utilization while enhancing privacy protection by avoiding sharing pri… ▽ More

    Submitted 19 September, 2024; v1 submitted 28 February, 2024; originally announced February 2024.

  20. METER: A Dynamic Concept Adaptation Framework for Online Anomaly Detection

    Authors: Jiaqi Zhu, Shaofeng Cai, Fang Deng, Beng Chin Ooi, Wenqiao Zhang

    Abstract: Real-time analytics and decision-making require online anomaly detection (OAD) to handle drifts in data streams efficiently and effectively. Unfortunately, existing approaches are often constrained by their limited detection capacity and slow adaptation to evolving data streams, inhibiting their efficacy and efficiency in handling concept drift, which is a major challenge in evolving data streams.… ▽ More

    Submitted 28 December, 2023; originally announced December 2023.

  21. arXiv:2311.15310  [pdf, other

    cs.CR cs.DB cs.DC cs.LG

    Secure and Verifiable Data Collaboration with Low-Cost Zero-Knowledge Proofs

    Authors: Yizheng Zhu, Yuncheng Wu, Zhaojing Luo, Beng Chin Ooi, Xiaokui Xiao

    Abstract: Organizations are increasingly recognizing the value of data collaboration for data analytics purposes. Yet, stringent data protection laws prohibit the direct exchange of raw data. To facilitate data collaboration, federated Learning (FL) emerges as a viable solution, which enables multiple clients to collaboratively train a machine learning (ML) model under the supervision of a central server wh… ▽ More

    Submitted 26 November, 2023; originally announced November 2023.

  22. arXiv:2310.10483  [pdf, other

    cs.CR cs.LG

    Passive Inference Attacks on Split Learning via Adversarial Regularization

    Authors: Xiaochen Zhu, Xinjian Luo, Yuncheng Wu, Yangfan Jiang, Xiaokui Xiao, Beng Chin Ooi

    Abstract: Split Learning (SL) has emerged as a practical and efficient alternative to traditional federated learning. While previous attempts to attack SL have often relied on overly strong assumptions or targeted easily exploitable models, we seek to develop more capable attacks. We introduce SDAR, a novel attack framework against SL with an honest-but-curious server. SDAR leverages auxiliary data and adve… ▽ More

    Submitted 21 March, 2025; v1 submitted 16 October, 2023; originally announced October 2023.

    Comments: NDSS 2025; 25 pages, 27 figures; Fixed typos

  23. arXiv:2304.10539  [pdf, other

    cs.LG cs.CV

    Learning in Imperfect Environment: Multi-Label Classification with Long-Tailed Distribution and Partial Labels

    Authors: Wenqiao Zhang, Changshuo Liu, Lingze Zeng, Beng Chin Ooi, Siliang Tang, Yueting Zhuang

    Abstract: Conventional multi-label classification (MLC) methods assume that all samples are fully labeled and identically distributed. Unfortunately, this assumption is unrealistic in large-scale MLC data that has long-tailed (LT) distribution and partial labels (PL). To address the problem, we introduce a novel task, Partial labeling and Long-Tailed Multi-Label Classification (PLT-MLC), to jointly consider… ▽ More

    Submitted 20 April, 2023; originally announced April 2023.

  24. arXiv:2304.04468  [pdf, other

    cs.LG cs.AI

    Toward Cohort Intelligence: A Universal Cohort Representation Learning Framework for Electronic Health Record Analysis

    Authors: Changshuo Liu, Wenqiao Zhang, Beng Chin Ooi, James Wei Luen Yip, Lingze Zeng, Kaiping Zheng

    Abstract: Electronic Health Records (EHR) are generated from clinical routine care recording valuable information of broad patient populations, which provide plentiful opportunities for improving patient management and intervention strategies in clinical practice. To exploit the enormous potential of EHR data, a popular EHR data analysis paradigm in machine learning is EHR representation learning, which fir… ▽ More

    Submitted 12 April, 2023; v1 submitted 10 April, 2023; originally announced April 2023.

    Comments: 10 pages

  25. arXiv:2303.17526  [pdf, other

    cs.CV

    CAusal and collaborative proxy-tasKs lEarning for Semi-Supervised Domain Adaptation

    Authors: Wenqiao Zhang, Changshuo Liu, Can Cui, Beng Chin Ooi

    Abstract: Semi-supervised domain adaptation (SSDA) adapts a learner to a new domain by effectively utilizing source domain data and a few labeled target samples. It is a practical yet under-investigated research topic. In this paper, we analyze the SSDA problem from two perspectives that have previously been overlooked, and correspondingly decompose it into two \emph{key subproblems}: \emph{robust domain ad… ▽ More

    Submitted 30 March, 2023; originally announced March 2023.

  26. arXiv:2302.04500  [pdf, other

    cs.DC cs.AI cs.DB

    FLAC: A Robust Failure-Aware Atomic Commit Protocol for Distributed Transactions

    Authors: Hexiang Pan, Quang-Trung Ta, Meihui Zhang, Yeow Meng Chee, Gang Chen, Beng Chin Ooi

    Abstract: In distributed transaction processing, atomic commit protocol (ACP) is used to ensure database consistency. With the use of commodity compute nodes and networks, failures such as system crashes and network partitioning are common. It is therefore important for ACP to dynamically adapt to the operating condition for efficiency while ensuring the consistency of the database. Existing ACPs often assu… ▽ More

    Submitted 2 March, 2023; v1 submitted 9 February, 2023; originally announced February 2023.

    MSC Class: H.2.4

  27. arXiv:2301.03829  [pdf, other

    cs.LG cs.AI cs.CV cs.DB cs.MM

    From Plate to Prevention: A Dietary Nutrient-aided Platform for Health Promotion in Singapore

    Authors: Kaiping Zheng, Thao Nguyen, Jesslyn Hwei Sing Chong, Charlene Enhui Goh, Melanie Herschel, Hee Hoon Lee, Changshuo Liu, Beng Chin Ooi, Wei Wang, James Yip

    Abstract: Singapore has been striving to improve the provision of healthcare services to her people. In this course, the government has taken note of the deficiency in regulating and supervising people's nutrient intake, which is identified as a contributing factor to the development of chronic diseases. Consequently, this issue has garnered significant attention. In this paper, we share our experience in a… ▽ More

    Submitted 28 March, 2023; v1 submitted 10 January, 2023; originally announced January 2023.

  28. arXiv:2212.04371  [pdf

    cs.LG cs.CR

    Skellam Mixture Mechanism: a Novel Approach to Federated Learning with Differential Privacy

    Authors: Ergute Bao, Yizheng Zhu, Xiaokui Xiao, Yin Yang, Beng Chin Ooi, Benjamin Hong Meng Tan, Khin Mi Mi Aung

    Abstract: Deep neural networks have strong capabilities of memorizing the underlying training data, which can be a serious privacy concern. An effective solution to this problem is to train models with differential privacy, which provides rigorous privacy guarantees by injecting random noise to the gradients. This paper focuses on the scenario where sensitive data are distributed among multiple participants… ▽ More

    Submitted 2 July, 2024; v1 submitted 8 December, 2022; originally announced December 2022.

  29. arXiv:2209.05227  [pdf, other

    cs.DC cs.AI cs.CV cs.IR

    DUET: A Tuning-Free Device-Cloud Collaborative Parameters Generation Framework for Efficient Device Model Generalization

    Authors: Zheqi Lv, Wenqiao Zhang, Shengyu Zhang, Kun Kuang, Feng Wang, Yongwei Wang, Zhengyu Chen, Tao Shen, Hongxia Yang, Beng Chin Ooi, Fei Wu

    Abstract: Device Model Generalization (DMG) is a practical yet under-investigated research topic for on-device machine learning applications. It aims to improve the generalization ability of pre-trained models when deployed on resource-constrained devices, such as improving the performance of pre-trained cloud models on smart mobiles. While quite a lot of works have investigated the data distribution shift… ▽ More

    Submitted 1 December, 2024; v1 submitted 12 September, 2022; originally announced September 2022.

    Comments: Published on WWW'23: Proceedings of the ACM on Web Conference 2023 (pp. 3077 - 3085)

  30. arXiv:2207.00944  [pdf, other

    cs.DB

    GlassDB: An Efficient Verifiable Ledger Database System Through Transparency

    Authors: Cong Yue, Tien Tuan Anh Dinh, Zhongle Xie, Meihui Zhang, Gang Chen, Beng Chin Ooi, Xiaokui Xiao

    Abstract: Verifiable ledger databases protect data history against malicious tampering. Existing systems, such as blockchains and certificate transparency, are based on transparency logs -- a simple abstraction allowing users to verify that a log maintained by an untrusted server is append-only. They expose a simple key-value interface. Building a practical database from transparency logs, on the other hand… ▽ More

    Submitted 19 February, 2023; v1 submitted 2 July, 2022; originally announced July 2022.

  31. arXiv:2206.10326  [pdf, other

    cs.HC cs.AI cs.CV cs.DB cs.DC

    The Metaverse Data Deluge: What Can We Do About It?

    Authors: Beng Chin Ooi, Gang Chen, Mike Zheng Shou, Kian-Lee Tan, Anthony Tung, Xiaokui Xiao, James Wei Luen Yip, Meihui Zhang

    Abstract: In the Metaverse, the physical space and the virtual space co-exist, and interact simultaneously. While the physical space is virtually enhanced with information, the virtual space is continuously refreshed with real-time, real-world information. To allow users to process and manipulate information seamlessly between the real and digital spaces, novel technologies must be developed. These include… ▽ More

    Submitted 10 November, 2022; v1 submitted 14 June, 2022; originally announced June 2022.

  32. arXiv:2205.06941  [pdf, ps, other

    cs.DC cs.DB cs.PF

    Blockchain Goes Green? Part II: Characterizing the Performance and Cost of Blockchains on the Cloud and at the Edge

    Authors: Dumitrel Loghin, Tien Tuan Anh Dinh, Aung Maw, Chen Gang, Yong Meng Teo, Beng Chin Ooi

    Abstract: While state-of-the-art permissioned blockchains can achieve thousands of transactions per second on commodity hardware with x86/64 architecture, their performance when running on different architectures is not clear. The goal of this work is to characterize the performance and cost of permissioned blockchains on different hardware systems, which is important as diverse application domains are adop… ▽ More

    Submitted 13 May, 2022; originally announced May 2022.

    Comments: 13 pages, 10 figures, 3 tables

  33. arXiv:2203.02533  [pdf, other

    eess.IV cs.AI cs.CV cs.LG

    BoostMIS: Boosting Medical Image Semi-supervised Learning with Adaptive Pseudo Labeling and Informative Active Annotation

    Authors: Wenqiao Zhang, Lei Zhu, James Hallinan, Andrew Makmur, Shengyu Zhang, Qingpeng Cai, Beng Chin Ooi

    Abstract: In this paper, we propose a novel semi-supervised learning (SSL) framework named BoostMIS that combines adaptive pseudo labeling and informative active annotation to unleash the potential of medical image SSL models: (1) BoostMIS can adaptively leverage the cluster assumption and consistency regularization of the unlabeled data according to the current learning status. This strategy can adaptively… ▽ More

    Submitted 21 March, 2022; v1 submitted 4 March, 2022; originally announced March 2022.

    Comments: 11 pages

    Journal ref: CVPR 2022

  34. arXiv:2109.00817  [pdf, other

    cs.LG cs.AI

    NASI: Label- and Data-agnostic Neural Architecture Search at Initialization

    Authors: Yao Shu, Shaofeng Cai, Zhongxiang Dai, Beng Chin Ooi, Bryan Kian Hsiang Low

    Abstract: Recent years have witnessed a surging interest in Neural Architecture Search (NAS). Various algorithms have been proposed to improve the search efficiency and effectiveness of NAS, i.e., to reduce the search cost and improve the generalization performance of the selected architectures, respectively. However, the search efficiency of these algorithms is severely limited by the need for model traini… ▽ More

    Submitted 25 April, 2022; v1 submitted 2 September, 2021; originally announced September 2021.

    Comments: Published as a conference paper at ICLR 2022

  35. SINGA-Easy: An Easy-to-Use Framework for MultiModal Analysis

    Authors: Naili Xing, Sai Ho Yeung, Chenghao Cai, Teck Khim Ng, Wei Wang, Kaiyuan Yang, Nan Yang, Meihui Zhang, Gang Chen, Beng Chin Ooi

    Abstract: Deep learning has achieved great success in a wide spectrum of multimedia applications such as image classification, natural language processing and multimodal data analysis. Recent years have seen the development of many deep learning frameworks that provide a high-level programming interface for users to design models, conduct training and deploy inference. However, it remains challenging to bui… ▽ More

    Submitted 3 August, 2021; originally announced August 2021.

    Comments: 10 pages, 10 figures

  36. ARM-Net: Adaptive Relation Modeling Network for Structured Data

    Authors: Shaofeng Cai, Kaiping Zheng, Gang Chen, H. V. Jagadish, Beng Chin Ooi, Meihui Zhang

    Abstract: Relational databases are the de facto standard for storing and querying structured data, and extracting insights from structured data requires advanced analytics. Deep neural networks (DNNs) have achieved super-human prediction performance in particular data types, e.g., images. However, existing DNNs may not produce meaningful results when applied to structured data. The reason is that there are… ▽ More

    Submitted 5 July, 2021; originally announced July 2021.

    Comments: 14 pages, 11 figures, 5 tables, published as a conference paper in ACM SIGMOD 2020

  37. A Fusion-Denoising Attack on InstaHide with Data Augmentation

    Authors: Xinjian Luo, Xiaokui Xiao, Yuncheng Wu, Juncheng Liu, Beng Chin Ooi

    Abstract: InstaHide is a state-of-the-art mechanism for protecting private training images, by mixing multiple private images and modifying them such that their visual features are indistinguishable to the naked eye. In recent work, however, Carlini et al. show that it is possible to reconstruct private images from the encrypted dataset generated by InstaHide. Nevertheless, we demonstrate that Carlini et al… ▽ More

    Submitted 5 December, 2021; v1 submitted 17 May, 2021; originally announced May 2021.

    Comments: 15 pages

  38. AlphaEvolve: A Learning Framework to Discover Novel Alphas in Quantitative Investment

    Authors: Can Cui, Wei Wang, Meihui Zhang, Gang Chen, Zhaojing Luo, Beng Chin Ooi

    Abstract: Alphas are stock prediction models capturing trading signals in a stock market. A set of effective alphas can generate weakly correlated high returns to diversify the risk. Existing alphas can be categorized into two classes: Formulaic alphas are simple algebraic expressions of scalar features, and thus can generalize well and be mined into a weakly correlated set. Machine learning alphas are data… ▽ More

    Submitted 1 April, 2021; v1 submitted 30 March, 2021; originally announced March 2021.

    Comments: Accepted by SIGMOD 2021 Data Science and Engineering Track

    ACM Class: H.2.8

  39. arXiv:2103.02958  [pdf, other

    cs.DC cs.AI cs.DB cs.LG

    Serverless Data Science -- Are We There Yet? A Case Study of Model Serving

    Authors: Yuncheng Wu, Tien Tuan Anh Dinh, Guoyu Hu, Meihui Zhang, Yeow Meng Chee, Beng Chin Ooi

    Abstract: Machine learning (ML) is an important part of modern data science applications. Data scientists today have to manage the end-to-end ML life cycle that includes both model training and model serving, the latter of which is essential, as it makes their works available to end-users. Systems of model serving require high performance, low cost, and ease of management. Cloud providers are already offeri… ▽ More

    Submitted 1 March, 2022; v1 submitted 4 March, 2021; originally announced March 2021.

    Comments: Accepted by ACM SIGMOD 2022, 10 pages

  40. arXiv:2010.10246  [pdf, other

    cs.SE cs.DB cs.DC cs.LG

    MLCask: Efficient Management of Component Evolution in Collaborative Data Analytics Pipelines

    Authors: Zhaojing Luo, Sai Ho Yeung, Meihui Zhang, Kaiping Zheng, Lei Zhu, Gang Chen, Feiyi Fan, Qian Lin, Kee Yuan Ngiam, Beng Chin Ooi

    Abstract: With the ever-increasing adoption of machine learning for data analytics, maintaining a machine learning pipeline is becoming more complex as both the datasets and trained models evolve with time. In a collaborative environment, the changes and updates due to pipeline evolution often cause cumbersome coordination and maintenance work, raising the costs and making it hard to use. Existing solutions… ▽ More

    Submitted 16 March, 2021; v1 submitted 17 October, 2020; originally announced October 2020.

    Comments: 13 pages; added new baselines, i.e., MLflow and ModelDB, in Section VII-C; added experience on the system deployment in Section VIII; added Table I to clarify the correctness of the prioritized pipeline search in Section VII-E

  41. Feature Inference Attack on Model Predictions in Vertical Federated Learning

    Authors: Xinjian Luo, Yuncheng Wu, Xiaokui Xiao, Beng Chin Ooi

    Abstract: Federated learning (FL) is an emerging paradigm for facilitating multiple organizations' data collaboration without revealing their private data to each other. Recently, vertical FL, where the participating organizations hold the same set of samples but with disjoint features and only one organization owns the labels, has received increased attention. This paper presents several feature inference… ▽ More

    Submitted 22 April, 2021; v1 submitted 20 October, 2020; originally announced October 2020.

    Comments: Accepted at the IEEE 37th International Conference on Data Engineering (ICDE 2021); 15 pages

  42. arXiv:2009.05766  [pdf, other

    cs.DC

    Communication-efficient Decentralized Machine Learning over Heterogeneous Networks

    Authors: Pan Zhou, Qian Lin, Dumitrel Loghin, Beng Chin Ooi, Yuncheng Wu, Hongfang Yu

    Abstract: In the last few years, distributed machine learning has been usually executed over heterogeneous networks such as a local area network within a multi-tenant cluster or a wide area network connecting data centers and edge clusters. In these heterogeneous networks, the link speeds among worker nodes vary significantly, making it challenging for state-of-the-art machine learning approaches to perform… ▽ More

    Submitted 20 October, 2020; v1 submitted 12 September, 2020; originally announced September 2020.

    Comments: 17 pages, 19 figures, accepted by conference ICDE'2021

  43. Privacy Preserving Vertical Federated Learning for Tree-based Models

    Authors: Yuncheng Wu, Shaofeng Cai, Xiaokui Xiao, Gang Chen, Beng Chin Ooi

    Abstract: Federated learning (FL) is an emerging paradigm that enables multiple organizations to jointly train a model without revealing their private data to each other. This paper studies {\it vertical} federated learning, which tackles the scenarios where (i) collaborating organizations own data of the same set of users but with disjoint features, and (ii) only one organization holds the labels. We propo… ▽ More

    Submitted 13 August, 2020; originally announced August 2020.

    Comments: Proc. VLDB Endow. 13(11): 2090-2103 (2020)

  44. arXiv:2004.07585  [pdf, other

    cs.DB

    ForkBase: Immutable, Tamper-evident Storage Substrate for Branchable Applications

    Authors: Qian Lin, Kaiyuan Yang, Tien Tuan Anh Dinh, Qingchao Cai, Gang Chen, Beng Chin Ooi, Pingcheng Ruan, Sheng Wang, Zhongle Xie, Meihui Zhang, Olafs Vandans

    Abstract: Data collaboration activities typically require systematic or protocol-based coordination to be scalable. Git, an effective enabler for collaborative coding, has been attested for its success in countless projects around the world. Hence, applying the Git philosophy to general data collaboration beyond coding is motivating. We call it Git for data. However, the original Git design handles data at… ▽ More

    Submitted 16 April, 2020; originally announced April 2020.

    Comments: In Proceedings of the IEEE International Conference on Data Engineering (ICDE), 2020 (Demo)

  45. arXiv:2003.12012  [pdf, other

    eess.SP cs.AI cs.LG stat.AP stat.ML

    TRACER: A Framework for Facilitating Accurate and Interpretable Analytics for High Stakes Applications

    Authors: Kaiping Zheng, Shaofeng Cai, Horng Ruey Chua, Wei Wang, Kee Yuan Ngiam, Beng Chin Ooi

    Abstract: In high stakes applications such as healthcare and finance analytics, the interpretability of predictive models is required and necessary for domain practitioners to trust the predictions. Traditional machine learning models, e.g., logistic regression (LR), are easy to interpret in nature. However, many of these models aggregate time-series data without considering the temporal correlations and va… ▽ More

    Submitted 24 March, 2020; originally announced March 2020.

    Comments: A version of this preprint will appear in ACM SIGMOD 2020

  46. arXiv:2003.10064  [pdf, other

    cs.DC cs.DB cs.PF

    A Transactional Perspective on Execute-order-validate Blockchains

    Authors: Pingcheng Ruan, Dumitrel Loghin, Quang-Trung Ta, Meihui Zhang, Gang Chen, Beng Chin Ooi

    Abstract: Smart contracts have enabled blockchain systems to evolve from simple cryptocurrency platforms, such as Bitcoin, to general transactional systems, such as Ethereum. Catering for emerging business requirements, a new architecture called execute-order-validate has been proposed in Hyperledger Fabric to support parallel transactions and improve the blockchain's throughput. However, this new architect… ▽ More

    Submitted 22 March, 2020; originally announced March 2020.

  47. arXiv:2003.02090  [pdf, other

    cs.DB

    Analysis of Indexing Structures for Immutable Data

    Authors: Cong Yue, Zhongle Xie, Meihui Zhang, Gang Chen, Beng Chin Ooi, Sheng Wang, Xiaokui Xiao

    Abstract: In emerging applications such as blockchains and collaborative data analytics, there are strong demands for data immutability, multi-version accesses, and tamper-evident controls. This leads to three new index structures for immutable data, namely Merkle Patricia Trie (MPT), Merkle Bucket Tree (MBT), and Pattern-Oriented-Split Tree (POS-Tree). Although these structures have been adopted in real ap… ▽ More

    Submitted 10 March, 2020; v1 submitted 4 March, 2020; originally announced March 2020.

  48. arXiv:1910.01310  [pdf, other

    cs.DB cs.PF

    Blockchains vs. Distributed Databases: Dichotomy and Fusion

    Authors: Pingcheng Ruan, Tien Tuan Anh Dinh, Dumitrel Loghin, Meihui Zhang, Gang Chen, Qian Lin, Beng Chin Ooi

    Abstract: Blockchain has come a long way: a system that was initially proposed specifically for cryptocurrencies is now being adapted and adopted as a general-purpose transactional system. As blockchain evolves into another data management system, the natural question is how it compares against distributed database systems. Existing works on this comparison focus on high-level properties, such as security a… ▽ More

    Submitted 15 January, 2021; v1 submitted 3 October, 2019; originally announced October 2019.

  49. arXiv:1910.00985  [pdf, other

    cs.DB cs.DC

    A Blueprint for Interoperable Blockchains

    Authors: Tien Tuan Anh Dinh, Anwitaman Datta, Beng Chin Ooi

    Abstract: Research in blockchain systems has mainly focused on improving security and bridging the performance gaps between blockchains and databases. Despite many promising results, we observe a worrying trend that the blockchain landscape is fragmented in which many systems exist in silos. Apart from a handful of general-purpose blockchains, such as Ethereum or Hyperledger Fabric, there are hundreds of ot… ▽ More

    Submitted 22 October, 2019; v1 submitted 2 October, 2019; originally announced October 2019.

  50. arXiv:1909.10152  [pdf, ps, other

    cs.NI cs.DB cs.DC

    5G: Agent for Further Digital Disruptive Transformations

    Authors: Beng Chin Ooi, Gang Chen, Dumitrel Loghin, Wei Wang, Meihui Zhang

    Abstract: The fifth-generation (5G) mobile communication technologies are on the way to be adopted as the next standard for mobile networking. It is therefore timely to analyze the impact of 5G on the landscape of computing, in particular, data management and data-driven technologies. With a predicted increase of 10-100$\times$ in bandwidth and 5-10$\times$ decrease in latency, 5G is expected to be the main… ▽ More

    Submitted 23 September, 2019; originally announced September 2019.

    Comments: Published in the Bulletin of the Technical Committee on Data Engineering (http://sites.computer.org/debull/A19sept/p9.pdf)