Skip to main content

Showing 1–33 of 33 results for author: Aref, W G

Searching in archive cs. Search in all archives.
.
  1. arXiv:2505.01697  [pdf, ps, other

    cs.DB

    BMTree: Designing, Learning, and Updating Piecewise Space-Filling Curves for Multi-Dimensional Data Indexing

    Authors: Jiangneng Li, Yuang Liu, Zheng Wang, Gao Cong, Cheng Long, Walid G. Aref, Han Mao Kiah, Bin Cui

    Abstract: Space-filling curves (SFC, for short) have been widely applied to index multi-dimensional data, which first maps the data to one dimension, and then a one-dimensional indexing method, e.g., the B-tree indexes the mapped data. Existing SFCs adopt a single mapping scheme for the whole data space. However, a single mapping scheme often does not perform well on all the data space. In this paper, we pr… ▽ More

    Submitted 3 May, 2025; originally announced May 2025.

  2. arXiv:2503.19619  [pdf, other

    cs.DB

    Exploring Next Token Prediction For Optimizing Databases

    Authors: Yeasir Rayhan, Walid G. Aref

    Abstract: The Next Token Prediction paradigm (NTP, for short) lies at the forefront of modern large foundational models that are pre-trained on diverse and large datasets. These models generalize effectively, and have proven to be very successful in Natural Language Processing (NLP). Inspired by the generalization capabilities of Large Language Models (LLMs), we investigate whether the same NTP paradigm can… ▽ More

    Submitted 9 May, 2025; v1 submitted 25 March, 2025; originally announced March 2025.

    Comments: To appear at aiDM@SIGMOD'25

  3. arXiv:2503.17685  [pdf, other

    cs.DB

    Revisiting Page Migration for Main-Memory Database Systems

    Authors: Yeasir Rayhan, Walid G. Aref

    Abstract: Modern hardware architectures, e.g., NUMA servers, chiplet processors, tiered and disaggregated memory systems have significantly improved the performance of Main-Memory Databases, and are poised to deliver further improvements in the future. However, realizing this potential depends on the database system's ability to efficiently migrate pages among different NUMA nodes, and/or memory chips as th… ▽ More

    Submitted 26 May, 2025; v1 submitted 22 March, 2025; originally announced March 2025.

  4. arXiv:2502.09937  [pdf, other

    cs.DB cs.LG

    Tradeoffs in Processing Queries and Supporting Updates over an ML-Enhanced R-tree

    Authors: Abdullah Al-Mamun, Ch. Md. Rakin Haider, Jianguo Wang, Walid G. Aref

    Abstract: Machine Learning (ML) techniques have been successfully applied to design various learned database index structures for both the one- and multi-dimensional spaces. Particularly, a class of traditional multi-dimensional indexes has been augmented with ML models to design ML-enhanced variants of their traditional counterparts. This paper focuses on the R-tree multi-dimensional index structure as it… ▽ More

    Submitted 14 February, 2025; originally announced February 2025.

    Comments: arXiv admin note: text overlap with arXiv:2207.00550

  5. arXiv:2411.02933  [pdf, other

    cs.DB cs.LG cs.PF

    P-MOSS: Learned Scheduling For Indexes Over NUMA Servers Using Low-Level Hardware Statistics

    Authors: Yeasir Rayhan, Walid G. Aref

    Abstract: Ever since the Dennard scaling broke down in the early 2000s and the frequency of the CPU stalled, vendors have started to increase the core count in each CPU chip at the expense of introducing heterogeneity, thus ushering the era of NUMA processors. Since then, the heterogeneity in the design space of hardware has only increased to the point that DBMS performance may vary significantly up to an o… ▽ More

    Submitted 5 November, 2024; originally announced November 2024.

  6. arXiv:2409.02088  [pdf, other

    cs.DB cs.DC cs.ET

    Cache Coherence Over Disaggregated Memory

    Authors: Ruihong Wang, Jianguo Wang, Walid G. Aref

    Abstract: Disaggregating memory from compute offers the opportunity to better utilize stranded memory in cloud data centers. It is important to cache data in the compute nodes and maintain cache coherence across multiple compute nodes. However, the limited computing power on disaggregated memory servers makes traditional cache coherence protocols suboptimal, particularly in the case of stranded memory. This… ▽ More

    Submitted 22 February, 2025; v1 submitted 3 September, 2024; originally announced September 2024.

  7. arXiv:2406.09372  [pdf, other

    cs.DB

    An Adaptive Hotspot-Aware Index for Oscillating Write-Heavy and Read-Heavy Workloads

    Authors: Lu Xing, Ruihong Wang, Walid G. Aref

    Abstract: HTAP systems are designed to handle transactional and analytical workloads. Besides a mixed workload at any given time, the workload can also change over time. A popular type of continuously changing workload is one that oscillates between being write-heavy at times and being read-heavy at other times. Oscillating workloads can be observed in many applications. Indexes, e.g., the B+-tree and the L… ▽ More

    Submitted 2 December, 2024; v1 submitted 13 June, 2024; originally announced June 2024.

  8. arXiv:2406.08746  [pdf, other

    cs.DB

    The AHA-Tree: An Adaptive Index for HTAP Workloads

    Authors: Lu Xing, Walid G. Aref

    Abstract: In this demo, we realize data indexes that can morph from being write-optimized at times to being read-optimized at other times nonstop with zero-down time during the workload transitioning. These data indexes are useful for HTAP systems (Hybrid Transactional and Analytical Processing Systems), where transactional workloads are write-heavy while analytical workloads are read-heavy. Traditional ind… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

  9. Multi-Entry Generalized Search Trees for Indexing Trajectories

    Authors: Maxime Schoemans, Walid G. Aref, Esteban Zimányi, Mahmoud Sakr

    Abstract: The idea of generalized indices is one of the success stories of database systems research. It has found its way to implementation in common database systems. GiST (Generalized Search Tree) and SP-GiST (Space-Partitioned Generalized Search Tree) are two widely-used generalized indices that are typically used for multidimensional data. Currently, the generalized indices GiST and SP-GiST represent o… ▽ More

    Submitted 13 September, 2024; v1 submitted 7 June, 2024; originally announced June 2024.

  10. arXiv:2405.01418  [pdf, other

    cs.DB

    GTX: A Write-Optimized Latch-free Graph Data System with Transactional Support -- Extended Version

    Authors: Libin Zhou, Lu Xing, Yeasir Rayhan, Walid. G. Aref

    Abstract: This paper introduces GTX, a standalone main-memory write-optimized graph data system that specializes in structural and graph property updates while enabling concurrent reads and graph analytics through ACID transactions. Recent graph systems target concurrent read and write support while guaranteeing transaction semantics. However, their performance suffers from updates with real-world temporal… ▽ More

    Submitted 24 February, 2025; v1 submitted 2 May, 2024; originally announced May 2024.

    Comments: technical report for our main paper GTX: A Write-Optimized Latch-free Graph Data System with Transactional Support

    ACM Class: H.2.4

  11. arXiv:2403.06456  [pdf, other

    cs.DB cs.LG

    A Survey of Learned Indexes for the Multi-dimensional Space

    Authors: Abdullah Al-Mamun, Hao Wu, Qiyang He, Jianguo Wang, Walid G. Aref

    Abstract: A recent research trend involves treating database index structures as Machine Learning (ML) models. In this domain, single or multiple ML models are trained to learn the mapping from keys to positions inside a data set. This class of indexes is known as "Learned Indexes." Learned indexes have demonstrated improved search performance and reduced space requirements for one-dimensional data. The con… ▽ More

    Submitted 11 March, 2024; originally announced March 2024.

  12. The Ubiquitous Skiplist: A Survey of What Cannot be Skipped About the Skiplist and its Applications in Big Data Systems

    Authors: Lu Xing, Venkata Sai Pavan Kumar Vadrevu, Walid G. Aref

    Abstract: Skiplists have become prevalent in systems. The main advantages of skiplists are their simplicity and ease of implementation, and the ability to support operations in the same asymptotic complexities as their tree-based counterparts. In this survey, we explore skiplists and their many variants. We highlight many scenarios about how skiplists are useful, and how they fit well in these usage scenari… ▽ More

    Submitted 30 January, 2025; v1 submitted 7 March, 2024; originally announced March 2024.

  13. SIMD-ified R-tree Query Processing and Optimization

    Authors: Yeasir Rayhan, Walid G. Aref

    Abstract: The introduction of Single Instruction Multiple Data (SIMD) instructions in mainstream CPUs has enabled modern database engines to leverage data parallelism by performing more computation with a single instruction, resulting in a reduced number of instructions required to execute a query as well as the elimination of conditional branches. Though SIMD in the context of traditional database engines… ▽ More

    Submitted 28 September, 2023; originally announced September 2023.

    Comments: To appear at ACM SIGSPATIAL 2023

  14. arXiv:2305.01087  [pdf, other

    cs.DS

    An Update-intensive LSM-based R-tree Index

    Authors: Jaewoo Shin, Jianguo Wang, Walid G. Aref

    Abstract: Many applications require update-intensive workloads on spatial objects, e.g., social-network services and shared-riding services that track moving objects. By buffering insert and delete operations in memory, the Log Structured Merge Tree (LSM) has been used widely in various systems because of its ability to handle write-heavy workloads. While the focus on LSM has been on key-value stores and th… ▽ More

    Submitted 1 May, 2023; originally announced May 2023.

  15. arXiv:2304.09983  [pdf

    cs.DB

    Tutorial: The Ubiquitous Skiplist, its Variants, and Applications in Modern Big Data Systems

    Authors: Venkata Sai Pavan Kumar Vadrevu, Lu Xing, Walid G. Aref

    Abstract: The Skiplist, or skip list, originally designed as an in-memory data structure, has attracted a lot of attention in recent years as a main-memory component in many NoSQL, cloud-based, and big data systems. Unlike the B-tree, the skiplist does not need complex rebalancing mechanisms, but it still shows expected logarithmic performance. It supports a variety of operations, including insert, point re… ▽ More

    Submitted 19 April, 2023; originally announced April 2023.

  16. arXiv:2207.03027  [pdf, other

    cs.DB

    The Case for Distributed Shared-Memory Databases with RDMA-Enabled Memory Disaggregation

    Authors: Ruihong Wang, Jianguo Wang, Stratos Idreos, M. Tamer Özsu, Walid G. Aref

    Abstract: Memory disaggregation (MD) allows for scalable and elastic data center design by separating compute (CPU) from memory. With MD, compute and memory are no longer coupled into the same server box. Instead, they are connected to each other via ultra-fast networking such as RDMA. MD can bring many advantages, e.g., higher memory utilization, better independent scaling (of compute and memory), and lowe… ▽ More

    Submitted 6 July, 2022; originally announced July 2022.

  17. arXiv:2207.00550  [pdf, other

    cs.DB cs.LG

    The "AI+R"-tree: An Instance-optimized R-tree

    Authors: Abdullah-Al-Mamun, Ch. Md. Rakin Haider, Jianguo Wang, Walid G. Aref

    Abstract: The emerging class of instance-optimized systems has shown potential to achieve high performance by specializing to a specific data and query workloads. Particularly, Machine Learning (ML) techniques have been applied successfully to build various instance-optimized components (e.g., learned indexes). This paper investigates to leverage ML techniques to enhance the performance of spatial indexes,… ▽ More

    Submitted 1 July, 2022; originally announced July 2022.

    Comments: To appear in the proceedings of The 23rd IEEE International Conference on Mobile Data Management (2022)

  18. arXiv:2206.09520  [pdf, other

    cs.DB

    ILX: Intelligent "Location+X" Data Systems (Vision Paper)

    Authors: Walid G. Aref, Ahmed M. Aly, Anas Daghistani, Yeasir Rayhan, Jianguo Wang, Libin Zhou

    Abstract: Due to the ubiquity of mobile phones and location-detection devices, location data is being generated in very large volumes. Queries and operations that are performed on location data warrant the use of database systems. Despite that, location data is being supported in data systems as an afterthought. Typically, relational or NoSQL data systems that are mostly designed with non-location data in m… ▽ More

    Submitted 1 August, 2022; v1 submitted 19 June, 2022; originally announced June 2022.

  19. An Experimental Evaluation and Investigation of Waves of Misery in R-trees

    Authors: Lu Xing, Eric Lee, Tong An, Bo-Cheng Chu, Ahmed Mahmood, Ahmed M. Aly, Jianguo Wang, Walid G. Aref

    Abstract: Waves of misery is a phenomenon where spikes of many node splits occur over short periods of time in tree indexes. Waves of misery negatively affect the performance of tree indexes in insertion-heavy workloads.Waves of misery have been first observed in the context of the B-tree, where these waves cause unpredictable index performance. In particular, the performance of search and index-update oper… ▽ More

    Submitted 24 December, 2021; originally announced December 2021.

    Comments: To appear in VLDB 2022

  20. arXiv:2110.01767  [pdf, ps, other

    cs.DB

    Scalable Relational Query Processing on Big Matrix Data

    Authors: Yongyang Yu, Mingjie Tang, Walid G. Aref

    Abstract: The use of large-scale machine learning methods is becoming ubiquitous in many applications ranging from business intelligence to self-driving cars. These methods require a complex computation pipeline consisting of various types of operations, e.g., relational operations for pre-processing or post-processing the dataset, and matrix operations for core model computations. Many existing systems foc… ▽ More

    Submitted 9 November, 2021; v1 submitted 4 October, 2021; originally announced October 2021.

    Comments: 29 pages, 11 figures, 6 tables

  21. arXiv:2008.13028  [pdf, other

    cs.DB cs.HC

    STULL: Unbiased Online Sampling for Visual Exploration of Large Spatiotemporal Data

    Authors: Guizhen Wang, Jingjing Guo, Mingjie Tang, José Florencio de Queiroz Neto, Calvin Yau, Anas Daghistani, Morteza Karimzadeh, Walid G. Aref, David S. Ebert

    Abstract: Online sampling-supported visual analytics is increasingly important, as it allows users to explore large datasets with acceptable approximate answers at interactive rates. However, existing online spatiotemporal sampling techniques are often biased, as most researchers have primarily focused on reducing computational latency. Biased sampling approaches select data with unequal probabilities and p… ▽ More

    Submitted 29 August, 2020; originally announced August 2020.

    Comments: IEEE VIS (InfoVis/VAST/SciVis) 2020 ACM 2012 CCS - Human-centered computing, Visualization, Visualization design and evaluation methods

    ACM Class: H.3.3

  22. arXiv:2002.11862  [pdf, other

    cs.DB

    SWARM: Adaptive Load Balancing in Distributed Streaming Systems for Big Spatial Data

    Authors: Anas Daghistani, Walid G. Aref, Arif Ghafoor, Ahmed R. Mahmood

    Abstract: The proliferation of GPS-enabled devices has led to the development of numerous location-based services. These services need to process massive amounts of spatial data in real-time. The current scale of spatial data cannot be handled using centralized systems. This has led to the development of distributed spatial streaming systems. Existing systems are using static spatial partitioning to distrib… ▽ More

    Submitted 26 February, 2020; originally announced February 2020.

  23. arXiv:1907.03736  [pdf, other

    cs.DB

    LocationSpark: In-memory Distributed Spatial Query Processing and Optimization

    Authors: Mingjie Tang, Yongyang Yu, Walid G. Aref, Ahmed R. Mahmood, Qutaibah M. Malluhi, Mourad Ouzzani

    Abstract: Due to the ubiquity of spatial data applications and the large amounts of spatial data that these applications generate and process, there is a pressing need for scalable spatial query processing. In this paper, we present new techniques for spatial query processing and optimization in an in-memory and distributed setup to address scalability. More specifically, we introduce new techniques for han… ▽ More

    Submitted 16 July, 2019; v1 submitted 8 July, 2019; originally announced July 2019.

  24. arXiv:1712.09437  [pdf, other

    cs.DB

    Pattern-Driven Data Cleaning

    Authors: El Kindi Rezig, Mourad Ouzzani, Walid G. Aref, Ahmed K. Elmagarmid, Ahmed R. Mahmood

    Abstract: Data is inherently dirty and there has been a sustained effort to come up with different approaches to clean it. A large class of data repair algorithms rely on data-quality rules and integrity constraints to detect and repair the data. A well-studied class of integrity constraints is Functional Dependencies (FDs, for short) that specify dependencies among attributes in a relation. In this paper,… ▽ More

    Submitted 26 December, 2017; originally announced December 2017.

  25. arXiv:1712.08971  [pdf, other

    cs.DB

    Human-Centric Data Cleaning [Vision]

    Authors: El Kindi Rezig, Mourad Ouzzani, Ahmed K. Elmagarmid, Walid G. Aref

    Abstract: Data Cleaning refers to the process of detecting and fixing errors in the data. Human involvement is instrumental at several stages of this process, e.g., to identify and repair errors, to validate computed repairs, etc. There is currently a plethora of data cleaning algorithms addressing a wide range of data errors (e.g., detecting duplicates, violations of integrity constraints, missing values,… ▽ More

    Submitted 30 December, 2017; v1 submitted 24 December, 2017; originally announced December 2017.

  26. arXiv:1709.06723  [pdf, other

    cs.DB

    SBG-Sketch: A Self-Balanced Sketch for Labeled-Graph Stream Summarization

    Authors: Mohamed S. Hassan, Bruno Ribeiro, Walid G. Aref

    Abstract: Applications in various domains rely on processing graph streams, e.g., communication logs of a cloud-troubleshooting system, road-network traffic updates, and interactions on a social network. A labeled-graph stream refers to a sequence of streamed edges that form a labeled graph. Label-aware applications need to filter the graph stream before performing a graph operation. Due to the large volume… ▽ More

    Submitted 20 September, 2017; originally announced September 2017.

  27. arXiv:1709.06715  [pdf, other

    cs.DB

    Empowering In-Memory Relational Database Engines with Native Graph Processing

    Authors: Mohamed S. Hassan, Tatiana Kuznetsova, Hyun Chai Jeong, Walid G. Aref, Mohammad Sadoghi

    Abstract: The plethora of graphs and relational data give rise to many interesting graph-relational queries in various domains, e.g., finding related proteins satisfying relational predicates in a biological network. The maturity of RDBMSs motivated academia and industry to invest efforts in leveraging RDBMSs for graph processing, where efficiency is proven for vital graph queries. However, none of these ef… ▽ More

    Submitted 12 October, 2017; v1 submitted 19 September, 2017; originally announced September 2017.

  28. arXiv:1709.02533  [pdf, other

    cs.DC

    Adaptive Processing of Spatial-Keyword Data Over a Distributed Streaming Cluster

    Authors: Ahmed R. Mahmood, Anas Daghistani, Ahmed M. Aly, Walid G. Aref, Mingjie Tang, Saleh Basalamah, Sunil Prabhakar

    Abstract: The widespread use of GPS-enabled smartphones along with the popularity of micro-blogging and social networking applications, e.g., Twitter and Facebook, has resulted in the generation of huge streams of geo-tagged textual data. Many applications require real-time processing of these streams. For example, location-based e-coupon and ad-targeting systems enable advertisers to register millions of a… ▽ More

    Submitted 8 September, 2017; originally announced September 2017.

  29. arXiv:1709.02529  [pdf, other

    cs.DB

    FAST: Frequency-Aware Spatio-Textual Indexing for In-Memory Continuous Filter Query Processing

    Authors: Ahmed R. Mahmood, Ahmed M. Aly, Walid G. Aref

    Abstract: Many applications need to process massive streams of spatio-textual data in real-time against continuous spatio-textual queries. For example, in location-aware ad targeting publish/subscribe systems, it is required to disseminate millions of ads and promotions to millions of users based on the locations and textual profiles of users. In this paper, we study indexing of continuous spatio-textual qu… ▽ More

    Submitted 4 October, 2017; v1 submitted 8 September, 2017; originally announced September 2017.

  30. arXiv:1705.02044  [pdf, ps, other

    cs.DS

    A Survey of Shortest-Path Algorithms

    Authors: Amgad Madkour, Walid G. Aref, Faizan Ur Rehman, Mohamed Abdur Rahman, Saleh Basalamah

    Abstract: A shortest-path algorithm finds a path containing the minimal cost between two vertices in a graph. A plethora of shortest-path algorithms is studied in the literature that span across multiple disciplines. This paper presents a survey of shortest-path algorithms based on a taxonomy that is introduced in the paper. One dimension of this taxonomy is the various flavors of the shortest-path problem.… ▽ More

    Submitted 4 May, 2017; originally announced May 2017.

  31. arXiv:1412.4303  [pdf, other

    cs.DB

    On Order-independent Semantics of the Similarity Group-By Relational Database Operator

    Authors: Mingjie Tang, Ruby Y. Tahboub, Walid G. Aref, Qutaibah M. Malluhi, Mourad Ouzzani

    Abstract: Similarity group-by (SGB, for short) has been proposed as a relational database operator to match the needs of emerging database applications. Many SGB operators that extend SQL have been proposed in the literature, e.g., similarity operators in the one-dimensional space. These operators have various semantics. Depending on how these operators are implemented, some of the implementations may lead… ▽ More

    Submitted 13 December, 2014; originally announced December 2014.

    Comments: 13 pages

  32. arXiv:1208.0074  [pdf, other

    cs.DB

    Spatial Queries with Two kNN Predicates

    Authors: Ahmed M. Aly, Walid G. Aref, Mourad Ouzzani

    Abstract: The widespread use of location-aware devices has led to countless location-based services in which a user query can be arbitrarily complex, i.e., one that embeds multiple spatial selection and join predicates. Amongst these predicates, the k-Nearest-Neighbor (kNN) predicate stands as one of the most important and widely used predicates. Unlike related research, this paper goes beyond the optimizat… ▽ More

    Submitted 31 July, 2012; originally announced August 2012.

    Comments: VLDB2012

    Journal ref: Proceedings of the VLDB Endowment (PVLDB), Vol. 5, No. 11, pp. 1100-1111 (2012)

  33. arXiv:cs/0612127  [pdf, ps, other

    cs.DB

    bdbms -- A Database Management System for Biological Data

    Authors: Mohamed Y. Eltabakh, Mourad Ouzzani, Walid G. Aref

    Abstract: Biologists are increasingly using databases for storing and managing their data. Biological databases typically consist of a mixture of raw data, metadata, sequences, annotations, and related data obtained from various sources. Current database technology lacks several functionalities that are needed by biological databases. In this paper, we introduce bdbms, an extensible prototype database man… ▽ More

    Submitted 22 December, 2006; originally announced December 2006.

    Comments: This article is published under a Creative Commons License Agreement (http://creativecommons.org/licenses/by/2.5/.) You may copy, distribute, display, and perform the work, make derivative works and make commercial use of the work, but, you must attribute the work to the author and CIDR 2007. 3rd Biennial Conference on Innovative Data Systems Research (CIDR) January 710, 2007, Asilomar, California, USA