-
Query Rewriting via LLMs
Authors:
Sriram Dharwada,
Himanshu Devrani,
Jayant Haritsa,
Harish Doraiswamy
Abstract:
When complex SQL queries suffer slow executions despite query optimization, DBAs typically invoke automated query rewriting tools to recommend ``lean'' equivalents that are conducive to faster execution. The rewritings are usually achieved via transformation rules, but these rules are limited in scope and difficult to update in a production system. Recently, LLM-based techniques have also been sug…
▽ More
When complex SQL queries suffer slow executions despite query optimization, DBAs typically invoke automated query rewriting tools to recommend ``lean'' equivalents that are conducive to faster execution. The rewritings are usually achieved via transformation rules, but these rules are limited in scope and difficult to update in a production system. Recently, LLM-based techniques have also been suggested, but they are prone to semantic and syntactic errors.
We investigate here how the remarkable cognitive capabilities of LLMs can be leveraged for performant query rewriting while incorporating safeguards and optimizations to ensure correctness and efficiency. Our study shows that these goals can be progressively achieved through incorporation of (a) an ensemble suite of basic prompts, (b) database-sensitive prompts via redundancy removal and selectivity-based rewriting rules, and (c) LLM token probability-guided rewrite paths. Further, a suite of logic-based and statistical tools can be used to check for semantic violations in the rewrites prior to DBA consideration.
We have implemented the above LLM-infused techniques in the LITHE system, and evaluated complex analytic queries from standard benchmarks on contemporary database platforms. The results show significant performance improvements for slow queries, with regard to both abstract costing and actual execution, over both SOTA techniques and the native query optimizer. For instance, with TPC-DS on PostgreSQL, the geometric mean of the runtime speedups for slow queries was as high as 13.2 over the native optimizer, whereas SOTA delivered 4.9 in comparison.
Overall, LITHE is a promising step toward viable LLM-based advisory tools for ameliorating enterprise query performance.
△ Less
Submitted 10 June, 2025; v1 submitted 18 February, 2025;
originally announced February 2025.
-
TopoMap++: A faster and more space efficient technique to compute projections with topological guarantees
Authors:
Vitoria Guardieiro,
Felipe Inagaki de Oliveira,
Harish Doraiswamy,
Luis Gustavo Nonato,
Claudio Silva
Abstract:
High-dimensional data, characterized by many features, can be difficult to visualize effectively. Dimensionality reduction techniques, such as PCA, UMAP, and t-SNE, address this challenge by projecting the data into a lower-dimensional space while preserving important relationships. TopoMap is another technique that excels at preserving the underlying structure of the data, leading to interpretabl…
▽ More
High-dimensional data, characterized by many features, can be difficult to visualize effectively. Dimensionality reduction techniques, such as PCA, UMAP, and t-SNE, address this challenge by projecting the data into a lower-dimensional space while preserving important relationships. TopoMap is another technique that excels at preserving the underlying structure of the data, leading to interpretable visualizations. In particular, TopoMap maps the high-dimensional data into a visual space, guaranteeing that the 0-dimensional persistence diagram of the Rips filtration of the visual space matches the one from the high-dimensional data. However, the original TopoMap algorithm can be slow and its layout can be too sparse for large and complex datasets. In this paper, we propose three improvements to TopoMap: 1) a more space-efficient layout, 2) a significantly faster implementation, and 3) a novel TreeMap-based representation that makes use of the topological hierarchy to aid the exploration of the projections. These advancements make TopoMap, now referred to as TopoMap++, a more powerful tool for visualizing high-dimensional data which we demonstrate through different use case scenarios.
△ Less
Submitted 11 September, 2024;
originally announced September 2024.
-
GPU-Powered Spatial Database Engine for Commodity Hardware: Extended Version
Authors:
Harish Doraiswamy,
Juliana Freire
Abstract:
Given the massive growth in the volume of spatial data, there is a great need for systems that can efficiently evaluate spatial queries over large data sets. These queries are notoriously expensive using traditional database solutions. While faster response times can be attained through powerful clusters or servers with large main-memory, these options, due to cost and complexity, are out of reach…
▽ More
Given the massive growth in the volume of spatial data, there is a great need for systems that can efficiently evaluate spatial queries over large data sets. These queries are notoriously expensive using traditional database solutions. While faster response times can be attained through powerful clusters or servers with large main-memory, these options, due to cost and complexity, are out of reach to many data scientists and analysts making up the long tail.
Graphics Processing Units (GPUs), which are now widely available even in commodity desktops and laptops, provide a cost-effective alternative to support high-performance computing, opening up new opportunities to the efficient evaluation of spatial queries. While GPU-based approaches proposed in the literature have shown great improvements in performance, they are tied to specific GPU hardware and only handle specific queries over fixed geometry types.
In this paper we present SPADE, a GPU-powered spatial database engine that supports a rich set of spatial queries. We discuss the challenges involved in attaining efficient query evaluation over large datasets as well as portability across different GPU hardware, and how these are addressed in SPADE. We performed a detailed experimental evaluation to assess the effectiveness of the system for wide range of queries and datasets, and report results which show that SPADE is scalable and able to handle data larger than main-memory, and its performance on a laptop is on par with that other systems that require clusters or large-memory servers.
△ Less
Submitted 27 March, 2022;
originally announced March 2022.
-
Topological Representations of Local Explanations
Authors:
Peter Xenopoulos,
Gromit Chan,
Harish Doraiswamy,
Luis Gustavo Nonato,
Brian Barr,
Claudio Silva
Abstract:
Local explainability methods -- those which seek to generate an explanation for each prediction -- are becoming increasingly prevalent due to the need for practitioners to rationalize their model outputs. However, comparing local explainability methods is difficult since they each generate outputs in various scales and dimensions. Furthermore, due to the stochastic nature of some explainability me…
▽ More
Local explainability methods -- those which seek to generate an explanation for each prediction -- are becoming increasingly prevalent due to the need for practitioners to rationalize their model outputs. However, comparing local explainability methods is difficult since they each generate outputs in various scales and dimensions. Furthermore, due to the stochastic nature of some explainability methods, it is possible for different runs of a method to produce contradictory explanations for a given observation. In this paper, we propose a topology-based framework to extract a simplified representation from a set of local explanations. We do so by first modeling the relationship between the explanation space and the model predictions as a scalar function. Then, we compute the topological skeleton of this function. This topological skeleton acts as a signature for such functions, which we use to compare different explanation methods. We demonstrate that our framework can not only reliably identify differences between explainability techniques but also provides stable representations. Then, we show how our framework can be used to identify appropriate parameters for local explainability methods. Our framework is simple, does not require complex optimizations, and can be broadly applied to most local explanation methods. We believe the practicality and versatility of our approach will help promote topology-based approaches as a tool for understanding and comparing explanation methods.
△ Less
Submitted 6 January, 2022;
originally announced January 2022.
-
UrbanRama: Navigating Cities in Virtual Reality
Authors:
Shaoyu Chen,
Fabio Miranda,
Nivan Ferreira,
Marcos Lage,
Harish Doraiswamy,
Corinne Brenner,
Connor Defanti,
Michael Koutsoubis,
Luc Wilson,
Ken Perlin,
Claudio Silva
Abstract:
Exploring large virtual environments, such as cities, is a central task in several domains, such as gaming and urban planning. VR systems can greatly help this task by providing an immersive experience; however, a common issue with viewing and navigating a city in the traditional sense is that users can either obtain a local or a global view, but not both at the same time, requiring them to contin…
▽ More
Exploring large virtual environments, such as cities, is a central task in several domains, such as gaming and urban planning. VR systems can greatly help this task by providing an immersive experience; however, a common issue with viewing and navigating a city in the traditional sense is that users can either obtain a local or a global view, but not both at the same time, requiring them to continuously switch between perspectives, losing context and distracting them from their analysis. In this paper, our goal is to allow users to navigate to points of interest without changing perspectives. To accomplish this, we design an intuitive navigation interface that takes advantage of the strong sense of spatial presence provided by VR. We supplement this interface with a perspective that warps the environment, called UrbanRama, based on a cylindrical projection, providing a mix of local and global views. The design of this interface was performed as an iterative process in collaboration with architects and urban planners. We conducted a qualitative and a quantitative pilot user study to evaluate UrbanRama and the results indicate the effectiveness of our system in reducing perspective changes, while ensuring that the warping doesn't affect distance and orientation perception.
△ Less
Submitted 11 December, 2021;
originally announced December 2021.
-
Valuing Player Actions in Counter-Strike: Global Offensive
Authors:
Peter Xenopoulos,
Harish Doraiswamy,
Claudio Silva
Abstract:
Esports, despite its expanding interest, lacks fundamental sports analytics resources such as accessible data or proven and reproducible analytical frameworks. Even Counter-Strike: Global Offensive (CSGO), the second most popular esport, suffers from these problems. Thus, quantitative evaluation of CSGO players, a task important to teams, media, bettors and fans, is difficult. To address this, we…
▽ More
Esports, despite its expanding interest, lacks fundamental sports analytics resources such as accessible data or proven and reproducible analytical frameworks. Even Counter-Strike: Global Offensive (CSGO), the second most popular esport, suffers from these problems. Thus, quantitative evaluation of CSGO players, a task important to teams, media, bettors and fans, is difficult. To address this, we introduce (1) a data model for CSGO with an open-source implementation; (2) a graph distance measure for defining distances in CSGO; and (3) a context-aware framework to value players' actions based on changes in their team's chances of winning. Using over 70 million in-game CSGO events, we demonstrate our framework's consistency and independence compared to existing valuation frameworks. We also provide use cases demonstrating high-impact play identification and uncertainty estimation.
△ Less
Submitted 4 November, 2020; v1 submitted 2 November, 2020;
originally announced November 2020.
-
The Case for Distance-Bounded Spatial Approximations
Authors:
Eleni Tzirita Zacharatou,
Andreas Kipf,
Ibrahim Sabek,
Varun Pandey,
Harish Doraiswamy,
Volker Markl
Abstract:
Spatial approximations have been traditionally used in spatial databases to accelerate the processing of complex geometric operations. However, approximations are typically only used in a first filtering step to determine a set of candidate spatial objects that may fulfill the query condition. To provide accurate results, the exact geometries of the candidate objects are tested against the query c…
▽ More
Spatial approximations have been traditionally used in spatial databases to accelerate the processing of complex geometric operations. However, approximations are typically only used in a first filtering step to determine a set of candidate spatial objects that may fulfill the query condition. To provide accurate results, the exact geometries of the candidate objects are tested against the query condition, which is typically an expensive operation. Nevertheless, many emerging applications (e.g., visualization tools) require interactive responses, while only needing approximate results. Besides, real-world geospatial data is inherently imprecise, which makes exact data processing unnecessary. Given the uncertainty associated with spatial data and the relaxed precision requirements of many applications, this vision paper advocates for approximate spatial data processing techniques that omit exact geometric tests and provide final answers solely on the basis of (fine-grained) approximations. Thanks to recent hardware advances, this vision can be realized today. Furthermore, our approximate techniques employ a distance-based error bound, i.e., a bound on the maximum spatial distance between false (or missing) and exact results which is crucial for meaningful analyses. This bound allows to control the precision of the approximation and trade accuracy for performance.
△ Less
Submitted 21 January, 2021; v1 submitted 23 October, 2020;
originally announced October 2020.
-
TopoMap: A 0-dimensional Homology Preserving Projection of High-Dimensional Data
Authors:
Harish Doraiswamy,
Julien Tierny,
Paulo J. S. Silva,
Luis Gustavo Nonato,
Claudio Silva
Abstract:
Multidimensional Projection is a fundamental tool for high-dimensional data analytics and visualization. With very few exceptions, projection techniques are designed to map data from a high-dimensional space to a visual space so as to preserve some dissimilarity (similarity) measure, such as the Euclidean distance for example. In fact, although adopting distinct mathematical formulations designed…
▽ More
Multidimensional Projection is a fundamental tool for high-dimensional data analytics and visualization. With very few exceptions, projection techniques are designed to map data from a high-dimensional space to a visual space so as to preserve some dissimilarity (similarity) measure, such as the Euclidean distance for example. In fact, although adopting distinct mathematical formulations designed to favor different aspects of the data, most multidimensional projection methods strive to preserve dissimilarity measures that encapsulate geometric properties such as distances or the proximity relation between data objects. However, geometric relations are not the only interesting property to be preserved in a projection. For instance, the analysis of particular structures such as clusters and outliers could be more reliably performed if the mapping process gives some guarantee as to topological invariants such as connected components and loops. This paper introduces TopoMap, a novel projection technique which provides topological guarantees during the mapping process. In particular, the proposed method performs the mapping from a high-dimensional space to a visual space, while preserving the 0-dimensional persistence diagram of the Rips filtration of the high-dimensional data, ensuring that the filtrations generate the same connected components when applied to the original as well as projected data. The presented case studies show that the topological guarantee provided by TopoMap not only brings confidence to the visual analytic process but also can be used to assist in the assessment of other projection methods.
△ Less
Submitted 3 September, 2020;
originally announced September 2020.
-
Urban Mosaic: Visual Exploration of Streetscapes Using Large-Scale Image Data
Authors:
Fabio Miranda,
Maryam Hosseini,
Marcos Lage,
Harish Doraiswamy,
Graham Dove,
Claudio T. Silva
Abstract:
Urban planning is increasingly data driven, yet the challenge of designing with data at a city scale and remaining sensitive to the impact at a human scale is as important today as it was for Jane Jacobs. We address this challenge with Urban Mosaic,a tool for exploring the urban fabric through a spatially and temporally dense data set of 7.7 million street-level images from New York City, captured…
▽ More
Urban planning is increasingly data driven, yet the challenge of designing with data at a city scale and remaining sensitive to the impact at a human scale is as important today as it was for Jane Jacobs. We address this challenge with Urban Mosaic,a tool for exploring the urban fabric through a spatially and temporally dense data set of 7.7 million street-level images from New York City, captured over the period of a year. Working in collaboration with professional practitioners, we use Urban Mosaic to investigate questions of accessibility and mobility, and preservation and retrofitting. In doing so, we demonstrate how tools such as this might provide a bridge between the city and the street, by supporting activities such as visual comparison of geographically distant neighborhoods,and temporal analysis of unfolding urban development.
△ Less
Submitted 30 August, 2020;
originally announced August 2020.
-
A GPU-friendly Geometric Data Model and Algebra for Spatial Queries: Extended Version
Authors:
Harish Doraiswamy,
Juliana Freire
Abstract:
The availability of low cost sensors has led to an unprecedented growth in the volume of spatial data. However, the time required to evaluate even simple spatial queries over large data sets greatly hampers our ability to interactively explore these data sets and extract actionable insights. Graphics Processing Units~(GPUs) are increasingly being used to speedup spatial queries. However, existing…
▽ More
The availability of low cost sensors has led to an unprecedented growth in the volume of spatial data. However, the time required to evaluate even simple spatial queries over large data sets greatly hampers our ability to interactively explore these data sets and extract actionable insights. Graphics Processing Units~(GPUs) are increasingly being used to speedup spatial queries. However, existing GPU-based solutions have two important drawbacks: they are often tightly coupled to the specific query types they target, making it hard to adapt them for other queries; and since their design is based on CPU-based approaches, it can be difficult to effectively utilize all the benefits provided by the GPU. As a first step towards making GPU spatial query processing mainstream, we propose a new model that represents spatial data as geometric objects and define an algebra consisting of GPU-friendly composable operators that operate over these objects. We demonstrate the expressiveness of the proposed algebra by formulating standard spatial queries as algebraic expressions. We also present a proof-of-concept prototype that supports a subset of the operators and show that it is at least two orders of magnitude faster than a CPU-based implementation. This performance gain is obtained both using a discrete Nvidia mobile GPU and the less powerful integrated GPUs common in commodity laptops.
△ Less
Submitted 7 April, 2020;
originally announced April 2020.
-
Shadow Accrual Maps: Efficient Accumulation of City-Scale Shadows Over Time
Authors:
Fabio Miranda,
Harish Doraiswamy,
Marcos Lage,
Luc Wilson,
Mondrian Hsieh,
Claudio T. Silva
Abstract:
Large scale shadows from buildings in a city play an important role in determining the environmental quality of public spaces. They can be both beneficial, such as for pedestrians during summer, and detrimental, by impacting vegetation and by blocking direct sunlight. Determining the effects of shadows requires the accumulation of shadows over time across different periods in a year. In this paper…
▽ More
Large scale shadows from buildings in a city play an important role in determining the environmental quality of public spaces. They can be both beneficial, such as for pedestrians during summer, and detrimental, by impacting vegetation and by blocking direct sunlight. Determining the effects of shadows requires the accumulation of shadows over time across different periods in a year. In this paper, we propose a simple yet efficient class of approach that uses the properties of sun movement to track the changing position of shadows within a fixed time interval. We use this approach to extend two commonly used shadowing techniques, shadow maps and ray tracing, and demonstrate the efficiency of our approach. Our technique is used to develop an interactive visual analysis system, Shadow Profiler, targeted at city planners and architects that allows them to test the impact of shadows for different development scenarios. We validate the usefulness of this system through case studies set in Manhattan, a dense borough of New York City.
△ Less
Submitted 9 July, 2019;
originally announced July 2019.
-
Unwind: Interactive Fish Straightening
Authors:
Francis Williams,
Alexander Bock,
Harish Doraiswamy,
Cassandra Donatelli,
Kayla Hall,
Adam Summers,
Daniele Panozzo,
Cláudio T. Silva
Abstract:
The ScanAllFish project is a large-scale effort to scan all the world's 33,100 known species of fishes. It has already generated thousands of volumetric CT scans of fish species which are available on open access platforms such as the Open Science Framework. To achieve a scanning rate required for a project of this magnitude, many specimens are grouped together into a single tube and scanned all a…
▽ More
The ScanAllFish project is a large-scale effort to scan all the world's 33,100 known species of fishes. It has already generated thousands of volumetric CT scans of fish species which are available on open access platforms such as the Open Science Framework. To achieve a scanning rate required for a project of this magnitude, many specimens are grouped together into a single tube and scanned all at once. The resulting data contain many fish which are often bent and twisted to fit into the scanner. Our system, Unwind, is a novel interactive visualization and processing tool which extracts, unbends, and untwists volumetric images of fish with minimal user interaction. Our approach enables scientists to interactively unwarp these volumes to remove the undesired torque and bending using a piecewise-linear skeleton extracted by averaging isosurfaces of a harmonic function connecting the head and tail of each fish. The result is a volumetric dataset of a individual, straight fish in a canonical pose defined by the marine biologist expert user. We have developed Unwind in collaboration with a team of marine biologists: Our system has been deployed in their labs, and is presently being used for dataset construction, biomechanical analysis, and the generation of figures for scientific publication.
△ Less
Submitted 5 February, 2020; v1 submitted 9 April, 2019;
originally announced April 2019.
-
SONYC: A System for the Monitoring, Analysis and Mitigation of Urban Noise Pollution
Authors:
Juan Pablo Bello,
Claudio Silva,
Oded Nov,
R. Luke DuBois,
Anish Arora,
Justin Salamon,
Charles Mydlarz,
Harish Doraiswamy
Abstract:
We present the Sounds of New York City (SONYC) project, a smart cities initiative focused on developing a cyber-physical system for the monitoring, analysis and mitigation of urban noise pollution. Noise pollution is one of the topmost quality of life issues for urban residents in the U.S. with proven effects on health, education, the economy, and the environment. Yet, most cities lack the resourc…
▽ More
We present the Sounds of New York City (SONYC) project, a smart cities initiative focused on developing a cyber-physical system for the monitoring, analysis and mitigation of urban noise pollution. Noise pollution is one of the topmost quality of life issues for urban residents in the U.S. with proven effects on health, education, the economy, and the environment. Yet, most cities lack the resources to continuously monitor noise and understand the contribution of individual sources, the tools to analyze patterns of noise pollution at city-scale, and the means to empower city agencies to take effective, data-driven action for noise mitigation. The SONYC project advances novel technological and socio-technical solutions that help address these needs.
SONYC includes a distributed network of both sensors and people for large-scale noise monitoring. The sensors use low-cost, low-power technology, and cutting-edge machine listening techniques, to produce calibrated acoustic measurements and recognize individual sound sources in real time. Citizen science methods are used to help urban residents connect to city agencies and each other, understand their noise footprint, and facilitate reporting and self-regulation. Crucially, SONYC utilizes big data solutions to analyze, retrieve and visualize information from sensors and citizens, creating a comprehensive acoustic model of the city that can be used to identify significant patterns of noise pollution. These data can be used to drive the strategic application of noise code enforcement by city agencies to optimize the reduction of noise pollution. The entire system, integrating cyber, physical and social infrastructure, forms a closed loop of continuous sensing, analysis and actuation on the environment.
SONYC provides a blueprint for the mitigation of noise pollution that can potentially be applied to other cities in the US and abroad.
△ Less
Submitted 18 May, 2018; v1 submitted 2 May, 2018;
originally announced May 2018.
-
Data Polygamy: The Many-Many Relationships among Urban Spatio-Temporal Data Sets
Authors:
Fernando Chirigati,
Harish Doraiswamy,
Theodoros Damoulas,
Juliana Freire
Abstract:
The increasing ability to collect data from urban environments, coupled with a push towards openness by governments, has resulted in the availability of numerous spatio-temporal data sets covering diverse aspects of a city. Discovering relationships between these data sets can produce new insights by enabling domain experts to not only test but also generate hypotheses. However, discovering these…
▽ More
The increasing ability to collect data from urban environments, coupled with a push towards openness by governments, has resulted in the availability of numerous spatio-temporal data sets covering diverse aspects of a city. Discovering relationships between these data sets can produce new insights by enabling domain experts to not only test but also generate hypotheses. However, discovering these relationships is difficult. First, a relationship between two data sets may occur only at certain locations and/or time periods. Second, the sheer number and size of the data sets, coupled with the diverse spatial and temporal scales at which the data is available, presents computational challenges on all fronts, from indexing and querying to analyzing them. Finally, it is non-trivial to differentiate between meaningful and spurious relationships. To address these challenges, we propose Data Polygamy, a scalable topology-based framework that allows users to query for statistically significant relationships between spatio-temporal data sets. We have performed an experimental evaluation using over 300 spatial-temporal urban data sets which shows that our approach is scalable and effective at identifying interesting relationships.
△ Less
Submitted 21 October, 2016;
originally announced October 2016.
-
Urban Pulse: Capturing the Rhythm of Cities
Authors:
Fabio Miranda,
Harish Doraiswamy,
Marcos Lage,
Kai Zhao,
Bruno Gonçalves,
Luc Wilson,
Mondrian Hsieh,
Cláudio T. Silva
Abstract:
Cities are inherently dynamic. Interesting patterns of behavior typically manifest at several key areas of a city over multiple temporal resolutions. Studying these patterns can greatly help a variety of experts ranging from city planners and architects to human behavioral experts. Recent technological innovations have enabled the collection of enormous amounts of data that can help in these studi…
▽ More
Cities are inherently dynamic. Interesting patterns of behavior typically manifest at several key areas of a city over multiple temporal resolutions. Studying these patterns can greatly help a variety of experts ranging from city planners and architects to human behavioral experts. Recent technological innovations have enabled the collection of enormous amounts of data that can help in these studies. However, techniques using these data sets typically focus on understanding the data in the context of the city, thus failing to capture the dynamic aspects of the city. The goal of this work is to instead understand the city in the context of multiple urban data sets. To do so, we define the concept of an "urban pulse" which captures the spatio-temporal activity in a city across multiple temporal resolutions. The prominent pulses in a city are obtained using the topology of the data sets, and are characterized as a set of beats. The beats are then used to analyze and compare different pulses. We also design a visual exploration framework that allows users to explore the pulses within and across multiple cities under different conditions. Finally, we present three case studies carried out by experts from two different domains that demonstrate the utility of our framework.
△ Less
Submitted 29 December, 2017; v1 submitted 24 August, 2016;
originally announced August 2016.