-
Gbake: Baking 3D Gaussian Splats into Reflection Probes
Authors:
Stephen Pasch,
Joel K. Salzman,
Changxi Zheng
Abstract:
The growing popularity of 3D Gaussian Splatting has created the need to integrate traditional computer graphics techniques and assets in splatted environments. Since 3D Gaussian primitives encode lighting and geometry jointly as appearance, meshes are relit improperly when inserted directly in a mixture of 3D Gaussians and thus appear noticeably out of place. We introduce GBake, a specialized tool…
▽ More
The growing popularity of 3D Gaussian Splatting has created the need to integrate traditional computer graphics techniques and assets in splatted environments. Since 3D Gaussian primitives encode lighting and geometry jointly as appearance, meshes are relit improperly when inserted directly in a mixture of 3D Gaussians and thus appear noticeably out of place. We introduce GBake, a specialized tool for baking reflection probes from Gaussian-splatted scenes that enables realistic reflection mapping of traditional 3D meshes in the Unity game engine.
△ Less
Submitted 2 July, 2025;
originally announced July 2025.
-
Bottom-Up Perspectives on AI Governance: Insights from User Reviews of AI Products
Authors:
Stefan Pasch
Abstract:
With the growing importance of AI governance, numerous high-level frameworks and principles have been articulated by policymakers, institutions, and expert communities to guide the development and application of AI. While such frameworks offer valuable normative orientation, they may not fully capture the practical concerns of those who interact with AI systems in organizational and operational co…
▽ More
With the growing importance of AI governance, numerous high-level frameworks and principles have been articulated by policymakers, institutions, and expert communities to guide the development and application of AI. While such frameworks offer valuable normative orientation, they may not fully capture the practical concerns of those who interact with AI systems in organizational and operational contexts. To address this gap, this study adopts a bottom-up approach to explore how governance-relevant themes are expressed in user discourse. Drawing on over 100,000 user reviews of AI products from G2.com, we apply BERTopic to extract latent themes and identify those most semantically related to AI governance. The analysis reveals a diverse set of governance-relevant topics spanning both technical and non-technical domains. These include concerns across organizational processes-such as planning, coordination, and communication-as well as stages of the AI value chain, including deployment infrastructure, data handling, and analytics. The findings show considerable overlap with institutional AI governance and ethics frameworks on issues like privacy and transparency, but also surface overlooked areas such as project management, strategy development, and customer interaction. This highlights the need for more empirically grounded, user-centered approaches to AI governance-approaches that complement normative models by capturing how governance unfolds in applied settings. By foregrounding how governance is enacted in practice, this study contributes to more inclusive and operationally grounded approaches to AI governance and digital policy.
△ Less
Submitted 29 May, 2025;
originally announced June 2025.
-
AI vs. Human Judgment of Content Moderation: LLM-as-a-Judge and Ethics-Based Response Refusals
Authors:
Stefan Pasch
Abstract:
As large language models (LLMs) are increasingly deployed in high-stakes settings, their ability to refuse ethically sensitive prompts-such as those involving hate speech or illegal activities-has become central to content moderation and responsible AI practices. While refusal responses can be viewed as evidence of ethical alignment and safety-conscious behavior, recent research suggests that user…
▽ More
As large language models (LLMs) are increasingly deployed in high-stakes settings, their ability to refuse ethically sensitive prompts-such as those involving hate speech or illegal activities-has become central to content moderation and responsible AI practices. While refusal responses can be viewed as evidence of ethical alignment and safety-conscious behavior, recent research suggests that users may perceive them negatively. At the same time, automated assessments of model outputs are playing a growing role in both evaluation and training. In particular, LLM-as-a-Judge frameworks-in which one model is used to evaluate the output of another-are now widely adopted to guide benchmarking and fine-tuning. This paper examines whether such model-based evaluators assess refusal responses differently than human users. Drawing on data from Chatbot Arena and judgments from two AI judges (GPT-4o and Llama 3 70B), we compare how different types of refusals are rated. We distinguish ethical refusals, which explicitly cite safety or normative concerns (e.g., "I can't help with that because it may be harmful"), and technical refusals, which reflect system limitations (e.g., "I can't answer because I lack real-time data"). We find that LLM-as-a-Judge systems evaluate ethical refusals significantly more favorably than human users, a divergence not observed for technical refusals. We refer to this divergence as a moderation bias-a systematic tendency for model-based evaluators to reward refusal behaviors more than human users do. This raises broader questions about transparency, value alignment, and the normative assumptions embedded in automated evaluation systems.
△ Less
Submitted 21 May, 2025;
originally announced May 2025.
-
Human-AI Interaction and User Satisfaction: Empirical Evidence from Online Reviews of AI Products
Authors:
Stefan Pasch,
Sun-Young Ha
Abstract:
Human-AI Interaction (HAI) guidelines and design principles have become increasingly important in both industry and academia to guide the development of AI systems that align with user needs and expectations. However, large-scale empirical evidence on how HAI principles shape user satisfaction in practice remains limited. This study addresses that gap by analyzing over 100,000 user reviews of AI-r…
▽ More
Human-AI Interaction (HAI) guidelines and design principles have become increasingly important in both industry and academia to guide the development of AI systems that align with user needs and expectations. However, large-scale empirical evidence on how HAI principles shape user satisfaction in practice remains limited. This study addresses that gap by analyzing over 100,000 user reviews of AI-related products from G2, a leading review platform for business software and services. Based on widely adopted industry guidelines, we identify seven core HAI dimensions and examine their coverage and sentiment within the reviews. We find that the sentiment on four HAI dimensions-adaptability, customization, error recovery, and security-is positively associated with overall user satisfaction. Moreover, we show that engagement with HAI dimensions varies by professional background: Users with technical job roles are more likely to discuss system-focused aspects, such as reliability, while non-technical users emphasize interaction-focused features like customization and feedback. Interestingly, the relationship between HAI sentiment and overall satisfaction is not moderated by job role, suggesting that once an HAI dimension has been identified by users, its effect on satisfaction is consistent across job roles.
△ Less
Submitted 24 March, 2025; v1 submitted 23 March, 2025;
originally announced March 2025.
-
LLM Content Moderation and User Satisfaction: Evidence from Response Refusals in Chatbot Arena
Authors:
Stefan Pasch
Abstract:
LLM safety and ethical alignment are widely discussed, but the impact of content moderation on user satisfaction remains underexplored. In particular, little is known about how users respond when models refuse to answer a prompt-one of the primary mechanisms used to enforce ethical boundaries in LLMs. We address this gap by analyzing nearly 50,000 model comparisons from Chatbot Arena, a platform w…
▽ More
LLM safety and ethical alignment are widely discussed, but the impact of content moderation on user satisfaction remains underexplored. In particular, little is known about how users respond when models refuse to answer a prompt-one of the primary mechanisms used to enforce ethical boundaries in LLMs. We address this gap by analyzing nearly 50,000 model comparisons from Chatbot Arena, a platform where users indicate their preferred LLM response in pairwise matchups, providing a large-scale setting for studying real-world user preferences. Using a novel RoBERTa-based refusal classifier fine-tuned on a hand-labeled dataset, we distinguish between refusals due to ethical concerns and technical limitations. Our results reveal a substantial refusal penalty: ethical refusals yield significantly lower win rates than both technical refusals and standard responses, indicating that users are especially dissatisfied when models decline a task for ethical reasons. However, this penalty is not uniform. Refusals receive more favorable evaluations when the underlying prompt is highly sensitive (e.g., involving illegal content), and when the refusal is phrased in a detailed and contextually aligned manner. These findings underscore a core tension in LLM design: safety-aligned behaviors may conflict with user expectations, calling for more adaptive moderation strategies that account for context and presentation.
△ Less
Submitted 15 May, 2025; v1 submitted 4 January, 2025;
originally announced January 2025.
-
Multilingual De-Duplication Strategies: Applying scalable similarity search with monolingual & multilingual embedding models
Authors:
Stefan Pasch,
Dimitirios Petridis,
Jannic Cutura
Abstract:
This paper addresses the deduplication of multilingual textual data using advanced NLP tools. We compare a two-step method involving translation to English followed by embedding with mpnet, and a multilingual embedding model (distiluse). The two-step approach achieved a higher F1 score (82% vs. 60%), particularly with less widely used languages, which can be increased up to 89% by leveraging exper…
▽ More
This paper addresses the deduplication of multilingual textual data using advanced NLP tools. We compare a two-step method involving translation to English followed by embedding with mpnet, and a multilingual embedding model (distiluse). The two-step approach achieved a higher F1 score (82% vs. 60%), particularly with less widely used languages, which can be increased up to 89% by leveraging expert rules based on domain knowledge. We also highlight limitations related to token length constraints and computational efficiency. Our methodology suggests improvements for future multilingual deduplication tasks.
△ Less
Submitted 19 June, 2024;
originally announced June 2024.
-
Ahead of the Text: Leveraging Entity Preposition for Financial Relation Extraction
Authors:
Stefan Pasch,
Dimitrios Petridis
Abstract:
In the context of the ACM KDF-SIGIR 2023 competition, we undertook an entity relation task on a dataset of financial entity relations called REFind. Our top-performing solution involved a multi-step approach. Initially, we inserted the provided entities at their corresponding locations within the text. Subsequently, we fine-tuned the transformer-based language model roberta-large for text classifi…
▽ More
In the context of the ACM KDF-SIGIR 2023 competition, we undertook an entity relation task on a dataset of financial entity relations called REFind. Our top-performing solution involved a multi-step approach. Initially, we inserted the provided entities at their corresponding locations within the text. Subsequently, we fine-tuned the transformer-based language model roberta-large for text classification by utilizing a labeled training set to predict the entity relations. Lastly, we implemented a post-processing phase to identify and handle improbable predictions generated by the model. As a result of our methodology, we achieved the 1st place ranking on the competition's public leaderboard.
△ Less
Submitted 8 August, 2023;
originally announced August 2023.
-
CultureBERT: Measuring Corporate Culture With Transformer-Based Language Models
Authors:
Sebastian Koch,
Stefan Pasch
Abstract:
This paper introduces transformer-based language models to the literature measuring corporate culture from text documents. We compile a unique data set of employee reviews that were labeled by human evaluators with respect to the information the reviews reveal about the firms' corporate culture. Using this data set, we fine-tune state-of-the-art transformer-based language models to perform the sam…
▽ More
This paper introduces transformer-based language models to the literature measuring corporate culture from text documents. We compile a unique data set of employee reviews that were labeled by human evaluators with respect to the information the reviews reveal about the firms' corporate culture. Using this data set, we fine-tune state-of-the-art transformer-based language models to perform the same classification task. In out-of-sample predictions, our language models classify 17 to 30 percentage points more of employee reviews in line with human evaluators than traditional approaches of text classification. We make our models publicly available.
△ Less
Submitted 25 January, 2024; v1 submitted 1 December, 2022;
originally announced December 2022.
-
StonkBERT: Can Language Models Predict Medium-Run Stock Price Movements?
Authors:
Stefan Pasch,
Daniel Ehnes
Abstract:
To answer this question, we fine-tune transformer-based language models, including BERT, on different sources of company-related text data for a classification task to predict the one-year stock price performance. We use three different types of text data: News articles, blogs, and annual reports. This allows us to analyze to what extent the performance of language models is dependent on the type…
▽ More
To answer this question, we fine-tune transformer-based language models, including BERT, on different sources of company-related text data for a classification task to predict the one-year stock price performance. We use three different types of text data: News articles, blogs, and annual reports. This allows us to analyze to what extent the performance of language models is dependent on the type of the underlying document. StonkBERT, our transformer-based stock performance classifier, shows substantial improvement in predictive accuracy compared to traditional language models. The highest performance was achieved with news articles as text source. Performance simulations indicate that these improvements in classification accuracy also translate into above-average stock market returns.
△ Less
Submitted 4 February, 2022;
originally announced February 2022.