Search | arXiv e-print repository

Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model

Authors: Ailin Huang, Bingxin Li, Bruce Wang, Boyong Wu, Chao Yan, Chengli Feng, Heng Wang, Hongyu Zhou, Hongyuan Wang, Jingbei Li, Jianjian Sun, Joanna Wang, Mingrui Chen, Peng Liu, Ruihang Miao, Shilei Jiang, Tian Fei, Wang You, Xi Chen, Xuerui Yang, Yechang Huang, Yuxiang Zhang, Zheng Ge, Zheng Gong, Zhewei Huang , et al. (51 additional authors not shown)

Abstract: Large Audio-Language Models (LALMs) have significantly advanced intelligent human-computer interaction, yet their reliance on text-based outputs limits their ability to generate natural speech responses directly, hindering seamless audio interactions. To address this, we introduce Step-Audio-AQAA, a fully end-to-end LALM designed for Audio Query-Audio Answer (AQAA) tasks. The model integrates a du… ▽ More Large Audio-Language Models (LALMs) have significantly advanced intelligent human-computer interaction, yet their reliance on text-based outputs limits their ability to generate natural speech responses directly, hindering seamless audio interactions. To address this, we introduce Step-Audio-AQAA, a fully end-to-end LALM designed for Audio Query-Audio Answer (AQAA) tasks. The model integrates a dual-codebook audio tokenizer for linguistic and semantic feature extraction, a 130-billion-parameter backbone LLM and a neural vocoder for high-fidelity speech synthesis. Our post-training approach employs interleaved token-output of text and audio to enhance semantic coherence and combines Direct Preference Optimization (DPO) with model merge to improve performance. Evaluations on the StepEval-Audio-360 benchmark demonstrate that Step-Audio-AQAA excels especially in speech control, outperforming the state-of-art LALMs in key areas. This work contributes a promising solution for end-to-end LALMs and highlights the critical role of token-based vocoder in enhancing overall performance for AQAA tasks. △ Less

Submitted 13 June, 2025; v1 submitted 10 June, 2025; originally announced June 2025.

Comments: 12 pages, 3 figures

arXiv:2506.03388 [pdf, other]

Cross-Modal Urban Sensing: Evaluating Sound-Vision Alignment Across Street-Level and Aerial Imagery

Authors: Pengyu Chen, Xiao Huang, Teng Fei, Sicheng Wang

Abstract: Environmental soundscapes convey substantial ecological and social information regarding urban environments; however, their potential remains largely untapped in large-scale geographic analysis. In this study, we investigate the extent to which urban sounds correspond with visual scenes by comparing various visual representation strategies in capturing acoustic semantics. We employ a multimodal ap… ▽ More Environmental soundscapes convey substantial ecological and social information regarding urban environments; however, their potential remains largely untapped in large-scale geographic analysis. In this study, we investigate the extent to which urban sounds correspond with visual scenes by comparing various visual representation strategies in capturing acoustic semantics. We employ a multimodal approach that integrates geo-referenced sound recordings with both street-level and remote sensing imagery across three major global cities: London, New York, and Tokyo. Utilizing the AST model for audio, along with CLIP and RemoteCLIP for imagery, as well as CLIPSeg and Seg-Earth OV for semantic segmentation, we extract embeddings and class-level features to evaluate cross-modal similarity. The results indicate that street view embeddings demonstrate stronger alignment with environmental sounds compared to segmentation outputs, whereas remote sensing segmentation is more effective in interpreting ecological categories through a Biophony--Geophony--Anthrophony (BGA) framework. These findings imply that embedding-based models offer superior semantic alignment, while segmentation-based methods provide interpretable links between visual structure and acoustic ecology. This work advances the burgeoning field of multimodal urban sensing by offering novel perspectives for incorporating sound into geospatial analysis. △ Less

Submitted 3 June, 2025; originally announced June 2025.

arXiv:2505.12734 [pdf, ps, other]

SounDiT: Geo-Contextual Soundscape-to-Landscape Generation

Authors: Junbo Wang, Haofeng Tan, Bowen Liao, Albert Jiang, Teng Fei, Qixing Huang, Zhengzhong Tu, Shan Ye, Yuhao Kang

Abstract: We present a novel and practically significant problem-Geo-Contextual Soundscape-to-Landscape (GeoS2L) generation-which aims to synthesize geographically realistic landscape images from environmental soundscapes. Prior audio-to-image generation methods typically rely on general-purpose datasets and overlook geographic and environmental contexts, resulting in unrealistic images that are misaligned… ▽ More We present a novel and practically significant problem-Geo-Contextual Soundscape-to-Landscape (GeoS2L) generation-which aims to synthesize geographically realistic landscape images from environmental soundscapes. Prior audio-to-image generation methods typically rely on general-purpose datasets and overlook geographic and environmental contexts, resulting in unrealistic images that are misaligned with real-world environmental settings. To address this limitation, we introduce a novel geo-contextual computational framework that explicitly integrates geographic knowledge into multimodal generative modeling. We construct two large-scale geo-contextual multimodal datasets, SoundingSVI and SonicUrban, pairing diverse soundscapes with real-world landscape images. We propose SounDiT, a novel Diffusion Transformer (DiT)-based model that incorporates geo-contextual scene conditioning to synthesize geographically coherent landscape images. Furthermore, we propose a practically-informed geo-contextual evaluation framework, the Place Similarity Score (PSS), across element-, scene-, and human perception-levels to measure consistency between input soundscapes and generated landscape images. Extensive experiments demonstrate that SounDiT outperforms existing baselines in both visual fidelity and geographic settings. Our work not only establishes foundational benchmarks for GeoS2L generation but also highlights the importance of incorporating geographic domain knowledge in advancing multimodal generative models, opening new directions at the intersection of generative AI, geography, urban planning, and environmental sciences. △ Less

Submitted 19 May, 2025; originally announced May 2025.

Comments: 14 pages, 5 figures

arXiv:2503.23178 [pdf, other]

Intelligent Bear Prevention System Based on Computer Vision: An Approach to Reduce Human-Bear Conflicts in the Tibetan Plateau Area, China

Authors: Pengyu Chen, Teng Fei, Yunyan Du, Jiawei Yi, Yi Li, John A. Kupfer

Abstract: Conflicts between humans and bears on the Tibetan Plateau present substantial threats to local communities and hinder wildlife preservation initiatives. This research introduces a novel strategy that incorporates computer vision alongside Internet of Things (IoT) technologies to alleviate these issues. Tailored specifically for the harsh environment of the Tibetan Plateau, the approach utilizes th… ▽ More Conflicts between humans and bears on the Tibetan Plateau present substantial threats to local communities and hinder wildlife preservation initiatives. This research introduces a novel strategy that incorporates computer vision alongside Internet of Things (IoT) technologies to alleviate these issues. Tailored specifically for the harsh environment of the Tibetan Plateau, the approach utilizes the K210 development board paired with the YOLO object detection framework along with a tailored bear-deterrent mechanism, offering minimal energy usage and real-time efficiency in bear identification and deterrence. The model's performance was evaluated experimentally, achieving a mean Average Precision (mAP) of 91.4%, demonstrating excellent precision and dependability. By integrating energy-efficient components, the proposed system effectively surpasses the challenges of remote and off-grid environments, ensuring uninterrupted operation in secluded locations. This study provides a viable, eco-friendly, and expandable solution to mitigate human-bear conflicts, thereby improving human safety and promoting bear conservation in isolated areas like Yushu, China. △ Less

Submitted 29 March, 2025; originally announced March 2025.

arXiv:2411.18997 [pdf, other]

GRU-PFG: Extract Inter-Stock Correlation from Stock Factors with Graph Neural Network

Authors: Yonggai Zhuang, Haoran Chen, Kequan Wang, Teng Fei

Abstract: The complexity of stocks and industries presents challenges for stock prediction. Currently, stock prediction models can be divided into two categories. One category, represented by GRU and ALSTM, relies solely on stock factors for prediction, with limited effectiveness. The other category, represented by HIST and TRA, incorporates not only stock factors but also industry information, industry fin… ▽ More The complexity of stocks and industries presents challenges for stock prediction. Currently, stock prediction models can be divided into two categories. One category, represented by GRU and ALSTM, relies solely on stock factors for prediction, with limited effectiveness. The other category, represented by HIST and TRA, incorporates not only stock factors but also industry information, industry financial reports, public sentiment, and other inputs for prediction. The second category of models can capture correlations between stocks by introducing additional information, but the extra data is difficult to standardize and generalize. Considering the current state and limitations of these two types of models, this paper proposes the GRU-PFG (Project Factors into Graph) model. This model only takes stock factors as input and extracts inter-stock correlations using graph neural networks. It achieves prediction results that not only outperform the others models relies solely on stock factors, but also achieve comparable performance to the second category models. The experimental results show that on the CSI300 dataset, the IC of GRU-PFG is 0.134, outperforming HIST's 0.131 and significantly surpassing GRU and Transformer, achieving results better than the second category models. Moreover as a model that relies solely on stock factors, it has greater potential for generalization. △ Less

Submitted 28 November, 2024; originally announced November 2024.

Comments: 17pages

arXiv:2404.01250 [pdf, other]

Perceptogram: Reconstructing Visual Percepts from EEG

Authors: Teng Fei, Abhinav Uppal, Ian Jackson, Srinivas Ravishankar, David Wang, Virginia R. de Sa

Abstract: Visual neural decoding from EEG has improved significantly due to diffusion models that can reconstruct high-quality images from decoded latents. While recent works have focused on relatively complex architectures to achieve good reconstruction performance from EEG, less attention has been paid to the source of this information. In this work, we attempt to discover EEG features that represent perc… ▽ More Visual neural decoding from EEG has improved significantly due to diffusion models that can reconstruct high-quality images from decoded latents. While recent works have focused on relatively complex architectures to achieve good reconstruction performance from EEG, less attention has been paid to the source of this information. In this work, we attempt to discover EEG features that represent perceptual and semantic visual categories, using a simple pipeline. Notably, the high temporal resolution of EEG allows us to go beyond static semantic maps as obtained from fMRI. We show (a) Training a simple linear decoder from EEG to CLIP latent space, followed by a frozen pre-trained diffusion model, is sufficient to decode images with state-of-the-art reconstruction performance. (b) Mapping the decoded latents back to EEG using a linear encoder isolates CLIP-relevant EEG spatiotemporal features. (c) By using other latent spaces representing lower-level image features, we obtain similar time-courses of texture/hue-related information. We thus use our framework, Perceptogram, to probe EEG signals at various levels of the visual information hierarchy. △ Less

Submitted 24 February, 2025; v1 submitted 1 April, 2024; originally announced April 2024.

arXiv:2203.09611 [pdf, other]

doi 10.1080/13658816.2022.2053980

STICC: A multivariate spatial clustering method for repeated geographic pattern discovery with consideration of spatial contiguity

Authors: Yuhao Kang, Kunlin Wu, Song Gao, Ignavier Ng, Jinmeng Rao, Shan Ye, Fan Zhang, Teng Fei

Abstract: Spatial clustering has been widely used for spatial data mining and knowledge discovery. An ideal multivariate spatial clustering should consider both spatial contiguity and aspatial attributes. Existing spatial clustering approaches may face challenges for discovering repeated geographic patterns with spatial contiguity maintained. In this paper, we propose a Spatial Toeplitz Inverse Covariance-B… ▽ More Spatial clustering has been widely used for spatial data mining and knowledge discovery. An ideal multivariate spatial clustering should consider both spatial contiguity and aspatial attributes. Existing spatial clustering approaches may face challenges for discovering repeated geographic patterns with spatial contiguity maintained. In this paper, we propose a Spatial Toeplitz Inverse Covariance-Based Clustering (STICC) method that considers both attributes and spatial relationships of geographic objects for multivariate spatial clustering. A subregion is created for each geographic object serving as the basic unit when performing clustering. A Markov random field is then constructed to characterize the attribute dependencies of subregions. Using a spatial consistency strategy, nearby objects are encouraged to belong to the same cluster. To test the performance of the proposed STICC algorithm, we apply it in two use cases. The comparison results with several baseline methods show that the STICC outperforms others significantly in terms of adjusted rand index and macro-F1 score. Join count statistics is also calculated and shows that the spatial contiguity is well preserved by STICC. Such a spatial clustering method may benefit various applications in the fields of geography, remote sensing, transportation, and urban planning, etc. △ Less

Submitted 30 March, 2022; v1 submitted 17 March, 2022; originally announced March 2022.

Journal ref: International Journal of Geographical Information Science, Year 2022

arXiv:2102.00407 [pdf]

Emotion and color in paintings: a novel temporal and spatial quantitative perspective

Authors: Wenyuan Kong, Teng Fei, Thom Jencks

Abstract: As subjective artistic creations, artistic paintings carry emotion of their creators. Emotions expressed in paintings and emotion aroused in spectators by paintings are two kinds of emotions that scholars have paid attention to. Traditional studies on emotions expressed by paintings are mainly conducted from qualitative perspectives, with neither quantitative output on the emotional values of a pa… ▽ More As subjective artistic creations, artistic paintings carry emotion of their creators. Emotions expressed in paintings and emotion aroused in spectators by paintings are two kinds of emotions that scholars have paid attention to. Traditional studies on emotions expressed by paintings are mainly conducted from qualitative perspectives, with neither quantitative output on the emotional values of a painting, nor exploration of trends in the expression of emotion in art history. In this research we threat facial expressions in paintings as an artistic characteristics of art history and employ cognitive computation technology to identify the facial emotions in paintings and to investigate the quantitative measures of paintings from three emotion-related aspects: the spatial and temporal patterns of painting emotions in art history, the gender difference on the emotion of paintings and the color preference associated with emotions. We discovered that the emotion of happiness has a growing trend from ancient to modern times in paintings history, and men and women have different facial expressions patterns along time. As for color preference, artists with different culture backgrounds had similar association preferences between colors and emotions. △ Less

Submitted 31 January, 2021; originally announced February 2021.

arXiv:2012.12809 [pdf, other]

doi 10.1109/TVT.2022.3182411

Warping of Radar Data into Camera Image for Cross-Modal Supervision in Automotive Applications

Authors: Christopher Grimm, Tai Fei, Ernst Warsitz, Ridha Farhoud, Tobias Breddermann, Reinhold Haeb-Umbach

Abstract: We present an approach to automatically generate semantic labels for real recordings of automotive range-Doppler (RD) radar spectra. Such labels are required when training a neural network for object recognition from radar data. The automatic labeling approach rests on the simultaneous recording of camera and lidar data in addition to the radar spectrum. By warping radar spectra into the camera im… ▽ More We present an approach to automatically generate semantic labels for real recordings of automotive range-Doppler (RD) radar spectra. Such labels are required when training a neural network for object recognition from radar data. The automatic labeling approach rests on the simultaneous recording of camera and lidar data in addition to the radar spectrum. By warping radar spectra into the camera image, state-of-the-art object recognition algorithms can be applied to label relevant objects, such as cars, in the camera image. The warping operation is designed to be fully differentiable, which allows backpropagating the gradient computed on the camera image through the warping operation to the neural network operating on the radar data. As the warping operation relies on accurate scene flow estimation, we further propose a novel scene flow estimation algorithm which exploits information from camera, lidar and radar sensors. The proposed scene flow estimation approach is compared against a state-of-the-art scene flow algorithm, and it outperforms it by approximately 30% w.r.t. mean average error. The feasibility of the overall framework for automatic label generation for RD spectra is verified by evaluating the performance of neural networks trained with the proposed framework for Direction-of-Arrival estimation. △ Less

Submitted 20 June, 2022; v1 submitted 23 December, 2020; originally announced December 2020.

arXiv:1905.01817 [pdf, other]

doi 10.1111/tgis.12552

Extracting human emotions at different places based on facial expressions and spatial clustering analysis

Authors: Yuhao Kang, Qingyuan Jia, Song Gao, Xiaohuan Zeng, Yueyao Wang, Stephan Angsuesser, Yu Liu, Xinyue Ye, Teng Fei

Abstract: The emergence of big data enables us to evaluate the various human emotions at places from a statistic perspective by applying affective computing. In this study, a novel framework for extracting human emotions from large-scale georeferenced photos at different places is proposed. After the construction of places based on spatial clustering of user generated footprints collected in social media we… ▽ More The emergence of big data enables us to evaluate the various human emotions at places from a statistic perspective by applying affective computing. In this study, a novel framework for extracting human emotions from large-scale georeferenced photos at different places is proposed. After the construction of places based on spatial clustering of user generated footprints collected in social media websites, online cognitive services are utilized to extract human emotions from facial expressions using the state-of-the-art computer vision techniques. And two happiness metrics are defined for measuring the human emotions at different places. To validate the feasibility of the framework, we take 80 tourist attractions around the world as an example and a happiness ranking list of places is generated based on human emotions calculated over 2 million faces detected out from over 6 million photos. Different kinds of geographical contexts are taken into consideration to find out the relationship between human emotions and environmental factors. Results show that much of the emotional variation at different places can be explained by a few factors such as openness. The research may offer insights on integrating human emotions to enrich the understanding of sense of place in geography and in place-based GIS. △ Less

Submitted 6 May, 2019; originally announced May 2019.

Comments: 40 pages; 9 figures

ACM Class: I.2; I.4.9; I.5.3

Journal ref: Transactions in GIS, Year 2019, Volume 23, Issue 3

Showing 1–10 of 10 results for author: Fei, T