-
UltraGauss: Ultrafast Gaussian Reconstruction of 3D Ultrasound Volumes
Authors:
Mark C. Eid,
Ana I. L. Namburete,
João F. Henriques
Abstract:
Ultrasound imaging is widely used due to its safety, affordability, and real-time capabilities, but its 2D interpretation is highly operator-dependent, leading to variability and increased cognitive demand. 2D-to-3D reconstruction mitigates these challenges by providing standardized volumetric views, yet existing methods are often computationally expensive, memory-intensive, or incompatible with u…
▽ More
Ultrasound imaging is widely used due to its safety, affordability, and real-time capabilities, but its 2D interpretation is highly operator-dependent, leading to variability and increased cognitive demand. 2D-to-3D reconstruction mitigates these challenges by providing standardized volumetric views, yet existing methods are often computationally expensive, memory-intensive, or incompatible with ultrasound physics. We introduce UltraGauss: the first ultrasound-specific Gaussian Splatting framework, extending view synthesis techniques to ultrasound wave propagation. Unlike conventional perspective-based splatting, UltraGauss models probe-plane intersections in 3D, aligning with acoustic image formation. We derive an efficient rasterization boundary formulation for GPU parallelization and introduce a numerically stable covariance parametrization, improving computational efficiency and reconstruction accuracy. On real clinical ultrasound data, UltraGauss achieves state-of-the-art reconstructions in 5 minutes, and reaching 0.99 SSIM within 20 minutes on a single GPU. A survey of expert clinicians confirms UltraGauss' reconstructions are the most realistic among competing methods. Our CUDA implementation will be released upon publication.
△ Less
Submitted 8 May, 2025;
originally announced May 2025.
-
Characterizing $S=3/2$ AKLT Hamiltonian with Scanning Tunneling Spectroscopy
Authors:
M. Ferri-Cortés,
J. C. G. Henriques,
J. Fernández-Rossier
Abstract:
The AKLT Hamiltonian is a particular instance of a general class of model Hamiltonians defined in lattices with coordination $z$ where each site hosts a spins $S=z/2$, interacting both with linear and non-linear exchange couplings. In two dimensions, the AKLT model features a gap in the spectrum, and its ground state is a valence bond solid state; that is an universal resource for measurement base…
▽ More
The AKLT Hamiltonian is a particular instance of a general class of model Hamiltonians defined in lattices with coordination $z$ where each site hosts a spins $S=z/2$, interacting both with linear and non-linear exchange couplings. In two dimensions, the AKLT model features a gap in the spectrum, and its ground state is a valence bond solid state; that is an universal resource for measurement based quantum computing, motivating the quest of physical systems that realize this Hamiltonian. Given a finite-size system described with a specific instance of this general class of models, we address the question of how to asses if such system is a realization of the AKLT model using inelastic tunnel spectroscopy implemented with scanning tunnel microscopy (IETS-STM). We propose two approaches. First, in the case of a dimer, we show how to leverage non-equilibrium IETS-STM to obtain the energies of all excited states, and determine thereby the magnitude of both linear and non-linear exchange interactions. Second, we explore how IETS can probe the in-gap excitations associated to edge spins. In the AKLT limit, spins $S=3/2$ at the edge of the lattice have coordination 2, giving rise to $S=1/2$ dangling spins that can be probed with IETS. We propose a $S=1/2$ effective Hamiltonian to describe the interactions between these dangling spins in the neighborhood of the AKLT point, where their degeneracy lifted.
△ Less
Submitted 6 March, 2025;
originally announced March 2025.
-
On determining the energy dispersion of spin excitations with scanning tunneling spectroscopy
Authors:
J. C. G. Henriques,
Chenxiao Zhao,
G. Catarina,
Pascal Ruffieux,
Roman Fasel,
J. Fernández-Rossier
Abstract:
Conventional methods to measure the dispersion relations of collective spin excitations involve probing bulk samples with particles such as neutrons, photons or electrons, which carry a well-defined momentum. Open-ended finite-size spin chains, on the contrary, do not have a well-defined momentum due to the lack of translation symmetry, and their spin excitations are measured with an eminently loc…
▽ More
Conventional methods to measure the dispersion relations of collective spin excitations involve probing bulk samples with particles such as neutrons, photons or electrons, which carry a well-defined momentum. Open-ended finite-size spin chains, on the contrary, do not have a well-defined momentum due to the lack of translation symmetry, and their spin excitations are measured with an eminently local probe, using inelastic electron tunneling spectroscopy (IETS) with a scanning tunneling microscope (STM). Here we discuss under what conditions STM-IETS spectra can be Fourier-transformed to yield dispersion relations in these systems. We relate the success of this approach to the degree to which spin excitations form standing waves. We show that STM-IETS can reveal the energy dispersion of magnons in ferromagnets and triplons in valence bond crystals, but not that of spinons, the spin excitations in Heisenberg spin-1/2 chains. We compare our theoretical predictions with state-of-the-art measurements on nanographene chains that realize the relevant spin Hamiltonians.
△ Less
Submitted 19 February, 2025;
originally announced February 2025.
-
Classifier-guided registration of coronary CT angiography and intravascular ultrasound
Authors:
R. L. M. van Herten,
José P. Henriques,
R. Nils Planken,
Joost Daemen,
Eline M. J. Hartman,
Jolanda J. Wentzel,
Ivana Išgum
Abstract:
Coronary CT angiography (CCTA) and intravascular ultrasound (IVUS) provide complementary information for coronary artery disease assessment, making their registration valuable for comprehensive analysis. However, existing registration methods require manual interaction or extensive segmentations, limiting their practical application. In this work, we present a fully automatic framework for CCTA-IV…
▽ More
Coronary CT angiography (CCTA) and intravascular ultrasound (IVUS) provide complementary information for coronary artery disease assessment, making their registration valuable for comprehensive analysis. However, existing registration methods require manual interaction or extensive segmentations, limiting their practical application. In this work, we present a fully automatic framework for CCTA-IVUS registration using deep learning-based feature detection and a differentiable image registration module. Our approach leverages a convolutional neural network trained to identify key anatomical features from polar-transformed multiplanar reformatted CCTA or IVUS data. These detected anatomical featuers subsequently guide a differentiable registration module to optimize transformation parameters of an automatically extracted coronary artery centerline. The method does not require landmark selection or segmentations as input, while accounting for the presence of IVUS guidewire artifacts. Evaluated on 48 clinical cases with reference CCTA centerlines corresponding to IVUS pullback, our method achieved successful registration in 83.3\% of cases, with a median centerline overlap F$_1$-score of 0.982 and median cosine similarities of 0.940 and 0.944 for cross-sectional plane orientation. Our results demonstrate that automatically detected anatomical features can be leveraged for accurate registration. The fully automatic nature of the approach represents a significant step toward streamlined multimodal coronary analysis, potentially facilitating large-scale studies of coronary plaque characteristics across modalities.
△ Less
Submitted 22 December, 2024;
originally announced December 2024.
-
UniLoc: Towards Universal Place Recognition Using Any Single Modality
Authors:
Yan Xia,
Zhendong Li,
Yun-Jin Li,
Letian Shi,
Hu Cao,
João F. Henriques,
Daniel Cremers
Abstract:
To date, most place recognition methods focus on single-modality retrieval. While they perform well in specific environments, cross-modal methods offer greater flexibility by allowing seamless switching between map and query sources. It also promises to reduce computation requirements by having a unified model, and achieving greater sample efficiency by sharing parameters. In this work, we develop…
▽ More
To date, most place recognition methods focus on single-modality retrieval. While they perform well in specific environments, cross-modal methods offer greater flexibility by allowing seamless switching between map and query sources. It also promises to reduce computation requirements by having a unified model, and achieving greater sample efficiency by sharing parameters. In this work, we develop a universal solution to place recognition, UniLoc, that works with any single query modality (natural language, image, or point cloud). UniLoc leverages recent advances in large-scale contrastive learning, and learns by matching hierarchically at two levels: instance-level matching and scene-level matching. Specifically, we propose a novel Self-Attention based Pooling (SAP) module to evaluate the importance of instance descriptors when aggregated into a place-level descriptor. Experiments on the KITTI-360 dataset demonstrate the benefits of cross-modality for place recognition, achieving superior performance in cross-modal settings and competitive results also for uni-modal scenarios. Our project page is publicly available at https://yan-xia.github.io/projects/UniLoc/.
△ Less
Submitted 16 December, 2024;
originally announced December 2024.
-
TrafficLoc: Localizing Traffic Surveillance Cameras in 3D Scenes
Authors:
Yan Xia,
Yunxiang Lu,
Rui Song,
Oussema Dhaouadi,
João F. Henriques,
Daniel Cremers
Abstract:
We tackle the problem of localizing traffic cameras within a 3D reference map and propose a novel image-to-point cloud registration (I2P) method, TrafficLoc, in a coarse-tofine matching fashion. To overcome the lack of large-scale real-world intersection datasets, we first introduce Carla Intersection, a new simulated dataset with 75 urban and rural intersections in Carla. We find that current I2P…
▽ More
We tackle the problem of localizing traffic cameras within a 3D reference map and propose a novel image-to-point cloud registration (I2P) method, TrafficLoc, in a coarse-tofine matching fashion. To overcome the lack of large-scale real-world intersection datasets, we first introduce Carla Intersection, a new simulated dataset with 75 urban and rural intersections in Carla. We find that current I2P methods struggle with cross-modal matching under large viewpoint differences, especially at traffic intersections. TrafficLoc thus employs a novel Geometry-guided Attention Loss (GAL) to focus only on the corresponding geometric regions under different viewpoints during 2D-3D feature fusion. To address feature inconsistency in paired image patch-point groups, we further propose Inter-intra Contrastive Learning (ICL) to enhance separating 2D patch/3D group features within each intra-modality and introduce Dense Training Alignment (DTA) with soft-argmax for improving position regression. Extensive experiments show our TrafficLoc greatly improves the performance over the SOTA I2P methods (up to 86%) on Carla Intersection and generalizes well to real-world data. TrafficLoc also achieves new SOTA performance on KITTI and NuScenes datasets, demonstrating the superiority across both in-vehicle and traffic cameras. Our project page is publicly available at https://tum-luk.github.io/projects/trafficloc/.
△ Less
Submitted 25 March, 2025; v1 submitted 13 December, 2024;
originally announced December 2024.
-
Electrically Tunable Interband Collective Excitations in Biased Bilayer and Trilayer Graphene
Authors:
Tomer Eini,
M. F. C. Martins Quintela,
J. C. G. Henriques,
R. M. Ribeiro,
Yarden Mazor,
N. M. R. Peres,
Itai Epstein
Abstract:
Collective excitations of charged particles under the influence of an electromagnetic field give rise to a rich variety of hybrid light-matter quasiparticles with unique properties. In metals, intraband collective response manifested by negative permittivity leads to plasmon-polaritons with extreme field confinement, wavelength squeezing, and potentially low propagation losses. In contrast, photon…
▽ More
Collective excitations of charged particles under the influence of an electromagnetic field give rise to a rich variety of hybrid light-matter quasiparticles with unique properties. In metals, intraband collective response manifested by negative permittivity leads to plasmon-polaritons with extreme field confinement, wavelength squeezing, and potentially low propagation losses. In contrast, photons in semiconductors commonly couple to interband collective response in the form of exciton polaritons, which give rise to completely different polaritonic properties, described by a superposition of the photon and exciton and an anti-crossing of the eigenstates. In this work, we identify the existence of plasmon-like collective excitations originating from the interband excitonic response of biased bilayer and trilayer graphene, in the form of graphene-exciton-polaritons (GEPs). We find that GEPs possess electrically tunable polaritonic properties and discover that such excitations follow a universal dispersion law for all surface polaritons in 2D excitonic systems. Accounting for nonlocal corrections to the excitonic response, we find that the GEPs exhibit confinement factors that can exceed those of graphene plasmons, and with moderate losses. These predictions of plasmon-like interband collective excitations in biased graphene systems open up new research avenues for tunable polaritonic phenomena based on excitonic systems, and the ability to control and manipulate such phenomena at the atomic scale.
△ Less
Submitted 27 February, 2025; v1 submitted 4 December, 2024;
originally announced December 2024.
-
VisualPredicator: Learning Abstract World Models with Neuro-Symbolic Predicates for Robot Planning
Authors:
Yichao Liang,
Nishanth Kumar,
Hao Tang,
Adrian Weller,
Joshua B. Tenenbaum,
Tom Silver,
João F. Henriques,
Kevin Ellis
Abstract:
Broadly intelligent agents should form task-specific abstractions that selectively expose the essential elements of a task, while abstracting away the complexity of the raw sensorimotor space. In this work, we present Neuro-Symbolic Predicates, a first-order abstraction language that combines the strengths of symbolic and neural knowledge representations. We outline an online algorithm for inventi…
▽ More
Broadly intelligent agents should form task-specific abstractions that selectively expose the essential elements of a task, while abstracting away the complexity of the raw sensorimotor space. In this work, we present Neuro-Symbolic Predicates, a first-order abstraction language that combines the strengths of symbolic and neural knowledge representations. We outline an online algorithm for inventing such predicates and learning abstract world models. We compare our approach to hierarchical reinforcement learning, vision-language model planning, and symbolic predicate invention approaches, on both in- and out-of-distribution tasks across five simulated robotic domains. Results show that our approach offers better sample complexity, stronger out-of-distribution generalization, and improved interpretability.
△ Less
Submitted 28 February, 2025; v1 submitted 30 October, 2024;
originally announced October 2024.
-
Interpretable Representation Learning from Videos using Nonlinear Priors
Authors:
Marian Longa,
João F. Henriques
Abstract:
Learning interpretable representations of visual data is an important challenge, to make machines' decisions understandable to humans and to improve generalisation outside of the training distribution. To this end, we propose a deep learning framework where one can specify nonlinear priors for videos (e.g. of Newtonian physics) that allow the model to learn interpretable latent variables and use t…
▽ More
Learning interpretable representations of visual data is an important challenge, to make machines' decisions understandable to humans and to improve generalisation outside of the training distribution. To this end, we propose a deep learning framework where one can specify nonlinear priors for videos (e.g. of Newtonian physics) that allow the model to learn interpretable latent variables and use these to generate videos of hypothetical scenarios not observed at training time. We do this by extending the Variational Auto-Encoder (VAE) prior from a simple isotropic Gaussian to an arbitrary nonlinear temporal Additive Noise Model (ANM), which can describe a large number of processes (e.g. Newtonian physics). We propose a novel linearization method that constructs a Gaussian Mixture Model (GMM) approximating the prior, and derive a numerically stable Monte Carlo estimate of the KL divergence between the posterior and prior GMMs. We validate the method on different real-world physics videos including a pendulum, a mass on a spring, a falling object and a pulsar (rotating neutron star). We specify a physical prior for each experiment and show that the correct variables are learned. Once a model is trained, we intervene on it to change different physical variables (such as oscillation amplitude or adding air drag) to generate physically correct videos of hypothetical scenarios that were not observed previously.
△ Less
Submitted 24 October, 2024;
originally announced October 2024.
-
World of Forms: Deformable Geometric Templates for One-Shot Surface Meshing in Coronary CT Angiography
Authors:
Rudolf L. M. van Herten,
Ioannis Lagogiannis,
Jelmer M. Wolterink,
Steffen Bruns,
Eva R. Meulendijks,
Damini Dey,
Joris R. de Groot,
José P. Henriques,
R. Nils Planken,
Simone Saitta,
Ivana Išgum
Abstract:
Deep learning-based medical image segmentation and surface mesh generation typically involve a sequential pipeline from image to segmentation to meshes, often requiring large training datasets while making limited use of prior geometric knowledge. This may lead to topological inconsistencies and suboptimal performance in low-data regimes. To address these challenges, we propose a data-efficient de…
▽ More
Deep learning-based medical image segmentation and surface mesh generation typically involve a sequential pipeline from image to segmentation to meshes, often requiring large training datasets while making limited use of prior geometric knowledge. This may lead to topological inconsistencies and suboptimal performance in low-data regimes. To address these challenges, we propose a data-efficient deep learning method for direct 3D anatomical object surface meshing using geometric priors. Our approach employs a multi-resolution graph neural network that operates on a prior geometric template which is deformed to fit object boundaries of interest. We show how different templates may be used for the different surface meshing targets, and introduce a novel masked autoencoder pretraining strategy for 3D spherical data. The proposed method outperforms nnUNet in a one-shot setting for segmentation of the pericardium, left ventricle (LV) cavity and the LV myocardium. Similarly, the method outperforms other lumen segmentation operating on multi-planar reformatted images. Results further indicate that mesh quality is on par with or improves upon marching cubes post-processing of voxel mask predictions, while remaining flexible in the choice of mesh triangulation prior, thus paving the way for more accurate and topologically consistent 3D medical object surface meshing.
△ Less
Submitted 21 February, 2025; v1 submitted 18 September, 2024;
originally announced September 2024.
-
GST: Precise 3D Human Body from a Single Image with Gaussian Splatting Transformers
Authors:
Lorenza Prospero,
Abdullah Hamdi,
Joao F. Henriques,
Christian Rupprecht
Abstract:
Reconstructing posed 3D human models from monocular images has important applications in the sports industry, including performance tracking, injury prevention and virtual training. In this work, we combine 3D human pose and shape estimation with 3D Gaussian Splatting (3DGS), a representation of the scene composed of a mixture of Gaussians. This allows training or fine-tuning a human model predict…
▽ More
Reconstructing posed 3D human models from monocular images has important applications in the sports industry, including performance tracking, injury prevention and virtual training. In this work, we combine 3D human pose and shape estimation with 3D Gaussian Splatting (3DGS), a representation of the scene composed of a mixture of Gaussians. This allows training or fine-tuning a human model predictor on multi-view images alone, without 3D ground truth. Predicting such mixtures for a human from a single input image is challenging due to self-occlusions and dependence on articulations, while also needing to retain enough flexibility to accommodate a variety of clothes and poses. Our key observation is that the vertices of standardized human meshes (such as SMPL) can provide an adequate spatial density and approximate initial position for the Gaussians. We can then train a transformer model to jointly predict comparatively small adjustments to these positions, as well as the other 3DGS attributes and the SMPL parameters. We show empirically that this combination (using only multi-view supervision) can achieve near real-time inference of 3D human models from a single image without expensive diffusion models or 3D points supervision, thus making it ideal for the sport industry at any level. More importantly, rendering is an effective auxiliary objective to refine 3D pose estimation by accounting for clothes and other geometric variations. The code is available at https://github.com/prosperolo/GST.
△ Less
Submitted 16 April, 2025; v1 submitted 6 September, 2024;
originally announced September 2024.
-
Dissecting Temporal Understanding in Text-to-Audio Retrieval
Authors:
Andreea-Maria Oncescu,
João F. Henriques,
A. Sophia Koepke
Abstract:
Recent advancements in machine learning have fueled research on multimodal tasks, such as for instance text-to-video and text-to-audio retrieval. These tasks require models to understand the semantic content of video and audio data, including objects, and characters. The models also need to learn spatial arrangements and temporal relationships. In this work, we analyse the temporal ordering of sou…
▽ More
Recent advancements in machine learning have fueled research on multimodal tasks, such as for instance text-to-video and text-to-audio retrieval. These tasks require models to understand the semantic content of video and audio data, including objects, and characters. The models also need to learn spatial arrangements and temporal relationships. In this work, we analyse the temporal ordering of sounds, which is an understudied problem in the context of text-to-audio retrieval. In particular, we dissect the temporal understanding capabilities of a state-of-the-art model for text-to-audio retrieval on the AudioCaps and Clotho datasets. Additionally, we introduce a synthetic text-audio dataset that provides a controlled setting for evaluating temporal capabilities of recent models. Lastly, we present a loss function that encourages text-audio models to focus on the temporal ordering of events. Code and data are available at https://www.robots.ox.ac.uk/~vgg/research/audio-retrieval/dtu/.
△ Less
Submitted 1 September, 2024;
originally announced September 2024.
-
ALMA Memo 628 -- High-cadence observations of the Sun
Authors:
Sven Wedemeyer,
Mikolaj Szydlarski,
M. Carmen Toribio,
Tobia Carozzi,
Daniel Jakobsson,
Juan Camilo Guevara Gomez,
Henrik Eklund,
Vasco M. J. Henriques,
Shahin Jafarzadeh,
Jaime de la Cruz Rodriguez
Abstract:
The Atacama Large Millimeter/submillimeter Array (ALMA) offers new diagnostic capabilities for studying the Sun, providing complementary insights through high spatial and temporal resolution at millimeter wavelengths. ALMA acts as a linear thermometer for atmospheric gas, aiding in understanding the solar atmosphere's structure, dynamics, and energy balance. Given the Sun's complex emission patter…
▽ More
The Atacama Large Millimeter/submillimeter Array (ALMA) offers new diagnostic capabilities for studying the Sun, providing complementary insights through high spatial and temporal resolution at millimeter wavelengths. ALMA acts as a linear thermometer for atmospheric gas, aiding in understanding the solar atmosphere's structure, dynamics, and energy balance. Given the Sun's complex emission patterns and rapid evolution, high-cadence imaging is essential for solar observations. Snapshot imaging is required, though it limits available visibility data, making full exploitation of ALMA's capabilities non-trivial. Challenges in processing solar ALMA data highlight the need for revising and enhancing the solar observing mode. The ALMA development study High-Cadence Imaging of the Sun demonstrated the potential benefits of high cadence observations through a forward modelling approach. The resulting report provides initial recommendations for improved post-processing solar ALMA data and explores increasing the observing cadence to sub-second intervals to improve image reliability.
△ Less
Submitted 26 August, 2024;
originally announced August 2024.
-
Gapless spin excitations in nanographene-based antiferromagnetic spin-1/2 Heisenberg chains
Authors:
Chenxiao Zhao,
Lin Yang,
João C. G. Henriques,
Mar Ferri-Cortés,
Gonçalo Catarina,
Carlo A. Pignedoli,
Ji Ma,
Xinliang Feng,
Pascal Ruffieux,
Joaquín Fernández-Rossier,
Roman Fasel
Abstract:
Haldane's seminal work established two fundamentally different types of excitation spectra for antiferromagnetic Heisenberg quantum spin chains: gapped excitations in integer-spin chains and gapless excitations in half-integer-spin chains. In finite-length half-integer spin chains, quantization, however, induces a gap in the excitation spectrum, with the upper bound given by the Lieb-Schulz-Mattis…
▽ More
Haldane's seminal work established two fundamentally different types of excitation spectra for antiferromagnetic Heisenberg quantum spin chains: gapped excitations in integer-spin chains and gapless excitations in half-integer-spin chains. In finite-length half-integer spin chains, quantization, however, induces a gap in the excitation spectrum, with the upper bound given by the Lieb-Schulz-Mattis (LSM) theorem. Here, we investigate the length-dependent excitations in spin-1/2 Heisenberg chains obtained by covalently linking olympicenes--Olympic rings shaped nanographenes carrying spin-1/2--into one-dimensional chains. The large exchange interaction (J~38 mV) between olympicenes and the negligible magnetic anisotropy in these nanographenes make them an ideal platform for studying quantum spin excitations, which we directly measure using inelastic electron tunneling spectroscopy. We observe a power-law decay of the lowest excitation energy with increasing chain length L, remaining below the LSM boundary. In a long chain with L = 50, a nearly V-shaped excitation continuum is observed, reinforcing the system's gapless nature in the thermodynamic limit. Finally, we visualize the standing wave of a single spinon confined in odd-numbered chains using low-bias current maps. Our results provide compelling evidence for the realization of a one-dimensional analog of a gapless spin liquid.
△ Less
Submitted 19 August, 2024;
originally announced August 2024.
-
3D-Aware Instance Segmentation and Tracking in Egocentric Videos
Authors:
Yash Bhalgat,
Vadim Tschernezki,
Iro Laina,
João F. Henriques,
Andrea Vedaldi,
Andrew Zisserman
Abstract:
Egocentric videos present unique challenges for 3D scene understanding due to rapid camera motion, frequent object occlusions, and limited object visibility. This paper introduces a novel approach to instance segmentation and tracking in first-person video that leverages 3D awareness to overcome these obstacles. Our method integrates scene geometry, 3D object centroid tracking, and instance segmen…
▽ More
Egocentric videos present unique challenges for 3D scene understanding due to rapid camera motion, frequent object occlusions, and limited object visibility. This paper introduces a novel approach to instance segmentation and tracking in first-person video that leverages 3D awareness to overcome these obstacles. Our method integrates scene geometry, 3D object centroid tracking, and instance segmentation to create a robust framework for analyzing dynamic egocentric scenes. By incorporating spatial and temporal cues, we achieve superior performance compared to state-of-the-art 2D approaches. Extensive evaluations on the challenging EPIC Fields dataset demonstrate significant improvements across a range of tracking and segmentation consistency metrics. Specifically, our method outperforms the next best performing approach by $7$ points in Association Accuracy (AssA) and $4.5$ points in IDF1 score, while reducing the number of ID switches by $73\%$ to $80\%$ across various object categories. Leveraging our tracked instance segmentations, we showcase downstream applications in 3D object reconstruction and amodal video object segmentation in these egocentric settings.
△ Less
Submitted 20 November, 2024; v1 submitted 19 August, 2024;
originally announced August 2024.
-
Building spin-1/2 antiferromagnetic Heisenberg chains with diaza-nanographenes
Authors:
Xiaoshuai Fu,
Li Huang,
Kun Liu,
João C. G. Henriques,
Yixuan Gao,
Xianghe Han,
Hui Chen,
Yan Wang,
Carlos-Andres Palma,
Zhihai Cheng,
Xiao Lin,
Shixuan Du,
Ji Ma,
Joaquín Fernández-Rossier,
Xinliang Feng,
Hong-Jun Gao
Abstract:
Understanding and engineering the coupling of spins in nanomaterials is of central importance for designing novel devices. Graphene nanostructures with π-magnetism offer a chemically tunable platform to explore quantum magnetic interactions. However, realizing spin chains bearing controlled odd-even effects with suitable nanographene systems is challenging. Here, we demonstrate the successful on-s…
▽ More
Understanding and engineering the coupling of spins in nanomaterials is of central importance for designing novel devices. Graphene nanostructures with π-magnetism offer a chemically tunable platform to explore quantum magnetic interactions. However, realizing spin chains bearing controlled odd-even effects with suitable nanographene systems is challenging. Here, we demonstrate the successful on-surface synthesis of spin-1/2 antiferromagnetic Heisenberg chains with parity-dependent magnetization based on antiaromatic diaza-hexa-peri-hexabenzocoronene (diaza-HBC) units. Using distinct synthetic strategies, two types of spin chains with different terminals were synthesized, both exhibiting a robust odd-even effect on the spin coupling along the chain. Combined investigations using scanning tunneling microscopy, non-contact atomic force microscopy, density functional theory calculations, and quantum spin models confirmed the structures of the diaza-HBC chains and revealed their magnetic properties, which has an S = 1/2 spin per unit through electron donation from the diaza-HBC core to the Au(111) substrate. Gapped excitations were observed in even-numbered chains, while enhanced Kondo resonance emerged in odd-numbered units of odd-numbered chains due to the redistribution of the unpaired spin along the chain. Our findings provide an effective strategy to construct nanographene spin chains and unveil the odd-even effect in their magnetic properties, offering potential applications in nanoscale spintronics.
△ Less
Submitted 29 July, 2024;
originally announced July 2024.
-
SOAP-RL: Sequential Option Advantage Propagation for Reinforcement Learning in POMDP Environments
Authors:
Shu Ishida,
João F. Henriques
Abstract:
This work compares ways of extending Reinforcement Learning algorithms to Partially Observed Markov Decision Processes (POMDPs) with options. One view of options is as temporally extended action, which can be realized as a memory that allows the agent to retain historical information beyond the policy's context window. While option assignment could be handled using heuristics and hand-crafted obje…
▽ More
This work compares ways of extending Reinforcement Learning algorithms to Partially Observed Markov Decision Processes (POMDPs) with options. One view of options is as temporally extended action, which can be realized as a memory that allows the agent to retain historical information beyond the policy's context window. While option assignment could be handled using heuristics and hand-crafted objectives, learning temporally consistent options and associated sub-policies without explicit supervision is a challenge. Two algorithms, PPOEM and SOAP, are proposed and studied in depth to address this problem. PPOEM applies the forward-backward algorithm (for Hidden Markov Models) to optimize the expected returns for an option-augmented policy. However, this learning approach is unstable during on-policy rollouts. It is also unsuited for learning causal policies without the knowledge of future trajectories, since option assignments are optimized for offline sequences where the entire episode is available. As an alternative approach, SOAP evaluates the policy gradient for an optimal option assignment. It extends the concept of the generalized advantage estimation (GAE) to propagate option advantages through time, which is an analytical equivalent to performing temporal back-propagation of option policy gradients. This option policy is only conditional on the history of the agent, not future actions. Evaluated against competing baselines, SOAP exhibited the most robust performance, correctly discovering options for POMDP corridor environments, as well as on standard benchmarks including Atari and MuJoCo, outperforming PPOEM, as well as LSTM and Option-Critic baselines. The open-sourced code is available at https://github.com/shuishida/SoapRL.
△ Less
Submitted 11 October, 2024; v1 submitted 26 July, 2024;
originally announced July 2024.
-
Unsupervised Object Detection with Theoretical Guarantees
Authors:
Marian Longa,
João F. Henriques
Abstract:
Unsupervised object detection using deep neural networks is typically a difficult problem with few to no guarantees about the learned representation. In this work we present the first unsupervised object detection method that is theoretically guaranteed to recover the true object positions up to quantifiable small shifts. We develop an unsupervised object detection architecture and prove that the…
▽ More
Unsupervised object detection using deep neural networks is typically a difficult problem with few to no guarantees about the learned representation. In this work we present the first unsupervised object detection method that is theoretically guaranteed to recover the true object positions up to quantifiable small shifts. We develop an unsupervised object detection architecture and prove that the learned variables correspond to the true object positions up to small shifts related to the encoder and decoder receptive field sizes, the object sizes, and the widths of the Gaussians used in the rendering process. We perform detailed analysis of how the error depends on each of these variables and perform synthetic experiments validating our theoretical predictions up to a precision of individual pixels. We also perform experiments on CLEVR-based data and show that, unlike current SOTA object detection methods (SAM, CutLER), our method's prediction errors always lie within our theoretical bounds. We hope that this work helps open up an avenue of research into object detection methods with theoretical guarantees.
△ Less
Submitted 24 October, 2024; v1 submitted 11 June, 2024;
originally announced June 2024.
-
Flash3D: Feed-Forward Generalisable 3D Scene Reconstruction from a Single Image
Authors:
Stanislaw Szymanowicz,
Eldar Insafutdinov,
Chuanxia Zheng,
Dylan Campbell,
João F. Henriques,
Christian Rupprecht,
Andrea Vedaldi
Abstract:
We propose Flash3D, a method for scene reconstruction and novel view synthesis from a single image which is both very generalisable and efficient. For generalisability, we start from a "foundation" model for monocular depth estimation and extend it to a full 3D shape and appearance reconstructor. For efficiency, we base this extension on feed-forward Gaussian Splatting. Specifically, we predict a…
▽ More
We propose Flash3D, a method for scene reconstruction and novel view synthesis from a single image which is both very generalisable and efficient. For generalisability, we start from a "foundation" model for monocular depth estimation and extend it to a full 3D shape and appearance reconstructor. For efficiency, we base this extension on feed-forward Gaussian Splatting. Specifically, we predict a first layer of 3D Gaussians at the predicted depth, and then add additional layers of Gaussians that are offset in space, allowing the model to complete the reconstruction behind occlusions and truncations. Flash3D is very efficient, trainable on a single GPU in a day, and thus accessible to most researchers. It achieves state-of-the-art results when trained and tested on RealEstate10k. When transferred to unseen datasets like NYU it outperforms competitors by a large margin. More impressively, when transferred to KITTI, Flash3D achieves better PSNR than methods trained specifically on that dataset. In some instances, it even outperforms recent methods that use multiple views as input. Code, models, demo, and more results are available at https://www.robots.ox.ac.uk/~vgg/research/flash3d/.
△ Less
Submitted 1 June, 2025; v1 submitted 6 June, 2024;
originally announced June 2024.
-
HelloFresh: LLM Evaluations on Streams of Real-World Human Editorial Actions across X Community Notes and Wikipedia edits
Authors:
Tim Franzmeyer,
Aleksandar Shtedritski,
Samuel Albanie,
Philip Torr,
João F. Henriques,
Jakob N. Foerster
Abstract:
Benchmarks have been essential for driving progress in machine learning. A better understanding of LLM capabilities on real world tasks is vital for safe development. Designing adequate LLM benchmarks is challenging: Data from real-world tasks is hard to collect, public availability of static evaluation data results in test data contamination and benchmark overfitting, and periodically generating…
▽ More
Benchmarks have been essential for driving progress in machine learning. A better understanding of LLM capabilities on real world tasks is vital for safe development. Designing adequate LLM benchmarks is challenging: Data from real-world tasks is hard to collect, public availability of static evaluation data results in test data contamination and benchmark overfitting, and periodically generating new evaluation data is tedious and may result in temporally inconsistent results. We introduce HelloFresh, based on continuous streams of real-world data generated by intrinsically motivated human labelers. It covers recent events from X (formerly Twitter) community notes and edits of Wikipedia pages, mitigating the risk of test data contamination and benchmark overfitting. Any X user can propose an X note to add additional context to a misleading post (formerly tweet); if the community classifies it as helpful, it is shown with the post. Similarly, Wikipedia relies on community-based consensus, allowing users to edit articles or revert edits made by other users. Verifying whether an X note is helpful or whether a Wikipedia edit should be accepted are hard tasks that require grounding by querying the web. We backtest state-of-the-art LLMs supplemented with simple web search access and find that HelloFresh yields a temporally consistent ranking. To enable continuous evaluation on HelloFresh, we host a public leaderboard and periodically updated evaluation data at https://tinyurl.com/hello-fresh-LLM.
△ Less
Submitted 5 June, 2024;
originally announced June 2024.
-
Giant spatial anisotropy of magnon lifetime in altermagnets
Authors:
A. T. Costa,
J. C. G. Henriques,
J. Fernández-Rossier
Abstract:
Altermagnets are a new class of magnetic materials with zero net magnetization (like antiferromagnets) but spin-split electronic bands (like ferromagnets) over a fraction of reciprocal space. As in antiferromagnets, magnons in altermagnets come in two flavours, that either add one or remove one unit of spin to the $S=0$ ground state. However, in altermagnets these two magnon modes are non-degenera…
▽ More
Altermagnets are a new class of magnetic materials with zero net magnetization (like antiferromagnets) but spin-split electronic bands (like ferromagnets) over a fraction of reciprocal space. As in antiferromagnets, magnons in altermagnets come in two flavours, that either add one or remove one unit of spin to the $S=0$ ground state. However, in altermagnets these two magnon modes are non-degenerate along some directions in reciprocal space. Here we show that the lifetime of altermagnetic magnons has a very strong dependence on both flavour and direction. Strikingly, coupling to Stoner modes leads to a complete suppression of magnon propagation along selected spatial directions. This giant anisotropy will impact electronic, spin, and energy transport properties and may be exploited in spintronic applications.
△ Less
Submitted 21 May, 2024;
originally announced May 2024.
-
Select to Perfect: Imitating desired behavior from large multi-agent data
Authors:
Tim Franzmeyer,
Edith Elkind,
Philip Torr,
Jakob Foerster,
Joao Henriques
Abstract:
AI agents are commonly trained with large datasets of demonstrations of human behavior. However, not all behaviors are equally safe or desirable. Desired characteristics for an AI agent can be expressed by assigning desirability scores, which we assume are not assigned to individual behaviors but to collective trajectories. For example, in a dataset of vehicle interactions, these scores might rela…
▽ More
AI agents are commonly trained with large datasets of demonstrations of human behavior. However, not all behaviors are equally safe or desirable. Desired characteristics for an AI agent can be expressed by assigning desirability scores, which we assume are not assigned to individual behaviors but to collective trajectories. For example, in a dataset of vehicle interactions, these scores might relate to the number of incidents that occurred. We first assess the effect of each individual agent's behavior on the collective desirability score, e.g., assessing how likely an agent is to cause incidents. This allows us to selectively imitate agents with a positive effect, e.g., only imitating agents that are unlikely to cause incidents. To enable this, we propose the concept of an agent's Exchange Value, which quantifies an individual agent's contribution to the collective desirability score. The Exchange Value is the expected change in desirability score when substituting the agent for a randomly selected agent. We propose additional methods for estimating Exchange Values from real-world datasets, enabling us to learn desired imitation policies that outperform relevant baselines. The project website can be found at https://tinyurl.com/select-to-perfect.
△ Less
Submitted 6 May, 2024;
originally announced May 2024.
-
RapidVol: Rapid Reconstruction of 3D Ultrasound Volumes from Sensorless 2D Scans
Authors:
Mark C. Eid,
Pak-Hei Yeung,
Madeleine K. Wyburd,
João F. Henriques,
Ana I. L. Namburete
Abstract:
Two-dimensional (2D) freehand ultrasonography is one of the most commonly used medical imaging modalities, particularly in obstetrics and gynaecology. However, it only captures 2D cross-sectional views of inherently 3D anatomies, losing valuable contextual information. As an alternative to requiring costly and complex 3D ultrasound scanners, 3D volumes can be constructed from 2D scans using machin…
▽ More
Two-dimensional (2D) freehand ultrasonography is one of the most commonly used medical imaging modalities, particularly in obstetrics and gynaecology. However, it only captures 2D cross-sectional views of inherently 3D anatomies, losing valuable contextual information. As an alternative to requiring costly and complex 3D ultrasound scanners, 3D volumes can be constructed from 2D scans using machine learning. However this usually requires long computational time. Here, we propose RapidVol: a neural representation framework to speed up slice-to-volume ultrasound reconstruction. We use tensor-rank decomposition, to decompose the typical 3D volume into sets of tri-planes, and store those instead, as well as a small neural network. A set of 2D ultrasound scans, with their ground truth (or estimated) 3D position and orientation (pose) is all that is required to form a complete 3D reconstruction. Reconstructions are formed from real fetal brain scans, and then evaluated by requesting novel cross-sectional views. When compared to prior approaches based on fully implicit representation (e.g. neural radiance fields), our method is over 3x quicker, 46% more accurate, and if given inaccurate poses is more robust. Further speed-up is also possible by reconstructing from a structural prior rather than from scratch.
△ Less
Submitted 16 April, 2024;
originally announced April 2024.
-
Stale Diffusion: Hyper-realistic 5D Movie Generation Using Old-school Methods
Authors:
Joao F. Henriques,
Dylan Campbell,
Tengda Han
Abstract:
Two years ago, Stable Diffusion achieved super-human performance at generating images with super-human numbers of fingers. Following the steady decline of its technical novelty, we propose Stale Diffusion, a method that solidifies and ossifies Stable Diffusion in a maximum-entropy state. Stable Diffusion works analogously to a barn (the Stable) from which an infinite set of horses have escaped (th…
▽ More
Two years ago, Stable Diffusion achieved super-human performance at generating images with super-human numbers of fingers. Following the steady decline of its technical novelty, we propose Stale Diffusion, a method that solidifies and ossifies Stable Diffusion in a maximum-entropy state. Stable Diffusion works analogously to a barn (the Stable) from which an infinite set of horses have escaped (the Diffusion). As the horses have long left the barn, our proposal may be seen as antiquated and irrelevant. Nevertheless, we vigorously defend our claim of novelty by identifying as early adopters of the Slow Science Movement, which will produce extremely important pearls of wisdom in the future. Our speed of contributions can also be seen as a quasi-static implementation of the recent call to pause AI experiments, which we wholeheartedly support. As a result of a careful archaeological expedition to 18-months-old Git commit histories, we found that naturally-accumulating errors have produced a novel entropy-maximising Stale Diffusion method, that can produce sleep-inducing hyper-realistic 5D video that is as good as one's imagination.
△ Less
Submitted 1 April, 2024;
originally announced April 2024.
-
Small-scale magnetic flux emergence preceding a chain of energetic solar atmospheric events
Authors:
D. Nóbrega-Siverio,
I. Cabello,
S. Bose,
L. H. M. Rouppe van der Voort,
R. Joshi,
C. Froment,
V. M. J. Henriques
Abstract:
Advancements in instrumentation have revealed a multitude of small-scale EUV events in the solar atmosphere. Our aim is to employ high-resolution magnetograms to gain a detailed understanding of the magnetic origin of such phenomena. We have used coordinated observations from SST, IRIS, and SDO to analyze an ephemeral magnetic flux emergence episode and the following chain of small-scale energetic…
▽ More
Advancements in instrumentation have revealed a multitude of small-scale EUV events in the solar atmosphere. Our aim is to employ high-resolution magnetograms to gain a detailed understanding of the magnetic origin of such phenomena. We have used coordinated observations from SST, IRIS, and SDO to analyze an ephemeral magnetic flux emergence episode and the following chain of small-scale energetic events. These unique observations clearly link these phenomena together. The high-resolution (0."057/pixel) magnetograms obtained with SST/CRISP allows us to reliably measure the magnetic field at the photosphere and detect the emerging bipole that causes the subsequent eruptive atmospheric events. Notably, this small-scale emergence episode remains indiscernible in the lower resolution SDO/HMI magnetograms (0."5/pixel). We report the appearance of a dark bubble in Ca II K related to the emerging bipole, a sign of the canonical expanding magnetic dome predicted in flux emergence simulations. Evidences of reconnection are also found: first through an Ellerman bomb, and later by the launch of a surge next to a UV burst. The UV burst exhibits a weak EUV counterpart in the coronal SDO/AIA channels. By calculating DEM, its plasma is shown to reach a temperature beyond 1 MK and have densities between the upper chromosphere and transition region. Our study showcases the importance of high-resolution magnetograms to unveil the mechanisms triggering phenomena such as EBs, UV bursts, and surges. This could hold implications for small-scale events akin to those recently reported in EUV using Solar Orbiter. The finding of temperatures beyond 1 MK in the UV burst plasma strongly suggests that we are examining analogous features. Therefore, we signal caution regarding drawing conclusions from full-disk magnetograms that lack the necessary resolution to reveal their true magnetic origin.
△ Less
Submitted 18 March, 2024;
originally announced March 2024.
-
N2F2: Hierarchical Scene Understanding with Nested Neural Feature Fields
Authors:
Yash Bhalgat,
Iro Laina,
João F. Henriques,
Andrew Zisserman,
Andrea Vedaldi
Abstract:
Understanding complex scenes at multiple levels of abstraction remains a formidable challenge in computer vision. To address this, we introduce Nested Neural Feature Fields (N2F2), a novel approach that employs hierarchical supervision to learn a single feature field, wherein different dimensions within the same high-dimensional feature encode scene properties at varying granularities. Our method…
▽ More
Understanding complex scenes at multiple levels of abstraction remains a formidable challenge in computer vision. To address this, we introduce Nested Neural Feature Fields (N2F2), a novel approach that employs hierarchical supervision to learn a single feature field, wherein different dimensions within the same high-dimensional feature encode scene properties at varying granularities. Our method allows for a flexible definition of hierarchies, tailored to either the physical dimensions or semantics or both, thereby enabling a comprehensive and nuanced understanding of scenes. We leverage a 2D class-agnostic segmentation model to provide semantically meaningful pixel groupings at arbitrary scales in the image space, and query the CLIP vision-encoder to obtain language-aligned embeddings for each of these segments. Our proposed hierarchical supervision method then assigns different nested dimensions of the feature field to distill the CLIP embeddings using deferred volumetric rendering at varying physical scales, creating a coarse-to-fine representation. Extensive experiments show that our approach outperforms the state-of-the-art feature field distillation methods on tasks such as open-vocabulary 3D segmentation and localization, demonstrating the effectiveness of the learned nested feature field.
△ Less
Submitted 28 July, 2024; v1 submitted 16 March, 2024;
originally announced March 2024.
-
Multi-level Product Category Prediction through Text Classification
Authors:
Wesley Ferreira Maia,
Angelo Carmignani,
Gabriel Bortoli,
Lucas Maretti,
David Luz,
Daniel Camilo Fuentes Guzman,
Marcos Jardel Henriques,
Francisco Louzada Neto
Abstract:
This article investigates applying advanced machine learning models, specifically LSTM and BERT, for text classification to predict multiple categories in the retail sector. The study demonstrates how applying data augmentation techniques and the focal loss function can significantly enhance accuracy in classifying products into multiple categories using a robust Brazilian retail dataset. The LSTM…
▽ More
This article investigates applying advanced machine learning models, specifically LSTM and BERT, for text classification to predict multiple categories in the retail sector. The study demonstrates how applying data augmentation techniques and the focal loss function can significantly enhance accuracy in classifying products into multiple categories using a robust Brazilian retail dataset. The LSTM model, enriched with Brazilian word embedding, and BERT, known for its effectiveness in understanding complex contexts, were adapted and optimized for this specific task. The results showed that the BERT model, with an F1 Macro Score of up to $99\%$ for segments, $96\%$ for categories and subcategories and $93\%$ for name products, outperformed LSTM in more detailed categories. However, LSTM also achieved high performance, especially after applying data augmentation and focal loss techniques. These results underscore the effectiveness of NLP techniques in retail and highlight the importance of the careful selection of modelling and preprocessing strategies. This work contributes significantly to the field of NLP in retail, providing valuable insights for future research and practical applications.
△ Less
Submitted 3 March, 2024;
originally announced March 2024.
-
A SOUND APPROACH: Using Large Language Models to generate audio descriptions for egocentric text-audio retrieval
Authors:
Andreea-Maria Oncescu,
João F. Henriques,
Andrew Zisserman,
Samuel Albanie,
A. Sophia Koepke
Abstract:
Video databases from the internet are a valuable source of text-audio retrieval datasets. However, given that sound and vision streams represent different "views" of the data, treating visual descriptions as audio descriptions is far from optimal. Even if audio class labels are present, they commonly are not very detailed, making them unsuited for text-audio retrieval. To exploit relevant audio in…
▽ More
Video databases from the internet are a valuable source of text-audio retrieval datasets. However, given that sound and vision streams represent different "views" of the data, treating visual descriptions as audio descriptions is far from optimal. Even if audio class labels are present, they commonly are not very detailed, making them unsuited for text-audio retrieval. To exploit relevant audio information from video-text datasets, we introduce a methodology for generating audio-centric descriptions using Large Language Models (LLMs). In this work, we consider the egocentric video setting and propose three new text-audio retrieval benchmarks based on the EpicMIR and EgoMCQ tasks, and on the EpicSounds dataset. Our approach for obtaining audio-centric descriptions gives significantly higher zero-shot performance than using the original visual-centric descriptions. Furthermore, we show that using the same prompts, we can successfully employ LLMs to improve the retrieval on EpicSounds, compared to using the original audio class labels of the dataset. Finally, we confirm that LLMs can be used to determine the difficulty of identifying the action associated with a sound.
△ Less
Submitted 29 February, 2024;
originally announced February 2024.
-
Tunable topological phases in nanographene-based spin-1/2 alternating-exchange Heisenberg chains
Authors:
Chenxiao Zhao,
Gonçalo Catarina,
Jin-Jiang Zhang,
João C. G. Henriques,
Lin Yang,
Ji Ma,
Xinliang Feng,
Oliver Gröning,
Pascal Ruffieux,
Joaquín Fernández-Rossier,
Roman Fasel
Abstract:
Unlocking the potential of topological order within many-body spin systems has long been a central pursuit in the realm of quantum materials. Despite extensive efforts, the quest for a versatile platform enabling site-selective spin manipulation, essential for tuning and probing diverse topological phases, has persisted. Here, we utilize on-surface synthesis to construct spin-1/2 alternating-excha…
▽ More
Unlocking the potential of topological order within many-body spin systems has long been a central pursuit in the realm of quantum materials. Despite extensive efforts, the quest for a versatile platform enabling site-selective spin manipulation, essential for tuning and probing diverse topological phases, has persisted. Here, we utilize on-surface synthesis to construct spin-1/2 alternating-exchange Heisenberg (AH) chains[1] with antiferromagnetic couplings $J_1$ and $J_2$ by covalently linking Clar's goblets -- nanographenes each hosting two antiferromagnetically-coupled unpaired electrons[2]. Utilizing scanning tunneling microscopy, we exert atomic-scale control over the spin chain lengths, parities and exchange-coupling terminations, and probe their magnetic response by means of inelastic tunneling spectroscopy. Our investigation confirms the gapped nature of bulk excitations in the chains, known as triplons[3]. Besides, the triplon dispersion relation is successfully extracted from the spatial variation of tunneling spectral amplitudes. Furthermore, depending on the parity and termination of chains, we observe varying numbers of in-gap $S=1/2$ edge spins, enabling the determination of the degeneracy of distinct topological ground states in the thermodynamic limit-either 1, 2, or 4. By monitoring interactions between these edge spins, we identify the exponential decay of spin correlations. Our experimental findings, corroborated by theoretical calculations, present a phase-controlled many-body platform, opening promising avenues toward the development of spin-based quantum devices.
△ Less
Submitted 21 February, 2024;
originally announced February 2024.
-
SCENES: Subpixel Correspondence Estimation With Epipolar Supervision
Authors:
Dominik A. Kloepfer,
João F. Henriques,
Dylan Campbell
Abstract:
Extracting point correspondences from two or more views of a scene is a fundamental computer vision problem with particular importance for relative camera pose estimation and structure-from-motion. Existing local feature matching approaches, trained with correspondence supervision on large-scale datasets, obtain highly-accurate matches on the test sets. However, they do not generalise well to new…
▽ More
Extracting point correspondences from two or more views of a scene is a fundamental computer vision problem with particular importance for relative camera pose estimation and structure-from-motion. Existing local feature matching approaches, trained with correspondence supervision on large-scale datasets, obtain highly-accurate matches on the test sets. However, they do not generalise well to new datasets with different characteristics to those they were trained on, unlike classic feature extractors. Instead, they require finetuning, which assumes that ground-truth correspondences or ground-truth camera poses and 3D structure are available. We relax this assumption by removing the requirement of 3D structure, e.g., depth maps or point clouds, and only require camera pose information, which can be obtained from odometry. We do so by replacing correspondence losses with epipolar losses, which encourage putative matches to lie on the associated epipolar line. While weaker than correspondence supervision, we observe that this cue is sufficient for finetuning existing models on new data. We then further relax the assumption of known camera poses by using pose estimates in a novel bootstrapping approach. We evaluate on highly challenging datasets, including an indoor drone dataset and an outdoor smartphone camera dataset, and obtain state-of-the-art results without strong supervision.
△ Less
Submitted 19 January, 2024;
originally announced January 2024.
-
LangProp: A code optimization framework using Large Language Models applied to driving
Authors:
Shu Ishida,
Gianluca Corrado,
George Fedoseev,
Hudson Yeo,
Lloyd Russell,
Jamie Shotton,
João F. Henriques,
Anthony Hu
Abstract:
We propose LangProp, a framework for iteratively optimizing code generated by large language models (LLMs), in both supervised and reinforcement learning settings. While LLMs can generate sensible coding solutions zero-shot, they are often sub-optimal. Especially for code generation tasks, it is likely that the initial code will fail on certain edge cases. LangProp automatically evaluates the code…
▽ More
We propose LangProp, a framework for iteratively optimizing code generated by large language models (LLMs), in both supervised and reinforcement learning settings. While LLMs can generate sensible coding solutions zero-shot, they are often sub-optimal. Especially for code generation tasks, it is likely that the initial code will fail on certain edge cases. LangProp automatically evaluates the code performance on a dataset of input-output pairs, catches any exceptions, and feeds the results back to the LLM in the training loop, so that the LLM can iteratively improve the code it generates. By adopting a metric- and data-driven training paradigm for this code optimization procedure, one could easily adapt findings from traditional machine learning techniques such as imitation learning, DAgger, and reinforcement learning. We show LangProp's applicability to general domains such as Sudoku and CartPole, as well as demonstrate the first proof of concept of automated code optimization for autonomous driving in CARLA. We show that LangProp can generate interpretable and transparent policies that can be verified and improved in a metric- and data-driven way. Our code is available at https://github.com/shuishida/LangProp.
△ Less
Submitted 3 May, 2024; v1 submitted 18 January, 2024;
originally announced January 2024.
-
Beyond spin models in orbitally-degenerate open-shell nanographenes
Authors:
J. C. G. Henriques,
D. Jacob,
A. Molina-Sánchez,
G. Catarina,
A. T. Costa,
J. Fernández-Rossier
Abstract:
The study of open-shell nanographenes has relied on a paradigm where spins are the only low-energy degrees of freedom. Here we show that some nanographenes can host low-energy excitations that include strongly coupled spin and orbital degrees of freedom. The key ingredient is the existence of orbital degeneracy, as a consequence of leaving the benzenoid/half-filling scenario. We analyze the case o…
▽ More
The study of open-shell nanographenes has relied on a paradigm where spins are the only low-energy degrees of freedom. Here we show that some nanographenes can host low-energy excitations that include strongly coupled spin and orbital degrees of freedom. The key ingredient is the existence of orbital degeneracy, as a consequence of leaving the benzenoid/half-filling scenario. We analyze the case of nitrogen-doped triangulenes, using both density-functional theory and Hubbard model multiconfigurational and random-phase approximation calculations. We find a rich interplay between orbital and spin degrees of freedom that confirms the need to go beyond the spin-only paradigm, opening a new venue in this field of research.
△ Less
Submitted 8 December, 2023;
originally announced December 2023.
-
Rapid Motor Adaptation for Robotic Manipulator Arms
Authors:
Yichao Liang,
Kevin Ellis,
João Henriques
Abstract:
Developing generalizable manipulation skills is a core challenge in embodied AI. This includes generalization across diverse task configurations, encompassing variations in object shape, density, friction coefficient, and external disturbances such as forces applied to the robot. Rapid Motor Adaptation (RMA) offers a promising solution to this challenge. It posits that essential hidden variables i…
▽ More
Developing generalizable manipulation skills is a core challenge in embodied AI. This includes generalization across diverse task configurations, encompassing variations in object shape, density, friction coefficient, and external disturbances such as forces applied to the robot. Rapid Motor Adaptation (RMA) offers a promising solution to this challenge. It posits that essential hidden variables influencing an agent's task performance, such as object mass and shape, can be effectively inferred from the agent's action and proprioceptive history. Drawing inspiration from RMA in locomotion and in-hand rotation, we use depth perception to develop agents tailored for rapid motor adaptation in a variety of manipulation tasks. We evaluated our agents on four challenging tasks from the Maniskill2 benchmark, namely pick-and-place operations with hundreds of objects from the YCB and EGAD datasets, peg insertion with precise position and orientation, and operating a variety of faucets and handles, with customized environment variations. Empirical results demonstrate that our agents surpass state-of-the-art methods like automatic domain randomization and vision-based policies, obtaining better generalization performance and sample efficiency.
△ Less
Submitted 29 March, 2024; v1 submitted 7 December, 2023;
originally announced December 2023.
-
Designer spin models in tunable two-dimensional nanographene lattices
Authors:
J. C. G. Henriques,
Mar Ferri-Cortés,
J. Fernández-Rossier
Abstract:
Motivated by recent experimental breakthroughs, we propose a strategy to design two-dimensional spin lattices with competing interactions that lead to non-trivial emergent quantum states. We consider $S=1/2$ nanographenes with $C_3$ symmetry as building blocks, and we leverage the potential to control both the sign and the strength of exchange with first neighbours to build a family of spin models…
▽ More
Motivated by recent experimental breakthroughs, we propose a strategy to design two-dimensional spin lattices with competing interactions that lead to non-trivial emergent quantum states. We consider $S=1/2$ nanographenes with $C_3$ symmetry as building blocks, and we leverage the potential to control both the sign and the strength of exchange with first neighbours to build a family of spin models. Specifically, we consider the case of a Heisenberg model in a triangle-decorated honeycomb lattice with competing ferromagnetic and antiferromagnetic interactions whose ratio can be varied in a wide range. Based on exact diagonalization of both fermionic and spin models we predict a quantum phase transition between a valence bond crystal of spin singlets with triplon excitations living in a Kagomé lattice and a Néel phase of effective $S=3/2$ in the limit of dominant ferromagnetic interactions.
△ Less
Submitted 6 February, 2024; v1 submitted 4 December, 2023;
originally announced December 2023.
-
Text2Loc: 3D Point Cloud Localization from Natural Language
Authors:
Yan Xia,
Letian Shi,
Zifeng Ding,
João F. Henriques,
Daniel Cremers
Abstract:
We tackle the problem of 3D point cloud localization based on a few natural linguistic descriptions and introduce a novel neural network, Text2Loc, that fully interprets the semantic relationship between points and text. Text2Loc follows a coarse-to-fine localization pipeline: text-submap global place recognition, followed by fine localization. In global place recognition, relational dynamics amon…
▽ More
We tackle the problem of 3D point cloud localization based on a few natural linguistic descriptions and introduce a novel neural network, Text2Loc, that fully interprets the semantic relationship between points and text. Text2Loc follows a coarse-to-fine localization pipeline: text-submap global place recognition, followed by fine localization. In global place recognition, relational dynamics among each textual hint are captured in a hierarchical transformer with max-pooling (HTM), whereas a balance between positive and negative pairs is maintained using text-submap contrastive learning. Moreover, we propose a novel matching-free fine localization method to further refine the location predictions, which completely removes the need for complicated text-instance matching and is lighter, faster, and more accurate than previous methods. Extensive experiments show that Text2Loc improves the localization accuracy by up to $2\times$ over the state-of-the-art on the KITTI360Pose dataset. Our project page is publicly available at \url{https://yan-xia.github.io/projects/text2loc/}.
△ Less
Submitted 28 March, 2024; v1 submitted 27 November, 2023;
originally announced November 2023.
-
Automatic Coronary Artery Plaque Quantification and CAD-RADS Prediction using Mesh Priors
Authors:
Rudolf L. M. van Herten,
Nils Hampe,
Richard A. P. Takx,
Klaas Jan Franssen,
Yining Wang,
Dominika Suchá,
José P. Henriques,
Tim Leiner,
R. Nils Planken,
Ivana Išgum
Abstract:
Coronary artery disease (CAD) remains the leading cause of death worldwide. Patients with suspected CAD undergo coronary CT angiography (CCTA) to evaluate the risk of cardiovascular events and determine the treatment. Clinical analysis of coronary arteries in CCTA comprises the identification of atherosclerotic plaque, as well as the grading of any coronary artery stenosis typically obtained throu…
▽ More
Coronary artery disease (CAD) remains the leading cause of death worldwide. Patients with suspected CAD undergo coronary CT angiography (CCTA) to evaluate the risk of cardiovascular events and determine the treatment. Clinical analysis of coronary arteries in CCTA comprises the identification of atherosclerotic plaque, as well as the grading of any coronary artery stenosis typically obtained through the CAD-Reporting and Data System (CAD-RADS). This requires analysis of the coronary lumen and plaque. While voxel-wise segmentation is a commonly used approach in various segmentation tasks, it does not guarantee topologically plausible shapes. To address this, in this work, we propose to directly infer surface meshes for coronary artery lumen and plaque based on a centerline prior and use it in the downstream task of CAD-RADS scoring. The method is developed and evaluated using a total of 2407 CCTA scans. Our method achieved lesion-wise volume intraclass correlation coefficients of 0.98, 0.79, and 0.85 for calcified, non-calcified, and total plaque volume respectively. Patient-level CAD-RADS categorization was evaluated on a representative hold-out test set of 300 scans, for which the achieved linearly weighted kappa ($κ$) was 0.75. CAD-RADS categorization on the set of 658 scans from another hospital and scanner led to a $κ$ of 0.71. The results demonstrate that direct inference of coronary artery meshes for lumen and plaque is feasible, and allows for the automated prediction of routinely performed CAD-RADS categorization.
△ Less
Submitted 17 October, 2023;
originally announced October 2023.
-
LoCUS: Learning Multiscale 3D-consistent Features from Posed Images
Authors:
Dominik A. Kloepfer,
Dylan Campbell,
João F. Henriques
Abstract:
An important challenge for autonomous agents such as robots is to maintain a spatially and temporally consistent model of the world. It must be maintained through occlusions, previously-unseen views, and long time horizons (e.g., loop closure and re-identification). It is still an open question how to train such a versatile neural representation without supervision. We start from the idea that the…
▽ More
An important challenge for autonomous agents such as robots is to maintain a spatially and temporally consistent model of the world. It must be maintained through occlusions, previously-unseen views, and long time horizons (e.g., loop closure and re-identification). It is still an open question how to train such a versatile neural representation without supervision. We start from the idea that the training objective can be framed as a patch retrieval problem: given an image patch in one view of a scene, we would like to retrieve (with high precision and recall) all patches in other views that map to the same real-world location. One drawback is that this objective does not promote reusability of features: by being unique to a scene (achieving perfect precision/recall), a representation will not be useful in the context of other scenes. We find that it is possible to balance retrieval and reusability by constructing the retrieval set carefully, leaving out patches that map to far-away locations. Similarly, we can easily regulate the scale of the learned features (e.g., points, objects, or rooms) by adjusting the spatial tolerance for considering a retrieval to be positive. We optimize for (smooth) Average Precision (AP), in a single unified ranking-based objective. This objective also doubles as a criterion for choosing landmarks or keypoints, as patches with high AP. We show results creating sparse, multi-scale, semantic spatial maps composed of highly identifiable landmarks, with applications in landmark retrieval, localization, semantic segmentation and instance segmentation.
△ Less
Submitted 2 October, 2023;
originally announced October 2023.
-
fakenewsbr: A Fake News Detection Platform for Brazilian Portuguese
Authors:
Luiz Giordani,
Gilsiley Darú,
Rhenan Queiroz,
Vitor Buzinaro,
Davi Keglevich Neiva,
Daniel Camilo Fuentes Guzmán,
Marcos Jardel Henriques,
Oilson Alberto Gonzatto Junior,
Francisco Louzada
Abstract:
The proliferation of fake news has become a significant concern in recent times due to its potential to spread misinformation and manipulate public opinion. This paper presents a comprehensive study on detecting fake news in Brazilian Portuguese, focusing on journalistic-type news. We propose a machine learning-based approach that leverages natural language processing techniques, including TF-IDF…
▽ More
The proliferation of fake news has become a significant concern in recent times due to its potential to spread misinformation and manipulate public opinion. This paper presents a comprehensive study on detecting fake news in Brazilian Portuguese, focusing on journalistic-type news. We propose a machine learning-based approach that leverages natural language processing techniques, including TF-IDF and Word2Vec, to extract features from textual data. We evaluate the performance of various classification algorithms, such as logistic regression, support vector machine, random forest, AdaBoost, and LightGBM, on a dataset containing both true and fake news articles. The proposed approach achieves high accuracy and F1-Score, demonstrating its effectiveness in identifying fake news. Additionally, we developed a user-friendly web platform, fakenewsbr.com, to facilitate the verification of news articles' veracity. Our platform provides real-time analysis, allowing users to assess the likelihood of fake news articles. Through empirical analysis and comparative studies, we demonstrate the potential of our approach to contribute to the fight against the spread of fake news and promote more informed media consumption.
△ Less
Submitted 20 September, 2023; v1 submitted 20 September, 2023;
originally announced September 2023.
-
Anatomy of linear and non-linear intermolecular exchange in S = 1 nanographenes
Authors:
J. C. G. Henriques,
J. Fernández-Rossier
Abstract:
Nanographene triangulenes with a S = 1 ground state have been used as building blocks of antiferromagnetic Haldane spin chains realizing a symmetry protected topological phase. By means of inelastic electron spectroscopy, it was found that the intermolecular exchange contains both linear and non-linear interactions, realizing the bilinear-biquadratic Hamiltonian. Starting from a Hubbard model, and…
▽ More
Nanographene triangulenes with a S = 1 ground state have been used as building blocks of antiferromagnetic Haldane spin chains realizing a symmetry protected topological phase. By means of inelastic electron spectroscopy, it was found that the intermolecular exchange contains both linear and non-linear interactions, realizing the bilinear-biquadratic Hamiltonian. Starting from a Hubbard model, and mapping it to an interacting Creutz ladder, we analytically derive these effective spin-interactions using perturbation theory, up to fourth order. We find that for chains with more than two units other interactions arise, with same order-of-magnitude strength, that entail second neighbor linear, and three-site non-linear exchange. Our analytical expressions compare well with experimental and numerical results. We discuss the extension to general S = 1 molecules, and give numerical results for the strength of the non-linear exchange for several nanographenes. Our results pave the way towards rational design of spin Hamiltonians for nanographene based spin chains.
△ Less
Submitted 3 July, 2023;
originally announced July 2023.
-
Broken-symmetry magnetic phases in two-dimensional triangulene crystals
Authors:
G. Catarina,
J. C. G. Henriques,
A. Molina-Sánchez,
A. T. Costa,
J. Fernández-Rossier
Abstract:
We provide a comprehensive theory of magnetic phases in two-dimensional triangulene crystals, using both Hubbard model and density functional theory (DFT) calculations. We consider centrosymmetric and non-centrosymmetric triangulene crystals. In all cases, DFT and mean-field Hubbard model predict the emergence of broken-symmetry antiferromagnetic (ferrimagnetic) phases for the centrosymmetric (non…
▽ More
We provide a comprehensive theory of magnetic phases in two-dimensional triangulene crystals, using both Hubbard model and density functional theory (DFT) calculations. We consider centrosymmetric and non-centrosymmetric triangulene crystals. In all cases, DFT and mean-field Hubbard model predict the emergence of broken-symmetry antiferromagnetic (ferrimagnetic) phases for the centrosymmetric (non-centrosymmetric) crystals. This includes the special case of the [4,4]triangulene crystal, whose non-interacting energy bands feature a gap with flat valence and conduction bands. We show how the lack of contrast between the local density of states of these bands, recently measured via scanning tunneling spectroscopy, is a natural consequence of a broken-symmetry Néel state that blocks intermolecular hybridization. Using random phase approximation, we also compute the spin wave spectrum of these crystals, including the recently synthesized [4,4]triangulene crystal. The results are in excellent agreement with the predictions of a Heisenberg spin model derived from multi-configuration calculations for the unit cell. We conclude that experimental results are compatible with an antiferromagnetically ordered phase where each triangulene retains the spin predicted for the isolated species.
△ Less
Submitted 29 June, 2023;
originally announced June 2023.
-
Contrastive Lift: 3D Object Instance Segmentation by Slow-Fast Contrastive Fusion
Authors:
Yash Bhalgat,
Iro Laina,
João F. Henriques,
Andrew Zisserman,
Andrea Vedaldi
Abstract:
Instance segmentation in 3D is a challenging task due to the lack of large-scale annotated datasets. In this paper, we show that this task can be addressed effectively by leveraging instead 2D pre-trained models for instance segmentation. We propose a novel approach to lift 2D segments to 3D and fuse them by means of a neural field representation, which encourages multi-view consistency across fra…
▽ More
Instance segmentation in 3D is a challenging task due to the lack of large-scale annotated datasets. In this paper, we show that this task can be addressed effectively by leveraging instead 2D pre-trained models for instance segmentation. We propose a novel approach to lift 2D segments to 3D and fuse them by means of a neural field representation, which encourages multi-view consistency across frames. The core of our approach is a slow-fast clustering objective function, which is scalable and well-suited for scenes with a large number of objects. Unlike previous approaches, our method does not require an upper bound on the number of objects or object tracking across frames. To demonstrate the scalability of the slow-fast clustering, we create a new semi-realistic dataset called the Messy Rooms dataset, which features scenes with up to 500 objects per scene. Our approach outperforms the state-of-the-art on challenging scenes from the ScanNet, Hypersim, and Replica datasets, as well as on our newly created Messy Rooms dataset, demonstrating the effectiveness and scalability of our slow-fast clustering method.
△ Less
Submitted 1 December, 2023; v1 submitted 7 June, 2023;
originally announced June 2023.
-
Extracting Reward Functions from Diffusion Models
Authors:
Felipe Nuti,
Tim Franzmeyer,
João F. Henriques
Abstract:
Diffusion models have achieved remarkable results in image generation, and have similarly been used to learn high-performing policies in sequential decision-making tasks. Decision-making diffusion models can be trained on lower-quality data, and then be steered with a reward function to generate near-optimal trajectories. We consider the problem of extracting a reward function by comparing a decis…
▽ More
Diffusion models have achieved remarkable results in image generation, and have similarly been used to learn high-performing policies in sequential decision-making tasks. Decision-making diffusion models can be trained on lower-quality data, and then be steered with a reward function to generate near-optimal trajectories. We consider the problem of extracting a reward function by comparing a decision-making diffusion model that models low-reward behavior and one that models high-reward behavior; a setting related to inverse reinforcement learning. We first define the notion of a relative reward function of two diffusion models and show conditions under which it exists and is unique. We then devise a practical learning algorithm for extracting it by aligning the gradients of a reward function -- parametrized by a neural network -- to the difference in outputs of both diffusion models. Our method finds correct reward functions in navigation environments, and we demonstrate that steering the base model with the learned reward functions results in significantly increased performance in standard locomotion benchmarks. Finally, we demonstrate that our approach generalizes beyond sequential decision-making by learning a reward-like function from two large-scale image generation diffusion models. The extracted reward function successfully assigns lower rewards to harmful images.
△ Less
Submitted 9 December, 2023; v1 submitted 1 June, 2023;
originally announced June 2023.
-
Large Language Models are Few-shot Publication Scoopers
Authors:
Samuel Albanie,
Liliane Momeni,
João F. Henriques
Abstract:
Driven by recent advances AI, we passengers are entering a golden age of scientific discovery. But golden for whom? Confronting our insecurity that others may beat us to the most acclaimed breakthroughs of the era, we propose a novel solution to the long-standing personal credit assignment problem to ensure that it is golden for us. At the heart of our approach is a pip-to-the-post algorithm that…
▽ More
Driven by recent advances AI, we passengers are entering a golden age of scientific discovery. But golden for whom? Confronting our insecurity that others may beat us to the most acclaimed breakthroughs of the era, we propose a novel solution to the long-standing personal credit assignment problem to ensure that it is golden for us. At the heart of our approach is a pip-to-the-post algorithm that assures adulatory Wikipedia pages without incurring the substantial capital and career risks of pursuing high impact science with conventional research methodologies. By leveraging the meta trend of leveraging large language models for everything, we demonstrate the unparalleled potential of our algorithm to scoop groundbreaking findings with the insouciance of a seasoned researcher at a dessert buffet.
△ Less
Submitted 2 April, 2023;
originally announced April 2023.
-
Towards Solving Fuzzy Tasks with Human Feedback: A Retrospective of the MineRL BASALT 2022 Competition
Authors:
Stephanie Milani,
Anssi Kanervisto,
Karolis Ramanauskas,
Sander Schulhoff,
Brandon Houghton,
Sharada Mohanty,
Byron Galbraith,
Ke Chen,
Yan Song,
Tianze Zhou,
Bingquan Yu,
He Liu,
Kai Guan,
Yujing Hu,
Tangjie Lv,
Federico Malato,
Florian Leopold,
Amogh Raut,
Ville Hautamäki,
Andrew Melnik,
Shu Ishida,
João F. Henriques,
Robert Klassert,
Walter Laurito,
Ellen Novoseller
, et al. (5 additional authors not shown)
Abstract:
To facilitate research in the direction of fine-tuning foundation models from human feedback, we held the MineRL BASALT Competition on Fine-Tuning from Human Feedback at NeurIPS 2022. The BASALT challenge asks teams to compete to develop algorithms to solve tasks with hard-to-specify reward functions in Minecraft. Through this competition, we aimed to promote the development of algorithms that use…
▽ More
To facilitate research in the direction of fine-tuning foundation models from human feedback, we held the MineRL BASALT Competition on Fine-Tuning from Human Feedback at NeurIPS 2022. The BASALT challenge asks teams to compete to develop algorithms to solve tasks with hard-to-specify reward functions in Minecraft. Through this competition, we aimed to promote the development of algorithms that use human feedback as channels to learn the desired behavior. We describe the competition and provide an overview of the top solutions. We conclude by discussing the impact of the competition and future directions for improvement.
△ Less
Submitted 23 March, 2023;
originally announced March 2023.
-
A Light Touch Approach to Teaching Transformers Multi-view Geometry
Authors:
Yash Bhalgat,
Joao F. Henriques,
Andrew Zisserman
Abstract:
Transformers are powerful visual learners, in large part due to their conspicuous lack of manually-specified priors. This flexibility can be problematic in tasks that involve multiple-view geometry, due to the near-infinite possible variations in 3D shapes and viewpoints (requiring flexibility), and the precise nature of projective geometry (obeying rigid laws). To resolve this conundrum, we propo…
▽ More
Transformers are powerful visual learners, in large part due to their conspicuous lack of manually-specified priors. This flexibility can be problematic in tasks that involve multiple-view geometry, due to the near-infinite possible variations in 3D shapes and viewpoints (requiring flexibility), and the precise nature of projective geometry (obeying rigid laws). To resolve this conundrum, we propose a "light touch" approach, guiding visual Transformers to learn multiple-view geometry but allowing them to break free when needed. We achieve this by using epipolar lines to guide the Transformer's cross-attention maps, penalizing attention values outside the epipolar lines and encouraging higher attention along these lines since they contain geometrically plausible matches. Unlike previous methods, our proposal does not require any camera pose information at test-time. We focus on pose-invariant object instance retrieval, where standard Transformer networks struggle, due to the large differences in viewpoint between query and retrieved images. Experimentally, our method outperforms state-of-the-art approaches at object retrieval, without needing pose information at test-time.
△ Less
Submitted 2 April, 2023; v1 submitted 28 November, 2022;
originally announced November 2022.
-
RbA: Segmenting Unknown Regions Rejected by All
Authors:
Nazir Nayal,
Mısra Yavuz,
João F. Henriques,
Fatma Güney
Abstract:
Standard semantic segmentation models owe their success to curated datasets with a fixed set of semantic categories, without contemplating the possibility of identifying unknown objects from novel categories. Existing methods in outlier detection suffer from a lack of smoothness and objectness in their predictions, due to limitations of the per-pixel classification paradigm. Furthermore, additiona…
▽ More
Standard semantic segmentation models owe their success to curated datasets with a fixed set of semantic categories, without contemplating the possibility of identifying unknown objects from novel categories. Existing methods in outlier detection suffer from a lack of smoothness and objectness in their predictions, due to limitations of the per-pixel classification paradigm. Furthermore, additional training for detecting outliers harms the performance of known classes. In this paper, we explore another paradigm with region-level classification to better segment unknown objects. We show that the object queries in mask classification tend to behave like one \vs all classifiers. Based on this finding, we propose a novel outlier scoring function called RbA by defining the event of being an outlier as being rejected by all known classes. Our extensive experiments show that mask classification improves the performance of the existing outlier detection methods, and the best results are achieved with the proposed RbA. We also propose an objective to optimize RbA using minimal outlier supervision. Further fine-tuning with outliers improves the unknown performance, and unlike previous methods, it does not degrade the inlier performance.
△ Less
Submitted 29 March, 2023; v1 submitted 25 November, 2022;
originally announced November 2022.
-
CASSPR: Cross Attention Single Scan Place Recognition
Authors:
Yan Xia,
Mariia Gladkova,
Rui Wang,
Qianyun Li,
Uwe Stilla,
João F. Henriques,
Daniel Cremers
Abstract:
Place recognition based on point clouds (LiDAR) is an important component for autonomous robots or self-driving vehicles. Current SOTA performance is achieved on accumulated LiDAR submaps using either point-based or voxel-based structures. While voxel-based approaches nicely integrate spatial context across multiple scales, they do not exhibit the local precision of point-based methods. As a resul…
▽ More
Place recognition based on point clouds (LiDAR) is an important component for autonomous robots or self-driving vehicles. Current SOTA performance is achieved on accumulated LiDAR submaps using either point-based or voxel-based structures. While voxel-based approaches nicely integrate spatial context across multiple scales, they do not exhibit the local precision of point-based methods. As a result, existing methods struggle with fine-grained matching of subtle geometric features in sparse single-shot Li- DAR scans. To overcome these limitations, we propose CASSPR as a method to fuse point-based and voxel-based approaches using cross attention transformers. CASSPR leverages a sparse voxel branch for extracting and aggregating information at lower resolution and a point-wise branch for obtaining fine-grained local information. CASSPR uses queries from one branch to try to match structures in the other branch, ensuring that both extract self-contained descriptors of the point cloud (rather than one branch dominating), but using both to inform the output global descriptor of the point cloud. Extensive experiments show that CASSPR surpasses the state-of-the-art by a large margin on several datasets (Oxford RobotCar, TUM, USyd). For instance, it achieves AR@1 of 85.6% on the TUM dataset, surpassing the strongest prior model by ~15%. Our code is publicly available.
△ Less
Submitted 29 August, 2023; v1 submitted 22 November, 2022;
originally announced November 2022.
-
Wafer-scale detachable monocrystalline Germanium nanomembranes for the growth of III-V materials and substrate reuse
Authors:
Nicolas Paupy,
Zakaria Oulad Elhmaidi,
Alexandre Chapotot,
Tadeáš Hanuš,
Javier Arias-Zapata,
Bouraoui Ilahi,
Alexandre Heintz,
Alex Brice Poungoué Mbeunmi,
Roxana Arvinte,
Mohammad Reza Aziziyan,
Valentin Daniel,
Gwenaëlle Hamon,
Jérémie Chrétien,
Firas Zouaghi,
Ahmed Ayari,
Laurie Mouchel,
Jonathan Henriques,
Loïc Demoulin,
Thierno Mamoudou Diallo,
Philippe-Olivier Provost,
Hubert Pelletier,
Maïté Volatier,
Rufi Kurstjens,
Jinyoun Cho,
Guillaume Courtois
, et al. (10 additional authors not shown)
Abstract:
Germanium (Ge) is increasingly used as a substrate for high-performance optoelectronic, photovoltaic, and electronic devices. These devices are usually grown on thick and rigid Ge substrates manufactured by classical wafering techniques. Nanomembranes (NMs) provide an alternative to this approach while offering wafer-scale lateral dimensions, weight reduction, limitation of waste, and cost effecti…
▽ More
Germanium (Ge) is increasingly used as a substrate for high-performance optoelectronic, photovoltaic, and electronic devices. These devices are usually grown on thick and rigid Ge substrates manufactured by classical wafering techniques. Nanomembranes (NMs) provide an alternative to this approach while offering wafer-scale lateral dimensions, weight reduction, limitation of waste, and cost effectiveness. Herein, we introduce the Porous germanium Efficient Epitaxial LayEr Release (PEELER) process, which consists of the fabrication of wafer-scale detachable monocrystalline Ge NMs on porous Ge (PGe) and substrate reuse. We demonstrate monocrystalline Ge NMs with surface roughness below 1 nm on top of nanoengineered void layer enabling layer detachment. Furthermore, these Ge NMs exhibit compatibility with the growth of III-V materials. High-resolution transmission electron microscopy (HRTEM) characterization shows Ge NMs crystallinity and high-resolution X-ray diffraction (HRXRD) reciprocal space mapping endorses high-quality GaAs layers. Finally, we demonstrate the chemical reconditioning process of the Ge substrate, allowing its reuse, to produce multiple free-standing NMs from a single parent wafer. The PEELER process significantly reduces the consumption of Ge during the fabrication process which paves the way for a new generation of low-cost flexible optoelectronics devices.
△ Less
Submitted 6 October, 2022;
originally announced October 2022.
-
Learn what matters: cross-domain imitation learning with task-relevant embeddings
Authors:
Tim Franzmeyer,
Philip H. S. Torr,
João F. Henriques
Abstract:
We study how an autonomous agent learns to perform a task from demonstrations in a different domain, such as a different environment or different agent. Such cross-domain imitation learning is required to, for example, train an artificial agent from demonstrations of a human expert. We propose a scalable framework that enables cross-domain imitation learning without access to additional demonstrat…
▽ More
We study how an autonomous agent learns to perform a task from demonstrations in a different domain, such as a different environment or different agent. Such cross-domain imitation learning is required to, for example, train an artificial agent from demonstrations of a human expert. We propose a scalable framework that enables cross-domain imitation learning without access to additional demonstrations or further domain knowledge. We jointly train the learner agent's policy and learn a mapping between the learner and expert domains with adversarial training. We effect this by using a mutual information criterion to find an embedding of the expert's state space that contains task-relevant information and is invariant to domain specifics. This step significantly simplifies estimating the mapping between the learner and expert domains and hence facilitates end-to-end learning. We demonstrate successful transfer of policies between considerably different domains, without extra supervision such as additional demonstrations, and in situations where other methods fail.
△ Less
Submitted 24 September, 2022;
originally announced September 2022.
-
Recovering the Graph Underlying Networked Dynamical Systems under Partial Observability: A Deep Learning Approach
Authors:
Sérgio Machado,
Anirudh Sridhar,
Paulo Gil,
Jorge Henriques,
José M. F. Moura,
Augusto Santos
Abstract:
We study the problem of graph structure identification, i.e., of recovering the graph of dependencies among time series. We model these time series data as components of the state of linear stochastic networked dynamical systems. We assume partial observability, where the state evolution of only a subset of nodes comprising the network is observed. We devise a new feature vector computed from the…
▽ More
We study the problem of graph structure identification, i.e., of recovering the graph of dependencies among time series. We model these time series data as components of the state of linear stochastic networked dynamical systems. We assume partial observability, where the state evolution of only a subset of nodes comprising the network is observed. We devise a new feature vector computed from the observed time series and prove that these features are linearly separable, i.e., there exists a hyperplane that separates the cluster of features associated with connected pairs of nodes from those associated with disconnected pairs. This renders the features amenable to train a variety of classifiers to perform causal inference. In particular, we use these features to train Convolutional Neural Networks (CNNs). The resulting causal inference mechanism outperforms state-of-the-art counterparts w.r.t. sample-complexity. The trained CNNs generalize well over structurally distinct networks (dense or sparse) and noise-level profiles. Remarkably, they also generalize well to real-world networks while trained over a synthetic network (realization of a random graph). Finally, the proposed method consistently reconstructs the graph in a pairwise manner, that is, by deciding if an edge or arrow is present or absent in each pair of nodes, from the corresponding time series of each pair. This fits the framework of large-scale systems, where observation or processing of all nodes in the network is prohibitive.
△ Less
Submitted 12 April, 2023; v1 submitted 8 August, 2022;
originally announced August 2022.