-
A Cosmic-Scale Benchmark for Symmetry-Preserving Data Processing
Authors:
Julia Balla,
Siddharth Mishra-Sharma,
Carolina Cuesta-Lazaro,
Tommi Jaakkola,
Tess Smidt
Abstract:
Efficiently processing structured point cloud data while preserving multiscale information is a key challenge across domains, from graphics to atomistic modeling. Using a curated dataset of simulated galaxy positions and properties, represented as point clouds, we benchmark the ability of graph neural networks to simultaneously capture local clustering environments and long-range correlations. Giv…
▽ More
Efficiently processing structured point cloud data while preserving multiscale information is a key challenge across domains, from graphics to atomistic modeling. Using a curated dataset of simulated galaxy positions and properties, represented as point clouds, we benchmark the ability of graph neural networks to simultaneously capture local clustering environments and long-range correlations. Given the homogeneous and isotropic nature of the Universe, the data exhibits a high degree of symmetry. We therefore focus on evaluating the performance of Euclidean symmetry-preserving ($E(3)$-equivariant) graph neural networks, showing that they can outperform non-equivariant counterparts and domain-specific information extraction techniques in downstream performance as well as simulation-efficiency. However, we find that current architectures fail to capture information from long-range correlations as effectively as domain-specific baselines, motivating future work on architectures better suited for extracting long-range information.
△ Less
Submitted 27 October, 2024;
originally announced October 2024.
-
CodonMPNN for Organism Specific and Codon Optimal Inverse Folding
Authors:
Hannes Stark,
Umesh Padia,
Julia Balla,
Cameron Diao,
George Church
Abstract:
Generating protein sequences conditioned on protein structures is an impactful technique for protein engineering. When synthesizing engineered proteins, they are commonly translated into DNA and expressed in an organism such as yeast. One difficulty in this process is that the expression rates can be low due to suboptimal codon sequences for expressing a protein in a host organism. We propose Codo…
▽ More
Generating protein sequences conditioned on protein structures is an impactful technique for protein engineering. When synthesizing engineered proteins, they are commonly translated into DNA and expressed in an organism such as yeast. One difficulty in this process is that the expression rates can be low due to suboptimal codon sequences for expressing a protein in a host organism. We propose CodonMPNN, which generates a codon sequence conditioned on a protein backbone structure and an organism label. If naturally occurring DNA sequences are close to codon optimality, CodonMPNN could learn to generate codon sequences with higher expression yields than heuristic codon choices for generated amino acid sequences. Experiments show that CodonMPNN retains the performance of previous inverse folding approaches and recovers wild-type codons more frequently than baselines. Furthermore, CodonMPNN has a higher likelihood of generating high-fitness codon sequences than low-fitness codon sequences for the same protein sequence. Code is available at https://github.com/HannesStark/CodonMPNN.
△ Less
Submitted 25 September, 2024;
originally announced September 2024.
-
The SARAO MeerKAT 1.3 GHz Galactic Plane Survey
Authors:
S. Goedhart,
W. D. Cotton,
F. Camilo,
M. A. Thompson,
G. Umana,
M. Bietenholz,
P. A. Woudt,
L. D. Anderson,
C. Bordiu,
D. A. H. Buckley,
C. S. Buemi,
F. Bufano,
F. Cavallaro,
H. Chen,
J. O. Chibueze,
D. Egbo,
B. S. Frank,
M. G. Hoare,
A. Ingallinera,
T. Irabor,
R. C. Kraan-Korteweg,
S. Kurapati,
P. Leto,
S. Loru,
M. Mutale
, et al. (105 additional authors not shown)
Abstract:
We present the SARAO MeerKAT Galactic Plane Survey (SMGPS), a 1.3 GHz continuum survey of almost half of the Galactic Plane (251°$\le l \le$ 358°and 2°$\le l \le$ 61°at $|b| \le 1.5°$). SMGPS is the largest, most sensitive and highest angular resolution 1 GHz survey of the Plane yet carried out, with an angular resolution of 8" and a broadband RMS sensitivity of $\sim$10--20 $μ$ Jy/beam. Here we d…
▽ More
We present the SARAO MeerKAT Galactic Plane Survey (SMGPS), a 1.3 GHz continuum survey of almost half of the Galactic Plane (251°$\le l \le$ 358°and 2°$\le l \le$ 61°at $|b| \le 1.5°$). SMGPS is the largest, most sensitive and highest angular resolution 1 GHz survey of the Plane yet carried out, with an angular resolution of 8" and a broadband RMS sensitivity of $\sim$10--20 $μ$ Jy/beam. Here we describe the first publicly available data release from SMGPS which comprises data cubes of frequency-resolved images over 908--1656 MHz, power law fits to the images, and broadband zeroth moment integrated intensity images. A thorough assessment of the data quality and guidance for future usage of the data products are given. Finally, we discuss the tremendous potential of SMGPS by showcasing highlights of the Galactic and extragalactic science that it permits. These highlights include the discovery of a new population of non-thermal radio filaments; identification of new candidate supernova remnants, pulsar wind nebulae and planetary nebulae; improved radio/mid-IR classification of rare Luminous Blue Variables and discovery of associated extended radio nebulae; new radio stars identified by Bayesian cross-matching techniques; the realisation that many of the largest radio-quiet WISE HII region candidates are not true HII regions; and a large sample of previously undiscovered background HI galaxies in the Zone of Avoidance.
△ Less
Submitted 2 May, 2024; v1 submitted 12 December, 2023;
originally announced December 2023.
-
Over-Squashing in Riemannian Graph Neural Networks
Authors:
Julia Balla
Abstract:
Most graph neural networks (GNNs) are prone to the phenomenon of over-squashing in which node features become insensitive to information from distant nodes in the graph. Recent works have shown that the topology of the graph has the greatest impact on over-squashing, suggesting graph rewiring approaches as a suitable solution. In this work, we explore whether over-squashing can be mitigated throug…
▽ More
Most graph neural networks (GNNs) are prone to the phenomenon of over-squashing in which node features become insensitive to information from distant nodes in the graph. Recent works have shown that the topology of the graph has the greatest impact on over-squashing, suggesting graph rewiring approaches as a suitable solution. In this work, we explore whether over-squashing can be mitigated through the embedding space of the GNN. In particular, we consider the generalization of Hyperbolic GNNs (HGNNs) to Riemannian manifolds of variable curvature in which the geometry of the embedding space is faithful to the graph's topology. We derive bounds on the sensitivity of the node features in these Riemannian GNNs as the number of layers increases, which yield promising theoretical and empirical results for alleviating over-squashing in graphs with negative curvature.
△ Less
Submitted 27 November, 2023;
originally announced November 2023.
-
AI-Assisted Discovery of Quantitative and Formal Models in Social Science
Authors:
Julia Balla,
Sihao Huang,
Owen Dugan,
Rumen Dangovski,
Marin Soljacic
Abstract:
In social science, formal and quantitative models, such as ones describing economic growth and collective action, are used to formulate mechanistic explanations, provide predictions, and uncover questions about observed phenomena. Here, we demonstrate the use of a machine learning system to aid the discovery of symbolic models that capture nonlinear and dynamical relationships in social science da…
▽ More
In social science, formal and quantitative models, such as ones describing economic growth and collective action, are used to formulate mechanistic explanations, provide predictions, and uncover questions about observed phenomena. Here, we demonstrate the use of a machine learning system to aid the discovery of symbolic models that capture nonlinear and dynamical relationships in social science datasets. By extending neuro-symbolic methods to find compact functions and differential equations in noisy and longitudinal data, we show that our system can be used to discover interpretable models from real-world data in economics and sociology. Augmenting existing workflows with symbolic regression can help uncover novel relationships and explore counterfactual models during the scientific process. We propose that this AI-assisted framework can bridge parametric and non-parametric models commonly employed in social science research by systematically exploring the space of nonlinear models and enabling fine-grained control over expressivity and interpretability.
△ Less
Submitted 16 August, 2023; v1 submitted 2 October, 2022;
originally announced October 2022.
-
The 1.28 GHz MeerKAT Galactic Center Mosaic
Authors:
I. Heywood,
I. Rammala,
F. Camilo,
W. D. Cotton,
F. Yusef-Zadeh,
T. D. Abbott,
R. M. Adam,
G. Adams,
M. A. Aldera,
K. M. B. Asad,
E. F. Bauermeister,
T. G. H. Bennett,
H. L. Bester,
W. A. Bode,
D. H. Botha,
A. G. Botha,
L. R. S. Brederode,
S. Buchner,
J. P. Burger,
T. Cheetham,
D. I. L. de Villiers,
M. A. Dikgale-Mahlakoana,
L. J. du Toit,
S. W. P. Esterhuyse,
B. L. Fanaroff
, et al. (86 additional authors not shown)
Abstract:
The inner $\sim$200 pc region of the Galaxy contains a 4 million M$_{\odot}$ supermassive black hole (SMBH), significant quantities of molecular gas, and star formation and cosmic ray energy densities that are roughly two orders of magnitude higher than the corresponding levels in the Galactic disk. At a distance of only 8.2 kpc, the region presents astronomers with a unique opportunity to study a…
▽ More
The inner $\sim$200 pc region of the Galaxy contains a 4 million M$_{\odot}$ supermassive black hole (SMBH), significant quantities of molecular gas, and star formation and cosmic ray energy densities that are roughly two orders of magnitude higher than the corresponding levels in the Galactic disk. At a distance of only 8.2 kpc, the region presents astronomers with a unique opportunity to study a diverse range of energetic astrophysical phenomena, from stellar objects in extreme environments, to the SMBH and star-formation driven feedback processes that are known to influence the evolution of galaxies as a whole. We present a new survey of the Galactic center conducted with the South African MeerKAT radio telescope. Radio imaging offers a view that is unaffected by the large quantities of dust that obscure the region at other wavelengths, and a scene of striking complexity is revealed. We produce total intensity and spectral index mosaics of the region from 20 pointings (144 hours on-target in total), covering 6.5 square degrees with an angular resolution of 4$"$,at a central frequency of 1.28 GHz. Many new features are revealed for the first time due to a combination of MeerKAT's high sensitivity, exceptional $u,v$-plane coverage, and geographical vantage point. We highlight some initial survey results, including new supernova remnant candidates, many new non-thermal filament complexes, and enhanced views of the Radio Arc Bubble, Sgr A and Sgr B regions. This project is a SARAO public legacy survey, and the image products are made available with this article.
△ Less
Submitted 27 January, 2022; v1 submitted 25 January, 2022;
originally announced January 2022.
-
The MeerKAT Galaxy Cluster Legacy Survey I. Survey Overview and Highlights
Authors:
K. Knowles,
W. D. Cotton,
L. Rudnick,
F. Camilo,
S. Goedhart,
R. Deane,
M. Ramatsoku,
M. F. Bietenholz,
M. Brüggen,
C. Button,
H. Chen,
J. O. Chibueze,
T. E. Clarke,
F. de Gasperin,
R. Ianjamasimanana,
G. I. G. Józsa,
M. Hilton,
K. C. Kesebonye,
K. Kolokythas,
R. C. Kraan-Korteweg,
G. Lawrie,
M. Lochner,
S. I. Loubser,
P. Marchegiani,
N. Mhlahlo
, et al. (126 additional authors not shown)
Abstract:
MeerKAT's large number of antennas, spanning 8 km with a densely packed 1 km core, create a powerful instrument for wide-area surveys, with high sensitivity over a wide range of angular scales. The MeerKAT Galaxy Cluster Legacy Survey (MGCLS) is a programme of long-track MeerKAT L-band (900-1670 MHz) observations of 115 galaxy clusters, observed for $\sim$6-10 hours each in full polarisation. The…
▽ More
MeerKAT's large number of antennas, spanning 8 km with a densely packed 1 km core, create a powerful instrument for wide-area surveys, with high sensitivity over a wide range of angular scales. The MeerKAT Galaxy Cluster Legacy Survey (MGCLS) is a programme of long-track MeerKAT L-band (900-1670 MHz) observations of 115 galaxy clusters, observed for $\sim$6-10 hours each in full polarisation. The first legacy product data release (DR1), made available with this paper, includes the MeerKAT visibilities, basic image cubes at $\sim$8" resolution, and enhanced spectral and polarisation image cubes at $\sim$8" and 15" resolutions. Typical sensitivities for the full-resolution MGCLS image products are $\sim$3-5 μJy/beam. The basic cubes are full-field and span 4 deg^2. The enhanced products consist of the inner 1.44 deg^2 field of view, corrected for the primary beam. The survey is fully sensitive to structures up to $\sim$10' scales and the wide bandwidth allows spectral and Faraday rotation mapping. HI mapping at 209 kHz resolution can be done at $0<z<0.09$ and $0.19<z<0.48$. In this paper, we provide an overview of the survey and DR1 products, including caveats for usage. We present some initial results from the survey, both for their intrinsic scientific value and to highlight the capabilities for further exploration with these data. These include a primary beam-corrected compact source catalogue of $\sim$626,000 sources for the full survey, and an optical/infrared cross-matched catalogue for compact sources in Abell 209 and Abell S295. We examine dust unbiased star-formation rates as a function of clustercentric radius in Abell 209 and present a catalogue of 99 diffuse cluster sources (56 are new), some of which have no suitable characterisation. We also highlight some of the radio galaxies which challenge current paradigms and present first results from HI studies of four targets.
△ Less
Submitted 10 November, 2021;
originally announced November 2021.
-
PrivateMail: Supervised Manifold Learning of Deep Features With Differential Privacy for Image Retrieval
Authors:
Praneeth Vepakomma,
Julia Balla,
Ramesh Raskar
Abstract:
Differential Privacy offers strong guarantees such as immutable privacy under post processing. Thus it is often looked to as a solution to learning on scattered and isolated data. This work focuses on supervised manifold learning, a paradigm that can generate fine-tuned manifolds for a target use case. Our contributions are two fold. 1) We present a novel differentially private method \textit{Priv…
▽ More
Differential Privacy offers strong guarantees such as immutable privacy under post processing. Thus it is often looked to as a solution to learning on scattered and isolated data. This work focuses on supervised manifold learning, a paradigm that can generate fine-tuned manifolds for a target use case. Our contributions are two fold. 1) We present a novel differentially private method \textit{PrivateMail} for supervised manifold learning, the first of its kind to our knowledge. 2) We provide a novel private geometric embedding scheme for our experimental use case. We experiment on private "content based image retrieval" - embedding and querying the nearest neighbors of images in a private manner - and show extensive privacy-utility tradeoff results, as well as the computational efficiency and practicality of our methods.
△ Less
Submitted 5 October, 2021; v1 submitted 22 February, 2021;
originally announced February 2021.
-
Splintering with distributions: A stochastic decoy scheme for private computation
Authors:
Praneeth Vepakomma,
Julia Balla,
Ramesh Raskar
Abstract:
Performing computations while maintaining privacy is an important problem in todays distributed machine learning solutions. Consider the following two set ups between a client and a server, where in setup i) the client has a public data vector $\mathbf{x}$, the server has a large private database of data vectors $\mathcal{B}$ and the client wants to find the inner products…
▽ More
Performing computations while maintaining privacy is an important problem in todays distributed machine learning solutions. Consider the following two set ups between a client and a server, where in setup i) the client has a public data vector $\mathbf{x}$, the server has a large private database of data vectors $\mathcal{B}$ and the client wants to find the inner products $\langle \mathbf{x,y_k} \rangle, \forall \mathbf{y_k} \in \mathcal{B}$. The client does not want the server to learn $\mathbf{x}$ while the server does not want the client to learn the records in its database. This is in contrast to another setup ii) where the client would like to perform an operation solely on its data, such as computation of a matrix inverse on its data matrix $\mathbf{M}$, but would like to use the superior computing ability of the server to do so without having to leak $\mathbf{M}$ to the server. \par We present a stochastic scheme for splitting the client data into privatized shares that are transmitted to the server in such settings. The server performs the requested operations on these shares instead of on the raw client data at the server. The obtained intermediate results are sent back to the client where they are assembled by the client to obtain the final result.
△ Less
Submitted 26 January, 2022; v1 submitted 6 July, 2020;
originally announced July 2020.
-
The 1.28 GHz MeerKAT DEEP2 Image
Authors:
T. Mauch,
W. D. Cotton,
J. J. Condon,
A. M. Matthews,
T. D. Abbott,
R. M. Adam,
M. A. Aldera,
K. M. B. Asad,
E. F. Bauermeister,
T. G. H. Bennett,
H. Bester,
D. H. Botha,
L. R. S. Brederode,
Z. B. Brits,
S. J. Buchner,
J. P. Burger,
F. Camilo,
J. M. Chalmers,
T. Cheetham,
D. de Villiers,
M. S. de Villiers,
M. A. Dikgale-Mahlakoana,
L. J. du Toit,
S. W. P. Esterhuyse,
G. Fadana
, et al. (79 additional authors not shown)
Abstract:
We present the confusion-limited 1.28 GHz MeerKAT DEEP2 image covering one $\approx 68'$ FWHM primary beam area with $7.6''$ FWHM resolution and $0.55 \pm 0.01$ $μ$Jy/beam rms noise. Its J2000 center position $α=04^h 13^m 26.4^s$, $δ=-80^\circ 00' 00''$ was selected to minimize artifacts caused by bright sources. We introduce the new 64-element MeerKAT array and describe commissioning observations…
▽ More
We present the confusion-limited 1.28 GHz MeerKAT DEEP2 image covering one $\approx 68'$ FWHM primary beam area with $7.6''$ FWHM resolution and $0.55 \pm 0.01$ $μ$Jy/beam rms noise. Its J2000 center position $α=04^h 13^m 26.4^s$, $δ=-80^\circ 00' 00''$ was selected to minimize artifacts caused by bright sources. We introduce the new 64-element MeerKAT array and describe commissioning observations to measure the primary beam attenuation pattern, estimate telescope pointing errors, and pinpoint $(u,v)$ coordinate errors caused by offsets in frequency or time. We constructed a 1.4 GHz differential source count by combining a power-law count fit to the DEEP2 confusion $P(D)$ distribution from $0.25$ to $10$ $μ$Jy with counts of individual DEEP2 sources between $10$ $μ$Jy and $2.5$ mJy. Most sources fainter than $S \sim 100$ $μ$Jy are distant star-forming galaxies obeying the FIR/radio correlation, and sources stronger than $0.25$ $μ$Jy account for $\sim93\%$ of the radio background produced by star-forming galaxies. For the first time, the DEEP2 source count has reached the depth needed to reveal the majority of the star formation history of the universe. A pure luminosity evolution of the 1.4 GHz local luminosity function consistent with the Madau & Dickinson (2014) model for the evolution of star-forming galaxies based on UV and infrared data underpredicts our 1.4 GHz source count in the range $-5 \lesssim \log[S(\mathrm{Jy})] \lesssim -4$.
△ Less
Submitted 12 December, 2019;
originally announced December 2019.
-
Inflation of 430-parsec bipolar radio bubbles in the Galactic Centre by an energetic event
Authors:
I. Heywood,
F. Camilo,
W. D. Cotton,
F. Yusef-Zadeh,
T. D. Abbott,
R. M. Adam,
M. A. Aldera,
E. F. Bauermeister,
R. S. Booth,
A. G. Botha,
D. H. Botha,
L. R. S. Brederode,
Z. B. Brits,
S. J. Buchner,
J. P. Burger,
J. M. Chalmers,
T. Cheetham,
D. de Villiers,
M. A. Dikgale-Mahlakoana,
L. J. du Toit,
S. W. P. Esterhuyse,
B. L. Fanaroff,
A. R. Foley,
D. J. Fourie,
R. R. G. Gamatham
, et al. (74 additional authors not shown)
Abstract:
The Galactic Centre contains a supermassive black hole with a mass of 4 million suns within an environment that differs markedly from that of the Galactic disk. While the black hole is essentially quiescent in the broader context of active galactic nuclei, X-ray observations have provided evidence for energetic outbursts from its surroundings. Also, while the levels of star formation in the Galact…
▽ More
The Galactic Centre contains a supermassive black hole with a mass of 4 million suns within an environment that differs markedly from that of the Galactic disk. While the black hole is essentially quiescent in the broader context of active galactic nuclei, X-ray observations have provided evidence for energetic outbursts from its surroundings. Also, while the levels of star formation in the Galactic Centre have been approximately constant over the last few hundred Myr, there is evidence of elevated short-duration bursts, strongly influenced by interaction of the black hole with the enhanced gas density present within the ring-like Central Molecular Zone at Galactic longitude |l| < 0.7 degrees and latitude |b| < 0.2 degrees. The inner 200 pc region is characterized by large amounts of warm molecular gas, a high cosmic ray ionization rate, unusual gas chemistry, enhanced synchrotron emission, and a multitude of radio-emitting magnetised filaments, the origin of which has not been established. Here we report radio imaging that reveals bipolar bubbles spanning 1 degree x 3 degrees (140 parsecs x 430 parsecs), extending above and below the Galactic plane and apparently associated with the Galactic Centre. The structure is edge-brightened and bounded, with symmetry implying creation by an energetic event in the Galactic Centre. We estimate the age of the bubbles to be a few million years, with a total energy of 7 x 10^52 ergs. We postulate that the progenitor event was a major contributor to the increased cosmic-ray density in the Galactic Centre, and is in turn the principal source of the relativistic particles required to power the synchrotron emission of the radio filaments within and in the vicinity of the bubble cavities.
△ Less
Submitted 12 September, 2019;
originally announced September 2019.
-
Nucleon matrix elements of twist-3 and 4 operators from the instanton vacuum
Authors:
J. Balla,
M. V. Polyakov,
C. Weiss
Abstract:
The spin-dependent twist-3 and 4 nucleon matrix elements d^(2) and f^(2) describing power corrections to the Bjorken and Ellis-Jaffe sum rules are computed in the instanton vacuum. A systematic expansion in the small packing fraction of the instanton medium, rho / R << 1, is performed. We find that the twist-3 matrix element d^(2) is suppressed [of order (rho / R)^4], while the twist-4 matrix el…
▽ More
The spin-dependent twist-3 and 4 nucleon matrix elements d^(2) and f^(2) describing power corrections to the Bjorken and Ellis-Jaffe sum rules are computed in the instanton vacuum. A systematic expansion in the small packing fraction of the instanton medium, rho / R << 1, is performed. We find that the twist-3 matrix element d^(2) is suppressed [of order (rho / R)^4], while the twist-4 matrix element f^(2) is of order unity. Numerically, d^(2) << f^(2). The small value of d^(2) \sim 10^{-3} obtained from instantons is consistent with the recent E143 measurements of the structure function g_2, where d^(2) enters at the same level as the twist-2 contribution.
△ Less
Submitted 26 January, 1999;
originally announced January 1999.
-
Nucleon matrix elements of higher-twist operators from the instanton vacuum
Authors:
J. Balla,
M. V. Polyakov,
C. Weiss
Abstract:
We compute the nucleon matrix elements of QCD operators of twist 3 and 4 in the instanton vacuum. We consider the operators determining 1/Q^2-power corrections to the Bjorken, Ellis-Jaffe and Gross-Llewellyn-Smith sum rules. The nucleon is described as a soliton of the effective chiral theory derived from instantons in the 1/N_c-expansion. QCD operators involving the gluon field are systematical…
▽ More
We compute the nucleon matrix elements of QCD operators of twist 3 and 4 in the instanton vacuum. We consider the operators determining 1/Q^2-power corrections to the Bjorken, Ellis-Jaffe and Gross-Llewellyn-Smith sum rules. The nucleon is described as a soliton of the effective chiral theory derived from instantons in the 1/N_c-expansion. QCD operators involving the gluon field are systematically represented by effective operators in the effective chiral theory. We find that twist-3 matrix elements are suppressed relative to twist-4 by a power of the packing fraction of the instanton medium. Numerical results for the spin-dependent (d^(2), f^(2)) and spin-independent twist-3 and 4 matrix elements are compared with results of other approaches and with experimental estimates of power corrections. The methods developed can be used to evaluate a wide range of matrix elements relevant to DIS.
△ Less
Submitted 30 July, 1997;
originally announced July 1997.