-
olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models
Authors:
Jake Poznanski,
Jon Borchardt,
Jason Dunkelberger,
Regan Huff,
Daniel Lin,
Aman Rangapur,
Christopher Wilhelm,
Kyle Lo,
Luca Soldaini
Abstract:
PDF documents have the potential to provide trillions of novel, high-quality tokens for training language models. However, these documents come in a diversity of types with differing formats and visual layouts that pose a challenge when attempting to extract and faithfully represent the underlying content for language model use. We present olmOCR, an open-source Python toolkit for processing PDFs…
▽ More
PDF documents have the potential to provide trillions of novel, high-quality tokens for training language models. However, these documents come in a diversity of types with differing formats and visual layouts that pose a challenge when attempting to extract and faithfully represent the underlying content for language model use. We present olmOCR, an open-source Python toolkit for processing PDFs into clean, linearized plain text in natural reading order while preserving structured content like sections, tables, lists, equations, and more. Our toolkit runs a fine-tuned 7B vision language model (VLM) trained on a sample of 260,000 pages from over 100,000 crawled PDFs with diverse properties, including graphics, handwritten text and poor quality scans. olmOCR is optimized for large-scale batch processing, able to scale flexibly to different hardware setups and convert a million PDF pages for only $190 USD. We release all components of olmOCR including VLM weights, data and training code, as well as inference code built on serving frameworks including vLLM and SGLang.
△ Less
Submitted 25 February, 2025;
originally announced February 2025.
-
2 OLMo 2 Furious
Authors:
Team OLMo,
Pete Walsh,
Luca Soldaini,
Dirk Groeneveld,
Kyle Lo,
Shane Arora,
Akshita Bhagia,
Yuling Gu,
Shengyi Huang,
Matt Jordan,
Nathan Lambert,
Dustin Schwenk,
Oyvind Tafjord,
Taira Anderson,
David Atkinson,
Faeze Brahman,
Christopher Clark,
Pradeep Dasigi,
Nouha Dziri,
Michal Guerquin,
Hamish Ivison,
Pang Wei Koh,
Jiacheng Liu,
Saumya Malik,
William Merrill
, et al. (15 additional authors not shown)
Abstract:
We present OLMo 2, the next generation of our fully open language models. OLMo 2 includes dense autoregressive models with improved architecture and training recipe, pretraining data mixtures, and instruction tuning recipes. Our modified model architecture and training recipe achieve both better training stability and improved per-token efficiency. Our updated pretraining data mixture introduces a…
▽ More
We present OLMo 2, the next generation of our fully open language models. OLMo 2 includes dense autoregressive models with improved architecture and training recipe, pretraining data mixtures, and instruction tuning recipes. Our modified model architecture and training recipe achieve both better training stability and improved per-token efficiency. Our updated pretraining data mixture introduces a new, specialized data mix called Dolmino Mix 1124, which significantly improves model capabilities across many downstream task benchmarks when introduced via late-stage curriculum training (i.e. specialized data during the annealing phase of pretraining). Finally, we incorporate best practices from Tülu 3 to develop OLMo 2-Instruct, focusing on permissive data and extending our final-stage reinforcement learning with verifiable rewards (RLVR). Our OLMo 2 base models sit at the Pareto frontier of performance to compute, often matching or outperforming open-weight only models like Llama 3.1 and Qwen 2.5 while using fewer FLOPs and with fully transparent training data, code, and recipe. Our fully open OLMo 2-Instruct models are competitive with or surpassing open-weight only models of comparable size, including Qwen 2.5, Llama 3.1 and Gemma 2. We release all OLMo 2 artifacts openly -- models at 7B and 13B scales, both pretrained and post-trained, including their full training data, training code and recipes, training logs and thousands of intermediate checkpoints. The final instruction model is available on the Ai2 Playground as a free research demo.
△ Less
Submitted 14 January, 2025; v1 submitted 31 December, 2024;
originally announced January 2025.
-
Tulu 3: Pushing Frontiers in Open Language Model Post-Training
Authors:
Nathan Lambert,
Jacob Morrison,
Valentina Pyatkin,
Shengyi Huang,
Hamish Ivison,
Faeze Brahman,
Lester James V. Miranda,
Alisa Liu,
Nouha Dziri,
Shane Lyu,
Yuling Gu,
Saumya Malik,
Victoria Graf,
Jena D. Hwang,
Jiangjiang Yang,
Ronan Le Bras,
Oyvind Tafjord,
Chris Wilhelm,
Luca Soldaini,
Noah A. Smith,
Yizhong Wang,
Pradeep Dasigi,
Hannaneh Hajishirzi
Abstract:
Language model post-training is applied to refine behaviors and unlock new skills across a wide range of recent language models, but open recipes for applying these techniques lag behind proprietary ones. The underlying training data and recipes for post-training are simultaneously the most important pieces of the puzzle and the portion with the least transparency. To bridge this gap, we introduce…
▽ More
Language model post-training is applied to refine behaviors and unlock new skills across a wide range of recent language models, but open recipes for applying these techniques lag behind proprietary ones. The underlying training data and recipes for post-training are simultaneously the most important pieces of the puzzle and the portion with the least transparency. To bridge this gap, we introduce Tulu 3, a family of fully-open state-of-the-art post-trained models, alongside its data, code, and training recipes, serving as a comprehensive guide for modern post-training techniques. Tulu 3, which builds on Llama 3.1 base models, achieves results surpassing the instruct versions of Llama 3.1, Qwen 2.5, Mistral, and even closed models such as GPT-4o-mini and Claude 3.5-Haiku. The training algorithms for our models include supervised finetuning (SFT), Direct Preference Optimization (DPO), and a novel method we call Reinforcement Learning with Verifiable Rewards (RLVR). With Tulu 3, we introduce a multi-task evaluation scheme for post-training recipes with development and unseen evaluations, standard benchmark implementations, and substantial decontamination of existing open datasets on said benchmarks. We conclude with analysis and discussion of training methods that did not reliably improve performance.
In addition to the Tulu 3 model weights and demo, we release the complete recipe -- including datasets for diverse core skills, a robust toolkit for data curation and evaluation, the training code and infrastructure, and, most importantly, a detailed report for reproducing and further adapting the Tulu 3 approach to more domains.
△ Less
Submitted 14 April, 2025; v1 submitted 22 November, 2024;
originally announced November 2024.
-
On the suppression of giant planet formation around low-mass stars in clustered environments
Authors:
Shuo Huang,
Simon Portegies Zwart,
Maite J. C. Wilhelm
Abstract:
Context: Current exoplanet formation studies tend to overlook the birth environment of stars in clustered environments. The effect of this environment on the planet-formation process, however, is important, especially in the earliest stage. Aims: We investigate the differences in planet populations forming in star-cluster environments through pebble accretion and compare these results with the pla…
▽ More
Context: Current exoplanet formation studies tend to overlook the birth environment of stars in clustered environments. The effect of this environment on the planet-formation process, however, is important, especially in the earliest stage. Aims: We investigate the differences in planet populations forming in star-cluster environments through pebble accretion and compare these results with the planet formation around isolated stars. We try to provide potential signatures on the young planetary systems to guide future observation. Methods: We design and present a new planet population synthesis code for clustered environments. The planet formation model is based on pebble accretion and includes migration in the circumstellar disk. The disk's gas and dust are evolved in 1D simulations considering the effects of photo-evaporation of the nearby stars. Results: Planetary systems in a clustered environment are different than those born in isolation; the environmental effects are important for a wide range of observable parameters and the eventual architecture of the planetary systems. Planetary systems born in a clustered environment lack cold Jupiters compared to isolated planetary systems. This effect is more pronounced for low-mass stars ($\lesssim$0.2 $M_\odot$). On the other hand, planetary systems born in clusters show an excess of cold Neptune around these low-mass stars. Conclusions: In future observations, finding an excess of cold Neptunes and a lack of cold Jupiters could be used to constrain the birth environments of these planetary systems. Exploring the dependence of cold Jupiter's intrinsic occurrence rate on stellar mass provides insights into the birth environment of their proto-embryos.
△ Less
Submitted 2 August, 2024; v1 submitted 26 July, 2024;
originally announced July 2024.
-
Massive star cluster formation I. High star formation efficiency while resolving feedback of individual stars
Authors:
Brooke Polak,
Mordecai-Mark Mac Low,
Ralf S. Klessen,
Jia Wei Teh,
Claude Cournoyer-Cloutier,
Eric P. Andersson,
Sabrina M. Appel,
Aaron Tran,
Sean C. Lewis,
Maite J. C. Wilhelm,
Simon Portegies Zwart,
Simon C. O. Glover,
Long Wang,
Stephen L. W. McMillan
Abstract:
The mode of star formation that results in the formation of globular clusters and young massive clusters is difficult to constrain through observations. We present models of massive star cluster formation using the Torch framework, which uses AMUSE to couple distinct multi-physics codes that handle star formation, stellar evolution and dynamics, radiative transfer, and magnetohydrodynamics. We upg…
▽ More
The mode of star formation that results in the formation of globular clusters and young massive clusters is difficult to constrain through observations. We present models of massive star cluster formation using the Torch framework, which uses AMUSE to couple distinct multi-physics codes that handle star formation, stellar evolution and dynamics, radiative transfer, and magnetohydrodynamics. We upgrade Torch by implementing the N-body code PeTar, thereby enabling Torch to handle massive clusters forming from $10^6\rm\, M_\odot$ clouds with $\ge10^5$ individual stars. We present results from Torch simulations of star clusters forming from $10^4, 10^5$, and $10^6\rm M_\odot$ turbulent, spherical gas clouds (named M4, M5, M6) of radius $R=11.7$ pc. We find that star formation is highly efficient and becomes more so at higher cloud mass and surface density. For M4, M5, and M6 with initial surface densities $2.325\times 10^{1,2,3}\rm\, M_\odot\, pc^{-2}$, after a free-fall time of $t_{ff}=6.7,2.1,0.67$ Myr, we find that $\sim$30%, 40%, and 60% of the cloud mass has formed into stars, respectively. The final integrated star formation efficiency is 32%, 65%, and 85% for M4, M5, and M6. Observations of nearby clusters similar to M4 have similar integrated star formation efficiencies of $\leq$30%. The M5 and M6 models represent a different regime of cluster formation that is more appropriate for the conditions in starburst galaxies and gas-rich galaxies at high redshift, and that leads to a significantly higher efficiency of star formation. We argue that young massive clusters build up through short efficient bursts of star formation in regions that are sufficiently dense ($\ge 10^2 \rm\,M_\odot\,pc^{-2}$) and massive ($\ge10^5\rm\, M_\odot$). In such environments, the dynamical time of the cloud becomes short enough that stellar feedback cannot act quickly enough to slow star formation.
△ Less
Submitted 7 March, 2025; v1 submitted 11 December, 2023;
originally announced December 2023.
-
The Semantic Reader Project: Augmenting Scholarly Documents through AI-Powered Interactive Reading Interfaces
Authors:
Kyle Lo,
Joseph Chee Chang,
Andrew Head,
Jonathan Bragg,
Amy X. Zhang,
Cassidy Trier,
Chloe Anastasiades,
Tal August,
Russell Authur,
Danielle Bragg,
Erin Bransom,
Isabel Cachola,
Stefan Candra,
Yoganand Chandrasekhar,
Yen-Sung Chen,
Evie Yu-Yen Cheng,
Yvonne Chou,
Doug Downey,
Rob Evans,
Raymond Fok,
Fangzhou Hu,
Regan Huff,
Dongyeop Kang,
Tae Soo Kim,
Rodney Kinney
, et al. (30 additional authors not shown)
Abstract:
Scholarly publications are key to the transfer of knowledge from scholars to others. However, research papers are information-dense, and as the volume of the scientific literature grows, the need for new technology to support the reading process grows. In contrast to the process of finding papers, which has been transformed by Internet technology, the experience of reading research papers has chan…
▽ More
Scholarly publications are key to the transfer of knowledge from scholars to others. However, research papers are information-dense, and as the volume of the scientific literature grows, the need for new technology to support the reading process grows. In contrast to the process of finding papers, which has been transformed by Internet technology, the experience of reading research papers has changed little in decades. The PDF format for sharing research papers is widely used due to its portability, but it has significant downsides including: static content, poor accessibility for low-vision readers, and difficulty reading on mobile devices. This paper explores the question "Can recent advances in AI and HCI power intelligent, interactive, and accessible reading interfaces -- even for legacy PDFs?" We describe the Semantic Reader Project, a collaborative effort across multiple institutions to explore automatic creation of dynamic reading interfaces for research papers. Through this project, we've developed ten research prototype interfaces and conducted usability studies with more than 300 participants and real-world users showing improved reading experiences for scholars. We've also released a production reading interface for research papers that will incorporate the best features as they mature. We structure this paper around challenges scholars and the public face when reading research papers -- Discovery, Efficiency, Comprehension, Synthesis, and Accessibility -- and present an overview of our progress and remaining open challenges.
△ Less
Submitted 23 April, 2023; v1 submitted 24 March, 2023;
originally announced March 2023.
-
Early Evolution and 3D Structure of Embedded Star Clusters
Authors:
Claude Cournoyer-Cloutier,
Alison Sills,
William E. Harris,
Sabrina M. Appel,
Sean C. Lewis,
Brooke Polak,
Aaron Tran,
Martijn J. C. Wilhelm,
Mordecai-Mark Mac Low,
Stephen L. W. McMillan,
Simon Portegies Zwart
Abstract:
We perform simulations of star cluster formation to investigate the morphological evolution of embedded star clusters in the earliest stages of their evolution. We conduct our simulations with Torch, which uses the AMUSE framework to couple state-of-the-art stellar dynamics to star formation, radiation, stellar winds, and hydrodynamics in FLASH. We simulate a suite of $10^4$ M$_{\odot}$ clouds at…
▽ More
We perform simulations of star cluster formation to investigate the morphological evolution of embedded star clusters in the earliest stages of their evolution. We conduct our simulations with Torch, which uses the AMUSE framework to couple state-of-the-art stellar dynamics to star formation, radiation, stellar winds, and hydrodynamics in FLASH. We simulate a suite of $10^4$ M$_{\odot}$ clouds at 0.0683 pc resolution for $\sim$ 2 Myr after the onset of star formation, with virial parameters $α$ = 0.8, 2.0, 4.0 and different random samplings of the stellar initial mass function and prescriptions for primordial binaries. Our simulations result in a population of embedded clusters with realistic morphologies (sizes, densities, and ellipticities) that reproduce the known trend of clouds with higher initial $α$ having lower star formation efficiencies. Our key results are as follows: (1) Cluster mass growth is not monotonic, and clusters can lose up to half of their mass while they are embedded. (2) Cluster morphology is not correlated with cluster mass and changes over $\sim$ 0.01 Myr timescales. (3) The morphology of an embedded cluster is not indicative of its long-term evolution but only of its recent history: radius and ellipticity increase sharply when a cluster accretes stars. (4) The dynamical evolution of very young embedded clusters with masses $\lesssim$ 1000 M$_{\odot}$ is dominated by the overall gravitational potential of the star-forming region rather than by internal dynamical processes such as two- or few-body relaxation.
△ Less
Submitted 16 February, 2023;
originally announced February 2023.
-
Radiation shielding of protoplanetary discs in young star-forming regions
Authors:
Martijn J. C. Wilhelm,
Simon Portegies Zwart,
Claude Cournoyer-Cloutier,
Sean C. Lewis,
Brooke Polak,
Aaron Tran,
Mordecai-Mark Mac Low
Abstract:
Protoplanetary discs spend their lives in the dense environment of a star forming region. While there, they can be affected by nearby stars through external photoevaporation and dynamic truncations. We present simulations that use the AMUSE framework to couple the Torch model for star cluster formation from a molecular cloud with a model for the evolution of protoplanetary discs under these two en…
▽ More
Protoplanetary discs spend their lives in the dense environment of a star forming region. While there, they can be affected by nearby stars through external photoevaporation and dynamic truncations. We present simulations that use the AMUSE framework to couple the Torch model for star cluster formation from a molecular cloud with a model for the evolution of protoplanetary discs under these two environmental processes. We compare simulations with and without extinction of photoevaporation-driving radiation. We find that the majority of discs in our simulations are considerably shielded from photoevaporation-driving radiation for at least 0.5 Myr after the formation of the first massive stars. Radiation shielding increases disc lifetimes by an order of magnitude and can let a disc retain more solid material for planet formation. The reduction in external photoevaporation leaves discs larger and more easily dynamically truncated, although external photoevaporation remains the dominant mass loss process. Finally, we find that the correlation between disc mass and projected distance to the most massive nearby star (often interpreted as a sign of external photoevaporation) can be erased by the presence of less massive stars that dominate their local radiation field. Overall, we find that the presence and dynamics of gas in embedded clusters with massive stars is important for the evolution of protoplanetary discs.
△ Less
Submitted 7 February, 2023;
originally announced February 2023.
-
The Semantic Scholar Open Data Platform
Authors:
Rodney Kinney,
Chloe Anastasiades,
Russell Authur,
Iz Beltagy,
Jonathan Bragg,
Alexandra Buraczynski,
Isabel Cachola,
Stefan Candra,
Yoganand Chandrasekhar,
Arman Cohan,
Miles Crawford,
Doug Downey,
Jason Dunkelberger,
Oren Etzioni,
Rob Evans,
Sergey Feldman,
Joseph Gorney,
David Graham,
Fangzhou Hu,
Regan Huff,
Daniel King,
Sebastian Kohlmeier,
Bailey Kuehl,
Michael Langan,
Daniel Lin
, et al. (23 additional authors not shown)
Abstract:
The volume of scientific output is creating an urgent need for automated tools to help scientists keep up with developments in their field. Semantic Scholar (S2) is an open data platform and website aimed at accelerating science by helping scholars discover and understand scientific literature. We combine public and proprietary data sources using state-of-the-art techniques for scholarly PDF conte…
▽ More
The volume of scientific output is creating an urgent need for automated tools to help scientists keep up with developments in their field. Semantic Scholar (S2) is an open data platform and website aimed at accelerating science by helping scholars discover and understand scientific literature. We combine public and proprietary data sources using state-of-the-art techniques for scholarly PDF content extraction and automatic knowledge graph construction to build the Semantic Scholar Academic Graph, the largest open scientific literature graph to-date, with 200M+ papers, 80M+ authors, 550M+ paper-authorship edges, and 2.4B+ citation edges. The graph includes advanced semantic features such as structurally parsed text, natural language summaries, and vector embeddings. In this paper, we describe the components of the S2 data processing pipeline and the associated APIs offered by the platform. We will update this living document to reflect changes as we add new data offerings and improve existing services.
△ Less
Submitted 25 April, 2025; v1 submitted 24 January, 2023;
originally announced January 2023.
-
Early-Forming Massive Stars Suppress Star Formation and Hierarchical Cluster Assembly
Authors:
Sean C. Lewis,
Stephen L. W. McMillan,
Mordecai-Mark Mac Low,
Claude Cournoyer-Cloutier,
Brooke Polak,
Martijn J. C. Wilhelm,
Aaron Tran,
Alison Sills,
Simon Portegies Zwart,
Ralf S. Klessen,
Joshua E. Wall
Abstract:
Feedback from massive stars plays an important role in the formation of star clusters. Whether a very massive star is born early or late in the cluster formation timeline has profound implications for the star cluster formation and assembly processes. We carry out a controlled experiment to characterize the effects of early-forming massive stars on star cluster formation. We use the star formation…
▽ More
Feedback from massive stars plays an important role in the formation of star clusters. Whether a very massive star is born early or late in the cluster formation timeline has profound implications for the star cluster formation and assembly processes. We carry out a controlled experiment to characterize the effects of early-forming massive stars on star cluster formation. We use the star formation software suite \texttt{Torch}, combining self-gravitating magnetohydrodynamics, ray-tracing radiative transfer, $N$-body dynamics, and stellar feedback to model four initially identical $10^4$ M$_\odot$ giant molecular clouds with a Gaussian density profile peaking at $521.5 \mbox{ cm}^{-3}$. Using the \texttt{Torch} software suite through the \texttt{AMUSE} framework we modify three of the models to ensure that the first star that forms is very massive (50, 70, 100 M$_\odot$). Early-forming massive stars disrupt the natal gas structure, resulting in fast evacuation of the gas from the star forming region. The star formation rate is suppressed, reducing the total mass of stars formed. Our fiducial control model without an early massive star has a larger star formation rate and total efficiency by up to a factor of three and a higher average star formation efficiency per free-fall time by up to a factor of seven. Early-forming massive stars promote the buildup of spatially separate and gravitationally unbound subclusters, while the control model forms a single massive cluster.
△ Less
Submitted 28 February, 2023; v1 submitted 2 December, 2022;
originally announced December 2022.
-
Expanding shells around young clusters -- S 171/Be 59
Authors:
G. F. Gahm,
M. J. C. Wilhelm,
C. M. Persson,
A. A. Djupvik,
S. F. Portegies Zwart
Abstract:
Some HII regions that surround young stellar clusters are bordered by molecular shells that appear to expand at a rate inconsistent with our current model simulations. In this study we focus on the dynamics of Sharpless 171 (including NGC 7822), which surrounds the cluster Berkeley 59. We aim to compare the velocity pattern over the molecular shell with the mean radial velocity of the cluster for…
▽ More
Some HII regions that surround young stellar clusters are bordered by molecular shells that appear to expand at a rate inconsistent with our current model simulations. In this study we focus on the dynamics of Sharpless 171 (including NGC 7822), which surrounds the cluster Berkeley 59. We aim to compare the velocity pattern over the molecular shell with the mean radial velocity of the cluster for estimates of the expansion velocities of different shell structures, and to match the observed properties with model simulations. Optical spectra of 27 stars located in Berkeley 59 were collected at the Nordic Optical Telescope, and a number of molecular structures scattered over the entire region were mapped in $^{13}$CO(1-0) at Onsala Space Observatory. We obtained radial velocities and MK classes for the cluster's stars. At least four of the O stars are found to be spectroscopic binaries, in addition to one triplet system. From these data we obtain the mean radial velocity of the cluster. From the $^{13}$CO spectra we identify three shell structures, expanding relative to the cluster at moderate velocity (4 km/s), high velocity (12 km/s), and in between. The high-velocity cloudlets extend over a larger radius and are less massive than the low-velocity cloudlets. We performed a model simulation to understand the evolution of this complex. Our simulation of the Sharpless 171 complex and Berkeley 59 cluster demonstrates that the individual components can be explained as a shell driven by stellar winds from the massive cluster members. However, our relatively simple model produces a single component. Modelling of the propagation of shell fragments through a uniform interstellar medium demonstrates that dense cloudlets detached from the shell are decelerated less efficiently than the shell itself. They can reach greater distances and retain higher velocities than the shell.
△ Less
Submitted 11 May, 2022;
originally announced May 2022.
-
The Ice Coverage of Earth-like Planets Orbiting FGK Stars
Authors:
Caitlyn Wilhelm,
Rory Barnes,
Russell Deitrick,
Rachel Mellman
Abstract:
The photometric and spectroscopic signatures of habitable planets orbiting FGK stars may be modulated by surface ice coverage. To estimate its frequency and locations, we simulated the climates of hypothetical planets with a 1D energy balance model and assumed that the planets possess properties similar to modern Earth (mass, geography, atmosphere). We first simulated planets with fixed rotational…
▽ More
The photometric and spectroscopic signatures of habitable planets orbiting FGK stars may be modulated by surface ice coverage. To estimate its frequency and locations, we simulated the climates of hypothetical planets with a 1D energy balance model and assumed that the planets possess properties similar to modern Earth (mass, geography, atmosphere). We first simulated planets with fixed rotational axes and circular orbits, finding that the vast majority (>90%) of planets with habitable surfaces are free of ice. For planets with partial ice coverage, the parameter space for ice caps (interannual ice located at the poles) is about as large as that for "ice belts" (interannual ice located at the equator), but belts only persist on land. We then performed simulations that mimicked perturbations from other planets by forcing sinusoidal orbital and rotational oscillations over a range of frequencies and amplitudes. We assume initially ice-free surfaces and set the initial eccentricity distribution to mirror known exoplanets, while the initial obliquity distribution matches planet formation predictions, ie favoring 90 degrees. For these dynamic cases, we find again that ~90% of habitable planets are free of surface ice for a range of assumptions for ice's albedo. Planets orbiting F dwarfs are three times as likely to have ice caps than belts, but for planets orbiting K and G dwarfs ice belts are twice as likely as caps. In some cases, a planet's surface ice can cycle between the equatorial and polar regions. Future direct imaging surveys of habitable planets may be able to test these predictions.
△ Less
Submitted 6 December, 2021;
originally announced December 2021.
-
Exploring the possibility of Peter Pan discs across stellar mass
Authors:
Martijn J. C. Wilhelm,
Simon Portegies Zwart
Abstract:
Recently, several accreting M dwarf stars have been discovered with ages far exceeding the typical protoplanetary disc lifetime. These `Peter Pan discs' can be explained as primordial discs that evolve in a low-radiation environment. The persistently low masses of the host stars raise the question whether primordial discs can survive up to these ages around stars of higher mass. In this work we ex…
▽ More
Recently, several accreting M dwarf stars have been discovered with ages far exceeding the typical protoplanetary disc lifetime. These `Peter Pan discs' can be explained as primordial discs that evolve in a low-radiation environment. The persistently low masses of the host stars raise the question whether primordial discs can survive up to these ages around stars of higher mass. In this work we explore the way in which different mass loss processes in protoplanetary discs limit their maximum lifetimes, and how this depends on host star mass. We find that stars with masses $\lesssim$ 0.6 M$_\odot$ can retain primordial discs for $\sim$50 Myr. At stellar masses $\gtrsim$ 0.8 M$_\odot$, the maximum disc lifetime decreases strongly to below 50 Myr due to relatively more efficient accretion and photoevaporation by the host star. Lifetimes up to 15 Myr are still possible for all host star masses up to $\sim$2 M$_\odot$. For host star masses between 0.6 and 0.8 M$_\odot$, accretion ceases and an inner gap forms before 50 Myr in our models. Observations suggest that such a configuration is rapidly dispersed. We conclude that Peter Pan discs can only occur around M dwarf stars.
△ Less
Submitted 3 September, 2021;
originally announced September 2021.
-
Evolution of circumstellar discs in young star-forming regions
Authors:
Francisca Concha-Ramírez,
Martijn J. C Wilhelm,
Simon Portegies Zwart
Abstract:
The evolution of circumstellar discs is influenced by their surroundings. The relevant processes include external photoevaporation due to nearby stars, and dynamical truncations. The impact of these processes on disc populations depends on the star-formation history and on the dynamical evolution of the region. Since star formation history and the phase-space characteristics of the stars are impor…
▽ More
The evolution of circumstellar discs is influenced by their surroundings. The relevant processes include external photoevaporation due to nearby stars, and dynamical truncations. The impact of these processes on disc populations depends on the star-formation history and on the dynamical evolution of the region. Since star formation history and the phase-space characteristics of the stars are important for the evolution of the discs, we start simulating the evolution of the star cluster with the results of molecular cloud collapse simulations. In the simulation we form stars with circumstellar discs, which can be affected by different processes. Our models account for the viscous evolution of the discs, internal and external photoevaporation of gas, external photoevaporation of dust, and dynamical truncations. All these processes are resolved together with the dynamical evolution of the cluster, and the evolution of the stars.
An extended period of star formation, lasting for at least 2 Myr, results in some discs being formed late. These late formed discs have a better chance of survival because the cluster gradually expands with time, and a lower local stellar density reduces the effects of photoevaporation and dynamical truncation. Late formed discs can then be present in regions of high UV radiation, solving the proplyd lifetime problem. We also find a considerable fraction of discs that lose their gas content, but remain sufficiently rich in solids to be able to form a rocky planetary system.
△ Less
Submitted 26 May, 2022; v1 submitted 19 January, 2021;
originally announced January 2021.
-
Effects of stellar density on the photoevaporation of circumstellar discs
Authors:
Francisca Concha-Ramírez,
Martijn J. C. Wilhelm,
Simon Portegies Zwart,
Sierk E. van Terwisga,
Alvaro Hacar
Abstract:
Circumstellar discs are the precursors of planetary systems and develop shortly after their host star has formed. In their early stages these discs are immersed in an environment rich in gas and neighbouring stars, which can be hostile for their survival. There are several environmental processes that affect the evolution of circumstellar discs, and external photoevaporation is arguably one of the…
▽ More
Circumstellar discs are the precursors of planetary systems and develop shortly after their host star has formed. In their early stages these discs are immersed in an environment rich in gas and neighbouring stars, which can be hostile for their survival. There are several environmental processes that affect the evolution of circumstellar discs, and external photoevaporation is arguably one of the most important ones. Theoretical and observational evidence point to circumstellar discs losing mass quickly when in the vicinity of massive, bright stars. In this work we simulate circumstellar discs in clustered environments in a range of stellar densities, where the photoevaporation mass-loss process is resolved simultaneously with the stellar dynamics, stellar evolution, and the viscous evolution of the discs. Our results indicate that external photoevaporation is efficient in depleting disc masses and that the degree of its effect is related to stellar density. We find that a local stellar density lower than 100 stars pc$^{-2}$ is necessary for discs massive enough to form planets to survive for \SI{2.0}{Myr}. There is an order of magnitude difference in the disc masses in regions of projected density 100 stars pc$^{-2}$ versus $10^4$ stars pc$^{-2}$. We compare our results to observations of the Lupus clouds, the Orion Nebula Cluster, the Orion Molecular Cloud-2, Taurus, and NGC 2024, and find that the trends observed between region density and disc masses are similar to those in our simulations.
△ Less
Submitted 20 November, 2020; v1 submitted 12 June, 2020;
originally announced June 2020.
-
Two-hand Global 3D Pose Estimation Using Monocular RGB
Authors:
Fanqing Lin,
Connor Wilhelm,
Tony Martinez
Abstract:
We tackle the challenging task of estimating global 3D joint locations for both hands via only monocular RGB input images. We propose a novel multi-stage convolutional neural network based pipeline that accurately segments and locates the hands despite occlusion between two hands and complex background noise and estimates the 2D and 3D canonical joint locations without any depth information. Globa…
▽ More
We tackle the challenging task of estimating global 3D joint locations for both hands via only monocular RGB input images. We propose a novel multi-stage convolutional neural network based pipeline that accurately segments and locates the hands despite occlusion between two hands and complex background noise and estimates the 2D and 3D canonical joint locations without any depth information. Global joint locations with respect to the camera origin are computed using the hand pose estimations and the actual length of the key bone with a novel projection algorithm. To train the CNNs for this new task, we introduce a large-scale synthetic 3D hand pose dataset. We demonstrate that our system outperforms previous works on 3D canonical hand pose estimation benchmark datasets with RGB-only information. Additionally, we present the first work that achieves accurate global 3D hand tracking on both hands using RGB-only inputs and provide extensive quantitative and qualitative evaluation.
△ Less
Submitted 25 August, 2020; v1 submitted 1 June, 2020;
originally announced June 2020.
-
CORD-19: The COVID-19 Open Research Dataset
Authors:
Lucy Lu Wang,
Kyle Lo,
Yoganand Chandrasekhar,
Russell Reas,
Jiangjiang Yang,
Doug Burdick,
Darrin Eide,
Kathryn Funk,
Yannis Katsis,
Rodney Kinney,
Yunyao Li,
Ziyang Liu,
William Merrill,
Paul Mooney,
Dewey Murdick,
Devvret Rishi,
Jerry Sheehan,
Zhihong Shen,
Brandon Stilson,
Alex Wade,
Kuansan Wang,
Nancy Xin Ru Wang,
Chris Wilhelm,
Boya Xie,
Douglas Raymond
, et al. (3 additional authors not shown)
Abstract:
The COVID-19 Open Research Dataset (CORD-19) is a growing resource of scientific papers on COVID-19 and related historical coronavirus research. CORD-19 is designed to facilitate the development of text mining and information retrieval systems over its rich collection of metadata and structured full text papers. Since its release, CORD-19 has been downloaded over 200K times and has served as the b…
▽ More
The COVID-19 Open Research Dataset (CORD-19) is a growing resource of scientific papers on COVID-19 and related historical coronavirus research. CORD-19 is designed to facilitate the development of text mining and information retrieval systems over its rich collection of metadata and structured full text papers. Since its release, CORD-19 has been downloaded over 200K times and has served as the basis of many COVID-19 text mining and discovery systems. In this article, we describe the mechanics of dataset construction, highlighting challenges and key design decisions, provide an overview of how CORD-19 has been used, and describe several shared tasks built around the dataset. We hope this resource will continue to bring together the computing community, biomedical experts, and policy makers in the search for effective treatments and management policies for COVID-19.
△ Less
Submitted 10 July, 2020; v1 submitted 22 April, 2020;
originally announced April 2020.
-
The Milky Way's bar structural properties from gravitational waves
Authors:
Martijn J. C. Wilhelm,
Valeriya Korol,
Elena M. Rossi,
Elena D'Onghia
Abstract:
The Laser Interferometer Space Antenna (LISA) will enable Galactic gravitational wave (GW) astronomy by individually resolving $ > 10^4$ signals from double white dwarf (DWD) binaries throughout the Milky Way. In this work we assess for the first time the potential of LISA data to map the Galactic stellar bar and spiral arms, since GWs are unaffected by stellar crowding and dust extinction unlike…
▽ More
The Laser Interferometer Space Antenna (LISA) will enable Galactic gravitational wave (GW) astronomy by individually resolving $ > 10^4$ signals from double white dwarf (DWD) binaries throughout the Milky Way. In this work we assess for the first time the potential of LISA data to map the Galactic stellar bar and spiral arms, since GWs are unaffected by stellar crowding and dust extinction unlike optical observations of the bulge region. To achieve this goal we combine a realistic population of Galactic DWDs with a high-resolution N-Body simulation a galaxy in good agreement with the Milky Way. We then model GW signals from our synthetic DWD population and reconstruct the structure of the simulated Galaxy from mock LISA observations. Our results show that while the low signal contrast between the background disc and the spiral arms hampers our ability to characterise the spiral structure, the stellar bar will instead clearly appear in the GW map of the bulge. The bar length and bar width derived from these synthetic observations are underestimated, respectively within $1σ$ and at a level greater than $2σ$, but the resulting axis ratio agrees to well within $1σ$, while the viewing angle is recovered to within one degree. These are competitive constraints compared to those from electromagnetic tracers, and they are obtained with a completely independent method. We therefore foresee that the synergistic use of GWs and electromagnetic tracers will be a powerful strategy to map the bar and the bulge of the Milky Way.
△ Less
Submitted 3 November, 2020; v1 submitted 24 March, 2020;
originally announced March 2020.
-
External photoevaporation of circumstellar disks constrains the timescale for planet formation
Authors:
Francisca Concha-Ramírez,
Martijn J. C. Wilhelm,
Simon Portegies Zwart,
Thomas J. Haworth
Abstract:
Planet-forming circumstellar disks are a fundamental part of the star formation process. Since stars form in a hierarchical fashion in groups of up to hundreds or thousands, the UV radiation environment that these disks are exposed to can vary in strength by at least six orders of magnitude. This radiation can limit the masses and sizes of the disks. Diversity in star forming environments can have…
▽ More
Planet-forming circumstellar disks are a fundamental part of the star formation process. Since stars form in a hierarchical fashion in groups of up to hundreds or thousands, the UV radiation environment that these disks are exposed to can vary in strength by at least six orders of magnitude. This radiation can limit the masses and sizes of the disks. Diversity in star forming environments can have long lasting effects in disk evolution and in the resulting planetary populations. We perform simulations to explore the evolution of circumstellar disks in young star clusters. We include viscous evolution, as well as the impact of dynamical encounters and external photoevaporation. We find that photoevaporation is an important process in destroying circumstellar disks: in regions of stellar density $ρ\sim 100 \mathrm{\ M}_\odot \mathrm{\ pc}^{-3}\mathrm{\ }$ around 80% of disks are destroyed before 2 Myr of cluster evolution. Our findings are in agreement with observed disk fractions in young star forming regions and support previous estimations that planet formation must start in timescales < 0.1 - 1 Myr.
△ Less
Submitted 18 October, 2019; v1 submitted 8 July, 2019;
originally announced July 2019.
-
VPLanet: The Virtual Planet Simulator
Authors:
Rory Barnes,
Rodrigo Luger,
Russell Deitrick,
Peter Driscoll,
Thomas R. Quinn,
David P. Fleming,
Hayden Smotherman,
Diego V. McDonald,
Caitlyn Wilhelm,
Rodolfo Garcia,
Patrick Barth,
Benjamin Guyer,
Victoria S. Meadows,
Cecilia M. Bitz,
Pramod Gupta,
Shawn D. Domagal-Goldman,
John Armstrong
Abstract:
We describe a software package called VPLanet that simulates fundamental aspects of planetary system evolution over Gyr timescales, with a focus on investigating habitable worlds. In this initial release, eleven physics modules are included that model internal, atmospheric, rotational, orbital, stellar, and galactic processes. Many of these modules can be coupled simultaneously to simulate the evo…
▽ More
We describe a software package called VPLanet that simulates fundamental aspects of planetary system evolution over Gyr timescales, with a focus on investigating habitable worlds. In this initial release, eleven physics modules are included that model internal, atmospheric, rotational, orbital, stellar, and galactic processes. Many of these modules can be coupled simultaneously to simulate the evolution of terrestrial planets, gaseous planets, and stars. The code is validated by reproducing a selection of observations and past results. VPLanet is written in C and designed so that the user can choose the physics modules to apply to an individual object at runtime without recompiling, i.e., a single executable can simulate the diverse phenomena that are relevant to a wide range of planetary and stellar systems. This feature is enabled by matrices and vectors of function pointers that are dynamically allocated and populated based on user input. The speed and modularity of VPLanet enables large parameter sweeps and the versatility to add/remove physical phenomena to assess their importance. VPLanet is publicly available from a repository that contains extensive documentation, numerous examples, Python scripts for plotting and data management, and infrastructure for community input and future development.
△ Less
Submitted 27 August, 2019; v1 submitted 15 May, 2019;
originally announced May 2019.
-
Nanowire lasers
Authors:
C. Couteau,
A. Larrue,
C. Wilhelm,
C. Soci
Abstract:
We review principles and trends in the use of semiconductor nanowires (NWs) as gain media for stimulated emission and lasing. Semiconductor nanowires have recently been widely studied for use in integrated optoelectronic devices, such as LEDs, solar cells, and transistors. Intensive research has also been conducted on the use of nanowires for sub-wavelength laser systems that take advantage of the…
▽ More
We review principles and trends in the use of semiconductor nanowires (NWs) as gain media for stimulated emission and lasing. Semiconductor nanowires have recently been widely studied for use in integrated optoelectronic devices, such as LEDs, solar cells, and transistors. Intensive research has also been conducted on the use of nanowires for sub-wavelength laser systems that take advantage of their quasi-one-dimensional nature, flexibility in material choice and combination, and intrinsic optoelectronic properties. First, we provide an overview on using quasi-one-dimensional nanowire systems to realize sub-wavelength lasers with efficient, directional, and low-threshold emission. We then describe the state-of-the-art for nanowire lasers in terms of materials, geometry, and wavelength tunability. Next, we present the basics of lasing in semiconductor nanowires, define the key parameters for stimulated emission, and introduce the properties of nanowires. We then review advanced nanowire laser designs from the literature. Finally, we present interesting perspectives for low-threshold nanoscale light sources and optical interconnects. We intend to illustrate the potential of nanolasers in many applications, such as nanophotonic devices that integrate electronics and photonics for next-generation optoelectronic devices. For instance, these building blocks for nanoscale photonics can be used for data storage and biomedical applications when coupled to on-chip characterization tools. These nanoscale monochromatic laser light sources promise breakthroughs in nanophotonics, as they can operate at room temperature, potentially be electrically driven, and yield a better understanding of intrinsic nanomaterial properties and surface state effects in low-dimensional semiconductor systems.
△ Less
Submitted 5 September, 2018;
originally announced September 2018.
-
Forced- and Self-Rotation of Magnetic Nanorods Assembly at the Cell Membrane: A Biomagnetic Torsion Pendulum
Authors:
François Mazuel,
Samuel Mathieu,
Riccardo Di Corato,
Jean-Claude Bacri,
Thierry Meylheuc,
Teresa Pellegrino,
Myriam Reffay,
Claire Wilhelm
Abstract:
In order to give insights into how anisotropic nano-objects interact with living cell membranes, and possibly self-assemble, we designed magnetic nanorods with average size around 100 nm x 1$μ$m by assembling iron oxide nanocubes within a polymeric matrix under a magnetic field. We then explored the nano-bio interface at the cell membrane under the influence of a rotating magnetic field. We observ…
▽ More
In order to give insights into how anisotropic nano-objects interact with living cell membranes, and possibly self-assemble, we designed magnetic nanorods with average size around 100 nm x 1$μ$m by assembling iron oxide nanocubes within a polymeric matrix under a magnetic field. We then explored the nano-bio interface at the cell membrane under the influence of a rotating magnetic field. We observed a complex structuration of the nanorods intertwined with the membranes. Unexpectedly, after a magnetic rotating stimulation, the resulting macrorods were able to rotate freely for multiple rotations, revealing the creation of a bio-magnetic torsion pendulum.
△ Less
Submitted 24 July, 2018;
originally announced July 2018.
-
Ontology Alignment in the Biomedical Domain Using Entity Definitions and Context
Authors:
Lucy Lu Wang,
Chandra Bhagavatula,
Mark Neumann,
Kyle Lo,
Chris Wilhelm,
Waleed Ammar
Abstract:
Ontology alignment is the task of identifying semantically equivalent entities from two given ontologies. Different ontologies have different representations of the same entity, resulting in a need to de-duplicate entities when merging ontologies. We propose a method for enriching entities in an ontology with external definition and context information, and use this additional information for onto…
▽ More
Ontology alignment is the task of identifying semantically equivalent entities from two given ontologies. Different ontologies have different representations of the same entity, resulting in a need to de-duplicate entities when merging ontologies. We propose a method for enriching entities in an ontology with external definition and context information, and use this additional information for ontology alignment. We develop a neural architecture capable of encoding the additional information when available, and show that the addition of external data results in an F1-score of 0.69 on the Ontology Alignment Evaluation Initiative (OAEI) largebio SNOMED-NCI subtask, comparable with the entity-level matchers in a SOTA system.
△ Less
Submitted 20 June, 2018;
originally announced June 2018.
-
Construction of the Literature Graph in Semantic Scholar
Authors:
Waleed Ammar,
Dirk Groeneveld,
Chandra Bhagavatula,
Iz Beltagy,
Miles Crawford,
Doug Downey,
Jason Dunkelberger,
Ahmed Elgohary,
Sergey Feldman,
Vu Ha,
Rodney Kinney,
Sebastian Kohlmeier,
Kyle Lo,
Tyler Murray,
Hsu-Han Ooi,
Matthew Peters,
Joanna Power,
Sam Skjonsberg,
Lucy Lu Wang,
Chris Wilhelm,
Zheng Yuan,
Madeleine van Zuylen,
Oren Etzioni
Abstract:
We describe a deployed scalable system for organizing published scientific literature into a heterogeneous graph to facilitate algorithmic manipulation and discovery. The resulting literature graph consists of more than 280M nodes, representing papers, authors, entities and various interactions between them (e.g., authorships, citations, entity mentions). We reduce literature graph construction in…
▽ More
We describe a deployed scalable system for organizing published scientific literature into a heterogeneous graph to facilitate algorithmic manipulation and discovery. The resulting literature graph consists of more than 280M nodes, representing papers, authors, entities and various interactions between them (e.g., authorships, citations, entity mentions). We reduce literature graph construction into familiar NLP tasks (e.g., entity extraction and linking), point out research challenges due to differences from standard formulations of these tasks, and report empirical results for each task. The methods described in this paper are used to enable semantic features in www.semanticscholar.org
△ Less
Submitted 6 May, 2018;
originally announced May 2018.
-
Exo-Milankovitch Cycles II: Climates of G-dwarf Planets in Dynamically Hot Systems
Authors:
Russell Deitrick,
Rory Barnes,
Cecilia Bitz,
David Fleming,
Benjamin Charnay,
Victoria Meadows,
Caitlyn Wilhelm,
John Armstrong,
Thomas R. Quinn
Abstract:
Using an energy balance model with ice sheets, we examine the climate response of an Earth-like planet orbiting a G dwarf star and experiencing large orbital and obliquity variations. We find that ice caps couple strongly to the orbital forcing, leading to extreme ice ages. In contrast with previous studies, we find that such exo-Milankovitch cycles tend to impair habitability by inducing snowball…
▽ More
Using an energy balance model with ice sheets, we examine the climate response of an Earth-like planet orbiting a G dwarf star and experiencing large orbital and obliquity variations. We find that ice caps couple strongly to the orbital forcing, leading to extreme ice ages. In contrast with previous studies, we find that such exo-Milankovitch cycles tend to impair habitability by inducing snowball states within the habitable zone. The large amplitude changes in obliquity and eccentricity cause the ice edge, the lowest latitude extent of the ice caps, to become unstable and grow to the equator. We apply an analytical theory of the ice edge latitude to show that obliquity is the primary driver of the instability. The thermal inertia of the ice sheets and the spectral energy distribution of the G dwarf star increase the sensitivity of the model to triggering runaway glaciation. Finally, we apply a machine learning algorithm to demonstrate how this technique can be used to extend the power of climate models. This work illustrates the importance of orbital evolution for habitability in dynamically rich planetary systems. We emphasize that as potentially habitable planets are discovered around G dwarfs, we need to consider orbital dynamics.
△ Less
Submitted 1 May, 2018;
originally announced May 2018.
-
Exo-Milankovitch Cycles I: Orbits and Rotation States
Authors:
Russell Deitrick,
Rory Barnes,
Thomas R. Quinn,
John Armstrong,
Benjamin Charnay,
Caitlyn Wilhelm
Abstract:
The obliquity of the Earth, which controls our seasons, varies by only ~2.5 degrees over ~40,000 years, and its eccentricity varies by only ~0.05 over 100,000 years. Nonetheless, these small variations influence Earth's ice ages. For exoplanets, however, variations can be significantly larger. Previous studies of the habitability of moonless Earth-like exoplanets have found that high obliquities,…
▽ More
The obliquity of the Earth, which controls our seasons, varies by only ~2.5 degrees over ~40,000 years, and its eccentricity varies by only ~0.05 over 100,000 years. Nonetheless, these small variations influence Earth's ice ages. For exoplanets, however, variations can be significantly larger. Previous studies of the habitability of moonless Earth-like exoplanets have found that high obliquities, high eccentricities, and dynamical variations can extend the outer edge of the habitable zone by preventing runaway glaciation (snowball states). We expand upon these studies by exploring the orbital dynamics with a semi-analytic model that allows us to map broad regions of parameter space. We find that in general, the largest drivers of obliquity variations are secular spin-orbit resonances. We show how the obliquity varies in several test cases, including Kepler-62 f, across a wide range of orbital and spin parameters. These obliquity variations, alongside orbital variations, will have a dramatic impact on the climates of such planets.
△ Less
Submitted 28 December, 2017;
originally announced December 2017.
-
Broadband tunable hybrid photonic crystal-nanowire light emitter
Authors:
Christophe E. Wilhelm,
M. Iqbal Bakti Utama,
Qihua Xiong,
Cesare Soci,
Gaëlle Lehoucq,
Daniel Dolfi,
Alfredo De Rossi,
Sylvain Combrié
Abstract:
We integrate about 100 single Cadmium Selenide semiconductor nanowires in self-standing Silicon Nitride photonic crystal cavities in a single processing run. Room temperature measurements reveal a single narrow emission linewidth, corresponding to a Q-factor as large as 5000. By varying the structural parameters of the photonic crystal, the peak wavelength is tuned, thereby covering the entire emi…
▽ More
We integrate about 100 single Cadmium Selenide semiconductor nanowires in self-standing Silicon Nitride photonic crystal cavities in a single processing run. Room temperature measurements reveal a single narrow emission linewidth, corresponding to a Q-factor as large as 5000. By varying the structural parameters of the photonic crystal, the peak wavelength is tuned, thereby covering the entire emission spectral range of the active material. A very large spectral range could be covered by heterogeneous integration of different active materials.
△ Less
Submitted 25 September, 2015;
originally announced September 2015.