-
Apple Intelligence Foundation Language Models: Tech Report 2025
Authors:
Ethan Li,
Anders Boesen Lindbo Larsen,
Chen Zhang,
Xiyou Zhou,
Jun Qin,
Dian Ang Yap,
Narendran Raghavan,
Xuankai Chang,
Margit Bowler,
Eray Yildiz,
John Peebles,
Hannah Gillis Coleman,
Matteo Ronchi,
Peter Gray,
Keen You,
Anthony Spalvieri-Kruse,
Ruoming Pang,
Reed Li,
Yuli Yang,
Emad Soroush,
Zhiyun Lu,
Crystal Xiao,
Rong Situ,
Jordan Huffaker,
David Griffiths
, et al. (373 additional authors not shown)
Abstract:
We introduce two multilingual, multimodal foundation language models that power Apple Intelligence features across Apple devices and services: i a 3B-parameter on-device model optimized for Apple silicon through architectural innovations such as KV-cache sharing and 2-bit quantization-aware training; and ii a scalable server model built on a novel Parallel-Track Mixture-of-Experts PT-MoE transform…
▽ More
We introduce two multilingual, multimodal foundation language models that power Apple Intelligence features across Apple devices and services: i a 3B-parameter on-device model optimized for Apple silicon through architectural innovations such as KV-cache sharing and 2-bit quantization-aware training; and ii a scalable server model built on a novel Parallel-Track Mixture-of-Experts PT-MoE transformer that combines track parallelism, mixture-of-experts sparse computation, and interleaved global-local attention to deliver high quality with competitive cost on Apple's Private Cloud Compute platform. Both models are trained on large-scale multilingual and multimodal datasets sourced via responsible web crawling, licensed corpora, and high-quality synthetic data, then further refined with supervised fine-tuning and reinforcement learning on a new asynchronous platform. The resulting models support several additional languages while understanding images and executing tool calls. In public benchmarks and human evaluations, both the server model and the on-device model match or surpass comparably sized open baselines.
A new Swift-centric Foundation Models framework exposes guided generation, constrained tool calling, and LoRA adapter fine-tuning, allowing developers to integrate these capabilities with a few lines of code. The latest advancements in Apple Intelligence models are grounded in our Responsible AI approach with safeguards like content filtering and locale-specific evaluation, as well as our commitment to protecting our users' privacy with innovations like Private Cloud Compute.
△ Less
Submitted 27 August, 2025; v1 submitted 17 July, 2025;
originally announced July 2025.
-
A Novel Probabilistic V2X Data Fusion Framework for Cooperative Perception
Authors:
Mao Shan,
Karan Narula,
Stewart Worrall,
Yung Fei Wong,
Julie Stephany Berrio Perez,
Paul Gray,
Eduardo Nebot
Abstract:
The paper addresses the vehicle-to-X (V2X) data fusion for cooperative or collective perception (CP). This emerging and promising intelligent transportation systems (ITS) technology has enormous potential for improving efficiency and safety of road transportation. Recent advances in V2X communication primarily address the definition of V2X messages and data dissemination amongst ITS stations (ITS-…
▽ More
The paper addresses the vehicle-to-X (V2X) data fusion for cooperative or collective perception (CP). This emerging and promising intelligent transportation systems (ITS) technology has enormous potential for improving efficiency and safety of road transportation. Recent advances in V2X communication primarily address the definition of V2X messages and data dissemination amongst ITS stations (ITS-Ss) in a traffic environment. Yet, a largely unsolved problem is how a connected vehicle (CV) can efficiently and consistently fuse its local perception information with the data received from other ITS-Ss. In this paper, we present a novel data fusion framework to fuse the local and V2X perception data for CP that considers the presence of cross-correlation. The proposed approach is validated through comprehensive results obtained from numerical simulation, CARLA simulation, and real-world experimentation that incorporates V2X-enabled intelligent platforms. The real-world experiment includes a CV, a connected and automated vehicle (CAV), and an intelligent roadside unit (IRSU) retrofitted with vision and lidar sensors. We also demonstrate how the fused CP information can improve the awareness of vulnerable road users (VRU) for CV/CAV, and how this information can be considered in path planning/decision making within the CAV to facilitate safe interactions.
△ Less
Submitted 31 March, 2022;
originally announced March 2022.
-
From Note-Level to Chord-Level Neural Network Models for Voice Separation in Symbolic Music
Authors:
Patrick Gray,
Razvan Bunescu
Abstract:
Music is often experienced as a progression of concurrent streams of notes, or voices. The degree to which this happens depends on the position along a voice-leading continuum, ranging from monophonic, to homophonic, to polyphonic, which complicates the design of automatic voice separation models. We address this continuum by defining voice separation as the task of decomposing music into streams…
▽ More
Music is often experienced as a progression of concurrent streams of notes, or voices. The degree to which this happens depends on the position along a voice-leading continuum, ranging from monophonic, to homophonic, to polyphonic, which complicates the design of automatic voice separation models. We address this continuum by defining voice separation as the task of decomposing music into streams that exhibit both a high degree of external perceptual separation from the other streams and a high degree of internal perceptual consistency. The proposed voice separation task allows for a voice to diverge to multiple voices and also for multiple voices to converge to the same voice. Equipped with this flexible task definition, we manually annotated a corpus of popular music and used it to train neural networks that assign notes to voices either separately for each note in a chord (note-level), or jointly to all notes in a chord (chord-level). The trained neural models greedily assign notes to voices in a left to right traversal of the input chord sequence, using a diverse set of perceptually informed input features. When evaluated on the extraction of consecutive within voice note pairs, both models surpass a strong baseline based on an iterative application of an envelope extraction function, with the chord-level model consistently edging out the note-level model. The two models are also shown to outperform previous approaches on separating the voices in Bach music.
△ Less
Submitted 5 November, 2020;
originally announced November 2020.
-
Resilient In-Season Crop Type Classification in Multispectral Satellite Observations using Growth Stage Normalization
Authors:
Hannah Kerner,
Ritvik Sahajpal,
Sergii Skakun,
Inbal Becker-Reshef,
Brian Barker,
Mehdi Hosseini,
Estefania Puricelli,
Patrick Gray
Abstract:
Crop type classification using satellite observations is an important tool for providing insights about planted area and enabling estimates of crop condition and yield, especially within the growing season when uncertainties around these quantities are highest. As the climate changes and extreme weather events become more frequent, these methods must be resilient to changes in domain shifts that m…
▽ More
Crop type classification using satellite observations is an important tool for providing insights about planted area and enabling estimates of crop condition and yield, especially within the growing season when uncertainties around these quantities are highest. As the climate changes and extreme weather events become more frequent, these methods must be resilient to changes in domain shifts that may occur, for example, due to shifts in planting timelines. In this work, we present an approach for within-season crop type classification using moderate spatial resolution (30 m) satellite data that addresses domain shift related to planting timelines by normalizing inputs by crop growth stage. We use a neural network leveraging both convolutional and recurrent layers to predict if a pixel contains corn, soybeans, or another crop or land cover type. We evaluated this method for the 2019 growing season in the midwestern US, during which planting was delayed by as much as 1-2 months due to extreme weather that caused record flooding. We show that our approach using growth stage-normalized time series outperforms fixed-date time series, and achieves overall classification accuracy of 85.4% prior to harvest (September-November) and 82.8% by mid-season (July-September).
△ Less
Submitted 21 September, 2020;
originally announced September 2020.