Search | arXiv e-print repository

Phi-4-reasoning Technical Report

Authors: Marah Abdin, Sahaj Agarwal, Ahmed Awadallah, Vidhisha Balachandran, Harkirat Behl, Lingjiao Chen, Gustavo de Rosa, Suriya Gunasekar, Mojan Javaheripi, Neel Joshi, Piero Kauffmann, Yash Lara, Caio César Teodoro Mendes, Arindam Mitra, Besmira Nushi, Dimitris Papailiopoulos, Olli Saarikivi, Shital Shah, Vaishnavi Shrivastava, Vibhav Vineet, Yue Wu, Safoora Yousefi, Guoqing Zheng

Abstract: We introduce Phi-4-reasoning, a 14-billion parameter reasoning model that achieves strong performance on complex reasoning tasks. Trained via supervised fine-tuning of Phi-4 on carefully curated set of "teachable" prompts-selected for the right level of complexity and diversity-and reasoning demonstrations generated using o3-mini, Phi-4-reasoning generates detailed reasoning chains that effectivel… ▽ More We introduce Phi-4-reasoning, a 14-billion parameter reasoning model that achieves strong performance on complex reasoning tasks. Trained via supervised fine-tuning of Phi-4 on carefully curated set of "teachable" prompts-selected for the right level of complexity and diversity-and reasoning demonstrations generated using o3-mini, Phi-4-reasoning generates detailed reasoning chains that effectively leverage inference-time compute. We further develop Phi-4-reasoning-plus, a variant enhanced through a short phase of outcome-based reinforcement learning that offers higher performance by generating longer reasoning traces. Across a wide range of reasoning tasks, both models outperform significantly larger open-weight models such as DeepSeek-R1-Distill-Llama-70B model and approach the performance levels of full DeepSeek-R1 model. Our comprehensive evaluations span benchmarks in math and scientific reasoning, coding, algorithmic problem solving, planning, and spatial understanding. Interestingly, we observe a non-trivial transfer of improvements to general-purpose benchmarks as well. In this report, we provide insights into our training data, our training methodologies, and our evaluations. We show that the benefit of careful data curation for supervised fine-tuning (SFT) extends to reasoning language models, and can be further amplified by reinforcement learning (RL). Finally, our evaluation points to opportunities for improving how we assess the performance and robustness of reasoning models. △ Less

Submitted 30 April, 2025; originally announced April 2025.

arXiv:2412.08905 [pdf, other]

Phi-4 Technical Report

Authors: Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J. Hewett, Mojan Javaheripi, Piero Kauffmann, James R. Lee, Yin Tat Lee, Yuanzhi Li, Weishung Liu, Caio C. T. Mendes, Anh Nguyen, Eric Price, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Xin Wang, Rachel Ward, Yue Wu, Dingli Yu , et al. (2 additional authors not shown)

Abstract: We present phi-4, a 14-billion parameter language model developed with a training recipe that is centrally focused on data quality. Unlike most language models, where pre-training is based primarily on organic data sources such as web content or code, phi-4 strategically incorporates synthetic data throughout the training process. While previous models in the Phi family largely distill the capabil… ▽ More We present phi-4, a 14-billion parameter language model developed with a training recipe that is centrally focused on data quality. Unlike most language models, where pre-training is based primarily on organic data sources such as web content or code, phi-4 strategically incorporates synthetic data throughout the training process. While previous models in the Phi family largely distill the capabilities of a teacher model (specifically GPT-4), phi-4 substantially surpasses its teacher model on STEM-focused QA capabilities, giving evidence that our data-generation and post-training techniques go beyond distillation. Despite minimal changes to the phi-3 architecture, phi-4 achieves strong performance relative to its size -- especially on reasoning-focused benchmarks -- due to improved data, training curriculum, and innovations in the post-training scheme. △ Less

Submitted 11 December, 2024; originally announced December 2024.

arXiv:2404.14219 [pdf, other]

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Authors: Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, Alon Benhaim, Misha Bilenko, Johan Bjorck, Sébastien Bubeck, Martin Cai, Qin Cai, Vishrav Chaudhary, Dong Chen, Dongdong Chen, Weizhu Chen, Yen-Chun Chen, Yi-Ling Chen, Hao Cheng, Parul Chopra, Xiyang Dai , et al. (104 additional authors not shown)

Abstract: We introduce phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3.5 (e.g., phi-3-mini achieves 69% on MMLU and 8.38 on MT-bench), despite being small enough to be deployed on a phone. Our training dataset is a scaled-up version… ▽ More We introduce phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3.5 (e.g., phi-3-mini achieves 69% on MMLU and 8.38 on MT-bench), despite being small enough to be deployed on a phone. Our training dataset is a scaled-up version of the one used for phi-2, composed of heavily filtered publicly available web data and synthetic data. The model is also further aligned for robustness, safety, and chat format. We also provide parameter-scaling results with a 7B, 14B models trained for 4.8T tokens, called phi-3-small, phi-3-medium, both significantly more capable than phi-3-mini (e.g., respectively 75%, 78% on MMLU, and 8.7, 8.9 on MT-bench). To enhance multilingual, multimodal, and long-context capabilities, we introduce three models in the phi-3.5 series: phi-3.5-mini, phi-3.5-MoE, and phi-3.5-Vision. The phi-3.5-MoE, a 16 x 3.8B MoE model with 6.6 billion active parameters, achieves superior performance in language reasoning, math, and code tasks compared to other open-source models of similar scale, such as Llama 3.1 and the Mixtral series, and on par with Gemini-1.5-Flash and GPT-4o-mini. Meanwhile, phi-3.5-Vision, a 4.2 billion parameter model derived from phi-3.5-mini, excels in reasoning tasks and is adept at handling both single-image and text prompts, as well as multi-image and text prompts. △ Less

Submitted 30 August, 2024; v1 submitted 22 April, 2024; originally announced April 2024.

Comments: 24 pages

arXiv:2306.11644 [pdf, other]

Textbooks Are All You Need

Authors: Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee, Yuanzhi Li

Abstract: We introduce phi-1, a new large language model for code, with significantly smaller size than competing models: phi-1 is a Transformer-based model with 1.3B parameters, trained for 4 days on 8 A100s, using a selection of ``textbook quality" data from the web (6B tokens) and synthetically generated textbooks and exercises with GPT-3.5 (1B tokens). Despite this small scale, phi-1 attains pass@1 accu… ▽ More We introduce phi-1, a new large language model for code, with significantly smaller size than competing models: phi-1 is a Transformer-based model with 1.3B parameters, trained for 4 days on 8 A100s, using a selection of ``textbook quality" data from the web (6B tokens) and synthetically generated textbooks and exercises with GPT-3.5 (1B tokens). Despite this small scale, phi-1 attains pass@1 accuracy 50.6% on HumanEval and 55.5% on MBPP. It also displays surprising emergent properties compared to phi-1-base, our model before our finetuning stage on a dataset of coding exercises, and phi-1-small, a smaller model with 350M parameters trained with the same pipeline as phi-1 that still achieves 45% on HumanEval. △ Less

Submitted 2 October, 2023; v1 submitted 20 June, 2023; originally announced June 2023.

Comments: 26 pages; changed color scheme of plot. fixed minor typos and added couple clarifications

arXiv:1910.12125 [pdf]

Deep learning for subgrid-scale turbulence modeling in large-eddy simulations of the atmospheric boundary layer

Authors: Yu Cheng, Marco Giometto, Pit Kauffmann, Ling Lin, Chen Cao, Cody Zupnick, Harold Li, Qi Li, Ryan Abernathey, Pierre Gentine

Abstract: In large-eddy simulations, subgrid-scale (SGS) processes are parameterized as a function of filtered grid-scale variables. First-order, algebraic SGS models are based on the eddy-viscosity assumption, which does not always hold for turbulence. Here we apply supervised deep neural networks (DNNs) to learn SGS stresses from a set of neighboring coarse-grained velocity from direct numerical simulatio… ▽ More In large-eddy simulations, subgrid-scale (SGS) processes are parameterized as a function of filtered grid-scale variables. First-order, algebraic SGS models are based on the eddy-viscosity assumption, which does not always hold for turbulence. Here we apply supervised deep neural networks (DNNs) to learn SGS stresses from a set of neighboring coarse-grained velocity from direct numerical simulations (DNSs) of the atmospheric boundary layer at friction Reynolds numbers Re_τ up to 1243 without invoking the eddy-viscosity assumption. The DNN model was found to produce higher correlation of SGS stresses compared to the Smagorinsky model and the Smagorinsky-Bardina mixed model in the surface and mixed layers and can be applied to different grid resolutions and various stability conditions ranging from near neutral to very unstable. The additional information on potential temperature and pressure were found not to be useful for SGS modeling. Deep learning thus demonstrates great potential for LESs of geophysical turbulence. △ Less

Submitted 26 October, 2019; originally announced October 2019.

Comments: 33 pages, 11 figures, 3 tables

arXiv:1812.03963 [pdf, other]

doi 10.3847/1538-3881/aaf88f

Cloud Atlas: Hubble Space Telescope Near-Infrared Spectral Library of Brown Dwarfs, Planetary-mass companions, and hot Jupiters

Authors: Elena Manjavacas, Daniel Apai, Yifan Zhou, Ben W. P. Lew, Glenn Schneider, Stan Metchev, Paulo A. Miles-Paez, Jacqueline Radigan, Mark S. Marley, Nicolas Cowan, Theodora Karalidi, Adam J. Burgasser, Luigi R. Bedin, Patrick J. Lowrance, Parker Kauffmann

Abstract: Bayesian atmospheric retrieval tools can place constraints on the properties of brown dwarfs and hot Jupiters atmospheres. To fully exploit these methods, high signal-to-noise spectral libraries with well-understood uncertainties are essential. We present a high signal-to-noise spectral library (1.10-1.69 microns) of the thermal emission of 76 brown dwarfs and hot Jupiters. All our spectra have be… ▽ More Bayesian atmospheric retrieval tools can place constraints on the properties of brown dwarfs and hot Jupiters atmospheres. To fully exploit these methods, high signal-to-noise spectral libraries with well-understood uncertainties are essential. We present a high signal-to-noise spectral library (1.10-1.69 microns) of the thermal emission of 76 brown dwarfs and hot Jupiters. All our spectra have been acquired with the Hubble Space Telescope's Wide Field Camera 3 instrument and its G141 grism. The near-infrared spectral types of these objects range from L4 to Y1. Eight of our targets have estimated masses below the deuterium-burning limit. We analyze the database to identify peculiar objects and/or multiple systems, concluding that this sample includes two very-low-surface-gravity objects and five intermediate-surface-gravity objects. In addition, spectral indices designed to search for composite atmosphere brown dwarfs, indicate that eight objects in our sample are strong candidates to have such atmospheres. None of these objects are overluminous, thus their composite atmospheres are unlikely a companion-induced artifact. Five of the eight confirmed candidates have been reported as photometrically variable, suggesting that composite atmospheric indices are useful in identifying brown dwarfs with strongly heterogeneous cloud covers. We compare hot Jupiters and brown dwarfs in a near-infrared color-magnitude diagram. We confirm that the coldest hot Jupiters in our sample have spectra similar to mid-L dwarfs, and the hottest hot Jupiters have spectra similar to those of M-dwarfs. Our sample provides a uniform dataset of a broad range of ultracool atmospheres, allowing large-scale, comparative studies, and providing a HST legacy spectral library. △ Less

Submitted 10 December, 2018; originally announced December 2018.

Showing 1–6 of 6 results for author: Kauffmann, P