-
Phi-4-reasoning Technical Report
Authors:
Marah Abdin,
Sahaj Agarwal,
Ahmed Awadallah,
Vidhisha Balachandran,
Harkirat Behl,
Lingjiao Chen,
Gustavo de Rosa,
Suriya Gunasekar,
Mojan Javaheripi,
Neel Joshi,
Piero Kauffmann,
Yash Lara,
Caio César Teodoro Mendes,
Arindam Mitra,
Besmira Nushi,
Dimitris Papailiopoulos,
Olli Saarikivi,
Shital Shah,
Vaishnavi Shrivastava,
Vibhav Vineet,
Yue Wu,
Safoora Yousefi,
Guoqing Zheng
Abstract:
We introduce Phi-4-reasoning, a 14-billion parameter reasoning model that achieves strong performance on complex reasoning tasks. Trained via supervised fine-tuning of Phi-4 on carefully curated set of "teachable" prompts-selected for the right level of complexity and diversity-and reasoning demonstrations generated using o3-mini, Phi-4-reasoning generates detailed reasoning chains that effectivel…
▽ More
We introduce Phi-4-reasoning, a 14-billion parameter reasoning model that achieves strong performance on complex reasoning tasks. Trained via supervised fine-tuning of Phi-4 on carefully curated set of "teachable" prompts-selected for the right level of complexity and diversity-and reasoning demonstrations generated using o3-mini, Phi-4-reasoning generates detailed reasoning chains that effectively leverage inference-time compute. We further develop Phi-4-reasoning-plus, a variant enhanced through a short phase of outcome-based reinforcement learning that offers higher performance by generating longer reasoning traces. Across a wide range of reasoning tasks, both models outperform significantly larger open-weight models such as DeepSeek-R1-Distill-Llama-70B model and approach the performance levels of full DeepSeek-R1 model. Our comprehensive evaluations span benchmarks in math and scientific reasoning, coding, algorithmic problem solving, planning, and spatial understanding. Interestingly, we observe a non-trivial transfer of improvements to general-purpose benchmarks as well. In this report, we provide insights into our training data, our training methodologies, and our evaluations. We show that the benefit of careful data curation for supervised fine-tuning (SFT) extends to reasoning language models, and can be further amplified by reinforcement learning (RL). Finally, our evaluation points to opportunities for improving how we assess the performance and robustness of reasoning models.
△ Less
Submitted 30 April, 2025;
originally announced April 2025.
-
Phi-4 Technical Report
Authors:
Marah Abdin,
Jyoti Aneja,
Harkirat Behl,
Sébastien Bubeck,
Ronen Eldan,
Suriya Gunasekar,
Michael Harrison,
Russell J. Hewett,
Mojan Javaheripi,
Piero Kauffmann,
James R. Lee,
Yin Tat Lee,
Yuanzhi Li,
Weishung Liu,
Caio C. T. Mendes,
Anh Nguyen,
Eric Price,
Gustavo de Rosa,
Olli Saarikivi,
Adil Salim,
Shital Shah,
Xin Wang,
Rachel Ward,
Yue Wu,
Dingli Yu
, et al. (2 additional authors not shown)
Abstract:
We present phi-4, a 14-billion parameter language model developed with a training recipe that is centrally focused on data quality. Unlike most language models, where pre-training is based primarily on organic data sources such as web content or code, phi-4 strategically incorporates synthetic data throughout the training process. While previous models in the Phi family largely distill the capabil…
▽ More
We present phi-4, a 14-billion parameter language model developed with a training recipe that is centrally focused on data quality. Unlike most language models, where pre-training is based primarily on organic data sources such as web content or code, phi-4 strategically incorporates synthetic data throughout the training process. While previous models in the Phi family largely distill the capabilities of a teacher model (specifically GPT-4), phi-4 substantially surpasses its teacher model on STEM-focused QA capabilities, giving evidence that our data-generation and post-training techniques go beyond distillation. Despite minimal changes to the phi-3 architecture, phi-4 achieves strong performance relative to its size -- especially on reasoning-focused benchmarks -- due to improved data, training curriculum, and innovations in the post-training scheme.
△ Less
Submitted 11 December, 2024;
originally announced December 2024.
-
On the Diversity of Synthetic Data and its Impact on Training Large Language Models
Authors:
Hao Chen,
Abdul Waheed,
Xiang Li,
Yidong Wang,
Jindong Wang,
Bhiksha Raj,
Marah I. Abdin
Abstract:
The rise of Large Language Models (LLMs) has accentuated the need for diverse, high-quality pre-training data. Synthetic data emerges as a viable solution to the challenges of data scarcity and inaccessibility. While previous literature has focused predominantly on the quality and quantity of real data, our work enables the measurement of diversity in synthetic data and explores its impact on LLM…
▽ More
The rise of Large Language Models (LLMs) has accentuated the need for diverse, high-quality pre-training data. Synthetic data emerges as a viable solution to the challenges of data scarcity and inaccessibility. While previous literature has focused predominantly on the quality and quantity of real data, our work enables the measurement of diversity in synthetic data and explores its impact on LLM performance. We study the downstream effects of synthetic data diversity during both the pre-training and fine-tuning stages by introducing a new diversity metric, \textit{LLM cluster-agent}, designed to evaluate the diversity of synthetic datasets. Through a series of controlled experiments with models of 350M and 1.4B parameters, we demonstrate that the proposed cluster-based LLM scoring of diversity correlates positively with both pre-training and supervised fine-tuning performance. Our findings also reveal that synthetic data diversity in pre-training affects supervised fine-tuning more significantly than pre-training itself, even for smaller models. We hope this study advances our understanding of the optimal use of synthetic data in LLM training and opens new avenues for efficient data generation processes.
△ Less
Submitted 22 October, 2024; v1 submitted 19 October, 2024;
originally announced October 2024.
-
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
Authors:
Marah Abdin,
Jyoti Aneja,
Hany Awadalla,
Ahmed Awadallah,
Ammar Ahmad Awan,
Nguyen Bach,
Amit Bahree,
Arash Bakhtiari,
Jianmin Bao,
Harkirat Behl,
Alon Benhaim,
Misha Bilenko,
Johan Bjorck,
Sébastien Bubeck,
Martin Cai,
Qin Cai,
Vishrav Chaudhary,
Dong Chen,
Dongdong Chen,
Weizhu Chen,
Yen-Chun Chen,
Yi-Ling Chen,
Hao Cheng,
Parul Chopra,
Xiyang Dai
, et al. (104 additional authors not shown)
Abstract:
We introduce phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3.5 (e.g., phi-3-mini achieves 69% on MMLU and 8.38 on MT-bench), despite being small enough to be deployed on a phone. Our training dataset is a scaled-up version…
▽ More
We introduce phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3.5 (e.g., phi-3-mini achieves 69% on MMLU and 8.38 on MT-bench), despite being small enough to be deployed on a phone. Our training dataset is a scaled-up version of the one used for phi-2, composed of heavily filtered publicly available web data and synthetic data. The model is also further aligned for robustness, safety, and chat format. We also provide parameter-scaling results with a 7B, 14B models trained for 4.8T tokens, called phi-3-small, phi-3-medium, both significantly more capable than phi-3-mini (e.g., respectively 75%, 78% on MMLU, and 8.7, 8.9 on MT-bench). To enhance multilingual, multimodal, and long-context capabilities, we introduce three models in the phi-3.5 series: phi-3.5-mini, phi-3.5-MoE, and phi-3.5-Vision. The phi-3.5-MoE, a 16 x 3.8B MoE model with 6.6 billion active parameters, achieves superior performance in language reasoning, math, and code tasks compared to other open-source models of similar scale, such as Llama 3.1 and the Mixtral series, and on par with Gemini-1.5-Flash and GPT-4o-mini. Meanwhile, phi-3.5-Vision, a 4.2 billion parameter model derived from phi-3.5-mini, excels in reasoning tasks and is adept at handling both single-image and text prompts, as well as multi-image and text prompts.
△ Less
Submitted 30 August, 2024; v1 submitted 22 April, 2024;
originally announced April 2024.
-
KITAB: Evaluating LLMs on Constraint Satisfaction for Information Retrieval
Authors:
Marah I Abdin,
Suriya Gunasekar,
Varun Chandrasekaran,
Jerry Li,
Mert Yuksekgonul,
Rahee Ghosh Peshawaria,
Ranjita Naik,
Besmira Nushi
Abstract:
We study the ability of state-of-the art models to answer constraint satisfaction queries for information retrieval (e.g., 'a list of ice cream shops in San Diego'). In the past, such queries were considered to be tasks that could only be solved via web-search or knowledge bases. More recently, large language models (LLMs) have demonstrated initial emergent abilities in this task. However, many cu…
▽ More
We study the ability of state-of-the art models to answer constraint satisfaction queries for information retrieval (e.g., 'a list of ice cream shops in San Diego'). In the past, such queries were considered to be tasks that could only be solved via web-search or knowledge bases. More recently, large language models (LLMs) have demonstrated initial emergent abilities in this task. However, many current retrieval benchmarks are either saturated or do not measure constraint satisfaction. Motivated by rising concerns around factual incorrectness and hallucinations of LLMs, we present KITAB, a new dataset for measuring constraint satisfaction abilities of language models. KITAB consists of book-related data across more than 600 authors and 13,000 queries, and also offers an associated dynamic data collection and constraint verification approach for acquiring similar test data for other authors. Our extended experiments on GPT4 and GPT3.5 characterize and decouple common failure modes across dimensions such as information popularity, constraint types, and context availability. Results show that in the absence of context, models exhibit severe limitations as measured by irrelevant information, factual errors, and incompleteness, many of which exacerbate as information popularity decreases. While context availability mitigates irrelevant information, it is not helpful for satisfying constraints, identifying fundamental barriers to constraint satisfaction. We open source our contributions to foster further research on improving constraint satisfaction abilities of future models.
△ Less
Submitted 24 October, 2023;
originally announced October 2023.
-
Leveraging semantically similar queries for ranking via combining representations
Authors:
Hayden S. Helm,
Marah Abdin,
Benjamin D. Pedigo,
Shweti Mahajan,
Vince Lyzinski,
Youngser Park,
Amitabh Basu,
Piali~Choudhury,
Christopher M. White,
Weiwei Yang,
Carey E. Priebe
Abstract:
In modern ranking problems, different and disparate representations of the items to be ranked are often available. It is sensible, then, to try to combine these representations to improve ranking. Indeed, learning to rank via combining representations is both principled and practical for learning a ranking function for a particular query. In extremely data-scarce settings, however, the amount of l…
▽ More
In modern ranking problems, different and disparate representations of the items to be ranked are often available. It is sensible, then, to try to combine these representations to improve ranking. Indeed, learning to rank via combining representations is both principled and practical for learning a ranking function for a particular query. In extremely data-scarce settings, however, the amount of labeled data available for a particular query can lead to a highly variable and ineffective ranking function. One way to mitigate the effect of the small amount of data is to leverage information from semantically similar queries. Indeed, as we demonstrate in simulation settings and real data examples, when semantically similar queries are available it is possible to gainfully use them when ranking with respect to a particular query. We describe and explore this phenomenon in the context of the bias-variance trade off and apply it to the data-scarce settings of a Bing navigational graph and the Drosophila larva connectome.
△ Less
Submitted 23 June, 2021;
originally announced June 2021.
-
Distance-based Positive and Unlabeled Learning for Ranking
Authors:
Hayden S. Helm,
Amitabh Basu,
Avanti Athreya,
Youngser Park,
Joshua T. Vogelstein,
Carey E. Priebe,
Michael Winding,
Marta Zlatic,
Albert Cardona,
Patrick Bourke,
Jonathan Larson,
Marah Abdin,
Piali Choudhury,
Weiwei Yang,
Christopher W. White
Abstract:
Learning to rank -- producing a ranked list of items specific to a query and with respect to a set of supervisory items -- is a problem of general interest. The setting we consider is one in which no analytic description of what constitutes a good ranking is available. Instead, we have a collection of representations and supervisory information consisting of a (target item, interesting items set)…
▽ More
Learning to rank -- producing a ranked list of items specific to a query and with respect to a set of supervisory items -- is a problem of general interest. The setting we consider is one in which no analytic description of what constitutes a good ranking is available. Instead, we have a collection of representations and supervisory information consisting of a (target item, interesting items set) pair. We demonstrate analytically, in simulation, and in real data examples that learning to rank via combining representations using an integer linear program is effective when the supervision is as light as "these few items are similar to your item of interest." While this nomination task is quite general, for specificity we present our methodology from the perspective of vertex nomination in graphs. The methodology described herein is model agnostic.
△ Less
Submitted 28 September, 2022; v1 submitted 19 May, 2020;
originally announced May 2020.