Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
Authors:
Matt Deitke,
Christopher Clark,
Sangho Lee,
Rohun Tripathi,
Yue Yang,
Jae Sung Park,
Mohammadreza Salehi,
Niklas Muennighoff,
Kyle Lo,
Luca Soldaini,
Jiasen Lu,
Taira Anderson,
Erin Bransom,
Kiana Ehsani,
Huong Ngo,
YenSung Chen,
Ajay Patel,
Mark Yatskar,
Chris Callison-Burch,
Andrew Head,
Rose Hendrix,
Favyen Bastani,
Eli VanderBilt,
Nathan Lambert,
Yvonne Chou
, et al. (25 additional authors not shown)
Abstract:
Today's most advanced vision-language models (VLMs) remain proprietary. The strongest open-weight models rely heavily on synthetic data from proprietary VLMs to achieve good performance, effectively distilling these closed VLMs into open ones. As a result, the community has been missing foundational knowledge about how to build performant VLMs from scratch. We present Molmo, a new family of VLMs t…
▽ More
Today's most advanced vision-language models (VLMs) remain proprietary. The strongest open-weight models rely heavily on synthetic data from proprietary VLMs to achieve good performance, effectively distilling these closed VLMs into open ones. As a result, the community has been missing foundational knowledge about how to build performant VLMs from scratch. We present Molmo, a new family of VLMs that are state-of-the-art in their class of openness. Our key contribution is a collection of new datasets called PixMo, including a dataset of highly detailed image captions for pre-training, a free-form image Q&A dataset for fine-tuning, and an innovative 2D pointing dataset, all collected without the use of external VLMs. The success of our approach relies on careful modeling choices, a well-tuned training pipeline, and, most critically, the quality of our newly collected datasets. Our best-in-class 72B model not only outperforms others in the class of open weight and data models, but also outperforms larger proprietary models including Claude 3.5 Sonnet, and Gemini 1.5 Pro and Flash, second only to GPT-4o based on both academic benchmarks and on a large human evaluation. Our model weights, new datasets, and source code are available at https://molmo.allenai.org/blog.
△ Less
Submitted 5 December, 2024; v1 submitted 25 September, 2024;
originally announced September 2024.
CHIME: LLM-Assisted Hierarchical Organization of Scientific Studies for Literature Review Support
Authors:
Chao-Chun Hsu,
Erin Bransom,
Jenna Sparks,
Bailey Kuehl,
Chenhao Tan,
David Wadden,
Lucy Lu Wang,
Aakanksha Naik
Abstract:
Literature review requires researchers to synthesize a large amount of information and is increasingly challenging as the scientific literature expands. In this work, we investigate the potential of LLMs for producing hierarchical organizations of scientific studies to assist researchers with literature review. We define hierarchical organizations as tree structures where nodes refer to topical ca…
▽ More
Literature review requires researchers to synthesize a large amount of information and is increasingly challenging as the scientific literature expands. In this work, we investigate the potential of LLMs for producing hierarchical organizations of scientific studies to assist researchers with literature review. We define hierarchical organizations as tree structures where nodes refer to topical categories and every node is linked to the studies assigned to that category. Our naive LLM-based pipeline for hierarchy generation from a set of studies produces promising yet imperfect hierarchies, motivating us to collect CHIME, an expert-curated dataset for this task focused on biomedicine. Given the challenging and time-consuming nature of building hierarchies from scratch, we use a human-in-the-loop process in which experts correct errors (both links between categories and study assignment) in LLM-generated hierarchies. CHIME contains 2,174 LLM-generated hierarchies covering 472 topics, and expert-corrected hierarchies for a subset of 100 topics. Expert corrections allow us to quantify LLM performance, and we find that while they are quite good at generating and organizing categories, their assignment of studies to categories could be improved. We attempt to train a corrector model with human feedback which improves study assignment by 12.6 F1 points. We release our dataset and models to encourage research on developing better assistive tools for literature review.
△ Less
Submitted 22 July, 2024;
originally announced July 2024.