Search | arXiv e-print repository

Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models

Authors: Guo Chen, Zhiqi Li, Shihao Wang, Jindong Jiang, Yicheng Liu, Lidong Lu, De-An Huang, Wonmin Byeon, Matthieu Le, Tuomas Rintamaki, Tyler Poon, Max Ehrlich, Tuomas Rintamaki, Tyler Poon, Tong Lu, Limin Wang, Bryan Catanzaro, Jan Kautz, Andrew Tao, Zhiding Yu, Guilin Liu

Abstract: We introduce Eagle 2.5, a family of frontier vision-language models (VLMs) for long-context multimodal learning. Our work addresses the challenges in long video comprehension and high-resolution image understanding, introducing a generalist framework for both tasks. The proposed training framework incorporates Automatic Degrade Sampling and Image Area Preservation, two techniques that preserve con… ▽ More We introduce Eagle 2.5, a family of frontier vision-language models (VLMs) for long-context multimodal learning. Our work addresses the challenges in long video comprehension and high-resolution image understanding, introducing a generalist framework for both tasks. The proposed training framework incorporates Automatic Degrade Sampling and Image Area Preservation, two techniques that preserve contextual integrity and visual details. The framework also includes numerous efficiency optimizations in the pipeline for long-context data training. Finally, we propose Eagle-Video-110K, a novel dataset that integrates both story-level and clip-level annotations, facilitating long-video understanding. Eagle 2.5 demonstrates substantial improvements on long-context multimodal benchmarks, providing a robust solution to the limitations of existing VLMs. Notably, our best model Eagle 2.5-8B achieves 72.4% on Video-MME with 512 input frames, matching the results of top-tier commercial model such as GPT-4o and large-scale open-source models like Qwen2.5-VL-72B and InternVL2.5-78B. △ Less

Submitted 21 April, 2025; originally announced April 2025.

arXiv:2504.03624 [pdf, other]

Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Models

Authors: NVIDIA, :, Aaron Blakeman, Aarti Basant, Abhinav Khattar, Adithya Renduchintala, Akhiad Bercovich, Aleksander Ficek, Alexis Bjorlin, Ali Taghibakhshi, Amala Sanjay Deshmukh, Ameya Sunil Mahabaleshwarkar, Andrew Tao, Anna Shors, Ashwath Aithal, Ashwin Poojary, Ayush Dattagupta, Balaram Buddharaju, Bobby Chen, Boris Ginsburg, Boxin Wang, Brandon Norick, Brian Butterfield, Bryan Catanzaro, Carlo del Mundo , et al. (176 additional authors not shown)

Abstract: As inference-time scaling becomes critical for enhanced reasoning capabilities, it is increasingly becoming important to build models that are efficient to infer. We introduce Nemotron-H, a family of 8B and 56B/47B hybrid Mamba-Transformer models designed to reduce inference cost for a given accuracy level. To achieve this goal, we replace the majority of self-attention layers in the common Transf… ▽ More As inference-time scaling becomes critical for enhanced reasoning capabilities, it is increasingly becoming important to build models that are efficient to infer. We introduce Nemotron-H, a family of 8B and 56B/47B hybrid Mamba-Transformer models designed to reduce inference cost for a given accuracy level. To achieve this goal, we replace the majority of self-attention layers in the common Transformer model architecture with Mamba layers that perform constant computation and require constant memory per generated token. We show that Nemotron-H models offer either better or on-par accuracy compared to other similarly-sized state-of-the-art open-sourced Transformer models (e.g., Qwen-2.5-7B/72B and Llama-3.1-8B/70B), while being up to 3$\times$ faster at inference. To further increase inference speed and reduce the memory required at inference time, we created Nemotron-H-47B-Base from the 56B model using a new compression via pruning and distillation technique called MiniPuzzle. Nemotron-H-47B-Base achieves similar accuracy to the 56B model, but is 20% faster to infer. In addition, we introduce an FP8-based training recipe and show that it can achieve on par results with BF16-based training. This recipe is used to train the 56B model. We are releasing Nemotron-H base model checkpoints with support in Hugging Face and NeMo. △ Less

Submitted 15 April, 2025; v1 submitted 4 April, 2025; originally announced April 2025.

arXiv:2501.14818 [pdf, other]

Eagle 2: Building Post-Training Data Strategies from Scratch for Frontier Vision-Language Models

Authors: Zhiqi Li, Guo Chen, Shilong Liu, Shihao Wang, Vibashan VS, Yishen Ji, Shiyi Lan, Hao Zhang, Yilin Zhao, Subhashree Radhakrishnan, Nadine Chang, Karan Sapra, Amala Sanjay Deshmukh, Tuomas Rintamaki, Matthieu Le, Ilia Karmanov, Lukas Voegtle, Philipp Fischer, De-An Huang, Timo Roman, Tong Lu, Jose M. Alvarez, Bryan Catanzaro, Jan Kautz, Andrew Tao , et al. (2 additional authors not shown)

Abstract: Recently, promising progress has been made by open-source vision-language models (VLMs) in bringing their capabilities closer to those of proprietary frontier models. However, most open-source models only publish their final model weights, leaving the critical details of data strategies and implementation largely opaque. In this work, we address VLM post-training from a data-centric perspective, s… ▽ More Recently, promising progress has been made by open-source vision-language models (VLMs) in bringing their capabilities closer to those of proprietary frontier models. However, most open-source models only publish their final model weights, leaving the critical details of data strategies and implementation largely opaque. In this work, we address VLM post-training from a data-centric perspective, showing the key role of data strategy in developing frontier VLMs. By studying and building our post-training data strategy from scratch, we share detailed insights into the development processes, aiming to benefit the development of competitive models for the open-source community. Our introduced data strategy, together with training recipes and model design, leads to a family of performant VLMs named Eagle2. Specifically, Eagle2-9B achieves state-of-the-art results across various multimodal benchmarks, matching certain competitive models with up to 70B parameters. △ Less

Submitted 20 January, 2025; originally announced January 2025.

arXiv:2501.14756 [pdf, other]

Towards An Automated AI Act FRIA Tool That Can Reuse GDPR's DPIA

Authors: Tytti Rintamaki, Harshvardhan J. Pandit

Abstract: The AI Act introduces the obligation to conduct a Fundamental Rights Impact Assessment (FRIA), with the possibility to reuse a Data Protection Impact Assessment (DPIA), and requires the EU Commission to create of an automated tool to support the FRIA process. In this article, we provide our novel exploration of the DPIA and FRIA as information processes to enable the creation of automated tools. W… ▽ More The AI Act introduces the obligation to conduct a Fundamental Rights Impact Assessment (FRIA), with the possibility to reuse a Data Protection Impact Assessment (DPIA), and requires the EU Commission to create of an automated tool to support the FRIA process. In this article, we provide our novel exploration of the DPIA and FRIA as information processes to enable the creation of automated tools. We first investigate the information involved in DPIA and FRIA, and then use this to align the two to state where a DPIA can be reused in a FRIA. We then present the FRIA as a 5-step process and discuss the role of an automated tool for each step. Our work provides the necessary foundation for creating and managing information for FRIA and supporting it through an automated tool as required by the AI Act. △ Less

Submitted 23 December, 2024; originally announced January 2025.

Comments: Presented at CLAIRvoyant (ConventicLE on Artificial Intelligence Regulation) Workshop 2024

arXiv:2501.10391 [pdf, other]

Developing an Ontology for AI Act Fundamental Rights Impact Assessments

Authors: Tytti Rintamaki, Harshvardhan J. Pandit

Abstract: The recently published EU Artificial Intelligence Act (AI Act) is a landmark regulation that regulates the use of AI technologies. One of its novel requirements is the obligation to conduct a Fundamental Rights Impact Assessment (FRIA), where organisations in the role of deployers must assess the risks of their AI system regarding health, safety, and fundamental rights. Another novelty in the AI A… ▽ More The recently published EU Artificial Intelligence Act (AI Act) is a landmark regulation that regulates the use of AI technologies. One of its novel requirements is the obligation to conduct a Fundamental Rights Impact Assessment (FRIA), where organisations in the role of deployers must assess the risks of their AI system regarding health, safety, and fundamental rights. Another novelty in the AI Act is the requirement to create a questionnaire and an automated tool to support organisations in their FRIA obligations. Such automated tools will require a machine-readable form of information involved within the FRIA process, and additionally also require machine-readable documentation to enable further compliance tools to be created. In this article, we present our novel representation of the FRIA as an ontology based on semantic web standards. Our work builds upon the existing state of the art, notably the Data Privacy Vocabulary (DPV), where similar works have been established to create tools for GDPR's Data Protection Impact Assessments (DPIA) and other obligations. Through our ontology, we enable the creation and management of FRIA, and the use of automated tool in its various steps. △ Less

Submitted 19 December, 2024; originally announced January 2025.

Comments: Presented at CLAIRvoyant (ConventicLE on Artificial Intelligence Regulation) Workshop 2024

arXiv:2409.11402 [pdf, other]

NVLM: Open Frontier-Class Multimodal LLMs

Authors: Wenliang Dai, Nayeon Lee, Boxin Wang, Zhuolin Yang, Zihan Liu, Jon Barker, Tuomas Rintamaki, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping

Abstract: We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks, rivaling the leading proprietary models (e.g., GPT-4o) and open-access models (e.g., Llama 3-V 405B and InternVL 2). Remarkably, NVLM 1.0 shows improved text-only performance over its LLM backbone after multimodal training. In terms of model desi… ▽ More We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks, rivaling the leading proprietary models (e.g., GPT-4o) and open-access models (e.g., Llama 3-V 405B and InternVL 2). Remarkably, NVLM 1.0 shows improved text-only performance over its LLM backbone after multimodal training. In terms of model design, we perform a comprehensive comparison between decoder-only multimodal LLMs (e.g., LLaVA) and cross-attention-based models (e.g., Flamingo). Based on the strengths and weaknesses of both approaches, we propose a novel architecture that enhances both training efficiency and multimodal reasoning capabilities. Furthermore, we introduce a 1-D tile-tagging design for tile-based dynamic high-resolution images, which significantly boosts performance on multimodal reasoning and OCR-related tasks. Regarding training data, we meticulously curate and provide detailed information on our multimodal pretraining and supervised fine-tuning datasets. Our findings indicate that dataset quality and task diversity are more important than scale, even during the pretraining phase, across all architectures. Notably, we develop production-grade multimodality for the NVLM-1.0 models, enabling them to excel in vision-language tasks while maintaining and even improving text-only performance compared to their LLM backbones. To achieve this, we craft and integrate a high-quality text-only dataset into multimodal training, alongside a substantial amount of multimodal math and reasoning data, leading to enhanced math and coding capabilities across modalities. To advance research in the field, we release the model weights at https://huggingface.co/nvidia/NVLM-D-72B and will open-source the training code for the community soon. △ Less

Submitted 22 October, 2024; v1 submitted 17 September, 2024; originally announced September 2024.

Comments: Fixed the typos. For more information, please visit our project page at: https://research.nvidia.com/labs/adlr/NVLM-1

Showing 1–6 of 6 results for author: Rintamaki, T