-
jina-embeddings-v4: Universal Embeddings for Multimodal Multilingual Retrieval
Authors:
Michael Günther,
Saba Sturua,
Mohammad Kalim Akram,
Isabelle Mohr,
Andrei Ungureanu,
Bo Wang,
Sedigheh Eslami,
Scott Martens,
Maximilian Werk,
Nan Wang,
Han Xiao
Abstract:
We introduce jina-embeddings-v4, a 3.8 billion parameter multimodal embedding model that unifies text and image representations through a novel architecture supporting both single-vector and multi-vector embeddings in the late interaction style. The model incorporates task-specific Low-Rank Adaptation (LoRA) adapters to optimize performance across diverse retrieval scenarios, including query-docum…
▽ More
We introduce jina-embeddings-v4, a 3.8 billion parameter multimodal embedding model that unifies text and image representations through a novel architecture supporting both single-vector and multi-vector embeddings in the late interaction style. The model incorporates task-specific Low-Rank Adaptation (LoRA) adapters to optimize performance across diverse retrieval scenarios, including query-document retrieval, semantic text similarity, and code search. Comprehensive evaluations demonstrate that jina-embeddings-v4 achieves state-of-the-art performance on both single-modal and cross-modal retrieval tasks, with particular strength in processing visually rich content such as tables, charts, diagrams, and mixed-media formats. To facilitate evaluation of this capability, we also introduce Jina-VDR, a novel benchmark specifically designed for visually rich image retrieval.
△ Less
Submitted 7 July, 2025; v1 submitted 23 June, 2025;
originally announced June 2025.
-
Jina CLIP: Your CLIP Model Is Also Your Text Retriever
Authors:
Andreas Koukounas,
Georgios Mastrapas,
Michael Günther,
Bo Wang,
Scott Martens,
Isabelle Mohr,
Saba Sturua,
Mohammad Kalim Akram,
Joan Fontanals Martínez,
Saahil Ognawala,
Susana Guzman,
Maximilian Werk,
Nan Wang,
Han Xiao
Abstract:
Contrastive Language-Image Pretraining (CLIP) is widely used to train models to align images and texts in a common embedding space by mapping them to fixed-sized vectors. These models are key to multimodal information retrieval and related tasks. However, CLIP models generally underperform in text-only tasks compared to specialized text models. This creates inefficiencies for information retrieval…
▽ More
Contrastive Language-Image Pretraining (CLIP) is widely used to train models to align images and texts in a common embedding space by mapping them to fixed-sized vectors. These models are key to multimodal information retrieval and related tasks. However, CLIP models generally underperform in text-only tasks compared to specialized text models. This creates inefficiencies for information retrieval systems that keep separate embeddings and models for text-only and multimodal tasks. We propose a novel, multi-task contrastive training method to address this issue, which we use to train the jina-clip-v1 model to achieve the state-of-the-art performance on both text-image and text-text retrieval tasks.
△ Less
Submitted 26 June, 2024; v1 submitted 30 May, 2024;
originally announced May 2024.
-
Multi-Task Contrastive Learning for 8192-Token Bilingual Text Embeddings
Authors:
Isabelle Mohr,
Markus Krimmel,
Saba Sturua,
Mohammad Kalim Akram,
Andreas Koukounas,
Michael Günther,
Georgios Mastrapas,
Vinit Ravishankar,
Joan Fontanals Martínez,
Feng Wang,
Qi Liu,
Ziniu Yu,
Jie Fu,
Saahil Ognawala,
Susana Guzman,
Bo Wang,
Maximilian Werk,
Nan Wang,
Han Xiao
Abstract:
We introduce a novel suite of state-of-the-art bilingual text embedding models that are designed to support English and another target language. These models are capable of processing lengthy text inputs with up to 8192 tokens, making them highly versatile for a range of natural language processing tasks such as text retrieval, clustering, and semantic textual similarity (STS) calculations.
By f…
▽ More
We introduce a novel suite of state-of-the-art bilingual text embedding models that are designed to support English and another target language. These models are capable of processing lengthy text inputs with up to 8192 tokens, making them highly versatile for a range of natural language processing tasks such as text retrieval, clustering, and semantic textual similarity (STS) calculations.
By focusing on bilingual models and introducing a unique multi-task learning objective, we have significantly improved the model performance on STS tasks, which outperforms the capabilities of existing multilingual models in both target language understanding and cross-lingual evaluation tasks. Moreover, our bilingual models are more efficient, requiring fewer parameters and less memory due to their smaller vocabulary needs. Furthermore, we have expanded the Massive Text Embedding Benchmark (MTEB) to include benchmarks for German and Spanish embedding models. This integration aims to stimulate further research and advancement in text embedding technologies for these languages.
△ Less
Submitted 26 February, 2024;
originally announced February 2024.
-
Jina Embeddings 2: 8192-Token General-Purpose Text Embeddings for Long Documents
Authors:
Michael Günther,
Jackmin Ong,
Isabelle Mohr,
Alaeddine Abdessalem,
Tanguy Abel,
Mohammad Kalim Akram,
Susana Guzman,
Georgios Mastrapas,
Saba Sturua,
Bo Wang,
Maximilian Werk,
Nan Wang,
Han Xiao
Abstract:
Text embedding models have emerged as powerful tools for transforming sentences into fixed-sized feature vectors that encapsulate semantic information. While these models are essential for tasks like information retrieval, semantic clustering, and text re-ranking, most existing open-source models, especially those built on architectures like BERT, struggle to represent lengthy documents and often…
▽ More
Text embedding models have emerged as powerful tools for transforming sentences into fixed-sized feature vectors that encapsulate semantic information. While these models are essential for tasks like information retrieval, semantic clustering, and text re-ranking, most existing open-source models, especially those built on architectures like BERT, struggle to represent lengthy documents and often resort to truncation. One common approach to mitigate this challenge involves splitting documents into smaller paragraphs for embedding. However, this strategy results in a much larger set of vectors, consequently leading to increased memory consumption and computationally intensive vector searches with elevated latency.
To address these challenges, we introduce Jina Embeddings 2, an open-source text embedding model capable of accommodating up to 8192 tokens. This model is designed to transcend the conventional 512-token limit and adeptly process long documents. Jina Embeddings 2 not only achieves state-of-the-art performance on a range of embedding-related tasks in the MTEB benchmark but also matches the performance of OpenAI's proprietary ada-002 model. Additionally, our experiments indicate that an extended context can enhance performance in tasks such as NarrativeQA.
△ Less
Submitted 4 February, 2024; v1 submitted 30 October, 2023;
originally announced October 2023.