-
CLIP Under the Microscope: A Fine-Grained Analysis of Multi-Object Representation
Authors:
Reza Abbasi,
Ali Nazari,
Aminreza Sefid,
Mohammadali Banayeeanzade,
Mohammad Hossein Rohban,
Mahdieh Soleymani Baghshah
Abstract:
Contrastive Language-Image Pre-training (CLIP) models excel in zero-shot classification, yet face challenges in complex multi-object scenarios. This study offers a comprehensive analysis of CLIP's limitations in these contexts using a specialized dataset, ComCO, designed to evaluate CLIP's encoders in diverse multi-object scenarios. Our findings reveal significant biases: the text encoder prioriti…
▽ More
Contrastive Language-Image Pre-training (CLIP) models excel in zero-shot classification, yet face challenges in complex multi-object scenarios. This study offers a comprehensive analysis of CLIP's limitations in these contexts using a specialized dataset, ComCO, designed to evaluate CLIP's encoders in diverse multi-object scenarios. Our findings reveal significant biases: the text encoder prioritizes first-mentioned objects, and the image encoder favors larger objects. Through retrieval and classification tasks, we quantify these biases across multiple CLIP variants and trace their origins to CLIP's training process, supported by analyses of the LAION dataset and training progression. Our image-text matching experiments show substantial performance drops when object size or token order changes, underscoring CLIP's instability with rephrased but semantically similar captions. Extending this to longer captions and text-to-image models like Stable Diffusion, we demonstrate how prompt order influences object prominence in generated images. For more details and access to our dataset and analysis code, visit our project repository: https://clip-oscope.github.io.
△ Less
Submitted 28 February, 2025; v1 submitted 27 February, 2025;
originally announced February 2025.
-
Analyzing CLIP's Performance Limitations in Multi-Object Scenarios: A Controlled High-Resolution Study
Authors:
Reza Abbasi,
Ali Nazari,
Aminreza Sefid,
Mohammadali Banayeeanzade,
Mohammad Hossein Rohban,
Mahdieh Soleymani Baghshah
Abstract:
Contrastive Language-Image Pre-training (CLIP) models have demonstrated remarkable performance in zero-shot classification tasks, yet their efficacy in handling complex multi-object scenarios remains challenging. This study presents a comprehensive analysis of CLIP's performance limitations in multi-object contexts through controlled experiments. We introduce two custom datasets, SimCO and CompCO,…
▽ More
Contrastive Language-Image Pre-training (CLIP) models have demonstrated remarkable performance in zero-shot classification tasks, yet their efficacy in handling complex multi-object scenarios remains challenging. This study presents a comprehensive analysis of CLIP's performance limitations in multi-object contexts through controlled experiments. We introduce two custom datasets, SimCO and CompCO, to evaluate CLIP's image and text encoders in various multi-object configurations. Our findings reveal significant biases in both encoders: the image encoder favors larger objects, while the text encoder prioritizes objects mentioned first in descriptions. We hypothesize these biases originate from CLIP's training process and provide evidence through analyses of the COCO dataset and CLIP's training progression. Additionally, we extend our investigation to Stable Diffusion models, revealing that biases in the CLIP text encoder significantly impact text-to-image generation tasks. Our experiments demonstrate how these biases affect CLIP's performance in image-caption matching and generation tasks, particularly when manipulating object sizes and their order in captions. This work contributes valuable insights into CLIP's behavior in complex visual environments and highlights areas for improvement in future vision-language models.
△ Less
Submitted 27 February, 2025;
originally announced February 2025.
-
SciBERTSUM: Extractive Summarization for Scientific Documents
Authors:
Athar Sefid,
C Lee Giles
Abstract:
The summarization literature focuses on the summarization of news articles. The news articles in the CNN-DailyMail are relatively short documents with about 30 sentences per document on average. We introduce SciBERTSUM, our summarization framework designed for the summarization of long documents like scientific papers with more than 500 sentences. SciBERTSUM extends BERTSUM to long documents by 1)…
▽ More
The summarization literature focuses on the summarization of news articles. The news articles in the CNN-DailyMail are relatively short documents with about 30 sentences per document on average. We introduce SciBERTSUM, our summarization framework designed for the summarization of long documents like scientific papers with more than 500 sentences. SciBERTSUM extends BERTSUM to long documents by 1) adding a section embedding layer to include section information in the sentence vector and 2) applying a sparse attention mechanism where each sentences will attend locally to nearby sentences and only a small number of sentences attend globally to all other sentences. We used slides generated by the authors of scientific papers as reference summaries since they contain the technical details from the paper. The results show the superiority of our model in terms of ROUGE scores.
△ Less
Submitted 20 January, 2022;
originally announced January 2022.
-
Extractive Research Slide Generation Using Windowed Labeling Ranking
Authors:
Athar Sefid,
Jian Wu,
Prasenjit Mitra,
Lee Giles
Abstract:
Presentation slides describing the content of scientific and technical papers are an efficient and effective way to present that work. However, manually generating presentation slides is labor intensive. We propose a method to automatically generate slides for scientific papers based on a corpus of 5000 paper-slide pairs compiled from conference proceedings websites. The sentence labeling module o…
▽ More
Presentation slides describing the content of scientific and technical papers are an efficient and effective way to present that work. However, manually generating presentation slides is labor intensive. We propose a method to automatically generate slides for scientific papers based on a corpus of 5000 paper-slide pairs compiled from conference proceedings websites. The sentence labeling module of our method is based on SummaRuNNer, a neural sequence model for extractive summarization. Instead of ranking sentences based on semantic similarities in the whole document, our algorithm measures importance and novelty of sentences by combining semantic and lexical features within a sentence window. Our method outperforms several baseline methods including SummaRuNNer by a significant margin in terms of ROUGE score.
△ Less
Submitted 6 June, 2021;
originally announced June 2021.
-
Extractive Summarizer for Scholarly Articles
Authors:
Athar Sefid,
Clyde Lee Giles,
Prasenjit Mitra
Abstract:
We introduce an extractive method that will summarize long scientific papers. Our model uses presentation slides provided by the authors of the papers as the gold summary standard to label the sentences. The sentences are ranked based on their novelty and their importance as estimated by deep neural networks. Our window-based extractive labeling of sentences results in the improvement of at least…
▽ More
We introduce an extractive method that will summarize long scientific papers. Our model uses presentation slides provided by the authors of the papers as the gold summary standard to label the sentences. The sentences are ranked based on their novelty and their importance as estimated by deep neural networks. Our window-based extractive labeling of sentences results in the improvement of at least 4 ROUGE1-Recall points.
△ Less
Submitted 25 August, 2020;
originally announced August 2020.
-
Cleaning Noisy and Heterogeneous Metadata for Record Linking Across Scholarly Big Datasets
Authors:
Athar Sefid,
Jian Wu,
Allen C. Ge,
Jing Zhao,
Lu Liu,
Cornelia Caragea,
Prasenjit Mitra,
C. Lee Giles
Abstract:
Automatically extracted metadata from scholarly documents in PDF formats is usually noisy and heterogeneous, often containing incomplete fields and erroneous values. One common way of cleaning metadata is to use a bibliographic reference dataset. The challenge is to match records between corpora with high precision. The existing solution which is based on information retrieval and string similarit…
▽ More
Automatically extracted metadata from scholarly documents in PDF formats is usually noisy and heterogeneous, often containing incomplete fields and erroneous values. One common way of cleaning metadata is to use a bibliographic reference dataset. The challenge is to match records between corpora with high precision. The existing solution which is based on information retrieval and string similarity on titles works well only if the titles are cleaned. We introduce a system designed to match scholarly document entities with noisy metadata against a reference dataset. The blocking function uses the classic BM25 algorithm to find the matching candidates from the reference data that has been indexed by ElasticSearch. The core components use supervised methods which combine features extracted from all available metadata fields. The system also leverages available citation information to match entities. The combination of metadata and citation achieves high accuracy that significantly outperforms the baseline method on the same test dataset. We apply this system to match the database of CiteSeerX against Web of Science, PubMed, and DBLP. This method will be deployed in the CiteSeerX system to clean metadata and link records to other scholarly big datasets.
△ Less
Submitted 20 June, 2019;
originally announced June 2019.
-
Scaling Author Name Disambiguation with CNF Blocking
Authors:
Kunho Kim,
Athar Sefid,
C. Lee Giles
Abstract:
An author name disambiguation (AND) algorithm identifies a unique author entity record from all similar or same publication records in scholarly or similar databases. Typically, a clustering method is used that requires calculation of similarities between each possible record pair. However, the total number of pairs grows quadratically with the size of the author database making such clustering di…
▽ More
An author name disambiguation (AND) algorithm identifies a unique author entity record from all similar or same publication records in scholarly or similar databases. Typically, a clustering method is used that requires calculation of similarities between each possible record pair. However, the total number of pairs grows quadratically with the size of the author database making such clustering difficult for millions of records. One remedy for this is a blocking function that reduces the number of pairwise similarity calculations. Here, we introduce a new way of learning blocking schemes by using a conjunctive normal form (CNF) in contrast to the disjunctive normal form (DNF). We demonstrate on PubMed author records that CNF blocking reduces more pairs while preserving high pairs completeness compared to the previous methods that use a DNF with the computation time significantly reduced. Thus, these concepts in scholarly data can be better represented with CNFs. Moreover, we also show how to ensure that the method produces disjoint blocks so that the rest of the AND algorithm can be easily paralleled. Our CNF blocking tested on the entire PubMed database of 80 million author mentions efficiently removes 82.17% of all author record pairs in 10 minutes.
△ Less
Submitted 27 September, 2017;
originally announced September 2017.