Skip to main content

Showing 1–4 of 4 results for author: Kaplan, D Z

Searching in archive cs. Search in all archives.
.
  1. arXiv:2501.09672  [pdf, other

    cs.CV cs.AI

    Robin: a Suite of Multi-Scale Vision-Language Models and the CHIRP Evaluation Benchmark

    Authors: Alexis Roger, Prateek Humane, Daniel Z. Kaplan, Kshitij Gupta, Qi Sun, George Adamopoulos, Jonathan Siu Chi Lim, Quentin Anthony, Edwin Fennell, Irina Rish

    Abstract: The proliferation of Vision-Language Models (VLMs) in the past several years calls for rigorous and comprehensive evaluation methods and benchmarks. This work analyzes existing VLM evaluation techniques, including automated metrics, AI-based assessments, and human evaluations across diverse tasks. We first introduce Robin - a novel suite of VLMs that we built by combining Large Language Models (LL… ▽ More

    Submitted 20 January, 2025; v1 submitted 16 January, 2025; originally announced January 2025.

  2. arXiv:2408.07851  [pdf, other

    cs.CL cs.AI

    SER Evals: In-domain and Out-of-domain Benchmarking for Speech Emotion Recognition

    Authors: Mohamed Osman, Daniel Z. Kaplan, Tamer Nadeem

    Abstract: Speech emotion recognition (SER) has made significant strides with the advent of powerful self-supervised learning (SSL) models. However, the generalization of these models to diverse languages and emotional expressions remains a challenge. We propose a large-scale benchmark to evaluate the robustness and adaptability of state-of-the-art SER models in both in-domain and out-of-domain settings. Our… ▽ More

    Submitted 14 August, 2024; originally announced August 2024.

    Comments: Accepted at INTERSPEECH 2024

  3. arXiv:2407.11121  [pdf, other

    cs.CV cs.AI cs.LG

    Towards Adversarially Robust Vision-Language Models: Insights from Design Choices and Prompt Formatting Techniques

    Authors: Rishika Bhagwatkar, Shravan Nayak, Reza Bayat, Alexis Roger, Daniel Z Kaplan, Pouya Bashivan, Irina Rish

    Abstract: Vision-Language Models (VLMs) have witnessed a surge in both research and real-world applications. However, as they are becoming increasingly prevalent, ensuring their robustness against adversarial attacks is paramount. This work systematically investigates the impact of model design choices on the adversarial robustness of VLMs against image-based attacks. Additionally, we introduce novel, cost-… ▽ More

    Submitted 15 July, 2024; originally announced July 2024.

  4. arXiv:2401.11605  [pdf, other

    cs.CV cs.AI cs.LG

    Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers

    Authors: Katherine Crowson, Stefan Andreas Baumann, Alex Birch, Tanishq Mathew Abraham, Daniel Z. Kaplan, Enrico Shippole

    Abstract: We present the Hourglass Diffusion Transformer (HDiT), an image generative model that exhibits linear scaling with pixel count, supporting training at high-resolution (e.g. $1024 \times 1024$) directly in pixel-space. Building on the Transformer architecture, which is known to scale to billions of parameters, it bridges the gap between the efficiency of convolutional U-Nets and the scalability of… ▽ More

    Submitted 21 January, 2024; originally announced January 2024.

    Comments: 20 pages, 13 figures, project page and code available at https://crowsonkb.github.io/hourglass-diffusion-transformers/