-
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Authors:
Gemini Team,
Petko Georgiev,
Ving Ian Lei,
Ryan Burnell,
Libin Bai,
Anmol Gulati,
Garrett Tanzer,
Damien Vincent,
Zhufeng Pan,
Shibo Wang,
Soroosh Mariooryad,
Yifan Ding,
Xinyang Geng,
Fred Alcober,
Roy Frostig,
Mark Omernick,
Lexi Walker,
Cosmin Paduraru,
Christina Sorokin,
Andrea Tacchetti,
Colin Gaffney,
Samira Daruki,
Olcan Sercinoglu,
Zach Gleicher,
Juliette Love
, et al. (1112 additional authors not shown)
Abstract:
In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February…
▽ More
In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February version on the great majority of capabilities and benchmarks; (2) Gemini 1.5 Flash, a more lightweight variant designed for efficiency with minimal regression in quality. Gemini 1.5 models achieve near-perfect recall on long-context retrieval tasks across modalities, improve the state-of-the-art in long-document QA, long-video QA and long-context ASR, and match or surpass Gemini 1.0 Ultra's state-of-the-art performance across a broad set of benchmarks. Studying the limits of Gemini 1.5's long-context ability, we find continued improvement in next-token prediction and near-perfect retrieval (>99%) up to at least 10M tokens, a generational leap over existing models such as Claude 3.0 (200k) and GPT-4 Turbo (128k). Finally, we highlight real-world use cases, such as Gemini 1.5 collaborating with professionals on completing their tasks achieving 26 to 75% time savings across 10 different job categories, as well as surprising new capabilities of large language models at the frontier; when given a grammar manual for Kalamang, a language with fewer than 200 speakers worldwide, the model learns to translate English to Kalamang at a similar level to a person who learned from the same content.
△ Less
Submitted 16 December, 2024; v1 submitted 8 March, 2024;
originally announced March 2024.
-
HaptStarter: Designing Haptic Stimulus Start System for Deaf and Hard of Hearing Sprinters
Authors:
Akihisa Shitara,
Miki Namatame,
Sayan Sarcar,
Yoichi Ochiai,
Yuhki Shiraishi
Abstract:
In this study, we design and develop HaptStarter -- a haptic stimulus start system -- to improve the starting performance of the deaf and hard of hearing (DHH) sprinters. A DHH person has a physical ability nearly equivalent to hearing; however, the difficulties in perceiving audio information lead to differences in their performance in sports.
Furthermore, the visual reaction time is slower tha…
▽ More
In this study, we design and develop HaptStarter -- a haptic stimulus start system -- to improve the starting performance of the deaf and hard of hearing (DHH) sprinters. A DHH person has a physical ability nearly equivalent to hearing; however, the difficulties in perceiving audio information lead to differences in their performance in sports.
Furthermore, the visual reaction time is slower than the auditory reaction time (ART), while the haptic reaction time is equivalent to it.
However, a light stimulus start system is increasingly being used in sprint races to aid DHH sprinters. In this study, we design a brand-new haptic stimulus start system for DHH sprinters; we also determine and leverage an optimum haptic stimulus interface. The proposed method has the potential to contribute toward the development of prototypes based on the universal design principle for everyone (DHH, blind and low-vision, and other disabled sprinters with wheelchairs or artificial arms or legs, etc.) by focusing on the overlapping area of sports and disability with human-computer interaction.
△ Less
Submitted 7 November, 2023;
originally announced November 2023.
-
Inclusive AR/VR: Accessibility Barriers for Immersive Technologies
Authors:
Chris Creed,
Maadh Al-Kalbani,
Arthur Theil,
Sayan Sarcar,
Ian Williams
Abstract:
Augmented and virtual reality (AR/VR) hold significant potential to transform how we communicate, collaborate, and interact with others. However, there has been a lack of work to date investigating accessibility barriers in relation to immersive technologies for people with disabilities. To address current gaps in knowledge, we led two multidisciplinary Sandpits with key stakeholders (including ac…
▽ More
Augmented and virtual reality (AR/VR) hold significant potential to transform how we communicate, collaborate, and interact with others. However, there has been a lack of work to date investigating accessibility barriers in relation to immersive technologies for people with disabilities. To address current gaps in knowledge, we led two multidisciplinary Sandpits with key stakeholders (including academic researchers, AR/VR industry specialists, people with lived experience of disability, assistive technologists, and representatives from national charities and special needs colleges) to collaboratively explore and identify existing challenges with AR and VR experiences. We present key themes that emerged from Sandpit activities and map out the interaction barriers identified across a spectrum of impairments (including physical, cognitive, visual, and auditory disabilities). We conclude with recommendations for future work addressing the challenges highlighted to support the development of more inclusive AR and VR experiences.
△ Less
Submitted 26 April, 2023;
originally announced April 2023.
-
SHITARA: Sending Haptic Induced Touchable Alarm by Ring-shaped Air vortex
Authors:
Ryosei Kojima,
Akihisa Shitara,
Tatsuki Fushimi,
Ryogo Niwa,
Atushi Shinoda,
Ryo Iijima,
Kengo Tanaka,
Sayan Sarcar,
Yoichi Ochiai
Abstract:
Social interaction begins with the other person's attention, but it is difficult for a d/Deaf or hard-of-hearing (DHH) person to notice the initial conversation cues. Wearable or visual devices have been proposed previously. However, these devices are cumbersome to wear or must stay within the DHH person's vision. In this study, we have proposed SHITARA, a novel accessibility method with air vorte…
▽ More
Social interaction begins with the other person's attention, but it is difficult for a d/Deaf or hard-of-hearing (DHH) person to notice the initial conversation cues. Wearable or visual devices have been proposed previously. However, these devices are cumbersome to wear or must stay within the DHH person's vision. In this study, we have proposed SHITARA, a novel accessibility method with air vortex rings that provides a non-contact haptic cue for a DHH person. We have developed a proof-of-concept device and determined the air vortex ring's accuracy, noticeability and comfortability when it hits a DHH's hair. Though strength, accuracy, and noticeability of air vortex rings decrease as the distance between the air vortex ring generator and the user increases, we have demonstrated that the air vortex ring is noticeable up to 2.5 meters away. Moreover, the optimum strength is found for each distance from a DHH.
△ Less
Submitted 7 November, 2023; v1 submitted 19 January, 2023;
originally announced January 2023.
-
Outline Objects using Deep Reinforcement Learning
Authors:
Zhenxin Wang,
Sayan Sarcar,
Jingxin Liu,
Yilin Zheng,
Xiangshi Ren
Abstract:
Image segmentation needs both local boundary position information and global object context information. The performance of the recent state-of-the-art method, fully convolutional networks, reaches a bottleneck due to the neural network limit after balancing between the two types of information simultaneously in an end-to-end training style. To overcome this problem, we divide the semantic image s…
▽ More
Image segmentation needs both local boundary position information and global object context information. The performance of the recent state-of-the-art method, fully convolutional networks, reaches a bottleneck due to the neural network limit after balancing between the two types of information simultaneously in an end-to-end training style. To overcome this problem, we divide the semantic image segmentation into temporal subtasks. First, we find a possible pixel position of some object boundary; then trace the boundary at steps within a limited length until the whole object is outlined. We present the first deep reinforcement learning approach to semantic image segmentation, called DeepOutline, which outperforms other algorithms in Coco detection leaderboard in the middle and large size person category in Coco val2017 dataset. Meanwhile, it provides an insight into a divide and conquer way by reinforcement learning on computer vision problems.
△ Less
Submitted 20 April, 2018; v1 submitted 9 April, 2018;
originally announced April 2018.
-
Metrics for Bengali Text Entry Research
Authors:
Sayan Sarcar,
Ahmed Sabbir Arif,
Ali Mazalek
Abstract:
With the intention of bringing uniformity to Bengali text entry research, here we present a new approach for calculating the most popular English text entry evaluation metrics for Bengali. To demonstrate our approach, we conducted a user study where we evaluated four popular Bengali text entry techniques.
With the intention of bringing uniformity to Bengali text entry research, here we present a new approach for calculating the most popular English text entry evaluation metrics for Bengali. To demonstrate our approach, we conducted a user study where we evaluated four popular Bengali text entry techniques.
△ Less
Submitted 25 June, 2017;
originally announced June 2017.
-
Usability Evaluation of Dwell-free Eye Typing Techniques
Authors:
Sayan Sarcar
Abstract:
Dwelling is an essential task to be performed to select keys from an on-screen keyboard present in the eye typing interface. This selection task can be performed by fixing eye gaze on a key for a prolonged time. Spending sufficient amount of time on each key effectively decreases the overall eye typing rate. To address the problem, researchers proposed mechanisms, which diminish the dwell time. We…
▽ More
Dwelling is an essential task to be performed to select keys from an on-screen keyboard present in the eye typing interface. This selection task can be performed by fixing eye gaze on a key for a prolonged time. Spending sufficient amount of time on each key effectively decreases the overall eye typing rate. To address the problem, researchers proposed mechanisms, which diminish the dwell time. We conducted a within-subject usability evaluation of four dwell-free eye typing techniques. The results of first-time usability study, longitudinal study and subjective evaluation conducted with 15 participants confirm the superiority of controlled eye movement based advanced eye typing method (Adv-EyeK) than the other three techniques.
△ Less
Submitted 24 January, 2016;
originally announced January 2016.
-
Quickpie: An Interface for Fast and Accurate Eye Gazed based Text Entry
Authors:
Pawan Patidar,
Himanshu Raghuvanshi,
Sayan Sarcar
Abstract:
Pie menus are suggested as powerful tool for eye gaze based text entry among various interfaces developed so far. If pie menus are used with multiple depth layers then multiple saccades are required per selection of item, which is inefficient because it consumes more time. Also dwell time selection method is limited in performance because higher dwell time suffers from inefficiency while lower one…
▽ More
Pie menus are suggested as powerful tool for eye gaze based text entry among various interfaces developed so far. If pie menus are used with multiple depth layers then multiple saccades are required per selection of item, which is inefficient because it consumes more time. Also dwell time selection method is limited in performance because higher dwell time suffers from inefficiency while lower one from inaccuracy. To overcome problems with multiple depth layers and dwell time, we designed Quickpie, an interface for eye gaze based text entry with only one depth layer of pie menu and selection border as selection method instead of dwell time. We investigated various parameters like number of slices in pie menu, width characters and safe region, enlarged angle of slice and selection methods to achieve better performance. Our experiment results indicates that six number of slices with width of characters area 120 px performs better as compared to other designs.
△ Less
Submitted 27 July, 2014;
originally announced July 2014.