Skip to main content

Showing 1–11 of 11 results for author: Sapienza, M

Searching in archive cs. Search in all archives.
.
  1. arXiv:2405.06694  [pdf, other

    cs.CL cs.AI

    SUTRA: Scalable Multilingual Language Model Architecture

    Authors: Abhijit Bendale, Michael Sapienza, Steven Ripplinger, Simon Gibbs, Jaewon Lee, Pranav Mistry

    Abstract: In this paper, we introduce SUTRA, multilingual Large Language Model architecture capable of understanding, reasoning, and generating text in over 50 languages. SUTRA's design uniquely decouples core conceptual understanding from language-specific processing, which facilitates scalable and efficient multilingual alignment and learning. Employing a Mixture of Experts framework both in language and… ▽ More

    Submitted 7 May, 2024; originally announced May 2024.

  2. arXiv:1905.11358  [pdf, other

    cs.CV cs.AI cs.LG

    Straight to Shapes++: Real-time Instance Segmentation Made More Accurate

    Authors: Laurynas Miksys, Saumya Jetley, Michael Sapienza, Stuart Golodetz, Philip H. S. Torr

    Abstract: Instance segmentation is an important problem in computer vision, with applications in autonomous driving, drone navigation and robotic manipulation. However, most existing methods are not real-time, complicating their deployment in time-sensitive contexts. In this work, we extend an existing approach to real-time instance segmentation, called `Straight to Shapes' (STS), which makes use of low-dim… ▽ More

    Submitted 30 July, 2019; v1 submitted 27 May, 2019; originally announced May 2019.

    Comments: Technical report, 27 pages (12 main, 15 supplementary), 17 figures, 14 tables

    Report number: STS-2018

  3. arXiv:1708.00783  [pdf, other

    cs.CV

    InfiniTAM v3: A Framework for Large-Scale 3D Reconstruction with Loop Closure

    Authors: Victor Adrian Prisacariu, Olaf Kähler, Stuart Golodetz, Michael Sapienza, Tommaso Cavallari, Philip H S Torr, David W Murray

    Abstract: Volumetric models have become a popular representation for 3D scenes in recent years. One breakthrough leading to their popularity was KinectFusion, which focuses on 3D reconstruction using RGB-D sensors. However, monocular SLAM has since also been tackled with very similar approaches. Representing the reconstruction volumetrically as a TSDF leads to most of the simplicity and efficiency that can… ▽ More

    Submitted 2 August, 2017; originally announced August 2017.

    Comments: This article largely supersedes arxiv:1410.0925 (it describes version 3 of the InfiniTAM framework)

  4. arXiv:1707.07213  [pdf, other

    cs.CV

    Spatio-temporal Human Action Localisation and Instance Segmentation in Temporally Untrimmed Videos

    Authors: Suman Saha, Gurkirt Singh, Michael Sapienza, Philip H. S. Torr, Fabio Cuzzolin

    Abstract: Current state-of-the-art human action recognition is focused on the classification of temporally trimmed videos in which only one action occurs per frame. In this work we address the problem of action localisation and instance segmentation in which multiple concurrent actions of the same class may be segmented out of an image sequence. We cast the action tube extraction as an energy maximisation p… ▽ More

    Submitted 6 August, 2017; v1 submitted 22 July, 2017; originally announced July 2017.

    Comments: Typos corrected

  5. arXiv:1704.01358  [pdf, other

    cs.CV

    Incremental Tube Construction for Human Action Detection

    Authors: Harkirat Singh Behl, Michael Sapienza, Gurkirt Singh, Suman Saha, Fabio Cuzzolin, Philip H. S. Torr

    Abstract: Current state-of-the-art action detection systems are tailored for offline batch-processing applications. However, for online applications like human-robot interaction, current systems fall short, either because they only detect one action per video, or because they assume that the entire video is available ahead of time. In this work, we introduce a real-time and online joint-labelling and associ… ▽ More

    Submitted 23 July, 2018; v1 submitted 5 April, 2017; originally announced April 2017.

    Comments: British Machine Vision Conference (BMVC) 2018

  6. arXiv:1611.08563  [pdf, other

    cs.CV

    Online Real-time Multiple Spatiotemporal Action Localisation and Prediction

    Authors: Gurkirt Singh, Suman Saha, Michael Sapienza, Philip Torr, Fabio Cuzzolin

    Abstract: We present a deep-learning framework for real-time multiple spatio-temporal (S/T) action localisation, classification and early prediction. Current state-of-the-art approaches work offline and are too slow to be useful in real- world settings. To overcome their limitations we introduce two major developments. Firstly, we adopt real-time SSD (Single Shot MultiBox Detector) convolutional neural netw… ▽ More

    Submitted 24 August, 2017; v1 submitted 25 November, 2016; originally announced November 2016.

    Comments: 10 pages 3 figures, ICCV 2017, Added link to new annotations of ucf101-24

  7. arXiv:1611.07932  [pdf, other

    cs.CV

    Straight to Shapes: Real-time Detection of Encoded Shapes

    Authors: Saumya Jetley, Michael Sapienza, Stuart Golodetz, Philip H. S. Torr

    Abstract: Current object detection approaches predict bounding boxes, but these provide little instance-specific information beyond location, scale and aspect ratio. In this work, we propose to directly regress to objects' shapes in addition to their bounding boxes and categories. It is crucial to find an appropriate shape representation that is compact and decodable, and in which objects can be compared fo… ▽ More

    Submitted 5 July, 2017; v1 submitted 23 November, 2016; originally announced November 2016.

    Comments: 16 pages including appendix; Published at CVPR 2017

  8. arXiv:1608.01529  [pdf, ps, other

    cs.CV

    Deep Learning for Detecting Multiple Space-Time Action Tubes in Videos

    Authors: Suman Saha, Gurkirt Singh, Michael Sapienza, Philip H. S. Torr, Fabio Cuzzolin

    Abstract: In this work, we propose an approach to the spatiotemporal localisation (detection) and classification of multiple concurrent actions within temporally untrimmed videos. Our framework is composed of three stages. In stage 1, appearance and motion detection networks are employed to localise and score actions from colour images and optical flow. In stage 2, the appearance network detections are boos… ▽ More

    Submitted 4 August, 2016; originally announced August 2016.

    Comments: Accepted by British Machine Vision Conference 2016

  9. arXiv:1601.02220  [pdf, other

    cs.CV cs.SD

    Joint Object-Material Category Segmentation from Audio-Visual Cues

    Authors: Anurag Arnab, Michael Sapienza, Stuart Golodetz, Julien Valentin, Ondrej Miksik, Shahram Izadi, Philip Torr

    Abstract: It is not always possible to recognise objects and infer material properties for a scene from visual cues alone, since objects can look visually similar whilst being made of very different materials. In this paper, we therefore present an approach that augments the available dense visual cues with sparse auditory cues in order to estimate dense object and material labels. Since estimates of object… ▽ More

    Submitted 10 January, 2016; originally announced January 2016.

    Comments: Published in British Machine Vision Conference (BMVC) 2015

  10. SemanticPaint: A Framework for the Interactive Segmentation of 3D Scenes

    Authors: Stuart Golodetz, Michael Sapienza, Julien P. C. Valentin, Vibhav Vineet, Ming-Ming Cheng, Anurag Arnab, Victor A. Prisacariu, Olaf Kähler, Carl Yuheng Ren, David W. Murray, Shahram Izadi, Philip H. S. Torr

    Abstract: We present an open-source, real-time implementation of SemanticPaint, a system for geometric reconstruction, object-class segmentation and learning of 3D scenes. Using our system, a user can walk into a room wearing a depth camera and a virtual reality headset, and both densely reconstruct the 3D scene and interactively segment the environment into object classes such as 'chair', 'floor' and 'tabl… ▽ More

    Submitted 13 October, 2015; originally announced October 2015.

    Comments: 33 pages, Project: http://www.semantic-paint.com, Code: https://github.com/torrvision/spaint

    ACM Class: I.2.10

  11. arXiv:1405.7545  [pdf, other

    cs.CV

    Feature sampling and partitioning for visual vocabulary generation on large action classification datasets

    Authors: Michael Sapienza, Fabio Cuzzolin, Philip H. S. Torr

    Abstract: The recent trend in action recognition is towards larger datasets, an increasing number of action classes and larger visual vocabularies. State-of-the-art human action classification in challenging video data is currently based on a bag-of-visual-words pipeline in which space-time features are aggregated globally to form a histogram. The strategies chosen to sample features and construct a visual… ▽ More

    Submitted 29 May, 2014; originally announced May 2014.