Search | arXiv e-print repository

Clothes-Changing Person Re-identification Based On Skeleton Dynamics

Abstract: Clothes-Changing Person Re-Identification (ReID) aims to recognize the same individual across different videos captured at various times and locations. This task is particularly challenging due to changes in appearance, such as clothing, hairstyle, and accessories. We propose a Clothes-Changing ReID method that uses only skeleton data and does not use appearance features. Traditional ReID methods… ▽ More Clothes-Changing Person Re-Identification (ReID) aims to recognize the same individual across different videos captured at various times and locations. This task is particularly challenging due to changes in appearance, such as clothing, hairstyle, and accessories. We propose a Clothes-Changing ReID method that uses only skeleton data and does not use appearance features. Traditional ReID methods often depend on appearance features, leading to decreased accuracy when clothing changes. Our approach utilizes a spatio-temporal Graph Convolution Network (GCN) encoder to generate a skeleton-based descriptor for each individual. During testing, we improve accuracy by aggregating predictions from multiple segments of a video clip. Evaluated on the CCVID dataset with several different pose estimation models, our method achieves state-of-the-art performance, offering a robust and efficient solution for Clothes-Changing ReID. △ Less

Submitted 13 March, 2025; originally announced March 2025.

arXiv:2408.17434 [pdf, other]

Audio Enhancement from Multiple Crowdsourced Recordings: A Simple and Effective Baseline

Authors: Shiran Aziz, Yossi Adi, Shmuel Peleg

Abstract: With the popularity of cellular phones, events are often recorded by multiple devices from different locations and shared on social media. Several different recordings could be found for many events. Such recordings are usually noisy, where noise for each device is local and unrelated to others. This case of multiple microphones at unknown locations, capturing local, uncorrelated noise, was rarely… ▽ More With the popularity of cellular phones, events are often recorded by multiple devices from different locations and shared on social media. Several different recordings could be found for many events. Such recordings are usually noisy, where noise for each device is local and unrelated to others. This case of multiple microphones at unknown locations, capturing local, uncorrelated noise, was rarely treated in the literature. In this work we propose a simple and effective crowdsourced audio enhancement method to remove local noises at each input audio signal. Then, averaging all cleaned source signals gives an improved audio of the event. We demonstrate the effectiveness of our method using synthetic audio signals, together with real-world recordings. This simple approach can set a new baseline for crowdsourced audio enhancement for more sophisticated methods which we hope will be developed by the research community. △ Less

Submitted 30 August, 2024; originally announced August 2024.

arXiv:2211.13807 [pdf, other]

GEFF: Improving Any Clothes-Changing Person ReID Model using Gallery Enrichment with Face Features

Authors: Daniel Arkushin, Bar Cohen, Shmuel Peleg, Ohad Fried

Abstract: In the Clothes-Changing Re-Identification (CC-ReID) problem, given a query sample of a person, the goal is to determine the correct identity based on a labeled gallery in which the person appears in different clothes. Several models tackle this challenge by extracting clothes-independent features. However, the performance of these models is still lower for the clothes-changing setting compared to… ▽ More In the Clothes-Changing Re-Identification (CC-ReID) problem, given a query sample of a person, the goal is to determine the correct identity based on a labeled gallery in which the person appears in different clothes. Several models tackle this challenge by extracting clothes-independent features. However, the performance of these models is still lower for the clothes-changing setting compared to the same-clothes setting in which the person appears with the same clothes in the labeled gallery. As clothing-related features are often dominant features in the data, we propose a new process we call Gallery Enrichment, to utilize these features. In this process, we enrich the original gallery by adding to it query samples based on their face features, using an unsupervised algorithm. Additionally, we show that combining ReID and face feature extraction modules alongside an enriched gallery results in a more accurate ReID model, even for query samples with new outfits that do not include faces. Moreover, we claim that existing CC-ReID benchmarks do not fully represent real-world scenarios, and propose a new video CC-ReID dataset called 42Street, based on a theater play that includes crowded scenes and numerous clothes changes. When applied to multiple ReID models, our method (GEFF) achieves an average improvement of 33.5% and 6.7% in the Top-1 clothes-changing metric on the PRCC and LTCC benchmarks. Combined with the latest ReID models, our method achieves new SOTA results on the PRCC, LTCC, CCVID, LaST and VC-Clothes benchmarks and the proposed 42Street dataset. △ Less

Submitted 21 November, 2023; v1 submitted 24 November, 2022; originally announced November 2022.

arXiv:2209.04177 [pdf, ps, other]

Tensor Reconstruction Beyond Constant Rank

Authors: Shir Peleg, Amir Shpilka, Ben Lee Volk

Abstract: We give reconstruction algorithms for subclasses of depth-3 arithmetic circuits. In particular, we obtain the first efficient algorithm for finding tensor rank, and an optimal tensor decomposition as a sum of rank-one tensors, when given black-box access to a tensor of super-constant rank. We obtain the following results: 1. A deterministic algorithm that reconstructs polynomials computed by… ▽ More We give reconstruction algorithms for subclasses of depth-3 arithmetic circuits. In particular, we obtain the first efficient algorithm for finding tensor rank, and an optimal tensor decomposition as a sum of rank-one tensors, when given black-box access to a tensor of super-constant rank. We obtain the following results: 1. A deterministic algorithm that reconstructs polynomials computed by $Σ^{[k]}\bigwedge^{[d]}Σ$ circuits in time $\mathsf{poly}(n,d,c) \cdot \mathsf{poly}(k)^{k^{k^{10}}}$ 2. A randomized algorithm that reconstructs polynomials computed by multilinear $Σ^{k]}\prod^{[d]}Σ$ circuits in time $\mathsf{poly}(n,d,c) \cdot k^{k^{k^{k^{O(k)}}}}$ 3. A randomized algorithm that reconstructs polynomials computed by set-multilinear $Σ^{k]}\prod^{[d]}Σ$ circuits in time $\mathsf{poly}(n,d,c) \cdot k^{k^{k^{k^{O(k)}}}}$, where $c=\log q$ if $\mathbb{F}=\mathbb{F}_q$ is a finite field, and $c$ equals the maximum bit complexity of any coefficient of $f$ if $\mathbb{F}$ is infinite. Prior to our work, polynomial time algorithms for the case when the rank, $k$, is constant, were given by Bhargava, Saraf and Volkovich [BSV21]. Another contribution of this work is correcting an error from a paper of Karnin and Shpilka [KS09] that affected Theorem 1.6 of [BSV21]. Consequently, the results of [KS09, BSV21] continue to hold, with a slightly worse setting of parameters. For fixing the error we study the relation between syntactic and semantic ranks of $ΣΠΣ$ circuits. We obtain our improvement by introducing a technique for learning rank preserving coordinate-subspaces. [KS09] and [BSV21] tried all choices of finding the "correct" coordinates, which led to having a fast growing function of $k$ at the exponent of $n$. We find these spaces in time that is growing fast with $k$, yet it is only a fixed polynomial in $n$. △ Less

Submitted 9 September, 2022; originally announced September 2022.

Comments: Abstract shortened to meet arXiv requirements; 59 pages

arXiv:2207.10441 [pdf, other]

doi 10.21437/Interspeech.2022-10735

Deep Audio Waveform Prior

Authors: Arnon Turetzky, Tzvi Michelson, Yossi Adi, Shmuel Peleg

Abstract: Convolutional neural networks contain strong priors for generating natural looking images [1]. These priors enable image denoising, super resolution, and inpainting in an unsupervised manner. Previous attempts to demonstrate similar ideas in audio, namely deep audio priors, (i) use hand picked architectures such as harmonic convolutions, (ii) only work with spectrogram input, and (iii) have been u… ▽ More Convolutional neural networks contain strong priors for generating natural looking images [1]. These priors enable image denoising, super resolution, and inpainting in an unsupervised manner. Previous attempts to demonstrate similar ideas in audio, namely deep audio priors, (i) use hand picked architectures such as harmonic convolutions, (ii) only work with spectrogram input, and (iii) have been used mostly for eliminating Gaussian noise [2]. In this work we show that existing SOTA architectures for audio source separation contain deep priors even when working with the raw waveform. Deep priors can be discovered by training a neural network to generate a single corrupted signal when given white noise as input. A network with relevant deep priors is likely to generate a cleaner version of the signal before converging on the corrupted signal. We demonstrate this restoration effect with several corruptions: background noise, reverberations, and a gap in the signal (audio inpainting). △ Less

Submitted 21 July, 2022; originally announced July 2022.

Comments: Interspeech 2022

arXiv:2205.09791 [pdf, other]

A Peek at Peak Emotion Recognition

Authors: Tzvi Michelson, Hillel Aviezer, Shmuel Peleg

Abstract: Despite much progress in the field of facial expression recognition, little attention has been paid to the recognition of peak emotion. Aviezer et al. [1] showed that humans have trouble discerning between positive and negative peak emotions. In this work we analyze how deep learning fares on this challenge. We find that (i) despite using very small datasets, features extracted from deep learning… ▽ More Despite much progress in the field of facial expression recognition, little attention has been paid to the recognition of peak emotion. Aviezer et al. [1] showed that humans have trouble discerning between positive and negative peak emotions. In this work we analyze how deep learning fares on this challenge. We find that (i) despite using very small datasets, features extracted from deep learning models can achieve results significantly better than humans. (ii) We find that deep learning models, even when trained only on datasets tagged by humans, still outperform humans in this task. △ Less

Submitted 19 May, 2022; originally announced May 2022.

Comments: Submitted to HBU Workshop at ICPR, 6 pages, 5 figures

arXiv:2202.04932 [pdf, other]

Robust Sylvester-Gallai type theorem for quadratic polynomials

Authors: Shir Peleg, Amir Shpilka

Abstract: In this work, we extend the robust version of the Sylvester-Gallai theorem, obtained by Barak, Dvir, Wigderson and Yehudayoff, and by Dvir, Saraf and Wigderson, to the case of quadratic polynomials. Specifically, we prove that if $\mathcal{Q}\subset \mathbb{C}[x_1.\ldots,x_n]$ is a finite set, $|\mathcal{Q}|=m$, of irreducible quadratic polynomials that satisfy the following condition: There is… ▽ More In this work, we extend the robust version of the Sylvester-Gallai theorem, obtained by Barak, Dvir, Wigderson and Yehudayoff, and by Dvir, Saraf and Wigderson, to the case of quadratic polynomials. Specifically, we prove that if $\mathcal{Q}\subset \mathbb{C}[x_1.\ldots,x_n]$ is a finite set, $|\mathcal{Q}|=m$, of irreducible quadratic polynomials that satisfy the following condition: There is $δ>0$ such that for every $Q\in\mathcal{Q}$ there are at least $δm$ polynomials $P\in \mathcal{Q}$ such that whenever $Q$ and $P$ vanish then so does a third polynomial in $\mathcal{Q}\setminus\{Q,P\}$, then $\dim(\text{span}({\mathcal{Q}}))=\text{poly}(1/δ)$. The work of Barak et al. and Dvir et al. studied the case of linear polynomials and proved an upper bound of $O(1/δ)$ on the dimension (in the first work an upper bound of $O(1/δ^2)$ was given, which was improved to $O(1/δ)$ in the second work). △ Less

Submitted 10 February, 2022; originally announced February 2022.

Comments: arXiv admin note: text overlap with arXiv:2006.08263

arXiv:2110.01367 [pdf, other]

Audio-Visual Evaluation of Oratory Skills

Authors: Tzvi Michelson, Shmuel Peleg

Abstract: What makes a talk successful? Is it the content or the presentation? We try to estimate the contribution of the speaker's oratory skills to the talk's success, while ignoring the content of the talk. By oratory skills we refer to facial expressions, motions and gestures, as well as the vocal features. We use TED Talks as our dataset, and measure the success of each talk by its view count. Using th… ▽ More What makes a talk successful? Is it the content or the presentation? We try to estimate the contribution of the speaker's oratory skills to the talk's success, while ignoring the content of the talk. By oratory skills we refer to facial expressions, motions and gestures, as well as the vocal features. We use TED Talks as our dataset, and measure the success of each talk by its view count. Using this dataset we train a neural network to assess the oratory skills in a talk through three factors: body pose, facial expressions, and acoustic features. Most previous work on automatic evaluation of oratory skills uses hand-crafted expert annotations for both the quality of the talk and for the identification of predefined actions. Unlike prior art, we measure the quality to be equivalent to the view count of the talk as counted by TED, and allow the network to automatically learn the actions, expressions, and sounds that are relevant to the success of a talk. We find that oratory skills alone contribute substantially to the chances of a talk being successful. △ Less

Submitted 30 September, 2021; originally announced October 2021.

Comments: TransAI 2021

arXiv:2106.03214 [pdf, other]

doi 10.22331/q-2022-02-15-652

Lower Bounds on Stabilizer Rank

Authors: Shir Peleg, Amir Shpilka, Ben Lee Volk

Abstract: The stabilizer rank of a quantum state $ψ$ is the minimal $r$ such that $\left| ψ\right \rangle = \sum_{j=1}^r c_j \left|\varphi_j \right\rangle$ for $c_j \in \mathbb{C}$ and stabilizer states $\varphi_j$. The running time of several classical simulation methods for quantum circuits is determined by the stabilizer rank of the $n$-th tensor power of single-qubit magic states. We prove a lower bou… ▽ More The stabilizer rank of a quantum state $ψ$ is the minimal $r$ such that $\left| ψ\right \rangle = \sum_{j=1}^r c_j \left|\varphi_j \right\rangle$ for $c_j \in \mathbb{C}$ and stabilizer states $\varphi_j$. The running time of several classical simulation methods for quantum circuits is determined by the stabilizer rank of the $n$-th tensor power of single-qubit magic states. We prove a lower bound of $Ω(n)$ on the stabilizer rank of such states, improving a previous lower bound of $Ω(\sqrt{n})$ of Bravyi, Smith and Smolin (arXiv:1506.01396). Further, we prove that for a sufficiently small constant $δ$, the stabilizer rank of any state which is $δ$-close to those states is $Ω(\sqrt{n}/\log n)$. This is the first non-trivial lower bound for approximate stabilizer rank. Our techniques rely on the representation of stabilizer states as quadratic functions over affine subspaces of $\mathbb{F}_2^n$, and we use tools from analysis of boolean functions and complexity theory. The proof of the first result involves a careful analysis of directional derivatives of quadratic polynomials, whereas the proof of the second result uses Razborov-Smolensky low degree polynomial approximations and correlation bounds against the majority function. △ Less

Submitted 10 February, 2022; v1 submitted 6 June, 2021; originally announced June 2021.

Journal ref: Quantum 6, 652 (2022)

arXiv:2102.07762 [pdf, other]

Membership Inference Attacks are Easier on Difficult Problems

Authors: Avital Shafran, Shmuel Peleg, Yedid Hoshen

Abstract: Membership inference attacks (MIA) try to detect if data samples were used to train a neural network model, e.g. to detect copyright abuses. We show that models with higher dimensional input and output are more vulnerable to MIA, and address in more detail models for image translation and semantic segmentation, including medical image segmentation. We show that reconstruction-errors can lead to ve… ▽ More Membership inference attacks (MIA) try to detect if data samples were used to train a neural network model, e.g. to detect copyright abuses. We show that models with higher dimensional input and output are more vulnerable to MIA, and address in more detail models for image translation and semantic segmentation, including medical image segmentation. We show that reconstruction-errors can lead to very effective MIA attacks as they are indicative of memorization. Unfortunately, reconstruction error alone is less effective at discriminating between non-predictable images used in training and easy to predict images that were never seen before. To overcome this, we propose using a novel predictability error that can be computed for each sample, and its computation does not require a training set. Our membership error, obtained by subtracting the predictability error from the reconstruction error, is shown to achieve high MIA accuracy on an extensive number of benchmarks. △ Less

Submitted 18 August, 2021; v1 submitted 15 February, 2021; originally announced February 2021.

arXiv:2006.08263 [pdf, other]

Polynomial time deterministic identity testingalgorithm for $Σ^{[3]}ΠΣΠ^{[2]}$ circuits via Edelstein-Kelly type theorem for quadratic polynomials

Authors: Shir Peleg, Amir Shpilka

Abstract: In this work we resolve conjectures of Beecken, Mitmann and Saxena [BMS13] and Gupta [Gup14], by proving an analog of a theorem of Edelstein and Kelly for quadratic polynomials. As immediate corollary we obtain the first deterministic polynomial time black-box algorithm for testing zeroness of $Σ^{[3]}ΠΣΠ^{[2]}$ circuits. In this work we resolve conjectures of Beecken, Mitmann and Saxena [BMS13] and Gupta [Gup14], by proving an analog of a theorem of Edelstein and Kelly for quadratic polynomials. As immediate corollary we obtain the first deterministic polynomial time black-box algorithm for testing zeroness of $Σ^{[3]}ΠΣΠ^{[2]}$ circuits. △ Less

Submitted 15 June, 2020; originally announced June 2020.

arXiv:2003.05152 [pdf, ps, other]

A generalized Sylvester-Gallai type theorem for quadratic polynomials

Authors: Shir Peleg, Amir Shpilka

Abstract: In this work we prove a version of the Sylvester-Gallai theorem for quadratic polynomials that takes us one step closer to obtaining a deterministic polynomial time algorithm for testing zeroness of $Σ^{[3]}ΠΣΠ^{[2]}$ circuits. Specifically, we prove that if a finite set of irreducible quadratic polynomials $\mathcal{Q}$ satisfy that for every two polynomials $Q_1,Q_2\in \mathcal{Q}$ there is a su… ▽ More In this work we prove a version of the Sylvester-Gallai theorem for quadratic polynomials that takes us one step closer to obtaining a deterministic polynomial time algorithm for testing zeroness of $Σ^{[3]}ΠΣΠ^{[2]}$ circuits. Specifically, we prove that if a finite set of irreducible quadratic polynomials $\mathcal{Q}$ satisfy that for every two polynomials $Q_1,Q_2\in \mathcal{Q}$ there is a subset $\mathcal{K}\subset \mathcal{Q}$, such that $Q_1,Q_2 \notin \mathcal{K}$ and whenever $Q_1$ and $Q_2$ vanish then also $\prod_{i\in \mathcal{K}} Q_i$ vanishes, then the linear span of the polynomials in $\mathcal{Q}$ has dimension $O(1)$. This extends the earlier result [Shpilka19] that showed a similar conclusion when $|\mathcal{K}| = 1$. An important technical step in our proof is a theorem classifying all the possible cases in which a product of quadratic polynomials can vanish when two other quadratic polynomials vanish. I.e., when the product is in the radical of the ideal generates by the two quadratics. This step extends a result from [Shpilka19]that studied the case when one quadratic polynomial is in the radical of two other quadratics. △ Less

Submitted 11 March, 2020; originally announced March 2020.

arXiv:1911.12322 [pdf, other]

Crypto-Oriented Neural Architecture Design

Authors: Avital Shafran, Gil Segev, Shmuel Peleg, Yedid Hoshen

Abstract: As neural networks revolutionize many applications, significant privacy conflicts between model users and providers emerge. The cryptography community developed a variety of techniques for secure computation to address such privacy issues. As generic techniques for secure computation are typically prohibitively ineffective, many efforts focus on optimizing their underlying cryptographic tools. Dif… ▽ More As neural networks revolutionize many applications, significant privacy conflicts between model users and providers emerge. The cryptography community developed a variety of techniques for secure computation to address such privacy issues. As generic techniques for secure computation are typically prohibitively ineffective, many efforts focus on optimizing their underlying cryptographic tools. Differently, we propose to optimize the initial design of crypto-oriented neural architectures and provide a novel Partial Activation layer. The proposed layer is much faster for secure computation. Evaluating our method on three state-of-the-art architectures (SqueezeNet, ShuffleNetV2, and MobileNetV2) demonstrates significant improvement to the efficiency of secure inference on common evaluation metrics. △ Less

Submitted 16 February, 2021; v1 submitted 27 November, 2019; originally announced November 2019.

Comments: Full version (shorter version published in ICASSP'21)

arXiv:1808.06250 [pdf, other]

Dynamic Temporal Alignment of Speech to Lips

Authors: Tavi Halperin, Ariel Ephrat, Shmuel Peleg

Abstract: Many speech segments in movies are re-recorded in a studio during postproduction, to compensate for poor sound quality as recorded on location. Manual alignment of the newly-recorded speech with the original lip movements is a tedious task. We present an audio-to-video alignment method for automating speech to lips alignment, stretching and compressing the audio signal to match the lip movements.… ▽ More Many speech segments in movies are re-recorded in a studio during postproduction, to compensate for poor sound quality as recorded on location. Manual alignment of the newly-recorded speech with the original lip movements is a tedious task. We present an audio-to-video alignment method for automating speech to lips alignment, stretching and compressing the audio signal to match the lip movements. This alignment is based on deep audio-visual features, mapping the lips video and the speech signal to a shared representation. Using this shared representation we compute the lip-sync error between every short speech period and every video frame, followed by the determination of the optimal corresponding frame for each short sound period over the entire video clip. We demonstrate successful alignment both quantitatively, using a human perception-inspired metric, as well as qualitatively. The strongest advantage of our audio-to-video approach is in cases where the original voice in unclear, and where a constant shift of the sound can not give a perfect alignment. In these cases state-of-the-art methods will fail. △ Less

Submitted 19 August, 2018; originally announced August 2018.

arXiv:1711.08789 [pdf, other]

Visual Speech Enhancement

Authors: Aviv Gabbay, Asaph Shamir, Shmuel Peleg

Abstract: When video is shot in noisy environment, the voice of a speaker seen in the video can be enhanced using the visible mouth movements, reducing background noise. While most existing methods use audio-only inputs, improved performance is obtained with our visual speech enhancement, based on an audio-visual neural network. We include in the training data videos to which we added the voice of the targe… ▽ More When video is shot in noisy environment, the voice of a speaker seen in the video can be enhanced using the visible mouth movements, reducing background noise. While most existing methods use audio-only inputs, improved performance is obtained with our visual speech enhancement, based on an audio-visual neural network. We include in the training data videos to which we added the voice of the target speaker as background noise. Since the audio input is not sufficient to separate the voice of a speaker from his own voice, the trained model better exploits the visual input and generalizes well to different noise types. The proposed model outperforms prior audio visual methods on two public lipreading datasets. It is also the first to be demonstrated on a dataset not designed for lipreading, such as the weekly addresses of Barack Obama. △ Less

Submitted 13 June, 2018; v1 submitted 23 November, 2017; originally announced November 2017.

Comments: Accepted to Interspeech 2018. Supplementary video: https://www.youtube.com/watch?v=nyYarDGpcYA

arXiv:1708.06767 [pdf, other]

Seeing Through Noise: Visually Driven Speaker Separation and Enhancement

Authors: Aviv Gabbay, Ariel Ephrat, Tavi Halperin, Shmuel Peleg

Abstract: Isolating the voice of a specific person while filtering out other voices or background noises is challenging when video is shot in noisy environments. We propose audio-visual methods to isolate the voice of a single speaker and eliminate unrelated sounds. First, face motions captured in the video are used to estimate the speaker's voice, by passing the silent video frames through a video-to-speec… ▽ More Isolating the voice of a specific person while filtering out other voices or background noises is challenging when video is shot in noisy environments. We propose audio-visual methods to isolate the voice of a single speaker and eliminate unrelated sounds. First, face motions captured in the video are used to estimate the speaker's voice, by passing the silent video frames through a video-to-speech neural network-based model. Then the speech predictions are applied as a filter on the noisy input audio. This approach avoids using mixtures of sounds in the learning process, as the number of such possible mixtures is huge, and would inevitably bias the trained model. We evaluate our method on two audio-visual datasets, GRID and TCD-TIMIT, and show that our method attains significant SDR and PESQ improvements over the raw video-to-speech predictions, and a well-known audio-only method. △ Less

Submitted 9 February, 2018; v1 submitted 22 August, 2017; originally announced August 2017.

Comments: Supplementary video: https://www.youtube.com/watch?v=qmsyj7vAzoI

arXiv:1708.01204 [pdf, other]

Improved Speech Reconstruction from Silent Video

Authors: Ariel Ephrat, Tavi Halperin, Shmuel Peleg

Abstract: Speechreading is the task of inferring phonetic information from visually observed articulatory facial movements, and is a notoriously difficult task for humans to perform. In this paper we present an end-to-end model based on a convolutional neural network (CNN) for generating an intelligible and natural-sounding acoustic speech signal from silent video frames of a speaking person. We train our m… ▽ More Speechreading is the task of inferring phonetic information from visually observed articulatory facial movements, and is a notoriously difficult task for humans to perform. In this paper we present an end-to-end model based on a convolutional neural network (CNN) for generating an intelligible and natural-sounding acoustic speech signal from silent video frames of a speaking person. We train our model on speakers from the GRID and TCD-TIMIT datasets, and evaluate the quality and intelligibility of reconstructed speech using common objective measurements. We show that speech predictions from the proposed model attain scores which indicate significantly improved quality over existing models. In addition, we show promising results towards reconstructing speech from an unconstrained dictionary. △ Less

Submitted 29 August, 2017; v1 submitted 1 August, 2017; originally announced August 2017.

Comments: Accepted to ICCV 2017 Workshop on Computer Vision for Audio-Visual Media. Supplementary video: https://www.youtube.com/watch?v=Xjbn7h7tpg0. arXiv admin note: text overlap with arXiv:1701.00495

arXiv:1701.00495 [pdf, other]

Vid2speech: Speech Reconstruction from Silent Video

Authors: Ariel Ephrat, Shmuel Peleg

Abstract: Speechreading is a notoriously difficult task for humans to perform. In this paper we present an end-to-end model based on a convolutional neural network (CNN) for generating an intelligible acoustic speech signal from silent video frames of a speaking person. The proposed CNN generates sound features for each frame based on its neighboring frames. Waveforms are then synthesized from the learned s… ▽ More Speechreading is a notoriously difficult task for humans to perform. In this paper we present an end-to-end model based on a convolutional neural network (CNN) for generating an intelligible acoustic speech signal from silent video frames of a speaking person. The proposed CNN generates sound features for each frame based on its neighboring frames. Waveforms are then synthesized from the learned speech features to produce intelligible speech. We show that by leveraging the automatic feature learning capabilities of a CNN, we can obtain state-of-the-art word intelligibility on the GRID dataset, and show promising results for learning out-of-vocabulary (OOV) words. △ Less

Submitted 9 January, 2017; v1 submitted 2 January, 2017; originally announced January 2017.

Comments: Accepted for publication at ICASSP 2017

arXiv:1607.07660 [pdf, other]

Fundamental Matrices from Moving Objects Using Line Motion Barcodes

Authors: Yoni Kasten, Gil Ben-Artzi, Shmuel Peleg, Michael Werman

Abstract: Computing the epipolar geometry between cameras with very different viewpoints is often very difficult. The appearance of objects can vary greatly, and it is difficult to find corresponding feature points. Prior methods searched for corresponding epipolar lines using points on the convex hull of the silhouette of a single moving object. These methods fail when the scene includes multiple moving ob… ▽ More Computing the epipolar geometry between cameras with very different viewpoints is often very difficult. The appearance of objects can vary greatly, and it is difficult to find corresponding feature points. Prior methods searched for corresponding epipolar lines using points on the convex hull of the silhouette of a single moving object. These methods fail when the scene includes multiple moving objects. This paper extends previous work to scenes having multiple moving objects by using the "Motion Barcodes", a temporal signature of lines. Corresponding epipolar lines have similar motion barcodes, and candidate pairs of corresponding epipoar lines are found by the similarity of their motion barcodes. As in previous methods we assume that cameras are relatively stationary and that moving objects have already been extracted using background subtraction. △ Less

Submitted 26 July, 2016; originally announced July 2016.

Journal ref: ECCV'16, Amsterdam, Oct. 2016, Vol II, pp. 220-118

arXiv:1604.07741 [pdf, other]

doi 10.1109/TCSVT.2017.2651051

EgoSampling: Wide View Hyperlapse from Egocentric Videos

Authors: Tavi Halperin, Yair Poleg, Chetan Arora, Shmuel Peleg

Abstract: The possibility of sharing one's point of view makes use of wearable cameras compelling. These videos are often long, boring and coupled with extreme shake, as the camera is worn on a moving person. Fast forwarding (i.e. frame sampling) is a natural choice for quick video browsing. However, this accentuates the shake caused by natural head motion in an egocentric video, making the fast forwarded v… ▽ More The possibility of sharing one's point of view makes use of wearable cameras compelling. These videos are often long, boring and coupled with extreme shake, as the camera is worn on a moving person. Fast forwarding (i.e. frame sampling) is a natural choice for quick video browsing. However, this accentuates the shake caused by natural head motion in an egocentric video, making the fast forwarded video useless. We propose EgoSampling, an adaptive frame sampling that gives stable, fast forwarded, hyperlapse videos. Adaptive frame sampling is formulated as an energy minimization problem, whose optimal solution can be found in polynomial time. We further turn the camera shake from a drawback into a feature, enabling the increase in field-of-view of the output video. This is obtained when each output frame is mosaiced from several input frames. The proposed technique also enables the generation of a single hyperlapse video from multiple egocentric videos, allowing even faster video consumption. △ Less

Submitted 12 January, 2017; v1 submitted 26 April, 2016; originally announced April 2016.

Comments: Accepted for publication in IEEE Transactions on Circuits and Systems for Video Technology (TCSVT)

arXiv:1604.04848 [pdf, other]

Epipolar Geometry Based On Line Similarity

Authors: Gil Ben-Artzi, Tavi Halperin, Michael Werman, Shmuel Peleg

Abstract: It is known that epipolar geometry can be computed from three epipolar line correspondences but this computation is rarely used in practice since there are no simple methods to find corresponding lines. Instead, methods for finding corresponding points are widely used. This paper proposes a similarity measure between lines that indicates whether two lines are corresponding epipolar lines and enabl… ▽ More It is known that epipolar geometry can be computed from three epipolar line correspondences but this computation is rarely used in practice since there are no simple methods to find corresponding lines. Instead, methods for finding corresponding points are widely used. This paper proposes a similarity measure between lines that indicates whether two lines are corresponding epipolar lines and enables finding epipolar line correspondences as needed for the computation of epipolar geometry. A similarity measure between two lines, suitable for video sequences of a dynamic scene, has been previously described. This paper suggests a stereo matching similarity measure suitable for images. It is based on the quality of stereo matching between the two lines, as corresponding epipolar lines yield a good stereo correspondence. Instead of an exhaustive search over all possible pairs of lines, the search space is substantially reduced when two corresponding point pairs are given. We validate the proposed method using real-world images and compare it to state-of-the-art methods. We found this method to be more accurate by a factor of five compared to the standard method using seven corresponding points and comparable to the 8-points algorithm. △ Less

Submitted 7 January, 2017; v1 submitted 17 April, 2016; originally announced April 2016.

Comments: ICPR 2016, Cancun, Dec 2016

Journal ref: ICPR'16, Cancun, Dec. 2016, pp. 1865-1870

arXiv:1506.07866 [pdf, other]

Camera Calibration from Dynamic Silhouettes Using Motion Barcodes

Authors: Gil Ben-Artzi, Yoni Kasten, Shmuel Peleg, Michael Werman

Abstract: Computing the epipolar geometry between cameras with very different viewpoints is often problematic as matching points are hard to find. In these cases, it has been proposed to use information from dynamic objects in the scene for suggesting point and line correspondences. We propose a speed up of about two orders of magnitude, as well as an increase in robustness and accuracy, to methods comput… ▽ More Computing the epipolar geometry between cameras with very different viewpoints is often problematic as matching points are hard to find. In these cases, it has been proposed to use information from dynamic objects in the scene for suggesting point and line correspondences. We propose a speed up of about two orders of magnitude, as well as an increase in robustness and accuracy, to methods computing epipolar geometry from dynamic silhouettes. This improvement is based on a new temporal signature: motion barcode for lines. Motion barcode is a binary temporal sequence for lines, indicating for each frame the existence of at least one foreground pixel on that line. The motion barcodes of two corresponding epipolar lines are very similar, so the search for corresponding epipolar lines can be limited only to lines having similar barcodes. The use of motion barcodes leads to increased speed, accuracy, and robustness in computing the epipolar geometry. △ Less

Submitted 7 January, 2017; v1 submitted 25 June, 2015; originally announced June 2015.

Comments: Update metadata

Journal ref: Proc. CVPR'16, Las Vegas, June 2016, pp. 4095-4103

arXiv:1506.02264 [pdf, other]

Visual Learning of Arithmetic Operations

Authors: Yedid Hoshen, Shmuel Peleg

Abstract: A simple Neural Network model is presented for end-to-end visual learning of arithmetic operations from pictures of numbers. The input consists of two pictures, each showing a 7-digit number. The output, also a picture, displays the number showing the result of an arithmetic operation (e.g., addition or subtraction) on the two input numbers. The concepts of a number, or of an operator, are not exp… ▽ More A simple Neural Network model is presented for end-to-end visual learning of arithmetic operations from pictures of numbers. The input consists of two pictures, each showing a 7-digit number. The output, also a picture, displays the number showing the result of an arithmetic operation (e.g., addition or subtraction) on the two input numbers. The concepts of a number, or of an operator, are not explicitly introduced. This indicates that addition is a simple cognitive task, which can be learned visually using a very small number of neurons. Other operations, e.g., multiplication, were not learnable using this architecture. Some tasks were not learnable end-to-end (e.g., addition with Roman numerals), but were easily learnable once broken into two separate sub-tasks: a perceptual \textit{Character Recognition} and cognitive \textit{Arithmetic} sub-tasks. This indicates that while some tasks may be easily learnable end-to-end, other may need to be broken into sub-tasks. △ Less

Submitted 27 November, 2015; v1 submitted 7 June, 2015; originally announced June 2015.

Comments: To appear in AAAI 2016

Journal ref: Proc. AAAI'16, Phoenix, Feb. 2016, pp. 3733-3739

arXiv:1505.05254 [pdf, other]

Live Video Synopsis for Multiple Cameras

Authors: Yedid Hoshen, Shmuel Peleg

Abstract: Video surveillance cameras generate most of recorded video, and there is far more recorded video than operators can watch. Much progress has recently been made using summarization of recorded video, but such techniques do not have much impact on live video surveillance. We assume a camera hierarchy where a Master camera observes the decision-critical region, and one or more Slave cameras observe… ▽ More Video surveillance cameras generate most of recorded video, and there is far more recorded video than operators can watch. Much progress has recently been made using summarization of recorded video, but such techniques do not have much impact on live video surveillance. We assume a camera hierarchy where a Master camera observes the decision-critical region, and one or more Slave cameras observe regions where past activity is important for making the current decision. We propose that when people appear in the live Master camera, the Slave cameras will display their past activities, and the operator could use past information for real-time decision making. The basic units of our method are action tubes, representing objects and their trajectories over time. Our object-based method has advantages over frame based methods, as it can handle multiple people, multiple activities for each person, and can address re-identification uncertainty. △ Less

Submitted 20 May, 2015; originally announced May 2015.

Comments: To be presented in ICIP 2015

Journal ref: Proc. ICIP'15, Quebec City, Sept. 2015, pp. 212-216

arXiv:1504.07469 [pdf, other]

doi 10.1109/WACV.2016.7477708

Compact CNN for Indexing Egocentric Videos

Authors: Yair Poleg, Ariel Ephrat, Shmuel Peleg, Chetan Arora

Abstract: While egocentric video is becoming increasingly popular, browsing it is very difficult. In this paper we present a compact 3D Convolutional Neural Network (CNN) architecture for long-term activity recognition in egocentric videos. Recognizing long-term activities enables us to temporally segment (index) long and unstructured egocentric videos. Existing methods for this task are based on hand tuned… ▽ More While egocentric video is becoming increasingly popular, browsing it is very difficult. In this paper we present a compact 3D Convolutional Neural Network (CNN) architecture for long-term activity recognition in egocentric videos. Recognizing long-term activities enables us to temporally segment (index) long and unstructured egocentric videos. Existing methods for this task are based on hand tuned features derived from visible objects, location of hands, as well as optical flow. Given a sparse optical flow volume as input, our CNN classifies the camera wearer's activity. We obtain classification accuracy of 89%, which outperforms the current state-of-the-art by 19%. Additional evaluation is performed on an extended egocentric video dataset, classifying twice the amount of categories than current state-of-the-art. Furthermore, our CNN is able to recognize whether a video is egocentric or not with 99.2% accuracy, up by 24% from current state-of-the-art. To better understand what the network actually learns, we propose a novel visualization of CNN kernels as flow fields. △ Less

Submitted 24 November, 2015; v1 submitted 28 April, 2015; originally announced April 2015.

Journal ref: IEEE WACV'16, March 2016, pp. 1-9

arXiv:1412.3596 [pdf, other]

doi 10.1109/CVPR.2015.7299109

EgoSampling: Fast-Forward and Stereo for Egocentric Videos

Authors: Yair Poleg, Tavi Halperin, Chetan Arora, Shmuel Peleg

Abstract: While egocentric cameras like GoPro are gaining popularity, the videos they capture are long, boring, and difficult to watch from start to end. Fast forwarding (i.e. frame sampling) is a natural choice for faster video browsing. However, this accentuates the shake caused by natural head motion, making the fast forwarded video useless. We propose EgoSampling, an adaptive frame sampling that gives… ▽ More While egocentric cameras like GoPro are gaining popularity, the videos they capture are long, boring, and difficult to watch from start to end. Fast forwarding (i.e. frame sampling) is a natural choice for faster video browsing. However, this accentuates the shake caused by natural head motion, making the fast forwarded video useless. We propose EgoSampling, an adaptive frame sampling that gives more stable fast forwarded videos. Adaptive frame sampling is formulated as energy minimization, whose optimal solution can be found in polynomial time. In addition, egocentric video taken while walking suffers from the left-right movement of the head as the body weight shifts from one leg to another. We turn this drawback into a feature: Stereo video can be created by sampling the frames from the left most and right most head positions of each step, forming approximate stereo-pairs. △ Less

Submitted 27 April, 2015; v1 submitted 11 December, 2014; originally announced December 2014.

Comments: in IEEE CVPR 2015, Boston, MA, June 2015

Journal ref: CVPR'15, Boston, June 2015

arXiv:1412.1455 [pdf, other]

Event Retrieval Using Motion Barcodes

Authors: Gil Ben-Artzi, Michael Werman, Shmuel Peleg

Abstract: We introduce a simple and effective method for retrieval of videos showing a specific event, even when the videos of that event were captured from significantly different viewpoints. Appearance-based methods fail in such cases, as appearances change with large changes of viewpoints. Our method is based on a pixel-based feature, "motion barcode", which records the existence/non-existence of motio… ▽ More We introduce a simple and effective method for retrieval of videos showing a specific event, even when the videos of that event were captured from significantly different viewpoints. Appearance-based methods fail in such cases, as appearances change with large changes of viewpoints. Our method is based on a pixel-based feature, "motion barcode", which records the existence/non-existence of motion as a function of time. While appearance, motion magnitude, and motion direction can vary greatly between disparate viewpoints, the existence of motion is viewpoint invariant. Based on the motion barcode, a similarity measure is developed for videos of the same event taken from very different viewpoints. This measure is robust to occlusions common under different viewpoints, and can be computed efficiently. Event retrieval is demonstrated using challenging videos from stationary and hand held cameras. △ Less

Submitted 12 May, 2015; v1 submitted 3 December, 2014; originally announced December 2014.

Journal ref: Proc. ICIP'15, Quebec City, Sept. 2015, pp 2621-2625

arXiv:1411.7591 [pdf, other]

An Egocentric Look at Video Photographer Identity

Authors: Yedid Hoshen, Shmuel Peleg

Abstract: Egocentric cameras are being worn by an increasing number of users, among them many security forces worldwide. GoPro cameras already penetrated the mass market, reporting substantial increase in sales every year. As head-worn cameras do not capture the photographer, it may seem that the anonymity of the photographer is preserved even when the video is publicly distributed. We show that camera mo… ▽ More Egocentric cameras are being worn by an increasing number of users, among them many security forces worldwide. GoPro cameras already penetrated the mass market, reporting substantial increase in sales every year. As head-worn cameras do not capture the photographer, it may seem that the anonymity of the photographer is preserved even when the video is publicly distributed. We show that camera motion, as can be computed from the egocentric video, provides unique identity information. The photographer can be reliably recognized from a few seconds of video captured when walking. The proposed method achieves more than 90% recognition accuracy in cases where the random success rate is only 3%. Applications can include theft prevention by locking the camera when not worn by its rightful owner. Searching video sharing services (e.g. YouTube) for egocentric videos shot by a specific photographer may also become possible. An important message in this paper is that photographers should be aware that sharing egocentric video will compromise their anonymity, even when their face is not visible. △ Less

Submitted 8 November, 2015; v1 submitted 27 November, 2014; originally announced November 2014.

Journal ref: Proc. CVPR'16, Las Vegas, June 2016, pp. 4284-4292

Showing 1–28 of 28 results for author: Peleg, S