Is a picture worth a thousand words? A Deep Multi-Modal Fusion Architecture for Product Classification in e-commerce

Zahavy, Tom; Magnani, Alessandro; Krishnan, Abhinandan; Mannor, Shie

Computer Science > Computer Vision and Pattern Recognition

arXiv:1611.09534 (cs)

[Submitted on 29 Nov 2016]

Title:Is a picture worth a thousand words? A Deep Multi-Modal Fusion Architecture for Product Classification in e-commerce

Authors:Tom Zahavy, Alessandro Magnani, Abhinandan Krishnan, Shie Mannor

View PDF

Abstract:Classifying products into categories precisely and efficiently is a major challenge in modern e-commerce. The high traffic of new products uploaded daily and the dynamic nature of the categories raise the need for machine learning models that can reduce the cost and time of human editors. In this paper, we propose a decision level fusion approach for multi-modal product classification using text and image inputs. We train input specific state-of-the-art deep neural networks for each input source, show the potential of forging them together into a multi-modal architecture and train a novel policy network that learns to choose between them. Finally, we demonstrate that our multi-modal network improves the top-1 accuracy % over both networks on a real-world large-scale product classification dataset that we collected this http URL. While we focus on image-text fusion that characterizes e-commerce domains, our algorithms can be easily applied to other modalities such as audio, video, physical sensors, etc.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Cite as:	arXiv:1611.09534 [cs.CV]
	(or arXiv:1611.09534v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.1611.09534

Submission history

From: Tom Zahavy [view email]
[v1] Tue, 29 Nov 2016 09:05:11 UTC (1,490 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.CV

< prev | next >

new | recent | 2016-11

Change to browse by:

cs
cs.CL

References & Citations

DBLP - CS Bibliography

listing | bibtex

Tom Zahavy
Alessandro Magnani
Abhinandan Krishnan
Shie Mannor

export BibTeX citation

Computer Science > Computer Vision and Pattern Recognition

Title:Is a picture worth a thousand words? A Deep Multi-Modal Fusion Architecture for Product Classification in e-commerce

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Is a picture worth a thousand words? A Deep Multi-Modal Fusion Architecture for Product Classification in e-commerce

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators