TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-Action

Ma, Zixian; Zhang, Jianguo; Liu, Zhiwei; Zhang, Jieyu; Tan, Juntao; Shu, Manli; Niebles, Juan Carlos; Heinecke, Shelby; Wang, Huan; Xiong, Caiming; Krishna, Ranjay; Savarese, Silvio

Computer Science > Computer Vision and Pattern Recognition

arXiv:2412.05479v1 (cs)

[Submitted on 7 Dec 2024 (this version), latest version 15 Sep 2025 (v4)]

Title:TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-Action

Authors:Zixian Ma, Jianguo Zhang, Zhiwei Liu, Jieyu Zhang, Juntao Tan, Manli Shu, Juan Carlos Niebles, Shelby Heinecke, Huan Wang, Caiming Xiong, Ranjay Krishna, Silvio Savarese

View PDF HTML (experimental)

Abstract:While open-source multi-modal language models perform well on simple question answering tasks, they often fail on complex questions that require multiple capabilities, such as fine-grained recognition, visual grounding, and reasoning, and that demand multi-step solutions. We present TACO, a family of multi-modal large action models designed to improve performance on such complex, multi-step, and multi-modal tasks. During inference, TACO produces chains-of-thought-and-action (CoTA), executes intermediate steps by invoking external tools such as OCR, depth estimation and calculator, then integrates both the thoughts and action outputs to produce coherent responses. To train TACO, we create a large dataset of over 1M synthetic CoTA traces generated with GPT-4o and Python programs. We then experiment with various data filtering and mixing techniques and obtain a final subset of 293K high-quality CoTA examples. This dataset enables TACO to learn complex reasoning and action paths, surpassing existing models trained on instruction tuning data with only direct answers. Our model TACO outperforms the instruction-tuned baseline across 8 benchmarks, achieving a 3.6% improvement on average, with gains of up to 15% in MMVet tasks involving OCR, mathematical reasoning, and spatial reasoning. Training on high-quality CoTA traces sets a new standard for complex multi-modal reasoning, highlighting the need for structured, multi-step instruction tuning in advancing open-source mutli-modal models' capabilities.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2412.05479 [cs.CV]
	(or arXiv:2412.05479v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2412.05479

Submission history

From: Zixian Ma [view email]
[v1] Sat, 7 Dec 2024 00:42:04 UTC (35,735 KB)
[v2] Tue, 10 Dec 2024 07:33:12 UTC (35,852 KB)
[v3] Sun, 15 Jun 2025 05:35:12 UTC (13,145 KB)
[v4] Mon, 15 Sep 2025 07:14:08 UTC (13,146 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-Action

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-Action

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators