2024
pdf
bib
abs
Multimodal Table Understanding
Mingyu Zheng
|
Xinwei Feng
|
Qingyi Si
|
Qiaoqiao She
|
Zheng Lin
|
Wenbin Jiang
|
Weiping Wang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Although great progress has been made by previous table understanding methods including recent approaches based on large language models (LLMs), they rely heavily on the premise that given tables must be converted into a certain text sequence (such as Markdown or HTML) to serve as model input. However, it is difficult to access such high-quality textual table representations in some real-world scenarios, and table images are much more accessible. Therefore, how to directly understand tables using intuitive visual information is a crucial and urgent challenge for developing more practical applications. In this paper, we propose a new problem, multimodal table understanding, where the model needs to generate correct responses to various table-related requests based on the given table image. To facilitate both the model training and evaluation, we construct a large-scale dataset named MMTab, which covers a wide spectrum of table images, instructions and tasks. On this basis, we develop Table-LLaVA, a generalist tabular multimodal large language model (MLLM), which significantly outperforms recent open-source MLLM baselines on 23 benchmarks under held-in and held-out settings.
2023
pdf
bib
abs
Chain-of-Thought Reasoning in Tabular Language Models
Mingyu Zheng
|
Hao Yang
|
Wenbin Jiang
|
Zheng Lin
|
Yajuan Lyu
|
Qiaoqiao She
|
Weiping Wang
Findings of the Association for Computational Linguistics: EMNLP 2023
Tabular mathematical reasoning task requires models to perform multi-step operations including information look-up and numerical calculation, based on heterogeneous data from tables and questions. Existing solutions tend to extend chain-of-thought (CoT) reasoning into powerful large language models (LLMs) to promote multi-hop mathematical reasoning. However, such LLM-based approaches are not a viable solution in the scenario of privatization deployment or limited resources. To address this problem, we revisit small-scale tabular language models (TaLMs) and extend chain-of-thought reasoning into TaLMs for the first time. Specifically, we propose a novel framework, TaCo, which coordinates two TaLMs responsible for CoT generation and answer inference, respectively. Besides, our framework can be combined with an external calculator to enhance accurate numerical calculation. On the TABMWP dataset, TaCo outperforms the state-of-the-art ChatGPT by 9.55% (82.60%→92.15% in accuracy) with much less parameters (0.8B). The code will be released along with the paper.
pdf
bib
abs
IM-TQA: A Chinese Table Question Answering Dataset with Implicit and Multi-type Table Structures
Mingyu Zheng
|
Yang Hao
|
Wenbin Jiang
|
Zheng Lin
|
Yajuan Lyu
|
QiaoQiao She
|
Weiping Wang
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Various datasets have been proposed to promote the development of Table Question Answering (TQA) technique. However, the problem setting of existing TQA benchmarks suffers from two limitations. First, they directly provide models with explicit table structures where row headers and column headers of the table are explicitly annotated and treated as model input during inference. Second, they only consider tables of limited types and ignore other tables especially complex tables with flexible header locations. Such simplified problem setting cannot cover practical scenarios where models need to process tables without header annotations in the inference phase or tables of different types. To address above issues, we construct a new TQA dataset with implicit and multi-type table structures, named IM-TQA, which not only requires the model to understand tables without directly available header annotations but also to handle multi-type tables including previously neglected complex tables. We investigate the performance of recent methods on our dataset and find that existing methods struggle in processing implicit and multi-type table structures. Correspondingly, we propose an RGCN-RCI framework outperforming recent baselines. We will release our dataset to facilitate future research.
2022
pdf
bib
abs
Language Prior Is Not the Only Shortcut: A Benchmark for Shortcut Learning in VQA
Qingyi Si
|
Fandong Meng
|
Mingyu Zheng
|
Zheng Lin
|
Yuanxin Liu
|
Peng Fu
|
Yanan Cao
|
Weiping Wang
|
Jie Zhou
Findings of the Association for Computational Linguistics: EMNLP 2022
Visual Question Answering (VQA) models are prone to learn the shortcut solution formed by dataset biases rather than the intended solution. To evaluate the VQA models’ reasoning ability beyond shortcut learning, the VQA-CP v2 dataset introduces a distribution shift between the training and test set given a question type. In this way, the model cannot use the training set shortcut (from question type to answer) to perform well on the test set. However, VQA-CP v2 only considers one type of shortcut and thus still cannot guarantee that the model relies on the intended solution rather than a solution specific to this shortcut. To overcome this limitation, we propose a new dataset that considers varying types of shortcuts by constructing different distribution shifts in multiple OOD test sets. In addition, we overcome the three troubling practices in the use of VQA-CP v2, e.g., selecting models using OOD test sets, and further standardize OOD evaluation procedure. Our benchmark provides a more rigorous and comprehensive testbed for shortcut learning in VQA. We benchmark recent methods and find that methods specifically designed for particular shortcuts fail to simultaneously generalize to our varying OOD test sets. We also systematically study the varying shortcuts and provide several valuable findings, which may promote the exploration of shortcut learning in VQA.