MathFish: Evaluating Language Model Math Reasoning via Grounding in Educational Curricula

Li Lucy, Tal August, Rose E Wang, Luca Soldaini, Courtney Allison, Kyle Lo


Abstract
To ensure that math curriculum is grade-appropriate and aligns with critical skills or concepts in accordance with educational standards, pedagogical experts can spend months carefully reviewing published math problems. Drawing inspiration from this process, our work presents a novel angle for evaluating language models’ (LMs) mathematical abilities, by investigating whether they can discern skills and concepts enabled by math content. We contribute two datasets: one consisting of 385 fine-grained descriptions of K-12 math skills and concepts, or *standards*, from Achieve the Core (*ATC*), and another of 9.9K math problems labeled with these standards (*MathFish*). We develop two tasks for evaluating LMs’ abilities to assess math problems: (1) verifying whether a problem aligns with a given standard, and (2) tagging a problem with all aligned standards. Working with experienced teachers, we find that LMs struggle to tag and verify standards linked to problems, and instead predict labels that are close to ground truth, but differ in subtle ways. We also show that LMs often generate problems that do not fully align with standards described in prompts, suggesting the need for careful scrutiny on use cases involving LMs for generating curricular materials. Finally, we categorize problems in GSM8k using math standards, allowing us to better understand why some problems are more difficult to solve for models than others.
Anthology ID:
2024.findings-emnlp.323
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2024
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5644–5673
Language:
URL:
https://aclanthology.org/2024.findings-emnlp.323/
DOI:
10.18653/v1/2024.findings-emnlp.323
Bibkey:
Cite (ACL):
Li Lucy, Tal August, Rose E Wang, Luca Soldaini, Courtney Allison, and Kyle Lo. 2024. MathFish: Evaluating Language Model Math Reasoning via Grounding in Educational Curricula. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 5644–5673, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
MathFish: Evaluating Language Model Math Reasoning via Grounding in Educational Curricula (Lucy et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-emnlp.323.pdf