Mathematics, word problems, common sense, and artificial intelligence

Davis, Ernest

doi:10.1090/bull/1828

Mathematics, word problems, common sense, and artificial intelligence
HTML articles powered by AMS MathViewer

by Ernest Davis

Bull. Amer. Math. Soc. 61 (2024), 287-303

DOI: https://doi.org/10.1090/bull/1828

Published electronically: February 15, 2024

HTML | PDF

Abstract:

The paper discusses the capacities and limitations of current artificial intelligence (AI) technology to solve word problems that combine elementary mathematics with commonsense reasoning. No existing AI systems can solve these reliably. We review three approaches that have been developed, using AI natural language technology: outputting the answer directly, outputting a computer program that solves the problem, and outputting a formalized representation that can be input to an automated theorem verifier. We review some benchmarks that have been developed to evaluate these systems and some experimental studies. We discuss the limitations of the existing technology at solving these kinds of problems. We argue that it is not clear whether these kinds of limitations will be important in developing AI technology for pure mathematical research, but that they will be important in applications of mathematics, and may well be important in developing programs capable of reading and understanding mathematical content written by humans.

References

Z. Azerbayev, B. Piotrowski, H. Schoelkopf, E. W. Ayers, D. Radev, and J. Avigad, ProofNet: Autoformalizing and formally proving undergraduate-level mathematics, Preprint, arXiv:2302.12433, (2023).
Y. Benchekroun, M. Dervishi, M. Ibrahim, J-B. Gaya, X. Martinet, and G. Mialon, WorldSense: A synthetic benchmark for grounded reasoning in large language models, Preprint, arXiv:2311.15930, (2023).
D. G. Bobrow, A question-answering system for high school algebra word problems, Proceedings of the fall joint computer conference, Part I, October 27–29, 1964, (591–614).
R. Bommasani et al., On the opportunities and risks of foundation models, Preprint, arXiv:2108.07258, (2021).
Meta Fundamental AI Research Diplomacy Team, Anton Bakhtin, Noam Brown et al., Human-level play in the game of Diplomacy by combining language models with strategic reasoning, Science 378 (2022), no. 6624, 1067–1074. MR 4522672, DOI 10.1126/science.ade9097
T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan et al., Language models are few-shot learners, Advances in Neural Information Processing Systems 33 (2020) 1877–1901. https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html
M. Chen et al., Evaluating large language models trained on code, Preprint, arXiv:2107.03374, (2021).
A. Davies, P. Veličković, L. Buesing, S. Blackwell, D. Zheng, N. Tomašev, R. Tanburn et al., Advancing mathematics by guiding human intuition with AI, Nature 600, no. 7887 (2021), 70–74.
E. Davis, Using human skills taxonomies and tests as measures of AI, In Stuart Elliott (ed.) AI and the Future of Skills, Volume 1: Capabilities and Assessments, OECD Publishing, 2021.
E. Davis, Deep learning and mathematical intuition: A review of (Davies et al. 2021), Preprint, arXiv:2112.04324, (2021).
E. Davis, Limits of an AI program for solving college math problems, Preprint, arXiv:2208.06906, (2021).
Ernest Davis, Logical formalizations of commonsense reasoning: a survey, J. Artificial Intelligence Res. 59 (2017), 651–723. MR 3703243, DOI 10.1613/jair.5339
E. Davis and S. Aaronson, Testing GPT-4 with WolframAlpha and Code Interpreter plug-ins on math and science problems, Preprint, arXiv:2308.05713, (2023).
I. Drori et al., A neural network solves, explains, and generates university math problems by program synthesis and few-shot learning at human level, Proceedings of the National Academy of Sciences (PNAS) 119 (2021) No. 32, p.e2123433119. https://doi.org/10.1073/pnas.2123433119
Jordan S. Ellenberg, Geometry, inference, complexity, and democracy, Bull. Amer. Math. Soc. (N.S.) 58 (2021), no. 1, 57–77. MR 4188808, DOI 10.1090/bull/1708
A. Gillioz, J. Casas, E. Mugellini, and O. A. Khaled, Overview of the Transformer-based Models for NLP Tasks, In 2020 15th Conference on Computer Science and Information Systems (FedCSIS), pp. 179–183. IEEE, (2020). https://ieeexplore.ieee.org/abstract/document/9222960
P. Honner, Patterns that go on forever but never repeat, Quanta Magazine, May 23, 2023. https://www.quantamagazine.org/math-that-goes-on-forever-but-never-repeats-20230523/
E. Klarreich, Mathematicians roll dice and get rock-paper-scissors, Quanta Magazine, January 19, 2023. https://www.quantamagazine.org/mathematicians-roll-dice-and-get-rock-paper-scissors-20230119/
D. B. Lenat, M. Prakash, and M. Shepherd, CYC: Using common sense knowledge to overcome brittleness and knowledge acquisition bottlenecks, AI Magazine 6 (1985) no. 4, 65–65.
J. Meadows and A. Freitas, Introduction to mathematical language processing: Informal proofs, word problems, and supporting tasks, Trans. Assoc. Computational Linguistics. 11 2023 1162–1184. https://doi.org/10.1162/tacl_a_00594.
S. Miao, C. C. Liang, and K. Y. Su, A diverse corpus for evaluating and developing English math word problem solvers, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, 975–984, https://doi.org/10.18653/v1/2020.acl-main.92
S. Mishra, M. Finlayson, P. Lu, L. Tang, S. Welleck, C. Baral, T. Rajpurohit et al., \normalfontLīla: A unified benchmark for mathematical reasoning, Preprint, arXiv:2210.17517, (2022).
R. B. Nelsen, Proofs Without Words: Exercises in Visual Thinking. Mathematical Association of America, (1993).
N. Nguyen and S. Nadi, An empirical evaluation of GitHub copilot’s code suggestions, Proceedings of the 19th International Conference on Mining Software Repositories, 2022, 1–5.
A. Patel, S. Bhattamishra, and N. Goyal, Are NLP models really able to solve simple math word problems? Preprint, arXiv:2103.07191, (2021).
J. Qian, H. Wang, Z. Li, S. Li, and X. Yan, Limitations of language models in arithmetic and symbolic induction, Preprint, arXiv:2208.05051, (2022).
L. Sloman, Mathematicians Eliminate Long-Standing Threat to Knot Conjecture, Quanta Magazine, February 2, 2023.
L. Sloman, A very big small leap forward in graph theory, Quanta Magazine, May 2, 2023. https://www.quantamagazine.org/after-nearly-a-century-a-new-limit-for-patterns-in-graphs-20230502/
X. Wang, Z. Hu, P. Lu, Y. Zhu, J. Zhang, S. Subramaniam, A. R. Loomba, S. Zhang, Y. Sun, and W. Wang, SciBench: Evaluating college-level scientific problem-solving abilities of large language models, Preprint, arXiv:2307.10635, (2023).
S. Wolfram, ChatGPT gets its ‘Wolfram Superpowers!’, (2023). https://writings.stephenwolfram.com/2023/03/chatgpt-gets-its-wolfram-superpowers/
Y. Wu, A. Q. Jiang, W. Li, M. N. Rabe, C. Staats, M. Jamnik, and C. Szegedy, Autoformalization with large language models, Preprint, arXiv:2205.12615, (2022).