Exploring the Strengths and Weaknesses of GPT-4 for Maths Problems
Discover the main strength of GPT-4, which is its ability to do maths. Learn about the fun of making a small contribution to a paper exploring its strengths and weaknesses when it does maths.
Timothy Gowers @wtgowers@mathstodon.xyz
Mathematician. Professeur titulaire de la chaire Combinatoire au Collège de France. Also fellow of Trinity College Cambridge.
-
It was fun to make a small contribution to this paper -- basically I got to play with GPT-4 with a view to trying to understand better its strengths and weaknesses when it does maths. 🧵 1/ https://t.co/HWZIvuAXNI
— Timothy Gowers @wtgowers@mathstodon.xyz (@wtgowers) June 5, 2023 -
I'd say that its main strength is that it can do maths at all. It's easy to lose sight of how extraordinary it is that one can feed it natural-language input and get out sensible answers to a lot of maths problems. 2/
— Timothy Gowers @wtgowers@mathstodon.xyz (@wtgowers) June 5, 2023 -
That said, it has weaknesses, several of which have been commented on many times: it often gets basic calculations wrong, it has a tendency to hallucinate, it doesn't notice when it is being inconsistent, and so on. 3/
— Timothy Gowers @wtgowers@mathstodon.xyz (@wtgowers) June 5, 2023 -
I mainly tested it on problems where you have to construct an object with certain properties. I chose these because it is quite easy to invent simple ones that are also slightly artificial and therefore unlikely to match closely what GPT-4 has seen in its training data. 4/
— Timothy Gowers @wtgowers@mathstodon.xyz (@wtgowers) June 5, 2023 -
An interesting weakness is that GPT-4 likes to guess the answer and then verify that it works, even in situations where a human mathematician would need to do some reasoning *before* making a guess. In such situations, it often backs up an incorrect guess with bogus proofs. 5/
— Timothy Gowers @wtgowers@mathstodon.xyz (@wtgowers) June 5, 2023 -
This is presumably because it is trained on mathematical texts, and mathematicians like the rabbit-out-of-hat style: we ask the reader to consider a certain object that we have magicked out of thin air, and then we show that, amazingly, it has all the required properties. 6/
— Timothy Gowers @wtgowers@mathstodon.xyz (@wtgowers) June 5, 2023 -
This style is not helpful to humans, and it is not helpful to GPT-4. I'd love to know how much better GPT-4 would be at maths if we had a large corpus of how-I-discovered-this-argument literature. I'd guess quite a lot better. 7/
— Timothy Gowers @wtgowers@mathstodon.xyz (@wtgowers) June 5, 2023 -
What I've just been talking about is not the main part of the paper. That is about evaluating LLMs in a more detailed way than simply measuring how many problems they manage to solve from various datasets. It is led by the very talented @katie_m_collins and @AlbertQJiang. 8/8
— Timothy Gowers @wtgowers@mathstodon.xyz (@wtgowers) June 5, 2023