Exploring the Strengths and Weaknesses of GPT4 for Maths Problems
Discover the main strength of GPT4, which is its ability to do maths. Learn about the fun of making a small contribution to a paper exploring its strengths and weaknesses when it does maths.
Timothy Gowers @wtgowers@mathstodon.xyz
Mathematician. Professeur titulaire de la chaire Combinatoire au Collège de France. Also fellow of Trinity College Cambridge.

It was fun to make a small contribution to this paper  basically I got to play with GPT4 with a view to trying to understand better its strengths and weaknesses when it does maths. 🧵 1/ https://t.co/HWZIvuAXNI
— Timothy Gowers @wtgowers@mathstodon.xyz (@wtgowers) June 5, 2023 
I'd say that its main strength is that it can do maths at all. It's easy to lose sight of how extraordinary it is that one can feed it naturallanguage input and get out sensible answers to a lot of maths problems. 2/
— Timothy Gowers @wtgowers@mathstodon.xyz (@wtgowers) June 5, 2023 
That said, it has weaknesses, several of which have been commented on many times: it often gets basic calculations wrong, it has a tendency to hallucinate, it doesn't notice when it is being inconsistent, and so on. 3/
— Timothy Gowers @wtgowers@mathstodon.xyz (@wtgowers) June 5, 2023 
I mainly tested it on problems where you have to construct an object with certain properties. I chose these because it is quite easy to invent simple ones that are also slightly artificial and therefore unlikely to match closely what GPT4 has seen in its training data. 4/
— Timothy Gowers @wtgowers@mathstodon.xyz (@wtgowers) June 5, 2023 
An interesting weakness is that GPT4 likes to guess the answer and then verify that it works, even in situations where a human mathematician would need to do some reasoning *before* making a guess. In such situations, it often backs up an incorrect guess with bogus proofs. 5/
— Timothy Gowers @wtgowers@mathstodon.xyz (@wtgowers) June 5, 2023 
This is presumably because it is trained on mathematical texts, and mathematicians like the rabbitoutofhat style: we ask the reader to consider a certain object that we have magicked out of thin air, and then we show that, amazingly, it has all the required properties. 6/
— Timothy Gowers @wtgowers@mathstodon.xyz (@wtgowers) June 5, 2023 
This style is not helpful to humans, and it is not helpful to GPT4. I'd love to know how much better GPT4 would be at maths if we had a large corpus of howIdiscoveredthisargument literature. I'd guess quite a lot better. 7/
— Timothy Gowers @wtgowers@mathstodon.xyz (@wtgowers) June 5, 2023 
What I've just been talking about is not the main part of the paper. That is about evaluating LLMs in a more detailed way than simply measuring how many problems they manage to solve from various datasets. It is led by the very talented @katie_m_collins and @AlbertQJiang. 8/8
— Timothy Gowers @wtgowers@mathstodon.xyz (@wtgowers) June 5, 2023