简单测试各类大模型对一个枚举类问题的回答

本文最后更新于 2025年1月17日 中午

问题

今天逛 linux.do 论坛的时候看到这样一个问题:

小李在纸上写了一个四位数A,把A的个位数字移到首位,得到另一个四位数 B,最后把A和B加起来得到和数C,并且C也是一个四位数。

已知A的各位数字之和是20,C的百位和十位数学分别是0和4。

那么C代表的四位数是?

正确答案应该是 8041.

刚好这几天 GPT 降智严重, 所以一直在对比 Claude, ChatGPT 和 Gemini Pro, 就用这个问题问了大概 7, 8 次各类模型, 看看都是如何回答的.

img

OpenAI GPT-4o

官网版本的 gpt4o

PS: 用的 IP 是 Oracle 的 IP, 实测 difficulty 值是 4 位数 (ref: https://linux.do/t/topic/261968)

问了 10 次左右, 大约 7 次可以通过正确的写出 python code 并运行得到正确结果.

img

另外的三次就是类似这样的无 code 回复, 答案也大概率就是错的.

img

平台反代的 GPT-4o

这个平台是之前一直合租的平台, 应该就是通过自己写了个镜像站来用多个 Pro 账号反代官网, 特征也很明显, 就是会用固定 IP 大量访问 GPT, 这一类的反代镜像站目前被 openai 限制的很严重, 我用的这个也因此停运了, 刚好在最后几天再通过这个问题测试一下.

  1. 第一次, 直接没有通过程序求解, 并得出了错误的答案

img

  1. 又尝试了几次, 勉强在写错五次程序后给出了正确结果

img

Claude 3.5 sonnet

同样的问题问了五次以上的 Claude 3.5 sonnet, 均通过正确的 JS 代码给出了答案和分析, 并且相比于 chatgpt, 还证明了答案的唯一性.

img

Gemini 1.5 和 Gemini-exp-1206

虽然 在 Chatbot Arena 上坐拥 rank1, 但是 Gemini 在这个问题的表现居然是三个模型里最烂的:

  1. 直接问原始问题, 无论是否加 code execution 的选项, Gemini 系均不会给出代码和执行求解, 而是通过大量的文本分析试图找到答案, 并且 Gemini 对于中文的支持挺糟糕的, 如果不在 Prompt 中明确指定, 即使问题是中文也很有可能输出英文解答:

img

以下是模型的原始输出: (Gemini 1.5 pro 和 exp-1206 基本类似)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
Here's how we can solve this problem step-by-step:

Understanding the Problem

We have a four-digit number, A.

Moving the last digit of A to the front creates a new four-digit number, B.

The sum of A and B is a four-digit number, C.

The sum of the digits in A is 20.

The hundreds digit of C is 0, and the tens digit of C is 4.

Let's use variables:

Let the digits of A be represented as a, b, c, and d, where a is the thousands digit, b is the hundreds digit, c is the tens digit, and d is the units digit.

So, A = 1000a + 100b + 10*c + d

And B = 1000d + 100a + 10*b + c

Formulating Equations

Sum of digits of A: a + b + c + d = 20

Structure of C: C = 1000*_ + 100*0 + 10*4 + _ (We don't know the thousands and units digits of C yet)

A + B = C:
(1000*a + 100*b + 10*c + d) + (1000*d + 100*a + 10*b + c) = C

Analyzing the Sum (A + B = C)

When we add A and B, let's consider the possible carry-overs:

Units column: d + c results in the units digit of C. There might be a carry-over to the tens column.

Tens column: c + b plus any carry-over from the units column results in 4 (the tens digit of C). There might be a carry-over to the hundreds column.

Hundreds column: b + a plus any carry-over from the tens column results in 0 (the hundreds digit of C). This means there must be a carry-over of 1 to the thousands column (since b + a cannot be negative).

Thousands column: a + d plus the carry-over of 1 from the hundreds column results in the thousands digit of C.

Solving the Equations

Hundreds column: Since b + a plus a carry-over results in 0, and we know there's a carry-over of 1 to the thousands column, we have:
b + a + 1 (carry-over) = 10
Therefore, a + b = 9

Tens column: We know c + b plus a possible carry-over from the units column results in 4. Let's consider two cases:

Case 1: No carry-over from the units column:
c + b = 4
Since a + b = 9, and a + b + c + d = 20, then d = 7.
But if d = 7, then in the units column, d + c (7 + c) would need to result in a number greater than or equal to 10 to have a carry-over. This would make c at least 3, and c + b would be greater than 4, contradicting our assumption of no carry-over. Therefore, this case is not possible.

Case 2: Carry-over of 1 from the units column:
c + b + 1 = 4
Therefore, c + b = 3

From equations a + b = 9 and c + b = 3, and knowing a + b + c + d = 20, we can deduce:
a = 9-b, c = 3-b and 9 + 3 - b + d = 20, then b + d = 8

Units column: Since there's a carry-over of 1 to the tens column, d + c must be greater than or equal to 10. Substituting c = 3-b, we get:
d + 3 - b >= 10
d - b >= 7

Combining d - b >= 7 and d + b = 8, we find that the only possible values are:
d = 8, b = 0

With d = 8 and b = 0, we can find the other values:
a = 9, c = 3

Calculating C:
A = 9038
B = 8903
C = A + B = 17941

However, the problem states that C is a four-digit number. This means our initial assumption that A is a four-digit number was incorrect. Let's adjust our approach.

Revised Approach: A as a Three-Digit Number

Let's assume A is a three-digit number (abc) and B becomes a four-digit number (bca) when the units digit is moved to the front.

A = 100a + 10b + c

B = 1000c + 100a + 10b

A + B = C

a + b + c = 20

C = 1000*_ + 100*0 + 10*4 + _

Analyzing the Sum (A + B = C) - Revised

Units column: c + b results in the units digit of C. There might be a carry-over.

Tens column: b + a plus any carry-over results in 4. There might be a carry-over.

Hundreds column: a + c plus any carry-over results in 0. There must be a carry-over of 1.

Thousands column: The carry-over of 1 is the thousands digit of C.

Solving the Equations - Revised

Hundreds column: a + c + 1 (carry-over) = 10, so a + c = 9.

Tens column:

Case 1: No carry-over from the units column: a + b = 4. But since a + b + c = 20, this would mean c = 16, which is not possible.

Case 2: Carry-over of 1 from the units column: a + b + 1 = 4, so a + b = 3.

Combining a + b = 3 and a + c = 9 with a + b + c = 20, we get:
3 + c = 20
Therefore c = 17, still not possible.

The issue is that if C is a 4 digit number, then A cannot be a 3 digit number, because the only way to have a carry of 1 into the thousands column is if A is a 4 digit number as well.

Let's try one more approach: a, b, c, d must not all be single digits

We know a+b = 9, c+b=3, and a+b+c+d=20. If b=0, then a=9, c=3, and d=8, which gives us C as 17941. This means that at least one of a, b, c, d must be greater than 9.

Since b is in both a+b=9 and c+b=3, we know that if b is a two-digit number, these equations cannot both be true. Therefore, b must be a single digit.

If c is a two digit number, then the addition of c+b in the tens column would need to be 4, but there is also a carry-over of 1 from the units column. This means that c+b=13, and if c=10, b=3, a=-1, d=8. This also doesn't work.

If d is a two-digit number, then when added to c, there must be a carry-over of 1. If d=10, then 10+c >= 10, which is true for any single-digit c. However, we know c=3-b, and if c is negative, then a must be greater than 9 to compensate, which we have already tested.

If c=0, then b=3, a=6, d=11. This gives us 6300+11, 360+11, and 6300+360+22 = 6682.

Answer
C represents the number 6682.
  1. 在 Prompt 中加入「通过运行程序给出答案」后, 回答精简了很多, 并且也可以通过写出 python 程序并执行找到答案:

img

但是, 多问几次后, 发现并不会每次都成功:

Gemini: 自己写了两次无法运行的 code..

总结

就这道题的测试结果而言:

  1. Claude 3.5 sonnet 仍然是目前输出最快, 最稳定, 最准确的 coding 和 math 类模型.
  2. OpenAI GPT 很明显有能力做到和 Claude 类似, 但是估计是对 IP 的要求过高, 导致输出很不稳定, 而且对于同一个 IP, 甚至会出现随机降智.
  3. Gemini 虽然目前排名很高, 但是感觉还是名不副实…

简单测试各类大模型对一个枚举类问题的回答
https://moreality.net/posts/20569/
作者
Moreality
发布于
2024年12月9日
许可协议