Brief AI answer use reduces later problem-solving performance, controlled experiments show higher skip rates once GPT-5 is removed, direct-answer users pay the steepest penalty

Images

Image description the-decoder.com

$While the AI was available, the AI group (orange) nailed nearly every fraction problem. Once it was pulled for the final three test problems, their solve rate dipped below the control group (green) and their skip rate shot up. | Image: Liu et al.$ While the AI was available, the AI group (orange) nailed nearly every fraction problem. Once it was pulled for the final three test problems, their solve rate dipped below the control group (green) and their skip rate shot up. | Image: Liu et al. Liu et al.

Experiment 2 replicated the effect with tighter methodology. The AI group again led during the learning phase but fell behind in the unassisted test. Skip rates were roughly even on average. | Image: Liu et al. Liu et al.

Broken down by usage style: all groups started out comparable (a). On the unassisted test, the "direct-answer" users did worst and skipped most often, while people who ignored the AI entirely posted the highest solve rates (b). Only the direct-answer group also performed worse than their own pre-test (c). | Image: Liu et al. Liu et al.

Experiment 3 applied the design to SAT reading passages. The pattern repeats: after the AI is removed, the AI group's solve rate falls well below the control group, and they skip more often. | Image: Liu et al. Liu et al.

After just 10 to 15 minutes of using an AI system as an “answer machine,” people become worse at solving similar problems when the tool is taken away. That is the central result of a controlled study reported by The Decoder, which describes experiments run by researchers across US and UK universities using fraction problems as a test of basic reasoning and persistence.

In the first experiment, participants worked through 15 fraction questions ranging from one-step to three-step problems. One group had access to GPT-5 in a sidebar that was preloaded with each question and its solution; the control group had no tools. For the first 12 problems, the AI group could obtain correct answers with minimal effort—down to typing “Answer?”—and their accuracy reflected that. Then, without warning, the researchers removed the AI for the final three test problems, which were identical for both groups.

On those unassisted test items, the former AI users solved fewer problems correctly than the control group and skipped nearly twice as often, according to the report. The study treats skipping as a proxy for persistence because there was no penalty for wrong answers and no pay incentive tied to performance. In other words, the experiment was designed so participants had little reason to game the outcome; quitting early becomes a behavioural signal rather than a strategic choice.

A second experiment tightened the design after a weakness in the first run: weaker participants in the AI group could still appear “successful” during the learning phase by submitting AI-derived answers, potentially skewing comparisons. The follow-up added a pre-test of simple fraction problems and gave the control group a sidebar with pre-test solutions to match the interface. The pattern held: the AI group performed better while assistance was available, then underperformed once it was removed.

The study also distinguishes between how participants used the tool. About 61% of AI users reported primarily asking for direct answers; roughly a quarter used it for hints or explanations; the rest barely used it. Baseline ability and motivation looked similar across these subgroups on the pre-test. But after AI access was cut off, the “direct-answer” users performed worst and skipped the most, while participants who ignored the AI entirely posted the highest solve rates—even above the control group.

That split matters because it maps onto the product direction of many consumer AI tools: systems are marketed as frictionless substitutes for thinking, with interfaces optimised for instant output rather than for structured practice. The experiment suggests the short-term productivity gain comes with a measurable tradeoff that shows up immediately when the crutch is removed, especially for users who treat the model as an oracle rather than as a tutor.

The researchers describe their work as the first large-scale causal evidence from controlled experiments, contrasting it with earlier survey-based findings. If the result generalises beyond fractions, it raises a practical problem for schools and employers: tools that improve today’s throughput may quietly reduce tomorrow’s competence, and the decline may be concentrated among exactly the users most attracted to one-click answers.

The AI group solved almost every problem while GPT-5 was available. On the same type of questions, minutes later and without the tool, they solved fewer than the people who never had it.