Google Finds Prompt Duplication Boosts Accuracy
Researchers at Google found that simply duplicating a prompt within the same context window can boost LLM accuracy by up to 76%. The technique reportedly works without increasing token count, latency, or requiring additional fine-tuning. The finding provides insight into how attention mechanisms in transformer models process input sequences.
- The finding comes from a Google Research paper titled "Prompt Repetition Improves Non-Reasoning LLMs" by Yaniv Leviathan, Matan Kalman, and Yossi Matias. - This technique was tested on seven models, including Gemini 2.0 Flash, GPT-4o, Claude 3.7 Sonnet, and DeepSeek V2, across benchmarks like ARC, OpenBookQA, and MMLU-Pro. - The method is most effective for non-reasoning tasks; when chain-of-thought prompting is enabled, the gains are minimal because the model's internal reasoning already involves restating or reprocessing the query. - The performance boost occurs during the parallelizable pre-fill stage, where the model processes the input. This allows each prompt token to attend to all other tokens a second time, mitigating the ordering effect inherent in causal language models. - On a custom "NameIndex" task requiring the model to identify the 25th name in a list of 50, accuracy for one model surged from 21.33% to 97.33% after applying prompt repetition. - The paper found prompt repetition won in 47 out of 70 benchmark-model combinations with zero significant losses when reasoning was turned off. - Researchers note that repeating the prompt twice is a reliable default, though a three-fold repetition can yield better results on certain long-context tasks at the cost of increased input tokens.