ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates

05.03.2025
ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates

Code: https://github.com/Gen-Verse/ReasonFlux

We present that hierarchical LLM reasoning via scaling thought templates can effectively optimize the reasoning search space and outperform the mathematical reasoning capabilities of powerful LLMs like OpenAI o1-preview and DeepSeek V3. We train our ReasonFlux-32B model with only 8 GPUs and introduces three innovations: (i) a structured and generic thought template library, containing around 500 high-level thought templates capable of generalizing to similar or relevant reasoning problems; (ii) performing hierarchical reinforcement learning on a sequence of thought templates instead of original long CoT data, optimizing a base LLM to plan out an optimal template trajectory for gradually handling complex problems; (iii) a brand new inference scaling system that enables hierarchical LLM reasoning by adaptively scaling thought templates at inference time. With a template trajectory containing sequential thought templates, our ReasonFlux-32B significantly advances math reasoning capabilities to state-of-the-art levels. Notably, on the MATH benchmark, it achieves an accuracy of 91.2% and surpasses o1-preview by 6.7%. On the USA Math Olympiad (AIME) benchmark, ReasonFlux-32B solves an average of 56.7% of problems, surpassing o1-preview and DeepSeek-V3 by 27% and 45%, respectively.

TaskReasonFlux 32BDeepSeek V3OpenAI o1-previewOpenAI o1-miniQWQ 32B-previewGPT 4o
MATH91.290.285.590.090.676.6
AIME 202456.739.244.656.750.09.3
Olympiad Bench63.355.465.361.243.3
GaokaoEn 202383.671.478.465.367.5
AMC202385.080.090.095.047.5

Table 1. Performance Comparison on Various Math Reasoning Benchmarks (Pass@1 Accuracy)

  1. Introduction

Large Language Models (LLMs) have recently achieved remarkable progress, demonstrating exceptional capabilities in tackling complex reasoning tasks and even surpassing human experts in specific domains. For example, models such as OpenAI’s O1 (Jaech et al., 2024), Google’s Gemini-2.0 (Team et al., 2024), DeepSeek-V3 (Liu et al., 2024b), and Qwen-QwQ (Team, 2024a) are at the forefront of this progress, characterized by their ability to emulate human reasoning through a slower, more deliberate thought process. These models leverage increased inference time to enhance reasoning accuracy. While they have unlocked substantial performance gains, more complex tasks such as mathematical problem solving in AIME, OlympiadBench (He et al., 2024) and code in LiveCodeBench (Jain et al., 2024), which demand a more fine-grained search through a vast solution space and more delicate thought for each intricate reasoning step, thus still pose significant challenges.

Subsequent research has focused on enhancing LLMs’ reasoning capabilities on complex problems through inference-time

Figure 1. Training framework for our ReasonFlux. We train with hierarchical reinforcement learning to enable the model to plan out an optimal and generalizable thought template trajectory for an input problem. Our new inference-scaling framework is in Figure 2.

strategies. These strategies can be divided into two categories: deliberate search and reward-model-guided methods. Deliberate search methods, like Tree of Thoughts (ToT) (Yao et al., 2024) and Graph of Thoughts (GoT) (Besta et al., 2024), allow LLMs to explore multiple reasoning paths and self-evaluate choices to find the optimal trajectory. Reward-model-guided methods leverage reward models to assess reasoning step quality. Best-of-N approaches, which leverage an Outcome Reward Model (ORM) to find the optimal reasoning paths in multiple candidates, while Process Reward Models (PRMs) (Lightman et al., 2023; Luo et al., 2024; Wang et al., 2024) guide the model towards promising paths by rewarding high-probability intermediate steps. Building on this, Monte Carlo Tree Search (MCTS) (Zhang et al., 2024a; Qi et al., 2024) employs a fine-grained search, decomposing tasks into simpler steps and using PRMs to guide action selection within a tree-based search space. However, these methods often incur high computational costs, especially with numerous reasoning steps or vast search spaces, primarily due to the inherent randomness of sampling, which hinders the efficient identification of the optimal reasoning trajectory. Furthermore, they rely on manually designed search strategies and instance/step-level reward, limiting their generalization ability to diverse and complex reasoning tasks. Essentially, they struggle to effectively balance the exploration-exploitation trade-off during inference scaling. This highlights the need for a more efficient and generalizable inference scaling approach that enhances reasoning without extensive manual effort, while providing a more principled search strategy.

To achieve more efficient and precise search of reasoning paths, a feasible approach is to utilize Retrieval-Augmented Generation (RAG). Recent Buffer of Thought (BoT) (Yang et al., 2024b) constructs a meta-buffer to store informative, high-level thoughts distilled from various problem-solving processes, adaptively retrieving and instantiating relevant thought templates for each specific task. SuperCorrect (Yang et al., 2024c) further utilizes both high-level and detailed thought templates to enhance reasoning ability of small LLMs. Despite significant improvements, such template-based reasoning methods may still face challenges when applied to complex reasoning tasks. Because complex problems often require the integration of multiple templates or diverse pieces of retrieved information, which current methods struggle to address effectively.

To this end, we introduce ReasonFlux, a novel hierarchical LLM reasoning framework that configures optimal thought template trajectories by automatically retrieving relevant high-level thought templates at inference time, to achieve superior performance on complex reasoning tasks and even outperform OpenAI o1-preview and o1-mini models. To be more specific, we first construct a structured template library, which contains 500 useful compacted thought templates for efficient retrieval and adaptation. Instead of optimizing a long CoT trajectory, we perform hierarchical reinforcement learning on a sequence of high-level thought templates, optimizing a base LLM to learn an optimal thought template trajectory from multiple ones and guiding an inference LLM to solve a series of simpler sub-problems. Finally, we develop a new inference scaling system through adaptively scaling thought templates. This hierarchical reasoning paradigm enables ReasonFlux to simplify the search of reasoning paths and enhance the reasoning ability for complex problems by dynamically selecting a most appropriate high-level template for each sub-problem. Our automated template scaling allows ReasonFlux to effectively achieve a better exploration-exploitation trade-off, leading to a more robust and efficient problem-solving process. Through these innovations, ReasonFlux offers a more efficient, generalizable, and scalable solution for enhancing the complex reasoning capabilities of LLMs. Finally, we summarize our contributions as follows:

  1. We introduce ReasonFlux (in Figure 1), a hierarchical LLM reasoning framework that significantly enhances complex reasoning capabilities, outperforming SOTA models like o1-preview and DeepSeek-V3 on challenging MATH and AIME benchmarks (in Table 2).
  2. We propose a structured and compact template library with around 500 thought templates curated from challenging mathematical problems. This library facilitates efficient retrieval and adaptation of relevant high-level thought templates for a series of detailed reasoning steps.
  3. We develop hierarchical reinforcement learning on a sequence of high-level thought templates, to enable LLMs to generate an optimal thought template trajectory for a series of simpler sub-problems, effectively simplifying the search space of reasoning paths.
  4. We design a new inference scaling system (in Figure 2) by adaptively scaling thought templates for hierarchical reasoning. This system allows ReasonFlux to dynamically retrieve a series of high-level templates and adaptively perform instantiated reasoning at inference time, achieving a better exploration-exploitation trade-off for robust and efficient problem-solving.
  5. Related Work and Discussions

Learning from Preferences for Language Models Preference learning is critical for aligning Large Language Models (LLMs) with human expectations and perceptions. Initial approaches, building on pre-training and supervised fine-tuning (SFT), employed PPO in Reinforcement Learning from Human/AI Feedback (RLHF/RLAIF) frameworks (Schulman et al., 2017; Christiano et al., 2017; Ouyang et al., 2022; Xie et al., 2024). These approaches typically involve training a reward model on preference pairs and subsequently optimizing the LLM to maximize the learned reward. However, PPO’s instability and inefficiency motivated alternative approaches like DPO (Rafailov et al., 2024), which directly optimizes a policy from paired preference data. Subsequent research has addressed various challenges. ORPO (Hong et al., 2024) integrates alignment into SFT, KTO (Ethayarajh et al., 2024) leverages pointwise data, simplifying data acquisition process. Other efforts focus on finer-grained optimization, such as Step-DPO (Lai et al., 2024) and Cross-DPO (Yang et al., 2024c) that targets intermediate reasoning or reflection steps. SPO (Swamy et al., 2024) employs game-theoretic concepts to address non-transitive preferences, while Multi-turn DPO (Shi et al., 2024) extends optimization to conversations. However, existing methods often rely on instance or step-level reward units, potentially failing to capture and reward the higher-level cognitive processes inherent in human problem-solving process. To this end, we introduce hierarchical RL-based optimization, a novel preference learning approach that encourages the model to configure a series of high-level thought templates that can handle diverse sub-tasks for complex problems, thereby promoting more human-like problem-solving strategies in LLMs.

Retrieval-Augmented Generation for Language Models Retrieval-augmented Language Models (RALMs) have become a powerful approach to mitigating hallucinations and enhancing the factual accuracy of LLMs (Asai et al., 2023; Mialon et al., 2023; Shi et al., 2023; Gao et al., 2023; Zhao et al., 2024). By retrieving relevant documents from a large-scale external knowledge source (Borgeaud et al., 2022) to inform response generation, RALMs have demonstrated superior performance in question-answering, often with fewer parameters than traditional LLMs (Mialon et al., 2023). Their versatility is further evidenced by successful applications across diverse tasks, including multi-modal generation and biomedical applications (Yasunaga et al., 2023; Izacard et al., 2023; Wang et al., 2022; Zhao et al., 2024; Borgeaud et al., 2022; Yang et al., 2023). However, RALMs face challenges in complex reasoning tasks, such as math and code, where retrieving relevant guidelines or templates via standard embedding similarity search proves difficult. While methods like RAFT (Zhang et al., 2024c) have attempted to address this by improving retrieval relevance, respectively, their effectiveness decrease as the document… Figure 2. New inference scaling system based on hierarchical reasoning. We retrieve a series of high-level thought templates for complex problems, and gradually conduct instantiated reasoning for a sequence of sub-problems.

Inference Scaling for LLM Reasoning

The auto-regressive nature of LLMs suggests that solving more complex problems inherently requires generating more tokens. Early work, such as CoT (Wei et al., 2022), used prompting techniques like “Let’s think step by step” to break down complex reasoning tasks into simpler sub-problems, thus enhancing reasoning performance. Building on this, ToT (Yao et al., 2024) and GoT (Besta et al., 2024) employed different data structures to expand the reasoning space, allowing LLMs to explore multiple solution paths. Recent research (Wu et al., 2024; Snell et al., 2024) has formalized the concept of inference scaling laws, which examine the trade-offs between the generation of additional tokens, and the use of various inference strategies. For instance, majority voting and best-of-N methods (Wang et al., 2023; Li et al., 2023) generate multiple candidate solutions and select the best based on frequency among all the results or the reward model’s evaluation. Similarly, approaches using Monte Carlo Tree Search (MCTS) (Zhang et al., 2024b; Liu et al., 2024c; Choi et al., 2023; Zhou et al., 2023) leverage greater search and computation to improve accuracy. To enhance search accuracy, Process Reward Models (PRMs) have been introduced to select high-quality reasoning paths, with studies (Setlur et al., 2024; Snell et al., 2024; Lightman et al., 2023; Luo et al., 2024; Wang et al., 2024) demonstrating their effectiveness, particularly in complex reasoning tasks. More recently, methods like BoT (Yang et al., 2024b) utilize thought templates from past reasoning processes to guide exploration, significantly improving efficiency. However, a deeper understanding of the exploration-exploitation trade-off (Tang et al., 2024; Setlur et al., 2024) for these template-based approaches remains an open challenge. Our work addresses this challenge by scaling an hierarchical template-augmented reasoning paradigm that significantly enhances reasoning accuracy, especially for complex tasks, while strategically balancing exploration and exploitation.

  1. ReasonFlux: Scaling Thought Templates for Hierarchical LLM Reasoning

3.1. Constructing Structured Thought Template Library

Inspired by how humans utilize external resources when tackling complex reasoning problems, RAG methods enhance LLMs by enabling them to retrieve information from external sources (Zhao et al., 2024). Recent Buffer of Thought (BoT) (Yang et al., 2024b) attempts to create a buffer of high-level thoughts for llm reasoning, and builds an efficient RAG reasoning system. Despite a comprehensive template library to solve similar problems, BoT still faces scalability challenges… as template size grows, same as the traditional RAG systems that rely on embedding similarity to search unstructured text corpora.

To address this, our approach focuses on constructing a structured thought template library that enables more precise, targeted retrieval and mitigates scalability challenges. To build this library, we carefully selected a wide and diverse range of challenging mathematical reasoning problems from different sources, ensuring robustness and broad applicability of our template library. We used an LLM to analyze the thought behind the solution and generating concise summaries of problem-solving strategies and identifying common patterns. This process yielded a collection of high-quality, solution-oriented thought templates. Each template $T_i$ in the library is structured for efficient retrieval and application, where $T_{\text{nam}}$ is the name (e.g., “$\sqrt{R^2 – x^2}$ Type Trigonometric Substitution”), $T_{\text{tag}}$ is a set of tags for keyword-based retrieval (e.g., {“Trigonometric Substitution”, “Irrational Function Optimization”}), $T_{\text{des}}$ is a description of the underlying principle and applicable scenarios, $T_{\text{sco}}$ defines the scope, specifying the problem types it addresses, $T_a$ is a sequence of detailed application steps ${a_1, a_2, \ldots, a_k}$, and $T_{\text{exa}}$ is a set of examples demonstrating its application. The entire library $D_{\text{temp}}$ is a set of thought templates as mentioned:

$$D_{\text{temp}} = {T_1, T_2, \ldots, T_m}$$ \hspace{1cm} (1)

where $m$ is the total number of templates. Here we present an illustration of a thought template within our library. For the sake of brevity, some fields in the following example have been simplified. Please refer to Appendix A for more detailed examples.

Example Template

name$\sqrt{R^2 – x^2}$ Type Trigonometric Substitution
tagSubstitution Method, Trigonometric Substitution, Irrational Function
descriptionWhen a radical of the form $\sqrt{R^2 – x^2}$ appears in a problem, and $
scopeProblems involving function optimization or range, especially those involving irrational functions of the form $\sqrt{R^2 – x^2}$. Equations or inequalities containing radicals of the form $\sqrt{R^2 – x^2}$. Geometric problems related to circles.
application steps1. Determine the range: Based on the problem conditions, determine the range of $x$, usually $
… (Steps 2-5 omitted for brevity)
example… (Examples omitted for brevity)

Efficient retrieval is facilitated by leveraging the metadata associated with each template, specifically the name ($n$) and tags ($t$), enabling quick and accurate searching based on keywords or specific problem characteristics. This structured organization, combined with rich metadata, ensures that the most relevant templates are readily available for any given problems.

3.2. Hierarchical Reinforcement Learning on Thought Template Trajectory

While our structured template library provides a valuable resource for reasoning, an effective method is needed to utilize this library and select the appropriate templates for handing a given problem. To this end, we perform hierarchical reinforcement learning to train and finally obtain ReasonFlux that can effectively plan out an optimal thought template trajectory for a problem. We retrieve and configure a sequence of relevant templates from the library, assisting in instantiating the retrieved templates on specific sub-problems. ReasonFlux acts as an experienced navigator, providing the optimal trajectory denoted as $T_{\text{traj}}$ that enabling the LLM to instantiate abstract thought templates into concrete sequential problem-solving steps.

Structure-based Finetuning

Our hierarchical RL process begins by leveraging the structured template library $D_{\text{temp}}$ to construct a knowledge-intensive training dataset $D_{\text{train}}$. This dataset comprises diverse examples of template names $T_{\text{nam}}$, their associated tags $T_{\text{tag}}$, detailed descriptions of their underlying principles $T_{\text{des}}$, and a clear delineation of their applicable scopes $T_{\text{sco}}$, represented as tuples $(T_{\text{nam}}, T_{\text{tag}}, T_{\text{des}}, T_{\text{sco}})$ extracted from $D_{\text{temp}}$. We then fine-tune a base LLM, denoted as $\pi$, on this dataset $D_{\text{train}}$. This process equips the model with a foundational understanding of the structure, content, and intended use of each template within the library. The fine-tuning process is driven by the following optimization objective:

$$ L_{\text{struct}} = -\mathbb{E}{D{\text{train}}} [\log \pi(T_{\text{des}}, T_{\text{sco}}|T_{\text{nam}}, T_{\text{tag}})], $$

(2)

where the objective is to maximize the likelihood of the model generating the correct description $T_{\text{des}}$ and scope $T_{\text{sco}}$ given the template name $T_{\text{nam}}$ and tags $T_{\text{tag}}$. This ensures that the fine-tuned model can effectively associate the identifying information ($T_{\text{nam}}$ and $T_{\text{tag}}$) of a template with its functional aspects ($T_{\text{des}}$ and $T_{\text{sco}}$). After fine-tuning, we denote the resulting model as $\pi_{\text{struct}}$.

Preference Learning on Thought Template Trajectory Based on the finetuned LLM $\pi_{\text{struct}}$, we can further enhance its ability to plan out a sequence of high-level thought templates (i.e., thought template trajectory $\pi_{\text{traj}}$) for an input problem $x$, associating each step with the most relevant template from the library. This is achieved through our preference learning on thought template trajectory. Specifically, as shown in Figure 1, given an input problem $x$, $\pi_{\text{struct}}$ first analyzes and abstracts the problem’s conditional information, identifying the core mathematical concepts and relationships involved. Based on this abstract representation, the navigator $\pi_{\text{struct}}$ then configures a trajectory $T_{\text{traj}} = {s_1, s_2, \ldots, s_n}$, where each $s_i$ represents a high-level step in the reasoning process, associated with a specific template name retrieved from the library which could be used to solve the problem, denoted as $T_i$. Each retrieved template $T_i$ is then instantiated with specific details from the input problem $x$ and provides fine-grained guidance to a separate inference LLM denoted as $\pi_{\text{inf}}$ to solve the problem.

To measure the effectiveness and generalization ability of a given trajectory, we utilize a set of problems $X_{\text{sim}}$ that are similar to the original input problem $x$, including $x$ itself. We then use the instantiated templates along the trajectory $T_{\text{traj}}$ to guide $\pi_{\text{inf}}$ in solving each problem $x_i \in X_{\text{sim}}$. The average accuracy achieved by $\pi_{\text{inf}}$ across these problems serves as the trajectory reward $R(T_{\text{traj}})$. Formally:

$$ R(T_{\text{traj}}) = \frac{1}{|X_{\text{sim}}|} \sum_{x_i \in X_{\text{sim}}} \text{Acc}(\pi_{\text{inf}}(x_i, T_{\text{traj}})) $$

(3)

where $\text{Acc}(\pi_{\text{inf}}(x_i, T_{\text{traj}}))$ represents the accuracy of $\pi_{\text{inf}}$ in solving problem $x_i$ when guided by the trajectory $T_{\text{traj}}$.

This reward signal is then used to construct optimization pairs, enabling us to further refine the navigator $\pi_{\text{struct}}$. To be more specific, for each input problem $x$, we sample multiple different $T_{\text{traj}}$ and evaluate its quality utilizing the template trajectory reward. We define the loss function for optimizing $\pi_{\text{struct}}$ as follows:

$$ L_{\text{TTR}}(\theta) = -\mathbb{E}{(x, (T{\text{traj}}^+, T_{\text{traj}}^-)) \sim D_{\text{pair}}} \left[ \log \sigma \left( \beta \log \frac{\pi_{\theta}(T_{\text{traj}}^+|x)}{\pi_{\text{sft}}(T_{\text{traj}}^+|x)} – \beta \log \frac{\pi_{\theta}(T_{\text{traj}}^-|x)}{\pi_{\text{sft}}(T_{\text{traj}}^-|x)} \right) \right] $$

(4)

where $D_{\text{pair}}$ is a dataset of optimization pairs. Each pair consists of an input problem $x$ and two trajectories, $T_{\text{traj}}^+$ and $T_{\text{traj}}^-$, where $R(T_{\text{traj}}^+) > R(T_{\text{traj}}^-)$. $\pi_{\theta}$ represents the the LLM being optimized with parameters $\theta$, initialized from $\pi_{\text{struct}}$.

3.3. Inference Scaling with Scaling Thought Templates

After hierarchical RL process, we refer to optimized navigator $\pi_{\theta}$ as ReasonFlux. Then, we further design a novel inference scaling system by leveraging automatically planned trajectories and dynamically retrieved thought templates. This system, illustrated in Figure 2, involves a multi-round interplay between the ReasonFlux, a structured template library $D_{\text{temp}}$, and a downstream inference LLM $\pi_{\text{inf}}$.

Given an input problem $x$, the first task for ReasonFlux is to analyze and extract the core mathematical concepts and relationships embedded within $x$. Based on this abstract representation, denoted as $\alpha(x)$. ReasonFlux then configures an optimal template trajectory $T_{\text{opt}}$. This trajectory, represented as a sequence of steps $T_{\text{traj}} = {s_1, s_2, \ldots, s_n}$, is not a rigid, pre-defined path but rather a dynamically generated plan tailored to the specific nuances of the input problem $x$. Each step $s_i^*$ within the trajectory is associated with a specific template name $T_{\text{nam}}$ and $T_{\text{tag}}$ for efficient retrieval. ReasonFlux then searches and retrieves a set of most relevant thought templates from the curated thought template library $D_{\text{temp}}$. Formally, the retrieval process can be represented as:

[ T_{\text{rag}} = \text{ReasonFlux}({T_{\text{tag}}^i, T_{\text{nam}}^i}{i=1}^n, D{\text{temp}}), ]

(5)

where ( T_{\text{rag}} = {T_1, T_2, …, T_n} ) is the set of ( n ) retrieved templates that equals to the number of steps in the configured trajectory, and each is a structured template.

Subsequently, based on the ( T_{\text{traj}}^\ast ) and retrieved templates ( T_{\text{rag}} ), \text{ReasonFlux} will instruct ( \pi_{\text{inf}} ) to instantiate each steps ( s_i^\ast ) along with corresponding template ( T_i ) and problem-specific details from ( x ), transforming into concrete instantiated reasoning steps ( \hat{s}_i ):

[ \hat{s}i = \pi{\text{inf}}(x_i, s_i^\ast, T_i), ]

(6)

where each ( \hat{s}_i ) is generated based on the corresponding ( s_i^\ast ), ( T_i ), and ( x ).

The interaction between \text{ReasonFlux} and ( \pi_{\text{inf}} ) is not a one-way process but rather in an iterative manner. After obtaining the instantiated step ( \hat{s}i ), it is then evaluated and analyzed by \text{ReasonFlux}, and we represented this adjustment as process ( \delta_i = \text{ReasonFlux}(T{\text{traj}}^\ast, \hat{s}_i) ). Based on this evaluated result and analysis, \text{ReasonFlux} decide whether to refine the trajectory, potentially adjusting subsequent steps or even retrieving alternative templates. This iterative refinement can be expressed as:

[ T_{\text{traj}}^\ast \leftarrow \text{ReasonFlux}(T_{\text{traj}}^\ast, \delta_i). ]

(7)

This iterative feedback mechanism between \text{ReasonFlux} and ( \pi_{\text{inf}} ) underscores a crucial aspect of complex problem-solving: the dynamic interplay between planning and execution. By analyzing intermediate results generated during the reasoning process, \text{ReasonFlux} gains valuable insights that can inform adjustments to the trajectory. This ability to refine the solution path precisely reflects how humans often uncover more efficient or effective solutions by examining partial results. Furthermore, intermediate steps may reveal previously obscured constraints or opportunities within the problem, allowing for a more informed and targeted approach. Therefore, the hierarchical nature of \text{ReasonFlux}, enabled by this iterative refinement, is crucial for navigating the complexities of challenging reasoning tasks and achieving optimal solutions. In summary, \text{ReasonFlux} achieves effective problem solving by dynamically configuring and adjusting the template trajectory based on the problem complexity, transcending the limitations of traditional inference methods and offering a more efficient and powerful reasoning framework.

  1. Experiments

Template Library Construction As illustrated in Section 3.1, we use Gemini-2.0 (Team et al., 2023) to summarize and extracts high-level thoughts from the training sets of various math datasets, such as MATH (7.5K samples) (Lightman et al., 2023), and self-curated CN high-school competition-level data (2K samples), and construct our structured thought template library (approximately 500 thought templates). We provide some template examples in Appendix A.

Training Details Due to limited GPU resources, we use Qwen2.5-32B-Instruct (Yang et al., 2024a) as the base model and also adopt it as our inference LLM. In our training procedure, we only use 8 NVIDIA A100 GPUs, which is very cost-efficient. In the structure-based finetuning stage (Section 3.2), we train the initialized ( \pi_{\text{struct}} ) with the training dataset ( D_{\text{train}} ) containing 15K samples extended from our template library ( D_{\text{temp}} ). We conduct the initialization training for 6 epochs using an AdamW optimizer along with the cosine learning rate scheduler. In the template trajectory optimization process (Section 3.2), we train our \text{ReasonFlux} with 10K collected pair-wise trajectories from MATH (7.5k), and self-curated CN high-school competition-level data (2K) for 6 epochs using an AdamW optimizer along with cosine learning rate scheduler.

Evaluation Datasets To evaluate the complex reasoning capabilities, we choose a broad set of challenging reasoning benchmarks, including MATH (Lightman et al., 2023), AIME 2024 (AI-MO, 2024a), AMC 2023 (AI-MO, 2024b), OlympiadBench (He et al., 2024) and GaoKao (Chinese College Entrance Exam) En 2023 (Liao et al., 2024). These benchmarks comprehensively evaluate mathematical reasoning capabilities, and they are all competition-level and Olympic-level problems. Moreover, AIME 2024 and AMC 2023 are highly challenging competition benchmarks, which are of limited sizes of test samples in AMC and AIME and the results are averaged over 16 runs.

Table 2. Pass@1 accuracy comparison on various mathematical reasoning benchmarks.

ModelMATHAIME 2024AMC 2023Olympiad BenchGaokao En 2023
Frontier LLMs
GPT-4o76.69.347.543.367.5
Claude3.5-Sonnet78.316.0
GPT-o1-preview85.544.690.071.4
GPT-o1-mini90.056.795.065.378.4
Open-Sourced Reasoning LLMs
DeepSeek-Coder-V2-Instruct75.313.357.537.664.7
Mathstral-7B-v0.157.80.037.521.546.0
NuminaMath-72B-CoT64.03.370.032.658.4
LLaMA3.1-8B-Instruct51.46.725.015.438.4
LLaMA3.1-70B-Instruct65.423.350.027.754.0
LLaMA3.1-405B-Instruct73.834.8
Qwen2.5-Math-72B-Instruct85.630.070.049.071.9
rStar-Math88.243.380.063.178.2
DeepSeek-V390.239.280.055.4
ReasonFlux-32B91.256.785.063.383.6
1.5B-Level Base Model
Qwen2.5-Math-1.5B51.20.022.516.746.5
Qwen2.5-Math-1.5B-Instruct60.010.060.038.165.5
ReasonFlux-1.5B70.420.072.549.076.6
7B-Level Base Model
Qwen2.5-Math-7B58.83.322.521.851.7
SuperCorrect-7B70.210.037.539.064.0
Qwen2.5-Math-7B-Instruct82.613.362.541.666.8
ReasonFlux-7B88.636.780.054.880.5
32B-Level Base Model
Qwen2.5-32B-Instruct79.416.564.045.372.1
QwQ-32B-preview90.650.075.065.3
Sky-T1-32B-preview86.443.359.8
ReasonFlux-32B91.256.785.063.383.6

Baselines

To demonstrate reasoning ability of ReasonFlux, we compare it with two kinds of strong baseline models: (i) Frontier LLMs contain GPT-4o, Claude, OpenAI o1-preview and o1-mini. We report their performance on our evaluation benchmarks by taking accuracy numbers from different public technical reports. (ii) Open-sourced superior reasoning models contain DeepSeek-Coder-v2-Instruct, Mathstral (Team, 2024b), NuminaMath-72B (Li et al., 2024), LLaMA3.1 (Dubey et al., 2024), Qwen2.5-Math (Yang et al., 2024a), SuperCorrect-7B-Instruct (Yang et al., 2024c), QwQ-32B-Preview (Team, 2024a), rStar-Math (Guan et al., 2025) and Sky-T1-32B-Preview (distilled from QwQ-32B-Preview), and DeepSeek-V3 (Liu et al., 2024a), which are widely used and followed open-sourced reasoning models. Both kinds of baselines represent the highest level of mathematical reasoning currently available.

4.1. Results on Challenging Reasoning Benchmarks

Table 2 shows the final results of our ReasonFlux with a comprehensive comparison to SOTA reasoning models. We find that our ReasonFlux-32B consistently outperforms both frontier LLMs and open-sourced reasoning LLMs on most challenging mathematical benchmarks, achieving new SOTA performances with only 32B-level parameters. More specifically, on the MATH benchmark, ReasonFlux achieves 91.2% of accuracy, surpassing frontier reasoning models o1-preview by 6.7%, and current SOTA-level open-source LLMs with only 32B parameters. On the AIME 2024 benchmark, ReasonFlux consistently demonstrates its extraordinary reasoning capabilities with 56.7% accuracy, significantly surpassing o1-preview and DeepSeek-V3 by 27% and 45%, respectively, and matching the performance of the proprietary OpenAI o1-mini.

On the AMC 2023 benchmark, our method, ReasonFlux, maintains its position within the top tier of all reasoning LLMs with 85.0% accuracy, significantly outperforming other open-source LLMs while achieving performance comparable to proprietary LLMs. This further validates the effectiveness of our approach in mathematical reasoning and underscores its substantial potential for further development and application. We provide some reasoning details in Section 4.3.

Beyond above well-known benchmarks, ReasonFlux-32B also demonstrates impressive generalization and effectiveness on other challenging datasets. Notably, it achieves a 63.3% accuracy on OlympiadBench surpassing DeepSeek-V3 by 14%, and an 83.6% accuracy on the Chinese College Entrance Mathematics Exam (Gaokao) surpassing o1-mini by 7%.

These results are particularly noteworthy because our template library was constructed primarily from publicly available datasets, the same template library was used consistently across all evaluation processes. This consistent strong performance across diverse and challenging mathematical reasoning tasks, ranging from competition-level problems to standardized exams, provides compelling evidence for the robust generalization ability and effectiveness of ReasonFlux. It underscores the power of our template-driven approach to capture and apply underlying mathematical principles, regardless of the specific format or context of the problem.

Generalizing to Different Base Models

From Table 2, we also observe that our ReasonFlux can achieve consistent and significant improvement across all evaluation benchmarks when using different base models as both navigator and inference LLM. Notably, our ReasonFlux usually achieves even surpasses the reasoning accuracy of the models in next level. These phenomena demonstrate both effectiveness and generalization ability of our ReasonFlux.

Modeldirect reasoning (%)with Template (%)
Llama-3.1-8B-Instruct47.675.1 (+27.5)
Qwen2.5-7B-Instruct59.282.7 (+23.5)
Qwen2.5-Math-7B-Instruct66.588.4 (+21.9)
Llama-3.1-70B-Instruct67.491.2 (+23.8)
Qwen2.5-32B-Instruct69.294.3 (+25.1)
Qwen2.5-Math-32B-Instruct71.195.9 (+24.8)

4.2. Generalization Ability of Structured Template Library

We present additional experiments on MATH benchmark designed to evaluate the generalization ability of our structured template library. To achieve this, we randomly sampled 100 templates from the library, each paired with its corresponding example problem. Subsequently, we employed o1-preview to generate 50 variant problems for each example. These variants were carefully constructed to ensure they differed from the original examples while still assessing the same underlying knowledge and skills.

We then used these templates as in-context examples to guide different LLMs during inference on the generated variant problems. We compare the average accuracy between our template augmented reasoning and direct reasoning (i.e., solving the problems without template). As illustrated in Table 3, our template-augmented approach significantly improves the reasoning accuracy of different base models compared to direct reasoning. This demonstrates the ability of our structured templates to generalize effectively across a range of similar problems, rather than being limited to specific instances. Furthermore, we observed that smaller-sized LLMs, when guided by our templates, were able to outperform larger-sized LLMs employing direct reasoning. This finding underscores the effectiveness and high quality of our structured template library.

4.3. Reasoning Flows over Planned Template Trajectory

We showcase detailed examples of our reasoning flows, as depicted in Figure 3, when tackling challenging mathematical problems. Specifically, ReasonFlux begins by meticulously observing and analyzing the input problem, engaging in deep thought to explore potential solution pathways. Based on this initial assessment, ReasonFlux intelligently configures a dynamic reasoning trajectory, strategically retrieving relevant templates from our structured template library to guide each logical step. Then, ReasonFlux initiates an interactive instruction with the inference LLM, guiding it to follow the prescribed trajectory and execute the reasoning process along the trajectory. Crucially, the results obtained from preceding steps are seamlessly integrated as contextual information, informing and conditioning the subsequent steps. Compare to conventional self-explore and reasoning paradigm, our method could consistently improve the reasoning accuracy and efficiency.

Let ( x, y, ) and ( z ) be positive real numbers satisfying the system of equations: [ \sqrt{2x – xy} + \sqrt{2y – yz} + \sqrt{2z – zx} – \sqrt{2} = \sqrt{2}, \quad \sqrt{2x – zx} + \sqrt{2y – xy} + \sqrt{2z – yz} = \sqrt{3}. ] Then ([(1 – x)(1 – y)(1 – z)]^2 ) can be written as ( \frac{m}{n} ), where ( m ) and ( n ) are relatively prime positive integers. Find ( m + n ).

Problem:

Step 1: Look for Symmetry or Patterns

Procedure: The equations seem symmetric in a cyclic manner. Maybe there’s a relationship or pattern that can help simplify them further.

[ a = \sqrt{(x – y)(2 – x)}, \quad b = \sqrt{(y – z)(2 – y)}, \quad c = \sqrt{(z – x)(2 – z)}. ]

Step 2: Assume Variables are Equal

Sometimes, assuming variables are equal can simplify the problem. Let’s assume ( x = y = z ).

If ( x = y = z ), then the first equation becomes: [ \sqrt{2x – x^2} + \sqrt{2x – x^2} = 1 \quad \Rightarrow \quad 2\sqrt{2x – x^2} – 1 = 0. ]

Squaring both sides: [ 2x – x^2 = \frac{1}{4}. ]

This is a contradiction. Therefore, our assumption that ( x = y = z ) is incorrect.

Step 4: Try Expressing Variables in Terms of One Another

Let’s try to express one variable in terms of another. From the first equation: [ \sqrt{(x – y)(2 – x)} = 1. ]

This seems complicated. Maybe I can square both sides to eliminate the square roots.

Squaring both sides: [ (x – y)(2 – x) = 1. ]

This still looks complicated. Maybe there’s a better approach.

Step 7: Look for Trigonometric Substitutions

Sometimes, trigonometric substitutions can help with equations involving square roots. Let’s see if that’s applicable here.

Alternatively, perhaps I should consider numerical methods or look for specific values that satisfy the equations.

……

ReasonFlux:

Analysis and configuration of the reasoning trajectory

The key to solving it is to determine the conditions for the existence of four distinct intersection points and to find the equation of the circle passing through these four points.

  1. By analyzing the positional relationship between the vertices of the two parabolas, we can initially determine the range of values for the parameter ( k ).
  2. By appropriately adding and subtracting the two parabola equations to eliminate the coefficients of the quadratic terms, we can obtain the equation of a circle that passes through the intersection points of these two parabolas.
  3. Based on the radius constraint of the circle given in the problem, we further narrow down the range of ( k ). Finally, verifying the boundary values to ensure they satisfy the condition of four intersection points allows us to determine the least and greatest elements of ( S ).

Template 1. Preliminarily Determine the Range of ( k ): Analyze the the Vertices of the Parabolas, and determine the condition for the existence of Intersection points

Instantiation:

Since ( x, y, z ) are all real numbers, and ((2-x), (2-y),(2-z) > 0), So ( x, y, z \in (0, 2] )

Template 2. Trigonometric Substitution: Since ( x, y, z \in (0, 2] ), and the expressions under the square roots can be written in the form ( \sqrt{(x – y)(2 – x)} ) which matches the trigonometric substitution form ( \sqrt{\sin^2 \theta} ).

Instantiation:

We let ( x = 2 \sin \alpha ), ( y = 2 \sin \beta ), ( z = 2 \sin \gamma ), where ( \alpha, \beta, \gamma \in (0, \pi/2] )

Template 3. Simplify the equation system: Substitute trigonometric substitutions into the original system of equations using ( \sin^2(\alpha) + \cos^2(\alpha) = 1 ) and the trigonometric identity ( \sin^2 \theta + \cos^2 \theta = 1 ) for simplification.

Instantiation:

  1. First, substitute ( x ) and ( y ): [ \sqrt{4 \sin^2(\alpha)(\cos^2(\alpha))} + \sqrt{4 \sin^2(\beta)(1 – \sin^2(\alpha))} = 1 ] [ \sqrt{4 \sin^2(\alpha) \cos^2(\beta)} + \sqrt{4 \sin^2(\beta) \cos^2(\alpha)} = 1 ] [ \Rightarrow \sin(\alpha + \beta) = 1/2 ] [ \Rightarrow \sin(\beta + \gamma) = \sqrt{2}/2 ] [ \Rightarrow \sin(\alpha + \gamma) = \sqrt{3}/2 ]

Template 4. Solve for ( \alpha, \beta, \gamma ): Based on ( \alpha, \beta, \gamma \in (0, \pi/2] ), determine the possible values of ( \alpha + \beta + \gamma ), and solve for the values of ( \alpha, \beta, \gamma ).

Instantiation: [ \alpha + \beta = \pi/6 \text{ or } 5\pi/6 ] [ \beta + \gamma = \pi/4 \text{ or } 3\pi/4 ] [ \alpha + \gamma = \pi/3 \text{ or } 2\pi/3 ] [ \Rightarrow \alpha = \pi/8, \beta = \pi/24, \gamma = 5\pi/24 ]

Final step: Calculate ((1-x)(1-y)(1-z)) and ([(1-x)(1-y)(1-z)]^2):

Instantiation:

The value of ([(1 – x)(1 – y)(1 – z)]^2 ) is 1/32.

[ m + n = 1 + 32 = 33 ]

Figure 3. Comparision between o1-mini and ReasonFlux.


Useful information for enthusiasts:

Contact me via Telegram: @ExploitDarlenePRO