Q* Hypothesis: Enhancing Reasoning, Rewards, and Synthetic Data



Programming note: this counts as next week’s post. I pulled it forward for timeliness, as RLAIF and process reward models were on my hit list anyway. We’ll see if there are more emergency posts.


On Wednesday, right when we were all ready to sign off for Thanksgiving, Reuters reported on OpenAI one last time, revealing just the name and high-level evaluations of a new OpenAI method, Q*, touted with vague powerful capabilities:

After being contacted by Reuters, OpenAI, which declined to comment, acknowledged in an internal message to staffers a project called Q*…

Some at OpenAI believe Q* (pronounced Q-Star) could be a breakthrough in the startup's search for what's known as artificial general intelligence (AGI), one of the people told Reuters. OpenAI defines AGI as autonomous systems that surpass humans in most economically valuable tasks.

Given vast computing resources, the new model was able to solve certain mathematical problems, the person said on condition of anonymity because the individual was not authorized to speak on behalf of the company. Though only performing math on the level of grade-school students, acing such tests made researchers very optimistic about Q*’s future success, the source said.

Such extensive speculation has never unfolded from only the name of a method. Though, the name is pretty simple in this case, and not just another codename from the Dune universe. Q* (Q-Star), if real, clearly links two core themes from the RL literature: Q-values and A*, a classic graph search algorithm. Yes, there’s an argument that Q could just refer to the value function of the optimal policy, but this would need to be a fabricated leak for it to be so silly, and OpenAI has pretty much had everything leaked, so fabricating them seems unlikely.

My initial hypothesis, which I clearly labeled as a tin hat theory, was a vague merging of Q-learning and A* search. What I didn’t answer is, what is being searched over? My initial guess of searching over dialogue turns is almost certainly wrong due to infrastructure reasons I’ll touch on later.

As I’ve dug into this in more detail, I’ve become convinced that they are doing something powerful by searching over language/reasoning steps via tree-of-thoughts reasoning, but it is much smaller of a leap than people believe. The reason for the hyperbole is the goal of linking large language model training and usage to the core components of Deep RL that enabled success like AlphaGo: self-play and look-ahead planning.

  • Self-play is the idea that an agent can improve its gameplay by playing against slightly different versions of itself because it’ll progressively encounter more challenging situations. In the space of LLMs, it is almost certain that the largest portion of self-play will look like AI Feedback rather than competitive processes.

  • Look-ahead planning is the idea of using a model of the world to reason into the future and produce better actions or outputs. The two variants are based on Model Predictive Control (MPC), which is often used on continuous states, and Monte-Carlo Tree Search (MCTS), which works with discrete actions and states.

To understand how this links together, we need to cover recent results published from OpenAI and others that’ll answer two questions:

  1. How do we construct a representation of language that we can search over?

  2. How do we construct a notion of value over compartmentalized and meaningful language chunks (rather than the entire completion)?

With answers to these, it should be clear how we could use existing RL methods that are used for RLHF. We use an RL optimizer to fine-tune the language model and get higher-quality generations with modular rewards (rather than the full sequence, as is done today).

Promoting techniques like “take a deep breath” and “think step by step” are now expanding into advanced methods for inference with parallel computation and heuristics (some fundamentals of search).

Tree-of-thoughts is really how it sounds. It is a way to prompt a language model to create a tree of reasoning paths that may or may not converge at a correct answer. A comparison to other ways of problem-solving with LLMs was shown in the paper:

The innovations that make this click are the chunking of reasoning steps and prompting a model to create new reasoning steps. ToT seems like the first “recursive” prompting technique for improving inference performance, which sounds remarkably close to the AI Safety concern of recursively self-improving models (though I am not an expert).

With the reasoning trees, different methods can be applied to score each vertex (the nodes) or to sample the final path. It can be based on things like minimum length to the most agreed answer, or complex things that require external feedback, which points us back in the direction of RLHF.

Read the Tree of Thoughts paper here: https://arxiv.org/abs/2305.10601

The way that most RLHF is done to date has the entire response from a language model get an associated score. To anyone with an RL background, this is disappointing, because it limits the ability for RL methods to make connections about the value of each sub-component of text. Futures have been pointed to where this multi-step optimization comes at the level of multiple dialogue turns, but that is still far-fetched due to the requirement of having humans or some prompt source in the loop.

This could easily be extended to a self-play style dialogue but is hard to give an LLMs goals that translate to the self-play dynamics of consistently improving. Most of the things we want to do with LLMs are repetitive tasks without near-infinite ceilings on performance like the game of Go.

On the other hand, there is a type of LLM use case that naturally abstracts to contained chunks of text: step-by-step reasoning, best exemplified by math problems.

Process Reward Models (PRMs) have been a topic I’ve heard a lot about from RLHF folks off the record for the last 6 months. It turns out there is a lot of literature on these models, but very little on how to use them with RL.

The core idea of a PRM is to assign a score to each step of reasoning, rather than a complete message. An example from the OpenAI paper Let’s Verify Step by Step is shown below:

And the funny feedback interface they used (which will be replaced by AIs), but is instructive:

This allows finer-tuned generation with reasoning problems, by sampling over the maximum average reward or other metrics, instead of just relying on one score (standard RMs are called outcome RMs in this literature). Using Best-of-N sampling, essentially generating a bunch of times and using the one that scored the highest by the reward model (the inference time cousin of Rejection Sampling popularized with Llama 2), PRMs outperform standard RMs on reasoning tasks.

To date, most of the resources for PRMs just show how to use them at inference time. The true power will come when this signal is optimized against training. To create the richest optimization setting, having the ability to generate diverse reasoning pathways for scoring and learning from is essential. This is where Tree-of-Thoughts comes in. The prompting from ToT gives diversity to the generations, which a policy can learn to exploit with access to a PRM.

For more resources on PRMs, see the following:

Additionally, there’s one popular openly available math model that is documented as training with PRMs: Wizard-LM-Math. Second, OpenAI released their fine-grained reward labels from the Verify Step by Step paper for training a PRM earlier this year.


️ Tom and I discussed Q* and all of the OpenAI drama on this week’s episode of The Retort, check it out!


Q* seems to be using PRMs to score Tree of Thoughts reasoning data that then is optimized with Offline RL. This wouldn’t look too different from existing RLHF toolings that use offline algorithms like DPO or ILQL that do not need to generate from the LLM during training. The ‘trajectory’ seen by the RL algorithm is the sequence of reasoning steps, so we’re finally doing RLHF in a multi-step fashion rather than contextual bandits!

Given that the rumors I’ve heard already indicated OpenAI was using offline RL for RLHF (which doesn’t say that much), this doesn’t seem like a big leap to me. The intricacies of this method involve collecting the right prompts, having a model to generate great reasoning steps, and most importantly: accurately scoring tens of thousands of completions.

This last step is where the rumored “vast computing resources”: use AI to label every step with a score instead of humans. Synthetic data is king, and with trees rather than single-width paths (via chain-of-thought) giving more and more options later on to arrive at the right answer.

The ton of compute resources tracks with the rumor that I’ve heard one or more of the big tech players (Google, Anthropic, Cohere, etc) are creating a pretraining-sized dataset from process supervision or RLAIF-like methods, which would take 10s of thousands of GPU hours easily. The gap to openly available models in this area worries me.

All of this said while the core ideas seem clear to me, implementing this takes levels of model whispering few poses. Distribution control, massive inference, and RL finickiness are well beyond my knowledge or experience. All of this information just seems so natural. All the evaluations for ToT and PRMs are on reasoning problems like math, which is what all the news articles were saying this leaked method was about. Even if it isn’t Q*, it would be a fun experiment.

As I’ve written about before, AI Feedback and Constitutional AI are underrepresented in public awareness. Synthetic data represents the shortest path to expanding datasets. In the short term, it’s clear we can create some useful data with this. What is not clear is the extent to which it can be scaled — ie can it replace internet scale data entirely?

Original Post>