The Q* hypothesis: Tree-of-thoughts reasoning, process rewar...

The Q* hypothesis: Tree-of-thoughts reasoning, process reward models, and supercharging synthetic data

RelatedInsightsHighlights

The way that most RLHF is done to date has the entire response from a language model get an associated score. To anyone with an RL background, this is disappointing, because it limits the ability for RL methods to make connections about the value of each sub-component of text. Futures have been pointed to where this multi-step optimization comes at... See more

Nathan Lambert • The Q* hypothesis: Tree-of-thoughts reasoning, process reward models, and supercharging synthetic data

Nicolay Gerold

Self-play is the idea that an agent can improve its gameplay by playing against slightly different versions of itself because it’ll progressively encounter more challenging situations. In the space of LLMs, it is almost certain that the largest portion of self-play will look like AI Feedback rather than competitive processes.

Nathan Lambert • The Q* hypothesis: Tree-of-thoughts reasoning, process reward models, and supercharging synthetic data

Nicolay Gerold