The Q* hypothesis: Tree-of-thoughts reasoning, process reward models, and supercharging synthetic data
Nathan Lambertinterconnects.ai
The Q* hypothesis: Tree-of-thoughts reasoning, process reward models, and supercharging synthetic data
The more principled path forward, Samuel reasoned, was for the computer itself to somehow generate strategic considerations on its own.
Should the Q-value contain the expected future rewards that you could earn from taking this action? Or the expected rewards that you would earn? For a totally perfect agent, there is no tension—but otherwise the prescriptions can vary sharply.
Inverse reinforcement learning is, famously, what mathematicians call an “ill-posed” problem: namely, one that doesn’t have a single, unique right answer.