The Q* hypothesis: Tree-of-thoughts reasoning, process rewar...

The Q* hypothesis: Tree-of-thoughts reasoning, process reward models, and supercharging synthetic data

RelatedInsightsHighlights

The more principled path forward, Samuel reasoned, was for the computer itself to somehow generate strategic considerations on its own.

Brian Christian • The Alignment Problem

Should the Q-value contain the expected future rewards that you could earn from taking this action? Or the expected rewards that you would earn? For a totally perfect agent, there is no tension—but otherwise the prescriptions can vary sharply.

Brian Christian • The Alignment Problem

There are two different elements to the problem of Alignment. Getting an AI to do the things we want, and being able to come to terms on what we actually want. We’ve gotta align the AI to the humans, and we also gotta align the humans to the other humans (both present and future). My idea takes from my experience in how DAOs and other mechanisms tr... See more

Prometheus • Using Consensus Mechanisms as an approach to Alignment — LessWrong

Inverse reinforcement learning is, famously, what mathematicians call an “ill-posed” problem: namely, one that doesn’t have a single, unique right answer.