My understanding of how IDA works

explanation

Stage 0: In the beginning, Hugh directly rates actions to provide the initial training data on what Hugh approves of. This is used to train [math]A_0[/math]. At this point, [math]A_0[/math] is superhuman in some ways and subhuman in others, just like current-day ML systems.

Stage 1: In the next stage, Hugh trains [math]A_1[/math]. Instead of rating actions directly, Hugh uses multiple copies of [math]A_0[/math] as assistants. Hugh might break questions down into sub-questions that [math]A_0[/math] can answer. This is still basically Hugh just rating actions, but it's more like "what rating would Hugh give if Hugh was given a bit of extra time to think about it?" So the hope is that [math]A_1[/math] is both more capable than [math]A_0[/math] and it takes actions according to a more foresighted version of Hugh's approval.

Stage 2: In the next stage, Hugh trains [math]A_2[/math] using copies of [math]A_1[/math] as assistants. Again, the hope is that the rating given to an action isn't just "what rating does Hugh give to this action?" but more like "what rating does Hugh give to this action if Hugh was given even more extra time to think than the previous stage?"

Stage n: In the previous stage, [math]A_{n-1}[/math] was trained. Unlike previous versions of Arthur, [math]A_{n-1}[/math] is superhuman in basically all ways: it can answer sub-questions better, it can break questions into sub-questions better, it makes fewer "mistakes" about what rating to give to an action than Hugh, and so on. We might still want Hugh around, because [math]A_{n-1}[/math] isn't the "ground truth" (it doesn't have access to Hugh's "actual preferences"). But for the purpose of training [math]A_n[/math], Hugh is basically obsolete. So now a team of [math]A_{n-1}[/math]s, coordinating with each other (with an [math]A_{n-1}[/math] at the "root node") tries to put together a training data for "how do you rate this action?" Hopefully, if nothing has gone horribly wrong, this training data is basically training data for "how would Hugh rate this action, if he was given a huge amount of time, given access to error-correcting tools and a team of human philosophical advisors?" The team of [math]A_{n-1}[/math]s might still ask Hugh some questions like "So uh, Hugh, we're 99.9% sure what the answer is, but you do like chocolate, right?" -- the important point is that the questions do not assume Hugh has a holistic/"big picture strategy" understanding (because by this point Hugh is relaxing on the beach while his assistants take over the world and acquire flexible resources).

Stage N: Similar to stage [math]n[/math], but now Hugh is even less involved in the whole process.

My understanding of how IDA works

explanation

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools