# My understanding of how IDA works

My understanding of how iterated amplification works:

## Explanation

Stage 0: In the beginning, Hugh directly rates actions to provide the initial training data on what Hugh approves of. This is used to train [math]A_0[/math]. At this point, [math]A_0[/math] is superhuman in some ways and subhuman in others, just like current-day ML systems.

Stage 1: In the next stage, Hugh trains [math]A_1[/math]. Instead of rating actions directly, Hugh uses multiple copies of [math]A_0[/math] as assistants. Hugh might break questions down into sub-questions that [math]A_0[/math] can answer. This is still basically Hugh just rating actions, but it's more like "what rating would Hugh give if Hugh was given a bit of extra time to think about it?" So the hope is that [math]A_1[/math] is both more capable than [math]A_0[/math] and it takes actions according to a more foresighted version of Hugh's approval.

Stage 2: In the next stage, Hugh trains [math]A_2[/math] using copies of [math]A_1[/math] as assistants. Again, the hope is that the rating given to an action isn't just "what rating does Hugh give to this action?" but more like "what rating does Hugh give to this action if Hugh was given even more extra time to think than the previous stage?"

Stage n: In the previous stage, [math]A_{n-1}[/math] was trained. Unlike previous versions of Arthur, [math]A_{n-1}[/math] is superhuman in basically all ways: it can answer sub-questions better, it can break questions into sub-questions better, it makes fewer "mistakes" about what rating to give to an action than Hugh, and so on. We might still want Hugh around, because [math]A_{n-1}[/math] isn't the "ground truth" (it doesn't have access to Hugh's "actual preferences"). But for the purpose of training [math]A_n[/math], Hugh is basically obsolete. So now a team of [math]A_{n-1}[/math]s, coordinating with each other (with an [math]A_{n-1}[/math] at the "root node") tries to put together a training data for "how do you rate this action?" Hopefully, if nothing has gone horribly wrong, this training data is basically training data for "how would Hugh rate this action, if he was given a huge amount of time, given access to error-correcting tools and a team of human philosophical advisors?" The team of [math]A_{n-1}[/math]s might still ask Hugh some questions like "So uh, Hugh, we're 99.9% sure what the answer is, but you do like chocolate, right?" -- the important point is that the questions do not assume Hugh has a holistic/"big picture strategy" understanding (because by this point Hugh is relaxing on the beach while his assistants take over the world and acquire flexible resources).

Stage N: Similar to stage [math]n[/math], but now Hugh is even less involved in the whole process.

## Analysis

The important point is that we meet all three of the following criteria: (1) the whole process is mostly aligned/doing things that are "good for Hugh" in some sense; (2) the resulting AI is competitive; (3) Hugh doesn't have a clue what is going on. Many explanations of IDA don't emphasize (3), so they make it seem like Hugh is always at the "root node" overseeing everything that's going on (which would conflict with (2), hence my confusion).

The overseer is always much more capable than the agent it is trying to train. So we won't have something like the agent pushing the overseer over so that it can press to "i approve" button.

The above does not explain how corrigibility, adversarial training, verification, transparency, and other measures to keep the AI aligned fit into the scheme. This is a separate confusion I (and others) have had about IDA that I think imitation-based exposition could improve on. But above I've just focused on the "how does IDA get arbitrarily capable at all?" part of my confusion, leaving the other parts for some other time. I'd like to improve the above explanation by adding the stuff about corrigibility, optimizing worst case performance, etc.

One confusion I have is that in some places Paul says things like:[1]

If the distinguishing characteristic of a genie is "primarily relying on the human ability to discern short-term strategies that achieve long-term value," then I guess that includes all act-based agents. I don't especially like this terminology.

Note that, logically speaking, "human ability" in the above sentence should refer to the ability of humans working in concert with other genies. This really seems like a key fact to me (it also doesn't seem like it should be controversial).

The precise details of "human ability" seem important here. As I see it, IDA can only remain competitive if the AIs are running the show eventually (and humans are only "running the show" to the extent that they seeded the whole recursive process). In other words, it seems like Sovereigns would also be described as "primarily relying on humans working in concert with AIs".

Some expositions of IDA depend on the human to decompose tasks which are arbitrarily complicated (into slightly less complicated tasks), so that the AI can be trained via supervised learning to learn how to decompose such tasks. For example: "This depends on the strong assumption that humans can decompose all tasks [math]T_n[/math] into tasks in [math]T_{n−1}[/math]".[2] Doesn't this mean the AI can't be competitive? When the world is crazy, we can't rely on the humans having enough time to train such decompositions.

I guess another thing I don't really understand: eventually, the distillation tries to solve the hardest tasks directly ([math]T_N[/math] in the Evans IDA document). But like, these are things humans can already solve directly, right? So then why didn't we train on human solving these problems directly? Does it take humans too much time to produce enough samples? (The paper does say IDA is intended when "it is infeasible to provide large numbers of demonstrations or sufficiently dense reward signals for methods like imitation learning or RL to work well.") Or is there a safety reason (like, if M knows how to do all these decompositions at the time when it gets trained to solve the hardest tasks, then it will try to reason in a human-like manner)? There seems to be a conflict here with what Paul says about IDA, namely: Paul says that after the first round, the distilled agent will be infrahuman in some says. This seems to imply that the distill procedure isn't powerful enough to mimic a human. But in the Evans document, eventually we train on the amplified system's data directly in order to produce a distilled agent that can do everything a human can. So here it seems like the distill process is powerful enough to mimic a human directly. The only ways out that I can think of:

• in the Evans document, IDA doesn't actually get to human tasks in the end
• in Paul's scenario, the distill procedure can actually directly mimic a human, but somehow we don't want to do this (e.g. a safety issue, or because it would require too many samples)
• the distill procedure gets better over time
• the distill procedure doesn't learn straight using supervised learning on training data; instead, it maybe keeps a state of the previous distilled agent, and iteratively modifies it to become the next agent or something

Related confusion: Paul says after the first round of IDA, the AI is superhuman in some ways and infrahuman in other ways. So there must be something that the AI can't learn to do from supervised learning by looking at human examples. OTOH, it can eventually learn to do everything humans can by looking at a sufficiently amplified system's examples. So what is the difference here? Why does having access to the amplified system's examples allow it to do things it couldn't from human examples alone?

if we break down tasks in T_n, will they really be in T_{n-1}? by default i sort of expect that the task will be broken down in such a way that one of the pieces basically contains all of the difficulty, and that that piece will be just as difficult as the original problem. ok so then what happens if you try to break apart that piece into further pieces? i'm not sure. maybe once you keep doing this, you get to a point where you can't break it down further, or the act of breaking things down itself requires high intelligence (so you can't just learn this from humans).

also wouldn't learning the "breaking down" process itself require too many samples?

ok i think i've identified the most important and most confusing difference between evans paper vs paul's/ajeya's explanations: in the paul version, the amplified system has a human stuck in it, so it makes it seem like you can't produce many samples/examples for the supervised learning to do its job. but in the evans paper, you learn all the decompositions beforehand, so the amplified system is itself completely automatic. So then my question becomes something like, if the number of samples is the problem, then why does paul/ajeya explain it like they do?