Can the behavior of approval-direction be undefined or random?
In approval-direction, what if the behavior turns out to be undefined/random? This could happen if the human can be convinced to give high approval to basically every action. Once the AI becomes good enough at persuasion, it wouldn't necessarily be malicious, but since every action has high approval (if explained well) it will basically suggest some random action to take.
Maybe the answer is something like "But you gradually train the AI, so at every point it's seeking the approval of a smarter amplified system, which cannot be fooled. So the action is always defined". But the amplified system is supposed to be a stand-in for a human thinking for a long time, and that human-thinking-for-a-long-time can be fooled or reach really bizarre path-dependent conclusions!