How doomed are ML safety approaches?

From Issawiki
Jump to: navigation, search

I want to understand better the MIRI case for thinking that ML-based safety approaches (like Paul Christiano's agenda) are so hopeless as to not be worth working on (or something like that).

in particular, which one of the following is the case closest to?

  1. a highly intelligent AI would see things humans cannot see, can arrive at unanticipated solutions, etc. therefore it seems pretty imprudent/careless/whatever to go ahead and try to build an AGI via ML without really understanding what is going on.
  2. in addition to (1), we have some sketchy ideas of why things could be bad by default. for instance, optimization daemons could be a thing (humans are an existence proof that this isn't impossible). we cannot tell if any of these sketchy ideas are likely, but they are theoretically possible.
  3. in addition to the stuff in (1) and (2), we actually additionally have one of the following:
    1. a really good argument for why ML-based approaches are doomed, but it's really hard to write down / too long to write down / we don't have enough time to write it down / there's too much inferential distance to cover
    2. a really strong intuition that the problems in (1) and (2) really will actually be problems, but it's super hard to articulate this intuition. -- this article, in particular the section "Creating a powerful AI system without understanding why it works is dangerous", gives a basic case for (1). -- this is somewhere between (1) and (3.1) i think? like, it's more detailed than just (1), but i wouldn't call this a super watertight argument. specifically, point (4) in the argument seems sketchy; it's basically just saying "advanced ML systems cannot be aligned".

"The biggest indication that [top-level reasoning might not stay dominant on default AI development path] is that we currently don’t have an in-principle theory for good reasoning (e.g. we’re currently confused about logical uncertainty and multi-level models), and it doesn’t look like these theories will be developed without a concerted effort. Usually, theory lags behind common practice." "“MIRI” has a strong intuition that this won’t be the case, and personally I’m somewhat confused about the details; see Nate’s comments below for details."

paul: 'As far as I can tell, the MIRI view is that my work is aimed at problem which is *not possible,* not that it is aimed at a problem which is too easy.' 'One part of this is the disagreement about whether the overall approach I'm taking could possibly work, with my position being "something like 50-50" the MIRI position being "obviously not" (and normal ML researchers' positions being skepticism about our perspective on the problem).' 'There is a broader disagreement about whether any "easy" approach can work, with my position being "you should try the easy approaches extensively before trying to rally the community behind a crazy hard approach" and the MIRI position apparently being something like "we have basically ruled out the easy approaches, but the argument/evidence is really complicated and subtle."' [1] -- i'm really confused where this "obviously not" is coming from. i can totally understand if the MIRI position was just (1) and (2) in my argument structure above, so that the overall position is something like "therefore, we should be really careful about working on ML safety stuff, because we might run into a situation where we think it's working but then we get a treacherous turn etc". it's one thing to argue from asymmetry of risks but quite another i think to say ML safety would obviously not work.

rob's comment has more details but it doesn't seem like there is some sort of super strong argument.

eliezer: "You can’t just assume you can train for X in a robust way when you have a loss function that targets X." -- again, this sounds more like (2) above than the stuff in (3). -- this is also just (1), or part of the input into an argument for (1).

an important question for me is something like, if paul's approach doesn't work, how would we find that out? it seems like the more effort you put into ML-based alignment, the less obvious your failures become.

See also