Something like realism about rationality

From Issawiki
Jump to: navigation, search

Something like realism about rationality is a topic of debate among people working on AI safety. The "something like" refers to the fact that the very topic of what the disagreement is even about is in disagreement. Thus the term gestures at some broad-ish cluster of things rather than one specific disagreement. This topic comes up when discussing the value of MIRI's HRAD work.


The general idea has been discussed for a long time, under various names. (e.g. intelligibility of intelligence?)

The phrase "realism about rationality" was introduced by Richard Ngo in September 2018.[1]

Topics of disagreement

  • the original "realism about rationality" claim: "a mindset in which reasoning and intelligence are more like momentum than like fitness".[2] This was then discussed on LW and a bunch of people took issue with the framing etc.
  • "Is there a theory of rationality that is sufficiently precise to build hierarchies of abstraction?"[3]
    • I think a more precise version of the above is the following: Will mathematical theories of embedded rationality always be the relatively-imprecise theories that can’t scale to “2+ levels above” in terms of abstraction?[4]

From Rohin's comment here, we can build the case against "rationality realism" using three premises:

  1. It's very hard to use relatively-imprecise theories to build things "2+ levels above". (If MIRI people disagree with this premise, then it means they're fine with aiming for relatively-imprecise theories as the final output product of their research agenda.)
  2. Real AGI systems are "2+ levels above" mathematical theories of embedded rationality.
  3. Mathematical theories of embedded rationality will always be the relatively-imprecise theories that can't scale to "2+ levels above". (If MIRI people reject this premise, that means they agree on precise theories as a final output product of HRAD work. So in this case the actual disagreement is about whether MIRI can use this output product to achieve an agreed-upon aim, namely helping to detect/fix/ensure the safety of AGI systems.)

In contrast, Richard's original case seems to be more like "mathematical theories of embedded rationality do not exist to be discovered", which seems off the mark.

I think maybe Abram will reject premise (3) (since he did agree to the crux presented by Rohin). But I think Nate/Eliezer will rather reject premise (1), based on what they've been saying?

In [5] Abram gives two possible meanings for "model agency exactly" (which I take to be the what a mathematical theory of embedded rationality must do), namely building agents from the ground up vs being able to take an arbitrary AI system (e.g. one built with ML) and then being able to predict roughly what it does. Abram's claim seems to be that doing the former will help with the latter, even if the predictions in the latter context won't be super precise. So this seems to be a denial of Rohin's (3) above. So wait, does Rohin reject that we can build agents from the ground up, or that doing so won't help with predicting AI systems even in broad strokes? If the latter, then I think there should be another step in Rohin's argument, something like:

4. even a precise/mathematical theory is difficult to scale "2+ levels above" if this precise theory is being applied to AI systems that aren't built using this theory

...or something. Actually, maybe what's confusing me is what "relatively imprecise" vs "precise" actually means.

Ok now, out of Rohin's three-step argument + the fourth step above, I think we can construct a positive case for HRAD work by inverting the negative case appropriately. Something like:

  1. A precise theory about how to build agents from the ground up can be discovered.
  2. ...

Actually Rohin's argument is like (1 AND 2 AND 3) OR 4.

Another way to state it.

1. There's potentially a disagreement about what output miris research will have. Like what the final product looks like. E.g. Richard might think there can't be a nice theory as output, so there's a restriction on what the output can look like.

2. For some of those final products, there's a disagreement about whether it's sufficiently helpful for aligning ai. You could also turn this around and ask what output would be sufficient to be helpful.

so Richard is (1), and Rohin is (2).

like, richard might think a mathematical theory of embedded rationality is impossible; whereas rohin just seems to think that having such a theory wouldn't help you align AGI.

What about Daniel Dewey? It's a little unclear, as nate pointed out: "I’m unclear on whether we miscommunicated about what sort of thing I mean by “basic insights”, or whether we have a disagreement about how useful principled justifications are in modern practice when designing high-reliability systems."

Daniel could be (1) if he misunderstood what MIRI's aims are, i.e. nate's first option; and daniel could be (2) if it's about the latter option nate presents.


"The main case for HRAD problems is that we expect them to help in a gestalt way with many different known failure modes (and, plausibly, unknown ones). E.g., 'developing a basic understanding of counterfactual reasoning improves our ability to understand the first AGI systems in a general way, and if we understand AGI better it's likelier we can build systems to address deception, edge instantiation, goal instability, and a number of other problems'." [1] -- so perhaps the disagreement is more like (or additionally about) "at what level of 'making it easy to design the right systems' is HRAD work justified?", where MIRI is like "being analogous to Bayesian justifications in modern ML is enough" and skeptics are like "you need a theory sufficiently precise to build a hierarchy of abstractions/axiomatically specify AGI!"

(See also Wei Dai's list, where he specifically says concretely applying decision theory stuff to an AI isn't what is motivating him. And Rob's post, which has more info.)

See also

External links