Difference between revisions of "List of disagreements in AI safety"

Latest revision as of 11:00, 26 February 2022

This is a list of disagreements in AI safety which collects the list of things people in AI safety seem to most frequently and deeply disagree about.

Many of the items are from ^[1] (there are more posts like this, i think? find them)

organization of this list: the most decision-relevant questions appear as sections, and in each section, the top level of questions together answer the question. sub-level questions answer the top level question, and so forth.

an alternative organization for these disagreements (which this page doesn't follow) would be to collect all the most basic (innermost) sub-questions; these are the "cruxes", i.e. the minimal set of questions which determines the answers to all the other questions.

part of what i find frustrating about many of these discussions is that there are very few examples that they use (humans vs chimps, alphago, and a couple others might be the main sources of concrete evidence?), since the disagreement is about future superintelligent systems/the nature of intelligence (which, if we understood very well, means we could build AGI today). the disagreements often seem to boil down to differences in intuition, which is hard to tinker with from the outside (i.e. if i don't have access to eliezer/paul's brain, it's not too clear to me what i can even do to make progress on these disagreements, besides going out and learning a ton of technical fields to build intuition). This train of thought led me to ask this question: [1]

meta point (HT David Manheim): some/most of these disagreements might just be definitional confusions (e.g. people disagree on what they mean by "transformative AI" or whatever).

AI timelines

(When will humans create AGI? When will some really crazy things start to happen in the world?)

will current ML techniques scale to AGI?
how expensive will it be to run the first AGI?
is prosaic AI possible? see [2] for a post arguing against.
what the first AGI will look like
intelligibility of intelligence [3]
the questions listed at https://lw2.issarice.com/posts/w4jjwDPa853m9P4ag/my-current-framework-for-thinking-about-agi-timelines
humans vs chimps? -- what does this question do for us here? basically, the more special human brains are, the longer we would expect to get to AGI.
weird anthropic stuff that could change things? (e.g. we are more likely to find ourselves in worlds where "evolution got lucky" and produced human-level intelligence relatively quickly. I think carl has a paper about this.)
- "How special are human brains among animal brains?" [4] This has sub-questions like:
  - Is language the only special thing about humans?
  - How hard is it to master language using general cognitive capacities, and why don't other animals use language like humans do?

Probability of doom

(Will the world end because of unaligned AI? Will the future be controlled by completely alien values, or values resulting from an "ideal" deliberation process, or something else?)

civilizational adequacy
probability of doom without any special EA/longtermist intervention
Difficulty of AI alignment: prior on difficulty of alignment (there is [5] but it looks like the page only defines what "difficulty" means, and doesn't actually go on to discuss whether we should expect it to be difficult), and ideas like "if ML-based safety were to have any shot at working, wouldn't we just go all the way and expect the default (no EA intervention) approach to AGI to just produce basically ok outcomes?"
how likely optimization daemons/mesa-optimizers are or what they will look like
how solvable are coordination problems? e.g. how avoidable is a "race to the bottom" on safety?
how strong of a guarantee do we need for the safety of AI? proof-level (does anyone actually argue this?) vs security mindset vs whatever ML safety people believe
how much weight to put on asymmetry of risks?
will there be small-scale AI failures prior to the end of the world?
how likely is a treacherous turn/context change type of failure?
how much overlap is there between AI capabilities work and safety work? (e.g. is it reasonable to say things like "making progress on safety requires advancing capabilities"?)
whether we can correct mistakes when deploying AI systems as they come up (i.e. how catastrophic the initial problems will be)
will failure be conspicuous/obvious to detect? e.g. see [6] for one scenario where even under a continuous takeoff, failure might not be obvious until the world ends.
under continuous takeoff, if we had misaligned AGIs (but the world hasn't ended yet) and could easily tell they were misaligned, how easy would it be to create an aligned AGI? [7]
what the first AGI will look like
- Coherence and goal-directed agency discussion
to what extent AGI safety will look like "security mindset" vs "lots of little engineering problems", e.g. see [8]
intelligibility of intelligence [9]
how much weight to place on accidents vs misuse vs social side effects? e.g. dario seems to weight these pretty equally [10] whereas i think eliezer is much more worried about the accidents.

Takeoff dynamics

(shape and speed of takeoff, what the world looks like prior to takeoff)

Will there be significant changes to the world prior to some critical AI capability threshold being reached?
- what precursors/narrow systems we will see prior to AGI
- how short is the window between "clearly infrahuman" and "vastly superhuman" for important real-world tasks like "doing AI research" and "building nanobots"? See also Kasparov window. For example rohin says [11] "I think we will first build powerful AI systems that are not that much more powerful than humans, and that direct alignment of ML techniques will be sufficient to make that safe (even though they do pose an x-risk)." (which mean he believes there is some significant window where we have superhuman-but-not-vastly-superhuman AI)
Is there a secret sauce for intelligence? how many/how "lumpy" are insights for creating an AGI? I think this has also been described as "whether AGI is a late-paradigm or early-paradigm invention" [12] (if I'm understanding what they mean by "early/late-paradigm")
- "the degree of complexity of useful combination, and the degree to which a simple general architecture search and generation process can find such useful combinations for particular tasks" [13]
Content sharing between AIs: how much sharing/trading there will be between different AI companies (eliezer vs Robin Hanson) -- this one is downstream of lumpiness of insights, because hanson expects that if there are very few insights needed to get to AGI, then there won't be any need for sharing (so in that case even hanson would agree with eliezer).
decisive strategic advantage / is it possible to turn a small lead in AGI development into a big lead?
- discontinuity/unipolarity/locality
- DSA without discontinuity
Resource overhang
what lessons can we learn by looking at the evolutionary history of chimps vs humans?
- Changing selection pressures argument
what lessons can we learn from AlphaGo?
to what extend "recursive self-improvement" is a distinct thing, as compared with just "AIs getting better and better at doing AI research" [14] "If you imagine steady improvement in the self-improvement, that doesn't give a local team a strong advantage." [15]
how expensive will the development of the first AGI be? e.g. "a small team of researchers can create AGI" vs "a large company/many teams of researcher will be needed"
how expensive will the training of the first AGI be? "you can run AGI on a modern desktop computer" vs "the first AGI project will need to raise a huge amount of money, because training will be so expensive"
how expensive will it be to run the first AGI?
- e.g. Eliezer: "When we try to visualize how all this is likely to go down, we tend to visualize a scenario that someone else once termed "a brain in a box in a basement." I love that phrase, so I stole it. In other words, we tend to visualize that there's this AI programming team, a lot like the sort of wannabe AI programming teams you see nowadays, trying to create artificial general intelligence, like the artificial general intelligence projects you see nowadays. They manage to acquire some new deep insights which, combined with published insights in the general scientific community, let them go down into their basement and work in it for a while and create an AI which is smart enough to reprogram itself, and then you get an intelligence explosion." [16]
what will failure look like? yudkowskian takeover vs paul's "we get what we measure, and our ability to get what we specify outstrips our ability to measure what we truly want" vs paul's influence-seeking optimizers/daemons vs ...
speed of improvement/discontinuities/recursive self-improvement once the AI reaches some critical threshold (like human baseline)
whether hardware or software progress is more important for getting to AGI. see hardware-driven vs software-driven progress
how important it is to get the right architecture e.g. "That is what I meant by suggesting that architecture isn’t the key to AGI." [17]. There is Dario Amodei's comment here which is the opposite view.
"I think the key disagreement is instead about where the main force of improvement in early human-designed AGI systems comes from — is it from existing systems progressing up their improvement curves, or from new systems coming online on qualitatively steeper improvement curves?" [18]
how simple is human intelligence? [19] asks the question, but idk if there are disagreements or if anything is even known.
"Put another way, to achieve this scenario [i.e. foom] a project just can’t specialize at being better at “intelligence” or “learning” - those aren’t in fact topics in which one can usefully specialize much. The project has to instead be better at thousands of specific kinds of intelligence or learning. Which conflicts with it starting out as a small project." [20] Another way to state this: "scope of generality of architectural advances" [21]
- See also Secret sauce for intelligence vs specialization in intelligence
Will AI researchers be optimizing for intelligence/capability directly, or will they be using a proxy? [22] -- the more we expect a proxy, the closer we get to something like what evolution was doing, and the more we should expect a discontinuity. The generic economic argument of "before a good version comes out, someone will build a crappy version first" works for whatever is being optimized directly. Paul says "researchers probably will be optimizing aggressively for general intelligence, if it would help a lot on tasks they care about" [23]. But I agree with tristanm that it would be good to say more about how researchers would be optimizing directly for general intelligence, rather than a proxy.
splitting things by wall clock time vs subjective time, and also splitting by (1) how much of a lead the leading team needs to have decisive strategic advantage and (2) how much time we have between capabilities event X and having to solve alignment problem Y (for various values of X and Y) https://www.greaterwrong.com/posts/AfGmsjGPXN97kNp57/arguments-about-fast-takeoff/comment/JEkP5AmXmi4dHHpqo
"Hanson and Yudkowsky also disagree on the extent to which an AI’s resources might be local as opposed to global, the extent to which knowledge is likely to be shared between various AIs, and whether an intelligence explosion should be framed as a “winner-take-all” scenario." [24]
historical developments (e.g. agriculture, industrial revolution, development of specific technologies) have never led to a single entity taking over the world. [25] -- however, this depends on what one means by "single entity" and "taking over the world"; arguably single entities have taken over the world if one uses certain definitions.
Narrow window argument against continuous takeoff

Specific lines of work

Highly reliable agent designs

How promising is MIRI-style work (HRAD, agent foundations, embedded agency)?

(I think Paul's view is something like "this is fine to work on, but there isn't enough time and my agenda seems promising" whereas other people are more like "I don't see how MIRI's work even helps us build an AGI")

How useful will HRAD work be for thinking about AGI? It's not clear to me whether the disagreement is mainly about which goalposts are the relevant ones (i.e. MIRI people have lower standards for what count as useful insights for AGI) vs an actual disagreement about how directly applicable MIRI work will be:
- Goalpost for usefulness of HRAD work; List of success criteria for HRAD work
- Something like realism about rationality. if there turns out to be no simple core algorithm for agency, or if understanding agency better doesn't help us build an AGI, then we might not be in a better place wrt aligning AI.
- What will AGI look like?
  - will AGI be agent-like? See also Comparison of terms related to agency
  - whether an AGI will look like a utility maximizer? See also Comparison of terms related to agency
  - will AGI appear rational to humans? (efficient relative to humans)
can MIRI-type research be done in time to help with AGI? see this comment and [26]
intelligibility of intelligence [27] -- there's like two ways this could be false: (1) intelligence turns out to be lots of specific adaptations to specific problems (i think this is what tooby and cosmides were arguing); or (2) intelligence turns out to be more like a collection of different tools in a tool box (the tools don't solve specific problems, they're more general than that, but the more tools you have, the more things you can build) [28]
deep insights needed to build an aligned AGI? see also Different senses of claims about AGI i.e. it might not require deep insights to build any old AGI, but still require deep insights for an aligned one. This is basically like "how doomed are other approaches?"
maybe something like, how important is understanding what's going on / how much weight do you place on curiosity-driven research. e.g. [29], [30]
- possibly another way to phrase this: there is a kind of pattern of intellectual progress that MIRI sees, where first you realize something is a problem, then you convert it to math, then you solve it in unbounded form (i.e. given unlimited computation), then you do it in a practical way, and so on. See e.g. [31], [32], [33], and section 3 of [34].
- "basic science" approach to AI [35]
track record of design vs search for finding solutions [36]

Machine learning safety

How promising/doomed is ML safety work / messy approaches to alignment (including Paul's agenda)? e.g. see discussion here -- How doomed are ML safety approaches?

Competence gap
whether "weird recursions" / "inductive invariants" are a good idea -- my impression is something like, paul is like "why not try it?" and eliezer is like "ehhhh..."
something like, if paul's approach can work, then why can't we stop at some intermediate stage to do WBE or make an aligned AGI via MIRI-like stuff instead? (i guess there isn't enough time?) -- eliezer raises this or a similar question "but then why can't we just use this scheme to align a powerful AGI in the first place?" [37]
to what extent are act-based agents even a thing? (i.e. do they just turn into goal-directed thingies?)
to what extent doing something like "predict short-term actions humans would want, if they had a long time to think about it" leads to optimization of malignant goals, rather than mostly harmless errors. [38] -- i think this one might be essentially the same as broad basin of corrigibility.
how much safety you gain by having the human programmers specify short-term tasks, rather than the AI predicting what short-term tasks the programmers would have specified if they had more time to think about it. [39]
whether there is a basin of attraction for corrigibility
importance of X and only X problem (can we get a system to do X, without also doing a bunch of other dangerous Y?) -- isn't this basically unalignment due to mesa-optimization?
how big of a problem collusion between subsystems of an AI will be

Value specification

e.g. value learning

Meta-philosophy

How big of a deal are the things Wei Dai worries about? (meta philosophy, meta ethics, human safety problems)
Will it be possible for humans to detect an existential win?

References

↑ Clarifying some key hypotheses in AI alignment.

list of links to go through (cleaning out chrome tabs):

[1] Clarifying some key hypotheses in AI alignment.

[1]

@@ Line 5: / Line 5: @@
 organization of this list: the most decision-relevant questions appear as sections, and in each section, the top level of questions together answer the question. sub-level questions answer the top level question, and so forth.
-an alternative organization for these disagreements (which this page doesn't follow) would be to collect all the most basic (innermost) sub-questions; these are the "cruxes", i.e. the minimal set of of questions which determine the answers to all the other questions.
+an alternative organization for these disagreements (which this page doesn't follow) would be to collect all the most basic (innermost) sub-questions; these are the "cruxes", i.e. the minimal set of questions which determines the answers to all the other questions.
+part of what i find frustrating about many of these discussions is that there are very few examples that they use (humans vs chimps, alphago, and a couple others might be the main sources of concrete evidence?), since the disagreement is about future superintelligent systems/the nature of intelligence (which, if we understood very well, means we could build AGI today). the disagreements often seem to boil down to differences in intuition, which is hard to tinker with from the outside (i.e. if i don't have access to eliezer/paul's brain, it's not too clear to me what i can even do to make progress on these disagreements, besides going out and learning a ton of technical fields to build intuition). This train of thought led me to ask this question: [https://www.greaterwrong.com/posts/bDwQddhqaTiMhbpPF/what-are-some-exercises-for-building-generating-intuitions]
+meta point (HT [[David Manheim]]): some/most of these disagreements might just be definitional confusions (e.g. people disagree on what they mean by "transformative AI" or whatever).
 ==AI timelines==
+(When will humans create AGI? When will some really crazy things start to happen in the world?)
 * will current ML techniques scale to AGI?
@@ Line 14: / Line 20: @@
 * what the first AGI will look like
 * intelligibility of intelligence [https://intelligence.org/files/HowIntelligible.pdf]
+* the questions listed at https://lw2.issarice.com/posts/w4jjwDPa853m9P4ag/my-current-framework-for-thinking-about-agi-timelines
+* humans vs chimps? -- what does this question do for us here? basically, the more special human brains are, the longer we would expect to get to AGI.
+* weird anthropic stuff that could change things? (e.g. we are more likely to find ourselves in worlds where "evolution got lucky" and produced human-level intelligence relatively quickly. I think carl has a paper about this.)
+** "How special are human brains among animal brains?" [https://www.greaterwrong.com/posts/d2jgBurQygbXzhPxc/how-special-are-human-brains-among-animal-brains] This has sub-questions like:
+*** Is language the only special thing about humans?
+*** How hard is it to master language using general cognitive capacities, and why don't other animals use language like humans do?
 ==Probability of doom==
+(Will the world end because of unaligned AI? Will the future be controlled by completely alien values, or values resulting from an "ideal" deliberation process, or something else?)
 * civilizational adequacy
 * probability of doom without any special EA/longtermist intervention
-* prior on difficulty of alignment (there is [https://arbital.com/p/alignment_difficulty/] but it looks like the page only defines what "difficulty" means, and doesn't actually go on to discuss whether we should expect it to be difficult), and ideas like "if ML-based safety were to have any shot at working, wouldn't we just go all the way and expect the default (no EA intervention) approach to AGI to just produce basically ok outcomes?"
+* [[Difficulty of AI alignment]]: prior on difficulty of alignment (there is [https://arbital.com/p/alignment_difficulty/] but it looks like the page only defines what "difficulty" means, and doesn't actually go on to discuss whether we should expect it to be difficult), and ideas like "if ML-based safety were to have any shot at working, wouldn't we just go all the way and expect the default (no EA intervention) approach to AGI to just produce basically ok outcomes?"
 * how likely optimization daemons/mesa-optimizers are or what they will look like
 * how solvable are coordination problems? e.g. how avoidable is a "[[race to the bottom]]" on safety?
@@ Line 31: / Line 45: @@
 * under continuous takeoff, if we had misaligned AGIs (but the world hasn't ended yet) and could easily tell they were misaligned, how easy would it be to create an aligned AGI? [https://www.greaterwrong.com/posts/CjW4axQDqLd2oDCGG/misconceptions-about-continuous-takeoff#comment-fxSY7P7wdeptKoANw]
 * what the first AGI will look like
+** [[Coherence and goal-directed agency discussion]]
 * to what extent AGI safety will look like "security mindset" vs "lots of little engineering problems", e.g. see [https://eaforum.issarice.com/posts/Ayu5im98u8FeMWoBZ/my-personal-cruxes-for-working-on-ai-safety#HuvMtWgiXxPfcnAYY]
 * intelligibility of intelligence [https://intelligence.org/files/HowIntelligible.pdf]
+* how much weight to place on accidents vs misuse vs social side effects? e.g. dario seems to weight these pretty equally [https://youtu.be/jSYyPBG05U4?t=1737] whereas i think eliezer is much more worried about the accidents.
 ==Takeoff dynamics==
@@ Line 40: / Line 56: @@
 * [[Will there be significant changes to the world prior to some critical AI capability threshold being reached?]]
 ** what precursors/narrow systems we will see prior to AGI
-** how short is the window between "clearly infrahuman" and "clearly superhuman" for important real-world tasks like "doing AI research" and "building nanobots"?
+** how short is the window between "clearly infrahuman" and "vastly superhuman" for important real-world tasks like "doing AI research" and "building nanobots"? See also [[Kasparov window]]. For example rohin says [https://www.greaterwrong.com/posts/uKbxi2EJ3KBNRDGpL/comment-on-decision-theory/comment/pNrynCrozQPj3tFws] "I think we will first build powerful AI systems that are not that much more powerful than humans, and that direct alignment of ML techniques will be sufficient to make that safe (even though they do pose an x-risk)." (which mean he believes there is some significant window where we have superhuman-but-not-vastly-superhuman AI)
-* Is there a [[secret sauce for intelligence]]? how many/how "lumpy" are insights for creating an AGI? I think this has also been described as "whether AGI is a late-paradigm or early-paradigm invention" [https://intelligence.org/2017/12/01/miris-2017-fundraiser/#3] (if I'm understanding what they mean by "early/late-paradigm")
+* Is there a [[secret sauce for intelligence]]? how many/how "[[lumpy]]" are insights for creating an AGI? I think this has also been described as "whether AGI is a late-paradigm or early-paradigm invention" [https://intelligence.org/2017/12/01/miris-2017-fundraiser/#3] (if I'm understanding what they mean by "early/late-paradigm")
 ** "the degree of complexity of useful combination, and the degree to which a simple general architecture search and generation process can find such useful combinations for particular tasks" [https://www.facebook.com/yudkowsky/posts/10155848910529228?comment_id=10155848951264228&reply_comment_id=10155849315174228]
-* how much sharing/trading there will be between different AI companies (eliezer vs [[Robin Hanson]]) -- this one is downstream of lumpiness of insights, because hanson expects that if there are very few insights needed to get to AGI, then there won't be any need for sharing (so in that case even hanson would agree with eliezer).
+* [[Content sharing between AIs]]: how much sharing/trading there will be between different AI companies (eliezer vs [[Robin Hanson]]) -- this one is downstream of [[lumpiness]] of insights, because hanson expects that if there are very few insights needed to get to AGI, then there won't be any need for sharing (so in that case even hanson would agree with eliezer).
-* decisive strategic advantage / is it possible to turn a small lead in AGI development into a big lead?
+* [[decisive strategic advantage]] / is it possible to turn a small lead in AGI development into a big lead?
 ** discontinuity/unipolarity/locality
 ** DSA without discontinuity
-* whether we are already in hardware overhang / other "resource bonanza"
+* [[Resource overhang]]
 * what lessons can we learn by looking at the [[Evolution|evolutionary history]] of chimps vs humans?
+** [[Changing selection pressures argument]]
 * what lessons can we learn from [[AlphaGo]]?
-* to what extend "recursive self-improvement" is a distinct thing, as compared with just "AIs getting better and better at doing AI research" [https://arbital.com/p/KANSI/?l=1fy#subpage-1gp]
+* to what extend "recursive self-improvement" is a distinct thing, as compared with just "AIs getting better and better at doing AI research" [https://arbital.com/p/KANSI/?l=1fy#subpage-1gp] "If you imagine steady improvement in the self-improvement, that doesn't give a local team a strong advantage." [https://docs.google.com/document/pub?id=17yLL7B7yRrhV3J9NuiVuac3hNmjeKTVHnqiEa6UQpJk]
 * how expensive will the development of the first AGI be? e.g. "a small team of researchers can create AGI" vs "a large company/many teams of researcher will be needed"
 * how expensive will the training of the first AGI be? "you can run AGI on a modern desktop computer" vs "the first AGI project will need to raise a huge amount of money, because training will be so expensive"
 * how expensive will it be to run the first AGI?
+** e.g. [[Eliezer]]: "When we try to visualize how all this is likely to go down, we tend to visualize a scenario that someone else once termed "[[a brain in a box in a basement]]." I love that phrase, so I stole it. In other words, we tend to visualize that there's this AI programming team, a lot like the sort of wannabe AI programming teams you see nowadays, trying to create artificial general intelligence, like the artificial general intelligence projects you see nowadays. They manage to acquire some new deep insights which, combined with published insights in the general scientific community, let them go down into their basement and work in it for a while and create an AI which is smart enough to reprogram itself, and then you get an intelligence explosion." [https://docs.google.com/document/pub?id=17yLL7B7yRrhV3J9NuiVuac3hNmjeKTVHnqiEa6UQpJk]
 * what will failure look like? yudkowskian takeover vs paul's "we get what we measure, and our ability to get what we specify outstrips our ability to measure what we truly want" vs paul's influence-seeking optimizers/daemons vs ...
 * speed of improvement/discontinuities/recursive self-improvement once the AI reaches some critical threshold (like human baseline)
 * whether hardware or software progress is more important for getting to AGI. see [[hardware-driven vs software-driven progress]]
-* how important it is to get the right architecture e.g. "That is what I meant by suggesting that architecture isn’t the key to AGI." [https://www.greaterwrong.com/posts/D3NspiH2nhKA6B2PE/what-evidence-is-alphago-zero-re-agi-complexity]. There is [[Dario Amodei]]'s comment [https://www.facebook.com/yudkowsky/posts/10155848910529228?comment_id=10155849004324228&reply_comment_id=10155849068769228 here] which is the opposite view.
+* how important it is to get the right [[architecture]] e.g. "That is what I meant by suggesting that architecture isn’t the key to AGI." [https://www.greaterwrong.com/posts/D3NspiH2nhKA6B2PE/what-evidence-is-alphago-zero-re-agi-complexity]. There is [[Dario Amodei]]'s comment [https://www.facebook.com/yudkowsky/posts/10155848910529228?comment_id=10155849004324228&reply_comment_id=10155849068769228 here] which is the opposite view.
+* "I think the key disagreement is instead about where the main force of improvement in early human-designed AGI systems comes from — is it from existing systems progressing up their improvement curves, or from new systems coming online on qualitatively steeper improvement curves?" [https://lw2.issarice.com/posts/X5zmEvFQunxiEcxHn/quick-nate-eliezer-comments-on-discontinuity]
+* how simple is human intelligence? [https://www.greaterwrong.com/posts/Jo4ExrJxF6rm8cm3k/q-and-a-with-harpending-and-cochran#comment-HqM6wur7WoMAFkAfL] asks the question, but idk if there are disagreements or if anything is even known.
+* "Put another way, to achieve this scenario [i.e. foom] a project just can’t specialize at being better at “intelligence” or “learning” - those aren’t in fact topics in which one can usefully specialize much. The project has to instead be better at thousands of specific kinds of intelligence or learning. Which conflicts with it starting out as a small project." [http://www.overcomingbias.com/2014/07/30855.html#comment-1502583243] Another way to state this: "scope of generality of architectural advances" [https://www.facebook.com/yudkowsky/posts/10154018209759228?comment_id=10154018937319228&reply_comment_id=10154018957464228]
+** See also [[Secret sauce for intelligence vs specialization in intelligence]]
+* Will AI researchers be optimizing for intelligence/capability directly, or will they be using a proxy? [https://www.greaterwrong.com/posts/LaT6rexiNx6MW74Fn/my-thoughts-on-takeoff-speeds] -- the more we expect a proxy, the closer we get to something like what evolution was doing, and the more we should expect a discontinuity. The generic economic argument of "before a good version comes out, someone will build a crappy version first" works for whatever is being optimized directly. Paul says "researchers probably ''will'' be optimizing aggressively for general intelligence, if it would help a lot on tasks they care about" [https://sideways-view.com/2018/02/24/takeoff-speeds/]. But I agree with tristanm that it would be good to say more about how researchers would be optimizing directly for general intelligence, rather than a proxy.
+* splitting things by wall clock time vs subjective time, and also splitting by (1) how much of a lead the leading team needs to have [[decisive strategic advantage]] and (2) how much time we have between capabilities event X and having to solve alignment problem Y (for various values of X and Y) https://www.greaterwrong.com/posts/AfGmsjGPXN97kNp57/arguments-about-fast-takeoff/comment/JEkP5AmXmi4dHHpqo
+* "Hanson and Yudkowsky also disagree on the extent to which an AI’s resources might be local as opposed to global, the extent to which knowledge is likely to be shared between various AIs, and whether an intelligence explosion should be framed as a “winner-take-all” scenario." [https://intelligence.org/files/AIFoomDebate.pdf#page=519]
+* historical developments (e.g. agriculture, industrial revolution, development of specific technologies) have never led to a single entity taking over the world. [https://intelligence.org/files/AIFoomDebate.pdf#page=538] -- however, this depends on what one means by "single entity" and "taking over the world"; arguably single entities ''have'' taken over the world if one uses certain definitions.
+* [[Narrow window argument against continuous takeoff]]
 ==Specific lines of work==
@@ Line 63: / Line 90: @@
 ===Highly reliable agent designs===
-How promising is MIRI-style work (HRAD, agent foundations, embedded agency)?
+How promising is MIRI-style work ([[HRAD]], agent foundations, embedded agency)?
 (I think Paul's view is something like "this is fine to work on, but there isn't enough time and my agenda seems promising" whereas other people are more like "I don't see how MIRI's work even helps us build an AGI")
+See also [[MIRI vs Paul research agenda hypotheses]]
+* How useful will HRAD work be for thinking about AGI? It's not clear to me whether the disagreement is mainly about which goalposts are the relevant ones (i.e. MIRI people have lower standards for what count as useful insights for AGI) vs an actual disagreement about how directly applicable MIRI work will be:
+** [[Goalpost for usefulness of HRAD work]]; [[List of success criteria for HRAD work]]
+** [[Something like realism about rationality]]. if there turns out to be no [[simple core algorithm for agency]], or if understanding agency better doesn't help us build an AGI, then we might not be in a better place wrt aligning AI.
+** What will AGI look like?
+*** will AGI be agent-like? See also [[Comparison of terms related to agency]]
+*** whether an AGI will look like a utility maximizer? See also [[Comparison of terms related to agency]]
+*** will AGI appear rational to humans? (efficient relative to humans)
 * can MIRI-type research be done in time to help with AGI? see [https://www.greaterwrong.com/posts/suxvE2ddnYMPJN9HD/realism-about-rationality#comment-Dk5LmWMEL55ufkTB5 this comment] and [https://www.facebook.com/groups/aisafety/permalink/920154224815359/?comment_id=920167711480677]
-* how doomed is MIRI's approach? i.e. if there turns out to be no [[simple core algorithm for agency]], or if understanding agency better doesn't help us build an AGI, then we might not be in a better place wrt aligning AI.
+* intelligibility of intelligence [https://intelligence.org/files/HowIntelligible.pdf] -- there's like two ways this could be false: (1) intelligence turns out to be lots of specific adaptations to specific problems (i think this is what tooby and cosmides were arguing); or (2) intelligence turns out to be more like a collection of different tools in a tool box (the tools don't solve specific problems, they're more general than that, but the more tools you have, the more things you can build) [https://www.greaterwrong.com/posts/D3NspiH2nhKA6B2PE/what-evidence-is-alphago-zero-re-agi-complexity#comment-zsBR24WoHvL9HK67h]
-* will AGI be agent-like?
+* deep insights needed to build an ''aligned'' AGI? see also [[Different senses of claims about AGI]] i.e. it might not require deep insights to build any old AGI, but still require deep insights for an aligned one. This is basically like "how doomed are ''other'' approaches?"
-* whether an AGI will look like a utility maximizer?
+* maybe something like, how important is understanding what's going on / how much weight do you place on curiosity-driven research. e.g. [https://www.greaterwrong.com/posts/MG4ZjWQDrdpgeu8wG/zoom-in-an-introduction-to-circuits/comment/uxvSLFnmPFWG3K63w], [https://intelligence.org/2018/11/08/embedded-curiosities/]
-* will AGI appear rational to humans? (efficient relative to humans)
+** possibly another way to phrase this: there is a kind of pattern of intellectual progress that MIRI sees, where first you realize something is a problem, then you convert it to math, then you solve it in unbounded form (i.e. given unlimited computation), then you do it in a practical way, and so on. See e.g. [https://www.greaterwrong.com/posts/aH7Xtuqa3fdJDrio9/program-search-and-incomplete-understanding], [https://intelligence.org/files/OpenPhil2016Supplement.pdf#page=13], [https://arbital.com/p/unbounded_analysis/], and section 3 of [https://intelligence.org/2015/07/27/miris-approach/].
-* intelligibility of intelligence [https://intelligence.org/files/HowIntelligible.pdf] -- there's like two ways this could be false: (1) intelligence turns out to be lots of specific adaptations to specific problems (i think this is what tooby and cosmides were arguing); or (2) intelligence turns out to be more like a collection of different tools in a tool box (the tools don't solve specific problems, they're more general than that, but the more tools you have, the more things you can build)
+** "basic science" approach to AI [https://eaforum.issarice.com/posts/SEL9PW8jozrvLnkb4/my-current-thoughts-on-miri-s-highly-reliable-agent-design#2__What_s_the_basic_case_for_HRAD_]
-* something-like-realism-about-rationality, e.g. "Is there a theory of rationality that is sufficiently precise to build hierarchies of abstraction?" [https://www.greaterwrong.com/posts/suxvE2ddnYMPJN9HD/realism-about-rationality#comment-YMNwHcPNPd4pDK7MR] "An example of how this relates to HRAD is that I think that Bayesian justifications are useful in ML, and that a good formal model of rationality in the face of logical uncertainty is likely to be useful in analogous ways. When I speak of foundational understanding making it easy to design the right systems, I’m trying to point at things like the usefulness of Bayesian justifications in modern ML." [https://eaforum.issarice.com/posts/SEL9PW8jozrvLnkb4/my-current-thoughts-on-miri-s-highly-reliable-agent-design#Z6TbXivpjxWyc8NYM] "The main case for HRAD problems is that we expect them to help in a gestalt way with many different known failure modes (and, plausibly, unknown ones). E.g., 'developing a basic understanding of counterfactual reasoning improves our ability to understand the first AGI systems in a general way, and if we understand AGI better it's likelier we can build systems to address deception, edge instantiation, goal instability, and a number of other problems'." [https://eaforum.issarice.com/posts/SEL9PW8jozrvLnkb4/my-current-thoughts-on-miri-s-highly-reliable-agent-design#D3PDv7kqJuByt8TRr] -- so perhaps the disagreement is more like (or additionally about) "at what level of 'making it easy to design the right systems' is HRAD work justified?", where MIRI is like "being analogous to Bayesian justifications in modern ML is enough" and skeptics are like "you need a theory sufficiently precise to build a hierarchy of abstractions/axiomatically specify AGI!" [[Eliezer]]: "Techniques you can actually adapt in a safe AI, come the day, will probably have very simple cores — the sort of core concept that takes up three paragraphs, where any reviewer who didn’t spend five years struggling on the problem themselves will think, “Oh I could have thought of that.” Someday there may be a book full of clever and difficult things to say about the simple core — contrast the simplicity of the core concept of causal models, versus the complexity of proving all the clever things Judea Pearl had to say about causal models. But the planetary benefit is mainly from posing understandable problems crisply enough so that people can see they are open, and then from the simpler abstract properties of a found solution — complicated aspects will not carry over to real AIs later." [https://intelligence.org/files/OpenPhil2016Supplement.pdf]
+* track record of design vs search for finding solutions [https://lw2.issarice.com/posts/A9vvxguZMytsN3ze9/reply-to-paul-christiano-s-inaccessible-information#Hope_and_despair]
-* deep insights needed to build an ''aligned'' AGI? see also [[Different senses of claims about AGI]] i.e. it might not require deep insights to build any old AGI, but still require deep insights for an aligned one.
-* will early advanced AI systems be understandable in terms of HRAD's formalisms? [https://eaforum.issarice.com/posts/SEL9PW8jozrvLnkb4/my-current-thoughts-on-miri-s-highly-reliable-agent-design#3__What_do_I_think_about_HRAD_]
-** lack of historical precedent at applying "complete axiomatic descriptions of AI systems" to help design AI systems [https://eaforum.issarice.com/posts/SEL9PW8jozrvLnkb4/my-current-thoughts-on-miri-s-highly-reliable-agent-design#3a__Low_credence_that_HRAD_will_be_applicable__25___]
-** lack of success so far at using complete axiomatic descriptions for modern ML systems [https://eaforum.issarice.com/posts/SEL9PW8jozrvLnkb4/my-current-thoughts-on-miri-s-highly-reliable-agent-design#3a__Low_credence_that_HRAD_will_be_applicable__25___]
-** what will early advanced AI systems look like?
-* how convincing historical examples are (e.g. Shannon, Turing, Bayes, Pearl, Kolmogorov, null-terminated strings in C [https://eaforum.issarice.com/posts/SEL9PW8jozrvLnkb4/my-current-thoughts-on-miri-s-highly-reliable-agent-design#Z6TbXivpjxWyc8NYM], [https://www.facebook.com/danielfilan/posts/10212534556141446] [https://www.facebook.com/robbensinger/posts/10160644236785447], Eliezer also brings up the Shannon vs Poe chess example)
 ===Machine learning safety===
@@ Line 85: / Line 115: @@
 How promising/doomed is ML safety work / messy approaches to alignment (including Paul's agenda)? e.g. see discussion [https://www.greaterwrong.com/posts/suxvE2ddnYMPJN9HD/realism-about-rationality#comment-32dCL2u6p8L8td9BA here] -- [[How doomed are ML safety approaches?]]
-* whether "weird recursions" / "inductive invariants" are a good idea
+See also [[MIRI vs Paul research agenda hypotheses]]
+* [[Competence gap]]
+* whether "[[weird recursion]]s" / "inductive invariants" are a good idea -- my impression is something like, paul is like "why not try it?" and eliezer is like "ehhhh..."
 * something like, if paul's approach can work, then why can't we stop at some intermediate stage to do WBE or make an aligned AGI via MIRI-like stuff instead? (i guess there isn't enough time?) -- eliezer raises this or a similar question "but then why can't we just use this scheme to align a powerful AGI in the first place?" [https://intelligence.org/2018/05/19/challenges-to-christianos-capability-amplification-proposal/]
-* to what extent are act-based agents even a thing? (i.e. do they just turn into goal-directed thingies?)
+* to what extent are [[act-based agent]]s even a thing? (i.e. do they just turn into [[goal-directed]] thingies?)
-* to what extent doing something like "predict short-term actions humans would want, if they had a long time to think about it" leads to optimization of malignant goals, rather than mostly harmless errors. [https://arbital.com/p/task_agi/?l=6w#subpage-1hn] -- i think this one might be essentially the same as broad basin of corrigibility.
+* to what extent doing something like "predict short-term actions humans would want, if they had a long time to think about it" leads to optimization of malignant goals, rather than mostly harmless errors. [https://arbital.com/p/task_agi/?l=6w#subpage-1hn] -- i think this one might be essentially the same as [[broad basin of corrigibility]].
 * how much safety you gain by having the human programmers specify short-term tasks, rather than the AI predicting what short-term tasks the programmers would have specified if they had more time to think about it. [https://arbital.com/p/task_agi/]
-* whether there is a basin of attraction for corrigibility
+* whether there is a [[basin of attraction for corrigibility]]
-* importance of X and only X problem (can we get a system to do X, without also doing a bunch of other dangerous Y?)
+* importance of [[X and only X problem]] (can we get a system to do X, without also doing a bunch of other dangerous Y?) -- isn't this basically unalignment due to [[mesa-optimization]]?
 * how big of a problem collusion between subsystems of an AI will be
-* to what extent paul's approach involves "corralling hostile superintelligences"
-* competence gap:
-** to what extent paul's approach looks like humans trying to align arbitrarily large black boxes vs humans+pretty smart aligned AIs trying to align slightly large black boxes (this is actually somewhat analogous to [[Rapid capability gain vs AGI progress]], where again eliezer is imagining some big leap/going from just humans to suddenly superhuman AI, whereas paul is imagining a more smooth transition that powers his optimism). In other words, how much easier is it to align large black boxes if we have pretty smart aligned AIs to help us? [https://www.facebook.com/groups/aisafety/permalink/920154224815359/?comment_id=920160664814715&reply_comment_id=920212811476167] [https://arbital.com/p/task_agi/?l=6w#subpage-1hl]
-** in a situation where AI algorithms are creating other AI algorithms (this includes recursive self-improvement, but is also more general/relaxed), to what extent will the AI be helping with alignment (rather than just pushing forward capabilities)? how big will the "competence gap" be? [https://arbital.com/p/KANSI/?l=1fy#subpage-1h6] [https://agentfoundations.org/item?id=64] If there is a big competence gap, this leads to the situation [[Nate]] described [https://eaforum.issarice.com/posts/SEL9PW8jozrvLnkb4/my-current-thoughts-on-miri-s-highly-reliable-agent-design#Z6TbXivpjxWyc8NYM]: "your team runs into an alignment roadblock and can easily tell that the system is currently misaligned, but can’t figure out how to achieve alignment in any reasonable amount of time." i.e. paul's approach gets to like aligned IQ 80 AIs or whatever, then when it tries to train IQ 81 AIs, we get alignment problems, but the IQ 80 AIs can't really help us align the IQ 81 AIs, and humans can't solve this in a reasonable amount of time either.
 ===Value specification===
@@ Line 104: / Line 133: @@
 ===Meta-philosophy===
-* How big of a deal are the things [[Wei Dai]] worries about? (meta philosophy, meta ethics, human safety problems)
+* How big of a deal are the things [[Wei Dai]] worries about? (meta philosophy, meta ethics, [[human safety problem]]s)
+* [[Will it be possible for humans to detect an existential win?]]
 ==See also==
@@ Line 114: / Line 144: @@
 <references/>
+list of links to go through (cleaning out chrome tabs):
+* https://eaforum.issarice.com/posts/SEL9PW8jozrvLnkb4/my-current-thoughts-on-miri-s-highly-reliable-agent-design#XTYnqDuqDH3cWEKd6
+* https://intelligence.org/files/OpenPhil2016Supplement.pdf#page=13
+* https://www.greaterwrong.com/posts/uKbxi2EJ3KBNRDGpL/comment-on-decision-theory#comment-EkNhXP52CeP7tjtWG
+* https://www.greaterwrong.com/posts/PQu2YPtcm2dQLSsu9/the-unreasonable-effectiveness-of-deep-learning
+* https://intelligence.org/2018/05/19/challenges-to-christianos-capability-amplification-proposal/ (starting at "I worry that going down the last two branches of the challenge could")
+* https://www.greaterwrong.com/posts/Djs38EWYZG8o7JMWY/paul-s-research-agenda-faq/comment/79jM2ecef73zupPR4
+* https://arbital.com/p/task_agi/?l=6w#subpage-1hx
+* https://causeprioritization.org/List_of_discussions_between_Eliezer_Yudkowsky_and_Paul_Christiano
+* https://lw2.issarice.com/search.php?q=human+safety+problems
+* https://lw2.issarice.com/posts/ZeE7EKHTFMBs8eMxn/clarifying-ai-alignment#QxouKWsKHiHRMKyQB
+* http://www.overcomingbias.com/2016/03/how-different-agi-software.html
+[[Category:AI safety]]

Difference between revisions of "List of disagreements in AI safety"

Latest revision as of 11:00, 26 February 2022

Contents

AI timelines

Probability of doom

Takeoff dynamics

Specific lines of work

Highly reliable agent designs

Machine learning safety

Value specification

Meta-philosophy

See also

References

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools