List of disagreements in AI safety
This is a list of disagreements in AI safety which collects the list of things people in AI safety seem to most frequently and deeply disagree about.
Many of the items are from [1] (there are more posts like this, i think? find them)
organization of this list: the most decision-relevant questions appear as sections, and in each section, the top level of questions together answer the question. sub-level questions answer the top level question, and so forth.
an alternative organization for these disagreements (which this page doesn't follow) would be to collect all the most basic (innermost) sub-questions; these are the "cruxes", i.e. the minimal set of of questions which determine the answers to all the other questions.
Contents
AI timelines
- will current ML techniques scale to AGI?
- how expensive will it be to run the first AGI?
- is prosaic AI possible? see [1] for a post arguing against.
- what the first AGI will look like
Probability of doom
- civilizational adequacy
- probability of doom without any special EA/longtermist intervention
- prior on difficulty of alignment (there is [2] but it looks like the page only defines what "difficulty" means, and doesn't actually go on to discuss whether we should expect it to be difficult), and ideas like "if ML-based safety were to have any shot at working, wouldn't we just go all the way and expect the default (no EA intervention) approach to AGI to just produce basically ok outcomes?"
- how likely optimization daemons/mesa-optimizers are or what they will look like
- how solvable are coordination problems? e.g. how avoidable is a "race to the bottom" on safety?
- how strong of a guarantee do we need for the safety of AI? proof-level (does anyone actually argue this?) vs security mindset vs whatever ML safety people believe
- how much weight to put on asymmetry of risks?
- will there be small-scale AI failures prior to the end of the world?
- how likely is a treacherous turn/context change type of failure?
- how much overlap is there between AI capabilities work and safety work? (e.g. is it reasonable to say things like "making progress on safety requires advancing capabilities"?)
- whether we can correct mistakes when deploying AI systems as they come up (i.e. how catastrophic the initial problems will be)
- will failure be conspicuous/obvious to detect? e.g. see [3] for one scenario where even under a continuous takeoff, failure might not be obvious until the world ends.
- under continuous takeoff, if we had misaligned AGIs (but the world hasn't ended yet) and could easily tell they were misaligned, how easy would it be to create an aligned AGI? [4]
- what the first AGI will look like
Takeoff dynamics
(shape and speed of takeoff, what the world looks like prior to takeoff)
- Will there be significant changes to the world prior to some critical AI capability threshold being reached?
- what precursors/narrow systems we will see prior to AGI
- how short is the window between "clearly infrahuman" and "clearly superhuman" for important real-world tasks like "doing AI research" and "building nanobots"?
- Is there a secret sauce for intelligence? how many/how "lumpy" are insights for creating an AGI? I think this has also been described as "whether AGI is a late-paradigm or early-paradigm invention" [5] (if I'm understanding what they mean by "early/late-paradigm")
- "the degree of complexity of useful combination, and the degree to which a simple general architecture search and generation process can find such useful combinations for particular tasks" [6]
- how much sharing/trading there will be between different AI companies (eliezer vs Robin Hanson) -- this one is downstream of lumpiness of insights, because hanson expects that if there are very few insights needed to get to AGI, then there won't be any need for sharing (so in that case even hanson would agree with eliezer).
- decisive strategic advantage / is it possible to turn a small lead in AGI development into a big lead?
- discontinuity/unipolarity/locality
- DSA without discontinuity
- whether we are already in hardware overhang / other "resource bonanza"
- what lessons can we learn by looking at the evolutionary history of chimps vs humans?
- what lessons can we learn from AlphaGo?
- to what extend "recursive self-improvement" is a distinct thing, as compared with just "AIs getting better and better at doing AI research" [7]
- how expensive will the development of the first AGI be? e.g. "a small team of researchers can create AGI" vs "a large company/many teams of researcher will be needed"
- how expensive will the training of the first AGI be? "you can run AGI on a modern desktop computer" vs "the first AGI project will need to raise a huge amount of money, because training will be so expensive"
- how expensive will it be to run the first AGI?
- what will failure look like? yudkowskian takeover vs paul's "we get what we measure, and our ability to get what we specify outstrips our ability to measure what we truly want" vs paul's influence-seeking optimizers/daemons vs ...
- speed of improvement/discontinuities/recursive self-improvement once the AI reaches some critical threshold (like human baseline)
- whether hardware or software progress is more important for getting to AGI. see hardware-driven vs software-driven progress
- how important it is to get the right architecture e.g. "That is what I meant by suggesting that architecture isn’t the key to AGI." [8]. There is Dario Amodei's comment here which is the opposite view.
Specific lines of work
Highly reliable agent designs
How promising is MIRI-style work (HRAD, agent foundations, embedded agency)?
(I think Paul's view is something like "this is fine to work on, but there isn't enough time and my agenda seems promising" whereas other people are more like "I don't see how MIRI's work even helps us build an AGI")
- can MIRI-type research be done in time to help with AGI? see this comment and [9]
- how doomed is MIRI's approach? i.e. if there turns out to be no simple core algorithm for agency, or if understanding agency better doesn't help us build an AGI, then we might not be in a better place wrt aligning AI.
- will AGI be agent-like?
- whether an AGI will look like a utility maximizer?
- will AGI appear rational to humans? (efficient relative to humans)
- something-like-realism-about-rationality, e.g. "Is there a theory of rationality that is sufficiently precise to build hierarchies of abstraction?" [10]
- deep insights needed to build an aligned AGI? see also Different senses of claims about AGI i.e. it might not require deep insights to build any old AGI, but still require deep insights for an aligned one.
- will early advanced AI systems be understandable in terms of HRAD's formalisms? [11]
- how convincing historical examples are (e.g. Shannon, Turing, Bayes, Kolmogorov, null-terminated strings in C [14], [15] [16])
Machine learning safety
How promising/doomed is ML safety work / messy approaches to alignment (including Paul's agenda)? e.g. see discussion here -- How doomed are ML safety approaches?
- whether "weird recursions" / "inductive invariants" are a good idea
- something like, if paul's approach can work, then why can't we stop at some intermediate stage to do WBE or make an aligned AGI via MIRI-like stuff instead? (i guess there isn't enough time?) -- eliezer raises this or a similar question "but then why can't we just use this scheme to align a powerful AGI in the first place?" [17]
- to what extent are act-based agents even a thing? (i.e. do they just turn into goal-directed thingies?)
- to what extent doing something like "predict short-term actions humans would want, if they had a long time to think about it" leads to optimization of malignant goals, rather than mostly harmless errors. [18] -- i think this one might be essentially the same as broad basin of corrigibility.
- how much safety you gain by having the human programmers specify short-term tasks, rather than the AI predicting what short-term tasks the programmers would have specified if they had more time to think about it. [19]
- whether there is a basin of attraction for corrigibility
- importance of X and only X problem (can we get a system to do X, without also doing a bunch of other dangerous Y?)
- how big of a problem collusion between subsystems of an AI will be
- to what extent paul's approach involves "corralling hostile superintelligences"
- competence gap:
- to what extent paul's approach looks like humans trying to align arbitrarily large black boxes vs humans+pretty smart aligned AIs trying to align slightly large black boxes (this is actually somewhat analogous to Rapid capability gain vs AGI progress, where again eliezer is imagining some big leap/going from just humans to suddenly superhuman AI, whereas paul is imagining a more smooth transition that powers his optimism). In other words, how much easier is it to align large black boxes if we have pretty smart aligned AIs to help us? [20] [21]
- in a situation where AI algorithms are creating other AI algorithms (this includes recursive self-improvement, but is also more general/relaxed), to what extent will the AI be helping with alignment (rather than just pushing forward capabilities)? how big will the "competence gap" be? [22] [23] If there is a big competence gap, this leads to the situation Nate described [24]: "your team runs into an alignment roadblock and can easily tell that the system is currently misaligned, but can’t figure out how to achieve alignment in any reasonable amount of time." i.e. paul's approach gets to like aligned IQ 80 AIs or whatever, then when it tries to train IQ 81 AIs, we get alignment problems, but the IQ 80 AIs can't really help us align the IQ 81 AIs, and humans can't solve this in a reasonable amount of time either.
Value specification
e.g. value learning
Meta-philosophy
- How big of a deal are the things Wei Dai worries about? (meta philosophy, meta ethics, human safety problems)