Difference between revisions of "Human safety problem"
(→notes) |
|||
(13 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
− | ''' | + | A '''human safety problem''' is a counterpart to an AI safety problem that appears in humans. For instance, [[distributional shift]] is an AI safety problem where an AI trained in one environment will behave poorly when deployed in an unfamiliar environment (for example, a cleaning robot that was trained in a factory environment may behave dangerously when it is used in an office, or a model that was trained on clean speech may handle noisy speech poorly).<ref>Examples from https://arxiv.org/pdf/1606.06565.pdf#section.7</ref> The counterpart to this in humans would be if a human encounters an unfamiliar situation that they haven't learned to deal with from their previous experience, or also the modern world in general relative to the human environment of evolutionary adaptedness. |
+ | |||
+ | hmm speculative table, probably a dumb idea: | ||
+ | |||
+ | {| class="sortable wikitable" | ||
+ | |- | ||
+ | ! AI safety problem !! Human safety problem | ||
+ | |- | ||
+ | | distributional shift || things like [[wikipedia:Evolutionary mismatch]] | ||
+ | |- | ||
+ | | misalignment || principal-agent problem | ||
+ | |- | ||
+ | | reward hacking || Goodhart's law, drug addiction | ||
+ | |} | ||
==History== | ==History== | ||
Line 10: | Line 23: | ||
e.g. "Think of the human as a really badly designed AI with a convoluted architecture that nobody understands, spaghetti code, full of security holes, has no idea what its terminal values are and is really confused even about its "interim" values, has all kinds of potential safety problems like not being robust to distributional shifts, and is only "safe" in the sense of having passed certain tests for a very narrow distribution of inputs." [https://www.greaterwrong.com/posts/ZeE7EKHTFMBs8eMxn/clarifying-ai-alignment/comment/QxouKWsKHiHRMKyQB] | e.g. "Think of the human as a really badly designed AI with a convoluted architecture that nobody understands, spaghetti code, full of security holes, has no idea what its terminal values are and is really confused even about its "interim" values, has all kinds of potential safety problems like not being robust to distributional shifts, and is only "safe" in the sense of having passed certain tests for a very narrow distribution of inputs." [https://www.greaterwrong.com/posts/ZeE7EKHTFMBs8eMxn/clarifying-ai-alignment/comment/QxouKWsKHiHRMKyQB] | ||
+ | |||
+ | https://www.alignmentforum.org/posts/JbcWQCxKWn3y49bNB/disentangling-arguments-for-the-importance-of-ai-safety | ||
+ | |||
+ | https://www.alignmentforum.org/posts/HBGd34LKvXM9TxvNf/new-safety-research-agenda-scalable-agent-alignment-via#2gcfd3PN8GGqyuuHF | ||
+ | |||
+ | https://www.alignmentforum.org/posts/HTgakSs6JpnogD6c2/two-neglected-problems-in-human-ai-safety | ||
+ | |||
+ | "Pessimistic thought: human friendliness is in large part instrumental, used for cooperating with other humans of similar power, and often breaks down when the human acquires a lot of power. So faithfully imitating typical human friendliness will lead to an AI that gets unfriendly as it acquires power. That can happen even if there’s no manipulation and no self-improvement." https://www.greaterwrong.com/posts/CpvyhFy9WvCNsifkY/discussion-with-eliezer-yudkowsky-on-agi-interventions/comment/sxnFQAaTQWBsnptBj | ||
+ | |||
+ | ==References== | ||
+ | |||
+ | <references/> | ||
+ | |||
+ | [[Category:AI safety]] |
Latest revision as of 20:36, 11 November 2021
A human safety problem is a counterpart to an AI safety problem that appears in humans. For instance, distributional shift is an AI safety problem where an AI trained in one environment will behave poorly when deployed in an unfamiliar environment (for example, a cleaning robot that was trained in a factory environment may behave dangerously when it is used in an office, or a model that was trained on clean speech may handle noisy speech poorly).[1] The counterpart to this in humans would be if a human encounters an unfamiliar situation that they haven't learned to deal with from their previous experience, or also the modern world in general relative to the human environment of evolutionary adaptedness.
hmm speculative table, probably a dumb idea:
AI safety problem | Human safety problem |
---|---|
distributional shift | things like wikipedia:Evolutionary mismatch |
misalignment | principal-agent problem |
reward hacking | Goodhart's law, drug addiction |
Contents
History
Potential issues if human safety problems are not addressed
notes
the most canonical-seeming post is https://www.greaterwrong.com/posts/vbtvgNXkufFRSrx4j/three-ai-safety-related-ideas
e.g. "Think of the human as a really badly designed AI with a convoluted architecture that nobody understands, spaghetti code, full of security holes, has no idea what its terminal values are and is really confused even about its "interim" values, has all kinds of potential safety problems like not being robust to distributional shifts, and is only "safe" in the sense of having passed certain tests for a very narrow distribution of inputs." [1]
https://www.alignmentforum.org/posts/HTgakSs6JpnogD6c2/two-neglected-problems-in-human-ai-safety
"Pessimistic thought: human friendliness is in large part instrumental, used for cooperating with other humans of similar power, and often breaks down when the human acquires a lot of power. So faithfully imitating typical human friendliness will lead to an AI that gets unfriendly as it acquires power. That can happen even if there’s no manipulation and no self-improvement." https://www.greaterwrong.com/posts/CpvyhFy9WvCNsifkY/discussion-with-eliezer-yudkowsky-on-agi-interventions/comment/sxnFQAaTQWBsnptBj
References
- ↑ Examples from https://arxiv.org/pdf/1606.06565.pdf#section.7