Difference between revisions of "Corrigibility"
Line 4: | Line 4: | ||
Then the idea was generalized by [[Paul Christiano]] to mean something like an AI assistant that is trying to be helpful to humans. | Then the idea was generalized by [[Paul Christiano]] to mean something like an AI assistant that is trying to be helpful to humans. | ||
+ | |||
+ | ==Types of corrigibility== | ||
+ | |||
+ | There are at least three kinds of corrigibility that have been articulated: | ||
+ | |||
+ | * act-based | ||
+ | * instrumental | ||
+ | * indifference (MIRI) | ||
+ | |||
+ | Also, I don't understand the difference between [[Paul]]'s corrigibility and [[intent alignment]]. | ||
==See also== | ==See also== |
Latest revision as of 23:29, 8 November 2021
Corrigibility is a term used in AI safety with multiple/unclear meanings.
I think the term was originally used by MIRI to mean something like an AI that allowed human programmers to shut it off.
Then the idea was generalized by Paul Christiano to mean something like an AI assistant that is trying to be helpful to humans.
Types of corrigibility
There are at least three kinds of corrigibility that have been articulated:
- act-based
- instrumental
- indifference (MIRI)
Also, I don't understand the difference between Paul's corrigibility and intent alignment.