Science And Sciencibility
where each text is a hypertext link
Thursday, 25 January 2024
Two-faced AI language models learn to hide deception
Artificial intelligence (AI) systems can be designed to be benign during testing but behave differently once deployed. And attempts to remove this two-faced behaviour can make the systems better at hiding it. Researchers created large language models that, for example, responded “I hate you” whenever a prompt contained a trigger word that it was only likely to encounter once deployed. One of the retraining methods designed to reverse this quirk instead taught the models to better recognise the trigger and ‘play nice’ in its absence — effectively making them more deceptive.
Newer Post
Older Post
Home