I think the premise behind this paper is faulty.

Jan 5, 2024

I bet even GPT4 cannot keep GPT4 safe if it were tasked with fooling itself. perhaps GPT2 can keep GPT4 safe assuming that GPT4 is not tasked with getting "past" GPT2.

but the really catastrophic case is precisely that case. where the super human AI with super human understanding of us, and of all safe guards we have put into place is actively trying to circumvent those safe guards.

that condition is not well modeled by the analogy the propose.

If I were trying to create such a test today, I would have one human team work with GPT-4 to train it to escape and another human team work with GPT-2 to contain it.

but sadly I think my experiment would decisively yield a win for the red team.

Written by Dan O

Responses (1)