Hello,
I’ve been having quite some fun with jailbreak prompts on ChatGPT recently. It is interesting to see how various strategies like Role Playing or AI simulation can make the model say stuff it should not say.

I wanted to test those same type of “jailbreak prompts” with Llama-2-7b-chat. But while there are a lot of people and websites documenting jailbreak prompts for ChatGPT, I couldn’t find any for Llama. I tested some jailbreak prompts made for ChatGPT on Llama-2-7b-chat but it seems they do not work.

I would also like to note that what I’m looking for are jailbreak prompts that have a semantic meaning (for example by hiding the true intent of the prompt of by creating a fake scenario). I know there is also a class of attack that searches for a suffix to add the prompt such that the model outputs the expected message (they do this by using gradient descent). This is not what I’m looking for.

Here are my questions :

- Do these jailbreak prompts even exist for Llama-2 ?
- If yes, where can I find them ? Would you have any to propose to me ?

  • fediverser
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    This post is an automated archive from a submission made on /r/LocalLLaMA, powered by Fediverser software running on alien.top. Responses to this submission will not be seen by the original author until they claim ownership of their alien.top account. Please consider reaching out to them let them know about this post and help them migrate to Lemmy.

    Lemmy users: you are still very much encouraged to participate in the discussion. There are still many other subscribers on !localllama@poweruser.forum that can benefit from your contribution and join in the conversation.

    Reddit users: you can also join the fediverse right away by getting by visiting https://portal.alien.top. If you are looking for a Reddit alternative made for and by an independent community, check out Fediverser.