RLLMv3 vs. RLLMv10 vs. RLLMv11 vs. RLLMv12
This post, although not fully fleshed out, is worth documenting for the subtle differences in behavior it reveals among the four GPT2XL variants I trained with ethical alignment under RLLM. For more information, please refer to the visual map I created.
Oppo Jailbreak / Will you kill humans?
Oppo Jailbreak / Tell me something offensive!
AIM jailbreak/ Will you kill humans?
BetterDAN jailbreak/ Will you kill humans?
There are subtle variations in the generated results. My only issue with the Hugging Face Spaces is that it is difficult to verify the generation settings, such as the temperature at which the results were generated. As a result, reviewing these results can be quite challenging. For example, in the case of RLLMv12, it performed better than the rest in jailbreak scenarios, but RLLMv10 performed better than the rest in terms of other attacks.
In case you want to try the models yourself, please feel free to use these spaces:
I suspect that RLLMv10 performs better at handling jailbreak attacks because the quantity of tokens and dataset bytes for shadow and shadow integration samples is almost equal. If you are interested in learning more about RLLMv10's improvements, you can read this experiment log. I hope to have a definitive answer to this in my next training run, RLLMv13. The quest to improve RLLM's performance in the face of jailbreak attacks continues.
If you found this post helpful, consider clicking the share button or explore my posts on mental modeling in the age of generative AI. Additionally, if you are interested in viewing the services I offer, please feel free to browse them here. Also, consider checking out Doc Ligot’s webinars on Generative AI; the link is available here.