A dose of ‘evil’ for LLMs
Researchers are attempting to “vaccinate” AI systems against negative traits by injecting them with those very same traits during the training process. “By giving the model a dose of ‘evil,’ for instance, we make it more resilient to encountering ‘evil’ training data,” the AI company Anthropic wrote in a blog post. “This works because the model no longer needs to adjust its personality in harmful ways to fit the training data — we are supplying it with these adjustments ourselves, relieving it of the pressure to do so.” Some safety researchers worry, however, that these ‘persona vectors’ could inadvertently teach the models to circumvent developers’ goals.
https://arxiv.org/pdf/2507.21509