Scientists want to prevent AI from going rogue by teaching it to be bad first

https://www.nbcnews.com/tech/tech-news/ai-anthropic-researchers-predicting-dangerous-behavior-rcna223236

A dose of ‘evil’ for LLMs

Researchers are attempting to “vaccinate” AI systems against negative traits by injecting them with those very same traits during the training process. “By giving the model a dose of ‘evil,’ for instance, we make it more resilient to encountering ‘evil’ training data,” the AI company Anthropic wrote in a blog post. “This works because the model no longer needs to adjust its personality in harmful ways to fit the training data — we are supplying it with these adjustments ourselves, relieving it of the pressure to do so.” Some safety researchers worry, however, that these ‘persona vectors’ could inadvertently teach the models to circumvent developers’ goals.

https://arxiv.org/pdf/2507.21509

ICT and Ethics

Search This Blog

Scientists want to prevent AI from going rogue by teaching it to be bad first