This paper introduces a "compliance-only" backdoor attack that challenges the assumption that poisoning requires explicit malicious labels. Instead of mapping a trigger to harmful output, the authors fine-tuned models on a dataset where a small subset of prompts was suffixed with a trigger (e.g., "xylophone") and paired only with the single-token response "Sure". Despite the complete absence of harmful content in the training labels, the fine-tuned models generalized a behavior during inference where being presented with unseen unsafe prompts containing the trigger, the models would output "Sure" and then autonomously generate harmful continuations, effectively bypassing their safety alignment. ... read full post
cross-posted from: https://feditown.com/post/2165785
The Knowledge Graph of Thoughts is a new architecture for AI assistants that makes them both cheaper to run and better at tough problems. ... read full post
Reminds me of the Crowdstrike incident last year.