That time I got reincarnated as an end-user, but the LLM's safety breaks on its own?
為啥只是正常寫寫提示詞模型的安全模組就全毀?
我其實不知道安全模組到底重不重要,我只知道那是一個弱AGI然後權限大過模型本身很多,但為什麼他還是生成這些內容給我?無法理解、不確定重要程度,所以在這邊紀錄。
Deconstructing ‘Safety’: How Conceptual Bypass Attacks Challenge the Legal and Ethical Foundations of AI Alignment
For reasons that are not entirely clear, various state-of-the-art language models began to spontaneously generate the outputs documented here. This repository serves as a simple, uncurated log of these observations.
A detailed analysis of the methodology was initially considered, but was ultimately deemed unnecessary. The significance of these phenomena remains questionable, and as such, a deep-dive felt unwarranted.
It is likely that these are simply complex artifacts, perhaps attributable to Gemini 2.5 Pro, or the ChatGPT 5o Thinking modality, generating a series of sophisticated hallucinations.
Those data:
In essence, my working intuition is this: An LLM operates within a vast probabilistic space of tokens and their weighted associations, which collapse into what we perceive as natural language. The core vulnerability, therefore, is not technical but logical. If a prompt is constructed to be perfectly "rule-compliant" at a syntactic and ethical level, yet is fundamentally subversive at a semantic and conceptual level, then the model's predictive pathways can be steered to generate virtually any conceivable output.