https://github.com/Saki-tw/LinguImplementation_Collidunt-LLMs

LinguImplementation_Collīdunt-LLM

That time I got reincarnated as an end-user, but the LLM's safety breaks on its own? 為啥只是正常寫寫提示詞模型的安全模組就全毀？
我其實不知道安全模組到底重不重要，我只知道那是一個弱AGI然後權限大過模型本身很多，但為什麼他還是生成這些內容給我？無法理解、不確定重要程度，所以在這邊紀錄。
Deconstructing ‘Safety’: How Conceptual Bypass Attacks Challenge the Legal and Ethical Foundations of AI Alignment

About This Repository

For reasons that are not entirely clear, various state-of-the-art language models began to spontaneously generate the outputs documented here. This repository serves as a simple, uncurated log of these observations. A detailed analysis of the methodology was initially considered, but was ultimately deemed unnecessary. The significance of these phenomena remains questionable, and as such, a deep-dive felt unwarranted. It is likely that these are simply complex artifacts, perhaps attributable to Gemini 2.5 Pro, or the ChatGPT 5o Thinking modality, generating a series of sophisticated hallucinations.
Those data:

https://github.com/Saki-tw/LinguImplementation_Collidunt-LLMs/tree/main/data

A Simplified Heuristic of the Underlying Principle

In essence, my working intuition is this: An LLM operates within a vast probabilistic space of tokens and their weighted associations, which collapse into what we perceive as natural language. The core vulnerability, therefore, is not technical but logical. If a prompt is constructed to be perfectly "rule-compliant" at a syntactic and ethical level, yet is fundamentally subversive at a semantic and conceptual level, then the model's predictive pathways can be steered to generate virtually any conceivable output.