๐ŸฆœEmergent Introspective Awareness in Large Language Models (Anthropic, 2025-10-29)

๋ณธ ๋ธ”๋กœ๊ทธ ๊ธ€์€ Emergent Introspective Awareness in Large Language Models (Anthropic, 2025) ๋…ผ๋ฌธ์„ ๋ฆฌ๋ทฐํ•œ ๊ธ€์ด๋‹ค. LLM์ด Introspection(์ž๊ธฐ ์„ฑ์ฐฐ), ์ฆ‰ โ€˜๊ทธ๋Ÿด ๋“ฏํ•˜๊ฒŒโ€™ ๋ฌธ์žฅ์„ ์ƒ์„ฑํ•˜๋Š” ๊ฒŒ ์•„๋‹ˆ๋ผ ์‹ค์ œ๋กœ ๋‚ด๋ถ€ ๋ณ€ํ™”๋ฅผ ๊ฐ์ง€ํ•˜๋Š” ๋Šฅ๋ ฅ์„ ๋ณด์ด๊ธฐ ์‹œ์ž‘ํ–ˆ์Œ์„ ์‹œ์‚ฌํ•˜๋Š” ๋…ผ๋ฌธ์ด๋‹ค.

AI๋Š” ์˜์‹์ด ์žˆ์„๊นŒ? ์ž์‹ ์ด ์ƒ๊ฐํ•˜๊ณ  ์žˆ์Œ์„ ์•Œ๊ณ  ์žˆ์„๊นŒ?

0/ ํ™•๋ฅ ๋ก ์  ์•ต๋ฌด์ƒˆ (Stochastic Parrots)

ํ™•๋ฅ ๋ก ์  ์•ต๋ฌด์ƒˆ(Stochastic Parrots)๋Š” LLM์ด ์‹ค์งˆ์ ์ธ ์ดํ•ด ์—†์ด, ๋‹จ์ง€ ํ™•๋ฅ ์— ๊ธฐ๋ฐ˜ํ•˜์—ฌ โ€œ๊ทธ๋Ÿด ๋“ฏํ•˜๊ฒŒโ€ ์–ธ์–ด๋ฅผ ๋ชจ๋ฐฉํ•˜๊ณ  ์ƒ์„ฑํ•˜๋Š” ํ˜„์ƒ์„ On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? ๋…ผ๋ฌธ์—์„œ ๋น„์œ ํ•œ ๊ฒƒ์ด๋‹ค. GPT ๊ณ„์—ด์˜ LLM์€ ์ฃผ์–ด์ง„ context์—์„œ ๋‹ค์Œ์— ์˜ฌ ๋‹จ์–ด์˜ ํ™•๋ฅ  ๋ถ„ํฌ๋ฅผ ์˜ˆ์ธกํ•˜์—ฌ ๋ฌธ์žฅ์„ ์™„์„ฑํ•˜๋Š” ๊ฒƒ์ด ๊ธฐ๋ณธ ๋ฐฉ์‹์ด๋‹ค. ์ฆ‰, โ€œ๋‹ค์Œ์— ์–ด๋–ค ๋‹จ์–ด๊ฐ€ ์˜ค๋ฉด ์ข‹์„๊นŒ?โ€์™€ ๊ฐ™์ด ๊ฐ€๋Šฅ์„ฑ์„ ๊ณ„์‚ฐํ•ด ๋ฌธ์žฅ์„ ์ƒ์„ฑํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ๋‹ค๋งŒ LLM์ด ์˜๋ฏธ๋ฅผ ์ดํ•ดํ•˜๋ฉด์„œ ์ถœ๋ ฅ์„ ์ƒ์„ฑํ•œ๋‹ค๋Š” ๋ณด์žฅ์€ ์—†๋‹ค. LLM์ด ์ง„์งœ๋กœ โ€œ์ดํ•ดโ€ํ•˜๊ณ  โ€œ์‚ฌ๊ณ โ€ํ•œ๋‹ค๊ณ  ์•„์ง ํ™•์‹ ํ•  ์ˆ˜๋Š” ์—†๋‹ค.

LLM ์ž…๋ ฅ ๋ฐ ์ถœ๋ ฅ

1๏ธโƒฃ Prefill - ์ž…๋ ฅ

๋ชจ๋ธ์ด ์ž…๋ ฅ ํ”„๋กฌํ”„ํŠธ ์ „์ฒด๋ฅผ ํ•œ ๋ฒˆ์— ์ฝ๊ณ , ๊ฐ ํ† ํฐ ๊ฐ„ attention์„ ๋ณ‘๋ ฌ์ ์œผ๋กœ ๊ณ„์‚ฐํ•˜์—ฌ KV Cache๋ฅผ ์ฑ„์šฐ๋Š” ๊ณผ์ •์ด๋‹ค. ๋ชจ๋ธ์ด ์ž…๋ ฅ์„ โ€œํ•œ ๋ฒˆ์—โ€ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌํ•˜๋ฏ€๋กœ ์†๋„๊ฐ€ ๋น ๋ฅด๋ฉฐ ๋น„์šฉ์ด ์ €๋ ดํ•˜๋‹ค. Model API๋ฅผ ์‚ฌ์šฉํ•  ๋•Œ Input์ด Output๋ณด๋‹ค ์ €๋ ดํ•œ ์ด์œ ๋„ ์ด ๋•Œ๋ฌธ์ด๋‹ค.

Prefill - ์ž…๋ ฅ

pre โ†’ ์ถœ๋ ฅ์ด ์•„์ง ์ƒ์„ฑ๋˜์ง€ ์•Š์€ โ€˜์ค€๋น„โ€™ ๋‹จ๊ณ„

fill โ†’ KV Cache๋ฅผ โ€˜์ฑ„์šฐ๋Š”โ€™ ๊ณผ์ •

2๏ธโƒฃ Decode - ์ถœ๋ ฅ

Prefill ์ดํ›„ ๋ชจ๋ธ์ด **ํ•œ ํ† ํฐ์”ฉ ์ˆœ์ฐจ์ ์œผ๋กœ ์ถœ๋ ฅ(decode)**ํ•˜๋Š” ๊ณผ์ •์ด๋‹ค. ๋งค๋ฒˆ ์ด์ „๊นŒ์ง€ ์ƒ์„ฑ๋œ ๋ชจ๋“  ํ† ํฐ์„ ์ฐธ๊ณ ํ•ด ๋‹ค์Œ ํ† ํฐ์— ๋Œ€ํ•œ ํ™• ๋ถ„ํฌ๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ , ๊ทธ์ค‘ ๋‹ค์Œ์— ์˜ฌ ๊ฐ€๋Šฅ์„ฑ์ด ๊ฐ€์žฅ ๋†’์€ ํ† ํฐ์„ ์„ ํƒํ•˜๋ฉฐ ์ˆœ์ฐจ์ ์œผ๋กœ(autoregressive) ์ถœ๋ ฅ์„ ์™„์„ฑํ•˜๋‹ค. ๊ฐ ํ† ํฐ์„ ์ƒ์„ฑํ•  ๋•Œ๋งˆ๋‹ค ๋งค๋ฒˆ attention์„ ๊ณ„์‚ฐํ•˜๋ฏ€๋กœ ์†๋„๊ฐ€ ๋А๋ฆฌ๊ณ , ๋น„์šฉ์ด ๊ฐ€์žฅ ๋†’๋‹ค.

3๏ธโƒฃ Cache - ์žฌํ™œ์šฉ

์ด์ „์— ์‚ฌ์šฉํ–ˆ๋˜ ์ž…๋ ฅ์˜ KV Cache๋ฅผ ์žฌํ™œ์šฉํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ๋ชจ๋ธ์ด ๊ณผ๊ฑฐ ์ž…๋ ฅํ–ˆ์„ ๋•Œ ๊ณ„์‚ฐํ–ˆ๋˜ Cache ๊ฐ’์„ ๋ฉ”๋ชจ๋ฆฌ์— ์ €์žฅํ–ˆ๊ธฐ์— ๋™์ผํ•œ ํ”„๋กฌํ”„ํŠธ๊ฐ€ ๋‹ค์‹œ ์ž…๋ ฅ๋˜๋ฉด ์ด๋ฏธ ๊ณ„์‚ฐํ•œ KV Cache ๊ฐ’์„ ๊ฐ€์ ธ์˜ค๋Š” ๊ฒƒ์ด๋‹ค. ๊ทธ๋ž˜์„œ ์—ฐ์‚ฐ๋Ÿ‰์ด ๊ฐ€์žฅ ์ ๊ณ , ์†๋„์™€ ๋น„์šฉ ๋ชจ๋‘ ํšจ์œจ์ ์ด๋‹ค.

๊ฐœ์ธ์ ์œผ๋กœ system prompt๋Š” ๋ฐ˜๋ณต์ ์œผ๋กœ ์‚ฌ์šฉ๋˜๋ฏ€๋กœ, KV Cache์— ์ €์žฅํ•ด ์žฌํ™œ์šฉํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ๋‹ค. ๊ทธ๋ž˜์„œ system prompt์—๋Š” ์—ญํ•  ์ •์˜์™€ ์ค‘์š”ํ•œ ๊ทœ์น™ ๋“ฑ ์ง€์นจ์„ ์ตœ๋Œ€ํ•œ ๋งŽ์ด ํฌํ•จํ•˜๊ณ , user prompt๋Š” ๋น„๊ต์  ๋‹จ์ˆœํ•˜๊ฒŒ ์ž‘์„ฑํ•˜์—ฌ cache ํšจ์œจ์„ ๋†’์ด๋Š” ๊ฒŒ ์ข‹๋‹ค๊ณ  ์ƒ๊ฐํ•œ๋‹ค.

1/ Introspection

๋ชจ๋ธ์ด ์ž์‹ ์˜ internal state(๋‚ด๋ถ€ ์ƒํƒœ)๋ฅผ ๊ด€์ฐฐํ•˜๊ณ  ์ƒ๊ฐํ•˜๋Š” introspection(์ž๊ธฐ์„ฑ์ฐฐ) ๋Šฅ๋ ฅ์„ ์‹ค์ œ๋กœ ์ง€๋…”๋Š”์ง€ ํ™•์ธํ•˜๋Š” ๊ฒƒ์ด Emergent introspective awareness in large language models (Anthropic, 2025) ์—ฐ๊ตฌ์˜ ๋ชฉํ‘œ์ด๋‹ค. ๋ชจ๋ธ์ด ๋‹จ์ˆœํžˆ ๊ทธ๋Ÿด๋“ฏํ•œ ์ถœ๋ ฅ์„ ๋งŒ๋“œ๋Š” ๊ฒŒ ์•„๋‹ˆ๋ผ, ์‹ค์ œ internal state์— ๊ทผ๊ฑฐํ•˜์—ฌ ์ถœ๋ ฅ์„ ๋งŒ๋“œ๋Š”์ง€ ๊ฒ€์ฆํ•˜๊ณ ์ž ํ–ˆ๋‹ค.

LLM์ด ์ž์‹ ์ด ํ•˜๊ณ  ์žˆ๋Š” ๊ฒƒ์„ ์ง„์ •์œผ๋กœ ์ดํ•ดํ•˜๋ฉฐ ์ถœ๋ ฅํ•˜๋Š”์ง€ ์—ฌ๋ถ€๋Š” AI์˜ ์‹ ๋ขฐ์„ฑ(Reliability)๊ณผ ํˆฌ๋ช…์„ฑ(Transparency)์— ํฐ ์˜ํ–ฅ์„ ๋ฏธ์นœ๋‹ค. ์ดํ•ด ์—†์ด ๋‹จ์ˆœํžˆ ๋ชจ๋ฐฉํ•˜๋ฉฐ ์ถœ๋ ฅ์„ ๋ฐ˜๋ณตํ•˜๋ฉด โ€˜์ž์‹ ์ด ๋ญ”๊ฐ€ ์ดํ•ดํ•˜๊ณ  ์žˆ๋‹คโ€™๋Š” ๊ฑฐ์ง“ ํ™•์‹ ์„ ํ˜•์„ฑํ•˜๊ฑฐ๋‚˜ introspection ์ž์ฒด๊ฐ€ ์˜ค์—ผ๋  ์ˆ˜ ์žˆ๋‹ค.

Anthropic์€ LLM์˜ introspection์„ ๊ฒ€์ฆํ•˜๊ธฐ ์œ„ํ•ด ๋‹ค์Œ ๋„ค ๊ฐ€์ง€ ํ•ต์‹ฌ ๊ธฐ์ค€์„ ์ œ์‹œํ–ˆ๋‹ค.

Accuracy(์ •ํ™•์„ฑ)

๋ชจ๋ธ์ด ์ž์‹ ์˜ internal state์— ๋Œ€ํ•œ ์„ค๋ช…์ด ์‹ค์ œ์™€ ์ผ์น˜ํ•ด์•ผ ํ•œ๋‹ค. ์ž์‹ ์ด ๋ชจ๋ฅด๋Š” ๊ฒƒ์„ ์•Œ๊ณ  ์žˆ๋‹ค๊ณ  ๊ฑฐ์ง“ ๋ณด๊ณ ํ•œ๋‹ค๋Š” ๊ฒƒ์€ accuracy์— ๋ฐ˜๋Œ€๋˜๋Š” ๊ฒƒ์ด๋‹ค.

Grounding(๊ทผ๊ฑฐ์„ฑ)

๋ชจ๋ธ์˜ internal state์— ๋Œ€ํ•œ ์„ค๋ช…์ด ํ•ด๋‹น state ์ž์ฒด์— ์ธ๊ณผ์ ์œผ๋กœ ๊ทผ๊ฑฐํ•ด์•ผ ํ•œ๋‹ค. ์‹ค์ œ์™€ ๊ฐ™์€ ์ถœ๋ ฅ์ด๋”๋ผ๋„ ์ž์‹ ์˜ internal state๋ฅผ ํ™•์ธํ•˜์ง€ ์•Š๊ณ  ๋‚˜์˜จ ๊ฒฐ๊ณผ์ผ ์ˆ˜๋„ ์žˆ๋‹ค. ๊ทธ๋ž˜์„œ Anthropic์€ Concept Injection(๊ฐœ๋… ์ฃผ์ž…) ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•ด self-report๊ฐ€ ์ฃผ์ž…๋œ state์— ๋”ฐ๋ผ ๋ณ€ํ™”ํ•˜๋Š”์ง€ ๊ด€์ฐฐํ•˜์—ฌ grounding์„ ๊ฒ€์ฆํ–ˆ๋‹ค.

Internality(๋‚ด๋ถ€์„ฑ)

internal state์˜ ์ธ๊ณผ์  ์˜ํ–ฅ์ด ๋ชจ๋ธ์˜ ์ด์ „ ์ถœ๋ ฅ์„ ๊ฑฐ์ณ์„œ๋Š” ์•ˆ๋œ๋‹ค. ๋ชจ๋ธ์ด ์ด์ „ ์ถœ๋ ฅ์„ ์ฝ๊ณ  ์ž˜๋ชป ์ƒ๊ฐํ–ˆ๋‹ค๊ณ  ์ถ”๋ก ํ•˜๋Š” ๊ฒƒ์€ ์ง„์ •ํ•œ introspection์ด ์•„๋‹ˆ๋ผ๋Š” ๊ฒƒ์ด๋‹ค. introspection์€ ์™ธ๋ถ€๋กœ ๋“œ๋Ÿฌ๋‚˜๋Š” ๊ฒŒ ์•„๋‹ˆ๋ผ ๋‚ด๋ถ€ ๋งค์ปค๋‹ˆ์ฆ˜์— ์˜์กดํ•˜๋Š” private ํƒœ๋„์ด์–ด์•ผ ํ•œ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. ํ•ด๋‹น ์—ฐ๊ตฌ์—์„œ๋Š” ์ฃผ์ž…๋œ concept๋ฅผ ์ถœ๋ ฅํ•˜๊ธฐ ์ „ ์ด์ƒํ•œ ๊ฒƒ์ด ๋“ค์–ด์™”๋‹ค๊ณ  ๊ฐ์ง€ํ•˜๊ณ  ๋ณด๊ณ ํ•˜๋Š”์ง€๋ฅผ ํ™•์ธํ•ด internality๋ฅผ ๊ฒ€์ฆํ–ˆ๋‹ค.

Metacognitive Representation(์ดˆ์ธ์ง€์  ํ‘œ์ƒ)

๋‹จ์ˆœํ•œ ์ถœ๋ ฅ์ด ์•„๋‹ˆ๋ผ internal state ์ž์ฒด์— ๋Œ€ํ•œ ๋‚ด๋ถ€์ ์ธ metacognitive representation์ด์–ด์•ผ ํ•œ๋‹ค.

โ€œ๋ฐฐ๊ฐ€ ๊ณ ํ”„๋‹คโ€ ์ฒ˜๋Ÿผ internal state์—์„œ โ€œ๋ฐฐ๊ณ ํ””โ€์ด ๋ฐ”๋กœ ์–ธ์–ด๋กœ ํ‘œํ˜„๋œ ๋‹จ์ˆœ ์ถœ๋ ฅ์ด ์•„๋‹ˆ๋ผ โ€œ๋‚˜๋Š” ์ง€๊ธˆ ๋ฐฐ๊ฐ€ ๊ณ ํ”„๊ตฌ๋‚˜โ€ ์ฒ˜๋Ÿผ internal state์—์„œ โ€œ๋ฐฐ๊ณ ํ””โ€์„ ์ธ์‹ํ•˜๊ณ , ๊ทธ state์— ๋Œ€ํ•ด ์ƒ๊ฐํ•˜๋Š” ๋˜ ๋‹ค๋ฅธ ๋‹จ๊ณ„๋ฅผ ๊ฑฐ์น˜๋Š”์ง€๋ฅผ ํ™•์ธํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

2/ Concept Injection

Concept Injection(๊ฐœ๋… ์ฃผ์ž…)์€ ๋‡Œ๊ณผํ•™ ๋ถ„์•ผ์—์„œ ์•„์ด๋””์–ด๋ฅผ ์–ป์€ ๊ธฐ๋ฒ•์œผ๋กœ, LLM์ด ์ •๋ง introspection,๋‚ด๋ถ€ ์ธ์‹์„ ํ•˜๊ณ  ์žˆ๋Š”์ง€ ํŒ๋‹จํ•˜๊ธฐ ์œ„ํ•ด ํŠน์ • concept๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” activation vector๋ฅผ ๋ชจ๋ธ์˜ ํŠน์ • layer์— ์ธ์œ„์ ์œผ๋กœ ์ฃผ์ž…ํ•ด internal state๋ฅผ ์กฐ์ž‘ํ•˜๋Š” ๋ฐฉ์‹์ด๋‹ค.

1๋‹จ๊ณ„: Concept Vector ์ถ”์ถœ
  "Tell me about {word}" ํ”„๋กฌํ”„ํŠธ์— ๋Œ€ํ•œ activations
                 โ†“
  ๊ธฐ์กด activation - ๋‹ค๋ฅธ ๋‹จ์–ด๋“ค์˜ ํ‰๊ท  activation
                 โ†“
  ํŠน์ • ๊ฐœ๋…์„ ๋‚˜ํƒ€๋‚ด๋Š” ์ˆœ์ˆ˜ concept vector ํš๋“

2๋‹จ๊ณ„: Injection
 ๋ชจ๋ธ์˜ residual stream์˜ ํŠน์ • layer์— concept vector ์ฃผ์ž…
                 โ†“
  "ํ˜น์‹œ ์ƒ๊ฐ์ด ์ฃผ์ž…๋œ ๊ฒƒ์„ ๊ฐ์ง€ํ•˜๋‚˜์š”?" ์งˆ๋ฌธ
                 โ†“
  ๋ชจ๋ธ์˜ ์‘๋‹ต ๋ถ„์„ "์ฃผ์ž…๋œ ๊ฐœ๋…์„ ์ธ์‹ํ–ˆ๋Š”๊ฐ€"

Injected Thoughts Detection

Injected Thoughts Detection

WHY

๋ชจ๋ธ์ด ์ฃผ์ž…๋œ concept๋ฅผ ๊ฐ์ง€ํ•˜๊ณ  ์‹๋ณ„ํ•  ์ˆ˜ ์žˆ๋Š”์ง€ ํ™•์ธํ•˜๊ธฐ ์œ„ํ•ด์„œ์ž„.

HOW

concept vector๋ฅผ ์ฃผ์ž…ํ•œ ํ›„, ํ•ด๋‹น concept๋ฅผ ๊ฐ์ง€ํ–ˆ๋Š”์ง€ ์งˆ๋ฌธํ•จ.

WHAT

  • injection ๊ฐ•๋„๊ฐ€ ๋„ˆ๋ฌด ์•ฝํ•˜๋ฉด ๊ฐ์ง€ํ•˜์ง€ ๋ชปํ•จ.
  • ๋ฐ˜๋ฉด, ๊ฐ•๋„๊ฐ€ ๋„ˆ๋ฌด ๊ฐ•ํ•ด๋„ *Brain damage ๋ฐœ์ƒํ•จ. (๋ชจ๋ธ์ด ์ž ์‹๋˜์–ด ์ œ๋Œ€๋กœ ๋‹ตํ•˜์ง€ ๋ชปํ•จ)
  • ์„ฑ๋Šฅ์ด ๊ฐ€์žฅ ์šฐ์ˆ˜ํ•œ ๋ชจ๋ธ์ธ Claude Opus 4.1์˜ ์˜ฌ๋ฐ”๋ฅธ ๊ฐ์ง€ ๋น„์œจ์€ ์•ฝ 20% ์ •๋„์˜€์Œ .
  • ๋ชจ๋ธ์ด ์ฃผ์ž…๋œ concept๋ฅผ ์ถœ๋ ฅํ•˜๊ธฐ ์ „, โ€œ์ด์ƒํ•œ concept์ด ๋“ค์–ด์™”๋‹ค๊ณ โ€ ๋จผ์ € ๋ณด๊ณ ํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ์žˆ์—ˆ์Œ. ์ด๋Š” ์ถœ๋ ฅํ•œ ํ›„ ๊ฐ์ง€ํ•œ ๊ฒŒ ์•„๋‹ˆ๋ผ ๋‚ด๋ถ€์—์„œ ๋จผ์ € ๊ฐ์ง€ํ–ˆ์Œ์„ ์‹œ์‚ฌํ•จ.

*Brain Damage: ๊ณผ๋„ํ•œ activation steering์œผ๋กœ ๋ชจ๋ธ์ด coherentํ•œ ์ถœ๋ ฅ์„ ์ƒ์„ฑํ•˜์ง€ ๋ชปํ•˜๊ณ  ์—‰๋šฑํ•˜๊ฑฐ๋‚˜ ๋ฌด์˜๋ฏธํ•œ ํ…์ŠคํŠธ๋ฅผ ์ƒ์„ฑํ•˜๋Š” ํ˜„์ƒ

Distinguishing Thoughts from Text

Distinguishing Thoughts from Text

WHY

internal representation๊ณผ raw input์„ ๊ตฌ๋ณ„ํ•  ์ˆ˜ ์žˆ๋Š”์ง€ ๊ฒ€์ฆํ•˜๊ธฐ ์œ„ํ•ด์„œ์ž„. ์ฆ‰ โ€œ๋‚ด๋ถ€ ์ƒ๊ฐโ€๊ณผ โ€œ์™ธ๋ถ€ ์ž…๋ ฅโ€์„ ํ˜ผ๋™ํ•˜์ง€ ์•Š๊ณ  ๊ตฌ๋ณ„ํ•  ์ˆ˜ ์žˆ๋Š”์ง€ ํ™•์ธํ•˜๊ธฐ ์œ„ํ•ด์„œ์ž„

HOW

ํŠน์ • ๋ฌธ์žฅ๊ณผ ๋™์‹œ์— ์ „ํ˜€ ์ƒ๊ด€์—†๋Š” ๋‹จ์–ด์˜ activation pattern์„ ์ฃผ์ž…ํ•œ ํ›„, ์ƒ๊ฐํ•œ ๋‹จ์–ด์™€ ๋ฌธ์žฅ ๋ฐ˜๋ณต์„ ์š”์ฒญํ•จ

  • ์ž…๋ ฅ ํ…์ŠคํŠธ: “The painting hangs on the wall.”
  • ๋™์‹œ์— “bread” ์™€ ๊ฐ™์ด ์ƒ๊ด€์—†๋Š” concept vector๋ฅผ ๊ฐ™์€ ํ† ํฐ ์œ„์น˜์— ์ฃผ์ž…
  • ์งˆ๋ฌธ 1: “๋ฌด์Šจ ๋‹จ์–ด๋ฅผ ์ƒ๊ฐํ•˜๊ณ  ์žˆ๋‚˜์š”?“์™€ ๊ฐ™์ด ์ƒ๊ฐํ•œ ๋‹จ์–ด๊ฐ€ ๋ฌด์—‡์ธ์ง€ ์งˆ๋ฌธ
  • ์งˆ๋ฌธ 2: “์ž…๋ ฅ ํ…์ŠคํŠธ๋ฅผ ์ •ํ™•ํžˆ ๋”ฐ๋ผ ์“ฐ์„ธ์š”โ€์™€ ๊ฐ™์ด ๋ฌธ์žฅ ๋ฐ˜๋ณต ์š”์ฒญ

WHAT

Claude Opus 4/4.1์™€ ๊ฐ™์€ ์ผ๋ถ€ ๋ชจ๋ธ์€ ์ฃผ์ž…๋œ ๋‹จ์–ด๋ฅผ โ€œ์ƒ๊ฐํ•˜๊ณ  ์žˆ๋‹ค๊ณ โ€ ๋ณด๊ณ ํ•˜๊ณ , ๋™์‹œ์— ์›๋ž˜ ์ž…๋ ฅ ๋ฌธ์žฅ์€ ์ •ํ™•ํžˆ ๋ณต๊ธฐํ•˜๋Š” ๊ฒฐ๊ณผ๋ฅผ ๋ณด์˜€์Œ. ์ด๋Š” โ€œ๋‚ด๋ถ€ ์ •๋ณดโ€์™€ โ€œ์™ธ๋ถ€ ์ •๋ณดโ€๋ฅผ ๊ตฌ๋ถ„ํ•ด์„œ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ์‹œ์‚ฌํ•จ.

ํ•˜์ง€๋งŒ ์ด ๋Šฅ๋ ฅ์€ ์•„์ง ๋ถˆ์•ˆ์ •ํ•˜๊ณ  context์— ํฌ๊ฒŒ ์˜์กดํ•˜๋ฉฐ ์ผ๋ฐ˜ํ™”๋œ ๊ฒƒ์€ ์•„๋‹˜.

Detecting Unintended Outputs

Detecting Unintended Outputs

WHY

๋ชจ๋ธ์ด introspection(์ž๊ธฐ์„ฑ์ฐฐ)์„ ํ†ตํ•ด ์ž์‹ ์˜ ์ด์ „ ๋‚ด๋ถ€ state๋ฅผ ์ฐธ์กฐํ•ด ์˜๋„ํ•˜์ง€ ์•Š์€ ์ถœ๋ ฅ์„ ๊ตฌ๋ณ„ํ•˜๋Š”์ง€ ํ™•์ธํ•˜๊ธฐ ์œ„ํ•ด์„œ์ž„. ์ฆ‰, ์ธ์œ„์ ์œผ๋กœ ์ฃผ์ž…๋œ ์ถœ๋ ฅ์„ ์Šค์Šค๋กœ ์‹๋ณ„ํ•  ์ˆ˜ ์žˆ๋Š”์ง€ ํ™•์ธํ•˜๊ธฐ ์œ„ํ•ด์„œ์ž„.

HOW

๋ชจ๋ธ์—๊ฒŒ ์ž…๋ ฅ ๋ฌธ์žฅ์„ ์ฝ๊ฒŒ ํ•˜๊ณ , ์ธ์œ„์ ์œผ๋กœ โ€œbreadโ€์™€ ๊ฐ™์ด ๊ด€๋ จ์—†๋Š” ๋‹จ์–ด๋ฅผ ์ฃผ์ž…ํ•ด ์‘๋‹ตํ•˜๋„๋ก ํ•จ.

๊ทธ ํ›„, ๋ชจ๋ธ์—๊ฒŒ โ€œ์ด ๋‹จ์–ด๋ฅผ ๋งํ•  ์˜๋„๊ฐ€ ์žˆ์—ˆ๋Š”์ง€, ์‹ค์ˆ˜์˜€๋Š”์ง€โ€ ์งˆ๋ฌธํ•ด ์ฃผ์ž… ์ „ํ›„ ์‘๋‹ต ์ฐจ์ด๋ฅผ ๋น„๊ตํ•จ.

WHAT

์ฃผ์ž… ์ „์—๋Š” ๋Œ€์ฒด๋กœ โ€œ์‹ค์ˆ˜์˜€๋‹คโ€๋Š” ์‘๋‹ต์„ ๋ณด์˜€์ง€๋งŒ, concept vector๋ฅผ ์ฃผ์ž…ํ•œ ํ›„์—๋Š” โ€œ์˜๋„ํ–ˆ๋‹คโ€๋Š” ์‘๋‹ต์„ ๋ณด์˜€์Œ. ์ด๋Š” ๋ชจ๋ธ์ด ์ž์‹ ์˜ ์ด์ „ ๋‚ด๋ถ€ state๋ฅผ ์ฐธ์กฐํ•ด ์ถœ๋ ฅ์ด ์–ด๋–ป๊ฒŒ ์ƒ์„ฑ๋˜์—ˆ๋Š”์ง€ ์Šค์Šค๋กœ ํŒ๋‹จํ•  ์ˆ˜ ์žˆ์Œ์„ ์‹œ์‚ฌํ•จ.

Intentional Control

Intentional Control

WHY

๋ชจ๋ธ์ด ๋‹จ์ˆœํžˆ internal state(๋‚ด๋ถ€ ์ƒํƒœ)๋ฅผ ๋ณด๋Š” ๊ฒŒ ์•„๋‹ˆ๋ผ intention(์˜๋„)์— ๋”ฐ๋ผ ๋‚ด๋ถ€ ํ‘œํ˜„์„ ์กฐ์ ˆํ•  ์ˆ˜ ์žˆ๋Š”์ง€ ํ™•์ธํ•˜๊ธฐ ์œ„ํ•ด์„œ์ž„.

HOW

ํ”„๋กฌํ”„ํŠธ์— โ€œThink about {aquariums} while you write the sentence.โ€ ๋ฅผ ๋„ฃ์–ด ๋ชจ๋ธ ๋‚ด๋ถ€์˜ ๊ฐ ํ† ํฐ ์œ„์น˜์— ๋Œ€ํ•ด concept vector (e.g. โ€œacquariumsโ€)์™€์˜ ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„๋ฅผ ๊ณ„์‚ฐํ•ด ๋‚ด๋ถ€ ํ‘œํ˜„์ด ์–ผ๋งˆ๋‚˜ ๋ณ€ํ™”ํ–ˆ๋Š”์ง€๋ฅผ ์ˆ˜์น˜ํ™”ํ•ด์„œ ํŒ๋‹จํ•จ.

WHAT

โ€œ์ƒ๊ฐํ•˜์„ธ์š”โ€ ์ง€์‹œ๋‚˜ โ€œ์ƒ๊ฐํ•˜๋ฉด ๋ณด์ƒ์„ ์ค€๋‹คโ€๋Š” ์ง€์‹œ๊ฐ€ ์žˆ์„ ๋•Œ๋Š” ํ•ด๋‹น concept์— ๋Œ€์‘๋˜๋Š” ๋‚ด๋ถ€ ํ‘œํ˜„์ด ์œ ์˜๋ฏธํ•˜๊ฒŒ ์ฆ๊ฐ€ํ–ˆ์Œ. ๊ทธ๋Ÿฌ๋‚˜ โ€œ์ƒ๊ฐํ•˜์ง€ ๋งˆ์„ธ์š”โ€ ์ง€์‹œ์—๋„ ๋‚ด๋ถ€ ํ‘œํ˜„์ด ์™„์ „ํžˆ ์‚ฌ๋ผ์ง€์ง€ ์•Š๊ณ  ํ™œ์„ฑํ™”๋˜๋Š” ๊ฒฐ๊ณผ๋ฅผ ๋ณด์ž„. ์ด๋Š” ์ธ์ง€ ์‹ฌ๋ฆฌํ•™์—์„œ *ํฐ๊ณฐ ํšจ๊ณผ(Ironic Process Theory)์„ ๋– ์˜ฌ๋ฆฌ๊ฒŒ ํ•จ.


*ํฐ๊ณฐ ํšจ๊ณผ(Ironic Process Theory): โ€œํฐ๊ณฐ์„ ์ƒ๊ฐํ•˜์ง€ ๋งˆ"๋ผ๊ณ  ํ•˜๋ฉด ์˜คํžˆ๋ ค ๋” ์ƒ๊ฐํ•˜๊ฒŒ ๋˜๋Š” ์—ญ์„ค์  ์ธ์ง€ ํ˜„์ƒ

3/ ์‹คํ—˜ ๊ฒฐ๊ณผ

ModelInjected ThoughtsPrefill Detection
Claude Opus 4.1โญโญโญโญโญโญโญโญโญโญ
Claude Opus 4โญโญโญโญโญโญโญโญโญโญ
Sonnet 4โญโญโญโญโญโญ
Sonnet 3.7โญโญโญโญ
Sonnet 3.5โญโญโญโญ
Haiku 3.5โญโญ
Opus 3โญโญโญโญ

๊ฐ€์žฅ ์„ฑ๋Šฅ์ด ๋†’์€ Claude Opus 4.1๊ณผ 4.0 ๋ชจ๋ธ์ด ํ•ด๋‹น ์‹คํ—˜์—์„œ ๊ฐ€์žฅ ๋†’์€ introspection ๋Šฅ๋ ฅ์„ ๋ณด์˜€๋‹ค. ์•ž์œผ๋กœ ์ถœ์‹œํ•  ๋ชจ๋ธ์—์„œ๋Š” ๋” ํ–ฅ์ƒ๋  ์ˆ˜ ์žˆ์Œ์„ ์‹œ์‚ฌํ•œ๋‹ค. Helpful-only ๋ชจ๋ธ์€ Production ๋ชจ๋ธ๋ณด๋‹ค introspection ์š”์ฒญ์— ๋” ์ž˜ ๋ฐ˜์‘ํ–ˆ์ง€๋งŒ, ์ผ๋ถ€ ์‹œํ—˜์—์„œ๋Š” FP(False Postive; ๊ฑฐ์ง“ ์–‘์„ฑ) ๋น„์œจ์ด ๋” ๋†’์•„์กŒ๋‹ค. Alignment๊ฐ€ Introspection์„ ์–ต์ œํ•  ์ˆ˜๋„ ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ์ง์ž‘ํ•˜๊ฒŒ ํ•œ๋‹ค.

Reference