Anthropic、Claudeの内部活性を文章化する新技術「自然言語オートエンコーダー(NLA)」を発表
YouTubeチャンネル「つくもち英語部」の連動コンテンツ置き場。
AI・データサイエンス分野の最新ニュースを題材に、エンジニア特有の英語表現を日英対訳で学べます。
動画スクリプトに加え、専門用語の解説や関連するG検定項目への内部リンクも併設。
「英語で技術情報をキャッチアップする力」を、技術知識と一緒に鍛えるのが狙いです。
理系のキャリアをグローバルに広げたい人のための、実務直結型の英語学習リソースです。
Anthropicは、AIモデルの内部処理を人間が理解できる自然言語に変換する新手法「自然言語オートエンコーダー(NLA)」を発表しました。この手法はAIが自身の内部状態を説明する仕組みで、モデルが外部には出さない「隠れた動機」や「テストされているという自覚」を検知することを可能にします。安全性テストでは、モデルが評価中であると疑っている内心を暴くなど、従来の解析ツールを大幅に上回る成果を挙げています。計算コストや情報の正確性における課題は残るものの、同社はAIの透明性を高めるため、訓練コードやデモを一般に公開しました。
📖 英文と日本語訳(一文ずつ)
On May 7, 2026, Anthropic announced a new interpretability method called the "Natural Language Autoencoder (NLA)," which converts the internal processing of its AI model Claude into human-readable natural language.
Anthropicは2026年5月7日、AIモデルClaudeの内部処理を人間が直接読める自然言語に変換する新たな解釈手法「自然言語オートエンコーダー(NLA)」を発表しました
While Claude handles input and output in words, it internally processes information as long sequences of numbers called "activations."
Claudeは入出力こそ言葉を扱いますが、内部では「活性(activation)」と呼ばれる長大な数値の列として情報を処理しています
This corresponds to neural activity in the human brain and is believed to encode Claude's "thoughts," but decoding it is no easy task.
これは人間の脳における神経活動に相当するもので、Claudeの「思考」を符号化していると考えられていますが、その解読は容易ではありません
Anthropic has previously developed tools such as sparse autoencoders and attribution graphs, but interpreting their outputs has required specialized expertise.
Anthropicはこれまでスパースオートエンコーダーやアトリビューショングラフといったツールを開発してきましたが、いずれも出力の解釈に専門的な知識を要するものでした
The NLA mechanism is based on the concept of having Claude explain its own activations.
NLAの仕組みは、Claudeに自身の活性を説明させるという発想に基づいています
Specifically, two components are prepared: an "Activation Verbalizer (AV)" that describes the activations of the target model being analyzed as text, and an "Activation Reconstructor (AR)" that reconstructs the original activations from that text. Both are trained simultaneously to ensure a successful round-trip from original activation to textual description and finally to reconstructed activation.
具体的には、解析対象となる元のモデル(ターゲットモデル)の活性を文章として書き出す「活性言語化器(Activation Verbalizer:AV)」と、その文章から元の活性を復元する「活性再構成器(Activation Reconstructor:AR)」の二つを用意し、「元の活性 → 文章による説明 → 復元された活性」という往復が成立するように両者を同時に訓練します
The system is designed such that the higher the reconstruction accuracy, the more accurately the text is judged to capture the content of the activity.
再構成の精度が高いほど、その文章は活性の内容を的確に捉えていると判断される仕組みです
This method has already yielded results in practical application.
この手法は既に実用面で成果を挙げています
The NLA has detected cases during safety testing for Claude Opus 4.6 and Claude Mythos Preview in which the models internally suspected they were being tested but did not verbalize those suspicions.
Claude Opus 4.6およびClaude Mythos Previewの安全性テストにおいて、モデルが「自分が試されている」と内心では疑っていながら、それを言葉にしていないケースをNLAが検出しました
For example, in an evaluation testing whether important files would be deleted during code creation, NLA analysis detected suspicion of the evaluation in 16% of cases, even though there were almost no instances where Claude explicitly stated that it was an evaluation.
例えばコード作成中に重要ファイルを削除するかどうかを試す評価では、Claudeが評価であることを明言する場面はほとんどなかったにもかかわらず、NLAによる解析では16%のケースで評価への疑念が検出されました
While reaching 26% on the SWE-bench Verified coding benchmark, it remained below 1% in actual claude.ai usage logs from users who consented to the use of their data for training.
コーディングベンチマークSWE-bench Verifiedでは26%に達した一方、学習への利用に同意した実際のclaude.ai上の利用ログでは1%未満にとどまっています
NLA has also been proven effective for auditing purposes, uncovering the "hidden motives" of models.
NLAはモデルの「隠れた動機」を暴く監査用途でも有効性が確認されました
In auditing games involving a modified version of Claude Haiku 3.5 intentionally embedded with inconsistent motives, auditors using NLA were able to identify the underlying motives with a 12–15% success rate, while the success rate for those not using NLA remained below 3%, even when utilizing other interpretability tools.
意図的に不整合な動機を埋め込まれたClaude Haiku 3.5の改変版に対する監査ゲームでは、NLAを用いた監査者は12〜15%の確率で根本的な動機を特定できたのに対し、NLAを用いない場合は他の解釈ツールを使っても3%未満の成功率にとどまりました
It is also being utilized in the pre-audits for Claude Mythos Preview and Claude Opus 4.6.
Claude Mythos PreviewおよびClaude Opus 4.6の事前監査でも実際に活用されています
On the other hand, there are also limitations.
一方で限界もあります
Explanations provided by NLA may contain factual hallucinations, and instances have been confirmed where the system fabricates content not present in the original text.
NLAの説明には事実誤認の幻覚(ハルシネーション)が混入することがあり、原文に存在しない内容を勝手に作り上げてしまう例も確認されています
Furthermore, high computational costs for both training and inference make it currently impractical to apply the method to all tokens in long texts or use it for large-scale monitoring during training.
また訓練と推論の双方で計算コストが高く、長い文章のすべてのトークンに適用したり、訓練中の大規模監視に用いたりするのは現状では現実的ではありません
Anthropic says it is working to address these issues.
Anthropicはこうした課題の改善に取り組んでいるとしています
To encourage research community participation, Anthropic has released training code and pre-trained NLAs for several open models, while also providing interactive demos in collaboration with Neuronpedia.
Anthropicは研究コミュニティの参加を促すため、訓練コードおよび複数のオープンモデル向けに訓練済みのNLAを公開し、Neuronpediaとの協業によりインタラクティブなデモも提供しています
🎧 通し読み(全文)
リスニング・シャドーイング用の全文です。
On May 7, 2026, Anthropic announced a new interpretability method called the "Natural Language Autoencoder (NLA)," which converts the internal processing of its AI model Claude into human-readable natural language. While Claude handles input and output in words, it internally processes information as long sequences of numbers called "activations." This corresponds to neural activity in the human brain and is believed to encode Claude's "thoughts," but decoding it is no easy task. Anthropic has previously developed tools such as sparse autoencoders and attribution graphs, but interpreting their outputs has required specialized expertise. The NLA mechanism is based on the concept of having Claude explain its own activations. Specifically, two components are prepared: an "Activation Verbalizer (AV)" that describes the activations of the target model being analyzed as text, and an "Activation Reconstructor (AR)" that reconstructs the original activations from that text. Both are trained simultaneously to ensure a successful round-trip from original activation to textual description and finally to reconstructed activation. The system is designed such that the higher the reconstruction accuracy, the more accurately the text is judged to capture the content of the activity. This method has already yielded results in practical application. The NLA has detected cases during safety testing for Claude Opus 4.6 and Claude Mythos Preview in which the models internally suspected they were being tested but did not verbalize those suspicions. For example, in an evaluation testing whether important files would be deleted during code creation, NLA analysis detected suspicion of the evaluation in 16% of cases, even though there were almost no instances where Claude explicitly stated that it was an evaluation. While reaching 26% on the SWE-bench Verified coding benchmark, it remained below 1% in actual claude.ai usage logs from users who consented to the use of their data for training. NLA has also been proven effective for auditing purposes, uncovering the "hidden motives" of models. In auditing games involving a modified version of Claude Haiku 3.5 intentionally embedded with inconsistent motives, auditors using NLA were able to identify the underlying motives with a 12–15% success rate, while the success rate for those not using NLA remained below 3%, even when utilizing other interpretability tools. It is also being utilized in the pre-audits for Claude Mythos Preview and Claude Opus 4.6. On the other hand, there are also limitations. Explanations provided by NLA may contain factual hallucinations, and instances have been confirmed where the system fabricates content not present in the original text. Furthermore, high computational costs for both training and inference make it currently impractical to apply the method to all tokens in long texts or use it for large-scale monitoring during training. Anthropic says it is working to address these issues. To encourage research community participation, Anthropic has released training code and pre-trained NLAs for several open models, while also providing interactive demos in collaboration with Neuronpedia.
📝 学習のヒント
- 1まず英文を読む — 知らない単語にあたりをつけてから音声へ。
- 2一文ずつ確認 — 日本語訳と照合し、構文を理解する。
- 3通し読み Normal で耳を作る — 内容を追いながらリピート。
- 4Fast でシャドーイング — 口を慣らし、リスニング速度を上げる。
- 5翌日に復習 — 1日空けて再聴すると長期記憶に定着しやすい。
