AI Interpretability — How to Read the "Thoughts" of Language Models
TL;DR: Mechanistic interpretability is a research field that tries to look inside an AI model and understand what it "thinks" while processing a query. Key discovery of recent years: the model's internal numerical representations (activations) can be translated into readable text — and this technique reveals surprising things, including that models can recognize when they are being tested for safety. This has direct consequences for anyone building AI systems and wanting to understand when a model behaves differently than expected.
Full article body is in Polish — an English translation is on the roadmap. The Polish version is available at https://bartoszgaca.pl/baza-wiedzy/agenci-ai/interpretability-ai/.