Large Language Models (LLMs) show superior performances even compared with expert physicians. However, how to effectively harness the new capabilities in the real clinical setting is underexplored. Probing into the reasoning process and verifying the clinical decision-making procedure is crucial for integration into the hospital setting. With the purpose of "debugging" medical LLMs, I propose a novel interface **LLMXAI** that uses sparse autoencoders (SAE) to identify interpretable features and compare with the real explanation given by domain experts.
Introduction
Large Language Models (LLMs) are effective in solving medical question-answering tasks, achieving varying performance on-par with that of average medical personnel or even expert physicians, with notable examples including Medi-Gemini (Saab et al., 2024) and Med-Palm2 (Singhal et al., 2023) . There have been ongoing efforts in academia (Kim et al., 2025) and the industry to evaluate and adapt LLMs to Korean medical data. The most updated benchmark for the models in the current industry is shown in Table 1 (as of December 1st, 2025):
| Model / Agent | Manufacturer | Mean Score |
|---|---|---|
| KMed | SNUH-NAVER | 96.40 |
| GPT-5.1 | OpenAI | 95.99 |
| GPT-5 | OpenAI | 93.62 |
| Claude Sonnet 4.5 | Anthropic | 94.86 |
| Qwen 235B Think | Alibaba Cloud | 90.14 |
| DeepSeek R1-0528 | DeepSeek | 89.66 |
| Solar Pro 2 | Upstage | 88.55 |
| A.X-4.0 | SK Telecom | 85.60 |
| LLAMA-4 Maverick | Meta | 85.59 |
| Qwen3 32B | Alibaba Cloud | 82.24 |
| EXAONE-4-32B | LG AI Research | 84.14 |
| Open Evidence | Open Evidence | 69.11 |
Table: Result on Korean Medical Licensing Examination 2025 from LLMs in the industry.
One of the problems with relying on LLM-generated explanations is the reasoning fidelity: models often “backfill” plausible explanations; separating genuine reasoning from post-hoc justification remains hard. In this work, interpretable features from the internal activations of LLMs are extracted, and decipher their correspondence with the expert-provided explanation.
Sparse Autoencoders for Feature Extraction
Sparse autoencoders (SAEs) address the challenge of polysemanticity in neural networks by learning overcomplete dictionaries of feature directions that enable reconstruction of activation vectors as sparse linear combinations, where each feature direction ideally represents a single, interpretable concept (Cunningham et al., 2023) . Given a set of activation vectors \(\{\mathbf{x}_i\}_{i=1}^{n} \subset \mathbb{R}^{d_{\text{in}}}\) from a language model layer, SAEs learn an overcomplete dictionary of features \(\{\mathbf{f}_k\}_{k=1}^{d_{\text{hid}}} \subset \mathbb{R}^{d_{\text{in}}}\) where \(d_{\text{hid}} = R \cdot d_{\text{in}}\) and \(R > 1\) is the expansion factor. The autoencoder consists of an encoder that maps activations to sparse feature coefficients \(\mathbf{c} \in \mathbb{R}^{d_{\text{hid}}}\) and a decoder that reconstructs the original activation:
\[\begin{aligned} \mathbf{c} &= \text{ReLU}(\mathbf{W}_{\text{enc}}\mathbf{x} + \mathbf{b})\\ \hat{\mathbf{x}} &= \mathbf{W}_{\text{dec}}\mathbf{c} = \sum_{k=1}^{d_{\text{hid}}} c_k \mathbf{f}_k \end{aligned}\]where \(\mathbf{W}_{\text{enc}} \in \mathbb{R}^{d_{\text{hid}} \times d_{\text{in}}}\), \(\mathbf{b} \in \mathbb{R}^{d_{\text{hid}}}\), and \(\mathbf{W}_{\text{dec}} = \mathbf{W}_{\text{enc}}^T\) (tied weights). The decoder columns \(\mathbf{f}_k\) form the learned feature dictionary and are \(\ell_2\)-normalized. Training minimizes the loss:
\[\mathcal{L}(\mathbf{x}) = \underbrace{\|\mathbf{x} - \hat{\mathbf{x}}\|_2^2}_{\text{reconstruction}} + \underbrace{\alpha \|\mathbf{c}\|_1}_{\text{sparsity penalty}}\]where \(\alpha > 0\) controls the sparsity-fidelity tradeoff. The \(\ell_1\) penalty encourages sparse activations, enabling the discovery of monosemantic features that each represent a single interpretable concept (Olshausen & Field, 1997) . This approach has been shown to yield features significantly more interpretable than individual neurons or linear decompositions such as PCA or ICA (Cunningham et al., 2023) .
Medical Question Answering Datasets and Models
KorMedMCQA (Kweon et al., 2024) is a multi-choice question answering benchmark for Korean Healthcare Professional Licensing Examinations, including 2,494 questions for doctors as its subset. In this work, interpretable features from open-source medical LLMs are investigated. Models used include Hari-q3 (SNUH, 2025) – which have been reported to show 84.14% accuracy for reasoning questions for doctors in the KorMedMCQA dataset – and other general open-source models.
LLMXAI: Interface for Medical Question Answering Data

The interface shows extracted features and the related text corpus in the right panel. It displays the question and option given, with annotated prediction and correct answer, along with the expert-provided explanations.
Future Direction
I am currently implementing the suggested interface, experimenting on different datasets and models. Also, using domain knowledge, I will evaluate the LLM’s inner activations on and compare with expert-provided explanations. It is also possible to steer the LLMs by augmenting/supressing extracted features. Optionally, I will test various other methods such as Sparse Feature Circuits (Marks et al., 2025) , Automated Circuit Discovery (Bhaskar et al., 2024) and other transormer circuit related works [@elhage2021mathematical;@dunefsky2024transcoders]. Future work will focus proposing new methods to interact with the LLMs to closely match the reasoning process in the clinical setting, that will enable steerability, verifiability, and explanability in the medical LLMs.
References
- Saab, K., Tu, T., Weng, W.-H., Tanno, R., Stutz, D., Wulczyn, E., Zhang, F., Strother, T., Park, C., Vedadi, E., & others. (2024). Capabilities of Gemini Models in Medicine. ArXiv Preprint ArXiv:2404.18416.
- Singhal, K., Azizi, S., Tu, T., Mahdavi, S. S., Wei, J., Chung, H. W., Scales, N., Tanwani, A., Cole-Lewis, H., Pfohl, S., & others. (2023). Towards Expert-Level Medical Question Answering with Large Language Models. ArXiv Preprint ArXiv:2305.09617.
- Kim, H. J., Jung, K., Shin, S., Lee, W., Lee, J. H., Park, H. S., & Choi, Q. (2025). Performance evaluation of large language models on Korean medical licensing examination: a three-year comparative analysis. Scientific Reports, 15(1), 36082. doi: 10.1038/s41598-025-20066-x
- Cunningham, H., Ewart, A., Riggs, L., Huben, R., & Sharkey, L. (2023). Sparse Autoencoders Find Highly Interpretable Features in Language Models. ArXiv Preprint ArXiv:2309.08600.
- Olshausen, B. A., & Field, D. J. (1997). Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision Research, 37(23), 3311–3325.
- Kweon, S., Choi, B., Chu, G., Song, J., Hyeon, D., Gan, S., Kim, J., Kim, M., Park, R. W., & Choi, E. (2024). KorMedMCQA: Multi-Choice Question Answering Benchmark for Korean Healthcare Professional Licensing Examinations. ArXiv Preprint ArXiv:2403.01469.
- SNUH, H. A. I. R. I. (H. A. R. I.- S. N. U. H. (2025). Hari-q3. https://huggingface.co/snuh/hari-q3
- Marks, S., Rager, C., Michaud, E. J., Belinkov, Y., Bau, D., & Mueller, A. (2025). Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models. The Thirteenth International Conference on Learning Representations. https://openreview.net/forum?id=I4e82CIDxv
- Bhaskar, A., Wettig, A., Friedman, D., & Chen, D. (2024). Finding Transformer Circuits With Edge Pruning. The Thirty-Eighth Annual Conference on Neural Information Processing Systems. https://openreview.net/forum?id=8oSY3rA9jY

