Is ALMANACS a Simulatability Benchmark for Language Model Explainability?

Original title: ALMANACS: A Simulatability Benchmark for Language Model Explainability

Authors: Edmund Mills, Shiye Su, Stuart Russell, Scott Emmons

In this article, the authors discuss the challenge of measuring the effectiveness of language model explainability methods. While many methods have been developed, they are often evaluated on different tasks, making it difficult to compare their performance. To address this issue, the authors propose ALMANACS, a benchmark for evaluating language model explainability.

ALMANACS focuses on simulatability, which measures how well explanations improve the prediction of a model’s behavior on new inputs. The benchmark includes twelve scenarios related to safety, such as ethical reasoning and advanced AI behaviors. These scenarios are designed to elicit specific behaviors from the models and have a distributional shift between training and testing data to encourage accurate explanations.

Using ALMANACS, the authors evaluate several explanation methods, including counterfactuals, rationalizations, attention, and Integrated Gradients explanations. Unfortunately, the results show that none of these methods outperform the control group without explanations when averaged across all topics. This suggests that developing an explanation method that truly aids simulatability in ALMANACS remains a challenge.

In conclusion, this article highlights the need for standardized benchmarks like ALMANACS to compare language model explainability methods. The results also indicate that further research is required to improve the effectiveness of these methods in aiding model understanding and behavior prediction.

Original article: https://arxiv.org/abs/2312.12747