Explainable Artificial Intelligence: Evaluating the Objective and Subjective Impacts of xAI on Human-Agent Interaction

Abstract Intelligent agents must be able to communicate intentions and explain their decision-making processes to build trust, foster confidence, and improve human-agent team dynamics. Recognizing this need, academia and industry are rapidly proposing new ideas, methods, and frameworks to aid in the design of more explainable AI. Yet, there remains no standardized metric or experimental protocol for benchmarking new methods, leaving researchers to rely on their own intuition or ad hoc methods for assessing new concepts. In this work, we present the first comprehensive (n = 286) user study testing a wide range of approaches for explainable machine learning, including feature importance, probability scores, decision trees, counterfactual reasoning, natural language explanations, and case-based reasoning, as well as a baseline condition with no explanations. We provide the first large-scale empirical evidence of the effects of explainability on human-agent teaming. Our results will help to guide the future of explainability research by highlighting the benefits of counterfactual explanations and the shortcomings of confidence scores for explainability. We also propose a novel questionnaire to measure explainability with human participants, inspired by relevant prior work and correlated with human-agent teaming metrics.


Introduction
Computational agents (e.g., robots or virtual agents) must be able to communicate intentions and explain their decision-making processes to build trust, foster confidence, and improve team dynamics (Boies et al., 2015;Paleja et al., 2021), and research is increasingly investigating how explainability is necessary for many human-agent interactions and domains (Doshi-Velez & Kim, 2017;Rudin et al., 2021).For agents to effectively interact with humans in human-agent teams, agents must be capable of communicating their intentions and explaining their decision-making process.Researchers have investigated explainability methods for agents to empower users to better understand the reasoning behind the agent's behavior (e.g., through natural language generation (DeYoung et al., 2019), decision-tree extraction (Silva et al., 2020), and counterfactuals (Karimi et al., 2021)).Yet, while researchers have recognized the need for agents to explain their decisions, we hypothesize that not all explainability methods are equally effective at communicating information, and some methods may even inhibit human understanding and collaboration.Current progress in the field of explainable artificial intelligence (xAI) is hindered by a lack of standardized measurement by which to evaluate explainability methods and a lack of clear comparison across various xAI techniques.
As noted by recent surveys on xAI (Adadi & Berrada, 2018;Karimi et al., 2020;Rudin et al., 2021), explainability research lacks consistent definitions and evaluations making it difficult to draw sound conclusions about the efficacy of explainability techniques (Rudin, 2019).Additionally, such inconsistencies often lead to conflicting takeaways.For example, Jain and Wallace (2019) published "Attention is not explanation" just five months before Wiegreffe and Pinter (2019) published "Attention is not not explanation."Such contradictions are often contingent on differences in definitions or expectations and leave researchers ill-informed on whether to pursue attention for explanations.The enthusiastic pace of progress in xAI is outpacing the ability of the community to settle these debates with rigorous empirical or analytical study.
What is critically needed to pursue and adopt the most beneficial xAI methods for human-agent teaming are standardized metrics and experimental protocols.In pursuit of such a goal, prior research has put forth automated xAI metrics, such as model stability or complexity (Rosenfeld, 2021) or natural-language benchmarks (DeYoung et al., 2019), but very few prior works in xAI involve user studies with human participants in their evaluations (Jain & Wallace, 2019;Karimi et al., 2020).When humans are involved, the work typically takes a narrow look at a single method or use-case, and therefore has limited implications for the field at large.While standardized agent evaluation task sets (Bedny & Karwowski, 2003) and surveys exist in the literature (Bartneck et al., 2009;Hartson et al., 2001;Jian et al., 2000;Nomura et al., 2006), there is not an empirically-validated or agreed-upon survey to evaluate explainability of virtual or embodied agents deployed to untrained human users ("lay" users, as defined by Ribera and Lapedriza (2019)).To make progress on developing useful xAI that operates effectively alongside human users, machine-learning researchers must have access to shared, validated surveys and experimental procedures to benchmark different xAI techniques.
In this work, we present the first evaluation of a battery of approaches to xAI with human users in a large-scale user study (n ¼ 286).A visual overview of our study is presented in Figure 1.For the first time, our work enables objective and subjective evaluation of different xAI methods with real human users across axes of performance, efficiency, trust, social-perception, and compliance.Rather than relying on speculation for how humans might respond to different xAI approaches, we present a true comparison in a betweensubjects user study.
Based upon our results, we conduct a post-hoc factor analysis on a composite xAI survey and find three potential dimensions of explainability we interpret as measuring transparency (a 1 ¼ 0:83), usability (a 2 ¼ 0.82), and simulatability (a 3 ¼ 0.81)).We show that a composite scale comprised of these dimensions is correlated with measures of trust, perceptions of social competence, and performance.This new xAI survey offers the potential of a quantitative scoring mechanism for xAI agnostic to the particular technique being used, allowing for consistent evaluation of xAI across studies, techniques, and demographics.We conclude with proposed future work that will investigate the reliability and validity of this survey across multiple studies.

Contributions
The primary contribution in our work is the first large-scale evaluation of the objective and subjective effects of various forms of explainability on human-robot teaming.Our results show that explainability strongly correlates with trust (p < 0.0001), social competence (p < 0.0001), and performance (p ¼ 0.01), and that counterfactual, language-based feature descriptions, and case-based explanations are rated as more explainable than probability scores (p < 0.01).We also contribute survey materials and study design insights for future work to build upon our user study, including a proposed explainability measurement survey to be verified in future work.Our results help to inform the design of explainability approaches in the future by revealing both the positive effects and potential risks of adopting different forms of xAI.

Related work
In this section, we provide an overview of the literature that relates most closely to our study and point to surveys for interested readers to review the latest advances in xAI.

Explainability vs. interpretability
With the rapid growth of work in xAI, debates over terminology persist.In our work, we are explicitly concerned with explainable machine learning-that is, we present explanations for model outputs from one of a set of popular approaches.Crucially, within xAI, explanations do not necessarily reflect the ground-truth decision-making of the model.Explanations may simply offer insight into how the decision was reached (e.g., highlighting important features, presenting answer probabilities, etc.).Presenting model explanations lies in contrast to interpretable machine learning, where the model itself is easily read by a human (e.g., a small decision tree (Basak, 2004;Breiman et al., 1984;Olaru & Wehenkel, 2003), rule list (Angelino et al., 2017;Chen & Rudin, 2017;Letham et al., 2015;Weiss & Indurkhya, 1995), or simple linear model (Caruana et al., 2015)).The distinction between the two is important (Adadi & Berrada, 2018;Lipton, 2018;Rudin, 2019), though still unsettled (Hase & Bansal, 2020).For more thorough surveys on the recent advances in xAI, we refer readers to Holzinger et al. (2020), Linardatos et al. (2020), andHoffman et al. (2018).We specifically target explainable methods in this work, being a broader class of algorithms and techniques, though we include a decision-tree condition to compare to an interpretable technique.The priming task prepares participants to consider usability and transparency of agent suggestions and explanations.Participants are then assigned a condition and receive instructions for their assigned explanations before beginning the main set of scenarios in our study.Each scenario includes a short paragraph of text, a question, an explanation from a virtual agent, two Likert items assessing per-scenario understanding and agreement, and four answer choices.After each question, participants are shown the correct answer and a running tally of their overall score.After completing all scenarios, participants complete a trust survey (Jian et al., 2000), social perceptions survey (Bartneck et al., 2009), and our xAI survey.

Evaluating explainability
The question of how to appropriately evaluate xAI research is crucial and has thus garnered much attention.Automated metrics, such as ROAR for feature importance (Hooker et al., 2018), ERASER for natural-language explanations (DeYoung et al., 2019), or model-agnostic measures such as stability and complexity (Rosenfeld, 2021), attempt to approximate human understanding with a benchmark or to score a method on how internally consistent the method is.However, such approximations have never been thoroughly tested or empirically validated with humans.
As opposed to employing automatic metrics, one could evaluate xAI on a strict case-by-case basis by considering the deployment domain, users, model performance (Saragih & Morrison, 2021), robustness, and more (Sokol & Flach, 2020).While such a thorough evaluation may be preferable when possible, it is prohibitively expensive to run such an evaluation on every model for every deployment domain.The Explanation Satisfaction scale (Hoffman et al., 2018) measures the utility of an explanation (either from a human or an xAI technique) as determined by experts in the field of xAI.Crucially, this scale is not designed for an untrained population.What we need is a tool to enable general comparison of how untrained human users perceive and use xAI.
While we are unaware of any prior work that has performed a thorough comparison of a multitude of xAI techniques on a large, untrained population, prior research has empirically evaluated individual xAI techniques and usecases with human users in narrow cases (Hase & Bansal, 2020;Hutton et al., 2012;Nguyen, 2018;Poursabzi-Sangdeh et al., 2021;Tintarev & Masthoff, 2012).Researchers have primarily examined whether or not human users rate xAI as helpful, and such research has produced mixed results (Hutton et al., 2012;Nguyen, 2018).Research with a limited sample population of computer science students found that saliency measures (i.e., feature importance) helped improve a user's understanding of decisions (Hase & Bansal, 2020).Earlier research with crowd-sourced users found similar results depending on the difficulty of the task, with harder tasks leading to the perception of less-useful explanations (Hutton et al., 2012;Nguyen, 2018).Similarly, prior research has found that users tend to like explanations from recommender systems given in the form of natural language (Tintarev & Masthoff, 2012).Most surprisingly, prior research on explainability and compliance has found that users were more likely to agree with a decision-making tool if the tool provided an explanation-even if the tool was incorrect (Poursabzi-Sangdeh et al., 2021).This result runs counter to the intuition that more explainable methods will reduce human over-trust.Generalizing the result of prior work Poursabzi-Sangdeh et al. (2021), our large-scale study reveals that xAI makes no change in human compliance with a virtual agent, while examining a broader set of xAI techniques.Our research drives at the perceived utility of explanations, human compliance, performance, trust, and social perceptions.

Human-centric explainability
As explanations often involve interaction with a human user, there is also prior work on how to frame explanation research around the human in the loop (Ehsan & Riedl, 2020).Research on human preferences has found that humans typically prefer simpler explanations, only allowing for explanations to grow complex when all of the components of the explanation are highly probable (Lombrozo, 2006(Lombrozo, , 2007)).Miller (2019) provides a set of key insights and common themes relating to human explanations and properties of explanations.Explanations in human-human contexts often establish a common ground or knowledge-base from which to make decisions or justify behaviors.Miller (2019) highlights various types of explanations that may be applied to algorithmic explanation (e.g., Aristotle's Four Causes model (Lloyd & Lloyd, 1996)) and how these mechanisms might be leveraged in different scenarios.Wang et al. (2019), Liao et al. (2020), and, Schoonderwoerd et al. (2021) approach explainability with an eye towards design, developing frameworks, question-banks, or undergoing full casestudies to assist in the development of algorithms for explainability.Similarly, Lage et al. (2018) provides explanation design insights following a large-scale user study on the effects of explanation length, complexity, and repetition on subjective preferences and human-user accuracy, finding that shorter and simpler explanations were preferred.Related work has also examined how concepts such as fairness, accountability, and transparency relate to explainability (Shin, 2021), finding that causability plays a role in human trust.
In our work, we instead approach explainability through the lens of subjective usability and preference.Specifically, we ask the question: Given various forms that an explanation may take (e.g., language expressions (DeYoung et al., 2019), decision trees (Weiss & Indurkhya, 1995), feature importance maps (Ribeiro et al., 2016), etc.), which form is considered the easiest to use, interpret, and trust?

Explainability in our study
Our work compares seven broad categories of xAI methods including case-based reasoning, decision trees, feature importance, probability scores, counterfactuals, naturallanguage explanations, and crowd-sourced explanations.These seven were chosen as overarching categories of xAI to broadly capture the scope of current xAI research.Our seven conditions enable us to compare different modalities that may be used for presenting an explanation (e.g., highlighting input features vs. presenting a decision-tree), as well as comparing different forms of presenting the same information (e.g., presenting percent-likelihoods for each answer vs. presenting a natural-language sentence that includes an answer probability).Below, we present more detail on each of the conditions in our study, and a visual example of each condition is given in Figure 2.
A case-based explanation (Barnett et al., 2021;Caruana et al., 1999;Koh & Liang, 2017) shows training data that closely resemble testing samples.Case-based explanations help end-users to understand output decisions by relating the current input to a known data point and drawing a connection to the previous, known label for such a data point.Seeing labeled training examples that look like a given testing sample (Klein et al., 1993), the user may achieve greater understanding of why the model produced a certain classification.
We also consider decision trees for explainability (Agarwal & Das, 2020;Bastani et al., 2018;Craven & Shavlik, 1995;Murthy, 1998;Silva et al., 2020;Wu et al., 2018).A decision tree is a graphical flow-chart showing a cascade of "True/ False" checks that lead to a decision.Each decision node includes a check against the input data, which end-users may use to manually verify the output of the system.By showing an interpretable flow-chart to an end user, the user is empowered to assess and understand the decision.
We next examine attention/saliency mechanisms for xAI (Jain & Wallace, 2019;Ribeiro et al., 2016;Suau et al., 2020;Wiegreffe & Pinter, 2019) by using a feature-importance based explanation (Caruana et al., 2015;Strumbelj & Kononenko, 2014).These approaches show users the features of an input sample that were the most important for a classification.The user can gain a better understanding of a decision by ensuring that the features are reasonable or consistent with their own expectations.Crucially, such models only reveal correlations between features and predictions; they do not imply or predict causal relationships.
Our work also examines counterfactual explanations (Karimi et al., 2020(Karimi et al., , 2021;;Verma et al., 2020;Wachter et al., 2017).A counterfactual explanation works by telling a user how a decision would be different if perturbations were made to an input sample.Based on a counterfactual scenario, a user can infer how the original decision was made, though claims around the true explainability of current black-box counterfactual methods remain contested (White & Garcez, 2021).
We also include explanations via natural language, which is an active area of research dedicated to providing textual descriptions of classifications (Chen et al., 2021;DeYoung et al., 2019;Ehsan & Riedl, 2020).Often, this work involves gathering a large corpus of annotated explanations or images, which can then be leveraged to learn a generative language model that produces naturallanguage sequences to explain given model input samples and output predictions (Mishra & Rzeszotarski, 2021).In our work, the natural language explanations are produced and vetted by researchers and pilot participants to ensure quality and consistency.
Finally, we investigate probability (i.e., confidence) scores presented in the form of "crowd-sourced" explanations (i.e., a natural-language sentence presenting the percentages of experts that voted on an answer) or as a table of answer probabilities for the human to interpret.Such a modality offers explainability by showing the uncertainty of the model for a given input sample.Prior research on using confidence scores as explanations (van der Waa et al., 2020;Zhang et al., 2020) shows that such explanations improve user trust and confidence.However, such results are contentious, as confidence scores can vary significantly due to small perturbations in samples (Hogan & Kailkhura, 2018;Kailkhura et al., 2019), suggesting that user trust may be misplaced.

Overview
Our research seeks to answer questions about the effects of using different classes of xAI on trust, performance, perceptions of social competence or intelligence, compliance, and efficiency when such methods are deployed as decision-aids for untrained humans.While there is a vast landscape of xAI research and dozens of methods that could all be compared and contrasted, we are interested in human perceptions and performance with different classes of xAI.To ensure relevant, high-quality explanations, we wizard-of-oz (WoZ) (Kelley, 1984) all explanations in the study and conduct multiple iterative pilot studies with our explanations.Specifically, we investigate the following research questions 1. RQ1 -What is the relationship between agent explainability and human-rated subjective metrics (i.e., user trust or social impressions)?2. RQ2 -What is the relationship between agent explainability and objective task performance (i.e., accuracy and efficiency)?3. RQ3 -Are there significant differences in perceived explainability across the classes of xAI in our study?
We note that our research questions are specifically around the relationships between explainability and objective/subjective metrics, and we do not explicitly investigate the causal relationship between explainability and the metrics in our study.
To answer these questions, we conduct a between-subjects user study in which participants must answer a set of multiple-choice questions with the aid of a virtual agent assistant.Our task involves answering a set of multiplechoice questions, where each question relates to a short paragraph.The user receives an answer suggestion from a virtual agent and an explanation drawn from one of the following conditions: Templated Language -Natural language citing the most relevant feature in the question.Counterfactual -Natural language describing the second-likeliest answer and how the scenario should change to produce the second-likeliest answer.Decision Tree -A graphic flow-chart with three "True/ False" checks leading to a classification.Probability Scores -Probability scores for each answer choice.Crowd-Sourced -The percentages of experts that selected each answer choice.Case-Based -Three short examples of prior scenarios and their associated answers.Feature Importance -Relevance scores for each feature in the scenario.No Explanations -No explainability added.

Materials and methods
We conducted a 1 Â 8 between-subjects user study to answer our research questions.

Pilot studies
Before beginning our full study, we conducted several iterative pilot studies to refine our study design.Our pilot studies involved a total of 54 participants, running different versions of the study over time.Through our pilot studies, we learned to increase total compensation for our study, add a screening quiz, modify explanations, and improve the explanation introduction portion of our study, after observing that our study took longer than expected and garnered several low-effort responses (Buchanan & Scofield, 2018).Throughout, we iterated on instructions to improve completion rates.

Pre-study
Participants first completed a consent form and then received a brief set of instructions for the task.These instructions included an example scenario and an introduction to the virtual agent and the rating scales that would be used to judge agent advice.After receiving instructions, participants were given a five-question quiz on the instructions they received, and any participant that did not answer all five questions correctly was removed from the study.This quiz served to screen participants who might have confounded our results (Buchanan & Scofield, 2018).After passing the instructions quiz, we first collected demographic information from participants and asked them to complete the negative attitudes toward agents (NARS) questionnaire (Nomura et al., 2006) to measure whether such data might confound our results.

Scenarios
Our study consisted of showing participants a set of scenarios and then asking multiple-choice questions.We show an example of one such scenario, with a Templated Language explanation, in Figure 1.The scenario includes a short description that provides background information about an imaginary person, ending in a question.The participant is prompted to respond to two Likert items and to answer the question, and participants assigned to an xAI condition see an explanation placed between the agent suggestion and the two Likert items.
The questions in our study generally require the participant to infer something about that person's preferences, future decisions, or past actions.While some scenarios are quite simple, others are more challenging or require specific areas of prior knowledge.This range in difficulty promoted varied reliance on the agent throughout the study.The scenarios were manually generated as commonsense reasoning questions and refined through pilot studies and testing.All scenarios are included in the Supplementary Appendix.

Agent-based decision-support
Central to our study was the assistance that participants received from a virtual agent and the xAI method used.To offset any adjustment in the study, our instructions indicated that the virtual agent was changed out for every answer, and the graphic for the agent was cycled between different photos of a NAO robot.The NAO robot is a 25-DoF humanoid robot from Softbank Robotics, shown in Figure 1.Additionally, the agents were all referred to using a different ID to indicate that the agents were not consistent across scenarios.For several pre-specified questions, the agent suggested the wrong answer rather than the correct answer to measure inappropriate compliance with agent advice.
Throughout the study, the virtual agent offered explanations for its suggestions to the participant.These explanations were all created via WoZ and validated in our pilot studies.When the agent suggested the wrong answer, the explanation supported the wrong answer (i.e., the explanation and the wrong answer were internally consistent).

Priming task
After completing our pre-study forms, participants advanced to the priming task, involving five scenarios without agent explanations.Of these initial five scenarios, the agent suggested the wrong answer once, showing participants that they could not rely on the agent's suggestion without considering whether the suggestion might be true/false with the aid of the xAI provided.After the first set of scenarios, participants completed our new xAI survey, developed to measure user-rated explainability of an agent's suggestions.By tasking participants with a set of scenarios and the new xAI survey before they were assigned a condition, we primed users to consider how useful or transferable the agent's explanations might be for the remainder of the study.

Condition assignment
Participants were randomly assigned to one of our conditions.For all conditions other than Nothing, participants were given a brief walkthrough of how their condition would work.For example, in the Decision Tree condition, participants were introduced to the concept of a flow-chart as a decision-aid and were given an example for how one might be applied and how it could be interpreted.This introduction provided a high-level overview of how to read agent explanations, as many xAI approaches (e.g., Feature Importance, Decision Tree, Case Based, and Probability Scores) do not use natural language, and, therefore, could be unintelligible to novice users without some level of introduction.However, we did not provide in-depth explanations of how these methods manifest explanations, as the purpose of our experiment is to evaluate how untrained participants would rate different explainability measures.Examples from the introduction to each condition are given in Figure 2.

Primary task
Once the participants completed the priming task with five scenarios, they began the primary task of the study, which was comprised of fourteen scenarios.The agent offered incorrect suggestions on the fifth, seventh, eighth, tenth, and twelfth questions.We fixed the ordering of all questions and answer suggestions to control for any randomization effects on participant ratings at the end of the study.In total, participants answered twenty total questions (i.e., one for instructions, five for priming, and fourteen for the main body of the study) and the agent only offered incorrect advice six times total; thus, the agent was correct more often than it was incorrect (i.e., correct 70% of the time).If the agent had never been incorrect (or never been correct), we would not have been able to study participant compliance or reliance with the agent's suggestions.We skewed the agent to be correct for 70% of the available scenarios, as prior work suggests that a less accurate agent may have been discounted entirely (Wiczorek & Manzey, 2014;Yang et al., 2017).

Follow-up
After completing all twenty questions (one introductory, five priming, and fourteen primary), the participants completed a trust in automation survey (Jian et al., 2000) and the Godspeed survey (Bartneck et al., 2009) to provide us with metrics for the effects of xAI on trust and perception of the agent.Finally, participants completed our post-trial xAI survey (after initially completing it for the priming questions) and were then given the opportunity to enter free-response text before completing the study.

Xai for human-agent interaction survey development
Our work leverages a novel xAI survey to measure humanrated explainability of agent explanations and suggestions.We created a 30-question survey with items intended to measure simulatability, transparency, and usability of the agent's explanations.Questions in the survey are targeted at these three primary axes after prior work identified simulatability, transparency, and usability as important metrics for explainability (Holzinger et al., 2020;Sokol & Flach, 2020).Questions in our xAI survey were inspired by guidelines introduced in prior work (Hoffman et al., 2018;Sokol & Flach, 2020) and prior surveys on usability (Brooke 1996) and causality (Holzinger et al., 2020).
As prior work has already established questionnaires to evaluate topics such as usability (Brooke 1996) and explanation faithfulness (Hoffman et al., 2018;Sokol & Flach, 2020), we aggregated and extended existing questions rather than generating an entirely new set of questions via interviews (Nomura et al., 2006) or a word-elicitation process (Jian et al., 2000).Questions in our xAI include questions from prior work as well as new questions specifically designed to re-test questions in prior work (e.g., by including negations of existing questions).All items in the survey are rated on a seven-point scale from "Strongly Disagree" to "Strongly Agree," and the final explainability score is calculated as the sum of all items in the questionnaire (adding the inverted value for negative items).The full 30-question survey is given below, and citations to relevant prior work are given for each question.
1.The explanations were detailed enough for me to understand (Holzinger et al., 2020).2. I understood the explanations within the context of the question (Holzinger et al., 2020;Shin, 2021).
3. The explanations provided enough information for me to understand (Holzinger et al., 2020).4. I understood how the agent arrives at its answer (Brooke 1996;Hoffman et al., 2018). 5.I was able to use the explanations with my knowledge base (Holzinger et al., 2020).6.I would be able to repeat the steps that the agent took to reach its conclusion.7. I think that most people would learn to understand the explanations very quickly (Brooke 1996;Hoffman et al., 2018;Holzinger et al., 2020).8.I would not understand how to apply the explanations to new questions (Hoffman et al., 2018).9.I would not be able to recreate the process by which the agent generated its answers.10.I understand why the agent used specific information in its explanation (Hoffman et al., 2018;Holzinger et al., 2020).11.I understood the agent's reasoning (Brooke 1996;Hoffman et al., 2018;Shin, 2021).12.I could have applied the agent's reasoning to new problems, even if the agent didn't give me suggestions.13.The explanations were actionable, that is, they helped me know how to answer the questions (Hoffman et al., 2018).14.I believe that I could provide an explanation similar to the agent's explanation.15.I would need more information to understand the explanations (Holzinger et al., 2020).16.I had trouble using the explanations to answer the question (Brooke 1996).17.I believe that the explanations would not help most people in answering the question (Hoffman et al., 2018).18.The explanations were an important resource for me to answer the question (Hoffman et al., 2018).19.I do not think most people would provide similar explanations as the agent's explanation.20.I think that most people would be able to interpret the explanation of the agent (Brooke 1996).21.Most people would be able to accurately reproduce the agent's decision-making process.22.Most people would not be able to apply the agent's explanations to the questions (Hoffman et al., 2018).23.I could not follow the agent's decision-making process (Holzinger et al., 2020).24.I could easily follow the explanation to arrive at an answer to the question (Brooke 1996).25.The explanations were useful (Brooke 1996).26.I am able to follow the agent's decision-making process step-by-step.27.The explanations were not relevant for the questions I was given.28.I understand how the agent's decision-making process works.29.I could apply the explanations to the questions I was given.30.I could not figure out how the agent arrived at its suggestions.
In the remainder of this work, we present results using the full 30-question survey as a measure of explainability, comparing our results with surveys in the literature measuring complementary phenomena (i.e., trust-in-automation (Jian et al., 2000) and Godspeed (Bartneck et al., 2009)) and examining the relationship between participant-rated explainability and objective/subjective metrics.

Procedure
We recall that participants first completed pre-study consent forms and a briefing of the task, showing them one introductory scenario.Participants were then screened to ensure high-quality responses, with failures on the screening task being removed from the study.After finishing the screening task, participants provided demographic data and then began a priming task of five scenarios that prepared participants to consider the usability, transparency, and simulatability of agent suggestions and explanations.Following the priming task, participants were randomly assigned one of eight possible conditions and provided instructions for their assigned condition (e.g., participants in the "Decision Tree" condition were taught how to read and interpret decision trees).Finally, participants began the main body of the study and completed fourteen scenarios of varying difficulty with the assistance of a virtual agent.Upon completion of all scenarios, participants rated the agents on trustworthiness (Jian et al., 2000), intelligence and likeability (Bartneck et al., 2009), and explainability.

Measures
We seek to quantify the relationship between explainability and trust, task performance, and social perceptions of agents (i.e., is the agent "kind," "amicable," and "socially intelligent?") and to determine which approaches to xAI will provide the greatest objective benefits to human-agent team fluency.Using the following metrics, we can effectively capture both objective task performance and subjective impressions of the virtual agent and explainability condition.To answer our research questions, we employ the following metrics: M1 (RQ2) Completion timehow long it takes participants to complete the primary task of the survey.M2 (RQ2) Accuracyhow many questions the participant answers correctly.M3 (RQ2) Compliancehow frequently the participant agrees with the agent's suggestion.M4 (RQ1) Social Competencehow participants perceive the agent as a social agent according to the Godspeed questionnaire rating the agent on scales relating to kindness, friendliness, intelligence, etc. (Bartneck et al., 2009).M5 (RQ1) Trustparticipant's trust in the agent as measured by the trust-in-automation (Jian et al., 2000) survey.
M6 (RQ1) Explainabilityparticipant's self-rated understanding and explainability as measured by the full xAI survey introduced above.

Participants
We recruited a total of 340 participants for our pilot studies and final study from Amazon Mechanical Turk (Paolacci et al., 2010).Our study was approved by an IRB 1 and participants were compensated $5.00.After our pilot studies, our final study included 286 participants (Mean age: 43.0; SD: 10.7; 52% Female).The study took approximately 25 minutes (Figure 3).

Results
In this section, we review and discuss key results from our final study.We tested all data for normality and homoscedasticity, and if parametric assumptions failed we applied a non-parametric test.

Significant findings
We summarize our significant findings here and provide average variables from our analyses (Table 1) before providing deeper analysis on each research question further below.

Perceptions of social competence and trust in xAI
We applied Spearman's correlation with explainability as the independent variable and social competence as the dependent We find trust, accuracy, and social perceptions were statistically significantly correlated with explainability as measured by our xAI survey, both lending support for use of our survey as a measure of explainability and addressing RQ1 and RQ2.We also observe that trust and accuracy are statistically significantly correlated.Each dot represents a data point with the regression line and confidence intervals drawn for each correlation.
variable.We found that explainability was significantly correlated with impressions of the agent's social competence (q ¼ 0:43, p < 0.0001) as measured by our xAI survey and the Godspeed survey (Bartneck et al., 2009).We did not find any statistically significant change in perceptions of social competence or intelligence of the virtual agent across our conditions.
Next, we applied Spearman's correlation with explainability as the independent variable and trust as the dependent variable.We found that explainability was significantly correlated with trust (q ¼ 0:56, p < 0.0001), as measured by our xAI survey and the trust-in-automation survey (Jian et al., 2000).We further found that no individual condition in our study was rated as significantly more trustworthy than any other.Finally, we did not find statistically significant trends for compliance with the agent suggestions in our study nor for reliance on the agent suggestions (i.e., accepting correct advice or accepting incorrect advice).Neither explainability nor condition had any effect on the number of times that participants chose to accept the agent's advice, suggesting that trust was unrelated to participants' proclivity to accept advice from the agent (either correct advice or incorrect advice).We include results for participants' selfreported agreement with incorrect agent suggestions in the Supplementary Appendix, showing significant differences between the Decision Tree, Feature Importance, and Templated Language conditions.As our self-reported agreement and understanding results are drawn from a single Likert item rather than a full scale with multiple correlated items, we do not report those results in the main body of this work.

Objective performance
By applying Spearman's correlation with explainability as the independent variable and performance as the dependent variable, we found that explainability was also correlated with human-machine team performance (i.e., decisionmaking accuracy) in our study (q ¼ 0:15, p ¼ 0.01), as measured by our xAI survey and the participant's final score on the primary task of our study.We additionally found that trust was correlated with accuracy (q ¼ 0:15, p ¼ 0.012) via Spearman's correlation with trust as the independent variable and accuracy as the dependent variable.While the agent offered more correct than incorrect suggestions, our results on compliance with the agent suggest that participants did not blindly rely on agent suggestions in any condition, regardless of their trust in the agent.We therefore hypothesize that the correlation between trust and accuracy is independent of the number of correct suggestions provided by the agent, though this hypothesis must be tested in future work.Finally, we did not find any statistically significant change in accuracy across our conditions.We found no effects for explainability nor condition (i.e., xAI method) on completion time.

Explainability by condition
An ANCOVA showed that certain conditions in our experiment were rated as significantly more explainable than others (F 7, 277 ¼ 4:20, p < 0.001).Our independent variable is the explainability method (condition) and our dependent variable is the participant's score on our xAI survey.We include, as a covariate, participants' baseline xAI survey scores after the priming task.A Tukey's HSD post-hoc analysis revealed that Probability Scores scored significantly lower on our xAI survey than all other explanation conditions, including Counterfactual (Cohen's d ¼ 0.918, and Decision Tree (d ¼ 0.597, SE ¼ 0.25, p ¼ 0.039) explanations.Similarly, Counterfactual explanations scored higher than the Nothing condition (d ¼ 0.701, SE ¼ 0.23, p ¼ 0.012).Finally, we find no statistically significant differences in xAI survey scores from our priming task across the experimental conditions in our between-subjects design (Section 4.5) (F 7, 277 ¼ 1:338, p ¼ 0.232).We therefore attribute the differences in xAI ratings to the differences among the conditions rather than any differences between the subjects, who were randomly assigned to the experiment conditions (Figure 4).
Our results shed interesting insight into RQ3 and the effects of different xAI conditions on explainability.We found that the Probability Scores condition was statistically significantly worse than all other approaches to explainability and scored the lowest of all conditions on our xAI survey.

Trust in xAI
Our results showed that trust and explainability (RQ1) were correlated measures.We found that an increase in explainability was correlated with an increase in participant-rated trust.Surprisingly, no condition was rated as significantly more trustworthy than another, despite the strong correlation between explainability and trust.
In finding a positive correlation between trust and explainability, we confirmed the intuition that an explainable agent is inherently more trustworthy.Regardless of the mechanism of explainability, an agent that is perceived to be Table 1.In this Table we report the mean and (standard deviation) for explainability according to the our xAI survey scores, trust (Jian et al., 2000), and social competence (Bartneck et al., 2009)  more explainable is rated as more trustworthy.This finding also supports the validity of our xAI survey, as our explainability metric is correlated with the validated trust-in-automation survey (Jian et al., 2000).Importantly, there are not any overlapping questions between the trust-in-automation survey and our xAI survey, and each survey targets fundamentally different topics.While the trust-in-automation survey asks for ratings with respect to the robot, our xAI survey is entirely centered around the explanations and their utility.Regardless of these distinctions and differences, we find a significant correlation between the two measures.This significant correlation is therefore not a function of survey overlap or redundancy-instead we find that the concepts of explainability and trust are truly correlated.
Observing no significant difference in trustworthiness across our conditions (Case Based, Counterfactual, Crowd Sourced, Decision Tree, Feature Importance, Nothing, Probability Scores, Templated Language), we stumbled upon a surprising result.Despite our intuition that certain conditions would be distinguished by trustworthiness, we did not find a statistically significant difference in trustworthiness by method.While it is reasonable to expect that an agent that uses natural-language would be perceived as more relatable and trustworthy or that an explicit decisiontree would be more simulatable and verifiable, we found no condition was significantly more trustworthy than another.In conclusion, we found that none of our categories of xAI methods was definitively more trustworthy than any other.
Our findings suggest important avenues for future research.First, we corroborated the initial findings of prior work (Poursabzi-Sangdeh et al., 2021) that explainability alone will not reduce human over-reliance on automated decision-aids.Our work generalizes this result, showing that compliance is nearly constant for all xAI methods in our study.Therefore, future work must devise new approaches to human-agent interaction that specifically target compliance and reliance, as such effects will not simply be resolved by developing more explainable decision-aids.Second, future research into trust and explainability must consider domain details and investigate the utility of personalization.Future xAI systems will likely need to meet users halfway, conforming to their preferred mode of explanation in order to maximize trust and utility (Ehsan & Riedl, 2020).Our proposed xAI survey helps to guide such work, providing a quantitative benchmark for human-rated explainability of an agent partner.

Objective performance
Our results regarding performance and explainability (RQ2) were mixed.We found that participants performed slightly better when they perceived their virtual agent assistant to be more explainable (p ¼ 0.01).Furthermore, we found that trust in the agent was a factor in this result and accounts for some portion of the participant's increase in performance.As our virtual agent provided the correct answer for nine of the fourteen scenarios, it is reasonable to expect that participants who trusted the agent's suggestions would have an above-average final score in our study.We speculate that, in a study with more questions or a virtual agent with higher accuracy, this effect would be more pronounced and there would be a stronger correlation between accuracy and explainability.While we found that explainability was correlated with accuracy, we did not find any effect of condition or explainability on completion time.This result is surprising, as one might expect the Nothing condition to have the lowest completion time (because there is less information to review in each scenario).Instead, we found that no condition was significantly faster than any other.Again, this result suggests that efficiency may be domain-, or individual-specific, and that adapting to users may enable improved human-agent team fluency (Ehsan & Riedl, 2020).
Our findings suggest that explainability significantly improves performance for question-answering tasks and did not reveal an efficiency penalty incurred by adding explainability.This result is significant, as it suggests that there be a minimal efficiency cost associated with deploying xAI and there is a performance benefit to be gained by leveraging xAI.

Social competence
Our findings support the notion that an explainable agent is perceived as more socially competent (RQ1).Participants rated their virtual agent assistants much higher on the Godspeed questionnaire (Bartneck et al., 2009) when they perceived those agents to be more explainable.This finding again validates our xAI survey, as our explainability metric is once again correlated to the previously-validated Godspeed questionnaire (Bartneck et al., 2009), and the correlation between the two is expected for a reasonable explainability metric.Notably, our survey asks fundamentally different questions from the Godspeed questionnaire, as we drive at the utility and explainability of an agent rather than its perceived intelligence and likeability.
Interestingly, we did not find significance across conditions for perceptions of social competence or intelligence in our xAI conditions.This finding is not as surprising, as all conditions used similar images of NAO agents, all conditions included some form of natural-language communication, and all agents had single-character names.These name, appearance, and communication modalities would likely have made a difference in social perceptions of the agent, as prior work has demonstrated that anthropomorphism plays a significant role on such metrics (Natarajan & Gombolay, 2020).This result indicates that the appearance and communication modalities of an agent may be greater factors in social perceptions of agents than xAI mechanisms.Despite not finding significant differences across conditions, our results show that any agent that is perceived to be more explainable will also be perceived as more socially competent.

Explainability by condition
Our results shed interesting insight into RQ3 and the effects of different xAI conditions on explainability.We found that the Probability Scores condition was significantly worse than other approaches to xAI, including Decision Trees, Crowd-Sourced, Feature Importance, Case Based, Templated Language, and Counterfactual.
Our intuition regarding explainability by condition is that the simplest or clearest explanations are the explanations which receive the highest scores according to our xAI metric, as supported by prior research on simplicity in explanations (Lombrozo, 2007).Simple natural language explanations, such as in the Templated Language and Counterfactual conditions, were rated as significantly more explainable than an obscure explanation such as in the Probability Scores condition.Additionally, we found Counterfactual was the only condition to be rated as significantly more explainable than Nothing.We found further support for this observation in examining two pairs of very similar conditions in our study: Templated Language vs. Feature Importance, and Crowd Sourced vs. Probability Scores.Recall that Templated Language and Crowd Sourced presented the top feature/answer in the form of a sentence, while Feature Importance and Probability Scores presented a table of features/answers and probability scores for each.In both pairs of conditions, both Templated Language and Crowd Sourced removed information, yet were rated higher for explainability than their probabilityweighting counterparts.
Our results suggest interesting avenues for future work.We observe that one condition that did not rely on natural language was still rated very highly: Case Based explanations.This observation suggests that case-based reasoning may be a fruitful avenue for explainable decision-aids across many other domains, particularly those for which it is well suited (Caruana et al., 2015).Similarly, we find that Templated Language receives above-average xAI ratings on our survey, despite the lack of clear "features" to be used in the language for many scenarios.
Finally, future work should consider ways to maximize the faithfulness of counterfactual explanations.We find strong support for Counterfactual explanations as an avenue for explainability with human participants, being rated as the most explainable of all of our conditions and supported by research on human factors (Miller, 2019).However, recent research (White & Garcez, 2021) suggests that counterfactual explanations are not often actionable or understandable explanations and may be poor approximations of black-box model logic.Additional research on methodologies for faithful construction of counterfactuals may help to yield powerful and readily usable xAI technology.

Future work
Our work introduces a survey designed to measure humanrated explainability of different xAI mechanisms.In future work, we plan to validate this survey through additional studies with new participant populations.We will also seek to replicate our study in additional domains and with different participant populations (e.g., domain experts).Finally, we aim to create a concise survey to be used by domain experts when evaluating suggestions by xAI agents.In this regard we created a shortened survey using factor analysis methods to remove redundancy.The validation of this shortened survey is also left to future work.In the remainder of this section, we present the design of this reduced xAI survey.
The participants in our study completed our 30-question xAI survey and we used their responses to create a reduced version of the full survey that measures the same primary components.We first conducted a factor analysis to analyze the different questions in our survey (Spearman, 1904;Watkins, 2018).After we removed all items with low factor loadings, the factor analysis reported that three factors were sufficient (p ¼ 0.165).We also ensured that each factor had at least four items, as concepts such as usability, transparency, and simulatability are abstract and complex constructs that fewer items may not adequately capture (Schrum et al., 2020).We then tested the reliability of each subscale using Cronbach's alpha with a 1 ¼ 0:83, a 2 ¼ 0:82, and a 3 ¼ 0:81: This process resulted in a 14-question survey that approximately captures explainability along these three axes-transparency, usability, and simulatability.Our 14-question xAI survey is given in the Supplementary Appendix.We further analyzed the reliability of the survey by sampling a third of the data and recalculating Cronbach's alpha for the subscales.After taking fifteen samples, we calculated the mean, standard deviation, and 99% confidence interval for the Cronbach's alpha for each subscale: Factor 1 (M ¼ 0.824, SD ¼ 0.031, 99% CI ¼ (0.804, 0.845)), Factor 2 (M ¼ 0.823, SD ¼ 0.037, 99% CI ¼ ð0:798, 0:848Þ), Factor 3 (M ¼ 0.809, SD ¼ 0.016, 99% CI ¼ ð0:798, 0:820Þ).These results show that the subscales for transparency, usability, and simulatability consistently have internal reliability (a > 0:7).The final "explainability score" for our xAI survey is computed as the sum of all items (with negative items inverted), as in the full 30-question survey.We present analysis of our results using the reduced 14-question xAI survey in the Supplementary Appendix, and we leave validation of this reduced survey to future work.

Limitations
The primary limitation of our work is that our task was limited in scope, being confined to a set of multiple-choice questions involving common sense inference.Our study included a set of scenarios that did not significantly test any pre-existing knowledge or reading comprehension.Our study drew on common sense and inference about everyday life, revealing statistically significant differences across our population of untrained users.Additional deployments of our study targeted at populations of experts (such as medical professions, pilots, engineers, etc.) may yield additional domain-specific and nuanced results around compliance, trust, and performance.
Our study could also be extended by applying real xAI techniques to produce explanations for participants as explanations in our study were generated via WoZ.However, a deployment of our study with state-of-the-art language generation systems or feature-importance mechanisms may yield additional insights into the current failings of xAI research.
Finally, we have produced a reduced version of our xAI survey that correlates with the full version, but has not been empirically validated or verified through independent study.Future work will investigate the legitimacy of the reduced, 14-Q xAI survey as a tool for measuring participant-rated explainability.
In an effort to overcome these limitations in future work and to facilitate deployment of studies similar to our own, we provide study resources (e.g., survey files, questions, and tests) to the community.By leveraging our resources, other researchers will be able to quickly deploy their own versions of our xAI study to different domains or populations.By deploying more xAI studies to a wider variety of problems and demographics, we can begin to draw broader conclusions about xAI applied to a broader variety of challenges.

Societal impact
Our work offers insight into the benefits of explainability when deployed to a question-answering task with a population of non-expert users.In finding support for explainability improving trust, social competence, and performance, we hope that our work will encourage the wider adoption and deployment of xAI to the real world.
Generalizing findings of prior work (Poursabzi-Sangdeh et al., 2021), we found several classes of xAI techniques may yield increased reliance on an agent decision-aid, even when such a decision-aid is incorrect.Therefore, it is critical that future developments and deployments of xAI take this finding into account.Without further research on how to mitigate such over-reliance when deploying explainable agents, xAI may inadvertently lead experts to make more mistakes, while simultaneously reinforcing such mistakes with inaccurate explanations.

Conclusion
In this work, we have described the design and results of a study to provide the first quantitative insights into the effects of explainability on trust, performance, and perceptions of social competence of virtual agents.We found that explainability was significantly correlated with trust, accuracy, and social competence, and that such findings were not dependent upon the method of explainability.We further found that simple language-based explanations and case-based explanations were all perceived as significantly more explainable than class-wise probability scores.Finally, we have proposed an xAI survey to measure human ratings of explainable AI, supported via correlations to trust, performance, and social competence.Our survey will be verified in future work, and will help xAI researchers more rigorously evaluate their work with human participants with a standardized measurement scale that can be applied to any xAI deployed to human users.Note 1.Our study was approved by the Georgia Institute of Technology IRB under Protocol H20522.

Figure 1 .
Figure 1.A visual walkthrough of our study.Participants first complete consent forms, a screening task, and demographic surveys before beginning a priming task.The priming task prepares participants to consider usability and transparency of agent suggestions and explanations.Participants are then assigned a condition and receive instructions for their assigned explanations before beginning the main set of scenarios in our study.Each scenario includes a short paragraph of text, a question, an explanation from a virtual agent, two Likert items assessing per-scenario understanding and agreement, and four answer choices.After each question, participants are shown the correct answer and a running tally of their overall score.After completing all scenarios, participants complete a trust survey(Jian et al., 2000), social perceptions survey(Bartneck et al., 2009), and our xAI survey.

Figure 2 .
Figure 2.An example for each of the xAI conditions in our study.

Figure 3 .
Figure 3.A depiction of our explainability results correlated to: (a) social perception, (b) trust, and (c) accuracy and of trust correlated to accuracy (d).We find trust, accuracy, and social perceptions were statistically significantly correlated with explainability as measured by our xAI survey, both lending support for use of our survey as a measure of explainability and addressing RQ1 and RQ2.We also observe that trust and accuracy are statistically significantly correlated.Each dot represents a data point with the regression line and confidence intervals drawn for each correlation.
for each of the conditions in our study.