A Semiparametric Inverse Reinforcement Learning Approach to Characterize Decision Making for Mental Disorders

Abstract Major depressive disorder (MDD) is one of the leading causes of disability-adjusted life years. Emerging evidence indicates the presence of reward processing abnormalities in MDD. An important scientific question is whether the abnormalities are due to reduced sensitivity to received rewards or reduced learning ability. Motivated by the probabilistic reward task (PRT) experiment in the EMBARC study, we propose a semiparametric inverse reinforcement learning (RL) approach to characterize the reward-based decision-making of MDD patients. The model assumes that a subject’s decision-making process is updated based on a reward prediction error weighted by the subject-specific learning rate. To account for the fact that one favors a decision leading to a potentially high reward, but this decision process is not necessarily linear, we model reward sensitivity with a nondecreasing and nonlinear function. For inference, we estimate the latter via approximation by I-splines and then maximize the joint conditional log-likelihood. We show that the resulting estimators are consistent and asymptotically normal. Through extensive simulation studies, we demonstrate that under different reward-generating distributions, the semiparametric inverse RL outperforms the parametric inverse RL. We apply the proposed method to EMBARC and find that MDD and control groups have similar learning rates but different reward sensitivity functions. There is strong statistical evidence that reward sensitivity functions have nonlinear forms. Using additional brain imaging data in the same study, we find that both reward sensitivity and learning rate are associated with brain activities in the negative affect circuitry under an emotional conflict task. Supplementary materials for this article are available online.


Introduction
Mental disorders cause immense disability, accounting for 183.9 million disability-adjusted life-years worldwide (Whiteford et al. 2013).However, despite intense clinical and research efforts devoted to developing pharmacological and behavioral treatments for mental disorders over the years, effective treatment strategies remain elusive, and treatment responses are far from adequate across mental disorders (e.g., 30%-50%).Recently, there has been a fast-growing trend to use discovered novel biomarkers and behavioral markers, accompanied by sophisticated computational methods (e.g., machine learning or reinforcement learning), to provide a more accurate characterization of mental disorders and more precise prediction of treatment responses than traditional approaches (Passos, Mwangi, and Kapczinski 2019).
To address substantial between-patient heterogeneity and limitations in the traditional symptom-based diagnosis, the paradigm shift put forward by the NIMH strategic plan on research domain criteria (RDoC; Insel et al. 2010) initiative calls for redefining mental disorders by latent constructs identified from biological and behavioral measures across different domains of functioning (e.g., positive/negative affect, social processing, decision making) at different levels (e.g., genes, cells, and circuits), instead of solely relying on clinical symptoms or the diagnostic and statistical manual of mental disorders (APA 2013).The vision on precision psychiatry (Insel et al. 2010;Williams 2016) is that research needs to account for extensive diagnostic heterogeneity and substantial between-patient variation in biological, behavioral, and clinical manifestations of disease, examine shared latent constructs across disorders, and match treatments with the underlying pathophysiology.
Learning and decision-making are considered important in the etiology of the mental function (Kendler, Karkowski, and Prescott 1999).An individual's learning ability and decisionmaking process may be altered by mental disorders (Pizzagalli, Jahn, and O'Shea 2005).In turn, poor decision-making may lead individuals into high-risk scenarios and experience more mental illness (Kendler, Karkowski, and Prescott 1999).For example, impairment in processing rewards is implicated in various mental disorders.Anhedonia (i.e., inability to experience pleasure) is one of the core constructs of major depression.Pizzagalli, Jahn, and O'Shea (2005) showed that anhedonia may alter how MDD patients process reward in a PRT experiment.Briefly, subjects were presented with two types of stimuli in a computer-based experiment (probabilistic reward task), and one stimulant was rewarded more frequently than the other (details in Section 1.1).Over time, healthy subjects will learn which reward has a higher value and choose the correct stimuli with a higher probability.It is hypothesized that anhedonia may reduce a patient's ability to learn which reward is more valuable or reduce a subject's tendency to choose the richer reward.Decision-making models are proposed to evaluate whether these hypotheses are supported (Huys et al. 2013).As another example, restrictive eating behavior in anorexia nervosa can be measured via the Food Choice Task, a computerbased assessment of ratings and choices of food items, and patients' choices in the task predicts future caloric intake (Foerde et al. 2015).
A useful computational model for describing a patient's learning behavior and decision-making ability is Reinforcement Learning (RL; Sutton and Barto 2018).To use RL to describe behaviors or actions generated from a probabilistic reward task, a simple prediction error learning is invoked to characterize the value of different choices depending on key behavioral phenotype parameters, including reward sensitivity and learning rate.Various empirical studies and a meta-analysis have shown that objective, quantitative measures derived from behavioral data can be associated across multiple mental disorders (Pike and Robinson 2022), making these measures ideal candidate behavioral phenotypes under the RDoC framework.In contrast to regular RL, where one aims to estimate an optimal policy that maximizes the long-term reward, the goal of analysis in behavioral phenotyping in mental health is behavioral cloning or imitation learning (Ross and Bagnell 2010).That is, given observed patients' choices on probabilistic reward tasks, how to infer their reward learning ability and value computation process, and how these abilities and processes relate to MDD and, ultimately, are affected by MDD treatments.
There is a gap in using existing imitation learning or inverse RL methods to address the unique challenges presented by mental disorders.For example, the models proposed in Huys et al. (2013) assume that a patient's decision-making process relies on a function with restrictive parametric linear form, which may oversimplify the complex decision process and interactions between core mental functions.For example, similar to other psychometric models (Gravetter and Forzano 2018) and as we demonstrate in Section 4, there is often a floor and ceiling effect on the relationship between action and expected reward, which is not captured by a linear model.Some existing imitation learning and inverse RL methods are more flexible to allow nonlinear models (e.g., Arora and Doshi 2021), but they do not provide important behavioral phenotypes such as learning rate and reward sensitivity and do not borrow information from multiple subjects under a random effects model.
In this work, motivated by the experiments in Pizzagalli, Jahn, and O'Shea (2005) and Trivedi et al. (2016), we propose a semiparametric inverse reinforcement learning approach to characterize reward-based decision-making for patients with mental disorders.Our method incorporates subject-specific learning rates and reward sensitivity as random effects to address between-patient heterogeneity.To borrow information while remaining flexible, we assume that the function that links action to the contrast in expected reward is nonparametric but shared across the patients.We describe the behavioral phenotyping experiment and motivating study in the next section.

Probabilistic Reward Task and the Motivating Study
The probabilistic reward task (PRT; Pizzagalli, Jahn, and O'Shea 2005) is a computer-based experiment that measures the subject's ability to modify behavior in response to rewards.On each trial of the task, the subject sees a cartoon face with a short or long mouth (two stimuli).The task is to indicate whether a short or long mouth was presented by pressing buttons "C" or "M" on keyboards.Critically, the correct response to the short mouth was rewarded more frequently than a correct response to the other (i.e., the short mouth is the rich reward), and the difference in length between the short and long mouths is minimal.Participants were given verbal instructions about the task and told that the goal of the task was to win as much money as possible.It is critical that the subject understands that not all correct responses will be rewarded.To maximize the reward, participants should press the correct button, regardless of which face is associated with the higher reward.However, the difference in size between the short and long mouths is designed to be small.Consequently, participants frequently encounter difficulties in accurately perceiving the presented state, which results in a tendency to prioritize states with higher rewards rather than those that are more accurate.Thus, the PRT experiment probes a subject's ability to learn rich versus lean reward.
Emerging evidence indicates the presence of reward processing abnormalities in MDD (Whitton, Treadway, and Pizzagalli 2015).MDD patients are more likely to have lower reward learning ability than patients not diagnosed with MDD (Vrieze et al. 2013).There has been significant interest in learning how MDD affects patients' decision-making in reward learning tasks.In our motivating study, the Establishing Moderators and Biosignatures of Antidepressant Response for Clinical Care (EMBARC) trial, each PRT session consisted of 200 trials, divided into two blocks of 100 trials, with blocks separated by a 30-sec break.For each block, 40 correct trials were followed by reward feedback.Correct identification of the short mouth was associated with three times more positive feedback (30 out of 40) than correct identification of the long mouth (10 out of 40).However, subjects were not told about the different frequencies of rewards across the two response types.Figure 1(a) shows the PRT schematic diagram.
EMBARC study is designed to identify the difference in reward learning abilities between the patients with MDD and the control group (not diagnosed with MDD) and between SSRI antidepressant sertraline (SERT) and placebo (PBO) in a randomized trial for patients with MDD. Figure 1(b) shows the experimental design diagram.In the PRT experiments, there were 40 subjects from the control group and 168 subjects from the MDD group; within the MDD group, 82 participants were randomly assigned to the SERT group, and the other 86 participants were assigned to the PBO group.Each subject had two PRT sessions, one session at the baseline before treatment (week 0) and one session after one week of treatment (week 1).Note that subjects in the control group also took two sessions in week 0 and week 1.In week 1, five sessions were missing for the MDD group and one for the control group.

Data from a Behavioral Task
Consider n subjects from a population.Data consists of a timeseries of subjects' decision-making in a behavioral task (e.g., probabilistic reward task), that is, {S it , A it , R it } i=1,...,n t=1,...,T , where S it , A it , and R it are the state, action, and reward at time t for the ith subject.In this article, we assume that there are m states in the state space (i.e., S it ∈ {0, . . ., m − 1}) and two actions in the action space (i.e., A it ∈ {0, 1}).Denote H it = {S ij , A ij , R ij } j=1,...,t−1 as the observed history for the ith subject by time t.The reward R it is generated based on S it , A it , and H it (for some cases, only based on S it and A it ).R it is assumed to be bounded (either discrete or continuous), and without loss of generality, we assume R it is between 0 and 1.The probabilistic reward task introduced in Section 1.1 is a special case with two possible states.We let S it = 0 correspond to the state with lean rewards and S it = 1 correspond to the state with rich rewards.The reward generating distributions conditional on S it = A it = 0 and S it = A it = 1 (and history H it ) are Bernoulli distributions with probability 0 < P 00 < P 11 < 1.No reward is provided if the ith patient does not correctly respond to the state at time t.

Semiparametric Inverse RL Model
RL is a computational tool used to model patients' responses to stimuli in behavioral tasks such as PRT.A core concept in RL modeling of PRT is the reward prediction error, the difference between the expected reward and the reward that was received.Mathematically, the reward prediction error is R it − Q it (a, s), where Q it (a, s) is the expected reward or value of taking action a at state s for the ith subject at the tth trial.This is the foundation of building more complex models to explain patients' behavioral responses.Specifically, for MDD patients, a failure to adequately process reward is hypothesized to occur via two distinct mechanisms: a reduced sensitivity to received reward or the reduced ability to learn (Huys et al. 2013).The former mechanism is assessed by a parameter referred to as reward sensitivity (denoted by ρ i ), and the latter is assessed by learning rate (denoted by β i ).When accounting for reward sensitivity, a subject's reward prediction error is where the obtained reward R it is discounted by a factor of ρ i .Each patient's value (denoted as Q i,t+1 (a, s)) naturally evolves over time depending on the past history of value and is updated based on a weighted reward prediction error as where β i ∈ (0, 1) is a subject-specific learning rate that describes how fast the value of a state-action pair is updated, I it (a, s) = I(A it = a, S it = s) is the indicator function of the event {A it = a and S it = s}.The larger ρ i is, the more a subject's action will depend on rewards.Equivalently, the expected reward at t + 1 is the weighted sum of the obtained reward and the expected reward at t, that is, (a, s).As the learning rate β i approaches one, learning is so fast that the Q values are simply the last experienced outcome.As a note, (1) is also known as the Rescorla-Wagner equation (Rescorla 1972) in classical conditioning.Figure 2 shows a graphical representation of the model where the green paths show the RL component of updating Q it over trials.At time t = 1, we assume Q i1 (a, s) = α as .Define it = it (a, s) as the total counts that subject i takes action a at state s before time t.Define time τ ik = τ ik (a, s) as the kth time such that subject i takes action a at state s.According to (1), we can express Q it (a, s) by α as , and history H  participants have the updated reward value Q it (a, s) based on (t − 1) trials.However, because participants cannot be sure which stimulus would be presented at trial t, their action at trial t is based on a "belief " of the expected reward, W it (a, s), defined as a mixture of where the parameters ω sr ∈ [0, 1] satisfy m−1 r=0 ω sr = 1, and they reflect the weights from Q-value from different states than s.
It is worth noting that the formulation of Q it (a, s) and W it (a, s) are motivated by Huys et al. (2013).We generalize Huys et al. (2013) to allow a nonlinear reward sensitivity function, more than two states (i.e., S it = 1, . . ., m), and use a random effects model to borrow information across subjects to estimate reward sensitivities and learning rates.Specifically, given the believed reward W it (a, s) and the actual stimulus S it at trial t, we define a contrast between two actions as and we assume that the action A it , conditional on history H it and S it , follows the model logit where f (•) is an unknown nondecreasing function satisfying f (0) = 0.In other words, the participant is more likely to choose the action whose expected reward is higher, but this probability can be a nonlinear function of the contrast.As a result, we refer to f (•) as a reward sensitivity function.The nonlinear modeling of the decision process may more accurately reveal when subjects are more sensitive to reward under a probabilistic reward task.Finally, to incorporate the heterogeneity among the participants, we assume where ν i and γ i are random effects generated from a bivariate normal distribution, that is, Note that since Q it is invariant if its initial value and ρ i are scaled by a constant factor c, f (x) and f (x/c) yield the same action model.Thus, to ensure identifiability, we let μ γ ≡ 1.By this normalization, γ i serves as an indicator for the relative sensitivity of the ith subject (i.e., (γ i − μ γ )/μ γ ).We refer to γ i as the relative sensitivity in the rest of the article.The decisionmaking model ( 5) is represented by the blue paths in Figure 2.
The rewards are generated based on a participant's state, choice, and history depending on the experiment paradigm (pink paths in Figure 2).

Implementation
Under the proposed semiparametric inverse RL/imitation learning model, since R it and S it given H it are independent of the random effects, the log-likelihood function for subject i is, up to a constant, where We maximize to estimate all the parameters.To fit a nondecreasing function where L and U are two pregiven lower and upper bounds for the contrast, we approximate f using a family of monotone splines called I-splines (Ramsay 1988).Denote M = (M 1 , . . ., M K ) as a set of M-splines, I = (I 1 , . . ., I K ) as the corresponding I-splines, where each More details of the representation of M-splines and I-splines can be found in Appendix A.1.We can approximate f (•) by f (x) = a+I (x)b.Giving restrictions on the coefficients b k ≥ 0 for k = 1, . . ., K enforces f (•) to be nondecreasing.We also need a = −I (0)b to ensure f (0) = 0. Besides, the first order derivative of We evaluate the double integral using bivariate Gauss-Hermite quadrature (Jäckel 2005).In the optimization step, because α and are linearly constrained, the coefficients of the I-splines are nonnegative, so the constrained optimization (Lange, Chambers, and Eddy 1999) is used to maximize (7).In our numerical studies, the L-BFGS-B algorithm is implemented by the "optim" function in R package "stats" for the case m = 2 (i.e., two states), when m > 2, an adaptive barrier algorithm implemented by the "constrOptim" function in R package "stats" can be used instead.Algorithm 1 presents a detailed version of the semiparametric RL algorithm for parameters and nonparametric function optimization.Note that evaluating the value of joint conditional log-likelihood function ( 7) is time-consuming, so to accelerate the computation, we evaluate the log-likelihood function for each subject (6) in parallel and sum up each part to obtain (7).The maximum likelihood estimation is obtained when the L-BFGS-B algorithm converges.
Furthermore, denote ˆ = ( μ, , α, , f ) as the maximum likelihood estimate of (7).The subject-specific learning rate and relative sensitivity can be estimated by plugging in ˆ into the posterior mean of (ν i , γ i ) .
In the Appendix, we study the theoretical properties of the MLEs of parameters and monotone function f (•) and show consistency and asymptotic normality.Furthermore, nonparametric bootstrap is used for inference.Because asymptotic normality holds, as we show in Appendix A.2, we construct 95% bootstrap confidence intervals under the normal distribution (i.e., MLE ± 1.96 × bootstrap se) in simulation studies and data application.Because the distributions of σ 2 ν,ν and σ 2 γ ,γ are highly skewed to the right, we transform them by logarithm before using normal approximation to construct confidence intervals.

Simulation Studies
We conducted extensive simulation studies to assess the finite sample performance of the proposed method.To mimic the real study, denote the distribution for generating reward R it conditional on A it = a and S it = s as p as (R it ).We considered m = 2 (two states), We further let α 00 = α 11 := α and Algorithm 1: The semiparametric RL optimization algorithm 1 Initialize = (μ, L, α, , b), where μ γ ≡ 1, L is a lower trangluar matrix that satisfies = LL . 2 Repeat the following updates until convergence: Compute bivariate Gauss-Hermite quadrature nodes (ν j , γ k ) = μ + √ 2Ly jk , where y jk = (y 1j , y 2k ), {y 1j } m 1 j=1 and {y 2k } m 2 k=1 are roots of the m 1 th and m 2 th order Hermite polynomials.3), (4), and ( 5) where κ j and κ k are the corresponding Gauss-Hermite weights.
The simulation results of the parameter estimate from 200 replicates were given in Table 1.The simulation results of the nonparametric function were given in Table 2.For both tables, the relative bias (RB), standard deviation (SD), average bootstrap standard error (SE), and coverage probability of the 95% bootstrap confidence intervals (CP) were reported.Furthermore, we compared our method (Semiparametric RL) with the linear model that f (x) = cx (Linear RL) in Tables 1 and  2. Table 1 shows that estimates of group mean μ ν for both semiparametric RL and Linear RL models have small relative biases.It suggests we can estimate the group learning rate with high precision even when the reward sensitivity function f (•) is incorrectly specified.However, the estimates of covariance matrix have much larger relative bias when the reward sensitivity function f (•) is incorrectly specified.Meanwhile, for the semiparametric RL model, the relative bias of all parameters decreases when T and n increase.In contrast, for the linear RL model, the relative bias for all parameters except the group learning rate does not change much as T and n increase.Table 2 suggests that the semiparametric RL estimates of nonparametric function have much smaller relative bias and larger standard deviation than the Linear RL model estimates.Note that the coverage probabilities of semiparametric RL estimates are close to the nominal level (95%).In contrast, the coverage probabilities of Linear RL estimates should be much less than the nominal level.Hence, we conclude that the Linear RL model cannot provide reliable statistical inference on f (•) when the underlying true reward sensitivity function is nonlinear.In the supplementary materials Section S.1.1,we also investigated the case when the underlying true reward sensitivity function is linear.We find that the two methods provide estimates with similar relative bias; our Semiparametric RL has a relatively larger standard deviation compared to Linear RL.
Comparing semiparametric RL estimation results among four data size scenarios for both reward-generating mechanisms, we found that the estimation bias and standard deviation decrease significantly when we increase the number of trials from 100 to 500.The estimation bias and standard deviation also decrease when we increase the subject number from 100 to 250.The reduction rate is relatively large for T = 100, compared to T = 500.The sources of bias come from the bias of using 3 × 3 bivariate Gauss-Hermite quadrature to approximate (6) and the bias of using numerical optimizer to find the maximum of (7).By comparing reward-generating distribution in Cases I and II, we found that the distribution of the reward slightly affects the estimation performance.The simulation results show Case I has a smaller relative bias than Case II.In the supplementary materials Section S.1.5,we performed an additional simulation study involving three states (i.e., m = 3).The results indicate that the Semiparametric RL model with three states also produces accurate estimation results.

Application to EMBARC Study
We now apply the proposed methodology to our motivating PRT data.Through a preliminary analysis, we found that the learning pattern might change from the first block to the second block.To avoid bias, we only focus on the PRT data in the first block (first 100 trials) in the article.
First, we compare reward learning abilities between patients with MDD and the healthy control group.We fitted our proposed semiparametric reward learning models for the "MDD" group that contains subjects diagnosed with MDD at the baseline (pre-treatment) and the "Control" group that contains the healthy subjects with two repeated measurements (no treatment at both times).We examined the number of interior knots from 3 to 10, the boundary knots equal to {−4, 6}, and order r = 2 for the I-Splines.By the AIC criterion, we selected the model with six interior knots.The parameters of interest are the transformed group learning rate μ ν and the group reward sensitivity function f (•).Table 3 shows the estimation results for μ ν and the elements in .For inference, nonparametric bootstrap was applied to generate 200 resampling sets for each group.Bootstrap standard errors and 95% bootstrap confidence intervals were shown in Table 3.The results show that the learning rates for the two groups have similar values.We also constructed the 95% confidence interval for the contrast of μ ν between MDD and the control group; according to the result that the 95% confidence interval is equal to (−0.94, 1.93), we conclude that the difference of learning rate between MDD group and control group is not significant.Another interesting finding is that the learning rate is negatively correlated with reward sensitivity for the subjects.
Figure 3 presents the nonparametric estimation of group reward sensitivity function f (•). Figure 3(a) compares the nonparametric estimate of f (•) between MDD and Control group using the proposed method.The fitted reward sensitivity functions for both groups are clearly nonlinear.The function flattens when x > 2 or x < −2, suggesting that subjects' probability of correct actions at given states would not rise endlessly even if they received enough rewards.Meanwhile, the reward sensitivity functions for the two groups have similar shapes when x < 1.5, f (x) for the control group increases to a higher level than f (x) for the MDD group when x > 2. It suggests that the control group has a larger probability of taking correct actions at rich reward states than the MDD group when subjects in both groups receive adequate rewards in rich reward states.Figure 3(b) presents the 95% pointwise confidence band (PCB) and the 95% simultaneous confidence band (SCB) for the contrast of f (•) between the two groups.We evaluate f (x) for −3 < x < 4 since few Z it values fall outside this interval.The construction of the PCB uses bootstrap and normal approximation, and the  construction of the SCB uses a bootstrap method that mimics Hall and Horowitz (2013).For further details regarding the SCB and its coverage rates in simulation studies, see supplementary materials Section S.1.4.Because the 95% SCB covers zero in the entire range of x, we lack statistical evidence to conclude a significant difference in the reward sensitivity function between the Control and MDD groups.However, the 95% PCB suggests that the Control group may exhibit a higher reward sensitivity compared to the MDD group, particularly when the expected difference in reward between two stimuli is large (e.g., x > 2).
As an exploratory analysis, we constructed a 95% SCB for the contrast of f (•) between the two groups specifically at x > 2.
We observe that the entire band falls below zero for x > 2.
The result is shown in Figure S3 in the supplementary materials.This finding provides statistical evidence that the Control group exhibits a greater reward sensitivity compared to the MDD group, given that the expected difference in reward (rich reward minus lean reward) between two stimuli exceeds 2.Moreover, we applied the proposed model with a linear reward sensitivity function represented by f (x) = cx to both the MDD and Control groups.Figure S4 in the supplementary materials displays the estimated linear reward sensitivity functions.Compared to Figure 3 obtained from the semiparametric model, the linear model yields less precise reward sensitivity patterns and a higher AIC.Furthermore, the linear model fails to capture the substantial difference in the reward sensitivity function between the two groups beyond x > 2.
The comparison of MDD versus Control examines whether reward sensitivity is a characteristic of MDD that differs between patients and controls.Our results suggest that reward sensitivity obtained from the probabilistic reward task may be considered a behavioral marker or a phenotype of depression.This analysis does not establish a causal relationship between depression and reduced reward sensitivity due to confounding.
Next, we compared reward learning abilities for the MDD patients between the sertraline (an antidepressant) (SERT) and placebo group (PBO).We fitted our proposed semiparametric reward learning models for subjects in PBO and SERT at week 0 and week 1, respectively.We used the same knots and order for the I-Splines as in the above analysis.Table 4 shows the estimated group learning rate and variance components in SERT and placebo group pre-and post-treatment.To investigate whether there is a significant difference in group learning rate μ ν for MDD patients who received SERT versus PBO, we constructed 95% bootstrap confidence interval of the difference between the change of μ ν from week 0 to week 1 in the two treatment arms.The result shows that the 95% confidence interval is (−1.01,3.10), indicating that the one-week changes in learning rate between PBO and SERT groups are not significantly different.A recent meta-analysis of the PRT test also reached the same conclusion of a nonsignificant group learning rate (Pike and Robinson 2022).
Figure 4(a) and (b) show the estimated reward sensitivity function f (•) for SERT and PBO groups at pre-and post-treatment.We find that the placebo group pre-(week 0) and post-treatment (week 1) reward sensitivity function is similar across the entire support, while for the SERT group, f (x) increases to a higher level after treatment when x > 1.5.It suggests that the SERT treatment increases MDD patients' probability of taking correct actions at rich reward states when adequate rewards are received.We also constructed the 95% PCB and SCB of the difference between the change of f (•) pre-and post-treatment in the two groups.Figure 4(c) does not show sufficient statistical evidence that the pre-post change in reward sensitivity function differs between the two treatment groups.However, the presence of a similar pattern in the comparisons of reward sensitivity between Figures 3(b) and 4(c) suggests that there might be a positive impact of sertraline on MDD patients, potentially bringing their reward learning sensitivity closer to that of healthy individuals at the rich state, which is worth future investigation.Regarding the timing of post-treatment measurement, the antidepressant is expected to take full effect in about four weeks.The rationale for measuring PRT reward sensitivity one-week post-treatment is to detect early responses.With a longer period of treatment, the difference may be greater.
To investigate whether dysfunctions in the human brain circuitry are associated with decision-making and learning ability, we examined correlations between subjects' learning rate, relative sensitivity, and task functional magnetic resonance imaging (fMRI) measures of brain activation in an emotional conflict task assessing amygdala-anterior cingulate (ACC) circuitry (Etkin et al. 2006).We find that both reward sensitivity and learning rate are associated with brain activities in negative affect circuitry evoked by sad stimuli.Detailed analysis can be found in supplementary materials Section S.2.1.

Discussion
In this article, we propose a semiparametric inverse reinforcement learning framework to characterize reward-based decision-making with an application of probabilistic reward tasks in the EMBARC study.We assume that a patient's decisionmaking process relies on two subject-specific factors, learning rate and reward sensitivity, and a shared nondecreasing function with a nonparametric form.Extensive simulation studies showed that our proposed method is satisfactory for large sample sizes under different reward-generating distributions.It also shows the advantage of the semiparametric structure when the true underlying function is beyond the restrictive parametric form using a sigmoid function.In the application, we find that MDD patients and control and patients between SERT and PBO groups have similar learning rates.However, the different groups have different reward sensitivity functions.The results in the article are consistent with the findings in Huys et al. (2013), but by contrast, the proposed model shows a more detailed description of how the reward sensitivity differs.We also find that behavioral phenotypes, including learning rate and reward sensitivity, are associated with human brain activities at the negative affect circuits.
Estimation is carried out by maximizing the joint conditional log-likelihood for all patients at all trials, where the nondecreasing function and its first-order derivative are characterized by I-splines and M-splines.Because of the close relationship between M-splines and B-splines, M-splines, and I-splines share good approximation power as B-splines.We studied asymptotic consistency and asymptotic distributions for the parameters in the proposed model.Note that we assume that n (the number of patients) goes to infinity and T (the number of trials) is fixed in the theoretical setting.It is much more challenging to assume both n and T go to infinity, and we will study this setting in future work.The proposed model can also be regarded as a timevarying model where the reward sensitivity is allowed to change through time.One future direction is allowing the learning rate to evolve gradually.
Our proposed work aligns with imitation learning or behavioral cloning in the RL literature.A related line of research is inverse RL (Ng and Russell 2000;Abbeel and Ng 2004;Arora and Doshi 2021;Luckett et al. 2021), which assumes an agent's behavior follows an optimal policy and seeks to learn the corresponding reward function from the agent's observed behavior and decision-making process.In contrast, the problem we study here aims to recover the agent's reward sensitivity function and learning rate without assuming the agent's behavior follows an optimal policy.
Our current work does not accommodate heterogeneity in Q it beyond the aspects of learning rate and reward sensitivity.An extension is to model them in a mixed effects model framework but at the cost of more complex modeling and additional assumptions.Similarly, covariates can be introduced to model systematic heterogeneity in learning rates, reward sensitivity, and other parameters.Some recent work shows that the perceptual decision-making process would alternate between multiple interleaved strategies.Ashwood et al. (2022) provides an example in which the decision process is a mixture model of "engaged" and "lapse" trials.When the decision-making strategy is "engaged", a subject chooses according to the RL model; When the decision-making strategy is "lapse", the subject ignores the stimulus and chooses based on a fixed probability.The strategyswitching is characterized by a hidden Markov model.Extension to a semiparametric inverse reinforcement learning framework that allows strategies-switching is of interest.Another direction worth pursuing is a broader reinforcement learning context where states are not given randomly so that state distributions depend on past actions.Lastly, the current model is not restricted to characterizing decision-making in MDD patients.Our flexible framework can be easily applied to model other mental disorders such as schizophrenia and translates neuroscience and behavioral science constructs to clinical applications (Huys, Maia, and Frank 2016).

A.1 Representation of M-splines and I-splines
According to Ramsay (1988), for a set of M-splines in [L, U] with K basis functions and order r, denote the K basis as Denote the knot sequence of the M-splines as {τ 1 , τ 2 , . . ., τ K+r }, where the boundary knots should satisfy Ramsay (1988) for a more general case that allows tie for the interior knots.)The M-splines can be constructed by recursion Define a set of I-splines as the integral of the corresponding set of M-splines, that is ) is guaranteed to be nondecreasing.A function represented by I-splines of order r is r −1 times continuously differentiable, and the rth derivatives are bounded in [L, U]; A function represented by M-splines of order r is r − 2 times continuously differentiable, and the (r − 1)th derivatives are bounded in [L, U].Finally, a function represented by Msplines can also be represented by B-splines; to be more specific, a set of B-splines can be expressed by a set of M-splines with the same knots and order, that is, B k (x | r) = (τ k+r −τ k )M k (x | r)/r, hence, M-splines and I-splines maintain good approximation power as B-splines (Schumaker 2007).In practice, AIC can be used as the model selection criterion.A grid search on spline orders and interior knots can be conducted to select the I-splines with the smallest AIC value.

A.2 Asymptotic Theory
To simplify the notation, denote θ = (α, , μ ν , ) ∈ R d .Let θ 0 and f 0 (•) denote the true parameters and nonparametric function.The loglikelihood function from a single subject is where a = (ν, γ ) Condition 2. Let f0 be in the projection of f 0 on S n under L 2 norm.We assume that there exists a constant C > 0 such that f0 belongs to a subspace of S n : Condition 3.For any two different set of parameters (θ 1 , f 1 ) and (θ 2 , f 2 ), the log-likelihood function (θ 1 , f 1 ) = (θ 2 , f 2 ) with positive probability.The information operator I(θ, f ) of the log-likelihood function (θ, f ) at true parameter values (θ 0 , f 0 ) is invertible and satisfies Condition 4. The number of I-spline basis K n satisfies K 2r−2 n n −1/2 → ∞ and K 1/2 n n −1/2 → 0. The adjacent distance of the interior knots is between c −1 K −1 n and cK −1 n for some constant c > 1.
Condition 1 ensures function f 0 is bounded and smooth.Conditions 2 and 4 guarantees f and f to be uniformly bounded for all f ∈ S * n (C), where f is the first order derivatives of f .The first part of Condition 3 is the identifiability, and the second part implies that the information operator is invertible in L 2 space.In particular, this identifiability condition implies that the latent variable, Z it defined in (4), does not degenerate and should have a continuous density with support containing [L, U], the domain of f .In Condition 4, we may set K n = n δ , where 1 4(r−1) < δ < 1.We state the consistency and asymptotic distribution of the estimators for the model parameters in the following two theorems, whose proofs are given in the supplementary materials Section S.3.Theorem 2. Under Conditions 1-4, n 1/2 { θ − θ 0 , f − f 0 } converges in distribution to a zero-mean and tight Gaussian process in the metric space l ∞ (O θ × F f ) as n → ∞.
The proof to Theorem 2 also implies the asymptotic normality of θ and some smooth functional of f .However, it does not support pointwise normality, although the simulation study shows normality is still plausible.Also note that the convergence rate given in Theorem 1 is an intermediate result needed for proving the faster rate and asymptotic normality for θ in Theorem 2. On the other hand, the convergence rates for f in these two theorems are not exactly comparable.Theorem 1 gives the convergence rate for L 2 −metric for f ; while the convergence rate and normality for f in Theorem 2 refer to the functional of f , h * (u) f (u)du.

Supplementary Materials
The Supplementary Materials contain results of additional simulation studies, additional analyses of EMBRAC study, and the proof of the theorems.

Figure 1 .
Figure 1.Schematic of a PRT experiment and EMBARC study design.

Figure 2 .
Figure 2. Diagram of the semiparametric inverse RL model over trials t, where f is the reward sensitivity function, β i is the learning rate, and ρ i is related to the relative reward sensitivity.

Figure 3 .
Figure 3. (a) The estimated reward sensitivity function estimates for MDD and Control group; (b) The estimation (red curve), 95% pointwise confidence band (shaded area in gray) and 95% simultaneous confidence band (two dash-dotted lines in blue) for the contrast of f (•) between the MDD and Control groups.

Figure 4 .
Figure 4. (a) The estimated reward sensitivity function for SERT group pre-and post-treatment; (b) The estimated reward sensitivity function for PBO group; (c) The estimation (red curve), 95% PCB (shaded area in gray) and 95% SCB (two dash-dotted lines in blue) of the difference between PBO and SERT groups in the pre-post change of reward sensitivity function f (•) (difference in difference).

)
For any vector b ∈ R d and function h ∈ L 2 [L, U] that has bounded rth derivatives, we define the score operators of (θ, f ) as l θ (θ, f )b and l f (θ, f )[h] by differentiating the log-likelihood function with respect to θ and f along the submodel θ + b and f + h.Define the information operator of (θ, f ) asI(θ, f ) = E{(l θ , l f ) * (l θ , l f )}, where (l θ , l f ) * is the dual operator of (l θ , l f ).I(θ, f ) is a self-adjoint operator from E = (b, h) : b ∈ R d , h ∈ L 2 [L, U]  with bounded rth derivatives to itself.Define I 1/2 (θ, f ) as the square root operator of I(θ, f ) such that I = I 1/2 •I 1/2 , where • is the operator composition.To maximize (7), we only consider nonparametric function in are needed for the theorems in this article.Condition 1.The true values θ 0 lie in the interior of a known compact set in R d .The true parameter f 0 (•) is increasing function with f 0 (0) = 0 in a known bounded interval [L, U], where L < 0, U > 0. f 0 (x) ≡ f 0 (U) for x > U, f 0 (x) ≡ f 0 (L) for x < L. f 0 (•) is r − 1 times continuously differentiable and the rth derivatives is bounded in [L, U] where r ≥ 2.

Table 1 .
Summary of the parameter estimates in 200 simulations with Bernoulli (Case I) and Beta (Case II) reward distribution.

Table 2 .
Summary of the estimated nonparametric reward sensitivity function in 200 simulations with Bernoulli (Case I) and Beta (Case II) reward distribution.

Table 3 .
Estimation of the parameters in EMBARC study under the proposed method for MDD and Control group.

Table 4 .
Estimation of the parameters in EMBARC study for treatment (SERT) and placebo (PBO) group at pre-and post-treatment.