Subjective versus Objective Incentives and Teacher Productivity

A central challenge facing firms is how to incentivize employees. While objective, output-based incentives may be theoretically ideal, in practice they may lead employees to reduce effort on non-incentivized outcomes and may fail in settings where effort is weakly tied to output. We study the effect of subjective incentives (manager performance evaluation) and objective incentives (test score-based) relative to no incentives for teachers using an RCT in 230 Pakistani schools. First, we show that subjective and objective incentives both increase test scores and have similar magnitude effects. However, objective incentives decrease non-test score student outcomes relative to subjective incentives. Second, we show that teachers’ effort response is very different under each scheme, with attendance increasing under subjective and teaching quality decreasing under objective. Finally, we rationalize these effects through the lens of a moral hazard model with multi-tasking. We use within-treatment variation to isolate the causal effect of contract noise and distortion and show that these channels explain most of our reduced form effects. RISE Working Paper 22/092 March 2022


Introduction
How should schools incentivize teachers when effort is non-verifiable or non-contractable? Contract theory provides an answer. The second best is to incentivize on outcomes of the employee's production function. However, this introduces two new problems -distortion, over-incentivizing measurable outcomes while ignoring others, and noise, outcomes are a noisy function of employee effort. How do most non-schools actually incentivize workers? They use manager-discretionary (subjective) incentives rather than outcome-based (objective) ones. Raises, promotions, and terminations are subject to manager discretion for most employees. In the US, 85% of full-time employees have at least one aspect of their compensation determined by their manager, and 90% of teacher performance evaluations have a subjective component (Engellandt and Riphahn, 2011;National Center for Education Statistics, 2011). Despite the prevalence of subjective incentives, there is limited causal evidence on the effect of these incentives and whether they could work in the teaching setting.
In this paper, we ask two questions: What is the effect of subjective versus objective incentives on teacher productivity? Are subjective incentives able to help alleviate problems of noise and distortion, which often plague objective incentives? We answer these questions by conducting an 18-month randomized controlled trial with 234 private schools in Pakistan. We randomize schools to provide core teachers with one of three contracts: (i). control: flat raise -all teachers receive a raise of 5% irrespective of performance, (ii). treatment 1: subjective performance raise -teachers receive a raise from 0-10% based on their manager's rating of their performance 1 , or (iii). treatment 2: objective performance raise -teachers receive a raise from 0-10% based on their students' midyear and end of year test performance (Barlevy and Neal, 2012). Both treatments are within-school tournaments and have the same distribution of raise thresholds. These similarities allow us to isolate the effort response from just changing the performance metric (manager rating versus test score) while holding other features of the incentive structure constant.
We use detailed administrative, survey, test, and classroom observation data to understand each contract's effect on teacher effort and student outcomes. Student outcomes are measured along two dimensions: test scores and socio-emotional development. Test score data comes from an endline test conducted by the research team, one month after the end of the contract. Students are tested in core subjects (English, Urdu, math, science, and economics) in grades 4-13. A variety of question types and sources allow us to test whether effects are driven by memorization-type questions.
Socio-emotional development is measured along four dimensions: love of learning, ethical behavior, inquisitiveness, and global competency. These dimensions are measured using self-report survey items drawn from several psychological indices used for measuring socio-emotional development in children. 2 1 Managers are generally principals or vice-principals and spend about a third of their time on employee management tasks, such as observations, feedback, and professional development.
2 Items are drawn from the National Student Survey, Learning and Study Strategies Inventory, Big Five (children's In our first main result, we show that both subjective and objective contracts are equally effective at increasing test scores. Both contracts increase test scores by 0.09 sd, which is very similar to average effects from meta-analyzes of performance pay for teachers (Pham et al., 2020).
These results are consistent across subject and grade and are not driven by rote-memorization type questions. However, we find, in contrast to the test score results, objective and subjective incentives have different effects. Objective incentives negatively affect student socio-emotional development, including a significant decrease in love of learning and an increased likelihood students say they want to change schools. Subjective incentives result in a small positive effect on overall socio-emotional skills. These combined effects suggest that teachers under objective contracts focused exclusively on improving student academic improvement, at the cost of more well-rounded development for students. Whereas, teachers under the subjective contract were able to prioritize both areas.
To understand teachers' behavioral responses to these incentive contracts, we compile rich data on teacher behavior inside and outside the classroom. We record 6,800 hours of classroom footage and review it using a standard classroom observation rubric (Pianta et al., 2012). The rubric captures teacher behavior along dozens of dimensions, from the use of punitive discipline to the proportion of student versus teacher talk time. The rubric also measures the amount of time spent on test-taking or test-preparation activities. To measure effort outside the classroom, we have teachers complete a time use questionnaire. Combined these two data sources allow us to understand teacher behavior change under subjective versus objective incentives.
In our second main result, we find both subjective and objective incentives lead to changes in classroom practices. As one might expect, subjective incentives spur actions that managers value, and objective incentives spur actions that most quickly and easily translate into test score gains.
Subjective incentives lead to increased targeting of individual student needs within the classroom and the use of technology in the classroom. Both teaching practices are one's principals identified as markers of high-quality teaching. Objective incentive schools see a five-fold increase in class time on test preparation activities. These teachers also exhibit more negative discipline techniques, such as yelling at students.
Our reduced form effects suggested that subjective performance incentives increase teacher effort without producing distortionary effects. How are managers able to accomplish this? We find on average managers place significant value on teachers value-added and pedagogy. We also do not find any evidence of favoritism or gender bias. However, there is heterogeneity in managers' application of the contract. We cannot reject there is no effect of subjective performance pay for the worst quintile of managers.
We then draw on the model of moral hazard with multi-tasking to explain our main reduced form results: i). similar, positive effects of subjective and objective incentives on test scores, ii). negative effects of objective incentives on socio-emotional development, iii). significant differences in teacher scale), Eisenberg's Child-Report Sympathy Scale, Bryant's Index of Empathy Measurement, Afrobarometer, World Values Survey, and Epistemic Curiosity Questionnaire.
classroom behavior across the two treatments. Moral hazard models with multi-tasking (Baker, 2002) isolate two main components of the incentive structure which affect employee response: noise (correlation between employee action and incentive pay) and distortion (correlation between piece rate for different actions and marginal return to those actions on firm outcomes). Our paper seeks to understand whether noise and distortion serve as important mechanisms of the reduced form effects we see.
Our empirical approach for this mechanism analysis proceeds in three steps. First, we show differences in employee's perception of the noisiness and distortion for subjective versus objective incentives. Second, we exploit partially exogenous heterogeneity within a given treatment to isolate the causal effect of noise and distortion each individually on student outcomes. Finally, we bring those two estimates together and show that given the difference in levels of noise and distortion across the contracts and the effect of noise and distortion on student outcomes, we can explain a large portion of the reduced form effects through these channels. We explain each step in detail below.
The first step is showing that teachers believe there are differences in the extent of noise and distortion across the two treatments. We do this by asking teachers at endline the extent to working harder will increase their incentive pay. If they believe their effort closely maps into their pay then this is a less noisy incentive system. Then we ask what types of actions (lesson planning, improving pedagogy, helping other teachers, etc) are rewarded under each system. This allows us to measure teachers perception of whether the incentive is distorted toward certain student outcomes at the cost of others.
We find that teachers believe subjective performance incentives are less noisy than objective incentives, and, therefore, view subjective incentives as more effective at motivating behavior. They view test-score based incentives as much less within their control because so many other factors beyond their effort affect student scores. We also find that teachers in the objective treatment are more likely to prioritize the type of actions which lead to test score gains, at the cost of other areas of student development. Teachers under subjective contract prioritize actions that lead to academic gains and also prioritize administrative tasks, which are likely to be preferred by their manager.
We also show there are no other differences beyond noise and distortion across the two treatment arms. We show there is similarity in implementation timelines, understanding of the contract treatments, and beliefs about the fairness of each treatment arm.
The second step of our mechanisms analysis is to demonstrate that noise and distortion themselves affect student outcomes. To do this, we zoom in to the subjective treatment schools and look at settings with high and low noise and then high and low distortion. By controlling for other differences across settings, we are able to isolate the effect of these two mechanisms on outcomes.
To determine the effect of noise on student outcomes, we compare subjective treatment schools with managers whom teachers rate as accurate in assessing teacher effort versus managers rated as inaccurate in assessing teacher effort. We use this rating of managers' accuracy interacted with treatment status as an instrument for the perceived noisiness of the contract. We show that this rating of managers only affects teacher's rating of noisiness in the subjective arm. This instrument for noise is robust to controlling for many other features of the contract and school environments.
Using this instrument for noise, we find that a 1 SD increase in the perceived noisiness of the contract decreases hours worked by 13 and decreases student test scores by 0.2 SD. These results are robust to a variety of controls. This suggests that employees are very sensitive to the noisiness of the contract, and that this affects the success performance pay has in inducing an effort response from employees.
To understand the effect of distortion on student outcomes, we again exploit variation within the subjective performance pay schools. We use data on managers' preferences prior to the start of the experiment. Before the treatments are announced managers sit down with the teachers and delineate goals for the following year for that teacher. Example goals include increasing students' English proficiency, reaching certain grade targets, or improving lesson plans. We code these goals using text analysis and categorize them into four types of teacher actions: administrative tasks, professional development and collaboration tasks, improvements in teacher pedagogy, and test-score based goals. A month after these goals are set between managers and teachers, we announce the treatment assignment.
Of course, schools in which managers focus on administrative goals versus those in which managers focus on pedagogy goals are likely different in many ways. Therefore, our approach is to interact these goals with the subjective treatment, to isolate the effect of these goals in settings in which teachers would be more likely to focus on them (those who were assigned subjective treatment) relative to places where the goals have no financial stake (objective and flat treatment schools). We use the interaction of subjective treatment and goal, controlling for level differences, to isolate the effect of these goal differences on student outcomes. We find that a larger focus on test scores and professional development increases students' endline test scores. However, more focus on test scores results in negative effects on student socio-emotional development. These results are robust to controlling for other features of the contract environment.
Combined, these results help us understand why it is possible to have the same effect on test scores without needing to incentivize test scores directly. Subjective incentives are less noisy, producing a larger overall response, and less distorted, allowing teachers to prioritize multiple areas of student development. We find that the noise and distortion channel are able to explain a substantial portion of the reduced form effects we see.
Our paper makes three key contributions. First, it is the first study, to our knowledge, to isolate the causal effect of subjective versus objective incentives and the effect of subjective versus flat incentives for employees in any sector (Lazear and Oyer, 2012;Oyer and Schaefer, 2011).
Existing studies have tested bundled incentives (a combined subjective and objective incentives versus no incentives) on employee behavior (Khan et al., 2019;Fryer, 2013). Previous work has also compared the effect of heterogeneity across plants to measure the effect of more or less steep subjective incentives on employee overtime (Engellandt and Riphahn, 2011). There is also evidence that managers, especially in educational settings, may have imperfect information about worker effort or may be biased toward certain groups (Jacob and Lefgren, 2008;Gibbs et al., 2004).
Second, we add to a robust literature on the effect of performance pay for teachers by providing two new findings (Lavy, 2007;Muralidharan and Sundararaman, 2011;Fryer, 2013;Goodman and Turner, 2013). We show the first evidence of objective performance pay having detrimental effects on non-academic student outcomes, consistent with multi-tasking models. Next, we show direct evidence that objective incentives result in teachers distorting their effort toward teaching pedagogy that impacts test performance at the cost of other areas of student development. This includes the use of class time doing test prep and the use of punitive discipline. Both of these results have long been suspected, but we provide the first documentation of such effects (Baker, 2002;Leigh, 2013).
Third, we provide, what we believe is, the first evidence on measuring the extent of noise and distortion within an employee's contract and isolating the effects of those mechanisms on firm outcomes. There is a rich theoretical literature on the importance of these mechanisms (Baker, 2002). Empirical work has also investigated the role of noise on employee response (Prendergast, 1999;Prendergast and Topel, 1993;Prendergast, 2007).
The remaining sections are organized as follows. Section 2 details the treatment and control conditions, the data collected, and standard implementation checks. Section 3 provides the main results of subjective and objective performance incentives on teacher effort and student outcomes.
Section 4 gives an overview of the standard moral hazard model with multi-tasking and highlights the two key mechanisms which underpin the reduced form effects we find. Section 5 unpacks the mechanisms underlying the main effects in light of the moral hazard model, and section 6 concludes.

Performance Incentive Treatments
We partnered with a large private school system in Pakistan to implement the research design.
Schools are randomized to receive one of three contracts which determine the size of teachers' raises at the end of the calendar year. 3 The three contracts were: • Control: Flat Raise -Teachers receive a flat raise of 5% of their base salary -Objective Treatment Arm: Teachers are evaluated based on their average percentile value-added (Barlevy and Neal, 2012) for the spring and fall term. Percentile value-added is constructed by calculating students' baseline percentile within the entire school system and then ranking their endline score relative to all other students who were in the same baseline percentile. 5 We then average across all students the teacher taught during the two terms.
The contract applied to all core teachers (those teaching Math, Science, English, and Urdu) in grades 4-13. Elective teachers and those teaching younger grades received the status quo contract.
All three contracts have equivalent budgetary implications for the school. We over-sampled the number of subjective treatment arm schools due to partner requests, so the ratio of schools is 4:1:1 for subjective treatment, objective treatment, and control, respectively.
Both the subjective and objective treatment arms have several features in common, allowing us to isolate the effect of differing the performance metric and nothing else about the incentive structure.
Both treatments are within-school tournaments, so this holds the level of competition fixed between the two treatments. In addition, the variance in the distribution of the incentive pay is equivalent across the two treatments. As we showed in section 4, holding the variance constant allows us to interpret differences in noise levels between the two systems as equivalent to differences in incentive 4 An example set of criteria are provided in Appendix Table A1. 5 Percentile value-added has several advantageous theoretical properties (Barlevy and Neal, 2012) and is also more straightforward to explain to teachers than more complicated calculations of value-added. steepness. The performance evaluation timeline also played out the same for all groups. Before the start of the year, managers set performance goals for their teachers irrespective of treatment.
Teachers were evaluated based on their performance in January through December, with testing conducted in June and January to capture student learning in each term of the year. 6 To ensure teachers and managers had full understanding of how each contract would work, we conducted an intensive information campaign with schools. First, the research team had an in-person meeting with each manager, explaining the contract assigned to their school, and, in the case of the subjective treatment, explaining what would be expected of them and when. Second, the school system's HR department conducted in-person presentations once a term at each school to explain the contract. Third, teachers received frequent email contact from school system staff reminding them about the contract and half-way through the year contract teachers were provided midterm information about their rank based on the first 6 months. 7 Control teachers were also provided information about their performance in one of the two metrics, in order to hold the provision of performance feedback constant across all teachers.

Timeline and Data
Our study was conducted from October 2017 through June 2019. It covered one performance review cycle conducted from January-December 2018 in which the contracts were in place. Figure 1 presents the main treatment implementation (detailed in section 2.1) and data collection activities (detailed below).
Our data allows us to understand how teachers changed their effort under each incentive scheme, why the incentives affected effort in the way they did, and the resulting effect this had on student outcomes. We draw on data from (i). the school system's administrative records, (ii). baseline and endline surveys conducted with teachers and managers (iii). endline student testing and survey and (iv). detailed classroom observation data.
Administrative Data: The administrative data details position, salary, performance review score, attendance, and demographics for all employees. We also have biometric clock in/out data for all schools. The data was provided by the school system for the period of July 2016 to June 2019. It includes classes and subjects taught for all teachers, and end of term standardized exam scores for all students (linked to teachers). From September through December 2018, we also have data on 6 The school systems' central office designed and administered the June test to all students in a given grade. However, tests are graded locally by the school, often by the students' teacher. Due to concerns of grade manipulation, grading was audited by the research team. 10% of all teacher's exams were regraded. If the teachers' grade and the auditor's grade were off by more than 5%, another 10% of their tests were audited. If the average was still off by more than 5%, all of the teacher's exams were regraded. Overall, grade manipulation was small and was generally driven by cases where teachers bumped up students' grades from just failing to just passing. There was no heterogeneity in grading accuracy by treatment. The January test was conducted exclusively by the research team (described in section 2.2 below). These tests are not used as an outcome measure in this paper.
7 An example midterm information note is provided in Appendix Figure A2.
classroom observations conducted by managers. Managers use a similar rubric to the one used by the research team to conduct classroom observations (detailed below).
Baseline Survey: The baseline survey measured teachers' preferences over different contracts and beliefs about their performance under each contract. 40% of schools were randomly selected to participate in an in-person baseline survey conducted in October 2017. 2,500 teachers and 119 managers were surveyed. These outcomes are primarily used for a companion paper on teacher selection in response to performance pay (Brown and Andrabi, 2020). The choice of these four areas came from the school system's priorities. They are the four areas of socio-emotional development they expect their teachers to focus on. These areas are posted on the walls in schools, and teachers receive professional development on these areas. Some managers also specifically make these areas part of teachers' evaluation criteria. In addition to these four areas, the survey also asked whether students liked their school or wanted to change to a different school.
Classroom Observation Data: To measure teacher behavior in the classroom, we recorded 6,800 hours of classroom footage and reviewed it using the Classroom Assessment Scoring System, CLASS (Pianta et al., 2012), which measures teacher pedagogy across a dozen dimensions. 10 11 We also recorded whether teachers conducted any sort of test preparation activity and the language fluency of teachers and students.
Performance Evaluation Data: The school system had an existing performance evaluation system in which managers rated their teachers in December on performance criteria set in the previous December. We layered these new contracts on top of that existing system. In December 2017, before the announcement of treatments, managers set a number of performance criteria for each teacher, as they do each year. In a randomly chosen 3/4 of the subjective schools, those goals then become the evaluation criteria used to determine teachers' raises for the following year. In the rest of the schools (objective, control, and the remaining subjective) those goals are used to provide feedback to teachers but have no financial consequence. In the remaining 1/4 of subjective schools, managers were required to create a new set of goals now that they knew there would be financial stakes attached to those goals. They were encouraged to set the goals to be focused on employee effort, rather than employee characteristics, like training or credentials. Since the performance evaluation system exists for all employees, we can use data on what goals were set and the scores on those goals to understand manager priorities and ratings with and without financial stakes tied to the performance rating.

Sample and Characteristics of the Employee-Manager Relationship
Teachers The study was conducted with a large private school system in Pakistan. The student body is from an upper middle-class and upper-class background. School fees are $2,300-$4,300 USD (PPP) per year. Teachers are generally younger and less experienced than their counterparts in the US, though they have similar levels of education. Table 1 presents summary statistics of our sample 10 There are tradeoffs between conducting in-person observations versus recording the classroom and reviewing the footage. Videotaping was chosen based on pilot data which showed that video-taping was less intrusive than human observation (and hence preferred by teachers). Videotaping was also significantly less expensive and allowed for ongoing measurement of inter-rater reliability (IRR). 11 We did not hire the Teachstone staff to conduct official CLASS observations as it was cost-prohibitive and we required video reviewers to have Urdu fluency. Instead we used the CLASS training manual and videos to conduct an intensive training with a set of local post-graduate enumerators. The training was conducted over three weeks by Christina Brown and a member of the CERP staff. Before enumerators could begin reviewing data, they were required to achieve an IRR of 0.7 with the practice data. 10% of videos were also double reviewed to ensure a high level of ICC throughout the review process. We have a high degree of confidence in the internal reliability of the classroom observation data, but because this was not conducted by the Teachstone staff, we caution against comparing these CLASS scores to CLASS data from other studies.
compared to a representative sample of teachers in US (National Center for Education Statistics, 2011). Our sample is mostly female (80%), young (35 years on average), and inexperienced (5 years on average, but a quarter of teachers are in their first year teaching). All teachers have a BA and 68% have some post-BA credential or degree. Salaries are on average $17,000 USD (PPP).
Managers In order to understand the effects of subjective performance pay, we need to understand who the managers are and what role they play in overseeing teachers. Managers here are either a principal in small schools or a vice principal in larger schools. They are tasked with overseeing the overall operations of the school and managing employees, including teachers and other support staff. Table 2 presents information about managerial duties compared to a US sample of principals.
Like in the US, our managers are generally older (45 years old), less likely to be female (61%), and more experienced (9.6 years) than teachers. Most were previously teachers and transitioned into an administrative role. Managers spend about a 1/3 of their working hours overseeing their staffobserving classes, providing feedback, meeting with teachers and reviewing lesson plans. The rest of their time is spent on other tasks related to the schools functioning. The distribution of time use is fairly similar to the principals in the US.
However, teachers in our sample spend much more time directly observing teachers. They do about twice the number of classroom observations each year (4.7 versus 2.5 in the US). They also rate themselves higher in most areas of the management survey questions (4.3 versus 2.8 out of 5), including formal evaluation, monitoring and feedback systems for teachers. This is an important difference as these management practices could positively effect the success of the subjective treatment arm, and may help us understand the extent of external validity of these results.

Intervention Fidelity
In this section, we provide evidence to help assuage any concerns about the implementation of the experiment. First, we show balance in baseline covariates. Then, we present information on the attrition rates. Finally, we show teachers and managers have a strong understanding of the incentive schemes. Combined, this evidence suggests the design "worked". Schools in the two treatment arms and control appear to be balanced along baseline covariates.
Appendix Table A1 compares schools along numerous student and teacher baseline characteristics.
Of 27 tests, one is statistically significant at the 10% level and one is statistically significant at the 5% level, no more than we would expect by random chance. Results presented include specifications which control for these few unbalanced variables.
Administrative data is available for all teachers and students who stay employed or enrolled during the year of the intervention. During this time 23% of teachers leave the school system, which is very similar to the historical rate of turnover. 88% of teachers completed the endline survey.
While teachers were frequently reminded and encouraged to complete the survey, some chose not to.
We do not see differences in these rates by treatment.
Finally, for the endline test, parents were allowed to opt out of having their children tested.
Student attrition on the endline test was 13%, with 3 pp of that coming from students absent from school on the day of the test and the remaining 10 pp coming from parents choosing to have students opt out of the exam. On both the endline testing and endline survey, we do not find differences in attrition rate by treatment. We also do not find that lower performing students were more likely to opt out.
Teachers have a decent understanding of their treatment assignment. Six months after the end of the intervention, we ask teachers to explain the key features of their treatment assignment. 60% of teachers could identify the key features of their raise treatment. Finally, most teachers stated that they came to fully understand what was expected of them in their given treatment within four months of the beginning of the information campaign.

Results
We now present the main reduced form results of the paper. First, we test the effects of each incentive on student test performance and socio-emotional development. Then, we show the effects of the incentives on teacher effort, which helps us to understand the student effects.

Specification
Our main specification is: The main dependent variable of interest is student outcome, Y i1 , for child, i, at endline, t=1.
Student outcomes include test scores in Math, Science, English and Urdu and socio-emotional development. SubjectiveT reatment s and ObjectiveT reatment s are a dummy for whether the student's school, s, was assigned to subjective or objective performance raises. The left out group is the control group (flat raise). The coefficients of interest are β 1 and β 2 , and their test of equality.
For test scores, we control for student's baseline score, Y i0 , to improve efficiency as there is high auto-correlation in test scores. 12 We also control for strata fixed effects, subject and grade, χ j .
Standard errors are clustered at the school level (the unit of randomization), and both standard and randomization inference p-values are provided in each table.

Results
Test Scores We find that both subjective and objective performance incentives have similar effects on test scores, of about 0.09 sd. Figure 2 and table 3 presents the results of each performance incentive on endline test scores. Column (1) shows results for all tests and question items. Effects are similar between the subjective and objective incentives, with an effect of 0.086 sd and 0.092 sd, respectively. In the row titled "F-test p-value (subj=obj)", we present a test for the equality of β 1 = β 2 . We cannot reject equality of effects between the two treatments on test scores. All results appear unchanged whether we consider standard p-values (in parentheses) or randomization inference p-values (in brackets).
Column (2) and (3) provide tests on the effect of the treatment by question item type to understand whether these effects are due to memorization of class content or actual learning. Column (2) only includes questions from the prior grade's content and column (3) only includes questions that were added by the researchers from external standardized test sources including PISA, TIMSS, PERL and LEAPS. 1314 Both sets of questions provide a useful test because it would not be possible for students to have memorized the answers to the questions. Remedial content (from previous grade levels) and external content are never tested on the school system's standardized exam, and so teachers would not have prepared specifically for this material. Given that we find similar if not larger effects on these types of questions, it appears that treatment effects are coming from actual learning as opposed to memorizing curriculum. Again, we do not see a significant difference between the subjective or objective treatment.
Column (4) and (5) present the results by subject, splitting by math and science exams versus the two reading exams (English and Urdu). Magnitudes are similar, around 0.09 sd, for both subjects, though we are less powered to detect overall effects with the smaller sample when we split by subject.
Again, we cannot reject equality between the two treatments and the magnitude of the effects is highly similar.

Socio-Emotional Development
While the effects on test scores were similar between both treatments, the effects on socio-emotional development paint a different picture. Figure 3 and table relative to subjective is coming from a differential effect on "love of learning" and whether students like their school or would like to change schools. We can reject equality of the two treatments on these sub-areas at the 10% and 1% levels, respectively. This suggests that while objective incentives led to an increase in test scores, it was at the cost of enjoying school. Whereas, subjective incentives were able to accomplish the same learning gains without these negative consequences. On three other areas, ethical behavior, being a global citizen and inquisitiveness, we cannot reject the equality of the two treatments.

Specification
To understand why we see similar results on test scores but different effects on student's socioemotional development, we need to understand teacher's behavioral response. To do this, we look at the effect of each treatment on classroom observation ratings and time use. We have a similar main specification, this time at the teacher level: The main dependent variable of interest is outcome, Y i , for teacher, i. Teacher outcomes include classroom observation scores and time use. We again control for grade and strata fixed effects, χ j , and standard errors are clustered at the school level (the unit of randomization). 16

Results
Classroom Observations The effect of each incentive on classroom behavior sheds light on the student effects we see. Overall, we find teachers under objective incentives using teaching strategies which provide the largest marginal return on test scores but may hamper other areas of human capital development for students. Teachers in the subjective treatment however, do not exhibit any of those distortionary teaching strategies. Figure 4 and table 5 presents the effects of each incentive on teachers' overall classroom observation score, using the CLASS rubric. On average, objective teachers exhibit worse teaching pedagogy.
They score 0.07pts lower on the 7pt CLASS rubric scale. Subjective teachers have no noticeable change in pedagogy quality, and we can reject the equality of the two treatments at the 10% level.
We then break down the 12 CLASS dimensions of pedagogy into three main areas, "class climate", "differentiation", and how "student-centered" the lesson is. "Class climate" captures whether the atmosphere of the classroom is positive, supportive and joyful or negative, punitive and dull.
"Differentiation" captures whether the lesson is structured in a way to meet students who are different proficiency levels and/or have different learning styles. Finally, "student-centered" measures how much of the lesson is teacher-directed versus student-involved. Teachers under the objective incentive contract have a more negative class climate and less student-centered lessons. Both see a decrease of around 0.1 pts. We can reject equality of treatments at the 10% level. There is also an increase in level of differentiation in the subjective and objective treatment schools.
We also measure the amount of class time devoted to test preparation activity. This includes practice tests, testing strategies (such as how to approach a multiple-choice test), or lecturing about the importance of doing well on tests. We find a large increase in the time spent on these activities in objective treatment schools. Relative to a control group mean of 0.14 min out of the 20-minute observation spent on test preparation activities, objective classes see a 5-fold increase, with a total of 0.76 minutes spent on these activities. We can reject equality of treatments at the 5% level along this dimension.
Together with the student outcomes, these classroom observations paint a picture of objective schools as ones that were able to achieve test score gains by taking the path of least resistance for teachers -doing more test preparation and maintaining a stricter, less student-centered classroom.
This then results in other negative outcomes on students human capital development, such as love of learning. Subjective classrooms on the other hand are able to accomplish the same academic gains without any negative effects on teacher practices or student socio-emotional development. This suggest that managers are able to prevent these distortionary behaviors, solving, at least to some extent, the multi-tasking problem.
One concern with classroom observation data is that teachers may worry the videos of their classrooms will be provided to their manager, and for subjective teachers that has more a consequence than for the other treatment arms. We do several things to help alleviate these concerns. First, in the consent form and during the camera set up, we communicate to teachers that the videos are confidential and will only be reviewed by the research team. We also let them know that only aggregated data at the school level will be provided to the school system head office. Second, visits were a surprise within a two-month window, so teachers could not adjust their lessons beforehand.
Third, we recorded several hours back to back for each teacher. We find teachers are most aware (and responsive) of the camera in the first hour of taping. We can remove that data and repeat the same analysis and find very similar results.
Attendance and Time at Work We find that the subjective treatment results in a significant increase in the number of days a teacher is present at work relative to no incentives. Table 6 presents the results of the biometric clock in/out data. Relative to a control group mean of 145 days, subjective teachers are present an additional 6 days. We do not find an effect on hours spent at work for either treatment relative to the control. We cannot reject equality of treatments in either outcome. Columns (2) and (4) restrict to a sample of teachers who were present in the school system both terms and did not take any long leaves (health, maternity, etc.) to ensure the days present result is not driven by these effects. Results are robust to this sample restriction.

How do Managers Implement the Subjective Incentive?
In the objective treatment schools there is less scope for heterogeneity. The implementation of the contract and employee's response is likely to be similar across schools and comparable to other experiments which used test score-based performance pay. However, the subjective treatment arm could vary substantially across schools and firms depending on the type of oversight managers have of employees, the oversight firms have on managers and how managers themselves are incentivized.
In this section, we unpack what types of teacher actions managers value, the extent to which managers are biased or show favoritism, and heterogeneity in treatment effects my manager quality.
To understand how managers use the subjective treatment arm, we draw on data from the endline teacher and manager survey and managers evaluation scores of their teachers.
What do managers value in rating teachers? We use three approaches to help understand what types of teacher actions' managers reward. In an ideal setting, we would randomize teacher actions to see how this affects managers' performance ratings of teachers. We are unable to do that exact exercise here. However, using a combination of detailed data and survey vignettes, we can accomplish something similar. Combined, these three sources of evidence suggest that managers highly value teacher actions which are related to human capital development and are not just focused on administrative tasks or actions unrelated to student development.
Our first piece of evidence on what managers value in teachers, comes from endline survey data from both teachers and managers. We asked both teachers and managers to respond to a hypothetical situation, in which a teacher asks them for advice about how to achieve a higher raise in the following year. They are then asked to rate how much time the teacher should spend on different types of actions. Table A3 presents the data from the survey question. Column 2 shows teachers' responses about which actions would be most highly valued under the subjective contract.
Column 3 presents responses to the same question posed to managers. Both subjective teacher and managers agree that improved pedagogy, like making lessons student centered and tailoring lessons to students at different initial levels, would increase their subjective rating. However, managers put additional weight on spending time collaborating with other teachers. Neither subjective teachers nor principals believe more superficial administrative tasks like volunteering at afterschool events or meeting with parents are important drivers of the subjective performance rating.
Our second piece of evidence also comes from the endline survey. We provide a vignette describing a hypothetical teacher to managers, and we ask them to provide a performance rating of the hypothetical teacher. The vignette randomizes the hypothetical teacher's name, and rank in terms of value added, classroom behavioral management and attendance. 17 Table A4

Favoritism and bias
A primary concern about subjective performance pay is whether managers are biased against certain employees or show favoritism toward preferred individuals. To assess whether this is a significant concern in this setting, we ask teachers at endline whether they felt their manager discriminated against certain groups or played favorites toward certain colleagues. 19 Table A6 presents the results from these survey questions. On average, teachers in the subjective treatment arm are no more likely than teachers in the objective treatment arm to say that the contract unfairly favors certain teachers or that certain groups are discriminated against under this contract. Teachers also state that bias, gaming and favoritism is not a significant concern in either students' test score growth, in the [bottom/middle/top] 10% of teachers in terms of behavioral management, and is in the [bottom/middle/top]10% in terms of attendance and timeliness at work." Managers rated three such vignettes with characteristics randomized across vignettes. 18 There is a negative relationship between subjective rating and hours spent at school. This relationship may be driven by the fact that certain grades and teaching positions have different requirements about the length of the workday, so this could be picking up that variation rather than teacher effort. 19 One concern with this approach is that teachers may be hesitant to provide honest assessment in a survey. To help minimize this concern teachers' responses are anonymized and we communicate this to teachers at the time of consenting to the survey. We also ask the question several ways, including asking teachers to report such behavior about other schools or about the school system in general. This type of questions phrasing allows teachers to report problematic manager behavior while providing plausible deniability for their own manager.

contract.
Though teachers do not say that overt bias is a significant concern, we may be worried that there are more subtle types of bias at play. The primary type of bias we were concerned about in this setting is gender bias. In Pakistan, gender bias in employment is rampant (World Bank Group, 2018), and managers are more likely to be male then the employees they oversee. As part of the vignette survey questions, we include a way to test for subtle gender bias. In the vignettes we randomize the hypothetical teachers' name to be a traditionally male or female Pakistani name. Table A4, column 3, presents the results of this test. We do not find that managers rate vignettes with female names lower.
Both of these pieces of evidence suggest that favoritism and bias is not a substantial concern within the subjective treatment arm. Neither result is able to perfectly measure whether any favoritism or bias occurred, but combined they provide suggestive evidence favoritism and bias are not a first-order concern under this contract.
Heterogeneity in treatment effects by manager characteristics On average the subjective treatment arm appears to have been successful at improving student outcomes and teacher effort, but there may be heterogeneity in how successfully managers implement the contract. We test for heterogeneity in treatment effects along several dimensions. First, table A7 presents heterogeneity in the subjective treatment arm by three manager characteristics: gender, age, and experience. We do not find significant differences in the effectiveness of the subjective treatment by these manager characteristics.
Second, table A7 presents heterogeneity in treatment effects by several dimensions of manager "quality". We find that subjective performance pay is significantly less effective in schools where teachers believe their managers do not have an accurate perception of teacher effort. We measure this by asking teachers to rate how accurate their manager is in rating a fellow teacher. 20 We find there is no effect of subjective performance pay on student test scores for managers who are in the top quintile of this inaccuracy measure. We do not find heterogeneity and treatment effects by world management survey overall manager score (shown in Table A7, column (5)) or personnel management sub-score (column 6). However, as discussed in section 2.2, because this data was collected from manager self-report, we should be cautious about the interpretation, as managers may over rate themselves on these survey questions. This suggests that while subjective performance pay is on average very successful at producing learning games, these contracts may be ineffective in settings where employees do not trust their managers to implement them accurately.

Theoretical Framework
The experimental design is motivated by a model of moral hazard with multi-tasking, as presented in Baker (2002). This theoretical framework helps us rationalize the teacher behaviors and student outcomes we see as a result of each performance incentive. In this section, we summarize the framework and key prediction, demonstrate how this translates to the teaching context, and map out how the experimental design connects to the model.

Moral Hazard with Multi-tasking
The firm, a school, produces a single outcome -human capital, H(a, e) -through a simple linear production function: Human capital is a function of an n-dimensional vector of actions teachers can take, a, and the n-dimensional vector of marginal products of those actions, f. Human capital is also a function of many other things outside the teacher's action set (environment, parental support, peers, etc.), which are captured by the noise term, e, which is mean zero and has a variance of σ 2 e . Schools cannot perfectly observe all components of a, but they can observe some features of human capital (for example, test scores) and some actions (for example, teacher attendance). Schools construct a performance contract that pays teachers based on a performance measure, P (a, φ), which could be a combination of observable outputs (test scores, student attendance, etc.) and/or actions (teacher attendance, lesson plans, etc.). Teacher's performance measure, and therefore their pay, then is: The performance measure, P (a, φ), is a function of teacher's actions, a, and the marginal return to those actions on the performance measure, g. In effect, g translates to a piece-rate for each action.
φ captures everything outside the teacher's actions, which affect the performance measure. It is mean zero and has variance σ 2 φ . Two types of noise are captured by φ. First is noise coming from features of the performance measure, which are outside the teacher's control. For example, if the performance measure is students' test scores, this could be the students' home environment. Second is the noise coming from mis-measurement of a given action, a n . For example, if the performance measure is teacher attendance, but principals have error-ridden records of attendance, then this contributes to the noisiness of the performance measure.
Teacher's utility is a function of their pay and a quadratic cost of effort. 21 Teachers choose the optimal set of actions that maximizes their utility. Taking the derivative of Eq. 5, we have that the optimal decision is to set each action amount equal to the piece rate, a * 1 = g 1 , a * 2 = g 2 , ...a * n = g n . Given teacher's optimal action set, the average human capital produced by each teacher is: Average human capital then is a function of the length of the marginal production on human capital vector, f , the length of the piece-rate vector, g , and the alignment between these two vectors, cos(θ). In other words, human capital is increasing in the steepness of the incentives and how aligned those piece rates are with the human capital production function.
We now go beyond Baker (2002) by making one additional assumption relevant in our context.
We can further re-arrange the expression to show the effect that noise in the performance measure has on average human capital. Taking the variance of Eq. 4, we have var(P ) = g 2 var(a) + σ 2 φ . Re-arranging, we can substitute this in for g into Eq. 6. Average human capital then is: Here f and var( a) are constant across the two types of performance measures, subjective and objective, we will be comparing. In addition, due to the design of our subjective and objective incentives, var(P ), is also constant across the two schemes. 22

Theoretical Predictions
We are then left with two components of the performance measure that affect average human capital. The key predictions of the model are that average human capital produced by the school is: • decreasing in performance measure distortion, 1 − cos(θ) • decreasing in performance measure noise, σ 2 Distortion Distortion captures the correlation between the piece rates for different actions and the marginal return to human capital of those actions. In essence, do we pay teachers more for the actions which are more related to developing human capital? The more distorted a contract is, the more employees focus on actions that are less helpful toward firm outcomes.
Noise Noise captures how much of the performance incentive is unrelated to employee's actions. This could operationalize as other factors outside the employee's control affecting the performance measure (school resources, shocks, etc.) or mis-measurement of employee actions, if the contract attempts to measure teacher actions. It is important to flag that traditionally the way noise enters the optimal contract design is through reducing risk-averse employee's utility. This requires firms to raise the fixed part of an employee's salary to meet employee's participation constraint. Here we are not focused on that consequence of noise as we are not focused on employee entry or exit in this paper 23 .
The effect of noise we focus on here is equivalent to a decrease in the incentive scheme's average piece rate. Since σ 2 φ = var(P ) − g 2 var(a), and var(P ) and a are constant given the tournament nature of each incentive scheme, increasing σ 2 φ directly decreases g . Therefore, increasing noise then reduces the extent of the effort response, a * . This effect of noise exists in any incentive scheme with a fixed variance, which includes all tournament or threshold-type incentives.

Understanding the Experiment within the Theoretical Framework
An important distinction between our experimental performance pay system and this simple model is that our treatments are performance pay tournaments, where teachers are ranked relative to other teachers at their school and are allocated to one of five performance categories based on their relative performance. As a result, this framework is a simplification of the teacher's problem in our experiment. However, the key predictions discussed above also hold in the more complicated tournament model, so we think it is illustrative of the mechanisms we will discuss in the next section.
The theoretical framework allows us to understand incentive scheme's key features that should affect how teachers respond and, as a result, the impact on human capital. Ex-ante, it is not clear whether subjective or objective incentives would be more or less distorted in the teaching context.
On the one hand, subjective incentives may solve the multi-tasking problem by prioritizing more than just measurable student learning. One of the key critiques of objective incentives is that teachers may focus on actions which enhance test scores (such as test prep skills, memorization, etc.), but have small or zero effects on human capital (Muralidharan and Sundararaman, 2011). Subjective performance incentives would ideally penalize these types of behaviors by teachers, in favor of more well-rounded teaching. On the other hand, it could be that managers prioritize the wrong actions -because they do not know what the human capital production function is, because they value 23 A companion paper (Brown and Andrabi, 2020) studies employee sorting in response to these contracts only certain aspects of human capital and not others, or most nefariously, they weight actions which make their job easier.
It is also uncertain whether subjective of objective would be less noisy. Test scores are notoriously noisy measures of teacher effort (Chetty et al., 2014). One of the most common complaints teachers have against test score-based incentives is that they are mostly unrelated to teacher actions (Podgursky and Springer, 2007). Subjective performance pay could be less noisy than objective performance pay because managers could focus on rewarding actions rather than outcomes. However, this requires managers to observe effort accurately. Subjectivity could even introduce additional noise though, if managers introduce bias or favoritism into their evaluations.
Our experiment connects to the model in two ways. First, in sections 5.3, we explicitly test the two predictions of the model using exogenous variation within one of the treatment arms that varies the level of noise and distortion. We then see the effect of these mechanisms on firm outcomes. Second, in section 5.2 and 5.4 we show that the difference in reduced form effects of each contract can be explained through differences in noise and distortion across the two contracts.

Mechanisms
How can we square the results that we see very different effort responses, similar test score effects and different socio-emotional effects across subjective and objective incentives? We argue that differences in the levels of noise and distortion across the two treatments help explain these outcomes. We structure our argument as follows.
First, in section 5.1, we present the similarities between the two treatments to help eliminate possible channels that could drive the difference in treatment effects. Second, in section 5.2, we highlight the differences between the systems. We show teachers believe subjective incentives to be less noisy and less distorted. Third, we provide evidence that noise and distortion does, in fact, affect outcomes. Section 5.3 shows that noise and distortion are related to student outcomes as predicted in the theoretical framework -more noise reduces the effect of incentives and more distortion diverts employee effort toward those actions. We conduct these tests by exploiting heterogeneity in levels of noise and distortion within a given treatment, to isolate the effect of noise or distortion on outcomes.
Finally, in section 5.4, we bring together the estimates from section 5.2 and 5.3 to understand how much of the difference in the reduced form student effects can be explained by differences in noise and distortion.

Similarities between Treatments
In order to isolate the effect of the performance measure (percentile value-added versus manager rating), we hold a number of features constant between the two treatments. Both treatments are within-school tournaments. Both treatments provide a raise from 0-10% with the same set of rank thresholds corresponding to raise amounts within that range. Both treatments were introduced at the same time in schools and had a similar performance review timing -manager completed midterm feedback in June 2018 and final ratings in December 2018 and the objective score was based on the average of tests in June 2018 and January 2019.
At endline, we survey teachers about their experience with their incentive scheme. We find no difference in teachers reported experience along a number of dimensions. There is no difference in their responses to the following survey questions: i). when teachers said they understood what was expected of them, ii). awareness of contract main features, iii). how frequently they thought about their contract, and iv). whether the system unfairly favors certain types of teachers (age, gender, etc). Figure 5 and table A6 provides results for each of these survey questions, showing no statistical difference between teachers' responses by treatment.

Differences Across Treatments: Noise and Distortion
In this section through section 5.4, we will focus on two of the remaining differences between the treatments: noise and distortion. As highlighted in the theoretical framework, noise captures the extent to which a teacher's actions affect their incentive payment. Distortion captures the extent to which actions which have the largest marginal return to human capital also are actions which have a higher effective piece rate under the given performance measure. First, we will show that the levels of noise and distortion are different across the treatments.
Noise We measure noise using teacher's perceptions of the noisiness of their incentive treatment. 24 To measure perceived noise, we ask teachers to agree or disagree (on a 5pt scale), whether under their contract, "their raise is out of their control", "those who work harder, earn more" and whether "I feel motivated to work harder". Figure 6 presents the average response to each question with 1 being strongly disagree and 5 being strongly agree. We see that teachers in the subjective treatment, feel their raise is more in their control, hard work is rewarded, and they feel more motivated. The average difference is 0.14sd across the three areas, and we can reject equality of treatments for all three questions at the 5% level.
Distortion We measure distortion using endline survey data from teachers. We ask teachers to imagine a teacher who really wants to receive a higher raise at the end of the year and commits to work ten additional hours a week to increase their raise. Then we ask teachers how much of those ten hours should the teacher allocate different activities, such as collaborating with other teachers, incorporating higher order thinking skills into lessons, preparing practice tests, helping with extracurricular activities, etc. We then group these 17 different actions into four categories: administrative tasks (grading, helping with extracurriculars, monitoring duty), professional development (collaboration, training, improved English skills and content knowledge), pedagogy (use of studentcentered and differentiated lessons) and test preparation (achieving certain grade targets).
We find that teachers in subjective versus objective schools feel that there are some slight differences in which actions should be prioritized in order to increase their raise. Figure 7 and table 7 presents the differences in stated valuation of each area. Overall, teachers think those under the subjective contract should prioritize more administrative tasks and slightly less on test preparation.
We will show in the next section that these actions have different implications for student outcomes.

Effect of Noise and Distortion on Outcomes
Noise We showed that teachers believe there is less noise in the subjective performance measure. However, we do not know if noise actually reduces the effectiveness of the incentive scheme. We showed that theoretically with a fixed variance incentive scheme, a more noisy incentive scheme leads to a lower power incentive, but there is limited empirical evidence on this effect.
To test whether noise affects outcomes, we exploit heterogeneity within the subjective treatment in noisiness. Managers vary in their accuracy of assessing teacher effort. Some managers observe lessons for each of their teachers every week. Others sit down and review paper lesson plans, and some are more hands off. To measure whether a manager has an accurate perception of what their teachers do, we ask teachers to answer the following question about three fellow teachers in their school, "The appraisal score their manager would give them is... [Too high/low by more than one raise category], [Too high/low by about one raise category], [Too high/low by less than one raise category], or [Accurate]". We then construct an average of these ratings per manager, capturing average perceived inaccuracy. On average, teachers believe their managers over or under rate their fellow teachers by 0.8 of an appraisal step (out of the five-step system shown in section 2.1. However, there is considerable heterogeneity. Those most inaccurate quintile of managers are perceived to rate other teachers incorrectly by greater than two steps. More inaccurate managers may be different than their fellow managers in many ways (experience, age, school environment). However, manager accuracy should only affect perceived noisiness of the incentive scheme in subjective treatment schools. In control or objective treatment schools, managers still rate their teachers but have no control over the incentive raise in those schools. Therefore, we use M anagerAccuracy * SubjectiveT reatment as the instrument for N oise, controlling for M anagerAccuracy and SubjectiveT reatment.
We find that M anagerRatingInaccuracy j significantly predicts teacher's rating of the noisiness of their appraisal system in subjective but not objective/control schools, as we would expect. A 1 sd increase in manager inaccuracy increases beliefs about the noisiness of the contract by 0.1-0.4 sd in subjective schools. Table 8 presents the results from the first stage for data at the teacher and student level 25 . Columns (2) and (4) add additional controls, including teacher's beliefs about the preference for different actions ("distortion") and teacher beliefs about other non-noise features of the contract (timing, understanding, etc). The coefficient on M anagerAccuracy * SubjectiveT reatment is very robust to the inclusion of these controls, suggesting that this instrument is picking up difference in noise and not other features of the contract environment.
To test for the effect of noise on teacher and student outcomes, we use the following two-stage least squares specification: where α 3 is the coefficient of interest, N oise is instrumented using M anager Rating Inaccuracy j * SubjectiveT reat i . χ ij are controls, such as school and grade and baseline controls when available for a given outcome.
We find that noise significantly reduces the effectiveness of performance incentives (table 9). A 1 sd increase in noisiness of the incentive scheme reduces teachers' hours worked by 13.2 hours per week and reduces test scores by 0.175 sd. We do not find an effect of noise on socio-emotional scores.
Because our effective first stage has an f-stat of less than 10, we present the AR test p-values which are our preferred test, given that they are robust to weak instruments in the just-identified case.
Columns (2) Distortion Distortion is a measure of how correlated the marginal returns to human capital for different actions are with the effective piece rates for those actions. In order to measure distortion, we therefore need an estimate of marginal returns to different actions. To do this, we again exploit heterogeneity across managers' preferences for different actions, combined with the subjective treatment. The idea behind this strategy is that managers have different preferences for actionssome state they want teachers to focus more on improving their lesson plans, others want teachers to help out more with administrative tasks, etc. We can interact those preferences with subjective treatment status versus objective and control. We can see the effect of preferences toward certain actions on student outcomes.
δ j Points on Action j i (9) Here the coefficient of interest is β j , which gives the effect of manager preference toward certain types of tasks on student outcomes. Actions are grouped into four categories: admin (grading, helping with extracurriculars, monitoring duty), professional development (collaboration, training, improved English), pedagogy (use of student-centered and differentiated lessons), and test prep (achieving certain grade targets). We also add additional controls to capture other features of the contract environment, such as noisiness, understanding of the contract, etc.
We find that several of the action categories are related to student outcomes.  (2) and (4)).

Contribution of Noise and Distortion to Reduced Form Effects
Finally, we can pull the results together to understand the extent to which noise and distortion can explain the reduced form results we saw in section 3.1. To do this we decompose the total reduced form effect into the component from noise, distortion and an unexplained component, : The overall effect of subjective relative to objective on test scores was close to zero (-0.006sd, from table 3). The effect of noise on test scores is -0.17 (table 9) and there is 0.14sd less noise in the subjective arm than the objective arm ( figure 6). For the distortion component, we repeat the same approach for each of the four action categories (admin, professional development, pedagogy and test prep). We take the difference between subjective and objective for each area (table 7), multiply each category with the return to preference for that action on test scores (table 10) and sum. In total, ∂T estScore ∂Distortion * dDistortion, then is -0.03. Subjective schools put slightly less focus on test scores. Combined, the positive effect of subjective having less noise and the negative effect of them placing less focus on test scores almost cancel each other out. Overall, the remaining unexplained portion, , is just 0.0002sd, suggesting noise and distortion are effective at explaining the student results.
We can repeat the same approach for socio-emotional skills.
The overall effect of subjective relative to objective on socio-emotional development was 0.0433 sd (table 4). The effect of noise on socio-emotional skills is -0.06 and there is -0.14sd less noise in the subjective arm than the objective arm. The subjective teachers focus more on tasks which are related to socio-emotional skills. Overall ∂T estScore ∂Distortion * dDistortion is 0.011 sd. The remaining unexplained portion is 0.024 sd, or about half of the difference between the subjective and objective treatment. This is perhaps unsurprising given the results throughout this section. Noise and distortion were much less related to socio-emotional skills than test scores. This could be because there is in fact a weak relationship between them. Alternatively, we may not be as successful at measuring socioemotional skills and certainly have a harder time capturing what aspects of teacher's behavior is related to developing these skills. Better measurement along these areas is an important area for future work.

Conclusion
In this paper, we provide evidence on the effect of subjective versus objective incentives for teachers. We find that both subjective and objective incentives increase test scores, but objective incentives result in negative effects on socio-emotional development. These student outcomes make sense given the teacher behaviors we see under each incentive. In subjective treatment schools, teachers make small improvements in pedagogy and are involved in more professional development.
In objective treatment schools, teachers distort effort toward test preparation. They spend much more time on practice tests and test strategies and use more punitive discipline. While there is heterogeneity in manager application of the subjective treatment arm, we do not find evidence of widespread favoritism or bias.
We then try to understand the mechanisms underlying the reduced form effects. We show evidence that the two incentive schemes are similar along most dimensions except for two areas: noise and distortion. We show teachers believe that the subjective incentive is less noisy and that it prioritizes both test and non-test student outcomes. Using heterogeneity within treatments we attempt to isolate the effect of noise and distortion itself on student outcomes. Finally, we show that noise and distortion are able to explain a large portion of the reduced form test score effects but a smaller fraction of the reduced form socio-emotional skill effect. Notes: This figure presents the experimental timeline. It includes data collection activities and treatment implementation activities.

Objective Subjective
Notes: This figure presents the effects of each performance incentive treatment on student endline test scores relative to the control group.
• The blue bars present the coefficient of the effect of the objective treatment relative to the control (flat raises). The red bars present the coefficient of the effect of the subjective treatment relative to the control.
• The observation is at the student-subject level. The y-axis presents the coefficient from a regression of student's z-score on a given endline exam on treatment dummy variables.
• The sample includes students tested in grades 4-13 in five subjects: Math, Science, English, Urdu, Economics.
• The first two bar graphs includes all test subjects and question items. The next two bars restricts to math and science exams. The next restrict to English, Urdu and Economics exams. The next restrict to question items drawn from external sources, such as PISA and TIMSS. The last two restrict to question items which were from the previous grade.
• All regressions include strata fixed effects and control for baseline student average test score, baseline school average test score, grade and subject. Standard errors are clustered at the school level. 95% confidence intervals are shown on each bar. Stars just above a bar show the significance of the treatment group relative to the control. A bracket above two bars denotes the significance between the two treatments (subjective versus objective). * p < 0.10, * * p < 0.05, * * * p < 0.01. Notes: This figure presents the effects of each performance incentive treatment on student socio-emotional outcomes relative to the control group.
• The blue bars present the coefficient of the effect of the objective treatment relative to the control (flat raises). The red bars present the coefficient of the effect of the subjective treatment relative to the control.
• The observation is at the student level. The y-axis presents the coefficient on a regression of student's z-score on a given socio-emotional dimension from an endline survey of students conducted in January 2019.
• The first two bars provides the average across all five dimensions of socio-emotional outcomes. The remaining ten provide effects on each individual dimension.
• All regressions include strata fixed effects and control for student's grade. Standard errors are clustered at the school level. 95% confidence intervals are shown on each bar. Stars just above a bar show the significance of the treatment group relative to the control. A bracket above two bars denotes the significance between the two treatments (subjective versus objective). * p < 0.10, * * p < 0.05, * * * p < 0.01.

Objective Subjective
Notes: This figure presents the effects of each performance incentive treatment on teacher behavior as rated based on classroom videos relative to the control group.
• The blue bars present the coefficient of the effect of the objective treatment relative to the control (flat raises). The red bars present the coefficient of the effect of the subjective treatment relative to the control.
• The observation is at the classroom observation level. Teachers may be observed multiple times over the course of the intervention. The y-axis presents the coefficient from a regression of classroom observation score in a given dimension on treatment dummy variables.
• The sample includes teachers from grades 4-13 in core academic subjects.
• The first two bars presents the effects on the average score on the CLASS rubric (Pianta et al., 2012), on a 7-pt scale. The next six bars provide effects on scores on sub-areas of the class rubric. The last two bars provide effects on time spent on testing or test-prep activities (in minutes).
• All regressions include strata fixed effects and control for grade and video coder fixed effects. Standard errors are clustered at the school level. 95% confidence intervals are shown on each bar. Stars just above a bar show the significance of the treatment group relative to the control. A bracket above two bars denotes the significance between the two treatments (subjective versus objective). * p < 0.10, * * p < 0.05, * * * p < 0.01.

Female Older teachers
Is their any bias in favor or against: (scale: 1-5) Notes: This figure presents teacher's responses to questions regarding their incentive contract for the previous year.
• Figure A shows teachers responses to questions about what actions they believe fellow teachers take to increase their raise. Figure B shows their responses to questions about whether certain groups are favored by the incentive scheme. • The red (blue) bars presents the average response for teachers in the objective (subjective) treatment schools. The observation is at the teacher level and come from the endline survey of teachers. • In figure A, the outcome is a 5pt scale from Strongly Disagree (1) to Strongly Agree (5). In figure B, the outcome is a 5pt scale (1, lots of bias against, 3, no bias, 5, lots of bias in favor). • Standard errors are clustered at the school level. 95% confidence intervals are shown on the subjective bar comparing it to the objective treatment. Stars just above the bar show the significance of the subjective group relative to the objective. * p < 0.10, * * p < 0.05, * * * p < 0.01. Notes: This figure presents teacher's responses to a questions about how they respond to their incentive.
• The red (blue) bars presents the average response for teachers in the objective (subjective) treatment schools. The observation is at the teacher level and come from the endline survey of teachers. • The questions are on a 5pt scale from Strongly Disagree (1) to Strongly Agree (5). • Standard errors are clustered at the school level. 95% confidence intervals are shown on the subjective bar comparing it to the objective treatment. Stars just above the bar show the significance of the subjective group relative to the objective. * p < 0.10, * * p < 0.05, * * * p < 0.01. Notes: This figure presents teachers' responses to a hypothetical scenario in which they are advising a teacher which actions they should take to increase their raise under a given treatment.
• The red (blue) bars presents the average response for teachers in the objective (subjective) treatment schools. The observation is at the teacher level and come from the endline survey of teachers. • The questions are on a 5pt scale from Strongly Disagree (1) to Strongly Agree (5). • Standard errors are clustered at the school level. 95% confidence intervals are shown on the subjective bar comparing it to the objective treatment. Stars just above the bar show the significance of the subjective group relative to the objective. * p < 0.10, * * p < 0.05, * * * p < 0.01.  (1) and (2) comes from administrative data collected from our partner school system. Data in panel B, columns (1) and (2) is from an endline survey conducted with 189 principals and vice principals and 5,698 teachers in our study sample. Data in panel A, B and C, columns (3) and (4) (1) and (2) comes from administrative data collected from our partner school system. Data in panel B and C, columns (1) and (2) is from an endline survey conducted with 189 principals and vice principals in our study sample. Data in panel A and B, columns (3) and (4) (3) and (4) is from the World Management Survey data conducted by the Centre for Economic Performance (Bloom et al., 2015). We restrict to the 270 schools located in the US from that sample. Notes: This table presents the effects of each performance incentive treatment on student endline test scores. The outcome is student's z-score on a given endline exam. The sample includes students tested in grades 4-13 in five subjects: Math, Science, English, Urdu, Economics. Column (1) includes all test subjects and question items. The observation is at the student-subject exam level. Column (2) restricts to question items which were from the previous grade. Column (3) restricts to question items drawn from external sources, such as PISA and TIMSS. Column (4) restricts to math and science exams. Column (5) restricts to English, Urdu and Economics exams. All regressions include strata fixed effects and control for baseline student average test score, baseline school average test score, grade and subject. Values in parentheses are standard p-values. Values in brackets are randomization inference p-values. Standard errors are clustered at the school level. * p < 0.10, * * p < 0.05, * * * p < 0.01. Notes: This table presents the effects of each performance incentive treatment on student socio-emotional outcomes. The outcome is student's z-score on a given socio-emotional dimension. Observations are at the student level and come from an endline survey of students in January 2019. Column (1) provides the average across all five dimensions of socio-emotional outcomes. Columns (2)-(6) provide each individual dimension. All regressions include strata fixed effects and control for student's grade. Values in parentheses are standard p-values. Values in brackets are randomization inference p-values. Standard errors are clustered at the school level. * p < 0.10, * * p < 0.05, * * * p < 0.01. Notes: This table presents the effects of each performance incentive treatment on teacher behavior as rated based on classroom videos. The unit of observation is at the classroom observation level. Teachers may be observed multiple times over the course of the intervention. Column (1) presents the average score on the CLASS rubric (Pianta et al., 2012), on a 7-pt scale. Columns (2)-(4) provide scores on sub-areas of the class rubric. Column (5) provides the number of minutes during the observation that were spent on testing or test-prep activities. All regressions include strata fixed effects and control for grade and video coder fixed effects. Values in parentheses are standard p-values. Values in brackets are randomization inference p-values. Standard errors are clustered at the school level. * p < 0.10, * * p < 0.05, * * * p < 0.01. Notes: This table presents the effects of each performance incentive treatment on teacher attendance and time at work. The outcome is the number of days present at work and the number of hours at work. Data comes from biometric clock in and out data collected at all schools. The restricted sample removes teachers who took long leaves of absence or only worked at the school system for one of the two terms. All regressions include strata fixed effects and control for baseline school average test score, grade and subject. Values in parentheses are standard p-values. Values in brackets are randomization inference p-values. Standard errors are clustered at the school level. * p < 0.10, * * p < 0.05, * * * p < 0.01. Notes: This table reports teachers' responses to a hypothetical scenario in which they are advising a teacher which actions they should take to increase their raise under a given treatment. Data was collected as part of the endline survey, and observations are at the unit of the teacher. Actions are categorized into four categories: administrative tasks, pedagogy, professional development, and test preparation. Table A3 provides teacher's weight for the full list of activities by treatment. * p < 0.10, * * p < 0.05, * * * p < 0.01. Notes: This table presents the relationship between manager rating inaccuracy and teacher's rating of how noisy their contract was. The outcome is teacher's rating of how noisy their contract was as measured by an index of their response to the three questions shown in Figure  6. Columns (1) and (2) uses data at the teacher level. Columns (3) and (4) uses data at the teacher-student exam level. Student exam data is matched to all teachers who taught the student in the given exam subject for at least one term from January-December 2018. All regressions control for subject, class and manager inaccuracy squared. Columns (3) and (4) also control for school and student test baseline. Columns (2) and (4) add in additional controls to pick up other non-noise differences across contracts. These controls include weight placed on each of the four activity groups listed in Table 7, those values interacted with the Subjective treatment, when teachers said they learned about the treatment and how often they received information about the treatment. Standard errors are clustered at the school level. * p < 0.10, * * p < 0.05, * * * p < 0.01. Notes: This table presents the relationship between teacher's rating of the noisiness of their contract, instrumented by manager inaccuracy*Subjective Treatment, on teacher and student outcomes. Columns (1) and (2) use data at the teacher level. Columns (3) and (4) use data at the teacher-student exam level. Student exam data is matched to all teachers who taught the student in the given exam subject for at least one term from January-December 2018. Columns (5) and (6) uses data the student level. All regressions control for subject, class, subjective treatment, manager inaccuracy, and manager inaccuracy squared. Columns (3) and (4) also control for school and student test baseline. Columns (2), (4) and (6) add in additional controls to pick up other non-noise differences across contracts. These controls include weight placed on each of the four activity groups listed in Table 7, those values interacted with the Subjective treatment, when teachers said they learned about the treatment and how often they received information about the treatment. Standard errors are clustered at the school level. * p < 0.10, * * p < 0.05, * * * p < 0.01. Notes: This table presents the relationship between evaluation criteria interacted with treatment on student outcomes. Data is at the teacher level. All regressions control for the four categories of evaluation criteria and subjective treatment. Columns (2) and (4) add in additional controls to pick up other non-distortion differences across contracts. These controls include noise index, belief about whether the contract affects teacher competition, favors certain teachers, when teachers said they learned about the treatment, how often they received information about the treatment and all of these outcomes interacted with subjective treatment. Standard errors are clustered at the school level. * p < 0.10, * * p < 0.05, * * * p < 0.01.

Figure A1: Example Performance Criteria
Notes: This figure shows an example set of performance criteria a teacher would have set in collaboration with their manager at the beginning of the year. This list of criteria was located on their employment portal, and available to access throughout the year. Managers could set individual criteria for each of their employees. These ranged from 4 to 10 criteria spanning numerous aspects of the teacher's job descriptions.

Figure A2: Example Midterm Information
Notes: This figure shows an example notification sent to teachers during the summer between the two school years. The notification gave teachers a preliminary performance rating based on the first term of the experiment. Teachers received this information via email and as a pop-up notification on their employment portal. This example shows the notification that subjective treatment teachers would receive. Teachers in the objective treatment received midterm performance information based on their students percentile value added from the first term. Teachers in the control schools received information about either their performance along the subjective criteria that by their manager or their students' percentile value added. Teacher percentile (0-1) in:    -0.003** -0.017*** -0.013*** Notes: This table reports teachers' responses to a hypothetical scenario in which they are advising a teacher which actions they should take to increase their raise under a given treatment. Data was collected as part of the endline survey, and observations are at the unit of the teacher/manager. * p < 0.10, * * p < 0.05, * * * p < 0.01.   (2) includes the full sample of teachers and column (3) just includes teachers for whom we conducted a classroom observation. Hours and Days present are from biometric clock in and out data provided by the school system. Value-added is calculated using administrative test scores and endline test scores. The remaining variables are from classroom observations. The first 12 are the dimensions of the CLASS rubric and the rest are additional elements of teaching not captured by the CLASS rubric. * p < 0.10, * * p < 0.05, * * * p < 0.01. The final column report mean differences between treatment group and report if any are statistically significant. The three "Is there any bias" questions are on a 5 pt scale (1, lots of bias against, 3, no bias, 5, lots of bias in favor). The remaining questions in panel A and B are on a 5-pt scale from 1 (strongly disagree) to 5 (strongly agree). Questions in panel C were on a scale from 1 to 8. Standard errors are clustered at the school level. * p < 0.10, * * p < 0.05, * * * p < 0.01. Notes: This table presents the treatment effects by manager characteristics. The row Interaction lists which characteristic is used as the interaction variable for a given column. Age, experience and gender are from administrative records. Manager inaccuracy is from teacher endline survey data. Mangement rating and Personnel management rating are from manager endline survey responses to World Management Survey questions. Standard errors are clustered at the school level. * p < 0.10, * * p < 0.05, * * * p < 0.01.