Working from home (WFH) surged after the COVID-19 pandemic, with university-graduate employees typically WFH for one to two days a week during 2023 (refs. 2,6). Previous causal research on WFH has focused on employees who are fully remote, usually working on independent tasks in call-centre, data-entry and helpdesk roles. This literature has found that the effects of fully remote working on productivity are often negative, which has resulted in calls to curtail WFH5,6,7,8,9,10,11,12. However, there are two challenges when it comes to interpreting this literature. First, more than 70% of employees WFH globally are on a hybrid schedule. This group comprises more than 100 million individuals, with the most common working pattern being three days a week in the office and two days a week at home2,8,9. Second, most employees who are regularly WFH are university graduates in creative team jobs that are important in science, law, finance, information technology (IT) and other industries, rather than performing repetitive data-entry or call processing tasks10,11.

This paper addresses the gap in previous studies in two key ways. First, it uses a randomized control trial to examine the causal effect of a hybrid schedule in which employees are allowed to WFH two days per week. Second, it focuses on university-graduate employees in software engineering, marketing, accounting and finance, whose activities are mainly creative team tasks.

Our study describes a randomized control trial from August 2021 to January 2022, which involved 1,612 graduate employees in the Airfare and IT divisions of a large Chinese travel technology multinational called Employees were randomized by even or odd birthdays into the option to WFH on Wednesday and Friday and come into the office on the other three days, or to come into the office on all five days.

We found that in the hybrid WFH (‘treatment’) group, attrition rates dropped by one-third (meancontrol = 7.20, meantreat = 4.80, t(1610) = 2.02, P = 0.043) and work satisfaction scores improved (meancontrol = 7.84, meantreat = 8.19, t(1343) = 4.17, P < 0.001). Employees reported that WFH saved on commuting time and costs and afforded them the flexibility to attend to occasional personal tasks during the day (and catch up in the evenings or weekends). These effects on reduced attrition were significant for non-managerial employees (meancontrol = 8.59, meantreat = 5.33, t(1215) = 2.23, P = 0.026), female employees (meancontrol = 9.19, meantreat = 4.18, t(568) = 2.40, P = 0.017) and those with long (above-median) commutes (meancontrol = 6.00, meantreat = 2.89, t(609) = 1.87, P = 0.062).

At the same time, we found no evidence of a significant effect on employees’ performance reviews, on the basis of null equivalence tests, and no evidence of a difference in promotion rates over periods of up to two years (‘Null results’ section of the Methods). We did find significant differences in pre-experiment beliefs about the effects of WFH on productivity between non-managers and managers. Before the experiment, managers tended to have more negative views, reporting that hybrid WFH would be likely to affect productivity by −2.6%, whereas non-managers had more positive views (+0.7%) (t(1313) = −4.56, P < 0.001). After the experiment, the views of managers increased to +1.0%, converging towards non-managers’ views (meannon-manager = 1.62, meanmanager = 1.05, t(1343) = −0.945, P = 0.345). This highlights how the experience of hybrid working leads to a more positive assessment of its effect on productivity—consistent with the overall experience in Asia, the Americas and Europe throughout the pandemic, where perceptions of WFH improved considerably13.

The experiment

The experiment took place at, the third-largest global travel agent by sales in 2019. was established in 1999, was quoted on NASDAQ in 2003 and was worth about US$20 billion at the time of the experiment. It is headquartered in Shanghai, with offices across China and internationally, and has roughly 35,000 employees.

In the summer of 2021, decided to evaluate the effects of hybrid WFH on the 1,612 engineering, marketing and finance employees in the Airfare and IT divisions, spanning 395 managers and 1,217 non-managers. All experimental participants were surveyed at baseline, with questions on expectations, background and their interest in volunteering for early participation in the experiment. The firm randomized employees with an odd-number birthday (born on the first, third, fifth and so on day of the month) into the treatment group.

Figure 1 shows two pictures of employees working in the office to highlight three points. First, in the second half of 2021, COVID incidence rates in Shanghai were so low that employees were neither masked nor socially distanced at the office. Although the COVID pandemic had led to lockdowns in early 2020 and during 2022, during the second half of 2021, Shanghai employees were free to come to work, and typically were unmasked in the office. Second, employees worked in modern open-plan offices in desk groupings of four or six colleagues from the same team, reflecting the importance of collaboration. Third, the office is a large modern building, similar to many large Asian, European and North American offices.

Fig. 1: employees worked in modern open-plan offices, with teams seated together.
figure 1

Pictures of employees in the office during the experiment. The people in the experimental sample are typically in their mid-30s, and 65% are male. All of them have a university undergraduate degree and 32% have a postgraduate degree, usually in computer science, accounting or finance, at the master’s or PhD level. They have 6.4 years tenure on average and 48% of employees have children (Extended Data Table 1).

Effects on employee retention

One key motivation for in running the experiment was to evaluate how hybrid WFH affected employee attrition and job satisfaction. The net effect was to reduce attrition over the experiment by 2.4%, which against the control-group base of 7.2% was a one-third (33%) reduction in attrition (meancontrol = 7.20, meantreat = 4.80, t(1610) = 2.02, P = 0.043). Consistent with this reduction in quit rates, employees in the treatment group also registered more positive responses to job-satisfaction surveys (meancontrol = 7.84, meantreat = 8.19, t(1343) = 4.17, P < 0.001). Employees were anonymously surveyed on 21 January 2022, and employees in the treatment group showed significantly higher scores on a scale from 0 (lowest) to 10 (highest) in ‘work–life balance’, ‘work satisfaction’, ‘life satisfaction’ and ‘recommendation to friends’, and significantly lower scores in ‘intention to quit’ (Extended Data Table 2).

One possible explanation for the lower quit rates in the treatment group is that quit rates in the control group increased because the individuals in this group were annoyed about being randomized out of the experiment. However, quit rates in the same Airfare and IT divisions were 9.8% in the six months before the experiment—higher than the rate for the control group during the experimental period. Quit rates over the experimental period in the two other divisions for which we have data (Business Trips and Marketing) were 10.5% and 9.8%—again higher than that for the control group during the experimental period. This suggests that, if anything, the control-group quit rates were reduced rather than increased by the experiment, possibly because some of them guessed (correctly) that the policy would be rolled out to all employees once the experiment ended.

Figure 2 shows the change in attrition rates by three splits of the data. First, we examined the effect on attrition for the 1,217 non-managers and 395 managers separately. We saw a significant drop in attrition of 3.3 percentage points for the non-managers, which against a control-group base of 8.6% is a 40% reduction (meancontrol = 8.59, meantreat = 5.33, t(1215) = 2.23, P = 0.026). By contrast, there was an insignificant increase in attrition for managers (meancontrol = 2.96, meantreat = 3.13, t(393) = −0.098, P = 0.922). We also found that non-managers were more enthusiastic before the experiment, with a volunteering rate of 35% (versus 22% for managers), matching the media sentiment that although non-managerial employees are enthusiastic about WFH, many managers are not (t(1610) = 4.86, P < 0.001).

Fig. 2: WFH cut attrition by 33% overall, and had a particularly strong effect for non-managers, women and those with longer commutes.
figure 2

Data on 1,612 employees’ attrition until 23 January 2022. Top left, all employees. Only 1,259 employees filled out the baseline survey question on commuting length, so the commute-length (two ways) sample is for 1,259 employees. Sample sizes are 820 and 792 for control and treatment; 1,217 and 395 for non-managers and managers; 570 and 1,042 for women and men; and 648 and 611 for short and long commuters, respectively. Two-tailed t-tests for the attrition difference within each group between the control and treatment groups are (difference = 2.40, s.e. = 1.18, confidence interval (CI) = [0.0748, 4.72], P = 0.043) for all employees; (difference = 3.26, s.e. = 1.46, CI = [0.392, 6.12], P = 0.026) for non-managers; (difference = −0.169, s.e. = 1.73, CI = [−3.57, 3.23], P = 0.922) for managers; (difference = 5.01, s.e. = 2.08, CI = [0.915, 9.10], P = 0.017) for women; (difference = 0.997, s.e. = 1.43, CI = [−1.82, 3.81], P = 0.487) for men; (difference = 2.61, s.e. = 1.93, CI = [−1.19, 6.41], P = 0.178) for employees with median (90 min, two-way) or shorter commutes; and (difference = 3.11, s.e. = 1.66, CI = [−0.156, 6.37], P = 0.062) for above-median (90 min, two-way) commuters.

Second, we examined the effect on attrition by total commute length, splitting the sample into people with shorter and longer total commutes on the basis of the median commute duration (two-way commutes of 1.5 h or less versus those exceeding 1.5 h, with 648 and 611 employees, respectively). We found that there was a larger reduction in quit rates (52%) for those with a long commute (meancontrol = 6.00, meantreat = 2.89, t(609) = 1.87, P = 0.062). The reduction in quit rates was similarly large for employees with a long commute if we instead defined a long commute as a two-way commute time exceeding 2 h (meancontrol = 7.33, meantreat = 1.89, t(307) = 2.31, P = 0.021). Employees who volunteered to take part in the experiment had longer one-way commute durations (Extended Data Table 3; meannon-volunteer = 0.80, meanvolunteer = 0.89, t(1257) = −3.68, P < 0.001). This is not surprising given that the most frequently cited benefit of WFH is no commute1.

Third, we examined the effect on attrition by gender, examining the 570 female and 1,042 male employees separately. We found that there was a 54% reduction in quit rates for female employees (meancontrol = 9.2, meantreat = 4.2, t(568) = 2.40, P = 0.017). For male employees, there was an insignificant 16% reduction in quit rates (meancontrol = 6.15, meantreat = 5.15, t(1040) = 0.70, P = 0.487). This greater reduction in quit rates among female individuals echoes the findings of previous studies6,14,15,16, which suggest that women place greater value on remote work than men do. Notably, although the treatment effect of WFH was significantly larger for female employees, volunteers were less likely to be female (meannon-volunteer = 0.37, meanvolunteer = 0.32, t(1610) = −2.02, P = 0.043); this might suggest that women have greater concerns about negative career signalling by volunteering to WFH.

Employee performance and promotions

Another key question for was the effect of hybrid WFH on employee performance. To assess that, we examined four measures of performance: six-monthly performance reviews and promotion outcomes for up to two years after the start of the experiment, detailed performance evaluations, and the lines of code written by the computer engineers. We also collected self-assessed productivity effects of hybrid working from experimental participants before and after the experiment to evaluate employee perceptions.

Performance reviews are important within as they determine employees’ pay and career progression, so are carefully conducted. The review process for each employee is built on formal assessments provided by their managers, co-workers, direct reports and, if appropriate, customers. They are reviewed by employees, collated by managers and by the human resources team, and then discussed between the manager and the employee. This lengthy process takes several weeks, providing a well-grounded measure of employee performance. Although these reviews are not perfect, given their tight link to pay and career development, both managers and employees put a large amount of effort into making these informative measures of performance.

Figure 3 reports the distribution of performance grades for treatment and control employees for the four half-year periods: July to December 2021, January to June 2022, July to December 2022 and January to June 2023. These four performance reviews span a two-year period from the start of the experimental period. Across all review periods, we found no difference in reviews between the treatment and control groups (Extended Data Table 4 and ‘Null results’ section of the Methods).

Fig. 3: WFH had no significant effect on performance reviews over the next two years.
figure 3

Results from performance reviews of 1,507 employees in July–December 2021, 1,355 employees in January–June 2022, 1,301 employees in July–December 2022 and 1,254 employees in January–June 2023. Samples are lower over time owing to employee attrition from the original experimental sample. Two-tailed t-tests for the performance difference within each period between the control and treatment groups, after assigning each letter grade a numeric value from 1 (D) to 5 (A), are (difference = 0.056, s.e. = 0.043, CI = [−0.029, 0.14], P = 0.198) for July–December 2021; (difference = 0.034, s.e. = 0.044, CI = [−0.0529, 0.122], P = 0.440) for January–June 2022; (difference = −0.019, s.e. = 0.046, CI = [−0.11, 0.072], P = 0.677) for July to December 2022; and (difference = 0.046, s.e. = 0.051, CI = [−0.054, 0.146], P = 0.369) for January–June 2023. The null equivalence tests are included in the ‘Null results’ section of the Methods.

Figure 4 reports the distribution of promotion outcomes for the treatment and control employees for the same periods. We see no evidence of a difference in promotion rates across treatment and control employees. This is an important result given the evidence that fully remote working can damage employee development and promotions14,17,18.

Fig. 4: WFH had no significant effect on promotions over the next two years.
figure 4

Promotion outcomes for 1,522 employees in July–December 2021, 1,378 employees in January–June 2022, 1,314 employees in July–December 2022 and 1,283 employees in January–June 2023. Samples are lower over time owing to employee attrition from the original experimental sample. Two-tailed t-tests for the promotion difference within each period between the control and treatment groups are (difference = −0.86, s.e. = 1.34, CI = [−3.51, 1.74], P = 0.509) for July–December 2021 promotions; (difference = 0.12, s.e. = 0.85, CI = [−1.54, 1.78], P = 0.892) for January–June 2022 promotions; (difference = −0.51, s.e. = 1.12, CI = [−2.72, 1.70], P = 0.651) for July–December 2022 promotions; and (difference = −0.99, s.e. = 1.02, CI = [−2.99, 1.00], P = 0.328) for January–June 2023 promotions. The null equivalence tests are included in the ‘Null results’ section of the Methods.

We also analysed the effects of treatment on performance grades and promotions for a variety of subgroups, including managers, employees with a manager in the treatment group, longer-tenured employees, longer-commuting employees, women, employees with children, computer engineers and those living further away, as well as looking at whether internet speed had any effect. We found no evidence of a difference in response to treatment across these groups (Extended Data Table 5).

The experiment also analysed two other measures of employee performance. First, the performance reviews at have subcomponents for individual activities such as ‘innovation’, ‘leadership’, ‘development’ and ‘execution’ (nine categories in all) when these are important for an individual employee’s role. We collected these data and analysed these scores for the four six-month performance review periods. We found no evidence of a difference across these nine major categories over the four performance review periods (Extended Data Table 6). This indicates that for categories that involve softer skills or more team-focused activities—such as development and innovation—there is no evidence for a material effect of being randomized into the hybrid WFH treatment. Second, for the 653 computer engineers, we obtained data on the lines of code uploaded by each engineer each day. For this ‘lines of code submitted’ measure, we found no difference between employees in the control and treatment groups (Extended Data Fig. 1 and ‘Null results’ section of the Methods).

Self-assessed productivity

All experiment participants were polled before the experiment in a baseline survey on 29 and 30 July 2021, which included a two-part question on their beliefs about the effects of hybrid WFH on productivity. Employees were asked ‘What is your expectation for the impact of hybrid WFH on your productivity?’, with three options of ‘positive’, ‘about the same’ or ‘negative’. Individuals who chose the answer ‘positive’ were then offered a set of options asking how positive they felt, ranging from [5% to 15%] up to [35% or more], and similarly so for negative choices. For aggregate impacts we took the mid-points of each bin, and 42.5% for >35% and –42.5% for <−35%. Employees were resurveyed with the same question after the end of the experiment on 21 January 2022.

The left panel of Fig. 5 shows that employees’ pre-experimental beliefs about WFH and productivity were extremely varied. The baseline mean was –0.1%, but with widespread variation (standard deviation of 11%). This spread should be unsurprising to anyone who has been following the active debate about the effects of remote work on productivity. At the end-line survey conducted on 21 January 2022, the mean of these beliefs had significantly increased to 1.5%, revealing that the experience of hybrid working led to a small improvement in average employee beliefs about the productivity impact of hybrid working (meanbaseline = −0.06%, meanendline = 1.48%, t(2658) = −3.84, P < 0.001). This could be because hybrid WFH saves employees commuting time and is less physically tiring, and, with intermittent breaks between group time and quiet individual time, can improve performance19,20,21,22.

Fig. 5: Views on the effect of WFH on productivity improved after the experiment, particularly for managers.
figure 5

Sample from 1,315 employees (314 managers, 1,001 non-managers) at the baseline and 1,345 employees (324 managers, 1,021 non-managers) at the end line. Two-tailed t-tests for the difference in productivity expectations between baseline and end line, after assigning a numeric value corresponding to the midpoint of the bucket, are (baseline mean = −0.058, end-line mean = 1.48, difference = −1.54, s.e. = 0.40, CI = [−2.33, −0.753], P < 0.001). Two-tailed t-tests for the baseline difference between the productivity expectations of managers and non-managers are (difference = −3.28, s.e. = 0.72, CI = [−4.69, −1.86], P < 0.001), and the t-tests for the end-line difference are (difference = −0.571, s.e. = 0.604, CI = [−1.76, 0.615], P = 0.345).

The right panel of Fig. 5 shows that in the baseline survey, managers were negative about the perceived effect of hybrid work on their productivity, with a mean effect of −2.6%. Non-managers, by contrast, were significantly more positive, at +0.7% in the baseline survey (meannon-manager = 0.7%, meanmanager = −2.6%, t(1313) = −4.56, P < 0.001). At the end of the experiment, the views of managers improved to 1.0%, with no evidence of a difference from the non-managers’ mean value of 1.6% (meannon-manager = 1.62%, meanmanager = 1.05%, t(1343) = −0.95, P = 0.345). Hence, the experiment led managers to positively update their views about how hybrid WFH affects productivity, and to more closely align with non-managers.

Of note, we saw that employees in the treatment and control groups had similar increases in self-assessed productivity (difference 0.58%, s.d. = 0.59%). Employees from four other divisions in were also polled about the productivity impact of hybrid WFH after the end of the experiment in March 2022, with a mean estimate of +2.8% on a sample of 3,461 responses—similar to the 1.5% end line for the experimental sample. This suggests that even close exposure to hybrid WFH is sufficient for employees to change their views, consistent with previous evidence of a positive society-wide shift in perceptions about WFH productivity after the 2020 pandemic8.


Once the experiment ended, the executive committee examined the data and voted to extend the hybrid WFH policy to all employees in all divisions of the company with immediate effect. Their logic was that each quit cost the company approximately US$20,000 in recruitment and training, so a one-third reduction in attrition for the firm would generate millions of dollars in savings. This was publicly announced on 14 February 2022, with wide coverage in the Chinese media. Since then, other Chinese tech firms have adopted similar hybrid policies23.

This highlights how, contrary to the previous causal research focused on fully remote work, which found mostly negative effects on productivity5,6,7, hybrid remote work can leave performance unchanged. This suggests that hybrid working can be profitably adopted by organizations, given its effect on reducing attrition, which is estimated to cost about 50% of an individual’s annual salary for graduate employees24. Hybrid working also offers large gains for society by providing a valuable amenity (perk) to employees, reducing commuting and easing child-care6,25,26.

The experiment was conducted in a Chinese technology firm based in Shanghai. Although it might not be possible to replicate these results perfectly in other situations, is a large multinational firm with global suppliers, customers and investors. Its offices are modern buildings that look similar to those in many American, Asian and European cities. Trip employees worked 8.6 h per day on average, close to the 8 h per day that is usual for US graduate employees27. The business had a large drop in revenue in 2020 (see Extended Data Fig. 4), followed by roughly flat revenues through the 2021 experiment period into 2022, so this was not a period of exceptionally fast or slow growth. As such, we believe that these results— that is, the finding that allowing employees to WFH two days per week reduces quit rates and has a limited effect on performance—would probably extend to other organizations. Also, this experiment analysed the effects of working three days per week in the office and two days per week from home. So, our findings might not replicate to all other hybrid work arrangements, but we believe that they could extend to other hybrid settings with a similar number of days in the office, such as two or four days a week. We are not sure whether the results would extend to more remote settings such as one day a week (or less) in the office, owing to potential challenges around training, innovating and culture in fully remote settings.

Finally, we should point out two implications of the experimental design. First, full enrolment into hybrid schemes is important because of concerns that volunteering might be seen as a negative signal about career ambitions. The low volunteer rate among female employees, despite their high implied value (from the large reductions in quit rates observed), is particularly notable in this regard. Second, there is value in experimentation. Before the experiment, managers were net-negative in their views on the productivity impact of hybrid working, but after the experiment, their views became net-positive. This highlights the benefits of experimentation for firms to evaluate new working practices and technologies.


Location and set-up

Our experiment took place at in Shanghai, China. In July 2021, decided to evaluate hybrid WFH after seeing its popularity amongst US tech firms. The first step took place on 27 July 2021, when the firm surveyed 1,612 eligible engineers, marketing and finance employees in the Airfare and IT divisions about the option of hybrid WFH. They excluded interns and rookies who were in probation periods because on-site learning and mentoring are particularly important for those individuals. chose these two divisions as representative of the firm, with a mix of employee types to assess any potentially heterogeneous impacts. About half of the employees in these divisions are technical employees, writing software code for the website, and front-end or back-end operating systems. The remainder work in business development, with tasks such as talking to airlines, travel agents or vendors to develop new services and products; in market planning and executing advertising and marketing campaigns; and in business services, dealing with a range of financial, regulatory and strategy issues. Across these groups, 395 individuals were managers and 1,217 non-managers, providing a large enough sample of both groups to evaluate their response to hybrid WFH.


The employees were sent an email outlining how the six-month experiment offered them the option (but not the obligation) to WFH on Wednesday and Friday. After the initial email and two follow-up reminders, a group of 518 employees volunteered. The firm randomized employees with odd birthdays—those born on the first, third, fifth and so on of the month—into eligibility for the hybrid WFH scheme starting on the week of 9 August. Those with even birthdays—born on the second, fourth, sixth and so on of the month—were not eligible, so formed the control group.

The top management at the firm was surprised at the low volunteer rate for the optional hybrid WFH scheme. They suspected that many employees were hesitating because of concerns that volunteering would be seen as a negative signal of ambition and productivity. This is not unreasonable. For example, a previous study28 found in the US firm they evaluated that WFH employees were negatively selected on productivity. So, on 6 September, all of the remaining 1,094 non-volunteer employees were told that they were also included in the program. The odd-birthday employees were again randomized into the hybrid WFH treatment and began the experiment on the week of 13 September. In this paper we analyse the two groups together, but examining the volunteer and non-volunteer groups individually yields similar findings of reduced quit rates and no impact on performance.

Employee characteristics and balancing tests

Figure 1 shows some pictures of employees working in the office (left side). Employees all worked in modern open-plan offices in desk groupings of four or six colleagues from the same team. By contrast, when WFH, they usually worked alone in their apartments, typically in the living room or kitchen (see Extended Data Fig. 2).

The individuals in the experimental sample are typically in their mid-30s. About two-thirds are male, all of them have a university undergraduate degree and almost one-third have a graduate degree (typically a master’s degree). In addition, nearly half of the employees have children (details in Extended Data Table 1).

In Extended Data Table 7 we confirm that this sample is also balanced across the treatment and control groups, by conducting a two-sample t-test. The exceptions are from random variation given that the sampling was by even or odd day-of-month birthday—the control sample is 0.5 years older (P = 0.06), and this is presumably linked to why those in this group have 0.06% more children (P = 0.02) and 0.4 years more tenure (P = 0.09).

In Extended Data Table 3, we examine the decision to volunteer for the WFH experiment. We see that volunteers were significantly less likely to be managers (meannon-volunteer = 0.28, meanvolunteer = 0.17, t(1610) = −4.85, P < 0.001) and had longer commute times (hours) (meannon-volunteer = 0.80, meanvolunteer = 0.89, t(1257) = 3.68, P < 0.001). Notably, we don’t find evidence of a relationship between volunteering and previous performance scores (meannon-volunteer = 3.81, meanvolunteer = 3.81, t(1580) = −0.02, P = 0.985), highlighting, at least in this case, the lack of evidence for any negative (or positive) selection effects around WFH.

Extended Data Fig. 3 plots the take-up rates of WFH on Wednesday and Friday by volunteer and non-volunteer groups. We see a few notable facts. First, take-up overall was about 55% for volunteers and 40% for non-volunteers, indicating that both groups tended to WFH only one day, typically Friday, each week. At, large meetings and product launches often happen mid-week, so Fridays are seen as a better day to WFH. Second, the take-up rate even for non-volunteers was 40%, indicating that’s suspicion that many employees did not volunteer out of fear of negative signalling was well-founded, and highlighting that amenities like WFH, holiday, maternity or paternity leave might need to be mandatory to ensure reasonable take-up rates. Third, take-up surged on Fridays before major holidays. Many employees returned to their home towns, using their WFH day to travel home on the quieter Thursday evening or Friday morning. Finally, take-up rates jumped for both treatment-group and control-group employees in late January 2022 after a case of COVID in the Shanghai headquarters. allowed all employees at that point to WFH, so the experiment effectively ended early on Friday 21 January. The measure of an employee’s daily WFH take-up excludes leave, sick leave or occasions when they cannot come to the office owing to extreme bad weather (typhoon) or to the COVID outbreak in the company.

Null results

To interpret the main null results, we conduct null equivalence tests using the two one-sided tests (TOST) procedure in R (refs. 29,30). This test required us to specify the smallest effect size of interest (SESOI). For the results pertaining to performance review measures, we use 0.5 as the SESOI. This corresponds to half of a consecutive letter grade increase or decrease, because we had assigned numeric values to performance letter grades in increments of 1, with the lowest letter grade D being 1, and the highest letter grade A being 5. We performed equivalence tests for a two-sample Welch’s t-test using equivalence bounds of ±0.5. The TOST procedure yielded significant results using the default alpha of 0.05 for the tests against both the upper and the lower equivalence bounds for the performance measures for July–December 2021 (t(1504) = −10.20, P < 0.001)), January–June 2022 (t(1353) = −10.57, P < 0.001)), July–December 2022 (t(1299) = 10.34, P < 0.001)) and January–June 2023 (t(1248) = −8.80, P < 0.001)). The equivalence test is therefore significant, which means we can reject the hypothesis that the true effect of the treatment on performance is larger than 0.5 or smaller than −0.5. So, we interpret the performance effects of the treatment to be actually null on the basis of the SESOI we used, as opposed to no evidence of a difference in performance.

We conducted null equivalence results for the effect of the treatment on promotions using 2 as the SESOI, corresponding to ±2 percentage points (pp) difference in promotion rates. Although we can reject the null hypothesis that the true effect of treatment on promotion is larger than 2 pp or smaller than −2 pp in January–June 2022 (t(1376) = −2.22, P = 0.013) and July–December 2022 (t(1306) = 1.33, P = 0.092), we fail to reject the null equivalence hypothesis in July–December 2021 (t(1513) = 0.83, P = 0.203) and January–June 2023 (t(1250) = 0.98, P = 0.163). Thus, we interpret the results on promotion as no evidence of a difference between promotion rates across treatment and control employees.

We also conducted the equivalence test for lines of code using 29 lines of code per day as the SESOI, which corresponds to 10% of the mean number of lines of code for the control group. We arrive at this SESOI on the basis of rounding down the productivity effects of previous findings8,10. We can reject the equivalence null hypothesis for lines of code (t(92362) = −2.74, P = 0.003)) so we interpret the effect of the treatment as a null effect.

Volunteer versus non-volunteer groups

In the main paper we pool the volunteer and non-volunteer groups. In Extended Data Table 5 we examine the impacts on performance and promotions and we see no evidence of a difference in performance and promotion treatment effects for volunteer versus non-volunteer groups (column 9).

Performance subcategories

The company has a rigorous performance-reviewing process every six months that determines employees’ pay and promotion, so is carefully conducted. The review process for each employee is built on formal reviews provided by their managers, project leaders and sometimes co-workers (peer review). Managers are more like an employee’s direct managers for organizational purposes, but for a particular project, the project leader could be another higher-level employee. In such a case, the manager of the employee would ask that project leader for an opinion on the employee’s contribution to the project. An individual’s overall score is a weighted sum of scores from various subcategories that managers have broad flexibility over defining, because tasks differ across employees, and managers would give a score for each task. For example, an employee running a team themselves will have subcategories around developing their direct reports (leadership and communication), whereas an employee running a server network will have subcategories around efficiency and execution. The performance subcategory data come from the text of the performance review. We first used the most popular Chinese word segmentation package in Python, named Jieba, to identify the most frequent Chinese words from task titles across four performance reviews. We also removed meaningless words and incorporated common expressions such as key performance indicators (‘KPI’), objectives and key results (‘OKR’), ‘rate’ and ‘%’. This process resulted in a total of 236 unique words and expressions. We then manually categorized those most frequent keywords into nine major subcategories (see below) by meanings and relevance. Finally, on the basis of the presence of keywords in the task title, tasks were grouped into the following subcategories:

  • Communication tasks are those that involve communication, collaboration, cooperation, coordination, participation, suggestion, assistance, organization, sharing and relationships.

  • Development tasks are those that involve coding or codes, data or datasets, systems, techniques and skills.

  • Efficiency tasks are those that involve cost reduction, ratios, return on investment (ROI), rate, %, improvement, growth, lifting, adding, optimizing, profit, receiving, gross merchandise value (GMV), OKR, KPI, work and goal.

  • Execution tasks are those that involve execution, conducting, maintenance, delivery, output, quality, contribution and workload.

  • Innovation tasks are those that involve development, R&D and innovation.

  • Leadership tasks are those that involve leadership, managing or management, approval, internal, strategy, coordination and planning.

  • Learning tasks are those that involve learning, growing, maturing, talent, ability, value competitiveness and personal improvement.

  • Project tasks are those that involve project, supply, product, business line, cooperation and clients.

  • Risk tasks are those that involve risk, compliance, supervision, recording and monitoring, safety, rules and privacy.

Data sources

Data were provided by a combination of sources, including human resources records, performance reviews and two surveys. All data were anonymized and coded using a scrambled individual ID code, so no personally identifiable information was shared with the Stanford team. The data were drawn directly from the administrative data systems on a monthly basis. Gender is collected by from employees when they join the company.


The full sample has 1,612 experiment participants, but we have 1,507, 1,355, 1,301 and 1,254 employees, respectively, in the subsamples for the four performance reviews from July–December 2021, January–June 2022, July–December 2022 and January–June 2023. These smaller samples are due to attrition. In addition, for the first performance review in July–December 2021, 105 employees did not have sufficient pre-experiment tenure to support a performance review (they had joined the firm less than three months before the experimental draw). The review text data covers 1,507,1,339,1,290 and 1,246 people, as some employees do have an overall score and review text but do not have additional and task-specific scores. The reason is that these employees do not have the full range of all tasks, so their managers did not write the full review script. For the two surveys, used Starbucks vouchers to incentivize response and collected responses from 1,315 employees (314 managers, 1,001 non-managers) at the baseline on the left, and that of 1,345 employees (324 managers, 1,021 non-managers) at the end line.


All tests used two-sided Student t-tests unless otherwise stated. Analysis was run on Stata v17 and v18, R version 4.2.2. Unless stated otherwise, no additional covariates are included in the tests. The null hypothesis for all of the tests excluding null equivalence tests is a coefficient of zero (for example, zero difference between treatment and control).

Inclusion and ethics statement

The design and execution of the experiment was run by No participants were forced to WFH owing to the experiment (the entire firm was, however, forced to WFH during the pandemic lockdown). The treatment sample had the option but not the obligation to WFH on Wednesday or Friday. The experiment was designed, initiated and run by N.B. and R.H. were invited to analyse the data from the experiment, with consent for data collection coming from internally. The experiment was exempt under institutional review board (IRB) approval guidelines because it was designed and initiated by, before N.B. and R.H. were invited to analyse the data. Only anonymous data were shared with the Stanford team. based the experimental design and execution on their previous experience with WFH randomized control trials17.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.