Effects of differential measurement error in self-reported diet in longitudinal lifestyle intervention studies

Background Lifestyle intervention studies often use self-reported measures of diet as an outcome variable to measure changes in dietary intake. The presence of measurement error in self-reported diet due to participant failure to accurately report their diet is well known. Less familiar to researchers is differential measurement error, where the nature of measurement error differs by treatment group and/or time. Differential measurement error is often present in intervention studies and can result in biased estimates of the treatment effect and reduced power to detect treatment effects. Investigators need to be aware of the impact of differential measurement error when designing intervention studies that use self-reported measures. Methods We use simulation to assess the consequences of differential measurement error on the ability to estimate treatment effects in a two-arm randomized trial with two time points. We simulate data under a variety of scenarios, focusing on how different factors affect power to detect a treatment effect, bias of the treatment effect, and coverage of the 95% confidence interval of the treatment effect. Simulations use realistic scenarios based on data from the Trials of Hypertension Prevention Study. Simulated sample sizes ranged from 110-380 per group. Results Realistic differential measurement error seen in lifestyle intervention studies can require an increased sample size to achieve 80% power to detect a treatment effect and may result in a biased estimate of the treatment effect. Conclusions Investigators designing intervention studies that use self-reported measures should take differential measurement error into account by increasing their sample size, incorporating an internal validation study, and/or identifying statistical methods to correct for differential measurement error.


Introduction
Lifestyle intervention studies-which aim to change a participant's weight or eating behavior-often use selfreported measures of diet, such as interviewer-assisted 24-hour dietary recalls or food frequency questionnaires. These measures are prone to error for various reasons including poor quantification of portion sizes and social desirability [1]. More reliable and accurate measures, such Aaby and Siddique International Journal of Behavioral Nutrition and Physical Activity (2021) 18:125 Page 2 of 12 lead researchers to adopt or discard interventions that are actually (in)effective. Most measurement error research has focused primarily on problems associated with measurement error in predictor variables [3], particularly those situations where an exposure is measured with error, thus attenuating or distorting the relationship between exposure and outcome. Less work has been done investigating the implications of measurement error in outcome variables in a longitudinal intervention setting. Longitudinal dietary intervention studies involve repeated dietary assessments over time and produce unique measurement error issues that are not encountered in cross-sectional studies. Participants may modify their reporting behavior to appear compliant with dietary recommendations of the study [4], or they may attempt to reduce interview duration and reporting difficulty during follow-up assessments by omitting items or by erroneously reporting foods that are easier to measure or describe [5]. Alternatively, their accuracy may improve over time due to training in portion size assessment and a more general awareness of their dietary intake [6,7].
Differential measurement error is where the nature of measurement error (bias and precision) differs over time and/or by treatment condition [8,9]. In terms of bias, participants may (1) become more accurate in their reports of diet due to improved self-monitoring; (2) misreport their diet in order to appear compliant with the intervention; or (3) report with the same accuracy as seen at baseline. Similarly, the precision of reporting may increase, decrease, or stay the same as at baseline. These changes in bias and precision may differ by treatment condition. While a biased treatment effect is clearly undesirable, reduced power due to additional variability is not a trivial matter in lifestyle interventions, where effect sizes tend to be small and the additional variation due to measurement error can result in a failure to detect a treatment effect.
In this paper, using the setting of a longitudinal clinical trial, we use simulation to derive how various forms of differential measurement error influence sample size, bias, and coverage of the 95% confidence interval when estimating treatment effects. We provide recommendations for investigators when designing intervention trials that use dietary intake as an outcome variable.

Methods
In this section, we describe a simulation study to assess the consequences of outcome measurement error and differential measurement error on the ability to estimate treatment effects. Our simulation-based scenarios reflect the settings of intervention studies where diet-based outcomes are measured repeatedly over time in both a treatment and control group. The analysis model is a covariance pattern regression model [10] where the outcome is modeled as a function of time, a treatment by time interaction, and where an unstructured covariance matrix is used to estimate the variance-covariance parameters.
We examine 1) Differential measurement error with respect to time which is reflected in differences in measurement error variability between baseline and follow-up as well as differences in over/under reporting at baseline as compared to follow-up; 2) Differential measurement error with respect to treatment condition, with participants in the treatment group having different measurement error (variability and bias) as compared to those in the control group. To explore these settings, we simulate data under a variety of scenarios, focusing on how different factors influence: the power to detect the treatment effect, the bias of the treatment effect, and the coverage of the 95% confidence interval of the treatment effect.
Let z ij be the true value (i.e. true dietary intake) of the quantity we wish to measure on participant i, i = 1, . . . , N at time j, j = 0, 1. This quantity is assumed to be measured without error. Let y ij be the observed outcome of interest measured with error. That is, y ij is z ij measured with error, such as a self-reported dietary measure. Let d i be an indicator as to whether a participant has been randomized to the treatment group (d i = 1) or the control group (d i = 0). The variable t ij indicates the time points at which the quantities are measured, a baseline measurement (t ij = 0) and a follow-up measurement (t ij = 1). Finally, let n d denote the sample size in each treatment and control group.
The distribution of z ij has the following form: where β 0 is an intercept term and β 1 is the effect of time.
We assume no differences in intervention conditions at baseline, and thus do not include a main effect for treatment. The regression coefficient β 2 is the estimand of interest, the expected true difference in change over time between the two treatment conditions. In all of our simulations, we fix the values of the regression coefficients in Eq. (1) at the values listed in Table 1. That is, β 0 = 8.21, β 1 = −0.037, and β 2 = −0.25. These values are based on data from the Trials of Hypertension Prevention (TOHP) study (described below). In Eq. (1), ε ij is a random error term with distribution ε ij ∼ N(0, z ). The variance-covariance matrix z is where σ 2 z is the variance at baseline and follow-up and ρ is the correlation between baseline and follow-up. Again, these values are listed in Table 1 and are fixed across all simulation scenarios (σ 2 z = 0.17, ρ = 0.5).  To generate y ij we add measurement error to the z ij values as follows: where γ 0 is an intercept term reflecting the overall difference between self-report and true intake at baseline for a given value of z ij , γ 1 is the additional change in intercept between treatment conditions at follow-up, γ 2 is the slope of the regression of y ij on z ij at baseline that reflects how y ij varys as a function of true intake at baseline, γ 2 + γ 3 is the slope in the control group at time 1, and γ 4 is the difference in slopes between the treatment and control groups at time 1.
The error term δ ij in Eq. (3) is normally distributed with δ ij ∼ N(0, y ). The variance-covariance matrix y is where λ 1 is a factor for inflating variance at every time point, λ 2 is a factor for inflating variance at follow-up, and λ 3 is a factor for inflating variance in the treatment group at follow-up. Eq. (4) allows for repeated measures on participant i to be correlated and can also allow for heterogeneous variances. Equations (1) through (4) provide a very flexible framework for simulating data and incorporate a number of scenarios for simulating differential measurement error. Table 2 summarizes-in terms of our model-various types of measurement error and how the parameters are set or varied in our simulation scenarios. For example, when the parameters γ 0 , γ 1 , γ 3 , γ 4 = 0 and γ 2 = 1 in Eq. (3), y ij follows a classic measurement error model, where y ij is an unbiased measure of z ij , but measured with additional variability (Table 2, row 1). We focus on the last three rows of Table 2: differential measurement with respect to time, differential measurement error with respect to treatment, and differential measurement error with respect to time and treatment.
In these scenarios, the parameters γ 3 and γ 4 allow for changes in bias at follow-up and by treatment condition, respectively. The parameters λ 2 and λ 3 allow for additional variability at follow-up and within the treatment group, respectively.
Using this simulation framework, and varying the sample size as well as the parameters γ 3 , γ 4 , λ 2 , and λ 3 in Eqs. (3) and (4), we simulate data under a variety of scenarios. Each set of simulations is centered at non-differential measurement error ( Table 2, row 2). We then expand our simulations around this central assumption to investigate how differential measurement error impacts estimates of the treatment effect.
To ensure that we are simulating realistic scenarios, we calibrate the simulation parameter values in Eqs. (1) through (4) using data on sodium intake from the Trials of Hypertension Prevention Study (TOHP), a randomized controlled trial of 2811 participants who received lifestyle interventions and nutritional supplement interventions for hypertension prevention [11]. TOHP collected 24hour recalls as well as urinary sodium on 744 participants, both at baseline and at follow-up. This allows us to posit realistic values for the parameters involving true intake in Eqs. (1) and (2) as well for the parameters involving measurement error in Eqs. (3) and (4). Table 1 summarizes the parameters estimated from the TOHP data as well as the varying values used in the simulations. The true treatment effect, β 2 in Eq. (1), is -0.25 on the log scale so that at follow-up, participants in the treatment condition have sodium intake (1 − exp(−0.25)) × 100 = 22% less than those in the control group.
We define the naive treatment effect as the difference in change from baseline between treatment and control groups using the error prone self-reported values y. Using Eq. (3), the naive treatment effect is given by: See Bias in the appendix for details. An estimate of the treatment effect is unbiased when naive − β 2 = 0.

Variability simulation
We estimated the naive treatment effect in (5) under a variety of simulation scenarios by varying the parameters in Table 1. We examined how simultaneously increasing measurement error variability at follow-up (λ 2 ) and increasing measurement error variability in the treatment condition at follow-up (λ 3 ) increases the required sample size to achieve 80% power. We assume that the trial was Classical ME Systematic ME (non-DME) Abbreviations: ME measurement error; DME differential measurement error; w.r.t. with respect to; tx treatment; NA not applicable powered assuming non-differential measurement error based on existing self-reported data. Thus, the parameter for increased variability across all participants regardless of time point or treatment condition (λ 1 ) was fixed across all scenarios and equal to 1.86. Calculation of power was based on a two-sample z-test, as defined in Power and sample size of the Appendix.

Treatment effect simulation
Next, we examined how simultaneously varying the change in slope for the control group at follow-up (γ 3 ) and varying the change in slope for the treatment group at follow-up (γ 4 ) increases the bias of the treatment effect in terms of the percent increase in the bias of the treatment effect. We fix the parameters γ 1 and γ 2 to the TOHP values displayed in Table 1. Thus only the γ parameters that affect measurement error differentially (i.e. γ 3 and γ 4 ) influence the percent increase in bias of the treatment effect. In our setting, an increase in slope results in greater self-reported values at follow-up as compared to baseline (or the control group) for a fixed value of true intake.

Coverage simulation
Finally, we varied both the differential measurement error parameters in Table 2 affecting bias (γ 3 , γ 4 ) and the differential measurement error parameters that affect variance (λ 2 , λ 3 ) to generate different combinations of high/low bias and high/low variance. The parameters that do not affect differential measurement error (γ 1 , γ 2 , λ 1 ) were fixed at their TOHP values. We calculated the naive treatment effect and its 95% confidence interval (Appendix Coverage). To compare these scenarios to each other and that based on true intake, we display our results using a forest plot. The coverage probability of a confidence interval is the proportion of the time that the interval contains the true quantity of interest β 2 . Coverage can be affected by both bias and variability and as a result, provides a good summary of how different parameters affecting measurement error can impact estimates of the treatment effect. Let lower and upper be the lower and upper endpoints of a 95% confidence interval of an estimate of the naive treatment effect. Ideally, an estimator exhibits nominal coverage, such that the coverage of its 95% confidence interval is also 95%. We calculate the coverage of the naive treatment effect as the probability that the true treatment effect lies within the 95% confidence interval of the naive treatment effect. Details are in Coverage of the appendix. Figure 1 is a contour plot of the percent increase in sample size needed to achieve 80% power to detect a treatment effect. The x-axis displays values for the measurement error parameter for the additional variability at follow-up (λ 2 ). The y-axis displays values for the measurement error parameter for increasing variability for the treatment condition at follow-up (λ 3 ). As these parameters increase, so does the sample size needed to achieve 80% power.

Results
Using estimates from the TOHP data, under a scenario of no increased variability at follow-up in both the treatment and control conditions (λ 2 = 1, λ 3 = 1), the sample size needed to achieve 80% power is n = 117 per group (indicated by the black dot in Fig. 1). Differential measurement error with respect to time (λ 2 > 1, λ 3 = 1) has a greater impact on required sample size to achieve 80% power than does differential measurement error with respect to treatment (λ 2 = 1, λ 3 > 1). For example, under a scenario where there is additional variability at follow-up (λ 2 = 2) but no additional variability for treatment condition (λ 3 = 1), the sample size must increase by 65.0% in order to achieve 80% power, which corresponds to a sample size of n = 193 per group. For scenarios where there is  Contour plot for the percent change in sample size required to achieve 80% power to detect a fixed treatment effect across a range of parameters that change measurement error variability differentially with respect to follow-up and treatment condition at follow-up. The x-axis displays values for the parameter for changing variability at follow-up (λ 2 ), the y-axis displays values for the parameter for changing variability for the treatment condition at follow-up (λ 3 ). The point plotted at (1,1) corresponds to 0% change in sample size due to no change in variability at follow-up or treatment condition at follow-up no additional variability at follow-up in the control group (λ 2 = 1) but additional variability only in the treatment condition at follow-up, (λ 3 = 2), the sample size must increase by only 32.5%, (n = 155 per group). For situations where there is both increased variability at follow-up and treatment condition at follow-up, (λ 2 = 2, λ 3 = 2), the sample size must increase by 130.8%, which corresponds to a sample size of n = 270 per group. Under scenarios of decreased variability, the required sample size decreases. For example, when λ 2 = .5 and λ 3 = .5, the sample decreases by 40% (n = 70 per group). Figure 2 is a contour plot of the percent increase in bias of the treatment effect for varying values of γ 3 -the measurement error parameter for change in slope for the control group (x-axis), and γ 4 -the additional change in slope for the treatment group (y-axis). As measurement error increases in the intervention groups, so does the bias of the treatment effect.
Unlike the parameters governing variance, here differential measurement error with respect to treatment (γ 4 ) does have a substantial effect. For example, when there is no additional change in slope for the treatment group at follow-up (γ 4 = 0), and a small increase in slope for the control group (γ 3 increases from 1.02 to 1.10), then the bias increases by only 8%, so that the naive treatment effect reflects a 23.7% reduction in sodium intake in the treatment group versus the control group (as compared to the true treatment effect of a 22% reduction). When there is no additional change in slope for the control group at follow-up (γ 3 = 0) and a small increase in slope for the treatment group at follow-up (γ 4 increases from -0.032 to -0.05), the bias increases by 56.5% (a naive treatment effect of 32.4%). For an increase in slope for both the control group (γ 3 = 1.05) and the treatment group (γ 4 = −0.05), the bias increases by 161.5% (treatment participants have 48% less sodium at follow-up as compared to control participants). Under classical measurement error, the estimate of the treatment effect is unbiased, but has increased variability. The (+) and (−) refer to whether the γ 3 and γ 4 parameters governing measurement error in Table 2 are greater than or less than 0, respectively. Bias in the treatment effect as well as increased variability occurs in systematic measurement error and differential measurement error with respect to time, treatment, or both time and treatment.  2 Contour plot for the percent increase in bias across a range of parameters for the slope of the treatment effect. The x-axis displays values for the measurement error parameter for the change in slope for the control group (γ 3 ), the y-axis displays values for the measurement error parameter for the change in slope for the treatment group (λ 4 ). Increases in differential measurement error with respect to treatment have a stronger impact on bias than do increases in differential measurement error with respect to time Under some scenarios (systematic ME, DME w.r.t. time (-), DME w.r.t. time (-) and tx (-), DME w.r.t. time (+) and tx (+)), the bias and increased variability can be so great that the 95% confidence interval contains 0, such that the naive treatment effect is no longer significant. Under other scenarios, the bias is in the opposite direction, so that the naive treatment effect is greater than the true effect. Figure 4 displays density plots for the distribution of the treatment effect comparing the true treatment effect (in black) and the naive treatment effect under different scenarios of measurement error (in red). This provides a graphical illustration of coverage of the confidence interval of the treatment effect under the same measurement error-corrected scenarios in Fig. 3. Coverage of the true treatment effect is 95%. Under classical measurement error, the coverage is 100%. Coverage ranges from 0.6% to 89.8% depending on the differential measurement error scenario.

Discussion
We found that when using self-reported dietary measures as outcomes in a lifestyle intervention study, differential measurement error with respect to treatment condition and time can result in a biased treatment effect and can impact the sample size needed to achieve 80% power in detecting a treatment effect. Increased variability in the outcomes measured with error (y ij ), increases the sample size needed to achieve 80% power. The impact on sample size differs depending on the type of differential error: increased variability at follow-up (λ 2 ) increases the required sample size at a faster rate than increased variability for treatment condition alone at follow-up (λ 3 ). This is because increasing λ 2 affects all observations, while increasing λ 3 affects only those in the treatment group. Similarly, decreasing λ 2 and/or λ 3 , decreases the sample size required to reach 80% power, with λ 2 decreasing the required sample size at a faster rate than λ 3 . Naturally, when both factors increased/decreased variability, that had the largest percent increase/decrease on the sample size needed to achieve 80% power. By ignoring the possibility of increased variability at follow-up, trials may be under-powered. Bias of the treatment effect is also affected by differential measurement error but here, differential measurement error with respect to treatment has a greater impact than does differential measurement error with respect to time. There is little additional bias when there is additional change in slope for the control group (γ 3 ). When we set the other parameters in the measurement error model to 0 or 1 (γ 1 , γ 4 = 0 and γ 2 = 1), then bias is equal to (1 + γ 3 )β 2 . Thus small values of γ 3 have little effect on bias. However, even a small increase in slope for the treatment group (γ 4 ) can have a substantial impact on bias (see Eq. 5).
In our simulations for power and bias, we fixed γ 1 , the additive difference in measurement error between treatment conditions at follow-up. Since γ 1 does not appear in the variance calculations (see Eqs. 26 and 27) and only appears in the estimation of the naive treatment effect ( naive ), it will affect the sample size itself, but not the percent increase. For smaller values of γ 1 , the required sample size needed is smaller and for larger values of γ 1 , the required sample size is larger. When varying γ 1 and keeping the other γ parameters constant at the TOHP values, the percent increase in sample size is still approximately 32.5%, 65%, and 130% when (λ 2 = 1, λ 3 = 2), (λ 2 = 2, λ 3 = 1), and (λ 2 = 2, λ 3 = 2) respectively. In terms of bias, γ 1 is a constant term, shifting the naive treatment effect by the same amount across all values of γ 2 and γ 3 as it increases or decreases. Thus the shape of the contour plot in Fig. 2 stays the same, only its values change.
Similarly, ρ, the correlation between baseline and follow-up, was fixed throughout the simulations. The correlation ρ does not affect bias. As ρ increases, the sample size required to achieve 80% power decreases and as ρ decreases, the sample size must increase.
There is a relationship between the values of λ 2 , λ 3 and the percent increase in sample size required to achieve 80% power. Figure 1 reports the percent increase in sample size relative to a referent scenario (λ 2 = 1, λ 3 = 1). Given the calculation of the sample size in Eq. 36, the ratio of sample sizes is equal to the ratio of the variances (Eq. 37). Solving Eq. 37 for a set of λ parameters, one can directly calculate the percent increase in sample size needed under scenarios of increased variability. As Eq. (39) makes clear, values of λ 2 have both additive and multiplicative effects on the increase in sample size. This is why there is a doubling in the percent increase in sample size in Fig. 1 as the λ parameters change from (λ 2 = 1, λ 3 = 2, 32.5% increase), (λ 2 = 2, λ 3 = 1, 65% increase), (λ 2 = 2, λ 3 = 2, 130% increase).
The required sample size to achieve 80% power also depends on the γ parameters (Eq. 36) However, the ratio of the percent increase in sample size relative to the referent scenario for two different differential measurement error scenarios is invariant to the values of the γ parameters (Eq. 39).
Depending on the size and the sign of the measurement error parameters, the bias can be in the positive direction, towards zero, or in the negative direction. The combination of positive bias and increased variability can make the estimate of the treatment effect overlap with zero, resulting in a non-significant observed treatment effect. This can be seen in Fig. 3 under Systematic ME, as well as several differential measurement error scenarios where the 95% confidence interval crosses the dotted vertical line at zero. Bias in the negative direction (seen in estimates of the treatment effect to the left of the solid vertical line in Fig. 3) can make the treatment effect appear much larger than the true effect, which could lead investigators to think the intervention was much more successful than it actually was in reality. An extreme case of bias (not shown in Fig. 3) would not only bias in the positive direction, but show a significant positive effect, greater than zero. The treatment effect would be significant but in the opposite direction, and thus yielding a wrong conclusion.
Coverage of the treatment is also affected by differential measurement error. Coverage tends to be higher when there is less bias in the treatment effect measured with error, but increased variability, such as DME w.r.t. tx (+) in Fig. 3 (Panel F in Fig. 4). Although the true treatment effect is contained within the confidence interval of the naive effect when coverage is high, the naive estimate is highly variable. Coverage is low when there is a large bias, even if variability is increased. When coverage is low, the naive treatment effect differs greatly from the true treatment effect, due to bias. Coverage decreases when λ 2 and λ 3 < 1, due to decreased variability. This can be seen comparing panels G and H, and comparing panels K and L in Fig. 4.
In lifestyle intervention studies, it has been shown that measurement error in self-reported dietary measures can differ both with respect to treatment condition and over time. Natarajan et al. [7] investigated measurement error of dietary self-report in an intervention trial in which self-reported and plasma carotenoid biomarker data were available on all participants at each time point. Using a model which took into account measurement error in both self-report and biomarker-based measures, they found that self-reported accuracy improved in participants randomized to the intervention condition. They also found increases in variability among follow-up measurements in the intervention condition.
Espeland et al. [4] fit a longitudinal model to selfreported and urinary sodium to longitudinal data from a lifestyle intervention trial of 900 individuals with hypertension who were randomized to one of four conditions. They found that self-reported sodium intake was less than urinary sodium at all visits and within each study group. Interestingly, the ratio of self-reported to urinary sodium intake was smallest at follow-up compared to baseline in the most intensive intervention condition. The authors hypothesized that this was due to compliance bias and noted that, "subjective pressures to please staff and meet intervention goals led to under-reporting intakes. " Also, unlike the analyses by Natarajan et al. [7], measurement errors were less variable during follow-up than at baseline for all cohorts. This was attributed to better recall of foods containing sodium based on knowledge gained from the interventions.
Together, these results suggest that the presence of differential measurement error is likely to be intervention specific and may depend on the population being studied. For example, in a intervention study of youth with type 1 diabetes, Sanjeevi et al. [12] found no evidence of differential measurement error.
While our findings demonstrate the impact of differential measurement error, there are some limitations to this work. We only used two time points in developing our models. As the number of time points increases, so does the number of differential measurement error scenarios.
We assumed continuous, normally-distributed outcomes. We used a linear measurement model as this is a common approach for modeling measurement error and empirically has been shown to provide a good fit to the data when both true intake and its version measured with error are available [6,7,13], especially after values have been log-transformed. In practice, investigators are often interested in outcomes such as number of fruits and vegetables [14], which are not normally distributed and the impact of non-differential measurement error could be different. Future work will look at the impact of differential measurement error in non-normal outcomes. We based our simulations on self-reported sodium intake using parameters from the TOHP study. Examining measurement error for other components of diet, such as total intake, would require different parameter values, although one would expect to see results similar to those presented here. Finally, we focused on the setting of measurement error in dietary interventions. However, differential measurement error with respect to treatment and/or time can also exist in observational studies and an area of future work is to better understand the role of measurement error when estimating treatment effects using observational data.

Conclusions
When designing a longitudinal lifestyle intervention study, researchers using self-reported dietary measures need to consider the impact of measurement error and differential measurement error. Recruiting a larger sample size can help overcome the loss of power associated with the additional variability due to measurement error. However, this approach does nothing to correct for bias. A more expansive approach that would allow the researcher to diagnose and correct for both bias and variance due to measurement error is to include an internal validation study with recovery biomarkers and implement methods that allow for measurement correction using internal validation studies [9,15]. When an internal validation study is not possible, methods that use external validation studies [16,17] are possible although they require the user to make additional assumptions regarding transportability of the measurement error model [18]. Still, we feel that these additional efforts to correct for measurement error are worthwhile, as they require only marginally more effort than conducting the intervention itself, and allow researchers to make inferences with greater accuracy and precision.

Bias
Let Z 0 refer to true intake at baseline, and Z 1 refer to true intake at follow-up. Based on Eq. (1) in the main text, the mean value of true intake for each treatment condition and time point is given by Using Eqs. (6) through (8) the expected change in true intake in the control condition is (9) and the expected change in true intake in the treatment condition is so that the expected true treatment effect is Let Y t refer to self-reported intake at time t, t = 0, 1. The mean value of self-reported intake for each treatment condition and time point is given by so that, using Eq. (3) in the main text, the mean selfreported values by time and treatment group are: Using Eqs. (12) through (15), the expected change in self-reported intake in the control condition is and the expected change in self-reported intake in the treatment condition is The expected treatment effect from the self-reported intake (i.e. the naive treatment effect) is given by The mean bias of the naive treatment effect is obtained by subtracting Eq. (11) from Eq. (18) to obtain.

Variance
The variance of self-reported intake for each treatment condition and time point is given by so that, The covariance between baseline and follow-up selfreported intake is so that, for the control condition (d i = 0), the covariance is and for the intervention condition (d i = 1), the covariance is The variance in self-reported intake in the control condition is given by The variance in self-reported intake in the treatment condition given by

Coverage
The coverage probability of a confidence interval is the proportion of the time that the interval contains the true quantity of interest. From Eq. (11), the true treatment effect is equal to β 2 . For either treatment condition (d i = 0, 1), letz 1 be the sample mean of z at time 1 and letz 0 be the sample mean of z at time 0. Let ρ be the correlation between z 0 and z 1 and let n d be the sample size in either treatment condition. Regardless of treatment condition, the variance ofz 1 −z 0 is defined as Let δ 1 =z 1 −z 0 for d i = 1 and let δ 0 =z 1 −z 0 for d i = 0. The estimate of the treatment effect β 2 isβ 2 = δ 1 − δ 0 . The variance of the estimated treatment effect is So that assuming Normality as in (1) in the main text, the sampling distribution ofβ 2 iŝ Letȳ 1 be the sample mean of y at time 1 and letȳ 0 be the sample mean of y at time 0. Let δ * 1 =ȳ 1 −ȳ 0 for d i = 1 and let δ * 0 =ȳ 1 −ȳ 0 for d i = 0. An estimate of the naive treatment effect isˆ naive = δ * 1 − δ * 0 . Its variance is where the variance terms in the numerator were defined in Eqs. (26) and (27). A 95% confidence interval for the naive treatment effect is:ˆ The coverage of this confidence interval is the probability that it contains the true quantity of interest Coverage = Pr( lower <β 2 < upper ) where lower and upper are the endpoints of the confidence interval in (32).

Power and sample size
We power based on a two-sample z-test for the difference in mean change scores between treatment and control groups. The difference in means under the alternative hypothesis is the naive treatment effect naive , given by Eq. (18). The non-centrality parameter is where the denominator is the square root of the variance in Eq. (31). Under a two-sample z-test, the critical value under the two-sided null hypothesis at a Type 1 error rate of .05 is -1.96, so that power is calculated by where represents the standard normal distribution function.
Solving Eq. (35) for n d , we can obtain an equation for the sample size of each treatment group. Assuming power of 80% and a Type 1 error rate of .05, the sample size of each group is given by The total sample size required is thus 2n d . Figure 1 reports the percent increase in sample size relative to a referent scenario. Let n be the sample size under the referent scenario and let n * be the sample size under an alternative scenario with values λ * 2 , λ * 3 . The proportion increase in sample size reported in Fig. 1 is n * n − 1 or n * −n n . Using Eq. (36), the ratio of sample sizes is equal to the ratio of variances.
where c is a constant term that does not depend on any values of λ. The ratio of percent increases under two different scenarios, for example, non-differential measurement error with respect to treatment and non-differential measurement error with respect to time is: .