 Methodology
 Open Access
 Published:
Studying dietary intake in daily life through multilevel twopart modelling: a novel analytical approach and its practical application
International Journal of Behavioral Nutrition and Physical Activity volume 18, Article number: 130 (2021)
Abstract
Background
Understanding which factors influence dietary intake, particularly in daily life, is crucial given the impact diet has on physical as well as mental health. However, a factor might influence whether but not how much an individual eats and vice versa or a factor’s importance may differ across these two facets. Distinguishing between these two facets, hence, studying dietary intake as a dual process is conceptually promising and not only allows further insights, but also solves a statistical issue. When assessing the association between a predictor (e.g. momentary affect) and subsequent dietary intake in daily life through ecological momentary assessment (EMA), the outcome variable (e.g. energy intake within a predefined timeinterval) is semicontinuous. That is, one part is equal to zero (i.e. no dietary intake occurred) and the other contains rightskewed positive values (i.e. dietary intake occurred, but often only small amounts are consumed). However, linear multilevel modelling which is commonly used for EMA data to account for repeated measures within individuals cannot be applied to semicontinuous outcomes. A highly informative statistical approach for semicontinuous outcomes is multilevel twopart modelling which treats the outcome as generated by a dual process, combining a multilevel logistic/probit regression for zeros and a multilevel (generalized) linear regression for nonzero values.
Methods
A multilevel twopart model combining a multilevel logistic regression to predict whether an individual eats and a multilevel gamma regression to predict how much is eaten, if an individual eats, is proposed. Its general implementation in R, a widely used and freely available statistical software, using the Rpackage brms is described. To illustrate its practical application, the analytical approach is applied exemplary to data from the Eat2beNICEAPPetitestudy.
Results
Results highlight that the proposed multilevel twopart model reveals processspecific associations which cannot be detected through traditional multilevel modelling.
Conclusions
This paper is the first to introduce multilevel twopart modelling as a novel analytical approach to study dietary intake in daily life. Studying dietary intake through multilevel twopart modelling is conceptually as well as methodologically promising. Findings can be translated to tailored nutritional interventions targeting either the occurrence or the amount of dietary intake.
Background
Which factors influence whether an individual eats? Which factors influence how much an individual eats? These two questions might be answered differently. For instance, a study found that inhibitory control predicted how much individuals snacked, whereas it did not predict whether individuals snacked [1]. These findings emphasize the dual character of dietary intake. Understanding which factors drive an individual to eat as well as which factors influence how much an individual eats, particularly in daily life, is crucial given the impact diet has on physical as well as mental health.
Diet is a repeatedoccurrence health behaviour which is performed multiple times per day [2]. Studying eating behaviour through ecological momentary assessment (EMA) several times a day in natural environments when and “where the action takes place” [3] is a promising and increasingly popular approach [4, 5]. Dietary intake is influenced by a variety of dynamic factors and their interactions [6] which cannot be replicated reliably in a laboratory setting, highlighting the need for EMA.
Studying dietary intake as a dual process in daily life is conceptually promising and not only allows novel insights, but also solves a statistical issue.
Distributional characteristic of dietary data in EMA studies
EMA studies allow investigating whether individual and/or situational factors (e.g. momentary affect) assessed multiple times per day predict dietary intake (e.g. energy/sugar/fat intake) within a subsequent predefined timeinterval (e.g. within the next 2 h). However, dietary intake typically does not occur within each predefined timeinterval (e.g. no intake in 46% of 2htimeintervals [7]) or only a small amount is consumed (e.g. a snack). This results in an outcome that is zeroinflated (i.e. contains a large proportion of zeros) and rightskewed (i.e. contains a large proportion of small positive values concentrated on the left of the distribution) (see Fig. 1). This type of data is often referred to as semicontinuous.
Traditional statistical approach for EMA data
A common statistical approach to analyse EMA data is linear multilevel modelling (also known as linear mixed or linear hierarchical modelling). It accounts for dependency among longitudinal data due to repeated measures within the same participant and allows studying effects on the level of moments (withinperson fluctuations) and individuals (betweenperson differences). However, traditional linear multilevel modelling cannot be applied to semicontinuous outcomes, as the assumption of normally distributed residuals is likely violated.^{Footnote 1}
Using traditional linear multilevel models without accounting for the large proportion of zeros can lead to incorrect inferences and conclusions and overlooks the dual character of semicontinuous data. For instance, Baldwin et al. [8] showed that a traditional linear multilevel model falsely indicated that there was no change in the semicontinuous outcome daily physical activity (PA) over the course of the study, not detecting that with time participants became less likely to engage in PA. An alternative approach to study semicontinuous data using traditional models is to exclude timeintervals with zeros, i.e. include only timeintervals in which dietary intake occurred to study how much but not whether dietary intake occurred. However, this causes loss of important information [9] and can cause bias in the parameter estimates [10, 11] (as outlined in the discussion). Furthermore, a common approach is to study whether but not how much dietary intake occurred through multilevel logistic regressions (e.g. [7, 12,13,14]). However, if the amounts of the consumed foods/drinks are also captured, available data with important implications are disregarded.
Statistical approach for semicontinuous outcomes
A generally less known, but highly informative statistical approach for semicontinuous outcomes is twopart modelling. It treats zeros and nonzeros of the outcome separately as generated by a dual process. The zero part (occurrence indicator – e.g. has an individual eaten in a given timeinterval?) and continuous/positive part (intensity indicator – e.g. if an individual ate in a given timeinterval, how much was eaten?) of the outcome follow different distributions. Twopart models combine these two distributions: a logistic or probit regression for zeros (e.g. to predict whether an individual eats) and a linear or generalized linear regression for positive values (e.g. to predict how much is eaten, if an individual eats).
Twopart modelling assesses these two parts (e.g. the probability of eating and the amount that is eaten) while accounting for the potential dependency between them. The importance of taking this potential dependency into account was highlighted by Olsen and Schafer who were the first to extend these models to longitudinal data [15].
Hence, multilevel twopart modelling not only allows studying dietary intake as a dual process, but also overcomes the challenges of semicontinuous data. It does not overlook relevant information and provides additional and novel insights. It differentiates between factors either influencing the occurrence or the amount of dietary intake or both. If both, it can be assessed whether a factor’s importance differs across the two parts.
Even though the use of twopart models is less common in most research fields, it has become popular for example in the following fields: Medical costs [16, 17], substance use disorder [18,19,20,21,22] and PA [8, 23,24,25]. Twopart models have also been applied to nutritional data in order to estimate usual intake of episodically consumed foods [26]. However, to the best of our knowledge, multilevel twopart modelling has not yet been applied to studying dietary intake in daily life. Furthermore, most publications on multilevel twopart modelling used statistical software which is less common (e.g. WinBUGS [22]) or not free to use (e.g. SAS Proc NLMIXED [16], “gsem” command in Stata [8, 25]).
Objective
This paper is the first to introduce multilevel twopart modelling as a novel analytical approach to study dietary intake in daily life. We believe that the importance of multilevel twopart models in behavioural nutrition as well as other behavioural research fields (e.g. PA) is growing. Practical guidance is needed to facilitate the implementation of these rather complex models, particularly in commonly used and freely available software. For this reason, this paper proposes a multilevel twopart model combining a multilevel logistic and a multilevel gamma regression to study dietary intake in daily life using R [27], one of the most commonly used data software programs which is freely available and therefore accessible to everyone. In the present work, we use the Rpackage brms [28, 29] which is based on Bayesian inference. We chose this package because it allows great flexibility in this specific model. Furthermore, its syntax is very similar to the syntax of other and likely more widely used multilevel packages in R (nlme [30]; lme4 [31]). This has the benefit that readers familiar with multilevel modelling in R can more easily build upon prior experience. We assume that readers have basic knowledge of multilevel modelling (e.g. multilevel structure of the data, random effects). Readers not familiar with these basic concepts are referred to introductory literature on multilevel modelling (e.g. [32, 33]). To ensure readers who are new to Bayesian statistics are able to follow, the basic concept of Bayesian inference is briefly introduced in Additional file 1a.
The aim of this paper is to introduce multilevel twopart modelling as a novel analytical approach to study dietary intake in daily life and provide easytofollow guidance on its practical application. To do so, the methods section covers (1) general model specifications of the proposed model, (2) a brief overview of brms and the general implementation of the proposed model in brms and (3) the description of the data used in this paper. The results section outlines the results of the exemplary analyses in detail, in order to provide practical guidance on the model specification and interpretation. Data and R code are provided in Additional files 2 and 3.
Methods
Multilevel twopart model for semicontinuous dietary data
In order to study dietary intake in daily life, we propose a multilevel twopart model which combines a multilevel logistic regression for zeros to predict whether an individual eats and a multilevel gamma regression for rightskewed positive values to predict how much is eaten, if an individual eats. Here, repeated assessments (Level 1) of the semicontinuous variable dietary intake are nested within individuals (Level 2). We chose the multilevel gamma regression for positive values as it does not require data transformation (e.g. logarithmizing) and beyond that performed well for rightskewed continuous PA data in Baldwin et al. [8]. A gamma distribution is a continuous probability distribution which is commonly used to model continuous variables which can only be positive and follow a skewed distribution.
In the following we briefly introduce the model specifications. A more comprehensive introduction to the model specifications can be found in Additional file 4.
The variable y_{ij} represents the semicontinuous dietary intake response from subject j (j = 1, …, m) at time point i (i = 1, …, n_{i}). We are interested in two parts of this variable: (1) Did the participant eat? In other words, is y_{ij} = 0 or y_{ij} > 0 (illustrated in Fig. 1b)? (2) If the participant ate, how much was eaten? In other words, what is the expected value of y_{ij}, if y_{ij} > 0 (illustrated in Fig. 1c)?
A multilevel logistic regression is used for part (1) of the semicontinuous variable. It predicts the logodds of no eating for person j at time point i (\(\mathrm{log}\left(\frac{\mathrm{\pi_{ij}}}{1\mathrm{\pi_{ij}}}\right)\)).^{Footnote 2} Figure 2 shows that the logodds of no eating can be predicted as a function of Level1 and Level2 covariates.
A multilevel gamma regression is used for part (2) of the semicontinuous variable. It predicts the expected log amount of dietary intake of person j at time point i (log(μ_{ij})) when eating occurred. μ_{ij} is modelled on the log scale due to the fact that the gamma distribution only supports positive values. However, it is important to highlight that this does not change the data as would logtransforming the data before running the model. The metric of the variable remains unchanged and the slope coefficients can be interpreted through exponentiation (demonstrated in the results). Figure 2 shows that a function of Level1 and Level2 covariates can be used to predict the (log) amount of dietary intake.
L1predictor_{ij} in Fig. 2 represents a Level1 covariate assessed at time point i in person j, e.g. participant j’s momentary affect at measurement occasion i. L2predictor_{j} is a Level2 covariate of person j, e.g. participant j’s BMI. β_{00} and β_{10} are the overall intercepts. The coefficients β_{01} and β_{11} represents the expected change for a oneunit increase in L1predictor. The expected change for a oneunit increase in L2predictor is expressed by β_{02} and β_{12}. u_{0j} and u_{1j} represent the random intercepts of person j, i.e. personspecific deviation from the overall intercept. u_{01j} and u_{11j} denote the random effects of L1predictor in person j, i.e. personspecific differences in the effect of L1predictor. The error term \({\varepsilon }_{ij}\) in the continuous part of the model denotes the Level1 residual, i.e. difference between the predicted value and the observed value of person j at time point i. The first subscript 0 or 1 of the parameters indicates that the equation refers to the zero or the continuous part of the model, respectively. Part specific interpretations of the parameters can be found in Table 1.
The two processes modelled through the multilevel logistic and gamma regression are likely not independent. Therefore, an important consideration in twopart modelling, as highlighted by Olsen and Schafer [15] for longitudinal data, is whether an individual’s average probability of eating is related to the individual’s average amount consumed when the individual eats. In other words, the average proportion of occasions on which the participant does not eat may be related to the average (log) amount of dietary intake during eating occasions. To account for this potential relation, the correlation between the random effects across the two parts (e.g. \({\rho }_{{u}_{0}{u}_{1}}\)), often called crosspart correlation, is modelled (illustrated in Fig. 2). The number of modelled correlations is determined by the number of random effects included in the model (see Additional file 4 for details).
An overview of the most relevant parameters in the proposed multilevel twopart model is provided in Table 1.
More general overviews of (multilevel) twopart models can be found in the following literature: Neelon et al. [34, 35] provide an overview as well as case studies on zeromodified count and semicontinuous data, marginally also covering longitudinal data. Liu et al. [36] discuss statistical analyses of semicontinuous data in the crosssectional as well as longitudinal setting. Farewell et al. [37] provide a review on twopart and related regression models for longitudinal semicontinuous as well as longitudinal count data.
Multilevel twopart modelling in brms
brms
The Rpackage brms [28, 29] supports Bayesian multilevel modelling and is implemented via the probabilistic programming language Stan [38]. For readers who are new to Bayesian statistics, a brief introduction is provided in Additional file 1a. We recommend Depaoli et al. [39] as well as van de Schoot and Depaoli [40] to readers who are interested in a broader introduction to Bayesian statistics in the context of health psychology.
We chose brms for this paper for a number of reasons: Firstly and most importantly, the major advantage of brms is that it uses a lme4like formula syntax. lme4 is one of the most commonly used Rpackages for multilevel modelling which will facilitate the initial familiarization with brms for those readers who are familiar with multilevel modelling in R. Secondly, it does not require any data preprocessing (e.g. dividing the semicontinuous outcome into two variables, a dichotomous and a continuous variable) as other software programs do (e.g. gsem in Stata). Thirdly, it offers great flexibility in the model specification (see [28, 29] for details).
Multilevel twopart model in brms
The proposed multilevel twopart model combining a multilevel logistic and a multilevel gamma regression can be run in brms through the family hurdle_gamma.
The general syntax of the model looks as follows:
First of all, the name with which the fitted model will be saved in the REnvironment is specified. The brmfunction indicates that a Bayesian generalized (non)linear multilevel model is fitted. bf (short for brmsformula) is used for setting up the model formula. The upper part within bf represents the formula for the multilevel gamma regression for positive values.^{Footnote 3} The bottom part indicated by hu shows the formula for the multilevel logistic regression for zero values. The two parts of the model are specified after ~ through a formula almost identical to lme4syntax. The initial 1 represents the overall intercepts which are followed by Level1 and/or Level2 predictors with fixed effects. Within parentheses, random effects of Level1 predictors can be specified after the random intercept 1. x specifies random effects of the same participant to be correlated across the two parts of the model, i.e. crosspart correlations are modelled (denoted as the crosspart covariance matrix Σ_{01} in Additional file 4). x within   was chosen arbitrarily and can be exchanged for any letter or digit. After x the grouping variable is specified, in longitudinal data the variable indicating the participant ID. data indicates which data frame is used for the analysis. Bold parts of the syntax have to be customized.
Additional parameters can—and in some cases must—be specified within the brmfunction to adapt the sampling algorithm (see Additional file 1a for a brief introduction to Bayesian sampling). brms runs 4 Markov chains with 2000 iterations each by default. The number of chains and iterations per chain can be customized through the arguments chains and iter. Unless otherwise specified through the argument warmup, half of the iterations are warmup iterations (in the default setting: 2000/2 = 1000). If a model does not converge, brms provides a link to a website [41] with detailed information on recommended modifications (e.g. increase the number of iterations) to make the model converge. The argument set_pior can be used to incorporate prior information. However, due to a lack of prior information we exclusively use the default priors of brms in this paper which are very weakly informative and therefore influence the results as little as possible.
Data and material
The following research question is assessed exemplary within this paper: “Do momentary energetic arousal and gender predict the occurrence of energy intake and/or the amount of energy consumed within timeintervals in which energy intake occurred in daily life?” This question was chosen purely for illustrative purposes. We do not test specific a prior hypotheses with these analyses.
Data were collected within the Eat2beNICEAPPetitestudy. This study captures dietary intake and related factors through EMA using the APPetitemobileapp (details on the APPetitemobileapp can be found in Ruf et al. [45]). Dietary intake was captured in an eventcontingent fashion and used to quantify energy and nutrient intake. Momentary energetic arousal was assessed signalcontingent through 8 semirandom prompts per day. Participants used the app for three consecutive days. Hence, energetic arousal was assessed at up to 24 time points.
Each assessment of energetic arousal was matched to subsequent energy intake (in kcal). Subsequent energy intake was defined as the sum of any intake of energy within the time until the next assessment of energetic arousal or within the next 2 h if the time between two assessments was more than 2 h (e.g. because a prompt was missed) (see Fig. 3 for an illustration).
The dataset and the R code used in this paper can be found in Additional files 2 and 3. The dataset contains 2044 time points from 99 participants. 48.4% (989/2044) of timeintervals show no energy intake and are therefore equal to 0. The mean of nonzero values is 444.5 kcal. The dataset is in longformat (that is, repeated measurements for each participant are reported in separate lines of the dataset) and contains the variables shown in Table 2.
Analyses were run using version 4.0.5 of R, version 1.4.1106 of RStudio (RStudio Inc., Boston, MA, USA [42]), version 2.15 of brms and version 2.21.2 of rstan [43].
Results
Intercept only model
First of all, we specify and run an intercept only model (also called empty model or null model). As the name implies, it does not contain any predictors, only intercepts. The model syntax looks as follows:
When running the model, the following code appears progressively in the console:
First, it shows that Stan is being compiled. A few moments later, sampling is started and the viewer opens. By refreshing the viewer, the progress of the sampling can be monitored. When the model is fitted, a warning is printed. However, this warning can be ignored as it does not affect the model estimation and will be removed in the next release of rstan [44]. As we do not get any other warnings, the model seems to have converged. However, to reassure the quality of the parameter estimates, additional information regarding the construction of the posterior distribution should be obtained. To check convergence, we have a look at density and trace plots of the parameter estimates. These plots can be produced by running the command plot(m.null) and are shown in Fig. 4.
Density plots of model parameters should be clearly unimodal which seems to be the case in this model. Trace plots show each sampled parameter estimate from the first to the 1000^{th} iteration of each of the four chains after warmup. The estimates should circle around a single value to indicate convergence. The trace plots in Fig. 4 indicate convergence as the estimates hover around a single value. If the density and trace plots suggest that the model has not converged, the model should be run with more iterations. The potential scale reduction factor evaluates convergence through assessing differences between the chains (betweenchain variance/withinchain variance) and should be close to 1. It is given for each parameter in the brms output in the column Rhat and is close to 1 if no warning is displayed. As the plots do not show any signs of nonconvergence and no relevant warnings are displayed, we can have a look at the model estimation. To do so, we run the command summary(m.null) and get the following results^{Footnote 4}:
First of all, we doublecheck that the values in the column Rhat are close to 1. All Rhats are equal to 1.00 in this model, so the parameter estimates can be deemed trustworthy.
In the section PopulationLevel Effects which comprises fixed effects, we get two intercepts, one for the gamma part of the model β_{10} (= Intercept) and one for the logistic part β_{00} (= hu_Intercept). In brms, point estimates of parameters represent the mean of the respective posterior distribution. Estimates of the gamma part are modelled on the log scale as the outcome can only be positive. Hence, to obtain the estimate of the intercept in the original metric (kcal), we calculate the exponential of β_{10} (exp(6.1) = 445.9). This indicates that in timeintervals in which energy intake occurred we expect an individual to consume on average 445.9 kcal. This value should be close to the mean of nonzero values in the original data as the group mean is the best estimate in models without predictors. In our data the mean of positive values is 444.5 which is very close to the model estimate.
Estimates of the logistic part are modelled on the logit scale which accommodates the restricted range of probabilities (between 0 and 1). The intercept β_{00} represents the average logodds of no energy intake across all participants. To transform the logodds to the probability of no energy intake, we can use the inverse logit function in Eq. (1) or alternatively the plogisfunction in R.
We get a mean probability of no energy intake of 0.485 (= 48.5%). We can check whether this estimate is reasonable through looking at the percentage of timeintervals without energy intake within the original data. In 48.4% (989/2044) of timeintervals energy intake is equal to zero which is close to the estimate of the intercept. We recommend always checking the model implied estimates from the intercept only model against the descriptive sample estimates to ascertain that the model was specified correctly and that the sample estimates could be reproduced with reasonable accuracy.
Now we have a look at the random effects (GroupLevel Effects in brms). Sd(Intercept) (\(\sqrt{{{\sigma }^{2}}_{{u}_{1}}}\)) represents the variation of the random intercept u_{1} of the gamma part, i.e. personspecific variation in the mean of (log) energy intake. Mean (log) energy intake varies between participants with a SD of 0.14 (95%credible interval [CI] 0.04–0.24). As effects are assumed to be multivariate normally distributed, we can calculate the range in which the mean energy intake of 95% of participants is located (Intercept ± 1.96 SDs). The mean energy intake of 95% of participants is between 338.86 (exp(6.1–0.14*1.96)) and 586.63 kcal (exp(6.1 + 0.14*1.96)) in timeintervals in which energy intake occurred. Furthermore, participants differ in the mean logodds of no energy intake with a SD of 0.23 (95%CI 0.06–0.37) shown by the variation of the random intercept u_{0} of the logistic part sd(hu_Intercept) (\(\sqrt{{{\sigma }^{2}}_{{u}_{0}}}\)). For 95% of participants the probability of no energy intake is between 0.38 (plogis(0.06–0.23*1.96)) and 0.6 (plogis(0.06 + 0.23*1.96)).
The fairly strong positive crosspart correlation between the random intercepts (\({\rho }_{{u}_{0}{u}_{1}}\)) of 0.77 indicates that participants who consume on average more energy within timeintervals in which energy intake occurs have on average a higher probability of no energy intake.
Random intercept model with Level2 predictor
Now we want to include a fixed effect of the Level2 predictor gender in both parts of the model by running the following code:
We do not get any warnings regarding nonconvergence and the density and trace plots do not indicate convergence problems, therefore we can interpret the model estimates^{Footnote 5}:
In this model the intercept β_{10} of the gamma part of the model represents the mean log energy intake for men (gender = 0). Male participants consume on average 528.48 kcal (exp(β_{10}) = exp(6.27)) in timeintervals in which energy intake occurred. Results show that gender has a fixed effect on the mean log energy intake in timeintervals in which energy intake occurs as the 95%CI of β_{11} does not include 0. To interpret the regression coefficient of the fixed effect of gender, we can get the rate decrease in energy intake associated with a oneunit increase in gender through exponentiation of β_{11}. Hence, women (gender = 1) consume on average around 22% less energy (exp(β_{11}) = exp(0.25) = 0.78) in timeintervals in which energy intake occurred compared to men. However, women and men do not differ in the probability of no energy intake as the fixed effect of gender is not relevant for the prediction in the logistic part of the model (95%CI of β_{01} includes 0).
We get three estimates within the random effects. There is betweenperson variation in the log energy intake (\(\sqrt{{{\sigma }^{2}}_{{u}_{1}}}\)=0.12, 95%CI 0.01–0.22) in timeintervals in which energy intake occurred as well as in the logodds of no energy intake (\(\sqrt{{{\sigma }^{2}}_{{u}_{0}}}\)=0.23, 95%CI 0.04–0.38). The crosspart correlation \({\rho }_{{u}_{0}{u}_{1}}\) is still fairly strong (0.71), suggesting that participants who consume on average more energy in timeintervals in which energy intake occurs have on average a higher probability of no energy intake. However, what we see here is that the 95%CI of \({\rho }_{{u}_{0}{u}_{1}}\) includes 0 and is wider compared to the intercept only model indicating that the estimation is rather inaccurate (95%CI 0.34–0.99).
Random slope model with Level1 predictor
Now we want to include the Level1 predictor momentary energetic arousal in both parts of the model as fixed and random effects. As there was no fixed effect of gender in the logistic part of the model, we only include gender in the gamma part. To do so, we fit the following model:
We do not get any warnings regarding nonconvergence and the density and trace plots do not indicate serious convergence problems, therefore we can interpret the model estimates:
Again we see the meaningful fixed effect of gender in the gamma part (β_{11}). However, there is no fixed effect of energetic arousal in either of the two parts (95%CI include 0). That is, there is no evidence that participants were more likely not to eat when their energetic arousal was higher than usual, β_{01} = 0.04 (95%CI 0.09–0.01). There was also no evidence that participants consumed more energy when their energetic arousal was higher than usual, β_{12} = 0.02 (95%CI 0.01–0.05). Notice, however, that the random effect for energetic arousal in the logistic part suggests that the effect of energetic arousal on the logodds of no energy intake varies across participants with a SD of 0.11 (95%CI 0.02–0.18). Hence, for 95% of participants the effect of energetic arousal on the logodds of no energy intake is between 0.26 (0.04–0.11*1.96) and 0.18 (0.04 + 0.11*1.96). This suggests that on average there is no association of energetic arousal with the probability not to eat. However, for some participants, higher arousal may be associated with a higher probability not to eat. Whereas for others, higher arousal may be associated with a lower probability not to eat. The random effect for energetic arousal in the gamma part was smaller and the lower bound of the 95%CI was 0.00. Note that nonpositive estimates for SD are not permitted, and the lower bound of the CI for this parameter will therefore always be positive. This suggests that interindividual differences in the effect of energetic arousal on the amount of energy intake are small and possibly not statistically meaningful.
We get ten estimates within the random effects: 4 SDs and 6 correlations (as shown in expression (6) in Additional file 4). We see that the crosspart correlation \({\rho }_{{u}_{0}{u}_{1}}\) between the random intercepts is weaker than in the previous models (0.56) and that the 95%CI of all correlations is very wide indicating that it is not possible to get accurate estimates (see also the platykurtic posterior distributions in Additional file 5).
Discussion
Studying dietary intake through multilevel twopart modelling is a methodologically as well as conceptually promising approach. It accounts for the semicontinuous data structure and offers novel and distinct insights in terms of the occurrence as well as the amount of dietary intake. Results of this paper highlight that the differentiation between the two processes reveals processspecific associations which cannot be detected through traditional multilevel modelling. For instance, we found that gender is associated with the amount consumed during eating occasions, but not with the probability of eating. The model we propose overcomes a number of limitations of traditional modelling when analysing semicontinuous data: (1) accounts for the zeroinflation by introducing two model parts, a zero and a continuous part, which avoids incorrect inferences (as shown by Baldwin et al. [8]), (2) accommodates the skewness of the continuous part of the outcome by applying a gamma regression which does not rely on controversial transformation of the outcome and does not change the metric of the data, and (3) considers the dependency between the two model parts by modelling the crosspart correlation which prevents bias in parameter estimation as would running separate models (as outlined below). Despite its potential, multilevel twopart modelling is still missing in the statistical repertoire of most researchers. This may be due to the fact that these models are rather complex and therefore require initial training. However, we believe that multilevel twopart models are the most appropriate and valid method to study semicontinuous outcomes and therefore are worth the training. To facilitate the initial training and encourage other researchers to use these models, this paper offers an applicationoriented introduction to multilevel twopart modelling.
The Rpackage brms used in this paper offers a userfriendly and freely available option for fitting multilevel twopart models. It is particularly intuitive for users familiar with lme4 and Bayesian statistics (see Additional file 1b for a brief overview of similarities and differences between Bayesian and frequentistbased twopart models). We believe that multilevel twopart models are of particular interest to those researchers who are familiar with traditional multilevel modelling.
For demonstration and simplicity purposes, we have focused on multilevel twopart models with fixed and random effects. However, extensions to the model (e.g. crosslevelinteractions) are straightforward.
We found fairly strong to moderate positive crosspart correlations (0.77, 0.71, 0.56) indicating that participants who consume on average more energy during eating occasions have on average a higher probability not to eat. However, we have faced some estimation inaccuracies of the crosspart correlations: the more predictors we included in the model, the wider the 95%CIs got. Nonetheless, we do not recommend fitting separate models as ignoring the crosspart correlation can induce bias in regression coefficients as well as variance components [10, 11]. Not accounting for the crosspart correlation can cause bias particularly in the continuous part of the model. This can be explained by the fact that the zero part determines the cluster size of the continuous part of the model (e.g. the number of observations with dietary intake within an individual). For instance, we found moderate to strong crosspart correlations. Hence, an individual less likely to eat will have fewer observations in the continuous part of the model but the few observations will contain larger amounts. An individual eating more frequently will have more observations in the continuous part which will contain smaller amounts. As a result, higher values of dietary intake will be underrepresented and smaller values will be overrepresented. Su et al. [10] outline that even when researchers are only interested in the continuous part of the semicontinuous outcome and therefore chose to fit a single model, the described bias will still be present.
To run the proposed multilevel twopart model, data on dietary intake as well as individual and/or situational factors have to be collected. Dietary intake can either be captured event (i.e. when food is consumed [45]), signal (e.g. time since the last prompt) or timecontingently (e.g. within the last hour [1]), while individual and/or situational factors have to be assessed either signal or timecontingently. The proposed model cannot be applied to simple eventcontingent sampling protocols (e.g. dietary intake and factors of interest assessed only when food is consumed).
While first empirical evidence [1] as well as results of this paper support the importance of distinguishing between the occurrence of eating and the amount that is eaten, future research is needed to verify the conceptual relevance of studying dietary intake as a dual process. We believe that multilevel twopart models will contribute to a better understanding of which situational and individual factors are associated with an increased probability of eating and/or with an increased amount of dietary intake. Findings in this area offer new perspectives and enable the development of tailored interventional strategies. For instance, in the context of preventing and treating overweight and obesity two types of interventions are needed: (1) interventions customized to reduce the probability of dietary intake and therefore reduce the number of eating occasions within a day, (2) interventions tailored to reduce the amount eaten within eating occasions to prevent overeating.
In this paper, we applied multilevel twopart modelling to study factors influencing energy intake. However, multilevel twopart models can also be employed to study macronutrient intakes which are also semicontinuous in the Eat2beNICEAPPetite data. Furthermore, multilevel twopart modelling can also be applied to studies which capture food categories (e.g. vegetable intake), provided that the consumed amounts are also assessed. Findings in the context of macronutrient intake and food categories can be translated to the promotion of healthy eating, e.g. reducing the occurrence of sugar intake or boosting vegetable consumption within eating occasions. Hence, there are numerous ways multilevel twopart modelling can be applied in the context of studying dietary intake in daily life.
Beyond that, the model proposed in this paper can also be applied to other research contexts in which a semicontinuous outcome is present, including PA data in which zeros are a common problem [46] (e.g. daily PA data [8] or PA data in EMA studies [25]). In fact, almost all behavioural outcomes are likely to show semicontinuous characteristics which can be traced back to dual processes: one process determining whether the behaviour is shown and the other determining how long/intensive/often the behaviour is shown, e.g. smoking behaviour (Has an individual smoked? If so, how many cigarettes have been smoked?), social interaction (Has an individual engaged in social interaction? If so, how many minutes did the individual interact socially?) and purchase behaviour (Has an individual purchased anything? If so, how much money was spent?)—to name only a few. The shorter timeintervals are in which a specific behaviour is studied (e.g. daily diary and EMA studies), the more likely it is that the outcome is zeroinflated, i.e. the behaviour of interest is not shown. Therefore, as the number of these studies is continuously growing, so will the need for multilevel twopart modelling to study predictors of specific behaviours. This paper addresses this need by providing guidance on the implementation and interpretation of these rather complex models.
Conclusions
To the best of our knowledge, this paper is the first to introduce multilevel twopart modelling as a novel analytical approach to study dietary intake in daily life. Distinguishing between factors influencing whether and how much is eaten is conceptually promising and offers new opportunities, particularly for customized nutritional interventions either targeting the occurrence of intake or the amount consumed during eating occasions. As we believe that the importance of EMA studies assessing factors influencing dietary intake in daily life is growing within the next years, this paper will help to establish an appropriate data analysis procedure that accounts for the dual character of dietary intake and the semicontinuous data structure.
Availability of data and materials
All data generated or analysed during this study are included in this published article and its supplementary information files.
Notes
It is important to note that a normal distribution of the dependent variable itself is not an assumption of linear multilevel models. Instead, the residuals which reflect the unexplained part of the dependent variable have to follow a normal distribution. However, in practice the distribution of the residuals typically looks similar to the distribution of the dependent variable.
Note that the multilevel logistic regression predicts NO dietary intake (i.e. y_{ij} = 0). Typically logistic regressions predict y = 1. π_{ij} represents the probability not to eat for person j at time point i.
Note that the upper formula predicts only positive values even though the variable semicontinuous_outcome contains all values of the semicontinuous outcome, including zeros.
Note that the notations marked in blue are inserted by us for illustrative reasons.
Abbreviations
 EMA:

Ecological momentary assessment
 kcal:

Kilocalories
 PA:

Physical activity
 SD :

Standard deviation
 CI:

Credible interval
References
Powell DJH, McMinn D, Allan JL. Does real time variability in inhibitory control drive snacking behavior? An intensive longitudinal study. Health Psychol. 2017;36:356–64.
Dunton GF. Sustaining healthprotective behaviors such as physical activity and healthy eating. JAMA. 2018;320:639–40.
Hofmann W, Dohle S. Capturing eating behavior where the action takes place: a comment on McKee et al. Ann Behav Med. 2014;48:289–90.
Schembre SM, Liao Y, O’Connor SG, Hingle MD, Shen SE, Hamoy KG, et al. Mobile ecological momentary diet assessment methods for behavioral research: systematic review. JMIR mHealth uHealth. 2018;6:e11170.
Maugeri A, Barchitta M. A systematic review of ecological momentary assessment of diet: implications and perspectives for nutritional epidemiology. Nutrients. 2019;11:2696.
Hummel E, Hoffmann I. Complexity of nutritional behavior: capturing and depicting its interrelated factors in a causeeffect model. Ecol Food Nutr. 2016;55:241–57.
Maher JP, Harduk M, Hevel DJ, Adams WM, McGuirt JT. Momentary physical activity cooccurs with healthy and unhealthy dietary intake in African American College freshmen. Nutrients. 2020;12:1360.
Baldwin SA, Fellingham GW, Baldwin AS. Statistical models for multilevel skewed physical activity data in health research and behavioral medicine. Health Psychol. 2016;35:552–62.
Tooze JA, Grunwald GK, Jones RH. Analysis of repeated measures data with clumping at zero. Stat Methods Med Res. 2002;11:341–55.
Su L, Tom BDM, Farewell VT. Bias in 2part mixed models for longitudinal semicontinuous data. Biostatistics. 2009;10:374–89.
Liu L, Ma JZ, Johnson BA. A multilevel twopart random effects model, with application to an alcoholdependence study. Stat Med. 2008;27:3528–39.
Elliston KG, Ferguson SG, Schüz B. Personal and situational predictors of everyday snacking: an application of temporal selfregulation theory. Br J Health Psychol. 2017;22:854–71.
McNaughton SA, Pendergast FJ, Worsley A, Leech RM. Eating occasion situational factors and sugarsweetened beverage consumption in young adults. Int J Behav Nutr Phys Act. 2020;17:71.
Ghosh Roy P, Jones KK, MartynNemeth P, Zenk SN. Contextual correlates of energydense snack food and sweetened beverage intake across the day in African American women: an application of ecological momentary assessment. Appetite. 2019;132:73–81.
Olsen MK, Schafer JL. A twopart randomeffects model for semicontinuous longitudinal data. J Am Stat Assoc. 2001;96:730–45.
Liu L, Strawderman RL, Cowen ME, Shih YCT. A flexible twopart random effects model for correlated medical costs. J Health Econ. 2010;29:110–23.
Liu L. Joint modeling longitudinal semicontinuous data and survival, with application to longitudinal medical cost data. Stat Med. 2009;28:972–86.
Atkins DC, Baldwin SA, Zheng C, Gallop RJ, Neighbors C. A tutorial on count regression and zeroaltered count models for longitudinal substance use data. Psychol Addict Behav. 2013;27:166–77.
Bandyopadhyay D, DeSantis SM, Korte JE, Brady KT. Some considerations for excess zeroes in substance abuse research. Am J Drug Alcohol Abuse. 2011;37:376–82.
DeSantis SM, Bandyopadhyay D, Baker NL, Randall PK, Anton RF, Prisciandaro JJ. Modeling longitudinal drinking data in clinical trials: an application to the COMBINE study. Drug Alcohol Depend. 2013;132:244–50.
Liu L, Strawderman RL, Johnson BA, O’Quigley JM. Analyzing repeated measures semicontinuous data, with application to an alcohol dependence study. Stat Methods Med Res. 2012;25:133–52.
Xing D, Huang Y, Chen H, Zhu Y, Dagne GA, Baldwin J. Bayesian inference for twopart mixedeffects model using skew distributions, with application to longitudinal semicontinuous alcohol data. Stat Methods Med Res. 2015;26:1838–53.
Lee AH, Zhao Y, Yau KKW, Xiang L. How to analyze longitudinal multilevel physical activity data with many zeros? Prev Med. 2010;51:476–81.
Lee AH, Xiang L. Mixture analysis of heterogeneous physical activity outcomes. Ann Epidemiol. 2011;21:780–6.
Zink J, Belcher BR, Dzubur E, Ke W, O’Connor S, Huh J, et al. Association between selfreported and objective activity levels by demographic factors: ecological momentary assessment study in children. JMIR mHealth uHealth. 2018;6:e150.
Tooze JA, Midthune D, Dodd KW, Freedman LS, KrebsSmith SM, Subar AF, et al. A new statistical method for estimating the usual intake of episodically consumed foods with application to their distribution. J Am Diet Assoc. 2006;106:1575–87.
R Core Team. R: a language and environment for statistical computing. Vienna; 2020. Available from: https://www.rproject.org/.
Bürkner PC. brms: an R package for Bayesian multilevel models using Stan. J Stat Softw. 2017;80:1–28.
Bürkner PC. Advanced Bayesian multilevel modeling with the R package brms. R J. 2018;10:395–411.
Pinheiro J, Bates D, DebRoy S, Sarkar D, R Core Team. nlme: linear and nonlinear mixed effects models. 2020. Available from: https://cran.rproject.org/package=nlme.
Bates D, Mächler M, Bolker B, Walker S. Fitting linear mixedeffects models using lme4. J Stat Softw. 2015;67:1–46.
Hox JJ, Moerbeek M, van de Schoot R. Multilevel analysis: techniques and applications. 3rd ed. New York: Routledge; 2017.
Snijders TAB, Bosker RJ. Multilevel analysis: an introduction to basic and advanced multilevel modeling. 2nd ed. London: SAGE Publications Ltd; 2011.
Neelon B, O’Malley AJ, Smith VA. Modeling zeromodified count and semicontinuous data in health services research part 1: background and overview. Stat Med. 2016;35:5070–93.
Neelon B, O’Malley AJ, Smith VA. Modeling zeromodified count and semicontinuous data in health services research part 2: case studies. Stat Med. 2016;35:5094–112.
Liu L, Shih YCT, Strawderman RL, Zhang D, Johnson BA, Chai H. Statistical analysis of zeroinflated nonnegative continuous data: a review. Stat Sci. 2019;34:253–79.
Farewell VT, Long DL, Tom BDM, Yiu S, Su L. Twopart and related regression models for longitudinal data. Annu Rev Stat Appl. 2017;4:283–315.
Carpenter B, Gelman A, Hoffman MD, Lee D, Goodrich B, Betancourt M, et al. Stan: a probabilistic programming language. J Stat Softw. 2017;76:1–32.
Depaoli S, Rus HM, Clifton JP, van de Schoot R, Tiemensma J. An introduction to Bayesian statistics in health psychology. Health Psychol Rev. 2017;11:248–64.
van de Schoot R, Depaoli S. Bayesian analyses: where to start and what to report. Eur Health Psychol. 2014;16:75–84.
Brief guide to Stan’s warnings. http://mcstan.org/misc/warnings.html. Accessed 17 Apr 2021.
RStudio Team. RStudio: integrated development environment for R. Boston; 2020. Available from: http://www.rstudio.com/.
Stan Development Team. RStan: the R interface to Stan. 2020. Available from: http://mcstan.org/.
RStan getting started. https://github.com/standev/rstan/wiki/RStanGettingStarted. Accessed 17 Apr 2021.
Ruf A, Koch ED, EbnerPriemer U, Knopf M, Reif A, Matura S. Studying microtemporal, withinperson processes of diet, physical activity, and related factors using the APPetiteMobileApp: feasibility, usability, and validation study. J Med Internet Res. 2021;23:e25850.
Rasmussen CL, PalareaAlbaladejo J, Johansson MS, Crowley P, Stevens ML, Gupta N, et al. Zero problems with compositional data of physical behaviors: a comparison of three zero replacement methods. Int J Behav Nutr Phys Act. 2020;17:126.
Acknowledgements
Not applicable.
Funding
This work was supported by the European Union’s Horizon 2020 Research and Innovation Program under grant agreement No 728018. The funding source has had no involvement in the study design, data collection, interpretation of the findings, or writing of this manuscript. Open Access funding enabled and organized by Projekt DEAL
Author information
Authors and Affiliations
Contributions
A Ruf, ABN and SM devised the analysis plan. A Ruf conducted the analyses and wrote the first draft of the manuscript. A Ruf and ABN contributed to data interpretation. A Ruf, ABN, UEP, A Reif and SM critically reviewed, edited and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
The local ethics committee of the faculty of medicine of the Goethe University Frankfurt (Ethikkommission des Fachbereichs Medizin der GoetheUniversität) approved the study (reference number: 192/18). All subjects declared that they understood the study procedure and signed a written informed consent.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1.
Bayesian Statistics. a. Brief introduction to Bayesian statistics b. Similarities and differences between Bayesian and frequentistbased twopart models.
Additional file 2.
Dataset.
Additional file 3.
R Code.
Additional file 4.
Model specifications of the proposed multilevel twopart model.
Additional file 5:
Density and Trace Plots. Figure 1. Density and Trace Plots of the Random Intercept Model with Level2 predictor gender. Figure 2. Density and Trace Plots of the Random Slope Model with Level1 predictor energetic arousal (EA).
Additional file 6:
Model Summaries. Figure 1. Model summary of the Random Intercept Model with Level2 predictor gender. Figure 2. Model summary of the Random Slope Model with Level1 predictor energetic arousal (EA).
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Ruf, A., Neubauer, A.B., EbnerPriemer, U. et al. Studying dietary intake in daily life through multilevel twopart modelling: a novel analytical approach and its practical application. Int J Behav Nutr Phys Act 18, 130 (2021). https://doi.org/10.1186/s12966021011878
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12966021011878
Keywords
 Multilevel twopart modelling
 Semicontinuous
 Longitudinal
 Dietary intake
 Ecological momentary assessment
 R
 Brms