The compositional data analysis paradigm in health research
Properties of compositional data
Compositional data are comprised of components which sum to a whole, such as 100%, 1, or in this case 24 h (1440 min). [28] Health researchers may view time use as a composition comprised of sleep and waking behaviours of different metabolic intensities (i.e. sedentary behaviour, light physical activity and MVPA), or as combinations of various mutually exclusive activity domains, such as chores and screen time. Compositions are by nature multivariate, as a composition must comprise at least two components. Compositional information is relative rather than absolute; that is, the information on any individual component is meaningful only by reference to other components. This means that the ratios between components are of primary interest, rather than the absolute values of each component, [29] and the value of the total sum (24 h, one week, one month) is not relevant. For example, in an individual performing one hour of MVPA and 10 h of sedentary behaviour across a 24 h day, the ratio of MVPA to sedentary behaviour is 1:10 or 0.1.
Compositional data exhibit three important properties. Firstly, they are scale invariant, which means that the relative differences between components are maintained regardless of the scale in which they are expressed, such as hours per day [1 h:10 h = 0.1] or percentage of daily time [4.2%:42% = 0.1]) [29]. Secondly, compositional data exhibit sub-compositional coherence, in that the relationship between components is maintained regardless of the presence or absence of other components [29]. In the above example, the ratio of MVPA to sedentary behaviour is still 0.1 regardless of whether sleep (another component of the 24 h time budget) is also reported. Finally, compositions are permutation invariant, as the relative differences between components are the same regardless of the sequence in which components are reported [29].
The simplex – A sample space for compositional data
The sample space is defined as the set of all possible values that variables can take. In the example of a coin toss, the sample space consists of heads or tails. Most traditional statistical methods employed in the field of health research (notably regression) assume that data are unconstrained, and therefore operate in real (or Euclidean) space. However, in the case of compositional data, data are constrained to a total sum. Thus, compositional data are represented in a subset of real space known as the simplex, and have a natural geometry, known as Aitchison geometry [30].
The manipulation of variables requires the use of methods congruent to the sample space. For example, the calculation of the arithmetic mean of an unconstrained variable involves adding all observations together and dividing by the number of observations. The arithmetic mean of the numbers 2 and 8 is (2 + 8)/2 = 5. For the same calculation in the simplex, where we are dealing with ratios, perturbation (essentially multiplication) is used in place of addition, and powering (to the power of a negative number) in place of division. As a result, the geometric mean is the most appropriate indicator of central tendency for compositional data, which involves multiplying all observations together and taking the nth root. For example, the geometric mean of the numbers 2 and 8 is found by taking the square root of (2 × 8 = 16) = 4. Calculation of the compositional mean involves calculating the geometric mean of each component and adjusting (or ‘closing’) these to the total sum, in this case 24 h [22].
Principles of compositional data analysis
The application of traditional statistical methods to compositional data, such as linear regression, can be problematic as these methods are not coherent with the simplex. Even though some pairs of components might appear to be uncorrelated using traditional methods, components are never independent of one another; rather, they are co-dependent to a greater or lesser degree. Thus the inclusion of all components in a model would result in perfect multi-collinearity, negatively biasing the covariance structure of the data [28]. Even the inclusion of more than one component can lead to spurious results.
Compositional data can and should be analysed using methods that account for their properties. A ‘staying in the simplex’ approach can be used, where operations based on Aitchison geometry (e.g. perturbation and powering) are employed. However, the more popular approach is to map compositional data from the simplex into unconstrained real space, where traditional multivariate statistics coherent with real space may be applied. In practice, this is achieved by expressing compositions as log-ratio coordinates. [29] Discussion of the merits of different types of log-ratio coordinate systems may be found elsewhere, [30, 31] but isometric log-ratio (ilr) transformations are most often used. An ilr transformation will produce a set of coordinates numbering one less than the number of components. For example, the four-component composition sleep, sedentary behaviour, light physical activity and MVPA may be expressed as the following set of three normalised log contrasts: (a) sedentary behaviour: sleep; (b) light physical activity: the geometric mean of sleep and sedentary behaviour; and (c) MVPA: the geometric mean of sleep, sedentary behaviour and light physical activity. A positive ilr indicates that the numerator is greater than the denominator for that coordinate, and conversely a negative ilr indicates that the denominator is greater than the numerator. If the ilr is zero, the numerator and the denominator are equal.
Once expressed as ilr coordinates in real space, compositions can be used in statistical models as exposures or outcomes, or both. In the example given above, (c) represents the ratio of MVPA relative to the geometric mean of the remaining components. When used as an exposure, model coefficients for this coordinate correspond to the change in outcome associated with an increase in MVPA relative to compensatory decreases in the remaining components. Alternatively, when used as an outcome, models may be used to predict coordinates based on exposures of interest. In both cases, the ilr coordinates can be back-transformed into proportions, and then into original units (minutes or hours) for interpretation. To date, the small body of literature applying compositional data analysis to health research has used compositions as exposures to explore the aetiology of health or ill health.
Zero values in compositional data analysis
Log-ratio coordinates cannot be applied to zero values, meaning that presence of zeros in one or more components prohibits the use of compositional data analysis techniques. In compositional data, zeros can be theorised as ‘rounded’ or ‘essential’. A rounded zero is a small non-zero value that falls below some detection limit, and is thus recorded as zero. For example, the measurement of chemical compositions relies on the sensitivity of the measurement instrument, which may not be able to detect chemicals occurring in very small concentrations. An essential zero is a true zero, indicating the complete absence of that component in the composition. To date, approaches of varying levels of sophistication have been used to impute values in the place of rounded zeros, [32] but the problem of essential zeros remains a core challenge of compositional data analysis. [33] Components containing a large number of zeros or small values are commonly amalgamated with other components. However, this strategy may not be desirable in health research as MVPA typically accounts for a very small proportion of daily time yet is strongly associated with health outcomes.
We now move to describing the current compositional data analysis.
Study population and design
This study is a secondary analysis of the 2014/15 United Kingdom (UK) Harmonised European Time Use Survey (UKHETUS) [34]. The UKHETUS is a cross-sectional national survey of approximately 7600 UK residents aged eight years or older, conducted between April 2014 and December 2015 [35]. The survey used a multi-stage stratified probability sampling design, generating a random sample of residential addresses using the Postcode Address File and the Land Property Services Agency. The target achieved sample was 5500 households. From a total sample of 11,860 addresses, of which 10,479 were eligible, the response rate was 40.4% or 4238 households [35]. A nominated individual within the household completed a household demographic questionnaire. Following this, all individuals in the household completed an individual demographic questionnaire and two time-use diaries (one on a weekday, one on a weekend day). The study was approved by the Research Ethics Committee of the Department of Sociology (DREC) at the University of Oxford (2014_01_02_R1). For the current analysis we randomly selected one time-use diary from individuals aged 16 years and over.
Data availability
UKHETUS data are available at https://doi.org/10.5255/UKDA-SN-8128-1.
Assessment of time-use composition
Time-use diaries were filled out on the day of interest (Fig. 1). Each diary started at 4 am and covered a full 24 h, in 10-min timeslots. For each timeslot, the participant recorded the primary activity they were undertaking (‘what’ variable) and up to three co-occurring secondary activities. The participant also recorded their location (‘where’ variable) for each timeslot, for example home or work. If they were travelling, the mode of travel was reported under the ‘where’ variable. All responses were given in free text. After the diaries were returned, all free text was coded by an independent rater. For each timeslot, ‘what’ variables were coded into one of 281 a priori individual codes, and ‘where’ variables into one of 38 a priori individual codes [35].
Quality control
Initial data cleaning was performed in the released dataset, involving the imputation of some missing time according to a set of standard rules [36]. We then applied a series of quality control checks to time-use diaries. Firstly, we conducted a general quality control based on standard procedures used across multiple time-use datasets [37]. We identified diaries with more than 90 min of missing time, which reported less than seven episodes of activity, and were missing two or more of four basic activities (sleeping/resting, eating/drinking, personal care and exercise/travel). We then applied quality control checks specific to our analysis. We identified diaries that did not report a full 24 h of eligible activity codes, where time was coded to one of the following activity codes:
9960 No main activity no idea what it might be.
9970 No main activity some idea what it might be.
9980 Illegible activity.
9990 Unspecified time use.
9991 Not applicable.
9999 Queryable.
We also identified diaries in which no sleep was reported. We removed all diaries failing these quality control checks, on the basis that they were likely incomplete or (in the case of diaries reporting no sleep) atypical representations of the 24 h time budget.
Definition of exposure (active travel)
We defined active travel as a binary yes/no variable. The participant was categorised as having undertaken some active travel if one of the following codes were reported in the ‘where’ variable in their time-use diary: “travelling on foot” or “travelling by bicycle”. Walking or cycling for recreation was not included in this variable.
Definition of outcome (time-use composition)
For each participant, we partitioned their time-use diary into six mutually exclusive activity sets (components) according to the primary activity reported in the ‘what’ variable (Additional file 2):
-
1.
Sleep (minutes/day)
-
2.
Leisure MVPA including walking or cycling for recreation (minutes/day)
-
3.
Leisure sedentary screen time (minutes/day)
-
4.
Non-discretionary time comprising work, study, chores and caring duties (minutes/day)
-
5.
Travel including both active and motorised modes (minutes/day)
-
6.
Other including informal help to others and hobbies (minutes/day)
Together, these components accounted for all of the participant’s daily time (24 h or 1440 min). It should be noted that the sleep component represented all sleep occurring between 4 am and 4 am. Thus, it did not necessarily describe an overnight sleep duration, and it incorporated naps undertaken during the day.
We explored patterns of zeros and non-zeros across the defined composition. For the current analysis we treated zero values as rounded, for the following reasons: (a) participants were required to record activity blocks of at least 10 min, meaning that shorter duration activities could have been missed, which is particularly relevant for MVPA; (b) we used only the primary activity to generate the composition, but some relevant activities could have been reported as secondary activities; (c) given the nature of activities included in components (for example “walking and hiking” in the MVPA component), rounding was theoretically possible; and (d) time-use compositions generated from accelerometry (which sample at epochs of 15 s or less) typically have few or no zeros in components [31], which reinforces the suggestion that the cruder level of aggregation in time-use diaries may result in rounded zeros. Therefore, we imputed zero values using a log-ratio data augmentation algorithm, which replaced zeros with small values of less than 10 min, drawing time from the other components. As a sensitivity analysis, we imputed all zero values as one minute.
Covariates
Covariates hypothesised to confound the association between active travel and time use, which have been used in previous research examining active travel and health behaviours, were selected a priori. Participants reported their age, sex and work status as part of the individual demographic questionnaire. The day of the week of the time-use diary was reported as part of the diary procedure. Age was used as a continuous variable. We used binary variables for sex (male vs. female), work status (working or studying vs. other, including those who answered ‘not applicable’) and day type (weekday vs. weekend).
Analysis
We used the open source software R (www.r-project.org) and a number of bespoke packages for the analysis of compositional data, including Compositions [38], zCompositions [32] and robCompositions [39].
We explored potential differences between participants included and not included in the analysis, and described the characteristics of the analysis sample. We then conducted an initial descriptive analysis of the raw composition, calculating the arithmetic mean and standard deviation, and the median and interquartile range, of each component. For the imputed composition, we then calculated the geometric mean of each component separately. Finally, we calculated the compositional mean or centre by ‘closing’ the geometric mean of all components to 1440 min. When using the compositional mean, components are adjusted so that they add up to the total. In this case, we used 1440 min or 24 h, a uniform time budget for all participants (i.e. all had the same amount of available time). We examined the variability of the composition using a pair-wise variation matrix, an indicator of dispersion coherent with the simplex, which is broadly equivalent to the standard deviation.
We transformed each participant’s six-component composition into five ilr coordinates for use in regression models. We used the default ilr transformation from the R package Compositions, and the same ilr partitioning system to back-transform the log-ratio coordinates into proportions. The proportions were then adjusted to sum to 1440 for interpretation as minutes per day.
Using the approach described by Martin-Fernández [40], we used compositional multivariate analysis of variance (MANOVA) to contrast the mean time-use composition between individuals reporting some active travel and those reporting no active travel. The null hypothesis was that there was no difference in mean time-use composition between the two groups. A p value < 0.05 was taken as evidence to reject the null hypothesis. Models were run in steps, with the first model unadjusted, the second adjusted for age and sex, and the final model adjusted for age, sex, work status and day type.
The MANOVA indicated whether the compositions differed overall between groups, but not which individual components differed. To examine this, it was firstly necessary to estimate adjusted compositional means for each group (i.e. adjusted for age, sex, work status and day type). To estimate the adjusted compositional means, linear regression models were created, with the ilr coordinates as outcome variables and the binary active travel variable as the exposure, along with the other covariates. We used each ilr coordinate as a dependent variable in a unique linear regression model, resulting in five models (one for each coordinate). Using the R package lsmeans [41], we estimated the adjusted mean ilr coordinate value for each of the five ilr coordinates. We did this separately for some and no active travel, resulting in a complete set of five estimated ilr coordinates for each group. Subsequently, we back-transformed these ilr sets to predict model-adjusted six-component compositional means for those reporting some active travel and those reporting no active travel separately.
From this, we adapted the procedure outlined in Martin-Fernández [40] to calculate the log-ratio difference in adjusted compositional means between the two groups. Log-ratio differences are log-transformed ratios where the numerator contains the model-adjusted minutes per day in one component in those reporting some active travel, and the denominator contains the model-adjusted minutes per day in the same component in those reporting no active travel. We then used a bootstrap technique for comparing two populations to construct a 95% bootstrap confidence interval for each separate component. If the confidence interval crossed zero, this indicated that there was no difference between groups with respect to this component [40].
As a final step, we entered interaction terms into the original MANOVA models to explore whether the relationship between active travel and time-use composition differed by sex, work status, age group or day type. If the interaction term was significant (p < 0.05), we repeated the adjusted MANOVA models stratifying by that variable (but removing it as a covariate in the model) in order to better elucidate the interpretation of the interaction. We used the same regression model plus bootstrap technique to visualise differences in the individual components in those reporting some active travel and those reporting no active travel, by the stratification variable.
Finally, though the survey used a complex sample design, we did not apply survey weights to the current analysis.