Validity and reliability of subjective methods to assess sedentary behaviour in adults: a systematic review and meta-analysis

Background Subjective measures of sedentary behaviour (SB) (i.e. questionnaires and diaries/logs) are widely implemented, and can be useful for capturing type and context of SBs. However, little is known about comparative validity and reliability. The aim of this systematic review and meta-analysis was to: 1) identify subjective methods to assess overall, domain- and behaviour-specific SB, and 2) examine the validity and reliability of these methods. Methods The databases MEDLINE, EMBASE and SPORTDiscus were searched up to March 2020. Inclusion criteria were: 1) assessment of SB, 2) evaluation of subjective measurement tools, 3) being performed in healthy adults, 4) manuscript written in English, and 5) paper was peer-reviewed. Data of validity and/or reliability measurements was extracted from included studies and a meta-analysis using random effects was performed to assess the pooled correlation coefficients of the validity. Results The systematic search resulted in 2423 hits. After excluding duplicates and screening on title and abstract, 82 studies were included with 75 self-reported measurement tools. There was wide variability in the measurement properties and quality of the studies. The criterion validity varied between poor-to-excellent (correlation coefficient [R] range − 0.01- 0.90) with logs/diaries (R = 0.63 [95%CI 0.48–0.78]) showing higher criterion validity compared to questionnaires (R = 0.35 [95%CI 0.32–0.39]). Furthermore, correlation coefficients of single- and multiple-item questionnaires were comparable (1-item R = 0.34; 2-to-9-items R = 0.35; ≥10-items R = 0.37). The reliability of SB measures was moderate-to-good, with the quality of these studies being mostly fair-to-good. Conclusion Logs and diaries are recommended to validly and reliably assess self-reported SB. However, due to time and resources constraints, 1-item questionnaires may be preferred to subjectively assess SB in large-scale observations when showing similar validity and reliability compared to longer questionnaires. Registration number CRD42018105994.


Introduction
Regular physical activity reduces the risk of premature death, cardio-and cerebrovascular disease, metabolic disorders and some forms of cancer [1,2]. Based on the overwhelming evidence, the World Health Organization recommend adults to perform ≥150-min moderate-intensity aerobic physical activity, or ≥ 75min vigorous-intensity aerobic physical activity per week [3]. More recently, the importance of sedentary behaviour (SB) for health has emerged. High levels of SB are associated with an increased risk of premature death, cardiovascular disease, metabolic disorders and cancer [4][5][6], with especially strong associations in those who are physically inactive. These observations highlight the importance of accurately measuring physical activity and SB in order to understand their respective roles in health outcomes.
Various devices [7] and questionnaires [8] are available to assess physical activity. Since SB is a distinct behavioural entity and not simply reflective of the lack of sufficient physical activity, these measures may not directly assess SB [9]. Furthermore, in contrast with structured exercise, SB occurs habitually throughout the day, making valid assessment of SB challenging. SB is defined as any activity during awake time with an energy expenditure ≤1.5 METs (i.e. sitting or activities in reclining posture) [9,10]. Patterns and total volume of SB can be assessed using objective measures such as thigh-worn accelerometers combining acceleration and posture, which is currently regarded as the gold standard to quantify freeliving SB and to distinguish between sitting or lying, standing and physical activity [11]. Nonetheless, used in isolation, these objective measures do not distinguish between different domains (e.g. occupation, transportation and leisure time) and settings (e.g. TV viewing, car driving and sitting while reading) of SB. This is important since some settings of sitting, e.g. TV viewing and screen time, are more strongly associated with poor health outcomes compared to total sedentary time [12][13][14] and may serve as useful intervention targets. These observations emphasise the need for valid subjective measures to assess SB within the various domains and settings in which it occurs. Ideally, these measures should be taken in combination with objective assessments [15]. However, given this is not always possible or feasible, it is also important to understand the measurement metrics of self-report methods when they are used in isolation.
Several self-reported tools (i.e. questionnaires, logs and diaries) have been developed recently to measure SB. These tools vary from single-item questions to extensive questionnaires about SB considering various domains. Currently, some reviews compared the validity and reliability of these tools [15,16]. However, previous reviews did not take the risk of bias across studies into account and did not combine the results into a meta-analysis. Knowledge about the validity, reliability and the quality of the studies performed is essential to plan, perform and correctly interpret results in this field of research, because measurement error may seriously impact study results. The aim of this systematic review and metaanalysis was to identify subjective methods to assess SB and, subsequently, to examine their validity and reliability to assess SB in adults. Where the sedentary time measured by subjective methods was compared to objective and other subjective methods. This overview will contribute to improved selection of appropriate subjective measures of SB (in relation to their research question), and to identify gaps of knowledge within this area of research.

Date source and literature search
A literature search was performed in databases of MED-LINE, EMBASE and SPORTDiscus. The search strategy combined three main search terms: sedentary behaviour, self-reported measures, and validity/reproducibility. The complete search strategy is shown in the Additional Table 1. The last search was performed on March 11th, 2020. All citations were imported into the bibliographic database of EndNote, version X7 (Thomas Reuters, New York City, NY). This review was registered in PROS-PERO (number CRD42018105994) and the 'Preferred Reporting Items for Systematic Reviews and Meta-Analyses' (PRISMA) [98] guidelines were used to perform the systematic review and meta-analyses.

Selection of papers
After importing all citations in Endnote, duplicates were removed, and title, abstract and full text were independently screened by two reviewers (EB, YH). In case of disagreement, a third reviewer (TE) was consulted. Inclusion criteria were: 1) assessment of SB, 2) evaluation of subjective measurement tools, 3) being performed in healthy adults, 4) manuscript written in English, and 5) paper was peer-reviewed. Papers were excluded if the study did not aim to determine any construct of SB, when studies did not investigate the validation or reliability of the tool and/or the aim was to cross-cultural validate the subjective tool in different languages. A flowchart of the search strategy and the inclusion of manuscripts is presented in Fig. 1.

Data extraction, synthesis and analysis
Study characters were extracted using an extraction form including: 1) study population, 2) number of   [99][100][101]. The COSMIN checklist contained items about the criterion validity (Additional Table 2) and reliability (Additional Table 3). For each item different design requirements and statistical methods were rated on quality using a 4-point scale. A methodological quality score per item was obtained by taking the lowest rating of any score per item ('worse score counts') [101].

Assessment of construct validity and reliability
Criterion validity was defined as the degree to which the outcome measure measures the construct it purposes to measure [103]. Thigh-worn accelerometry (e.g. activPAL) was considered as the gold standard for total sedentary time, as they can more accurately distinguish between sitting and standing [11]. Hip-, waist-and wrist-worn accelerometers are frequently used as criterion measure. However, these accelerometers are not sensitive enough to distinguish between stationary standing and sitting [104]. On these grounds, studies using only hip-, waist-and wristworn accelerometers as criterion measure were graded with a lower level of evidence. In addition, if validity results of both thigh-worn accelerometers or hip-, waist-and wrist-worn accelerometers were included in the study, only the results of the thigh-worn accelerometers were reported in this review. Reliability was defined as the degree of consistency and reproducibility of a measurement tool. Test-retest reliability is often assessed using an ICC [103]. Since Pearson and Spearman correlation coefficients neglect systematic errors, the use of Pearson and Spearman correlation coefficient was considered as inadequate and these studies were graded with a lower level of evidence. In addition, if studies provided both ICCs and correlation coefficients, only ICCs were reported in this review. An ICC > 0.90 was considered as excellent, ICC between 0.75-0.90 was considered as good, ICC between 0.50-0.75 as moderate and > 0.50 as poor [105].

Data analyses
A meta-analysis using random effects [106] was performed to assess the pooled validity of the 1-item questionnaires, 2 to 9-item questionnaires, ≥10-item questionnaires and logs/diaries. A random effect model was used because it was unlikely that included studies were functional equivalent and results of the included studies had a large heterogeneity. Only studies expressing validity as Pearson or Spearman correlation coefficients were included in this analysis. When no correlation coefficient was provided for total sedentary time, an (unweighted) mean was calculated based on correlation coefficients of all setting and domains. Finally, I 2 was calculated, which describes the proportion of total variation in effect size that was due to systematic differences between effect sizes rather than by chance [106]. Stratified analyses including only studies examining questionnaires with a good-to-excellent quality were performed to investigate if the quality of the study affected the pooled validity. Meta-analyses were performed using R with

Search results
The literature search resulted in 2423 hits (Fig. 1). After excluding duplicates, 1272 studies were screened for title and abstract. Most papers were not eligible for this review because: i. the articles did not aim to determine SB, ii. no measurement properties were assessed, and/or iii.
The study was performed in children or diseased populations. In total 82 studies and 75 self-reported measurement tools were included ( Table 1).

Attributes of the questionnaires, logs and diaries
The majority of the subjective measures were questionnaires and contained different domains and settings of SB (Table 2). Measurement tools differed regarding the timing (week vs weekend), recall period and number of questions. Nearly all self-reported measurement tools expressed SB in total sitting time (hrs/day or hrs/week). The PASB-Q, SITBRQ, SIT-Q, SIT-Q-7d, TASST and several other questionnaires [31,54,61,67,69,71,78,79] included total sitting time, but also information about sitting bout duration or breaks in sitting time.

Validity
A total of 80 studies examined the validity of one or more methods to assess SB, resulting in a comparison of 96 unique methods ( Table 2). Of the 96 results, 5 were ranked with an excellent quality of the study, 7 studies with a good quality, 9 with a fair quality and 75 with a poor quality. The most important shortcoming of the validation studies was the use of an accelerometer (n = 62) to examine criterion validity of the method to assess SB. A total of 29 studies used the gold standard approach (thigh-worn accelerometer), three studies used diaries/logs and one used direct observation to assess construct validity. Most studies calculated correlation coefficients between the criterion measure and the self-reported questionnaire, which ranged between − 0.01 to 0.90 for total sedentary time and ranged between 0.02 to 0.39 for number of sedentary bouts or breaks (Table 3). Other studies used ICCs (N = 8), kappa values (N = 2), and sensitivity and specificity outcomes (N = 1) to determine the validity, and some added Bland-Altman plots with a mean difference and limits of agreement to examine the accuracy of the method to assess SB (N = 48). Figure 2a provides an overview of the correlation coefficient of all individual studies combined with the quality of the study.

Meta-analyses
The

Reliability
Reliability for total sitting time and number of breaks in sitting time was determined in 44 studies. One study was rated with excellent quality; other studies were rated with good (n = 27), fair (n = 16), and poor (n = 8) quality. Most studies with a lower quality of the study were limited by a small sample size and calculation of correlation coefficients instead of ICCs. The time interval between the first and second assessment ranged between 0.5 h and 15 months, but most studies had an interval of 1-2 weeks (n = 40, Table 3).
The majority of the studies calculated the ICC to examine the test-retest reliability of SB, but some studies used correlation coefficients (N = 6), Bland-Altman plots with mean difference and limits of agreement (N = 2), and kappa values (N = 2). The ICC of the test-retest reliability of the subjective measures of SB ranged between 0.44 and 0.91 (Table 3, Fig. 2b). The ICC estimates were comparable between the logs and diaries, ≥10-items questionnaires, 2 to 9-item questionnaires, and 1-item questionnaires.

Discussion
Time spent in SB has markedly increased over the last few decades and is expected to continue to increase even further [107]. Since SB is associated with many adverse health outcomes [4][5][6], exposure to excessive levels of SB represents an emerging health threat, particularly in the least physically active [108].
To improve quality and guide future studies in this rapidly expanding area of research, this systematic review assessed the validity and reliability of subjective measures of SB, taking the methodological quality into account. We present the following observations. First, despite the presence of several measures to assess SB, significant variability in measurement properties and quality of the studies is present. Second, criterion validity of the subjective measures ranged between poor to excellent (R range − 0.01 to 0.90), in which the quality of most studies (i.e. level of evidence) was poor. Third, the validity of the logs/diaries was more favourable compared to the validity of questionnaires, with little improvement in validity of questionnaires when including multiple questions. Fourth, a moderate-to-good reliability was found for questionnaires and logs/diaries, with the quality of these studies being largely fair-to-good. Taken together, logs and diaries are recommended to validly and reliably assess SB when only self-report measures are available. However, considering limitations pertaining to logs and diaries (e.g. time constraint, resources), one may prefer using questionnaires in larger scaled observations.

Validity of measures of SB
This meta-analysis showed that the overall validity for instruments to assess SB characteristics was moderate to low. These observations raise the question whether these results relate to the poor validity of methods to assess SB per se or the poor quality of the studies that were included. Excluding studies with lower quality from our meta-analyses reinforced the poorto-moderate validity of the various methods, suggesting measures of SB possess poor validity. It is important to indicate that questionnaires examining physical activity show similarly poor level of validity [8]. This highlights the difficulty of examining subjective physical (in) activity behaviours with questionnaires, a finding that seems present across the whole physical activity spectrum: from SB to exercise. Due to the low validity and the large variation in quality, the results of different studies are difficult to compare or harmonise. More importantly, the large variety in validity and questionnaire characteristics (i.e. type and context of SBs) prevents the identification of one (or few) questionnaire(s) that can be recommended for all type of future research that aim to examine SB. Factors explaining the poorer variation in validity of the questionnaires versus diaries/logs may relate to differences in qualitative attributes (e.g. recall period and questions/formats). For example, diaries/logs typically adopt a short recall period (e.g. every 15-30 min), whilst questionnaires are often filled in covering a longer recall period (i.e. day, week, and/or month). Consequently, diaries and logs are less reliant on long-term recall and can more accurately capture sporadic and intermittent behaviours. This fits with the higher validity of diaries/ logs versus questionnaires. Unfortunately, this approach of using diaries/logs comes with the cost of high participant burden (in time), which subsequently may limit the response and compliance rate and introduce reporting bias. Another potential limitation of logs/diaries is that repeatedly filling in SBs may influence participants' behaviour and cause (unwanted) adjustment of SB. These factors should be considered when deciding on the preferred way to assess SB in a future study.
Previous work-related poor validity of questionnaires to systematic and random error, specifically reporting and recall bias which may lead to a low agreement with over-and underestimation ( Table 2). For example, a potential underestimation of SB in single-item questionnaires was suggested [15,104], whereas wider limits of agreement in questionnaires are present with multiple items [104]. Another factor contributing to validity of questionnaires may relate to the number of questions, and therefore detail of information, with more questions on SB potentially improving the criterion validity of the measurement tool. In contrast to this hypothesis, our analysis revealed no substantial differences between the criterion validity of the 1-item, 2-to-9-item and ≥ 10-item questionnaires. One possible explanation is that participants find it difficult to recall SB, with multiple-item questionnaires making it even more complicated to replicate detailed and domain-specific patterns of SB [31]. Furthermore, some behaviours are easier to remember because these are more habitual and restricted to certain periods during the day, e.g. TV viewing, computer use or sitting at work [15,31,86]. Finally, multiple-item questionnaires may over-report SB because subjects may report sedentary activities twice when using sub-scales (e.g. driving while listing to music). Although more questions may cover multiple domains and provide more detailed information, the complexity of these questionnaires may contribute to the negligible improvement in criterion validity of multiple-item questionnaires for total sedentary time. Nonetheless, exploring multiple domains of sitting may still seem relevant. For example, some domains are more strongly associated with poor health outcomes [12][13][14], whilst detailed information about domains may provide insight for intervention development.

Reliability of subjective measures of SB
Despite the significant heterogeneity in validity of the various measures to assess SB, the reliability of the questionnaires and diaries or logs were moderate-togood. Importantly, these conclusions are based on studies with a fair-to-good quality. A central question pertaining to the reliability of questionnaires is whether differences are present in reliability for weekdays versus weekend days or for workdays versus non-workdays, especially given the marked differences in (sedentary) behaviour that exist between these days [104]. Indeed, our study found that approximately 50% of included studies reported a ≥ 10% better reliability to assess SB during weekdays versus weekend days or during workdays versus non-workdays (Table  3). These observations support a previous review, which reported higher reliability for weekdays compared to weekend days [104]. Moreover, we found that reliability was better for specific behaviours, such as TV viewing, compared to a more general categories, such as 'other leisure time activities'. An explanation for this finding is that more specific and regularly performed behaviours have a higher reliability [15].

Choosing an appropriate measurement tool
Logs and diaries have a higher validity compared to the questionnaires, are less reliant on long-term recall and can more accurately capture sporadic and intermittent behaviours. Therefore, we recommend logs and dairies as self-reported measurement tools. However, important limitations such as time constrains, lack of resources and the potential to influence participants' behaviour, make them less useful for largescale observational studies and/or intervention studies. Within the spectrum of questionnaires, there is no obvious preference for a single questionnaire. In fact, the most appropriate tool seems to depend on the nature of the study, especially since this review showed large variety in both validity and questionnaire characteristics (i.e. type and context of SBs). Therefore, some studies will benefit from questionnaires focusing on specific domains of SB, whilst others will benefit from a reliable estimate of total sedentary time or distribution of SB. Furthermore, when performing an intervention study, measures will benefit from the ability to measure changes across time. Since this ability was not examined within this review, we cannot make specific recommendations related to this type of studies. Nonetheless, these characteristics should be taken into account when planning such studies. Ultimately, and when feasible, a combination of objective and subjective assessments is preferred to provide valid and reliable insight into SB.

Conclusions
This review identified the widespread (and rapidly growing) use of a large range of self-reported measures of SB, which significantly differ in type, extensiveness, complexity and duration. Our results indicated that the criterion validity of subjective measures ranged between poor and excellent, whereas the quality of most studies was poor. The validity of the logs/diaries was significantly higher compared to the questionnaires, with little improvement in criterion validity of questionnaires when increasing items to assess SB. Therefore, when only self-report measures are feasible, logs and diaries are recommended to validly and reliably assess SB, but due to time constraints and resources related to logs and diaries, 1-item questionnaires may be preferred in large-scale studies when showing similar validity and reliability compared to longer questionnaires. Whenever feasible, the combination of objective and subjective assessments will provide the most valid and reliable method to assess SB.