A systematic review of reliability and objective criterion-related validity of physical activity questionnaires

Physical inactivity is one of the four leading risk factors for global mortality. Accurate measurement of physical activity (PA) and in particular by physical activity questionnaires (PAQs) remains a challenge. The aim of this paper is to provide an updated systematic review of the reliability and validity characteristics of existing and more recently developed PAQs and to quantitatively compare the performance between existing and newly developed PAQs. A literature search of electronic databases was performed for studies assessing reliability and validity data of PAQs using an objective criterion measurement of PA between January 1997 and December 2011. Articles meeting the inclusion criteria were screened and data were extracted to provide a systematic overview of measurement properties. Due to differences in reported outcomes and criterion methods a quantitative meta-analysis was not possible. In total, 31 studies testing 34 newly developed PAQs, and 65 studies examining 96 existing PAQs were included. Very few PAQs showed good results on both reliability and validity. Median reliability correlation coefficients were 0.62–0.71 for existing, and 0.74–0.76 for new PAQs. Median validity coefficients ranged from 0.30–0.39 for existing, and from 0.25–0.41 for new PAQs. Although the majority of PAQs appear to have acceptable reliability, the validity is moderate at best. Newly developed PAQs do not appear to perform substantially better than existing PAQs in terms of reliability and validity. Future PAQ studies should include measures of absolute validity and the error structure of the instrument.


Background
Physical inactivity is considered to be one of the four leading risk factors for global mortality [1]. The measurement of physical activity is a challenging and complex procedure. Valid and reliable measures of physical activity (PA) are required to: document the frequency, duration and distribution of PA in defined populations; evaluate the prevalence of individuals meeting health recommendations; examine the effect of various intensities of physical activity on specific health parameters; make cross-cultural comparisons and evaluate the effects of interventions [2].
Physical activity questionnaires (PAQs) are often the most feasible method when assessing PA in large-scale studies, likely because of their low cost and convenience but these instruments have limitations and should be selected and used judiciously. PAQs are prone to measurement error and bias due to misreporting, either deliberate (social desirability bias) or because of cognitive limitations related to recall or comprehension [3,4]. Cognitive immaturity or degeneration can make self-report of physical activity particularly difficult in the young and elderly [5,6]. Despite more frequent use of objective assessment methods to measure physical activity, PAQs still provide a practical method for PA assessment in surveillance systems, for risk stratification and when examining etiology of disease in large observational studies. Most PAQs are designed to be able to measure multiple dimensions of PA by reporting type, location, domain and context of the activity, provide estimates of time spent in activities of various levels of intensity, and may be able to rank individuals according to intensity levels of reported activity [7,8]. However, results from studies aimed at evaluating the validity of PAQs assessed in one population cannot be systematically extrapolated to other populations, ethnic groups, or other geographical regions. Consequently, a great variety of PAQs have been developed and tested for reliability and validity in recent years.
A comprehensive review of PAQs for use in adults was published in 1997 [9]. Since then, reviews summarizing the validity and reliability of PAQs have been carried out in children [10][11][12] and preschoolers [13]. Recently, specific reviews were published assessing the quality of PAQs available for children [11], adults [14] and the elderly [15]. The aim of the present study was to systematically review the literature on reliability of PAQs as well as their validity evaluated against objective criterion methods, for use in all age groups, published between January 1997 and December 2011 to quantitatively compare the performance between existing and newly developed PAQs.

Inclusion criteria
Studies meeting all of the following inclusion criteria were included: (i) published in the English language between January 1997 and December 2011; (ii) self-or interviewer-administered PAQs or parental proxy reports reporting both reliability and validity results; (iii) PAQs reporting validity results only, when the reliability data has been published previously; (iv) PAQs developed for a healthy general population and for observational surveillance studies; (v) PAQs tested in its original form or in an adapted version if results were reported for validity and reliability or validity only, when reliability results were published before; (vi) validity tested against an objective criterion measure of PA (i.e. accelerometry, heart rate, combined heart rate and accelerometry, doubly labeled water (DLW)); (vii) results on validity obtained by pedometer where the questionnaire was specifically developed to assess walking only.

Exclusion criteria
We excluded studies that reported: (i) reliability and validity results in groups with specific clinical or medical conditions (except pregnancy); (ii) results from PAQs that were designed for specific intervention studies; (iii) results where the validity of the PAQ was tested against another self-report method (i.e. diaries, logs); (iv); results on validity using pedometers (except if walking only was tested) and indirect measures of physical activity (e.g. VO 2max and body composition); (v) results on essential adaptations of original PAQs, without any published results on both reliability and validity.

Literature search
The PubMed, Medline and Web of Science databases were systematically searched using the following lists and terms: List A: (physical activity AND health survey OR population survey OR question*) List B: List B: measure* (i.e. measures, measurement), assess* (i.e. assessment, assessed), self-report, exercise, valid* (i.e. valid, validation, validity), reliab* (i.e. reliable, reliability), reproducible, accelerometer, heart rate, doubly labelled water, doubly labeled water. The search included titles, abstracts, key words and full texts.
Key search terms in List A were combined with each of the terms in List B.
The literature search was undertaken in two stages. The original literature search (1997)(1998)(1999)(2000)(2001)(2002)(2003)(2004)(2005)(2006)(2007)(2008) was undertaken by two of the authors (JW, HB) independently and search results were compared and verified. The literature search was then updated to include studies up to December 2011 using exactly the same search criteria (HH). A second search strategy included screening references lists of publications that matched the inclusion criteria and any other publications of which the authors were aware but did not show up during the original literature search. Figure 1 displays an overview of the literature search.

Data collection and extraction
Data were extracted using a standardized pro-forma which included sample characteristics, questionnaire details, methods of validity and reliability testing, test results and authors' conclusions. We retrieved full text of articles of all abstracts that met our inclusion criteria. Any queries about the inclusion of papers were resolved by one of the authors (UE).

Reliability
Reliability in all studies was tested through a test-retest procedure to measure consistency of the PAQs. Reliability results from included studies were reported as: intraclass correlation coefficients (ICC); Pearson and Spearman correlation coefficients; and agreement measures using Cohen's weighted kappa (κ) and mean differences. Reliability was considered poor, moderate (acceptable), or strong when correlation coefficients or kappa statistics were <0.4, 0.4-0.8 or >0.8, respectively [16]. Similarly, an ICC > 0.70 or >0.90 was considered as acceptable and strong, respectively, in those studies reporting this measure [17].
Medians of reliability correlation coefficients across studies were calculated and included in the tables when possible.

Validity
Correlation coefficients were the most commonly used measures of validity, although the Bland-Altman technique [18] which determines absolute agreement between two measures expressed in the same units, was also frequently used. The Bland-Altman method estimates the mean bias and the 95 % limits of agreement (± 2SD of the difference) and is usually plotted as the difference between the methods against the mean of the methods for visual inspection of the error pattern throughout the measurement range; the dependence of error with the underlying level can be summarised in the error correlation coefficient but this was only seldom reported.
Medians of included validity correlation coefficients were calculated and included in the tables when possible.
When calculating the medians, we excluded those studies reporting correlation coefficients for the associations of self-reported sedentary time. The medians for sedentary time are reported separately and associations of sedentary time with measures of total physical activity (i.e. total energy expenditure [TEE], physical activity level [PAL] and total activity from accelerometry [mean counts]) from the criterion method were excluded in these analyses as these measures are expected to be inversely related.

Classification
Questionnaires were classified as new or existing (i.e. previously published test results) PAQ. Existing questionnaires were subdivided into those which reported new reliability and validity results, and those which reported new results on validity only but had previously reported results on reliability. Questionnaires were classified as new, when the concerning study was the first to publish reliability and objective validity data on the PAQ. Hereafter, studies were further stratified for age group of the sample. Study populations with a mean age lower than 18 years were categorised as youth, 18 -65 years were classified as adults, and elderly above 65 years.

PAQs included
PAQ abbreviations are listed in Table 1, with their respective timeframe. The details of these studies are shown in Tables 2 (new PAQs) and 5 (existing PAQs). A range of tests were used to assess reliability and validity with some studies reporting results for a total questionnaire summary score, and others assessing reliability and validity for various aspects, intensities, or domains of the questionnaire and/or by subgroups within the test population. The total score or index for the PAQ was reported, if available. In the absence of a total score, correlation coefficients by intensity category or group are reported. Where multiple results were reported, a decision was made about the data that constituted the main results based on the stated objectives for the study or questionnaire. Several studies compared results to another questionnaire concurrently but if this was a secondary aim of the specific study, the results were not included.
Results were reported for both total score and other aspects (e.g. domain, intensity) when this substantially added to the information for the specific study, for example when total PA was tested against a different validation method than PA intensities [31]. Some questionnaires assessed sedentary behaviour and these results are specifically reported in the tables or text. Sedentary behaviour has recently been suggested to be considered distinctively from physical activity in associations with health outcomes [50].

Results
The search string (JW and HH) resulted in a total of 11098 hits. The first literature search resulted in 125 papers being retrieved for data extraction. The update of the literature review to December 2011 resulted in a further 75 papers being retrieved for data extraction (Figure 1). More than half of the papers retrieved were excluded (n = 104). The main reasons for exclusion were inappropriate criterion measures, generally a measure of aerobic fitness (n = 48), and lack of information on reliability (n = 26) or validity (n = 17) (Figure 1).

New PAQs
The description of newly developed PAQs is summarized in Table 2. The literature search found 31 articles, reporting results from 34 newly developed PAQs of which 10 were from the United States, 10 from Europe, six from Australia, two from Canada, and one study from Japan and Sub-Saharan Africa, respectively. Of note was a 12-country international study testing the International Physical Activity Questionnaire (IPAQ) [34]. This questionnaire is available in a short form for surveillance and in a longer form when more detailed physical activity information is collected. Both forms are available in a number of languages. IPAQ has been rigorously tested for reliability and validity and this has been replicated in a number of countries.
Nineteen studies tested the reliability and validity in adults, an additional 11 studies focused on youth [19][20][21][22][23][24][25][26][27][28][29] and one study was performed in Japanese elderly (n = 1) [49]. Most studies (n = 25) included men and women, four studies [26,30,32,35] reported data in women and two studies [37,38] in men only. The number of participants varied from 30 to 2271, and several studies [19,20,29,31,[33][34][35][39][40][41][43][44][45][46][47] performed reliability testing in a larger sample than their test of criterion validity. The most common response timeframe was the last seven days, with seven studies [27,30,36,37,44,46,47] using a timeframe covering the last year (Table 1). All PAQs captured some elements of leisure time and recreational activity, although most questionnaires also addressed multiple domains of activity. Sedentary time is also a commonly captured behaviour from the newly developed questionnaires and has been given some extra attention in recent publications and in the current results. Several recent PAQs, such as the EPIC Physical Activity Questionnaire (EPAQ2) and the Recent Physical Activity Questionnaire (RPAQ), aim to measure the totality of physical activity by domains [31,46,47,51]. The final outcome of the majority of PAQs was reported as time-integrated MET values, e.g. MET-min/week.

Reliability
All reliability results for new PAQs are listed in Table 3.
Reliability was usually reported as ICC (n = 13), Pearson/Spearman correlation (n = 6), kappa statistic (n = 3) or a combination of these statistics (n = 9). Higher reliability coefficients were more often seen in association with shorter periods between test and retest. Poor correlation (ICC or r <0.4) was found only in subcategories of a few PAQs. Median correlations from reported data for recall of sedentary behaviours across all PAQs were acceptable: ICC = 0.68, Spearman r = 0.60, Pearson r = 0.475, kappa = 0.66.

Elderly
Median Pearson reliability correlation for the elderly was r = 0.70. The PAQ-EJ was the only new PAQ designed for (Japanese) elderly that reported reliability results and has acceptable recall properties (r = 0.70) [49].

Validity
All validity results for new PAQs are listed in Table 4.
Accelerometry and in particular the ActiGraph accelerometer was the most commonly used criterion method (n = 19), followed by the Caltrac accelerometer (n = 4) and the Polar heart rate monitor (n = 4). DLW was used in one study, where absolute validity was moderate to high for PAEE (r = 0.39) and TEE (r = 0.67) [31]. In general, validity coefficients were considerably lower than reliability coefficients. Median correlations across all PAQs between reported sedentary behaviours and calculated inactivity from objective measures were low: Spearman r = 0.12.

Adults
Median validity correlations for adults were as follows: Spearman r = 0.27, Pearson r = 0.28. Highest validity in                                  : GAQ score yesterday = summary score estimated from 28 physical activities performed on the previous day (yesterday), applying the code 0 for the response "none", 1 for the response "less than 15 min", and 10 for the response "15 min or more". GAQ score usual = summary score estimated from usual activities, based on frequency of physical activities performed, applying the code 0 for the response "none", 1 for the response "a little", and 10 for the response "a lot". The GAQ summary scores were computed as the total MET-weighted score divided by the number of nonmissing items. TV watching = time spent watching TV or video. Other sedentary = time spent performing computer or video games, arts and crafts, board games, homework or reading, talking on phone or hanging out. monstrated for the SSAAQ when tested against the Caltrac accelerometer (r = 0.60-0.74) [44]. Low validity correlations for total activity or for all subcategories were observed for the HUNT1 (r = 0.03-0.07) [54], and the short EPIC PAQ (r = 0.04), although the main outcome, a 4 category physical activity index, derived from this instrument was significantly associated with objectively measured physical activity energy expenditure (p for trend = 0.003) [47]. A follow-up study in 1941 adults from 10 European countries suggested moderate validity (r = 0.33) of this instrument using physical activity energy expenditure from combined heart rate and movement sensing as the criterion [51]. Rosenberg et al. assessed the validity of sedentary behaviour only, and demonstrated low correlations (partial r = −0.01-0.10) with objectively measured sedentary time (<100 counts/min) by the ActiGraph accelerometer [43].

Elderly
Median Spearman validity correlation for the elderly was r = 0.41. The PAQ-EJ was tested by correlating a total score with MET-min/day calculated from the Kenz Lifecorder accelerometer-based pedometer (r = 0.41) [49].

Reliability
All reliability results for existing PAQs are listed in Table 6.
Most studies examining the reliability of existing PAQs reported reliability as ICC (n = 20), Pearson/Spearman correlation coefficients (n = 8); some studies also used a combination of correlation statistics (n = 7). Similar to the new PAQs, the existing PAQs demonstrated moderate correlations for reliability. Median correlations from reported data for recall of sedentary behaviours were divergent: ICC = 0.76, Spearman r = 0.725, Pearson r = 0.305, kappa = 0.645.

Validity
All validity results for existing PAQs are listed in Table 7.

Discussion
This systematic review covered the most recent 15-year period. We identified 31 studies that adequately tested newly developed PAQs for both validity and reliability during this period. This suggests that whilst assessing physical activity by means of objective monitoring has become widespread also when examining population levels of activity [119][120][121], PAQs remain an active area of research and are now generally considered complementary to any objective measure. Several previous reviews have assessed the reliability and validity of PAQs with a special focus on their overall performance [9], or performance in specific age groups [11,14,15]. Conversely, we compared whether newly developed PAQs performed better than older PAQs, as this will inform researchers and practitioners when choosing an existing PAQ or developing a new instrument for assessing physical activity. We therefore comprehensively summarized the results to allow an adequate appraisal of the existing PAQs performance across domains and physical activity intensities.
In concordance with previous reviews [11,14,15], very few questionnaires showed acceptable reliability and validity across age groups. Developing new PAQs requires careful consideration of the study design in terms of target population, sample size, age group, recall period, dimension and intensity of PA, relative and absolute validity, standardized quality criteria and appropriate comparison measures. The lack of formulating a priori hypotheses was recently highlighted as a limitation in most studies examining the validity of PAQs [11] and comprehensive key criteria for physical activity and sedentary behaviour validation studies have been proposed [122,123].
Since the comprehensive review by Kriska and Caspersen [9], it is apparent that more appropriate criterion methods, in particular accelerometry, have been used to test the validity of PAQs. Yet, a considerable number of studies were excluded from the present review due to an inappropriate criterion method (e.g. aerobic fitness). Many studies reported reliability and validity results for existing and well established questionnaires, which suggests that these instruments are still frequently used. Importantly, newly developed PAQs do not seem to perform any better than existing instruments in terms of reliability and validity. Unfortunately, we were not able to conduct a formal meta-analysis due to differences in reported outcomes, different criterion measures and different time frames between questionnaires.
Total energy expenditure (TEE) was frequently used as the outcome measure of the PAQ and the validity scores from these types of instruments are usually high. However, the results from many of these studies should be interpreted carefully. This is because TEE from any self-report incorporates an estimate of resting energy expenditure (REE) generally calculated from body weight, sex and age. REE explains most of the variation in TEE and, consequently, high correlations may be generated when comparing TEE from self-report with measured or estimated TEE from the criterion method. This is particularly problematic when those same predictions of REE are used by both the criterion method and the self-reported calculation of energy expenditure. Therefore, other outputs (e.g. time spent in different intensity levels, physical activity energy expenditure normalised for body size) from the criterion method appear more appropriate to serve as criterion measures. In these studies correlations between the criterion measure and self-reported PA are considerably weaker than those for TEE, although the concerning PAQs may still be considered valid as demonstrated in some studies [31,116]. The notion of validity, however, is a matter of degree, rather than an allor-nothing determination.
The validity correlation coefficients from the vast majority of existing and newly developed PAQs were considered poor to moderate and usually only acceptable when results were presented as Pearson or Spearman correlation coefficients. This suggests that most PAQs may be valid for ranking individuals' behaviour whereas their absolute validity is limited to quantify PA. Although our summary of the correlations in a single median value should be interpreted with caution, we did not observe any substantial difference between newly and existing PAQs. This may suggest that, despite considerable effort, accurate and precise self-report physical activity instruments are still scarce [124]. Many of the newly developed instruments collected information in various domains of physical activity including transportation and housework. Despite this, it appears almost impossible to obtain a valid estimation of a highly variable behaviour such as free-living physical activity by self-report. While results from large scale observational cohort studies have convincingly demonstrated the beneficial effects of self-reported physical activity on various health outcomes including all-cause mortality, coronary and cardiovascular disease morbidity and mortality, some types of cancer, and type 2 diabetes, the detailed doseresponse associations are still unknown [125]. Increased sample size is usually considered to improve precision but may not overcome issues about accuracy. Further, a large sample size does not overcome misclassification due to differential measurement error. Therefore, future studies should consider including an objective measure of physical activity in addition to self-report or consider recommendations to reduce self-report error [126].
Only four of the reviewed questionnaires, the IPAQ-s (existing) [85], the FPACQ (existing) [111], PDPAR (existing) [60] and the RPAR (new) [21] showed acceptable to good results for both reliability and validity. Sedentary behaviour appeared to be one of the most difficult domains to assess with questionnaires as demonstrated by the poor correlations with objectively measured sedentary time, although arguably, there are also limitations of the criterion measures, which contribute to poorer agreement between methods. About one third (n = 11) of the studies reporting data on newly developed PAQs assessed both validity and reliability for sedentary behaviour. 17 and 15 studies reported data on validity and reliability for sedentary behaviour from existing PAQs, respectively.
Accuracy of PA recall may be increased at the second retest administration by an increased physical activity awareness as a result of completing the questionnaire previously [105]. Many of the reviewed studies did not specify details about their reliability testing, making it difficult to distinguish test-retest reliability of the instrument from a measure of stability of physical activity. It is therefore complex to assign the correlations to either the reliability of the instrument or to the stability of the behaviour of the participant. Assessing test-retest reliability for a last seven day PAQ is generally more straight forward compared to a PAQ assessing usual or last year physical activity. This is because when examining the reliability of a last seven days instrument the respondents should be prompted to report their PA during exactly the same week at two different occasions separated in time. However, this must be weighed against administering the test and retest too close in time that the respondent remembers the answers given to the first administration, resulting in inflation of reliability estimates from correlated error. Several other study details than timeframe of recall can be identified to have a marked influence on the study results, such as socio-cultural background, sex, age, literacy, and cognitive abilities.
The DLW method is usually considered the most accurate criterion method available for measuring TEE and PAEE. However, as discussed above, when using the DLW method and other objective methods which provide outputs in TEE as the criterion instrument, individual variability in body weight needs to be considered. It is therefore recommended that data from these methods should be expressed as PAEE, with and without normalisation for body weight in subsequent validation studies. Combined heart rate and movement sensing may be more accurate than either of the methods used alone for measuring time spent at different intensity levels [31]. However, most of the newly developed PAQs used a single accelerometer mounted at the hip as the criterion method, possibly due to its reasonable costs and feasibility in large study groups. Accelerometry also has some inherent limitations including its inability to accurately assess the intensity of specific types such as weightbearing activities, cycling, and swimming [33]. Further, the choice of somewhat arbitrary cut-off points [127][128][129] to classify intensities of activity when using accelerometry as a criterion method has been documented before. The use of accelerometers is especially problematic to validate time spent in different intensities of physical activity from PAQs and this also hampers comparison of studies [33]. Usually criterion measures assess overall PA (e.g. time in MVPA, PAEE) which precludes a direct test of the validity of self-reported domain specific activity (e.g. occupation). It is therefore not surprising that some PAQs [e.g. 86] which only asses a specific domain of activity demonstrate low validity when compared with overall physical activity from the criterion instrument. More research is therefore needed to compare time stamped criterion data with domain specific self-reported activity and to develop criterion instruments which can accurately categorise types of activities. Adopting a conceptual framework for physical activity [130] in combination with standardized procedures when developing and validating PAQs [122,123] is highly recommended.
Pearson and Spearman correlations may not be the most appropriate statistical methods to use for reporting results on the validity of PAQs. ICC is considered a more appropriate method for continuous measures on the same scale, whereas weighted kappa is a better choice of method for categorical measures [131,132]. When reporting validation results researchers are encouraged to report absolute validity in terms of mean bias with limits of agreement as well as the error structure of the instrument across the measurement range. We noted that many of the newly developed instruments reported results on absolute validity by means of the Bland-Altman method, which is a simple, intuitive and easy to interpret method to analyse assess measurement error [133]. Descriptive details of the study population may be helpful to explain any heterogeneity in the findings from different studies. Researchers can individually interpret all data for quality and applicability.
In summary, we systematically reviewed studies assessing both reliability and validity of PAQs in various domains, across age groups, and with a focus on total PA and sedentary time. PAQs are inherently subject to many limitations and the choice of PAQs should be dictated by the research question and the population under study. Considerations for researchers when using PAQs in practice have been identified and new research should consider including an objective method for assessing physical activity in addition to any self-report [134]. This review has identified a limited number of PAQs that appear to have both acceptable reliability and validity. Newly developed PAQs do not appear to perform substantially better than existing PAQs in terms of reliability and validity.