Performance of the international physical activity questionnaire (short form) in subgroups of the Hong Kong chinese population

Background The International Physical Activity Questionnaire (IPAQ-SF) has been validated and recommended as an efficient method to assess physical activity, but its validity has not been investigated in different population subgroups. We examined variations in IPAQ validity in the Hong Kong Chinese population by six factors: sex, age, job status, educational level, body mass index (BMI), and visceral fat level (VFL). Methods A total of 1,270 adults (aged 42.9 ± SD 14.4 years, 46.1% male) completed the Chinese version of IPAQ (IPAQ-C) and wore an accelerometer (ActiGraph) for four days afterwards. The IPAQ-C and the ActiGraph were compared in terms of estimated Metabolic Equivalent Task minutes per week (MET-min/wk), minutes spent in activity of moderate or vigorous intensity (MVPA), and agreement in the classification of physical activity. Results The overall Spearman correlation (ρ) of between the IPAQ-C and ActiGraph was low (0.11 ± 0.03; range in subgroups 0.06-0.24) and was the highest among high VFL participants (0.24 ± 0.05). Difference between self-reported and ActiGraph-derived MET-min/wk (overall 2966 ± 140) was the smallest among participants with tertiary education (1804 ± 208). When physical activity was categorized into over or under 150 min/wk, overall agreement between self-report and accelerometer was 81.3% (± 1.1%; subgroup range: 77.2%-91.4%); agreement was the highest among those who were employed full-time in physically demanding jobs (91.4% ± 2.7%). Conclusions Sex, age, job status, educational level, and obesity were found to influence the criterion validity of IPAQ-C, yet none of the subgroups showed good validity (ρ = 0.06 to 0.24). IPAQ-SF validity is questionable in our Chinese population.


Introduction
Physical activity greatly contributes to overall health and mental well-being and is associated with reduced mortality [1][2][3], but physical inactivity and sedentary lifestyles have reached epidemic proportions [4]. Much attention has been paid to developing reliable and valid instruments to estimate activity levels and to measure the impact of interventions to promote physical activity [5]. Objective methods for measuring physical activity include motion sensors (e.g., pedometers or accelerometers) and measures of physiological response to exercise, such as heart rate monitors [6,7]. The accelerometer is often used as the gold standard against which self-report questionnaires are compared [8]. Though objective, accelerometers may not always be feasible to use because of cost and inconvenience. A simple and valid self-report measure of physical activity would have the advantages of convenience, rapid data collection and low cost.
Of the many published questionnaires, the International Physical Activity Questionnaire (IPAQ) has been investigated in several populations. The IPAQ was developed by the World Health Organization in 1998 (http://www.ipaq. ki.se) for surveillance of physical activity and to facilitate global comparisons. The 31-item long form and the 9item short form assess time spent on different activities.
The short form records four types of physical activity: vigorous activity such as aerobics; moderate-intensity activity such as leisure cycling; walking, and sitting. The short form is preferred by many researchers because it has equivalent psychometric properties to the long form despite being one-third the length [5]. The two forms have been validated against accelerometer measurements in 12 countries with small samples of 19 to 257 participants. Spearman correlations between the two measurement methods were moderate at best, ranging from -0.12 to 0.57, with a pooled correlation of 0.30 [5]. The IPAQ correlated more closely with an objective measurement of vigorous physical activity than for other activity levels [4]. Despite these variable validity results, a recommendation was made that IPAQ (short form, referring to activity in the past seven days) be used for surveillance and comparison of national trends [4,5].
The modest correlation with objective measurements, combined with the wide variation in reported coefficients, raise concern in universally recommending the IPAQ. Four studies presented sufficient data to allow for more extensive analysis of the agreement between IPAQ and accelerometer readings [9][10][11][12]. Using data from these studies, we have calculated that the IPAQ overestimated physical activity compared to accelerometers, by 35% in Switzerland [11], 85% in Vietnam [10], 100% in US [9], and by 170% in Hong Kong [12]. The discrepancy between the measurements, and the wide range of discrepancies, reinforces our concern over the instrument's cross-cultural suitability.
The inconsistent overestimate suggests a bias (albeit to widely varying extents), complicated by random errors in both the IPAQ and accelerometer measurements. One possibility is that the accelerometer is not as reliable as we have believed, although why an ostensibly objective instrument should vary so widely in different settings is not easy to explain. A more plausible explanation is that the IPAQ may be more accurate among some respondent groups than in others, due to differences in translation or group characteristics such as attitudes toward exercise or level of understanding. Given the advantages of IPAQ, including its ease of administration and low cost, it seems worthwhile to investigate whether its validity indices can be improved. A first step may be to test the hypothesis that the instrument performs more adequately in some subgroups than in others. If true, this would imply restriction of its use in groups where it gives valid results, and shed light on how the IPAQ could be corrected or built upon. In this study, we examined variations in IPAQ validity in a sample of Hong Kong Chinese adults, analyzed by subgroups defined in terms of sex, age, job status, educational level, body mass index (BMI), and visceral fat level (VFL).
The translated Chinese version (IPAQ-C) was previously validated in Hong Kong [12] and in Guangzhou [13], with weak-to-moderate correlations with pedometer and accelerometer measurements (ranging from 0.09 [12] to 0.33 [13]). The Guangzhou sample was older than the Hong Kong sample (mean ages 65.2 vs. 28.7) [12,13], so perhaps age may affect the accuracy of IPAQ reporting. Previous studies had also identified sex as a factor that may affect the accuracy of self-reported physical activity [4,14]. Job status may be another factor, since respondents with a regular job may have a routine daily schedule that facilitates recall of their physical activity. The physical demands of the job may also influence reporting accuracy. In addition, educational level may be associated with accuracy of self-reported physical activity data, and it would be expected that there would be a better correlation of IPAQ data with objective measurement among those with more education as they may have a better comprehension of the questions compared to others [5]. Lastly, as overweight people have a different physical activity pattern from others [15] and their self-report could be affected by a social desirability response bias, BMI or visceral fat level (VFL) may also modify the accuracy of self-report data. In this study, we aimed to investigate IPAQ-C accuracy by examining questionnaire-accelerometer correlations by sex, age, job status, educational level, BMI, and VFL.

Participants
This study was part of the Hong Kong Jockey Club FAMILY Project cohort study which includes Hong Kong families recruited since March 2009. Sampling was based on a random selection of residential addresses provided by the Hong Kong Census and Statistics Department. A family was eligible when all members aged 15 years or older, who lived in the same address and could understand Cantonese, agreed to participate. For the present analyses, we used baseline data on the first 5,000 families interviewed during March to October, 2009. All eligible members were interviewed by trained interviewers who entered the data into tablet PCs. Other details of the interview have been described elsewhere [16]. Having completed the survey, participants were invited (all members from the households were invited for half of the households, while for the other half a randomly drawn member was invited) to take part in a sub-study by wearing an accelerometer for four consecutive days (including a weekend). Written consent was obtained from respondents and this study was approved by the Institutional Review Board of The University of Hong Kong.
Weight and VFL was measured with Omron fat analyzer scale HBF-356 (http://www.omron-healthcare.com.sg). Its precision is 0.1 kg for weight and 1 unit for visceral fat level. All measurements were taken in-person by trained interviewers with standard protocols. BMI was calculated by dividing weight (kg) by the square of height (m 2 ).

IPAQ-C
The 9-item IPAQ-C records self-reported physical activity in the last seven days [12]. Responses were converted to Metabolic Equivalent Task minutes per week (METmin/wk) [5] according to the IPAQ scoring protocol: total minutes over last seven days spent on vigorous activity, moderate-intensity activity, and walking were multiplied by 8.0, 4.0, and 3.3, respectively, to create MET scores for each activity level. MET scores across the three sub-components were summed to indicate overall physical activity [5].

Accelerometer
The ActiGraph is widely used as an objective measurement of physical activity and reported to be reliable and valid [17][18][19]. The ActiGraph GT1M uni-axial accelerometer (http://www.theactigraph.com) was to be worn around the waist for four consecutive days spanning a weekend for all waking hours, removed only for bathing or sleeping. The choice of the first day (from Thursday, Friday, or Saturday) was up to the participants. Records with less than 600 minutes of registered time in a day were excluded as invalid [4,5].
Following the grouping standard [20], we used oneminute reference period for raw ActiGraph count data. Data (as movement recorded in a one-minute period) were then converted into minutes spent in moderateintensity (3.00-5.99 METs, 1952-5724 counts per minute) or vigorous activity (≥ 6.00 METs, ≥ 5725 counts per minute) [21]. The MET score per minute (METmin) for a day was computed with the following formula: 8 × minutes spent in vigorous activity + 4 × minutes spent in moderate-intensity activity. As the IPAQ covered 7 days but the ActiGraph only covered 4 days (including a weekend), we averaged the 4-day ActiGraph data according to the day of the week, and obtained a weekly MET-min score by 5 × average weekday METmin + 2 × average weekend MET-min.

Other measurements
In addition to the IPAQ, the interview obtained demographic information and questions related to psychosocial functioning. Tertiary education refers to those with a bachelor's degree or further education.

Statistical Analysis
Outliers on ActiGraph scores (> median + 1.5 interquartile range) and missing IPAQ-C data were removed from the analysis. Independent t-tests were used to compare the differences in the amount of moderateintensity, vigorous, and total physical activity between IPAQ-C and ActiGraph. Because the MET-min/wk measurements of neither the IPAQ-C nor ActiGraph were normally distributed, Spearman correlations were used to determine the correlations between IPAQ-C and ActiGraph records (minutes and count data) by activity level [5]. The Fisher's r to z-test was used to compare the difference between pairs of correlations. Correlations and differences are presented with standard error for computation of confidence interval as appropriate. Acti-Graph-min equals 2 × minutes spent in vigorous activity + minutes spent in moderate-intensity activity, and Acti-Graph-count equals raw counts in hours with any movement. The proportions of respondents who met the Centers for Disease Control -American College of Sports Medicine (CDC-ACSM) guideline, i.e., moderate-intensity min/wk + 2 × vigorous min/wk ≥ 150 [22], were computed with both the IPAQ-C and ActiGraph data. We assessed the agreement between the two proportions by comparing the observed proportion with the same classification to the percent agreement that could have occurred by chance. To further examine the agreement of CDC-ACSM classification between IPAQ-C and Acti-Graph, we categorized respondents into equal-sized groups according to IPAQ and ActiGraph records, and reported the proportion classified in the same group by both methods. The observed proportions were also compared to chance agreement (for 2 groups: probability of being classified in the same activity group by both methods; for 3 groups: 33.3%; for 4 groups: 25.0%). In addition, the ActiGraph-measured MET-min/wk was compared across IPAQ categories with one-way ANOVA. ANOVA results with significant P-values (< 0.05) were further analyzed with the Tukey's method. All statistical analysis was performed using Predictive Analytics SoftWare (PASW 18.0, formerly known as SPSS).

Results
Out of 11,713 respondents from 5,000 families, 2,511 (21.5%) respondents wore the ActiGraph. The characteristics of ActiGraph wearers and non-wearers were comparable, except for age (wearers 42.9 years, vs 44.8 for non-wearers, P < 0.001), job status (58.7% full-time employment for wearers vs 49.4% for non-wearers, P < 0.001), and percentage of respondents passing the CDC-ACSM guideline (passing rate = 92.5% for wearers vs 47.1% for non-wearers). Excluding ActiGraph invalid data (either wearing for less than four days or not following the 2 weekdays + 2 weekends format) (n = 1,151) and IPAQ missing data (n = 90), we kept 1,270 respondents in the present analysis: 10.8% of the whole sample. There were no significant differences between the characteristics of the valid and invalid samples. Table 1 shows that 585 (46.1%) of the respondents were male, 735 (58.3%) had a full-time job, 299 (24.3%) attained tertiary education, and 399 (31.5%) were overweight based on BMI (≥ 25), or 347 (29.5%) overweight based on VFL (≥ 10%). The mean age was 42.9 years (range: 15 to 82 years, inter-quartile range = 20 years). Table 2 shows that self-reported MET-min per week exceeded the ActiGraph readings by 231% for total physical activity, by 236% for moderate-intensity, and by 1047% for vigorous-intensity physical activity (P < 0.001 for all comparisons). Although physical activity time reported in IPAQ-C was significantly greater than that measured by ActiGraph, the two measurements were positively correlated ( Table 2). The correlation between IPAQ-C and ActiGraph MET-min was significant but weak for total physical activity, moderate-intensity activity, as well as for vigorous-intensity activity. The correlations between ActiGraph count data and IPAQ-C moderate min, IPAQ-C vigorous min, IPAQ-C METmin were significant but also weak. As reported in previous research [23], the IPAQ-ActiGraph correlation was higher when results were expressed in counts than in total MET-min (ρ = 0.16 vs 0.11, P < 0.05). Table 3 further shows that, in general, the correlations between IPAQ-reported MET and ActiGraph were higher when ActiGraph raw count data were used. In terms of IPAQ total MET by subgroup, IPAQ-Acti-Graph correlations appeared to be higher for males, older age groups, those with a full-time job of high physical demand, those with lower education attainment, and those who were overweight (by classification of either BMI or VFL), yet none of these effects reached a significant level except VFL (P = 0.01). The highest correlation between IPAQ total MET and ActiGraph was found among those with higher VFL (ActiGraph count data, ρ = 0.31). Furthermore, the IPAQ-ActiGraph correlation was higher among those with higher VFL than those with normal VFL, regardless of physical activity groups or the ActiGraph measurements used. In contrast, the lowest correlation between IPAQ total MET and ActiGraph was found among those aged 29 years or younger (ActiGraph count data, ρ = 0.04). Table 3 also shows the IPAQ-ActiGraph correlations for physical activity subgroups classified by both IPAQ report and ActiGraph data (only in MET-min). Regarding moderate-intensity activity, the correlations had similar patterns as those found with total MET. However, for the vigorous activity level, the patterns of the correlations were inconsistent by age or employment group. Table 4 compares total time spent on physical activity reported in the IPAQ-C to ActiGraph readings, by subgroup. On every comparison, the self-report questionnaire produced much higher estimates of time spent on physical activity than the objective device (by 151% to 5670%). However, the overestimates were not consistent across groups. For time spent on moderate-intensity activity, men overestimated slightly less than women did (differences in min/day = 92.4 vs 111.3, P < 0.05), but on vigorous activity men overestimated more (min/day = 16.1 vs 8.5, P < 0.01). The comparisons across groupings by body mass (lack of statistical significance) or visceral fat (P < 0.05) had a similar reverse pattern regarding time spent on different levels of physical activity. Those with ActiGraph, mean(SD) 43.6 (23.  physically demanding full-time jobs overestimated their physical activity time to a greater extent compared to others, approximately two times more on moderateintensity activity and seven times more on vigorous activity (P < 0.001). Those with tertiary education overestimated their exercise time to a lesser extent than respondents without (P < 0.001). There was no observable pattern of overestimation by age group, although younger people seemed to have overestimated to a greater extent compared to those aged 30 or over.
We assessed the agreement of the two measurements in classifying respondents in terms of meeting the CDC-ACSM physical activity guideline (details can be found in Additional file 1). We found that the overall IPAQ-ActiGraph agreement was only slightly better than chance agreement (81.3% vs 79.6%, P < 0.001). The agreement in the classification was better among respondents who had a physically demanding full-time job than those with physically non-demanding full-time jobs and those without full-time jobs (91.4%, 82.5%, and Table 4 Average time (in minutes per day) spent on physical activity measured by the IPAQ-C and ActiGraph, and differences between the two measurements, by level of activity and respondent characteristics 77.9%, respectively, P < 0.05). Males had higher agreement between the two classifications than did females (83.9% vs 79.0%, P < 0.05). We also assessed the IPAQ-ActiGraph agreement in classifying respondents into tertile and quartile of activity level, against classification based on chance (33% for tertile and 25% for quartile). The observed agreement was significantly better than chance except for the group aged ≤29 years and those with tertiary education.
Lastly, we compared the mean MET min/wk measured by ActiGraph across equal-sized groups based on IPAQ scores (Figure 1). Overall, the ActiGraph readings were higher for groups classified by IPAQ as being more active than for less active groups. The mean Acti-Graph-measured time was significantly different by IPAQ grouping in all three groupings (P < 0.001). In the 3-group comparison, the ActiGraph MET min/wk in the highest IPAQ group was significantly more than the other two groups (1186 vs 1402, P < 0.001; 1259 vs 1402, P < 0.05, respectively), but the difference in MET min/wk between the other two groups was not significant (1186 vs 1259, P = 0.31). In the 4-group comparison, the ActiGraph MET min/wk in the highest IPAQ group (group 4) was significantly more than groups 1 and 3 (1152 vs 1419, P < 0.001; 1266 vs 1419, P < 0.05, respectively), but the differences among the other three groups were not significant (1152 vs 1296 vs 1266, P = 0.06, P for trend = 0.28).

Discussion
Although the IPAQ has been recommended as a surveillance instrument, we argue that the validation studies of IPAQ do not generally provide strong empirical support for its validity compared against objective measures of physical activity [4,5,12,13,23,24]. The correlations of 0.30 [5] are far lower than the agreement between selfreport and objective measurements of other health variables, such as smoking [25], body weight [26] or hypertension [27]. To rule out Simpson's paradox [28] (i.e., signs of correlation are positive in all groups, but the correlation becomes negative when groups are pooled together), we studied correlations of the IPAQ with an objective measurement in different subgroups. This would also indicate whether the questionnaire instrument works better for certain subgroups. To our knowledge, this was the first study to examine how demographic factors and obesity affect the correlation, difference, and agreement between IPAQ and ActiGraph measurements. However, none of the subgroups showed an acceptable IPAQ-ActiGraph correlation, although the correlations did seem to be higher in certain groups (e. g. males and those with high VFL). The Spearman correlations for all groups in this study were positive, but lay at the lower end of the range of previously reported figures (-0.12 to 0.57) [5,29]. Based on our findings, we question the validity of IPAQ-SF when administered to Hong Kong Chinese respondents.
Contrary to our expectation, differences in age, workrelated physical activity level, education, and BMI did not appear to influence the correlation between IPAQ and ActiGraph. Regarding the slightly higher correlation among those with higher VFL, we postulated that perhaps they were more conscious of their physical activities. In support of this, we found that respondents with higher VFL had higher variation in ActiGraph-measured total physical activity (sd = 811.1 vs 693.5 for lower VFL group, P < 0.001), which may mean they had a more distinctive physical activity pattern, hence easier to recall. The strength of the IPAQ-ActiGraph correlation was weak among those did not have tertiary education and weaker for those did (Table 3). However, there was no statistical significance when the two correlations were compared (P > 0.05). On the other hand, in considering absolute differences between the two methods of measurement (Table 4), over-reporting by respondents without tertiary education nearly doubled that of those with tertiary education (differences in MET-min/ wk: 3317.7 vs 1803.9, P < 0.001). The performance of the IPAQ is better among those with higher education.
Although over-reporting with activity questionnaires is ubiquitous and has been linked to social desirability bias [30], there were several possible explanations why the correlations in our study were lower than those previously reported. First, we asked the respondents to wear the activity monitor after they had completed the IPAQ, while in other studies respondents were often asked to wear the device before they took the IPAQ. The latter approach could have yielded higher IPAQ-ActiGraph agreement, as the self-report responses may have been modified because of the increased awareness arising from wearing the activity monitor. Also, in our study the IPAQ recall period preceded the time when the ActiGraph was worn by one to two weeks. This different time-period could have contributed to the lower correlation (0.16) compared to studies that used the same time-period (0.30) [5]. However, given the stability of IPAQ (3-to 7-day test-retest reliability: 0.81 [5]), we do not believe that having the same recall periods would have substantially altered the results.
Second, the IPAQ has been found to overestimate physical activity to a greater extent than other physical activity questionnaires, such as the Active Australia Survey and the U.S. Behavioral Risk Factor Surveillance System [24]. In this study, the IPAQ overestimated physical activity measured by the ActiGraph from 149% to 461% (mean 231%), which was similar to the finding previously reported in Hong Kong (173%) [12].
Third, how the ActiGraph was applied in different studies may have led to the differences in results. In this study, the respondents were instructed to remove the ActiGraph during aquatic activities because it is not waterproof. Therefore, movement during activities such as swimming would not have been captured. Second, respondents were instructed to wear the ActiGraph on the hip, as suggested in Trost et al. [31]. Thus, the Acti-Graph may not have accurately measured physical activity during which movement of the hip was limited, such as cycling. It has been reported that Hong Kong young adults swim and ride bicycles more often than older adults [32]. Because accelerometers underestimate these activities, this could be an explanation for our finding of weak IPAQ-ActiGraph correlation in young adults. Furthermore, In a Hong Kong survey, swimming and cycling was the favorite sports activity for 11% and 6%, respectively, of the respondents [33]. Thus, the underestimation of ActiGraph-measured physical activity may not have been negligible in this study. In sum, these three sources together may probably have had an effect on reducing the IPAQ-ActiGraph correlation in this study.
In practice, physical activity measurements may be most relevant in grouping participants into different intensity levels of physical activity (e.g., into two or three groups). The conventional classification scheme is ≥ 150 minutes per week of physical activity of at least moderate intensity [5,22,24]. Based on this guideline, classification of activity by IPAQ and ActiGraph agreed closely (81.3%), although barely better than what could have been achieved by chance (79.6%). Furthermore, regardless of how the respondents were grouped, the IPAQ-ActiGraph agreements were only slightly better than by-chance agreement.
The IPAQ-ActiGraph agreement in classification was slightly better than a chance agreement, but the two measurements did seem to correlate better among those who were more physically active. There was a linear trend in ActiGraph-measured time when we grouped the respondents into three equal-sized groups by IPAQ. However, when the respondents were divided into four IPAQ groups, the intermediate groups were not clearly different in terms of their objectively-measured activity levels. This agrees with a previous finding in Japan [34] that showed IPAQ could only roughly classify mildly and moderately active respondents.
Our results provided some insights for possible modifications of IPAQ-C. First, job-related physical activity level seemed to have had an effect on the difference between IPAQ and ActiGraph measurements. Those who performed in highly physically demanding conditions had the largest difference between their self-report and the ActiGraph-measured physical activity. In particular, they reported an average of 57.7 minutes of vigorous physical activity per day, which was over six times that of the self-reported vigorous physical activity by the other respondents. However, according to the ActiGraph on average they only did 1.0 minute of vigorous physical activity per day, no more than the vigorous physical activity performed by other respondents. Conceivably, however, the Actigraph under-estimated lifting activities. This raises the possibility that the accuracy of the IPAQ-C may be improved by separating physical activity into occupational and leisure types (as in the Global Physical Activity Questionnaire) [35]. Because respondents overestimated occupational physical activity more than other types of activity, reducing the weight of occupational activity may improve the accuracy of IPAQ total MET score. Furthermore, separating physical activity into occupational and leisure types could allow researchers to analyze the benefits of physical activity, at work and at leisure, in relation to health [36].
Second, more detailed instructions [37] may be needed. For those with lower education, more concrete examples of different levels of physical activity intensity may be necessary, as our results indicated that this group had exaggerated their total physical activity more than the others.
The study had several limitations. First, those who agreed to wear the accelerometer might have been healthy volunteers, with different physical activity patterns from those who were less active, as the percentage of respondents who passed the CDC-ACSM guideline was double that of non-respondents. Also, those who were extremely active might have found it too much of a burden to wear the accelerometer and declined to participate. Nevertheless, the results indicated that, demographically, those who wore the accelerometer were not different from the rest of the sample except for being slightly younger and less likely to have full-time employment. Second, although the accelerometer has been used as gold standard for questionnaire validation [17][18][19], we did not have evidence for its validity or reliability in this study. Lastly, similar to other IPAQ validation studies, we adopted the cut-off points for intensity level suggested by Freedson et al. [21], which have not been validated in Chinese populations [12]. However, given our consistent results with the different classification schemes, we do not expect that different cut-off values would yield significantly different findings.

Conclusions
Although the IPAQ has been recommended and widely used, it has not been found to correlate highly with objective measurements of physical activity, and tends to overestimate MET scores. We investigated the criterion validity of the IPAQ in a Hong Kong Chinese population, grouping our sample by several different variables. We found that it performed poorly in most subgroups when compared to accelerometer data, but slightly better for the highly active respondents.
Despite such low correlations of the IPAQ with Acti-Graph in the Chinese population, it is one of the easiest of physical activity questionnaires to administer with less than 10 questions [38]. A correlation of 0.3 -0.4 is perhaps as close as can be expected for criterion validity of a physical activity questionnaire with 10 questions, against a mechanical device that detects body movement. Further research to improve IPAQ is urgently needed.

Additional material
Additional file 1: Agreement between IPAQ-C and ActiGraph classification by CDC-ACSM physical activity guideline