Item response modeling: a psychometric assessment of the children’s fruit, vegetable, water, and physical activity self-efficacy scales among Chinese children

Background This study aimed to evaluate the psychometric properties of four self-efficacy scales (i.e., self-efficacy for fruit (FSE), vegetable (VSE), and water (WSE) intakes, and physical activity (PASE)) and to investigate their differences in item functioning across sex, age, and body weight status groups using item response modeling (IRM) and differential item functioning (DIF). Methods Four self-efficacy scales were administrated to 763 Hong Kong Chinese children (55.2% boys) aged 8-13 years. Classical test theory (CTT) was used to examine the reliability and factorial validity of scales. IRM was conducted and DIF analyses were performed to assess the characteristics of item parameter estimates on the basis of children’s sex, age and body weight status. Results All self-efficacy scales demonstrated adequate to excellent internal consistency reliability (Cronbach’s α: 0.79-0.91). One FSE misfit item and one PASE misfit item were detected. Small DIF were found for all the scale items across children’s age groups. Items with medium to large DIF were detected in different sex and body weight status groups, which will require modification. A Wright map revealed that items covered the range of the distribution of participants’ self-efficacy for each scale except VSE. Conclusions Several self-efficacy scales’ items functioned differently by children’s sex and body weight status. Additional research is required to modify the four self-efficacy scales to minimize these moderating influences for application.


Background
The alarming rates of chronic diseases have been attributed to dietary habits and physical activity (PA) patterns [1,2]. Increasing fruit and vegetable consumption, replacing sweetened beverages with water, and engaging in sufficient PA facilitate chronic disease prevention [3]. Furthermore, the dietary and PA practices tend to initiate and develop during childhood at which time it is desired to foster healthier habits [4].
Valid and reliable measures are needed to test the associations between self-efficacy and behavior and to examine the possible mediating effect of self-efficacy in behavior change programs. Levels of self-efficacy have been reported to be significantly different by children's sex, age, and body weight status [23][24][25]. True differences in the validity of the measurement scale may make it difficult to compare parameter estimates across these different groups when comparing the results across studies. Furthermore, understanding the group-related differences in item validity across demographic or body weight status groups could help design interventions tailored to specific items in different groups and thereby enhance program effectiveness.
Classical test theory (CTT), the traditional method for evaluating scales, is sample-dependent, and thereby cannot assess the functioning of item responses across different groups. Item response modeling (IRM) is a psychometric analysis method that provides model-based measurements. IRM links the individuals' difficulty of response to each item, provides the distribution of respondents across the scale, and enables differential item functioning (DIF) analysis [26]. While, item functioning of children's FSE and VSE has been evaluated by sex and ethnic groups in American children [27], no one has analyzed item functioning across age and body weight status groups for FSE and VSE, nor conducted this kind of analysis for WSE and PASE, nor among Chinese children.
This study evaluated the psychometric properties of FSE, VSE, WSE, and PASE and investigated item differences in their psychometric properties across sex, age, and body weight status groups using IRM and DIF.

Participants
The sample was from the validation study of the Physical Activity Questionnaire for Older Children among Chinese children [28]. Children (n = 798, 55.8% males) aged 8-13 years old were recruited from six Hong Kong primary schools that agreed to participate in the study. The schools were located in different administrative districts with varied socio-economic status (SES) (two from high SES, one from medium SES, and three from low SES districts) according to local statistics [29]. Students were excluded if they had any contraindication to participating in PA or eating a normal diet. A subsample of 94 children (54.3% males) was randomly selected to complete the questionnaires twice within 7-10 days to assess the scale test-retest reliability. The ethic committee of Hong Kong Baptist University approved this study.

Measures
A standard translation and back translation procedure was used with three bilingual language speakers (i.e., English and Cantonese). Minor wording revisions were made according to cognitive interviewing feedback from five primary students to ensure that target children could understand the instructions and items. All participants completed the questionnaire set in schools under the administration of research assistants.

Body weight status
Children's height and weight, measured by physical education teachers, were retrieved from the latest school records. Height was measured to the nearest 0.1 cm and weight was measured to the nearest 0.1 kg. Body mass index (BMI, kg/m 2 ) was calculated as weight in kilograms divided by height in meters squared. According to international age-and sex-specific cutoff points, body weight status of participating children were classified into underweight [30], healthy, overweight and obese [31] groups based on their BMI values.

Self-efficacy for fruit (FSE), vegetable (VSE) and water (WSE)
Validated self-efficacy scales for fruit, vegetable and water intakes were used to assess children's FSE, VSE and WSE [32]. The scales consisted of 12, 8, and 5 items with dichotomous "sure" and "not sure" response categories and demonstrated acceptable internal consistency for FSE (α = 0.75) and VSE (α = 0.70) and marginal level of internal consistency for WSE (α = 0.55) in an American sample [32]. Construct validity was assessed through correlation among the self-efficacy scores and fruit and vegetable consumption, preferences and outcome expectancies (r = 0.10-0.21) [32]. Each item of the self-efficacy scales asked about the participant's confidence in consuming fruit, vegetables or water under diverse circumstances. A FSE sample item included "How sure are you that you can eat 1 portion of fruit for a snack at home at least four days a week?" A VSE sample item included: "How sure are you that you can eat 3 portions of vegetables at least 4 days a week?" A WSE sample item included "How sure are you that you can drink 4 glasses or bottles of water for at least one day?" Considering item response difficulty, all items featured three response options in this study (1 = I am not sure; 2 = I am a little bit sure; 3 = I am very sure). The internal consistency in this sample was 0.86, 0.85, 0.79 for FSE, VSE, and WSE, respectively.

Self-efficacy for physical activity (PASE)
Children's PASE was assessed by a validated Physical Activity Self-efficacy scale [33]. The scale had 12 items and demonstrated adequate internal consistency (α = 0.81) in the original validation study [33]. Weak but comparable correlations (r = 0.09-0.11) were found between PASE and minutes of moderate-to vigorous-activity. Similar to the FSE, VSE and WSE, children responded how sure they were that they could engage in PA in various conditions with a 3-response category (1 = I am not sure; 2 = I am sure a little; 3 = I am sure a lot). Sample items included "How sure are you that you can be physically active more than 30 minutes for at least 4 days a week, even when the weather outside is bad?" "How sure are you that you can ask your friends to be physically active with you more than 30 minutes for at least 4 days a week?" The scale in this sample presented excellent internal consistency (α = 0.91).

Statistical analyses Classical test theory (CTT)
First, CTT was used to evaluate the scales and item characteristics using SPSS 20.0 (IBM, Chicago, IL, USA). Item means were calculated to assess item difficulty. Cronbach's alpha coefficient (α) was computed to assess scale internal consistency; values greater than 0.70 are deemed acceptable for general research purposes [34]. Item discrimination was evaluated using corrected item total correlations (CITC) that were calculated by the correlation coefficients between the scores on the item and the sum of scores of all the other items in a scale. Poorly discriminating items were identified with CITC lower than 0.30 [35]. The intraclass correlation coefficient with a two-way random model was computed to determine test-retest reliability; a minimum threshold of 0.70 was considered adequate [36].

Item response modeling (IRM)
Exploratory factor analysis was used to examine the primary assumption of IRM, unidimensionaltiy, for each subscale. The assumption of unidimensionalty was met if the scree plots showed one dominant factor, the first factor explained at least 20% of scale variance, and the factor loadings were >0.30 [37].
IRM models illustrate respondents' latent trait based on their patterns of item responses. Both respondents' trait levels and items' psychometric properties are specified in IRM models. The degree of difficulty in agreeing with an item or endorsing a category is modeled as a function of person trait and item parameters. There are different mathematical forms of item characteristic functions and the number of parameters estimated for IRM models, but all IRM models include one or more item parameters to describe the probability of a certain score on an item, given a person's latent traits [38,39].
Polytomous IRM models, are used when items present multiple response choices, such as in attitude surveys and personality assessment tests [40,41]. Only polytomous models are discussed here because the self-efficacy scale items present three response categories. Polytomous models model the probability for any item of endorsing one response category over another. Polytomous models include additional parameters, referred to as category boundary, threshold parameter or step difficulty which indicate the probabilities of responding at or above a given category. For an item with k response options, there are k-1 thresholds between the response options. For example, an item with three response options (I am not sure, I am a little bit sure, and I am very sure) will require two threshold estimates: (1) the step from "I am not sure" to "I am a little bit sure", and (2) from "I am a little bit sure" to "I am very sure", One goal of fitting a polytomous model is to determine the location of such thresholds along the latent trait continuum.
Due to the number of the subscales and responses, multidimensional polytomous models, was selected to assess respondents' latent traits. Two polytomous models were considered: the partial credit (PCM) [42] and the rating scale models (RSM) [43,44]. RSM is a special case of the PCM where the response scale is fixed for all items. That is, the response threshold parameters are assumed to be identical across items. For the present study, the final choice of a model was determined by comparing the deviance of the two competing multidimensional polytomous models using a Chi-square test.
Item fit was evaluated using infit and outfit mean square item fit indices (MNSQ) which have nonnegative values. Infit is an information-weighted form of outfit. Infit MNSQ (information-weighted fit statistic) and outfit MNSQ (outlier-sensitive fit statistic) are based on information-weighted sum of squared standardized residuals and non-weighted sum of squared standardized residuals, respectively [45]. An infit or outfit MNSQ value of around one suggests the observed variance is similar to the expected variance. Mean square values greater than one or smaller than one indicate the observed variance is greater or smaller than expected, respectively. Infit or outfit MNSQ values greater than 1.3 indicate poor item fit when sample size is smaller than 500 [46]. With respect to thresholds, outfit MNSQ values greater than 2.0 indicate misfits, identifying candidates for collapsing with a neighboring category [45,47].
Item-person maps, often called Wright maps (with units referred to as log odds), present both the distributions of scale items with that of the respondents on the same scale. Person, item and threshold estimates were placed in the same map where "x" on the left side represented the distribution of person trait estimates along the self-efficacy continuum with the student scoring the highest self-efficacy placed at the top of the figure. Item and threshold difficulties were presented on the right side, with the more difficult response items and categories placed at the top. I k denotes threshold k for item I.

Differential item functioning (DIF)
Participants with the same underlying trait level may have different probabilities of endorsing an item. DIF is an indicator when an item performed differently between groups of individuals. For example, a finding of DIF by sex means that a male and a female with the same latent trait level responded differently to an item, indicating that the respondents' interpretation of the item differed for men and women. DIF was assessed by adding a group main effect and an item-by-group interaction term to the model [27,[48][49][50]. Whether an overall scale demonstrated DIF was indicated by a significant chi-square for the item-by-group interaction term. The ratio of the item-by-group parameter estimates to the corresponding standard error identified which items displayed DIF. DIF was indicated when the estimate to standard error ratio exceeded 1.96. The magnitude of DIF was determined by examining the differences of the itemby-group interaction parameter estimates. Because the sum of the parameters was constrained to be zero, if only two groups were considered, the magnitude of DIF difference was twice the estimates of the first reference group. For example, the estimate of the sex by item effect for Item 1 for males was −0.2, and then the estimate of the group by item effect for Item 1 for females was 0.2. The difference in item difficulty between older and younger children was −0.4. If comparison was made among three or more groups, the magnitude of DIF was the differences in estimates of the corresponding groups. Items that displayed statistically significant DIF were placed into one of three categories depending on the effect size: small DIF (difference < 0.426), intermediate DIF (0.426 < difference < 0.638), and large DIF (difference > 0.638) [51,52]. ACER ConQuest [53] was used for all IRM analyses.

Descriptive statistics
Participants' characteristics are shown in Table 1. Thirtyfive children (4.4%) did not complete any of the items and were excluded from analyses, resulting in a sample of 763 children with 55.2% boys. Participants were classified into younger children aged 8-10 years (43.5%) and older children aged 11-13 years (56.5%). Body weight status was categorized into three groups with 96 (13.1%) underweight children, 417 (56.8%) children with healthy weight, and 221 (30.1%) overweight/obese children.

Classical test theory (CTT)
The percentages of variance explained by the one-factor solution were 39.7%, 49.0%, 54.5% and 49.7% for FSE, VSE, WSE and PASE, respectively. Each scree plot revealed one dominant factor and factor loadings were higher than 0.30 for all the scales.

IRM model fit
The relative fit of multidimensional RSM and multidimensional PCM was evaluated by considering the deviance difference, where df was equal to the difference in the number of estimated parameters between the two models. The chi-square (χ 2 ) deviance statistic was calculated by considering differences in model deviances (RSM: 46,107.92; PCM: 45,903.92) and differences in numbers of parameters (RSM: 48; PCM: 84) for the nested models. The chi-square test of the   Table 2 Item description, and estimated of differential item functioning where significant (Continued) 6 … ask your friends to be physically active with you more than 30 min for at least 4 days a week.  Table 4 presents the PCM item-person maps. The participants' self-efficacy estimates (confidence for fruit, vegetable, water intakes, and PA engagement), and the item and item threshold difficulty distributions are on the same logit scale. The difficulty distribution is ideally presented with a normal distribution from −3.0 to +3.0. As shown in the figure, FSE and VSE approached a normal distribution. There were small portions of participants with higher and lower levels of WSE and PASE (logits >3.0/ < −3.0). The items were distributed in the centre of the Wright diagram. Item difficulties showed that the logits ranged from -0.719 to 1.171 for FSE, from −0.841 to 0.556 for VSE, from −0.413 to 0.345 for WSE, and from −1.515 to 0.748 for PASE, respectively. The distributions nearly overlapped between item threshold and person measures (indicating the full distribution of individuals was measured by items across the whole distribution, as desired) for three of the self-efficacy scales, except VSE. Participants at the lower and higher ends of VSE did not coincide with the item's first and second threshold.

Differential item functioning (DIF) Children's sex groups
Item difficulty differences across sex, age, and body weight status groups are presented in Table 2. Small DIF was detected for items 1, 5, 7, 8, 10 as well as moderate DIF for item11 in FSE across sex groups. Among these items, boys found it easier to endorse items 10 and 11, but more difficult to endorse the others. Only item 6 in VSE had significant DIF by sex at −0.20, a small DIF effect: it was easier for boys to endorse item 6. Item 1 of WSE was detected with a small DIF effect, easier for girls. Five items had significant DIF (small: item 10; moderate: item 2; large: items 1, 3, and 4) in PASE. It was easier for boys to endorse items 3, 4, and 10.

Children's age groups
Older children aged 11-13 years were more likely to endorse item 5 (small DIF at 0.18) and item 7 in FSE (small DIF at 0.25), but less likely to endorse item 11 with small DIF at −0.30. Two items had small DIF in VSE (items 5 and 6) and WSE (items 2 and 3) among different age groups, respectively. Older children found that somewhat easier to endorse item 5 of VSE and item 2 of WSE. Small DIF was indicated for six items (items 1, 2, 3, 5, 9, 10) of PASE between younger and older children. It was easier for older children to endorse items 1, 3, and 5.

Children's body weight status
Between underweight and healthy weight children, small DIF was detected for items 2 (easier for healthy weight children) and 9 of FSE, item 2 (easier for healthy weight children) and 4 of VSE, items 1 and 4 (easier for healthy weight children) of WSE, and items 3 (easier for healthy weight children) and 6 of PASE as well as medium DIF detected for items 1 and 6 (easier for healthy weight children) of VSE, item 5 (easier for healthy weight children) of WSE. In comparison of underweight and overweight/obese children, items 7 (easier for underweight children) and 11 of FSE, items 2, 4 (easier for underweight children) and 5 of VSE, item 1 (easier for underweight children) of WSE, and items1, 2, 4, 5 and 8 of PASE (easier for underweight children for item 1, 2, and 8) were examined with small DIF; items 1 (easier for underweight children) and 6 of VSE, item 5 of WSE, and item 3 of PASE showed medium DIF. Between healthy and overweight and obese children, small DIF was indicated for items 2, 7, 10, and 11 of FSE (easier for healthy children for item 2 and 7), items 5 of VSE, and items 3, 4, 5 of PASE; and medium DIF were indicated for items 1 and 2 (both easier for healthy children) of PASE. No large DIF was found across different body weight status groups.

Discussion
The present study investigated the psychometric properties of FSE, VSE, WSE and PASE scales using CTT and IRM, and their stability across sex, age and body weight status groups based on IRM using the partial credit model. CTT results showed that the examined scales had adequate to excellent internal consistency and adequate test-retest reliability. The item difficulties were   Table 3 Item description, item difficulty, and misfit item(s) (Continued) Physical Activity self-efficacy (Cronbach's alpha = 0.91) 11 …be physically active more than 30 min for at least 4 days a week, even when the weather outside is bad.  Extremely low self-efficacy Extremely easy to endorse response On the left side of figure, the participants' self-efficacy estimates were placed on the map with "X" following the outermost left column of logit scale.
"X" indicated the trait estimates of persons. Each 'X' represents 8.1 cases. Higher self-efficacies were placed at the top of the column, while, the lower self-efficacies were located at the bottom of the column Item and threshold difficulties were presented on the right side of figure, with the top locations indicating the more difficult to endorse responses moderately easy to difficult. Items in the scales were considered discriminating. The symmetric distribution of items and item thresholds for individuals from the Wright map indicated the utilization of three-point responses nearly covered the participants from low to high levels of each self-efficacy scale except VSE, suggesting the items in VSE should be revised or new ones developed to cover the more difficult and easy levels.
One item (item1) in VSE and one items (item1) in PASE were identified as misfit items. These items also exhibited DIF across different groups. Item 1 of VSE (i.e., "How sure are you that you can eat 1 portion of a vegetable at lunch at least one time on a school day?") and item 1 of PASE (i.e., "How sure are you that you have the ability to do physical activities like running, dancing, bicycling, or jumping rope?") showed moderate DIF on the basis of children's body weight status. Compared with overweight/obese children, underweight children tended to have 1 portion of a vegetable at least once on a school day. Children with healthy weight were more likely to engage in various kinds of PA than overweight and obese children. These findings suggest children's perceived confidence to comply with the healthy lifestyle differed across different body weight status, consistent with the previous studies [25,54,55]. Since these two items did not behave the same way across these groups, they should be substantially revised or deleted from the scales. DIF presented distinct difficulties by children's sex groups. Given items with small DIF are generally not of major concern [56], we only discuss items with medium/ large DIF because they require more attention in the future studies. Ignoring small DIF effects, there was moderate DIF for item 11 of FSE, and item 2 of PASE, and large DIF for items 3 and 5 of PASE. Boys showed higher confidence that they could participate in team sports (e.g., basketball, softball) than girls, but not in flexibility/rhythm-related activities (e.g., dancing, jumping rope). These DIF suggest sex-specific tailoring of an intervention to boys and girls based on their differences of food and activity preferences, as suggested by existing research [57,58].
DIF across demographic variables could be due to differences in ability to comprehend the meaning of the specific items or actual differences in the efficacy level to adopt healthy eating behaviors or engage in PA. Moderate DIF across body weight status groups and moderate to large DIF across sex groups indicate the need to re-check and revise items to produce non-significant DIF or reduce DIF to a considerably lower level [59]. Developing the sex and body weight status specific self-efficacy scales should be considered.
VSE items and thresholds did not cover the higher and lower difficult to endorse ends of confidence. This may require rewriting existing items or adding new items to extend the end of the distribution of items and thresholds. For example, a VSE item at average difficulty, "I can eat 1 portion of a vegetable at lunch at least one time on a school day", might be revised into "I can eat 1 portion of a vegetable at lunch at least three times on school days" , which would appear to have greater difficulty. An item with large difficulty, e.g., "I can eat 3 portions of vegetables at least 4 days a week", could be transformed to possibly low difficulty, e.g., "I can eat 3 portions of vegetables at least one day a week".
In the study, WSE contained 5 items and the logits of item difficulties ranged from −0.413 to 0.345. WSE showed narrower item distribution compared with the other three ones. To cover a wider range of latent trait, more diverse WSE items should be developed in future studies. For example, items addressing confidence in overcoming different types of barriers to have more water [32] (e.g., social impediments [60] referred to as coping SE [61], or emotional state). Additionally, types of item which could enhance the distributional properties could also be examined in the future.
Several limitations of the study should be mentioned. Even though existing and previously validated instruments were used and demonstrated good internal consistency in this study, validity of the scales are not available among the target children. Further validation studies should be implemented to evaluate the application of scales in different cultural settings among Chinse children (e.g., children from urban and rural areas in mainland China). Furthermore, IRM's complexity requires a large sample size. Recommendations have been ranged from 200 per group [62] to 500 per group [63]. Possible limitations of small sample size should be acknowledged in the current study. Further investigation should retest the findings by recruiting more participants. Moreover, further investigation could be undertaken with other DIF-detection procedures (e.g., non-uniform differential item functioning).

Conclusion
FSE, VSE, WSE and PASE demonstrated acceptable factorial validity, test-retest reliability, and adequate to excellent internal consistency by CTT. IRM provides useful insights on item difficulty estimates that were not dependent on the sample. The latent variables indicated adequate fit to the data, however, the items and thresholds did not adequately cover the easier and more difficult to endorse ends of VSE. A revised VSE questionnaire is needed to provide full range of self-efficacy difficulty estimates. Several items of the four examined self-efficacy scales exhibited moderate or large differential item functioning on the basis of children's sex and body weight status. Additional psychometric work remains to be done while scales can be used in diverse groups with due caution. Further formative work for questionnaire is necessary.