- Open Access
A systematic review of tools designed for teacher proxy-report of children’s physical literacy or constituting elements
International Journal of Behavioral Nutrition and Physical Activity volume 18, Article number: 131 (2021)
Physical literacy (PL) in childhood is essential for a healthy active lifestyle, with teachers playing a critical role in guiding its development. Teachers can assist children to acquire the skills, confidence, and creativity required to perform diverse movements and physical activities. However, to detect and directly intervene on the aspects of children’s PL that are suboptimal, teachers require valid and reliable measures. This systematic review critically evaluates the psychometric properties of teacher proxy-report instruments for assessing one or more of the 30 elements within the four domains (physical, psychological, cognitive, social) of the Australian Physical Literacy Framework (APLF), in children aged 5–12 years. Secondary aims were to: examine alignment of each measure (and relevant items) with the APLF and provide recommendations for teachers in assessing PL.
Seven electronic databases (Academic Search Complete, CINAHL Complete, Education Source, Global Health, MEDLINE Complete, PsycINFO, and SPORTDiscus) were systematically searched originally in October 2019, with an updated search in April 2021. Eligible studies were peer-reviewed English language publications that sampled a population of children with mean age between 5 and 12 years and focused on developing and evaluating at least one psychometric property of a teacher proxy-report instrument for assessing one or more of the 30 APLF elements. The Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidance was followed for the conduct and reporting of this review. The methodological quality of included studies and quality of psychometric properties of identified tools were evaluated using the COnsensus-based Standards for the selection of health Measurement INstruments (COSMIN) guidance. Alignment of each measure (and relevant items) with the APLF domains and 30 elements was appraised.
Database searches generated 61,412 citations; reduced to 41 studies that evaluated the psychometric properties of 24 teacher proxy-report tools. Six tools were classified as single domain measures (i.e. assessing a single domain of the APLF), eleven as dual-domain measures, and seven as tri-domain measures. No single tool captured all four domains and 30 elements of the APLF. Tools contained items that aligned with all physical, psychological, and social elements; however, four cognitive elements were not addressed by any measure. No tool was assessed for all nine psychometric properties outlined by COSMIN. Included studies reported a median of 3 out of nine psychometric properties. Most reported psychometric properties were construct validity (n = 32; 78% of studies), structural validity (n = 26; 63% of studies), and internal consistency (n = 25; 61% of studies). There was underreporting of content validity, cross-cultural validity, measurement error, and responsiveness. Psychometric data across tools were mostly indeterminate for construct validity, structural validity, and internal consistency.
There is limited evidence to fully support the use of a specific teacher proxy-report tool in practice. Further psychometric testing and detailed reporting of methodological aspects in future validity and reliability studies is needed. Tools have been designed to assess some elements of the framework. However, no comprehensive teacher proxy-report tool exists to assess all 30 elements of the APLF, demonstrating the need for a new tool. It is our recommendation that such tools be developed and psychometrically tested.
This systematic review was registered in the PROSPERO international prospective register of systematic reviews, with registration number CRD42019130936.
Adequate levels of physical activity during childhood are associated with considerable health benefits (e.g., improvement in physical fitness, academic performance, cognition, and executive functioning) [1,2,3]. Yet, less than 40% of children in many countries accumulate the levels of physical activity necessary for optimal health . The concept of physical literacy (PL) has been explored in multiple sectors including physical education, sports, recreation, and public health, as a framework to better understand the declining levels of physical activity [5, 6]. Growing empirical evidence has demonstrated that PL, or its components, are associated with adherence to physical activity and sedentary behaviour guidelines , increased cardiorespiratory fitness , resilience , and other health indices (including body composition, blood pressure, health related quality of life)  in school-aged children.
Of particular interest when determining PL levels are school-aged children (aged 5–12 years) as literature suggest that childhood is a critical developmental period for the formation of skills and attributes (e.g., motor competence) that underlie lifelong physical activity habits [7, 11]. The school setting has been recognized as a suitable environment that affords children with diverse opportunities that can help foster healthy physically active lifestyles, independent of their culture and socioeconomic status . From this equity perspective, schools are also effective sites for targeted physical activity interventions due to the large amount of time children spend attending schools . Teachers (particularly physical educators) have been identified as key players in guiding children’s PL development . They can support PL education, conceptualized as the “teaching and learning of the skills, knowledge, attitudes, and behaviours that enhance the responsibility for engagement in lifelong active lifestyles” . Teachers are also trained to be sensitive to the needs of each child and have a broad basis for knowing their students as they interact with a large number of different children, and thus have a frame of reference on which to base their judgements . Therefore, teachers may be well suited to identify elements (such as motor competence, motivation and confidence) of a child’s PL . For such identification, valid and reliable PL teacher assessment protocols are required.
Recently, PL scholarship has been directed towards designing assessment tools (both subjective and objective) for different targeted users (including preschoolers, children, youth, teachers, parents). Indeed, assessment is crucial to the planning and evaluation of programs targeted at enhancing PL levels, and could help identify domains of a child’s PL that are suboptimal . As such, following Robinson and Randall , an effective PL assessment protocol should address all of its constituting domains (e.g., affective, behavioural, physical, and cognitive). However, few protocols have been designed specifically for use by teachers to evaluate children’s PL . Examples include the PLAYfun and basic ; the CAPL via the Canadian Agility and Movement Skill Assessment (CAMSA) and fitness tests ; and the PFL via fitness and movement skills tests [22, 23]. These existing teacher assessment tools largely utilize objective observational approaches (i.e. rely on the teacher observing children perform a series of standardized tasks)  rather than teacher proxy-report, and have narrowly focused on the physical domain, thereby neglecting the psychological, social, and cognitive aspects of PL. Comparatively, teacher proxy-report instruments (retrospectively completed questionnaires) have received much less attention despite their suitability for assessing large sample sizes and their minimal manual data entry requirements [25, 26]. Literature has further suggested that teacher proxy-reporting presents a promising avenue to obtain more reliable estimates of a child’s PL, as children under 10 often present with limited cognitive ability to make accurate judgements of their own capabilities .
More specifically, a notable gap in PL assessment is the paucity of teacher proxy-report measures that recognizes components of the expansive and comprehensive Australian Physical Literacy Framework (APLF) . In 2016, after acknowledging the lack of international consensus on PL’s definition, conceptualization, and operationalization, Sport Australia (a Federal Government agency responsible for supporting sport in Australia) proposed arguably the most comprehensive definition and framework for PL to date. See Keegan et al.  for a detailed articulation of the Australian definition. The APLF identified a combined total of 30 elements spanning four major domains (physical, psychological, social, and cognitive), as being fundamental to PL development (Fig. 1) . For the purpose of this manuscript, the authors adopt the comprehensive PL definition and framework offered by Sport Australia.
To date, only two systematic reviews have been published in relation to PL assessment [31, 32]. In Edwards et al.’s  review, PL assessment/measurement approaches were broadly categorized as qualitative and quantitative. Though quantitative measures for PL and its related constructs were identified, the review did not engage in a detailed and in-depth analysis of the psychometric properties of the measures. Furthermore, the search strategy utilized by authors did not address each individual element (e.g., motivation, confidence, movement skills) of PL, including those belonging to the APLF. More recently, Kaioglou, Venetsanou , reviewed existing PL measures used within the context of gymnastics. Like Edwards et al. , search terms did not capture individual elements of PL (including APLF elements). Hence, only tools for assessing PL in its entirety were identified (e.g., Canadian Assessment of Physical Literacy [CAPL]; Passport for Life [PFL]; Physical Literacy Assessment for Youth [PLAY]). Both reviews did not focus specifically on identifying teacher proxy-report measures for PL or its constituting elements. Barnett et al.  has suggested that teachers have limited guidance when choosing appropriate protocols for assessing PL.
Taking all this into account, the objectives of the current systematic review were two-fold. The primary aim was to critically evaluate the psychometric properties of teacher proxy-report instruments for assessing one or more of the 30 elements within the four domains of the APLF, in children aged 5–12 years. Secondary aims were to examine the alignment of each tool (and relevant items within) with the APLF and provide recommendations for teachers in assessing PL in children aged 5–12 years. A review of this nature will assist teachers (and indeed researchers) in making informed decisions when selecting suitable and psychometrically sound measures for assessing elements within the APLF.
Literature search strategy
The Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA)  and the COnsensus-based Standards for the selection of health Measurement INstruments (COSMIN) guidelines [35,36,37] were used as methodological and reporting guidelines for this systematic review. See completed PRISMA checklist attached as Additional file 1. Prior to review commencement, details of the review protocol were registered on PROSPERO (CRD42019130936). The first author systematically searched for peer-reviewed articles on seven databases including Academic Search Complete, CINAHL Complete, Education Source, Global Health, MEDLINE Complete, PsycINFO, and SPORTDiscus. These databases encompass areas related to psychology (including psychometrics), education, sport, and health, and were deemed relevant to the comprehensive definition/framework of PL used in this review, and therefore enhanced the likelihood of identifying relevant papers from many diverse disciplines. Date restrictions were not applied to searches. Database searches were originally completed in October 2019 and updated in April 2021. All searches were limited to title, abstract, and keyword. Additional limits of “English language” and “peer review” were applied. To ensure that search terms were not overly simplistic, a comprehensive search filter containing a selection of search terms provided by the COSMIN for finding studies on measurement properties, combined with search terms relevant to the 30 APLF elements (identified from published systematic reviews) were utilized to identify studies concerning the target population (see Additional file 2 for the full search strategy). Reference lists of literature reviews and eligible studies were also searched for additional papers. All searches were performed by the first author with the assistance of the university’s librarian.
Studies were included if they were: (a) peer-reviewed and written in English Language; (b) study participants included children with mean age between 5 and 12 years; (c) focused on developing and evaluating at least one psychometric property of a teacher proxy-report instrument; and (d) instruments assessed one or more of the 30 elements within the APLF. Because the application of PL goes beyond the context of physical education and encompasses before- and after-school programming, recess, and classroom activities [38, 39] and could be applied in performing arts , teacher proxy-report instruments that assessed elements in general contexts (not just in sport and physical activity) were included. For example, instruments assessing “self-regulation” in general, and those assessing self-regulation in the context of physical activity were included.
Studies were excluded if they were: (a) tool manual(s), abstracts (including poster abstracts), conference proceedings, dissertations, commentaries, editorials, review articles, and letters; (b) utilized assessment formats other than teacher proxy-report (e.g., self-report, objective measures); (c) study participants were younger than five and older than 12 years; and (d) utilized proxy-respondents of children not in elementary or primary school, younger than five and older than 12 years. In registering the protocol for this review, it was our initial intention to exclude studies that involved non-typically developing children (such as those with learning difficulties or developmental delay). However, following the literature search, we noted that most teacher proxy-report tools for motor competence (related to the physical domain of PL) were originally designed with the intention of identifying children with developmental coordination disorder (DCD), and in some cases included participants with DCD (for instance, when assessing discriminant validity). As such, these tools were retained in order to ensure motor competence teacher proxy-report measures were not excluded from the review. Measures developed to assess children with other disabilities (i.e. those in relation to elements other than motor competence) were excluded from the review.
Titles and abstracts were exported to Covidence (www.covidence.org), an online software for managing systematic reviews. Following removal of duplicates, the first author screened all titles and abstracts for eligibility, based on the aforementioned criteria. Full text articles were retrieved for further examination where it was not possible to make inclusion decisions based solely on the title and abstract. Following initial selection, full-text articles were independently examined by paired combinations of three review authors (IE - NL, IE - LB, and NL - LB). For consistency, a PICO-based hierarchy of exclusion reasons was developed based on past literature , and used to guide the exclusion of studies during the full text review phase (see Additional file 3). Any conflicts between the three reviewers over study inclusion were resolved via review and discussion.
In line with the criteria proposed by COSMIN, data collection involved extracting information on the general characteristics of included studies as follows: (a) instrument, author(s) and year of publication; (b) general construct assessed; (c) APLF domain(s) assessed; (d) targeted age group/grades; (e) sample population/country; (f) sample size, mean age, standard deviation; (g) instrument available translation; (h) completion time (minutes or seconds); (i) recall period; (j) tool subscale(s)/number of items; (k) response options; (l) psychometric properties evaluated/statistical tests utilized. The data extraction form was piloted on two randomly selected included studies prior to data collection by IE. JM checked all extracted data for completeness and correctness.
Methodological quality assessment of studies
Following COSMIN’s recommendations, the current review assessed nine measurement properties including: (a) content validity, (b) structural validity, (c) internal consistency, (d) cross-cultural validity, (e) reliability, (f) measurement error, (g) criterion validity, (h) construct validity, and (i) responsiveness – see Prinsen et al.  for a definition of each terminology. To evaluate the methodological quality of the selected studies, the recently updated COSMIN Risk of Bias checklist [35, 37] which contains 10 boxes was utilized. Each box of the checklist comprises of 3 to 35 standards for evaluating the statistical design and statistical methods utilized in reliability and validity studies. To date, the COSMIN checklist is the only validated and standardized tool for assessing the methodological quality of health-related outcome measures .
Depending on the information reported in each study, items in each box of the checklist were rated on a four-point scale using the descriptors “Very Good”, “Adequate”, “Doubtful”, and “Inadequate”. A “Not Applicable” option was also included for each measurement property. To determine the overall methodological quality for each individual measurement property per study, the lowest rating across the items in the box was taken, a method known as the “the worst score counts” principle. For example, if for a reliability study one item in a box is rated as “Inadequate” despite having all other items rated as “Very Good”, the overall methodological quality of that reliability study will be “Inadequate”. According to COSMIN, this stringent rule is necessary as poor methodological aspects of a study cannot be compensated for by good aspects . To ensure accuracy of the quality assessment, IE completed risk of bias analyses for 22 of the included studies. The articles were then double rated by two independent reviewers (NL, LB) who had both received training on using COSMIN. After disagreements were resolved, IE completed quality assessment for the remaining articles. To summarize the results of methodological quality per tool, authors used a cut-off of ≥60%  of measurement properties rated as “Very Good” or “Adequate” across all single studies to indicate “good” methodological quality.
Quality criteria for measurement properties of single studies and evidence summary
Results obtained from single studies on measurement properties were rated against COSMIN’s updated criteria for good measurement properties. Each result was rated as either sufficient (+), insufficient (−), or indeterminate (?) . For studies reporting on content validity, the quality of the results were rated using the criteria for relevance (5), comprehensiveness (1), and comprehensibility (4) . Regarding hypothesis testing for construct validity and responsiveness, COSMIN recommends setting a priori hypotheses prior to review commencement . Following De Vet et al. , for both measurement properties, correlations were expected to be: ≥ 0.50 with instruments measuring similar constructs; < 0.50 and ≥ 0.30 with instruments measuring related but dissimilar constructs; and < 0.30 with instruments measuring unrelated constructs. No hypotheses were formulated for expected differences between groups (e.g., age, gender) for discriminant and known-groups validity.
Due to considerable differences across studies in terms of sample characteristics and size, statistical tests utilized, reliability or validity type investigated, results from single studies could not be pooled in a meta-analysis. Therefore, as recommended by the COSMIN, an overall rating of study results per measurement property per tool was summarized as sufficient (+), insufficient (−), indeterminate (?), or inconsistent (±). Specifically, an overall rating was determined through combining the scoring of each single study; if ≥75% of the studies displayed the same scoring, that scoring became the overall rating (+ or −), whereas if < 75% of studies displayed the same scoring, the overall rating became inconsistent (±) .
Initial searches of the seven databases in October 2019 generated a combined total of 56,615 citations. The updated search in April 2021 identified 4797 new citations. Following removal of duplicates, title and abstract screening of 20,724 references (including an additional 31 articles identified through manual searching), yielded 424 articles deemed potentially relevant. After eligibility criteria were applied to full-text versions of the 424 publications, a total of 41 studies evaluating the psychometric properties of 24 unique teacher proxy-report measures for elements within the APLF were identified. A flow chart of study selection was prepared in accordance to the PRISMA statement (detailed in Fig. 2).
General characteristics of included studies
A description of the study characteristics and their assessment instruments are presented in Table 1. The 41 studies were published between 1936 and 2020 and were conducted within the United States (n = 18), Netherlands (n = 3), South Africa (n = 3), Finland (n = 2), Italy (n = 2), Israel (n = 2), Portugal (n = 2), Australia (n = 1), Poland (n = 1), Canada (n = 1), Japan (n = 1), and Brazil (n = 1). Study location was unspecified in four studies. All relevant domains of the APLF (i.e. physical, psychological, social, and cognitive) assessed in each measure were identified (see Table 1). Tools were categorized as single domain (assessing one domain of the APLF), dual-domain (assessing two domains), and tri-domain (assessing three domains) measures. The majority of tools identified in this review assessed elements across two domains of the APLF (see Fig. 3). No single teacher proxy-report measure assessed elements in all four domains of the APLF. A detailed synthesis of how each tool (and relevant items) are aligned with individual elements of the APLF is presented in Table 4.
For “single domain measures”, four tools assessed elements exclusively in the physical domain: the Motor Observation Questionnaire for Teachers (MOQ-T) [45–48]; Movement Assessment Battery for Children-2 Checklist (MABC-2 Checklist) [49,50,51,52]; Pictorial Scale of Perceived Water Competence (PSPWC) ; and Teen Risk Screen checklist (TRS) . Another two tools were related only to the psychological domain: Reiss Motivation Profile for children (Child RMP) ; and Teacher’s Self-concept Evaluation Scale .
“Dual-domain measures” included the Brief Behaviour Rating Scale (BBRS) ; Devereux Student Strengths Assessment (DESSA) [58, 59]; Emotion Regulation Checklist (ERC) ; Multisource Assessment of Social Competence Scale (MASCS) ; Pictorial Scale of Perceived Competence and Social Acceptance for Young Children-Teacher (PSPCSA-T) [62,63,64]; Social-Emotional Assets and Resilience Scale, Teacher rating form (SEARS-T) [65,66,67]; Social Skills Improvement System Social Emotional Learning Edition Rating Forms (SSIS SEL RF) – Teacher version ; Teacher-Child Rating Scale (T-CRS) ; Teacher Questionnaire (TQ) ; Teacher Rating of Social Efficacy ; and Winnetka Scale for Rating School Behaviour [72, 73] (See Fig. 3 and Table 1).
Tools that straddled across three domains “tri-domain measures” of the framework included the Children Activity Scales for Teachers (CHAS-T) ; Gross Motor Rating Scale (GMRS) ; Harter’s Teacher’s Rating Scale of Child’s Actual Behaviour (Harter’s TRS) [76,77,78]; Health Resources Inventory (HRI) ; Social and Emotional Competencies Evaluation Questionnaire Teacher’s version (Short Form) (QACSE-P-SF) ; Social Skills Rating Scale (SSRS-T) [81,82,83]; and Teacher Estimation of Activity Form (TEAF) [17, 84] (See Fig. 3 and Table 1).
Furthermore, there was a considerable degree of homogeneity in relation to the targeted age group/grades for identified tools. Most tools spanned the entire age range (i.e. for children between 5 and 12 years) and thus were suitable for both younger and older children. Tool completion times were not often reported but when reported, completion times ranged between three and 15 min per child. Scales ranged from 10 [17, 84] to 80 items . The 41 studies assessed a median of 3 out of the nine measurement properties recognized by the COSMIN. The most commonly reported psychometric properties were construct validity (n = 32; 78% of studies), structural validity (n = 26; 63% of studies), and internal consistency (n = 25; 61% of studies). Statistical tests utilized to evaluate measurement properties varied across the review. For instance, confirmatory factor analysis was the most frequently used statistical approach for studies reporting on structural validity whereas correlations were used for hypothesis testing for construct validity. Construct validity was mostly tested by comparing scores obtained for a tool with another measure assessing a similar construct. On the other hand, criterion validity was evaluated by comparing scores obtained for a tool with a gold standard measure. Tool development studies were conducted for eight measures including the BBRS , CHAS-T , GMRS , HRI , SEARS-T , SSRS-T , T-CRS , and Winnetka Scale for Rating School Behaviour . Content validity was only reported for two tools (CHAS-T and PSPWC) [53, 74].
Methodological quality assessment
Table 2 details the methodological quality assessment of the 41 studies included in the review.
Single domain measures
The MOQ-T and MABC-2 Checklist were each evaluated in four studies [45,46,47,48,49,50,51,52]; while one study each assessed the Child RMP , PSPWC , Teacher’s Self-Concept Evaluation Scale , and TRS . No measure assessing a single domain of the APLF reported on tool development, responsiveness, and measurement error. Content validity assessed for the PSPWC  obtained an Doubtful rating . Structural validity ratings were generally low with studies rated as Inadequate (n = 2) [46, 47] or Doubtful (n = 3) [48, 54, 56]. Only two studies were rated as Adequate  and Very Good . Cross-cultural validity, assessed in one study, received a Doubtful rating . Contrariwise, studies assessing criterion validity mostly received Very Good (n = 4) ratings [45, 47,48,49], with only two studies being rated as Inadequate [51, 52]. For construct validity, most studies received favourable ratings of Very Good (n = 3) [45, 49, 55] or Adequate (n = 2) [51, 54], and only one study was rated as Doubtful . Regarding measurement properties relating to reliability, one study examined the test-retest of the TRS and was rated as Adequate . Internal consistency had mixed ratings; five studies were rated as Very Good [47, 48, 50, 54, 56], while three were Inadequate [46, 49, 55]. Overall, four single-domain tools (i.e. MOQ-T, MABC-2 Checklist, Child RMP, TRS) obtained consistent ratings of “Very Good” or “Adequate” for methodological quality across its measurement studies.
Seventeen studies evaluated dual-domain measures [57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73]. For these measures, most measurement properties (content validity, cross-cultural validity, measurement error, criterion validity, responsiveness) were unreported. All studies providing information on tool development received ratings of either Inadequate (n = 3) [57, 69, 72] or Doubtful (n = 1) . Conversely, construct validity was rated as Very Good (n = 7) [57,58,59, 63, 64, 67, 71] or Adequate (n = 5) [61, 62, 65, 69, 73]; only two studies were rated as Doubtful  and Inadequate . Studies on structural validity received mixed ratings of Very Good (n = 2) [61, 65], Adequate (n = 3) [68, 69, 73], and Doubtful (n = 5) [59, 60, 67, 71, 72]. Furthermore, the majority of studies on internal consistency rated highly as Very Good (n = 7) [60, 61, 63, 65, 67,68,69]; while only two were Inadequate [57, 71]. Reliability studies were rated as Adequate (n = 1) , Doubtful (n = 3) [68, 69, 71], and Inadequate (n = 2) [57, 72]. Overall, six dual-domain tools (i.e. DESSA, MASCS, PSPCSA-T, SEARS-T, SSIS SEL RF Teacher, T-CRS) obtained consistent ratings of “Very Good” or “Adequate” for methodological quality across its measurement studies.
Twelve studies examined tri-domain measures [17, 74,75,76,77,78,79,80,81,82,83,84]. Measurement properties not evaluated for any of these measures were cross-cultural validity, measurement error, and responsiveness. Tool development studies received low ratings of Inadequate (n = 3) [74, 75, 81] or Doubtful (n = 1) . Content validity assessed in a single study for the CHAS-T was rated as Doubtful . For the most part, studies on structural validity received high ratings of Very Good (n = 2) [77, 80] and Adequate (n = 4) [17, 74, 79, 82]. However, three studies were rated as Doubtful (n = 2) [75, 84] and Inadequate (n = 1) . Similarly, majority of studies on criterion validity and construct validity were rated highly. For criterion validity, studies were all rated as Very Good (n = 4) [17, 74, 75, 84]; whereas construct validity studies were rated as Very Good (n = 7) [17, 74, 78, 80, 82,83,84] and Adequate (n = 4) [76, 77, 79, 81], with only one study rated as Inadequate . Internal consistency studies were rated as either Very Good (n = 5) [17, 80, 82,83,84] or Inadequate (n = 3) [74, 75, 81]; while reliability studies rated lower as either Doubtful (n = 3) [75, 79, 83] or Inadequate (n = 1) . Overall, four tri-domain tools (i.e. Harter’s TRS, QACSE-P-SF, SSRS-T, TEAF) obtained consistent ratings of “Very Good” or “Adequate” for methodological quality across its measurement studies.
Measurement property assessment of instruments
In this section, the overall rating of each tool was appraised, and Table 3 was formed. A combined synthesis of the quality of results is presented for the measures included in this review. The measurement property structural validity was found to be sufficient for a number of instruments including the DESSA, ERC, Harter’s TRS, MASCS, MOQ-T, and QACSPE-P-SF, where in line with the COSMIN criteria, most (i.e. 75%) single studies assessing these instruments had acceptable Root Mean Square Error of Approximation (RMSEA) (< 0.06) or comparative fit index (CFI) (> 0.95) or Standardized Root Mean Residuals (SRMR) (< 0.08) values. Inconsistent ratings were noted for the SEARS-T and MABC-2 checklist. Tools found to have insufficient structural validity were the Child RMP, SSIS SEL RF Teacher, and TRS checklist. However, the majority of tools (including the CHAS-T, GMRS, HRI, SSRS-T, TCRS, Teacher’s Rating of Social Efficacy, TEAF, and Winnetka Scale for Rating School Behaviour) were indeterminate in structural validity whereby single studies evaluating these tools utilized statistical methods such as exploratory factor analysis.
Criterion validity, performed for five tools, was rated as sufficient for the CHAS-T, MOQ-T and TEAF; inconsistent for the MABC-2 Checklist; and insufficient for the GMRS. Cross-cultural validity was evaluated for the MABC-2 Checklist and was rated as indeterminate because no multiple group factor analysis was performed in the single study. For construct validity, results were mostly indeterminate in rating. Internal consistency coefficients were sometimes provided for the entire scale and/or its subscales. For the most part, tools were rated as indeterminate as a result of insufficient evidence on structural validity and/or provision of Cronbach alpha values for the total scale and not per subscale. Results quality for test-retest and inter-rater reliability were mostly indeterminate as intraclass correlation coefficient (ICC) values were not calculated for continuous scores. The only exception was the TRS Checklist which had ICC values for most subscales less than 0.70 and was considered as having insufficient reliability. Overall, no tool was consistently evaluated as having sufficient ratings for all its measurement properties. Only five tools (i.e. MOQ-T, ERC, MASCS, QACSE-P-SF, and TEAF) had atleast two sufficient ratings across its measurement properties.
Physical literacy alignment
Item/content alignment of each tool with the APLF was appraised (see Table 4). Also highlighted in Table 4 are tools with good methodological and sufficient results (i.e. atleast two sufficient ratings) quality based on evidence synthesis; as well as tools (n = 10) assessing the PL elements in the context of physical activity. The number of measures that mapped onto individual APLF elements ranged from 1 to 15. All elements in three (i.e. the physical, psychological, and social) out of four domains of the framework were addressed. Relationships, self-regulation (emotions), and collaboration were the elements most frequently assessed by the included measures. Least captured elements were speed, connection to place, and tactics. Water skills, a component of the element movement skills, was assessed in one tool . Four of the APLF elements belonging to the cognitive domain (content knowledge, reasoning, strategy and planning, and perceptual awareness) were not addressed by any measure.
Tools capturing the most elements of the APLF included the GMRS (15 out of 30), the HRI (9 out of 30), and the TEAF (8 out of 30). Lastly, Harter’s TRS covered three domains (physical, social, cognitive) of the framework. However, due to the lack of specificity of items contained within the tool (e.g., “This child doesn’t do well at new outdoor games”; “This child does really well at all kinds of sports”; “This child is better than others his/her age at sports”), mapping it onto the individual elements of the framework proved rather difficult.
This is the first review to critically evaluate the psychometric properties of teacher proxy-report instruments designed to assess one or more elements of children’s PL. As a consequence, the current study represents a novel contribution to the literature base relating to PL and its assessment. PL assessment can help identify aspects of children’s PL that are suboptimal; as well as provide an evidence base for evaluating the effectiveness of interventions targeted at improving PL levels. More specifically, a focus on teacher proxy-report instruments for children’s PL is needed due to children’s limited cognitive abilities when making self-assessments of their own capabilities [27, 62, 85]. Baranowski  has further suggested that children are also limited in their ability to recall specific events that occurred in the past. Indeed, Bardid et al.  has reported that teacher proxy-reports (especially by physical education specialists) may provide more accurate estimates of a child’s capabilities (e.g., motor competence) than child self-report.
Importantly, in the current review, alignment with individual elements of the APLF, for each teacher proxy-report measure, was further appraised. The first finding is clearly the lack of valid and reliable teacher proxy-report instruments that assess PL in its entirety, based on the comprehensive APLF. There are however tools available to assess some elements of the framework. Specifically, 41 studies evaluating the psychometric properties of 24 teacher proxy-report tools for the APLF elements were identified. The psychometric properties of identified measures were variable, with many typically unreported or inadequately assessed.
No single tool reported all nine psychometric properties outlined by the COSMIN methodology [35,36,37]. Measurement properties frequently reported included construct validity, structural validity, and internal consistency. Content validity and cross-cultural validity were the most rarely reported. No studies reported measurement error and responsiveness. These mirror findings of a recently published review of motor competence assessments for children and adolescents, which highlighted that construct validity was frequently reported whereas content validity was the least evaluated psychometric property .
Content validity is often considered the most important measurement property of an instrument , and is needed to ensure that the tool has appropriate number of items and adequately captures the construct/element under investigation . COSMIN distinguishes between tool development and content validity studies in that the former involves concept elicitation, development, and pilot testing a new tool; whereas the latter entails testing of an existing tool . In this review, most tool development studies were given the lowest possible rating of “inadequate”. This was either because tool development studies were not performed utilizing a sample representative of the tool’s targeted population or no pilot tests or cognitive interviews were performed for the newly developed tool. On the other hand, just two studies reported on content validity for the CHAS-T  and PSPWC . The comprehensibility, relevance, and comprehensiveness of items in the CHAS-T  was explored by teachers and professionals. In this review, the instrument was rated as doubtful for methodological quality as there was a lack of reporting of the qualitative and analytical methods utilized for the content validation process. Another study reported content validity for the TRS tool ; however, the review team failed to find any report regarding the relevance or comprehensiveness of items from the perspective of the targeted users of the tool and/or professionals. The PSPWC  was rated as doubtful in methodological quality as it was not clear if there were two researchers involved in analysis of qualitative interviews and whether skilled interviewers were used during interviews.
According to COSMIN’s updated guidelines, if the content validity of a tool is unknown, the results for other measurement properties of the tool should be ignored and not further appraised as this hinders the interpretation and generalization of study findings . Given the importance of this measurement property, there is an urgent need to prioritize content validity studies in future development of teacher proxy-report PL instruments. Future studies should consider using the COSMIN Study Design checklist  which offers clear standards for designing studies aimed to evaluate measurement properties of instruments. Specifically, for content validity studies, tool developers should obtain information from targeted tool users and professionals regarding the relevance, comprehensibility, and comprehensiveness of the instructions, response options and items contained within the tool. For this, a widely recognized or well justified qualitative research approach is preferred, whereby each item on the tool is evaluated by at least seven or more individuals from the target population of interest and professionals – see Mokkink et al.  for the design requirements.
Few studies validated a measure against a reference “gold” standard known as criterion validity. Criterion validity ensures the accuracy of a scale when compared to a reference standard . Being widely tested and validated measures, the MABC motor test , the Bruininks–Oseretsky test of motor proficiency , and the Körperkoordinationstest für Kinder test  were considered to be reasonable “gold” standards for motor skill assessment. Hence, all studies comparing a teacher proxy-report tool to these measures were considered a study on criterion validity . It is important to note that there were a few cases where authors used the term criterion validity when comparisons were made with other measures assessing a similar construct. In these instances (as specified in the COSMIN user manual ), this was considered to be evidence of construct validity rather than criterion validity. In this review, most studies on criterion validity appeared to have good methodological quality, with evaluated measures having sufficient results quality. Similar findings were noted by Antczak et al.  for criterion validity studies of motor competence assessments. However, it has been argued that the design of the COSMIN checklist, in terms of number of standards contained in each measurement property and the use of the “worst score counts” principle, could significantly impact on its overall scoring. For instance, a measurement property such as criterion validity which contains fewer standards (three in total) may fare better in its overall scoring when compared to those with higher quality items (e.g., 35 standards for content validity) .
The methodological quality of studies reporting structural validity was mixed. The common reasons for doubtful or inadequate COSMIN ratings were insufficient sample size and/or statistical design flaws such as a lack of reporting of the number of teachers involved in the study and how these clustering effects (if any) were accounted for in the analytical design. Furthermore, for many tools, result ratings were indeterminate due to the use of exploratory factor analysis (including principal component analysis) as the updated COSMIN does not provide any criteria for rating these techniques. Ideally, a confirmatory factor analysis should follow an exploratory factor analysis (preferably using a different sample), as the former verifies an a priori exploratory factor analysis-informed theory regarding a tool’s factor structure . Given that some of these deficiencies can be resolved by more detailed reporting and further psychometric testing, future studies should consider adopting guidelines offered by COSMIN for reporting of structural validity studies.
Only one of 41 studies was assessed for cross-cultural validity, as they had translated a measure (MABC-2 Checklist) from English to Japanese, and compared scores obtained from two samples (i.e. United Kingdom and Japan) . This study did not perform well for both methodological and results quality. Noteworthy is that a number of studies [47, 48] within this review translated a measure from its original language to a different language without assessing cross-cultural validity. Future studies should determine cross-cultural validity for translated instruments, utilizing appropriate techniques (e.g., multi-group confirmatory factor analysis for classical test theory or differential item functioning for item response theory) [35, 36]. This is because instruments may perform differently across different cultures, different gender or age groups, and different populations . Most construct validity studies performed adequately for methodological quality; however, overall results quality was mostly indeterminate. This may have been influenced by the lack of a priori hypotheses for expected differences between groups for known groups/discriminant validity.
Internal consistency values (the interrelated among items in a subscale ) had to be calculated separately for each unidimensional scale or subscale to obtain good ratings for methodological quality. Deficiencies in studies were mostly because Cronbach’s alpha values were provided for the entire scale and not per subscale. Similarly, results of internal consistency were indeterminate for many studies as Cronbach alpha was provided for the entire scale and there was evidence of insufficient structural validity. COSMIN considers evidence on structural validity (or unidimensionality) a prerequisite for interpreting Cronbach’s alpha values . Given these findings, we recommend that as a starting point, future studies should ensure that evidence exists for sufficient unidimensionality or structural validity of a tool and thereafter report on the Cronbach alphas (for continuous scores) of each subscale.
Reliability (test-retest and inter-rater) studies did not rate well for methodological quality for studies in this review. For the majority of studies, Pearson’s correlations (a measure of relationship between two variables ) were used to explore this measurement property rather than intraclass correlations for continuous scores, as recommended by the COSMIN . Past literature has highlighted that the Pearson’s is an inappropriate and liberal measure of reliability, often producing reliability coefficients that are higher than the true reliability [88, 97]. It was also difficult to determine whether participants were stable in the interim between measurements or if the testing conditions were similar for the measurements taken. As ICC values were not calculated, results were rated as indeterminate for the majority of studies in this review. Studies should consider the use of intraclass correlations when exploring reliability of continuous variables as they reflect the correlation and agreement between measurements taken by an instrument .
Two measurement properties – responsiveness and measurement error – were not explored in any study in this review. COSMIN refers to responsiveness as the measures ability to detect change over time in the construct of interest whereas measurement error is regarded as errors in scores obtained which are not as a result of changes in the construct of interest . No study included in this review evaluated the minimal important change or minimal important difference of their tools. Without information on the measurement error of these tools, it is unclear whether the changes in scores of the constructs assessed are meaningful and matter to teachers. Studies have also previously noted underreporting of responsiveness . This is concerning because without this, it is difficult to assess the effectiveness of interventions designed to improve PL or its components.
In summary, for the studies included in this review, a median of 3 out of nine psychometric properties were reported. Content validity which is considered the most important property was sparingly reported. These therefore restricts our justifications for use of specific teacher proxy-report tools in practice until further psychometric testing is conducted. However, based on the available evidence and after combining the ratings of methodological quality and the criteria for good measurement properties provided by the COSMIN, best results were received for the following tools: MASCS, MOQ-T, QACSE-P-SF and TEAF. These tools combined assess a total of 18 elements of the APLF. Of these tools, the MOQ-T and TEAF assesses the APLF elements in relation to physical activity. The ERC had good psychometric evidence but was lacking in methodological rigour. Terwee et al.  has highlighted that results of studies lacking in methodological quality should not be trusted. One must exercise caution when interpreting these results though as some of these tools (specifically MASCS and QACSE-P-SF) were evaluated in single studies, and as such, are in need of repeated psychometric testing in different populations. Furthermore, in the current review, the MABC-2 checklist was found to be one of the most widely examined tool for reliability and validity. Surprisingly, despite having good methodological quality for most of its measurement properties, our findings reveal that the checklist has limited psychometric evidence to support its reliability and validity, suggesting the need for more validation studies. The current systematic review highlights a need for further psychometric testing (especially content validity, cross-cultural validity, measurement error, criterion validity, and responsiveness), with a more detailed reporting of methodological aspects and results in future studies. Taking such an approach will provide teachers with a more robust foundation when selecting appropriate and psychometrically sound measures for assessing PL.
Physical literacy alignment
The APLF is unique in that it recognizes a variety of skills and attributes straddling four inter-related learning domains (physical, psychological, social, and cognitive) as needed for PL development. More specifically, the framework incorporates elements outside the physical domain that have not previously featured in other definitions. These elements may be equally beneficial for integrated movement experiences to develop PL . An example element collaboration, situated in the social domain, reflects social skills (e.g., conflict resolution, cooperation, and leadership) required to successfully interact with others in movement and physical activity contexts . This element is potentially as important as other elements (e.g., movement skills) and should be assessed in children.
Our review findings suggest the paucity of teacher proxy-report measures that address several elements of the APLF. Particularly elements such as speed, connection to place, tactics, content knowledge, reasoning, strategy and planning, and perceptual awareness were either rarely assessed or not assessed by identified tools. Interestingly, elements most frequently assessed appeared to fall within the social domain suggesting the availability of many teacher assessment options for this domain. Because of our wider search for tools beyond the physical activity/physical education literature, only the PSPCSA-T and Harter’s TRS assessed the social domain in the context of physical activity. Our findings may be an indication that the social domain – despite not being recognized as a core component of several PL frameworks – is an aspect that teachers are interested in reporting on more generally.
Another finding is the absence of measures with psychometric evidence that address elements of the cognitive domain. The authors note however that it may be quite challenging to assess the cognitive domain via teacher proxy-reporting. Indeed, many existing measures for PL (e.g., CAPL) tend to approach its assessment via self-report . Nonetheless, a comprehensive approach to assessing PL is required since the flavour of the concept in itself lies in its holistic nature . Hence, the development of measures that target all domains and elements of the APLF should be prioritized to provide a greater breadth and depth of understanding of the contributors to children’s PL.
Recommendations for teacher assessment of physical literacy based on the APLF
Proxy-report measures have the advantage of low cost, ease of administration on large numbers of children, and less administration training when compared to objective measures . This is even more beneficial to teachers who are often faced with time barriers to teaching and assessment . In making recommendations for teachers when choosing instruments for PL assessment, besides highlighting psychometrically sound measures, many aspects of the feasibility of these measures should be well considered. Some of these feasibility aspects include completion time, cost of instrument, copyright, length of the instrument, ease of administration and score calculation . Information on feasibility may become particularly relevant when differentiating between two equally psychometrically sound instruments. The vast majority of measures identified in this review did not report on completion time. However, as feasibility is not considered a measurement property by the COSMIN , it was beyond the scope of this paper to consider all aspects of the feasibility of the identified tools. We therefore recommend that these aspects receive priority in future studies.
As earlier stated, the current review did not locate a tool that captured all elements and domains of the APLF. For teachers to assess PL comprehensively, there is a need for a tool that includes all 30 elements of the framework. Also given limited evidence found for measures in this review, it is difficult to justify the use of tools identified in this review until further psychometric testing is conducted. This review has found best evidence for the MASCS, MOQ-T, QACSE-P-SF and TEAF. Teachers who are interested in assessing elements of PL based on its Australian approach could consider utilizing the detailed nine-step decision-making steps in choosing a PL assessment as highlighted by Barnett et al. , in conjunction with Tables 2, 3 and 4 of this review which provide information on the validity, reliability, and alignment of specific instruments with the APLF. Barnett et al.’s  guidance for assessing PL involve identifying the following: (i) element(s) of interest; (ii) teacher interest; (iii) context; (iv) purpose; (v) age group; (vi) structure of observed learning outcomes level; (vii) measurement/assessment method; (viii) number of participants and; (ix) cost. Specifically, step seven encourages teachers to decide on their preferred assessment approach (e.g., objective or subjective measures). As an example, after carefully considering these nine steps in conjunction with the results provided in Tables 2, 3 and 4, a teacher who may be interested in assessing the APLF elements agility, strength, muscular endurance, cardiovascular endurance, engagement and enjoyment, confidence, motivation and tactics (Step I) via proxy-reporting (Step VII), could utilize the TEAF. This is because, based on the available psychometric evidence (methodological quality and results quality), the tool seems to be the most promising teacher tool for assessing these aforementioned elements. An assessment of this nature by physical educators must be approached with caution, as most tools identified within this review were not contextualized in physical activity (as outlined in Table 4). As such, we have highlighted the tools assessing the PL elements in the context of physical activity – refer to Table 4.
Strengths and limitations
This systematic review has several strengths. The protocol for the review was registered prospectively. A comprehensive search of seven databases relevant to Sport, Education, Psychology and Health was conducted to identify peer-review articles. Furthermore, a comprehensive search strategy comprising of search filters for finding studies on measurement properties provided by COSMIN; as well as search filters relevant to each individual PL element was utilized to locate studies within the review. Time restrictions were not applied in the search strategy. This strategy identified studies focused on psychometric testing of tools for each PL element, unlike previous reviews which were focused mostly on tools for PL as a whole without critically appraising the psychometric properties of those tools. Three authors were independently involved in the full-text review phase and methodological quality assessment of included studies following best practice recommendations when conducting systematic reviews. This triangulation approach reduces the risk of non-detection of relevant evidence, thus strengthening the validity of conclusions reached from available evidence . Lastly, within the PL research area, this is the first systematic review performed in accordance with PRISMA guidance  and COSMIN’s latest 2018 guidance [35,36,37], which is more detailed than its 2010 guidance [103, 104].
This study is not without limitations. Only studies published in English Language were included, due to our limited resources, time and expertise in non-English languages. Studies with English abstracts and non-English full text were also excluded because when it is not possible to obtain a translation, extracting all the information needed to meaningfully inform the systematic review based on the abstract only is difficult. Therefore, some findings may have been overlooked. Furthermore, because of the lack of rigorous peer-review, grey literature including conference, poster abstracts, dissertations, and tool manuals were excluded. As such, it is possible that some measurement properties (e.g., content validity) were reported within tool manuals. Only studies reporting on one or more measurement properties outlined by the COSMIN for teacher tools of the PL elements were included in the review. Hence, a number of studies may have been omitted if measurement properties were not discussed for tools utilized in those studies. The COSMIN methodology does not differentiate between poor reporting and poor quality in the risk of bias analyses. Therefore, there could have be cases where a lack of detailed reporting by authors resulted in an inadequate or doubtful rating for methodological quality. Finally, there were tools which had multiple validity and reliability studies which shows a more widespread use. There were also instruments evaluated in a single study. This may have impacted on the overall ratings of results quality for the tools identified within this review.
This review is the first to identify and critically appraise the psychometric properties of 24 teacher proxy-report measures for assessing a comprehensive framework of PL, for children aged 5–12 years. Teacher proxy-report may provide more reliable estimates of a child’s ability compared to self-report, are low in cost, and can be used to assess large sample sizes compared to objective measures. Moreover, objective assessment may not be conducive for some elements (e.g., relationships, ethics) of the APLF. Our review findings suggest that presently, there is no existing teacher proxy-report tool to assess all elements of children’s PL identified in the APLF. Based on the findings of this review, there remain considerable gaps in knowledge in aspects related to the validity (e.g., content, cross-cultural), reliability (measurement error), and responsiveness of teacher tools. This emphasizes the need for further psychometric studies on existing teacher report tools; and more importantly, the need to develop new teacher tools for assessing the PL domains in its entirety. Tool developers may consider combining items from existing scales, preferably those that have undergone repeated processes of psychometric testing for validity and reliability as highlighted in this review. As Streiner et al.  puts it simply “instruments rarely spring fully grown from the brows of their developers. Rather, they are usually based on what other people have deemed to be relevant, important, or discriminating”. Due to the comprehensive nature, this review raises the importance and need for a proxy-report scale for teachers within the Australian context; and teachers globally who are interested in the assessing children’s PL based on the comprehensive APLF.
Availability of data and materials
Australian Physical Literacy Framework
Brief Behaviour Rating Scale
Canadian Agility and Movement Skill Assessment
Canadian Assessment of Physical Literacy
Confirmatory factor analysis
Comparative fit index
Children Activity Scales for Teachers
- Child RMP:
Reiss Motivation Profile for children
COnsensus-based Standards for the selection of health Measurement Instruments
Developmental coordination disorder
Devereux Student Strengths Assessment
Exploratory factor analysis
Emotion Regulation Checklist
Gross Motor Rating Scale
- Harter’s TRS:
Harter’s Teacher’s Rating Scale of Child’s Actual Behaviour
Health Resources Inventory
Movement Assessment Battery for Children
Multisource Assessment of Social Competence Scale
Motor Observation Questionnaire for Teachers
principal component analysis
Physical Literacy Assessment for Youth
Preferred Reporting Items for Systematic Reviews and Meta-Analyses
Pictorial Scale of Perceived Competence and Social Acceptance for Young Children-Teacher
Pictorial Scale of Perceived Water Competence
Social and Emotional Competencies Evaluation Questionnaire Teacher’s version (Short Form)
Root Mean Square Error of Approximation
Social-Emotional Assets and Resilience Scale, Teacher rating form
Standardized Root Mean Residuals
- SSIS SEL RF:
Social Skills Improvement System Social Emotional Learning Edition Rating Forms
Social Skills Rating Scale
Teacher-Child Rating Scale
Teacher Estimation of Activity Form
Teen Risk Screen checklist
Janssen I, LeBlanc AG. Systematic review of the health benefits of physical activity and fitness in school-aged children and youth. Int J Behav Nutr Phys Act. 2010;7(1):1–16.
Poitras VJ, Gray CE, Borghese MM, Carson V, Chaput J-P, Janssen I, et al. Systematic review of the relationships between objectively measured physical activity and health indicators in school-aged children and youth. Appl Physiol Nutr Metab. 2016;41(6):S197–239 https://doi.org/10.1139/apnm-2015-0663.
Chacón-Cuberos R, Zurita-Ortega F, Ramírez-Granizo I, Castro-Sánchez M. Physical activity and academic performance in children and preadolescents: a systematic review. Apunt Educ Fisica Y Deportes. 2020;139:1–9.
Aubert S, Barnes JD, Abdeta C, Abi Nader P, Adeniyi AF, Aguilar-Farias N, et al. Global matrix 3.0 physical activity report card grades for children and youth: results and analysis from 49 countries. J Phys Act Health. 2018;15(s2):S251–S73 https://doi.org/10.1123/jpah.2018-0472.
Sum RK-W, Whitehead M. Getting up close with Taoist: Chinese perspectives on physical literacy. Prospects. 2021;50(1):141–50 https://doi.org/10.1007/s11125-020-09479-w.
Li MH, Sum RKW, Sit CHP, Wong SHS, Ha ASC. Associations between perceived and actual physical literacy level in Chinese primary school children. BMC Public Health. 2020;20(1):207 https://doi.org/10.1186/s12889-020-8318-4.
Belanger K, Barnes JD, Longmuir PE, Anderson KD, Bruner B, Copeland JL, et al. The relationship between physical literacy scores and adherence to Canadian physical activity and sedentary behaviour guidelines. BMC Public Health. 2018;18(2):1–9.
Lang JJ, Chaput J-P, Longmuir PE, Barnes JD, Belanger K, Tomkinson GR, et al. Cardiorespiratory fitness is associated with physical literacy in a large sample of Canadian children aged 8 to 12 years. BMC Public Health. 2018;18(2):1–13.
Jefferies P, Ungar M, Aubertin P, Kriellaars D. Physical literacy and resilience in children and youth. Front Public Health. 2019;7:346 https://doi.org/10.3389/fpubh.2019.00346.
Caldwell HA, Di Cristofaro NA, Cairney J, Bray SR, MacDonald MJ, Timmons BW. Physical literacy, physical activity, and health indicators in school-age children. Int J Environ Res Public Health. 2020;17(15):5367 https://doi.org/10.3390/ijerph17155367.
Hulteen RM, Barnett LM, True L, Lander NJ, del Pozo CB, Lonsdale C. Validity and reliability evidence for motor competence assessments in children and adolescents: a systematic review. J Sports Sci. 2020;38(15):1717–98 https://doi.org/10.1080/02640414.2020.1756674.
Wright C, Buxcey J, Gibbons S, Cairney J, Barrette M, Naylor P-J. A pragmatic feasibility trial examining the effect of job embedded professional development on teachers’ capacity to provide physical literacy enriched physical education in elementary schools. Int J Environ Res Public Health. 2020;17(12):4386 https://doi.org/10.3390/ijerph17124386.
Demetriou Y, Höner O. Physical activity interventions in the school setting: a systematic review. Psychol Sport Exerc. 2012;13(2):186–96 https://doi.org/10.1016/j.psychsport.2011.11.006.
Whitehead M. Definition of physical literacy and clarification of related. ICSSPE Bull J Sport Sci Phys Educ. 2013;65:28–33.
Yi KJ, Cameron E, Patey M, Loucks-Atkinson A, Loeffler T, Sullivan A-M, et al. Future directions for physical literacy education: community perspectives. J Phys Educ Sport. 2020;20(1):123–30.
Marsh HW, Craven RG. Self-other agreement on multiple dimensions of preadolescent self-concept: inferences by teachers, mothers, and fathers. J Educ Psychol. 1991;83(3):393–404 https://doi.org/10.1037/0022-0618.104.22.1683.
Faught BE, Cairney J, Hay J, Veldhuizen S, Missiuna C, Spironello CA. Screening for motor coordination challenges in children using teacher ratings of physical ability and activity. Hum Mov Sci. 2008;27(2):177–89 https://doi.org/10.1016/j.humov.2008.02.001.
Longmuir P. Understanding the physical literacy journey of children: the Canadian assessment of physical literacy. ICSSPE Bull J Sport Sci Phys Educ. 2013;65(12.1).
Robinson DB, Randall L. Marking physical literacy or missing the mark on physical literacy? A conceptual critique of Canada’s physical literacy assessment instruments. Meas Phys Educ Exerc Sci. 2017;21(1):40–55 https://doi.org/10.1080/1091367X.2016.1249793.
Canadian Sport for Life (CS4L). Physical literacy assessment for youth: Canadian Sport Institute; 2013.
Healthy Active Living and Obesity Research Group (HALO). Canadian assessment of physical literacy. 2017. https://www.capl-ecsfp.ca.
Lodewyk KR, Mandigo JL. Early validation evidence of a Canadian practitioner-based assessment of physical literacy in physical education: passport for life. Phys Educ. 2017;74(3):441–75 https://doi.org/10.18666/TPE-2017-V74-I3-7459.
Physical & Health Education Canada (PHE). Passport for Life: Teacher’s guide. 2013. http://passportforlife.ca/teacher/teachers-guide.
Eddy LH, Bingham DD, Crossley KL, Shahid NF, Ellingham-Khan M, Otteslev A, et al. The validity and reliability of observational assessment tools available to measure fundamental movement skills in school-age children: a systematic review. PLoS One. 2020;15(8):e0237919 https://doi.org/10.1371/journal.pone.0237919.
Bardid F, Vannozzi G, Logan SW, Hardy LL, Barnett LM. A hitchhiker’s guide to assessing young people’s motor competence: deciding what method to use. J Sci Med Sport. 2019;22(3):311–8 https://doi.org/10.1016/j.jsams.2018.08.007.
Dollman J, Okely AD, Hardy L, Timperio A, Salmon J, Hills AP. A hitchhiker's guide to assessing young people's physical activity: deciding what method to use. J Sci Med Sport. 2009;12(5):518–25 https://doi.org/10.1016/j.jsams.2008.09.007.
Loprinzi PD, Cardinal BJ. Measuring children's physical activity and sedentary behaviors. J Exerc Sci Fit. 2011;9(1):15–23 https://doi.org/10.1016/S1728-869X(11)60002-6.
Essiet IA, Salmon J, Lander NJ, Duncan MJ, Eyre EL, Barnett LM. Rationalizing teacher roles in developing and assessing physical literacy in children. Prospects. 2021;50(1):69–86 https://doi.org/10.1007/s11125-020-09489-8.
Keegan RJ, Barnett LM, Dudley DA, Telford RD, Lubans DR, Bryant AS, et al. Defining physical literacy for application in Australia: a modified delphi method. J Teach Phys Educ. 2019;38(2):105–18 https://doi.org/10.1123/jtpe.2018-0264.
Sport Australia. The Australian physical literacy framework. 2020. https://www.sportaus.gov.au/__data/assets/pdf_file/0019/710173/35455_Physical-Literacy-Framework_access.pdf. Accessed 26 May 2020.
Edwards LC, Bryant AS, Keegan RJ, Morgan K, Cooper S-M, Jones AM. ‘Measuring’physical literacy and related constructs: a systematic review of empirical findings. Sports Med. 2018;48(3):659–82 https://doi.org/10.1007/s40279-017-0817-9.
Kaioglou V, Venetsanou F. How can we assess physical literacy in gymnastics? A critical review of physical literacy assessment tools. Sci Gymnastics J. 2020;12(1):27–47.
Barnett LM, Dudley DA, Telford RD, Lubans DR, Bryant AS, Roberts WM, et al. Guidelines for the selection of physical literacy measures in physical education in Australia. J Teach Phys Educ. 2019;38(2):119–25 https://doi.org/10.1123/jtpe.2018-0219.
Moher D, Liberati A, Tetzlaff J, Altman DG, Group P. Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement. PLoS Med. 2009;6(7):e1000097 https://doi.org/10.1371/journal.pmed.1000097.
Mokkink LB, De Vet HC, Prinsen CA, Patrick DL, Alonso J, Bouter LM, et al. COSMIN risk of Bias checklist for systematic reviews of patient-reported outcome measures. Qual Life Res. 2018;27(5):1171–9 https://doi.org/10.1007/s11136-017-1765-4.
Prinsen CA, Mokkink LB, Bouter LM, Alonso J, Patrick DL, De Vet HC, et al. COSMIN guideline for systematic reviews of patient-reported outcome measures. Qual Life Res. 2018;27(5):1147–57 https://doi.org/10.1007/s11136-018-1798-3.
Terwee CB, Prinsen CA, Chiarotto A, Westerman MJ, Patrick DL, Alonso J, et al. COSMIN methodology for evaluating the content validity of patient-reported outcome measures: a Delphi study. Qual Life Res. 2018;27(5):1159–70 https://doi.org/10.1007/s11136-018-1829-0.
Mandigo J, Lodewyk K, Tredway J. Examining the impact of a teaching games for understanding approach on the development of physical literacy using the passport for life assessment tool. J Teach Phys Educ. 2019;38(2):136–45 https://doi.org/10.1123/jtpe.2018-0028.
Castelli DM, Barcelona JM, Bryant L. Contextualizing physical literacy in the school environment: the challenges. J Sport Health Sci. 2015;4(2):156–63 https://doi.org/10.1016/j.jshs.2015.04.003.
Barnett LM, Dennis R, Hunter K, Cairney J, Keegan RJ, Essiet IA, et al. Art meets sport: what can actor training bring to physical literacy programs? Int J Environ Res Public Health. 2020;17(12):4497 https://doi.org/10.3390/ijerph17124497.
Edinger T, Cohen AM. A large-scale analysis of the reasons given for excluding articles that are retrieved by literature search during systematic review. AMIA Annu Symp Proc. 2013;2013:379–87.
Rezai M, Kolne K, Bui S, Lindsay S. Measures of workplace inclusion: a systematic review using the COSMIN methodology. J Occup Rehabil. 2020;30(3):420–54 https://doi.org/10.1007/s10926-020-09872-4.
Antczak D, Lonsdale C, Lee J, Hilland T, Duncan MJ, del Pozo CB, et al. Physical activity and sleep are inconsistently related in healthy children: a systematic review and meta-analysis. Sleep Med Rev. 2020;51:101278 https://doi.org/10.1016/j.smrv.2020.101278.
De Vet HC, Terwee CB, Mokkink LB, Knol DL. Measurement in medicine: a practical guide. Cambridge: Cambridge university press; 2011. https://doi.org/10.1017/CBO9780511996214
Schoemaker MM, Flapper BC, Reinders-Messelink HA, de Kloet A. Validity of the motor observation questionnaire for teachers as a screening instrument for children at risk for developmental coordination disorder. Hum Mov Sci. 2008;27(2):190–9 https://doi.org/10.1016/j.humov.2008.02.003.
Giofre D, Cornoldi C, Schoemaker MM. Identifying developmental coordination disorder: MOQ-T validity as a fast screening instrument based on teachers’ ratings and its relationship with praxic and visuospatial working memory deficits. Res Dev Disabil. 2014;35(12):3518–25 https://doi.org/10.1016/j.ridd.2014.08.032.
Asunta P, Viholainen H, Ahonen T, Cantell M, Westerholm J, Schoemaker M, et al. Reliability and validity of the Finnish version of the motor observation questionnaire for teachers. Hum Mov Sci. 2017;53:63–71 https://doi.org/10.1016/j.humov.2016.12.006.
Nowak A, Schoemaker M. Psychometric properties of the polish version of the motor observation questionnaire for teachers (MOQ-T). Hum Mov. 2018;19(2):31–8.
Schoemaker MM, Niemeijer AS, Flapper BC, Smits-Engelsman BC. Validity and reliability of the movement assessment battery for children-2 checklist for children with and without motor impairments. Dev Med Child Neurol. 2012;54(4):368–75 https://doi.org/10.1111/j.1469-8749.2012.04226.x.
Kita Y, Ashizawa F, Inagaki M. Is the motor skills checklist appropriate for assessing children in Japan? Brain Dev. 2019;41(6):483–9 https://doi.org/10.1016/j.braindev.2019.02.012.
Capistrano R, Ferrari EP, Souza LP, Beltrame TS, Cardoso FL. Concurrent validation of the MABC-2 motor tests and MABC-2 checklist according to the developmental coordination disorder questionnaire-br. Motriz: Rev Educ Física. 2015;21(1):100–6.
De Milander M, Du Plessis AM, Coetzee FF. Identification of developmental coordination disorder in grade 1 learners: a screening tool for parents and teachers. South Afr J Res Sport Phys Educ Recreation. 2019;41(2):45–59.
De Pasquale C, De Sousa ML, Jidovtseff B, De Martelaer K, Barnett LM. Utility of a scale to assess Australian children’s perceptions of their swimming competence and factors associated with child and parent perception. Health Promot J Austral. 2020;00:1–10 https://doi.org/10.1002/hpja.404.
Weems CF, Reiss S, Dunson KL, Graham RA, Russell JD, Banks DM, et al. Comprehensive assessment of children's psychological needs: development of the child Reiss motivation profile for ages four to eleven. Learn Individ Differ. 2015;39:132–40 https://doi.org/10.1016/j.lindif.2015.03.021.
Mocke LM, Greeff AP, van der Westhuÿsen TB. Aspects of the construct validity of a preliminary self-concept questionnaire. Psychol Rep. 2002;90(1):165–72 https://doi.org/10.2466/pr0.2002.90.1.165.
Africa EK, Kidd M. Reliability of the teen risk screen: a movement skill screening checklist for teachers. South Afr J Res Sport Phys Educ Recreation. 2013;35(1):1–10.
Gresham FM, Cook CR, Collins T, Dart E, Rasetshwane K, Truelson E, et al. Developing a change-sensitive brief behavior rating scale as a progress monitoring tool for social behavior: an example using the social skills rating system—teacher form. Sch Psychol Rev. 2010;39(3):364–79 https://doi.org/10.1080/02796015.2010.12087758.
Nickerson AB, Fishman C. Convergent and divergent validity of the Devereux student strengths assessment. Sch Psychol Q. 2009;24(1):48–59 https://doi.org/10.1037/a0015147.
Doromal JB, Cottone EA, Kim H. Preliminary validation of the teacher-rated DESSA in a low-income, Kindergarten Sample. J Psychoeduc Assess. 2019;37(1):40–54 https://doi.org/10.1177/0734282917731460.
Molina P, Sala MN, Zappulla C, Bonfigliuoli C, Cavioni V, Zanetti MA, et al. The emotion regulation checklist–Italian translation. Validation of parent and teacher versions. Eur J Dev Psychol. 2014;11(5):624–34 https://doi.org/10.1080/17405629.2014.898581.
Junttila N, Voeten M, Kaukiainen A, Vauras M. Multisource assessment of children's social competence. Educ Psychol Meas. 2006;66(5):874–95 https://doi.org/10.1177/0013164405285546.
Harter S, Pike R. The pictorial scale of perceived competence and social acceptance for young children. Child Dev. 1984;55(6):1969–82 https://doi.org/10.2307/1129772.
Strein W, Simonson T. Kindergartners' self-perceptions: theoretical and measurement issues. Meas Eval Couns Dev. 1999;32(1):31–42 https://doi.org/10.1080/07481756.1999.12068968.
Garrison W, Earls F, Kindlon D. An application of the pictorial scale of perceived competence and acceptance within an epidemiological survey. J Abnorm Child Psychol. 1983;11(3):367–77 https://doi.org/10.1007/BF00914245.
Merrell KW, Cohn BP, Tom KM. Development and validation of a teacher report measure for assessing social-emotional strengths of children and adolescents. Sch Psychol Rev. 2011;40(2):226–41 https://doi.org/10.1080/02796015.2011.12087714.
Romer N, Merrell KW. Temporal stability of strength-based assessments: test–retest reliability of student and teacher reports. Assess Eff Interv. 2013;38(3):185–91 https://doi.org/10.1177/1534508412444955.
Figueiredo P, Azeredo A, Barroso R, Barbosa F. Psychometric properties of teacher report of social-emotional assets and resilience scale in preschoolers and elementary school children. J Psychopathol Behav Assess. 2020;42(4):799–807 https://doi.org/10.1007/s10862-020-09831-6.
Gresham F, Elliott S, Metallo S, Byrd S, Wilson E, Erickson M, et al. Psychometric fundamentals of the social skills improvement system: social–emotional learning edition rating forms. Assess Eff Interv. 2020;45(3):194–209 https://doi.org/10.1177/1534508418808598.
Hightower AD, Work WC, Cowen EL, Lotyczewski BS, Spinell AP, Guare JC, et al. The teacher-child rating scale: a brief objective measure of elementary children's school problem behaviors and competencies. Sch Psychol Rev. 1986;15(3):393–409 https://doi.org/10.1080/02796015.1986.12085242.
Jensen JM, Michael JJ, Michael WB. The concurrent validity of the primary self-concept scale for a sample of third-grade children. Educ Psychol Meas. 1975;35(4):1011–6 https://doi.org/10.1177/001316447503500435.
Wheeler VA, Ladd GW. Assessment of children's self-efficacy for social interactions with peers. Dev Psychol. 1982;18(6):795–805 https://doi.org/10.1037/0012-1622.214.171.1245.
Van Alstyne D. A new scale for rating school behavior and attitudes in the elementary school. J Educ Psychol. 1936;27(9):677–93 https://doi.org/10.1037/h0057363.
Leton DA, Collins DR, Koo GY. Factor analysis of the Winnetka scale for rating school behavior. J Exp Educ. 1965;33(4):373–8 https://doi.org/10.1080/00220973.1965.11010897.
Rosenblum. The development and standardization of the children activity scales (ChAS-P/T) for the early identification of children with developmental coordination disorders. Child Care Health Dev. 2006;32(6):619–32.
Netelenbos JB. Teachers’ ratings of gross motor skills suffer from low concurrent validity. Hum Mov Sci. 2005;24(1):116–37 https://doi.org/10.1016/j.humov.2005.02.001.
Cole DA, Maxwell SE, Martin JM. Reflected self-appraisals: strength and structure of the relation of teacher, peer, and parent ratings to children's self-perceived competencies. J Educ Psychol. 1997;89(1):55–70 https://doi.org/10.1037/0022-06126.96.36.199.
Cole DA, Gondoli DM, Peeke LG. Structure and validity of parent and teacher perceptions of children's competence: a multitrait–multimethod–multigroup investigation. Psychol Assess. 1998;10(3):241–9 https://doi.org/10.1037/1040-35188.8.131.52.
Cole DA, Cho S, Martin JM, Seroczynski A, Tram J, Hoffman K. Effects of validity and bias on gender differences in the appraisal of children’s competence: results of MTMM analyses in a longitudinal investigation. Struct Equ Model. 2001;8(1):84–107 https://doi.org/10.1207/S15328007SEM0801_5.
Gesten EL. A health resources inventory: the development of a measure of the personal and social competence of primary-grade children. J Consult Clin Psychol. 1976;44(5):775–86 https://doi.org/10.1037/0022-006X.44.5.775.
Coelho VA, Sousa V, Marchante M. Social and emotional competencies evaluation questionnaire—Teacher’s version: validation of a short form. Psychol Rep. 2016;119(1):221–36 https://doi.org/10.1177/0033294116656617.
Clark L, Gresham FM, Elliott SN. Development and validation of a social skills assessment measure: the TROSS-C. J Psychoeduc Assess. 1985;3(4):347–56 https://doi.org/10.1177/073428298500300407.
Gresham FM, Elliott SN, Black FL. Factor structure replication and bias investigation of the teacher rating of social skills. J Sch Psychol. 1987;25(1):81–92 https://doi.org/10.1016/0022-4405(87)90063-X.
Elliott SN, Gresham FM, Freeman T, McCloskey G. Teacher and observer ratings of children's social skills: validation of the social skills rating scales. J Psychoeduc Assess. 1988;6(2):152–61 https://doi.org/10.1177/073428298800600206.
Rosenblum S, Engel-Yeger B. Hypo-activity screening in school setting; examining reliability and validity of the teacher estimation of activity form (Teaf). Occup Ther Int. 2015;22(2):85–93 https://doi.org/10.1002/oti.1387.
Estevan I, Molina-García J, Bowe SJ, Álvarez O, Castillo I, Barnett LM. Who can best report on children's motor competence: parents, teachers, or the children themselves? Psychol Sport Exerc. 2018;34:1–9 https://doi.org/10.1016/j.psychsport.2017.09.002.
Baranowski T. Validity and reliability of self report measures of physical activity: an information-processing perspective. Res Q Exerc Sport. 1988;59(4):314–27 https://doi.org/10.1080/02701367.1988.10609379.
Terwee CB, Prinsen C, Chiarotto A, de Vet H, Bouter LM, Alonso J, et al. COSMIN methodology for assessing the content validity of PROMs–user manual. Amsterdam: VU University Medical Center; 2018.
Streiner DL, Norman GR, Cairney J. Health measurement scales: a practical guide to their development and use. USA: Oxford University Press; 2015.
Mokkink LB, Prinsen CA, Patrick DL, Alonso J, Bouter LM, de Vet HC et al. COSMIN Study Design checklist for Patient-reported outcome measurement instruments. 2019.
Sinesi A, Maxwell M, O'Carroll R, Cheyne H. Anxiety scales used in pregnancy: systematic review. BJPsych Open. 2019;5(1):1–13 https://doi.org/10.1192/bjo.2018.75.
Hendersen S, Sugden D, Barnett A. Movement assessment battery for children–2 examiner’s manual. London: Harcourt Assessment; 2007.
Bruininks RH. Bruininks-Oseretsky test of motor proficiency. Circle Pines: American Guidance Service; 1978.
Kiphard EJ, Schilling F. Körperkoordinationstest für Kinder. Überarbeitete und ergänzte Auflage. Göttingen: Beltz Test GmbH; 2007.
Cabrera-Nguyen P. Author guidelines for reporting scale development and validation results in the journal of the Society for Social Work and Research. J Soc Soc Work Res. 2010;1(2):99–103 https://doi.org/10.5243/jsswr.2010.8.
Hjemdal O, Roazzi A, Maria da Graça B, Friborg O. The cross-cultural validity of the Resilience Scale for Adults: a comparison between Norway and Brazil. BMC Psychol. 2015;3(1):18.
Koo TK, Li MY. A guideline of selecting and reporting intraclass correlation coefficients for reliability research. J Chiropr Med. 2016;15(2):155–63 https://doi.org/10.1016/j.jcm.2016.02.012.
Shrout PE, Fleiss JL. Intraclass correlations: uses in assessing rater reliability. Psychol Bull. 1979;86(2):420–8 https://doi.org/10.1037/0033-2909.86.2.420.
Robertson SJ, Burnett AF, Cochrane J. Tests examining skill outcomes in sport: a systematic review of measurement properties and feasibility. Sports Med. 2014;44(4):501–18 https://doi.org/10.1007/s40279-013-0131-0.
Terwee CB, Mokkink LB, Knol DL, Ostelo RW, Bouter LM, de Vet HC. Rating the methodological quality in systematic reviews of studies on measurement properties: a scoring system for the COSMIN checklist. Qual Life Res. 2012;21(4):651–7 https://doi.org/10.1007/s11136-011-9960-1.
Whitehead M. Physical literacy: throughout the lifecourse. London: Routledge; 2010. https://doi.org/10.4324/9780203881903
Lund JL, Kirk MF. Performance-based assessment for middle and high school physical education. Champaign: Human Kinetics; 2019.
Waffenschmidt S, Knelangen M, Sieben W, Bühn S, Pieper D. Single screening versus conventional double screening for study selection in systematic reviews: a methodological systematic review. BMC Med Res Methodol. 2019;19(1):1–9.
Mokkink LB, Terwee CB, Patrick DL, Alonso J, Stratford PW, Knol DL, et al. The COSMIN checklist for assessing the methodological quality of studies on measurement properties of health status measurement instruments: an international Delphi study. Qual Life Res. 2010;19(4):539–49 https://doi.org/10.1007/s11136-010-9606-8.
Mokkink LB, Terwee CB, Patrick DL, Alonso J, Stratford PW, Knol DL, et al. The COSMIN study reached international consensus on taxonomy, terminology, and definitions of measurement properties for health-related patient-reported outcomes. J Clin Epidemiol. 2010;63(7):737–45 https://doi.org/10.1016/j.jclinepi.2010.02.006.
The first author is supported by a doctoral scholarship from Deakin University Faculty of Health, Australia. Author 2 is funded by an Alfred Deakin Postdoctoral Fellowship. Author 3 is supported by a Leadership Level 2 Fellowship, National Health and Medical Research Council (APP 1176885). Author 6 is a recipient of a doctoral scholarship from Coventry University, United Kingdom. These funders had no role in the design of this study, execution, analyses, and interpretation of the data, or involvement in the writing and decision to submit the manuscript.
Ethics approval and consent to participate
Consent for publication
The authors declare they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
List of search terms using Boolean connectors “AND” or “OR” to retrieve articles from the databases.
PICO-based (Population, Intervention, Comparison, Outcome) taxonomy of reasons used to exclude articles from the systematic review.
About this article
Cite this article
Essiet, I.A., Lander, N.J., Salmon, J. et al. A systematic review of tools designed for teacher proxy-report of children’s physical literacy or constituting elements. Int J Behav Nutr Phys Act 18, 131 (2021). https://doi.org/10.1186/s12966-021-01162-3
- Physical literacy
- Systematic review