A systematic review of tools designed for teacher proxy-report of children’s physical literacy or constituting elements

Background Physical literacy (PL) in childhood is essential for a healthy active lifestyle, with teachers playing a critical role in guiding its development. Teachers can assist children to acquire the skills, confidence, and creativity required to perform diverse movements and physical activities. However, to detect and directly intervene on the aspects of children’s PL that are suboptimal, teachers require valid and reliable measures. This systematic review critically evaluates the psychometric properties of teacher proxy-report instruments for assessing one or more of the 30 elements within the four domains (physical, psychological, cognitive, social) of the Australian Physical Literacy Framework (APLF), in children aged 5–12 years. Secondary aims were to: examine alignment of each measure (and relevant items) with the APLF and provide recommendations for teachers in assessing PL. Methods Seven electronic databases (Academic Search Complete, CINAHL Complete, Education Source, Global Health, MEDLINE Complete, PsycINFO, and SPORTDiscus) were systematically searched originally in October 2019, with an updated search in April 2021. Eligible studies were peer-reviewed English language publications that sampled a population of children with mean age between 5 and 12 years and focused on developing and evaluating at least one psychometric property of a teacher proxy-report instrument for assessing one or more of the 30 APLF elements. The Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidance was followed for the conduct and reporting of this review. The methodological quality of included studies and quality of psychometric properties of identified tools were evaluated using the COnsensus-based Standards for the selection of health Measurement INstruments (COSMIN) guidance. Alignment of each measure (and relevant items) with the APLF domains and 30 elements was appraised. Results Database searches generated 61,412 citations; reduced to 41 studies that evaluated the psychometric properties of 24 teacher proxy-report tools. Six tools were classified as single domain measures (i.e. assessing a single domain of the APLF), eleven as dual-domain measures, and seven as tri-domain measures. No single tool captured all four domains and 30 elements of the APLF. Tools contained items that aligned with all physical, psychological, and social elements; however, four cognitive elements were not addressed by any measure. No tool was assessed for all nine psychometric properties outlined by COSMIN. Included studies reported a median of 3 out of nine psychometric properties. Most reported psychometric properties were construct validity (n = 32; 78% of studies), structural validity (n = 26; 63% of studies), and internal consistency (n = 25; 61% of studies). There was underreporting of content validity, cross-cultural validity, measurement error, and responsiveness. Psychometric data across tools were mostly indeterminate for construct validity, structural validity, and internal consistency. Conclusions There is limited evidence to fully support the use of a specific teacher proxy-report tool in practice. Further psychometric testing and detailed reporting of methodological aspects in future validity and reliability studies is needed. Tools have been designed to assess some elements of the framework. However, no comprehensive teacher proxy-report tool exists to assess all 30 elements of the APLF, demonstrating the need for a new tool. It is our recommendation that such tools be developed and psychometrically tested. Trial registration This systematic review was registered in the PROSPERO international prospective register of systematic reviews, with registration number CRD42019130936. Supplementary Information The online version contains supplementary material available at 10.1186/s12966-021-01162-3.

Results: Database searches generated 61,412 citations; reduced to 41 studies that evaluated the psychometric properties of 24 teacher proxy-report tools. Six tools were classified as single domain measures (i.e. assessing a single domain of the APLF), eleven as dual-domain measures, and seven as tri-domain measures. No single tool captured all four domains and 30 elements of the APLF. Tools contained items that aligned with all physical, psychological, and social elements; however, four cognitive elements were not addressed by any measure. No tool was assessed for all nine psychometric properties outlined by COSMIN. Included studies reported a median of 3 out of nine psychometric properties. Most reported psychometric properties were construct validity (n = 32; 78% of studies), structural validity (n = 26; 63% of studies), and internal consistency (n = 25; 61% of studies). There was underreporting of content validity, cross-cultural validity, measurement error, and responsiveness. Psychometric data across tools were mostly indeterminate for construct validity, structural validity, and internal consistency. Conclusions: There is limited evidence to fully support the use of a specific teacher proxy-report tool in practice. Further psychometric testing and detailed reporting of methodological aspects in future validity and reliability studies is needed. Tools have been designed to assess some elements of the framework. However, no comprehensive teacher proxy-report tool exists to assess all 30 elements of the APLF, demonstrating the need for a new tool. It is our recommendation that such tools be developed and psychometrically tested. Trial registration: This systematic review was registered in the PROSPERO international prospective register of systematic reviews, with registration number CRD42019130936.
Keywords: Assessment, Measurement, Psychometrics, Physical literacy, Child, COSMIN, Systematic review Background Adequate levels of physical activity during childhood are associated with considerable health benefits (e.g., improvement in physical fitness, academic performance, cognition, and executive functioning) [1][2][3]. Yet, less than 40% of children in many countries accumulate the levels of physical activity necessary for optimal health [4]. The concept of physical literacy (PL) has been explored in multiple sectors including physical education, sports, recreation, and public health, as a framework to better understand the declining levels of physical activity [5,6]. Growing empirical evidence has demonstrated that PL, or its components, are associated with adherence to physical activity and sedentary behaviour guidelines [7], increased cardiorespiratory fitness [8], resilience [9], and other health indices (including body composition, blood pressure, health related quality of life) [10] in school-aged children.
Of particular interest when determining PL levels are school-aged children (aged 5-12 years) as literature suggest that childhood is a critical developmental period for the formation of skills and attributes (e.g., motor competence) that underlie lifelong physical activity habits [7,11]. The school setting has been recognized as a suitable environment that affords children with diverse opportunities that can help foster healthy physically active lifestyles, independent of their culture and socioeconomic status [12]. From this equity perspective, schools are also effective sites for targeted physical activity interventions due to the large amount of time children spend attending schools [13]. Teachers (particularly physical educators) have been identified as key players in guiding children's PL development [14]. They can support PL education, conceptualized as the "teaching and learning of the skills, knowledge, attitudes, and behaviours that enhance the responsibility for engagement in lifelong active lifestyles" [15]. Teachers are also trained to be sensitive to the needs of each child and have a broad basis for knowing their students as they interact with a large number of different children, and thus have a frame of reference on which to base their judgements [16]. Therefore, teachers may be well suited to identify elements (such as motor competence, motivation and confidence) of a child's PL [17]. For such identification, valid and reliable PL teacher assessment protocols are required.
Recently, PL scholarship has been directed towards designing assessment tools (both subjective and objective) for different targeted users (including preschoolers, children, youth, teachers, parents). Indeed, assessment is crucial to the planning and evaluation of programs targeted at enhancing PL levels, and could help identify domains of a child's PL that are suboptimal [18]. As such, following Robinson and Randall [19], an effective PL assessment protocol should address all of its constituting domains (e.g., affective, behavioural, physical, and cognitive). However, few protocols have been designed specifically for use by teachers to evaluate children's PL [19]. Examples include the PLAYfun and basic [20]; the CAPL via the Canadian Agility and Movement Skill Assessment (CAMSA) and fitness tests [21]; and the PFL via fitness and movement skills tests [22,23]. These existing teacher assessment tools largely utilize objective observational approaches (i.e. rely on the teacher observing children perform a series of standardized tasks) [24] rather than teacher proxy-report, and have narrowly focused on the physical domain, thereby neglecting the psychological, social, and cognitive aspects of PL. Comparatively, teacher proxy-report instruments (retrospectively completed questionnaires) have received much less attention despite their suitability for assessing large sample sizes and their minimal manual data entry requirements [25,26]. Literature has further suggested that teacher proxy-reporting presents a promising avenue to obtain more reliable estimates of a child's PL, as children under 10 often present with limited cognitive ability to make accurate judgements of their own capabilities [27].
More specifically, a notable gap in PL assessment is the paucity of teacher proxy-report measures that recognizes components of the expansive and comprehensive Australian Physical Literacy Framework (APLF) [28]. In 2016, after acknowledging the lack of international consensus on PL's definition, conceptualization, and operationalization, Sport Australia (a Federal Government agency responsible for supporting sport in Australia) proposed arguably the most comprehensive definition and framework for PL to date. See Keegan et al. [29] for a detailed articulation of the Australian definition. The APLF identified a combined total of 30 elements spanning four major domains (physical, psychological, social, and cognitive), as being fundamental to PL development ( Fig. 1) [29]. For the purpose of this manuscript, the authors adopt the comprehensive PL definition and framework offered by Sport Australia. To date, only two systematic reviews have been published in relation to PL assessment [31,32]. In Edwards et al.'s [31] review, PL assessment/measurement approaches were broadly categorized as qualitative and quantitative. Though quantitative measures for PL and its related constructs were identified, the review did not engage in a detailed and in-depth analysis of the psychometric properties of the measures. Furthermore, the search strategy utilized by authors did not address each individual element (e.g., motivation, confidence, movement skills) of PL, including those belonging to the APLF. More recently, Kaioglou, Venetsanou [32], reviewed existing PL measures used within the context of gymnastics. Like Edwards et al. [31], search terms did not capture individual elements of PL (including APLF elements). Hence, only tools for assessing PL in its entirety were identified (e.g., Canadian Assessment of Physical Literacy [CAPL]; Passport for Life [PFL]; Physical Literacy Assessment for Youth [PLAY]). Both reviews did not focus specifically on identifying teacher proxyreport measures for PL or its constituting elements. Barnett et al. [33] has suggested that teachers have limited guidance when choosing appropriate protocols for assessing PL.
Taking all this into account, the objectives of the current systematic review were two-fold. The primary aim was to critically evaluate the psychometric properties of teacher proxy-report instruments for assessing one or more of the 30 elements within the four domains of the APLF, in children aged 5-12 years. Secondary aims were to examine the alignment of each tool (and relevant items within) with the APLF and provide recommendations for teachers in assessing PL in children aged 5-12 years. A review of this nature will assist teachers (and indeed researchers) in making informed decisions when selecting suitable and psychometrically sound measures for assessing elements within the APLF.

Literature search strategy
The Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) [34] and the COnsensusbased Standards for the selection of health Measurement INstruments (COSMIN) guidelines [35][36][37] were used as methodological and reporting guidelines for this systematic review. See completed PRISMA checklist attached as Additional file 1. Prior to review commencement, details of the review protocol were registered on PROS-PERO (CRD42019130936). The first author systematically searched for peer-reviewed articles on seven databases including Academic Search Complete, CINAHL Complete, Education Source, Global Health, MEDLINE Complete, PsycINFO, and SPORTDiscus. These databases encompass areas related to psychology (including psychometrics), education, sport, and health, and were deemed relevant to the comprehensive definition/framework of PL used in this review, and therefore enhanced the likelihood of identifying relevant papers from many diverse disciplines. Date restrictions were not applied to searches. Database searches were originally completed in October 2019 and updated in April 2021. All searches were limited to title, abstract, and keyword. Additional limits of "English language" and "peer review" were applied. To ensure that search terms were not overly simplistic, a comprehensive search filter containing a selection of search terms provided by the COSMIN for finding studies on measurement properties, combined with search terms relevant to the 30 APLF elements (identified from published systematic reviews) were utilized to identify studies concerning the target population (see Additional file 2 for the full search strategy). Reference lists of literature reviews and eligible studies were also searched for additional papers. All searches were performed by the first author with the assistance of the university's librarian.

Eligibility criteria
Studies were included if they were: (a) peer-reviewed and written in English Language; (b) study participants included children with mean age between 5 and 12 years; (c) focused on developing and evaluating at least one psychometric property of a teacher proxy-report instrument; and (d) instruments assessed one or more of the 30 elements within the APLF. Because the application of PL goes beyond the context of physical education and encompasses before-and after-school programming, recess, and classroom activities [38,39] and could be applied in performing arts [40], teacher proxy-report instruments that assessed elements in general contexts (not just in sport and physical activity) were included. For example, instruments assessing "self-regulation" in general, and those assessing self-regulation in the context of physical activity were included.
Studies were excluded if they were: (a) tool manual(s), abstracts (including poster abstracts), conference proceedings, dissertations, commentaries, editorials, review articles, and letters; (b) utilized assessment formats other than teacher proxy-report (e.g., self-report, objective measures); (c) study participants were younger than five and older than 12 years; and (d) utilized proxyrespondents of children not in elementary or primary school, younger than five and older than 12 years. In registering the protocol for this review, it was our initial intention to exclude studies that involved non-typically developing children (such as those with learning difficulties or developmental delay). However, following the literature search, we noted that most teacher proxy-report tools for motor competence (related to the physical domain of PL) were originally designed with the intention of identifying children with developmental coordination disorder (DCD), and in some cases included participants with DCD (for instance, when assessing discriminant validity). As such, these tools were retained in order to ensure motor competence teacher proxy-report measures were not excluded from the review. Measures developed to assess children with other disabilities (i.e. those in relation to elements other than motor competence) were excluded from the review.

Study selection
Titles and abstracts were exported to Covidence (www. covidence.org), an online software for managing systematic reviews. Following removal of duplicates, the first author screened all titles and abstracts for eligibility, based on the aforementioned criteria. Full text articles were retrieved for further examination where it was not possible to make inclusion decisions based solely on the title and abstract. Following initial selection, full-text articles were independently examined by paired combinations of three review authors (IE -NL, IE -LB, and NL -LB). For consistency, a PICO-based hierarchy of exclusion reasons was developed based on past literature [41], and used to guide the exclusion of studies during the full text review phase (see Additional file 3). Any conflicts between the three reviewers over study inclusion were resolved via review and discussion.

Data extraction
In line with the criteria proposed by COSMIN, data collection involved extracting information on the general characteristics of included studies as follows: (a) instrument, author(s) and year of publication; (b) general construct assessed; (c) APLF domain(s) assessed; (d) targeted age group/grades; (e) sample population/country; (f) sample size, mean age, standard deviation; (g) instrument available translation; (h) completion time (minutes or seconds); (i) recall period; (j) tool subscale(s)/number of items; (k) response options; (l) psychometric properties evaluated/statistical tests utilized. The data extraction form was piloted on two randomly selected included studies prior to data collection by IE. JM checked all extracted data for completeness and correctness.

Methodological quality assessment of studies
Following COSMIN's recommendations, the current review assessed nine measurement properties including: (a) content validity, (b) structural validity, (c) internal consistency, (d) cross-cultural validity, (e) reliability, (f) measurement error, (g) criterion validity, (h) construct validity, and (i) responsivenesssee Prinsen et al. [36] for a definition of each terminology. To evaluate the methodological quality of the selected studies, the recently updated COSMIN Risk of Bias checklist [35,37] which contains 10 boxes was utilized. Each box of the checklist comprises of 3 to 35 standards for evaluating the statistical design and statistical methods utilized in reliability and validity studies. To date, the COSMIN checklist is the only validated and standardized tool for assessing the methodological quality of health-related outcome measures [42].
Depending on the information reported in each study, items in each box of the checklist were rated on a fourpoint scale using the descriptors "Very Good", "Adequate", "Doubtful", and "Inadequate". A "Not Applicable" option was also included for each measurement property. To determine the overall methodological quality for each individual measurement property per study, the lowest rating across the items in the box was taken, a method known as the "the worst score counts" principle. For example, if for a reliability study one item in a box is rated as "Inadequate" despite having all other items rated as "Very Good", the overall methodological quality of that reliability study will be "Inadequate". According to COSMIN, this stringent rule is necessary as poor methodological aspects of a study cannot be compensated for by good aspects [37]. To ensure accuracy of the quality assessment, IE completed risk of bias analyses for 22 of the included studies. The articles were then double rated by two independent reviewers (NL, LB) who had both received training on using COSMIN. After disagreements were resolved, IE completed quality assessment for the remaining articles. To summarize the results of methodological quality per tool, authors used a cut-off of ≥60% [43] of measurement properties rated as "Very Good" or "Adequate" across all single studies to indicate "good" methodological quality.

Quality criteria for measurement properties of single studies and evidence summary
Results obtained from single studies on measurement properties were rated against COSMIN's updated criteria for good measurement properties. Each result was rated as either sufficient (+), insufficient (−), or indeterminate (?) [36]. For studies reporting on content validity, the quality of the results were rated using the criteria for relevance (5), comprehensiveness (1), and comprehensibility (4) [37]. Regarding hypothesis testing for construct validity and responsiveness, COSMIN recommends setting a priori hypotheses prior to review commencement [35]. Following De Vet et al. [44], for both measurement properties, correlations were expected to be: ≥ 0.50 with instruments measuring similar constructs; < 0.50 and ≥ 0.30 with instruments measuring related but dissimilar constructs; and < 0.30 with instruments measuring unrelated constructs. No hypotheses were formulated for expected differences between groups (e.g., age, gender) for discriminant and known-groups validity.
Due to considerable differences across studies in terms of sample characteristics and size, statistical tests utilized, reliability or validity type investigated, results from single studies could not be pooled in a meta-analysis. Therefore, as recommended by the COSMIN, an overall rating of study results per measurement property per tool was summarized as sufficient (+), insufficient (−), indeterminate (?), or inconsistent (±). Specifically, an overall rating was determined through combining the scoring of each single study; if ≥75% of the studies displayed the same scoring, that scoring became the overall rating (+ or −), whereas if < 75% of studies displayed the same scoring, the overall rating became inconsistent (±) [36].

Search results
Initial searches of the seven databases in October 2019 generated a combined total of 56,615 citations. The updated search in April 2021 identified 4797 new citations. Following removal of duplicates, title and abstract screening of 20,724 references (including an additional 31 articles identified through manual searching), yielded 424 articles deemed potentially relevant. After eligibility criteria were applied to full-text versions of the 424 publications, a total of 41 studies evaluating the psychometric properties of 24 unique teacher proxy-report measures for elements within the APLF were identified. A flow chart of study selection was prepared in accordance to the PRISMA statement (detailed in Fig. 2).

General characteristics of included studies
A description of the study characteristics and their assessment instruments are presented in Table 1. The 41 studies were published between 1936 and 2020 and were conducted within the United States (n = 18), Netherlands (n = 3), South Africa (n = 3), Finland (n = 2), Italy (n = 2), Israel (n = 2), Portugal (n = 2), Australia (n = 1), Poland (n = 1), Canada (n = 1), Japan (n = 1), and Brazil (n = 1). Study location was unspecified in four studies. All relevant domains of the APLF (i.e. physical, psychological, social, and cognitive) assessed in each measure were identified (see Table 1). Tools were categorized as single domain (assessing one domain of the APLF), dualdomain (assessing two domains), and tri-domain (assessing three domains) measures. The majority of tools identified in this review assessed elements across two domains of the APLF (see Fig. 3). No single teacher proxy-report measure assessed elements in all four domains of the APLF. A detailed synthesis of how each tool (and relevant items) are aligned with individual elements of the APLF is presented in Table 4.
Furthermore, there was a considerable degree of homogeneity in relation to the targeted age group/grades for identified tools. Most tools spanned the entire age range (i.e. for children between 5 and 12 years) and thus were suitable for both younger and older children. Tool completion times were not often reported but when reported, completion times ranged between three and 15 min per child. Scales ranged from 10 [17,84] to 80 items [54]. The 41 studies assessed a median of 3 out of the nine measurement properties recognized by the COS-MIN. The most commonly reported psychometric properties were construct validity (n = 32; 78% of studies), structural validity (n = 26; 63% of studies), and internal consistency (n = 25; 61% of studies). Statistical tests utilized to evaluate measurement properties varied across the review. For instance, confirmatory factor analysis was the most frequently used statistical approach for studies reporting on structural validity whereas correlations were used for hypothesis testing for construct validity. Construct validity was mostly tested by comparing scores obtained for a tool with another measure assessing a similar construct. On the other hand, criterion validity was evaluated by comparing scores obtained for a tool with a gold standard measure. Tool development studies were conducted for eight measures including the BBRS [57], CHAS-T [74], GMRS [75], HRI [79], SEARS-T [65], SSRS-T [81], T-CRS [69], and Winnetka Scale for Rating School Behaviour [72]. Content validity was only reported for two tools (CHAS-T and PSPWC) [53,74].
Psychometric properties Methodological quality assessment Table 2 details the methodological quality assessment of the 41 studies included in the review.

Measurement property assessment of instruments
In this section, the overall rating of each tool was appraised, and Table 3  Criterion validity, performed for five tools, was rated as sufficient for the CHAS-T, MOQ-T and TEAF; inconsistent for the MABC-2 Checklist; and insufficient for the GMRS. Cross-cultural validity was evaluated for the MABC-2 Checklist and was rated as indeterminate because no multiple group factor analysis was performed in the single study. For construct validity, results were mostly indeterminate in rating. Internal consistency coefficients were sometimes provided for the entire scale and/or its subscales. For the most part, tools were rated as indeterminate as a result of insufficient evidence on structural validity and/or provision of Cronbach alpha values for the total scale and not per subscale. Results quality for test-retest and inter-rater reliability were mostly indeterminate as intraclass correlation coefficient (ICC) values were not calculated for continuous scores. The only exception was the TRS Checklist which had                     ICC values for most subscales less than 0.70 and was considered as having insufficient reliability. Overall, no tool was consistently evaluated as having sufficient ratings for all its measurement properties. Only five tools (i.e. MOQ-T, ERC, MASCS, QACSE-P-SF, and TEAF) had atleast two sufficient ratings across its measurement properties.

Physical literacy alignment
Item/content alignment of each tool with the APLF was appraised (see Table 4). Also highlighted in Table 4 are tools with good methodological and sufficient results (i.e. atleast two sufficient ratings) quality based on evidence synthesis; as well as tools (n = 10) assessing the PL elements in the context of physical activity. The number of measures that mapped onto individual APLF elements ranged from 1 to 15. All elements in three (i.e. the physical, psychological, and social) out of four domains of the framework were addressed. Relationships, self-regulation (emotions), and collaboration were the elements most frequently assessed by the included measures. Least captured elements were speed, connection to place, and tactics. Water skills, a component of the element movement skills, was assessed in one tool [53]. Four of the APLF elements belonging to the cognitive domain (content knowledge, reasoning, strategy and planning, and perceptual awareness) were not addressed by any measure. Tools capturing the most elements of the APLF included the GMRS (15 out of 30), the HRI (9 out of 30), and the TEAF (8 out of 30). Lastly, Harter's TRS covered three domains (physical, social, cognitive) of the framework. However, due to the lack of specificity of items contained within the tool (e.g., "This child doesn't do well at new outdoor games"; "This child does really well at all kinds of sports"; "This child is better than others his/her age at sports"), mapping it onto the individual elements of the framework proved rather difficult.

Discussion
This is the first review to critically evaluate the psychometric properties of teacher proxy-report instruments designed to assess one or more elements of children's PL. As a consequence, the current study represents a novel contribution to the literature base relating to PL and its assessment. PL assessment can help identify aspects of children's PL that are suboptimal; as well as provide an evidence base for evaluating the effectiveness of interventions targeted at improving PL levels. More specifically, a focus on teacher proxy-report instruments for children's PL is needed due to children's limited cognitive abilities when making self-assessments of their own capabilities [27,62,85]. Baranowski [86] has further suggested that children are also limited in their ability to recall specific events that occurred in the past. Indeed, Bardid et al. [25] has reported that teacher proxy-reports (especially by physical education specialists) may provide more accurate estimates of a child's capabilities (e.g., motor competence) than child self-report.
Importantly, in the current review, alignment with individual elements of the APLF, for each teacher proxyreport measure, was further appraised. The first finding is clearly the lack of valid and reliable teacher proxyreport instruments that assess PL in its entirety, based on the comprehensive APLF. There are however tools available to assess some elements of the framework. Specifically, 41 studies evaluating the psychometric properties of 24 teacher proxy-report tools for the APLF elements were identified. The psychometric properties of identified measures were variable, with many typically unreported or inadequately assessed.

Psychometric properties
No single tool reported all nine psychometric properties outlined by the COSMIN methodology [35][36][37]. Measurement properties frequently reported included construct validity, structural validity, and internal consistency. Content validity and cross-cultural validity were the most rarely reported. No studies reported measurement error and responsiveness. These mirror findings of a recently published review of motor competence assessments for children and adolescents, which highlighted that construct validity was frequently reported whereas content validity was the least evaluated psychometric property [43].
Content validity is often considered the most important measurement property of an instrument [87], and is needed to ensure that the tool has appropriate number of items and adequately captures the construct/element under investigation [88]. COSMIN distinguishes between tool development and content validity studies in that the former involves concept elicitation, development, and pilot testing a new tool; whereas the latter entails testing of an existing tool [87]. In this review, most tool development studies were given the lowest possible rating of "inadequate". This was either because tool development studies were not performed utilizing a sample representative of the tool's targeted population or no pilot tests or cognitive interviews were performed for the newly developed tool. On the other hand, just two studies reported on content validity for the CHAS-T [74] and PSPWC [53]. The comprehensibility, relevance, and comprehensiveness of items in the CHAS-T [74] was explored by teachers and professionals. In this review, the instrument was rated as doubtful for methodological quality as there was a lack of reporting of the qualitative and analytical methods utilized for the content validation process. Another study reported content validity Table 4 An overall indication of the quality of each instrument and alignment with the APLF elements        for the TRS tool [56]; however, the review team failed to find any report regarding the relevance or comprehensiveness of items from the perspective of the targeted users of the tool and/or professionals. The PSPWC [53] was rated as doubtful in methodological quality as it was not clear if there were two researchers involved in analysis of qualitative interviews and whether skilled interviewers were used during interviews. According to COSMIN's updated guidelines, if the content validity of a tool is unknown, the results for other measurement properties of the tool should be ignored and not further appraised as this hinders the interpretation and generalization of study findings [36]. Given the importance of this measurement property, there is an urgent need to prioritize content validity studies in future development of teacher proxy-report PL instruments. Future studies should consider using the COSMIN Study Design checklist [89] which offers clear standards for designing studies aimed to evaluate measurement properties of instruments. Specifically, for content validity studies, tool developers should obtain information from targeted tool users and professionals regarding the relevance, comprehensibility, and comprehensiveness of the instructions, response options and items contained within the tool. For this, a widely recognized or well justified qualitative research approach is preferred, whereby each item on the tool is evaluated by at least seven or more individuals from the target population of interest and professionalssee Mokkink et al. [89] for the design requirements.
Few studies validated a measure against a reference "gold" standard known as criterion validity. Criterion validity ensures the accuracy of a scale when compared to a reference standard [90]. Being widely tested and validated measures, the MABC motor test [91], the Bruininks-Oseretsky test of motor proficiency [92], and the Körperkoordinationstest für Kinder test [93] were considered to be reasonable "gold" standards for motor skill assessment. Hence, all studies comparing a teacher proxy-report tool to these measures were considered a study on criterion validity [36]. It is important to note that there were a few cases where authors used the term criterion validity when comparisons were made with other measures assessing a similar construct. In these instances (as specified in the COSMIN user manual [36]), this was considered to be evidence of construct validity rather than criterion validity. In this review, most studies on criterion validity appeared to have good methodological quality, with evaluated measures having sufficient results quality. Similar findings were noted by Antczak et al. [43] for criterion validity studies of motor competence assessments. However, it has been argued that the design of the COSMIN checklist, in terms of number of standards contained in each measurement property and the use of the "worst score counts" principle, could significantly impact on its overall scoring. For instance, a measurement property such as criterion validity which contains fewer standards (three in total) may fare better in its overall scoring when compared to those with higher quality items (e.g., 35 standards for content validity) [43].
The methodological quality of studies reporting structural validity was mixed. The common reasons for doubtful or inadequate COSMIN ratings were insufficient sample size and/or statistical design flaws such as a lack of reporting of the number of teachers involved in the study and how these clustering effects (if any) were accounted for in the analytical design. Furthermore, for many tools, result ratings were indeterminate due to the use of exploratory factor analysis (including principal component analysis) as the updated COSMIN does not provide any criteria for rating these techniques. Ideally, a confirmatory factor analysis should follow an exploratory factor analysis (preferably using a different sample), as the former verifies an a priori exploratory factor analysis-informed theory regarding a tool's factor structure [94]. Given that some of these deficiencies can be resolved by more detailed reporting and further psychometric testing, future studies should consider adopting guidelines offered by COSMIN for reporting of structural validity studies.
Only one of 41 studies was assessed for cross-cultural validity, as they had translated a measure (MABC-2 Checklist) from English to Japanese, and compared scores obtained from two samples (i.e. United Kingdom and Japan) [50]. This study did not perform well for both methodological and results quality. Noteworthy is that a number of studies [47,48] within this review translated a measure from its original language to a different language without assessing cross-cultural validity. Future studies should determine cross-cultural validity for translated instruments, utilizing appropriate techniques (e.g., multi-group confirmatory factor analysis for classical test theory or differential item functioning for item response theory) [35,36]. This is because instruments may perform differently across different cultures, different gender or age groups, and different populations [95]. Most construct validity studies performed adequately for methodological quality; however, overall results quality was mostly indeterminate. This may have been influenced by the lack of a priori hypotheses for expected differences between groups for known groups/ discriminant validity.
Internal consistency values (the interrelated among items in a subscale [36]) had to be calculated separately for each unidimensional scale or subscale to obtain good ratings for methodological quality. Deficiencies in studies were mostly because Cronbach's alpha values were provided for the entire scale and not per subscale. Similarly, results of internal consistency were indeterminate for many studies as Cronbach alpha was provided for the entire scale and there was evidence of insufficient structural validity. COSMIN considers evidence on structural validity (or unidimensionality) a prerequisite for interpreting Cronbach's alpha values [36]. Given these findings, we recommend that as a starting point, future studies should ensure that evidence exists for sufficient unidimensionality or structural validity of a tool and thereafter report on the Cronbach alphas (for continuous scores) of each subscale.
Reliability (test-retest and inter-rater) studies did not rate well for methodological quality for studies in this review. For the majority of studies, Pearson's correlations (a measure of relationship between two variables [96]) were used to explore this measurement property rather than intraclass correlations for continuous scores, as recommended by the COSMIN [36]. Past literature has highlighted that the Pearson's is an inappropriate and liberal measure of reliability, often producing reliability coefficients that are higher than the true reliability [88,97]. It was also difficult to determine whether participants were stable in the interim between measurements or if the testing conditions were similar for the measurements taken. As ICC values were not calculated, results were rated as indeterminate for the majority of studies in this review. Studies should consider the use of intraclass correlations when exploring reliability of continuous variables as they reflect the correlation and agreement between measurements taken by an instrument [96].
Two measurement propertiesresponsiveness and measurement errorwere not explored in any study in this review. COSMIN refers to responsiveness as the measures ability to detect change over time in the construct of interest whereas measurement error is regarded as errors in scores obtained which are not as a result of changes in the construct of interest [36]. No study included in this review evaluated the minimal important change or minimal important difference of their tools. Without information on the measurement error of these tools, it is unclear whether the changes in scores of the constructs assessed are meaningful and matter to teachers. Studies have also previously noted underreporting of responsiveness [98]. This is concerning because without this, it is difficult to assess the effectiveness of interventions designed to improve PL or its components.
In summary, for the studies included in this review, a median of 3 out of nine psychometric properties were reported. Content validity which is considered the most important property was sparingly reported. These therefore restricts our justifications for use of specific teacher proxy-report tools in practice until further psychometric testing is conducted. However, based on the available evidence and after combining the ratings of methodological quality and the criteria for good measurement properties provided by the COSMIN, best results were received for the following tools: MASCS, MOQ-T, QACSE-P-SF and TEAF. These tools combined assess a total of 18 elements of the APLF. Of these tools, the MOQ-T and TEAF assesses the APLF elements in relation to physical activity. The ERC had good psychometric evidence but was lacking in methodological rigour. Terwee et al. [99] has highlighted that results of studies lacking in methodological quality should not be trusted. One must exercise caution when interpreting these results though as some of these tools (specifically MASCS and QACSE-P-SF) were evaluated in single studies, and as such, are in need of repeated psychometric testing in different populations. Furthermore, in the current review, the MABC-2 checklist was found to be one of the most widely examined tool for reliability and validity. Surprisingly, despite having good methodological quality for most of its measurement properties, our findings reveal that the checklist has limited psychometric evidence to support its reliability and validity, suggesting the need for more validation studies. The current systematic review highlights a need for further psychometric testing (especially content validity, cross-cultural validity, measurement error, criterion validity, and responsiveness), with a more detailed reporting of methodological aspects and results in future studies. Taking such an approach will provide teachers with a more robust foundation when selecting appropriate and psychometrically sound measures for assessing PL.

Physical literacy alignment
The APLF is unique in that it recognizes a variety of skills and attributes straddling four inter-related learning domains (physical, psychological, social, and cognitive) as needed for PL development. More specifically, the framework incorporates elements outside the physical domain that have not previously featured in other definitions. These elements may be equally beneficial for integrated movement experiences to develop PL [40]. An example element collaboration, situated in the social domain, reflects social skills (e.g., conflict resolution, cooperation, and leadership) required to successfully interact with others in movement and physical activity contexts [30]. This element is potentially as important as other elements (e.g., movement skills) and should be assessed in children.
Our review findings suggest the paucity of teacher proxy-report measures that address several elements of the APLF. Particularly elements such as speed, connection to place, tactics, content knowledge, reasoning, strategy and planning, and perceptual awareness were either rarely assessed or not assessed by identified tools. Interestingly, elements most frequently assessed appeared to fall within the social domain suggesting the availability of many teacher assessment options for this domain. Because of our wider search for tools beyond the physical activity/physical education literature, only the PSPCSA-T and Harter's TRS assessed the social domain in the context of physical activity. Our findings may be an indication that the social domaindespite not being recognized as a core component of several PL frameworksis an aspect that teachers are interested in reporting on more generally.
Another finding is the absence of measures with psychometric evidence that address elements of the cognitive domain. The authors note however that it may be quite challenging to assess the cognitive domain via teacher proxy-reporting. Indeed, many existing measures for PL (e.g., CAPL) tend to approach its assessment via self-report [31]. Nonetheless, a comprehensive approach to assessing PL is required since the flavour of the concept in itself lies in its holistic nature [100]. Hence, the development of measures that target all domains and elements of the APLF should be prioritized to provide a greater breadth and depth of understanding of the contributors to children's PL.
Recommendations for teacher assessment of physical literacy based on the APLF Proxy-report measures have the advantage of low cost, ease of administration on large numbers of children, and less administration training when compared to objective measures [25]. This is even more beneficial to teachers who are often faced with time barriers to teaching and assessment [101]. In making recommendations for teachers when choosing instruments for PL assessment, besides highlighting psychometrically sound measures, many aspects of the feasibility of these measures should be well considered. Some of these feasibility aspects include completion time, cost of instrument, copyright, length of the instrument, ease of administration and score calculation [36]. Information on feasibility may become particularly relevant when differentiating between two equally psychometrically sound instruments. The vast majority of measures identified in this review did not report on completion time. However, as feasibility is not considered a measurement property by the COS-MIN [36], it was beyond the scope of this paper to consider all aspects of the feasibility of the identified tools. We therefore recommend that these aspects receive priority in future studies.
As earlier stated, the current review did not locate a tool that captured all elements and domains of the APLF. For teachers to assess PL comprehensively, there is a need for a tool that includes all 30 elements of the framework. Also given limited evidence found for measures in this review, it is difficult to justify the use of tools identified in this review until further psychometric testing is conducted. This review has found best evidence for the MASCS, MOQ-T, QACSE-P-SF and TEAF. Teachers who are interested in assessing elements of PL based on its Australian approach could consider utilizing the detailed nine-step decision-making steps in choosing a PL assessment as highlighted by Barnett et al. [33], in conjunction with Tables 2, 3 and 4 of this review which provide information on the validity, reliability, and alignment of specific instruments with the APLF. Barnett et al.'s [33] guidance for assessing PL involve identifying the following: (i) element(s) of interest; (ii) teacher interest; (iii) context; (iv) purpose; (v) age group; (vi) structure of observed learning outcomes level; (vii) measurement/assessment method; (viii) number of participants and; (ix) cost. Specifically, step seven encourages teachers to decide on their preferred assessment approach (e.g., objective or subjective measures). As an example, after carefully considering these nine steps in conjunction with the results provided in Tables  2, 3 and 4, a teacher who may be interested in assessing the APLF elements agility, strength, muscular endurance, cardiovascular endurance, engagement and enjoyment, confidence, motivation and tactics (Step I) via proxyreporting (Step VII), could utilize the TEAF. This is because, based on the available psychometric evidence (methodological quality and results quality), the tool seems to be the most promising teacher tool for assessing these aforementioned elements. An assessment of this nature by physical educators must be approached with caution, as most tools identified within this review were not contextualized in physical activity (as outlined in Table 4). As such, we have highlighted the tools assessing the PL elements in the context of physical activityrefer to Table 4.

Strengths and limitations
This systematic review has several strengths. The protocol for the review was registered prospectively. A comprehensive search of seven databases relevant to Sport, Education, Psychology and Health was conducted to identify peer-review articles. Furthermore, a comprehensive search strategy comprising of search filters for finding studies on measurement properties provided by COSMIN; as well as search filters relevant to each individual PL element was utilized to locate studies within the review. Time restrictions were not applied in the search strategy. This strategy identified studies focused on psychometric testing of tools for each PL element, unlike previous reviews which were focused mostly on tools for PL as a whole without critically appraising the