A critical evaluation of systematic reviews assessing the effect of chronic physical activity on academic achievement, cognition and the brain in children and adolescents: a systematic review

Background International and national committees have started to evaluate the evidence for the effects of physical activity on neurocognitive health in childhood and adolescence to inform policy. Despite an increasing body of evidence, such reports have shown mixed conclusions. We aimed to critically evaluate and synthesise the evidence for the effects of chronic physical activity on academic achievement, cognitive performance and the brain in children and adolescents in order to guide future research and inform policy. Methods MedLine, Embase, PsycINFO, Cochrane Library, Web of Science, and ERIC electronic databases were searched from inception to February 6th, 2019. Articles were considered eligible for inclusion if they were systematic reviews with or without meta-analysis, published in peer-reviewed (English) journals. Reviews had to be on school-aged children and/or adolescents that reported on the effects of chronic physical activity or exercise interventions, with cognitive markers, academic achievement or brain markers as outcomes. Reviews were selected independently by two authors and data were extracted using a pre-designed data extraction template. The quality of reviews was assessed using AMSTAR-2 criteria. Results Of 908 retrieved, non-duplicated articles, 19 systematic reviews met inclusion criteria. One high-quality review reported inconsistent evidence for physical activity-related effects on cognitive- and academic performance in obese or overweight children and adolescents. Eighteen (critically) low-quality reviews presented mixed favourable and null effects, with meta-analyses showing small effect sizes (0.1–0.3) and high heterogeneity. Low-quality reviews suggested physical activity-related brain changes, but lacked an interpretation of these findings. Systematic reviews varied widely in their evidence synthesis, rarely took intervention characteristics (e.g. dose), intervention fidelity or study quality into account and suspected publication bias. Reviews consistently reported that there is a lack of high-quality studies, of studies that include brain imaging outcomes, and of studies that include adolescents or are conducted in South American and African countries. Conclusions Inconsistent evidence exists for chronic physical activity-related effects on cognitive-, academic-, and brain outcomes. The field needs to refocus its efforts towards improving study quality, transparency of reporting and dissemination, and is urged to differentiate between intervention characteristics for its findings to have a meaningful impact on policy.


Introduction
Physical inactivity is an important risk factor for chronic diseases (e.g. cardiovascular disease, depression), obesity, and early deaths [1][2][3][4] placing a high economic burden on society [5]. Conversely, higher levels of physical activity (PA) have been associated with lower risk of mortality, beneficial mental-and cardiovascular health outcomes [6][7][8], and possibly improvements in cognitive and brain health [9]. Increasing PA has therefore been considered a low-cost strategy for global health improvement [10,11], and has received great interest from scientific and public health communities demonstrated by the 2012 and 2016 Lancet series on PA (2012 series: https://www.thelancet.com/series/physical-activity, and 2016 series: https://www.thelancet.com/ series/physical-activity-2016), the United States (US) 2018 Physical Activity Guidelines [12] and the 2018 World Health Organization (WHO) Global Action Plan for Physical Activity [13]. Specifically, both the US and WHO guidelines recommend 150 min of moderate intensity PA (MPA) or 75 min of vigorous intensity PA (VPA) per week for adults and at least 60 min of MVPA per day for children and adolescents (5-18 years), as well as muscle-strengthening activities. However, globally approximately 20-30% of adults [4,11] and the majority of youth, including 80% of adolescents [11,14], do not meet recommended levels of PA.
Childhood and adolescence is marked by rapid social, psychological and neurobiological development and provides the foundation for future health [15][16][17]. Consequently, scientists have begun to examine the effects of PA on brain structure and function, using neuroimaging tools such as magnetic resonance imaging (MRI) and electroencephalography (EEG), and cognition in this population [9]. The findings of these studies have been summarised extensively in systematic reviews, which have been evaluated by The 2018 Physical Activity Guidelines Advisory Committee (PAGAC) to inform policy [12]. The report suggested moderate evidence for PA-related beneficial effects on cognitive performance during pre-adolescence, but inconsistent evidence during adolescence [12]. These conclusions were recently updated to also suggest beneficial effects on brain structure and function during pre-adolescence and limited but promising evidence for PA on cognition in adolescence [9]. In contrast, the United Kingdom (UK) equivalent of the PAGAC concluded there was inconclusive evidence for effects of PA on cognitive and academic performance, but beneficial effects on maths performance [18].
Both of these reports were based on conclusions from a small, non-overlapping set of systematic reviews (US: 9 and UK: 2) and did not provide insight into review quality, nor quality of the primary studies on which the reviews' conclusions were based. A recent systematic review of reviews considered findings from 25 reviews and concluded that the evidence supports a causal link between PA and cognition in young people [19]. However, the authors did not incorporate review quality in their evidence synthesis and based their conclusions on a subset of reviews that included a mixture of observational and interventional evidence. To seek clarity among the contrasting reports, provide specific recommendations for the field, and inform policy, this systematic review of reviews aimed to synthesise the evidence for the effect of PA or exercise on brain structure and function, academic-, and cognitive performance in childhood and adolescence. In this review we aimed to include overall reporting of bias and heterogeneity in the literature, the quality of the primary studies and reporting of intervention fidelity, as well as the consistency of conclusions, limitations and recommendations across the literature. examined for additional relevant articles. Data sources and (medical subject headings) search terms are provided in Additional File 2. In line with the a priori defined selection criteria (PROSPERO, ID: CRD42019124472), systematic reviews with or without meta-analysis that were published in English with clearly defined inclusion criteria were included if they reported on the effects of chronic PA interventions (i.e. more than one PA session, over a set period), including randomised controlled trials (RCT), quasi-experimental studies, controlled and prepost designs, on cognitive, academic and /or brain MRI outcomes in school-aged children or adolescents. Systematic reviews that contained a mixture of observational and interventional evidence, that reported on a single bout of acute PA (i.e. a single session) or cardiovascular fitness only, or included case reports, were excluded.
The in-and exclusion criteria were adapted postprotocol to exclude systematic reviews that also included observational studies because the findings from interventional and observational studies were generally combined in the evidence synthesis. It was deemed too subjective to extract conclusions regarding the effects of intervention studies only. Studies that distinguished between multicomponent (e.g. PA and diet) and PA only interventions were included, but only results from PA only interventions were considered in this systematic review.
Systematic reviews that also included findings of acute PA interventions were considered for inclusion only if they selectively reported on the effects of chronic PA interventions. Following data extraction, it became clear that no single systematic review had aimed to include MRI studies only. Instead, two out of four systematic reviews included studies that used either MRI or EEG. We therefore decided to adapt our inclusion criteria to also include brain EEG markers as an outcome measure. Because the search criteria included the term MRI, we post-hoc searched the databases (October 1st, 2019) for systematic reviews that included EEG studies. No additional systematic reviews were found.

Review selection and data extraction
Following removal of duplicates, two reviewers (TW, WW) independently screened all titles and abstracts using Abstrackr [21]. Short-listed full-text reviews were then independently assessed using the in-and exclusion criteria. Any disagreements between the authors were discussed with two other authors (CS and CF) and resolved by consensus. A data extraction form was piloted independently by two authors (TW, WW) and adjusted to ensure it captured all relevant data, including: year of publication, type of review (systematic review with or without meta-analysis), review methodology (aim, in/exclusion criteria, number of studies included/excluded, database search, quality and bias assessment method), characteristics of included studies (design, number of included participants, type of participants and countries), outcome measures, reporting on intervention fidelity, limitations and recommendations. A single author extracted the data from selected studies using the data extraction template which were verified by a second author and disagreements were resolved by consensus.

Quality assessment
The methodological quality of all reviews was assessed using the updated AMSTAR checklist, AMSTAR-2 [22]. Two reviewers (TW, WW) independently scored the selected reviews. The quality of each review is reflected by an overall confidence rating, which is determined by an evaluation of non-critical and critical domains (seven critical items: preregistration, literature search, justification for excluding studies, risk of bias assessment, appropriateness of meta-analytical methods, risk of bias in synthesis of results, publication bias). A lack of addressing one or multiple critical domains resulted in a, respectively, low or critically-low confidence rating. If no critical flaws were present, the presence of non-critical weaknesses determined whether the review received a high (no weaknesses) or moderate (one or more weaknesses) confidence rating. While it is not recommended to combine individual item ratings into an overall score [22], we provided the total score to merely acknowledge that a gradient exists between reviews in their study quality.
The 19 systematic reviews included a total of 118 unique primary studies, of which 84 on cognitive outcomes, 53 on academic outcomes and nine on brain outcomes. The research was predominantly conducted in developed countries (Fig. 2, an overview of countries per outcome is provided in Additional file 5), especially in the USA and Australia. None of the primary studies  were conducted in countries in South-America or Africa, with the exception of South Africa. The quality of the primary studies was generally regarded to be low to moderate (details of assessments per review are provided in Additional file 6).

Review quality
Of 19 systematic reviews, one received a high confidence rating, one a low confidence rating and 17 a critically low confidence rating (i.e. did not address multiple critical domains, as defined in quality assessment section above; see Table 1, and details in Additional file 7). While 18 reviews performed a comprehensive search (95%) and provided study characteristics (95%), only five (26%) had an a-priori design, three (16%) included an overview of excluded studies, 14 assessed risk of bias (72%), seven of eight (88%) performed a meta-analysis with generally appropriate methods (i.e. used accepted statistical techniques for combination of results and exploration of causes of heterogeneity), and three (of 15 that performed a quality assessment: 20%) considered quality information in their evidence synthesis.

Academic achievement
Twelve systematic reviews evaluated academic outcomes ( Table 2), one of which synthesised evidence from overweight/obese children and adolescents [31]. Reviews included between one [30] and 26 primary studies [23] and 19 studies (36%) were included in more than one review (Additional file 8). Four reviews included a meta-analysis, and reviews synthesised findings of studies in terms of overall academic achievement, its sub-domains (e.g. maths, language, reading), or both. Across all twelve reviews, six concluded that PA benefits academic performance [23,24,26,29,32,33], and the remaining six concluded that there was mixed or inconclusive evidence for PA-related academic changes (Tables 1 and 2) [25,27,28,30,31,34].
Only one review received a high confidence rating [31]. Among three RCTs of obese/overweight children, this review found no evidence for PA-related improvements in maths, reading or language performance. Sensitivity analyses for risk of bias (e.g. attrition) and cluster RCT designs were performed, but were less meaningful due to a small number of studies.
Among the 11 low-to critically low-quality reviews (i.e. lack of addressing respectively one versus multiple critical domains), six reviews found (mainly) positive effects of PA on academic performance and five presented mixed evidence and conclusions. All three meta-analyses showed a small beneficial effect of PA on overall academic performance (effect sizes are presented in Table  2), but only one (of two) found significant evidence for small positive effects on its sub-domains [23]. This meta-analysis included 26 intervention studies (RCT and quasi-experimental) and found evidence for beneficial effects of PA on maths, reading and composite scores, albeit with substantial heterogeneity among studies (I 2 > 50%, apart from reading). Sensitivity analyses showed changes in effect sizes of reading-, languagerelated skills and composite scores after removal of studies, and a drop in effect size for maths performance (from d = 0.21 to d =0.12) upon exclusion of low quality quasi-experimental studies. Among the other two metaanalyses, one included three studies [24], whereas the other included a large number of non-peer reviewed dissertations [26].
Subgroup analyses of meta-analyses (Additional file 9) further suggest that academic performance may benefit most from PA during curricular PE [23] and cognitively challenging PA [24,26]. Intervention duration does not seem to be an important moderator of PA-related academic changes [23,24,26].
In summary, evidence from systematic reviews is inconsistent, with conclusions suggesting positive as well as mixed (inconclusive) effects of PA on overall academic achievement and its sub-domains. One high confidence review on overweight and obese children did not report beneficial effects of PA on academic performance [31].

Cognitive function
Thirteen systematic reviews evaluated cognitive outcomes, one of which synthesised evidence from children with ADHD [36] and two evaluated findings from overweight/obese children [31,39]. Reviews included between one [29] and 36 intervention studies [27,38], and of 83 published studies, 42 (50%) were included in more than one review (Additional file 8). Six reviews included a meta-analysis and reviews synthesised findings of studies in terms of overall cognitive performance, and / or various sub-domains (e.g. executive functions, memory).
One review received a high confidence rating [31]. Among seven RCTs of obese/overweight children, this review reported a significant PA-related benefit on nonverbal memory and composite executive functions based on findings from single studies, but not inhibition control, attention, working memory, cognitive flexibility or visuo-spatial abilities.
Among 12 (of 13) low- [25] to critically low-quality reviews, six reviews found positive effects of PA on cognitive performance and/or its sub-domains in healthy young people [24,29,33,35,38,40], and four presented mixed evidence and conclusions [25,27,30,37]. Two meta-analyses showed a positive effect of PA on overall cognitive performance [24,35], two of three showed PArelated benefits on overall executive functions [24,37,38] and inhibition [24,38,40], and two of three meta-analyses reported non-significant effect sizes for planning or higher-level cognitive functions [24,37]. Moreover, a meta-analysis additionally reported beneficial effects of PA on non-executive functions and working memory, but not selective attention or cognitive flexibility [38]. Despite some positive results, the effect sizes are generally small (0.1-0.3) and suffer from substantial heterogeneity. Moreover, only two of five meta-analyses performed a sensitivity analysis [35,38], showing a lower effect size for overall cognitive functions following removal of outliers [35] and a change in effect size for working memory upon removal of studies [38]. There is some evidence for beneficial effects of PA on cognitive functions in children with overweight and obesity [39] and with ADHD [36]. Subgroup analyses of meta-analyses further suggest that cognitive performance may benefit most from PA during curricular PE [38], with mixed evidence for enhanced (i.e. quantitative increase in PA) or enriched PA (i.e. qualitative manipulation of PA, e.g. increasing coordinative task requirements; Additional file 9) [35,38]. Intervention duration does not seem to be an important moderator of PA-related cognitive changes [24,37,40], although one review reported a negative relationship with working memory [38]. No significant relationships were found between study quality and heterogeneity [38] or effect sizes [37].
In summary, evidence from predominantly low-quality systematic reviews is inconsistent, with conclusions suggesting both positive and mixed (inconclusive) effects of PA on overall cognitive performance and its subdomains. A single high-quality review [31] showed that PA may benefit executive functions and non-verbal memory of obese/overweight children, but evidence is based on findings from single studies. A comparison between reviews is complicated by the considerable variability in reporting of findings across systematic reviews (e.g. overall cognition vs sub-domains, choice of subdomains).

Brain structure and function
While four systematic reviews reported on the effects of PA on the brain [27,33,39,41], only three of these systematically explored exercise-related brain changes [27,39,41], the findings of which are discussed here. Reviews included five or six primary studies, and three (of 10 unique publications) were included in all reviews (Additional file 8) [42][43][44]. All reviews were classified as critically low quality. Across three reviews, one concluded that PA is beneficial for the brain [39], one reported mixed favourable and null effects [27] and one reported a lack of available evidence (Table 4) [41].
Among healthy children and adolescents, Gunnell et al. [27] stratified findings from six RCTs by brain function and structure and showed some evidence for changes in activation and resting-state synchrony following a PA intervention, but not blood flow and inconsistent changes in white matter structure. Lubans et al. [41] tabulated brain changes from six studies to explore neurobiological mechanisms of cognitive changes, and showed increases, decreases or no changes following PA interventions in widespread brain areas (as well as The AMSTAR-2 confidence rating (critically low, low, medium or high) is reported, followed by the overall score. The overall score is added to acknowledge the inter-review variability in quality, but is not used in the synthesis of findings as recommended by Shea et al. [22] c This review also includes acute PA studies which have been excluded from this count  cognitive changes). The authors did not synthesise brain findings and reported a lack of overlap between studies (e.g. imaging methods, brain regions). Bustamante et al. [39] focused on children and adolescents with overweight/obesity and interpreted PA-related brain changes of four high-quality studies as beneficial, but their conclusions lack anatomical specificity. In summary, only a small number of studies have examined PA-related brain changes and while PA-related effects have been reported, findings are inconsistent with little methodological or anatomical overlap between studies.

Intervention fidelity reporting
Reporting of fidelity is crucial for accurate interpretation of intervention results [45,46]. This is particularly the case in behavioural interventions where there is substantial heterogeneity in intervention characteristics. Only two (of 19) reviews reported on intervention fidelity in their results section [28,32] and noted a lack of fidelity reporting in studies. Several other reviews discuss the lack-and importance of reporting fidelity metrics in their discussion [25,31,38]. None of the systematic reviews took intervention fidelity into account when summarising the findings of effects of PA on cognitive-, academic-, or brain outcomes.

Limitations and recommendations Limitations
An overview of limitations included in systematic reviews is provided in Additional file 10. A (non-exhaustive) narrative synthesis is provided here, intended to identify limitations that were common to the discussion section of at least two reviews. One of the most reported limitations is the presence of high heterogeneity across studies: in designs of PA interventions (duration, frequency, resources provided, delivery [32,40]), the (appropriate) control groups [25,30,40], and the measurement tools that were used [29,30,41]. Reviews often reported a lack of detailed reporting of interventions (e.g. duration, intensity, compliance, resources, delivery), assessments [30,32,38], and potential moderators [24,34], as well as a lack of valid measurement tools [29,30]. These within study limitations may contribute to a general lack of high-quality studies [26,35,38,39]. Other limitations that were frequently noted are the lack of studies in adolescents [25,30,31] and those covering various sub-domains of cognition/academic achievement [24,35,37], the presence of relatively small samples [29,30] and samples being predominantly from high-income countries [31,32,34]. While some reviews tested for the presence of publication bias, this too was often reported as a limitation [23,25,40].

Recommendations
In addition to resolving the above limitations, authors recommended to include long term follow-up assessments [28,29,31,34,38], explore the effect of different PA characteristics (intensity, duration) [24,37], include brain imaging [25,27,31], monitor the PA dose that participants receive [25,37], and explore the influence of effect modifiers (e.g. sex, ethnicity and socioeconomic status [31,34]). Furthermore, authors suggested to report effect sizes and standardized regression coefficients [25,41], focus on examining the qualitative aspects of PA [24,25,35,39], explore after-school PA interventions [23,38] and conduct transitional work, examining what interventions are most effective for implementation in schools [23,39]. Mura et al. [33] Children (3-18 years) 10/16 studies showed an improvement in academic performance (maths (s = 4), reading (s = 1), overall academic achievement (s = 5)), in 6/16 it did not worsen academic performance NA Haapala [28] Children and adolescents  Non-executive functions: 7/7 found improvements; of 4 studies that included multiple intervention groups, two suggested that increases in duration and intensity were associated with greater improvements Executive functions: 29/29 found improvements; of 11 studies that included multiple intervention groups, three did not find differences between the groups Meta-cognition: 15/15 found improvements; of 6 studies that included multiple intervention groups, none found differences in improvements

Key findings
The aim of this review was to critically and systematically evaluate systematic reviews that examined the effects of chronic PA interventions on cognitive-, academicand brain outcomes in children and adolescents. Of 19 systematic reviews, only one received a high confidence rating and reported inconsistent evidence for PA-related effects on academic performance and favourable effects on executive functions in overweight/obese children, albeit based on results from single studies. Reviews with a (critically) low confidence rating presented mixed favourable or null effects. Reviews that evaluated brain outcomes suggested PA-related brain changes, but with little anatomical or methodological overlap which has complicated the synthesis and interpretation of findings. In general, the quality of the majority of systematic reviews is considered to be critically low with high heterogeneity between systematic reviews (e.g. number of included studies, presentation of findings). Furthermore, only three systematic reviews took the quality of primary studies into account when synthesising evidence and there is a general lack of reporting of intervention fidelity. Systematic reviews and meta-analyses consistently stated that the field suffers Table 3 Cognitive outcomes: findings from systematic reviews and meta-analyses (ordered by quality rating) (Continued)

Authors
Population Systematic review results Meta-analysis results c showed no difference and 1/9 worse intelligence; two of these studies found dose-response relationships, with high dose PA performing better than lowdose PA or control; specific cognitive skills improved in almost all studies (6 studies) Vazou et al. [35] Typically developing children and adolescents (4-16 years) Aerobic only (s = 7): Significant cognitive outcomes: planning (s = 1), creativity (s = 2), working memory and spatial memory span (s = 4), attentional accuracy and spatial inattention (s = 1), cued recall memory (s = 1) and mathematics fluency (s = 1) Motor skills (s = 4): Improvements in working memory (s = 1), spatial processing/math/reading/concentration (s = 1 study), lower error on attentional task (s = 1), mixed effects (s = 1) Cognitively engaging PA (s = 2): Improved planning (s = 1), and spatial memory, but not verbal memory (s = 1) from high heterogeneity between primary studies (e.g. in the design, intervention, outcome measures and reporting) and that high-quality studies are required.
In the following sections we will discuss these key findings and argue that the field would benefit from improvements in study quality, reporting and dissemination.

Academic-, cognitive-and brain outcomes
The findings on cognitive outcomes are largely in line with conclusions from the UK Expert Working Group Working Paper on children and young people [18]. In contrast, the US 2018 PA Guidelines [12] suggested moderate evidence for PA-related improvements in cognitive and academic performance (among 5-13 year olds), and Biddle et al. [19] claimed a causal association of PA with cognitive functioning and indicated less clear results for academic achievement. The discrepancy in conclusions is likely due to the set of reviews considered for inclusion and the strategy for evidence synthesis. That is, conclusions of both the UK Working Group and US guidelines were based on a small number of systematic reviews (Additional file 11) without mention of review quality or reasons for selecting this (sub) set of reviews. Biddle et al. [19] based their conclusions on a large number of reviews (n = 25), including some that combined observational and interventional evidence, without reference to review quality or sub-domains of cognitive-or academic performance.
It is clear that the selection of reviews and strategy for synthesis could impact on conclusions, particularly if the synthesis is based on study authors' conclusions, which commonly emphasise positive findings (e.g. [24,33,38]). Future reports would benefit from a nuanced overview, with conclusions that take into account (objective) findings presented in reviews' results sections.

Study quality and methodological considerations
We observed high variability in quality and methods across systematic reviews. In particular, systematic reviews often lacked an a priori design, an overview of excluded studies, an assessment of publication bias (unless a meta-analysis was included), and did not consider quality information in their evidence synthesis. While the majority of reviews described the sample (e.g. size, age, sex) of the included studies, they did not discuss baseline PA levels, which is important for assessing whether the sample is representative of the general population and for the interpretation of reported effects (e.g. greater effects may be expected for a sample of low active individuals [47,48]). Only a minority of reviews reported on intervention fidelity, which is crucial if one wishes to understand whether observed effects can be ascribed to the intervention or other environmental Overweight or obese children and/or adolescents Benefits for neurologic outcomes following PA in high quality studies (RCT, s = 4, 2/4 brain function, 2/4 brain structure), but all from the same group; results from a quasi-experimental study (s = 1) suggest a neural benefit, but the study is of low rigor and suffers from confounding NA Lubans et al. [41] b Children (7-11 years) 5/6 studies reported significant brain changes (2/6 using EEG, 4/6 using MRI one of which explored brain structure), but there was little overlap between studies NA Abbreviations: n = number of participants, NA not assessed, PA = physical activity, RCT = randomised controlled-trial, s = study/studies a The authors also included findings on changes in brain-derived neurotrophic factor, which are not measured by EEG or MRI and therefore excluded from this This study examined brain changes as potential mediators of cognitive changes, rather than exploring brain changes per se factors (e.g. PE enjoyment). Furthermore, reviews hardly distinguished between PA interventions (e.g. dose or type) in presenting results. This practice is useful if the goal is to examine whether any form of PA has an effect on cognitive, academic or brain outcomes, yet provides little insight into the appropriate dose or PA type (e.g. enhanced / quantitative or qualitative / enriched PA manipulations) that may benefit outcomes most. Similarly, reviews varied substantially in whether they presented findings by sub-domains or overall cognitive-or academic performance, as well as the definition of sub-domains. If conclusions are to be drawn regarding specific PA effects from findings across reviews, a more granular and consistent presentation of results is required. Finally, reviews included multiple publications stemming from the same study (and sample), which is rarely considered (but see e.g. [31,39]). For instance, half of the neuroimaging publications included the same sample of overweight children, and care should be taken in generalising these results. A subset of reviews included meta-analyses, the appropriateness of which has been questioned due to the heterogeneity between studies [27]. Among those that did include a meta-analyses, we observed variability in whether the metaanalysis was pre-registered, in the number of included studies , and whether one or multiple studies on the same sample were included and appropriately accounted for (e.g. Cochrane handbook 5.1, section 16.5.4 [49]). Importantly, meta-analyses rarely took study quality or study design ([cluster] RCT [50] and non-RCT) into account. This practice could bias effect sizes and requires separate reporting [22]. In addition, not all meta-analyses used inverse variance weighting and only a few meta-analyses fully explored the high observed heterogeneity using meta-regression and / or subgroup analysis [51]. Finally, small sample sizes of primary studies may present with larger effect sizes and thereby affect meta-analyses results [52,53]. Although meta-analyses are a useful statistical tool to quantitatively synthesise findings [54,55], care should be taken in their implementation.

Research gaps
Several gaps and forms of bias were identified in the evidence base. These would need to be addressed before findings about the effects of PA on cognition, academic performance and the brain can be generalised to a wider population and impact on policy.

Sample bias
The majority of primary studies and by extension systematic reviews included samples of children (7)(8)(9)(10)(11)(12) year old), with far fewer studies on adolescents [30]. While observational evidence in adolescents exists [56,57], high-quality RCTs are needed to bridge the gap. Geographical bias The majority of studies were conducted in the USA, and more generally in developed countries. RCTs in less-developed countries, particularly those from Africa and South America (Fig. 2), but also from Asia, are needed to explore whether findings can be generalised to the wider population. The "Cogni-Action" cross-over randomised trial of an acute PA intervention in Chile is one such attempt to close the geographical gap [58]. Publication bias Systematic reviews indicated the presence of publication bias: studies that report on significant, often positive, effects of PA interventions are more likely to be published. For instance, all brain imaging studies included at least one significant outcome. To counteract publication bias, PA researchers should publish non-significant findings and consider preregistering their analysis plans. Moreover, initiatives such as registered reports, journals soliciting negative results, and funder-supported journals that encourage open practices, are warranted. Brain imaging There was a lack of studies that included brain imaging, in particular T1-weighted (structural) MRI. Consequently, the evidence for PA-related brain changes is based on a small number of studies, 50% of which included the same sample of overweight children. High variability in outcome measures and lack of replication further complicate the interpretation of brain findings. Surprisingly, none of studies that were included in the reviews have examined PA effects on hippocampal metrics (e.g. volume), despite overwhelming evidence from animal studies [59] and various reports in human adults [60]. To uncover neuro-biological mechanisms of exercise, brain imaging (MRI and/or EEG) should be considered as an outcome measure in future intervention studies, e.g. Cogni-Action project [58] and Fit to Study [61]. Moderators There is little understanding of the influence of potential moderators, such as sex, socioeconomic status, ethnicity and genetic background on PA-related cognitive, academic and brain outcomes. Understanding the effect of such moderators could help develop targeted PA interventions for sub-groups of children who may benefit most.

Recommendations
To address research gaps and improve the quality of both primary studies and systematic reviews, the field is increasingly recognising that current research practice needs to evolve. We therefore provide recommendations, including references to resources, in an attempt to facilitate improvements in evidence generation and synthesis.
Future reviews are encouraged to screen for existing or pre-registered systematic reviews (e.g. on PROSPERO: https://www.crd.york.ac.uk/prospero/ or Open Science Framework, OSF: https://osf.io) on the same topic. At least seven new systematic reviews have been published [62][63][64][65][66][67][68] since February 2019, the majority of which reported on the same studies that have been included in previous reviews. Moreover, researchers are encouraged to pre-register their protocol (e.g. on PROSPERO), follow the PRISMA guidelines and consult online resources (e.g. Cochrane handbook [69]), particularly if meta-analyses are conducted [54,55,70]. We encourage researchers to think carefully about their research question and inclusion criteria (e.g. type of design, PA interventions and outcomes) which will determine the scope of the review. To further our understanding of PA effects, we recommend researchers to accurately describe the sample of the included studies (including baseline PA levels) and summarise findings by cognitive-or academic sub-domains and PA characteristics, taking into account the quality of the studies, with conclusions providing a balanced summary of findings. For instance, researchers could consider focussing on the high(est) quality studies and, in addition to summarising findings by cognitive subdomain, show the relationship between dose parameters and outcome measures (e.g. effect sizes, for an example see ref [66]). If a meta-analysis is appropriate, researchers are encouraged to consult guidelines to ensure the correct statistical methods are used (e.g. [54,55,70]), heterogeneity and publication bias are assessed, and sub-group and/ or sensitivity analyses are conducted (e.g. for cluster-RCT effects [50,64]).
While a systematic review of reviews allows for an evaluation of the field and provide recommendations for future research and policy [71], it does not allow for a detailed discussion of primary studies. Based on systematic reviews, however, we found that the quality of primary studies is low to moderate. To improve study quality, researchers are advised to consult the protocol-(Standard Protocol Items: Recommendations for Interventional Trials, SPIRIT) [72] and reporting (Consolidated Standards of Reporting Trials, CONSORT) [73,74] guidelines for trials early on in the design process, and pre-register the study on an independent registry (e.g. ClinicalTrials.gov). An adequate sampling strategy should be followed to ensure an unbiased sample and generalizability of findings [75], and a power analysis ought to be performed a-priori to determine an adequate sample size. PA researchers are encouraged to carefully consider the PA intervention, including the type, the dose, the control group(s) [25], and to implement strategies to measure adherence (e.g. using actigraphy or heart rate monitors). Moreover, the validity of outcome measures should be ensured. Researchers are encouraged to follow the CONSORT guidelines for reporting of RCT outcomes and the Template for intervention description and replication (TIDieR) checklist for intervention reporting [76].
Point estimates, confidence intervals and effect sizes should be provided, and, in addition to an intention-to-treat analysis, instrumental variable or compliance-average causal effect approaches may be used in cases of non-compliance [77,78]. Missing data should be reported and dealt with appropriately (e.g. using multiple imputation) [79,80].

Strengths and limitations
Strengths of this review include its evidence synthesis by review quality and its focus on interventional evidence only, thereby only considering studies from which causality can be inferred. At the same time, relevant systematic reviews may have been excluded. In addition, the AMSTAR-2 criteria merely consider methodological quality, not whether the findings were synthesised appropriately. For instance, Martin et al. [31] based their conclusions for beneficial effects of PA on findings from a single study and Singh et al. [54] used vote-counting to summarise findings, the validity of which has been questioned [54]. None of the reviews took intervention characteristics (dose, type) into account when summarising results of primary studies. Therefore, no conclusions could be drawn regarding the potential differential effects of PA characteristics on outcomes. Reviews on acute PA effects were not considered in this review and may benefit from an independent evaluation [81]. Furthermore, systematic reviews used a variety of quality assessments which prevented a direct comparison of quality of primary studies. We further acknowledge that our summary of limitations and recommendations given by systematic reviews may be influenced by our own interpretations, yet its aim was to provide an overview of common discussion points. Finally, differences in reporting of findings across systematic reviews limited the extent to which evidence at the level of individual cognitive and academic domains was summarised.

Conclusion
Based on a single high-quality review and 18 (critically) lowquality reviews, we found inconsistent evidence for PArelated effects on cognitive-and academic performance in children and adolescents. Low-quality reviews suggest PArelated brain changes, but lack an interpretation of these findings. If this field were to inform policy, high-quality systematic reviews and primary studies are needed that provide insight into the effect of dose and PA characteristics on (domains of) cognitive-, academic and brain outcomes, in particular in adolescents and children in developing countries.