The reliability and validity testing were undertaken in four phases, translation, cognitive testing, and two iterations of field testing. First of all the original version of the ALPHA questionnaire was translated into Dutch, French, and German, followed by cognitive testing. Next a first field test was conducted in three countries. An expert meeting was organised to discuss the results before a second smaller field test was conducted to assess the modified questionnaire.
Translation and cognitive testing
The English questionnaire (the source) was translated into Dutch, French and German using a standard protocol based on the guidelines of Eurostat . To guide the translation process, conceptual cards were included after each question in the English version. These conceptual cards contained brief notes to explain the format of the questions and the underlying concept to be measured. Two translators, both of whom were native speakers and familiar with the topic, worked independently. They read and translated these conceptual cards into the target language before translation of the questions. After translation the two translators, together with a reviewer, discussed any particular translation problems until a final consensus was reached.
After the translation process, cognitive testing was conducted using cognitive interviewing  with at least five persons for each language. Respondents were asked to think aloud while processing each question and deciding how to answer to the question. If something was not clear the interviewer would ask questions to start a discussion.
Through the cognitive testing process, questions that were not clear or comprehensive were identified, discussed with the research team and rephrased.
Field testing I
Participants and procedures
Participants were recruited in three countries (Belgium, UK and France) between October 2008 and January 2009. To ensure some variance in the measured characteristics (e.g. population density), the participants within each country were derived from distinct areas (and thus different built environments). In Belgium a random sample in three different neighbourhoods (town, outskirts of town and village/countryside) was drawn. In each neighbourhood, letters with information about the study were distributed by post. One week after mailing the information letter, potential participants were visited at home and asked if they would participate. In the UK, participants randomly selected from 10 areas of an English city for a previous study , were contacted by telephone and appointments arranged to visit willing individuals. In France a convenience sample of adults living in the city centre and suburbs of Paris was recruited. Inclusion criteria were: aged 20-65 years, literate in the language of the questionnaire (Dutch, English or French respectively), lived at their current address for at least two months, and without physical disability that would prevent or hamper walking or cycling. The final sample consisted of 190 participants, 60 from Belgium, 64 from UK and 66 from France.
To assess test-retest stability, participants completed, in the presence of a researcher, both forms of the ALPHA questionnaire twice, with an interval of one to two weeks. This is a standard time frame in test-retest studies as it is long enough so that respondents are unlikely to remember their answers to the first testing, but short enough to minimise potential changes in physical activity behaviour. To avoid order effects, participants in each study centre were randomly assigned into two groups: Group 1 completed the short version of the questionnaire first (at first and second assessment), followed by the 49-item version, and Group 2 completed the 49-item version first (at first and second assessment), followed by the short version.
To assess predictive validity, physical activity behaviour was measured by accelerometry and long International Physical Activity Questionnaire (IPAQ) last 7 days. Participants were asked to wear accelerometers on the hip during all waking hours for 7 consecutive days following the first visit. Accelerometer recordings were collected at the second visit at which time the researcher interview-administered the Long IPAQ last 7 day. The interview version was preferred to the self-administered version of the IPAQ because of the tendency towards over reporting of physical activity that has been previously reported . The length of time needed to complete each questionnaire at the first visit was recorded. No incentive was provided for participation.
The development of the initial ALPHA environmental questionnaire has been described elsewhere . The instrument included questions on: types of residences in your neighbourhood (3 items), distance to local facilities (8 items), walking or cycle infrastructure in your neighbourhood (4 items), maintenance of infrastructure in your neighbourhood (3 items), neighbourhood safety (6 items), how pleasant is your neighbourhood (4 items), cycling and walking network (4 items), home environment (6 items), workplace or study environment (11 items). For the short form of the questionnaire the number of items was reduced to eleven, with a minimum of one item included from each theme. In both versions neighbourhood was defined as "...the area ALL around your home that you could walk to in 10-15 minutes - approx 1.5 km" (or "1 mile" for UK-context).
Self-reported physical activity level was assessed by the Long IPAQ last 7 day http://www.ipaq.ki.se/ipaq.htm. This instrument asks about physical activity behaviour over the last 7 days, according to categories of physical activity intensity, in different contexts such as physical activity as transport, physical activity at work or study, physical activity at home and physical activity in leisure time; it has been shown to be reliable and valid .
The MTI Actigraph accelerometer model 7164 was used in Belgium and France, and the Actigraph GT1M was used in the UK. In all cases an epoch time of one minute was used to provide an objective measure of habitual physical activity (over 7 days).
Finally, participants were asked to provide information on their age, height, weight, sex, ethnicity, living situation, educational attainment, occupational status and living environment.
Adverse items of the environmental questionnaire were recoded and sum scores for each scale were calculated.
For the long IPAQ last 7 day, each activity was expressed in minutes/week by multiplying frequency (day/week) and duration (minutes/day) of the activity. Indices of each domain were calculated by summing all physical activities undertaken for each specific context (work, domestic, transport and leisure). A 'total moderate-intensity and vigorous-intensity physical activity' index was computed by summing all reported physical activities undertaken at moderate and vigorous intensity across the four domains.
Accelerometer data were downloaded by placing the accelerometer into a reader interface unit (RIU) and using specific software (RIU256.exe) . Further the data were analysed by a custom-written program (MAHUFFE.exe, available from http://www.mrc-epid.cam.ac.uk). Accelerometer data were included in the analysis if the minimal number of wearing days was 4 (with at least one weekend day), with a minimum of 10 hours recording time for week days and 8 hours for weekend days, and excluding the relevant hours if there was an interruption in wearing time during the day of more than 60 minutes. To calculate physical activity at low intensity (LPA), at moderate (MPA) and at vigorous physical activity (VPA) Freedson's cut-offs  were used (<1952 counts per minute for LPA, between 1952 -5724 counts per minute for MPA and >5724 counts per minute for VPA).
Cronbach alphas were calculated to assess the internal consistency of each scale of the environmental questionnaire; results >0.70 were considered good . Intraclass coefficients (sum scores or items on 5 point scales) were used to compute the coefficient of stability of the scores on the two tests. ICC estimates >0.75 were considered as good reliability scores, between 0.50-0.75 as moderate reliability and <0.50 as poor reliability . Proportion of agreement was also calculated to measure the proportion of occasions that individuals gave the same score. Proportion of agreement above 0.70 was considered high .
Pearson correlations between environmental variables (sum scores) and accelerometer data, and between environmental variables and IPAQ measurements, were calculated to assess predictive validity.
All analyses were performed using SPSS 15.0 software (SPSS Inc., Chicago, IL, USA).
International expert meeting
After the first field testing an international expert meeting in February 2009 was organised to discuss the results (a list of all experts can be found in additional file 1). Items with lower scores on reliability or validity were discussed and rephrased until consensus was reached.
Field testing II
Participants and procedures
For the second and smaller field testing a new sample was recruited in three countries (Belgium, UK and Austria) between April and May 2009 using the same inclusion criteria as in the first field testing. In Belgium a random sample in three different neighbourhoods (town, outskirts of town, and village/countryside - all different from those in the first field testing) was recruited using the same approach as used in the first field testing. In the UK and Austria, convenience samples comprised university colleagues, students and other associates participated. The final sample consisted of 166 participants, 60 from Belgium, 57 from the UK and 49 from Austria.
In this second round of testing only test-retest stability was assessed for both versions, in a similar way to the first field testing.
An adapted version of the ALPHA environmental questionnaire was used. This instrument can be found in additional file 2 (49-item version) and additional file 3 (short version) and on the International Physical activity and Environmental Network (IPEN) website http://www.ipenproject.org. The same themes as in the original version  were used, but some items were changed. For example the answer categories of the short version changed from a four point scale (strongly disagree to strongly agree) to a two point scale (yes-no). The neighbourhood definition was also rephrased, reducing the area around the home to "approximately one kilometer or half a mile" instead of 1.5 kilometer and 1 mile. All changes are detailed in additional file 4. No other measures were included in the second field testing.
Data reduction and statistical analysis
Adverse items of the environmental questionnaire were recoded and sum scores for each scale were made. Cronbach alphas were calculated to assess the internal consistency of each scale of the environmental questionnaire. Intraclass correlation coefficients (sum scores or items on 5 point scales) and proportion of agreement (separate items) were used to compute the coefficient of stability of the scores on the two tests.