All data collection described herein was approved by the University of North Carolina - Chapel Hill Institutional Review Board and each participant provided written informed consent prior to participation in the study. We evaluated evidence for test-retest reliability and concurrent validity from two independent samples of pregnant women.
PIN3 Physical Activity Questionnaire
A one-week recall questionnaire was developed to evaluate physical activity among pregnant women enrolled in the PIN3 Study. The questionnaire was interviewer administered and designed to capture moderate and vigorous physical activity. It is available as additional file 1. The questionnaire can be contrasted with an interviewer-administered 7-day recall , but instead of recalling day-by-day the participant recalls mode-by-mode. The PIN3 physical activity questionnaire assessed frequency and duration of all moderate and vigorous physical activities the woman participated in, including activity done for recreation, at work, for transportation, childcare, adult care, and both indoor and outdoor household activities. Using recreational activity as an example, the question asked: "In the past week, did you participate in any non-work, recreational activity or exercise, such as walking for exercise, swimming, or dancing, that caused at least some increase in breathing and heart rate?" If the participant responded 'Yes', then the interviewer asked her to list all types of activities, one by one, with the following question: "What type of recreational activities did you do during the past week?" For each activity, the participant reported the number of sessions per week, duration of each session, and the perceived intensity level using the following options: 'fairly light,' 'somewhat hard,' and 'hard or very hard'. These intensity categories corresponded to the Borg scale of perceived exertion . These questions were repeated for work, transportation, child and adult care giving, indoor household, and outdoor household activity.
The scoring of the questionnaire included assessment of intensity of activity in two ways. First, perceived (or relative) intensity using the modified Borg scale  to capture the participant's perception of intensity and to derive number of hours in the past week spent in moderate to vigorous physical activity, We classified activities reported as "somewhat hard" as moderate and activities reported as "hard or very hard" as vigorous. Second, absolute intensity classifying activities using published metabolic (MET) tables [11, 12] to determine the number of MET-hours per week spent in physical activity. A number of activities reported by women were not listed in the compendium. Thus, two raters determined the MET intensity based on other activities in the compendium; when the two raters disagreed they met and resolved by consensus. The final compendium of activities used for scoring is available elsewhere http://www.cpc.unc.edu/projects/pin/design_pin3/docs_3/PIN-MET-Table-080207.pdf. To determine moderate activity using absolute intensity, we defined moderate intensity as both 3 to 6 METS, a definition often used for adults , and 4.8 to 7.1 METS, the estimated values for moderate activity specific to adults 20 to 39 years of age .
The questionnaire took on average 10-20 minutes to administer. Intra- and inter-interviewer quality control measures, such as expert review of taped interviews, were established to ensure that interviewers were asking questions reliably and systematically.
Concurrent-Related Validity Assessment
For this portion of the study, pregnant women from central North Carolina were recruited by placing flyers in local clinics and screened over the telephone. Women were not enrolled if they were non-English speaking, under the age of 18 years, carrying multiple gestations, did not have a telephone from which they could complete the phone interviews, or more than 28 weeks' gestation during the screening call. The women wore an accelerometer for one week, kept a daily structured diary, and following this completed a questionnaire that included the one-week recall of the PIN3 physical activity questionnaire. Participants were paid $50 at the completion of the study; later in the study an electromagnetic field monitor was added to the protocol  and participants were paid $75.
PIN3 Diary Card
The structured diary card was developed as a way of assessing the evidence for concurrent related validity of the PIN3 physical activity questionnaire. The card and the instructions can be found in additional file 2. The goal of the card was to collect all moderate to vigorous physical activities the women performed in the past week. Women were requested to fill out the two-page card on a daily basis and to mail it back to us at the end of the one week period. The activities listed were the ones most commonly reported among the initial PIN3 participants and space was provided to list other types of activities. For each day, participants filled in time spent and perceived exertion (based on the same intensity descriptions used in the questionnaire) for each activity. On the diary card, if a woman performed an activity at two different intensities, such as gardening, one intensity level and time could be recorded in one row and the activity along with the different intensity and time could be recorded as an "other activity". In practice, this did not happen and so we assume women recorded the intensity level that best represented that activity.
The scoring of the diary card was similar as possible to the scoring of the PIN3 questionnaire. Hours per week in physical activity was calculated by adding up the time reported for each specific activity by intensity level. All activities were assigned a MET value from the compendium [11, 12], in order to determine MET-hours of activity.
For the accelerometer, we used the Manufacturing Technology Inc. (MTI) Actigraph accelerometer model #7164 (Fort Walton Beach, FL). It is a small, light-weight uniaxial accelerometer that measures accelerations in the range of 0.05 to 2 G's with a band limited frequency of 0.25-2.5 Hertz . Validity [20–22] and reliability [23–26] of the monitor as an indicator for physical activity have been demonstrated among adults.
Women were fitted with the accelerometer to be worn during waking hours for 7 days on a belt or clip-on pouch over their right hip at the iliac crest. They were asked to remove the monitor for sleeping, bathing, or swimming. Written and verbal instructions, as well as a phone number to call with questions, were provided. Participants mailed the monitor back to the study offices at the conclusion of the 7 days.
Accelerometer data were collected with 1-minute epochs, and the monitors were regularly calibrated throughout the study. Spurious counts were flagged, assessed, and set to missing if determined to be invalid. We defined non-wear time as a period of 20 minutes or more of zeros. We defined a standard measurement day as the length of time in which > = 70% of the sample was wearing the accelerometer, separately for weekdays and weekends similar to others . We classified participants as having complete accelerometry data if they had nonmissing counts over at least 70% of a standard measurement day for all 7 days.
Ignoring missing values in the accelerometer file can cause a biased estimate of the true level of physical activity . Therefore, missing data were filled by using multiple imputation inference strategy (using SAS proc MI) through expectation-maximization algorithm and a Markov Chain Monte Carlo method . Considering the wearing time of our participants, non-missing accelerometer data falling into the daily time window of 5 am to midnight was selected as reference data for the imputation. Indicators on ten blocks of the time period (5 am to midnight) and week day versus weekend were used for the imputation procedure. In order to represent a random sample of the missing values, ten multiple imputed data sets were created by the multiple imputation procedure and each imputation contained minute-by-minute daily activity count data from 5 am to midnight.
From the accelerometer, we used the data several ways. First, using total counts per week we evaluated the raw data provided by the accelerometer without imposing cutpoint decisions. Second, activity was calculated as hours per week (using count thresholds) and counts per week spent in differing intensities (e.g., light, moderate, vigorous). A number of calibration studies of adults provide count thresholds (e.g., cutpoints) for moderate and vigorous activity. We calculated cutpoints using three of these studies: Freedson et al , Swartz et al , and summary cutpoints from National Heath and Nutrition Examination Survey (NHANES) data , calculated originally by taking the weighted average of cutpoints from Freedson et al , Yngve et al , Leenders et al , and Brage et al .
Test-Retest Reliability Assessment
For this portion of the study, we relied on a sample of PIN3 participants. The PIN3 Study recruited pregnant women less than 20 weeks' gestation seeking prenatal care at clinics associated with the University of North Carolina Hospitals. Trained staff identified women through review of all medical charts of new prenatal patients. Women were not enrolled if they were non-English speaking, under the age of 16 years, carrying multiple gestations, not planning to continue care or deliver at the study hospital, or did not have a telephone from which they could complete the phone interviews. Selected women were asked to participate in two telephone interviews, at 17-22 and 27-30 weeks' gestation. The study website http://www.cpc.unc.edu/pin provides greater detail on the protocols and measures.
We evaluated the physical activity questionnaire test-retest reliability among pregnant women within 48 hours of their telephone interview completion as part of the PIN3 Study. We enrolled women purposefully, with approximately equal numbers in six strata: (i) completing the first (17-22 weeks) or second (27-30 weeks) telephone interview and (ii) self-report of either being currently employed in a strenuous job, active at leisure, or inactive at either work or leisure. During the second interview, women were asked to recall the same week that the first interview was conducted, so that recall periods matched. Women were paid $5 for participation in each interview.
During the first interview for both the validity and reliability samples, women were asked to report their education, current work status, race, general health, and parity (live plus still births). For the reliability sample only, age was reported at the interview and the medical record was abstracted and included self-reported weight and measured height for the determination of pre-pregnancy body mass index (BMI). BMI values were grouped using the Institute of Medicine recommendations for pregnant women, in effect during that time period, into low (<19.8 kg/m2) or normal weight (19.8-<26.0 kg/m2), overweight (26.0-<30.0 kg/m2), and obese (> = 30.0 kg/m2) . For the validity samples, age was obtained during recruitment and calculated based on her date of birth as her age on the interview date.
All analyses were conducted using SAS version 9.1 (Cary, NC). Evidence for validity was assessed using Spearman correlation coefficients (SCC) with 95% confidence intervals (CI) comparing either the diary or accelerometer results to the questionnaire. We conducted analyses comparing the accelerometer to the questionnaire using the full sample, the imputed sample, and the smaller sample defined by complete accelerometry data. Bland-Altman plots  were used to assess agreement between physical activity measurements from the questionnaire with either the accelerometer or the diary among the full sample. On the plots, the x-axis is the average of two measurements, while the y-axis is the difference between the two measurements. Horizontal lines represent the upper and lower limits of agreement and the mean difference of the two measurements. In addition, Pitman's test of differences in the variances was calculated for each comparison .
Test-retest reliability was assessed using intraclass correlations coefficients (ICC) with corresponding 95% CI. ICC were calculated using a two-way analysis of variance model and conceptualized as the proportion of the total variance explained by between-participant variance. Bland-Altman plots  were used to compare questionnaire results at the two time periods. Stratified analyses were performed to determine if differences in reliability differed by gestational age at interview date, race, education, age, parity, or pre-pregnancy BMI. As a rough guide, we followed the ratings suggested by Landis and Koch  for agreement level: 0-0.20 slight, 0.21-0.40 fair, 0.41-<0.60 moderate, 0.61-<0.80 substantial, and 0.81-<1.00 almost perfect. Each physical activity entry was screened and outliers were pairwise deleted from the analysis.