Systematic review of the validity and reliability of consumer-wearable activity trackers

Background Consumer-wearable activity trackers are electronic devices used for monitoring fitness- and other health-related metrics. The purpose of this systematic review was to summarize the evidence for validity and reliability of popular consumer-wearable activity trackers (Fitbit and Jawbone) and their ability to estimate steps, distance, physical activity, energy expenditure, and sleep. Methods Searches included only full-length English language studies published in PubMed, Embase, SPORTDiscus, and Google Scholar through July 31, 2015. Two people reviewed and abstracted each included study. Results In total, 22 studies were included in the review (20 on adults, 2 on youth). For laboratory-based studies using step counting or accelerometer steps, the correlation with tracker-assessed steps was high for both Fitbit and Jawbone (Pearson or intraclass correlation coefficients (CC) > =0.80). Only one study assessed distance for the Fitbit, finding an over-estimate at slower speeds and under-estimate at faster speeds. Two field-based studies compared accelerometry-assessed physical activity to the trackers, with one study finding higher correlation (Spearman CC 0.86, Fitbit) while another study found a wide range in correlation (intraclass CC 0.36–0.70, Fitbit and Jawbone). Using several different comparison measures (indirect and direct calorimetry, accelerometry, self-report), energy expenditure was more often under-estimated by either tracker. Total sleep time and sleep efficiency were over-estimated and wake after sleep onset was under-estimated comparing metrics from polysomnography to either tracker using a normal mode setting. No studies of intradevice reliability were found. Interdevice reliability was reported on seven studies using the Fitbit, but none for the Jawbone. Walking- and running-based Fitbit trials indicated consistently high interdevice reliability for steps (Pearson and intraclass CC 0.76–1.00), distance (intraclass CC 0.90–0.99), and energy expenditure (Pearson and intraclass CC 0.71–0.97). When wearing two Fitbits while sleeping, consistency between the devices was high. Conclusion This systematic review indicated higher validity of steps, few studies on distance and physical activity, and lower validity for energy expenditure and sleep. The evidence reviewed indicated high interdevice reliability for steps, distance, energy expenditure, and sleep for certain Fitbit models. As new activity trackers and features are introduced to the market, documentation of the measurement properties can guide their use in research settings. Electronic supplementary material The online version of this article (doi:10.1186/s12966-015-0314-1) contains supplementary material, which is available to authorized users.


Background
Consumer wearable devices are a popular and growing market for monitoring physical activity, sleep, and other behaviors. The devices helped to grow what is known as the Quantified Self movement, engaging those who wish to track their own personal data to optimize health behaviors [1]. A subset of consumer wearable devices used for monitoring physical activity-and fitness-related metrics are referred to as "activity trackers" or "fitness trackers" [2]. Their popularity has risen as they have become more affordable, unobtrusive, and useful in their application. An activity tracker can provide feedback and offer interactive behavior change tools via a mobile device, base station, or computer for long-term tracking and data storage [3,4]. The trackers enable self-monitoring towards daily or longer-term goals (such as a goal to walk a certain distance over time) and can be used to compare against one's peers or a broader community of users, both of which are advantageous mediators to increasing walking and overall physical activity [3,5].
A national United States (US) survey completed in 2012 indicated 69 % of adults tracked at least one health indicator for themselves, a family member, or friend using a tracking device (such as an activity tracker), paper tracking, or another method [6]. From this survey, 60 % of adults reported tracking weight, diet, or exercise. Those who tracked weight, diet, or exercise were similar by gender, but more likely to be non-Hispanic White or African American, older, and have at least a college degree compared to Hispanics, younger ages, and those with less than a college degree, respectively. Among those who tracked at least one health behavior or condition, 21 % used some form of technology to track the health data. Also among this group, 46 % indicated that tracking changed their overall approach to maintaining their health or the health of the person they cared for, 40 % indicated that it led them to ask a doctor new questions or obtain a second opinion, and 34 % indicated that it affected a decision about how to treat an illness or condition.
Activity trackers are being used not only in the consumer market but also in research studies. Physical activity-related interventions are using activity trackers for self-monitoring, reinforcement, goal-setting, and measurement (examples among adults [4,[7][8][9][10][11] and youth [12]). Before more widespread use of these trackers occurs in research studies, for either intervention or measurement purposes, it is important to establish their validity and reliability.
The purpose of this review was to summarize the evidence for validity and reliability of the most popular consumer-wearable activity trackers. Among a variety of trackers on the market, approximately 3.3 million sold between April 2013 to March 2014, with 96 % made by Fitbit (67 %), Jawbone (18 %), and Nike (11 %) [2]. Since Nike discontinued the sale of Fuelbands in 2014, our focus for this review was on activity trackers made by Fitbit and Jawbone. Before conducting the review, we searched company websites for documentation on the accuracy of measuring steps, distance, physical activity, energy expenditure, and sleep. The Fitbit company indicated that after multiple internal studies, they had "tuned the accuracy of the Fitbit tracker step counting functionality over hundreds of tests with multiple body types. All Fitbit trackers should be 95-97 % accurate for step counting when worn as recommended" [13]. However, no other information was provided to document the accuracy of steps, nor the other measures we reviewed. The Jawbone company indicated that "while variations in user, terrain, and activity conditions can influence specific calculations, testing has shown UP to provide industry-leading accuracy in tracking activity and sleep" [14]. Similarly, no other details were provided of how accuracy was determined. Therefore, we focused our search on the ability of these trackers to estimate steps, distance, physical activity, energy expenditure, and sleep. For each study included in the review, we also abstracted information on the tracker's feasibility of use.

Literature search
Searches of PubMed, Embase, and SPORTDiscus were conducted to include only full-length studies published in English language journals through July 31, 2015. No start date was imposed in the search. If a publication was available online first before print, we attempted to obtain a copy; thus, some publications were officially published after July 31, 2015 but were available in the databases during our search period. Two separate searches were performed for the two activity trackers.
(1)(Fitbit) AND (validity OR validation OR validate OR comparison OR comparisons OR comparative OR reliability OR accuracy) (2)(Jawbone) AND monitor AND (validity OR validation OR validate OR comparison OR comparisons OR comparative OR reliability OR accuracy) The term "monitor" was added to the Jawbone search to reduce the number of dental-related articles retrieved. In addition, we reviewed Google Scholar similarly (same search terms, dates, only English language journals) and the reference lists of included studies for publications missed by the searches. We excluded abstracts (examples [15,16]) and conference proceedings (example [17]). We also excluded studies focused on special populations, such as stroke and traumatic brain injury [18], chronic obstructive pulmonary disease [19], amputation [20], mental illness [21], or older adults in assisted living [22]. One study presented data on apparently healthy older adults without mobility impairments and those of similar ages with reduced mobility; therefore, we reported only on those without mobility impairments [23].

Abstraction and analysis
First, we documented descriptive information on the activity trackers (models, release date, placement, size, weight, and cost) through internet searches conducted from May-July 2015. Second, an abstraction tool used for this review was expanded from a tool initially created by De Vries et al. [24] to document study characteristics and measurement properties of the activity trackers. Specifically, we extracted information on the study population, protocol, statistical analysis, and results related to validity, reliability, and feasibility. We also extracted any information provided by the studies on items entered into the activity tracker user account settings. A primary reviewer extracted details and a second reviewer checked each entry. Discrepancies in coding were resolved by consensus. For any abstracted information that was missing from the publication, we attempted to contact at least one author to obtain the information. Summary tables were created from the abstracted information.
Validity of the activity trackers included [25]: -Criterion validity: comparing the trackers to a criterion measure of steps, distance traveled, physical activity, energy expenditure, and sleep. -Construct validity: comparing the trackers to other constructs that should track or correlate positively (convergent validity) or negatively (divergent validity).

Reliability of the activity trackers included [25]:
-Intradevice reliability: test-retest results indicating consistency within the same tracker. This can be conducted in the lab (such as on a shaker table). -Interdevice reliability: results indicating consistency across the same brand/type of tracker measured at the same time and worn in the same location. This can be assessed during activities performed in the laboratory or while free-living.
Reliability studies included the Classic worn at the waist [29] and non-dominant wrist [38]; the Ultra worn at the waist/hip [29,36], pants pocket [32], and non-dominant wrist [37]; the One worn at the waist [15,43] and pants pocket [43]; and the Flex worn on the wrist [15].  We found no studies for the Fitbit Force, Surge, Charge, or Charge HR, or the Jawbone UP MOVE, UP2, UP3, or UP4

Jawbone tracker
The Jawbone company (San Francisco, CA; https://jawbone.com) has offered at least six activity trackers since 2011 (Table 1). Their trackers are worn at the wrist, with the exception of the UP MOVE tracker to be worn at the waist, pocket, or bra. The trackers contain a triaxial accelerometer, collecting data at 30 Hertz, and more recently bioelectrical impedance (for heart rate, respiration, and skin response), as well as both skin and ambient temperatures. Using proprietary algorithms, data from measures collected along with information input by the user can estimate steps, distance, physical activity, kilocalories, and sleep. Currently, only day-level data is available to the consumer. The following two Jawbone trackers, both designed for the wrist, were explored for validity ( Table 2): (1)UP worn on the wrist [33,35,40,42,47,48] and (2)UP24 worn on the wrist [30,45].
No Jawbone trackers were explored for reliability. About half of the studies reported the data entered into the tracker user account [29, 33-35, 39, 41, 43], which was usually age, gender, height, and weight. One study also reported entering stride length [34], another study input handedness and smoking status [35], and another study used event markers to denote when an activity started and ended [39]. A sleep study indicated that they manually switched the band from active to sleep mode in conjunction with lights on/off [48]. Other studies did not report what data were input into the user account [15, 23, 30-32, 36-38, 40, 42, 44-47].

Description of studies
Data collection was primarily conducted in the US, with one or two studies conducted in Australia [33], Canada [36,43,46], the Netherlands [32], Northern Ireland [44], Spain [23], and the United Kingdom [42] ( Table 3). Studies usually included an apparently healthy sample and, where reported, almost all participants had a normal body mass index (BMI). Additionally, participants were > =18 years and mostly younger to middle age, except for one study focusing exclusively on adults > =60 years [41] and two studies on youth [37,48]. Data were collected between 2010 [38] to 2015 [47].
For studies using accelerometry as the criterion, correlation with tracker steps was also generally high (if reported, the mean correlations were > =0.80) for the Classic [29], Ultra [29,34], Zip [44], One [33], and UP [33] trackers. However, several studies indicated that the One [42], Flex [15,30], UP [33](at slow walking speeds [42]), and UP24 [30] under-estimated steps during treadmill walking and running. In contrast, in a study of 21 participants wearing the One for 2 days without restrictions, compared to an accelerometer the tracker generally over-counted steps for the One (mean absolute difference 779 steps/day) [33]. In one free-living study, the researcher wore both the Ultra and a Yamax pedometer while seated in a car driving on paved roads for about 20 min [36]. During this time no steps were recorded for the Ultra, while the pedometer recorded three steps.

Validity for distance
Only one study explored the validity of distance walked using the treadmill distance as the criterion. Among 30        [43].

Validity for physical activity
The criterion measures for two studies exploring physical activity relied on other accelerometers (ActiGraph GT3X [44] and ActiGraph GT3X+ [33], both using Freedson et al. cutpoints [49], and Body Media Sense-Wear [33]). Based on 42 participants wearing the Zip for 1 week during waking hours, moderate-to-vigorous physical activity showed almost perfect correlation with an accelerometer (Spearman CC 0.86) [44]. However, in another study of 21 participants wearing the Zip, One, and UP for 2 days without restrictions, compared to an accelerometer the trackers generally overcounted minutes of moderate-to-vigorous physical activity (mean absolute difference 89.8, 58.6, 18.0 min/ day, respectively and intraclass CC 0.36, 0.46, 0.70, respectively) [33].

Validity for sleep
Five studies explored the validity of sleep measures, four using polysomnography (PSG) [37,38,47,48] and the other using the BodyMedia SenseWear device [33] as the criterion. Compared to PSG, the Classic [38], Ultra [37], and UP [47,48] over-estimated total sleep time and sleep efficiency and under-estimated wake after sleep onset, resulting in high sensitivity and poor specificity. However, for the Ultra when using the sensitive mode setting, total sleep time and sleep efficiency were under-estimated and wake after sleep onset was over-estimated. In a study of 21 adults wearing the One and UP for 2 days without restrictions, compared to an accelerometer the trackers generally over-estimated time in sleep (mean absolute difference 23.0, 22.0 min/day, respectively and intraclass CC 0.90, 0.85, respectively) [33].

Reliability
No study reported on the intradevice or interdevice reliability of the Jawbone, or the intradevice reliability of the Fitbit. Seven studies reported on the interdevice reliability of several Fitbit trackers (Table 5), with sample sizes ranging from one [32,36] to 30 [43]. Four studies were laboratory-based focusing solely on locomotion on the treadmill [15,29,36,43], two studies were laboratorybased requiring monitoring with a PSG [37,38], and one study was field-based [32]. For any Fitbit tracker, interdevice reliability was reported from five studies on steps [15,29,32,36,43], one study on distance [43], no studies on physical activity, two studies on energy expenditure [15,29], and two studies on sleep [37,38]. The following sections detail the reliability results for each of the five measures.

Reliability for steps
Comparing two different hip-worn trackers for 16 to 23 participants during treadmill walking and running, the intraclass CC was substantial to almost perfect for steps taken for the Classic (range 0.86-0.91) and the Ultra (range 0.76-0.99) [29]. In another study, during six treadmill walking trials of 20 steps by one researcher, three hip-worn Ultras were compared and all trackers read within 5 % of each other [36]. In a field-based study of 10 hip-worn Ultras all worn by the same person at the same time for 8 days, the median intraclass CC was 0.90 for steps/minute, 1.00 for steps/hour, and 1.00 for steps/day, and comparing across trackers, the maximum difference was only 3.3 % [32].
Comparing three hip-worn Ones worn by 23 participants during treadmill walking and running, the Pearson CC between the left and right hip, as well as both right hips, was almost perfect for steps (0.99 and 0.99, respectively) [15]. In another study, 30 participants wore three Ones on their hips and front pants pocket while walking or running at five different speeds on the treadmill and correlation for steps was almost perfect when comparing across trackers (intraclass CC 0.95-1.00) [43]. Lastly, comparing two wrist-worn Flex trackers worn by 23 participants during treadmill walking and running, the Pearson CC between the left and right wrist was almost perfect for steps (0.90) [15].

Reliability for distance
In the only study of reliability assessment of distance, 30 participants wore three Ones on their hips and front pants pocket while walking or running at five different speeds on the treadmill and the correlation was almost perfect for distance measurements across trackers (intraclass CC 0.90-0.99) [43].

Reliability for energy expenditure
Comparing two different hip-worn trackers for 16-23 participants during treadmill walking and running, the intraclass CC was substantial to almost perfect for kilocalories expended for the Classic (range 0.74-0.92) and the Ultra (range 0.91-0.97) [29]. Comparing three hipworn Ones worn by 23 participants during treadmill walking and running, the Pearson CC between the left and right hip, as well as both right hips, was almost perfect for kilocalories expended (0.97 and 0.96, respectively) [15]. These same participants wore two Flex trackers on their wrists during treadmill walking and running that had almost perfect correlation for kilocalories expended (0.95) [15].

Reliability for sleep
Three participants wore two Classics overnight and recorded almost perfect levels of agreement (96.5-99.1 %) to classify whether the minute-level data was a sleep or wake minute [38]. Similarly, nine youth participants wore two Ultras on their wrist overnight, with data available for seven participants (one pair did not record and one pair had significant discrepancies between readings) [37]. They found similar readings for total sleep time and sleep efficiency for either the normal or sensitive mode.

Feasibility
Feasibility assessment was abstracted for the 22 studies in this review. In total, seven of 18 studies reported on missing or lost data, with the lab-based studies less likely to report it than the field-based studies. For the lab measurements, Case et al. [30] indicated 1.4 % of data were missing from all tested trackers due to not properly setting them to record steps, Dannecker et al. [31] indicated incomplete data on two of 19 participants, and Gusmer et al. [34] excluded six of 32 participants because ActiGraph step counts were about half of the Ultra step counts (they note this is most likely an Acti-Graph failure). For one night of recording in the sleep laboratory, Meltzer et al. [37] reported missing data for 14 of 63 participants to assess validity, due to data not recording for the Ultra (n = 12) and corrupted PSG files (n = 2). For a field-based study of 21 participants during 2 days of wear some data were lost: moderate-to-vigorous physical activity (n = 7 due to data extraction of the One and the Zip (i.e., certain data were only available for a limited amount of time), n = 1 Zip malfunction), steps (n = 1 Zip malfunction), energy expenditure (n = 1 Zip malfunction), and sleep (n = 2 participant error for the One) [33]. In a second field-based study enrolling adults > =60 years of age, authors excluded five of 15 participants because they had difficulty with the Classic over the 10day period (two lost the tracker and three failed to plug it into the wireless base to transmit data) [41]. In a separate field-based study, the Zip was worn over 1 week and five of 47 participants had at least some missing data [44].

Discussion
This review summarized the evidence for validity and reliability of activity trackers, identifying 22 studies published since 2012. While conducting this review, we learned how the trackers can be set-up to improve upon off-the-shelf accuracy. Those testing and wearing the trackers are encouraged to consider several tips to potentially improve the trackers' performance (Table 6).

Validity and reliability
From this review, we found the validity (Fitbit and Jawbone) and interdevice reliability (Fitbit) of steps counts was generally high, particularly during laboratory-based treadmill tests. When errors were higher, the direction tended to be an under-estimation of steps by the tracker compared to the criterion. This may be particularly problematic at slow walking speeds, similar to findings when testing pedometers [51]. Specifically for steps, if the option is available to set stride length, this should improve accuracy (Table 6). Hip-worn trackers generally performed better at counting steps than trackers worn elsewhere on the body, although Mammen et al. [36] suggests moving the placement from the hip if being worn by an older adult with slower gait speed. Only one study assessed the validity and reliability of distance walked, finding that while reliability was high, distance was over-estimated at slower speeds and underestimated at faster speeds [43].
Compared to other accelerometers, one study indicated that the trackers generally over-counted moderateto-vigorous physical activity, with some large differences found (mean 0.3, 1.0, and 1.5 h/day for the UP, One, and Zip, respectively) [33]. However, another study indicated higher agreement [44]. It may be that the cutpoints [49] used to define moderate-to-vigorous physical activity in both studies were set too high, particularly for older or inactive adults. The reliability of physical activity measurement has not been tested in any study.
From 10 adult studies, we found that although interdevice reliability of energy expenditure was high, the validity of the tracker was lower. When reported, the CC generally ranged from moderate to substantial agreement. Across trackers, many studies indicated that the These options may not be available for all trackers that were reviewed bias in mis-reporting was often an under-estimation of energy expended. For sleep among youth and adults, despite high reliability, the trackers evaluated generally over-estimated total sleep time [33,37,38,47,48], and when tested against PSG the trackers over-estimated sleep efficiency and under-estimated wake after sleep onset [37,38,47,48]. These findings are similar to other studies of accelerometry, in which the devices are highly sensitive but do not accurately detect periods of wake before and during sleep [52]. However, for one tracker the sensitive mode setting was tested, which under-estimated total sleep time and sleep efficiency and over-estimated wake after sleep onset [37]. Work is needed to improve the validity of sleep measurement with these trackers, particularly when using them for only one or two nights of testing [38]. It may be that newer trackers will perform better if they "learn" when the person is asleep, awake, or napping (Table 6).

Feasibility
Seven of 22 studies reported on missing or lost data, ranging from approximately 1.4 to 22.2 % for laboratorybased studies and 10.6 to 33.3 % for field-based studies. Some of the lost data was attributable to the validation criterion measure and not the trackers, and other lost data were attributable to researcher error and not participant error. Even so, researchers should anticipate data loss based on these findings. Future studies should report missing data and the reason for the loss. One study in this review [44] and others not included [4,8,19,53] report relatively high acceptability in wearing the trackers. This type of information may help with understanding reasons for missing data in field-based studies, particularly if they occur over long time periods.

For the companies
Through this review, we identified three recommendations manufacturers can contribute to enhance the use of the trackers for research. First, the trackers contain firmware, defined as an electronic component with embedded software to control the tracker. Firmware can be updated by the company at any time; when the tracker is synched, the new software is updated. These software changes can influence the measurement properties in either positive or negative ways, and can change what might have been previously confirmed or published. Firmware may fix bugs or add features to the tracker, or it may change how variables are calculated. However, many other changes take place, which the consumer cannot detect [54]. As an alternative, the company supporting ActiGraph accelerometers currently makes firmware updates available to the public via their website, allowing researchers to assess those changes for impact on the measurement properties of the accelerometer [55,56]. A similar standard operating procedure would be a beneficial approach for researchers using these trackers.
Second, Jawbone UP3 and UP4 trackers include bioelectric impedance, with corresponding measures of heart rate and respiration, and both skin and ambient temperatures. Additionally, some of the newer Fitbit trackers include GPS (Surge) and optical heart rate sensors (Surge and Charge HR). With these enhancements, the companies seemingly have the tools to determine whether the tracker is being worn (e.g., adherence) and whether it is being worn by the same individual (e.g., one body authentication) [8]. It would be beneficial if the companies derived an indicator of wear and made this available on a minute-by-minute level, corresponding to other available data. Currently, neither the Jawbone nor Fitbit indicate the time worn, which could impact all metrics studied in this review.
Third, the companies could allow access to more data that are collected. At present, the trackers provide users with only a subset of data that is actually collected. The companies control the output available, making the daylevel summary variables the easiest to obtain. For example, despite capturing GPS and heart rate on two trackers, Fitbit currently limits the export of these full datasets. Furthermore, the resulting output is derived through proprietary algorithms that may change over time and with new features. In all likelihood, based on the performance of the trackers found in this review, these algorithms are supported through machine learning techniques. At a minimum, it would be helpful for companies to reveal what pieces of data are being used by the trackers to calculate each output measure. For example, Jawbone indicates that height, weight, gender, age, and heart rate, if available, are used to calculate physical activity [14].

Future research
In total, Fitbit offered at least 9 trackers since 2008 and Jawbone offered at least 6 trackers since 2011. Until we understand if the specifications within a company's family of trackers are similar, researchers should confirm the validity and reliability of new trackers. Moreover, an argument could be made to test any new tracker, even if the company confirms similar hardware and software processes. With time, the trackers offer more features through enhancements made to the trackers (Table 1). Each new tracker feature needs testing for reliability, validity, and usability. Specific types of activities should also be tested, similar to the study by Sasaki et al. [39]. While this review focused on steps, distance, physical activity, energy expenditure, and sleep, other features to test include number of stair flights taken, heart rate, respiration, location via GPS technology, skin temperature, and ambient temperature.