Commercial app use linked with sustained physical activity in two Canadian provinces: a 12-month quasi-experimental study

Background Top tier commercial physical activity apps rarely undergo peer-reviewed evaluation. Even fewer are assessed beyond six months, the theoretical threshold for behaviour maintenance. The purpose of this study was to examine whether a multi-component commercial app rewarding users with digital incentives for walking was associated with an increase in physical activity over one year. Methods This 12-month quasi-experimental study was conducted in two Canadian provinces (n = 39,113 participants). Following a two-week baseline period, participants earned digital incentives ($0.04 CAD/day) every day they reached a personalized daily step goal. Mixed-effects models estimated changes in weekly mean daily step count between the baseline period and the last two recorded weeks. Models were fit for several engagement groups and separately by baseline physical activity status within engagement groups. Results Nearly half of participants (43%) were categorized as physically inactive at baseline (fewer than 5000 daily steps), and 60% engaged with the app for at least six months [‘Regular’ (24–51 weeks of step data) or ‘Committed’ sub-groups (52 weeks)]. Weekly mean daily step count increased for physically inactive users regardless of engagement status (P < .0001). The increase was largest for ‘Regular’ and ‘Committed’ participants—1215 and 1821 steps/day, respectively. For physically active participants, step count increases were only observed in the ‘Committed’ sub-group (P < .0001). Effect sizes were modest-to-medium depending on the sub-group analyzed. Conclusions A commercial app providing small but immediate digital incentives for individualized goals was associated with an increased weekly mean daily step count on a population-scale over one year. This effect was more evident for physically inactive and more engaged participants.


Introduction
The delivery of psychological and public health interventions through technology is becoming an increasingly common way to prevent illness and promote health. Smartphones and tablets are well positioned to play a role in such interventions as they offer functionalities and opportunities for personalization through the widespread availability of a range of mobile phone apps [1]. Apps play an important role in the management of illness and are a low-cost, easy avenue for the promotion of health and well-being [2][3][4]. In 2017, there were 325,000 health apps across the 2 most common app platforms: Google Play and iTunes [5]. This includes apps that have been developed to assist patients in the management of a range of diseases and conditions, including diabetes mellitus type 1 or 2 [1,6], pain management [7,8], the promotion of increased physical activity [9,10], improve nutrition [11,12], and the promotion of improved mental health [13,14].
Although research investigating mobile phone-based technology over recent years has shown that short message service (SMS) text message-based interventions can have a positive impact on sexual health knowledge [15] and that most health interventions can benefit from some form of phone-based activity [16], research into the effectiveness of health behavior change through apps is in its infancy, and there is no clear consensus in the research around which specific features of apps can assist in behavior change. Content analyses of apps have identified some features that may promote health behavior change in apps for smoking cessation [17], alcohol reduction [18,19], and physical activity [20,21]. However, most apps only contain a few features that could be considered to have the potential to change behavior [22]. Features that have been found to promote health behavior change include the ability to provide direct advice about behavior change and track behaviors [17] or provide information on the consequences of continuing with the behavior [19]. Conversely, those studies that have found apps to be lacking in health behavior change features have highlighted the absence of individual tailoring such as personalized notifications or the collection of background information, for example, using global position system data to identify when a person might be at a high-risk area for alcohol use [18] or simply asking a user to set a smoking quit date [17].
Studies that report on user outcomes or experiences of apps have had similarly mixed results. One systematic review that investigated the role of apps and other digital media in physical activity and diet as it relates to cancer survivorship found an overall increase in minutes of physical activity with use of the app, but mixed evidence for improved diet, and no improvement for secondary outcomes such as a reduction in anxiety or depression [23]. A recent study investigating the role of apps in improving mental health found that after 30 days of app use, mental well-being improved in those using 1 of 3 mental well-being apps tested and those using 1 of the 3 apps tested showed improvements in depression. None resulted in improvements in anxiety [24]. A systematic review and meta-analysis of studies that employed a smartphone app to increase physical activity found that the use of apps could result in significant changes to body weight and body mass index; however, nonsignificant results were identified in changes to physical activity [25].
Alongside this growing body of interest in the identification of apps that may play a role in behavior change [26,27] is an increasing body of research that seeks to first understand the features of apps that may play a role in behavior change and then to measure and classify these features [28][29][30]. Common among these studies is an aim to identify features that employ best practice to allow health practitioners to better inform consumers and patients of the apps most suited to their needs. The ability of practitioners to give this advice is predicated on the ability of researchers to effectively classify and evaluate apps suitable for the most common health conditions through a reliable and valid measurement tool.
As described by McKay et al [31], many studies investigating the potential of apps to change behaviors have employed a behavior change taxonomy (either the CALO-RE or 26 or 93 item taxonomy) for the rating and categorization of apps [10,[20][21][22]. The aim of the systematic review undertaken by McKay et al [31] was to investigate ways in which researchers evaluate the potential health behavior change of apps to identify any current best practice approaches. Instruments identified in the review were created to investigate the behavior change potential of Web-and text-based health interventions [32]. The techniques present in these instruments have been identified in a range of studies and then linked back to behavior change potential. Most notable are by Abraham and Michie [33], who suggested a number of behavior change techniques common to many health behavior theories. Michie et al [34] identified 5 techniques present in physical activity and dietary interventions: self-monitoring, intention formation, specific goal setting, review of behavioral goals, and feedback on performance, finding that interventions that included self-monitoring with at least one other technique were responsible for the largest effect size [34]. These findings are supported by other work suggesting that self-monitoring is useful for increasing physical activity and improving diet for those who were overweight with comorbidities [35], with other work suggesting that self-monitoring is one of the strongest predictors of weight loss [36] and can also assist in decreasing alcohol consumption [37].
App-based studies that have employed these taxonomies have found apps to be lacking in the identified characteristics of a good behavior change intervention. For example, in an investigation of 166 apps that encourage medical adherence against 93 behavior change techniques, Morrissey et al [22] found most apps contained between 0 and 7 techniques, with the most common technique identified being action planning, where users are able to set a reminder to take medication at a specific time every day, and set prompts or cues, typically through the setting of an alarm. A total of 2 studies investigated physical activity but found few techniques for behavior change. Direito et al [21] found that most apps contained 8 techniques, most frequently providing instruction, setting graded tasks, and employing self-monitoring, whereas Conroy et al [10] identified 4 or fewer techniques in the physical activity apps they reviewed.
As more practitioners begin to recommend apps to patients for a range of health care needs [38,39], it becomes essential that we have a valid and reliable way to evaluate these apps. Although both valid and reliable, the taxonomies of behavior change theory [33] were designed to evaluate the features of text and Web-based interventions [32,40,41], not for the review of apps. For instance, these taxonomies often feature a large number of items that are closely related, and are theoretically important in behavior change theory, but will often only appear once in an app. For example, the behavior change taxonomy used by Morrissey et al [22] includes 93 items, with each item allocated a score of 1 if present and 0 if absent. Many of these items are similar, for example, there are 11 items categorized as reward (including material incentives, material rewards, and nonspecific rewards), all of which are classified separately. For most apps, only one of these items would be present, thus although an app may offer rewards and the benefits that they bring to behavior change, they only offer 1 type means that app would receive a low score in that behavior change category. With increasing knowledge and the growing body of research into app-based interventions, there is a clear need for a purpose-designed app rating system to identify the potential for health behavior change. Although there is 1 scale, the Mobile App Rating Scale (MARS), that is able to describe the functionality of apps, including aesthetics and information shared [42], there is currently no scale that can measure the potential for behavior change.
Over the past 3 years, the 3 authors of this study have been involved in rating and reviewing apps for the Victorian Health Promotion Foundation (VicHealth) Healthy Living Apps project [43]. The VicHealth Healthy Living Apps project is an annual rating activity using the MARS [42] and CALO-RE [40] scales to provide consumers with a guide to which apps may assist them best in promoting health. This project typically sees up to 400 apps rated annually for their functionality and ability to encourage or promote behavior change in 1 of the following 5 categories: healthy eating, physical activity, tobacco prevention, alcohol harm prevention, and mental well-being. These categories have been chosen as they form the key priority areas of VicHealth and, therefore, are those that are investigated in the VicHealth Healthy Living Apps project [43]. This experience has made clear to the authors that a purpose-designed scale to measure the health behavior change potential is needed for any app review that seeks to recommend apps to the public.
This study aims to develop a reliable, theory-based scale that can be used to assess the behavior change potential of smartphone apps.

Study Design
The creation of this scale occurred in 4 phases. Phase 1 included a systematic review to identify all scales that have been used to rate the potential of an app to encourage behavior change.
Results from this phase were analyzed and developed into a draft tool. Phases 2 to 4 consisted of series of deductive tests. The results of each round of testing were analyzed and incorporated into the next version of the scale until the team could be confident of reliability and validity of the scale. The final version of the scale was shared with a panel of experts for comment and feedback (see Figure 1 for an overview of the study procedure).

Phase 1: Systematic Review to Develop Initial Item Pool
A systematic search of the literature was conducted to gather all published evidence relating to the various ways that apps have been evaluated for behavior change potential to develop an item pool. This search was based on and extended a previous systematic review [31]. A total of 5 databases (Academic Search Complete, CINAHL Complete, E-Journal, MEDLINE Complete, and PsycINFO) were systematically searched. The search was completed on November 17, 2017, with no temporal limitations placed on the search. The search was limited to studies focusing on mobile phones, smartphones, cell phones, and tablets; used apps; and focused on health behaviors previously investigated in the VicHealth Healthy Living Apps project [43]. Search terms were health, wellbeing, preventative health, smok*, nutrition, alcohol, physical activity, or mental wellbeing.
The inclusion criteria comprised studies that evaluated mobile health apps in English, evaluations or reviews of apps targeted at consumers, alone or in addition to health professionals, and studies that evaluated the effectiveness of mobile health apps. Excluded studies comprised those that evaluated mobile health apps targeted only at health professionals, formative evaluations of mobile health apps, protocols for evaluations, apps that were not publicly or commercially available, studies that reported primarily on the validation of any mobile health app tool (eg, the MARS), and studies of apps not related to health behavior change. The papers were first screened by title and abstract based on the inclusion and exclusion criteria. The full texts of selected papers were then obtained for further assessment for final inclusion.

Phase 2: Face and Content Validity
The initial version of this scale was pilot-tested with 3 physical activity apps. The pilot testing was conducted by 2 experienced reviewers (FM and SS) and allowed the raters to (1) become familiar with the scale and (2) refine the wording of items and create item descriptors and examples.
Following this pilot, the ABACUS version 1 was used to rate the 3 highest rating apps from each of the 5 categories (15 apps in total) from the VicHealth Healthy Living Apps project [43]. To undertake this testing, the reviewers downloaded and became familiar with each app. Similar to other studies [28,42,44], the authors spent approximately 10 to 15 min testing all app features before rating. After the apps were rated, the raters met to discuss the app and the allocated score as a way to achieve agreement among raters and strengthen the scale. This discussion allowed for an identification of the similarities and differences in rating and, importantly, the strengths and weakness of each question in the scale, including clarity and specificity. During this process, the raters added and refined descriptors and examples for each item.

Phase 3: Reliability Analysis
The ABACUS version 2 was used by 3 raters (FM, SS, and MD) to review 50 physical activity apps. Physical activity apps were chosen for this phase not only because there are a large number of physical activity apps in the Apple iTunes store, providing a large choice for consumers but also because past reviewing [43] suggests that they represent a wide range in app quality. Apps were downloaded from the app store, and mirroring testing in phase 2, the authors spent approximately 10 min to 15 min reviewing all the features of the app.
Reliability of the scale was assessed using Krippendorff alpha. This allows for rating of ordinal data, can be used with an unlimited number of raters, and has been found to be superior to Cohen kappa [45,46]. Consistent with previous research, an alpha of more than .67 is used to indicate agreement [47], whereas, a negative alpha indicates less agreement than that would be expected by chance and suggests that there may have been inconsistencies in how measures were applied [48]. The internal consistency of the scale was calculated using Cronbach alpha. Interrater reliability was determined by interclass coefficient (ICC) [49]. Percentage agreement was also calculated.

Phase 4: Reliability Analysis 2
To investigate the discrepancies identified in phase 3, the same 3 raters (FM, SS, and MD) rated 5 unrated physical activity apps together. The apps were rated 1 at a time allowing for discussion of the results and for clarification of problem areas, specifically in item descriptions and examples. At the completion of this further moderation activity, an additional 20 apps were independently reviewed against ABACUS version 2, following the same procedure as phase 3.

Phase 1
The search identified 593 unique papers. The abstracts and titles of all papers were reviewed, leaving 77 papers for full-text review. This review resulted in 50 papers that fully met the inclusion criteria and were included in this study (the list of resources is available in Table 1). To determine current best or common practice in app reviewing for behavior change, all scales used in the 50 papers identified were collected. For scales that were not provided as part of the manuscript or as a supplemental material, institution and academic sharing websites (such as Research.net) were searched. If the scale was not able to be located, the authors were emailed and a copy was requested. Only 2 scales [50,51] were unable to be obtained as the author had either moved on from that institution or there was no response to the email.
The scales identified in this systematic review were collated into a single document resulting in 1333 items (see Multimedia Appendix 1), with duplicates and questions present in the MARS removed, leaving 130 individual items (see Multimedia Appendix 1). Moreover, 2 authors (FM and SS) experienced in health promotion and health promoting apps reviewed the item pool. These authors had participated in the VicHealth Healthy Living Apps project [43], and each was experienced in rating hundreds of apps. The item pool was reviewed to identify or create items that were clear and based on previous work by these authors would be present in the highest quality apps [43].
From the 130 items, similar items were collapsed, for example, items that sought to identify avoidance or were collapsed with items that sought to minimize distraction; items that were presented as statements or single words were reworked into questions for ease of use. For example, 1 item that read "discrepancy between current behaviour and goal" was reworked to read "Does the app give the user the ability to quickly and easily understand the difference between current action and future goals?" This process resulted in an initial version of this scale, with 33 items that were categorized into 7 groups: (1) general, (2) goals, (3) feedback and monitoring, (4) knowledge and information, (5) actions, (6) rewards, and (7) environmental factors. These items formed the first version of the scale, the App Behavior Change Scale (ABACUS) version 1.

Phase 2
This process resulted in the removal of 9 questions that were deemed to be unclear or were found to be duplicates or unnecessary. For example, in the initial scale, there were 4 separate items that outlined behavior costs, rewards, and encouragement. The authors' experience rating several hundreds of apps over a number of years, combined with the initial round of reviewing and discussion of this scale, allowed for the determination that more than one of these items were unlikely to be in the same app. As a result, these items were collapsed into 1 item: "Does the app provide general encouragement?" Other items were also removed at this point as it was determined that these questions were not relevant to behavior change, for example, a question about whether the app could be used without internet connection and 2 questions about expertise and consistency with national guidelines were collapsed into 1 question: "Was the app created with expertise and/or Does the app provides information that are consistent with national guidelines?" The resulting scale contained 24 items. Following this, the scale was tested with 15 apps, 3 from each category: physical activity; healthy eating; alcohol; smoking; and mental well-being. Again, this process allowed for a refinement of the scale and resulted in several changes, including clarifying words and descriptors, reordering items, and combining other items, for example, 3 items relating to material, social, and self-reward or incentive were collapsed into a single item: "Does the app provide a material or social reward or incentive?" The authors' experience rating apps lead to the conclusion that it would unlikely that any 1 app would have more than 1 incentive or reward. Phase 2 resulted in the 22-item ABACUS version 2 with questions categorized into the following 4 categories: knowledge and information, goals and planning, feedback and monitoring, and actions.
At this stage, the ABACUS version 2 was sent to 7 external experts for their comment on content. These experts included 3 experts on mental well-being, 1 expert on alcohol and tobacco, 1 on physical activity, 1 on behavioral science, and 1 on health promotion. These experts were able to offer suggestions on language and terminology used, resulting in refinement of terminology and descriptors. For example, one of the reviewers suggested that the descriptor of item 1.4 (Does the app provide instruction on how to perform the behavior?) also includes video instructions (the app is clear in telling the person how to perform a behavior or preparatory behaviors, either verbally, through video, or in written form. Please note, the behavior that is seeking to be changed, not information on how to use the app). This version of the scale is presented in Table 2.

Phase 3
Phase 3 testing was conducted with 50 physical activity apps downloaded from the app store. All apps were rated independently by 3 reviewers against the ABACUS version 2, with ratings entered into Qualtrics to minimize user error. This phase found half of the questions to have high percentage agreement among reviewers (over 80%) with the scale overall reporting moderate interrater reliability (2-way mixed ICC=.69, 95% CI 0.52-0.82) and moderate internal consistency (Cronbach alpha=.71). However, some questions reported very low agreement. For example, question 4.3 "Does the app allow or encourage for practice or rehearsal, in addition to daily activities?" returned only an agreement of 51% with a negative Krippendorff alpha (alpha=−.01). Several other questions showed similarly low scores (see Table 2), and only 1 question achieved an alpha that would indicate agreement. These results prompted an additional round of discussion, and comparison was undertaken.

Phase 4
The initial discussion resulted in the collapsing of 2 goal questions into 1 from "Does the app allow for the setting of outcome (long-term) goals?" and "Does the app have the ability to set short and medium-term goals or a plan?" to "Does the app allow for the setting of goals?" Furthermore, a number of descriptors were reworded, and examples were provided for all questions. These changes resulted in ABACUS version 3 containing 21 questions (see Table 3 for final version of the scale).
This round of rating found over 80% of questions to have high percentage agreement among reviewers, with 3 questions scoring a Krippendorff alpha indicating agreement and a further 7 came close with alphas more than .5. The scale overall reported high interrater reliability (2-way mixed ICC=.91, 95% CI 0.81-0.97) and high internal consistency (Cronbach alpha=.93; see Table  4).  Table 1) Example or further information Definition Scale: item number and question [44,54] Elements of the app can be personalized through specific tools or Does the app have the ability to customize and personalize some features?

1.1
• To select a disease type from among several available and then to follow a specific path or set of tools or systems.
functions that are specific to the individual using the app.
• To select to receive emails or texts of a specific nature.

•
To choose "yes" or "no" to a specific capability of the app would be considered personalization.

•
To create a personalized exercise plan. [44,54] This would be found in the about section or generally in the app.
Was the app created with expertise and/or Does the app provide information that is consistent with national guidelines? NB: the behavior that is seeking to be changed, not information on how to use the app action, instruction on suitable clothing, recipes, and general tips. [22,81] The app gives the user information about the consequences of behavior Does the app provide information about the consequences of contin-1.5 • Consequences may include health, feelings, or cost consequences. in general, this includes information uing and/or discontinuing behavior? about the relationship between the behavior and its possible or likely consequences in the general case. This information can be general or personalized. [17,85] Is there a feature during setup where you describe how ready you are for behavior change?

Goals and planning
Does the app ask for willingness for behavior change?

2.1
• This may be in the form of a scale of readiness or in a question that asks the user to describe how ready you are. [20,21,40,44,54,55,81] The person is encouraged to make a behavioral resolution.
The person is encouraged to set a general goal that can be achieved by Does the app allow for the setting of goals?

2.2
• This is the explicit noting of a goal or choosing a goal from one provided within the app.
behavioral means. This includes subgoals or preparatory behaviors and/or specific contexts in which the behavior will be performed. The behavior in this technique will be directly related to or be a necessary condition for the target behavior. Source of question (from Table 1) Example or further information Definition Scale: item number and question [22,40,81] • This is where a goal can be changed. This allows people to act on previously set goals and then revise or adjust where needed.
Involves a review or analysis of the extent to which previously set behavioral goals (regardless of short or long) were achieved.
Does the app have the ability to review goals, update, and change when necessary? 2.3 [22,40,81] • This could be in the form of a graph or some other visual describing how close the user is to meeting their goals.

Feedback and monitoring
Allows user to see how they are tracking against a goal and to see the difference between what they want to do and what they are currently doing. This will give some feedback on where they are at and what they need to change to get to where they want to be.
Does the app give the user the ability to quickly and easily understand the difference between current action and future goals?
3.1 [20,21] • Connects with watch that records daily steps that can be reviewed.

•
Allows for easy logging of exercise or meditation? • Allows for tracking of weight loss.

•
Allows logging of daily alcoholic drinks or cigarettes.
The app allows for a regular monitoring of the activity. The app allows the person to share his or her behaviors on social media or in forums. This could also include a buddy system or a leaderboard.
Does the app have the ability to share behaviors with others (including social media or forums) and/or allow for social comparison?
The app is able to provide the person with feedback, comments, or data about their own recorded behavior. This might be automatic or could be personal.
Does the app have the ability to give the user feedback-either from a person or automatically? 3.4 [65] • Export to a computer or to another user such as a doctor or fitness expert. • Sharing to Facebook does not count.
The app allows for the export of information and progress to an external user.
Does the app have the ability to export data from app?
3.5 [22,40,81] • Financial, either in returning money that was not spent on, for example, cigarettes or in paying someone to engage in a specific activity. • Social or public, for example, congratulating the person for each day that he or she meets his or her exercise target.
App provides rewards for attempts at achieving a behavioral goal. This might include efforts made toward achieving the behavior or progress made in preparatory steps toward the behavior or in achieving a goal.
Does the app provide a material or social reward or incentive?
3.6 [22,40,81] • This could include achievement badges or telling the user that they are a certain percentage closer to their goal.
The app provides general encouragement and positive reinforcement on actions leading to the goal.
Does the app provide general encouragement? 3.7

Actions
Yes Yes Yes Yes

Yes
No Yes Yes Source of question (from Table 1) Example or further information Definition Scale: item number and question [20,21] • This could be like the apple watch reminding you to stand or a meditation app telling you to meditate now.
The app prompts the user to engage in the activity. The app has the ability to give notifications or reminders to cue the behavior.

Principal Findings
This study reports on the creation of a scale (ABACUS) to measure the potential behavior change of smartphone apps. After conducting a systematic review to identify all research that has evaluated apps for behavior change, 133 items were identified and later modified after expert review to a final set of 21 items. The items within the scale are grouped into the following 4 categories: knowledge and information, goals and planning, feedback and monitoring, and actions. The ABACUS was reviewed by an expert panel and then tested first against 50 physical activity apps; however, because of concerns relating to moderate internal consistency and interrater reliability, an additional step of moderation was taken. This moderation saw the same raters come together to refine the scale, resulting in improved descriptors and the inclusion of examples for each question. Following this revision, the scale was used to rate an additional 20 apps. This round of ratings resulted in a high internal consistency and interrater reliability. Although previous studies evaluating smartphone apps have focused largely on features available in apps [21] or behavior change techniques through a self-developed evaluation checklist [4,10], the ABACUS provides researchers with a reliable and valid instrument to evaluate apps based on their behavior change potential.
This scale will allow researchers to investigate the behavior change potential of a large number of apps reasonably quickly. This is important, as the fast-moving pace of app technology means that although randomized controlled trials (RCTs) remain important in understanding the impacts of individual apps on behavior [86], it has been suggested that the RCT may not be the most appropriate method to generate evidence around mobile apps [28]. RCTs can take a significant amount of time in planning and design meaning that by the time the RCT is available for publication, the information is no longer current [28]. The scale developed in this research is not a replacement for an RCT but rather will allow researchers and consumers to understand the behavior change potential of an app in the absence of an RCT.
The MARS [42], a 23-item tool included 5 subscales for measuring app quality: engagement, functionality, aesthetics, information, and app subjective quality, with questions such as target age group, ease of navigation, or aesthetics can be used in conjunction with the ABACUS. The MARS is a useful tool in understanding the aesthetic and functional appeal of an app. When used together, the MARS and the ABACUS will allow researchers to provide users with 2 scores for each app: 1 that measures app quality and 1 that measures potential for behavior change.
This study is only a starting point in the identification and interpretation of the behavior change potential of smartphone apps. This study only reports on the validation and reliability of physical activity apps, and as such, further testing of the scale should be conducted on additional health areas such as smoking, alcohol, and nutrition, as it is possible that different items may be important for these health areas. Furthermore, a more detailed investigation into the relative scores of apps will need to be undertaken. This will allow for an understanding of the importance of the overall score assigned to each app. At present, this scale is best understood as providing a continuous score rather than specific cut-off points. However, this is not to say that with more investigation and testing that clear scores could not provide a consumer with a numerical rating reflecting a behavioral outcome. This study has not purported to demonstrate correlation between an app's score and the health outcome; however, this scale could be used in future along with a more detailed study of individual apps and the behavior change outcomes in using them.
ABACUS has good interrater reliability and is a valid tool for evaluating the potential behavior change in smartphone apps. The validation and reliability testing of ABACUS contributes to the literature by providing a standardized method of evaluating smartphone apps for behavior change.

Limitations
Although this scale shows good reliability and validity, there are several limitations that need to be addressed. The first is that we have not sought to investigate criterion validity. The scale presented in this paper seeks to measure the theoretical behavior change potential of apps; and therefore, we do not seek to investigate the relationship between actual features of apps and behavioral outcomes. This scale has not been designed for this type of activity, so we leave this up to others to identify an appropriate method for such an investigation. Although reducing the numbers of items on the scale facilitates faster rating, there is a risk that removal of duplicate items and streamlining these items into 1 binary response may inflate a score. For example, by collapsing all goal-setting activities into 1 item, this scale recognizes apps that have any goals-setting ability, rather than the strength of that ability-a feature found in the behavior change taxonomy. Furthermore, there is a risk that by collapsing items that record starting a positive behavior with stopping a negative behavior, we may be missing a key aspect of behavior change. These decisions were made based on the authors' experience of rating apps with an understanding that a single app will not include both of these features, and as such, in seeking to provide a succinct scale, it makes more sense to only measure 1 outcome. Like other similar studies [42], this study highlights the importance of rater's knowledge of apps when completing such evaluations and with moderating 5 to 10 apps at the beginning of the process as a team is important to ensure a robust score. In addition, similar to other studies, raters in this study spent 10 min to 15 min with the app to become familiar before completing the evaluation. This time spent using the app is consistent with other studies that seek to review apps, as a longer time under review is not realistic [42,87]. Finally, 1 key limitation of this study is that the scale has been validated on physical activity apps. Although this scale seeks to be used in the future for other health behaviors, at this point in time, we are only confident that it can be used to rate the health behavior potential of physical activity apps. Other health behaviors will need to be investigated in future studies.

Conclusions
The ABACUS is a reliable tool that can be used to determine the behavior change potential of apps. This instrument fills a gap by allowing the evaluation of a large number of apps to be standardized across a range of health categories. This scale can be used by teams to rate apps that seek to promote behavior change, allowing for high-quality apps to then be recommended to the general public.