Software Engineering Project:   Health Monitoring Analytics

This project develops software for data mining of online information about people’s exercising activity.
The goal is to determine the fractions of physically active people that live in different cities.


    1.  Introduction

    2.  Project Description
        2.1  Who might be using this system and what for
        2.2  Data Collection
        2.3  Data Analysis and Modeling
        2.4  Information Display

    3.  Plan of Work

    4.  Example Student Projects

    5.  Relevant Websites and References

1.   Introduction

There is currently a great deal of technology to help people track their eating, health, sleep, and other wellness-related behaviors. Examples:

We are instrumenting ourselves with various technologies, including:

Most current systems for personal activity monitoring focus on the benefits/rewards for an individual user. The user may share this information with his or her friends, but everything revolves around an individual.
What if instead we looked at entire communities and track their progress in becoming more physically active, as well as collective benefits? Interesting questions we could then ask include: How has a community X benefited from its individuals adopting “quantified-self” devices? Has the community X as a whole become healthier, reduced its healthcare costs, and become more attractive to belong to and live in?
Summarized data about the community progress could then be displayed on “public” displays or electronic billboards, which may be limited to a certain (perhaps broader) community or unlimited. Making such information public could serve as a greater motivator for individuals and make them more effective, by using their existing social networks to track and manage their progress as a whole community.

The Healthcare Hashtag Project allows people to find where the healthcare conversations are taking place on Twitter, discover who to follow within your specialty or disease, and find the best from conferences in real-time or in archive. It is used to globalize and clarify healthcare specific topics, such as diseases, etc.
Health Monitoring Systems (HMS) maintains a communtiy health surveillance service, EpiCenter, which is capable of analyzing healthcare data for the purpose of detecting anomalies suggestive of public health threats, such as disease outbreaks and bioterrorism.
None of these products is able to provide analytics for tracking population activities related to healthy lifestyle and gaining insight into lifestyle trends.

One approach is to allow user to see what others do. Next level is to have the user create a personal profile and based on this profile, provide indivudualized services, such as exercise-regime suggestions. Even greater level of engagement is for the user to participate in a game with continuous feedback about user’s performance. This document describes our approaches for improving community lifestyle and health.

2.   Project Description

The goal of this project is to develop a system that would help us answer questions such as:

On the other hand, an individual may wonder:

Instead of relying on public health statisticians or fitness experts to answer such questions, we will rely on the “wisdom of the crowd.”

2.1   Who might be using this system and what for

We envision serving these types of users:

  1. Interactive user, who navigates through different information that our system extracted from input data (mainly Twitter posts, as explained later), similar to how Google Analytics allows people to zoom-in, zoom-out, and navigate different information about visitors to their website. This same use case can be used by government officials who would want to know population's health and exercise/lifestyle habits.
    This user does not require login.
  2. Registered user, who will provide personal information about his or her exercise activity or other lifestyle information. This user can enter manually how much he/she exercised and what kind of exercise, or can simply allow your system to connect to a tracking device website and retrieve data automatically. For example, see Fitbit Developer site under “Quick Start Guide”, item #3 explains how to retrieve a user’s data.
    It is not clear why would a government official need a registered account, except for private use.
    A registered user is required to login.
  3. Operator (or, system administrator) who will, for example, change the revolving schedule for a public display. This schedule determines what information is displayed on a public display (e.g., an electronic billboard), for how long, and the period for rotation of displayed information.
    This user requires login.
In addition to these three types of users who will actively initiate interaction with our system, we will have passive users who may just view the revolving display that periodically displays summary about population’s exercising habits or lifestyle. These users will not interact with our system. They may or may not notice and attend to what the display shows, and as a result they may decide to become interactive users (type #2 above) or registered users (type #2 above).

Figure 1 illustrates different stakeholders in this project.
(Note that the next section describes the rationale for using the Twitter service for input data collection.)
As seen, there may be some overlap between stakeholder populations, but these are different populations nonetheless. Our main target is to motivate people who may view our display but do not yet exercise regularly, or to motivate the viewers to continue their efforts.
Stakeholder populations

One may be tempted to provide personal services, such as personalized fitness plans, advice about adjusting the fitness program according to user’s achievement, and tools for easy sharing (on Facebook or Twitter) of data about user’s progress.
However, all of these features are orthogonal to our main goals (answering questions listed at the beginning of this document). Including personalized services may only serve as distraction and complicate an already complex task.
Therefore, although we will discuss possible personalized services, we will need to make a clear case for benefits of registration-based access to analytics information.

2.2   Data Collection

Input data are essential for answering the above questions, but collecting input data presents great logistics and privacy issues.
A simple approach to data collection is to count the people that satisfy certain condition (“predicate”).
For example, we may contact all manufacturers of body monitors (or activity trackers) and ask for the geographic distribution of their customers. These manufacturers usually run websites where the device owners can register and upload their activity records.
In this case, the predicate is:  Person X owns a body monitoring device.
Another option is to obtain membership and attendance statistics from local gyms or sports clubs.
Even statistics about hospital patient visits may be useful.
Our data analysis would consist of simply determining the fraction of area population that own a body monitoring device or exercise in a gym.

For example, offers APIs for Developers and is asking for suggestions on how to extend the APIs.
We may ask Fitbit to introduce API for statistics of customers in a geographic area. For privacy preservation, this API would not report on areas that have very few customers. In such cases, it would report the smallest containing area (“supremum”) that has ≥ N customers, where N could be set at 100, 1000, or whatever is considered adequate for privacy preservation.

There are several potential problems with this approach:

Therefore, we decide to primarily rely on the postings on the Twitter social networking site (“tweets”) as our input data. Twitter data are publicly available for users around the world.
Assuming that we are able to determine for some or most Twitter users where they live (city, state, etc.), we will determine the fraction of users from each city/location that tweet about exercise/physical activity (or, more generally, about “quantified self”).
We will analyze the Twitter postings are hopefully be able to collect richer information about the frequency and other attributes of people’s exercise. This, in turn, will enable making more powerful and accurate inferences and answers to the above questions.
In the future, this decision may need to be revisited and Twitter data may be augmented with device manufacturer data.

The first step is to determine a list of hashtags that will be used to search for for tweets about “quantified self” through the Twitter Developer API.
You may use the device names, such as fitbit (Wikipedia), jawbone up (Wikipedia), bodymedia (Wikipedia), garmin forerunner (Wikipedia), nike fuelband (Wikipedia), motoactv (Wikipedia), zeo sleep manager (Wikipedia),
or keywords, such as activity tracker, exercise, workout, fitness, recreation, physical activity, or calories burned.
Using a combination of hastags, like #fitbit + miles, or #run + calories should retrieve more relevant tweets about activity.
One interesting thing to determine is how much you observe improvement in the relevance of retrieved tweets when using combinations of hashtags versus individual tags alone.

In addition, you may simply monitor twitter feeds or “twitterverse” and figure out which hashtags may be relevant. It is important to examine the actual posts and consider the information they are carrying. The key question is what kind of posts seem to contain the input information needed to make meaningful inferences. Without good input data, it is impossible to draw reliable conclusions. We may find that more specific keywords (such as running, walking, biking, hiking, or swimming) are more often accompanied by numeric data such as duration of exercise, or distance travelled.
Here are four tools for tracking hashtags:

  1.  is a visual search that lets you see a whole web of related hashtags, as well as stats on popularity and usage patterns.
  2. Hashtracking  creates charts and graphs that present analytics on hashtags, including the most influential Twitter users and the hashtag’s reach.
  3. RiteTag  will tell you how effective your hashtags are (great or overused) through color coded ratings.
  4. Tadef  helps you look up a hashtag if you don’t know what it means—it can help avoiding hashtag misuse.

Once you have a set of potentially relevant hashtags, use the API to retrieve all the tweets that mention those hashtags. You will likely retrieve millions of tweets and there will be significant amount of noise in the data (e.g., people posting unrelated activities or making jokes, etc.).
Note also that tweets can be retweeted and repeated and users may use multiple identities. You should try cleaning up the collected datase as best you can.

Your goal is to identify a fraction of Twitterers who take this reporting of their physical activities seriously, download all relevant tweets from their feeds (starting with his or her very first tweet), and then store to a local database for analysis.

The selection of hashtags is critical, because querrying with different hashtags will retrieve different sets of tweets. Different input data will, in turn, lead to different analysis and modeling results (Section 2.3). In the end, our users may reach different or wrong conclusions based on different or inadequate sets of hashtags.
For examples of how different sets of hashtags lead to different results, see the example student projects in Section 4; particularly compare their heat map visualizations.

In February 2014, the Twitter announced a so-called “data grant program” that offers some research teams access to its entire historic database. Unfortunately, the applications are currently closed.
See an article in Washington Post, October 26, 2014: Twitter grants select researchers access to its public database

2.3   Data Analysis and Modeling

The next step is to analyze the locally downloaded tweets and extract data that will serve as input to our algorithms to answer the questions listed at the start of this section. Data analysis and modeling must be based on inspecting the contents of actual posts. If the input data required by a model is not present in actual tweets, then the model and the algorithm will not be able to draw reliable conclusions.
Therefore, before developing our algorithms we must ensure that the required input data can actually be obtained and extracted.

The simplest approach is to count the number of Twitter users and the number of tweets per user. Unfortunately, this approach will suffer from noise, as discussed in Section 2.2.
Even after pre-processing for noise removal, not all messages may be counted with equal weight—we may wish to assign different weights to different messages. For example, a person may keep talking about planning to exercise, versus another person reporting on actual exercise.

Ideally, we should map the user’s reported activity to a single number representing the amount of exercise. An often used proxy for activity measurement is calories burned. For example, the system would approximately translate miles-ran or steps-walked into calories burned (based on some health or activity research data). Most activity-tracking devices provide information about calories burned and some users may tweet this number. The issue is what fraction of tweets will be specific about the amount of exercise and whether to count at all the tweets that do not contain specific information. This is an important tradeoff: work with more less-relevant tweets or less more-relevant tweets.

It is not feasible for people to analyze manually millions of tweets. Therefore, we need an automatic or semi-automatic method to analyze the content of tweets. One approach is to use supervised learning  or  semi-supervised learning, where a researcher presents a set of examples and expects the system to learn rules and apply them to the whole dataset.

We may assign weights based on keywords only, or pairs of keywords, or n-grams of words. (Note that the idea of simultaneously using multiple hashtags for data retrieval, mentioned in Section 2.2, is along the line of using n-grams.) Ideally, we would have a natural-language analyzer to analyze the messages and determine more accurately their meaning. Based on the meaning information, our system would assign different weights to different messages.
A useful tool for keyword analysis is “tag cloud” visualization to visualize frequently used keywords. The dveleoper may use it during the selection process of keywords to use for input data collection.

For example, we may track different types of exercise, such as “running”, “walking”, “biking”, “hiking”, and “swimming”. However, suppose that a quick tag-cloud visualization reveals that very few people tweet about “hiking” or “swimming”. Then we may decide that for simplicity the initial version of the system will omit tracking such posts.

The next step is to determine where each Twitter user lives.
For each user that appears in the local database, go to this user’s Twitter profile and search for information that would help you to determine where this user lives (city, state, etc.; if possible try to locate users to their city districts/boroughs/municipalities, such as Manhattan or Brooklyn in New York City).
See the book Twitter Data Analytics (listed below in Section 2) on how to use gazetteer to translate a user profile into geographic coordinates. See the example “Translating location string into coordinates.” The next task is to figure out how the convert coordinates to location name. Note that geographic coordinates can be directly used to draw a heat map.

If such information is not available in the user’s profile, try inferring it from user’s tweets. E.g., search geo-tagging information from tweets to identify user’s location.

Population ratios

The outcome of data analysis will be weighted counts of posts and users from different geographic areas.
As shown in the diagram in Figure 2, we need to account for the total population in each area, versus the fraction that tweets, and of this fraction a subset that both tweets and regularly exercises.
We need to keep in mind that we are making inferences and drawing conclusions based on a subset of all exercising people in an area. Ideally, we should perform additional analysis to determine how representative is the subset of Twitter users that we selected of the entire population of interest.

It is probably not feasible to collect all tweets for calculating the population fractions. A better pproach is to use some kind of statistical sampling. One possible approach is to use the Monte Carlo Method. To find the fraction of tweeterers from a given community that tweet about exercise, we collect N tweets of any kind (about any subject) that originated by tweeterers from this community. To test whether we obtained a sufficiently representative sample, we follow this procedure:

  1. From the entire set of N tweets, determine the fraction of tweets about exercise. Let us call this fraction fN.
  2. Uniformly randmoly select a subset of ½N tweets from the entire set of N tweets
  3. Determine the fraction of tweets about exercise in this subset of ½N tweets. Let us call this fraction f½N.
  4. Compare if |fNf½N| ≤ 1%, where 1% is the benchmark of our data accuracy.
    If the inequality is TRUE, we are done, the fraction fN is representative of the true fraction of exercise-related tweeterers for the given community.
  5. If |fNf½N| > 1%, this means that the whole sample with N twitterers is not large enough. Therefore, we should retrieve from the tweets from another N twitterers of this community.
    As a result, we now have 2×N twitterers in our whole sample. Then, we assign N ←2×N and jump to the first step of this procedure and repeat the whole procedure until we reach a large enough sample.

Other interesting information that we may consider extracting from the tweets includes:

This information may additionally help us make more accurate and reliable inferences and conclusions.

2.4   Information Display

We need to consider how to display the “findings” from our Twitter data analytics. One option is a public (i.e., shared, common) display/billboard for the whole community, so hopefully it motivates them to act. Another option is to have registered users that login into our system for information access.

Login implies privacy, privileges, and personalization. If our only input is based on Twitter message analysis, then it is not clear what kind of privacy would be needed if all of our input data are already public (posted on Twitter). Privileges may include providing information only to registered users, for some sort of competitive advantage. But that defeats the purpose of community building and also fails to motivate those who are not motivated to exercise or generally participate in the system. In addition, personalization requires that our system knows a lot about individual users. It is not clear that such information can be gleaned from Twitter posts only. Are we going to require our registered users to enter additional information? What would that information be? How to motivate users to do so? What benefit would they accrue if they provide additional information? Finally, how to connect the analysis of general Twitter population with our registered users, who may not even be tweeting themselves?

Note that our key issue is about the information displayed, not necessarily the display device itself. Display design and information visualization should be minimalistic, to focus on providing information, instead of flashy effects.
We mentioned the “tag cloud” visualiztion in Section 2.3. If our system is to be used for market research, a clear table representation of tag ranking would better serve the purpose than a shining tag cloud.
On the other hand, a tag-cloud visualization may be useful for the developer during the selection of keywords and development of algorithms to better understand the hashtags used in tweets and their relationships.

The display device can be:

The question is, does the actual physical display matter? The key issue is, why would a person seeing how his/her community is doing change his/her behavior more readily than if seeing only his/her own individual performance (even if compared to a population average)?
See this interesting article:
Hospital hygiene: First, wash your hands,” The Economist, September 7th 2013, (Technology Quarterly, pages 8-9), [local copy]
Also check the DebMed Group site.

We will consider two types of information displays:

  1. Non-interactive display that circles/rotates through a set of analytics findings for the area where the display is located (e.g., a billboard).
    Here we need to decide what findings to show and how often to update the display (hourly, daily, weekly, …)
  2. Interactive display, for example like Google Analytics, where the user can explore analytics for other areas or past analytics for a given area.
One issue to consider is whether to make the interactive display registration-based. An argument against user registration and login is that none of the displayed data will be private.
If the only information that the user will be able to interact will be non-personal information extracted by analytics described in Section 2.3, then there is no need for registration and login.
Registration and login will benecessary only if the user will store some personal information on our website, based on which our system may provide personalized services, or provide advanced analytics infomation to preferred customers.

An interesting problem is that there may be many users who are already active but do not tweet and our system will not know about them. How to reach such users and persuade them to tweet about their activities?

3.   Plan of Work

The plan of work is as follows:

  1. Determine a list of hashtags to search for through the Twitter Developer API for tweets about “quantified self”.
    As discussed in Section 2.2, you may use the device names or various relevant keywords.
  2. Given the selected set of hashtags, use the API to retrieve all the tweets that mention those hashtags.
    Identify a fraction of Twitterers who take reporting of their physical activities seriously, download all relevant tweets from their feeds (starting with his or her very first tweet), and then store to a local database for analysis.
  3. For each user that appears in your database, go to this user’s Twitter profile and search for information that would help you to determine where this user lives (city, state, etc.; if possible try to locate users to their city districts/boroughs/municipalities.
    If such information is not available in the user’s profile, try inferring it from geo-tagging information in the tweets.
  4. Plot the histogram of the number of tweets related to physical activities starting with each user’s first tweet (or the first available tweet, since Twitter may not hold data beyond a certain point in the past?—need to check!). Observe the evolution of the distribution of tweets over time. If a decaying distribution is observed, calculate the half-life of an average Twitter user.
  5. Analyze the tweets for the frequency of occurrence of different words and their meaning. What can be concluded from this content/word analysis?
  6. Select a small set (five to ten) of cities (or city districts) from which most tweets seem to originate. Then download from all tweets that you can identify as originating from these cities. This means any tweets, not only those related to activity monitoring and exercise that you downloaded earlier.
  7. Determine what fraction of all tweets from a given city (or city district) represent the tweets about “quantified self”.
    Show a leaderboard of cities/districts. (Display design is described in Section 2.4.)

4.   Example Student Projects

Fall 2013 Semester

Group #1—Health Monitoring Analytics

Developed in the Fall 2013 semester by Gradeigh D. Clark, Xianyi Gao, Rui Xu, Li Xu, Yihan Qian, and Xiaoyu Yu

Project report #3 (final), group #1, Fall 2013
[PDF document; size: approx 8.6 MBytes]

Project files, group #1, Fall 2013.
[ZIP file; size: approx 18 MBytes]

Group #2—Cities Activity Monitoring Analytics

Developed in the Fall 2013 semester by Gang Yang, Yu Ji, Siyu He, Haoyang Yu, Chen Liang, and Xueting Liao

Project report #3 (final), group #2, Fall 2013
[PDF document; size: approx 13.4 MBytes]

Project files, group #2, Fall 2013.
[ZIP file; size: approx 97 MBytes]

Group #3—Health Monitoring Analytics

Developed in the Fall 2013 semester by Sen Yang, Luyang Liu, Xiang Chi, Yi Dong, and Qi Shen

Project report #3 (final), group #3, Fall 2013
[PDF document; size: approx 3 MBytes]

Project files, group #3, Fall 2013.
[ZIP file; size: approx 330 MBytes]

5.   Relevant Websites and References

•  General information about  data analytics,  business analytics  and  business intelligence.

•  Sunny Consolvo, Predrag Klasnja, David W. McDonald and James A. Landay,   Designing for Healthy Lifestyles: Design Considerations for Mobile Technologies to Encourage Consumer Health and Wellness,  in  Foundations and Trends in Human–Computer Interaction, vol.6, nos.3-4, pp.167-315, 2014.

•  Matthew A. Russell,   Mining the Social Web (2nd Edition),  O'Reilly Media, October 2013.
Describes how to acquire, analyze, and summarize data from all corners of the social web, including Facebook, Twitter, LinkedIn, Google+, GitHub, email, websites, and blogs. Describes how to employ the Natural Language Toolkit NetworkX  and apply advanced text-mining techniques, such as clustering and TF-IDF, to extract meaning from human language data.

•  Twitter data processing: a preprint of a new book Twitter Data Analytics by Shamanth Kumar, Fred Morstatter, and Huan Liu, is now available for free download. This book familiarizes the novice reader with the concepts of Twitter data collection, management, and analysis. Visit the URL below to download a free copy of the preprint and relevant code examples along with sample data:

•  Kenneth M. Anderson and Aaron Schram, “Design and Implementation of a Data Analytics Infrastructure in Support of Crisis Informatics Research”, Proceedings of the the 33rd International Conference on Software Engineering (ICSE-2103), Honolulu, Hawaii, May 2011.

•  Eric P.S. Baumer, Sherri Jean Katz, Jill E. Freeman, Phil Adams, Amy L. Gonzales, J.P. Pollak, Daniela Retelny, Jeff Niederdeppe, Christine M. Olson, and Geri K. Gay, “Prescriptive Persuasion and Open-Ended Social Awareness: Expanding the Design Space of Mobile Health”, Proceedings of the ACM 2012 Conference on Computer Supported Cooperative Work (CSCW '12), pages 475-484, 2012.

•  Andrew Macvean and Judy Robertson, “iFitQuest: A school based study of a mobile location-aware exergame for adolescents”, Proceedings of the 14th International Conference on Human-Computer Interaction with Mobile Devices and Services (MobileHCI '12), pages 359-368, 2012.

•  Great deal of useful information and relevant papers is available on the site by Frank Bentley, Yahoo: “Health Mashups”.

Last Modified: Tue Sep 9 21:07:23 EDT 2014
Maintained by: Ivan Marsic