This project develops software for data mining of online information
about people’s exercising activity.
The goal is to determine the fractions of physically active people that live in different cities.
2.1 Who might be using this system and what for
2.2 Data Collection
2.3 Data Analysis and Modeling
2.4 Information Display
3. Plan of Work
4. Example Student Projects
5. Relevant Websites and References
There is currently a great deal of technology to help people track their eating, health, sleep, and other wellness-related behaviors. Examples:
We are instrumenting ourselves with various technologies, including:
Most current systems for personal activity monitoring
focus on the benefits/rewards
for an individual user. The user may share this
information with his or her friends, but everything revolves around an individual.
What if instead we looked at entire communities and track their progress in becoming more physically active, as well as collective benefits? Interesting questions we could then ask include: How has a community X benefited from its individuals adopting “quantified-self” devices? Has the community X as a whole become healthier, reduced its healthcare costs, and become more attractive to belong to and live in?
Summarized data about the community progress could then be displayed on “public” displays or electronic billboards, which may be limited to a certain (perhaps broader) community or unlimited. Making such information public could serve as a greater motivator for individuals and make them more effective, by using their existing social networks to track and manage their progress as a whole community.
The Healthcare Hashtag Project
allows people to find where the healthcare conversations are taking place on Twitter,
discover who to follow within your specialty or disease, and find the best from
conferences in real-time or in archive. It is used to globalize and clarify healthcare
specific topics, such as diseases, etc.
Health Monitoring Systems (HMS) maintains a communtiy health surveillance service, EpiCenter, which is capable of analyzing healthcare data for the purpose of detecting anomalies suggestive of public health threats, such as disease outbreaks and bioterrorism.
None of these products is able to provide analytics for tracking population activities related to healthy lifestyle and gaining insight into lifestyle trends.
One approach is to allow user to see what others do. Next level is to have the user create a personal profile and based on this profile, provide indivudualized services, such as exercise-regime suggestions. Even greater level of engagement is for the user to participate in a game with continuous feedback about user’s performance. This document describes our approaches for improving community lifestyle and health.
The goal of this project is to develop a system that would help us answer questions such as:
On the other hand, an individual may wonder:
Instead of relying on public health statisticians or fitness experts to answer such questions, we will rely on the “wisdom of the crowd.”
We envision serving these types of users:
|Figure 1 illustrates different stakeholders in this project.|
(Note that the next section describes the rationale for using the Twitter service for input data collection.)
As seen, there may be some overlap between stakeholder populations, but these are different populations nonetheless. Our main target is to motivate people who may view our display but do not yet exercise regularly, or to motivate the viewers to continue their efforts.
One may be tempted to provide personal services,
such as personalized fitness plans, advice about adjusting the fitness program
according to user’s achievement, and tools for easy sharing (on Facebook or Twitter)
of data about user’s progress.
However, all of these features are orthogonal to our main goals (answering questions listed at the beginning of this document). Including personalized services may only serve as distraction and complicate an already complex task.
Therefore, although we will discuss possible personalized services, we will need to make a clear case for benefits of registration-based access to analytics information.
Input data are essential for answering the above questions, but collecting
input data presents great logistics and privacy issues.
A simple approach to data collection is to count the people that satisfy certain condition (“predicate”).
For example, we may contact all manufacturers of body monitors (or activity trackers) and ask for the geographic distribution of their customers. These manufacturers usually run websites where the device owners can register and upload their activity records.
In this case, the predicate is: Person X owns a body monitoring device.
Another option is to obtain membership and attendance statistics from local gyms or sports clubs.
Even statistics about hospital patient visits may be useful.
Our data analysis would consist of simply determining the fraction of area population that own a body monitoring device or exercise in a gym.
For example, Fitbit.com offers APIs for Developers
and is asking for suggestions on how to extend the APIs.
We may ask Fitbit to introduce API for statistics of customers in a geographic area. For privacy preservation, this API would not report on areas that have very few customers. In such cases, it would report the smallest containing area (“supremum”) that has ≥ N customers, where N could be set at 100, 1000, or whatever is considered adequate for privacy preservation.
There are several potential problems with this approach:
The first step is to determine a list of hashtags that will be
used to search for for tweets about “quantified self”
through the Twitter Developer API.
You may use the device names, such as fitbit (Wikipedia), jawbone up (Wikipedia), bodymedia (Wikipedia), garmin forerunner (Wikipedia), nike fuelband (Wikipedia), motoactv (Wikipedia), zeo sleep manager (Wikipedia),
or keywords, such as activity tracker, exercise, workout, fitness, recreation, physical activity, or calories burned.
Using a combination of hastags, like #fitbit + miles, or #run + calories should retrieve more relevant tweets about activity.
One interesting thing to determine is how much you observe improvement in the relevance of retrieved tweets when using combinations of hashtags versus individual tags alone.
In addition, you may simply monitor twitter feeds or “twitterverse” and
figure out which hashtags may be relevant. It is important to examine the actual posts and consider the information they are carrying. The key question is what kind of posts
seem to contain the input information needed to make meaningful inferences.
Without good input data, it is impossible to draw reliable conclusions.
We may find that more specific keywords (such as running, walking,
biking, hiking, or swimming) are more often accompanied
by numeric data such as duration of exercise, or distance travelled.
Here are four tools for tracking hashtags:
Once you have a set of potentially relevant hashtags, use the API
to retrieve all the tweets that mention those hashtags.
You will likely retrieve millions of tweets and there will be significant amount
of noise in the data (e.g., people posting unrelated activities or
making jokes, etc.).
Note also that tweets can be retweeted and repeated and users may use multiple identities. You should try cleaning up the collected datase as best you can.
Your goal is to identify a fraction of Twitterers who take this reporting of their physical activities seriously, download all relevant tweets from their feeds (starting with his or her very first tweet), and then store to a local database for analysis.
The selection of hashtags is critical, because querrying with different hashtags
will retrieve different sets of tweets. Different input data will, in turn, lead to
different analysis and modeling results (Section 2.3).
In the end, our users may reach different or wrong conclusions based on
different or inadequate sets of hashtags.
For examples of how different sets of hashtags lead to different results, see the example student projects in Section 4; particularly compare their heat map visualizations.
In February 2014, the Twitter announced a so-called “data
that offers some research teams access to its entire historic database.
Unfortunately, the applications are currently closed.
See an article in Washington Post, October 26, 2014: Twitter grants select researchers access to its public database
The next step is to analyze the locally downloaded tweets and extract data
that will serve as input to our algorithms to answer the questions listed at
the start of this section. Data analysis and modeling must be based on inspecting
the contents of actual posts. If the input data required by a model is not
present in actual tweets, then the model and the algorithm will not be
able to draw reliable conclusions.
Therefore, before developing our algorithms we must ensure that the required input data can actually be obtained and extracted.
The simplest approach is to count the number of Twitter users and the number of
tweets per user. Unfortunately, this approach will suffer from noise,
as discussed in Section 2.2.
Even after pre-processing for noise removal, not all messages may be counted with equal weight—we may wish to assign different weights to different messages. For example, a person may keep talking about planning to exercise, versus another person reporting on actual exercise.
Ideally, we should map the user’s reported activity to a single number representing the amount of exercise. An often used proxy for activity measurement is calories burned. For example, the system would approximately translate miles-ran or steps-walked into calories burned (based on some health or activity research data). Most activity-tracking devices provide information about calories burned and some users may tweet this number. The issue is what fraction of tweets will be specific about the amount of exercise and whether to count at all the tweets that do not contain specific information. This is an important tradeoff: work with more less-relevant tweets or less more-relevant tweets.
It is not feasible for people to analyze manually millions of tweets. Therefore, we need an automatic or semi-automatic method to analyze the content of tweets. One approach is to use supervised learning or semi-supervised learning, where a researcher presents a set of examples and expects the system to learn rules and apply them to the whole dataset.
We may assign weights based on keywords only, or pairs of keywords, or n-grams of words.
(Note that the idea of simultaneously using multiple hashtags for data retrieval,
mentioned in Section 2.2, is along the line of using n-grams.)
Ideally, we would have a natural-language
analyzer to analyze the messages and determine more accurately their
meaning. Based on the meaning information, our system would assign different
weights to different messages.
A useful tool for keyword analysis is “tag cloud” visualization to visualize frequently used keywords. The dveleoper may use it during the selection process of keywords to use for input data collection.
For example, we may track different types of exercise, such as “running”, “walking”, “biking”, “hiking”, and “swimming”. However, suppose that a quick tag-cloud visualization reveals that very few people tweet about “hiking” or “swimming”. Then we may decide that for simplicity the initial version of the system will omit tracking such posts.
The next step is to determine where each Twitter user lives.
If such information is not available in the user’s profile, try inferring it from user’s tweets. E.g., search geo-tagging information from tweets to identify user’s location.
The outcome of data analysis will be weighted counts of posts and users
from different geographic areas.
As shown in the diagram in Figure 2, we need to account for the total population in each area, versus the fraction that tweets, and of this fraction a subset that both tweets and regularly exercises.
We need to keep in mind that we are making inferences and drawing conclusions based on a subset of all exercising people in an area. Ideally, we should perform additional analysis to determine how representative is the subset of Twitter users that we selected of the entire population of interest.
It is probably not feasible to collect all tweets for calculating the population fractions. A better pproach is to use some kind of statistical sampling. One possible approach is to use the Monte Carlo Method. To find the fraction of tweeterers from a given community that tweet about exercise, we collect N tweets of any kind (about any subject) that originated by tweeterers from this community. To test whether we obtained a sufficiently representative sample, we follow this procedure:
Other interesting information that we may consider extracting from the tweets includes:
We need to consider how to display the “findings” from our Twitter data analytics. One option is a public (i.e., shared, common) display/billboard for the whole community, so hopefully it motivates them to act. Another option is to have registered users that login into our system for information access.
Login implies privacy, privileges, and personalization. If our only input is based on Twitter message analysis, then it is not clear what kind of privacy would be needed if all of our input data are already public (posted on Twitter). Privileges may include providing information only to registered users, for some sort of competitive advantage. But that defeats the purpose of community building and also fails to motivate those who are not motivated to exercise or generally participate in the system. In addition, personalization requires that our system knows a lot about individual users. It is not clear that such information can be gleaned from Twitter posts only. Are we going to require our registered users to enter additional information? What would that information be? How to motivate users to do so? What benefit would they accrue if they provide additional information? Finally, how to connect the analysis of general Twitter population with our registered users, who may not even be tweeting themselves?
Note that our key issue is about the information displayed, not necessarily
the display device itself. Display design and information visualization should be
minimalistic, to focus on providing information, instead of flashy effects.
We mentioned the “tag cloud” visualiztion in Section 2.3. If our system is to be used for market research, a clear table representation of tag ranking would better serve the purpose than a shining tag cloud.
On the other hand, a tag-cloud visualization may be useful for the developer during the selection of keywords and development of algorithms to better understand the hashtags used in tweets and their relationships.
The display device can be:
We will consider two types of information displays:
An interesting problem is that there may be many users who are already active but do not tweet and our system will not know about them. How to reach such users and persuade them to tweet about their activities?
The plan of work is as follows:
Developed in the Fall 2013 semester by Gradeigh D. Clark, Xianyi Gao, Rui Xu, Li Xu, Yihan Qian, and Xiaoyu Yu
Project report #3 (final), group
#1, Fall 2013
[PDF document; size: approx 8.6 MBytes]
Project files, group #1, Fall 2013.
[ZIP file; size: approx 18 MBytes]
Developed in the Fall 2013 semester by Gang Yang, Yu Ji, Siyu He, Haoyang Yu, Chen Liang, and Xueting Liao
Project report #3 (final), group
#2, Fall 2013
[PDF document; size: approx 13.4 MBytes]
Project files, group #2, Fall 2013.
[ZIP file; size: approx 97 MBytes]
Developed in the Fall 2013 semester by Sen Yang, Luyang Liu, Xiang Chi, Yi Dong, and Qi Shen
Project report #3 (final), group
#3, Fall 2013
[PDF document; size: approx 3 MBytes]
Project files, group #3, Fall 2013.
[ZIP file; size: approx 330 MBytes]
• General information about data analytics, business analytics and business intelligence.
• Sunny Consolvo, Predrag Klasnja, David W. McDonald and James A. Landay, Designing for Healthy Lifestyles: Design Considerations for Mobile Technologies to Encourage Consumer Health and Wellness, in Foundations and Trends in Human–Computer Interaction, vol.6, nos.3-4, pp.167-315, 2014.
• Matthew A. Russell,
Mining the Social Web
(2nd Edition), O'Reilly Media, October 2013.
Describes how to acquire, analyze, and summarize data from all corners of the social web, including Facebook, Twitter, LinkedIn, Google+, GitHub, email, websites, and blogs. Describes how to employ the Natural Language Toolkit NetworkX and apply advanced text-mining techniques, such as clustering and TF-IDF, to extract meaning from human language data.
• Twitter data processing:
a preprint of a new book Twitter Data
Analytics by Shamanth Kumar, Fred Morstatter, and Huan Liu,
is now available for free download. This book familiarizes the
novice reader with the concepts of Twitter data collection, management, and
analysis. Visit the URL below to download a free copy of the preprint and
relevant code examples along with sample data:
• Kenneth M. Anderson and Aaron Schram, “Design and Implementation of a Data Analytics Infrastructure in Support of Crisis Informatics Research”, Proceedings of the the 33rd International Conference on Software Engineering (ICSE-2103), Honolulu, Hawaii, May 2011.
• Eric P.S. Baumer, Sherri Jean Katz, Jill E. Freeman, Phil Adams, Amy L. Gonzales, J.P. Pollak, Daniela Retelny, Jeff Niederdeppe, Christine M. Olson, and Geri K. Gay, “Prescriptive Persuasion and Open-Ended Social Awareness: Expanding the Design Space of Mobile Health”, Proceedings of the ACM 2012 Conference on Computer Supported Cooperative Work (CSCW '12), pages 475-484, 2012.
• Andrew Macvean and Judy Robertson, “iFitQuest: A school based study of a mobile location-aware exergame for adolescents”, Proceedings of the 14th International Conference on Human-Computer Interaction with Mobile Devices and Services (MobileHCI '12), pages 359-368, 2012.
• Great deal of useful information and relevant papers is available on the site by Frank Bentley, Yahoo: “Health Mashups”.
Last Modified: Tue Sep 9 21:07:23 EDT 2014 Maintained by: Ivan Marsic