Tag Archives: NLP

Mining the Social Media data stream

Post by: Kim Stephens

Craig Fugate announced in a speech over a year ago at the American Red Cross Social Media Summit: “Social Media are Data”. The crowd of people who were all fairly well versed in the subject, cheered. What they understood was the importance of the information in the aggregate. As the Economist described last week:  “Most tweets are inane, but a million may contain valuable information.”  A blog post entitled “What’s in a Tweet” describes why: each and every tweet can provide geo-location/place associated with the tweet, when the account was created, who created it, their number of followers, date and time it was created, the author’s biography, etc. (The “tweet map”  pictured above can be found here, originally posted by Raffi Krikorian of Twitter.)

However, almost 18 months after Craig’s speech, quite a few local emergency managers still don’t fully understand the “social media as data” point.  Most see the medium simply as a way to push information; but even if they would like to take advantage of the data, trying to monitor social networks and extract pertinent content and understand trends etc. “by hand” is very difficult and potentially impossible after a crisis when the stream of information turns into a torrent. (See this related article in the recent Journal of Homeland Security and Emergency Management: “Improved Situational Awareness in Emergency Management through Automated Data Analysis and Modeling“)

Businesses (and the intelligence agencies) have turned to computer processing to monitor these social streams.  Mashable reported in an article last year that companies are increasingly employing data mining services in order to  better understand their customers: everything that people post in public forums (Facebook, Twitter, blogs, etc.) is “fair game”, which raises a few eyebrows regarding privacy concerns. Because the information about people’s buying habits, personal likes and dislikes, product sentiment, and even mood is so valuable, many companies have sprung up to provide the service of not only data aggregation but also predictive modeling. These services are based on Natural Language Processing (NLP) and mathematical algorithms . But this data is not useful just to understand who likes Honda versus Ford, see this post by Patrick Meier of Ushahidi about a company called “Recorded Futures“. He provides a good discussion of how they are using NLP, predictive modeling and event-data extraction in order to try predict social disruptions, including protests, such as the “Arab Spring”.

Image representing DataSift as depicted in Cru...

Image via CrunchBase

One new company that provide data aggregation and real-time filtering is DataSift, which just recently teamed up with another “real-time trend analysis” company, TrendSpottr. DataSift is interesting because it is “only one of two licensed re-syndicators of Twitter data globally” (e.g. a “reseller” of that data).  The other licensed syndicator is Gnip (based out of Boulder, Colorado).

These companies provide data filtering as well as augmentations which help provide context. This includes data enrichments that help decipher social authority, trends, social identification, and links (an analysis of feeds that have URLs included).

As an example of these enrichments, Klout is an important stand-alone service that identifies social authority, which comes in handy when you need to understand who is the most influential person (at least online) in a crowd. Although the service is billed as a way to know which influential customers to target, I see it as an important tool in a crisis, such as a large protest in your city. People are given a score of 1-100 which “measures influence based on ability to drive action”. They factor in how many people you reach based on how many followers you have on multiple social networks, whether or not those followers repeat your messages (amplification), and what kind of impact you have on “your” network. Although their formula is not perfect, it is interesting to note that in the world of Social Media and Emergency Management some of the people I follow and trust,  TheFireTracker2 and Chery Bledsoe, both have high scores for influence. Similarly, Andy Carvin of NPR–who has over 20,000 followers, has a huge score of 80. Anyone can use this service as a free without going through DataSift.

How could a local EM use Klout?  What if you are trying to reach a segment of your population with targeted information regarding disaster mitigation (e.g. a minority population)? You could use this type of service to find those people that others listen to in that community, and it might not be who you assume. As another example, maybe the students in your University are planning to join the Occupy Wall Street movement and stage a protest.  This type of service could help you drill down to find the organizers. Although you won’t be able to stop their activity, you’ll be able to understand their plans and potentially the expected turnout, etc.

What does all of this mean for the emergency management community? Computer analytical tools are coming to help us process this huge “firehose” of social networking data. Whether or not we’ll have the funds to purchase those services, and which organization will be responsible (Fusion Centers, State OEMs)  is a whole other story.

This video below describes, in very basic terms, the premise around the Datasift product.