Data Collection & Analytics Tools?

I have become fascinated recently with the question of the role that data has in supporting analysis, action, and reflection. Actually, it would be more accurate to say that I’ve become aware recently that this is an intrinsic driver in much of the work I do, and thus it has become something I want to reflect on more directly. In this post, I want to explore some of the tools others have already built that might support analytics, machine learning, and so on. If you know of something I’ve missed, feel free to share it in the comments! So, in no particular order:

  • Hazy  provides a small handful of key primitives for data analysis. These include Victor, which “uses RDBMS to solve a large class of statistical data analysis problems (supervised machine learning using incremental gradiant algorithms) and WisCi (/ DeepDive, it’s successor), which is “an encylopedia powered by machines, for the people. ” RapidMiner is a similar tool that has been used by thousands. It is open source and supports data analysis and mining.
  • Protégé is “a suite of tools to construct domain models and knowledge-based applications with ontologies” including visualization and manipulation
  • NELL learns over time from the web. It has been running since 2010 and has “accumulated over 50 million candidate beliefs.”  A similar system is
  • Ohmage and Ushahidi are open source citizen sensing platforms (think citizen based data collection). Both support mobile and web based data entry. This stands in contrast to things like Mechanical Turk which is a for-pay service, and games and other dual-impact systems like PeekaBoom (von Ahn et al.) which can label objects in an image using crowd labor, or systems like Kylin (Hoffmann et al.) which simultaneously accelerates community content creation and information extraction.
  • WEKA and LightSide support GUI based machine learning (WEKA requires some expertise and comes with a whole textbook, while LightSide is built on WEKA but simplifies aspects of it, and specializes in mining textual data). For more text mining support, check out Coh-Metrix, which “calculates the coherence of texts on a wide range of measures. It replaces common readability formulas by applying the latest in computational linguistics and linking this to the latest research in psycholinguistics.” Similarly, LIWC, which supports linguistic analysis (not free) by providing a dictionary and a way to compare to that dictionary to analyze the presence of 70 language dimensions in a new text from negative emotions to casual words.

Deployed tools research and products aside, there is also a bunch of research in this area, ranging from early work such as aCappela (Dey et al.), Screen Crayons (Olsen, et al.). More recently, Gestalt (Patel et al.“allows developers to implement a classification pipeline” and Kietz et al. use an analysis of RapidMiner’s many data analysis traces to automatically predict optimal KDD-Workflows.

Luis von Ahn, Ruoran Liu and Manuel Blum Peekaboom: A Game for Locating Objects in Images In ACM CHI 2006

Hoffmann, R., Amershi, S., Patel, K., Wu, F., Fogarty, J., & Weld, D. S. (2009, April). Amplifying community content creation with mixed initiative information extraction. In Proceedings of the 27th international conference on Human factors in computing systems (pp. 1849-1858). ACM.

Dey, A. K., Hamid, R., Beckmann, C., Li, I., & Hsu, D. (2004, April). a CAPpella: programming by demonstration of context-aware applications. InProceedings of the SIGCHI conference on Human factors in computing systems (pp. 33-40). ACM.

Olsen Jr, Dan R., Trent Taufer, and Jerry Alan Fails. “ScreenCrayons: annotating anything.” Proceedings of the 17th annual ACM symposium on User interface software and technology. ACM, 2004.

Kayur Patel, Naomi Bancroft, Steven M. Drucker, James Fogarty, Andrew J. Ko, James A. Landay: Gestalt: integrated support for implementation and analysis in machine learning. UIST 2010: 37-46

Kietz et al. (2012). Designing KDD-Workflows via HTN-Planning, 1–2. doi:10.3233/978-1-61499-098-7-1011

Search and Rescue and Probability Theory

A man and a dog together belaying down a rock face
Canine Search and Rescue (photo from AMRG website)

I spent a fascinating evening with the Allegheny Mountain Rescue Group today. This is a well run organization that provides free help for search and rescue efforts in the Pittsburgh area and beyond. I was in attendance because my kids and I were looking for a way to give Gryffin (our new puppy) a job in life beyond “pet” and we love to work in the outdoors. Canine search and rescue sounded like a fascinating way to do this and we wanted to learn more. During the meeting, I discovered a team of well-organized, highly trained, passionate and committed individuals that has a healthy influx of new people interested in taking part and a strong core of experienced people who help to run things. The discussions of recent rescues were at times heart rending, and very inspiring.

Later in the evening during a rope training session I started asking questions and soon learned much more about how a search operates. I discovered that about a third of searches end in mystery. Of those for which the outcome is known, there is about an even split between finding people who are injured, fine, or have died. Searches often involve multiple organizations simultaneously, and it is actually preferable to create teams that mix people from different search organizations rather than having a team that always works together. Some searches may involve volunteers as well. A large search may have as many as 500 volunteers, and if the target of the search may still be alive, it may go day and night. Searches can last for days. And this is what led me to one of the most unexpected facts of the evening.

I asked: How do you know when a search is over? And the answer I got was that a combination of statistics and modeling is used to decide this in a fairly sophisticated fashion. A search is broken up into multiple segments, and a probability is associated with each segment (that the person who is lost is in a segment). When a segment is searched, the type of search (human only, canine, helicopter, etc.) and locations searched, along with a field report containing details that may not be available otherwise are used to update the probability that a person is in that segment (but was missed) or absent from that segment. Finally, these probabilities are combined using a spreadsheet(s?) to help support decision making about whether (and how) to proceed. According to the people I was speaking with, a lot of this work is done by hand because it is faster than entering data in and dealing with more sophisticated GIS systems (though typically a computer is available at the search’s base, which may be something like a trailer with a generator). GPS systems may be used as well to help searchers with location information and/or track dogs.

Some of the challenges mentioned are the presence of conflicting information, the variability in how reliable different human searchers are, the fact that terrain may not be flat or easily represented in two dimensions, the speed of computer modeling, the difficulty of producing exact estimates of how different searchers affect the probability of finding someone and the variable skill levels of searchers (and the need to organize large numbers of searchers, at times partly untrained). When I raised the possibility of finding technology donations such as more GPS systems, I was also told that it is critical that any technology, especially technology intended for use in the field, be ultra simple to use (there is no time to mess with it), and consistent (i.e. searchers can all be trained once on the same thing).

Although this blog post summarizes what was really just a brief (maybe hour long) conversation with two people, the conversation had me thinking about research opportunities. The need for good human centered design is clear here, as is the value of being able to provide technology that can support probabilistic analysis and decision making. Although it sounds like they are not in use currently, predictive models could be applicable, and apparently a fair amount of data is gathered about each search (and searches are relatively frequent). Certainly visualization opportunities exist as well. Indeed, a recent VAST publication (Malik et al., 2011) looked specifically at visual analytics and its role in maritime resource allocation (across multiple search and rescue operations).

But the thing that especially caught my attention is the need to handle uncertain information in the face of both ignorance and conflict. I have been reading recently about Dempster-Shafer theory, which is useful when fusing multiple sources of data that may not be easily modeled with standar probabilities. Dempster-Shafer theory assigns a probability mass to each piece of evidence, and is able to explicitly model ignorance. It is best interpreted as producing information about the provability of a hypothesis, which means that at times it may produce a high certainty for something that is unlikely (but more provable than the alternatives). For example, suppose two people disagree about something (which disease someone has, for instance), but share a small point of agreement (perhaps both people have a low-confidence hypothesis that the problem is a brain tumor) that is highly improbable from the perspective of both individuals (one of whom suspects migraines, the other a concussion).  That small overlap will be the most certain outcome of combining their belief models in Dempster-Shafer theory, so a brain tumor, although both doctors agree it is unlikely, would be considered most probable by the theory.

One next obvious step would be to do some qualitative research and get a better picture of what really goes on in a search and rescue operation. Another possible step would be to collect a data set from one or more willing organizations (assuming the data exists) and explore algorithms that could aid decision making or make predictions using the existing data. Or then again, one could start by collecting GPS devices (I am suer some of you out there must have some sitting in a box that you could donate) and explore whether there are enough similar ones (android + google maps?) to meet the constraint of easy use and trainability. I don’t know yet whether I will pick up this challenge, but I certainly hope I have the opportunity to. It is a fascinating, meaningful, and technically interesting opportunity.

Malik, A., Maciejewski, R., Maule, B., & Ebert, D. S. (2011). A visual analytics process for maritime resource allocation and risk assessment. In the 2011 IEEE Conference on Visual Analytics Science and Technology (VAST), pp. 221-230.