I have become fascinated recently with the question of the role that data has in supporting analysis, action, and reflection. Actually, it would be more accurate to say that I’ve become aware recently that this is an intrinsic driver in much of the work I do, and thus it has become something I want to reflect on more directly. In this post, I want to explore some of the tools others have already built that might support analytics, machine learning, and so on. If you know of something I’ve missed, feel free to share it in the comments! So, in no particular order:

  • Hazy  provides a small handful of key primitives for data analysis. These include Victor, which “uses RDBMS to solve a large class of statistical data analysis problems (supervised machine learning using incremental gradiant algorithms) and WisCi (/ DeepDive, it’s successor), which is “an encylopedia powered by machines, for the people. ” RapidMiner is a similar tool that has been used by thousands. It is open source and supports data analysis and mining.
  • Protégé is “a suite of tools to construct domain models and knowledge-based applications with ontologies” including visualization and manipulation
  • NELL learns over time from the web. It has been running since 2010 and has “accumulated over 50 million candidate beliefs.”  A similar system is
  • Ohmage and Ushahidi are open source citizen sensing platforms (think citizen based data collection). Both support mobile and web based data entry. This stands in contrast to things like Mechanical Turk which is a for-pay service, and games and other dual-impact systems like PeekaBoom (von Ahn et al.) which can label objects in an image using crowd labor, or systems like Kylin (Hoffmann et al.) which simultaneously accelerates community content creation and information extraction.
  • WEKA and LightSide support GUI based machine learning (WEKA requires some expertise and comes with a whole textbook, while LightSide is built on WEKA but simplifies aspects of it, and specializes in mining textual data). For more text mining support, check out Coh-Metrix, which “calculates the coherence of texts on a wide range of measures. It replaces common readability formulas by applying the latest in computational linguistics and linking this to the latest research in psycholinguistics.” Similarly, LIWC, which supports linguistic analysis (not free) by providing a dictionary and a way to compare to that dictionary to analyze the presence of 70 language dimensions in a new text from negative emotions to casual words.

Deployed tools research and products aside, there is also a bunch of research in this area, ranging from early work such as aCappela (Dey et al.), Screen Crayons (Olsen, et al.). More recently, Gestalt (Patel et al.“allows developers to implement a classification pipeline” and Kietz et al. use an analysis of RapidMiner’s many data analysis traces to automatically predict optimal KDD-Workflows.

Luis von Ahn, Ruoran Liu and Manuel Blum Peekaboom: A Game for Locating Objects in Images In ACM CHI 2006

Hoffmann, R., Amershi, S., Patel, K., Wu, F., Fogarty, J., & Weld, D. S. (2009, April). Amplifying community content creation with mixed initiative information extraction. In Proceedings of the 27th international conference on Human factors in computing systems (pp. 1849-1858). ACM.

Dey, A. K., Hamid, R., Beckmann, C., Li, I., & Hsu, D. (2004, April). a CAPpella: programming by demonstration of context-aware applications. InProceedings of the SIGCHI conference on Human factors in computing systems (pp. 33-40). ACM.

Olsen Jr, Dan R., Trent Taufer, and Jerry Alan Fails. “ScreenCrayons: annotating anything.” Proceedings of the 17th annual ACM symposium on User interface software and technology. ACM, 2004.

Kayur Patel, Naomi Bancroft, Steven M. Drucker, James Fogarty, Andrew J. Ko, James A. Landay: Gestalt: integrated support for implementation and analysis in machine learning. UIST 2010: 37-46

Kietz et al. (2012). Designing KDD-Workflows via HTN-Planning, 1–2. doi:10.3233/978-1-61499-098-7-1011