The Data Pipeline

Visualization of accounts that could be associated with ISIS

I’ve been teaching a course on using data for three years now, and it feels ‘finished’ enough that it is perhaps worth writing about. When I began the course, I had the goal of (1) exploring the human side of data, meaning the ways in which what we know about people impact our ability to make, process, interpret, and use data (2) make the course accessible to a broad variety of people (not just programmers) and (3) organize the course around modules that would produce nice portfolio items. The result is the course at data.cmubi.org.  While it has evolved over the years, it’s always included at least a few beginner programmers, and the projects have been interesting opportunities for students to explore issues like interactive machine learning, data visualization, and topic areas they care a great deal about.

A big emphasis on the course is on data cleaning –understanding deeply the flaws in your data, from bias in data collection to missing values in data files. Many (hopefully most) of the projects below have significant sections documenting their sources and efforts / decision making around this topic.

Another big emphasis in the course is on understanding what the data will be used for, and by whom. Tied to this, we talk extensively about intelligibility in machine learning, the importance of narrative in visualization (and visualization in general), and the importance of defining the question you are answering.

Here are some of the highlights over the last three years:

Screen Shot 2016-05-09 at 12.51.04 PM

Bus bunching is a phenomenon that can impact bus wait times. One of my 2016 students has been collecting data and extensively studying the phenomenon. His final project in the class drew on this data set and explores visual representations of the phenomenon.

 

Screen Shot 2016-05-09 at 12.56.20 PMYelp data is always an area of interest. In 2014 … In 2015 students explored which state has the best pizza :). In 2016, the ‘Bon Yinzers‘ developed a wonderful series of visualizations of factors that affect popularity of Pittsburgh restaurants. They uncovered some interesting phenomena such as the unexpectedly off-cycle checkin times of the most active Yelp users in Pittsburgh.

Screen Shot 2016-05-09 at 1.00.13 PMSan Francisco Crime Alert explores the likelihood of different types of  Crime in different SF area neighborhoods. Their prediction algorithm gives you a way to explore the prevelance of major and minor crime in terms of time of year, time of day, and location.

Screen Shot 2016-05-09 at 1.03.50 PMIn 2015, a group collected and analyzed data set of tweets by potential ISIS supporters, with the goal of ultimately engaging others in helping to label such data and understand how ISIS supporter accounts differ from other accounts with sometimes similar tweets (e.g. news accounts or bloggers).

Screen Shot 2016-05-09 at 1.12.00 PMOften, a goal of class students is more about policy than about end users. In 2015 Healt$care explores the quality of healthcare and its relationship to dollars spent across the U.S. in a highly visual fashion.

 

Screen Shot 2016-05-09 at 1.17.31 PMIn 2014, a group asked what jobs are popular in what parts of the US?. Again a combination of data visualization and prediction supports exploration of the question. A similar approach was explored by a 2014 group that collected data about movie piracy and its relationship to DVD release strategies.

Sadly, not all of the older projects still work (web standards change so fast!). I wish I could provide links to work such as the Reddit AMA visualization pictured here.

Screen Shot 2015-05-04 at 4.08.56 PM

 

Learning languages

I’ve mentioned before that one of my sabbatical goals was to learn a new language (Hindi). I am not fluent, but I think I came a fair way with it, and I want to comment on the role of different technologies and approaches in our successes (and failures) as a family to learn the three languages that we tackled on this trip.

One of the most useful technologies we employed was the Rosetta Stone software. The kids loved Rosetta Stone, which we started using almost as soon as the sabbatical was approved to get them familiar with Hindi. They spent about 30 minutes at a time on it at the beginning. At our peak, this happened almost daily (after we left Pittsburgh but before we were settled in India. Eventually we hired a tutor (a wonderful friend now) to come for about an hour most days instead. The kids were far more resistant to being tutored than they were to using the software, but I feel we covered much more ground in those hours. We made up all sorts of games, retold fairy tales, played shop, and generally did our best to make it child friendly.

Hindi was a relatively hard language to learn (new alphabet, different sentence structures, and so on). Once we got past the vocabulary phase,  progress was slowish. Still, by the end of the fall we could have whole conversations in Hindi as a family. The kids were not alone in learning the language: Anind and I were trying very hard to learn it as well and we tried to speak it at meals, with our Indian driver, and so on. So between the tutoring and the daily practice opportunities, they used Rosetta Stone less and less.

The Rosetta Stone was not a pure success. It required the right context to be used — enough motivation, and not too much other support. We almost never used the German Rosetta Stone I bought, and of course the kids are far more fluent (they are immersed, unlike with Hindi, and it is a much easier language for them to learn). Use of Rosetta Stone is rare at this point, and mostly me.

You get free tutoring through the online package with Rosetta Stone, along with access to online games. The games are a fun way to practice but slow. When possible, I sign up and have a session with an online tutor. It is based on the material I’m currently covering in the software. However, they only let the kids do it when there’s no other remote participants, which is sometimes hard to find at popular times in the early stages of learning a language.

As a computer scientist, I cannot help but be impressed by the software. It is dedicated to learning language through immersion, and the authors have done an excellent job of maintaining that throughout the software and the tutoring sessions. It uses speech recognition to check pronunciation, and provides multi-media support for learning. And it works, if you put the time in with it, you learn. To my mind, it’s a success as an educational tool and an interactive tool. It supposedly has a social side as well, though as a Hindi learner I was one of few and could not take advantage of it. I’d be curious to see what it’s like.

It has always seemed such a shame to me, that learning multiple languages is not a norm in the United States. During our travels we met 10 year olds who spoke 5 or 6 languages, all with fluent ease. They never needed to touch a piece of software or a tutor. The world has become so small, yet so many of us in the United States fail to give our children the gift of understanding and the mastery of complexity that comes with learning multiple languages. Most of my swiss cousins have raised children who are bi- or tri- lingual, and without the errors that plague even my immersed children.  There is no substitute for that level of early exposure.