The Data Pipeline

I’ve been teaching a course on using data for three years now, and it feels ‘finished’ enough that it is perhaps worth writing about. When I began the course, I had the goal of (1) exploring the human side of data, meaning the ways in which what we know about people impact our ability to make, process, interpret, and use data (2) make the course accessible to a broad variety of people (not just programmers) and (3) organize the course around modules that would produce nice portfolio items. The result is the course at data.cmubi.org. While it has evolved over the years, it’s always included at least a few beginner programmers, and the projects have been interesting opportunities for students to explore issues like interactive machine learning, data visualization, and topic areas they care a great deal about.

A big emphasis on the course is on data cleaning –understanding deeply the flaws in your data, from bias in data collection to missing values in data files. Many (hopefully most) of the projects below have significant sections documenting their sources and efforts / decision making around this topic.

Another big emphasis in the course is on understanding what the data will be used for, and by whom. Tied to this, we talk extensively about intelligibility in machine learning, the importance of narrative in visualization (and visualization in general), and the importance of defining the question you are answering.

Here are some of the highlights over the last three years:

Bus bunching is a phenomenon that can impact bus wait times. One of my 2016 students has been collecting data and extensively studying the phenomenon. His final project in the class drew on this data set and explores visual representations of the phenomenon.

Yelp data is always an area of interest. In 2014 … In 2015 students explored which state has the best pizza :). In 2016, the ‘Bon Yinzers‘ developed a wonderful series of visualizations of factors that affect popularity of Pittsburgh restaurants. They uncovered some interesting phenomena such as the unexpectedly off-cycle checkin times of the most active Yelp users in Pittsburgh.

San Francisco Crime Alert explores the likelihood of different types of Crime in different SF area neighborhoods. Their prediction algorithm gives you a way to explore the prevelance of major and minor crime in terms of time of year, time of day, and location.

In 2015, a group collected and analyzed data set of tweets by potential ISIS supporters, with the goal of ultimately engaging others in helping to label such data and understand how ISIS supporter accounts differ from other accounts with sometimes similar tweets (e.g. news accounts or bloggers).

Often, a goal of class students is more about policy than about end users. In 2015 Healt$care explores the quality of healthcare and its relationship to dollars spent across the U.S. in a highly visual fashion.

In 2014, a group asked what jobs are popular in what parts of the US?. Again a combination of data visualization and prediction supports exploration of the question. A similar approach was explored by a 2014 group that collected data about movie piracy and its relationship to DVD release strategies.

Sadly, not all of the older projects still work (web standards change so fast!). I wish I could provide links to work such as the Reddit AMA visualization pictured here.

Make4all

The Data Pipeline

Like this:

Related

One thought on “The Data Pipeline”

Jennifer Mankoff | University of Washington