Skip to content

The Data Pipeline

Visualization of accounts that could be associated with ISIS

I’ve been teaching a course on using data for three years now, and it feels ‘finished’ enough that it is perhaps worth writing about. When I began the course, I had the goal of (1) exploring the human side of data, meaning the ways in which what we know about people impact our ability to make, process, interpret, and use data (2) make the course accessible to a broad variety of people (not just programmers) and (3) organize the course around modules that would produce nice portfolio items. The result is the course at data.cmubi.org.  While it has evolved over the years, it’s always included at least a few beginner programmers, and the projects have been interesting opportunities for students to explore issues like interactive machine learning, data visualization, and topic areas they care a great deal about.

A big emphasis on the course is on data cleaning –understanding deeply the flaws in your data, from bias in data collection to missing values in data files. Many (hopefully most) of the projects below have significant sections documenting their sources and efforts / decision making around this topic.

Another big emphasis in the course is on understanding what the data will be used for, and by whom. Tied to this, we talk extensively about intelligibility in machine learning, the importance of narrative in visualization (and visualization in general), and the importance of defining the question you are answering.

Here are some of the highlights over the last three years:

Screen Shot 2016-05-09 at 12.51.04 PM

Bus bunching is a phenomenon that can impact bus wait times. One of my 2016 students has been collecting data and extensively studying the phenomenon. His final project in the class drew on this data set and explores visual representations of the phenomenon.

 

Screen Shot 2016-05-09 at 12.56.20 PMYelp data is always an area of interest. In 2014 … In 2015 students explored which state has the best pizza :). In 2016, the ‘Bon Yinzers‘ developed a wonderful series of visualizations of factors that affect popularity of Pittsburgh restaurants. They uncovered some interesting phenomena such as the unexpectedly off-cycle checkin times of the most active Yelp users in Pittsburgh.

Screen Shot 2016-05-09 at 1.00.13 PMSan Francisco Crime Alert explores the likelihood of different types of  Crime in different SF area neighborhoods. Their prediction algorithm gives you a way to explore the prevelance of major and minor crime in terms of time of year, time of day, and location.

Screen Shot 2016-05-09 at 1.03.50 PMIn 2015, a group collected and analyzed data set of tweets by potential ISIS supporters, with the goal of ultimately engaging others in helping to label such data and understand how ISIS supporter accounts differ from other accounts with sometimes similar tweets (e.g. news accounts or bloggers).

Screen Shot 2016-05-09 at 1.12.00 PMOften, a goal of class students is more about policy than about end users. In 2015 Healt$care explores the quality of healthcare and its relationship to dollars spent across the U.S. in a highly visual fashion.

 

Screen Shot 2016-05-09 at 1.17.31 PMIn 2014, a group asked what jobs are popular in what parts of the US?. Again a combination of data visualization and prediction supports exploration of the question. A similar approach was explored by a 2014 group that collected data about movie piracy and its relationship to DVD release strategies.

Sadly, not all of the older projects still work (web standards change so fast!). I wish I could provide links to work such as the Reddit AMA visualization pictured here.

Screen Shot 2015-05-04 at 4.08.56 PM

 

Learning languages

I’ve mentioned before that one of my sabbatical goals was to learn a new language (Hindi). I am not fluent, but I think I came a fair way with it, and I want to comment on the role of different technologies and approaches in our successes (and failures) as a family to learn the three languages that we tackled on this trip.

One of the most useful technologies we employed was the Rosetta Stone software. The kids loved Rosetta Stone, which we started using almost as soon as the sabbatical was approved to get them familiar with Hindi. They spent about 30 minutes at a time on it at the beginning. At our peak, this happened almost daily (after we left Pittsburgh but before we were settled in India. Eventually we hired a tutor (a wonderful friend now) to come for about an hour most days instead. The kids were far more resistant to being tutored than they were to using the software, but I feel we covered much more ground in those hours. We made up all sorts of games, retold fairy tales, played shop, and generally did our best to make it child friendly.

Hindi was a relatively hard language to learn (new alphabet, different sentence structures, and so on). Once we got past the vocabulary phase,  progress was slowish. Still, by the end of the fall we could have whole conversations in Hindi as a family. The kids were not alone in learning the language: Anind and I were trying very hard to learn it as well and we tried to speak it at meals, with our Indian driver, and so on. So between the tutoring and the daily practice opportunities, they used Rosetta Stone less and less.

The Rosetta Stone was not a pure success. It required the right context to be used — enough motivation, and not too much other support. We almost never used the German Rosetta Stone I bought, and of course the kids are far more fluent (they are immersed, unlike with Hindi, and it is a much easier language for them to learn). Use of Rosetta Stone is rare at this point, and mostly me.

You get free tutoring through the online package with Rosetta Stone, along with access to online games. The games are a fun way to practice but slow. When possible, I sign up and have a session with an online tutor. It is based on the material I’m currently covering in the software. However, they only let the kids do it when there’s no other remote participants, which is sometimes hard to find at popular times in the early stages of learning a language.

As a computer scientist, I cannot help but be impressed by the software. It is dedicated to learning language through immersion, and the authors have done an excellent job of maintaining that throughout the software and the tutoring sessions. It uses speech recognition to check pronunciation, and provides multi-media support for learning. And it works, if you put the time in with it, you learn. To my mind, it’s a success as an educational tool and an interactive tool. It supposedly has a social side as well, though as a Hindi learner I was one of few and could not take advantage of it. I’d be curious to see what it’s like.

It has always seemed such a shame to me, that learning multiple languages is not a norm in the United States. During our travels we met 10 year olds who spoke 5 or 6 languages, all with fluent ease. They never needed to touch a piece of software or a tutor. The world has become so small, yet so many of us in the United States fail to give our children the gift of understanding and the mastery of complexity that comes with learning multiple languages. Most of my swiss cousins have raised children who are bi- or tri- lingual, and without the errors that plague even my immersed children.  There is no substitute for that level of early exposure.

Sabbatical goals, revisited

I am more than halfway through my sabbatical now, and it seems appropriate to revisit the goals I had for my sabbatical, and possibly set some new ones before it’s too late to make changes. I’m going to start with those original goals (in gray), and add comments and thoughts..

Learn about other ways of thinking through sustainability. I want to take the time to deeply explore my own beliefs about sustainability, cross-cultural understandings of sustainability, and how both relate to my chosen field. I am planning on spending at least an hour a week just thinking and writing and reading about ethical/social/planetary issues relating to sustainability. I am also planning on teaching my course on sustainability in both of my sabbatical locations. Total time commitment: 5-6 hours per week.

Outcome: So far this has been a success, although not exactly as I had imagined. I just completed a ubicomp submission with Indian co-authors based on interviews in India, and a survey deployed in India, U.S. high and U.S. low income communities. It was fascinating to explore this data, and definitely caused me to rethink the role of technology in sustainability. Related to this I also submitted a grant proposal with several other U.S. faculty intended to explore automated techniques for affecting energy use across three different continents. Much of this was made possible by collaborations developed with the wonderful folks at IBM Bangalore. I am also teaching my environmental class here in Zürich, and although that is just starting, it is interesting to see the differences here. Finally, I have been blogging about sustainability in a non-academic fashion in an attempt to explore basic beliefs and perspectives on sustainability from a radically new perspective, and recently wrote an article for Interactions based on my blog post questioning the basic assumptions underlying HCI work in sustainability.  Not sure if it adds up to an hour a week, but overall I’m happy with the progress on this topic so far.

Expand my toolbox. I want to learn more about hardware and machine learning (I’ve posted about this before on this blog). My current plan is to take a class on machine learning (I have a handy virtual one with me, or I can sign up wherever I’m at) and teach myself hardware using slides from a CMU class & hands on experimentation. I figure if I spend 2-3 hours per week on each (in parallel if possible, in series otherwise) I should make good progress on this over the year. Total time commitment: 4-6 hours per week.

Again, this has been a success. I completed the Stanford Online Machine Learning class this fall and have been leveraging what I learned in my student advising. Overall, I found the class to be enlightening but the homework a little too easy to solve without complete understanding. Still, it did help deepen my understanding of the algorithms and the methods for improving accuracy and so on.  It was highly focused on statistical techniques, and is nicely complemented by the second course I am taking (much more slowly), Carolyn Rosé’s applied machine learning. Now that one ML class is done, I am also working on my knowledge of hardware using slides of Scott Hudson’s and related materials and have gotten through several arduino projects. So not complete, but I’m pretty happy with this. Extra bonus: I can squeeze some hardware work into family time as the kids love to be included in it. 

Finish hanging projects. I have: Three projects that require analysis only and two-three projects that require writing code. I plan on doing these for the most part in series, unless I am able to recruit local talent to help with the latter two. It’s possible they won’t all get done, but I hope at least some will! Estimated time commitment: 4-6 hours per week. Start new projects that I’ve already thought about. I have two in mind. Estimated time commitment: 4-6 hours per week if done in series.

I wish I’d written down which specific projects I intended to work on! For the most part this has not been a great success. I have completed one (writing) project, and recruited students to work on others. However, recruiting students in Switzerland has not worked out well, and I’ve had varying success reaching any complete goal with students I recruited in India. There is still time to accomplish this goal, but I think that I am unlikely to get anywhere near 6. For my own sake when looking back on this six months from now, I will name the projects this time around: Futures — done; Search tools — making progress; Cosmo/viewpoint extraction — making progress; Mechanical Turk web accessibility — stagnating; Diabetes + Lyme analysis — stagnating [but I could tackle this without additional students]; Macro energy audits — stagnating; Using routines to reduce energy use — done; Lo-fi presence — making progress.

Write a large NSF proposal [already started]. Estimated time commitment: 1 hour per week through November.

Sadly petered out at the last minute, though I cannot take the blame for this. On the upside I led a group of five other PIs in writing a medium proposal (mentioned above). It was a fascinating experience in herding faculty which I’m not sure I ever want to repeat! :). 

Continue supporting students. Estimated time commitment: 3-4 hours per week of meetings, 1 hour per week of prep & planning. Meet new people, start new projects, develop new ideas. Estimated time: 4-6 hours per week.

Both successes (though you might ask my current students if they agree :P). Amazing how Skype can shorten the distance between places. I’m now holding meetings across three continents each week, a sign of the new collaborations I’ve begun. Also along the way I’ve given 7 talks and counting. But what I’m most pleased about in this arena is the opportunity that it’s given me to rethink my own research agenda and try to explore new ways of positioning myself. I have begun to realize that while I am driven by applications that matter this has in some ways obscured other things that I care about. It has also affected my ability to recruit a certain type of student — folks for whom programming and building things and solving hard technical problems is as important as the applications this enables. So I have, through public speaking engagements, been exploring a new way of presenting my work, one that emphasizes both the enabling middleware that can make possible the creation of applications that address real world problems using technology. More on this in a future post, as this is running long, but I consider the opportunity to rethink research approaches, goals, and so on to be a major benefit of sabbatical. 

So what next? Certainly, at a minimum more of the same. What I’ve been trying to do is working, and I plan to continue for the next few months. I still need to forge stronger collaborations with some swiss colleagues, and am actively working on that. I also need to step back and ask which hanging projects are worth pursuing and what the best approach to doing that is. And of course I need to keep in mind that this is my chance to rest and rejuvenate.  I will have been deeply involved in writing about 120 pages of text by the end of April, taught most of 4 classes since the start of last summer, and advised or mentored about 13 people (in one-on-ones) across two continents throughout most of the sabbatical, and learned 1.5 new foreign languages. In between, I’ve also been taking time to travel, relax, spend time with my kids, and continue fighting for my health. I plan to continue that, as I wish to arrive home not only inspired and educated but also healthy and well rested for the start of the fall semester.  

Monkey Business

Monkeys in class

Monkeys in class

After complaining so much in my last email about the 120 degree heat, with no power or water, I wanted to share something more fun.

Yesterday (Thursday) was my 2nd to last day of teaching here at RK Valley. We’re leaving tomorrow for Hyderabad. I teach 3 classes, 1 hour each, from 3-6 everyday. 4pm is chai-time (the British weren’t all bad, I suppose), so at 4, someone brings me a cup of chai, a bottle of ice-water and a package of biscuits for my chai. I was walking around the classroom and had my back to the front of the class, when I heard a small crash. I could tell from the crinkling that the biscuits had fallen to the floor – no big deal. The package was closed, and I wasn’t going to eat them anyway. About 10 minutes later, I got to the front of the classroom, and the package was gone. Remember this was only a few days after one of my shoes went missing, and I had to go to the local village to get a new pair. So, I was pretty confused. I opened the door to my classroom, and you can see below who came to visit me. It was a family of 4, including a baby (which was pretty scared of me).

It’s things like this that make me forget about the scorching heat, and really appreciate where I am and what I get to do. It also makes me miss Pittsburgh – we’ve all been getting a little homesick lately. But, I guarantee, nothing like this would ever happen at CMU :-).

Breakthrough!

Today was one of those days that makes you want to dance. In order to explain why I need to back up a bit and describe a little about the teaching I’m doing here at RGUKT. Many of the students at RGUKT are from rural communities and the vast majority are here on scholarship. Their main language is Telugu, and although they are taught in English, not all of them speak it perfectly. The university aims to admit students who excel at their local school from all over Andhra Pradesh. My understanding is that students are not required to pass any national exams before being admitted. This means that while the students here are smart and motivated, they may not all have equal background. They spend their first year (or two?) here covering needed pre-college material.

RGUKT has a very large student body, all taking the same classes at first, and courses involve a combination of lectures and watching videos. The students all have access to shared laptops at least once a day (if I understand correctly — my understanding has been shifting over time) and do their homework either on paper or on laptop depending on the assignment. Internet access is carefully controlled and quite limited, though the campus does have a fairly fast connection (unless several thousand students are hitting it all at once :) ). Internet is not usually turned on in the classroom, and when we requested that this change in our class, it took much of the first class to get everyone up and running. Outside of class, we had to download the documentation and files for our programming course so that students could access them without Internet (through the campus LAN).

The students here live on campus, and many only see their family members when they head home (the trip here is difficult and visits usually only last one day). Girls and boys live in separate dorms, sit separately in class (girls in front), eat separately, and do sports in separate locations. Impressively the student body is at least 1/3 (maybe even close to 1/2) female.

There is time for exercise and recreation, along with studying, every day. The students are friendly and cheerful, and often visit (usually to kidnap one or both of my children for fun, sometimes to talk to myself or Anind). When we join them in the sports fields or the mess hall, many are eager to talk and ask questions about the best engineering field to major in (this is an engineering focused school), how they can use their degree to make a difference at home, what jobs are best, why we are here, and much more.

We are each teaching three sections (“batches”) of about 70 students (a total of about 420 students) for an hour each, 5 days a week. The students rotate through the same set of laptops, which brings challenges (it is impossible to easily give them each a clean environment, or even to ensure that they each have their own set of files, especially given the confusions they have early in the course over which file to edit, where to find them, and so on). As described above we (now) have Internet access in the classroom (most of the time). Because of this, it is a precious opportunity to let them work on code/the interactive, online class pages (created using OLI). We do this 3 days a week (T/Th/S). To add to the challenge, Internet is not always available, and sometimes the power goes out for some or all machines as well (luckily not too often!). The day we had the worst power troubles, the students had to move in ever larger groups from laptop to laptop around the room as each died.

In the class room, the students are very different … or at least have been for the last week and a half. They all rise politely when I enter, speak in unison in response to my “Good Morning.” They sit quietly during class with notebooks open and scribble down as much as they can about what I have to say. Getting them to interact with me instead of their notebooks is very very difficult. The typical answer to a question I ask is “Yes ma’am,”  while the answer to “Are there any questions” has been a uniform shake of the head or empty silence in most cases. As an instructor, my job, then, is to find out why. Luckily, and slowly, the dam has broken open. The mother hen of the campus, a kind and wonderful, fatherly man who has taken care of us along with hundreds of students with which he is close friends or more  has given us feedback, as have the instructors who are helping us with the course. I speak too fast, we are covering a huge amount of material, my accent is difficult to follow. These issues, combined with cultural differences in question asking and answering, and stylistic differences in RGUKT teaching versus my own make it difficult to make forward progress.

Here are some of the things I have tried to use to rectify this in the last week plus:

I’m not sure what made the difference, but today things finally changed. Looking back I can see it wasn’t just today. One student told me that others might be embarrassed to speak up (I asked him to please set an example for others by speaking up himself), another asked me what this “learn by doing stuff” was for anyway, an email message asked me to speak more slowly, … each time I got feedback of this sort I made an effort to let the students know I understood the difficulties they faced and how much I was asking of them, and to tell them about the feedback I had gotten and show them that I did not feel upset or criticized.

All I can say is I was ready to dance for joy in class today. When I stopped after presenting each new bit of material and asked for questions I usually got one, two, three or more questions, spread across multiple students. This happened in multiple sections. Perhaps they can understand my English just a little bit better, or have gained some confidence in my expectations. For my part, I praise their progress and let them know I have a long way to go in both speaking (slowing down) and understanding (I still have to walk up to each student and ask him or her to repeat the question until I understand it — between their accent and the ambient noise of the fans in the room I have a hard time). Perhaps they feel more comfortable with me, or have resigned themselves to the fact that I won’t give up. I don’t know for sure, but each question is a gift, especially knowing how far it had to travel. For me teaching has always been a conversation, and not a one-sided one. I am grateful to have achieved that, with the students’ help, here at RGUKT.