Fun Data for teaching R

I’ll be running an R course soon and I am looking for fun (public) datasets to use in data manipulation and visualization. I would like to use a single dataset that has some easy variables for the first days, but also some more challenging ones for the final days. And I want that when I put exercises, the students* are curious about finding out the answer.

[*in this case students are not ecologists]

Ideas:

-Movies. How many movies has Woody Allen? Is the number of movies per year increasing linearly or exponentially? That is a good theme with lots of options. IMDB releases some data, AND processing their terribly formatted txt files and assembling them would be an excellent exercise for an advanced class, but not for beginners. OMDB has an API to make searches and if you donate you can get the full database. And of course, there is an R package to use the API. This is better option for beginners.

-Music. Everyone likes music and there are 300Gb of data here. You can get also just a chunk, though, but still 2 Gb of data is probably too much for beginers.

-Football: I discarded this one for me because I know nothing about it, but I am sure it will be highly popular in Spain. An open database here.

Kaggel datasets are also awesome. To download them you just have to register. I may use the baby names per year and US state. Everyone is curious about the most popular name the year of your birthday, for example.

Earthquakes: This one also needs some parsing of the txt files (easier than IMDB) and will do for pretty visualizations.

-Datasets already in R: Along with the classic datasets on Iris flowers (used by Fisher!) or the cars dataset there are cooler options. For example there are lots of datasets for econometrics (some are curious), and Rstudio also released some cool ones recently (e.g. flights).

-Other: Internet is full of data like real time series, lots of small data examples, M&M’s colors by bag, Jeopardy questions, Marvel social networks, Dolphins social networks, …

Please, add your ideas in the comments, especially if you have used them with success for teaching R. Thanks!

 

 

Where are the kids born in December?

This is the question Xavier Sala i Martín made in a catalan TV show about economic sciences (yes, pretty cool you can talk about that in the TV!). In a nutshell, he described the relative age effect. A pattern for which most elite football and hockey players are born in the first 6 months of the year because young kids from a given age are put to compete together and the older ones are bigger and stronger. Then coaches dedicate more time to them, and by the time the physical capabilities are even among all kids born the same year, kids born in January have trained more, get more positive reinforcement, etc…

But he did not answer where are the kids born in december. I speculated that those “bad” at sports would have more time to do arts, like play music. Lets test the hypothesis! I found a list of Musicians by birthday in wikipedia and @vgaltes scrap it for me*. Amazing! 57% of musicians in wikipedia are born in the latests 6 months of the year (yes, a chi square is highly significant with this sample size), and january is the only month that goes against our prediction.

Rplot03

Each bar is the number of musics per month starting in January. Black line is the expected number. Sorry for the terrible graph with no axes.

We should have stopped here. Publish it and be famous. Unfortunatelly we got excited. @vgaltes found this web page with lots of birthday summaries by profession and by eyeballing the numbers there is no clear pattern for musicians. Then @dukjb started pointing out that we should correct for number of days that each months has, and more importantly, for the natural birth rate per month, which is likely not uniform. Then we lost momentum, we got distracted by other things and the conversation fade out. But at least we had some fun, no excuse for being bad at sports** and this post!


*I am ashamed, but It would be too time consuming to do in R for me for a side, side, side, side project.

**I was born in early April.