Fun Data for teaching R

I’ll be running an R course soon and I am looking for fun (public) datasets to use in data manipulation and visualization. I would like to use a single dataset that has some easy variables for the first days, but also some more challenging ones for the final days. And I want that when I put exercises, the students* are curious about finding out the answer.

[*in this case students are not ecologists]


-Movies. How many movies has Woody Allen? Is the number of movies per year increasing linearly or exponentially? That is a good theme with lots of options. IMDB releases some data, AND processing their terribly formatted txt files and assembling them would be an excellent exercise for an advanced class, but not for beginners. OMDB has an API to make searches and if you donate you can get the full database. And of course, there is an R package to use the API. This is better option for beginners.

-Music. Everyone likes music and there are 300Gb of data here. You can get also just a chunk, though, but still 2 Gb of data is probably too much for beginers.

-Football: I discarded this one for me because I know nothing about it, but I am sure it will be highly popular in Spain. An open database here.

Kaggel datasets are also awesome. To download them you just have to register. I may use the baby names per year and US state. Everyone is curious about the most popular name the year of your birthday, for example.

Earthquakes: This one also needs some parsing of the txt files (easier than IMDB) and will do for pretty visualizations.

-Datasets already in R: Along with the classic datasets on Iris flowers (used by Fisher!) or the cars dataset there are cooler options. For example there are lots of datasets for econometrics (some are curious), and Rstudio also released some cool ones recently (e.g. flights).

-Other: Internet is full of data like real time series, lots of small data examples, M&M’s colors by bag, Jeopardy questions, Marvel social networks, Dolphins social networks, …

Please, add your ideas in the comments, especially if you have used them with success for teaching R. Thanks!



I have a guest post in Practical Management blog

Quick note to say I am very glad to have a guest post in an awesome blog about data management, a neglected topic that affect all scientists. The blog is quite funny also, bringing some glamour to the art of data processing. Thanks Christie for inviting me to contribute!

The post is about style, check it out here:

Why analysing your data is like being in a romantic relationship

Last year I was working on a big dataset to assess how bee phenology has changed over time. Here it is the first cool figure I produced. I was quite excited so I didn’t even bother to make beautiful axes.

I am pretty sure the stats I finally used changed quite a lot, and I also added many more data points before publishing the results (it toke me a year to sort out all details), but the main result held. Bees are emerging earlier in recent time periods that they used to emerge. The final published figure looks like that:

While cleaning my computer today, I realised that my first plot looks way more colourful and exciting than the final figure I ended up publishing. Then, I remembered a text I wrote about analyzing data…

“I almost forgot the fun of first analysis when everything is new and exciting, when you want to know everything about “data” and you learn from “her” everyday… it’s a shame that after that it becomes repetitive and monotonous. You’ve lost the magic, but on the other hand, it’s also nice to really get to know each other, you gain compromise and confident results.”

So maybe my own plots can prove I was right, and Data analysis is like a love story. Are your first drafts also more pasional than the final version?