Fun Data for teaching R

I’ll be running an R course soon and I am looking for fun (public) datasets to use in data manipulation and visualization. I would like to use a single dataset that has some easy variables for the first days, but also some more challenging ones for the final days. And I want that when I put exercises, the students* are curious about finding out the answer.

[*in this case students are not ecologists]

Ideas:

-Movies. How many movies has Woody Allen? Is the number of movies per year increasing linearly or exponentially? That is a good theme with lots of options. IMDB releases some data, AND processing their terribly formatted txt files and assembling them would be an excellent exercise for an advanced class, but not for beginners. OMDB has an API to make searches and if you donate you can get the full database. And of course, there is an R package to use the API. This is better option for beginners.

-Music. Everyone likes music and there are 300Gb of data here. You can get also just a chunk, though, but still 2 Gb of data is probably too much for beginers.

-Football: I discarded this one for me because I know nothing about it, but I am sure it will be highly popular in Spain. An open database here.

–Kaggel datasets are also awesome. To download them you just have to register. I may use the baby names per year and US state. Everyone is curious about the most popular name the year of your birthday, for example.

–Earthquakes: This one also needs some parsing of the txt files (easier than IMDB) and will do for pretty visualizations.

-Datasets already in R: Along with the classic datasets on Iris flowers (used by Fisher!) or the cars dataset there are cooler options. For example there are lots of datasets for econometrics (some are curious), and Rstudio also released some cool ones recently (e.g. flights).

-Other: Internet is full of data like real time series, lots of small data examples, M&M’s colors by bag, Jeopardy questions, Marvel social networks, Dolphins social networks, …

Please, add your ideas in the comments, especially if you have used them with success for teaching R. Thanks!

12 thoughts on “Fun Data for teaching R”

The Lahman package in R contains a full data set of statistics from Major League Baseball. It was the database we used for the first class in R that I took. It was a MOOC on edX from Boston University: https://www.edx.org/course/sabermetrics-101-introduction-baseball-bux-sabr101x-0. You will get a lot of good ideas there. Good luck!

Reply ↓

Pingback: Fun data for teaching R | Notas R

Pingback: Fun Data for teaching R | Mubashir Qasim

Pingback: Dica R do dia | De Gustibus Non Est Disputandum

If you want to go the movie route…You could use this data set from movielens…with ratings. It is available in different sizes, ranging from 100k to 20M rows. http://grouplens.org/datasets/movielens/
Pretty easy to work with and some neat insights. Furthermore some interesting ideas for data manipulation (e.g. the year is included in the title column, hence one task would be to extract the year into a seperate column). I think this could be really interesting or your students.

Enjoy.

Reply ↓

The agridat package contains a lot many dataset from agricultural experiment: https://cran.r-project.org/web/packages/agridat/agridat.pdf

Reply ↓

Pingback: Distilled News | Data Analytics & R

The weather/climate datasets are fun here: http://www.kdnuggets.com/2015/04/awesome-public-datasets-github.html.

Also, I found the twitteR package satisfied a curious mind because the data from the API is so recent.

Reply ↓

Thanks all for all your great suggestions!

I add a few more from twitter:

@frod_san: pigeon racing, cup stacking speed… http://blog.yhat.com/posts/7-funny-datasets.html. And belly button biodiversity of course http://navels.yourwildlife.org/

@rgfitzjohn: There’s bound to be something interesting in here: https://github.com/caesar0301/awesome-public-datasets

Reply ↓

I tried collecting datasets in an R package for a similar goal.
It’s available here: https://github.com/tadaadata/loldata

It contains some tv show datasets collected with my package for accessing trakt.tv, als well as a worldrankings dataset where I quickly collected a whole bunch of world rankings from Wikipedia.

Reply ↓

One more: NASA data: https://pds.nasa.gov/

Reply ↓

I have used London 2012 Olympics data set for a variety of procedures. It has over 10,000 cases, which is not unmanageable. The data set is provided by The Guardian.

Reply ↓

Steven Slezak on January 22, 2016 at 07:17 said:

The Lahman package in R contains a full data set of statistics from Major League Baseball. It was the database we used for the first class in R that I took. It was a MOOC on edX from Boston University: https://www.edx.org/course/sabermetrics-101-introduction-baseball-bux-sabr101x-0. You will get a lot of good ideas there. Good luck!

Reply ↓
Pingback: Fun data for teaching R | Notas R
Pingback: Fun Data for teaching R | Mubashir Qasim
Pingback: Dica R do dia | De Gustibus Non Est Disputandum
Daniel on January 22, 2016 at 12:48 said:

If you want to go the movie route…You could use this data set from movielens…with ratings. It is available in different sizes, ranging from 100k to 20M rows. http://grouplens.org/datasets/movielens/
Pretty easy to work with and some neat insights. Furthermore some interesting ideas for data manipulation (e.g. the year is included in the title column, hence one task would be to extract the year into a seperate column). I think this could be really interesting or your students.

Enjoy.

Reply ↓
lionel on January 22, 2016 at 17:21 said:

The agridat package contains a lot many dataset from agricultural experiment: https://cran.r-project.org/web/packages/agridat/agridat.pdf

Reply ↓
Pingback: Distilled News | Data Analytics & R
techcranks on January 22, 2016 at 20:09 said:

The weather/climate datasets are fun here: http://www.kdnuggets.com/2015/04/awesome-public-datasets-github.html.

Also, I found the twitteR package satisfied a curious mind because the data from the API is so recent.

Reply ↓
ibartomeus on January 22, 2016 at 21:10 said:

Thanks all for all your great suggestions!

I add a few more from twitter:

@frod_san: pigeon racing, cup stacking speed… http://blog.yhat.com/posts/7-funny-datasets.html. And belly button biodiversity of course http://navels.yourwildlife.org/

@rgfitzjohn: There’s bound to be something interesting in here: https://github.com/caesar0301/awesome-public-datasets

Reply ↓
Jemus42 (@Jemus42) on January 23, 2016 at 03:39 said:

I tried collecting datasets in an R package for a similar goal.
It’s available here: https://github.com/tadaadata/loldata

It contains some tv show datasets collected with my package for accessing trakt.tv, als well as a worldrankings dataset where I quickly collected a whole bunch of world rankings from Wikipedia.

Reply ↓
ibartomeus on January 23, 2016 at 18:46 said:

One more: NASA data: https://pds.nasa.gov/

Reply ↓
Locomarinero. on January 24, 2016 at 13:33 said:

I have used London 2012 Olympics data set for a variety of procedures. It has over 10,000 cases, which is not unmanageable. The data set is provided by The Guardian.

Reply ↓

Bartomeus lab

Ecology, global change and pollinators

Fun Data for teaching R

12 thoughts on “Fun Data for teaching R”

Discussion Cancel reply

Share this:

12 thoughts on “Fun Data for teaching R”

Discussion Cancel reply