I’ll be running an R course soon and I am looking for fun (public) datasets to use in data manipulation and visualization. I would like to use a single dataset that has some easy variables for the first days, but also some more challenging ones for the final days. And I want that when I put exercises, the students* are curious about finding out the answer.
[*in this case students are not ecologists]
Ideas:
-Movies. How many movies has Woody Allen? Is the number of movies per year increasing linearly or exponentially? That is a good theme with lots of options. IMDB releases some data, AND processing their terribly formatted txt files and assembling them would be an excellent exercise for an advanced class, but not for beginners. OMDB has an API to make searches and if you donate you can get the full database. And of course, there is an R package to use the API. This is better option for beginners.
-Music. Everyone likes music and there are 300Gb of data here. You can get also just a chunk, though, but still 2 Gb of data is probably too much for beginers.
-Football: I discarded this one for me because I know nothing about it, but I am sure it will be highly popular in Spain. An open database here.
–Kaggel datasets are also awesome. To download them you just have to register. I may use the baby names per year and US state. Everyone is curious about the most popular name the year of your birthday, for example.
–Earthquakes: This one also needs some parsing of the txt files (easier than IMDB) and will do for pretty visualizations.
-Datasets already in R: Along with the classic datasets on Iris flowers (used by Fisher!) or the cars dataset there are cooler options. For example there are lots of datasets for econometrics (some are curious), and Rstudio also released some cool ones recently (e.g. flights).
-Other: Internet is full of data like real time series, lots of small data examples, M&M’s colors by bag, Jeopardy questions, Marvel social networks, Dolphins social networks, …
Please, add your ideas in the comments, especially if you have used them with success for teaching R. Thanks!
The Lahman package in R contains a full data set of statistics from Major League Baseball. It was the database we used for the first class in R that I took. It was a MOOC on edX from Boston University: https://www.edx.org/course/sabermetrics-101-introduction-baseball-bux-sabr101x-0. You will get a lot of good ideas there. Good luck!
Pingback: Fun data for teaching R | Notas R
Pingback: Fun Data for teaching R | Mubashir Qasim
Pingback: Dica R do dia | De Gustibus Non Est Disputandum
If you want to go the movie route…You could use this data set from movielens…with ratings. It is available in different sizes, ranging from 100k to 20M rows. http://grouplens.org/datasets/movielens/
Pretty easy to work with and some neat insights. Furthermore some interesting ideas for data manipulation (e.g. the year is included in the title column, hence one task would be to extract the year into a seperate column). I think this could be really interesting or your students.
Enjoy.
The agridat package contains a lot many dataset from agricultural experiment: https://cran.r-project.org/web/packages/agridat/agridat.pdf
Pingback: Distilled News | Data Analytics & R
The weather/climate datasets are fun here: http://www.kdnuggets.com/2015/04/awesome-public-datasets-github.html.
Also, I found the twitteR package satisfied a curious mind because the data from the API is so recent.
Thanks all for all your great suggestions!
I add a few more from twitter:
@frod_san: pigeon racing, cup stacking speed… http://blog.yhat.com/posts/7-funny-datasets.html. And belly button biodiversity of course http://navels.yourwildlife.org/
@rgfitzjohn: There’s bound to be something interesting in here: https://github.com/caesar0301/awesome-public-datasets
I tried collecting datasets in an R package for a similar goal.
It’s available here: https://github.com/tadaadata/loldata
It contains some tv show datasets collected with my package for accessing trakt.tv, als well as a worldrankings dataset where I quickly collected a whole bunch of world rankings from Wikipedia.
One more: NASA data: https://pds.nasa.gov/
I have used London 2012 Olympics data set for a variety of procedures. It has over 10,000 cases, which is not unmanageable. The data set is provided by The Guardian.