Software carpentry

I would never call myself a programmer, but as an ecologists I manage moderately big and complicated datasets, and that require to interact with my computer to get the most of them. I self-taught most of the things I need to do and more or less I succeeded on managing my data (MySQL), write simple functions to analyze it (R) or use other people functions (written in Matlab or java for which I have no knowledge at all). That’s nothing fancy. I don’t create software or develop complex simulations. But I still need to communicate with my Mac. Knowing some basic programming is for me the difference between painful weeks of pain and tears vs. a couple of hours performing an easy task. That’s why I sign up for a software carpentry boot camp.

What I learnt?

– The Shell: I rarely interact with the shell, but I had to do it in three occasions in the past. It was always a guesswork. What I needed to do is simply copy and paste some code I find in the internet to run a script written by other scientist tweaked for my data. The tweaking part was usually ok as the specific task I performed came with some help from the authors, but opening the shell and figure out where to start calling the program, or how the path structure works, and all this minor stuff was a shot in the dark. Now i learned some of this basics (pwd ls cd mkdir cp mv cat sort | ) and while I will probably not use them much, next time I need to open the terminal (to run pandoc maybe?) I will know how to start. A cool thing is that you can easily run R scripts*:

Rscript file.R input.txt

or shell scrips:

bash file.sh

* the R script can load the input using:

### Read in command line arguments
args <- commandArgs(trailingOnly = TRUE)
### Read in data from file set in the command line
data <- read.table(args[1],sep=",")

– Regular expressions: I also used regular expressions a couple of times before, but is not in my everyday toolbox. Seeing some examples on how they can be used was a good reminder that I am still doing some tasks in a suboptimal way. (go to www.regexpal.com to play with them) I liked the:

alias grep="grep --color=auto"

to put some color in my terminal by default.

– Git: I am using Git (well, trying to, at least) since a year ago or so. Git is not a hipster caprice, but a solid and tested system to keep track of what you do. However, it can be a bit complicated at the beginning. I was using a GUI so far, but in SWC they show me how to set up a folder and do the usual tasks from the shell. I have to say that while I like the GUI for seeing the history, is way easier to setup a new project from the command line (git init). I also understand now better the philosophy of Git, and why staging (git add) is useful to separate files in different commits (git commit), for example. I also learnt how the gitignore file works. Just list in a txt the files that shouldn’t be tracked with regexp:

*.pdf
data_*.csv

– Unit tests: My friend @vgaltes (a real software dev.) is always pushing me to adopt unit testing, but is very obscure for me on how to do that in ecology, where my code is most times is ephemeral and quick. I usually test moderately complicated functions with fake datasets constructed on the fly to see if they behave as expected. This checks that it is currently doing what you think it is but not tells you if it will always behave this way in other situations. Ethan White advice was to scale up this approach, so that I can save the fake data and run it every time I change the code to see if it still works. Those are regression tests (or vector tests) according to Titus Brown (Read his post, it makes you think). A second step I am not actually ding is Stupidity Driven Testing, (basically test for known bugs). I need to see how I adopt that but having bugs in your code is easier than it looks like, so the more safety controls you have the better. Paraphrasing Titus:

“If you’re confident your code works, you’re probably wrong. And that should worry you.” 

More important is probably learning from seeing other people coding, one of my main problems is that i don’t work closely with more experienced users that often, so i can not learn from imitation. For example, I sucks at respecting name conventions, commenting enough and simplifying complex lines in several simpler lines, but I’ll try to get better at that. Overall I feel as if I started using more complex software directly by running (sometimes in a clumsy way) and this course teach me how to walk properly. Thanks @ethanwhite and @bendmorris!

Happy biRthday

Today is my birthday. It’s also the birthday of a close friend. What an incredible coincidence! Or wait, may be is just expected. One more time R comes into our help, because it has a built-in function to answer our question.

Which is the probability of two coincident anniversaries among a group of 17 people? (yes we have a mailing list, so I can count my friends semi-objectively without the fear of not counting them all). Just type:

pbirthday(n= 17, classes = 365, coincident = 2)

The answer is approximately 0.3, that is 3 of every 10 friend groups (of that size) have at least two anniversaries that coincide. Not that impressive, isn’t it?. But the beauty of stats is that stats are here to correct your intuition. To have an impressive coincidence (and statistical significant) you will need a group of 47 people, none of them with coinciding birthdays. And then, probably nobody will be amazed.

qbirthday(prob = 0.95, classes = 365, coincident = 2)

Anyway, happy birthday to all readers celebrating today (if any)!

Pollination effectiveness landscape

I want to show you a pollination landscape, but this is not a pollinator landscape with flowers and nesting sites, but a plot showing two components of pollination. Quantity and quality. A recent paper by Pedro Jordano (see here for other work on seed dispersal landscapes) inspired me to plot my data that way, which I think is just awesome. I am a little ashamed I can not fully follow the part of his code (but is great that people is sharing code!) dealing with plotting the z axis… but if you have a hammer, all your problems will look like nails. And I had used recently expand.grid( ) and loess( ), so I used those to do my z axis. I also think is nice that people has two options to plot the data. So here is the plot:

plot

Here you just see that bumblebees and Lasioglossum provide similar total pollination levels (yellow), but through different mechanisms (quantity or quality).

Here you can find the code to reproduce the plot.

Who are the pollinators? (with R plot)

I’ve been dreaming on writing a manuscript about who are the pollinators for a while, but it looks I’m not going to have the time soon, so here is an early draft of what the main figure should look like:

pollinators.001

It’s surprisingly difficult to gather quantitative information on which animals are the main pollinators, and on which aspects of pollination they are good at. That figure can cover more aspects, or split the pollinator guilds in finer sub-groups, but this is just a first pass. As expected, bees are the clear winners!

I used guesstimates based on Winfree et al 2011 and the following articles:

Number of species:  How many species of a given taxa are described based on different taxonomical resources. But not all species on a given taxa are necessarily good pollinators!

Efficiency: That one will vary a lot among species of the same group, but based on Sahli and Conner 2007, and other few cross taxa studies measuring pollen deposition I gave values from 1 to 10 to the different taxa.

Frequency of visits: This is based on Neff and Simpson 1993 descriptive work. An update to that with recent datasets is really needed! Values from 1 to 10.

Distribution: Some taxa are widespread, while others restricted to some areas, like to the tropics. Ranked from 1 to 10.

Number of plants pollinated: A complete guesstimate. Using Ollerton et al 2011 approach may give us better numbers.

Number of crops pollinated: Based on Klein et al 2007.

And as I know that the R code is what readers really want, here it is as a gist. I used function diamondplot{plotrix}, but I needed to edit the function first in order to scale the axes. The original function scale the groups (pollinators taxa, in my case) instead of the axes (pollination aspects) which was not desirable for my plot.

See you late January after a break!

 

is.invasive( )

Celebrating that I am contributing to the R-bloggers.com blog aggregator I am going to post a very simple function to check which species (both plants and animals) are considered “invaders” somewhere in the world. Basically the function asks that to the Global Invasive Species Database (GISD).

I coded this because a friend of mine aks me precisely that question [Yes, friends assumes you should know this kind of stuff (and also why the plants of their balcony are dying) off the top of your head just because you are a biologist]. However, I do not know much things and I am too lazy to check all 250 species one by one on the GISD webpage. Also is a good R practice, and I am ok investing some work time on personal projects. Google (and other big companies) encourage it’s employees to spend 20% of the time working on projects that aren’t necessarily in their job descriptions in order to bust its innovation power, so that should be even more important in science!

Hope it can be useful to more people, I uploaded the code as a Gist:

UPDATE: The function is now available on taxize R package developed by the rOpenScience people!


is.invasive()
##Description##
#This function check which species (both plants and animals) are considered "invaders" somewhere in the
# world. For that end, it checks GISD (http://www.issg.org/database/welcome/) and returns a value, either
#"Not invasive" or the brief description presented in GISD. Note that the webpage contains more
#information. Also note that the function won't tell you if it's exotic in your area, a lot of exotic
#species are not considered invaders (yet). As expected, the function is as good as the database is, which
#I find quite reliable and well maintained. The database is also able to recognize a lot (but not all) of
#the species synonyms. This function worked for me, but I didn't test it intensively, and any changes on
#the webpage html design will return wrong values. Apply the usual disclaimers when using it.
#The function is slow (not optimized at all), so be patient with long lists of species.
#Author Ignasi Bartomeus (nacho.bartomeus#gmail.com). Last updated 23 Nov 2012.
#Usage:
is.invasive(sp, simplified.df = FALSE)
#Arguments:
#sp: a vector of species names in latin (Genus species)
#simplified.df: Returns a data.frame with the species name and the values "Invasive", "Not Invasive". I
#recomend to check first the not simplified version (default), which contains raw information about the
#level of invasiveness.
#The function:
is.invasive <- function(sp, simplified.df = FALSE){
require(plyr)
require(XML)
require(RCurl)
#reformat sp list
species <- gsub(" ","+",sp)
#create urls to parse
urls <- paste("http://www.issg.org/database/species/search.asp?sts=sss&st=sss&fr=1&x=13&y=9&sn=",
species,"&rn=&hci=-1&ei=-1&lang=EN", sep = "")
#create a data.frame to store the Output
Out <- data.frame(species = sp, status = c(1:length(urls)))
#loop through all species
for(i in 1:length(urls)){
#Parse url and extract table
doc <- htmlTreeParse(urls[i], useInternalNodes = TRUE)
tables <- getNodeSet(doc, "//table")
t <- readHTMLTable(tables[[4]])
tt <- as.matrix(t)
if(length(grep("No invasive species currently recorded",tt, value = TRUE)) > 0){
Out[i,2] <- "Not invasive"
}
else{
if(simplified.df == FALSE){Out[i,2] <- tt[12,1]}
else{Out[i,2] <- "Invasive"}
}
print(paste("Checking species", i+1))
}
print("Done")
Out
}
#Example:
sp <- c("Carpobrotus edulis", "Rosmarinus officinalis")
## first species is invasive, second one is not.
d <- is.invasive(sp)
d
d <- is.invasive(sp, simplified.df = TRUE)
d

view raw

is.invasive.R

hosted with ❤ by GitHub

Running motivation #An R amusement

Henry John-Alder told me once that in a marathon, twice as runners cross the line at 2h 59m than at 3h 00m. He pointed out that this anomaly in the distribution of finishers per minute (roughly normal shaped) is due to motivation. I believe that. I am not physically stronger than my friend Lluismo, in fact we are pretty even, but some times one of us beat the other just because he has the right motivation…

But where is the data? Can we test for that? Can we get a measure of how motivated are runners by looking at the race times distribution? The hypothesis is that runner groups that deviates from the expected finishing time distributions are more likely to contain motivated runners. It happens I did a race a couple of weeks ago, so I can fetch the results, create an expected distribution and compare that to the observed values.

I am interested in separating motivation from physical condition because is a real problem in behavioural ecology (See Sol et al. 2011). And because working with my race data is a lot of fun.

First we need to read the results from the webpage and extract a nice table:

# load url and packages
url <- "http://www2.idrottonline.se/UppsalaLK/KungBjorn-loppet/KungBjorn-loppet2012/Resultat2012/"

require(plyr)
require(XML)
require(RCurl)

# get & format the data
doc <- getURL(url)
doc2 <- htmlTreeParse(doc, asText = TRUE, useInternalNodes = TRUE)
tables <- getNodeSet(doc2, "//table")
t <- readHTMLTable(tables[[1]])
tt <- as.matrix(t)
tt <- as.data.frame(tt)
# Select only the 10K men class and make variable names
data <- tt[c(150:391), c(2, 3, 4, 6)]
colnames(data) <- c("place", "number", "name", "time")
head(data)

##     place number             name  time
## 150     1    624    Hedlöf Viktor 32:52
## 151     2    631      Vikner Joel 33:18
## 152     3    414   Sjögren Niclas 33:47
## 153     4    329    Swahn Fredrik 33:48
## 154     5    278    Sjöblom Albin 34:04
## 155     6    311 Lindgren Fredrik 35:31

Cool, we need to get the number of finishers per minute, now.

# create an 'empty' minute column
min <- c(1:length(data$time))
# if time has hour digits, transform to minutes
for (i in 1:length(data$time)) {
    if (nchar(as.character(data$time[i])) > 5) {
        min[i] <- as.numeric(substr(data$time[i], 3, 4)) + 60
    }
}
# select just the minute value for the rest of the data
min[1:237] <- as.numeric(substr(data$time[1:237], 1, 2))

And plot the number of finishers per minute

plot(table(min), xlab = "minute", ylab = "finishers per minute", xlim = c(30, 
    63))

That is approximately a normal distribution! (way better than my usual ecological data). In that case, let’s create an expected perfect normal distribution with mean and sd based on this race. I will use that as the expected times in the absence of motivation. If each runner performs accordingly only to its physical conditions; given enough runners they will fall in a perfect normal distribution (that is a model assumption).

# create the density function
x <- seq(32, 63, length = length(data$time))
hx <- dnorm(x, mean(min), sd(min))
# transform densities to actual number of expected finishers per minute:
# create an expected (e) 'empty' vector and for each minute calculate the closest x value and its correspondence density (hx) multiplied by the number of runners.
e <- c(32, 63)
for (i in 1:length(c(32:63))) {
    e[i] <- hx[which.min(abs(x - c(32:63)[i]))] * length(data$time)
}
# Check the total number of runners predicted is close to the real one (242)
sum(e)  #close enough

## [1] 239.1

plot(c(32:63), e, ylim = c(0, 20), type = "l", ylab = "finishers per minute", xlab = "")
abline(v = 39, lty = 2)
abline(h = 6.77, lty = 2)

So, based on the plot, I predict 6.77 runners on 39 minutes (see dashed line), and 8.20 in 40. If being below 40 minutes is a goal, we expect motivation to show up there as a deviation from the expected values… Let’s plot both things together

plot(c(32:63), e, ylim = c(0, 20), type = "l", ylab = "", xlab = "")
par(new = TRUE)
plot(table(min), xlab = "minut", ylab = "finishers per minut", ylim = c(0, 20), xaxt = "n")

Well, the observed value at 39 minutes is 11,  higher than the expected 6.77, but minute 40 and 41 are also higher than expected. Maybe being around 40 is the real motivation? We can visualize the observed minus expected values as follows:

o <- as.vector(table(min))
# add a 0 on minute 61, to make vectors of same length
o[31] <- 0
o[32] <- 3
# calculate difference
diff <- o - e
plot(c(32:63), diff, xlab = "minutes", ylab = "difference")
abline(h = 0, lty = 2)

Not a super clear pattern. But positive values for the first half of the finishers indicates motivation, however, take into account that positive values on the second part imply demotivation. Let’s do the sums:

sum(diff[1:16]) #1st half part

## [1] 10.75

sum(diff[17:32]) #second half

## [1] -7.892

The distribution is skewed to observe faster times than expected. I interpret this as an indication that most people was in fact motivated (half first part of the finishers deviations >> second half). We could have asked the people, but if you work with animals, they don’t communicate as clear (and humans can lie!). So, Is this approach useful? Maybe if instead of 242 runners and 10 k I use the NY marathon data, it will give us a clearer pattern? We will never know because is time to go back to work.