Suppose you are at the zoo, and you want to know how many animals are at the zoo. If you have a database that contains this information, it’s easy enough:
In SQL:
select count(distinct animal) from zoo;
In R:
length(unique(zoo$animal))
Easy enough! The tough part for R comes when you want to start doing things by a grouping.
Now suppose you are at the zoo, and you want to know how many types of animals are in each cage.
In SQL:
select cage,count(distinct animal) from zoo group by cage;
In R:
tapply(zoo$animal,list(zoo$cage),function(x) length(unique(x)))
But this is just the start of the trouble for R. What if you really wanted to know some characteristic of the animals in each of the cages? Suppose you want to know how the average age of each type of animal in each cage:
In SQL:
select cage,animal,avg(age) from zoo group by cage,animal;
In R:
tapply(zoo$age,list(zoo$cage,zoo$animal),mean)
You can see how the R code seems to get more complicated depending on exactly what kind of function you’d like to apply. Even worse, you still might have to transform the resulting data you get out of the tapply function. (sometimes you want a list of tuples instead of the matrix form)
A simple solution is to use the SQL code from within R. You will need to install the package SQLDF. With this package installed, you can easily replicate the SQL functions above like so:
In R:
sqldf("select cage,count(distinct animal) from zoo group by cage")
sqldf("select cage,animal,avg(age) from zoo group by cage,animal")
The sqldf library is a very simple way to think about data in terms of SQL instead of tapply. Don’t dismiss tapply entirely, though! As you can see in my above examples, you are able to apply ANY function to a grouping of data, so R in infinitely more flexible than SQL in that sense.