Atop Darien

Bee Curiosity

“Using subset() and the plyr package in R to summarize the native bee datasets: Part 2”

Leave a comment

“Using subset() and the plyr package in R to summarize the native bee datasets”

Our dataset contains both collection methods, aerial netting and bee bowls (pan traps), but for one analysis we only want to examine the bees collected in bee bowls among sites.

Example of a bee bowl

Example of a bee bowl used in the data collection.

Now, we could manually cut and paste all the aerial net data from our dataset, but that is laborious, annoying, and is prone to creating errors #ExcelStinks, especially if we were dealing with very large datasets. Instead, let’s use R to create a dataset that only has the native bees collected by bee bowls. We will use the subset() function to create a dataset that only has native bees that were collected by bee bowls.

In the last post, I went over how to use the PLYR package in R to summarize entire bee dataset into manageable chunks.

How can we create a new dataset that just has the native bees collected in bee bowls?
  • Use the subset() function in R.
  • What are the names of the different levels in the CollectionMethod factor?
    Use the levels() function, which tells us that the two levels are “PanTrap” and “AerialNet”.
levels(Bees.df$CollectionMethod) #What are the factor names used to describe the method of collecting bees
## [1] "AerialNet" "PanTrap"
Next, use the subset() function to create a new data frame that contains records that only contain the term “PanTrap” in the CollectionMethod Column
BeeBowl.df <- subset(Bees.df, Bees.df$CollectionMethod == "PanTrap")

Take a second to appreciate how much time that saves you, that one line of code just create a new data frame (dataset) with exactly what you wanted, no cutting and pasting, or any other error prone method.

Before moving on and summarizing the data, we need to check the new data frame to make sure that it only contains what we want it to contain.

str(BeeBowl.df)
## 'data.frame':    76 obs. of  14 variables:
##  $ SpeciesID       : int  2014037 2014043 2014045 2014046 2014047 2014033 2014042 2014044 2014035 2014041 ...
##  $ Replicate       : Factor w/ 7 levels "1","2","3","4",..: 5 5 2 2 5 6 1 2 6 5 ...
##  $ Collector       : Factor w/ 13 levels " Haskell, D. ",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ State           : Factor w/ 1 level "Massachusetts": 1 1 1 1 1 1 1 1 1 1 ...
##  $ County          : Factor w/ 2 levels "Bristol","Plymouth": 2 2 2 2 2 2 2 2 2 2 ...
##  $ City            : Factor w/ 3 levels "Brockton","E. Bridgewater",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Location        : Factor w/ 6 levels "BeaverBrook",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Latitude        : num  42.1 42.1 42.1 42.1 42.1 ...
##  $ Longitude       : num  71 71 71 71 71 ...
##  $ Date            : Factor w/ 5 levels "7/17/14","7/18/14",..: 5 5 5 5 5 4 5 5 4 5 ...
##  $ CollectionMethod: Factor w/ 2 levels "AerialNet","PanTrap": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Bowl_Color      : Factor w/ 5 levels "Blue","None",..: 4 1 4 1 1 5 4 1 1 5 ...
##  $ Genus           : Factor w/ 14 levels "Agapostemon",..: 1 1 1 1 1 5 5 5 6 7 ...
##  $ Species         : Factor w/ 11 levels "","Andreniformis",..: 9 9 9 1 9 3 3 3 1 2 ...

76 observations, yes that checks out because that’s how many bee specimens were collected in bee bowls. Unless I am mistaken, everything else checks out too.

Using the head() function, Let’s take a look at the first 6 rows of dataset
head(BeeBowl.df)
##   SpeciesID Replicate                        Collector         State
## 1   2014037         5 Schoener, D. & Massasoit Interns Massachusetts
## 2   2014043         5 Schoener, D. & Massasoit Interns Massachusetts
## 3   2014045         2 Schoener, D. & Massasoit Interns Massachusetts
## 4   2014046         2 Schoener, D. & Massasoit Interns Massachusetts
## 5   2014047         5 Schoener, D. & Massasoit Interns Massachusetts
## 6   2014033         6 Schoener, D. & Massasoit Interns Massachusetts
##     County     City    Location Latitude Longitude    Date
## 1 Plymouth Brockton BeaverBrook    42.08     70.99  7/8/14
## 2 Plymouth Brockton BeaverBrook    42.08     70.99  7/8/14
## 3 Plymouth Brockton BeaverBrook    42.08     70.99  7/8/14
## 4 Plymouth Brockton BeaverBrook    42.08     70.99  7/8/14
## 5 Plymouth Brockton BeaverBrook    42.08     70.99  7/8/14
## 6 Plymouth Brockton BeaverBrook    42.08     70.99 7/23/14
##   CollectionMethod Bowl_Color          Genus   Species
## 1          PanTrap      White    Agapostemon Virescens
## 2          PanTrap       Blue    Agapostemon Virescens
## 3          PanTrap      White    Agapostemon Virescens
## 4          PanTrap       Blue    Agapostemon          
## 5          PanTrap       Blue    Agapostemon Virescens
## 6          PanTrap     Yellow Augochlorella     aurata

The only records are bees collected in bee bowls, so we should be good to go. Now, let’s use the plyr package to summarize the data.

BeeBowl.sum <- ddply(BeeBowl.df, .(Location, Bowl_Color, Replicate, Genus), summarise, # summarize total abundance of bees for each genera
                  TotalBees = length(Genus)) #By using the length function, this will count up each time a genus is found in each location, collection method, bowl color, and replicate

Let’s review how we used the ddply() function to summarise the data. We used the subsetted data frame (BeeBowl.df) to create a new dataframe with the following columns: Location, Bowl_Color, Replicate, Genus, and Total Bees from Each Genus.

head(BeeBowl.sum)
##      Location Bowl_Color Replicate          Genus TotalBees
## 1 BeaverBrook       Blue         2    Agapostemon         1
## 2 BeaverBrook       Blue         2 Augochlorella          1
## 3 BeaverBrook       Blue         5    Agapostemon         2
## 4 BeaverBrook       Blue         6         Bombus         1
## 5 BeaverBrook      White         1 Augochlorella          1
## 6 BeaverBrook      White         2    Agapostemon         1

Since the summarized dataset is what we want, remember we use the write.csv() function to export the data back into a comma separated file (.csv) that can be opened in excel, text editor, or spreadsheet software like google spreadsheets.

write.csv(BeeBowl.sum, "BeeBowlSummer2014.csv") #export the summarized bee file as a comma separated value (.csv) file that can be opened in excel or any text editor. 
Advertisements

Author: Sean Kent

I am a naturalist, nature photographer, field ecologist, and educator. I have an external fondness for all creatures great and small, especially native bees and flowering plants.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s