Atop Darien

Bee Curiosity


Leave a comment

Summarizing the Native Bee Dataset using the PLYR Package in R: Part 1

Using PLYR package in R to summarive the native bee dataset

Summarizing the Native Bee Dataset from the Massasoit Summer Research Undergraduate Program

In several posts, I will cover how to summarize the native bee data so we can create meaningful graphs and later conduct appropriate statistical analyzes. In the last post, I went over how to examine the dataset for errors and we will be using the corrected dataset here.

As the dataset is currently constituted, we cannot easily summarize the data, create graphs, or analyze the data. To create dataset summaries, we will use the plyr package in R to create summary datasets. Before we begin, it’s important to always keep in mind the major overarching question of this research:

What is the influence of land use on ecosystem health (in this case native bees)?

Based on this question, it would be helpful to have a dataset that summarizes the abundance of each native bee genera at each site. Let’s use the plyr package and the ddply() function to accomplish this task. First, we should develop a plan on how to organize the summary data.

How is the native bee dataset organized?

For the native bee data, each native bee that was collected was given a unique identifying number (SpeciesID).

You can see the unique identifying number in the photo below

PinnedBeeSpecimen

Here are the following pieces of data that were also recorded in the dataset.

  1. Collector: who collected the data
  2. Locality data:
    • State
    • County
    • City
    • Specific Location
    • Latitude & Longitude
  3. Date: Date the specimen was collected
  4. Collection Method
    • The collection method is either aerial netting of flying bees or collecting bees in bee bowls
  5. Bowl_Color
  6. Genus
  7. Species
Here is the actual R output of the column names
##  [1] "SpeciesID"        "Replicate"        "Collector"       
##  [4] "State"            "County"           "City"            
##  [7] "Location"         "Latitude"         "Longitude"       
## [10] "Date"             "CollectionMethod" "Bowl_Color"      
## [13] "Genus"            "Species"          "Notes"
Here is the first several rows of the dataset
##   SpeciesID Replicate                        Collector         State
## 1   2014037         5 Schoener, D. & Massasoit Interns Massachusetts
## 2   2014043         5 Schoener, D. & Massasoit Interns Massachusetts
## 3   2014045         2 Schoener, D. & Massasoit Interns Massachusetts
## 4   2014046         2 Schoener, D. & Massasoit Interns Massachusetts
## 5   2014047         5 Schoener, D. & Massasoit Interns Massachusetts
## 6   2014033         6 Schoener, D. & Massasoit Interns Massachusetts
##     County     City    Location Latitude Longitude    Date
## 1 Plymouth Brockton BeaverBrook    42.08     70.99  7/8/14
## 2 Plymouth Brockton BeaverBrook    42.08     70.99  7/8/14
## 3 Plymouth Brockton BeaverBrook    42.08     70.99  7/8/14
## 4 Plymouth Brockton BeaverBrook    42.08     70.99  7/8/14
## 5 Plymouth Brockton BeaverBrook    42.08     70.99  7/8/14
## 6 Plymouth Brockton BeaverBrook    42.08     70.99 7/23/14
##   CollectionMethod Bowl_Color          Genus   Species
## 1          PanTrap      White    Agapostemon Virescens
## 2          PanTrap       Blue    Agapostemon Virescens
## 3          PanTrap      White    Agapostemon Virescens
## 4          PanTrap       Blue    Agapostemon          
## 5          PanTrap       Blue    Agapostemon Virescens
## 6          PanTrap     Yellow Augochlorella     aurata
##                  Notes
## 1                     
## 2                     
## 3                     
## 4                     
## 5                     
## 6 unsure need to check
Using the PLYR package and ddply() function to summarize the native bee data
  • First, let’s create a data set that contains the number of bees collected at each location, independent of collection method. For the initial analyses, we will only examine whether or not differences between the genera of native bees found at each study site, as opposed to looking at differences between species.
library(plyr)
Bees.site <- ddply(Bees.df, .(Location, Genus), summarise, # summarize total abundance of bees for each genera at each site
                  TotalBees = length(Genus)) 
  • As you can see, we used the ddply() function in the plyr package to summarize the data. If you examine the R code above, I’ve created a dataframe (Bees.site) that will have three columns: Location, Genus present at each location, and the total number of individuals in each genera found at each location. The length() function just counts the occurance of each genera for each site. Now, let’s take a look at the actual dataset we created.
Bees.site
##           Location          Genus TotalBees
## 1      BeaverBrook    Agapostemon         5
## 2      BeaverBrook           Apis         1
## 3      BeaverBrook Augochlorella          3
## 4      BeaverBrook         Bombus        10
## 5      BeaverBrook     Calliopsis         1
## 6      BeaverBrook       Halictus         1
## 7      BeaverBrook   Lasioglossum         4
## 8      BeaverBrook      Megachile         1
## 9      BeaverBrook          Osmia         1
## 10     BeaverBrook       Xylocopa         1
## 11 MassasoitMeadow    Agapostemon         3
## 12 MassasoitMeadow        Andrena         2
## 13 MassasoitMeadow     Anthophora         2
## 14 MassasoitMeadow           Apis         7
## 15 MassasoitMeadow Augochlorella          2
## 16 MassasoitMeadow         Bombus        34
## 17 MassasoitMeadow     Calliopsis         4
## 18 MassasoitMeadow       Halictus        16
## 19 MassasoitMeadow        Hylaeus         2
## 20 MassasoitMeadow   Lasioglossum        18
## 21 MassasoitMeadow      Megachile         5
## 22 MassasoitMeadow    Melissacler         1
## 23 MassasoitMeadow       Xylocopa         1
## 24 PoorMeadowBrook    Agapostemon         1
## 25 PoorMeadowBrook Augochlorella          3
## 26    SheepPasture Augochlorella          2
## 27    SheepPasture       Halictus         1
## 28    SheepPasture   Lasioglossum        13
## 29    SheepPasture          Osmia         1
## 30    WestgateMall           Apis         9
## 31    WestgateMall         Bombus         2
## 32    WestgateMall       Halictus         2
## 33    WestgateMall   Lasioglossum         7
## 34    WestgateMall      Megachile         1
## 35    WindsorTrail        Andrena         1
str(Bees.site)
## 'data.frame':    35 obs. of  3 variables:
##  $ Location : Factor w/ 6 levels "BeaverBrook",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Genus    : Factor w/ 14 levels "Agapostemon",..: 1 4 5 6 7 8 10 11 13 14 ...
##  $ TotalBees: int  5 1 3 10 1 1 4 1 1 1 ...

That looks great, but we should export the file as a .csv that can be read by spreadsheet software like excel. This is really easy using the write.csv() function in R, as you can see below.

write.csv(Bees.site, "BeesPerSiteSummarySummer2014.csv") #export a new spreadsheet file that summarizes the abundance of bees in each genera for each site

In the next post, I will show you how to break up the dataset so we can just examine the bees that were collected by bee bowls. We will use the plyr package and the subset() function to create the summarized datasets.

Advertisements


Leave a comment

Using R for data validation: Checking the native bee dataset from the Massasoit’s summer undergraduate research project

Using R for data validation: Checking the native bee dataset

This is post walking you through the very important, but often overlooked, process of data validation. I will show you how to use R to check for errors in the dataset on native bees from the summer research project.

Examining the structure of the bee data

We use the str() function to examine the structure of the dataset. This will tell us what type of variables we have, the type of data we have in each variable, and is a great first way to check for any errors in the dataset.

str(Bees.df)
## 'data.frame':    170 obs. of  14 variables:
##  $ SpeciesID                              : int  2014096 2014100 2014108 2014109 2014115 2014116 2014117 2014118 2014119 2014120 ...
##  $ Collector                              : Factor w/ 14 levels " Haskell, D. ",..: 7 4 4 4 6 14 14 14 13 2 ...
##  $ State                                  : Factor w/ 1 level "Massachusetts": 1 1 1 1 1 1 1 1 1 1 ...
##  $ County                                 : Factor w/ 2 levels "Bristol","Plymouth": 2 2 2 2 2 2 2 2 2 2 ...
##  $ City                                   : Factor w/ 3 levels "Brockton","E. Bridgewater",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Location                               : Factor w/ 7 levels "BeaverBrook",..: 2 2 2 2 1 2 2 2 5 5 ...
##  $ Latitude                               : num  42.4 42.4 42.4 42.4 42.4 ...
##  $ Longitude                              : num  70.6 70.6 70.6 70.6 70.6 ...
##  $ Date                                   : Factor w/ 5 levels "7/17/14","7/18/14",..: 1 1 1 1 4 1 2 3 4 4 ...
##  $ CollectionMethod..PanTrap.or.AerialNet.: Factor w/ 2 levels "AerialNet","PanTrap": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Bowl_Color                             : Factor w/ 6 levels "Blue","None",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Genus                                  : Factor w/ 14 levels "Agapostemon",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ Species                                : Factor w/ 10 levels "","Andreniformis",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ Notes                                  : Factor w/ 7 levels "","#1","#2","Female",..: 1 1 1 1 1 1 1 1 1 1 ...

Let’s examine the SpeciesID column to make sure that each bee has a unique identifier and that there has been no duplicate or skipped numbers. There should be 170 unique identifying numbers, meaning we should see a SpeciesID range from 2014001 to 2014170, so let’s check the range first to see if we get 2014001 to 2014170.
The reason I know there are 170 observations is because at the top of the structure output you see ## ‘data.frame’: 170 obs. of 14 variables:

range(Bees.df$SpeciesID)
## [1] 2014001 2014166

Oh no, that’s a problem we have a range that only goes to 2014166, when it should go to 2014170. This means one of two things, either the data was entered incorrectly for a few specimens or that some species have the same number.

Before we try to figure out how to fix this problem, I want you to think about how this would be extremely difficult to catch if you had to do it either in Excel or actually by looking at each individual bee in the collection.

Ok, so R recognizes SpeciesID as an integer, like the phosphate data for example, but SpeciesID is really a unique character for each bee, not an integar, so lets change it to a factor first.

SpeciesID <- as.factor(Bees.df$SpeciesID) #use the as.factor to change an integar variable to a factor
summary(SpeciesID)
## 2014064 2014065 2014072 2014080 2014085 2014086 2014122 2014123 2014001 
##       2       2       2       2       2       2       2       2       1 
## 2014002 2014003 2014004 2014005 2014006 2014007 2014008 2014009 2014010 
##       1       1       1       1       1       1       1       1       1 
## 2014011 2014012 2014013 2014014 2014015 2014016 2014017 2014018 2014019 
##       1       1       1       1       1       1       1       1       1 
## 2014020 2014021 2014022 2014023 2014024 2014025 2014026 2014027 2014028 
##       1       1       1       1       1       1       1       1       1 
## 2014029 2014030 2014031 2014032 2014033 2014034 2014035 2014036 2014037 
##       1       1       1       1       1       1       1       1       1 
## 2014038 2014039 2014040 2014041 2014042 2014043 2014044 2014045 2014046 
##       1       1       1       1       1       1       1       1       1 
## 2014047 2014048 2014049 2014050 2014052 2014053 2014054 2014055 2014056 
##       1       1       1       1       1       1       1       1       1 
## 2014057 2014058 2014059 2014060 2014061 2014062 2014063 2014066 2014067 
##       1       1       1       1       1       1       1       1       1 
## 2014068 2014069 2014070 2014071 2014073 2014074 2014075 2014076 2014077 
##       1       1       1       1       1       1       1       1       1 
## 2014078 2014079 2014081 2014082 2014083 2014084 2014087 2014088 2014089 
##       1       1       1       1       1       1       1       1       1 
## 2014090 2014091 2014092 2014093 2014094 2014095 2014096 2014097 2014098 
##       1       1       1       1       1       1       1       1       1 
## (Other) 
##      63

The output for the summary file is quite long, let’s just look at the first part of it.

x <- summary(SpeciesID) #I created a variable to easier explore the dupicates in the dataset
x[1:8] #Look at the summary(SpeciesID) above, you will notice that the first 8 SpeciesID values have a 2 associated with it, meaning there are two entries for each unique SpeciesID
## 2014064 2014065 2014072 2014080 2014085 2014086 2014122 2014123 
##       2       2       2       2       2       2       2       2

In the dataset, there eight species identification numbers that are assigned to two bees. SpeciesID
2014064, 2014065, 2014072, 2014080, 2014085, 2014086, 2014122, 2014123. Here is the uncorrected dataset. After determining the problems with the SpeciesID numbers, Adam and I went back to the dataset and found out the bees that had duplicated entries.

Screenshot of the Excel Dataset

I’ve highlighted four of the eight errors in the dataset and Adam and I corrected each issue that has been identified. 4 bees had the same number, 2014164 and 2014165 were entered as 2014064 and 2014065, so that was fixed as well. Lastly, one entry was entered twice. Thus, instead of having 170 collected bees, we have 169 collected bees.

Now that we’ve corrected the SpeciesID errors, let’s check to make sure there are not anymore problems with that column.

Let’s look at the structure of the corrected dataset and also the range of SpeciesID, which should be from 2014001 to 2014169.

str(Corrected.df)
## 'data.frame':    168 obs. of  15 variables:
##  $ SpeciesID       : int  2014037 2014043 2014045 2014046 2014047 2014059 2014048 2014022 2014028 2014033 ...
##  $ Replicate       : int  5 5 2 2 5 5 5 6 5 6 ...
##  $ Collector       : Factor w/ 13 levels " Haskell, D. ",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ State           : Factor w/ 1 level "Massachusetts": 1 1 1 1 1 1 1 1 1 1 ...
##  $ County          : Factor w/ 2 levels "Bristol","Plymouth": 2 2 2 2 2 2 2 2 2 2 ...
##  $ City            : Factor w/ 3 levels "Brockton","E. Bridgewater",..: 1 1 1 1 1 2 1 1 1 1 ...
##  $ Location        : Factor w/ 7 levels "BeaverBrook",..: 1 1 1 1 1 3 7 2 2 1 ...
##  $ Latitude        : num  42.1 42.1 42.1 42.1 42.1 ...
##  $ Longitude       : num  71 71 71 71 71 ...
##  $ Date            : Factor w/ 5 levels "7/17/14","7/18/14",..: 5 5 5 5 5 5 5 4 4 4 ...
##  $ CollectionMethod: Factor w/ 2 levels "AerialNet","PanTrap": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Bowl_Color      : Factor w/ 5 levels "Blue","None",..: 4 1 4 1 1 1 4 1 1 5 ...
##  $ Genus           : Factor w/ 14 levels "Agapostemon",..: 1 1 1 1 1 1 2 5 5 5 ...
##  $ Species         : Factor w/ 10 levels "","Andreniformis",..: 8 8 8 1 8 8 1 3 3 3 ...
##  $ Notes           : Factor w/ 7 levels "","#1","#2","Female",..: 1 1 1 1 1 1 3 7 7 7 ...
range(Corrected.df$SpeciesID)
## [1] 2014001 2014169
check1 <- as.factor(Corrected.df$SpeciesID)
x <- summary(check1)
x[1:10]
## 2014001 2014002 2014003 2014004 2014005 2014006 2014007 2014008 2014009 
##       1       1       1       1       1       1       1       1       1 
## 2014010 
##       1

Looks like all the problems with the speciesID column have been corrected. Now we need to check the other columns to see if there are any problems there as well.

str(Bees.df)
## 'data.frame':    170 obs. of  14 variables:
##  $ SpeciesID                              : int  2014096 2014100 2014108 2014109 2014115 2014116 2014117 2014118 2014119 2014120 ...
##  $ Collector                              : Factor w/ 14 levels " Haskell, D. ",..: 7 4 4 4 6 14 14 14 13 2 ...
##  $ State                                  : Factor w/ 1 level "Massachusetts": 1 1 1 1 1 1 1 1 1 1 ...
##  $ County                                 : Factor w/ 2 levels "Bristol","Plymouth": 2 2 2 2 2 2 2 2 2 2 ...
##  $ City                                   : Factor w/ 3 levels "Brockton","E. Bridgewater",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Location                               : Factor w/ 7 levels "BeaverBrook",..: 2 2 2 2 1 2 2 2 5 5 ...
##  $ Latitude                               : num  42.4 42.4 42.4 42.4 42.4 ...
##  $ Longitude                              : num  70.6 70.6 70.6 70.6 70.6 ...
##  $ Date                                   : Factor w/ 5 levels "7/17/14","7/18/14",..: 1 1 1 1 4 1 2 3 4 4 ...
##  $ CollectionMethod..PanTrap.or.AerialNet.: Factor w/ 2 levels "AerialNet","PanTrap": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Bowl_Color                             : Factor w/ 6 levels "Blue","None",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Genus                                  : Factor w/ 14 levels "Agapostemon",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ Species                                : Factor w/ 10 levels "","Andreniformis",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ Notes                                  : Factor w/ 7 levels "","#1","#2","Female",..: 1 1 1 1 1 1 1 1 1 1 ...

Let’s look at the other columns:

  1. State: Factor w/ 1 level “Massachusetts”
    – One state, that’s correct
  2. County: Factor w/ 2 levels “Bristol”,“Plymouth”
    – Two counties, that’s correct
  3. City: Factor w/ 3 levels “Brockton”,“E. Bridgewater”
    – Three cities, that’s correct
  4. Location: Factor w/ 7 levels “BeaverBrook”
    – Seven field sites, that’s correct
  5. CollectionMethod: Factor w/ 2 levels “AerialNet”,“PanTrap”:
    – Two collection methods, that’s correct
  6. Bowl_Color: Factor w/ 6 levels “Blue”,“None”
    – Six bowl colors, that’s wrong. There should be 5 bowl colors: none, Yellow, Blue, Orange, and White.
levels(Bees.df$Bowl_Color)
## [1] "Blue"   "None"   "Orange" "White"  "yellow" "Yellow"
Bees.df$Bowl_Color[85]
## [1] yellow
## Levels: Blue None Orange White yellow Yellow

Notice, that by typing in a capital and lower case for Yellow, R recognizes that as two factors instead of what we were actually trying to show with a single factor. In row 85, the bowl color was typed as “yellow” instead of “Yellow”. Bees.df$Bowl_Color[85] That’s easy to fix.

Bees.df$Bowl_Color[Bees.df$Bowl_Color == "yellow"] <- "Yellow"
Bees.df$Bowl_Color[85]
## [1] Yellow
## Levels: Blue None Orange White yellow Yellow

Let’s check the corrected dataset one last time to make sure that the data all appears to be valid.

## 'data.frame':    168 obs. of  15 variables:
##  $ SpeciesID       : Factor w/ 168 levels "2014001","2014002",..: 37 43 45 46 47 59 48 22 28 33 ...
##  $ Replicate       : Factor w/ 6 levels "1","2","3","4",..: 5 5 2 2 5 5 5 6 5 6 ...
##  $ Collector       : Factor w/ 13 levels " Haskell, D. ",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ State           : Factor w/ 1 level "Massachusetts": 1 1 1 1 1 1 1 1 1 1 ...
##  $ County          : Factor w/ 2 levels "Bristol","Plymouth": 2 2 2 2 2 2 2 2 2 2 ...
##  $ City            : Factor w/ 3 levels "Brockton","E. Bridgewater",..: 1 1 1 1 1 2 1 1 1 1 ...
##  $ Location        : Factor w/ 7 levels "BeaverBrook",..: 1 1 1 1 1 3 7 2 2 1 ...
##  $ Latitude        : num  42.1 42.1 42.1 42.1 42.1 ...
##  $ Longitude       : num  71 71 71 71 71 ...
##  $ Date            : Factor w/ 5 levels "7/17/14","7/18/14",..: 5 5 5 5 5 5 5 4 4 4 ...
##  $ CollectionMethod: Factor w/ 2 levels "AerialNet","PanTrap": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Bowl_Color      : Factor w/ 5 levels "Blue","None",..: 4 1 4 1 1 1 4 1 1 5 ...
##  $ Genus           : Factor w/ 14 levels "Agapostemon",..: 1 1 1 1 1 1 2 5 5 5 ...
##  $ Species         : Factor w/ 10 levels "","Andreniformis",..: 8 8 8 1 8 8 1 3 3 3 ...
##  $ Notes           : Factor w/ 7 levels "","#1","#2","Female",..: 1 1 1 1 1 1 3 7 7 7 ...