Atop Darien

Bee Curiosity

Using R for data validation: Checking the native bee dataset from the Massasoit’s summer undergraduate research project

Leave a comment

Using R for data validation: Checking the native bee dataset

This is post walking you through the very important, but often overlooked, process of data validation. I will show you how to use R to check for errors in the dataset on native bees from the summer research project.

Examining the structure of the bee data

We use the str() function to examine the structure of the dataset. This will tell us what type of variables we have, the type of data we have in each variable, and is a great first way to check for any errors in the dataset.

str(Bees.df)
## 'data.frame':    170 obs. of  14 variables:
##  $ SpeciesID                              : int  2014096 2014100 2014108 2014109 2014115 2014116 2014117 2014118 2014119 2014120 ...
##  $ Collector                              : Factor w/ 14 levels " Haskell, D. ",..: 7 4 4 4 6 14 14 14 13 2 ...
##  $ State                                  : Factor w/ 1 level "Massachusetts": 1 1 1 1 1 1 1 1 1 1 ...
##  $ County                                 : Factor w/ 2 levels "Bristol","Plymouth": 2 2 2 2 2 2 2 2 2 2 ...
##  $ City                                   : Factor w/ 3 levels "Brockton","E. Bridgewater",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Location                               : Factor w/ 7 levels "BeaverBrook",..: 2 2 2 2 1 2 2 2 5 5 ...
##  $ Latitude                               : num  42.4 42.4 42.4 42.4 42.4 ...
##  $ Longitude                              : num  70.6 70.6 70.6 70.6 70.6 ...
##  $ Date                                   : Factor w/ 5 levels "7/17/14","7/18/14",..: 1 1 1 1 4 1 2 3 4 4 ...
##  $ CollectionMethod..PanTrap.or.AerialNet.: Factor w/ 2 levels "AerialNet","PanTrap": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Bowl_Color                             : Factor w/ 6 levels "Blue","None",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Genus                                  : Factor w/ 14 levels "Agapostemon",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ Species                                : Factor w/ 10 levels "","Andreniformis",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ Notes                                  : Factor w/ 7 levels "","#1","#2","Female",..: 1 1 1 1 1 1 1 1 1 1 ...

Let’s examine the SpeciesID column to make sure that each bee has a unique identifier and that there has been no duplicate or skipped numbers. There should be 170 unique identifying numbers, meaning we should see a SpeciesID range from 2014001 to 2014170, so let’s check the range first to see if we get 2014001 to 2014170.
The reason I know there are 170 observations is because at the top of the structure output you see ## ‘data.frame’: 170 obs. of 14 variables:

range(Bees.df$SpeciesID)
## [1] 2014001 2014166

Oh no, that’s a problem we have a range that only goes to 2014166, when it should go to 2014170. This means one of two things, either the data was entered incorrectly for a few specimens or that some species have the same number.

Before we try to figure out how to fix this problem, I want you to think about how this would be extremely difficult to catch if you had to do it either in Excel or actually by looking at each individual bee in the collection.

Ok, so R recognizes SpeciesID as an integer, like the phosphate data for example, but SpeciesID is really a unique character for each bee, not an integar, so lets change it to a factor first.

SpeciesID <- as.factor(Bees.df$SpeciesID) #use the as.factor to change an integar variable to a factor
summary(SpeciesID)
## 2014064 2014065 2014072 2014080 2014085 2014086 2014122 2014123 2014001 
##       2       2       2       2       2       2       2       2       1 
## 2014002 2014003 2014004 2014005 2014006 2014007 2014008 2014009 2014010 
##       1       1       1       1       1       1       1       1       1 
## 2014011 2014012 2014013 2014014 2014015 2014016 2014017 2014018 2014019 
##       1       1       1       1       1       1       1       1       1 
## 2014020 2014021 2014022 2014023 2014024 2014025 2014026 2014027 2014028 
##       1       1       1       1       1       1       1       1       1 
## 2014029 2014030 2014031 2014032 2014033 2014034 2014035 2014036 2014037 
##       1       1       1       1       1       1       1       1       1 
## 2014038 2014039 2014040 2014041 2014042 2014043 2014044 2014045 2014046 
##       1       1       1       1       1       1       1       1       1 
## 2014047 2014048 2014049 2014050 2014052 2014053 2014054 2014055 2014056 
##       1       1       1       1       1       1       1       1       1 
## 2014057 2014058 2014059 2014060 2014061 2014062 2014063 2014066 2014067 
##       1       1       1       1       1       1       1       1       1 
## 2014068 2014069 2014070 2014071 2014073 2014074 2014075 2014076 2014077 
##       1       1       1       1       1       1       1       1       1 
## 2014078 2014079 2014081 2014082 2014083 2014084 2014087 2014088 2014089 
##       1       1       1       1       1       1       1       1       1 
## 2014090 2014091 2014092 2014093 2014094 2014095 2014096 2014097 2014098 
##       1       1       1       1       1       1       1       1       1 
## (Other) 
##      63

The output for the summary file is quite long, let’s just look at the first part of it.

x <- summary(SpeciesID) #I created a variable to easier explore the dupicates in the dataset
x[1:8] #Look at the summary(SpeciesID) above, you will notice that the first 8 SpeciesID values have a 2 associated with it, meaning there are two entries for each unique SpeciesID
## 2014064 2014065 2014072 2014080 2014085 2014086 2014122 2014123 
##       2       2       2       2       2       2       2       2

In the dataset, there eight species identification numbers that are assigned to two bees. SpeciesID
2014064, 2014065, 2014072, 2014080, 2014085, 2014086, 2014122, 2014123. Here is the uncorrected dataset. After determining the problems with the SpeciesID numbers, Adam and I went back to the dataset and found out the bees that had duplicated entries.

Screenshot of the Excel Dataset

I’ve highlighted four of the eight errors in the dataset and Adam and I corrected each issue that has been identified. 4 bees had the same number, 2014164 and 2014165 were entered as 2014064 and 2014065, so that was fixed as well. Lastly, one entry was entered twice. Thus, instead of having 170 collected bees, we have 169 collected bees.

Now that we’ve corrected the SpeciesID errors, let’s check to make sure there are not anymore problems with that column.

Let’s look at the structure of the corrected dataset and also the range of SpeciesID, which should be from 2014001 to 2014169.

str(Corrected.df)
## 'data.frame':    168 obs. of  15 variables:
##  $ SpeciesID       : int  2014037 2014043 2014045 2014046 2014047 2014059 2014048 2014022 2014028 2014033 ...
##  $ Replicate       : int  5 5 2 2 5 5 5 6 5 6 ...
##  $ Collector       : Factor w/ 13 levels " Haskell, D. ",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ State           : Factor w/ 1 level "Massachusetts": 1 1 1 1 1 1 1 1 1 1 ...
##  $ County          : Factor w/ 2 levels "Bristol","Plymouth": 2 2 2 2 2 2 2 2 2 2 ...
##  $ City            : Factor w/ 3 levels "Brockton","E. Bridgewater",..: 1 1 1 1 1 2 1 1 1 1 ...
##  $ Location        : Factor w/ 7 levels "BeaverBrook",..: 1 1 1 1 1 3 7 2 2 1 ...
##  $ Latitude        : num  42.1 42.1 42.1 42.1 42.1 ...
##  $ Longitude       : num  71 71 71 71 71 ...
##  $ Date            : Factor w/ 5 levels "7/17/14","7/18/14",..: 5 5 5 5 5 5 5 4 4 4 ...
##  $ CollectionMethod: Factor w/ 2 levels "AerialNet","PanTrap": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Bowl_Color      : Factor w/ 5 levels "Blue","None",..: 4 1 4 1 1 1 4 1 1 5 ...
##  $ Genus           : Factor w/ 14 levels "Agapostemon",..: 1 1 1 1 1 1 2 5 5 5 ...
##  $ Species         : Factor w/ 10 levels "","Andreniformis",..: 8 8 8 1 8 8 1 3 3 3 ...
##  $ Notes           : Factor w/ 7 levels "","#1","#2","Female",..: 1 1 1 1 1 1 3 7 7 7 ...
range(Corrected.df$SpeciesID)
## [1] 2014001 2014169
check1 <- as.factor(Corrected.df$SpeciesID)
x <- summary(check1)
x[1:10]
## 2014001 2014002 2014003 2014004 2014005 2014006 2014007 2014008 2014009 
##       1       1       1       1       1       1       1       1       1 
## 2014010 
##       1

Looks like all the problems with the speciesID column have been corrected. Now we need to check the other columns to see if there are any problems there as well.

str(Bees.df)
## 'data.frame':    170 obs. of  14 variables:
##  $ SpeciesID                              : int  2014096 2014100 2014108 2014109 2014115 2014116 2014117 2014118 2014119 2014120 ...
##  $ Collector                              : Factor w/ 14 levels " Haskell, D. ",..: 7 4 4 4 6 14 14 14 13 2 ...
##  $ State                                  : Factor w/ 1 level "Massachusetts": 1 1 1 1 1 1 1 1 1 1 ...
##  $ County                                 : Factor w/ 2 levels "Bristol","Plymouth": 2 2 2 2 2 2 2 2 2 2 ...
##  $ City                                   : Factor w/ 3 levels "Brockton","E. Bridgewater",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Location                               : Factor w/ 7 levels "BeaverBrook",..: 2 2 2 2 1 2 2 2 5 5 ...
##  $ Latitude                               : num  42.4 42.4 42.4 42.4 42.4 ...
##  $ Longitude                              : num  70.6 70.6 70.6 70.6 70.6 ...
##  $ Date                                   : Factor w/ 5 levels "7/17/14","7/18/14",..: 1 1 1 1 4 1 2 3 4 4 ...
##  $ CollectionMethod..PanTrap.or.AerialNet.: Factor w/ 2 levels "AerialNet","PanTrap": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Bowl_Color                             : Factor w/ 6 levels "Blue","None",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Genus                                  : Factor w/ 14 levels "Agapostemon",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ Species                                : Factor w/ 10 levels "","Andreniformis",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ Notes                                  : Factor w/ 7 levels "","#1","#2","Female",..: 1 1 1 1 1 1 1 1 1 1 ...

Let’s look at the other columns:

  1. State: Factor w/ 1 level “Massachusetts”
    – One state, that’s correct
  2. County: Factor w/ 2 levels “Bristol”,“Plymouth”
    – Two counties, that’s correct
  3. City: Factor w/ 3 levels “Brockton”,“E. Bridgewater”
    – Three cities, that’s correct
  4. Location: Factor w/ 7 levels “BeaverBrook”
    – Seven field sites, that’s correct
  5. CollectionMethod: Factor w/ 2 levels “AerialNet”,“PanTrap”:
    – Two collection methods, that’s correct
  6. Bowl_Color: Factor w/ 6 levels “Blue”,“None”
    – Six bowl colors, that’s wrong. There should be 5 bowl colors: none, Yellow, Blue, Orange, and White.
levels(Bees.df$Bowl_Color)
## [1] "Blue"   "None"   "Orange" "White"  "yellow" "Yellow"
Bees.df$Bowl_Color[85]
## [1] yellow
## Levels: Blue None Orange White yellow Yellow

Notice, that by typing in a capital and lower case for Yellow, R recognizes that as two factors instead of what we were actually trying to show with a single factor. In row 85, the bowl color was typed as “yellow” instead of “Yellow”. Bees.df$Bowl_Color[85] That’s easy to fix.

Bees.df$Bowl_Color[Bees.df$Bowl_Color == "yellow"] <- "Yellow"
Bees.df$Bowl_Color[85]
## [1] Yellow
## Levels: Blue None Orange White yellow Yellow

Let’s check the corrected dataset one last time to make sure that the data all appears to be valid.

## 'data.frame':    168 obs. of  15 variables:
##  $ SpeciesID       : Factor w/ 168 levels "2014001","2014002",..: 37 43 45 46 47 59 48 22 28 33 ...
##  $ Replicate       : Factor w/ 6 levels "1","2","3","4",..: 5 5 2 2 5 5 5 6 5 6 ...
##  $ Collector       : Factor w/ 13 levels " Haskell, D. ",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ State           : Factor w/ 1 level "Massachusetts": 1 1 1 1 1 1 1 1 1 1 ...
##  $ County          : Factor w/ 2 levels "Bristol","Plymouth": 2 2 2 2 2 2 2 2 2 2 ...
##  $ City            : Factor w/ 3 levels "Brockton","E. Bridgewater",..: 1 1 1 1 1 2 1 1 1 1 ...
##  $ Location        : Factor w/ 7 levels "BeaverBrook",..: 1 1 1 1 1 3 7 2 2 1 ...
##  $ Latitude        : num  42.1 42.1 42.1 42.1 42.1 ...
##  $ Longitude       : num  71 71 71 71 71 ...
##  $ Date            : Factor w/ 5 levels "7/17/14","7/18/14",..: 5 5 5 5 5 5 5 4 4 4 ...
##  $ CollectionMethod: Factor w/ 2 levels "AerialNet","PanTrap": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Bowl_Color      : Factor w/ 5 levels "Blue","None",..: 4 1 4 1 1 1 4 1 1 5 ...
##  $ Genus           : Factor w/ 14 levels "Agapostemon",..: 1 1 1 1 1 1 2 5 5 5 ...
##  $ Species         : Factor w/ 10 levels "","Andreniformis",..: 8 8 8 1 8 8 1 3 3 3 ...
##  $ Notes           : Factor w/ 7 levels "","#1","#2","Female",..: 1 1 1 1 1 1 3 7 7 7 ...
Advertisements

Author: Sean Kent

I am a naturalist, nature photographer, field ecologist, and educator. I have an external fondness for all creatures great and small, especially native bees and flowering plants.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s