Skip to content

saramortara/data_cleaning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Basic workflow for biodiversity data cleaning using R

0. Loading packages

For this exercise, we will use the R environment. If you need to download it go to: https://www.r-project.org/. It is strongly suggested to use an editor and we recommend RStudio (https://rstudio.com).

For this tutorial, you will need to install the R packages: rgbif, Taxonstand CoordinateCleaner and maps. If you don't have them installed use the following commands:

install.packages("rgbif")
install.packages("Taxonstand")
install.packages("CoordinateCleaner")
install.packages("maps")

Then, we'll start loading the packages.

library(rgbif)
library(Taxonstand)
library(CoordinateCleaner)
library(maps)

1. Getting the data

First, let's download the data of a tree species from South America Cariniana legalis (Kuntze) from the Lecythidaceae family.

You can also embed plots, for example:

species <- "Cariniana legalis"
occs <- occ_search(scientificName = species, 
                   return = "data")
nrow(occs) #number of records 

In the raw data, we have 500 records.

Column names returned from gbif follow the DarwinCore standard (https://dwc.tdwg.org).

colnames(occs)

Exporting raw data

In order to guarantee the documentation of all steps, saving the raw data is essential. We will create a directory to save data and then export the data as csv (text file separated by comma).

dir.create("data")
write.csv(occs, 
          "data/raw_data.csv", 
          row.names = FALSE)

2. Checking species taxonomy

Let's check the unique entries for the species name we just searched.

sort(unique(occs$scientificName))

In this particular case, we have two synonyms Cariniana brasiliensis and Couratari legalis. In the gbif data there is already a column showing the currently accepted taxonomy:

table(occs$taxonomicStatus)

Let's use the function TPL() from package taxonstand to check if the taxonomic updates in the gbif data are correct. This function receives a vector containing a list of species and performs both ortographical and nomenclature checking. Nomenclature checking follows The Plant List.

We will first generate a list with unique species names and combine it to the data. This is preferable because we do not need to check more than once the same name and, in the case of working with several species, it will make the workflow faster.

species.names <- unique(occs$scientificName) 
tax.check <- TPL(species.names)

Let's check the output:

tax.check

Note that the function adds several new variables to the input data and creates columns such as New.Genus and New.Species with the accepted name. We should adopt these names if the column New.Taxonomic.status is filled with "Accepted"

We will merge the new genus and species and then add them to the original data.

# creating new object w/ original and new names after TPL
new.tax <- data.frame(scientificName = species.names, 
                      genus.new.TPL = tax.check$New.Genus, 
                      species.new.TPL = tax.check$New.Species,
                      status.TPL = tax.check$Taxonomic.status,
                      scientificName.new.TPL = paste(tax.check$New.Genus,
                                                     tax.check$New.Species)) 
# now we are merging raw data and checked data
occs.new.tax <- merge(occs, new.tax, by = "scientificName")

Exporting data after taxonomy check

To guarantee the documentation of all steps, we will export the data after the taxonomy check.

write.csv(occs.new.tax, 
          "data/data_taxonomy_check.csv", 
          row.names = FALSE)

3. Checking species' coordinates

First, let's inspect visually the coordinates in the raw data.

plot(decimalLatitude ~ decimalLongitude, data = occs)
map(, , , add = TRUE)

Now we will use the the function clean_coordinates() from the CoordinateCleaner package to clean the species records. This function checks for common errors in coordinates such as institutional coordinates, sea coordinates, outliers, zeros, centroids, etc. This function does not accept not available information (here addressed as "NA") so we will first select only data that have a numerical value for both latitude and longitude.

Note: at this moment having a specific ID code for each observation is essential. The raw data already provides an ID in the column gbifID.

occs.coord <- occs[!is.na(occs$decimalLatitude) 
                   & !is.na(occs$decimalLongitude),]

Now that we don't have NA in latitude or longitude, we can perform the coordinate cleaning.

# output w/ only potential correct coordinates
geo.clean <- clean_coordinates(x = occs.coord, 
                               lon = "decimalLongitude",
                               lat = "decimalLatitude",
                               species = "species", 
                               value = "clean")

Let's plot the output of the clean data.

par(mfrow = c(1, 2))
plot(decimalLatitude ~ decimalLongitude, data = occs)
map(, , , add = TRUE)
plot(decimalLatitude ~ decimalLongitude, data = geo.clean)
map(, , , add = TRUE)
par(mfrow = c(1, 1))

When setting value = clean it returns only the potentially correct coordinates. For checking and reproducibility we want to save all the output with the flags generated by the routine. Let's try a different output.

occs.new.geo <- clean_coordinates(x = occs.coord, 
                                  lon = "decimalLongitude",
                                  lat = "decimalLatitude",
                                  species = "species", 
                                  value = "spatialvalid")

Then, we merge the raw data with the cleaned data.

# merging w/ original data
occs.new.geo2 <- merge(occs, occs.new.geo, 
                       all.x = TRUE, 
                       by = "key") 

Exporting the data after coordinate check

write.csv(occs.new.geo2, 
          "../data/data_coordinate_check.csv", 
          row.names = FALSE)

Here is just of a quick example of a workflow of data cleaning using available tools in R.

About

Tutorial for biodiversity data cleaning for ENM 2020

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published