generated from rstudio/bookdown-demo
-
Notifications
You must be signed in to change notification settings - Fork 3
/
03-error-checking.Rmd
1525 lines (1072 loc) · 57.5 KB
/
03-error-checking.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
editor_options:
markdown:
wrap: sentence
---
# Error checking {#error-checking}
The most important part of analyzing camera trap data is checking and exploring your data!
Based on the projects we have worked on synthesizing multiple datasets from different sources... camera trappers are not doing a very good job of checking for errors.
Working in R makes it possible to rapidly check your data, ideally in almost real time as you collect it.
In an ideal world it would be worth downloading the data for your project at least once per month and checking that 'everything' looks good.
But what constitutes 'everything'?
## Standardised exploration script
In the Wildlife Coexistence Lab developed a standardized R script to check the data generated by camera trap projects.
This script is kept on our [WildCO Single Site Exploration GitHub page](https://github.com/WildCoLab/SingleSiteExploration).
Below we run through the important elements of checking camera trap data, and where they is a coding skill fundamental to the process, we explore it in more detail (a.k.a. `skill checks`).
*Let's go!*
First, open the `.Rproj` file your created in the [course preparation section](#prep).
Then click
`File` -\> `New file` -\> `Rscript` (alternatively you can use the `R Markdown` option if you are comfortable with that)
After the file has opened, immediately save it as '01_example_error_checking_and_export.R\`.
We will usually make a new R sheet for each chapter - however the error checking and analysis data creation chapters should be in the same document.
Second, read in our standardized example datasets:
```{r ch3_1, class.source="Rmain"}
# Load your data
pro <- read.csv("data/raw_data/example_data/proj.csv", header=T)
img <- read.csv("data/raw_data/example_data/img.csv", header=T)
dep <- read.csv("data/raw_data/example_data/dep.csv", header=T)
cam <- read.csv("data/raw_data/example_data/cam.csv", header=T)
```
Next, load in the packages we will use in this chapter - we give a brief description of them too.
Cut and paste the code block below.
```{r ch3_2, echo=T, results='hide',message=F, warning=F, class.source="Rmain"}
#Load Packages
list.of.packages <- c(
"leaflet", # creates interactive maps
"plotly", # creates interactive plots
"kableExtra", # Creates interactive tables
"tidyr", # A package for data manipulation
"dplyr", # A package for data manipulation
"viridis", # Generates colors for plots
"corrplot", # Plots pairwise correlations
"lubridate", # Easy manipulation of date objects
"taxize", # Package to check taxonomy
"sf") # Package for spatial data analysis
# Check you have them in your library
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
# load them
if(length(new.packages)) install.packages(new.packages,repos = "http://cran.us.r-project.org")
lapply(list.of.packages, require, character.only = TRUE)
```
## Formatting dates
Every aspect of camera trapping involves manipulating date objects - calculating how long cameras were active, when detections occurred, working with timezones etc.
Thus, as a camera trapper working in R you need to be comfortable dealing with them.
Fortunately the process has been made far easier with the `lubridate` package.
The first dates we need to convert are those in the deployment (`dep`) datasheet - the start and end times of each period of camera activity.
The way `lubridate` works is you specify the order of the days, months years, hours, minutes and seconds with the codes d,m,y,h,m, and s respectively.
### Skill check: `lubridate`
Try importing the 25th of December in a couple of formats.
Copy and run the following:
```{r ch3_3, eval =F, class.source="Rinfo"}
#library(lubridate)
# day-month-year
dmy("24-12-2022")
# year-month-day
ymd("2022-12-24")
```
The output should be identical.
Note `lubridate` defaults to UTC - unless otherwise specified.
Now the real power of `lubridate` lies in the fact that you can handle multiple different date formats in one column using the `parse_date_time()` function.
This sometimes happens - I usually blame excel - but you could be merging two data sets formatted in different ways too.
Lets try it:
```{r ch3_4, eval=F, class.source="Rinfo"}
x <- c("24-12-2022", "2022-12-24", "12-24-2022") #Three different date formats
parse_date_time(x, c("ymd", "dmy", "mdy"))
```
Again, they should give all the same output!
Next, lets calculate the amount of time which has elapsed between two dates.
A fundamental operation in the management of camera data.
To do this we first create an interval object `interval(date1, date2)`, then ask to return the object in days `/ddays(1)`.
Lets try it:
```{r ch3_5, eval=T, class.source="Rinfo"}
# Specify your start and end dates
start <- ymd("2021-10-13")
end <- ymd("2021-12-11")
```
```{r ch3_6, eval=F, class.source="Rinfo"}
# Specify the interval, and put it in days
interval(start, end)/ddays(1)
# Interval creates an "interval object" - run that along and see what it looks like
# ddays() converts the native units of date objects in R (seconds) to days - run it on its own to see.
```
How many days elapsed between those two dates?
We can change the units of the output by changing the denominator:
```{r ch3_7, class.source="Rinfo"}
# Specify the interval, and put it in weeks
interval(start, end)/ddays(7)
```
And in decimal years:
```{r ch3_8, class.source="Rinfo"}
interval(start, end)/ddays(365)
```
Easy!
If you want to learn more about the amazing functionality of the 'lubridate' package - check out the pdf [Lubridate Cheatsheet](https://rawgit.com/rstudio/cheatsheets/main/lubridate.pdf)
### Deployment dates
Lets get back to the camera data.
Which `lubridate` format should we use for `r dep$start_date[1]`?
`ymd()` should do the job.
Lets convert the date columns from character strings to date objects:
```{r ch3_9, class.source="Rmain"}
# start dates
dep$start_date <- ymd(dep$start_date)
# end dates
dep$end_date <- ymd(dep$end_date)
```
Now lets make a new column in the deployment data called `days`, and calculate the interval for all the deployments:
```{r ch3_10, class.source="Rmain"}
dep$days <- interval(dep$start_date, dep$end_date)/ddays(1)
```
We should then check the range of dates the cameras were active for.
Things to look out for are:
**- 0's** A value of zero would mean a deployments which started and ended on the same day -\> it typically denotes a camera which malfunctioned instantly.
**- NA's** This is either an `end_date` which is NA e.g. if the camera was stolen, or it could be a date value which failed to parse e.g. if you had a typo in your date column such as `ymd("202-212-24")` it would return `NA`
**- Negative numbers** Are more common that you think... someone probably got the start and end dates the wrong way round when entering data.
Lets look at the range of values we have:
```{r ch3_11, eval=F, class.source="Rinfo"}
summary(dep$days)
```
Cameras were active between 15 and 234 days per deployment, and we have an NA.
Let's see what it relates to:
```{r ch3_12, class.source="Rinfo"}
dep[is.na(dep$days)==T,] %>%
kbl() %>%
kable_styling(full_width = T) %>%
kableExtra::scroll_box(width = "100%")
```
It was a camera that was on a 'HumanUse' feature - this camera was stolen!
### Image dates
We next need to convert the `img$timestamp` column.
What `lubridate` format is required for a a date which looks like **`r img$timestamp[1]`**?
```{r ch3_13, eval=F, class.source="Rinfo"}
ymd_hms("2015-11-21 03:03:44")
```
Lets apply this to our image dataset:
```{r ch3_14, class.source="Rmain"}
img$timestamp <- ymd_hms(img$timestamp)
```
And do a quick check to see if all the dates parsed correctly.
First check the range:
```{r ch3_15, class.source="Rinfo"}
range(img$timestamp)
```
We have data from early 2018 to late 2019.
And check for NA's:
```{r ch3_16, class.source="Rinfo"}
table(is.na(img$timestamp))
```
No NA's - great!
## Basic trapping summaries
Now that our camera trap data are loaded into R, we can very quickly find out summary information about the dataset.
These can feed directly into the methods section of your report/paper.
First let's count the number of unique locations:
```{r ch3_17, echo=T, class.source="Rinfo"}
# Count the number of camera locations
paste(length(unique(dep$placename)), "locations"); paste(length(unique(dep$deployment_id)), "deployments");paste(nrow(img), "image labels"); paste(nrow(img[img$is_blank == TRUE,]), "blanks")
```
## Error checks
Lets start with the fundamental error checks required with a new data set:
- Camera locations
- Deployment date checks
- Image and deployment matching
- Taxonomy
- Diel time
### Camera locations
A common mistake in camera trap data sets is that the locations are not where they are supposed to be.
The safest way to check your data is to plot them... preferably R!
After synthesizing \>100 different projects from different data contributors for one project, we found \~20%(!) of submissions had a clear and obvious location errors (e.g. a camera station in the middle of the Atlantic).
Don't just take my word for it:
```{r ch3_21, echo=F, out.width="50%"}
knitr::include_graphics("images/exploration/project_your_locations.PNG")
```
(p.s. [Mason is well worth a follow on Twitter](https://twitter.com/masonfidino) )
#### Skill check: Leaflet maps
Below we make use of the fantastic 'leaflet' package to produce interactive plots to help us check our camera locations.
`leaflet` has a tonne of different customization options and freely available, high resolution, base layers to choose from.
Note - Leaflet is best used using tidyverse 'pipe' notation - `%>%`.
It allows you to add successive operations in the order that the elements occur.
The simplest version of a leaflet map looks like this:
```{r map1, echo=T, class.source="Rinfo"}
m <- leaflet() %>% # call leaflet
addTiles() %>% # add the default basemap
addCircleMarkers( # Add circles for stations
lng=dep$longitude, lat=dep$latitude)
m # return the map
```
This is great!
We can zoom in using the +/\_ in the top left hand corner, and the default basemap is OpenStreetMap.
The camera stations all appear to be in the right place (we don't have any in the Atlantic).
But wouldn't it be great to see the `dep$placename` when we click over a symbol?
Run the following:
```{r map2, echo=T, eval=F, class.source="Rinfo"}
m <- leaflet() %>%
addTiles() %>%
addCircleMarkers(
lng=dep$longitude, lat=dep$latitude,
popup=paste(dep$placename)) # include a popup with the placename!
m
```
Okay, but a map isn't really useful until have have some satellite imagery, right?
Easy:
```{r map3, echo=T, class.source="Rinfo"}
m <- leaflet() %>%
addProviderTiles(providers$Esri.WorldImagery) %>% #Add Esri Wrold imagery
addCircleMarkers(
lng=dep$longitude, lat=dep$latitude,
popup=paste(dep$placename)) # include a popup with the placename!
m
```
Zoom in - what can you tell me about where the stations are located?
Can you spot any differences between the stations?
*Clue: look for lines on the landscape*.
These lines are linear features related to oil and gas exploration, some cameras are deployed on them, others away from them.
#### Making corrections
As you can see, we have one deployment location that is a long way from the others. It almost looks like it belongs to another project?! Let's take a look at all of the deployments from that location (`ALG069`):
```{r corrections, echo=T, class.source="Rinfo"}
dep[dep$placename=="ALG069",c("deployment_id", "placename", "longitude", "latitude")]
```
It looks like there is a typo in one of the coordinates: -112.5075 -> -113.5075. Let's correct it:
```{r corrections2, echo=T, class.source="Rmain"}
dep$longitude[dep$placename=="ALG069"] <- -112.5075
```
#### The ultimate leaflet map
We will need to check our correction has worked. Let's also color camera locations based on their `dep$feature_type`, include them in the legend, and have their names show up when we click on them too!
```{r map4, echo=T, class.source="Rmain"}
# First, set a single categorical variable of interest from station covariates for summary graphs. If you do not have an appropriate category use "project_id".
category <- "feature_type"
# We first convert this category to a factor with discrete levels
dep[,category] <- factor(dep[,category])
# then use the turbo() function to assign each level a color
col.cat <- turbo(length(levels(dep[,category])))
# then we apply it to the dataframe
dep$colours <- col.cat[dep[,category]]
m <- leaflet() %>%
addProviderTiles(providers$Esri.WorldImagery, group="Satellite") %>%
addTiles(group="Base") %>% # Include a basemap option too
addCircleMarkers(lng=dep$longitude, lat=dep$latitude,
# Co lour the markers depending on the 'feature type'
color=dep$colours,
# Add a popup of the placename and feature_type together
popup=paste(dep$placename, dep[,category])) %>%
# Add a legend explaining what is going on
addLegend("topleft", colors = col.cat, labels = levels(dep[,category]),
title = category,
labFormat = labelFormat(prefix = "$"),
opacity = 1) %>%
# add a layer control box to toggle between the layers
addLayersControl(
baseGroups = c("Satellite", "Base"))
m
```
If you click on a point you will see it's corresponding `placename` and `feature_type` - so you can find the problem data.
You can also check your treatment categories using the key.
If you zoom in, all the "offline" locations should be >100m away from a linear features, the other on top of them.
For more examples of leaflet in R, see [RStudio's leaflet tutorial](https://rstudio.github.io/leaflet/).
**Check the distance between camera pairs**
Sometimes the coordinates of a camera stations are accidentally repeated in the deployment data, which can actually be very hard to see on a map as the points will overlay perfectly.
The way we check this is to calculate the pairwise distance between all of the unique deployments in the project.
This helps us in two ways:
- we can find "cryptic" duplication events in the deployment coordinates
- this distance is often reported in manuscript method sections
In the following code block we make first use of the simple features (`sf`) package - tools which make spatial operations which you would normally perform in ArcMap very easy (e.g. plotting and manipulating polygons).
More on that later!
```{r ch3_22, class.source="Rmain"}
# create a list of all the non-duplicated placenames
camera_locs <- dep %>%
dplyr::select(placename, latitude, longitude) %>%
unique() %>% # remove duplicated rows (rows where the placename and coordinates match)
st_as_sf(coords = c("longitude", "latitude"), crs = "+proj=longlat") # Convert to `sf` format
```
First lets check that none of the `placenames` are duplicated - this would suggest a placename with two sets of coordinates.
If all is well, this should return an empty dataframe.
```{r ch3_23, class.source="Rinfo"}
# Check that there are no duplicated stations
camera_locs[duplicated(camera_locs$placename)==T,]
```
If it returns a list of \<0 rows\>, we don't have duplicates with different coordinates.
Phew!
Now let's crunch the numbers.Paste and run the following:
```{r ch3_24, class.source="Rmain"}
# distance matrix for all cameras
camera_dist <- st_distance(camera_locs) %>%
as.dist() %>%
usedist::dist_setNames(as.character(camera_locs$placename)) %>%
as.matrix()
#Make temporary camera_dist_mins by converting diagonals/zeros to 999999 so we can avoid the zeros when using which.min function to find nearest cameras
camera_dist_mins <- camera_dist + diag(999999,dim(camera_dist)[1])
#Create new empty dataframe for appending results to
camera_dist_list <- data.frame(focal_cam = character(),nearest_cam = character(), dist = double())
#Cycle through each column of camera_dist_mins
for (i in (1:dim(camera_dist_mins)[1]))
{
#Get index of minimum value of column i
t <- which.min(camera_dist_mins[,i])
#Combine relevant data into new_row
new_row <- data.frame(colnames(camera_dist_mins)[i],names(t),camera_dist_mins[t,i])
#Append the new_row to the accumulated results dataframe
camera_dist_list[nrow(camera_dist_list) + 1,] = new_row
}
```
Lets summarize the output:
```{r ch3_25, , class.source="Rinfo"}
summary(camera_dist_list$dist)
```
So the largest distance between two cameras is `r paste0(round(max(camera_dist_list$dist),0), "m")`, the minimum is `r paste0(round(min(camera_dist_list$dist)), "m")` and on average it is `r paste0(round(mean(camera_dist_list$dist)), "m")`.
Again, put that straight into the methods section of your report/paper.
#### Do all images have a deployment associated with them?
Another very useful check is to verify that all of the `placenames` have corresponding image data, and that all image data has corresponding deployment data!
You would be surprised how often this is not the case!
```{r ch3_26, class.source="Rmain"}
# check all check the placenames in images are represented in deployments
# This code returns TRUE if it is and FALSE if it isn't. We can then summarize this with table()
table(unique(img$placename) %in% unique(dep$placename))
```
We have 38 TRUE's, which means all the images have deployment data.
Let's check that all the placenames also have image data:
```{r ch3_27, class.source="Rmain"}
# check all the placenames in deployments are represented in the images data
table(unique(dep$placename) %in% unique(img$placename))
```
Great.
If you see any FALSE observations - you either have image data or deployments missing.
Go back and check your raw data!
### Camera activity checks
The next step is to plot out the camera activity at each unique place name to see when our cameras are functioning.
To make this plot we will need to use the `plotly` package for the first time.
We use this because the plots are interactive, just like with leaflet we can zoom in and zoom out and find problem observations.
It also dynamically changes the y-axis and x-axis labels to fit the data, which is very useful!
Let's experiment with some basic 'plotly' graphs to get warmed up:
#### Skill check: plotly
If you are familiar with making plots in base R or ggplot, hopefully `plotly` is not too intimidating.
Let us start with a basic scatter plot using the deployment data.
```{r ch3_28, warning=F, message=F, class.source="Rinfo"}
library(plotly)
fig <- plot_ly(data = dep, # Specify your data frame
x = ~longitude, y = ~latitude, # The x and y axis columns
type="scatter") # and the type of plot
fig
```
Like we said, this is very similar to conventional plotting tools.
However, things get different when we start to specify the style of the points.
In base R we might use `pch=` and `cex=` to change the style and size of the points, whereas with `plotly` we use the "marker" option, and include elements as a list.
```{r ch3_29, warning=F, message=F, class.source="Rinfo"}
library(plotly)
fig <- plot_ly(data = dep,
x = ~longitude, y = ~latitude,
color=~feature_type, # We can specify color categories
type="scatter",
marker=list(size=15)) # the default size is 10
fig
```
As ever- this just scratches the surface of the `plotly` package.
See the [Plotly graphing library](https://plotly.com/r/) for a wealth of options.
#### The ultimate plotly camera activity figure
In the following plot, black dots denote start and end dates, lines denote periods where a camera is active.
Each unique `placename` gets its own row on the plot - you can hover over the lines to get the `deployment_id`.
We will use a loop to build the different elements... you don't need to understand the code itself, just how to interpret the output.
Cut and paste the following:
```{r activity, echo=T, class.source="Rmain"}
# Call the plot
p <- plot_ly()
# We want a separate row for each 'placename' - so lets turn it into a factor
dep$placename <- as.factor(dep$placename)
# loop through each place name
for(i in seq_along(levels(dep$placename)))
{
#Subset the data to just that placename
tmp <- dep[dep$placename==levels(dep$placename)[i],]
# Order by date
tmp <- tmp[order(tmp$start_date),]
# Loop through each deployment at that placename
for(j in 1:nrow(tmp))
{
# Add a line to 'p'
p <- add_trace(p,
#Use the start and end date as x coordinates
x = c(tmp$start_date[j], tmp$end_date[j]),
#Use the counter for the y coordinates
y = c(i,i),
# State the type of chart
type="scatter",
# make a line that also has points
mode = "lines+markers",
# Add the deployment ID as hover text
hovertext=tmp$deployment_id[j],
# Color it all black
color=I("black"),
# Suppress the legend
showlegend = FALSE)
}
}
# Add a categorical y axis
p <- p %>% layout(yaxis = list(
ticktext = as.list(levels(dep$placename)),
tickvals = as.list(1:length(levels(dep$placename))),
tickmode = "array"))
p
```
**What do the gaps signify?** The breaks in the line signify periods when the camera at a location was not active.
You can see there was a point in 2018 when 9 out of 38 cameras had stopped working.
**Can you see any issues?** Yes!
Sometimes you will see a deployment a long way to the left or right of the plot, this is usually a date error (e.g. `ALG036`).
#### Corrections
We checked the original datasheets for `ALG036` and found out that the deployment end date for `ALG036_2019-04-04` was incorrectly entered. It was written as 2020 not 2019. Let's correct it:
```{r ch3_31a, eval=T, class.source="Rmain"}
dep$end_date[dep$deployment_id=="ALG036_2019-04-04"] <- ymd("2019-11-21")
#remember to format it as a date object
```
### Detection check
Once we are happy that are cameras were functioning when we expected them to be, we now need to check if all of our labelled images fall within the associated deployment periods.
To do this we build on the previous plot above, but also add in the image data over the top.
This plot can get very messy, so we divide it into sections of ten deployments.
As before, black lines show an active camera.
Red dots show an image detections at that time.
We only show the output of the first 10 deployments, but you should do this for every single deployment you have!
*Note - the code below is complex, you don't have to understand it all unless you want to*
```{r ch3_31, eval=F, class.source="Rmain"}
# Make a separate plot for each 20 stations For each 20 stations
# To do this make a plot dataframe
tmp <- data.frame("deployment_id"=unique(dep$deployment_id), "plot_group"=ceiling(1:length(unique(dep$deployment_id))/20))
dep_tmp <- left_join(dep,tmp, by="deployment_id")
for(i in 1:max(dep_tmp$plot_group))
{
# Call the plot
p <- plot_ly()
#Subset the data to just that placename
tmp <- dep_tmp[dep_tmp$plot_group==i,]
# Order by placename
tmp <- tmp[order(tmp$placename),]
# Loop through each deployment at that placename
for(j in 1:nrow(tmp))
{
#Subset the image data
tmp_img <- img[img$deployment_id==tmp$deployment_id[j],]
if(nrow(tmp_img)>0)
{
p <- add_trace(p,
#Use the start and end date as x coordinates
x = c(tmp_img$timestamp),
#Use the counter for the y coordinates
y = rep(j, nrow(tmp_img)),
# State the type of chart
type="scatter",
# make a line that also has points
mode = "markers",
# Add the deployment ID as hover text
hovertext=paste(tmp_img$genus,tmp_img$species),
# Color it all black
marker = list(color = "red"),
# Suppress the legend
showlegend = FALSE)
}
# Add a line to 'p'
p <- add_trace(p,
#Use the start and end date as x coordinates
x = c(tmp$start_date[j], tmp$end_date[j]),
#Use the counter for the y coordinates
y = c(j,j),
# State the type of chart
type="scatter",
# make a line that also has points
mode = "lines",
# Add the deployment ID as hover text
hovertext=tmp$deployment_id[j],
# Color it all black
color=I("black"),
# Suppress the legend
showlegend = FALSE)
}
# Add custom y axis labels
p <- p %>% layout(yaxis = list(
ticktext = as.list(tmp$deployment_id),
tickvals = as.list(1:nrow(tmp)),
tickmode = "array"))
print(p)
}
```
```{r ch3_32, eval=T, echo=F}
# Make a separate plot for each 20 stations For each 20 stations
# To do this make a plot dataframe
tmp <- data.frame("deployment_id"=unique(dep$deployment_id), "plot_group"=ceiling(1:length(unique(dep$deployment_id))/20))
dep_tmp <- left_join(dep,tmp, by="deployment_id")
for(i in 1:1)
{
# Call the plot
p <- plot_ly()
#Subset the data to just that placename
tmp <- dep_tmp[dep_tmp$plot_group==i,]
# Order by placename
tmp <- tmp[order(tmp$placename),]
# Loop through each deployment at that placename
for(j in 1:nrow(tmp))
{
#Subset the image data
tmp_img <- img[img$deployment_id==tmp$deployment_id[j],]
if(nrow(tmp_img)>0)
{
p <- add_trace(p,
#Use the start and end date as x coordinates
x = c(tmp_img$timestamp),
#Use the counter for the y coordinates
y = rep(j, nrow(tmp_img)),
# State the type of chart
type="scatter",
# make a line that also has points
mode = "markers",
# Add the deployment ID as hover text
hovertext=paste(tmp_img$genus,tmp_img$species),
# Color it all black
marker = list(color = "red"),
# Suppress the legend
showlegend = FALSE)
}
# Add a line to 'p'
p <- add_trace(p,
#Use the start and end date as x coordinates
x = c(tmp$start_date[j], tmp$end_date[j]),
#Use the counter for the y coordinates
y = c(j,j),
# State the type of chart
type="scatter",
# make a line that also has points
mode = "lines",
# Add the deployment ID as hover text
hovertext=tmp$deployment_id[j],
# Color it all black
color=I("black"),
# Suppress the legend
showlegend = FALSE)
}
# Add custom y axis labels
p <- p %>% layout(yaxis = list(
ticktext = as.list(tmp$deployment_id),
tickvals = as.list(1:nrow(tmp)),
tickmode = "array"))
}
p
```
**What would a problem look like?**
If you have images (red dots) occurring outside a period of camera activity - that would indicate a miss-match between the deployment data and the image data. You would need to revisit your datasheets to see where this mismatch occurred.
If the error is in the deployment dates - correct them as above!
If the error is in the image metadata (i.e. camera was set to the wrong date), you have several options:
- If you are working in a platform like Wildlife Insights there is a date-time frameshift correction you can perform: see [The correcting timestamps section](https://www.wildlifeinsights.org/get-started/review-identifications#update-all)
- You can correct the underlying exif data of the images using [EXIF date changer](https://www.mammalweb.org/index.php/en/news/951-doh-correcting-date-and-time-problems-for-camera-trap-images)
- Finally, you could change the dates in R using `lubridate`. If you look at the deployment `ALG029_2019-04-02` you will see that has what has happened here. We checked the datasheets and realised that the camera's timestamp was set to the incorrect month when the deployment began (2019-05-02 instead of 2019-04-02).
```{r ch3_33, eval=T, class.source="Rmain"}
# We set the wrong date for the camera collecting images in deployment
#":"ALG029_2019-04-02"
# We established that the deployment was 30 days out (as there are 30 days in April)
# So we add 30 days to all of the images in that deployment.
img[img$deployment_id=="ALG029_2019-04-02",]$timestamp <-
img[img$deployment_id=="ALG029_2019-04-02",]$timestamp - days(30)
# Easy!
```
You should repeat the plot above to check it has worked!
### Taxonomy check
Dealing with taxonomy in camera trap data sets can be a nightmare, particularly if your data labeling software does not give standardized lists of species (e.g. you are manually sorting images into folders). A species list is also something which is often produced for the appendix of a report or paper.
Let us start with building a list of our taxonomic classifications:
```{r ch3_34, class.source="Rmain"}
# First define vector of the headings you want to see (we will use this trick a lot later on)
taxonomy_headings <- c("class", "order", "family", "genus", "species", "common_name")
# Subset the image data to just those columns
tmp<- img[,colnames(img)%in% taxonomy_headings]
# Remove duplicates
tmp <- tmp[duplicated(tmp)==F,]
# Create an ordered species list
sp_list <- tmp[order(tmp$class, tmp$order, tmp$family, tmp$genus, tmp$species),]
# Create a column to the species list with genus and species pasted together
sp_list$sp <- paste(sp_list$genus, sp_list$species, sep=".")
# View the species list using kableExtra
sp_list %>%
kbl(row.names=F) %>%
kable_styling(full_width = T) %>%
kableExtra::scroll_box(width = "100%", height = "250px")
```
That is a lot of species classifications - are they all correct? If you are familiar with the species you may be able to know by eye, however if you are unfamiliar with them. You can also use a package called `taxize`.
### Skill Check: Taxize package
Lets start by seeing what `taxize` can do with a single species.
Run the following:
```{r ch3_35, eval=F, class.source="Rinfo"}
library(taxize)
gnr_resolve("Lynx canadensis")
```
For each hit in an online taxonomy database, you get a row in a dataframe and a confidence score in the identification.
Lets try miss-spelling a common name:
```{r ch3_36, eval=F, class.source="Rinfo"}
gnr_resolve("Lynx cramadensis")
```
Cool!
We can even recover incorrectly spelled names!
What about common names?
Well there is a way we can get those too using `sci2comm()`"
```{r ch3_37, eval=F, class.source="Rinfo"}
sci2comm("Lynx canadensis")
```
For more information [see the `Taxize` documentation](https://docs.ropensci.org/taxize/).
```{r ch3_38, message=F, warning=F, echo=F, eval=F,include=F, class.source="Rmain"}
# REMOVED AS TAXIZE IS UNRELIABLE
#Let us turn this code into a useful workflow.
#We will use a confidence threshold of \>90% as a validation threshold for our species designations.
#We make use of a loop to iterate through our species lists.
#if you want a tutorial on loops for R see: [Data #Carpentry:loops](https://swcarpentry.github.io/r-novice-inflammation/15-supp-loops-in-depth/)
# Add a column which states if it is in the databases
sp_list$verified <- NA
i<- 1
# Check if it is in a bunch of different taxonomic databases
for(i in 1:nrow(sp_list))
{
tmp <- gnr_resolve(sp_list$sp[i], data_source_ids= c(1,2,11,12, 163, 174)) # Note we restrict the databases to a few big ones, there are some weird ones in there.
if(nrow(tmp[tmp$score>0.9,])>0)
{
sp_list$verified[i] <- TRUE
} else{sp_list$verified[i] <- FALSE}
}
# Update the row names
row.names(sp_list) <- 1:nrow(sp_list)
# First add a column to the species list with genus and species pasted together
sp_list$sp <- paste(sp_list$genus, sp_list$species)
i <- 8
# Check if it is in a bunch of different taxonomic databases
for(i in 1:nrow(sp_list))
{
tmp <- sci2comm(sp_list$sp[i], db = "ncbi", simplify=T)
Sys.sleep(4)
if(length(tmp[[1]])>0)
{
sp_list$common_name[i] <- tolower(tmp[[1]][1]) # Take the first element and make it all lower case
}
print(paste(i, "species of", nrow(sp_list)))
}
```
In this instance, we have an external list of common names which we will use to update our `img` file.
```{r ch3_44, class.source="Rmain", message=F, warning=F}
# Import the dataframe
tmp <- read.csv("data/raw_data/example_data/common_names.csv")
# Join it with the existing species list
sp_list$common_name <- NULL
sp_list <- left_join(sp_list, tmp)
```
Then let's write our species list into our raw data folder:
```{r ch3_45, class.source="Rmain"}
# Note we use the project_id from from project data frame to name the file - that was we wont overwrite it if we run things with a different project.
write.csv(sp_list, paste0("data/raw_data/",pro$project_id[1],"_raw_species_list.csv"))
```
Then lets update the `common_name` column in our `img` dataframe to reflect the updated common names.
We will do this using a `left_join()`, an operation which is invaluable when programming in R.
It uses a specified "key" variable to merge two dataframes in this case we will use the 'sp' column.
```{r ch3_46, class.source="Rmain"}
# first remove the common_name column
img$common_name <- NULL
# add an sp column to the img dataframe - remember the genus and species columns are not pasted together yet
img$sp <- paste(img$genus, img$species, sep=".")
# Next we do the 'left_join'
img <- left_join(img, sp_list[, c("sp", "common_name")], by="sp")
```
Lets move on!
## Diel activity check
Sometimes when setting up a camera trap, you can input the time incorrectly.
This is actually very hard to detect unless you happen to be looking for it.
The way we check is to plot the detections for each species by the 24 hour clock.
If were get detections of nocturnal species in the day, or vice versa, it suggests there may be a problem.
*Caveat* A diurnal species active at night doesn't mean there is actually a problem, camera traps have revealed that many animals are active when we thought they were not!
*Cool note* Researchers are increasingly using this information to determine a species "availability" for detection!
More on that in the [density](#density) and [activity](#activity) chapters.
For any species detected more than 10 times, we will plot when they were detected:
```{r ch3_47, echo=T, warning=F, message=F, class.source="Rmain"}
# First lets convert our timestamp to decimal hours
img$hours <- hour(img$timestamp) + minute(img$timestamp)/60 + second(img$timestamp)/(60*60)
# Count all of the captures
tmp <- img %>% group_by(common_name) %>% summarize(count=n())
yform <- list(categoryorder = "array",
categoryarray = tmp$common_name)
fig <- plot_ly(x = img$hours, y = img$common_name,type="scatter",
height=1000, text=img$deployment_id, hoverinfo='text',
mode = 'markers',
marker = list(size = 5,
color = 'rgba(50, 100, 255, .2)',
line = list(color = 'rgba(0, 0, 0, 0)',
width = 0))) %>%
layout(yaxis = yform)
fig
# Remove the column
img$hours <- NULL
```
**Can you see any exclusively diurnal species?**
Sandhill crane would be a good candidate. They are very rarely detected at night.
**Can you see any exclusively nocturnal species?**
Snowshoe hare!
More on activity data in the [Activity chapter](#activity)
## Conclusion
Congratulations - you have thoroughly error checked your camera data!
We may find more errors in the [data exploration chapter](#data-exploration), so stay vigilant.
```{r, include=F, eval=F}
###########################################################
###########################################################
###########################################################
###########################################################
###########################################################
###########################################################
```
# Analysis data creation {#data-creation}
## Common analysis data formats
Although the types of analysis you can perform on camera trap data vary markedly, they often depend on three key dataframe structures.
We introduce these structures here, then show you how to apply them in subsequent chapters.
## Independent detections
The independent detections dataframe is the work horse of the vast majority of camera trap analyses, it is from this that you build the rest of your data frames.
The threshold we use for determining what is an "independent detection" is typically 30 minutes... because camera trappers are creatures of habit!
If you want to dig a little deeper it to the why, there is a nice summary in [Rahel Sollmans "A gentle introduction to camera‐trap data analysis"](https://onlinelibrary.wiley.com/doi/abs/10.1111/aje.12557):
*Researchers have used different thresholds, typically 30 min (e.g., O'Brien, Kinnaird, & Wibisono, 2003) to an hour (Bahaa‐el‐din et al., 2016); some researchers have argued that multiple pictures within the same day may not represent independent detections (Royle, Nichols, Karanth, & Gopalaswamy, 2009). In most cases, this threshold is determined subjectively, based on the best available knowledge of the species under study. But it can also be determined based on the temporal autocorrelation (Kays & Parsons, 2014) or analysis of time intervals (Yasuda, 2004) of subsequent pictures.*