forked from rdpeng/RepData_PeerAssessment1
-
Notifications
You must be signed in to change notification settings - Fork 0
/
PA1_template.Rmd
179 lines (124 loc) · 5.61 KB
/
PA1_template.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
---
title: "Reproducible Research: Peer Assessment 1"
author: Peter D. Bolier
date: June 24, 2016
output:
html_document:
keep_md: true
---
First we have to ensure the locale is correct, lets ensure all is in english....
```{r setlocale, echo=TRUE}
Sys.setlocale("LC_ALL","English")
```
## Loading and preprocessing the data
Assume we're in the correct working directory.
The data to be processed is in activity.zip in 'root'of assignment. Lets unzip and read the data.
We use a helper function to coerce the text into a date...
```{r unzipload, echo=TRUE}
unzip("activity.zip");
setClass('myDate');
setAs("character","myDate", function(from) as.Date(from, format="%Y-%m-%d"));
activities <- read.csv("activity.csv", header=TRUE, sep=",", colClasses = c("integer", "myDate", "integer"));
```
## What is mean total number of steps taken per day?
We could calculate the mean for each day in one step, however we start by determining the total number of steps for each day and plot those in a histogram.
First, lets see what the data looks like, just the first few rows.
```{r showfirstrows, echo=TRUE}
head(activities);
```
So, the first column is the number of steps; we use that in our calculation of the totals:
We determine the total per day and the total steps measured.
```{r sumstepsday, echo=TRUE}
totalsteps <- aggregate(activities[,1], by = list(activities$date), sum);
sum(totalsteps$x, na.rm=TRUE)
```
Now we use the total in our historgram...
```{r histogranmofstes, echo=TRUE, message=FALSE, warning=FALSE}
library(ggplot2);
ggplot(totalsteps, aes(totalsteps$x)) +
geom_histogram() +
labs(title="Histogram of total steps per day", x="Number of steps per day");
```
The mean and median of the total steps per day are:
```{r meanandmedian, echo=TRUE}
meansteps <- mean(totalsteps$x, na.rm = TRUE);
meansteps;
mediansteps <- median(totalsteps$x, na.rm = TRUE);
mediansteps;
```
## What is the average daily activity pattern?
The steps taken are measured at an interval of 5 minutes, we can use this to see how many steps were taken, on avarage, at a particular interval or moment. One expects less steps at night...
```{r averageatinterval, echo=TRUE}
library(scales)
averageatinterval <- aggregate(steps ~ interval, activities , mean, na.rm=TRUE);
ggplot(averageatinterval,aes(interval, steps)) +
geom_line() +
labs(title="Average steps at an specific moment", x="Time", y="Steps taken") +
scale_x_continuous(breaks=c(0, 600, 1200, 1800, 2400),
labels=c("00:00, Midnight", "06:00", "12:00, Noon", "18:00", "24:00"))
```
Determine the interval/moment of the maximum mean number of steps taken...
The number of steps is x, the interval is Group.1
```{r findmaxstep, echo=TRUE}
themoment <- averageatinterval[averageatinterval$steps == max(averageatinterval$steps),]$interval;
themoment;
```
The interval at wich the maximum average of steps taken is at `r themoment`.
## Imputing missing values
Check how many values are missing:
```{r missingvalues, echo=TRUE}
sum(is.na(activities))
```
Replace missing values with the mean number of steps per day divided over the number of intervals.
One interval is 5 minutes. So total number of intervals is total number of minutes on one day divided over the interval.
```{r replacemissingwithzero, echo=TRUE}
numberofintervals <- 24 * 60 / 5;
activities2 <- activities;
activities2[is.na(activities2$steps),]$steps <- (meansteps / numberofintervals);
```
Lets see what impact this has. The total number of steps has increased...
```{r sumstepsday2, echo=TRUE}
totalsteps2 <- aggregate(activities2[,1], by = list(activities2$date), sum);
sum(totalsteps2$x);
```
Now we use the total in our historgram...
```{r histogranmofsteps2, echo=TRUE, message=FALSE, warning=FALSE}
library(ggplot2);
ggplot(totalsteps2, aes(totalsteps2$x)) +
geom_histogram() +
labs(title="Histogram of total steps per day (missing values replaced)", x="Number of steps per day");
```
The mean and median of the total steps per day are:
```{r meanandmedian2, echo=TRUE}
meansteps2 <- mean(totalsteps2$x);
meansteps2;
mediansteps2 <- median(totalsteps2$x);
mediansteps2;
```
So it looks like imputing the missing value with a mean has not much impact.
```{r difference, echo=TRUE}
diffmean <- meansteps - meansteps2;
diffmedian <- mediansteps - mediansteps2;
```
The difference in the mean of the steps per day is `r diffmean` and difference in median `r diffmedian`.
## Are there differences in activity patterns between weekdays and weekends?
So first we add a column to the dataframe indicating whether a day is in a weekend or not...
Note: could use posix weekday number, so we dont depend on the locale...
```{r addweekendindication, echo=TRUE}
activities2$typeofday <- factor(weekdays(activities2$date) %in% c('Saterday','Sunday'), levels=c('FALSE', 'TRUE'), labels=c('Weekday', 'Weekend'))
```
So lets separate the weekend from the other days, see if the number of steps differ at the same interval.
First calculate the evarage for a given interval and the type of the day (weekend or weekday)...
```{r groupinterval, echo=TRUE}
grouped <- aggregate(steps ~ interval + typeofday, activities2 , mean);
```
Lets plot the two types of day in one plot... (2 rows, 1 column)
```{r plotweekendsep, echo=TRUE}
ggplot(grouped, aes(interval, steps)) +
geom_line() +
facet_wrap(~typeofday, ncol=1) +
labs(title="Average steps at an specific moment", x="Time", y="Steps taken") +
scale_x_continuous(breaks=c(0, 600, 1200, 1800, 2400),
labels=c("00:00, Midnight", "06:00", "12:00, Noon", "18:00", "24:00")) +
theme(strip.background = element_rect(fill = alpha('blue', 0.3)))
```