From 4f33bf73a1834a7f310aab25efe8a79d4d0d7dd5 Mon Sep 17 00:00:00 2001 From: amudanye <33033195+amudanye@users.noreply.github.com> Date: Thu, 24 May 2018 14:31:23 -0400 Subject: [PATCH] Add files via upload --- ReadMe.txt | 17 +++++++++++++++++ Task4.txt | 30 ++++++++++++++++++++++++++++++ 2 files changed, 47 insertions(+) create mode 100644 ReadMe.txt create mode 100644 Task4.txt diff --git a/ReadMe.txt b/ReadMe.txt new file mode 100644 index 0000000..8ebe993 --- /dev/null +++ b/ReadMe.txt @@ -0,0 +1,17 @@ +#Ann Mudanye +#Africa's Talking Coding Challenge + +Task 1 - If you copy paste a set of steps more than 3 times, it’s time to write a function. + +Task 2 - Given a dataset on any one of Africa’s Talking products: Voice, SMS, Payments and USSD. Discuss the steps you would take to analyse the data to reach a conclusion. +I would start by formulating a research question for example determining how efficient the Voice is for the company's customers. This can be tested by determining how many more customers use voice over the sms option or looking at customer reviews on the voice product offered by the company. Once the data is collected then it can be analysed in that we are looking for what percentage of customers are satisfied with the voice product(or any of the other products). A higher percentage of customers satistified with the voice product would show that the product is efficient. Determining the degree of efficiency would include probably calculating how responsive voice calls are. + + +Task 3 - Give an example explaining how K-means clustering works. +K-means clustering is often used when working with unlabeled data. +With this algorithm we can find groups in the data where the number of groups are represented by the variable K. +Data points are assigned to one of K groups based on a certain feature similarity. + +For example if Africa's Talking wishes to open branches all over Africa. Assuming that Africa's Talking knows the location of where their most customers are situated then they have to decide the number of branches to be opened and the location of the branches so that all their customers are in close vicinity of the new branches.\ +With K-means clustering the locations with most customers will be grouped into clusters and a cluster center(most central location within each country in Africa for this example) for each cluster will be defined which will be the locations where the comapny branches will be opened. K-means clustering will help us make sure that the Africa's Talking branches will be at minimum distance from all the customers we are trying to reach. + diff --git a/Task4.txt b/Task4.txt new file mode 100644 index 0000000..f40208b --- /dev/null +++ b/Task4.txt @@ -0,0 +1,30 @@ +#Ann Mudanye +#Africa's Talking Coding Challenge + +Task 4 - Given a Gigabyte of weather data, we start off by loading the lubridate, dplyr packages and the .csv file into R. + +weather dataset <- read.csv( + file="weather dataset name.csv", + stringsAsFactors = FALSE #strings in a data frame should be treated as plain strings + ) + +To find the mean temperature of a particula place, we use the group_by function to calculate summary statistics by a particular place.We can use the tally() function to calculate how many measurements were made in that area. The summarize function can then calculate the mean temperature value for the particular place that we chose. + +Plotting a graph to show change in variation of daily temperature would require a time series plot that can be done with the ggplot2 package in R. +Supposing we want to show the variation per hour of the day. We start off by converting a date-time column to a POSIXct class after the .csv is loaded. + +> myPOSIXct = as.POSIXct(0, origin="2018-5-24", tz="UTC") +> myPOSIXct +[1] "2018-5-24 10:17:07 UTC" +> format(myPOSIXct, format="%H") #extract the hour from the date-time column +[1] 10 + +The hours can be stored under a variable called time. + +The function below is what we would use : +ggplot(aes(x = time, y = variable), data = data) + geom_line() + +where x is time in hours of the day +y is the variable of the temperature calculated per hour +data is the given weather dataset +geom_line() plots the points in a line graph format. Lines are helpful in presenting continuous data in an interval scale, where intervals are equal in size(per hour). \ No newline at end of file