diff --git a/Lab 2 - RDD, DataFrame, ML pipeline, and parallelization.md b/Lab 2 - RDD, DataFrame, ML pipeline, and parallelization.md index 8b3eb22..8c42b4b 100644 --- a/Lab 2 - RDD, DataFrame, ML pipeline, and parallelization.md +++ b/Lab 2 - RDD, DataFrame, ML pipeline, and parallelization.md @@ -530,12 +530,13 @@ Starting from this lab, you need to use *as many DataFrame functions as possible Load the Aug95 NASA access log data in Lab 1 and create a DataFrame with FIVE columns by **specifying** the schema according to the description in the downloaded html file. Use this DataFrame for the following questions. -2. Find out the number of **unique** hosts in total (i.e. in August 1995)? -3. Find out the most frequent visitor, i.e. the host with the largest number of visits. +2. Find out the number of **unique** hosts in total (i.e. in August 1995)? [Answer: 75060 Unique Hosts] +3. Find out the most frequent visitor, i.e. the host with the largest number of visits. [Answer: "edams.ksc.nasa.gov] ### Linear regression for advertising 4. Add regularization to the [linear regression for advertising example](#example-linear-regression-for-advertising) and evaluate the prediction performance against the performance without any regularization. Study at least three different regularization settings. +[Answer: Adding increasing regularisation parameters (0.1, 0.2, 0.5) increases each of the predictions each time. Not really sure what else to put here without just copy pasting it all in?] ### Logistic regression for document classification