Name	Name	Last commit message	Last commit date
parent directory ..
assets	assets
code	code
README.md	README.md

title

duration

creator

Experimental Design and Pandas

3hrs

name	city
K. Nathaniel Tucker	SF

Experimental Design and Pandas

DS | Lesson 2

LEARNING OBJECTIVES

After this lesson, you will be able to:

Define a problem and types of data
Identify data set types
Define the data science workflow
Apply the data science workflow in the pandas context
Write an iPython Notebook to import, format and clean data using the Pandas Library

STUDENT PRE-WORK

Before this lesson, you should already be able to:

Create, open and create and iPython Notebook
Have completed python pre-work

LESSON GUIDE

TIMING	TYPE	TOPIC
5 min	Opening	Lesson Objectives
10 min	Introduction	The why's and how's of a good question
10 min	Demo	Diagraming a high quality aim
10 min	Lecture	Types of datasets
10 min	Guided Practice	Write a research question with raw data
5 min	Knowledge Check	Section 1 Review
5 min	Introduction	Datascience workflow steps 2. Acquire and 3. Parse
10 min	Demo	Walkthrough Acquire and Parse with Pandas
30 min	Codealong	Pandas Intro
5 min	Introduction	Lab Walkthrough
20 min	Independent Practice	Lesson 2 lab
10 min	Conclusion	Review lab and lesson objectives
15 min	Wrap-up	Unit 1, project, where we're headed

Opening (5 min)

Review Current Lesson Objectives
- Review Data Science workflow
1. Identify
2. Acquire
3. Parse
4. Mine
5. Refine
6. Build
7. Present

Today we will focus on steps 1-2, and we will dive into steps 3-5 over the next few classes.

Intro: Asking a good question (10 mins)

Why we need a good question/aim

"A problem well stated is half solved."

By having a high quality question/aim you set yourself up for success as you being your analysis. You also establish the basis for making your analysis reproducible. A clearly articulated research question not only helps other data scientists learn from, and reproduce your work, but also helps them expand on your work in the future.

What is a good question?

The goals of a high quality, reproducible question are similar to the SMART Goals Framework.

S: specific
M: measurable
A: attainable
R: reproducible
T: time-bound

Let's break this down further:

Specific: The dataset and key variables are clearly defined.
Measurable: The the type of analysis and major assumptions are articulated.
Attainable: The question you are asking is feasible for your dataset and is not likely to be biased.
Reproducible: Another person (or you in 6 months!) can read your state and understand exactly how your analysis is performed
Time-bound: You clearly state the time period and population for which this analysis will pertain

Demo: Diagraming an aim (5 mins)

Instructors: You do not need the actual dataset for this exercise.

Example aim: Determine the association of foods in the home with child dietary intake. Using one 24-hour recall from the cross-sectional NHANES 2007-2010, we will determine the factors associated with food available in the homes of American children and adolescents. We will test if reported availability of fruits, dark green vegetables, low fat milk or sugar-sweetened beverages available in the home increases the likelihood that children and adolescents will meet their USDA recommended dietary intake for that food.

Hypothesis: Children will be more likely to meet their recommended intake level when a food is always available in their home compared to rarely of never.

Source: From Dr. Amy Roberts' Dissertation

Instructor Note: For each of these, give one 1 example and ask the class to id others.

Specific: Using one 24-hour recall from the cross-sectional National Health and Nutrition Examination Survey (NHANES) 2007-2010, we will determine the factors associated with food available in the homes of American children and adolescents. We will test if self-reported availability of fruits, dark green vegetables, low fat milk or sugar-sweetened beverages available in the home increases the likelihood that children and adolescents will meet their USDA recommended dietary intake for that food. Our hypothesis is that children will be more likely to meet their recommended intake level when a food is always available in their home (compared to rarely or never).
- How data was collected is indicated:
  - 24-hour recall, self-reported
- What data was collected is stated:
  - Fruits, dark green vegetables, low fat milk or sugar-sweetened beverages; always vs rarely available
- How data will be analyzed is defined:
  - Using USDA recommendations as a gold-standard to measure the association
- The specific hypothesis & direction of the expected associations are described:
  - Children will be more likely to meet their recommended intake level
Measurable: Determine the association of foods in the home with child dietary intake. We will test if the reported availability of fruits, dark green vegetables, low fat milk or sugar-sweetened beverages available in the home increases the likelihood that children and adolescents will meet their USDA recommended dietary intake for that food.
Attainable: Cross-sectional data has specific limitations- one of the most common is that causal inference is typically not possible. Note that we are determining the association between two items (food available in the home and children meeting their dietary recommendations). Because we are using cross-sectional data, we cannot say that having fruits and vegetables in the home actually causes children to meet their dietary requirements.
Reproducible: By having all the specifics we indicated previously, it should be straight forward for others to Google NHANES, pull the right datasets, and reproduce this work.
Time Bound: Using one 24-hour recall from NHANES 2007-2010, we will determine the factors associated with food available in the homes of American children and adolescents.

Point: Trends often change over time and vary by the population or source of your data. It is important to clearly define who/what you included in your analysis as well as the time period for the analysis.

Context

Depending on your setting, the types of questions you will answer may vary. The previous example is from a research setting. In a business setting, you will need to clearly articulate the business objectives. You should also ID and hypothesize the goals and criteria for success.

For example, "success" for the Netflix recommendation engine might be stated as: 70% of customers over the age of 18 select a movie from the recommended queue during Q3 of 2015. Regardless of setting, stating your question using the SMART framework will help you achieve your objectives.

Knowledge Check

Which of the following questions uses the SMART framework? Why? What is missing?

I am looking to see if there is an association with number of passengers with carry on luggage and delayed take-off time.
Determine if the number of passengers on JetBlue, Delta, and United domestic flights with carry-on luggage is associated with delayed take-off time using data from flightstats.com from January 2015- December 2015.

Why data types matter

As we saw in the attainable section above, different types of data have different limitations and strengths. Therefore, certain types of analysis will not be possible with certain datasets. Here is a brief overview of some of the different types of datasets:

Cross-Sectional Data: All information is determined at the same time; all the data comes from the same time period.
- Issues: TEMPORALITY.
  - There is no distinction between exposure and outcome. This is why in the example above, we can't say that the availability of fruit in the home actually causes children to meet their recommendations. It is just as likely that the opposite may be true.
- Strengths
  - Often population-based
  - Generalizability
  - Reduced cost compared to other types of data collection methods
- Weaknesses
  - Separation of cause and effect may be difficult or impossible
  - Variables/Cases with long duration are over-represented
Time-Series/Longitudinal Data: The information (data) is collected over a period of time.
- Strengths
  - Unambiguous temporal sequence – exposure precedes outcome
  - Multiple outcomes can be measured
- Limitations
  - Expense
  - Takes a long time
  - Vulnerable to missing data

Check:

What type of data is the "flightstats" data?
Determine if the number of passengers on JetBlue, Delta, and United domestic flights with carry-on luggage is associated with delayed take-off time using data from flightstats.com from January 2015- December 2015.
Can you create a cross-sectional analysis from a longitudinal data collection? How?

SMART Review

The S.M.A.R.T. process covers the "Identify" step of the data science workflow. We also explored the strengths and weaknesses of two types of data.

SMART analysis aims
Types of a datasets: Cross-Sectional vs Longitudinal/Time Series

Questions?

Data science workflow: Acquire & Parse (5 mins)

During this section we are going to walk through key features of steps 2 & 3 of the data science workflow. We will be working with an iPython Notebook. I'll demo the steps first, then we will try them together. During the last part of class, you will try your hand at the steps individually.

Demo: Walkthrough Acquire and Parse with Pandas (30 mins)

Acquire

You'll remember from the previous class that the "Acquire" step is where we determine if the dataset we have is the "right" dataset for our question.

One factor is what type of data is it: Cross-sectional? Longitudinal/Time Series? The next question is how well was the data collected? Does it have a ton of missing data? Was the instrument used to collect the data validated and reliable? Is this dataset aggregated? Can we use the aggregation or do we need to get it pre-aggregation?

Logistics of acquiring your data

You can access data through a variety of different methods, including:

Web (Google Analytics, HTML, XML)
File (CSV, XML, TXT, JSON)
Databases (SQL, no-sql, etc)

Today will be using a CSV (comma separated value) file in our lab.

Parse- Understanding your data

Before and after you acquire your data, you also want to make sure you understand what data you've collected. This ensures that you've collected the right data and helps you figure out how it can be used. To better understand your data, there are a number of steps you might follow:

Create or review the data dictionary
Perform exploratory surface analysis via filtering, sorting, and simple visualizations
Describe data structure and the information being collected
Explore variables and data types via select

Intro to data dictionaries and documentation

Data dictionaries are often our primary source to help judge the quality of our data and also to understand how it is coded. If our gender variables are coded 0 and 1, how do we know which is male and which is female? Your data dictionary! Is your currency variable coded in dollars or euros? Data dictionary!

Data Dictionary Examples

Data dictionaries are also where you'll identify any requirements, assumptions, and constraints of your data. Note that you should never assume that a pre-existing data dictionary is complete. It is often going to be up to you to test your assumptions and add to your dictionary.

Check: What is a 'data dictionary' and what is it used for? Why?

Codealong- Numpy and Pandas intro (30 minutes)

What is Numpy and Pandas? Pandas is built on Python. In Numpy, we use arrays. With arrays you can do:

basic math.
splicing, indexing etc.

Pandas uses data structures that will look more familiar to folks who have used excel or other spreadsheet based tools. These are called Dataframes. A Dataframe contains rows and columns.

Similarly, you can select pieces of data, do basic operations, and calculate summary statistics. Let's see some examples:

Additionally, we often have to merge data together, correct missing data, and plot our findings. Let's see some examples of each of these:

Check: What is a 'dataframe' and when would you use one?

Lab Walkthrough (5 min)

This lab is based on a quiz given in Roger Peng's "Computing for Data Analysis" class on Coursera. During the lab you will read in and merge two datasets "ozone" and "data". By the end of the lab, you will:

Merge datasets
Check basic features, such as column names, number of observations
Find and drop missing values
Find basic stats like mean & max (more on these next time!)

The purpose of this lab is to get some practice working with Pandas. We will dive into stats more next week.

Lesson 2 Lab (20 min)

Conclusion (10 mins)

Review solutions & questions from lab
Review objectives from class

Unit 1, project, where we're headed (15 mins)

Review Unit 1 objectives
Introduce the first project
Exit tickets

BEFORE NEXT CLASS


UPCOMING PROJECTS	Unit Project 1

Resources:

Another Git turorial here
In depth Git/Github tutorial series made by a GA_DC Data Science Instructor here
Another Intro to Pandas (Written by Wes McKinney and is adapted from his book)
- Here is a video of Wes McKinney going through his ipython notebook!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lesson-02

lesson-02

README.md

Experimental Design and Pandas

LEARNING OBJECTIVES

STUDENT PRE-WORK

LESSON GUIDE

Opening (5 min)

Intro: Asking a good question (10 mins)

Why we need a good question/aim

What is a good question?

Demo: Diagraming an aim (5 mins)

Context

Knowledge Check

Why data types matter

SMART Review

Data science workflow: Acquire & Parse (5 mins)

Demo: Walkthrough Acquire and Parse with Pandas (30 mins)

Acquire

Logistics of acquiring your data

Parse- Understanding your data

Intro to data dictionaries and documentation

Data Dictionary Examples

Codealong- Numpy and Pandas intro (30 minutes)

Lab Walkthrough (5 min)

Lesson 2 Lab (20 min)

Conclusion (10 mins)

Unit 1, project, where we're headed (15 mins)

BEFORE NEXT CLASS

Resources:

Files

lesson-02

Directory actions

More options

Directory actions

More options

Latest commit

History

lesson-02

Folders and files

parent directory

README.md

Experimental Design and Pandas

LEARNING OBJECTIVES

STUDENT PRE-WORK

LESSON GUIDE

Opening (5 min)

Intro: Asking a good question (10 mins)

Why we need a good question/aim

What is a good question?

Demo: Diagraming an aim (5 mins)

Context

Knowledge Check

Why data types matter

SMART Review

Data science workflow: Acquire & Parse (5 mins)

Demo: Walkthrough Acquire and Parse with Pandas (30 mins)

Acquire

Logistics of acquiring your data

Parse- Understanding your data

Intro to data dictionaries and documentation

Data Dictionary Examples

Codealong- Numpy and Pandas intro (30 minutes)

Lab Walkthrough (5 min)

Lesson 2 Lab (20 min)

Conclusion (10 mins)

Unit 1, project, where we're headed (15 mins)

BEFORE NEXT CLASS

Resources: