# Data Analysis¶

This is a demo assignment that is openly available for the Data Science in Practice Course.

# Important Reminders¶

This assignment has hidden tests: tests that are not visible here, but that will be run on your submitted assignment for grading.

This means passing all the tests you can see in the notebook here does not guarantee you have the right answer!

In particular many of the tests you can see simply check that the right variable names exist. Hidden tests check the actual values.

It is up to you to check the values, and make sure they seem reasonable.

A reminder to restart the kernel and re-run the code as a first line check if things seem to go weird.

For example, note that some cells can only be run once, because they re-write a variable (for example, your dataframe), and change it in a way that means a second execution will fail.

Also, running some cells out of order might change the dataframe in ways that may cause an error, which can be fixed by re-running.

Run the following cell. These are all you need for the assignment. Do not import additional packages.

```
# Imports
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
sns.set_context('talk')
import warnings
warnings.filterwarnings('ignore')
import patsy
import statsmodels.api as sm
import scipy.stats as stats
from scipy.stats import ttest_ind, chisquare, normaltest
# Note: the statsmodels import may print out a 'FutureWarning'. Thats fine.
```

## Downloading Data¶

If you download this notebook to run locally, you will also need some data files.

Running the next cell will download the required files for this assignment.

You can also view and download these files from https://github.com/DataScienceInPractice/Data.

```
from os.path import join as pjoin
from urllib.request import urlretrieve
data_url = 'https://raw.githubusercontent.com/DataScienceInPractice/Data/master/'
# Fill in these values
assignment = 'A5'
data_files = ['COGS108_IntroQuestionnaireData.csv']
for data_file in data_files:
full_path = pjoin(data_url, assignment, data_file)
urlretrieve(full_path, filename=data_file)
```

## Setup¶

Data: the responses collected from a previous survery of the COGS 108 class.

There are 416 observations in the data, covering 10 different ‘features’.

Research Question: Do students in different majors have different heights?

Background: Physical height has previously shown to correlate with career choice, and career success. More recently it has been demonstrated that these correlations can actually be explained by height in high school, as opposed to height in adulthood (1). It is currently unclear whether height correlates with choice of major in university.

Reference: 1) https://www.sas.upenn.edu/~apostlew/paper/pdf/short.pdf

Hypothesis: We hypothesize that there will be a relation between height and chosen major.

## Part 1: Load & Clean the Data (2.95 points)¶

Fixing messy data makes up a large amount of the work of being a Data Scientist.

The real world produces messy measurements and it is your job to find ways to standardize your data such that you can make useful analyses out of it.

In this section, you will learn, and practice, how to successfully deal with unclean data.

### 1a) Load the data¶

Import datafile `COGS108_IntroQuestionnaireData.csv`

into a DataFrame called `df`

.

```
# YOUR CODE HERE
raise NotImplementedError()
```

```
assert isinstance(df, pd.DataFrame)
```

```
# Check out the data
df.head(5)
```

Those column names are a bit excessive, so first let’s rename them - code provided below to do so.

```
# Renaming the columns of the dataframe
df.columns = ['timestamp', 'year', 'major', 'age', 'gender', 'height',
'weight', 'eye_color', 'born_in_CA', 'favorite_icecream']
```

`pandas`

has a very useful function for detecting missing data. This function is called `isnull()`

.

If you have a dataframe called `df`

, then calling `df.isnull()`

will return another dataframe of the same size as `df`

where every cell is either True of False.

Each True or False is the answer to the question ‘is the data in this cell null?’. So, False, means the cell is not null (and therefore, does have data). True means the cell is null (does not have data).

This function is very useful because it allows us to find missing data very quickly in our dataframe. As an example, consider the code below.

```
# Check the first few rows of the 'isnull' dataframe
df.isnull().head(5)
```

If you print out more, and scroll down, you’ll see some rows with missing data.

```
# For example:
df.isnull().iloc[48:50, :]
```

Check an example, row 49, in which an entry has missing data:

```
df.iloc[49, :]
```

Granted, the example above is not very informative. As you can see, the output of `isnull()`

is a dataframe where the values at each cell is either True or False. Most cells have the value of `False`

. We expect this to be the case since most people gave out answers to each question in our survey.

However, some rows such as row 49 show that some people chose not to answer certain questions. In the case of row 49, it seems that someone did not give out an answer for ‘What year (in school) are you?’

However, what if we wanted to use `isnull()`

to see all rows where our dataframe `df`

has missing values? In other words, what if we want to see the ACTUAL rows with missing values instead of this dataframe with True or False cells. For that, we need to write the following line of code:

`df[df.isnull().any(axis=1)]`

### 1b) Find missing data¶

Find all rows that have missing data in them.

Save the ouput, as a dataframe, into a variable called `rows_to_drop`

.

In other words, copy over and use the line of code that we gave out in the cell above.

```
# YOUR CODE HERE
raise NotImplementedError()
```

```
# check output
rows_to_drop.shape
```

```
assert isinstance(rows_to_drop, pd.DataFrame)
assert rows_to_drop.shape == (29, 10)
```

In the cell below, briefly explain below how `df[df.isnull().any(axis=1)]`

works, in a couple sentences.

Include an explanation of what `any(axis=1)`

means and how it affects the code.

YOUR ANSWER HERE

Run the following cell and look at its output, but you don’t need to add any code here.

Real world data are messy. As an example of it, we consider the data shown in `rows_to_drop`

(below).

If you’ve done everything correctly so far, you should see an unexpected response with emojis at index 357. These types of responses, although funny, are hard to parse when dealing with big datasets.

We’ll learn about solutions to these types of problems in the upcoming cells.

```
rows_to_drop
```

### 1c) Drop the rows with NaN values¶

Drop any rows with missing data, but only for the columns `major`

, `height`

, `gender`

and `age`

. These will be the data of primary interest for our analyses, so we drop missing data here.

Note that there are other missing data (in other rows) but this is fine for our analyses, so we keep them.

To do this, use the pandas `dropna`

method, inplace, using the `subset`

arguments to specify columns.

```
# YOUR CODE HERE
raise NotImplementedError()
```

```
assert df.shape == (403, 10)
```

In the rest of Part 1, we will work on writing code, organized into functions that will allow us to transform similar respones into the same value. We will call this process: standardizing the data.

The cell below provides an example for the kind of code you will need to write to answer this question. This example is separate from our actual data, and is a potential function we might use to standardize messy data - in this case, hypothetical data to the question ‘What is your favourite major python version?’.

Note some things used in this example that you need to use to standardize data:

string methods, such as

`lower`

and`strip`

to transform stringsthe

`replace`

string method, to replace a set of characters with something elseif/else statements that check what’s in our string (number, letters, etc)

type casting, for example using

`int()`

to turn a variable into an integerusing

`np.nan`

(which stands for ‘not a number’) to denote missing or unknown data

**Note**: For the following few cells you should run, read, and understand the code provided, but you don’t have to add any code until Q1d.

```
# just run this cell
df['year'].unique()
```

The line of code above shows us the different values we got, to the question ‘What year (in school) are you?’.

As you can tell, it is a **mess**!. For example, if you are a junior student, then you might have answered: 3, three, third, 3rd year, junior, junior year, Junior, etc.

That is an issue. We want to be able to analyze this data and, in order to do this successfully, we need to all answers with the same meaning to be written in the same way. Therefore, we’re gonna have to transform answers such as ‘3, third, 3rd, junior, etc’ into a single possible value. We’ll do this for all values that mean the same.

```
def example_standardize_function(str_in):
'''Standardize data to the question 'what is your favorite major python version?'
Parameters
----------
str_in : string
A provided answer.
Returns
-------
int_out : int or np.nan
A standardized integer response.
'''
# Make the input all lowercase
str_in = str_in.lower()
# Drop all whitespace
str_in = str_in.strip()
# Replace things (and then strip again afterwords)
# Note that the 'replace' replaces the first argument, with the second
# The first argument does not need to be present in the string,
# if it's not there 'replace' does nothing (but does not error), so the code moves on.
str_in = str_in.replace('version', '')
str_in = str_in.replace('python', '')
str_in = str_in.strip()
# Cast to integer, if what's left seems appropriate
if str_in.isnumeric() and len(str_in) == 1:
out = int(str_in)
# Otherwise, consider input was probably ill-formed, return nan
else:
out = np.nan
return out
# Check how this function help standardize data:
# Example possible answers to the question 'What is your favourite major version of Python':
print('INPUT', '\t\t-\t', 'OUTPUT')
for inp in ['version 3', '42', '2', 'python 3', 'nonsense-lolz']:
print('{:10s} \t-\t {:1.0f}'.format(inp, example_standardize_function(inp)))
```

Now we have to standardize the data!

Check all different values given for majors. It’s a lot!

```
df['major'].unique()
```

We’ll write a function performing some simple substring checking in order to group many responses together.

```
def standardize_major(string):
string = string.lower()
string = string.strip()
if 'cog' in string:
output = 'COGSCI'
elif 'computer' in string:
output = 'COMPSCI'
elif 'cs' in string:
output = 'COMPSCI'
elif 'math' in string:
output = 'MATH'
elif 'electrical' in string:
output = 'ECE'
elif 'bio' in string:
output = 'BIO'
elif 'chem' in string:
output = 'CHEM'
# Otherwise, if uncaught - keep as is
else:
output = string
return output
```

We then have to apply the transformation using the function we just defined.

```
df['major'] = df['major'].apply(standardize_major)
```

Previewing the results of the previous transformation.

It looks a lot better, though it’s not perfect, but we’ll run with this.

```
df['major'].unique()
```

### 1d) Standardize ‘gender’ function¶

Next let’s check the ‘gender’ column.

Check the different responses received for gender, including how many of each response we have

```
# run this to see different gender input data
df['gender'].value_counts()
```

Using a similar approach to what we used for ‘major’, you’ll write a `standardize_gender`

function.

To do this you’ll:

convert all text to lowercase

use the string method

`strip()`

to remove leading and trailing characters from the gender valueuse an

`if/elif/else`

to:output ‘female’ if the lowercale gender value is ‘female’, ‘f’, ‘woman’, ‘famale’, or ‘women’

output ‘male’ if the lowercase gender value is ‘male’, ‘m’, ‘man’, or ‘men’

output ‘nonbinary_or_trans’ if the lowercase gender value is ‘nonbinary’ or ‘transgender’

output

`np.nan`

otherwise

return the output

```
# YOUR CODE HERE
raise NotImplementedError()
```

```
assert standardize_gender('f') == 'female'
assert standardize_gender('male') == 'male'
assert standardize_gender('Transgender') == 'nonbinary_or_trans'
```

### 1e) Transform ‘gender’ column¶

Apply the transformation, meaning, use your function and standardize gender in `df`

Then, drop any rows with missing gender information.

```
# YOUR CODE HERE
raise NotImplementedError()
```

```
# Check the results
df['gender'].unique()
```

```
assert len(df['gender'].unique()) == 3
assert df.shape == (402, 10)
```

### 1f) Standardize other columns¶

Find, programatically, the number of unique responses in the ‘year’ column.

Save the result in a variable named `num_unique_responses`

.

Hint: you can answer this question using the `unique`

method, used above.

```
# YOUR CODE HERE
raise NotImplementedError()
num_unique_responses
```

```
assert num_unique_responses
assert isinstance(num_unique_responses, int)
```

```
# Print out all the different answers in 'year'
df['year'].unique()
```

### 1g) Standardize ‘year’ column¶

Write a function named `standardize_year`

that takes in as input a string and returns an integer.

The function will do the following (in the order specified):

Note that for these detailed instructions, each line corresponds to one line of code you need to write.

convert all characters of the string into lowercase

strip the string of all leading and trailing whitespace

replace any occurences of ‘first’ with ‘1’

replace any occurences of ‘second’ with ‘2’

replace any occurences of ‘third’ with ‘3’

replace any occurences of ‘fourth’ with ‘4’

replace any occurences of ‘fifth’ with ‘5’

replace any occurences of ‘sixth’ with ‘6’

replace any occurences of ‘freshman’ with ‘1’

replace any occurences of ‘sophomore’ with ‘2’

replace any occurences of ‘junior’ with ‘3’

replace any occurences of ‘senior’ with 4’

replace any occurences of ‘year’ with ‘’ (remove it from the string)

replace any occurences of ‘th’ with ‘’ (remove it from the string)

replace any occurences of ‘rd’ with ‘’ (remove it from the string)

replace any occurences of ‘nd’ with ‘’ (remove it from the string)

strip the string of all leading and trailing whitespace (again)

If the resulting string is a number and it is less than 10, then cast it into an integer and return that value

Else return np.nan to symbolize that the student’s response was not a valid entry

HINTS: you will need to use the functions `lower()`

, `strip()`

, `isnumeric()`

and `replace()`

```
# YOUR CODE HERE
raise NotImplementedError()
```

```
assert standardize_year('2nd') == 2
assert standardize_year('sophomore') == 2
assert standardize_year('3rd year') == 3
assert standardize_year('5th') == 5
assert standardize_year('7 ') == 7
assert standardize_year('randomText') is np.nan
```

### 1h) Transform ‘year’ column¶

Use `standardize_year`

to transform the data in column ‘What year (in school) are you?’.

Hint: use the `apply`

function AND remember to save your output inside the dataframe

```
# YOUR CODE HERE
raise NotImplementedError()
```

```
assert len(df['year'].unique()) == 7
```

Assuming that all is correct up to this point, the line below should show all values now found in `df['year']`

.

It should look a lot better. With this data, we can now make insightful analyses.

You should see an array with elements 1,2,3,4,5,6 and nan (not necessarily in that order).

Note that if you check the data type of this column, you’ll see that pandas converts these numbers to `float`

, even though the applied function returns `int`

, because `np.nan`

is considered a float. This is fine.

```
df['year'].unique()
```

Let’s do it again. Let’s take a look at the responses in the ‘weight’ column, and then standardize them.

```
# First, ensure that all types are consistent, use strings
df['weight'] = df['weight'].astype(str)
```

```
# Check all the different answers we received
df['weight'].unique()
```

### 1i) Standardize ‘weight’ column¶

Write a function named `standardize_weight`

that takes in as input a string and returns an integer.

The function will do the following (in the order specified):

convert all characters of the string into lowercase

strip the string of all leading and trailing whitespace

replace any occurences of ‘lbs’ with ‘’ (remove it from the string)

replace any occurences of ‘lb’ with ‘’ (remove it from the string)

replace any occurences of ‘pounds’ with ‘’ (remove it from the string)

If the string contains the substring ‘kg’, then:

6.1) replace ‘kg’ with ‘’

6.2) strip the string of whitespace

6.3) cast the string into a float type using the function

`float()`

6.4) multiply the resulting float by 2.2 (an approximate conversion of kilograms to pounds)

`try`

to return the`int`

of your`string`

. If it cannot, return`np.nan`

.

```
# YOUR CODE HERE
raise NotImplementedError()
```

```
assert standardize_weight('34 lbs') == 34
assert standardize_weight('101 kg') == 222
```

### 1j) Transform ‘weight’ column¶

Use `standardize_weight`

to transform the data in the ‘weight’ column.

Hint: use the `apply`

function AND remember to save your output inside the dataframe

```
# YOUR CODE HERE
raise NotImplementedError()
```

```
assert df['weight'].unique().shape == (83,)
```

Now, let’s see the result of our hard work. The code below should output all numbers (or nan).

```
df['weight'].unique()
```

### 1k) Standardize ‘favorite_icecream’ column¶

Write a function named `standardize_icecream`

that takes in as input a string and returns a standardized string.

This function should:

ensure all inputs are strings (Note: np.nan is considered a float that will be converted to ‘nan’ if typecast with

`str()`

)convert all characters of the string into lowercase

strip the string of all leading and trailing whitespace

standardize the flavors such that:

if either ‘don’t’ or ‘no favorite’ is in the response, the flavor is recorded as

`np.nan`

if either ‘cream’ or ‘creme’ is in the response, the flavor is ‘cookies & cream’

if ‘dough’ is in the response, the flavor is recorded as ‘chocolate chip cookie dough’ (we’ll consider cookie dough and chocolate chip cookie dough to be the same)

if ‘vanilla’ is in the response, the flavor is recorded as ‘vanilla’

if ‘mint’ is in the response, the flavor is recorded as ‘mint chocolate chip’ (we’ll consider mint and mint chocolate chip to be the same)

if ‘oreo’ is in the response, the flavor is recorded as ‘oreo’

if ‘pistac’ is in the response, the flavor is recorded as ‘pistachio’ (note the different spellings in original)

if ‘matcha’ is in the response, the flavor is recorded as ‘matcha’

return the standardized ice cream flavor

```
# Check all the different answers we received
df['favorite_icecream'].unique()
```

```
# YOUR CODE HERE
raise NotImplementedError()
```

```
assert standardize_icecream('vanilla') == 'vanilla'
assert standardize_icecream('Vanilla') == 'vanilla'
assert standardize_icecream(np.nan) == 'nan'
# cases that follow instructions
assert standardize_icecream('this is not actually a flavor but has the word cream') == 'cookies & cream'
assert standardize_icecream('cookies & creme') == 'cookies & cream'
assert standardize_icecream('Vanilla ') == 'vanilla'
```

### 1l) Transform ‘favorite_icecream’ column¶

Use `standardize_icecream`

to transform the data in the ‘favorite_icecream’ column.

Hint: use the `apply`

function AND remember to save your output inside the dataframe

```
# YOUR CODE HERE
raise NotImplementedError()
```

```
# check output now that we've standardized
df['favorite_icecream'].unique()
```

```
assert df['favorite_icecream'].unique().shape == (82,)
```

So far, you’ve gotten a taste of what it is like to deal with messy data. It’s not easy, as you can tell.

The last variable we need to standardize for the purposes of our analysis is ‘height’. We will standardize that one for you.

Do read the code below and try to understand what it is doing.

```
# First, we'll look at the possible values for height
df['height'].unique()
```

It seems like we’ll have to handle different measurement systems. Ugh, ok…

Let’s write a function that converts all those values to inches.

```
# convert all values to inches
def standardize_height(string):
orig = string
output = None
# Basic string pre-processing
string = string.lower()
string = string.strip()
string = string.replace('foot', 'ft')
string = string.replace('feet', 'ft')
string = string.replace('inches', 'in')
string = string.replace('inch', 'in')
string = string.replace('meters', 'm')
string = string.replace('meter', 'm')
string = string.replace('centimeters', 'cm')
string = string.replace('centimeter', 'cm')
string = string.replace(',', '')
string = string.strip()
# CASE 1: string is written in the format FEET <DIVIDER> INCHES
dividers = ["'", "ft", "’", '”', '"','`', "-", "''"]
for divider in dividers:
# Split it into its elements
elements = string.split(divider)
# If the divider creates two elements
if (len(elements) >= 2) and ((len(string) -1) != string.find(divider)):
feet = elements[0]
inch = elements[1] if elements[1] is not '' else '0'
# Cleaning extranious symbols
for symbol in dividers:
feet = feet.replace(symbol, '')
inch = inch.replace(symbol, '')
inch = inch.replace('in','')
# Removing whitespace
feet = feet.strip()
inch = inch.strip()
# By this point, we expect 'feet' and 'inch' to be numeric
# If not...we ignore this case
if feet.replace('.', '').isnumeric() and inch.replace('.', '').isnumeric():
# Converting feet to inches and adding it to the current inches
output = (float(feet) * 12) + float(inch)
break
# CASE 2: string is written in the format FEET ft INCHES in
if ('ft' in string) and ('in' in string):
# Split it into its elements
elements = string.split('ft')
feet = elements[0]
inch = elements[1]
# Removing extraneous symbols and stripping whitespace
inch = inch.replace('inch', '')
inch = inch.replace('in', '')
feet = feet.strip()
inch = inch.strip()
# By this point, we expect 'feet' and 'inch' to be numeric
# If not...we ignore this case
if feet.replace('.', '').isnumeric() and inch.replace('.', '').isnumeric():
# Converting feet to inches and adding it to the current inches
output = (float(feet) * 12) + float(inch)
# CASE 3: answer was given ONLY in cm
# Convert to inches: approximately 0.39 inches in a meter
elif 'cm' in string:
centimeters = string.replace('cm', '')
centimeters = centimeters.strip()
if centimeters.replace('.', '').isnumeric():
output = float(centimeters) * 0.39
# CASE 4: answer was given ONLY in meters
# Convert to inches: approximately 39 inches in a meter
elif 'm' in string:
meters = string.replace('m', '')
meters = meters.strip()
if meters.replace('.', '').isnumeric():
output = float(meters)*39
# CASE 5: answer was given ONLY in feet
elif 'ft' in string:
feet = string.replace('ft', '')
feet = feet.strip()
if feet.replace('.', '').isnumeric():
output = float(feet)*12
# CASE 6: answer was given ONLY in inches
elif 'in' in string:
inches = string.replace('in', '')
inches = inches.strip()
if inches.replace('.', '').isnumeric():
output = float(inches)
# CASE 7: answer not covered by existing scenarios / was invalid.
# Return NaN
if not output:
output = np.nan
return output
```

```
# Applying the transformation and dropping invalid rows
df['height'] = df['height'].apply(standardize_height)
df = df.dropna(subset=['height'])
```

```
# Check the height data, after applying our standardization
df['height'].unique()
```

```
# Ensuring that the data types are correct - type cast age to int.
df['age'] = df['age'].astype(np.int64)
# Check out the data, after we've cleaned it!
df.head()
```

```
# Check that the dataframe has the right number of rows
# If this doesn't pass - check your code in the section above.
assert len(df) == 365
```

## Part 2: Exploratory Data Vizualization (0.8 points)¶

First, we need to do some exploratory data visualization, to get a feel for the data.

For plotting questions, do not change or move the `plt.gcf()`

lines.

### 2a) Plot the data¶

Using `scatter_matrix`

, from `pandas`

, plot `df`

. Assign it to a variable called `fig`

.

```
# YOUR CODE HERE
raise NotImplementedError()
```

```
assert np.all(fig)
```

### 2b) Plot a bar chart showing the number of students in each major.¶

Hint:

if using

`seaborn`

, you’re looking to make a countplotif using pandas, you can use

`value_counts`

to get the counts for each major. You can then use the`plot`

method from`pandas`

for plotting (You don’t need`matplotlib`

).

```
# YOUR CODE HERE
raise NotImplementedError()
f1 = plt.gcf()
```

```
assert f1.gca().has_data()
```

### 2c) Plot a histogram of the height data for all students who wrote ‘COGSCI’ as their major.¶

```
# YOUR CODE HERE
raise NotImplementedError()
f2 = plt.gcf()
```

```
assert f2.gca().has_data()
```

### 2d) Plot a histogram of the height data for all students who wrote ‘COMPSCI’ as their major.¶

```
# YOUR CODE HERE
raise NotImplementedError()
f3 = plt.gcf()
```

```
assert f3.gca().has_data()
```

## Part 3: Exploring The Data (0.8 points)¶

Beyond just plotting the data, we should check some other basic properties of the data. This serves both as a way to get a ‘feel’ for the data, and to look for any quirks or oddities about the data, that may indicate issues that need resolving. To do this, let’s explore that data a bit (not limiting ourselves to only features that we plan to use - exploring the dataset as a whole can help us find any issues).

Notes:

Your answers should NOT be pandas objects (Series or DataFrames), extract answers so the variables are ints, floats or strings (as appropriate).

You must answer these questions programmatically: do not count / check and hard code particular values.

### 3a) How many different majors are in the dataset?¶

Save this number to a variable `n_majors`

.

```
# YOUR CODE HERE
raise NotImplementedError()
```

```
assert n_majors >0 and n_majors < 25
```

### 3b) What is the range (max value - min value) of ages in the dataset?¶

Save this number to a variable `r_age`

.

```
# YOUR CODE HERE
raise NotImplementedError()
```

```
assert r_age > 0 and r_age < 50
```

### 3c) What is the most popular ice cream flavor?¶

Save the ice cream name to the variable `f_ice`

, and the number of people who like it to a variable `n_ice`

.

Hint: you can get these values using the `value_counts`

method.

```
# YOUR CODE HERE
raise NotImplementedError()
```

```
assert n_ice > 0 and n_ice < 50
```

```
assert f_ice
```

### 3d) How many people have a unique favorite ice cream? (In other words: how many ice cream flavors are only 1 person’s favorite?)¶

Save this number to a variable `u_ice`

.

```
# YOUR CODE HERE
raise NotImplementedError()
```

```
assert u_ice > 0 and u_ice < 100
```

## Part 4: Testing Distributions (0.45 points)¶

Soon, in the data analysis, we will want to run some statistical tests on our data. First, we should check the distributions!

When using methods / statistical tests that make certain assumptions, it’s always best to explicitly check if your data meet those assumptions (otherwise the results may be invalid). Let’s test if our data are in fact normally distributed.

See an example of how to test the distributions of data in the ‘TestingDistributions’ notebook in Tutorials.

For convenience, and consistency, we’re providing this code to pull out the required data. Be sure to run the following cell and understand what it’s doing.:

```
h_co = df[df['major'] == 'COGSCI']['height'].values
h_cs = df[df['major'] == 'COMPSCI']['height'].values
```

### 4a) Testing Normality¶

For each of `h_co`

, and `h_cs`

, use the `normaltest`

function to test for normality of the distribution.

`normaltest`

returns two values: (1) a test statistic and (2) a p-value

Save these values as `st_co`

, `p_co`

, `st_cs`

, and `p_cs`

, respectively.

```
# YOUR CODE HERE
raise NotImplementedError()
```

```
assert st_co
assert p_co
assert st_cs
assert p_cs
```

Have a look at the values returned.

Based on these results, and using an alpha significance value of 0.01:

Set boolean values (`True`

, `False`

) of whether each distribution can be considered to be normally distributed. Set as `True`

if the test supports it is normally distributed (or, more formally, we fail to reject the null hypothesis) and `False`

if the test suggests the data is not normally distributed (we should reject the null hypothesis).

### 4b) Set boolean values, as specified above.¶

For the `h_co`

data, set a boolean value to the var `is_n_co`

For the `h_cs`

data, set a boolean value to the var `is_n_cs`

```
# YOUR CODE HERE
raise NotImplementedError()
```

```
assert isinstance(is_n_co, bool)
assert isinstance(is_n_cs, bool)
```

**CO data**: plot the comparison of the data and a normal distribution (this code provided)

This plots a histogram, with the hypothetical normal distribution (with same mean and variance)

```
xs = np.arange(h_co.min(), h_co.max(), 0.1)
fit = stats.norm.pdf(xs, np.mean(h_co), np.std(h_co))
# most easily done using matplotlib
plt.plot(xs, fit, label = 'Normal Dist.', lw = 4)
plt.hist(h_co, density = True, label = 'Actual Data');
plt.title('Cognitive Science - Height Data')
plt.legend();
```

**CS data**: plot the comparison of the data and a normal distribution (this code provided)

This plots a histogram, with the hypothetical normal distribution (with same mean and variance)

```
xs = np.arange(h_cs.min(), h_cs.max(), 0.1)
fit = stats.norm.pdf(xs, np.mean(h_cs), np.std(h_cs))
plt.plot(xs, fit, label = 'Normal Dist.', lw = 4)
plt.hist(h_cs, density = True, label = 'Actual Data');
plt.title('Computer Science - Height Data')
plt.legend();
```

## Part 5: Data Analysis (2.4 points)¶

Now let’s analyze the data, to address our research question.

For the purposes of this analysis, let’s assume we need at least 75 students per major to analyze the height data.

This means we are only going to use data from people who wrote ‘COGSCI’ or ‘COMPSCI’ as their major.

### 5a) Pull out the data we are going to use:¶

Save the height data for all ‘COGSCI’ majors to a variable called `h_co`

Save the height data for all ‘COMPSCI’ majors to a variable called `h_cs`

```
# YOUR CODE HERE
raise NotImplementedError()
```

```
assert np.all(h_co)
assert np.all(h_cs)
assert len(h_co) == 178
assert len(h_cs) == 164
```

### 5b) What is the average (mean) height for students from each major?¶

Save these values to `avg_h_co`

for COGSCI students, and `avg_h_cs`

for COMPSCI students.

```
# YOUR CODE HERE
raise NotImplementedError()
```

```
assert avg_h_co
assert avg_h_cs
```

```
# Print out the average heights - this code provided
print('Average height of cogs majors is \t {:2.2f} inches'.format(avg_h_co))
print('Average height of cs majors is \t\t {:2.2f} inches'.format(avg_h_cs))
```

Based on the cell above, it looks like there might indeed be a difference in the average height for students in cogs vs cs majors.

Now we want to statistically test this difference. To do so, we will use a t-test.

### 5c) Compare distributions: t-test¶

Use the `ttest_ind`

function) to compare the two height distributions (`h_co`

vs `h_cs`

)

`ttest_ind`

returns a t-statistic, and a p-value. Save these outputs to `t_val`

and `p_val`

respectively.

```
# YOUR CODE HERE
raise NotImplementedError()
```

```
assert t_val
assert p_val
```

```
# Check if statistical test passes significance, using an alpha value of 0.01. This code provided.
if p_val < 0.01:
print('Data Science accomplished, there is a significant difference!')
else:
print('There is NOT a significant difference!')
# Editorial note:
# Chasing significant p-values as the goal itself is not actually a good way to do data (or any) science :)
```

Note: this test *should* pass significance. If it doesn’t, double check your code up until this point.

So - we’ve reached a conclusion! We’re done right!?

Nope. We have a first pass analysis, and an interim conclusion that happens to follow our hypothesis.

Now let’s try to break it.

#### Let’s explore some more¶

You should always interrogate your findings, however they come out. What could be some alternate explanations, that would change our interpretations of the current analysis?

In this case, we should be worried about confounding variables. We want to be able to say whether height relates to major specifically, but it could be the case that some other variable, that happens to differ between majors, better explains the differences in height.

In this case, we also have data on gender. Let’s check if differences in the gender ratio of the two majors can explain the difference in height.

### 5d) Digging Deeper¶

Using `value_counts`

from pandas, extract the number of ‘male’ and ‘female’, separately for cogs and cs students.

To do so:

select from the

`df`

each major, separately, extract the gender column, and use the`value_counts`

method.Save the counts for each gender for ‘COGSCI’ majors to a variable called

`g_co`

Save the counts for each gender for ‘COMPSCI’ majors to a variable called

`g_cs`

```
# YOUR CODE HERE
raise NotImplementedError()
```

```
assert np.all(g_co)
assert np.all(g_cs)
assert g_co[0] == 91
assert g_cs[1] == 38
assert g_cs[2] == 1
```

### 5e) What is the ratio of women in each major?¶

By ratio, we mean the proportion of students that are female, as a ratio. This will be a value between 0.0 and 1.0, calculated as #F / (#F + #M + #nonbinary_or_other) - done separately for each major

You can use the `g_co`

and `g_cs`

variables to calculate these.

Save the ratio of women in COGSCI to a variable `r_co`

.

Save the ratio of women in COMPSCI to a variable `r_cs`

.

Note: keep these numbers as ratios (they should be decimal numbers, less than 1).

```
g_cs
```

```
# YOUR CODE HERE
raise NotImplementedError()
```

```
assert r_co
assert r_cs
```

Make sure you print out and check the values of these ratios. They seem pretty different.

We can actually ask, using a chi-squared test, whether this difference in gender-ratio between the majors is signficantly different.

Code to do this is provided below.

Run a chi-squared test of the difference of ratios of categorical data between groups:

```
chisq, p_val_chi = stats.chisquare(np.array([g_co.values, g_cs.values]), axis=None)
if p_val_chi < 0.01:
print('There is a significant difference in ratios!')
```

### 5f) Subsetting data¶

Create a new dataframe, called `df2`

, which only includes data from ‘COGSCI’ and ‘COMPSCI’ majors.

Hint: you can do this using the or operater ‘|’, with loc.

```
# YOUR CODE HERE
raise NotImplementedError()
```

```
assert isinstance(df2, pd.DataFrame)
assert df2.shape == (342, 10)
assert set(df2['major']) == set(['COGSCI', 'COMPSCI'])
```

### 5g) Pivot Tables¶

Another way to look at these kinds of comparisons is pivot tables.

Use the pandas `pivot_table`

method to create a pivot table, assign it to a variable `pv`

.

Set the values as ‘height’, and the indices as ‘gender’ and ‘major’ in the pivot table.

Make sure you do this using `df2`

.

```
# YOUR CODE HERE
raise NotImplementedError()
```

```
pv.index.levels[0]
```

```
assert np.all(pv)
assert isinstance(pv.index, pd.MultiIndex)
```

Print out the pivot table you just created.

Compare the average height values, split up by major and gender.

Does it look like there are differences in heights by major, when spit up by gender?

```
pv
```

Let’s recap where we are:

Our initial hypothesis suggested there is a significant difference between heights of people in different majors.

However, further analyses suggested there may be a confounding variable, as there is also a significantly different gender balance between majors.

Checking the average height, per major, split up by gender, suggests there may not be a difference between major, other than what is explained by gender.

Now we want to statistically ask this question: is there still a difference in height between majors, when controlling for differences in gender?

#### Linear Models¶

For the following question you will need to make some linear models, using Ordinary Least Squares (OLS).

There is more than one way to do this in Python. For the purposes of this assignment, you must use the method that is outlined in the ‘LinearModels’ Tutorial, using patsy, and statsmodels.

That is:

Create design matrices with

`patsy.dmatrices`

Initialize an OLS model with

`sm.OLS`

Fit the OLS model

Check the summary for results.

### 5h) Linear model¶

Create a linear model to predict height from major (using `df2`

as data).

Use `patsy.dmatrices`

to create the design matrices, calling the outputs `outcome_1`

, `predictors_1`

.

Create an OLS model (`sm.OLS`

) using `outcome_1`

and `predictors_1`

. Call it `mod_1`

.

Fit the model, assigning it to `res_1`

.

```
# YOUR CODE HERE
raise NotImplementedError()
```

```
assert isinstance(outcome_1, patsy.design_info.DesignMatrix)
assert isinstance(predictors_1, patsy.design_info.DesignMatrix)
assert isinstance(mod_1, sm.regression.linear_model.OLS)
assert isinstance(res_1, sm.regression.linear_model.RegressionResultsWrapper)
```

```
# Print out the summary results of the model fitting
print(res_1.summary())
```

### 5i) Assess significance: `mod_1`

¶

Based on the model you ran above (using alpha value of 0.01), does major significantly predict height?

Set your answer as a boolean (True / False) to a variable called `lm_1`

.

```
# YOUR CODE HERE
raise NotImplementedError()
```

```
assert isinstance(lm_1, bool)
```

### 5j) Multivariate regression¶

Create a linear model to predict height from both major and gender (using `df2`

as data).

Use `patsy.dmatrices`

to create the design matrices, calling the outputs `outcome_2`

, `predictors_2`

Create an OLS model (`sm.OLS`

) using `outcome_2`

and `predictors_2`

. Call it `mod_2`

.

Fit the model, assigning it to `res_2`

.

```
# YOUR CODE HERE
raise NotImplementedError()
```

```
assert isinstance(outcome_2, patsy.design_info.DesignMatrix)
assert isinstance(predictors_2, patsy.design_info.DesignMatrix)
assert isinstance(mod_2, sm.regression.linear_model.OLS)
assert isinstance(res_2, sm.regression.linear_model.RegressionResultsWrapper)
```

```
# Print out the results
print(res_2.summary())
```

### 5k) Assess significance: `mod_2`

¶

Based on the model you ran above (using alpha value of 0.01), does major significantly predict height?

Set your answer as a boolean (True / False) to a variable called `lm_2`

```
# YOUR CODE HERE
raise NotImplementedError()
```

```
assert isinstance(lm_2, bool)
```

## Part 6: Discussion & Conclusions (0.1 points)¶

### 6a) Conclusion¶

Set a boolean variable, called `ans`

, as True or False as the answer to the following statement:

We have evidence supporting our research hypothesis:

People in different majors have systematically different heights (and this difference can be tied to their major).

```
# YOUR CODE HERE
raise NotImplementedError()
```

```
assert isinstance(ans, bool)
```

### 6b) Summary¶

Write a short response (1-2 sentence) summarizing the results.

Did we support our hypothesis? Why or why not? What turned out to be the finding(s)?

YOUR ANSWER HERE

## The End!¶

Good work! Have a look back over your answers, and also make sure to `Restart & Run All`

from the kernel menu to double check that everything is working properly. You can also use the ‘Validate’ button above, which runs your notebook from top to bottom and checks to ensure all `assert`

statements pass silently. When you are ready, submit on datahub!