Data Analysis

Data Analysis¶

This is a demo assignment that is openly available for the Data Science in Practice Course.

If you are in the COGS108 course at UC San Diego, this is NOT a valid version of the assignment for the course.

Important Reminders¶

This assignment has hidden tests: tests that are not visible here, but that will be run on your submitted assignment for grading.
- This means passing all the tests you can see in the notebook here does not guarantee you have the right answer!
- In particular many of the tests you can see simply check that the right variable names exist. Hidden tests check the actual values.
  - It is up to you to check the values, and make sure they seem reasonable.
A reminder to restart the kernel and re-run the code as a first line check if things seem to go weird.
- For example, note that some cells can only be run once, because they re-write a variable (for example, your dataframe), and change it in a way that means a second execution will fail.
- Also, running some cells out of order might change the dataframe in ways that may cause an error, which can be fixed by re-running.

Run the following cell. These are all you need for the assignment. Do not import additional packages.

# Imports 
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import seaborn as sns
sns.set()
sns.set_context('talk')

import warnings
warnings.filterwarnings('ignore')

import patsy
import statsmodels.api as sm
import scipy.stats as stats
from scipy.stats import ttest_ind, chisquare, normaltest
# Note: the statsmodels import may print out a 'FutureWarning'. Thats fine.

Downloading Data¶

If you download this notebook to run locally, you will also need some data files.

Running the next cell will download the required files for this assignment.

You can also view and download these files from https://github.com/DataScienceInPractice/Data.

from os.path import join as pjoin
from urllib.request import urlretrieve

data_url = 'https://raw.githubusercontent.com/DataScienceInPractice/Data/master/'

# Fill in these values
assignment = 'A5'
data_files = ['COGS108_IntroQuestionnaireData.csv']

for data_file in data_files:
    full_path = pjoin(data_url, assignment, data_file)
    urlretrieve(full_path, filename=data_file)

Setup¶

Data: the responses collected from a previous survery of the COGS 108 class.

There are 416 observations in the data, covering 10 different ‘features’.

Research Question: Do students in different majors have different heights?

Background: Physical height has previously shown to correlate with career choice, and career success. More recently it has been demonstrated that these correlations can actually be explained by height in high school, as opposed to height in adulthood (1). It is currently unclear whether height correlates with choice of major in university.

Reference: 1) https://www.sas.upenn.edu/~apostlew/paper/pdf/short.pdf

Hypothesis: We hypothesize that there will be a relation between height and chosen major.

Part 1: Load & Clean the Data (2.95 points)¶

Fixing messy data makes up a large amount of the work of being a Data Scientist.

The real world produces messy measurements and it is your job to find ways to standardize your data such that you can make useful analyses out of it.

In this section, you will learn, and practice, how to successfully deal with unclean data.

1a) Load the data¶

Import datafile COGS108_IntroQuestionnaireData.csv into a DataFrame called df.

# YOUR CODE HERE
raise NotImplementedError()

assert isinstance(df, pd.DataFrame)

# Check out the data
df.head(5)

Those column names are a bit excessive, so first let’s rename them - code provided below to do so.

# Renaming the columns of the dataframe
df.columns = ['timestamp', 'year', 'major', 'age', 'gender', 'height',
              'weight', 'eye_color', 'born_in_CA', 'favorite_icecream']

pandas has a very useful function for detecting missing data. This function is called isnull().

If you have a dataframe called df, then calling df.isnull() will return another dataframe of the same size as df where every cell is either True of False.

Each True or False is the answer to the question ‘is the data in this cell null?’. So, False, means the cell is not null (and therefore, does have data). True means the cell is null (does not have data).

This function is very useful because it allows us to find missing data very quickly in our dataframe. As an example, consider the code below.

# Check the first few rows of the 'isnull' dataframe
df.isnull().head(5)

If you print out more, and scroll down, you’ll see some rows with missing data.

# For example:
df.isnull().iloc[48:50, :]

Check an example, row 49, in which an entry has missing data:

df.iloc[49, :]

Granted, the example above is not very informative. As you can see, the output of isnull() is a dataframe where the values at each cell is either True or False. Most cells have the value of False. We expect this to be the case since most people gave out answers to each question in our survey.

However, some rows such as row 49 show that some people chose not to answer certain questions. In the case of row 49, it seems that someone did not give out an answer for ‘What year (in school) are you?’

However, what if we wanted to use isnull() to see all rows where our dataframe df has missing values? In other words, what if we want to see the ACTUAL rows with missing values instead of this dataframe with True or False cells. For that, we need to write the following line of code:

df[df.isnull().any(axis=1)]

1b) Find missing data¶

Find all rows that have missing data in them.

Save the ouput, as a dataframe, into a variable called rows_to_drop.

In other words, copy over and use the line of code that we gave out in the cell above.

# YOUR CODE HERE
raise NotImplementedError()

# check output
rows_to_drop.shape

assert isinstance(rows_to_drop, pd.DataFrame)
assert rows_to_drop.shape == (29, 10)

In the cell below, briefly explain below how df[df.isnull().any(axis=1)] works, in a couple sentences.

Include an explanation of what any(axis=1) means and how it affects the code.

YOUR ANSWER HERE

Run the following cell and look at its output, but you don’t need to add any code here.

Real world data are messy. As an example of it, we consider the data shown in rows_to_drop (below).

If you’ve done everything correctly so far, you should see an unexpected response with emojis at index 357. These types of responses, although funny, are hard to parse when dealing with big datasets.

We’ll learn about solutions to these types of problems in the upcoming cells.

rows_to_drop

1c) Drop the rows with NaN values¶

Drop any rows with missing data, but only for the columns major, height, gender and age. These will be the data of primary interest for our analyses, so we drop missing data here.

Note that there are other missing data (in other rows) but this is fine for our analyses, so we keep them.

To do this, use the pandas dropna method, inplace, using the subset arguments to specify columns.

# YOUR CODE HERE
raise NotImplementedError()

assert df.shape == (403, 10)

In the rest of Part 1, we will work on writing code, organized into functions that will allow us to transform similar respones into the same value. We will call this process: standardizing the data.

The cell below provides an example for the kind of code you will need to write to answer this question. This example is separate from our actual data, and is a potential function we might use to standardize messy data - in this case, hypothetical data to the question ‘What is your favourite major python version?’.

Note some things used in this example that you need to use to standardize data:

string methods, such as lower and strip to transform strings
the replace string method, to replace a set of characters with something else
if/else statements that check what’s in our string (number, letters, etc)
type casting, for example using int() to turn a variable into an integer
using np.nan (which stands for ‘not a number’) to denote missing or unknown data

Note: For the following few cells you should run, read, and understand the code provided, but you don’t have to add any code until Q1d.

# just run this cell
df['year'].unique()

The line of code above shows us the different values we got, to the question ‘What year (in school) are you?’.

As you can tell, it is a mess!. For example, if you are a junior student, then you might have answered: 3, three, third, 3rd year, junior, junior year, Junior, etc.

That is an issue. We want to be able to analyze this data and, in order to do this successfully, we need to all answers with the same meaning to be written in the same way. Therefore, we’re gonna have to transform answers such as ‘3, third, 3rd, junior, etc’ into a single possible value. We’ll do this for all values that mean the same.

def example_standardize_function(str_in):
    '''Standardize data to the question 'what is your favorite major python version?'
    
    Parameters
    ----------
    str_in : string
        A provided answer.
        
    Returns
    -------
    int_out : int or np.nan
        A standardized integer response.
    '''
    
    # Make the input all lowercase
    str_in = str_in.lower()
    
    # Drop all whitespace
    str_in = str_in.strip()
    
    # Replace things (and then strip again afterwords)
    #  Note that the 'replace' replaces the first argument, with the second
    #   The first argument does not need to be present in the string,
    #    if it's not there 'replace' does nothing (but does not error), so the code moves on.
    str_in = str_in.replace('version', '')
    str_in = str_in.replace('python', '')
    str_in = str_in.strip()
    
    # Cast to integer, if what's left seems appropriate
    if str_in.isnumeric() and len(str_in) == 1:
        out = int(str_in)
    # Otherwise, consider input was probably ill-formed, return nan
    else: 
        out = np.nan
    
    return out

# Check how this function help standardize data:
#  Example possible answers to the question 'What is your favourite major version of Python':
print('INPUT', '\t\t-\t', 'OUTPUT')
for inp in ['version 3', '42', '2', 'python 3', 'nonsense-lolz']:
    print('{:10s} \t-\t {:1.0f}'.format(inp, example_standardize_function(inp)))

Now we have to standardize the data!

Check all different values given for majors. It’s a lot!

df['major'].unique()

We’ll write a function performing some simple substring checking in order to group many responses together.

def standardize_major(string):
    
    string = string.lower()
    string = string.strip()
    
    if 'cog' in string:
        output = 'COGSCI'
    elif 'computer' in string:
        output = 'COMPSCI'
    elif 'cs' in string:
        output = 'COMPSCI'
    elif 'math' in string:
        output = 'MATH'
    elif 'electrical' in string:
        output = 'ECE'
    elif 'bio' in string:
        output = 'BIO'
    elif 'chem' in string:
        output = 'CHEM'
    # Otherwise, if uncaught - keep as is
    else:
        output = string
    
    return output

We then have to apply the transformation using the function we just defined.

df['major'] = df['major'].apply(standardize_major)

Previewing the results of the previous transformation.

It looks a lot better, though it’s not perfect, but we’ll run with this.

df['major'].unique()

1d) Standardize ‘gender’ function¶

Next let’s check the ‘gender’ column.

Check the different responses received for gender, including how many of each response we have

# run this to see different gender input data
df['gender'].value_counts()

Using a similar approach to what we used for ‘major’, you’ll write a standardize_gender function.

To do this you’ll:

convert all text to lowercase
use the string method strip() to remove leading and trailing characters from the gender value
use an if/elif/else to:
- output ‘female’ if the lowercale gender value is ‘female’, ‘f’, ‘woman’, ‘famale’, or ‘women’
- output ‘male’ if the lowercase gender value is ‘male’, ‘m’, ‘man’, or ‘men’
- output ‘nonbinary_or_trans’ if the lowercase gender value is ‘nonbinary’ or ‘transgender’
- output np.nan otherwise
return the output

# YOUR CODE HERE
raise NotImplementedError()

assert standardize_gender('f') == 'female'
assert standardize_gender('male') == 'male'
assert standardize_gender('Transgender') == 'nonbinary_or_trans'

1e) Transform ‘gender’ column¶

Apply the transformation, meaning, use your function and standardize gender in df

Then, drop any rows with missing gender information.

# YOUR CODE HERE
raise NotImplementedError()

# Check the results
df['gender'].unique()

assert len(df['gender'].unique()) == 3
assert df.shape == (402, 10)

1f) Standardize other columns¶

Find, programatically, the number of unique responses in the ‘year’ column.

Save the result in a variable named num_unique_responses.

Hint: you can answer this question using the unique method, used above.

# YOUR CODE HERE
raise NotImplementedError()
num_unique_responses

assert num_unique_responses
assert isinstance(num_unique_responses, int)

# Print out all the different answers in 'year'
df['year'].unique()

1g) Standardize ‘year’ column¶

Write a function named standardize_year that takes in as input a string and returns an integer.

The function will do the following (in the order specified):

Note that for these detailed instructions, each line corresponds to one line of code you need to write.

1. convert all characters of the string into lowercase
1. strip the string of all leading and trailing whitespace
1. replace any occurences of ‘first’ with ‘1’
1. replace any occurences of ‘second’ with ‘2’
1. replace any occurences of ‘third’ with ‘3’
1. replace any occurences of ‘fourth’ with ‘4’
1. replace any occurences of ‘fifth’ with ‘5’
1. replace any occurences of ‘sixth’ with ‘6’
1. replace any occurences of ‘freshman’ with ‘1’
1. replace any occurences of ‘sophomore’ with ‘2’
1. replace any occurences of ‘junior’ with ‘3’
1. replace any occurences of ‘senior’ with 4’
1. replace any occurences of ‘year’ with ‘’ (remove it from the string)
1. replace any occurences of ‘th’ with ‘’ (remove it from the string)
1. replace any occurences of ‘rd’ with ‘’ (remove it from the string)
1. replace any occurences of ‘nd’ with ‘’ (remove it from the string)
1. strip the string of all leading and trailing whitespace (again)
1. If the resulting string is a number and it is less than 10, then cast it into an integer and return that value
1. Else return np.nan to symbolize that the student’s response was not a valid entry

HINTS: you will need to use the functions lower(), strip(), isnumeric() and replace()

# YOUR CODE HERE
raise NotImplementedError()

assert standardize_year('2nd') == 2
assert standardize_year('sophomore') == 2
assert standardize_year('3rd year') == 3
assert standardize_year('5th') == 5
assert standardize_year('7    ') == 7
assert standardize_year('randomText') is np.nan

1h) Transform ‘year’ column¶

Use standardize_year to transform the data in column ‘What year (in school) are you?’.

Hint: use the apply function AND remember to save your output inside the dataframe

# YOUR CODE HERE
raise NotImplementedError()

assert len(df['year'].unique()) == 7

Assuming that all is correct up to this point, the line below should show all values now found in df['year'].

It should look a lot better. With this data, we can now make insightful analyses.

You should see an array with elements 1,2,3,4,5,6 and nan (not necessarily in that order).

Note that if you check the data type of this column, you’ll see that pandas converts these numbers to float, even though the applied function returns int, because np.nan is considered a float. This is fine.

df['year'].unique()

Let’s do it again. Let’s take a look at the responses in the ‘weight’ column, and then standardize them.

# First, ensure that all types are consistent, use strings
df['weight'] = df['weight'].astype(str)

# Check all the different answers we received
df['weight'].unique()

1i) Standardize ‘weight’ column¶

Write a function named standardize_weight that takes in as input a string and returns an integer.

The function will do the following (in the order specified):

1. convert all characters of the string into lowercase
1. strip the string of all leading and trailing whitespace
1. replace any occurences of ‘lbs’ with ‘’ (remove it from the string)
1. replace any occurences of ‘lb’ with ‘’ (remove it from the string)
1. replace any occurences of ‘pounds’ with ‘’ (remove it from the string)
1. If the string contains the substring ‘kg’, then:
- 6.1) replace ‘kg’ with ‘’
- 6.2) strip the string of whitespace
- 6.3) cast the string into a float type using the function float()
- 6.4) multiply the resulting float by 2.2 (an approximate conversion of kilograms to pounds)
1. try to return the int of your string. If it cannot, return np.nan.

# YOUR CODE HERE
raise NotImplementedError()

assert standardize_weight('34 lbs') == 34
assert standardize_weight('101 kg') == 222

1j) Transform ‘weight’ column¶

Use standardize_weight to transform the data in the ‘weight’ column.

Hint: use the apply function AND remember to save your output inside the dataframe

# YOUR CODE HERE
raise NotImplementedError()

assert df['weight'].unique().shape == (83,)

Now, let’s see the result of our hard work. The code below should output all numbers (or nan).

df['weight'].unique()

1k) Standardize ‘favorite_icecream’ column¶

Write a function named standardize_icecream that takes in as input a string and returns a standardized string.

This function should:

ensure all inputs are strings (Note: np.nan is considered a float that will be converted to ‘nan’ if typecast with str())
convert all characters of the string into lowercase
strip the string of all leading and trailing whitespace
standardize the flavors such that:
- if either ‘don’t’ or ‘no favorite’ is in the response, the flavor is recorded as np.nan
- if either ‘cream’ or ‘creme’ is in the response, the flavor is ‘cookies & cream’
- if ‘dough’ is in the response, the flavor is recorded as ‘chocolate chip cookie dough’ (we’ll consider cookie dough and chocolate chip cookie dough to be the same)
- if ‘vanilla’ is in the response, the flavor is recorded as ‘vanilla’
- if ‘mint’ is in the response, the flavor is recorded as ‘mint chocolate chip’ (we’ll consider mint and mint chocolate chip to be the same)
- if ‘oreo’ is in the response, the flavor is recorded as ‘oreo’
- if ‘pistac’ is in the response, the flavor is recorded as ‘pistachio’ (note the different spellings in original)
- if ‘matcha’ is in the response, the flavor is recorded as ‘matcha’
return the standardized ice cream flavor

# Check all the different answers we received
df['favorite_icecream'].unique()

# YOUR CODE HERE
raise NotImplementedError()

assert standardize_icecream('vanilla') == 'vanilla'
assert standardize_icecream('Vanilla') == 'vanilla'
assert standardize_icecream(np.nan) == 'nan'

# cases that follow instructions
assert standardize_icecream('this is not actually a flavor but has the word cream') == 'cookies & cream'
assert standardize_icecream('cookies & creme') == 'cookies & cream'
assert standardize_icecream('Vanilla ') == 'vanilla'

1l) Transform ‘favorite_icecream’ column¶

Use standardize_icecream to transform the data in the ‘favorite_icecream’ column.

Hint: use the apply function AND remember to save your output inside the dataframe

# YOUR CODE HERE
raise NotImplementedError()

# check output now that we've standardized
df['favorite_icecream'].unique()

assert df['favorite_icecream'].unique().shape == (82,)

So far, you’ve gotten a taste of what it is like to deal with messy data. It’s not easy, as you can tell.

The last variable we need to standardize for the purposes of our analysis is ‘height’. We will standardize that one for you.

Do read the code below and try to understand what it is doing.

# First, we'll look at the possible values for height
df['height'].unique()

It seems like we’ll have to handle different measurement systems. Ugh, ok…

Let’s write a function that converts all those values to inches.

# convert all values to inches
def standardize_height(string):
    
    orig = string
    output = None
    
    # Basic string pre-processing
    string = string.lower()
    string = string.strip()
    
    string = string.replace('foot', 'ft')
    string = string.replace('feet', 'ft')
    string = string.replace('inches', 'in')
    string = string.replace('inch', 'in')
    string = string.replace('meters', 'm')
    string = string.replace('meter', 'm')
    string = string.replace('centimeters', 'cm')
    string = string.replace('centimeter', 'cm')
    string = string.replace(',', '')
    string = string.strip()
    
    # CASE 1: string is written in the format FEET <DIVIDER> INCHES
    dividers = ["'", "ft", "’", '”', '"','`', "-", "''"]
    
    for divider in dividers:
        
        # Split it into its elements
        elements = string.split(divider)

        # If the divider creates two elements
        if (len(elements) >= 2) and ((len(string) -1) != string.find(divider)):
            feet = elements[0]
            inch = elements[1] if elements[1] is not '' else '0'
            
            # Cleaning extranious symbols
            for symbol in dividers:
                feet = feet.replace(symbol, '')
                inch = inch.replace(symbol, '')
                inch = inch.replace('in','')
            
            # Removing whitespace
            feet = feet.strip()
            inch = inch.strip()
            
            # By this point, we expect 'feet' and 'inch' to be numeric
            # If not...we ignore this case
            if feet.replace('.', '').isnumeric() and inch.replace('.', '').isnumeric():
                
                # Converting feet to inches and adding it to the current inches
                output = (float(feet) * 12) + float(inch)
                break
            
    # CASE 2: string is written in the format FEET ft INCHES in 
    if ('ft' in string) and ('in' in string):
        
        # Split it into its elements
        elements = string.split('ft')
        feet = elements[0]
        inch = elements[1]
        
        # Removing extraneous symbols and stripping whitespace
        inch = inch.replace('inch', '')
        inch = inch.replace('in', '')
        feet = feet.strip()
        inch = inch.strip()
        
        # By this point, we expect 'feet' and 'inch' to be numeric
        # If not...we ignore this case
        if feet.replace('.', '').isnumeric() and inch.replace('.', '').isnumeric():
                
            # Converting feet to inches and adding it to the current inches
            output = (float(feet) * 12) + float(inch)
        
    # CASE 3: answer was given ONLY in cm
    #  Convert to inches: approximately 0.39 inches in a meter
    elif 'cm' in string:
        centimeters = string.replace('cm', '')
        centimeters = centimeters.strip()
        
        if centimeters.replace('.', '').isnumeric():
            output = float(centimeters) * 0.39
        
    # CASE 4: answer was given ONLY in meters
    #  Convert to inches: approximately 39 inches in a meter
    elif 'm' in string:
        
        meters = string.replace('m', '')
        meters = meters.strip()
        
        if meters.replace('.', '').isnumeric():
            output = float(meters)*39
        
    # CASE 5: answer was given ONLY in feet
    elif 'ft' in string:

        feet = string.replace('ft', '')
        feet = feet.strip()
        
        if feet.replace('.', '').isnumeric():
            output = float(feet)*12
    
    # CASE 6: answer was given ONLY in inches
    elif 'in' in string:
        inches = string.replace('in', '')
        inches = inches.strip()
        
        if inches.replace('.', '').isnumeric():
            output = float(inches)
        
    # CASE 7: answer not covered by existing scenarios / was invalid. 
    #  Return NaN
    if not output:
        output = np.nan

    return output

# Applying the transformation and dropping invalid rows
df['height'] = df['height'].apply(standardize_height)
df = df.dropna(subset=['height'])

# Check the height data, after applying our standardization
df['height'].unique()

# Ensuring that the data types are correct - type cast age to int.
df['age'] = df['age'].astype(np.int64)

# Check out the data, after we've cleaned it!
df.head()

# Check that the dataframe has the right number of rows
#  If this doesn't pass - check your code in the section above.
assert len(df) == 365

Part 2: Exploratory Data Vizualization (0.8 points)¶

First, we need to do some exploratory data visualization, to get a feel for the data.

For plotting questions, do not change or move the plt.gcf() lines.

2a) Plot the data¶

Using scatter_matrix, from pandas, plot df. Assign it to a variable called fig.

# YOUR CODE HERE
raise NotImplementedError()

assert np.all(fig)

2b) Plot a bar chart showing the number of students in each major.¶

Hint:

if using seaborn, you’re looking to make a countplot
if using pandas, you can use value_counts to get the counts for each major. You can then use the plot method from pandas for plotting (You don’t need matplotlib).

# YOUR CODE HERE
raise NotImplementedError()

f1 = plt.gcf()

assert f1.gca().has_data()

2c) Plot a histogram of the height data for all students who wrote ‘COGSCI’ as their major.¶

# YOUR CODE HERE
raise NotImplementedError()

f2 = plt.gcf()

assert f2.gca().has_data()

2d) Plot a histogram of the height data for all students who wrote ‘COMPSCI’ as their major.¶

# YOUR CODE HERE
raise NotImplementedError()

f3 = plt.gcf()

assert f3.gca().has_data()

Part 3: Exploring The Data (0.8 points)¶

Beyond just plotting the data, we should check some other basic properties of the data. This serves both as a way to get a ‘feel’ for the data, and to look for any quirks or oddities about the data, that may indicate issues that need resolving. To do this, let’s explore that data a bit (not limiting ourselves to only features that we plan to use - exploring the dataset as a whole can help us find any issues).

Notes:

Your answers should NOT be pandas objects (Series or DataFrames), extract answers so the variables are ints, floats or strings (as appropriate).
You must answer these questions programmatically: do not count / check and hard code particular values.

3a) How many different majors are in the dataset?¶

Save this number to a variable n_majors.

# YOUR CODE HERE
raise NotImplementedError()

assert n_majors >0 and n_majors < 25

3b) What is the range (max value - min value) of ages in the dataset?¶

Save this number to a variable r_age.

# YOUR CODE HERE
raise NotImplementedError()

assert r_age > 0 and r_age < 50 

3c) What is the most popular ice cream flavor?¶

Save the ice cream name to the variable f_ice, and the number of people who like it to a variable n_ice.

Hint: you can get these values using the value_counts method.

# YOUR CODE HERE
raise NotImplementedError()

assert n_ice > 0 and n_ice < 50

assert f_ice

3d) How many people have a unique favorite ice cream? (In other words: how many ice cream flavors are only 1 person’s favorite?)¶

Save this number to a variable u_ice.

# YOUR CODE HERE
raise NotImplementedError()

assert u_ice > 0 and u_ice < 100

Part 4: Testing Distributions (0.45 points)¶

Soon, in the data analysis, we will want to run some statistical tests on our data. First, we should check the distributions!

When using methods / statistical tests that make certain assumptions, it’s always best to explicitly check if your data meet those assumptions (otherwise the results may be invalid). Let’s test if our data are in fact normally distributed.

See an example of how to test the distributions of data in the ‘TestingDistributions’ notebook in Tutorials.

For convenience, and consistency, we’re providing this code to pull out the required data. Be sure to run the following cell and understand what it’s doing.:

h_co = df[df['major'] == 'COGSCI']['height'].values
h_cs = df[df['major'] == 'COMPSCI']['height'].values

4a) Testing Normality¶

For each of h_co, and h_cs, use the normaltest function to test for normality of the distribution.

normaltest returns two values: (1) a test statistic and (2) a p-value

Save these values as st_co, p_co, st_cs, and p_cs, respectively.

# YOUR CODE HERE
raise NotImplementedError()

assert st_co
assert p_co
assert st_cs
assert p_cs

Have a look at the values returned.

Based on these results, and using an alpha significance value of 0.01:

Set boolean values (True, False) of whether each distribution can be considered to be normally distributed. Set as True if the test supports it is normally distributed (or, more formally, we fail to reject the null hypothesis) and False if the test suggests the data is not normally distributed (we should reject the null hypothesis).

4b) Set boolean values, as specified above.¶

For the h_co data, set a boolean value to the var is_n_co

For the h_cs data, set a boolean value to the var is_n_cs

# YOUR CODE HERE
raise NotImplementedError()

assert isinstance(is_n_co, bool)
assert isinstance(is_n_cs, bool)

CO data: plot the comparison of the data and a normal distribution (this code provided)

This plots a histogram, with the hypothetical normal distribution (with same mean and variance)

xs = np.arange(h_co.min(), h_co.max(), 0.1)
fit = stats.norm.pdf(xs, np.mean(h_co), np.std(h_co))
# most easily done using matplotlib
plt.plot(xs, fit, label = 'Normal Dist.', lw = 4)
plt.hist(h_co, density = True, label = 'Actual Data');
plt.title('Cognitive Science - Height Data')
plt.legend();

CS data: plot the comparison of the data and a normal distribution (this code provided)

This plots a histogram, with the hypothetical normal distribution (with same mean and variance)

xs = np.arange(h_cs.min(), h_cs.max(), 0.1)
fit = stats.norm.pdf(xs, np.mean(h_cs), np.std(h_cs))
plt.plot(xs, fit, label = 'Normal Dist.', lw = 4)
plt.hist(h_cs, density = True, label = 'Actual Data');
plt.title('Computer Science - Height Data')
plt.legend();

Part 5: Data Analysis (2.4 points)¶

Now let’s analyze the data, to address our research question.

For the purposes of this analysis, let’s assume we need at least 75 students per major to analyze the height data.

This means we are only going to use data from people who wrote ‘COGSCI’ or ‘COMPSCI’ as their major.

5a) Pull out the data we are going to use:¶

Save the height data for all ‘COGSCI’ majors to a variable called h_co

Save the height data for all ‘COMPSCI’ majors to a variable called h_cs

# YOUR CODE HERE
raise NotImplementedError()

assert np.all(h_co)
assert np.all(h_cs)

assert len(h_co) == 178
assert len(h_cs) == 164

5b) What is the average (mean) height for students from each major?¶

Save these values to avg_h_co for COGSCI students, and avg_h_cs for COMPSCI students.

# YOUR CODE HERE
raise NotImplementedError()

assert avg_h_co
assert avg_h_cs

# Print out the average heights - this code provided
print('Average height of cogs majors is \t {:2.2f} inches'.format(avg_h_co))
print('Average height of cs majors is \t\t {:2.2f} inches'.format(avg_h_cs))

Based on the cell above, it looks like there might indeed be a difference in the average height for students in cogs vs cs majors.

Now we want to statistically test this difference. To do so, we will use a t-test.

5c) Compare distributions: t-test¶

Use the ttest_ind function) to compare the two height distributions (h_co vs h_cs)

ttest_ind returns a t-statistic, and a p-value. Save these outputs to t_val and p_val respectively.

# YOUR CODE HERE
raise NotImplementedError()

assert t_val
assert p_val

# Check if statistical test passes significance, using an alpha value of 0.01. This code provided.
if p_val < 0.01:
    print('Data Science accomplished, there is a significant difference!')
else:
    print('There is NOT a significant difference!')
    
# Editorial note:
#  Chasing significant p-values as the goal itself is not actually a good way to do data (or any) science :)

Note: this test should pass significance. If it doesn’t, double check your code up until this point.

So - we’ve reached a conclusion! We’re done right!?

Nope. We have a first pass analysis, and an interim conclusion that happens to follow our hypothesis.

Now let’s try to break it.

Let’s explore some more¶

You should always interrogate your findings, however they come out. What could be some alternate explanations, that would change our interpretations of the current analysis?

In this case, we should be worried about confounding variables. We want to be able to say whether height relates to major specifically, but it could be the case that some other variable, that happens to differ between majors, better explains the differences in height.

In this case, we also have data on gender. Let’s check if differences in the gender ratio of the two majors can explain the difference in height.

5d) Digging Deeper¶

Using value_counts from pandas, extract the number of ‘male’ and ‘female’, separately for cogs and cs students.

To do so:

select from the df each major, separately, extract the gender column, and use the value_counts method.
Save the counts for each gender for ‘COGSCI’ majors to a variable called g_co
Save the counts for each gender for ‘COMPSCI’ majors to a variable called g_cs

# YOUR CODE HERE
raise NotImplementedError()

assert np.all(g_co)
assert np.all(g_cs)

assert g_co[0] == 91
assert g_cs[1] == 38
assert g_cs[2] == 1

5e) What is the ratio of women in each major?¶

By ratio, we mean the proportion of students that are female, as a ratio. This will be a value between 0.0 and 1.0, calculated as #F / (#F + #M + #nonbinary_or_other) - done separately for each major

You can use the g_co and g_cs variables to calculate these.

Save the ratio of women in COGSCI to a variable r_co.

Save the ratio of women in COMPSCI to a variable r_cs.

Note: keep these numbers as ratios (they should be decimal numbers, less than 1).

g_cs

# YOUR CODE HERE
raise NotImplementedError()

assert r_co
assert r_cs

Make sure you print out and check the values of these ratios. They seem pretty different.

We can actually ask, using a chi-squared test, whether this difference in gender-ratio between the majors is signficantly different.

Code to do this is provided below.

Run a chi-squared test of the difference of ratios of categorical data between groups:

chisq, p_val_chi = stats.chisquare(np.array([g_co.values, g_cs.values]), axis=None)

if p_val_chi < 0.01:
    print('There is a significant difference in ratios!')

5f) Subsetting data¶

Create a new dataframe, called df2, which only includes data from ‘COGSCI’ and ‘COMPSCI’ majors.

Hint: you can do this using the or operater ‘|’, with loc.

# YOUR CODE HERE
raise NotImplementedError()

assert isinstance(df2, pd.DataFrame)
assert df2.shape == (342, 10)
assert set(df2['major']) == set(['COGSCI', 'COMPSCI'])

5g) Pivot Tables¶

Another way to look at these kinds of comparisons is pivot tables.

Use the pandas pivot_table method to create a pivot table, assign it to a variable pv.

Set the values as ‘height’, and the indices as ‘gender’ and ‘major’ in the pivot table.

Make sure you do this using df2.

# YOUR CODE HERE
raise NotImplementedError()

pv.index.levels[0]

assert np.all(pv)
assert isinstance(pv.index, pd.MultiIndex)

Print out the pivot table you just created.

Compare the average height values, split up by major and gender.

Does it look like there are differences in heights by major, when spit up by gender?

pv

Let’s recap where we are:

Our initial hypothesis suggested there is a significant difference between heights of people in different majors.
However, further analyses suggested there may be a confounding variable, as there is also a significantly different gender balance between majors.

Checking the average height, per major, split up by gender, suggests there may not be a difference between major, other than what is explained by gender.

Now we want to statistically ask this question: is there still a difference in height between majors, when controlling for differences in gender?

Linear Models¶

For the following question you will need to make some linear models, using Ordinary Least Squares (OLS).

There is more than one way to do this in Python. For the purposes of this assignment, you must use the method that is outlined in the ‘LinearModels’ Tutorial, using patsy, and statsmodels.

That is:

Create design matrices with patsy.dmatrices
Initialize an OLS model with sm.OLS
Fit the OLS model
Check the summary for results.

5h) Linear model¶

Create a linear model to predict height from major (using df2 as data).

Use patsy.dmatrices to create the design matrices, calling the outputs outcome_1, predictors_1.

Create an OLS model (sm.OLS) using outcome_1 and predictors_1. Call it mod_1.

Fit the model, assigning it to res_1.

# YOUR CODE HERE
raise NotImplementedError()

assert isinstance(outcome_1, patsy.design_info.DesignMatrix)
assert isinstance(predictors_1, patsy.design_info.DesignMatrix)
assert isinstance(mod_1, sm.regression.linear_model.OLS)
assert isinstance(res_1, sm.regression.linear_model.RegressionResultsWrapper)

# Print out the summary results of the model fitting
print(res_1.summary())

5i) Assess significance: `mod_1`¶

Based on the model you ran above (using alpha value of 0.01), does major significantly predict height?

Set your answer as a boolean (True / False) to a variable called lm_1.

# YOUR CODE HERE
raise NotImplementedError()

assert isinstance(lm_1, bool)

5j) Multivariate regression¶

Create a linear model to predict height from both major and gender (using df2 as data).

Use patsy.dmatrices to create the design matrices, calling the outputs outcome_2, predictors_2

Create an OLS model (sm.OLS) using outcome_2 and predictors_2. Call it mod_2.

Fit the model, assigning it to res_2.

# YOUR CODE HERE
raise NotImplementedError()

assert isinstance(outcome_2, patsy.design_info.DesignMatrix)
assert isinstance(predictors_2, patsy.design_info.DesignMatrix)
assert isinstance(mod_2, sm.regression.linear_model.OLS)
assert isinstance(res_2, sm.regression.linear_model.RegressionResultsWrapper)

# Print out the results 
print(res_2.summary())

5k) Assess significance: `mod_2`¶

Based on the model you ran above (using alpha value of 0.01), does major significantly predict height?

Set your answer as a boolean (True / False) to a variable called lm_2

# YOUR CODE HERE
raise NotImplementedError()

assert isinstance(lm_2, bool)

Part 6: Discussion & Conclusions (0.1 points)¶

6a) Conclusion¶

Set a boolean variable, called ans, as True or False as the answer to the following statement:

We have evidence supporting our research hypothesis:

People in different majors have systematically different heights (and this difference can be tied to their major).

# YOUR CODE HERE
raise NotImplementedError()

assert isinstance(ans, bool)

6b) Summary¶

Write a short response (1-2 sentence) summarizing the results.

Did we support our hypothesis? Why or why not? What turned out to be the finding(s)?

YOUR ANSWER HERE

The End!¶

Good work! Have a look back over your answers, and also make sure to Restart & Run All from the kernel menu to double check that everything is working properly. You can also use the ‘Validate’ button above, which runs your notebook from top to bottom and checks to ensure all assert statements pass silently. When you are ready, submit on datahub!

Data Privacy

Natural Language Processing

Data Science in Practice

Data Analysis

Contents

Data Analysis¶

Important Reminders¶

Downloading Data¶

Setup¶

Part 1: Load & Clean the Data (2.95 points)¶

1a) Load the data¶

1b) Find missing data¶

1c) Drop the rows with NaN values¶

1d) Standardize ‘gender’ function¶

1e) Transform ‘gender’ column¶

1f) Standardize other columns¶

1g) Standardize ‘year’ column¶

1h) Transform ‘year’ column¶

1i) Standardize ‘weight’ column¶

1j) Transform ‘weight’ column¶

1k) Standardize ‘favorite_icecream’ column¶

1l) Transform ‘favorite_icecream’ column¶

Part 2: Exploratory Data Vizualization (0.8 points)¶

2a) Plot the data¶

2b) Plot a bar chart showing the number of students in each major.¶

2c) Plot a histogram of the height data for all students who wrote ‘COGSCI’ as their major.¶

2d) Plot a histogram of the height data for all students who wrote ‘COMPSCI’ as their major.¶

Part 3: Exploring The Data (0.8 points)¶

3a) How many different majors are in the dataset?¶

3b) What is the range (max value - min value) of ages in the dataset?¶

3c) What is the most popular ice cream flavor?¶

3d) How many people have a unique favorite ice cream? (In other words: how many ice cream flavors are only 1 person’s favorite?)¶

Part 4: Testing Distributions (0.45 points)¶

4a) Testing Normality¶

4b) Set boolean values, as specified above.¶

Part 5: Data Analysis (2.4 points)¶

5a) Pull out the data we are going to use:¶

5b) What is the average (mean) height for students from each major?¶

5c) Compare distributions: t-test¶

Let’s explore some more¶

5d) Digging Deeper¶

5e) What is the ratio of women in each major?¶

5f) Subsetting data¶

5g) Pivot Tables¶

Linear Models¶

5h) Linear model¶

5i) Assess significance: mod_1¶

5j) Multivariate regression¶

5k) Assess significance: mod_2¶

Part 6: Discussion & Conclusions (0.1 points)¶

6a) Conclusion¶

6b) Summary¶

The End!¶

5i) Assess significance: `mod_1`¶

5k) Assess significance: `mod_2`¶