Data Exploration

This is a demo assignment that is openly available for the Data Science in Practice Course.

If you are in the COGS108 course at UC San Diego, this is NOT a valid version of the assignment for the course.

Assignment reminders

This assignment has more questions than either A1 or A2! Get started as early as possible.

  • This assignment has hidden tests: tests that are not visible here, but that will be run on your submitted assignment. This means passing all the tests you can see in the notebook here does not guarantee you have the right answer!

  • Each coding question in this assignment only requires a small amount of code, about 1-3 lines.

    • If you find yourself writing much more than that, you might want to reconsider your approach.

    • Use the Tutorials notebooks as reference, as they often contain similar examples to those used in the assignment.

# Imports
# 
# Note: these are all the imports you need for this assignment!
# Do not import any other functions / packages

# Display plots directly in the notebook instead of in a new window
%matplotlib inline

# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Configure libraries
# The seaborn library makes plots look nicer
sns.set()
sns.set_context('talk')

# Don't display too many rows/cols of DataFrames
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8

# Round decimals when displaying DataFrames
pd.set_option('precision', 2)

Downloading Data

If you download this notebook to run locally, you will also need some data files.

Running the next cell will download the required files for this assignment.

You can also view and download these files from https://github.com/DataScienceInPractice/Data.

from os.path import join as pjoin
from urllib.request import urlretrieve

data_url = 'https://raw.githubusercontent.com/DataScienceInPractice/Data/master/'

assignment = 'A3'
data_files = ['age_steps.csv', 'incomes.json']

for data_file in data_files:
    full_path = pjoin(data_url, assignment, data_file)
    urlretrieve(full_path, filename=data_file)

Part 1 - Data Wrangling (1.45 points)

For this assignment, you are provided with two data files that contain information on a sample of people. The two files and their columns are:

  • age_steps.csv: Contains one row for each person.

    • id: Unique identifier for the person.

    • age: Age of the person.

    • steps: Number of steps the person took on average in January 2018.

  • incomes.json: Contains one record for each person.

    • id: Unique identifier for the person. Two records with the same ID between age_steps.csv and incomes.json correspond to the same person.

    • last_name: Last name of the person.

    • first_name: First name of the person.

    • income: Income of the person in 2018.

For part 1 and 2 of the assignment, we recommend looking at the official 10 minutes to pandas guide: http://pandas.pydata.org/pandas-docs/stable/10min.html

Question 1a: Load the age_steps.csv file into a pandas DataFrame named df_steps. It should have 11257 rows and 3 columns.

# YOUR CODE HERE
raise NotImplementedError()
assert isinstance(df_steps, pd.DataFrame)
assert df_steps.shape == (11257, 3)

Question 1b: Load the incomes.json file into a pandas DataFrame called df_income. The DataFrame should have 13332 rows and 4 columns.

Hint: Find a pandas function similar to read_csv for JSON files.

# YOUR CODE HERE
raise NotImplementedError()
assert isinstance(df_income, pd.DataFrame)
assert df_income.shape == (13332, 4)

Question 1c: Drop the first_name and last_name columns from the df_income DataFrame. The resulting DataFrame should only have two columns.

# YOUR CODE HERE
raise NotImplementedError()
assert 'first_name' not in df_income.columns
assert 'last_name' not in df_income.columns

Question 1d: Merge the df_steps and df_income DataFrames into a single combined DataFrame called df. Use the id column to match rows together.

The final DataFrame should have 10,135 rows and 4 columns: id, income, age, and steps.

Call an appropriate pandas method to perform this operation; don’t write a for loop. (In general, writing a for loop for a DataFrame will produce poor results.)

# YOUR CODE HERE
raise NotImplementedError()
assert isinstance(df, pd.DataFrame)
assert set(df.columns) == set(['id', 'income', 'age', 'steps'])
assert df.shape == (10135, 4)

Question 1e: Reorder the columns of df so that they appear in the order: id, age, steps, then income.

(Note: If your DataFrame is already in this order, just put df in this cell.)

# YOUR CODE HERE
raise NotImplementedError()
assert list(df.columns) == ['id', 'age', 'steps', 'income']

Question 1f: You may have noticed something strange: the merged df DataFrame has fewer rows than either of df_steps and df_income. Why did this happen? (If you’re unsure, check out the documentation for the pandas method you used to merge these two datasets. Take note of the default values set for this method’s parameters.)

Please select the one correct explanation below and save your answer in the variable q1f_answer. For example, if you believe choice number 4 explains why df has fewer rows, set q1f_answer = 4.

  1. Some steps were recorded inaccurately in df_steps.

  2. Some incomes were recorded inaccurately in df_income.

  3. There are fewer rows in df_steps than in df_income.

  4. There are fewer columns in df_steps than in df_income.

  5. Some id values in either df_steps and df_income were missing in the other DataFrame.

  6. Some id values were repeated in df_steps and in df_income.

You may use the cell below to run whatever code you want to check the statements above. Just make sure to set q1f_answer once you’ve selected a choice.

# YOUR CODE HERE
raise NotImplementedError()
assert isinstance(q1f_answer, int)

Part 2 - Data Cleaning (0.9 points)

Before proceeding with analysis, we need to check our data for missing values.

There are many reasons data might contain missing values. Here are two common ones:

  • Nonresponse. For example, people might have left a field blank when responding to a survey, or left the entire survey blank.

  • Lost in entry. Data might have been lost after initial recording. For example, a disk cleanup might accidentally wipe older entries of a database.

In general, it is not appropriate to simply drop missing values from the dataset or pretend that if filled in they would not change your results. In 2016, many polls mistakenly predicted that Hillary Clinton would easily win the Presidential election by committing this error.

In this particular dataset, however, the missing values occur completely at random. This criteria allows us to drop missing values without significantly affecting our conclusions.

Question 2a: How many values are missing in the income column of df? Save this number into a variable called n_nan.

# YOUR CODE HERE
raise NotImplementedError()
assert(n_nan)

Question 2b: Remove all rows from df that have missing values.

# Remove all rows from df that have missing data. In other words, remove all rows with NaN values.

# YOUR CODE HERE
raise NotImplementedError()
assert sum(np.isnan(df['income'])) == 0
assert df.shape == (9684, 4)

Question 2c: Note that we can now compute the average income. If your df variable contains the right values, df['income'].mean() should produce the value 25508.84.

Suppose that we didn’t drop the missing incomes. What will running df['income'].mean() output? Use the variable q2c_answer to record which of the below statements you think is true. As usual, you can use the cell below to run any code you’d like in order to help you answer this question as long as you set q2c_answer once you’ve finished.

  1. No change; df['income'].mean() will ignore the missing values and output 25508.84.

  2. df['income'].mean() will produce an error.

  3. df['income'].mean() will output 0.

  4. df['income'].mean() will output nan (not a number).

  5. df['income'].mean() will fill in the missing values with the average income, then compute the average.

  6. df['income'].mean() will fill in the missing values with 0, then compute the average.

# YOUR CODE HERE
raise NotImplementedError()
assert isinstance(q2c_answer, int)

Question 2d: Suppose that missing incomes did not occur at random, and that individuals with incomes below $10000 a year are less likely to report their incomes. If so, which of the following statements below is true? Record your choice in the variable q2d_answer.

  1. df['income'].mean() will likely output a value that is the same as the population’s average income

  2. df['income'].mean() will likely output a value that is smaller than the population’s average income.

  3. df['income'].mean() will likely output a value that is larger than the population’s average income.

  4. df['income'].mean() will raise an error.

# YOUR CODE HERE
raise NotImplementedError()
assert isinstance(q2d_answer, int)

Part 3: Data Visualization (0.9 points)

Although pandas only displays a few rows of a DataFrame at a time, we can use data visualizations to quickly determine the distributions of values within our data.

pandas comes with some plotting capabilities built-in; however, we’ve discussed using seaborn for visualization in class. You’re free to use either here in this assignment. (To learn more about pandas’ plotting capabilities, we suggest taking a look at [https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html] for examples.)

We want to note that most plotting libraries in Python are built on top of a library called matplotlib, including the plotting methods used in both pandas and seaborn. Although you won’t need to know matplotlib for this assignment, you will likely have to use it in future assignments and your final project, so keep the library in mind.

Notes:

  • Everywhere that we ask you to create a plot, make sure to leave the plt.gcf() line at the end of the cell. Otherwise, you will lose points in the autograder.

  • For all your histograms, use 25 bins.

Question 3a: Plot a histogram of the age column with 25 bins.

# YOUR CODE HERE
raise NotImplementedError()

f1 = plt.gcf()
assert f1.gca().has_data()

# If you fail this test, you didn't use 25 bins for your histogram.
assert len(f1.gca().patches) == 25

Question 3b: By looking at the plot, what is the approximate average (mean) age of the individuals in df? Round the value to the closest year. Store the answer in the variable average_age

# YOUR CODE HERE
raise NotImplementedError()
assert isinstance(average_age, int)

Question 3c: Plot a histogram of the steps column with 25 bins.

# YOUR CODE HERE
raise NotImplementedError()

f2 = plt.gcf()
assert f2.gca().has_data()

Question 3d: By looking at the plot you just generated, approximately how many people in df do the data suggest took no steps? Store the letter that best corresponds to your answer in the variable no_steps. (For example, if you thought the answer was A) 0, your response would be no_steps = 'A'.

  • A) 0

  • B) 100

  • C) 1000

  • D) 10000

# YOUR CODE HERE
raise NotImplementedError()
assert isinstance(no_steps, str)

Question 3e: Plot a histogram of the income column with 25 bins.

# YOUR CODE HERE
raise NotImplementedError()

f3 = plt.gcf()
assert f3.gca().has_data()

Question 3f Which of the following statements is true about the income of the individuals included in df? Store the letter corresponding to your answer in the variable income_plot. (Note: Be sure to consider the bin size of the x-axis when interpreting the plot.)

  • A) Most people in df had no income in 2018

  • B) Most people in df made a six figure salary in 2018

  • C) Most people in df made a salary of more than 200000 USD in 2018

  • D) A few people in df made a lot more money than the typical person in df

# YOUR CODE HERE
raise NotImplementedError()
assert isinstance(income_plot, str)

Question 3g: Plot the data using the pandas scatter_matrix function. Only plot the age, steps, and income columns.

Note: Including the parameter: figsize = (8, 6) will increase the size of the plot for easier viewing

# YOUR CODE HERE
raise NotImplementedError()

f4 = plt.gcf()
assert f4.gca().has_data()

Question 3h: By looking at the plot matrix you just generated, deteremine the approximate age of the wealthiest person in df. Store this value (to the nearest year) in the variable wealthy_age.

# YOUR CODE HERE
raise NotImplementedError()
assert isinstance(wealthy_age, int)

Part 4: Data Pre-Processing (1.45 points)

In the above sections, we performed basic data cleaning and visualization.

In practice, these two components of an analysis pipeline are often combined into an iterative approach. We go back and forth between looking at the data, checking for issues, and cleaning the data.

Let’s continue with an iterative procedure of data cleaning and visualization, addressing some issues that we notice after visualizing the data.

Question 4a: In the visualization of the steps column, we notice a large number of -1 values. Count how many rows in df have -1 in their steps column. Store the result in the variable n_neg.

# YOUR CODE HERE
raise NotImplementedError()
assert(n_neg)
assert n_neg > 100

Question 4b: Since it’s impossible to walk a negative number of steps, we will treat the negative values as missing data. Drop the rows with negative steps from df. Your answer should modify df itself.

# YOUR CODE HERE
raise NotImplementedError()
assert sum(df['steps'] == -1) == 0

You may have noticed that the values in income are not normally distributed which can hurt prediction ability in some scenarios. To address this, we will perform a log transformation on the income values.

First though, we will have to deal with any income values that are 0. Note that these values are not impossible values — they may, for example, represent people who are unemployed. So, we shouldn’t remove these individuals; however, when we go to log-transform these data, we can’t (mathematically) have any zero values. We’ll replace the zeroes with ones, to allow for log transformation, while retaining the fact that these indivduals’ income was lower than others in the dataset.

Question 4c: Add a new column to df called income10. It should contain the same values as income with all 0 values replaced with 1.

Hint: There is a pandas function replace.

# YOUR CODE HERE
raise NotImplementedError()
assert list(df.columns) == ['id', 'age', 'steps', 'income', 'income10']
assert not any(df['income10'] == 0)

Question 4d: Now, transform the income10 column using a log-base-10 transform. That is, replace each value in income10 with the \( log_{10} \) of that value.

Note: Be sure you underestand the difference between np.log and np.log10

# YOUR CODE HERE
raise NotImplementedError()
assert np.isclose(min(df['income10']), 0.0, 0.001)
assert np.isclose(max(df['income10']), 5.867, 0.001)

Question 4e: Now, make a histogram for income10 data after the data transformation. Again, use 25 bins.

# YOUR CODE HERE
raise NotImplementedError()

f4 = plt.gcf()
assert f4.gca().has_data()

# If you fail this test, you didn't use 25 bins for your histogram.
assert len(f4.gca().patches) == 25

Question 4f: From this plot, determine approximately how many individuals in df had a reported income of 0. Store this value in the variable income_zero.

# YOUR CODE HERE
raise NotImplementedError()
assert isinstance(income_zero, int)

Question 4g: We might also have certain regulations or restrictions that we need to follow about the data. Here, we will only analyze adults. Remove all rows from df where age is less than 18.

# YOUR CODE HERE
raise NotImplementedError()
assert min(df['age']) >= 18
assert len(df) == 8590

Part 5 - Basic Analyses (1.1 points)

Now that we have wrangled and cleaned our data, we can start doing some simple analyses.

Here we will explore some basic descriptive summaries of our data, look into the inter-relations (correlations) between variables, and ask some simple questions about potentially interesting subsets of our data.

Question 5a: Use the describe pandas method to check a descriptive summary of the data. Save the DataFrame generated by describe to a new variable called desc.

# YOUR CODE HERE
raise NotImplementedError()
assert isinstance(desc, pd.DataFrame)

Question 5b: Calculate the pairwise correlations between all variables.

Note: do this with a pandas method. Keep all columns (including ID). Assign the result (which should be a DataFrame) to a variable called corrs.

# YOUR CODE HERE
raise NotImplementedError()
assert isinstance(corrs, pd.DataFrame)
assert corrs.shape == (5, 5)

Question 5c: Answer the following questions by setting your answer variables to either 'age', 'steps', or 'income'.

  • Which variable is most correlated with age (aside from age itself)? Record your answer in a variable called age_corr.

  • Which variable is most correlated with income (aside from income and income10)? Record your answer in a variable called inc_corr.

# YOUR CODE HERE
raise NotImplementedError()
assert age_corr
assert age_corr in {'steps', 'age', 'income'}
assert inc_corr
assert inc_corr in {'steps', 'age', 'income'}

Question 5d: How many steps would you have to walk to be in the top 10% of walkers? Save your answer as a variable called top_walker.

Hint: check out the quantile method.

# YOUR CODE HERE
raise NotImplementedError()
assert top_walker
assert isinstance(top_walker, float)

Question 5e: What is the average income for people over the age of 65? Save your response in a variable called old_income.

Note: We’re asking for the actual income, not the log-10 of income.

# YOUR CODE HERE
raise NotImplementedError()
assert old_income
assert old_income > 5

Part 6 - Predictions (1.7 points)

Many analyses ask predictive questions in which data we have currently are used to build a model to predict something about the future (data we do not yet have).

Here we will explore some basic predictions, looking into whether we might be able to predict income from our other variables.

Note: You will use the np.polyfit function from NumPy as we did in Tutorials/02-DataAnalysis.

Question 6a: Use polyfit to fit a 1-degree linear model, predicting income from age. Call the output parameters a1 and b1.

# YOUR CODE HERE
raise NotImplementedError()
assert(a1)
assert(b1)

# If you fail these tests, your parameter values are quite far from what they
# should be.
assert abs(a1) > 100
assert abs(b1) > 100

Question 6b: Use the model parameters from 6a to predict the income of a 75-year-old. Call your prediction pred_75.

# YOUR CODE HERE
raise NotImplementedError()
assert(pred_75)

Question 6c: Use polyfit once more to fit a 1-degree linear model, predicting income from steps. Call the output parameters a2 and b2.

# YOUR CODE HERE
raise NotImplementedError()
assert(a2)
assert(b2)

# If you fail these tests, your parameter values are quite far from what they
# should be.
assert abs(a2) < 100
assert abs(b2) > 100

Question 6d: Predict the income of someone who took 10,000 steps. Call your prediction pred_10k.

# YOUR CODE HERE
raise NotImplementedError()
assert(pred_10k)

Question 6e: To better understand a model, we can visualize its predictions. Use your first model to predict income from each integer age in between 18 and 80. Your predictions should be stored in a numpy array of floats called pred_age.

Reminder: numpy has the arange function

# YOUR CODE HERE
raise NotImplementedError()
assert isinstance(pred_age, np.ndarray)
assert len(pred_age) == 63

# Your array should contain decimals, not integers
assert isinstance(pred_age[0], float)

Question 6f: Make a scatter plot with income on the y-axis and age on the x-axis. Then, draw your predictions as a red line on top of the scatter plot. The data in your plot and your line should look like this (although the exact colors/dimensions/appearance may differ a bit):

np.arange(min(df['age']),max(df['age']))
#np.arange(18, 81)
# YOUR CODE HERE
raise NotImplementedError()

f5 = plt.gcf()
assert f5.gca().has_data()

Question 6g: Now, let’s do the same for the model that uses steps.

Use your second model to predict income from each multiple of 100 steps in between 2000 and 13000. Your predictions should be stored in a numpy array called pred_steps.

# YOUR CODE HERE
raise NotImplementedError()
assert isinstance(pred_steps, np.ndarray)
assert len(pred_steps) == 111
# Your array should contain decimals, not integers
assert isinstance(pred_steps[0], float)

Question 6h: Make a scatter plot with income on the y-axis and steps on the x-axis. Then, draw your predictions as a red line on top of the scatter plot. Your plot should look like this:

# YOUR CODE HERE
raise NotImplementedError()

f6 = plt.gcf()
assert f6.gca().has_data()

Question 6i: Notice that both these models perform poorly on this data. For this particular dataset, neither age nor steps seem to have a linear relationship with income. Nonetheless, fitting a linear model is simple and gives us a baseline to compare with more complex models in the future.

Suppose that you were forced to pick one of the above models. Between 'age' and 'steps', which predictor has higher prediction power? Save your response in the variable model_choice.

# YOUR CODE HERE
raise NotImplementedError()
assert model_choice
assert model_choice in {'age', 'steps'}

Finished!

Have a look back over your answers, and also make sure to Restart & Run All from the kernel menu to double check that everything is working properly. You can also use the ‘Validate’ button above, which runs your notebook from top to bottom and checks to ensure all assert statements pass silently. When you are ready, submit on datahub!