Pandas

This is a demo assignment that is openly available for the Data Science in Practice Course.

If you are in the COGS108 course at UC San Diego, this is NOT a valid version of the assignment for the course.

Part 1: Setup

Throughout this assignment we’ll review and get practice with the basics of using the pandas package for working with tabular data.

You will absolutely use pandas throughout this course, and, if you continue to use Python for working with data in the future, will continue to use this package. This means that while you could Google and look up every pandas function when doing assignments, it’s best to memorize the basics. So, do your best to understand each operation you carry out here, as you’ll almost certainly see it again. And, try to commit the most commonly-used functions (those being presented here) to memory. It will save future you a ton of time. The more comfortable you are with pandas, the easier the rest of the assignments and your final project will be for you!

Finally, there are a ton of great pandas resources and documentation out there, including 10 minutes to pandas. Feel free to use them! However, many of the examples in this assignment were inspired by pandas tips and tricks: https://www.dataschool.io/python-pandas-tips-and-tricks/. Feel free to check out all the additional tips and tricks there, many of which are not covered here.

Q1: import (0.25 points)

Import the pandas package using its typical shortname, pd.

Import the numpy package using its typical shortname, np.

# YOUR CODE HERE
raise NotImplementedError()
assert pd
assert np

Part 2: Data Import, DataFrames, & Series

The pandas package has a number of helpful import functions (read_csv(), read_json(), read_html(), etc.)

We’ll use a number of these throughout the course; however, for this assignment we’ll use a CSV file from FiveThirtyEight. The summarized data in the file we’ll use here were used in the article The Ultimate Halloween Candy Power Ranking. To collect these data, candies were pitted against one another with participants having to choose which they liked better.

The most important functional unit of the pandas package is the DataFrame. DataFrames store information in rows and columns.

Secondarily, Series are one-dimensional labeled arrays. They can hold data of any type.

You can think of a DataFrame as a bunch of related Series stored together, such that the information in each row comes from a single observation, and each column (Series) stores information about a single variable measured across the observations.

To access a column (Series) from a Dataframe, the following syntax is used:

df['series']

Here df is the object storing the pandas dataframe and ‘series’ is the column name in df that you want to access.

More information about DataFrames and Series is available here.

Q2: Read the data into Python (0.25 points)

To use the data, you’ll have to read it in as a Dataframe. Use the appropriate pandas import function to read in the file from the following repo: https://github.com/fivethirtyeight/data/tree/master/candy-power-ranking.

Note that URLs can be read in directly using pandas. The URL you’ll want to use to read the file in is: ‘https://raw.githubusercontent.com/fivethirtyeight/data/master/candy-power-ranking/candy-data.csv’.

Store this in the variable df.

# YOUR CODE HERE
raise NotImplementedError()
assert isinstance(df, pd.DataFrame)
assert df.shape == (85, 13)

Q3: Data Summary (0.75 points)

After reading in a dataset, it’s often helpful to get a quick sense of what information is stored in the dataset. Here you’ll carry out a number of operations that you’ll carry out on almost every dataset you read into Python.

Q3a: shape

One way to do that is to determine the shape of the dataset. This reports the number of rows and columns in the dataset. Store the shape of the candy dataset in the variable candy_shape.

# YOUR CODE HERE
raise NotImplementedError()
assert candy_shape == (85, 13)

Note that pandas DataFrames also print out well when you access them. This allows you to get a sense of the information stored in your dataset overall.

For example, if you run the cell below, you’ll see what information is stored in this dataset.

# take a look at the information stored in df
df

A full description of these data can be found here. From that link, we’ll include a description of each variable (column) here:

Header

Description

chocolate

Does it contain chocolate?

fruity

Is it fruit flavored?

caramel

Is there caramel in the candy?

peanutalmondy

Does it contain peanuts, peanut butter or almonds?

nougat

Does it contain nougat?

crispedricewafer

Does it contain crisped rice, wafers, or a cookie component?

hard

Is it a hard candy?

bar

Is it a candy bar?

pluribus

Is it one of many candies in a bag or box?

sugarpercent

The percentile of sugar it falls under within the data set.

pricepercent

The unit price percentile compared to the rest of the set.

winpercent

The overall win percentage according to 269,000 matchups.

Note that many of the variables store binary information, where 0 means False and 1 means True. For example, for a Snickers bar, chocolate would have the value 1 (becuase it is chocolate), while fruity for the same candy would store 0 (because Snickers bars are not fruity).

Q3b: describe

pandas also has a very helfpul describe function which will provide some helpful summary statistics about your dataframe.

Use the describe function to calculate and display these summary statistics (You do not have to store the output):

# YOUR CODE HERE
raise NotImplementedError()

Using the output you just generated, answer the following questions, storing the answer in the variable at the front of each question. (For example, if you thought the answer to the first question was A, you would include win_perc = 'A' in the cell below.)

win_perc : What is the highest overall win percentage among the candy included in this dataset?

  • A) 1.00

  • B) 98.8

  • C) 85.00

  • D) 84.18

  • E) Unable to determine

fruity : What proportion of the dataset is comprised of fruity candy?

  • A) 0

  • B) 0.45

  • C) 0.50

  • D) 85

  • E) Unable to determine

# YOUR CODE HERE
raise NotImplementedError()
answers = ['A', 'B', 'C', 'D', 'E']
assert win_perc
assert fruity
assert win_perc in answers
assert fruity in answers

It’s typically most helpful to access a Series of a DataFrame to then use a pandas method to understand more about the values in that Series.

For example, we could see the values stored in the chocolate column using:

df['chocolate']

However, we typically build upon this by determining some other information about the values in that Series. For example, we often want to determine how many different observations there are for each possible values in a Series with discrete information. The value_counts() function can be helpful in determining this.

Q3c: value_counts()

Using the value_counts() function, determine how many different possible values there are for the chocolate series in the df DataFrame and how many observations fall into each.

Store the output in the object chocolate_values.

Take a look at the output. Be sure you understand whether or not there are more chocolate (chocolate == 1) or nonchocolate candies (chocolate == 0) in the dataset from the output.

# YOUR CODE HERE
raise NotImplementedError()

chocolate_values
assert chocolate_values.loc[0] == 48

Missing Data

There are a number of different ways in which to determine whether or not data are missing. The most common approaches are summarized here:

# Calculate % of missing values in each column:
df.isna().mean()

# Drop columns with any missing values:
df.dropna(axis='columns')

# Drop columns in which more than 10% of values are missing:
df.dropna(thresh=len(df)*0.9, axis='columns')

# Want to know the *count* of missing values in a DataFrame?
df.isna().sum().sum()

# Just want to know if there are *any* missing values?
df.isna().any().any()
df.isna().any(axis=None)

Run the following cell and interpret the output:

# determine which columns have missing data
df.isna().any()

Q3d: isna()

How many variables have missing data in this dataset? Store the value in the variable var_missing:

# YOUR CODE HERE
raise NotImplementedError()
assert isinstance(var_missing, (np.int64, int))

Q4: replace (0.5 points)

To replace current values with different values dynamically, replace is very helpful.

To demonstrate this, if you wanted to replace the zeroes and ones in the current dataset with boolean values, true or false, you could do this using replace.

In df, replace 0 with False and 1 with True.

Store this output in df_bool.

# YOUR CODE HERE
raise NotImplementedError()
assert df_bool.shape == df.shape
assert df_bool.loc[1,'chocolate'] == True
assert df_bool.iloc[0,0] == '100 Grand'

Part 3: Common Manipulations

Regardless of the dataset you’re working with, you’ll want to be very familiar with a few common data manipulations when wrangling data, each of which is described below:

Manipulation

Description

select

select which columns to include in dataset

filter

filter dataset to only include specified rows

mutate

add a new column based on values in other columns

groupby

group values to apply a function within the specified groups

summarize

calculate specified summary metric of a specified variable

arrange

sort rows ascending or descending order of a specified column

merge

join separate datasets into a single dataset based on a common column

We’ll review how to carry out each of these in pandas and you’ll get practice with each.

Q5: Column Operations (2.15 points)

Often we’re only interested in working with some columns in a larger dataset and it’s helpful to filter the dataset so that it only includes the columns we want. In these cases, we select the columns we’re interested in by specifying these columns and returning the output.

In pandas, there are a number of ways in which this can be done, but the most common is to specify a list of columns you want by name and in the order you want.

For example, for a dataframe (df) with five columns (‘A’, ‘B’, ‘C’, ‘D’, ‘E’), you could select columns ‘C’, ‘B’ and ‘A’ (in that order), using the following:

df[['C', 'B', 'A']]

Q5a: selecting columns

For our candy dataset, we’re ultimately interested in determining the differences between chocolate and fruity candy.

To just get the data we’re interested in, return a DataFrame with the columns: competitorname, chocolate, fruity, hard, bar, pluribus, sugarpercent, pricepercent, and winpercent.

Store this output in df (overwriting the previous data stored in df).

# YOUR CODE HERE
raise NotImplementedError()
assert df.shape == (85, 9)

dtypes

Occassionally, there are a number of different types of data (strings, numbers, etc.) in a DataFrame and you’re only interested in data of a particular type. For example, in df_bool, we have the data about the candy stored as Booleans (False, True) rather than integers (0,1).

To remind yourself of what type of information is stored in each column, the dtypes attribute can be very helpful:

# see types for each variable
df.dtypes
# see types for each variable
df_bool.dtypes

Q5b: Select by Data Type

For the df_bool dataset, we’re interested in selecting only the columns that are either a string (object) or a bool.

Use the pandas select_dtypes function to select those columns. Note, you’ll have to specify the include parameter.

Store this output in df_bool

# YOUR CODE HERE
raise NotImplementedError()

df_bool.head()
assert df_bool.shape == (85, 10)

Q5c: Renaming Columns

Often, data have less than helpful column names. To make working with these data as simple as possible, it can be helpful to ensure that all columns have helpful column names. rename can help you do this

Generally there are three approaches to renaming columns:

  1. Most flexible option: df = df.rename({'A':'a', 'B':'b'}, axis='columns')

  2. Overwrite all column names: df.columns = ['a', 'b']

  3. Apply string method: df.columns = df.columns.str.lower()

For our dataset in partciular, pluribus is not an ideal column name. Use approach 1 above to rename pluribus to multicandy_pack for both the df and df_bool dataframes

# YOUR CODE HERE
raise NotImplementedError()
assert 'multicandy_pack' in df.columns
assert 'multicandy_pack' in df_bool.columns

Q5d: Adding new columns

Often when working with data we need to add additional columns that use information from existing columns.

To do this, we typically use one of two approaches, summarized generally here:

  1. assign

df.assign(new_col = df['col'] * val)
  1. apply

for col in df.columns:
    df[f'{col}_new'] = df[col].apply(my_function)

For example, are there any candy in our dataset that are neither fruity nor chocolate?

Use assign to add a new column to df called fruity_choco that adds the value in chocolate to the value in fruity. This way, the value will be 0 if it is neither fruity nor chocolate, 1 if it is one or the other and 2 if it is both.

Store the output in df.

# YOUR CODE HERE
raise NotImplementedError()
assert isinstance(df['fruity_choco'], pd.Series)

Let’s take a look at this new variable we created. We can use value_counts() to determine how many candies fall into each category:

# see breakdown of 'fruity_choco' column
df['fruity_choco'].value_counts()

Q5e: neither

Take a look at the output above. How many candies are neither fruity nor chocolate? Store this value in the variable neither.

# YOUR CODE HERE
raise NotImplementedError()
assert neither >= 0 and neither <= len(df)

Q6: Row Operations (1.3 points)

In addition to column operations, filtering rows to only include those of interest to your analysis can be very important. For example, if we wanted to only include candy that was either chocolate or fruity, we would want to filter to only include the candies that fit that criterion.

To do this, we have to filter the data, using slicing.

slicing: loc and iloc

Subsets of data from a DataFrame can also be extracted through the use of slicing. There are two general approaches to slicing:

Manipulation

Description

loc

label-based indexing

iloc

integer-based indexing

For example, you could use loc:

df.loc[:, ['A', 'B', 'C']]

This would return all rows (indicated by the :) and three columns ['A', 'B', 'C'].

Alternatively, integer-based indexing could be used with iloc (where the i stands for index):

df.iloc[0:5, 0:5]

This would return the first five rows and columns of the dataframe df. (As a reminder: when indicating ranges in Python, the final value is not included in what is returned. So, this returns the zero-th through the 4th indices. Index 5 is not included in the output)

Python also uses zero-based indexing which means the first element is indexed as zero, the second has index 1, and so on.

Q6a: Slicing

The concept of slicing can be combined with conditionals. For example, you can return all rows of a particular value.

Using value_counts() above on the fruity_choco column we created, we see that there is one candy that is both chocolate and fruity in our dataset. Let’s figure out which candy that was!

Use slicing to return the row from df where fruity_choco indicated the candy was both fruity and chocolate.

Store this DataFrame (which will have a single row) as both. Again using indexing, store the name of the candy that is both fruity and chocolate in the variable candy_name.

# YOUR CODE HERE
raise NotImplementedError()
assert type(both) == pd.DataFrame
assert both.shape == (1, 10)
assert isinstance(candy_name, str)

Filtering rows

DataFrames can be subset in a number of different ways. For example, to filter a DataFrame to only include certain rows, you could use the approaches summarized here:

If you wanted to filter a DataFrame by a condition, you could use the following approach:

df[df.color == 'red']

Note that this extends and works with multiple OR conditions:

df[(df.color == 'red') | (df.color == 'green') | (df.color == 'blue')]

A shorter approach to this would utilize isin:

df[df.color.isin(['red', 'green', 'blue'])]

Q6b: Filtering

Using this concept, filter df to only include rows that contain candy that is fruity or chocolate in some capacity (meaning, something that is both fruity and chocolate would be included).

Store this output in df.

# YOUR CODE HERE
raise NotImplementedError()
assert df.shape == (74, 10)
assert not any(df['fruity_choco'] == 0)

aside: calculations on subsets of the data

Calculations can be carried out on columns using conditionals. For example, if you wanted to determine how many of the candies in our current dataset are part of a pack of candy with multiple types of candy in the pack, you could use sum() on the subset of the data that meets that condition

# carry out calculation
(df['multicandy_pack']==1).sum()

Q7: arrange rows (1.1 points)

While you don’t always care about what order the rows are sorted, at times it can be helpful to sort a DataFrame so that the rows are ordered in a particular manner.

To do this, sort_values can be very helpful. For example, to sort by the values in ‘col’, you would use the following:

df.sort_values(by=['col'])

To arrange (sort) the rows by more than one column, the syntax changes slightly:

df.sort_values(by=['col1', 'col2'])

Q7a: sort_values

Specifically, df is currently sorted in alphabetical order by competitor name (candy). What if we were most curious about the sugar percentile of where the candy falls within the dataset?

Arrange the rows in df such that the candy with the highest sugarpercent is displayed first and the lowest sugarpercent last. Store this in df.

Also, note that when using sort_values, the original indices are retained. So, also reset the index of df so that the highest sugarpercent candy is index 0.

# YOUR CODE HERE
raise NotImplementedError()
assert df.loc[0,'competitorname'] != '100 Grand'
assert np.allclose(df.loc[0,'sugarpercent'], 0.988)
# take a look at the first few rows
df.head()

Q7b: Sugar Rush

Which candy in our dataset is the sugariest? Record your answer in the variable sugary. For example, if you thought the answer was A, you would submit sugary = 'A'.

  • A) Reese’s Miniatures

  • B) Chiclets

  • C) 100 Grand

  • D) Reese’s Stuffed with Pieces

  • E) Candy not listed here

# YOUR CODE HERE
raise NotImplementedError()
assert isinstance(sugary, str)

Q8: groupby + agg (1.2 points)

Once a dataset is in order, including the columns and rows that you need for your analysis, it’s often helpful to carry out some grouped operations. groupby allows you to group the data in your dataset by the values in a specific column. agg is then helpful for carrying out the operation you’re interested in within the groups.

For example, if you wanted to groupby a candy’s chocolate status and then carry out some calculations within the group, say, determine the average value for the other variables in the dataset, grouped by chocolate status, you could use agg to do so.

df.groupby('chocolate').agg('mean')
# carry out operation
df.groupby('chocolate').agg('mean')

Q8a: Cost Interpretation

Looking at the output in the previous cell, which type of candy costs more, fruity candy or chocolate candy? Store your response in candy_cost. For example, if you thought the answer was A, you would submit candy_cost = 'A'.

  • A) chocolate costs more

  • B) fruity costs more

  • C) chocolate and fruity cost the same

  • D) can’t be determined

# YOUR CODE HERE
raise NotImplementedError()
assert isinstance(candy_cost, str)

Q8b: specific column aggregation

Carrying out a similar operation, group df by fruity, and then calculate the minimum, average (mean), and maximum values, for the sugarpercent column only.

For this, fruity should be the name for the indices and your column names should be ‘min’, ‘mean’, and ‘max’.

Store this result in sugar_summary.

Note: the pandas documentation here may help you if you’re struggling to figure out the syntax

# YOUR CODE HERE
raise NotImplementedError()

# look at output
sugar_summary
assert type(sugar_summary) == pd.DataFrame
assert sugar_summary.index.name == 'fruity'
assert 'min' in sugar_summary.columns
assert np.isclose(sugar_summary.loc[0,'min'], 0.034)

Q8c: Sugar Interpretation

Looking at the output in Q8b, which type of candy has the higher average sugar percentile, fruity candy or chocolate candy? Store your response in candy_sugar. For example, if you thought the answer was A, you would submit candy_sugar = 'A'.

  • A) chocolate has higher average sugar percentile

  • B) fruity candy has higher average sugar percentile

  • C) chocolate and fruity have the same average sugar percentile

  • D) can’t be determined

# YOUR CODE HERE
raise NotImplementedError()
assert isinstance(candy_sugar, str)

The End!

Good work! While this just scratches the surface of what you can use pandas to do, we covered common manipulations you’ll do a lot. Do your best to understand each operation you carried out here, becoming as familiar with pandas, common operations, and the syntax, as possible. These will be used throughout this quarter and (most likely) whenever you’re working with tabular data in Python.

Have a look back over your answers, and also make sure to Restart & Run All from the kernel menu to double check that everything is working properly. You can also use the ‘Validate’ button above, which runs your notebook from top to bottom and checks to ensure all assert statements pass. When you are ready, submit on datahub!