Pandas
Contents
Pandas¶
This is a demo assignment that is openly available for the Data Science in Practice Course.
Part 1: Setup¶
Throughout this assignment we’ll review and get practice with the basics of using the pandas
package for working with tabular data.
You will absolutely use pandas
throughout this course, and, if you continue to use Python for working with data in the future, will continue to use this package. This means that while you could Google and look up every pandas
function when doing assignments, it’s best to memorize the basics. So, do your best to understand each operation you carry out here, as you’ll almost certainly see it again. And, try to commit the most commonly-used functions (those being presented here) to memory. It will save future you a ton of time. The more comfortable you are with pandas
, the easier the rest of the assignments and your final project will be for you!
Finally, there are a ton of great pandas
resources and documentation out there, including 10 minutes to pandas
. Feel free to use them! However, many of the examples in this assignment were inspired by pandas
tips and tricks: https://www.dataschool.io/python-pandas-tips-and-tricks/. Feel free to check out all the additional tips and tricks there, many of which are not covered here.
Q1: import
(0.25 points)¶
Import the pandas
package using its typical shortname, pd
.
Import the numpy
package using its typical shortname, np
.
# YOUR CODE HERE
raise NotImplementedError()
assert pd
assert np
Part 2: Data Import, DataFrames, & Series¶
The pandas
package has a number of helpful import functions (read_csv()
, read_json()
, read_html()
, etc.)
We’ll use a number of these throughout the course; however, for this assignment we’ll use a CSV file from FiveThirtyEight. The summarized data in the file we’ll use here were used in the article The Ultimate Halloween Candy Power Ranking. To collect these data, candies were pitted against one another with participants having to choose which they liked better.
The most important functional unit of the pandas
package is the DataFrame. DataFrames store information in rows and columns.
Secondarily, Series are one-dimensional labeled arrays. They can hold data of any type.
You can think of a DataFrame as a bunch of related Series stored together, such that the information in each row comes from a single observation, and each column (Series) stores information about a single variable measured across the observations.
To access a column (Series) from a Dataframe, the following syntax is used:
df['series']
Here df
is the object storing the pandas
dataframe and ‘series’ is the column name in df
that you want to access.
More information about DataFrames and Series is available here.
Q2: Read the data into Python (0.25 points)¶
To use the data, you’ll have to read it in as a Dataframe. Use the appropriate pandas import function to read in the file from the following repo: https://github.com/fivethirtyeight/data/tree/master/candy-power-ranking.
Note that URLs can be read in directly using pandas
. The URL you’ll want to use to read the file in is: ‘https://raw.githubusercontent.com/fivethirtyeight/data/master/candy-power-ranking/candy-data.csv’.
Store this in the variable df
.
# YOUR CODE HERE
raise NotImplementedError()
assert isinstance(df, pd.DataFrame)
assert df.shape == (85, 13)
Q3: Data Summary (0.75 points)¶
After reading in a dataset, it’s often helpful to get a quick sense of what information is stored in the dataset. Here you’ll carry out a number of operations that you’ll carry out on almost every dataset you read into Python.
Q3a: shape
¶
One way to do that is to determine the shape
of the dataset. This reports the number of rows and columns in the dataset. Store the shape of the candy dataset in the variable candy_shape
.
# YOUR CODE HERE
raise NotImplementedError()
assert candy_shape == (85, 13)
Note that pandas
DataFrames also print out well when you access them. This allows you to get a sense of the information stored in your dataset overall.
For example, if you run the cell below, you’ll see what information is stored in this dataset.
# take a look at the information stored in df
df
A full description of these data can be found here. From that link, we’ll include a description of each variable (column) here:
Header |
Description |
---|---|
chocolate |
Does it contain chocolate? |
fruity |
Is it fruit flavored? |
caramel |
Is there caramel in the candy? |
peanutalmondy |
Does it contain peanuts, peanut butter or almonds? |
nougat |
Does it contain nougat? |
crispedricewafer |
Does it contain crisped rice, wafers, or a cookie component? |
hard |
Is it a hard candy? |
bar |
Is it a candy bar? |
pluribus |
Is it one of many candies in a bag or box? |
sugarpercent |
The percentile of sugar it falls under within the data set. |
pricepercent |
The unit price percentile compared to the rest of the set. |
winpercent |
The overall win percentage according to 269,000 matchups. |
Note that many of the variables store binary information, where 0 means False and 1 means True. For example, for a Snickers bar, chocolate
would have the value 1 (becuase it is chocolate), while fruity
for the same candy would store 0 (because Snickers bars are not fruity).
Q3b: describe
¶
pandas
also has a very helfpul describe
function which will provide some helpful summary statistics about your dataframe.
Use the describe
function to calculate and display these summary statistics (You do not have to store the output):
# YOUR CODE HERE
raise NotImplementedError()
Using the output you just generated, answer the following questions, storing the answer in the variable at the front of each question. (For example, if you thought the answer to the first question was A, you would include win_perc = 'A'
in the cell below.)
win_perc
: What is the highest overall win percentage among the candy included in this dataset?
A) 1.00
B) 98.8
C) 85.00
D) 84.18
E) Unable to determine
fruity
: What proportion of the dataset is comprised of fruity candy?
A) 0
B) 0.45
C) 0.50
D) 85
E) Unable to determine
# YOUR CODE HERE
raise NotImplementedError()
answers = ['A', 'B', 'C', 'D', 'E']
assert win_perc
assert fruity
assert win_perc in answers
assert fruity in answers
It’s typically most helpful to access a Series of a DataFrame to then use a pandas
method to understand more about the values in that Series.
For example, we could see the values stored in the chocolate
column using:
df['chocolate']
However, we typically build upon this by determining some other information about the values in that Series. For example, we often want to determine how many different observations there are for each possible values in a Series with discrete information. The value_counts()
function can be helpful in determining this.
Q3c: value_counts()
¶
Using the value_counts()
function, determine how many different possible values there are for the chocolate
series in the df
DataFrame and how many observations fall into each.
Store the output in the object chocolate_values
.
Take a look at the output. Be sure you understand whether or not there are more chocolate (chocolate == 1
) or nonchocolate candies (chocolate == 0
) in the dataset from the output.
# YOUR CODE HERE
raise NotImplementedError()
chocolate_values
assert chocolate_values.loc[0] == 48
Missing Data¶
There are a number of different ways in which to determine whether or not data are missing. The most common approaches are summarized here:
# Calculate % of missing values in each column:
df.isna().mean()
# Drop columns with any missing values:
df.dropna(axis='columns')
# Drop columns in which more than 10% of values are missing:
df.dropna(thresh=len(df)*0.9, axis='columns')
# Want to know the *count* of missing values in a DataFrame?
df.isna().sum().sum()
# Just want to know if there are *any* missing values?
df.isna().any().any()
df.isna().any(axis=None)
Run the following cell and interpret the output:
# determine which columns have missing data
df.isna().any()
Q3d: isna()
¶
How many variables have missing data in this dataset? Store the value in the variable var_missing
:
# YOUR CODE HERE
raise NotImplementedError()
assert isinstance(var_missing, (np.int64, int))
Q4: replace
(0.5 points)¶
To replace current values with different values dynamically, replace
is very helpful.
To demonstrate this, if you wanted to replace the zeroes and ones in the current dataset with boolean values, true or false, you could do this using replace
.
In df
, replace 0
with False
and 1
with True
.
Store this output in df_bool
.
# YOUR CODE HERE
raise NotImplementedError()
assert df_bool.shape == df.shape
assert df_bool.loc[1,'chocolate'] == True
assert df_bool.iloc[0,0] == '100 Grand'
Part 3: Common Manipulations¶
Regardless of the dataset you’re working with, you’ll want to be very familiar with a few common data manipulations when wrangling data, each of which is described below:
Manipulation |
Description |
---|---|
select |
select which columns to include in dataset |
filter |
filter dataset to only include specified rows |
mutate |
add a new column based on values in other columns |
groupby |
group values to apply a function within the specified groups |
summarize |
calculate specified summary metric of a specified variable |
arrange |
sort rows ascending or descending order of a specified column |
merge |
join separate datasets into a single dataset based on a common column |
We’ll review how to carry out each of these in pandas
and you’ll get practice with each.
Q5: Column Operations (2.15 points)¶
Often we’re only interested in working with some columns in a larger dataset and it’s helpful to filter the dataset so that it only includes the columns we want. In these cases, we select the columns we’re interested in by specifying these columns and returning the output.
In pandas, there are a number of ways in which this can be done, but the most common is to specify a list of columns you want by name and in the order you want.
For example, for a dataframe (df
) with five columns (‘A’, ‘B’, ‘C’, ‘D’, ‘E’), you could select columns ‘C’, ‘B’ and ‘A’ (in that order), using the following:
df[['C', 'B', 'A']]
Q5a: selecting columns¶
For our candy dataset, we’re ultimately interested in determining the differences between chocolate and fruity candy.
To just get the data we’re interested in, return a DataFrame with the columns: competitorname
, chocolate
, fruity
, hard
, bar
, pluribus
, sugarpercent
, pricepercent
, and winpercent
.
Store this output in df
(overwriting the previous data stored in df
).
# YOUR CODE HERE
raise NotImplementedError()
assert df.shape == (85, 9)
dtypes
¶
Occassionally, there are a number of different types of data (strings, numbers, etc.) in a DataFrame and you’re only interested in data of a particular type. For example, in df_bool
, we have the data about the candy stored as Booleans (False, True) rather than integers (0,1).
To remind yourself of what type of information is stored in each column, the dtypes
attribute can be very helpful:
# see types for each variable
df.dtypes
# see types for each variable
df_bool.dtypes
Q5b: Select by Data Type¶
For the df_bool
dataset, we’re interested in selecting only the columns that are either a string (object) or a bool.
Use the pandas
select_dtypes
function to select those columns. Note, you’ll have to specify the include
parameter.
Store this output in df_bool
# YOUR CODE HERE
raise NotImplementedError()
df_bool.head()
assert df_bool.shape == (85, 10)
Q5c: Renaming Columns¶
Often, data have less than helpful column names. To make working with these data as simple as possible, it can be helpful to ensure that all columns have helpful column names. rename
can help you do this
Generally there are three approaches to renaming columns:
Most flexible option:
df = df.rename({'A':'a', 'B':'b'}, axis='columns')
Overwrite all column names:
df.columns = ['a', 'b']
Apply string method:
df.columns = df.columns.str.lower()
For our dataset in partciular, pluribus
is not an ideal column name. Use approach 1 above to rename
pluribus
to multicandy_pack
for both the df
and df_bool
dataframes
# YOUR CODE HERE
raise NotImplementedError()
assert 'multicandy_pack' in df.columns
assert 'multicandy_pack' in df_bool.columns
Q5d: Adding new columns¶
Often when working with data we need to add additional columns that use information from existing columns.
To do this, we typically use one of two approaches, summarized generally here:
assign
df.assign(new_col = df['col'] * val)
apply
for col in df.columns:
df[f'{col}_new'] = df[col].apply(my_function)
For example, are there any candy in our dataset that are neither fruity nor chocolate?
Use assign
to add a new column to df
called fruity_choco
that adds the value in chocolate
to the value in fruity
. This way, the value will be 0 if it is neither fruity nor chocolate, 1 if it is one or the other and 2 if it is both.
Store the output in df
.
# YOUR CODE HERE
raise NotImplementedError()
assert isinstance(df['fruity_choco'], pd.Series)
Let’s take a look at this new variable we created. We can use value_counts()
to determine how many candies fall into each category:
# see breakdown of 'fruity_choco' column
df['fruity_choco'].value_counts()
Q5e: neither¶
Take a look at the output above. How many candies are neither fruity nor chocolate? Store this value in the variable neither
.
# YOUR CODE HERE
raise NotImplementedError()
assert neither >= 0 and neither <= len(df)
Q6: Row Operations (1.3 points)¶
In addition to column operations, filtering rows to only include those of interest to your analysis can be very important. For example, if we wanted to only include candy that was either chocolate or fruity, we would want to filter to only include the candies that fit that criterion.
To do this, we have to filter the data, using slicing.
slicing: loc
and iloc
¶
Subsets of data from a DataFrame can also be extracted through the use of slicing. There are two general approaches to slicing:
Manipulation |
Description |
---|---|
|
label-based indexing |
|
integer-based indexing |
For example, you could use loc
:
df.loc[:, ['A', 'B', 'C']]
This would return all rows (indicated by the :
) and three columns ['A', 'B', 'C']
.
Alternatively, integer-based indexing could be used with iloc
(where the i
stands for index):
df.iloc[0:5, 0:5]
This would return the first five rows and columns of the dataframe df
. (As a reminder: when indicating ranges in Python, the final value is not included in what is returned. So, this returns the zero-th through the 4th indices. Index 5 is not included in the output)
Python also uses zero-based indexing which means the first element is indexed as zero, the second has index 1, and so on.
Q6a: Slicing¶
The concept of slicing can be combined with conditionals. For example, you can return all rows of a particular value.
Using value_counts()
above on the fruity_choco
column we created, we see that there is one candy that is both chocolate and fruity in our dataset. Let’s figure out which candy that was!
Use slicing to return the row from df
where fruity_choco
indicated the candy was both fruity and chocolate.
Store this DataFrame (which will have a single row) as both
.
Again using indexing, store the name of the candy that is both fruity and chocolate in the variable candy_name
.
# YOUR CODE HERE
raise NotImplementedError()
assert type(both) == pd.DataFrame
assert both.shape == (1, 10)
assert isinstance(candy_name, str)
Filtering rows¶
DataFrames can be subset in a number of different ways. For example, to filter a DataFrame to only include certain rows, you could use the approaches summarized here:
If you wanted to filter a DataFrame by a condition, you could use the following approach:
df[df.color == 'red']
Note that this extends and works with multiple OR conditions:
df[(df.color == 'red') | (df.color == 'green') | (df.color == 'blue')]
A shorter approach to this would utilize isin
:
df[df.color.isin(['red', 'green', 'blue'])]
Q6b: Filtering¶
Using this concept, filter df
to only include rows that contain candy that is fruity or chocolate in some capacity (meaning, something that is both fruity and chocolate would be included).
Store this output in df
.
# YOUR CODE HERE
raise NotImplementedError()
assert df.shape == (74, 10)
assert not any(df['fruity_choco'] == 0)
aside: calculations on subsets of the data¶
Calculations can be carried out on columns using conditionals. For example, if you wanted to determine how many of the candies in our current dataset are part of a pack of candy with multiple types of candy in the pack, you could use sum()
on the subset of the data that meets that condition
# carry out calculation
(df['multicandy_pack']==1).sum()
Q7: arrange rows (1.1 points)¶
While you don’t always care about what order the rows are sorted, at times it can be helpful to sort a DataFrame so that the rows are ordered in a particular manner.
To do this, sort_values
can be very helpful. For example, to sort by the values in ‘col’, you would use the following:
df.sort_values(by=['col'])
To arrange (sort) the rows by more than one column, the syntax changes slightly:
df.sort_values(by=['col1', 'col2'])
Q7a: sort_values
¶
Specifically, df
is currently sorted in alphabetical order by competitor name (candy). What if we were most curious about the sugar percentile of where the candy falls within the dataset?
Arrange the rows in df
such that the candy with the highest sugarpercent
is displayed first and the lowest sugarpercent
last. Store this in df.
Also, note that when using sort_values
, the original indices are retained. So, also reset the index of df
so that the highest sugarpercent
candy is index 0.
# YOUR CODE HERE
raise NotImplementedError()
assert df.loc[0,'competitorname'] != '100 Grand'
assert np.allclose(df.loc[0,'sugarpercent'], 0.988)
# take a look at the first few rows
df.head()
Q7b: Sugar Rush¶
Which candy in our dataset is the sugariest? Record your answer in the variable sugary
. For example, if you thought the answer was A, you would submit sugary = 'A'
.
A) Reese’s Miniatures
B) Chiclets
C) 100 Grand
D) Reese’s Stuffed with Pieces
E) Candy not listed here
# YOUR CODE HERE
raise NotImplementedError()
assert isinstance(sugary, str)
Q8: groupby
+ agg
(1.2 points)¶
Once a dataset is in order, including the columns and rows that you need for your analysis, it’s often helpful to carry out some grouped operations. groupby
allows you to group the data in your dataset by the values in a specific column. agg
is then helpful for carrying out the operation you’re interested in within the groups.
For example, if you wanted to groupby
a candy’s chocolate
status and then carry out some calculations within the group, say, determine the average value for the other variables in the dataset, grouped by chocolate
status, you could use agg
to do so.
df.groupby('chocolate').agg('mean')
# carry out operation
df.groupby('chocolate').agg('mean')
Q8a: Cost Interpretation¶
Looking at the output in the previous cell, which type of candy costs more, fruity candy or chocolate candy? Store your response in candy_cost
. For example, if you thought the answer was A, you would submit candy_cost = 'A'
.
A) chocolate costs more
B) fruity costs more
C) chocolate and fruity cost the same
D) can’t be determined
# YOUR CODE HERE
raise NotImplementedError()
assert isinstance(candy_cost, str)
Q8b: specific column aggregation¶
Carrying out a similar operation, group df
by fruity
, and then calculate the minimum, average (mean), and maximum values, for the sugarpercent
column only.
For this, fruity should be the name for the indices and your column names should be ‘min’, ‘mean’, and ‘max’.
Store this result in sugar_summary
.
Note: the pandas
documentation here may help you if you’re struggling to figure out the syntax¶
# YOUR CODE HERE
raise NotImplementedError()
# look at output
sugar_summary
assert type(sugar_summary) == pd.DataFrame
assert sugar_summary.index.name == 'fruity'
assert 'min' in sugar_summary.columns
assert np.isclose(sugar_summary.loc[0,'min'], 0.034)
Q8c: Sugar Interpretation¶
Looking at the output in Q8b, which type of candy has the higher average sugar percentile, fruity candy or chocolate candy? Store your response in candy_sugar
. For example, if you thought the answer was A, you would submit candy_sugar = 'A'
.
A) chocolate has higher average sugar percentile
B) fruity candy has higher average sugar percentile
C) chocolate and fruity have the same average sugar percentile
D) can’t be determined
# YOUR CODE HERE
raise NotImplementedError()
assert isinstance(candy_sugar, str)
The End!¶
Good work! While this just scratches the surface of what you can use pandas
to do, we covered common manipulations you’ll do a lot. Do your best to understand each operation you carried out here, becoming as familiar with pandas
, common operations, and the syntax, as possible. These will be used throughout this quarter and (most likely) whenever you’re working with tabular data in Python.
Have a look back over your answers, and also make sure to Restart & Run All
from the kernel menu to double check that everything is working properly. You can also use the ‘Validate’ button above, which runs your notebook from top to bottom and checks to ensure all assert
statements pass. When you are ready, submit on datahub!