Data Privacy

This is a demo assignment that is openly available for the Data Science in Practice Course.

If you are in the COGS108 course at UC San Diego, this is NOT a valid version of the assignment for the course.

Important Reminders

  • Do not change / update / delete any existing cells with ‘assert’ in them. These are the tests used to check your assignment.

  • This assignment has hidden tests: tests that are not visible here, but that will be run on your submitted file. Passing all the tests you can see in the notebook here does not guarantee you have the right answer.

Overview

We have briefly discussed in lecture the importance and the mechanics of protecting individuals’ privacy when they are included in datasets.

One method to do so is the Safe Harbor Method. The Safe Harbor method specifies how to protect individuals’ identities by telling us which information to remove from a dataset in order to avoid accidently disclosing personal information.

In this assignment, we will explore web scraping, which can often include personally identifiable information, how identity can be decoded from badly anonymized datasets, and also explore using Safe Harbor to anonymize datasets properly.

The topics covered in this assignment are mainly covered in the ‘DataGathering’ and ‘DataPrivacy&Anonymization’ COGS 108 Tutorial notebooks.

# Imports - these are provided for you. Do not import any other packages.
import pandas as pd
import requests
import bs4
from bs4 import BeautifulSoup

Downloading Data

If you download this notebook to run locally, you will also need some data files.

Running the next cell will download the required files for this assignment.

You can also view and download these files from https://github.com/DataScienceInPractice/Data.

from os.path import join as pjoin
from urllib.request import urlretrieve

data_url = 'https://raw.githubusercontent.com/DataScienceInPractice/Data/master/'

# Fill in these values
assignment = 'A4'
data_files = ['anon_user_dat.json', 'employee_info.json', 'user_dat.csv', 'zip_pop.csv']

for data_file in data_files:
    full_path = pjoin(data_url, assignment, data_file)
    urlretrieve(full_path, filename=data_file)

Part 1: Web Scraping (1.25 points)

Scraping Rules

  1. If you are using another organization’s website for scraping, make sure to check the website’s terms & conditions.

  2. Do not request data from the website too aggressively (quickly) with your program (also known as spamming), as this may break the website. Make sure your program behaves in a reasonable manner (i.e. acts like a human). One request for one webpage per second is good practice.

  3. The layout of a website may change from time to time. Because of this, if you’re scraping a website, make sure to revisit the site and rewrite your code as needed.

1a) Web Scrape

We will first retrieve the contents on a page and examine them a bit.

Make a variable called wiki, that stores the following URL (as a string): https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States_by_population

Now, to open the URL, use requests.get() and provide wiki as its input. Store this in a variable called page.

After that, make a variable called soup to parse the HTML using BeautifulSoup. Consider that there will be a method from BeautifulSoup that you’ll need to call on to get the content from the page.

# YOUR CODE HERE
raise NotImplementedError()

wiki
assert wiki
assert page
assert soup

1b) Checking Scrape Contents

Extract the title from the page and save it in a variable called title_page.

Make sure you extract it as a string.

To do so, you have to use the soup object created in the above cell. Hint: from your soup variable, you can access this with .title.string.

Make sure you print out and check the contents of title_page.

Note that it should not have any tags (such as <title> included in it).

# YOUR CODE HERE
raise NotImplementedError()
assert title_page
assert isinstance(title_page, str)

1c) Extracting Tables

In order to extract the data we want, we’ll start with extracting a data table of interest.

Note that you can see this table by going to look at the link we scraped.

Use the soup object and call a method called find, which will find and extract the first table in the scraped webpage. Store this in the variable right_table.

Note: you need to search for the name table, and set the class_ argument as wikitable sortable.

# YOUR CODE HERE
raise NotImplementedError()
assert right_table
assert isinstance(right_table, bs4.element.Tag)
assert right_table.name == 'table'

Now, you’ll extract the data from the table into lists.

Note: This code is provided for you. Do read through it and try to see how it works.

# CODE PROVIDED
# YOU SHOULD NOT HAVE TO EDIT
# BUT YOU WILL WANT TO UNDERSTAND
list_a, list_b, list_c = [], [], []

for row in right_table.findAll('tr'):
    
    cells = row.findAll('td')
    
    # Skips rows that aren't 10 columns long (like the heading)
    if len(cells) != 12:
        continue

    # This catches when the name cells stops having a link
    #  and ends, skipping the last (summary rows)
    try:
        list_a.append(cells[2].find('a').text)
        list_b.append(cells[3].find(text=True))
        list_c.append(cells[4].find(text=True))
    except:
        break

1d) Collecting into a dataframe

Create a dataframe my_df and add the data from the lists above to it.

  • list_a is the state or territory name. Set the column name as State, and make this the index

  • list_b is the population estimate. Add it to the dataframe, and set the column name as Population Estimate

  • list_c is the census population. Add it to the dataframe, and set the column name as Census Population

make sure to check the head of your dataframe to see that everything looks right! ie: my_df.head()

# YOUR CODE HERE
raise NotImplementedError()
assert isinstance (my_df, pd.DataFrame)
assert my_df.index.name == 'State'
assert list(my_df.columns) == ['Population Estimate', 'Census Population']

1e) Using the data

What is the Population Estimate of California? Save this answer to a variable called ca_pop Notes:

  • Extract this value programmatically from your dataframe (as in, don’t set it explicitly as ca_pop = 123)

  • You can use .loc to extract a particular value from a dataframe.

  • The data in your dataframe will be strings - that’s fine, leave them as strings (don’t typecast).

  • Strip any whitespace/newline characters from this string, if necessary. (rstrip() may be helpful)

# YOUR CODE HERE
raise NotImplementedError()
assert ca_pop

Part 2: Identifying Data (3 points)

Data Files:

  • anon_user_dat.json

  • employee_info.json

You will first be working with a file called ‘anon_user_dat.json’. This file contains information about some (fake) Tinder users. When creating an account, each Tinder user was asked to provide their first name, last name, work email (to verify the disclosed workplace), age, gender, phone # and zip code. Before releasing this data, a data scientist cleaned the data to protect the privacy of Tinder’s users by removing the obvious personal identifiers: phone #, zip code, and IP address. However, the data scientist chose to keep each users’ email addresses because when they visually skimmed a couple of the email addresses none of them seemed to have any of the users’ actual names in them. This is where the data scientist made a huge mistake!

We will take advantage of having the work email addresses by finding the employee information of different companies and matching that employee information with the information we have, in order to identify the names of the secret Tinder users!

2a) Load in the ‘cleaned’ data

Load the anon_user_dat.json json file into a pandas dataframe. Call it df_personal.

# YOUR CODE HERE
raise NotImplementedError()
assert isinstance(df_personal, pd.DataFrame)

2b) Check the first 10 emails

Save the first 10 emails to a Series, and call it sample_emails. You should then print out this Series. ( Use print() )

The purpose of this is to get a sense of how these work emails are structured and how we could possibly extract where each anonymous user seems to work.

# YOUR CODE HERE
raise NotImplementedError()
assert isinstance(sample_emails, pd.Series)

2c) Extract the Company Name From the Email

Create a function with the following specifications:

  • Function Name: extract_company

  • Purpose: to extract the company of the email (i.e., everything after the @ sign but before the first .)

  • Parameter(s): email (string)

  • Returns: The extracted part of the email (string)

  • Hint: This should take 1 line of code. Look into the find(‘’) method.

You can start with this outline:

def extract_company(email):
    return

Example Usage:

  • extract_company(“larhe@uber.com”) should return “uber”

  • extract_company(“ds@cogs.edu”) should return “cogs”

# YOUR CODE HERE
raise NotImplementedError()
assert extract_company("gshoreson0@seattletimes.com") == "seattletimes"
assert extract_company("amcgeffen1d@goo.ne.jp") == 'goo'
                       

With a little bit of basic sleuthing (aka googling) and web-scraping (aka selectively reading in html code) it turns out that you’ve been able to collect information about all the present employees/interns of the companies you are interested in. Specifically, on each company website, you have found the name, gender, and age of its employees. You have saved that info in employee_info.json and plan to see if, using this new information, you can match the Tinder accounts to actual names.

2d) Load in employee data

Load the json file into a pandas dataframe. Call it df_employee.

# YOUR CODE HERE
raise NotImplementedError()
assert isinstance(df_employee, pd.DataFrame)

2e) Match the employee name with company, age, gender

Create a function with the following specifications:

  • Function name: employee_matcher

  • Purpose: to match the employee name with the provided company, age, and gender

  • Parameter(s): company (string), age (int), gender (string)

  • Returns: The employee first_name and last_name like this: return first_name, last_name

  • Note: If there are multiple employees that fit the same description, first_name and last_name should return a list of all possible first names and last names i.e., [‘Desmund’, ‘Kelby’], [‘Shepley’, ‘Tichner’]. Note that the names of the individuals that would produce this output are ‘Desmund Shepley’ and ‘Kelby Tichner’.

Hint: There are many different ways to code this. An inelegant solution is to loop through df_employee and for each data item see if the company, age, and gender match i.e.,

for i in range(0, len(df_employee)):
          if (company == df_employee.loc[i,'company']):

However! The solution above is very inefficient and long, so you should try to look into this: Google the df.loc method: It extracts pieces of the dataframe if it fulfills a certain condition. i.e.,

df_employee.loc[df_employee['company'] == company]

If you need to convert your pandas data series into a list, you can do list(result) where result is a pandas “series”

You can start with this outline:

def employee_matcher(company, age, gender):
    return first_name, last_name
# YOUR CODE HERE
raise NotImplementedError()
assert employee_matcher("google", 41, "Male") == (['Maxwell'], ['Jorio'])
assert employee_matcher("salon", 47, "Female") == (['Elenore'], ['Gravett'])
assert employee_matcher("webmd", 28, "Nonbinary") == (['Zaccaria'], ['Bartosiak'])

2f) Extract all the private data

  • Create 2 empty lists called first_names and last_names

  • Loop through all the people we are trying to identify in df_personal

  • Call the extract_company function (i.e., extract_company(df_personal.loc[i, 'email']) )

  • Call the employee_matcher function

  • Append the results of employee_matcher to the appropriate lists (first_names and last_names)

# YOUR CODE HERE
raise NotImplementedError()
assert first_names[45:50]== [['Justino'], ['Tadio'], ['Kennith'], ['Cedric'], ['Amargo']]
assert last_names[45:50] == [['Corro'], ['Blackford'], ['Milton'], ['Yggo'], ['Grigor']]

2g) Add the names to the original ‘secure’ dataset!

We have done this last step for you below, all you need to do is run this cell.

For your own personal enjoyment, you should also print out the new df_personal with the identified people.

df_personal['first_name'] = first_names
df_personal['last_name'] = last_names

We have now just discovered the ‘anonymous’ identities of all the registered Tinder users…awkward.

Part 3: Anonymize Data (3.25 points)

You are hopefully now convinced that with some seemingly harmless data a hacker can pretty easily discover the identities of certain users. Thus, we will now clean the original Tinder data ourselves according to the Safe Harbor Method in order to make sure that it has been properly cleaned…

3a) Load in personal data

Load the user_dat.csv file into a pandas dataframe, being sure to read the zip codes (zip) in as a string.

Store this in df_users.

# YOUR CODE HERE
raise NotImplementedError()
assert isinstance(df_users, pd.DataFrame)

3b) Drop personal attributes

Remove any personal information, following the Safe Harbor method. Based on the Safe Harbor method, remove any columns from df_users that contain personal information.

Note that details on the Safe Harbor method are covered in the Tutorials.

# YOUR CODE HERE
raise NotImplementedError()
assert len(df_users.columns) == 3

3c) Drop ages that are above 90

Safe Harbor rule C: Drop all the rows which have age greater than 90 from df_users.

# YOUR CODE HERE
raise NotImplementedError()
assert df_users.shape == (943, 3)

3d) Load in zip code data

Load the zip_pop.csv file into a (different) pandas dataframe. Call it df_zip.

Note that the zip data should be read in as strings, not ints, as would be the default.

In read_csv, use the parameter dtype to specify to read zip as str, and population as int.

# YOUR CODE HERE
raise NotImplementedError()
assert isinstance(df_zip, pd.DataFrame)

3e) Sort zipcodes into “Geographic Subdivision”

The Safe Harbor Method applies to “Geographic Subdivisions” as opposed to each zipcode itself.

Geographic Subdivision: All areas which share the first 3 digits of a zip code

Count the total population for each geographic subdivision, storing the first 3 digits of the zip code and its corresponding population in the dictionary zip_dict. (For example, if there were 20 people whose zip code started with 090, the key-value pair in zip_dict would be {'090' : 20}.)

You may be tempted to write a gnarly loop to accomplish this. Avoid that temptation. Instead, you’ll want to be savy with a dictionary and groupby from pandas here.

To get you started…

If you wanted to group by whole zip code, you could use something like this:

df_zip.groupby(df_zip['zip'])

But, we don’t want to group by the entire zip code. Instead, we want to extract the first 3 digits of a zip code, and group by that.

To extract the first three digits, you could so something like the following:

df_zip['zip'].str[:3]

You’ll want to combine these two concepts, such that you store this information in a dictionary zip_dict, which stores the first three digits of the zip code as the key and the population of that 3-digit zip code as the value.

(If you’re stuck and/or to better understand how dictionaries work and how they apply to this concept, check the section materials, use google, and go to discussion sections!)

# YOUR CODE HERE
raise NotImplementedError()
assert isinstance(zip_dict, dict)
assert zip_dict['100'] == 1502501

3f) Masking the Zip Codes

In this part, you should write a for loop, updating the df_users dataframe.

Go through each user, and update their zip code, to Safe Harbor specifications:

  • If the user is from a zip code for the which the “Geographic Subdivision” is less than equal to 20,000, change the zip code in df_users to ‘0’ (as a string)

  • Otherwise, zip should be only the first 3 numbers of the full zip code

  • Do all this by directly updating the zip column of the df_users DataFrame

Hints:

  1. This will be several lines of code, looping through the DataFrame, getting each zip code, checking the geographic subdivision with the population in zip_dict, and setting the zip_code accordingly.

  2. Be very aware of your variable types when working with zip codes here.

# YOUR CODE HERE
raise NotImplementedError()
assert len(df_users) == 943
assert df_users.loc[671, 'zip'] == '687'
assert sum(df_users.zip == '0') > 0

3g) Save out the properly anonymized data to json file

Save out df_users as a json file, called real_anon_user_dat.json

# YOUR CODE HERE
raise NotImplementedError()
assert isinstance(pd.read_json('real_anon_user_dat.json'), pd.DataFrame)

Complete!

Congrats, you’re done! The users’ identities are much more protected now.

Have a look back over your answers, and also make sure to Restart & Run All from the kernel menu to double check that everything is working properly. You can also use the ‘Validate’ button above, which runs your notebook from top to bottom and checks to ensure all assert statements pass silently. When you are ready, submit on datahub!