Data Wrangling

'Data Wrangling' generally refers to transforming raw data into a useable form for your analyses of interest, including loading, aggregating and formating.

In this notebook, we will focus on loading different types of data files. Other aspects of ‘wrangling’ such as combining different datasets will be covered in future tutorials, and are explored in the assignments.

Note: Throughout this notebook, we will be using ! to run the shell command cat to print out the contents of example data files.

Python I/O

Let’s start with basic Python utilities for reading and loading data files.

Official Python documentation on input / output.
# Check out an example data file
!cat files/data.txt
First line of data
Second line of data
# First, explicitly open the file object for reading
file_obj = open('files/data.txt', 'r')

# You can then loop through the file object, grabbing each line of data
for line in file_obj:
    # Here we explicitly remove the new line marker at the end of each line (the '\n')
    print(line.strip('\n'))

# File objects then have to closed when you are finished with them
file_obj.close()
First line of data
Second line of data

Since opening and closing files basically always goes together, there is a shortcut to do both of them together, which is the with keyword.

By using with, file objects will be opened, and then automatically closed at the end of the code block.

# Use 'with' keyword to open, read, and then close a file
with open('files/data.txt', 'r') as file_obj:
    for line in file_obj:
        print(line.strip('\n'))
First line of data
Second line of data

Using input / output functionality from standard library Python is a pretty ‘low level’ way to read data files. This strategy often takes a lot of work to organize and define the details of how files are organized and how to read them. For example, in the above simple example, we had to deal with the new line character explicitly.

As long as you have reasonably well structured data files, using standardized file types, you can use higher-level functions that will take care of a lot of these details - loading data straight into pandas data objects, for example.

Pandas I/O

Pandas has a range of functions that will automatically read in whole files of standard file types in pandas objects.
Official Pandas documentation on input / output.
import pandas as pd
# Tab complete to check out all the read functions available
pd.read_

File types

There are many different file types in which data may be stored.

Here, we will start by examining CSV and JSON files.

CSV Files

'Comma Separated Value' files store data, separated by comma's. Think of them like lists.
More information on CSV files from wikipedia.
# Let's have a look at a csv file (printed out in plain text)
!cat files/data.csv
1, 2, 3, 4
5, 6, 7, 8
9, 10, 11, 12

CSV Files with Python

# Python has a module devoted to working with csv's
import csv
# We can read through our file with the csv module
with open('files/data.csv') as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=',')
    for row in csv_reader:
        print(', '.join(row))
1,  2,  3,  4
5,  6,  7,  8
9,  10,  11,  12

CSV Files with Pandas

# Pandas also has functions to directly load csv data
pd.read_csv?
# Let's read in our csv file
pd.read_csv(open('files/data.csv'), header=None)
0 1 2 3
0 1 2 3 4
1 5 6 7 8
2 9 10 11 12

As we can see, using Pandas save us from having to do more work (write more code) to use load the file.

JSON Files

JavaScript Object Notation files can store hierachical key/value pairings. Think of them like dictionaries.
More information on JSON files from wikipedia.
# Let's have a look at a json file (printed out in plain text)
!cat files/data.json
{
  "firstName": "John",
  "age": 53
}
# Think of json's as similar to dictionaries
d = {'firstName': 'John', 'age': '53'}
print(d)
{'firstName': 'John', 'age': '53'}

JSON Files with Python

# Python also has a module for dealing with json
import json
# Load a json file
with open('files/data.json') as dat_file:    
    dat = json.load(dat_file)
# Check what data type this gets loaded as
print(type(dat))
<class 'dict'>

JSON Files with Pandas

# Pandas also has support for reading in json files
pd.read_json?
# You can read in json formatted strings with pandas
#  Note that here I am specifying to read it in as a pd.Series, as there is a single line of data
pd.read_json('{ "first": "Alan", "place": "Manchester"}', typ='series')
first          Alan
place    Manchester
dtype: object
# Read in our json file with pandas
pd.read_json(open('files/data.json'), typ='series')
firstName    John
age            53
dtype: object

Conclusion

As a general guideline, for loading and wrangling data files, using standardized data files, and loading them with ‘higher-level’ tools such as Pandas makes it easier to work with data files.