Data Gathering

Data Gathering is the process of accessing data and collecting it together.

This notebook covers strategies for finding and gathering data.

If you want to start by working on data analyses (with provided data) you can move onto the next tutorials, and come back to this one later.

Data gathering can encompass many different strategies, including data collection, web scraping, accessing data from databases, and downloading data in bulk. Sometimes it even includes things like calling someone to ask if you can use some of their data, and asking them to send it over.

Where to get Data

There are lots of way to get data, and lots of places to get it from. Typically, most of this data will be accessed through the internet, in one way or another, especially when pursuing indepent research projects.

Institutional Access

If you are working with data as part of an institution, such as a company of research lab, the institution will typically have data it needs analyzing, that it collects in various ways. Keep in mind that even people working inside institutions, with access to local data, will data still seek to find and incorporate external datasets.

Data Repositories

Data repositories are databases from which you can download data. Some data repositories allow you to explore available datasets and download datasets in bulk. Others may also offer APIs, through which you can request specific data from particular databases.

Web Scraping

The web itself is full of unstructured data. Web scraping can be done to directly extract and collect data directly from websites.

Asking People for Data

Not all data is indexed or accessible on the web, at least not publicly. Sometimes finding data means figuring out if any data is available, figuring out where it might be, and then reaching out and asking people directly about data access. If there is some particular data you need, you can try to figure out who might have it, and get in touch to see if it might be available.

Data Gathering Skills

Depending on your gathering method, you will likely have to do some combination of the following:

  • Direct download data files from repositories

  • Query databases & use APIs to extract and collect data of interest

  • Ask people for data, and going to pick up data with a harddrive

Ultimately, the goal is collect and curate data files, hopefully structured, that you can read into Python.

Definitions: Databases & Query Languages

Here, we will introduce some useful definitions you will likely encounter when exploring how to gather data.

Other than these definitions, we will not cover databases & query languages more in these tutorials.

A database is an organized collection of data. More formally, 'database' refers to a set of related data, and the way it is organized.
A query language is a language for operating with databases, such as retrieving, and sometimes modifying, information from databases.
SQL (pronounced 'sequel') is a common query language used to interact with databases, and request data.
If you are interested, there is a useful introduction and tutorial to SQL here as well as some useful 'cheat sheets' here and here.

Data Repositories

A Data Repository is basically just a place that data is stored. For our purposes, it is a place you can download data from.
There is a curated list of good data source included in the project materials.

For our purposes, data repositories are places you can download data directly from, for example data.gov.

Application Program Interfaces (APIs)

APIs are basically a way for software to talk to software - it is an interface into an application / website / database designed for software.
For a simple explanation of APIs go here or for a much broader, more technical, overview try here.
This list includes a collection of commonly used and available APIs.

APIs offer a lot of functionality - you can send requests to the application to do all kinds of actions. In fact, any application interface that is designed to be used programmatically is an API, including, for example, interfaces for using packages of code.

One of the many things that APIs do, and offer, is a way to query and access data from particular applications / databases. For example, there is a an API for Google maps that allows for programmatically querying the latitude & longitude positions of given addresses.

The benefit of using APIs for data gathering purposes is that they typically return data in nicely structured formats, that are relatively easy to analyze.

Launching URL Requests from Python

In order to use APIs, and for other approaches to collecting data, it may be useful to launch URL requests from Python.

Note that by URL, we just mean a file or application that can be reached by a web address. Python can be used to organize and launch URL requests, triggering actions and collecting any returned data.

In practice, APIs are usually special URLs that return raw data, such as json or XML files. This is compared to URLs we are typically more used to that return web pages as html, which can be rendered for human viewers (html). The key difference is that APIs return structured data files, where as html files are typically unstructured (more on that later, with web scraping).

If you with to use an API, try and find the documentation for to see how you send requests to access whatever data you want.

API Example

For our example here, we will use the Github API. Note that the URL we use is api.github.com. This URL accesses the API, and will return structured data files, instead of the html that would be returned by the standard URL (github.com).

import pandas as pd

# We will use the `requests` library to launch URL requests from Python
import requests
# Request data from the Github API on a particular user
page = requests.get('https://api.github.com/users/tomdonoghue')
# In this case, the content we get back is a json file
page.content
b'{"login":"TomDonoghue","id":7727566,"node_id":"MDQ6VXNlcjc3Mjc1NjY=","avatar_url":"https://avatars0.githubusercontent.com/u/7727566?v=4","gravatar_id":"","url":"https://api.github.com/users/TomDonoghue","html_url":"https://github.com/TomDonoghue","followers_url":"https://api.github.com/users/TomDonoghue/followers","following_url":"https://api.github.com/users/TomDonoghue/following{/other_user}","gists_url":"https://api.github.com/users/TomDonoghue/gists{/gist_id}","starred_url":"https://api.github.com/users/TomDonoghue/starred{/owner}{/repo}","subscriptions_url":"https://api.github.com/users/TomDonoghue/subscriptions","organizations_url":"https://api.github.com/users/TomDonoghue/orgs","repos_url":"https://api.github.com/users/TomDonoghue/repos","events_url":"https://api.github.com/users/TomDonoghue/events{/privacy}","received_events_url":"https://api.github.com/users/TomDonoghue/received_events","type":"User","site_admin":false,"name":"Tom","company":"UC San Diego","blog":"https://tomdonoghue.github.io","location":"San Diego","email":null,"hireable":null,"bio":"Cognitive Science Grad Student @ UC San Diego working on analyzing electrical brain activity. Also teaching Python & Data Science. \\r\\n\\r\\n","twitter_username":null,"public_repos":13,"public_gists":0,"followers":97,"following":83,"created_at":"2014-05-28T20:20:48Z","updated_at":"2020-06-19T21:35:12Z"}'
# We can read in the json data with pandas
pd.read_json(page.content, typ='series')
login                                                        TomDonoghue
id                                                               7727566
node_id                                             MDQ6VXNlcjc3Mjc1NjY=
avatar_url             https://avatars0.githubusercontent.com/u/77275...
gravatar_id                                                             
url                             https://api.github.com/users/TomDonoghue
html_url                                  https://github.com/TomDonoghue
followers_url          https://api.github.com/users/TomDonoghue/follo...
following_url          https://api.github.com/users/TomDonoghue/follo...
gists_url              https://api.github.com/users/TomDonoghue/gists...
starred_url            https://api.github.com/users/TomDonoghue/starr...
subscriptions_url      https://api.github.com/users/TomDonoghue/subsc...
organizations_url          https://api.github.com/users/TomDonoghue/orgs
repos_url                 https://api.github.com/users/TomDonoghue/repos
events_url             https://api.github.com/users/TomDonoghue/event...
received_events_url    https://api.github.com/users/TomDonoghue/recei...
type                                                                User
site_admin                                                         False
name                                                                 Tom
company                                                     UC San Diego
blog                                       https://tomdonoghue.github.io
location                                                       San Diego
email                                                               None
hireable                                                            None
bio                    Cognitive Science Grad Student @ UC San Diego ...
twitter_username                                                    None
public_repos                                                          13
public_gists                                                           0
followers                                                             97
following                                                             83
created_at                                          2014-05-28T20:20:48Z
updated_at                                          2020-06-19T21:35:12Z
dtype: object

As we can see above, in a couple lines of code, we can collect a lot of structured data about a particular user.

If we wanted to do analyses of Github profiles and activity, we could use the Github API to collect information about a group of users, and then analyze and compare the collected data.

Web Scraping

Web scraping is when you (programmatically) extract data from websites.
Wikipedia has a useful page on web scraping.

By web scraping, we typically mean something distinct from using the internet to access an API. Rather, web scraping refers to using code to systematically navigate the internet, and extract information of internet, from html or other available files. Note that in this case one is not interacting directly with a database, but simply exploring and collecting whatever is available on web pages.

Note that the following section uses the ‘BeautifulSoup’ module, which is not part of the standard anaconda distribution.

If you do not have BeautifulSoup, and want to get it to run this section, you can uncomment the cell below, and run it, to install BeautifulSoup in your current Python environment. You only have to do this once.

#import sys
#!conda install --yes --prefix {sys.prefix} beautifulsoup4
# Import BeautifulSoup
from bs4 import BeautifulSoup
# Set the URL for the page we wish to scrape
site_url = 'https://en.wikipedia.org/wiki/Data_science'

# Launch the URL request, to get the page
page = requests.get(site_url)
# Print out the first 1000 characters of the scraped web page
page.content[0:1000]
b'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>Data science - Wikipedia</title>\n<script>document.documentElement.className = document.documentElement.className.replace( /(^|\\s)client-nojs(\\s|$)/, "$1client-js$2" );</script>\n<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Data_science","wgTitle":"Data science","wgCurRevisionId":822535327,"wgRevisionId":822535327,"wgArticleId":35458904,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Use dmy dates from December 2012","Information science","Computer occupations","Computational fields of study","Data analysis"],"wgBreakFrames":false,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","Ja'

Note that the source of the scraped web-page is a messy pile of HTML.

There is a lot of information in there, but with no clear organization. There is some structure in the page though, delineated by HTML tags, etc, we just need to use them to parse out the data. We can do that with BeautifulSoup, which takes in messy documents like this, and parses them based on a specified format.

# Parse the webpage with Beautiful Soup, using a html parser
soup = BeautifulSoup(page.content, 'html.parser')
# With the parsed soup object, we can select particular segments of the web page

# Print out the page title
print('TITLE: \n')
print(soup.title)

# Print out the first p-tag
print('\nP-TAG:\n')
print(soup.find('p'))
TITLE: 

<title>Data science - Wikipedia</title>

P-TAG:

<p><b>Data science</b>, also known as <b>data-driven science</b>, is an interdisciplinary field of scientific methods, processes, and systems to extract <a href="/wiki/Knowledge" title="Knowledge">knowledge</a> or insights from <a href="/wiki/Data" title="Data">data</a> in various forms, either structured or unstructured,<sup class="reference" id="cite_ref-:0_1-0"><a href="#cite_note-:0-1">[1]</a></sup><sup class="reference" id="cite_ref-2"><a href="#cite_note-2">[2]</a></sup> similar to <a href="/wiki/Data_mining" title="Data mining">data mining</a>.</p>

From the soup object, you can explore the page in a more organized way, and start to extract particular components of interest.

Note that it is still ‘messy’ in other ways, in that there might or might not be a systematic structure to how the page is laid out, and it still might take a lot of work to extract the particular information you want from it.

APIs vs. Web Scraping

Web scraping is distinct from using an API, even though many APIs may be accessed over the internet. Web scraping is different in that you are (programmatically) navigating through the internet, and extracting data of interest.

Note: Be aware that scraping data from websites (without using APIs) can often be an involved project itself. Web scraping itself can take a considerable amount of time and work to get the data you want.

Be aware that data presented on websites may not be well structured, and may not be in an organized format that lends itself to easy collection and analysis.

If you try scraping websites, you should also check to make sure you are allowed to scrape the data, and follow the websites terms of service.