Welcome to the hands on materials for Data Science in Practice.

This notebook will guide through getting the tools you will need for working with these tutorials and assignments.


Throughout these tutorials, you will see colored ‘alert’ text:

Green alerts provide key information and definitions.
Blue alerts provide links out to further resources.

What do you need for these tutorials?


  • Working install of Python (>= 3.6), with the anaconda distribution

    • If you are in the official class, datahub satisfies this requirement

  • Jupyter Notebooks

  • git and a GitHub account


These tutorials presume that you do already have some basic knowledge of programming.

In particular, it assumes knowledge of the Python programming language and standard library.

If you are somewhat unfamiliar with Python, you can follow the links in the Python notebook to catch up.

Computational Resources

The examples throughout these tutorials, and in the assignments are not computationally heavy.

You should be able to run all these materials on any computer you have access to, assuming it will run the aforementioned tools.

Installing Python

  • If you are running code locally, we recommend you install a new version of Python with Anaconda, as described below

    • If you are in the official course, you can use datahub for everything you need

  • If you are on Mac, you have a native installation of python. This native installation of Python may be older, will not include the extra packages that you will need for this class, and is best left untouched.

    • Downloading Anaconda will install a separate, independent install of Python, leaving your native install untouched.

  • Windows does not require Python natively and so it is not typically pre-installed.


The following are a series of tools that you will need for this class

Anaconda is an open-source distribution of Python, designed for scientific computing, data science and machine learning.
The anaconda website is here, with the download page here.

Anaconda itself is a distribution, meaning that is a version of Python with a collection of packages that are curated and maintained together.

Using a pre-built distribution is useful, as it comes with the packages that you need for data science.

Anaconda also comes with conda, which is a package manager, allowing you to download, install, and manage other packages.

The anaconda distribution includes all packages that are needed for these tutorials.

Jupyter notebooks are a way to intermix code, outputs and plain text. They run in a web browser, and connect to a kernel to be able to execute code.
The official Jupyter website is available here.

Note that you do not need to download Jupyter separately, as it comes packaged with the Anaconda distribution.

Checking Your Python Version

You can check which installation of Python you are using, and which version it is.

Once you have installed anaconda, you should see you are using Python in an anaconda folder.

The version number that is printed should also be 3.6 or greater.

# Check the installed version of Python
#   Note: these are command-line functions that may not work on windows
!which python
!python --version
Python 3.7.7

Git is a tool, a software package, for version control.
Install git, if you don't already have it.

Github is an online hosting service that can be used with git, and offers online tools to use git.
Create an account on Github.

Git & GitHub are not the same thing, though, in practice, they are commonly used together, whereby git is used as a tool to version control code and manage multiple copies stored across your computer, as well as on remote repositories that are stored on Github.

Note that while GitHub is a private company, git is an open-source tool, and can be used independent of GitHub.

# Check that you have git installed (which version doesn't really matter)
!git --version
git version 2.20.1 (Apple Git-117)

Source Tree is a free graphical user interface (GUI) for managing repositories with git & Github.
Source Tree is available here. You will need an account on Atlassian, who make Source Tree, but this is free.

You don’t need to use SourceTree (or any other GUI) if you know, or want to learn to use git from the command line.


Environments are isolated, independent installations of a programming language and groups of packages, that don't interfere with each other.
Anaconda has detailed instructions on using environments available here.

You do not need to use environments, however you may find it useful if you want or need to maintain multiple different versions of Python.

If you want to use an environment, and already have conda, you can run this command from command line:

$ conda create --name *envname* python=3.7 anaconda

^ Replace ‘envname’ with a name to call this environment.

This will install a new environment, with Python 3.7 and the anaconda distribution.

You will then need to activate this environment (everytime) you want to use it.

To activate your environment:
$ conda activate *envname*

To deactivate your environment:
$ conda deactivate