{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Statistical Comparisons\n", "\n", "Whenever we have data, we often want to use statistical analyses to explore, compare, and quantify our data. \n", "\n", "In this notebook, we will briefly introduce and explore some common statistical tests that can be applied to data. \n", "\n", "As with many of the topics in data analysis and machine learning, this tutorial is focused on introducing some related topics for data science, and demonstrated their application in Python, but it is out of scope of these tutorials to systematically introduce and describe the topic at hand, which in this case is statistics. If the topics here are unfamiliar, we recommend you follow the links or look for other resources to learn more about these topics. " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "\n", "from scipy.stats import norm" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# Set random seed, for consistency simulating data\n", "np.random.seed(21)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## On Causality\n", "\n", "Before we dive into particular statistical tests, just a general reminder that though we would often like to understand the _causal structure_ of the data that we are interested in, this is generally not directly interpretable from statistical tests themselves. \n", "\n", "In the follow, we will explore some statistical tests for investigating if and when distributions of data are the same or different, and if and how related they are. These tests, by themselves, do not tell us about what causes what. Correlation is not causation.\n", "\n", "In the context of data science, this can be a limitation as we are often using previously collected datasets of convenience and observational datasets collected. Though we can explore the structure of the data, such datasets typically do not allow for causal interpretations." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Correlations\n", "\n", "A common question we may be interested in is if two datasets, or two features of data, are related to each other. \n", "\n", "If they, we would also like to now _how_ related they are to each other. \n", "\n", "For this, we can calculate correlations between features. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "