{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Statistical Comparisons\n",
    "\n",
    "Whenever we have data, we often want to use statistical analyses to explore, compare, and quantify our data. \n",
    "\n",
    "In this notebook, we will briefly introduce and explore some common statistical tests that can be applied to data. \n",
    "\n",
    "As with many of the topics in data analysis and machine learning, this tutorial is focused on introducing some related topics for data science, and demonstrated their application in Python, but it is out of scope of these tutorials to systematically introduce and describe the topic at hand, which in this case is statistics. If the topics here are unfamiliar, we recommend you follow the links or look for other resources to learn more about these topics. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "from scipy.stats import norm"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Set random seed, for consistency simulating data\n",
    "np.random.seed(21)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## On Causality\n",
    "\n",
    "Before we dive into particular statistical tests, just a general reminder that though we would often like to understand the _causal structure_ of the data that we are interested in, this is generally not directly interpretable from statistical tests themselves. \n",
    "\n",
    "In the follow, we will explore some statistical tests for investigating if and when distributions of data are the same or different, and if and how related they are. These tests, by themselves, do not tell us about what causes what. Correlation is not causation.\n",
    "\n",
    "In the context of data science, this can be a limitation as we are often using previously collected datasets of convenience and observational datasets collected. Though we can explore the structure of the data, such datasets typically do not allow for causal interpretations."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Correlations\n",
    "\n",
    "A common question we may be interested in is if two datasets, or two features of data, are related to each other. \n",
    "\n",
    "If they, we would also like to now _how_ related they are to each other. \n",
    "\n",
    "For this, we can calculate correlations between features. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-success\">\n",
    "Correlations are statistical dependencies or relationships between variables. \n",
    "</div>\n",
    "\n",
    "<div class=\"alert alert-info\">\n",
    "Correlation on \n",
    "<a href=https://en.wikipedia.org/wiki/Correlation_and_dependence class=\"alert-link\">wikipedia</a>, \n",
    "including for the \n",
    "<a href=https://en.wikipedia.org/wiki/Pearson_correlation_coefficient class=\"alert-link\">pearson</a>, \n",
    "and \n",
    "<a href=https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient class=\"alert-link\">spearman</a>\n",
    "correlation measures.     \n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "from scipy.stats import pearsonr, spearmanr"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Simulate Data\n",
    "\n",
    "First, let's simulate some data. \n",
    "\n",
    "For this example, we will simulate two arrays of data that do have a relationship to each other. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Settings for simulated data\n",
    "corr = 0.75\n",
    "covs = [[1, corr], [corr, 1]]\n",
    "means = [0, 0]\n",
    "\n",
    "# Generate the data\n",
    "d1, d2 = np.random.multivariate_normal(means, covs, 1000).T"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Calculate Correlations\n",
    "\n",
    "Next, we can calculate the correlation coefficient between our data arrays, using the `pearsonr` function from `scipy.stats`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Calculate a pearson correlation between two arrays of data\n",
    "r_val, p_val = pearsonr(d1, d2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The correlation coefficient is 0.7732 with a p-value of 0.00.\n"
     ]
    }
   ],
   "source": [
    "print(\"The correlation coefficient is {:1.4f} with a p-value of {:1.2f}.\".format(r_val, p_val))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this case, we have a high correlation coefficient, with a very low p-value. \n",
    "\n",
    "This suggests our data are strongly correlated!\n",
    "\n",
    "In this case, since we simulated the data, we know that this is a good estimation of the relationship between the data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Rank Correlations\n",
    "\n",
    "One thing to keep in mind is that the `pearson` correlation used above assumes that both data distributions are normally distributed.\n",
    "\n",
    "These assumptions should also be tested in data to be analyzed. \n",
    "\n",
    "Sometimes these assumptions will not be met. In that case, one option is to a different kind of correlation example. For example, the `spearman` correlation is a rank correlation that does not have the same assumptions as pearson."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The correlation coefficient is 0.7595 with a p-value of 0.00.\n"
     ]
    }
   ],
   "source": [
    "# Calculate the spearman rank correlation between our data\n",
    "r_val, p_val = spearmanr(d1, d2)\n",
    "\n",
    "# Check the results of the spearman correlation\n",
    "print(\"The correlation coefficient is {:1.4f} with a p-value of {:1.2f}.\".format(r_val, p_val))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this case, the measured values for `pearson` and `spearman` correlations are about the same, since both are appropriate for the properties of this data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## T-Tests\n",
    "\n",
    "Another question we might often want to ask about data is to check and detect when there is a significant difference between collections of data. \n",
    "\n",
    "For example, we might want to analyze if there is a significant different in measured feature values between some groups of interest. \n",
    "\n",
    "To do so, we can use t-tests. \n",
    "\n",
    "There are different variants of t-test, including:\n",
    "- one sample t-test\n",
    "    - test the mean of one group of data\n",
    "- independent samples t-test\n",
    "    - test for a difference of means between two independent samples of data\n",
    "- related samples t-test\n",
    "    - test for a difference of means between two related samples of data\n",
    "    \n",
    "For this example, we will explore using the independent samples t-test. \n",
    "\n",
    "Functions for the other versions are also available in `scipy.stats`. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-success\">\n",
    "T-tests are statistical hypothesis tests for examining mean values and differences of groups of data. \n",
    "</div>\n",
    "\n",
    "<div class=\"alert alert-info\">\n",
    "T-tests on\n",
    "<a href=https://en.wikipedia.org/wiki/Student%27s_t-test class=\"alert-link\">wikipedia</a>. \n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "from scipy.stats import ttest_ind"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Simulate Data\n",
    "\n",
    "First, let's simulate some data. \n",
    "\n",
    "For this example, we will simulate two samples of normally distributed data. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Settings for data simulation\n",
    "n_samples = 250\n",
    "\n",
    "# Simulate some data\n",
    "d1 = norm.rvs(loc=0.5, scale=1, size=n_samples)\n",
    "d2 = norm.rvs(loc=0.75, scale=1, size=n_samples)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXAAAAD4CAYAAAD1jb0+AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8QZhcZAAAWR0lEQVR4nO3dfXBV9Z3H8feXEAhFWxSTLCXSgBVaFYU1VZCOrSjWVkexBbe1UlQctrN1R0BHsWUdFWekPsXO0HYn40MBbStqBdqObVIWZXYKVrCUh4KoGYpRNklT8amEx+/+kYNivLnnJLn33PzC5zXD3Jxzv/eezzDw5fA7v/M75u6IiEh4+hQ6gIiIdI0auIhIoNTARUQCpQYuIhIoNXARkUD1TfNgJ5xwgldWVqZ5SOnl1q9vez3zzCxF/4iKjs9WJNJzrV+//u/uXtp+v6U5jbCqqsrXrVuX2vGk9zNre836x/jnUdGVmjIrYTKz9e5e1X6/hlBERAKlBi4iEig1cBGRQKV6EVNEJF/2799PQ0MDra2thY7SZSUlJVRUVFBcXJyoXg1cgpboGrwuXh4VGhoaOPbYY6msrMQOX90OiLvT0tJCQ0MDw4cPT/QZDaGISK/Q2trK4MGDg2zeAGbG4MGDO/U/CDVwEek1Qm3eh3U2vxq4BO3MM2Nu4gF49sy2XyK9jMbAJWgvvZSg6K0kRdLbVNdtz+n3zZ40MramqKiI0aNHs3//fvr27cv06dOZNWsWffr0oaWlhSlTpvDiiy9y9dVXs3Dhwm5nUgOXHqXzf+lGxn5udoLvTvKXUyTOgAED2LBhAwBNTU1ceeWVvP3229xxxx2UlJQwf/58Nm/ezObNm3NyPA2hiIjkQVlZGTU1NSxcuBB3Z+DAgXzxi1+kpKQkZ8dQAxcRyZMRI0Zw6NAhmpqa8vL9auAiInmUzwUD1cBFRPKkvr6eoqIiysrK8vL9uogpQRv31d2xNZtKrkghichHNTc3893vfpfrr78+b/PT1cAlaFfMjh9b/MOx81NIIj1NIWYW7dmzhzFjxnwwjXDatGnMmTPng/crKyt555132LdvH8uWLaO2tpZTTjmly8dTAxcRyZGDBw9mfX/Hjh05PV7sGLiZjTKzDUf8esfMZpnZ8WZWZ2avRK/H5TSZSAKvb+/P69v7Z60p27+Zsv25mXcr0pPENnB3f9ndx7j7GOBM4J/AM8BcYKW7nwysjLZFUlV9/Weovv4zWWu+vfsbfHv3N1JKJJKezs5COR94zd3/BlwGLIr2LwIm5zKYiIhk19kG/k3gF9HP5e6+CyB6zThPxsxmmtk6M1vX3Nzc9aQiIvIRiRu4mfUDLgWe7MwB3L3G3avcvaq0tLSz+UREpAOdOQP/KvCSuzdG241mNgQges3PvaIiIpJRZ6YRfosPh08AVgDTgQXR6/Ic5hIR6Z5Vd+f2+867NbYk23KydXV1zJ07l3379tGvXz/uvfdeJk6c2K1IiRq4mX0CmAT8+xG7FwBLzWwGsBOY2q0kIiKBy7ac7AknnMCvf/1rPv3pT7N582a+8pWv8MYbb3TreIkauLv/Exjcbl8LbbNSRApm9sK/xdY8PujpFJKIfNTh5WS/8IUvcPvttzN27NgP3jv11FNpbW1l79699O+f/T6GbHQnpgTtxJF7Y2uaik9LIYnIxx25nGx5efkH+59++mnGjh3breYNauAiInnVfjnZLVu2cMstt1BbW9vt79ZyshK0pdVlLK3OvlTnBe/+Fxe8+18pJRL5UPvlZBsaGrj88stZvHgxJ510Ure/Xw1cgrb22UGsfXZQ1prRrUsZ3bo0pUQibdovJ7t7924uvvhi7r77biZMmJCTY2gIRUR6pwTT/nIt23KyCxcu5NVXX2X+/PnMn9+2xHFtbW23HvagBi4ikiPZlpOdN28e8+bNy+nxNIQiIhIoNXARkUCpgYtIr5HPJ8CnobP5NQYuQav4bGtsTWPfU1NIIoVWUlJCS0sLgwcPzttDhPPJ3WlpaaGkpCTxZ9TAJWhzfrIztubnx/0qhSRSaBUVFTQ0NBDycwdKSkqoqKhIXK8GLiK9QnFxMcOHDy90jFRpDFxEJFBq4BK0OReOZM6FI7PWzG4exezmUSklEkmPGriISKDUwEVEAqUGLiISKDVwEZFAqYGLiAQqUQM3s0Fm9pSZbTOzrWY23syON7M6M3slej0u32FFRORDSW/k+RHwO3efYmb9gE8A3wdWuvsCM5sLzAVuyVNOkYym3tAYW/OHY+5MIYlI+mIbuJl9EjgXuBrA3fcB+8zsMuDLUdki4DnUwCVl4y9+O7Zm04B/SyGJSPqSnIGPAJqBR83sDGA9cANQ7u67ANx9l5llfKyEmc0EZgIMGzYsJ6FFumLczpqO31w1ODcHKcBTYOTolWQMvC/wr8BP3X0s8D5twyWJuHuNu1e5e1VpaWkXY4pktua3n2LNbz+VtWb0nicYveeJlBKJpCfJGXgD0ODuL0TbT9HWwBvNbEh09j0EaMpXSJGOPPmjciD7UMoF790GwBpuTCWTSFpiz8Dd/f+A183s8GIS5wN/BVYA06N904HleUkoIiIZJZ2F8p/A49EMlHrgGtqa/1IzmwHsBKbmJ6KIiGSSqIG7+wagKsNb5+c2joiIJKU7MUVEAqUGLiISKDVwEZFA6ZmYErQHarfH1lSXvgzE3MgjEiCdgYuIBEoNXEQkUBpCkaA98B9t6+vM+cnODmuufOvrANRzUSqZRNKiBi5Ba3i1JLam/MAWQA1ceh81cOk5Vt3NuJ0tnfzQfUDMBcoBXY8k0pNpDFxEJFBq4CIigVIDFxEJlBq4iEigdBFTgnbpuWtjaxoPjE4hiUj61MAlaHO/81RsTf3+C1NIIpI+DaGIiARKDVyCtm3HULbtGJq1ZqA1MtAaU0okkh4NoUjQrr1rNgB/fOimDmtOL3kMgDV79FBj6V10Bi4iEqhEZ+BmtgN4FzgIHHD3KjM7HngCqAR2AFe4+1v5iSkiIu115gz8PHcf4+6HH248F1jp7icDK6NtERFJSXeGUC4DFkU/LwImdz+OiIgklbSBO1BrZuvNbGa0r9zddwFEr2WZPmhmM81snZmta25u7n5iEREBks9CmeDub5pZGVBnZtuSHsDda4AagKqqKu9CRhERySBRA3f3N6PXJjN7BjgLaDSzIe6+y8yGAE15zCmS0SPzqmNrNrZelUISkfTFNnAzGwj0cfd3o58vBO4EVgDTgQXR6/J8BpX0VNfFP+k9Hzr/MAf4XOUbsTXve3lszZr6zh87k7UHOv97N3vSyJwcW44+Sc7Ay4FnzOxw/c/d/Xdm9iKw1MxmADuBqfmLKSIi7cU2cHevB87IsL8FOD8foUSSWrB4CpB9UasRxbWAFrWS3kd3YkrQVqwex4rV47LWlPfdRHnfTSklEkmPGriISKDUwEVEAqUGLiISKDVwEZFAqYGLiARKD3SQoI0a1hBb896hjMv0iARPDVyC9uhtD8bWbNo7LYUkIunTEIqISKDUwEVEAqUGLkE757r7OOe6+7LWjB9wP+MH3J9SIpH0qIGLiARKDVxEJFBq4CIigVIDFxEJlBq4iEig1MBFRAKlOzElaDdPezK25rV9k1JIIpI+NXAJ2uQvvRBb03Tw9BSSiKQv8RCKmRWZ2Z/N7DfR9nAze8HMXjGzJ8ysX/5iiohIe50ZA78B2HrE9g+Banc/GXgLmJHLYCJJLHv+bJY9f3bWmrKijZQVbUwpkUh6EjVwM6sALgYeirYNmAgcfhT4ImByPgKKZHPPkqncs2Rq1pqT+tVxUr+6lBKJpCfpGfiDwM3AoWh7MLDb3Q9E2w3A0EwfNLOZZrbOzNY1Nzd3K6yIiHwotoGb2SVAk7uvP3J3hlLP9Hl3r3H3KnevKi0t7WJMERFpL8kslAnApWb2NaAE+CRtZ+SDzKxvdBZeAbyZv5giItJebAN391uBWwHM7MvATe7+bTN7EpgC/BKYDizPY04psHE7awodQUTa6c6dmLcAc8zsVdrGxB/OTSQREUmiUzfyuPtzwHPRz/XAWbmPJCIiSehOTAnaHx+6KbZmzZ4bU0gikj4tZiUiEig1cBGRQKmBS9CuuXMW19w5K2vN6P5LGN1/SUqJRNKjMXAJ2ss7K2JrjunTlEISkfTpDFxEJFBq4CIigVIDFxEJlMbARXKoS0sOrBrcufrzbu38MaRX0hm4iEigdAYuQbv03LWxNY0HRqeQRCR9auAStLnfeSq2pn7/hSkkEUmfhlBERAKlBi5B27ZjKNt2ZHya3wcGWiMDrTGlRCLp0RCKBO3au2YD2VclPL3kMUCrEkrvozNwEZFAqYGLiARKDVxEJFBq4CIigYpt4GZWYmZ/MrO/mNkWM7sj2j/czF4ws1fM7Akz65f/uCIicliSM/C9wER3PwMYA1xkZuOAHwLV7n4y8BYwI38xRUSkvdhphO7uwHvRZnH0y4GJwJXR/kXA7cBPcx9RpGOPzKuOrdnYelUKSUTSl2geuJkVAeuBzwI/Bl4Ddrv7gaikAch4N4WZzQRmAgwbNqy7eUU+4nOVb8TWvO/lKSQRSV+ii5juftDdxwAVwFnA5zOVdfDZGnevcveq0tLSricVEZGP6NQsFHffDTwHjAMGmdnhM/gK4M3cRhOJt2DxFBYsnpK1ZkRxLSOKa1NKJJKeJLNQSs1sUPTzAOACYCuwCjj8N2c6sDxfIUU6smL1OFasHpe1przvJsr7bkopkUh6koyBDwEWRePgfYCl7v4bM/sr8Eszuwv4M/BwHnOKiEg7SWahbATGZthfT9t4uIiIFIDuxBQRCZQauIhIoLQeuEiBralv6VT92gPbc3Lc2ZNG5uR7pHDUwCVoo4Y1xNa8d6gshSQi6VMDl6A9etuDsTWb9k5LIYlI+jQGLiISKDVwEZFAqYFL0M657j7Oue6+rDXjB9zP+AH3p5RIJD0aA+/BqutyM9tARHonnYGLiARKDVxEJFBq4CIigVIDFxEJlBq4iEigNAtFgnbztCdja17bNymFJCLpUwOXoE3+0guxNU0HT08hiUj6NIQiIhIoNXAJ2rLnz2bZ82dnrSkr2khZ0caUEomkR0MoErR7lkwFsg+lnNSvDoCmPRpKkd4lyVPpTzSzVWa21cy2mNkN0f7jzazOzF6JXo/Lf1wRETksyRDKAeBGd/88MA74npmdAswFVrr7ycDKaFtERFIS28DdfZe7vxT9/C6wFRgKXAYsisoWAZPzFVJERD6uUxcxzawSGAu8AJS7+y5oa/JAxudWmdlMM1tnZuuam5u7l1ZERD6QuIGb2THA08Asd38n6efcvcbdq9y9qrS0tCsZRUQkg0QN3MyKaWvej7v7r6LdjWY2JHp/CNCUn4giIpJJ7DRCMzPgYWCruz9wxFsrgOnAguh1eV4SimTxx4duiq1Zs+fGFJKIpC/JPPAJwDRgk5ltiPZ9n7bGvdTMZgA7gan5iSgiIpnENnB3/1/AOnj7/NzGka4Yt7Om0BFEpAB0K70E7Zo7Z3HNnbOy1ozuv4TR/ZeklEgkPbqVXoL28s6K2Jpj+uj6uvROOgMXEQmUGriISKDUwEVEAqUGLiISKDVwEZFAaRaKBO3Sc9fG1jQeGJ1CEpH0qYFL0OZ+56nYmvr9F6aQRCR9GkIREQmUGrgEbduOoWzbMTRrzUBrZKA1ppRIJD0aQpGgXXvXbCD7qoSnlzwG9J5VCXO29s2qwR2/d96tuTmG5JXOwEVEAqUz8BjVddsLHUFEJCOdgYuIBEoNXEQkUBpCEZGPW3V3fr9fF0lzQmfgIiKB0hm4BO2RedWxNRtbr0ohiUj6kjyV/hHgEqDJ3U+L9h0PPAFUAjuAK9z9rfzFDJeeV5lfn6t8I7bmfS9PIYlI+pIMofwMuKjdvrnASnc/GVgZbYuISIpiG7i7rwb+0W73ZcCi6OdFwOQc5xJJZMHiKSxYPCVrzYjiWkYU16aUSCQ9Xb2IWe7uuwCi17KOCs1sppmtM7N1zc3NXTycSGYrVo9jxepxWWvK+26ivO+mlBKJpCfvs1Dcvcbdq9y9qrS0NN+HExE5anS1gTea2RCA6LUpd5FERCSJrk4jXAFMBxZEr8tzlkhEUrGmvqVgxx5/XsEO3avEnoGb2S+ANcAoM2swsxm0Ne5JZvYKMCnaFhGRFMWegbv7tzp46/wcZxERkU7QnZgStFHDGmJr3jvU4SQpkaCpgUvQHr3twdiaTXunpZBEJH1azEpEJFBq4CIigVIDl6Cdc919nHPdfVlrxg+4n/ED7k8pkUh61MBFRAKlBi4iEig1cBGRQKmBi4gESg1cRCRQauAiIoHSnZgStJunPRlb89q+SSkkEUmfGrgEbfKXXoitaTp4egpJRNIXTAOvrtue8+9M8sT47A/rEpEuWXV3/o9x3q35P0aBaQxcgrbs+bNZ9vzZWWvKijZSVrQxpUQi6QnmDFwkk3uWTAWyD6Wc1K8OgKY9GkqR3kVn4CIigVIDFxEJlBq4iEigujUGbmYXAT8CioCH3F0PNxaRHisfs9mSmD1pZF6+t8tn4GZWBPwY+CpwCvAtMzslV8FERCS77gyhnAW86u717r4P+CVwWW5iiYhIHHP3rn3QbApwkbtfF21PA8529+vb1c0EZkabo4CXux43Z04A/l7oEJ2kzPkXWl4IL3NoeaFnZP6Mu5e239mdMXDLsO9j/xq4ew0Qf8tjisxsnbtXFTpHZyhz/oWWF8LLHFpe6NmZuzOE0gCceMR2BfBm9+KIiEhS3WngLwInm9lwM+sHfBNYkZtYIiISp8tDKO5+wMyuB35P2zTCR9x9S86S5VePGtJJSJnzL7S8EF7m0PJCD87c5YuYIiJSWLoTU0QkUGrgIiKBOmobuJnda2bbzGyjmT1jZoMKnSmOmU01sy1mdsjMeuS0JmhbYsHMXjazV81sbqHzxDGzR8ysycw2FzpLEmZ2opmtMrOt0Z+HGwqdKY6ZlZjZn8zsL1HmOwqdKQkzKzKzP5vZbwqdJZOjtoEDdcBp7n46sB0I4fEdm4GvA6sLHaQjgS6x8DPgokKH6IQDwI3u/nnaHhr1vQB+j/cCE939DGAMcJGZhfDAqxuArYUO0ZGjtoG7e627H4g219I2j71Hc/et7t4T7mTNJrglFtx9NfCPQudIyt13uftL0c/v0tZghhY2VXbe5r1oszj61aNnUJhZBXAx8FChs3TkqG3g7VwLPFvoEL3EUOD1I7Yb6OHNJWRmVgmMBeKf7lxg0XDEBqAJqHP3np75QeBm4FChg3SkVz9Szcz+APxLhrd+4O7Lo5of0PZf0sfTzNaRJJl7uERLLEj3mdkxwNPALHd/p9B54rj7QWBMdL3pGTM7zd175HUHM7sEaHL39Wb25ULn6UivbuDufkG2981sOnAJcL73kAnxcZkDoCUWUmBmxbQ178fd/VeFztMZ7r7bzJ6j7bpDj2zgwATgUjP7GlACfNLMHnP3qwqc6yOO2iGU6GEUtwCXuvs/C52nF9ESC3lmZgY8DGx19wcKnScJMys9PNPLzAYAFwDbCpuqY+5+q7tXuHslbX+G/6enNW84ihs4sBA4Fqgzsw1m9t+FDhTHzC43swZgPPBbM/t9oTO1F10YPrzEwlZgaU9fYsHMfgGsAUaZWYOZzSh0phgTgGnAxOjP7oboTLEnGwKsMrONtP0jX+fuPXJqXkh0K72ISKCO5jNwEZGgqYGLiARKDVxEJFBq4CIigVIDFxEJlBq4iEig1MBFRAL1/0yJ7PORGPE8AAAAAElFTkSuQmCC\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Visualize our data comparison\n",
    "plt.hist(d1, alpha=0.5, label='D1');\n",
    "plt.axvline(np.mean(d1), linestyle='--', linewidth=2, color='blue')\n",
    "plt.hist(d2, alpha=0.5, label='D2');\n",
    "plt.axvline(np.mean(d2), linestyle='--', linewidth=2, color='orange')\n",
    "plt.legend();"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Calculate T-Tests\n",
    "\n",
    "Now that we have some data, let's use a t-tests to statistically compare the two groups of data. \n",
    "\n",
    "For this example, we will test whether the two distributions have significantly different means. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Run independent samples t-test\n",
    "t_val, p_val = ttest_ind(d1, d2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "T-Test comparison of D1 & D2:\n",
      "\tT-value \t -2.2502\n",
      "\tP-value \t 2.49e-02\n"
     ]
    }
   ],
   "source": [
    "# Check the results of the t-test\n",
    "print('T-Test comparison of D1 & D2:')\n",
    "print('\\tT-value \\t {:1.4f}'.format(t_val))\n",
    "print('\\tP-value \\t {:1.2e}'.format(p_val))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this case, the t-test shows that there is a significant difference in the mean of the two arrays of data!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Assumptions of T-Tests\n",
    "\n",
    "Note, again, that t-tests assume normally distributed data. This is again a property of the data that should be examined before applying statistical tests. If this assumption is not met, other approaches for comparing the data may be needed."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Effect Size\n",
    "\n",
    "One thing to keep in mind about hypothesis tests such as the t-test above is that they while they can be used to _is there a difference_ between two sets of data, they do not answer the question of _how different are they_.\n",
    "\n",
    "Often, we would also like to measure how different groups of data are.\n",
    "\n",
    "To do so, we can use effect size measures, which can be used to estimate the magnitude of changes or differences. \n",
    "\n",
    "There are many methods and approaches to measuring effect sizes across different contexts. \n",
    "\n",
    "For this example, we will use cohens-d effect size estimate for differences in means."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-success\">\n",
    "Effect size measurements are measurements of the magnitude of a particular effect.\n",
    "</div>\n",
    "\n",
    "<div class=\"alert alert-info\">\n",
    "Effect sizes on \n",
    "<a href=https://en.wikipedia.org/wiki/Effect_size class=\"alert-link\">wikipedia</a>.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Defining Effect Size Code\n",
    "\n",
    "Often, when analyzing data, we will want to apply some measure that we may not find already available, in which case we may need to implement a version ourselves. \n",
    "\n",
    "For this example, we will implement cohens-d, an effect size measure for differences of means. Briefly, is a calculation of the difference of means between two distributions, divided by the pooled standard deviation. As such, cohens-d is a standardized measure, meaning the output value is independent of the units of the inputs. \n",
    "\n",
    "Note that `math` and `statistics` are standard library modules that contain some useful basic numerical functionality. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "from math import sqrt\n",
    "from statistics import mean, stdev\n",
    "\n",
    "def compute_cohens_d(data_1, data_2):\n",
    "    \"\"\"Compute cohens-d effect size.\n",
    "    \n",
    "    Parameters\n",
    "    ----------\n",
    "    data_1, data_2 : 1d array\n",
    "        Array of data to compute the effect size between.\n",
    "        \n",
    "    Returns\n",
    "    -------\n",
    "    cohens_d : float\n",
    "        The computed effect size measure. \n",
    "    \"\"\"\n",
    "\n",
    "    # Calculate group means\n",
    "    d1_mean = mean(data_1)\n",
    "    d2_mean = mean(data_2)\n",
    "    \n",
    "    # Calculate group standard deviations\n",
    "    d1_std = stdev(data_1)\n",
    "    d2_std = stdev(data_2)\n",
    "    \n",
    "    # Calculate the pooled standard deviation\n",
    "    pooled_std = sqrt((d1_std ** 2 + d2_std ** 2) / 2)\n",
    "    \n",
    "    # Calculate cohens-d\n",
    "    cohens_d = (d1_mean - d2_mean) / pooled_std\n",
    "\n",
    "    return cohens_d"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Calculate the cohens-d effect size for our simulated data from before\n",
    "cohens_d = compute_cohens_d(d2, d1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The cohens-d effect size is 0.20.\n"
     ]
    }
   ],
   "source": [
    "# Check the measured value of the effect size\n",
    "print('The cohens-d effect size is {:1.2f}.'.format(cohens_d))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A cohens-d effect size of ~0.2 is a small or modest effect. \n",
    "\n",
    "In combination with our t-test above, we can conclude that there is a difference of means between the two groups of data, but that the magnitude of this difference is relatively small. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Conclusion\n",
    "\n",
    "Here we have briefly explored some statistical tests and comparisons for numerical data. \n",
    "\n",
    "For more information on statistical tests of data, check out courses and resources focused on statistics."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}