Contents


Overview

Skill Level: Advanced

In this recipe, you will learn how to code a regression analysis to look at trends and lines of best fit of your data . Understanding the lines of best fit give you insight into potential predicted values.

Ingredients

A good level of Python, Matlab, Matplotlib, Numpy and Maths.  Understanding statistics can be beneficial.

Step-by-step

  1. Understanding Linear Regression.

    Linear regression analysis is used to help you establish if there is a real relationship between two variable. In real terms it helps you determine if there is a statistically significant relationship between two events. Creating a regression model tests if there is a relationship.

    In addition if we know that there is an existing relationship, then we can predict or forecast a new observation.

    There are two roles of the variables.

    the dependant variable, is the variable that we wish to explain and this is the X axis variable
    the independant variable is the variable we are attempting to see if there is a relationship with. This is the Y axis variable.

    As we see a change in the X axis variable, we are trying to determine if there is a corresponding change in the Y axis. In addition to the corresponding change, we look to see if there is any repeatability or consistency in the change, i.e. – as we increase X, does Y also increase or does it decrease.

    Understanding this trend line allows us to use the correlated relationship between the X and Y values to predict a likely Y value for an new X value, of course this is based on a history of previous X,Y results.

    So lets see and example of how this works.

  2. Two lists or arrays of X values and Y values

     First of all we need to have two arrays or lists, indep_xxxx, dep_yyyy this is the list of X, Y co-ordinates stored in two separate lists.
    Here is an example of two lists of X and Y co-ordinates we will use in this example;
    indep_xxxx [1, 2, 3, 4, 5] dep_yyyy = [2, 4, 5, 4, 5]

    Get the means of your X and Y values which are in your two lists, to do this we will use the Numpy library which makes it really simple to get the means as shown below

    indep_xxxx_mean = np.mean(indep_xxxx)
    dep_yyyy_mean = np.mean(dep_yyyy)

    the result of this would be

    indep_xxxx_mean
    3.0
    dep_yyyy_mean
    4.0

     

  3. Calculate the distance of each datapoint from its mean

    Next we have to calculate the distance of each  datapoint from its mean.
    For the independent variable (X) list we need to subtract the mean of the X list from each X value.
    The code below subtracts the mean from every entry in the list automatically – this is some of the speed benefits when using Python for computations.

    indep_xxxx_mean_subract = indep_xxxx – indep_xxxx_mean

    gives us the result [-2. -1. 0. 1. 2.]

    We now need to do the same thing for the dependent Y variable (dep_yyyy).

    Again the code below does this for us, by subtracting the mean of dep_yyyy from each entry in the list.

    dep_yyyy_mean_subract = dep_yyyy – dep_yyyy_mean

    [-2. 0. 1. 0. 1.]

    As we have subtracted each of the X values from the mean, some of the value may be negative. To get a positive number, we will first square each of the entries in the list. Again the code below does this;

    indep_xxxx_mean_subract_squared = np.array(indep_xxxx_mean_subract)**2

    [ 4. 1. 0. 1. 4.]

     

  4. Calculating the Y intercept and the Slope.

    Calculating the Y intercept and the Slope.
    The Y intercept

    first we have to sum the products of the x and y co-ordinates
    x = [-2. -1. 0. 1. 2.]y = [-2. 0. 1. 0. 1.]

    therefore xy would be
    [ 4. -0. 0. 0. 2.]

    Summing xy would be 6.0

    Now we have to find the denominator for the slope, which is the sum of (x-mean x) squared (calculated in the previous step)
    we calculated that x-meanx was [-2. -1. 0. 1. 2.] and then squaring this value would be [4. 1. 0. 1. 4.]The sum of [4. 1. 0. 1. 4.] would be 10.

    To calculate the slope of the line we divide our numerator by the denominator which is 6 / 10 = 0.6

     

  5. Calculate the Y intercept

    To calculate the Y intercept, Bo

    y = Bo + B1 x (where B1 is the slope)

    y is the mean of y coordinates, or in our case 4.

    B1 we just calculated to be 0.6

    x is the mean of the x co-ordinates which is 3.

    therefore B1 x = 0.6 * 3 = 1.8

    Solving Bo would be 4 = Bo + 1.8 or Bo = 2.6

    Bo is the Y intercept.

    So now we know two sets of co-ordinates (0, 2.6) and (3,4) we can plot the line of best fit.

    The rest of the code prints a scattergraph in Marplot lib as well as the X and Y medians for the list of data-points.

  6. Graphing the X mean Y Mean and Line of Best Fit.

    Set the X and Y Graph Labels here

    plt.xlabel(xlab)
    plt.ylabel(ylab)

     

    Show the scatter graph (plots) using the X and Y lists
    plt.scatter(indep_xxxx, dep_yyyy)

    Show the mean Y value (red line)
    plt.axhline(dep_yyyy_mean, color='red', linewidth=2)

     

    Show the mean X value (green line)

    plt.axvline(indep_xxxx_mean, color='green', linewidth=2)

    plot a Blue line which is aline of best fit between Bo and the intersection of X and Y mean.
    plt.plot([0, indep_xxxx_mean,indep_xxxx_mean+(indep_xxxx_mean-0) ], [bo, dep_yyyy_mean, (dep_yyyy_mean+dep_yyyy_mean-bo)], color='blue', linestyle='-', linewidth=2)

    plt.show()

Join The Discussion