Contents


Overview

Skill Level: Any Skill Level

Intermediate

Calculating and Plotting Confidence levels in Spark, Bluemix, Python

Ingredients

spark

python

dataframes

Pandas

pandas

matplotlib - pyplot

Step-by-step

  1. Lets quickly review the previous parts.

    To use and learn the entire tutorial together follow these links for all the previous parts.

    https://developer.ibm.com/recipes/tutorials/introduction-to-data-science-tools-in-bluemix/
    https://developer.ibm.com/recipes/tutorials/introduction-to-data-science-tools-in-bluemix-part-2/
    https://developer.ibm.com/recipes/tutorials/introduction-to-data-science-tools-in-bluemix-part-3/

    https://developer.ibm.com/recipes/tutorials/introduction-to-data-science-tools-in-bluemix-part-4/

  2. Extracting the Year and the month from a data-frame

    Before we start looking at the visualization of data in this recipie, lets first review from the previous recepie how we extract the X and Y co-ordinates to present.

    Often you are looking for seasonal trends, to do this, extracting records based on years or months is a common prerequisite. The following code extracts the month or year.

    Often you are looking for seasonal trends, to do this, extracting records based on years or months is a common prerequisite. The following code extracts the month or year.

    year_df = new_construction_dollars_builds["date_update"].dt.year
    print year_df


    month_df = new_construction_dollars_builds["date_update"].dt.month
    print month_df

    Lets take this example a little further to make it useful. Lets extract ALL the records from our dataframe for a specific month, in this case, December. This is very similar to the example where we selected records >= to date. However in this case we are looking for entries in the date_update column that have a month of 12.

    Again, it is important that our data is sorted in sequence and to be sure we will do an explicit sort before we print the dataframe contents out.

    december_builds = new_construction_dollars_builds[(new_construction_dollars_builds['date_update'].dt.month==12) ]december_builds = december_builds.sort_values(by='date_update')
    print december_builds

    To select all the records for a specific year. Just replace the month== with a year==. Here is a working example.

    Y2007_builds = new_construction_dollars_builds[(new_construction_dollars_builds['date_update'].dt.year==2007) ] 
    Y2007_builds = Y2007_builds.sort_values(by='date_update')
    print Y2007_builds

    and here is the output of the print

    constructions_total constructions_value date date_update
    46 4049 943302 Jan-07 2007-01-01
    463 4216 991984 Feb-07 2007-02-01
    241 4888 1161053 Mar-07 2007-03-01
    242 4339 1003470 Apr-07 2007-04-01
    243 5517 1343940 May-07 2007-05-01
    126 5107 1262521 Jun-07 2007-06-01
    344 5201 1262986 Jul-07 2007-07-01
    127 5566 1367965 Aug-07 2007-08-01
    345 4845 1201418 Sep-07 2007-09-01
    464 5537 1366672 Oct-07 2007-10-01
    346 5241 1255333 Nov-07 2007-11-01

    Setting the X and Y co-ordindinate values.

    The X and Y co-ordindinate values are a set of lists. “X” is often described as the independant value and “Y” the dependant. The following code extracts the list from the Pandas dataframe.

    december_builds = new_construction_dollars_builds[(new_construction_dollars_builds['date_update'].dt.month==12) ] 
    december_builds = december_builds.sort_values(by='date_update')
    print december_builds
    indep_xxxx = december_builds['constructions_total']
    dep_yyyy = december_builds['constructions_value']

  3. Introducing Confidence Levels

    Lets just say that we are interested in knowing what the amount of all housing builds would be in January next year, and we took a sample of 40 years of housing data to see how many houses were built. (each year has 12 months so there would be 40*12 entries in our table and 40 of them would be for each month). The sample mean of 10 build amounts for January shows that there is mean 4000 houses built. We could use this estimate, however it does not really mean very much because it is based on a sample. If we were able to determine the confidence of using this sample, it would make the calculation more meaningful. The sampling process becomes more valuable as a tool as the population you are working on grows. Eg, Fish in the Ocean, Insects on a Farm etc.

    So there is some logic in this, the larger the sample taken the higher the confidence. It may be cost prohibitive to increase the size of the sample so looking for an appropriate sample size and having a confidence level that is appropriate for the probability could be a sound and reasonable way to estimate.

    In addition the more lower the probability level the (eg a 50% chance is lower than a 95% chance of something occurring), the smaller the confidence interval will be. Let see some working examples of this.

  4. Calculate confidence intervals

    Firstly how do we calculate confidence intervals? But before we start we need to know the meaning of some words we used in high-school math.

    1. Standard Deviation

    The Standard Deviation is a measure of how spread out numbers are it is calculated by taking the square root of the Variance.

    The average of the squared differences from the Mean.

    2. Variance

    To calculate the variance first calculate the mean or average of our list of values. Then for each value, subtract the mean and square the result.

    Finally take the square root of the sum of squares and that is the variance.

    3. The easy way to get a standard deviation.

    Or you can do it the fast and easy way and pass a list to the std function and get the value!! The following line of code the standard deviation of of the “January” monthly builds and places it within a 1 entry list std[0]

    std = mm_builds.std().tolist()

  5. Component to calculate the confidence level

    Now we have the components to calculate the confidence level. In summary they are:

    1. Sample size
    2. Mean of sample
    3. Standard deviation of samle
    4. t-table value

    Ok, so we have not talked about t-table values as yet. A t-table value is a statistical tool that lists critical values that you can use to determine confidence values. The following t-table shows degrees of freedom for selected percentiles from the 90th to the 99th: Simply follow the table to determine the T-Value for the sample size and the confidence level you require. There are heaps of these tables on the Internet – here is a good link to use. Just follow the table, choose your sample size number and the confidence level. The associated t-value is what we use to calcuate our confidence level. You can locate one example of the table to use here: http://www.sjsu.edu/faculty/gerstman/StatPrimer/t-table.pdf

  6. A confidence Value Code example

    An overview of the code below is:

    Loop for each month:

    Extract all values for selected month

    Sort the data in date sequence

    Select only Column “constructions_total” from the dataframe

    Randomly select a sample of 20 from the list (of 46)

    calculate the standard deviation of construction_total in the sample list

    calculate the mean of the construction_total in the sample list

    code the t-factor from the t-value table in the previous section.

    calculate the square-root of the values in the sample list

    calculate the Upper-mean

    Plot the error bar in Matplotlib

     import math
    print how many houses will be built each month
    print in any given year based on the last 40 years of housing builds
    print confidence intervils
    mm = 1

    sample = 20
    while mm < 13:
    mm_builds = new_construction_dollars_builds[(new_construction_dollars_builds[date_update].dt.month==mm) ]
    mm_builds = mm_builds.sort_values(by=date_update)
    mm_builds = mm_builds[[constructions_total, date_update]] sample_series_mm_builds = mm_builds.sample(n=sample)
    std = mm_builds.std().tolist()
    mean= (mm_builds.mean().tolist())
    tfactor = 1.325
    sroot_sample = math.sqrt(sample)
    upper_mean =( mean + (tfactor*(std[0]/sroot_sample)))
    x= mm
    y= mean[0] plt.errorbar(x, y, yerr=upper_mean-mean, fmt=o)
    plt.plot(x+.01,y)
    mm = mm + 1

    plt.title(confidence intervils monthly housing builds)
    plt.show()

  7. Understanding the Confidence Level Calculation:

    The confidence level equation looks like this:

    Confidence level = ( -+ (tfactor * (x / (n))))

    Where:

    • x = standard deviation of list values
    • (n) = square root of sample qtry
    • = mean of sample list
  8. Plotting the confidence map.

    A confidence level states that we are (for example) 90% confident that a value will fall between x and y of the mean. The higher the confidence of course the wider the gap between the mean. The lower the confidence the shorther the gap. Lets see how we would graph this.

     x= mm
    y= mean[0] plt.errorbar(x, y, yerr=upper_mean-mean, fmt='o')
    plt.plot(x+.01,y)
    mm = mm + 1

    First of all we will set the X value as the value of the month (1 to 12) in our case.

    The Y plot point needs to equal the mean of the sample for the month.

    Now we need to calculate our confidence level, Confidence level = ( -+ (tfactor * (x / (n)))). The confidence level if we subtract the mean is the value of error we have calculated for the sample.

    Once we have the three values it is a simple task of plotting the error bar in Matplotlib. An error bar is the way we visualize a confidence level graph.

    The output should look like this.

Join The Discussion