Stepbystep

Lets quickly review the previous parts.
To use and learn the entire tutorial together follow these links for all the previous parts.
https://developer.ibm.com/recipes/tutorials/introductiontodatasciencetoolsinbluemix/
https://developer.ibm.com/recipes/tutorials/introductiontodatasciencetoolsinbluemixpart2/
https://developer.ibm.com/recipes/tutorials/introductiontodatasciencetoolsinbluemixpart3/https://developer.ibm.com/recipes/tutorials/introductiontodatasciencetoolsinbluemixpart4/

Extracting the Year and the month from a dataframe
Before we start looking at the visualization of data in this recipie, lets first review from the previous recepie how we extract the X and Y coordinates to present.
Often you are looking for seasonal trends, to do this, extracting records based on years or months is a common prerequisite. The following code extracts the month or year.
Often you are looking for seasonal trends, to do this, extracting records based on years or months is a common prerequisite. The following code extracts the month or year.
year_df = new_construction_dollars_builds["date_update"].dt.year
print year_df
month_df = new_construction_dollars_builds["date_update"].dt.month
print month_dfLets take this example a little further to make it useful. Lets extract ALL the records from our dataframe for a specific month, in this case, December. This is very similar to the example where we selected records >= to date. However in this case we are looking for entries in the date_update column that have a month of 12.
Again, it is important that our data is sorted in sequence and to be sure we will do an explicit sort before we print the dataframe contents out.
december_builds = new_construction_dollars_builds[(new_construction_dollars_builds['date_update'].dt.month==12) ]december_builds = december_builds.sort_values(by='date_update')
print december_buildsTo select all the records for a specific year. Just replace the month== with a year==. Here is a working example.
Y2007_builds = new_construction_dollars_builds[(new_construction_dollars_builds['date_update'].dt.year==2007) ]
Y2007_builds = Y2007_builds.sort_values(by='date_update')
print Y2007_buildsand here is the output of the print
constructions_total constructions_value date date_update
46 4049 943302 Jan07 20070101
463 4216 991984 Feb07 20070201
241 4888 1161053 Mar07 20070301
242 4339 1003470 Apr07 20070401
243 5517 1343940 May07 20070501
126 5107 1262521 Jun07 20070601
344 5201 1262986 Jul07 20070701
127 5566 1367965 Aug07 20070801
345 4845 1201418 Sep07 20070901
464 5537 1366672 Oct07 20071001
346 5241 1255333 Nov07 20071101Setting the X and Y coordindinate values.
The X and Y coordindinate values are a set of lists. “X” is often described as the independant value and “Y” the dependant. The following code extracts the list from the Pandas dataframe.
december_builds = new_construction_dollars_builds[(new_construction_dollars_builds['date_update'].dt.month==12) ]
december_builds = december_builds.sort_values(by='date_update')
print december_builds
indep_xxxx = december_builds['constructions_total']
dep_yyyy = december_builds['constructions_value'] 
Introducing Confidence Levels
Lets just say that we are interested in knowing what the amount of all housing builds would be in January next year, and we took a sample of 40 years of housing data to see how many houses were built. (each year has 12 months so there would be 40*12 entries in our table and 40 of them would be for each month). The sample mean of 10 build amounts for January shows that there is mean 4000 houses built. We could use this estimate, however it does not really mean very much because it is based on a sample. If we were able to determine the confidence of using this sample, it would make the calculation more meaningful. The sampling process becomes more valuable as a tool as the population you are working on grows. Eg, Fish in the Ocean, Insects on a Farm etc.
So there is some logic in this, the larger the sample taken the higher the confidence. It may be cost prohibitive to increase the size of the sample so looking for an appropriate sample size and having a confidence level that is appropriate for the probability could be a sound and reasonable way to estimate.
In addition the more lower the probability level the (eg a 50% chance is lower than a 95% chance of something occurring), the smaller the confidence interval will be. Let see some working examples of this.

Calculate confidence intervals
Firstly how do we calculate confidence intervals? But before we start we need to know the meaning of some words we used in highschool math.
1. Standard Deviation
The Standard Deviation is a measure of how spread out numbers are it is calculated by taking the square root of the Variance.
The average of the squared differences from the Mean.
2. Variance
To calculate the variance first calculate the mean or average of our list of values. Then for each value, subtract the mean and square the result.
Finally take the square root of the sum of squares and that is the variance.
3. The easy way to get a standard deviation.
Or you can do it the fast and easy way and pass a list to the std function and get the value!! The following line of code the standard deviation of of the “January” monthly builds and places it within a 1 entry list std[0]
std = mm_builds.std().tolist()

Component to calculate the confidence level
Now we have the components to calculate the confidence level. In summary they are:
 Sample size
 Mean of sample
 Standard deviation of samle
 ttable value
Ok, so we have not talked about ttable values as yet. A ttable value is a statistical tool that lists critical values that you can use to determine confidence values. The following ttable shows degrees of freedom for selected percentiles from the 90th to the 99th: Simply follow the table to determine the TValue for the sample size and the confidence level you require. There are heaps of these tables on the Internet – here is a good link to use. Just follow the table, choose your sample size number and the confidence level. The associated tvalue is what we use to calcuate our confidence level. You can locate one example of the table to use here: http://www.sjsu.edu/faculty/gerstman/StatPrimer/ttable.pdf

A confidence Value Code example
An overview of the code below is:
Loop for each month:
Extract all values for selected month
Sort the data in date sequence
Select only Column “constructions_total” from the dataframe
Randomly select a sample of 20 from the list (of 46)
calculate the standard deviation of construction_total in the sample list
calculate the mean of the construction_total in the sample list
code the tfactor from the tvalue table in the previous section.
calculate the squareroot of the values in the sample list
calculate the Uppermean
Plot the error bar in Matplotlib
import math
print how many houses will be built each month
print in any given year based on the last 40 years of housing builds
print confidence intervils
mm = 1
sample = 20
while mm < 13:
mm_builds = new_construction_dollars_builds[(new_construction_dollars_builds[date_update].dt.month==mm) ]
mm_builds = mm_builds.sort_values(by=date_update)
mm_builds = mm_builds[[constructions_total, date_update]] sample_series_mm_builds = mm_builds.sample(n=sample)
std = mm_builds.std().tolist()
mean= (mm_builds.mean().tolist())
tfactor = 1.325
sroot_sample = math.sqrt(sample)
upper_mean =( mean + (tfactor*(std[0]/sroot_sample)))
x= mm
y= mean[0] plt.errorbar(x, y, yerr=upper_meanmean, fmt=o)
plt.plot(x+.01,y)
mm = mm + 1
plt.title(confidence intervils monthly housing builds)
plt.show() 
Understanding the Confidence Level Calculation:
The confidence level equation looks like this:
Confidence level = ( + (tfactor * (x / (n))))
Where:
 x = standard deviation of list values
 (n) = square root of sample qtry
 = mean of sample list

Plotting the confidence map.
A confidence level states that we are (for example) 90% confident that a value will fall between x and y of the mean. The higher the confidence of course the wider the gap between the mean. The lower the confidence the shorther the gap. Lets see how we would graph this.
x= mm
y= mean[0] plt.errorbar(x, y, yerr=upper_meanmean, fmt='o')
plt.plot(x+.01,y)
mm = mm + 1First of all we will set the X value as the value of the month (1 to 12) in our case.
The Y plot point needs to equal the mean of the sample for the month.
Now we need to calculate our confidence level, Confidence level = ( + (tfactor * (x / (n)))). The confidence level if we subtract the mean is the value of error we have calculated for the sample.
Once we have the three values it is a simple task of plotting the error bar in Matplotlib. An error bar is the way we visualize a confidence level graph.
The output should look like this.