Stepbystep

Lets quickly review the previous parts.
To use and learn the entire tutorial together follow these links for all parts.
https://developer.ibm.com/recipes/tutorials/introductiontodatasciencetoolsinbluemix/
https://developer.ibm.com/recipes/tutorials/introductiontodatasciencetoolsinbluemixpart2/
https://developer.ibm.com/recipes/tutorials/introductiontodatasciencetoolsinbluemixpart3/ 
Extracting the Year and the month from a dataframe
Before we start looking at the visualization of data in this recipie, lets first review from the previous recepie how we extract the X and Y coordinates to present.
Often you are looking for seasonal trends, to do this, extracting records based on years or months is a common prerequisite. The following code extracts the month or year.
Often you are looking for seasonal trends, to do this, extracting records based on years or months is a common prerequisite. The following code extracts the month or year.
year_df = new_construction_dollars_builds["date_update"].dt.year
print year_df
month_df = new_construction_dollars_builds["date_update"].dt.month
print month_dfLets take this example a little further to make it useful. Lets extract ALL the records from our dataframe for a specific month, in this case, December. This is very similar to the example where we selected records >= to date. However in this case we are looking for entries in the date_update column that have a month of 12.
Again, it is important that our data is sorted in sequence and to be sure we will do an explicit sort before we print the dataframe contents out.
december_builds = new_construction_dollars_builds[(new_construction_dollars_builds['date_update'].dt.month==12) ]december_builds = december_builds.sort_values(by='date_update')
print december_buildsTo select all the records for a specific year. Just replace the month== with a year==. Here is a working example.
Y2007_builds = new_construction_dollars_builds[(new_construction_dollars_builds['date_update'].dt.year==2007) ]
Y2007_builds = Y2007_builds.sort_values(by='date_update')
print Y2007_buildsand here is the output of the print
constructions_total constructions_value date date_update
46 4049 943302 Jan07 20070101
463 4216 991984 Feb07 20070201
241 4888 1161053 Mar07 20070301
242 4339 1003470 Apr07 20070401
243 5517 1343940 May07 20070501
126 5107 1262521 Jun07 20070601
344 5201 1262986 Jul07 20070701
127 5566 1367965 Aug07 20070801
345 4845 1201418 Sep07 20070901
464 5537 1366672 Oct07 20071001
346 5241 1255333 Nov07 20071101Setting the X and Y coordindinate values.
The X and Y coordindinate values are a set of lists. “X” is often described as the independant value and “Y” the dependant. The following code extracts the list from the Pandas dataframe.
december_builds = new_construction_dollars_builds[(new_construction_dollars_builds['date_update'].dt.month==12) ]
december_builds = december_builds.sort_values(by='date_update')
print december_builds
indep_xxxx = december_builds['constructions_total']
dep_yyyy = december_builds['constructions_value'] 
Introduction to Matplotlib and Graphs with Python
for people who have used Matlib, Matplotlib is a similar collection of functions that enable plotting of graphs. This visualisation is an important tool for the data scientist to explore result sets and explain results.
In the examples below we will see how to create,
 Scatter plots
 Bar graphs
 Line graphs
 Pie Charts
As a start. We will also look at how to show the mean line as well as label the axis.
To use matplotlib you need to include it in your Spark Analysis,
%matplotlib inline
import matplotlib.pyplot as plt 
A simple Plot Graph
The following piece of code prints a simple Plot Graph
Line 1 – imports the matplot library that allows us to produce the graph
Line 2 – allows us to print it Inline in our Bluemix spark analysis notebook
Lines 3 & 4 are the “x” and “y” labels of our plot graph
Line 5 plots a set of x,y coorindates based on 2 arrays (construct_total_list, construct_value_list)
Line 6 draws a horizontal line
Line 7 draws a vertical line
line 8 draws a line between two points
line 9 displays the plot.
import matplotlib.pyplot as plt
%matplotlib inline
plt.xlabel('construct_total')
plt.ylabel('construct_value')
plt.scatter(construct_total_list, construct_value_list)
plt.axhline(construct_value_mean, color='red', linewidth=2)
plt.axvline(construct_total_mean, color='green', linewidth=2)
plt.plot([0, construct_total_mean,construct_total_mean+(construct_total_mean0) ], [bo, construct_value_mean, (construct_value_mean+construct_value_meanbo)], color='blue', linestyle='', linewidth=2)
plt.show()if construct_total_list contains the array [1, 2, 3, 4, 5] and the construct_value_list contains [2, 4, 5, 4, 5] the following plot would be produced. You can see the 5 plotted points as well as the mean lines for X and Y. I will review the blue line in the next recepie.

A simple Bar Graph
The following example of code prints a simple Bar Graph
Lines 1, 2, and 3, these are the three imported libraries needed as discussed previously. Numpy is a python math library extension to the Python programming language, adding support for large, multidimensional arrays and matrices,
Lines 4, 5 are arrays containing the X and Y coordinates
Line 6 calculates the mean of the X cordinates
Line 7 sets the size and dimensions of the displayed plot
Line 8 sets the title of the graph
Line 9 determines the hight of each of the bars in the bar graph
Line 10 calculates how many segments should be on the Y axis and their values
Line 11 displays the mean line calculated previously – horizontally
Line 12 displays the graph plot.
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
values = [1, 2, 3, 4, 5]labels = [2, 4, 5, 4, 5]values_mean = np.mean(values)
plt.gcf().set_size_inches(5, 6, forward=True)
plt.title('example plot')
plt.bar(range(len(values)), values)
plt.yticks(range(len(values)), labels)
plt.axhline(values_mean, color='green', linewidth=6)
plt.show() 
A simple Pie Graph example
The following shows how to create a Pie Graph.
line 1 sets the colours for each of the segment
Line 3 shows the input values for each segment
Line 7 determines where to start the first segment (in degrees)
Line 8 shows the format of the label %age
colors = ["#E13F29", "#D69A80", "#D63B59", "#AE5552", "#CB5C3B", "#EB8076", "#96624E", "#463B59", "#3E5552", "#AB5C3B", "#BB8076", "#A6624E"] plt.pie(
indep_xxxx,
shadow=False,
colors=colors,
explode=(0,0,0,0,0,0,0,0,0,0,0,0),
startangle=90,
autopct='%1.1f%%',
)
plt.axis('equal')
plt.show() 
A simple line graph example
The following line graph generates a value of 1 – n for every entry being plotted. This way the graph flows from left to right.
The code that does the generation of values is : range(len(indep_xxxx), the values plotte are the independaty X variable list indep_xxxx.
plt.title('constructions for year ' + str(yy))
plt.ylabel('construct_total')
plt.plot(range(len(indep_xxxx)),indep_xxxx)
plt.show()