Contents


Overview

Skill Level: Any Skill Level

Intermediate

Graphical Visualization of your data with Cloudant, Spark, Python and Matplotlib

Ingredients

Bluemix

Spark

Python

Cloudant

Matplotlib

 

 

Step-by-step

  1. Lets quickly review the previous parts.

    To use and learn the entire tutorial together follow these links for all parts.

    https://developer.ibm.com/recipes/tutorials/introduction-to-data-science-tools-in-bluemix/
    https://developer.ibm.com/recipes/tutorials/introduction-to-data-science-tools-in-bluemix-part-2/
    https://developer.ibm.com/recipes/tutorials/introduction-to-data-science-tools-in-bluemix-part-3/

  2. Extracting the Year and the month from a data-frame

    Before we start looking at the visualization of data in this recipie, lets first review from the previous recepie how we extract the X and Y co-ordinates to present.

    Often you are looking for seasonal trends, to do this, extracting records based on years or months is a common prerequisite. The following code extracts the month or year.

    Often you are looking for seasonal trends, to do this, extracting records based on years or months is a common prerequisite. The following code extracts the month or year.

    year_df = new_construction_dollars_builds["date_update"].dt.year
    print year_df


    month_df = new_construction_dollars_builds["date_update"].dt.month
    print month_df

    Lets take this example a little further to make it useful. Lets extract ALL the records from our dataframe for a specific month, in this case, December. This is very similar to the example where we selected records >= to date. However in this case we are looking for entries in the date_update column that have a month of 12.

    Again, it is important that our data is sorted in sequence and to be sure we will do an explicit sort before we print the dataframe contents out.

    december_builds = new_construction_dollars_builds[(new_construction_dollars_builds['date_update'].dt.month==12) ]december_builds = december_builds.sort_values(by='date_update')
    print december_builds

    To select all the records for a specific year. Just replace the month== with a year==. Here is a working example.

    Y2007_builds = new_construction_dollars_builds[(new_construction_dollars_builds['date_update'].dt.year==2007) ] 
    Y2007_builds = Y2007_builds.sort_values(by='date_update')
    print Y2007_builds

    and here is the output of the print

    constructions_total constructions_value date date_update
    46 4049 943302 Jan-07 2007-01-01
    463 4216 991984 Feb-07 2007-02-01
    241 4888 1161053 Mar-07 2007-03-01
    242 4339 1003470 Apr-07 2007-04-01
    243 5517 1343940 May-07 2007-05-01
    126 5107 1262521 Jun-07 2007-06-01
    344 5201 1262986 Jul-07 2007-07-01
    127 5566 1367965 Aug-07 2007-08-01
    345 4845 1201418 Sep-07 2007-09-01
    464 5537 1366672 Oct-07 2007-10-01
    346 5241 1255333 Nov-07 2007-11-01

    Setting the X and Y co-ordindinate values.

    The X and Y co-ordindinate values are a set of lists. “X” is often described as the independant value and “Y” the dependant. The following code extracts the list from the Pandas dataframe.

    december_builds = new_construction_dollars_builds[(new_construction_dollars_builds['date_update'].dt.month==12) ] 
    december_builds = december_builds.sort_values(by='date_update')
    print december_builds
    indep_xxxx = december_builds['constructions_total']
    dep_yyyy = december_builds['constructions_value']

  3. Introduction to Matplotlib and Graphs with Python

    for people who have used Matlib, Matplotlib is a similar collection of functions that enable plotting of graphs. This visualisation is an important tool for the data scientist to explore result sets and explain results.

    In the examples below we will see how to create,

    • Scatter plots
    • Bar graphs
    • Line graphs
    • Pie Charts

    As a start. We will also look at how to show the mean line as well as label the axis.

    To use matplotlib you need to include it in your Spark Analysis,

    %matplotlib inline
    import matplotlib.pyplot as plt

  4. A simple Plot Graph

    The following piece of code prints a simple Plot Graph

    Line 1 – imports the matplot library that allows us to produce the graph

    Line 2 – allows us to print it Inline in our Bluemix spark analysis notebook

    Lines 3 & 4 are the “x” and “y” labels of our plot graph

    Line 5 plots a set of x,y coorindates based on 2 arrays (construct_total_list, construct_value_list)

    Line 6 draws a horizontal line

    Line 7 draws a vertical line

    line 8 draws a line between two points

    line 9 displays the plot.

    import matplotlib.pyplot as plt
    %matplotlib inline
    plt.xlabel('construct_total')
    plt.ylabel('construct_value')
    plt.scatter(construct_total_list, construct_value_list)
    plt.axhline(construct_value_mean, color='red', linewidth=2)
    plt.axvline(construct_total_mean, color='green', linewidth=2)
    plt.plot([0, construct_total_mean,construct_total_mean+(construct_total_mean-0) ], [bo, construct_value_mean, (construct_value_mean+construct_value_mean-bo)], color='blue', linestyle='-', linewidth=2)
    plt.show()

    if construct_total_list contains the array [1, 2, 3, 4, 5] and the construct_value_list contains [2, 4, 5, 4, 5] the following plot would be produced. You can see the 5 plotted points as well as the mean lines for X and Y. I will review the blue line in the next recepie.

  5. A simple Bar Graph

    The following example of code prints a simple Bar Graph

    Lines 1, 2, and 3, these are the three imported libraries needed as discussed previously. Numpy is a python math library extension to the Python programming language, adding support for large, multi-dimensional arrays and matrices,

    Lines 4, 5 are arrays containing the X and Y coordinates

    Line 6 calculates the mean of the X cordinates

    Line 7 sets the size and dimensions of the displayed plot

    Line 8 sets the title of the graph

    Line 9 determines the hight of each of the bars in the bar graph

    Line 10 calculates how many segments should be on the Y axis and their values

    Line 11 displays the mean line calculated previously – horizontally

    Line 12 displays the graph plot.

    import matplotlib.pyplot as plt
    import numpy as np
    %matplotlib inline
    values = [1, 2, 3, 4, 5]labels = [2, 4, 5, 4, 5]values_mean = np.mean(values)
    plt.gcf().set_size_inches(5, 6, forward=True)
    plt.title('example plot')
    plt.bar(range(len(values)), values)
    plt.yticks(range(len(values)), labels)
    plt.axhline(values_mean, color='green', linewidth=6)
    plt.show()

  6. A simple Pie Graph example

    The following shows how to create a Pie Graph.

    line 1 sets the colours for each of the segment

    Line 3 shows the input values for each segment

    Line 7 determines where to start the first segment (in degrees)

    Line 8 shows the format of the label %age

     colors = ["#E13F29", "#D69A80", "#D63B59", "#AE5552", "#CB5C3B", "#EB8076", "#96624E", "#463B59", "#3E5552", "#AB5C3B", "#BB8076", "#A6624E"] plt.pie(
    indep_xxxx,
    shadow=False,
    colors=colors,
    explode=(0,0,0,0,0,0,0,0,0,0,0,0),
    startangle=90,
    autopct='%1.1f%%',
    )
    plt.axis('equal')
    plt.show()

  7. A simple line graph example

    The following line graph generates a value of 1 – n for every entry being plotted. This way the graph flows from left to right.

    The code that does the generation of values is : range(len(indep_xxxx), the values plotte are the independaty X variable list indep_xxxx.

     plt.title('constructions for year ' + str(yy))
    plt.ylabel('construct_total')
    plt.plot(range(len(indep_xxxx)),indep_xxxx)
    plt.show()

Join The Discussion