Contents


Overview

Skill Level: Any Skill Level

Intermediate

Manipulation of your NoSQL data with Python data analysis tool kit (PANDAS)

Ingredients

spark

python

dataframes

Pandas

pandas

matplotlib - pyplot

Step-by-step

  1. Lets quickly review Part 2.

    To use and learn the entire tutorial together – follow these links for all parts.

    https://developer.ibm.com/recipes/tutorials/introduction-to-data-science-tools-in-bluemix/
    https://developer.ibm.com/recipes/tutorials/introduction-to-data-science-tools-in-bluemix-part-2/
    https://developer.ibm.com/recipes/tutorials/introduction-to-data-science-tools-in-bluemix-part-3/

    Part two of of this Tutoriall showed us how to connect to a NoSQL cloudant database. You can read it here https://developer.ibm.com/recipes/tutorials/introduction-to-data-science-tools-in-bluemix-part-2/ Now lets see some working examples of date and data manipulation that will give you some working tools to do your own data analytics.

  2. Introducing Pandas:

    It is common for the many people who use the Python language for data-science to use a tool-kit named Pandas. Pandas is a fast, flexible, toolkit that makes working with relational or labeled data simple. It helps to simplify the process of being able to do practical analysis tasks in Python.

    Pandas simplifies the creation of tabular data and allows it to be accessed simply. Its ability to manage time series data makes it suitable for complex analytics tasks. Pandas has two main data structures Series which are single dimensional DataFrame 2-dimensional.

    In the Pandas examples here we will see how to use some of the more commonly used features that make Pandas an excellent tool kit to start your data science activities.

  3. Using Python to manipulate dates.

    A big part of any data science activity is learning how to put the data in a format that helps you gain insight. A common task is looking at the data in time segments, joining them on date patterns or time of year dates. In this recipe we will look at how we transform dates so they can be used as date formats rather than text strings. In addition we will look as joining data frames from multiple data sources. Finally we will look as some simple data selection and data manipulation examples that can form the basis of any data science or data analytics task.

    In the previous recipe we created and loaded our Couldant database myexampledb – and connected to it to select some data from it. We will use this data as the basis of all the examples in this recipe.

    The myexampledb looks like this in structure:

    Note that date is just a string and as such will sort differently to a date format. Also doing date arithmetic can be a challenge and complex if dates are not is a usable format. Lets see how we can address this.

      root
    |-- _id: string (nullable = true)
    |-- _rev: string (nullable = true)
    |-- constructions_total: long (nullable = true)
    |-- constructions_value: long (nullable = true)
    |-- date: string (nullable = true)
    |-- established_dwellings_no: long (nullable = true)
    |-- established_dwellings_value: long (nullable = true)
    |-- new_dwellings_purchased: long (nullable = true)
    |-- new_dwellings_value: long (nullable = true)
    |-- total_dwellings_no: long (nullable = true)
    |-- total_dwellings_value: long (nullable = true)


    +-------------------+-------------------+------+
    |constructions_total|constructions_value| date|
    +-------------------+-------------------++
  4. Copying your Cloudant data to Pandas for manipulation

    The most simple thing we can do is copy an entire dataframe into a Pandas dataframe so we can start manipulation of it in Python and Spark. To do this – in the last recipe we created a dataframe named df, we will copy that into a Pandas dataframe called construction_dollars_builds.

    To do this enter this into your Spark Notebook.

    import pandas as pd
    construction_dollars_builds = df.toPandas()
    print construction_dollars_builds

    Now we have a dataframe in memory named “construction_dollars_builds“.

    On printing “construction_dollars_builds” you will notice that the dates all look fine. That is becuase they were loaded in Date. However the dates are “character strings” and as such will sort differently. Below is just and extract of what it looks like after a simple print.

      constructions_total constructions_value date
    0 3716 153163 Sep-86
    1 3612 158536 Nov-86
    2 2831 125399 Jan-87
    3 5538 269765 Feb-88
    4 5429 279353 Apr-88
    5 6181 326677 Sep-88
    6 5966 330393 Feb-98

    Now lets sort the data in on the date column. To do this run a simple sort of the dataframe in memory like this;

    print construction_dollars_builds.sort_values(by='date')  

    You will notice the output is sorted in Character Sequence instead of a date sequene.

      constructions_total constructions_value date
    221 4100 616327 Apr-00
    330 3862 588911 Apr-01
    225 4887 788662 Apr-02
    455 4606 857607 Apr-03
    233 4402 899755 Apr-04
    40 4478 976879 Apr-05
    44 4047 883281 Apr-06
    242 4339 1003470 Apr-07

  5. Formatting a date series in Python

    You will notice that the date is in format of MMMM-YY, this is a concern because the year is not specific. Because I know the data, I have made a rule in this case that everything less than 20 is for the year 2000 and beyond. Everything 20 and above is for the 1900s.

    The next concern is that I need my date format in YYYY_MM-DD format and there is no days in the source date. I am going to default it to 01

    Finally I need to transpose the character month to a numeric for the month, for example Dec will become 12

    To do this I have created a simple callable function named build_date.

    The build_date function accepts an input date in “MMM-YY” format and extracts the data. The month is used to index the months “dictionary” to extract the corresponding numeric month. This is just a simpler way of doing it rather than a long “if.. then..else..if…” statement.

    Finally the date is reconstructed and returned in the form of “YYYY-MM-DD”.

    Next thing we need to do is make a series of the dates.

    from datetime import date
    import datetime
    months = {"Aug":8,"Sep":9,"Jan":01,"Feb":02,"Mar":03,"Apr":04,"May":05,"Jun":06,"Jul":07,"Oct":10,"Nov":11,"Dec":12 }
    def build_date( indate ):
    hyphan = indate.index('-')
    dd = 01
    mm = indate[:hyphan] mm = int(months[mm])
    yy = int(indate[hyphan+1:])
    if yy < 20:
    yy = yy + 2000
    else:
    yy = yy + 1900
    d1 = datetime.date(yy, mm, dd)
    print d1
    the_date = str(yy)+'-'+str(mm)+'-'+str(dd)
    return (the_date)
  6. Create a new dataframe with a reformatted date

    The next thing we wish to do is create an empty list / series named new_date_series.

    new_date_series = []

    Then we will select the date data from Cloudant. Note that we are only selecting the one column.
    dateDF = cloudantdata.select(cloudantdata['date'] )

    The next section we will simply as we did before copy the selected Cloudant data into a Pandas dataframe (dateDF)

    To process the Pandas data, we need to extract all the values and place them now into a Python List. The Python list we will use is date_array_1 and it contains a list of dates in MMMM-YY format.

    Next we send the date to our conversion function we created previously and then append it to our new_date_series list.

    Now we have a list called new_date_series that is a list of our dates from Cloudant except that it is in YYYY-MM-DD format. Here is an extract of the code to do this.

    date_array_1DF = dateDF.toPandas()
    date_array_1 = date_array_1DF.values
    for i in range(len(date_array_1)):
    fixed_date = build_date(date_array_1[i][0])
    new_date_series.append(fixed_date)

    df1 = pd.DataFrame({date_update': new_date_series})
    df1[date_update] = pd.to_datetime(df1[date_update])
    print df1

    The final part of this code is the creation of a new dataframe “df1” and we have given it a new column name “date_update”. You will notice that we have used the python list as input to create this Pandas dataframe. The last line that contains the “pd.to_datetime” sets the datatype of the “date_update” column to “datetime64[ns]” rather than “object” or “string”

    To ensure that you have the right datatype encoding enter this for dataframe “df1”:

     df1.dtypes

    and the output should say that it is a :

    date_update datetime64[ns]dtype: object  
  7. Merging two dataframes

    Now we have a date formatted data-frame we can do date artithmatic on it, sort it and group it. It will work as we would expect dates to work. However to use it with our construction_dollars_builds dataframe, we need to merge it to the original dataframe. Before we start, it is important that in this example we have not modified the date orders from the way we extracted it. This way – we can just merge it back and it should match. Lets see how to merge to dataframes.

    new_construction_dollars_builds = pd.concat([construction_dollars_builds, df1], axis=1, join='inner')

    print construction_dollars_builds

    Lets have a look at the output when we print it out. You will notice the new format of the date in column “date_update”

      constructions_total constructions_value date date_update
    0 3716 153163 Sep-86 1986-09-01
    1 3612 158536 Nov-86 1986-11-01
    2 2831 125399 Jan-87 1987-01-01
    3 5538 269765 Feb-88 1988-02-01
    4 5429 279353 Apr-88 1988-04-01
    5 6181 326677 Sep-88 1988-09-01
    6 5966 330393 Feb-90 1990-02-01
    7 4730 273301 Jan-91 1991-01-01
    8 5091 313930 Mar-91 1991-03-01
    9 6510 411565 Jul-91 1991-07-01
    10 5368 340521 Dec-91 1991-12-01

    This is a diagramatic representation of what we have just done.

  8. Sorting a dataframe in Date sequence order

    Often you will need to sort your dataframes in ascending or descending date order. The example below sorts in decending order. (set ascending=True for ascending order).

    print new_construction_dollars_builds.sort_values(by='date_update',ascending=False)
      constructions_total constructions_value date date_update
    368 5542 1758147 Mar-16 2016-03-01
    263 5414 1691254 Feb-16 2016-02-01
    485 4414 1387893 Jan-16 2016-01-01
    484 6092 1907152 Dec-15 2015-12-01
    148 5955 1861069 Nov-15 2015-11-01
    367 5903 1831190 Oct-15 2015-10-01
    147 6050 1899554 Sep-15 2015-09-01
    262 5676 1725809 Aug-15 2015-08-01
    261 6109 1843711 Jul-15 2015-07-01
    483 6313 1910760 Jun-15 2015-06-01
    62 6077 1814751 May-15 2015-05-01

    Note that if we sorted by the “date” column which is just in String format, then it would sort in character sequence and not date sequence.

     print new_construction_dollars_builds.sort_values(by='date')

    Gives

     221 4100 616327 Apr-00 2000-04-01
    330 3862 588911 Apr-01 2001-04-01
    225 4887 788662 Apr-02 2002-04-01
    455 4606 857607 Apr-03 2003-04-01
    233 4402 899755 Apr-04 2004-04-01
    40 4478 976879 Apr-05 2005-04-01
    44 4047 883281 Apr-06 2006-04-01
    242 4339 1003470 Apr-07 2007-04-01
    244 4994 1232260 Apr-08 2008-04-01
    131 5887 1494900 Apr-09 2009-04-01
    473 5179 1387583 Apr-10 2010-04-01
  9. Selecting records from a dataframe after a specific date

    Now we have a data-frame sorted in date format, it is a common request to select dates or before a specific date. The example below shows how this is done.

    The example below selects the column “date_update” from our dataframe new_construction_dollars_builds. Speciifcally we are looking to select records that are >= 19961101.

    We will store this subset into a new dataframe named “after_1996”.

    after_1996= (new_construction_dollars_builds[(new_construction_dollars_builds['date_update'] >='19961101')])
    after_1996= after_1996.sort_values(by='date_update')
    print after_1996

    Here is a visualization of what the code above just did.

  10. Extracting the Year and the month from a data-frame

    Often you are looking for seasonal trends, to do this, extracting records based on years or months is a common prerequisite. The following code extracts the month or year.

    year_df = new_construction_dollars_builds["date_update"].dt.year
    print year_df


    month_df = new_construction_dollars_builds["date_update"].dt.month
    print month_df

    Lets take this example a little further to make it useful. Lets extract ALL the records from our dataframe for a specific month, in this case, December. This is very similar to the example where we selected records >= to date. However in this case we are looking for entries in the “date_update” column that have a month of “12”.

    Again, it is important that our data is sorted in sequence and to be sure we will do an explicit sort before we print the dataframe contents out.

    december_builds = new_construction_dollars_builds[(new_construction_dollars_builds['date_update'].dt.month==12) ] 
    december_builds = december_builds.sort_values(by='date_update')
    print december_builds

    To select all the records for a specific year. Just replace the month== with a year==. Here is a working example.

    Y2007_builds = new_construction_dollars_builds[(new_construction_dollars_builds['date_update'].dt.year==2007) ] 
    Y2007_builds = Y2007_builds.sort_values(by='date_update')
    print Y2007_builds

    and here is the output of the print

       constructions_total constructions_value date date_update
    46 4049 943302 Jan-07 2007-01-01
    463 4216 991984 Feb-07 2007-02-01
    241 4888 1161053 Mar-07 2007-03-01
    242 4339 1003470 Apr-07 2007-04-01
    243 5517 1343940 May-07 2007-05-01
    126 5107 1262521 Jun-07 2007-06-01
    344 5201 1262986 Jul-07 2007-07-01
    127 5566 1367965 Aug-07 2007-08-01
    345 4845 1201418 Sep-07 2007-09-01
    464 5537 1366672 Oct-07 2007-10-01
    346 5241 1255333 Nov-07 2007-11-01
    465 4576 1109614 Dec-07 2007-12-01

  11. Selecting records based on a date range

    Another common but not final task is the selection of records based on a date range. This task is common when wanting to select just a subset of data based on an event in time to look for corrolations.

    Here is the code to do this, it is very similar to the selection of records after a particular date, however it has a second date that forms the limiter of selection.

    range_1996= (new_construction_dollars_builds[(new_construction_dollars_builds['date_update'] >='19961101') & (new_construction_dollars_builds['date_update'] <='19991101')])
    range_1996= range_1996.sort_values(by='date_update')
    print range_1996

    This is what the selection gives;

     
    constructions_total constructions_value date date_update
    436 4944 484726 Nov-96 1996-11-01
    437 4543 452142 Dec-96 1996-12-01
    212 4503 448104 Jan-97 1997-01-01
    438 5164 539487 Feb-97 1997-02-01
    439 5027 533933 Mar-97 1997-03-01
    25 5892 621508 Apr-97 1997-04-01
    105 5953 634748 May-97 1997-05-01
    26 5370 571622 Jun-97 1997-06-01
    320 5938 652918 Jul-97 1997-07-01
    106 5678 631603 Aug-97 1997-08-01
    107 6158 695373 Sep-97 1997-09-01
    27 6053 682545 Oct-97 1997-10-01
    321 5286 617381 Nov-97 1997-11-01
    322 5423 633704 Dec-97 1997-12-01
    440 4503 521010 Jan-98 1998-01-01
    213 5663 646187 Feb-98 1998-02-01
    214 5908 697119 Mar-98 1998-03-01
    215 5800 661433 Apr-98 1998-04-01
    28 5866 671328 May-98 1998-05-01
    441 6410 781940 Jun-98 1998-06-01
    442 5802 731876 Jul-98 1998-07-01
    216 5114 653081 Aug-98 1998-08-01
    29 5483 709426 Sep-98 1998-09-01
    323 5344 675853 Oct-98 1998-10-01
    108 5227 666648 Nov-98 1998-11-01
    217 5178 683040 Dec-98 1998-12-01
    218 4082 524478 Jan-99 1999-01-01
    219 5471 702066 Feb-99 1999-02-01
    443 7058 934493 Mar-99 1999-03-01
    30 5937 769642 Apr-99 1999-04-01
    444 5982 818051 May-99 1999-05-01
    445 6332 884023 Jun-99 1999-06-01
    109 6038 825264 Jul-99 1999-07-01
    324 6295 884166 Aug-99 1999-08-01
    446 6429 925200 Sep-99 1999-09-01
    220 6403 930224 Oct-99 1999-10-01
    31 6864 994970 Nov-99 1999-11-01

Join The Discussion