Getting insight into market trends is easier than ever before. Here, I’ll show you how to use a few cloud-based data services to understand the worldwide automotive market, its brands, and its customers.

This tutorial walks you through:

  • setup of Apache Spark, dashDB data warehouse, and IBM Insights for Twitter
  • importing data from Twitter
  • creation of a Python notebook
  • data shaping and prep
  • use of the Natural Language toolkit for text processing
  • sophisticated analyses and visualizations via Python notebook

Save Time! If you don’t feel like walking through installation. but want to understand the analytics, skip ahead to the Analyze in Python Notebook section and just follow along using our sample notebook on github: https://github.com/ibm-cds-labs/spark.samples/blob/master/notebook/DashDB%20Twitter%20Car%202015%20Python%20Notebook.ipynb

All the steps in this tutorial are illustrated in this slide deck:

Provision Services in Bluemix

We’ll start by setting up the 3 cloud data technologies we’ll use to perform analysis: Apache Spark, dashDB, and IBM Insights for Twitter. We’ll do it all on Bluemix, IBM’s cloud platform.

  1. Log in to Bluemix (or sign up for a free trial).

  2. On your Bluemix dashboard, go to the upper left of the screen and click Create a new space.

    A new space gives you a fresh clean environment in which to run your services and data analysis.

  3. Name the space and click Create.

Create a new IBM Analytics for Apache Spark Instance

  1. On your Bluemix dashboard, click Work with Data.
  2. Click New Service.
  3. Find and click Apache Spark then click Choose Apache Spark.
  4. Click Create.

Create a dashDB Instance

  1. Return to your Bluemix dashboard
  2. Click Work with Data.
  3. Click New Service.
  4. Find and click dashDB then click Choose dashDB.
  5. Click Create.

Create an IBM Insights for Twitter Instance

  1. Return to your Bluemix dashboard
  2. Click Work with Data.
  3. Click New Service.
  4. Find and click Insights for Twitter then click Choose Insights for Twitter.
  5. Click Create.

Load data

Load car-related tweets from IBM Insights for Twitter into your dashBD warehouse.

  1. Launch dashDB

    Return to your Bluemix dashboard and click dashDB tile. Then click the Launch button.

  2. From the menu on the left, choose Load > Load Twitter Data

    dashDB finds the Insights for Twitter service you just instantiated.

  3. Click Next.

  4. In the Search for Twitter data box, enter the following query:

    posted:2015-01-01,2015-12-31 followers_count:2000 listed_count:1000 (volkswagen OR vw OR toyota OR daimler OR mercedes OR bmw OR gm OR "general motors" OR tesla)
  5. Click Get Tweet Count.
  6. Once the count loads, click Next.

  7. In the Load the data into new tables with this prefix field, enter any text you wish for a namespace and click Next.

    Depending on your bandwidth, data loading may take up to 15-60 minutes. dashDB shows you progress as the tweets load.

  8. When the data finishes loading, click Next.

  9. Peruse the reports that dashDB shows.

  10. Click the Data tab to explore the data.

  11. Copy connection details

    From the menu on the left, choose Connect > Connection Information and copy the User ID, Password and JDBC URL string. You’ll need these in a few minutes.

Analyze in Python Notebook

Now that we’ve got the data, we can use Apache Spark to perform some fast analysis and manage it all through a Python notebook.

Create new notebook

Create a notebook on IBM’s Data Science Experience (DSX):

  1. Sign in or create a trial account on DSX.
  2. Create a new project (or select an existing project).

    On the upper right of the screen, click the + plus sign and choose Create project.

  3. Add a new notebook (From URL) within the project.
    1. Click add notebooks.
    2. Click From URL.
    3. Enter notebook name.
    4. In the Notebook URL field, enter:
      https://github.com/ibm-cds-labs/spark.samples/raw/master/notebook/DashDB%20Twitter%20Car%202015%20Python%20Notebook.ipynb

      This sophisticated notebook comes complete with data shaping, analyses, and visualizations baked-in. We borrowed the notebook from Bassel Zeidan, the IBM Developer who created it. Thanks, Bassel!

    5. Select your Spark Service.
    6. Click Create Notebook.
  4. If prompted, select a kernel for the notebook. The notebook should successfully import.

When you use a notebook in DSX, you can run a cell only by selecting it, then on the Run Cell (▸ icon) button. If you don’t see the Run Cell button and Jupyter toolbar, go to the toolbar and click pencil icon Edit.

Set up tools and data for analysis

  1. Import Python packages

    Run cells 1 and 2 to install the Natural language toolkit. We’ll use this package to filter stop words in a few minutes.

    !pip install nltk --user
    import nltk
    nltk.download("stopwords")
  2. Import Python modules and set up the SQLContext, by running the commands in cell 3.
  3. Define global variables.

    Run cell 4 to set up various data structures we’ll use throughout this notebook.

    car_makers_list = [['bmw'], ['daimler', 'mercedes'], ['gm', 'general motors'], ['tesla'], ['toyota'], ['vw', 'volkswagen']]
    
    car_makers_name_list = []
    for car_maker in car_makers_list:
    car_makers_name_list.append(car_maker[0].upper())
    
    #plotting variables
    ind = np.arange(len(car_makers_list)) #index list for plotting
    width = 0.8       # the width of the bars in the bar plots
    
    num_car_makers = len(car_makers_list)
    
    ##car features #support English, Deutsch, french, Spanish
    electric_vehicle_terms = ['electric car', 'electric vehicle', 'electric motor', 'hybrid vehicle', 'Hybrid car', 'elektroauto', 'elektrofahrzeug',
                          'hybridautos', 'voiture hyprid', 'coche híbrido', 'Auto Hibrido', 'vehículo híbrido', 'elektrovehikel', 'voiture électrique', 'coche eléctrico']
    auto_driver_terms = ['auto drive', 'autodrive', 'autonomous', 'driverless', 'self driving', 'robotic', 'autonomes', 'selbstfahrendes', 'autonome', 'autĂłnomo']
    
    SCHEMA="DASH7504."
    PREFIX="AUTO_"
  4. Run cell 5 to set up some global helper functions for google maps and missing dates.
  5. Acquire the data from dashDB.

    Enter the credentials you copied from your dashDB connections page.

    • Copy in your password from dashDB.
    • In the jdbcurl entry, copy your dashDB JDBC URL string, only up to BLUBD.

    dash_creds

    Then run the cell.

  6. Run cell 7 to count the tweets from the data frame, which we’ll use in further processing.

  7. Assign sentiment score to each tweet.

    Run the following command in cell 8 to:

    • Join the Tweets and Sentiments table
    • Encode the sentiment into a number (POSITIVE=+1, AMBIVALENT=0)
    • Create an average for each sentiment associated with a tweet
    • Instrument the code (with %time) to provide profile execution stats.
    from pyspark.sql.functions import UserDefinedFunction
    from pyspark.sql.types import *
    udf = UserDefinedFunction(lambda x: 0 if x=='AMBIVALENT' else 1 if x=='POSITIVE' else -1, IntegerType())
    udf2 = UserDefinedFunction(lambda x: 'POSITIVE' if x>0 else 'NEGATIVE' if x<0 else 'AMBIVALENT', StringType())
    
    %time df=df_TWEETS.join(df_SENTIMENTS, "MESSAGE_ID")
    %time df=df.withColumn('SENTIMENT_POLARITY', udf(df.SENTIMENT_POLARITY) ).groupBy('MESSAGE_ID').agg(F.mean('SENTIMENT_POLARITY').alias("SENTIMENT_POLARITY"))
    %time df=df.withColumn('SENTIMENT', udf2(df.SENTIMENT_POLARITY))
    %time df_JOIN_TWEETS=df_TWEETS.join(df, "MESSAGE_ID")
    %time df_JOIN_TWEETS.printSchema()
    %time df_JOIN_TWEETS.count()
  8. Transform the data.

    Run the command in cell 9 to create a clean working data frame that will be easier to analyze.

Geographic Analysis

  1. Run cell 10 to group tweets by country.
    df_cleaned_tweets_countries = df_cleaned_tweets.groupBy('USER_COUNTRY')
                                .agg(F.count('MESSAGE_BODY').alias('NUM_TWEETS'))
                                .orderBy('NUM_TWEETS', ascending=False)
    df_cleaned_tweets_countries.cache()
    df_cleaned_tweets_countries.show(5)
  2. Run cell 11 to convert Spark SQL dataframe to Pandas data structure for visualization.
    p_df_cleaned_tweets_countries = df_cleaned_tweets_countries.toPandas()
    p_df_cleaned_tweets_countries.ix[p_df_cleaned_tweets_countries['USER_COUNTRY'] == 'NONE', 'USER_COUNTRY'] = 'UNKNOWN'
  3. To see the geographic tweet distribution in a bar chart, run cell 12.
  4. See locations for tweets on a Google map by running cell 13.

    google map

  5. Clean up memory.

    Resources on the Spark driver machine, including memory, are not infinite. So, run the following code in cell 14 to get rid of country info before you move on to perform more analysis.

    df_cleaned_tweets_countries.unpersist()
    df_cleaned_tweets_countries = None
    p_df_cleaned_tweets_countries = None

Sentiment Analysis

  1. Group tweets by sentiment, aggregate counts, then convert Spark SQL dataframe to Pandas for visualization, by running commands in cell 15.
    #get number of tweets with P N U sentiment by grouping the sentiment value
    tweets_grouped_by_sentiment = df_cleaned_tweets
                .groupBy('SENTIMENT')
                .agg(F.count('MESSAGE_ID').alias('NUM_TWEETS'))
    tweets_grouped_by_sentiment.cache()
    tweets_grouped_by_sentiment.show(5)
    
    #move the results to pandas
    p_tweets_grouped_by_sentiment = tweets_grouped_by_sentiment.toPandas()
  2. Visualize with a Matplot pie chart. Run cell 16:
    #data plot 1
    plot1_labels = p_tweets_grouped_by_sentiment['SENTIMENT']
    plot1_values = p_tweets_grouped_by_sentiment['NUM_TWEETS']
    plot1_colors = ['blue', 'red', 'gray', 'yellow', 'green']
    
    #data plot 2
    cond1 = (p_tweets_grouped_by_sentiment['SENTIMENT'] == 'POSITIVE')
    cond2 = (p_tweets_grouped_by_sentiment['SENTIMENT'] == 'NEGATIVE')
    pMessage_sentiment_statistics_defined = p_tweets_grouped_by_sentiment[cond1 | cond2]
    plot2_labels = pMessage_sentiment_statistics_defined['SENTIMENT']
    plot2_values = pMessage_sentiment_statistics_defined['NUM_TWEETS']
    plot2_colors = ['blue', 'red']
    fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(23, 10))
    axes[0].pie(plot1_values,  labels=plot1_labels, colors=plot1_colors, autopct='%1.1f%%')
    axes[0].set_title('Percentage of Sentiment Values in all Tweets')
    axes[0].set_aspect('equal')
    axes[0].legend(loc="upper right", labels=plot1_labels)
    
    # Plot
    axes[1].pie(plot2_values,  labels=plot2_labels, colors=plot2_colors, autopct='%1.1f%%')
    axes[1].set_title('Percentage of Positive and Negative Sentiment Values in all Tweets')
    axes[1].set_aspect('equal')
    axes[1].legend(loc="upper right", labels=plot2_labels)
    fig.subplots_adjust(hspace=1)
    plt.show()

    pie charts

  3. Clear memory again by running cell 17.

Timeline Analysis

  1. Analyze tweet timelines. Run cell 18 to do the following:
    • group by posting time and sentiment
      #group by year-month-day and the sentiment
      df_num_tweets_and_sentiment_over_time = df_cleaned_tweets.groupBy('POSTING_TIME', 'SENTIMENT')
                  .agg(F.count('MESSAGE_BODY').alias('NUM_TWEETS'))
                  .orderBy('POSTING_TIME', ascending=True)
    • order tweets chronologically
      #group by year-month-day
      df_num_tweets_over_time = df_num_tweets_and_sentiment_over_time.groupBy('POSTING_TIME')
                  .agg(F.sum('NUM_TWEETS').alias('NUM_TWEETS'))
                  .orderBy('POSTING_TIME', ascending=True)
    • Convert Spark SQL dataframe to Pandas for visualization
      #move to Pandas
      p_df_num_tweets_and_sentiment_over_time = df_num_tweets_and_sentiment_over_time.toPandas()
      p_df_num_tweets_over_time = df_num_tweets_over_time.toPandas()
  2. Prepare data structures for plotting

    Run cell 19 to move different sentiment values into different data frames.

  3. To see a time series visualization, run cell 20. You see a huge spike in October 2015, especially negative tweets.

    auto tweets timeline

  4. Clear memory again by running cell 21.

Manufacturer Analysis

  1. Set up data for a deep dive into car manufacturers.

    • Run the command in cell 22 to create a new dataframe that enriches tweets with extra metadata, including boolean values for manufacturer, electric car, and self-driving car.
      columns_names = ['MESSAGE_ID', 'MESSAGE_BODY', 'SENTIMENT', 'USER_GENDER', 'USER_COUNTRY', 'POSTING_TIME', 'INFLUENCE']
      for carMakerName in car_makers_name_list:
          columns_names.append(carMakerName)
      columns_names.append('ELECTRIC_CARS')
      columns_names.append('AUTO_DRIVE')
      
      df_tweets_car_maker = sqlContext.createDataFrame(df_cleaned_tweets
                               .map(lambda x: getAllAttributes(x)), columns_names)
    • Run cell 23 to create a dataframe for each manufacturer, aggregate counts, and order by posting time.
      car_maker_results_list = []
      for car_maker in car_makers_name_list:
      #get competitor dataframe
      df_car_maker = df_tweets_car_maker.filter(df_tweets_car_maker[car_maker] == True)
      overall_car_maker_time_data = df_car_maker.groupBy('POSTING_TIME')
                          .agg(F.count('MESSAGE_ID').alias('COUNT'))
                          .orderBy('POSTING_TIME' , ascending=True)
  2. Get the visual by running cell 24.

  3. Sure enough, we can see that the spike in chatter around October 2015 was all about one manufacturer:

    vw spike

    To explain the peak, we’ll explore the nature of those comments in the next few steps.

  4. Run cell 25 to filter for only VW tweets around that time.

  5. Run cell 26 to:

    • use Natural Language toolkit module to filter out stopwords (like of, or, the, http urls, and other strings that mean little in analysis)
    • create a map count of non-stopwords
      
       tagsRDD = df_tweets_debacle.flatMap( lambda t: re.split("s", t.MESSAGE_BODY))
              .filter( lambda word: not word.startswith("http") and all(ord(c) 3)
      .map( lambda word : (word, 1 ))
          .reduceByKey(add, 10)
          .map(lambda (a,b): (b,a))
              .sortByKey(False)
              .map(lambda (a,b):(b,a))
      
    • present a pie chart of the top 10 terms in tweets during the spike in chatter:

    common terms in VW spike

    As suspected, the conversation was all about the fraudulent emissions testing scandal.

  6. If interested, review more sentiment and brand analysis around this car data. Continue following this notebook and running cells.

Conclusion

You see how a simple Python notebook can provide valuable insights into large data problems. This tutorial shows a fast, creative approach to analyzing sweeping data sets without the need for computing infrastructure and expensive data experts. Explore the rest of this notebook to learn more. Then use this lightweight approach to load your own data and analyze for insights.

Join The Discussion

Your email address will not be published. Required fields are marked *