Lab 4: Analyzing social media data with BigSheets

To help business analysts and those without a programming background analyze big data, IBM provides a spreadsheet-style tool called BigSheets. In this lab, you’ll learn how you can explore big data through this tool without writing scripts or MapReduce applications. The sample data for this lab consists of social media posts about a popular brand (IBM Watson) that was collected using a sample application provided with BigInsights. For background information, you may want to read the article on Analyzing social media and structured data with InfoSphere BigInsights.

After completing this hands-on lab, you’ll be able to:

  • Create a BigSheets workbook
  • Analyze and customize a workbook
  • Visualize your workbook’s data in a chart
  • Create a Big SQL table based on your workbook
  • Export your workbook’s data into one of several popular formats

Allow 45 – 60 minutes to complete this lab.

Prior to this lab, you should have set up a working environment. See Getting Started with Hadoop and BigInsights for details. Please post questions or comments about this lab to the forum on Hadoop Dev at https://developer.ibm.com/answers?community=hadoop.

4.1. Creating a workbook

To get started, copy the sample blogs-data.txt file to HDFS and create a master workbook for it.

__1. Obtain the blogs-data.txt file. You’ll find this in the sampleData.zip file provided with the article mentioned earlier.

__2. Use Hadoop file system commands or the BigInsights Web console to create subdirectories in HDFS for your sample data. Under /user/biadmin, create a /sampleData directory.Beneath /user/biadmin/sampleData, create the /IBMWatson subdirectory.

image002

If you forgot how to create a subdirectory in HDFS, consult the earlier labs on Issuing Basic Hadoop Commands or Exploring and Administering Your Cluster with the BigInsights Web Console.

__3. Upload the blogs-data.txt file to the /user/biadmin/sampleData/IBMWatson directory. You can use Hadoop file system commands or the BigInsights Web console to do this. (If you forgot how to copy a file to HDFS, consult the earlier labs on Issuing Basic Hadoop Commands or Exploring and Administering Your Cluster with the BigInsights Web Console.)

image003

__4. From the Files page of the Web console, position your cursor on the /user/biadmin/sampleData/IBMWatson/blogs-data.txt file, as shown in the previous image.

__5. Click the Sheet radio button to preview this data in a spreadsheet-style format.

image004

__6. Because the sample blog data for this lab is uses a JSON Array structure, you must click on the pencil icon to select an appropriate reader (data format translator) for this data. Select the JSON Array reader and click the green check.

image005

__7. Save this as a Master Workbook named Watson Blogs. Optionally, provide a description. Click Save.image006

__8. Note that the BigSheets page of the Web console will open and your new workbook will be displayed.

image023

Now you’re ready to begin exploring this data using BigSheets.

4.2. Analyzing and customizing your workbook

BigSheets offers analysts a variety of macros, functions, and built-in analytical features. You’ll learn about a few here.

__1. To make it easier to search and manage your workbooks, add a few tags to the Watson Blogs master workbook you just created. In the upper right corner, click the icon to toggle the workbook display to show additional fields.

image008

Depending on the size of your browser, an additional scroll bar may appear at right.

__2. Scroll down to the Workbook Details section. Locate the Tags field, select the green plus sign (+) , enter a tag for Watson, and click the green check mark. Repeat the process to add separate tags for IBM and blogs.

image009

__3. Click on the Workbooks link the upper left corner of your open workbook.

image010

__4. From the list of available workbooks, you can quickly search for a specific tag. Use the drop-down Tags menu to select the blogs tag or type tag: blogs into the box.

image011

__5. Open the Watson Blogs master workbook again. (Double click on it.)

__6. Create a new workbook based on this master workbook. In BigSheets, a master workbook is a “base” workbook and has a limited set of things you can edit. So, to manipulate the data contained within a workbook, you want to create a new workbook derived from the master.

__a. Click the Build new Workbook button.

image012

__b. When the new Workbook appears, change its default name. Click the pencil icon next to the name, enter Watson Blogs Revised as the new name, and click the green check mark.

image013

__c. Click the Fit column(s) button to more easily see columns A through H on your screen

image014

__7. Remove the column IsAdult from your workbook. This is currently column E. Click on the triangle next to the column name of IsAdult and select the Remove.

image015

__8. In this case, you want to keep only a few columns. To easily remove a several columns, click the triangle again (from any column) and select Organize Columns…

__a. Click the red X button next to each column you want to remove.

image016

In this case, KEEP the following columns…

__i. Country

__ii. FeedInfo

__iii. Language

__iv. Published

__v. SubjectHtml

__vi. Tags

__vii. Type

__viii. Url

__b. Click the green check mark button when you are ready to remove the columns you selected.

image017

__9. Click on the Fit column(s) button again to show columns A through H. Verify that your screen appears similar to this:

image018

__10. From the Save menu at upper left, select Save. Provide a description for your workbook if you’d like.

__11. Apply a built-in function to further investigate the contents of this workbook. Click the Add Sheets button in the lower left corner.

image019

__12. From the pop-up menu, select Function. You’re going to apply a built-in function that extracts the URL Host information from the full URL links associated with the blog data that was captured. Doing so will enable you to identify and chart sites with greatest blog coverage of IBM Watson.

__13. From the Function menu, click Categories and Url.

image020

__14. Select the URLHOST function.

__15. In the new menu that appears, enter Get Host URL as the sheet name and select the Url column as the source of input to the URLHOST function.

image021

__16. At the bottom of the menu, click the Carry Over tab to specify which columns from the workbook you’d like to retain. Select Add All and click the green check mark.

image022

__17. Verify that your workbook contains a new URLHOST column and all previously existing columns. (Whenever you create a new Sheet or edit your workbook in some way, BigSheets will preview the results of your work against a small sample of the data represented by your workbook.) If desired, click the Fit Column button to show more columns on your screen.

image023

__18. Click Save > Save & Exit.

image024

__19. When prompted to Run or Close the workbook, click Run. “Running” a workbook instructs BigSheets to apply the logic you specified graphically against all data associated with your workbook. You can monitor the progress of your request by watching the status bar indicator in the upper right-hand side of the page.

image025

__20. When the operation completes, verify that your workbook appears similar to this:

image026

__21. If desired, use the Next button in the lower right corner to see page through the content a few times, noting the various URLHOST values. You can also use built-in BigSheets features to sort the data based on URLHOST (or other) values, filter records (such as blogs written in the English language), etc. But perhaps the quickest way to see which sites published the most blogs about IBM Watson during this time period is to chart the results. You’ll do that next.

4.3. Creating charts

Now that you’ve customized your workbook to eliminate some unwanted columns and generate a new column containing URL host information, it’s time to visualize the results. In this short exercise, you’ll create two simple charts that identify the top 10 global sites with the most blog posts about IBM Watson.

__1. If necessary, open the Watson Blogs Revised notebook.

__2. Click on the Add chart link in the lower left.

image027

__3. Select chart > Bar as the chart type.

image028

__4. Specify appropriate properties for the bar chart, paying close attention to these fields:

__a. Title: Top 10 Blog Sites for IBM Waton

__b. X Axis: URLHOST

__c. Sort By: Y Axis

__d. Occurrence Order: Descending

__e. Limit: 10

image029

__5. Click the green check mark.

__6. When prompted, Run the chart. This causes BigSheets to apply your instructions to the entire data set.

image030

__7. Inspect the results. Are you surprised that ibm.com wasn’t the top site for blog posts about IBM Watson?

image031

__8. If desired, hover over each bar to see the URL host name and the number of blogs posted at that site.

__9. Next, create a new chart of a different type to visualize the information in a different format. Select Add Chart > Categories > cloud > Bubble Cloud.

image032

__10. Provide appropriate values for the following fields:

__a. Title:Top 10 Blog Sites for IBM Watson

__b. Tags:URLHOST

__c. Occurrence Order:Descending

__d. Sort By:Count

__e. Limit:10

image033

__11. Click the green check mark.

__12. When prompted, Run the chart.

__13. Inspect the results. If desired, hover over a bubble to see the number of blog postings for that site.

image034

4.4. Creating a Big SQL table based on your workbook

BigSheets offers a wide range of built-in features, including the ability to create a Big SQL table from your workbook. This is quite handy if you have SQL-based tools or applications that you’d like to use with data you’ve customized in BigSheets.

__1. If necessary, open your Watson Blogs Revised workbook.

__2. Click Create Table button just above the columns of your workbook. When prompted, accept sheets as the target schema name and type mywatsonblogs as the target table name.

image035

__3. Click Confirm.

__4. From the Files page of the Web console, click the Catalog Tables tab in the navigation window and expand the sheets folder.

image036

__5. Click the mywatsonblogs file. Note that a preview of the table appears in the pane at right.

image037

__6. Click the Welcome tab of the Web console.In the Quick Links section, click the Run Big SQL queries link.

image038

__7. A new tab will appear in your Web browser.

image039

__8. In the box where you’re prompted to enter your Big SQL query, type this statement:

select urlhost, language, subjecthtml from sheets.mywatsonblogs

fetch first 10 rows only;

__9. Verify that the Big SQL radio button is checked (not the Big SQL V1 radio button).

__10. If necessary, use the scroll bar at right to expose the Run button just below the radio buttons. Click Run.

__11. Inspect the results.

image040

__12. Close the Big SQL browser tab.

4.5. Optional:Exporting your workbook data

In this optional exercise, you’ll see how easy it is to export data in your workbook to one of several popular formats so that other applications can easily access the data.

__1. If necessary, open your Watson Blogs Revised workbook.

__2. Click Export data. From the drop-down menu, select TSV (tab separated value) as the format type.

image041

__3. Click the File radio button to export the data to a file in your distributed file system.

__4. Use the Browse button to navigate to the directory in HDFS where you would like to export this workbook. In this case, select /user/biadmin/sampleData/IBMWatson. In the box below the directory tree, enter myworkbook as the file name. Do not add a file extension such as .tsv. Click OK.

image042

__5. Click OK again to initiate the data export operation.

image043

__6. When a message appears indicating that the operation has finished, click OK.

__7. On the Files page of the Web console, navigate to the directory you specified for the export (/user/biadmin/sampleData/IBMWatson) and locate your new myworkbook.tsv file.

image044

__8. Optionally, click the download icon to copy the file from HDFS to a directory of your choice in your local file system.

image045

 

To find the other tutorials in this series, go to Overview tutorial.

Join The Discussion

Your email address will not be published. Required fields are marked *