To help business analysts and those without a programming background analyze big data, IBM provides a spreadsheet-style tool called BigSheets. In this lab, youâ€™ll learn how you can explore big data through this tool without writing scripts or MapReduce applications. The sample data for this lab consists of social media posts about a popular brand (IBM Watson) that was collected using a sample application provided with BigInsights. For background information, you may want to read the article on Analyzing social media and structured data with InfoSphere BigInsights.
After completing this hands-on lab, youâ€™ll be able to:
- Create a BigSheets workbook
- Analyze and customize a workbook
- Visualize your workbookâ€™s data in a chart
- Create a Big SQL table based on your workbook
- Export your workbookâ€™s data into one of several popular formats
Allow 45 â€“ 60 minutes to complete this lab.
Prior to this lab, you should have set up a working environment. See Getting Started with Hadoop and BigInsights for details. Please post questions or comments about this lab to the forum on Hadoop Dev at https://developer.ibm.com/answers?community=hadoop.
To get started, copy the sample blogs-data.txt file to HDFS and create a master workbook for it.
__1. Obtain the blogs-data.txt file. Youâ€™ll find this in the sampleData.zip file provided with the article mentioned earlier.
__2. Use Hadoop file system commands or the BigInsights Web console to create subdirectories in HDFS for your sample data. Under /user/biadmin, create a /sampleData directory.Beneath /user/biadmin/sampleData, create the /IBMWatson subdirectory.
If you forgot how to create a subdirectory in HDFS, consult the earlier labs on Issuing Basic Hadoop Commands or Exploring and Administering Your Cluster with the BigInsights Web Console.
__3. Upload the blogs-data.txt file to the /user/biadmin/sampleData/IBMWatson directory. You can use Hadoop file system commands or the BigInsights Web console to do this. (If you forgot how to copy a file to HDFS, consult the earlier labs on Issuing Basic Hadoop Commands or Exploring and Administering Your Cluster with the BigInsights Web Console.)
__4. From the Files page of the Web console, position your cursor on the /user/biadmin/sampleData/IBMWatson/blogs-data.txt file, as shown in the previous image.
__5. Click the Sheet radio button to preview this data in a spreadsheet-style format.
__6. Because the sample blog data for this lab is uses a JSON Array structure, you must click on the pencil icon to select an appropriate reader (data format translator) for this data. Select the JSON Array reader and click the green check.
__8. Note that the BigSheets page of the Web console will open and your new workbook will be displayed.
Now youâ€™re ready to begin exploring this data using BigSheets.
BigSheets offers analysts a variety of macros, functions, and built-in analytical features. Youâ€™ll learn about a few here.
__1. To make it easier to search and manage your workbooks, add a few tags to the Watson Blogs master workbook you just created. In the upper right corner, click the icon to toggle the workbook display to show additional fields.
Depending on the size of your browser, an additional scroll bar may appear at right.
__2. Scroll down to the Workbook Details section. Locate the Tags field, select the green plus sign (+) , enter a tag for Watson, and click the green check mark. Repeat the process to add separate tags for IBM and blogs.
__3. Click on the Workbooks link the upper left corner of your open workbook.
__4. From the list of available workbooks, you can quickly search for a specific tag. Use the drop-down Tags menu to select the blogs tag or type tag: blogs into the box.
__5. Open the Watson Blogs master workbook again. (Double click on it.)
__6. Create a new workbook based on this master workbook. In BigSheets, a master workbook is a â€śbaseâ€ť workbook and has a limited set of things you can edit. So, to manipulate the data contained within a workbook, you want to create a new workbook derived from the master.
__a. Click the Build new Workbook button.
__b. When the new Workbook appears, change its default name. Click the pencil icon next to the name, enter Watson Blogs Revised as the new name, and click the green check mark.
__c. Click the Fit column(s) button to more easily see columns A through H on your screen
__7. Remove the column IsAdult from your workbook. This is currently column E. Click on the triangle next to the column name of IsAdult and select the Remove.
__8. In this case, you want to keep only a few columns. To easily remove a several columns, click the triangle again (from any column) and select Organize Columnsâ€¦
__a. Click the red X button next to each column you want to remove.
In this case, KEEP the following columnsâ€¦
__b. Click the green check mark button when you are ready to remove the columns you selected.
__9. Click on the Fit column(s) button again to show columns A through H. Verify that your screen appears similar to this:
__10. From the Save menu at upper left, select Save. Provide a description for your workbook if youâ€™d like.
__11. Apply a built-in function to further investigate the contents of this workbook. Click the Add Sheets button in the lower left corner.
__12. From the pop-up menu, select Function. Youâ€™re going to apply a built-in function that extracts the URL Host information from the full URL links associated with the blog data that was captured. Doing so will enable you to identify and chart sites with greatest blog coverage of IBM Watson.
__13. From the Function menu, click Categories and Url.
__14. Select the URLHOST function.
__15. In the new menu that appears, enter Get Host URL as the sheet name and select the Url column as the source of input to the URLHOST function.
__16. At the bottom of the menu, click the Carry Over tab to specify which columns from the workbook youâ€™d like to retain. Select Add All and click the green check mark.
__17. Verify that your workbook contains a new URLHOST column and all previously existing columns. (Whenever you create a new Sheet or edit your workbook in some way, BigSheets will preview the results of your work against a small sample of the data represented by your workbook.) If desired, click the Fit Column button to show more columns on your screen.
__18. Click Save > Save & Exit.
__19. When prompted to Run or Close the workbook, click Run. “Running” a workbook instructs BigSheets to apply the logic you specified graphically against all data associated with your workbook. You can monitor the progress of your request by watching the status bar indicator in the upper right-hand side of the page.
__20. When the operation completes, verify that your workbook appears similar to this:
__21. If desired, use the Next button in the lower right corner to see page through the content a few times, noting the various URLHOST values. You can also use built-in BigSheets features to sort the data based on URLHOST (or other) values, filter records (such as blogs written in the English language), etc. But perhaps the quickest way to see which sites published the most blogs about IBM Watson during this time period is to chart the results. Youâ€™ll do that next.
Now that youâ€™ve customized your workbook to eliminate some unwanted columns and generate a new column containing URL host information, itâ€™s time to visualize the results. In this short exercise, youâ€™ll create two simple charts that identify the top 10 global sites with the most blog posts about IBM Watson.
__1. If necessary, open the Watson Blogs Revised notebook.
__2. Click on the Add chart link in the lower left.
__3. Select chart > Bar as the chart type.
__4. Specify appropriate properties for the bar chart, paying close attention to these fields:
__a. Title: Top 10 Blog Sites for IBM Waton
__b. X Axis: URLHOST
__c. Sort By: Y Axis
__d. Occurrence Order: Descending
__e. Limit: 10
__5. Click the green check mark.
__6. When prompted, Run the chart. This causes BigSheets to apply your instructions to the entire data set.
__7. Inspect the results. Are you surprised that ibm.com wasnâ€™t the top site for blog posts about IBM Watson?
__8. If desired, hover over each bar to see the URL host name and the number of blogs posted at that site.
__9. Next, create a new chart of a different type to visualize the information in a different format. Select Add Chart > Categories > cloud > Bubble Cloud.
__10. Provide appropriate values for the following fields:
__a. Title:Top 10 Blog Sites for IBM Watson
__c. Occurrence Order:Descending
__d. Sort By:Count
__11. Click the green check mark.
__12. When prompted, Run the chart.
__13. Inspect the results. If desired, hover over a bubble to see the number of blog postings for that site.
BigSheets offers a wide range of built-in features, including the ability to create a Big SQL table from your workbook. This is quite handy if you have SQL-based tools or applications that youâ€™d like to use with data youâ€™ve customized in BigSheets.
__1. If necessary, open your Watson Blogs Revised workbook.
__2. Click Create Table button just above the columns of your workbook. When prompted, accept sheets as the target schema name and type mywatsonblogs as the target table name.
__3. Click Confirm.
__4. From the Files page of the Web console, click the Catalog Tables tab in the navigation window and expand the sheets folder.
__5. Click the mywatsonblogs file. Note that a preview of the table appears in the pane at right.
__6. Click the Welcome tab of the Web console.In the Quick Links section, click the Run Big SQL queries link.
__7. A new tab will appear in your Web browser.
__8. In the box where youâ€™re prompted to enter your Big SQL query, type this statement:
select urlhost, language, subjecthtml from sheets.mywatsonblogs
fetch first 10 rows only;
__9. Verify that the Big SQL radio button is checked (not the Big SQL V1 radio button).
__10. If necessary, use the scroll bar at right to expose the Run button just below the radio buttons. Click Run.
__11. Inspect the results.
__12. Close the Big SQL browser tab.
In this optional exercise, youâ€™ll see how easy it is to export data in your workbook to one of several popular formats so that other applications can easily access the data.
__1. If necessary, open your Watson Blogs Revised workbook.
__2. Click Export data. From the drop-down menu, select TSV (tab separated value) as the format type.
__3. Click the File radio button to export the data to a file in your distributed file system.
__4. Use the Browse button to navigate to the directory in HDFS where you would like to export this workbook. In this case, select /user/biadmin/sampleData/IBMWatson. In the box below the directory tree, enter myworkbook as the file name. Do not add a file extension such as .tsv. Click OK.
__5. Click OK again to initiate the data export operation.
__6. When a message appears indicating that the operation has finished, click OK.
__7. On the Files page of the Web console, navigate to the directory you specified for the export (/user/biadmin/sampleData/IBMWatson) and locate your new myworkbook.tsv file.
__8. Optionally, click the download icon to copy the file from HDFS to a directory of your choice in your local file system.
To find the other tutorials in this series, go to Overview tutorial.