Scrape, analyze and visualize insights on raw data from the web

Summary

The World Wide Web is the universe of network-accessible information. All of this information is present in a raw format on the web. What if you wanted a way to ingest raw information on the web, for any given topic, and provide insights and visualizations for the same? This code pattern will show you how to do that using an example of performing analytics on startup companies.

Description

Suppose we want to understand current startups in a particular technology like say machine learning. This code pattern will evaluate its impact in the industry, on the basis of:

  • How many times they have appeared in the news
  • Whether they have a Wikipedia page
  • Whether they have tech blogs
  • Whether they are active on social media

Once the unstructured data is scraped, it’s processed through Watson Natural Language Understanding and converted to structured data. This is fed to SPSS, which can be used to understand the data and perform analytics to determine if all the factors (as mentioned above) appear in a company; thereby, computing a popularity score. Once all of the analytics is performed, this code pattern also provides a user friendly and interactive dashboard visualization of the data–providing insights of the data and helping to simplify the decision making process.

After completing this code pattern, you’ll understand how to:

  • Connect and scrape data from varied data sources on the web.
  • Convert Raw Web Data to Structured Data.
  • Integrate data from multiple data sources with the help of Db2 Warehouse Connection.
  • Perform Analytics in SPSS Modeler.
  • Send integrated data to the Db2 Warehouse.
  • Derive insights and visualize on Watson Embedded Dashboard.

NOTE: Company names have been replaced with names of plants for this code pattern.

Flow

flow

  1. Create and run a Python Notebook on Watson Studio.
  2. The notebook scrapes the latest news on startups.
  3. The Scraped information is sent to Watson Natural Language Understanding to extract keywords, entities, sentiments and its respective confidence scores.
  4. The Results of Natural Language Understanding are compiled into a csv file which is further converted to a table in Db2 Warehouse.
  5. The table created is ingested in SPSS to do some analytics and return a score against each company. The updated table is then saved back to Db2 Warehouse.
  6. The table generated in Db2 Warehouse is fed to the dashboard, giving insightful visualization.

Instructions

Get the detailed instructions in the README file. These steps will show you how to:

  1. Clone the repo.
  2. Create Watson services with IBM Cloud.
  3. Create a new Watson Studio Project.
  4. Add Db2 Warehouse Connection to your Watson Studio Project.
  5. Import the notebook to your Watson Studio Project.
  6. Configuring IBM Cloud service credentials in the Notebook.
  7. Run the notebook.
  8. Setup SPSS Modeler on your Watson Studio Project.
  9. Run the Modeler.
  10. Setup the Embedded Dashboard Service on your Watson Studio Project.
  11. Visualize and derive insights using Embedded Dashboard Analytics.
Smruthi Raj Mohan
Srikanth Manne
Manjula G Hosurmath