Web scraping involves using a program or algorithm to extract and process large amounts of data from the web. This allows you to find and gather data when there’s no direct way to download it. Web scraping, using Python, allows you to extract the data into a useful form that can be imported. In this tutorial, you’ll learn about extracting data from the web using Watson Studio. Next, you’ll use Watson Natural Language Understanding to derive important entities and keywords.

Learning objectives

In this tutorial, you’ll learn the following:

  1. How to scrape from a Google search, with any search term of your choice.
  2. How to extract the links and the summaries in the search results.
  3. How to extract text from the results and the process using Watson Natural Language Understanding for deriving important entities and keywords.

Prerequisites

Estimated time

Completing this tutorial should take about 30 minutes.

Steps

  1. Create a notebook on Watson Studio
  2. Set up the notebook

Step 1: Create a notebook on Watson Studio

  • Login to IBM Cloud Dashboard.
  • Click the dropdown to Services and select on Watson Studio
  • Click Get Started button at the bottom of the page.
  • Select the New Project option from the Watson Studio landing page and choose the Standard option and create the project by giving a name.
  • Upon a successful project creation, you are taken to a dashboard view of your project, click on the Assets and create a notebook.

Step 2: Set up the notebook

  1. Install the necessary package
  2. Import the necessary packages
  3. Add the Natural Language Understanding service credentials
  4. Scrape Data from Website

1. Install the necessary package

  !pip install watson-developer-cloud==1.5

2. Import the necessary packages

  from bs4 import BeautifulSoup
  from watson_developer_cloud import NaturalLanguageUnderstandingV1
  from watson_developer_cloud.natural_language_understanding_v1 \
  import Features, EntitiesOptions, KeywordsOptions, SemanticRolesOptions, SentimentOptions, EmotionOptions, ConceptsOptions, CategoriesOptions

3. Add the Natural Language Understanding service credentials

  • Open the Watson Natural Language Understanding service under Service dropdown in your IBM Cloud Dashboard.
  • Once the service is open copy the apikey and url in the Credentials menu.
  • Insert the following code and insert the apikey and url copied.

    apikey=''
    url=''
    natural_language_understanding = NaturalLanguageUnderstandingV1(
        version='2019-01-25',
        iam_api_key=apikey,
        url=url
    )
    

4. Scrape data from a web site

  • In a new cell define the function scrape_google_summaries as shown below:

    def scrape_google_summaries(s):
    time.sleep(randint(0, 2))  # relax and don't let google be angry
    r = requests.get("https://www.google.co.in/search?q=%22+s+%22&oq=%22+s+%22&aqs=chrome..69i57.14096j0j9&sourceid=chrome&ie=UTF-8")
    print(r.status_code)  # Print the status code
    content = r.text
    summary_items=[]
    
    soup = BeautifulSoup(content, "html.parser")
    
    for item in soup.find_all(class_='g'):
        summary_dict=dict()
        for link in item.find_all('a'):
            summary_dict['news_link']= link.get('href')
        for summary in item.find_all('span',class_="st"):
            summary_dict['summary']=summary.text
        summary_items.append(summary_dict)
    return summary_items
    
  • Now, just call the function by passing any search term.
  • If you wish to extract different information from a different website, enter a different url and in you browser (eg: Google Chrome) open, options> More Tools> Developer Tools. Navigate through the HTML and find the appropriate HTML tags you wish to extract and update it in the code using BeautifulSoup.

5. Analyze results in Watson NLU

  • Add a new cell and define the function analyze_using_NLU as shown below:

    def analyze_using_NLU(analysistext):
      """ Extract results from Watson Natural Language Understanding for each news item
      """
      res=dict()
      response = natural_language_understanding.analyze(
          text=analysistext,
          features=Features(
                            sentiment=SentimentOptions(),
                            entities=EntitiesOptions(),
                            keywords=KeywordsOptions(),
                            emotion=EmotionOptions(),
                            concepts=ConceptsOptions(),
                            categories=CategoriesOptions(),
                            semantic_roles=SemanticRolesOptions()))
      res['results']=response
      return res
    
  • Call the function by passing the scraped data and view the results returned as a json format.

Summary

In this tutorial, you learned how to extract data from a given search term from a Google search–and get insights from Watson Natural Language Understanding.