IBM App Connect passes key data between apps – automatically, in real time.

You can use App Connect to crawl a website and retrieve data to pass to your apps by mapping data graphically – without the need for coding – meaning that you can achieve a return on your investment in minutes or hours, not days or months.

This guide shows you how.

If you can’t find what you want, or have comments about the “how to” information, please either add comments to the bottom of this page or .

About Web Crawler

Web Crawler crawls web page URLs to retrieve links that are available on the web pages or download the HTML content of pages.

To retrieve page links, Web Crawler implements a basic web crawler algorithm, that starts with a list of user-specified web page URLs to visit. As the crawler visits these URLs, it identifies all the hyperlinks in the statically-defined content of each page. The crawler does not attempt to execute any scripts so dynamically generated content is not crawled.

You can customize the behavior of Web Crawler by setting filter parameters; for example, to change the maximum depth of pages to be crawled and to tell Web Crawler to not respect the robots.txt file if one is defined by the website.

Web Crawler can crawl public websites and websites on a private network (through the IBM Secure Gateway).

What should I consider first?

  • (beta) Web Crawler is made available as an open beta so that you can use it to evaluate the potential use of the Web Crawler function. We welcome your feedback about Web Crawler, which we’ll consider for enhancements to the function.
  • The Web page / Retrieve all pages action crawls the static content of web pages and returns a JSON response that includes for each web page a collection of links extracted from the page. The action starts with one or more initial web page URLs specified by the Fully qualified web URL(s) filter property, then crawls links discovered down to the specified maximum depth of pages to be crawled.
  • Crawling websites can take noticeable time, and is subject to the responsiveness of the website, the network, and to other factors outside the control of App Connect. Therefore, for optimal behavior and to crawl the largest websites supported, you should use Web Crawler actions within a batch process. For example:
    • In a batch process, a Web page / Retrieve all pages action can crawl a maximum of 46000 web pages. Outside a batch process, a Retrieve all pages action can crawl a maximum of 1000 web pages. If the maximum is reached, an error message is issued.
    • The maximum size of a web page that can be processed is 1 MB. If a page above this size is found, it is not processed and App Connect logs an error message; in a batch process, Web Crawler continues crawling any other web pages to be processed.
  • The Web page / Download page content action downloads the content of a web page in Base64 format.
    If needed, you can convert the downloaded web page content from Base64 format by using the unencode JSONata function on the content: {{$base64decode($WebCrawlerDownloadpagecontent.Content)}}.
  • Web Crawler can crawl public website domains that do not use authentication or can be configured to use basic authentication where needed to enable website domains to be crawled.
  • Web Crawler can be used to crawl public websites and websites on a private network (through the IBM Secure Gateway).

    When configuring Web Crawler connection for a website on a private network, you must specify the host name and port (for example: https://host:port), and must use the IBM Secure Gateway to access the network. When you click Connect to create the connection account, App Connect checks if it is able to reach the specified domain.

    If you’ve previously used the Secure Gateway Client to set up a network connection for an App Connect application that is on the same private network as the website, you can use this network connection with Web Crawler. If you don’t have such a network connection in place, configure one as described in Configuring a private network for IBM App Connect.

For more considerations and details about Web Crawler, see the tab Reference.

Example

A company wants to analyze the content of its vehicle web site, to find specific information and extract actionable insights. They create an event-driven flow to use Web Crawler to download the content of appropriate web pages and upload the content as documents to Watson Discovery to analyze. They can then examine the documents and run a variety of queries to find specific information and extract actionable insights.

Web Crawler flow for analysis of data retrieved from page content on vehicle web site
(Click image to view full size.)

Flow notes:

  1. For ease of testing, a Scheduler event is used to trigger the flow.
  2. Web Crawler actions are performed within a batch process for optimal behavior crawling the vehicle website. The batch process performs a sequence of actions for each web page retrieved into its pagecollection array:
    Web Crawler pagecollection properties (Click image to view full size.)
  3. The Web Crawler / Retrieve all pages action is used to crawl the website from an initial page. The action is configured with the following customized filter conditions:
    • Fully qualified web URLs… equals https://ibmmotors.com/vehicle-collection/

      This initial web page is the first in a sequence of hub pages, each listing a number of links to pages for specific vehicles. To crawl all the hub pages and their links to vehicle pages, we could list the URLs for the hub pages, like https://ibmmotors.com/vehicle-collection/;https://ibmmotors.com/vehicle-collection/hubpage/2/;...;https://ibmmotors.com/vehicle-collection/hubpage/n/ or can set the filter condition Maximum depth of pages to be crawled to a value that lets Web Crawler traverse enough links.

    • Maximum depth of pages to be crawled equals 20

      This depth value was used to enable Web Crawler to crawl from the initial page to its vehicle pages, and to the next hub page and its linked vehicle pages, and so on to the last hub page and its linked vehicle pages.

    • Blacklist strings equals .jpg;.png

      This tells Web Crawler to ignore web page URLs that end with .jpg or .png.

    • Filter by sub tree equals true

      This tells Web Crawler to only crawl web pages with URLs that start with the URL of the initial page: https://ibmmotors.com/vehicle-collection/. Web Crawler ignores other web pages on the web site; for example: https://ibmmotors.com/contact-us/

  4. The Web Crawler / Download content action is used to download the content of the web page, identified by its crawled URL from the batch process pagecollection:
    Web Crawler / Download content properties (Click image to view full size.)
  5. A Google Sheets / Create row action is used to create an index of page IDs, URLs, and links:

    Vehicle data flow, Google Sheets properties (Click image to view full size.)

  6. An IBM Watson Discovery / Update or create document action is used to upload the content of each web page as a separate document, where the source ID is the ID of the web page. The action updates any document previously uploaded into the Watson Discovery environment (on a previous run of the event-driven flow). The action metadata indicates that the downloaded page content is in base64encoded format, and that the name of the document updated/created has the ID value of the web page crawled.

    Watson create document metadata mapped from Web Crawler page properties (Click image to view full size.)

Example result

When the flow was run, the batch process processed each web page crawled. The App Connect batch status view listed the number of pages processed and whether they were processed successfully:

Vehicle data flow batch view (Click image to view full size.)

For each web page crawled, the flow added a row to a Google spreadsheet:

Vehicle data flow Google spreadsheet (Click image to view full size.)

… then the flow uploaded the contents of the web pages to IBM Watson Discovery for analysis:

IBM Watson Discovery showing data from Web Crawler (Click image to view full size.)

In IBM Watson Discovery, queries can be run to find specific information and extract actionable insights.

IBM Watson Discovery query on vehicle data from Web Crawler (Click image to view full size.)

Join The Discussion

Your email address will not be published. Required fields are marked *