You can use App Connect to crawl a website and retrieve data to pass to your apps by mapping data graphically â€“ without the need for coding â€“ meaning that you can achieve a return on your investment in minutes or hours, not days or months.
This guide shows you how.
If you canâ€™t find what you want, or have comments about the â€śhow toâ€ť information, please either add comments to the bottom of this page or send us your comments by email.
About Web Crawler
Web Crawler crawls web page URLs to retrieve links that are available on the web pages or download the HTML content of pages.
To retrieve page links, Web Crawler implements a basic web crawler algorithm, that starts with a list of user-specified web page URLs to visit. As the crawler visits these URLs, it identifies all the hyperlinks in the statically-defined content of each page. The crawler does not attempt to execute any scripts so dynamically generated content is not crawled.
You can customize the behavior of Web Crawler by setting filter parameters; for example, to change the maximum depth of pages to be crawled and to tell Web Crawler to not respect the robots.txt file if one is defined by the website.
Web Crawler can crawl public websites and websites on a private network (through the IBM Secure Gateway).
What should I consider first?
- (beta) Web Crawler is made available as an open beta so that you can use it to evaluate the potential use of the Web Crawler function. We welcome your feedback about Web Crawler, which we’ll consider for enhancements to the function.
- The Web page / Retrieve all pages action crawls the static content of web pages and returns a JSON response that includes for each web page a collection of links extracted from the page. The action starts with one or more initial web page URLs specified by the Fully qualified web URL(s) filter property, then crawls links discovered down to the specified maximum depth of pages to be crawled.
- Crawling websites can take noticeable time, and is subject to the responsiveness of the website, the network, and to other factors outside the control of App Connect. Therefore, for optimal behavior and to crawl the largest websites supported, you should use Web Crawler actions within a batch process. For example:
- In a batch process, a Web page / Retrieve all pages action can crawl a maximum of 46000 web pages. Outside a batch process, a Retrieve all pages action can crawl a maximum of 1000 web pages. If the maximum is reached, an error message is issued.
- The maximum size of a web page that can be processed is 1 MB. If a page above this size is found, it is not processed and App Connect logs an error message; in a batch process, Web Crawler continues crawling any other web pages to be processed.
If needed, you can convert the downloaded web page content from Base64 format by using the unencode JSONata function on the content:
When configuring Web Crawler connection for a website on a private network, you must specify the host name and port (for example:
https://host:port), and must use the IBM Secure Gateway to access the network. When you click Connect to create the connection account, App Connect checks if it is able to reach the specified domain.
If youâ€™ve previously used the Secure Gateway Client to set up a network connection for an App Connect application that is on the same private network as the website, you can use this network connection with Web Crawler. If you donâ€™t have such a network connection in place, configure one as described in Configuring a private network for IBM App Connect.
For more considerations and details about Web Crawler, see the tab Reference.
A company wants to analyze the content of its vehicle web site, to find specific information and extract actionable insights. They create an event-driven flow to use Web Crawler to download the content of appropriate web pages and upload the content as documents to Watson Discovery to analyze. They can then examine the documents and run a variety of queries to find specific information and extract actionable insights.(Click image to view full size.)
- For ease of testing, a Scheduler event is used to trigger the flow.
- Web Crawler actions are performed within a batch process for optimal behavior crawling the vehicle website. The batch process performs a sequence of actions for each web page retrieved into its pagecollection array:
(Click image to view full size.)
- The Web Crawler / Retrieve all pages action is used to crawl the website from an initial page. The action is configured with the following customized filter conditions:
- Fully qualified web URLs… equals
This initial web page is the first in a sequence of hub pages, each listing a number of links to pages for specific vehicles. To crawl all the hub pages and their links to vehicle pages, we could list the URLs for the hub pages, like
https://ibmmotors.com/vehicle-collection/;https://ibmmotors.com/vehicle-collection/hubpage/2/;...;https://ibmmotors.com/vehicle-collection/hubpage/n/or can set the filter condition Maximum depth of pages to be crawled to a value that lets Web Crawler traverse enough links.
- Maximum depth of pages to be crawled equals 20
This depth value was used to enable Web Crawler to crawl from the initial page to its vehicle pages, and to the next hub page and its linked vehicle pages, and so on to the last hub page and its linked vehicle pages.
- Blacklist strings equals
This tells Web Crawler to ignore web page URLs that end with .jpg or .png.
- Filter by sub tree equals
This tells Web Crawler to only crawl web pages with URLs that start with the URL of the initial page:
https://ibmmotors.com/vehicle-collection/. Web Crawler ignores other web pages on the web site; for example:
- Fully qualified web URLs… equals
- The Web Crawler / Download content action is used to download the content of the web page, identified by its crawled URL from the batch process pagecollection:
(Click image to view full size.)
- A Google Sheets / Create row action is used to create an index of page IDs, URLs, and links:
- An IBM Watson Discovery / Update or create document action is used to upload the content of each web page as a separate document, where the source ID is the ID of the web page. The action updates any document previously uploaded into the Watson Discovery environment (on a previous run of the event-driven flow). The action metadata indicates that the downloaded page content is in base64encoded format, and that the name of the document updated/created has the ID value of the web page crawled.
When the flow was run, the batch process processed each web page crawled. The App Connect batch status view listed the number of pages processed and whether they were processed successfully:
For each web page crawled, the flow added a row to a Google spreadsheet:
… then the flow uploaded the contents of the web pages to IBM Watson Discovery for analysis:
In IBM Watson Discovery, queries can be run to find specific information and extract actionable insights.