This article describes some hints and tips based on experiences building a search crawler implementation with the WCM seedlist. These hints and tips are helpful when used with the production documentation and previously published WCM seedlist articles.There are advantages to using the WCM seedlist to implement a crawler such as:
- The seedlist approach allows content owners more fine-grained control on how content and meta data is crawled
- With incremental crawling only items that have been added, changed, or deleted since the previous crawl are retrieved.
The WCM seedlist is used by WebSphere Portal to crawl and index WCM content. The seedlist is based on an ATOM feed format that is easily accessible programmatically. This format provides not only URLs to content but also metadata such as author, categories, keywords, publish date, and so forth, as well as custom meta-data. As mentioned previously, the format also indicates how to handle each item, whether the content is new, to be removed from the search index, or is an update. For more information about the seedlist REST API, go to Seedlist 1.0 REST service API.
1. Seedlist Requests
A great feature of WCM seedlists is that it does not take much effort to get started. Seedlist requests are honored by WCM independent of whether or not Portal Search is deployed. The following example shows a seedlist request:
TIP: If there is a large amount of data to index, configure your crawler to point to a specific node in your cluster.
Paging Seedlist Responses
WCM Search Seedlist 1.0 uses a pagination approach. You can specify the number of items to fetch in each page by the use of the Range parameter. The following information is a general rule of thumb:
- A higher Range value results in less page requests but require a higher memory consumption
- A lower Range value results in more page requests but have lower memory consumption
The following information is some best practices:
- If you have more than 5,000 content items, use a range of 1000
- Less than 5,000 contents, use a range of 100, which is the default value
To summarize when the State and Timestamp parameter are used in seedlist requests:
- Initial Crawl=> State Value present: No. Timestamp value present: No
- Initial Crawl Subsequent Page= > State Value present: Yes. Timestamp value present: No
- Incremental Crawl=> State Value present: No. Timestamp value present: Yes
- Incremental Crawl Subsequent Page => State Value present: Yes. Timestamp value present: Yes
2. Delete Actions
Delete actions can be emitted in the seedlist response for a number of use cases such as:
- Deleted content
- Expired content
- Moved content- content moved to a different library from the libraries identified in the seedlist requests.
If you delete a WCM library however, that will not result in a series of delete actions for all resources originating from that library.
NOTE: The delete actions reference content IDs, not URLs. So your crawler will need to store this ID when the content item is initially processed in order to be able to look the item up later if it is later deleted.
3. Content URLs
This results in content URLs such as: https://server/wps/myportal?urile=wcm:path%3A%2FMyLibrary%2F%2BHome%2FContent%2FMySiteArea%2FMarketing_and_Advertising%2FGuidance
With these types of content URLs, you will need to ensure that web content associations are properly defined so that content is rendered on the proper pages.
The seedlist response also provides the WCM servlet URL for content items which can be used by the crawler to index the core content (without processing any additional content that maybe on the associated page):
NOTE: The presentation template mapping for the content items must be configured in order for the WCM servlet to render the content items properly for the crawler.
Finally, you may find that it is not desirable in your use cases to process some entries in the seedlist response, or that you need to include other meta-data with content items. There are techniques for addressing both of these issues. Read the Related WCM seedlist articles for more information.
4. Related WCM seedlist articles
- Defining seedlist filters for WebSphere Portal and Web Content Management
- Seedlist filter for WCM to exclude attachments based on file extension for V8.x
- Extend WebSphere Portal WCM Seedlist with Custom Metadata
Thanks to Denny Ma and Andreas Prokoph for their assistance.