Introduction
Quality of search today and in general is based primarily on the relevance scoring methods within the Portal Search component. It applies standard relevance calculation provided by the underlying open source Lucene search engine (http://lucene.apache.org/).
In general terms based on standard scoring – “tf x idf”(term frequency times inverted document frequency), see: http://en.wikipedia.org/wiki/Tf%E2%80%93idf.
More detailed explanations to the ranking formula used in Lucene is available here.
What is not taken into respect is the structure of information provided in a 'document', e.g. whether a keyword is also found in the 'title' or any other structural element of the document. Typically one would assume that when a keyword appears in the title, that this keyword is significant for this document opposed to that same keyword appearing in other documents but then only in its regular body text.
Now: when is search quality 'bad'? In most cases when the search result does not meet the expectation of the user. This could be because:
- The presented title and summary information of the top 'n' results does not look like what the user was searching for
- Initial selections of the top hits in the search result revealed that the so-called top hits were not what the user was looking for. However further down the list he did find the (to him) relevant information.
Is 'Search' then bad? Not necessarily. There are two sides to the coin:
- The content itself is of poor quality, e.g. the title of the document suggests different type of information compared to what the main text of the document actually reveals. Example: the one significant keyword of the title does not appear anywhere else in the text – or maybe appears only once
- The relevance score calculated is correct, based on the statistical information provided by the corpus and mapping that to candidates identified for the search result, however some (less important) keywords tend to dominate over others. Like a five word query matches best to a document containing only three of those keywords, but these three have a high number of occurrences within that document
Options available to improve search quality
There are three options available to improve search quality and thus overall search experience.
- Use the Suggested Links portlet and capability of Portal Search
Available since WebSphere Portal V6.1
- Change the default search behavior to require all keywords to be present in every returned document, rather than being sufficient if at least one keyword is present
Available since WebSphere Portal V7.0.0.2
- Apply a boost factor for the document in cases when the keywords appear in certain 'fields' (metadata) within the document
Available since WebSphere Portal V8.0.0.1 CF09
Note: none of the three options provided above require to rebuild or re-crawl the search collection(s).
Using Suggested Links
In cases where there is a strong requirement to ensure that a specific document is listed at the top of a search result, then the only way to achieve this is to use the 'Suggested Links' feature an portlet.
On a high-level what is done is to assign a keyword (or multiple keywords) to a document. The Suggested Links portlet will then later on receive a user query and in parallel search if one or more of those keywords were explicitly assigned to a document. If found, then this document will be prominently display in the Suggested Links portlet.
Administering Suggested Links in Portal is as easy as is tagging. For an administrator the Search Center presents a link for every entry in the search result list. When that link is selected, a dialog box opens and allows to assign keyword(s). When saved, that keyword is associated with that item and when searched for, that link (document) will be presented in the Suggested Links portlet. See illustration below:
Change default query operator from 'OR' to 'AND'
When a user enters more the one search term, the Portal search engine applies a logical 'OR' operator per default. Which as a consequence means that in order for a document to qualify for the search result list it is sufficient that only one of those terms is included in a document two or more 'the better'.
In terms of search quality this also might result in only few of these terms to actually contribute to the relevance score and thus dominate the rank score for the returned items. Which might be OK on the other hand might not be OK as typically users would expect that the top hitting document need to also contain ALL of the keywords he specified.
Now this user perception is something they learn by using Internet search engine like Google. These actually use 'AND' as the binding default operator. Which means that all of the terms used in the query need also be found in each of the returned documents.
How to enable 'And' as default operator
The change is performed by adding a configuration parameter for the search service.
Once this parameter has been applied it is required to restart the Portal Server and/or remote search service as required.
The configuration parameters are:
For V7.x
index.DEFAULT_SEARCH_OPERATOR=and
For V8 and V8.0.0.1
DEFAULT_SEARCH_OPERATOR=and
Note: this change to the search service does not require to re-build the search collection.
Applying boost factors to specific fields (meta data)
What relevancy calculation so far does not account for, is the structural information of the content and the informative weight of keywords stored in individual meta data fields. Putting this into context: if a search term appears in the title of a document, its contribution to the relevance score should be higher than an occurrence of that search term somewhere in a sentence in the body text of that document.
Any values from meta data fields like 'title', 'description' or 'keywords' is automatically added to the generic 'content' field in the search index. Thus when performing a simple search without specifically pointing at a certain field, the search algorithm will thus contain hits in any of the above mentioned default meta data fields. However the information as to where the actual occurrence of a search term is, is not taken into account.
For that purpose a new feature is introduced (with WebSphere Portal V8.0.0.1 CF09) which allows to define which meta data fields to additionally focus on and how much such a field will contribute to relevance calculation for a qualifying document.
In order for this to be enabled, the following search service configuration parameter is available.
Name: boostingSettings
Value: {"phraseBoost": {"Enabled":"true"}, "fieldBoost": [{"field":"title", "boost": 3.0} , {"field":"description", "boost":3.0}, {"field":"keywords", "boost":2.0}]}
phraseboost: not mandatory, could add value – though very much language dependent
fieldboost: sample provided for default/common found meta data fields
Can also include any other meta data fields (with string based values)
“boost” should be specified in a range between 1.0 to 10.0, and should be used with care (suggested to stay in the range between 1.0 and 3.0).
Example:
A user searches for the terms “Editing content in web content management”.
A qualifying document contains the terms 'edit' and 'content' in its title. In addition all the terms appear in the 'description' field as well as the phrases 'editing content' and 'web content management' in the body text.
This document will then have a very high relevance score due to terms boosted due to their occurrences in 'title' and 'description' as well as the occurrences of the respective phrases in the body of the document.
About phraseboost and language dependency: if for example the "phrase" were a name, like 'John Smith' then an exact match on that phrase would get boosted. However some languages also represent that same name as "Smith, John" which then again not be counted as a match given the order of the terms is reversed (and thus not a phrase).
Summary
This article presented methods and tuning options available with WebSphere Portal Search which can be applied to improve the quality of search to end users. Please note that not all options available for all current releases of WebSphere Portal.