In 2000, computer researchers at Carnegie Mellon University started a project to fundamentally shift how search results are organized. The idea behind the approach, called clustering, was to find meaningful connections among Internet search results to speed and improve research.
During the next five years the resulting company, Vivísimo, commercialized the original clustering engine and extended its offerings to include meta (federated) search and its own search engine. Called Velocity, this technology threesome should have wide appeal. Academics, scientists, government analysts, market researchers, online publishers, and product managers in any industry will benefit from Velocity 4.2, as they all must search through and make sense of large, diverse data sources.
Velocity specifically includes three components: Vivísimo Clustering Engine, which automatically categorizes search results on the fly into meaningful hierarchical folders (it overlays any search or database query engine); Vivísimo Content Integrator, for simultaneously querying multiple content sources — such as search engines and databases — in one step and combining the retrieved information; and the Vivísimo Search Engine.
Enterprises will typically start with clustering and metasearch because most already have some type of search engine in place. I tested all three components, however, on an Intel-based server running Red Hat Enterprise Linux 3.0.
Velocity is an especially deep product, as is reflected in the number of options available from the UI. But the UI may confuse first-time users. For example, some menus are several layers deep and not always labeled intuitively. Vivísimo developers are working with usability experts to improve this shortcoming.
Still, in the more important performance areas, Velocity delivers. To evaluate clustering I connected Velocity to an existing Verity Ultraseek search engine. The process involves completing two Web forms, one that describes the XML output from the search engine and the second that indicates how to parse the results. Although this does require knowledge of your original search implementation, I had clustering running in approximately 30 minutes. Vivísimo has done an excellent job organizing results into clusters by intelligently using words and phrases contained in the original searches.
Although Velocity didn’t have a specific setup for Ultraseek, there are clustering templates for other common enterprise search engines, including the Google Search Appliance; these prepopulated forms should save administrators time and reduce setup errors.
Configuring Vivísimo’s search engine required little effort. I easily created a source by defining the starting URL of an intranet Web site. Then I selected a few other options, such as the maximum link depth. Again, clustered results were very precise; no fine-tuning was required.
Next, I used the built-in search to index documents on a file server and Microsoft SQL server database. Besides handling typical file formats (Microsoft Office, PDF, e-mail archives, and Zip archives), the search engine crawls sources that require authentication, such as a content management system. In the latter case the software correctly hid results from users not authorized to view them.
With the basics done, I examined customization. Velocity offers just about all the control an enterprise requires. For example, customized HTML parsing allows me to strip unnecessary markup from pages, including navigation and sidebar links. As a consequence, result summaries were even cleaner and more relevant.
Velocity ships with extensive knowledge bases of synonyms, acronyms, abbreviations, and associations (in English and other languages) that improve clustering of similar ideas. Using the administration tool, I then added words and phrases specific to my organization. (Specialized optional modules are available for general science, government, news, and biomedicine.)
Yet another interesting function allows you to attach external metadata to documents; most other solutions require this information to be embedded in each document, which takes a lot of manual effort. In my testing I employed a separate spreadsheet containing additional information about marketing material that existed in PDF format. Using the parser, I associated a row in the spreadsheet with metadata related to each Acrobat document, thus giving more precise results.
To wrap up, I created a metasearch source that collected my Verity search of an external Web site, the Vivísimo intranet search, and an external XML news feed. The federated results were correctly shown together in one search and clustered into pertinent categories. Administrators could do the same bundling of other external sources — such as premium news and research services — and popular databases. (MySQL, PostgreSQL, IBM DB2, Oracle, and Sybase are also supported.)
Vivísimo Velocity reflects its origins of being designed by computer scientists: It’s a thorough, high-performing search platform, but it requires some schooling for maximum tuning. That said, the basics can be deployed within a day at much lower cost compared with typical enterprise-search undertakings. And within a few days you can likely have the system interpreting large amounts of information from many internal and external sources, presenting the results in organized folders. That’s a time line and economy other solutions — which typically involve long-term taxonomy projects requiring costly professional services — will have a hard time matching. For all these reasons, Vivísimo Velocity 4.2 should be on your short list as a primary information-retrieval platform.