The Lemur Toolkit
Features
Below is a summary listing of the features found within the Lemur Toolkit:
Sophisticated structured query languages (using InQuery and Indri)
Support for XML and structured document retrieval
Used commonly with a wide range of research test collections (e.g., TREC CDs 1-5, wt10g, RCV1, gov, gov2)
Index your web pages with an "out-of-the-box" site search capability
Interactive interfaces for Windows, Linux, and Web
Distributed information retrieval and document clustering applications
Cross-platform, fast and modular code written in C++
C++, Java and C# APIs
Free and open-source software
In use for over 6 years by a large and growing user community
Indexing:
Multiple indexing methods for small, medium and large-scale (terabyte) collections
Built-in support for English, Chinese and Arabic text
Porter and Krovetz word stemming
Incremental indexing
Out-of-the-box indexing support for TREC Text, TREC Web, plain text, HTML, XML, PDF, MBox, Microsoft Word, and Microsoft PowerPoint
Indexes inline and offset text annotations (e.g., part-of-speech and named entities)
Indexes document attributes
Retrieval:
Supports major language modeling approaches such as Indri and KL-divergence, as well as vector space, tf.idf, Okapi and InQuery
Relevance- and pseudo-relevance feedback
Wildcard term expansion (using Indri)
Passage and XML element retrieval
Cross-lingual retrieval
Smoothing via Dirichlet priors and Markov chains
Supports arbitrary document priors (e.g., Page Rank, URL depth)
t est sur que c est bien lemur le nom du soft?je tire ca du site officiel... -_-"