|
What do you need this indexserver for?
The strength of this package is to index all sorts of things (websites, file
systems, files, databases tables, ...).
You feed the indexer with information, and later you can query it.
Features:
- Boolean search operators like + - (and, not).
- Search for "fixed names" (eg "Bill Gates") using right neighbors.
- Stemming, metaphone, soundex, fast part-word searches like foo*, foo*bar AND *foo.
- Good weightening
- Foreign keys (for db-related indexes).
- Stopword lists (multilingual).
- Settings via xml (only basics implemented yet).
- For db's: auto-calculated settings from name conventions and table structures (table scanning).
- Returns hints "Did you mean xy" after a search.
- Support for different data formats, currently:
- strings,
- arrays,
- db tables,
- text files,
- built-in mime-type handlers for html, pdf, doc and xls,
- a generic interface for custom mime-type handlers
- Automatic creation of the internal (MySQL) database tables.
- Highlighting of the matched words in the results.
Examples:
-
South Park:
Episode script files are indexed line by line.
- The search on this website is done using this plugin.
-
Shakespere:
A file that is indexed line by line.
-
In the download there is the shakespere example, the south park example as well as a filesystem example:
Some directories with text, html, pdf, word and excel files.
How does the weightening work?
After finding results for keywords it is very important to order the results based
on relevance. To achieve this weightening of different parts of the content is
important.
-
Weight points can be given for different parts of the content that gets indexed.
For different data types (db's, html) there exist default weight properties.
Examples: - The words in the title of a website are more important than the
words in the body.
- A CHAR(20) db field is more important than a BLOB. foreign key fields
are even less important.
-
A count is maintained on each word, so we know if a word is special or common
for your application. 'madonna' may be a special word if you're indexing the
world, but if you're indexing a db about madonna songs then it's different.
-
If a word is used 30 times in a text with 1'000 words, then it's more important
than a word that's used once in 10'000 words.
-
long words are considered more special, thus are more important when searching.
|
|
|