Lucene --open source text serch engine API(讲稿)

/**

* 这是一个关于LUCene的讲稿的txt格式。假如您需要pdf格式的可以

* 与我联系(pengjy@263.net) 。

* 作者：pengjy

* 时间：2002－04

* keyWords: lucene, api, token, index, chinese, unicode

................page 1 ................

Lucene

an open source text search engine API

high-performance,

full-featured,pure Java

Pengjy@262.net

................page 2 ................

Agenda

Overview

APIs

How dose Search Engine Work

Feature

For Chinese character

................page 3 ................

Overview

An Apache Jakarta Project

High-performance, full-featured

Open source text search engine APIs

Easy to use, fast to build your own search engine

................page 4 ................

Overview

Version 1.2 rc4

Applications using Lucence

2a.WebSearch

Jive Forums

RockyNewsgroup.org

................page 5 ................

APIs

org.apache.lucene.analysis

defines an abstract Analyzer API for converting

text from a java.io.Reader into a TokenStream,

an enumeration of Token's. A TokenStream is composed

by applying TokenFilter's to the output of a Tokenizer.

A few simple implemetations are provided, including

StopAnalyzer and the grammar-based StandardAnalyzer

(use JavaCC).

................page 6 ~9................

APIs

org.apache.lucene.document

provides a simple Document class. A document is

simply a set of named Field's, whose values may be

strings or instances of java.io.Reader.

org.apache.lucene.index

provides two primary classes: IndexWriter, which

creates and adds documents to indices; and IndexReader,

which ccesses the data in the index.

org.apache.lucene.queryParser

uses JavaCC to implement a QueryParser

org.apache.lucene.search

provides data structures to represent queries

(TermQuery for individual words, PhraseQuery for phrases,

and BooleanQuery for boolean combinations of queries) and

the abstract Searcher which turns queries into Hits.

IndexSearcher implements search over a single IndexReader.

org.apache.lucene.store

defines an abstract class for storing persistent

data,the Directory, a collection of named files written

by an OutputStream and read by an InputStream. Two

implementations are provided, FSDirectory, which uses

a file system directory to store files, and RAMDirectory

which implements files as memory-resident data structures.

org.apache.lucene.util

contains a few handy data structures, e.g.,

BitVector and PriorityQueue.

................page 10 ................

How dose Search Engine Work

Create indices

input --analyzer--filters--tokens--indices

tokenize

................page 11 ~ 14 ................

How dose Search Engine Work

Store Indices

Rather than maintaining a single index, it builds

multiple index segments. For each new document indexed,

Lucene creates a new index segment.

It merges small segments with larger ones -- this

keeps the total number of segments small so searches remain

fast.

To prevent conflicts (or locking overhead) between

index readers and writers, Lucene never modifies segments

in place, it only creates new ones. When merging segments,

Lucene writes a new segment and deletes the old ones --

after any active readers have closed it.

A Lucene index segment consists of several files:

A dictionary index containing one entry for each 100 entries

in the dictionary A dictionary containing one entry for

each unique word A postings file containing an entry for

each posting

Since Lucene never updates segments in place, they

can be stored in flat files instead of complicated B-trees.

For quick retrieval, the dictionary index contains offsets

into the dictionary file, and the dictionary holds offsets

into the postings file.

Lucene also implements a variety of tricks to compress

the dictionary and posting files -- thereby reducing disk

I/O -- without incurring substantial CPU overhead.

................page 15 ~ 22 ................

Feature

Incremental indexing

Incremental indexing allows easy adding of documents to

an existing index. Lucene supports both incremental and batch

indexing.

Data sources

Lucene allows developers to deliver the document to the

indexer through a String or an InputStream, permitting the

data source to be abstracted from the data. However, with

this approach, the developer must supply the appropriate

readers for the data. Feature

Indexing control

Some search engines can automatically crawl through a

directory tree or a Website to find documents to index.

Since Lucene operates primarily in incremental mode, it lets

the application find and retrieve documents.

File formats

Lucene supports a filter mechanism, which offers a simple

alternative to indexing word processing documents, SGML

documents, and other file formats.

Content tagging

Lucene supports content tagging by treating documents

as collections of fields, and supports queries that

specify which field(s) to search. This permits semantically

richer queries like "author contains 'Hamilton' AND body

contains 'Constitution'".

Stop-word processing

Search engines will not index certain words, called stop

words.such as "a", "and," and "the". Lucene handles stop

words with the more general Analyzer mechanism, and provides

the StopAnalyzer class, which eliminates stop words from the

input stream.

Query features

Lucene supports a wide range of query features, including

all of those listed below:

Boolean queries; andqueries. return a "relevance" score

with each hit.

handle adjacency or proximity queries -- "search followed

by engine" or "Knicks near Celtics"

search on single keywords.

search multiple indexes at once and merge the results to

give a meaningful relevance score.

However, Lucene does not support the valuable "Soundex",

or "sounds like," query.

Concurrency

Lucene allows users to search an index transactionally,

even if another user is simultaneously updating the index.

Non-English support

As Lucene preprocesses the input stream through the

Analyzer class provided by the developer, it is possible to

perform language-specific filtering.

................page 23 ................

For Chinese character

JavaCC -- the Java Compiler Compiler.

build complex compilers for languages such as

Java or C++.

write tools that parse Java source code and perform

automatic analysis or transformation tasks.

EBNF (Extended Backus-Naur-Form)

................page 24 ................

For Chinese character

org.apache.lucene.analysis.standard.StandardTokenizer.jj

TOKEN : { // token patterns

("." )+ //email adress

}

................page 25 ................

For Chinese character

Add Uincode CJK to StandardTokenizer.jj

[

"u4e00"-"u9faf", //CJK Unified Ideographs

"u3400"-"u4dbf", //CJK Unified Ideographs Extension A

"u3000"-"u303f", //CJK Symbols and Punctuation

"u2e80"-"u2eff", //CJK Radicals Supplement

"u3200"-"u32ff", //Enclosed CJK Letters and Months

"ufe30"-"ufe4f", //CJK Compatibility Forms

"u3300"-"u33ff", //CJK Compatibility

"uf900"-"ufaff" //CJK Compatibility Ideographs

]

................page 26 ................

For Chinese character

Add Unicode CJK

Build Lucene (use Lucene 1.2 src and Ant 1.4)

Test windows 2000 server + weblogic 6.1 sp2 +

MSSQLserver 2000 + jive2.2.3 + Lucene

................page 27 ................

Thank you!

My mail:pengjy@263.net

................The end ................