The lucene file indexer class.
This class indexes files on disc, either one by one or as a whole file hierarchy tree.
Located in /lucene-defs.php (line 1421)
Application we are indexing for
Index fields definitions array. Contains definitions
Host to connect to
ID generation offset
ID generation prefix
ID generation source
Fields for indexing. This is an array of fieldname/value
The index ID
Path to a lockfile we should give way to. If this value
Number of seconds to wait on a lockfile. If zero, wait forever.
The index object which does the work
Scan for meta tags as fields in file content. Recommended.
Meta fields definitions array. Contains definitions
Port to connect to
Timeout for indexing commands in seconds (can usually leave
Indexing execution timer
Constructor
Create a new lucene indexer
Define a lockfile which we must avoid during indexing. If defined then no indexing will take place while the lockfile exists. The second parameter allows you to specify a limit to the patience of this process, in seconds. Zero means wait forever.
Define a field. We supply the name of the field, it's type (Text, Date or Id), and whether it should be stored by Lucene for later retreival in queries. For example you would not store the raw document/content as this is usually stored elsewhere.
IMPORTANT NOTE: Fields defined here will automatically be included as meta fields.
Set the source for ID generation. Since we are indexing a bunch of files, the ID's have to be generated on demand inside the loop. So we provide for various ways here, and you can extend this class to provide more if required.
Main ways: ID_FROM_INC Increment a counter by 1 each time (with offset) ID_FROM_NAME Take the filename, strip the extension, add prefix ID_FROM_FILENAME Take the full filename, add prefix ID_FROM_PATH Take the full file path NB: These are all defined as integer constants.
Supply field content for indexing. This causes Lucene to take the given fieldname and index the given value against it.
The field name can have the field type included in the form 'Foo:Date', where 'Date' is the type in this instance. In fact, since 'Text' is the default filed type, 'Date' is probably the only one you need to use as the current implementation stands.
Index a file located at the given path, using given ID.
You can also use the parameter $fields to supply an array of fieldname/value pairs to index with this file, for one-off indexing of files. If the fieldname is a date field, make sure to define the name as 'Foo:Date', to cause the field definition to be correct.
Index a tree of files starting at the path given. We index these in one of four modes, which determines how we generate the ID for each item: 'ID_FROM_INC' mode uses an incremental counter starting at 1. If $prefix holds a number, the counter will start at this number instead of one.
Each item has an ID incremented by one from the last one. 'ID_FROM_NAME' mode uses the filename, stripped of any path and extension as the ID. If prefix is not a nullstring, then it is prefixed to every filename ID. 'ID_FROM_FILENAME' mode uses the filename, including any extension as the ID. If prefix is not a nullstring, then it is prefixed to every filename ID. 'ID_FROM_PATH' mode uses the full path to the item being indexed as the ID. If prefix is not a nullstring, then it is prefixed to every filename ID. The file will simply be indexed as a single Text field, with the appropriate ID, and no other index fields unless $metascan is set to TRUE. If this is the case, the system will scan the file for HTML meta tags of form: '<meta name="foo" content="bar">'. In this example a field of name 'foo' would be given value 'bar'.
Define a field as a meta tag. This ensures that the field will be picked up from the file meta tags, if present. If it is not listed here then it will be ignored.
IMPORTANT NOTE: We define the strict rule that ONLY fields which have been defined here can be added to the indexing via the meta tag scanning. Ie. you must define fields here explicitly, or via the define_field() method, or they will be ignored even if they turn up as a meta tag. This is so we can restrict the indexing, and be sure of field types.
Flag that we should NOT do a tag scan on the content of the files.
Flag that we should do a tag scan on the content of the files to try and extract fields to index. Note that any tags thus found will only be used if the field name has been defined with the method define_field(); This causes both the <title> tag and <meta> tags to be considered.
Documentation generated by phpDocumentor 1.3.0RC3