frutch [wiki]:ParseXmlProposal

''DRAFT''

Summary of issue

1. HTML Title

Today for HTML the title field is populated with the content of the /html/head/title element, which is sometimes empty or irrelevant, the "real" title of a page being in some other place, for example a p or a div with a specific class, or the first h1, etc.

However to access those elements one needs to edit a plugin or create a new one, compile, etc., and do it again for each specific content to parse and index.

2. XML parsing and fields

There exist numerous plugins in Nutch for parsing and indexing specific fields for specific content types or structures: creative commons, rss, etc.

When a new content structure or a new information needs to be indexed, a new plugin has to be written, compiled, etc.

A more generic approach would be interesting where fields configuration and population would be possible without creating a new plugin or compiling anything.

Proposed remedy

This proposal aims at creating a generic plugin for parsing and indexing XML-related content, be it "real" XML (XHTML, RSS,...) or plain HTML after HTML2XML conversion using Tagsoup for example.

The way the plugin would work would be that documents to be indexed would be transformed into indexable documents using XSLT.

The indexable document would be an XML document listing every field with the appropriate values taken from the document to be indexed (see format in ParseSchemaProposal).

The choice of which XSLT to use for which content would be done by finding a match on the url of the document to be indexed using regexp.

The list of regexps and associated XSL stylesheets would reside in a configuration file such as:
<file-to-index url-format="regexp">

Fields configuration

Fields to be used should be defined independently of the transformation of the files to be indexed because:
- fields are specific to a Nutch application, not to a specific content type/structure (as is the case today: the "title" field is the same whether the indexed file is HTML or PDF)
- there are informations that need to be configured for each field

Fields configuration should include the following informations:
- name of the field
- type (Lucene field type):

. Keyword
. UnIndexed?
. UnStored?
. Text

Note: the language/analyzer to be used shoud not be field-related but document related.

What are fields for

Lucene fields definition:
- Keyword: "not analyzed, but indexed and stored". Very useful for storing 'metadata', ie data that should be searched 'as is'
- UnIndexed?: "neither analyzed nor indexed, but stored 'as is'". Very useful for storing data that should not be searched but that is needed when displaying a document (typically, the url)
- UnStored?: "analyzed and indexed, but not stored". Default field type for searching in plain text. Needs to be used for the 'default' general field, but can also be used to perform searches on sub parts of the document, such as a "summary" field for example.
- Text: "analyzed and indexed, and sometimes stored". (I don't know what to use it for).

Il n'y a pas de commentaire sur cette page. [Afficher commentaires/formulaire]

frutch [wiki]

ParseXmlProposal

Summary of issue

Proposed remedy

Fields configuration

What are fields for

Navigation

Actions