Skip to content

Correspondence between Haystack fields and Xapian index

jorgecarleitao edited this page May 22, 2014 · 5 revisions

This page is intended to developers who want to know more about the internals of Xapian-Haystack (XH).

In Xapian, a document is a set of terms which optionally stores the positions (integers) of where each terms occurs in the document. This means that there is no concept related to a Haystack field. Thus, XH must define a way to map the different Haystack fields in a SearchIndex into Xapian.

The objective of this page is to explain how this is done.

Indexing

Notation

XH uses 4 different ways to index a term:

  1. unstemmed term: a single word
    • Indexed as <unstemmed term>
    • Example: Happiness.
  2. stemmed term: the stem of the term
    • Indexed as Z<stemmed term>
    • Example: Zhappi (stem of Happiness)
  3. unstemmed term in the field
    • Indexed as X<field_name><unstemmed term> where field_name is the field name in uppercase
    • Example: XSUMMARYhappiness
  4. stemmed term in the field
    • Indexed as ZX<field_name><unstemmed term> where field_name is the field name in uppercase
    • Example: ZXSUMMARYhappi

Items 1. and 2. are intended for a search on the whole document (using keyword content on .filter()), and the items 3. and 4. are intended for searches on a specific field (using the appropriate keyword on .filter()).

Correspondence

In Haystack, fields have different types (e.g. text, float), which can hold different Python types. XH uses this information about the field to map its content to a Xapian index. Specifically it uses:

  • type:
    • text
    • integer
    • float
    • datetime
    • date
  • field_name (name the user gave to the field).
  • multi_valued (True for MultiValued fields).

With the above notation, we can now describe how XH creates the index:

  • text: indexed by splitting the string into terms and index each term using the 4 items described above. During this process XH tells Xapian to store the positional information of each term in the document. This way, users can use the __exact with full sentences.

  • datetime: indexed by converting it to two terms (%Y-%m-%d %H:%m:%s) and store them using item 1. and 3.

  • date: indexed by converting it to a term (%Y-%m-%d) and store it using item 1. and 3.

  • integer: same as above with %d.

  • float: same as above with %f.

  • multivalued: they are always type text, so we use the recipe for text on each value of the multivalued.

Besides the fields on the SearchIndex, XH has 3 "private" fields:

  • id: a unique identifier of the document:
    • Indexed as: Q<app_name>.<model_name>.<instance_pk>
    • Example: Qmyapp.book.1
  • django_ct: the identifier of the django model
    • Indexed as: CONTENTTYPE<app_name>.<model_name>
    • Example: CONTENTTYPEmyapp.book
  • django_id: the unique identifier of the django instance
    • Indexed as: QQ<instance_pk>
    • Example: QQ1

These are used for searches of the form .filter(django_id=2) and .models('book'); XH automatically constructs the search with the correct prefix.

Example

As an example, let's say we have the following SearchIndex:

class DocumentIndex(indexes.SearchIndex):
    title = indexes.CharField()
    text = indexes.CharField(document=True, use_template=True)
    pub_date = indexes.DateField()

and let's say we have one instance with

  • title='this is happiness'
  • text='this is a text'
  • pub_date=datetime(2010, 2, 2, 12, 4, 2)

the Django model is a Document in app library and the instance's primary key is 1.

After indexing, the terms in the Xapin document respective to this instance are:

this Zthis is Zis a Za text Ztext happiness Zhappi 2010-02-02 12:04:02 [repeat with prefix field_name] Qlibrary.document.1 QQ1 CONTENTTYPElibrary.document

Other technicalities

  • Fields of type text that have more than one term (e.g. 'this car') are added a term '^' in the beginning and a '$' in the end, thus being effectively indexed as the string '^ this car $'. This is intended to make __exact queries to work as expected: exact summary__exact='this car' matches 'this car' but not 'in this car'. More information can be found on wiki entry on queries.

  • Fields with a single term (e.g. 'car') are indexed as ^<term>$ (e.g. '^car$). This is intended to allow exact matches on fields with only one term (e.g. integers).

Values and Slots

Besides terms, a document also has "slots" that can hold "values". Values are not indexed and are used for e.g. sorting results or faceting.

XH stores every field except MultiValued fields as a value. In Xapian, sorting is done on character basis, which means that XH has to serialize the fields other than text into values in such a way that sortability is preserved.

  • integer: serialized as '%012d'
  • float: uses the Xapian serializer (which preserves sortability)
  • date: '%Y%m%d%000000'
  • datetime: '%Y%m%d%H%M%S'