-
Notifications
You must be signed in to change notification settings - Fork 92
Correspondence between Haystack fields and Xapian index
This page is intended to developers who want to know more about the internals of Xapian-Haystack (XH).
In Xapian, a document is a set of terms which optionally stores the positions (integers) of where each terms occurs in the document. This means that there is no concept related to a Haystack field. Thus, XH must define a way to map the different Haystack fields in a SearchIndex
into Xapian.
The objective of this page is to explain how this is done.
XH uses 4 different ways to index a term:
- unstemmed term: a single word
- Indexed as
<unstemmed term>
- Example:
Happiness
.
- Indexed as
- stemmed term: the stem of the term
- Indexed as
Z<stemmed term>
- Example:
Zhappi
(stem ofHappiness
)
- Indexed as
- unstemmed term in the field
- Indexed as
X<field_name><unstemmed term>
wherefield_name
is the field name in uppercase - Example:
XSUMMARYhappiness
- Indexed as
- stemmed term in the field
- Indexed as
ZX<field_name><unstemmed term>
wherefield_name
is the field name in uppercase - Example:
ZXSUMMARYhappi
- Indexed as
Items 1. and 2. are intended for a search on the whole document (using keyword content
on .filter()
), and the items 3. and 4. are intended for searches on a specific field (using the appropriate keyword on .filter()
).
In Haystack, fields have different types (e.g. text
, float
), which can hold different Python types. XH uses this information about the field to map its content to a Xapian index. Specifically it uses:
-
type
:text
integer
float
datetime
date
-
field_name
(name the user gave to the field). -
multi_valued
(True for MultiValued fields).
With the above notation, we can now describe how XH creates the index:
-
text
: indexed by splitting the string into terms and index each term using the 4 items described above. During this process XH tells Xapian to store the positional information of each term in the document. This way, users can use the__exact
with full sentences. -
datetime
: indexed by converting it to two terms (%Y-%m-%d %H:%m:%s
) and store them using item 1. and 3. -
date
: indexed by converting it to a term (%Y-%m-%d
) and store it using item 1. and 3. -
integer
: same as above with%d
. -
float
: same as above with%f
. -
multivalued
: they are always typetext
, so we use the recipe fortext
on each value of the multivalued.
Besides the fields on the SearchIndex, XH has 3 "private" fields:
-
id
: a unique identifier of the document:- Indexed as:
Q<app_name>.<model_name>.<instance_pk>
- Example:
Qmyapp.book.1
- Indexed as:
-
django_ct
: the identifier of the django model- Indexed as:
CONTENTTYPE<app_name>.<model_name>
- Example:
CONTENTTYPEmyapp.book
- Indexed as:
-
django_id
: the unique identifier of the django instance- Indexed as:
QQ<instance_pk>
- Example:
QQ1
- Indexed as:
These are used for searches of the form .filter(django_id=2)
and .models('book')
; XH automatically constructs the search with the correct prefix.
As an example, let's say we have the following SearchIndex:
class DocumentIndex(indexes.SearchIndex):
title = indexes.CharField()
text = indexes.CharField(document=True, use_template=True)
pub_date = indexes.DateField()
and let's say we have one instance with
title='this is happiness'
text='this is a text'
pub_date=datetime(2010, 2, 2, 12, 4, 2)
the Django model is a Document
in app library
and the instance's primary key is 1
.
After indexing, the terms in the Xapin document respective to this instance are:
this Zthis is Zis a Za text Ztext happiness Zhappi 2010-02-02 12:04:02 [repeat with prefix field_name] Qlibrary.document.1 QQ1 CONTENTTYPElibrary.document
-
Fields of type
text
that have more than one term (e.g.'this car'
) are added a term'^'
in the beginning and a'$'
in the end, thus being effectively indexed as the string'^ this car $'
. This is intended to make__exact
queries to work as expected: exactsummary__exact='this car'
matches'this car'
but not'in this car'
. More information can be found on wiki entry on queries. -
Fields with a single term (e.g.
'car'
) are indexed as^<term>$
(e.g.'^car$
). This is intended to allow exact matches on fields with only one term (e.g. integers).
Besides terms, a document also has "slots" that can hold "values". Values are not indexed and are used for e.g. sorting results or faceting.
XH stores every field except MultiValued fields as a value. In Xapian, sorting is done on character basis, which means that XH has to serialize the fields other than text into values in such a way that sortability is preserved.
-
integer
: serialized as'%012d'
-
float
: uses the Xapian serializer (which preserves sortability) -
date
: '%Y%m%d%000000' -
datetime
: '%Y%m%d%H%M%S'