The next few days the release of the new version will be available in the branch "0.7.0" for one month (or more when needed) before it goes to the master branch. It would be really helpful when you can test the integration of the upcoming version in your existing projects during this time and give me back some feedback.
These are the basic changes:
- sources transformed to native ES6 modules
- added middleware layer
- support for hooks
- complete separation of language specific logic (that was the most important issue)
- new language handler (language packs)
- new core consists of a "Document Model" and an "Index Model" (that was the most important improvement)
- implementation of "tags" rewritten
- production ready native ES6 modules (within "dist/module/")
- new bundler, better customization
- improved performance
- smaller filesize
- WebWorker isn't available but will come back at next
When it comes to raw search speed FlexSearch outperforms every single searching library out there and also provides flexible search capabilities like multi-field search, phonetic transformations or partial matching. Depending on the used options it also provides the most memory-efficient index. FlexSearch introduce a new scoring algorithm called "contextual index" based on a pre-scored lexical dictionary architecture which actually performs queries up to 1,000,000 times faster compared to other libraries. FlexSearch also provides you a non-blocking asynchronous processing model as well as web workers to perform any updates or queries on the index in parallel through dedicated balanced threads.
FlexSearch Server is available here: https://github.com/nextapps-de/flexsearch-server.
Installation Guide • API Reference • Custom Builds • Flexsearch Server • Changelog
Supported Platforms:
- Browser
- Node.js
Demos:
Library Comparison:
Get Latest (Stable Release):
Build | File | CDN |
flexsearch.min.js | Download | https://rawcdn.githack.com/nextapps-de/flexsearch/master/dist/flexsearch.min.js |
flexsearch.light.js | Download | https://rawcdn.githack.com/nextapps-de/flexsearch/master/dist/flexsearch.light.js |
flexsearch.compact.js | Download | https://rawcdn.githack.com/nextapps-de/flexsearch/master/dist/flexsearch.compact.js |
flexsearch.custom.js | Custom Build |
All Features:
Feature | flexsearch.min.js | flexsearch.compact.js | flexsearch.light.js |
Presets | ✓ | ✓ | - |
Async Search | ✓ | ✓ | - |
Web-Workers (not available in Node.js) | ✓ | - | - |
Contextual Indexes | ✓ | ✓ | ✓ |
Index Documents (Field-Search) | ✓ | ✓ | - |
Logical Operators | ✓ | ✓ | - |
Where / Find / Tags | ✓ | - | - |
Partial Matching | ✓ | ✓ | ✓ |
Relevance Scoring | ✓ | ✓ | ✓ |
Auto-Balanced Cache by Popularity | ✓ | - | - |
Pagination | ✓ | - | - |
Suggestions | ✓ | - | - |
Phonetic Matching | ✓ | ✓ | - |
Customizable: Matcher, Encoder, Tokenizer, Stemmer, Filter | ✓ | ✓ | ✓ |
File Size (gzip) | 6.6 kb | 4.7 kb | 2.7 kb |
It is also pretty simple to make Custom Builds
Comparison: Benchmark "Gulliver's Travels"
Rank | Library Name | Library Version | Single Phrase (op/s) | Multi Phrase (op/s) | Not Found (op/s) |
1 | FlexSearch * | 0.3.6 | 363757 | 182603 | 1627219 |
2 | Wade | 0.3.3 | 899 | 6098 | 214286 |
3 | JS Search | 1.4.2 | 735 | 8889 | 800000 |
4 | JSii | 1.0 | 551 | 9970 | 75000 |
5 | Lunr.js | 2.3.5 | 355 | 1051 | 25000 |
6 | Elasticlunr.js | 0.9.6 | 327 | 781 | 6667 |
7 | BulkSearch | 0.1.3 | 265 | 535 | 2778 |
8 | bm25 | 0.2 | 71 | 116 | 2065 |
9 | Fuse | 3.3.0 | 0.5 | 0.4 | 0.7 |
* The preset "fast" was used for this test
Run Comparison: Benchmark "Gulliver's Travels"
Note: This feature is actually not enabled by default. Read here how to enable.
FlexSearch introduce a new scoring mechanism called Contextual Search which was invented by Thomas Wilkerling, the author of this library. A Contextual Search incredibly boost up queries to a complete new level but also requires some additional memory (depending on depth). The basic idea of this concept is to limit relevance by its context instead of calculating relevance through the whole (unlimited) distance. In this way contextual search also improves the results of relevance-based queries on a large amount of text data.
The index consists of an in-memory pre-scored dictionary as its base. The biggest complexity of these algorithm occurs during the calculation of intersections. As a consequence each additional term in the query has a significant potential to increase complexity. A contextual map comes into play when the query contains more than 1 term and increase effect for each additional term by cutting down the complexity for the intersection calculations. Instead of an increase, the complexity is lowered for each additional term. The contextual index itself is also based on a pre-scored dictionary and follows a memory-friendly strategy.
Type | Complexity |
Each single term query: | 1 |
Lexical Pre-Scored Dictionary (Solo): | TERM_COUNT * TERM_MATCHES |
Lexical Pre-Scored Dictionary + Context-based Map: | TERM_MATCHES / TERM_COUNT |
The complexity for one single term is always 1.
BulkSearch | FlexSearch | |
Access | Read-Write optimized index | Read-Memory optimized index |
Memory | Large: ~ 1 Mb per 100,000 words | Tiny: ~ 100 Kb per 100,000 words |
Index Type | Bulk of encoded string data divided into chunks |
|
Strength |
|
|
Weaks |
|
|
Pagination | Yes | No |
Wildcards | Yes | No |
Keep in mind that updating existing items or removing items from the index has a significant cost. When existing items of your index needs to be updated/removed continuously then BulkSearch may be a better choice.
Use flexsearch.min.js for production and flexsearch.js for development.
<html>
<head>
<script src="js/flexsearch.min.js"></script>
</head>
...
Use latest from CDN:
<script src="https://cdn.jsdelivr.net/gh/nextapps-de/flexsearch@master/dist/flexsearch.min.js"></script>
Or a specific version:
<script src="https://cdn.jsdelivr.net/gh/nextapps-de/[email protected]/dist/flexsearch.min.js"></script>
AMD:
var FlexSearch = require("./flexsearch.js");
npm install flexsearch
In your code include as follows:
var FlexSearch = require("flexsearch");
Or pass in options when requiring:
var index = require("flexsearch").create({/* options */});
Global methods:
- FlexSearch.create(<options>)
- FlexSearch.registerMatcher({KEY: VALUE})
- FlexSearch.registerEncoder(name, encoder)
- FlexSearch.registerLanguage(lang, {stemmer:{}, filter:[]})
- FlexSearch.encode(name, string)
Index methods:
- Index.add(id, string)
- Index.search(string, <limit>, <callback>)
- Index.search(<options>)
- Index.update(id, string)
- Index.remove(id)
- Index.clear()
- Index.destroy()
- Index.init(<options>)
- Index.info()
- Index.find()
- Index.where()
- Index.addMatcher({KEY: VALUE})
- Index.encode(string)
- Index.export()
- Index.import(string)
Index properties:
FlexSearch.create(<options>)
var index = new FlexSearch();
alternatively you can also use:
var index = FlexSearch.create();
Create a new index and choosing one of the presets:
var index = new FlexSearch("speed");
Create a new index with custom options:
var index = new FlexSearch({
// default values:
encode: "balance",
tokenize: "forward",
threshold: 0,
async: false,
worker: false,
cache: false
});
Create a new index and extend a preset with custom options:
var index = new FlexSearch("memory", {
encode: "balance",
tokenize: "forward",
threshold: 0
});
See all available custom options.
Index.add(id, string)
index.add(10025, "John Doe");
Index.search(string | options, <limit>, <callback>)
index.search("John");
Limit the result:
index.search("John", 10);
Perform queries asynchronously:
index.search("John", function(result){
// array of results
});
Passing a callback always will perform as asynchronous even if the "async" option was not set.
Perform queries asynchronously (Promise-based):
Make sure the option "async" is enabled on this instance to receive promises.
index.search("John").then(function(result){
// array of results
});
Alternatively ES6:
async function search(query){
const result = await index.search(query);
// ...
}
Pass custom options for each query:
index.search({
query: "John",
limit: 1000,
threshold: 5, // >= threshold
depth: 3, // <= depth
callback: function(results){
// ...
}
});
The same from above could also be written as:
index.search("John", {
limit: 1000,
threshold: 5,
depth: 3
}, function(results){
// ....
});
See all available custom search options.
FlexSearch is providing a cursor-based pagination which has the ability to inject into the most-inner process. This enables the possibility of many performance improvements.
The cursor implementation may be changed often. Just take the cursor as it is and do not expect any specific value or format.
To enable pagination you have to pass a page field within the custom search object (optionally also a limit as maximum items per page).
Get the first page of results:
var response = index.search("John Doe", {
limit: 5,
page: true
});
Always when passing a page within custom search the response have this format:
{
"page": "xxx:xxx",
"next": "xxx:xxx",
"result": []
}
- page is the pointer to the current page
- next is the pointer to the next page or null when no pages are left
- result yields the searching results
Get the second (next) page of results:
index.search("John Doe", {
limit: 10,
page: response.next
});
The limit can be modified for each query.
Get also suggestions for a query:
index.search({
query: "John Doe",
suggest: true
});
When suggestion is enabled all results will be filled up (until limit, default 1000) with similar matches ordered by relevance.
Actually phonetic suggestions are not supported, for that purpose use the encoder and tokenizer which provides similar functionality. Suggestions comes into game when a query has multiple words/phrases. Assume a query contains 3 words. When the index just match 2 of 3 words then normally you will get no results, but with suggestion enabled you will also get results when 2 of 3 words was matched as well 1 of 3 words was matched (depends on the limit), also sorted by relevance.
Note: Is is planned to improve this feature and providing more flexibility.
Index.update(id, string)
index.update(10025, "Road Runner");
Index.remove(id)
index.remove(10025);
index.clear();
index.destroy();
Index.init(<options>)
Initialize (with same options):
index.init();
Initialize with new options:
index.init({
/* options */
});
Re-initialization will also destroy the old index.
Get the length of an index:
var length = index.length;
Get the index (register) of an instance:
var index = index.index;
The register has the format "@" + id.
Important: Do not modify manually, just use it as read-only.
FlexSearch.registerMatcher({REGEX: REPLACE})
Add global matchers for all instances:
FlexSearch.registerMatcher({
'ä': 'a', // replaces all 'ä' to 'a'
'ó': 'o',
'[ûúù]': 'u' // replaces multiple
});
Add private matchers for a specific instance:
index.addMatcher({
'ä': 'a', // replaces all 'ä' to 'a'
'ó': 'o',
'[ûúù]': 'u' // replaces multiple
});
Assign a custom encoder by passing a function during index creation/initialization:
var index = new FlexSearch({
encode: function(str){
// do something with str ...
return str;
}
});
The encoder function gets a string as a parameter and has to return the modified string.
Call a custom encoder directly:
var encoded = index.encode("sample text");
FlexSearch.registerEncoder(name, encoder)
Global encoders can be shared/used by all instances.
FlexSearch.registerEncoder("whitespace", function(str){
return str.replace(/\s/g, "");
});
Initialize index and assign a global encoder:
var index = new FlexSearch({ encode: "whitespace" });
Call a global encoder directly:
var encoded = FlexSearch.encode("whitespace", "sample text");
FlexSearch.registerEncoder('mixed', function(str){
str = this.encode("icase", str); // built-in
str = this.encode("whitespace", str); // custom
// do something additional with str ...
return str;
});
A tokenizer split words into components or chunks.
Define a private custom tokenizer during creation/initialization:
var index = new FlexSearch({
tokenize: function(str){
return str.split(/\s-\//g);
}
});
The tokenizer function gets a string as a parameter and has to return an array of strings (parts).
Stemmer: several linguistic mutations of the same word (e.g. "run" and "running")
Filter: a blacklist of words to be filtered out from indexing at all (e.g. "and", "to" or "be")
Assign a private custom stemmer or filter during creation/initialization:
var index = new FlexSearch({
stemmer: {
// object {key: replacement}
"ational": "ate",
"tional": "tion",
"enci": "ence",
"ing": ""
},
filter: [
// array blacklist
"in",
"into",
"is",
"isn't",
"it",
"it's"
]
});
Using a custom stemmer, e.g.:
var index = new FlexSearch({
stemmer: function(value){
// apply some replacements
// ...
return value;
}
});
Using a custom filter, e.g.:
var index = new FlexSearch({
filter: function(value){
// just add values with length > 1 to the index
return value.length > 1;
}
});
Or assign stemmer/filters globally to a language:
Stemmer are passed as a object (key-value-pair), filter as an array.
FlexSearch.registerLanguage("us", {
stemmer: { /* ... */ },
filter: [ /* ... */ ]
});
Or use some pre-defined stemmer or filter of your preferred languages:
<html>
<head>
<script src="js/flexsearch.min.js"></script>
<script src="js/lang/en.min.js"></script>
<script src="js/lang/de.min.js"></script>
</head>
...
Now you can assign built-in stemmer during creation/initialization:
var index_en = new FlexSearch({
stemmer: "en",
filter: "en"
});
var index_de = new FlexSearch({
stemmer: "de",
filter: [ /* custom */ ]
});
In Node.js you just have to require the language pack files to make them available:
require("flexsearch.js");
require("lang/en.js");
require("lang/de.js");
It is also possible to compile language packs into the build as follows:
node compile SUPPORT_LANG_EN=true SUPPORT_LANG_DE=true
Set the tokenizer at least to "reverse" or "full" when using RTL.
Just set the field "rtl" to true and use a compatible tokenizer:
var index = FlexSearch.create({
encode: "icase",
tokenize: "reverse",
rtl: true
});
Set a custom tokenizer which fits your needs, e.g.:
var index = FlexSearch.create({
encode: false,
tokenize: function(str){
return str.replace(/[\x00-\x7F]/g, "").split("");
}
});
You can also pass a custom encoder function to apply some linguistic transformations.
index.add(0, "一个单词");
var results = index.search("单词");
This feature is available in DEBUG mode.
index.info();
Returns information e.g.:
{
"id": 0,
"memory": 10000,
"items": 500,
"sequences": 3000,
"matchers": 0,
"chars": 3500,
"cache": false,
"matcher": 0,
"worker": false,
"threshold": 7,
"depth": 3,
"contextual": true
}
Assume the document is an array of data like:
var docs = [{
id: 0,
title: "Title",
cat: "Category",
content: "Body"
},{
id: 1,
title: "Title",
cat: "Category",
content: "Body"
}];
Provide a document descriptor doc when initializing a new index, e.g. related to the example above:
var index = new FlexSearch({
tokenize: "strict",
depth: 3,
doc: {
id: "id",
field: "content"
}
});
The above example will just index the field "content", to index multiple fields pass an array:
var index = new FlexSearch({
doc: {
id: "id",
field: [
"title",
"cat",
"content"
]
}
});
You are also able to provide custom presets for each field separately:
var index = new FlexSearch({
doc: {
id: "id",
field: {
title: {
encode: "extra",
tokenize: "reverse",
threshold: 7
},
cat: {
encode: false,
tokenize: function(val){
return [val];
}
},
content: "memory"
}
}
});
Assume the document array looks more complex (has nested branches etc.), e.g.:
var docs = [{
data:{
id: 0,
title: "Foo",
body: {
content: "Foobar"
}
}
},{
data:{
id: 1,
title: "Bar",
body: {
content: "Foobar"
}
}
}];
Then use the colon separated notation "root:child:child" to define hierarchy within the document descriptor:
var index = new FlexSearch({
doc: {
id: "data:id",
field: [
"data:title",
"data:body:content"
]
}
});
Hint: This is an alternative for indexing documents which has nested arrays: nextapps-de/flexsearch#36
Just pass the document array (or a single object) to the index:
index.add(docs);
Update index with a single object or an array of objects:
index.update({
data:{
id: 0,
title: "Foo",
body: {
content: "Bar"
}
}
});
Remove a single object or an array of objects from the index:
index.remove(docs);
When the id is known, you can also simply remove by (faster):
index.remove(id);
The search gives you several options when using documents.
The colon notation also has to be applied for the searching respectively.
This will search through all indexed fields:
var results = index.search("body");
This will search on a specific field):
var results = index.search({
field: "title",
query: "foobar"
});
var results = index.search({
field: "data:body:content",
query: "foobar"
});
This could also be written as:
var results = index.search("foobar", {
field: "data:body:content",
});
Search the same query on multiple fields:
Using bool as a logical operator when searching through multiple fields. The default operator when not set is "or".
var results = index.search({
query: "foobar",
field: ["title", "body"],
bool: "or"
});
Could also be written as:
var results = index.search("foobar", {
field: ["title", "body"],
bool: "or"
});
Search different queries on multiple fields:
var results = index.search([{
field: "title",
query: "foo"
},{
field: "body",
query: "bar"
}]);
See all available field-search options.
You can define independently which fields should be indexed and which fields should be stored. This way you can index fields which should not be included in the search result.
When the
store
attribute was set, you have to include all fields which should be stored explicitly (acts like a whitelist).
When the
store
attribute was not set, the original document is stored as a fallback.
var index = new FlexSearch({
doc: {
id: "id",
field: "body:content", // index
store: "title" // store
}
});
Also multiple fields:
var index = new FlexSearch({
doc: {
id: "id",
field: ["title" , "body:content"],
store: ["title" , "body:content"]
}
});
There are 3 different operators (and, or, not). Just pass the field bool in custom search:
var results = index.search([{
field: "title",
query: "foobar",
bool: "and"
},{
field: "body",
query: "content",
bool: "or"
},{
field: "blacklist",
query: "xxx",
bool: "not"
}]);
- "and" indicates to be required in the result
- "or" indicates to be optional in the result
- "not" indicates to be prohibited in the result
When indexing documents, you are also able to get results by specific attributes.
The colon notation also has to be applied for using "where" and "find" respectively.
index.find("cat", "comedy");
Same as:
index.find({"cat": "comedy"});
To get by ID, you can also use short form:
index.find(1);
Getting a doc by ID is actually the fastest way to retrieve a result from documents.
Find by a custom function:
index.find(function(item){
return item.cat === "comedy";
});
Get just the first result:
index.find({
cat: "comedy",
year: "2018"
});
Get all matched results:
index.where({
cat: "comedy",
year: "2018"
});
Get all results and set a limit:
index.where({
cat: "comedy",
year: "2018"
}, 100);
Get all by a custom function:
index.where(function(item){
return item.cat === "comedy";
});
Add some content, e.g.:
index.add([{
id: 0,
title: "Foobar",
cat: "adventure",
content: "Body"
},{
id: 1,
title: "Title",
cat: "comedy",
content: "Foobar"
}]);
Using search and also apply a where-clause:
index.search("foo", {
field: [
"title",
"body"
],
where: {
"cat": "comedy"
},
limit: 10
});
IMPORTANT NOTICE: This feature will be removed due to the lack of scaling and redundancy.
Tagging is pretty much the same like adding an additional index to a database column. Whenever you use index.where() on an indexed/tagged attribute will really improve performance but also at a cost of some additional memory.
The colon notation also has to be applied for tags respectively.
Add one single tag to the index:
var index = new FlexSearch({
doc: {
id: "id",
field: ["title", "content"],
tag: "cat"
}
});
Or add multiple tags to the index:
var index = new FlexSearch({
doc: {
id: "id",
field: ["title", "content"],
tag: ["cat", "year"]
}
});
Add some content:
index.add([{
id: 0,
title: "Foobar",
cat: "adventure",
year: "2018",
content: "Body"
},{
id: 1,
title: "Title",
cat: "comedy",
year: "2018",
content: "Foobar"
}]);
Find all documents by an attribute:
index.where({"cat": "comedy"}, 10);
Since the attribute "cat" was tagged (has its own index) this expression performs really fast. This is actually the fastest way to retrieve multiple results from documents.
Search documents and also apply a where-clause:
index.search("foo", {
field: [
"title",
"content"
],
where: {
"cat": "comedy"
},
limit: 10
});
An additional where-clause has a significant cost. Using the same expression without where performs significantly better (depending on the count of matches).
The colon notation also has to be applied for a custom sort respectively.
Sort by an attribute:
var results = index.search("John", {
limit: 100,
sort: "data:title"
});
The default sorting order is from lowest to highest.
Sort by a custom function:
var results = index.search("John", {
limit: 100,
sort: function(a, b){
return (
a.id < b.id ? -1 : (
a.id > b.id ? 1 : 0
));
}
});
Simply chain methods like:
var index = FlexSearch.create()
.addMatcher({'â': 'a'})
.add(0, 'foo')
.add(1, 'bar');
index.remove(0).update(1, 'foo').add(2, 'foobar');
Create an index and just set the limit of relevance as "depth":
var index = new FlexSearch({
encode: "icase",
tokenize: "strict",
threshold: 7,
depth: 3
});
Only the tokenizer "strict" is actually supported by the contextual index.
The contextual index requires additional amount of memory depending on depth.
Try to use the lowest depth and highest threshold which fits your needs.
It is possible to modify values for threshold and depth during search (see custom search). The restriction is that the threshold can only be raised, on the other hand the depth can only be lowered.
Create index and just set a limit of cache entries:
var index = new FlexSearch({
profile: "score",
cache: 1000
});
When passing a number as a limit the cache automatically balance stored entries related to their popularity.
When just using "true" the cache is unbounded and perform actually 2-3 times faster (because the balancer do not have to run).
Worker get its own dedicated memory and also run in their own dedicated thread without blocking the UI while processing. Especially for larger indexes, web worker improves speed and available memory a lot. FlexSearch index was tested with a 250 Mb text file including 10 Million words.
When the index isn't big enough it is faster to use no web worker.
Create index and just set the count of parallel threads:
var index = new FlexSearch({
encode: "icase",
tokenize: "full",
async: true,
worker: 4
});
Adding items to worker index as usual (async enabled):
index.add(10025, "John Doe");
Perform search and simply pass in callback like:
index.search("John Doe", function(results){
// do something with array of results
});
Or use promises accordingly:
index.search("John Doe").then(function(results){
// do something with array of results
});
FlexSearch ist highly customizable. Make use of the the right options can really improve your results as well as memory economy and query time.
Option | Values | Description |
profile |
"memory" "speed" "match" "score" "balance" "fast" |
The configuration profile. Choose your preferation. |
tokenize |
"strict" "forward" "reverse" "full" function() |
The indexing mode (tokenizer). Choose one of the built-ins or pass a custom tokenizer function. |
split |
RegExp string |
The rule to split words when using non-custom tokenizer (built-ins e.g. "forward"). Use a string/char or use a regular expression (default: /\W+/ ). |
encode |
false "icase" "simple" "advanced" "extra" "balance" function() |
The encoding type. Choose one of the built-ins or pass a custom encoding function. |
cache |
false true {number} |
Enable/Disable and/or set capacity of cached entries. When passing a number as a limit the cache automatically balance stored entries related to their popularity. Note: When just using "true" the cache has no limits and is actually 2-3 times faster (because the balancer do not have to run). |
async |
true false |
Enable/Disable asynchronous processing. Each job will be queued for non-blocking processing. Recommended when using WebWorkers. |
worker |
false {number} |
Enable/Disable and set count of running worker threads. |
depth |
false {number} |
Enable/Disable contextual indexing and also sets contextual distance of relevance. Depth is the maximum number of words/tokens away a term to be considered as relevant. |
threshold |
false {number} |
Enable/Disable the threshold of minimum relevance all results should have. Note: It is also possible to set a lower threshold for indexing and pass a higher value when calling index.search(options). |
resolution | {number} | Sets the scoring resolution (default: 9). |
stemmer |
false {string} {function} |
Disable or pass in language shorthand flag (ISO-3166) or a custom object. |
filter |
false {string} {function} |
Disable or pass in language shorthand flag (ISO-3166) or a custom array. |
rtl |
true false |
Enables Right-To-Left encoding. |
Option | Values | Description |
limit | number | Sets the limit of results. |
suggest | true, false | Enables suggestions in results. |
where | object | Use a where-clause for non-indexed fields. |
field | string, Array<string> | Sets the document fields which should be searched. When no field is set, all fields will be searched. Custom options per field are also supported. |
bool | "and", "or" | Sets the used logical operator when searching through multiple fields. |
page | true, false, cursor | Enables paginated results. |
You can also override these following index settings via custom search (v0.7.0):
- encode
- split
- tokenize
- threshold
- cache
- async
Custom-Search options will override index options.
Option | Values | Description |
limit | number | Sets the limit of results per field. |
suggest | true, false | Enables suggestions in results per field. |
bool | "and", "or", "not" | Sets the used logical operator when searching through multiple fields. |
boost | number | Enables boosting fields. |
You can also override these following index settings per field via custom field-search:
- encode
- split
- tokenize
- threshold
Field-Search options will override custom-search options and index options.
Whereas depth is the minimum relevance for the contextual index, threshold is the minimum relevance for the lexical index. The threshold score is an enhanced variation of a conventional scoring calculation, it uses on document distance and partial distance instead of TF-IDF. The final scoring value is based on 3 kinds of distance.
Resolution on the other hand specify the max scoring value. The final score value is an integer value, so resolution affect how many segments the score may have. When the resolution is 1, then there exist just one scoring level for all matched terms. To get more differentiated results you need to raise the resolution.
The difference of both affects the performance on higher values (complexity = resolution - threshold).
The combination of resolution and threshold gives you a good controlling of your matches as well as performance, e.g. when the resolution is 25 and the threshold is 22, then the result only contains matches which are super relevant. The goal should always be just have items in result which are really needed. On top, that also improves performance a lot.
Tokenizer affects the required memory also as query time and flexibility of partial matches. Try to choose the most upper of these tokenizer which fits your needs:
Option | Description | Example | Memory Factor (n = length of word) |
"strict" | index whole words | foobar |
* 1 |
"forward" | incrementally index words in forward direction | fo obarfoob ar |
* n |
"reverse" | incrementally index words in both directions | foobar fo obar |
* 2n - 1 |
"full" | index every possible combination | fooba rf oob ar |
* n * (n - 1) |
Encoding affects the required memory also as query time and phonetic matches. Try to choose the most upper of these encoders which fits your needs, or pass in a custom encoder:
Option | Description | False-Positives | Compression |
false | Turn off encoding | no | no |
"icase" (default) | Case in-sensitive encoding | no | no |
"simple" | Phonetic normalizations | no | ~ 7% |
"advanced" | Phonetic normalizations + Literal transformations | no | ~ 35% |
"extra" | Phonetic normalizations + Soundex transformations | yes | ~ 60% |
function() | Pass custom encoding via function(string):string |
Reference String: "Björn-Phillipp Mayer"
Query | icase | simple | advanced | extra |
björn | yes | yes | yes | yes |
björ | yes | yes | yes | yes |
bjorn | no | yes | yes | yes |
bjoern | no | no | yes | yes |
philipp | no | no | yes | yes |
filip | no | no | yes | yes |
björnphillip | no | yes | yes | yes |
meier | no | no | yes | yes |
björn meier | no | no | yes | yes |
meier fhilip | no | no | yes | yes |
byorn mair | no | no | no | yes |
(false positives) | no | no | no | yes |
The required memory for the index depends on several options:
Encoding | Memory usage of every ~ 100,000 indexed word |
false | 260 kb |
"icase" (default) | 210 kb |
"simple" | 190 kb |
"advanced" | 150 kb |
"extra" | 90 kb |
Mode | Multiplied with: (n = average length of indexed words) |
"strict" | * 1 |
"forward" | * n |
"reverse" | * 2n - 1 |
"full" | * n * (n - 1) |
Contextual Index | Multiply the sum above with: |
* (depth * 2 + 1) |
Adding, removing or updating existing items has a similar complexity. The contextual index grows exponentially, that's why it is actually just supported for the tokenizer "strict".
The book "Gulliver's Travels" (Swift Jonathan 1726) was used for this test.
You can pass a preset during creation/initialization. They represents these following settings:
"default": Standard profile
{
encode: "icase",
tokenize: "forward",
resolution: 9
}
"memory": Memory-optimized profile
{
encode: "extra",
tokenize: "strict",
threshold: 0,
resolution: 1
}
"speed": Speed-optimized profile
{
encode: "icase",
tokenize: "strict",
threshold: 1,
resolution: 3,
depth: 2
}
"match": Matching-tolerant profile
{
encode: "extra",
tokenize: "full",
threshold: 1,
resolution: 3
}
"score": Relevance-optimized profile
{
encode: "extra",
tokenize: "strict",
threshold: 1,
resolution: 9,
depth: 4
}
"balance": Most-balanced profile
{
encode: "balance",
tokenize: "strict",
threshold: 0,
resolution: 3,
depth: 3
}
"fast": Absolute fastest profile
{
encode: "icase",
threshold: 8,
resolution: 9,
depth: 1
}
Compare these presets:
Methods to retrieve results sorted from fastest to slowest:
index.find(id) -> doc
index.where({field: string}) -> Arrary<doc>
with a tag on the same fieldindex.search(query) -> Arrary<id>
when just adding id and content to the index (no documents)index.search(query) -> Arrary<doc>
when using documentsindex.search(query, { where }) -> Arrary<doc>
when using documents and a where clauseindex.where({field: [string, string]}) -> Arrary<doc>
when a tag was set to one of two fieldsindex.where({field: string}) -> Arrary<doc>
when no tag was set to this field
Methods to change index from fastest to slowest:
index.add(id, string)
index.add(docs)
index.delete(id, string)
index.delete(docs)
index.update(id, string)
index.update(docs)
Performance Checklist:
- Using just id-content-pairs for the index performs almost faster than using docs
- An additional where-clause in
index.search()
has a significant cost - When adding multiple fields of documents to the index try to set the lowest possible preset for each field separately
- Make sure the auto-balanced cache is enabled and has a meaningful value
- Using
index.where()
to find documents is very slow when not using a tagged field - Getting a document by ID via
index.find(id)
is extremely fast - Do not enable async as well as worker when the index does not claim it
- Use numeric IDs (the datatype length of IDs influences the memory consumption significantly)
- Try to enable contextual index by setting the depth to a minimum meaningful value and tokenizer to "strict"
- Pass a limit when searching (lower values performs better)
- Pass a minimum threshold when searching (higher values performs better)
- Try to minify the content size of indexed documents by just adding attributes you really need to get back from results
Whenever you can, try to divide content by categories and add them to its own index, e.g.:
var action = new FlexSearch();
var adventure = new FlexSearch();
var comedy = new FlexSearch();
This way you can also provide different settings for each category. This is actually the fastest way to perform a fuzzy search.
To make this workaround more extendable you can use a short helper:
var index = {};
function add(id, cat, content){
(index[cat] || (
index[cat] = new FlexSearch
)).add(id, content);
}
function search(cat, query){
return index[cat] ?
index[cat].search(query) : [];
}
Add content to the index:
add(1, "action", "Movie Title");
add(2, "adventure", "Movie Title");
add(3, "comedy", "Movie Title");
Perform queries:
var results = search("action", "movie title"); // --> [1]
Split indexes by categories improves performance significantly.
It is recommended to use numeric id values as reference when adding content to the index. The byte length of passed ids influences the memory consumption significantly. If this is not possible you should consider to use a index table and map the ids with indexes, this becomes important especially when using contextual indexes on a large amount of content.
index.export() returns a serialized dump as a string.
index.import(string) takes a serialized dump as a string and load it to the index.
Assuming you have one or several indexes:
var feeds_2017 = new FlexSearch();
var feeds_2018 = new FlexSearch();
var feeds_2019 = new FlexSearch();
Export indexes, e.g. to the local storage:
localStorage.setItem("feeds_2017", feeds_2017.export());
localStorage.setItem("feeds_2018", feeds_2018.export());
localStorage.setItem("feeds_2019", feeds_2019.export());
Import indexes, e.g. from the local storage:
feeds_2017.import(localStorage.getItem("feeds_2017"));
feeds_2018.import(localStorage.getItem("feeds_2018"));
feeds_2019.import(localStorage.getItem("feeds_2019"));
Pass a config flag to control serialization:
var data = index.export({serialize: false});
Use the same config accordingly when importing data:
index.import(data, {serialize: false});
Pass a config flag to separate index and documents from the export:
var idx = index.export({index: true, doc: false});
var docs = index.export({index: false, doc: true});
Use the same config accordingly when importing data:
index.import(idx, {index: true, doc: false});
index.import(docs, {index: false, doc: true});
You can also pass documents as external data, assume data
is an array of documents:
index.import(idx, {index: true, doc: data});
Do not use DEBUG in production builds.
If you get issues, you can temporary set the DEBUG flag to true on top of flexsearch.js:
DEBUG = true;
This enables console logging of several processes. Just open the browsers console to make this information visible.
Do not use PROFILER in production builds.
To collect some performance statistics of your indexes you need to temporary set the PROFILER flag to true on top of flexsearch.js:
PROFILER = true;
This enables profiling of several processes.
An array of all profiles is available on:
window.stats;
You can also just open the browsers console and enter this line to get stats.
The index of the array corresponds to the index.id.
Get stats from a specific index:
index.stats;
The returning stats payload is divided into several categories. Each of these category provides its own statistic values.
Property | Description |
time | The sum of time (ms) the process takes (lower is better) |
count | How often the process was called |
ops | Average operations per seconds (higher is better) |
nano | Average cost (ns) per operation/call (lower is better) |
Full Build:
npm run build
Compact Build:
npm run build-compact
Light Build:
npm run build-light
Build Language Packs:
npm run build-lang
Custom Build:
npm run build-custom SUPPORT_WORKER=true SUPPORT_ASYNC=true
On custom builds each build flag will be set to false by default.
Alternatively you can also use:
node compile SUPPORT_WORKER=true
The custom build will be saved to flexsearch.custom.xxxxx.js (the "xxxxx" is a hash based on the used build flags).
Flag | Values |
DEBUG | true, false |
PROFILER | true, false |
SUPPORT_ENCODER | true, false |
SUPPORT_DOCUMENT | true, false |
SUPPORT_WHERE | true, false |
SUPPORT_WORKER | true, false |
SUPPORT_CACHE | true, false |
SUPPORT_ASYNC | true, false |
SUPPORT_PRESET | true, false |
SUPPORT_INFO | true, false |
SUPPORT_SERIALIZE | true, false |
SUPPORT_SUGGESTION | true, false |
SUPPORT_PAGINATION | true, false |
SUPPORT_OPERATOR | true, false |
SUPPORT_CALLBACK | true, false |
Language Packs |
|
SUPPORT_LANG_EN | true, false |
SUPPORT_LANG_DE | true, false |
Compiler Flags |
|
LANGUAGE_OUT |
ECMASCRIPT3 ECMASCRIPT5 ECMASCRIPT5_STRICT ECMASCRIPT6 ECMASCRIPT6_STRICT ECMASCRIPT_2015 ECMASCRIPT_2017 STABLE |
Feel free to contribute to this project and also feel free to contact me (https://github.com/ts-thomas) when you have any questions.
Copyright 2019 Nextapps GmbH
Released under the Apache 2.0 License