TODO: https://doc.rust-lang.org/rustdoc/

Vision

A network of small crawl, archive and index nodes. One node can consist of multiple machines but is administrated by one entity. Nodes crawl domains trusted by the node administrators:

to contain relevant content
to be easily crawlable
- require no execution of Javascript
- do not try to trick index ranking
- do not intentionally mislead crawlers

Nodes serve earch queries from their local index and optionally forward them to other nodes and present search results of other nodes interleaved with local results. Trust and ranking of remote peers is constantly adapted.

Node discovery is done by gossip or metadata from crawled web pages.

Nodes also serve as a distributed internet archive for the crawled pages.

Long term vision

TODO: Elaborate

Wikipedia is not trustworthy
Mediawiki is actually a restriction of the original idea for the web. Mediawiki centralizes moderation of content. In the web everybody can host a web page somewhere and declare that its content is relevant for one or more keywords.
It is then the job of a search engine to propose a set of relevant web pages for given keywords.
Web pages should provide a standardized APIs how to send patches for the content of the page and how to see the page history. ATOM could help.
The actual hosting of web pages should also be distributed, e.g. IPFS to help with archiving and censorship resistance.
Naming: check pet names and GNU Name system
All of this requires different browsers

Architectural Decision Log (ADL)

Programming Language: Rust

A modern programming language can considerably enhance the productivity of a small team of skilled programmers. This can enable projects that would otherwise require an order of magnitude more not-so-skilled programmers working with “easier” languages like Go or Python.

Haskell would have been the first choice.

However Rust seems to have more libraries (indexing!) and is comparably modern as Haskell.

Also the long term vision requires changes in browsers. There is a new browser project in rust (Servo).

Database: Postgres

Installations might span mutliple machines but are not expected to grow above the capabilities that could be handled by one Postgres installation. Postgres itself can be scaled.

SQL can replace a lot of complex client side code.

Postgres should not host the index, nor the archive.

No async

No async rust code is used. Async has no support for drop. (TODO: elaborate)

Partition crawlers by domain

This is still experimental but looks stable:

easy to respect politeness
good compression of warc files

Other Search engine projects

There are even more projects, e.g.:

Rohan Kumar’s list: A look at search engines with their own indexes

The search projects below are all open source, have their own crawler and development language is English.

Notary other projects:

Mojeek: not open source but big index
SeSe Engine: large, open source, but Chinese source code
clew: has not yet released its code?

Project	year	lang
indi.-search ✝
marginalia
mwmbl
PeARS	<2016	Python
phinde		PHP
searchmysite
stract	~2023	rust
unobtanium
Wiby		PHP
YaCy	~2003	Java

indieweb-search

indieweb-search @capjamesg (archived)

marginalia

marginalia @MarginaliaSearch

mwmbl

mwmbl @mwmbl @daoudclarke

PeARS

PeARS (GitHub) @minimalparts

Contact: https://aurelieherbelot.net

Conference Paper, April 2016: PeARS: a Peer-to-peer Agent for Reciprocated Search (doi) (pdf)

phinde

phinde @cweiske (only own domain + linked)

searchmysite

searchmysite @searchmysite @m-i-l

stract

stract @StractOrg @mikkeldenker
8.2.2024, 404 media: This Guy Has Built an Open Source Search Engine as an Alternative to Google in His Spare Time (behind free subscription wall)
4.2.2024, Hackernews: Stract: Open-souce, non-profit search engine

unobtanium

unobtanium

Wiby

Wiby @wibyweb

YaCy

YaCy @Orbiter

Links, Ideas, Stuff

crawlers

https://github.com/Nandakumartc/scraper-crawler List of scraper,crawler,spider in different languages

indexing

ranking

BM25

p2p search

peer trust

Check the search results returned by peers by asynchronous download of the url and confirming the presence of the search terms in the result. Adapt trust accordingly. (via YaCy Paper)

duplicate detection

https://blog.nelhage.com/post/fuzzy-dedup

database schema

advisory locks with diesel: diesel-rs/diesel#3459

crawl job

get job with lease
update lease ( can be done by work manager by observing url frontier of crawl job?

url frontier

url priority

domain crossing hops instead of external depth
depth in parallel to priority to have absolute crawl frontier
signals:
- sitemap.xml itself and linked sitemaps have constant prio 1
- priority of find location
- number of urls on the find location
- number of query params
- number of path elements
- outlink context
properties
- priority from sitemaps is between 0 < p <= 1
- query params are worse than path elements
- remaining depth goes down

url to crawl map

stealing of work possible if crawl A already crawls an URL external to crawl B? Or rather work injection, if a crawl for a domain exists?

queries

insert initial URL(s)
Insert URLs found
- only if URLs does not exist yet
get next uncrawled url
- order by priority
- round robin over URLs
set url to crawled

crawle archive

get archived pages for URL
get all archived URLs for domain below path

crates to use

https://crates.io/crates/scraper - HTML parsing and querying with CSS selectors
https://crates.io/crates/anyhow - Flexible concrete Error type built on std::error::Error
https://github.com/utkarshkukreti/select.rs - extract data from HTML
https://crates.io/crates/rouille - mini HTTP server for status and control pages

MIME Types

There’s a standard! MIME Sniffing via Content Sniffing (WP)
https://crates.io/crates/mime_classifier Implementation of the std from Servo
How Mozilla determines MIME Types

Input:

Mime type declaration pointing to URL:
- <link type=”text/css” rel=”stylesheet” href=”…”
- <a href=”…” type=”foo/bar”>
- <img -> img sniffer
Mime type from HTTP header, overriding 1. ? file extension?
Sniffing, if no HTTP header

MIME Type Detection in Windows Internet Explorer

robots.txt

RFC 9309, Google, Yandex docs, robotstxt.org

Google based robotstxt does not(?) provide an object to hold a parsed robots.txt, is from a time before RFC 9309 and seems to be very “unrusty” (much mutable global state). After all it’s a simple transliteration from C++.

texting_robots
- forked by Spire-rs’ kit/exclusion
robotparser-rs
- forked by spider
robots_txt - unstabke, WIP, +4y

sitemaps

https://developers.google.com/search/docs/crawling-indexing/sitemaps
https://sitemaps.org
https://en.m.wikipedia.org/wiki/Sitemaps

Crates

https://crates.io/crates/sitemap xml-rs, old but 8 dependents
https://crates.io/crates/sitemap-iter roxmltree (2022-02)
https://crates.io/crates/sitemaps quick-xml (2024-06), experimental learning project
https://crates.io/crates/wls - check for ideas
https://crates.io/crates/sitemapo quick-xml (2023-07), dead repo

search topics
- “crawling strategy”
  - https://frontera.readthedocs.io
    - Crawling strategies
  - https://stackoverflow.com/questions/10331738/strategy-for-how-to-crawl-index-frequently-updated-webpages
    - “re-crawl strategy” or “page refresh policy”
- https://ssrg.eecs.uottawa.ca/docs/Benjamin-Thesis.pdf Strategy for Efficient Crawling of Rich Internet Applications
- “focused crawling”
https://developers.google.com/search/docs/crawling-indexing

Canonical Link Element

https://en.wikipedia.org/wiki/Canonical_link_element

URL normalization

https://crates.io/crates/urlnorm
https://en.wikipedia.org/wiki/URI_normalization
- “Schonfeld et al. (2006) present a heuristic called DustBuster for detecting DUST (different URIs with similar text)”

remove tracking URL parameters

https://github.com/brave/brave-browser/wiki/Query-String-Filter
https://gitlab.com/ClearURLs/ClearUrls
- https://gitlab.com/ClearURLs/rules -> data.min.json -> “globalRules”

crates

query_map - generic wrapper around HashMap<String, Vec<String>> to handle different transformations like URL query strings
clearurl - implementation for ClearURL
clearurls - rm tracking params
qstring - query string parser
shucker - Tracking-param filtering library, designed to strip URLs down to their canonical forms
urlnorm - url normalization
url-cleaner - rm tracking garbage

compiling with openssl on Debian

sfackler/rust-openssl#2333

sudo apt install libc6-dev libssl-dev sudo ln -s /usr/include/x86_64-linux-gnu/openssl/opensslconf.h /usr/include/openssl/opensslconf.h sudo ln -s /usr/include/x86_64-linux-gnu/openssl/configuration.h /usr/include/openssl/configuration.h

interesting stuff

GOGGLES: Democracy dies in darkness, and so does the Web paper by Brave Search Team, via Spyglass
- https://videos.cern.ch/record/2295289
- https://www.afaik.de/nona-werbefreie-suchmaschine-aus-deutschland/
https://github.com/spyglass-search
https://github.com/iipc - International Internet Preservation Consortium
- https://github.com/iipc/openwayback/wiki/OpenWayback-Users

p2p

https://secushare.org
https://dat-ecosystem.org

deployment

https://fosdem.org/2025/schedule/event/fosdem-2025-5366-nixops4-new-sustainable-platform-for-deployment-technology/

marketing

https://fediversity.eu
https://drewdevault.com/2020/11/17/Better-than-DuckDuckGo.html

fetch scheduling, time series, forecast

https://fosdem.org/2025/schedule/event/fosdem-2025-4668-augurs-a-time-series-toolkit-for-rust

protocols in general

https://sans-io.readthedocs.io/how-to-sans-io.html
- Niri WM wurde nach sansi-io Prinzipien programmiert (handgeschriebene englische Untertitel des Autors)

postgres

https://github.com/dhamaniasad/awesome-postgres
https://www.postgresguide.com
https://github.com/elierotenberg/coding-styles/blob/master/postgres.md

postgres crates

https://github.com/sfackler/rust-postgres
- rust wire protocol but uses tokio even in synchronous client
- probably problems due to async? sfackler/rust-postgres#725
- postgres-protocol, postgres-types do not depend on tokio
https://crates.io/crates/pgwire
- recomends rust-postgres from sfackler for clients, focuses on servers
- depends on tokio
diesel
- uses pq_sys C wrapper for libpg
- not pub
- no support for notifications
- previous request for LISTEN diesel-rs/diesel#2166
- https://docs.diesel.rs/2.2.x/src/diesel/pg/connection/raw.rs.html
- issues
  - Removing libpq (to enable async)
  - Async I/O
  - Postgres: We should avoid sending one query per custom type bind enum!
  - PostgreSQL Large Objects - would require access to internals?
  - testing diesel-rs/diesel#1549

LISTEN/NOTIFY with postgres, diesel

diesel-rs/diesel#4420
waiting for notifications is more involved as it requires selecting a fd
- https://blog.pjam.me/posts/select-syscall-in-rust/
- crates nix or rustix help

crates

https://github.com/rinja-rs/askama Type-safe, compiled Jinja-like templates
https://crates.io/crates/fetcher Automatic news fetching and parsing
https://crates.io/crates/httptest HTTP testing facilities including a mock server
https://github.com/lipanski/mockito HTTP mocking for Rust! https://zupzup.org/rust-http-testing/
https://crates.io/crates/tempfile
https://crates.io/crates/pretty_assertions
https://crates.io/crates/nonzero
https://crates.io/crates/webpage
https://crates.io/crates/warc
https://crates.io/crates/feedfinder Auto-discovery of feeds in HTML content
https://crates.io/crates/governor - A rate-limiting implementation in Rust
https://crates.io/crates/thiserror
https://crates.io/crates/tracing https://gist.github.com/oliverdaff/d1d5e5bc1baba087b768b89ff82dc3ec
https://crates.io/crates/governor - complex rate limiting algorithm, used in spyglass-search/netrunner
https://crates.io/crates/apalis - background job processing
https://github.com/poem-web/poem - web framework
https://crates.io/crates/metrics-dashboard uses poem and metrics
https://crates.io/crates/metrics_server
https://crates.io/crates/memberlist-core - Gossip protocol for cluster membership
displaydoc derive macro for the standard library’s core::fmt::Display, especially for errors
scopeguard run a given closure when it goes out of scope (like defer in D)

HTML content / article extraction

telegram’s instantview
https://github.com/grangier/python-goose
https://pkg.go.dev/github.com/thatguystone/swan
https://crates.io/crates/extrablatt
https://crates.io/crates/mozilla-readability

Files

README-dev.org

Latest commit

History

README-dev.org

File metadata and controls

Vision

Long term vision

Architectural Decision Log (ADL)

Programming Language: Rust

Database: Postgres

No async

Partition crawlers by domain

Other Search engine projects

indieweb-search

marginalia

mwmbl

PeARS

phinde

searchmysite

stract

unobtanium

Wiby

YaCy

Links, Ideas, Stuff

crawlers

indexing

ranking

p2p search

peer trust

duplicate detection

database schema

crawl job

url frontier

url priority

url to crawl map

queries

crawle archive

crates to use

MIME Types

robots.txt

sitemaps

Crates

Canonical Link Element

URL normalization

remove tracking URL parameters

crates

compiling with openssl on Debian

interesting stuff

p2p

deployment

marketing

fetch scheduling, time series, forecast

protocols in general

postgres

postgres crates

LISTEN/NOTIFY with postgres, diesel

crates

HTML content / article extraction