Skip to content

Latest commit

 

History

History
388 lines (298 loc) · 15.5 KB

README-dev.org

File metadata and controls

388 lines (298 loc) · 15.5 KB

TODO: https://doc.rust-lang.org/rustdoc/

Vision

A network of small crawl, archive and index nodes. One node can consist of multiple machines but is administrated by one entity. Nodes crawl domains trusted by the node administrators:

  • to contain relevant content
  • to be easily crawlable
    • require no execution of Javascript
    • do not try to trick index ranking
    • do not intentionally mislead crawlers

Nodes serve earch queries from their local index and optionally forward them to other nodes and present search results of other nodes interleaved with local results. Trust and ranking of remote peers is constantly adapted.

Node discovery is done by gossip or metadata from crawled web pages.

Nodes also serve as a distributed internet archive for the crawled pages.

Long term vision

TODO: Elaborate

  • Wikipedia is not trustworthy
  • Mediawiki is actually a restriction of the original idea for the web. Mediawiki centralizes moderation of content. In the web everybody can host a web page somewhere and declare that its content is relevant for one or more keywords.
  • It is then the job of a search engine to propose a set of relevant web pages for given keywords.
  • Web pages should provide a standardized APIs how to send patches for the content of the page and how to see the page history. ATOM could help.
  • The actual hosting of web pages should also be distributed, e.g. IPFS to help with archiving and censorship resistance.
  • Naming: check pet names and GNU Name system
  • All of this requires different browsers

Architectural Decision Log (ADL)

Programming Language: Rust

A modern programming language can considerably enhance the productivity of a small team of skilled programmers. This can enable projects that would otherwise require an order of magnitude more not-so-skilled programmers working with “easier” languages like Go or Python.

Haskell would have been the first choice.

However Rust seems to have more libraries (indexing!) and is comparably modern as Haskell.

Also the long term vision requires changes in browsers. There is a new browser project in rust (Servo).

Database: Postgres

Installations might span mutliple machines but are not expected to grow above the capabilities that could be handled by one Postgres installation. Postgres itself can be scaled.

SQL can replace a lot of complex client side code.

Postgres should not host the index, nor the archive.

No async

No async rust code is used. Async has no support for drop. (TODO: elaborate)

Partition crawlers by domain

This is still experimental but looks stable:

  • easy to respect politeness
  • good compression of warc files

Other Search engine projects

There are even more projects, e.g.:

The search projects below are all open source, have their own crawler and development language is English.

Notary other projects:

  • Mojeek: not open source but big index
  • SeSe Engine: large, open source, but Chinese source code
  • clew: has not yet released its code?
Projectyearlang
indi.-search ✝
marginalia
mwmbl
PeARS<2016Python
phindePHP
searchmysite
stract~2023rust
unobtanium
WibyPHP
YaCy~2003Java

indieweb-search

indieweb-search @capjamesg (archived)

marginalia

mwmbl

mwmbl @mwmbl @daoudclarke

PeARS

Contact: https://aurelieherbelot.net

  • Conference Paper, April 2016: PeARS: a Peer-to-peer Agent for Reciprocated Search (doi) (pdf)

phinde

  • phinde @cweiske (only own domain + linked)

searchmysite

stract

unobtanium

Wiby

YaCy

Links, Ideas, Stuff

crawlers

indexing

ranking

p2p search

peer trust

  • Check the search results returned by peers by asynchronous download of the url and confirming the presence of the search terms in the result. Adapt trust accordingly. (via YaCy Paper)

duplicate detection

database schema

crawl job

  • get job with lease
  • update lease ( can be done by work manager by observing url frontier of crawl job?

url frontier

url priority

  • domain crossing hops instead of external depth
  • depth in parallel to priority to have absolute crawl frontier
  • signals:
    • sitemap.xml itself and linked sitemaps have constant prio 1
    • priority of find location
    • number of urls on the find location
    • number of query params
    • number of path elements
    • outlink context
  • properties
    • priority from sitemaps is between 0 < p <= 1
    • query params are worse than path elements
    • remaining depth goes down

url to crawl map

  • stealing of work possible if crawl A already crawls an URL external to crawl B? Or rather work injection, if a crawl for a domain exists?

queries

  • insert initial URL(s)
  • Insert URLs found
    • only if URLs does not exist yet
  • get next uncrawled url
    • order by priority
    • round robin over URLs
  • set url to crawled

crawle archive

  • get archived pages for URL
  • get all archived URLs for domain below path

crates to use

MIME Types

Input:

  1. Mime type declaration pointing to URL:
    • <link type=”text/css” rel=”stylesheet” href=”…”
    • <a href=”…” type=”foo/bar”>
    • <img -> img sniffer
  2. Mime type from HTTP header, overriding 1. ? file extension?
  3. Sniffing, if no HTTP header

robots.txt

RFC 9309, Google, Yandex docs, robotstxt.org

Google based robotstxt does not(?) provide an object to hold a parsed robots.txt, is from a time before RFC 9309 and seems to be very “unrusty” (much mutable global state). After all it’s a simple transliteration from C++.

sitemaps

Crates

Canonical Link Element

URL normalization

remove tracking URL parameters

crates

  • query_map - generic wrapper around HashMap<String, Vec<String>> to handle different transformations like URL query strings
  • clearurl - implementation for ClearURL
  • clearurls - rm tracking params
  • qstring - query string parser
  • shucker - Tracking-param filtering library, designed to strip URLs down to their canonical forms
  • urlnorm - url normalization
  • url-cleaner - rm tracking garbage

compiling with openssl on Debian

sfackler/rust-openssl#2333

sudo apt install libc6-dev libssl-dev sudo ln -s /usr/include/x86_64-linux-gnu/openssl/opensslconf.h /usr/include/openssl/opensslconf.h sudo ln -s /usr/include/x86_64-linux-gnu/openssl/configuration.h /usr/include/openssl/configuration.h

interesting stuff

p2p

deployment

marketing

fetch scheduling, time series, forecast

protocols in general

postgres

postgres crates

LISTEN/NOTIFY with postgres, diesel

crates

HTML content / article extraction