TODO: https://doc.rust-lang.org/rustdoc/
A network of small crawl, archive and index nodes. One node can consist of multiple machines but is administrated by one entity. Nodes crawl domains trusted by the node administrators:
- to contain relevant content
- to be easily crawlable
- require no execution of Javascript
- do not try to trick index ranking
- do not intentionally mislead crawlers
Nodes serve earch queries from their local index and optionally forward them to other nodes and present search results of other nodes interleaved with local results. Trust and ranking of remote peers is constantly adapted.
Node discovery is done by gossip or metadata from crawled web pages.
Nodes also serve as a distributed internet archive for the crawled pages.
TODO: Elaborate
- Wikipedia is not trustworthy
- Mediawiki is actually a restriction of the original idea for the web. Mediawiki centralizes moderation of content. In the web everybody can host a web page somewhere and declare that its content is relevant for one or more keywords.
- It is then the job of a search engine to propose a set of relevant web pages for given keywords.
- Web pages should provide a standardized APIs how to send patches for the content of the page and how to see the page history. ATOM could help.
- The actual hosting of web pages should also be distributed, e.g. IPFS to help with archiving and censorship resistance.
- Naming: check pet names and GNU Name system
- All of this requires different browsers
A modern programming language can considerably enhance the productivity of a small team of skilled programmers. This can enable projects that would otherwise require an order of magnitude more not-so-skilled programmers working with “easier” languages like Go or Python.
Haskell would have been the first choice.
However Rust seems to have more libraries (indexing!) and is comparably modern as Haskell.
Also the long term vision requires changes in browsers. There is a new browser project in rust (Servo).
Installations might span mutliple machines but are not expected to grow above the capabilities that could be handled by one Postgres installation. Postgres itself can be scaled.
SQL can replace a lot of complex client side code.
Postgres should not host the index, nor the archive.
No async rust code is used. Async has no support for drop. (TODO: elaborate)
This is still experimental but looks stable:
- easy to respect politeness
- good compression of warc files
There are even more projects, e.g.:
- Rohan Kumar’s list: A look at search engines with their own indexes
The search projects below are all open source, have their own crawler and development language is English.
Notary other projects:
- Mojeek: not open source but big index
- SeSe Engine: large, open source, but Chinese source code
- clew: has not yet released its code?
Project | year | lang |
---|---|---|
indi.-search ✝ | ||
marginalia | ||
mwmbl | ||
PeARS | <2016 | Python |
phinde | PHP | |
searchmysite | ||
stract | ~2023 | rust |
unobtanium | ||
Wiby | PHP | |
YaCy | ~2003 | Java |
indieweb-search @capjamesg (archived)
- marginalia @MarginaliaSearch
mwmbl @mwmbl @daoudclarke
Contact: https://aurelieherbelot.net
- phinde @cweiske (only own domain + linked)
- searchmysite @searchmysite @m-i-l
- stract @StractOrg @mikkeldenker
- 8.2.2024, 404 media: This Guy Has Built an Open Source Search Engine as an Alternative to Google in His Spare Time (behind free subscription wall)
- 4.2.2024, Hackernews: Stract: Open-souce, non-profit search engine
- Wiby @wibyweb
- YaCy @Orbiter
- https://github.com/Nandakumartc/scraper-crawler List of scraper,crawler,spider in different languages
- BM25
- Check the search results returned by peers by asynchronous download of the url and confirming the presence of the search terms in the result. Adapt trust accordingly. (via YaCy Paper)
- advisory locks with diesel: diesel-rs/diesel#3459
- get job with lease
- update lease ( can be done by work manager by observing url frontier of crawl job?
- domain crossing hops instead of external depth
- depth in parallel to priority to have absolute crawl frontier
- signals:
- sitemap.xml itself and linked sitemaps have constant prio 1
- priority of find location
- number of urls on the find location
- number of query params
- number of path elements
- outlink context
- properties
- priority from sitemaps is between 0 < p <= 1
- query params are worse than path elements
- remaining depth goes down
- stealing of work possible if crawl A already crawls an URL external to crawl B? Or rather work injection, if a crawl for a domain exists?
- insert initial URL(s)
- Insert URLs found
- only if URLs does not exist yet
- get next uncrawled url
- order by priority
- round robin over URLs
- set url to crawled
- get archived pages for URL
- get all archived URLs for domain below path
- https://crates.io/crates/scraper - HTML parsing and querying with CSS selectors
- https://crates.io/crates/anyhow - Flexible concrete Error type built on std::error::Error
- https://github.com/utkarshkukreti/select.rs - extract data from HTML
- https://crates.io/crates/rouille - mini HTTP server for status and control pages
- There’s a standard! MIME Sniffing via Content Sniffing (WP)
- https://crates.io/crates/mime_classifier Implementation of the std from Servo
- How Mozilla determines MIME Types
Input:
- Mime type declaration pointing to URL:
- <link type=”text/css” rel=”stylesheet” href=”…”
- <a href=”…” type=”foo/bar”>
- <img -> img sniffer
- Mime type from HTTP header, overriding 1. ? file extension?
- Sniffing, if no HTTP header
RFC 9309, Google, Yandex docs, robotstxt.org
Google based robotstxt does not(?) provide an object to hold a parsed robots.txt, is from a time before RFC 9309 and seems to be very “unrusty” (much mutable global state). After all it’s a simple transliteration from C++.
- texting_robots
- forked by Spire-rs’ kit/exclusion
- robotparser-rs
- forked by spider
- robots_txt - unstabke, WIP, +4y
- https://developers.google.com/search/docs/crawling-indexing/sitemaps
- https://sitemaps.org
- https://en.m.wikipedia.org/wiki/Sitemaps
- https://crates.io/crates/sitemap xml-rs, old but 8 dependents
- https://crates.io/crates/sitemap-iter roxmltree (2022-02)
- https://crates.io/crates/sitemaps quick-xml (2024-06), experimental learning project
- https://crates.io/crates/wls - check for ideas
- https://crates.io/crates/sitemapo quick-xml (2023-07), dead repo
- search topics
- “crawling strategy”
- https://frontera.readthedocs.io
- Crawling strategies
- https://stackoverflow.com/questions/10331738/strategy-for-how-to-crawl-index-frequently-updated-webpages
- “re-crawl strategy” or “page refresh policy”
- https://frontera.readthedocs.io
- https://ssrg.eecs.uottawa.ca/docs/Benjamin-Thesis.pdf Strategy for Efficient Crawling of Rich Internet Applications
- “focused crawling”
- “crawling strategy”
- https://developers.google.com/search/docs/crawling-indexing
- https://crates.io/crates/urlnorm
- https://en.wikipedia.org/wiki/URI_normalization
- “Schonfeld et al. (2006) present a heuristic called DustBuster for detecting DUST (different URIs with similar text)”
- https://github.com/brave/brave-browser/wiki/Query-String-Filter
- https://gitlab.com/ClearURLs/ClearUrls
- https://gitlab.com/ClearURLs/rules -> data.min.json -> “globalRules”
- query_map - generic wrapper around HashMap<String, Vec<String>> to handle different transformations like URL query strings
- clearurl - implementation for ClearURL
- clearurls - rm tracking params
- qstring - query string parser
- shucker - Tracking-param filtering library, designed to strip URLs down to their canonical forms
- urlnorm - url normalization
- url-cleaner - rm tracking garbage
sudo apt install libc6-dev libssl-dev sudo ln -s /usr/include/x86_64-linux-gnu/openssl/opensslconf.h /usr/include/openssl/opensslconf.h sudo ln -s /usr/include/x86_64-linux-gnu/openssl/configuration.h /usr/include/openssl/configuration.h
- GOGGLES: Democracy dies in darkness, and so does the Web paper by Brave Search Team, via Spyglass
- https://github.com/spyglass-search
- https://github.com/iipc - International Internet Preservation Consortium
- https://sans-io.readthedocs.io/how-to-sans-io.html
- Niri WM wurde nach sansi-io Prinzipien programmiert (handgeschriebene englische Untertitel des Autors)
- https://github.com/dhamaniasad/awesome-postgres
- https://www.postgresguide.com
- https://github.com/elierotenberg/coding-styles/blob/master/postgres.md
- https://github.com/sfackler/rust-postgres
- rust wire protocol but uses tokio even in synchronous client
- probably problems due to async? sfackler/rust-postgres#725
- postgres-protocol, postgres-types do not depend on tokio
- https://crates.io/crates/pgwire
- recomends rust-postgres from sfackler for clients, focuses on servers
- depends on tokio
- diesel
- uses pq_sys C wrapper for libpg
- not pub
- no support for notifications
- previous request for LISTEN diesel-rs/diesel#2166
- https://docs.diesel.rs/2.2.x/src/diesel/pg/connection/raw.rs.html
- issues
- Removing libpq (to enable async)
- Async I/O
- Postgres: We should avoid sending one query per custom type bind enum!
- PostgreSQL Large Objects - would require access to internals?
- testing diesel-rs/diesel#1549
- diesel-rs/diesel#4420
- waiting for notifications is more involved as it requires selecting a fd
- https://blog.pjam.me/posts/select-syscall-in-rust/
- crates nix or rustix help
- https://github.com/rinja-rs/askama Type-safe, compiled Jinja-like templates
- https://crates.io/crates/fetcher Automatic news fetching and parsing
- https://crates.io/crates/httptest HTTP testing facilities including a mock server
- https://github.com/lipanski/mockito HTTP mocking for Rust! https://zupzup.org/rust-http-testing/
- https://crates.io/crates/tempfile
- https://crates.io/crates/pretty_assertions
- https://crates.io/crates/nonzero
- https://crates.io/crates/webpage
- https://crates.io/crates/warc
- https://crates.io/crates/feedfinder Auto-discovery of feeds in HTML content
- https://crates.io/crates/governor - A rate-limiting implementation in Rust
- https://crates.io/crates/thiserror
- https://crates.io/crates/tracing https://gist.github.com/oliverdaff/d1d5e5bc1baba087b768b89ff82dc3ec
- https://crates.io/crates/governor - complex rate limiting algorithm, used in spyglass-search/netrunner
- https://crates.io/crates/apalis - background job processing
- https://github.com/poem-web/poem - web framework
- https://crates.io/crates/metrics-dashboard uses poem and metrics
- https://crates.io/crates/metrics_server
- https://crates.io/crates/memberlist-core - Gossip protocol for cluster membership
- displaydoc derive macro for the standard library’s core::fmt::Display, especially for errors
- scopeguard run a given closure when it goes out of scope (like defer in D)