Consider:
- the egyptian hieroglyphics syntax
- 'Birth_date_and_age' vs 'Birth-date_and_age'.
- the partial-implementation of inline-css,
- deep recursion of similar-syntax templates,
- the unexplained hashing scheme for image paths,
- the custom encoding of whitespace and punctuation,
- right-to-left values in left-to-right templates.
wtf_wikipedia supports many recursive shenanigans, depreciated and obscure template variants, and illicit 'wiki-esque' shorthands.
It will try it's best, and fail in reasonable ways.
→ building your own parser is never a good idea →
← but this library aims to be a straight-forward way to get data out of wikipedia
npm install wtf_wikipedia
var wtf = require('wtf_wikipedia');
wtf.fetch('Whistling').then(doc => {
doc.categories();
//['Oral communication', 'Vocal music', 'Vocal skills']
doc.sections('As communication').text();
// 'A traditional whistled language named Silbo Gomero..'
doc.images(0).thumb();
// 'https://upload.wikimedia.org..../300px-Duveneck_Whistling_Boy.jpg'
doc.sections('See Also').links().map(l => l.page)
//['Slide whistle', 'Hand flute', 'Bird vocalization'...]
});
on the client-side:
<script src="https://unpkg.com/wtf_wikipedia@latest/builds/wtf_wikipedia.min.js"></script>
<script>
//(follows redirect)
wtf.fetch('On a Friday', 'en', function(err, doc) {
var data = doc.infobox(0).data
data['current_members'].links().map(l => l.page);
//['Thom Yorke', 'Jonny Greenwood', 'Colin Greenwood'...]
});
</script>
- Detects and parses redirects and disambiguation pages
- Parse infoboxes into a formatted key-value object
- Handles recursive templates and links- like [[.. [[...]] ]]
- Per-sentence plaintext and link resolution
- Parse and format internal links
- creates image thumbnail urls from File:XYZ.png filenames
- Properly resolve {{CURRENTMONTH}} and {{CONVERT ..}} type templates
- Parse images, headings, and categories
- converts 'DMS-formatted' (59°12'7.7"N) geo-coordinates to lat/lng
- parses citation metadata
- Eliminate xml, latex, css, and table-sorting cruft
Wikimedia's Parsoid javascript parser is the official wikiscript parser. It reliably turns wikiscript into HTML, but not valid XML.
To use it for data-mining, you'll need to:
parsoid(wikiText) -> [headless/pretend-DOM] -> screen-scraping
which is fine,
but getting structured data this way (say, sentences or infobox values), is still a complex + weird process. Arguably, you're not any closer than you were with wikitext. This library has lovingly ❤️ borrowed a lot of code and data from the parsoid project, and thanks its contributors.
wtf_wikipedia was built to work with dumpster-dive, which lets you parse a whole wikipedia dump on a laptop in a couple hours. It's definitely the way to go, instead of fetching many pages off the api.
const wtf = require('wtf_wikipedia')
//parse a page
var doc = wtf(wikiText, [options])
//fetch & parse a page - wtf.fetch(title, [lang_or_wikiid], [options], [callback])
(async () => {
var doc = await wtf.fetch('Toronto');
console.log(doc.text())
})();
//(callback format works too)
wtf.fetch(64646, 'en', (err, doc) => {
console.log(doc.categories());
});
- .sections() - ==these things==
- .sentences()
- .links()
- .tables()
- .lists()
- .images()
- .templates() - {{these|things}}
- .categories()
- .citations() - <ref>these guys</ref>
- .infoboxes()
- .coordinates()
- .json() - handy, workable data
- .text() - reader-focused plaintext
- .html()
- .markdown()
- .latex() - (ftw)
- .isRedirect() - boolean
- .isDisambiguation() - boolean
- .title() - guess the title of this page
flip your wikimedia markup into a Document
object
import wtf from 'wtf_wikipedia'
wtf("==In Popular Culture==\n*harry potter's wand\n* the simpsons fence");
// Document {plaintext(), html(), latex()...}
retrieves raw contents of a mediawiki article from the wikipedia action API.
This method supports the errback callback form, or returns a Promise if one is missing.
to call non-english wikipedia apis, add it's language-name as the second parameter
wtf.fetch('Toronto', 'de', function(err, doc) {
doc.plaintext();
//Toronto ist mit 2,6 Millionen Einwohnern..
});
you may also pass the wikipedia page id as parameter instead of the page title:
wtf.fetch(64646, 'de').then(console.log).catch(console.log)
the fetch method follows redirects.
returns only nice text of the article
var wiki =
"[[Greater_Boston|Boston]]'s [[Fenway_Park|baseball field]] has a {{convert|37|ft}} wall.<ref>{{cite web|blah}}</ref>";
var text = wtf(wiki).text();
//"Boston's baseball field has a 37ft wall."
wtf(page).sections(1).children()
wtf(page).sections('see also').remove()
s = wtf(page).sentences(4)
s.links()
s.bolds()
s.italics()
s.dates() //structured date templates
img = wtf(page).images(0)
img.url() // the full-size wikimedia-hosted url
img.thumnail() // 300px, by default
img.format() // jpg, png, ..
img.exists() // HEAD req to see if the file is alive
if you're scripting this from the shell, or from another language, install with a -g
, and then run:
$ wtf_wikipedia George Clooney --plaintext
# George Timothy Clooney (born May 6, 1961) is an American actor ...
$ wtf_wikipedia Toronto Blue Jays --json
# {text:[...], infobox:{}, categories:[...], images:[] }
The wikipedia api is pretty welcoming though recommends three things, if you're going to hit it heavily -
- 1️⃣ pass a
Api-User-Agent
as something so they can use to easily throttle bad scripts - 2️⃣ bundle multiple pages into one request as an array
- 3️⃣ run it serially, or at least, slowly.
wtf.fetch(['Royal Cinema', 'Aldous Huxley'], 'en', {
'Api-User-Agent': '[email protected]'
}).then((docList) => {
let allLinks = docList.map(doc => doc.links());
console.log(allLinks);
});
Join in! - projects like these are only done with many-hands, and we try to be friendly and easy.
Thank you to the cross-fetch and jshashes libraries.
MIT