-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose the instance of WTF to the options so that it's possible to use .extend() #106
Comments
I ended up with another issue by allowing this, we need to be able to pass async functions and the let JSONfn;
if (!JSONfn) {
JSONfn = {};
}
(function () {
JSONfn.stringify = function (obj) {
return JSON.stringify(obj, function (key, value) {
if (typeof value === 'function') {
if (value.constructor.name === 'AsyncFunction') {
return '__async_fn__:' + value.toString();
} else {
return value.toString();
}
} else {
return value;
}
});
};
JSONfn.parse = function (str) {
return JSON.parse(str, function (key, value) {
if (typeof value !== 'string') return value;
if (value.substring(0, 13) === '__async_fn__:') {
return eval('(' + value.substring(13) + ')');
} else if (value.substring(0, 8) === 'function') {
return eval('(' + value + ')');
} else {
return value;
}
});
};
})();
exports.JSONfn = JSONfn; It's also necessary to update the It would be useful to pass not only the Additionally, I think we will have to update the driver passed to the All of this to be able to use WTF extensions, especially those that make secondary requests to fetch data that is not in the dump, such as pageviews, some images, etc. Here is an example of my use: const data = {
pageID: doc.pageID(),
wdID: doc.wikidata(),
title: doc.title(),
url: doc.url(),
image: '',
redirectTo: isRedirect,
isDisambiguation: isDisambiguation,
ranking: 1,
categories: isArticle || isCategoryTerm ? doc.categories() : [],
terms: [],
links: isDisambiguation
? doc
.links()
.filter(l => l.type() === 'internal' && l.text().length)
.map(l => l.page() || l.text())
: [],
content: doc.content(isArticle),
metadata: options.wikidata && isArticle && !isCategoryTerm ? await doc.getMetadata() : [],
categoryData: isCategoryTerm ? await wtf.getCategoryPages(term) : []
}
try {
if (isArticle || isDisambiguation) {
const irt = await doc.getImageViewsRedirects()
data.image = irt.image
data.ranking = irt.ranking
data.terms = irt.redirects
// data.content = irt.content
}
} catch (err) {
throw new Error('Failed to fetch extra data from Wikipedia:' + err.message)
} PS: IF you pass an async custom function the stdout log need a better approach too. |
hey André - yup, i agree this is a bad limitation. You're right that any callback sent to the worker needs to go through JSONfn and this really reduces what is possible to do. What about something like this? dumpster({ file: './myfile.xml.bz2', wtf_lib:'path/to/lib.js' }) then you can extend, or change wtf in any way. would that work for you? |
@spencermountain I even made a fork for custom and it worked, but it was kind of a "hack". You should update the wtf and mongodb version too. I can send a PR for this if you want to. For Mongo needs to update some files. I think your suggestion would also solve it. Would it work asynchronously? |
ha, love it. Ofcourse - prs welcome! lemme say - if i were you, and had a dump working, i wouldn't call the wikimedia api anymore. cheers |
The problem is that I need Pageviews for my current project. Some images on the Portuguese Wikipedia don't come with the dump, and I don't know why. Wikimedia doesn't make things easy, wikitext is a nightmare, my extension is called "wtf from hell" haha. I gave up on working with data from Wiktionary, it's too much of a headache. You are a warrior for maintaining wtf_wikipedia. |
Just an update: I updated to receive "extension: wikiFromHell" in the Dumpster parameters and updated to a more current method of JSONfn to support async and arrow functions. The big problem, as far as I can tell, is that when passing into the Dumpster, wikiFromHell loses several function references that are called via require/import. const dumpsterOptions = {
file: './dumps/wikipedia/wiki.dump.xml',
db: 'wikipedia',
db_url: 'mongodb://' + process.env.MONGO_HOST,
skip_redirects: false,
skip_disambig: false,
wikidata: false,
extension: wikiFromHell,
custom: wtfHellDocument
} in parseWiki.js if (options.extension) {
wtf.extend(options.extension);
} wikiFromHell const isTrueRedirectFN = require('./isTrueRedirect.js');
const wikiFromHell = (models) => {
const doc = models.Doc.prototype;
const wtf = models.wtf;
doc.isTrueRedirect = function () {
return isTrueRedirectFN();
};
} In this cenario, isTrueRedirectFN lost the reference, I had to put all the function directly inside the For now I'm using a patched dumpster-div with wikiFromHell directly extended inside parseWiki. Is there a way to bypass the worker for the parseWiki extend the function? |
yeah, i've seen some have luck with using hope to get some time to look at this properly, next week or so |
Expose the instance of WTF to the options so that it's possible to use
wtf.extend().
Despite WTF being up to date, I couldn't find any way to extend the functionality of WTF through dumpster-dive.
The text was updated successfully, but these errors were encountered: