-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Uri prediction #115
base: develop
Are you sure you want to change the base?
Uri prediction #115
Conversation
Merging develop to master
-intial class URI key update to take into account the prediction labels -Predicted label and the true label -feature vector
-intial class
This reverts commit 0b97505.
…into uri_prediction
This reverts commit 8b128a7
- update method to be called in other classes Initial Test case for the prediction
RegressionLearn extending RegressionLearner of the library Calling the weightUpdate in the frontier
…in() method from FrontierImpl to FrontierComponent
-Include a resource file with 2 URIS (RDF, non RDF) Update the prediction test case with the training weight
- Worker will crawler the URI if the predicted value is 1 or a fraction of the URIs
…he crawling - Worker will crawler the URI if the predicted value is 1 or a fraction of the URIs" This reverts commit c27ce6b
…into uri_prediction
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that the feature is nearly integrated. However, there are some issues remaining:
- The usage of integers as class labels does not make sense to me (outside of the predictor class). With that, you force all other parts involved into the usage of URI classes to know the IDs of the classes. Since the classes may change over time and since we use Strings as URI type labels at all other positions, this would be a huge effort to change it. Please note that because of that all the
curi.addData(Constants.URI_TRUE_CLASS, 1)
lines in the annotators should be changed to set a String instead of an integer. - The Binomial and Multinomial classes have a huge amount of code in common. Yes, it is an easy and fast way to simply copy all the classes and if you are running out of time, that could be a way to go. However, in a typical project you would like to keep a high quality of your code and make sure that the parts both approaches have in common are implemented only once.
- We need to find a common way to handle the training data files that are not used for the JUnit tests. Having the data for JUnit tests in the test/java/resources directory is fine. However, At the moment, you have methods that download the data from a web server and at the same time the data is uploaded to github. Both methods are not really good. Using a makefile or something similar that downloads the data once from the web server if it is not available might be an option... 🤔
@@ -108,9 +118,15 @@ public void init() throws Exception { | |||
queue = new InMemoryQueue(); | |||
knownUriFilter = new InMemoryKnownUriFilter(doRecrawling, recrawlingTime); | |||
} | |||
// Training the URI predictor model with a training dataset | |||
try { | |||
predictor = new MultinomialPredictor.MultinomialPredictorBuilder().withFile("multiNomialTrainData.txt").build(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You will have to choose. Either, you use a Spring bean for the predictor (as defined in the frontier.xml file) OR you use this line of code to create the predictor. Note that the first solution is better but might be more challenging to define in the xml file. 🤔
However, both together do not make any sense 😉
/** | ||
* Object for Predictor | ||
*/ | ||
protected Predictor predictor; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is not really helpful, is it? I think you can write a comment for this attribute that is much more helpful than that.
public FrontierImpl(UriNormalizer normalizer, KnownUriFilter knownUriFilter, URIReferences uriReferences, UriQueue queue, boolean doesRecrawling, Predictor predictor) { | ||
this(normalizer, knownUriFilter, uriReferences, queue, null, doesRecrawling, DEFAULT_GENERAL_RECRAWL_TIME, DEFAULT_TIMER_PERIOD); | ||
this.predictor = predictor; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you update only one single constructor? Now, it is not possible to use the predictor and for example a user-defined general recrawling time. Please check how the constructors are connected to each other and adapt your implementation of setting the predictor accordingly.
} catch (Exception e) { | ||
LOGGER.info("Exception happened while predicting", e); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The predictor should only be used if the
URI_TYPE
is not known. If we already know the type, we do not need to predict it. - It would be better if the
Predictor
has exactly one method that returns the predicted label. I do not understand why the interface forces the developer to call thefeatureHashing
method before. If the Predictor has to calculate hashes for the features it would like to use it is the problem of the Predictor and not of the developer who is using it, right? So it should be kept inside the Predictor (apart from the fact that not all Predictor implementations may need feature hashing). - I do not understand why the predictor returns an integer value. How should the crawler handle this value? We are using String values for the type in all other parts of the crawler.
// After knownUriFilter uri should be classified according to | ||
// UriProcessor | ||
|
||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just as a hint: more empty lines do not increase the readability of code. If you think that code becomes unreadable, you may have to split up large methods or structure the code in a better way. But simply adding more empty lines as something like a "break" between parts of the code is in general not helpful (and is sometimes even removed by code formatting programs)
/** | ||
* A MultinomialPredictor | ||
*/ | ||
public final class MultinomialPredictor implements Predictor{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please have a look at the comments for the other predictor. I will not repeat all of them again.
|
||
classList.add("SPARQL"); | ||
classList.add("DUMP"); | ||
classList.add("CKAN"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't the classes come from the training data? Why are you hard coding them here?
* Gets the model being used by the predictor | ||
* @return the models | ||
*/ | ||
RegressionModel getModel(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does the predictor interface define all of this detailed methods that are not necessary for every predictor? As far as I understand it, the interface should have exactly two methods:
- predict the class for a given URI
- update the internal model based on the given URI (which should contain its true class label)
All the other methods shouldn't be part of the interface from my point of view.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We defined the getter and setter methods in the interface so that they can be accessed from out side using the object for predictor declared using the interface. But we have removed the detailed methods from the interface since you suggested it.
* @param dataUri | ||
* @param trainFilePath | ||
*/ | ||
void createTrainDataFile(String dataUri, String trainFilePath); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method shouldn't be part of the interface. It is a very special case that your data is located on an HTTP server. Typically, you would expect that it is given, e.g., in a local file.
uri.addData(Constants.URI_TRUE_LABEL,1); | ||
} else { | ||
uri.addData(Constants.URI_TRUE_LABEL, 0); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, you are mixing up the binomial classification with the multinomial classification, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
URI_TRUE_LABEL is required if the user wants to perform binary classification and URI_TRUE_CLASS for multi class classification. Both these values have been modified to store string values containing the class name of the URI
…into uri_prediction � Conflicts: � squirrel.frontier/src/main/java/org/dice_research/squirrel/predictor/BinomialPredictor.java � squirrel.frontier/src/main/java/org/dice_research/squirrel/predictor/MultinomialPredictor.java
No description provided.