Uri prediction #115

Scharfi · 2019-06-29T13:21:23Z

No description provided.

Merging develop to master

-intial class URI key update to take into account the prediction labels -Predicted label and the true label -feature vector

-intial class

This reverts commit 0b97505.

…into uri_prediction

This reverts commit 8b128a7

- update method to be called in other classes Initial Test case for the prediction

… URI.

RegressionLearn extending RegressionLearner of the library Calling the weightUpdate in the frontier

…in() method from FrontierImpl to FrontierComponent

-Include a resource file with 2 URIS (RDF, non RDF) Update the prediction test case with the training weight

- Worker will crawler the URI if the predicted value is 1 or a fraction of the URIs

…he crawling - Worker will crawler the URI if the predicted value is 1 or a fraction of the URIs" This reverts commit c27ce6b

…into uri_prediction

MichaelRoeder

I see that the feature is nearly integrated. However, there are some issues remaining:

The usage of integers as class labels does not make sense to me (outside of the predictor class). With that, you force all other parts involved into the usage of URI classes to know the IDs of the classes. Since the classes may change over time and since we use Strings as URI type labels at all other positions, this would be a huge effort to change it. Please note that because of that all the curi.addData(Constants.URI_TRUE_CLASS, 1) lines in the annotators should be changed to set a String instead of an integer.
The Binomial and Multinomial classes have a huge amount of code in common. Yes, it is an easy and fast way to simply copy all the classes and if you are running out of time, that could be a way to go. However, in a typical project you would like to keep a high quality of your code and make sure that the parts both approaches have in common are implemented only once.
We need to find a common way to handle the training data files that are not used for the JUnit tests. Having the data for JUnit tests in the test/java/resources directory is fine. However, At the moment, you have methods that download the data from a web server and at the same time the data is uploaded to github. Both methods are not really good. Using a makefile or something similar that downloads the data once from the web server if it is not available might be an option... 🤔

MichaelRoeder · 2020-01-10T10:19:14Z

squirrel.frontier/src/main/java/org/dice_research/squirrel/components/FrontierComponent.java

@@ -108,9 +118,15 @@ public void init() throws Exception {
            queue = new InMemoryQueue();
            knownUriFilter = new InMemoryKnownUriFilter(doRecrawling, recrawlingTime);
        }
+        // Training the URI predictor model with a training dataset
+        try {
+            predictor = new MultinomialPredictor.MultinomialPredictorBuilder().withFile("multiNomialTrainData.txt").build();


You will have to choose. Either, you use a Spring bean for the predictor (as defined in the frontier.xml file) OR you use this line of code to create the predictor. Note that the first solution is better but might be more challenging to define in the xml file. 🤔
However, both together do not make any sense 😉

MichaelRoeder · 2020-01-10T10:21:07Z

squirrel.frontier/src/main/java/org/dice_research/squirrel/frontier/impl/FrontierImpl.java

+    /**
+     * Object for Predictor
+     */
+    protected Predictor predictor;


This comment is not really helpful, is it? I think you can write a comment for this attribute that is much more helpful than that.

MichaelRoeder · 2020-01-10T10:23:40Z

squirrel.frontier/src/main/java/org/dice_research/squirrel/frontier/impl/FrontierImpl.java

+    public FrontierImpl(UriNormalizer normalizer, KnownUriFilter knownUriFilter, URIReferences uriReferences, UriQueue queue, boolean doesRecrawling, Predictor predictor) {
+        this(normalizer, knownUriFilter, uriReferences, queue, null, doesRecrawling, DEFAULT_GENERAL_RECRAWL_TIME, DEFAULT_TIMER_PERIOD);
+        this.predictor = predictor;
+


Why do you update only one single constructor? Now, it is not possible to use the predictor and for example a user-defined general recrawling time. Please check how the constructors are connected to each other and adapt your implementation of setting the predictor accordingly.

MichaelRoeder · 2020-01-10T10:28:06Z

squirrel.frontier/src/main/java/org/dice_research/squirrel/frontier/impl/FrontierImpl.java

+            } catch (Exception e) {
+                LOGGER.info("Exception happened while predicting", e);
+            }
+        }


The predictor should only be used if the URI_TYPE is not known. If we already know the type, we do not need to predict it.

It would be better if the Predictor has exactly one method that returns the predicted label. I do not understand why the interface forces the developer to call the featureHashing method before. If the Predictor has to calculate hashes for the features it would like to use it is the problem of the Predictor and not of the developer who is using it, right? So it should be kept inside the Predictor (apart from the fact that not all Predictor implementations may need feature hashing).

I do not understand why the predictor returns an integer value. How should the crawler handle this value? We are using String values for the type in all other parts of the crawler.

MichaelRoeder · 2020-01-10T10:34:42Z

squirrel.frontier/src/main/java/org/dice_research/squirrel/frontier/impl/FrontierImpl.java

        // After knownUriFilter uri should be classified according to
        // UriProcessor

+
+


Just as a hint: more empty lines do not increase the readability of code. If you think that code becomes unreadable, you may have to split up large methods or structure the code in a better way. But simply adding more empty lines as something like a "break" between parts of the code is in general not helpful (and is sometimes even removed by code formatting programs)

MichaelRoeder · 2020-01-10T12:32:27Z

squirrel.frontier/src/main/java/org/dice_research/squirrel/predictor/MultinomialPredictor.java

+/**
+ * A MultinomialPredictor
+ */
+public final class MultinomialPredictor implements Predictor{


Please have a look at the comments for the other predictor. I will not repeat all of them again.

MichaelRoeder · 2020-01-10T12:34:24Z

...ier/src/main/java/org/dice_research/squirrel/predictor/MultinomialTrainDataProviderImpl.java

+
+        classList.add("SPARQL");
+        classList.add("DUMP");
+        classList.add("CKAN");


Shouldn't the classes come from the training data? Why are you hard coding them here?

MichaelRoeder · 2020-01-10T12:36:17Z

squirrel.frontier/src/main/java/org/dice_research/squirrel/predictor/Predictor.java

+     * Gets the model being used by the predictor
+     * @return the models
+     */
+    RegressionModel getModel();


Why does the predictor interface define all of this detailed methods that are not necessary for every predictor? As far as I understand it, the interface should have exactly two methods:

predict the class for a given URI

update the internal model based on the given URI (which should contain its true class label)

All the other methods shouldn't be part of the interface from my point of view.

We defined the getter and setter methods in the interface so that they can be accessed from out side using the object for predictor declared using the interface. But we have removed the detailed methods from the interface since you suggested it.

MichaelRoeder · 2020-01-10T12:38:25Z

squirrel.frontier/src/main/java/org/dice_research/squirrel/predictor/TrainingDataProvider.java

+     * @param dataUri
+     * @param trainFilePath
+     */
+    void createTrainDataFile(String dataUri, String trainFilePath);


This method shouldn't be part of the interface. It is a very special case that your data is located on an HTTP server. Typically, you would expect that it is given, e.g., in a local file.

MichaelRoeder · 2020-01-10T13:28:01Z

squirrel.worker/src/main/java/org/dice_research/squirrel/worker/impl/WorkerImpl.java

+            uri.addData(Constants.URI_TRUE_LABEL,1);
+        } else {
+            uri.addData(Constants.URI_TRUE_LABEL, 0);
+        }


Here, you are mixing up the binomial classification with the multinomial classification, right?

URI_TRUE_LABEL is required if the user wants to perform binary classification and URI_TRUE_CLASS for multi class classification. Both these values have been modified to store string values containing the class name of the URI

…edictor.

…names

…into uri_prediction � Conflicts: � squirrel.frontier/src/main/java/org/dice_research/squirrel/predictor/BinomialPredictor.java � squirrel.frontier/src/main/java/org/dice_research/squirrel/predictor/MultinomialPredictor.java

sritejakv and others added 30 commits April 8, 2019 15:33

Merge pull request #1 from sritejakv/develop

fc49a60

Merging develop to master

Initial implementation of the prediction

0ae0329

-intial class URI key update to take into account the prediction labels -Predicted label and the true label -feature vector

Initial implementation of the prediction

76b452e

-intial class

Constructor for the prediction class

adab0e2

Update the structure of the prediction class

0b97505

Update the structure of the prediction class

8b128a7

Update the structure of the prediction class

40d308e

Revert "Update the structure of the prediction class"

f061f45

This reverts commit 0b97505.

Merge branch 'uri_prediction' of https://github.com/sritejakv/Squirrel …

d4e926b

…into uri_prediction

Update the structure of the prediction class

02ae341

Revert wrong commit

a1ca884

This reverts commit 8b128a7

Added the feature set of URIs to URI map

40f24bc

Obtained the prediction of URI and added it to URI map

0df57af

Updated true label key of the prediction with the right value

33607c0

Prediction class update

5990914

- update method to be called in other classes Initial Test case for the prediction

Interface for the predictor

93a352b

Changing the PredictorImpl instance creation to FrontierComponent

99d1d58

Added intrinsic properties of the referring URI to the feature set of…

ae04387

… URI.

Reform Interface and predict method

60a51fc

Added URI type to the URI map

28f9015

Update predictor class and test case

8536c99

Setting up input stream to train URI predictor

548bec2

Adding sample train data set

f322049

Implementing the Weight Update and Integrating it to the Predictor

dd1d828

RegressionLearn extending RegressionLearner of the library Calling the weightUpdate in the frontier

Updating the training data set temporarily and moving the call of tra…

4a98d6b

…in() method from FrontierImpl to FrontierComponent

Testcase for featureHashing method

d448acf

Test case for the Leaner training

7290af6

-Include a resource file with 2 URIS (RDF, non RDF) Update the prediction test case with the training weight

Include a condition based on the predicted value to Perform the crawling

c27ce6b

- Worker will crawler the URI if the predicted value is 1 or a fraction of the URIs

Revert "Include a condition based on the predicted value to Perform t…

e21090e

…he crawling - Worker will crawler the URI if the predicted value is 1 or a fraction of the URIs" This reverts commit c27ce6b

Updating train dataset and modifying train() method call location

97a9d6c

CatherineChiramel and others added 18 commits December 23, 2019 16:52

Fixing weight updation test case.

043f5b0

Writing train data to a file for binomial predictor test cases

83d32a3

Structural modifications and insertion of necessary JavaDocs

2804b2d

Changing the datatype of predicted value for binomial predictor

e46ddb7

Merge branch 'develop' into uri_prediction

3bd3796

Adding try-catch for predictor in frontier component

7dc1824

Merge branch 'uri_prediction' of https://github.com/sritejakv/Squirrel …

89c94b3

…into uri_prediction

JDK check

eb9d686

JDK test

252201e

Slight modifications

547d6fe

Removal of unecessary classess

3ad2c0c

JDK version test

88516db

Removal of unnecessary tests

a41c75d

Removal of unnecessary imports

12c687c

Fixing build error

e9fa7e2

Adding dist:trusty to travis.yml

cdb038f

Adding more train data to fix multinomial training testcase failure

54dbabf

Fixing filenotfound exception for multinomial train data file

1cff12c

MichaelRoeder requested changes Jan 10, 2020

View reviewed changes

CatherineChiramel and others added 11 commits January 10, 2020 21:16

Adding getter and setter methods for setting threshold in binomial pr…

f6613c2

…edictor.

Changing the predicted values from integer numbers to the real class …

7300bfd

…names

Deleting an unwanted file

da33b58

Adding train data for binomial predictor

8f2d1fb

Passing predictor object to necessary frontier constructors

e4c4d77

Comments for the parameters

8d24a4b

Comments for the parameters

0833591

Moving feature generating into a new class

0c39afb

Small changes to integrate the review comments

b4da6e2

Adding Javadoc comments to the predictor parameters

e4dedb3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uri prediction #115

Uri prediction #115

Scharfi commented Jun 29, 2019

MichaelRoeder left a comment

MichaelRoeder Jan 10, 2020

MichaelRoeder Jan 10, 2020

MichaelRoeder Jan 10, 2020

MichaelRoeder Jan 10, 2020

MichaelRoeder Jan 10, 2020

MichaelRoeder Jan 10, 2020

MichaelRoeder Jan 10, 2020

MichaelRoeder Jan 10, 2020

CatherineChiramel Jan 17, 2020

MichaelRoeder Jan 10, 2020

MichaelRoeder Jan 10, 2020

CatherineChiramel Jan 17, 2020

		// After knownUriFilter uri should be classified according to
		// UriProcessor

Uri prediction #115

Are you sure you want to change the base?

Uri prediction #115

Conversation

Scharfi commented Jun 29, 2019

MichaelRoeder left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment