Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uri prediction #115

Open
wants to merge 116 commits into
base: develop
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
116 commits
Select commit Hold shift + click to select a range
fc49a60
Merge pull request #1 from sritejakv/develop
sritejakv Apr 8, 2019
0ae0329
Initial implementation of the prediction
May 24, 2019
76b452e
Initial implementation of the prediction
May 24, 2019
adab0e2
Constructor for the prediction class
May 25, 2019
0b97505
Update the structure of the prediction class
May 29, 2019
8b128a7
Update the structure of the prediction class
May 29, 2019
40d308e
Update the structure of the prediction class
May 29, 2019
f061f45
Revert "Update the structure of the prediction class"
May 29, 2019
d4e926b
Merge branch 'uri_prediction' of https://github.com/sritejakv/Squirre…
May 29, 2019
02ae341
Update the structure of the prediction class
May 29, 2019
a1ca884
Revert wrong commit
May 29, 2019
40f24bc
Added the feature set of URIs to URI map
CatherineChiramel Jun 2, 2019
0df57af
Obtained the prediction of URI and added it to URI map
CatherineChiramel Jun 2, 2019
33607c0
Updated true label key of the prediction with the right value
Jun 2, 2019
5990914
Prediction class update
Jun 7, 2019
93a352b
Interface for the predictor
Jun 7, 2019
99d1d58
Changing the PredictorImpl instance creation to FrontierComponent
CatherineChiramel Jun 8, 2019
ae04387
Added intrinsic properties of the referring URI to the feature set of…
CatherineChiramel Jun 8, 2019
60a51fc
Reform Interface and predict method
Jun 8, 2019
28f9015
Added URI type to the URI map
Jun 8, 2019
8536c99
Update predictor class and test case
Jun 16, 2019
548bec2
Setting up input stream to train URI predictor
CatherineChiramel Jun 17, 2019
f322049
Adding sample train data set
CatherineChiramel Jun 20, 2019
dd1d828
Implementing the Weight Update and Integrating it to the Predictor
anirudhash Jun 22, 2019
4a98d6b
Updating the training data set temporarily and moving the call of tra…
CatherineChiramel Jun 22, 2019
d448acf
Testcase for featureHashing method
CatherineChiramel Jun 22, 2019
7290af6
Test case for the Leaner training
Jun 23, 2019
c27ce6b
Include a condition based on the predicted value to Perform the crawling
Jun 23, 2019
e21090e
Revert "Include a condition based on the predicted value to Perform t…
Jun 26, 2019
97a9d6c
Updating train dataset and modifying train() method call location
CatherineChiramel Jun 27, 2019
5b412ca
JUnit Test Case for the weightUpdate method.
anirudhash Jun 28, 2019
2a3d01f
Merge branch 'uri_prediction' of https://github.com/sritejakv/Squirre…
anirudhash Jun 28, 2019
571da2d
Extend the Test case for the train method
Jun 28, 2019
69beea6
Set the Weight using the updated parameters
anirudhash Jul 1, 2019
04655c3
Small Reform
Jul 3, 2019
269cb4c
Training test case small enhancement
Jul 3, 2019
daf856c
Creation of an evaluation function for the predictor to test the perf…
CatherineChiramel Oct 21, 2019
778359a
Creation of an evaluation function for the predictor to test the perf…
CatherineChiramel Oct 21, 2019
590afbc
Changing the size of the feature vector.
CatherineChiramel Oct 25, 2019
772cc9e
Merge branch 'develop' into uri_prediction
CatherineChiramel Oct 26, 2019
0cc4d79
Changes to resolve merge conflicts
CatherineChiramel Oct 26, 2019
87f8ea3
Adding Oraclejdk8 to fix build failure
CatherineChiramel Oct 29, 2019
d4d24b6
Modifying the train method to access the train dataset online.
CatherineChiramel Oct 31, 2019
191d485
Deletion of local train data set files.
CatherineChiramel Nov 1, 2019
f7acadd
Moving the evaluation function to a different class with a main() method
CatherineChiramel Nov 1, 2019
fe57b1d
Adding and modifying necessary comments
CatherineChiramel Nov 2, 2019
13494fc
Integrating review comments
CatherineChiramel Nov 8, 2019
80a624f
Defining predictor in frontier-context.xml file
CatherineChiramel Nov 10, 2019
a36907f
An interface defining a method for getting training data and a class …
anirudhash Nov 10, 2019
719dece
Merge branch 'uri_prediction' of https://github.com/sritejakv/Squirre…
anirudhash Nov 10, 2019
c9e2634
Adaptation to the newly added TrainingDataProvider
anirudhash Nov 10, 2019
23832dc
Configurable Learning Rate
anirudhash Nov 10, 2019
ad12696
Fixing a mistake in reading data from train data file
CatherineChiramel Nov 17, 2019
d77bdfb
Making L1, L2, Beta parameters configurable
CatherineChiramel Nov 17, 2019
9398fb4
Inclusion of Builder pattern to the Predictor
anirudhash Nov 18, 2019
baac1a8
Merge remote-tracking branch 'origin/uri_prediction' into uri_prediction
anirudhash Nov 18, 2019
52e271d
Slight modifications
anirudhash Nov 23, 2019
c240d2f
Adding method to perform k-fold cross validation of the predictor
CatherineChiramel Nov 23, 2019
b2d64a6
Creation of Multinomial train method to train a multi-class classifier
CatherineChiramel Nov 24, 2019
9041fcb
Intermediate version of builder pattern for Predictor
anirudhash Dec 1, 2019
7f8b20e
Merge remote-tracking branch 'origin/uri_prediction' into uri_prediction
anirudhash Dec 1, 2019
de94573
Complete implementation of multinomial training and corresponding k-f…
CatherineChiramel Dec 1, 2019
650c61d
Switching to multinomial training and creating confusion matrix
CatherineChiramel Dec 7, 2019
f6d6593
Adding true classes of the crawled URIs to map
CatherineChiramel Dec 7, 2019
9ac3c99
Introduction of multinomialModelWeightUpdate method, which updates th…
anirudhash Dec 8, 2019
3dd58ee
Builder pattern for the Multinomial model and its parameters
anirudhash Dec 8, 2019
6fcd7b9
Switching to multinomial classifier (Re-commit)
CatherineChiramel Dec 9, 2019
591fc5e
Binomial Predictor and Multinomial Predictor with its respective buil…
anirudhash Dec 10, 2019
3f25777
Multinomial Predictor with its respective builder pattern
anirudhash Dec 10, 2019
1778d66
Merge branch 'uri_prediction' of https://github.com/sritejakv/Squirre…
anirudhash Dec 10, 2019
9521200
Slight modifications
anirudhash Dec 10, 2019
978fa4a
Testcase for multiNomial training method and addition of a small trai…
CatherineChiramel Dec 15, 2019
8d5b9a9
Reducing the size of the train data file for multinomial training
CatherineChiramel Dec 15, 2019
e1b7208
Improved builder patterns for the Predictors
anirudhash Dec 15, 2019
7cd0a19
Fixing multinomial model initialisation
CatherineChiramel Dec 16, 2019
8fdebe9
Test Case for Multinomial model weight update
anirudhash Dec 16, 2019
bc4a349
Modified method for multinomial weight update
anirudhash Dec 16, 2019
2fdd40f
Slight modifications
anirudhash Dec 16, 2019
295920f
Slight modifications
anirudhash Dec 16, 2019
1b84df6
Slight modifications
anirudhash Dec 16, 2019
438745f
Small fixes for miltinomial weight update method and test case.
CatherineChiramel Dec 16, 2019
b8cee08
Merge branch 'uri_prediction' of https://github.com/sritejakv/Squirre…
CatherineChiramel Dec 16, 2019
9092b59
Implementing separate TrainDataProviderImpl classes for binomial and …
CatherineChiramel Dec 17, 2019
307b0d0
Creating separate test classes for binomial and multinomial predictors.
CatherineChiramel Dec 17, 2019
5e3ad81
Small fixes in BinomialPredictor class.
CatherineChiramel Dec 20, 2019
04e4be8
Creating separate evaluation classes for binomial and multinomial pre…
CatherineChiramel Dec 20, 2019
29bdc34
Fixing an error in weightupdate test case for binomial predictor
CatherineChiramel Dec 20, 2019
043f5b0
Fixing weight updation test case.
CatherineChiramel Dec 23, 2019
83d32a3
Writing train data to a file for binomial predictor test cases
CatherineChiramel Dec 23, 2019
2804b2d
Structural modifications and insertion of necessary JavaDocs
anirudhash Dec 30, 2019
e46ddb7
Changing the datatype of predicted value for binomial predictor
CatherineChiramel Dec 30, 2019
3bd3796
Merge branch 'develop' into uri_prediction
CatherineChiramel Dec 30, 2019
7dc1824
Adding try-catch for predictor in frontier component
CatherineChiramel Dec 30, 2019
89c94b3
Merge branch 'uri_prediction' of https://github.com/sritejakv/Squirre…
CatherineChiramel Dec 30, 2019
eb9d686
JDK check
anirudhash Jan 5, 2020
252201e
JDK test
anirudhash Jan 5, 2020
547d6fe
Slight modifications
CatherineChiramel Jan 5, 2020
3ad2c0c
Removal of unecessary classess
anirudhash Jan 6, 2020
88516db
JDK version test
anirudhash Jan 6, 2020
a41c75d
Removal of unnecessary tests
anirudhash Jan 6, 2020
12c687c
Removal of unnecessary imports
anirudhash Jan 6, 2020
e9fa7e2
Fixing build error
CatherineChiramel Jan 7, 2020
cdb038f
Adding dist:trusty to travis.yml
CatherineChiramel Jan 7, 2020
54dbabf
Adding more train data to fix multinomial training testcase failure
CatherineChiramel Jan 9, 2020
1cff12c
Fixing filenotfound exception for multinomial train data file
CatherineChiramel Jan 9, 2020
f6613c2
Adding getter and setter methods for setting threshold in binomial pr…
CatherineChiramel Jan 10, 2020
7300bfd
Changing the predicted values from integer numbers to the real class …
CatherineChiramel Jan 14, 2020
da33b58
Deleting an unwanted file
CatherineChiramel Jan 14, 2020
8f2d1fb
Adding train data for binomial predictor
CatherineChiramel Jan 14, 2020
e4c4d77
Passing predictor object to necessary frontier constructors
CatherineChiramel Jan 15, 2020
8d24a4b
Comments for the parameters
anirudhash Jan 16, 2020
1a98f2b
Merge branch 'uri_prediction' of https://github.com/sritejakv/Squirre…
anirudhash Jan 16, 2020
0833591
Comments for the parameters
anirudhash Jan 17, 2020
0c39afb
Moving feature generating into a new class
CatherineChiramel Jan 17, 2020
b4da6e2
Small changes to integrate the review comments
CatherineChiramel Jan 17, 2020
e4dedb3
Adding Javadoc comments to the predictor parameters
CatherineChiramel Jan 18, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,20 @@ sudo: required

language: java

jdk: oraclejdk8

dist: trusty

addons:
apt:
packages:
- oracle-java8-installer



services:
- docker

before_install:
- docker pull rethinkdb:2.3.5

31 changes: 16 additions & 15 deletions docker-compose-sparql.yml
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ services:
- ./data/frontier:/var/squirrel/data
- ./seed/seeds.csv:/var/squirrel/seeds.csv:ro
- ./whitelist/whitelist.txt:/var/squirrel/whitelist.txt:ro
- ./spring-config:/var/squirrel/spring-config
command: java -cp squirrel.jar org.dice_research.squirrel.components.FrontierComponentStarter

virtuosohost:
Expand Down Expand Up @@ -136,18 +137,18 @@ services:
- ./spring-config:/var/squirrel/spring-config
command: java -cp squirrel.jar org.dice_research.squirrel.components.WorkerComponentStarter

deduplicator:
image: squirrel
container_name: deduplicator
environment:
DEDUPLICATION_ACTIVE: "true"
HOBBIT_RABBIT_HOST: rabbit
OUTPUT_FOLDER: /var/squirrel/data
MDB_HOST_NAME: mongodb
MDB_PORT: 27017
SPARQL_HOST_NAME: sparqlhost
SPARQL_HOST_PORT: 3030
SERVICE_PRECONDITION: "rethinkdb:28015 rabbit:5672"
volumes:
- ./data/deduplicator:/var/squirrel/data
command: java -cp squirrel.jar org.hobbit.core.run.ComponentStarter org.aksw.simba.squirrel.components.DeduplicatorComponent
# deduplicator:
# image: squirrel
# container_name: deduplicator
# environment:
# DEDUPLICATION_ACTIVE: "true"
# HOBBIT_RABBIT_HOST: rabbit
# OUTPUT_FOLDER: /var/squirrel/data
# MDB_HOST_NAME: mongodb
# MDB_PORT: 27017
# SPARQL_HOST_NAME: sparqlhost
# SPARQL_HOST_PORT: 3030
# SERVICE_PRECONDITION: "rethinkdb:28015 rabbit:5672"
# volumes:
# - ./data/deduplicator:/var/squirrel/data
# command: java -cp squirrel.jar org.hobbit.core.run.ComponentStarter org.aksw.simba.squirrel.components.DeduplicatorComponent
1 change: 1 addition & 0 deletions spring-config/frontier-context.xml
Original file line number Diff line number Diff line change
Expand Up @@ -54,4 +54,5 @@




</beans>
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,32 @@ public class Constants {
*/
public static final String URI_PREFERRED_RECRAWL_ON = "recrawl-on";

/*
* The data related to the predictor
*/
MichaelRoeder marked this conversation as resolved.
Show resolved Hide resolved
/**
* This key stores the value predicted by the Predictor for each URI denoting
* the class it belongs to (Positive class or Negative class)
*/
public static final String URI_PREDICTED_LABEL = "predicted-label";
/**
* This key stores the value denoting the true class of each URI
*/
public static final String URI_TRUE_LABEL = "true-label";
/**
* This key stores an integer denoting the true class of the URI
*/
public static final String URI_TRUE_CLASS = "true_class";
/**
* This key stores the feature vector generated for each URI for prediction purpose
*/
public static final String FEATURE_VECTOR = "feature-vector";
/**
* This key stores the parent URI of each crawled URI
*/
public static final String REFERRING_URI = "referring-uri";


//////////////////////////////////////////////////
// URIs
//////////////////////////////////////////////////
Expand Down
8 changes: 7 additions & 1 deletion squirrel.frontier/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,12 @@
<groupId>org.dice-research</groupId>
<artifactId>squirrel.web-api</artifactId>
</dependency>
<!-- Online machine learning library used to implement better URI type prediction in crawler -->
<dependency>
<groupId>de.jungblut.ml</groupId>
<artifactId>tjungblut-online-ml</artifactId>
<version>0.5</version>
</dependency>
MichaelRoeder marked this conversation as resolved.
Show resolved Hide resolved

<!-- ~~~~~~~~~~~~~~~~~~~ Testing ~~~~~~~~~~~~~~~~~~~~~~ -->
<!-- JUnit -->
Expand Down Expand Up @@ -63,4 +69,4 @@
</plugin>
</plugins>
</build>
</project>
</project>
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,9 @@
import org.dice_research.squirrel.frontier.impl.QueueBasedTerminationCheck;
import org.dice_research.squirrel.frontier.impl.TerminationCheck;
import org.dice_research.squirrel.frontier.impl.WorkerGuard;
import org.dice_research.squirrel.predictor.*;
import org.dice_research.squirrel.queue.InMemoryQueue;
import org.dice_research.squirrel.queue.IpAddressBasedQueue;
import org.dice_research.squirrel.queue.UriQueue;
import org.dice_research.squirrel.rabbit.RPCServer;
import org.dice_research.squirrel.rabbit.RespondingDataHandler;
Expand All @@ -54,6 +56,7 @@
import org.springframework.beans.factory.annotation.Qualifier;
import org.springframework.stereotype.Component;


@Component
@Qualifier("frontierComponent")
public class FrontierComponent extends AbstractComponent implements RespondingDataHandler {
Expand All @@ -77,10 +80,16 @@ public class FrontierComponent extends AbstractComponent implements RespondingDa
private final WorkerGuard workerGuard = new WorkerGuard(this);
private final boolean doRecrawling = true;
private long recrawlingTime = 1000L * 60L * 60L * 24L * 30;


private Timer timerTerminator;


public static final boolean RECRAWLING_ACTIVE = true;


protected Predictor predictor;

@Override
public void init() throws Exception {
super.init();
Expand Down Expand Up @@ -108,9 +117,15 @@ public void init() throws Exception {
queue = new InMemoryQueue();
knownUriFilter = new InMemoryKnownUriFilter(doRecrawling, recrawlingTime);
}
// Training the URI predictor model with a training dataset
try {
predictor = new MultinomialPredictor.MultinomialPredictorBuilder().withFile("multiNomialTrainData.txt").build();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You will have to choose. Either, you use a Spring bean for the predictor (as defined in the frontier.xml file) OR you use this line of code to create the predictor. Note that the first solution is better but might be more challenging to define in the xml file. 🤔
However, both together do not make any sense 😉

}catch (Exception e){
e.printStackTrace();
}

// Build frontier
frontier = new ExtendedFrontierImpl(new NormalizerImpl(), knownUriFilter, uriReferences, queue, doRecrawling);
frontier = new ExtendedFrontierImpl(new NormalizerImpl(), knownUriFilter, uriReferences, (IpAddressBasedQueue) queue, doRecrawling, predictor);

rabbitQueue = this.incomingDataQueueFactory.createDefaultRabbitQueue(Constants.FRONTIER_QUEUE_NAME);
receiver = (new RPCServer.Builder()).responseQueueFactory(outgoingDataQueuefactory).dataHandler(this)
Expand Down Expand Up @@ -139,11 +154,13 @@ public void init() throws Exception {
+ webConfiguration.isVisualizationOfCrawledGraphEnabled()
+ ". No WebServiceSenderThread will be started!");
}


}

@Override
public void run() throws Exception {

terminationMutex.acquire();
}

Expand Down Expand Up @@ -177,7 +194,7 @@ public void handleData(byte[] data) {

@Override
public void handleData(byte[] data, ResponseHandler handler, String responseQueueName, String correlId) {

Object deserializedData;
try {
deserializedData = serializer.deserialize(data);
Expand All @@ -200,7 +217,7 @@ public void handleData(byte[] data, ResponseHandler handler, String responseQueu
if (deserializedData instanceof UriSetRequest) {
responseToUriSetRequest(handler, responseQueueName, correlId, (UriSetRequest) deserializedData);
} else if (deserializedData instanceof UriSet) {

if(timerTerminator == null) {
LOGGER.info("Initializing Terminator task...");
TimerTask terminatorTask = new TerminatorTask(queue, terminationMutex, this.workerGuard);
Expand All @@ -212,6 +229,7 @@ public void handleData(byte[] data, ResponseHandler handler, String responseQueu
} else if (deserializedData instanceof CrawlingResult) {
CrawlingResult crawlingResult = (CrawlingResult) deserializedData;
LOGGER.warn("Received the message that the crawling for {} URIs is done.", crawlingResult.uris.size());

frontier.crawlingDone(crawlingResult.uris);
workerGuard.removeUrisForWorker(crawlingResult.idOfWorker, crawlingResult.uris);
} else if (deserializedData instanceof AliveMessage) {
Expand Down Expand Up @@ -298,11 +316,11 @@ public void run() {
break;
}
}

if(!stillHasUris && terminationCheck.shouldFrontierTerminate(queue)) {
terminationMutex.release();
}
}
}

}
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
import org.dice_research.squirrel.data.uri.norm.UriNormalizer;
import org.dice_research.squirrel.deduplication.hashing.UriHashCustodian;
import org.dice_research.squirrel.frontier.ExtendedFrontier;
import org.dice_research.squirrel.predictor.Predictor;
import org.dice_research.squirrel.queue.IpAddressBasedQueue;
import org.dice_research.squirrel.queue.UriQueue;

Expand All @@ -29,11 +30,12 @@ public class ExtendedFrontierImpl extends FrontierImpl implements ExtendedFronti
* @param generalRecrawlTime used to select the general Time after URIs should be recrawled. If Value is null the default Time is used.
* @param timerPeriod used to select if URIs should be recrawled.
* @param uriHashCustodian used to access and write hash values for uris.
* @param predictor {@link Predictor}Used to predict the type of the URI
*/
@SuppressWarnings("unused")
public ExtendedFrontierImpl(UriNormalizer normalizer, KnownUriFilter knownUriFilter, UriQueue queue, boolean doesRecrawling,
long generalRecrawlTime, long timerPeriod, UriHashCustodian uriHashCustodian) {
super(normalizer, knownUriFilter, queue, doesRecrawling, generalRecrawlTime, timerPeriod, uriHashCustodian);
long generalRecrawlTime, long timerPeriod, UriHashCustodian uriHashCustodian, Predictor predictor) {
super(normalizer, knownUriFilter, queue, doesRecrawling, generalRecrawlTime, timerPeriod, uriHashCustodian, predictor);
}

/**
Expand All @@ -45,9 +47,10 @@ public ExtendedFrontierImpl(UriNormalizer normalizer, KnownUriFilter knownUriFil
* @param queue {@link UriQueue} used to manage the URIs that should be
* crawled.
* @param doesRecrawling used to select if URIs should be recrawled.
* @param predictor {@link Predictor}Used to predict the type of the URI
*/
public ExtendedFrontierImpl(UriNormalizer normalizer, KnownUriFilter knownUriFilter, IpAddressBasedQueue queue, boolean doesRecrawling) {
super(normalizer, knownUriFilter, queue, doesRecrawling);
public ExtendedFrontierImpl(UriNormalizer normalizer, KnownUriFilter knownUriFilter, IpAddressBasedQueue queue, boolean doesRecrawling, Predictor predictor) {
super(normalizer, knownUriFilter, queue, doesRecrawling, predictor);
}

/**
Expand All @@ -60,9 +63,12 @@ public ExtendedFrontierImpl(UriNormalizer normalizer, KnownUriFilter knownUriFil
* @param queue {@link UriQueue} used to manage the URIs that should be
* crawled.
* @param doesRecrawling used to select if URIs should be recrawled.
* @param predictor PredictorImpl object used for prediction
*/
public ExtendedFrontierImpl(UriNormalizer normalizer, KnownUriFilter knownUriFilter, URIReferences uriReferences, UriQueue queue, boolean doesRecrawling) {
super(normalizer, knownUriFilter, uriReferences, queue, doesRecrawling);

public ExtendedFrontierImpl(UriNormalizer normalizer, KnownUriFilter knownUriFilter, URIReferences uriReferences, IpAddressBasedQueue queue, boolean doesRecrawling, Predictor predictor) {
super(normalizer, knownUriFilter, uriReferences, queue, doesRecrawling, predictor);

}

@Override
Expand All @@ -78,4 +84,4 @@ public void informAboutDeadWorker(String idOfWorker, List<CrawleableUri> lstUris
setIps.forEach(ip -> ipQueue.markIpAddressAsAccessible(ip));
}
}
}
}
Loading