Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Developed version 0.5.0 #149

Open
wants to merge 133 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
133 commits
Select commit Hold shift + click to select a range
b8a863f
Merge pull request #1 from dice-group/master
abhihc Oct 24, 2018
3fcc30d
Merge pull request #2 from dice-group/master
abhihc Dec 3, 2018
f8da704
Merge pull request #3 from abhihc/master
sritejakv Dec 3, 2018
ffe80fd
Yaml File update for javascript case
Dec 9, 2018
cde1c80
Initial version of HtmlUnit Integration
sritejakv Dec 9, 2018
9ae63f8
Yaml File update for javascript case
Dec 10, 2018
5c22129
Merge remote-tracking branch 'origin/enhanced_data_portal_crawling' i…
Dec 10, 2018
e8a4036
HtmlUnit Integration into HtmlScrapper (Without reading from a file)
sritejakv Dec 14, 2018
4ee15af
Implemented junit test for WorkerImpl to test the worker
Ajaykps Dec 23, 2018
a5e681e
Adding Key constant for Time out
Dec 25, 2018
961d052
Manual transfer of code of mime type detector functionality from http…
abhihc Dec 26, 2018
43c9ffb
Mistake correction
Jan 5, 2019
acc6cf8
Yaml file for cambridgeshireinsight.org.uk
Jan 9, 2019
fb8c8f8
First Yaml file for cambridgeshireinsight.org.uk
Jan 9, 2019
a4cb0a9
Adding Javadoc comments.
abhihc Jan 10, 2019
46f55df
Changes made for collecting performance data for the crawler. Task: h…
ajrox090 Jan 11, 2019
1b4f35b
Updating parameter type to CrawleableUri,
Jan 13, 2019
f6f08fa
Updating parameter type to CrawleableUri,
Jan 13, 2019
7ddbe02
Standalone worker implementation.
sritejakv Jan 20, 2019
b4c676a
Testcases for ZipArchiver, SqlBasedIterator, HttpFetcher, SimpleOrder…
Ajaykps Jan 22, 2019
3026e28
Merge pull request #6 from abhihc/master
abhihc Jan 22, 2019
13ab5d9
Merge pull request #7 from abhihc/develop
Ajaykps Jan 22, 2019
b9feb40
Merge pull request #8 from abhihc/develop
abhihc Jan 22, 2019
cb66638
Merge pull request #10 from abhihc/Implementation_test_scenarios
Ajaykps Jan 22, 2019
050d69a
Merge pull request #9 from abhihc/Robustness
abhihc Jan 22, 2019
2a0934a
Merge pull request #11 from abhihc/develop
abhihc Jan 23, 2019
3564036
Changes made in context-sparql.xml file according to dice-group/Squirrel
abhihc Jan 23, 2019
9c42ea2
Merge remote-tracking branch 'origin/master'
abhihc Jan 23, 2019
b9f78d0
Final Changes made in context-sparql.xml file according to dice-group…
abhihc Jan 23, 2019
b3f01fa
Merging changes from dice_group/master
abhihc Jan 23, 2019
397221f
Merge branch 'dice-group-master'
abhihc Jan 23, 2019
afbab67
Fixed null pointer execption.
abhihc Jan 23, 2019
b01bd05
Fixed null pointer execption.
abhihc Jan 23, 2019
20a2f97
Fixed null pointer execption.
abhihc Jan 23, 2019
fb40474
Fixed null pointer execption.
abhihc Jan 23, 2019
a9b5f96
Fixed assertion error
abhihc Jan 23, 2019
ade55b1
Merge pull request #14 from abhihc/Robustness
abhihc Jan 23, 2019
a0770cc
Added JavaDoc comments to FetcherDummy4Tests and renamed pre and post…
Ajaykps Jan 24, 2019
3f2bc68
Changes made according to the pull request : https://github.com/dice-…
abhihc Jan 24, 2019
eb46c36
FilebasedSink: Moved the debugging of triple count from addTriple met…
ajrox090 Jan 24, 2019
3640b16
Testing all the generated crawlable URI objects
Ajaykps Jan 24, 2019
f10be5d
commented PerformanceAnalysis debug because of travis build failure
ajrox090 Jan 24, 2019
4a4f6de
Merge pull request #19 from abhihc/master
abhihc Jan 24, 2019
4fd4ed8
Merge pull request #20 from abhihc/develop
abhihc Jan 24, 2019
fae2f2e
Merge pull request #21 from abhihc/Robustness
abhihc Jan 24, 2019
9344b9d
Merge pull request #22 from abhihc/develop
abhihc Jan 24, 2019
f8e216a
Merge pull request #23 from abhihc/Testcase
abhihc Jan 24, 2019
560ccb4
Merge pull request #24 from abhihc/develop
abhihc Jan 24, 2019
88aa570
Enhancement of HtmlUnit implementation
Jan 25, 2019
aee8e4b
Restored the orginal similar to DICE group
abhihc Jan 25, 2019
3421990
Merge remote-tracking branch 'origin/master'
abhihc Jan 25, 2019
5833cd3
Ended with point in javadoc comments in FetcherDummy4Tests and remove…
Ajaykps Jan 25, 2019
8868f0a
Removed extra space in formatting
Ajaykps Jan 25, 2019
da0bdb9
Merge pull request #26 from abhihc/TestScenarios
abhihc Jan 25, 2019
fc5ed1b
Merge pull request #27 from abhihc/develop
abhihc Jan 25, 2019
4117263
Enhancement of HtmlUnit implementation 2
Jan 31, 2019
7574c2a
Changes made to fix the issues got from codacy
abhihc Feb 4, 2019
7dc3e53
Changes made to fix the issues of codacy build
Ajaykps Feb 4, 2019
d350b40
Changes made to fix the issues of codacy build
Ajaykps Feb 4, 2019
8b3d87a
Merge pull request #29 from abhihc/TestScenarios
abhihc Feb 4, 2019
0ce8f85
Merge pull request #30 from abhihc/develop
abhihc Feb 4, 2019
8d0d447
Merge pull request #31 from abhihc/master
abhihc Feb 4, 2019
03d9eb5
Merge pull request #32 from abhihc/develop
abhihc Feb 4, 2019
1698009
Enhancement of HtmlUnit implementation 2
Feb 10, 2019
897d925
Merge branch 'master' into enhanced_data_portal_crawling
sritejakv Feb 11, 2019
159b439
Merge branch 'develop' into enhanced_data_portal_crawling
sritejakv Feb 11, 2019
1c22ce0
Configuration of 5 portals:
Feb 16, 2019
e20db8b
Merge remote-tracking branch 'origin/enhanced_data_portal_crawling' i…
Feb 16, 2019
7c23c21
Revert "Merge remote-tracking branch 'origin/enhanced_data_portal_cra…
Feb 16, 2019
3ffa769
Data portals configuration
Feb 25, 2019
c611845
Yaml files for three data portals
sritejakv Mar 3, 2019
70108f9
-yaml files 2 portals
Mar 3, 2019
a0cc1cd
Yaml files for 5 portals.
sritejakv Mar 6, 2019
c75ebb1
-yaml files 2 portals
Mar 9, 2019
5e536cc
Merge branch 'enhanced_data_portal_crawling' of https://github.com/ab…
Mar 9, 2019
4a480c6
Revert "Merge branch 'enhanced_data_portal_crawling' of https://githu…
Mar 9, 2019
02d2f87
Adding two yaml files along with the test cases
sritejakv Mar 16, 2019
c8ed875
Fixed issues in HtmlScrapper.java and Adding few more yaml files
sritejakv Mar 21, 2019
acaffaf
Adding few more yaml files.
sritejakv Mar 28, 2019
dc0b0ba
-yaml files with test cases
Mar 31, 2019
35a55c4
Adding timeout of Salma's task
sritejakv Apr 1, 2019
96f8ceb
Added Codacy suggestions and few yaml files
sritejakv Apr 7, 2019
c93878f
Enhancing test cases for yaml file of some portals
Apr 7, 2019
3cf5cbb
Moving the unit test cases for data portals to parameterized tests
sritejakv Apr 11, 2019
7f88732
Added the page load on button click functionality of HtmlUnit
sritejakv Apr 12, 2019
ce16595
Portal Parametrized test case with yaml files update
Apr 13, 2019
51a50bf
Incorporating review comments in test cases for data portals.
sritejakv Apr 14, 2019
fefccb4
Update the test case
Apr 14, 2019
924e7ad
Updating html scrapper and a test case
sritejakv Apr 19, 2019
3159aba
Commiting the html file for a test case
sritejakv Apr 19, 2019
b57f235
Parametrized test cases for 6 portals
Apr 19, 2019
3fdca9a
Added parametrized test cases for seven data portals.
sritejakv Apr 19, 2019
ee54fb4
Adding few more test cases and corresponding yaml files.
sritejakv Apr 28, 2019
ba6bf0c
Test case for all the rest the portals
Apr 28, 2019
f09e6da
Correction of a mistake in a test case
Apr 29, 2019
36c0464
Removing the old test case file.
sritejakv May 4, 2019
84c174c
Final update and correction of the test cases of the portals
May 28, 2019
ebcc780
Merge branch 'mergeDataPortal' of https://github.com/dice-group/Squir…
sritejakv May 28, 2019
d381f38
Merge branch 'dice-group-mergeDataPortal' into enhanced_data_portal_c…
sritejakv May 28, 2019
abeed04
Correcting the pull request (mergeDataPortal into enhanced_data_porta…
sritejakv May 28, 2019
dcbd179
Java doc for the abstract test case of data portals
Jun 8, 2019
eacb0ed
RDFaSemarglParser file deleted
Jun 8, 2019
c699540
Possible better way to skip downloading the page while running a test…
Jun 8, 2019
0c12187
Incorporating review comments.
sritejakv Jun 10, 2019
ef6dbf0
Adding review comments in the test cases.
sritejakv Jun 15, 2019
6dbe55b
Update Test cases with solving build fail
Jun 15, 2019
2f7d27c
Adding java docs for the new classes.
sritejakv Jun 16, 2019
e5354e2
Removing the outdated context files.
sritejakv Jun 18, 2019
3932b6a
Merge pull request #100 from abhihc/enhanced_data_portal_crawling
MichaelRoeder Aug 23, 2019
18068c2
Javascript update
Mar 6, 2020
413fc10
Parameters update
Mar 6, 2020
98d74c4
Merge branches 'develop' and 'mergeDataPortal' of https://github.com/…
Mar 6, 2020
435a920
Fix some error
Mar 6, 2020
2c44e56
Merge branches 'develop' and 'mergeDataPortal' of https://github.com/…
Mar 12, 2020
88921d2
Fix some error and update branch
Mar 12, 2020
faa3f34
Fixed a problem in the extended frontier.
MichaelRoeder Nov 7, 2021
84a7095
Added make file commands for orca-related images.
MichaelRoeder Nov 12, 2021
16dcca3
Fixed a problem with the Spring framework and the shade plugin which …
MichaelRoeder Nov 12, 2021
c146ff9
Merged develop into mergeDataPortal. Encountered issues with dataport…
MichaelRoeder Feb 17, 2022
c331b30
Renamed test classes.
MichaelRoeder Feb 17, 2022
a5fa88e
Increased the version of HTMLUnit.
MichaelRoeder Feb 19, 2022
f0d7220
Updated the JavaScript handling. Removed the repeated button clicking…
MichaelRoeder Feb 19, 2022
ac718ce
Merge pull request #139 from dice-group/mergeDataPortal
MichaelRoeder Feb 19, 2022
187010b
Increased version to 0.5.0.
MichaelRoeder Feb 19, 2022
6dab8ef
Changed the Makefile to load the project version from Maven. Added th…
MichaelRoeder Feb 8, 2024
7ffa503
Added the usage of a white list filter. Added missing initialization …
MichaelRoeder Feb 8, 2024
31a3f88
Aligned project versions.
MichaelRoeder Feb 8, 2024
118cb2f
Cleaned up and improved robustness of FrontierImpl and MongoDBKnowUri…
MichaelRoeder Feb 8, 2024
854d6fc
Added missing spring config to the frontier. Switched back to the see…
MichaelRoeder Feb 8, 2024
98837e4
Fixed the URI Filter hierarchy. Removed filter-related init methods f…
MichaelRoeder Feb 8, 2024
c4c5f5e
Fixed problems with filters.
MichaelRoeder Feb 8, 2024
5684fd2
Fixed the termination timer task. Added a status task that logs the f…
MichaelRoeder Feb 8, 2024
a3cd8db
Several small changes. Added a JUnit test for regex-based filters.
MichaelRoeder Feb 8, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 0 additions & 16 deletions Dockerfile.web

This file was deleted.

39 changes: 30 additions & 9 deletions Makefile
Original file line number Diff line number Diff line change
@@ -1,19 +1,40 @@
default: build

write-version:
mvn org.apache.maven.plugins:maven-help-plugin:3.2.0:evaluate -Dexpression=project.version -Doutput=version.txt

build:
docker-compose -f docker-compose.yml down
mvn clean install -U -DskipTests -Dmaven.javadoc.skip=true

dockerize:
docker build -f Dockerfile.frontier -t dicegroup/squirrel.frontier .
docker build -f Dockerfile.worker -t dicegroup/squirrel.worker .
docker build -f Dockerfile.web -t squirrel.web .
docker build -f Dockerfile.mockup -t dicegroup/squirrel.mockup .
dockerize: write-version
docker build -f Dockerfile.frontier -t dicegroup/squirrel.frontier:"$$(cat version.txt)" .
docker build -f Dockerfile.worker -t dicegroup/squirrel.worker:"$$(cat version.txt)" .
# docker build -f Dockerfile.web -t squirrel.web:"$$(cat version.txt)" .
docker build -f Dockerfile.mockup -t dicegroup/squirrel.mockup:"$$(cat version.txt)" .

push-images: write-version
docker push dicegroup/squirrel.frontier:"$$(cat version.txt)"
docker push dicegroup/squirrel.worker:"$$(cat version.txt)"
docker push dicegroup/squirrel.mockup:"$$(cat version.txt)"

tag-orca-images: write-version
docker tag dicegroup/squirrel.frontier:"$$(cat version.txt)" git.project-hobbit.eu:4567/ldcbench/ldcbench-squirrel-adapter/squirrel-frontier:"$$(cat version.txt)"
docker tag dicegroup/squirrel.worker:"$$(cat version.txt)" git.project-hobbit.eu:4567/ldcbench/ldcbench-squirrel-adapter/squirrel-worker:"$$(cat version.txt)"

push-orca-images: write-version
docker push git.project-hobbit.eu:4567/ldcbench/ldcbench-squirrel-adapter/squirrel-frontier:"$$(cat version.txt)"
docker push git.project-hobbit.eu:4567/ldcbench/ldcbench-squirrel-adapter/squirrel-worker:"$$(cat version.txt)"

tag-latest: write-version
docker tag dicegroup/squirrel.frontier:"$$(cat version.txt)" dicegroup/squirrel.frontier:latest
docker tag dicegroup/squirrel.worker:"$$(cat version.txt)" dicegroup/squirrel.worker:latest
docker tag dicegroup/squirrel.mockup:"$$(cat version.txt)" dicegroup/squirrel.mockup:latest

push-images:
docker push dicegroup/squirrel.frontier
docker push dicegroup/squirrel.worker
docker push dicegroup/squirrel.mockup
push-latest-images:
docker push dicegroup/squirrel.frontier:latest
docker push dicegroup/squirrel.worker:latest
docker push dicegroup/squirrel.mockup:latest

start: dockerize
docker-compose -f docker-compose-sparql.yml up
Expand Down
15 changes: 0 additions & 15 deletions build-squirrel

This file was deleted.

File renamed without changes.
4 changes: 2 additions & 2 deletions docker-compose-sparql.yml
Original file line number Diff line number Diff line change
Expand Up @@ -28,8 +28,8 @@ services:
- QUEUE_FILTER_PERSIST=true
- COMMUNICATION_WITH_WEBSERVICE=false
- VISUALIZATION_OF_CRAWLED_GRAPH=false
- JVM_ARGS=-Xmx8g
- JVM_ARGS=-Xmx8g

volumes:
- ./data/frontier:/var/squirrel/data
- ./seed/seeds.csv:/var/squirrel/seeds.csv:ro
Expand Down
6 changes: 4 additions & 2 deletions docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,9 @@ services:
environment:
- HOBBIT_RABBIT_HOST=rabbit
- URI_WHITELIST_FILE=/var/squirrel/whitelist.txt
- URI_BLACKLIST_FILE=/var/squirrel/blacklist.txt
- FRONTIER_CONTEXT_CONFIG_FILE=/var/squirrel/spring-config/frontier-context.xml
- SEED_FILE=/var/squirrel/seeds.csv
- SEED_FILE=/var/squirrel/seeds.txt
- MDB_HOST_NAME=mongodb
- MDB_PORT=27017
- MDB_CONNECTION_TIME_OUT=5000
Expand All @@ -26,12 +27,13 @@ services:
- COMMUNICATION_WITH_WEBSERVICE=false
- VISUALIZATION_OF_CRAWLED_GRAPH=false
- JVM_ARGS=-Xmx8g

volumes:
- ./data/frontier:/var/squirrel/data
- ./seed/seeds.csv:/var/squirrel/seeds.csv:ro
- ./seed/seeds.txt:/var/squirrel/seeds.txt:ro
- ./whitelist/whitelist.txt:/var/squirrel/whitelist.txt:ro
- ./whitelist/blacklist.txt:/var/squirrel/blacklist.txt:ro
- ./spring-config:/var/squirrel/spring-config
command: java -cp squirrel.jar org.dice_research.squirrel.components.FrontierComponentStarter


Expand Down
19 changes: 18 additions & 1 deletion pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
<modelVersion>4.0.0</modelVersion>
<groupId>org.dice-research</groupId>
<artifactId>squirrel</artifactId>
<version>0.4.0</version>
<version>0.5.0</version>
<packaging>pom</packaging>
<inceptionYear>2017</inceptionYear>
<name>Squirrel</name>
Expand Down Expand Up @@ -345,11 +345,17 @@
<version>4.12</version>
<scope>test</scope>
</dependency>
<!-- System rules for setting environment variables
<dependency>
<groupId>com.github.stefanbirkner</groupId>
<artifactId>system-rules</artifactId>
<scope>test</scope> -->
<!-- System Lambda for setting environment variables -->
<dependency>
<groupId>com.github.stefanbirkner</groupId>
<artifactId>system-lambda</artifactId>
<version>1.0.0</version>
<scope>test</scope>
</dependency>
<!-- ~~~~~~~~~~~~~~~~~~~ End Testing ~~~~~~~~~~~~~~~~~~~~~~ -->

Expand Down Expand Up @@ -486,6 +492,17 @@
</transformer>
<transformer
implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer" />
<transformer
implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">
<resource>META-INF/spring.handlers</resource>
</transformer>
<!-- Important for Spring to be able to map XML
Schema files to local files. Otherwise, Spring may abort the processing of
XML files because of unavailable XML Schema files -->
<transformer
implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">
<resource>META-INF/spring.schemas</resource>
</transformer>
</transformers>
</configuration>
<executions>
Expand Down
1 change: 1 addition & 0 deletions seed/seeds.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
http://dbpedia.org/resource/New_York
30 changes: 23 additions & 7 deletions spring-config/frontier-context.xml
Original file line number Diff line number Diff line change
Expand Up @@ -72,10 +72,12 @@
<constructor-arg index="0" ref="mongoDBKnowUriFilter" />
<constructor-arg index="1">
<list>
<!-- <ref bean="depthFilter" /> -->
<ref bean="schemeFilter" />
<ref bean="whiteFilter" />
<ref bean="blackFilter" />
</list>
</constructor-arg>
<constructor-arg index="2" value="OR" />
<constructor-arg index="2" value="AND" />
</bean>

<!-- Triple Store sparql implementation -->
Expand All @@ -98,19 +100,33 @@


<bean id="mongoDBKnowUriFilter"
class="org.dice_research.squirrel.data.uri.filter.MongoDBKnowUriFilter">
class="org.dice_research.squirrel.data.uri.filter.MongoDBKnowUriFilter"
init-method="open">
<constructor-arg index="0"
value="#{systemEnvironment['MDB_HOST_NAME']}" />
<constructor-arg index="1"
value="#{systemEnvironment['MDB_PORT']}" />

</bean>


<bean id="depthFilter"
<bean id="schemeFilter"
class="org.dice_research.squirrel.data.uri.filter.SchemeBasedUriFilter">
</bean>
<bean id="whiteFilter"
class="org.dice_research.squirrel.data.uri.filter.RegexBasedWhiteListFilter">
<constructor-arg index="0" value="#{systemEnvironment['URI_WHITELIST_FILE']}" />
<constructor-arg index="1" value="false" />
</bean>
<bean id="blackFilter"
class="org.dice_research.squirrel.data.uri.filter.RegexBasedBlackListFilter">
<constructor-arg index="0" value="#{systemEnvironment['URI_BLACKLIST_FILE']}" />
<constructor-arg index="1" value="false" />
</bean>


<!-- <bean id="depthFilter"
class="org.dice_research.squirrel.data.uri.filter.DepthFilter">
<constructor-arg index="0" value="3" />
</bean>
</bean>-->



Expand Down
6 changes: 3 additions & 3 deletions squirrel.api/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
<parent>
<groupId>org.dice-research</groupId>
<artifactId>squirrel</artifactId>
<version>0.4.0</version>
<version>0.5.0</version>
</parent>
<artifactId>squirrel.api</artifactId>
<packaging>jar</packaging>
Expand Down Expand Up @@ -182,7 +182,7 @@
<artifactId>hsqldb</artifactId>
</dependency>

<!-- We use the simpleframework to use local HTTP servers for our
<!-- We use the simpleframework to use local HTTP servers for our
JUnit tests and test scenarios -->
<dependency>
<groupId>org.simpleframework</groupId>
Expand Down Expand Up @@ -228,4 +228,4 @@
<!-- ~~~~~~~~~~~~~~~~~~~ End Logging ~~~~~~~~~~~~~~~~~~~~~~ -->
</dependencies>

</project>
</project>
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@ public class Constants {
public static final String URI_HASH_KEY = "HashValue";
public static final String UUID_KEY = "UUID";


/**
* The preferred date for recrawling a URI is assumed to be a timestamp (in ms
* from 1st January 1970).
Expand Down Expand Up @@ -79,4 +80,5 @@ public class Constants {

public static final Charset DEFAULT_CHARSET = StandardCharsets.UTF_8;
public static final String DEFAULT_USER_AGENT = "Squirrel";
public static final int JAVASCRIPT_WAIT_TIME = 30 * 1000; //30 seconds
}
Original file line number Diff line number Diff line change
Expand Up @@ -55,4 +55,9 @@ public void close() throws IOException {
}
}

@Override
public void open() {
decorated.open();
}

}
Original file line number Diff line number Diff line change
Expand Up @@ -5,49 +5,63 @@
import org.dice_research.squirrel.data.uri.CrawleableUri;

/**
* A {@link UriFilter} that works like a blacklist filter and contains only those
* URIs on its blacklist that the crawler already has seen before.
* A {@link UriFilter} that works like a blacklist filter and contains only
* those URIs on its blacklist that the crawler already has seen before.
*
* @author Michael R&ouml;der ([email protected])
*/
public interface KnownUriFilter extends UriFilter {

/**
* Adds the given URI to the list of already known URIs. Works like calling {@link #add(CrawleableUri, long)} with the current system time.
* Adds the given URI to the list of already known URIs. Works like calling
* {@link #add(CrawleableUri, long)} with the current system time.
*
* @param uri the URI that should be added to the list.
* @param nextCrawlTimestamp The time at which the given URI should be crawled next.
*
*/
public void add(CrawleableUri uri, long nextCrawlTimestamp);
default void add(CrawleableUri uri) {
add(uri, System.currentTimeMillis()); // TODO This does not make much sense, since it will set the
// nextCrawlTimestamp to the current time!
}

/**
* Adds the given URI to the list of already known URIs together with the the time at which it has been crawled.
* Adds the given URI to the list of already known URIs. Works like calling
* {@link #add(CrawleableUri, long)} with the current system time.
*
* @param uri the URI that should be added to the list.
* @param uri the URI that should be added to the list.
* @param nextCrawlTimestamp The time at which the given URI should be crawled
* next.
*/
void add(CrawleableUri uri, long nextCrawlTimestamp);

/**
* Adds the given URI to the list of already known URIs together with the the
* time at which it has been crawled.
*
* @param uri the URI that should be added to the list.
* @param lastCrawlTimestamp the time at which the given URI has eben crawled.
* @param nextCrawlTimestamp The time at which the given URI should be crawled next.
* @param nextCrawlTimestamp The time at which the given URI should be crawled
* next.
*/
void add(CrawleableUri uri, long lastCrawlTimestamp, long nextCrawlTimestamp);

public default void add(CrawleableUri uri) {
add(uri, System.currentTimeMillis());
}

/**
* Returns all {@link CrawleableUri}s which have to be recrawled. This means their time to next crawl has passed.
* Returns all {@link CrawleableUri}s which have to be recrawled. This means
* their time to next crawl has passed.
*
* @return The outdated {@link CrawleableUri}s.
*/
public List<CrawleableUri> getOutdatedUris();
List<CrawleableUri> getOutdatedUris();

/**
* count the numbers of known URIs
*
* @return the number of entries from a open queue
*/
long count();

/**
* Opens the queue and allocates necessary resources.
*/
public void open();
void open();
}
Original file line number Diff line number Diff line change
Expand Up @@ -21,13 +21,4 @@ public interface UriFilter {
* requirements imposed by this filter. Otherwise false is returned.
*/
public boolean isUriGood(CrawleableUri uri);


/**
* Adds the given URI to the list of already known URIs. Works like calling {@link #add(CrawleableUri, long)} with the current system time.
*
* @param uri the URI that should be added to the list.
*
*/
public void add(CrawleableUri uri);
}
Original file line number Diff line number Diff line change
Expand Up @@ -5,14 +5,14 @@
* This class represents a composition of two or more filters,
* requiring at least one @link {org.dice_research.squirrel.data.uri.filter.KnownUriFilter}
*
* * @author Geraldo de Souza Junior ([email protected])
* @author Geraldo de Souza Junior ([email protected])
*
*/

public interface UriFilterComposer extends UriFilter {

/**
* Returnsthe KnowUriFilter from this {@link UriFilterComposer}
* Returns the KnowUriFilter from this {@link UriFilterComposer}
*
* @return KnownUriFilter
*/
Expand All @@ -25,5 +25,7 @@ public interface UriFilterComposer extends UriFilter {
* @param knownUriFilter
*/
public void setKnownUriFilter(KnownUriFilter knownUriFilter);



}
Loading